0% found this document useful (0 votes)
28 views68 pages

Statistical Learning Theory For Neural Operators

Uploaded by

lcmn7102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views68 pages

Statistical Learning Theory For Neural Operators

Uploaded by

lcmn7102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Statistical Learning Theory for Neural Operators

Niklas Reinhardt1 , Sven Wang2 , and Jakob Zech1


1
Interdisziplinäres Zentrum für wissenschaftliches Rechnen, Universität Heidelberg, Im Neuenheimer
Feld 205, 69120 Heidelberg, Germany
arXiv:2412.17582v1 [math.ST] 23 Dec 2024

niklas.reinhardt@iwr.uni-heidelberg.de jakob.zech@uni-heidelberg.de
2
Institut für Mathematik, Humboldt-Universität zu Berlin, Rudower Chaussee 25, 12489 Berlin,
Germany
sven.wang@hu-berlin.de

December 24, 2024

Abstract
We present statistical convergence results for the learning of (possibly) non-linear map-
pings in infinite-dimensional spaces. Specifically, given a map G0 : X → Y between two
separable Hilbert spaces, we analyze the problem of recovering G0 from n ∈ N noisy input-
output pairs (xi , yi )n
i=1 with yi = G0 (xi ) + εi ; here the xi ∈ X represent randomly drawn
“design” points, and the εi are assumed to be either i.i.d. white noise processes or sub-
gaussian random variables in Y. We provide general convergence results for least-squares-type
empirical risk minimizers over compact regression classes G ⊆ L∞ (X, Y ), in terms of their ap-
proximation properties and metric entropy bounds, which are derived using empirical process
techniques. This generalizes classical results from finite-dimensional nonparametric regression
to an infinite-dimensional setting. As a concrete application, we study an encoder-decoder
based neural operator architecture termed FrameNet. Assuming G0 to be holomorphic, we
prove algebraic (in the sample size n) convergence rates in this setting, thereby overcoming
the curse of dimensionality. To illustrate the wide applicability, as a prototypical example
we discuss the learning of the non-linear solution operator to a parametric elliptic partial
differential equation.

Contents
1 Introduction 2
1.1 Outline and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Existing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Regression in Hilbert Spaces 6


2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Empirical Risk Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Learning Holomorphic Operators with FrameNet 12


3.1 Representation Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Encoder and Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.3 Smoothness Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 FrameNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 Feedforward Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1
3.2.2 The FrameNet Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Approximation of Holomorphic Operators . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.1 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.2 Approximation Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.3 Entropy Bounds for FrameNet . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Statistical Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Applications 19
4.1 Finite Dimensional Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Parametric Darcy Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.2 Sample Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Conclusions 22

References 23

Appendices 30

A Auxiliary Probabilistic Lemmas 30

B Proofs of Section 2 37
B.1 Proof of Theorem 2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
B.1.1 Existence and Measurability of Ĝn . . . . . . . . . . . . . . . . . . . . . . . 37
B.1.2 Concentration Inequality for Ĝn . . . . . . . . . . . . . . . . . . . . . . . . 38
B.2 Proof of Theorem 2.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
B.3 Proof of Corollary 2.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
B.4 Proof of Corollary 2.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
B.5 Proof of Theorem 2.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

C Neural Network Theory 41


C.1 Operations on Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
C.2 Neural Network Approximation Theory . . . . . . . . . . . . . . . . . . . . . . . . 45

D Proofs of Section 3 50
D.1 Proof of Proposition 3.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
D.2 RePU-realization of Tensorized Legendre Polynomials . . . . . . . . . . . . . . . . 53
D.3 Proof of Theorem 3.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
D.4 Proof of Lemma 3.12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
D.5 Proof of Lemma 3.14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

E Proofs of Section 4 67
E.1 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

1 Introduction
Learning non-linear relationships of high- and infinite-dimensional data is a fundamental problem
in modern statistics and machine learning. In recent years, “Operator Learning” has emerged
as a powerful tool for analyzing and approximating mappings G0 between infinite-dimensional
spaces [59, 44, 10, 61, 86, 5, 79, 66, 51]. The primary motivation for considering truly infinite-
dimensional data stems from applications in the natural sciences, where inputs and outputs of
operators are elements in function spaces. For instance, G0 could be the operator relating an
initial condition x of a dynamical system to the state G0 (x) of the system after a certain time, or
a coefficient-to-solution map of a parametric partial differential equation (PDE).

2
For finite-dimensional inputs and outputs, nonparametric regression is the standard framework
for inferring general, non-linear relationships. There, one aims to reconstruct some “ground truth”
G0 : Rd → Rm , d, m ∈ N, from noisy data (xi , yi ) ∈ Rd × Rm , i = 1, . . . , n, generated via
yi = G0 (xi ) + εi , where xi are called the “design points” and εi are typically independent and
identically distributed (i.i.d.) noise variables. In the framework of empirical risk minimization
(ERM), one chooses a suitable function class G of mappings from Rd to Rm and some loss function
L : Rm ×Rm → R measuring the discrepancy between predictions G(xi ) and the data yi . Statistical
estimation is achieved by minimizing
n
1X
Ĝn ∈ arg min Jn (G), Jn (G) := L(G(xi ), yi ), (1.1)
G
G∈G n i=1

assuming that minimizers exist. In the finite-dimensional setting, statistically optimal conver-
gence rates for such estimators were established for least-squares, maximum likelihood, and more
generally “minimum contrast” estimators, in [36, 8, 11]; see also [89] where such results are shown
for ERMs over neural network classes. However, as is well-known, both—approximation rates [28,
29] as well as statistical convergence rates [36, 38] over classical smoothness classes—deterioriate
exponentially in terms of the dimension d. This renders computations practically infeasible for
large d. This phenomenon is referred to as the curse of dimensionality, see also Section 4.1 ahead.
The framework for operator learning considered in this paper can be viewed as a direct ex-
tension of (1.1) to the infinite-dimensional case. Given Hilbert spaces X and Y, and a mapping
G0 : X → Y, the goal is to reconstruct G0 from “training data” (xi , yi ) ∈ X × Y with

yi = G0 (xi ) + εi ,

where the regression class G is a suitable set of measurable mappings between X and Y and εi
are centered noise variables, see Section 2 for details. This “supervised learning” setting underlies
popular methods such as the PCA-Net [44, 10].
We also mention the framework of “physics-informed learning” which is common in opera-
tor learning (relevant e.g. for the DeepONet [61]), but which is not considered in the present
manuscript. Here, information on the ground truth G0 is not known in the form of input-output
pairs, but instead is implicitly described via

N(G0 (x), y) = 0,

where N : X × Y → Z for a third vector space Z. Typically, N(x, ·) encodes a family of differential
operators parametrized by x whichPrepresent the underlying physical model. In this case, the loss to
n
minimize is a residual of the form i=1 kN(xi , G(xi ))k2Z , thus leading to an “unsupervised learning
problem”. The two cases, supervised and unsupervised learning, can also be combined. Various
different (neural network based) architectures (i.e. regression classes G ) have been proposed in
recent years for the purpose of supervised or unsupervised operator learning.

1.1 Outline and Contributions


In this paper we provide statistical convergence results for operator learning which do not suffer
from the curse of dimensionality, and which can be applied to prototypical problems in the PDE
literature. We first develop our theory in an abstract setting for ERMs over classes of mappings
between separable Hilbert spaces, and later apply our theory to concrete examples. In doing
so, we build upon and synthesize influential proof techniques from nonparametric statistics, in
particular M-estimation [37, 67], approximation theory for parametric PDEs [22, 20, 43], and
empirical process theory [96, 30].
To illustrate the scope of our contributions, we start by stating a convergence result for the
elliptic “Darcy flow” problem on the d-dimensional torus. This is a standard example in PDE
driven forward and inverse problems, e.g. [6, 14, 19, 90, 94, 68, 67]. We aim for an informal
exposition here, with full details given in Section 4: Denote by Td the d-dimensional torus, fix a

3
smooth source function f : Td → (0, ∞) and let amin > 0. For a sufficiently smooth and uniformly
positive conductivity a : Td → R, denote by G0 (a) the unique solution of the elliptic PDE
Z
d
−∇ · (a∇u) = f on T and u(x)dx = 0. (1.2)
Td

Now let γ be some probability distribution on L2 (Td ) such that



supp(γ) ⊆ a ∈ H s (Td ) : inf a(x) ≥ amin , kakH s (Td ) ≤ R , (1.3)
x∈Td

for some R > 0, s > 2d + 1. Suppose we observe noisy input-output pairs (ai , yi )ni=1 given by
yi = G0 (ai ) + εi , where the εi are independent L2 (Td )-Gaussian white noise processes (Section 2).
The operator G0 can then be learned from this data as stated in the next theorem (Section 4.2);
it regards empirical risk minimizers Ĝn over the so-called FrameNet G FN , which corresponds to
a neural network based class of measurable mappings from X → Y (Section 3). Formally, Ĝn is
defined as a minimizer of the least squares objective
n
1X
Ĝn ∈ arg min kyi − G(ai )k2L2 (Td ) , (1.4)
GFN
G∈G n i=1

although a suitable modification is required to make this mathematically rigorous (Section 2.1).
Theorem 1.1 (Informal). Consider the operator G0 from the Darcy problem on the d-dimensional
torus Td (d ≥ 2), and suppose that γ satisfies (1.3) for some s > 3d/2 + 1 and amin > 0. Fix
τ > 0 (arbitrarily small).
Then there exists a constant C such that for each n ∈ N there exists a FrameNet class G FN (n)
and any empirical risk minimizer Ĝn in (1.4) satisfies1
Z 
2s+2−3d
EG0 kĜn (a) − G0 (a)k2L2 (Td ) dγ(a) ≤ Cn− 2s+2−d +τ . (1.5)

The most significant feature of the above statement is that the convergence rate in (1.5) is
algebraic in n, and thus circumvents the curse of dimensionality. The classes G FN , whose existence
is postulated by the theorem, can be precisely characterized in terms of the sparsity, depth, width
and other network class parameters, which are chosen in terms of the statistical sample size n.
We also note that the regularity assumption s > 3d/2 + 1 was made here for convenience and can
be weakened to s > 3d/2, see Theorem 4.2 and Remark 4.3 below.
To achieve Theorem 1.1 and several other related results, we build our theory in multiple steps.
In Section 2, a general regression framework for mappings between Hilbert spaces is considered.
Our first main result, Theorem 2.5, gives a non-asyptotic concentration upper bound on the
empirical risk between Ĝn and G0 , with respect to the design points xi . The upper bound is
quantified in terms of the metric entropy of the “regression class” G and the best approximation
of G0 from G . Theorem 2.6 strengthens this statement to L2 (γ)-loss, for the case of random design
points xi ∼ γ. These results provide an operator learning analogue to classical convergence rates
in nonparametric regression. The proofs rely on probabilistic generic chaining techniques [97, 30]
and “slicing” arguments as introduced in [37], which we generalize to the current setting. We also
note our proofs contrast existing nonparametric statistical analyses of neural networks [89] for
real-valued regression, where generic chaining techniques were not required for obtaining optimal
rates (up to log-factors).
In the second part of this work, we apply our statistical results to the specific deep operator
network class G FN termed “FrameNet” and introduced in [43]. Together with their underlying
decoder-encoder structure and feedforward neural network structure, FrameNet classes are defined
in Section 3. These classes are known to satisfy good approximation properties for holomorphic
1 Here and in the following E
G0 denotes the expectation w.r.t. the random data (xi , yi )i generated by the ground
truth G0 . Similarly, we write PG0 for corresponding probabilities.

4
operators, a property which is fulfilled for the Darcy problem (1.2) and more broadly a wide
range of PDE based problems [21, 22, 20, 49, 41, 42, 46, 93, 24]. In Section 3, we identify
such operator holomorphy as a key regularity property which allows to derive “dimension-free”
statistical convergence rates. By extending approximation theoretic results from [43], as well
as establishing metric entropy bounds for G FN based on [89], we obtain algebraic convergence
results for ERMs over FrameNet classes, for reconstructing holomorphic operators G0 . Specifically,
Theorem 3.15 bounds the L2 -risk E[kĜn − G0 k2L2 (γ) ] . n−κ/(κ+1) , where κ > 0 denotes the
approximation rate established in [43]. We treat the case of ReLU and RePU [57] activation
functions for both sparse and fully-connected architectures.
In Section 4, we illustrate the usefulness of our general theory in two concrete settings. First,
we show how our theory recovers well-known minimax-optimal convergence rates for real-valued
regression (i.e., Y = R) on d-dimensional domains. This proves that our abstract results from
Section 2 cannot be improved in general, although matching lower bounds are yet unknown in the
infinite-dimensional setting. Thereafter, Section 4.2 demonstrates how our theory can be used to
yield the first algebraic convergence rates for a non-linear operator arising from PDEs – see in
particular Theorem 4.2 and Remark 4.3, which underlie Theorem 1.1.

1.2 Existing Results


The approximation of mappings between infinite-dimensional spaces has been studied extensively
in the context of Uncertainty Quantification, where G0 corresponds to the solution operator of
a parameter dependent PDE. Various methodologies have been proposed and analyzed for this
task, including for example compressed sensing [31, 87], sparse-grid interpolation [16, 72], least-
squares [23, 15], and reduced basis methods [85, 45]. Recently, neural network approaches have
become increasingly popular for this task as they provide a highly expressive and fast to evaluate
parametrization of high-dimensional functions. These attributes make them particularly useful for
learning surrogates in scientific applications, e.g. [44, 61, 59, 10, 5, 74, 73, 9, 18, 25, 53].

Approximation Theory for Neural Operators First theoretical results on operator learning
focused on the approximation error, establishing the existence of neural network architectures
capable of approximating G0 up to a certain accuracy, with the error decreasing algebraic in terms
of the number of learnable network parameters. For example, [92, 54, 91] showed that neural
networks have sufficient expressivity to efficiently approximate certain (holomorphic) mappings
G0 . Such results are based on the observation that the smoothness of G0 implies the image of this
operator to have moderate n-widths, i.e. to be well approximated in moderate-dimensional linear
subspaces. See for example [32, 21, 22, 20, 47, 7]. Specifically for DeepONets [61], such a result
was obtained in [56].

Statistical Theory for Neural Operators The analysis of sample complexity has received
less attention so far. In [55], the authors analysed in particular the error of PCA encoders and
decoders used for PCA-Net, but did not analyse the statistical error for the full operator.
The paper [48] provides such a result for the estimation of linear and diagonalizable mappings
from noisy data; for lower bounds see, e.g., [13]. For other work on “functional regression”, see, e.g.,
[40, 64]. An analysis for nonparametric regression of nonlinear mappings from noisy data in infinite
dimensions was provided in [60]. There, the authors considered Lipschitz continuous mappings G0 ,
and proved consistency in the large data limit. Additionally they give convergence results, which
in general suffer from the curse of dimension however. This is due to their very general assumption
on the smoothness of G0 : Mhaskar and Hahm [62] showed very early, that the nonlinear n-width
of Lipschitz operators in L2 decays only logarithmically, i.e. the number of (exact) data points
needed for the reconstruction of the functional is exponential in the desired accuracy. Recently,
[52] generalized these results and showed a generic curse of dimensionality for the reconstruction of
Lipschitz operators and C k -operators from exact data. Moreover, the authors show that under the
existence of some intrinsic low-dimensionality allowing for fast approximation, also the dependence

5
on the data complexity improves. Concerning the case of noisy and holomorphic operators, we
also refer to the recent works [3, 2] who consider a setup similar to ours, but contrary to us also
treat the more general case of Banach space valued functions. The authors derive upper bounds
and concentration inequalities for the L2 -error, and lower bounds for the approximation error, in
terms of a neural network based architecture. Key differences to our work include in particular
that [3, 2] consider network architectures that are linear in the trainable parameters, in the noisy
case their analysis does in general not imply convergence in the large data limit n → ∞, and they
do not provide convergence rates for concrete PDE models.

M-Estimation in Nonparametric Regression Convergence theory of M-estimators and (pe-


nalized) empirical risk minimizers was investigated around the 2000s in foundational works by
van de Geer [36, 37], Birge, Massart and co-authors [11, 8]. These works build on concentration
inequalities for empirical processes using “chaining” techniques which date back to seminal con-
tributions of Talagrand and others, see, e.g., [97, 39] and references therein. These techniques are
known to produce minimax-optimal rates n−2s/(2s+d) for ERMs over s-smooth Sobolev, Hölder
and more generally, Besov smoothness classes of real-valued functions on bounded d-dimensional
Euclidean domains. The analysis of neural-network based ERMs was initiated by the work [89],
which considered regression over (compositional) Hölder classes on finite-dimensional domains,
and was followed by several other works such as [95]. We also mention [67, 71, 4] which analyse
ERMs in non-linear elliptic PDE-based inverse problems such as the “Darcy” flow problem studied
here. The present setting falls outside the scope of such classical theory for real-valued functions.
However, the derivation of our concentration inequalities for ERMs does build upon the same
probabilistic empirical process machinery laid out above [97, 30].

1.3 Notation
We write N = {1, 2 . . . } and N0 = {0, 1, 2, . . . }. We write an . bn , an & bn for real sequences
(an )n∈N , (bn )n∈N if an is respectively upper or lower bounded by a positive multiplicative constant
which does not depend on n (but may well depend on other ambient parameters which we make
explicit whenever confusion may arise). By an ≃ bn , we mean that both an . bn and an & bn .
For a pseudometric space (T, d) and any δ > 0, let N (T, d, δ) be the δ-covering number of T ,
i.e. the minimal number of open δ-balls in d needed to cover T . We denote the metric entropy of
T by
H(T, d, δ) = log N (T, d, δ). (1.6)
Given a Borel probability measure γ on X and a subset D ⊆ X, we define the norms
Z
kGk2L2 (X,γ;Y) := kG(x)k2Y dγ(x),
X
kGk∞,D := sup kG(x)kY
x∈D

and also write k · kL2 (γ) and k · k∞ if the underlying spaces are clear from context. The space of
real-valued, square summable sequences indexed over N is denoted by ℓ2 (N). The complexification
of a real Hilbert space H is denoted by HC , see [50, 65].

2 Regression in Hilbert Spaces


2.1 Problem Formulation
Throughout, let X and Y denote two separable (real) Hilbert spaces with respective inner products
h·, ·iX , h·, ·iY and suppose
G0 : X → Y
is some non-linear (Borel measurable) operator which we aim to reconstruct. The observed data
are assumed to be noisy “input-output pairs” (xi , yi )ni=1 ∈ (X × Y)n given by

6
iid
xi ∼ γ i = 1, . . . , n, (2.1a)
and
yi = G0 (xi ) + σεi i = 1, . . . , n, (2.1b)
where σ > 0 denotes a scalar “noise level”, εi are independent random noise variables and γ is a
probability distribution on X. The xi ∈ X are also referred to as the “design points”, and we write
x = (x1 , ..., xn ) ∈ Xn . We will both derive results which are conditional on the design x , as well
x
as results for random design. To avoid confusions, we will use the notations PG 0
, ExG0 to denote
probabilities and expectations under the distribution (2.1) with fixed design x , and we use PG0 ,
EG0 to denote probabilities and expectations with random design xi ∼ γ.

Remark 2.1. In practice, we will often deal with scenarios in which G0 is only defined on some
measurable subset V ⊂ X, see e.g. the solution operator in the Darcy flow example in Section 4.2.1.
In this case, our results can be applied to any measurable extension of G0 on X.

White Noise Model In this article we consider two assumptions on the noise, the first being
that the (εi )ni=1 in (2.1) are independent copies of a Y-white noise process. Recall that for any
given separable Hilbert space Y, the Y-Gaussian white noise process is defined as the mean-zero
Gaussian process WY = (WY (y) : y ∈ Y) indexed by Y with “iso-normal” covariance structure

WY (y) ∼ N(0, kyk2Y ), Cov(WY (y), WY (y ′ )) = hy, y ′ iY , for all y, y ′ ∈ Y.

It is well-known that WY does not take values in Y unless dim(Y) < ∞, but is interpreted as a
stochastic process indexed by Y, see [38, p.19] for details. Nevertheless, we slightly abuse notation
and use the common notation hWY , yiY := WY (y).
Under this assumption, conditionally on xi we interpret each observation yi in (2.1) as a
realisation of a Gaussian process (yi (f ) : f ∈ Y) with

E[yi (f )] = hG0 (xi ), f iY , Cov(yi (f ), yi (f ′ )) = hv, v ′ iY ,

and we shall again use the notation hyi , f iY to denote yi (f ) (see also [99, 39, 67] where this common
viewpoint is explained in detail).
Example 2.2. Let O ⊆ Rd be a bounded, smooth domain. Then, for Y = L2 (O), one can show
that draws of an L2 (O)-white noise process a.s. take values in negative Sobolev spaces H −κ for
κ > d/2, see, e.g., [69, 12].

Sub-Gaussian Noise Model The second setting we consider is that of sub-Gaussian noise. We
say that a random vector X taking values in Y is sub-Gaussian with parameter η > 0 if E[X] = 0
and  
t2
P(kXkY ≥ t) ≤ 2 exp − 2 , for all t ≥ 0.

In the sub-Gaussian noise model, we assume that (εi )ni=1 in (2.1) are independent sub-Gaussian
variables in Y with parameter η = 1.

2.1.1 Empirical Risk Minimization


Let G be a class of (measurable) operators G ∋ G : X → Y. We would like to study classical em-
pirical risk minimizers of least-squares type over G . Specifically, given regression data (xi , yi )ni=1 ,
consider the empirical risk
n
1X 2
I˜n (G) := yi − G(xi ) Y
, I˜n : G → [0, ∞].
n i=1

7
However, this functional takes finite values almost surely only in the sub-Gaussian noise model.
In the white noise model, since yi ∈ / Y, it holds I˜n (G) = ∞ almost surely – we thus consider a
modified definition of least-squares type estimators which is common in the literature on regression
with white noise [67, 39]. Instead of (2.1.1), we consider
n
1X 2
In (G) = −2hG(xi ), yi iY + G(xi ) Y
, In : G → R, (2.2)
n i=1

which takes finite values a.s. also in the white noise model. Note that
Pthe latter objective function
n
can be obtained from (2.1.1) by formally subtracting the term n−1 i=1 kyi k2Y which exhibits no
dependency on G. Therefore in the sub-Gaussian noise model the minimization of I˜n and In are
equivalent which is why we consider (2.2) in the following. We will denote minimizers of In (G) by
Ĝn .
Our assumptions on the class G in the ensuing theorems will ensure that a measurable choice
of minimizers Ĝn of In exists, see Theorem 2.5 (i). However, the ERM Ĝn will in general not
be unique, since we do not impose convexity on G . The reason is that our main application, the
NN-based FrameNet class G FN , is non-convex.
Remark 2.3 (Connection to maximum likelihood). In the white noise model, it follows from the
Cameron-Martin theorem (see, e.g., Theorem 2.6.13 in [39]) that −nIn (G)/(2σ 2 ) constitutes the
negative log-likelihood of the (dominated) statistical model arising from (2.1) with white noise. In
this case Ĝn can also be interpreted as a (nonparametric) maximum likelihood estimator over the
class G .
Remark 2.4. Consider nonparametric regression of an unknown function f : O → R for some
bounded, smooth domain O. Here, it is well-known that the observation white noise error model,
where data is given by Y = f + σW (with W a L2 (O)-white noise process) is asymptotically equiva-
lent in a Le Cam-sense to an observation model with m “equally spaced” (random or deterministic)
observation points throughout O,

Yi = f (zi ) + ηi , i = 1, ..., m,

with i.i.d. N (0, 1) errors, where the equivalence holds for σ ≍ 1/ m, see [88]. Therefore, our
observation model (2.1) may be viewed as a simplified proxy.

2.2 Main Results


Let G be a class of operators mapping from X to Y. For any fixed x = (x1 , ..., xn ) ∈ Xn and
(Borel) measurable map G : X → Y, we denote the empirical seminorm induced by x with
n
1X
kGk2n = kG(xi )k2Y . (2.3)
n i=1

For any element G∗ ∈ G and δ > 0, define the localized classes



G ∗n (δ) = G ∈ G : kG − G∗ kn ≤ δ ,

and denote its metric entropy integral by


Z δ
1
G ∗n (δ), k · kn ) :=
J(δ) = J(G G ∗n (δ), k · kn , ρ)dρ.
H 2 (G (2.4)
0

The following result provides a general convergence theorem for empirical risk minimizers with
high probability, which relates the empirical risk of ERMs over some operator class G to the
metric entropy of G . It can be viewed as a generalisation of classical convergence results for sieved
M-estimators [100] to Hilbert space valued functions. The proof can be found in Appendix B.1.

8
Theorem 2.5. For some measurable G0 : X → Y, let the data (xi , yi )ni=1 arise from (2.1) either
with white noise or with sub-Gaussian noise. Let G be a class of measurable maps from X → Y,
let G∗ ∈ G , and let x = (x1 , ..., xn ) ∈ Xn be such that the following holds.
(a) There exists a constant C > 0 s.t. G is compact with respect to some norm k · k satisfying
k · kn ≤ Ck · k.
G ∗n (δ), k · kn ) for all δ > 0 and
(b) There exists Ψn : (0, ∞) → [0, ∞) s.t. Ψn (δ) ≥ J(G
Ψn (δ)
δ 7→ is non-increasing for δ ∈ (0, ∞).
δ2

Then the following holds.


(i) Minimizers Ĝn of the empirical risk (2.2) exist, and there is a measurable selection (with
respect to the data (xi , yi )ni=1 ) of such a minimizer.

(ii) Fix any measurable selection Ĝn from part (i). Then there exists a universal constant CCh >
0 (see Lemma A.4) such that for any G∗ ∈ G as above, any positive sequence (δn )n∈N
satisfying
√ 2
nδn ≥ 32Cch σΨn (δn ), (2.5)

and for any


n 4Cch σ √ o
R ≥ max δn , √ , 2kG∗ − G0 kn ,
n
it holds that
 
 nR2
PxG0 kĜn − G0 kn ≥ R ≤ 2 exp − 2 σ2 . (2.6)
16Cch

The lower bound for R in (ii), which determines the convergence rate of kĜn −G0 kn , is typically
optimized by balancing the “stochastic term” δn and the empirical approximation error kG∗ −G0 kn ,
see, e.g., Theorem 3.15 below. Note that Theorem 2.5 gives a convergence rate with respect to
the empirical norm k · kn . Therefore, when the design points x are random, the norm itself is also
random, and the assumptions (a) and (b) in Theorem 2.5 have to be understood conditional on
possible realizations of x . Theorem 2.6 below will give a corresponding concentration inequality
on the k · kL2 (γ) -error under the assumption of i.i.d. random design xi ∼ γ. In this setting, a
sufficient condition for (b) to be satisfied almost surely is to take Ψn as an upper bound for the
entropy integral with uniform entropy H(G G, k · k∞,supp(γ) , δ).
The compactness of G in part (a) is needed for the existence of the ERM Ĝn in (2.2). The
existence result for Ĝn (see e.g. [70, Proposition 5]) requires compact metric spaces, which is why
we assume copmactness w.r.t. a norm k · k stronger than the empirical seminorm k · kn .
In particular, compactness implies separability of G with respect to k · kn , which in turn is
needed to guarantee the measurability of certain suprema of empirical processes ranging over G .
See the proof of Theorem 2.5 below, in particular (B.4) and (B.6). The technical growth restriction
in (b) on the function Ψn (δ) (i.e., our upper bound for the entropy integral) is required for the
G, k · k∞ , δ) .
“peeling device” in (B.4). In particular, this assumption is satisfied in case that H(G
δ −s for any 0 < α < 2, see Corollary 2.7 below for details.
Theorem 2.5 provides a concentration inequality for the empirical norm kĜn − G0 kn . Under
the assumption of randomly chosen design points xi ∼ γ, xi ∈ X, this statement can be extended
to a convergence result for the L2 (γ)-norm. To this end, we also need slightly stronger technical
assumptions on the class G with respect to the k · k∞,supp(γ) -norm.
Assumption 1. For some probability measure γ on X, assume x = (x1 , . . . , xn ) arises from
i.i.d. draws xi ∼ γ. Let G be a class of measurable maps X → Y, G∗ ∈ G and kG0 k∞,supp(γ) < ∞.
Suppose

9
(a) G is compact with respect to k · k∞,supp(γ) ,
(b) There exists a (deterministic) upper bound Ψn : (0, ∞) → [0, ∞) such that for a.e. x ∼ γ n ,
G∗n (δ), k · kn ) for all δ > 0 and
it holds Ψn (δ) ≥ J(G

Ψn (δ)
δ 7→ is non-increasing for δ ∈ (0, ∞). (2.7)
δ2

Note that Assumption 1 is strictly stronger than assumptions (a) and (b) in Theorem 2.5,
since k · k∞,supp(γ) is stronger than k · kn and Ψn in (2.7) does not depend on x . In particular,
Assumption 1 implies the existence of a measurable ERM Ĝn by Theorem 2.5 (i).
The uniform k · k∞,supp(γ) assumptions are used to control the concentration of the empirical
norm k · kn around the “population norm” k · k∞,supp(γ) , see Lemma A.5 and also Lemma A.6
below. Let us write F = {G − G0 : G ∈ G }. As an immediate consequence of Assumption 1, there
exists some F∞ < ∞ such that

sup kF k∞,supp(γ) = sup kG − G0 k∞,supp(γ) ≤ F∞ . (2.8)


F
F ∈F G
G∈G

We can now state our main concentration inequality for the convergence of kĜn − G0 kL2 (γ) , which
in particular provides a bound for the mean squared error EG0 [kĜn − G0 k2L2 (γ) ] as well. The proof
of Theorem 2.6 can be found in Appendix B.2.
Theorem 2.6 (L2 (γ)-Concentration under Random Design). Consider the nonparametric regres-
sion model (2.1) either with white noise or with sub-Gaussian noise and any measurable empirical
risk minimizer Ĝn from (2.2). Suppose that G0 , G , G∗ , γ and Ψn (·) are such that Assumption
1 holds. Then there exists some universal constant C > 0 such that for any positive sequences
(δn )n∈N and (δ̃n )n∈N with
√ 2
nδn ≥ CσΨn (δn ) and nδ̃n2 ≥ CF∞ 2
H(GG, k · k∞,supp(γ) , δ̃n ), (2.9)

all G∗ ∈ G as above and all


 
∗ σ + F∞
R ≥ C max δn , δ̃n , kG − G0 k∞,supp(γ) , √ (2.10)
n

we have,
   
nR2
PG0 kĜn − G0 kL2 (γ) ≥ R ≤ 2 exp − . (2.11)
C 2 (σ 2 + F∞
2 )

To keep the presentation simple, we have left the numerical constants in the preceding theorem
implicit. However, they can be made explicit, see the proof for details. The following bound on the
mean squared error is obtained upon integration of the concentration inequalites from Theorem
2.5 and Lemma A.5. Note that directly integrating the L2 (γ)-concentration, cf. (2.11), gives
an approximation term in the uniform norm k · k∞,supp(γ) (following (2.10)), which has weaker
convergence properties in general, see Theorem 3.10. For the proof of Corollary 2.7, see Appendix
B.3.
Corollary 2.7 (L2 (γ)-Mean Squared Error). Consider the setting of Theorem 2.6 and assume
in addition that Assumption 1 is fulfilled for all G∗ ∈ G (with the same Ψn ). Then, for some
universal constant C > 0 and all n ∈ N,
h i  2 
σ 2 + F∞
EG0 kĜn − G0 k2L2 (γ) ≤ C δn2 + δ̃n2 + + 8 inf kG∗ − G0 k2L2 (γ) . (2.12)
n G∗ ∈G
G

10
To demonstrate the typical use-cases of our abstract results, we summarize in the following
corollary the rates which can be obtained under algebraic approximation properties of G and two
concrete scalings of the metric entropy of G . These correspond to the typical entropy bounds
satisfied by (i) some fixed n-independent, infinite-dimensional regression class, and (ii) an N -
dimensional approximation class, where N is chosen in terms of n. In the following corollary, .
refers to an inequality involving a constant independent of N and δ. For a proof of Corollary 2.8,
see Appendix B.4.
Corollary 2.8. Consider the setting of Corollary 2.7. Let G = G (N ), N ∈ N, be a sequence of
regression classes2 such that inf G∗ ∈G ∗ 2
G(N ) kG − G0 kL2 (γ) . N
−β
for some β > 0 and all N ∈ N.
Denote the entropy by H(N, δ) := H(G G (N ), k · k∞,supp(γ) , δ).
(i) If H(N, δ) . δ −α for some 0 < α < 2, then
h i 2
EG0 kĜn − G0 k2L2 (γ) . n− 2+α .

(ii) If H(N, δ) . N log(δ −1 ), then for all τ > 0


h i β
EG0 kĜn − G0 k2L2 (γ) . n− β+1 +τ ,

where the constant in . in general depends on τ .


Remark 2.9 (Effective smoothness). In classical nonparametric regression over s-smooth function
classes on [0, 1]d , the entropy assumption H(N, δ) . δ −α from part (i) is fulfilled for α = d/s,
2s
which yields the minimax-optimal rate n− 2s+d , see Section 4.1 for details. Since the rate only
depends on α (or equivalently α−1 ), we can think of α−1 = s/d as the “effective smoothness” of
the statistical model at hand.
The sub-Gaussian noise model poses a more restrictive regularity assumption on εi than the
assumption of white noise. It is possible to get L2 (γ)-convergence for sub-Gaussian noise in the
case the entropy integral J(δ) in (2.4) is not finite, such that Assumption 1 (b) does not hold.
The details are shown in the next theorem, its proof is deferred to Appendix B.5.
Theorem 2.10. Consider the nonparametric regression model (2.1) with sub-Gaussian noise and
the empirical risk from (2.2). Let Assumption 1 (a) hold, i.e. suppose that kG0 k∞,supp(γ) < ∞
and G is compact with respect to k · k∞,supp(γ) . Then there exists some universal constant C1 > 0,
such that for any positive sequences (δn )n∈N and (δ̃n )n∈N with
 
4 2 2 2 δn2
nδn ≥ C1 σ F∞ H and nδ̃n2 ≥ 6F∞ 2
H(GG , k · k∞,supp(γ) , δ̃n ), (2.13)
8σ 2 + δn2

all G∗ ∈ G and all


 
∗ F∞
R ≥ max δn , δ̃n , kG − G0 k∞,supp(γ) , √ ,
n
we have
     
nR4 nR2
PG0 kĜn − G0 kL2 (γ) ≥ R ≤ 4 exp − + 2 exp − 2 2 . (2.14)
C12 σ 2 (1 + F∞
2 ) C1 F∞
Furthermore, for all n ∈ N there exists a universal constant C2 > 0 such that
h i  2
(1 + σ)(1 + F∞ ) 
EG0 kĜn − G0 k2L2 (γ) ≤ C2 δn2 + δ̃n2 + √ + inf kG∗
− G0 k 2
L 2 (γ) . (2.15)
n G∗ ∈G
G
2 We may think of N as the number of parameters of G , e.g. the size of the FrameNet class G
FN in Section 3
below.

11
Remark 2.11. Consider α ≥ 2 in Corollary 2.8 (i). Then J(δ) in (2.4) is not necessarily finite,
and hence Assumption 1 (b) need not be satisfied. Therefore Theorem 2.6 cannot be applied. In
the sub-Gaussian noise case, we may still use the L2 (γ)-bound in (2.15) however. Similar as in
1 2
the proof of Corollary 2.8, it can then be shown that δn2 . n− 2+α and δ̃n2 . n− 2+α satisfy (2.13).
This yields
h i 1
EG0 kĜn − G0 k2L2 (γ) . n− 2+α ,

i.e. half the convergence rate of the “chaining regime” considered in Corollary 2.8.

3 Learning Holomorphic Operators with FrameNet


In this section we first recall the NN-based operator class FrameNet from [43, Section 2], see
Sections 3.1 and 3.2. This will provide the regression class G FN over which to estimate G0 .
Similar to, e.g., PCA-Net [44], FrameNet consists of mappings
G = DY ◦ g ◦ EX , (3.1)
for a linear encoder EX : X → ℓ2 (N), a linear decoder DY : ℓ2 (N) → Y, and a coefficient map
ℓ2 (N) → ℓ2 (N). The encoder maps an x ∈ X to its coefficients in some representation system of X.
Conversely, the decoder builds a y ∈ Y out of a coefficient series in ℓ2 (N). These representation
systems consist of apriorily fixed frames. The coefficient map g is represented by a feedforward
neural network, that will be trained by ERM.
Subsequently, in Sections 3.3-3.4, we apply our analysis to the learning of holomorphic operators
G0 : X → Y. For such mappings, the FrameNet architecture was shown in [43] to be capable of
overcoming the curse of dimensionality in terms of the approximation error. We generalize this
property to the case of bounded network parameters in subsection 3.3 and furthermore show metric
entropy bounds. This allows us to prove that FrameNet can overcome the curse of dimensionality
in the learning of holomorphic operators, both in terms the approximation capability and in terms
of sample complexity.

3.1 Representation Systems


We briefly recall basic definitions and properties of frames. For more details see for instance [17].

3.1.1 Frames
Definition 3.1. A family Ψ = {ψj : j ∈ N} ⊂ X is called a frame of X, if the analysis operator

F : X → ℓ2 (N), v 7→ hv, ψj iX j∈N

is bounded and boundedly invertible between X and range(F ) ⊂ ℓ2 (N).


Every orthonormal basis of X is trivially a frame. Since Definition 3.1 merely requires bounded
invertibility on the range of F , F need not be surjective, and in particular ψ need not consist of
linearly independent vectors. The frame bounds of ψ are defined as
kF vkℓ2 kF vkℓ2
ΛΨ := kF kX→ℓ2 = sup , λΨ := inf , (3.2)
06=v∈X kvkX 06=v∈X kvkX

the synthesis operator F ′ as


 X
F ′ : ℓ2 (N) → X, vi i∈N
7→ v T ψ := vi ψi ,
i∈N

and finally the frame operator as T := F ′ F : X → X. The following lemma gives a characterization
of T . For a proof, see [17, Lemma 5.1.5].

12
Lemma 3.2. The frame operator T is boundedly invertible, self-adjoint and positive. Furthermore,
2 −2
it holds that kT kX→X = ΛΨ and kT −1kX→X = λΨ Ψ := T −1Ψ is a frame of X, called
. The family Ψ̃
the (canonical) dual frame of X. The analysis operator of the dual frame is F̃ := F (F ′ F )−1 and
−1
its frame bounds are λΨ and Λ−1
ψ .

Definition 3.3. A family Ψ = {ψj : j ∈ N} ⊂ X is called a Riesz basis of X if there exists a


bounded, bijective operator A : X → X and an orthonormal basis (ej )j∈N with ψj = Aej for all
j ∈ N.
A Riesz basis is a frame Ψ which is also a basis. Equivalently, a Riesz basis is a frame with
ker(F ′ ) = 0 and therefore range(F ) = ℓ2 (N). Moreover, the dual frame Ψ̃
Ψ of a Riesz basis is also
a Riesz basis, e.g. [17, Section 5].

3.1.2 Encoder and Decoder


Throughout the rest of this paper we fix frames and their duals on X, Y and denote them by

Ψ X = (ψj )j∈N , ΨX = (ψ̃j )j∈N ,


Ψ̃ Ψ Y = (ηj )j∈N , ΨY = (η̃j )j∈N .
Ψ̃

The corresponding analysis operators are FX , F̃X , FY and F̃Y . We then introduce encoder and
decoder maps via
( (
X → ℓ2 (N), ′ ℓ2 (N) → Y,
EX := F̃X = DY := FY = P (3.3)
x 7→ (hx, ψ̃j iX )j∈N , (yj )j∈N 7→ j∈N yj ηj .

In case Ψ X and Ψ Y are Riesz bases, the mappings in (3.3) are boundedly invertible.

3.1.3 Smoothness Scales


The encoder mapping EX : X → ℓ2 (N) maps an element to its coefficients in the frame represen-
tation. For computational purposes, this coefficient sequence must be truncated, as only finitely
many coefficients can be considered. Consequently, it is essential to control the error resulting from
discarding higher-order frame coefficients. To formalize this we next introduce scales of subspaces
of X, Y, whose elements exhibit a certain coefficient decay.
Definition 3.4. Let θ = (θj )j∈N be a strictly positive, monotonically decreasing sequence such
that θ 1+ε ∈ l1 (N) for all ε > 0.
For all r, t ≥ 0 we introduce the subspaces Xr ⊂ X and Yt ⊂ Y via

Xr := {x ∈ X : kxkXr < ∞} and Yt := {y ∈ Y : kykYt < ∞}

where
X X
kxk2Xr := hx, ψ̃j i2X θj−2r and kyk2Yt := hy, η̃j i2Y θj−2t .
j∈N j∈N

For every r ≥ 0, Xr is a Hilbert space [43, Lemma 1].

3.2 FrameNet
In this subsection we recall the FrameNet architecture from [43, Section 2]. We start by formally
introducing feedforward neural networks (NNs) following [75, 43].

13
3.2.1 Feedforward Neural Networks
Definition 3.5. A function f : Rp0 → RpL+1 is called a neural network, if there exists σ : R → R,
l
integers p1 , . . . , pL+1 ∈ N, L ∈ N and real numbers wi,j , blj ∈ R such that for all x = (xi )pi=1
0
∈ Rp0

X
p0 
zj1 =σ 1
wi,j xi + b1j , j = 1, . . . , p1 , (3.4a)
i=1
Xpl 
zjl+1 = σ l+1 l
wi,j zi + bl+1
j , l = 1, . . . , L − 1, j = 1, . . . , pl+1 ,
i=1
X
pL pL+1
pL+1
f (x) = (zjL+1 )j=1 = L+1 L
wi,j zi + bL+1
j . (3.4b)
i=1 j=1

l
We call σ the activation function, L the depth, p := maxl=0,...,L+1 pl the width, wi,j ∈ R the
l
weights, and bj ∈ R the biases of the NN.
While different NNs can realize the same function, for simplicity we refer to a function f : Rp0 →
pL+1
R as an NN of type (3.4), if it allows for (at least) one such representation. Additionally, our
analysis will require some further terminology: For a NN f : Rp0 → RpL+1 as in (3.4) its
• size is the number of nonzero parameters
l
size(f ) := |{(i, j, l) : wi,j 6= 0}| + |{(j, l) : blj 6= 0}|,

• maximum of parameters is
n o
l
mpar(f ) := max max |wi,j |, max |blj | ,
i,j,l j,l

• maximum range on Ω ⊆ Rp0 is


  12
pL+1
X 2
mranΩ (f ) := sup f (x) 2
= sup  zjL+1 (x)  .
x∈Ω x∈Ω j=1

Throughout, for q ∈ N, q ≥ 1, we consider the activation function

σq (x) := max{0, x}q , x ∈ R.

For q = 1, σ1 is the rectified linear unit (ReLU), for q ≥ 2, σq is called rectified power unit (RePU).
Remark 3.6. Definition 3.5 introduces a NN as a function f : Rp0 → RpL+1 . Throughout, we
also understand the realization of a NN as a map f : ℓ2 (N) → ℓ2 (N) via extension by zeros. This
l
is equivalent to suitably padding the weight matrices (wi,j )i,j and bias vectors (blj )j in Definition
3.5 for l ∈ {0, L + 1} with infinitely many zeros, see [43, Remark 13].

3.2.2 The FrameNet Class


By definition of frames, the encoding operator EX : X → ℓ2 (N) in (3.3) is injective, and the
decoding operator DY : ℓ2 (N) → Y is surjective. Thus for every mapping G : X → Y, there exists
a coefficient map g : ℓ2 (N) → ℓ2 (N) such that

G = DY ◦ g ◦ EX .

This motivates the introduction of the following function class.

14
Given σ : R → R, positive integers L, p, s ∈ N, and reals M , B ∈ R, let

g FN (σ, L, p, s, M, B) := g : Rp0 → RpL+1 is NN with activation function σ s.t.
depth(g) ≤ L, width(g) ≤ p, size(g) ≤ s,
mpar(g) ≤ M, mran[−1,1]p0 (g) ≤ B, p0 , pL+1 ≤ p .

Instead of directly considering (3.1), for our analysis, contrary to [43], with θ from Definition 3.5,
fixed R > 0, and U := [−1, 1]N , it will be convenient to introduce the linear scaling
(
×j∈N [−Rθjr , Rθjr ] → U
Sr := x  (3.5)
(xj )j∈N 7→ Rθjr j∈N .
j

The FrameNet class then consists of all operators



GFN (σ, L, p, s, M, B) := G = DY ◦ g ◦ Sr ◦ EX : g ∈ g FN (σ, L, p, s, M, B) . (3.6)

3.3 Approximation of Holomorphic Operators


Our statistical theory established in Section 2 shows that sample complexity depends on

(i) the approximation quality of G w.r.t G0 , cf. the term inf G∗ ∈G
G kG − G0 kL2 (γ) in (2.12),

(ii) the metric entropy of G , cf. the terms δn2 and δ̃n2 in (2.12).
In this subsection we give results for both the approximation quality and the metric entropy of
the FrameNet class G FN .

3.3.1 Setting
Let us start by making the assumptions on G0 and the sampling measure γ on X more precise.
Denote in the following U := [−1, 1]N , let r > 12 , R > 0, and let the frames (ψj )j∈N , (ηj )j∈N be as
in Section 3, and (θj )j∈N as in Definition 3.4. Then
(
r U → X,
σR := P (3.7)
y 7→ R j∈N θjr yj ψj

yields a well-defined map, since by construction (yj θjr )j∈N ∈ ℓ2 (N) for every y ∈ U , and (ψj )j∈N
is a frame. Next, we introduce the “cubes”
 
CR r
(X) = a ∈ X : sup θj−r |ha, ψ̃j iX | ≤ R . (3.8)
j∈N

r r
Observe that with σR (U ) := {σR (y) : y ∈ U }, clearly
r r
CR (X) ⊆ σR (U ),

but equality holds in general only if the (ψj )j∈N form a Riesz basis [43, Remark 10].
Example 3.7. Let X = R, r = 1, R = 1, θ1 = 3/2 and θ2 = 1/2. Consider the frame Ψ = {1, 1}
(which is not a basis) with dual analogue Ψ̃ = {1/2, 1/2}. Then with U = [−1, 1]2

C11 (R) = [−1, 1] ( [−2, 2] = {θ1 y1 ψ1 + θ2 y2 ψ2 : y ∈ U }.

The holomorphy assumption on G0 can now be formulated as follows, [43, Assumption 1].
Recall that for a real Hilbert space X, we denote by XC its complexification.
Assumption 2. For some r > 1, R > 0, t > 0, CG0 < ∞ there exists an open set OC ⊂ XC
r
containing σR (U ), such that

15
(a) supa∈OC kG0 (a)kYtC ≤ CG0 , and G0 : OC → YC is holomorphic,
r
(b) γ is a probability measure on X with supp(γ) ⊆ CR (X).
r
The assumption requires G0 to be holomorphic on a superset of σR (U ). The probability
r r
measure γ is allowed to have support on CR (X) which is potentially smaller than σR (U ). In
particular, G0 is then holomorphic on a superset of the support of γ.
Example 3.8. Denote by λ the Lebesgue measure. Then π := ⊗j∈N λ2 is the uniform probability
measure on U = [−1, 1]N , and
r
γ := (σR )♯ π (3.9)
r
defines a probability measure on X with support σR (U ). If ψ is a Riesz basis, then supp(γ) =
r r
CR (X), so that supp(γ) ⊆ CR (X) as required in Assumption 2.

3.3.2 Approximation Theory


Theorem 1 in [43] establishes a convergence rate for approximating G0 within G FN (with un-
bounded weights), in terms of the number of trainable network parameters. Our main results
on sample complexity depend on bounds on the metric entropy, which require bounded network
weights. To address this, we now extend Theorem 1 from [43] to FrameNet architectures with
bounded weights, as introduced in Section 3.
Similar as in [92, 78, 43], the analysis builds on polynomial chaos expansions, e.g. [102]. Denote
R1
by (Lj )j∈N the univariate Legendre polynomials normalized such that 12 −1 Lj (x)2 dx = 1 for all
j ∈ N. Then, [1, Chapter 22]
p
sup |Lj (x)| ≤ 2j + 1 ∀j ∈ N0 . (3.10)
x∈[−1,1]

Next, let F be the set of infinite-dimensional multiindices with finite support, i.e.
n o
F := (ν j )j∈N0 ∈ NN
0 : | supp ν| < ∞ , (3.11)

where supp ν := {j : νj 6= 0}. For finite sets of multiindices Λ ⊆ F, their effective dimension and
maximal order is defined as

d(Λ) := sup{| supp ν| : ν ∈ Λ} and m(Λ) := sup{|ν| : ν ∈ Λ},


P
where |ν| := j∈N νj . Moreover, with U := [−1, 1]N for all y ∈ U and ν ∈ F, we let Lν (y) :=
Q
j∈N Lνj (yj ) be the corresponding multivariate Legendre polynomial. This infinite product is
well-defined, since for all but finitely many j holds νj = 0 so that Lνj ≡ 1. The next Proposition
gives an approximation result for multivariate Legendre polynomials. It is an extension of [75,
Proposition 2.13] to the case of bounded network parameters. The proof is provided in Appendix
D.1.
Proposition 3.9 (σ1 -NN approximation of Lν ). Let δ ∈ (0, 1/2) and Λ ⊂ F be finite. Then there
exists a σ1 -NN fΛ,δ , such that its outputs {L̃ν,δ }ν∈Λ satisfy

∀ν ∈ Λ : sup |Lν (yy ) − L̃ν,δ (yy )| ≤ δ. (3.12)


y ∈U

Furthermore, there exists a constant C > 0 independent of Λ, d(Λ), m(Λ) and δ such that
h i
depth(fΛ,δ ) ≤ C log(d(Λ))d(Λ) log(m(Λ))m(Λ) + m(Λ)2 + log(δ −1 ) log(d(Λ)) + m(Λ)

width fΛ,δ ≤ C|Λ|d(Λ)
 h    i
size fΛ,δ ≤ C |Λ|d(Λ)2 log m(Λ) + m(Λ)3 d(Λ)2 + log δ −1 m(Λ)2 + |Λ| d(Λ)

mpar fΛ,δ ≤ 1.

16
A similar result holds for RePU neural networks, see Proposition D.1. Given q ∈ N, and
N ∈ N, we introduce the two FrameNet classes
G sp
FN (σq , N ) := G FN (σq , widthN , depthN , sizeN , M, B),
(3.14a)
G full
FN (σq , N ) := G FN (σq , widthN , depthN , ∞, M, B)

where for certain constants CL , Cp , Cs , M , B ≥ 1 (to be determined later)

depthN = max{1, ⌈CL log(N )⌉}, widthN = ⌈Cp N ⌉, sizeN = ⌈Cs N ⌉, (3.14b)

and ⌈x⌉ denotes the smallest integer larger or equal to x ∈ R.


We emphasize that Gsp FN (σq , N ) corresponds to a sparsely connected architecture, whereas
G full
FN (σq , N ) represents a fully connected architecture (because there is no constraint on its size). In
particular, since every linear transformation in between activation functions has at most widthN +
width2N parameters, we have
2
G full
size(G 2
FN (σq , N )) ≤ (depthN + 1)(widthN + widthN ) = O(log(N )N ) as N → ∞. (3.15)

Thus G full
FN (σq , N ) can essentially be quadratically larger than G sp
FN (σq , N ).
The next theorem extends [43, Theorems 1 and 2] to the case of bounded network parameters.
The proof is provided in Appendix D.3.
Theorem 3.10 (Sparse network approximation). Let G0 , γ satisfy Assumption 2 with r > 1,
t > 0. Let q ≥ 1 be an integer and fix τ > 0 (arbitrarily small).
(i) There exists C > 0 s.t. for all N ∈ N
 
2
inf
sp
G − G0 ∞,supp(γ)
≤ CN −2 min{r−1,t}+τ .
GFN (σq ,N )
G∈G

(ii) Let Ψ X be a Riesz basis and let γ be as in (3.9). Then there exists C > 0 s.t. for all N ∈ N
h i
2 1
inf
sp
G − G 0 L2 (γ)
≤ CN −2 min{r− 2 ,t}+τ .
GFN (σq ,N )
G∈G

Remark 3.11. Theorem 3.10 provides an approximation rate for G sp FN (σq , N ) in terms of N .
sp
Since G full
FN (σq , N ) ⊇ G FN (σq , N ), trivially the statement remains true for G full
FN (σq , N ). However,
full
as the size of G FN (σq , N ) can be quadratically larger by (3.15), the convergence rate in terms of
network size is essentially halfed for the fully connected architecture.

3.3.3 Entropy Bounds for FrameNet


In the following we bound the metric entropy (cp. (1.6)) of FrameNet for ReLU activation. Recall
that ΛΨ Y is the frame constant in (3.2). The proof of Lemma 3.12 is given in Appendix D.4.
r
Lemma 3.12 (cf. [89, Lemma 5]). Let L, p, s, M , B ≥ 1, σR be as in (3.7) and U = [−1, 1]N .
Then G FN = G FN (σ1 , L, p, s, M, B) is compact with respect to k · k∞,σRr (U) and k · kn . Furthermore,
G FN satisfies for all δ > 0
 
H(GGFN , k · k∞,σRr (U) , δ) ≤ (s + 1) log 2L+6 ΛΨY L2 M L+1 pL+4 max{1, δ −1 } . (3.16)

SP FC
In particular there exist constants CH , CH > 0 such that for the sparse and fully-connected
FrameNet classes from (3.14) it holds
 −1 
H(GGsp
FN (σ1 , N ), k · k∞,σR
r (U) , δ) ≤ C
SP 2
H N 1 + log(N ) + log max 1, δ
 
H(GGfull
FN (σ1 , N ), k · k∞,σR
r (U) , δ) ≤ C
FC 2
H N 1 + log(N )3 + log max 1, δ −1

for all δ > 0.

17
Remark 3.13. The entropy bound is independent of the constant B bounding the maximum range
of the network, see Subsection 3.2. However, B < ∞ will be necessary to apply Theorem 2.6.
For RePU activation, as mentioned before, the metric entropy bounds exhibit a worse depen-
dency on the network parameters, due to the lack of global Lipschitz continuity of σq if q ≥ 2.
The proof of Lemma 3.14 is given in Appendix D.5.
r
Lemma 3.14. Let q ∈ N, q ≥ 2, and L, p, s, M , B ≥ 1, σR be as in (3.7) and U = [−1, 1]N .
Then G FN = G FN (σq , L, p, s, M, B) is compact with respect to k · k∞,σRr (U) and k · kn . Furthermore,
G FN satisfies for all δ > 0
 2L+2

H(GGFN , k · k∞,σRr (U) , δ) ≤ (s + 1) log ΛΨ Y Lq L+q (2pM )4q max{1, δ −1 } . (3.17)

SP FC
Consider the constant CL from (3.14). Then there exists CH , CH > 0 such that
 
G sp
H(G SP 1+2CL log(q)
FN (σq , N ), k · k∞,σR (U) , δ) ≤ CH N
r 1 + log(N )2 + log max 1, δ −1
 
G full
H(G FN (σq , N ), k · k∞,σR
r (U) , δ) ≤ C
FC 2+2CL log(q)
H N 1 + log(N )2 + log max 1, δ −1

for all δ > 0.

3.4 Statistical Theory


Using our statistical results from Theorem 2.6 and Corollaries 2.7-2.8 as well as the approximation
result from Theorem 3.10, we can bound EG0 [kĜn − G0 k2L2 (γ) ] solely in terms of the statistical
sample size n. This is formalized in the next two theorems, for ReLU and RePU activation.
Our result distinguishes between the sparse and fully connected architectures G sp FN (σq , N ),
full
G FN (σq , N ) in (3.14). In practice, fully connected architectures are often preferred, due to their
simpler implementation, and because training sparse NN architectures can run into problems
like “bad” local minima, see, e.g., [35]. Our theoretical upper bounds are sharper for sparse
architectures; this is because the additional free parameters in the fully connected architecture
increase the entropy of this class, but do not yield better approximation properties in our proofs.
Theorem 3.15 (ERM for white noise and ReLU). Let G0 : X → Y and γ satisfy Assumption 2
for some r > 1, t > 0. Fix τ > 0 (arbitrarily small) and set (cp. Example 3.8)
(
2 min{r − 12 , t} if Ψ X is a Riesz basis and γ = (σR
r
)# π,
κ :=
2 min{r − 1, t} otherwise.

For every n ∈ N, let (xi , yi )ni=1 be data generated by (2.1) either in the white noise model or in the
sub-Gaussian noise model. Then there exist constants CL , Cp , Cs , M , B ≥ 1 in (3.14) and C > 0
(all independent of n) such that
1
(i) Sparse FrameNet: with N = N (n) = ⌈n κ+1 ⌉ and G = Gsp FN (σ1 , N ), there exists a mea-
surable choice of an ERM Ĝn of (2.2). Any such Ĝn satisfies
κ
EG0 [kĜn − G0 k2L2 (γ) ] ≤ Cn− κ+1 +τ , (3.18)

1
(ii) Fully connected FrameNet: with N = N (n) = ⌈n κ+2 ⌉ and G = G full FN (σ1 , N ), there exists
a measurable choice of an ERM Ĝn of (2.2). Any such Ĝn satisfies
κ
EG0 [kĜn − G0 k2L2 (γ) ] ≤ Cn− κ+2 +τ . (3.19)

One can further use Theorem 2.6 to show concentration inequalities for the risk kĜn −G0 kL2 (γ) .

18
Proof. The proof follows directly from the entropy bounds in Lemma 3.12 and Corollary 2.8
(ii): First, Lemma 3.12 in particular verifies Assumption 1 for all G∗ ∈ G , which is required for
Corollary 2.8. Applying the corollary with β = κ then gives (3.18), and with β = κ2 we obtain
(3.19). Note that crucially F∞ (see (2.8)) does not depend on N , because kG0 k∞,supp γ < ∞ and
G FN is universally bounded by the N -independent constant B, see (3.6).
Theorem 3.16 (ERM for white noise and RePU). Consider the setting of Theorem 3.15 and let
q ∈ N, q ≥ 2. There exist constants CL , Cp , Cs , M , B ≥ 1 in (3.14) and C > 0 (all independent
of N ) such that
1
(i) Sparse FrameNet: with N = N (n) = ⌈n κ+1+4CL log(q) ⌉ and G = G sp FN (σq , N ), there exists
a measurable choice of an ERM Ĝn of (2.2). Any such Ĝn satisfies
− κ+1+4Cκ +τ
EG0 [kĜn − G0 k2L2 (γ) ] ≤ Cn L log(q) . (3.20)
1
(ii) Fully connected FrameNet: with N = N (n) = ⌈n κ+2+4CL log(q) ⌉ and G = Gfull FN (σq , N ),
there exists a measurable choice of ERM Ĝn of (2.2). Any such Ĝn satisfies
− κ+2+4Cκ +τ
EG0 [kĜn − G0 k2L2 (γ) ] ≤ Cn L log(q) . (3.21)

Proof. Similar to the proof of Theorem 3.15, the proof of Theorem 3.16 follows from the entropy
bounds in Lemma 3.14 below and Corollary 2.8 (ii) with γ = κ/(1 + 2CL log(q)) for (3.20) and
γ = κ/(2 + 2CL log(q)) for (3.21).
A few remarks are in order. In the RePU case the activation function σq (x) = max{0, x}q
has no global Lipschitz condition for q ≥ 2. As a result, the entropy bounds obtained for the
corresponding FrameNet class are larger than for ReLU. This leads to worse convergence rates.
Moreover, for the RePU case, the convergence rate depends on the constant CL in (3.14b). The
proof of Theorem 3.10 shows that CL depends on the decay properties of the Legendre coefficients
r
(cν i ,j )i,j∈N of the function G0 ◦ σR , i.e. CL depends on G0 (see (D.23) and (D.24) in Theorem
D.4). Explicit bounds on CL are possible, see [104, Lemma 1.4.15].

4 Applications
We now present two applications of our results. First, in finite dimensional regression, our analysis
recovers well-known minimax-optimal rates for standard smoothness classes. This indicates that
our main statistical result in Theorem 2.6 is in general optimal. However, we do not claim opti-
mality specifically for the approximation of holomorphic operators discussed in Section 3. Second,
for an infinite dimensional problem we address the learning of a solution operator to a parameter
dependent PDE.

4.1 Finite Dimensional Regression


Let d ∈ N and X = Rd , Y = R. Moreover, let D ⊆ Rd be a bounded, open, smooth domain, and
G0 : D → R a ground-truth regression function. Suppose that
iid iid
xi ∼ γ and εi ∼ N(0, 1) ∀i ∈ N
are independent samples for some probability measure γ on D, and let
yi = G0 (xi ) + εi ∀i ∈ N.
Given a regression class G of measurable mappings from D → R, the least-squares problem is to
determine
n
X n
X
Ĝn ∈ arg min −2G(xi )yi + G(xi )2 = arg min |G(xi ) − yi |2 . (4.2)
G
G∈G i=1 G
G∈G i=1

19
For s ≥ 0, it is well-known that the minimax-optimal rate of recovering a ground-truth function
G0 in s-smooth function classes, such as the Sobolev space H s (D) or the Besov space B∞,∞
s
(D)
2s
(see, e.g., [33] for definitions), equals n− 2s+d , e.g., [38, 100].
Denote now by G sR the ball of radius R > 0 around the origin in either H s (D) or B∞,∞ s
(D).
Then,
 R d/s
H(GG sR , k · k∞,supp(γ) , δ) ≃ ∀δ ∈ (0, 1),
δ

which holds for all s > 0 if G sR is the ball in Bs,s (D), and for all s > d/2 in case G sR is the ball in
H (D), see Theorem 4.10.3 in [98]. Corollary 2.8 (i) (with α = d/s < 2) then directly yield the
s

following theorem. It recovers the minimax optimal rate for nonparametric least squares/maximum
likelihood estimators.
Theorem 4.1. Let R > 0 and s > d/2. Then, there exists C > 0 such that for all G0 ∈ G sR , the
estimator Ĝn in (4.2) with G = G sR and data as in (4.1) satisfies
  2 2s
EG0 kĜn − G0 k2L2 (D) ≤ Cn− 2+α = Cn− 2s+d ∀n ∈ N.

4.2 Parametric Darcy Flow


As a second application we apply Theorem 3.15 to the solution operator of the diffusion equation,
extending the discussion of approximation errors in [43, Section 7.1].

4.2.1 Setup
We recall the setup from [43, Sections 7.1.1, 7.1.2].
Let d ∈ N, and denote by Td ≃ [0, 1]d the d-dimensional torus. In the following, all function
spaces on Td are understood to be one-periodic in each variable. Fix ā ∈ L∞ (Td ) and f ∈
H −1 (Td )/R such that for some constant amin > 0

ess inf (ā(x) + a(x)) > amin . (4.3)


x∈Td

We consider the ground truth G0 : a 7→ u, mapping a ∈ L∞ (Td ) to the solution u ∈ H 1 (Td ) of


Z
−∇ · ((ā + a)∇u) = f on Td and u(x) dx = 0. (4.4)
Td

Then G0 : {a ∈ L∞ (Td ) : (4.3) holds} → H 1 (Td ) is well-defined.


To represent a and u, we use Fourier expansions on Td . Denote for j ∈ N0 and j ∈ Nd0 , d ≥ 2,
d
Y
√ √
ξ0 := 1, ξ2j (x) := 2 cos(2πjx), ξ2j−1 (x) := 2 sin(2πjx), ξj (x1 , . . . , xd ) := ξjk (xk ).
k=1

Then for r ≥ 0, {max{1, |j|}r ξj : j ∈ Nd0 } forms an ONB of H r (Td ) equipped with inner product
X 2r
hu, viH r (Td ) := hu, ξj iL2 hv, ξj iL2 max {1, |jj |} .
j ∈Nd
0

In the following, fix r0 , t0 ≥ 0 and set

X := H r0 (Td ), ψj := max{1, |j|}−r0 ξj ,


(4.5)
Y := H t0 (Td ), ηj := max{1, |j|}−t0 ξj ,

so that ΨX := (ψj )j∈Nd0 , ΨY := (ηj )j∈Nd0 form ONBs of X, Y respectively. The encoder EX and
decoder DY are now as in (3.3). Direct calculation shows Xr = H r0 +rd and Yt = H t0 +td for r,
t ≥ 0; for more details see [43, Section 7.1.2].

20
4.2.2 Sample Complexity
We now analyze the sample complexity for learning the PDE solution operator G0 in Section 4.2.1.
For a proof of Theorem 4.2, see Appendix E.1.
3d τ1 d
Theorem 4.2. Let d ∈ N, d ≥ 2, s > 3d 2 and t0 ∈ [0, 1]. Fix τ1 > 0, τ2 ∈ (0, min{s − 2 , 8 })
(both arbitrarily small), and set
(
d
2 + τ2 if s ∈ ( 3d
2 , 2d + 1 − t0 ]
r0 = s+t 0 −1
2 if s > 2d + 1 − t0 .

Moreover let f ∈ C ∞ (Td ), and let


(a) ground truth: G0 : a 7→ u be given through (4.4),
(b) representation system: EX , DY be as in (3.3) with the orthonormal basis in (4.5), and
r0 , t0 from above,
(c) data: γ be the measure defined in (3.7) and (3.9) with r = s−r d
0
such that ā + a satisfies
n
(4.3) for all a ∈ supp(γ), and let (xi , yi )i=1 be generated by (2.1) with the additive white
noise model or the sub-Gaussian noise model,
(d) regression class: G = G sp FN (σ1 , N ) be the n-dependent sparse FrameNet architecture in
1
(3.14) with N (n) = ⌈n κ+1 ⌉.
Then there exists a constant C > 0 such that for all n ∈ N there exists a measurable ERM
Ĝn ∈ G sp
FN (σ1 , N (n)) in (2.2), and any such Ĝn satisfies
κ
EG0 [kĜn − G0 k2L2 (H r0 (Td ),γ;H t0 (Td )) ] ≤ Cn− κ+1 +τ1 (4.6)
where ( 
2 min ds − 1, 1−t
d
0
if s ∈ ( 3d
2 , 2d + 1 − t0 ],
κ= (4.7)
s+1−t0
d −1 if s > 2d + 1 − t0 .
r
Remark 4.3. Consider the setting of Theorem 4.2, and let supp(γ) ⊆ CR (X). A slight modifica-
tion of the proof of Theorem 4.2 similar to [43] (using the approximation bound in Theorem 3.10
(i) instead of (ii)) then yields
κ
EG0 [kĜn − G0 k2L2 (H r0 (Td ),γ;H t0 (Td )) ] ≤ Cn− κ+1 +τ1 (4.8)
where for some (small) τ2 > 0
(
( d2 + τ2 , 2s
d − 3) if s ∈ ( 3d 3d
2 , 2 + 1 − t0 ]
(r0 , κ) = s+t0 − d −1
( 2
2
, s+1−t
d
0
− 32 ) if s > 3d
2 + 1 − t0 .
Since
 
 X 
BR (H s (Td )) = BR (Xr ) = {x ∈ X : kxkXr ≤ R} = x ∈ X : hx, ψ̃j i2X θj−2r ≤ R2
 
j∈N
 
⊆ a ∈ X : sup θj−r |ha, ψ̃j iX | ≤ R = CR
r
(X),
j∈N

in particular, (4.8) holds for any γ with supp(γ) ⊆ BR (H s (Td )). This shows Theorem 1.1.
Similar rates can also be obtained for this PDE model on a convex, polygonal domain D ⊂ T2
with Dirichlet boundary conditions. The argument uses the Riesz basis constructed in [26], but
is otherwise similar to the torus, for details see [43, Section 7.2]. Moreoever, using the RePU
activation function, (4.6) holds with convergence rate κ/(κ + 1 + 4CL log(q)), where κ is from (4.7)
and CL from (3.14b). The proof is similar to Theorem 4.2 using Theorem 3.16 instead of Theorem
3.15. Finally, rates for the fully-connected class G full
FN (σq , N ) can be established using Theorems
3.15 (ii) and 3.16 (ii).

21
5 Conclusions
In this work, we established convergence theorems for empirical risk minimization to learn map-
pings G0 between infinite dimensional Hilbert spaces. Our setting assumes given data in the form
of n input-output pairs, with an additive noise model. We discuss both the case of Gaussian white
noise and sub-Gaussian noise. Our main statistical result, Theorem 2.6, bounds the mean-squared
L2 -error E[kĜn −G0 k2L2 (γ) ] in terms of the approximation error, and algebraic rates in n depending
only on the metric entropy of G . This provides a general framework to study operator learning
from the perspective of sample complexity.
In the second part of this work, we applied our statistical results to a specific operator learning
architecture from [43], termed FrameNet. As our main application, we showed that holomor-
phic operators G0 can be learned with NN-based surrogates without suffering from the curse of
dimension, cf. Theorem 3.15. Such results have wide applicability, as the required holomorphy
assumption is well-established in the literature, and has been verified for a variety of models in-
cluding for example general elliptic PDEs [21, 22, 20, 41], Maxwell’s equations [49], the Calderon
projector [42], the Helmholtz equation [46, 93] and also nonlinear PDEs such as the Navier-Stokes
equations [24].

22
References
[1] Milton Abramowitz and Irene A Stegun. Handbook of mathematical functions with formulas,
graphs, and mathematical tables. Vol. 55. US Government printing office, 1948.
[2] Ben Adcock, Nick Dexter, and Sebastian Moraga. Optimal deep learning of holomorphic
operators between Banach spaces. 2024. url: http://arxiv.org/pdf/2406.13928.
[3] Ben Adcock et al. “Near-optimal learning of Banach-valued, high-dimensional functions via
deep neural networks”. In: Neural Networks 181 (2025), p. 106761. issn: 0893-6080. doi:
https://doi.org/10.1016/j.neunet.2024.106761. url: https://www.sciencedirect
.com/science/article/pii/S0893608024006853.
[4] Sergios Agapiou and Sven Wang. “Laplace priors and spatial inhomogeneity in Bayesian
inverse problems”. In: arXiv:2112.05679 (2021).
[5] Anima Anandkumar et al. “Neural Operator: Graph Kernel Network for Partial Differential
Equations”. In: ICLR 2020 Workshop on Integration of Deep Neural Models and Differential
Equations. 2019. url: https://openreview.net/forum?id=fg2ZFmXFO3.
[6] Ivo Babuška, Fabio Nobile, and Raúl Tempone. “A stochastic collocation method for elliptic
partial differential equations with random input data”. In: SIAM Rev. 52.2 (2010), pp. 317–
355. issn: 0036-1445,1095-7200. doi: 10.1137/100786356. url: https://doi.org/10.11
37/100786356.
[7] Markus Bachmayr et al. “Sparse polynomial approximation of parametric elliptic PDEs.
Part II: Lognormal coefficients”. In: ESAIM Math. Model. Numer. Anal. 51.1 (2017),
pp. 341–363. issn: 2822-7840,2804-7214. doi: 10.1051/m2an/2016051. url: https://doi
.org/10.1051/m2an/2016051.
[8] Andrew Barron, Lucien Birgé, and Pascal Massart. “Risk bounds for model selection via
penalization”. In: Probability theory and related fields 113 (1999), pp. 301–413.
[9] Sebastian Becker et al. “Learning the random variables in Monte Carlo simulations with
stochastic gradient descent: Machine learning for parametric PDEs and financial derivative
pricing”. In: Mathematical Finance 34.1 (2024), pp. 90–150.
[10] Kaushik Bhattacharya et al. “Model reduction and neural networks for parametric PDEs”.
In: SMAI J. Comput. Math. 7 (2021), pp. 121–157.
[11] Lucien Birgé and Pascal Massart. “Rates of convergence for minimum contrast estimators”.
In: Probability Theory and Related Fields 97 (1993), pp. 113–150.
[12] Ismaël Castillo and Richard Nickl. “Nonparametric Bernstein–von Mises theorems in Gaus-
sian white noise”. In: The Annals of Statistics 41.4 (2013). issn: 0090-5364. doi: 10.1214
/13-AOS1133.
[13] Gaëlle Chagny, Anouar Meynaoui, and Angelina Roche. “Adaptive nonparametric estima-
tion in the functional linear model with functional output”. In: (2022).
[14] Abdellah Chkifa, Albert Cohen, and Christoph Schwab. “Breaking the curse of dimension-
ality in sparse polynomial approximation of parametric PDEs”. In: J. Math. Pures Appl.
(9) 103.2 (2015), pp. 400–428. issn: 0021-7824,1776-3371. doi: 10.1016/j.matpur.2014
.04.009. url: https://doi.org/10.1016/j.matpur.2014.04.009.
[15] Abdellah Chkifa et al. “Discrete least squares polynomial approximation with random eval-
uations—application to parametric and stochastic elliptic PDEs”. In: ESAIM Math. Model.
Numer. Anal. 49.3 (2015), pp. 815–837. issn: 0764-583X. doi: 10.1051/m2an/2014050.
url: https://doi.org/10.1051/m2an/2014050.
[16] Abdellah Chkifa et al. “Sparse adaptive Taylor approximation algorithms for parametric
and stochastic elliptic PDEs”. In: ESAIM Math. Model. Numer. Anal. 47.1 (2013), pp. 253–
280. issn: 0764-583X. doi: 10.1051/m2an/2012027. url: https://doi.org/10.1051/m2
an/2012027.

23
[17] Ole Christensen. An Introduction to Frames and Riesz Bases [recurso electrónico]. Second
edition 2016. Applied and Numerical Harmonic Analysis. Cham, 2016. isbn: 978-3-319-
25613-9.
[18] Ludovica Cicci, Stefania Fresca, and Andrea Manzoni. “Deep-HyROMnet: A deep learning-
based operator approximation for hyper-reduction of nonlinear parametrized PDEs”. In:
Journal of Scientific Computing 93.2 (2022), p. 57.
[19] K. A. Cliffe et al. “Multilevel Monte Carlo methods and applications to elliptic PDEs with
random coefficients”. In: Comput. Vis. Sci. 14.1 (2011), pp. 3–15. issn: 1432-9360,1433-
0369. doi: 10.1007/s00791-011-0160-x. url: https://doi.org/10.1007/s00791-011
-0160-x.
[20] Albert Cohen and Ronald DeVore. “Approximation of high-dimensional parametric PDEs”.
In: Acta Numer. 24 (2015), pp. 1–159. issn: 0962-4929. doi: 10.1017/S0962492915000033.
url: https://doi.org/10.1017/S0962492915000033.
[21] Albert Cohen, Ronald DeVore, and Christoph Schwab. “Convergence rates of best N -term
Galerkin approximations for a class of elliptic sPDEs”. In: Found. Comput. Math. 10.6
(2010), pp. 615–646. issn: 1615-3375. doi: 10.1007/s10208-010-9072-2. url: http://d
x.doi.org/10.1007/s10208-010-9072-2.
[22] Albert Cohen, Ronald Devore, and Christoph Schwab. “Analytic regularity and polynomial
approximation of parametric and stochastic elliptic PDE’s”. In: Anal. Appl. (Singap.) 9.1
(2011), pp. 11–47. issn: 0219-5305. doi: 10.1142/S0219530511001728. url: http://dx.d
oi.org/10.1142/S0219530511001728.
[23] Albert Cohen, Giovanni Migliorati, and Fabio Nobile. “Discrete least-squares approxima-
tions over optimized downward closed polynomial spaces in arbitrary dimension”. In: Con-
str. Approx. 45.3 (2017), pp. 497–519. issn: 0176-4276. doi: 10.1007/s00365-017-9364-8.
url: https://doi.org/10.1007/s00365-017-9364-8.
[24] Albert Cohen, Christoph Schwab, and Jakob Zech. “Shape Holomorphy of the stationary
Navier-Stokes Equations”. In: SIAM J. Math. Analysis 50.2 (2018), pp. 1720–1752. doi:
https://doi.org/10.1137/16M1099406.
[25] Niccolò Dal Santo, Simone Deparis, and Luca Pegolotti. “Data driven approximation of
parametrized PDEs by reduced basis and neural networks”. In: Journal of Computational
Physics 416 (2020), p. 109550.
[26] Oleg Davydov and Rob Stevenson. “Hierarchical Riesz Bases for H s (Ω), 1 < s < 5/2”. In:
Constructive Approximation 22.3 (2005), pp. 365–394. issn: 1432-0940. doi: 10.1007/s00
365-004-0593-2.
[27] Tim De Ryck, Samuel Lanthaler, and Siddhartha Mishra. “On the approximation of func-
tions by tanh neural networks”. In: Neural Networks 143 (2021), pp. 732–750.
[28] R. DeVore, R. Howard, and C. Micchelli. “Optimal nonlinear approximation.” In: Manus-
cripta mathematica 63.4 (1989), pp. 469–478. url: http://eudml.org/doc/155392.
[29] R.A. DeVore and G.G. Lorentz. Constructive Approximation. Grundlehren der mathema-
tischen Wissenschaften. Springer Berlin Heidelberg, 1993. isbn: 9783540506270. url: http
s://books.google.de/books?id=cDqNW6k7_ZwC.
[30] Sjoerd Dirksen. “Tail bounds via generic chaining”. In: Electronic Journal of Probability
20 (2015). issn: 1083-6489. doi: 10.1214/EJP.v20-3760.
[31] Alireza Doostan and Houman Owhadi. “A non-adapted sparse approximation of PDEs with
stochastic inputs”. In: Journal of Computational Physics 230.8 (2011), pp. 3015–3034. issn:
0021-9991. doi: https://doi.org/10.1016/j.jcp.2011.01.002. url: https://www.sci
encedirect.com/science/article/pii/S0021999111000106.

24
[32] Dinh Dung et al. Analyticity and sparsity in uncertainty quantification for PDEs with
Gaussian random field inputs. Vol. 2334. Lecture Notes in Mathematics. Springer, Cham,
2023, pp. xv+205. isbn: 978-3-031-38383-0. doi: 10 . 1007 / 978 - 3 - 031 - 38384 - 7. url:
https://doi.org/10.1007/978-3-031-38384-7.
[33] D. E. Edmunds and H. Triebel. Function spaces, entropy numbers, differential operators.
Vol. 120. Cambridge Tracts in Mathematics. Cambridge University Press, Cambridge, 1996,
pp. xii+252. isbn: 0-521-56036-5. doi: 10.1017/CBO9780511662201.
[34] Dennis Elbrachter et al. “Deep Neural Network Approximation Theory”. In: IEEE Trans-
actions on Information Theory 67.5 (2021), pp. 2581–2623. issn: 0018-9448. doi: 10.1109
/TIT.2021.3062161.
[35] Utku Evci et al. The Difficulty of Training Sparse Neural Networks. 2019. url: http://a
rxiv.org/pdf/1906.10732.
[36] Sara van de Geer. Empirical Processes in M-Estimation. Cambridge U. Press, 2000.
[37] Sara van de Geer. “Least squares estimation with complexity penalties”. In: Mathematical
Methods of statistics 10 (2001), pp. 355–374.
[38] Evarist Giné and Richard Nickl. Mathematical foundations of infinite-dimensional statistical
models. Cambridge series in statistical and probabilistic mathematics. New York (NY):
Cambridge University Press, 2016. isbn: 1107043166.
[39] Evarist Giné and Richard Nickl. Mathematical foundations of infinite-dimensional statis-
tical models. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge
University Press, New York, 2016, pp. xiv+690.
[40] Sonja Greven and Fabian Scheipl. “A general framework for functional regression mod-
elling”. In: Statistical Modelling 17.1-2 (2017), pp. 1–35.
[41] Helmut Harbrecht, Michael Peters, and Markus Siebenmorgen. “Analysis of the domain
mapping method for elliptic diffusion problems on random domains”. In: Numerische Math-
ematik 134.4 (2016), pp. 823–856.
[42] Fernando Henrı́quez and Christoph Schwab. “Shape holomorphy of the Calderón projector
for the Laplacian in R 2”. In: Integral Equations and Operator Theory 93.4 (2021), p. 43.
[43] Lukas Herrmann, Christoph Schwab, and Jakob Zech. “Neural and spectral operator sur-
rogates: unified construction and expression rate bounds”. In: Advances in Computational
Mathematics 50.4 (2024), pp. 1–43. issn: 1019-7168. doi: 10.1007/s10444-024-10171-2.
url: https://link.springer.com/article/10.1007/s10444-024-10171-2.
[44] J.S. Hesthaven and S. Ubbiali. “Non-intrusive reduced order modeling of nonlinear problems
using neural networks”. In: Journal of Computational Physics 363 (2018), pp. 55–78. issn:
0021-9991. doi: https://doi.org/10.1016/j.jcp.2018.02.037. url: https://www.sci
encedirect.com/science/article/pii/S0021999118301190.
[45] Jan S. Hesthaven, Gianluigi Rozza, and Benjamin Stamm. Certified reduced basis meth-
ods for parametrized partial differential equations. SpringerBriefs in Mathematics. BCAM
SpringerBriefs. Springer, Cham; BCAM Basque Center for Applied Mathematics, Bilbao,
2016, pp. xiii+131. isbn: 978-3-319-22469-5. doi: 10.1007/978- 3- 319- 22470- 1. url:
https://doi.org/10.1007/978-3-319-22470-1.
[46] R. Hiptmair et al. “Large deformation shape uncertainty quantification in acoustic scatter-
ing”. In: Advances in Computational Mathematics 44.5 (Oct. 2018), pp. 1475–1518. issn:
1572-9044. doi: 10.1007/s10444-018-9594-8. url: https://doi.org/10.1007/s10444
-018-9594-8.
[47] Viet Ha Hoang and Christoph Schwab. “N-term Wiener chaos approximation rates for ellip-
tic PDEs with lognormal Gaussian random inputs”. In: Mathematical Models and Methods
in Applied Sciences 24.04 (2014), pp. 797–826. doi: 10.1142/S0218202513500681. eprint:
https://doi.org/10.1142/S0218202513500681. url: https://doi.org/10.1142/S021
8202513500681.

25
[48] Maarten V. de Hoop et al. “Convergence Rates for Learning Linear Operators from Noisy
Data”. In: SIAM/ASA Journal on Uncertainty Quantification 11.2 (2023), pp. 480–513.
doi: 10.1137/21M1442942. eprint: https://doi.org/10.1137/21M1442942. url: https
://doi.org/10.1137/21M1442942.
[49] Carlos Jerez-Hanckes, Christoph Schwab, and Jakob Zech. “Electromagnetic wave scatter-
ing by random surfaces: shape holomorphy”. In: Math. Models Methods Appl. Sci. 27.12
(2017), pp. 2229–2259. issn: 0218-2025. doi: 10.1142/S0218202517500439. url: https
://doi.org/10.1142/S0218202517500439.
[50] Padraig Kirwan. “Complexifications of multilinear and polynomial mappings”. Ph.D. thesis,
National University of Ireland, Galway. PhD thesis. 1997.
[51] Nikola B Kovachki, Samuel Lanthaler, and Andrew M Stuart. “Operator learning: Algo-
rithms and analysis”. In: arXiv preprint arXiv:2402.15715 (2024).
[52] Nikola B. Kovachki, Samuel Lanthaler, and Hrushikesh Mhaskar. Data Complexity Esti-
mates for Operator Learning. 2024. url: http://arxiv.org/pdf/2405.15992.
[53] Fabian Kröpfl, Roland Maier, and Daniel Peterseim. “Operator compression with deep
neural networks”. In: Advances in Continuous and Discrete Models 2022.1 (2022), p. 29.
[54] Gitta Kutyniok et al. “A theoretical analysis of deep neural networks and parametric
PDEs”. In: Constr. Approx. 55.1 (2022), pp. 73–125. issn: 0176-4276,1432-0940. doi: 1
0.1007/s00365-021-09551-4. url: https://doi.org/10.1007/s00365-021-09551-4.
[55] Samuel Lanthaler. “Operator learning with PCA-Net: upper and lower complexity bounds”.
In: Journal of Machine Learning Research 24.318 (2023), pp. 1–67. url: http://jmlr.or
g/papers/v24/23-0478.html.
[56] Samuel Lanthaler, Siddhartha Mishra, and George E. Karniadakis. “Error estimates for
DeepONets: a deep learning framework in infinite dimensions”. In: Transactions of Math-
ematics and Its Applications 6.1 (2022). doi: 10.1093/imatrm/tnac001.
[57] Bo Li. “Better Approximations of High Dimensional Smooth Functions by Deep Neural
Networks with Rectified Power Units”. In: Communications in Computational Physics 27.2
(2020), pp. 379–411. issn: 1815-2406. doi: 10.4208/cicp.OA-2019-0168.
[58] Bo Li, Shanshan Tang, and Haijun Yu. “PowerNet: Efficient representations of polynomi-
als and smooth functions by deep neural networks with rectified power units”. In: arXiv
preprint arXiv:1909.05136 (2019).
[59] Zongyi Li et al. Fourier Neural Operator for Parametric Partial Differential Equations. cite
arxiv:2010.08895. 2020. url: http://arxiv.org/abs/2010.08895.
[60] Hao Liu et al. Deep Nonparametric Estimation of Operators between Infinite Dimensional
Spaces. Jan. 1, 2022. url: http://arxiv.org/pdf/2201.00217v1.
[61] Lu Lu et al. “Learning nonlinear operators via DeepONet based on the universal approx-
imation theorem of operators”. In: Nature Machine Intelligence 3.3 (Mar. 2021), pp. 218–
229. issn: 2522-5839. doi: 10.1038/s42256-021-00302-5. url: https://doi.org/10.10
38/s42256-021-00302-5.
[62] H. N. Mhaskar and N. Hahm. “Neural networks for functional approximation and system
identification”. In: Neural computation 9.1 (1997), pp. 143–159. issn: 0899-7667. doi: 10.1
162/neco.1997.9.1.143.
[63] Hrushikesh N Mhaskar. “Neural networks for optimal approximation of smooth and analytic
functions”. In: Neural computation 8.1 (1996), pp. 164–177.
[64] Jeffrey S Morris and Raymond J Carroll. “Wavelet-based functional mixed models”. In:
Journal of the Royal Statistical Society Series B: Statistical Methodology 68.2 (2006),
pp. 179–199.

26
[65] Gustavo A. Muñoz, Yannis Sarantopoulos, and Andrew Tonge. “Complexifications of real
Banach spaces, polynomials and multilinear maps”. In: Studia Math. 134.1 (1999), pp. 1–
33. issn: 0039-3223.
[66] Nicholas H Nelsen and Andrew M Stuart. “Operator learning using random features: A
tool for scientific computing”. In: SIAM Review 66.3 (2024), pp. 535–571.
[67] R. Nickl, S. van de Geer, and S. Wang. “Convergence rates for Penalised Least Squares
Estimators in PDE-constrained regression problems”. In: SIAM J. Uncert. Quant. 8 (2020).
[68] Richard Nickl. Bayesian Non-linear Statistical Inverse Problems. EMS press, 2023.
[69] Richard Nickl. “Bernstein–von Mises theorems for statistical inverse problems I: Schrödinger
equation”. In: Journal of the European Mathematical Society 22.8 (2020), pp. 2697–2750.
issn: 1435-9855. doi: 10.4171/JEMS/975.
[70] Richard Nickl. “Donsker-type theorems for nonparametric maximum likelihood estimators”.
In: Probability Theory and Related Fields 138.3-4 (2007). issn: 0178-8051. doi: 10.1007/s
00440-006-0031-4.
[71] Richard Nickl and Sven Wang. “On polynomial-time computation of high-dimensional pos-
terior measures by Langevin-type algorithms”. In: Journal of the European Mathematical
Society, to appear (2020).
[72] F. Nobile, R. Tempone, and C. G. Webster. “A Sparse Grid Stochastic Collocation Method
for Partial Differential Equations with Random Input Data”. In: SIAM Journal on Numer-
ical Analysis 46.5 (2008), pp. 2309–2345. doi: 10.1137/060663660. eprint: https://doi
.org/10.1137/060663660. url: https://doi.org/10.1137/060663660.
[73] Thomas O’Leary-Roseberry et al. “Derivative-Informed Neural Operator: An efficient frame-
work for high-dimensional parametric derivative learning”. In: Journal of Computational
Physics 496 (2024), p. 112555. issn: 0021-9991. doi: https://doi.org/10.1016/j.jcp.2
023.112555. url: https://www.sciencedirect.com/science/article/pii/S00219991
23006502.
[74] Thomas O’Leary-Roseberry et al. “Derivative-informed projected neural networks for high-
dimensional parametric maps governed by PDEs”. In: Computer Methods in Applied Me-
chanics and Engineering 388 (2022), p. 114199. issn: 0045-7825. doi: https://doi.org/1
0.1016/j.cma.2021.114199. url: https://www.sciencedirect.com/science/article
/pii/S0045782521005302.
[75] J. A. A. Opschoor, Ch. Schwab, and J. Zech. “Exponential ReLU DNN Expression of Holo-
morphic Maps in High Dimension”. In: Constructive Approximation 55.1 (2022), pp. 537–
582. issn: 1432-0940. doi: 10.1007/s00365-021-09542-5. url: https://link.springer
.com/article/10.1007/s00365-021-09542-5#citeas.
[76] J. A. A. Opschoor, Ch. Schwab, and J. Zech. “Exponential ReLU DNN expression of
holomorphic maps in high dimension”. In: Constr. Approx. 55.1 (2022), pp. 537–582. issn:
0176-4276. doi: 10.1007/s00365-021-09542-5. url: https://doi.org/10.1007/s0036
5-021-09542-5.
[77] Joost A. A. Opschoor, Philipp C. Petersen, and Christoph Schwab. “Deep ReLU net-
works and high-order finite element methods”. In: Analysis and Applications 18.05 (2020),
pp. 715–770. issn: 0219-5305. doi: 10.1142/S0219530519410136.
[78] Joost A. A. Opschoor, Christoph Schwab, and Jakob Zech. “Deep learning in high dimen-
sion: ReLU neural network expression for Bayesian PDE inversion”. In: Optimization and
control for partial differential equations—uncertainty quantification, open and closed-loop
control, and shape optimization. Vol. 29. Radon Ser. Comput. Appl. Math. De Gruyter,
Berlin, 2022, pp. 419–462. isbn: 978-3-11-069596-0. doi: 10.1515/9783110695984- 015.
url: https://doi.org/10.1515/9783110695984-015.
[79] Houman Owhadi and Gene Ryan Yoo. “Kernel flows: From learning kernels from data into
the abyss”. In: Journal of Computational Physics 389 (2019), pp. 22–47.

27
[80] Philipp Petersen, Mones Raslan, and Felix Voigtlaender. “Topological Properties of the Set
of Functions Generated by Neural Networks of Fixed Size”. In: Foundations of Computa-
tional Mathematics 21.2 (2021), pp. 375–444. issn: 1615-3375. doi: 10.1007/s10208-020
-09461-0. url: https://link.springer.com/article/10.1007/s10208-020-09461-0.
[81] Philipp Petersen and Felix Voigtlaender. “Optimal approximation of piecewise smooth func-
tions using deep ReLU neural networks”. In: Neural networks : the official journal of the
International Neural Network Society 108 (2018), pp. 296–330. doi: 10.1016/j.neunet.2
018.08.019.
[82] Allan Pinkus. “Approximation theory of the MLP model in neural networks”. In: Acta
numerica, 1999. Vol. 8. Acta Numer. Cambridge Univ. Press, Cambridge, 1999, pp. 143–
195. isbn: 0-521-77088-2. doi: 10.1017/S0962492900002919. url: https://doi.org/10
.1017/S0962492900002919.
[83] Tomaso Poggio et al. “Why and when can deep-but not shallow-networks avoid the curse
of dimensionality: a review”. In: International Journal of Automation and Computing 14.5
(2017), pp. 503–519.
[84] David Pollard. Convergence of Stochastic Processes. New York, NY: Springer New York,
1984. isbn: 978-1-4612-9758-1. doi: 10.1007/978-1-4612-5254-2.
[85] Alfio Quarteroni, Andrea Manzoni, and Federico Negri. Reduced basis methods for par-
tial differential equations. Vol. 92. Unitext. An introduction, La Matematica per il 3+2.
Springer, Cham, 2016, pp. xi+296. isbn: 978-3-319-15430-5. doi: 10.1007/978-3-319-15
431-2. url: https://doi.org/10.1007/978-3-319-15431-2.
[86] Bogdan Raonic et al. “Convolutional neural operators”. In: ICLR 2023 Workshop on
Physics for Machine Learning. 2023.
[87] Holger Rauhut and Christoph Schwab. “Compressive sensing Petrov-Galerkin approxima-
tion of high-dimensional parametric operator equations”. In: Math. Comp. 86.304 (2017).
Report 2014-14, Seminar for Applied Mathematics, ETH Zürich, pp. 661–700. issn: 0025-
5718. doi: 10.1090/mcom/3113. url: http://dx.doi.org/10.1090/mcom/3113.
[88] Markus Reiß. “Asymptotic equivalence for nonparametric regression with multivariate and
random design”. In: Ann. Statist. 36.4 (2008), pp. 1957–1982. issn: 0090-5364. doi: 10.12
14/07-AOS525. url: http://dx.doi.org/10.1214/07-AOS525.
[89] Johannes Schmidt-Hieber. “Supplement to “Nonparametric regression using deep neural
networks with ReLU activation function””. In: The Annals of Statistics 48.4 (2020). issn:
0090-5364. doi: 10.1214/19-AOS1875SUPP.
[90] C. Schwab and A. M. Stuart. “Sparse deterministic approximation of Bayesian inverse
problems”. In: Inverse Problems 28.4 (2012), pp. 045003, 32. issn: 0266-5611,1361-6420.
doi: 10.1088/0266-5611/28/4/045003. url: https://doi.org/10.1088/0266-5611/28
/4/045003.
[91] Christoph Schwab and Jakob Zech. “Deep learning in high dimension: neural network ex-
pression rates for analytic functions in L2 (Rd , γd )”. In: SIAM/ASA J. Uncertain. Quantif.
11.1 (2023), pp. 199–234. issn: 2166-2525. doi: 10.1137/21M1462738. url: https://doi
.org/10.1137/21M1462738.
[92] Christoph Schwab and Jakob Zech. “Deep learning in high dimension: neural network ex-
pression rates for generalized polynomial chaos expansions in UQ”. In: Anal. Appl. (Singap.)
17.1 (2019), pp. 19–55. issn: 0219-5305,1793-6861. doi: 10.1142/S0219530518500203. url:
https://doi.org/10.1142/S0219530518500203.
[93] Euan A Spence and Jared Wunsch. “Wavenumber-explicit parametric holomorphy of Helm-
holtz solutions in the context of uncertainty quantification”. In: SIAM/ASA Journal on
Uncertainty Quantification 11.2 (2023), pp. 567–590.
[94] Andrew M. Stuart. “Inverse problems: a Bayesian perspective”. In: Acta Numer. 19 (2010),
pp. 451–559. issn: 0962-4929. doi: 10.1017/S0962492910000061.

28
[95] Taiji Suzuki. “Adaptivity of deep ReLU network for learning in Besov and mixed smooth
Besov spaces: optimal rate and curse of dimensionality”. In: he 7th International Conference
on Learning Representations (ICLR2019). Vol. 7. 2019.
[96] Michel Talagrand. The generic chaining: Upper and lower bounds for stochastic processes.
Berlin and New York: Springer, 2005. isbn: 3-540-24518-9. doi: 10.1007/3-540-27499-5.
[97] Michel Talagrand. Upper and lower bounds for stochastic processes. Vol. 60. Springer, 2014.
[98] Hans Triebel. Interpolation theory, function spaces, differential operators. 2., rev. and enl.
ed. Heidelberg and Leipzig: Barth, 1995. isbn: 3335004205.
[99] Alexandre B Tsybakov. Introduction to Nonparametric Estimation. Springer series in statis-
tics. Dordrecht: Springer, 2009. doi: 10.1007/b13794. url: https://cds.cern.ch/reco
rd/1315296.
[100] Sara A. de van Geer. Applications of empirical process theory. Digitally printed version.
Cambridge series in statistical and probabilistic mathematics. Cambridge: Cambridge Uni-
versity Press, 2009. isbn: 052165002X.
[101] Roman Vershynin. High-dimensional probability: An introduction with applications in data
science. Vol. 47. Cambridge series in statistical and probabilistic mathematics. Cambridge,
United Kingdom and New York, NY: Cambridge University Press, 2018. isbn: 1108415199.
[102] Dongbin Xiu and George Em Karniadakis. “The Wiener–Askey Polynomial Chaos for
Stochastic Differential Equations”. In: SIAM Journal on Scientific Computing 24.2 (2002),
pp. 619–644. doi: 10.1137/S1064827501387826. eprint: https://doi.org/10.1137/S10
64827501387826. url: https://doi.org/10.1137/S1064827501387826.
[103] Dmitry Yarotsky. “Error bounds for approximations with deep ReLU networks”. In: Neu-
ral networks : the official journal of the International Neural Network Society 94 (2017),
pp. 103–114. doi: 10.1016/j.neunet.2017.07.002.
[104] Jakob Zech. “Sparse-Grid Approximation of High-Dimensional Parametric PDEs”. PhD
thesis. ETH Zurich, 2018. doi: 10.3929/ethz-b-000340651.

29
Appendices
A Auxiliary Probabilistic Lemmas
We recall the classical Bernstein inequality.
Lemma A.1 (Bernstein’s inequality). Let X1 , . . . , Xn be independent, centered RVs with finite
second moments E[Xi2 ] < ∞ and uniform bound |Xi | ≤ M for i = 1, . . . , n. Then it holds
n
!  
X t2
P Xi ≥ t ≤ 2 exp − Pn , t ≥ 0.
i=1
2 i=1 E[Xi2 ] + 32 M t

For a proof of Bernstein’s inequality, see [84, page 193].


Lemma A.2 (Basic Inequality). Let εi , i = 1, . . . , n be i.i.d. white noise or sub-Gaussian noise.
Then it holds for all G∗ ∈ G
n
2σ X
kĜn − G0 k2n ≤ kG∗ − G0 k2n + hεi , Ĝn (xi ) − G∗ (xi )iY .
n i=1

Proof. Let G∗ ∈ G be arbitrary. Using the definition of Ĝn from (2.2), it holds
kĜn − G0 k2n
n
1X
= kĜn (xi )k2Y − 2hG0 (xi ), Ĝn (xi )iY + kG0 (xi )k2Y
n i=1
n
1X
= kĜn (xi )k2Y − 2hG0 (xi ) + σεi , Ĝn (xi )iY + 2σhεi , Ĝn (xi )iY + kG0 (xi )k2Y
n i=1
n
1X ∗
≤ kG (xi )k2Y − 2hG0 (xi ) + σεi , G∗ (xi )iY + 2σhεi , Ĝn (xi )iY + kG0 (xi )k2Y
n i=1
n
1X ∗
= kG (xi )k2Y − 2hG0 (xi ), G∗ (xi )iY + kG0 (xi )k2Y + 2σhεi , Ĝn (xi ) − G∗ (xi )iY
n i=1
n
2σ X
= kG∗ − G0 k2n + hεi , Ĝn (xi ) − G∗ (xi )iY ,
n i=1

which shows the claim.


Next, we state a generic chaining result from [30, Theorem 3.2], originally derived for finite
index sets, for countable index sets T . We restrict ourselves to the case of real-valued stochastic
processes.
Lemma A.3. Let T be a countable index set and d : T ×T → [0, ∞) a pseudometric. Furthermore,
let (Xt )t∈T be an R-valued stochastic process such that for some α > 0 and all s, t ∈ T ,
P (|Xt − Xs | ≥ ud(t, s)) ≤ 2 exp (−uα ) , u ≥ 0. (A.1)
Then, there exists a constant M > 0 depending only on α, such that for all t0 ∈ T it holds that
    α
u
P sup |Xt − Xt0 | ≥ M Jα (T, d) + u sup d(s, t) ≤ exp − , u ≥ 1,
t∈T s,t∈T α
where Jα denotes the metric entropy integral
Z ∞
1
Jα (T, d) = (log N (T, d, u)) α du.
0

30
Proof. Since T is countable, we can write T = {tj : j ∈ N}. Using this, we define Tn = {tj : j ≤ n}
for n ∈ N. Since Tn is finite [30, Theorem 3.2, Eq. (3.2) and its proof] gives M̃ > 0 s.t.
   p1  
1
p
E sup |Xt − Xt0 | ≤ M̃ Jα (Tn , d) + sup d(s, t)p a (A.2)
t∈Tn s,t∈Tn

for all p ≥ 1, t0 ∈ T and n ∈ N. In (A.2), we used [30, Eq. (2.3)] to upper bound the γα -functionals
by the respective metric entropy integrals Jα . The monotone convergence theorem shows (A.2) for
T in the limit n → ∞. Applying [30, Lemma A.1] then gives the claim with M = exp(α−1 )M̃ .
We now use Lemma A.3 to establish the following concentration bound, which is tailored
towards the empirical processes appearing in our proofs, cf. Lemma A.2. Note that this lemma
can be viewed as a generalization of the key chaining Lemma 3.12 in [71] to Y-valued regression
functions; the proof follows along the same lines.
Lemma A.4 (Chaining Lemma). Let X, Y be separable Hilbert spaces, and suppose Θ is a (possibly
uncountable) set parameterizing a class of maps

H = {hθ : X → Y, θ ∈ Θ} .

Consider an empirical process of the form


n
1X
Zn (θ) = hhθ (xi ), εi iY ,
n i=1

where x1 , . . . , xn ∈ X are fixed elements and εi , . . . , εn are either (i) i.i.d. Gaussian white noise
processes indexed by Y, or (ii) i.i.d. sub-Gaussian random variables in Y with parameter 1.
Recall the empirical seminorm k · kn . Suppose that

sup khθ kn =: U < ∞, (A.3)


θ∈Θ

and define the metric entropy integral


Z Up
J(H, dn ) = log N (H, dn , τ ) dτ, dn (θ, θ′ ) = khθ − hθ′ kn .
0

Let the space (Θ, dn ) be separable. Then supθ∈Θ Zn (θ) is measurable and there exists a universal
constant CCh > 0 such that for all δ > 0 with

nδ ≥ CCh J(H, dn ), (A.4)

it holds
   
8nδ 2
P sup |Zn (θ)| ≥ δ ≤ exp − 2 2 .
θ∈Θ CCh U

Proof. In both of the cases, we will apply [30, Theorem 3.2], which we stated in Lemma A.3.
White noise case. Let θ, θ′ ∈ Θ be arbitrary. Since εi , i = 1, . . . , n are independent white
noise processes, we have Zn (θ) − Zn (θ′ ) ∼ N(0, n−1 khθ − hθ′ k2n ), i.e. the increments of Zn are
normal (recall that x is regarded as fixed here). Thus,

P (|Zn (θ) − Zn (θ′ )| ≥ t) ≤ 2 exp −nt2 /(2khθ − hθ′ k2n ) , t ≥ 0,

and
√ !  
′ 2tdn (θ, θ′ ) dn (θ, θ′ )2 t2 2

P |Zn (θ) − Zn (θ )| ≥ √ ≤ 2 exp − = 2 exp −t , t ≥ 0, (A.5)
n khθ − hθ′ k2n

31
√ √
which verifies the assumption (A.1) for α = 2 and d¯n := 2dn / n.
¯ ′
√ Eq. (A.5) shows√ that the process Zn (θ) is sub-Gaussian w.r.t. the pseudometric dn (θ, θ ) =
2khθ − hθ′ kn / n. Therefore [38, Theorem 2.3.7 (a)] yields that Zn (θ) is sample bounded and
uniformly sample continuous. Since (Θ, dn ) is separable, so is (Θ, d¯n ). Thus it holds

sup Zn (θ) = sup Zn (θ) a.s., (A.6)


θ∈Θ θ∈Θ0

where Θ0 ⊂ Θ denotes a countable, dense subset. The right hand side of (A.6) is measurable as
a countable supremum. Therefore also the left hand side supθ∈Θ Zn (θ) is measurable. Applying
Lemma A.3 (to the countable set Θ0 ) and using (A.6) gives that for some universal constant M
and all θ† ∈ Θ,
    2
¯ tU t
P sup |Zn (θ) − Zn (θ† )| ≥ M J(H, dn ) + √ ≤ exp − , t ≥ 1.
θ∈Θ n 2

Due to (A.3) it holds N (H, dn , δ) = 1 for all δ ≥ U , and thus



√ √ 2U
N (H, d¯n , τ ) = N (H, dn , nτ / 2) = 1 ∀τ ≥ √ .
n
√ √
Substituting ρ = nτ / 2 we get
Z √ √
q 2U/ n
J(H, d¯n ) = log N (H, d¯n , τ ) dτ
0
√ Z Us  √ 
2 ¯ 2ρ
=√ log N H, dn , √ dρ
n 0 n
√ Z U √
2 p 2J(H, dn )
=√ log N (H, dn , ρ) dρ = √
n 0 n

and therefore
√ !  2
2M t
P sup |Zn (θ) − Zn (θ† )| ≥ √ (J(H, dn ) + tU ) ≤ exp − , t ≥ 1. (A.7)
θ∈Θ n 2

Since Zn (θ† ) ∼ N(0, n−1 khθ† k2n ), it holds for all θ† ∈ Θ


     
M tU −M 2 U 2 t2 M 2 t2
P |Zn (θ† )| ≥ √ ≤ exp ≤ exp − , t ≥ 0. (A.8)
n 2khθ† k2n 2

Combining (A.7) and (A.8) yields for t ≥ 1


 
3M
P sup |Zn (θ)| ≥ √ (J(H, dn ) + tU )
θ∈Θ n
√ !  
2M M
≤ P sup |Zn (θ) − Zn (θ† )| ≥ √ (J(H, dn ) + tU ) + P |Zn (θ† )| ≥ √ (J(H, dn ) + tU )
θ∈Θ n n
 2  
−t M tU
≤ exp + P |Zn (θ† )| ≥ √
2 n
 2  2 2

−t −M t
≤ exp + exp
2 2
 2
−t
= 2 exp ,
2

32

where we√assumed without loss of generality that M ≥ 1. Substitute δ = 3M/ n (J(H, dn ) + tU ),
i.e. tp
= ( nδ/3M − J(H, dn ))/U√. Because N (H, dn , τ ) ≥ 2 for τ ≤ U/2, we have that J(H, dn ) ≥
U/2 log(2) > U/4. Therefore nδ ≥ 15M J(H, dn ) := CCh J(H, dn ) implies t ≥ 1 and thus
  √ 2
!  
( nδ/(3M ) − J(H, dn )) 8nδ 2
P sup |Zn (θ)| ≥ δ ≤ 2 exp − ≤ 2 exp − 2 2 , (A.9)
θ∈Θ 2U 2 CCh U

which gives the claim for the white noise case.


Sub-Gaussian case. For θ, θ′ ∈ Θ it holds
n
1X
Zn (θ) − Zn (θ′ ) = hεi , hθ (xi ) − hθ′ (xi )iY .
n i=1

Since the centered RVs n−1 hεi , hθ (xi )−hθ′ (xi )iY are i.i.d. sub-Gaussian with parameter n−1 khθ (xi )
− hθ′ (xi )kY , the ‘generalized’ Hoeffding inequality for sub-Gaussian variables (see [101, Theorem
2.6.2]) implies that for some universal
√ constant c > 0 the increment Zn (θ)−Zn (θ′ ) is sub-Gaussian
with parameter ckhθ − hθ kn / n. Therefore

√ !
2tc 
P |Zn (θ) − Zn (θ )| ≥ √ dn (θ, θ ) ≤ 2 exp −t2 , t ≥ 0.
′ ′
n

From here on, the proof is similar to the white noise case and we obtain (A.9) by absorbing c into
CCh .
iid
G, k · k∞,supp(γ) , δ). Let (δ̃n )n∈N
Lemma A.5. Let xi ∼ γ and consider the entropy H(δ) = H(G
be a positive sequence with

nδ̃n2 ≥ 6F∞
2
H(δ̃n ).

Then for R ≥ max{8δ̃n , 18F∞ / n}, it holds that
   
nR2
P kĜn − G0 kL2 (γ) ≥ R, kĜn − G0 kL2 (γ) ≥ 2kĜn − G0 kn ≤ 2 exp − 2
.
320F∞

Proof. In the following we write k · k∞ = k · k∞,supp(γ) . Let F = {G − G0 : G ∈ G }. Then for


R ≥ 0 it holds that
 
PG0 kĜn − G0 kL2 (γ) ≥ R, kĜn − G0 kL2 (γ) ≥ 2kĜn − G0 kn
!
≤ PG0 sup kF kL2 (γ) ≥ 2kF kn . (A.10)
F , kF kL2 (γ) ≥R
F ∈F

For s ∈ N, we define F s = {F ∈ F : sR ≤ kF kL2(γ) ≤ (s + 1)R}. As a union of disjoint events, it


holds
! ∞  
X
PG0 sup kF kL2 (γ) ≥ 2kF kn = PG0 sup kF kL2 (γ) ≥ 2kF kn
F , kF kL2 (γ) ≥R
F ∈F s=1 Fs
F ∈F

X∞  
sR
≤ PG0 sup kF kL2 (γ) − kF kn ≥ ,
s=1 Fs
F ∈F 2

where we used for F ∈ F s and s ∈ N


kF kL2 (γ) sR
kF kL2(γ) ≥ 2kF kn =⇒ kF kL2 (γ) − kF kn ≥ ≥ .
2 2

33
Now consider some covering F ∗s = {Fs,j }N j=1 with N = N (F F s , k · k∞ , R/8). Thus for arbitrary
s ∈ N and F ∈ F s there exists Fs,j ∈ F ∗s s.t. kF − Fs,j k∞ ≤ R/8. We get

kF kL2 (γ) − kF kn ≤ kF kL2 (γ) − kFs,j kL2 (γ) + kFs,j kL2 (γ) − kFs,j kn
+ |kFs,j kn − kF kn |
R
≤ + kFs,j kL2 (γ) − kFs,j kn .
4
Therefore

X  
sR
PG0 sup kF kL2 (γ) − kF kn ≥
s=1 Fs
F ∈F 2
X∞  
sR
≤ PG0 max kFs,j kL2 (γ) − kFs,j kn ≥
s=1
j=1,...,N 4
X∞  
sR
≤ F s , k · k∞ , R/8) max PG0
N (F kFs,j kL2 (γ) − kFs,j kn ≥ , (A.11)
s=1
j=1,...,N 4

where we used sR/2 − R/4 ≥ sR/4 for s ∈ N. Since Fs,j ∈ F s , we have kFs,j kL2 (γ) ≥ sR and
therefore

kFs,j k2L2 (γ) − kFs,j k2n


kFs,j kL2 (γ) − kFs,j kn =
kFs,j kL2 (γ) + kFs,j kn
1
≤ kFs,j k2L2 (γ) − kFs,j k2n .
sR
Inserting into (A.11) gives

X  
sR
F s , k · k∞ , R/8) max PG0
N (F kFs,j kL2 (γ) − kFs,j kn ≥
s=1
j=1,...,N 4
X∞  
s2 R 2
≤ F s , k · k∞ , R/8) max PG0
N (F kFs,j k2L2 (γ) − kFs,j k2n ≥ .
s=1
j=1,...,N 4

Define the variables


1 
Yi = kFs,j k2L2 (γ) − kFs,j (xi )k2Y , i = 1, . . . , n.
n
It holds for all i = 1, . . . , n

E[Yi ] = 0,
2
2F∞
|Yi | ≤ ,
n 
1  2 
E[Yi2 ] = E kF k 2
s,j L2 (γ) − kF (x
s,j i Y )k 2
n2
1 h i
4 2 2 4
= E kF s,j k 2
L (γ) − 2kFs,j k 2
L (γ) kF s,j (xi )k Y + kFs,j (xi )k Y
n2
1  
= E kFs,j (xi )k4Y − kFs,j k4L2 (γ)
n2
2
F∞
≤ kFs,j k2L2 (γ) .
n2

34
Applying Bernstein’s inequality (Lemma A.1) for the variables Yi yields

X  
s2 R 2
F s , k · k∞ , R/8) max PG0
N (F kFs,j k2L2 (γ) − kFs,j k2n ≥
s=1
j=1,...,N 4

!
X ns4 R4
≤ F s , k · k∞ , R/8) max exp −
N (F s2 R2
32F∞2 (kF k2
s=1
j=1,...,N s,j L2 (γ) + 6 )
X∞  
ns2 R2
≤ F s , k · k∞ , R/8) −
exp H(F 2
,
s=1
160F∞

where we used kFs,j k2L2 (γ) ≤ (s + 1)2 R2 ≤ 4s2 R2 for all j = 1, . . . , N and s ∈ N, since Fj,s ∈ F s .
Since H(δ)/δ 2 is non-increasing in δ and R ≥ 8δ̃n , we have
 2
ns2 R2 n R
2
≥ 2
G, k · k∞ , R/8) ≥ 2H(F
≥ 2H(G F s , k · k∞ , R/8).
160F∞ 3F∞ 8

Therefore we get

X  
ns2 R2
exp H(FF s , k · k∞ , R/8) − 2
s=1
160F∞
X∞   X ∞    
ns2 R2 1 nR2 nR2
≤ exp − 2
≤ exp − < 2 exp − . (A.12)
s=1
320F∞ s=1
s2 2
320F∞ 2
320F∞

Note that for x, y ≥ 1, we have exp(−xy)


√ ≤ exp(−y)/x. PApplying this with x = s2 ≥ 1 and
y = nR /(320F∞ ) ≥ 1 for R ≥ 18F∞ / n, together with ∞
2 2 2 2
s=1 1/s = π /6 < 2, gives (A.12).
Combining (A.10)–(A.12) shows the claim.
Lemma A.6. Consider the sub-Gaussian noise model, i.e. suppose kεi kY are i.i.d. sub-Gaussian
with parameter 1. Abbreviate H(δ) = H(G G , k·k∞,supp(γ) , δ). Then there exists a universal constant
C > 0 s.t. for all positive sequences (δn )n∈N with

 
δn2
nδn4 ≥C σ2 2 2
F∞ H ,
8σ 2 + δn2

all G∗ ∈ G and R ≥ max{δn , 2kG∗ − G0 kn }, it holds for every x = (x1 , ..., xn ) ∈ Xn
   
x nR4
PG0 kĜn − G0 kn ≥ R ≤ 4 exp − 2 2 2 )
. (A.13)
C σ (1 + F∞

Proof. In the following we write k · k∞ = k · k∞,supp(γ) . For R2 ≥ 2kG∗ − G0 k2n , we use the basic
inequality (Lemma A.2) to obtain
   
R2
PxG0 kĜn − G0 k2n ≥ R2 ≤ PxG0 kĜn − G0 k2n − kG∗ − G0 k2n ≥ .
2
n
!
x 1X ∗ R2
≤ PG0 hεi , Ĝn (xi ) − G (xi )iY ≥ . (A.14)
n i=1 4σ

For R > 0, let G ∗ = {G − G∗ , G ∈ G } and (Gj )N G∗ , k · k∞ , R2 /(8σE[kεi kY ] + R2 ))


j=1 with N = N (G

denote a minimal k · k∞ -cover of G .

35
It holds for R > 0
n
!
1X R2
PxG0 hεi , Ĝn (xi ) − G∗ (xi )iY ≥
n i=1 4σ
n
!
1X R2
≤ PxG0 sup hεi , G(xi ) − G∗ (xi )iY ≥
G n i=1
G∈G 4σ
n n
!
1X 1X R2
≤ PxG0 sup hεi , G(xi ) − G∗ (xi ) − Gj ∗ (xi )iY + max hεi , Gj (xi )iY ≥
G∈GG n i=1 j=1,...N n
i=1

n
!
1X R2
≤ PxG0 sup hεi , G(xi ) − G∗ (xi ) − Gj ∗ (xi )iY ≥
G∈GG n i=1 8σ
| {z }
(i)
n
!
1 X R2
+ PxG0 max hεi , Gj (xi )iY ≥ .
j=1,...N n i=1

| {z }
(ii)

We estimate the terms (i) and (ii) separatly. For (i), use kG−G∗ −Gj ∗ k∞ ≤ R2 /(8σE[kεi kY ]+R2 )
for all G ∈ G and estimate
n
!
x 1X ∗ R2
PG0 sup hεi , G(xi ) − G (xi ) − Gj ∗ (xi )iY ≥
G n i=1
G∈G 8σ
n
!
1X R2 R2
≤ PxG0 kεi kY ≥
n i=1 8σE[kεi kY ] + R2 8σ
n
!  
1X R2 nR4
≤ PxG0 kεi kY − E[kεi kY ] ≥ ≤ 2 exp − 2 2 ,
n i=1 8σ C σ

where we used the Hoeffding inequality for sub-Gaussian random variables [101, Theorem 2.6.2].
For (ii), it holds
n
!
x 1X R2
PG0 max hεi , Gj (xi )iY ≥
j=1,...N n 8σ
i=1
n
!
x 1X R2
≤ N max PG0 hεi , Gj (xi )iY ≥
j=1,...N n i=1 8σ
     
2nR4 R2 2nR4
≤ 2N exp − 2 2 2 ≤ 2 exp H − ,
C σ F∞ 8σ 2 + R2 C 2 σ 2 F∞
2

where we used E[kεi kY ] ≤ σ (Cauchy-Schwarz) and [101, Theorem 2.6.2] for Yi = n−1 hεi , Gj (xi )iY ,
i = 1, . . . , n.
Since H(δ)/δ 2 is non-increasing, also H(δ/(8σ 2 + δ))/δ 2 is non-increasing and thus it holds for
R ≥ δn
 
nR4 R2
≥ H .
C 2 σ 2 F∞
2 8σ 2 + R2
This gives for R ≥ δn
 
nR4
(ii) ≤ 2 exp − . (A.15)
C σ 2 F∞
2 2

Combining (A.14)–(A.15) gives the result.

36
B Proofs of Section 2
B.1 Proof of Theorem 2.5
B.1.1 Existence and Measurability of Ĝn
Proof of Theorem 2.5 (i). White noise case. For i = 1, . . . , n, denote the probability space of
the RVs xi and the white noise processes εi as (Ω, Σ, P). Furthermore, equip the Hilbert spaces
X and Y with Borel σ-algebras BX and BY and the space G with the Borel σ-algebra BG . Then,
there exists an orthonormal basis (ψj )j∈N of Y and i.i.d. Gaussian variables Zj ∼ N(0, 1) s.t.

εi : Ω × Y → R,

X
εi (ω, y) = hy, ψj iY Zj (ω)
j=1

for i = 1, . . . , n, see also [38, Example 2.1.11]


Recall the noise level σ > 0. For i = 1, . . . , n, we define

ui : Ω × G → R,
ui (ω, G) = 2σεi (ω, G(xi (ω))) + 2hG0 (xi (ω)), G(xi (ω))iY − kG(xi (ω))k2Y . (B.1)
Pn
We aim to apply [70, Proposition 5] to u := 1/n i=1 ui in order to get existence and measurability
of Ĝn in (2.2).
Per assumption, the metric space (G G , k·k) is compact. We show that ui from (B.1) is measurable
in the first component and continuous in the second component for all i = 1, . . . , n. Then [70,
Proposition 5] shows the claim. Consider an arbitrary G ∈ G. We show that ui (. , G) is (Σ, BR )-
measurable, where BR is the Borel σ-algebra in R.
The RVs xi are (Σ, BX )-measurable by definition. The maps G and G0 are assumed to
be (BX , BY )-measurable. Furthermore, because of their continuity, the scalar product h. , .iY is
(BY , BR )-measurable in both components and also the norm k . kY is (BY , BR )-measurable. There-
fore, since the composition of measurable functions is measurable, the latter two summands in
(B.1) are (Σ, BR )-measurable.
Proceeding with the first summand, the RVs Zj are (Σ, BR )-measurable by definition for all
j ∈ N. Therefore the products h . , ψj iY Zj are (Σ ⊗ BY , BR )-measurable for all j ∈ N. Thus
εi is, as the pointwise limit, (Σ ⊗ BY , BR )-measurable. Then, as the composition of measurable
functions, the first summand in (B.1) is (Σ, BR )-measurable. Therefore ui ( . , G), i = 1, . . . , n, and
thus u( . , G) is (Σ, BR )-measurable for all G ∈ G .
We proceed and show that u(ω, . ) is continuous w.r.t. k · k. Therefore choose G, G′ ∈ G and
ω ∈ Ω. Then it holds for i = 1, . . . , n and xi = xi (ω)

|ui (ω, G) − ui (ω, G′ )| ≤ 2σ |εi (ω, G(xi ) − G′ (xi ))| + 2 |hG0 (xi ), G(xi ) − G′ (xi )iY |
+ |kG(xi )kY − kG′ (xi )kY | |kG(xi )kY + kG′ (xi )kY |
≤ 2σ |εi (ω, G(xi ) − G′ (xi ))|

+ n (2kG0 (xi )kY + kG(xi )kY + kG′ (xi )kY ) kG − G′ kn
≤ 2σ |εi (ω, G(xi ) − G′ (xi ))|

+ C n (2kG0 (xi )kY + kG(xi )kY + kG′ (xi )kY ) kG − G′ k, (B.2)

where we used that k · kn ≤ Ck · k at the last inequality. Furthermore, [38, page 40 and Proposition
2.3.7(a)] yields that the white noise processes εi are a.s. sample dεi -continuous w.r.t. their intrinsic
pseudometrics dεi : Y × Y → R, dεi (y, y ′ ) = E[hεi , y − y ′ i2Y ]1/2 = ky − y ′ kY . Since for all G, G′ ∈ G
we have
√ √
kG(xi ) − G′ (xi )kY ≤ nkG − G′ kn ≤ C nkG − G′ k,

37
the white noise processes are also a.s. sample d-continuous, where d is the metric induced by k·k.
Together with (B.2) this shows that there exists a null-set Ω0 ⊂ Ω such that for all ω ∈ Ω\Ω0 ,
u(ω, .) is continuous w.r.t. k · k. Now we choose versions x̃i and ε̃i s.t. u(ω, .) = 0 for all ω ∈ Ω0 .
Then G 7→ u(ω, G) : (G G , k · k) → R is continuous for all ω ∈ Ω. Applying [70, Proposition 5] gives
an (Σ, BG )-measurable MLSE Ĝn in (2.2) with the desired minimization property.
Sub-Gaussian case. For i = 1, . . . , n, consider the functions ui from (B.1). The measurability
of ui ( . , G), i = 1, . . . , n follows from the measurability of the scalar product h. , .iY , the norm k·kY ,
G0 and all G ∈ G . Note that in contrast to the white noise case, εi are RVs in Y for all i = 1, . . . , n
and therefore measurable without any further investigation. Also, since εi (ω) ∈ Y for all ω,
Cauchy-Schwarz immediately shows that ui (ω, .) and therefore u(ω, .) is continuous w.r.t. k · k for
all ω ∈ Ω\Ω0 . Choosing versions x̃i and ε̃i and applying [70, Proposition 5] as above gives the
existence of an (Σ, BG )-measurable LSE Ĝn in (2.2) and therefore finishes the proof.

B.1.2 Concentration Inequality for Ĝn


Proof of Theorem 2.5 (ii). The proof follows ideas developed in [100, Section 10.3] as well as [71],
where generic chaining bounds from [30] were used to bound the relevant empirical processes
appearing below.

Slicing argument. Recall the definition (2.3) of the empirical norm. For R2 ≥ 2kG∗ − G0 k2n ,
we have
 
PxG0 kĜn − G0 k2n ≥ R2 ≤ PxG0 2 kĜn − G0 k2n − kG∗ − G0 k2n ≥ R2 . (B.3)

As a union of disjoint events, it holds



PxG0 2 kĜn − G0 k2n − kG∗ − G0 k2n ≥ R2
X∞  
= PxG0 22s R2 ≤ 2 kĜn − G0 k2n − kG∗ − G0 k2n < 22s+2 R2 .
s=0

Applying the basic inequality (Lemma A.2) and defining the empirical process (XG : G ∈ G )
indexed by the operator class G as
n
1X
XG = hεi , G(xi ) − G∗ (xi )iY ,
n i=1

gives

X  
PxG0 22s R2 ≤ 2 kĜn − G0 k2n − kG∗ − G0 k2n < 22s+2 R2
s=0

X  
22s−2 R2
≤ PxG0 XĜn ≥ , kĜn − G0 k2n − kG∗ − G0 k2n < 22s+1 R2
s=0
σ
X∞  
x 22s−2 R2
≤ PG0 sup XG ≥ . (B.4)
s=0 G∗
G∈G n (2
s+3/2 R) σ

In (B.4), we additionally used that if 2kG∗ − G0 k2n ≤ R2 and

kĜn − G0 k2n − kG∗ − G0 k2n < 22s+1 R2

then kĜn − G0 k2n ≤ 22s+1 R2 + R2 /2 and thus



kĜn − G∗ k2n ≤ 2 kĜn − G0 k2n + kG∗ − G0 k2n ≤ 22s+2 R2 + 2R2 ≤ 22s+3 R2 .

38
Concentration inequality for each slice. We wish to apply Lemma A.4 to bound the prob-
abilities in (B.4). Let CCh be the generic constant from this lemma and let δn satisfy (2.5). Then,
due to δ 7→ Ψn (δ)/δ 2 being non-increasing, (2.5) gives for all R ≥ δn and s ∈ N0
√ 2s+3 2
n(2 R ) ≥ 32CCh σΨn (2s+3/2 R)

so that with Θ := G ∗n (2s+3/2 R)

√ 22s−2 R2
n ≥ CCh Ψn (2s+3/2 R) ≥ CCh J(Θ, k · kn ). (B.5)
σ
With hG := G − G∗ and H := {hG : G ∈ Θ} we have J(Θ, k · kn ) = J(H, k · kn ) and thus (B.5)
verifies assumption (A.4) of Lemma A.4 for δ = 22s−2 R2 /σ.
Furthermore, since Θ = G ∗n (2s+3/2 R) ⊂ G for s ∈ N0 , R > 0 and (G G , k · kn ) is compact, the
space (Θ, k · kn ) is separable, which verifies the last assumption of Lemma A.4. Applying this
lemma with U = 2s+3/2 R shows that the (uncountable) suprema in (B.4) are measurable and that
for all R ≥ δn ,


X   ∞
X  
22s−2 R2 8n24s−4 R4
PxG0 sup XG ≥ ≤ exp − 2 2 2s+3 2
s=0 G∗
G∈G n (2
s+3/2 R) σ s=0
CCh σ 2 R
X∞  
22s−4 nR2
≤ exp − 2 σ2
s=0
CCh
X∞  
−2s nR2
≤ 2 exp − 2 σ2
s=0
16CCh
 
nR2
< 2 exp − 2 σ2 . (B.6)
16CCh

In (B.6)√we additionally used exp(−xy) ≤ exp(−x)/y, which holds for all x, y ≥ 1, i.e. for R ≥
4CCh σ/ n. Combining (B.3)–(B.4) and (B.6) gives (2.6) and therefore shows the claim.

B.2 Proof of Theorem 2.6


We use the concentration inequality from Theorem 2.5 for the empirical error and combine it with
a key concentration result for the empirical norm around the L2 (γ)-norm, proved in Lemma A.5.
For any R > 0,
   
R
PG0 kĜn − G0 kL2 (γ) ≥ R ≤ PG0 kĜn − G0 kn ≥
2
 
+ PG0 kĜn − G0 kL2 (γ) ≥ R, kĜn − G0 kL2 (γ) ≥ 2kĜn − G0 kn .
(B.7)

Now suppose that R satisfies


 
√ ∗ 4Cch σ 9F∞
R ≥ 2 max δn , 4δ̃n , 2kG − G0 k∞,supp(γ) , √ , √ .
n n

This is implied by (2.10) for an appropriate choice of C. In particular, R ≥ 2 2kG∗ − G0 kn for
all x ∈ Xn . Since Theorem 2.5 holds for γ n -almost every x ∈ Xn , taking expectations over x ∼ γ n
gives
   
R nR2
PG0 kĜn − G0 kn ≥ ≤ 2 exp − 2 σ2 . (B.8)
2 64CCh

39
Furthermore, Lemma A.5 gives
   
nR2
PG0 kĜn − G0 kL2 (γ) ≥ R, kĜn − G0 kL2 (γ) ≥ 2kĜn − G0 kn ≤ 2 exp − 2
, (B.9)
320F∞

since we have assumed R ≥ max{8δ̃n , 18F∞ / n}. Combining (B.8) and (B.9) shows (2.11) for
some C.

B.3 Proof of Corollary 2.7


It remains to show (2.12). We use (B.7) to estimate
 
EG0 kĜn − G0 k2L2 (γ)
Z ∞  √ 
= PG0 kĜn − G0 kL2 (γ) ≥ R dR
0
Z ∞Z √ !
x R
≤ PG0 kĜn − G0 kn ≥ dγ(xx) dR
0 Xn 2
Z ∞  √ 
+ PG0 kĜn − G0 kL2 (γ) ≥ R, kĜn − G0 kL2 (γ) ≥ 2kĜn − G0 kn dR. (B.10)
0

For G∗ ∈ G set R1 = 2 max{δn , 2kG∗ − G0 kn , 4C√Ch n
σ
}, where CCh is from Lemma A.4. Then
Theorem 2.5 gives
Z ∞Z √ !
R
PxG0 kĜn − G0 kn ≥ dγ(x) dR
0 X n 2
Z Z ∞  
nR
≤ R12 dγ(xx) + 2 exp − 2 σ2 dR
Xn R21 64CCh
2 Z ∞  
2 ∗ 2 16CCh σ2 nR
≤ 4δn + 8kG − G0 kL2 (γ) + +2 exp − 2 σ2 dR
n R21 64CCh
2
80CCh σ2
≤ 4δn2 + 8kG∗ − G0 k2L2 (γ) + .
n

Lemma A.5 gives with R2 = max{8δ̃n , 18F∞ / n}
Z ∞  √ 
PG0 kĜn − G0 kL2 (γ) ≥ R, kĜn − G0 kL2 (γ) ≥ 2kĜn − G0 kn dR
0
Z ∞  
nR
≤ R22 + 2 exp − 2
dR
R22 320F∞
2
964F∞
≤ 64δ̃n2 + . (B.11)
n
Combining (B.10)–(B.11) and taking an infimum over G∗ ∈ G shows (2.12) for some C and thus
finishes the proof of Corollary 2.7.

B.4 Proof of Corollary 2.8


We construct sequences (δn )n∈N and (δ̃n )n∈N satisfying (2.9) and balance the approximation term
and the entropy terms in (2.12) by choosing N (n) appropriately.

40
Proof of (i): Choosing δ̃n2 ≃ n−2/(2+α) immediately shows the second part of (2.9). For J(δ)
from (2.4) it holds
Z δp
J(δ) ≤ H(ρ, N ) dρ . δ 1−α/2 =: Ψn (δ).
0

Hence δn2 ≃n −2/(2+α)


satisfies the first part of (2.9). Therefore Corollary 2.7 shows
h i 2
EG0 kĜn − G0 k2L2 (γ) . N −β + n− 2+α .

Since the entropy term is independent of N , the limit N → ∞ gives the claim.

Proof of (ii): Choosing δ̃n2 ≃ N (1 + log(n))/n shows in particular log(n) & log(δ̃n−1 ), which in
turn shows that δ̃n satisfies the second part of (2.9). For J(δ) from (2.4) it holds
Z δ p √ 
J(δ) . H(ρ, N ) dρ . N δ 1 + log(δ −1 ) =: Ψn (δ).
0

Hence δn2 ≃ N (1 + log(n))/n satisfies the first part of (2.9). Therefore Corollary 2.7 shows
h i N (1 + log(n))
EG0 kĜn − G0 k2L2 (γ) . N −β + .
n
1
Choosing N (n) = ⌈n β+1 ⌉ and using log(n) ≤ nτ /τ for all τ > 0 and n ≥ 1 gives the claim.

B.5 Proof of Theorem 2.10


The concentration inequality (2.14) follows immediately from (B.7), (B.9) (which holds for sub-
Gaussian noise) and (A.13) from Lemma A.6. Application of Lemma A.6 yields
Z ∞Z √ !
δ
PxG0 kĜn − G0 kn ≥ x ) dδ
dγ(x
0 X n 2
Z ∞  
2 ∗ 2 nδ 2
≤ 4δn + 8kG − G0 kL2 (γ) + 4 exp − dδ
0 16C12 σ 2 (1 + F∞
2 )
r
π
≤ 4δn2 + 8kG∗ − G0 k2L2 (γ) + 16C1 σ(1 + F∞ ) .
n

Using (B.10) and (B.11), and minimizing over G∗ ∈ G gives the bound on the mean-squared error
(2.15) for some C2 .

C Neural Network Theory


In this section we recap elementary operations and approximation theory for neural networks,
based on [81]. For a NN f : Rp0 → RpL+1 (see Definition 3.5) we denote by sizein (f ) and sizeout (f )
the number of nonzero weights and biases of the first and last layer respectively.

C.1 Operations on Neural Networks


In this subsection, we recap elementary operations on NNs. We start with the parallelization of
two NNs following [75, Section 2.2.1].
Definition C.1 (Parallelization). Let q ∈ N, q ≥ 2 and σ ∈ {σ1 , σq }. Let f and g be two σ-NNs
realizing the functions f and g with the same depth L. Furthermore, define the input dimensions

41
of f and g as nf and ng and the output dimensions as mf and mg .3 Then there exists a σ-NN
(f, g), called the parallelization of f and g, which simultaneously realizes f and g, i.e.

(f, g) : Rnf × Rng → Rmf × Rmg : (x


x, x̃
x ) 7→ (f (x
x ), g(x̃
x)).

It holds

size ((f, g)) = size(f ) + size(g),


depth ((f, g)) = depth(f ) = depth(g),
width ((f, g)) = width(f ) + width(g),
mpar ((f, g)) = max {mpar(f ), mpar(g)} , (C.1)
2 2 2
mranΩ ((f, g)) = mranΩ (f ) + mranΩ (g) .

Let N ∈ N, N ≥ 3. We extend Definition C.1 to parallelize N σ-neural networks fi , i = 1, . . . N


with equal depth and denote the resulting σ-NN as ({fi }N
i=1 ). It holds that

  N
X
N
size {fi }i=1 = size(fi ), (C.2)
i=1
  N
X
N
sizein {fi }i=1 = sizein (fi ),
i=1
  N
X
N
sizeout {fi }i=1 = sizeout (fi ),
i=1
 
N
depth {fi }i=1 = depth(f1 ),
  N
X
N
width {fi }i=1 = width(fi ), (C.3)
i=1
 
N
mpar {fi }i=1 = max mpar(fi ),
i=1,...,N
 2 XN
N
mranΩ {fi }i=1 = mranΩ (fi )2 .
i=1

Next, we recall the concatenation of NNs, [81, Definition 2.2].


Lemma C.2 (Concatenation). Let q ∈ N, q ≥ 2 and σ ∈ {σ1 , σq }. Let f and g be two σ-NNs.
Furthermore, let the output dimension mg of g equal the input dimension nf of f . Then there
exists a σ-NN f • g realizing the composition f ◦ g : x 7→ f (g(x)) of the functions f and g. It holds

depth (f • g) = depth(f ) + depth(g),


width (f • g) = max{width(f ), width(g)},
mranΩ (f • g) = mrang(Ω) (f ) .

There is no simple control over the size and the weight bound of the concatenation f • g in
Definition C.2. The reason is that the network f • g multiplies network weights and biases of
the NNs f and g at layer l = depth(g) + 1 (for details see [81, Definition 2.2]). In the following
we use sparse concatenation to get control over the size and the weights. We first introduce the
realization of the identity map and separate the analysis for the σ1 - and σq -case. The following
lemma is proven in [81, Remark 2.4].
3 Using the syntax from (3.4a)–(3.4b), it holds nf = p0 (f ), ng = p0 (g), mf = pL+1 (f ) and mg = pL+1 (g).

42
Lemma C.3 (σ1 -realization of identity map). Let d ∈ N and L ∈ N. Then there exists a σ1 -
identity network IdRd of depth L, which exactly realizes the identity map IdRd : Rd → Rd , x 7→ x .
It holds

size (IdRd ) ≤ 2d(L + 1), (C.4)


width (IdRd ) ≤ 2d, (C.5)
mpar (IdRd ) ≤ 1.

We proceed with the analogous result for the RePU activation function. The following lemma
follows from the construction in [57, Theorem 2.5 (2)] with a parallelization argument.
Lemma C.4 (σq -realization of identity map). Let q ∈ N with q ≥ 2. Further let d ∈ N and L ∈ N
be arbitrary. Then there exists a σq -NN IdRd of depth L, which exactly realizes the identity map
IdRd . It holds

size(IdRd ) ≤ Cq dL, (C.6)


width(IdRd ) ≤ Cq d,
mpar(IdRd ) ≤ Cq ,

where Cq is independent of d (but does depend on q).


We next define the sparse concatenation of two NNs.
Definition C.5 (σ1 -sparse concatenation, [81, Definition 2.5]). Let f and g be two σ1 -NNs.
Furthermore, let the output dimension mg of g equal the input dimension nf of f . Then there
exists a σ1 -NN f ◦ g with
f ◦ g := f • IdRmg •g
realizing the composition f ◦ g : x 7→ f (g(x)) of the functions f and g.4 It holds

size(f ◦ g) ≤ size(g) + sizeout (g) + sizein (f ) + size(f ) ≤ 2size(f ) + 2size(g), (C.7)


(
sizein (g) depth(g) ≥ 1,
sizein (f ◦ g) ≤
2sizein (g) depth(g) = 0,
(
sizeout (f ) depth(f ) ≥ 1,
sizeout (f ◦ g) ≤ (C.8)
2sizeout (f ) depth(f ) = 0,
depth(f ◦ g) = depth(f ) + depth(g) + 1, (C.9)
width(f ◦ g) ≤ 2 max{width(f ), width(g)}, (C.10)
mpar(f ◦ g) ≤ max {mpar(f ), mpar(g)} , (C.11)
mranΩ (f ◦ g) = Bg(Ω) (f ) . (C.12)

Proof. The bounds in (C.9), (C.10) and (C.12) follow from Definition C.3 with the NN calculus
from Definition C.2. The bounds on the sizes in (C.7)–(C.8) and the weight and bias bound (C.11)
follow from the specific structure of the σ1 -identity network, see [81, Remark 2.6].
We proceed with the sparse concatenation of σq -NNs.

Definition C.6 (σq -sparse concatenation). Let q ∈ N with q ≥ 2. Let f and g be two σq -NNs.
Furthermore, let the output dimension mg of g equal the input dimension nf of f . Then there
exists a σq -NN f ◦ g with
f ◦ g := f • IdRmg •g
4 The symbol ◦ does mean either the functional concatenation of f and g or the sparse concatenation of the NN

f and g (which realizes the function f ◦ g).

43
realizing the composition f ◦ g : x 7→ f (g(x)) of the functions f and g. It holds

size(f ◦ g) ≤ size(g) + (Cq − 1)sizeout (g) + (Cq − 1)sizein (f ) + size(f )


≤ Cq size(f ) + Cq size(g), (C.13)
(
sizein (g) depth(g) ≥ 1,
sizein (f ◦ g) ≤
Cq sizein (g) depth(g) = 0,
(
sizeout (f ) depth(f ) ≥ 1,
sizeout (f ◦ g) ≤ (C.14)
Cq sizeout (f ) depth(f ) = 0,
depth(f ◦ g) = depth(f ) + depth(g) + 1, (C.15)
width(f ◦ g) ≤ Cq max{width(f ), width(g)}, (C.16)
mpar(f ◦ g) ≤ Cq max {mpar(f ), mpar(g)} , (C.17)
mranΩ (f ◦ g) = Bg(Ω) (f ) (C.18)

with a constant Cq > 1 depending only on q.


Proof. The bounds in (C.15), (C.16) and (C.18) follow from Definition C.4 with the NN calculus
from Definition C.2. The bounds on the sizes in (C.13)–(C.14) and the weight and bias bound in
(C.17) hold because of the specify structure of the σq -identity network IdRn , see [57, Eq. (2.57)].
Definitions C.5 and C.6 show that one can control the size, as well as the weights and biases of
the concatenation of two NN by inserting one additional identity layer between the two networks.
We end this subsection by introducing summation and scalar multiplication networks.
Definition C.7 (Summation networks). Let q ∈ N, q ≥ 2, σ ∈ {σ1 , σq } and d, m ∈ N. Then
there exists a σ-NN Σm such that for x1 , . . . , xm ∈ Rd
m
X
Σm (x1 , . . . , xm ) = xi ,
i=1

with depth(Σm ) = 0, width(Σm ) = md, size(Σm ) = md and mpar(Σm ) = 1.


Proof. Set w 1 = (1d , . . . , 1d ) and b 1 = 0 with the d × d identity matrices 1d .
Definition C.8 (Scalar multiplication networks). Let q ∈ N with q ≥ 2 and σ ∈ {σ1 , σq }. Let
α ∈ R and d ∈ N. Then there exists a σ-NN SMα with

SMα (x) = αx, x ∈ Rd .

Furthermore, there exists a constant Cq only depending on q such that

depth (SMα ) ≤ Cq max{1, log(|α|)},


width (SMα ) ≤ Cq d,
size (SMα ) ≤ Cq d max{1, log(|α|)},
mpar (SMα ) ≤ Cq .

Proof. For the proof of the σ1 -case, see [34, Lemma A.1]. We prove the RePU case in the following.
Without loss of generality we can choose α > 1. For α < 0 we set SMα = −SM−α . For
0 < α < 1 we set SMα = α IdRd , where we directly multiply the weights and biases of the identity
network (with depth L = 1) with α.
Therefore let α > 1. Let K be the maximum integer smaller than log2 (α), and set α̃ =
2−K+1 α < 1. Furthermore, set A1 = (IdRd , IdRd ) and A2 = Σ2 with the one-layered identity
network IdRd and the summation network Σ2 from Definition C.7. We notice that

A2 • A1 x = 2x ∀x ∈ Rd .

44
Using the bounds for Σ2 from Definition C.7 and IdRd from Definition C.4, we have depth(A2 •A1 ) =
1, width(A2 • A1 ) ≤ Cq d, size(A2 • A1 ) ≤ Cq d and mpar(A2 • A1 ) ≤ Cq . Setting A2k+1 = A1 ,
A2k+2 = A2 for k = 1, . . . , K and A2K+3 = α̃ IdRd , we get
A2K+3 ◦ A2K+2 • A2K+1 ◦ A2K • A2K−1 ◦ . . . .. ◦ A2 • A1 x = αx ∀x ∈ Rd .
Applying the σq -NN calculus for concatenation (Definition C.2) and sparse concatenation (Defi-
nition C.6) gives the desired bounds for SMα .

C.2 Neural Network Approximation Theory


In this subsection, we summarize approximation results for ReLU and RePU neural networks. In
recent years, the expressivity and approximation properties of neural network architectures have
been extensively studied in the literature (e.g., [63, 82, 103, 83, 81, 34, 76, 27]). However, with few
exceptions (e.g., [89, 27, 34]), most of these works do not provide bounds on the size of the weights,
which are crucial for controlling the entropy. Therefore, we revisit some of these arguments to
provide complete proofs of our results.
We start with the well-known result that ReLU-NNs can approximate the multiplication map
exponentially fast. The following proposition was shown in [34, Proposition III.3].
Proposition C.9 (σ1 -NN approximation of multiplication, cf. [34, Propsition III.3]). Let D ∈ R,
˜ δ,D : [−D, D]2 → R satisfying
D ≥ 1 and δ ∈ (0, 1/2). Then there exists a σ1 -NN ×
sup ˜ δ,D (x, y) − xy ≤ δ.
×
x,y∈[−D,D]

Furthermore, there exists a constant C, independent of D and δ, such that


 
depth ט δ,D ≤ C log(D) + log δ −1 , (C.19)

width ט δ,D ≤ 5, (C.20)
 
˜ δ,D ≤ C log(D) + log δ −1 ,
size × (C.21)

mpar × ˜ δ,D ≤ 1. (C.22)
The next proposition shows that the multiplication map can be exactly realized by a σq -NN,
which follows directly from [57, Theorem 2.5 and Eq. (2.59)].
Proposition C.10 (σq -NN approximation of multiplication). Let q ∈ N, q ≥ 2. There exists a
˜ : R2 → R with depth(×)
σq -NN × ˜ = 1 exactly realizing the multiplication of two numbers, i.e.
˜
×(x, y) = xy ∀x, y ∈ R.
We can extend the above results to the multiplication of N numbers, see [75, Proposition 2.6].
Proposition C.11 (σ1 -NN multiplication of N numbers). Let N ∈ N with N ≥ 2. Furthermore,
Q
let D ∈ R, D ≥ 1 and δ ∈ (0, 1/2). Then there exists a σ1 -NN e δ,D : [−D, D]N → R such that
N
Y Y
f
sup yj − (y1 , . . . , yN ) ≤ δ. (C.23)
(yi )N N δ,D
i=1 ∈[−D,D] j=1

Furthermore, there exists a constant C independent of N , δ and D such that


Y 
f 
depth ≤ C log(N ) log(N ) + N log(D) + log δ −1 ,
δ,D
Y 
f
width ≤ 5N, (C.24)
δ,D
Y 
f 
size ≤ CN log(N ) + N log(D) + log δ −1 , (C.25)
δ,D
Y 
f
mpar ≤ 1.
δ,D

45
Q
Proof. Analogous to [75, Proposition 2.6] we construct e δ,D as a binary tree of × ˜ .,. -networks from
Proposition C.9. We modify the proof of [75] to get a construction with bounded weights.
Define Ñ := min{2k : k ∈ N, 2k ≥ N }. We now consider the multiplication of Ñ numbers
with yN +1 , . . . , yÑ := 1. This can be implemented by a zero-layer network with
(
1 1 i = j ≤ N,
wi,j =
0 otherwise,
(
0 j ≤ N,
b1j =
1 N < j ≤ Ñ .

For l = 0, . . . , log2 Ñ − 1 we define the mapping Rl


Ñ Ñ
Rl : [−Dl , Dl ] 2l 7→ [−Dl+1 , Dl+1 ] 2l+1 ,
 
Rl (y1l , . . . , y2l log2 (Ñ )−l ) := ט δ′ ,Dl (y1l , y2l ), . . . , ×
˜ δ′ ,Dl (y l log (Ñ )−l , y l log (Ñ )−l )
2 2 −1 2 2
(C.26)

l
with δ ′ := δ/(Ñ 2 D2Ñ ) and Dl := 2l D2 . We now set
Y
f
:= Rlog2 (Ñ )−1 ◦ · · · ◦ R0 . (C.27)
δ,D

Q
Eq. (C.27) shows that the map Rl describes the multiplications on level l of the binary tree e δ,D .
In order for (C.27) to be well-defined, we have to show that the outputs of the NN Rl are admissible
inputs for the NN Rl+1 .
We therefore denote with yjl , j = 1, . . . , 2log2 (Ñ)−l , l = 1, . . . , log2 (Ñ ) − 1, the output of the
network Rl−1 ◦· · ·◦R0 applied to the input yk0 = yk , k = 1, . . . , Ñ . Then we have to show |yjl | ≤ Dl
for l = 0, . . . , log2 (Ñ ) − 1 and j = 1, . . . , 2log2 (Ñ )−l . We will show this claim by induction. For
l = 0 it holds |yjl | ≤ D = D0 . Now assume |yjl | ≤ Dl for arbitrary but fixed l ∈ {0, . . . , log2 (Ñ )−2}
and all j = 1, . . . , 2log2 (Ñ )−l . Then it holds

|yjl+1 | = |× l
˜ δ′ ,Dl (y2j−1 l
, y2j )|
l l
= |y2j−1 · y2j δ ′ ≤ 2Dl2 = Dl+1
+ δ ′ | ≤ Dl2 + |{z}
≤1≤Dl2

for all j = 1, . . . , 2log2 (Ñ )−(l+1) , which shows the claim. We proceed by showing the error bound
in (C.23). Therefore define
2l
Y
l
zj := yk+2l (j−1)
k=1

log2 (Ñ)−l
for l = 0, . . . , log2 (Ñ ) and j = 1, . . . , 2 . The quantites zjl describe the exact computations
up to level l of the binary tree, i.e. the output of level l − 1, if one uses standard multiplication
instead of the multiplication networks × ˜ in the first l − 1 levels. We now prove
l+1
|yjl − zjl | ≤ 4l D2 δ′, j = 1, . . . , 2log2 (Ñ)−l (C.28)

by induction over l = 0, . . . , log2 Ñ . Inserting l = log2 (Ñ ) then shows the error bound in (C.23)
using the definition of δ ′ .
We have yj0 = yj = zj0 for j = 1, . . . , Ñ , therefore (C.28) holds for l = 0. Now assume (C.28)

46
to hold for an arbitrary but fixed l ∈ {0, . . . , log2 (Ñ ) − 1}. For j = 1, . . . , 2log2 (Ñ )−(l+1) it holds

|yjl+1 − zjl+1 | = × l
˜ Dl ,δ′ (y2j−1 l
, y2j l
) − z2j−1 l
z2j
l l
= y2j−1 y2j + δ ′ − z2j−1
l l
z2j
l l l l l l
= (y2j−1 − z2j−1 + z2j−1 ) · (y2j − z2j + z2j ) + δ ′ − z2j−1
l l
z2j
l l l l l l l
= (y2j−1 − z2j−1 ) · (y2j − z2j ) + (y2j−1 − z2j−1 )z2j
l l l
+ (y2j − z2j )z2j−1 + δ′
l l l l l l l
≤ (y2j−1 − z2j−1 ) · y2j − z2j + y2j−1 − z2j−1 · z2j
| {z } | {z } | {z } |{z}
use(C.28) ≤1 use(C.28) ≤D2l
l l l ′
+ y2j − z2j · z2j−1
+|δ |
| {z } | {z }
use(C.28) ≤D2l
 l+1
 l
 
≤ δ ′ 4l D 2 1 + 2D2 +1
l+1 l+1 l+2
≤ 4 4l D 2 D2 δ ′ ≤ 4l+1 D2 δ′ ,

which shows (C.28) for l + 1 and therefore the claim.


Q Q
We proceed by calculating the depth of e δ,D . Since e δ,D concatenates the maps ×
˜ δ′ ,Dl , we
can repeatedly use (C.9) and get

Y  log2 (Ñ )−1


X
f 
depth ≤ ˜ δ′ ,Dj + log2 (Ñ ) − 1.
depth ×
δ,D
j=0

˜ from (C.19) and calculate


We use the depth bound for ×

Y  log2 (Ñ )−1


X
f 
depth ≤C log Dj δ ′−1 + log2 (Ñ )
δ,D
j=0
log2 (Ñ )−1  
X j
=C log 2j D2 δ ′−1 + log2 (Ñ )
j=0
 
log2 (Ñ )−1
Y
2j ′−1 
= C log  j
2 D δ + log2 (Ñ )
j=0
 
log2 (Ñ )·(log 2 (Ñ )−1) log2 (Ñ ) − log2 (Ñ )
2 ′
≤ C log 2 2 D (δ ) + log2 (Ñ )
 2 
≤ C log 2(log2 (Ñ)) DÑ Ñ 2 log2 (Ñ) D2Ñ log2 (Ñ ) δ − log2 (Ñ) + log2 (Ñ )
 2

≤ C log 2(log2 (Ñ ) Ñ 2 log2 (Ñ) D3Ñ log2 (Ñ ) δ − log2 (Ñ) + log2 (Ñ )
 
≤ C log2 (Ñ ) log(Ñ ) + Ñ log(D) + log δ −1

≤ C log(N ) log(N ) + N log(D) + log δ −1 . (C.29)

The constant C changes from line to line in (C.29).


Q
For a bound on the width we use the fact that e δ,D is a parallelization of at most Ñ /2 ≤ N
networks ט δ′ ,Dl in each layer l ∈ {1, . . . , log2 (Ñ )}. With (C.3) and the width bound of ×
˜ in (C.20)
it holds Y 
f 
width ≤ N width × ˜ δ′ ,Dl ≤ 5N.
δ,Dl

47
Q
For a bound on the size size( e δ,D ) we observe that level l, l = 0, . . . , log2 (Ñ ) − 1, of the binary

tree ˜ δ′ ,Dl . We calculate
consists of 2log2 (Ñ )−l−1 product networks ×

Y  log2 (Ñ )−1


X
f   
size ≤ 2log2 (Ñ )−l−1 sin × ˜ δ′ ,Dl + size ×
˜ δ′ ,Dl + sout × ˜ δ′ ,Dl
δ,D
l=0
log2 (Ñ )−1
X 
≤ 2log2 (Ñ )−l−1 3C log Dl δ ′−1
l=0
log2 (Ñ )−1  
X l
≤C 2log2 (Ñ)−l−1 log 2l D2 Ñ 2 D2Ñ δ −1
l=0
log2 (Ñ )−1   
X 
≤C 2log2 (Ñ)−l−1 l + 2l log(D) + log Ñ + Ñ log(D) + log δ −1
l=0
 
≤ C Ñ log2 (Ñ ) + Ñ log2 (Ñ ) log(D) + Ñ log(Ñ ) + Ñ 2 log(D) + Ñ log δ −1

≤ CN log(N ) + N log(D) + log δ −1 . (C.30)

In (C.30) we used (C.7) to bound the size of a sparse concatenation and (C.21) for the size of the
product network × ˜ δ,D .
Q
For the bound on the weights and biases, we get mpar( e δ,D ) ≤ 1 because of mpar(× ˜ δ,D ) ≤ 1,
see (C.22), and the NN calculus for sparse concatenation in (C.11) and parallelization in (C.1).
We continue with the RePU-case.
Proposition C.12 (σq -NN of multiplication of N numbers). Let N, q ∈ N with N, q ≥ 2. Then
Q
there exists a σq -NN e : RN → R such that

Y N
Y
f
(y1 , . . . , yN ) = yj .
j=1

Furthermore, there exists a constant Cq independent of N such that


Y 
f
depth ≤ Cq log(N ), (C.31)
δ,D
Y 
f
width ≤ Cq N, (C.32)
δ,D
Y 
f
size ≤ Cq N, (C.33)
δ,D
Y 
f
mpar ≤ Cq . (C.34)
δ,D


Proof. The construction is similar to the ReLU case. We define as a binary tree of product
˜ see (C.26) and (C.27). The binary tree has a maximum of 2N binary networks ×,
networks ×, ˜ a
maximum height of log2 (2N ) and a maximum width of N . Therefore (C.31)–(C.34) follow with
the NN calculus rules from Definition C.6.
We proceed and state the approximation results for univariate polynomials. We start with the
ReLU case. The following proposition was shown in [34, Proposition III.5].

48
Proposition C.13 (σ1 -NN approximation of polynomials, cf. [34, Proposition III.5]). Let m ∈ N
and a = (ai )m
i=0 ∈ R
m+1
. Further let D ∈ R, D ≥ 1 and δ ∈ (0, 1/2). Define a∞ = max{1, kak∞}.
Then there exists a σ1 -NN p̃δ,D : [−D, D] → R satisfying
m
X
sup p̃δ,D (x) − ai xi ≤ δ.
x∈[−D,D] i=0

Furthermore, there exists a constant C independent of m, ai , D and δ such that


 
depth(p̃δ,D ) ≤ Cm m log (D) + log δ −1 + log(m) + log(a∞ ) ,
width (p̃δ,D ) ≤ 9,
 
size (p̃δ,D ) ≤ Cm m log (D) + log δ −1 + log(m) + log(a∞ ) ,
mpar (p̃δ,D ) ≤ 1.
In the RePU-case we get the well-known result that polynomials can be exactly realized by
σq -NNs, see [58].
Proposition C.14 (σq -NN realization of polynomials). Let m, q ∈ N, q ≥ 2 and a = (ai )m i=0 ∈
Rm+1 . Set a∞ = max{1, maxi=0,...,m ai }. Then there exists a σq -NN p̃ : R → R satisfying
m
X
p̃(x) = ai xi ∀x ∈ R.
i=0

Furthermore, there exists a constant Cq only depending on q such that


depth(p̃) ≤ Cq (log(a∞ ) + m) ,
width (p̃) ≤ Cq ,
size (p̃) ≤ Cq (log(a∞ ) + m) ,
mpar (p̃) ≤ Cq .
Proof. We use Horner’s method for polynomial evaluation and write
Xm     
a0 a1 am−1 am
ai xi = a∞ +x + ··· + x +x ... . (C.35)
i=0
a∞ a∞ a∞ a∞

Following (C.35), we build p̃ via


    
a0 a1
p̃ = SMa∞ ◦ Σ2 ˜ IdR , Σ2
,× , . . . , SMam a−1 (IdR ) . . . .
a∞ a∞ ∞

The bounds for p̃ follow from the respective bounds for Σ2 from Definition C.7, IdR from Lemma
C.4 and SMα from Definition C.8.
We now use Propositions C.13 and C.14 to get an approximation result for univariate Legendre
polynomials.
Corollary C.15 (σ1 -NN approximation of Lj ). Let j ∈ N0 and δ ∈ (0, 1/2). Then there exists a
σ1 -NN L̃j,δ : [−1, 1] → R with
sup |L̃j,δ (x) − Lj (x)| ≤ δ.
x∈[−1,1]

Furthermore, there exists a constant C such that it holds


  
depth L̃j,δ ≤ Cj j + log(δ −1 ) , (C.36)
 
width L̃j,δ ≤ 9, (C.37)
  
size L̃j,δ ≤ Cj j + log δ −1 , (C.38)
 
mpar L̃j,δ ≤ 1.

49
Proof. For j ∈ N, l ∈ N0 , l ≤ j, denote the coefficients of Lj with cjl . In [77, Eq. (4.17)] the bound
Pj j j j j j j
Pj j j
l=0 |cl | ≤ 4 is derived. With c = (cl )l=0 it holds kc k∞ ≤ l=0 |cl | ≤ 4 . The result now
follows with Proposition C.13.
We continue with the σq -case.

Corollary C.16 (σq -NN approximation of Lj ). Let j ∈ N0 . Then there exists a σq -NN L̃j : R → R
with

L̃j (x) = Lj (x) ∀x ∈ R.

Furthermore, there exists a constant Cq only depending on q such that it holds


 
depth L̃j ≤ Cq j, (C.39)
 
width L̃j,δ ≤ Cq , (C.40)
 
size L̃j,δ ≤ Cq j, (C.41)
 
mpar L̃j,δ ≤ Cq .

Proof. The bounds follow similar as in the σ1 -case using Proposition C.14.

D Proofs of Section 3
D.1 Proof of Proposition 3.9
We proceed analogously to the proof of [75, Proposition 2.13]. We define fΛ,δ as a composition
(1) (2) (2)
of two subnetworks, fΛ,δ := fΛ,δ ◦ fΛ,δ . The subnetwork fΛ,δ evaluates, in parallel, all relevant
univariate Legendre polynomials, i.e.
n o 
(2)
fΛ,δ (yy ) := IdR ◦L̃νj ,δ′ (yj ) , (D.1)
(j,νj )∈T

where we used

T := (j, νj ) ∈ N2 : ν ∈ Λ, j ∈ supp ν , (D.2)
−1 −d(Λ)+1
δ ′ := (2d(Λ)) (2m(Λ) + 2) δ

and y = (yj )(j,νj )∈T . In (D.1) the big round brackets denote a parallelization and we use the
(1) (2)
identity networks to synchronize the depth. The subnetwork fΛ,δ takes the output of fΛ,δ as input
Q
and computes, in parallel, tensorized Legendre polynomials using the multiplication networks e .,.
introduced in Proposition C.11. With Mν := 2|νν |1 + 2 we define
  
(1) (1) (2)
fΛ,δ (zk )k≤|T | = fΛ,δ fΛ,δ (yy )
 Y n o  
f
:= IdR ◦ IdR ◦L̃νj ,δ′ (yj ) . (D.3)
δ/2,Mν j∈supp ν ν ∈Λ

The multiplication networks in (D.3) are well-defined, since

sup |L̃νj ,δ′ (yj )| ≤ 2νj + 2 ≤ 2|νν |1 + 2 = Mν , (D.4)


yj ∈[−1,1]

where we used (3.10) and δ ′ < 1.

50
We will first show the error bound in (3.12). Let ν ∈ Λ be arbitrary. We use the shorthand
notation k · k := k · kL∞ ([−1,1]|T | )) and calculate

Lν − L̃ν ,δ

Y Y Y n o 
f
≤ Lν − L̃νj ,δ ′ + L̃νj ,δ ′ − L̃νj ,δ′
δ/2,Mν j∈supp ν
j∈supp ν j∈supp ν

X Y Y δ
≤ L̃νj ,δ′ · Lνk − L̃νk ,δ′ · Lνj +
j∈supp ν : j∈supp ν :
2
k∈supp ν
j<k j>k
 d(Λ)−1
δ Mν δ δ
≤d(Λ)Mνd(Λ)−1 δ ′ + ≤ + ≤ δ,
2 2m(Λ) + 2 2 2

where we used (D.4), Mν ≤ 2m(Λ) + 2 and the definition of δ ′ .


(1) (2)
We proceed and calculate the depth L of fΛ,δ . Since fΛ,δ = fΛ,δ ◦ fΛ,δ , it holds depth(fΛ,δ ) ≤
(1) (2) (2)
depth(fΛ,δ ) + depth(fΛ,δ ) + 1, see (C.9). We start with a depth bound of fΛ,δ . Denoting by C a
universal multiplicative constant that is allowed to change from line to line, it holds that
   
(2)
depth fΛ,δ = 1 + max depth L̃νj ,δ′
ν ∈Λ
j∈supp ν

≤C max νj νj + log δ ′−1
ν ∈Λ
j∈supp ν

≤ Cm(Λ) m(Λ) + log δ ′−1

≤ Cm(Λ) log (d(Λ)) + d(Λ) log (m(Λ)) + m(Λ) + log δ −1

≤ Cm(Λ) log (d(Λ)) + d(Λ) log (m(Λ)) + m(Λ) + log δ −1 . (D.5)

In (D.5) we used the depth bound for univariate Legendre polynomials, (C.36), at the first in-
(1)
equality. Furthermore, we used νj ≤ m(Λ). For the depth of fΛ,δ it holds
  Y 
(1) f
depth fΛ,δ = 1 + max depth
ν ∈Λ δ/2,Mν

≤ 1 + C max log (| supp ν |) log (| supp ν |) + | supp ν | log (Mν ) + log δ −1
ν ∈Λ

≤ 1 + C log (d(Λ)) log (d(Λ)) + d(Λ) log (m(Λ)) + log δ −1 , (D.6)

where we used | supp ν | ≤ d(Λ) for all ν ∈ Λ, Mν ≤ 4m(Λ) and the depth bound for σ1 -
multiplication networks from Proposition C.11. Combining the two depth bounds (D.5) and
(D.6), we get
   
(1) (2)
depth (fΛ,δ ) = 1 + depth fΛ,δ + depth fΛ,δ

≤ Cm(Λ) log (d(Λ)) + d(Λ) log (m(Λ)) + m(Λ) + log δ −1

+ C log (d(Λ)) log (d(Λ)) + d(Λ) log (m(Λ)) + log δ −1
h i
≤ C log(d(Λ))d(Λ) log(m(Λ))m(Λ) + m(Λ)2 + log(δ −1 ) log(d(Λ)) + m(Λ) .

(1) (2)
For the width width(fΛ,δ ) we use width(fΛ,δ ) ≤ 2 max{width(fΛ,δ ), width(fΛ,δ )}, see (C.10). This

51
(2) (1)
leaves us to calculate width(fΛ,δ ) and width(fΛ,δ ). It holds
  X  
(2)
width fΛ,δ ≤ width IdR ◦L̃νj ,δ′
(j,νj )∈T
X  
≤2 width L̃νj ,δ′ ≤ 18|T |, (D.7)
(j,νj )∈T

where we used (C.5) for the width of the σ1 -identity network and (C.37) for the width of L̃νj ,δ′ .
(1)
For width(fΛ,δ ) it holds
  X  Y 
(1) f
width fΛ,δ ≤ width IdR ◦
δ/2,Mν
ν ∈Λ
X Y 
f
≤2 width
δ/2,Mν
ν ∈Λ
X
≤ 10d(Λ) = 10|Λ|d(Λ), (D.8)
ν ∈Λ

Q
again using (C.5) and (C.24) for the width of the multiplication network e . Combining (D.7) and
(D.8) gives

width (fΛ,δ ) ≤ 36|Λ|d(Λ),

where |T | ≤ |Λ|d(Λ) was used.


(1) (2)
To estimate size(fΛ,δ ), we use (C.7) and find size(fΛ,δ ) ≤ 2size(fΛ,δ )+2size(fΛ,δ ). We calculate
  n o 
(2)
size fΛ,δ = size IdR ◦L̃νj ,δ′ (yj )
(j,νj )∈T
X  
= size IdR ◦L̃νj ,δ′ (yj )
(j,νj )∈T
  
≤ 2m(Λ)d(Λ) max size (IdR ) + size L̃νj ,δ′ (yj )
(j,νj )∈T
 
≤ 10m(Λ)d(Λ) max size L̃νj ,δ′ (yj )
(j,νj )∈T

≤ Cd(Λ)m(Λ) m(Λ) + log δ ′−1
2

≤ Cd(Λ)m(Λ)2 log (d(Λ)) + d(Λ) log (m(Λ)) + m(Λ) + log δ −1 . (D.9)

In (D.9) we used the NN calculus rules for the sizes of a sparse concatenation in (C.7) and a
parallelization in (C.2). Furthermore, we used |T | ≤ m(Λ)d(Λ) at the first equality and
   
size (IdR ) ≤ 4 max depth L̃νj ,δ′ (yj ) ≤ 4 max size L̃νj ,δ′ (yj ) , (D.10)
(j,νj )∈T (j,νj )∈T

which follows from (C.4). At the third inequality in (D.9) we used the size bound for the univariate
Legendre polynomials from (C.38).

52
(1)
For size(fΛ,δ ) it holds
  X  Y 
(1) f
size fΛ,δ = size IdR ◦
δ/2,Mν
ν ∈Λ
X Y
f

≤2 size(IdR ) + size
δ/2,Mν
ν ∈Λ
Y 
f
≤ 10|Λ| max size
ν ∈Λ δ/2,Mν

≤ C|Λ| max | supp ν | log (| supp ν |) + | supp ν | log(Mν ) + log δ −1
ν ∈Λ

≤ C|Λ|d(Λ) log (d(Λ)) + d(Λ) log(m(Λ)) + log δ −1 . (D.11)

In (D.11), we used the size bound for from (C.25) and the argument from (D.10). Additionally
we used Mν = 2|νν |1 + 2 ≤ 4m(Λ). Combining (D.9) and (D.11) shows the size bound for fΛ,δ .

The network fΛ,δ consists of sparse concatinations and parallelizations of the networks and

L̃j . Because we have mpar( ) ≤ 1 and mpar(L̃j ) ≤ 1, the NN calculus rules (C.11) and (C.1)
yield mpar(fΛ,δ ) ≤ 1. This finishes the proof.

D.2 RePU-realization of Tensorized Legendre Polynomials


We show a result analogous to Proposition 3.9 for the RePU-realization of tensorized Legendre
polynomials. The construction is similar to [75, Proposition 2.13].
Proposition D.1 (σq -NN approximation of Lν ). Consider the setting of Proposition 3.9. Let
q ∈ N, q ≥ 2. Then there exists a σq -NN fΛ such that the outputs {L̃ν }ν∈Λ of fΛ satisfy

∀ν ∈ Λ, ∀yy ∈ U : L̃ν (yy ) = Lν (yy ).

Furthermore, there exists a constant Cq > 0 depending only on q such that


 
 
depth fΛ ≤ Cq m(Λ) + log d(Λ) ,

width fΛ ≤ Cq |Λ|d(Λ),
 

size fΛ,δ ≤ Cq d(Λ) |Λ| + m(Λ)2 ,

mpar fΛ,δ ≤ Cq .

Proof. Similar to the proof of Proposition 3.9 we define fΛ as a composition of two subnetworks
(1) (2)
fΛ and fΛ . It holds n 
o
(2)
fΛ (yy ) := IdR ◦L̃νj (yj )
(j,νj )∈T

and
  
(1) (1) (2)
fΛ (zk )k≤|T | = fΛ fΛ (yy )
   
f n
Y o
:= IdR ◦ IdR ◦L̃νj (yj )
j∈supp ν ν∈Λ

with T from (D.2) and y = (yj )(j,νj )∈T . Furthermore, we use the σq -NNs L̃j from Corollary C.15

and from Proposition C.12. The calculations are similar to the proof of Proposition 3.9. It

53
holds
   
(2)
depth fΛ =1+ max depth L̃νj
ν ∈Λ
j∈supp ν

≤ Cq max νj
ν ∈Λ
j∈supp ν

≤ Cq m(Λ). (D.12)
In (D.12) we used the depth bound for univariate Legendre polynomials, (C.39). Furthermore, we
(1)
used νj ≤ m(Λ) for all ν ∈ Λ and j ∈ supp ν . For the depth of fΛ it holds
  Y 
(1) f
depth fΛ = 1 + max depth
ν ∈Λ

≤ 1 + Cq max log (| supp ν |)


ν ∈Λ
≤ 1 + Cq log (d(Λ)) , (D.13)
where we used | supp ν | ≤ d(Λ) for all ν ∈ Λ and the depth bound for σq -multiplication networks
from Proposition C.12. Combining the two depth bounds from (D.12) and (D.13), we get
depth(fΛ ) ≤ Cq (m(Λ) + log(d(Λ)) .
For the width width(fΛ ) we calculate
  X  
(2)
width fΛ = width IdR ◦L̃νj
(j,νj )∈T
X  
≤ Cq width L̃νj ≤ Cq |T |, (D.14)
(j,νj )∈T

where we used (C.16) for the width of a σq -sparse concatenation and (C.40) for the width of L̃νj .
(1)
For width(fΛ ) it holds
  X  Y 
(1) f
width fΛ ≤ width IdR ◦
ν ∈Λ
X Y
f
≤ Cq width
ν ∈Λ
X
≤ Cq d(Λ) = Cq |Λ|d(Λ) (D.15)
ν ∈Λ
Q
using (C.16) and (C.32) for the width of the multiplication network e . Combining (D.14) and
(D.15) gives
width (fΛ ) ≤ Cq |Λ|d(Λ),
where |T | ≤ |Λ|d(Λ) was used.
(1) (2)
To estimate size(fΛ ), we use (C.13) and find size(fΛ ) ≤ Cq (size(fΛ )+size(fΛ )). We calculate
  n o 
(2)
size fΛ = size IdR ◦L̃νj (yj )
(j,νj )∈T
X  
= size IdR ◦L̃νj (yj )
(j,νj )∈T
  
≤ Cq m(Λ)d(Λ) max size (IdR ) + size L̃νj (yj )
(j,νj )∈T
 
≤ Cq m(Λ)d(Λ) max size L̃νj (yj )
(j,νj )∈T
2
≤ Cq d(Λ)m(Λ) . (D.16)

54
In (D.16) we used |T | ≤ m(Λ)d at the first inequality and
   
size (IdR ) ≤ Cq max depth L̃νj (yj ) ≤ Cq max size L̃νj (yj ) , (D.17)
(j,νj )∈T (j,νj )∈T

which follows from (C.6). At the third inequality in (D.16) we used the size bound for the univariate
Legendre polynomials from (C.41).
(1)
For size(fΛ ) it holds
  X  Y 
(1) f
size fΛ = size IdR ◦
ν ∈Λ
X Y
f
≤ Cq size(IdR ) +
ν ∈Λ
Y
f
≤ Cq |Λ| max size
ν ∈Λ

≤ Cq |Λ| max | supp ν |


ν ∈Λ
≤ Cq |Λ|d(Λ). (D.18)

In (D.18) we used the size bound for from (C.33) and the argument from (D.17). Combining
(D.16) and (D.18) shows the size bound for fΛ .

The network fΛ consists of sparse concatinations and parallelizations of the networks and

L̃j . Because we have mpar( ) ≤ Cq and mpar(L̃j ) ≤ Cq , the NN calculus rules (C.17) and (C.1)
yield mpar(fΛ ) ≤ Cq . This finishes the proof.

D.3 Proof of Theorem 3.10


The following two theorems are similar to [43, Theorem 5] and will be required for the proof of
Theorem 3.10.
Theorem D.2. For N, q ∈ N, consider the sparse FrameNet class G sp FN (σq , N ). Let Assumption
2 be satisfied with r > 1 and t > 0. Fix τ > 0 (arbitrary small). Then there exists a constant
C > 0 independent of N , such that there exists ΓN ∈ G sp
FN (σq , N ) with

sup kΓN (a) − G0 (a)kY ≤ CN − min{r−1,t}+τ . (D.19)


r (X)
a∈CR

Theorem D.3. Consider the setting of Theorem D.2. Let Ψ X be a Riesz basis. Additionally, let
γ be as in (3.9). Fix τ > 0 (arbitrary small). Then there exists a constant C > 0 independent of
N , such that there exists ΓN ∈ G sp
FN (σq , N ) with

kΓN − G0 kL2 (C r (X),γ;Y) ≤ CN − min{r− 2 ,t}+τ .


1
(D.20)
R

We first show that Theorems D.2 and D.3 imply Theorem 3.10.
Proof of Theorem 3.10. First consider the setting of Theorem D.2. Let τ > 0. Then there exists
a constant C independent of N and a FrameNet ΓN ∈ G sp FN (σq , N ) such that for all N ∈ N
2 2
kΓN − G0 k∞,supp(γ) ≤ sup kΓN (a) − G0 (a)kY ≤ CN −2 min{r−1,t}+τ ,
r (X)
a∈CR
r
where we used (D.19) with τ /2 and supp(γ) ⊆ CR (X) by Assumption 2.
Now consider the setting of Theorem D.3. Let τ > 0. Then there exists a constant C indepen-
dent of N and a FrameNet G spFN (σq , N ) with

kΓN − G0 k2L2 (γ) ≤ kΓN − G0 k2L2 (C r (X),γ;Y) ≤ CN −2 min{r− 2 ,t}+τ ,


1

r
where we used supp(γ) ⊂ CR (X) (Assumption 2) and (D.20) with τ /2.
We are left to prove Theorems D.2 and D.3. We need some auxiliary results.

55
Auxiliary Results
r
For r > 1, R > 0, U = [−1, 1]N and σR form (3.7) we define
r
u : U → Y, u(yy ) := (G0 ◦ σR )(yy ).

For the proofs of Theorems D.2 and D.3 we do a Y-valued tensorized Legendre expansion of u in
the frame (ηj Lν (yy ))j,νν of L2 (U, π; Y), which reads
XX
r
u(yy ) = G0 (σR (yy )) = cν ,j ηj Lν (yy ) (D.21)
j∈N ν ∈F

with Legendre coefficients Z


cν ,j := Lν (yy ) hu(yy ), η̃j iY dπ(yy ). (D.22)
U
Our aim is to construct the network ΓN out of the tensorized Legendre polynomials with the
“most important” contributions to the expansion. This contribution is quantified via the Legendre
coefficients cν ,j in (D.22). We therefore have to examine bounds on cν ,j and analyze their respective
structure. Therefore consider the following order relation on multi-indices in F from (3.11). For
µ , ν ∈ F we write µ ≤ ν if and only if µj ≤ νj for all j ∈ N. We call a set Λ ⊂ F downward closed
if and only if ν ∈ F implies µ ∈ F for all µ ≤ ν . Furthermore, for ν ∈ F, define

Y
ων := (1 + 2νj ).
j=1

The following theorem is a special case of [104, Theorem 2.2.10]. The formulation is similar to
[43, Theorem 4].
Theorem D.4. Let Assumption 2 be satisfied with r > 1 and t > 0. Fix τ > 0, p ∈ ( 1r , 1] and
t′ ∈ [0, t]. Consider F from (3.11), and let π = ⊗j∈N λ2 be the infinite product (probability) measure
on U = [−1, 1]N , where λ denotes the Lebesgue measure on [−1, 1]. Then there exists C > 0 and
a sequence (aν )ν ∈F ∈ lp (F) of positive numbers such that
(i) for each ν ∈ F Z
ωντ Lν (yy )u(yy ) dπ(yy ) ≤ Caν ,
U Yt′

(ii) there exists an enumeration (ννi )i∈N of F such that (aνi )i∈N is monotonically decreasing, the
set ΛN := {ννi : i ≤ N } ⊆ F is downward closed for each N ∈ N, and additionally

m(ΛN ) := max |ννi | = O (log(|ΛN |)) , (D.23)


i=1,...,N

d(ΛN ) := max | supp νi | = o (log(|ΛN |)) (D.24)


i=1,...,N

for N → ∞,
(iii) the following expansion holds with absolute and uniform convergence:
X Z

∀yy ∈ U : u(yy ) = Lν (yy ) x)u(x
Lν (x x ) ∈ Yt .
x ) dπ(x
ν ∈F U

The following proposition reformulates Theorem D.4 (i) into a bound for cν ,j . It was shown in

[43, Proposition 2]. Recall that θj denote the weights to define the spaces Yt , t′ > 0, see Definition
3.4.

56
Proposition D.5 ([43, Proposition 2]). Consider the setting of Theorem D.4. Then for each
ν ∈F X ′
ων2τ θj−2t cν2 ,j ≤ C 2 aν2 .
j∈N

Proposition D.5 gives decay of the coefficients cν ,j in both j and ν . Since θj = O(j −1+τ )

for all τ > 0 we have cν2 ,j = O(j −1−2t +τ̃ ) for τ̃ < 2τ t′ and every ν ∈ ΛN . Furthermore, since
p
(aν )ν ∈F ∈ l (F) the Legendre coefficients cν ,j decay algebraically in ν . We continue with a
technical lemma, which was shown in [43, Lemma 4].
Lemma D.6 ([43, Lemma 4]). Let α > 1, β > 0 and assume two sequences (ai )i∈N and (dj )j∈N in
R with ai . i−α and dj . j −β for all i, j ∈ N. Additionally assume that (dj )j∈N is monotonically
decreasing. Suppose that there exists a constant C < ∞ such that the sequence (ci,j )i,j∈N satisfies
X
∀i ∈ N : c2i,j d−2 2 2
j ≤ C ai .
j∈N

Then for every τ > 0


P
(i) for all N ∈ N there exists (mi )i∈N ⊆ NN 0 monotonically decreasing such that i∈N mi ≤ N
and   21
X X
 c2i,j  . N − min{α−1,β}+τ ,
i∈N j>mi

P
(ii) for all N ∈ N there exists (mi )i∈N ⊆ NN 0 monotonically decreasing such that i∈N mi ≤ N
and   21
X X
c2i,j  . N − min{α− 2 ,β }+τ .
1

i∈N j>mi

In the following, we use Lemma D.6 to get a decay property for the Legendre coefficients
cνi ,j with the enumeration νi of ΛN from Theorem D.4. The sequence m = (mi )i∈N quantifies
which coefficients of the Legendre expansion are “important” and are therefore used to define the
surrogate ΓN .
We first show that Theorem D.4 yields sufficient decay on the Legendre coefficients cνi ,j s.t. the
assumptions of Lemma D.6 are satisfied.

Lemma D.7. Consider the setting of Theorem D.4. Let τ̃ > 0 such that 1/p > r − τ̃ /2. Then

the assumptions of Lemma D.6 are fulfilled for α = r − τ̃ /2, β = t − τ̃ /2, ai = aνi , dj = θjt and
1/2
ci,j = ωνi cνi ,j for i, j ∈ N.
1
Proof. Proposition D.5 with τ = 2 gives
  21
1 X ′
 τ̃

ω 
2
νi θj−2t cν2i ,j  = O(aνi ) = O i−r+ 2 . (D.25)
j∈N

P
The last equality in (D.25) holds because iaνpi ≤ p
j∈N aνj < ∞ (since aνi is monotonically

decreasing) implies aνi = O(i−1/p ) = O(i−r+τ̃ /2 ). Since (θjt )j∈N ∈ l1/(t−τ̃ /2) (see Definition 3.4)
it holds ′
θjt = O(j −t+τ̃ /2 )
with the same argument.

57
Proofs of Theorems D.2 and D.3
The proof of Theorems D.2 and D.3 is similar to [43, Sections 4.2-4.4, Proofs of Theorems 1, 2
and 5].
Proof of Theorem D.2. Let (aν )ν ∈F be the enumeration (ννi )i∈N from Theorem D.4, where we use
the case τ = 12 . Therefore (aνi )i∈N is monotonically decreasing and belongs to lp with p ∈ ( r1 , 1].
We further fix τ̃ > 0 and demand p1 > r − τ̃2 . Fix Ñ ∈ N and set ΛÑ := {ννj : j ≤ Ñ } ⊂ F, which
is downward closed by Theorem D.4. Now we approximate the tensorized Legendre polynomials
Lν on the index set ΛÑ . Let ρ ∈ (0, 21 ). In the ReLU case, Proposition 3.9 gives a NN fΛÑ ,ρ with
outputs {L̃ν ,ρ }ν ∈ΛÑ s.t.
sup max Lν (yy ) − L̃ν ,ρ (yy ) ≤ ρ.
y ∈U ν ∈ΛÑ

Using |ΛÑ | = Ñ , (D.23) and (D.24), it holds for Ñ ≥ 2


 
depth(fΛÑ ,ρ ) = O log(Ñ )2 log(log(Ñ ))2 + log(Ñ ) log ρ−1 ,
width(fΛÑ ,ρ ) = O(Ñ log(Ñ )),
 
size(fΛÑ ,ρ ) = O Ñ log(Ñ )2 log(log(Ñ )) + Ñ log(Ñ ) log ρ−1 ,
mpar(fΛÑ ,ρ ) = 1.

The constants hidden in O( . ) are independent of Ñ and ρ. For Ñ ∈ N, set the accuracy ρ :=
1
Ñ − min{r− 2 ,t} . Then it holds
 
depth(fΛÑ ,ρ ) = O log(Ñ )2 log(log(Ñ ))2 ,
 
size(fΛÑ ,ρ ) = O Ñ log(Ñ )2 log(log(Ñ )) .

Proposition D.1 shows that the ReLU bounds also hold for the RePU-case.
By Lemma D.7 the assumptions of Lemma D.6 are satisfied. Applying P Lemma D.6 (i) with
α := r − τ̃ /2 and β := t − τ̃ /2 gives a sequence (mi )i∈N ⊂ NN
0 such that i∈N mi ≤ Ñ and
  12
X 1 X
ω 
2
νi cν2i ,j  ≤ C Ñ − min{r−1,t}+τ̃ . (D.26)
i∈N j>mi

We now define   X
γ̃Ñ ,j := L̃νi ,ρ (yy )cνi ,j (D.27)
{i∈N:mi ≥j}

for j ∈ N, where empty sums are set to zero. Recall the uniform distribution π on U = [−1, 1]N
(Example 3.8). With γ̃Ñ = (γ̃Ñ ,j )j∈N it holds

X X X
r
kG0 ◦ σR (yy ) − DY ◦ γ̃Ñ (yy )kY = cνi ,j Lνi (yy )ηj − cνi ,j L̃νi ,ρ (yy )ηj
i,j∈N i∈N j≤mi
Y

X X X X
≤ Lνi (yy ) cνi ,j ηj + (Lνi (yy ) − L̃νi ,ρ (yy )) cνi ,j ηj
i∈N j>mi i∈N j≤mi
Y Y
  21   12
X X X X
≤ ΛΨ Y kLνi k∞,π  cν2i ,j  + ΛΨ Y ρ  cν2i ,j 
i∈N
| {z } j>m i∈N j≤m
1 i i
≤ων2i

≤ C̃ΛΨ Y Ñ − min{r−1,t}+τ̃ + C̃ΛΨ Y ρ ≤ C̃ Ñ − min{r−1,t}+τ̃ (D.28)

58
for all y ∈ U . In (D.28) we used the definition of DY , (D.27) and (D.21) at the first equality.
Furthermore, we used (D.26), the definition of ρ and
  21   21
X X X X
 cν2i ,j  ≤  cν2i ,j 
i∈N j≤mi i∈N j∈N
  21
X 1 X ′
≤ C̃ ω 
2
νi θj−2t cν2i ,j 
i∈N j∈N
X X
≤ C̃ aνi ≤ C̃ i−r+τ̃ /2 ≤ C̃ (D.29)
i∈N i∈N

at the second-to-last inequality. We changed the constants C̃ from line to line in (D.28) and
(D.29). The last line of (D.28) shows why the RePU-case does not improve the approximation
property qualitatively. In the RePU-case, Proposition D.1 gives a σq -NN fΛ exactly realizing the
tensorized Legendre polynomials, i.e. the case ρ = 0 from above. Therefore the second summand
in the last line of (D.28) vanishes. This does not improve the approximation rate due to the first
summand. This part depends on the summability properties of the Legendre coefficients cνi ,j
following Assumption 2 and is therefore independent of the activation function σ.
Now we argue similar to [43, Proof of Theorem 1]. Consider the scaling Sr from (3.5). It holds
r
Sr ◦ EX (a) ∈ U ∀a ∈ CR (X), (D.30)

because of (3.8) and (3.5). We define Γ̃Ñ := DY ◦ γ̃Ñ ◦ Sr ◦ EX and calculate

sup G0 (a) − Γ̃Ñ (a) = sup kG0 (a) − DY ◦ γ̃Ñ ◦ Sr ◦ EX (a)kY


r (X)
a∈CR Y r (X)
a∈CR
r r
= sup kG0 ◦ σR (yy ) − DY ◦ γ̃Ñ ◦ Sr ◦ EX ◦ σR (yy )kY
r (y
y ∈U: σR
{y r (X)}
y )∈CR
r
≤ sup kG0 ◦ σR (yy ) − DY ◦ γ̃Ñ (yy )kY ≤ C̃ Ñ − min{r−1,t}+τ̃ , (D.31)
y ∈U

where we used (D.30) and (D.28).


In order to finish the proof of Theorem D.2, we relate Ñ to N and show ΓN := Γ̃Ñ ∈
G sp
FN (σq , N ), i.e. we show that the approximation networks we constructed have the desired sparse
structure. We simultaneously prove the ReLU- and RePU-case.
In order to analyse the NNs γ̃Ñ from (D.27), we specify its structure. We set nj = |{mi ≥ j}|
and define
 n  !
o
γ̃Ñ = Σnj IdR ◦SMcνi ,j ◦ L̃νi ,ρ . (D.32)
mi ≥j j∈N

The round brackets in (D.32) denote a parallelization. The networks SMcνi ,j denote the scalar
multiplication networks from Definition C.8. Furthermore, Σnj denotes the summation network
from Definition C.7 and we use the identity networks IdR from Lemma C.3 or C.4 to synchronize
the depth. Using the respective bounds for the summation and scalar multiplication networks and

59
the NN calculus for parallelization and sparse concatenation we get
    
depth(γ̃Ñ ) ≤ 2 + max
2
depth L̃νi ,ρ + depth SM cνi ,j + max depth Σ nj
i,j∈N , mi ≥j j∈N
2
≤ 3 + O(log(Ñ ) log(log(Ñ ))) + max Cq log (|cνi ,j |) + 0
i,j∈N2 , mi ≥j

= O(log(Ñ )2 log(log(Ñ ))), (D.33)



X X   X X 
width(γ̃Ñ ) ≤ max width L̃νi ,ρ , width SMcνi ,j ,

j∈N mi ≥j j∈N mi ≥j

X X X 
width (IdR ) , width Σnj

j∈N mi ≥j j∈N
 
 X X X 
≤ Cq max O(Ñ log(Ñ )), Cq , nj = O(Ñ log(Ñ )),
 
j∈N mi ≥j j∈N
X X      X 
size(γ̃Ñ ) ≤ Cq size L̃νi ,ρ + size SMcνi ,j + size (IdR ) + Cq size Σnj
j∈N mi ≥j j∈N
  
  X X
= O Ñ log(Ñ )2 log(log(Ñ )) +  Cq log (|cνi ,j |) + Cq nj 
j∈N mi ≥j
  X X
≤ O Ñ log(Ñ )2 log(log(Ñ )) + Cq 1
j∈N mi ≥j
 
= O Ñ log(Ñ )2 log(log(Ñ )) , (D.34)
mpar(γ̃Ñ ) ≤ Cq .

In (D.33)–(D.34) we used log (|cνi ,j |) ≤ Cq for all i, j ∈ N independent of n. Furthermore, we used


X X X X X X
nj = 1= 1= mi ≤ Ñ .
j∈N j∈N mi ≥j i∈N j≤mi i∈N

To get rid of the logarithmic terms, we define N = N (Ñ ) := max{1, Ñ log(Ñ )3 } and obtain a NN
γN = γ̃Ñ with

depth(γN ) = O(log(N )),


width(γN ) = O(N ),
size(γN ) = N,
mpar(γN ) ≤ Cq

and error less than

C̃ Ñ − min{r−1,t}+τ̃ = C̃ Ñ −κ ≤ C̃(3κ/τ̃ )3κ N −κ+τ̃ := CN − min{r−1,t}+τ . (D.35)

Per definition depth(γN ) = O(log(N )) and width(γN ) = O(N ) yields constants C̃L , C̃p and
N1 , N2 ∈ N s.t.

depth(γN ) ≤ C̃L max{1, log(N )}, N ≥ N1 ,


width(γN ) ≤ C̃p N, N ≥ N2 .

Setting CL = max{C̃L , maxN =2,...,N1 −1 depth(γN )/ log(2)} and

60
Cp = max{C̃p , maxN =1,...,N2 −1 width(γN )} shows

depth(γN ) ≤ CL log(N ), N ∈ N,
width(γN ) ≤ Cp N, N ∈ N.

In order to show ΓN := DY ◦ γN ◦ Sr ◦ EX ∈ G sp
FN (σq , N ), we are left to show that the maximum
Euclidean norm k · k2 of γN in U is independent of N . It holds for all y ∈ U that DX ◦ Sr−1 (yy ) ∈
r
CR (X). We get

sup kγN (yy )k2 = sup kEY ◦ ΓN ◦ DX ◦ Sr−1 (yy )k2


y ∈U y ∈U

≤ ΛΨY sup kΓN (a)kY


r (X)
a∈CR

= ΛΨ Y sup kΓN (a) − G0 (a) + G0 (a)kY


r (X)
a∈CR

≤ ΛΨ Y sup (kΓN (a) − G0 (a)kY + ckG0 (a)kYt )


r (X)
a∈CR

≤ ΛΨ Y (C + cCG0 ) =: B, (D.36)

where ΛΨY denotes the upper frame bound of Ψ Y and c = θ0t , see Definition 3.4. In (D.36) we used
Assumption 2 and the approximation error from (D.35). Thus ΓN ∈ G sp FN (σq , N ) for all N ∈ N,
r
where we set Cs = 1. Using supp(γ) ⊂ CR (X) (Assumption 2) in (D.31) finalizes the proof of
Theorem D.2.
Proof of Theorem D.3. By Lemma D.7 the assumptions of Lemma D.6 are satisfied. Applying
Lemma D.6 (ii) with α := r − τ̃ /2 and β := t − τ̃ /2 gives a sequence (mi )i∈N ⊂ NN
P 0 such that
i∈N m i ≤ Ñ and
  21
X X
cν2i ,j  ≤ C̃ Ñ − min{r− 2 ,t}+τ̃ .
1
 ων i (D.37)
i∈N j>mi

Define γ̃Ñ = (γ̃Ñ ,j )j∈N for all y ∈ U with γ̃Ñ ,j as in (D.27). Then it holds

X X
r
kG0 ◦ σR − DY ◦ γ̃Ñ kL2 (U,π;Y) ≤ cνi ,j Lνi ηj
i∈N j>mi
L2 (U,π;Y)

X X  
+ cνi ,j ηj Lνi − L̃νi ,ρ
i∈N j≤mi
L2 (U,π;Y)
  21
  21
 X X  X X
≤ ΛΨY  kLνi k2∞,π cν2i ,j  + ΛΨY ρ  cν2i ,j 
i∈N
| {z } j>mi i∈N j≤mi
≤ωνi
1 1
≤ C̃ΛΨ Y Ñ − min{r− 2 ,t}+τ̃ + C̃ΛΨ Y ρ ≤ C̃ Ñ − min{r− 2 ,t}+τ̃ . (D.38)

In (D.38) we used the definition of DY , (D.27) and (D.21) at the first inequality. Additionally we
used that (Lν ηj )ν ,j is a frame of L2 (U, π; Y) at the second inequality. Finally we used (D.37), the
definition of ρ and an argument similar to (D.29) at the second-to-last inequality. Note that again
we changed the constants C̃ from line to line in (D.38).
Since Ψ X is a Riesz basis, we have (see Section 3.1.2 and (3.5))
r r r
CR (X) = {σR (yy ), y ∈ U } and EX ◦ σR (yy ) = Sr−1 (yy ). (D.39)

61
With Γ̃Ñ := DY ◦ γ̃Ñ ◦ Sr ◦ EX we calculate

Γ̃Ñ − G0 r (X),(σ r ) π;Y)


= kDY ◦ γ̃Ñ ◦ Sr ◦ EX − G0 kL2 (C r (X),(σr )# π;Y)
L2 (CR R #
R R

r r
= kDY ◦ γ̃Ñ ◦ Sr ◦ EX ◦ σR − G0 ◦ σR kL2 (U,π;Y)

kL2 (U,π;Y) ≤ C̃ Ñ − min{r− 2 ,t}+τ ,


1
r
= kDY ◦ γ̃Ñ − G0 ◦ σR

where we used (D.39) and (D.38). Defining N = N (Ñ ) := max{1, Ñ log(Ñ )3 } we can proceed
similar to the proof of Theorem D.2 from (D.31) on. The reason this works is that the NNs γ̃Ñ
are defined in the same way in the L2 - and the L∞ -case (only the sequence m changes, but not
its properties). This shows ΓN := Γ̃Ñ ∈ G sp
FN (σq , N ) for all N ∈ N and thus finishes the proof of
Theorem D.3.

D.4 Proof of Lemma 3.12


The arguments in the following proof are based on entropy bounds for feedforward neural network
classes, first established in [89, Proof of Lemma 5].
Define the supremum norm k · k∞,∞ on g FN as

kgk∞,∞ := sup kg(yy )k∞ , g ∈ g FN , (D.40)


y ∈Rp0

where k·k∞ denotes the maximum norm in Rn . Then [80, Proposition 3.5] shows that (gg FN , k·k∞,∞ )
is compact. Since the map i : g FN → G FN , g → G = DY ◦ g ◦ EX is linear, also (G G FN , k·k∞,supp(γ) )
and hence (GG FN , k·kn ) is compact. We now show the entropy bounds for G FN .
Step 1. Recall depth(g) ≤ L, depth(g) ≤ p, size(g) ≤ s and mpar(g) ≤ M for g ∈ g FN .
We first estimate the entropy H(G GFN , k · k∞,supp(γ) , δ) against the respective entropy of g FN . For
G, G′ ∈ G FN and g, g ′ ∈ G FN with G = DY ◦ g ◦ Sr ◦ EX , G′ = DY ◦ g ′ ◦ Sr ◦ EX , it holds

kG − G′ k∞,σRr (U) = sup kDY ◦ g ◦ Sr ◦ EX (x) − DY ◦ g ′ ◦ Sr ◦ EX (x)kY


r
x∈σR (U)

≤ ΛΨ Y sup kg(yy ) − g ′ (yy )k2 ≤ ΛΨY pkg − g ′ k∞,∞ , (D.41)
y ∈U

r √
where we used σR = DX ◦Sr−1 and k·k∞,∞ from (D.40). Furthermore, we used kg(u)k2 ≤ pkgk∞
for all g ∈ g FN , since NNs g ∈ g FN have width(g) ≤ p. Then (D.41) yields
 
δ
GFN , k · k∞,σRr (U) , δ) ≤ H g FN , k · k∞,∞ ,
H(G √ . (D.42)
ΛΨY p

Step 2. It remains to bound H(gg FN , k · k∞,∞ , δ) = log(N (gg FN , k · k∞,∞ , δ)). To this end
we follow the proof and notation of [89, Lemma 5]. For l = 1, . . . L + 1, define the matrices
l
Wl = (wi,j )i,j ∈ Rpl−1 ×pl and the vectors Bl = (blj )j ∈ Rpl . Furthermore, define

σ Bl : Rpl → Rpl , σ Bl (x) = σ1 (x + Bl ) = max{0, x + Bl }, l = 1, . . . L,


BL+1 pL+1 pL+1 BL+1
σ :R →R , σ (x) = x + BL+1 . (D.43)

Then we can write a NN g ∈ g FN as a functional composition of σ Bl and Wl , i.e.

g : Rp0 → RpL+1 , g(x) = σ BL+1 WL+1 σ Bl . . . W2 σ B1 W1 x.

For k ∈ {1, . . . , L + 1} we define the functions

A+
kg :R
p0
→ Rpk , A+
k g(x) = σ
Bk
Wk . . . σ B1 W1 x,
A−
kg :R
pk−1
→ RpL+1 , A−
k g(x) = σ
BL+1
WL+1 . . . σ Bk Wk x. (D.44)

62
Furthermore, set A+ −
0 g = IdRp0 and AL+2 g = IdR L+1 . For all 1 ≤ l ≤ L + 1 holds
p

kσ Bl (x)k∞ ≤ kxk∞ + M
kW l (x)k∞ ≤ kW l k∞ kxk∞ ≤ M pkxk∞ .

We claim that for k ∈ {1, . . . , L + 1}

sup kA+
k g(x)k∞ ≤ (M (p + 1))
k
x∈[−1,1]p0

and proceed by induction. The case k = 0 is trivial. To go from k − 1 to k we compute

sup kA+
k xk∞ = sup kσ Bk Wk (σ Bk−1 Wk−1 · · · σ B1 W1 x)k∞
x∈[−1,1]p0 x∈[−1,1]p0

≤ sup kσ Bk W k xk∞
x∈[−(M(p+1))k−1 ,(M(p+1))k−1 ]pk−1

≤ (M p(M (p + 1))k−1 + M ) ≤ (M (p + 1))k , (D.45)

as claimed.
Moreover, for l = 1, . . . , L + 1, Wl : (Rpl−1 , k · k∞ ) → (Rpl , k · k∞ ) is Lipschitz with constant
M p and σ Bl : (Rpl , k · k∞ ) → (Rpl , k · k∞ ) is Lipschitz with constant 1. Thus we can estimate the
Lipschitz constant of A− k g for k = 1, . . . , L + 1. It holds

A− −
k g(x) − Ak g(y) ∞
= σ BL+1 WL+1 . . . σ Bk Wk x − σ BL+1 WL+1 . . . σ Bk Wk y ∞
BL Bk BL Bk
≤ Mp σ WL . . . σ Wk x − σ WL . . . σ Wk y ∞
L+2−k pk−1
≤ · · · ≤ (M p) kx − yk∞ for x, y ∈ R . (D.46)
l,∗
Now let g, g ∗ ∈ g FN be two NN such that |wi,j
l
− wi,j | < ε and |bli − bl,∗
i | < ε for all i ≤ pl+1 ,
j ≤ pl , l ≤ L + 1. Then
L+1
X ∗
kg − g ∗ k∞,∞ ≤ A−
k+1 gσ
Bk
Wk A+ ∗ −
k−1 g − Ak+1 gσ
Bk
Wk∗ A+
k−1 g

∞,∞
k=1
L+1
X L+1−k ∗
≤ (M p) σ Bk Wk A+ ∗
k−1 g − σ
Bk
Wk∗ A+
k−1 g

∞,∞
k=1
L+1
X  
L+1−k
≤ (M p) (Wk − Wk∗ )A+
k−1 g

∞,∞
+ kBk − Bk∗ k∞
k=1
L+1
X 
≤ε (M p)L+1−k pM k−1 (p + 1)k−1 + 1
k=1
< ε(L + 1)M L (p + 1)L+1 , (D.47)

where we used (D.45), (D.46) and M ≥ 1. The total number of weight and biases is less than
(L + 1)(p2 + p). Therefore there are at most
 
(L + 1)(p2 + p)
≤ ((L + 1)(p2 + p))s
s

combinations to pick s nonzero parameters. Since all parameters are bounded by M , we choose

63
ε = δ/((L + 1)M L (p + 1)L+1 ) and obtain the covering bound for all δ > 0
( s
)
X s∗
−1 2
N (gg FN , k · k∞,∞ , δ) ≤ max 1, 2M ǫ (L + 1)(p + p)
s∗ =1
( s
)
X s∗
−1 L+1 L+1 2
≤ max 1, 2δ (L + 1)M (p + 1) (L + 1)(p + p)
s∗ =1
( s
)
X ∗
−1 2 L+1 L+3 s
≤ max 1, 2δ (L + 1) M (p + 1)
s∗ =1
 s+1
≤ 2L+6 L2 M L+1 pL+3 max 1, δ −1 , (D.48)

where we used L ≥ 1 and p ≥ 1 at the last inequality. Eqs. (D.48) and (D.42) show (3.16).
Applying (3.16) to the sparse FrameNet class G spFN (σ1 , N ) gives

H G sp
FN (σ1 , N ), k · k∞,σR (U) , δ
r
  
depth +4
≤ (sizeN + 1) log 2depthN +6 ΛΨ Y depth2N M depthN +1 widthN N max 1, δ −1
≤ (Cs N + 1)
  
× log 2CL log(N )+6 ΛΨY (CL log(N ))2 M CL log(N )+1 (Cp N )CL log(N )+4 max 1, δ −1
SP
 
≤ CH N 1 + log(N )2 + log max 1, δ −1 , N ∈ N, δ > 0, (D.49)

where we defined

SP
CH = 2Cs (CL + 6) log(2) + log(ΛΨ Y ) + CL2 +

+ (CL + 1) log(M ) + (CL + 4)(log(Cp ) + 1) .

Applying (3.16) to the fully connected FrameNet class G full FN (σ1 , N ) gives

H G full
FN (σ1 , N ), k · k∞,σR
r (U) , δ
  
depth +4
≤ (sFC (N ) + 1) log 2depthN +6 ΛΨ Y depth2N M depthN +1 widthN N max 1, δ −1
 
≤ (depthN + 1) width2N + widthN + 1
  
× log 2CL log(N )+6 ΛΨY (CL log(N ))2 M CL log(N )+1 (Cp N )CL log(N )+4 max 1, δ −1
 
≤ (CL log(N ) + 1) Cp2 N 2 + Cp N + 1
  
× log 2CL log(N )+6 ΛΨY (CL log(N ))2 M CL log(N )+1 (Cp N )CL log(N )+4 max 1, δ −1
FC 2
 
≤ CH N 1 + log(N )3 + log max 1, δ −1 , N ∈ N, (D.50)

where we defined

FC
CH = 8CL Cp2 (CL + 6) log(2) + log(ΛΨ Y ) + CL2 +

+ (CL + 1) log(M ) + (CL + 4)(log(Cp ) + 1) .

Equations (D.49) and (D.50) finish the proof of Lemma 3.12.

64
D.5 Proof of Lemma 3.14
The following proof is a modification of [89, Proof of Lemma 5] to the case where the activation
function is not globally, but only locally Lipschitz continuous. The compactness of (G G , k·k∞,supp(γ) )
follows similarly to the ReLU case since [80, Proposition 3.5] holds for any continuous activation
function.
Let q ∈ N, q ≥ 2 and let σq : R → R, σq (x) = max{0, x}q denote the RePU activation
function. Recall depth(g) ≤ L, depth(g) ≤ p, size(g) ≤ s and mpar(g) ≤ M for g ∈ g FN .
We argue analogously to the ReLU-case in Lemma 3.12 and bound the entropy of the NN class
g FN (σq , L, p, s, M, B). Recall the definitions of σ Bl , A+ −
k g and Ak g from (D.43)-(D.44). Similar to
(D.45) it holds that

kA+
k gk∞,∞ = sup kA+
k g(x)k∞
x∈[−1,1]p0

≤ sup σ Bk Wk σ Bk−1 . . . W2 σq x ∞
x∈[−M(p+1),M(p+1)]p1

≤ sup σ Bk Wk σ Bk−1 . . . W2 x ∞
x∈[−M q (p+1)q ,M q (p+1)q ]p1
Pk
qj qk+1
≤ · · · ≤ (M (p + 1))) j=1
≤ (M (p + 1)) ,

where we used M ≥ 1.
In the RePU-case, A− ′
k g is only locally Lipschitz: Since |σq (x) | ≤ q|x|
q−1
it holds

|σq (x) − σq (y)| ≤ q max{|x|, |y|}q−1 |x − y| ∀x, y ∈ R.

Therefore for k = 1, . . . , L + 1 and x, y ∈ Rpk−1 , kxk∞ , kyk∞ ≤ C, we get

A− −
k g(x) − Ak g(y) ∞
= σ BL+1 WL+1 σ BL . . . Wk x − σ BL+1 WL+1 σ BL . . . Wk y ∞
BL BL−1 BL BL−1
≤ Mp σ WL σ . . . Wk x − σ WL σ . . . Wk y ∞
!q−1
≤ M pq sup WL σ BL−1 . . . Wk x ∞
kxk∞ ≤C

× WL σ BL−1 . . . Wk x − WL σ BL−1 . . . Wk y ∞
!q−1
≤ (M pq)L+2−k sup WL σ BL−1 . . . Wk x ∞
kxk∞ ≤C
!q−1 !q−1
× sup WL−1 σ BL−2 . . . Wk x ∞
× ···× sup kWk xk∞ kx − yk∞ .
kxk∞ ≤C kxk∞ ≤C

Using

sup Wj σ Bj−1 . . . Wk x ∞
≤ sup Wj σ Bj−1 . . . Wk+1 x ∞
kxk∞ ≤C kxk∞ ≤(M(p+1)C)q

≤ sup Wj σ Bj−1 . . . Wk+2 x ∞


2
kxk∞ ≤(M(p+1))q+q C q2

≤ ··· ≤ sup kWj xk∞


Pj−k l
j−k
kxk∞ ≤(M(p+1)) l=1 q C q
Pj−k
ql j−k qj−k+1
≤ (M (p + 1)) l=0
Cq ≤ (M (p + 1)C)

65
for j = k, . . . , L and C, M ≥ 1, we get
L 
Y qj−k+1
A− −
k g(x) − Ak g(y) ∞
≤ (M pq)L+2−k M (p + 1)Ĉ kx − yk∞
j=k
 qL+2−k
≤ (M pq)L+2−k M (p + 1)Ĉ kx − yk∞ , x, y ∈ Rpk−1 .(D.51)

l,∗
Now we proceed similar to (D.47). Let g, g ∗ ∈ g FN be two NN such that |wi,j l
− wi,j | < ε and
l l,∗ + −
|bi − bi | < ε for all i ≤ pl+1 , j ≤ pl , l ≤ L + 1. Then with A0 g = IdRp0 and AL+2 g = IdRpL+1 we
estimate

kg − g ∗ k∞,∞
L+1
X ∗
≤ A−
k+1 gσ
Bk
Wk A+ ∗ −
k−1 g − Ak+1 gσ
Bk
Wk∗ A+
k−1 g

∞,∞
k=1
L+1
X  qL+1−k
L+1−k qk+1
≤ (M pq) M (p + 1) (M (p + 1))
k=1

σ Bk Wk A+ ∗
k−1 g − σ
Bk
Wk∗ A+
k−1 g

∞,∞
L+1
X  k+1
qL+1−k
≤ (M pq)L+1−k M (p + 1) (M (p + 1))q
k=1
   q−1
(Wk − Wk∗ )A+
k−1 g

∞,∞
+ kB k − B ∗ +
k ∞,∞ q M (p + 1) Ak−1 g
k ∗
∞,∞
L+1
X  q L+1−k
L+2−k qk+1
≤ 2ε (M pq) M (p + 1) (M (p + 1)) A+
k−1 g

∞,∞
k=1
 q−1
M (p + 1) A+
k−1 g

∞,∞
  L
L+2 q

L+1 q
≤ 2ε(L + 1) (M (p + 1)q)L+q M (p + 1) (M (p + 1))q (M (p + 1))q
4q2L+2 √ −1
< εLq L+q (2pM ) 2M p(p2 + p)(L + 1) . (D.52)

In (D.52) we used the Lipschitz bound (D.51) with


n o
Bk∗ qk+1
C = max 1, kσ Bk Wk A+ ∗
k−1 g k∞,∞ , kσ Wk∗ A+ ∗
k−1 g k∞,∞ ≤ (M (p + 1)) ,

and p ≥ 1, L ≥ 1, q ≥ 2 at the last inequality.


As in the proof of Lemma 3.12, there are
 
(L + 1)(p2 + p)
≤ ((L + 1)(p2 + p))s
s

combinations to pick s nonzero weights and biases. Since all parameters are bounded by M , we
choose

2M p(p2 + p)(L + 1)δ
ε= 4q2L+2
Lq L+q (2pM )

66
and obtain the covering bound
( s
)
X s∗
N (gg FN , k · k∞,∞ , δ) ≤ max 1, 2M ǫ−1 (L + 1)(p2 + p)
s∗ =1
( s 
)
X √ s

L+q 4q2L+2
≤ max 1, Lq (2pM ) ( pδ)−1
s∗ =1
 2L+2 √  s+1
−1
≤ Lq L+q (2pM )4q p max 1, δ −1 . (D.53)

Eqs. (D.53) and (D.42) show (3.17).


Applying (3.17) to the sparse FrameNet class G sp FN (σq , N ) gives

H G sp
FN (σq , N ), k · k∞,σRr (U) , δ
  
4q2depth N +2
≤ (sSP (N ) + 1) log ΛΨ Y depthN q depthN +q (2widthN M ) max 1, δ −1
  
4q2CL log(N )+2
≤ (Cs N + 1) log ΛΨ Y CL log(N )q CL log(N )+q (2Cp N M ) max 1, δ −1
SP 1+2CL log(q)
 
≤ CH N 1 + log(N ) + log max 1, δ −1 , (D.54)

where we set
SP

CH = 2Cs log(ΛΨ Y ) + CL + (CL + q) log(q) + 4q 2 (log(2Cp M ) + 1) .

Applying (3.17) to the fully connected FrameNet class G full FN (σq , N ) gives the entropy bound

H G full
FN (σq , N ), k · k∞,σR
r (U) , δ
  
4q2depth N +2
≤ (sF C (N ) + 1) log ΛΨ Y depthN q depthN +q (2widthN M ) max 1, δ −1
 
≤ (depthN + 1) width2N + widthN + 1
  
4q2depth N +2
× log ΛΨ Y depthN q depthN +q (2widthN M ) max 1, δ −1
 
≤ (CL log(N ) + 1) Cp2 N 2 + Cp N + 1
  
4q2CL log(N )+2
× log ΛΨ Y CL log(N )q CL log(N )+q (2Cp N M ) max 1, δ −1
FC 2+2CL log(q)
 
≤ CH N 1 + log(N )2 + log max 1, δ −1 , (D.55)

where we set
 
FC
CH =8Cs Cp2 2
log(ΛΨ Y ) + CL + (CL + q) log(q) + 4q (log(2Cp M ) + 1) .

Equations (D.54) and (D.55) finish the proof of Lemma 3.14.

E Proofs of Section 4
E.1 Proof of Theorem 4.2
In [43, Proof of Proposition 3, Step 1], the holomorphy in Assumption 2 is verified for X, Y in
r
(4.5) with r0 > d/2 and t ∈ [0, (1 + r0 − d/2 − t0 )/d). Moreover, γ = (σR )# π in particular shows
r
supp(γ) ⊆ CR (X) and hence verifies the second part of Assumption 2. Substituting s = r0 + rd,
i.e. r = s−r
d , and taking t = (1 + r0 − d/2 − t0 )/d − τ with some small τ , Theorem 3.15 (i) then
0

gives
κ
EG0 [kĜn − G0 k2L2 (γ) ] ≤ Cn− κ+1 +τ ,

67
where
 d 
s − r0 1 1 + r0 − 2 − t0
κ = 2 min − , −τ
d 2 d

for all r0 > d/2 and t0 ∈ [0, 1].


From here on the proof is essentially the same as [43, Proof of Proposition 3, Step 2]; the only
difference is that while [43] uses the uniform bound in Theorem 3.10 (i), we require the L2 -bound
in Theorem 3.10 (ii). For completeness, we repeat the argument. The constraint r > 1 implies
s > r0 + d on s. We now choose r0 > d2 in order to maximize the convergence rate. Solving
d
s − r0 1 1 + r0 − 2 − t0
− =
d 2 d
for r0 gives
s + t0 − 1
r0 = . (E.1)
2
The constraint r0 > d2 implies the constraint s > d + 1 − t0 .
We look at two cases separately. First, if s ∈ ( 3d d
2 , 2d + 1 − t0 ], we set r0 := 2 + τ2 , where
we choose τ2 > 0 s.t. τ2 < s − 3d/2 which guarantees s > r0 + d. For τ < τ2 /d, we obtain the
convergence rate
ns − d
− τ2 1 1+ d
+ τ2 − d
− t0 o ns 2τ2 1 − t0 o
2 2 2
κ = 2 min − , − τ ≥ 2 min −1− , .
d 2 d d d d
In the case s > 2d + 1 − t0 , define r0 as in (E.1). The constraint s > r0 + d amounts to
s + t0 − 1
s> +d ⇔ s > 2d + t0 − 1,
2
which holds since s > 2d + 1 − t0 ≥ 2d + t0 − 1 for all t0 ∈ [0, 1]. In this case we get the convergence
rate
s − r0 s + 1 − t0
κ=2 −1−τ = − 1 − τ.
d d
Choosing τ1 > 8τ2 /d > 8τ shows (4.6) and finishes the proof of Theorem 4.2.

68

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy