Neural Operator
Neural Operator
Journal of Machine Learning Research 23 (2022) 1-115 Submitted 12/21; Revised 10/22; Published 12/22
Neural Operator: Learning Maps Between Function Spaces Neural Operator: Learning Maps Between Function Spaces
With Applications to PDEs With Applications to PDEs
Nikola Kovachki∗† NKOVACHKI @ NVIDIA . COM Nvidia Nikola Kovachki∗† NKOVACHKI @ NVIDIA . COM Nvidia
Zongyi Li∗ ZONGYILI @ CALTECH . EDU Caltech Zongyi Li∗ ZONGYILI @ CALTECH . EDU Caltech
Burigede Liu BL 377@ CAM . AC . UK Cambridge University Burigede Liu BL 377@ CAM . AC . UK Cambridge University
Kamyar Azizzadenesheli KAMYARA @ NVIDIA . COM Nvidia Kamyar Azizzadenesheli KAMYARA @ NVIDIA . COM Nvidia
Kaushik Bhattacharya BHATTA @ CALTECH . EDU Caltech Kaushik Bhattacharya BHATTA @ CALTECH . EDU Caltech
Andrew Stuart ASTUART @ CALTECH . EDU Caltech Andrew Stuart ASTUART @ CALTECH . EDU Caltech
Anima Anandkumar ANIMA @ CALTECH . EDU Caltech Anima Anandkumar ANIMA @ CALTECH . EDU Caltech
Abstract Abstract
*警告: 该PDF由GPT-Academic开源项目调用大语言模型+Latex翻译插件一键生成, 版权归原文作者所有。 翻
The classical development of neural networks has primarily focused on learning mappings
译内容可靠性无保障,请仔细鉴别并以原文为准。项目Github地址 https://github.com/binary-husky/
between finite dimensional Euclidean spaces or finite sets. We propose a generalization of neu-
gpt_academic/。当前大语言模型: gpt-4o-mini,当前语言模型温度设定: 1。为了防止大语言模型的意外谬误
ral networks to learn operators, termed neural operators, that map between infinite dimensional
产生扩散影响,禁止移除或修改此警告。
function spaces. We formulate the neural operator as a composition of linear integral operators
and nonlinear activation functions. We prove a universal approximation theorem for our proposed
neural operator, showing that it can approximate any given nonlinear continuous operator. The pro- The classical development of neural networks has primarily focused on learning mappings
posed neural operators are also discretization-invariant, i.e., they share the same model parameters between finite dimensional Euclidean spaces or finite sets. We propose a generalization of neu-
among different discretization of the underlying function spaces. Furthermore, we introduce four ral networks to learn operators, termed neural operators, that map between infinite dimensional
classes of efficient parameterization, viz., graph neural operators, multi-pole graph neural operators, function spaces. We formulate the neural operator as a composition of linear integral operators
low-rank neural operators, and Fourier neural operators. An important application for neural oper- and nonlinear activation functions. We prove a universal approximation theorem for our proposed
ators is learning surrogate maps for the solution operators of partial differential equations (PDEs). neural operator, showing that it can approximate any given nonlinear continuous operator. The pro-
We consider standard PDEs such as the Burgers, Darcy subsurface flow, and the Navier-Stokes posed neural operators are also discretization-invariant, i.e., they share the same model parameters
equations, and show that the proposed neural operators have superior performance compared to ex- among different discretization of the underlying function spaces. Furthermore, we introduce four
isting machine learning based methodologies, while being several orders of magnitude faster than classes of efficient parameterization, viz., graph neural operators, multi-pole graph neural operators,
conventional PDE solvers. low-rank neural operators, and Fourier neural operators. An important application for neural oper-
ators is learning surrogate maps for the solution operators of partial differential equations (PDEs).
Keywords: Deep Learning, Operator Learning, Discretization-Invariance, Partial Differential
We consider standard PDEs such as the Burgers, Darcy subsurface flow, and the Navier-Stokes
Equations, Navier-Stokes Equation.
equations, and show that the proposed neural operators have superior performance compared to ex-
©2022 Nikola Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, Anima ©2022 Nikola Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, Anima
Anandkumar. Anandkumar.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at
http://jmlr.org/papers/v23/21-1524.html. http://jmlr.org/papers/v23/21-1524.html.
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
1. Introduction isting machine learning based methodologies, while being several orders of magnitude faster than
conventional PDE solvers.
Learning mappings between function spaces has widespread applications in science and engi-
Keywords: Deep Learning, Operator Learning, Discretization-Invariance, Partial Differential
neering. For instance, for solving differential equations, the input is a coefficient function and the Equations, Navier-Stokes Equation.
output is a solution function. A straightforward solution to this problem is to simply discretize the
infinite-dimensional input and output function spaces into finite-dimensional grids, and apply stan- 1. Introduction
dard learning models such as neural networks. However, this limits applicability since the learned
neural network model may not generalize well to different discretizations, beyond the discretization Learning mappings between function spaces has widespread applications in science and engi-
grid of the training data. neering. For instance, for solving differential equations, the input is a coefficient function and the
output is a solution function. A straightforward solution to this problem is to simply discretize the
To overcome these limitations of standard neural networks, we formulate a new deep-learning
infinite-dimensional input and output function spaces into finite-dimensional grids, and apply stan-
framework for learning operators, called neural operators, which directly map between function
dard learning models such as neural networks. However, this limits applicability since the learned
spaces on bounded domains. Since our neural operator is designed on function spaces, they can be
neural network model may not generalize well to different discretizations, beyond the discretization
discretized by a variety of different methods, and at different levels of resolution, without the need
grid of the training data.
for re-training. In contrast, standard neural network architectures depend heavily on the discretiza-
To overcome these limitations of standard neural networks, we formulate a new deep-learning
tion of training data: new architectures with new parameters may be needed to achieve the same
framework for learning operators, called neural operators, which directly map between function
error for data with varying discretization. We also propose the notion of discretization-invariant
spaces on bounded domains. Since our neural operator is designed on function spaces, they can be
models and prove that our neural operators satisfy this property, while standard neural networks do
discretized by a variety of different methods, and at different levels of resolution, without the need
not.
for re-training. In contrast, standard neural network architectures depend heavily on the discretiza-
1.1 Our Approach tion of training data: new architectures with new parameters may be needed to achieve the same
error for data with varying discretization. We also propose the notion of discretization-invariant
Discretization-Invariant Models. We formulate a precise mathematical notion of discretization
models and prove that our neural operators satisfy this property, while standard neural networks do
invariance. We require any discretization-invariant model with a fixed number of parameters to
not.
satisfy the following:
1. acts on any discretization of the input function, i.e. accepts any set of points in the input domain,
The first two requirements of accepting any input and output points in the domain is a natural
requirement for discretization invariance, while the last one ensures consistency in the limit as the 2. can be evaluated at any point of the output domain,
discretization is refined. For example, families of graph neural networks (Scarselli et al., 2008) and
3. converges to a continuum operator as the discretization is refined.
transformer models (Vaswani et al., 2017) are resolution invariant, i.e., they can receive inputs at any
resolution, but they fail to converge to a continuum operator as discretization is refined. Moreover, The first two requirements of accepting any input and output points in the domain is a natural
we require the models to have a fixed number of parameters; otherwise, the number of parameters requirement for discretization invariance, while the last one ensures consistency in the limit as the
becomes unbounded in the limit as the discretization is refined, as shown in Figure 1. Thus the discretization is refined. For example, families of graph neural networks (Scarselli et al., 2008) and
2 2
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
notion of discretization invariance allows us to define neural operator models that are consistent transformer models (Vaswani et al., 2017) are resolution invariant, i.e., they can receive inputs at any
in function spaces and can be applied to data given at any resolution and on any mesh. We also resolution, but they fail to converge to a continuum operator as discretization is refined. Moreover,
establish that standard neural network models are not discretization invariant. we require the models to have a fixed number of parameters; otherwise, the number of parameters
becomes unbounded in the limit as the discretization is refined, as shown in Figure 1. Thus the
notion of discretization invariance allows us to define neural operator models that are consistent
in function spaces and can be applied to data given at any resolution and on any mesh. We also
establish that standard neural network models are not discretization invariant.
图 1: Discretization Invariance
An discretization-invariant operator has convergent predictions on a mesh refinement.
Neural Operators. We introduce the concept of neural operators for learning operators that are
图 1: Discretization Invariance
mappings between infinite-dimensional function spaces. We propose neural operator architectures
An discretization-invariant operator has convergent predictions on a mesh refinement.
to be multi-layers where layers are themselves operators composed with non-linear activations. This
ensures that that the overall end-to-end composition is an operator, and thus satisfies the discretiza-
tion invariance property. The key design choice for neural operator is the operator layers. To keep it Neural Operators. We introduce the concept of neural operators for learning operators that are
simple, we limit ourselves to layers that are linear operators. Since these layers are composed with mappings between infinite-dimensional function spaces. We propose neural operator architectures
non-linear activations, we obtain neural operator models that are expressive and able to capture any to be multi-layers where layers are themselves operators composed with non-linear activations. This
continuous operator. The latter property is known as universal approximation. ensures that that the overall end-to-end composition is an operator, and thus satisfies the discretiza-
The above line of reasoning for neural operator design follows closely the design of standard tion invariance property. The key design choice for neural operator is the operator layers. To keep it
neural networks, where linear layers (e.g. matrix multiplication, convolution) are composed with simple, we limit ourselves to layers that are linear operators. Since these layers are composed with
non-linear activations, and we have universal approximation of continuous functions defined on non-linear activations, we obtain neural operator models that are expressive and able to capture any
compact domains (Hornik et al., 1989). Neural operators replace finite-dimensional linear layers in continuous operator. The latter property is known as universal approximation.
neural networks with linear operators in function spaces. The above line of reasoning for neural operator design follows closely the design of standard
We formally establish that neural operator models with a fixed number of parameters satisfy neural networks, where linear layers (e.g. matrix multiplication, convolution) are composed with
discretization invariance. We further show that neural operators models are universal approximators non-linear activations, and we have universal approximation of continuous functions defined on
of continuous operators acting between Banach spaces, and can uniformly approximate any contin- compact domains (Hornik et al., 1989). Neural operators replace finite-dimensional linear layers in
uous operator defined on a compact set of a Banach space. Neural operators are the only known neural networks with linear operators in function spaces.
class of models that guarantee both discretization-invariance and universal approximation. We formally establish that neural operator models with a fixed number of parameters satisfy
See Table 1 for a comparison among the deep learning models. Previous deep learning models are discretization invariance. We further show that neural operators models are universal approximators
3 3
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
mostly defined on a fixed grid, and removing, adding, or moving grid points generally makes these of continuous operators acting between Banach spaces, and can uniformly approximate any contin-
models no longer applicable. Thus, they are not discretization invariant. uous operator defined on a compact set of a Banach space. Neural operators are the only known
We propose several design choices for the linear operator layers in neural operator such as a class of models that guarantee both discretization-invariance and universal approximation.
parameterized integral operator or through multiplication in the spectral domain as shown in Figure See Table 1 for a comparison among the deep learning models. Previous deep learning models are
2. Specifically, we propose four practical methods for implementing the neural operator framework: mostly defined on a fixed grid, and removing, adding, or moving grid points generally makes these
graph-based operators, low-rank operators, multipole graph-based operators, and Fourier operators. models no longer applicable. Thus, they are not discretization invariant.
Specifically, for graph-based operators, we develop a Nyström extension to connect the integral
We propose several design choices for the linear operator layers in neural operator such as a
operator formulation of the neural operator to families of graph neural networks (GNNs) on arbitrary
parameterized integral operator or through multiplication in the spectral domain as shown in Figure
grids. For Fourier operators, we consider the spectral domain formulation of the neural operator
2. Specifically, we propose four practical methods for implementing the neural operator framework:
which leads to efficient algorithms in settings where fast transform methods are applicable.
graph-based operators, low-rank operators, multipole graph-based operators, and Fourier operators.
We include an exhaustive numerical study of the four formulations of neural operators. Numer- Specifically, for graph-based operators, we develop a Nyström extension to connect the integral
ically, we show that the proposed methodology consistently outperforms all existing deep learning operator formulation of the neural operator to families of graph neural networks (GNNs) on arbitrary
methods even on the resolutions for which the standard neural networks were designed. For the grids. For Fourier operators, we consider the spectral domain formulation of the neural operator
two-dimensional Navier-Stokes equation, when learning the entire flow map, the method achieves which leads to efficient algorithms in settings where fast transform methods are applicable.
< 1% error for a Reynolds number of 20 and 8% error for a Reynolds number of 200.
We include an exhaustive numerical study of the four formulations of neural operators. Numer-
The proposed Fourier neural operator (FNO) has an inference time that is three orders of mag-
ically, we show that the proposed methodology consistently outperforms all existing deep learning
nitude faster than the pseudo-spectral method used to generate the data for the Navier-Stokes prob-
methods even on the resolutions for which the standard neural networks were designed. For the
lem (Chandler and Kerswell, 2013) – 0.005s compared to the 2.2s on a 256 × 256 uniform spatial
two-dimensional Navier-Stokes equation, when learning the entire flow map, the method achieves
grid. Despite its tremendous speed advantage, the method does not suffer from accuracy degradation
< 1% error for a Reynolds number of 20 and 8% error for a Reynolds number of 200.
when used in downstream applications such as solving Bayesian inverse problems. Furthermore, we
demonstrate that FNO is robust to noise on the testing problems we consider here. The proposed Fourier neural operator (FNO) has an inference time that is three orders of mag-
nitude faster than the pseudo-spectral method used to generate the data for the Navier-Stokes prob-
Model NNs DeepONets Interpolation Neural Operators lem (Chandler and Kerswell, 2013) – 0.005s compared to the 2.2s on a 256 × 256 uniform spatial
Property grid. Despite its tremendous speed advantage, the method does not suffer from accuracy degradation
Discretization Invariance ✗ ✗ ✓ ✓ when used in downstream applications such as solving Bayesian inverse problems. Furthermore, we
Is the output a function? ✗ ✓ ✓ ✓ demonstrate that FNO is robust to noise on the testing problems we consider here.
Can query the output at any point? ✗ ✓ ✓ ✓
Can take the input at any point? ✗ ✗ ✓ ✓ 1.2 Background and Context
Universal Approximation ✗ ✓ ✗ ✓
Data-driven approaches for solving PDEs. Over the past decades, significant progress has been
表 1: Comparison of deep learning models. The first row indicates whether the model is discretiza- made in formulating (Gurtin, 1982) and solving (Johnson, 2012) the governing PDEs in many sci-
tion invariant. The second and third rows indicate whether the output and input are a functions. The entific fields from micro-scale problems (e.g., quantum and molecular dynamics) to macro-scale
fourth row indicates whether the model class is a universal approximator of operators. Neural Oper- applications (e.g., civil and marine engineering). Despite the success in the application of PDEs to
ators are discretization invariant deep learning methods that output functions and can approximate solve real-world problems, two significant challenges remain: (1) identifying the governing model
any operator. for complex systems; (2) efficiently solving large-scale nonlinear systems of equations.
4 4
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Data-driven approaches for solving PDEs. Over the past decades, significant progress has been Identifying and formulating the underlying PDEs appropriate for modeling a specific prob-
made in formulating (Gurtin, 1982) and solving (Johnson, 2012) the governing PDEs in many sci- lem usually requires extensive prior knowledge in the corresponding field which is then combined
entific fields from micro-scale problems (e.g., quantum and molecular dynamics) to macro-scale with universal conservation laws to design a predictive model. For example, modeling the de-
applications (e.g., civil and marine engineering). Despite the success in the application of PDEs to formation and failure of solid structures requires detailed knowledge of the relationship between
solve real-world problems, two significant challenges remain: (1) identifying the governing model stress and strain in the constituent material. For complicated systems such as living cells, acquiring
for complex systems; (2) efficiently solving large-scale nonlinear systems of equations. such knowledge is often elusive and formulating the governing PDE for these systems remains pro-
Identifying and formulating the underlying PDEs appropriate for modeling a specific prob- hibitive, or the models proposed are too simplistic to be informative. The possibility of acquiring
lem usually requires extensive prior knowledge in the corresponding field which is then combined such knowledge from data can revolutionize these fields. Second, solving complicated nonlinear
with universal conservation laws to design a predictive model. For example, modeling the de- PDE systems (such as those arising in turbulence and plasticity) is computationally demanding and
5 5
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
formation and failure of solid structures requires detailed knowledge of the relationship between
Model NNs DeepONets Interpolation Neural Operators
stress and strain in the constituent material. For complicated systems such as living cells, acquiring Property
such knowledge is often elusive and formulating the governing PDE for these systems remains pro- Discretization Invariance ✗ ✗ ✓ ✓
hibitive, or the models proposed are too simplistic to be informative. The possibility of acquiring
Is the output a function? ✗ ✓ ✓ ✓
such knowledge from data can revolutionize these fields. Second, solving complicated nonlinear
Can query the output at any point? ✗ ✓ ✓ ✓
PDE systems (such as those arising in turbulence and plasticity) is computationally demanding and
Can take the input at any point? ✗ ✗ ✓ ✓
can often make realistic simulations intractable. Again the possibility of using instances of data to
Universal Approximation ✗ ✓ ✗ ✓
design fast approximate solvers holds great potential for accelerating numerous problems.
表 1: Comparison of deep learning models. The first row indicates whether the model is discretiza-
Learning PDE Solution Operators. In PDE applications, the governing differential equations
tion invariant. The second and third rows indicate whether the output and input are a functions. The
are by definition local, whilst the solution operator exhibits non-local properties. Such non-local
fourth row indicates whether the model class is a universal approximator of operators. Neural Oper-
effects can be described by integral operators explicitly in the spatial domain, or by means of spec-
ators are discretization invariant deep learning methods that output functions and can approximate
tral domain multiplication; convolution is an archetypal example. For integral equations, the graph
any operator.
approximations of Nyström type (Belongie et al., 2002) provide a consistent way of connecting
different grid or data structures arising in computational methods and understanding their contin-
can often make realistic simulations intractable. Again the possibility of using instances of data to
uum limits (Von Luxburg et al., 2008; Trillos and Slepčev, 2018; Trillos et al., 2020). For spectral
design fast approximate solvers holds great potential for accelerating numerous problems.
domain calculations, there are well-developed tools that exist for approximating the continuum
(Boyd, 2001; Trefethen, 2000). However, these approaches for approximating integral operators Learning PDE Solution Operators. In PDE applications, the governing differential equations
are not data-driven. Neural networks present a natural approach for learning-based integral opera- are by definition local, whilst the solution operator exhibits non-local properties. Such non-local
tor approximations since they can incorporate non-locality. However, standard neural networks are effects can be described by integral operators explicitly in the spatial domain, or by means of spec-
limited to the discretization of training data and hence, offer a poor approximation to the integral tral domain multiplication; convolution is an archetypal example. For integral equations, the graph
operator. We tackle this issue here by proposing the framework of neural operators. approximations of Nyström type (Belongie et al., 2002) provide a consistent way of connecting
different grid or data structures arising in computational methods and understanding their contin-
Properties of existing deep-learning models. Previous deep learning models are mostly defined
uum limits (Von Luxburg et al., 2008; Trillos and Slepčev, 2018; Trillos et al., 2020). For spectral
on a fixed grid, and removing, adding, or moving grid points generally makes these models no longer
domain calculations, there are well-developed tools that exist for approximating the continuum
applicable, as seen in Table 1. Thus, they are not discretization invariant. In general, standard neural
(Boyd, 2001; Trefethen, 2000). However, these approaches for approximating integral operators
networks (NN) (such as Multilayer perceptron (MLP), convolution neural networks (CNN), Resnet,
are not data-driven. Neural networks present a natural approach for learning-based integral opera-
and Vision Transformers (ViT)) that take the input grid and output grid as finite-dimensional vectors
tor approximations since they can incorporate non-locality. However, standard neural networks are
are not discretization-invariant since their input and output have to be at the fixed grid with fixed
limited to the discretization of training data and hence, offer a poor approximation to the integral
location. On the other hand, the pointwise neural networks used in PINNs (Raissi et al., 2019) that
operator. We tackle this issue here by proposing the framework of neural operators.
take each coordinate as input are discretization-invariant since it can be applied at each location
in parallel. However PINNs only represent the solution function of one instance and it does not Properties of existing deep-learning models. Previous deep learning models are mostly defined
learn the map from the input functions to the output solution functions. A special class of neural on a fixed grid, and removing, adding, or moving grid points generally makes these models no longer
networks is convolution neural networks (CNNs). CNNs also do not converge with grid refinement applicable, as seen in Table 1. Thus, they are not discretization invariant. In general, standard neural
since their respective fields change with different input grids. On the other hand, if normalized networks (NN) (such as Multilayer perceptron (MLP), convolution neural networks (CNN), Resnet,
by the grid size, CNNs can be applied to uniform grids with different resolutions, which converge and Vision Transformers (ViT)) that take the input grid and output grid as finite-dimensional vectors
6 6
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
to differential operators, in a similar fashion to the finite difference method. Interpolation is a are not discretization-invariant since their input and output have to be at the fixed grid with fixed
baseline approach to achieve discretization-invariance. While NNs+Interpolation (or in general any location. On the other hand, the pointwise neural networks used in PINNs (Raissi et al., 2019) that
finite-dimensional neural networks+Interpolation) are resolution invariant and their outputs can be take each coordinate as input are discretization-invariant since it can be applied at each location
queried at any point, they are not universal approximators of operators since the dimension of input in parallel. However PINNs only represent the solution function of one instance and it does not
and output of the internal CNN model is defined to a bounded number. DeepONets (Lu et al., 2019) learn the map from the input functions to the output solution functions. A special class of neural
are a class of operators that have the universal approximation property. DeepONets consist of a networks is convolution neural networks (CNNs). CNNs also do not converge with grid refinement
branch net and a trunk net. The trunk net allows queries at any point, but the branch net constrains since their respective fields change with different input grids. On the other hand, if normalized
the input to fixed locations; however it is possible to modify the branch net to make the methodology by the grid size, CNNs can be applied to uniform grids with different resolutions, which converge
discretization invariant, for example by using the PCA-based approach as used in (De Hoop et al., to differential operators, in a similar fashion to the finite difference method. Interpolation is a
2022). baseline approach to achieve discretization-invariance. While NNs+Interpolation (or in general any
finite-dimensional neural networks+Interpolation) are resolution invariant and their outputs can be
queried at any point, they are not universal approximators of operators since the dimension of input
Furthermore, we show transformers (Vaswani et al., 2017) are special cases of neural operators
and output of the internal CNN model is defined to a bounded number. DeepONets (Lu et al., 2019)
with structured kernels that can be used with varying grids to represent the input function. However,
are a class of operators that have the universal approximation property. DeepONets consist of a
the commonly used vision-based extensions of transformers, e.g., ViT (Dosovitskiy et al., 2020),
branch net and a trunk net. The trunk net allows queries at any point, but the branch net constrains
use convolutions on patches to generate tokens, and therefore, they are not discretization-invariant
the input to fixed locations; however it is possible to modify the branch net to make the methodology
models.
discretization invariant, for example by using the PCA-based approach as used in (De Hoop et al.,
2022).
We also show that when our proposed neural operators are applied only on fixed grids, the re-
sulting architectures coincide with neural networks and other operator learning frameworks. In such
reductions, point evaluations of the input functions are available on the grid points. In particular, we Furthermore, we show transformers (Vaswani et al., 2017) are special cases of neural operators
show that the recent work of DeepONets (Lu et al., 2019), which are maps from finite-dimensional with structured kernels that can be used with varying grids to represent the input function. However,
spaces to infinite dimensional spaces are special cases of neural operators architecture when neu- the commonly used vision-based extensions of transformers, e.g., ViT (Dosovitskiy et al., 2020),
ral operators are limited only to fixed input grids. Moreover, by introducing an adjustment to the use convolutions on patches to generate tokens, and therefore, they are not discretization-invariant
DeepONet architecture, we propose the DeepONet-Operator model that fits into the full operator models.
learning framework of maps between function spaces.
We also show that when our proposed neural operators are applied only on fixed grids, the re-
2. Learning Operators sulting architectures coincide with neural networks and other operator learning frameworks. In such
reductions, point evaluations of the input functions are available on the grid points. In particular, we
show that the recent work of DeepONets (Lu et al., 2019), which are maps from finite-dimensional
In subsection 2.1, we describe the generic setting of PDEs to make the discussions in the spaces to infinite dimensional spaces are special cases of neural operators architecture when neu-
following setting concrete. In subsection 2.2, we outline the general problem of operator learning ral operators are limited only to fixed input grids. Moreover, by introducing an adjustment to the
as well as our approach to solving it. In subsection 2.3, we discuss the functional data that is DeepONet architecture, we propose the DeepONet-Operator model that fits into the full operator
available and how we work with it numerically. learning framework of maps between function spaces.
7 7
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
We consider the generic family of PDEs of the following form, In subsection 2.1, we describe the generic setting of PDEs to make the discussions in the
following setting concrete. In subsection 2.2, we outline the general problem of operator learning
(La u)(x) = f (x), x ∈ D, as well as our approach to solving it. In subsection 2.3, we discuss the functional data that is
(1)
u(x) = 0, x ∈ ∂D, available and how we work with it numerically.
for some a ∈ A, f ∈ U ∗ and D ⊂ Rd a bounded domain. We assume that the solution u : D → R 2.1 Generic Parametric PDEs
lives in the Banach space U and La : A → L(U; U ∗ ) is a mapping from the parameter Banach
We consider the generic family of PDEs of the following form,
space A to the space of (possibly unbounded) linear operators mapping U to its dual U ∗ . A natural
operator which arises from this PDE is G † := L−1
a f : A → U defined to map the parameter to the (La u)(x) = f (x), x ∈ D,
(1)
solution a 7→ u. A simple example that we study further in Section 6.2 is when La is the weak u(x) = 0, x ∈ ∂D,
form of the second-order elliptic operator −∇ · (a∇) subject to homogeneous Dirichlet boundary
conditions. In this setting, A = L∞ (D; R+ ), U = H01 (D; R), and U ∗ = H −1 (D; R). When for some a ∈ A, f ∈ U ∗ and D ⊂ Rd a bounded domain. We assume that the solution u : D → R
needed, we will assume that the domain D is discretized into K ∈ N points and that we observe lives in the Banach space U and La : A → L(U; U ∗ ) is a mapping from the parameter Banach
N ∈ N pairs of coefficient functions and (approximate) solution functions {a(i) , u(i) }N space A to the space of (possibly unbounded) linear operators mapping U to its dual U ∗ . A natural
i=1 that are
used to train the model (see Section 2.2). We assume that a(i) are i.i.d. samples from a probability operator which arises from this PDE is G † := L−1
a f : A → U defined to map the parameter to the
measure µ supported on A and u(i) are the pushforwards under G † . solution a 7→ u. A simple example that we study further in Section 6.2 is when La is the weak
form of the second-order elliptic operator −∇ · (a∇) subject to homogeneous Dirichlet boundary
2.2 Problem Setting conditions. In this setting, A = L∞ (D; R+ ), U = H01 (D; R), and U ∗ = H −1 (D; R). When
needed, we will assume that the domain D is discretized into K ∈ N points and that we observe
Our goal is to learn a mapping between two infinite dimensional spaces by using a finite col-
N ∈ N pairs of coefficient functions and (approximate) solution functions {a(i) , u(i) }N
i=1 that are
lection of observations of input-output pairs from this mapping. We make this problem concrete
used to train the model (see Section 2.2). We assume that a(i) are i.i.d. samples from a probability
in the following setting. Let A and U be Banach spaces of functions defined on bounded domains
′
measure µ supported on A and u(i) are the pushforwards under G † .
D ⊂ Rd , D′ ⊂ Rd respectively and G † : A → U be a (typically) non-linear map. Suppose we
have observations {a(i) , u(i) }N
i=1 where a
(i) ∼ µ are i.i.d. samples drawn from some probability
2.2 Problem Setting
measure µ supported on A and u(i) = G † (a(i) ) is possibly corrupted with noise. We aim to build an
Our goal is to learn a mapping between two infinite dimensional spaces by using a finite col-
approximation of G † by constructing a parametric map
lection of observations of input-output pairs from this mapping. We make this problem concrete
Gθ : A → U, θ ∈ Rp (2) in the following setting. Let A and U be Banach spaces of functions defined on bounded domains
′
D ⊂ Rd , D′ ⊂ Rd respectively and G † : A → U be a (typically) non-linear map. Suppose we
with parameters from the finite-dimensional space Rp and then choosing θ† ∈ Rp so that Gθ† ≈ G † . have observations {a(i) , u(i) }N
i=1 where a
(i) ∼ µ are i.i.d. samples drawn from some probability
We will be interested in controlling the error of the approximation on average with respect to measure µ supported on A and u(i) = G † (a(i) ) is possibly corrupted with noise. We aim to build an
µ. In particular, assuming G † is µ-measurable, we will aim to control the L2µ (A; U) Bochner norm approximation of G † by constructing a parametric map
of the approximation
Gθ : A → U, θ ∈ Rp (2)
Z
∥G † − Gθ ∥2L2µ (A;U ) = Ea∼µ ∥G † (a) − Gθ (a)∥2U = ∥G † (a) − Gθ (a)∥2U dµ(a). (3) with parameters from the finite-dimensional space Rp and then choosing θ† ∈ Rp so that Gθ† ≈ G † .
A
8 8
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
This is a natural framework for learning in infinite-dimensions as one could seek to solve the asso- We will be interested in controlling the error of the approximation on average with respect to
ciated empirical-risk minimization problem µ. In particular, assuming G † is µ-measurable, we will aim to control the L2µ (A; U) Bochner norm
N of the approximation
† 1 X (i)
min Ea∼µ ∥G (a) − Gθ (a)∥2U ≈ minp ∥u − Gθ (a(i) )∥2U (4) Z
θ∈Rp θ∈R N †
∥G − Gθ ∥2L2µ (A;U ) †
= Ea∼µ ∥G (a) − Gθ (a)∥2U = ∥G † (a) − Gθ (a)∥2U dµ(a). (3)
i=1
A
which directly parallels the classical finite-dimensional setting (Vapnik, 1998). As well as using
This is a natural framework for learning in infinite-dimensions as one could seek to solve the asso-
error measured in the Bochner norm, we will also consider the setting where error is measured
ciated empirical-risk minimization problem
uniformly over compact sets of A. In particular, given any K ⊂ A compact, we consider
N
1 X (i)
† minp Ea∼µ ∥G † (a) − Gθ (a)∥2U ≈ minp ∥u − Gθ (a(i) )∥2U (4)
sup ∥G (a) − Gθ (a)∥U (5) θ∈R θ∈R N
a∈K i=1
which is a more standard error metric in the approximation theory literature. Indeed, the classic which directly parallels the classical finite-dimensional setting (Vapnik, 1998). As well as using
approximation theory of neural networks in formulated analogously to equation (5) (Hornik et al., error measured in the Bochner norm, we will also consider the setting where error is measured
1989). uniformly over compact sets of A. In particular, given any K ⊂ A compact, we consider
In Section 8 we show that, for the architecture we propose and given any desired error toler-
sup ∥G † (a) − Gθ (a)∥U (5)
ance, there exists p ∈ N and an associated parameter θ† ∈ Rp , so that the loss (3) or (5) is less a∈K
than the specified tolerance. However, we do not address the challenging open problems of charac- which is a more standard error metric in the approximation theory literature. Indeed, the classic
terizing the error with respect to either (a) a fixed parameter dimension p or (b) a fixed number of approximation theory of neural networks in formulated analogously to equation (5) (Hornik et al.,
training samples N . Instead, we approach this in the empirical test-train setting where we minimize 1989).
(4) based on a fixed training set and approximate (3) from new samples that were not seen during In Section 8 we show that, for the architecture we propose and given any desired error toler-
training. Because we conceptualize our methodology in the infinite-dimensional setting, all finite- ance, there exists p ∈ N and an associated parameter θ† ∈ Rp , so that the loss (3) or (5) is less
dimensional approximations can share a common set of network parameters which are defined in than the specified tolerance. However, we do not address the challenging open problems of charac-
the (approximation-free) infinite-dimensional setting. In particular, our architecture does not de- terizing the error with respect to either (a) a fixed parameter dimension p or (b) a fixed number of
pend on the way the functions a(i) , u(i) are discretized. . The notation used through out this paper, training samples N . Instead, we approach this in the empirical test-train setting where we minimize
along with a useful summary table, may be found in Appendix A. (4) based on a fixed training set and approximate (3) from new samples that were not seen during
training. Because we conceptualize our methodology in the infinite-dimensional setting, all finite-
2.3 Discretization dimensional approximations can share a common set of network parameters which are defined in
Since our data a(i) and u(i) are, in general, functions, to work with them numerically, we the (approximation-free) infinite-dimensional setting. In particular, our architecture does not de-
assume access only to their point-wise evaluations. To illustrate this, we will continue with the pend on the way the functions a(i) , u(i) are discretized. . The notation used through out this paper,
example of the preceding paragraph. For simplicity, assume D = D′ and suppose that the input along with a useful summary table, may be found in Appendix A.
(i)
and output functions are both real-valued. Let D(i) = {xℓ }L
ℓ=1 ⊂ D be a L-point discretization
2.3 Discretization
of the domain D and assume we have observations a(i) |D(i) , u(i) |D(i) ∈ RL , for a finite collection
of input-output pairs indexed by j. In the next section, we propose a kernel inspired graph neural Since our data a(i) and u(i) are, in general, functions, to work with them numerically, we
network architecture which, while trained on the discretized data, can produce the solution u(x) for assume access only to their point-wise evaluations. To illustrate this, we will continue with the
any x ∈ D given an input a ∼ µ. In particular, our discretized architecture maps into the space example of the preceding paragraph. For simplicity, assume D = D′ and suppose that the input
(i)
U and not into a discretization thereof. Furthermore our parametric operator class is consistent, in and output functions are both real-valued. Let D(i) = {xℓ }L
ℓ=1 ⊂ D be a L-point discretization
9 9
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
that, given a fixed set of parameters, refinement of the input discretization converges to the true of the domain D and assume we have observations a(i) |D(i) , u(i) |D(i) ∈ RL , for a finite collection
functions space operator. We make this notion precise in what follows and refer to architectures that of input-output pairs indexed by j. In the next section, we propose a kernel inspired graph neural
possess it as function space architectures, mesh-invariant architectures, or discretization-invariant network architecture which, while trained on the discretized data, can produce the solution u(x) for
architectures. * any x ∈ D given an input a ∼ µ. In particular, our discretized architecture maps into the space
U and not into a discretization thereof. Furthermore our parametric operator class is consistent, in
Definition 1 We call a discrete refinement of the domain D ⊂ Rd any sequence of nested sets
that, given a fixed set of parameters, refinement of the input discretization converges to the true
D1 ⊂ D2 ⊂ · · · ⊂ D with |DL | = L for any L ∈ N such that, for any ϵ > 0, there exists a number
functions space operator. We make this notion precise in what follows and refer to architectures that
L = L(ϵ) ∈ N such that
[ possess it as function space architectures, mesh-invariant architectures, or discretization-invariant
D⊆ {y ∈ Rd : ∥y − x∥2 < ϵ}. *
architectures.
x∈DL
Definition 2 Given a discrete refinement (DL )∞ d Definition 1 We call a discrete refinement of the domain D ⊂ Rd any sequence of nested sets
L=1 of the domain D ⊂ R , any member DL is
called a discretization of D. D1 ⊂ D2 ⊂ · · · ⊂ D with |DL | = L for any L ∈ N such that, for any ϵ > 0, there exists a number
L = L(ϵ) ∈ N such that
Since a : D ⊂ Rd → Rm , pointwise evaluation of the function (discretization) at a set of L points
[
D⊆ {y ∈ Rd : ∥y − x∥2 < ϵ}.
gives rise to the data set {(xℓ , a(xℓ ))}L
ℓ=1 . Note that this may be viewed as a vector in R
Ld × RLm . x∈DL
Definition 3 Suppose A is a Banach space of Rm -valued functions on the domain D ⊂ Rd . Let called a discretization of D.
G : A → U be an operator, DL be an L-point discretization of D, and Ĝ : RLd × RLm → U some
map. For any K ⊂ A compact, we define the discretized uniform risk as Since a : D ⊂ Rd → Rm , pointwise evaluation of the function (discretization) at a set of L points
gives rise to the data set {(xℓ , a(xℓ ))}L
ℓ=1 . Note that this may be viewed as a vector in R
Ld × RLm .
RK (G, Ĝ, DL ) = sup ∥Ĝ(DL , a|DL ) − G(a)∥U . An example of the mesh refinement is given in Figure 1.
a∈K
Definition 4 Let Θ ⊆ Rp be a finite dimensional parameter space and G : A × Θ → U a map Definition 3 Suppose A is a Banach space of Rm -valued functions on the domain D ⊂ Rd . Let
representing a parametric class of operators with parameters θ ∈ Θ. Given a discrete refinement G : A → U be an operator, DL be an L-point discretization of D, and Ĝ : RLd × RLm → U some
(Dn )∞ d
n=1 of the domain D ⊂ R , we say G is discretization-invariant if there exists a sequence of
map. For any K ⊂ A compact, we define the discretized uniform risk as
maps Ĝ1 , Ĝ2 , . . . where ĜL : RLd × RLm × Θ → U such that, for any θ ∈ Θ and any compact set
RK (G, Ĝ, DL ) = sup ∥Ĝ(DL , a|DL ) − G(a)∥U .
K ⊂ A, a∈K
refine the discretization. Such a property is highly desirable as it allows a transfer of solutions maps Ĝ1 , Ĝ2 , . . . where ĜL : RLd × RLm × Θ → U such that, for any θ ∈ Θ and any compact set
between different grid geometries and discretization sizes with a single architecture that has a fixed K ⊂ A,
number of parameters. lim RK (G(·, θ), ĜL (·, ·, θ), DL ) = 0.
L→∞
*. Note that the meaning of the indexing of sets D• in the following definition differs that used earlier in this paragraph. *. Note that the meaning of the indexing of sets D• in the following definition differs that used earlier in this paragraph.
10 10
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
We note that, while the application of our methodology is based on having point-wise evalua- We prove that the architectures proposed in Section 3 are discretization-invariant. We further verify
tions of the function, it is not limited by it. One may, for example, represent a function numerically this claim numerically by showing that the approximation error is approximately constant as we
as a finite set of truncated basis coefficients. Invariance of the representation would then be with refine the discretization. Such a property is highly desirable as it allows a transfer of solutions
respect to the size of this set. Our methodology can, in principle, be modified to accommodate this between different grid geometries and discretization sizes with a single architecture that has a fixed
scenario through a suitably chosen architecture. We do not pursue this direction in the current work. number of parameters.
From the construction of neural operators, when the input and output functions are evaluated on We note that, while the application of our methodology is based on having point-wise evalua-
fixed grids, the architecture of neural operators on these fixed grids coincide with the class of neural tions of the function, it is not limited by it. One may, for example, represent a function numerically
networks. as a finite set of truncated basis coefficients. Invariance of the representation would then be with
respect to the size of this set. Our methodology can, in principle, be modified to accommodate this
3. Neural Operators scenario through a suitably chosen architecture. We do not pursue this direction in the current work.
From the construction of neural operators, when the input and output functions are evaluated on
In this section, we outline the neural operator framework. We assume that the input functions
fixed grids, the architecture of neural operators on these fixed grids coincide with the class of neural
a ∈ A are Rda -valued and defined on the bounded domain D ⊂ Rd while the output functions
′ networks.
u ∈ U are Rdu -valued and defined on the bounded domain D′ ⊂ Rd . The proposed architecture
Gθ : A → U has the following overall structure:
3. Neural Operators
1. Lifting: Using a pointwise function Rda → Rdv0 , map the input {a : D → Rda } 7→ {v0 : In this section, we outline the neural operator framework. We assume that the input functions
D → Rdv0 } to its first hidden representation. Usually, we choose dv0 > da and hence this is a ∈ A are Rda -valued and defined on the bounded domain D ⊂ Rd while the output functions
a lifting operation performed by a fully local operator. ′
u ∈ U are Rdu -valued and defined on the bounded domain D′ ⊂ Rd . The proposed architecture
Gθ : A → U has the following overall structure:
2. Iterative Kernel Integration: For t = 0, . . . , T − 1, map each hidden representation to the
next {vt : Dt → Rdvt } 7→ {vt+1 : Dt+1 → Rdvt+1 } via the action of the sum of a local linear 1. Lifting: Using a pointwise function Rda → Rdv0 , map the input {a : D → Rda } 7→ {v0 :
operator, a non-local integral kernel operator, and a bias function, composing the sum with a D → Rdv0 } to its first hidden representation. Usually, we choose dv0 > da and hence this is
fixed, pointwise nonlinearity. Here we set D0 = D and DT = D′ and impose that Dt ⊂ Rdt a lifting operation performed by a fully local operator.
is a bounded domain.†
2. Iterative Kernel Integration: For t = 0, . . . , T − 1, map each hidden representation to the
3. Projection: Using a pointwise function RdvT → Rdu , map the last hidden representation next {vt : Dt → Rdvt } 7→ {vt+1 : Dt+1 → Rdvt+1 } via the action of the sum of a local linear
{vT : D′ → RdvT } 7→ {u : D′ → Rdu } to the output function. Analogously to the first operator, a non-local integral kernel operator, and a bias function, composing the sum with a
step, we usually pick dvT > du and hence this is a projection step performed by a fully local fixed, pointwise nonlinearity. Here we set D0 = D and DT = D′ and impose that Dt ⊂ Rdt
operator. is a bounded domain.†
The outlined structure mimics that of a finite dimensional neural network where hidden repre- 3. Projection: Using a pointwise function RdvT → Rdu , map the last hidden representation
sentations are successively mapped to produce the final output. In particular, we have {vT : D′ → RdvT } 7→ {u : D′ → Rdu } to the output function. Analogously to the first
step, we usually pick dvT > du and hence this is a projection step performed by a fully local
Gθ := Q ◦ σT (WT −1 + KT −1 + bT −1 ) ◦ · · · ◦ σ1 (W0 + K0 + b0 ) ◦ P (6) operator.
†. The indexing of sets D• here differs from the two previous indexings used in Subsection 2.3. The index t is not the †. The indexing of sets D• here differs from the two previous indexings used in Subsection 2.3. The index t is not the
physical time, but the iteration (layer) in the model architecture. physical time, but the iteration (layer) in the model architecture.
11 11
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
where P : Rda → Rdv0 , Q : RdvT → Rdu are the local lifting and projection mappings respectively, The outlined structure mimics that of a finite dimensional neural network where hidden repre-
dvt+1 ×dvt
Wt ∈ R are local linear operators (matrices), Kt : {vt : Dt → Rdvt } → {vt+1 : Dt+1 → sentations are successively mapped to produce the final output. In particular, we have
Rdvt+1 } are integral kernel operators, bt : Dt+1 → Rdvt+1 are bias functions, and σt are fixed
activation functions acting locally as maps Rvt+1 → Rvt+1 in each layer. The output dimensions Gθ := Q ◦ σT (WT −1 + KT −1 + bT −1 ) ◦ · · · ◦ σ1 (W0 + K0 + b0 ) ◦ P (6)
dv0 , . . . , dvT as well as the input dimensions d1 , . . . , dT −1 and domains of definition D1 , . . . , DT −1
are hyperparameters of the architecture. By local maps, we mean that the action is pointwise, in where P : Rda → Rdv0 , Q : RdvT → Rdu are the local lifting and projection mappings respectively,
particular, for the lifting and projection maps, we have (P(a))(x) = P(a(x)) for any x ∈ D Wt ∈ Rdvt+1 ×dvt are local linear operators (matrices), Kt : {vt : Dt → Rdvt } → {vt+1 : Dt+1 →
and (Q(vT ))(x) = Q(vT (x)) for any x ∈ D′ and similarly, for the activation, (σ(vt+1 ))(x) = Rdvt+1 } are integral kernel operators, bt : Dt+1 → Rdvt+1 are bias functions, and σt are fixed
σ(vt+1 (x)) for any x ∈ Dt+1 . The maps, P, Q, and σt can thus be thought of as defining Nemitskiy activation functions acting locally as maps Rvt+1 → Rvt+1 in each layer. The output dimensions
operators (Dudley and Norvaisa, 2011, Chapters 6,7) when each of their components are assumed to dv0 , . . . , dvT as well as the input dimensions d1 , . . . , dT −1 and domains of definition D1 , . . . , DT −1
be Borel measurable. This interpretation allows us to define the general neural operator architecture are hyperparameters of the architecture. By local maps, we mean that the action is pointwise, in
when pointwise evaluation is not well-defined in the spaces A or U e.g. when they are Lebesgue, particular, for the lifting and projection maps, we have (P(a))(x) = P(a(x)) for any x ∈ D
Sobolev, or Besov spaces. and (Q(vT ))(x) = Q(vT (x)) for any x ∈ D′ and similarly, for the activation, (σ(vt+1 ))(x) =
The crucial difference between the proposed architecture (6) and a standard feed-forward neu- σ(vt+1 (x)) for any x ∈ Dt+1 . The maps, P, Q, and σt can thus be thought of as defining Nemitskiy
ral network is that all operations are directly defined in function space (noting that the activation operators (Dudley and Norvaisa, 2011, Chapters 6,7) when each of their components are assumed to
funtions, P and Q are all interpreted through their extension to Nemitskiy operators) and therefore be Borel measurable. This interpretation allows us to define the general neural operator architecture
do not depend on any discretization of the data. Intuitively, the lifting step locally maps the data to when pointwise evaluation is not well-defined in the spaces A or U e.g. when they are Lebesgue,
a space where the non-local part of G † is easier to capture. We confirm this intuition numerically in Sobolev, or Besov spaces.
Section 7; however, we note that for the theory presented in Section 8 it suffices that P is the identity The crucial difference between the proposed architecture (6) and a standard feed-forward neu-
map. The non-local part of G† is then learned by successively approximating using integral kernel ral network is that all operations are directly defined in function space (noting that the activation
operators composed with a local nonlinearity. Each integral kernel operator is the function space funtions, P and Q are all interpreted through their extension to Nemitskiy operators) and therefore
analog of the weight matrix in a standard feed-forward network since they are infinite-dimensional do not depend on any discretization of the data. Intuitively, the lifting step locally maps the data to
linear operators mapping one function space to another. We turn the biases, which are normally vec- a space where the non-local part of G † is easier to capture. We confirm this intuition numerically in
tors, to functions and, using intuition from the ResNet architecture (He et al., 2016), we further add Section 7; however, we note that for the theory presented in Section 8 it suffices that P is the identity
a local linear operator acting on the output of the previous layer before applying the nonlinearity. map. The non-local part of G † is then learned by successively approximating using integral kernel
The final projection step simply gets us back to the space of our output function. We concatenate operators composed with a local nonlinearity. Each integral kernel operator is the function space
in θ ∈ Rp the parameters of P, Q, {bt } which are usually themselves shallow neural networks, the analog of the weight matrix in a standard feed-forward network since they are infinite-dimensional
parameters of the kernels representing {Kt } which are again usually shallow neural networks, and linear operators mapping one function space to another. We turn the biases, which are normally vec-
the matrices {Wt }. We note, however, that our framework is general and other parameterizations tors, to functions and, using intuition from the ResNet architecture (He et al., 2016), we further add
such as polynomials may also be employed. a local linear operator acting on the output of the previous layer before applying the nonlinearity.
The final projection step simply gets us back to the space of our output function. We concatenate
Integral Kernel Operators We define three version of the integral kernel operator Kt used in (6).
in θ ∈ Rp the parameters of P, Q, {bt } which are usually themselves shallow neural networks, the
For the first, let κ(t) ∈ C(Dt+1 × Dt ; Rdvt+1 ×dvt ) and let νt be a Borel measure on Dt . Then we
parameters of the kernels representing {Kt } which are again usually shallow neural networks, and
define Kt by
Z the matrices {Wt }. We note, however, that our framework is general and other parameterizations
(Kt (vt ))(x) = κ(t) (x, y)vt (y) dνt (y) ∀x ∈ Dt+1 . (7) such as polynomials may also be employed.
Dt
12 12
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Normally, we take νt to simply be the Lebesgue measure on Rdt but, as discussed in Section 4, Integral Kernel Operators We define three version of the integral kernel operator Kt used in (6).
other choices can be used to speed up computation or aid the learning process by building in a For the first, let κ(t) ∈ C(Dt+1 × Dt ; Rdvt+1 ×dvt ) and let νt be a Borel measure on Dt . Then we
priori information. The choice of integral kernel operator in (7) defines the basic form of the neural define Kt by Z
operator and is the one we analyze in Section 8 and study most in the numerical experiments of (Kt (vt ))(x) = κ(t) (x, y)vt (y) dνt (y) ∀x ∈ Dt+1 . (7)
Dt
Section 7.
Normally, we take νt to simply be the Lebesgue measure on Rdt but, as discussed in Section 4,
For the second, let κ(t) ∈ C(Dt+1 × Dt × Rda × Rda ; Rdvt+1 ×dvt ). Then we define Kt by
other choices can be used to speed up computation or aid the learning process by building in a
Z
(Kt (vt ))(x) = κ(t) (x, y, a(ΠD D
∀x ∈ Dt+1 . priori information. The choice of integral kernel operator in (7) defines the basic form of the neural
t+1 (x)), a(Πt (y)))vt (y) dνt (y) (8)
Dt
operator and is the one we analyze in Section 8 and study most in the numerical experiments of
where ΠD
t : Dt → D are fixed mappings. We have found numerically that, for certain PDE prob-
Section 7.
lems, the form (8) outperforms (7) due to the strong dependence of the solution u on the parameters For the second, let κ(t) ∈ C(Dt+1 × Dt × Rda × Rda ; Rdvt+1 ×dvt ). Then we define Kt by
Z
a, for example, the Darcy flow problem considered in subsection 7.2.1. Indeed, if we think of (6)
(Kt (vt ))(x) = κ(t) (x, y, a(ΠD D
t+1 (x)), a(Πt (y)))vt (y) dνt (y) ∀x ∈ Dt+1 . (8)
as a discrete time dynamical system, then the input a ∈ A only enters through the initial condi- Dt
tion hence its influence diminishes with more layers. By directly building in a-dependence into the where ΠD
t : Dt → D are fixed mappings. We have found numerically that, for certain PDE prob-
kernel, we ensure that it influences the entire architecture. lems, the form (8) outperforms (7) due to the strong dependence of the solution u on the parameters
Lastly, let κ(t)∈ C(Dt+1 × Dt × Rdvt ×Rdvt ; Rdvt+1 ×dvt ). Then we define Kt by a, for example, the Darcy flow problem considered in subsection 7.2.1. Indeed, if we think of (6)
Z as a discrete time dynamical system, then the input a ∈ A only enters through the initial condi-
(Kt (vt ))(x) = κ(t) (x, y, vt (Πt (x)), vt (y))vt (y) dνt (y) ∀x ∈ Dt+1 . (9)
Dt
tion hence its influence diminishes with more layers. By directly building in a-dependence into the
kernel, we ensure that it influences the entire architecture.
where Πt : Dt+1 → Dt are fixed mappings. Note that, in contrast to (7) and (8), the integral
Lastly, let κ(t) ∈ C(Dt+1 × Dt × Rdvt × Rdvt ; Rdvt+1 ×dvt ). Then we define Kt by
operator (9) is nonlinear since the kernel can depend on the input function vt . With this definition Z
and a particular choice of kernel κt and measure νt , we show in Section 5.2 that neural operators (Kt (vt ))(x) = κ(t) (x, y, vt (Πt (x)), vt (y))vt (y) dνt (y) ∀x ∈ Dt+1 . (9)
Dt
are a continuous input/output space generalization of the popular transformer architecture (Vaswani
where Πt : Dt+1 → Dt are fixed mappings. Note that, in contrast to (7) and (8), the integral
et al., 2017).
operator (9) is nonlinear since the kernel can depend on the input function vt . With this definition
Single Hidden Layer Construction Having defined possible choices for the integral kernel oper- and a particular choice of kernel κt and measure νt , we show in Section 5.2 that neural operators
ator, we are now in a position to explicitly write down a full layer of the architecture defined by (6). are a continuous input/output space generalization of the popular transformer architecture (Vaswani
For simplicity, we choose the integral kernel operator given by (7), but note that the other definitions et al., 2017).
(8), (9) work analogously. We then have that a single hidden layer update is given by
Single Hidden Layer Construction Having defined possible choices for the integral kernel oper-
Z
vt+1 (x) = σt+1 Wt vt (Πt (x)) + (t)
κ (x, y)vt (y) dνt (y) + bt (x) ∀x ∈ Dt+1 (10) ator, we are now in a position to explicitly write down a full layer of the architecture defined by (6).
Dt For simplicity, we choose the integral kernel operator given by (7), but note that the other definitions
where Πt : Dt+1 → Dt are fixed mappings. We remark that, since we often consider functions on (8), (9) work analogously. We then have that a single hidden layer update is given by
the same domain, we usually take Πt to be the identity.
Z
(t)
vt+1 (x) = σt+1 Wt vt (Πt (x)) + κ (x, y)vt (y) dνt (y) + bt (x) ∀x ∈ Dt+1 (10)
We will now give an example of a full single hidden layer architecture i.e. when T = 2. We Dt
choose D1 = D, take σ2 as the identity, and denote σ1 by σ, assuming it is any activation function. where Πt : Dt+1 → Dt are fixed mappings. We remark that, since we often consider functions on
Furthermore, for simplicity, we set W1 = 0, b1 = 0, and assume that ν0 = ν1 is the Lebesgue the same domain, we usually take Πt to be the identity.
13 13
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
measure on Rd . Then (6) becomes We will now give an example of a full single hidden layer architecture i.e. when T = 2. We
Z Z choose D1 = D, take σ2 as the identity, and denote σ1 by σ, assuming it is any activation function.
(1) (0)
(Gθ (a))(x) = Q κ (x, y)σ W0 P(a(y)) + κ (y, z)P(a(z)) dz + b0 (y) dy (11)
D D
Furthermore, for simplicity, we set W1 = 0, b1 = 0, and assume that ν0 = ν1 is the Lebesgue
measure on Rd . Then (6) becomes
for any x ∈ D′ . In this example, P ∈ C(Rda ; Rdv0 ), κ(0) ∈ C(D×D; Rdv1 ×dv0 ), b0 ∈ C(D; Rdv1 ),
Z Z
W0 ∈ Rdv1 ×dv0 , κ(1) ∈ C(D′ × D; Rdv2 ×dv1 ), and Q ∈ C(Rdv2 ; Rdu ). One can then parametrize (Gθ (a))(x) = Q (1)
κ (x, y)σ W0 P(a(y)) + (0)
κ (y, z)P(a(z)) dz + b0 (y) dy (11)
the continuous functions P, Q, κ(0) , κ(1) , b0 by standard feed-forward neural networks (or by any D D
other means) and the matrix W0 simply by its entries. The parameter vector θ ∈ Rp then becomes for any x ∈ D′ . In this example, P ∈ C(Rda ; Rdv0 ), κ(0) ∈ C(D×D; Rdv1 ×dv0 ), b0 ∈ C(D; Rdv1 ),
the concatenation of the parameters of P, Q, κ(0) , κ(1) , b0 along with the entries of W0 . One can W0 ∈ Rdv1 ×dv0 , κ(1) ∈ C(D′ × D; Rdv2 ×dv1 ), and Q ∈ C(Rdv2 ; Rdu ). One can then parametrize
then optimize these parameters by minimizing with respect to θ using standard gradient based min- the continuous functions P, Q, κ(0) , κ(1) , b0 by standard feed-forward neural networks (or by any
imization techniques. To implement this minimization, the functions entering the loss need to be other means) and the matrix W0 simply by its entries. The parameter vector θ ∈ Rp then becomes
discretized; but the learned parameters may then be used with other discretizations. In Section 4, the concatenation of the parameters of P, Q, κ(0) , κ(1) , b0 along with the entries of W0 . One can
we discuss various choices for parametrizing the kernels, picking the integration measure, and how then optimize these parameters by minimizing with respect to θ using standard gradient based min-
those choices affect the computational complexity of the architecture. imization techniques. To implement this minimization, the functions entering the loss need to be
Preprocessing It is often beneficial to manually include features into the input functions a to help discretized; but the learned parameters may then be used with other discretizations. In Section 4,
facilitate the learning process. For example, instead of considering the Rda -valued vector field a we discuss various choices for parametrizing the kernels, picking the integration measure, and how
as input, we use the Rd+da -valued vector field (x, a(x)). By including the identity element, infor- those choices affect the computational complexity of the architecture.
mation about the geometry of the spatial domain D is directly incorporated into the architecture. Preprocessing It is often beneficial to manually include features into the input functions a to help
This allows the neural networks direct access to information that is already known in the problem facilitate the learning process. For example, instead of considering the Rda -valued vector field a
and therefore eases learning. We use this idea in all of our numerical experiments in Section 7. as input, we use the Rd+da -valued vector field (x, a(x)). By including the identity element, infor-
Similarly, when learning a smoothing operator, it may be beneficial to include a smoothed version mation about the geometry of the spatial domain D is directly incorporated into the architecture.
of the inputs aϵ using, for example, Gaussian convolution. Derivative information may also be This allows the neural networks direct access to information that is already known in the problem
of interest and thus, as input, one may consider, for example, the Rd+2da +dda -valued vector field and therefore eases learning. We use this idea in all of our numerical experiments in Section 7.
(x, a(x), aϵ (x), ∇x aϵ (x)). Many other possibilities may be considered on a problem-specific basis. Similarly, when learning a smoothing operator, it may be beneficial to include a smoothed version
Discretization Invariance and Approximation In light of discretization invariance Theorem 8 of the inputs aϵ using, for example, Gaussian convolution. Derivative information may also be
and universal approximation Theorems 11 12, 13, 14 whose formal statements are given in Section of interest and thus, as input, one may consider, for example, the Rd+2da +dda -valued vector field
8, we may obtain a decomposition of the total error made by a neural operator as a sum of the dis- (x, a(x), aϵ (x), ∇x aϵ (x)). Many other possibilities may be considered on a problem-specific basis.
cretization error and the approximation error. In particular, given a finite dimensional instantiation
Discretization Invariance and Approximation In light of discretization invariance Theorem 8
of a neural operator Ĝθ : RLd × RLda → U, for some L-point discretization of the input, we have
and universal approximation Theorems 11 12, 13, 14 whose formal statements are given in Section
∥Ĝθ (DL , a|DL ) − G † (a)∥U ≤ ∥Ĝθ (DL , a|DL ) − Gθ (a)∥U + ∥Gθ (a) − G † (a)∥U . 8, we may obtain a decomposition of the total error made by a neural operator as a sum of the dis-
| {z
discretization error
} | {z }
approximation error
cretization error and the approximation error. In particular, given a finite dimensional instantiation
of a neural operator Ĝθ : RLd × RLda → U, for some L-point discretization of the input, we have
Our approximation theoretic Theorems imply that we can find parameters θ so that the approxi-
mation error is arbitrarily small while the discretization invariance Theorem states that we can find ∥Ĝθ (DL , a|DL ) − G † (a)∥U ≤ ∥Ĝθ (DL , a|DL ) − Gθ (a)∥U + ∥Gθ (a) − G † (a)∥U .
a fine enough discretization (large enough L) so that the discretization error is arbitrarily small.
| {z } | {z }
discretization error approximation error
14 14
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Therefore, with a fixed set of parameters independent of the input discretization, a neural operator Our approximation theoretic Theorems imply that we can find parameters θ so that the approxi-
that is able to be implemented on a computer can approximate operators to arbitrary accuracy. mation error is arbitrarily small while the discretization invariance Theorem states that we can find
a fine enough discretization (large enough L) so that the discretization error is arbitrarily small.
4. Parameterization and Computation Therefore, with a fixed set of parameters independent of the input discretization, a neural operator
that is able to be implemented on a computer can approximate operators to arbitrary accuracy.
In this section, we discuss various ways of parameterizing the infinite dimensional architecture
(6), Figure 2. The goal is to find an intrinsic infinite dimensional parameterization that achieves
4. Parameterization and Computation
small error (say ϵ), and then rely on numerical approximation to ensure that this parameterization
delivers an error of the same magnitude (say 2ϵ), for all data discretizations fine enough. In this way In this section, we discuss various ways of parameterizing the infinite dimensional architecture
the number of parameters used to achieve O(ϵ) error is independent of the data discretization. In (6), Figure 2. The goal is to find an intrinsic infinite dimensional parameterization that achieves
many applications we have in mind the data discretization is something we can control, for example small error (say ϵ), and then rely on numerical approximation to ensure that this parameterization
when generating input/output pairs from solution of partial differential equations via numerical delivers an error of the same magnitude (say 2ϵ), for all data discretizations fine enough. In this way
simulation. The proposed approach allows us to train a neural operator approximation using data the number of parameters used to achieve O(ϵ) error is independent of the data discretization. In
from different discretizations, and to predict with discretizations different from those used in the many applications we have in mind the data discretization is something we can control, for example
data, all by relating everything to the underlying infinite dimensional problem. when generating input/output pairs from solution of partial differential equations via numerical
We also discuss the computational complexity of the proposed parameterizations and suggest simulation. The proposed approach allows us to train a neural operator approximation using data
algorithms which yield efficient numerical methods for approximation. Subsections 4.1-4.4 delin- from different discretizations, and to predict with discretizations different from those used in the
eate each of the proposed methods. data, all by relating everything to the underlying infinite dimensional problem.
To simplify notation, we will only consider a single layer of (6) i.e. (10) and choose the input We also discuss the computational complexity of the proposed parameterizations and suggest
and output domains to be the same. Furthermore, we will drop the layer index t and write the single algorithms which yield efficient numerical methods for approximation. Subsections 4.1-4.4 delin-
layer update as eate each of the proposed methods.
Z To simplify notation, we will only consider a single layer of (6) i.e. (10) and choose the input
u(x) = σ W v(x) + κ(x, y)v(y) dν(y) + b(x) ∀x ∈ D (12) and output domains to be the same. Furthermore, we will drop the layer index t and write the single
D
layer update as
where D ⊂ Rd is a bounded domain, v : D → Rn is the input function and u : D → Rm Z
is the output function. When the domain domains D of v and u are different, we will usually u(x) = σ W v(x) + κ(x, y)v(y) dν(y) + b(x) ∀x ∈ D (12)
D
extend them to be on a larger domain. We will consider σ to be fixed, and, for the time being,
take dν(y) = dy to be the Lebesgue measure on Rd . Equation (12) then leaves three objects where D ⊂ Rd is a bounded domain, v : D → Rn is the input function and u : D → Rm
which can be parameterized: W , κ, and b. Since W is linear and acts only locally on v, we will is the output function. When the domain domains D of v and u are different, we will usually
always parametrize it by the values of its associated m × n matrix; hence W ∈ Rm×n yielding extend them to be on a larger domain. We will consider σ to be fixed, and, for the time being,
mn parameters. We have found empirically that letting b : D → Rm be a constant function over take dν(y) = dy to be the Lebesgue measure on Rd . Equation (12) then leaves three objects
any domain D works at least as well as allowing it to be an arbitrary neural network. Perusal of which can be parameterized: W , κ, and b. Since W is linear and acts only locally on v, we will
the proof of Theorem 11 shows that we do not lose any approximation power by doing this, and we always parametrize it by the values of its associated m × n matrix; hence W ∈ Rm×n yielding
reduce the total number of parameters in the architecutre. Therefore we will always parametrize b mn parameters. We have found empirically that letting b : D → Rm be a constant function over
by the entries of a fixed m-dimensional vector; in particular, b ∈ Rm yielding m parameters. Notice any domain D works at least as well as allowing it to be an arbitrary neural network. Perusal of
that both parameterizations are independent of any discretization of v. the proof of Theorem 11 shows that we do not lose any approximation power by doing this, and we
15 15
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
The rest of this section will be dedicated to choosing the kernel function κ : D × D → Rm×n reduce the total number of parameters in the architecutre. Therefore we will always parametrize b
and the computation of the associated integral kernel operator. For clarity of exposition, we consider by the entries of a fixed m-dimensional vector; in particular, b ∈ Rm yielding m parameters. Notice
only the simplest proposed version of this operator (7) but note that similar ideas may also be applied that both parameterizations are independent of any discretization of v.
to (8) and (9). Furthermore, in order to focus on learning the kernel κ, here we drop σ, W , and b The rest of this section will be dedicated to choosing the kernel function κ : D × D → Rm×n
from (12) and simply consider the linear update and the computation of the associated integral kernel operator. For clarity of exposition, we consider
Z only the simplest proposed version of this operator (7) but note that similar ideas may also be applied
u(x) = κ(x, y)v(y) dν(y) ∀x ∈ D. (13)
D to (8) and (9). Furthermore, in order to focus on learning the kernel κ, here we drop σ, W , and b
from (12) and simply consider the linear update
To demonstrate the computational challenges associated with (13), let {x1 , . . . , xJ } ⊂ D be a
Z
uniformly-sampled J-point discretization of D. Recall that we assumed dν(y) = dy and, for
u(x) = κ(x, y)v(y) dν(y) ∀x ∈ D. (13)
simplicity, suppose that ν(D) = 1, then the Monte Carlo approximation of (13) is D
Kjl = κ(xj , xl ) ∈ Rm×n , j, l = 1, . . . , J Kernel Matrix. It will often times be useful to consider the kernel matrix associated to κ for the
discrete points {x1 , . . . , xJ } ⊂ D. We define the kernel matrix K ∈ RmJ×nJ to be the J × J block
where we use (j, l) to index an individual block rather than a matrix element. Various numerical matrix with each block given by the value of the kernel i.e.
algorithms for the efficient computation of (13) can be derived based on assumptions made about
the structure of this matrix, for example, bounds on its rank or sparsity. Kjl = κ(xj , xl ) ∈ Rm×n , j, l = 1, . . . , J
16 16
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
4.1 Graph Neural Operator (GNO) where we use (j, l) to index an individual block rather than a matrix element. Various numerical
algorithms for the efficient computation of (13) can be derived based on assumptions made about
We first outline the Graph Neural Operator (GNO) which approximates (13) by combining a
the structure of this matrix, for example, bounds on its rank or sparsity.
Nyström approximation with domain truncation and is implemented with the standard framework
of graph neural networks. This construction was originally proposed in Li et al. (2020c).
4.1 Graph Neural Operator (GNO)
Nyström Approximation. A simple yet effective method to alleviate the cost of computing (13)
We first outline the Graph Neural Operator (GNO) which approximates (13) by combining a
is employing a Nyström approximation. This amounts to sampling uniformly at random the points
Nyström approximation with domain truncation and is implemented with the standard framework
over which we compute the output function u. In particular, let xk1 , . . . , xkJ ′ ⊂ {x1 , . . . , xJ } be
of graph neural networks. This construction was originally proposed in Li et al. (2020c).
J ′ ≪ J randomly selected points and, assuming ν(D) = 1, approximate (13) by
Nyström Approximation. A simple yet effective method to alleviate the cost of computing (13)
J ′
1 X is employing a Nyström approximation. This amounts to sampling uniformly at random the points
u(xkj ) ≈ ′ κ(xkj , xkl )v(xkl ), j = 1, . . . , J ′ .
J
l=1 over which we compute the output function u. In particular, let xk1 , . . . , xkJ ′ ⊂ {x1 , . . . , xJ } be
We can view this as a low-rank approximation to the kernel matrix K, in particular, J ′ ≪ J randomly selected points and, assuming ν(D) = 1, approximate (13) by
J ′
1 X
K ≈ KJJ ′ KJ ′ J ′ KJ ′ J (15) u(xkj ) ≈ ′ κ(xkj , xkl )v(xkl ), j = 1, . . . , J ′ .
J
l=1
where KJ ′ J ′ is a J ′ × J ′ block matrix and KJJ ′ , KJ ′ J are interpolation matrices, for example, We can view this as a low-rank approximation to the kernel matrix K, in particular,
linearly extending the function to the whole domain from the random nodal points. The complexity
of this computation is O(J ′2 ) hence it remains quadratic but only in the number of subsampled K ≈ KJJ ′ KJ ′ J ′ KJ ′ J (15)
points J ′ which we assume is much less than the number of points J in the original discretization. where KJ ′ J ′ is a J ′ × J ′ block matrix and KJJ ′ , KJ ′ J are interpolation matrices, for example,
Truncation. Another simple method to alleviate the cost of computing (13) is to truncate the linearly extending the function to the whole domain from the random nodal points. The complexity
integral to a sub-domain of D which depends on the point of evaluation x ∈ D. Let s : D → B(D) of this computation is O(J ′2 ) hence it remains quadratic but only in the number of subsampled
be a mapping of the points of D to the Lebesgue measurable subsets of D denoted B(D). Define points J ′ which we assume is much less than the number of points J in the original discretization.
dν(x, y) = 1s(x) dy then (13) becomes Truncation. Another simple method to alleviate the cost of computing (13) is to truncate the
integral to a sub-domain of D which depends on the point of evaluation x ∈ D. Let s : D → B(D)
Z
u(x) = κ(x, y)v(y) dy ∀x ∈ D. (16)
s(x) be a mapping of the points of D to the Lebesgue measurable subsets of D denoted B(D). Define
dν(x, y) = 1s(x) dy then (13) becomes
If the size of each set s(x) is smaller than D then the cost of computing (16) is O(cs J 2 ) where
Z
cs < 1 is a constant depending on s. While the cost remains quadratic in J, the constant cs can u(x) = κ(x, y)v(y) dy ∀x ∈ D. (16)
have a significant effect in practical computations, as we demonstrate in Section 7. For simplicity s(x)
and ease of implementation, we only consider s(x) = B(x, r) ∩ D where B(x, r) = {y ∈ Rd : If the size of each set s(x) is smaller than D then the cost of computing (16) is O(cs J 2 ) where
∥y − x∥Rd < r} for some fixed r > 0. With this choice of s and assuming that D = [0, 1]d , we can cs < 1 is a constant depending on s. While the cost remains quadratic in J, the constant cs can
explicitly calculate that cs ≈ rd . have a significant effect in practical computations, as we demonstrate in Section 7. For simplicity
Furthermore notice that we do not lose any expressive power when we make this approximation and ease of implementation, we only consider s(x) = B(x, r) ∩ D where B(x, r) = {y ∈ Rd :
so long as we combine it with composition. To see this, consider the example of the previous ∥y − x∥Rd < r} for some fixed r > 0. With this choice of s and assuming that D = [0, 1]d , we can
√
paragraph where, if we let r = 2, then (16) reverts to (13). Pick r < 1 and let L ∈ N with explicitly calculate that cs ≈ rd .
17 17
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
L ≥ 2 be the smallest integer such that 2L−1 r ≥ 1. Suppose that u(x) is computed by composing Furthermore notice that we do not lose any expressive power when we make this approximation
the right hand side of (16) L times with a different kernel every time. The domain of influence so long as we combine it with composition. To see this, consider the example of the previous
√
of u(x) is then B(x, 2L−1 r) ∩ D = D hence it is easy to see that there exist L kernels such that paragraph where, if we let r = 2, then (16) reverts to (13). Pick r < 1 and let L ∈ N with
computing this composition is equivalent to computing (13) for any given kernel with appropriate L ≥ 2 be the smallest integer such that 2L−1 r ≥ 1. Suppose that u(x) is computed by composing
regularity. Furthermore the cost of this computation is O(Lrd J 2 ) and therefore the truncation is the right hand side of (16) L times with a different kernel every time. The domain of influence
beneficial if rd (log 2 1/r + 1) < 1 which holds for any r < 1/2 when d = 1 and any r < 1 of u(x) is then B(x, 2L−1 r) ∩ D = D hence it is easy to see that there exist L kernels such that
when d ≥ 2. Therefore we have shown that we can always reduce the cost of computing (13) by computing this composition is equivalent to computing (13) for any given kernel with appropriate
truncation and composition. From the perspective of the kernel matrix, truncation enforces a sparse, regularity. Furthermore the cost of this computation is O(Lrd J 2 ) and therefore the truncation is
block diagonally-dominant structure at each layer. We further explore the hierarchical nature of this beneficial if rd (log2 1/r + 1) < 1 which holds for any r < 1/2 when d = 1 and any r < 1
computation using the multipole method in subsection 4.3. when d ≥ 2. Therefore we have shown that we can always reduce the cost of computing (13) by
Besides being a useful computational tool, truncation can also be interpreted as explicitly build- truncation and composition. From the perspective of the kernel matrix, truncation enforces a sparse,
ing local structure into the kernel κ. For problems where such structure exists, explicitly enforcing block diagonally-dominant structure at each layer. We further explore the hierarchical nature of this
it makes learning more efficient, usually requiring less data to achieve the same generalization er- computation using the multipole method in subsection 4.3.
ror. Many physical systems such as interacting particles in an electric potential exhibit strong local
behavior that quickly decays, making truncation a natural approximation technique. Besides being a useful computational tool, truncation can also be interpreted as explicitly build-
ing local structure into the kernel κ. For problems where such structure exists, explicitly enforcing
Graph Neural Networks. We utilize the standard architecture of message passing graph net-
it makes learning more efficient, usually requiring less data to achieve the same generalization er-
works employing edge features as introduced in Gilmer et al. (2017) to efficiently implement (13)
ror. Many physical systems such as interacting particles in an electric potential exhibit strong local
on arbitrary discretizations of the domain D. To do so, we treat a discretization {x1 , . . . , xJ } ⊂ D
behavior that quickly decays, making truncation a natural approximation technique.
as the nodes of a weighted, directed graph and assign edges to each node using the function
s : D → B(D) which, recall from the section on truncation, assigns to each point a domain of
integration. In particular, for j = 1, . . . , J, we assign the node xj the value v(xj ) and emanate
from it edges to the nodes s(xj ) ∩ {x1 , . . . , xJ } = N (xj ) which we call the neighborhood of xj .
If s(x) = D then the graph is fully-connected. Generally, the sparsity structure of the graph deter- Graph Neural Networks. We utilize the standard architecture of message passing graph net-
mines the sparsity of the kernel matrix K, indeed, the adjacency matrix of the graph and the block works employing edge features as introduced in Gilmer et al. (2017) to efficiently implement (13)
kernel matrix have the same zero entries. The weights of each edge are assigned as the arguments on arbitrary discretizations of the domain D. To do so, we treat a discretization {x1 , . . . , xJ } ⊂ D
of the kernel. In particular, for the case of (13), the weight of the edge between nodes xj and xk is as the nodes of a weighted, directed graph and assign edges to each node using the function
simply the concatenation (xj , xk ) ∈ R2d . More complicated weighting functions are considered for s : D → B(D) which, recall from the section on truncation, assigns to each point a domain of
the implementation of the integral kernel operators (8) or (9). integration. In particular, for j = 1, . . . , J, we assign the node xj the value v(xj ) and emanate
With the above definition the message passing algorithm of Gilmer et al. (2017), with averaging from it edges to the nodes s(xj ) ∩ {x1 , . . . , xJ } = N (xj ) which we call the neighborhood of xj .
aggregation, updates the value v(xj ) of the node xj to the value u(xj ) as If s(x) = D then the graph is fully-connected. Generally, the sparsity structure of the graph deter-
mines the sparsity of the kernel matrix K, indeed, the adjacency matrix of the graph and the block
1 X
u(xj ) = κ(xj , y)v(y), j = 1, . . . , J kernel matrix have the same zero entries. The weights of each edge are assigned as the arguments
|N (xj )|
y∈N (xj )
of the kernel. In particular, for the case of (13), the weight of the edge between nodes xj and xk is
which corresponds to the Monte-Carlo approximation of the integral (16). More sophisticated simply the concatenation (xj , xk ) ∈ R2d . More complicated weighting functions are considered for
quadrature rules and adaptive meshes can also be implemented using the general framework of the implementation of the integral kernel operators (8) or (9).
18 18
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
message passing on graphs, see, for example, Pfaff et al. (2020). We further utilize this framework With the above definition the message passing algorithm of Gilmer et al. (2017), with averaging
in subsection 4.3. aggregation, updates the value v(xj ) of the node xj to the value u(xj ) as
Convolutional Neural Networks. Lastly, we compare and contrast the GNO framework to stan- 1 X
u(xj ) = κ(xj , y)v(y), j = 1, . . . , J
|N (xj )|
dard convolutional neural networks (CNNs). In computer vision, the success of CNNs has largely y∈N (xj )
been attributed to their ability to capture local features such as edges that can be used to distinguish which corresponds to the Monte-Carlo approximation of the integral (16). More sophisticated
different objects in a natural image. This property is obtained by enforcing the convolution kernel to quadrature rules and adaptive meshes can also be implemented using the general framework of
have local support, an idea similar to our truncation approximation. Furthermore by directly using a message passing on graphs, see, for example, Pfaff et al. (2020). We further utilize this framework
translation invariant kernel, a CNN architecture becomes translation equivariant; this is a desirable in subsection 4.3.
feature for many vision models e.g. ones that perform segmentation. We will show that similar
Convolutional Neural Networks. Lastly, we compare and contrast the GNO framework to stan-
ideas can be applied to the neural operator framework to obtain an architecture with built-in local
dard convolutional neural networks (CNNs). In computer vision, the success of CNNs has largely
properties and translational symmetries that, unlike CNNs, remain consistent in function space.
been attributed to their ability to capture local features such as edges that can be used to distinguish
To that end, let κ(x, y) = κ(x − y) and suppose that κ : Rd → Rm×n is supported on B(0, r).
different objects in a natural image. This property is obtained by enforcing the convolution kernel to
Let r∗ > 0 be the smallest radius such that D ⊆ B(x∗ , r∗ ) where x∗ ∈ Rd denotes the center of
have local support, an idea similar to our truncation approximation. Furthermore by directly using a
mass of D and suppose r ≪ r∗ . Then (13) becomes the convolution
Z translation invariant kernel, a CNN architecture becomes translation equivariant; this is a desirable
u(x) = (κ ∗ v)(x) = κ(x − y)v(y) dy ∀x ∈ D. (17) feature for many vision models e.g. ones that perform segmentation. We will show that similar
B(x,r)∩D
ideas can be applied to the neural operator framework to obtain an architecture with built-in local
Notice that (17) is precisely (16) when s(x) = B(x, r) ∩ D and κ(x, y) = κ(x − y). When the
properties and translational symmetries that, unlike CNNs, remain consistent in function space.
kernel is parameterized by e.g. a standard neural network and the radius r is chosen independently
To that end, let κ(x, y) = κ(x − y) and suppose that κ : Rd → Rm×n is supported on B(0, r).
of the data discretization, (17) becomes a layer of a convolution neural network that is consistent in
Let r∗ > 0 be the smallest radius such that D ⊆ B(x∗ , r∗ ) where x∗ ∈ Rd denotes the center of
function space. Indeed the parameters of (17) do not depend on any discretization of v. The choice
mass of D and suppose r ≪ r∗ . Then (13) becomes the convolution
κ(x, y) = κ(x − y) enforces translational equivariance in the output while picking r small enforces Z
locality in the kernel; hence we obtain the distinguishing features of a CNN model. u(x) = (κ ∗ v)(x) = κ(x − y)v(y) dy ∀x ∈ D. (17)
B(x,r)∩D
We will now show that, by picking a parameterization that is inconsistent in function space
Notice that (17) is precisely (16) when s(x) = B(x, r) ∩ D and κ(x, y) = κ(x − y). When the
and applying a Monte Carlo approximation to the integral, (17) becomes a standard CNN. This is
kernel is parameterized by e.g. a standard neural network and the radius r is chosen independently
most easily demonstrated when D = [0, 1] and the discretization {x1 , . . . , xJ } is equispaced i.e.
of the data discretization, (17) becomes a layer of a convolution neural network that is consistent in
|xj+1 − xj | = h for any j = 1, . . . , J − 1. Let k ∈ N be an odd filter size and let z1 , . . . , zk ∈ R
function space. Indeed the parameters of (17) do not depend on any discretization of v. The choice
be the points zj = (j − 1 − (k − 1)/2)h for j = 1, . . . , k. It is easy to see that {z1 , . . . , zk } ⊂
κ(x, y) = κ(x − y) enforces translational equivariance in the output while picking r small enforces
B̄(0, (k − 1)h/2) which we choose as the support of κ. Furthermore, we parameterize κ directly
locality in the kernel; hence we obtain the distinguishing features of a CNN model.
by its pointwise values which are m × n matrices at the locations z1 , . . . , zk thus yielding kmn
We will now show that, by picking a parameterization that is inconsistent in function space
parameters. Then (17) becomes
and applying a Monte Carlo approximation to the integral, (17) becomes a standard CNN. This is
k n
1 XX most easily demonstrated when D = [0, 1] and the discretization {x1 , . . . , xJ } is equispaced i.e.
u(xj )p ≈ κ(zl )pq v(xj − zl )q , j = 1, . . . , J, p = 1, . . . , m
k
l=1 q=1 |xj+1 − xj | = h for any j = 1, . . . , J − 1. Let k ∈ N be an odd filter size and let z1 , . . . , zk ∈ R
where we define v(x) = 0 if x ̸∈ {x1 , . . . , xJ }. Up to the constant factor 1/k which can be re- be the points zj = (j − 1 − (k − 1)/2)h for j = 1, . . . , k. It is easy to see that {z1 , . . . , zk } ⊂
absorbed into the parameterization of κ, this is precisely the update of a stride 1 CNN with n input B̄(0, (k − 1)h/2) which we choose as the support of κ. Furthermore, we parameterize κ directly
19 19
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
channels, m output channels, and zero-padding so that the input and output signals have the same by its pointwise values which are m × n matrices at the locations z1 , . . . , zk thus yielding kmn
length. This example can easily be generalized to higher dimensions and different CNN structures, parameters. Then (17) becomes
we made the current choices for simplicity of exposition. Notice that if we double the amount of
discretization points for v i.e. J 7→ 2J and h 7→ h/2, the support of κ becomes B̄(0, (k − 1)h/4) k n
1 XX
hence the model changes due to the discretization of the data. Indeed, if we take the limit to the u(xj )p ≈ κ(zl )pq v(xj − zl )q , j = 1, . . . , J, p = 1, . . . , m
k
continuum J → ∞, we find B̄(0, (k − 1)h/2) → {0} hence the model becomes completely local. l=1 q=1
To fix this, we may try to increase the filter size k (or equivalently add more layers) simultaneously
with J, but then the number of parameters in the model goes to infinity as J → ∞ since, as we where we define v(x) = 0 if x ̸∈ {x1 , . . . , xJ }. Up to the constant factor 1/k which can be re-
previously noted, there are kmn parameters in this layer. Therefore standard CNNs are not consis- absorbed into the parameterization of κ, this is precisely the update of a stride 1 CNN with n input
tent models in function space. We demonstrate their inability to generalize to different resolutions channels, m output channels, and zero-padding so that the input and output signals have the same
in Section 7. length. This example can easily be generalized to higher dimensions and different CNN structures,
we made the current choices for simplicity of exposition. Notice that if we double the amount of
4.2 Low-rank Neural Operator (LNO) discretization points for v i.e. J 7→ 2J and h 7→ h/2, the support of κ becomes B̄(0, (k − 1)h/4)
By directly imposing that the kernel κ is of a tensor product form, we obtain a layer with O(J) hence the model changes due to the discretization of the data. Indeed, if we take the limit to the
computational complexity. We term this construction the Low-rank Neural Operator (LNO) due to continuum J → ∞, we find B̄(0, (k − 1)h/2) → {0} hence the model becomes completely local.
its equivalence to directly parameterizing a finite-rank operator. We start by assuming κ : D × D → To fix this, we may try to increase the filter size k (or equivalently add more layers) simultaneously
R is scalar valued and later generalize to the vector valued setting. We express the kernel as with J, but then the number of parameters in the model goes to infinity as J → ∞ since, as we
previously noted, there are kmn parameters in this layer. Therefore standard CNNs are not consis-
r
X
κ(x, y) = φ(j) (x)ψ (j) (y) ∀x, y ∈ D tent models in function space. We demonstrate their inability to generalize to different resolutions
j=1 in Section 7.
for some functions φ(1) , ψ (1) , . . . , φ(r) , ψ (r) : D → R that are normally given as the components
of two neural networks φ, ψ : D → Rr or a single neural network Ξ : D → R2r which couples all 4.2 Low-rank Neural Operator (LNO)
functions through its parameters. With this definition, and supposing that n = m = 1, we have that
By directly imposing that the kernel κ is of a tensor product form, we obtain a layer with O(J)
(13) becomes
computational complexity. We term this construction the Low-rank Neural Operator (LNO) due to
r
Z X its equivalence to directly parameterizing a finite-rank operator. We start by assuming κ : D × D →
u(x) = φ(j) (x)ψ (j) (y)v(y) dy
D j=1 R is scalar valued and later generalize to the vector valued setting. We express the kernel as
r Z
X
= ψ (j) (y)v(y) dy φ(j) (x)
D r
j=1 X
r κ(x, y) = φ(j) (x)ψ (j) (y) ∀x, y ∈ D
X
(j) (j) j=1
= ⟨ψ , v⟩φ (x)
j=1
where ⟨·, ·⟩ denotes the L2 (D; R) inner product. Notice that the inner products can be evaluated for some functions φ(1) , ψ (1) , . . . , φ(r) , ψ (r) : D → R that are normally given as the components
independently of the evaluation point x ∈ D hence the computational complexity of this method is of two neural networks φ, ψ : D → Rr or a single neural network Ξ : D → R2r which couples all
O(rJ) which is linear in the discretization. functions through its parameters. With this definition, and supposing that n = m = 1, we have that
20 20
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
We may also interpret this choice of kernel as directly parameterizing a rank r ∈ N operator (13) becomes
on L2 (D; R). Indeed, we have r
Z X
r
X u(x) = φ(j) (x)ψ (j) (y)v(y) dy
u= (φ(j) ⊗ ψ (j) )v (18) D j=1
j=1 r Z
X
which corresponds preceisely to applying the SVD of a rank r operator to the function v. Equation = ψ (j) (y)v(y) dy φ(j) (x)
j=1 D
(18) makes natural the vector valued generalization. Assume m, n ≥ 1 and φ(j) : D → Rm and r
X
(j)
ψ (j) : D → Rn for j = 1, . . . , r then, (18) defines an operator mapping L2 (D; Rn ) → L2 (D; Rm ) = ⟨ψ , v⟩φ(j) (x)
j=1
that can be evaluated as
r
X where ⟨·, ·⟩ denotes the L2 (D; R) inner product. Notice that the inner products can be evaluated
u(x) = ⟨ψ (j) , v⟩L2 (D;Rn ) φ(j) (x) ∀x ∈ D. independently of the evaluation point x ∈ D hence the computational complexity of this method is
j=1
O(rJ) which is linear in the discretization.
We again note the linear computational complexity of this parameterization. Finally, we observe We may also interpret this choice of kernel as directly parameterizing a rank r ∈ N operator
that this method can be interpreted as directly imposing a rank r structure on the kernel matrix. on L2 (D; R). Indeed, we have
r
Indeed, X
u= (φ(j) ⊗ ψ (j) )v (18)
K = KJr KrJ j=1
where KJr , KrJ are J × r and r × J block matricies respectively. This construction is similar to which corresponds preceisely to applying the SVD of a rank r operator to the function v. Equation
the DeepONet construction of Lu et al. (2019) discussed in Section 5.1, but parameterized to be (18) makes natural the vector valued generalization. Assume m, n ≥ 1 and φ(j) : D → Rm and
consistent in function space. ψ (j) : D → Rn for j = 1, . . . , r then, (18) defines an operator mapping L2 (D; Rn ) → L2 (D; Rm )
that can be evaluated as
4.3 Multipole Graph Neural Operator (MGNO) r
X
u(x) = ⟨ψ (j) , v⟩L2 (D;Rn ) φ(j) (x) ∀x ∈ D.
A natural extension to directly working with kernels in a tensor product form as in Section 4.2 j=1
is to instead consider kernels that can be well approximated by such a form. This assumption gives We again note the linear computational complexity of this parameterization. Finally, we observe
rise to the fast multipole method (FMM) which employs a multi-scale decomposition of the kernel that this method can be interpreted as directly imposing a rank r structure on the kernel matrix.
in order to achieve linear complexity in computing (13); for a detailed discussion see e.g. (E, 2011, Indeed,
Section 3.2). FMM can be viewed as a systematic approach to combine the sparse and low-rank K = KJr KrJ
approximations to the kernel matrix. Indeed, the kernel matrix is decomposed into different ranges
where KJr , KrJ are J × r and r × J block matricies respectively. This construction is similar to
and a hierarchy of low-rank structures is imposed on the long-range components. We employ this
the DeepONet construction of Lu et al. (2019) discussed in Section 5.1, but parameterized to be
idea to construct hierarchical, multi-scale graphs, without being constrained to particular forms of
consistent in function space.
the kernel. We will elucidate the workings of the FMM through matrix factorization. This ap-
proach was first outlined in Li et al. (2020b) and is referred as the Multipole Graph Neural Operator
4.3 Multipole Graph Neural Operator (MGNO)
(MGNO).
A natural extension to directly working with kernels in a tensor product form as in Section 4.2
The key to the fast multipole method’s linear complexity lies in the subdivision of the kernel
is to instead consider kernels that can be well approximated by such a form. This assumption gives
matrix according to the range of interaction, as shown in Figure 3:
rise to the fast multipole method (FMM) which employs a multi-scale decomposition of the kernel
K = K1 + K2 + . . . + KL , (19) in order to achieve linear complexity in computing (13); for a detailed discussion see e.g. (E, 2011,
21 21
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
where Kℓ with ℓ = 1 corresponds to the shortest-range interaction, and ℓ = L corresponds to Section 3.2). FMM can be viewed as a systematic approach to combine the sparse and low-rank
the longest-range interaction; more generally index ℓ is ordered by the range of interaction. While approximations to the kernel matrix. Indeed, the kernel matrix is decomposed into different ranges
the uniform grids depicted in Figure 3 produce an orthogonal decomposition of the kernel, the and a hierarchy of low-rank structures is imposed on the long-range components. We employ this
decomposition may be generalized to arbitrary discretizations by allowing slight overlap of the idea to construct hierarchical, multi-scale graphs, without being constrained to particular forms of
ranges. the kernel. We will elucidate the workings of the FMM through matrix factorization. This ap-
proach was first outlined in Li et al. (2020b) and is referred as the Multipole Graph Neural Operator
(MGNO).
The key to the fast multipole method’s linear complexity lies in the subdivision of the kernel
matrix according to the range of interaction, as shown in Figure 3:
K = K1 + K2 + . . . + KL , (19)
22 22
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
J2 , J3 , . . . , JL which all admit a low-rank kernel matrix decomposition of the form (15). The orig- full kernel matrix K. Assuming the underlying discretization {x1 , . . . , xJ } ⊂ D is a uniform grid
inal J × J kernel matrix Kl is represented by a much smaller Jl × Jl kernel matrix, denoted by with resolution s such that sd = J, the L multi-level discretizations will be grids with resolution
Kl,l . As shown in Figure 3, K1 is full-rank but very sparse while KL is dense but low-rank. Such sl = s/2l−1 , and consequentially Jl = sdl = (s/2l−1 )d . In this case rl can be chosen as 1/s for
structure can be achieved by applying equation (15) recursively to equation (19), leading to the l = 1, . . . , L. To ensure orthogonality of the discretizations, the fast multipole algorithm sets the
multi-resolution matrix factorization (Kondor et al., 2014): integration domains to be B(x, rl )\B(x, rl−1 ) for each level l = 2, . . . , L, so that the discretization
on level l does not overlap with the one on level l − 1. Details of this algorithm can be found in e.g.
K ≈ K1,1 + K1,2 K2,2 K2,1 + K1,2 K2,3 K3,3 K3,2 K2,1 + · · · (20)
Greengard and Rokhlin (1997).
where K1,1 = K1 represents the shortest range, K1,2 K2,2 K2,1 ≈ K2 , represents the second shortest
Recursive Low-rank Decomposition. The coarse discretization representation can be understood
range, etc. The center matrix Kl,l is a Jl × Jl kernel matrix corresponding to the l-level of the
as recursively applying an inducing points approximation (Quiñonero Candela and Rasmussen,
discretization described above. The matrices Kl+1,l , Kl,l+1 are Jl+1 × Jl and Jl × Jl+1 wide and
2005): starting from a discretization with J1 = J nodes, we impose inducing points of size
long respectively block transition matrices. Denote vl ∈ RJl ×n for the representation of the input
J2 , J3 , . . . , JL which all admit a low-rank kernel matrix decomposition of the form (15). The orig-
v at each level of the discretization for l = 1, . . . , L, and ul ∈ RJl ×n for the output (assuming the
inal J × J kernel matrix Kl is represented by a much smaller Jl × Jl kernel matrix, denoted by
inputs and outputs has the same dimension). We define the matrices Kl+1,l , Kl,l+1 as moving the
Kl,l . As shown in Figure 3, K1 is full-rank but very sparse while KL is dense but low-rank. Such
representation vl between different levels of the discretization via an integral kernel that we learn.
structure can be achieved by applying equation (15) recursively to equation (19), leading to the
Combining with the truncation idea introduced in subsection 4.1, we define the transition matrices
multi-resolution matrix factorization (Kondor et al., 2014):
as discretizations of the following integral kernel operators:
K ≈ K1,1 + K1,2 K2,2 K2,1 + K1,2 K2,3 K3,3 K3,2 K2,1 + · · · (20)
Z
Kl,l : vl 7→ ul = κl,l (x, y)vl (y) dy (21)
B(x,rl,l ) where K1,1 = K1 represents the shortest range, K1,2 K2,2 K2,1 ≈ K2 , represents the second shortest
Z
range, etc. The center matrix Kl,l is a Jl × Jl kernel matrix corresponding to the l-level of the
Kl+1,l : vl 7→ ul+1 = κl+1,l (x, y)vl (y) dy (22)
Z
B(x,rl+1,l ) discretization described above. The matrices Kl+1,l , Kl,l+1 are Jl+1 × Jl and Jl × Jl+1 wide and
Kl,l+1 : vl+1 7→ ul = κl,l+1 (x, y)vl+1 (y) dy (23) long respectively block transition matrices. Denote vl ∈ RJl ×n for the representation of the input
B(x,rl,l+1 )
v at each level of the discretization for l = 1, . . . , L, and ul ∈ RJl ×n for the output (assuming the
where each kernel κl,l′ : D × D → Rn×n is parameterized as a neural network and learned. inputs and outputs has the same dimension). We define the matrices Kl+1,l , Kl,l+1 as moving the
V-cycle Algorithm We present a V-cycle algorithm, see Figure 4, for efficiently computing (20). representation vl between different levels of the discretization via an integral kernel that we learn.
It consists of two steps: the downward pass and the upward pass. Denote the representation in Combining with the truncation idea introduced in subsection 4.1, we define the transition matrices
downward pass and upward pass by v̌ and v̂ respectively. In the downward step, the algorithm starts as discretizations of the following integral kernel operators:
from the fine discretization representation v̌1 and updates it by applying a downward transition
v̌l+1 = Kl+1,l v̌l . In the upward step, the algorithm starts from the coarse presentation v̂L and
Z
Kl,l : vl 7→ ul = κl,l (x, y)vl (y) dy (21)
updates it by applying an upward transition and the center kernel matrix v̂l = Kl,l−1 v̂l−1 + Kl,l v̌l . B(x,rl,l )
Z
Notice that applying one level downward and upward exactly computes K1,1 + K1,2 K2,2 K2,1 , and Kl+1,l : vl 7→ ul+1 = κl+1,l (x, y)vl (y) dy (22)
a full L-level V-cycle leads to the multi-resolution decomposition (20). B(x,rl+1,l )
Z
Employing (21)-(23), we use L neural networks κ1,1 , . . . , κL,L to approximate the kernel oper- Kl,l+1 : vl+1 7→ ul = κl,l+1 (x, y)vl+1 (y) dy (23)
B(x,rl,l+1 )
ators associated to Kl,l , and 2(L − 1) neural networks κ1,2 , κ2,1 , . . . to approximate the transitions
Kl+1,l , Kl,l+1 . Following the iterative architecture (6), we introduce the linear operator W ∈ Rn×n where each kernel κl,l′ : D × D → Rn×n is parameterized as a neural network and learned.
23 23
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
图 4: V-cycle 图 4: V-cycle
Left: the multi-level discretization. Right: one V-cycle iteration for the multipole neural operator. Left: the multi-level discretization. Right: one V-cycle iteration for the multipole neural operator.
(denoting it by Wl for each corresponding resolution) to help regularize the iteration, as well as the V-cycle Algorithm We present a V-cycle algorithm, see Figure 4, for efficiently computing (20).
nonlinear activation function σ to increase the expensiveness. Since W acts pointwise (requiring It consists of two steps: the downward pass and the upward pass. Denote the representation in
J remains the same for input and output), we employ it only along with the kernel Kl,l and not the downward pass and upward pass by v̌ and v̂ respectively. In the downward step, the algorithm starts
transitions. At each layer t = 0, . . . , T − 1, we perform a full V-cycle as: from the fine discretization representation v̌1 and updates it by applying a downward transition
v̌l+1 = Kl+1,l v̌l . In the upward step, the algorithm starts from the coarse presentation v̂L and
• Downward Pass updates it by applying an upward transition and the center kernel matrix v̂l = Kl,l−1 v̂l−1 + Kl,l v̌l .
(t+1) (t) (t+1) Notice that applying one level downward and upward exactly computes K1,1 + K1,2 K2,2 K2,1 , and
For l = 1, . . . , L : v̌l+1 = σ(v̂l+1 + Kl+1,l v̌l ) (24)
a full L-level V-cycle leads to the multi-resolution decomposition (20).
• Upward Pass Employing (21)-(23), we use L neural networks κ1,1 , . . . , κL,L to approximate the kernel oper-
ators associated to Kl,l , and 2(L − 1) neural networks κ1,2 , κ2,1 , . . . to approximate the transitions
(t+1) (t+1) (t+1)
For l = L, . . . , 1 : v̂l = σ((Wl + Kl,l )v̌l + Kl,l−1 v̂l−1 ). (25) Kl+1,l , Kl,l+1 . Following the iterative architecture (6), we introduce the linear operator W ∈ Rn×n
(denoting it by Wl for each corresponding resolution) to help regularize the iteration, as well as the
Notice that one full pass of the V-cycle algorithm defines a mapping v 7→ u. nonlinear activation function σ to increase the expensiveness. Since W acts pointwise (requiring
Multi-level Graphs. We emphasize that we view the discretization {x1 , . . . , xJ } ⊂ D as a graph J remains the same for input and output), we employ it only along with the kernel Kl,l and not the
in order to facilitate an efficient implementation through the message passing graph neural network transitions. At each layer t = 0, . . . , T − 1, we perform a full V-cycle as:
architecture. Since the V-cycle algorithm works at different levels of the discretization, we build
• Downward Pass
multi-level graphs to represent the coarser and finer discretizations. We present and utilize two con-
structions of multi-level graphs, the orthogonal multipole graph and the generalized random graph. For l = 1, . . . , L : v̌l+1
(t+1) (t)
= σ(v̂l+1 + Kl+1,l v̌l
(t+1)
) (24)
The orthogonal multipole graph is the standard grid construction used in the fast multiple method
which is adapted to a uniform grid, see e.g. (Greengard and Rokhlin, 1997). In this construction, the • Upward Pass
decomposition in (19) is orthogonal in that the finest graph only captures the closest range interac-
(t+1) (t+1) (t+1)
tion, the second finest graph captures the second closest interaction minus the part already captured For l = L, . . . , 1 : v̂l = σ((Wl + Kl,l )v̌l + Kl,l−1 v̂l−1 ). (25)
in the previous graph and so on, recursively. In particular, the ranges of interaction for each kernel
do not overlap. While this construction is usually efficient, it is limited to uniform grids which Notice that one full pass of the V-cycle algorithm defines a mapping v 7→ u.
24 24
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
may be a bottleneck for certain applications. Our second construction is the generalized random Multi-level Graphs. We emphasize that we view the discretization {x1 , . . . , xJ } ⊂ D as a graph
graph as shown in Figure 3 where the ranges of the kernels are allowed to overlap. The generalized in order to facilitate an efficient implementation through the message passing graph neural network
random graph is very flexible as it can be applied on any domain geometry and discretization. Fur- architecture. Since the V-cycle algorithm works at different levels of the discretization, we build
ther it can also be combined with random sampling methods to work on problems where J is very multi-level graphs to represent the coarser and finer discretizations. We present and utilize two con-
large or combined with an active learning method to adaptively choose the regions where a finer structions of multi-level graphs, the orthogonal multipole graph and the generalized random graph.
discretization is needed. The orthogonal multipole graph is the standard grid construction used in the fast multiple method
which is adapted to a uniform grid, see e.g. (Greengard and Rokhlin, 1997). In this construction, the
Linear Complexity. Each term in the decomposition (19) is represented by the kernel matrix
decomposition in (19) is orthogonal in that the finest graph only captures the closest range interac-
Kl,l for l = 1, . . . , L, and Kl+1,l , Kl,l+1 for l = 1, . . . , L − 1 corresponding to the appropri-
tion, the second finest graph captures the second closest interaction minus the part already captured
ate sub-discretization. Therefore the complexity of the multipole method is L 2 d
P
l=1 O(Jl rl ) + in the previous graph and so on, recursively. In particular, the ranges of interaction for each kernel
PL−1 d
P L 2 d 2 d
l=1 O(Jl Jl+1 rl ) = l=1 O(Jl rl ). By designing the sub-discretization so that O(Jl rl ) ≤ do not overlap. While this construction is usually efficient, it is limited to uniform grids which
√
O(J), we can obtain complexity linear in J. For example, when d = 2, pick rl = 1/ Jl and
may be a bottleneck for certain applications. Our second construction is the generalized random
Jl = O(2−l J) such that rL is large enough so that there exists a ball of radius rL containing D.
graph as shown in Figure 3 where the ranges of the kernels are allowed to overlap. The generalized
Then clearly L 2 d
P
l=1 O(Jl rl ) = O(J). By combining with a Nyström approximation, we can obtain random graph is very flexible as it can be applied on any domain geometry and discretization. Fur-
O(J ′ ) complexity for some J ′ ≪ J.
ther it can also be combined with random sampling methods to work on problems where J is very
large or combined with an active learning method to adaptively choose the regions where a finer
4.4 Fourier Neural Operator (FNO)
discretization is needed.
Instead of working with a kernel directly on the domain D, we may consider its representation
in Fourier space and directly parameterize it there. This allows us to utilize Fast Fourier Transform Linear Complexity. Each term in the decomposition (19) is represented by the kernel matrix
(FFT) methods in order to compute the action of the kernel integral operator (13) with almost linear Kl,l for l = 1, . . . , L, and Kl+1,l , Kl,l+1 for l = 1, . . . , L − 1 corresponding to the appropri-
ate sub-discretization. Therefore the complexity of the multipole method is L 2 d
P
complexity. A similar idea was used in (Nelsen and Stuart, 2021) to construct random features in l=1 O(Jl rl ) +
PL−1 d
P L 2 d 2 d
function space The method we outline was first described in Li et al. (2020a) and is termed the l=1 O(Jl Jl+1 rl ) = l=1 O(Jl rl ). By designing the sub-discretization so that O(Jl rl ) ≤
√
Fourier Neural Operator (FNO). We note that the theory of Section 4 is designed for general kernels O(J), we can obtain complexity linear in J. For example, when d = 2, pick rl = 1/ Jl and
and does not apply to the FNO formulation; however, similar universal approximation results were Jl = O(2−l J) such that rL is large enough so that there exists a ball of radius rL containing D.
Then clearly L 2 d
P
developed for it in (Kovachki et al., 2021) when the input and output spaces are Hilbert space. For l=1 O(Jl rl ) = O(J). By combining with a Nyström approximation, we can obtain
simplicity, we will assume that D = Td is the unit torus and all functions are complex-valued. Let O(J ′ ) complexity for some J ′ ≪ J.
F : L2 (D; Cn ) → ℓ2 (Zd ; Cn ) denote the Fourier transform of a function v : D → Cn and F −1 its
inverse. For v ∈ L2 (D; Cn ) and w ∈ ℓ2 (Zd ; Cn ), we have 4.4 Fourier Neural Operator (FNO)
Instead of working with a kernel directly on the domain D, we may consider its representation
(Fv)j (k) = ⟨vj , ψk ⟩L2 (D;C) , j ∈ {1, . . . , n}, k ∈ Zd ,
X in Fourier space and directly parameterize it there. This allows us to utilize Fast Fourier Transform
(F −1 w)j (x) = wj (k)ψk (x), j ∈ {1, . . . , n}, x∈D (FFT) methods in order to compute the action of the kernel integral operator (13) with almost linear
k∈Zd
complexity. A similar idea was used in (Nelsen and Stuart, 2021) to construct random features in
where, for each k ∈ Zd , we define function space The method we outline was first described in Li et al. (2020a) and is termed the
Fourier Neural Operator (FNO). We note that the theory of Section 4 is designed for general kernels
2πik1 x1 2πikd xd
ψk (x) = e ···e , x∈D and does not apply to the FNO formulation; however, similar universal approximation results were
25 25
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
developed for it in (Kovachki et al., 2021) when the input and output spaces are Hilbert space. For
simplicity, we will assume that D = Td is the unit torus and all functions are complex-valued. Let
F : L2 (D; Cn ) → ℓ2 (Zd ; Cn ) denote the Fourier transform of a function v : D → Cn and F −1 its
inverse. For v ∈ L2 (D; Cn ) and w ∈ ℓ2 (Zd ; Cn ), we have
图 5: top: The architecture of the neural operators; bottom: Fourier layer. ψk (x) = e2πik1 x1 · · · e2πikd xd , x∈D
(a) The full architecture of neural operator: start from input a. 1. Lift to a higher dimension channel √
with i = −1 the imaginary unit. By letting κ(x, y) = κ(x − y) for some κ : D → Cm×n in (13)
space by a neural network P. 2. Apply T (typically T = 4) layers of integral operators and activation
functions. 3. Project back to the target dimension by a neural network Q. Output u. (b) Fourier layers: and applying the convolution theorem, we find that
Start from input v. On top: apply the Fourier transform F; a linear transform R on the lower Fourier modes
u(x) = F −1 F(κ) · F(v) (x)
−1 ∀x ∈ D.
which also filters out the higher modes; then apply the inverse Fourier transform F . On the bottom: apply
a local linear transform W .
We therefore propose to directly parameterize κ by its Fourier coefficients. We write
√
with i = −1 the imaginary unit. By letting κ(x, y) = κ(x − y) for some κ : D → Cm×n in (13)
u(x) = F −1 Rϕ · F(v) (x)
∀x ∈ D (26)
and applying the convolution theorem, we find that
where Rϕ is the Fourier transform of a periodic function κ : D → Cm×n parameterized by some
u(x) = F −1 F(κ) · F(v) (x)
∀x ∈ D.
ϕ ∈ Rp .
We therefore propose to directly parameterize κ by its Fourier coefficients. We write For frequency mode k ∈ Zd , we have (Fv)(k) ∈ Cn and Rϕ (k) ∈ Cm×n . We pick a
finite-dimensional parameterization by truncating the Fourier series at a maximal number of modes
u(x) = F −1 Rϕ · F(v) (x)
∀x ∈ D (26)
kmax = |Zkmax | = |{k ∈ Zd : |kj | ≤ kmax,j , for j = 1, . . . , d}|. This choice improves the empirical
where Rϕ is the Fourier transform of a periodic function κ : D → Cm×n parameterized by some performance and sensitivity of the resulting model with respect to the choices of discretization. We
ϕ ∈ Rp . thus parameterize Rϕ directly as complex-valued (kmax × m × n)-tensor comprising a collection
For frequency mode k ∈ Zd , we have (Fv)(k) ∈ Cn and Rϕ (k) ∈ Cm×n . We pick a of truncated Fourier modes and therefore drop ϕ from our notation. In the case where we have
finite-dimensional parameterization by truncating the Fourier series at a maximal number of modes real-valued v and we want u to also be real-valued, we impose that κ is real-valued by enforcing
kmax = |Zkmax | = |{k ∈ Zd : |kj | ≤ kmax,j , for j = 1, . . . , d}|. This choice improves the empirical conjugate symmetry in the parameterization i.e.
performance and sensitivity of the resulting model with respect to the choices of discretization. We
thus parameterize Rϕ directly as complex-valued (kmax × m × n)-tensor comprising a collection R(−k)j,l = R∗ (k)j,l ∀k ∈ Zkmax , j = 1, . . . , m, l = 1, . . . , n.
of truncated Fourier modes and therefore drop ϕ from our notation. In the case where we have
We note that the set Zkmax is not the canonical choice for the low frequency modes of vt . Indeed,
real-valued v and we want u to also be real-valued, we impose that κ is real-valued by enforcing
the low frequency modes are usually defined by placing an upper-bound on the ℓ1 -norm of k ∈ Zd .
conjugate symmetry in the parameterization i.e.
We choose Zkmax as above since it allows for an efficient implementation. Figure 5 gives a pictorial
∗
R(−k)j,l = R (k)j,l ∀k ∈ Zkmax , j = 1, . . . , m, l = 1, . . . , n. representation of an entire Neural Operator architecture employing Fourier layers.
26 26
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
We note that the set Zkmax is not the canonical choice for the low frequency modes of vt . Indeed,
the low frequency modes are usually defined by placing an upper-bound on the ℓ1 -norm of k ∈ Zd .
We choose Zkmax as above since it allows for an efficient implementation. Figure 5 gives a pictorial
representation of an entire Neural Operator architecture employing Fourier layers.
The Discrete Case and the FFT. Assuming the domain D is discretized with J ∈ N points, we
can treat v ∈ CJ×n and F(v) ∈ CJ×n . Since we convolve v with a function which only has kmax
Fourier modes, we may simply truncate the higher modes to obtain F(v) ∈ Ckmax ×n . Multiplication
by the weight tensor R ∈ Ckmax ×m×n is then
n
X
R · (Fvt ) k,l = Rk,l,j (Fv)k,j , k = 1, . . . , kmax , l = 1, . . . , m. (27)
j=1 图 5: top: The architecture of the neural operators; bottom: Fourier layer.
(a) The full architecture of neural operator: start from input a. 1. Lift to a higher dimension channel
When the discretization is uniform with resolution s1 × · · · × sd = J, F can be replaced by the Fast
space by a neural network P. 2. Apply T (typically T = 4) layers of integral operators and activation
Fourier Transform. For v ∈ CJ×n , k = (k1 , . . . , kd ) ∈ Zs1 × · · · × Zsd , and x = (x1 , . . . , xd ) ∈ D,
functions. 3. Project back to the target dimension by a neural network Q. Output u. (b) Fourier layers:
the FFT F̂ and its inverse F̂ −1 are defined as
Start from input v. On top: apply the Fourier transform F; a linear transform R on the lower Fourier modes
1 −1
sX d −1
sX
−2iπ
Pd xj k j
j=1 sj
which also filters out the higher modes; then apply the inverse Fourier transform F −1 . On the bottom: apply
(F̂v)l (k) = ··· vl (x1 , . . . , xd )e ,
a local linear transform W .
x1 =0 xd =0
1 −1
sX d −1
sX Pd xj kj
−1 2iπ j=1
(F̂ v)l (x) = ··· vl (k1 , . . . , kd )e sj
The Discrete Case and the FFT. Assuming the domain D is discretized with J ∈ N points, we
k1 =0 kd =0 can treat v ∈ CJ×n and F(v) ∈ CJ×n . Since we convolve v with a function which only has kmax
for l = 1, . . . , n. In this case, the set of truncated modes becomes Fourier modes, we may simply truncate the higher modes to obtain F(v) ∈ Ckmax ×n . Multiplication
by the weight tensor R ∈ Ckmax ×m×n is then
Zkmax = {(k1 , . . . , kd ) ∈ Zs1 × · · · × Zsd | kj ≤ kmax,j or sj − kj ≤ kmax,j , for j = 1, . . . , d}.
n
X
When implemented, R is treated as a (s1 ×· · ·×sd ×m×n)-tensor and the above definition of Zkmax
R · (Fvt ) k,l = Rk,l,j (Fv)k,j , k = 1, . . . , kmax , l = 1, . . . , m. (27)
corresponds to the “corners” of R, which allows for a straight-forward parallel implementation of j=1
(27) via matrix-vector multiplication. In practice, we have found the choice kmax,j roughly around When the discretization is uniform with resolution s1 × · · · × sd = J, F can be replaced by the Fast
1 2
3 to 3 of the maximum number of Fourier modes in the Fast Fourier Transform of the grid valuation Fourier Transform. For v ∈ CJ×n , k = (k1 , . . . , kd ) ∈ Zs1 × · · · × Zsd , and x = (x1 , . . . , xd ) ∈ D,
of the input function provides desirable performance. In our empirical studies, we set kmax,j = 12 the FFT F̂ and its inverse F̂ −1 are defined as
which yields kmax = 12d parameters per channel, to be sufficient for all the tasks that we consider.
1 −1
sX d −1
sX Pd xj kj
−2iπ j=1 sj
Choices for R. In general, R can be defined to depend on (Fa), the Fourier transform of the (F̂v)l (k) = ··· vl (x1 , . . . , xd )e ,
x1 =0 xd =0
input a ∈ A to parallel our construction (8). Indeed, we can define Rϕ : Zd × Cda → Cm×n as 1 −1
sX d −1
sX Pd xj kj
2iπ
a parametric function that maps k, (Fa)(k)) to the values of the appropriate Fourier modes. We (F̂ −1 v)l (x) = ··· vl (k1 , . . . , kd )e j=1 sj
• Direct. Define the parameters ϕk ∈ Cm×n for each wave number k: for l = 1, . . . , n. In this case, the set of truncated modes becomes
Rϕ k, (Fa)(k) := ϕk . Zkmax = {(k1 , . . . , kd ) ∈ Zs1 × · · · × Zsd | kj ≤ kmax,j or sj − kj ≤ kmax,j , for j = 1, . . . , d}.
27 27
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
• Linear. Define the parameters ϕk1 ∈ Cm×n×da , ϕk2 ∈ Cm×n for each wave number k: When implemented, R is treated as a (s1 ×· · ·×sd ×m×n)-tensor and the above definition of Zkmax
corresponds to the “corners” of R, which allows for a straight-forward parallel implementation of
Rϕ k, (Fa)(k) := ϕk1 (Fa)(k) + ϕk2 .
(27) via matrix-vector multiplication. In practice, we have found the choice kmax,j roughly around
1 2
to of the maximum number of Fourier modes in the Fast Fourier Transform of the grid valuation
• Feed-forward neural network. Let Φϕ : Zd × Cda → Cm×n be a neural network with 3 3
of the input function provides desirable performance. In our empirical studies, we set kmax,j = 12
parameters ϕ:
which yields kmax = 12d parameters per channel, to be sufficient for all the tasks that we consider.
Rϕ k, (Fa)(k) := Φϕ (k, (Fa)(k)).
Choices for R. In general, R can be defined to depend on (Fa), the Fourier transform of the
We find that the linear parameterization has a similar performance to the direct parameterization
input a ∈ A to parallel our construction (8). Indeed, we can define Rϕ : Zd × Cda → Cm×n as
above, however, it is not as efficient both in terms of computational complexity and the number of
a parametric function that maps k, (Fa)(k)) to the values of the appropriate Fourier modes. We
parameters required. On the other hand, we find that the feed-forward neural network parameteriza-
have experimented with the following parameterizations of Rϕ :
tion has a worse performance. This is likely due to the discrete structure of the space Zd ; numerical
evidence suggests neural networks are not adept at handling this structure. Our experiments in this • Direct. Define the parameters ϕk ∈ Cm×n for each wave number k:
work focus on the direct parameterization presented above.
Rϕ k, (Fa)(k) := ϕk .
Invariance to Discretization. The Fourier layers are discretization-invariant because they can
learn from and evaluate functions which are discretized in an arbitrary way. Since parameters are • Linear. Define the parameters ϕk1 ∈ Cm×n×da , ϕk2 ∈ Cm×n for each wave number k:
learned directly in Fourier space, resolving the functions in physical space simply amounts to pro-
Rϕ k, (Fa)(k) := ϕk1 (Fa)(k) + ϕk2 .
jecting on the basis elements e2πi⟨x,k⟩ ; these are well-defined everywhere on Cd .
Quasi-linear Complexity. The weight tensor R contains kmax < J modes, so the inner multipli- • Feed-forward neural network. Let Φϕ : Zd × Cda → Cm×n be a neural network with
cation has complexity O(kmax ). Therefore, the majority of the computational cost lies in computing parameters ϕ:
the Fourier transform F(v) and its inverse. General Fourier transforms have complexity O(J 2 ),
Rϕ k, (Fa)(k) := Φϕ (k, (Fa)(k)).
however, since we truncate the series the complexity is in fact O(Jkmax ), while the FFT has com-
We find that the linear parameterization has a similar performance to the direct parameterization
plexity O(J log J). Generally, we have found using FFTs to be very efficient, however, a uniform
above, however, it is not as efficient both in terms of computational complexity and the number of
discretization is required.
parameters required. On the other hand, we find that the feed-forward neural network parameteriza-
Non-uniform and Non-periodic Geometry. The Fourier neural operator model is defined based tion has a worse performance. This is likely due to the discrete structure of the space Zd ; numerical
on Fourier transform operations accompanied by local residual operations and potentially additive evidence suggests neural networks are not adept at handling this structure. Our experiments in this
bias function terms. These operations are mainly defined on general geometries, function spaces, work focus on the direct parameterization presented above.
and choices of discretization. They are not limited to rectangular domains, periodic functions, or
Invariance to Discretization. The Fourier layers are discretization-invariant because they can
uniform grids. In this paper, we instantiate these operations on uniform grids and periodic functions
learn from and evaluate functions which are discretized in an arbitrary way. Since parameters are
in order to develop fast implementations that enjoy spectral convergence and utilize methods such as
learned directly in Fourier space, resolving the functions in physical space simply amounts to pro-
fast Fourier transform. In order to maintain a fast and memory-efficient method, our implementation
jecting on the basis elements e2πi⟨x,k⟩ ; these are well-defined everywhere on Cd .
of the Fourier neural operator relies on the fast Fourier transform which is only defined on uniform
mesh discretizations of D = Td , or for functions on the square satisfying homogeneous Dirichlet Quasi-linear Complexity. The weight tensor R contains kmax < J modes, so the inner multipli-
(fast Fourier sine transform) or homogeneous Neumann (fast Fourier cosine transform) boundary cation has complexity O(kmax ). Therefore, the majority of the computational cost lies in computing
conditions. However, the fast implementation of Fourier neural operator can be applied in more the Fourier transform F(v) and its inverse. General Fourier transforms have complexity O(J 2 ),
28 28
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
general geometries via Fourier continuations. Given any compact manifold D = M, we can always however, since we truncate the series the complexity is in fact O(Jkmax ), while the FFT has com-
embed it into a periodic cube (torus), plexity O(J log J). Generally, we have found using FFTs to be very efficient, however, a uniform
i : M → Td discretization is required.
where the regular FFT can be applied. Conventionally, in numerical analysis applications, the em- Non-uniform and Non-periodic Geometry. The Fourier neural operator model is defined based
bedding i is defined through a continuous extension by fitting polynomials (Bruno et al., 2007). on Fourier transform operations accompanied by local residual operations and potentially additive
However, in the Fourier neural operator, the idea can be applied simply by padding the input with bias function terms. These operations are mainly defined on general geometries, function spaces,
zeros. The loss is computed only on the original space during training. The Fourier neural operator and choices of discretization. They are not limited to rectangular domains, periodic functions, or
will automatically generate a smooth extension to the padded domain in the output space. uniform grids. In this paper, we instantiate these operations on uniform grids and periodic functions
in order to develop fast implementations that enjoy spectral convergence and utilize methods such as
4.5 Summary fast Fourier transform. In order to maintain a fast and memory-efficient method, our implementation
of the Fourier neural operator relies on the fast Fourier transform which is only defined on uniform
We summarize the main computational approaches presented in this section and their complex-
mesh discretizations of D = Td , or for functions on the square satisfying homogeneous Dirichlet
ity:
(fast Fourier sine transform) or homogeneous Neumann (fast Fourier cosine transform) boundary
• GNO: Subsample J ′ points from the J-point discretization and compute the truncated integral conditions. However, the fast implementation of Fourier neural operator can be applied in more
Z general geometries via Fourier continuations. Given any compact manifold D = M, we can always
u(x) = κ(x, y)v(y) dy (28) embed it into a periodic cube (torus),
B(x,r)
i : M → Td
at a O(JJ ′ ) complexity.
where the regular FFT can be applied. Conventionally, in numerical analysis applications, the em-
• LNO: Decompose the kernel function tensor product form and compute bedding i is defined through a continuous extension by fitting polynomials (Bruno et al., 2007).
However, in the Fourier neural operator, the idea can be applied simply by padding the input with
r
X
u(x) = ⟨ψ (j) , v⟩φ(j) (x) (29) zeros. The loss is computed only on the original space during training. The Fourier neural operator
j=1 will automatically generate a smooth extension to the padded domain in the output space.
at a O(J) complexity.
4.5 Summary
• MGNO: Compute a multi-scale decomposition of the kernel We summarize the main computational approaches presented in this section and their complex-
ity:
K = K1,1 + K1,2 K2,2 K2,1 + K1,2 K2,3 K3,3 K3,2 K2,1 + · · ·
(30)
u(x) = (Kv)(x) • GNO: Subsample J ′ points from the J-point discretization and compute the truncated integral
Z
at a O(J) complexity. u(x) = κ(x, y)v(y) dy (28)
B(x,r)
• FNO: Parameterize the kernel in the Fourier domain and compute the using the FFT at a O(JJ ′ ) complexity.
• LNO: Decompose the kernel function tensor product form and compute
u(x) = F −1 Rϕ · F(v) (x)
(31)
r
X
u(x) = ⟨ψ (j) , v⟩φ(j) (x) (29)
at a O(J log J) complexity. j=1
29 29
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
In this section, we provide a discussion on the recent related methods, in particular, Deep- • MGNO: Compute a multi-scale decomposition of the kernel
ONets, and demonstrate that their architectures are subsumed by generic neural operators when
neural operators are parametrized inconsistently 5.1. When only applied and queried on fixed K = K1,1 + K1,2 K2,2 K2,1 + K1,2 K2,3 K3,3 K3,2 K2,1 + · · ·
(30)
grids, we show neural operator architectures subsume neural networks and, furthermore, we show u(x) = (Kv)(x)
how transformers are special cases of neural operators 5.2.
at a O(J) complexity.
5.1 DeepONets • FNO: Parameterize the kernel in the Fourier domain and compute the using the FFT
We will now draw a parallel between the recently proposed DeepONet architecture in Lu et
u(x) = F −1 Rϕ · F(v) (x)
(31)
al. (2019), a map from finite-dimensional spaces to function spaces, and the neural operator frame-
work. We will show that if we use a particular, point-wise parameterization of the first kernel in a at a O(J log J) complexity.
NO and discretize the integral operator, we obtain a DeepONet. However, such a parameterization
breaks the notion of discretization invariance because the number of parameters depends on the dis- 5. Neural Operators and Other Deep Learning Models
cretization of the input function. Therefore such a model cannot be applied to arbitrarily discretized
In this section, we provide a discussion on the recent related methods, in particular, Deep-
functions and its number of parameters goes to infinity as we take the limit to the continuum. This
ONets, and demonstrate that their architectures are subsumed by generic neural operators when
phenomenon is similar to our discussion in subsection 4.1 where a NO parametrization which is
neural operators are parametrized inconsistently 5.1. When only applied and queried on fixed
inconsistent in function space and breaks discretization invariance yields a CNN. We propose a
grids, we show neural operator architectures subsume neural networks and, furthermore, we show
modification to the DeepONet architecture, based on the idea of the LNO, which addresses this
how transformers are special cases of neural operators 5.2.
issue and gives a discretization invariant neural operator.
5.1 DeepONets
Proposition 5 A neural operator with a point-wise parameterized first kernel and discretized inte-
gral operators yields a DeepONet. We will now draw a parallel between the recently proposed DeepONet architecture in Lu et
al. (2019), a map from finite-dimensional spaces to function spaces, and the neural operator frame-
Proof We work with (11) where we choose W0 = 0 and denote b0 by b. For simplicity, we will work. We will show that if we use a particular, point-wise parameterization of the first kernel in a
consider only real-valued functions i.e. da = du = 1 and set dv0 = dv1 = n and dv2 = p for NO and discretize the integral operator, we obtain a DeepONet. However, such a parameterization
some n, p ∈ N. Define P : R → Rn by P(x) = (x, . . . , x) and Q : Rp → R by Q(x) = breaks the notion of discretization invariance because the number of parameters depends on the dis-
(1)
x1 + · · · + xp . Furthermore let κ(1) : D′ × D → Rp×n be defined by some κjk : D′ × D → R cretization of the input function. Therefore such a model cannot be applied to arbitrarily discretized
for j = 1, . . . , p and k = 1, . . . , n. Similarly let κ(0) : D × D → Rn×n be given as κ(0) (x, y) = functions and its number of parameters goes to infinity as we take the limit to the continuum. This
(0) (0) (0) (0)
diag(κ1 (x, y), . . . , κn (x, y)) for some κ1 , . . . κn : D × D → R. Then (11) becomes phenomenon is similar to our discussion in subsection 4.1 where a NO parametrization which is
p X
n Z Z inconsistent in function space and breaks discretization invariance yields a CNN. We propose a
(1) (0)
X
(Gθ (a))(x) = κjk (x, y)σ κj (y, z)a(z) dz + bj (y) dy modification to the DeepONet architecture, based on the idea of the LNO, which addresses this
k=1 j=1 D D
issue and gives a discretization invariant neural operator.
where b(y) = (b1 (y), . . . , bn (y)) for some b1 , . . . , bn : D → R. Let x1 , . . . , xq ∈ D be the points
at which the input function a is evaluated and denote by ã = a(x1 ), . . . , a(xq ) ∈ Rq the vector of
Proposition 5 A neural operator with a point-wise parameterized first kernel and discretized inte-
(0)
evaluations. Choose κj (y, z) = 1(y)wj (z) for some w1 , . . . , wn : D → R where 1 denotes the gral operators yields a DeepONet.
30 30
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
constant function taking the value one. Let Proof We work with (11) where we choose W0 = 0 and denote b0 by b. For simplicity, we will
q consider only real-valued functions i.e. da = du = 1 and set dv0 = dv1 = n and dv2 = p for
wj (xl ) = w̃jl
|D| some n, p ∈ N. Define P : R → Rn by P(x) = (x, . . . , x) and Q : Rp → R by Q(x) =
for j = 1, . . . , n and l = 1, . . . , q where w̃jl ∈ R are some constants. Furthermore let bj (y) = x1 + · · · + xp . Furthermore let κ(1) : D′ × D → Rp×n be defined by some κjk : D′ × D → R
(1)
b̃j 1(y) for some constants b̃j ∈ R. Then the Monte Carlo approximation of the inner-integral yields for j = 1, . . . , p and k = 1, . . . , n. Similarly let κ(0) : D × D → Rn×n be given as κ(0) (x, y) =
p X n Z (0) (0) (0) (0)
diag(κ1 (x, y), . . . , κn (x, y)) for some κ1 , . . . κn : D × D → R. Then (11) becomes
(1)
X
(Gθ (a))(x) = κjk (x, y)σ ⟨w̃j , ã⟩Rq + b̃j 1(y) dy
k=1 j=1 D
p X
n Z Z
(1) (0)
X
(1) (Gθ (a))(x) = κjk (x, y)σ κj (y, z)a(z) dz + bj (y) dy
where w̃j = w̃j1 , . . . , w̃jq . Choose κjk (x, y) = (c̃jk /|D|)φk (x)1(y) for some constants c̃jk ∈R D D
k=1 j=1
and functions φ1 , . . . , φp : D′ → R. Then we obtain
p
n
p
where b(y) = (b1 (y), . . . , bn (y)) for some b1 , . . . , bn : D → R. Let x1 , . . . , xq ∈ D be the points
at which the input function a is evaluated and denote by ã = a(x1 ), . . . , a(xq ) ∈ Rq the vector of
X X X
(Gθ (a))(x) = c̃jk σ ⟨w̃j , ã⟩Rq + b̃j φk (x) = Gk (ṽ)φk (x) (32)
k=1 j=1 k=1 (0)
evaluations. Choose κj (y, z) = 1(y)wj (z) for some w1 , . . . , wn : D → R where 1 denotes the
where Gk : Rq → R can be viewed as the components of a single hidden layer neural network constant function taking the value one. Let
G : Rq → Rp with parameters w̃jl , b̃j , c̃jk . The set of maps φ1 , . . . , φp form the trunk net while
q
G is the branch net of a DeepONet. Our construction above can clearly be generalized to yield wj (xl ) = w̃jl
|D|
arbitrary depth branch nets by adding more kernel integral layers, and, similarly, the trunk net can
for j = 1, . . . , n and l = 1, . . . , q where w̃jl ∈ R are some constants. Furthermore let bj (y) =
be chosen arbitrarily deep by parameterizing each φk as a deep neural network.
b̃j 1(y) for some constants b̃j ∈ R. Then the Monte Carlo approximation of the inner-integral yields
p X
n Z
(1)
X
Since the mappings w1 , . . . , wn are point-wise parametrized based on the input values ã, it is (Gθ (a))(x) = κjk (x, y)σ ⟨w̃j , ã⟩Rq + b̃j 1(y) dy
D
clear that the construction in the above proof is not discretization invariant. In order to make this k=1 j=1
model a discretization invariant neural operator, we propose DeepONet-Operator where, for each (1)
where w̃j = w̃j1 , . . . , w̃jq . Choose κjk (x, y) = (c̃jk /|D|)φk (x)1(y) for some constants c̃jk ∈ R
j, we replace the inner product in the finite dimensional space ⟨w̃j , ã⟩Rq with an appropriate inner
and functions φ1 , . . . , φp : D′ → R. Then we obtain
product in the function space ⟨wj , a⟩.
X p n
X p
X
p
X n
X (Gθ (a))(x) = c̃jk σ ⟨w̃j , ã⟩Rq + b̃j φk (x) = Gk (ṽ)φk (x) (32)
(Gθ (a))(x) = c̃jk σ ⟨wj , a⟩ + b̃j φk (x) (33) k=1 j=1 k=1
k=1 j=1
where Gk : Rq → R can be viewed as the components of a single hidden layer neural network
This operation is a projection of function a onto wj . Parametrizing wj by neural networks makes
G : Rq → Rp with parameters w̃jl , b̃j , c̃jk . The set of maps φ1 , . . . , φp form the trunk net while
DeepONet-Operator a discretization invariant model.
G is the branch net of a DeepONet. Our construction above can clearly be generalized to yield
There are other ways in which the issue can be resolved for DeepONets. For example, by
arbitrary depth branch nets by adding more kernel integral layers, and, similarly, the trunk net can
fixing the set of points on which the input function is evaluated independently of its discretization,
be chosen arbitrarily deep by parameterizing each φk as a deep neural network.
by taking local spatial averages as in (Lanthaler et al., 2021) or more generally by taking a set of
linear functionals on A as input to a finite-dimensional branch neural network (a generalization to
DeeONet-Operator) as in the specific PCA-based variant on DeepONet in (De Hoop et al., 2022). Since the mappings w1 , . . . , wn are point-wise parametrized based on the input values ã, it is
We demonstrate numerically in Section 7 that, when applied in the standard way, the error incurred clear that the construction in the above proof is not discretization invariant. In order to make this
by DeepONet(s) grows with the discretization of a while it remains constant for neural operators. model a discretization invariant neural operator, we propose DeepONet-Operator where, for each
31 31
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Linear Approximation and Nonlinear Approximation. We point out that parametrizations of j, we replace the inner product in the finite dimensional space ⟨w̃j , ã⟩Rq with an appropriate inner
the form (32) fall within the class of linear approximation methods since the nonlinear space G † (A) product in the function space ⟨wj , a⟩.
is approximated by the linear space span{φ1 , . . . , φp } (DeVore, 1998). The quality of the best
p n
possible linear approximation to a nonlinear space is given by the Kolmogorov n-width where n X X
(Gθ (a))(x) = c̃jk σ ⟨wj , a⟩ + b̃j φk (x) (33)
is the dimension of the linear space used in the approximation (Pinkus, 1985). The rate of decay k=1 j=1
of the n-width as a function of n quantifies how well the linear space approximates the nonlinear
This operation is a projection of function a onto wj . Parametrizing wj by neural networks makes
one. It is well know that for some problems such as the flow maps of advection-dominated PDEs,
DeepONet-Operator a discretization invariant model.
the n-widths decay very slowly; hence a very large n is needed for a good approximation for such
There are other ways in which the issue can be resolved for DeepONets. For example, by
problems (Cohen and DeVore, 2015). This can be limiting in practice as more parameters are
fixing the set of points on which the input function is evaluated independently of its discretization,
needed in order to describe more basis functions φj and therefore more data is needed to fit these
by taking local spatial averages as in (Lanthaler et al., 2021) or more generally by taking a set of
parameters.
linear functionals on A as input to a finite-dimensional branch neural network (a generalization to
On the other hand, we point out that parametrizations of the form (6), and the particular case
DeeONet-Operator) as in the specific PCA-based variant on DeepONet in (De Hoop et al., 2022).
(11), constitute (in general) a form of nonlinear approximation. The benefits of nonlinear approx-
We demonstrate numerically in Section 7 that, when applied in the standard way, the error incurred
imation are well understood in the setting of function approximation, see e.g. (DeVore, 1998),
by DeepONet(s) grows with the discretization of a while it remains constant for neural operators.
however the theory for the operator setting is still in its infancy (Bonito et al., 2020; Cohen et al.,
2020). We observe numerically in Section 7 that nonlinear parametrizations such as (11) outperform Linear Approximation and Nonlinear Approximation. We point out that parametrizations of
linear ones such as DeepONets or the low-rank method introduced in Section 4.2 when implemented the form (32) fall within the class of linear approximation methods since the nonlinear space G † (A)
with similar numbers of parameters. We acknowledge, however, that the theory presented in Section is approximated by the linear space span{φ1 , . . . , φp } (DeVore, 1998). The quality of the best
8 is based on the reduction to a linear approximation and therefore does not capture the benefits of possible linear approximation to a nonlinear space is given by the Kolmogorov n-width where n
the nonlinear approximation. Furthermore, in practice, we have found that deeper architectures than is the dimension of the linear space used in the approximation (Pinkus, 1985). The rate of decay
(11) (usually four to five layers are used in the experiments of Section 7), perform better. The ben- of the n-width as a function of n quantifies how well the linear space approximates the nonlinear
efits of depth are again not captured in our analysis in Section 8 either. We leave further theoretical one. It is well know that for some problems such as the flow maps of advection-dominated PDEs,
studies of approximation properties as an interesting avenue of investigation for future work. the n-widths decay very slowly; hence a very large n is needed for a good approximation for such
problems (Cohen and DeVore, 2015). This can be limiting in practice as more parameters are
Function Representation. An important difference between neural operators, introduced here, needed in order to describe more basis functions φj and therefore more data is needed to fit these
PCA-based operator approximation, introduced in Bhattacharya et al. (2020) and DeepONets, intro- parameters.
duced in Lu et al. (2019), is the manner in which the output function space is finite-dimensionalized. On the other hand, we point out that parametrizations of the form (6), and the particular case
Neural operators as implemented in this paper typically use the same finite-dimensionalization in (11), constitute (in general) a form of nonlinear approximation. The benefits of nonlinear approx-
both the input and output function spaces; however different variants of the neural operator idea imation are well understood in the setting of function approximation, see e.g. (DeVore, 1998),
use different finite-dimensionalizations. As discussed in Section 4, the GNO and MGNO are finite- however the theory for the operator setting is still in its infancy (Bonito et al., 2020; Cohen et al.,
dimensionalized using pointwise values as the nodes of graphs; the FNO is finite-dimensionalized 2020). We observe numerically in Section 7 that nonlinear parametrizations such as (11) outperform
in Fourier space, requiring finite-dimensionalization on a uniform grid in real space; the Low-rank linear ones such as DeepONets or the low-rank method introduced in Section 4.2 when implemented
neural operator is finite-dimensionalized on a product space formed from the Barron space of neural with similar numbers of parameters. We acknowledge, however, that the theory presented in Section
networks. The PCA approach finite-dimensionalizes in the span of PCA modes. DeepONet, on the 8 is based on the reduction to a linear approximation and therefore does not capture the benefits of
other hand, uses different input and output space finite-dimensionalizations; in its basic form it uses the nonlinear approximation. Furthermore, in practice, we have found that deeper architectures than
32 32
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
pointwise (grid) values on the input (branch net) whilst its output (trunk net) is represented as a (11) (usually four to five layers are used in the experiments of Section 7), perform better. The ben-
function in Barron space. There also exist POD-DeepONet variants that finite-dimensionalize the efits of depth are again not captured in our analysis in Section 8 either. We leave further theoretical
output in the span of PCA modes Lu et al. (2021b), bringing them closer to the method introduced studies of approximation properties as an interesting avenue of investigation for future work.
in Bhattacharya et al. (2020), but with a different finite-dimensionalization of the input space.
As is widely quoted, “all models are wrong, but some are useful” Box (1976). For operator Function Representation. An important difference between neural operators, introduced here,
approximation, each finite-dimensionalization has its own induced biases and limitations, and there- PCA-based operator approximation, introduced in Bhattacharya et al. (2020) and DeepONets, intro-
fore works best on a subset of problems. Finite-dimensionalization introduces a trade-off between duced in Lu et al. (2019), is the manner in which the output function space is finite-dimensionalized.
flexibility and representation power of the resulting approximate architecture. The Barron space Neural operators as implemented in this paper typically use the same finite-dimensionalization in
representation (Low-rank operator and DeepONet) is usually the most generic and flexible as it is both the input and output function spaces; however different variants of the neural operator idea
widely applicable. However this can lead to induced biases and reduced representation power on use different finite-dimensionalizations. As discussed in Section 4, the GNO and MGNO are finite-
specific problems; in practice, DeepONet sometimes needs problem-specific feature engineering dimensionalized using pointwise values as the nodes of graphs; the FNO is finite-dimensionalized
and architecture choices as studied in Lu et al. (2021b). We conjecture that these problem-specific in Fourier space, requiring finite-dimensionalization on a uniform grid in real space; the Low-rank
features compensate for the induced bias and reduced representation power that the basic form of the neural operator is finite-dimensionalized on a product space formed from the Barron space of neural
method (Lu et al., 2019) sometimes exhibits. The PCA (PCA operator, POD-DeepONet) and graph- networks. The PCA approach finite-dimensionalizes in the span of PCA modes. DeepONet, on the
based (GNO, MGNO) discretizations are also generic, but more specific compared to the DeepONet other hand, uses different input and output space finite-dimensionalizations; in its basic form it uses
representation; for this reason POD-DeepONet can outperform DeepONet on some problems (Lu pointwise (grid) values on the input (branch net) whilst its output (trunk net) is represented as a
et al., 2021b). On the other hand, the uniform grid-based representation FNO is the most specific function in Barron space. There also exist POD-DeepONet variants that finite-dimensionalize the
of all those operator approximators considered in this paper: in its basic form it applies by dis- output in the span of PCA modes Lu et al. (2021b), bringing them closer to the method introduced
cretizing the input functions, assumed to be specified on a periodic domain, on a uniform grid. As in Bhattacharya et al. (2020), but with a different finite-dimensionalization of the input space.
shown in Section 7 FNO usually works out of the box on such problems. But, as a trade-off, it re- As is widely quoted, “all models are wrong, but some are useful” Box (1976). For operator
quires substantial additional treatments to work well on non-uniform geometries, such as extension, approximation, each finite-dimensionalization has its own induced biases and limitations, and there-
interpolation (explored in Lu et al. (2021b)), and Fourier continuation (Bruno et al., 2007). fore works best on a subset of problems. Finite-dimensionalization introduces a trade-off between
flexibility and representation power of the resulting approximate architecture. The Barron space
5.2 Transformers as a Special Case of Neural Operators representation (Low-rank operator and DeepONet) is usually the most generic and flexible as it is
We will now show that our neural operator framework can be viewed as a continuum gener- widely applicable. However this can lead to induced biases and reduced representation power on
alization to the popular transformer architecture (Vaswani et al., 2017) which has been extremely specific problems; in practice, DeepONet sometimes needs problem-specific feature engineering
successful in natural language processing tasks (Devlin et al., 2018; Brown et al., 2020) and, more and architecture choices as studied in Lu et al. (2021b). We conjecture that these problem-specific
recently, is becoming a popular choice in computer vision tasks (Dosovitskiy et al., 2020). The par- features compensate for the induced bias and reduced representation power that the basic form of the
allel stems from the fact that we can view sequences of arbitrary length as arbitrary discretizations method (Lu et al., 2019) sometimes exhibits. The PCA (PCA operator, POD-DeepONet) and graph-
of functions. Indeed, in the context of natural language processing, we may think of a sentence based (GNO, MGNO) discretizations are also generic, but more specific compared to the DeepONet
as a “word”-valued function on, for example, the domain [0, 1]. Assuming our function is linked representation; for this reason POD-DeepONet can outperform DeepONet on some problems (Lu
to a sentence with a fixed semantic meaning, adding or removing words from the sentence simply et al., 2021b). On the other hand, the uniform grid-based representation FNO is the most specific
corresponds to refining or coarsening the discretization of [0, 1]. We will now make this intuition of all those operator approximators considered in this paper: in its basic form it applies by dis-
precise in the proof of the following statement. cretizing the input functions, assumed to be specified on a periodic domain, on a uniform grid. As
shown in Section 7 FNO usually works out of the box on such problems. But, as a trade-off, it re-
33 33
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Proposition 6 The attention mechanism in transformer models is a special case of a neural opera- quires substantial additional treatments to work well on non-uniform geometries, such as extension,
tor layer. interpolation (explored in Lu et al. (2021b)), and Fourier continuation (Bruno et al., 2007).
Proof We will show that by making a particular choice of the nonlinear integral kernel operator
5.2 Transformers as a Special Case of Neural Operators
(9) and discretizing the integral by a Monte-Carlo approximation, a neural operator layer reduces to
a pre-normalized, single-headed attention, transformer block as originally proposed in (Vaswani et We will now show that our neural operator framework can be viewed as a continuum gener-
al., 2017). For simplicity, we assume dvt = n ∈ N and that Dt = D for any t = 0, . . . , T , the bias alization to the popular transformer architecture (Vaswani et al., 2017) which has been extremely
term is zero, and W = I is the identity. Furthermore, to simplify notation, we will drop the layer successful in natural language processing tasks (Devlin et al., 2018; Brown et al., 2020) and, more
index t from (10) and, employing (9), obtain recently, is becoming a popular choice in computer vision tasks (Dosovitskiy et al., 2020). The par-
Z allel stems from the fact that we can view sequences of arbitrary length as arbitrary discretizations
u(x) = σ v(x) + κv (x, y, v(x), v(y))v(y) dy ∀x ∈ D (34)
D of functions. Indeed, in the context of natural language processing, we may think of a sentence
a single layer of the neural operator where v : D → Rn is the input function to the layer and we as a “word”-valued function on, for example, the domain [0, 1]. Assuming our function is linked
denote by u : D → Rn the output function. We use the notation κv to indicate that the kernel to a sentence with a fixed semantic meaning, adding or removing words from the sentence simply
depends on the entirety of the function v as well as on its pointwise values v(x) and v(y). While corresponds to refining or coarsening the discretization of [0, 1]. We will now make this intuition
this is not explicitly done in (9), it is a straightforward generalization. We now pick a specific form precise in the proof of the following statement.
for kernel, in particular, we assume κv : Rn × Rn → Rn×n does not explicitly depend on the spatial
variables (x, y) but only on the input pair (v(x), v(y)). Furthermore, we let Proposition 6 The attention mechanism in transformer models is a special case of a neural opera-
tor layer.
κv (v(x), v(y)) = gv (v(x), v(y))R
where R ∈ Rn×n is a matrix of free parameters i.e. its entries are concatenated in θ so they are Proof We will show that by making a particular choice of the nonlinear integral kernel operator
learned, and gv : Rn × Rn → R is defined as (9) and discretizing the integral by a Monte-Carlo approximation, a neural operator layer reduces to
Z −1 a pre-normalized, single-headed attention, transformer block as originally proposed in (Vaswani et
⟨Av(s), Bv(y)⟩ ⟨Av(x), Bv(y)⟩
gv (v(x), v(y)) = exp √ ds exp √ . al., 2017). For simplicity, we assume dvt = n ∈ N and that Dt = D for any t = 0, . . . , T , the bias
D m m
term is zero, and W = I is the identity. Furthermore, to simplify notation, we will drop the layer
Here A, B ∈ Rm×n are again matrices of free parameters, m ∈ N is a hyperparameter, and ⟨·, ·⟩ is
index t from (10) and, employing (9), obtain
the Euclidean inner-product on Rm . Putting this together, we find that (34) becomes
Z
exp ⟨Av(x),Bv(y)⟩
Z √ u(x) = σ v(x) + κv (x, y, v(x), v(y))v(y) dy ∀x ∈ D (34)
m
u(x) = σ v(x) + Rv(y) dy ∀x ∈ D. (35) D
R ⟨Av(s),Bv(y)⟩
D
D exp √
m
ds
a single layer of the neural operator where v : D → Rn is the input function to the layer and we
Equation (35) can be thought of as the continuum limit of a transformer block. To see this, we will
denote by u : D → Rn the output function. We use the notation κv to indicate that the kernel
discretize to obtain the usual transformer block.
depends on the entirety of the function v as well as on its pointwise values v(x) and v(y). While
To that end, let {x1 , . . . , xk } ⊂ D be a uniformly-sampled, k-point discretization of D and
this is not explicitly done in (9), it is a straightforward generalization. We now pick a specific form
denote vj = v(xj ) ∈ Rn and uj = u(xj ) ∈ Rn for j = 1, . . . , k. Approximating the inner-integral
for kernel, in particular, we assume κv : Rn × Rn → Rn×n does not explicitly depend on the spatial
in (35) by Monte-Carlo, we have
variables (x, y) but only on the input pair (v(x), v(y)). Furthermore, we let
k
⟨Av(s), Bv(y)⟩ |D| X ⟨Avl , Bv(y)⟩
Z
exp √ ds ≈ exp √ .
D m k m κv (v(x), v(y)) = gv (v(x), v(y))R
l=1
34 34
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Plugging this into (35) and using the same approximation for the outer integral yields where R ∈ Rn×n is a matrix of free parameters i.e. its entries are concatenated in θ so they are
⟨Avj ,Bvq ⟩
learned, and gv : Rn × Rn → R is defined as
k exp √
X m Z −1
uj = σ vj + Rvq , j = 1, . . . , k. (36) ⟨Av(s), Bv(y)⟩ ⟨Av(x), Bv(y)⟩
Pk
exp
⟨Avl ,Bvq ⟩
√ gv (v(x), v(y)) = exp √ ds exp √ .
q=1 l=1 m m m
D
Equation (36) can be viewed as a Nyström approximation of (35). Define the vectors zq ∈ Rk by Here A, B ∈ Rm×n are again matrices of free parameters, m ∈ N is a hyperparameter, and ⟨·, ·⟩ is
the Euclidean inner-product on Rm . Putting this together, we find that (34) becomes
1
zq = √ (⟨Av1 , Bvq ⟩, . . . , ⟨Avk , Bvq ⟩), q = 1, . . . , k.
exp ⟨Av(x),Bv(y)⟩
m Z √
m
u(x) = σ v(x) + Rv(y) dy ∀x ∈ D. (35)
R ⟨Av(s),Bv(y)⟩
Define S : Rk → ∆k , where ∆k denotes the k-dimensional probability simplex, as the softmax D
D exp √
m
ds
function ! Equation (35) can be thought of as the continuum limit of a transformer block. To see this, we will
exp(w1 ) exp(wk ) k
S(w) = Pk , . . . , Pk , ∀w ∈ R . discretize to obtain the usual transformer block.
j=1 exp(wj ) j=1 exp(wj )
To that end, let {x1 , . . . , xk } ⊂ D be a uniformly-sampled, k-point discretization of D and
Then we may re-write (36) as
denote vj = v(xj ) ∈ Rn and uj = u(xj ) ∈ Rn for j = 1, . . . , k. Approximating the inner-integral
k
X in (35) by Monte-Carlo, we have
uj = σ vj + Sj (zq )Rvq , j = 1, . . . , k. k
⟨Av(s), Bv(y)⟩ |D| X ⟨Avl , Bv(y)⟩
Z
q=1
exp √ ds ≈ exp √ .
D m k m
l=1
Furthermore, if we re-parametrize R = Rout Rval where Rout ∈ Rn×m and Rval ∈ Rm×n are
Plugging this into (35) and using the same approximation for the outer integral yields
matrices of free parameters, we obtain
⟨Avj ,Bvq ⟩
k exp √
k
X m
X uj = σ vj + Rvq , j = 1, . . . , k. (36)
uj = σ vj + Rout Sj (zq )Rval vq , ⟨Avl ,Bvq ⟩
Pk
j = 1, . . . , k q=1 l=1 exp √
m
q=1
Equation (36) can be viewed as a Nyström approximation of (35). Define the vectors zq ∈ Rk by
which is precisely the single-headed attention, transformer block with no layer normalization ap-
1
plied inside the activation function. In the language of transformers, the matrices A, B, and Rval zq = √ (⟨Av1 , Bvq ⟩, . . . , ⟨Avk , Bvq ⟩), q = 1, . . . , k.
m
correspond to the queries, keys, and values functions respectively. We note that tricks such as layer
Define S : Rk → ∆k , where ∆k denotes the k-dimensional probability simplex, as the softmax
normalization (Ba et al., 2016) can be adapted in a straightforward manner to the continuum setting
function !
and incorporated into (35). Furthermore multi-headed self-attention can be realized by simply al- exp(w1 ) exp(wk )
S(w) = Pk , . . . , Pk , ∀w ∈ Rk .
lowing κv to be a sum over multiple functions with form gv R all of which have separate trainable j=1 exp(wj ) j=1 exp(wj )
parameters. Including such generalizations yields the continuum limit of the transformer as imple- Then we may re-write (36) as
mented in practice. We do not pursue this here as our goal is simply to draw a parallel between the
k
X
two methods. uj = σ vj + Sj (zq )Rvq , j = 1, . . . , k.
q=1
Furthermore, if we re-parametrize R = Rout Rval where Rout ∈ Rn×m and Rval ∈ Rm×n are
Even though transformers are special cases of neural operators, the standard attention mech-
matrices of free parameters, we obtain
anism is memory and computation intensive, as seen in Section 6, compared to neural operator
k
architectures developed here (7)-(9). The high computational complexity of transformers is evident X
uj = σ vj + Rout Sj (zq )Rval vq , j = 1, . . . , k
is (35) since we must evaluate a nested integral of v for each x ∈ D. Recently, efficient attention q=1
35 35
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
mechanisms have been explored, e.g. long-short Zhu et al. (2021) and adaptive FNO-based attention which is precisely the single-headed attention, transformer block with no layer normalization ap-
mechanisms (Guibas et al., 2021). However, many of the efficient vision transformer architectures plied inside the activation function. In the language of transformers, the matrices A, B, and Rval
(Choromanski et al., 2020; Dosovitskiy et al., 2020) like ViTs are not special cases of neural opera- correspond to the queries, keys, and values functions respectively. We note that tricks such as layer
tors since they use CNN layers to generate tokens, which are not discretization invariant. normalization (Ba et al., 2016) can be adapted in a straightforward manner to the continuum setting
and incorporated into (35). Furthermore multi-headed self-attention can be realized by simply al-
6. Test Problems lowing κv to be a sum over multiple functions with form gv R all of which have separate trainable
parameters. Including such generalizations yields the continuum limit of the transformer as imple-
A central application of neural operators is learning solution operators defined by parametric mented in practice. We do not pursue this here as our goal is simply to draw a parallel between the
partial differential equations. In this section, we define four test problems for which we numerically two methods.
study the approximation properties of neural operators. To that end, let (A, U, F) be a triplet of
Banach spaces. The first two problem classes considered are derived from the following general
Even though transformers are special cases of neural operators, the standard attention mech-
class of PDEs:
anism is memory and computation intensive, as seen in Section 6, compared to neural operator
architectures developed here (7)-(9). The high computational complexity of transformers is evident
La u = f (37)
is (35) since we must evaluate a nested integral of v for each x ∈ D. Recently, efficient attention
mechanisms have been explored, e.g. long-short Zhu et al. (2021) and adaptive FNO-based attention
where, for every a ∈ A, La : U → F is a, possibly nonlinear, partial differential operator, and u ∈ U
mechanisms (Guibas et al., 2021). However, many of the efficient vision transformer architectures
corresponds to the solution of the PDE (37) when f ∈ F and appropriate boundary conditions are
(Choromanski et al., 2020; Dosovitskiy et al., 2020) like ViTs are not special cases of neural opera-
imposed. The second class will be evolution equations with initial condition a ∈ A and solution
tors since they use CNN layers to generate tokens, which are not discretization invariant.
u(t) ∈ U at every time t > 0. We seek to learn the map from a to u := u(τ ) for some fixed time
τ > 0; we will also study maps on paths (time-dependent solutions).
6. Test Problems
Our goal will be to learn the mappings
A central application of neural operators is learning solution operators defined by parametric
partial differential equations. In this section, we define four test problems for which we numerically
G † : a 7→ u or G † : f 7→ u;
study the approximation properties of neural operators. To that end, let (A, U, F) be a triplet of
Banach spaces. The first two problem classes considered are derived from the following general
we will study both cases, depending on the test problem considered. We will define a probability
class of PDEs:
measure µ on A or F which will serve to define a model for likely input data. Furthermore, measure
µ will define a topology on the space of mappings in which G † lives, using the Bochner norm (3). La u = f (37)
We will assume that each of the spaces (A, U, F) are Banach spaces of functions defined on a where, for every a ∈ A, La : U → F is a, possibly nonlinear, partial differential operator, and u ∈ U
bounded domain D ⊂ Rd . All reported errors will be Monte-Carlo estimates of the relative error corresponds to the solution of the PDE (37) when f ∈ F and appropriate boundary conditions are
imposed. The second class will be evolution equations with initial condition a ∈ A and solution
∥G † (a) − Gθ (a)∥L2 (D)
Ea∼µ u(t) ∈ U at every time t > 0. We seek to learn the map from a to u := u(τ ) for some fixed time
∥G † (a)∥L2 (D)
τ > 0; we will also study maps on paths (time-dependent solutions).
Our goal will be to learn the mappings
or equivalently replacing a with f in the above display and with the assumption that U ⊆ L2 (D).
The domain D will be discretized, usually uniformly, with J ∈ N points. G † : a 7→ u or G † : f 7→ u;
36 36
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
6.1 Poisson Equation we will study both cases, depending on the test problem considered. We will define a probability
measure µ on A or F which will serve to define a model for likely input data. Furthermore, measure
First we consider the one-dimensional Poisson equation with a zero boundary condition. In
µ will define a topology on the space of mappings in which G † lives, using the Bochner norm (3).
particular, (37) takes the form
We will assume that each of the spaces (A, U, F) are Banach spaces of functions defined on a
d2
− 2 u(x) = f (x), x ∈ (0, 1) bounded domain D ⊂ Rd . All reported errors will be Monte-Carlo estimates of the relative error
dx (38)
u(0) = u(1) = 0 ∥G † (a) − Gθ (a)∥L2 (D)
Ea∼µ
∥G † (a)∥L2 (D)
for some source function f : (0, 1) → R). In particular, for D(L) := H01 ((0, 1); R) ∩ H 2 ((0, 1); R),
we have L : D(L) → L2 ((0, 1); R) defined as −d2 /dx2 , noting that that L has no dependence on or equivalently replacing a with f in the above display and with the assumption that U ⊆ L2 (D).
any parameter a ∈ A in this case. We will consider the weak form of (38) with source function The domain D will be discretized, usually uniformly, with J ∈ N points.
f ∈ H −1 ((0, 1); R) and therefore the solution operator G † : H −1 ((0, 1); R) → H01 ((0, 1); R)
6.1 Poisson Equation
defined as
G † : f 7→ u. First we consider the one-dimensional Poisson equation with a zero boundary condition. In
particular, (37) takes the form
We define the probability measure µ = N (0, C) where
−2 d2
C = L+I , − u(x) = f (x), x ∈ (0, 1)
dx2 (38)
u(0) = u(1) = 0
defined through the spectral theory of self-adjoint operators. Since µ charges a subset of L2 ((0, 1); R),
we will learn G † : L2 ((0, 1); R) → H01 ((0, 1); R) in the topology induced by (3). for some source function f : (0, 1) → R). In particular, for D(L) := H01 ((0, 1); R) ∩ H 2 ((0, 1); R),
In this setting, G † has a closed-form solution given as we have L : D(L) → L2 ((0, 1); R) defined as −d2 /dx2 , noting that that L has no dependence on
Z 1 any parameter a ∈ A in this case. We will consider the weak form of (38) with source function
†
G (f ) = G(·, y)f (y) dy f ∈ H −1 ((0, 1); R) and therefore the solution operator G † : H −1 ((0, 1); R) → H01 ((0, 1); R)
0
defined as
where
1 G † : f 7→ u.
G(x, y) = (x + y − |y − x|) − xy, ∀(x, y) ∈ [0, 1]2
2
We define the probability measure µ = N (0, C) where
is the Green’s function. Note that while G † is a linear operator, the Green’s function G is non-linear
−2
as a function of its arguments. We will consider only a single layer of (6) with σ1 = Id, P = Id,
C = L+I ,
Q = Id, W0 = 0, b0 = 0, and Z 1
K0 (f ) = κθ (·, y)f (y) dy defined through the spectral theory of self-adjoint operators. Since µ charges a subset of L2 ((0, 1); R),
0
we will learn G † : L2 ((0, 1); R) → H01 ((0, 1); R) in the topology induced by (3).
where κθ : R2 → R will be parameterized as a standard neural network with parameters θ.
In this setting, G † has a closed-form solution given as
The purpose of the current example is two-fold. First we will test the efficacy of the neural
Z 1
operator framework in a simple setting where an exact solution is analytically available. Second †
G (f ) = G(·, y)f (y) dy
0
we will show that by building in the right inductive bias, in particular, paralleling the form of the
Green’s function solution, we obtain a model that generalizes outside the distribution µ. That is, where
1
once trained, the model will generalize to any f ∈ L2 ((0, 1); R) that may be outside the support G(x, y) = (x + y − |y − x|) − xy, ∀(x, y) ∈ [0, 1]2
2
37 37
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
of µ. For example, as defined, the random variable f ∼ µ is a continuous function, however, if is the Green’s function. Note that while G † is a linear operator, the Green’s function G is non-linear
κθ approximates the Green’s function well then the model G † will approximate the solution to (38) as a function of its arguments. We will consider only a single layer of (6) with σ1 = Id, P = Id,
accurately even for discontinuous inputs. Q = Id, W0 = 0, b0 = 0, and
Z 1
To create the dataset used for training, solutions to (38) are obtained by numerical integration
K0 (f ) = κθ (·, y)f (y) dy
using the Green’s function on a uniform grid with 85 collocation points. We use N = 1000 training 0
where D = (0, 1)2 is the unit square. In this setting A = L∞ (D; R+ ), U = H01 (D; R), and
12, x ≥ 0
T (x) = . F = H −1 (D; R). We fix f ≡ 1 and consider the weak form of (39) and therefore the solution
3, x<0
operator G † : L∞ (D; R+ ) → H01 (D; R) defined as
The random variable a ∼ µ is a piecewise-constant function with random interfaces given by the
underlying Gaussian random field. Such constructions are prototypical models for many physical G † : a 7→ u. (40)
systems such as permeability in sub-surface flows and (in a vector generalization) material mi-
crostructures in elasticity. Note that while (39) is a linear PDE, the solution operator G † is nonlinear. We define the probability
To create the dataset used for training, solutions to (39) are obtained using a second-order finite measure µ = T♯ N (0, C) as the pushforward of a Gaussian measure under the operator T where the
difference scheme on a uniform grid of size 421 × 421. All other resolutions are downsampled from covariance of the Gaussian is
this data set. We use N = 1000 training examples. C = (−∆ + 9I)−2
38 38
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
6.3 Burgers’ Equation with D(−∆) defined to impose zero Neumann boundary on the Laplacian. We define T to be a
Nemytskii operator acting on functions, defined through the map T : R → R+ defined as
We consider the one-dimensional viscous Burgers’ equation
12, x ≥ 0
∂ 1 ∂ 2 ∂2 T (x) = .
u(x, t) + u(x, t) = ν 2 u(x, t), x ∈ (0, 2π), t ∈ (0, ∞)
∂t 2 ∂x ∂x (41) 3, x<0
u(x, 0) = u0 (x), x ∈ (0, 2π)
The random variable a ∼ µ is a piecewise-constant function with random interfaces given by the
with periodic boundary conditions and a fixed viscosity ν = 10−1 . Let Ψ : L2per ((0, 2π); R)×R+ → underlying Gaussian random field. Such constructions are prototypical models for many physical
s ((0, 2π); R), for any s > 0, be the flow map associated to (41), in particular,
Hper systems such as permeability in sub-surface flows and (in a vector generalization) material mi-
crostructures in elasticity.
Ψ(u0 , t) = u(·, t), t > 0. To create the dataset used for training, solutions to (39) are obtained using a second-order finite
difference scheme on a uniform grid of size 421 × 421. All other resolutions are downsampled from
We consider the solution operator defined by evaluating Ψ at a fixed time. Fix any s ≥ 0. Then we
this data set. We use N = 1000 training examples.
may define G † : L2per ((0, 2π); R) → Hper
s ((0, 2π); R) by
39 39
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Equivalently, we study the vorticity-streamfunction formulation of the equation 6.4 Navier-Stokes Equation
∂t w(x, t) + ∇⊥ ψ · ∇w(x, t) = ν∆w(x, t) + g(x), x ∈ T2 , t ∈ (0, ∞), (44a) We consider the two-dimensional Navier-Stokes equation for a viscous, incompressible fluid
−∆ψ = ω, x ∈ T2 , t ∈ (0, ∞), (44b) ∂t u(x, t) + u(x, t) · ∇u(x, t) + ∇p(x, t) = ν∆u(x, t) + f (x), x ∈ T2 , t ∈ (0, ∞)
†
g(x1 , x2 ) = 0.1(sin(2π(x1 + x2 )) + cos(2π(x1 + x2 ))), ∀(x1 , x2 ) ∈ T2 .
G : w0 7→ Ψ(w0 , T ) (45)
√
0.1
The corresponding Reynolds number is estimated as Re = ν(2π)3/2
(Chandler and Kerswell, 2013).
for some fixed T > 0. In the second, we will map an initial part of the trajectory to a later part of the
Let Ψ : L2 (T2 ; R) × R+ → H s (T2 ; R), for any s > 0, be the flow map associated to (44), in
trajectory. In particular, we define G † : L2 (T2 ; R)×C (0, 10]; H s (T2 ; R) → C (10, T ]; H s (T2 ; R)
particular,
by
Ψ(w0 , t) = w(·, t), t > 0.
G † : w0 , Ψ(w0 , t)|t∈(0,10] →
7 Ψ(w0 , t)|t∈(10,T ] (46)
Notice that this is well-defined for any w0 ∈ L2 (T; R).
for some fixed T > 10. We define the probability measure µ = N (0, C) where We will define two notions of the solution operator. In the first, we will proceed as in the
previous examples, in particular, G † : L2 (T2 ; R) → H s (T2 ; R) is defined as
C = 73/2 (−∆ + 49I)−2.5
G † : w0 7→ Ψ(w0 , T ) (45)
with periodic boundary conditions on the Laplacian. We model the initial vorticity w0 ∼ µ to (44)
as µ charges a subset of L2 (T2 ; R). Its pushforward onto Ψ(w0 , t)|t∈(0,10] is required to define the for some fixed T > 0. In the second, we will map an initial part of the trajectory to a later part of the
trajectory. In particular, we define G † : L2 (T2 ; R)×C (0, 10]; H s (T2 ; R) → C (10, T ]; H s (T2 ; R)
measure on input space in the second case defined by (46).
40 40
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
To create the dataset used for training, solutions to (44) are obtained using a pseudo-spectral by
G † : w0 , Ψ(w0 , t)|t∈(0,10] →
split step method where the viscous terms are advanced using a Crank–Nicolson update and the 7 Ψ(w0 , t)|t∈(10,T ] (46)
nonlinear and forcing terms are advanced using Heun’s method. Dealiasing is used with the 2/3
for some fixed T > 10. We define the probability measure µ = N (0, C) where
rule. For further details on this approach see (Chandler and Kerswell, 2013). Data is obtained on a
uniform 256 × 256 grid and all other resolutions are subsampled from this data set. We experiment C = 73/2 (−∆ + 49I)−2.5
with different viscosities ν, final times T , and amounts of training data N .
with periodic boundary conditions on the Laplacian. We model the initial vorticity w0 ∼ µ to (44)
6.4.1 BAYESIAN I NVERSE P ROBLEM as µ charges a subset of L2 (T2 ; R). Its pushforward onto Ψ(w0 , t)|t∈(0,10] is required to define the
measure on input space in the second case defined by (46).
As an application of operator learning, we consider the inverse problem of recovering the initial
To create the dataset used for training, solutions to (44) are obtained using a pseudo-spectral
vorticity in the Navier-Stokes equation (44) from partial, noisy observations of the vorticity at a later
split step method where the viscous terms are advanced using a Crank–Nicolson update and the
time. Consider the first solution operator defined in subsection 6.4, in particular, G † : L2 (T2 ; R) →
nonlinear and forcing terms are advanced using Heun’s method. Dealiasing is used with the 2/3
H s (T2 ; R) defined as
rule. For further details on this approach see (Chandler and Kerswell, 2013). Data is obtained on a
G † : w0 7→ Ψ(w0 , 50)
uniform 256 × 256 grid and all other resolutions are subsampled from this data set. We experiment
where Ψ is the flow map associated to (44). We then consider the inverse problem
with different viscosities ν, final times T , and amounts of training data N .
y = O G † (w0 ) +η
(47)
6.4.1 BAYESIAN I NVERSE P ROBLEM
of recovering w0 ∈ L2 (T2 ; R) where O : H s (T2 ; R) → R49 is the evaluation operator on a uniform As an application of operator learning, we consider the inverse problem of recovering the initial
7 × 7 interior grid, and η ∼ N (0, Γ) is observational noise with covariance Γ = (1/γ 2 )I and vorticity in the Navier-Stokes equation (44) from partial, noisy observations of the vorticity at a later
γ = 0.1. We view (47) as the Bayesian inverse problem mapping prior measure µ on w0 to posterior time. Consider the first solution operator defined in subsection 6.4, in particular, G † : L2 (T2 ; R) →
measure πy on w0 /y. In particular, πy has density with respect to µ, given by the Randon-Nikodym H s (T2 ; R) defined as
derivative G † : w0 7→ Ψ(w0 , 50)
dπ y 1
(w0 ) ∝ exp − ∥y − O G † (w0 ) ∥2Γ
dµ 2 where Ψ is the flow map associated to (44). We then consider the inverse problem
where ∥ · ∥Γ = ∥Γ−1/2 · ∥ and ∥ · ∥ is the Euclidean norm in R49 . For further details on Bayesian
y = O G † (w0 ) +η
inversion for functions see (Cotter et al., 2009; Stuart, 2010), and see (Cotter et al., 2013) for MCMC (47)
methods adapted to the function-space setting.
of recovering w0 ∈ L2 (T2 ; R) where O : H s (T2 ; R) → R49 is the evaluation operator on a uniform
We solve (47) by computing the posterior mean Ew0 ∼πy [w0 ] using the pre-conditioned Crank–Nicolson
7 × 7 interior grid, and η ∼ N (0, Γ) is observational noise with covariance Γ = (1/γ 2 )I and
(pCN) MCMC method described in Cotter et al. (2013) for this task. We employ pCN in two cases:
γ = 0.1. We view (47) as the Bayesian inverse problem mapping prior measure µ on w0 to posterior
(i) using G † evaluated with the pseudo-spectral method described in section 6.4; and (ii) using Gθ ,
measure π y on w0 /y. In particular, π y has density with respect to µ, given by the Randon-Nikodym
the neural operator approximating G † . After a 5,000 sample burn-in period, we generate 25,000
derivative
samples from the posterior using both approaches and use them to compute the posterior mean. dπ y 1
†
2
(w0 ) ∝ exp − ∥y − O G (w0 ) ∥Γ
dµ 2
6.4.2 S PECTRA
where ∥ · ∥Γ = ∥Γ−1/2 · ∥ and ∥ · ∥ is the Euclidean norm in R49 . For further details on Bayesian
Because of the constant-in-time forcing term the energy reaches a non-zero equilibrium in time inversion for functions see (Cotter et al., 2009; Stuart, 2010), and see (Cotter et al., 2013) for MCMC
which is statistically reproducible for different initial conditions. To compare the complexity of the methods adapted to the function-space setting.
41 41
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
solution to the Navier-Stokes problem outlined in subsection 6.4 we show, in Figure 6, the Fourier We solve (47) by computing the posterior mean Ew0 ∼πy [w0 ] using the pre-conditioned Crank–Nicolson
spectrum of the solution data at time t = 50 for three different choices of the viscosity ν. The (pCN) MCMC method described in Cotter et al. (2013) for this task. We employ pCN in two cases:
figure demonstrates that, for a wide range of wavenumbers k, which grows as ν decreases, the rate (i) using G † evaluated with the pseudo-spectral method described in section 6.4; and (ii) using Gθ ,
of decay of the spectrum is −5/3, matching what is expected in the turbulent regime (Kraichnan, the neural operator approximating G † . After a 5,000 sample burn-in period, we generate 25,000
1967). This is a statistically stationary property of the equation, sustained for all positive times. samples from the posterior using both approaches and use them to compute the posterior mean.
6.4.2 S PECTRA
Because of the constant-in-time forcing term the energy reaches a non-zero equilibrium in time
which is statistically reproducible for different initial conditions. To compare the complexity of the
solution to the Navier-Stokes problem outlined in subsection 6.4 we show, in Figure 6, the Fourier
spectrum of the solution data at time t = 50 for three different choices of the viscosity ν. The
figure demonstrates that, for a wide range of wavenumbers k, which grows as ν decreases, the rate
图 6: Spectral Decay of Navier-Stokes. of decay of the spectrum is −5/3, matching what is expected in the turbulent regime (Kraichnan,
The spectral decay of the Navier-stokes equation data. The y-axis is represents the value of each mode; the
1967). This is a statistically stationary property of the equation, sustained for all positive times.
x-axis is the wavenumber |k| = k1 + k2 . From left to right, the solutions have viscosity
ν = 10−3 , 10−4 , 10−5 respectively.
In general, the model has the best performance when trained and tested using the same loss
criteria. If one trains the model using one norm and tests with another norm, the model may overfit
in the training norm. Furthermore, the choice of loss function plays a key role. In this work, we
use the relative L2 error to measure the performance in all our problems. Both the L2 error and its 图 6: Spectral Decay of Navier-Stokes.
The spectral decay of the Navier-stokes equation data. The y-axis is represents the value of each mode; the
square, the mean squared error (MSE), are common choices of the testing criteria in the numerical
x-axis is the wavenumber |k| = k1 + k2 . From left to right, the solutions have viscosity
analysis and machine learning literature. We observed that using the relative error to train the model
ν = 10−3 , 10−4 , 10−5 respectively.
has a good normalization and regularization effect that prevents overfitting. In practice, training
with the relative L2 loss results in around half the testing error rate compared to training with the
MSE loss. 6.5 Choice of Loss Criteria
In general, the model has the best performance when trained and tested using the same loss
7. Numerical Results
criteria. If one trains the model using one norm and tests with another norm, the model may overfit
In this section, we compare the proposed neural operator with other supervised learning ap- in the training norm. Furthermore, the choice of loss function plays a key role. In this work, we
proaches, using the four test problems outlined in Section 6. In Subsection 7.1 we study the Poisson use the relative L2 error to measure the performance in all our problems. Both the L2 error and its
equation, and learning a Greens function; Subsection 7.2 considers the coefficient to solution map square, the mean squared error (MSE), are common choices of the testing criteria in the numerical
for steady Darcy flow, and the initial condition to solution at positive time map for Burgers equation. analysis and machine learning literature. We observed that using the relative error to train the model
In subsection 7.3 we study the Navier-Stokes equation. has a good normalization and regularization effect that prevents overfitting. In practice, training
42 42
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
We compare with a variety of architectures found by discretizing the data and applying finite- with the relative L2 loss results in around half the testing error rate compared to training with the
dimensional approaches, as well as with other operator-based approximation methods; further de- MSE loss.
tailed comparison of other operator-based approximation methods may be found in De Hoop et al.
(2022), where the issue of error versus cost (with cost defined in various ways such as evaluation 7. Numerical Results
time of the network, amount of data required) is studied. We do not compare against traditional
In this section, we compare the proposed neural operator with other supervised learning ap-
solvers (FEM/FDM/Spectral), although our methods, once trained, enable evaluation of the input
proaches, using the four test problems outlined in Section 6. In Subsection 7.1 we study the Poisson
to output map orders of magnitude more quickly than by use of such traditional solvers on com-
equation, and learning a Greens function; Subsection 7.2 considers the coefficient to solution map
plex problems. We demonstrate the benefits of this speed-up in a prototypical application, Bayesian
for steady Darcy flow, and the initial condition to solution at positive time map for Burgers equation.
inversion, in Subsubection 7.3.4.
In subsection 7.3 we study the Navier-Stokes equation.
All the computations are carried on a single Nvidia V100 GPU with 16GB memory. The code
We compare with a variety of architectures found by discretizing the data and applying finite-
is available at https://github.com/zongyi-li/graph-pde and https://github.
dimensional approaches, as well as with other operator-based approximation methods; further de-
com/zongyi-li/fourier_neural_operator.
tailed comparison of other operator-based approximation methods may be found in De Hoop et al.
Setup of the Four Methods: We construct the neural operator by stacking four integral operator (2022), where the issue of error versus cost (with cost defined in various ways such as evaluation
layers as specified in (6) with the ReLU activation. No batch normalization is needed. Unless time of the network, amount of data required) is studied. We do not compare against traditional
otherwise specified, we use N = 1000 training instances and 200 testing instances. We use the solvers (FEM/FDM/Spectral), although our methods, once trained, enable evaluation of the input
Adam optimizer to train for 500 epochs with an initial learning rate of 0.001 that is halved every to output map orders of magnitude more quickly than by use of such traditional solvers on com-
100 epochs. We set the channel dimensions dv0 = · · · = dv3 = 64 for all one-dimensional problems plex problems. We demonstrate the benefits of this speed-up in a prototypical application, Bayesian
and dv0 = · · · = dv3 = 32 for all two-dimensional problems. The kernel networks κ(0) , . . . , κ(3) inversion, in Subsubection 7.3.4.
are standard feed-forward neural networks with three layers and widths of 256 units. We use the All the computations are carried on a single Nvidia V100 GPU with 16GB memory. The code
following abbreviations to denote the methods introduced in Section 4. is available at https://github.com/zongyi-li/graph-pde and https://github.
• GNO: The method introduced in subsection 4.1, truncating the integral to a ball with radius com/zongyi-li/fourier_neural_operator.
r = 0.25 and using the Nyström approximation with J ′ = 300 sub-sampled nodes.
Setup of the Four Methods: We construct the neural operator by stacking four integral operator
• LNO: The low-rank method introduced in subsection 4.2 with rank r = 4. layers as specified in (6) with the ReLU activation. No batch normalization is needed. Unless
otherwise specified, we use N = 1000 training instances and 200 testing instances. We use the
• MGNO: The multipole method introduced in subsection 4.3. On the Darcy flow problem,
Adam optimizer to train for 500 epochs with an initial learning rate of 0.001 that is halved every
we use the random construction with three graph levels, each sampling J1 = 400, J2 =
100 epochs. We set the channel dimensions dv0 = · · · = dv3 = 64 for all one-dimensional problems
100, J3 = 25 nodes nodes respectively. On the Burgers’ equation problem, we use the or-
and dv0 = · · · = dv3 = 32 for all two-dimensional problems. The kernel networks κ(0) , . . . , κ(3)
thogonal construction without sampling.
are standard feed-forward neural networks with three layers and widths of 256 units. We use the
• FNO: The Fourier method introduced in subsection 4.4. We set kmax,j = 16 for all one- following abbreviations to denote the methods introduced in Section 4.
dimensional problems and kmax,j = 12 for all two-dimensional problems.
• GNO: The method introduced in subsection 4.1, truncating the integral to a ball with radius
Remark on the Resolution. Traditional PDE solvers such as FEM and FDM approximate a single r = 0.25 and using the Nyström approximation with J ′ = 300 sub-sampled nodes.
function and therefore their error to the continuum decreases as the resolution is increased. The fig-
ures we show here exhibit something different: the error is independent of resolution, once enough • LNO: The low-rank method introduced in subsection 4.2 with rank r = 4.
43 43
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
resolution is used, but is not zero. This reflects the fact that there is a residual approximation error, • MGNO: The multipole method introduced in subsection 4.3. On the Darcy flow problem,
in the infinite dimensional limit, from the use of a finite-parametrized neural operator, trained on a we use the random construction with three graph levels, each sampling J1 = 400, J2 =
finite amount of data. Invariance of the error with respect to (sufficiently fine) resolution is a de- 100, J3 = 25 nodes nodes respectively. On the Burgers’ equation problem, we use the or-
sirable property that demonstrates that an intrinsic approximation of the operator has been learned, thogonal construction without sampling.
independent of any specific discretization; see Figure 8. Furthermore, resolution-invariant operators
• FNO: The Fourier method introduced in subsection 4.4. We set kmax,j = 16 for all one-
can do zero-shot super-resolution, as shown in Subsubection 7.3.1.
dimensional problems and kmax,j = 12 for all two-dimensional problems.
7.1 Poisson Equation Remark on the Resolution. Traditional PDE solvers such as FEM and FDM approximate a single
function and therefore their error to the continuum decreases as the resolution is increased. The fig-
Recall the Poisson equation (38) introduced in subsection 6.1. We use a zero hidden layer
ures we show here exhibit something different: the error is independent of resolution, once enough
neural operator construction without lifting the input dimension. In particular, we simply learn a
resolution is used, but is not zero. This reflects the fact that there is a residual approximation error,
kernel κθ : R2 → R parameterized as a standard feed-forward neural network with parameters
in the infinite dimensional limit, from the use of a finite-parametrized neural operator, trained on a
θ. Using only N = 1000 training examples, we obtain a relative test error of 10−7 . The neural
finite amount of data. Invariance of the error with respect to (sufficiently fine) resolution is a de-
operator gives an almost perfect approximation to the true solution operator in the topology of (3).
sirable property that demonstrates that an intrinsic approximation of the operator has been learned,
To examine the quality of the approximation in the much stronger uniform topology, we check
independent of any specific discretization; see Figure 8. Furthermore, resolution-invariant operators
whether the kernel κθ approximates the Green’s function for this problem. To see why this is
can do zero-shot super-resolution, as shown in Subsubection 7.3.1.
enough, let K ⊂ L2 ([0, 1]; R) be a bounded set i.e.
operator approximation benchmarks; we study the Darcy flow problem introduced in Subsection in particular, we obtain an approximation in the topology of uniform convergence over bounded
6.2 and the Burgers’ equation problem introduced in Subsection 6.3. The solution operators of sets, while having trained only in the topology of the Bochner norm (3). Figure 7 shows the results
44 44
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
图 7: Kernel for one-dimensional Green’s function, with the Nystrom approximation method 图 7: Kernel for one-dimensional Green’s function, with the Nystrom approximation method
left: learned kernel function; right: the analytic Green’s function. left: learned kernel function; right: the analytic Green’s function.
This is a proof of concept of the graph kernel network on 1 dimensional Poisson equation and the This is a proof of concept of the graph kernel network on 1 dimensional Poisson equation and the
comparison of learned and truth kernel. comparison of learned and truth kernel.
interest are defined by (40) and (42). We use the following abbreviations for the methods against from which we can see that κθ does indeed approximate the Green’s function well. This result
which we benchmark. implies that by constructing a suitable architecture, we can generalize to the entire space and data
that is well outside the support of the training set.
• NN is a standard point-wise feedforward neural network. It is mesh-free, but performs badly
due to lack of neighbor information. We use standard fully connected neural networks with 8 7.2 Darcy and Burgers Equations
layers and width 1000. In the following section, we compare four methods presented in this paper, with different
operator approximation benchmarks; we study the Darcy flow problem introduced in Subsection
• FCN is the state of the art neural network method based on Fully Convolution Network (Zhu
6.2 and the Burgers’ equation problem introduced in Subsection 6.3. The solution operators of
and Zabaras, 2018). It has a dominating performance for small grids s = 61. But fully
interest are defined by (40) and (42). We use the following abbreviations for the methods against
convolution networks are mesh-dependent and therefore their error grows when moving to a
which we benchmark.
larger grid.
45 45
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
(a) benchmarks on Burgers equation; (b) benchmarks on Darcy Flow for different resolutions; Train and test (a) benchmarks on Burgers equation; (b) benchmarks on Darcy Flow for different resolutions; Train and test
on the same resolution. For acronyms, see Section 7; details in Tables 3, 2. on the same resolution. For acronyms, see Section 7; details in Tables 3, 2.
图 8: Benchmark on Burger’s equation and Darcy Flow 图 8: Benchmark on Burger’s equation and Darcy Flow
• DeepONet is the Deep Operator network (Lu et al., 2019) that comes equipped with an ap- with a standard fully connected neural network with width 200. The method provably obtains
proximation theory (Lanthaler et al., 2021). We use the unstacked version with width 200 mesh-independent error and can learn purely from data, however the solution can only be
which is precisely defined in the original work (Lu et al., 2019). We use standard fully con- evaluated on the same mesh as the training data.
nected neural networks with 8 layers and width 200.
• RBM is the classical Reduced Basis Method (using a PCA basis), which is widely used in
applications and provably obtains mesh-independent error (DeVore, 2014). This method has
7.2.1 DARCY F LOW good performance, but the solutions can only be evaluated on the same mesh as the training
The results of the experiments on Darcy flow are shown in Figure 8 and Table 2. All the data and one needs knowledge of the PDE to employ it.
methods, except for FCN, achieve invariance of the error with respect to the resolution s. In the • DeepONet is the Deep Operator network (Lu et al., 2019) that comes equipped with an ap-
experiment, we tune each model across of range of different widths and depth to obtain the choices proximation theory (Lanthaler et al., 2021). We use the unstacked version with width 200
used here; for DeepONet for example this leads to 8 layers and width 200 as reported above. which is precisely defined in the original work (Lu et al., 2019). We use standard fully con-
Within our hyperparameter search, the Fourier neural operator (FNO) obtains the lowest rel- nected neural networks with 8 layers and width 200.
ative error. The Fourier based method likely sees this advantage because the output functions are
smooth in these test problems. We also note that is also possible to obtain better results on each
7.2.1 DARCY F LOW
model using modified architectures and problem specific feature engineering. For example for
DeepONet, using CNN on the branch net and PCA on the trunk net (the latter being similar to The results of the experiments on Darcy flow are shown in Figure 8 and Table 2. All the
the method used in Bhattacharya et al. (2020)) can achieve 0.0232 relative L2 error, as shown in methods, except for FCN, achieve invariance of the error with respect to the resolution s. In the
Lu et al. (2021b), about half the size of the error we obtain here, but for a very coarse grid with experiment, we tune each model across of range of different widths and depth to obtain the choices
s = 29. In the experiments the different approximation architectures are such their training cost used here; for DeepONet for example this leads to 8 layers and width 200 as reported above.
are similar across all the methods considered, for given s. Noting this, and for example comparing Within our hyperparameter search, the Fourier neural operator (FNO) obtains the lowest rel-
the graph-based neural operator methods such as GNO and MGNO that use Nyström sampling in ative error. The Fourier based method likely sees this advantage because the output functions are
physical space with FNO, we see that FNO is more accurate. smooth in these test problems. We also note that is also possible to obtain better results on each
46 46
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Networks s = 85 s = 141 s = 211 s = 421 model using modified architectures and problem specific feature engineering. For example for
NN 0.1716 0.1716 0.1716 0.1716 DeepONet, using CNN on the branch net and PCA on the trunk net (the latter being similar to
FCN 0.0253 0.0493 0.0727 0.1097 the method used in Bhattacharya et al. (2020)) can achieve 0.0232 relative L2 error, as shown in
PCANN 0.0299 0.0298 0.0298 0.0299 Lu et al. (2021b), about half the size of the error we obtain here, but for a very coarse grid with
RBM 0.0244 0.0251 0.0255 0.0259 s = 29. In the experiments the different approximation architectures are such their training cost
DeepONet 0.0476 0.0479 0.0462 0.0487 are similar across all the methods considered, for given s. Noting this, and for example comparing
GNO 0.0346 0.0332 0.0342 0.0369 the graph-based neural operator methods such as GNO and MGNO that use Nyström sampling in
LNO 0.0520 0.0461 0.0445 − physical space with FNO, we see that FNO is more accurate.
MGNO 0.0416 0.0428 0.0428 0.0420
FNO 0.0108 0.0109 0.0109 0.0098
Networks s = 85 s = 141 s = 211 s = 421
NN 0.1716 0.1716 0.1716 0.1716
表 2: Relative error on 2-d Darcy Flow for different resolutions s. FCN 0.0253 0.0493 0.0727 0.1097
PCANN 0.0299 0.0298 0.0298 0.0299
7.2.2 B URGERS ’ E QUATION RBM 0.0244 0.0251 0.0255 0.0259
DeepONet 0.0476 0.0479 0.0462 0.0487
The results of the experiments on Burgers’ equation are shown in Figure 8 and Table 3. As
for the Darcy problem, our instantiation of the Fourier neural operator obtains nearly one order GNO 0.0346 0.0332 0.0342 0.0369
of magnitude lower relative error compared to any benchmarks. The Fourier neural operator has LNO 0.0520 0.0461 0.0445 −
standard deviation 0.0010 and mean training error 0.0012. If one replaces the ReLU activation by MGNO 0.0416 0.0428 0.0428 0.0420
GeLU, the test error of the FNO is further reduced from 0.0018 to 0.0007. We again observe the FNO 0.0108 0.0109 0.0109 0.0098
invariance of the error with respect to the resolution. It is possible to improve the performance on
each model using modified architectures and problem specific feature engineering. Similarly, the 表 2: Relative error on 2-d Darcy Flow for different resolutions s.
PCA-enhanced DeepONet with a proper scaling can achieve 0.0194 relative L2 error, as shown in
Lu et al. (2021b), on a grid of resolution s = 128.
The results of the experiments on Burgers’ equation are shown in Figure 8 and Table 3. As
for the Darcy problem, our instantiation of the Fourier neural operator obtains nearly one order
of magnitude lower relative error compared to any benchmarks. The Fourier neural operator has
standard deviation 0.0010 and mean training error 0.0012. If one replaces the ReLU activation by
GeLU, the test error of the FNO is further reduced from 0.0018 to 0.0007. We again observe the
invariance of the error with respect to the resolution. It is possible to improve the performance on
图 9: Darcy, trained on 16 × 16, tested on 241 × 241 each model using modified architectures and problem specific feature engineering. Similarly, the
Graph kernel network for the solution of (6.2). It can be trained on a small resolution and will generalize to PCA-enhanced DeepONet with a proper scaling can achieve 0.0194 relative L2 error, as shown in
a large one. The Error is point-wise absolute squared error. Lu et al. (2021b), on a grid of resolution s = 128.
47 47
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Networks s = 256 s = 512 s = 1024 s = 2048 s = 4096 s = 8192 Networks s = 256 s = 512 s = 1024 s = 2048 s = 4096 s = 8192
NN 0.4714 0.4561 0.4803 0.4645 0.4779 0.4452 NN 0.4714 0.4561 0.4803 0.4645 0.4779 0.4452
GCN 0.3999 0.4138 0.4176 0.4157 0.4191 0.4198 GCN 0.3999 0.4138 0.4176 0.4157 0.4191 0.4198
FCN 0.0958 0.1407 0.1877 0.2313 0.2855 0.3238 FCN 0.0958 0.1407 0.1877 0.2313 0.2855 0.3238
PCANN 0.0398 0.0395 0.0391 0.0383 0.0392 0.0393 PCANN 0.0398 0.0395 0.0391 0.0383 0.0392 0.0393
DeepONet 0.0569 0.0617 0.0685 0.0702 0.0833 0.0857 DeepONet 0.0569 0.0617 0.0685 0.0702 0.0833 0.0857
GNO 0.0555 0.0594 0.0651 0.0663 0.0666 0.0699 GNO 0.0555 0.0594 0.0651 0.0663 0.0666 0.0699
LNO 0.0212 0.0221 0.0217 0.0219 0.0200 0.0189 LNO 0.0212 0.0221 0.0217 0.0219 0.0200 0.0189
MGNO 0.0243 0.0355 0.0374 0.0360 0.0364 0.0364 MGNO 0.0243 0.0355 0.0374 0.0360 0.0364 0.0364
FNO 0.0018 0.0018 0.0018 0.0019 0.0020 0.0019 FNO 0.0018 0.0018 0.0018 0.0019 0.0020 0.0019
表 3: Relative errors on 1-d Burgers’ equation for different resolutions s. 表 3: Relative errors on 1-d Burgers’ equation for different resolutions s.
The neural operator is mesh-invariant, so it can be trained on a lower resolution and evaluated
at a higher resolution, without seeing any higher resolution data (zero-shot super-resolution). Figure
9 shows an example of the Darcy Equation where we train the GNO model on 16 × 16 resolution
data in the setting above and transfer to 256 × 256 resolution, demonstrating super-resolution in
space.
In the following section, we compare our four methods with different benchmarks on the
图 9: Darcy, trained on 16 × 16, tested on 241 × 241
Navier-Stokes equation introduced in subsection 6.4. The operator of interest is given by (46).
Graph kernel network for the solution of (6.2). It can be trained on a small resolution and will generalize to
We use the following abbreviations for the methods against which we benchmark. a large one. The Error is point-wise absolute squared error.
• U-Net: A popular choice for image-to-image regression tasks consisting of four blocks with 7.2.3 Z ERO - SHOT SUPER - RESOLUTION .
2-d convolutions and deconvolutions Ronneberger et al. (2015).
• TF-Net: A network designed for learning turbulent flows based on a combination of spatial The neural operator is mesh-invariant, so it can be trained on a lower resolution and evaluated
and temporal convolutions Wang et al. (2020). at a higher resolution, without seeing any higher resolution data (zero-shot super-resolution). Figure
9 shows an example of the Darcy Equation where we train the GNO model on 16 × 16 resolution
• FNO-2d: 2-d Fourier neural operator with an auto-regressive structure in time. We use the data in the setting above and transfer to 256 × 256 resolution, demonstrating super-resolution in
Fourier neural operator to model the local evolution from the previous 10 time steps to the space.
48 48
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
In the following section, we compare our four methods with different benchmarks on the
Navier-Stokes equation introduced in subsection 6.4. The operator of interest is given by (46).
We use the following abbreviations for the methods against which we benchmark.
• U-Net: A popular choice for image-to-image regression tasks consisting of four blocks with
2-d convolutions and deconvolutions Ronneberger et al. (2015).
• TF-Net: A network designed for learning turbulent flows based on a combination of spatial
and temporal convolutions Wang et al. (2020).
• FNO-2d: 2-d Fourier neural operator with an auto-regressive structure in time. We use the
Fourier neural operator to model the local evolution from the previous 10 time steps to the
图 10: Benchmark on the Navier-Stokes next one time step, and iteratively apply the model to get the long-term trajectory. We set and
The learning curves on Navier-Stokes ν = 1e−3 with different benchmarks. Train and test on the same
kmax,j = 12, dv = 32.
resolution. For acronyms, see Section 7; details in Tables 4.
• FNO-3d: 3-d Fourier neural operator that directly convolves in space-time. We use the
next one time step, and iteratively apply the model to get the long-term trajectory. We set and Fourier neural operator to model the global evolution from the initial 10 time steps directly to
kmax,j = 12, dv = 32. the long-term trajectory. We set kmax,j = 12, dv = 20.
• FNO-3d: 3-d Fourier neural operator that directly convolves in space-time. We use the
Fourier neural operator to model the global evolution from the initial 10 time steps directly to
the long-term trajectory. We set kmax,j = 12, dv = 20. Parameters Time ν = 10−3 ν = 10−4 ν = 10−4 ν = 10−5
Config per T = 50 T = 30 T = 30 T = 20
epoch N = 1000 N = 1000 N = 10000 N = 1000
FNO-3D 6, 558, 537 38.99s 0.0086 0.1918 0.0820 0.1893
Parameters Time ν = 10−3 ν = 10−4 ν = 10−4 ν = 10−5
FNO-2D 414, 517 127.80s 0.0128 0.1559 0.0834 0.1556
Config per T = 50 T = 30 T = 30 T = 20
U-Net 24, 950, 491 48.67s 0.0245 0.2051 0.1190 0.1982
epoch N = 1000 N = 1000 N = 10000 N = 1000
TF-Net 7, 451, 724 47.21s 0.0225 0.2253 0.1168 0.2268
FNO-3D 6, 558, 537 38.99s 0.0086 0.1918 0.0820 0.1893
ResNet 266, 641 78.47s 0.0701 0.2871 0.2311 0.2753
FNO-2D 414, 517 127.80s 0.0128 0.1559 0.0834 0.1556
U-Net 24, 950, 491 48.67s 0.0245 0.2051 0.1190 0.1982
TF-Net 7, 451, 724 47.21s 0.0225 0.2253 0.1168 0.2268 表 4: Benchmarks on Navier Stokes (fixing resolution 64 × 64 for both training and testing).
ResNet 266, 641 78.47s 0.0701 0.2871 0.2311 0.2753
As shown in Table 4, the FNO-3D has the best performance when there is sufficient data
(ν = 10−3 , N = 1000 and ν = 10−4 , N = 10000). For the configurations where the amount of
表 4: Benchmarks on Navier Stokes (fixing resolution 64 × 64 for both training and testing). data is insufficient (ν = 10−4 , N = 1000 and ν = 10−5 , N = 1000), all methods have > 15% error
49 49
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
As shown in Table 4, the FNO-3D has the best performance when there is sufficient data
(ν = 10−3 , N = 1000 and ν = 10−4 , N = 10000). For the configurations where the amount of
data is insufficient (ν = 10−4 , N = 1000 and ν = 10−5 , N = 1000), all methods have > 15% error
with FNO-2D achieving the lowest among our hyperparameter search. Note that we only present
results for spatial resolution 64 × 64 since all benchmarks we compare against are designed for this
resolution. Increasing the spatial resolution degrades their performance while FNO achieves the
same errors.
Auto-regressive (2D) and Temporal Convolution (3D). We investigate two standard formulation
to model the time evolution: the auto-regressive model (2D) and the temporal convolution model
(3D). Auto-regressive models: FNO-2D, U-Net, TF-Net, and ResNet all do 2D-convolution in the
spatial domain and recurrently propagate in the time domain (2D+RNN). The operator maps the
solution at previous time steps to the next time step (2D functions to 2D functions). Temporal
convolution models: on the other hand, FNO-3D performs convolution in space-time – it approx-
图 10: Benchmark on the Navier-Stokes
imates the integral in time by a convolution. FNO-3D maps the initial time interval directly to the
The learning curves on Navier-Stokes ν = 1e−3 with different benchmarks. Train and test on the same
full trajectory (3D functions to 3D functions). The 2D+RNN structure can propagate the solution to
resolution. For acronyms, see Section 7; details in Tables 4.
any arbitrary time T in increments of a fixed interval length ∆t, while the Conv3D structure is fixed
to the interval [0, T ] but can transfer the solution to an arbitrary time-discretization. We find the 2D
method work better for short time sequences while the 3D method more expressive and easier to with FNO-2D achieving the lowest among our hyperparameter search. Note that we only present
train on longer sequences. results for spatial resolution 64 × 64 since all benchmarks we compare against are designed for this
resolution. Increasing the spatial resolution degrades their performance while FNO achieves the
Networks s = 64 s = 128 s = 256
same errors.
FNO-3D 0.0098 0.0101 0.0106
FNO-2D 0.0129 0.0128 0.0126
U-Net 0.0253 0.0289 0.0344 Auto-regressive (2D) and Temporal Convolution (3D). We investigate two standard formulation
TF-Net 0.0277 0.0278 0.0301 to model the time evolution: the auto-regressive model (2D) and the temporal convolution model
(3D). Auto-regressive models: FNO-2D, U-Net, TF-Net, and ResNet all do 2D-convolution in the
表 5: Resolution study on Navier-stokes equation (ν = 10−3 , N = 200, T = 20.) spatial domain and recurrently propagate in the time domain (2D+RNN). The operator maps the
solution at previous time steps to the next time step (2D functions to 2D functions). Temporal
convolution models: on the other hand, FNO-3D performs convolution in space-time – it approx-
7.3.1 Z ERO - SHOT SUPER - RESOLUTION .
imates the integral in time by a convolution. FNO-3D maps the initial time interval directly to the
The neural operator is mesh-invariant, so it can be trained on a lower resolution and evaluated full trajectory (3D functions to 3D functions). The 2D+RNN structure can propagate the solution to
at a higher resolution, without seeing any higher resolution data (zero-shot super-resolution). Figure any arbitrary time T in increments of a fixed interval length ∆t, while the Conv3D structure is fixed
11 shows an example where we train the FNO-3D model on 64×64×20 resolution data in the setting to the interval [0, T ] but can transfer the solution to an arbitrary time-discretization. We find the 2D
above with (ν = 10−4 , N = 10000) and transfer to 256 × 256 × 80 resolution, demonstrating super- method work better for short time sequences while the 3D method more expressive and easier to
resolution in space-time. The Fourier neural operator is the only model among the benchmarks train on longer sequences.
50 50
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
(FNO-2D, U-Net, TF-Net, and ResNet) that can do zero-shot super-resolution; the method works Networks s = 64 s = 128 s = 256
well not only on the spatial but also on the temporal domain. FNO-3D 0.0098 0.0101 0.0106
FNO-2D 0.0129 0.0128 0.0126
U-Net 0.0253 0.0289 0.0344
TF-Net 0.0277 0.0278 0.0301
The neural operator is mesh-invariant, so it can be trained on a lower resolution and evaluated
图 11: Zero-shot super-resolution at a higher resolution, without seeing any higher resolution data (zero-shot super-resolution). Figure
Vorticity field of the solution to the two-dimensional Navier-Stokes equation with viscosity ν = 104 11 shows an example where we train the FNO-3D model on 64×64×20 resolution data in the setting
(Re≈ 200); Ground truth on top and prediction on bottom. The model is trained on data that is discretized above with (ν = 10−4 , N = 10000) and transfer to 256 × 256 × 80 resolution, demonstrating super-
on a uniform 64 × 64 spatial grid and on a 20-point uniform temporal grid. The model is evaluated with a resolution in space-time. The Fourier neural operator is the only model among the benchmarks
different initial condition that is discretized on a uniform 256 × 256 spatial grid and a 80-point uniform (FNO-2D, U-Net, TF-Net, and ResNet) that can do zero-shot super-resolution; the method works
temporal grid.
well not only on the spatial but also on the temporal domain.
Figure 12 shows that all the methods are able to capture the spectral decay of the Navier-Stokes
equation. Notice that, while the Fourier method truncates the higher frequency modes during the
convolution, FNO can still recover the higher frequency components in the final prediction. Due
to the way we parameterize Rϕ , the function output by (26) has at most kmax,j Fourier modes per
channel. This, however, does not mean that the Fourier neural operator can only approximate func-
tions up to kmax,j modes. Indeed, the activation functions which occurs between integral operators
and the final decoder network Q recover the high frequency modes. As an example, consider a solu-
tion to the Navier-Stokes equation with viscosity ν = 10−3 . Truncating this function at 20 Fourier
modes yields an error around 2% as shown in Figure 13, while the Fourier neural operator learns
图 11: Zero-shot super-resolution
the parametric dependence and produces approximations to an error of ≤ 1% with only kmax,j = 12
Vorticity field of the solution to the two-dimensional Navier-Stokes equation with viscosity ν = 104
parameterized modes. (Re≈ 200); Ground truth on top and prediction on bottom. The model is trained on data that is discretized
on a uniform 64 × 64 spatial grid and on a 20-point uniform temporal grid. The model is evaluated with a
7.3.3 N ON - PERIODIC BOUNDARY CONDITION . different initial condition that is discretized on a uniform 256 × 256 spatial grid and a 80-point uniform
Traditional Fourier methods work only with periodic boundary conditions. However, the temporal grid.
Fourier neural operator does not have this limitation. This is due to the linear transform W (the
51 51
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
图 12: The spectral decay of the predictions of different methods 图 12: The spectral decay of the predictions of different methods
The spectral decay of the predictions of different models on the Navier-Stokes equation. The y-axis is the The spectral decay of the predictions of different models on the Navier-Stokes equation. The y-axis is the
spectrum; the x-axis is the wavenumber. Left is the spectrum of one trajectory; right is the average of 40 spectrum; the x-axis is the wavenumber. Left is the spectrum of one trajectory; right is the average of 40
trajectories. trajectories.
图 13: Spectral Decay in term of kmax 图 13: Spectral Decay in term of kmax
The error of truncation in one single Fourier layer without applying the linear transform R. The y-axis is the The error of truncation in one single Fourier layer without applying the linear transform R. The y-axis is the
normalized truncation error; the x-axis is the truncation mode kmax . normalized truncation error; the x-axis is the truncation mode kmax .
bias term) which keeps the track of non-periodic boundary. As an example, the Darcy Flow and 7.3.2 S PECTRAL ANALYSIS
the time domain of Navier-Stokes have non-periodic boundary conditions, and the Fourier neural
Figure 12 shows that all the methods are able to capture the spectral decay of the Navier-Stokes
operator still learns the solution operator with excellent accuracy.
equation. Notice that, while the Fourier method truncates the higher frequency modes during the
convolution, FNO can still recover the higher frequency components in the final prediction. Due
7.3.4 BAYESIAN I NVERSE P ROBLEM
to the way we parameterize Rϕ , the function output by (26) has at most kmax,j Fourier modes per
As discussed in Section 6.4.1, we use the pCN method of Cotter et al. (2013) to draw sam- channel. This, however, does not mean that the Fourier neural operator can only approximate func-
ples from the posterior distribution of initial vorticities in the Navier-Stokes equation given sparse, tions up to kmax,j modes. Indeed, the activation functions which occurs between integral operators
noisy observations at time T = 50. We compare the Fourier neural operator acting as a surro- and the final decoder network Q recover the high frequency modes. As an example, consider a solu-
gate model with the traditional solvers used to generate our train-test data (both run on GPU). We tion to the Navier-Stokes equation with viscosity ν = 10−3 . Truncating this function at 20 Fourier
52 52
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
generate 25,000 samples from the posterior (with a 5,000 sample burn-in period), requiring 30,000 modes yields an error around 2% as shown in Figure 13, while the Fourier neural operator learns
evaluations of the forward operator. the parametric dependence and produces approximations to an error of ≤ 1% with only kmax,j = 12
parameterized modes.
As shown in Figure 14, FNO and the traditional solver recover almost the same posterior mean
which, when pushed forward, recovers well the later-time solution of the Navier-Stokes equation.
7.3.3 N ON - PERIODIC BOUNDARY CONDITION .
In sharp contrast, FNO takes 0.005s to evaluate a single instance while the traditional solver, after
being optimized to use the largest possible internal time-step which does not lead to blow-up, takes Traditional Fourier methods work only with periodic boundary conditions. However, the
2.2s. This amounts to 2.5 minutes for the MCMC using FNO and over 18 hours for the traditional Fourier neural operator does not have this limitation. This is due to the linear transform W (the
solver. Even if we account for data generation and training time (offline steps) which take 12 hours, bias term) which keeps the track of non-periodic boundary. As an example, the Darcy Flow and
using FNO is still faster. Once trained, FNO can be used to quickly perform multiple MCMC the time domain of Navier-Stokes have non-periodic boundary conditions, and the Fourier neural
runs for different initial conditions and observations, while the traditional solver will take 18 hours operator still learns the solution operator with excellent accuracy.
for every instance. Furthermore, since FNO is differentiable, it can easily be applied to PDE-
constrained optimization problems in which adjoint calculations are used as part of the solution 7.3.4 BAYESIAN I NVERSE P ROBLEM
procedure.
As discussed in Section 6.4.1, we use the pCN method of Cotter et al. (2013) to draw sam-
ples from the posterior distribution of initial vorticities in the Navier-Stokes equation given sparse,
noisy observations at time T = 50. We compare the Fourier neural operator acting as a surro-
gate model with the traditional solvers used to generate our train-test data (both run on GPU). We
generate 25,000 samples from the posterior (with a 5,000 sample burn-in period), requiring 30,000
evaluations of the forward operator.
As shown in Figure 14, FNO and the traditional solver recover almost the same posterior mean
which, when pushed forward, recovers well the later-time solution of the Navier-Stokes equation.
In sharp contrast, FNO takes 0.005s to evaluate a single instance while the traditional solver, after
being optimized to use the largest possible internal time-step which does not lead to blow-up, takes
2.2s. This amounts to 2.5 minutes for the MCMC using FNO and over 18 hours for the traditional
solver. Even if we account for data generation and training time (offline steps) which take 12 hours,
using FNO is still faster. Once trained, FNO can be used to quickly perform multiple MCMC
runs for different initial conditions and observations, while the traditional solver will take 18 hours
图 14: Results of the Bayesian inverse problem for the Navier-Stokes equation. for every instance. Furthermore, since FNO is differentiable, it can easily be applied to PDE-
The top left panel shows the true initial vorticity while bottom left panel shows the true observed vorticity at constrained optimization problems in which adjoint calculations are used as part of the solution
T = 50 with black dots indicating the locations of the observation points placed on a 7 × 7 grid. The top procedure.
middle panel shows the posterior mean of the initial vorticity given the noisy observations estimated with
MCMC using the traditional solver, while the top right panel shows the same thing but using FNO as a
7.4 Discussion and Comparison of the Four methods
surrogate model. The bottom middle and right panels show the vorticity at T = 50 when the respective
approximate posterior means are used as initial conditions.
In this section we will compare the four methods in term of expressiveness, complexity, refin-
abilibity, and ingenuity.
53 53
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
In this section we will compare the four methods in term of expressiveness, complexity, refin-
abilibity, and ingenuity.
7.4.1 I NGENUITY
First we will discuss ingenuity, in other words, the design of the frameworks. The first method,
GNO, relies on the Nyström approximation of the kernel, or the Monte Carlo approximation of the
integration. It is the most simple and straightforward method. The second method, LNO, relies on
the low-rank decomposition of the kernel operator. It is efficient when the kernel has a near low-
rank structure. The third method, MGNO, is the combination of the first two. It has a hierarchical,
multi-resolution decomposition of the kernel. The last one, FNO, is different from the first three; it
restricts the integral kernel to induce a convolution.
图 14: Results of the Bayesian inverse problem for the Navier-Stokes equation.
GNO and MGNO are implemented using graph neural networks, which helps to define sam- The top left panel shows the true initial vorticity while bottom left panel shows the true observed vorticity at
pling and integration. The graph network library also allows sparse and distributed message passing. T = 50 with black dots indicating the locations of the observation points placed on a 7 × 7 grid. The top
The LNO and FNO don’t have sampling. They are faster without using the graph library. middle panel shows the posterior mean of the initial vorticity given the noisy observations estimated with
MCMC using the traditional solver, while the top right panel shows the same thing but using FNO as a
surrogate model. The bottom middle and right panels show the vorticity at T = 50 when the respective
scheme graph-based kernel network
approximate posterior means are used as initial conditions.
GNO Nyström approximation Yes Yes
LNO Low-rank approximation No Yes
7.4.1 I NGENUITY
MGNO Multi-level graphs on GNO Yes Yes
FNO Convolution theorem; Fourier features No No First we will discuss ingenuity, in other words, the design of the frameworks. The first method,
GNO, relies on the Nyström approximation of the kernel, or the Monte Carlo approximation of the
integration. It is the most simple and straightforward method. The second method, LNO, relies on
表 6: Ingenuity.
the low-rank decomposition of the kernel operator. It is efficient when the kernel has a near low-
rank structure. The third method, MGNO, is the combination of the first two. It has a hierarchical,
multi-resolution decomposition of the kernel. The last one, FNO, is different from the first three; it
7.4.2 E XPRESSIVENESS
restricts the integral kernel to induce a convolution.
We measure the expressiveness by the training and testing error of the method. The full O(J 2 ) GNO and MGNO are implemented using graph neural networks, which helps to define sam-
integration always has the best results, but it is usually too expensive. As shown in the experiments pling and integration. The graph network library also allows sparse and distributed message passing.
7.2.1 and 7.2.2, GNO usually has good accuracy, but its performance suffers from sampling. LNO The LNO and FNO don’t have sampling. They are faster without using the graph library.
works the best on the 1d problem (Burgers equation). It has difficulty on the 2d problem because it
7.4.2 E XPRESSIVENESS
doesn’t employ sampling to speed-up evaluation. MGNO has the multi-level structure, which gives
it the benefit of the first two. Finally, FNO has overall the best performance. It is also the only We measure the expressiveness by the training and testing error of the method. The full O(J 2 )
method that can capture the challenging Navier-Stokes equation. integration always has the best results, but it is usually too expensive. As shown in the experiments
54 54
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
The complexity of the four methods are listed in Table 7. GNO and MGNO have sampling. GNO Nyström approximation Yes Yes
Their complexity depends on the number of nodes sampled J ′ . When using the full nodes. They are LNO Low-rank approximation No Yes
still quadratic. LNO has the lowest complexity O(J). FNO, when using the fast Fourier transform, MGNO Multi-level graphs on GNO Yes Yes
has complexity O(J log J). FNO Convolution theorem; Fourier features No No
In practice. FNO is faster then the other three methods because it doesn’t have the kernel
network κ. MGNO is relatively slower because of its multi-level graph structure. 表 6: Ingenuity.
Complexity Time per epochs in training 7.2.1 and 7.2.2, GNO usually has good accuracy, but its performance suffers from sampling. LNO
GNO O(J ′2 r2 ) 4s works the best on the 1d problem (Burgers equation). It has difficulty on the 2d problem because it
LNO O(J) 20s doesn’t employ sampling to speed-up evaluation. MGNO has the multi-level structure, which gives
2 2
P
MGNO l O(Jl rl ) ∼ O(J) 8s it the benefit of the first two. Finally, FNO has overall the best performance. It is also the only
FNO (J log J) 4s method that can capture the challenging Navier-Stokes equation.
7.4.3 C OMPLEXITY
表 7: Complexity (roundup to second on a single Nvidia V100 GPU)
The complexity of the four methods are listed in Table 7. GNO and MGNO have sampling.
Their complexity depends on the number of nodes sampled J ′ . When using the full nodes. They are
7.4.4 R EFINABILITY still quadratic. LNO has the lowest complexity O(J). FNO, when using the fast Fourier transform,
Refineability measures the number of parameters used in the framework. Table 8 lists the has complexity O(J log J).
accuracy of the relative error on Darcy Flow with respect to different number of parameters. Because In practice. FNO is faster then the other three methods because it doesn’t have the kernel
GNO, LNO, and MGNO have the kernel networks, the slope of their error rates are flat: they can network κ. MGNO is relatively slower because of its multi-level graph structure.
work with a very small number of parameters. On the other hand, FNO does not have the sub-
Complexity Time per epochs in training
network. It needs at a larger magnitude of parameters to obtain an acceptable error rate.
GNO O(J ′2 r2 ) 4s
LNO O(J) 20s
Number of parameters 103 104 105 106
O(Jl2 rl2 ) ∼ O(J)
P
MGNO l 8s
GNO 0.075 0.065 0.060 0.035
FNO (J log J) 4s
LNO 0.080 0.070 0.060 0.040
MGNO 0.070 0.050 0.040 0.030
FNO 0.200 0.035 0.020 0.015 表 7: Complexity (roundup to second on a single Nvidia V100 GPU)
表 8: Refinability.
The relative error on Darcy Flow with respect to different number of parameters. The errors above are 7.4.4 R EFINABILITY
approximated value roundup to 0.05. They are the lowest test error achieved by the model, given the
Refineability measures the number of parameters used in the framework. Table 8 lists the
model’s number of parameters |θ| is bounded by 103 , 104 , 105 , 106 respectively.
accuracy of the relative error on Darcy Flow with respect to different number of parameters. Because
GNO, LNO, and MGNO have the kernel networks, the slope of their error rates are flat: they can
55 55
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
7.4.5 ROBUSTNESS work with a very small number of parameters. On the other hand, FNO does not have the sub-
network. It needs at a larger magnitude of parameters to obtain an acceptable error rate.
We conclude with experiments investigating the robustness of Fourier neural operator to noise.
We study: a) training on clean (noiseless) data and testing with clean and noisy data; b) training on Number of parameters 103 104 105 106
clean (noiseless) data and testing with clean and noisy data. When creating noisy data we map a to
GNO 0.075 0.065 0.060 0.035
noisy a′ as follows: at every grid-point x we set
LNO 0.080 0.070 0.060 0.040
表 9: Robustness. where ξ ∼ N (0, 1) is drawn i.i.d. at every grid point; this is similar to the setting adopted in Lu et al.
(2021b). We also study the 1d advection equation as an additional test case, following the setting in
As shown in the top half of Table 9 and Figure 15, we observe the Fourier neural operator is Lu et al. (2021b) in which the input data is a random square wave, defined by an R3 -valued random
robust with respect to the (test) noise level on all four problems. In particular, on the advection prob- variable.
lem, it has about 10% error with 10% noise. The Darcy and Navier-Stokes operators are smoothing, As shown in the top half of Table 9 and Figure 15, we observe the Fourier neural operator is
and the Fourier neural operator obtains lower than 10% error in all scenarios. However the FNO robust with respect to the (test) noise level on all four problems. In particular, on the advection prob-
is less robust on the advection equation, which is not smoothing, and on Burgers equation which, lem, it has about 10% error with 10% noise. The Darcy and Navier-Stokes operators are smoothing,
whilst smoothing also forms steep fronts. and the Fourier neural operator obtains lower than 10% error in all scenarios. However the FNO
A straightforward approach to enhance the robustness is to train the model with noise. As is less robust on the advection equation, which is not smoothing, and on Burgers equation which,
shown in the bottom half of Table 9, the Fourier neural operator has no gap between the clean data whilst smoothing also forms steep fronts.
and noisy data when training with noise. However, noise in training may degrade the performance A straightforward approach to enhance the robustness is to train the model with noise. As
on the clean data, as a trade-off. In general, augmenting the training data with noise leads to ro- shown in the bottom half of Table 9, the Fourier neural operator has no gap between the clean data
bustness. For example, in the auto-regressive modeling of dynamical systems, training the model and noisy data when training with noise. However, noise in training may degrade the performance
56 56
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
with noise will reduce error accumulation in time, and thereby help the model to predict over longer Problems Training error Test (clean) Test (noisy)
time-horizons (Pfaff et al., 2020). We also observed that other regularization techniques such as Burgers 0.002 0.002 0.018
early-stopping and weight decay improve robustness. Using a higher spatial resolution also helps. Advection 0.002 0.002 0.094
The advection problem is a hard problem for the FNO since it has discontinuities; similar issues Darcy 0.006 0.011 0.012
arise when using spectral methods for conservation laws. One can modify the architecture to address Navier-Stokes 0.024 0.024 0.039
such discontinuities accordingly. For example, Wen et al. (2021) enhance the FNO by composing a Burgers (train with noise) 0.011 0.004 0.011
CNN or UNet branch with the Fourier layer; the resulting composite model outperforms the basic Advection (train with noise) 0.020 0.010 0.019
FNO on multiphase flow with high contrast and sharp shocks. However the CNN and UNet take Darcy (train with noise) 0.007 0.012 0.012
the method out of the realm of discretization-invariant methods; further work is required to design Navier-Stokes (train with noise) 0.026 0.026 0.025
discretization-invariant image-processing tools, such as the identification of discontinuities.
表 9: Robustness.
on the clean data, as a trade-off. In general, augmenting the training data with noise leads to ro-
bustness. For example, in the auto-regressive modeling of dynamical systems, training the model
with noise will reduce error accumulation in time, and thereby help the model to predict over longer
time-horizons (Pfaff et al., 2020). We also observed that other regularization techniques such as
early-stopping and weight decay improve robustness. Using a higher spatial resolution also helps.
The advection problem is a hard problem for the FNO since it has discontinuities; similar issues
arise when using spectral methods for conservation laws. One can modify the architecture to address
such discontinuities accordingly. For example, Wen et al. (2021) enhance the FNO by composing a
CNN or UNet branch with the Fourier layer; the resulting composite model outperforms the basic
FNO on multiphase flow with high contrast and sharp shocks. However the CNN and UNet take
the method out of the realm of discretization-invariant methods; further work is required to design
discretization-invariant image-processing tools, such as the identification of discontinuities.
图 15: Robustness on Advection and Burgers equations
(a) The input of Advection equation (s = 40). The orange curve is the clean input; the blue curve is the
8. Approximation Theory
noisy input. (b) The output of Advection equation. The green curve is the ground truth output; the orange
curve is the prediction of FNO with clean input (overlapping with the ground truth); the blue curve is the The paper by Chen and Chen (1995) provides the first universal approximation theorem for
prediction on the noisy input. Figure (c) and (d) are for Burgers’ equation (s = 1000) correspondingly. operator approximation via neural networks, and the paper by Bhattacharya et al. (2020) provides
an alternative architecture and approximation result. The analysis of Chen and Chen (1995) was
recently extended in significant ways in the paper by Lanthaler et al. (2021) where, for the first
8. Approximation Theory
time, the curse of dimensionality is addressed, and resolved, for certain specific operator learning
The paper by Chen and Chen (1995) provides the first universal approximation theorem for problems, using the DeepOnet generalization Lu et al. (2019, 2021a) of Chen and Chen (1995).
operator approximation via neural networks, and the paper by Bhattacharya et al. (2020) provides The paper Lanthaler et al. (2021) was generalized to study operator approximation, and the curse of
an alternative architecture and approximation result. The analysis of Chen and Chen (1995) was dimensionality, for the FNO, in Kovachki et al. (2021).
57 57
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
recently extended in significant ways in the paper by Lanthaler et al. (2021) where, for the first
time, the curse of dimensionality is addressed, and resolved, for certain specific operator learning
problems, using the DeepOnet generalization Lu et al. (2019, 2021a) of Chen and Chen (1995).
The paper Lanthaler et al. (2021) was generalized to study operator approximation, and the curse of
dimensionality, for the FNO, in Kovachki et al. (2021).
Unlike the finite-dimensional setting, the choice of input and output spaces A and U for the
mapping G † play a crucial role in the approximation theory due to the distinctiveness of the induced
norm topologies. In this section, we prove universal approximation theorems for neural operators
both with respect to the topology of uniform convergence over compact sets and with respect to
the topology induced by the Bochner norm (3). We focus our attention on the Lebesgue, Sobolev,
continuous, and continuously differentiable function classes as they have numerous applications
in scientific computing and machine learning problems. Unlike the results of Bhattacharya et al.
(2020); Kovachki et al. (2021) which rely on the Hilbertian structure of the input and output spaces
or the results of Chen and Chen (1995); Lanthaler et al. (2021) which rely on the continuous func- 图 15: Robustness on Advection and Burgers equations
tions, our results extend to more general Banach spaces as specified by Assumptions 9 and 10 (stated (a) The input of Advection equation (s = 40). The orange curve is the clean input; the blue curve is the
in Section 8.3) and are, to the best of our knowledge, the first of their kind to apply at this level of noisy input. (b) The output of Advection equation. The green curve is the ground truth output; the orange
generality. curve is the prediction of FNO with clean input (overlapping with the ground truth); the blue curve is the
prediction on the noisy input. Figure (c) and (d) are for Burgers’ equation (s = 1000) correspondingly.
Our method of proof proceeds by making use of the following two observations. First we estab-
lish the Banach space approximation property Grothendieck (1955) for the input and output spaces
Unlike the finite-dimensional setting, the choice of input and output spaces A and U for the
of interest, which allows for a finite dimensionalization of the problem. In particular, we prove that
mapping G † play a crucial role in the approximation theory due to the distinctiveness of the induced
the Banach space approximation property holds for various function spaces defined on Lipschitz
norm topologies. In this section, we prove universal approximation theorems for neural operators
domains; the precise result we need, while unsurprising, seems to be missing from the functional
both with respect to the topology of uniform convergence over compact sets and with respect to
analysis literature and so we provide statement and proof. Details are given in Appendix A. Second,
the topology induced by the Bochner norm (3). We focus our attention on the Lebesgue, Sobolev,
we establish that integral kernel operators with smooth kernels can be used to approximate linear
continuous, and continuously differentiable function classes as they have numerous applications
functionals of various input spaces. In doing so, we establish a Riesz-type representation theorem
in scientific computing and machine learning problems. Unlike the results of Bhattacharya et al.
for the continuously differentiable functions. Such a result is not surprising and mimics the well-
(2020); Kovachki et al. (2021) which rely on the Hilbertian structure of the input and output spaces
known result for Sobolev spaces; however in the form we need it we could not find the result in the
or the results of Chen and Chen (1995); Lanthaler et al. (2021) which rely on the continuous func-
functional analysis literature and so we provide statement and proof. Details are given in Appendix
tions, our results extend to more general Banach spaces as specified by Assumptions 9 and 10 (stated
B. With these two facts, we construct a neural operator which linearly maps any input function to a
in Section 8.3) and are, to the best of our knowledge, the first of their kind to apply at this level of
finite vector then non-linearly maps this vector to a new finite vector which is then used to form the
generality.
coefficients of a basis expansion for the output function. We reemphasize that our approximation
theory uses the fact that neural operators can be reduced to a linear method of approximation (as Our method of proof proceeds by making use of the following two observations. First we estab-
pointed out in Section 5.1) and does not capture any benefits of nonlinear approximation. However lish the Banach space approximation property Grothendieck (1955) for the input and output spaces
these benefits are present in the architecture and are exploited by the trained networks we find in of interest, which allows for a finite dimensionalization of the problem. In particular, we prove that
58 58
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
practice. Exploiting their nonlinear nature to potentially obtain improved rates of approximation the Banach space approximation property holds for various function spaces defined on Lipschitz
remains an interesting direction for future research. domains; the precise result we need, while unsurprising, seems to be missing from the functional
The rest of this section is organized as follows. In Subsection 8.1, we define allowable ac- analysis literature and so we provide statement and proof. Details are given in Appendix A. Second,
tivation functions and the set of neural operators used in our theory, noting that they constitute a we establish that integral kernel operators with smooth kernels can be used to approximate linear
subclass of the neural operators defined in Section 5. In Subsection 8.3, we state and prove our functionals of various input spaces. In doing so, we establish a Riesz-type representation theorem
main universal approximation theorems. for the continuously differentiable functions. Such a result is not surprising and mimics the well-
known result for Sobolev spaces; however in the form we need it we could not find the result in the
8.1 Neural Operators functional analysis literature and so we provide statement and proof. Details are given in Appendix
For any n ∈ N and σ : R → R, we define the set of real-valued n-layer neural networks on Rd B. With these two facts, we construct a neural operator which linearly maps any input function to a
by finite vector then non-linearly maps this vector to a new finite vector which is then used to form the
coefficients of a basis expansion for the output function. We reemphasize that our approximation
d d
Nn (σ; R ) := {f : R → R : f (x) = Wn σ(. . . W1 σ(W0 x + b0 ) + b1 . . . ) + bn , theory uses the fact that neural operators can be reduced to a linear method of approximation (as
W0 ∈ Rd0 ×d , W1 ∈ Rd1 ×d0 , . . . , Wn ∈ R1×dn−1 , pointed out in Section 5.1) and does not capture any benefits of nonlinear approximation. However
b0 ∈ Rd0 , b1 ∈ Rd1 , . . . , bn ∈ R, d0 , d1 , . . . , dn−1 ∈ N}. these benefits are present in the architecture and are exploited by the trained networks we find in
practice. Exploiting their nonlinear nature to potentially obtain improved rates of approximation
′
We define the set of Rd -valued neural networks simply by stacking real-valued networks remains an interesting direction for future research.
′ ′ The rest of this section is organized as follows. In Subsection 8.1, we define allowable ac-
Nn (σ; Rd , Rd ) := {f : Rd → Rd : f (x) = f1 (x), . . . , fd′ (x) , f1 , . . . , fd′ ∈ Nn (σ; Rd )}.
tivation functions and the set of neural operators used in our theory, noting that they constitute a
d′
We remark that we could have defined Nn (σ; Rd , R ) by letting Wn ∈ R d′ ×dn d′
and bn ∈ R in the subclass of the neural operators defined in Section 5. In Subsection 8.3, we state and prove our
definition of Nn (σ; Rd ) because we allow arbitrary width, making the two definitions equivalent; main universal approximation theorems.
however the definition as presented is more convenient for our analysis. We also employ the pre-
′ 8.1 Neural Operators
ceding definition with Rd and Rd replaced by spaces of matrices. For any m ∈ N0 , we define the
set of allowable activation functions as the continuous R → R maps which make neural networks For any n ∈ N and σ : R → R, we define the set of real-valued n-layer neural networks on Rd
dense in C m (Rd ) on compacta at any fixed depth, by
Am := {σ ∈ C(R) : ∃n ∈ N s.t. Nn (σ; Rd ) is dense in C m (K) ∀K ⊂ Rd compact}. Nn (σ; Rd ) := {f : Rd → R : f (x) = Wn σ(. . . W1 σ(W0 x + b0 ) + b1 . . . ) + bn ,
W0 ∈ Rd0 ×d , W1 ∈ Rd1 ×d0 , . . . , Wn ∈ R1×dn−1 ,
It is shown in (Pinkus, 1999, Theorem 4.1) that {σ ∈ C m (R) : σ is not a polynomial} ⊆ Am with
b0 ∈ Rd0 , b1 ∈ Rd1 , . . . , bn ∈ R, d0 , d1 , . . . , dn−1 ∈ N}.
n = 1. Clearly Am+1 ⊆ Am .
′
We define the set of linearly bounded activations as We define the set of Rd -valued neural networks simply by stacking real-valued networks
′ ′
Nn (σ; Rd , Rd ) := {f : Rd → Rd : f (x) = f1 (x), . . . , fd′ (x) , f1 , . . . , fd′ ∈ Nn (σ; Rd )}.
L |σ(x)|
Am := σ ∈ Am : σ is Borel measurable , sup <∞ ,
x∈R 1 + |x| ′ ′ ′
We remark that we could have defined Nn (σ; Rd , Rd ) by letting Wn ∈ Rd ×dn and bn ∈ Rd in the
noting that any globally Lipschitz, non-polynomial, C m -function is contained in ALm . Most activa- definition of Nn (σ; Rd ) because we allow arbitrary width, making the two definitions equivalent;
tion functions used in practice fall within this class, for example, ReLU ∈ AL0 , ELU ∈ AL1 while however the definition as presented is more convenient for our analysis. We also employ the pre-
′
tanh, sigmoid ∈ ALm for any m ∈ N0 . ceding definition with Rd and Rd replaced by spaces of matrices. For any m ∈ N0 , we define the
59 59
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
For approximation in a Bochner norm, we will be interested in constructing globally bounded set of allowable activation functions as the continuous R → R maps which make neural networks
neural networks which can approximate the identity over compact sets as done in (Lanthaler et dense in C m (Rd ) on compacta at any fixed depth,
al., 2021; Bhattacharya et al., 2020). This allows us to control the potential unboundedness of the
Am := {σ ∈ C(R) : ∃n ∈ N s.t. Nn (σ; Rd ) is dense in C m (K) ∀K ⊂ Rd compact}.
support of the input measure by exploiting the fact that the probability of an input must decay to zero
in unbounded regions. Following (Lanthaler et al., 2021), we introduce the forthcoming definition It is shown in (Pinkus, 1999, Theorem 4.1) that {σ ∈ C m (R) : σ is not a polynomial} ⊆ Am with
which uses the notation of the diameter of a set. In particular, the diameter of any set S ⊆ Rd is n = 1. Clearly Am+1 ⊆ Am .
defined as, for | · |2 the Euclidean norm on Rd , We define the set of linearly bounded activations as
diam2 (S) := sup |x − y|2 . |σ(x)|
x,y∈S
ALm := σ ∈ Am : σ is Borel measurable , sup <∞ ,
x∈R 1 + |x|
Definition 7 We denote by BA the set of maps σ ∈ A0 such that, for any compact set K ⊂ Rd , noting that any globally Lipschitz, non-polynomial, C m -function is contained in ALm . Most activa-
ϵ > 0, and C ≥ diam2 (K) , there exists a number n ∈ N and a neural network f ∈ Nn (σ; Rd , Rd )
′
tion functions used in practice fall within this class, for example, ReLU ∈ AL0 , ELU ∈ AL1 while
such that tanh, sigmoid ∈ ALm for any m ∈ N0 .
For approximation in a Bochner norm, we will be interested in constructing globally bounded
|f (x) − x|2 ≤ ϵ, ∀x ∈ K, neural networks which can approximate the identity over compact sets as done in (Lanthaler et
|f (x)|2 ≤ C, ∀x ∈ Rd . al., 2021; Bhattacharya et al., 2020). This allows us to control the potential unboundedness of the
support of the input measure by exploiting the fact that the probability of an input must decay to zero
It is shown in (Lanthaler et al., 2021, Lemma C.1) that ReLU ∈ AL0 ∩ BA with n = 3. in unbounded regions. Following (Lanthaler et al., 2021), we introduce the forthcoming definition
We will now define the specific class of neural operators for which we prove a universal ap- which uses the notation of the diameter of a set. In particular, the diameter of any set S ⊆ Rd is
proximation theorem. It is important to note that the class with which we work is a simplification of defined as, for | · |2 the Euclidean norm on Rd ,
the one given in (6). In particular, the lifting and projection operators Q, P, together with the final
diam2 (S) := sup |x − y|2 .
activation function σn , are set to the identity, and the local linear operators W0 , . . . , Wn−1 are set to x,y∈S
zero. In our numerical studies we have in any case typically set σn to the identity. However we have
Definition 7 We denote by BA the set of maps σ ∈ A0 such that, for any compact set K ⊂ Rd ,
found that learning the local operators Q, P and W0 , . . . , Wn−1 is beneficial in practice; extending ′
ϵ > 0, and C ≥ diam2 (K) , there exists a number n ∈ N and a neural network f ∈ Nn (σ; Rd , Rd )
the universal approximation theorems given here to explain this benefit would be an important but
such that
non-trivial development of the analysis we present here.
Let D ⊂ Rd be a domain. For any σ ∈ A0 , we define the set of affine kernel integral operators |f (x) − x|2 ≤ ϵ, ∀x ∈ K,
by
|f (x)|2 ≤ C, ∀x ∈ Rd .
Z
d1 d2
IO(σ; D, R , R ) = {f 7→ κ(·, y)f (y) dy + b : κ ∈ Nn1 (σ; Rd × Rd , Rd2 ×d1 ), It is shown in (Lanthaler et al., 2021, Lemma C.1) that ReLU ∈ AL0 ∩ BA with n = 3.
D
b ∈ Nn2 (σ; Rd , Rd2 ), n1 , n2 ∈ N}, We will now define the specific class of neural operators for which we prove a universal ap-
proximation theorem. It is important to note that the class with which we work is a simplification of
for any d1 , d2 ∈ N. Clearly, since σ ∈ A0 , any S ∈ IO(σ; D, Rd1 , Rd2 ) acts as S : Lp (D; Rd1 ) → the one given in (6). In particular, the lifting and projection operators Q, P, together with the final
Lp (D; Rd2 ) for any 1 ≤ p ≤ ∞ since κ ∈ C(D̄ × D̄; Rd2 ×d1 ) and b ∈ C(D̄; Rd2 ). For any activation function σn , are set to the identity, and the local linear operators W0 , . . . , Wn−1 are set to
′
n ∈ N≥2 , da , du ∈ N, D ⊂ Rd , D′ ⊂ Rd domains, and σ1 ∈ AL0 , σ2 , σ3 ∈ A0 , we define the set of zero. In our numerical studies we have in any case typically set σn to the identity. However we have
60 60
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
n-layer neural operators by found that learning the local operators Q, P and W0 , . . . , Wn−1 is beneficial in practice; extending
the universal approximation theorems given here to explain this benefit would be an important but
Z
NOn (σ1 , σ2 , σ3 ; D, D′ , Rda , Rdu ) = {f 7→
κn (·, y) Sn−1 σ1 (. . . S2 σ1 (S1 (S0 f )) . . . ) (y) dy :
D non-trivial development of the analysis we present here.
da d1 dn−1 dn
S0 ∈ IO(σ2 , D; R , R ), . . . Sn−1 ∈ IO(σ2 , D; R , R ), Let D ⊂ Rd be a domain. For any σ ∈ A0 , we define the set of affine kernel integral operators
′
κn ∈ Nl (σ3 ; Rd × Rd , Rdu ×dn ), d1 , . . . , dn , l ∈ N}. by
Z
When da = du = 1, we will simply write NOn (σ1 , σ2 , σ3 ; D, D′ ). Since σ1 is linearly bounded, we d1 d2
IO(σ; D, R , R ) = {f 7→ κ(·, y)f (y) dy + b : κ ∈ Nn1 (σ; Rd × Rd , Rd2 ×d1 ),
D
can use a result about compositions of maps in Lp spaces such as (Dudley and Norvaiša, 2010, The-
b ∈ Nn2 (σ; Rd , Rd2 ), n1 , n2 ∈ N},
orem 7.13) to conclude that any G ∈ NOn (σ1 , σ2 , σ3 , D, D′ ; Rda , Rdu ) acts as G : Lp (D; Rda ) →
Lp (D′ ; Rdu ). Note that it is only in the last layer that we transition from functions defined over
for any d1 , d2 ∈ N. Clearly, since σ ∈ A0 , any S ∈ IO(σ; D, Rd1 , Rd2 ) acts as S : Lp (D; Rd1 ) →
domain D to functions defined over domain D′ .
Lp (D; Rd2 ) for any 1 ≤ p ≤ ∞ since κ ∈ C(D̄ × D̄; Rd2 ×d1 ) and b ∈ C(D̄; Rd2 ). For any
When the input space of an operator of interest is C m (D̄), for m ∈ N, we will need to take in ′
n ∈ N≥2 , da , du ∈ N, D ⊂ Rd , D′ ⊂ Rd domains, and σ1 ∈ AL0 , σ2 , σ3 ∈ A0 , we define the set of
derivatives explicitly as they cannot be learned using kernel integration as employed in the current
n-layer neural operators by
construction given in Lemma 30; note that this is not the case for W m,p (D) as shown in Lemma 28.
Z
We will therefore define the set of m-th order neural operators by ′ da du
NOn (σ1 , σ2 , σ3 ; D, D , R , R ) = {f 7→
κn (·, y) Sn−1 σ1 (. . . S2 σ1 (S1 (S0 f )) . . . ) (y) dy :
D
′
NOm da du α1
n (σ1 , σ2 , σ3 ; D, D , R , R ) = {(∂ f, . . . , ∂
αJm
f ) 7→ G(∂ α1 f, . . . , ∂ αJm f ) : S0 ∈ IO(σ2 , D; Rda , Rd1 ), . . . Sn−1 ∈ IO(σ2 , D; Rdn−1 , Rdn ),
G ∈ NOn (σ1 , σ2 , σ3 ; D, D′ , RJm da , Rdu )} ′
κn ∈ Nl (σ3 ; Rd × Rd , Rdu ×dn ), d1 , . . . , dn , l ∈ N}.
61 61
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
F F
A RJ A A RJ A
G† ψ G† G† ψ G†
U R J′ G
U U R J′ G
U
图 16: A schematic overview of the maps used to approximate G † . 图 16: A schematic overview of the maps used to approximate G † .
proximation of nonlinear operators G † : A → U by neural operators. We will make the following Theorem 8 Let D ⊂ Rd and D′ ⊂ Rd be two domains for some d, d′ ∈ N. Let A and U be
assumptions on the spaces A and U. real-valued Banach function spaces on D and D′ respectively. Suppose that A and U can be
continuously embedded in C(D̄) and C(D̄′ ) respectively and that σ1 , σ2 , σ3 ∈ C(R). Then, for
Assumption 9 Let D ⊂ Rd be a Lipschitz domain for some d ∈ N. One of the following holds any n ∈ N, the set of neural operators NOn (σ1 , σ2 , σ3 ; D, D′ ) whose elements are viewed as maps
A → U is discretization-invariant.
1. A = Lp1 (D) for some 1 ≤ p1 < ∞.
2. A = W m1 ,p1 (D) for some 1 ≤ p1 < ∞ and m1 ∈ N, The proof, provided in appendix E, constructs a sequence of finite dimensional maps which
approximate the neural operator by Riemann sums and shows uniform converges of the error over
3. A = C(D̄). compact sets of A.
′
Assumption 10 Let D′ ⊂ Rd be a Lipschitz domain for some d′ ∈ N. One of the following holds
8.3 Approximation Theorems
1. U = Lp2 (D′ ) for some 1 ≤ p2 < ∞, and m2 = 0, ′
Let A and U be Banach function spaces on the domains D ⊂ Rd and D′ ⊂ Rd respectively.
2. U = W m2 ,p2 (D′ ) for some 1 ≤ p2 < ∞ and m2 ∈ N, We will work in the setting where functions in A or U are real-valued, but note that all results
generalize in a straightforward fashion to the vector-valued setting. We are interested in the ap-
3. U = C m2 (D̄′ ) and m2 ∈ N0 . proximation of nonlinear operators G † : A → U by neural operators. We will make the following
assumptions on the spaces A and U.
We first show that neural operators are dense in the continuous operators G † : A → U in
the topology of uniform convergence on compacta. The proof proceeds by making three main Assumption 9 Let D ⊂ Rd be a Lipschitz domain for some d ∈ N. One of the following holds
approximations which are schematically shown in Figure 16. First, inputs are mapped to a finite-
dimensional representation through a set of appropriate linear functionals on A denoted by F : 1. A = Lp1 (D) for some 1 ≤ p1 < ∞.
A → RJ . We show in Lemmas 21 and 23 that, when A satisfies Assumption 9, elements of A∗ can 2. A = W m1 ,p1 (D) for some 1 ≤ p1 < ∞ and m1 ∈ N,
be approximated by integration against smooth functions. This generalizes the idea from (Chen and
Chen, 1995) where functionals on C(D̄) are approximated by a weighted sum of Dirac measures. 3. A = C(D̄).
We then show in Lemma 25 that, by lifting the dimension, this representation can be approximated ′
Assumption 10 Let D′ ⊂ Rd be a Lipschitz domain for some d′ ∈ N. One of the following holds
by a single element of IO. Second, the representation is non-linearly mapped to a new representation
′
by a continuous function ψ : RJ → RJ which finite-dimensionalizes the action of G † . We show, 1. U = Lp2 (D′ ) for some 1 ≤ p2 < ∞, and m2 = 0,
62 62
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
in Lemma 28, that this map can be approximated by a neural operator by reducing the architecture 2. U = W m2 ,p2 (D′ ) for some 1 ≤ p2 < ∞ and m2 ∈ N,
to that of a standard neural network. Third, the new representation is used as the coefficients of
′ 3. U = C m2 (D̄′ ) and m2 ∈ N0 .
an expansion onto representers of U, the map denoted G : RJ → U, which we show can be
approximated by a single IO layer in Lemma 27 using density results for continuous functions. The We first show that neural operators are dense in the continuous operators G † : A → U in
structure of the overall approximation is similar to (Bhattacharya et al., 2020) but generalizes the the topology of uniform convergence on compacta. The proof proceeds by making three main
ideas from working on Hilbert spaces to the spaces in Assumptions 9 and 10. Statements and proofs approximations which are schematically shown in Figure 16. First, inputs are mapped to a finite-
of the lemmas used in the theorems are given in the appendices. dimensional representation through a set of appropriate linear functionals on A denoted by F :
A → RJ . We show in Lemmas 21 and 23 that, when A satisfies Assumption 9, elements of A∗ can
Theorem 11 Let Assumptions 9 and 10 hold and suppose G† : A → U is continuous. Let σ1 ∈ AL0 ,
be approximated by integration against smooth functions. This generalizes the idea from (Chen and
σ2 ∈ A0 , and σ3 ∈ Am2 . Then for any compact set K ⊂ A and 0 < ϵ ≤ 1, there exists a number
Chen, 1995) where functionals on C(D̄) are approximated by a weighted sum of Dirac measures.
N ∈ N and a neural operator G ∈ NON (σ1 , σ2 , σ3 ; D, D′ ) such that
We then show in Lemma 25 that, by lifting the dimension, this representation can be approximated
sup ∥G † (a) − G(a)∥U ≤ ϵ. by a single element of IO. Second, the representation is non-linearly mapped to a new representation
a∈K ′
by a continuous function ψ : RJ → RJ which finite-dimensionalizes the action of G † . We show,
Furthermore, if U is a Hilbert space and σ1 ∈ BA and, for some M > 0, we have that ∥G † (a)∥U ≤ in Lemma 28, that this map can be approximated by a neural operator by reducing the architecture
M for all a ∈ A then G can be chosen so that to that of a standard neural network. Third, the new representation is used as the coefficients of
′
an expansion onto representers of U, the map denoted G : RJ → U, which we show can be
∥G(a)∥U ≤ 4M, ∀a ∈ A.
approximated by a single IO layer in Lemma 27 using density results for continuous functions. The
The proof is provided in appendix F In the following theorem, we extend this result to the case structure of the overall approximation is similar to (Bhattacharya et al., 2020) but generalizes the
A = C m1 (D̄), showing density of the m1 -th order neural operators. ideas from working on Hilbert spaces to the spaces in Assumptions 9 and 10. Statements and proofs
of the lemmas used in the theorems are given in the appendices.
Theorem 12 Let D ⊂ Rd be a Lipschitz domain, m1 ∈ N, define A := C m1 (D̄), suppose Assump-
tion 10 holds and assume that G † : A → U is continuous. Let σ1 ∈ AL0 , σ2 ∈ A0 , and σ3 ∈ Am2 . Theorem 11 Let Assumptions 9 and 10 hold and suppose G † : A → U is continuous. Let σ1 ∈ AL0 ,
Then for any compact set K ⊂ A and 0 < ϵ ≤ 1, there exists a number N ∈ N and a neural σ2 ∈ A0 , and σ3 ∈ Am2 . Then for any compact set K ⊂ A and 0 < ϵ ≤ 1, there exists a number
Furthermore, if U is a Hilbert space and σ1 ∈ BA and, for some M > 0, we have that ∥G † (a)∥U ≤ Furthermore, if U is a Hilbert space and σ1 ∈ BA and, for some M > 0, we have that ∥G † (a)∥U ≤
M for all a ∈ A then G can be chosen so that M for all a ∈ A then G can be chosen so that
∥G(a)∥U ≤ 4M, ∀a ∈ A.
∥G(a)∥U ≤ 4M, ∀a ∈ A.
The proof is provided in appendix F In the following theorem, we extend this result to the case
Proof The proof follows as in Theorem 11, replacing the use of Lemma 32 with Lemma 33.
A = C m1 (D̄), showing density of the m1 -th order neural operators.
With these results in hand, we show density of neural operators in the space L2µ (A; U) where Theorem 12 Let D ⊂ Rd be a Lipschitz domain, m1 ∈ N, define A := C m1 (D̄), suppose Assump-
µ is a probability measure and U is a separable Hilbert space. The Hilbertian structure of U allows tion 10 holds and assume that G † : A → U is continuous. Let σ1 ∈ AL0 , σ2 ∈ A0 , and σ3 ∈ Am2 .
63 63
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
us to uniformly control the norm of the approximation due to the isomorphism with ℓ2 as shown in Then for any compact set K ⊂ A and 0 < ϵ ≤ 1, there exists a number N ∈ N and a neural
Theorem 11. It remains an interesting future direction to obtain similar results for Banach spaces. operator G ∈ NOm 1 ′
N (σ1 , σ2 , σ3 ; D, D ) such that
The proof follows the ideas in (Lanthaler et al., 2021) where similar results are obtained for Deep-
sup ∥G † (a) − G(a)∥U ≤ ϵ.
ONet(s) on L2 (D) by using Lusin’s theorem to restrict the approximation to a large enough compact a∈K
set and exploit the decay of µ outside it. Bhattacharya et al. (2020) also employ a similar approach
Furthermore, if U is a Hilbert space and σ1 ∈ BA and, for some M > 0, we have that ∥G † (a)∥U ≤
but explicitly constructs the necessary compact set after finite-dimensionalizing.
M for all a ∈ A then G can be chosen so that
′
Theorem 13 Let D′ ⊂ Rd be a Lipschitz domain, m2 ∈ N0 , and suppose Assumption 9 holds.
∥G(a)∥U ≤ 4M, ∀a ∈ A.
Let µ be a probability measure on A and suppose G † : A → H m2 (D) is µ-measurable and G † ∈
L2µ (A; H m2 (D)). Let σ1 ∈ AL0 ∩ BA, σ2 ∈ A0 , and σ3 ∈ Am2 . Then for any 0 < ϵ ≤ 1, there exists Proof The proof follows as in Theorem 11, replacing the use of Lemma 32 with Lemma 33.
a number N ∈ N and a neural operator G ∈ NON (σ1 , σ2 , σ3 ; D, D′ ) such that
With these results in hand, we show density of neural operators in the space L2µ (A; U) where
∥G † − G∥L2µ (A;H m2 (D)) ≤ ϵ.
µ is a probability measure and U is a separable Hilbert space. The Hilbertian structure of U allows
The proof is provided in appendix G. In the following we extend this result to the case A = C m1 (D) us to uniformly control the norm of the approximation due to the isomorphism with ℓ2 as shown in
using the m1 -th order neural operators. Theorem 11. It remains an interesting future direction to obtain similar results for Banach spaces.
The proof follows the ideas in (Lanthaler et al., 2021) where similar results are obtained for Deep-
Theorem 14 Let D ⊂ Rd be a Lipschitz domain, m1 ∈ N, define A := C m1 (D) and suppose
ONet(s) on L2 (D) by using Lusin’s theorem to restrict the approximation to a large enough compact
Assumption 10 holds. Let µ be a probability measure on C m1 (D) and let G† : C m1 (D) →
set and exploit the decay of µ outside it. Bhattacharya et al. (2020) also employ a similar approach
U be µ-measurable and suppose G † ∈ L2µ (C m1 (D); U). Let σ1 ∈ AL0 ∩ BA, σ2 ∈ A0 , and
but explicitly constructs the necessary compact set after finite-dimensionalizing.
σ3 ∈ Am2 . Then for any 0 < ϵ ≤ 1, there exists a number N ∈ N and a neural operator
′
G ∈ NOm 1 ′
N (σ1 , σ2 , σ3 ; D, D ) such that
Theorem 13 Let D′ ⊂ Rd be a Lipschitz domain, m2 ∈ N0 , and suppose Assumption 9 holds.
Let µ be a probability measure on A and suppose G † : A → H m2 (D) is µ-measurable and G † ∈
†
∥G − G∥L2µ (C m1 (D);U ) ≤ ϵ.
L2µ (A; H m2 (D)). Let σ1 ∈ AL0 ∩ BA, σ2 ∈ A0 , and σ3 ∈ Am2 . Then for any 0 < ϵ ≤ 1, there exists
a number N ∈ N and a neural operator G ∈ NON (σ1 , σ2 , σ3 ; D, D′ ) such that
Proof The proof follows as in Theorem 13 by replacing the use of Theorem 11 with Theorem 12.
The proof is provided in appendix G. In the following we extend this result to the case A = C m1 (D)
using the m1 -th order neural operators.
9. Literature Review
We outline the major neural network-based approaches for the solution of PDEs. Theorem 14 Let D ⊂ Rd be a Lipschitz domain, m1 ∈ N, define A := C m1 (D) and suppose
Assumption 10 holds. Let µ be a probability measure on C m1 (D) and let G † : C m1 (D) →
Finite-dimensional Operators. An immediate approach to approximate G† is to parameterize it
U be µ-measurable and suppose G † ∈ L2µ (C m1 (D); U). Let σ1 ∈ AL0 ∩ BA, σ2 ∈ A0 , and
as a deep convolutional neural network (CNN) between the finite-dimensional Euclidean spaces
σ3 ∈ Am2 . Then for any 0 < ϵ ≤ 1, there exists a number N ∈ N and a neural operator
on which the data is discretized i.e. G : RK × Θ → RK (Guo et al., 2016; Zhu and Zabaras,
G ∈ NOm 1 ′
N (σ1 , σ2 , σ3 ; D, D ) such that
2018; Adler and Oktem, 2017; Bhatnagar et al., 2019; Kutyniok et al., 2022). Khoo et al. (2021)
concerns a similar setting, but with output space R. Such approaches are, by definition, not mesh ∥G † − G∥L2µ (C m1 (D);U ) ≤ ϵ.
64 64
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
independent and need modifications to the architecture for different resolution and discretization of Proof The proof follows as in Theorem 13 by replacing the use of Theorem 11 with Theorem 12.
D in order to achieve consistent error (if at all possible). We demonstrate this issue numerically
in Section 7. Furthermore, these approaches are limited to the discretization size and geometry of
the training data and hence it is not possible to query solutions at new points in the domain. In
contrast for our method, we show in Section 7, both invariance of the error to grid resolution, and
9. Literature Review
the ability to transfer the solution between meshes. The work Ummenhofer et al. (2020) proposed a
continuous convolution network for fluid problems, where off-grid points are sampled and linearly We outline the major neural network-based approaches for the solution of PDEs.
interpolated. However the continuous convolution method is still constrained by the underlying
grid which prevents generalization to higher resolutions. Similarly, to get finer resolution solution, Finite-dimensional Operators. An immediate approach to approximate G † is to parameterize it
Jiang et al. (2020) proposed learning super-resolution with a U-Net structure for fluid mechanics as a deep convolutional neural network (CNN) between the finite-dimensional Euclidean spaces
problems. However fine-resolution data is needed for training, while neural operators are capable on which the data is discretized i.e. G : RK × Θ → RK (Guo et al., 2016; Zhu and Zabaras,
of zero-shot super-resolution with no new data. 2018; Adler and Oktem, 2017; Bhatnagar et al., 2019; Kutyniok et al., 2022). Khoo et al. (2021)
DeepONet A novel operator regression architecture, named DeepONet, was recently proposed by concerns a similar setting, but with output space R. Such approaches are, by definition, not mesh
Lu et al. (2019, 2021a); it builds an iterated or deep structure on top of the shallow architecture independent and need modifications to the architecture for different resolution and discretization of
proposed in Chen and Chen (1995). The architecture consists of two neural networks: a branch D in order to achieve consistent error (if at all possible). We demonstrate this issue numerically
net applied on the input functions and a trunk net applied on the querying locations in the output in Section 7. Furthermore, these approaches are limited to the discretization size and geometry of
space. The original work of Chen and Chen (1995) provides a universal approximation theorem, the training data and hence it is not possible to query solutions at new points in the domain. In
and more recently Lanthaler et al. (2021) developed an error estimate for DeepONet itself. The contrast for our method, we show in Section 7, both invariance of the error to grid resolution, and
standard DeepONet structure is a linear approximation of the target operator, where the trunk net the ability to transfer the solution between meshes. The work Ummenhofer et al. (2020) proposed a
and branch net learn the coefficients and basis. On the other hand, the neural operator setting is continuous convolution network for fluid problems, where off-grid points are sampled and linearly
heavily inspired by the advances in deep learning and is a non-linear approximation, which makes it interpolated. However the continuous convolution method is still constrained by the underlying
constructively more expressive. A detailed discussion of DeepONet is provided in Section 5.1 and grid which prevents generalization to higher resolutions. Similarly, to get finer resolution solution,
as well as a numerical comparison to DeepONet in Section 7.2. Jiang et al. (2020) proposed learning super-resolution with a U-Net structure for fluid mechanics
problems. However fine-resolution data is needed for training, while neural operators are capable
Physics Informed Neural Networks (PINNs), Deep Ritz Method (DRM), and Deep Galerkin
of zero-shot super-resolution with no new data.
Method (DGM). A different approach is to directly parameterize the solution u as a neural net-
work u : D̄ × Θ → R (E and Yu, 2018; Raissi et al., 2019; Sirignano and Spiliopoulos, 2018;
DeepONet A novel operator regression architecture, named DeepONet, was recently proposed by
Bar and Sochen, 2019; Smith et al., 2020; Pan and Duraisamy, 2020; Beck et al., 2021). This
Lu et al. (2019, 2021a); it builds an iterated or deep structure on top of the shallow architecture
approach is designed to model one specific instance of the PDE, not the solution operator. It is
proposed in Chen and Chen (1995). The architecture consists of two neural networks: a branch
mesh-independent, but for any given new parameter coefficient function a ∈ A, one would need
net applied on the input functions and a trunk net applied on the querying locations in the output
to train a new neural network ua which is computationally costly and time consuming. Such an
space. The original work of Chen and Chen (1995) provides a universal approximation theorem,
approach closely resembles classical methods such as finite elements, replacing the linear span of a
and more recently Lanthaler et al. (2021) developed an error estimate for DeepONet itself. The
finite set of local basis functions with the space of neural networks.
standard DeepONet structure is a linear approximation of the target operator, where the trunk net
ML-based Hybrid Solvers Similarly, another line of work proposes to enhance existing numeri- and branch net learn the coefficients and basis. On the other hand, the neural operator setting is
cal solvers with neural networks by building hybrid models (Pathak et al., 2020; Um et al., 2020a; heavily inspired by the advances in deep learning and is a non-linear approximation, which makes it
65 65
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Greenfeld et al., 2019) These approaches suffer from the same computational issue as classical constructively more expressive. A detailed discussion of DeepONet is provided in Section 5.1 and
methods: one needs to solve an optimization problem for every new parameter similarly to the as well as a numerical comparison to DeepONet in Section 7.2.
PINNs setting. Furthermore, the approaches are limited to a setting in which the underlying PDE is
Physics Informed Neural Networks (PINNs), Deep Ritz Method (DRM), and Deep Galerkin
known. Purely data-driven learning of a map between spaces of functions is not possible.
Method (DGM). A different approach is to directly parameterize the solution u as a neural net-
Reduced Basis Methods. Our methodology most closely resembles the classical reduced basis work u : D̄ × Θ → R (E and Yu, 2018; Raissi et al., 2019; Sirignano and Spiliopoulos, 2018;
method (RBM) (DeVore, 2014) or the method of Cohen and DeVore (2015). The method intro- Bar and Sochen, 2019; Smith et al., 2020; Pan and Duraisamy, 2020; Beck et al., 2021). This
duced here, along with the contemporaneous work introduced in the papers (Bhattacharya et al., approach is designed to model one specific instance of the PDE, not the solution operator. It is
2020; Nelsen and Stuart, 2021; Opschoor et al., 2020; Schwab and Zech, 2019; O’Leary-Roseberry mesh-independent, but for any given new parameter coefficient function a ∈ A, one would need
et al., 2020; Lu et al., 2019; Fresca and Manzoni, 2022), are, to the best of our knowledge, amongst to train a new neural network ua which is computationally costly and time consuming. Such an
the first practical supervised learning methods designed to learn maps between infinite-dimensional approach closely resembles classical methods such as finite elements, replacing the linear span of a
spaces. Our methodology addresses the mesh-dependent nature of the approach in the papers (Guo finite set of local basis functions with the space of neural networks.
et al., 2016; Zhu and Zabaras, 2018; Adler and Oktem, 2017; Bhatnagar et al., 2019) by produc- ML-based Hybrid Solvers Similarly, another line of work proposes to enhance existing numeri-
ing a single set of network parameters that can be used with different discretizations. Furthermore, cal solvers with neural networks by building hybrid models (Pathak et al., 2020; Um et al., 2020a;
it has the ability to transfer solutions between meshes and indeed between different discretization Greenfeld et al., 2019) These approaches suffer from the same computational issue as classical
methods. Moreover, it needs only to be trained once on the equation set {aj , uj }N
j=1 . Then, obtain- methods: one needs to solve an optimization problem for every new parameter similarly to the
ing a solution for a new a ∼ µ only requires a forward pass of the network, alleviating the major PINNs setting. Furthermore, the approaches are limited to a setting in which the underlying PDE is
computational issues incurred in (E and Yu, 2018; Raissi et al., 2019; Herrmann et al., 2020; Bar known. Purely data-driven learning of a map between spaces of functions is not possible.
and Sochen, 2019) where a different network would need to be trained for each input parameter.
Reduced Basis Methods. Our methodology most closely resembles the classical reduced basis
Lastly, our method requires no knowledge of the underlying PDE: it is purely data-driven and there-
method (RBM) (DeVore, 2014) or the method of Cohen and DeVore (2015). The method intro-
fore non-intrusive. Indeed the true map can be treated as a black-box, perhaps to be learned from
duced here, along with the contemporaneous work introduced in the papers (Bhattacharya et al.,
experimental data or from the output of a costly computer simulation, not necessarily from a PDE.
2020; Nelsen and Stuart, 2021; Opschoor et al., 2020; Schwab and Zech, 2019; O’Leary-Roseberry
Continuous Neural Networks. Using continuity as a tool to design and interpret neural networks et al., 2020; Lu et al., 2019; Fresca and Manzoni, 2022), are, to the best of our knowledge, amongst
is gaining currency in the machine learning community, and the formulation of ResNet as a con- the first practical supervised learning methods designed to learn maps between infinite-dimensional
tinuous time process over the depth parameter is a powerful example of this (Haber and Ruthotto, spaces. Our methodology addresses the mesh-dependent nature of the approach in the papers (Guo
2017; E, 2017). The concept of defining neural networks in infinite-dimensional spaces is a central et al., 2016; Zhu and Zabaras, 2018; Adler and Oktem, 2017; Bhatnagar et al., 2019) by produc-
problem that has long been studied (Williams, 1996; Neal, 1996; Roux and Bengio, 2007; Glober- ing a single set of network parameters that can be used with different discretizations. Furthermore,
son and Livni, 2016; Guss, 2016). The general idea is to take the infinite-width limit which yields a it has the ability to transfer solutions between meshes and indeed between different discretization
non-parametric method and has connections to Gaussian Process Regression (Neal, 1996; Matthews methods. Moreover, it needs only to be trained once on the equation set {aj , uj }N
j=1 . Then, obtain-
et al., 2018; Garriga-Alonso et al., 2018), leading to the introduction of deep Gaussian processes ing a solution for a new a ∼ µ only requires a forward pass of the network, alleviating the major
(Damianou and Lawrence, 2013; Dunlop et al., 2018). Thus far, such methods have not yielded computational issues incurred in (E and Yu, 2018; Raissi et al., 2019; Herrmann et al., 2020; Bar
efficient numerical algorithms that can parallel the success of convolutional or recurrent neural net- and Sochen, 2019) where a different network would need to be trained for each input parameter.
works for the problem of approximating mappings between finite dimensional spaces. Despite the Lastly, our method requires no knowledge of the underlying PDE: it is purely data-driven and there-
superficial similarity with our proposed work, this body of work differs substantially from what fore non-intrusive. Indeed the true map can be treated as a black-box, perhaps to be learned from
we are proposing: in our work we are motivated by the continuous dependence of the data, in the experimental data or from the output of a costly computer simulation, not necessarily from a PDE.
66 66
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
input or output spaces, in spatial or spatio-temporal variables; in contrast the work outlined in this Continuous Neural Networks. Using continuity as a tool to design and interpret neural networks
paragraph uses continuity in an artificial algorithmic depth or width parameter to study the network is gaining currency in the machine learning community, and the formulation of ResNet as a con-
architecture when the depth or width approaches infinity, but the input and output spaces remain of tinuous time process over the depth parameter is a powerful example of this (Haber and Ruthotto,
fixed finite dimension. 2017; E, 2017). The concept of defining neural networks in infinite-dimensional spaces is a central
problem that has long been studied (Williams, 1996; Neal, 1996; Roux and Bengio, 2007; Glober-
Nyström Approximation, GNNs, and Graph Neural Operators (GNOs). The graph neural op-
son and Livni, 2016; Guss, 2016). The general idea is to take the infinite-width limit which yields a
erators (Section 4.1) has an underlying Nyström approximation formulation (Nyström, 1930) which
non-parametric method and has connections to Gaussian Process Regression (Neal, 1996; Matthews
links different grids to a single set of network parameters. This perspective relates our continuum
et al., 2018; Garriga-Alonso et al., 2018), leading to the introduction of deep Gaussian processes
approach to Graph Neural Networks (GNNs). GNNs are a recently developed class of neural net-
(Damianou and Lawrence, 2013; Dunlop et al., 2018). Thus far, such methods have not yielded
works that apply to graph-structured data; they have been used in a variety of applications. Graph
efficient numerical algorithms that can parallel the success of convolutional or recurrent neural net-
networks incorporate an array of techniques from neural network design such as graph convolu-
works for the problem of approximating mappings between finite dimensional spaces. Despite the
tion, edge convolution, attention, and graph pooling (Kipf and Welling, 2016; Hamilton et al., 2017;
superficial similarity with our proposed work, this body of work differs substantially from what
Gilmer et al., 2017; Veličković et al., 2017; Murphy et al., 2018). GNNs have also been applied to
we are proposing: in our work we are motivated by the continuous dependence of the data, in the
the modeling of physical phenomena such as molecules (Chen et al., 2019) and rigid body systems
input or output spaces, in spatial or spatio-temporal variables; in contrast the work outlined in this
(Battaglia et al., 2018) since these problems exhibit a natural graph interpretation: the particles are
paragraph uses continuity in an artificial algorithmic depth or width parameter to study the network
the nodes and the interactions are the edges. The work (Alet et al., 2019) performs an initial study
architecture when the depth or width approaches infinity, but the input and output spaces remain of
that employs graph networks on the problem of learning solutions to Poisson’s equation, among
fixed finite dimension.
other physical applications. They propose an encoder-decoder setting, constructing graphs in the
latent space, and utilizing message passing between the encoder and decoder. However, their model
Nyström Approximation, GNNs, and Graph Neural Operators (GNOs). The graph neural op-
uses a nearest neighbor structure that is unable to capture non-local dependencies as the mesh size
erators (Section 4.1) has an underlying Nyström approximation formulation (Nyström, 1930) which
is increased. In contrast, we directly construct a graph in which the nodes are located on the spatial
links different grids to a single set of network parameters. This perspective relates our continuum
domain of the output function. Through message passing, we are then able to directly learn the ker-
approach to Graph Neural Networks (GNNs). GNNs are a recently developed class of neural net-
nel of the network which approximates the PDE solution. When querying a new location, we simply
works that apply to graph-structured data; they have been used in a variety of applications. Graph
add a new node to our spatial graph and connect it to the existing nodes, avoiding interpolation error
networks incorporate an array of techniques from neural network design such as graph convolu-
by leveraging the power of the Nyström extension for integral operators.
tion, edge convolution, attention, and graph pooling (Kipf and Welling, 2016; Hamilton et al., 2017;
Low-rank Kernel Decomposition and Low-rank Neural Operators (LNOs). Low-rank de- Gilmer et al., 2017; Veličković et al., 2017; Murphy et al., 2018). GNNs have also been applied to
composition is a popular method used in kernel methods and Gaussian process (Kulis et al., 2006; the modeling of physical phenomena such as molecules (Chen et al., 2019) and rigid body systems
Bach, 2013; Lan et al., 2017; Gardner et al., 2018). We present the low-rank neural operator in (Battaglia et al., 2018) since these problems exhibit a natural graph interpretation: the particles are
Section 4.2 where we structure the kernel network as a product of two factor networks inspired by the nodes and the interactions are the edges. The work (Alet et al., 2019) performs an initial study
Fredholm theory. The low-rank method, while simple, is very efficient and easy to train especially that employs graph networks on the problem of learning solutions to Poisson’s equation, among
when the target operator is close to linear. Khoo and Ying (2019) proposed a related neural network other physical applications. They propose an encoder-decoder setting, constructing graphs in the
with low-rank structure to approximate the inverse of differential operators. The framework of two latent space, and utilizing message passing between the encoder and decoder. However, their model
factor networks is also similar to the trunk and branch network used in DeepONet (Lu et al., 2019). uses a nearest neighbor structure that is unable to capture non-local dependencies as the mesh size
But in our work, the factor networks are defined on the physical domain and non-local information is increased. In contrast, we directly construct a graph in which the nodes are located on the spatial
is accumulated through integration with respect to the Lebesgue measure. In contrast, DeepONet(s) domain of the output function. Through message passing, we are then able to directly learn the ker-
67 67
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
integrate against delta measures at a set of pre-defined nodal points that are usually taken to be the nel of the network which approximates the PDE solution. When querying a new location, we simply
grid on which the data is given. See section 5.1 for further discussion. add a new node to our spatial graph and connect it to the existing nodes, avoiding interpolation error
by leveraging the power of the Nyström extension for integral operators.
Multipole, Multi-resolution Methods, and Multipole Graph Neural Operators (MGNOs). To
efficiently capture long-range interaction, multi-scale methods such as the classical fast multipole Low-rank Kernel Decomposition and Low-rank Neural Operators (LNOs). Low-rank de-
methods (FMM) have been developed (Greengard and Rokhlin, 1997). Based on the assumption that composition is a popular method used in kernel methods and Gaussian process (Kulis et al., 2006;
long-range interactions decay quickly, FMM decomposes the kernel matrix into different ranges and Bach, 2013; Lan et al., 2017; Gardner et al., 2018). We present the low-rank neural operator in
hierarchically imposes low-rank structures on the long-range components (hierarchical matrices) Section 4.2 where we structure the kernel network as a product of two factor networks inspired by
(Börm et al., 2003). This decomposition can be viewed as a specific form of the multi-resolution Fredholm theory. The low-rank method, while simple, is very efficient and easy to train especially
matrix factorization of the kernel (Kondor et al., 2014; Börm et al., 2003). For example, the works of when the target operator is close to linear. Khoo and Ying (2019) proposed a related neural network
Fan et al. (2019c,b); He and Xu (2019) propose a similar multipole expansion for solving parametric with low-rank structure to approximate the inverse of differential operators. The framework of two
PDEs on structured grids. However, the classical FMM requires nested grids as well as the explicit factor networks is also similar to the trunk and branch network used in DeepONet (Lu et al., 2019).
form of the PDEs. In Section 4.3, we propose the multipole graph neural operator (MGNO) by But in our work, the factor networks are defined on the physical domain and non-local information
generalizing this idea to arbitrary graphs in the data-driven setting, so that the corresponding graph is accumulated through integration with respect to the Lebesgue measure. In contrast, DeepONet(s)
neural networks can learn discretization-invariant solution operators which are fast and can work on integrate against delta measures at a set of pre-defined nodal points that are usually taken to be the
complex geometries. grid on which the data is given. See section 5.1 for further discussion.
Fourier Transform, Spectral Methods, and Fourier Neural Operators (FNOs). The Fourier Multipole, Multi-resolution Methods, and Multipole Graph Neural Operators (MGNOs). To
transform is frequently used in spectral methods for solving differential equations since differen- efficiently capture long-range interaction, multi-scale methods such as the classical fast multipole
tiation is equivalent to multiplication in the Fourier domain. Fourier transforms have also played methods (FMM) have been developed (Greengard and Rokhlin, 1997). Based on the assumption that
an important role in the development of deep learning. They are used in theoretical work, such as long-range interactions decay quickly, FMM decomposes the kernel matrix into different ranges and
the proof of the neural network universal approximation theorem (Hornik et al., 1989) and related hierarchically imposes low-rank structures on the long-range components (hierarchical matrices)
results for random feature methods (Rahimi and Recht, 2008); empirically, they have been used (Börm et al., 2003). This decomposition can be viewed as a specific form of the multi-resolution
to speed up convolutional neural networks (Mathieu et al., 2013). Neural network architectures matrix factorization of the kernel (Kondor et al., 2014; Börm et al., 2003). For example, the works of
involving the Fourier transform or the use of sinusoidal activation functions have also been pro- Fan et al. (2019c,b); He and Xu (2019) propose a similar multipole expansion for solving parametric
posed and studied (Bengio et al., 2007; Mingo et al., 2004; Sitzmann et al., 2020). Recently, some PDEs on structured grids. However, the classical FMM requires nested grids as well as the explicit
spectral methods for PDEs have been extended to neural networks (Fan et al., 2019a,c; Kashinath form of the PDEs. In Section 4.3, we propose the multipole graph neural operator (MGNO) by
et al., 2020). In Section 4.4, we build on these works by proposing the Fourier neural operator generalizing this idea to arbitrary graphs in the data-driven setting, so that the corresponding graph
architecture defined directly in Fourier space with quasi-linear time complexity and state-of-the-art neural networks can learn discretization-invariant solution operators which are fast and can work on
approximation capabilities. complex geometries.
Sources of Error In this paper we will study the error resulting from approximating an operator Fourier Transform, Spectral Methods, and Fourier Neural Operators (FNOs). The Fourier
(mapping between Banach spaces) from within a class of finitely-parameterized operators. We show transform is frequently used in spectral methods for solving differential equations since differen-
that the resulting error, expressed in terms of universal approximation of operators over a compact tiation is equivalent to multiplication in the Fourier domain. Fourier transforms have also played
set or in terms of a resulting risk, can be driven to zero by increasing the number of parameters, and an important role in the development of deep learning. They are used in theoretical work, such as
refining the approximations inherent in the neural operator architecture. In practice there will be two the proof of the neural network universal approximation theorem (Hornik et al., 1989) and related
other sources of approximation error: firstly from the discretization of the data; and secondly from results for random feature methods (Rahimi and Recht, 2008); empirically, they have been used
68 68
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
the use of empirical risk minimization over a finite data set to determine the parameters. Balancing to speed up convolutional neural networks (Mathieu et al., 2013). Neural network architectures
all three sources of error is key to making algorithms efficient. However we do not study these other involving the Fourier transform or the use of sinusoidal activation functions have also been pro-
two sources of error in this work. Furthermore we do not study how the number of parameters in posed and studied (Bengio et al., 2007; Mingo et al., 2004; Sitzmann et al., 2020). Recently, some
our approximation grows as the error tolerance is refined. Generally, this growth may be super- spectral methods for PDEs have been extended to neural networks (Fan et al., 2019a,c; Kashinath
exponential as shown in (Kovachki et al., 2021). However, for certain classes of operators and et al., 2020). In Section 4.4, we build on these works by proposing the Fourier neural operator
related approximation methods, it is possible to beat the curse of dimensionality; we refer the reader architecture defined directly in Fourier space with quasi-linear time complexity and state-of-the-art
to the works (Lanthaler et al., 2021; Kovachki et al., 2021) for detailed analyses demonstrating this. approximation capabilities.
Finally we also emphasize that there is a potential source of error from the optimization procedure
Sources of Error In this paper we will study the error resulting from approximating an operator
which attempts to minimize the empirical risk: it may not achieve the global minumum. Analysis
(mapping between Banach spaces) from within a class of finitely-parameterized operators. We show
of this error in the context of operator approximation has not been undertaken.
that the resulting error, expressed in terms of universal approximation of operators over a compact
set or in terms of a resulting risk, can be driven to zero by increasing the number of parameters, and
10. Conclusions refining the approximations inherent in the neural operator architecture. In practice there will be two
other sources of approximation error: firstly from the discretization of the data; and secondly from
We have introduced the concept of Neural Operator, the goal being to construct a neural net-
the use of empirical risk minimization over a finite data set to determine the parameters. Balancing
work architecture adapted to the problem of mapping elements of one function space into elements
all three sources of error is key to making algorithms efficient. However we do not study these other
of another function space. The network is comprised of four steps which, in turn, (i) extract features
two sources of error in this work. Furthermore we do not study how the number of parameters in
from the input functions, (ii) iterate a recurrent neural network on feature space, defined through
our approximation grows as the error tolerance is refined. Generally, this growth may be super-
composition of a sigmoid function and a nonlocal operator, and (iii) a final mapping from feature
exponential as shown in (Kovachki et al., 2021). However, for certain classes of operators and
space into the output function.
related approximation methods, it is possible to beat the curse of dimensionality; we refer the reader
We have studied four nonlocal operators in step (iii), one based on graph kernel networks, one
to the works (Lanthaler et al., 2021; Kovachki et al., 2021) for detailed analyses demonstrating this.
based on the low-rank decomposition, one based on the multi-level graph structure, and the last one
Finally we also emphasize that there is a potential source of error from the optimization procedure
based on convolution in Fourier space. The designed network architectures are constructed to be
which attempts to minimize the empirical risk: it may not achieve the global minumum. Analysis
mesh-free and our numerical experiments demonstrate that they have the desired property of being
of this error in the context of operator approximation has not been undertaken.
able to train and generalize on different meshes. This is because the networks learn the mapping
between infinite-dimensional function spaces, which can then be shared with approximations at dif-
10. Conclusions
ferent levels of discretization. A further advantage of the integral operator approach is that data may
be incorporated on unstructured grids, using the Nyström approximation; these methods, however, We have introduced the concept of Neural Operator, the goal being to construct a neural net-
are quadratic in the number of discretization points; we describe variants on this methodology, using work architecture adapted to the problem of mapping elements of one function space into elements
low rank and multiscale ideas, to reduce this complexity. On the other hand the Fourier approach of another function space. The network is comprised of four steps which, in turn, (i) extract features
leads directly to fast methods, linear-log linear in the number of discretization points, provided from the input functions, (ii) iterate a recurrent neural network on feature space, defined through
structured grids are used. We demonstrate that our methods can achieve competitive performance composition of a sigmoid function and a nonlocal operator, and (iii) a final mapping from feature
with other mesh-free approaches developed in the numerical analysis community. Specifically, the space into the output function.
Fourier neural operator achieves the best numerical performance among our experiments, poten- We have studied four nonlocal operators in step (iii), one based on graph kernel networks, one
tially due to the smoothness of the solution function and the underlying uniform grids. The methods based on the low-rank decomposition, one based on the multi-level graph structure, and the last one
developed in the numerical analysis community are less flexible than the approach we introduce based on convolution in Fourier space. The designed network architectures are constructed to be
69 69
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
here, relying heavily on the structure of an underlying PDE mapping input to output; our method is mesh-free and our numerical experiments demonstrate that they have the desired property of being
entirely data-driven. able to train and generalize on different meshes. This is because the networks learn the mapping
between infinite-dimensional function spaces, which can then be shared with approximations at dif-
10.1 Future Directions ferent levels of discretization. A further advantage of the integral operator approach is that data may
be incorporated on unstructured grids, using the Nyström approximation; these methods, however,
We foresee three main directions in which this work will develop: firstly as a method to speed-
are quadratic in the number of discretization points; we describe variants on this methodology, using
up scientific computing tasks which involve repeated evaluation of a mapping between spaces of
low rank and multiscale ideas, to reduce this complexity. On the other hand the Fourier approach
functions, following the example of the Bayesian inverse problem 7.3.4, or when the underlying
leads directly to fast methods, linear-log linear in the number of discretization points, provided
model is unknown as in computer vision or robotics; and secondly the development of more ad-
structured grids are used. We demonstrate that our methods can achieve competitive performance
vanced methodologies beyond the four approximation schemes presented in Section 4 that are more
with other mesh-free approaches developed in the numerical analysis community. Specifically, the
efficient or better in specific situations; thirdly, the development of an underpinning theory which
Fourier neural operator achieves the best numerical performance among our experiments, poten-
captures the expressive power, and approximation error properties, of the proposed neural network,
tially due to the smoothness of the solution function and the underlying uniform grids. The methods
following Section 8, and quantifies the computational complexity required to achieve given error.
developed in the numerical analysis community are less flexible than the approach we introduce
10.1.1 N EW A PPLICATIONS here, relying heavily on the structure of an underlying PDE mapping input to output; our method is
entirely data-driven.
The proposed neural operator is a blackbox surrogate model for function-to-function mappings.
It naturally fits into solving PDEs for physics and engineering problems. In the paper we mainly
10.1 Future Directions
studied three partial differential equations: Darcy Flow, Burgers’ equation, and Navier-Stokes equa-
tion, which cover a broad range of scenarios. Due to its blackbox structure, the neural operator is We foresee three main directions in which this work will develop: firstly as a method to speed-
easily applied on other problems. We foresee applications on more challenging turbulent flows, such up scientific computing tasks which involve repeated evaluation of a mapping between spaces of
as those arising in subgrid models with in climate GCMs, high contrast media in geological models functions, following the example of the Bayesian inverse problem 7.3.4, or when the underlying
generalizing the Darcy model, and general physics simulation for games and visual effects. The model is unknown as in computer vision or robotics; and secondly the development of more ad-
operator setting leads to an efficient and accurate representation, and the resolution-invariant prop- vanced methodologies beyond the four approximation schemes presented in Section 4 that are more
erties make it possible to training and a smaller resolution dataset, and be evaluated on arbitrarily efficient or better in specific situations; thirdly, the development of an underpinning theory which
large resolution. captures the expressive power, and approximation error properties, of the proposed neural network,
The operator learning setting is not restricted to scientific computing. For example, in computer following Section 8, and quantifies the computational complexity required to achieve given error.
vision, images can naturally be viewed as real-valued functions on 2D domains and videos simply
add a temporal structure. Our approach is therefore a natural choice for problems in computer vision 10.1.1 N EW A PPLICATIONS
where invariance to discretization is crucial. We leave this as an interesting future direction. The proposed neural operator is a blackbox surrogate model for function-to-function mappings.
It naturally fits into solving PDEs for physics and engineering problems. In the paper we mainly
10.1.2 N EW M ETHODOLOGIES
studied three partial differential equations: Darcy Flow, Burgers’ equation, and Navier-Stokes equa-
Despite their excellent performance, there is still room for improvement upon the current tion, which cover a broad range of scenarios. Due to its blackbox structure, the neural operator is
methodologies. For example, the full O(J 2 ) integration method still outperforms the FNO by about easily applied on other problems. We foresee applications on more challenging turbulent flows, such
40%, albeit at greater cost. It is of potential interest to develop more advanced integration tech- as those arising in subgrid models with in climate GCMs, high contrast media in geological models
niques or approximation schemes that follows the neural operator framework. For example, one can generalizing the Darcy model, and general physics simulation for games and visual effects. The
70 70
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
use adaptive graph or probability estimation in the Nyström approximation. It is also possible to use operator setting leads to an efficient and accurate representation, and the resolution-invariant prop-
other basis than the Fourier basis such as the PCA basis and the Chebyshev basis. erties make it possible to training and a smaller resolution dataset, and be evaluated on arbitrarily
Another direction for new methodologies is to combine the neural operator in other settings. large resolution.
The current problem is set as a supervised learning problem. Instead, one can combine the neural The operator learning setting is not restricted to scientific computing. For example, in computer
operator with solvers (Pathak et al., 2020; Um et al., 2020b), augmenting and correcting the solvers vision, images can naturally be viewed as real-valued functions on 2D domains and videos simply
to get faster and more accurate approximation. Similarly, one can combine operator learning with add a temporal structure. Our approach is therefore a natural choice for problems in computer vision
physics constraints (Wang et al., 2021; Li et al., 2021). where invariance to discretization is crucial. We leave this as an interesting future direction.
10.1.3 T HEORY
10.1.2 N EW M ETHODOLOGIES
In this work, we develop a universal approximation theory (Section 8) for neural operators.
Despite their excellent performance, there is still room for improvement upon the current
As in the work of Lu et al. (2019) studying universal approximation for DeepONet, we use linear
methodologies. For example, the full O(J 2 ) integration method still outperforms the FNO by about
approximation techniques. The power of non-linear approximation (DeVore, 1998), which is likely
40%, albeit at greater cost. It is of potential interest to develop more advanced integration tech-
intrinsic to the success of neural operators in some settings, is still less studied, as discussed in
niques or approximation schemes that follows the neural operator framework. For example, one can
Section 5.1; we note that DeepOnet is intrinsically limited by linear approximation properties. For
use adaptive graph or probability estimation in the Nyström approximation. It is also possible to use
functions between Euclidean spaces, we clearly know, by combining two layers of linear functions
other basis than the Fourier basis such as the PCA basis and the Chebyshev basis.
with one layer of non-linear activation function, the neural network can approximate arbitrary con-
Another direction for new methodologies is to combine the neural operator in other settings.
tinuous functions, and that deep neural networks can be exponentially more expressive compared to
The current problem is set as a supervised learning problem. Instead, one can combine the neural
shallow networks (Poole et al., 2016). However issues are less clear when it comes to the choice of
operator with solvers (Pathak et al., 2020; Um et al., 2020b), augmenting and correcting the solvers
architecture and the scaling of the number of parameters within neural operators between Banach
to get faster and more accurate approximation. Similarly, one can combine operator learning with
spaces. The approximation theory of operators is much more complex and challenging compared to
physics constraints (Wang et al., 2021; Li et al., 2021).
that of functions over Euclidean spaces. It is important to study the class of neural operators with
respect to their architecture – what spaces the true solution operators lie in, and which classes of
10.1.3 T HEORY
PDEs the neural operator approximate efficiently. We leave these as exciting, but open, research
directions. In this work, we develop a universal approximation theory (Section 8) for neural operators.
As in the work of Lu et al. (2019) studying universal approximation for DeepONet, we use linear
approximation techniques. The power of non-linear approximation (DeVore, 1998), which is likely
intrinsic to the success of neural operators in some settings, is still less studied, as discussed in
Section 5.1; we note that DeepOnet is intrinsically limited by linear approximation properties. For
functions between Euclidean spaces, we clearly know, by combining two layers of linear functions
with one layer of non-linear activation function, the neural network can approximate arbitrary con-
tinuous functions, and that deep neural networks can be exponentially more expressive compared to
shallow networks (Poole et al., 2016). However issues are less clear when it comes to the choice of
architecture and the scaling of the number of parameters within neural operators between Banach
spaces. The approximation theory of operators is much more complex and challenging compared to
that of functions over Euclidean spaces. It is important to study the class of neural operators with
71 71
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Acknowledgements respect to their architecture – what spaces the true solution operators lie in, and which classes of
PDEs the neural operator approximate efficiently. We leave these as exciting, but open, research
Z. Li gratefully acknowledges the financial support from the Kortschak Scholars, PIMCO Fel-
directions.
lows, and Amazon AI4Science Fellows programs. A. Anandkumar is supported in part by Bren
endowed chair. K. Bhattacharya, N. B. Kovachki, B. Liu and A. M. Stuart gratefully acknowl-
edge the financial support of the Army Research Laboratory through the Cooperative Agreement
Number W911NF-12-0022. Research was sponsored by the Army Research Laboratory and was
accomplished under Cooperative Agreement Number W911NF-12-2-0022. AMS is also supported
by NSF (award DMS-1818977). Part of this research is developed when K. Azizzadenesheli was
with the Purdue University. The authors are grateful to Siddhartha Mishra for his valuable feedback
on this work.
The views and conclusions contained in this document are those of the authors and should
not be interpreted as representing the official policies, either expressed or implied, of the Army
Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and
distribute reprints for Government purposes notwithstanding any copyright notation herein.
The computations presented here were conducted on the Resnick High Performance Cluster at
the California Institute of Technology.
参考文献
Jonas Adler and Ozan Oktem. Solving ill-posed inverse problems using iterative deep neural
networks. Inverse Problems, nov 2017. doi: 10.1088/1361-6420/aa9581. URL https:
//doi.org/10.1088%2F1361-6420%2Faa9581.
Fernando Albiac and Nigel J. Kalton. Topics in Banach space theory. Graduate Texts in Mathemat-
ics. Springer, 1 edition, 2006.
Ferran Alet, Adarsh Keshav Jeewajee, Maria Bauza Villalonga, Alberto Rodriguez, Tomas Lozano-
Perez, and Leslie Kaelbling. Graph element networks: adaptive, structured computation and
memory. In 36th International Conference on Machine Learning. PMLR, 2019. URL http:
//proceedings.mlr.press/v97/alet19a.html.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint
arXiv:1607.06450, 2016.
72 72
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Francis Bach. Sharp analysis of low-rank kernel matrix approximations. In Conference on Learning Acknowledgements
Theory, pages 185–209, 2013.
Z. Li gratefully acknowledges the financial support from the Kortschak Scholars, PIMCO Fel-
Leah Bar and Nir Sochen. Unsupervised deep learning algorithm for PDE-based forward and inverse lows, and Amazon AI4Science Fellows programs. A. Anandkumar is supported in part by Bren
problems. arXiv preprint arXiv:1904.05417, 2019. endowed chair. K. Bhattacharya, N. B. Kovachki, B. Liu and A. M. Stuart gratefully acknowl-
edge the financial support of the Army Research Laboratory through the Cooperative Agreement
Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Number W911NF-12-0022. Research was sponsored by the Army Research Laboratory and was
Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. accomplished under Cooperative Agreement Number W911NF-12-2-0022. AMS is also supported
Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, by NSF (award DMS-1818977). Part of this research is developed when K. Azizzadenesheli was
2018. with the Purdue University. The authors are grateful to Siddhartha Mishra for his valuable feedback
on this work.
Christian Beck, Sebastian Becker, Philipp Grohs, Nor Jaafari, and Arnulf Jentzen. Solving the
The views and conclusions contained in this document are those of the authors and should
kolmogorov pde by means of deep learning. Journal of Scientific Computing, 88(3), 2021.
not be interpreted as representing the official policies, either expressed or implied, of the Army
Serge Belongie, Charless Fowlkes, Fan Chung, and Jitendra Malik. Spectral partitioning with indef- Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and
inite kernels using the nyström extension. In European conference on computer vision. Springer, distribute reprints for Government purposes notwithstanding any copyright notation herein.
2002. The computations presented here were conducted on the Resnick High Performance Cluster at
the California Institute of Technology.
Yoshua Bengio, Yann LeCun, et al. Scaling learning algorithms towards ai. Large-scale kernel
machines, 34(5):1–41, 2007.
参考文献
Saakaar Bhatnagar, Yaser Afshar, Shaowu Pan, Karthik Duraisamy, and Shailendra Kaushik. Predic- J. Aaronson. An Introduction to Infinite Ergodic Theory. Mathematical surveys and monographs.
tion of aerodynamic flow fields using convolutional neural networks. Computational Mechanics, American Mathematical Society, 1997. ISBN 9780821804940.
pages 1–21, 2019.
R. A. Adams and J. J. Fournier. Sobolev Spaces. Elsevier Science, 2003.
Kaushik Bhattacharya, Bamdad Hosseini, Nikola B Kovachki, and Andrew M Stuart. Model reduc-
Jonas Adler and Ozan Oktem. Solving ill-posed inverse problems using iterative deep neural
tion and neural networks for parametric PDEs. arXiv preprint arXiv:2005.03180, 2020.
networks. Inverse Problems, nov 2017. doi: 10.1088/1361-6420/aa9581. URL https:
V. I. Bogachev. Measure Theory, volume 2. Springer-Verlag Berlin Heidelberg, 2007. //doi.org/10.1088%2F1361-6420%2Faa9581.
Andrea Bonito, Albert Cohen, Ronald DeVore, Diane Guignard, Peter Jantsch, and Guergana Fernando Albiac and Nigel J. Kalton. Topics in Banach space theory. Graduate Texts in Mathemat-
Petrova. Nonlinear methods for model reduction. arXiv preprint arXiv:2005.02565, 2020. ics. Springer, 1 edition, 2006.
Steffen Börm, Lars Grasedyck, and Wolfgang Hackbusch. Hierarchical matrices. Lecture notes, 21: Ferran Alet, Adarsh Keshav Jeewajee, Maria Bauza Villalonga, Alberto Rodriguez, Tomas Lozano-
2003, 2003. Perez, and Leslie Kaelbling. Graph element networks: adaptive, structured computation and
memory. In 36th International Conference on Machine Learning. PMLR, 2019. URL http:
George EP Box. Science and statistics. Journal of the American Statistical Association, 71(356): //proceedings.mlr.press/v97/alet19a.html.
791–799, 1976.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint
John P Boyd. Chebyshev and Fourier spectral methods. Courier Corporation, 2001. arXiv:1607.06450, 2016.
73 73
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Francis Bach. Sharp analysis of low-rank kernel matrix approximations. In Conference on Learning
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are Theory, pages 185–209, 2013.
few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
Leah Bar and Nir Sochen. Unsupervised deep learning algorithm for PDE-based forward and inverse
Alexander Brudnyi and Yuri Brudnyi. Methods of Geometric Analysis in Extension and Trace problems. arXiv preprint arXiv:1904.05417, 2019.
Problems, volume 1. Birkhäuser Basel, 2012.
Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi,
Oscar P Bruno, Youngae Han, and Matthew M Pohlman. Accurate, high-order representation of Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al.
complex three-dimensional surfaces via fourier continuation analysis. Journal of computational Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261,
Physics, 227(2):1094–1125, 2007. 2018.
Gary J. Chandler and Rich R. Kerswell. Invariant recurrent solutions embedded in a turbulent two- Christian Beck, Sebastian Becker, Philipp Grohs, Nor Jaafari, and Arnulf Jentzen. Solving the
dimensional kolmogorov flow. Journal of Fluid Mechanics, 722:554–595, 2013. kolmogorov pde by means of deep learning. Journal of Scientific Computing, 88(3), 2021.
Chi Chen, Weike Ye, Yunxing Zuo, Chen Zheng, and Shyue Ping Ong. Graph networks as a uni- Serge Belongie, Charless Fowlkes, Fan Chung, and Jitendra Malik. Spectral partitioning with indef-
versal machine learning framework for molecules and crystals. Chemistry of Materials, 31(9): inite kernels using the nyström extension. In European conference on computer vision. Springer,
3564–3572, 2019. 2002.
Tianping Chen and Hong Chen. Universal approximation to nonlinear operators by neural networks Yoshua Bengio, Yann LeCun, et al. Scaling learning algorithms towards ai. Large-scale kernel
with arbitrary activation functions and its application to dynamical systems. IEEE Transactions machines, 34(5):1–41, 2007.
on Neural Networks, 6(4):911–917, 1995.
Saakaar Bhatnagar, Yaser Afshar, Shaowu Pan, Karthik Duraisamy, and Shailendra Kaushik. Predic-
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas tion of aerodynamic flow fields using convolutional neural networks. Computational Mechanics,
Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention pages 1–21, 2019.
with performers. arXiv preprint arXiv:2009.14794, 2020.
Kaushik Bhattacharya, Bamdad Hosseini, Nikola B Kovachki, and Andrew M Stuart. Model reduc-
Z. Ciesielski and J. Domsta. Construction of an orthonormal basis in cm(id) and wmp(id). Studia tion and neural networks for parametric PDEs. arXiv preprint arXiv:2005.03180, 2020.
Mathematica, 41:211–224, 1972.
V. I. Bogachev. Measure Theory, volume 2. Springer-Verlag Berlin Heidelberg, 2007.
Albert Cohen and Ronald DeVore. Approximation of high-dimensional parametric PDEs. Acta
Numerica, 2015. doi: 10.1017/S0962492915000033. Andrea Bonito, Albert Cohen, Ronald DeVore, Diane Guignard, Peter Jantsch, and Guergana
Petrova. Nonlinear methods for model reduction. arXiv preprint arXiv:2005.02565, 2020.
Albert Cohen, Ronald Devore, Guergana Petrova, and Przemyslaw Wojtaszczyk. Optimal stable
nonlinear approximation. arXiv preprint arXiv:2009.09907, 2020. Steffen Börm, Lars Grasedyck, and Wolfgang Hackbusch. Hierarchical matrices. Lecture notes, 21:
2003, 2003.
J. B. Conway. A Course in Functional Analysis. Springer-Verlag New York, 2007.
George EP Box. Science and statistics. Journal of the American Statistical Association, 71(356):
S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White. Mcmc methods for functions: Modifying
791–799, 1976.
old algorithms to make them faster. Statistical Science, 28(3):424–446, Aug 2013. ISSN 0883-
4237. doi: 10.1214/13-sts421. URL http://dx.doi.org/10.1214/13-STS421. John P Boyd. Chebyshev and Fourier spectral methods. Courier Corporation, 2001.
74 74
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Simon L Cotter, Massoumeh Dashti, James Cooper Robinson, and Andrew M Stuart. Bayesian Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
inverse problems for functions and applications to fluid mechanics. Inverse problems, 25(11): Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
115008, 2009. few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
Andreas Damianou and Neil Lawrence. Deep gaussian processes. In Artificial Intelligence and Alexander Brudnyi and Yuri Brudnyi. Methods of Geometric Analysis in Extension and Trace
Statistics, pages 207–215, 2013. Problems, volume 1. Birkhäuser Basel, 2012.
Maarten De Hoop, Daniel Zhengyu Huang, Elizabeth Qian, and Andrew M Stuart. The cost- Oscar P Bruno, Youngae Han, and Matthew M Pohlman. Accurate, high-order representation of
accuracy trade-off in operator learning with neural networks. Journal of Machine Learning, complex three-dimensional surfaces via fourier continuation analysis. Journal of computational
to appear; arXiv preprint arXiv:2203.13181, 2022. Physics, 227(2):1094–1125, 2007.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep Gary J. Chandler and Rich R. Kerswell. Invariant recurrent solutions embedded in a turbulent two-
bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. dimensional kolmogorov flow. Journal of Fluid Mechanics, 722:554–595, 2013.
Ronald A. DeVore. Nonlinear approximation. Acta Numerica, 7:51–150, 1998. Chi Chen, Weike Ye, Yunxing Zuo, Chen Zheng, and Shyue Ping Ong. Graph networks as a uni-
versal machine learning framework for molecules and crystals. Chemistry of Materials, 31(9):
Ronald A. DeVore. Chapter 3: The Theoretical Foundation of Reduced Basis Methods. 2014. doi: 3564–3572, 2019.
10.1137/1.9781611974829.ch3.
Tianping Chen and Hong Chen. Universal approximation to nonlinear operators by neural networks
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas with arbitrary activation functions and its application to dynamical systems. IEEE Transactions
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An on Neural Networks, 6(4):911–917, 1995.
image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929, 2020. Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas
Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention
R. Dudley and Rimas Norvaisa. Concrete Functional Calculus, volume 149. 01 2011. ISBN with performers. arXiv preprint arXiv:2009.14794, 2020.
978-1-4419-6949-1.
Z. Ciesielski and J. Domsta. Construction of an orthonormal basis in cm(id) and wmp(id). Studia
R.M. Dudley and R. Norvaiša. Concrete Functional Calculus. Springer Monographs in Mathemat- Mathematica, 41:211–224, 1972.
ics. Springer New York, 2010.
Albert Cohen and Ronald DeVore. Approximation of high-dimensional parametric PDEs. Acta
J. Dugundji. An extension of tietze’s theorem. Pacific Journal of Mathematics, 1(3):353 – 367, Numerica, 2015. doi: 10.1017/S0962492915000033.
1951.
Albert Cohen, Ronald Devore, Guergana Petrova, and Przemyslaw Wojtaszczyk. Optimal stable
Matthew M Dunlop, Mark A Girolami, Andrew M Stuart, and Aretha L Teckentrup. How deep are nonlinear approximation. arXiv preprint arXiv:2009.09907, 2020.
deep gaussian processes? The Journal of Machine Learning Research, 19(1):2100–2145, 2018.
J. B. Conway. A Course in Functional Analysis. Springer-Verlag New York, 2007.
W E. A proposal on machine learning via dynamical systems. Communications in Mathematics
S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White. Mcmc methods for functions: Modifying
and Statistics, 5(1):1–11, 2017.
old algorithms to make them faster. Statistical Science, 28(3):424–446, Aug 2013. ISSN 0883-
Weinan E. Principles of Multiscale Modeling. Cambridge University Press, Cambridge, 2011. 4237. doi: 10.1214/13-sts421. URL http://dx.doi.org/10.1214/13-STS421.
75 75
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Weinan E and Bing Yu. The deep ritz method: A deep learning-based numerical algorithm for Simon L Cotter, Massoumeh Dashti, James Cooper Robinson, and Andrew M Stuart. Bayesian
solving variational problems. Communications in Mathematics and Statistics, 3 2018. ISSN inverse problems for functions and applications to fluid mechanics. Inverse problems, 25(11):
2194-6701. doi: 10.1007/s40304-018-0127-z. 115008, 2009.
Yuwei Fan, Cindy Orozco Bohorquez, and Lexing Ying. Bcr-net: A neural network based on the Andreas Damianou and Neil Lawrence. Deep gaussian processes. In Artificial Intelligence and
nonstandard wavelet form. Journal of Computational Physics, 384:1–15, 2019a. Statistics, pages 207–215, 2013.
Yuwei Fan, Jordi Feliu-Faba, Lin Lin, Lexing Ying, and Leonardo Zepeda-Núnez. A multiscale Maarten De Hoop, Daniel Zhengyu Huang, Elizabeth Qian, and Andrew M Stuart. The cost-
neural network based on hierarchical nested bases. Research in the Mathematical Sciences, 6(2): accuracy trade-off in operator learning with neural networks. Journal of Machine Learning,
21, 2019b. to appear; arXiv preprint arXiv:2203.13181, 2022.
Yuwei Fan, Lin Lin, Lexing Ying, and Leonardo Zepeda-Núnez. A multiscale neural network based Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
on hierarchical matrices. Multiscale Modeling & Simulation, 17(4):1189–1213, 2019c. bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Charles Fefferman. Cm extension by linear operators. Annals of Mathematics, 166:779–835, 2007. Ronald A. DeVore. Nonlinear approximation. Acta Numerica, 7:51–150, 1998.
Stefania Fresca and Andrea Manzoni. Pod-dl-rom: Enhancing deep learning-based reduced order Ronald A. DeVore. Chapter 3: The Theoretical Foundation of Reduced Basis Methods. 2014. doi:
models for nonlinear parametrized pdes by proper orthogonal decomposition. Computer Methods 10.1137/1.9781611974829.ch3.
in Applied Mechanics and Engineering, 388:114–181, 2022.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Jacob R Gardner, Geoff Pleiss, Ruihan Wu, Kilian Q Weinberger, and Andrew Gordon Wilson. Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An
Product kernel interpolation for scalable gaussian processes. arXiv preprint arXiv:1802.08903, image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
2018. arXiv:2010.11929, 2020.
Adrià Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitchison. Deep Convolutional Net- R. Dudley and Rimas Norvaisa. Concrete Functional Calculus, volume 149. 01 2011. ISBN
works as shallow Gaussian Processes. arXiv e-prints, art. arXiv:1808.05587, Aug 2018. 978-1-4419-6949-1.
Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural R.M. Dudley and R. Norvaiša. Concrete Functional Calculus. Springer Monographs in Mathemat-
message passing for quantum chemistry. In Proceedings of the 34th International Conference on ics. Springer New York, 2010.
Machine Learning, 2017.
J. Dugundji. An extension of tietze’s theorem. Pacific Journal of Mathematics, 1(3):353 – 367,
Amir Globerson and Roi Livni. Learning infinite-layer networks: Beyond the kernel trick. CoRR,
1951.
abs/1606.05316, 2016. URL http://arxiv.org/abs/1606.05316.
Matthew M Dunlop, Mark A Girolami, Andrew M Stuart, and Aretha L Teckentrup. How deep are
Daniel Greenfeld, Meirav Galun, Ronen Basri, Irad Yavneh, and Ron Kimmel. Learning to optimize
deep gaussian processes? The Journal of Machine Learning Research, 19(1):2100–2145, 2018.
multigrid PDE solvers. In International Conference on Machine Learning, pages 2415–2423.
PMLR, 2019. W E. A proposal on machine learning via dynamical systems. Communications in Mathematics
and Statistics, 5(1):1–11, 2017.
Leslie Greengard and Vladimir Rokhlin. A new version of the fast multipole method for the laplace
equation in three dimensions. Acta numerica, 6:229–269, 1997. Weinan E. Principles of Multiscale Modeling. Cambridge University Press, Cambridge, 2011.
76 76
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
A Grothendieck. Produits tensoriels topologiques et espaces nucléaires, volume 16. American Weinan E and Bing Yu. The deep ritz method: A deep learning-based numerical algorithm for
Mathematical Society Providence, 1955. solving variational problems. Communications in Mathematics and Statistics, 3 2018. ISSN
2194-6701. doi: 10.1007/s40304-018-0127-z.
John Guibas, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, and Bryan Catan-
zaro. Adaptive fourier neural operators: Efficient token mixers for transformers. arXiv preprint Yuwei Fan, Cindy Orozco Bohorquez, and Lexing Ying. Bcr-net: A neural network based on the
arXiv:2111.13587, 2021. nonstandard wavelet form. Journal of Computational Physics, 384:1–15, 2019a.
Xiaoxiao Guo, Wei Li, and Francesco Iorio. Convolutional neural networks for steady flow ap- Yuwei Fan, Jordi Feliu-Faba, Lin Lin, Lexing Ying, and Leonardo Zepeda-Núnez. A multiscale
proximation. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge neural network based on hierarchical nested bases. Research in the Mathematical Sciences, 6(2):
Discovery and Data Mining, 2016. 21, 2019b.
Morton E Gurtin. An introduction to continuum mechanics. Academic press, 1982. Yuwei Fan, Lin Lin, Lexing Ying, and Leonardo Zepeda-Núnez. A multiscale neural network based
on hierarchical matrices. Multiscale Modeling & Simulation, 17(4):1189–1213, 2019c.
William H. Guss. Deep Function Machines: Generalized Neural Networks for Topological Layer
Charles Fefferman. Cm extension by linear operators. Annals of Mathematics, 166:779–835, 2007.
Expression. arXiv e-prints, art. arXiv:1612.04799, Dec 2016.
Stefania Fresca and Andrea Manzoni. Pod-dl-rom: Enhancing deep learning-based reduced order
Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems,
models for nonlinear parametrized pdes by proper orthogonal decomposition. Computer Methods
34(1):014004, 2017.
in Applied Mechanics and Engineering, 388:114–181, 2022.
Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs.
Jacob R Gardner, Geoff Pleiss, Ruihan Wu, Kilian Q Weinberger, and Andrew Gordon Wilson.
In Advances in neural information processing systems, pages 1024–1034, 2017.
Product kernel interpolation for scalable gaussian processes. arXiv preprint arXiv:1802.08903,
Juncai He and Jinchao Xu. Mgnet: A unified framework of multigrid and convolutional neural 2018.
network. Science china mathematics, 62(7):1331–1354, 2019.
Adrià Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitchison. Deep Convolutional Net-
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- works as shallow Gaussian Processes. arXiv e-prints, art. arXiv:1808.05587, Aug 2018.
nition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural
770–778, 2016. message passing for quantum chemistry. In Proceedings of the 34th International Conference on
Machine Learning, 2017.
L Herrmann, Ch Schwab, and J Zech. Deep relu neural network expression rates for data-to-qoi
maps in bayesian PDE inversion. 2020. Amir Globerson and Roi Livni. Learning infinite-layer networks: Beyond the kernel trick. CoRR,
abs/1606.05316, 2016. URL http://arxiv.org/abs/1606.05316.
Kurt Hornik, Maxwell Stinchcombe, Halbert White, et al. Multilayer feedforward networks are
universal approximators. Neural networks, 2(5):359–366, 1989. Daniel Greenfeld, Meirav Galun, Ronen Basri, Irad Yavneh, and Ron Kimmel. Learning to optimize
multigrid PDE solvers. In International Conference on Machine Learning, pages 2415–2423.
Chiyu Max Jiang, Soheil Esmaeilzadeh, Kamyar Azizzadenesheli, Karthik Kashinath, Mustafa
PMLR, 2019.
Mustafa, Hamdi A Tchelepi, Philip Marcus, Anima Anandkumar, et al. Meshfreeflownet: A
physics-constrained deep continuous space-time super-resolution framework. arXiv preprint Leslie Greengard and Vladimir Rokhlin. A new version of the fast multipole method for the laplace
arXiv:2005.01463, 2020. equation in three dimensions. Acta numerica, 6:229–269, 1997.
77 77
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Claes Johnson. Numerical solution of partial differential equations by the finite element method. A Grothendieck. Produits tensoriels topologiques et espaces nucléaires, volume 16. American
Courier Corporation, 2012. Mathematical Society Providence, 1955.
Karthik Kashinath, Philip Marcus, et al. Enforcing physical constraints in cnns through differen- John Guibas, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, and Bryan Catan-
tiable PDE layer. In ICLR 2020 Workshop on Integration of Deep Neural Models and Differential zaro. Adaptive fourier neural operators: Efficient token mixers for transformers. arXiv preprint
Equations, 2020. arXiv:2111.13587, 2021.
Yuehaw Khoo and Lexing Ying. Switchnet: a neural network model for forward and inverse scat- Xiaoxiao Guo, Wei Li, and Francesco Iorio. Convolutional neural networks for steady flow ap-
tering problems. SIAM Journal on Scientific Computing, 41(5):A3182–A3201, 2019. proximation. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 2016.
Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. Solving parametric PDE problems with artificial
neural networks. European Journal of Applied Mathematics, 32(3):421–435, 2021. Morton E Gurtin. An introduction to continuum mechanics. Academic press, 1982.
Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional net- William H. Guss. Deep Function Machines: Generalized Neural Networks for Topological Layer
works. arXiv preprint arXiv:1609.02907, 2016. Expression. arXiv e-prints, art. arXiv:1612.04799, Dec 2016.
Risi Kondor, Nedelina Teneva, and Vikas Garg. Multiresolution matrix factorization. In Interna- Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems,
tional Conference on Machine Learning, pages 1620–1628, 2014. 34(1):014004, 2017.
Nikola Kovachki, Samuel Lanthaler, and Siddhartha Mishra. On universal approximation and error
Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs.
bounds for Fourier Neural Operators. arXiv preprint arXiv:2107.07562, 2021.
In Advances in neural information processing systems, pages 1024–1034, 2017.
Robert H. Kraichnan. Inertial ranges in two-dimensional turbulence. The Physics of Fluids, 10(7):
Juncai He and Jinchao Xu. Mgnet: A unified framework of multigrid and convolutional neural
1417–1423, 1967.
network. Science china mathematics, 62(7):1331–1354, 2019.
Brian Kulis, Mátyás Sustik, and Inderjit Dhillon. Learning low-rank kernel matrices. In Proceedings
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-
of the 23rd international conference on Machine learning, pages 505–512, 2006.
nition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
Gitta Kutyniok, Philipp Petersen, Mones Raslan, and Reinhold Schneider. A theoretical analysis of 770–778, 2016.
deep neural networks and parametric pdes. Constructive Approximation, 55(1):73–125, 2022.
L Herrmann, Ch Schwab, and J Zech. Deep relu neural network expression rates for data-to-qoi
Liang Lan, Kai Zhang, Hancheng Ge, Wei Cheng, Jun Liu, Andreas Rauber, Xiao-Li Li, Jun Wang, maps in bayesian PDE inversion. 2020.
and Hongyuan Zha. Low-rank decomposition meets kernel learning: A generalized nyström
Kurt Hornik, Maxwell Stinchcombe, Halbert White, et al. Multilayer feedforward networks are
method. Artificial Intelligence, 250:1–15, 2017.
universal approximators. Neural networks, 2(5):359–366, 1989.
Samuel Lanthaler, Siddhartha Mishra, and George Em Karniadakis. Error estimates for deeponets:
Chiyu Max Jiang, Soheil Esmaeilzadeh, Kamyar Azizzadenesheli, Karthik Kashinath, Mustafa
A deep learning framework in infinite dimensions. arXiv preprint arXiv:2102.09618, 2021.
Mustafa, Hamdi A Tchelepi, Philip Marcus, Anima Anandkumar, et al. Meshfreeflownet: A
G. Leoni. A First Course in Sobolev Spaces. Graduate studies in mathematics. American Mathe- physics-constrained deep continuous space-time super-resolution framework. arXiv preprint
matical Soc., 2009. arXiv:2005.01463, 2020.
78 78
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, An- Claes Johnson. Numerical solution of partial differential equations by the finite element method.
drew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential Courier Corporation, 2012.
equations, 2020a.
Karthik Kashinath, Philip Marcus, et al. Enforcing physical constraints in cnns through differen-
Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, An- tiable PDE layer. In ICLR 2020 Workshop on Integration of Deep Neural Models and Differential
drew Stuart, and Anima Anandkumar. Multipole graph neural operator for parametric partial Equations, 2020.
differential equations, 2020b.
Yuehaw Khoo and Lexing Ying. Switchnet: a neural network model for forward and inverse scat-
Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, An- tering problems. SIAM Journal on Scientific Computing, 41(5):A3182–A3201, 2019.
drew Stuart, and Anima Anandkumar. Neural operator: Graph kernel network for partial differ-
ential equations. arXiv preprint arXiv:2003.03485, 2020c. Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. Solving parametric PDE problems with artificial
neural networks. European Journal of Applied Mathematics, 32(3):421–435, 2021.
Zongyi Li, Hongkai Zheng, Nikola Kovachki, David Jin, Haoxuan Chen, Burigede Liu, Kamyar
Azizzadenesheli, and Anima Anandkumar. Physics-informed neural operator for learning partial Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional net-
differential equations. arXiv preprint arXiv:2111.03794, 2021. works. arXiv preprint arXiv:1609.02907, 2016.
Lu Lu, Pengzhan Jin, and George Em Karniadakis. Deeponet: Learning nonlinear operators for Risi Kondor, Nedelina Teneva, and Vikas Garg. Multiresolution matrix factorization. In Interna-
identifying differential equations based on the universal approximation theorem of operators. tional Conference on Machine Learning, pages 1620–1628, 2014.
arXiv preprint arXiv:1910.03193, 2019.
Nikola Kovachki, Samuel Lanthaler, and Siddhartha Mishra. On universal approximation and error
Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learning bounds for Fourier Neural Operators. arXiv preprint arXiv:2107.07562, 2021.
nonlinear operators via deeponet based on the universal approximation theorem of operators.
Robert H. Kraichnan. Inertial ranges in two-dimensional turbulence. The Physics of Fluids, 10(7):
Nature Machine Intelligence, 3(3):218–229, 2021a.
1417–1423, 1967.
Lu Lu, Xuhui Meng, Shengze Cai, Zhiping Mao, Somdatta Goswami, Zhongqiang Zhang, and
Brian Kulis, Mátyás Sustik, and Inderjit Dhillon. Learning low-rank kernel matrices. In Proceedings
George Em Karniadakis. A comprehensive and fair comparison of two neural operators (with
of the 23rd international conference on Machine learning, pages 505–512, 2006.
practical extensions) based on fair data. arXiv preprint arXiv:2111.05512, 2021b.
Michael Mathieu, Mikael Henaff, and Yann LeCun. Fast training of convolutional networks through Gitta Kutyniok, Philipp Petersen, Mones Raslan, and Reinhold Schneider. A theoretical analysis of
ffts, 2013. deep neural networks and parametric pdes. Constructive Approximation, 55(1):73–125, 2022.
Alexander G. de G. Matthews, Mark Rowland, Jiri Hron, Richard E. Turner, and Zoubin Ghahra- Liang Lan, Kai Zhang, Hancheng Ge, Wei Cheng, Jun Liu, Andreas Rauber, Xiao-Li Li, Jun Wang,
mani. Gaussian Process Behaviour in Wide Deep Neural Networks. Apr 2018. and Hongyuan Zha. Low-rank decomposition meets kernel learning: A generalized nyström
method. Artificial Intelligence, 250:1–15, 2017.
Luis Mingo, Levon Aslanyan, Juan Castellanos, Miguel Diaz, and Vladimir Riazanov. Fourier
neural networks: An approach with sinusoidal activation functions. 2004. Samuel Lanthaler, Siddhartha Mishra, and George Em Karniadakis. Error estimates for deeponets:
A deep learning framework in infinite dimensions. arXiv preprint arXiv:2102.09618, 2021.
Ryan L Murphy, Balasubramaniam Srinivasan, Vinayak Rao, and Bruno Ribeiro. Janossy pool-
ing: Learning deep permutation-invariant functions for variable-size inputs. arXiv preprint G. Leoni. A First Course in Sobolev Spaces. Graduate studies in mathematics. American Mathe-
arXiv:1811.01900, 2018. matical Soc., 2009.
79 79
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Radford M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag, 1996. ISBN Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, An-
0387947248. drew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential
equations, 2020a.
Nicholas H Nelsen and Andrew M Stuart. The random feature model for input-output maps between
banach spaces. SIAM Journal on Scientific Computing, 43(5):A3212–A3243, 2021. Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, An-
drew Stuart, and Anima Anandkumar. Multipole graph neural operator for parametric partial
Evert J Nyström. Über die praktische auflösung von integralgleichungen mit anwendungen auf
differential equations, 2020b.
randwertaufgaben. Acta Mathematica, 1930.
Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, An-
Thomas O’Leary-Roseberry, Umberto Villa, Peng Chen, and Omar Ghattas. Derivative-informed
drew Stuart, and Anima Anandkumar. Neural operator: Graph kernel network for partial differ-
projected neural networks for high-dimensional parametric maps governed by pdes. arXiv
ential equations. arXiv preprint arXiv:2003.03485, 2020c.
preprint arXiv:2011.15110, 2020.
Zongyi Li, Hongkai Zheng, Nikola Kovachki, David Jin, Haoxuan Chen, Burigede Liu, Kamyar
Joost A.A. Opschoor, Christoph Schwab, and Jakob Zech. Deep learning in high dimension: Relu
Azizzadenesheli, and Anima Anandkumar. Physics-informed neural operator for learning partial
network expression rates for bayesian PDE inversion. SAM Research Report, 2020-47, 2020.
differential equations. arXiv preprint arXiv:2111.03794, 2021.
Shaowu Pan and Karthik Duraisamy. Physics-informed probabilistic learning of linear embeddings
Lu Lu, Pengzhan Jin, and George Em Karniadakis. Deeponet: Learning nonlinear operators for
of nonlinear dynamics with guaranteed stability. SIAM Journal on Applied Dynamical Systems,
identifying differential equations based on the universal approximation theorem of operators.
19(1):480–509, 2020.
arXiv preprint arXiv:1910.03193, 2019.
Jaideep Pathak, Mustafa Mustafa, Karthik Kashinath, Emmanuel Motheau, Thorsten Kurth, and
Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learning
Marcus Day. Using machine learning to augment coarse-grid computational fluid dynamics sim-
nonlinear operators via deeponet based on the universal approximation theorem of operators.
ulations, 2020.
Nature Machine Intelligence, 3(3):218–229, 2021a.
Aleksander Pełczyński and Michał Wojciechowski. Contribution to the isomorphic classification of
Lu Lu, Xuhui Meng, Shengze Cai, Zhiping Mao, Somdatta Goswami, Zhongqiang Zhang, and
sobolev spaces lpk(omega). Recent Progress in Functional Analysis, 189:133–142, 2001.
George Em Karniadakis. A comprehensive and fair comparison of two neural operators (with
Tobias Pfaff, Meire Fortunato, Alvaro Sanchez-Gonzalez, and Peter W. Battaglia. Learning mesh- practical extensions) based on fair data. arXiv preprint arXiv:2111.05512, 2021b.
based simulation with graph networks, 2020.
Michael Mathieu, Mikael Henaff, and Yann LeCun. Fast training of convolutional networks through
A. Pinkus. N-Widths in Approximation Theory. Springer-Verlag Berlin Heidelberg, 1985. ffts, 2013.
Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta Numerica, 8: Alexander G. de G. Matthews, Mark Rowland, Jiri Hron, Richard E. Turner, and Zoubin Ghahra-
143–195, 1999. mani. Gaussian Process Behaviour in Wide Deep Neural Networks. Apr 2018.
Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponen- Luis Mingo, Levon Aslanyan, Juan Castellanos, Miguel Diaz, and Vladimir Riazanov. Fourier
tial expressivity in deep neural networks through transient chaos. Advances in neural information neural networks: An approach with sinusoidal activation functions. 2004.
processing systems, 29:3360–3368, 2016.
Ryan L Murphy, Balasubramaniam Srinivasan, Vinayak Rao, and Bruno Ribeiro. Janossy pool-
Joaquin Quiñonero Candela and Carl Edward Rasmussen. A unifying view of sparse approximate ing: Learning deep permutation-invariant functions for variable-size inputs. arXiv preprint
gaussian process regression. J. Mach. Learn. Res., 6:1939–1959, 2005. arXiv:1811.01900, 2018.
80 80
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Ali Rahimi and Benjamin Recht. Uniform approximation of functions with random bases. In 2008 Radford M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag, 1996. ISBN
46th Annual Allerton Conference on Communication, Control, and Computing, pages 555–561. 0387947248.
IEEE, 2008.
Nicholas H Nelsen and Andrew M Stuart. The random feature model for input-output maps between
Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A banach spaces. SIAM Journal on Scientific Computing, 43(5):A3212–A3243, 2021.
deep learning framework for solving forward and inverse problems involving nonlinear partial
Evert J Nyström. Über die praktische auflösung von integralgleichungen mit anwendungen auf
differential equations. Journal of Computational Physics, 378:686–707, 2019.
randwertaufgaben. Acta Mathematica, 1930.
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedi- Thomas O’Leary-Roseberry, Umberto Villa, Peng Chen, and Omar Ghattas. Derivative-informed
cal image segmentation. In International Conference on Medical image computing and computer- projected neural networks for high-dimensional parametric maps governed by pdes. arXiv
assisted intervention, pages 234–241. Springer, 2015. preprint arXiv:2011.15110, 2020.
Nicolas Le Roux and Yoshua Bengio. Continuous neural networks. In Marina Meila and Xiaotong Joost A.A. Opschoor, Christoph Schwab, and Jakob Zech. Deep learning in high dimension: Relu
Shen, editors, Proceedings of the Eleventh International Conference on Artificial Intelligence and network expression rates for bayesian PDE inversion. SAM Research Report, 2020-47, 2020.
Statistics, 2007.
Shaowu Pan and Karthik Duraisamy. Physics-informed probabilistic learning of linear embeddings
Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. of nonlinear dynamics with guaranteed stability. SIAM Journal on Applied Dynamical Systems,
The graph neural network model. IEEE transactions on neural networks, 20(1):61–80, 2008. 19(1):480–509, 2020.
Christoph Schwab and Jakob Zech. Deep learning in high dimension: Neural network expression Jaideep Pathak, Mustafa Mustafa, Karthik Kashinath, Emmanuel Motheau, Thorsten Kurth, and
rates for generalized polynomial chaos expansions in UQ. Analysis and Applications, 17(01): Marcus Day. Using machine learning to augment coarse-grid computational fluid dynamics sim-
19–55, 2019. ulations, 2020.
Jonathan D Smith, Kamyar Azizzadenesheli, and Zachary E Ross. Eikonet: Solving the eikonal Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta Numerica, 8:
equation with deep neural networks. arXiv preprint arXiv:2004.00361, 2020. 143–195, 1999.
Elias M. Stein. Singular Integrals and Differentiability Properties of Functions. Princeton Univer- Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponen-
sity Press, 1970. tial expressivity in deep neural networks through transient chaos. Advances in neural information
processing systems, 29:3360–3368, 2016.
A. M. Stuart. Inverse problems: A bayesian perspective. Acta Numerica, 19:451–559, 2010.
Joaquin Quiñonero Candela and Carl Edward Rasmussen. A unifying view of sparse approximate
Lloyd N Trefethen. Spectral methods in MATLAB, volume 10. Siam, 2000. gaussian process regression. J. Mach. Learn. Res., 6:1939–1959, 2005.
81 81
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Nicolas Garcia Trillos and Dejan Slepčev. A variational approach to the consistency of spectral Ali Rahimi and Benjamin Recht. Uniform approximation of functions with random bases. In 2008
clustering. Applied and Computational Harmonic Analysis, 45(2):239–281, 2018. 46th Annual Allerton Conference on Communication, Control, and Computing, pages 555–561.
IEEE, 2008.
Nicolás Garcı́a Trillos, Moritz Gerlach, Matthias Hein, and Dejan Slepčev. Error estimates for
spectral convergence of the graph laplacian on random geometric graphs toward the laplace– Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A
beltrami operator. Foundations of Computational Mathematics, 20(4):827–887, 2020. deep learning framework for solving forward and inverse problems involving nonlinear partial
differential equations. Journal of Computational Physics, 378:686–707, 2019.
Kiwon Um, Philipp Holl, Robert Brand, Nils Thuerey, et al. Solver-in-the-loop: Learning from
differentiable physics to interact with iterative PDE-solvers. arXiv preprint arXiv:2007.00016, Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedi-
2020a. cal image segmentation. In International Conference on Medical image computing and computer-
assisted intervention, pages 234–241. Springer, 2015.
Kiwon Um, Raymond, Fei, Philipp Holl, Robert Brand, and Nils Thuerey. Solver-in-the-loop:
Learning from differentiable physics to interact with iterative PDE-solvers, 2020b. Nicolas Le Roux and Yoshua Bengio. Continuous neural networks. In Marina Meila and Xiaotong
Shen, editors, Proceedings of the Eleventh International Conference on Artificial Intelligence and
Benjamin Ummenhofer, Lukas Prantl, Nils Thürey, and Vladlen Koltun. Lagrangian fluid simu- Statistics, 2007.
lation with continuous convolutions. In International Conference on Learning Representations,
2020. Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.
The graph neural network model. IEEE transactions on neural networks, 20(1):61–80, 2008.
Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.
Christoph Schwab and Jakob Zech. Deep learning in high dimension: Neural network expression
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł rates for generalized polynomial chaos expansions in UQ. Analysis and Applications, 17(01):
ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. 19–55, 2019.
Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural
Justin Sirignano and Konstantinos Spiliopoulos. Dgm: A deep learning algorithm for solving partial
Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
differential equations. Journal of computational physics, 375:1339–1364, 2018.
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua
Vincent Sitzmann, Julien NP Martel, Alexander W Bergman, David B Lindell, and Gordon
Bengio. Graph attention networks. 2017.
Wetzstein. Implicit neural representations with periodic activation functions. arXiv preprint
Ulrike Von Luxburg, Mikhail Belkin, and Olivier Bousquet. Consistency of spectral clustering. The arXiv:2006.09661, 2020.
Annals of Statistics, pages 555–586, 2008.
Jonathan D Smith, Kamyar Azizzadenesheli, and Zachary E Ross. Eikonet: Solving the eikonal
Rui Wang, Karthik Kashinath, Mustafa Mustafa, Adrian Albert, and Rose Yu. Towards physics- equation with deep neural networks. arXiv preprint arXiv:2004.00361, 2020.
informed deep learning for turbulent flow prediction. In Proceedings of the 26th ACM SIGKDD
Elias M. Stein. Singular Integrals and Differentiability Properties of Functions. Princeton Univer-
International Conference on Knowledge Discovery & Data Mining, pages 1457–1466, 2020.
sity Press, 1970.
Sifan Wang, Hanwen Wang, and Paris Perdikaris. Learning the solution operator of parametric
A. M. Stuart. Inverse problems: A bayesian perspective. Acta Numerica, 19:451–559, 2010.
partial differential equations with physics-informed deeponets. arXiv preprint arXiv:2103.10974,
2021. Lloyd N Trefethen. Spectral methods in MATLAB, volume 10. Siam, 2000.
82 82
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Gege Wen, Zongyi Li, Kamyar Azizzadenesheli, Anima Anandkumar, and Sally M Benson. U- Nicolas Garcia Trillos and Dejan Slepčev. A variational approach to the consistency of spectral
fno–an enhanced fourier neural operator based-deep learning model for multiphase flow. arXiv clustering. Applied and Computational Harmonic Analysis, 45(2):239–281, 2018.
preprint arXiv:2109.03697, 2021.
Nicolás Garcı́a Trillos, Moritz Gerlach, Matthias Hein, and Dejan Slepčev. Error estimates for
Hassler Whitney. Functions differentiable on the boundaries of regions. Annals of Mathematics, 35 spectral convergence of the graph laplacian on random geometric graphs toward the laplace–
(3):482–485, 1934. beltrami operator. Foundations of Computational Mathematics, 20(4):827–887, 2020.
Christopher K. I. Williams. Computing with infinite networks. In Proceedings of the 9th Interna-
Kiwon Um, Philipp Holl, Robert Brand, Nils Thuerey, et al. Solver-in-the-loop: Learning from
tional Conference on Neural Information Processing Systems, Cambridge, MA, USA, 1996. MIT
differentiable physics to interact with iterative PDE-solvers. arXiv preprint arXiv:2007.00016,
Press.
2020a.
Chen Zhu, Wei Ping, Chaowei Xiao, Mohammad Shoeybi, Tom Goldstein, Anima Anandkumar,
Kiwon Um, Raymond, Fei, Philipp Holl, Robert Brand, and Nils Thuerey. Solver-in-the-loop:
and Bryan Catanzaro. Long-short transformer: Efficient transformers for language and vision. In
Learning from differentiable physics to interact with iterative PDE-solvers, 2020b.
Advances in Neural Information Processing Systems, 2021.
Yinhao Zhu and Nicholas Zabaras. Bayesian deep convolutional encoder–decoder networks Benjamin Ummenhofer, Lukas Prantl, Nils Thürey, and Vladlen Koltun. Lagrangian fluid simu-
for surrogate modeling and uncertainty quantification. Journal of Computational Physics, lation with continuous convolutions. In International Conference on Learning Representations,
2018. ISSN 0021-9991. doi: https://doi.org/10.1016/j.jcp.2018.04.018. URL http://www. 2020.
sciencedirect.com/science/article/pii/S0021999118302341.
Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł
ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S.
Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural
Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua
Bengio. Graph attention networks. 2017.
Ulrike Von Luxburg, Mikhail Belkin, and Olivier Bousquet. Consistency of spectral clustering. The
Annals of Statistics, pages 555–586, 2008.
Rui Wang, Karthik Kashinath, Mustafa Mustafa, Adrian Albert, and Rose Yu. Towards physics-
informed deep learning for turbulent flow prediction. In Proceedings of the 26th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining, pages 1457–1466, 2020.
Sifan Wang, Hanwen Wang, and Paris Perdikaris. Learning the solution operator of parametric
partial differential equations with physics-informed deeponets. arXiv preprint arXiv:2103.10974,
2021.
83 83
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Appendix A. Gege Wen, Zongyi Li, Kamyar Azizzadenesheli, Anima Anandkumar, and Sally M Benson. U-
fno–an enhanced fourier neural operator based-deep learning model for multiphase flow. arXiv
Notation Meaning preprint arXiv:2109.03697, 2021.
Operator Learning
Hassler Whitney. Functions differentiable on the boundaries of regions. Annals of Mathematics, 35
D ⊂ Rd The spatial domain for the PDE.
(3):482–485, 1934.
x∈D Points in the the spatial domain.
a ∈ A = (D; Rda ) The input functions (coefficients, boundaries, and/or initial conditions). Christopher K. I. Williams. Computing with infinite networks. In Proceedings of the 9th Interna-
u ∈ U = (D; Rdu ) The target solution functions. tional Conference on Neural Information Processing Systems, Cambridge, MA, USA, 1996. MIT
Dj The discretization of (aj , uj ). Press.
G† : A → U The operator mapping the coefficients to the solutions.
Chen Zhu, Wei Ping, Chaowei Xiao, Mohammad Shoeybi, Tom Goldstein, Anima Anandkumar,
µ A probability measure where aj sampled from.
and Bryan Catanzaro. Long-short transformer: Efficient transformers for language and vision. In
Neural Operator
Advances in Neural Information Processing Systems, 2021.
v(x) ∈ Rdv The neural network representation of u(x)
da Dimension of the input a(x). Yinhao Zhu and Nicholas Zabaras. Bayesian deep convolutional encoder–decoder networks
du Dimension of the output u(x). for surrogate modeling and uncertainty quantification. Journal of Computational Physics,
dv The dimension of the representation v(x). 2018. ISSN 0021-9991. doi: https://doi.org/10.1016/j.jcp.2018.04.018. URL http://www.
t = 0, . . . , T The layer (iteration) in the neural operator . sciencedirect.com/science/article/pii/S0021999118302341.
P, Q The pointwise linear transformation P : a(x) 7→ v0 (x) and Q : vT (x) 7→ u(x).
K The integral operator in the iterative update vt 7→ vt+1 ,
κ : R2(d+1) → Rdv ×dv The kernel maps (x, y, a(x), a(y)) to a dv × dv matrix
K∈ Rn×n×dv ×dv The kernel matrix with Kxy = κ(x, y).
W ∈ Rdv ×dv The pointwise linear transformation used as the bias term in the iterative update.
σ The activation function.
84 84
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
For any Banach space Y, we denote by L(X ; Y) the Banach space of continuous linear maps T : Appendix A.
X → Y with the operator norm
Notation Meaning
∥T ∥X →Y = sup ∥T x∥Y < ∞.
x∈X Operator Learning
∥x∥X =1
D ⊂ Rd The spatial domain for the PDE.
We will abuse notation and write ∥ · ∥ for any operator norm when there is no ambiguity about the
x∈D Points in the the spatial domain.
spaces in question.
a∈A= (D; Rda ) The input functions (coefficients, boundaries, and/or initial conditions).
Let d ∈ N. We say that D ⊂ Rd is a domain if it is a bounded and connected open set that is
u ∈ U = (D; Rdu ) The target solution functions.
topologically regular i.e. int(D̄) = D. Note that, in the case d = 1, a domain is any bounded, open
Dj The discretization of (aj , uj ).
interval. For d ≥ 2, we say D is a Lipschitz domain if ∂D can be locally represented as the graph of
G† : A → U The operator mapping the coefficients to the solutions.
a Lipschitz continuous function defined on an open ball of Rd−1 . If d = 1, we will call any domain
µ A probability measure where aj sampled from.
a Lipschitz domain. For any multi-index α ∈ Nd0 , we write ∂ α f for the α-th weak partial derivative
Neural Operator
of f when it exists.
v(x) ∈ Rdv The neural network representation of u(x)
Let D ⊂ Rd be a domain. For any m ∈ N0 , we define the following spaces
da Dimension of the input a(x).
C(D) = {f : D → R : f is continuous}, du Dimension of the output u(x).
C m (D) = {f : D → R : ∂ α f ∈ C m−|α|1 (D) ∀ 0 ≤ |α|1 ≤ m}, dv The dimension of the representation v(x).
t = 0, . . . , T The layer (iteration) in the neural operator .
Cbm (D) = f ∈ C m (D) : max sup |∂ α f (x)| < ∞ ,
0≤|α|1 ≤m x∈D P, Q The pointwise linear transformation P : a(x) 7→ v0 (x) and Q : vT (x) 7→ u(x).
C m (D̄) = {f ∈ Cbm (D) : ∂ α f is uniformly continuous ∀ 0 ≤ |α|1 ≤ m} K The integral operator in the iterative update vt 7→ vt+1 ,
κ : R2(d+1) → Rdv ×dv The kernel maps (x, y, a(x), a(y)) to a dv × dv matrix
and make the equivalent definitions when D is replaced with Rd . Note that any function in C m (D̄)
K∈ Rn×n×dv ×dv The kernel matrix with Kxy = κ(x, y).
has a unique, bounded, continuous extension from D to D̄ and is hence uniquely defined on ∂D.
W ∈ Rdv ×dv The pointwise linear transformation used as the bias term in the iterative update.
We will work with this extension without further notice. We remark that when D is a Lipschitz
σ The activation function.
domain, the following definition for C m (D̄) is equivalent
表 10: Table of notations: operator learning and neural operators
C m (D̄) = {f : D̄ → R : ∃F ∈ C m (Rd ) such that f ≡ F |D̄ },
In the paper, we will use lowercase letters such as v, u to represent vectors and functions; uppercase letters
which makes Cbm (D) (also with D = Rd ) and C m (D̄) Banach spaces. For any n ∈ N, we write ∥f ∥X ∗ = sup |f (x)| < ∞.
x∈X
C(D; Rn ) for the n-fold Cartesian product of C(D) and similarly for all other spaces we have ∥x∥X =1
85 85
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
defined or will define subsequently. We will continue to write ∥ · ∥C m for the norm on Cbm (D; Rn ) For any Banach space Y, we denote by L(X ; Y) the Banach space of continuous linear maps T :
and C m (D̄; Rn ) defined as X → Y with the operator norm
∥f ∥C m = max ∥fj ∥C m .
j∈{1,...,n} ∥T ∥X →Y = sup ∥T x∥Y < ∞.
x∈X
For any m ∈ N and 1 ≤ p ≤ ∞, we use the notation W m,p (D) for the standard Lp -type ∥x∥X =1
Sobolev space with m derivatives; we refer the reader to Adams and Fournier (2003) for a for- We will abuse notation and write ∥ · ∥ for any operator norm when there is no ambiguity about the
mal definition. Furthermore, we, at times, use the notation W 0,p (D) = Lp (D) and W m,2 (D) = spaces in question.
H m (D). Since we use the standard definitions of Sobolev spaces that can be found in any reference Let d ∈ N. We say that D ⊂ Rd is a domain if it is a bounded and connected open set that is
on the subject, we do not give the specifics here. topologically regular i.e. int(D̄) = D. Note that, in the case d = 1, a domain is any bounded, open
interval. For d ≥ 2, we say D is a Lipschitz domain if ∂D can be locally represented as the graph of
Appendix B. a Lipschitz continuous function defined on an open ball of Rd−1 . If d = 1, we will call any domain
a Lipschitz domain. For any multi-index α ∈ Nd0 , we write ∂ α f for the α-th weak partial derivative
In this section we gather various results on the approximation property of Banach spaces.
of f when it exists.
The main results are Lemma 22 which states that if two Banach spaces have the approximation
Let D ⊂ Rd be a domain. For any m ∈ N0 , we define the following spaces
property then continuous maps between them can be approximated in a finite-dimensional manner,
and Lemma 26 which states the spaces in Assumptions 9 and 10 have the approximation property. C(D) = {f : D → R : f is continuous},
C m (D) = {f : D → R : ∂ α f ∈ C m−|α|1 (D) ∀ 0 ≤ |α|1 ≤ m},
Definition 15 A Banach space X has a Schauder basis if there exist some {φj }∞
j=1 ⊂ X and
{cj }∞ ∗
j=1 ⊂ X such that
Cbm (D) = f ∈ C m (D) : max sup |∂ α f (x)| < ∞ ,
0≤|α|1 ≤m x∈D
1. cj (φk ) = δjk for any j, k ∈ N, C m (D̄) = {f ∈ Cbm (D) : ∂ α f is uniformly continuous ∀ 0 ≤ |α|1 ≤ m}
2. lim ∥x −
Pn and make the equivalent definitions when D is replaced with Rd . Note that any function in C m (D̄)
n→∞ j=1 cj (x)φj ∥X = 0 for all x ∈ X .
has a unique, bounded, continuous extension from D to D̄ and is hence uniquely defined on ∂D.
We remark that definition 15 is equivalent to the following. The elements {φj }∞j=1 ⊂ X are called We will work with this extension without further notice. We remark that when D is a Lipschitz
a Schauder basis for X if, for each x ∈ X , there exists a unique sequence {αj }∞
j=1 ⊂ R such that domain, the following definition for C m (D̄) is equivalent
n
X C m (D̄) = {f : D̄ → R : ∃F ∈ C m (Rd ) such that f ≡ F |D̄ },
lim ∥x − αj φj ∥X = 0.
n→∞ T∞
j=1
see Whitney (1934); Brudnyi and Brudnyi (2012). We define C ∞ (D) = m=0 C
m (D) and, simi-
For the equivalence, see, for example (Albiac and Kalton, 2006, Theorem 1.1.3). Throughout this larly, Cb∞ (D) and C ∞ (D̄). We further define
paper we will simply write the term basis to mean Schauder basis. Furthermore, we note that if
Cc∞ (D) = {f ∈ C ∞ (D) : supp(f ) ⊂ D is compact}
{φ}∞ ∞
j=1 is a basis then so is {φj /∥φ∥X }j=1 , so we will assume that any basis we use is normalized.
and, again, note that all definitions hold analogously for Rd . We denote by ∥·∥C m : Cbm (D) → R≥0
Definition 16 Let X be a Banach space and U ∈ L(X ; X ). U is called a finite rank operator if the norm
U (X ) ⊆ X is finite dimensional. ∥f ∥C m = max sup |∂ α f (x)|
0≤|α|1 ≤m x∈D
By noting that any finite dimensional subspace has a basis, we may equivalently define a finite rank which makes Cbm (D) (also with D = Rd ) and C m (D̄) Banach spaces. For any n ∈ N, we write
operator U ∈ L(X ; X ) to be one such that there exists a number n ∈ N and some {φj }nj=1 ⊂ X C(D; Rn ) for the n-fold Cartesian product of C(D) and similarly for all other spaces we have
86 86
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
and {cj }nj=1 ⊂ X ∗ such that defined or will define subsequently. We will continue to write ∥ · ∥C m for the norm on Cbm (D; Rn )
and C m (D̄; Rn ) defined as
n
X
Ux = cj (x)φj , ∀x ∈ X . ∥f ∥C m = max ∥fj ∥C m .
j∈{1,...,n}
j=1
For any m ∈ N and 1 ≤ p ≤ ∞, we use the notation W m,p (D) for the standard Lp -type
Definition 17 A Banach space X is said to have the approximation property (AP) if, for any com- Sobolev space with m derivatives; we refer the reader to Adams and Fournier (2003) for a for-
pact set K ⊂ X and ϵ > 0, there exists a finite rank operator U : X → X such that mal definition. Furthermore, we, at times, use the notation W 0,p (D) = Lp (D) and W m,2 (D) =
H m (D). Since we use the standard definitions of Sobolev spaces that can be found in any reference
∥x − U x∥X ≤ ϵ, ∀x ∈ K. on the subject, we do not give the specifics here.
We now state and prove some well-known results about the relationship between basis and the
Appendix B.
AP. We were unable to find the statements of the following lemmas in the form given here in the
literature and therefore we provide full proofs. In this section we gather various results on the approximation property of Banach spaces.
The main results are Lemma 22 which states that if two Banach spaces have the approximation
Lemma 18 Let X be a Banach space with a basis. Then X has the AP. property then continuous maps between them can be approximated in a finite-dimensional manner,
and Lemma 26 which states the spaces in Assumptions 9 and 10 have the approximation property.
J
X By noting that any finite dimensional subspace has a basis, we may equivalently define a finite rank
Ux = cj (x)φj , ∀x ∈ X .
j=1 operator U ∈ L(X ; X ) to be one such that there exists a number n ∈ N and some {φj }nj=1 ⊂ X
87 87
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Triangle inequality implies that, for any x ∈ K, and {cj }nj=1 ⊂ X ∗ such that
n
∥x − U (x)∥X ≤ ∥x − yl ∥X + ∥yl − U (yl )∥X + ∥U (yl ) − U (x)∥X X
Ux = cj (x)φj , ∀x ∈ X .
J
2ϵ X j=1
≤ +∥ cj (yl ) − cj (x) φj ∥X
3
j=1
Definition 17 A Banach space X is said to have the approximation property (AP) if, for any com-
2ϵ
≤ + C∥yl − x∥X pact set K ⊂ X and ϵ > 0, there exists a finite rank operator U : X → X such that
3
≤ϵ
∥x − U x∥X ≤ ϵ, ∀x ∈ K.
as desired.
We now state and prove some well-known results about the relationship between basis and the
AP. We were unable to find the statements of the following lemmas in the form given here in the
literature and therefore we provide full proofs.
Lemma 19 Let X be a Banach space with a basis and Y be any Banach space. Suppose there
exists a continuous linear bijection T : X → Y. Then Y has a basis.
Lemma 18 Let X be a Banach space with a basis. Then X has the AP.
Proof Let y ∈ Y and ϵ > 0. Since T is a bijection, there exists an element x ∈ X so that T x = y
Proof Let {cj }∞ ∗ ∞
j=1 ⊂ X and {φj }j=1 ⊂ X be a basis for X . Note that there exists a constant
and T −1 y = x. Since X has a basis, we can find {φj }∞ ∞ ∗
j=1 ⊂ X and {cj }j=1 ⊂ X and a number
C > 0 such that, for any x ∈ X and n ∈ N,
n = n(ϵ, ∥T ∥) ∈ N such that
n
X ϵ n J
∥x − cj (x)φj ∥X ≤ . X X
∥T ∥ ∥ cj (x)φj ∥X ≤ sup ∥ cj (x)φj ∥X ≤ C∥x∥X ,
j=1 J∈N
j=1 j=1
Note that
see, for example (Albiac and Kalton, 2006, Remark 1.1.6). Assume, without loss of generality,
n
X n
X n
X
∥y − cj (T −1 y)T φj ∥Y = ∥T x − T cj (x)φj ∥ ≤ ∥T ∥∥x − cj (x)φj ∥X ≤ ϵ that C ≥ 1. Let K ⊂ X be compact and ϵ > 0. Since K is compact, we can find a number
j=1 j=1 j=1 n = n(ϵ, C) ∈ N and elements y1 , . . . , yn ∈ K such that for any x ∈ K there exists a number
l ∈ {1, . . . , n} with the property that
hence {T φj }∞
j=1 ⊂ Y and {cj (T
−1 ·)}∞ ⊂ Y ∗ form a basis for Y by linearity and continuity of T
j=1
and T −1 . ϵ
∥x − yl ∥X ≤ .
3C
Proof Let K ⊂ Y be a compact set and ϵ > 0. The set R = T −1 (K) ⊂ X is compact since T −1 Define the finite rank operator U : X → X by
is continuous. Since X has the AP, there exists a finite rank operator U : X → X such that
J
X
ϵ Ux = cj (x)φj , ∀x ∈ X .
∥x − U x∥X ≤ , ∀x ∈ R.
∥T ∥ j=1
88 88
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Define the operator W : Y → Y by W = T U T −1 . Clearly W is a finite rank operator since U is a Triangle inequality implies that, for any x ∈ K,
finite rank operator. Let y ∈ K then, since K = T (R), there exists x ∈ R such that T x = y and
x = T −1 y. Then ∥x − U (x)∥X ≤ ∥x − yl ∥X + ∥yl − U (yl )∥X + ∥U (yl ) − U (x)∥X
J
2ϵ X
∥y − W y∥Y = ∥T x − T U x∥Y ≤ ∥T ∥∥x − U x∥X ≤ ϵ. ≤ +∥ cj (yl ) − cj (x) φj ∥X
3
j=1
2ϵ
hence Y has the AP. ≤ + C∥yl − x∥X
3
≤ϵ
The following lemma shows than the infinite union of compact sets is compact if each set is
the image of a fixed compact set under a convergent sequence of continuous maps. The result is as desired.
instrumental in proving Lemma 22.
Lemma 21 Let X , Y be Banach spaces and F : X → Y be a continuous map. Let K ⊂ X be a Lemma 19 Let X be a Banach space with a basis and Y be any Banach space. Suppose there
compact set in X and {Fn : X → Y}∞
n=1 be a sequence of continuous maps such that exists a continuous linear bijection T : X → Y. Then Y has a basis.
Then the set and T −1 y = x. Since X has a basis, we can find {φj }∞ ∞ ∗
j=1 ⊂ X and {cj }j=1 ⊂ X and a number
∞
[ n = n(ϵ, ∥T ∥) ∈ N such that
W := Fn (K) ∪ F (K) n
n=1
X ϵ
∥x − cj (x)φj ∥X ≤ .
is compact in Y. ∥T ∥
j=1
Note that
Proof Let ϵ > 0 then there exists a number N = N (ϵ) ∈ N such that
n
X n
X n
X
−1
ϵ ∥y − cj (T y)T φj ∥Y = ∥T x − T cj (x)φj ∥ ≤ ∥T ∥∥x − cj (x)φj ∥X ≤ ϵ
sup ∥F (x) − Fn (x)∥Y ≤ , ∀n ≥ N.
x∈K 2 j=1 j=1 j=1
which is compact since F and each Fn are continuous. We can therefore find a number J =
J(ϵ, N ) ∈ N and elements y1 , . . . , yJ ∈ WN such that, for any z ∈ WN , there exists a number
Lemma 20 Let X be a Banach space with the AP and Y be any Banach space. Suppose there exists
l = l(z) ∈ {1, . . . , J} such that
ϵ a continuous linear bijection T : X → Y. Then Y has the AP.
∥z − yl ∥Y ≤ .
2
Let y ∈ W \ WN then there exists a number m > N and an element x ∈ K such that y = Fm (x). Proof Let K ⊂ Y be a compact set and ϵ > 0. The set R = T −1 (K) ⊂ X is compact since T −1
Since F (x) ∈ WN , we can find a number l ∈ {1, . . . , J} such that is continuous. Since X has the AP, there exists a finite rank operator U : X → X such that
ϵ ϵ
∥F (x) − yl ∥Y ≤ . ∥x − U x∥X ≤ , ∀x ∈ R.
2 ∥T ∥
89 89
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
The following lemma shows that any continuous operator acting between two Banach spaces Then the set
∞
[
with the AP can be approximated in a finite-dimensional manner. The approximation proceeds W := Fn (K) ∪ F (K)
in three steps which are shown schematically in Figure 16. First an input is mapped to a finite- n=1
dimensional representation via the action of a set of functionals on X . This representation is then is compact in Y.
mapped by a continuous function to a new finite-dimensional representation which serves as the set
Proof Let ϵ > 0 then there exists a number N = N (ϵ) ∈ N such that
of coefficients onto representers of Y. The resulting expansion is an element of Y that is ϵ-close to
the action of G on the input element. A similar finite-dimensionalization was used in (Bhattacharya ϵ
sup ∥F (x) − Fn (x)∥Y ≤ , ∀n ≥ N.
x∈K 2
et al., 2020) by using PCA on X to define the functionals acting on the input and PCA on Y to define
the output representers. However the result in that work is restricted to separable Hilbert spaces; Define the set
N
here, we generalize it to Banach spaces with the AP.
[
WN = Fn (K) ∪ F (K)
n=1
Lemma 22 Let X , Y be two Banach spaces with the AP and let G : X → Y be a continuous map. which is compact since F and each Fn are continuous. We can therefore find a number J =
For every compact set K ⊂ X and ϵ > 0, there exist numbers J, J ′ ∈ N and continuous linear J(ϵ, N ) ∈ N and elements y1 , . . . , yJ ∈ WN such that, for any z ∈ WN , there exists a number
maps FJ : X → RJ , GJ ′ : RJ′ → Y as well as φ ∈ C(RJ ; R J′ ) such that l = l(z) ∈ {1, . . . , J} such that
ϵ
∥z − yl ∥Y ≤ .
sup ∥G(x) − (GJ ′ ◦ φ ◦ FJ )(x)∥Y ≤ ϵ. 2
x∈K
Let y ∈ W \ WN then there exists a number m > N and an element x ∈ K such that y = Fm (x).
Furthermore there exist w1 , . . . , wJ ∈ X ∗ such that FJ has the form Since F (x) ∈ WN , we can find a number l ∈ {1, . . . , J} such that
ϵ
∥F (x) − yl ∥Y ≤ .
FJ (x) = w1 (x), . . . , wJ (x) , ∀x ∈ X
2
90 90
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Proof Since X has the AP, there exists a sequence of finite rank operators {UnX : X → X }∞ F0 := F . Since K is closed, lim xn = x ∈ K thus, for each fixed n ∈ N,
n=1 n→∞
such that
lim Fαn (xj ) = Fαn (x) ∈ W
lim sup ∥x − UnX x∥X = 0. j→∞
n→∞ x∈K
Define the set by continuity of Fαn . Since uniform convergence implies point-wise convergence
∞
[
Z= UnX (K) ∪ K
p = lim Fαn (x) = Fα (x) ∈ W
n=1 n→∞
which is compact by Lemma 21. Therefore, G is uniformly continuous on Z hence there exists a
for some α ∈ N0 thus p ∈ W , showing that W is closed.
modulus of continuity ω : R≥0 → R≥0 which is non-decreasing and satisfies ω(t) → ω(0) = 0 as
t → 0 as well as
∥G(z1 ) − G(z2 )∥Y ≤ ω ∥z1 − z2 ∥X
∀z1 , z2 ∈ Z. The following lemma shows that any continuous operator acting between two Banach spaces
with the AP can be approximated in a finite-dimensional manner. The approximation proceeds
We can thus find, a number N = N (ϵ) ∈ N such that
in three steps which are shown schematically in Figure 16. First an input is mapped to a finite-
X ϵ dimensional representation via the action of a set of functionals on X . This representation is then
sup ω ∥x − UN x∥X ≤ .
x∈K 2 mapped by a continuous function to a new finite-dimensional representation which serves as the set
X (X ) < ∞. There exist elements {α }J
Let J = dim UN J ∗
j j=1 ⊂ X and {wj }j=1 ⊂ X such that
of coefficients onto representers of Y. The resulting expansion is an element of Y that is ϵ-close to
the action of G on the input element. A similar finite-dimensionalization was used in (Bhattacharya
J
et al., 2020) by using PCA on X to define the functionals acting on the input and PCA on Y to define
X
X
UN x= wj (x)αj , ∀x ∈ X.
j=1 the output representers. However the result in that work is restricted to separable Hilbert spaces;
here, we generalize it to Banach spaces with the AP.
Define the maps FJX : X → RJ and GX J
J : R → X by
FJX (x) = (w1 (x), . . . , wJ (x)), ∀x ∈ X , Lemma 22 Let X , Y be two Banach spaces with the AP and let G : X → Y be a continuous map.
J
X For every compact set K ⊂ X and ϵ > 0, there exist numbers J, J ′ ∈ N and continuous linear
GX
J (v) = vj αj , ∀v ∈ RJ , ′ ′
maps FJ : X → RJ , GJ ′ : RJ → Y as well as φ ∈ C(RJ ; RJ ) such that
j=1
91 91
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
′ ′
Analogously, define the maps FJY′ : Y → RJ and GY J
J ′ : R → Y by and there exist β1 , . . . , βJ ′ ∈ Y such that GJ ′ has the form
φ(v) = (FJY′ ◦ G ◦ GX
J )(v), ∀v ∈ RJ such that
lim sup ∥x − UnX x∥X = 0.
which is clearly continuous and note that GY X Y X X
J ′ ◦ φ ◦ FJ = UJ ′ ◦ G ◦ UN . Set FJ = FJ and
n→∞ x∈K
GJ ′ = G Y
J ′ then, for any x ∈ K, Define the set
∞
[
∥G(x) − (GJ ′ ◦ φ ◦ FJ )(x)∥Y ≤ ∥G(x) − X
G(UN x)∥Y + X
∥G(UN x) − (UJY′ ◦G◦ X
UN )(x)∥Y Z= UnX (K) ∪ K
n=1
X
x∥X + sup ∥y − UJY′ y∥Y
≤ ω ∥x − UN which is compact by Lemma 21. Therefore, G is uniformly continuous on Z hence there exists a
y∈W
modulus of continuity ω : R≥0 → R≥0 which is non-decreasing and satisfies ω(t) → ω(0) = 0 as
≤ϵ
t → 0 as well as
as desired. ∥G(z1 ) − G(z2 )∥Y ≤ ω ∥z1 − z2 ∥X
∀z1 , z2 ∈ Z.
92 92
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
′ ′
Corollary 24 Let M > 0 and m ∈ N0 . There exists a continuous linear bijection T : C m ([0, 1]d ) → Analogously, define the maps FJY′ : Y → RJ and GY J
J ′ : R → Y by
C m ([−M, M ]d ).
FJY′ (y) = (q1 (y), . . . , qJ ′ (y)), ∀y ∈ Y,
J′
Proof Let 1 ∈ Rd denote the vector in which all entries are 1. Define the map τ : Rd → Rd by X ′
GY
J ′ (v) = vj β j , ∀v ∈ RJ
1 1 j=1
τ (x) = x + 1, ∀x ∈ Rd . (48)
2M 2 ′ ′
for some {βj }Jj=1 ⊂ Y and {qj }Jj=1 ⊂ Y ∗ such that UJY′ = GY Y
J ′ ◦ FJ ′ . Clearly if Y admits a basis
Clearly τ is a C ∞ -diffeomorphism between [−M, M ]d and [0, 1]d hence Lemma 23 implies the then we could have defined FJY′ and GY Y J J ′
J ′ through it instead of through UJ ′ . Define φ : R → R
result. by
φ(v) = (FJY′ ◦ G ◦ GX
J )(v), ∀v ∈ RJ
which is clearly linear since composition is linear. We compute that, for any 0 ≤ |α|1 ≤ m, as desired.
∂ α (f ◦ τ ) = (2M )−|α|1 (∂ α f ) ◦ τ
We now state and prove some results about isomorphisms of function spaces defined on differ-
hence, by the change of variables formula, ent domains. These results are instrumental in proving Lemma 26.
X
∥T f ∥W m,1 ((−M,M )d ) = (2M )d−|α|1 ∥∂ α f ∥L1 ((0,1)d ) . Lemma 23 Let D, D′ ⊂ Rd be domains. Suppose that, for some m ∈ N0 , there exists a C m -
0≤|α|1 ≤m
diffeomorphism τ : D̄′ → D̄. Then the mapping T : C m (D̄) → C m (D̄′ ) defined as
We can therefore find numbers C1 , C2 > 0, depending on M and m, such that
T (f )(x) = f (τ (x)), ∀f ∈ C m (D̄), x ∈ D̄′
C1 ∥f ∥W m,1 ((0,1)d ) ≤ ∥T f ∥W m,1 ((−M,M )d ) ≤ C2 ∥f ∥W m,1 ((0,1)d ) . is a continuous linear bijection.
This shows that T : W m,1 ((0, 1)d ) → W m,1 ((−M, M )d ) is continuous and injective. Now let Proof Clearly T is linear since the evaluation functional is linear. To see that it is continuous, note
g∈ W m,1 ((−M, M )d ) and define f = g ◦ τ −1 . A similar argument shows that f ∈ W m,1 ((0, 1)d ) that by the chain rule we can find a constant Q = Q(m) > 0 such that
and, clearly, T f = g hence T is surjective.
∥T (f )∥C m ≤ Q∥τ ∥C m ∥f ∥C m , ∀f ∈ C m (D̄).
We will now show that it is bijective. Let f, g ∈ C m (D̄) so that f ̸= g. Then there exists a point
We now show that the spaces in Assumptions 9 and 10 have the AP. While the result is well-
x ∈ D̄ such that f (x) ̸= g(x). Then T (f )(τ −1 (x)) = f (x) and T (g)(τ −1 (x)) = g(x) hence
known when the domain is (0, 1)d or Rd , we were unable to find any results in the literature for
T (f ) ̸= T (g) thus T is injective. Now let g ∈ C m (D̄′ ) and define f : D̄ → R by f = g ◦ τ −1 .
Lipschitz domains and we therefore give a full proof here. The essence of the proof is to either
Since τ −1 ∈ C m (D̄; D̄′ ), we have that f ∈ C m (D̄). Clearly, T (f ) = g hence T is surjective.
exhibit an isomorphism to a space that is already known to have AP or to directly show AP by
embedding the Lipschitz domain into an hypercube for which there are known basis constructions.
93 93
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Our proof shows the stronger result that W m,p (D) for m ∈ N0 and 1 ≤ p < ∞ has a basis, but, Corollary 24 Let M > 0 and m ∈ N0 . There exists a continuous linear bijection T : C m ([0, 1]d ) →
for C m (D̄), we only establish the AP and not necessarily a basis. The discrepancy comes from the C m ([−M, M ]d ).
fact that there is an isomorphism between W m,p (D) and W m,p (Rd ) while there is not one between
Proof Let 1 ∈ Rd denote the vector in which all entries are 1. Define the map τ : Rd → Rd by
C m (D̄) and C m (Rd ).
1 1
τ (x) = x + 1, ∀x ∈ Rd . (48)
2M 2
Lemma 26 Let Assumptions 9 and 10 hold. Then A and U have the AP.
Clearly τ is a C ∞ -diffeomorphism between [−M, M ]d and [0, 1]d hence Lemma 23 implies the
result.
Proof It is enough to show that the spaces W m,p (D), and C m (D̄) for any 1 ≤ p < ∞ and m ∈ N0
with D ⊂ Rd a Lipschitz domain have the AP. Consider first the spaces W 0,p (D) = Lp (D). Since
the Lebesgue measure on D is σ-finite and has no atoms, Lp (D) is isometrically isomorphic to
Lemma 25 Let M > 0 and m ∈ N. There exists a continuous linear bijection T : W m,1 ((0, 1)d ) →
Lp ((0, 1)) (see, for example, (Albiac and Kalton, 2006, Chapter 6)). Hence by Lemma 20, it is
W m,1 ((−M, M )d ).
enough to show that Lp ((0, 1)) has the AP. Similarly, consider the spaces W m,p (D) for m > 0 and
p > 1. Since D is Lipschitz, there exists a continuous linear operator W m,p (D) → W m,p (Rd ) Proof Define the map τ : Rd → Rd by (48). We have that τ ((−M, M )d ) = (0, 1)d . Define the
(Stein, 1970, Chapter 6, Theorem 5) (this also holds for p = 1). We can therefore apply (Pełczyński operator T by
and Wojciechowski, 2001, Corollary 4) (when p > 1) to conclude that W m,p (D) is isomorphic to T f = f ◦ τ, ∀f ∈ W m,1 ((0, 1)d ).
Lp ((0, 1)). By (Albiac and Kalton, 2006, Proposition 6.1.3), Lp ((0, 1)) has a basis hence Lemma
which is clearly linear since composition is linear. We compute that, for any 0 ≤ |α|1 ≤ m,
18 implies the result.
Now, consider the spaces C m (D̄). Since D is bounded, there exists a number M > 0 such ∂ α (f ◦ τ ) = (2M )−|α|1 (∂ α f ) ◦ τ
that D̄ ⊆ [−M, M ]d . Hence, by Corollary 24, C m ([0, 1]d ) is isomorphic to C m ([−M, M ]d ). Since
hence, by the change of variables formula,
C m ([0, 1]d ) has a basis (Ciesielski and Domsta, 1972, Theorem 5), Lemma 19 then implies that
C m ([−M, M ]d ) has a basis. By (Fefferman, 2007, Theorem 1), there exists a continuous linear
X
∥T f ∥W m,1 ((−M,M )d ) = (2M )d−|α|1 ∥∂ α f ∥L1 ((0,1)d ) .
operator E : C m (D̄) → Cbm (Rd ) such that E(f )|D̄ = f for all f ∈ C(D̄). Define the restriction 0≤|α|1 ≤m
operators RM : Cbm (Rd ) → C m ([−M, M ]d ) and RD : C m ([−M, M ]d ) → C m (D̄) which are both We can therefore find numbers C1 , C2 > 0, depending on M and m, such that
clearly linear and continuous and ∥RM ∥ = ∥RD ∥ = 1. Let {cj }∞ m d ∗ and
j=1 ⊂ C ([−M, M ] )
{φj }∞ m d m d C1 ∥f ∥W m,1 ((0,1)d ) ≤ ∥T f ∥W m,1 ((−M,M )d ) ≤ C2 ∥f ∥W m,1 ((0,1)d ) .
j=1 ⊂ C ([−M, M ] ) be a basis for C ([−M, M ] ). As in the proof of Lemma 18, there
exists a constant C1 > 0 such that, for any n ∈ N and f ∈ C m ([−M, M ]d ),
This shows that T : W m,1 ((0, 1)d ) → W m,1 ((−M, M )d ) is continuous and injective. Now let
n
X g ∈ W m,1 ((−M, M )d ) and define f = g ◦ τ −1 . A similar argument shows that f ∈ W m,1 ((0, 1)d )
∥ cj (f )φj ∥C m ([−M,M ]d ) ≤ C1 ∥f ∥C m ([−M,M ]d ) . and, clearly, T f = g hence T is surjective.
j=1
Suppose, without loss of generality, that C1 ∥E∥ ≥ 1. Let K ⊂ C m (D̄) be a compact set and ϵ > 0. We now show that the spaces in Assumptions 9 and 10 have the AP. While the result is well-
Since K is compact, we can find a number n = n(ϵ) ∈ N and elements y1 , . . . , yn ∈ K such that, known when the domain is (0, 1)d or Rd , we were unable to find any results in the literature for
for any f ∈ K there exists a number l ∈ {1, . . . , n} such that Lipschitz domains and we therefore give a full proof here. The essence of the proof is to either
exhibit an isomorphism to a space that is already known to have AP or to directly show AP by
ϵ
∥f − yl ∥C m (D̄) ≤ . embedding the Lipschitz domain into an hypercube for which there are known basis constructions.
3C1 ∥E∥
94 94
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
For every l ∈ {1, . . . , n}, define gl = RM (E(yl )) and note that gl ∈ C m ([−M, M ]d ) hence there Our proof shows the stronger result that W m,p (D) for m ∈ N0 and 1 ≤ p < ∞ has a basis, but,
exists a number J = J(ϵ, n) ∈ N such that for C m (D̄), we only establish the AP and not necessarily a basis. The discrepancy comes from the
J fact that there is an isomorphism between W m,p (D) and W m,p (Rd ) while there is not one between
X ϵ
max ∥gl − cj (gl )φj ∥C m ([−M,M ]d ) ≤ . C m (D̄) and C m (Rd ).
l∈{1,...,n} 3
j=1
Notice that, since yl = RD (gl ), we have Lemma 26 Let Assumptions 9 and 10 hold. Then A and U have the AP.
J J
X X ϵ
max ∥yl − cj RM (E(yl )) RD (φj )∥C m (D̄) ≤ ∥RD ∥ max ∥gl − cj (gl )φj ∥C m ([−M,M ]d ) ≤ .
l∈{1,...,n}
j=1
l∈{1,...,n}
j=1
3 Proof It is enough to show that the spaces W m,p (D), and C m (D̄) for any 1 ≤ p < ∞ and m ∈ N0
with D ⊂ Rd a Lipschitz domain have the AP. Consider first the spaces W 0,p (D) = Lp (D). Since
Define the finite rank operator U : C m (D̄) → C m (D̄) by
the Lebesgue measure on D is σ-finite and has no atoms, Lp (D) is isometrically isomorphic to
J
X Lp ((0, 1)) (see, for example, (Albiac and Kalton, 2006, Chapter 6)). Hence by Lemma 20, it is
∀f ∈ C m (D̄).
Uf = cj RM (E(f )) RD (φj ),
j=1 enough to show that Lp ((0, 1)) has the AP. Similarly, consider the spaces W m,p (D) for m > 0 and
p > 1. Since D is Lipschitz, there exists a continuous linear operator W m,p (D) → W m,p (Rd )
We then have that, for any f ∈ K,
(Stein, 1970, Chapter 6, Theorem 5) (this also holds for p = 1). We can therefore apply (Pełczyński
∥f − U f ∥C m (D̄) ≤ ∥f − yl ∥C m (D̄) + ∥yl − U yl ∥C m (D̄) + ∥U yl − U f ∥C m (D̄) and Wojciechowski, 2001, Corollary 4) (when p > 1) to conclude that W m,p (D) is isomorphic to
2ϵ
J
X Lp ((0, 1)). By (Albiac and Kalton, 2006, Proposition 6.1.3), Lp ((0, 1)) has a basis hence Lemma
≤ +∥ cj RM (E(yl − f )) φj ∥C m ([−M,M ]d )
3 18 implies the result.
j=1
2ϵ Now, consider the spaces C m (D̄). Since D is bounded, there exists a number M > 0 such
≤ + C1 ∥RM (E(yl − f ))∥C m ([−M,M ]d )
3 that D̄ ⊆ [−M, M ]d . Hence, by Corollary 24, C m ([0, 1]d ) is isomorphic to C m ([−M, M ]d ). Since
2ϵ
≤ + C1 ∥E∥∥yl − f ∥C m (D̄) C m ([0, 1]d ) has a basis (Ciesielski and Domsta, 1972, Theorem 5), Lemma 19 then implies that
3
C m ([−M, M ]d ) has a basis. By (Fefferman, 2007, Theorem 1), there exists a continuous linear
≤ϵ
operator E : C m (D̄) → Cbm (Rd ) such that E(f )|D̄ = f for all f ∈ C(D̄). Define the restriction
hence C m (D̄) has the AP. operators RM : Cbm (Rd ) → C m ([−M, M ]d ) and RD : C m ([−M, M ]d ) → C m (D̄) which are both
We are left with the case W m,1 (D). A similar argument as for the C m (D̄) case holds. In par- clearly linear and continuous and ∥RM ∥ = ∥RD ∥ = 1. Let {cj }∞ m d ∗ and
j=1 ⊂ C ([−M, M ] )
ticular the basis from (Ciesielski and Domsta, 1972, Theorem 5) is also a basis for W m,1 ((0, 1)d ). {φj }∞ m d m d
j=1 ⊂ C ([−M, M ] ) be a basis for C ([−M, M ] ). As in the proof of Lemma 18, there
Lemma 25 gives an isomorphism between W m,1 ((0, 1)d ) and W m,1 ((−M, M )d ) hence we may exists a constant C1 > 0 such that, for any n ∈ N and f ∈ C m ([−M, M ]d ),
use the extension operator W m,1 (D) → W m,1 (Rd ) from (Stein, 1970, Chapter 6, Theorem 5) to
n
complete the argument. In fact, the same construction yields a basis for W m,1 (D) due to the iso-
X
∥ cj (f )φj ∥C m ([−M,M ]d ) ≤ C1 ∥f ∥C m ([−M,M ]d ) .
morphism with W m,1 (Rd ), see, for example (Pełczyński and Wojciechowski, 2001, Theorem 1). j=1
Suppose, without loss of generality, that C1 ∥E∥ ≥ 1. Let K ⊂ C m (D̄) be a compact set and ϵ > 0.
Since K is compact, we can find a number n = n(ϵ) ∈ N and elements y1 , . . . , yn ∈ K such that,
Appendix C. for any f ∈ K there exists a number l ∈ {1, . . . , n} such that
In this section, we prove various results about the approximation of linear functionals by ker-
ϵ
nel integral operators. Lemma 27 establishes a Riesz-representation theorem for C m . The proof ∥f − yl ∥C m (D̄) ≤ .
3C1 ∥E∥
95 95
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
proceeds exactly as in the well-known result for W m,p but, since we did not find it in the literature, For every l ∈ {1, . . . , n}, define gl = RM (E(yl )) and note that gl ∈ C m ([−M, M ]d ) hence there
we give full details here. Lemma 28 shows that linear functionals on W m,p can be approximated exists a number J = J(ϵ, n) ∈ N such that
uniformly over compact set by integral kernel operators with a C ∞ kernel. Lemmas 30 and 31 J
X ϵ
establish similar results for C and Cm respectively by employing Lemma 27. These lemmas are max ∥gl − cj (gl )φj ∥C m ([−M,M ]d ) ≤ .
l∈{1,...,n} 3
j=1
crucial in showing that NO(s) are universal since they imply that the functionals from Lemma 22
can be approximated by elements of IO. Notice that, since yl = RD (gl ), we have
∗ J J
Lemma 27 Let D ⊂ Rd be a domain and m ∈ N0 . For every L ∈ C m (D̄) there exist finite, X X ϵ
max ∥yl − cj RM (E(yl )) RD (φj )∥C m (D̄) ≤ ∥RD ∥ max ∥gl − cj (gl )φj ∥C m ([−M,M ]d ) ≤ .
signed, Radon measures {λα }0≤|α|1 ≤m such that l∈{1,...,n} l∈{1,...,n} 3
j=1 j=1
X Z
L(f ) = ∂ α f dλα , ∀f ∈ C m (D̄). Define the finite rank operator U : C m (D̄) → C m (D̄) by
0≤|α|1 ≤m D̄ J
X
∀f ∈ C m (D̄).
Proof The case m = 0 follow directly from (Leoni, 2009, Theorem B.111), so we assume that Uf = cj RM (E(f )) RD (φj ),
j=1
m > 0. Let α1 , . . . , αJ be an enumeration of the set {α ∈ Nd : |α|1 ≤ m}. Define the mapping
We then have that, for any f ∈ K,
T : C m (D̄) → C(D̄; RJ ) by
we have, by applying (Leoni, 2009, Theorem B.111) J times, that there exist finite, signed, Radon
measures {λα }0≤|α|1 ≤m such that Appendix C.
Z
X In this section, we prove various results about the approximation of linear functionals by ker-
L̄(T f ) = ∂ α f dλα , ∀f ∈ C m (D̄)
0≤|α|1 ≤m D̄ nel integral operators. Lemma 27 establishes a Riesz-representation theorem for C m . The proof
96 96
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
as desired. proceeds exactly as in the well-known result for W m,p but, since we did not find it in the literature,
we give full details here. Lemma 28 shows that linear functionals on W m,p can be approximated
uniformly over compact set by integral kernel operators with a C ∞ kernel. Lemmas 30 and 31
Lemma 28 Let D ⊂ Rd be a bounded, open set and L ∈ (W m,p (D))∗ for some m ≥ 0 and establish similar results for C and C m respectively by employing Lemma 27. These lemmas are
1 ≤ p < ∞. For any closed and bounded set K ⊂ W m,p (D) (compact if p = 1) and ϵ > 0, there crucial in showing that NO(s) are universal since they imply that the functionals from Lemma 22
exists a function κ ∈ Cc∞ (D) such that can be approximated by elements of IO.
Z ∗
sup |L(u) − κu dx| < ϵ. Lemma 27 Let D ⊂ Rd be a domain and m ∈ N0 . For every L ∈ C m (D̄) there exist finite,
u∈K D
signed, Radon measures {λα }0≤|α|1 ≤m such that
Proof First consider the case m = 0 and 1 ≤ p < ∞. By the Riesz Representation Theorem X Z
(Conway, 2007, Appendix B), there exists a function v ∈ Lq (D) such that L(f ) = ∂ α f dλα , ∀f ∈ C m (D̄).
0≤|α|1 ≤m D̄
Z
L(u) = vu dx. Proof The case m = 0 follow directly from (Leoni, 2009, Theorem B.111), so we assume that
D
m > 0. Let α1 , . . . , αJ be an enumeration of the set {α ∈ Nd : |α|1 ≤ m}. Define the mapping
Since K is bounded, there is a constant M > 0 such that
T : C m (D̄) → C(D̄; RJ ) by
sup ∥u∥Lp ≤ M.
u∈K T f = ∂ α0 f, . . . , ∂ αJ f ), ∀f ∈ C m (D̄).
Suppose p > 1, so that 1 < q < ∞. Density of Cc∞ (D) in Lq (D) (Adams and Fournier, 2003, Clearly ∥T f ∥C(D̄;RJ ) = ∥f ∥C m (D̄) hence T is an injective, continuous linear operator. Define
Corollary 2.30) implies there exists a function κ ∈ Cc∞ (D) such that W := T (C m (D̄)) ⊂ C(D̄; RJ ) then T −1 : W → C m (D̄) is a continuous linear operator since
−1 m
ϵ T preserves norm. Thus W = T −1 (C (D̄)) is closed as the pre-image of a closed set under
∥v − κ∥Lq < .
M a continuous map. In particular, W is a Banach space since C(D̄; RJ ) is a Banach space and T
By the Hölder inequality, is an isometric isomorphism between C m (D̄) and W . Therefore, there exists a continuous linear
Z functional L̃ ∈ W ∗ such that
|L(u) − κu dx| ≤ ∥u∥Lp ∥v − κ∥Lq < ϵ.
D L(f ) = L̃(T f ), ∀f ∈ C m (D̄).
Suppose that p = 1 then q = ∞. Since K is totally bounded, there exists a number n ∈ N and ∗
By the Hahn-Banach theorem, L̃ can be extended to a continuous linear functional L̄ ∈ C(D̄; RJ )
functions g1 , . . . , gn ∈ K such that, for any u ∈ K,
such that ∥L∥(C m (D̄))∗ = ∥L̃∥W ∗ = ∥L̄∥(C(D̄;RJ ))∗ . We have that
ϵ
∥u − gl ∥L1 <
3∥v∥L∞ L(f ) = L̃(T f ) = L̄(T f ), ∀f ∈ C m (D̄).
for some l ∈ {1, . . . , n}. Let ψη ∈ Cc∞ (D) denote a standard mollifier for any η > 0. We can find Since
J J
η > 0 small enough such that
×
∗ ∗ M ∗
C(D̄; R ) ∼
=J
C(D̄) =∼ C(D̄) ,
ϵ j=1 j=1
max ∥ψη ∗ gl − gl ∥L1 <
l∈{1,...,n} 9∥v∥L∞ we have, by applying (Leoni, 2009, Theorem B.111) J times, that there exist finite, signed, Radon
measures {λα }0≤|α|1 ≤m such that
Define f = ψη ∗ v ∈ C(D) and note that ∥f ∥L∞ ≤ ∥v∥L∞ . By Fubini’s theorem, we find
X Z
∂ α f dλα , ∀f ∈ C m (D̄)
Z Z
ϵ L̄(T f ) =
| (f − v)gl dx| = v(ψη ∗ gl − gl ) dx ≤ ∥v∥L∞ ∥ψη ∗ gl − gl ∥L1 < .
D D 9 0≤|α|1 ≤m D̄
97 97
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Since gl ∈ L1 (D), by Lusin’s theorem, we can find a compact set A ⊂ D such that as desired.
Z
ϵ
max |gl | dx <
l∈{1,...,n} D\A 18∥v∥L∞
Since Cc∞ (D) is dense in C(D) over compact sets (Leoni, 2009, Theorem C.16), we can find a Lemma 28 Let D ⊂ Rd be a bounded, open set and L ∈ (W m,p (D))∗ for some m ≥ 0 and
function κ ∈ Cc∞ (D) such that 1 ≤ p < ∞. For any closed and bounded set K ⊂ W m,p (D) (compact if p = 1) and ϵ > 0, there
ϵ exists a function κ ∈ Cc∞ (D) such that
sup |κ(x) − f (x)| ≤
x∈A 9M Z
and ∥κ∥L∞ ≤ ∥f ∥L∞ ≤ ∥v∥L∞ . We have, sup |L(u) − κu dx| < ϵ.
Z Z Z u∈K D
| (κ − v)gl dx| ≤ |(κ − v)gl | dx + |(κ − v)gl | dx Proof First consider the case m = 0 and 1 ≤ p < ∞. By the Riesz Representation Theorem
D A D\A
(Conway, 2007, Appendix B), there exists a function v ∈ Lq (D) such that
Z Z Z
≤ |(κ − f )gl | dx + |(f − v)gl | dx + 2∥v∥L∞ |gl | dx
A D D\A
Z
2ϵ L(u) = vu dx.
≤ sup |κ(x) − f (x)|∥gl ∥L1 + D
x∈A 9
ϵ Since K is bounded, there is a constant M > 0 such that
< .
3
sup ∥u∥Lp ≤ M.
Finally, u∈K
Z Z Z Z Z
|L(u) − κu dx| ≤ | vu dx −vgl dx| + | vgl dx − κu dx| Suppose p > 1, so that 1 < q < ∞. Density of Cc∞ (D) in Lq (D) (Adams and Fournier, 2003,
D D D D D
Z Z Z Z Corollary 2.30) implies there exists a function κ ∈ Cc∞ (D) such that
≤ ∥v∥L∞ ∥u − gl ∥L1 + | κu dx − κgl dx| + | κgl dx − vgl dx|
D Z D D D ϵ
∥v − κ∥Lq < .
ϵ M
≤ + ∥κ∥L∞ ∥u − gl ∥L1 + | (κ − v)gl dx|
3 D
By the Hölder inequality,
2ϵ
≤ + ∥v∥L∞ ∥u − gl ∥L1
3
Z
|L(u) − κu dx| ≤ ∥u∥Lp ∥v − κ∥Lq < ϵ.
< ϵ. D
Suppose that p = 1 then q = ∞. Since K is totally bounded, there exists a number n ∈ N and
Suppose m ≥ 1. By the Riesz Representation Theorem (Adams and Fournier, 2003, Theorem
functions g1 , . . . , gn ∈ K such that, for any u ∈ K,
3.9), there exist elements (vα )0≤|α|1 ≤m of Lq (D) where α ∈ Nd is a multi-index such that
X Z ϵ
∥u − gl ∥L1 <
L(u) = vα ∂α u dx. 3∥v∥L∞
0≤|α|1 ≤m D
for some l ∈ {1, . . . , n}. Let ψη ∈ Cc∞ (D) denote a standard mollifier for any η > 0. We can find
Since K is bounded, there is a constant M > 0 such that
η > 0 small enough such that
sup ∥u∥W m,p ≤ M. ϵ
u∈K max ∥ψη ∗ gl − gl ∥L1 <
l∈{1,...,n} 9∥v∥L∞
Suppose p > 1, so that 1 < q < ∞. Density of C0∞ (D) in Lq (D) implies there exist functions
Define f = ψη ∗ v ∈ C(D) and note that ∥f ∥L∞ ≤ ∥v∥L∞ . By Fubini’s theorem, we find
(fα )0≤|α|1 ≤m in Cc∞ (D) such that
Z Z
ϵ ϵ
∥fα − vα ∥Lq < | (f − v)gl dx| = v(ψη ∗ gl − gl ) dx ≤ ∥v∥L∞ ∥ψη ∗ gl − gl ∥L1 < .
MJ D D 9
98 98
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
where J = |{α ∈ Nd : |α|1 ≤ m}|. Let Since gl ∈ L1 (D), by Lusin’s theorem, we can find a compact set A ⊂ D such that
Z
X ϵ
κ= (−1)|α|1 ∂α fα max |gl | dx <
l∈{1,...,n} D\A 18∥v∥L∞
0≤|α|1 ≤m
Since Cc∞ (D) is dense in C(D) over compact sets (Leoni, 2009, Theorem C.16), we can find a
then, by definition of a weak derivative,
Z Z Z function κ ∈ Cc∞ (D) such that
X
|α|1
X ϵ
κu dx = (−1) ∂α fα u dx = fα ∂α u dx. sup |κ(x) − f (x)| ≤
D D 9M
0≤|α|1 ≤m 0≤|α|1 ≤m D x∈A
99 99
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
where J = |{α ∈ Nd : |α|1 ≤ m}| and ∥wα ∥L∞ ≤ ∥fα ∥L∞ ≤ ∥vα ∥L∞ . We have, where J = |{α ∈ Nd : |α|1 ≤ m}|. Let
!
X Z Z Z X
(−1)|α|1 ∂α fα
X
|(wα − vα )∂α gl | = |(wα − vα )∂α gl |dx + |(wα − vα )∂α gl |dx κ=
0≤|α|1 ≤m D 0≤|α|1 ≤m A D\A 0≤|α|1 ≤m
X Z Z
≤ |(wα − fα )∂α gl | dx + |(fα − vα )∂α gl | dx then, by definition of a weak derivative,
0≤|α|1 ≤m A D Z Z Z
X X
|α|1
Z κu dx = (−1) ∂α fα u dx = fα ∂α u dx.
+ 2∥vα ∥L∞ |∂α gl | dx D 0≤|α|1 ≤m D 0≤|α|1 ≤m D
D\A
X 2ϵ By the Hölder inequality,
≤ sup |wα (x) − fα (x)|∥∂α gl ∥L1 +
x∈A 9 Z
0≤|α|1 ≤m X X ϵ
ϵ |L(u) − κu dx| ≤ ∥∂α u∥Lp ∥fα − vα ∥Lq < M = ϵ.
< . D MJ
0≤|α|1 ≤m 0≤|α|1 ≤m
3
Let Suppose that p = 1 then q = ∞. Define the constant Cv > 0 by
X
|α|1
κ= (−1) ∂ α wα . X
0≤|α|1 ≤m Cv = ∥vα ∥L∞ .
0≤|α|1 ≤m
then, by definition of a weak derivative,
Z Z Z
X
|α|1
X Since K is totally bounded, there exists a number n ∈ N and functions g1 , . . . , gn ∈ K such that,
κu dx = (−1) ∂α wα u dx = wα ∂α u dx.
D 0≤|α|1 ≤m D 0≤|α|1 ≤m D for any u ∈ K,
ϵ
Finally, ∥u − gl ∥W m,1 <
Z Z 3Cv
∞
X
|L(u) − κu dx| ≤ |vα ∂α u − wα ∂α u| dx for some l ∈ {1, . . . , n}. Let ψη ∈ Cc (D) denote a standard mollifier for any η > 0. We can find
D 0≤|α|1 ≤m D η > 0 small enough such that
X Z Z
≤ |vα (∂α u − ∂α gl )| dx + |vα ∂α gl − wα ∂α u| dx ϵ
D D max max ∥ψη ∗ ∂α gl − ∂α gl ∥L1 < .
0≤|α|1 ≤m α l∈{1,...,n} 9Cv
X Z Z
≤ ∥vα ∥L∞ ∥u − gl ∥W m,1 + |(vα − wα )∂α gl | dx + |(∂α gl − ∂α u)wα | dx Define fα = ψη ∗ vα ∈ C(D) and note that ∥fα ∥L∞ ≤ ∥vα ∥L∞ . By Fubini’s theorem, we find
0≤|α|1 ≤m D D
X Z X Z
2ϵ X | (fα − vα )∂α gl dx| = | vα (ψη ∗ ∂α gl − ∂α gl ) dx|
< + ∥wα ∥L∞ ∥u − gl ∥W m,1 D D
3 0≤|α|1 ≤m 0≤|α|1 ≤m
0≤|α|1 ≤m X
≤ ∥vα ∥L∞ ∥ψη ∗ ∂α gl − ∂α gl ∥L1
< ϵ.
0≤|α|1 ≤m
ϵ
< .
9
Since ∂α gl ∈ L1 (D), by Lusin’s theorem, we can find a compact set A ⊂ D such that
∗
Lemma 29 Let D ⊂ Rd be a domain and L ∈ C m (D̄) for some m ∈ N0 . For any compact Z
ϵ
max max |∂α gl | dx < .
set K ⊂ C m (D̄) and ϵ > 0, there exists distinct points y11 , . . . , y1n1 , . . . , yJnJ ∈ D and numbers α l∈{1,...,n} D\A 18Cv
c11 , . . . , c1n1 , . . . , cJnJ ∈ R such that
Since Cc∞ (D) is dense in C(D) over compact sets, we can find functions wα ∈ Cc∞ (D) such that
nj
J X
X
sup |L(u) − cjk ∂ αj u(yjk )| ≤ ϵ ϵ
u∈K
sup |wα (x) − fα (x)| ≤
j=1 k=1 x∈A 9M J
100 100
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
where α1 , . . . , αJ is an enumeration of the set {α ∈ Nd0 : 0 ≤ |α|1 ≤ m}. where J = |{α ∈ Nd : |α|1 ≤ m}| and ∥wα ∥L∞ ≤ ∥fα ∥L∞ ≤ ∥vα ∥L∞ . We have,
!
X Z X Z Z
Proof By Lemma 27, there exist finite, signed, Radon measures {λα }0≤|α|1 ≤m such that |(wα − vα )∂α gl | = |(wα − vα )∂α gl |dx + |(wα − vα )∂α gl |dx
X Z 0≤|α|1 ≤m D 0≤|α|1 ≤m A D\A
L(u) = ∂ α u dλα , ∀u ∈ C m (D̄). X Z Z
0≤|α|1 ≤m D̄ ≤ |(wα − fα )∂α gl | dx + |(fα − vα )∂α gl | dx
0≤|α|1 ≤m A D
Let α1 , . . . , αJ be an enumeration of the set {α ∈ Nd0 : 0 ≤ |α|1 ≤ m}. By weak density of the Z
Dirac measures (Bogachev, 2007, Example 8.1.6), we can find points y11 , . . . , y1n1 , . . . , yJ1 , . . . , yJnJ ∈ + 2∥vα ∥L∞ |∂α gl | dx
D\A
D̄ as well as numbers c11 , . . . , cJnJ ∈ R such that X 2ϵ
≤ sup |wα (x) − fα (x)|∥∂α gl ∥L1 +
Z nj x∈A 9
αj
X ϵ 0≤|α|1 ≤m
| ∂ u dλαj − cjk ∂ αj u(yjk )| ≤ , m
∀u ∈ C (D̄) ϵ
D̄ 4J < .
k=1
3
for any j ∈ {1, . . . , J}. Therefore, Let
X
J Z
X J X
X nj
ϵ κ= (−1)|α|1 ∂α wα .
αj
| ∂ u dλαj − cjk ∂ αj u(yjk )| ≤ , m
∀u ∈ C (D̄). 0≤|α|1 ≤m
D̄ 4
j=1 j=1 k=1
then, by definition of a weak derivative,
Define the constant Z X Z X Z
nj
J X |α|1
X κu dx = (−1) ∂α wα u dx = wα ∂α u dx.
Q := |cjk |. D 0≤|α|1 ≤m D 0≤|α|1 ≤m D
j=1 k=1
Finally,
Since K is compact, we can find functions g1 , . . . , gN ∈ K such that, for any u ∈ K, there exists Z X Z
l ∈ {1, . . . , N } such that |L(u) − κu dx| ≤ |vα ∂α u − wα ∂α u| dx
ϵ D 0≤|α|1 ≤m D
∥u − gl ∥C k ≤.
4Q X Z Z
≤ |vα (∂α u − ∂α gl )| dx + |vα ∂α gl − wα ∂α u| dx
Suppose that some yjk ∈ ∂D. By uniform continuity, we can find a point ỹjk ∈ D such that D D
0≤|α|1 ≤m
ϵ
max |∂ αj gl (yjk ) − ∂ αj gl (ỹjk )| ≤
Z Z
. ≤
X
∥vα ∥ ∥u − gl ∥W m,1 + |(vα − wα )∂α gl | dx + |(∂α gl − ∂α u)wα | dx
l∈{1,...,N } 4Q L∞
0≤|α|1 ≤m D D
Denote
nj
J X 2ϵ X
X
αj < + ∥wα ∥L∞ ∥u − gl ∥W m,1
S(u) = cjk ∂ u(yjk ) 3
0≤|α|1 ≤m
j=1 k=1
and by S̃(u) the sum S(u) with yjk replaced by ỹjk . Then, for any u ∈ K, we have < ϵ.
101 101
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Since there are a finite number of points, this implies that all points yjk can be chosen in D. Suppose where α1 , . . . , αJ is an enumeration of the set {α ∈ Nd0 : 0 ≤ |α|1 ≤ m}.
now that yjk = yqp for some (j, k) ̸= (q, p). As before, we can always find a point ỹjk distinct from
Proof By Lemma 27, there exist finite, signed, Radon measures {λα }0≤|α|1 ≤m such that
all others such that
X Z
ϵ L(u) = ∂ α u dλα , ∀u ∈ C m (D̄).
max |∂ αj gl (yjk ) − ∂ αj gl (ỹjk )| ≤ .
l∈{1,...,N } 4Q 0≤|α|1 ≤m D̄
Repeating the previous argument then shows that all points yjk can be chosen distinctly as desired. Let α1 , . . . , αJ be an enumeration of the set {α ∈ Nd0 : 0 ≤ |α|1 ≤ m}. By weak density of the
Dirac measures (Bogachev, 2007, Example 8.1.6), we can find points y11 , . . . , y1n1 , . . . , yJ1 , . . . , yJnJ ∈
D̄ as well as numbers c11 , . . . , cJnJ ∈ R such that
∗ Z nj
Lemma 30 Let D ⊂ Rd be a domain and L ∈ C(D̄) . For any compact set K ⊂ C(D̄) and αj
X ϵ
| ∂ u dλαj − cjk ∂ αj u(yjk )| ≤ , ∀u ∈ C m (D̄)
4J
ϵ > 0, there exists a function κ ∈ Cc∞ (D) such that D̄ k=1
Since K is compact, there exist functions g1 , . . . , gJ ∈ K such that, for any u ∈ K, there exists ϵ
max |∂ αj gl (yjk ) − ∂ αj gl (ỹjk )| ≤ .
l∈{1,...,N } 4Q
some l ∈ {1, . . . , J} such that
ϵ Denote
nj
J X
∥u − gl ∥C ≤ . X
6nQ S(u) = cjk ∂ αj u(yjk )
j=1 k=1
Let r > 0 be such that the open balls Br (yj ) ⊂ D and are pairwise disjoint. Let ψη ∈ Cc∞ (Rd )
and by S̃(u) the sum S(u) with yjk replaced by ỹjk . Then, for any u ∈ K, we have
denote the standard mollifier with parameter η > 0, noting that supp ψr = Br (0). We can find a
number 0 < γ ≤ r such that |L(u) − S̃(u)| ≤ |L(u) − S(u)| + |S(u) − S̃(u)|
ϵ
Z ≤ + |cjk ∂ αj u(ỹjk ) − cjk ∂ αj u(yjk )|
ϵ 4
max | ψγ (x − yj )gl (x) dx − gl (yj )| ≤ . ϵ
l∈{1,...,J} D 3nQ ≤ + |cjk ∂ αj u(ỹjk ) − cjk ∂ αj gl (ỹjk )| + |cjk ∂ αj gl (ỹjk ) − cjk ∂ αj u(yjk )|
j∈{1,...,n} 4
ϵ
≤ + |cjk |∥u − gl ∥C m + |cjk ∂ αj gl (ỹjk ) − cjk ∂ αj gl (yjk )| + |cjk ∂ αj gl (yjk ) − cjk ∂ αj u(yjk )|
Define κ : Rd → R by 4
ϵ
n ≤ + 2|cjk |∥u − gl ∥C m + |cjk ||∂ αj gl (ỹjk ) − ∂ αj gl (yjk )|
X 4
κ(x) = cj ψγ (x − yj ), ∀x ∈ Rd .
j=1 ≤ ϵ.
102 102
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Since supp ψγ (· − yj ) ⊆ Br (yj ), we have that κ ∈ Cc∞ (D). Then, for any u ∈ K, Since there are a finite number of points, this implies that all points yjk can be chosen in D. Suppose
Z n
X n
X Z now that yjk = yqp for some (j, k) ̸= (q, p). As before, we can always find a point ỹjk distinct from
|L(u) − κu dx| ≤ |L(u) − cj u(yj )| + | cj u(yj ) − κu dx| all others such that
D j=1 j=1 D
ϵ
n Z max |∂ αj gl (yjk ) − ∂ αj gl (ỹjk )| ≤ .
ϵ X l∈{1,...,N } 4Q
≤ + |cj ||u(yj ) − ψη (x − yj )u(x) dx|
3 D
j=1
Repeating the previous argument then shows that all points yjk can be chosen distinctly as desired.
n Z
ϵ X
≤ +Q |u(yj ) − gl (yj )| + |gl (yj ) − ψη (x − yj )u(x) dx|
3 D
j=1
n Z
ϵ X ∗
≤ + nQ∥u − gl ∥C + Q |gl (yj ) − ψη (x − yj )gl (x) dx| Lemma 30 Let D ⊂ Rd be a domain and L ∈ C(D̄) . For any compact set K ⊂ C(D̄) and
3 D
j=1
Z
ϵ > 0, there exists a function κ ∈ Cc∞ (D) such that
+| ψη (x − yj ) gl (x) − u(x) dx|
D
Z
ϵ ϵ X n Z sup |L(u) − κu dx| < ϵ.
≤ + nQ∥u − gl ∥C + nQ + Q∥gl − u∥C ψγ (x − yj ) dx u∈K D
3 3nQ D
j=1
2ϵ Proof By Lemma 29, we can find points distinct points y1 , . . . , yn ∈ D as well as numbers
= + 2nQ∥u − gl ∥C
3 c1 , . . . , cn ∈ R such that
=ϵ n
X ϵ
sup |L(u) − cj u(yj )| ≤ .
u∈K 3
where we use the fact that mollifiers are non-negative and integrate to one. j=1
103 103
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Then, for any u ∈ K, Since supp ψγ (· − yj ) ⊆ Br (yj ), we have that κ ∈ Cc∞ (D). Then, for any u ∈ K,
XJ Z Z n n Z
κj ∂ αj u dx|
X X
|L(u) − |L(u) − κu dx| ≤ |L(u) − cj u(yj )| + | cj u(yj ) − κu dx|
j=1 D D D
j=1 j=1
nj
J X J Z nj n Z
X
αj
X
αj
X ϵ X
≤ |L(u) − cjk ∂ u(yjk )| + | κj ∂ u dx − cjk ∂ αj u(yjk )| ≤ ϵ ≤ + |cj ||u(yj ) − ψη (x − yj )u(x) dx|
D 3 D
j=1 k=1 j=1 k=1 j=1
n Z
as desired. ϵ X
≤ +Q |u(yj ) − gl (yj )| + |gl (yj ) − ψη (x − yj )u(x) dx|
3 D
j=1
n Z
ϵ X
≤ + nQ∥u − gl ∥C + Q |gl (yj ) − ψη (x − yj )gl (x) dx|
3 D
Appendix D. Z
j=1
The following lemmas show that the three pieces used in constructing the approximation from +| ψη (x − yj ) gl (x) − u(x) dx|
D
n
Lemma 22, which are schematically depicted in Figure 16, can all be approximated by NO(s).
Z
ϵ ϵ X
≤ + nQ∥u − gl ∥C + nQ + Q∥gl − u∥C ψγ (x − yj ) dx
Lemma 32 shows that FJ : A → RJ can be approximated by an element of IO by mapping 3 3nQ
j=1 D
′
to a vector-valued constant function. Similarly, Lemma 34 shows that GJ ′ : RJ → U can be 2ϵ
= + 2nQ∥u − gl ∥C
approximated by an element of IO by mapping a vector-valued constant function to the coefficients 3
of a basis expansion. Finally, Lemma 35 shows that NO(s) can exactly represent any standard neural =ϵ
network by viewing the inputs and outputs as vector-valued constant functions.
where we use the fact that mollifiers are non-negative and integrate to one.
Lemma 32 Let Assumption 9 hold. Let {cj }nj=1 ⊂ A∗ for some n ∈ N. Define the map F : A →
Rn by
∗
Lemma 31 Let D ⊂ Rd be a domain and L ∈ C m (D̄) . For any compact set K ⊂ C m (D̄) and
F (a) = c1 (a), . . . , cn (a) , ∀a ∈ A.
ϵ > 0, there exist functions κ1 , . . . , κJ ∈ Cc∞ (D) such that
Then, for any compact set K ⊂ A, σ ∈ A0 , and ϵ > 0, there exists a number L ∈ N and neural
network κ ∈ NL (σ; Rd × Rd , Rn×1 ) such that J Z
X
Z sup |L(u) − κj ∂ αj u dx| < ϵ
u∈K D
sup sup |F (a) − κ(y, x)a(x) dx|1 ≤ ϵ. j=1
a∈K y∈D̄ D
where α1 , . . . , αJ is an enumeration of the set {α ∈ Nd0 : 0 ≤ |α|1 ≤ m}.
Proof Since K is bounded, there exists a number M > 0 such that
sup ∥a∥A ≤ M. Proof By Lemma 29, we find distinct points y11 , . . . , y1n1 , . . . , yJnJ ∈ D and numbers c11 , . . . , cJnJ ∈
a∈K
R such that
nj
J X
Define the constant X ϵ
sup |L(u) − cjk ∂ αj u(yjk )| ≤ .
M, A = W m,p (D) u∈K j=1 k=1
2
Q :=
M |D|, A = C(D̄)
Applying the proof of Lemma 31 J times to each of the inner sums, we find functions κ1 , . . . , κJ ∈
and let p = 1 if A = C(D̄). By Lemma 28 and Lemma 30, there exist functions f1 , . . . , fn ∈ Cc∞ (D) such that
Cc∞ (D) such that Z nj
Z
ϵ αj
X ϵ
max sup |cj (a) − fj a dx| ≤ . max | κj ∂ u dx − cjk ∂ αj u(yjk )| ≤ .
j∈{1,...,n} a∈K
1 j∈{1,...,J} D 2J
D 2n p k=1
104 104
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Since σ ∈ A0 , there exits some L ∈ N and neural networks ψ1 , . . . , ψn ∈ NL (σ; Rd ) such that Then, for any u ∈ K,
ϵ J Z
max ∥ψj − fj ∥C ≤ .
X
j∈{1,...,n}
1 |L(u) − κj ∂ αj u dx|
2Qn p D
j=1
By setting all weights associated to the first argument to zero, we can modify each neural network X nj
J X J
X Z nj
X
αj αj
ψj to a neural network ψj ∈ NL (σ; Rd × Rd ) so that ≤ |L(u) − cjk ∂ u(yjk )| + | κj ∂ u dx − cjk ∂ αj u(yjk )| ≤ ϵ
j=1 k=1 j=1 D k=1
d
ψj (y, x) = ψj (x)1(y), ∀y, x ∈ R . as desired.
∗ Then, for any compact set K ⊂ A, σ ∈ A0 , and ϵ > 0, there exists a number L ∈ N and neural
Lemma 33 Suppose D ⊂ Rd is a domain and let {cj }nj=1 ⊂ C m (D̄) for some m, n ∈ N.
network κ ∈ NL (σ; Rd × Rd , Rn×1 ) such that
Define the map F : A → Rn by Z
sup sup |F (a) − κ(y, x)a(x) dx|1 ≤ ϵ.
∀a ∈ C m (D̄).
F (a) = c1 (a), . . . , cn (a) , a∈K y∈D̄ D
Proof The proof follows as in Lemma 32 by replacing the use of Lemmas 28 and 30 by Lemma 31. and let p = 1 if A = C(D̄). By Lemma 28 and Lemma 30, there exist functions f1 , . . . , fn ∈
Cc∞ (D) such that Z
ϵ
max sup |cj (a) − fj a dx| ≤ 1 .
j∈{1,...,n} a∈K D 2n p
105 105
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Lemma 34 Let Assumption 10 hold. Let {φj }nj=1 ⊂ U for some n ∈ N. Define the map G : Rn → Since σ ∈ A0 , there exits some L ∈ N and neural networks ψ1 , . . . , ψn ∈ NL (σ; Rd ) such that
U by ϵ
n
X max ∥ψj − fj ∥C ≤ 1 .
j∈{1,...,n}
G(w) = wj φ j , ∀w ∈ Rn . 2Qn p
j=1
By setting all weights associated to the first argument to zero, we can modify each neural network
Then, for any compact set K ⊂ Rn , σ ∈ Am2 , and ϵ > 0, there exists a number L ∈ N and a neural
′ ′
ψj to a neural network ψj ∈ NL (σ; Rd × Rd ) so that
network κ ∈ NL (σ; Rd × Rd , R1×n )
such that
Z ψj (y, x) = ψj (x)1(y), ∀y, x ∈ Rd .
sup ∥G(w) − κ(·, x)w1(x) dx∥U ≤ ϵ.
w∈K D′
Define κ ∈ NL (σ; Rd × Rd , Rn×1 ) by
Proof Since K ⊂ Rn is compact, there is a number M > 1 such that
κ(y, x) = [ψ1 (y, x), . . . , ψn (y, x)]T .
sup |w|1 ≤ M.
w∈K Then for any a ∈ K and y ∈ D̄, we have
If U = Lp2 (D′ ), then density of Cc∞ (D′ ) implies there are functions ψ̃1 , . . . , ψ̃n ∈ C ∞ (D̄′ ) such Z Xn Z
p
|F (a) − κ(y, x)a dx|p = |cj (a) − 1(y)ψj (x)a(x) dx|p
that D D
ϵ j=1
max ∥φj − ψ̃j ∥U ≤ . n Z Z
j∈{1,...,n} 2nM X
≤ 2p−1 |cj (a) − fj a dx|p + | (fj − ψj )a dx|p
Similarly if U = W m2 ,p2 (D′ ), then density of the restriction of functions in Cc∞ (R ) to D′ (Leoni, d′ D D
j=1
2009, Theorem 11.35) implies the same result. If U = C m2 (D̄′ ) then we set ψ̃j = φj for any ϵp
≤ + 2p−1 nQp ∥fj − ψj ∥pC
′ ′ 2
j ∈ {1, . . . , n}. Define κ̃ : Rd × Rd → R1×n by
≤ ϵp
1
κ̃(y, x) = [ψ̃1 (y), . . . , ψ̃n (y)].
|D′ | and the result follows by finite dimensional norm equivalence.
Then, for any w ∈ K,
Z n
X n
X ∗
∥G(w) − κ̃(·, x)w1(x) dx∥U = ∥ wj φ j − wj ψ̃j ∥U Lemma 33 Suppose D ⊂ Rd is a domain and let {cj }nj=1 ⊂ C m (D̄) for some m, n ∈ N.
D′ j=1 j=1
n Define the map F : A → Rn by
X
≤ |wj |∥φj − ψ̃j ∥U
∀a ∈ C m (D̄).
j=1 F (a) = c1 (a), . . . , cn (a) ,
ϵ
≤ . Then, for any compact set K ⊂ C m (D̄), σ ∈ A0 , and ϵ > 0, there exists a number L ∈ N and
2
Since σ ∈ Am2 , there exists neural networks ψ1 , . . . , ψn ∈ N1 (σ; Rd ) such that
′ neural network κ ∈ NL (σ; Rd × Rd , Rn×J ) such that
Z
ϵ κ(y, x) ∂ α1 a(x), . . . , ∂ αJ a(x) dx|1 ≤ ϵ
max ∥ψ̃j − ψj ∥C m2 ≤ sup sup |F (a) −
1 a∈K y∈D̄ D
j∈{1,...,n}
2nM (J|D′ |) p2
where α1 , . . . , αJ is an enumeration of the set {α ∈ Nd : 0 ≤ |α|1 ≤ m}.
where, if U = C m2 (D̄′ ), we set J = 1/|D′ | and p2 = 1, and otherwise J = |{α ∈ Nd : |α|1 ≤
m2 }|. By setting all weights associated to the second argument to zero, we can modify each neural Proof The proof follows as in Lemma 32 by replacing the use of Lemmas 28 and 30 by Lemma 31.
′ ′
network ψj to a neural network ψj ∈ N1 (σ; Rd × Rd ) so that
′
ψj (y, x) = ψj (y)1(x), ∀y, x ∈ Rd .
106 106
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
′ ′
Define κ ∈ N1 (σ; Rd × Rd , R1×n ) as Lemma 34 Let Assumption 10 hold. Let {φj }nj=1 ⊂ U for some n ∈ N. Define the map G : Rn →
1 U by
κ(y, x) = [ψ1 (y, x), . . . , ψn (y, x)]. n
|D′ |
X
G(w) = wj φ j , ∀w ∈ Rn .
Then, for any w ∈ Rn , j=1
n
Then, for any compact set K ⊂ Rn , σ ∈ Am2 , and ϵ > 0, there exists a number L ∈ N and a neural
Z X
κ(y, x)w1(x) dx = wj ψj (y). d′ d′
D′ j=1
network κ ∈ NL (σ; R × R , R1×n ) such that
Z
We compute that, for any j ∈ {1, . . . , n}, sup ∥G(w) − κ(·, x)w1(x) dx∥U ≤ ϵ.
w∈K D′
1
|D′ | p2 ∥ψj − ψ̃j ∥C m2 , U = Lp2 (D′ )
Proof Since K ⊂ Rn is compact, there is a number M > 1 such that
1
∥ψj − ψ̃j ∥U ≤ (J|D′ |) p2 ∥ψj − ψ̃j ∥C m2 , U = W m2 ,p2 (D′ )
sup |w|1 ≤ M.
U = C m2 (D̄′ )
∥ψj − ψ̃j ∥C m2 , w∈K
hence, for any w ∈ K, If U = Lp2 (D′ ), then density of Cc∞ (D′ ) implies there are functions ψ̃1 , . . . , ψ̃n ∈ C ∞ (D̄′ ) such
Z n n
that
X X ϵ ϵ
∥ κ(y, x)w1(x) dx − wj ψ̃j ∥U ≤ |wj |∥ψj − ψ̃j ∥U ≤ . max ∥φj − ψ̃j ∥U ≤ .
D′ 2 j∈{1,...,n} 2nM
j=1 j=1
′
Similarly if U = W m2 ,p2 (D′ ), then density of the restriction of functions in Cc∞ (Rd ) to D′ (Leoni,
By triangle inequality, for any w ∈ K, we have
Z Z 2009, Theorem 11.35) implies the same result. If U = C m2 (D̄′ ) then we set ψ̃j = φj for any
′ ′
∥G(w) − κ(·, x)w1(x) dx∥U ≤ ∥G(w) − κ̃(·, x)w1(x) dx∥U j ∈ {1, . . . , n}. Define κ̃ : Rd × Rd → R1×n by
D D
1
Z Z
+∥ κ̃(·, x)w1(x) dx − κ(·, x)w1(x) dx∥U κ̃(y, x) = [ψ̃1 (y), . . . , ψ̃n (y)].
D D |D′ |
Z n
ϵ X Then, for any w ∈ K,
≤ +∥ κ(·, x)w1(x) − wj ψ̃n ∥U
2 D n n
j=1 Z X X
∥G(w) − κ̃(·, x)w1(x) dx∥U = ∥ wj φ j − wj ψ̃j ∥U
≤ϵ D′ j=1 j=1
n
X
as desired. ≤ |wj |∥φj − ψ̃j ∥U
j=1
ϵ
≤ .
2
Lemma 35 Let N, d, d′ , p, q ∈ N, m, n ∈ N0 , D ⊂ Rp and D′ ⊂ Rq be domains and σ1 ∈ ALm . ′
′ ′ Since σ ∈ Am2 , there exists neural networks ψ1 , . . . , ψn ∈ N1 (σ; Rd ) such that
For any φ ∈ NN (σ1 ; Rd , Rd ) and σ2 , σ3 ∈ An , there exists a G ∈ NON (σ1 , σ2 , σ3 ; D, D′ , Rd , Rd )
ϵ
such that max ∥ψ̃j − ψj ∥C m2 ≤ 1
j∈{1,...,n}
φ(w) = G(w1)(x), ∀w ∈ Rd , ∀x ∈ D′ . 2nM (J|D′ |) p2
107 107
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
′ ′ ′ ′
where W0 ∈ Rd0 ×d , W1 ∈ Rd1 ×d0 , . . . , WN ∈ Rd ×dN −1 and b0 ∈ Rd0 , b1 ∈ Rd1 , . . . , bN ∈ Rd Define κ ∈ N1 (σ; Rd × Rd , R1×n ) as
for some d0 , . . . , dN −1 ∈ N. By setting all parameters to zero except for the last bias term, we can 1
κ(y, x) = [ψ1 (y, x), . . . , ψn (y, x)].
find κ(0) ∈ N1 (σ2 ; Rp × Rp , Rd0 ×d ) such that |D′ |
1 Then, for any w ∈ Rn ,
κ0 (x, y) = W0 , ∀x, y ∈ Rp . Z n
|D| X
κ(y, x)w1(x) dx = wj ψj (y).
D′ j=1
Similarly, we can find b̃0 ∈ N1 (σ2 ; Rp , Rd0 ) such that
We compute that, for any j ∈ {1, . . . , n},
p
b̃0 (x) = b0 , ∀x ∈ R . 1
|D′ | p2 ∥ψj − ψ̃j ∥C m2 , U = Lp2 (D′ )
1
Then Z ∥ψj − ψ̃j ∥U ≤ (J|D′ |) p2 ∥ψj − ψ̃j ∥C m2 , U = W m2 ,p2 (D′ )
κ0 (y, x)w1(x) dx + b̃(y) = (W0 w + b0 )1(y), ∀w ∈ Rd , ∀y ∈ D.
U = C m2 (D̄′ )
∥ψj − ψ̃j ∥C m2 ,
D
Continuing a similar construction for all layers clearly yields the result.
hence, for any w ∈ K,
Z n n
X X ϵ
∥ κ(y, x)w1(x) dx − wj ψ̃j ∥U ≤ |wj |∥ψj − ψ̃j ∥U ≤ .
D′ 2
j=1 j=1
Appendix E.
By triangle inequality, for any w ∈ K, we have
Proof [of Theorem 8] Without loss of generality, we will assume that D = D′ and, by continuous Z Z
embedding, that A = U = C(D̄). Furthermore, note that, by continuity, it suffices to show the ∥G(w) − κ(·, x)w1(x) dx∥U ≤ ∥G(w) − κ̃(·, x)w1(x) dx∥U
D D
result for the single layer Z Z
Z +∥ κ̃(·, x)w1(x) dx − κ(·, x)w1(x) dx∥U
d d d D D
NO = f 7→ σ1 κ(·, y)f (y) dy + b : κ ∈ Nn1 (σ2 ; R × R ), b ∈ Nn2 (σ2 ; R ), n1 , n2 ∈ N . Z n
D ϵ X
≤ +∥ κ(·, x)w1(x) − wj ψ̃n ∥U
2 D
Let K ⊂ A be a compact set and (Dj )∞j=1 be a discrete refinement of D. To each discretization Dj
j=1
(1) (j)
associate partitions Pj , . . . , Pj ⊆ D which are pairwise disjoint, each contains a single, unique ≤ϵ
point of Dj , each has positive Lebesgue measure, and
as desired.
j
(k)
a
Pj = D.
k=1
Lemma 35 Let N, d, d′ , p, q ∈ N, m, n ∈ N0 , D ⊂ Rp and D′ ⊂ Rq be domains and σ1 ∈ ALm .
We can do this since the points in each discretization Dj are pairwise distinct. For any G ∈ NO ′ ′
For any φ ∈ NN (σ1 ; Rd , Rd ) and σ2 , σ3 ∈ An , there exists a G ∈ NON (σ1 , σ2 , σ3 ; D, D′ , Rd , Rd )
with parameters κ, b define the sequence of maps Ĝj : Rjd × Rj → Y by
such that
j
∀w ∈ Rd , ∀x ∈ D′ .
!
X (k)
φ(w) = G(w1)(x),
Ĝj (y1 , . . . , yj , w1 , . . . , wj ) = σ1 κ(·, yk )wk |Pj | + b(·)
k=1 .
for any yk ∈ Rd and wk ∈ R. Since K is compact, there is a constant M > 0 such that
Proof We have that
sup ∥a∥U ≤ M.
a∈K φ(x) = WN σ1 (. . . W1 σ1 (W0 x + b0 ) + b1 . . . ) + bN , ∀x ∈ Rd
108 108
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
′ ′
Therefore, where W0 ∈ Rd0 ×d , W1 ∈ Rd1 ×d0 , . . . , WN ∈ Rd ×dN −1 and b0 ∈ Rd0 , b1 ∈ Rd1 , . . . , bN ∈ Rd
Z j for some d0 , . . . , dN −1 ∈ N. By setting all parameters to zero except for the last bias term, we can
(k)
X
sup sup κ(x, y)a(y) dy + κ(x, yk )a(yk )|Pj | + 2b(x) ≤ 2(M |D|∥κ∥C(D̄×D̄) + ∥b∥C(D̄) ) find κ(0) ∈ N1 (σ2 ; Rp × Rp , Rd0 ×d ) such that
x∈D̄ j∈N D k=1
1
:= R. κ0 (x, y) = W0 , ∀x, y ∈ Rp .
|D|
Hence we need only consider σ1 as a map [−R, R] → R. Thus, by uniform continuity, there exists Similarly, we can find b̃0 ∈ N1 (σ2 ; Rp , Rd0 ) such that
a modulus of continuity ω : R≥0 → R≥0 which is continuous, non-negative, and non-decreasing on
R≥0 , satisfies ω(z) → ω(0) = 0 as z → 0 and b̃0 (x) = b0 , ∀x ∈ Rp .
exists Q = Q(ϵ) ∈ N such that for any m ≥ Q implies Continuing a similar construction for all layers clearly yields the result.
t Z j
(k)
X (k)
a
sup κ(x, yk )|Pt | − κ(x, y) dy < |D|∥κ∥C(D̄×D̄) Pj = D.
x∈D̄ k=1 D k=1
where Dt = {y1 , . . . , yt }. Similarly, we can find p1 , . . . , pN ∈ N such that, for any tn ≥ pn , we We can do this since the points in each discretization Dj are pairwise distinct. For any G ∈ NO
have with parameters κ, b define the sequence of maps Ĝj : Rjd × Rj → Y by
tn Z
X (n) (n) (k) ϵ
sup κ(x, yk )an (yk )|Ptn | − κ(x, y)an (y) dy < j
!
x∈D̄ k=1 D 4 X (k)
Ĝj (y1 , . . . , yj , w1 , . . . , wj ) = σ1 κ(·, yk )wk |Pj | + b(·)
(n) (n) k=1
where Dtn = {y1 , . . . , ytn }. Let m ≥ max{q, p1 , . . . , pN } and denote Dm = {y1 , . . . , ym }.
Note that, for any yk ∈ Rd and wk ∈ R. Since K is compact, there is a constant M > 0 such that
Z
sup κ(x, y) (a(y) − an (y)) dy ≤ |D|∥κ∥C(D̄×D̄) ∥a − an ∥C(D̄) . sup ∥a∥U ≤ M.
x∈D̄ D a∈K
109 109
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Furthermore, Therefore,
m
X m
X Z j
(k) (k) (k)
X
sup κ(x, yk ) (an (yk ) − a(yk )) |Pm | ≤ ∥an − a∥C(D̄) sup κ(x, yk )|Pm | sup sup κ(x, y)a(y) dy + κ(x, yk )a(yk )|Pj | + 2b(x) ≤ 2(M |D|∥κ∥C(D̄×D̄) + ∥b∥C(D̄) )
x∈D̄ k=1 x∈D̄ k=1 x∈D̄ j∈N D k=1
m Z
X
(k) := R.
≤ ∥an − a∥C(D̄) sup κ(x, yk )|Pm |− κ(x, y) dy
x∈D̄ k=1 D
Z Hence we need only consider σ1 as a map [−R, R] → R. Thus, by uniform continuity, there exists
+ sup κ(x, y) dy a modulus of continuity ω : R≥0 → R≥0 which is continuous, non-negative, and non-decreasing on
x∈D̄ D
R≥0 , satisfies ω(z) → ω(0) = 0 as z → 0 and
≤ 2|D|∥κ∥C(D̄×D̄) ∥an − a∥C(D̄) .
|σ1 (z1 ) − σ1 (z2 )| ≤ ω(|z1 − z2 |) ∀z1 , z2 ∈ [−R, R]. (49)
Therefore, for any a ∈ K, by repeated application of the triangle inequality, we find that
Z m
X m
X Z Let ϵ > 0. Equation (49) and the non-decreasing property of ω imply that in order to show there
(k) (k)
sup κ(x, y)a(y) dy − κ(x, yk )a(yk )|Pm | ≤ sup κ(x, yk )an (yk )|Pm | − κ(x, y)an (y) dy
x∈D̄ D x∈D̄ k=1 D exists Q = Q(ϵ) ∈ N such that for any m ≥ Q implies
k=1
for any m ≥ Q. Since K is compact, we can find functions a1 , . . . , aN ∈ K such that, for any
Appendix F. a ∈ K, there is some n ∈ {1, . . . , N } such that
Proof [of Theorem 11] The statement in Lemma 26 allows us to apply Lemma 22 to find a mapping ϵ
∥a − an ∥C(D̄) ≤ .
G1 : A → U such that 4|D|∥κ∥C(D̄×D̄)
ϵ
sup ∥G † (a) − G1 (a)∥U ≤ Since (Dj ) is a discrete refinement, by convergence of Riemann sums, we can find some q ∈ N
a∈K 2
′ ′ such that for any t ≥ q, we have
where G1 = G◦ψ◦F with F : A → RJ , G : RJ → U continuous linear maps and ψ ∈ C(RJ ; RJ )
t
for some J, J ′ ∈ N. By Lemma 32, we can find a sequence of maps Ft ∈ IO(σ2 ; D, R, RJ ) for
Z
(k)
X
sup κ(x, yk )|Pt | − κ(x, y) dy < |D|∥κ∥C(D̄×D̄)
t = 1, 2, . . . such that x∈D̄ k=1 D
1
sup sup | Ft (a) (x) − F (a)|1 ≤ . where Dt = {y1 , . . . , yt }. Similarly, we can find p1 , . . . , pN ∈ N such that, for any tn ≥ pn , we
a∈K x∈D̄ t
have
In particular, Ft (a)(x) = wt (a)1(x) for some wt : A → RJ which is constant in space. We can tn Z
X (n) (n) (k) ϵ
therefore identify the range of Ft (a) with RJ . Define the set sup κ(x, yk )an (yk )|Ptn | − κ(x, y)an (y) dy <
x∈D̄ k=1 D 4
∞
[ (n) (n)
Z := Ft (K) ∪ F (K) ⊂ RJ where Dtn = {y1 , . . . , ytn }. Let m ≥ max{q, p1 , . . . , pN } and denote Dm = {y1 , . . . , ym }.
t=1 Note that,
which is compact by Lemma 21. Since ψ is continuous, it is uniformly continuous on Z hence Z
sup κ(x, y) (a(y) − an (y)) dy ≤ |D|∥κ∥C(D̄×D̄) ∥a − an ∥C(D̄) .
there exists a modulus of continuity ω : R≥0 → R≥0 which is continuous, non-negative, and non- x∈D̄ D
110 110
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
such as (Pinkus, 1999, Theorem 4.1) to find an ϵ-close (uniformly) neural network. Since Lemma Z m
X m
X Z
(k) (k)
sup κ(x, y)a(y) dy − κ(x, yk )a(yk )|Pm | ≤ sup κ(x, yk )an (yk )|Pm | − κ(x, y)an (y) dy
35 shows that neural operators can exactly mimic standard neural networks, it follows that we x∈D̄ D x∈D̄ k=1 D
k=1
′
can find S1 ∈ IO(σ1 ; D, RJ , Rd1 ), . . . , SN −1 ∈ IO(σ1 ; D, RdN −1 , RJ ) for some N ∈ N≥2 and
+ 3|D|∥κ∥C(D̄×D̄) ∥a − an ∥C(D̄)
d1 , . . . , dN −1 ∈ N such that ϵ 3ϵ
< + =ϵ
4 4
∀f ∈ L1 (D; RJ )
ψ̃(f ) := SN −1 ◦ σ1 ◦ · · · ◦ S2 ◦ σ1 ◦ S1 (f ), which completes the proof.
satisfies
ϵ
sup sup |ψ(q) − ψ̃(q1)(x)|1 ≤ . Appendix F.
q∈FT (K) x∈D̄ 6∥G∥
Proof [of Theorem 11] The statement in Lemma 26 allows us to apply Lemma 22 to find a mapping
By construction, ψ̃ maps constant functions into constant functions and is continuous in the appro- G1 : A → U such that
′
priate subspace topology of constant functions hence we can identity it as an element of C(RJ ; RJ ) ϵ
sup ∥G † (a) − G1 (a)∥U ≤
′ 2
for any input constant function taking values in RJ . Then (ψ̃ ◦ FT )(K) ⊂ RJ is compact. There- a∈K
′ ′
′ ′
fore, by Lemma 34, we can find a neural network κ ∈ NL (σ3 ; Rd × Rd , R1×J ) for some L ∈ N
′
where G1 = G◦ψ◦F with F : A → RJ , G : RJ → U continuous linear maps and ψ ∈ C(RJ ; RJ )
such that for some J, J ′ ∈ N. By Lemma 32, we can find a sequence of maps Ft ∈ IO(σ2 ; D, R, RJ ) for
Z
′ t = 1, 2, . . . such that
G̃(f ) := κ(·, y)f (y) dy, ∀f ∈ L1 (D; RJ ) 1
D′ sup sup | Ft (a) (x) − F (a)|1 ≤ .
a∈K x∈D̄ t
satisfies In particular, Ft (a)(x) = wt (a)1(x) for some wt : A → RJ which is constant in space. We can
ϵ therefore identify the range of Ft (a) with RJ . Define the set
sup∥G(y) − G̃(y1)∥U ≤ .
y∈(ψ̃◦FT )(K) 6
∞
[
Z := Ft (K) ∪ F (K) ⊂ RJ
Define t=1
Z
which is compact by Lemma 21. Since ψ is continuous, it is uniformly continuous on Z hence
G(a) := G̃ ◦ ψ̃ ◦ FT (a) = κ(·, y) (SN −1 ◦ σ1 ◦ . . . σ1 ◦ S1 ◦ FT )(a) (y) dy, ∀a ∈ A, there exists a modulus of continuity ω : R≥0 → R≥0 which is continuous, non-negative, and non-
D′
111 111
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
noting that G ∈ NON (σ1 , σ2 , σ3 ; D, D′ ). For any a ∈ K, define a1 := (ψ ◦ F )(a) and ã1 := decreasing on R≥0 , satisfies ω(s) → ω(0) = 0 as s → 0 and
(ψ̃ ◦ FT )(a) so that G1 (a) = G(a1 ) and G(a) = G̃(ã1 ) then
|ψ(z1 ) − ψ(z2 )|1 ≤ ω(|z1 − z2 |1 ) ∀z1 , z2 ∈ Z.
∥G1 (a) − G(a)∥U ≤ ∥G(a1 ) − G(ã1 )∥U + ∥G(ã1 ) − G̃(ã1 )∥U
≤ ∥G∥|a1 − a˜1 |1 + sup ∥G(y) − G̃(y1)∥U
We can thus find T ∈ N large enough such that
y∈(ψ̃◦FT )(K)
ϵ
≤ + ∥G∥|(ψ ◦ F )(a) − (ψ ◦ FT )(a)|1 + ∥G∥|(ψ ◦ FT )(a) − (ψ̃ ◦ FT )(a)|1 ϵ
6 sup ω(|F (a) − FT (a)|1 ) ≤ .
ϵ a∈K 6∥G∥
≤ + ∥G∥ω |F (a) − FT (a)|1 +∥G∥ sup |ψ(q) − ψ̃(q)|1
6 q∈FT (K)
ϵ Since FT is continuous, FT (K) is compact. Since ψ is a continuous function on the compact set
≤ .
2 ′
FT (K) ⊂ RJ mapping into RJ , we can use any classical neural network approximation theorem
Finally we have such as (Pinkus, 1999, Theorem 4.1) to find an ϵ-close (uniformly) neural network. Since Lemma
ϵ ϵ 35 shows that neural operators can exactly mimic standard neural networks, it follows that we
∥G † (a) − G(a)∥U ≤ ∥G † (a) − G1 (a)∥U + ∥G1 (a) − G(a)∥U ≤ + =ϵ ′
2 2 can find S1 ∈ IO(σ1 ; D, RJ , Rd1 ), . . . , SN −1 ∈ IO(σ1 ; D, RdN −1 , RJ ) for some N ∈ N≥2 and
as desired. d1 , . . . , dN −1 ∈ N such that
To show boundedness, we will exhibit a neural operator G̃ that is ϵ-close to G in K and is
∀f ∈ L1 (D; RJ )
uniformly bounded by 4M . Note first that ψ̃(f ) := SN −1 ◦ σ1 ◦ · · · ◦ S2 ◦ σ1 ◦ S1 (f ),
112 112
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Define noting that G ∈ NON (σ1 , σ2 , σ3 ; D, D′ ). For any a ∈ K, define a1 := (ψ ◦ F )(a) and ã1 :=
J′
(ψ̃ ◦ FT )(a) so that G1 (a) = G(a1 ) and G(a) = G̃(ã1 ) then
X
G̃(a) := βj (ψ̃(FT (a)))φj , ∀a ∈ A.
j=1
∥G1 (a) − G(a)∥U ≤ ∥G(a1 ) − G(ã1 )∥U + ∥G(ã1 ) − G̃(ã1 )∥U
Lemmas 34 and 35 then shows that G̃ ∈ NON +R (σ1 , σ2 , σ3 ; D, D′ ). Notice that
≤ ∥G∥|a1 − a˜1 |1 + sup ∥G(y) − G̃(y1)∥U
y∈(ψ̃◦FT )(K)
sup ∥G(a) − G̃(a)∥U ≤ sup |w − β(w)|2 ≤ ϵ.
a∈K w∈W ϵ
≤ + ∥G∥|(ψ ◦ F )(a) − (ψ ◦ FT )(a)|1 + ∥G∥|(ψ ◦ FT )(a) − (ψ̃ ◦ FT )(a)|1
6
Furthermore, ϵ
≤ + ∥G∥ω |F (a) − FT (a)|1 +∥G∥ sup |ψ(q) − ψ̃(q)|1
6 q∈FT (K)
∥G̃(a)∥U ≤ ∥G̃(a) − G(a)∥U + ∥G(a)∥U ≤ ϵ + 2M ≤ 3M, ∀a ∈ K. ϵ
≤ .
′
2
Let a ∈ A \ K then there exists q ∈ RJ \ W such that ψ̃(FT (a)) = q and
Finally we have
∥G̃(a)∥U = |β(q)|2 ≤ 4M ϵ ϵ
∥G † (a) − G(a)∥U ≤ ∥G † (a) − G1 (a)∥U + ∥G1 (a) − G(a)∥U ≤ + =ϵ
2 2
as desired.
as desired.
To show boundedness, we will exhibit a neural operator G̃ that is ϵ-close to G in K and is
uniformly bounded by 4M . Note first that
Appendix G.
∥G(a)∥U ≤ ∥G(a) − G † (a)∥U + ∥G † (a)∥U ≤ ϵ + M ≤ 2M, ∀a ∈ K
Proof [of Theorem 13] Let U = H m2 (D). For any R > 0, define
where, without loss of generality, we assume that M ≥ 1. By construction, we have that
G † (a), ∥G † (a)∥U ≤R
† ′
GR (a) := J
X
R G † (a), otherwise G(a) = ψ̃j (FT (a))φj , ∀a ∈ A
∥G † (a)∥ U
j=1
†
for any a ∈ A. Since GR → G † as R → ∞ µ-almost everywhere, G † ∈ L2µ (A; U), and clearly ′ ′
†
for some neural network φ : Rd → RJ . Since U is a Hilbert space and by linearity, we may assume
∥GR (a)∥U ≤ ∥G † (a)∥ U for any a ∈ A, we can apply the dominated convergence theorem for
that the components φj are orthonormal since orthonormalizing them only requires multiplying the
Bochner integrals to find R > 0 large enough such that
last layers of ψ̃ by an invertible linear map. Therefore
† ϵ
∥GR − G † ∥L2µ (A;U ) ≤ .
3 |ψ̃(FT (a))|2 = ∥G(a)∥U ≤ 2M, ∀a ∈ K.
Since A and U are Polish spaces, by Lusin’s theorem (Aaronson, 1997, Theorem 1.0.0) we can find ′
Define the set W := (ψ̃ ◦ FT )(K) ⊂ RJ which is compact as before. We have
a compact set K ⊂ A such that
ϵ2 diam2 (W ) = sup |x − y|2 ≤ sup |x|2 + |y|2 ≤ 4M.
µ(A \ K) ≤
153R2 x,y∈W x,y∈W
†
and GR |K is continuous. Since K is closed, by a generalization of the Tietze extension theorem ′
Since σ1 ∈ BA, there exists a number R ∈ N and a neural network β ∈ NR (σ1 ; RJ , RJ ) such that
′
† †
(Dugundji, 1951, Theorem 4.1), there exist a continuous mapping G̃R : A → U such that G̃R (a) =
† |β(x) − x|2 ≤ ϵ, ∀x ∈ W
GR (a) for all a ∈ K and
† † ′
sup ∥G̃R (a)∥ ≤ sup ∥GR (a)∥ ≤ R. |β(x)|2 ≤ 4M, ∀x ∈ RJ .
a∈A a∈A
113 113
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
†
Applying Theorem 11 to G̃R , we find that there exists a number N ∈ N and a neural operator Define
J′
† †
∥G̃(a)∥U ≤ ∥G̃(a) − G(a)∥U + ∥G(a)∥U ≤ ϵ + 2M ≤ 3M, ∀a ∈ K.
∥G † − G∥L2µ (A;U ) ≤ ∥G † − GR ∥L2µ (A;U ) + ∥GR − G∥L2µ (A;U )
′
Z Z !1
2 Let a ∈ A \ K then there exists q ∈ RJ \ W such that ψ̃(FT (a)) = q and
ϵ † †
≤ + ∥GR (a) − G(a)∥2U dµ(a) + ∥GR (a) − G(a)∥2U dµ(a)
3 K A\K ∥G̃(a)∥U = |β(q)|2 ≤ 4M
21
2ϵ2
ϵ †
≤ + + 2 sup ∥GR (a)∥2U + ∥G(a)∥2U µ(A \ K) as desired.
3 9 a∈A
2 12
ϵ 2ϵ 2
≤ + + 34R µ(A \ K)
3 9
2 12
ϵ 4ϵ Appendix G.
≤ +
3 9
Proof [of Theorem 13] Let U = H m2 (D). For any R > 0, define
=ϵ
G † (a), ∥G † (a)∥U ≤ R
†
as desired. GR (a) :=
R G † (a), otherwise
∥G † (a)∥U
†
for any a ∈ A. Since GR → G † as R → ∞ µ-almost everywhere, G † ∈ L2µ (A; U), and clearly
†
∥GR (a)∥U ≤ ∥G † (a)∥U for any a ∈ A, we can apply the dominated convergence theorem for
Bochner integrals to find R > 0 large enough such that
† ϵ
∥GR − G † ∥L2µ (A;U ) ≤ .
3
Since A and U are Polish spaces, by Lusin’s theorem (Aaronson, 1997, Theorem 1.0.0) we can find
a compact set K ⊂ A such that
ϵ2
µ(A \ K) ≤
153R2
†
and GR |K is continuous. Since K is closed, by a generalization of the Tietze extension theorem
† †
(Dugundji, 1951, Theorem 4.1), there exist a continuous mapping G̃R : A → U such that G̃R (a) =
†
GR (a) for all a ∈ K and
† †
sup ∥G̃R (a)∥ ≤ sup ∥GR (a)∥ ≤ R.
a∈A a∈A
114 114
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
†
Applying Theorem 11 to G̃R , we find that there exists a number N ∈ N and a neural operator
G ∈ NON (σ1 , σ2 , σ3 ; D, D′ ) such that
√
† 2ϵ
sup ∥G(a) − GR (a)∥U ≤
a∈K 3
and
sup ∥G(a)∥U ≤ 4R.
a∈A
We then have
† †
∥G † − G∥L2µ (A;U ) ≤ ∥G † − GR ∥L2µ (A;U ) + ∥GR − G∥L2µ (A;U )
!1
Z Z 2
ϵ † †
≤ + ∥GR (a) − G(a)∥2U dµ(a) + ∥GR (a) − G(a)∥2U dµ(a)
3 K A\K
12
2ϵ2
ϵ †
≤ + + 2 sup ∥GR (a)∥2U + ∥G(a)∥2U µ(A \ K)
3 9 a∈A
2 12
ϵ 2ϵ 2
≤ + + 34R µ(A \ K)
3 9
2 12
ϵ 4ϵ
≤ +
3 9
=ϵ
as desired.
115