Learning Data-Driven Discretizations
Learning Data-Driven Discretizations
differential equations
Yohai Bar-Sinaia,1,2 , Stephan Hoyerb,1,2 , Jason Hickeyb , and Michael P. Brennera,b
a
School of Engineering and Applied Sciences, Harvard University, Cambridge MA 02138; and b Google Research, Mountain View, CA 94043
Edited by John B. Bell, Lawrence Berkeley National Laboratory, Berkeley, CA, and approved June 21, 2019 (received for review August 14, 2018)
The numerical solution of partial differential equations (PDEs) different from coarse-graining techniques that are currently in
is challenging because of the need to resolve spatiotemporal use: Instead of analyzing equations of motion to derive effec-
features over wide length- and timescales. Often, it is computa- tive behavior, we directly learn from high-resolution solutions to
tionally intractable to resolve the finest features in the solution. these equations.
The only recourse is to use approximate coarse-grained repre-
sentations, which aim to accurately represent long-wavelength Related Work
dynamics while properly accounting for unresolved small-scale Several related approaches for computationally extracting effec-
physics. Deriving such coarse-grained equations is notoriously tive dynamics have been previously introduced. Classic works
difficult and often ad hoc. Here we introduce data-driven used neural networks for discretizing dynamical systems (5,
Downloaded from https://www.pnas.org by "INDIAN INSTITUTE OF SCIENCE, BANGALORE" on September 6, 2023 from IP address 14.139.128.56.
discretization, a method for learning optimized approximations 6). Similarly, equation-free modeling approximates coarse-scale
to PDEs based on actual solutions to the known underlying
derivatives by remapping coarse initial conditions to fine scales
equations. Our approach uses neural networks to estimate spa-
which are integrated exactly (7). The method has a similar spirit
tial derivatives, which are optimized end to end to best satisfy
to our approach, but it does not learn from fine-scale dynamics
the equations on a low-resolution grid. The resulting numeri-
and use the memorized statistics in subsequent times to reduce
cal methods are remarkably accurate, allowing us to integrate in
the computational load. Recent works have applied machine
time a collection of nonlinear equations in 1 spatial dimension
at resolutions 4× to 8× coarser than is possible with standard
learning to partial differential equations (PDEs), either focus-
finite-difference methods.
ing on speed (8–10) or recovering unknown dynamics (11, 12).
Models focused on speed often replace the slowest component
of a physical model with machine learning, e.g., the solution
coarse graining | machine learning | computational physics
of Poisson’s equation in incompressible fluid simulations (9),
subgrid cloud models in climate simulations (10), or building
Consider a generic PDE, describing the evolution of a continu- are equation dependent. Different regions in space (e.g., inside
ous field v (x , t), and outside a shock) will use different coefficients. To discover
these formulas, we use machine learning: We first generate a
∂v ∂v ∂v training set of high-resolution data and then learn the discrete
= F t, x , v , , ,··· . [1]
∂t ∂xi ∂xi ∂xj approximations to the derivatives in Eq. 2 from this dataset.
This produces a tradeoff in computational cost, which can be
Most PDEs in the exact sciences can be cast in this form, alleviated by carrying out high-resolution simulations on small
Downloaded from https://www.pnas.org by "INDIAN INSTITUTE OF SCIENCE, BANGALORE" on September 6, 2023 from IP address 14.139.128.56.
including equations that describe hydrodynamics, electrodynam- systems to develop local approximations to the solution mani-
ics, chemical kinetics, and elasticity. A common algorithm to fold and using them to solve equations in much larger systems at
numerically solve such equations is the method of lines (18): significantly reduced spatial resolution.
Given a spatial discretization x1 , . . . , xN , the field v (x , t) is
Burgers’ Equation. For concreteness, we demonstrate this ap-
MATHEMATICS
represented by its values at node points vi (t) = v (xi , t) (finite
APPLIED
differences) or by its averages over a grid cell, vi (t) = ∆x −1 proach with a specific example in 1 spatial dimension. Burg-
R xi +∆x /2
v (x 0 , t)dx 0 (finite volumes), where ∆x = xi − xi−1 is ers’ equation is a simple nonlinear equation which models fluid
xi −∆x /2 dynamics in 1D and features shock formation. In its conservative
the spatial resolution (19). The time evolution of vi can be form, it is written as
computed directly from Eq. 1 by approximating the spatial
derivatives at these points. There are various methods for this
v2
approximation—polynomial expansion, spectral differentiation, ∂v ∂ ∂v ∂v
+ J v, = f (x , t), J≡ −η , [4]
etc.—all yielding formulas resembling ∂t ∂x ∂x 2 ∂x
∂vi
= F (t, x , v1 , . . . , vN ) [3]
∂t
that can be numerically integrated using standard techniques.
C
The accuracy of the solution to Eq. 3 depends on ∆x , converging
to a solution of Eq. 2 as ∆x → 0. Qualitatively, accuracy requires
that ∆x is smaller than the spatial scale of the smallest feature
of the field v (x , t).
However, the scale of the smallest features is often orders of
magnitude smaller than the system size. High-performance com-
puting has been driven by the ever increasing need to accurately Fig. 1. Polynomial vs. neural net-based interpolation. (A) Interpolation
resolve smaller-scale features in PDEs. Even with petascale com- between known points (blue diamonds) on a segment of a typical solution
putational resources, the largest direct numerical simulation of of Burgers’ equation. Polynomial interpolation exhibits spurious “over-
a turbulent fluid flow ever performed has Reynolds number of shoots” in the vicinity of shock fronts. These errors compound when
order 1,000, using about 5 × 1011 grid points (22–24). Simula- integrated in time, such that a naive finite-difference method at this res-
tions at higher Reynolds number require replacing the physical olution quickly diverges. In contrast, the neural network interpolation is
so close to the exact solution that it cannot be visually distinguished. (B)
equations with effective equations that model the unresolved
Histogram of exact vs. interpolated function values over our full validation
physics. These equations are then discretized and solved numer- dataset. The neural network vastly reduces the number of poor predictions.
ically, e.g., using the method of lines. This overall procedure (C) Absolute error vs. local curvature. The thick line shows the median and
essentially modifies Eq. 2, by changing the αi to account for the the shaded region shows the central 90% of the distribution over the vali-
unresolved degrees of freedom, replacing the discrete equations dation set. The neural network makes much smaller errors in regions of high
in Eq. 3 with a different set of discrete equations. curvature, which correspond to shocks.
Bar-Sinai et al. PNAS | July 30, 2019 | vol. 116 | no. 31 | 15345
With this in mind, consider a typical segment of a solu- vide the cell average to the network as the “true” value of the
tion to Burgers’ equation (Fig. 1A). We want to compute the discretized field.
time derivative of the field given a low-resolution set of points Integrating Eq. 4, it is seen that the change rate of the cell
(blue diamonds in Fig. 1). Standard finite-difference formulas averages is completely determined by the fluxes at cell bound-
predict this time derivative by approximating v as a piecewise- aries. This is an exact relation, in which the only challenge is
polynomial function passing through the given points (orange estimating the flux given the cell averages. Thus, prediction is
curves in Fig. 1). But solutions to Burger’s equations are not carried out in 3 steps: First, the network reconstructs the spatial
polynomials: They are shocks with characteristic properties. By derivatives on the boundary between grid cells (staggered grid).
using this information, we can derive a more accurate, albeit Then, the approximated derivatives are used to calculate the flux
equation-specific, formula for the spatial derivatives. For the J using the exact formula Eq. 4. Finally, the temporal derivative
method to work it should be possible to reconstruct the fine- of the cell averages is obtained by calculating the total change
scale solution from low-resolution data. To this end, we ran many at each cell by subtracting J at the cell’s left and right bound-
simulations of Eq. 4 and used the resulting data to train a neu- aries. The calculation of the time derivative from the flux can also
ral network. Fig. 1 compares the predictions of our neural net be done using traditional techniques that promote stability, such
(details below and in SI Appendix) to fourth-order polynomial as monotone numerical fluxes (19). For some experiments, we
interpolation. This learned model is clearly far superior to the use Godunov flux, inspired by finite-volume ENO schemes (20,
polynomial approximation, demonstrating that the spatial res- 21), but it did not improve predictions for our neural networks
olution required for parameterizing the solution manifold can models.
be greatly reduced with equation-specific approximations rather Dividing the inference procedure into these steps is favor-
Downloaded from https://www.pnas.org by "INDIAN INSTITUTE OF SCIENCE, BANGALORE" on September 6, 2023 from IP address 14.139.128.56.
than finite differences. able in a few aspects: First, it allows us to constrain the model
at the various stages using traditional techniques; the conserva-
Models for Time Integration tive constraint, numerical flux, and formal polynomial accuracy
The natural question to ask next is whether such parameteri- constraints are what we use here, but other constraints are also
zations can be used for time integration. For this to work well, conceivable. Second, this scheme limits the machine-learning
integration in time must be numerically stable, and our models part to reconstructing the unknown solution at cell boundaries,
need a strong generalization capacity: Even a single error could which is the main conceptual challenge, while the rest of the
throw off the solution for later times. scheme follows either the exact dynamics or traditional approx-
To achieve this, we use multilayer neural networks to parame- imations for them. Third, it makes the trained model more
terize the solution manifold, because of their flexibility, including interpretable since the intermediate outputs (e.g., J or αi ) have
the ability to impose physical constraints and interpretability clear physical meaning. Finally, these physical constraints con-
through choice of model architecture. The high-level aspects of tribute to more accurate and stable models, as detailed in the
the network’s design, which we believe are of general interest, ablation study in SI Appendix.
are described below. Additional technical details are described in
SI Appendix and source code is available online at https://github. Choice of Loss. The loss of a neural net is the objective function
com/google/data-driven-discretization-1d. minimized during training. Rather than optimizing the predic-
tion accuracy of the spatial derivatives, we optimize the accuracy
Pseudolinear Representation. Our network represents spatial of the resulting time derivative*. This allows us to incorpo-
derivatives with a generalized finite-difference formula simi- rate physical constraints in the training procedure and directly
lar to Eq. 2: The output of the network is a list of coeffi- optimize the final predictions rather than intermediate stages.
cients α1 , . . . , αN such that the nth derivative is expressed as a Our loss is the mean-squared error between the predicted time
(n)
pseudolinear filter, Eq. 2, where the coefficients αi (v1 , v2 , . . . ) derivative and labeled data produced by coarse graining the fully
depend on space and time through their dependence on the field resolved simulations.
values in the neighboring cells. Finding the optimal coefficients Note that a low value of our training loss is a necessary but not
is the crux of our method. sufficient condition for accurate and stable numerical integration
The pseudolinear representation is a direct generalization of over time. Many models with low training loss exhibited poor
the finite-difference scheme of Eq. 2. Moreover, exactly as in the stability when numerically integrated (e.g., without the conserva-
case of Eq. 2, a Taylor expansion allows us to guarantee formal tive constraint), particularly for equations with low dissipation.
polynomial accuracy. That is, we can impose that approxima- From a machine-learning perspective, this is unsurprising: Imita-
tion errors decay as O(∆x m ) for some m ≤ N − n, by layering tion learning approaches, such as our models, often exhibit such
a fixed affine transformation (SI Appendix). We found the best issues because the distribution of inputs produced by the model’s
results when imposing linear accuracy, m = 1 with a 6-point sten- own predictions can differ from the training data (30). Incor-
cil (N = 6), which we used for all results shown here. Finally, porating the time-integrated solution into the loss improved
we note that this pseudolinear form is also a generalization of predictions in some cases (as in ref. 9), but did not guarantee sta-
the popular essentially nonoscillatory (ENO) and weighted ENO bility, and could cause the training procedure itself to diverge due
(WENO) methods (20, 21), which choose a local linear filter to decreased stability in calculating the loss. Stability for learned
(or a combination of filters) from a precomputed list accord- numerical methods remains an important area of exploration for
ing to an estimate of the solution’s local curvature. WENO is an future work.
efficient, human-understandable, way of adaptively choosing fil-
ters, inspired by nonlinear approximation theory. We improve on Learned Coefficients. We consider 2 different parameterizations
WENO by replacing heuristics with directly optimized quantities. for learned coefficients. In our first parameterization, we learn
optimized time- and space-independent coefficients. These fixed
Physical Constraints. Since Burgers’ equation is an instance of
the continuity equation, as with traditional methods, a major
increase in stability is obtained when using a finite-volume
* For
scheme, ensuring the coarse-grained solution satisfies the con- one specific case, namely the constant-coefficient model of Burgers’ equation
with Godunov flux limiting, trained models showed poor performance (e.g., not
servation law implied by the continuity equation. That is, coarse- monotonically increasing with resample factor) unless the loss explicitly included the
grained equations are derived for the cell averages of the field time-integrated solution, as done in ref. 9. Results shown in Figs. 3 and 4 use this loss
v , rather than its nodal values (19). During training we pro- for the constant-coefficient models with Burgers’ equation. See details in SI Appendix.
example, coefficients for both ∂v /∂x (Fig. 2 B, Inset) and v (SI method at low resolution. Importantly, the ringing effect around
Appendix, Fig. S3C) are either right or left biased, opposite the the shocks, which leads to numerical instabilities, is practically
sign of v . This is in line with our physical intuition: Burgers’ eliminated.
equation describes fluid flow, and the sign of v corresponds to Since our model is trained on fully resolved simulations, a cru-
the direction of flow. Coefficients that are biased in the oppo- cial requirement for our method to be of practical use is that
MATHEMATICS
site direction of v essentially look “upwind,” a standard strategy training can be done on small systems, but still produce models
APPLIED
in traditional numerical methods for solving hyperbolic PDEs that perform well on larger ones. We expect this to be the case,
(19), which helps constrain the scheme from violating tempo- since our models, being based on convolutional neural networks,
ral causality. Alternatively, upwinding could be built into the use only local features and by construction are translation invari-
model structure by construction, as we do in models which use ant. Fig. 3B illustrates the performance of our model trained on
Godunov flux. the domain [0, 2π] for predictions on a 10-times larger spatial
domain of size [0, 20π]. The learned model generalizes well. For
Results example, it shows good performance when function values are
Burgers’ Equation. To assess the accuracy of the time integration all positive in a region of size greater than 2π, which due to the
from our coarse-grained model, we computed “exact” solutions conservation law cannot occur in the training dataset.
To make this assessment quantitative, we averaged over many
realizations of the forcing and calculated the mean absolute error
integrated over time and space. Results on the 10-times larger
A B inference domain are shown in Fig. 3C: The solution from the
full neural network has equivalent accuracy to increasing the res-
olution for the baseline by a factor of about 8×. Interestingly,
even the simpler constant-coefficient method significantly out-
performs the baseline scheme. The constant-coefficient model
with Godunov flux is particularly compelling. This model is
faster than WENO, because there is no need to calculate coeffi-
cients on the fly, with comparable accuracy and better numerical
stability at coarse resolution, as shown in Figs. 3A and 4.
These calculations demonstrate that neural networks can carry
out coarse graining. Even if the mesh spacing is much larger
than the shock width, the model is still able to accurately propa-
gate dynamics over time, showing that it has learned an internal
representation of the shock structure.
Bar-Sinai et al. PNAS | July 30, 2019 | vol. 116 | no. 31 | 15347
A B Discussion and Conclusion
It has long been remarked that even simple nonlinear PDEs
can generate solutions of great complexity. But even very com-
plex, possibly chaotic, solutions are not just arbitrary functions:
They are highly constrained by the equations they solve. In
mathematical terms, despite the fact that the solution set of a
PDE is nominally infinite dimensional, the inertial manifold of
solutions is much smaller and can be understood in terms of
interactions between local features of the solutions to nonlinear
PDEs. The dynamical rules for interactions between these fea-
tures have been well studied over the past 50 years. Examples
include, among many others, interactions of shocks in complex
media, interactions of solitons (32), and the turbulent energy
cascade (34).
Machine learning offers a different approach for modeling
these phenomena, by using training data to parameterize the
C inertial manifold itself; said differently, it learns both the fea-
tures and their interactions from experience of the solutions.
Here we propose a simple algorithm for achieving this, moti-
Downloaded from https://www.pnas.org by "INDIAN INSTITUTE OF SCIENCE, BANGALORE" on September 6, 2023 from IP address 14.139.128.56.
1. J. D. Jackson, Classical Electrodynamics (John Wiley & Sons, 1999). 22. M. Lee, R. D. Moser, Direct numerical simulation of turbulent channel flow up to
2. D. Sholl, J. A. Steckel, Density Functional Theory: A Practical Introduction (Wiley & Reτ ≈ 5200. J. Fluid Mech. 774, 395–415 (2015).
Sons, 2011). 23. M. Clay, D. Buaria, T. Gotoh, P. Yeung, A dual communicator and dual grid-resolution
3. C. J. Chen, Fundamentals of Turbulence Modelling (CRC Press, 1997). algorithm for petascale simulations of turbulent mixing at high Schmidt number.
4. M. Van Dyke, Perturbation methods in fluid mechanics (NASA STI/Recon Technical Comput. Phys. Commun. 219, 313–328 (2017).
MATHEMATICS
Report A 75, 1975). 24. K. P. Iyer, K. R. Sreenivasan, P. K. Yeung, Reynolds number scaling of velocity
APPLIED
5. R. Gonzalez-Garcia, R. Rico-Martinez, I. Kevrekidis, Identification of distributed increments in isotropic turbulence. Phys. Rev. E 95, 021101 (2017).
parameter systems: A neural net based approach. Comput. Chem. Eng. 22, S965–S968 25. P. Constantin, C. Foias, B. Nicolaenko, R. Temam, Integral Manifolds and Inertial Mani-
(1998). folds for Dissipative Partial Differential Equations (Springer Science & Business Media,
6. R. Rico-Martinez, I. Kevrekidis, K. Krischer, “Nonlinear system identification using 2012), vol. 70.
neural networks: Dynamics and instabilities” in Neural Networks for Chemical 26. C. Foias, G. R. Sell, R. Temam, Inertial manifolds for nonlinear evolutionary equations.
Engineers, A. B. Bulsari, Ed. (Elsevier, Amsterdam, The Netherlands, 1995), pp. J. Differ. Equations 73, 309–353 (1988).
409–442. 27. M. Jolly, I. Kevrekidis, E. Titi, Approximate inertial manifolds for the Kuramoto-
7. I. G. Kevrekidis, G. Samaey, Equation-free multiscale computation: Algorithms and Sivashinsky equation: Analysis and computations. Physica D Nonlinear Phenom. 44,
applications. Annu. Rev. Phys. Chem. 60, 321–344 (2009). 38–60 (1990).
8. B. Kim et al., Deep fluids: A generative network for parameterized fluid simulations. 28. E. S. Titi, On approximate inertial manifolds to the Navier-Stokes equations. J. Math.
Computer Graphics Forum 38, 59–70 (2019). Anal. Appl. 149, 540–557 (1990).
9. J. Tompson, K. Schlachter, P. Sprechmann, K. Perlin, “Accelerating Eulerian fluid 29. M. Marion, Approximate inertial manifolds for reaction-diffusion equations in high
simulation with convolutional networks” in Proceedings of the 34th International space dimension. J. Dyn. Differ. Equations 1, 245–267 (1989).
Conference on Machine Learning, D. Precup, Y. W. Teh, eds. (PMLR, 2017), vol. 70, pp. 30. S. Ross, D. Bagnell, “Efficient reductions for imitation learning” in Proceedings of
3424–3433. the 13th International Conference on Artificial Intelligence and Statistics, Y. W. Teh,
10. S. Rasp, M. S. Pritchard, P. Gentine, Deep learning to represent subgrid processes in M. Titterington, eds. (PMLR, Chia Laguna Resort, Sardinia, Italy, 2010), vol. 9, pp.
climate models. Proc. Natl. Acad. Sci. U.S.A. 115, 9684–9689 (2018). 661–668.
11. S. L. Brunton, J. L. Proctor, J. N. Kutz, Discovering governing equations from data by 31. I. Goodfellow, Y. Bengio, A. Courville, Y. Bengio, Deep Learning (MIT Press,
sparse identification of nonlinear dynamical systems. Proc. Natl. Acad. Sci. U.S.A. 113, Cambridge, MA, 2016).
3932–3937 (2016). 32. N. J. Zabusky, M. D. Kruskal, Interaction of ”solitons” in a collisionless plasma and the
12. E. de Bezenac, A. Pajot, P. Gallinari, “Deep learning for physical processes: Incorpo- recurrence of initial states. Phys. Rev. Lett. 15, 240–243 (1965).
rating prior scientific knowledge” in International Conference on Learning Repre- 33. D. Zwillinger, Handbook of Differential Equations (Gulf Professional Publishing,
sentations (2018). https://iclr.cc/Conferences/2018/Schedule?showEvent=40. Accessed 1998).
11 July 2019. 34. U. Frisch, Turbulence: The Legacy of A. N. Kolmogorov (Cambridge University Press,
13. B. Lusch, J. N. Kutz, S. L. Brunton, Deep learning for universal linear embeddings of 1996).
nonlinear dynamics. Nat. Commun. 9, 4950 (2018). 35. M. Sundararajan, A. Taly, Q. Yan, “Axiomatic attribution for deep networks,” in
14. J. Morton, F. D. Witherden, A. Jameson, M. J. Kochenderfer, “Deep dynamical model- Proceedings of the 34th International Conference on Machine Learning (ICML),
ing and control of unsteady fluid flows” in Advances in Neural Information Processing D. Precup, Y. W. Teh, Eds. (PMLR, 2017), vol. 70, pp. 3319–3328.
Systems, S. Bengio et al., Eds. (Curran Associates, Inc., 2018), vol. 31, pp. 9258–9268. 36. A. Shrikumar, P. Greenside, A. Kundaje, “Learning important features through prop-
15. J. Ling, A. Kurzawski, J. Templeton, Reynolds averaged turbulence modelling using agating activation differences” in Proceedings of the 34th International Conference
deep neural networks with embedded invariance. J. Fluid Mech. 807, 155–166 on Machine Learning (ICML), D. Precup, Y. W. Teh, Eds. (PMLR, 2017), vol. 70, pp.
(2016). 3145–3153.
16. A. D. Beck, D. G. Flad, C. D. Munz, Deep neural networks for data-driven turbulence 37. Y. Romano, J. Isidoro, P. Milanfar, RAISR: Rapid and accurate image super resolution.
models. arXiv:1806.04482 (15 June 2018). IEEE Trans. Comput. Imaging 3, 110–125 (2017).
17. A. Roberts, Holistic discretization ensures fidelity to Burgers’ equation. Appl. Numer. 38. P. Getreuer et al., “BLADE: Filter learning for general purpose computational photog-
Math. 37, 371–396 (2001). raphy” in 2018 IEEE International Conference on Computational Photography (ICCP)
18. W. E. Schiesser, The Numerical Method of Lines: Integration of Partial Differential (IEEE, 2018).
Equations (Academic Press, San Diego, 1991). 39. J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, G. E. Dahl, “Neural message pass-
19. R. J. LeVeque, Numerical Methods for Conservation Laws (Birkhauser Verlag, 1992). ing for quantum chemistry” in Proceedings of the 34th International Conference
20. A. Harten, B. Engquist, S. Osher, S. R. Chakravarthy, Uniformly high order accurate on Machine Learning (ICML), D. Precup, Y. W. Teh, Eds. (PMLR, 2017), vol. 70, pp.
essentially non-oscillatory schemes, III. J. Comput. Phys. 71, 231–303 (1987). 1263–1272.
21. C. W. Shu, “Essentially non-oscillatory and weighted essentially non-oscillatory 40. C. R. Qi, L. Yi, H. Su, L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on
schemes for hyperbolic conservation laws” in Advanced Numerical Approximation point sets in a metric space” in Advances in Neural Information Processing Systems,
of Nonlinear Hyperbolic Equations, A. Quarteroni, Ed. (Springer, 1998), pp. 325–432. I. Guyon et al., Eds. (Curran Associates, Inc., 2017), vol. 30.
Bar-Sinai et al. PNAS | July 30, 2019 | vol. 116 | no. 31 | 15349