Closed-Form Continuous-Time Neural Networks: Nature Machine Intelligence
Closed-Form Continuous-Time Neural Networks: Nature Machine Intelligence
Article https://doi.org/10.1038/s42256-022-00556-7
Received: 23 March 2022 Ramin Hasani 1,5 , Mathias Lechner1,2,5, Alexander Amini1,
Lucas Liebenwein 1, Aaron Ray1, Max Tschaikowski3, Gerald Teschl 4
&
Accepted: 5 October 2022
Daniela Rus1
Published online: 15 November 2022
Check for updates Continuous-time neural networks are a class of machine learning systems
that can tackle representation learning on spatiotemporal decision-making
tasks. These models are typically represented by continuous differential
equations. However, their expressive power when they are deployed on
computers is bottlenecked by numerical differential equation solvers.
This limitation has notably slowed down the scaling and understanding of
numerous natural physical phenomena such as the dynamics of nervous
systems. Ideally, we would circumvent this bottleneck by solving the given
dynamical system in closed form. This is known to be intractable in general.
Here, we show that it is possible to closely approximate the interaction
between neurons and synapses—the building blocks of natural and artificial
neural networks—constructed by liquid time-constant networks efficiently
in closed form. To this end, we compute a tightly bounded approximation
of the solution of an integral appearing in liquid time-constant dynamics
that has had no known closed-form solution so far. This closed-form
solution impacts the design of continuous-time and continuous-depth
neural models. For instance, since time appears explicitly in closed
form, the formulation relaxes the need for complex numerical solvers.
Consequently, we obtain models that are between one and five orders of
magnitude faster in training and inference compared with differential
equation-based counterparts. More importantly, in contrast to ordinary
differential equation-based continuous networks, closed-form networks
can scale remarkably well compared with other deep learning instances.
Lastly, as these models are derived from liquid networks, they show good
performance in time-series modelling compared with advanced recurrent
neural network models.
Continuous neural network architectures built by ordinary differential sharing, adaptive computations and function approximation for
equations (ODEs)2 are expressive models useful in modelling data with non-uniformly sampled data.
complex dynamics. These models transform the depth dimension of These continuous-depth (time) models have shown promise in
static neural networks and the time dimension of recurrent neural density estimation applications3–6, as well as modelling sequential and
networks (RNNs) into a continuous vector field, enabling parameter irregularly sampled data1,7–9.
Massachusetts Institute of Technology, Cambridge, MA, USA. 2Institute of Science and Technology Austria, Klosterneuburg, Austria. 3Aalborg University,
1
Aalborg, Denmark. 4University of Vienna, Vienna, Austria. 5These authors contributed equally: Ramin Hasani, Mathias Lechner. e-mail: rhasani@mit.edu
Postsynaptic neuron
We solve this in
Presynaptic stimuli closed form
dx(t) x(t)
I(t) =– + S(t) x(t) =
dt τ 1
(x(0) – A) e–[τ +f(I(t))]t f(–I(t)) +A
This is a LTC DE instance
x(t)
Fig. 1 | Neural and synapse dynamics. A postsynaptic neuron receives the is a fundamental building block of LTC networks1, for which there is no known
stimuli I(t) through a nonlinear conductance-based synapse model. Here, S(t) closed-form expression. Here, we provide an approximate solution for this
stands for the synaptic current. The dynamics of the membrane potential of this equation which shows the interaction of nonlinear synapses with postsynaptic
postsynaptic neuron are given by the DE presented in the middle. This equation neurons in closed form.
While ODE-based neural networks with careful memory and gradi- nonlinear transmission of neurotransmitters, the probability of activa-
ent propagation design9 perform competitively with advanced discre- tion of receptors and the concentration of available neurotransmitters,
tized recurrent models on relatively small benchmarks, their training among other nonlinearities (see S(t) in Fig. 1) and (3) the propagation
and inference are slow owing to the use of advanced numerical differ- of information between neurons is induced by feedback and memory
ential equation (DE) solvers10. This becomes even more troublesome apparatuses (see how I(t) stimulates x(t) through a nonlinear synapse
as the complexity of the data, task and state space increases (that is, S(t) which also has a multiplicative difference of potential to the post-
requiring more precision)11, for instance, in open-world problems such synaptic neuron accounting for a negative feedback mechanism).
as medical data processing, self-driving cars, financial time-series and One could read I(t) as a mixture of exogenous input to the (neural)
physics simulations. network and presynaptic inputs from other neurons that result in a
The research community has developed solutions for resolving depolarization x(t). This depolarization is mediated by the current
this computational overhead and for facilitating the training of neural S(t) that depends upon depolarization and a reversal threshold A. LTC
ODEs, for instance by relaxing the stiffness of a flow by state augmenta- networks1, which are expressive continuous-depth models obtained by
tion techniques4,12, reformulating the forward pass as a root-finding a bilinear approximation20 of a neural ODE formulation2, are designed
problem13, using regularization schemes14–16 or improving the inference on the basis of these mechanisms. Correspondingly, we take their ODE
time of the network17. semantics and approximate a closed-form solution for the scalar case of
Here, we derive a closed-form continuous-depth model that has a postsynaptic neuron receiving an input stimulus from a presynaptic
the modelling capabilities of ODE-based models but does not require source through a nonlinear synapse.
any solver to model data (Fig. 1). To this end, we apply the theory of linear ODEs21 to analytically solve
Intuitively, in this work, we replace the integration (that is, solu- the dynamics of an LTC DE as shown in Fig. 1. We then simplify the solu-
tion) of a nonlinear DE describing the interaction of a neuron with its tion to the point where there is one integral left to solve. This integral
t
input nonlinear synaptic connections, with their corresponding non- compartment, ∫0 f(I(s)) dsin which f is a positive, continuous, monotoni-
linear operators. This could be achieved in principle using functional cally increasing and bounded nonlinearity, is challenging to solve in
Taylor expansions (in the spirit of the Volterra series)18. However, in closed form since it has dependencies on an input signal I(s) that is
the particular case of liquid time-constant (LTC) networks, we can arbitrarily defined (such as real-world sensory readouts). To approach
leverage a closed-form expression for the system’s response to input. this problem, we discretize I(s) into piecewise constant segments and
This allows one to evaluate the system’s response to exogenous input obtain the discrete approximation of the integral in terms of the sum
(I) and recurrent inputs from hidden states (x) as a function of time. of piecewise constant compartments over intervals. This piecewise
One way of looking at this is to regard the closed-form solution as the constant approximation inspired us to introduce an approximate
t
application of a nonlinear forward operator to the inputs of each hid- closed-form solution for the integral ∫0 f(I(s)) ds that is provably tight
den state or neuron in the network, where the outputs of one neuron when the integral appears as the exponent of an exponential decay,
constitute the inputs for others. Effectively, this rests on approximating which is the case for LTCs. We theoretically justify how this closed-form
a conductance-based model with a neural mass model, of the kind used solution represents LTCs’ ODE semantics and is as expressive (Fig. 1).
in the dynamic causal modelling of real neuronal networks19.
The proposed continuous neural networks yield considerably Explicit time dependence
faster training and inference speeds while being as expressive as their We then dissect the properties of the obtained closed-form solution and
ODE-based counterparts. We provide a derivation for the approximate design a new class of neural network models we call closed-form
closed-form solution to a class of continuous neural networks that explic- continuous-depth networks (CfC). CfCs have an explicit time dependence
itly models time. We demonstrate how this transformation can be formu- in their formulation that does not require a numerical ODE solver to obtain
lated into a novel neural model and scaled to create flexible, performant their temporal rollouts. Thus, they maximize the trade-off between accu-
and fast neural architectures on challenging sequential datasets. racy and efficiency of solvers. Formally, this property corresponds to
obtaining lower time complexity for models without numerical insta-
Deriving an approximate closed-form solution for neural bilities and errors as illustrated in Table 1 (left). For example, Table 1 (left)
interactions shows that the complexity of a pth-order numerical ODE solver is 𝒪𝒪(Kp),
Two neurons interact with each other through synapses as shown in where K is the number of ODE steps, while a CfC system (which has explicit
Fig. 1. There are three principal mechanisms for information propaga- time dependence) requires 𝒪𝒪(K)̃ , where K is the exogenous input time
tion in natural brains that are abstracted away in the current building steps, which are typically one to three orders of magnitude smaller than
blocks of deep learning systems: (1) neural dynamics are typically con- K. Moreover, the approximation error of a pth-order numerical ODE solver
tinuous processes described by DEs (see the dynamics of x(t) in Fig. 1), scales with 𝒪𝒪(ϵp+1 ), whereas CfCs are closed-form continuous-time sys-
(2) synaptic release is much more than scalar weights, involving a tems, thus the notion of approximation error becomes irrelevant to them.
Left: The time complexity of the process to compute K solver steps. ϵ is step size. ϵ̃ is the maximum step size and δ ≪ 0. K̃ is the time steps for CfCs corresponding to the input time step, which
is typically one to three orders of magnitude smaller than K. The left portion is reproduced with permission from ref. 17. Right: Sequence and time-step prediction complexity. n is the sequence
length. k is the number of hidden units. p is the order of the ODE solver.
This explicit time dependence allows CfCs to perform computa- The hidden state of an LTC network is determined by the solution
tions at least one order of magnitude faster in terms of training and of the following initial value problem (IVP)1:
inference time compared with their ODE-based counterparts, without
loss of accuracy. dx
= − [wτ + f(x, I, θ)] ⊙ x(t) + A ⊙ f(x, I, θ), (1)
dt
Sequence and time-step prediction efficiency
In sequence modelling tasks, one can perform predictions based on an where at a time step t, x(D×1)(t) defines the hidden state of a LTC layer
entire sequence of observations, or perform auto-regressive modelling with D cells, and I(m×1)(t) is an exogenous input to the system with m
(D×1)
where the model predicts the next time-step output given the current features. Here, wτ is a time-constant parameter vector, A(D×1) is a bias
time-step input. Table 1 (right) depicts the time complexity of different vector, f is a neural network parametrized by θ and ⊙ is the Hadamard
neural network instances at inference, for a given sequence of length product. The dependence of f(.) on x(t) denotes the posibility of having
n and a neural network of k number of hidden units. We observe that recurrent connections.
the complexity of ODE-based networks and Transformer modules is at The full proof of theorem 1 is given in Methods. The theorem for-
least an order of magnitude higher than that of discrete RNNs and CfCs mally demonstrates that the approximated closed-form solution for the
in both sequence prediction and auto-regressive modelling (time-step given LTC system is given by equation (2) and that this approximation
prediction) frameworks. is tightly bounded with bounds given in the proof.
This is desirable because not only do CfCs establish a continuous In the following, we show an illustrative example of this tightness
flow similar to ODE models1 to achieve better expressivity in representa- result in practice. To do this, we first present an instantiation of LTC
tion learning but they do so with the efficiency of discrete RNN models. networks and their approximate closed-form expressions. Extended
Data Fig. 1 shows a liquid network with two neurons and five synaptic
CfCs: flexible deep models for sequential tasks connections. The network receives an input signal I(t). Extended Data
Additionally, CfCs are equipped with novel time-dependent gat- Fig. 1 further derives the DE expression for the network along with its
ing mechanisms that explicitly control their memory. CfCs are as closed-form approximate solution. In general, it is possible to com-
expressive as their ODE-based peers and can be supplied with mixed pile an LTC network into its closed-form expression as illustrated in
memory architectures9 to avoid gradient issues in sequential data Extended Data Fig. 1. This compilation can be performed using Algo-
processing applications with long-range dependences. Beyond rithm 1 provided in Methods.
accuracy and performance metrics, our results indicate that, when
considering accuracy per compute time, CfCs exhibit over 150 fold Theorem 1
improvements over ODE-based compartments. We perform a diverse Given an LTC system determined by the IVP in equation (1), constructed by
set of advanced time-series modelling experiments and present one cell, receiving a single-dimensional time-series exogenous input I(t)
the performance and speed gain achievable by using CfCs in tasks with no self-connections, the following expression is an approximation
with long-term dependences, irregular data and modelling physical of its closed-form solution:
dynamics, among others.
x(t) ≈ (x0 − A)e−[wτ +f(I(t),θ)]t f(−I(t), θ) + A. (2)
Deriving a closed-form solution
In this section, we derive an approximate closed-form solution for Tightness of the closed-form solution in practice
LTC networks, an expressive subclass of time-continuous models. We Figure 2 shows an LTC-based network trained for autonomous driving22.
discuss how the scalar closed-form expression derived from a small LTC The figure further illustrates how close the proposed solution fits the
system can inspire the design of CfC models. In this regard, we define actual dynamics exhibited from a single-neuron ODE given the same
the LTC semantics. We then state the main theorem that computes a parametrization. The details of this experiment are given in Methods.
closed-form approximation of a given LTC system for the scalar case. To We next show how to design a novel neural network instance
prove the theorem, we first find the integral solution of the given LTC inspired by this closed-form solution that has well-behaved gradient
ODE system. We then compute a closed-form analytical solution for the properties and approximation capabilities.
integral solution for the case of piecewise constant inputs. Afterward,
we generalize the closed-form solution of the piecewise constant inputs Designing CfC models from the solution
to the case of arbitrary inputs with our novel approximation and finally Leveraging the scalar closed-form solution expressed by equation (2),
provide sharpness results (that is, measure the rate and accuracy of an we can now distil this model into a neural network model that can be
approximation error) for the derived solution. trained at scale. The solution provides a grounded theoretical basis
LTC module also replace A with another neural network instance, h(. ) to enhance the
flexibility of the model. To obtain a more general network architecture,
we allow the nonlinearity f(−x, −I; θ) present in equation (3) to have both
Input stream
Outputs
shared (backbone) and independent (g(. )) network compartments.
Gating balance
Perception module The time-decaying sigmoidal term can play a gating role if we addition-
Dynamics of each node
ally multiply h(. ) with (1 − σ(. )). This way, the time-decaying sigmoid
function stands for a gating mechanism that interpolates between the
dx Inputs I(t)
= –(wτ + f(x,I)) x(t) + A f(x,I) two limits of t → −∞ and t → ∞ of the ODE trajectory.
dt Neuron’s state x(t)
Nonlinearity f(·)
Parameters wτ , A Backbone
Output neuron dynamics
Each of these four proposed variants leverages our proposed solution Table 2 | Human activity recognition, per time-step
and thus is at least one order of magnitude faster than continuous-time classification
ODE models.
To investigate their representation learning power, in the following Model Accuracy (%) Time per epoch
(min)
we extensively evaluate CfCs on a series of sequence modelling tasks.
The objective is to test the effectiveness of the CfCs in learning spati- †RNN-Impute7 79.50 ± 0.8 0.38
otemporal dynamics, compared with a wide range of advanced models. †RNN-Δt 7
79.50 ± 0.8 0.45
†RNN-Decay7 80.00 ± 1.0 0.39
Baselines
We compare CfCs with a diverse set of advanced algorithms developed †GRU-D 51
80.60 ± 0.7 0.15
for sequence modelling by both discretized and continuous mecha- †RNN-VAE7 34.30 ± 4.0 2.63
nisms. These baselines are given in full in Methods. †Latent-ODE-RNN 7
83.50 ± 1.0 7.71
†ODE-RNN7 82.90 ± 1.6 3.15
Human activity recognition
7
The human activity dataset contains 6,554 sequences of humans dem- †Latent-ODE-ODE7 84.60 ± 1.3 8.49
onstrating activities such as walking, lying, sitting, etc. The input space Cf-S (current work) 87.04 ± 0.47 0.097
is formed of 561-dimensional inertial sensor measurements per time CfC-noGate (current work) 85.57 ± 0.34 0.093
step, recorded from the user’s smartphone26, being categorized into
CfC (current work) 84.87 ± 0.42 0.084
six group of activities (per time step) as output.
We set up our dataset split (training, validation and test) to care- CfC-mmRNN (current work) 85.97 ± 0.25 0.128
fully reflect the modifications made by Rubanova et al.7 on this task. Numbers represent mean ± s.d. (n = 5). The performance of the models marked by † is reported
The results of this experiment are reported in Table 2. We observe from ref. 7. Bold values indicate the highest accuracy and best time per epoch (min).
that not only do the CfC variants Cf-S, CfC-noGate and CfC-mmRNN
outperform other models with a high margin, but they do so with a
speed-up of more than 8,752% over the best-performing ODE-based Extended Data Fig. 5 compares the performance of many RNN
instance (Latent-ODE-ODE). The reason for such a large speed dif- baselines. Many architectures such as Augmented LSTM, CT-GRU,
ference is the complexity of the dataset dynamics that causes the GRU-D, ODE-LSTM, coRNN and Lipschitz RNN, and all variants of CfC,
ODE solvers of ODE-based models such as Latent-ODE-ODE to com- can successfully solve the task with 100% accuracy when the bit-stream
pute many steps upon stiff dynamics. This issue does not exist for samples are equidistant from each other. However, when the bit-stream
closed-form models as they do not use any ODE solver to account samples arrive at non-uniform distances, only architectures that are
for dynamics. The hyperparameter details of this experiment are immune to the vanishing gradient in irregularly sampled data can solve
provided in Extended Data Fig. 3. the task. These include GRU-D, ODE-LSTM, CfC and CfC-mmRNNs.
ODE-based RNNs cannot solve the event-based encoding tasks regard-
Physical dynamics modelling less of their choice of solvers, as they have vanishing/exploding gradi-
The Walker2D dataset consists of kinematic simulations of the MuJoCo ent issues9. The hyperparameter details of this experiment are provided
physics engine27 (see Methods for more details). As shown in Table 3, in Extended Data Fig. 4.
CfCs outperform the other baselines by a large margin, supporting
their strong capability to model irregularly sampled physical dynamics PhysioNet Challenge
with missing phases. It is worth mentioning that, on this task, CfCs even The PhysioNet Challenge 2012 dataset considers the prediction of the
outperform transformers by a considerable, 18% margin. The hyperpa- mortality of 8,000 patients admitted to the intensive care unit. The
rameter details of this experiment are provided in Extended Data Fig. 3. features represent time series of medical measurements taken dur-
ing the first 48 h after admission. The data are irregularly sampled in
Event-based sequential image processing time and over features, that is, only a subset of the 37 possible features
We next assess the performance of CfCs on a challenging sequential is given at each time point. We perform the same test–train split and
image processing task. This task is generated from the sequential modi- preprocessing as in ref. 7, and report the area under the curve (AUC)
fied National Institute of Standards and Technology (MNIST) dataset on the test set as a metric in Extended Data Fig. 6. We observe that
following the steps described in Methods. Moreover, the hyperparam- CfCs perform competitively to other baselines while performing 160
eter details of this experiment are provided in Extended Data Fig. 4. times faster in terms of training time compared with ODE-RNN and
Table 4 summarizes the results on this event-based sequence 220 times compared with continuous latent models. CfCs are also, on
classification task. We observe that models such as ODE-RNN, CT-RNN, average, three times faster than advanced discretized gated recurrent
GRU-ODE and LSTMs struggle to learn a good representation of the models. The hyperparameter details of this experiment are provided
input data and therefore show poor performance. In contrast, RNNs in Extended Data Fig. 7.
endowed with explicit memory, such as bi-directional RNNs, GRU-D,
Lipschitz RNN, coRNN, CT-LSTM and ODE-LSTM, perform well on Sentiment analysis using IMDB
this task. All CfC variants perform well on this task and establish the The Internet Movie Database (IMDB) sentiment analysis dataset28 con-
state-of-the-art on this task, with CfC-mmRNN achieving 98.09% and sists of 25,000 training and 25,000 test sentences (see Methods for
CfC-noGate achieving 96.99% accuracy in classifying irregularly sam- more details). Extended Data Fig. 8 shows how CfCs equipped with
pled sequences. It is worth mentioning that they do so around 200– mixed memory instances outperform advanced RNN benchmarks. The
400% faster than ODE-based models such as GRU-ODE and ODE-RNN. hyperparameter details of this experiment are provided in Extended
Data Fig. 7.
Regularly and irregularly sampled bit-stream XOR
The bit-stream XOR dataset9 considers the classification of bit streams Performance of CfCs in autonomous driving
by implementing an XOR function in time. That is, each item in the In this experiment, our objective is to evaluate how robustly CfCs
sequence contributes equally to the correct output. The details are learn to perform autonomous navigation in comparison with their
given in Methods. ODE-based counterparts, LTC networks. The task is to map incoming
†AugmentedLSTM 44
1.065 ± 0.006 0.10 CT-RNN48 72.05 ± 0.71 17.30
†Bi-directional RNN 53
1.071 ± 0.009 0.39 RNN-Decay7 88.93 ± 4.06 3.64
†PhasedLSTM 52
1.063 ± 0.010 0.25 GRU-D51 95.44 ± 0.34 3.42
†CT-LSTM 50
1.014 ± 0.014 0.31 GRU-ODE7 80.95 ± 1.52 6.76
coRNN 57
3.241 ± 0.215 0.18 coRNN57 94.44 ± 0.24 3.90
Lipschitz RNN 58
1.781 ± 0.013 0.17 Lipschitz RNN58 95.92 ± 0.16 3.86
LTC 1
0.662 ± 0.013 0.78 ODE-LSTM9 95.73 ± 0.24 6.35
Transformer 36
0.761 ± 0.032 0.80 Cf-S (current work) 95.23 ± 0.16 2.73
Cf-S (current work) 0.948 ± 0.009 0.12 CfC-noGate (current work) 96.99 ± 0.30 3.36
CfC-noGate (current work) 0.650 ± 0.008 0.21 CfC (current work) 95.42 ± 0.21 3.62
CfC (current work) 0.643 ± 0.006 0.08 CfC-mmRNN (current work) 98.09 ± 0.18 5.50
Test accuracy shown as mean ± s.d. (n = 5). Bold values indicate the highest accuracy and best
CfC-mmRNN (current work) 0.617 ± 0.006 0.34
time per epoch (min).
Modelling the physical dynamics of a walker agent in simulation. Numbers present mean ± s.d.
(n = 5). The performance of the models marked by † is reported from ref. 9. Bold values indicate
the lowest error and best time per epoch (min).
d
x(t) = − [wτ + f(I(t))] ⋅ x(t) + Af(I(t)). (5) ̃ = (x(0) − A)e−[wτ t+f(I(t))t] f(−I(t)) + A.
x(t) (9)
dt
̃
Then, |x(t) − x(t)| ≤ |x(0) − A|e−wτ t for all t ≥ 0. Writing c = x(0) − A for con-
Therefore, we can use the theory of linear ODEs to obtain an inte- venience, we can obtain the following sharpness results, additionally:
gral closed-form solution (section 1.10 in ref. 21) consisting of two nested 1. ̃
1
For any t ≥ 0, we have sup { (x(t) − x(t))|I ∶ [0; t] → ℝ} = e−wτ t.
integrals. The inner integral can be eliminated by means of integration c
1
by substitution43. The remaining integral expression can then be solved 2. For any t ≥ 0, we have inf { (x(t) − x(t))|I
̃ ∶ [0; t] → ℝ} = e−wτ t (e−t − 1).
c
in the case of piecewise constant inputs and approximated in the case
of general inputs. The three steps of the proof are outlined below. Above, the supremum and infimum are meant to be taken across all
continuous input signals. These statements settle the question about the
Integral closed-form solution of LTC worst-case errors of the approximation. The first statement implies, in
We consider the ODE semantics of a single neuron that receives some particular, that our bound is sharp.
arbitrary continuous input signal I and has a positive, bounded, con- The full proof is given in the next section. Lemma 1 demonstrates
tinuous and monotonically increasing nonlinearity f: that the integral solution we obtained and shown in equation (6) is
tightly close to the approximate closed-form solution we proposed in
d
x(t) = − [wτ + f(I(t))] ⋅ x(t) + A ⋅ [wτ + f(I(t))] . equation (9). Note that, as wτ is positively defined, the derived bound
dt
between equations (6) and (9) ensures an exponentially decaying error
as time goes by. Therefore, we have the statement of the theorem. □
Assumption. We assumed a second constant value wτ in the above
representation of a single LTC neuron. This is done to introduce sym- Proof of lemma 1
metry in the structure of the ODE, yielding a simpler expression for We start by noting that
the solution. The inclusion of this second constant may appear to t
profoundly alter the dynamics. However, as shown below, numerical ̃ = c e−wτ t [e− ∫0 f(I(s))ds − e−f(I(t))t f(−I(t))] .
x(t) − x(t)
experiments suggest that this simplifying assumption has a marginal
t
effect on the ability to approximate LTC cell dynamics. Since 0 ≤ f ≤ 1, we conclude that e− ∫0 f(I(s))ds ∈ [0; 1] and e −f(I(t))tf(−I
Using the variation of constants formula (section 1.10 in ref. 21), we (t)) ∈ [0; 1]. This shows that |x(t) − x(t)|
̃ ≤ |c|e−wτ t. To see the sharpness
obtain after some simplifications: results, pick some arbitrary small ε > 0 and a sufficiently large C > 0
t
such that f(−C) ≤ ε and 1 − ε ≤ f(C). With this, for any 0 < δ < t, we consider
x(t) = (x(0) − A)e−wτ t−∫0 f(I(s))ds + A. (6) the piecewise constant input signal I such that I(s) = −C for s ∈ [0; t − δ]
and I(s) = C for s ∈ (t − δ; t]. Then, it can be noted that
t
Analytical LTC solution for piecewise constant inputs e− ∫0 f(I(s))ds − e−f(I(t))t f(−I(t)) ≥
.
The derivation of a useful closed-form expression of x requires us to solve e−εt−δ⋅1 − e−(1−ε)⋅t ε → 1, when ε, δ → 0
t
the integral expression ∫0 f(I(s)) ds for any t ≥ 0. Fortunately, the integral
t
∫0 f(I(s)) dsenjoys a simple closed-form expression for piecewise constant Statement 1 follows by noting that there exists a family of continu-
inputs I. Specifically, assume that we are given a sequence of time points ous signals In ∶ [0; t] → ℝ such that ∣In( ⋅ )∣ ≤ C for all n ≥ 1 and In → I point-
wise as n → ∞. This is because
0 = τ0 < τ1 < τ2 < … < τn−1 < τn = ∞,
t t
lim | ∫0 f(I(s)) ds − ∫0 f(In (s)) ds | ≤
n→∞
such that τ1 , … , τn−1 ∈ ℝ and I(t) = γi for all t ∈ [τi; τi+1) with 0 ≤ i ≤n − 1. t t
lim ∫ | f(I(s)) − f(In (s))| ds ≤ lim L ∫0 |I(s) − In (s)| ds ,
Then, it holds that n→∞ 0 n→∞
t k−1 =0
∫ f(I(s)) ds = f(γk )(t − τk ) + ∑ f(γi )(τi+1 − τi ), (7)
0 i=0 where L is the Lipschitz constant of f, and the last identity is due to the
dominated convergence theorem43. To see statement 2, we first note
−1
when τk ≤ t < τk+1 for some 0 ≤ k ≤ n − 1 (as usual, one defines ∑i=0 ∶= 0). that the negation of the signal −I provides us with
With this, we have t
e− ∫0 f(−I(s))ds − e−f(−I(t))t f(I(t)) ≤
k−1
−f(γk )(t−τk )− ∑ f(γi )(τi+1 −τi )
x(t) = (x(0) − A)e−wτ t e i=0 + A, (8) e−(1−ε)(t−δ)−δ⋅0 − e−ε⋅t (1 − ε) → e−t − 1,
when τk ≤ t < τk+1 for some 0 ≤ k ≤ n − 1. While any continuous input can if ε, δ → 0. The fact that the left-hand side of the last inequality must be
t ′
be approximated arbitrarily well by a piecewise constant input43, a tight at least e −t − 1 follows by observing that e−t ≤ e− ∫0 f(I (s))ds and
approximation may require a large number of discretization points e −f(I″(t))t
f( − I″(t)) ≤ 1 for any I , I ∶ [0; t] → ℝ. □
′ ′′
This adjacency matrix can have an arbitrary sparsity (that is, there Experimental details for the Walker2D dataset
is no need to use a directed acyclic graph for the coupling between This task is designed based on the Walker2d-v2 OpenAI gym59 environ-
neurons). ment using data from four different stochastic policies. The objective
is to predict the physics state in the next time step. The training and
Algorithm 1. Translate the architecture of an LTC network into its testing sequences are provided at irregularly sampled intervals. We
closed-form variant report the squared error on the test set as a metric.
Inputs: LTC inputs I(N×T)(t), the activity x(H×T)(t) and initial states
x(H×1)(0) of LTC neurons and the adjacency matrix for synapses Description of the event-based MNIST experiment
[(N+H)∗(N+H)]
WAdj We first sequentialize each image by transforming each 28 × 28 image
LTC ODE solver with step of Δt into a long series of length 784. The objective is to predict the class
time-instance vectors of inputs, t(1×T)
I(t)
corresponding to each image from the long input sequence. Advanced
time-instance of LTC neurons tx(t) ∇ time might be sampled sequence modelling frameworks such as coRNN57, Lipschitz RNN58 and
irregularly mixed memory ODE-LSTM9 can solve this task very well with accuracy
LTC neuron parameter τ(H×1) of up to 99.0%. However, we aim to make the task even more challeng-
LTC network synaptic parameters {σ(N×H), μ(N×H), A(N×H)} ing by sparsifying the input vectors with event-like irregularly sampled
Outputs: LTC closed-form approximation of hidden state neu- mechanisms. To this end, in each vector input (that is, flattened image),
rons, x̂ (N×T) (t) we transform each consecutive occurrence of values into one event.
xpre(t) = WAdj × [I0…IN, x0…xH] ∇ all presynaptic signals to For instance, within the long binary vector of an image, the sequence
nodes 1, 1, 1, 1 is transformed to (1, t = 4) (ref. 9). This way, sequences of length
for ith neuron in neurons 1 to H do 784 are condensed into event-based irregularly sampled sequences of
for j in Synapses to ith neuron do length 256 that are far more challenging to handle than equidistance
[ −tx(t) ⊙(1/τi +
1
(−σij (xpre −μij ))
))] 1
input signals. A recurrent model now has to learn to memorize input
xî + = (x0 − Aij )e 1+e ij
⊙ (σij (xpre −μij ))
+ Aij information of length 256 while keeping track of the time lags between
1+e ij
8. Gholami, A., Keutzer, K. & Biros, G. ANODE: unconditionally 29. Lu, L., Jin, P., Pang, G., Zhang, Z. & Karniadakis, G. E. Learning
accurate memory-efficient gradients for neural ODEs. In nonlinear operators via deeponet based on the universal
Proceedings of the 28th International Joint Conference on Artificial approximation theorem of operators. Nat. Mach. Intell. 3,
Intelligence 730–736 (IJCAI, 2019). 218–229 (2021).
9. Lechner, M. & Hasani, R. Learning long-term dependencies in 30. Karniadakis, G. E. et al. Physics-informed machine learning.
irregularly-sampled time series. Preprint at https://arxiv.org/ Nat. Rev. Phys. 3, 422–440 (2021).
abs/2006.04418 (2020). 31. Wang, S., Wang, H. & Perdikaris, P. Learning the solution operator
10. Prince, P. J. & Dormand, J. R. High order embedded Runge–Kutta of parametric partial differential equations with physics-informed
formulae. J. Comput. Appl. Math. 7, 67–75 (1981). deeponets. Sci. Adv. 7, eabi8605 (2021).
11. Raissi, M., Perdikaris, P. & Karniadakis, G. E. Physics-informed 32. Rezende, D. & Mohamed, S. Variational inference with normalizing
neural networks: a deep learning framework for solving forward flows. In Proc. of International Conference on Machine Learning
and inverse problems involving nonlinear partial differential (Eds. Bach, F. & Blei, D.) 1530–1538 (PMLR, 2015).
equations. J. Comput. Phys. 378, 686–707 (2019). 33. Gu, A., Goel, K. & Re, C. Efficiently modeling long sequences
12. Massaroli, S., Poli, M., Park, J., Yamashita, A. & Asma, H. Dissecting with structured state spaces. In Proc. of International Conference
neural ODEs. In Proc. of 33th Conference on Neural Information on Learning Representations (2022). https://openreview.net/
Processing Systems (Eds. Larochelle, H. et al.) (NeurIPS, 2020). forum?id=uYLFoz1vlAC
13. Bai, S., Kolter, J. Z. & Koltun, V. Deep equilibrium models. Adv. 34. Hasani, R. et al. Liquid structural state-space models. Preprint at
Neural Inform. Process. Syst. 32, 690–701 (2019). https://arxiv.org/abs/2209.12951 (2022).
14. Finlay, C., Jacobsen, J.-H., Nurbekyan, L. & Oberman, A. M. How 35. Grunbacher, S. et al. On the verification of neural ODEs
to train your neural ODE: the world of Jacobian and kinetic with stochastic guarantees. Proc. AAAI Conf. Artif. Intell. 35,
regularization. In International Conference on Machine Learning 11525–11535 (2021).
(Eds. Daumé III, H. & Singh, A.) 3154–3164 (PMLR, 2020). 36. Vaswani, A. et al. Attention is all you need. In Proc. of Advances
15. Massaroli, S. et al. Stable Neural Flows. Preprint at https://arxiv. in Neural Information Processing Systems (Eds. Guyon, I. et al.)
org/abs/2003.08063 (2020). 5998–6008 (NeurIPS, 2017).
16. Kidger, P., Chen, R. T. & Lyons, T. “Hey, that’s not an ODE”: 37. Lechner, M., Hasani, R., Grosu, R., Rus, D. & Henzinger, T. A.
Faster ODE Adjoints via Seminorms. In Proceedings of the 38th Adversarial training is not ready for robot learning. In 2021 IEEE
International Conference on Machine Learning (Eds. Meila, M. & International Conference on Robotics and Automation (ICRA)
Zhang, T.) 139 (PMLR, 2021). 4140–4147 (IEEE, 2021).
17. Poli, M. et al. Hypersolvers: toward fast continuous-depth models. 38. Brunnbauer, A. et al. Latent imagination facilitates zero-shot
In Proc. of Advances in Neural Information Processing Systems transfer in autonomous racing. In 2022 International Conference
(Eds. Larochelle, H.) 21105–21117 (NeurIPS, 2020). on Robotics and Automation (ICRA) 7513–7520 (IEEE, 2021).
18. Schumacher, J., Haslinger, R. & Pipa, G. Statistical modeling 39. Hasani, R. M., Haerle, D. & Grosu, R. Efficient modeling of complex
approach for detecting generalized synchronization. Phys. Rev. E analog integrated circuits using neural networks. In Proc. of 12th
85, 056215 (2012). Conference on Ph.D. Research in Microelectronics and Electronics
19. Moran, R., Pinotsis, D. A. & Friston, K. Neural masses and 1–4 (IEEE, 2016).
fields in dynamic causal modeling. Front. Comput. Neurosci. 7, 40. Wang, G., Ledwoch, A., Hasani, R. M., Grosu, R. & Brintrup, A. A
57 (2013). generative neural network model for the quality prediction of
20. Friston, K. J., Harrison, L. & Penny, W. Dynamic causal modelling. work in progress products. Appl. Soft Comput. 85, 105683 (2019).
Neuroimage 19, 1273–1302 (2003). 41. DelPreto, J. et al. Plug-and-play supervisory control using muscle
21. Perko, L. Differential Equations and Dynamical Systems and brain signals for real-time gesture and error detection. Auton.
(Springer-Verlag, 1991). Robots 44, 1303–1322 (2020).
22. Lechner, M. et al. Neural circuit policies enabling auditable 42. Hasani, R. Interpretable Recurrent Neural Networks in
autonomy. Nat. Mach. Intell. 2, 642–652 (2020). Continuous-Time Control Environments. PhD dissertation,
23. Hochreiter, S. Untersuchungen zu dynamischen neuronalen Technische Univ. Wien (2020).
netzen. Diploma, Technische Universität München 91 (1991). 43. Rudin, W. Principles of Mathematical Analysis, 3rd edn.
24. Vorbach, C., Hasani, R., Amini, A., Lechner, M. & Rus, D. Causal (McGraw-Hill, 1976).
navigation by continuous-time neural networks. In Proc. of 44. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural
Advances in Neural Information Processing Systems (Eds. Ranzato, Comput. 9, 1735–1780 (1997).
M. et al.) 12425–12440 (NeurIPS, 2021). 45. Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical evaluation
25. Hasani, R. et al. Response characterization for auditing cell of gated recurrent neural networks on sequence modeling.
dynamics in long short-term memory networks. In Proc. of 2019 Preprint at https://arxiv.org/abs/1412.3555 (2014).
International Joint Conference on Neural Networks 1–8 (IEEE, 2019). 46. Shukla, S. N. & Marlin, B. Interpolation–prediction networks
26. Anguita, D., Ghio, A., Oneto, L., Parra Perez, X. & Reyes Ortiz, for irregularly sampled time series. In Proc. of International
J. L. A public domain dataset for human activity recognition Conference on Learning Representations (2018). https://
using smartphones. In Proc. of the 21st International European openreview.net/forum?id=r1efr3C9Ym
Symposium on Artificial Neural Networks, Computational 47. Horn, M., Moor, M., Bock, C., Rieck, B. & Borgwardt, K. Set
Intelligence and Machine Learning 437–442 (i6doc, 2013). functions for time series. In Proc. of International Conference on
27. Todorov, E., Erez, T. & Tassa, Y. MuJoCo: a physics engine for Machine Learning (Eds. Daumé III, H. & Singh, A.) 4353–4363
model-based control. In Proc. of 2012 IEEE/RSJ International (PMLR, 2020).
Conference on Intelligent Robots and Systems 5026–5033 48. Funahashi, K.-i & Nakamura, Y. Approximation of dynamical
(IEEE, 2012). systems by continuous time recurrent neural networks. Neural
28. Maas, A. et al. Learning word vectors for sentiment analysis. Netw. 6, 801–806 (1993).
In Proc. of the 49th Annual Meeting of the Association for 49. Mozer, M. C., Kazakov, D. & Lindsey, R. V. Discrete event,
Computational Linguistics: Human Language Technologies continuous time RNNs. Preprint at https://arxiv.org/abs/1710.04110
142–150 (ACM, 2011). (2017).
50. Mei, H. & Eisner, J. The neural Hawkes process: a neurally 68. Lu, Z., Pu, H., Wang, F., Hu, Z. & Wang, L. The expressive power of
self-modulating multivariate point process. In Proc. of 31st neural networks: a view from the width. In Proc. of Advances in
International Conference on Neural Information Processing Neural Information Processing Systems (Eds. Guyon, I. et al.) 30
Systems (Eds. Guyon, I. et al.) 6757–6767 (Curran Associates (Curran Associates, Inc 2017).
Inc., 2017). 69. Li, Q., Chen, L., Tai, C. et al. Maximum principle based algorithms
51. Che, Z., Purushotham, S., Cho, K., Sontag, D. & Liu, Y. Recurrent for deep learning. J. Mach. Learn. Res. 18, 5998–6026 (2018).
neural networks for multivariate time series with missing values. 70. Cohen, M. A. & Grossberg, S. Absolute stability of global pattern
Sci. Rep. 8, 1–12 (2018). formation and parallel memory storage by competitive neural
52. Neil, D., Pfeiffer, M. & Liu, S.-C. Phased LSTM: accelerating networks. IEEE Trans. Syst. Man Cybern. 5, 815–826 (1983).
recurrent network training for long or event-based sequences. 71. Mathieu, E. & Nickel, M. Riemannian continuous normalizing
In Proc. of 30th International Conference on Neural Information flows. In Proc. of Advances in Neural Information Processing
Processing Systems (Eds. Lee, D. D. et al.) 3889–3897 (Curran Systems Vol. 33 (eds Larochelle et al.) 2503–2515 (Curran
Associates Inc., 2016). Associates, Inc., 2020).
53. Schuster, M. & Paliwal, K. K. Bidirectional recurrent neural 72. Hodgkinson, L., van der Heide, C., Roosta, F. & Mahoney, M. W.
networks. IEEE Trans. Signal Process. 45, 2673–2681 (1997). Stochastic normalizing flows. In Proc. of Advances in Neural
54. Voelker, A. R., Kajić, I. & Eliasmith, C. Legendre memory units: Information Processing Systems (Eds. Larochelle, H. et al.)
continuous-time representation in recurrent neural networks. 5933–5944 (NeurIPS, 2020).
In Proceedings of the 33rd International Conference on Neural 73. Haber, E., Lensink, K., Treister, E. & Ruthotto, L. IMEXnet a forward
Information Processing Systems (Eds. Wallach, H. et al.) 15570– stable deep neural network. In Proc. of International Conference
15579 (ACM, 2019). on Machine Learning (Eds. Chaudhuri, K. & Salakhutdinov, R.)
55. Gu, A., Dao, T., Ermon, S., Rudra, A. & Ré, C. Hippo: recurrent 2525–2534 (PMLR, 2019).
memory with optimal polynomial projections. In Proc. of 74. Chang, B., Chen, M., Haber, E. & Chi, E. H. AntisymmetricRNN:
Advances in Neural Information Processing Systems (Eds. a dynamical system view on recurrent neural networks. In
Larochelle, H. et al.) 1474–1487 (NeurIPS, 2020). International Conference on Learning Representations (2018).
56. Lezcano-Casado, M. & Martınez-Rubio, D. Cheap orthogonal https://openreview.net/forum?id=ryxepo0cFX
constraints in neural networks: a simple parametrization of the 75. Lechner, M., Hasani, R., Rus, D. & Grosu, R. Gershgorin loss
orthogonal and unitary group. In Proc. of International Conference stabilizes the recurrent neural network compartment of an
on Machine Learning (Eds. Chaudhuri, K. & Salakhutdinov, R.) end-to-end robot learning scheme. In Proc. of IEEE International
3794–3803 (PMLR, 2019). Conference on Robotics and Automation 5446–5452 (IEEE, 2020).
57. Rusch, T. K. & Mishra, S. Coupled oscillatory recurrent 76. Gleeson, P., Lung, D., Grosu, R., Hasani, R. & Larson, S. D. c302:
neural network (coRNN): an accurate and (gradient) stable a multiscale framework for modelling the nervous system of
architecture for learning long time dependencies. In Proc. of Caenorhabditis elegans. Philos.Trans. R. Soc. B 373, 20170379 (2018).
International Conference on Learning Representations (2021). 77. Li, X., Wong, T.-K. L., Chen, R. T. & Duvenaud, D. Scalable gradients
https://openreview.net/forum?id=F3s69XzWOia for stochastic differential equations. In Proc. of International
58. Erichson, N. B., Azencot, O., Queiruga, A., Hodgkinson, L. & Conference on Artificial Intelligence and Statistics 3870–3882
Mahoney, M. W. Lipschitz recurrent neural networks. In Proc. of (PMLR, 2020).
International Conference on Learning Representations (2021). 78. Shukla, S. N. & Marlin, B. M. Multi-time attention networks for
https://openreview.net/forum?id=-N7PBXqOUJZ irregularly sampled time series. In International Conference on
59. Brockman, G. et al. OpenAI gym. Preprint at https://arxiv.org/ Learning Representations (2020). https://openreview.net/
abs/1606.01540 (2016). forum?id=4c0J6lwQ4_
60. Lechner, M., Hasani, R., Zimmer, M., Henzinger, T. A. & Grosu, 79. Xiong, Y. et al. Nyströmformer: a Nyström-based algorithm
R. Designing worm-inspired neural networks for interpretable for approximating self-attention. In Proceedings of the AAAI
robotic control. In Proc. of International Conference on Robotics Conference on Artificial Intelligence Vol. 35, No. 16, pp.
and Automation 87–94 (IEEE, 2019). 14138–14148 (AAAI, 2021).
61. Tylkin, P. et al. Interpretable autonomous flight via compact
visualizable neural circuit policies. IEEE Robot. Autom. Lett. 7, Acknowledgements
3265–3272 (2022). This research was supported in part by the AI2050 program at
62. Amini, A. et al. Vista 2.0: An open, data-driven simulator for Schmidt Futures (grant G-22-63172), the Boeing Company, and the
multimodal sensing and policy learning for autonomous vehicles. United States Air Force Research Laboratory and the United States
In 2022 International Conference on Robotics and Automation Air Force Artificial Intelligence Accelerator and was accomplished
(ICRA) 2419–2426 (IEEE, 2022). under cooperative agreement number FA8750-19-2-1000. The views
63. Amini, A. et al. Learning robust control policies for end-to-end and conclusions contained in this document are those of the authors
autonomous driving from data-driven simulation. IEEE Robot. and should not be interpreted as representing the official policies,
Autom. Lett. 5, 1143–1150 (2020). either expressed or implied, of the United States Air Force or the U.S.
64. Levine, S. & Koltun, V. Guided policy search. In Proc. of Government. The U.S. Government is authorized to reproduce and
International Conference on Machine Learning (Eds. Dasgupta, S. distribute reprints for Government purposes, notwithstanding any
& McAllester, D.) 1–9 (PMLR, 2013). copyright notation herein. This work was further supported by The
65. Bojarski, M. et al. VisualBackProp: efficient visualization of CNNs Boeing Company and Office of Naval Research grant N00014-18-1-
for autonomous driving. In Proc. of IEEE International Conference 2830. M.T. is supported by the Poul Due Jensen Foundation, grant
on Robotics and Automation 1–8 (IEEE, 2018). 883901. M.L. was supported in part by the Austrian Science Fund
66. Zhang, H., Wang, Z. & Liu, D. A comprehensive review of stability under grant Z211-N23 (Wittgenstein Award). A.A. was supported by the
analysis of continuous-time recurrent neural networks. IEEE Trans. National Science Foundation Graduate Research Fellowship Program.
Neural Netw. Learn. Syst 25, 1229–1262 (2014). We thank T.-H. Wang, P. Kao, M. Chahine, W. Xiao, X. Li, L. Yin and Y. Ben
67. Weinan, E. A proposal on machine learning via dynamical for useful suggestions and for testing of CfC models to confirm the
systems. Commun. Math. Stat. 5, 1–11 (2017). results across other domains.
Extended Data Fig. 1 | Instantiation of LTCs in ODE and closed-form representations. a) A sample LTC network with two nodes and five synapses. b) the ODE
representation of this two-neuron system. c) the approximate closed-form representation of the network.
Extended Data Fig. 2 | Closed-form Continuous-depth neural architecture. A backbone neural network layer delivers the input signals into three head networks g, f
and h. f acts as a liquid time-constant for the sigmoidal time-gates of the network. g and h construct the nonlinearities of the overall CfC network.
Extended Data Fig. 3 | Hyperparameters for Human activity and Walker. List of hyperparameters used to obtain results in Human activity and Walker2D
Experiments.
Extended Data Fig. 4 | Hyperparameters for ET-sMNIST and Bit-stream XOR. List of hyperparameters used to obtain results in Event-based MNIST and Bit-stream
XOR Experiments.
Extended Data Fig. 5 | Bit-stream XOR sequence classification. The performance values (accuracy %) for all baseline models are reproduced from9. Numbers present
mean ± standard deviations, (n=5). Note: The performance of models marked by † are reported from9. Bold declares highest accuracy and best time per epoch (min).
Extended Data Fig. 6 | PhysioNet. AUC stands for area under curve. Numbers present mean ± standard deviations, (n=5). Note: The performance of the models
marked by † are reported from 7 and the ones with * from78. Bold declares highest AUC score and best time per epoch (min).
Extended Data Fig. 7 | Hyperparameters for Physionet and IMDB. List of hyperparameters used to obtain results in Physionet and IMDB sentiment classification
experiments.
Extended Data Fig. 8 | Results on the IMDB datasets. The experiment is (n=5). Note: The performance of the models marked by † are reported from55, and
performed without any pretraining or pretrained word-embeddings. Thus, we * are reported from57. The n/a standard deviation denotes that the original report
excluded advanced attention-based models78,79 such as Transformers36 and RNN of these experiments did not provide the statistics of their analysis. Bold declares
structures that use pretraining. Numbers present mean ± standard deviations, highest accuracy and best time per epoch (min).
Extended Data Fig. 9 | Lane-keeping models’ parameter count. CfC and NCP networks perform lane-keeping in unseen scenarios with a compact representation.
Extended Data Fig. 10 | Attention Profile of networks. Trained networks tested on data collected in summer. b) results for networks tested on data
receive unseen inputs (first column in each tab) and generate acceleration and collected in winter. c) results for inputs corrupted by a zero-mean Gaussian noise
steering commands. We use the Visual-Backprop algorithm65 to compute the with variance, σ2 = 0.35.
saliency maps of the convolutional part of each network. a) results for networks