Timemachine: A Time Series Is Worth 4 Mambas For Long-Term Forecasting
Timemachine: A Time Series Is Worth 4 Mambas For Long-Term Forecasting
Long-term Forecasting
Md Atik Ahameda, * and Qiang Chenga,b, **
a Department of Computer Science, University of Kentucky
b Institute for Biomedical Informatics, University of Kentucky
Abstract. Long-term time-series forecasting remains challenging Recently, state-space models (SSMs) [10, 11, 12, 13, 25] have
due to the difficulty in capturing long-term dependencies, achieving emerged as powerful engines for sequence-based inference and have
linear scalability, and maintaining computational efficiency. We in- attracted growing research interest. These models are capable of
troduce TimeMachine, an innovative model that leverages Mamba, inferring over very long sequences and exhibit distinctive proper-
a state-space model, to capture long-term dependencies in multivari- ties, including the ability to capture long-range correlations with lin-
arXiv:2403.09898v1 [cs.LG] 14 Mar 2024
ate time series data while maintaining linear scalability and small ear complexity and context-aware selectivity with hidden attention
memory footprints. TimeMachine exploits the unique properties of mechanisms [11, 2]. SSMs have demonstrated great potential in var-
time series data to produce salient contextual cues at multi-scales ious domains, including genomics [11], tabular learning [1], graph
and leverage an innovative integrated quadruple-Mamba architecture data [3], and images [22], yet they remain unexplored for LTSF.
to unify the handling of channel-mixing and channel-independence The under-utilization of SSMs in LTSF can be attributed to two
situations, thus enabling effective selection of contents for predic- main reasons. First, highly content- and context-selective SSMs have
tion against global and local contexts at different scales. Experimen- only been recently developed [11]. Second, and more importantly, ef-
tally, TimeMachine achieves superior performance in prediction ac- fectively representing the context in time series data remains a chal-
curacy, scalability, and memory efficiency, as extensively validated lenge. Many Transformer-based models, such as Autoformer [29]
using benchmark datasets. and Informer [36], regard each time point as a token in a sequence,
Code availability: https://github.com/Atik-Ahamed/TimeMachine while more recent models like PatchTST [24] and iTransformer [21]
leverage patches of the time series as tokens. However, our empiri-
cal experiments on real-world MTS data suggest that directly utiliz-
1 Introduction ing SSMs for LTSF by using either time points or patches as tokens
could hardly achieve performance comparable to Transformer-based
Long-term time-series forecasting (LTSF) is essential in various models. Considering the particular characteristics of MTS data, it is
tasks across diverse fields, such as weather forecasting, anomaly de- essential to extract more salient contextual cues tailored to SSMs.
tection, and resource planning in energy, agriculture, industry, and MTS data typically have many channels with each variate corre-
defense. Although numerous approaches have been developed for sponding to a channel. Many models, such as Informer [36], FED-
LTSF, they typically can achieve only one or two desired properties former [37], and Autoformer [29], handle MTS data to extract use-
such as capturing long-term dependencies in multivariate time series ful representations in a channel-mixing way, where the MTS in-
(MTS), linear scalability in the amount of model parameters with put is treated as a two-dimensional matrix whose size is the num-
respect to data, and computational efficiency or applicability in edge ber of channels multiplied by the length of history. Nonetheless, re-
computing. It is still challenging to achieve these desirable properties cently a few works such as PatchTST [23] and TiDE [6] have shown
simultaneously. that a channel-independence way for handling MTS may achieve
Capturing long-term dependencies, which are generally abundant SOTA accuracy, where each channel is input to the model as a one-
in MTS data, is pivotal to LTSF. While linear models such as DLin- dimensional vector independent of the other channels. We believe
eaer [33] and TiDE [6] achieve competitive performance with linear that these two ways of handling LTSF need to be adopted as per
complexity and scalability, with accuracy on par with Transformer- the characteristics of the MTS data, rather than using a one-size-fits-
based models, they usually rely on MLPs and linear projections that all approach. When there are strong between-channel correlations,
may not well capture long-range correlations [4]. Transformer-based channel mixing usually can help capture such dependencies; other-
models such as iTransformer [35], PatchTST [23], and Crossformer wise, channel independence is a more sensible choice. Therefore, it is
[35] have a strong ability to capture long-range dependencies and necessary to design a unified architecture applicable to both channel-
superior performance in LTSF accuracy, thanks to the self-attention mixing and channel-independence scenarios.
mechanisms in Transformers [28]. However, they typically suffer Moreover, time series data exhibit a unique property – Tempo-
from the quadratic complexity [6], limiting their scalability and ap- ral relations are largely preserved after downsampling into two sub-
plicability, e.g., in edge computing. sequences. Few methods such as Scinet [19] have explored this prop-
erty in designing their models; however, it is under-utilized in other
∗ Email: atikahamed@uky.edu
∗∗
approaches. Due to the high redundancy of MTS values at consecu-
Email: qiang.cheng@uky.edu, Corresponding Author.
tive time points, directly using time points as tokens may have redun- CNN-based methods, such as TimesNet [30] and Scinet [19], uti-
dant values obscure context-based selection and, more importantly, lize convolutional filters to extract valuable temporal features and
overlook long-range dependencies. Rather than relying on individual model complex temporal dynamics. These approaches exhibit highly
time points, using patches may provide contextual clues within each competitive performance, often comparable to or even occasionally
time window of a patch length. However, a pre-defined small patch outperforming more sophisticated Transformer-based models.
length only provides contexts at a fixed temporal or frequency reso- Transformer-based Supervised Learning Methods, such as
lution, whereas long-range contexts may span different patches. To iTransformer [21], PatchTST [23], Crossformer [35], FEDformer
best capture long-range dependencies, it is sensible to supply multi- [37], stationary [20], and autoformer [29], have gained popularity
scale contexts and, at each scale, automatically produce global-level for LTSF due to their superior accuracy. These methods convert time
tokens as contexts similar to iTransformer [21] that tokenizes the series to token sequences and leverage the self-attention mechanism
whole look-back window. Further, while models like Transformer to discover dependencies between arbitrary time steps, making them
and the selective SSMs [11] have the ability to select sub-token con- particularly effective for modeling complex temporal relationships.
tents, such ability is limited in the channel-independence case, for They may also exploit Transformers’ ability to process data in paral-
which local contexts need to be enhanced when leveraging SSMs for lel, enabling long-term dependency discovery sometimes with even
LTSF. linear scalability. Despite their distinctive advantages, these methods
In this paper, we introduce a novel approach that effectively cap- typically have quadratic time and memory complexity due to point-
tures long-range dependencies in time series data by providing sensi- wise correlations in self-attention mechanisms.
ble multi-scale contexts and particularly enhancing local contexts in Self-Supervised Representation Learning Models: Self-
the channel-independence situation. Our model, built upon a selec- supervised learning has been leveraged to learn useful representa-
tive scan SSM called Mamba [11], serves as a core inference engine tions of MTS for downstream tasks, using non-Transformer-based
with a strong ability to capture long-range dependencies in MTS data models for time series [31, 9, 26, 32], and Transformer-based models
while maintaining linear scalability and small memory footprints. such as time series Transformer (TST) and TS-TCC [34, 7, 27].
The proposed model exploits the unique property of time series data Currently, Transformer-based self-supervised models have not yet
in a bottom-up manner by producing contextual cues at two scales achieved performance on par with supervised learning approaches
through consecutive resolution reduction or downsampling using lin- [27]. This paper focuses on LTSF in a supervised learning setting.
ear mapping. The first level operates at a high resolution, while the
second level works at a low resolution. At each level, we employ two
Mamba modules to glean contextual cues from global perspectives 3 Proposed Method
for the channel-mixing case and from both global and local perspec-
tives for the channel-independence case. In this section, we describe each component of our proposed
In summary, our major contributions are threefold: architecture and how we use our model to solve the LTSF problem.
• We develop an innovative model called TimeMachine that is the Assume a collection of MTS samples is given, denoted by dataset
first to leverage purely SSM modules to capture long-term depen- D, which comprises an input sequence x = [x1 , . . . , xL ], with each
dencies in multivariate time series data for context-aware predic- xt ∈ RM representing a vector of M measurements at time point t.
tion, with linear scalability and small memory footprints superior The sequence length L is also known as the look-back window. The
or comparable to linear models. goal is to predict T future values, denoted by [xL+1 , . . . , xL+T ].
• Our model constitutes an innovative architecture that unifies the The architecture of our proposed model, referred to as TimeMachine,
handling of channel-mixing and channel-independence situations is depicted in Figure 1. The pillars of this architecture consist of four
with four SSM modules, exploiting potential between-channel Mambas, which are employed in an integrated way to tap contextual
correlations. Moreover, our model can effectively select contents cues from MTS. This design choice enables us to harness Mamba’s
for prediction against global and local contextual information, at robust capabilities of inferring sequential data for LTSF.
different scales in the MTS data.
• Experimentally, TimeMachine achieves superior performance in Normalization: Before feeding the data to our model, we normalize
(0) (0)
prediction accuracy, scalability, and memory efficiency. We exten- the original MTS x into x0 = [x1 , · · · , xL ] ∈ RM ×L , via
(0)
sively validate the model using standard benchmark datasets and x = N ormalize(x). Here, N ormalize(·) represents a nor-
perform rigorous ablation studies to demonstrate its effectiveness. malization operation with two different options. The first is to use
the reversible instance normalization (RevIN) [16], which is also
adopted in PatchTST [23]. The second option is to employ regular
2 Related Works (0)
Z-score normalization: xi,j = (xi,j − mean(xi,j ))/σj , where
Numerous methods for LTSF have been proposed, which can be σj is the standard deviation for channel j, with j = 1, · · · , M .
grouped into three main categories: non-Transformer-based super- Empirically we find that RevIN is often more helpful compared to
vised approaches, Transformer-based supervised learning models, Z-score. Apart from normalizing the data in the forward pass of our
and self-supervised representation learning models. approach, in experiments, we also follow the standardization process
of the data when compared with baseline methods.
Non-Transformer-based Supervised Approaches include classi-
cal methods like ARIMA, VARMAX, GARCH [5], and RNN [15], Channel Mixing vs. Channel Independence: Our model can handle
as well as deep learning-based methods that achieve state-of-the- both channel independence and channel mixing cases. In channel in-
art (SOTA) performance using multi-layer perceptrons (MLPs) and dependence, each channel is processed independently by our model,
convolutional neural networks (CNNs). MLP-based models, such as while in channel mixing, the MTS sequence is processed with mul-
DLinear [33], TiDE [6], and RLinear [18], leverage the simplicity of tiple channels combined throughout our architecture. Regardless of
linear structures to achieve complexity and scalability. the case, our model accepts input of the shape BM L and produces
Element-wise addition Concatenation Transposable
Figure 1: Schematic diagram of our proposed methodology, TimeMachine. Our method incorporates a configuration of four Mambas, with
two specialized Mambas capable of processing the transposed signal data in each branch. To the left, an example of the TimeSeries signals is
depicted, while the right side offers a detailed zoomed-in view of a Mamba’s structure. Mambas are capable of accepting an input of shape
BM ni while providing the same shape as the output. In our method i ∈ {1, 2}.
the desired output of the shape BM T , eliminating the need for addi-
Integrated Quadruple Mambas: With the two processed embedded
tional manual pre-processing.
representations from E1 , E2 , we can now learn more comprehensive
Channel independence has been proven effective in reducing over-
representations by leveraging Mamba, a type of SSM with selective
fitting by PatchTST [23]. We found this strategy helpful for datasets
scan ability. At each embedding level, we employ a pair of Mambas
with a smaller number of channels. However, for datasets with a
to capture long-term dependencies within the look-back samples and
number of channels comparable to the look-back, channel mixing
provide sufficient local contexts. Denote the input to one of the four
is more effective in capturing the correlations among channels and
Mamba blocks by u, which is either DO(x(1) ) obtained after E1 and
reaching the desired minimum loss during training.
the subsequent dropout layer for the two outer Mambas, or DO(x(2) )
Our architecture is robust and versatile, capable of benefiting
obtained after E2 and the subsequent dropout layer for the two inner
from potentially strong inter-channel correlations in channel-mixing
Mambas (Figure 1). The input tensors may be reshaped per channel
case and exploiting independence in channel-independence case.
mixing or channel independence cases as described previously.
When dealing with channel independence, we reshape the input
Inside a Mamba block, two fully-connected layers in two branches
from BM L to (B × M )1L after the normalization step. The
calculate linear projections. The output of the linear mapping in the
reshaped input is then processed throughout the network and later
first branch passes through a 1D causal convolution and SiLU acti-
merged to provide an output shape of BM T . In contrast, for channel
vation S(·) [8], then a structured SSM. The continuous-time SSM
mixing, no reshaping is necessary. The channels are kept together
maps an input function or sequence u(t) to output v(t) through a
and processed throughout the network.
latent state h(t):
Embedded Representations: Before processing the input sequence dh(t)/dt = A h(t) + B u(t), v(t) = C h(t), (2)
with Mambas, we provide two-stage embedded representations of the
input sequence with length L by E1 and E2 : where h(t) is N -dimensional, with N also known as a state expan-
sion factor, u(t) is D-dimensional, with D being the dimension fac-
x(1) = E1 (x(0) ), x(2) = E2 (DO(x(1) )), (1) tor for an input token, v(t) is an output of dimension D, and A, B,
and C are coefficient matrices of proper sizes. This dynamic system
where DO stands for the dropout operation, and the embedding induces a discrete SSM governing state evolution and outputs given
operations E1 : RM ×L → RM ×n1 and E2 : RM ×n1 → RM ×n2 the input token sequence through time sampling at {k∆} with a ∆
are achieved through MLPs. Thus, for the channel mixing case, time interval. This discrete SSM is
the batch-formed tensors will have the following changes in size:
BM n1 ← E1 (BM L), and BM n2 ← E2 (BM n1 ). This enables hk = Ā hk−1 + B̄ uk , v k = C hk , (3)
us to deal with the fixed-length tokens of n1 and n2 regardless of the
variable input sequence length L, and both n1 and n2 are configured where hk , uk , and vk are respectively samples of h(t), u(t), and v(t)
to take values from the set {512, 256, 128, 64, 32} satisfying at time k∆,
n1 > n2 . Since MLPs are fully connected, we introduce dropouts to Ā = exp(∆A), B̄ = (∆A)−1 (exp(∆A) − I)∆B. (4)
reduce overfitting. Although we have the linear mappings (MLPs)
before Mambas, the performance of our model does not heavily For SSMs, diagonal A is often used. Mamba makes B, C, and ∆ lin-
rely on them, as demonstrated with the ablation study (see Section 5). ear time-varying functions dependent on the input. In particular, for a
token u, B, C ← LinearN (u), and ∆ ← sof tplus(parameter + Table 1: Overview of the characteristics of used benchmarking
LinearD (Linear1 (u))), where Linearp (u) is a linear projection datasets. Time points illustrate the total length of each dataset
to a p-dimensional space, and sof tplus activation function. Further- Dataset (D) Channels (M) Time Points Frequency
more, Mamba also has an option to expand the model dimension
Weather 21 52696 10 Minutes
factor D by a controllable dimension expansion factor E. Such co- Traffic 862 17544 Hourly
efficient matrices enable context and input selectivity properties [11] Electricity 321 26304 Hourly
to selectively propagate or forget information along the input token ETTh1 7 17420 Hourly
sequence based on the current token. Subsequently, the SSM output ETTh2 7 17420 Hourly
ETTm1 7 69680 15 Minutes
is multiplicatively modulated with the output from the second branch
ETTm2 7 69680 15 Minutes
before another fully connected projection. The second branch simply
consists of a linear mapping followed by a SiLU.
tions is verified by experimental results (see Supplementary Table 1).
Processed embedded representation with tensor size BM n1 is
Residual connections are demonstrated by arrows and element-wise
transformed by outer Mambas, while that with BM n2 is transformed
addition in our method (Figure 1).
by inner Mambas, as depicted in Figure 1. For the channel-mixing
To retain the information of both outer and inner pairs of Mam-
case, the whole univariate sequence of each channel is used as a token
bas we concatenate their representations before processing via P2 .
with dimension factor n2 for the inner Mambas. The outputs from
In summary, we concatenate the outputs of the L four Mambas with
the left-side and right-side inner Mambas, vL,k , vR,k ∈ Rn2 , are
(2) (3) a skip connection to have x(6) = x(5) ∥(x(4) x(1) ), where ∥ de-
element-wise added with xk to obtain xk for the k-th token, k = notes concatenation. Finally, the output y is obtained by applying P2
M ×n2
1, · · · , M . That is, by denoting vL = [vL,1 , · · · , vL,M
L] ∈ R to x(6) , i.e., y = P2 (x(6) ).
M ×n2 (3) L (2)
and similarly vR ∈ R , we have x = vL vR x ,
being element-wise addition. Then, x(3) is linearly mapped
L
with
to x(4) with P1 : x(3) → x(4) ∈ RM ×n1 . Similarly, the outputs 4 Result Analysis
∗ ∗
from the outer Mambas, vL,k , vR,k ∈ Rn1 are element-wise added
(5) M ×n1
In this segment, we present the main results of our experiments on
to obtain x ∈ R . widely recognized benchmark datasets for long-term MTS forecast-
For the channel independence case, the input is reshaped, ing. We also conduct extensive ablation studies to demonstrate the
BM L 7→ (B × M )1L, and the embedded representations become effectiveness of each component of our method.
(B × M )1n1 and (B × M )1n2 . One Mamba in each pair of outer
Mambas or inner Mambas considers the input dimension as 1 and the
token length as n1 or n2 , while the other Mamba learns with input 4.1 Datasets
dimension n2 or n1 and token length 1. This design enables learning We evaluate our model on seven benchmark datasets extensively
both global context and local context simultaneously. The outer and used for LTSF: Weather, Traffic, Electricity, and four ETT datasets
inner pairs of Mambas will extract salient features and context cues (ETTh1, ETTh2, ETTm1, ETTm2). Table 1 illustrates the relevant
are fine and coarse scales with high- and low-resolution, respectively. statistics of these datasets, highlighting that the Traffic and Electric-
Channel mixing is performed when the datasets contain a signifi- ity datasets notably large, with 862 and 321 channels, respectively,
cantly large number of channels, in particular, when the look-back L and tens of thousands of temporal points in each sequence. More de-
is comparable to the channel number M , taking the whole sequence tails on these datasets can be found in Wu et al. [29], Zhou et al.
as a token to better provide context cues. All four Mambas are used [36]. Focusing on long-term forecasting, we exclude the ILI dataset,
to capture the global context of the sequences at different scales which has a shorter temporal horizon, similar to Das et al. [6]. We
and learn from the channel correlations. This helps stabilize the demonstrate the superiority of our model in two parts: quantitative
training and reduce overfitting with large M . To switch between the (main results) and qualitative results. For a fair comparison, we used
channel-independence and channel-mixing cases, the input sequence the code from PatchTST [23] 1 and iTransformer [21] 2 , and we took
is simply transposed, with one Mamba in each branch processing the the results for the baseline methods from iTransformer [21].
transposed input, as demonstrated in Figure 1.
These integrated Mamba blocks empower our model for content-
dependent feature extraction and reasoning with long-range 4.2 Experimental Environment
dependencies and feature interactions.
All experiments were conducted using the PyTorch framework [24]
with NVIDIA 4XV100 GPUs (32GB each). The model was opti-
Output Projection: After receiving the output tokens from the
mized using the ADAM algorithm [17] with L2 loss. The batch size
Mambas, our goal is to project these tokens to generate predictions
varied depending on the dataset, but the training was consistently set
with the desired sequence length. To accomplish this task, we utilize
to 100 epochs. We measure the prediction errors using mean square
two MLPs, P1 and P2 , which output n1 and T time points, respec-
error (MSE) and mean absolute error (MAE) metrics, where smaller
tively, with each point having M channels. Specifically, projector
values indicate better prediction accuracy.
P1 performs a mapping RM ×n2 → RM ×n1 , as discussed above
for obtaining x(4) . Subsequently, projector P2 performs a mapping
Baseline Models: We compared our model, TimeMachine, with
RM ×2n1 → RM ×T , transforming the concatenated output from the
11 SOTA models, including iTransformer [21], PatchTST [23],
Mambas into the final predictions. The use of a two-stage output pro-
DLinear [33], RLinear [18], Autoformer [29], Crossformer [35],
jection via P1 and P2 symmetrically aligns with the two-stage em-
TiDE [6], Scinet [19], TimesNet [30], FEDformer [37], and Station-
bedded representation obtained through E1 and E2 .
ary [20]. Although another variant of SSMs, namely S4 [12], exists,
In addition to the token transformation, we also employ residual
connections. One residual connection is added before P1 , and an- 1 https://github.com/yuqinie98/PatchTST
2
other is added after P1 . The effectiveness of these residual connec- https://github.com/thuml/iTransformer
we do not include it in our comparison because TiDE [6] has already 3
iTransformer
demonstrated superior performance over S4. Ground Truth
2 TimeMachine
Observation
1
4.3 Quantitative Results
We demonstrate TimeMachine’s performance in supervised long- 0
Observation
tion 5. Table 2 clearly shows that our method demonstrates supe- 0.7
rior performance compared to all the strong baselines in almost all 0.8
datasets. Moreover, iTransformer [21] has significantly better perfor- 0.9
mance than other baselines on the Traffic and Electricity datasets, 1.0
which contain a large number of channels. Our method also demon- 1.1
0 20 40 60 80 100
strates comparable or superior performance on these two datasets, Time points
outperforming the existing strong baselines by a large margin. This (b) Traffic
demonstrates the effectiveness of our method in handling LTSF tasks Figure 3: Qualitative comparison between TimeMachine and second-
with varying number of channels and datasets. best-performing methods from Table 2. Observations are demon-
In addition to Table 2, we conducted experiments with TimeMa- strated from the test set for the case of L = 96 and T = 720 with a
chine using different look-back windows L = {192, 336, 720}. Ta- randomly selected channel and window frame of 100 time points.
ble 3 and Supplementary Table 2 demonstrate TimeMachine’s perfor- iTransformer [21] paper. To ensure a fair comparison, we set the ex-
mance under these settings. An examination of these tables reveals perimental settings for our method similar to those of iTransformer.
that the implementation of extended look-back windows markedly The results clearly show very small memory footprints compared
enhances the performance of our method across the majority of the to SOTA baselines. Specifically for Traffic, our method consumes a
datasets examined. This also demonstrates TimeMachine’s capability very similar amount of memory to the DLinear [33] method. More-
for handling longer look-back windows while maintaining consistent over, our method is capable of handling longer look-back windows
performance. with a relatively linear increase in the number of learnable param-
eters, as demonstrated in Supplementary Figure 4 for two datasets.
Electricity This is due to the robustness of our method, where E1 is only depen-
Tra
0.17 ffic dent on the input sequence length L, and the rest of the networks are
0.43 relatively independent of L, leading to a highly scalable model.
h1
ETT
0.42
0.24
follow the actual trend in the predicted future time horizon for the test
0.27
set. In the case of the Electricity dataset, there is a clear difference
0.38
m2
192 0.211 0.250 0.221 0.254 0.240 0.271 0.225 0.259 0.206 0.277 0.242 0.298 0.219 0.261 0.237 0.296 0.261 0.340 0.276 0.336 0.245 0.285 0.307 0.367
336 0.256 0.290 0.278 0.296 0.292 0.307 0.278 0.297 0.272 0.335 0.287 0.335 0.280 0.306 0.283 0.335 0.309 0.378 0.339 0.380 0.321 0.338 0.359 0.395
720 0.342 0.343 0.358 0.349 0.364 0.353 0.354 0.348 0.398 0.418 0.351 0.386 0.365 0.359 0.345 0.381 0.377 0.427 0.403 0.428 0.414 0.410 0.419 0.428
96 0.397 0.268 0.395 0.268 0.649 0.389 0.544 0.359 0.522 0.290 0.805 0.493 0.593 0.321 0.650 0.396 0.788 0.499 0.587 0.366 0.612 0.338 0.613 0.388
Traffic
192 0.417 0.274 0.417 0.276 0.601 0.366 0.540 0.354 0.530 0.293 0.756 0.474 0.617 0.336 0.598 0.370 0.789 0.505 0.604 0.373 0.613 0.340 0.616 0.382
336 0.433 0.281 0.433 0.283 0.609 0.369 0.551 0.358 0.558 0.305 0.762 0.477 0.629 0.336 0.605 0.373 0.797 0.508 0.621 0.383 0.618 0.328 0.622 0.337
720 0.467 0.300 0.467 0.302 0.647 0.387 0.586 0.375 0.589 0.328 0.719 0.449 0.640 0.350 0.645 0.394 0.841 0.523 0.626 0.382 0.653 0.355 0.660 0.408
Electricity
96 0.142 0.236 0.148 0.240 0.201 0.281 0.195 0.285 0.219 0.314 0.237 0.329 0.168 0.272 0.197 0.282 0.247 0.345 0.193 0.308 0.169 0.273 0.201 0.317
192 0.158 0.250 0.162 0.253 0.201 0.283 0.199 0.289 0.231 0.322 0.236 0.330 0.184 0.289 0.196 0.285 0.257 0.355 0.201 0.315 0.182 0.286 0.222 0.334
336 0.172 0.268 0.178 0.269 0.215 0.298 0.215 0.305 0.246 0.337 0.249 0.344 0.198 0.300 0.209 0.301 0.269 0.369 0.214 0.329 0.200 0.304 0.231 0.338
720 0.207 0.298 0.225 0.317 0.257 0.331 0.256 0.337 0.280 0.363 0.284 0.373 0.220 0.320 0.245 0.333 0.299 0.390 0.246 0.355 0.222 0.321 0.254 0.361
96 0.364 0.387 0.386 0.405 0.386 0.395 0.414 0.419 0.423 0.448 0.479 0.464 0.384 0.402 0.386 0.400 0.654 0.599 0.376 0.419 0.513 0.491 0.449 0.459
ETTh1
192 0.415 0.416 0.441 0.436 0.437 0.424 0.460 0.445 0.471 0.474 0.525 0.492 0.436 0.429 0.437 0.432 0.719 0.631 0.420 0.448 0.534 0.504 0.500 0.482
336 0.429 0.421 0.487 0.458 0.479 0.446 0.501 0.466 0.570 0.546 0.565 0.515 0.491 0.469 0.481 0.459 0.778 0.659 0.459 0.465 0.588 0.535 0.521 0.496
720 0.458 0.453 0.503 0.491 0.481 0.470 0.500 0.488 0.653 0.621 0.594 0.558 0.521 0.500 0.519 0.516 0.836 0.699 0.506 0.507 0.643 0.616 0.514 0.512
96 0.275 0.334 0.297 0.349 0.288 0.338 0.302 0.348 0.745 0.584 0.400 0.440 0.340 0.374 0.333 0.387 0.707 0.621 0.358 0.397 0.476 0.458 0.346 0.388
ETTh2
192 0.349 0.381 0.380 0.400 0.374 0.390 0.388 0.400 0.877 0.656 0.528 0.509 0.402 0.414 0.477 0.476 0.860 0.689 0.429 0.439 0.512 0.493 0.456 0.452
336 0.340 0.381 0.428 0.432 0.415 0.426 0.426 0.433 1.043 0.731 0.643 0.571 0.452 0.452 0.594 0.541 1.000 0.744 0.496 0.487 0.552 0.551 0.482 0.486
720 0.411 0.433 0.427 0.445 0.420 0.440 0.431 0.446 1.104 0.763 0.874 0.679 0.462 0.468 0.831 0.657 1.249 0.838 0.463 0.474 0.562 0.560 0.515 0.511
96 0.317 0.355 0.334 0.368 0.355 0.376 0.329 0.367 0.404 0.426 0.364 0.387 0.338 0.375 0.345 0.372 0.418 0.438 0.379 0.419 0.386 0.398 0.505 0.475
ETTm1
192 0.357 0.378 0.377 0.391 0.391 0.392 0.367 0.385 0.450 0.451 0.398 0.404 0.374 0.387 0.380 0.389 0.439 0.450 0.426 0.441 0.459 0.444 0.553 0.496
336 0.379 0.399 0.426 0.420 0.424 0.415 0.399 0.410 0.532 0.515 0.428 0.425 0.410 0.411 0.413 0.413 0.490 0.485 0.445 0.459 0.495 0.464 0.621 0.537
720 0.445 0.436 0.491 0.459 0.487 0.450 0.454 0.439 0.666 0.589 0.487 0.461 0.478 0.450 0.474 0.453 0.595 0.550 0.543 0.490 0.585 0.516 0.671 0.561
96 0.175 0.256 0.180 0.264 0.182 0.265 0.175 0.259 0.287 0.366 0.207 0.305 0.187 0.267 0.193 0.292 0.286 0.377 0.203 0.287 0.192 0.274 0.255 0.339
ETTm2
192 0.239 0.299 0.250 0.309 0.246 0.304 0.241 0.302 0.414 0.492 0.290 0.364 0.249 0.309 0.284 0.362 0.399 0.445 0.269 0.328 0.280 0.339 0.281 0.340
336 0.287 0.332 0.311 0.348 0.307 0.342 0.305 0.343 0.597 0.542 0.377 0.422 0.321 0.351 0.369 0.427 0.637 0.591 0.325 0.366 0.334 0.361 0.339 0.372
720 0.371 0.385 0.412 0.407 0.407 0.398 0.402 0.400 1.730 1.042 0.558 0.524 0.408 0.403 0.554 0.522 0.960 0.735 0.421 0.415 0.417 0.413 0.433 0.432
Table 3: Results for the long-term forecasting task with varying L = 5.2 Sensitivity of Dropouts
{192, 336, 720} and T = {96, 192, 336, 720}
In our model (Figure 1), we include two dropouts after processing the
Prediction (T )→ 96 192 336 720 signals from E1 and E2 . These dropouts are necessary, especially
for datasets with a small number of channels, e.g., ETTs. Supple-
D L MSE MAE MSE MAE MSE MAE MSE MAE
mentary Figure 1 shows the effect of dropouts on both ETTh1 and
192 0.362 0.252 0.386 0.262 0.402 0.270 0.431 0.288
Traffic
ETTh2 datasets. As expected, too low or too high dropout rates are
336 0.355 0.249 0.378 0.259 0.391 0.266 0.418 0.283
720 0.348 0.249 0.364 0.255 0.376 0.263 0.410 0.281 not helpful. To maintain balance, we set the dropout rates to 0.7 for
both datasets while tuning other variations for the rest.
192 0.135 0.230 0.167 0.258 0.176 0.269 0.213 0.302
Elec.
r
ST
er
ine
ar
e
e
TiD
rm
rm
rm
orm
ine
hT
ch
0.50
o
sfo
sfo
tc
DL
a
T=96 T=192 T=336 T=720
ssf
wf
eM
Pa
an
n
iFlo
Cro
Tra
Tim
0.48
iTr
Methods
0.46
(a) Traffic
0.44
MSE
2.00 0.42
0.40
1.75
0.38
1.50
0.36
Memory Footprint
r
ST
ar
e
e
e
hin
TiD
0.375
orm
rm
rm
orm
ine
hT
ac
sfo
sfo
tc
DL
ssf
wf
eM
Pa
0.350
n
an
MSE
iFlo
Cro
Tra
Tim
iTr
Methods 0.325
(b) Weather 0.300
Figure 4: Memory footprint (in GB) for Traffic (with 862 channels) 0.275
and Weather (with 21 channels) following iTransformer [21]. 0.250 8 16 32 64 128 256
T=96 State Expansion Factor (N)
0.48 T=192 (b) ETTh2
T=336
0.46 T=720 Figure 6: MSE versus the state expansion factor (N ) with the input
0.44 sequence length L = 96.
MSE
0.42 the highest possible value of 256. Figure 6 demonstrates the effec-
0.40
tiveness of this expansion factor while keeping all other parameters
0.38
fixed. With a higher state expansion factor, there is a certain chance
of performance improvement for varying prediction lengths. There-
0.36
fore, we set N = 256 as the default value for all datasets, and the
)
2)
6
64
32
64
32
64
32
25
12
12
,3
2,
2,
6,
6,
8,
8,
(64
2,
6,
(51
(51
(25
(25
(12
(12
(51
(51
(25
Combination
with this default value.
(a) ETTh1
5.6 Ablation on Mamba Dimension Expansion Factor
0.42 T=96
0.40 T=192 We also experimented with the dimension expansion factor (E) of
T=336
0.38 T=720 the Mambas, demonstrated in Supplementary Figure 3. Increasing
0.36 the block expansion factor does not lead to consistent improvements
MSE
8)
8)
2)
64
32
64
32
64
32
25
12
12
,3
2,
2,
6,
6,
8,
8,
(64
2,
2,
6,
(51
(51
(25
(25
(12
(12
(51
(51
(25
Observation
0.25
✗ 0.366 0.395 0.423 0.425 0.430 0.427 0.474 0.462
ETTh1 0.00
✓ 0.364 0.387 0.415 0.416 0.429 0.421 0.458 0.453
0.25 PatchTST
✗ 0.281 0.337 0.347 0.386 0.352 0.383 0.415 0.435 0.50 Ground Truth
ETTh2 TimeMachine
✓ 0.275 0.334 0.349 0.381 0.340 0.381 0.411 0.433 0.75
0 20 40 60 80 100
Time points
(a) ETTm1
0.48
0.46 0.1
0.44 0.2
0.3
MSE
0.42
Observation
0.4
0.40 T=96 0.5
T=192
0.38 T=336 0.6
T=720 PatchTST
0.36 0.7 Ground Truth
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 TimeMachine
Dropout 0.8
0 20 40 60 80 100
(a) ETTh1 Time points
(b) ETTm2
0.42 T=96 Figure 2: Qualitative comparison between TimeMachine and second
T=192 best-performing methods from Table 2. Observations are demon-
0.40 T=336
0.38 T=720 strated from the test set for the case of L = 96 and T = 720 with a
0.36 randomly selected channel and window frame of 100 time points.
MSE
0.34
0.32
0.30
0.28
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Dropout
(b) ETTh2
Figure 1: Performance (MSE) comparison concerning a diverse range
of dropouts with input sequence length, L = 96.
0.50
T=96 T=192 T=336 T=720
0.48
Table 2: Results for the long-term forecasting task with varying input
0.46
sequence length L = {192, 336, 720} and T = {96, 192, 336, 720}
0.44
MSE
300000
200000
100000
400000
Learnable Parameters
300000
200000
100000