0% found this document useful (0 votes)
72 views10 pages

Timemachine: A Time Series Is Worth 4 Mambas For Long-Term Forecasting

Uploaded by

18015701554
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views10 pages

Timemachine: A Time Series Is Worth 4 Mambas For Long-Term Forecasting

Uploaded by

18015701554
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

TimeMachine: A Time Series is Worth 4 Mambas for

Long-term Forecasting
Md Atik Ahameda, * and Qiang Chenga,b, **
a Department of Computer Science, University of Kentucky
b Institute for Biomedical Informatics, University of Kentucky

Abstract. Long-term time-series forecasting remains challenging Recently, state-space models (SSMs) [10, 11, 12, 13, 25] have
due to the difficulty in capturing long-term dependencies, achieving emerged as powerful engines for sequence-based inference and have
linear scalability, and maintaining computational efficiency. We in- attracted growing research interest. These models are capable of
troduce TimeMachine, an innovative model that leverages Mamba, inferring over very long sequences and exhibit distinctive proper-
a state-space model, to capture long-term dependencies in multivari- ties, including the ability to capture long-range correlations with lin-
arXiv:2403.09898v1 [cs.LG] 14 Mar 2024

ate time series data while maintaining linear scalability and small ear complexity and context-aware selectivity with hidden attention
memory footprints. TimeMachine exploits the unique properties of mechanisms [11, 2]. SSMs have demonstrated great potential in var-
time series data to produce salient contextual cues at multi-scales ious domains, including genomics [11], tabular learning [1], graph
and leverage an innovative integrated quadruple-Mamba architecture data [3], and images [22], yet they remain unexplored for LTSF.
to unify the handling of channel-mixing and channel-independence The under-utilization of SSMs in LTSF can be attributed to two
situations, thus enabling effective selection of contents for predic- main reasons. First, highly content- and context-selective SSMs have
tion against global and local contexts at different scales. Experimen- only been recently developed [11]. Second, and more importantly, ef-
tally, TimeMachine achieves superior performance in prediction ac- fectively representing the context in time series data remains a chal-
curacy, scalability, and memory efficiency, as extensively validated lenge. Many Transformer-based models, such as Autoformer [29]
using benchmark datasets. and Informer [36], regard each time point as a token in a sequence,
Code availability: https://github.com/Atik-Ahamed/TimeMachine while more recent models like PatchTST [24] and iTransformer [21]
leverage patches of the time series as tokens. However, our empiri-
cal experiments on real-world MTS data suggest that directly utiliz-
1 Introduction ing SSMs for LTSF by using either time points or patches as tokens
could hardly achieve performance comparable to Transformer-based
Long-term time-series forecasting (LTSF) is essential in various models. Considering the particular characteristics of MTS data, it is
tasks across diverse fields, such as weather forecasting, anomaly de- essential to extract more salient contextual cues tailored to SSMs.
tection, and resource planning in energy, agriculture, industry, and MTS data typically have many channels with each variate corre-
defense. Although numerous approaches have been developed for sponding to a channel. Many models, such as Informer [36], FED-
LTSF, they typically can achieve only one or two desired properties former [37], and Autoformer [29], handle MTS data to extract use-
such as capturing long-term dependencies in multivariate time series ful representations in a channel-mixing way, where the MTS in-
(MTS), linear scalability in the amount of model parameters with put is treated as a two-dimensional matrix whose size is the num-
respect to data, and computational efficiency or applicability in edge ber of channels multiplied by the length of history. Nonetheless, re-
computing. It is still challenging to achieve these desirable properties cently a few works such as PatchTST [23] and TiDE [6] have shown
simultaneously. that a channel-independence way for handling MTS may achieve
Capturing long-term dependencies, which are generally abundant SOTA accuracy, where each channel is input to the model as a one-
in MTS data, is pivotal to LTSF. While linear models such as DLin- dimensional vector independent of the other channels. We believe
eaer [33] and TiDE [6] achieve competitive performance with linear that these two ways of handling LTSF need to be adopted as per
complexity and scalability, with accuracy on par with Transformer- the characteristics of the MTS data, rather than using a one-size-fits-
based models, they usually rely on MLPs and linear projections that all approach. When there are strong between-channel correlations,
may not well capture long-range correlations [4]. Transformer-based channel mixing usually can help capture such dependencies; other-
models such as iTransformer [35], PatchTST [23], and Crossformer wise, channel independence is a more sensible choice. Therefore, it is
[35] have a strong ability to capture long-range dependencies and necessary to design a unified architecture applicable to both channel-
superior performance in LTSF accuracy, thanks to the self-attention mixing and channel-independence scenarios.
mechanisms in Transformers [28]. However, they typically suffer Moreover, time series data exhibit a unique property – Tempo-
from the quadratic complexity [6], limiting their scalability and ap- ral relations are largely preserved after downsampling into two sub-
plicability, e.g., in edge computing. sequences. Few methods such as Scinet [19] have explored this prop-
erty in designing their models; however, it is under-utilized in other
∗ Email: atikahamed@uky.edu
∗∗
approaches. Due to the high redundancy of MTS values at consecu-
Email: qiang.cheng@uky.edu, Corresponding Author.
tive time points, directly using time points as tokens may have redun- CNN-based methods, such as TimesNet [30] and Scinet [19], uti-
dant values obscure context-based selection and, more importantly, lize convolutional filters to extract valuable temporal features and
overlook long-range dependencies. Rather than relying on individual model complex temporal dynamics. These approaches exhibit highly
time points, using patches may provide contextual clues within each competitive performance, often comparable to or even occasionally
time window of a patch length. However, a pre-defined small patch outperforming more sophisticated Transformer-based models.
length only provides contexts at a fixed temporal or frequency reso- Transformer-based Supervised Learning Methods, such as
lution, whereas long-range contexts may span different patches. To iTransformer [21], PatchTST [23], Crossformer [35], FEDformer
best capture long-range dependencies, it is sensible to supply multi- [37], stationary [20], and autoformer [29], have gained popularity
scale contexts and, at each scale, automatically produce global-level for LTSF due to their superior accuracy. These methods convert time
tokens as contexts similar to iTransformer [21] that tokenizes the series to token sequences and leverage the self-attention mechanism
whole look-back window. Further, while models like Transformer to discover dependencies between arbitrary time steps, making them
and the selective SSMs [11] have the ability to select sub-token con- particularly effective for modeling complex temporal relationships.
tents, such ability is limited in the channel-independence case, for They may also exploit Transformers’ ability to process data in paral-
which local contexts need to be enhanced when leveraging SSMs for lel, enabling long-term dependency discovery sometimes with even
LTSF. linear scalability. Despite their distinctive advantages, these methods
In this paper, we introduce a novel approach that effectively cap- typically have quadratic time and memory complexity due to point-
tures long-range dependencies in time series data by providing sensi- wise correlations in self-attention mechanisms.
ble multi-scale contexts and particularly enhancing local contexts in Self-Supervised Representation Learning Models: Self-
the channel-independence situation. Our model, built upon a selec- supervised learning has been leveraged to learn useful representa-
tive scan SSM called Mamba [11], serves as a core inference engine tions of MTS for downstream tasks, using non-Transformer-based
with a strong ability to capture long-range dependencies in MTS data models for time series [31, 9, 26, 32], and Transformer-based models
while maintaining linear scalability and small memory footprints. such as time series Transformer (TST) and TS-TCC [34, 7, 27].
The proposed model exploits the unique property of time series data Currently, Transformer-based self-supervised models have not yet
in a bottom-up manner by producing contextual cues at two scales achieved performance on par with supervised learning approaches
through consecutive resolution reduction or downsampling using lin- [27]. This paper focuses on LTSF in a supervised learning setting.
ear mapping. The first level operates at a high resolution, while the
second level works at a low resolution. At each level, we employ two
Mamba modules to glean contextual cues from global perspectives 3 Proposed Method
for the channel-mixing case and from both global and local perspec-
tives for the channel-independence case. In this section, we describe each component of our proposed
In summary, our major contributions are threefold: architecture and how we use our model to solve the LTSF problem.
• We develop an innovative model called TimeMachine that is the Assume a collection of MTS samples is given, denoted by dataset
first to leverage purely SSM modules to capture long-term depen- D, which comprises an input sequence x = [x1 , . . . , xL ], with each
dencies in multivariate time series data for context-aware predic- xt ∈ RM representing a vector of M measurements at time point t.
tion, with linear scalability and small memory footprints superior The sequence length L is also known as the look-back window. The
or comparable to linear models. goal is to predict T future values, denoted by [xL+1 , . . . , xL+T ].
• Our model constitutes an innovative architecture that unifies the The architecture of our proposed model, referred to as TimeMachine,
handling of channel-mixing and channel-independence situations is depicted in Figure 1. The pillars of this architecture consist of four
with four SSM modules, exploiting potential between-channel Mambas, which are employed in an integrated way to tap contextual
correlations. Moreover, our model can effectively select contents cues from MTS. This design choice enables us to harness Mamba’s
for prediction against global and local contextual information, at robust capabilities of inferring sequential data for LTSF.
different scales in the MTS data.
• Experimentally, TimeMachine achieves superior performance in Normalization: Before feeding the data to our model, we normalize
(0) (0)
prediction accuracy, scalability, and memory efficiency. We exten- the original MTS x into x0 = [x1 , · · · , xL ] ∈ RM ×L , via
(0)
sively validate the model using standard benchmark datasets and x = N ormalize(x). Here, N ormalize(·) represents a nor-
perform rigorous ablation studies to demonstrate its effectiveness. malization operation with two different options. The first is to use
the reversible instance normalization (RevIN) [16], which is also
adopted in PatchTST [23]. The second option is to employ regular
2 Related Works (0)
Z-score normalization: xi,j = (xi,j − mean(xi,j ))/σj , where
Numerous methods for LTSF have been proposed, which can be σj is the standard deviation for channel j, with j = 1, · · · , M .
grouped into three main categories: non-Transformer-based super- Empirically we find that RevIN is often more helpful compared to
vised approaches, Transformer-based supervised learning models, Z-score. Apart from normalizing the data in the forward pass of our
and self-supervised representation learning models. approach, in experiments, we also follow the standardization process
of the data when compared with baseline methods.
Non-Transformer-based Supervised Approaches include classi-
cal methods like ARIMA, VARMAX, GARCH [5], and RNN [15], Channel Mixing vs. Channel Independence: Our model can handle
as well as deep learning-based methods that achieve state-of-the- both channel independence and channel mixing cases. In channel in-
art (SOTA) performance using multi-layer perceptrons (MLPs) and dependence, each channel is processed independently by our model,
convolutional neural networks (CNNs). MLP-based models, such as while in channel mixing, the MTS sequence is processed with mul-
DLinear [33], TiDE [6], and RLinear [18], leverage the simplicity of tiple channels combined throughout our architecture. Regardless of
linear structures to achieve complexity and scalability. the case, our model accepts input of the shape BM L and produces
Element-wise addition Concatenation Transposable

Dropout SiLU activation Nonlinearity

Figure 1: Schematic diagram of our proposed methodology, TimeMachine. Our method incorporates a configuration of four Mambas, with
two specialized Mambas capable of processing the transposed signal data in each branch. To the left, an example of the TimeSeries signals is
depicted, while the right side offers a detailed zoomed-in view of a Mamba’s structure. Mambas are capable of accepting an input of shape
BM ni while providing the same shape as the output. In our method i ∈ {1, 2}.

the desired output of the shape BM T , eliminating the need for addi-
Integrated Quadruple Mambas: With the two processed embedded
tional manual pre-processing.
representations from E1 , E2 , we can now learn more comprehensive
Channel independence has been proven effective in reducing over-
representations by leveraging Mamba, a type of SSM with selective
fitting by PatchTST [23]. We found this strategy helpful for datasets
scan ability. At each embedding level, we employ a pair of Mambas
with a smaller number of channels. However, for datasets with a
to capture long-term dependencies within the look-back samples and
number of channels comparable to the look-back, channel mixing
provide sufficient local contexts. Denote the input to one of the four
is more effective in capturing the correlations among channels and
Mamba blocks by u, which is either DO(x(1) ) obtained after E1 and
reaching the desired minimum loss during training.
the subsequent dropout layer for the two outer Mambas, or DO(x(2) )
Our architecture is robust and versatile, capable of benefiting
obtained after E2 and the subsequent dropout layer for the two inner
from potentially strong inter-channel correlations in channel-mixing
Mambas (Figure 1). The input tensors may be reshaped per channel
case and exploiting independence in channel-independence case.
mixing or channel independence cases as described previously.
When dealing with channel independence, we reshape the input
Inside a Mamba block, two fully-connected layers in two branches
from BM L to (B × M )1L after the normalization step. The
calculate linear projections. The output of the linear mapping in the
reshaped input is then processed throughout the network and later
first branch passes through a 1D causal convolution and SiLU acti-
merged to provide an output shape of BM T . In contrast, for channel
vation S(·) [8], then a structured SSM. The continuous-time SSM
mixing, no reshaping is necessary. The channels are kept together
maps an input function or sequence u(t) to output v(t) through a
and processed throughout the network.
latent state h(t):
Embedded Representations: Before processing the input sequence dh(t)/dt = A h(t) + B u(t), v(t) = C h(t), (2)
with Mambas, we provide two-stage embedded representations of the
input sequence with length L by E1 and E2 : where h(t) is N -dimensional, with N also known as a state expan-
sion factor, u(t) is D-dimensional, with D being the dimension fac-
x(1) = E1 (x(0) ), x(2) = E2 (DO(x(1) )), (1) tor for an input token, v(t) is an output of dimension D, and A, B,
and C are coefficient matrices of proper sizes. This dynamic system
where DO stands for the dropout operation, and the embedding induces a discrete SSM governing state evolution and outputs given
operations E1 : RM ×L → RM ×n1 and E2 : RM ×n1 → RM ×n2 the input token sequence through time sampling at {k∆} with a ∆
are achieved through MLPs. Thus, for the channel mixing case, time interval. This discrete SSM is
the batch-formed tensors will have the following changes in size:
BM n1 ← E1 (BM L), and BM n2 ← E2 (BM n1 ). This enables hk = Ā hk−1 + B̄ uk , v k = C hk , (3)
us to deal with the fixed-length tokens of n1 and n2 regardless of the
variable input sequence length L, and both n1 and n2 are configured where hk , uk , and vk are respectively samples of h(t), u(t), and v(t)
to take values from the set {512, 256, 128, 64, 32} satisfying at time k∆,
n1 > n2 . Since MLPs are fully connected, we introduce dropouts to Ā = exp(∆A), B̄ = (∆A)−1 (exp(∆A) − I)∆B. (4)
reduce overfitting. Although we have the linear mappings (MLPs)
before Mambas, the performance of our model does not heavily For SSMs, diagonal A is often used. Mamba makes B, C, and ∆ lin-
rely on them, as demonstrated with the ablation study (see Section 5). ear time-varying functions dependent on the input. In particular, for a
token u, B, C ← LinearN (u), and ∆ ← sof tplus(parameter + Table 1: Overview of the characteristics of used benchmarking
LinearD (Linear1 (u))), where Linearp (u) is a linear projection datasets. Time points illustrate the total length of each dataset
to a p-dimensional space, and sof tplus activation function. Further- Dataset (D) Channels (M) Time Points Frequency
more, Mamba also has an option to expand the model dimension
Weather 21 52696 10 Minutes
factor D by a controllable dimension expansion factor E. Such co- Traffic 862 17544 Hourly
efficient matrices enable context and input selectivity properties [11] Electricity 321 26304 Hourly
to selectively propagate or forget information along the input token ETTh1 7 17420 Hourly
sequence based on the current token. Subsequently, the SSM output ETTh2 7 17420 Hourly
ETTm1 7 69680 15 Minutes
is multiplicatively modulated with the output from the second branch
ETTm2 7 69680 15 Minutes
before another fully connected projection. The second branch simply
consists of a linear mapping followed by a SiLU.
tions is verified by experimental results (see Supplementary Table 1).
Processed embedded representation with tensor size BM n1 is
Residual connections are demonstrated by arrows and element-wise
transformed by outer Mambas, while that with BM n2 is transformed
addition in our method (Figure 1).
by inner Mambas, as depicted in Figure 1. For the channel-mixing
To retain the information of both outer and inner pairs of Mam-
case, the whole univariate sequence of each channel is used as a token
bas we concatenate their representations before processing via P2 .
with dimension factor n2 for the inner Mambas. The outputs from
In summary, we concatenate the outputs of the L four Mambas with
the left-side and right-side inner Mambas, vL,k , vR,k ∈ Rn2 , are
(2) (3) a skip connection to have x(6) = x(5) ∥(x(4) x(1) ), where ∥ de-
element-wise added with xk to obtain xk for the k-th token, k = notes concatenation. Finally, the output y is obtained by applying P2
M ×n2
1, · · · , M . That is, by denoting vL = [vL,1 , · · · , vL,M
L] ∈ R to x(6) , i.e., y = P2 (x(6) ).
M ×n2 (3) L (2)
and similarly vR ∈ R , we have x = vL vR x ,
being element-wise addition. Then, x(3) is linearly mapped
L
with
to x(4) with P1 : x(3) → x(4) ∈ RM ×n1 . Similarly, the outputs 4 Result Analysis
∗ ∗
from the outer Mambas, vL,k , vR,k ∈ Rn1 are element-wise added
(5) M ×n1
In this segment, we present the main results of our experiments on
to obtain x ∈ R . widely recognized benchmark datasets for long-term MTS forecast-
For the channel independence case, the input is reshaped, ing. We also conduct extensive ablation studies to demonstrate the
BM L 7→ (B × M )1L, and the embedded representations become effectiveness of each component of our method.
(B × M )1n1 and (B × M )1n2 . One Mamba in each pair of outer
Mambas or inner Mambas considers the input dimension as 1 and the
token length as n1 or n2 , while the other Mamba learns with input 4.1 Datasets
dimension n2 or n1 and token length 1. This design enables learning We evaluate our model on seven benchmark datasets extensively
both global context and local context simultaneously. The outer and used for LTSF: Weather, Traffic, Electricity, and four ETT datasets
inner pairs of Mambas will extract salient features and context cues (ETTh1, ETTh2, ETTm1, ETTm2). Table 1 illustrates the relevant
are fine and coarse scales with high- and low-resolution, respectively. statistics of these datasets, highlighting that the Traffic and Electric-
Channel mixing is performed when the datasets contain a signifi- ity datasets notably large, with 862 and 321 channels, respectively,
cantly large number of channels, in particular, when the look-back L and tens of thousands of temporal points in each sequence. More de-
is comparable to the channel number M , taking the whole sequence tails on these datasets can be found in Wu et al. [29], Zhou et al.
as a token to better provide context cues. All four Mambas are used [36]. Focusing on long-term forecasting, we exclude the ILI dataset,
to capture the global context of the sequences at different scales which has a shorter temporal horizon, similar to Das et al. [6]. We
and learn from the channel correlations. This helps stabilize the demonstrate the superiority of our model in two parts: quantitative
training and reduce overfitting with large M . To switch between the (main results) and qualitative results. For a fair comparison, we used
channel-independence and channel-mixing cases, the input sequence the code from PatchTST [23] 1 and iTransformer [21] 2 , and we took
is simply transposed, with one Mamba in each branch processing the the results for the baseline methods from iTransformer [21].
transposed input, as demonstrated in Figure 1.
These integrated Mamba blocks empower our model for content-
dependent feature extraction and reasoning with long-range 4.2 Experimental Environment
dependencies and feature interactions.
All experiments were conducted using the PyTorch framework [24]
with NVIDIA 4XV100 GPUs (32GB each). The model was opti-
Output Projection: After receiving the output tokens from the
mized using the ADAM algorithm [17] with L2 loss. The batch size
Mambas, our goal is to project these tokens to generate predictions
varied depending on the dataset, but the training was consistently set
with the desired sequence length. To accomplish this task, we utilize
to 100 epochs. We measure the prediction errors using mean square
two MLPs, P1 and P2 , which output n1 and T time points, respec-
error (MSE) and mean absolute error (MAE) metrics, where smaller
tively, with each point having M channels. Specifically, projector
values indicate better prediction accuracy.
P1 performs a mapping RM ×n2 → RM ×n1 , as discussed above
for obtaining x(4) . Subsequently, projector P2 performs a mapping
Baseline Models: We compared our model, TimeMachine, with
RM ×2n1 → RM ×T , transforming the concatenated output from the
11 SOTA models, including iTransformer [21], PatchTST [23],
Mambas into the final predictions. The use of a two-stage output pro-
DLinear [33], RLinear [18], Autoformer [29], Crossformer [35],
jection via P1 and P2 symmetrically aligns with the two-stage em-
TiDE [6], Scinet [19], TimesNet [30], FEDformer [37], and Station-
bedded representation obtained through E1 and E2 .
ary [20]. Although another variant of SSMs, namely S4 [12], exists,
In addition to the token transformation, we also employ residual
connections. One residual connection is added before P1 , and an- 1 https://github.com/yuqinie98/PatchTST
2
other is added after P1 . The effectiveness of these residual connec- https://github.com/thuml/iTransformer
we do not include it in our comparison because TiDE [6] has already 3
iTransformer
demonstrated superior performance over S4. Ground Truth
2 TimeMachine

Observation
1
4.3 Quantitative Results
We demonstrate TimeMachine’s performance in supervised long- 0

term forecasting tasks in Table 2. Following the protocol used in 1


iTransformer [21], we set all baselines fixed with L = 96 and
0 20 40 60 80 100
T = {96, 192, 336, 720}, including our method. For all results Time points
achieved by our model, we utilized the training-related values men- (a) Electricity
tioned in Section 4. In addition to the training hyperparameters, we
set default values for all Mambas: Dimension factor D = 256, lo- 0.3
iTransformer
cal convolutional width = 2, and state expand factor N = 1. We 0.4 Ground Truth
TimeMachine
0.5
provide an experimental justification for these parameters in Sec-
0.6

Observation
tion 5. Table 2 clearly shows that our method demonstrates supe- 0.7
rior performance compared to all the strong baselines in almost all 0.8
datasets. Moreover, iTransformer [21] has significantly better perfor- 0.9
mance than other baselines on the Traffic and Electricity datasets, 1.0
which contain a large number of channels. Our method also demon- 1.1
0 20 40 60 80 100
strates comparable or superior performance on these two datasets, Time points

outperforming the existing strong baselines by a large margin. This (b) Traffic
demonstrates the effectiveness of our method in handling LTSF tasks Figure 3: Qualitative comparison between TimeMachine and second-
with varying number of channels and datasets. best-performing methods from Table 2. Observations are demon-
In addition to Table 2, we conducted experiments with TimeMa- strated from the test set for the case of L = 96 and T = 720 with a
chine using different look-back windows L = {192, 336, 720}. Ta- randomly selected channel and window frame of 100 time points.
ble 3 and Supplementary Table 2 demonstrate TimeMachine’s perfor- iTransformer [21] paper. To ensure a fair comparison, we set the ex-
mance under these settings. An examination of these tables reveals perimental settings for our method similar to those of iTransformer.
that the implementation of extended look-back windows markedly The results clearly show very small memory footprints compared
enhances the performance of our method across the majority of the to SOTA baselines. Specifically for Traffic, our method consumes a
datasets examined. This also demonstrates TimeMachine’s capability very similar amount of memory to the DLinear [33] method. More-
for handling longer look-back windows while maintaining consistent over, our method is capable of handling longer look-back windows
performance. with a relatively linear increase in the number of learnable param-
eters, as demonstrated in Supplementary Figure 4 for two datasets.
Electricity This is due to the robustness of our method, where E1 is only depen-
Tra
0.17 ffic dent on the input sequence length L, and the rest of the networks are
0.43 relatively independent of L, leading to a highly scalable model.
h1
ETT

0.42

4.4 Qualitative Results


Weather

0.24

Figure 3 and supplementary Figure 2 demonstrate TimeMachine’s ef-


0.34 fectiveness in visual comparison. It is evident that TimeMachine can
2
ETTh

follow the actual trend in the predicted future time horizon for the test
0.27
set. In the case of the Electricity dataset, there is a clear difference
0.38
m2

between the performance of TimeMachine and iTransformer. For the


ETT

ETTm1 Traffic dataset, although both iTransformer and Timemachine’s per-


formance align with the ground truth, in the range approximately
TimeMachine PatchTST between 75-90, TimeMachine’s performance is more closely aligned
iTransformer RLinear with the ground truth compared to iTransformer. For better visualiza-
Figure 2: Average performance (in MSE) comparison among tion, we demonstrated a window of 100 predicted time points.
TimeMachine and the latest SOTA baselines with input sequence
length L = 96. The center of the circle represents the maximum pos-
sible error. The closer to the boundary, the better the performance. 5 Hyperparameter Sensitivity Analysis and
Following iTransformer [21], Figure 2 demonstrates the normal- Ablation Study
ized percentage gain of TimeMachine with respect to three other
SOTA methods, indicating a clear improvement over the strong base- In this section, we conducted experiments on various hyper-
lines. In addition to general performance comparison using MSE and parameters, including training and method-specific parameters. For
MAE metrics, we also compare the memory footprints and scalabil- each parameter, we provided experimental justification based on the
ity of our method against other baselines in Figure 4. We measured achieved results. While conducting an ablation experiment on a pa-
the GPU memory utilization of our method and compared it against rameter, other parameters were kept fixed at their default values, en-
other baselines, with results for other baseline models taken from the suring a clear justification for that specific parameter.
Table 2: Results in MSE and MAE (the lower the better) for the long-term forecasting task. We compare extensively with baselines under
different prediction lengths, T = {96, 192, 336, 720} following the setting of iTransformer [21]. The length of the input sequence (L) is set
to 96 for all baselines. The best results are in bold and the second best are underlined.
Methods→ TimeMachine iTransformer RLinear PatchTST Crossformer TiDE TimesNet DLinear SCINet FEDformer Stationary Autoformer
D T MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.164 0.208 0.174 0.214 0.192 0.232 0.177 0.218 0.158 0.230 0.202 0.261 0.172 0.220 0.196 0.255 0.221 0.306 0.217 0.296 0.173 0.223 0.266 0.336
Weather

192 0.211 0.250 0.221 0.254 0.240 0.271 0.225 0.259 0.206 0.277 0.242 0.298 0.219 0.261 0.237 0.296 0.261 0.340 0.276 0.336 0.245 0.285 0.307 0.367
336 0.256 0.290 0.278 0.296 0.292 0.307 0.278 0.297 0.272 0.335 0.287 0.335 0.280 0.306 0.283 0.335 0.309 0.378 0.339 0.380 0.321 0.338 0.359 0.395
720 0.342 0.343 0.358 0.349 0.364 0.353 0.354 0.348 0.398 0.418 0.351 0.386 0.365 0.359 0.345 0.381 0.377 0.427 0.403 0.428 0.414 0.410 0.419 0.428
96 0.397 0.268 0.395 0.268 0.649 0.389 0.544 0.359 0.522 0.290 0.805 0.493 0.593 0.321 0.650 0.396 0.788 0.499 0.587 0.366 0.612 0.338 0.613 0.388
Traffic

192 0.417 0.274 0.417 0.276 0.601 0.366 0.540 0.354 0.530 0.293 0.756 0.474 0.617 0.336 0.598 0.370 0.789 0.505 0.604 0.373 0.613 0.340 0.616 0.382
336 0.433 0.281 0.433 0.283 0.609 0.369 0.551 0.358 0.558 0.305 0.762 0.477 0.629 0.336 0.605 0.373 0.797 0.508 0.621 0.383 0.618 0.328 0.622 0.337
720 0.467 0.300 0.467 0.302 0.647 0.387 0.586 0.375 0.589 0.328 0.719 0.449 0.640 0.350 0.645 0.394 0.841 0.523 0.626 0.382 0.653 0.355 0.660 0.408
Electricity

96 0.142 0.236 0.148 0.240 0.201 0.281 0.195 0.285 0.219 0.314 0.237 0.329 0.168 0.272 0.197 0.282 0.247 0.345 0.193 0.308 0.169 0.273 0.201 0.317
192 0.158 0.250 0.162 0.253 0.201 0.283 0.199 0.289 0.231 0.322 0.236 0.330 0.184 0.289 0.196 0.285 0.257 0.355 0.201 0.315 0.182 0.286 0.222 0.334
336 0.172 0.268 0.178 0.269 0.215 0.298 0.215 0.305 0.246 0.337 0.249 0.344 0.198 0.300 0.209 0.301 0.269 0.369 0.214 0.329 0.200 0.304 0.231 0.338
720 0.207 0.298 0.225 0.317 0.257 0.331 0.256 0.337 0.280 0.363 0.284 0.373 0.220 0.320 0.245 0.333 0.299 0.390 0.246 0.355 0.222 0.321 0.254 0.361
96 0.364 0.387 0.386 0.405 0.386 0.395 0.414 0.419 0.423 0.448 0.479 0.464 0.384 0.402 0.386 0.400 0.654 0.599 0.376 0.419 0.513 0.491 0.449 0.459
ETTh1

192 0.415 0.416 0.441 0.436 0.437 0.424 0.460 0.445 0.471 0.474 0.525 0.492 0.436 0.429 0.437 0.432 0.719 0.631 0.420 0.448 0.534 0.504 0.500 0.482
336 0.429 0.421 0.487 0.458 0.479 0.446 0.501 0.466 0.570 0.546 0.565 0.515 0.491 0.469 0.481 0.459 0.778 0.659 0.459 0.465 0.588 0.535 0.521 0.496
720 0.458 0.453 0.503 0.491 0.481 0.470 0.500 0.488 0.653 0.621 0.594 0.558 0.521 0.500 0.519 0.516 0.836 0.699 0.506 0.507 0.643 0.616 0.514 0.512
96 0.275 0.334 0.297 0.349 0.288 0.338 0.302 0.348 0.745 0.584 0.400 0.440 0.340 0.374 0.333 0.387 0.707 0.621 0.358 0.397 0.476 0.458 0.346 0.388
ETTh2

192 0.349 0.381 0.380 0.400 0.374 0.390 0.388 0.400 0.877 0.656 0.528 0.509 0.402 0.414 0.477 0.476 0.860 0.689 0.429 0.439 0.512 0.493 0.456 0.452
336 0.340 0.381 0.428 0.432 0.415 0.426 0.426 0.433 1.043 0.731 0.643 0.571 0.452 0.452 0.594 0.541 1.000 0.744 0.496 0.487 0.552 0.551 0.482 0.486
720 0.411 0.433 0.427 0.445 0.420 0.440 0.431 0.446 1.104 0.763 0.874 0.679 0.462 0.468 0.831 0.657 1.249 0.838 0.463 0.474 0.562 0.560 0.515 0.511
96 0.317 0.355 0.334 0.368 0.355 0.376 0.329 0.367 0.404 0.426 0.364 0.387 0.338 0.375 0.345 0.372 0.418 0.438 0.379 0.419 0.386 0.398 0.505 0.475
ETTm1

192 0.357 0.378 0.377 0.391 0.391 0.392 0.367 0.385 0.450 0.451 0.398 0.404 0.374 0.387 0.380 0.389 0.439 0.450 0.426 0.441 0.459 0.444 0.553 0.496
336 0.379 0.399 0.426 0.420 0.424 0.415 0.399 0.410 0.532 0.515 0.428 0.425 0.410 0.411 0.413 0.413 0.490 0.485 0.445 0.459 0.495 0.464 0.621 0.537
720 0.445 0.436 0.491 0.459 0.487 0.450 0.454 0.439 0.666 0.589 0.487 0.461 0.478 0.450 0.474 0.453 0.595 0.550 0.543 0.490 0.585 0.516 0.671 0.561
96 0.175 0.256 0.180 0.264 0.182 0.265 0.175 0.259 0.287 0.366 0.207 0.305 0.187 0.267 0.193 0.292 0.286 0.377 0.203 0.287 0.192 0.274 0.255 0.339
ETTm2

192 0.239 0.299 0.250 0.309 0.246 0.304 0.241 0.302 0.414 0.492 0.290 0.364 0.249 0.309 0.284 0.362 0.399 0.445 0.269 0.328 0.280 0.339 0.281 0.340
336 0.287 0.332 0.311 0.348 0.307 0.342 0.305 0.343 0.597 0.542 0.377 0.422 0.321 0.351 0.369 0.427 0.637 0.591 0.325 0.366 0.334 0.361 0.339 0.372
720 0.371 0.385 0.412 0.407 0.407 0.398 0.402 0.400 1.730 1.042 0.558 0.524 0.408 0.403 0.554 0.522 0.960 0.735 0.421 0.415 0.417 0.413 0.433 0.432

Table 3: Results for the long-term forecasting task with varying L = 5.2 Sensitivity of Dropouts
{192, 336, 720} and T = {96, 192, 336, 720}
In our model (Figure 1), we include two dropouts after processing the
Prediction (T )→ 96 192 336 720 signals from E1 and E2 . These dropouts are necessary, especially
for datasets with a small number of channels, e.g., ETTs. Supple-
D L MSE MAE MSE MAE MSE MAE MSE MAE
mentary Figure 1 shows the effect of dropouts on both ETTh1 and
192 0.362 0.252 0.386 0.262 0.402 0.270 0.431 0.288
Traffic

ETTh2 datasets. As expected, too low or too high dropout rates are
336 0.355 0.249 0.378 0.259 0.391 0.266 0.418 0.283
720 0.348 0.249 0.364 0.255 0.376 0.263 0.410 0.281 not helpful. To maintain balance, we set the dropout rates to 0.7 for
both datasets while tuning other variations for the rest.
192 0.135 0.230 0.167 0.258 0.176 0.269 0.213 0.302
Elec.

336 0.133 0.225 0.160 0.255 0.172 0.268 0.211 0.303


720 0.133 0.225 0.160 0.257 0.167 0.269 0.204 0.300 5.3 Ablation of Residual Connections
ETTm2

192 0.170 0.252 0.230 0.294 0.273 0.325 0.351 0.376


336 0.165 0.254 0.223 0.291 0.264 0.323 0.345 0.375 Studies have shown the effectiveness of residual connection, includ-
720 0.163 0.253 0.222 0.295 0.265 0.325 0.336 0.376 ing models using SSMs [1] and CNNs [14]. In this section, we jus-
tify the two residual connections in our architecture: one from E2 to
the output of the two inner Mambas, and the other from E1 to the
5.1 Effect of MLPs’ Parameters (n1 , n2) output of P1 . These residual connections help stabilize training and
reduce overfitting, especially for the smaller datasets with channel
independence. Supplementary Table 1 provides experimental justifi-
cation, where the Res. column indicates the presence (✓) or absence
As demonstrated in Figure 1, we have two stages of compres- (✗) of residual connections. We observe clear improvement on both
sion with two MLPs E1 , E2 of output dimensions n1 and n2 , re- datasets when residual connections are used. This motivated us to
spectively, and P1 performing an expansion by converting n2 → include residual connections in our architecture, and all results pre-
n1 . Since several strong baseline methods, e.g., DLinear, lever- sented in Tables 2 and 3 incorporate these connections.
age mainly MLPs, we aim at understanding the effect of MLPs
on performance. To this end, we explored 10 different combina- 5.4 Effects of Mambas’ Local Convolutional Width
tions from {512, 256, 128, 64, 32} and demonstrated the perfor-
mance with MSE for two datasets (ETTh1, ETTh2) in Figure 5. In addition to experimenting with the different components of our
These figures show that our method is not heavily dependent on the architecture (Figure 1), we also investigated the effectiveness of
MLPs. Rather, we can see more improvement with very small MLPs Mamba parameters. For example, we tested two variations of local
for T = 720 with the ETTh1 dataset and mostly stable performance convolutional kernel widths (2 and 4) for the Mambas and found that
on the ETTh2 dataset. a kernel width of 2 yields more promising results compared to 4.
10 Table 4: Ablation experiment on the local convolution width with in-
put sequence length L = 96.
8
Memory Footprint

Prediction (T )→ 96 192 336 720


6
D d_conv MSE MAE MSE MAE MSE MAE MSE MAE
4 4 0.365 0.389 0.419 0.418 0.439 0.424 0.465 0.457
ETTh1
2 0.364 0.387 0.415 0.416 0.429 0.421 0.458 0.453
2
4 0.275 0.333 0.347 0.383 0.350 0.382 0.411 0.433
ETTh2
2 0.275 0.334 0.349 0.381 0.340 0.381 0.411 0.433
0
r

r
ST

er

ine

ar
e

e
TiD
rm

rm

rm
orm

ine
hT

ch
0.50
o

sfo

sfo
tc

DL
a
T=96 T=192 T=336 T=720
ssf

wf

eM
Pa

an

n
iFlo
Cro

Tra

Tim
0.48
iTr

Methods
0.46
(a) Traffic
0.44

MSE
2.00 0.42
0.40
1.75
0.38
1.50
0.36
Memory Footprint

1.25 8 16 32 64 128 256


1.00 State Expansion Factor (N)

0.75 (a) ETTh1


0.50
0.450
T=96 T=192 T=336 T=720
0.25 0.425
0.00 0.400
r

r
ST

ar
e

e
e
hin

TiD

0.375
orm

rm

rm
orm

ine
hT

ac
sfo

sfo
tc

DL
ssf

wf
eM
Pa

0.350
n

an

MSE
iFlo
Cro

Tra

Tim

iTr

Methods 0.325
(b) Weather 0.300
Figure 4: Memory footprint (in GB) for Traffic (with 862 channels) 0.275
and Weather (with 21 channels) following iTransformer [21]. 0.250 8 16 32 64 128 256
T=96 State Expansion Factor (N)
0.48 T=192 (b) ETTh2
T=336
0.46 T=720 Figure 6: MSE versus the state expansion factor (N ) with the input
0.44 sequence length L = 96.
MSE

0.42 the highest possible value of 256. Figure 6 demonstrates the effec-
0.40
tiveness of this expansion factor while keeping all other parameters
0.38
fixed. With a higher state expansion factor, there is a certain chance
of performance improvement for varying prediction lengths. There-
0.36
fore, we set N = 256 as the default value for all datasets, and the
)

2)
6

64

32

64

32

64

32
25

12

12

,3
2,

2,

6,

6,

8,

8,

(64

results in Tables 2 and 3 contain the TimeMachine’s performance


2,

2,

6,
(51

(51

(25

(25

(12

(12
(51

(51

(25

Combination
with this default value.
(a) ETTh1
5.6 Ablation on Mamba Dimension Expansion Factor
0.42 T=96
0.40 T=192 We also experimented with the dimension expansion factor (E) of
T=336
0.38 T=720 the Mambas, demonstrated in Supplementary Figure 3. Increasing
0.36 the block expansion factor does not lead to consistent improvements
MSE

0.34 in performance. Instead, higher expansion factors come with a heavy


0.32 cost in memory and training time. Therefore, we set this value to 1
0.30 by default in all Mambas and report the results in Tables 2 and 3.
0.28
6 Conclusion
6)

8)

8)

2)
64

32

64

32

64

32
25

12

12

,3
2,

2,

6,

6,

8,

8,

(64
2,

2,

6,
(51

(51

(25

(25

(12

(12
(51

(51

(25

Combination This paper introduces TimeMachine, a novel model that captures


long-term dependencies in multivariate time series data while main-
(b) ETTh2
Figure 5: MSE comparison with combinations of n1 and n2 for input
taining linear scalability and small memory footprints. By lever-
sequence length L = 96 for the ETTh1 and ETTh2 datasets. aging an integrated quadruple-Mamba architecture to predict with
Therefore, we set the default kernel width to 2 for all datasets and rich global and local contextual cues at multiple scales, TimeMa-
Mambas. chine unifies channel-mixing and channel-independence situations,
enabling accurate long-term forecasting. Extensive experiments
demonstrate the model’s superior performance in accuracy, scal-
5.5 Ablation on State Expansion Factor of Mambas
ability, and memory efficiency compared to state-of-the-art meth-
The SSM state expansion factor (N ) is another crucial parameter of ods. Future work will explore TimeMachine’s application in a self-
Mamba. We ablate this parameter from a very small value of 8 up to supervised learning setting.
Acknowledgements [21] Y. Liu, T. Hu, H. Zhang, H. Wu, S. Wang, L. Ma, and M. Long. itrans-
former: Inverted transformers are effective for time series forecasting.
In The Twelfth International Conference on Learning Representations,
This research is supported in part by the NSF under Grant IIS 2024. URL https://openreview.net/forum?id=JePfAI8fah.
2327113 and the NIH under Grants R21AG070909, P30AG072946, [22] J. Ma, F. Li, and B. Wang. U-mamba: Enhancing long-range
and R01HD101508-01. dependency for biomedical image segmentation. arXiv preprint
arXiv:2401.04722, 2024.
We would thank the University of Kentucky Center for Compu- [23] Y. Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam. A time series
tational Sciences and Information Technology Services Research is worth 64 words: Long-term forecasting with transformers. In The
Computing for their support and use of the Lipscomb Compute Clus- Eleventh International Conference on Learning Representations, 2023.
ter and associated research computing resources. URL https://openreview.net/forum?id=Jbdc0vTOcol.
[24] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imper-
ative style, high-performance deep learning library. Advances in neural
References information processing systems, 32, 2019.
[25] Y. Schiff, C.-H. Kao, A. Gokaslan, T. Dao, A. Gu, and V. Kuleshov. Ca-
[1] M. A. Ahamed and Q. Cheng. Mambatab: A simple yet effective ap- duceus: Bi-directional equivariant long-range dna sequence modeling.
proach for handling tabular data. arXiv preprint arXiv:2401.08867, arXiv preprint arXiv:2403.03234, 2024.
2024. [26] S. Tonekaboni, D. Eytan, and A. Goldenberg. Unsupervised representa-
[2] A. Ali, I. Zimerman, and L. Wolf. The hidden attention of mamba mod- tion learning for time series with temporal neighborhood coding. arXiv
els. arXiv preprint arXiv:2403.01590, 2024. preprint arXiv:2106.00750, 2021.
[3] A. Behrouz and F. Hashemi. Graph mamba: Towards learning on graphs [27] P. Trirat, Y. Shin, J. Kang, Y. Nam, J. Na, M. Bae, J. Kim, B. Kim,
with state space models. arXiv preprint arXiv:2402.08678, 2024. and J.-G. Lee. Universal time-series representation learning: A survey.
[4] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von arXiv preprint arXiv:2401.03717, 2024.
Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On [28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
the opportunities and risks of foundation models. arXiv preprint Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Ad-
arXiv:2108.07258, 2021. vances in neural information processing systems, 30, 2017.
[5] G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung. Time series [29] H. Wu, J. Xu, J. Wang, and M. Long. Autoformer: Decomposi-
analysis: forecasting and control. John Wiley & Sons, 2015. tion transformers with auto-correlation for long-term series forecasting.
[6] A. Das, W. Kong, A. Leach, S. K. Mathur, R. Sen, and R. Yu. Long- Advances in neural information processing systems, 34:22419–22430,
term forecasting with tiDE: Time-series dense encoder. Transactions 2021.
on Machine Learning Research, 2023. ISSN 2835-8856. URL https: [30] H. Wu, T. Hu, Y. Liu, H. Zhou, J. Wang, and M. Long. Timesnet: Tem-
//openreview.net/forum?id=pCbC3aQB5W. poral 2d-variation modeling for general time series analysis. In The
[7] E. Eldele, M. Ragab, Z. Chen, M. Wu, C. K. Kwoh, X. Li, and C. Guan. eleventh international conference on learning representations, 2022.
Time-series representation learning via temporal and contextual con- [31] L. Yang and S. Hong. Unsupervised time-series representation learning
trasting. arXiv preprint arXiv:2106.14112, 2021. with iterative bilinear temporal-spectral fusion. In International confer-
[8] S. Elfwing, E. Uchibe, and K. Doya. Sigmoid-weighted linear units ence on machine learning, pages 25038–25054. PMLR, 2022.
for neural network function approximation in reinforcement learning. [32] Z. Yue, Y. Wang, J. Duan, T. Yang, C. Huang, Y. Tong, and B. Xu.
Neural Networks, 107:3–11, 2018. ISSN 0893-6080. doi: https:// Ts2vec: Towards universal representation of time series. In Proceed-
doi.org/10.1016/j.neunet.2017.12.012. URL https://www.sciencedirect. ings of the AAAI Conference on Artificial Intelligence, volume 36, pages
com/science/article/pii/S0893608017302976. Special issue on deep re- 8980–8987, 2022.
inforcement learning. [33] A. Zeng, M. Chen, L. Zhang, and Q. Xu. Are transformers effective
[9] J.-Y. Franceschi, A. Dieuleveut, and M. Jaggi. Unsupervised scalable for time series forecasting? In Proceedings of the AAAI conference on
representation learning for multivariate time series. Advances in neural artificial intelligence, volume 37, pages 11121–11128, 2023.
information processing systems, 32, 2019. [34] G. Zerveas, S. Jayaraman, D. Patel, A. Bhamidipaty, and C. Eickhoff.
[10] D. Y. Fu, T. Dao, K. K. Saab, A. W. Thomas, A. Rudra, and C. Re. Hun- A transformer-based framework for multivariate time series representa-
gry hungry hippos: Towards language modeling with state space mod- tion learning. In Proceedings of the 27th ACM SIGKDD conference on
els. In International Conference on Learning Representations, 2022. knowledge discovery & data mining, pages 2114–2124, 2021.
[11] A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selec- [35] Y. Zhang and J. Yan. Crossformer: Transformer utilizing cross-
tive state spaces. arXiv preprint arXiv:2312.00752, 2023. dimension dependency for multivariate time series forecasting. In The
[12] A. Gu, K. Goel, and C. Re. Efficiently modeling long sequences with eleventh international conference on learning representations, 2022.
structured state spaces. In International Conference on Learning Rep- [36] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang.
resentations, 2021. Informer: Beyond efficient transformer for long sequence time-series
[13] A. Gu, I. Johnson, K. Goel, K. Saab, T. Dao, A. Rudra, and C. Ré. forecasting. In Proceedings of the AAAI conference on artificial intelli-
Combining recurrent, convolutional, and continuous-time models with gence, volume 35, pages 11106–11115, 2021.
linear state space layers. Advances in Neural Information Processing [37] T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, and R. Jin. Fedformer:
Systems, 34:572–585, 2021. Frequency enhanced decomposed transformer for long-term series fore-
[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image casting. In International conference on machine learning, pages 27268–
recognition. In Proceedings of the IEEE conference on computer vision 27286. PMLR, 2022.
and pattern recognition, pages 770–778, 2016.
[15] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural
computation, 9(8):1735–1780, 1997.
[16] T. Kim, J. Kim, Y. Tae, C. Park, J.-H. Choi, and J. Choo. Reversible in-
stance normalization for accurate time-series forecasting against distri-
bution shift. In International Conference on Learning Representations,
2022. URL https://openreview.net/forum?id=cGDAkQo1C0p.
[17] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980, 2014.
[18] Z. Li, S. Qi, Y. Li, and Z. Xu. Revisiting long-term time se-
ries forecasting: An investigation on linear mapping. arXiv preprint
arXiv:2305.10721, 2023.
[19] M. Liu, A. Zeng, M. Chen, Z. Xu, Q. Lai, L. Ma, and Q. Xu. Scinet:
Time series modeling and forecasting with sample convolution and in-
teraction. Advances in Neural Information Processing Systems, 35:
5816–5828, 2022.
[20] Y. Liu, H. Wu, J. Wang, and M. Long. Non-stationary transformers:
Exploring the stationarity in time series forecasting. Advances in Neural
Information Processing Systems, 35:9881–9893, 2022.
Supplementary Materials

Table 1: Ablation experiment on the residual connections with input


sequence length L = 96 and T = {96, 192, 336, 720}.
1.00
Prediction (T )→ 96 192 336 720 0.75
D Res. MSE MAE MSE MAE MSE MAE MSE MAE 0.50

Observation
0.25
✗ 0.366 0.395 0.423 0.425 0.430 0.427 0.474 0.462
ETTh1 0.00
✓ 0.364 0.387 0.415 0.416 0.429 0.421 0.458 0.453
0.25 PatchTST
✗ 0.281 0.337 0.347 0.386 0.352 0.383 0.415 0.435 0.50 Ground Truth
ETTh2 TimeMachine
✓ 0.275 0.334 0.349 0.381 0.340 0.381 0.411 0.433 0.75
0 20 40 60 80 100
Time points
(a) ETTm1
0.48
0.46 0.1
0.44 0.2
0.3
MSE

0.42

Observation
0.4
0.40 T=96 0.5
T=192
0.38 T=336 0.6
T=720 PatchTST
0.36 0.7 Ground Truth
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 TimeMachine
Dropout 0.8
0 20 40 60 80 100
(a) ETTh1 Time points
(b) ETTm2
0.42 T=96 Figure 2: Qualitative comparison between TimeMachine and second
T=192 best-performing methods from Table 2. Observations are demon-
0.40 T=336
0.38 T=720 strated from the test set for the case of L = 96 and T = 720 with a
0.36 randomly selected channel and window frame of 100 time points.
MSE

0.34
0.32
0.30
0.28
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Dropout
(b) ETTh2
Figure 1: Performance (MSE) comparison concerning a diverse range
of dropouts with input sequence length, L = 96.
0.50
T=96 T=192 T=336 T=720
0.48
Table 2: Results for the long-term forecasting task with varying input
0.46
sequence length L = {192, 336, 720} and T = {96, 192, 336, 720}
0.44
MSE

Prediction (T )→ 96 192 336 720 0.42


0.40
D L MSE MAE MSE MAE MSE MAE MSE MAE
0.38
ETTh1 Weather

192 0.155 0.204 0.198 0.243 0.241 0.281 0.327 0.336


0.36
336 0.151 0.201 0.192 0.240 0.236 0.278 0.318 0.334
1 2 3 4 5 6 7 8 9 10
720 0.151 0.203 0.195 0.246 0.239 0.285 0.321 0.340 Expand factor
192 0.365 0.386 0.415 0.413 0.406 0.417 0.447 0.459 (a) ETTh1
336 0.360 0.387 0.398 0.410 0.386 0.411 0.443 0.457
720 0.363 0.395 0.402 0.418 0.396 0.420 0.468 0.476 0.450
T=96 T=192 T=336 T=720
0.425
ETTh2

192 0.274 0.334 0.340 0.379 0.327 0.378 0.402 0.432


336 0.267 0.334 0.324 0.375 0.316 0.375 0.392 0.429 0.400
720 0.260 0.332 0.314 0.372 0.316 0.377 0.394 0.433 0.375
0.350
MSE
ETTm1

192 0.286 0.337 0.331 0.365 0.354 0.384 0.421 0.421


0.325
336 0.286 0.337 0.328 0.364 0.355 0.381 0.408 0.413
720 0.289 0.344 0.334 0.369 0.357 0.382 0.416 0.413 0.300
0.275
0.250 1 2 3 4 5 6 7 8 9 10
Expand factor
(b) ETTh2
Figure 3: Comparative analysis for the expanding factor.
400000
Learnable Parameters

300000

200000

100000

0 96 192 336 720 1440


Input Sequence Length
(a) ETTh2

400000
Learnable Parameters

300000

200000

100000

0 96 192 336 720 1440


Input Sequence Length
(b) Weather
Figure 4: Scalability in terms of learnable parameters with respect to
look-back window.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy