Integrating Mamba and Transformer For Long-Short Range Time Series Forecasting
Integrating Mamba and Transformer For Long-Short Range Time Series Forecasting
Series Forecasting
Xiongxiao Xu1 , Yueqing Liang1 , Baixiang Huang1 , Zhiling Lan2 , Kai Shu1
1 Department of Computer Science, Illinois Institute of Technology, Chicago, IL, USA
2 Department of Computer Science, University of Illinois Chicago, Chicago, IL, USA
{xxu85,yliang40,bhuang15}@hawk.iit.edu,zlan@uic.edu,kshu@iit.edu
in a variety of applications including weather forecasting, stock mar- An emerging body of research suggests that State Space Models
ket, and scientific simulations. Although transformers have proven (SSMs) [11–14, 33] have shown promising progress in sequence
to be effective in capturing dependency, its quadratic complexity of modeling problem. As a representative SSM model, Mamba achieves
attention mechanism prevents its further adoption in long-range comparable performance with Transformer in language modeling
time series forecasting, thus limiting them attend to short-range while enjoys a linear-time complexity. On the performance side,
range. Recent progress on state space models (SSMs) have shown Mamba introduces a selective mechanism to remember relevant
impressive performance on modeling long range dependency due information and filter out irrelevant information indefinitely. On the
to their subquadratic complexity. Mamba, as a representative SSM, computation side, Mamba implements a hardware-aware algorithm
enjoys linear time complexity and has achieved strong scalability for parallel training like a CNN and can be regarded as a RNN
on tasks that requires scaling to long sequences, such as language, for linear-time inference. Considering the above two advantages,
audio, and genomics. In this paper, we propose to leverage a hybrid Mamba is exceptional in handling long-range time series data.
framework Mambaformer that internally combines Mamba for Recent findings show that SSMs and Transformers are comple-
long-range dependency, and Transformer for short-range depen- mentary for language modeling [10, 17, 23]. We are interested in
dency, for long-short range forecasting. To the best of our knowl- if the observation is consistent in time series data. In this work,
edge, this is the first paper to combine Mamba and Transformer we propose to leverage a hybrid architecture Mambaformer [23]
architecture in time series data. We investigate possible hybrid that internally integrates strengths of Transformer and Mamba
architectures to combine Mamba layer and attention layer for long- for long-short range series forecasting. The comparative experi-
short range time series forecasting. The comparative study shows ments demonstrate that the Mambaformer family can integrate
that the Mambaformer family can outperform Mamba and Trans- advantages of Mamba and Transformer, thus facilitating time series
former in long-short range time series forecasting problem. The forecasting. To summarize, our contributions are as follows:
code is available at https://github.com/XiongxiaoXu/Mambaformer- • We are the first to explore the potential of the integration of
in-Time-Series. Mamba and Transformer in time series.
• We propose to adopt a hybrid architecture Mambaformer to
KEYWORDS capture long-short range dependencies in time series.
Mamba, Transformer, Time Series Forecasting • We conduct a comparative study to demonstrate the superiority of
Mambaformer family compared with Mamba and Transformer
in long-short range time series forecasting.
1 INTRODUCTION
Time series forecasting is an important problem and has been widely 2 RELATED WORK
used in many real-world scenarios, including weather forecast [1],
stock market [27], and scientific simulations [38]. For example, in 2.1 Time Series Forecasting
scientific simulation, researcher are interested in building a surro- Time series forecasting research has existed for a long time. The ear-
gate model built on machine leaning model to forecast behaviors lier researchers leverage statistical and traditional machine learning
of supercomputer across timescales to accelerate simulations, thus methods such as ARIMA [3], simple neural network [6], and sup-
bypassing billion even trillion events [9]. port vector machine [15], to forecast road traffic. However, these
Deep learning models, especially Transformer-based models, approaches are relatively weak due to their oversimplified assump-
have achieved progress in time series forecasting. Benefiting from tions and limited representation capabilities. Although more ex-
the attention mechanism, Transformers can achieve great advan- pressive deep leaning models including RNN [8] and LSTM [40]
tages and depict pairwise dependencies in time series data. However, are utilized to model time series data, they fall short into gradient
recent research [41] has questioned the validity of Transformer- vanishing problem [30] when dealing with long-range sequences.
based forecaster with a linear model. Although the effectiveness Inspired from the success of Transformer [31] models in text data, a
of Transformer-based models are proven in later work [20, 22], the variety of variants of Transformer [16, 19, 20, 22, 35, 36, 42, 43] have
quadratic complexity of attention mechanism is still computation- proven effective in time series data. For example, the latest iTrans-
ally challenging. When inferring next-token, the transformer has to former [20] that simply applies the attention and feed-forward
find relationships in sequences from all past tokens. Albeit effective, network on the inverted dimensions achieve SOTA performance.
Outputs
Additionally, recent work [2, 24, 28, 34] based on SSMs proposes to
leverage Mamba for time series forecasting. For instance, TimeMa-
chine [2] utilize four Mamba blocks to capture long range depen- Forecasting Layer
dency in multivariate time series data. Different from the previous Mamba Block
work, our paper makes the first attempt to combine transformer Add & Norm
Linear
and Mamba for time series forecasting.
Mamba
Block
𝐿× ×
2.2 State Space Models and Mamba Mambaformer
State Space Models (SSMs) [11–14, 33] emerge as a promising class Layer Add & Norm SSM
of architectures for sequence modeling. S4 is a structured SSM Masked
where the specialized Hippo [12] structure is imposed on the matrix Multi-head
A to capture long-range dependency. Building upon S4, Mamba [11] Attention 𝜎 𝜎
designs a selective mechanism to filter out irrelevant information,
and a hardware-aware algorithm for efficient implementation. Bene- Conv
Add & Norm
fiting from the designs, Mamba has achieve impressive performance
Mamba
across modalities such as language, audio, and genomics while re- Mamba Linear Linear
Pre-Processing
quiring only linear complexity on the sequence length, thus poten- Block
Block
tially an alternative of Transformer. Benefiting form its modeling
capability and scalability, Mamba has recently shown significant
progress in various communities, such as computer vision [29, 44],
medical [21, 37], graph [4, 32] and recommendation [18, 39]. A note- Embedding Token Temporal
worthy line of research is to combine the Transformer and Mamba Layer Encoding Encoding
for the purpose of language modeling [10, 17, 23]. A comparative
study [23] shows Mambaformer is effective in in-context learning Inputs
tasks. Jamba [17] is the first production-grade attention-SSM hy-
brid model with 12B active and 52B total available parameters, and Figure 1: The overview of the Mambaformer.
shows desirable performance for long context. We are interested in
if the observation is consistent in time series data and propose to where A ∈ R𝑁 ×𝑁 , B ∈ R𝑁 ×1 , and C ∈ R1×𝑁 are learnable matri-
adapt Mambaformer for time series forecasting. ces. SSM can be discretized from continuous signal into discrete
sequences by a step size Δ. The discretized version is as follows:
In the long-short range time series forecasting problem, given histor- where discrete parameters (A, B) can be obtained from continuous
ical time series samples with a look-back window L = (x1, x2, .., x𝐿 ) parameters (Δ, A, B) through a discretization rule, such as zero-
with length 𝐿, where each x𝑡 ∈ R𝑀 at time step 𝑡 is with 𝑀 vari- order hold (ZOH) rule A = exp(ΔA), B = exp(ΔA) −1 (exp(ΔA) −
ates, we aim to forecast 𝐹 future values F = (x𝐿+1, x𝐿+2, .., x𝐿+𝐹 ) I) ·ΔB. After discretization, the model can be computed in two ways,
with length 𝐹 . Besides, the associated temporal context information either as a linear recurrence for inference as shown in Equation 2,
(c1, c2, .., c𝐿 ) with dimension 𝐶 is assumed to be known [16], e.g. or as a global convolution for training as the following Equation 3:
day-of-the-week and hour-of-the-day. Note that the work is under 𝑘
K = (CB, CAB, ..., CA B, ...)
the rolling forecasting setting [42] where upon the completion of
a forecast for F , the look-back window B moves forward 𝐹 steps 𝑦 =𝑥 ∗K (3)
towards the future so that models can do a next forecast.
where K is a convolutional kernel.
Mamba
Layer
Positional Positional
Encoding Encoding
Token Temporal Token Temporal Token Temporal Token Temporal Token Temporal
Encoding Encoding Encoding Encoding Encoding Encoding Encoding Encoding Encoding Encoding
(a) Mambaformer (b) Attention-Mamba Hybrid (c) Mamba-Attention Hybrid (d) Mamba (e) Transformer
Figure 2: The structures of Mambaformer family and Mamba and Transformer. For illustration, we ignore the residual
connections and layer normalization associated with Mamba layer, attention layer, and feed forward layer in the figure.
• Mamba-Attention Hybrid adopts a Mamba-Attention layer Table 2: Multivariate time series forecasting results of the
where a Mamba block layer is followed by an attention layer comparative study. The values are averaged for multiple fore-
without a positional encoding. casting lengths 𝐹 ∈ {96, 192, 336, 720} where 96 and 192 corre-
spond to short-range forecasting, and 336 and 720 correspond
The other models in Figure 2 are as follows: to long-range forecasting. The length of look-back window
• Mamba leverages two Mamba block as a layer. is fixed at 𝐿 = 196. The best results are in bold and the second
• Transformer is a decoder-only Transformer architecture. best results are underlined.