0% found this document useful (0 votes)
13 views5 pages

Integrating Mamba and Transformer For Long-Short Range Time Series Forecasting

The document presents a novel hybrid architecture called Mambaformer, which integrates Mamba, a State Space Model (SSM) known for its linear time complexity, with Transformer models to enhance long-short range time series forecasting. The authors demonstrate that Mambaformer outperforms both Mamba and Transformer in forecasting tasks by leveraging the strengths of both architectures. This research marks the first attempt to combine these two approaches specifically for time series data, showcasing significant improvements in performance.

Uploaded by

r.almamlook
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views5 pages

Integrating Mamba and Transformer For Long-Short Range Time Series Forecasting

The document presents a novel hybrid architecture called Mambaformer, which integrates Mamba, a State Space Model (SSM) known for its linear time complexity, with Transformer models to enhance long-short range time series forecasting. The authors demonstrate that Mambaformer outperforms both Mamba and Transformer in forecasting tasks by leveraging the strengths of both architectures. This research marks the first attempt to combine these two approaches specifically for time series data, showcasing significant improvements in performance.

Uploaded by

r.almamlook
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Integrating Mamba and Transformer for Long-Short Range Time

Series Forecasting
Xiongxiao Xu1 , Yueqing Liang1 , Baixiang Huang1 , Zhiling Lan2 , Kai Shu1
1 Department of Computer Science, Illinois Institute of Technology, Chicago, IL, USA
2 Department of Computer Science, University of Illinois Chicago, Chicago, IL, USA
{xxu85,yliang40,bhuang15}@hawk.iit.edu,zlan@uic.edu,kshu@iit.edu

ABSTRACT such expensive computation is prohibitive for long distance and


Time series forecasting is an important problem and plays a key role limit Transformers to short-range time series forecasting.
arXiv:2404.14757v1 [cs.LG] 23 Apr 2024

in a variety of applications including weather forecasting, stock mar- An emerging body of research suggests that State Space Models
ket, and scientific simulations. Although transformers have proven (SSMs) [11–14, 33] have shown promising progress in sequence
to be effective in capturing dependency, its quadratic complexity of modeling problem. As a representative SSM model, Mamba achieves
attention mechanism prevents its further adoption in long-range comparable performance with Transformer in language modeling
time series forecasting, thus limiting them attend to short-range while enjoys a linear-time complexity. On the performance side,
range. Recent progress on state space models (SSMs) have shown Mamba introduces a selective mechanism to remember relevant
impressive performance on modeling long range dependency due information and filter out irrelevant information indefinitely. On the
to their subquadratic complexity. Mamba, as a representative SSM, computation side, Mamba implements a hardware-aware algorithm
enjoys linear time complexity and has achieved strong scalability for parallel training like a CNN and can be regarded as a RNN
on tasks that requires scaling to long sequences, such as language, for linear-time inference. Considering the above two advantages,
audio, and genomics. In this paper, we propose to leverage a hybrid Mamba is exceptional in handling long-range time series data.
framework Mambaformer that internally combines Mamba for Recent findings show that SSMs and Transformers are comple-
long-range dependency, and Transformer for short-range depen- mentary for language modeling [10, 17, 23]. We are interested in
dency, for long-short range forecasting. To the best of our knowl- if the observation is consistent in time series data. In this work,
edge, this is the first paper to combine Mamba and Transformer we propose to leverage a hybrid architecture Mambaformer [23]
architecture in time series data. We investigate possible hybrid that internally integrates strengths of Transformer and Mamba
architectures to combine Mamba layer and attention layer for long- for long-short range series forecasting. The comparative experi-
short range time series forecasting. The comparative study shows ments demonstrate that the Mambaformer family can integrate
that the Mambaformer family can outperform Mamba and Trans- advantages of Mamba and Transformer, thus facilitating time series
former in long-short range time series forecasting problem. The forecasting. To summarize, our contributions are as follows:
code is available at https://github.com/XiongxiaoXu/Mambaformer- • We are the first to explore the potential of the integration of
in-Time-Series. Mamba and Transformer in time series.
• We propose to adopt a hybrid architecture Mambaformer to
KEYWORDS capture long-short range dependencies in time series.
Mamba, Transformer, Time Series Forecasting • We conduct a comparative study to demonstrate the superiority of
Mambaformer family compared with Mamba and Transformer
in long-short range time series forecasting.
1 INTRODUCTION
Time series forecasting is an important problem and has been widely 2 RELATED WORK
used in many real-world scenarios, including weather forecast [1],
stock market [27], and scientific simulations [38]. For example, in 2.1 Time Series Forecasting
scientific simulation, researcher are interested in building a surro- Time series forecasting research has existed for a long time. The ear-
gate model built on machine leaning model to forecast behaviors lier researchers leverage statistical and traditional machine learning
of supercomputer across timescales to accelerate simulations, thus methods such as ARIMA [3], simple neural network [6], and sup-
bypassing billion even trillion events [9]. port vector machine [15], to forecast road traffic. However, these
Deep learning models, especially Transformer-based models, approaches are relatively weak due to their oversimplified assump-
have achieved progress in time series forecasting. Benefiting from tions and limited representation capabilities. Although more ex-
the attention mechanism, Transformers can achieve great advan- pressive deep leaning models including RNN [8] and LSTM [40]
tages and depict pairwise dependencies in time series data. However, are utilized to model time series data, they fall short into gradient
recent research [41] has questioned the validity of Transformer- vanishing problem [30] when dealing with long-range sequences.
based forecaster with a linear model. Although the effectiveness Inspired from the success of Transformer [31] models in text data, a
of Transformer-based models are proven in later work [20, 22], the variety of variants of Transformer [16, 19, 20, 22, 35, 36, 42, 43] have
quadratic complexity of attention mechanism is still computation- proven effective in time series data. For example, the latest iTrans-
ally challenging. When inferring next-token, the transformer has to former [20] that simply applies the attention and feed-forward
find relationships in sequences from all past tokens. Albeit effective, network on the inverted dimensions achieve SOTA performance.
Outputs
Additionally, recent work [2, 24, 28, 34] based on SSMs proposes to
leverage Mamba for time series forecasting. For instance, TimeMa-
chine [2] utilize four Mamba blocks to capture long range depen- Forecasting Layer
dency in multivariate time series data. Different from the previous Mamba Block
work, our paper makes the first attempt to combine transformer Add & Norm
Linear
and Mamba for time series forecasting.
Mamba
Block
𝐿× ×
2.2 State Space Models and Mamba Mambaformer
State Space Models (SSMs) [11–14, 33] emerge as a promising class Layer Add & Norm SSM
of architectures for sequence modeling. S4 is a structured SSM Masked
where the specialized Hippo [12] structure is imposed on the matrix Multi-head
A to capture long-range dependency. Building upon S4, Mamba [11] Attention 𝜎 𝜎
designs a selective mechanism to filter out irrelevant information,
and a hardware-aware algorithm for efficient implementation. Bene- Conv
Add & Norm
fiting from the designs, Mamba has achieve impressive performance
Mamba
across modalities such as language, audio, and genomics while re- Mamba Linear Linear
Pre-Processing
quiring only linear complexity on the sequence length, thus poten- Block
Block
tially an alternative of Transformer. Benefiting form its modeling
capability and scalability, Mamba has recently shown significant
progress in various communities, such as computer vision [29, 44],
medical [21, 37], graph [4, 32] and recommendation [18, 39]. A note- Embedding Token Temporal
worthy line of research is to combine the Transformer and Mamba Layer Encoding Encoding
for the purpose of language modeling [10, 17, 23]. A comparative
study [23] shows Mambaformer is effective in in-context learning Inputs
tasks. Jamba [17] is the first production-grade attention-SSM hy-
brid model with 12B active and 52B total available parameters, and Figure 1: The overview of the Mambaformer.
shows desirable performance for long context. We are interested in
if the observation is consistent in time series data and propose to where A ∈ R𝑁 ×𝑁 , B ∈ R𝑁 ×1 , and C ∈ R1×𝑁 are learnable matri-
adapt Mambaformer for time series forecasting. ces. SSM can be discretized from continuous signal into discrete
sequences by a step size Δ. The discretized version is as follows:

3 PRELIMINARIES ℎ𝑡 = Aℎ𝑡 −1 + B𝑥𝑡

3.1 Problem Statement 𝑦 = Cℎ𝑡 (2)

In the long-short range time series forecasting problem, given histor- where discrete parameters (A, B) can be obtained from continuous
ical time series samples with a look-back window L = (x1, x2, .., x𝐿 ) parameters (Δ, A, B) through a discretization rule, such as zero-
with length 𝐿, where each x𝑡 ∈ R𝑀 at time step 𝑡 is with 𝑀 vari- order hold (ZOH) rule A = exp(ΔA), B = exp(ΔA) −1 (exp(ΔA) −
ates, we aim to forecast 𝐹 future values F = (x𝐿+1, x𝐿+2, .., x𝐿+𝐹 ) I) ·ΔB. After discretization, the model can be computed in two ways,
with length 𝐹 . Besides, the associated temporal context information either as a linear recurrence for inference as shown in Equation 2,
(c1, c2, .., c𝐿 ) with dimension 𝐶 is assumed to be known [16], e.g. or as a global convolution for training as the following Equation 3:
day-of-the-week and hour-of-the-day. Note that the work is under 𝑘
K = (CB, CAB, ..., CA B, ...)
the rolling forecasting setting [42] where upon the completion of
a forecast for F , the look-back window B moves forward 𝐹 steps 𝑦 =𝑥 ∗K (3)
towards the future so that models can do a next forecast.
where K is a convolutional kernel.

3.2 State Space Models 4 METHODOLOGY


The State Space Model (SSM) is a recent class of sequence model- 4.1 Overview of Mambaformer
ing framework that are broadly related to RNNs, and CNNs, and
Inspired by advantages of the hybrid architectures in language
classical state space models [14]. S4 [13] and Mamba [11] are two
modeling [23], we propose to leverage Mambaformer to integrate
representative SSMs, and they are inspired by a continuous system
Mamba and Transformer to capture long-short range dependen-
that maps an input function or sequence 𝑥 (𝑡) ∈ R to an output
cies in time series data, leading to enhanced performance. Mam-
function or sequence 𝑦 (𝑡) ∈ R through an implicit latent state
baformer adopts a decoder-only style as GPT [5, 25, 26] family.
ℎ(𝑡) ∈ R𝑁 as follows:
4.2 Embedding Layer
ℎ ′ (𝑡) = Aℎ(𝑡) + B𝑥 (𝑡)
We utilize an embedding layer to map the low-dimension time se-
𝑦 (𝑡) = Cℎ(𝑡) (1) ries data into a high-dimensional space, including token embedding
2
and temporal embedding. projection matrix W𝑂 ∈ Rℎ𝑑 𝑣 ×𝐷 , the output of attention layer
Token Embedding. To convert raw time series data into high- H2 = OW𝑂 ∈ R𝐵×𝐿×𝐷 . We adopt the masking mechanism to
dimensional tokens, we utilize a one-dimensional convolutional prevent positions from attending to subsequent positions, and set
layer as a token embedding module because it can retain local se- 𝑑𝑘 = 𝑑 𝑣 = 𝐷/ℎ following vanilla Transformer setting [31].
mantic information within the time series data [7]. Mamba Layer. To overcome computational challenges of the Trans-
Temporal Embedding. Besides numerical value itself in the se- former and be beyond the performance of Transformer, we incor-
quence, temporal context information also provides informative porate the Mamba layer into the model to enhance the capability of
clues, such as hierarchical timestamps (week, month and year) and capturing long-range time series dependency. As shown in Figure 1,
agnostic timestamps (holidays and events) [41]. We employ a linear Mamba block is a sequence-sequence module with the same dimen-
layer to embed temporal context information. sion of input and output. In particularly, Mamba takes an input
Formally, let X ∈ R𝐵×𝐿×𝑀 denote input sequences with batch H2 and expand the dimension by two input linear projection. For
size 𝐵 and C ∈ R𝐵×𝐿×𝐶 denote the associated temporal context. one projection, Mamba processes the expanded embedding through
The embedding layer can be expressed as follows: a convolution and SiLU activation before feeding into the SSM.
The core discretized SSM module is able to select input-dependent
E = 𝐸𝑡𝑜𝑘𝑒𝑛 (X) + 𝐸𝑡𝑒𝑚 (C) (4)
knowledge and filter out irrelevant information. The other pro-
where E ∈ R𝐵×𝐿×𝐷 is output embedding, 𝐷 is the dimension of jection followed by SiLU activation, as a residual connection, is
the embedding, 𝐸𝑡𝑜𝑘𝑒𝑛 and 𝐸𝑡𝑒𝑚 denote toke embedding layer and combined with the output of the SSM module through a multiplica-
temporal embedding layer, respectively. tive gate. Finally, Mamba delivers output H3 ∈ R𝐵×𝐿×𝐷 through
Note that we do not need a positional embedding typically exist- an output linear projection.
ing in Transformer model. Instead, a Mamba pre-processing block
introduced in the next subsection is leveraged to internally incor- 4.5 Forecasting Layer
porate positional information to the embedding. At this layer, we obtain forecasting resulting by a linear layer to
convert high-dimension embedding space into original dimension
4.3 Mamba Pre-Processing Layer of time series data as follows:
To endow the embedding with positional information, we pre- b = 𝐿𝑖𝑛𝑒𝑎𝑟 (H3 )
X (7)
process the sequence by a Mamba block to internally embed order
information of input tokens. Mamba can be regarded as a RNN b ∈ R𝐵×𝐿×𝑀 denotes forecasting results.
where X
where the hidden state ℎ𝑡 at current time 𝑡 is updated by the hidden
state ℎ𝑡 −1 at previous time 𝑡 − 1 as shown in the Equation 2. Such 5 EXPERIMENTS
recurrence mechanism to process tokens enables Mamba naturally 5.1 Datasets and Evaluation Metrics
consider order information of sequences. Therefore, unlike posi-
Datasets. To evaluate the Mambaformer family, we adopt three
tional encoding being an essential component in Transformer, Mam-
popular real-world datasets [41], including ETTh1, Electricity, and
baformer replace positional encoding by a Mamba pre-processing
Exchange-Rate. All of them belong to multivariate time series. The
block. The Mamba pre-processing block can be expressed as follows:
statistics of the datasets are summarized in Table 1.
H1 = 𝑀𝑎𝑚𝑏𝑎(E) (5)
Table 1: The statistics of three real-world time series datasets.
where H1 ∈ R𝐵×𝐿×𝐷is a mixing vector including token embedding,
temporal embedding, and positional information.
Datasets ETTh1 Electricity Exchange_Rate
4.4 Mambaformer Layer Variates 7 321 8
The core Mambaformer layer interleaves Mamba layer and self- Timestamps 17,420 26,304 7,588
attention layer to combine advantages of Mamba and Transformer
to facilitate long-short range time series forecasting. Metrics. We use the MSE (Mean Square Error) and MAE (Mean
Attention Layer. To inherit impressive performance of depict- Absolute Error) metrics [41] to assess the Mambaformer family.
ing short-range time series dependencies in the transformer, we
leverage masked multi-head attention layer to obtain correlations 5.2 Mambaformer Family
between tokens. In particular, each head 𝑖 = 1, 2, ..., ℎ in the atten-
𝑄 As shown in Figure 2, we conduct a comparative study to investigate
tion layer transforms the embedding H1 into queries Q𝑖 = H1 W𝑖 ,
hybrid structures of Mamba and Transformer [23]. Particularly, we
keys K𝑖 = H1 W𝑖𝐾 , and values V𝑖 = H1 W𝑉𝑖 , where W𝑖 ∈ R𝐷 ×𝑑𝑘 ,
𝑄
interleave Mamba layer and attention layer in different orders and
W𝑖𝐾 ∈ R𝐷 ×𝑑𝑘 ∈, and W𝑉𝑖 ∈ R𝐷 ×𝑑 𝑣 are learnable matrices. After- compare them with Mamba and transformer. The structures of
wards, a scaled dot-product attention is utilized: Mambaformer family are as follows:
Q𝑖 K𝑇 • Mambaformer utilizes a pre-processing Mamba block and
O𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(Q𝑖 , K𝑖 , V𝑖 ) = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 ( √︁ 𝑖 )Vi (6) Mambaformer layer without a positional encoding.
𝑑𝑘 • Attention-Mamba Hybrid leverages a Attention-Mamba layer
where the outputs O𝑖 of each head are concatenated into a output where an attention layer is followed by a Mamba layer with a
vector O with the embedding dimension ℎ𝑑 𝑣 . Following a learnable positional encoding.
3
Mamba Mamba Attention Mamba Feed
Layer Layer Layer Layer Forward
×𝐿 ×𝐿 ×𝐿 ×𝐿 ×𝐿

Attention Attention Mamba Mamba Attention


Layer Layer Layer Layer Layer

Mamba
Layer
Positional Positional
Encoding Encoding

Token Temporal Token Temporal Token Temporal Token Temporal Token Temporal
Encoding Encoding Encoding Encoding Encoding Encoding Encoding Encoding Encoding Encoding
(a) Mambaformer (b) Attention-Mamba Hybrid (c) Mamba-Attention Hybrid (d) Mamba (e) Transformer

Figure 2: The structures of Mambaformer family and Mamba and Transformer. For illustration, we ignore the residual
connections and layer normalization associated with Mamba layer, attention layer, and feed forward layer in the figure.

• Mamba-Attention Hybrid adopts a Mamba-Attention layer Table 2: Multivariate time series forecasting results of the
where a Mamba block layer is followed by an attention layer comparative study. The values are averaged for multiple fore-
without a positional encoding. casting lengths 𝐹 ∈ {96, 192, 336, 720} where 96 and 192 corre-
spond to short-range forecasting, and 336 and 720 correspond
The other models in Figure 2 are as follows: to long-range forecasting. The length of look-back window
• Mamba leverages two Mamba block as a layer. is fixed at 𝐿 = 196. The best results are in bold and the second
• Transformer is a decoder-only Transformer architecture. best results are underlined.

Positional encoding is optional in the above architectures be-


cause Mamba layer internally consider positional information while ETTh1 Electricity Exchange
Methods
Transformer does not. For Mambaformer family, if a Mamba layer MSE MAE MSE MAE MSE MAE
is before an attention layer, the model does not need a positional Mambaformer 0.962 0.721 0.317 0.386 1.878 1.123
encoding; if not, the model needs a positional encoding. Attention-Mamba 0.995 0.792 0.349 0.409 2.029 1.126
Mamba-Attention 0.973 0.727 0.327 0.404 2.317 1.238
5.3 Comparative Performance Mamba 1.417 0.914 0.322 0.400 2.423 1.174
Transformer 0.991 0.790 0.355 0.414 2.173 1.165
We shown the comparative performance of Mambaformer and
Mamba and Transformer in Table 2. Accordingly, we have the
following observations:
6 DISCUSSION
• Mambaformer achieves superior performance compared to Mamba
and Transformer. It demonstrates Mambaformer can integrate This paper first investigates the potential of hybrid Mamba-Transformer
the strength of Mamba and Transformer, and capture both short- architecture in time series data. We propose to utilize a Mam-
range and long-range dependencies in time series data, thus out- baformer architecture for long-short range time series forecasting.
performing them. The observations are consistent with a large- We conduct a comparative study to investigate various combina-
scale hybrid mamba-transformer language model Jamba [17]. tions of Mamba and Transformer. The results show Mambaformer
• Mambaformer obtains the best performance in Mambaformer family can integrate advantages of Mamba and Transformer, thus
family. It further shows the reasonable design of Mambaformer. outperforming them in long-short range time series forecasting.
Compared to the attention-mamba hybrid architecture, Mam- This work adapts Mambaformer for time series data, but does
baformer can get the better performance. It demonstrates Mamba not compare with SOTA methods in time series forecasting. Future
layer can pre-process time series data and internally provide po- directions include (1) proposing a new hybrid Mamba-Transformer
sitional information, eliminating explicit positional encoding. architecture specifically for long-short range time series forecasting
• The performance of attention-mamba hybrid is comparable to and achieving SOTA results on comprehensive datasets, (2) scaling
mamba-attention hybrid. It indicates the order to interleave to large-scale hybrid Mamba-Transformer architecture specifically
Mamba layer and attention layer does not cause significant im- for long-short range time series forecasting, and (3) investigating
pact on performance of long-short range time series forecasting. combinations of Transformer and other sequence modeling frame-
works with subquadratic complexity in time series data.
4
REFERENCES [24] Badri N Patro and Vijay S Agneeswaran. 2024. SiMBA: Simplified Mamba-
[1] Kumar Abhishek, Maheshwari Prasad Singh, Saswata Ghosh, and Abhishek Based Architecture for Vision and Multivariate Time series. arXiv preprint
Anand. 2012. Weather forecasting model using artificial neural network. Procedia arXiv:2403.15360 (2024).
Technology 4 (2012), 311–318. [25] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018.
[2] Md Atik Ahamed and Qiang Cheng. 2024. TimeMachine: A Time Series is Worth Improving language understanding by generative pre-training. (2018).
4 Mambas for Long-term Forecasting. arXiv preprint arXiv:2403.09898 (2024). [26] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever,
[3] Mohammed S Ahmed and Allen R Cook. 1979. Analysis of freeway traffic time- et al. 2019. Language models are unsupervised multitask learners. OpenAI blog
series data by using Box-Jenkins techniques. Number 722. 1, 8 (2019), 9.
[4] Ali Behrouz and Farnoosh Hashemi. 2024. Graph Mamba: Towards Learning on [27] Omer Berat Sezer, Mehmet Ugur Gudelek, and Ahmet Murat Ozbayoglu. 2020.
Graphs with State Space Models. arXiv preprint arXiv:2402.08678 (2024). Financial time series forecasting with deep learning: A systematic literature
[5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, review: 2005–2019. Applied soft computing 90 (2020), 106181.
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda [28] Zhuangwei Shi. 2024. MambaStock: Selective state space model for stock predic-
Askell, et al. 2020. Language models are few-shot learners. Advances in neural tion. arXiv preprint arXiv:2402.18959 (2024).
information processing systems 33 (2020), 1877–1901. [29] Yujin Tang, Peijie Dong, Zhenheng Tang, Xiaowen Chu, and Junwei Liang. 2024.
[6] Kit Yan Chan, Tharam S Dillon, Jaipal Singh, and Elizabeth Chang. 2011. Neural- VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spa-
network-based models for short-term traffic flow forecasting using a hybrid tiotemporal Forecasting. arXiv preprint arXiv:2403.16536 (2024).
exponential smoothing and Levenberg–Marquardt algorithm. IEEE Transactions [30] Igor V Tetko, David J Livingstone, and Alexander I Luik. 1995. Neural network
on Intelligent Transportation Systems 13, 2 (2011), 644–654. studies. 1. Comparison of overfitting and overtraining. Journal of chemical
[7] Ching Chang, Wen-Chih Peng, and Tien-Fu Chen. 2023. Llm4ts: Two-stage information and computer sciences 35, 5 (1995), 826–833.
fine-tuning for time-series forecasting with pre-trained llms. arXiv preprint [31] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
arXiv:2308.08469 (2023). Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
[8] Xingyi Cheng, Ruiqing Zhang, Jie Zhou, and Wei Xu. 2018. Deeptransport: you need. Advances in neural information processing systems 30 (2017).
Learning spatial-temporal dependency for traffic condition forecasting. In 2018 [32] Chloe Wang, Oleksii Tsepa, Jun Ma, and Bo Wang. 2024. Graph-mamba: Towards
International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8. long-range graph sequence modeling with selective state spaces. arXiv preprint
[9] Elkin Cruz-Camacho, Kevin A Brown, Xin Wang, Xiongxiao Xu, Kai Shu, Zhiling arXiv:2402.00789 (2024).
Lan, Robert B Ross, and Christopher D Carothers. 2023. Hybrid PDES Simulation [33] Xiao Wang, Shiao Wang, Yuhe Ding, Yuehang Li, Wentao Wu, Yao Rong, Weizhe
of HPC Networks Using Zombie Packets. In Proceedings of the 2023 ACM SIGSIM Kong, Ju Huang, Shihao Li, Haoxiang Yang, et al. 2024. State Space Model for
Conference on Principles of Advanced Discrete Simulation. 128–132. New-Generation Network Alternative to Transformers: A Survey. arXiv preprint
[10] Mahan Fathi, Jonathan Pilault, Pierre-Luc Bacon, Christopher Pal, Orhan Firat, arXiv:2404.09516 (2024).
and Ross Goroshin. 2023. Block-state transformer. arXiv preprint arXiv:2306.09539 [34] Zihan Wang, Fanheng Kong, Shi Feng, Ming Wang, Han Zhao, Daling Wang,
(2023). and Yifei Zhang. 2024. Is Mamba Effective for Time Series Forecasting? arXiv
[11] Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with preprint arXiv:2403.11144 (2024).
selective state spaces. arXiv preprint arXiv:2312.00752 (2023). [35] Qingsong Wen, Tian Zhou, Chaoli Zhang, Weiqi Chen, Ziqing Ma, Junchi Yan,
[12] Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. 2020. Hippo: and Liang Sun. 2022. Transformers in time series: A survey. arXiv preprint
Recurrent memory with optimal polynomial projections. Advances in neural arXiv:2202.07125 (2022).
information processing systems 33 (2020), 1474–1487. [36] Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. 2021. Autoformer: De-
[13] Albert Gu, Karan Goel, and Christopher Ré. 2021. Efficiently modeling long composition transformers with auto-correlation for long-term series forecasting.
sequences with structured state spaces. arXiv preprint arXiv:2111.00396 (2021). Advances in neural information processing systems 34 (2021), 22419–22430.
[14] Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and [37] Zhaohu Xing, Tian Ye, Yijun Yang, Guang Liu, and Lei Zhu. 2024. Segmamba:
Christopher Ré. 2021. Combining recurrent, convolutional, and continuous-time Long-range sequential modeling mamba for 3d medical image segmentation.
models with linear state space layers. Advances in neural information processing arXiv preprint arXiv:2401.13560 (2024).
systems 34 (2021), 572–585. [38] Xiongxiao Xu, Xin Wang, Elkin Cruz-Camacho, Christopher D. Carothers, Kevin
[15] Xuexiang Jin, Yi Zhang, and Danya Yao. 2007. Simultaneously prediction of A. Brown, Robert B. Ross, Zhiling Lan, and Kai Shu. 2023. Machine Learning
network traffic flow based on PCA-SVR. In Advances in Neural Networks–ISNN for Interconnect Network Traffic Forecasting: Investigation and Exploitation. In
2007: 4th International Symposium on Neural Networks, ISNN 2007, Nanjing, China, Proceedings of the 2023 ACM SIGSIM Conference on Principles of Advanced Discrete
June 3-7, 2007, Proceedings, Part II 4. Springer, 1022–1031. Simulation. 133–137.
[16] Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, [39] Jiyuan Yang, Yuanzi Li, Jingyu Zhao, Hanbing Wang, Muyang Ma, Jun Ma,
and Xifeng Yan. 2019. Enhancing the locality and breaking the memory bottle- Zhaochun Ren, Mengqi Zhang, Xin Xin, Zhumin Chen, et al. 2024. Uncovering Se-
neck of transformer on time series forecasting. Advances in neural information lective State Space Model’s Capabilities in Lifelong Sequential Recommendation.
processing systems 32 (2019). arXiv preprint arXiv:2403.16371 (2024).
[17] Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedi- [40] Huaxiu Yao, Xianfeng Tang, Hua Wei, Guanjie Zheng, and Zhenhui Li. 2019.
gos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Revisiting spatial-temporal similarity: A deep learning framework for traffic
Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael prediction. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33.
Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor 5668–5675.
Zusman, and Yoav Shoham. 2024. Jamba: A Hybrid Transformer-Mamba Lan- [41] Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. 2023. Are transformers
guage Model. arXiv:2403.19887 [cs.CL] effective for time series forecasting?. In Proceedings of the AAAI conference on
[18] Chengkai Liu, Jianghao Lin, Jianling Wang, Hanzhou Liu, and James Caverlee. artificial intelligence, Vol. 37. 11121–11128.
2024. Mamba4Rec: Towards Efficient Sequential Recommendation with Selective [42] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong,
State Space Models. arXiv preprint arXiv:2403.03900 (2024). and Wancai Zhang. 2021. Informer: Beyond efficient transformer for long se-
[19] Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Lin, Alex X Liu, and quence time-series forecasting. In Proceedings of the AAAI conference on artificial
Schahram Dustdar. 2021. Pyraformer: Low-complexity pyramidal attention for intelligence, Vol. 35. 11106–11115.
long-range time series modeling and forecasting. In International conference on [43] Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin.
learning representations. 2022. Fedformer: Frequency enhanced decomposed transformer for long-term
[20] Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and series forecasting. In International Conference on Machine Learning. PMLR, 27268–
Mingsheng Long. 2023. itransformer: Inverted transformers are effective for time 27286.
series forecasting. arXiv preprint arXiv:2310.06625 (2023). [44] Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and
[21] Jun Ma, Feifei Li, and Bo Wang. 2024. U-mamba: Enhancing long-range de- Xinggang Wang. 2024. Vision mamba: Efficient visual representation learning
pendency for biomedical image segmentation. arXiv preprint arXiv:2401.04722 with bidirectional state space model. arXiv preprint arXiv:2401.09417 (2024).
(2024).
[22] Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. 2022.
A time series is worth 64 words: Long-term forecasting with transformers. arXiv
preprint arXiv:2211.14730 (2022).
[23] Jongho Park, Jaeseung Park, Zheyang Xiong, Nayoung Lee, Jaewoong Cho, Samet
Oymak, Kangwook Lee, and Dimitris Papailiopoulos. 2024. Can Mamba Learn
How to Learn? A Comparative Study on In-Context Learning Tasks. arXiv
preprint arXiv:2402.04248 (2024).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy