0% found this document useful (0 votes)
14 views15 pages

Multiple Patch-based Tokenizations

Gabriel L. Asher's thesis explores novel tokenization techniques to enhance PatchTST, a transformer model for long-sequence time-series forecasting. The study introduces various embedding strategies, including CNN-based embeddings and aggregate measures, demonstrating that these methods can outperform traditional approaches. Findings indicate that optimizing tokenization can significantly improve the efficiency of time-series transformers in capturing relevant information.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views15 pages

Multiple Patch-based Tokenizations

Gabriel L. Asher's thesis explores novel tokenization techniques to enhance PatchTST, a transformer model for long-sequence time-series forecasting. The study introduces various embedding strategies, including CNN-based embeddings and aggregate measures, demonstrating that these methods can outperform traditional approaches. Findings indicate that optimizing tokenization can significantly improve the efficiency of time-series transformers in capturing relevant information.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Dartmouth College

Dartmouth Digital Commons

Computer Science Senior Theses Computer Science

Spring 2024

Exploring Tokenization Techniques to Optimize Patch-Based


Time-Series Transformers
Gabriel L. Asher
gabriel.l.asher.24@dartmouth.edu

Follow this and additional works at: https://digitalcommons.dartmouth.edu/cs_senior_theses

Part of the Computer Sciences Commons

Recommended Citation
Asher, Gabriel L., "Exploring Tokenization Techniques to Optimize Patch-Based Time-Series Transformers"
(2024). Computer Science Senior Theses. 47.
https://digitalcommons.dartmouth.edu/cs_senior_theses/47

This Thesis (Undergraduate) is brought to you for free and open access by the Computer Science at Dartmouth
Digital Commons. It has been accepted for inclusion in Computer Science Senior Theses by an authorized
administrator of Dartmouth Digital Commons. For more information, please contact
dartmouthdigitalcommons@groups.dartmouth.edu.
Exploring Tokenization Techniques to Optimize Patch-Based
Time-Series Transformers
Gabriel Asher
Spring 2024

A Thesis submitted to the Faculty in partial fulfillment of the requirements for the
degree of Bachelor of Arts in Computer Science

Advised by Professor Sarah Preum

DARTMOUTH COLLEGE
Hanover, New Hampshire
Keywords: Time-Series Forecasting, Transformers, Tokenization, Deep Learning, Machine Learning

1
Abstract
Transformer architectures have revolutionized deep learning, impacting natural language
processing and computer vision. Recently, PatchTST has advanced long-sequence time-series
forecasting by embedding patches of time-steps to use as tokens for transformers. This study
examines and seeks to enhance PatchTST’s embedding techniques. Using eight benchmark
datasets, we explore explore novel token embedding techniques. To this end, we introduce
several PatchTST variants, which alter the embedding methods of the original paper. These
variants consist of the following architectural changes: using CNNs to embed inputs to tokens,
embedding an aggregate measure like the mean, max, or sum of a patch, adding the exponential
moving average (EMA) of prior tokens to any given token, and adding a residual between neigh-
boring tokens. Our findings show that CNN-based patch embeddings outperform PatchTST’s
linear layer strategy and simple aggregate measures, particularly embedding just the mean of
a patch, provide comparable results to PatchTST for some datasets. These insights highlight
the potential for optimizing time-series transformers through improved embedding strategies.
Additionally, they point to PatchTST’s ine!ciency at exploiting all information available in a
patch during the token embedding process.

2
1 Introduction
Since their introduction in 2017, transformer architectures have revolutionized the field of deep
learning. [1] In natural language processing, the transformer led to Large Language Models (LLMs)
like the GPT series, Llama series, T5, and otherwise. [2, 3, 4]. In computer vision, transform-
ers (ViT) have had a similar impact, through the development of vision transformers, which take
patches of images and use these as their tokens [5]. In long-sequence time-series signal forecasting,
although there is a history of transformer-based approaches having success [6, ?, 7], only with the
recent development of PatchTST [8] have transformers consistently outperformed non-transformer
methods[9]. Long-sequence time-series forecasting refers to time-series forecasting approaches in
settings like weather, tra!c, electricity usage, etc... which contain long sequences of time-steps
across many features and data collection frequencies. In this paper, PatchTST, similarly to a ViT,
patches a time-series signal, embeds these patches, then treats these embedded patches as tokens
to a transformer model which outputs a series of time-series forecasts. Naturally, given that trans-
former backbones are input datatype independent, PatchTST’s improvements in performance come
first and foremost from improvements in tokenization and embedding strategies. This observation
leads to the guiding question behind this study: How can we develop better embedding techniques
to transform time-series inputs into readable tokens for a transformer?
Through this exploration, we take similar methods in PatchTST, and expand on them. Using
the same 8 datasets as prior works [8, 10], we explore di”erent methods of aggregating patches,
direct tokenization of time-step values, architectural di”erences in tokenizing patches, and applying
EMA and residuals to tokens. We explore several embedding methods which aren’t covered in the
paper. These methods are as follows: treating each time-series step as a token, embedding the
maximum of each patch, embedding the sum of each patch, embedding the mean of each patch,
using a Convolutional Neural Network (CNN) [11] to embed patches, applying an exponential mov-
ing average (EMA) of tokens prior to passing into transformer backbone, and applying a residual
of neighboring tokens prior to passing into transformer backbone. These additional explorations
serve to improve the general understanding of time-series transformers, and are a helpful tool for
maximizing performance of these tools.
This paper makes the following contributions to the study of Deep-Learning in time-series, and
PatchTST variants in particular.
• We explore a number of novel embedding strategies in time-series, and show that one of
these: a CNN based patch embedding method outperforms PatchTST’s linear layer strategy.
Additionally, we suggest a hypothesis as to why this strategy may be more e”ective.

• We show that embedding and learning based on aggregate measures of a patch - particularly
the mean - o”ers comparable results to using all time-steps for some datasets. This conclusion
supports three insights: transformers have di!culties learning on time series, PatchTST does
not embed information e!ciently, and that the mean of patches alone may capture su!cient
information for time-series forecasting.

2 Related Work
2.0.1 Non-Transformer Based Time-Series Forecasting
Time-series forecasting has had a long history. Early approaches involve RNNs, such as LSTM. [12]
This method involves passing an input through a series of learnable gates to reach a final output
prediction. Other methods use CNNs to extract features which are used to predict output sequences.

3
[13] These methods have been used successfully in time-series forecasting [14], however face issues
with long-range dependencies since information extraction is localized by the kernel. Finally, there
have been more recent non-transformer based time-series forecasters which have had success. Most
notable among these is DLinear [9], which separates signal into a global trend extracted by using
a moving average kernel and the remainder of said signal. It passes these two separated signals
into a learnable linear layer each, and then combines the results into its final forecasted predictions.
However, although this method has outperformed some transformer-based methods, it still falls short
of PatchTST, our main baseline.

2.0.2 ViTs and Patch-Based Transformers


One of the most di!cult challenges to deal with time-series data is its lack of discretization and
continuous-valued nature. This challenge becomes especially pronounced when using transformers
for any sort of task. In fields like NLP, transformer-based models are inputted embeddings generated
from a fixed set of tokens. This strategy of using a discrete set of embeddings has had success in that
domain. [4, 15] However, there is no such strategy in domains that have more continuous values like
computer vision (CV) and time-series. Given the considerably larger amount of research conducted
on applying transformers to CV, a significant body of literature exists on how to e”ectively generate
embeddings from continuous-valued data. The dominant paradigm in CV, which PatchTST applies,
is a concept called ’patching’. First proposed in Vision Transformers (ViTs) [5], patching consists of
subdividing inputs into parts or ”patches”, then passing each of these patches through a featurizer
(often a learnable linear-layer) to generate tokens for the subsequent transformer. This strategy has
many variations and has been instrumental to many popular CV architectures. [16, 17, 18, 19, 20, 21]
PatchTST builds on ideas introduced by ViTs, and applies patch-based embedding strategies to time-
series data. However, despite their similarities, there are some unique characteristics of time-series
which separate it from CV. Notably, in CV, information may be multi-channelled, and patches in
time-series are 1d instead of 2d, which diminishes the amount of spacial information available.

2.0.3 Transformer-based Time-Series


Prior to the patch-based time-series transformer coined by PatchTST [8] which we explore further
in this work, several other transformer-based time-series forecasting models existed. Most of these
methods seek to reduce the O(n2 ) time complexity of the attention mechanism, which poses a
challenge to the use of transformers in long-sequence time-series data, since time-step sequences are
often long. LogTrans [22] reduces the time-complexity to O(nlog 2 (n)) by using CNNs to generate the
transformer’s key and query values. Informer [7] selects only the most relevant keys and also employs
distillation to reduce the dimensionality of inputs to subsequent attention layers. FEDFormer [6]
uses a FFT, low-rank approximation, and sparse attention to further reduce time complexity to a
linear level. Autoformer [10] doesn’t try to decrease the time-complexity of the attention mechanism,
but it uses a similar strategy to [9] in decomposing inputs into trends and seasonal components, using
the sum of these for forecasts.
Although we recognize the existence of these transformer-based baselines, we choose not to
include them in the set of baselines for this paper given that [8] shows that PatchTST, our primary
baseline, outperforms these methods on all of the datasets we use.

4
3 Methods
3.1 Problem Formulation
The base problem of time-series forecasting consists of the following. Given a series of time steps
x1,...,n , how accurately can we forecast time steps xn+1,...,n+T , where n is the number of time-steps
which are seen by the model, and T is the length of the sequence of time-steps that the model
is predicting. To this measure, we use a transformer-based deep-learning model to forecast future
time-steps.
A transformer-based model employs layers that utilize self-attention mechanisms to selectively
prioritize certain input tokens over others, e”ectively learning to assign varying degrees of importance
to di”erent parts of the input data. This is achieved through the concepts of keys, queries, and values,
which are critical components of the attention mechanism. Each of these elements are represented
by matrices that are learned during training.
In the context of self-attention, the input data is transformed into three matrices:

• Queries (Q): Represent the elements for which the model seeks relevant information.
• Keys (K): Represent all possible elements that can be attended to (i.e., the information that
can be retrieved).
• Values (V): Represent the actual information that is retrieved.

The product of The self-attention mechanism can be mathematically described by the following
equation: ! "
QK T
Attention(Q, K, V ) = softmax → V
dk
where QK T is the dot product of the queries and keys, which measures the alignment or similarity
between these elements. This product captures the relationship between any two tokens fed to
the transformer. The result is then scaled by dk , the dimension of the keys and queries, to help
stabilize the gradients during training. The softmax function is applied to convert these scores
into probabilities that sum to one, representing the weights assigned to each value. Finally, these
probabilities which represent the relationship between elements are multiplied by the the values
vector V to return a final embedding. Each of the three key matrices in a transformer layer have a
corresponding learnable weight matrix (W Q , W K , W V ) where (W Q , W K ) have the shape (d, dk )
and W V has the shape (d, dv ). Here, d is the embedding dimension, dv is the value dimension,
and dk is the key and query dimension. The final product is an embedding of size (d, dv ), which, if
d ↑= dv is reshaped to the embedding dimension d.
The transformer utilized in this paper follows the generalized encoder/decoder structure which is
outlined in figure 1. In this transformer, time steps x(1,...,n),1 are fed through some sort of tokenizer
to create tokens t(1,...,n),d , where d is the model dimensionality. This tokenization strategy is crucial
to the performance of a transformer, since the tokens used are the base numerical representations
of the data, and information-rich tokens can subsequently be leverage by the transformer. In the
case of this paper, we prepend a special token, CLS1,d to t(1,...,n),d . After these tokens are created,
we then add a learnable positional encoding to them. Then, they are subsequently fed into a
basic transformer encoder. For the case of this study, we utilize a fixed transformer of depth
2 and number of heads 4. We then use the special token embedding, CLS 1,d as an input to a
decoder which outputs forecasted time steps x(n+1,...,n+T ),1 using a single fully connected linear
layer decoder. Finally, for normalization and re-normalization of data, we use a RevIN layer [23]

5
x1 x2 x3 x4 xn

Tokenization
stage

Transformer Encoder

Decoder

xn+1 xn+2 xn+T

Figure 1: Transformer architecture used in the following paper. First, we tokenize input time-series
steps into embeddings. Next, these embeddings are passed through a transformer encoder. The
transformer encoder then generates a data representation, which we then feed into a simple decoder
to generate forecasted time-series steps.

which uses learnable parameters in addition to traditional statistical values (mean, variance) to
minimize e”ects of distribution shift in normalization.
This study in particular aims to examine the e”ects of di”erent tokenization methods in down-
stream performance. To this measure, we keep all other parts of the model as unchanged as possible.

3.2 Data Preparation


For this study, we utilize the same eight datasets as previous works [8, 10, 24], ensuring compara-
bility and continuity. Table 1 outlines each dataset, their sizes, and their number of features. Each
dataset comprises multivariate time series, x(1,...,n),l , where l is the number of features in a given
step. We use the dataset splits provided by [8, 10], which are 70%, 10%, 20% train, validation, test
splits respectively. Data is accessible from the Autoformer github.1 Like [8], despite the multivariate
forecasting framework, we treat all features in a given time-step as independent. Thus, our method
is multi-variate with channel (feature) independence. These datasets, which are widely used as
benchmarks in other works, comprise various tasks, time intervals, and modalities. Illness contains
weekly data of influenza-like illnesses from the US CDC from 2002-2021. Electricity contains hourly
electricity consumption by 321 customers. Tra!c is hourly data which measures road occupancy
rates. Weather is data recorded every 10 minutes, measuring 21 di”erent weather indicators (tem-
perature, humidity, etc...). Finally, ETTm1 and ETTm2 collect data from electricity transformers
1 https://github.com/thuml/Autoformer/

6
on a 15 minute basis. ETTh1 and ETTh2 also collect data from electricity transformers on a 1 hour
basis.

Table 1: Datasets used, size of datasets, and number of features per dataset.

Split Illness Electricity Tra!c Weather ETTh1 ETTh2 ETTm1 ETTm2


Train 557 17981 11849 36456 11763 11763 48345 48345
Val 74 2537 1661 5175 1647 1647 6873 6873
Test 170 5165 3413 10444 3389 3389 13841 13841
Features 7 321 862 21 7 7 7 7

3.3 Time-Series Tokenization


We investigate multiple tokenization techniques to improve the transformer model’s understanding
of time-series data, and ability to forecast on the modality. The tokenization methods are as follows,
and are additionally outlined in Figure 2

• Direct Tokenization: Each time step is treated as a separate token. Thus, we train we use
a linear layer to transform time steps from shape x(1,...,n),d to t(1, ..., n), d.
• Patch-Based Embedding:Most of our tokenization strategies utilise some sort of patch-
based embedding technique like [8]. We list our various patch-based methods below.

– PatchTST. Given time-steps x(1,...,n),d , we split the sequence into a set of patches P , such
that P < n and ndivp = N. Then, we pass each patch p(1,...,patch len),1 through an linear
layer to generate a token t(1,d) . We are left with a final set of tokens T(N,d) .
– Max/Mean/Sum: we do the same as PatchTST, except instead of passing the whole patch
into the learnable linear layer, we just pass the max/mean/sum of the patch. Thus, for
each given patch p(1,...,patch len),1 , we use an aggregation to turn it into p(1,1) , before
passing it through an linear layer to generate a token t(1,d) . We are left with a final set
of tokens T(N,d) .
– CNN: For a given patch p(1,...,patch len),1 , we pass this patch through 2 1d CNNs to
generate p(patch len,cnn dim ), where patch len ↓ cnn dim = d, and d is the embedding
dimension. Then, for each patch we flatten the patch into a its final token, yielding us
with the set of tokens T(N,d) .
• Exponential Moving Average (EMA): We embed each token using the PatchTST em-
bedding strategy, generating a set of tokens T(N,d) . Then, for any given token tn , we add the
EMA of tokens t(1,...,n→1) to this token. We explore this tokenization strategy because steps
in a time-series sequence may not be independent. Thus, we are able to capture and condition
tokens based on this case.
• Residuals: Here, we embed each token using the PatchTST embedding strategy, generating a
set of tokens T(N,d) . Then, for any given token tn , we add a residual of its neighboring tokens
tn→1 , tn+1 . Similarly to EMA, in this instance we operate under the assumption that steps in
a time-series sequence may not be independent. We believe that looking at neighboring tokens
may help inform dependencies hidden in the data.

7
PatchTST PatchTST - CNN
xn

xn
xn xn

Tokenn

Tokenn
Flatten
… Linear … 1D 1D
x… x…
Layer CNN CNN
xn-8

xn-8
xn- xn-
8 8

No Patch PatchTST - EMA

[xj-8…xj]

Tokenj
Tokenj

Tokenj

+ EMA(Token1,...,Tokenj-1)
xj
xj

Transformer Encoder

Transformer Encoder
Token…
Token…

Token…
… Linear … Linear
x…
Layer Layer
[x1…x8]
Tokenj-8

Token1
Token1
xj-8

xj-8

Mean/Max/Sum PatchTST - Neighbor Residual

Tokeni + 0.1(Tokeni-1) + 0.1(Tokeni+1)


[xj-8…xj]

Tokeni
Tokeni

xi
xi

Token…
Token…

… … Linear
x… Max
Tokeni

Linear Layer
Mean xi
Layer
Sum
[x1…x8]

Token1
Token1
xi-8

xi-8

(a) (b)

Figure 2: Depiction of embedding strategies analysed. a) Represents PatchTST, NoPatch and Patch
Mean/Max/Sum strategies. PatchTST is the strategy conceived of in [8], No Patch treats every time-
step as a token, Mean/Max/Sum takes the aggregate value of a patch and only embeds this. b)
Represents more alternative tokenization strategies. PatchTST - CNN uses CNN layers to extract
features from any given patch. PatchTST - EMA adds the EMA of prior tokens to any given token.
PatchTST - Neighbor Residual adds a residual of neighboring tokens to each token.

8
3.4 Training
All models were trained with depth 2, 4 heads, hidden dimension of 256, learning rate of 1e ↔ 4,
and patch length of 8 for all models and datasets. Our batch size and number of epochs varied per
dataset. We train from between 5-100 epochs (for electricity and illness respectively), and our batch
size ranges from 4-128. We use MSE as our loss function, and we evaluate the test set on the top
performing checkpoint on our validation set. Furthermore, for all of the datasets but Illness, our
model takes in a sequence of 336 time steps (equivalent to 42 patches) and forecasts the subsequent 96
patches. For illness dataset, since it is much smaller, our model takes in 96 time steps, and forecasts
the next 24 time steps. Finally, we use Phil Wang’s x-transformers transformer implementation 2
and all of our code is available here: https://github.com/gaasher/embedding_thesis. We use
pytorch-lightning for our training setup, and weights and biases for logging and checkpointing.
For our evaluation metrics, similarly to [8], we use mean-squared error (MSE) and mean-absolute
error (MAE) as our reported metrics.

4 Results

Table 2: Results of models with di”erent embedding strategies on datasets (MAE and MSE values)

Model Metric Illness Electricity Tra!c Weather ETTh1 ETTh2 ETTm1 ETTm2
Base PatchTST MAE 0.8604 0.2274 0.2747 0.2049 0.4602 0.3039 0.3813 0.2279
MSE 1.869 0.1324 0.3902 0.1529 0.438 0.1934 0.3318 0.115
No Patch MAE 0.7995 0.2355 0.2903 0.205 0.4594 0.2961 0.3993 0.2317
MSE 1.703 0.1372 0.4037 0.1556 0.4384 0.1832 0.356 0.1214
Patch Max MAE 0.9247 0.3765 0.3616 0.2053 0.5421 0.3 0.3998 0.2359
MSE 1.959 0.2816 0.5097 0.1571 0.6022 0.1883 0.3508 0.1232
Patch Mean MAE 0.8647 0.3001 0.3395 0.2022 0.4968 0.3065 0.3838 0.2359
MSE 1.905 0.1978 0.497 0.1542 0.4916 0.1892 0.333 0.1243
Patch Sum MAE 0.9052 0.3067 0.3509 0.206 0.5121 0.3175 0.4005 0.2414
MSE 1.913 0.2039 0.5205 0.1559 0.5113 0.2067 0.357 0.1273
PatchTST EMA MAE 0.825 0.2274 0.2747 0.2046 0.4605 0.3009 0.3815 0.2297
MSE 1.869 0.1324 0.3902 0.1529 0.4354 0.1867 0.3292 0.1178
PatchTST - Neighbor Residual MAE 0.8624 0.2274 0.2745 0.203 0.4591 0.2977 0.3832 0.2295
MSE 1.874 0.1327 0.391 0.1528 0.4336 0.1842 0.3312 0.1177
PatchTST - CNN MAE 0.8364 0.2279 0.2738 0.2035 0.4512 0.2807 0.3779 0.2248
MSE 1.789 0.132 0.3848 0.1544 0.4177 0.1707 0.3254 0.1127

Tables 2 reports MAE and MSE results for each respective embedding strategy and dataset.
Bold results represent the top performing model for any given metric and dataset. PatchTST - CNN
outperforms all other methods on most datasets. For all ETT datasets, Patch-CNN outperforms
alternative methods for both MAE and MSE. This is improvement is especially pronounced on
ETTh1 and ETTh2, where PatchTST - CNN outperforms PatchTST in MSE by 0.203 and .227
respectively. After PatchTST - CNN, performance on these datasets is comparable across the board,
with PatchTST - Neighbor Residual, Max, No Patch, and PatchTST - EMA all outperforming Base
PatchTST on certain ETT datasets. Performance varies more on Illness, Electricity, Tra!c, and
2 https://github.com/lucidrains/x-transformers

9
Weather. However, even in these datasets, PatchTST - CNN outperforms Base PatchTST in all
but MSE for Weather and MAE for Electricity. PatchTST - CNN has especially large performance
improvements for the illness dataset. This may be because the Illness dataset has fewer tokens
than the others, and PatchTST - CNN may capture more information across any given patch.
Additionally, No Patch significantly outperforms other methods on the Illness dataset. This is
likely because the transformer model benefits from processing each time step individually, capturing
temporal dependencies more e”ectively without the need for aggregation within patches. This
result suggests that, for certain datasets, simplifying the tokenization process by avoiding patching
altogether can lead to better performance. Finally, certain datasets seem to be harder to predict
than others. Notably, the Illness dataset has higher MAE and MSE values on aggregate than any
of the other datasets. This may occur because the data collection frequency is weekly, and thus the
model has limited signal to go o” of.

5 Discussion
5.1 PatchTST - Mean performance
This exploration of embedding strategies on time-series-data reveals a few key insights on the work-
ings and e”ectiveness of PatchTST and it’s methods. Primarily, we found that although PatchTST
does outperform simpler methods such as methods which embed the mean, max, or sum of a patch,
besides the Electricity and Tra!c datasets, the di”erences in performance are not as large as one
might think. In particular, we found that using the mean of a patch, and embedding this value
alone, performs similarly well on most time-series tasks evaluated, and even outperforms the base
PatchTST on MAE for the large weather dataset. This has a few potentially insightful implications.
Firstly, it could indicate that transformer architectures just have di!culties with modelling time-
series data. Chen et Al. [24] show that a MLP-only based approach achieves similar performance to
PatchTST and also that simple linear modelling outperforms many other transformer-based time-
series forecasting models. Thus, the results which we present add to the existing literature which
suggests challenges with time-series forecasting using transformers.
Secondly, this under-performance of PatchTST suggests that it does not embed information
e!ciently and e”ectively. Given our patch size of 8, and the fact that PatchTST feeds all 8 of
these values into a linear-layer to generate an embedding, one would expect the performance to be
significantly improved over only using one of these values or some aggregate of them. Thus, this
study has exhibited a limited e!ciency with the current PatchTST method, and has highlighted
that future works which should center information-rich tokens are needed.
Thirdly, the under-performance of PatchTST, and the near equivalent performance of embedding
the mean alone may be quite literally a regression to the mean. This may suggest two things. Firstly,
given the localized nature of our patching strategy, taking the localized mean of time-step windows
may be an e”ective strategy of capturing macro-trends. Secondly and alternatively, it may suggest
that the inherent noise of time-series patches may mean that primary captured signal in any given
patch is the mean. Thus, PatchTST may be approximating a simple patch mean-embedding strategy,
not vice-versa.

5.2 PatchTST - CNN performance


Another interesting insight gathered from this embedding strategy exploration was the success of our
Patch CNN strategy. This strategy involves first patching up our time-step sequence, then passing a
CNN over these patches to achieve the desired embedding dimension for a transformer. Interestingly,

10
this approach outperformed the baseline PatchTST on most datasets evaluated with regards to both
MAE and MSE. This superior performance could be attributed to a couple factors. Firstly, the Patch
CNN strategy might be better at capturing local dependencies and intricate patterns within the data
due to its convolutional nature, which is ideal for spatial and temporal feature extraction. CNNs are
particularly e”ective at capturing local dependencies because the CNN kernel processes a limited
number of inputs at once, allowing for a focused and detailed analysis of localized patterns.Secondly,
the combination of CNNs for feature extraction followed by transformers for sequence modeling
might o”er a more robust and adaptable architecture. Particularly, we leverage the strengths of
both strategies: CNNs for feature extraction and Transformers for long-range dependencies.

5.3 Limitations
Despite these insights, there are a few experimental limitations which may negatively e”ect results
in this paper. The first such limitation is the lack of statistical significance testing. All experi-
ments were conducted in google colab with the following public repository https://github.com/
gaasher/embedding_thesis, and training curves and additional information are stored in the fol-
lowing weights and biases project: https://wandb.ai/gaasher/timeseries_embeds. However,
due to the nature of training in colab, model checkpoints were not saved for each implementation.
This makes statistical testing not possible, unless all models were re-trained, which would cost too
much time and money. A similar limitation to this study is a lack of ablations. For all datasets
aside from Illness, we exclusively trained and ran experiments on number of seen time-steps n = 336
and forecasted steps T = 96. Furthermore, we don’t change the depth and number of heads in our
transformer model. We also don’t conduct any sort of sweep or grid search for hyper parameters
such as model dimension, learning rate, and EMANeighbor-Residual related parameters. We be-
lieve that with proper ablations, significant improvements in model performance can be achieved.
Finally, we only use a select few datasets which are not necessarily representative of the vast array
of long-sequence time-series data domains. Thus, our findings may not generalize to new time-series
domains.

6 Conclusion
This paper reveals several insights and potential improvements to token embedding strategies for
time-series transformers. As far as improvements go, we show that a CNN based patch embeddings
outperforms PatchTST’s linear layer strategy. Furthermore, we show that other alternate methods
such as applying an EMA to tokens and adding neighbor residuals to tokens could also yield marginal
improvements over base PatchTST. We believe that PatchTST - CNN may yield superior results
due to the better local feature extraction abilities of CNNs, which especially improve performance
when combined with transformers’ strengths with long-range dependencies. This paper additionally
builds on work which shows the di!culties of transformers in time-series forecasting. We show that
for several datasets, using patch aggregate measures, especially the mean, performs similarly or even
outperforms PatchTST, which has access to all time-step values. We argue that this shows an inher-
ent limitation to PatchTST’s ability to aggregate information. Finally, this paper builds on general
understanding and insights with regards to transformer embedding strategies. Through experiment-
ing on several novel strategies with otherwise fixed hyperparameters, we build interpretability and
context to future works in the space.

11
6.1 Future works
We believe that there are several paths for future works based on this paper. Firstly, we believe
that there is still significant work that can be done in bridging long-range and local dependencies
at the patch level. Given the often dependent status of time-series data, and the arbitrary nature
of patch cuto”s, we believe that there is significant room to explore strategies with incorporate
long-range context into patch-level embeddings. We believe that one potential and under-explored
solution is to exploit methods similar to [21], where transformers use hierarchical feature maps.
These feature maps merge patches at di”erent levels of the transformer encoder, which allows the
model to incorporated relationships between tokens at di”erent scales. Another potential area of
work is in discretization of patches and time-steps. As [25] explore in their paper, binning time-series
intensities, and discretizing as vocabulary in a GPT style decoder-only approach shows promising
results. Similarly, [26] uses a decoder-only approach, where the model is trained autoregressively,
learning on a next-patch prediction task. These decoder-only approaches could be particularly
useful for time-series. Currently, we forecast T time-step predictions in one forward pass. However,
in an autoregressive paradigm, since the model would produce one output at a time, it would
be able to condition outputs on all previous values, even predictions that it has already made.
Finally, we believe that given the performance improvements shown in [25, 26], pre-training is still
an under-studied subject in time-series transformers. However, further improvement in domain-
generalizability of embedding strategies will be needed to fully crack the code on this task.

7 Acknowledgements
I’d like to acknowledge Professor Sarah Preum, my thesis advisor, and Parker Seegmiller for their
support throughout the research process for this work.

References
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
L
# ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information
processing systems, 30, 2017.
[2] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim-
othée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open
and e!cient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[3] Colin Ra”el, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified
text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.

[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari-
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models
are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[5] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly,
et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv
preprint arXiv:2010.11929, 2020.

12
[6] Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. Fedformer:
Frequency enhanced decomposed transformer for long-term series forecasting. In International
conference on machine learning, pages 27268–27286. PMLR, 2022.
[7] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai
Zhang. Informer: Beyond e!cient transformer for long sequence time-series forecasting. In
Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115,
2021.
[8] Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is
worth 64 words: Long-term forecasting with transformers. arXiv preprint arXiv:2211.14730,
2022.
[9] Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers e”ective for time series
forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages
11121–11128, 2023.
[10] Minghao Chen, Houwen Peng, Jianlong Fu, and Haibin Ling. Autoformer: Searching trans-
formers for visual recognition. In Proceedings of the IEEE/CVF international conference on
computer vision, pages 12270–12280, 2021.
[11] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series.
The handbook of brain theory and neural networks, 3361(10):1995, 1995.
[12] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,
9(8):1735–1780, 1997.
[13] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional
and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
[14] Pradeep Hewage, Ardhendu Behera, Marcello Trovati, Ella Pereira, Morteza Ghahremani,
Francesco Palmieri, and Yonghuai Liu. Temporal convolutional neural (tcn) network for an
e”ective weather forecasting using time-series data from the local weather station. Soft Com-
puting, 24:16453–16482, 2020.
[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
2018.
[16] Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and
Han Hu. Simmim: A simple framework for masked image modeling. In Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition, pages 9653–9663, 2022.
[17] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image trans-
formers. arXiv preprint arXiv:2106.08254, 2021.
[18] Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung
Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv
preprint arXiv:2203.03605, 2022.
[19] Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael
Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-
embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 15619–15629, 2023.

13
[20] Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido
Assran, and Nicolas Ballas. V-jepa: Latent video prediction for visual representation learning.
2023.
[21] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining
Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings
of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
[22] Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng
Yan. Enhancing the locality and breaking the memory bottleneck of transformer on time series
forecasting. Advances in neural information processing systems, 32, 2019.

[23] Taesung Kim, Jinhee Kim, Yunwon Tae, Cheonbok Park, Jang-Ho Choi, and Jaegul Choo.
Reversible instance normalization for accurate time-series forecasting against distribution shift.
In International Conference on Learning Representations, 2021.
[24] Si-An Chen, Chun-Liang Li, Nate Yoder, Sercan O Arik, and Tomas Pfister. Tsmixer: An
all-mlp architecture for time series forecasting. arXiv preprint arXiv:2303.06053, 2023.

[25] Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin
Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham
Kapoor, et al. Chronos: Learning the language of time series. arXiv preprint arXiv:2403.07815,
2024.

[26] Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model
for time-series forecasting. arXiv preprint arXiv:2310.10688, 2023.

14

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy