0% found this document useful (0 votes)

14 views15 pages

Multiple Patch-based Tokenizations

Gabriel L. Asher's thesis explores novel tokenization techniques to enhance PatchTST, a transformer model for long-sequence time-series forecasting. The study introduces various embedding strategies, including CNN-based embeddings and aggregate measures, demonstrating that these methods can outperform traditional approaches. Findings indicate that optimizing tokenization can significantly improve the efficiency of time-series transformers in capturing relevant information.

Uploaded by

Ankit Prakash Ching

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views15 pages

Multiple Patch-based Tokenizations

Uploaded by

Ankit Prakash Ching

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Dartmouth College

Dartmouth Digital Commons

Computer Science Senior Theses Computer Science

Spring 2024

Exploring Tokenization Techniques to Optimize Patch-Based

Time-Series Transformers
Gabriel L. Asher
gabriel.l.asher.24@dartmouth.edu

Follow this and additional works at: https://digitalcommons.dartmouth.edu/cs_senior_theses

Part of the Computer Sciences Commons

Recommended Citation
Asher, Gabriel L., "Exploring Tokenization Techniques to Optimize Patch-Based Time-Series Transformers"
(2024). Computer Science Senior Theses. 47.
https://digitalcommons.dartmouth.edu/cs_senior_theses/47

This Thesis (Undergraduate) is brought to you for free and open access by the Computer Science at Dartmouth
Digital Commons. It has been accepted for inclusion in Computer Science Senior Theses by an authorized
administrator of Dartmouth Digital Commons. For more information, please contact
dartmouthdigitalcommons@groups.dartmouth.edu.
Exploring Tokenization Techniques to Optimize Patch-Based
Time-Series Transformers
Gabriel Asher
Spring 2024

A Thesis submitted to the Faculty in partial fulfillment of the requirements for the
degree of Bachelor of Arts in Computer Science

Advised by Professor Sarah Preum

DARTMOUTH COLLEGE
Hanover, New Hampshire
Keywords: Time-Series Forecasting, Transformers, Tokenization, Deep Learning, Machine Learning

1
Abstract
Transformer architectures have revolutionized deep learning, impacting natural language
processing and computer vision. Recently, PatchTST has advanced long-sequence time-series
forecasting by embedding patches of time-steps to use as tokens for transformers. This study
examines and seeks to enhance PatchTST’s embedding techniques. Using eight benchmark
datasets, we explore explore novel token embedding techniques. To this end, we introduce
several PatchTST variants, which alter the embedding methods of the original paper. These
variants consist of the following architectural changes: using CNNs to embed inputs to tokens,
embedding an aggregate measure like the mean, max, or sum of a patch, adding the exponential
moving average (EMA) of prior tokens to any given token, and adding a residual between neigh-
boring tokens. Our findings show that CNN-based patch embeddings outperform PatchTST’s
linear layer strategy and simple aggregate measures, particularly embedding just the mean of
a patch, provide comparable results to PatchTST for some datasets. These insights highlight
the potential for optimizing time-series transformers through improved embedding strategies.
Additionally, they point to PatchTST’s ine!ciency at exploiting all information available in a
patch during the token embedding process.

2
1 Introduction
Since their introduction in 2017, transformer architectures have revolutionized the field of deep
learning. [1] In natural language processing, the transformer led to Large Language Models (LLMs)
like the GPT series, Llama series, T5, and otherwise. [2, 3, 4]. In computer vision, transform-
ers (ViT) have had a similar impact, through the development of vision transformers, which take
patches of images and use these as their tokens [5]. In long-sequence time-series signal forecasting,
although there is a history of transformer-based approaches having success [6, ?, 7], only with the
recent development of PatchTST [8] have transformers consistently outperformed non-transformer
methods[9]. Long-sequence time-series forecasting refers to time-series forecasting approaches in
settings like weather, tra!c, electricity usage, etc... which contain long sequences of time-steps
across many features and data collection frequencies. In this paper, PatchTST, similarly to a ViT,
patches a time-series signal, embeds these patches, then treats these embedded patches as tokens
to a transformer model which outputs a series of time-series forecasts. Naturally, given that trans-
former backbones are input datatype independent, PatchTST’s improvements in performance come
first and foremost from improvements in tokenization and embedding strategies. This observation
leads to the guiding question behind this study: How can we develop better embedding techniques
to transform time-series inputs into readable tokens for a transformer?
Through this exploration, we take similar methods in PatchTST, and expand on them. Using
the same 8 datasets as prior works [8, 10], we explore di”erent methods of aggregating patches,
direct tokenization of time-step values, architectural di”erences in tokenizing patches, and applying
EMA and residuals to tokens. We explore several embedding methods which aren’t covered in the
paper. These methods are as follows: treating each time-series step as a token, embedding the
maximum of each patch, embedding the sum of each patch, embedding the mean of each patch,
using a Convolutional Neural Network (CNN) [11] to embed patches, applying an exponential mov-
ing average (EMA) of tokens prior to passing into transformer backbone, and applying a residual
of neighboring tokens prior to passing into transformer backbone. These additional explorations
serve to improve the general understanding of time-series transformers, and are a helpful tool for
maximizing performance of these tools.
This paper makes the following contributions to the study of Deep-Learning in time-series, and
PatchTST variants in particular.
• We explore a number of novel embedding strategies in time-series, and show that one of
these: a CNN based patch embedding method outperforms PatchTST’s linear layer strategy.
Additionally, we suggest a hypothesis as to why this strategy may be more e”ective.

• We show that embedding and learning based on aggregate measures of a patch - particularly
the mean - o”ers comparable results to using all time-steps for some datasets. This conclusion
supports three insights: transformers have di!culties learning on time series, PatchTST does
not embed information e!ciently, and that the mean of patches alone may capture su!cient
information for time-series forecasting.

2 Related Work
2.0.1 Non-Transformer Based Time-Series Forecasting
Time-series forecasting has had a long history. Early approaches involve RNNs, such as LSTM. [12]
This method involves passing an input through a series of learnable gates to reach a final output
prediction. Other methods use CNNs to extract features which are used to predict output sequences.

3
[13] These methods have been used successfully in time-series forecasting [14], however face issues
with long-range dependencies since information extraction is localized by the kernel. Finally, there
have been more recent non-transformer based time-series forecasters which have had success. Most
notable among these is DLinear [9], which separates signal into a global trend extracted by using
a moving average kernel and the remainder of said signal. It passes these two separated signals
into a learnable linear layer each, and then combines the results into its final forecasted predictions.
However, although this method has outperformed some transformer-based methods, it still falls short
of PatchTST, our main baseline.

2.0.2 ViTs and Patch-Based Transformers

One of the most di!cult challenges to deal with time-series data is its lack of discretization and
continuous-valued nature. This challenge becomes especially pronounced when using transformers
for any sort of task. In fields like NLP, transformer-based models are inputted embeddings generated
from a fixed set of tokens. This strategy of using a discrete set of embeddings has had success in that
domain. [4, 15] However, there is no such strategy in domains that have more continuous values like
computer vision (CV) and time-series. Given the considerably larger amount of research conducted
on applying transformers to CV, a significant body of literature exists on how to e”ectively generate
embeddings from continuous-valued data. The dominant paradigm in CV, which PatchTST applies,
is a concept called ’patching’. First proposed in Vision Transformers (ViTs) [5], patching consists of
subdividing inputs into parts or ”patches”, then passing each of these patches through a featurizer
(often a learnable linear-layer) to generate tokens for the subsequent transformer. This strategy has
many variations and has been instrumental to many popular CV architectures. [16, 17, 18, 19, 20, 21]
PatchTST builds on ideas introduced by ViTs, and applies patch-based embedding strategies to time-
series data. However, despite their similarities, there are some unique characteristics of time-series
which separate it from CV. Notably, in CV, information may be multi-channelled, and patches in
time-series are 1d instead of 2d, which diminishes the amount of spacial information available.

2.0.3 Transformer-based Time-Series

Prior to the patch-based time-series transformer coined by PatchTST [8] which we explore further
in this work, several other transformer-based time-series forecasting models existed. Most of these
methods seek to reduce the O(n2 ) time complexity of the attention mechanism, which poses a
challenge to the use of transformers in long-sequence time-series data, since time-step sequences are
often long. LogTrans [22] reduces the time-complexity to O(nlog 2 (n)) by using CNNs to generate the
transformer’s key and query values. Informer [7] selects only the most relevant keys and also employs
distillation to reduce the dimensionality of inputs to subsequent attention layers. FEDFormer [6]
uses a FFT, low-rank approximation, and sparse attention to further reduce time complexity to a
linear level. Autoformer [10] doesn’t try to decrease the time-complexity of the attention mechanism,
but it uses a similar strategy to [9] in decomposing inputs into trends and seasonal components, using
the sum of these for forecasts.
Although we recognize the existence of these transformer-based baselines, we choose not to
include them in the set of baselines for this paper given that [8] shows that PatchTST, our primary
baseline, outperforms these methods on all of the datasets we use.

4
3 Methods
3.1 Problem Formulation
The base problem of time-series forecasting consists of the following. Given a series of time steps
x1,...,n , how accurately can we forecast time steps xn+1,...,n+T , where n is the number of time-steps
which are seen by the model, and T is the length of the sequence of time-steps that the model
is predicting. To this measure, we use a transformer-based deep-learning model to forecast future
time-steps.
A transformer-based model employs layers that utilize self-attention mechanisms to selectively
prioritize certain input tokens over others, e”ectively learning to assign varying degrees of importance
to di”erent parts of the input data. This is achieved through the concepts of keys, queries, and values,
which are critical components of the attention mechanism. Each of these elements are represented
by matrices that are learned during training.
In the context of self-attention, the input data is transformed into three matrices:

• Queries (Q): Represent the elements for which the model seeks relevant information.
• Keys (K): Represent all possible elements that can be attended to (i.e., the information that
can be retrieved).
• Values (V): Represent the actual information that is retrieved.

The product of The self-attention mechanism can be mathematically described by the following
equation: ! "
QK T
Attention(Q, K, V ) = softmax → V
dk
where QK T is the dot product of the queries and keys, which measures the alignment or similarity
between these elements. This product captures the relationship between any two tokens fed to
the transformer. The result is then scaled by dk , the dimension of the keys and queries, to help
stabilize the gradients during training. The softmax function is applied to convert these scores
into probabilities that sum to one, representing the weights assigned to each value. Finally, these
probabilities which represent the relationship between elements are multiplied by the the values
vector V to return a final embedding. Each of the three key matrices in a transformer layer have a
corresponding learnable weight matrix (W Q , W K , W V ) where (W Q , W K ) have the shape (d, dk )
and W V has the shape (d, dv ). Here, d is the embedding dimension, dv is the value dimension,
and dk is the key and query dimension. The final product is an embedding of size (d, dv ), which, if
d ↑= dv is reshaped to the embedding dimension d.
The transformer utilized in this paper follows the generalized encoder/decoder structure which is
outlined in figure 1. In this transformer, time steps x(1,...,n),1 are fed through some sort of tokenizer
to create tokens t(1,...,n),d , where d is the model dimensionality. This tokenization strategy is crucial
to the performance of a transformer, since the tokens used are the base numerical representations
of the data, and information-rich tokens can subsequently be leverage by the transformer. In the
case of this paper, we prepend a special token, CLS1,d to t(1,...,n),d . After these tokens are created,
we then add a learnable positional encoding to them. Then, they are subsequently fed into a
basic transformer encoder. For the case of this study, we utilize a fixed transformer of depth
2 and number of heads 4. We then use the special token embedding, CLS 1,d as an input to a
decoder which outputs forecasted time steps x(n+1,...,n+T ),1 using a single fully connected linear
layer decoder. Finally, for normalization and re-normalization of data, we use a RevIN layer [23]

5
x1 x2 x3 x4 xn

Tokenization
stage

Transformer Encoder

Decoder

xn+1 xn+2 xn+T

Figure 1: Transformer architecture used in the following paper. First, we tokenize input time-series
steps into embeddings. Next, these embeddings are passed through a transformer encoder. The
transformer encoder then generates a data representation, which we then feed into a simple decoder
to generate forecasted time-series steps.

which uses learnable parameters in addition to traditional statistical values (mean, variance) to
minimize e”ects of distribution shift in normalization.
This study in particular aims to examine the e”ects of di”erent tokenization methods in down-
stream performance. To this measure, we keep all other parts of the model as unchanged as possible.

3.2 Data Preparation

For this study, we utilize the same eight datasets as previous works [8, 10, 24], ensuring compara-
bility and continuity. Table 1 outlines each dataset, their sizes, and their number of features. Each
dataset comprises multivariate time series, x(1,...,n),l , where l is the number of features in a given
step. We use the dataset splits provided by [8, 10], which are 70%, 10%, 20% train, validation, test
splits respectively. Data is accessible from the Autoformer github.1 Like [8], despite the multivariate
forecasting framework, we treat all features in a given time-step as independent. Thus, our method
is multi-variate with channel (feature) independence. These datasets, which are widely used as
benchmarks in other works, comprise various tasks, time intervals, and modalities. Illness contains
weekly data of influenza-like illnesses from the US CDC from 2002-2021. Electricity contains hourly
electricity consumption by 321 customers. Tra!c is hourly data which measures road occupancy
rates. Weather is data recorded every 10 minutes, measuring 21 di”erent weather indicators (tem-
perature, humidity, etc...). Finally, ETTm1 and ETTm2 collect data from electricity transformers
1 https://github.com/thuml/Autoformer/

6
on a 15 minute basis. ETTh1 and ETTh2 also collect data from electricity transformers on a 1 hour
basis.

Table 1: Datasets used, size of datasets, and number of features per dataset.

Split Illness Electricity Tra!c Weather ETTh1 ETTh2 ETTm1 ETTm2

Train 557 17981 11849 36456 11763 11763 48345 48345
Val 74 2537 1661 5175 1647 1647 6873 6873
Test 170 5165 3413 10444 3389 3389 13841 13841
Features 7 321 862 21 7 7 7 7

3.3 Time-Series Tokenization

We investigate multiple tokenization techniques to improve the transformer model’s understanding
of time-series data, and ability to forecast on the modality. The tokenization methods are as follows,
and are additionally outlined in Figure 2

• Direct Tokenization: Each time step is treated as a separate token. Thus, we train we use
a linear layer to transform time steps from shape x(1,...,n),d to t(1, ..., n), d.
• Patch-Based Embedding:Most of our tokenization strategies utilise some sort of patch-
based embedding technique like [8]. We list our various patch-based methods below.

– PatchTST. Given time-steps x(1,...,n),d , we split the sequence into a set of patches P , such
that P < n and ndivp = N. Then, we pass each patch p(1,...,patch len),1 through an linear
layer to generate a token t(1,d) . We are left with a final set of tokens T(N,d) .
– Max/Mean/Sum: we do the same as PatchTST, except instead of passing the whole patch
into the learnable linear layer, we just pass the max/mean/sum of the patch. Thus, for
each given patch p(1,...,patch len),1 , we use an aggregation to turn it into p(1,1) , before
passing it through an linear layer to generate a token t(1,d) . We are left with a final set
of tokens T(N,d) .
– CNN: For a given patch p(1,...,patch len),1 , we pass this patch through 2 1d CNNs to
generate p(patch len,cnn dim ), where patch len ↓ cnn dim = d, and d is the embedding
dimension. Then, for each patch we flatten the patch into a its final token, yielding us
with the set of tokens T(N,d) .
• Exponential Moving Average (EMA): We embed each token using the PatchTST em-
bedding strategy, generating a set of tokens T(N,d) . Then, for any given token tn , we add the
EMA of tokens t(1,...,n→1) to this token. We explore this tokenization strategy because steps
in a time-series sequence may not be independent. Thus, we are able to capture and condition
tokens based on this case.
• Residuals: Here, we embed each token using the PatchTST embedding strategy, generating a
set of tokens T(N,d) . Then, for any given token tn , we add a residual of its neighboring tokens
tn→1 , tn+1 . Similarly to EMA, in this instance we operate under the assumption that steps in
a time-series sequence may not be independent. We believe that looking at neighboring tokens
may help inform dependencies hidden in the data.

7
PatchTST PatchTST - CNN
xn

xn
xn xn

Tokenn

Tokenn
Flatten
… Linear … 1D 1D
x… x…
Layer CNN CNN
xn-8

xn-8
xn- xn-
8 8

No Patch PatchTST - EMA

[xj-8…xj]

Tokenj
Tokenj

Tokenj

+ EMA(Token1,...,Tokenj-1)
xj
xj

Transformer Encoder

Transformer Encoder
Token…
Token…

Token…
… Linear … Linear
x…
Layer Layer
[x1…x8]
Tokenj-8

Token1
Token1
xj-8

xj-8

Mean/Max/Sum PatchTST - Neighbor Residual

Tokeni + 0.1(Tokeni-1) + 0.1(Tokeni+1)

[xj-8…xj]

Tokeni
Tokeni

xi
xi

Token…
Token…

… … Linear
x… Max
Tokeni

Linear Layer
Mean xi
Layer
Sum
[x1…x8]

Token1
Token1
xi-8

xi-8

(a) (b)

Figure 2: Depiction of embedding strategies analysed. a) Represents PatchTST, NoPatch and Patch
Mean/Max/Sum strategies. PatchTST is the strategy conceived of in [8], No Patch treats every time-
step as a token, Mean/Max/Sum takes the aggregate value of a patch and only embeds this. b)
Represents more alternative tokenization strategies. PatchTST - CNN uses CNN layers to extract
features from any given patch. PatchTST - EMA adds the EMA of prior tokens to any given token.
PatchTST - Neighbor Residual adds a residual of neighboring tokens to each token.

8
3.4 Training
All models were trained with depth 2, 4 heads, hidden dimension of 256, learning rate of 1e ↔ 4,
and patch length of 8 for all models and datasets. Our batch size and number of epochs varied per
dataset. We train from between 5-100 epochs (for electricity and illness respectively), and our batch
size ranges from 4-128. We use MSE as our loss function, and we evaluate the test set on the top
performing checkpoint on our validation set. Furthermore, for all of the datasets but Illness, our
model takes in a sequence of 336 time steps (equivalent to 42 patches) and forecasts the subsequent 96
patches. For illness dataset, since it is much smaller, our model takes in 96 time steps, and forecasts
the next 24 time steps. Finally, we use Phil Wang’s x-transformers transformer implementation 2
and all of our code is available here: https://github.com/gaasher/embedding_thesis. We use
pytorch-lightning for our training setup, and weights and biases for logging and checkpointing.
For our evaluation metrics, similarly to [8], we use mean-squared error (MSE) and mean-absolute
error (MAE) as our reported metrics.

4 Results

Table 2: Results of models with di”erent embedding strategies on datasets (MAE and MSE values)

Model Metric Illness Electricity Tra!c Weather ETTh1 ETTh2 ETTm1 ETTm2
Base PatchTST MAE 0.8604 0.2274 0.2747 0.2049 0.4602 0.3039 0.3813 0.2279
MSE 1.869 0.1324 0.3902 0.1529 0.438 0.1934 0.3318 0.115
No Patch MAE 0.7995 0.2355 0.2903 0.205 0.4594 0.2961 0.3993 0.2317
MSE 1.703 0.1372 0.4037 0.1556 0.4384 0.1832 0.356 0.1214
Patch Max MAE 0.9247 0.3765 0.3616 0.2053 0.5421 0.3 0.3998 0.2359
MSE 1.959 0.2816 0.5097 0.1571 0.6022 0.1883 0.3508 0.1232
Patch Mean MAE 0.8647 0.3001 0.3395 0.2022 0.4968 0.3065 0.3838 0.2359
MSE 1.905 0.1978 0.497 0.1542 0.4916 0.1892 0.333 0.1243
Patch Sum MAE 0.9052 0.3067 0.3509 0.206 0.5121 0.3175 0.4005 0.2414
MSE 1.913 0.2039 0.5205 0.1559 0.5113 0.2067 0.357 0.1273
PatchTST EMA MAE 0.825 0.2274 0.2747 0.2046 0.4605 0.3009 0.3815 0.2297
MSE 1.869 0.1324 0.3902 0.1529 0.4354 0.1867 0.3292 0.1178
PatchTST - Neighbor Residual MAE 0.8624 0.2274 0.2745 0.203 0.4591 0.2977 0.3832 0.2295
MSE 1.874 0.1327 0.391 0.1528 0.4336 0.1842 0.3312 0.1177
PatchTST - CNN MAE 0.8364 0.2279 0.2738 0.2035 0.4512 0.2807 0.3779 0.2248
MSE 1.789 0.132 0.3848 0.1544 0.4177 0.1707 0.3254 0.1127

Tables 2 reports MAE and MSE results for each respective embedding strategy and dataset.
Bold results represent the top performing model for any given metric and dataset. PatchTST - CNN
outperforms all other methods on most datasets. For all ETT datasets, Patch-CNN outperforms
alternative methods for both MAE and MSE. This is improvement is especially pronounced on
ETTh1 and ETTh2, where PatchTST - CNN outperforms PatchTST in MSE by 0.203 and .227
respectively. After PatchTST - CNN, performance on these datasets is comparable across the board,
with PatchTST - Neighbor Residual, Max, No Patch, and PatchTST - EMA all outperforming Base
PatchTST on certain ETT datasets. Performance varies more on Illness, Electricity, Tra!c, and
2 https://github.com/lucidrains/x-transformers

9
Weather. However, even in these datasets, PatchTST - CNN outperforms Base PatchTST in all
but MSE for Weather and MAE for Electricity. PatchTST - CNN has especially large performance
improvements for the illness dataset. This may be because the Illness dataset has fewer tokens
than the others, and PatchTST - CNN may capture more information across any given patch.
Additionally, No Patch significantly outperforms other methods on the Illness dataset. This is
likely because the transformer model benefits from processing each time step individually, capturing
temporal dependencies more e”ectively without the need for aggregation within patches. This
result suggests that, for certain datasets, simplifying the tokenization process by avoiding patching
altogether can lead to better performance. Finally, certain datasets seem to be harder to predict
than others. Notably, the Illness dataset has higher MAE and MSE values on aggregate than any
of the other datasets. This may occur because the data collection frequency is weekly, and thus the
model has limited signal to go o” of.

5 Discussion
5.1 PatchTST - Mean performance
This exploration of embedding strategies on time-series-data reveals a few key insights on the work-
ings and e”ectiveness of PatchTST and it’s methods. Primarily, we found that although PatchTST
does outperform simpler methods such as methods which embed the mean, max, or sum of a patch,
besides the Electricity and Tra!c datasets, the di”erences in performance are not as large as one
might think. In particular, we found that using the mean of a patch, and embedding this value
alone, performs similarly well on most time-series tasks evaluated, and even outperforms the base
PatchTST on MAE for the large weather dataset. This has a few potentially insightful implications.
Firstly, it could indicate that transformer architectures just have di!culties with modelling time-
series data. Chen et Al. [24] show that a MLP-only based approach achieves similar performance to
PatchTST and also that simple linear modelling outperforms many other transformer-based time-
series forecasting models. Thus, the results which we present add to the existing literature which
suggests challenges with time-series forecasting using transformers.
Secondly, this under-performance of PatchTST suggests that it does not embed information
e!ciently and e”ectively. Given our patch size of 8, and the fact that PatchTST feeds all 8 of
these values into a linear-layer to generate an embedding, one would expect the performance to be
significantly improved over only using one of these values or some aggregate of them. Thus, this
study has exhibited a limited e!ciency with the current PatchTST method, and has highlighted
that future works which should center information-rich tokens are needed.
Thirdly, the under-performance of PatchTST, and the near equivalent performance of embedding
the mean alone may be quite literally a regression to the mean. This may suggest two things. Firstly,
given the localized nature of our patching strategy, taking the localized mean of time-step windows
may be an e”ective strategy of capturing macro-trends. Secondly and alternatively, it may suggest
that the inherent noise of time-series patches may mean that primary captured signal in any given
patch is the mean. Thus, PatchTST may be approximating a simple patch mean-embedding strategy,
not vice-versa.

5.2 PatchTST - CNN performance

Another interesting insight gathered from this embedding strategy exploration was the success of our
Patch CNN strategy. This strategy involves first patching up our time-step sequence, then passing a
CNN over these patches to achieve the desired embedding dimension for a transformer. Interestingly,

10
this approach outperformed the baseline PatchTST on most datasets evaluated with regards to both
MAE and MSE. This superior performance could be attributed to a couple factors. Firstly, the Patch
CNN strategy might be better at capturing local dependencies and intricate patterns within the data
due to its convolutional nature, which is ideal for spatial and temporal feature extraction. CNNs are
particularly e”ective at capturing local dependencies because the CNN kernel processes a limited
number of inputs at once, allowing for a focused and detailed analysis of localized patterns.Secondly,
the combination of CNNs for feature extraction followed by transformers for sequence modeling
might o”er a more robust and adaptable architecture. Particularly, we leverage the strengths of
both strategies: CNNs for feature extraction and Transformers for long-range dependencies.

5.3 Limitations
Despite these insights, there are a few experimental limitations which may negatively e”ect results
in this paper. The first such limitation is the lack of statistical significance testing. All experi-
ments were conducted in google colab with the following public repository https://github.com/
gaasher/embedding_thesis, and training curves and additional information are stored in the fol-
lowing weights and biases project: https://wandb.ai/gaasher/timeseries_embeds. However,
due to the nature of training in colab, model checkpoints were not saved for each implementation.
This makes statistical testing not possible, unless all models were re-trained, which would cost too
much time and money. A similar limitation to this study is a lack of ablations. For all datasets
aside from Illness, we exclusively trained and ran experiments on number of seen time-steps n = 336
and forecasted steps T = 96. Furthermore, we don’t change the depth and number of heads in our
transformer model. We also don’t conduct any sort of sweep or grid search for hyper parameters
such as model dimension, learning rate, and EMANeighbor-Residual related parameters. We be-
lieve that with proper ablations, significant improvements in model performance can be achieved.
Finally, we only use a select few datasets which are not necessarily representative of the vast array
of long-sequence time-series data domains. Thus, our findings may not generalize to new time-series
domains.

6 Conclusion
This paper reveals several insights and potential improvements to token embedding strategies for
time-series transformers. As far as improvements go, we show that a CNN based patch embeddings
outperforms PatchTST’s linear layer strategy. Furthermore, we show that other alternate methods
such as applying an EMA to tokens and adding neighbor residuals to tokens could also yield marginal
improvements over base PatchTST. We believe that PatchTST - CNN may yield superior results
due to the better local feature extraction abilities of CNNs, which especially improve performance
when combined with transformers’ strengths with long-range dependencies. This paper additionally
builds on work which shows the di!culties of transformers in time-series forecasting. We show that
for several datasets, using patch aggregate measures, especially the mean, performs similarly or even
outperforms PatchTST, which has access to all time-step values. We argue that this shows an inher-
ent limitation to PatchTST’s ability to aggregate information. Finally, this paper builds on general
understanding and insights with regards to transformer embedding strategies. Through experiment-
ing on several novel strategies with otherwise fixed hyperparameters, we build interpretability and
context to future works in the space.

11
6.1 Future works
We believe that there are several paths for future works based on this paper. Firstly, we believe
that there is still significant work that can be done in bridging long-range and local dependencies
at the patch level. Given the often dependent status of time-series data, and the arbitrary nature
of patch cuto”s, we believe that there is significant room to explore strategies with incorporate
long-range context into patch-level embeddings. We believe that one potential and under-explored
solution is to exploit methods similar to [21], where transformers use hierarchical feature maps.
These feature maps merge patches at di”erent levels of the transformer encoder, which allows the
model to incorporated relationships between tokens at di”erent scales. Another potential area of
work is in discretization of patches and time-steps. As [25] explore in their paper, binning time-series
intensities, and discretizing as vocabulary in a GPT style decoder-only approach shows promising
results. Similarly, [26] uses a decoder-only approach, where the model is trained autoregressively,
learning on a next-patch prediction task. These decoder-only approaches could be particularly
useful for time-series. Currently, we forecast T time-step predictions in one forward pass. However,
in an autoregressive paradigm, since the model would produce one output at a time, it would
be able to condition outputs on all previous values, even predictions that it has already made.
Finally, we believe that given the performance improvements shown in [25, 26], pre-training is still
an under-studied subject in time-series transformers. However, further improvement in domain-
generalizability of embedding strategies will be needed to fully crack the code on this task.

7 Acknowledgements
I’d like to acknowledge Professor Sarah Preum, my thesis advisor, and Parker Seegmiller for their
support throughout the research process for this work.

References
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
L
# ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information
processing systems, 30, 2017.
[2] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim-
othée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open
and e!cient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[3] Colin Ra”el, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified
text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.

[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari-
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models
are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[5] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly,
et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv
preprint arXiv:2010.11929, 2020.

12
[6] Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. Fedformer:
Frequency enhanced decomposed transformer for long-term series forecasting. In International
conference on machine learning, pages 27268–27286. PMLR, 2022.
[7] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai
Zhang. Informer: Beyond e!cient transformer for long sequence time-series forecasting. In
Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115,
2021.
[8] Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is
worth 64 words: Long-term forecasting with transformers. arXiv preprint arXiv:2211.14730,
2022.
[9] Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers e”ective for time series
forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages
11121–11128, 2023.
[10] Minghao Chen, Houwen Peng, Jianlong Fu, and Haibin Ling. Autoformer: Searching trans-
formers for visual recognition. In Proceedings of the IEEE/CVF international conference on
computer vision, pages 12270–12280, 2021.
[11] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series.
The handbook of brain theory and neural networks, 3361(10):1995, 1995.
[12] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,
9(8):1735–1780, 1997.
[13] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional
and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
[14] Pradeep Hewage, Ardhendu Behera, Marcello Trovati, Ella Pereira, Morteza Ghahremani,
Francesco Palmieri, and Yonghuai Liu. Temporal convolutional neural (tcn) network for an
e”ective weather forecasting using time-series data from the local weather station. Soft Com-
puting, 24:16453–16482, 2020.
[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
2018.
[16] Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and
Han Hu. Simmim: A simple framework for masked image modeling. In Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition, pages 9653–9663, 2022.
[17] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image trans-
formers. arXiv preprint arXiv:2106.08254, 2021.
[18] Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung
Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv
preprint arXiv:2203.03605, 2022.
[19] Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael
Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-
embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 15619–15629, 2023.

13
[20] Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido
Assran, and Nicolas Ballas. V-jepa: Latent video prediction for visual representation learning.
2023.
[21] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining
Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings
of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
[22] Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng
Yan. Enhancing the locality and breaking the memory bottleneck of transformer on time series
forecasting. Advances in neural information processing systems, 32, 2019.

[23] Taesung Kim, Jinhee Kim, Yunwon Tae, Cheonbok Park, Jang-Ho Choi, and Jaegul Choo.
Reversible instance normalization for accurate time-series forecasting against distribution shift.
In International Conference on Learning Representations, 2021.
[24] Si-An Chen, Chun-Liang Li, Nate Yoder, Sercan O Arik, and Tomas Pfister. Tsmixer: An
all-mlp architecture for time series forecasting. arXiv preprint arXiv:2303.06053, 2023.

[25] Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin
Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham
Kapoor, et al. Chronos: Learning the language of time series. arXiv preprint arXiv:2403.07815,
2024.

[26] Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model
for time-series forecasting. arXiv preprint arXiv:2310.10688, 2023.

The Greatest Miracle in The World
50% (4)
The Greatest Miracle in The World
2 pages
9
No ratings yet
9
28 pages
XLSTMTime Long-term Time Series Forecasting With XLSTM
No ratings yet
XLSTMTime Long-term Time Series Forecasting With XLSTM
13 pages
1
No ratings yet
1
19 pages
3
No ratings yet
3
19 pages
Mizoram Synod Handbooks
No ratings yet
Mizoram Synod Handbooks
16 pages
STUDENT PORTFOLIO Template DSA-2
No ratings yet
STUDENT PORTFOLIO Template DSA-2
3 pages
SAV2025_26_Class6th_List_of_Incomplete_Deleted_Unpaid_Candidates
No ratings yet
SAV2025_26_Class6th_List_of_Incomplete_Deleted_Unpaid_Candidates
29 pages
f2400-portable-power-station-manual
No ratings yet
f2400-portable-power-station-manual
14 pages
Transformers in Time Series- A Survey
No ratings yet
Transformers in Time Series- A Survey
8 pages
A_Survey_on_Time-Series_Pre-Trained_Models
No ratings yet
A_Survey_on_Time-Series_Pre-Trained_Models
20 pages
OM016-10 Amplivox Audibase 5 Operating Manual
No ratings yet
OM016-10 Amplivox Audibase 5 Operating Manual
75 pages
2024-ML-(R) Probabilistic sunspot predictions with a gated recurrent units-based combined model guided by pinball loss
No ratings yet
2024-ML-(R) Probabilistic sunspot predictions with a gated recurrent units-based combined model guided by pinball loss
16 pages
Patch Mixer
No ratings yet
Patch Mixer
7 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
22 pages
S2ProX An Enhanced Intraday Portfolio Adjustment Strategy 202408262059041
No ratings yet
S2ProX An Enhanced Intraday Portfolio Adjustment Strategy 202408262059041
22 pages
intriguingpropertiesofpositionalencodingintimeseriesforecasting
No ratings yet
intriguingpropertiesofpositionalencodingintimeseriesforecasting
19 pages
CH 10
No ratings yet
CH 10
41 pages
FilterNet Harnessing Frequency Filters for Time Series Forecasting
No ratings yet
FilterNet Harnessing Frequency Filters for Time Series Forecasting
20 pages
Reversible Dementia: Sudhir Sharma
No ratings yet
Reversible Dementia: Sudhir Sharma
35 pages
2305.12095
No ratings yet
2305.12095
39 pages
Knuth-Morris-Pratt Algorithm Explained: Definitive Reference for Developers and Engineers
From Everand
Knuth-Morris-Pratt Algorithm Explained: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
NeurIPS-2021-redesigning-the-transformer-architecture-with-insights-from-multi-particle-dynamical-systems-Paper
No ratings yet
NeurIPS-2021-redesigning-the-transformer-architecture-with-insights-from-multi-particle-dynamical-systems-Paper
14 pages
632_iTransformer_Inverted_Tran
No ratings yet
632_iTransformer_Inverted_Tran
25 pages
0759
No ratings yet
0759
9 pages
Storm Systems for Real-Time Data Processing: Definitive Reference for Developers and Engineers
From Everand
Storm Systems for Real-Time Data Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
BCI1023 PROGRAMMING TECHNIQUES S2 270624
No ratings yet
BCI1023 PROGRAMMING TECHNIQUES S2 270624
6 pages
F String Format
No ratings yet
F String Format
7 pages
NLP TimeSeries
No ratings yet
NLP TimeSeries
32 pages
A Decoder-Only Foundation Model for Time-series Forecasting
No ratings yet
A Decoder-Only Foundation Model for Time-series Forecasting
11 pages
2205.01138v2
No ratings yet
2205.01138v2
29 pages
T: I T A E T S F: I Ransformer Nverted Ransformers RE Ffective For IME Eries Orecasting
No ratings yet
T: I T A E T S F: I Ransformer Nverted Ransformers RE Ffective For IME Eries Orecasting
25 pages
Time Machine
No ratings yet
Time Machine
10 pages
GCformer - An Efficient Framework For Accurate and Scalable Long-Term Multivariate Time Series Forecasting
No ratings yet
GCformer - An Efficient Framework For Accurate and Scalable Long-Term Multivariate Time Series Forecasting
10 pages
Cube Explorer 5.01: Please Note: Cube Explorer Is Not Derived From, Is Not Associated With and Is Not Endorsed or
No ratings yet
Cube Explorer 5.01: Please Note: Cube Explorer Is Not Derived From, Is Not Associated With and Is Not Endorsed or
64 pages
Token Turing Machines
No ratings yet
Token Turing Machines
12 pages
A Time Series Is Worth 64 Words - Long-Term Forecasting With Transformers
No ratings yet
A Time Series Is Worth 64 Words - Long-Term Forecasting With Transformers
24 pages
Mcafee Change Control 6.2.0 - Linux Installation Guide
No ratings yet
Mcafee Change Control 6.2.0 - Linux Installation Guide
19 pages
Kusti Proceedure Basic Padyav Kusti Zn
No ratings yet
Kusti Proceedure Basic Padyav Kusti Zn
8 pages
EG Series Uptime Warranty
No ratings yet
EG Series Uptime Warranty
2 pages
Steyr 6270 6300 Terrus Cvt Stage IV Tractor Service Manual 48193184
No ratings yet
Steyr 6270 6300 Terrus Cvt Stage IV Tractor Service Manual 48193184
22 pages
047c328e828857d0e77472023f95ce2a
No ratings yet
047c328e828857d0e77472023f95ce2a
34 pages
Timemachine: A Time Series Is Worth 4 Mambas For Long-Term Forecasting
No ratings yet
Timemachine: A Time Series Is Worth 4 Mambas For Long-Term Forecasting
10 pages
A Transformer-based Framework for Multivariate Time Series Representation Learning
No ratings yet
A Transformer-based Framework for Multivariate Time Series Representation Learning
20 pages
Multivariate Time Series Forecasting Final 3rd Sem
No ratings yet
Multivariate Time Series Forecasting Final 3rd Sem
22 pages
Fed Former
No ratings yet
Fed Former
19 pages
2103 TiT
No ratings yet
2103 TiT
10 pages
Are Transformers Effective For Time Series Forecasting?
No ratings yet
Are Transformers Effective For Time Series Forecasting?
15 pages
Art Appreciation - Lecture 1 & 2 With Visual Arts
No ratings yet
Art Appreciation - Lecture 1 & 2 With Visual Arts
7 pages
ENGINEERING
No ratings yet
ENGINEERING
7 pages
A Joint Time-Frequency Domain Transformer For Multivariate Time Series Forecasting
No ratings yet
A Joint Time-Frequency Domain Transformer For Multivariate Time Series Forecasting
33 pages
2102.11972v2
No ratings yet
2102.11972v2
16 pages
Data Augmentation Techniques in Time Series Domain: A Survey and Taxonomy
No ratings yet
Data Augmentation Techniques in Time Series Domain: A Survey and Taxonomy
25 pages
Are Transformers Effective for Time Series Forecasting?
No ratings yet
Are Transformers Effective for Time Series Forecasting?
8 pages
Transformers in Time Series A Survey 2202.07125
No ratings yet
Transformers in Time Series A Survey 2202.07125
8 pages
T: I T A E T S F: I Ransformer Nverted Ransformers RE Ffective For IME Eries Orecasting
No ratings yet
T: I T A E T S F: I Ransformer Nverted Ransformers RE Ffective For IME Eries Orecasting
19 pages
Transformers in Time Series - A Survey
No ratings yet
Transformers in Time Series - A Survey
9 pages
A Transformer That Tends To Mine Metaphorical-Level Information
No ratings yet
A Transformer That Tends To Mine Metaphorical-Level Information
16 pages
FEDformer - Frequency Enhanced Decomposed Transformer For Long-Term Series Forecasting
No ratings yet
FEDformer - Frequency Enhanced Decomposed Transformer For Long-Term Series Forecasting
19 pages
The Last Priest of Sebek
100% (1)
The Last Priest of Sebek
19 pages
Are Transformers Effective For Time Series Forecasting?
No ratings yet
Are Transformers Effective For Time Series Forecasting?
8 pages
Transfer Learning With Time Series Data A Systematic Mapping Study
No ratings yet
Transfer Learning With Time Series Data A Systematic Mapping Study
24 pages
Science and Technology Journals
No ratings yet
Science and Technology Journals
8 pages
Java Module 2
No ratings yet
Java Module 2
23 pages
OpenTelemetry in Practice: Definitive Reference for Developers and Engineers
From Everand
OpenTelemetry in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Tensor Structures and Applications: Definitive Reference for Developers and Engineers
From Everand
Tensor Structures and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Enhancing The Locality and Breaking The Memory Bottleneck of Transformer On Time Series Forecasting Paper
No ratings yet
Enhancing The Locality and Breaking The Memory Bottleneck of Transformer On Time Series Forecasting Paper
11 pages
XLSTMTime - Long-term Time Series Forecasting With XLSTM
No ratings yet
XLSTMTime - Long-term Time Series Forecasting With XLSTM
13 pages
1.shiyang Li - Enhance Locality and Break The Memory Bottleneck
No ratings yet
1.shiyang Li - Enhance Locality and Break The Memory Bottleneck
14 pages
Harry Potter Trading Card Game RULES
100% (2)
Harry Potter Trading Card Game RULES
32 pages
Veritas - Testking.vcs 278.v2019!11!24.by .Philip.75q
No ratings yet
Veritas - Testking.vcs 278.v2019!11!24.by .Philip.75q
38 pages
2233 A Transformer Based Framework
No ratings yet
2233 A Transformer Based Framework
19 pages
HCA Course 1 HCA Course 1 With Verified Correct Answers - Complete Solution 2024
No ratings yet
HCA Course 1 HCA Course 1 With Verified Correct Answers - Complete Solution 2024
20 pages
Taycan - Brochure PDF
No ratings yet
Taycan - Brochure PDF
44 pages
Principles of Multiple Spanning Tree Protocol: Definitive Reference for Developers and Engineers
From Everand
Principles of Multiple Spanning Tree Protocol: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Pplied Time Series Ransfer Learning
No ratings yet
Pplied Time Series Ransfer Learning
4 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
2324 7821 Ma Fashion Management
No ratings yet
2324 7821 Ma Fashion Management
6 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Internet of Things Lab (IOT) : Laboratory Practical File
100% (1)
Internet of Things Lab (IOT) : Laboratory Practical File
36 pages
LLM_SOAR
No ratings yet
LLM_SOAR
27 pages
PDF
No ratings yet
PDF
316 pages
A Systematic Review For Transformer-Based Long-Term Series Forecasting
No ratings yet
A Systematic Review For Transformer-Based Long-Term Series Forecasting
30 pages
DUNSTAN 4527 - Hydraulic Hose Crimping Machines - NEW FORMAT
No ratings yet
DUNSTAN 4527 - Hydraulic Hose Crimping Machines - NEW FORMAT
3 pages
Audit Practice Manual
100% (2)
Audit Practice Manual
223 pages
Design Seminar Report Urban
No ratings yet
Design Seminar Report Urban
24 pages
5 Steps For Effective Supply Chain Management
100% (1)
5 Steps For Effective Supply Chain Management
8 pages
Contemporary Dramatic
No ratings yet
Contemporary Dramatic
9 pages
Time Series
No ratings yet
Time Series
29 pages
Support Vector Machine: Fundamentals and Applications
From Everand
Support Vector Machine: Fundamentals and Applications
Fouad Sabry
No ratings yet
Transformers Architectures For Time Series Forecasting
No ratings yet
Transformers Architectures For Time Series Forecasting
109 pages
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
From Everand
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
Fouad Sabry
No ratings yet
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet
Digital Modulations using Matlab
From Everand
Digital Modulations using Matlab
Mathuranathan Viswanathan
4/5 (6)
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
From Everand
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
Sebastian Thelen
5/5 (1)
Simulation of Digital Communication Systems Using Matlab
From Everand
Simulation of Digital Communication Systems Using Matlab
Mathuranathan Viswanathan
3.5/5 (22)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Multiple Patch-based Tokenizations

Uploaded by

Multiple Patch-based Tokenizations

Uploaded by

Dartmouth College

Dartmouth Digital Commons

Computer Science Senior Theses Computer Science

Exploring Tokenization Techniques to Optimize Patch-Based

Follow this and additional works at: https://digitalcommons.dartmouth.edu/cs_senior_theses

Part of the Computer Sciences Commons

Advised by Professor Sarah Preum

2.0.2 ViTs and Patch-Based Transformers

2.0.3 Transformer-based Time-Series

xn+1 xn+2 xn+T

3.2 Data Preparation

Split Illness Electricity Tra!c Weather ETTh1 ETTh2 ETTm1 ETTm2

3.3 Time-Series Tokenization

No Patch PatchTST - EMA

Mean/Max/Sum PatchTST - Neighbor Residual

Tokeni + 0.1(Tokeni-1) + 0.1(Tokeni+1)

5.2 PatchTST - CNN performance

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.