Applsci 14 04316
Applsci 14 04316
sciences
Review
A Historical Survey of Advances in Transformer Architectures
Ali Reza Sajun * , Imran Zualkernan and Donthi Sankalpa
Computer Science and Engineering Department, American University of Sharjah, Sharjah P.O. Box 26666,
United Arab Emirates; izualkernan@aus.edu (I.Z.); dsankalpa@aus.edu (D.S.)
* Correspondence: b00068908@aus.edu
Abstract: In recent times, transformer-based deep learning models have risen in prominence in the
field of machine learning for a variety of tasks such as computer vision and text generation. Given this
increased interest, a historical outlook at the development and rapid progression of transformer-based
models becomes imperative in order to gain an understanding of the rise of this key architecture.
This paper presents a survey of key works related to the early development and implementation
of transformer models in various domains such as generative deep learning and as backbones of
large language models. Previous works are classified based on their historical approaches, followed
by key works in the domain of text-based applications, image-based applications, and miscella-
neous applications. A quantitative and qualitative analysis of the various approaches is presented.
Additionally, recent directions of transformer-related research such as those in the biomedical and
timeseries domains are discussed. Finally, future research opportunities, especially regarding the
multi-modality and optimization of the transformer training process, are identified.
Keywords: transformers; deep learning; generative deep learning; large language models; GPT;
computer vision
1. Introduction
Ever since the introduction of the transformer model in June 2017 by Vaswani et al. [1],
Citation: Sajun, A.R.; Zualkernan, I.; the world of deep learning has seen a rapid adaptation of the model in pushing the state of
Sankalpa, D. A Historical Survey of the art in a number of previously challenging tasks. Due to its prowess in sequence model-
Advances in Transformer ing and machine translation, the transformer architecture was initially widely implemented
Architectures. Appl. Sci. 2024, 14, 4316. and indeed emerged as the predominant deep learning model for natural language process-
https://doi.org/10.3390/app14104316 ing (NLP) and generative deep-learning tasks [2]. Indeed, the introduction of transformers
Academic Editors: Andrea Prati,
has been a key factor in the development of large language models such as GPT3 and GPT4,
Dongpo Xu, Huisheng Zhang and which are the basis of culturally significant tools such as ChatGPT [3]. However, inspired
Jie Yang by the revolutionary self-attention mechanism in transformers, the architecture has since
been implemented in various application domains such as that of images, audio, and time
Received: 19 March 2024 series data [4]. Indeed, in recent times, transformers have been touted as being a potential
Revised: 21 April 2024
replacement for Convolutional Neural Networks (CNNs) for vision applications [5], with
Accepted: 15 May 2024
the introduction of the Vision Transformer (ViT) opening a new realm of architectures
Published: 20 May 2024
which build upon it. Considering the rapid increase in interest in transformer architecture,
it becomes pertinent to examine in detail the architecture of the transformer as well as its
historical progression from being introduced as an alternative to RNN-like architectures
Copyright: © 2024 by the authors.
for sequence-to-sequence mapping to being one of the most impactful architectures in
Licensee MDPI, Basel, Switzerland. the current realm of deep learning. Finally, it may be beneficial to examine the various
This article is an open access article prevalent transformer architectures applicable to the different data domains.
distributed under the terms and Prior to the introduction of transformers to the deep learning space, the established
conditions of the Creative Commons state of the art in sequence modeling had long been Long Short-term Memory (LSTMs) [6]
Attribution (CC BY) license (https:// and other forms of Recurrent Neural Networks (RNNs) [7]. These were especially prevalent
creativecommons.org/licenses/by/ for transduction problems such as language modeling and machine translation due to their
4.0/). recurrence which allows for recent information to be accounted for in order to maintain
1.1.
1.1. Transformers
Transformers
In
In order
order to to take
takeaadeeper
deeperlooklookand
andinvestigate
investigate thethe success
success seenseen by transformer
by the the transformer
model, it is imperative to examine, in detail, the architecture and
model, it is imperative to examine, in detail, the architecture and workings of the workings ofsolution
the solution
proposed by Vaswani et al. [1]. Unlike previously proposed sequence
proposed by Vaswani et al. [1]. Unlike previously proposed sequence transduction models transduction models
like [11] and [12], transformers maintain the encoder–decoder structure,
like [11] and [12], transformers maintain the encoder–decoder structure, as seen in Figure as seen in Figure 1,
but discard the recurrence and convolution aspects. This is made possible
1, but discard the recurrence and convolution aspects. This is made possible thanks to the thanks to the
novel
novel multi-head
multi-head attention
attention mechanism proposed in
mechanism proposed in addition
addition to
to the
the point-wise
point-wise feedfor-
feedforward
networks
ward networksingrained in theintransformer
ingrained the transformermodel.
model.Figure 1 shows
Figure 1 shows thethe
overall
overalltransformer
trans-
architecture as proposed
former architecture by Vaswani
as proposed et al. [1].
by Vaswani et al.The
[1].following sections
The following describe
sections the the
describe various
variouscontributing
blocks blocks contributing to this architecture
to this architecture in further
in further detail.detail.
Figure 1. A
Figure A depiction
depictionofoftransformer
transformerarchitecture.
architecture.
Figure 2. The
Figure 2. structureof
The structure ofthe
theattention
attentionlayer.
layer. Left:
Left: Scaled
Scaled Dot-Product
Dot-Product Attention.
Attention. Right:
Right: a multi-head
a multi-head
attention
attention mechanism.
mechanism.
In the encoder
In the encoderversion
version of this
of this layer,
layer, the inputs
the inputs consistconsist of queries
of queries and keys.and
Thekeys.
atten- The
attention function
tion function is applied
is then then applied
to thesetovectors
these vectors
as seen as
in seen
(1). in (1).
𝑄⋅𝐾 ⊤!
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 Q·K ⋅𝑉 (1)
Attention ( Q, K, V ) = √𝑑 ·V (1)
dk
where
where
- Q is the matrix of the queries;
- K is the matrix of the keys;
- Q is the matrix of the queries;
- V is the matrix of the values.
- K is the matrix of the keys;
The equation is applied in a way that the dot product between the query and the key
- V is the matrix of the values.
is first computed to form the score 𝑆 𝑄 ⋅ 𝐾 These scores are important as they deter-
mineThehow equation is applied
much attention is in a way
given that the
to other dot product
words betweenwords
when encoding the query and
at the the key is
current
first computed
position. These to formare
scores thethen
score S = Q · Kin
normalized ⊤ order
Thesetoscores are
ensure important
the asthe
stability of they determine
gradient
how much attention
to enhance is givengiving
training, thereby to other
thewords whenscore
normalized 𝑆
encoding words
. Theatsoftmax
the current position.
function
These
is thenscores aretothen
applied normalized scores
the normalized in order to ensure
in order the stability
to translate them of theprobabilities
into gradient to 𝑃enhance
√S
training, thereby giving the normalized score S = . The softmax function
𝑆 . These probabilities can then be applied to thenvalue matrix to obtain 𝑍 𝑉 ⋅ 𝑃. This is then
dk
would mean
applied to thethat vectors with
normalized largerinprobabilities
scores would receive
order to translate a greater
them into focus from
probabilities P= the(Sn ) .
consequent layers [5]. In transformers, a multi-head attention system is
These probabilities can then be applied to the value matrix to obtain Z = V · P. Thisused wherein the
originalmean
would queries,
thatkeys, and with
vectors valueslarger
are projected into Hwould
probabilities different sets ofalearned
receive greaterprojections.
focus from the
consequent layers [5]. In transformers, a multi-head attention system isthe
For each projection, the attention equation from (1) is applied to formulate output.
used The the
wherein
output across the H projections is then concatenated to form the multi-head output.
original queries, keys, and values are projected into H different sets of learned projections. The
formulation for this process can be found in (2).
For each projection, the attention equation from (1) is applied to formulate the output. The
output across the H projections is then concatenated to form the multi-head output. The
formulation for this process can be found in (2).
This process improves upon the performance seen by a single attention layer as
it allows the model to focus on multiple equally important words based on different
criterion instead of simply attributing a single word per input. This allows for multiple
Appl. Sci. 2024, 14, 4316 4 of 27
This residual connection boosts the flow of data by relaying the information forward
and therefore serves to enhance the model’s performance. The ‘+’ operator in this equation
refers to element-wise addition which helps combat the vanishing gradient problem. In the
context of the example discussed, these residual connections would make sure essential
characteristics of the word “bank” are not lost in the depth of the model’s layers.
Wherein pos is the position of a word within a sentence, dmodel is the dimension and i is
the current dimension of the position encoding. Using this, each element of the positional
encoding corresponds to a sinusoid, thereby allowing the transformer model to learn to
pay attention based on relative positions as well, consequently allowing it to extrapolate
to longer sequences. These encodings have indeed been a focal point of the consequent
research aiming to optimize the learning process. Indeed numerous works have proposed
modifications such as a learning process for the encodings [17,18] or a relative form of
position encoding [19].
Having discussed the importance and the working of transformer architecture, and
given the rapid advances in the field of deep learning brought forth due to this model, it
might be noteworthy to examine the historical progression since its introduction in 2017
leading up to transformers taking over many of the state-of-the-art techniques. While there
exist surveys on the various types of transformer architectures that have been proposed,
there seems to be a gap in the analysis from a historical viewpoint. Therefore, the rest
of the paper examines a historical perspective on the progression of notable transformer
architectures in addition to discussing the state-of-the-art techniques and architectures for
data of different types.
2. Survey Methodology
The search for sources for this work was done following the PRISMA checklist [20].
The following subsections illustrate the points focused on for the survey’s methodology.
2.2. Search
The following search terms were used through all the above-mentioned databases:
Transformers, State-of-The-Art Transformers, Key Transformer Architectures, Transformer
Deep Learning, Transformer Vision, Transformer NLP, BERT.
3. Survey Results
3.1. Early Transformer Implementations
3.1.1. Introductory Works
Since the introduction of the aforementioned transformer model in 2017, a vast array of
works have aimed to build upon its novel architecture in order to optimize its performance
for a variety of domains. Indeed, the work proposing the transformer model has been cited
more than 90,500 times as of 29 September 2023, according to Google Scholar [21]. Among
3.1. Early Transformer Implementations
3.1.1. Introductory Works
Since the introduction of the aforementioned transformer model in 2017, a vast array
of works have aimed to build upon its novel architecture in order to optimize its perfor-
Appl. Sci. 2024, 14, 4316
mance for a variety of domains. Indeed, the work proposing the transformer model has 6 of 27
been cited more than 90,500 times as of 29 September 2023, according to Google Scholar
[21]. Among the thousands of consequential works, a few emerge as notable models which
the thousands
have consequently of contributed
consequential works, the
to pushing a few emerge
overall as notable models
state-of-the-art which
techniques andhave
have conse-
quently contributed
established themselvestoaspushing
standardstheinoverall state-of-the-art
their fields. techniques
Figure 3 displays and have
a timeline established
of these
themselves
notable worksasarranged
standards in their fields.
chronologically Figure
and coded3according
displays atotimeline of these
the domain notable works
of implemen-
arranged chronologically and coded according to the domain of implementation.
tation.
Figure3.3.
Figure TheThe timeline
timeline of the
of the state-of-the-art
state-of-the-art transformer
transformer models
models [1,17,19,22–47].
[1,17,19,22–47].
InInorder
ordertoto benchmark
benchmark these
these works,
works, a number
a number of datasets
of datasets have have been utilized
been utilized by the by the
various works. A few of the commonly used datasets are BookCorpus
various works. A few of the commonly used datasets are BookCorpus [48], WMT 2014 [48], WMT 2014 [49],
Wikipedia
[49], Wikipedia [50], C4C4
[50], [22], ImageNet
[22], ImageNet [51],
[51],and
andCOCO
COCO[52].
[52].
AnAnearly
early work
work building
building upon
upon the the transformer
transformer model model wasofthat
was that Shawof Shaw et al. [19],
et al. [19],
whichsimply
which simplyinvolved
involvedextending
extendingthetheself-attention
self-attentionmechanism
mechanismofoftransformers
transformerstotoeffi-efficiently
consider
ciently representations
consider of theofrelative
representations positions
the relative or distances
positions between
or distances sequence
between sequenceelements.
This is done by modeling the input as a labeled, fully connected graph with the edges
between input elements xi and x j represented by vectors aV K da
ij , aij ∈ R . A modification is
then made to the transformer equation wherein edge information is then propagated to the
sublayer output as seen in Equation (6).
n
zi = ∑ αij x j W V + aV
ij (6)
j =1
Using these improved embeddings, the authors were able to report improvements in
both the EN-DE and EN-FR tasks over the vanilla transformer architecture.
Another early and majorly consequential work was that of Radford et al. [23] who
proposed the famous Generative Pre-Training (GPT) model. The base model used for the
work was the transformer architecture as it allowed the authors to capture long-range
linguistic structures. The idea proposed by the authors was one where the model can
perform more optimally for small amounts of labeled text data when it is generatively
trained in an unsupervised manner on a large unlabeled text corpus consisting of diverse
samples and then discriminatively fine-tuned on the specific task at hand. They do this
by utilizing a multi-layer transformer-decoder [53] architecture which applies a multi-
Another early and majorly consequential work was that of Radford et al. [23] who
proposed the famous Generative Pre-Training (GPT) model. The base model used for the
work was the transformer architecture as it allowed the authors to capture long-range
linguistic structures. The idea proposed by the authors was one where the model can per-
Appl. Sci. 2024, 14, 4316 form more optimally for small amounts of labeled text data when it is generatively trained7 of 27
in an unsupervised manner on a large unlabeled text corpus consisting of diverse samples
and then discriminatively fine-tuned on the specific task at hand. They do this by utilizing
a multi-layer transformer-decoder [53] architecture which applies a multi-headed self-at-
headed self-attention operation over the input context tokens followed by position-wise
tention operation over the input context tokens followed by position-wise feedforward
feedforward layers to produce an output distribution over the target tokens. These trained
layers to produce an output distribution over the target tokens. These trained weights can
weights can then
then be used withbeanused with an
auxiliary auxiliary
objective for objective for classification
classification tasks. Theused
tasks. The architecture architecture
by
used by the model can be seen
the model can be seen in Figure 4. in Figure 4.
Figure 5.
Figure 5. The
The pre-training
pre-training and
and fine-tuning
fine-tuningprocess
processof
ofthe
theBERT
BERTmodel.
model.
Using
Using this relatively simple
this relatively simpleconceptual
conceptualapproach,
approach,thetheBERT
BERT model
model waswas able
able to ob-
to obtain
tain state-of-the-art results on eleven natural language processing (NLP)
state-of-the-art results on eleven natural language processing (NLP) tasks, thereby tasks, thereby
establishing
establishingititasasa notable
a notableworkworkwhich numerous
which numerous consequent models
consequent have been
models have built
beenupon.
built
upon.It was soon after that, in the beginning of 2019, that Radford et al. followed up their
proposed GPT
It was soonmodel
afterwith
that,ain
model they calledofGPT-2
the beginning 2019, which followed
that Radford et aal.similar philosophy
followed up their
of
proposed GPT model with a model they called GPT-2 which followed a In
multi-task learning which they based on a framework proposed by Caruana [53]. their
similar
work, Radford et al. aimed to unify the two dominant approaches, namely, pre-training
philosophy of multi-task learning which they based on a framework proposed by Caruana
followed by supervised fine-tuning as well as a technique with unsupervised approaches
[53]. In their work, Radford et al. aimed to unify the two dominant approaches, namely,
towards specific tasks such as commonsense reasoning [55] and sentiment analysis [24].
pre-training followed by supervised fine-tuning as well as a technique with unsupervised
They achieve this by performing language modeling where, in addition to conditioning
approaches towards specific tasks such as commonsense reasoning [55] and sentiment
a model on the input, it is also conditioned on the task. They train their model in an
analysis [24]. They achieve this by performing language modeling where, in addition to
unsupervised manner on a dataset consisting of millions of web pages, called WebText,
conditioning a model on the input, it is also conditioned on the task. They train their
producing GPT-2, which is an enormous 1.5 billion parameter model which achieved state-
model in an unsupervised manner on a dataset consisting of millions of web pages, called
of-the-art results on seven language modeling tasks in a zero-shot system. The authors
WebText, producing GPT-2, which is an enormous 1.5 billion parameter model which
hypnotized that a large enough model would learn tasks embedded within language and
achieved state-of-the-art results on seven language modeling tasks in a zero-shot system.
would not require explicit, supervised training, which was proven by their results.
The authors hypnotized that a large enough model would learn tasks embedded within
Meanwhile, Wang et al. [25], in 2019, proposed a direct improvement upon the trans-
former model itself by formulating a deep transformer model which they claimed would
bypass the prevalent big transformer counterpart. They achieved this using a dual ap-
proach where, firstly, they implemented the proper use of layer normalization in addition
to introducing a novel way to pass the combinations of previous layers to the next ones.
Furthermore, they trained a 30-layer encoder, which they claim was the deepest at the time.
Using this approach, the authors were able to outperform the results of both the shallow
and the big transformers on the WMT’16 EN-DE, the NIST OpenMT’12 Chinese-English,
and the WMT’18 Chinese-English tasks.
Liu et al.’s proposed Robustly Optimized BERT Pre-training Approach (RoBERTa)
model [26] was introduced with the idea of improving the limitations of the BERT model
which were caused by significant undertraining. The authors achieved this by training the
model over a larger dataset, which consisted of CC-News and OpenWebText in addition to
the two datasets used to train the original BERT model, and training on longer sequences.
The performance was further improved by making the following changes on the original
model: dynamically changing the masking pattern that was applied to the training data and
removing the Next Sentence Prediction (NSP) objective. Unlike in the BERT model, where
the mask was generated only once during the data preprocessing stage, for the RoBERTa,
the authors generate a masking pattern every time a sequence is fed into the model.
The authors came to the conclusion that removing NSP matched or slightly improved
the downstream task performance after comparing the training of their model with and
without NSP. Throughout their experimentation, for a more accurate comparison, the
original optimization hyperparameters of the BERT model were initially maintained. The
Appl. Sci. 2024, 14, 4316 9 of 27
model was able to achieve state-of-the-art results on GLUE [56], RACE [57] and the Stanford
Question-Answering Dataset (SQuAD) [58], which are notable NLP tasks.
Another notable proposed modification of the transformer model is that outlined by
Sukhbaatar et al. [27], which suggests removing the feedforward layer from the transformer
architecture and solely using the attention layers. This is done by augmenting the attention
layers with persistent memory vectors which serve the same purpose as the feedforward
layers. On the first level, they first show that a feedforward sublayer can be viewed as
an attention layer. This argument can then be used to merge them into a single layer
which performs both functions by applying the attention mechanism simultaneously on the
sequence of input vectors, as in the attention layer, as well as a set of vectors not conditioned
on the input. Using this approach, they report outperforming models of similar sizes on
the enwik8 and WikiText-103 datasets.
An interesting work published in late 2019 that explored the NLP landscape is that
of Raffel et al.’s T5 model [22]; the researchers followed a transfer learning approach in
introducing a unified framework which converted all text-based language problems into
a text-to-text format. They experiment with a variety of pre-training objectives, architec-
tures, datasets and transfer approaches in addition to developing a new dataset they call
the Colossal Clean Crawled Corpus. Using this pre-training regime, they report having
achieved state-of-the-art results on a number of prevalent challenges in summarization,
question answering, and text classification.
Using this approach, the proposed model reduces the computation of the transformer
base model by 2.5× with only a 0.3 BLEU score degradation. Furthermore, the authors
report implementing pruning and quantization processes to compress the model size by
18.2×.
Carion et al. [30] propose a ground-breaking object detecting transformer named
DETR that views object detection as a direct set prediction problem. The main components
of the model are a set-based loss that forces predictions via a bipartite matching and a
transformer encoder and decoder. The overall architecture of the model is illustrated in
Figure
Figure 6.6.The
TheLite
LiteTransformer block.
Transformer block.
the following Figure 7.
Using this approach, the proposed model reduces the computation of the transformer
base model by 2.5× with only a 0.3 BLEU score degradation. Furthermore, the authors
report implementing pruning and quantization processes to compress the model size by
18.2×.
Carion et al. [30] propose a ground-breaking object detecting transformer named
DETR that views object detection as a direct set prediction problem. The main components
of the model are a set-based loss that forces predictions via a bipartite matching and a
transformer encoder and decoder. The overall architecture of the model is illustrated in
the following Figure 7.
The CNN is used to extract a compact feature representation of the input image by
generating a low-resolution activation map. The transformer’s encoder and decoder follow
the model architecture of Vaswani et al. [1]. The decoder output encodings are decoded
into box coordinates and class labels by the feedforward network. The object detection set
prediction loss produces a bipartite matching between the predicted and the ground truth
objects and then optimizes the object-specific losses. This model is on par with state-of-
the-art Faster R-CNN baseline on the famous COCO object detection dataset. The Faster
R-CNN was a model proposed by Ren et al. which used a Region Proposal Network to
generated region proposals which were then used by a Fast R-CNN for detection [61].
Around mid-2020, Brown et al. [31] proposed a work which improved on the state-of-
the-art NLP transformer model by proposing their improved GPT-3 model. The authors
scale-up the model by training it with 175 billion parameters which results in a model
which can perform a variety of tasks without requiring task-specific gradient updates or
fine-tuning, unlike the previous generations of the model. The other variation from the
architecture of GPT-2 is that of the use of alternating dense and locally banded sparse
attention patterns in the layers of the transformer. The model is able to perform well and
even achieve SOTA results on famous NLP dataset tasks with few-shot demonstrations
which are specified purely via text interactions with the model.
of-the-art NLP transformer model by proposing their improved GPT-3 model. The au-
thors scale-up the model by training it with 175 billion parameters which results in a
model which can perform a variety of tasks without requiring task-specific gradient up-
dates or fine-tuning, unlike the previous generations of the model. The other variation
Appl. Sci. 2024, 14, 4316 from the architecture of GPT-2 is that of the use of alternating dense and locally banded11 of 27
sparse attention patterns in the layers of the transformer. The model is able to perform
well and even achieve SOTA results on famous NLP dataset tasks with few-shot demon-
strations which are specified purely via text interactions with the model.
Dosovitskiy et al. [32] introduced the Vision Transformer (ViT) in late 2020, which
Dosovitskiy et al. [32] introduced the Vision Transformer (ViT) in late 2020, which
caused a shift in the research field. In order to adapt the transformer for image tasks,
caused a shift in the research field. In order to adapt the transformer for image tasks, the
the authors applied a standard transformer to images by splitting an image into patches
authors applied a standard transformer to images by splitting an image into patches and
and providing
providing the sequence
the sequence of theembeddings
of the linear linear embeddings of the
of the patches aspatches
the inputastothe
theinput
trans-to the
transformer. The overview of the ViT model can be seen
former. The overview of the ViT model can be seen in Figure 8. in Figure 8.
Figure
Figure8.8.An
Anoverview
overviewof of
thethe
ViT model
ViT s architecture.
model’s architecture.
The
Theimage
imageisisfirst
firstbroken
broken down
downinto patches
into which
patches are passed
which through
are passed a trainable
through a trainable
linear projection resulting in a D-dimension latent vector where D is
linear projection resulting in a D-dimension latent vector where D is the latent the latent vectorvector
size size
used by the transformer in its layers. An additional embedding at position
used by the transformer in its layers. An additional embedding at position 0 is added, 0 is added,
which
whichserves
servesasasa aclass
classlabel.
label. AA classification head
classification consisting
head of simple,
consisting densedense
of simple, layerslayers
is is
added with a hidden layer during pre-training and a single linear layer while
added with a hidden layer during pre-training and a single linear layer while fine-tuning. fine-tuning.
The
Theauthors
authorsreport
reportimprovements
improvements ononthethe
state-of-the-art results
state-of-the-art achieved
results by CNN-based
achieved by CNN-based
models for a range of benchmark datasets such as ImageNet [51], CIFAR10,
models for a range of benchmark datasets such as ImageNet [51], CIFAR10, CIFAR100 CIFAR100 [62] [62]
and Oxford-IIIT Pets [63].
and Oxford-IIIT Pets [63].
Appl. Sci. 2024, 14, x FOR PEER REVIEW 12 by
of 31
An interesting implementation using transformer architecture was that created
An interesting implementation using transformer architecture was that created by
Zheng et al. [33], who proposed a segmentation model named the Segmentation
Zheng et al. [33], who proposed a segmentation model named the Segmentation Trans-
former (SETR). They implement a solution wherein semantic segmentation is treated as a
Transformer (SETR). They implement a solution wherein semantic segmentation is treated
sequence-to-sequence prediction task with a transformer being deployed to encode an im-
as a sequence-to-sequence prediction task with a transformer being deployed to encode
age as a sequence of patches. They combine the encoder with a single decoder by modeling
an image as a sequence of patches. They combine the encoder with a single decoder by
the global context
modeling in each
the global layer
context of thelayer
in each transformer. Figure 9 shows
of the transformer. Figure the architecture
9 shows of their
the architec-
proposed system.
ture of their proposed system.
Figure
Figure 9.9.The
TheSETR’s
SETR sarchitecture.
architecture.
In the system, the image is first split into fixed patches which are linearly embedded
with position encodings added. The resulting sequence of vectors is then fed into a stand-
ard transformer encoder. They propose two different decoder designs for pixel-wise seg-
mentation, as can be seen in parts (b) and (c) of Figure 9. They then put these features
together through a multi-level feature aggregation system as seen in part (c) in Figure 9.
Appl. Sci. 2024, 14, 4316 12 of 27
In the system, the image is first split into fixed patches which are linearly embedded
with position encodings added. The resulting sequence of vectors is then fed into a
standard transformer encoder. They propose two different decoder designs for pixel-
wise segmentation, as can be seen in parts (b) and (c) of Figure 9. They then put these
features together through a multi-level feature aggregation system as seen in part (c) in
Figure 9. Using this methodology, they were able to achieve state-of-the-art results on the
ADE20K [64] and Pascal Context [65] challenges.
To study the low-level vision tasks like denoising, super-resolution, and deraining,
Chen et al. [34] worked on developing a new pre-trained model using transformer archi-
tecture, called the image processing transformer (IPT). The entire network is composed
of multiple pairs of heads and tails corresponding to different tasks and a single shared
body, so the pre-trained model becomes more compatible with different image processing
tasks. Multiple corrupted counterparts were generated for each image in the famous bench-
mark ImageNet dataset using several carefully designed operations. The model was then
trained on the dataset’s original images in addition to the newly generated images, and it
outperformed the current state-of-the-art methods on several low-level benchmarks.
Touvron et al. [35] proposed a major non-convolutional transformer model, called the
DeiT, that has fewer parameters than the ResNet model, which makes it trainable on a
single computer in less than 3 days. Furthermore, a teacher–student strategy which relies
on a distillation token procedure was used to ensure that the student learns from the teacher
through attention. Using the distillation technique enables image transformers to learn
more from a convent than from another comparably performing transformer. Therefore, a
combination of those techniques results in a top accuracy of 85.2% on ImageNet with no
external data. Consequently, transferring these models to a different downstream task, such
as a fine-grained classification on popular benchmark datasets like CIFAR-10, Oxford-102
flowers, and Stanford Cars, achieved competitive results.
the data by using a larger version of the JFT-300M dataset, namely, the JFT-3B dataset.
Using this model, the authors were able to achieve a new state-of-the-art result on the
ImageNet dataset with a top accuracy of 90.45%. They also proved that they achieved a
decent accuracy of 84.86% with few-shot learning, limiting to only 10 examples per class
from the ImageNet dataset for fine-tuning.
To make large-scaled language models more accessible, less complex and resources less
expensive, Zhang et al. [39] propose a suite of eight decoder-only pre-trained transformers
that consist of 125 million to 175 billion parameters, namely, Open Pre-trained Transformers
(OPTs). Their model is comparable to the state-of-the-art GPT-3 model with only 1/7th of
the carbon footprint. The model is directly developed from the GPT-3 model with a change
in the number of layers and attention heads to vary the parameter size. The smallest model,
consisting of 125M parameters, consists of 12 layers and 12 attention heads, while the
biggest model, consisting of 175B parameters, consists of 96 layers with 96 attention heads.
The batch size is varied from the original model to increase computational efficiency. While
training the OPT-175B model, the authors faced an issue of loss divergence, which they
fixed by lowering the learning rate and restarting the training from an earlier checkpoint.
The authors noticed a correlation between the loss divergence, the dynamic loss scalar
crashing to zero, and the l2 -norm of the activations of the final layer spiking. From this,
the authors derived a conclusion to pick restart points where the dynamic scalar loss was
still in the healthy state, which is greater than 1. The models were also additionally trained
with a larger set of data, including datasets that were used to train the RoBERTa, The Pile
dataset, and the PushShift.io Reddit dataset. The models were evaluated across 14 NLP
tasks, and it was seen that for zero-shot, the average performance follows the trend of
GPT-3 for 10 tests.
the patch generation process employed by ViTs but modifies it to avoid overlap. An
extremely recent solution making use of transformers in the field of vision is that of the
Unsupervised Semantic Segmentation Transformer (STEGO), proposed in March 2022
by Hamilton et al. [75]. This model makes use of transformers to localize semantically
meaningful categories within image corpora without any form of annotation. This is done
by using a novel loss function that encourages features to form compact clusters while
preserving their relationships across the corpora.
Furthermore, with the rise in the adaptation and use of transformers, an increase in
the focus on developing a lighter version of transformers has been noted. This is because,
while transformers have produced revolutionary results, it has been at a huge computation
cost, thereby preventing the models from being as easily adapted as earlier deep-learning
techniques such as CNNs [88]. To this end, numerous researchers have proposed works
aiming to scale or slim the weights of a traditional transformer. A notable attempt is
that of the EfficientFormer and EfficientFormerV2 proposed by Li et al. [41,42]. These
models make use of a process called latency-driven slimming to reduce the time taken for
inferencing using the trained transformers. The EfficientFormerV2 work further introduces
a fine-grained joint-search strategy that can find efficient architectures by optimizing the
latency and the number of parameters simultaneously. A similar work aiming to achieve
efficient image recognition was that of the AdaViT proposed by Meng et al. [43], which
serves as a computational framework learning to derive policies on which patches, self-
attention heads, and transformer blocks to use throughout the backbone on a per-input
basis. This is done by attaching a lightweight decision network to the backbone to produce
on-the-fly decisions. A similar thought process was seen in the case of the A-ViT method
proposed by Yin et al. [44] that adaptively adjusts the inference cost for images of different
complexities. This is done by reducing the number of tokens in the ViT as the inference
proceeds. Using the proposed method requires no extra parameters or sub-networks,
unlike the AdaViT, as the learning of the adaptive halting is based on the original network
parameters. A recent work aiming to improve the efficiency of transformer inference is
that of Pope et al. [45], who develop an analytical model for inference efficiency to select
the best multi-dimensional partitioning techniques. These are combined with low-level
optimizations to achieve a Pareto frontier on latency and FLOPS utilization tradeoffs.
Another key work was that of Zhang et al. in the introduction of the MiniViT
model [46], which applies weight multiplexing to reduce the complexity of the traditionally
immense vision transformer. This is done by multiplexing the weights of consecutive trans-
former blocks, wherein weights are shared across layers, while imposing a transformation
on the weights to increase diversity. Furthermore, the weight distillation over self-attention
is also applied to transfer knowledge from the large ViT models to the weight-multiplexed
compact models.
Yu and Wu [47] proposed a pruning framework to be applied to ViTs in order to simplify
all components in a transformer without altering the structure. This framework, called the UP-
ViT, estimates the importance score of each filter in a pre-trained ViT model before removing
redundant channels. Furthermore, they propose a progressive block-pruning method that
removes the least important block and proposes new hybrid blocks for ViTs.
An interesting area of recent work has been in making the training of transformers
a more data-efficient process. An early work in this space was that of the previously
discussed DeiT model proposed by Touvron et al. [35], who proposed using what they
called a distillation token to effectively learn from a teacher in a teacher–student method
employed to train transformers. This distillation token is learned through backpropagation,
through the interaction with the class and patch tokens through self-attention layers. A
more recent approach towards achieving data-efficient training is proposed by Wang
et al. [89], who aim to achieve this by claiming that the sparse feature sampling from local
image areas is key and, therefore, they propose a procedure where they alternate how
key and value sequences are constructed in the cross-attention layer. Furthermore, they
also introduce a label augmentation method which provides richer supervision, in turn,
achieving greater data efficiency.
4. Discussion
4.1. Historical Insight
Table 1 summarizes the historical works discussed in the previous section. The works
are color-coded in the timeline, wherein the works targeted towards text and NLP tasks are
color-coded in blue and the works targeted at image-related tasks are color-coded in orange.
Appl. Sci. 2024, 14, 4316 16 of 27
Table 1. Cont.
Table 1. Cont.
The table above summarizes key information from the history of the studies discussed
in the previous section. In addition to the name of the study, the author and the date,
the table also outlines the approach presented as well as the datasets evaluated upon, the
models benchmarked against, and the obtained results. Finally, the number of citations
attained by the paper as of the writing of this paper are also listed in order to emphasize
the importance of some of the presented studies.
In general, it can be seen that a number of works have chosen to add or modify layers
of the base transformer models, which has overall been seen to achieve good performance.
Indeed, such an approach is seen in works such as those of Shaw et al. [19], Wang et al. [25],
Sukhbaatar et al. [27], and Shazeer [28].
Another common approach for NLP tasks which has been shown to work really well
is to increase the size of the model to a very large number of parameters and to pre-train
it in an unsupervised fashion on a large corpus of data. This has been seen in numerous
state-of-the-art models such as the GPT [23], BERT [17], GPT-2 [90], RoBERTa [26], T5 [22],
and the GPT-3 [31] models, in the work of Radford et al. [37], and in the OPT model [39].
Yet another form of approaches involves the addition or modification of the loss
functions associated with the transformer model. Such an approach was seen in the case of
the work performed by Carion et al. [30].
When it comes to images, the general procedure followed by the previous studies was
to split images into patches and apply position embeddings on these patches, much like
what is done for texts. This was indeed the process followed by the Vision Transformer
(ViT) [32]. Other vision models implemented varied decoders such as the work proposed
by Zheng et al. [33]. Similarly, studies such as that by Chen et al. [34] make use of multiple
pairs of heads and tails corresponding to different low-level vision tasks. The ViT-G model
proposed by Zhai et al. [38] followed a procedure where the class token was removed and
the non-linear projection before the final layer was removed.
model [17], which involves modifying the input encodings to make them bidirectional.
The RoBERTa [26] builds upon this by adding an optimized pre-training process. Indeed,
most of the other NLP-solving approaches have involved modifications to input encodings
such as the TENER [69], ETC [67], and the Big Bird [68] models, thereby demonstrating the
importance of encodings to the NLP process. Table 2 below displays a summary of notable
transformer studies in the domain of NLP.
Models
Proposed
Name of Paper Author Date Datasets Benchmarked Results
Model
Against
BERT:
Pre-training of
Deep
BERT Large average
Bidirectional NLP BooksCorpus, English GLUE, SQuAD v1.1,
Devlin et al. 2018 score of 82.1 on GLUE
Transformers for Transformer Wikipedia SQuAD v2.0
testing
Language
Understanding
[17]
RoBERTa: A
BERT Large, XLNet Best results:
Robustly BooksCorpus, English
Variant of Large SQUAD 1.1-F1-94.6
Optimized BERT Liu et al. 2019 Wikipedia, CC-News,
BERT Enseambles-ALICE, Race-Middle-86.5
Pre-training OpenWebText, Stories
MT-DNN, XLNet GLUE-SST-96.4
Approach [26]
Chinese
NER-BiLSTM,
1D-CNN, CAN-NER, F1-scores
Transformer english Chinese
English NER-BiLSTM-CRF, NER-Weibo-58.17,
TENER:
NER-CoNLL2003, CNN-BiLSTM-CRF, Resume-95.00,
Adapting Sequence
OntoNotes 5.0 BiLSTM-BiLSTM- OntoNotes4.0-72.43,
Transformer labeling
Yan et al. 2019 Chinese NER-Chinese CRF, MSRA-92.74
Encoder for (NER)
part of OntoNotes 4.0, CNN-BiLSTM-CRF, English
Named Entity transformer
MSRA, Weibo NER, 1D-CNN, NER-ontoNotes
Recognition [69]
Resume NER LM-LSTM-CRF, CRF + 5.0-88.43,
HSCRF, BiLSTM- model+CNN-char get
BiLSTM-CRF, LS + 91.45 for CoNLL 2003
BiLSTM-CRF, CNˆ3,
GRN
Leaderboard results
ETC: Encoding SOTA (1ST) NQ long
Variation of
Long and answer-77.78
BERT-lifted BooksCorpus, English
Structured Inputs Ainslie et al. 2020 BERT, RoBERTa HOTPOT QA
weights from Wikipedia
in Transformers SUP.F1-89.09
RoBERTa
[67] WikiHop-82.25
OpenKP-42.05
Answering QA
HGN, GSAN,
task-Best results (F1
ReflectionNet,
SCORE)
Big Bird: RikiNet-v2,
HotpotQA-Sup-89.1
Transformers for Variation of Fusion-in-decoder,
Zaheer et al. 2021 MLM NaturalIQ-LA-77.8
Longer Sequences BERT SpanBERT,
TriviaQA-Verified-
[68] MRC-GCN,
92.4
MultiHop,
WikiHop-82.3
Longformer
(accuracy)
segmentation, proven through their decent results of an accuracy of 76.1%. This solution
would solve a lot of real-world-based problem applications, as those datasets are often
unbalanced or have less amounts of labeled data. A concrete quantitative analysis across
the previous studies is difficult to achieve due to the fact that all the authors report results
on different datasets and also report different evaluation metrics.
Models
Proposed
Name of Paper Author Date Datasets Benchmarked Results
Model
Against
Generative Image
Modeling-Pixel CNN,
Row Pixel RNN,
GIM-4.06 bits/dim
Gated Pixel CNN,
CIFAR10-VAlidation,
Image Transformer Attention Pixel CNN+,
Parmar et al. 2018 Cifar10 second best with 3.77 on
[73] Transformer PixelSNAIL
ImageNet, very close to
Further
Pixel RNN with 3.86
Inference-ResNet,
srez GAN, Pixel
Recursive
Trained on ImageNet-88.55-ViT-H
ILSVRC-2012, ReaL-90.72-ViT-H
ImageNet-21k, CIFAR-10-99.50-ViT-H
An Image is Worth 16
JFT CIFAR-100-94.55-ViT-H
× 16 Words: Vision BiT-L (ResNet152x4),
Dosovitsky Transfered on Oxford-IIIT-Pets-97.56-
Transformers for 2020 Transformer Noisy Student
et al. ReaL labels, ViT-H
Image Recognition at (ViT) (EfficientNet-L2)
Cifar10/100, Oxford
Scale [32]
Oxford-IIT Pets, Flowers-99.74-ViT-L
Oxford VTAB
Flowers-102 (19 tasks)-77.63-ViT-H
Training data-efficient ResNet, RegNetY,
DeiT-B 384/1000 epochs
image transformers EfficientNet,
Touvron et al. 2020 Based on ViT ImageNet outperforms ViT and
and distillation KDforAA, ViT (all
EffecientNet-85.2 acc
through attention [35] versions)
CUB-200-2011-
ResNet-50, RA-CNN,
GP-256, MaxExt,
DFL-CNN, NTS-Net,
Cross-X, DCL, CIN,
DBTNet, ASNet, S3N,
FDL, PMG, API-Net,
Introduction StackedLSTM,
Feature Fusion Vision of MAWS CUB-200-2011, MMAL-Net, ViT,
CUB-91.3% accuracy
Transformer for (mutual Stanford Dogs TransFG & PSM
Wang et al. 2021 iNaturalist2017-68.5%
Fine-Grained Visual attention and INaturalist2017-
Standford Dogs-92.4%
Categorization [74] weight iNaturalist2017 Resnet152, SSN,
selection) Huang et al.,
IncResNetv2, TASN,
ViT, TransFG&PSM
Standford
Dogs-MaxEnt, FDL,
RA-CNN, SEF,
Cross-X, API-Net, ViT,
TransFG & PSM
ResNet50, MoCoV2,
Unsupervised Unsupervised
27 class DINO, Deep Cluster, Unsupervised
Semantic Semantic Seg-
Hamilton COCOStuff, 27 SIFT, Doersch et al., Accuracy-56.9, mIoU-28.2
Segmentation By 2022 mentation
et al. classes of Isola et al. AC, Linear Probe
Distilling Feature Transformer
Cityscapes InMARS, IIC, MDC, Accuracy-76.1, mIoU-41.0
Correspondences [75] (STEGO)
PiCIE, PiCIE + H
recently used in audio and time series domains. Here, too, it is difficult to do a concrete
quantitative analysis as the specific application domains of the works summarized above
are all different. An interesting work to note is that of Koizumi et al. [77], which merges
NLP analysis within the audio domain and is quite successful in outperforming the results
of the traditional LSTM model that is usually used for such an application, with a best
score of 52.1 for the BLUE-1 dataset. Dong et al. [76] achieve a WER score of 10.9 on the
eval92 subset of the Wall Street Journal dataset, and Gong et al. [78] achieve their best
results on the Speech commands v2 dataset with an accuracy of 98.11% without adding
additional audio data while training. The second half of the table demonstrates different
areas in the domain of time series using transformers. The domains illustrated are those of
the Time Series Classification by Liu et al. [79], who were able to beat the state-of-the-art
results on 7 out of 13 competitive datasets, those of the Time Series forecasting proposed by
Zhou et al. [80], who achieved SOTA results in all 6 datasets, and the Time Series Anomaly
Detection proposed by Tuli et al. [81], who also beat the SOTA results in their domain on 7
out of 10 competitive datasets.
Table 4. A summary of the transformer-related works in the audio and time series domains.
Models
Proposed
Name of Paper Author Date Datasets Benchmarked Results
Model
Against
Speech-Transformer:
A No-Recurrence
CTC, seq2seq, seq2seq
Sequence-To- Audio Wall Street
Dong et al. 2018 + deep convolutional, WER 10.9 on eval92
Sequence Model For Transformer Journal dataset
seq2seq + Unigram LS
Speech Recognition
[76]
Beats in BLUE-1 with 52.1,
A Transformer-based Audio- BLUE-2-30.9, BLUE-3-18.8,
Baseline LSTM,
Audio-Captioning Captioning BLUE-4-10.8, CIDEr-25.8,
Koizumi et al. 2020 Cllotho dataset Transformer from
Model with Keyword Transformer METEOR-14.9,
same challenge
Estimation [77] (TRACKE) ROGUE-L-34.5, SPICE-9.7
SPIDEr-17.7
AudioSet AudioSet-AST
dataset-Baseline, (Ensamble-M) -> Balanced
PANN, PSLA single, mAP-0.378, full
PSLA Ensemble-S, mAP-0.485
Converted
AST: Audio PSLA Ensemble-M ESC-50-AST-P (trained
Audio pre-trained ViT to
Spectrogram Gong et al. 2021 ESC-50, speech using additional audio
Transformer AST, used DeiT
Transformer [78] comands V2-SOTA-S data)-95.6% Speech
weights
(without additional Commands V2-AST-S
audio data) SOTA-P (trained without
(with additional additional audio
audio data) data)-98.11%
AUSLAN,
ArabicDigits,
CMUsubject1,
Gated Transformer CharacterTrajec-
MLP, FCN, ResNet, Best SOTA results in 7/13
Networks for Time series tories, ECG,
Encoder, MCNN, datasets, with best scores
Multivariate Time Liu et al. 2021 classification JapeneseVowels,
t-LeNet, MCDCNN, of 100% for CMUsubject1,
Series Classification transformer KickvsPunch,
Time-CNN, TWIESN NetFlow and WalkvsRun
[79] Libras, NetFlow,
UWave, Wafer,
WalkvsRun,
PEMS
MERLIN, LSTM-NDT,
TranAD: Deep Beats the SOTA results in
DAGMM,
Transformer Anomaly NAB, UCR, MBA, 7/10 datasets for both
OmniAnomaly,
Networks for Detection SMAP, MSL, f1score and AUC
Tuli et al. 2022 MSCRED,
Anomaly Detection in Time Series SWaT, WADI, best score is AUC of
MAD-GAN, USAD,
Multivariate Time Transformer SMD, MSDS 0.9994 and F1 of 0.9694 for
MTAD-GAT, CAE-M,
Series Data [81] the UCR dataset
GDN
Appl. Sci. 2024, 14, 4316 22 of 27
Figure 10.
Figure 10. The
Theprogression
progressionofoftransformer architectures
transformer [1,17,23,32,41].
architectures [1,17,23,32,41].
However,
However,despite
despitethis
thisrapid
rapidprogression,
progression, certain gapsgaps
certain in theinfield
the remain. One major
field remain. One ma-
gapgap
jor seenseen
in contemporary
in contemporary research is that is
research transformers generallygenerally
that transformers have a quadratic
have a com-
quadratic
putation and memory
computation and memory complexity due todue
complexity theirtobeing
theirrequired to modeltoarbitrary
being required long de- long
model arbitrary
pendencies [91].
dependencies ThisThis
[91]. hashas
presented a major
presented issueissue
a major in the
inaccessibility of theofuse
the accessibility theofuse
trans-
of trans-
formers and
formers andhashasled
ledtotoaapromising
promising avenue
avenue of of
research
researchaimed at simplifying
aimed the training
at simplifying the training
process of
process of transformer
transformer models
models [92].
[92]. Indeed,
Indeed, the
the Lite
Lite Transformer
Transformer [29][29] discussed
discussed earlier
earlier was
was introduced with the intention of addressing this very issue, as were
introduced with the intention of addressing this very issue, as were implementations suchimplementations
such as the Longformer [93], Reformer [94], Linformer [95], Performer [96], and the OPT
as the Longformer [93], Reformer [94], Linformer [95], Performer [96], and the OPT [39].
[39]. However, these models are a start to what is a vast potential research space in opti-
However, these models are a start to what is a vast potential research space in optimizing
mizing transformer-training procedures. This is a pressing issue, as many of the state-of-
transformer-training procedures. This is a pressing issue, as many of the state-of-the-art
the-art models aim to simply increase a model s size (GPT-4, for instance) [97], and, there-
models aim to simply increase a model’s size (GPT-4, for instance) [97], and, therefore,
fore, make it impractical for that model to be used in many real-world applications.
make it impractical for that model to be used in many real-world applications.
Another interesting research issue is the problem of integrating all modalities with-
Another interesting research issue is the problem of integrating all modalities without
out changing the architecture towards a single modality. Early implementations of this
changing
have been the seenarchitecture
in models such towards
as theaPerceiver
single modality.
[98], whichEarly implementations
accepts all kinds of inputof this
but have
been seen in models such as the Perceiver [98], which accepts all kinds
can only generate fixed outputs such as class probabilities, and the Perceiver IO, which of input but can only
has flexible inputs and outputs but still relies on the specifics of the modalities, such as
augmentation or position encoding, to properly learn [99]. This research area is ripe for
expansion, as a model that is truly adaptable to anything would lead to massive progress
in the field of deep learning and would broaden the scope of the real-world applications
that could be improved with artificial intelligence.
Appl. Sci. 2024, 14, 4316 23 of 27
generate fixed outputs such as class probabilities, and the Perceiver IO, which has flexible
inputs and outputs but still relies on the specifics of the modalities, such as augmentation
or position encoding, to properly learn [99]. This research area is ripe for expansion, as a
model that is truly adaptable to anything would lead to massive progress in the field of
deep learning and would broaden the scope of the real-world applications that could be
improved with artificial intelligence.
A final research area which can be worked upon is that, generally, large amounts of
data are needed to train a good transformer. This is less than ideal as many real-world
applications do not contain adequate amounts of labeled data and therefore would not be
able to leverage this powerful model. Promising research towards achieving this is that
of the ViT-G [38], which reports having achieved few-shot learning by training with just
10 examples per class in the ImageNet dataset. More work needs to be done in this realm to
truly make transformers accessible for wide implementations. A possible avenue to achieve
this could be exploring ways to train transformers in a semi-supervised fashion [100]. With
the successful exploration of these avenues of research, it might be possible to leverage the
great power and achievements attained by transformers in real work applications which
would affect our daily lives.
Author Contributions: Conceptualization, A.R.S. and I.Z.; methodology, A.R.S., I.Z. and D.S.; in-
vestigation, A.R.S. and I.Z.; resources, I.Z.; writing—original draft preparation, A.R.S. and D.S.;
writing—review and editing, I.Z.; visualization, D.S.; supervision, I.Z. All authors have read and
agreed to the published version of the manuscript.
Funding: The work in this paper was supported, in part, by the Open Access Program from the
American University of Sharjah [grant number: OAPCEN-1410-E00291].
Acknowledgments: This paper represents the opinions of the authors and does not mean to represent
the position or opinions of the American University of Sharjah.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need.
In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9
December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010.
2. Li, X.; Metsis, V.; Wang, H.; Ngu, A.H.H. TTS-GAN: A Transformer-Based Time-Series Generative Adversarial Network. In
Artificial Intelligence in Medicine; Michalowski, M., Abidi, S.S.R., Abidi, S., Eds.; Lecture Notes in Computer Science; Springer
International Publishing: Cham, Switzerland, 2022; Volume 13263, pp. 133–143. ISBN 978-3-031-09341-8.
3. Myers, D.; Mohawesh, R.; Chellaboina, V.I.; Sathvik, A.L.; Venkatesh, P.; Ho, Y.-H.; Henshaw, H.; Alhawawreh, M.; Berdik, D.;
Jararweh, Y. Foundation and large language models: Fundamentals, challenges, opportunities, and social impacts. Cluster Comput.
2024, 27, 1–26. [CrossRef]
4. Liu, Y.; Zhang, Y.; Wang, Y.; Hou, F.; Yuan, J.; Tian, J.; Zhang, Y.; Shi, Z.; Fan, J.; He, Z. A Survey of Visual Transformers. IEEE
Trans. Neural Netw. Learn. Syst. 2023, 1–21. [CrossRef] [PubMed]
5. Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A Survey on Vision Transformer.
IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [CrossRef] [PubMed]
6. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef] [PubMed]
7. Rumelhart, D.E.; McClelland, J.L. Learning Internal Representations by Error Propagation. In Parallel Distributed Processing:
Explorations in the Microstructure of Cognition: Foundations; MIT Press: Cambridge, MA, USA, 1987; pp. 318–362. ISBN 978-0-262-
29140-8.
8. Zeyer, A.; Bahar, P.; Irie, K.; Schlüter, R.; Ney, H. A Comparison of Transformer and LSTM Encoder Decoder Models for ASR. In
Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December
2019; pp. 8–15.
9. Pascanu, R.; Mikolov, T.; Bengio, Y. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International
Conference on Machine Learning, Atlanta, GA, USA, 17–19 June 2013; Dasgupta, S., McAllester, D., Eds.; PMLR: Atlanta, GA,
USA, 2013; Volume 28, pp. 1310–1318.
10. Kim, Y.; Denton, C.; Hoang, L.; Rush, A.M. Structured Attention Networks. arXiv 2017, arXiv:1702.00887.
11. Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. arXiv 2014, arXiv:1409.3215.
12. Greff, K.; Srivastava, R.K.; Koutník, J.; Steunebrink, B.R.; Schmidhuber, J. LSTM: A Search Space Odyssey. IEEE Trans. Neural
Netw. Learning Syst. 2017, 28, 2222–2232. [CrossRef] [PubMed]
Appl. Sci. 2024, 14, 4316 24 of 27
13. Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in Vision: A Survey. ACM Comput. Surv. 2021, 54,
1–41. [CrossRef]
14. Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; Titov, I. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy
Lifting, the Rest Can Be Pruned. arXiv 2019, arXiv:1905.09418.
15. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
16. Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450.
17. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understand-
ing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T.,
Eds.; Association for Computational Linguistics: Pittsburgh, PA, USA, 2019; Volume 1, (Long and Short Papers), pp. 4171–4186.
18. Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y.N. Convolutional Sequence to Sequence Learning. In Proceedings of the
34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Precup, D., Teh, Y.W., Eds.; PMLR:
London, UK, 2017; Volume 70, pp. 1243–1252.
19. Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-Attention with Relative Position Representations. arXiv 2018, arXiv:1803.02155.
20. Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.;
Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, 71.
[CrossRef] [PubMed]
21. Attention Is All You Need Search Results. Available online: https://scholar.google.ae/scholar?q=Attention+Is+All+You+Need&
hl=en&as_sdt=0&as_vis=1&oi=scholart (accessed on 5 June 2022).
22. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer
Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67.
23. Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018.
Available online: https://api.semanticscholar.org/CorpusID:49313245 (accessed on 14 May 2024).
24. Radford, A.; Jozefowicz, R.; Sutskever, I. Learning to Generate Reviews and Discovering Sentiment. arXiv 2017, arXiv:1704.01444.
25. Wang, Q.; Li, B.; Xiao, T.; Zhu, J.; Li, C.; Wong, D.F.; Chao, L.S. Learning Deep Transformer Models for Machine Translation. arXiv
2019, arXiv:1906.01787.
26. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly
Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692.
27. Sukhbaatar, S.; Grave, E.; Lample, G.; Jegou, H.; Joulin, A. Augmenting Self-attention with Persistent Memory. arXiv 2019,
arXiv:1907.01470.
28. Shazeer, N. GLU Variants Improve Transformer. arXiv 2020, arXiv:2002.05202.
29. Wu, Z.; Liu, Z.; Lin, J.; Lin, Y.; Han, S. Lite Transformer with Long-Short Range Attention. In Proceedings of the International
Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020.
30. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In
Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Lecture Notes in Computer Science; Springer
International Publishing: Cham, Switzerland, 2020; Volume 12346, pp. 213–229. ISBN 978-3-030-58451-1.
31. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al.
Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M.,
Hadsell, R., Balcan, M.F., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 1877–1901.
32. Kolesnikov, A.; Dosovitskiy, A.; Weissenborn, D.; Heigold, G.; Uszkoreit, J.; Beyer, L.; Minderer, M.; Dehghani, M.; Houlsby, N.;
Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International
Conference on Learning Representations, Vienna, Austria, 4 May 2021.
33. Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.; et al. Rethinking semantic segmentation
from a sequence-to-sequence perspective with transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Piscataway, NJ, USA, 2022; pp. 6877–6886.
34. Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-Trained Image Processing Transformer. In
Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25
June 2021; pp. 12294–12305.
35. Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jegou, H. Training data-efficient image transformers & distillation
through attention. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Meila, M.,
Zhang, T., Eds.; PMLR: London, UK, 2021; Volume 139, pp. 10347–10357.
36. Fedus, W.; Zoph, B.; Shazeer, N. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.
arXiv 2022, arXiv:2101.03961.
37. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning
Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020.
38. Zhai, X.; Kolesnikov, A.; Houlsby, N.; Beyer, L. Scaling Vision Transformers. arXiv 2021, arXiv:2106.04560.
39. Zhang, S.; Roller, S.; Goyal, N.; Artetxe, M.; Chen, M.; Chen, S.; Dewan, C.; Diab, M.; Li, X.; Lin, X.V.; et al. OPT: Open Pre-trained
Transformer Language Models. arXiv 2022, arXiv:2205.01068.
Appl. Sci. 2024, 14, 4316 25 of 27
40. Wang, H.; Ma, S.; Dong, L.; Huang, S.; Zhang, D.; Wei, F. DeepNet: Scaling Transformers to 1000 Layers 2022. arXiv 2022,
arXiv:2203.00555. [CrossRef]
41. Li, Y.; Yuan, G.; Wen, Y.; Hu, J.; Evangelidis, G.; Tulyakov, S.; Wang, Y.; Ren, J. Efficientformer: Vision transformers at mobilenet
speed. Adv. Neural Inf. Process. Syst. 2022, 35, 12934–12949.
42. Li, Y.; Hu, J.; Wen, Y.; Evangelidis, G.; Salahi, K.; Wang, Y.; Tulyakov, S.; Ren, J. Rethinking Vision Transformers for MobileNet
Size and Speed. In Proceedings of the IEEE International Conference on Computer Vision, Paris, France, 2–3 October 2023.
43. Meng, L.; Li, H.; Chen, B.-C.; Lan, S.; Wu, Z.; Jiang, Y.-G.; Lim, S.-N. AdaViT: Adaptive Vision Transformers for Efficient Image
Recognition. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New
Orleans, LA, USA, 18–24 June 2022; pp. 12299–12308.
44. Yin, H.; Vahdat, A.; Alvarez, J.M.; Mallya, A.; Kautz, J.; Molchanov, P. A-ViT: Adaptive Tokens for Efficient Vision Transformer. In
Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Orleans, LA, USA, 18–24
June 2022; pp. 10799–10808.
45. Pope, R.; Douglas, S.; Chowdhery, A.; Devlin, J.; Bradbury, J.; Heek, J.; Xiao, K.; Agrawal, S.; Dean, J. Efficiently Scaling
Transformer Inference. Proc. Mach. Learn. Syst. 2023, 5.
46. Zhang, J.; Peng, H.; Wu, K.; Liu, M.; Xiao, B.; Fu, J.; Yuan, L. MiniViT: Compressing Vision Transformers with Weight Multiplexing.
In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Orleans, LA, USA, 18–24
June 2022; pp. 12135–12144.
47. Yu, H.; Wu, J. A unified pruning framework for vision transformers. Sci. China Inf. Sci. 2023, 66, 179101. [CrossRef]
48. Zhu, Y.; Kiros, R.; Zemel, R.; Salakhutdinov, R.; Urtasun, R.; Torralba, A.; Fidler, S. Aligning Books and Movies: Towards
Story-Like Visual Explanations by Watching Movies and Reading Books. In Proceedings of the IEEE International Conference on
Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015.
49. Bojar, O.; Buck, C.; Federmann, C.; Haddow, B.; Koehn, P.; Leveling, J.; Monz, C.; Pecina, P.; Post, M.; Saint-Amand, H.; et al.
Findings of the 2014 Workshop on Statistical Machine Translation. In Proceedings of the Ninth Workshop on Statistical Machine
Translation, Baltimore, MD, USA, 26–27 June 2014; Association for Computational Linguistics: Pittsburgh, PA, USA, 2014; pp.
12–58.
50. Lim, D.; Hohne, F.; Li, X.; Huang, S.L.; Gupta, V.; Bhalerao, O.; Lim, S.-N. Large Scale Learning on Non-Homophilous Graphs:
New Benchmarks and Strong Simple Methods. arXiv 2021, arXiv:2110.14446. [CrossRef]
51. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Li, F.F. ImageNet: A large-scale hierarchical image database. In Proceedings of the
2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255.
52. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in
Context. In Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Lecture Notes in Computer Science;
Springer International Publishing: Cham, Switzerland, 2014; Volume 8693, pp. 740–755. ISBN 978-3-319-10601-4.
53. Caruana, R. Multitask Learning. Mach. Learn. 1997, 28, 41–75. [CrossRef]
54. Taylor, W.L. “Cloze procedure”: A new tool for measuring readability. J. Q. 1953, 30, 415–433. [CrossRef]
55. Schwartz, R.; Sap, M.; Konstas, I.; Zilles, L.; Choi, Y.; Smith, N.A. Story Cloze Task: UW NLP System. In Proceedings of the
LSDSem 2017, Valencia, Spain, 3 April 2017.
56. Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural
Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural
Networks for NLP, Brussels, Belgium, 1 November 2018; Association for Computational Linguistics: Brussels, Belgium, 2018; pp.
353–355.
57. Lai, G.; Xie, Q.; Liu, H.; Yang, Y.; Hovy, E. RACE: Large-scale ReAding Comprehension Dataset From Examinations. In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11
September 2017; Association for Computational Linguistics: Copenhagen, Denmark, 2017; pp. 785–794.
58. Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of
the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; Association
for Computational Linguistics: Austin, TX, USA, 2016; pp. 2383–2392.
59. Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language Modeling with Gated Convolutional Networks. In Proceedings of the
34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 933–941.
60. Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. SuperGLUE: A Stickier Benchmark
for General-Purpose Language Understanding Systems. In Proceedings of the Advances in Neural Information Processing
Systems, Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.,
Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32.
61. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv
2016, arXiv:1506.01497. [CrossRef]
62. Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. 2009. Volume 7, pp. 32–33. Available online: https:
//www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf (accessed on 14 May 2024).
63. Parkhi, O.M.; Vedaldi, A.; Zisserman, A.; Jawahar, C.V. Cats and Dogs. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012.
Appl. Sci. 2024, 14, 4316 26 of 27
64. Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Scene Parsing through ADE20K Dataset. In Proceedings of the
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 June 2017; pp. 5122–5130.
65. Mottaghi, R.; Chen, X.; Liu, X.; Cho, N.-G.; Lee, S.-W.; Fidler, S.; Urtasun, R.; Yuille, A. The Role of Context for Object Detection
and Semantic Segmentation in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), Columbus, OH, USA, 23–28 June 2014.
66. Patwardhan, N.; Marrone, S.; Sansone, C. Transformers in the Real World: A Survey on NLP Applications. Information 2023, 14,
242. [CrossRef]
67. Ainslie, J.; Ontanon, S.; Alberti, C.; Cvicek, V.; Fisher, Z.; Pham, P.; Ravula, A.; Sanghai, S.; Wang, Q.; Yang, L. ETC: Encoding
Long and Structured Inputs in Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing (EMNLP), Online, 16–20 November 2020; Association for Computational Linguistics: Pittsburgh, PA, USA, 2020;
pp. 268–284.
68. Zaheer, M.; Guruganesh, G.; Dubey, K.A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big
bird: Transformers for longer sequences. Adv. Neural Inf. Process. Syst. 2020, 33, 17283–17297.
69. Yan, H.; Deng, B.; Li, X.; Qiu, X. TENER: Adapting Transformer Encoder for Named Entity Recognition. arXiv 2019,
arXiv:1911.04474.
70. Zhou, Q.; Yang, N.; Wei, F.; Tan, C.; Bao, H.; Zhou, M. Neural Question Generation from Text: A Preliminary Study. arXiv 2017,
arXiv:1704.01792.
71. Miwa, M.; Bansal, M. End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures. In Proceedings of the 54th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016;
Association for Computational Linguistics: Pittsburgh, PA, USA, 2016; pp. 1105–1116.
72. Fragkou, P. Applying named entity recognition and co-reference resolution for segmenting English texts. Prog. Artif. Intell. 2017,
6, 325–346. [CrossRef]
73. Parmar, N.J.; Vaswani, A.; Uszkoreit, J.; Kaiser, L.; Shazeer, N.; Ku, A.; Tran, D. Image Transformer. In Proceedings of the
International Conference on Machine Learning (ICML), Vienna, Austria, 10–15 July 2018.
74. Wang, J.; Yu, X.; Gao, Y. Feature Fusion Vision Transformer for Fine-Grained Visual Categorization. arXiv 2021, arXiv:2107.02341.
75. Hamilton, M.; Zhang, Z.; Hariharan, B.; Snavely, N.; Freeman, W.T. Unsupervised Semantic Segmentation by Distilling Feature
Correspondences. arXiv 2022, arXiv:2203.08414.
76. Dong, L.; Xu, S.; Xu, B. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition. In
Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB,
Canada, 15–20 April 2018; pp. 5884–5888.
77. Koizumi, Y.; Masumura, R.; Nishida, K.; Yasuda, M.; Saito, S. A Transformer-based Audio Captioning Model with Keyword
Estimation. arXiv 2020, arXiv:2007.00222.
78. Gong, Y.; Chung, Y.-A.; Glass, J. AST: Audio Spectrogram Transformer. arXiv 2021, arXiv:2104.01778.
79. Liu, M.; Ren, S.; Ma, S.; Jiao, J.; Chen, Y.; Wang, Z.; Song, W. Gated Transformer Networks for Multivariate Time Series
Classification. arXiv 2021, arXiv:2103.14438.
80. Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. FEDformer: Frequency Enhanced Decomposed Transformer for Long-term
Series Forecasting. arXiv 2022, arXiv:2201.12740.
81. Tuli, S.; Casale, G.; Jennings, N.R. TranAD: Deep Transformer Networks for Anomaly Detection in Multivariate Time Series Data.
arXiv 2022, arXiv:2201.07284. [CrossRef]
82. He, K.; Gan, C.; Li, Z.; Rekik, I.; Yin, Z.; Ji, W.; Gao, Y.; Wang, Q.; Zhang, J.; Shen, D. Transformers in medical image analysis.
Intell. Med. 2023, 3, 59–78. [CrossRef]
83. Sajun, A.R.; Zualkernan, I.; Sankalpa, D. Investigating the Performance of FixMatch for COVID-19 Detection in Chest X-rays.
Appl. Sci. 2022, 12, 4694. [CrossRef]
84. Ziani, S. Enhancing fetal electrocardiogram classification: A hybrid approach incorporating multimodal data fusion and advanced
deep learning models. Multimed. Tools Appl. 2023, 83, 55011–55051. [CrossRef]
85. Ziani, S.; Farhaoui, Y.; Moutaib, M. Extraction of Fetal Electrocardiogram by Combining Deep Learning and SVD-ICA-NMF
Methods. Big Data Min. Anal. 2023, 6, 301–310. [CrossRef]
86. Li, J.; Chen, J.; Tang, Y.; Wang, C.; Landman, B.A.; Zhou, S.K. Transforming medical imaging with Transformers? A comparative
review of key properties, current progresses, and future perspectives. Med. Image Anal. 2023, 85, 102762. [CrossRef] [PubMed]
87. Shamshad, F.; Khan, S.; Zamir, S.W.; Khan, M.H.; Hayat, M.; Khan, F.S.; Fu, H. Transformers in medical imaging: A survey. Med.
Image Anal. 2023, 88, 102802. [CrossRef]
88. Fournier, Q.; Caron, G.M.; Aloise, D. A Practical Survey on Faster and Lighter Transformers. ACM Comput. Surv. 2023, 55, 1–40.
[CrossRef]
89. Wang, W.; Zhang, J.; Cao, Y.; Shen, Y.; Tao, D. Towards Data-Efficient Detection Transformers. In Proceedings of the Computer
Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature Switzerland: Cham,
Switzerland, 2022; pp. 88–105.
90. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI
Blog 2019, 1, 9.
Appl. Sci. 2024, 14, 4316 27 of 27
91. Liu, P.J.; Saleh, M.; Pot, E.; Goodrich, B.; Sepassi, R.; Kaiser, L.; Shazeer, N. Generating Wikipedia by Summarizing Long Sequences.
arXiv 2018, arXiv:1801.10198.
92. Tay, Y.; Dehghani, M.; Bahri, D.; Metzler, D. Efficient Transformers: A Survey. ACM Comput. Surv. 2023, 55, 1–28. [CrossRef]
93. Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The Long-Document Transformer. arXiv 2020, arXiv:2004.05150.
94. Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The Efficient Transformer. arXiv 2020, arXiv:2001.04451.
95. Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-Attention with Linear Complexity. arXiv 2020, arXiv:2006.04768.
96. Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.;
et al. Rethinking Attention with Performers. arXiv 2021, arXiv:2009.14794.
97. OpenAI GPT-4 Technical Report 2023. arXiv 2023, arXiv:2303.08774. [CrossRef]
98. Jaegle, A.; Gimeno, F.; Brock, A.; Zisserman, A.; Vinyals, O.; Carreira, J. Perceiver: General Perception with Iterative Attention.
arXiv 2021, arXiv:2103.03206.
99. Jaegle, A.; Borgeaud, S.; Alayrac, J.-B.; Doersch, C.; Ionescu, C.; Ding, D.; Koppula, S.; Zoran, D.; Brock, A.; Shelhamer, E.; et al.
Perceiver IO: A General Architecture for Structured Inputs & Outputs. arXiv 2022, arXiv:2107.14795.
100. Weng, Z.; Yang, X.; Li, A.; Wu, Z.; Jiang, Y.-G. Semi-supervised vision transformers. In Proceedings of the ECCV 2022, Tel Aviv,
Israel, 23–27 October 2022.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.