ViT Survey On Segmentation
ViT Survey On Segmentation
Abstract—Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their
application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input
sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory
(LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited
as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos,
text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge
datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to
provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to
arXiv:2101.01169v5 [cs.CV] 19 Jan 2022
fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature
encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification,
object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual
reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image
super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We
compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental
value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further
interest in the community to solve current challenges towards the application of transformer models in computer vision.
Index Terms—Self-attention, transformers, bidirectional encoders, deep neural networks, convolutional networks, self-supervision.
1 I NTRODUCTION
Fig. 1: Statistics on the number of times keywords such as BERT, Self-Attention, and Transformers appear in the titles of Peer-
reviewed and arXiv papers over the past few years (in Computer Vision and Machine Learning). The plots show consistent growth
in recent literature. This survey covers recent progress on Transformers in the computer vision domain.
Fig. 3: Architecture of the Transformer Model [1]. The model was first developed for the language translation task where an input
sequence in one language is required to be converted to the output sequence in another language. The Transformer encoder
(middle row) operates on the input language sequence and converts it to an embedding before passing it on to the encoder blocks.
The Transformer decoder (bottom row) operates on the previously generated outputs in the translated language and the encoded
input sequence from the middle branch to output the next word in the output sequence. The sequence of previous outputs (used
as input to the decoder) is obtained by shifting the output sentence to the right by one position and appending start-of-sentence
token at the beginning. This shifting avoids the model to learn to simply copy the decoder input to the output. The ground-truth
to train the model is simply the output language sequence (without any right shift) appended with an end-of-sentence token. The
blocks consisting of multi-head attention (top row) and feed-forward layers are repeated N times in both the encoder and decoder.
For a given entity in the sequence, the self-attention basi- stead of static filters (that stay the same for any input) as in
cally computes the dot-product of the query with all keys, the case of convolution. Further, self-attention is invariant
which is then normalized using softmax operator to get the to permutations and changes in the number of input points.
attention scores. Each entity then becomes the weighted sum As a result, it can easily operate on irregular inputs as op-
of all entities in the sequence, where weights are given by posed to standard convolution that requires grid structure.
the attention scores (Fig. 2 and Fig. 3, top row-left block). Furthermore, it has been shown in the literature how self-
Masked Self-Attention: The standard self-attention attention (with positional encodings) is theoretically a more
layer attends to all entities. For the Transformer model [1] flexible operation which can model the behaviour of convo-
which is trained to predict the next entity of the sequence, lutional models towards encoding local features [40]. Cor-
the self-attention blocks used in the decoder are masked to donnier et al. [41] further studied the relationships between
prevent attending to the subsequent future entities. This is self-attention and convolution operations. Their empirical
simply done by an element-wise multiplication operation results confirm that multi-head self-attention (with sufficient
with a mask M ∈ Rn×n , where M is an upper-triangular parameters) is a more generic operation which can model
matrix. The masked self-attention is defined by, the expressiveness of convolution as a special case. In fact,
! self-attention provides the capability to learn the global as
QKT well as local features, and provide expressivity to adaptively
softmax p ◦M ,
dq learn kernel weights as well as the receptive field (similar to
deformable convolutions [42]).
where ◦ denotes Hadamard product. Basically, while pre-
dicting an entity in the sequence, the attention scores of the
future entities are set to zero in masked self-attention. 2.2 (Self) Supervised Pre-training
Multi-Head Attention: In order to encapsulate multiple Self-attention based Transformer models generally operate
complex relationships amongst different elements in the in a two-stage training mechanism. First, pre-training is
sequence, the multi-head attention comprises multiple self- performed on a large-scale dataset (and sometimes a com-
attention blocks (h = 8 in the original Transformer model bination of several available datasets [22], [43]) in either a
[1]). Each block has its own set of learnable weight ma- supervised [11] or a self-supervised manner [3], [44], [45].
trices {WQi , WKi , WVi }, where i = 0 · · · (h−1). For an Later, the pre-trained weights are adapted to the down-
input X, the output of the h self-attention blocks in multi- stream tasks using small-mid scale datasets. Examples of
head attention is then concatenated into a single matrix downstream tasks include image classification [46], ob-
[Z0 , Z1 , · · · Zh−1 ] ∈ Rn×h·dv and projected onto a weight ject detection [13], zero-shot classification [20], question-
matrix W ∈ Rh·dv ×d (Fig. 3, top row). answering [10] and action recognition [18]. The effective-
The main difference of self-attention with convolution ness of pre-training for large-scale Transformers has been
operation is that the filters are dynamically calculated in- advocated in both the language and vision domains. For
4
example, Vision Transformer model (ViT-L) [11] experiences N =6 in Fig. 3), with each block having two sub-layers: a
an absolute 13% drop in accuracy on ImageNet test set multi-head self-attention network, and a simple position-
when trained only on ImageNet train set as compared to the wise fully connected feed-forward network. Residual con-
case when pretrained on JFT dataset [47] with 300 million nections [67] alongside layer normalization [68] are em-
images. ployed after each block as in Fig. 3. Note that, different from
Since acquiring manual labels at a massive scale is cum- regular convolutional networks where feature aggregation
bersome, self-supervised learning has been very effectively and feature transformation are simultaneously performed
used in the pre-training stage. The self-supervision based (e.g., with a convolution layer followed by a non-linearity),
pre-training stage training has played a crucial role in un- these two steps are decoupled in the Transformer model
leashing the scalability and generalization of Transformer i.e., self-attention layer only performs aggregation while the
networks, enabling training even above a trillion parame- feed-forward layer performs transformation. Similar to the
ter networks (e.g., the latest Switch Transformer [10] from encoder, the decoder (bottom row) in the Transformer model
Google). An extensive survey on SSL can be found in [48], comprises six identical blocks. Each decoder block has three
[49]. As nicely summarized by Y. LeCun [50], the basic sub-layers, first two (multi-head self-attention, and feed-
idea of SSL is to fill in the blanks, i.e., try to predict the forward) are similar to the encoder, while the third sub-
occluded data in images, future or past frames in temporal layer performs multi-head attention on the outputs of the
video sequences or predict a pretext task e.g., the amount corresponding encoder block, as shown in Fig. 3.
of rotation applied to inputs, the permutation applied to The original Transformer model in [1] was trained for
image patches or the color of a gray-scale image. Another the Machine Translation task. The input to the encoder is
effective way to impose self-supervised constraints is via a sequence of words (sentence) in one language. Positional
contrastive learning. In this case, nuisance transformations encodings are added to the input sequence to capture the
are used to create two types of modified versions of the same relative position of each word in the sequence. Positional
image i.e., without changing the underlying class semantics encodings have the same dimensions as the input d = 512,
(e.g., image stylizing, cropping) and with semantic changes and can be learned or pre-defined e.g., by sine or cosine
(e.g., replacing an object with another in the same scene, or functions. Being an auto-regressive model, the decoder of
changing the class with minor adversarial changes to the the Transformer [1] uses previous predictions to output the
image). Subsequently, the model is trained to be invariant to next word in the sequence. The decoder, therefore, takes
the nuisance transformations and emphasize on modeling inputs from the encoder as well as the previous outputs
minor changes that can alter semantic labels. to predict the next word of the sentence in the translated
Self-supervised learning provides a promising learning language. To facilitate residual connections the output di-
paradigm since it enables learning from a vast amount of mensions of all layers are kept the same i.e., d = 512.
readily available non-annotated data. In the SSL based pre- The dimensions of query, key and value weight matrices
training stage, a model is trained to learn a meaningful in multi-head attention are set to dq = 64, dk = 64, dv = 64.
representation of the underlying data by solving a pretext
task. The pseudo-labels for the pretext task are automati-
2.4 Bidirectional Representations
cally generated (without requiring any expensive manual
annotations) based on data attributes and task definition. The training strategy of the original Transformer model [1]
Therefore, the pretext task definition is a critical choice in could only attend to the context on the left of a given word
SSL. We can broadly categorize existing SSL methods based in the sentence. This is limiting, since for most language
upon their pretext tasks into (a) generative approaches which tasks, contextual information from both left and right sides
synthesize images or videos (given conditional inputs), (b) is important. Bidirectional Encoder Representations from
context-based methods which exploit the relationships be- Transformers (BERT) [3] proposed to jointly encode the right
tween image patches or video frames, and (c) cross-modal and left context of a word in a sentence, thus improving
methods which leverage from multiple data modalities. the learned feature representations for textual data in an
Examples of generative approaches include conditional gen- self-supervised manner. To this end, BERT [3] introduced
eration tasks such as masked image modeling [43] and two pretext tasks to pre-train the Transformer model [1] in
image colorization [51], image super-resolution [52], image a self-supervised manner: Masked Language Model and Next
in-painting [53], and GANs based methods [54], [55]. The Sentence Prediction. For adapting the pre-trained model for
context-based pretext methods solve problems such as a downstream tasks, a task-specific additional output module
jigsaw puzzle on image patches [56]–[58], masked object is appended to the pre-trained model, and the full model
classification [22], predict geometric transformation such as is fine-tuned end-to-end. Here, we briefly touch upon the
rotation [46], [59], or verify temporal sequence of video pretext tasks. (1) Masked Language Model (MLM) - A
frames [60]–[62]. Cross-modal pretext methods verify the fixed percentage (15%) of words in a sentence are randomly
correspondence of two input modalities e.g., text & image masked and the model is trained to predict these masked
[63], audio & video [64], [65] or RGB & flow [66]. words using cross-entropy loss. In predicting the masked
words, the model learns to incorporate the bidirectional
context. (2) Next Sentence Prediction (NSP) - Given a pair
2.3 Transformer Model of sentences, the model predicts a binary label i.e., whether
The architecture of the Transformer model proposed in [1] the pair is valid from the original document or not. The
is shown in Fig. 3. It has an encoder-decoder structure. The training data for this can easily be generated from any
encoder (middle row) consists of six identical blocks (i.e., monolingual text corpus. A pair of sentences A and B is
5
Fig. 4: A taxonomy of self-attention design space. Existing approaches based on self-attention explore single-head or multi-head
(transformer) designs for vision tasks. We note that interesting efforts have been made to utilize knowledge from convolution
based architectures to improve ViTs (e.g., multi-scale and hybrid designs). We categorize the upcoming sections of this survey
according to the types of self-attention block (left tree diagram) as well as the prominent tasks in computer vision (right).
fixed across positions while a non-local operation such as all the feature vectors in the local neighbourhood when
introduced in [69] is a bottom-up method as it aggregates deriving the attention vectors. Authors show that with con-
input features over the full image. The local relation layer siderably fewer parameters, self-attention networks (SAN)
belongs to the category of bottom-up methods but it is can beat ResNet baselines on the ImageNet dataset. They
restricted to a fixed window size e.g., 7x7 neighborhood. further show robustness against adversarial perturbations
Bello et al. [79] explore the possibility of employing [83], [84] and generalization to unseen transformations [85].
self-attention as an alternative to convolutional operators. This behaviour is due to the dynamic nature of attention
They employ the relative position encoding [80] in two that makes it difficult for the adversary to calculate useful
dimensions to develop a new self-attention mechanism that fooling directions.
maintains translation equivariance, a desirable property for
handling images. Although this self-attention provides com-
3.2 Multi-head Self-Attention (Transformers)
petitive results as a stand-alone computational primitive,
the best performance is obtained in combination with the Unlike the approaches discussed in Sec. 3.1 which insert
convolutional operations. Authors show that attention aug- self-attention as a component in CNN inspired architectures,
mentation leads to systematic performance gains in image Vision Transformer (ViTs) [11] adapts the architecture of [1]
classification and object detection for different architectures. (see Fig. 3), which cascades multiple Transformer layers.
ViTs have gained significant research attention, and a num-
3.1.2 Self-Attention as Stand-alone Primitive ber of recent approaches have been proposed which build
As discussed above, convolutional layers possess transla- upon ViTs. Below, we discuss these methods by categorizing
tion equivariance but can not scale with a large receptive them into: uniform scale ViTs having single-scale features
field, therefore can not capture long-range interactions [81]. through all layers (Sec. 3.2.1), multi-scale ViTs that learn
On the other hand, global attention [1] which attend to hierarchical features which are more suitable for dense
all spatial locations of the input can be computationally prediction tasks (Sec. 3.2.2), and hybrid designs having
intensive and is preferred on down-sampled small images, convolution operations within ViTs (Sec. 3.2.3).
image patches [11] or augmenting the convolutional features
space [79]. Ramachandran et al. [81] proposed to replace 3.2.1 Uniform-scale Vision Transformers
convolutional layers in deep neural networks with a local The original Vision Transformer [11] model belongs to this
self-attention layer which can be applied to small or large family, where the multi-head self-attention is applied to a
inputs without increasing the computational cost. At a basic consistent scale in the input image where the spatial scale is
level, the proposed self-attention layer [81] considers all maintained through the network hierarchy. We name such
pixel positions in a specific window size around a given models as the uniform-scale ViTs, as described below.
pixel, compute queries, keys and value vectors for these Vision Transformer (ViT) [11] (Fig. 6) is the first work
pixels, and then aggregates the spatial information within to showcase how Transformers can ‘altogether’ replace
this window. The value vectors are aggregated after pro- standard convolutions in deep neural networks on large-
jecting the softmax score of queries and keys. This process scale image datasets. They applied the original Transformer
is repeated for all given pixels and the response is concate- model [1] (with minimal changes) on a sequence of image
nated to produce the output pixel. ResNet models with local ’patches’ flattend as vectors. The model was pre-trained
self-attention layer can solve ImageNet and COCO object on a large propriety dataset (JFT dataset [47] with 300
detection with fewer parameters as compared to ResNet million images) and then fine-tuned to downstream recog-
models based on convolutional layers [81]. nition benchmarks e.g., ImageNet classification. This is an
Zhao et al. [82] note that a traditional convolution important step since pre-training ViT on a medium-range
operator performs feature aggregation and transformation dataset would not give competitive results, because the
jointly (by applying a filter and then passing it through CNNs encode prior knowledge about the images (inductive
a non-linearity). In contrast, they propose to perform fea- biases e.g., translation equivariance) that reduces the need of
ture aggregation separately with self-attention followed by data as compared to Transformers which must discover such
transformation using an element-wise perceptron layer. For information from very large-scale data. Notably, compared
feature aggregation, they propose two alternate strategies: to the iGPT [19] model that also applied Transformers to
(a) pairwise self-attention and (b) patch-wise self-attention. full-sized images but performs training as a generative task,
The pairwise self-attention is permutation and cardinality ViT pre-trains the model with a supervised classification
invariant operation, while the patch-wise self-attention does task (although a self-supervision variant is also explored
not have such invariance properties (similar to convolu- which results in a less performance).
tion). Both pairwise and patch-wise self-attentions are im- The DeiT [12] is the first work to demonstrate that
plemented as a vector attention [82] that learns weights for Transformers can be learned on mid-sized datasets (i.e., 1.2
both the spatial and channel dimensions. This provides an million ImageNet examples compared to 300 million images
alternate approach for attention that is conventionally per- of JFT [11] used in ViT [11]) in relatively shorter training
formed using scalar weights (by taking a dot-product). The episodes. Besides using augmentation and regularization
pairwise self-attention is a set operator that computes a vec- procedures common in CNNs, the main contribution of
tor attention keeping in view the relationships of a particular DeiT [12] is a novel native distillation approach for Trans-
feature with its neighbors in a given local neighborhood. formers which uses a CNN as a teacher model (RegNetY-
In contrast, patch-wise self-attention is a generalization of 16GF [86]) to train the Transformer model. The outputs
the convolution operator (not a set operator) and looks at from the CNN aid the Transformer in efficiently figuring
7
3.2.3 Hybrid ViTs with Convolutions and encodes relationships between tokens at multiple scales
using cross-attention. Twins [37] builds upon PVT [93] (an
Convolutions do an excellent job at capturing low-level local attention only pyramid design), by replacing the absolute
features in images, and have been explored in multiple hy- position embedding in PVT with relative conditional po-
brid ViT designs, specially at the beginning to “patchify and sition embedding [113], and incorporating the separable
tokenize” an input image. For example, Convolutional vi- depth-wise convolutions instead of the standard spatial
sion Transformer (CvT) [96] incorporate convolution based attention, to capture local and global context of the image. In
projection to capture the spatial structure and low-level this sense, the hybrid designs tend to combine the strengths
details, for tokenization of image patches. CvT has a hier- of both convolution and transformer models. TransCNN
archical design, where number of tokens is progressively re- [114] propose a hierarchical multi-head self attention block,
duced while the token-width is increased, thus imitating the which first learns interactions within small grids (tokens)
impact of spatial downsampling as in CNNs. Convolution using self-attention, and then gradually merges the smaller
enhanced image Transformers [105] employ convolutions grids into larger grids. The proposed block can then be
based image-to-token module to extract low-level features. plugged into existing CNN architectures.
Compact Convolutional Transformer (CCT) [106] introduces
a new sequence pooling scheme, and incorporates convolu- 3.2.4 Self-Supervised Vision Transformers
tional blocks (conv-pool-reshape) for tokenization. CCT can
Contrastive learning based self-supervised approaches,
be trained from scratch on smaller datasets, e.g., CIFAR10
which have gained significant success for CNN based vision
with ∼ 95% accuracy, which is a remarkable property not
tasks, have also been investigated for ViTs. Chen et al. [115]
possible with the traditional ViTs.
evaluate different self-supervised frameworks and propose
LocalViT [107] introduces depthwise convolutions to en- practical strategies including MoCo v3 (extended from
hance local features modeling capability of ViTs. LeViT [108] v1/v2 [116], [117]) for stabilized training of self-supervised
(name inspired from LeNet [109]) applies a four-layered ViTs. Xie et al. [118] combine MoCo v2 [117] and BYOL [119]
CNN block (with 3 × 3 convolutions) at the beginning with to train DeiT [12] and SwinTransformer [36]. They demon-
progressively increasing channels (3,32,64,128,256). For a strate generalization of self-supervised SwinTransformer
3×224×224 input image, the resulting 256×14×14 output for dense prediction tasks of detection and segmentation.
from the CNN block becomes input to a hierarchical ViT. Self distillation with no labels (DINO) [120] demonstrate
By virtue of its design, LeViT is 5× faster than EfficientNet that self-supervised ViTs can automatically segment the
[87] on CPU, at inference. ResT [110] is another hierarchical background pixels of an image, even though they were
architecture which applies a CNN block at the beginning for never trained using pixel-level supervision, a phenomena
patch-embedding. It incorporates depth-wise convolutions otherwise not observed in CNNs or fully supervised ViTs.
and adaptive position encoding to tackle varying image Efficient self-supervised vision transformer (EsViT) [121]
sizes. A recent approach NesT [111] proposes a simple propose a multi-stage design, where neighboring tokens are
technique to introduce hierarchy in ViTs. NesT divides an gradually merged along the hierarchy of the network, and
image into non-overlapping blocks (each block is further use DINO for self-supervision. Apart from standard image-
split into patches). It first separately applies local self- level self-supervision as in DINO, they incorporate addi-
attention on patches within each block, and then enables tional patch-level self-supervision in which correspondence
global interaction between blocks by aggregating them into is promoted between similar patches within augmented
an image space and applying convolution operation, fol- versions of an image. EsViT demonstrates excellent perfor-
lowed by downsampling. The number of blocks is gradually mance under self-supervision settings, and its off-the-shelf
reduced along the hierarchy of the model, while number features transfer better than supervised SwinTransformer on
of local-patches is kept fixed. This simple scheme performs 17 out of 18 evaluated datasets.
favorably compared with more sophisticated designs [36],
[97], and enables training NesT on smaller datasets (e.g.,
CIFAR-10) from scratch. 3.3 Transformers for Object Detection
Depthwise Convolution and self-Attention Networks Transformers based modules have been used for object
(CoAtNets) [112] introduce a relative attention mod- detection in the following manner: (a) Transformer back-
ule (which combines depthwise convolutions and self- bones for feature extraction, with a R-CNN based head
attention), and vertically stack convolution and attention for detection (see Sec. 3.2.2), (b) CNN backbone for visual
layers. CoAtNets demonstrate an impressive 86% Ima- features and a Transformer based decoder for object detec-
geNet top-1 accuracy without extra data (i.e. trained only tion [13], [14], [122], [123] (see Sec. 3.3.1, and (c) a purely
on ImageNet-1k). Shuffle Transformer [95] performs self- transformer based design for end-to-end object detection
attention within a window and has depth-wise convolutions [124] (see Sec. 3.3.2).
between the window-based multi-head self-attention and
MLP. It introduces a shuffle operation to build stronger 3.3.1 Detection Transformers with CNN Backbone
cross-patch connections. Co-scale conv-attentional image Detection Transformer (DETR) [13] treats object detection
Transformers (CoaT) [98], is a hybrid hierarchical pyramid as a set prediction task i.e., given a set of image features,
design, with serial and parallel blocks, where the serial the objective is to predict the set of object bounding boxes.
block is similar to standard transformer block except for The Transformer model enables the prediction of a set of
the attention layer replaced with depthwise convolution. objects (in a single shot) and also allows modeling their
The parallel blocks is applied on the output of serial blocks relationships. DETR adapts a set loss function which allows
9
does not use any image-specific knowledge in the design conditioned scene generation task. Given the empty room
(e.g., the 2D position embeddings used in Image Trans- shape, [23] can propose new object configurations in the
former [142]). The features learned with iGPT’s unsuper- room while maintaining realism. Remarkably, the model
vised training mechanism compete impressively against does not use any appearance information and only learns to
other unsupervised approaches, achieving state-of-the-art generate new scenes by modeling the inter-object relation-
performance on CIFAR-10/100 [148] and STL [149] datasets ships using self-attention in Transformers. Similar to how
while performing comparably to SimCLR (a contrastive a Transformer operates on a sentence, it is applied to a
learning approach) [150] on ImageNet dataset. This is an sequence of objects to predict the next suitable object in a
astounding result, since the iGPT architecture is exactly the scene. Specifically, the size, pose, location, and category of
same as used for language modeling tasks, and therefore it the next object is predicted by the Transformer model. A
does not incorporate any prior domain-specific knowledge. start token indicates the initiation of inference and the num-
Notably, the competing unsupervised CNN based solutions ber of output token indicate the objects generated by the
widely adopt such priors in the form of architectural design, model in a sequence. The authors also explore generating
attention mechanisms, loss functions, and regularization new scenes given a textual description of the room layout.
[117], [151]–[154]. However, on the downside, iGPT has a The independence from the appearance makes the approach
high compute cost e.g., iGPT-L version has roughly 36× efficient, enabling interactive scene generation.
high training cost compared to MoCo [117] which is a The task of generating realistic images from text is inter-
state of the art self-supervised feature learning approach. esting and practically valuable (e.g., for artistic content cre-
For this reason, the training was generally limited to low- ation), but at the same time highly challenging. Prior text-to-
resolution of ≤ 64 × 64, while convolutional architectures image synthesis approaches [158]–[161] are mostly based on
can effectively learn from high-resolution inputs. GANs [54]. Although these methods produce encouraging
Transformers typically incur a high compute cost when results, they are far from being photo-realistic. Ramesh et
applied on high-dimensional sequences. To overcome this al. [20] recently proposed DALL·E which is a Transformer
limitation, Esser et al. [144] proposed to include inductive model capable of generating high-fidelity images from a
biases (commonly used in the CNNs) alongside Transform- given text description. DALL·E model has 12 billion param-
ers to improve their efficiency. Specifically, local connectivity eters and it is trained on a large set of text-image pairs taken
and spatial invariance biases inbuilt in the CNN structure from the internet. Before training, images are first resized
are leveraged by learning a rich dictionary of visual patterns to 256×256 resolution, and subsequently compressed to
(using a Generative Adversarial approach). A Transformer a 32×32 grid of latent codes using a pre-trained discrete
is then used to learn the long-range interactions between variational autoencoder [162], [163]. DALL·E takes as input
the dictionary items to generate the outputs. In turn, they a single stream of 1280 tokens (256 for the text and 1024
develop a conditional image generation model capable of for the image), and is trained to generate all other tokens
producing very high-resolution images (up to megapixel autoregressively (one after another). It provides flexibility
range) using Transformers. This is the first work that to generate images either from scratch (Fig. 10a) or by
demonstrates the application of Transformers to generate extending existing images (Fig. 10b), while staying faithful
such high-resolution images. to the text caption.
Generative Adversarial Networks (GANs) [54] with The authors demonstrate the effectiveness of DALL·E by
CNNs as default backbone have been very successful for creating images from text describing a wide variety of real
visually appealing image synthesis [155]–[157]. TransGAN and fictional concepts. While generating images purely from
[145] builds a strong GAN model, free of any convolution textural captions, DALL·E shows impressive performance at
operation, with both generator and discriminator based controlling multiple objects and their attributes (Fig. 10c),
upon the Transformer model [1]. The architecture of both rendering certain viewpoint (Fig. 10d), capturing object’s
generator and discriminator is based upon the encoder in internal structure (Fig. 10e), and combining unrelated ob-
original Transformer model [1]. For memory efficiency, the jects (Fig. 10f). Furthermore, DALL·E can perform image-to-
generator contains multiple stages, with up-sampling mod- image translation (Fig. 10g) guided by the input text.
ules in-between, which gradually increase the resolution
of feature maps (input sequence length) while reducing
the embedding dimension. The discriminator of TransGAN 3.6 Transformers for Low-level Vision
takes flattened image-patches as tokens similar to [132]. After witnessing the success of Transformer models in high-
Authors introduce different training techniques including level vision problems, numerous Transformer-based meth-
data augmentation, training with an auxiliary task and ods have been proposed for low-level vision tasks, including
injecting locality to self-attention to scale-up their model image super-resolution [16], [19], [164], denoising [19], [165],
for high quality image synthesis [144]. The TransGAN deraining [19], [165], and colorization [24]. Image restoration
model achieves state-of-the-art results in terms of Inception requires pixel-to-pixel correspondence from the input to the
Score and Fréchet Inception Distance (FID) on STL-10 and output images. One major goal of restoration algorithms
performs favorably compared with their CNN-based GAN is to preserve desired fine image details (such as edges
counterparts on other datasets. and texture) in the restored images. CNNs achieve this by
Unlike previous image generation methods [142]–[144], employing a single-scale architecture design that does not
which directly predict image outputs, [23] learns to generate involve any downsampling operation. Since the computa-
parameters of 3D objects to be placed in a given scene. tional complexity of self-attention in Transformer models
Specifically, SceneFormer [23] studies the 3D room layout increases quadratically with number of image patches, it is
12
infeasible to develop Transformer model that can operate on task, can provide significant performance gains over the
single-scale feature processing pipeline. Consequently, these state-of-the-art methods [167]–[169].
Transformer-based image restoration models make use of
various strategies to reduce the computational burden, such 3.6.2 Transformers for Super-Resolution
as computing attention on local image windows [164], per- Recent years have seen major performance breakthroughs
forming spatial reduction attention [166], and employing for super-resolution (SR) due to convolutional neural net-
encoder-decoder design [19], [165]. Here, we briefly discuss works (CNNs). Principally, the quality of super-resolved
a few image restoration Transformer models. images generated by CNNs is dependent on the choice of
optimization objective. While the SR methods [167], [170]–
3.6.1 Transformers for Image Processing Tasks [173] that are based on pixel-wise loss functions (e.g., L1,
MSE, etc.) yield impressive results in terms of image fi-
Top performing algorithms for high-level computer vision delity metrics such as PSNR and SSIM, they struggle to
tasks such as object detection and semantic segmentation recover fine texture details and often produce images that
often employ backbone models that are pre-trained on large- are overly-smooth and perceptually less pleasant. Further,
scale datasets e.g., ImageNet. In contrast, algorithms for low- perceptual SR approaches [52], [174]–[177], in addition to
level vision tasks such as image denoising, super-resolution, per-pixel loss, employ adversarial loss [54] and perceptual
and deraining are directly trained on task-specific data, loss [178] based on deep features extracted from pre-trained
thereby suffer from these limitations: (i) small number of im- CNNs. While these methods generate images that are sharp,
ages available in task-specific datasets (e.g., the commonly visually pleasant, and perceptually plausible, they show a
used DIV2K dataset for image super-resolution contains substantial decrease in reconstruction accuracy measured in
only 2000 images), (ii) the model trained for one image PSNR/SSIM. Moreover, the perceptual SR algorithms have a
processing task does not adapt well to other related tasks. tendency to hallucinate fake textures and cause artifacts. The
Chen et al. [19] propose a pre-trained model based on above mentioned SR approaches follow two distinct (but
Transformer architecture, named as Image Processing Trans- conflicting) research directions: one maximizing the recon-
former (IPT). It is capable of performing various image struction accuracy and the other maximizing the perceptual
restoration tasks such as super-resolution, denoising, and quality, but never both.
deraining. The overall architecture of IPT consists of multi- To alleviate the trade-off between perceptual reproduc-
heads and multi-tails to deal with different tasks separately, tion and accurate reproduction, Yang et al. [16] propose a
and a shared encoder-decoder Transformer body. Since ex- Transformer network (TTSR) for super-resolution. During
ploiting Transformers at full potential requires training on training, TTSR uses paired LR-HR images, as well as ref-
large-scale data, [19] takes the clean (ground-truth) images erence (Ref) images with similar content as of LR images.
from the ImageNet benchmark and synthesize their de- TTSR learns to search relevant regions in the Ref image and
graded versions for different tasks. For example, bicubic in- transfers rich textures to help super-resolving the input LR
terpolation is used for generating low-resolution images, ad- image. The texture Transformer module of TTSR method
ditive white Gaussian noise is added to prepare noisy data, (see Fig. 11) consists of four core components: (1) Learnable
and hand-crafted rain streaks are applied to obtain rainy texture extractor: takes as input LR↑, Ref↓↑, and Ref images,
images. In total, 10 million images are used to pre-train the and generates texture features query (Q), key (K), and value
IPT model. During training, each task-specific head takes as (V), respectively. Here, ↑ denotes bicubic upsampling opera-
input a degraded image and generates visual features. These tion, and ↓↑ represents bicubic down-sampling followed by
feature maps are divided into small crops and subsequently an upsampling operation. (2) Relevance embedding: first un-
flattened before feeding them to the Transformer encoder folds Q and K into patches and then computes the similarity
(whose architecture is the same as [1]). The outputs of the of each patch in Q with each patch in K in order to generate
encoder along with the task-specific embeddings are given hard and soft attention maps. (3) Hard-attention: transfers
as input to the Transformer decoder. The features from the HR texture features from V to (LR features) Q using the hard
decoder output are reshaped and passed to the multi-tail attention map. (4) Soft-attention: further enhances relevant
that yields restored images. The IPT model is optimized features while suppressing less relevant ones.
with L1 loss. Experimental results show that the pre-trained While TTSR [16] method deals with reference-based
IPT model, when fine-tuned for a specific low-level vision image super-resolution, most of the research is conducted
13
Fig. 12: An overview of Transformer models used for multi-modal tasks in computer vision. The Transformer designs in this
category can be grouped into single-stream (UNITER [43], OSCAR [44], VideoBERT [17], Unicoder-VL [180], VisualBERT [63] and
VL-BERT [22]) and dual-stream architectures (LXMERT [21], ViLBERT [181] and PEMT [182]). A key distinction between models
is the choice of loss functions. While most of the multi-modal methods are focused on images as visual data, VideoBERT [17] and
PEMT [182] are designed to work on video streams and leverage unique modalities e.g., audio signals in videos [182].
to reconstructing masked words in text in the BERT model experiments on Visual Reasoning for Real (NLVR) task [186]
[3]). This way, the model learns the inherent structure in demonstrating impressive improvements on novel tasks.
the data during pre-training and also models cross-domain
associations. With evaluations on several tasks, [17] demon- Lee et al. [182] note that the multi-modal representation
strated that a two-stream model can perform better than a learning approaches like VideoBERT [17] and ViLBERT [181]
single-stream model that uses shared parameters to model generally keep the language processing part fixed to a pre-
both language and vision domains [17]. trained model (e.g., BERT [3]) to reduce training complex-
ity. For the first time in the literature, they propose to
Similar to ViLBERT [181], Learning Cross-Modality En- learn an end-to-end multi-modal bidirectional Transformer
coder Representations from Transformers (LXMERT) [21] model called PEMT on audio-visual data from unlabeled
also uses a two-stream architecture based on BERT frame- videos. First, short-term (e.g., 1-3 seconds) video dynamics
work. The main difference lies in the object-relationship are encoded using CNNs, followed by a modality-specific
encoder that is used to model the visual features instead Transformer (audio/visual) to model long-term dependen-
of simple image-level features used in ViLBERT. The infor- cies (e.g., 30 seconds). A multi-modal Transformer is then
mation in two streams is then fused across modalities using applied to the modality-specific Transformer outputs to ex-
cross-attention blocks similar to [181]. change information across visual-linguistic domains. How-
Compared to two pre-texts tasks used for VLP in [181], ever, learning such a model in a naive form would incur
LXMERT uses five pre-training tasks including masked ob- huge memory requirements. To reduce parametric complex-
ject and language prediction, cross-modality matching, and ity, the parameters are shared across layers within each
visual question answering (Fig. 12-g). The pre-trained model Transformer which leads upto 80% parameter reduction.
is fine-tuned on the VQA task, however, a high similarity The Transformer is trained using a contrastive learning ap-
between pre-training and fine-tuned tasks raises questions proach based on a content-aware negative sampling (Fig. 12-
on the generalizability of the learned representations to new i). Specifically, the model uses the features obtained from
tasks. To this end, the authors conducted generalization CNNs learned during the training phase to select negative
15
samples that are visually similar to the positive instances. text pairs, CLIP learns a multi-modal embedding space, by
This work also compares various fusion strategies adopted jointly training an image-encoder and a text-encoder, such
in earlier works such as early (VideoBERT [17] and VL- that the cosine similarity of the valid N image-text pairs is
BERT [22]), mid-level (ViL-BERT [181] and LXMERT [21]) maximized, while the remaining N 2 −N pairs is minimized.
and late fusion mechanisms and shows that the mid-level The authors consider ResNet-50 [67] and Vision Transformer
fusion is the optimal choice. The proposed model is pre- (ViT) [132] for encoding images. The modified Transformer
trained on Kinetics-700 [187] dataset and later fine-tuned on model [1] as in [5] is employed for encoding text. CLIP is
downstream video classification tasks such as short video trained on a large corpus of 400 million image-text pairs and
classification on UCF101 [188], audio classification on ESC50 demonstrates excellent zero-shot transfer capabilities. At
[189] and long-term action recognition on Charades [190] inference, the names of classes are used as input to the text-
and Kinetics-Sounds [65] datasets. encoder, and similarity of the encoded image is computed
Tan and Bansal [191] introduce the concept of ‘vokens’ with all encoded texts (classes) to find the image-text pair
(images related to language tokens extracted from sen- with highest match. The CLIP achieves an astounding zero-
tences). The vokens (visualized tokens) provide visual su- shot classification accuracy of 75% on ImageNet, without us-
pervision to the language model to learn better features. The ing an supervision from ImageNet training set. The authors
motivation is that humans learn languages by correlating further demonstrate zero-shot transfer capabilities of the
visual information with semantic concepts. In a similar spirit CLIP model on 30 different computer vision benchmarks.
to other self-supervised language representation learning Note that CLIP with ResNet took 18 days to train on 592
methods [3], [181], they learn representations by defining V100 GPUs while CLIP with ViT took 12 days on 256 V100
an auxiliary task of voken-prediction task. Since the exist- GPUs. This highlights the computational cost of CLIP.
ing datasets encode limited visually grounded tokens, they
propose a vokenization method to map language tokens to
3.7.2 Single-stream Transformers
visual vokens, as illustrated in Fig. 13. The approach uses
language-based retrieval for such a mapping and transfers Different from two-stream networks like ViLBERT [181]
a model trained on a small labeled dataset (MS-COCO) to a and LXMERT [21], VisualBERT [63] uses a single stack of
large dataset (Wikipedia). Furthermore, it was ensured that Transformers to model both the domains (images and text).
the sentence-wide context is considered to obtain the token- The input sequence of text (e.g., caption) and the visual
voken mapping. The resulting model trained using gener- features corresponding to the object proposals are fed to
ated tokens outperforms the state of the art BERT model on the Transformer that automatically discovers relations be-
a diverse set of NLP tasks. In this sense, the proposed model tween the two domains. Notably, VisualBERT architecture is
does not evaluate vision tasks, however, uses vision as a somewhat similar to VideoBERT [17] (explained in Sec. 3.8),
useful grounding cue to train the language model, hence we but instead of only focusing on cooking videos, Visual-
include it in the multi-modal representation learning group. BERT evaluates on various visual-linguistic tasks (e.g., VCR,
Vision-and-Language Navigation (VLN) aims to predict NLVR, VQA, and visual grounding). The VisualBERT model
a navigation plan on a map based on the vision and first applies task-agnostic pre-training using two objectives
language inputs. Transformer models were used earlier in (Fig. 12-e). The first objective simply attempts to predict
[192], [193] for VLN task. These works first pre-train a cross- missing text tokens using the image features and remaining
modal Transformer using self-supervision on vision and textual tokens. The second objective attempts to differentiate
language pairs and subsequently fine-tune on the specific between the true and false caption of a given image. After
VLN tasks. While these works learn attention between im- task-agnostic pre-training, the authors propose to perform
age region and language, Chen et al. [194] propose to learn task-specific pre-training to bridge the domain gap before
cross-modal attention between language inputs and spatial the final fine-tuning to the downstream task.
topological maps (to represent an agent’s environment as Su et al. [22] propose a multi-modal pre-training ap-
a graph whose nodes denote places and the edges denote proach to learn features that are generalizable to multi-
their connectivity). Given the topological map and natural modal downstream tasks such as Visual Commonsense
language inputs, a VLN task using the Transformer model Reasoning and Visual Question Answering. This endeavor
bears resemblance to sequence prediction in NLP. Specif- requires adequately aligning the visual and linguistic cues
ically, at each time instance, the cross-modal Transformer so that an effective composite representation is learned. To
predicts a single node of the topological map in the nav- the end, [22] builds on the BERT model and inputs both
igation plan. The individual language and map encodings the visual and language features. The language features
are first processed using uni-modal encoders and later a correspond to the token in the input sentence and the visual
cross-modal encoder (similar to LXMERT [21]) is applied features correspond to the region of interest (RoI) from
to aggregate information across modalities. To denote posi- the input image (obtained via a standard Faster R-CNN).
tions in the map, a learned trajectory position encoding is Specifically, the model is pre-trained on both the visual-
appended with the map features. Based on this Transformer lingual dataset (Conceptual Captions [196]) as well as the
setup, [194] reports a full navigation system that can freely language-only datasets (e.g., Wikipedia). The loss function is
explore the environment and intelligently plan its actions. identical to BERT, where the model is trained to predict the
CLIP [195] is a contrastive approach to learn image rep- masked out words or visual ROIs (Fig. 12-f). In contrary to
resentations from text, with a learning objective which max- other works such as UNITER [43], VL-BERT claims that the
imizes similarity of correct text-image pairs embeddings in visual-linguistic matching tasks are not useful during pre-
a large batch size. Specifically, given a batch of N image- training, which is in contrast to evidence from later efforts
16
using a loss which predicts a uniform distribution over all sentence is temporally aligned with the sequence of visual
relevant text query tokens specific to the predicted bounding tokens. Further, the learned representations are shown to be
boxes. An additional contrastive loss term ensures corre- very useful for downstream tasks such as action classifica-
spondence between visual and text embedding. TransVG tion, zero-shot classification, and video captioning.
[204] is a simple design, where visual and text features are Zhou et al. [210] explore Masked Transformers for dense
fused together in a transformer module, and the bounding- video captioning. This requires generating language de-
box corresponding to the query is directly regressed us- scriptions for all events occurring in a video. Existing works
ing a learnable token (input to the Transformer module, on this problem generally operate sequentially i.e., first
along-with visual and text features). Referring Transformer detect events and then generate captions in separate sub-
[205] is also a simple one stage design where the text blocks. [210] proposes a unified Transformer network to
and image features are fused in a Transformer encoder, tackle both tasks jointly, thereby seamlessly integrating the
and the Transformer based decoder then directly regresses multi-modal tasks of event detection and captioning. First, a
bounding boxes or segmentation masks. Visual Grounding video encoder is used to obtain frame-wise representations
with Transformer [206] has an encoder-decoder architecture, followed by two decoder blocks focused on proposing the
where visual tokens (features extracted from a pretrained video events and the captions. Since untrimmed videos are
CNN model) and text tokens (parsed through an RNN considered, a masking network is used in the captioning
module) are processed in parallel with two distinct branches decoder to focus on describing a single event proposal.
in the encoder, with cross-modality attention to generate Remarkably, [210] was the first approach to target dense
text-guided visual features. The decoder then computes video captioning using non-recurrent models and used self-
attention between the text queries and visual features and attention in the encoder(applied on CNN derived features)
predicts query-specific bounding boxes. to model broad range context between video frames. Ex-
periments on ActivityNet Captions [214] and YouCookII
[215] datasets showed good improvements over previous
3.8 Video Understanding
recurrent network and two-stage based approaches.
Existing approaches for audio-video data analysis generally
learn representations on short-length videos (up to a few 3.8.2 Video Action Recognition
seconds long), that allow them to encode only short-range The traditional CNN based methods in video classification
dependencies [1], [32]. Long-range dependency modeling is generally perform 3D spatio-temporal processing over lim-
desirable in various uni-modal and multi-modal learning ited intervals to understand videos. Neimark et al. [211]
tasks such as activity recognition [71], [187], [207]–[209]. propose Video Transformer Network (VTN) that first ob-
Below, we explain recent approaches that seek to resolve this tains frame-wise features using 2D CNN and apply a Trans-
challenge using the expressivity of Transformer networks. former encoder (Longformer [103]) on top to learn temporal
It is important to note that several of these works [17], relationships. Longformer is an attractive choice to process
[18], [182], [210] still employ (pretrained) CNNs to encode long sequences (with an arbitrary length n) due to its O(n)
image/frame-level features in the videos on top of which complexity. The classification token is passed through a
Transformers are applied to model wide context. A few fully connected layer to recognize actions or events. The
exceptions include [209], [211]–[213] which obtain frame- advantage of using Transformer encoder on top of spatial
level features also using the ViT based backbones. features is two fold: (a) it allows processing a complete video
in a single pass, and (b) considerably improves training and
3.8.1 Joint Video and Language Modeling inference efficiency by avoiding the expensive 3D convolu-
The VideoBERT [17] model leverages Transformer networks tions. This makes VTN particularly suitable for modeling
and the strength of self-supervised learning to learn effec- long videos where interactions between entities are spread
tive multi-modal representations. Specifically, VideoBERT throughout the video length. Their experiments on Kinetics-
uses the prediction of masked visual and linguistic tokens as 400 dataset [71] with various backbones (ResNet [67], ViT
a pretext task (Fig. 12-c). This allows modeling high-level se- [11] and DeiT [12]) shows competitive performance.
mantics and long-range temporal dependencies, important Girdhar et al. [18] use a variant of Transformer archi-
for video understanding tasks. Given a video, [17] converts tecture to aggregate person-specific contextual cues in a
speech to text using off-the-shelf speech recognition systems video for action classification and localization. Initially, the
and applies vector quantization (clustering) to obtain visual model uses a Faster-RCNN [125] style processing where a
features from pre-trained video classification models. The backbone model generates features that are forwarded to the
BERT model is then directly applied to these concatenated Region Proposal Network to obtain object proposals. Then
sequences of language and visual tokens to learn their RoI pooling is applied to generate object-specific features.
joint distribution. The model can be trained with only-text, Multi-head self-attention [1] is then applied on top of the
video-only, and video+text domains. The resulting model object features as a cascade of self-attention layers. In each
showcases interesting capabilities for cross-modal predic- Transformer unit, a particular person feature is treated as
tions such as video generation from a given textual input the ‘query’ (Q), while the features from the neighboring
(e.g., captions or cooking recipe) and (video-based) future video clip are used as ‘key’ (K) and ‘value’ (V). The location
forecasting. The video+text model uses a visual-linguistic information is explicitly encoded in the input feature map
alignment task to learn cross-modality relationships. The from which K, V and Q are derived, thus incorporating
definition of this pre-text task is simple, given the latent the positional information in the self-attention. For a given
state of the [cls] token, the task is to predict whether the 400×400×64 video clip, the key and value tensors are of size
18
also included transition down/up blocks to reduce/increase 4.1 High Computational Cost
the number of points in the input (in a typical encoding- As discussed in Sec. 1, a strength of Transformer models
decoding pipeline style). The resulting architecture shows is their flexibility to scale to high parametric complexity.
promising results on the 3D classification and segmentation While this is a remarkable property that allows training
tasks. enormous sized models, this results in high training and
The Point Cloud Transformer (PCT) [231] is a parallel inference cost (a detailed comparison between CNN and
work to [230] and motivated by the permutation invariance ViTs is shown in Table 3). As an example, the BERT [3]
property of Transformers. However, compared to [230], it basic model (with 109 million parameters) took around 1.89
is more directly based on the conventional Transformer peta-flop days2 for training, while the latest GPT3 [6] model
architecture [1] and does not involve vector attention. The (175 billion parameters) took around 3640 peta-flop days
key modifications include a 3D coordinate-based position for training (a staggering ∼1925× increase). This comes
encoding, an offset attention module, and a neighbor em- with a huge price tag, e.g., according to one estimate [237],
bedding that encodes local 3D structure in point-clouds. GPT3 training might have cost OpenAI 4.6 million USD.
Specifically, the offset attention layer calculates the dif- Additionally, these large-scale models require aggressive
ference between the self-attended features and the input compression (e.g., distillation) to make them feasible for real-
features using element-wise subtraction. The local neighbor world settings.
embedding simply finds self-attention relationships among An empirical study on the scalability of Vision Trans-
a group of points instead of individual 3D points. Explicitly formers for number of parameters (ranging from five million
incorporating local neighbourhood information makes this to two billion), size of the training datasets (ranging from 30
a more efficient architecture compared to [230]. The method million to three billion training images), and compute bud-
shows promising performance on 3D shape classification, get (1-10000 TPU core-days) is presented in [238]. From this
normal estimation and segmentation tasks on ModelNet40 study, We can draw the following conclusions (a) scaling up
[232] and ShapeNet [233] datasets. on compute, model and size of training samples improves
The Mesh Transformer (METRO) [45] model targets 3D performance (b) only large models (with more parameters)
human pose and mesh reconstruction from a single 2D im- can benefit from more training data, and the performance
age. A key challenge here is to faithfully learn the non-local of smaller models platueas quickly and can not leverage
interactions between body-joints and mesh vertices (e.g., from additional data. This indicates that large scale models
hand and foot). The expressivity of Transformer network have the capacity to further enhance their representation
is used to jointly model vertex to vertex relationships in a learning capabilities. However, with the current designs,
mesh as well as the vertex to body-joint relationships. The scaling upon Transformer models is expensive and compute
self-attention mechanism can attend to any combination of prohibitive, thus necessitating the need for efficient designs.
vertices in the mesh, thereby encoding non-local relation- In the language domain, recent works focus on reducing
ships. The multi-layer Transformer architecture sequentially the high complexity of Transformer models (basically aris-
performs dimensionality reduction to map the 2D image to ing from the self-attention mechanism [1] where a token’s
3D mesh. Position encoding is performed using the 3D coor- representation is updated by considering all tokens from the
dinates (x,y ,z ) of each vertex and each body-joint. Similar to previous layer). For example, [103], [245] explore selective
masked language modeling in NLP, METRO uses masked or sparse attention to previous layer tokens while updating
vertex modeling (MVM) which randomly masks some per- each next layer token. Linformer [38] reduces complexity of
centage of input queries (see Fig. 16). The Transformer is standard self-attention operation from O(n2 ) to O(n) (both
tasked with regressing all the joints and vertices which helps in time and memory requirements). The main idea is to
encode inter-dependencies between them. METRO obtains show that a low-rank matrix is sufficient to model the self-
state-of-the-art results on human mesh reconstruction on attention mechanism. The Reformer model [246] employed
Human3.6M [234] and 3DPW [235] datasets. Since the ap- locally-sensitive hashing (LSH) to minimize the complexity
proach does not depends on a parametric mesh model, it of self-attention from O(n2 ) to O(nlog(n)). In similar pur-
generalizes well to other reconstruction tasks such as 3D suit, the recent Lambda Networks propose to model local
hand reconstruction [236]. Overall, this is the first effort context as a linear function which helps reduce complexity
to employ Transformers for 3D human reconstruction tasks of self-attention [247]. These linear function lambdas are
and leads to fairly good results. applied to the input query to model contextual relationships
between pixels.
4 O PEN C HALLENGES & F UTURE D IRECTIONS Vyas et al. [248] developed an efficient cluster attention
to deal with large input sequences that approximates the
Despite excellent performance from Transformer models original self-attention. The cluster attention groups queries
and their interesting salient features (Table 1), there ex- into clusters and then computes attention between cluster
ist several challenges associated with their applicability to centers (instead of attention between all the queries that
practical settings (Table 2). The most important bottlenecks leads to quadratic complexity). The main idea is that the
include requirement for large-amounts of training data and queries close in the Euclidean space should have similar
associated high computational costs. There have also been attention distributions. With a fixed number of clusters, this
some challenges to visualize and interpret Transformer intuition helps reduce the quadratic complexity to linear
models. In this section, we provide an overview of these
challenges, mention some of the recent efforts to address 2. A peta-flop day is a measure of computation and equals to per-
those limitations and highlight the open research questions. forming 1015 neural net operations per second for one complete day.
21
Task Method Design Highlights (focus on differences Input Data Type Label Type Loss
with the standard form)
Image ViT [11] Directly adopted NLP Transformer En- 2D Image Class labels Cross-entropy
Classification coder for images, Mechanism to linearly
embed image patches with positional em-
bedding suitable for the Encoder.
DeiT [12] Transformer as s student while CNN as 2D Image Class labels Cross-entropy,
a teacher, Distillation tokens to produce Distillation loss
estimated labels from teacher, Attention based on
between class and distillation tokens. KL-divergence
CLIP [195] Jointly train image and text encoders on 2D Images & texts Image-text Symmetric
image-text pairs, to maximize similarity of pairs cross-entropy
valid pairs and minimize otherwise
Object DETR [13] Linear projection layer to reduce CNN 2D Image Class labels Hungarian loss
Detection feature dimension, Spatial positional em- based on bipartite
bedding added to each multi-head self- matching between
attention layer of both encoder and de- predicted and
coder. Object queries (output positional ground truths
encoding) added to each multi-head self-
attention layer of decoder.
D-DETR [14] Deformable Transformer consists of de- 2D Image Class labels Hungarian loss
formable attention layers to introduce
sparse priors in Transformers, Multi-scale
attention module.
Low Shot CT [25] Self-supervised pretraining, Query- 2D Image Pretraining Normalized
Learning aligned class prototypes that provide without Cross-entropy
spatial correspondence between the labels and
support-set images and query image. few-shot
learning with
Class labels
Image ColTran [24] Conditional Row/column multi-head at- 2D Image 2D Image Negative
Colorization tention layers, Progressive multi-scale col- log-likelihood of the
orization scheme. images
Action ST-TR [216] Spatial and Temporal self-attention to op- Skeleton Action Cross-entropy
Recognition erates on graph data such as joints in skele- Classes
tons.
Super- TTSR [16] Texture enhancing Transformer module, 2D Image 2D Image Reconstruction loss,
resolution Relevance embeddings to compute the rel- Perceptual loss
evance between the low-resolution and defined on
reference image. pretrained VGG19
features.
Multi-Model Oscar [44] Transformer layer to jointly process triplet 2D Image Captions, Negative
Learning representation of image-text [words, tags, Class labels, log-likelihood of
features], Masked tokens to represent text Object tags masked tokens,
data. Contrastive binary
cross-entropy
3D Classifica- PT [230] Point Transformer block, Transition down CAD models, 3D Object and Cross-entropy
tion/Segmentation block to reduce cardinality of the point set, object part shape
Transition up for dense prediction tasks. segmentation categories
3D Mesh METRO [45] Progressive dimensionality reduction 2D Image 3D Mesh + L1 loss on mesh
Reconstruction across Transformer layers, Positional Human Pose vertices and joints in
Encoding with 3D joint and 3D vertex 3D and 2D
coordinates, Masked vertex/joint projection.
modeling.
Vision and Chen et al. [194] Uni-modal encoders on language and map Instruction text + Navigation Cross-entropy over
Language inputs followed by a cross-modal trans- RGBD panorama + Plan nodes and [stop]
Navigation former, Trajectory position encodings in Topological action
the map encoder. Environment Map
Referring CMSA [15] Multimodal feature, Cross-modal self- 2D Image + Segmentation Binary cross-entropy
Image attention on multiple levels and their fu- Language expression mask loss
Segmentation sion using learned gates.
Video Lee et al. [182] Operates on real-valued audio-visual sig- Audio-Visual Activity Contrastive InfoNCE
Classification nals instead of tokens, Contrastive learn- labels loss and Binary
ing for pre-training, End-to-end multi- cross-entropy
modal transformer learning.
TABLE 1: A summary of key design choices adopted in different variants of transformers for a representative set of
computer vision applications. The main changes relate to specific loss function choices, architectural modifications, different
position embeddings and variations in input data modalities.
22
Image ViT [11] Top-1 Acc. ImageNet 88.55 a) First application of Transformer a) Requires training on large-scale
Classifica- ICLR’21 (global self-attention) directly on data e.g., 300-Million images, b)
tion image patches, b) Convolution-free Requires careful transfer learning
network architecture, c) Outper- to the new task, c) Requires large
forms CNN models such as ResNet. model with 632-Million parameters
to achieve SOTA results.
DeiT [12] Top-1 Acc. ImageNet 83.10 a) Successfully trains Transformer a) Requires access to pretrained
arXiv’20 on ImageNet only, b) Introduces CNN based teacher model thus per-
attention-based distillation method. formance depends on the quality of
c) Produces competitive perfor- the teacher model.
mance with small (86-Million pa-
rameters) Transformers.
Swin-T [36] Top-1 Acc. ImageNet 84.5 a) Provides a general purpose back- a) Hard to train from scratch on
arXiv’21 bone for different vision tasks e.g., smaller datasets b) Quadratic com-
classification, detection and seg- pute complexity inherent to the
mentation b) A hierarchical design self-attention operation.
using shifted-windows operation.
Low-Shot CT [25] Top-1 Acc. ImageNet 62.25 a) Self-supervised pre-training Proposed algorithm is limited in its
Learning NeurIPS’20 COCO 60.35 mechanism that does not need capacity to perform on datasets that
manual labels, b) Dynamic lack spatial details such as texture.
inference using Transformer
achieving stat-of-the-art results.
Object DETR [13] AP COCO 44.9 a) Use of Transformer allows end- a) Performs poorly on small objects,
Detection ECCV’20 to-end training pipeline for object b) Requires long training time to
detection, b) Removes the need for converge.
hand-crafted post-processing steps.
D-DETR [14] AP COCO 43.8 a) Achieves better performance on Obtain SOTA results with 52.3 AP
ICLR’21 small objects than DETR [13], b) but with two stage detector design
Faster convergence than DETR [13] and test time augmentations.
Image ColTran [24] FID ImageNet 19.71 a) First successful application of a) Lacks end-to-end training, b)
Coloriza- ICLR’21 Transformer to image colorization, limited to images of size 256×256.
tion b) Achieves SOTA FID score.
Action ST-TR [216] Top-1 Acc. NTU 94.0/84.7 a) Successfully applies Transformer Proposed Transformers do not pro-
Recogni- arXiv’20 60/120 to model relations between body cess joints directly rather operate on
tion joints both in spatial and temporal features extracted by a CNN, thus
domain, b) Achieves SOTA results. the overall model is based on hand-
crafted design.
CUFED5 27.1 / 0.8 a) Achieves state-of-the-art super- a) Proposed Transformer does not
TTSR [16] PSNR/ Sun80 30.0 / 0.81 resolution by using attention, b) process images directly but features
Super-
Urban100 25.9 / 0.78
Resolution CVPR’20 SSIM Novel Transformer inspired archi- extracted by a convolution based
Manga109 30.1 / 0.91 tectures that can process multi-scale network, b) Model with large num-
features. ber of trainable parameters, and c)
Compute intensive.
ViLBERT VQA [183]/ a) Proposed Transformer architec- a) Requires large amount of data
Acc./
Multi- [181] Retrieval 70.6/ 58.2 ture can combine text and visual for pre-training, b) Requires fine
mAP (R@1)
Model NeurIPS’19 [239] information to understand inter- tuning to the new task.
Learning task dependencies, b) Achieves pre-
training on unlabelled dataset.
Oscar [44] Acc./ VQA [240]/
80.37/57.5 a) Exploit novel supervisory signal Requires extra supervision through
ECCV’20 mAP (R@1) COCO
via object tags to achieve text and pre-trained object detectors thus
image alignment, b) Achieves state- performance is dependent on the
of-the-art results. quality of object detectors.
Acc./ VQA [183]/ Learns fine-grained relation align- Requires large multi-task datasets
UNITER [43] Avg. Flickr30K 72.47/83.72 ment between text and images
ECCV’20 for Transformer training which lead
(R@1/5/10) [241] to high computational cost.
Point Trans- a) Transformer based attention ca- a) Only moderate improvements
3D Top-1 Acc. ModelNet40 92.8
former [230] [232] pable to process unordered and un- over previous SOTA, b) Large num-
Analysis IoU 85.9
arXiv’20 structured point sets, b) Permuta- ber of trainable parameters around
tion invariant architecture. 6× higher than PointNet++ [242].
MPJPE 77.1 a) Does not depend on parametric Dependent on hand-crafted net-
METRO [45]
PA-MPJPE 3DPW 47.9 mesh models so easily extendable work design.
arXiv’20 [235]
MPVE 88.2 to different objects, b) Achieves
SOTA results using Transformers.
TABLE 2: A summary of advantages and limitations of different Transformers based methods in different Tasks. (CT: Cross
Transformers, AP: Average Precision, mAP: mean AP, IoU: Intersection over Union, FID: Fréchet inception distance, MPJPE:
Mean Per Joint Position Error, MPVE: Mean Per Vertex Error).
23
Fig. 16: Mesh Transformer architecture. The joint and vertex queries are appended with positional embeddings and passed
through multiple self-attention layers to jointly regress 3D coordinates of joints and mesh vertices. Figure is from [45].
Method #Param (M) GFLOPs Top-1 Acc (%) Method #Param (M) GFLOPs Top-1 Acc (%)
ResNet18 [67]? 11.7 1.8 69.8 ResNet101 [67] ? 44.7 7.9 77.4
EfficientNet-B3 [87]? 12.0 1.8 81.6 ResNeXt101-32x4d [244]? 44.2 8.0 78.8
DeiT-T [12] 5.7 1.3 72.2 RegNetY-8G [86]? 39.0 8.0 81.7
T2T-ViTt -7 [35] 5.0 1.3 71.7 EfficientNet-B5 [87] ? 30.0 9.9 83.6
LocalViT-T [107] 5.9 1.3 74.8 CvT-21 [96] 32.0 7.1 82.5
CrossViT-T [104] 6.9 1.6 73.4 CaiT-S-24 [243] 32.2 9.4 82.7
PVTv1-T [93] 13.2 1.9 75.1 T2T-ViTt -19 [35] 39.0 9.8 81.4
ResT-Lite [110] 10.5 1.4 77.2 PVTv1-M [93] 44.2 6.7 81.2
CaiT-XXX-24 [243] 12.0 2.5 77.6 PVTv2-B3 [97] 45.2 6.9 83.2
PVTv2-B1 [97] 13.1 2.1 78.7 NesT-S [111] 38.0 10.4 83.3
Lv-ViT-T [89] 8.5 – 79.1
RegionViT-T [100] 13.8 2.4 80.4 ResNet152 [67] ? 60.2 11.6 78.3
CaiT-S-36 [243] 48.0 13.9 83.3
ResNet50 [67]? 25.6 4.1 76.1 T2T-ViTt -24 [35] 64.0 15.0 82.2
ResNeXt50-32x4d [244]? 25.0 4.3 77.6 PVTv1-L [93] 61.4 9.8 81.7
RegNetY-4G [86]? 21.0 4.0 80.0 TNT-B [88] 66.0 14.1 82.8
EfficientNet-B4 [87]? 19.0 4.2 82.9 Swin-S [36] 50.0 8.7 83.0
DeiT-S [12] 22.1 4.6 79.9 Twins-SVT-B [37] 56.0 8.3 83.2
PVTv1-S [93] 24.5 3.8 79.8 RegionViT-B [100] 72.7 13.0 83.3
LocalViT-S [107] 22.4 4.6 80.8 PVTv2-B4 [97] 62.6 10.1 83.6
CrossViT-S [104] 26.7 5.6 81.0
TNT-S [88] 23.8 5.2 81.3 ResNeXt101-64x4d [244] ? 83.5 15.6 79.6
Swin-T [36] 29.0 4.5 81.3 RegNetY-16G [86] ? 84.0 16.0 82.9
NesT-T [111] 17.0 5.8 81.5 EfficientNet-B6 [87] ? 43.0 19.0 84.0
T2T-ViTt -14 [35] 21.5 5.2 81.5 NesT-B [111] 68.0 17.9 83.8
CvT-13 [96] 20.0 4.5 81.6 ViT-B/16 [11] 86.6 17.6 79.8
ResT-B [110] 30.3 4.3 81.6 DeiT-B/16 [12] 86.6 17.6 81.8
Twins-SVT-S [37] 24.0 2.8 81.7 Swin-B [36] 88.0 15.4 83.3
PVTv2-B2-Li [97] 22.6 3.9 82.1 Twins-SVT-L [37] 99.2 14.8 83.7
RegionViT-S [100] 30.6 5.6 82.5 PVTv2-B5 [97] 82.0 11.8 83.8
Lv-ViT-S [89] 26.0 6.6 83.3 Lv-ViT-M [89] 56.0 16.0 84.1
TABLE 3: A Comparative analysis between different vision transformer and CNN models in terms of their parameter
complexity and top-1 (%) accuracy on ImageNet validation set. For a direct comparison, we consider models that are
trained on ImageNet from scratch on input of size 224x224. ? denotes pure CNN-based methods.
complexity of O(nc) with respect to the input sequence applications, e.g., low-level vision, n = H × W where
length n (where c is the number of clusters). We refer H, W denote the height and width of the image). This is a
interested readers to a survey on efficient Transformers in major drawback of existing Transformers that hinders their
NLP [34]. application to most tasks involving high-resolution (HR)
Similar to the NLP domain, computer vision models images, such as object detection and segmentation (in high-
also suffer from the high computational cost of Transformer level vision), and super-resolution, deblurring, denoising,
models. For example, image generators that are based etc. (in low-level vision). Numerous methods have been
on sequence-based Transformers (e.g., iGPT) have a high proposed that make special design choices to perform self-
compute cost limiting their applicability to high-resolution attention more ‘efficiently’, for instance employing pool-
inputs. The time and memory cost of core self-attention ing/downsampling in self-attention [97], [219], [249], local
operation in Transformers increases quadratically with the window-based attention [36], [250], axial-attention [179],
number of patches, i.e. O(n2 ), for n image patches (in some [251], low-rank projection attention [38], [252], [253], ker-
24
nelizable attention [254], [255], and similarity-clustering and especially multi-modal processing [181]. Although the
based methods [246], [256]. However, almost all of these initial results from these simple applications are quite en-
approaches either come with a trade-off between complexity couraging and motivate us to look further into the strengths
and accuracy, require special hardware specifications or are of self-attention and self-supervised learning, current archi-
still not applicable to very large images. Therefore, there tectures may still remain better tailored for language prob-
is a pressing need to develop an efficient self-attention lems (with a sequence structure) and need further intuitions
mechanism that can be applied to HR images on resource- to make them more efficient for visual inputs. For example,
limited systems without compromising accuracy. It will be vector attention from [82] is a nice work in this direction
interesting to explore how existing models can be extended which attempts to specifically tailor self-attention operation
to high-dimensional cases e.g., using a multi-scale trans- for visual inputs via learning channel-wise attentions. Simi-
former design with a somewhat local context modeling. By larly, [260] uses a Jigsaw puzzle based self-supervision loss
inducing inductive biases based on our understanding of as a parallel branch in the Transformers to improve person
the visual learning tasks (e.g., spatial relationships in the re-identification. A recent work [35] rearranges the spa-
local neighbourhood), the high computational cost can be tially close tokens to better model relationships in spatially
reduced. Similarly, using sparse attention maps modeled proximal locations. Token distillation [12] from pre-trained
with low-rank factorization in the matrices can also help CNN models has also been used as a remedy to inject
towards reducing the computational cost [211]. domain biases in the representations. One may argue that
the architectures like Transformer models should remain
4.2 Large Data Requirements generic to be directly applicable across domains, we notice
that the high computational and time cost for pre-training
Since Transformer architectures do not inherently encode
such models demands novel design strategies to make their
inductive biases (prior knowledge) to deal with visual data,
training more affordable on vision problems.
they typically require large amount of training to figure
out the underlying modality-specific rules. For example, a
CNN has inbuilt translation invariance, weight sharing, and
partial scale invariance due to pooling operations or multi- 4.4 Neural Architecture Search for ViTs
scale processing blocks. However, a Transformer network While Nerual Architecuter Search (NAS) has been well
needs to figure out these image-specific concepts on its own explored for CNNs to find an optimized architecture, it
from the training examples. Similarly, relationships between is relatively less explored in Transformers (even for lan-
video frames need to be discovered automatically by the guage transformers [261], [262]). Chen et al. [263] propose a
self-attention mechanism by looking at a large database one-shot NAS for vision transformers, called AutoFormer.
of video sequences. This results in longer training times, BossNAS [264] searches for a hybrid architecture (CNN
a significant increase in computational requirements, and and Transformer). Another recent effort studies the trade-
large datasets for processing. For example, the ViT [11] off between global and local information in Transformers in
model requires hundreds of millions of image examples to the context of vision applications [265]. It will be insightful
obtain reasonable performance on the ImageNet benchmark to further explore the domain-specific design choices (e.g.,
dataset. The question of learning a Transformer in a data- the contrasting requirements between language and vision
efficient manner is an open research problem and recent domains) using NAS to design more efficient and light-
works report encouraging steps towards its resolution. For weight models similar to CNNs [87].
example, DeiT [12] uses a distillation approach to achieve
data efficiency while T2T (Tokens-to-Token) ViT [35] models 4.5 Interpretability of Transformers
local structure by combining spatially close tokens together,
Through an extensive set of carefully designed experiments,
thus leading to competitive performance when trained only
Naseer et al. [266] investigate multiple intriguing properties
on ImageNet from scratch (without pre-training). By incor-
of ViTs in terms of their generalization and robustness. They
porating CNNs like feature hierarchies in ViTs to effectively
show that, compared with CNNs, ViTs demonstrate strong
capture local image cues, ViTs (e.g., CCT [106], NesT [111])
robustness against texture changes and severe occlusions,
can be trained from scratch even on small-scale datasets
e.g.ViTs retain upto 60% top-1 accuracy on ImageNet once
(e.g., CIFAR-10). Another approach to data efficient training
80% of the image content is randomly occluded. Given the
of ViTs is proposed in et al. [257]. The authors show that
strong performance of Transformer architectures, it is inter-
by smoothing the local loss surface using sharpness-aware
esting and critical to interpret their decisions, e.g., by visual-
minimizer (SAM) [258], ViTs can be trained with simple
izing relevant regions in an image for a given classification
data augmentation scheme (random crop, and horizontal
decision. The main challenge is that the attention originating
flip) [259], instead of employing compute intensive strong
in each layer, gets inter-mixed in the subsequent layers in a
data augmentation strategies, and can outperform their
complex manner, making it difficult to visualize the relative
counterpart ResNet models.
contribution of input tokens towards final predictions. This
is an open problem, however, some recent works [267]–[269]
4.3 Vision Tailored Transformer Designs target enhanced interpretability of Transformers and report
We note that most of the existing works focused on vision encouraging results. Attention roll-out and attention flow
tasks tend to directly apply NLP Transformer models on methods were proposed in [268] to estimate the accurate at-
computer vision problems. These include architectures de- tentions. However, this method functions in an ad-hoc man-
signed for image recognition [11], video understanding [17] ner and makes simplistic assumptions e.g., input tokens are
25
linearly combined using attention weights across the layers. to high-dimensional inputs, Perceiver uses an asymmetric
Chefer et al. [269] note that the attention scores obtained di- cross attention method to distill input information into low-
rectly via the self-attention process (encoding relationships dimensional latent bottleneck features. Once the features are
between tokens) or reassignments in [268] do not provide an distilled in a compact and fixed-dimensional form, regular
optimal solution. As an alternative, they propose to assign Transformer blocks are applied in the latent space. The
and propagate relevancy scores in the Transformer network original Perceiver model shows performance competitive to
such that the sum of relevancy is constant throughout the ResNets and ViTs on image classification and can process 3D
network. Their design can handle both the positive and data, audio, images, video or their combinations. However,
negative attributions experienced in the self-attention layer. this model can only generate fixed outputs e.g., class prob-
The proposed framework has an added advantage of being abilities. A recent improvement called Perceiver IO [275]
able to provide class-specific visualizations. Despite these aims to learn models with both flexible inputs as well as
seminal works, visualizing and interpreting Transformers arbitrary sized outputs. This allows application to problems
is an unsolved problem and methods are needed to obtain which demand structured outputs such as natural language
spatially precise activation-specific visualizations. Further tasks and visual comprehension. While these models avoid
progress in this direction can help in better understanding modality dependent architectural choices, the learning itself
the Transformer models, diagnosing any erroneous behav- still involves modality dependent choices e.g., specific aug-
iors and biases in the decision process. It can also help us mentations or positional encodings. An interesting and open
design novel architectures that can help us avoid any biases. future direction is to achieve total modality-agnosticism in
the learning pipeline.
4.6 Hardware Efficient Designs
Large-scale Transformer networks can have intensive power 5 C ONCLUSION
and computation requirements, hindering their deployment Attention has played a key role in delivering efficient
on edge devices and resource-constrained environments and accurate computer vision systems, while simultane-
such as internet-of-things (IoT) platforms. Some recent ef- ously providing insights into the function of deep neu-
forts have been reported to compress and accelerate NLP ral networks. This survey reviews the self-attention ap-
models on embedded systems such as FPGAs [270]. Li et proaches and specifically focuses on the Transformer and bi-
al. [270] used an enhanced block-circulant matrix-based rep- directional encoding architectures that are built on the prin-
resentation to compress NLP models and proposed a new ciple of self-attention. We first cover fundamental concepts
Field Programmable Gate Array (FPGA) architecture design pertaining to self-attention architectures and later provide
to efficiently manage resources for high throughput and low an in-depth analysis of competing approaches for a broad
latency. They could achieve 27x, 3x and 81x improvements range of computer vision applications. Specifically, we in-
in performance (throughput measured in FPS), reduced clude state of the art self-attention models for image recog-
power consumption, and energy efficiency relative a CPU nition, object detection, semantic and instance segmentation,
for RoBERTa model [7]. Towards this goal, [262] proposed video analysis and classification, visual question answering,
to design Hardware-Aware Transformers (HAT) using neu- visual commonsense reasoning, image captioning, vision-
ral architecture search strategies [271]–[273]. Specifically, a language navigation, clustering, few-shot learning, and 3D
SuperTransformer model is first trained for performance data analysis. We systematically highlight the key strengths
approximation which can estimate a model’s performance and limitations of the existing methods and particularly
without fully training it. This model comprises the largest elaborate on the important future research directions. With
possible model in the search space while sharing weights its specific focus on computer vision tasks, this survey pro-
between common parts. Eventually, an evolutionary search vides a unique view of the recent progress in self-attention
is performed considering the hardware latency constraints and Transformer-based methods. We hope this effort will
to find a suitable SubTransformer model for a target hard- drive further interest in the vision community to leverage
ware platform (e.g., IoT device, GPU, CPU). However, such the potential of Transformer models and improve on their
hardware efficient designs are currently lacking for the current limitations e.g., reducing their carbon footprint.
vision Transformers to enable their seamless deployment
in resource-constrained devices. Further, the search cost of
ACKNOWLEDGMENTS
the evolutionary algorithms remains significant with the
associated impact of CO2 emissions on the environment. The authors would like to thank Tim Prangemeier (TU Darmstadt), Lu-
owei Zhou (Microsoft Research), Jason Corso (University of Michigan),
Pichao Wang (Alibaba Group), Yuqing Wang (Meituan), Alex Meinke
(Uni-Tuebingen), Irwan Bello (Google Brain) and Manoj Kumar (Google
4.7 Towards Integrating All Modalities
Brain) for their helpful feedback on the survey. We would also like to
Since Transformers provide a unified design to process thank Mohamed Afham for his help with a figure.
different modalities, recent efforts also focus on proposing
more generic general purpose reasoning systems based on
Transformers. Inspired by the biological systems that can R EFERENCES
process information from a diverse range of modalities, [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Perceiver model [274] aims to learn a unified model that Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
in NeurIPS, 2017.
can process any given input modality without making [2] M. Ott, S. Edunov, D. Grangier, and M. Auli, “Scaling neural
domain-specific architectural assumptions. In order to scale machine translation,” in WMT, 2018.
26
[3] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- [29] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A
training of deep bidirectional transformers for language under- neural image caption generator,” in CVPR, 2015.
standing,” arXiv preprint arXiv:1810.04805, 2018. [30] Y. Bengio, I. Goodfellow, and A. Courville, Deep learning. MIT
[4] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Im- press, 2017.
proving language understanding by generative pre-training,” [31] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,
tech. rep., OpenAI, 2018. 2015.
[5] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, [32] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
“Language models are unsupervised multitask learners,” tech. Neural computation, 1997.
rep., OpenAI, 2019. [33] D. Hu, “An introductory survey on attention mechanisms in nlp
[6] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, problems,” in IntelliSys, 2019.
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, [34] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient trans-
et al., “Language models are few-shot learners,” arXiv preprint formers: A survey,” arXiv preprint arXiv:2009.06732, 2020.
arXiv:2005.14165, 2020. [35] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, F. E. Tay, J. Feng, and
[7] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, S. Yan, “Tokens-to-token vit: Training vision transformers from
M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A ro- scratch on imagenet,” arXiv preprint arXiv:2101.11986, 2021.
bustly optimized bert pretraining approach,” arXiv preprint [36] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo,
arXiv:1907.11692, 2019. “Swin transformer: Hierarchical vision transformer using shifted
[8] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, windows,” arXiv preprint arXiv:2103.14030, 2021.
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer [37] X. Chu, Z. Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia,
learning with a unified text-to-text transformer,” arXiv preprint and C. Shen, “Twins: Revisiting the design of spatial attention
arXiv:1910.10683, 2019. in vision transformers,” arXiv preprint arXiv:2104.13840, 2021.
[9] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, [38] S. Wang, B. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-
M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant attention with linear complexity,” arXiv preprint arXiv:2006.04768,
models with conditional computation and automatic sharding,” 2020.
arXiv preprint arXiv:2006.16668, 2020. [39] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-
[10] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling attention generative adversarial networks,” in International con-
to trillion parameter models with simple and efficient sparsity,” ference on machine learning, pp. 7354–7363, PMLR, 2019.
arXiv preprint arXiv:2101.03961. [40] J. Pérez, J. Marinković, and P. Barceló, “On the turing complete-
[11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, ness of modern neural network architectures,” in International
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, Conference on Learning Representations, 2018.
et al., “An image is worth 16x16 words: Transformers for image [41] J.-B. Cordonnier, A. Loukas, and M. Jaggi, “On the relationship
recognition at scale,” arXiv preprint arXiv:2010.11929, 2020. between self-attention and convolutional layers,” in International
[12] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and Conference on Learning Representations, 2019.
H. Jégou, “Training data-efficient image transformers & distilla- [42] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei,
tion through attention,” arXiv preprint arXiv:2012.12877, 2020. “Deformable convolutional networks,” in Proceedings of the IEEE
[13] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and international conference on computer vision, pp. 764–773, 2017.
S. Zagoruyko, “End-to-end object detection with transformers,” [43] Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng,
arXiv preprint arXiv:2005.12872, 2020. and J. Liu, “UNITER: Universal image-text representation learn-
[14] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable ing,” in ECCV, 2020.
DETR: Deformable transformers for end-to-end object detection,” [44] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu,
arXiv preprint arXiv:2010.04159, 2020. L. Dong, F. Wei, et al., “Oscar: Object-semantics aligned pre-
[15] L. Ye, M. Rochan, Z. Liu, and Y. Wang, “Cross-modal self- training for vision-language tasks,” in ECCV, 2020.
attention network for referring image segmentation,” in CVPR, [45] K. Lin, L. Wang, and Z. Liu, “End-to-end human pose
2019. and mesh reconstruction with transformers,” arXiv preprint
[16] F. Yang, H. Yang, J. Fu, H. Lu, and B. Guo, “Learning texture arXiv:2012.09760, 2020.
transformer network for image super-resolution,” in CVPR, 2020. [46] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised represen-
[17] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, tation learning by predicting image rotations,” in ICLR, 2018.
“VideoBERT: A joint model for video and language represen- [47] “Revisiting the unreasonable effectiveness of data.” https://ai.
tation learning,” in ICCV, 2019. googleblog.com/2017/07/revisiting-unreasonable-effectiveness.
[18] R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “Video html. Accessed: 2020-12-31.
action transformer network,” in CVPR, 2019. [48] L. Jing and Y. Tian, “Self-supervised visual feature learning with
[19] H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, deep neural networks: A survey,” TPAMI, 2020.
C. Xu, and W. Gao, “Pre-trained image processing transformer,” [49] X. Liu, F. Zhang, Z. Hou, Z. Wang, L. Mian, J. Zhang, and
arXiv preprint arXiv:2012.00364, 2020. J. Tang, “Self-supervised learning: Generative or contrastive,”
[20] A. Ramesh, M. Pavlov, G. Goh, and S. Gray, “DALL·E: Creating arXiv preprint arXiv:2006.08218, 2020.
images from text,” tech. rep., OpenAI, 2021. [50] “Aaai 2020 keynotes turing award winners event.” https://www.
[21] H. Tan and M. Bansal, “LXMERT: Learning cross-modality en- youtube.com/watch?v=UX8OubxsY8w. Accessed: 2020-12-31.
coder representations from transformers,” in EMNLP-IJCNLP, [51] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,”
2019. in ECCV, 2016.
[22] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “VL-BERT: [52] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham,
Pre-training of generic visual-linguistic representations,” arXiv A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al., “Photo-
preprint arXiv:1908.08530, 2019. realistic single image super-resolution using a generative adver-
[23] X. Wang, C. Yeshwanth, and M. Nießner, “SceneFormer: sarial network,” in CVPR, 2017.
Indoor scene generation with transformers,” arXiv preprint [53] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. Efros,
arXiv:2012.09793, 2020. “Context encoders: Feature learning by inpainting,” in CVPR,
[24] M. Kumar, D. Weissenborn, and N. Kalchbrenner, “Colorization 2016.
transformer,” in ICLR, 2021. [54] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-
[25] C. Doersch, A. Gupta, and A. Zisserman, “CrossTransformers: Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver-
spatially-aware few-shot transfer,” NeurIPS, 2020. sarial nets,” in NeurIPS, 2014.
[26] H.-J. Ye, H. Hu, D.-C. Zhan, and F. Sha, “Few-shot learning via [55] D. Lin, K. Fu, Y. Wang, G. Xu, and X. Sun, “MARTA GANs:
embedding adaptation with set-to-set functions,” in CVPR, 2020. Unsupervised representation learning for remote sensing image
[27] S. Chaudhari, G. Polatkan, R. Ramanath, and V. Mithal, classification,” GRSL, 2017.
“An attentive survey of attention models,” arXiv preprint [56] U. Ahsan, R. Madhok, and I. Essa, “Video jigsaw: Unsupervised
arXiv:1904.02874, 2019. learning of spatiotemporal context for video action recognition,”
[28] A. de Santana Correia and E. L. Colombini, “Attention, please! in WACV, 2019.
asurvey of neural attention models in deep learning,” arXiv [57] M. Noroozi and P. Favaro, “Unsupervised learning of visual
preprint arXiv:2103.16775, 2021. representations by solving jigsaw puzzles,” in ECCV, 2016.
27
[58] D. Kim, D. Cho, D. Yoo, and I. S. Kweon, “Learning image [88] K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y. Wang, “Transformer
representations by completing damaged jigsaw puzzles,” WACV, in transformer,” arXiv preprint arXiv:2103.00112, 2021.
2018. [89] Z. Jiang, Q. Hou, L. Yuan, D. Zhou, Y. Shi, X. Jin, A. Wang, and
[59] L. Jing, X. Yang, J. Liu, and Y. Tian, “Self-supervised spatiotempo- J. Feng, “All tokens matter: Token labeling for training better
ral feature learning via video rotation prediction,” arXiv preprint vision transformers,” arXiv preprint arXiv:2104.10858, 2021.
arXiv:1811.11387, 2018. [90] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix:
[60] H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang, “Unsupervised Regularization strategy to train strong classifiers with localizable
representation learning by sorting sequences,” in ICCV, 2017. features,” in Proceedings of the IEEE/CVF International Conference
[61] I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: unsu- on Computer Vision, pp. 6023–6032, 2019.
pervised learning using temporal order verification,” in ECCV, [91] A. El-Nouby, H. Touvron, M. Caron, P. Bojanowski, M. Douze,
2016. A. Joulin, I. Laptev, N. Neverova, G. Synnaeve, J. Verbeek, and
[62] D. Wei, J. J. Lim, A. Zisserman, and W. T. Freeman, “Learning H. Jegou, “Xcit: Cross-covariance image transformers,” 2021.
and using the arrow of time,” in CVPR, 2018. [92] D. Zhou, B. Kang, X. Jin, L. Yang, X. Lian, Z. Jiang, Q. Hou, and
[63] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, J. Feng, “Deepvit: Towards deeper vision transformer,” 2021.
“VisualBERT: A simple and performant baseline for vision and [93] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo,
language,” in Arxiv preprint arXiv:1908.03557, 2019. and L. Shao, “Pyramid vision transformer: A versatile back-
[64] B. Korbar, D. Tran, and L. T., “Cooperative learning of audio and bone for dense prediction without convolutions,” arXiv preprint
video models from self-supervised synchronization,” in NeurIPS, arXiv:2102.12122, 2021.
2018. [94] J. Yang, C. Li, P. Zhang, X. Dai, B. Xiao, L. Yuan, and J. Gao, “Focal
[65] R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in self-attention for local-global interactions in vision transformers,”
ICCV, 2017. 2021.
[66] N. Sayed, B. Brattoli, and B. Ommer, “Cross and learn: Cross- [95] Z. Huang, Y. Ben, G. Luo, P. Cheng, G. Yu, and B. Fu, “Shuffle
modal self-supervision,” in GCPR, 2018. transformer: Rethinking spatial shuffle for vision transformer,”
[67] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for 2021.
image recognition,” in CVPR, 2016. [96] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang,
[68] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” “Cvt: Introducing convolutions to vision transformers,” arXiv
arXiv preprint arXiv:1607.06450, 2016. preprint arXiv:2103.15808, 2021.
[69] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for [97] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo,
image denoising,” in CVPR, 2005. and L. Shao, “Pvtv2: Improved baselines with pyramid vision
[70] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural transformer,” 2021.
networks,” in CVPR, 2018. [98] W. Xu, Y. Xu, T. Chang, and Z. Tu, “Co-scale conv-attentional
[71] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vi- image transformers,” 2021.
jayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al., [99] W. Wang, L. Yao, L. Chen, D. Cai, X. He, and W. Liu, “Cross-
“The kinetics human action video dataset,” arXiv preprint former: A versatile vision transformer based on cross-scale atten-
arXiv:1705.06950, 2017. tion,” arXiv preprint arXiv:2108.00154, 2021.
[72] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, [100] C.-F. Chen, R. Panda, and Q. Fan, “Regionvit: Regional-to-local
“CCNet: Criss-cross attention for semantic segmentation,” in attention for vision transformers,” 2021.
ICCV, 2019.
[101] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and
[73] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,
P. Luo, “Segformer: Simple and efficient design for semantic
R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes
segmentation with transformers,” 2021.
dataset for semantic urban scene understanding,” in CVPR, 2016.
[102] P. Zhang, X. Dai, J. Yang, B. Xiao, L. Yuan, L. Zhang, and J. Gao,
[74] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba,
“Multi-scale vision longformer: A new vision transformer for
“Scene parsing through ade20k dataset,” in CVPR, 2017.
high-resolution image encoding,” ICCV 2021, 2021.
[75] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects [103] I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-
in context,” in ECCV, 2014. document transformer,” arXiv preprint arXiv:2004.05150, 2020.
[76] X. Liang, K. Gong, X. Shen, and L. Lin, “Look into person: Joint [104] C.-F. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention
body parsing & pose estimation network and a new benchmark,” multi-scale vision transformer for image classification,” arXiv
TPAMI, 2018. preprint arXiv:2103.14899, 2021.
[77] G. J. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object [105] K. Yuan, S. Guo, Z. Liu, A. Zhou, F. Yu, and W. Wu, “Incorporat-
classes in video: A high-definition ground truth database,” Pat- ing convolution designs into visual transformers,” arXiv preprint
tern Recognition Letters, 2009. arXiv:2103.11816, 2021.
[78] H. Hu, Z. Zhang, Z. Xie, and S. Lin, “Local relation networks for [106] A. Hassani, S. Walton, N. Shah, A. Abuduweili, J. Li, and H. Shi,
image recognition,” in ICCV, 2019. “Escaping the big data paradigm with compact transformers,”
[79] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attention 2021.
augmented convolutional networks,” in ICCV, 2019. [107] Y. Li, K. Zhang, J. Cao, R. Timofte, and L. V. Gool, “Localvit:
[80] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with rela- Bringing locality to vision transformers,” 2021.
tive position representations,” in NAACL, 2018. [108] B. Graham, A. El-Nouby, H. Touvron, P. Stock, A. Joulin,
[81] N. Parmar, P. Ramachandran, A. Vaswani, I. Bello, A. Levskaya, H. Jégou, and M. Douze, “Levit: a vision transformer in convnet’s
and J. Shlens, “Stand-alone self-attention in vision models,” in clothing for faster inference,” 2021.
NeurIPS, 2019. [109] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
[82] H. Zhao, J. Jia, and V. Koltun, “Exploring self-attention for image W. Hubbard, and L. D. Jackel, “Backpropagation applied to
recognition,” in CVPR, 2020. handwritten zip code recognition,” Neural computation, vol. 1,
[83] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Good- no. 4, pp. 541–551, 1989.
fellow, and R. Fergus, “Intriguing properties of neural networks,” [110] Q. Zhang and Y. Yang, “Rest: An efficient transformer for visual
arXiv preprint arXiv:1312.6199, 2013. recognition,” arXiv preprint arXiv:2105.13677, 2021.
[84] M. M. Naseer, S. H. Khan, M. H. Khan, F. S. Khan, and F. Porikli, [111] Z. Zhang, H. Zhang, L. Zhao, T. Chen, and T. Pfister, “Aggre-
“Cross-domain transferability of adversarial perturbations,” in gating nested transformers,” in arXiv preprint arXiv:2105.12723,
NeurIPS, 2019. 2021.
[85] M. Naseer, K. Ranasinghe, S. Khan, F. S. Khan, and F. Porikli, “On [112] Z. Dai, H. Liu, Q. V. Le, and M. Tan, “Coatnet: Marrying convo-
improving adversarial transferability of vision transformers,” lution and attention for all data sizes,” 2021.
arXiv preprint arXiv:2106.04169, 2021. [113] X. Chu, Z. Tian, B. Zhang, X. Wang, X. Wei, H. Xia, and C. Shen,
[86] I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár, “Conditional positional encodings for vision transformers,” 2021.
“Designing network design spaces,” in CVPR, 2020. [114] Y. Liu, G. Sun, Y. Qiu, L. Zhang, A. Chhatkuli, and L. Van Gool,
[87] M. Tan and Q. V. Le, “EfficientNet: Rethinking model scaling for “Transformer in convolutional neural networks,” arXiv preprint
convolutional neural networks,” in ICML, 2019. arXiv:2106.03180, 2021.
28
[115] X. Chen, S. Xie, and K. He, “An empirical study of training self- [142] N. Parmar, A. Vaswani, J. Uszkoreit, Ł. Kaiser, N. Shazeer, A. Ku,
supervised visual transformers,” arXiv e-prints, pp. arXiv–2104, and D. Tran, “Image transformer,” in ICML, 2018.
2021. [143] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and
[116] X. Chen, H. Fan, R. Girshick, and K. He, “Improved base- I. Sutskever, “Generative pretraining from pixels,” in ICML, 2020.
lines with momentum contrastive learning,” arXiv preprint [144] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for
arXiv:2003.04297, 2020. high-resolution image synthesis,” arXiv:2012.09841, 2020.
[117] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum [145] Y. Jiang, S. Chang, and Z. Wang, “Transgan: Two transformers
contrast for unsupervised visual representation learning,” in can make one strong gan,” 2021.
Proceedings of the IEEE/CVF Conference on Computer Vision and [146] A. K. Bhunia, S. Khan, H. Cholakkal, R. M. Anwer, F. S.
Pattern Recognition, pp. 9729–9738, 2020. Khan, and M. Shah, “Handwriting transformers,” arXiv preprint
[118] Z. Xie, Y. Lin, Z. Yao, Z. Zhang, Q. Dai, Y. Cao, and H. Hu, arXiv:2104.03964, 2021.
“Self-supervised learning with swin transformers,” arXiv preprint [147] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals,
arXiv:2105.04553, 2021. A. Graves, et al., “Conditional image generation with pixelcnn
[119] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, decoders,” in NeurIPS, 2016.
E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, [148] A. Krizhevsky, “Learning multiple layers of features from tiny
et al., “Bootstrap your own latent: A new approach to self- images,” tech. rep., 2009.
supervised learning,” arXiv preprint arXiv:2006.07733, 2020. [149] A. Coates, A. Ng, and H. Lee, “An analysis of single-layer
[120] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, networks in unsupervised feature learning,” in AISTATS, 2011.
and A. Joulin, “Emerging properties in self-supervised vision [150] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple
transformers,” arXiv preprint arXiv:2104.14294, 2021. framework for contrastive learning of visual representations,”
[121] C. Li, J. Yang, P. Zhang, M. Gao, B. Xiao, X. Dai, L. Yuan, arXiv preprint arXiv:2002.05709, 2020.
and J. Gao, “Efficient self-supervised vision transformers for [151] P. Bachman, R. Hjelm, and W. Buchwalter, “Learning represen-
representation learning,” arXiv preprint arXiv:2106.09785, 2021. tations by maximizing mutual information across views,” in
[122] Y. Wang, X. Zhang, T. Yang, and J. Sun, “Anchor detr: Query NeurIPS, 2019.
design for transformer-based detector,” 2021. [152] O. J. Hénaff, A. Srinivas, J. De Fauw, A. Razavi, C. Doersch, S. Es-
[123] T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. Hinton, “Pix2seq: A lami, and A. v. d. Oord, “Data-efficient image recognition with
language modeling framework for object detection,” 2021. contrastive predictive coding,” arXiv preprint arXiv:1905.09272,
[124] Y. Fang, B. Liao, X. Wang, J. Fang, J. Qi, R. Wu, J. Niu, and W. Liu, 2019.
“You only look at one sequence: Rethinking transformer in vision [153] Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview cod-
through object detection,” 2021. ing,” arXiv preprint arXiv:1906.05849, 2019.
[125] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: To- [154] S. Khan, H. Rahmani, S. A. A. Shah, and M. Bennamoun, “A
wards real-time object detection with region proposal networks,” guide to convolutional neural networks for computer vision,”
TPAMI, 2016. Synthesis Lectures on Computer Vision, 2018.
[126] R. Girshick, “Fast R-CNN,” in ICCV, 2015. [155] A. Radford, L. Metz, and S. Chintala, “Unsupervised represen-
tation learning with deep convolutional generative adversarial
[127] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in
networks,” arXiv preprint arXiv:1511.06434, 2015.
ICCV, 2017.
[156] C. Gao, Y. Chen, S. Liu, Z. Tan, and S. Yan, “Adversarialnas: Ad-
[128] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only
versarial neural architecture search for gans,” in CVPR, pp. 5680–
look once: Unified, real-time object detection,” in CVPR, 2016.
5689, 2020.
[129] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and
[157] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila,
A. C. Berg, “SSD: Single shot multibox detector,” in ECCV, 2016.
“Analyzing and improving the image quality of stylegan,” in
[130] T. Prangemeier, C. Reich, and H. Koeppl, “Attention-based trans- CVPR, pp. 8110–8119, 2020.
formers for instance segmentation of cells in microstructures,” in [158] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee,
2020 IEEE International Conference on Bioinformatics and Biomedicine “Generative adversarial text to image synthesis,” in ICML, 2016.
(BIBM), pp. 700–707, IEEE, 2020.
[159] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N.
[131] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and Metaxas, “StackGAN: Text to photo-realistic image synthesis
S. Belongie, “Feature pyramid networks for object detection,” in with stacked generative adversarial networks,” in ICCV, 2017.
CVPR, 2017. [160] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N.
[132] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, Metaxas, “StackGAN++: Realistic image synthesis with stacked
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, generative adversarial networks,” TPAMI, 2018.
J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: [161] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and
Transformers for image recognition at scale,” 2020. X. He, “AttnGAN: Fine-grained text to image generation with
[133] H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and L.-C. attentional generative adversarial networks,” in CVPR, 2018.
Chen, “Axial-DeepLab: Stand-alone axial-attention for panoptic [162] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”
segmentation,” arXiv preprint arXiv:2003.07853, 2020. arXiv preprint arXiv:1312.6114, 2013.
[134] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, [163] A. Razavi, A. van den Oord, and O. Vinyals, “Generating diverse
J. Feng, T. Xiang, P. H. S. Torr, and L. Zhang, “Rethinking high-fidelity images with vq-vae-2,” in NeurISP, 2019.
semantic segmentation from a sequence-to-sequence perspective [164] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte,
with transformers,” 2021. “Swinir: Image restoration using swin transformer,” in ICCVW,
[135] R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: 2021.
Transformer for semantic segmentation,” 2021. [165] Z. Wang, X. Cun, J. Bao, and J. Liu, “Uformer: A general
[136] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic u-shaped transformer for image restoration,” arXiv preprint
segmentation,” in CVPR, 2019. arXiv:2106.03106, 2021.
[137] G. Neuhold, T. Ollmann, S. Rota Bulo, and P. Kontschieder, “The [166] Z. Lu, H. Liu, J. Li, and L. Zhang, “Efficient transformer for single
mapillary vistas dataset for semantic understanding of street image super-resolution,” arXiv preprint arXiv:2108.11084, 2021.
scenes,” in ICCV, 2017. [167] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image
[138] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, super-resolution using very deep residual channel attention net-
“ImageNet: A large-scale hierarchical image database,” in CVPR, works,” in ECCV, 2018.
2009. [168] T. Dai, J. Cai, Y. Zhang, S. Xia, and L. Zhang, “Second-order
[139] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, “Modeling attention network for single image super-resolution,” in CVPR,
context in referring expressions,” in ECCV, 2016. 2019.
[140] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and [169] B. Niu, W. Wen, W. Ren, X. Zhang, L. Yang, S. Wang, K. Zhang,
K. Murphy, “Generation and comprehension of unambiguous X. Cao, and H. Shen, “Single image super-resolution via a holistic
object descriptions,” in CVPR, 2016. attention network,” in ECCV, 2020.
[141] S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg, “Refer- [170] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep
itgame: Referring to objects in photographs of natural scenes,” residual networks for single image super-resolution,” in CVPRW,
in EMNLP, 2014. 2017.
29
[171] Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep [197] L. Zhou, H. Palangi, L. Zhang, H. Hu, J. Corso, and J. Gao,
recursive residual network,” in CVPR, 2017. “Unified vision-language pre-training for image captioning and
[172] W. Han, S. Chang, D. Liu, M. Yu, M. Witbrock, and T. Huang, vqa,” in AAAI, vol. 34, pp. 13041–13049, 2020.
“Image super-resolution via dual-state recurrent networks,” in [198] C. Sun, F. Baradel, K. Murphy, and C. Schmid, “Learning video
CVPR, 2018. representations using contrastive bidirectional transformer,”
[173] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense arXiv preprint arXiv:1906.05743, 2019.
network for image restoration,” TPAMI, 2020. [199] C. Alberti, J. Ling, M. Collins, and D. Reitter, “Fusion of detected
[174] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and objects in text for visual question answering,” in EMNLP, 2019.
C. Change Loy, “ESRGAN: enhanced super-resolution generative [200] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz,
adversarial networks,” in ECCVW, 2018. S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al., “Visual
[175] S.-J. Park, H. Son, S. Cho, K.-S. Hong, and S. Lee, “SRFEAT: Single genome: Connecting language and vision using crowdsourced
image super-resolution with feature discrimination,” in ECCV, dense image annotations,” IJCV, 2017.
2018. [201] V. Ordonez, G. Kulkarni, and T. L. Berg, “Im2text: Describing
[176] M. S. Sajjadi, B. Scholkopf, and M. Hirsch, “EnhanceNet: Single images using 1 million captioned photographs,” in NeurIPS, 2011.
image super-resolution through automated texture synthesis,” in [202] P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi,
ICCV, 2017. and J. Gao, “Vinvl: Revisiting visual representations in vision-
[177] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, language models,” in Proceedings of the IEEE/CVF Conference on
A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al., “Photo- Computer Vision and Pattern Recognition, pp. 5579–5588, 2021.
realistic single image super-resolution using a generative adver- [203] A. Kamath, M. Singh, Y. LeCun, I. Misra, G. Synnaeve, and
sarial network,” in CVPR, 2017. N. Carion, “Mdetr–modulated detection for end-to-end multi-
[178] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real- modal understanding,” arXiv preprint arXiv:2104.12763, 2021.
time style transfer and super-resolution,” in ECCV, 2016. [204] J. Deng, Z. Yang, T. Chen, W. Zhou, and H. Li, “Transvg: End-to-
[179] J. Ho, N. Kalchbrenner, D. Weissenborn, and T. Salimans, “Ax- end visual grounding with transformers,” 2021.
ial attention in multidimensional transformers,” arXiv preprint [205] M. Li and L. Sigal, “Referring transformer: A one-step approach
arXiv:1912.12180, 2019. to multi-task visual grounding,” arXiv preprint arXiv:2106.03089,
[180] G. Li, N. Duan, Y. Fang, M. Gong, D. Jiang, and M. Zhou, 2021.
“Unicoder-VL: A universal encoder for vision and language by [206] Y. Du, Z. Fu, Q. Liu, and Y. Wang, “Visual grounding with
cross-modal pre-training.,” in AAAI, 2020. transformers,” arXiv preprint arXiv:2105.04281, 2021.
[181] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task- [207] S. Ging, M. Zolfaghari, H. Pirsiavash, and T. Brox, “COOT: Co-
agnostic visiolinguistic representations for vision-and-language operative hierarchical transformer for video-text representation
tasks,” in NeurIPS, 2019. learning,” arXiv preprint arXiv:2011.00597, 2020.
[182] S. Lee, Y. Yu, G. Kim, T. Breuel, J. Kautz, and Y. Song, “Param- [208] H. Seong, J. Hyun, and E. Kim, “Video multitask transformer
eter efficient multimodal transformers for video representation network,” in ICCV Workshops, pp. 0–0, 2019.
learning,” arXiv preprint arXiv:2012.04124, 2020.
[209] Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia,
[183] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, “End-to-end video instance segmentation with transformers,”
C. Lawrence Zitnick, and D. Parikh, “VQA: Visual question arXiv preprint arXiv:2011.14503, 2020.
answering,” in ICCV, 2015.
[210] L. Zhou, Y. Zhou, J. Corso, R. Socher, and C. Xiong, “End-to-
[184] R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi, “From recognition to
end dense video captioning with masked transformer,” in CVPR,
cognition: Visual commonsense reasoning,” in CVPR, 2019.
2018.
[185] K.-H. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked cross
[211] D. Neimark, O. Bar, M. Zohar, and D. Asselmann, “Video trans-
attention for image-text matching,” in ECCV, 2018.
former network,” arXiv preprint arXiv:2102.00719, 2021.
[186] A. Suhr, S. Zhou, A. Zhang, I. Zhang, H. Bai, and Y. Artzi,
[212] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and
“A corpus for reasoning about natural language grounded in
C. Schmid, “Vivit: A video vision transformer,” arXiv preprint
photographs,” arXiv preprint arXiv:1811.00491, 2018.
arXiv:2103.15691, 2021.
[187] J. Carreira, E. Noland, C. Hillier, and A. Zisserman, “A short
note on the kinetics-700 human action dataset,” arXiv:1907.06987, [213] G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention
2019. all you need for video understanding?,” in Proceedings of the
International Conference on Machine Learning (ICML), July 2021.
[188] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101
human actions classes from videos in the wild,” arXiv preprint [214] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles,
arXiv:1212.0402, 2012. “Dense-captioning events in videos,” in ICCV, pp. 706–715, 2017.
[189] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, [215] L. Zhou, C. Xu, and J. Corso, “Towards automatic learning of
R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology procedures from web instructional videos,” in AAAI, vol. 32,
and human-labeled dataset for audio events,” in ICASSP, 2017. 2018.
[190] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and [216] C. Plizzari, M. Cannici, and M. Matteucci, “Spatial tempo-
A. Gupta, “Hollywood in homes: Crowdsourcing data collection ral transformer network for skeleton-based action recognition,”
for activity understanding,” in ECCV, 2016. arXiv preprint arXiv:2008.07404, 2020.
[191] H. Tan and M. Bansal, “Vokenization: Improving language un- [217] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “NTU RGB+D: A
derstanding with contextualized, visual-grounded supervision,” large scale dataset for 3d human activity analysis,” in CVPR,
in EMNLP, 2020. 2016.
[192] W. Hao, C. Li, X. Li, L. Carin, and J. Gao, “Towards learning [218] J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C.
a generic agent for vision-and-language navigation via pre- Kot, “NTU RGB+D 120: A large-scale benchmark for 3d human
training,” in CVPR, 2020. activity understanding,” TPAMI, 2019.
[193] A. Majumdar, A. Shrivastava, S. Lee, P. Anderson, D. Parikh, [219] H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, and
and D. Batra, “Improving vision-and-language navigation with C. Feichtenhofer, “Multiscale vision transformers,” 2021.
image-text pairs from the web,” arXiv preprint arXiv:2004.14973, [220] J. Wang, G. Bertasius, D. Tran, and L. Torresani, “Long-short tem-
2020. poral contrastive learning of video transformers,” arXiv preprint
[194] K. Chen, J. K. Chen, J. Chuang, M. Vázquez, and S. Savarese, arXiv:2106.09212, 2021.
“Topological planning with transformers for vision-and- [221] L. Yang, Y. Fan, and N. Xu, “Video instance segmentation,” in
language navigation,” arXiv preprint arXiv:2012.05292, 2020. ICCV, pp. 5188–5197, 2019.
[195] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- [222] G. Bertasius and L. Torresani, “Classifying, segmenting, and
wal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning tracking object instances in video with mask propagation,” in
transferable visual models from natural language supervision,” CVPR, pp. 9739–9748, 2020.
Image, vol. 2, p. T2, 2021. [223] E. Triantafillou, T. Zhu, V. Dumoulin, P. Lamblin, U. Evci, K. Xu,
[196] P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual R. Goroshin, C. Gelada, K. Swersky, P.-A. Manzagol, et al., “Meta-
captions: A cleaned, hypernymed, image alt-text dataset for dataset: A dataset of datasets for learning to learn from few
automatic image captioning,” in ACL, 2018. examples,” in ICLR, 2020.
30
[224] T. N. Kipf and M. Welling, “Semi-supervised classification with [251] X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen,
graph convolutional networks,” arXiv preprint arXiv:1609.02907, and B. Guo, “Cswin transformer: A general vision trans-
2016. former backbone with cross-shaped windows,” arXiv preprint
[225] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhut- arXiv:2107.00652, 2021.
dinov, and A. J. Smola, “Deep sets,” in NeurIPS, 2017. [252] Y. Xiong, Z. Zeng, R. Chakraborty, M. Tan, G. Fung, Y. Li, and
[226] L. Liu, W. Hamilton, G. Long, J. Jiang, and H. Larochelle, “A V. Singh, “Nystr\” omformer: A nystr\” om-based algorithm for
universal representation transformer layer for few-shot image approximating self-attention,” in AAAI, 2021.
classification,” 2020. [253] Y. Tay, D. Bahri, D. Metzler, D. Juan, Z. Zhao, and C. Zheng,
[227] H. Edwards and A. Storkey, “Towards a neural statistician,” arXiv “Synthesizer: Rethinking self-attention in transformer models,”
preprint arXiv:1606.02185, 2016. in ICML, 2021.
[228] J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, and Y. W. Teh, [254] H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. A. Smith, and
“Set transformer: A framework for attention-based permutation- L. Kong, “Random feature attention,” in ICLR, 2021.
invariant neural networks,” in ICML, 2019. [255] K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane,
[229] J. Lee, Y. Lee, and Y. W. Teh, “Deep amortized clustering,” arXiv T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, et al.,
preprint arXiv:1909.13433, 2019. “Rethinking attention with performers,” in ICLR, 2021.
[230] H. Zhao, L. Jiang, J. Jia, P. Torr, and V. Koltun, “Point trans- [256] Y. Tay, D. Bahri, L. Yang, D. Metzler, and D.-C. Juan, “Sparse
former,” arXiv preprint arXiv:2012.09164, 2020. sinkhorn attention,” in ICML, 2020.
[231] M.-H. Guo, J.-X. Cai, Z.-N. Liu, T.-J. Mu, R. R. Martin, [257] X. Chen, C.-J. Hsieh, and B. Gong, “When vision transformers
and S.-M. Hu, “Pct: Point cloud transformer,” arXiv preprint outperform resnets without pretraining or strong data augmen-
arXiv:2012.09688, 2020. tations,” arXiv preprint arXiv:2106.01548, 2021.
[232] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, [258] P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, “Sharpness-
“3D ShapeNets: A deep representation for volumetric shapes,” in aware minimization for efficiently improving generalization,”
CVPR, 2015. arXiv preprint arXiv:2010.01412, 2020.
[233] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, [259] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna,
Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and “Rethinking the inception architecture for computer vision,” in
F. Yu, “ShapeNet: An information-rich 3d model repository,” Proceedings of the IEEE conference on computer vision and pattern
arXiv preprint arXiv:1512.03012, 2015. recognition, pp. 2818–2826, 2016.
[234] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Hu- [260] S. He, H. Luo, P. Wang, F. Wang, H. Li, and W. Jiang, “Transreid:
man3.6M: Large scale datasets and predictive methods for 3D Transformer-based object re-identification,” arXiv:2102.04378,
human sensing in natural environments,” TPAMI, 2013. 2021.
[235] T. von Marcard, R. Henschel, M. J. Black, B. Rosenhahn, and [261] D. R. So, C. Liang, and Q. V. Le, “The evolved transformer,” 2019.
G. Pons-Moll, “Recovering accurate 3d human pose in the wild [262] H. Wang, Z. Wu, Z. Liu, H. Cai, L. Zhu, C. Gan, and S. Han, “Hat:
using imus and a moving camera,” in ECCV, 2018. Hardware-aware transformers for efficient natural language pro-
[236] C. Zimmermann, D. Ceylan, J. Yang, B. Russell, M. Argus, and cessing,” 2020.
T. Brox, “FreiHAND: A dataset for markerless capture of hand [263] M. Chen, H. Peng, J. Fu, and H. Ling, “Autoformer:
pose and shape from single rgb images,” in ICCV, 2019. Searching transformers for visual recognition,” arXiv preprint
arXiv:2107.00651, 2021.
[237] “OpenAI’s GPT-3 language model: A technical overview.” https:
[264] C. Li, T. Tang, G. Wang, J. Peng, B. Wang, X. Liang, and
//lambdalabs.com/blog/demystifying-gpt-3/. Accessed: 2020-
X. Chang, “Bossnas: Exploring hybrid cnn-transformers with
12-31.
block-wisely self-supervised neural architecture search,” arXiv
[238] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, “Scaling vision
preprint arXiv:2103.12424, 2021.
transformers,” 2021.
[265] B. Chen, P. Li, C. Li, B. Li, L. Bai, C. Lin, M. Sun, W. Ouyang,
[239] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image
et al., “Glit: Neural architecture search for global and local image
descriptions to visual denotations: New similarity metrics for
transformer,” arXiv preprint arXiv:2107.02960, 2021.
semantic inference over event descriptions,” TACL, 2014.
[266] M. Naseer, K. Ranasinghe, S. Khan, M. Hayat, F. S. Khan, and
[240] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, M.-H. Yang, “Intriguing properties of vision transformers,” arXiv
“Making the v in vqa matter: Elevating the role of image un- preprint arXiv:2105.10497, 2021.
derstanding in visual question answering,” in CVPR, 2017. [267] E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, “Ana-
[241] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hock- lyzing multi-head self-attention: Specialized heads do the heavy
enmaier, and S. Lazebnik, “Flickr30k entities: Collecting region- lifting, the rest can be pruned,” arXiv preprint arXiv:1905.09418,
to-phrase correspondences for richer image-to-sentence models,” 2019.
in ICCV, 2015. [268] S. Abnar and W. Zuidema, “Quantifying attention flow in trans-
[242] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep formers,” arXiv preprint arXiv:2005.00928, 2020.
hierarchical feature learning on point sets in a metric space,” [269] H. Chefer, S. Gur, and L. Wolf, “Transformer interpretability
NeurIPS, 2017. beyond attention visualization,” arXiv preprint arXiv:2012.09838,
[243] H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and 2020.
H. Jégou, “Going deeper with image transformers,” arXiv preprint [270] B. Li, S. Pandey, H. Fang, Y. Lyv, J. Li, J. Chen, M. Xie, L. Wan,
arXiv:2103.17239, 2021. H. Liu, and C. Ding, “FTRANS: energy-efficient acceleration of
[244] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated transformers using fpga,” in ISLPED, 2020.
residual transformations for deep neural networks,” in CVPR, [271] G. Bender, P.-J. Kindermans, B. Zoph, V. Vasudevan, and Q. Le,
2017. “Understanding and simplifying one-shot architecture search,”
[245] R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long in ICML, 2018.
sequences with sparse transformers,” arXiv:1904.10509, 2019. [272] Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun,
[246] N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: The efficient “Single path one-shot neural architecture search with uniform
transformer,” in ICLR, 2020. sampling,” arXiv preprint arXiv:1904.00420, 2019.
[247] I. Bello, “Lambdanetworks: Modeling long-range interactions [273] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efficient
without attention,” in International Conference on Learning Repre- neural architecture search via parameter sharing,” in ICML, 2018.
sentations, 2021. [274] A. Jaegle, F. Gimeno, A. Brock, A. Zisserman, O. Vinyals, and
[248] A. Vyas, A. Katharopoulos, and F. Fleuret, “Fast transformers J. Carreira, “Perceiver: General perception with iterative atten-
with clustered attention,” NeurIPS, 2020. tion,” arXiv preprint arXiv:2103.03206, 2021.
[249] Y.-H. Wu, Y. Liu, X. Zhan, and M.-M. Cheng, “P2t: Pyramid [275] A. Jaegle, S. Borgeaud, J.-B. Alayrac, C. Doersch, C. Ionescu,
pooling transformer for scene understanding,” arXiv preprint D. Ding, S. Koppula, D. Zoran, A. Brock, E. Shelhamer, et al.,
arXiv:2106.12011, 2021. “Perceiver io: A general architecture for structured inputs &
[250] A. Vaswani, P. Ramachandran, A. Srinivas, N. Parmar, B. Hecht- outputs,” arXiv preprint arXiv:2107.14795, 2021.
man, and J. Shlens, “Scaling local self-attention for parameter
efficient visual backbones,” in Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition, pp. 12894–12904,
2021.