0% found this document useful (0 votes)
53 views30 pages

ViT Survey On Segmentation

This document provides an overview of recent applications of transformer models in computer vision. It discusses how transformer models can process visual data without strong inductive biases, and how they have been applied successfully to tasks like image recognition, object detection, segmentation, video understanding, and more. The survey aims to comprehensively cover the use of transformers in computer vision and provide references for readers.

Uploaded by

opekkhasu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views30 pages

ViT Survey On Segmentation

This document provides an overview of recent applications of transformer models in computer vision. It discusses how transformer models can process visual data without strong inductive biases, and how they have been applied successfully to tasks like image recognition, object detection, segmentation, video understanding, and more. The survey aims to comprehensively cover the use of transformers in computer vision and provide references for readers.

Uploaded by

opekkhasu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

1

Transformers in Vision: A Survey


Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir,
Fahad Shahbaz Khan, and Mubarak Shah

Abstract—Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their
application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input
sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory
(LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited
as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos,
text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge
datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to
provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to
arXiv:2101.01169v5 [cs.CV] 19 Jan 2022

fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature
encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification,
object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual
reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image
super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We
compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental
value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further
interest in the community to solve current challenges towards the application of transformer models in computer vision.

Index Terms—Self-attention, transformers, bidirectional encoders, deep neural networks, convolutional networks, self-supervision.

1 I NTRODUCTION

T RANSFORMER models [1] have recently demonstrated


exemplary performance on a broad range of language
tasks e.g., text classification, machine translation [2] and
However, visual data follows a typical structure (e.g., spatial
and temporal coherence), thus demanding novel network
designs and training schemes. As a result, Transformer mod-
question answering. Among these models, the most popular els and their variants have been successfully used for image
ones include BERT (Bidirectional Encoder Representations recognition [11], [12], object detection [13], [14], segmenta-
from Transformers) [3], GPT (Generative Pre-trained Trans- tion [15], image super-resolution [16], video understanding
former) v1-3 [4]–[6], RoBERTa (Robustly Optimized BERT [17], [18], image generation [19], text-image synthesis [20]
Pre-training) [7] and T5 (Text-to-Text Transfer Transformer) and visual question answering [21], [22], among several
[8]. The profound impact of Transformer models has become other use cases [23]–[26]. This survey aims to cover such
more clear with their scalability to very large capacity mod- recent and exciting efforts in the computer vision domain,
els [9], [10]. For example, the BERT-large [3] model with providing a comprehensive reference to interested readers.
340 million parameters was significantly outperformed by Transformer architectures are based on a self-attention
the GPT-3 [6] model with 175 billion parameters while the mechanism that learns the relationships between elements
latest mixture-of-experts Switch transformer [10] scales up of a sequence. As opposed to recurrent networks that pro-
to a whopping 1.6 trillion parameters! cess sequence elements recursively and can only attend to
The breakthroughs from Transformer networks in Nat- short-term context, Transformers can attend to complete
ural Language Processing (NLP) domain has sparked great sequences thereby learning long-range relationships. Al-
interest in the computer vision community to adapt these though attention models have been extensively used in
models for vision and multi-modal learning tasks (Fig. 1). both feed-forward and recurrent networks [27], [28], Trans-
formers are based solely on the attention mechanism and
have a unique implementation (i.e., multi-head attention)
• S. Khan, M. Naseer and F. S. Khan are with the MBZ University of
Artificial Intelligence, Abu Dhabi, UAE. optimized for parallelization. An important feature of these
E-mail: firstname.lastname@mbzuai.ac.ae models is their scalability to high-complexity models and
• M. Hayat is with the Faculty of IT, Monash University, Clayton VIC large-scale datasets e.g., in comparison to some of the other
3800, Australia.
• S. W. Zamir is with the Inception Institute of Artificial Intelligence, Abu
alternatives such as hard attention [29] which is stochastic in
Dhabi, UAE. nature and requires Monte Carlo sampling for sampling at-
• S. Khan and M. Naseer are also with the CECS, Australian National tention locations. Since Transformers assume minimal prior
University, Canberra ACT 0200, Australia. knowledge about the structure of the problem as compared
• F. S. Khan is also with the Computer Vision Laboratory, Linköping
University, Sweden. to their convolutional and recurrent counterparts [30]–[32],
• M. Shah is with the Center for Research in Computer Vision, University they are typically pre-trained using pretext tasks on large-
of Central Florida, Orlando, FL 32816, United States. scale (unlabelled) datasets [1], [3]. Such a pre-training avoids
Manuscript received March, 2021. costly manual annotations, thereby encoding highly expres-
2

Fig. 1: Statistics on the number of times keywords such as BERT, Self-Attention, and Transformers appear in the titles of Peer-
reviewed and arXiv papers over the past few years (in Computer Vision and Machine Learning). The plots show consistent growth
in recent literature. This survey covers recent progress on Transformers in the computer vision domain.

sive and generalizable representations that model rich rela-


tionships between the entities present in a given dataset. The
learned representations are then fine-tuned on the down-
stream tasks in a supervised manner to obtain favorable
results.
This paper provides a holistic overview of the trans-
former models developed for computer vision applications.
We develop a taxonomy of the network design space and
highlight the major strengths and shortcomings of the ex-
isting methods. Other literature reviews mainly focus on Fig. 2: An example self-attention block used in the vision
the NLP domain [33], [34] or cover generic attention-based domain [39]. Given the input sequence of image features, the
approaches [27], [33]. By focusing on the newly emerging triplet of (key, query, value) is calculated followed by attention
area of visual transformers, we comprehensively organize calculation and applying it to reweight the values. A single
the recent approaches according to the intrinsic features of head is shown here and an output projection (W) is finally
self-attention and the investigated task. We first provide an applied to obtain output features with the same dimension as
the input. Figure adapted from [39].
introduction to the salient concepts underlying Transformer
networks and then elaborate on the specifics of recent vision networks (Sec. 2.3 and 2.4) where these ideas have been
transformers. Where ever possible, we draw parallels be- applied. This background will help us better understand
tween the Transformers used in the NLP domain [1] and the the forthcoming Transformer based models used in the
ones developed for vision problems to flash major novelties computer vision domain (Sec. 3).
and interesting domain-specific insights. Recent approaches
show that convolution operations can be fully replaced
2.1 Self-Attention in Transformers
with attention-based transformer modules and have also
been used jointly in a single design to encourage symbiosis Given a sequence of items, self-attention estimates the rel-
between the two complementary set of operations. This sur- evance of one item to other items (e.g., which words are
vey finally details open research questions with an outlook likely to come together in a sentence). The self-attention
towards the possible future work. mechanism is an integral component of Transformers, which
explicitly models the interactions between all entities of a
sequence for structured prediction tasks. Basically, a self-
2 F OUNDATIONS
attention layer updates each component of a sequence by
There exist two key ideas that have contributed towards aggregating global information from the complete input
the development of conventional transformer models. (a) sequence. Lets denote a sequence of n entities (x1 , x2 , · · · xn )
The first one is self-attention, which allows capturing ‘long- by X ∈ Rn×d , where d is the embedding dimension to rep-
term’ dependencies between sequence elements as com- resent each entity. The goal of self-attention is to capture the
pared to conventional recurrent models that find it chal- interaction amongst all n entities by encoding each entity
lenging to encode such relationships. (b) The second key in terms of the global contextual information. This is done
idea is that of pre-training1 on a large (un)labelled corpus in by defining three learnable weight matrices to transform
a (self)supervised manner, and subsequently fine-tuning to Queries (WQ ∈ Rd×dq ), Keys (WK ∈ Rd×dk ) and Values
the target task with a small labeled dataset [3], [7], [38]. Be- (WV ∈ Rd×dv ), where dq = dk . The input sequence X is
low, we provide a brief tutorial on these two ideas (Sec. 2.2 first projected onto these weight matrices to get Q = XWQ ,
and 2.1), along with a summary of seminal Transformer K = XWK and V = XWV . The output Z ∈ Rn×dv of the
1. Several recent Vision Transformers demonstrate that the model
self attention layer is,
can be learned end-to-end on ImageNet-1K without any dedicated pre-
!
training phase [35]–[37]. However, the performance generally remains QKT
Z = softmax p V.
lower than the pre-trained counter-parts. dq
3

Fig. 3: Architecture of the Transformer Model [1]. The model was first developed for the language translation task where an input
sequence in one language is required to be converted to the output sequence in another language. The Transformer encoder
(middle row) operates on the input language sequence and converts it to an embedding before passing it on to the encoder blocks.
The Transformer decoder (bottom row) operates on the previously generated outputs in the translated language and the encoded
input sequence from the middle branch to output the next word in the output sequence. The sequence of previous outputs (used
as input to the decoder) is obtained by shifting the output sentence to the right by one position and appending start-of-sentence
token at the beginning. This shifting avoids the model to learn to simply copy the decoder input to the output. The ground-truth
to train the model is simply the output language sequence (without any right shift) appended with an end-of-sentence token. The
blocks consisting of multi-head attention (top row) and feed-forward layers are repeated N times in both the encoder and decoder.

For a given entity in the sequence, the self-attention basi- stead of static filters (that stay the same for any input) as in
cally computes the dot-product of the query with all keys, the case of convolution. Further, self-attention is invariant
which is then normalized using softmax operator to get the to permutations and changes in the number of input points.
attention scores. Each entity then becomes the weighted sum As a result, it can easily operate on irregular inputs as op-
of all entities in the sequence, where weights are given by posed to standard convolution that requires grid structure.
the attention scores (Fig. 2 and Fig. 3, top row-left block). Furthermore, it has been shown in the literature how self-
Masked Self-Attention: The standard self-attention attention (with positional encodings) is theoretically a more
layer attends to all entities. For the Transformer model [1] flexible operation which can model the behaviour of convo-
which is trained to predict the next entity of the sequence, lutional models towards encoding local features [40]. Cor-
the self-attention blocks used in the decoder are masked to donnier et al. [41] further studied the relationships between
prevent attending to the subsequent future entities. This is self-attention and convolution operations. Their empirical
simply done by an element-wise multiplication operation results confirm that multi-head self-attention (with sufficient
with a mask M ∈ Rn×n , where M is an upper-triangular parameters) is a more generic operation which can model
matrix. The masked self-attention is defined by, the expressiveness of convolution as a special case. In fact,
! self-attention provides the capability to learn the global as
QKT well as local features, and provide expressivity to adaptively
softmax p ◦M ,
dq learn kernel weights as well as the receptive field (similar to
deformable convolutions [42]).
where ◦ denotes Hadamard product. Basically, while pre-
dicting an entity in the sequence, the attention scores of the
future entities are set to zero in masked self-attention. 2.2 (Self) Supervised Pre-training
Multi-Head Attention: In order to encapsulate multiple Self-attention based Transformer models generally operate
complex relationships amongst different elements in the in a two-stage training mechanism. First, pre-training is
sequence, the multi-head attention comprises multiple self- performed on a large-scale dataset (and sometimes a com-
attention blocks (h = 8 in the original Transformer model bination of several available datasets [22], [43]) in either a
[1]). Each block has its own set of learnable weight ma- supervised [11] or a self-supervised manner [3], [44], [45].
trices {WQi , WKi , WVi }, where i = 0 · · · (h−1). For an Later, the pre-trained weights are adapted to the down-
input X, the output of the h self-attention blocks in multi- stream tasks using small-mid scale datasets. Examples of
head attention is then concatenated into a single matrix downstream tasks include image classification [46], ob-
[Z0 , Z1 , · · · Zh−1 ] ∈ Rn×h·dv and projected onto a weight ject detection [13], zero-shot classification [20], question-
matrix W ∈ Rh·dv ×d (Fig. 3, top row). answering [10] and action recognition [18]. The effective-
The main difference of self-attention with convolution ness of pre-training for large-scale Transformers has been
operation is that the filters are dynamically calculated in- advocated in both the language and vision domains. For
4

example, Vision Transformer model (ViT-L) [11] experiences N =6 in Fig. 3), with each block having two sub-layers: a
an absolute 13% drop in accuracy on ImageNet test set multi-head self-attention network, and a simple position-
when trained only on ImageNet train set as compared to the wise fully connected feed-forward network. Residual con-
case when pretrained on JFT dataset [47] with 300 million nections [67] alongside layer normalization [68] are em-
images. ployed after each block as in Fig. 3. Note that, different from
Since acquiring manual labels at a massive scale is cum- regular convolutional networks where feature aggregation
bersome, self-supervised learning has been very effectively and feature transformation are simultaneously performed
used in the pre-training stage. The self-supervision based (e.g., with a convolution layer followed by a non-linearity),
pre-training stage training has played a crucial role in un- these two steps are decoupled in the Transformer model
leashing the scalability and generalization of Transformer i.e., self-attention layer only performs aggregation while the
networks, enabling training even above a trillion parame- feed-forward layer performs transformation. Similar to the
ter networks (e.g., the latest Switch Transformer [10] from encoder, the decoder (bottom row) in the Transformer model
Google). An extensive survey on SSL can be found in [48], comprises six identical blocks. Each decoder block has three
[49]. As nicely summarized by Y. LeCun [50], the basic sub-layers, first two (multi-head self-attention, and feed-
idea of SSL is to fill in the blanks, i.e., try to predict the forward) are similar to the encoder, while the third sub-
occluded data in images, future or past frames in temporal layer performs multi-head attention on the outputs of the
video sequences or predict a pretext task e.g., the amount corresponding encoder block, as shown in Fig. 3.
of rotation applied to inputs, the permutation applied to The original Transformer model in [1] was trained for
image patches or the color of a gray-scale image. Another the Machine Translation task. The input to the encoder is
effective way to impose self-supervised constraints is via a sequence of words (sentence) in one language. Positional
contrastive learning. In this case, nuisance transformations encodings are added to the input sequence to capture the
are used to create two types of modified versions of the same relative position of each word in the sequence. Positional
image i.e., without changing the underlying class semantics encodings have the same dimensions as the input d = 512,
(e.g., image stylizing, cropping) and with semantic changes and can be learned or pre-defined e.g., by sine or cosine
(e.g., replacing an object with another in the same scene, or functions. Being an auto-regressive model, the decoder of
changing the class with minor adversarial changes to the the Transformer [1] uses previous predictions to output the
image). Subsequently, the model is trained to be invariant to next word in the sequence. The decoder, therefore, takes
the nuisance transformations and emphasize on modeling inputs from the encoder as well as the previous outputs
minor changes that can alter semantic labels. to predict the next word of the sentence in the translated
Self-supervised learning provides a promising learning language. To facilitate residual connections the output di-
paradigm since it enables learning from a vast amount of mensions of all layers are kept the same i.e., d = 512.
readily available non-annotated data. In the SSL based pre- The dimensions of query, key and value weight matrices
training stage, a model is trained to learn a meaningful in multi-head attention are set to dq = 64, dk = 64, dv = 64.
representation of the underlying data by solving a pretext
task. The pseudo-labels for the pretext task are automati-
2.4 Bidirectional Representations
cally generated (without requiring any expensive manual
annotations) based on data attributes and task definition. The training strategy of the original Transformer model [1]
Therefore, the pretext task definition is a critical choice in could only attend to the context on the left of a given word
SSL. We can broadly categorize existing SSL methods based in the sentence. This is limiting, since for most language
upon their pretext tasks into (a) generative approaches which tasks, contextual information from both left and right sides
synthesize images or videos (given conditional inputs), (b) is important. Bidirectional Encoder Representations from
context-based methods which exploit the relationships be- Transformers (BERT) [3] proposed to jointly encode the right
tween image patches or video frames, and (c) cross-modal and left context of a word in a sentence, thus improving
methods which leverage from multiple data modalities. the learned feature representations for textual data in an
Examples of generative approaches include conditional gen- self-supervised manner. To this end, BERT [3] introduced
eration tasks such as masked image modeling [43] and two pretext tasks to pre-train the Transformer model [1] in
image colorization [51], image super-resolution [52], image a self-supervised manner: Masked Language Model and Next
in-painting [53], and GANs based methods [54], [55]. The Sentence Prediction. For adapting the pre-trained model for
context-based pretext methods solve problems such as a downstream tasks, a task-specific additional output module
jigsaw puzzle on image patches [56]–[58], masked object is appended to the pre-trained model, and the full model
classification [22], predict geometric transformation such as is fine-tuned end-to-end. Here, we briefly touch upon the
rotation [46], [59], or verify temporal sequence of video pretext tasks. (1) Masked Language Model (MLM) - A
frames [60]–[62]. Cross-modal pretext methods verify the fixed percentage (15%) of words in a sentence are randomly
correspondence of two input modalities e.g., text & image masked and the model is trained to predict these masked
[63], audio & video [64], [65] or RGB & flow [66]. words using cross-entropy loss. In predicting the masked
words, the model learns to incorporate the bidirectional
context. (2) Next Sentence Prediction (NSP) - Given a pair
2.3 Transformer Model of sentences, the model predicts a binary label i.e., whether
The architecture of the Transformer model proposed in [1] the pair is valid from the original document or not. The
is shown in Fig. 3. It has an encoder-decoder structure. The training data for this can easily be generated from any
encoder (middle row) consists of six identical blocks (i.e., monolingual text corpus. A pair of sentences A and B is
5

Fig. 4: A taxonomy of self-attention design space. Existing approaches based on self-attention explore single-head or multi-head
(transformer) designs for vision tasks. We note that interesting efforts have been made to utilize knowledge from convolution
based architectures to improve ViTs (e.g., multi-scale and hybrid designs). We categorize the upcoming sections of this survey
according to the types of self-attention block (left tree diagram) as well as the prominent tasks in computer vision (right).

formed, such that B is the actual sentence (next to A) 50% of


the time, and B is a random sentence for other 50% of the
time. NSP enables the model to capture sentence-to-sentence
relationships which are crucial in many language modeling
tasks such as Question Answering and Natural Language
Inference.
(a) Non-local block [70] (b) Criss-cross attention [72]

Fig. 5: Comparison of two different self-attention approaches:


3 S ELF -ATTENTION & T RANSFORMERS IN V ISION Non-local self-attention block [70] and Criss-cross self-attention
module [72]. Figure is from [72].
We broadly categorize vision models with self-attention
into two categories: the models which use single-head self-
attention (Sec. 3.1), and the models which employ multi- Although the self-attention allows us to model full-
head self-attention based Transformer modules into their image contextual information, it is both memory and com-
architectures (Sec. 3.2). Below, we first discuss the first pute intensive. As shown in Fig. 5(a), in order to encode
category of single-head self-attention based frameworks, global context for a given pixel location, non-local block [70]
which generally apply global or local self-attention within computes a dense attention map (in green). The non-local
CNN architectures, or utilize matrix factorization to enhance block [70] has a high complexity of O(N 2 ), where N de-
design efficiency and use vectorized attention models. We notes the number of input feature maps. To reduce this
then discuss the Transformer-based vision architectures in computational burden, Huang et al. [72] propose the criss-
Sec. 3.2. cross attention module that for each pixel position generates
a sparse attention map only on the criss-cross path, as illus-
3.1 Single-head Self-Attention trated in Fig. 5(b). Further, by applying criss-cross attention
recurrently, each pixel position can capture context from all
3.1.1 Self-Attention in CNNs other pixels. Compared to non-local block, the criss-cross
Inspired by non-local means operation [69] which was uses√11× lesser GPU memory, and has a complexity of
mainly designed for image denoising, Wang et al. [70] pro- O(2 N ). State-of-the-art results are reported [72] for the
posed a differentiable non-local operation for deep neural semantic and instance segmentation tasks on several bench-
networks to capture long-range dependencies both in space mark datasets including Cityscapes [73], ADE20K [74],
and time in a feed-forward fashion. Given a feature map, COCO [75], LIP [76] and CamVid [77].
their proposed operator [70] computes the response at a Another shortcoming of the convolutional operator
position as a weighted sum of the features at all positions comes from the fact that after training, it applies fixed
in the feature map. This way, the non-local operation is weights regardless of any changes to the visual input. Hu
able to capture interactions between any two positions in et al. [78] proposed local relation networks to adaptively
the feature map regardless of the distance between them. compose pixels in a local window. They introduced a new
Videos classification is an example of a task where long- differentiable layer that adapts its weight aggregation based
range interactions between pixels exist both in space and on the compositional relations (similarity) between pix-
time. Equipped with the capability to model long-range els/features within a local window. Such adaptive weight
interactions, [70] demonstrated the superiority of non-local aggregation introduces geometric priors into the network
deep neural networks for more accurate video classification which are important for the recognition tasks [78]. Convo-
on Kinetics dataset [71]. lution is considered to be a top-down operator as it remains
6

fixed across positions while a non-local operation such as all the feature vectors in the local neighbourhood when
introduced in [69] is a bottom-up method as it aggregates deriving the attention vectors. Authors show that with con-
input features over the full image. The local relation layer siderably fewer parameters, self-attention networks (SAN)
belongs to the category of bottom-up methods but it is can beat ResNet baselines on the ImageNet dataset. They
restricted to a fixed window size e.g., 7x7 neighborhood. further show robustness against adversarial perturbations
Bello et al. [79] explore the possibility of employing [83], [84] and generalization to unseen transformations [85].
self-attention as an alternative to convolutional operators. This behaviour is due to the dynamic nature of attention
They employ the relative position encoding [80] in two that makes it difficult for the adversary to calculate useful
dimensions to develop a new self-attention mechanism that fooling directions.
maintains translation equivariance, a desirable property for
handling images. Although this self-attention provides com-
3.2 Multi-head Self-Attention (Transformers)
petitive results as a stand-alone computational primitive,
the best performance is obtained in combination with the Unlike the approaches discussed in Sec. 3.1 which insert
convolutional operations. Authors show that attention aug- self-attention as a component in CNN inspired architectures,
mentation leads to systematic performance gains in image Vision Transformer (ViTs) [11] adapts the architecture of [1]
classification and object detection for different architectures. (see Fig. 3), which cascades multiple Transformer layers.
ViTs have gained significant research attention, and a num-
3.1.2 Self-Attention as Stand-alone Primitive ber of recent approaches have been proposed which build
As discussed above, convolutional layers possess transla- upon ViTs. Below, we discuss these methods by categorizing
tion equivariance but can not scale with a large receptive them into: uniform scale ViTs having single-scale features
field, therefore can not capture long-range interactions [81]. through all layers (Sec. 3.2.1), multi-scale ViTs that learn
On the other hand, global attention [1] which attend to hierarchical features which are more suitable for dense
all spatial locations of the input can be computationally prediction tasks (Sec. 3.2.2), and hybrid designs having
intensive and is preferred on down-sampled small images, convolution operations within ViTs (Sec. 3.2.3).
image patches [11] or augmenting the convolutional features
space [79]. Ramachandran et al. [81] proposed to replace 3.2.1 Uniform-scale Vision Transformers
convolutional layers in deep neural networks with a local The original Vision Transformer [11] model belongs to this
self-attention layer which can be applied to small or large family, where the multi-head self-attention is applied to a
inputs without increasing the computational cost. At a basic consistent scale in the input image where the spatial scale is
level, the proposed self-attention layer [81] considers all maintained through the network hierarchy. We name such
pixel positions in a specific window size around a given models as the uniform-scale ViTs, as described below.
pixel, compute queries, keys and value vectors for these Vision Transformer (ViT) [11] (Fig. 6) is the first work
pixels, and then aggregates the spatial information within to showcase how Transformers can ‘altogether’ replace
this window. The value vectors are aggregated after pro- standard convolutions in deep neural networks on large-
jecting the softmax score of queries and keys. This process scale image datasets. They applied the original Transformer
is repeated for all given pixels and the response is concate- model [1] (with minimal changes) on a sequence of image
nated to produce the output pixel. ResNet models with local ’patches’ flattend as vectors. The model was pre-trained
self-attention layer can solve ImageNet and COCO object on a large propriety dataset (JFT dataset [47] with 300
detection with fewer parameters as compared to ResNet million images) and then fine-tuned to downstream recog-
models based on convolutional layers [81]. nition benchmarks e.g., ImageNet classification. This is an
Zhao et al. [82] note that a traditional convolution important step since pre-training ViT on a medium-range
operator performs feature aggregation and transformation dataset would not give competitive results, because the
jointly (by applying a filter and then passing it through CNNs encode prior knowledge about the images (inductive
a non-linearity). In contrast, they propose to perform fea- biases e.g., translation equivariance) that reduces the need of
ture aggregation separately with self-attention followed by data as compared to Transformers which must discover such
transformation using an element-wise perceptron layer. For information from very large-scale data. Notably, compared
feature aggregation, they propose two alternate strategies: to the iGPT [19] model that also applied Transformers to
(a) pairwise self-attention and (b) patch-wise self-attention. full-sized images but performs training as a generative task,
The pairwise self-attention is permutation and cardinality ViT pre-trains the model with a supervised classification
invariant operation, while the patch-wise self-attention does task (although a self-supervision variant is also explored
not have such invariance properties (similar to convolu- which results in a less performance).
tion). Both pairwise and patch-wise self-attentions are im- The DeiT [12] is the first work to demonstrate that
plemented as a vector attention [82] that learns weights for Transformers can be learned on mid-sized datasets (i.e., 1.2
both the spatial and channel dimensions. This provides an million ImageNet examples compared to 300 million images
alternate approach for attention that is conventionally per- of JFT [11] used in ViT [11]) in relatively shorter training
formed using scalar weights (by taking a dot-product). The episodes. Besides using augmentation and regularization
pairwise self-attention is a set operator that computes a vec- procedures common in CNNs, the main contribution of
tor attention keeping in view the relationships of a particular DeiT [12] is a novel native distillation approach for Trans-
feature with its neighbors in a given local neighborhood. formers which uses a CNN as a teacher model (RegNetY-
In contrast, patch-wise self-attention is a generalization of 16GF [86]) to train the Transformer model. The outputs
the convolution operator (not a set operator) and looks at from the CNN aid the Transformer in efficiently figuring
7

supervised and fully supervised image classification and


dense prediction (detection, segmentation). DeepViT [92]
observes that the similarity between attention maps of
deeper layer is high and hinders scaling models depth.
They propose to re-attend the attention maps in a multi-
head block instead of simple aggregation of these attention
maps, and show consistent gains over standard multi-head
self attention based ViTs.

3.2.2 Multi-scale Vision Transformers


Fig. 6: An overview of Vision Transformer (on the left) and the
details of Transformer encoder (on the right). The architecture
resembles Transformers used in the NLP domain and the image In standard ViTs, the number of the tokens and token feature
patches are simply fed to the model after flattening. After dimension are kept fixed throughout different blocks of
training, the feature obtained from the first token position is the network. This is limiting, since the model is unable
used for classification. Image obtained from [11]. to capture fine spatial details at different scales. Initial
Transformer based dense prediction methods (e.g., DETR
[13]) therefore have a convolutional backend. Multi-stage
out useful representations for input images. A distillation hierarchical design for ViTs, where number of tokens is
token is appended with the input patch embeddings and gradually reduced while the token feature dimension is
the class token. The self-attention layers operate on these progressively increased, has been shown to produce ef-
tokens to learn their inter-dependencies and outputs the fective features for dense prediction tasks [36], [93]–[96].
learned class, patch, and distillation tokens. The network is These models generally also perform well for recognition
trained with a cross-entropy loss defined on the output class tasks. These architectures mostly sparsify tokens by merg-
token and a distillation loss to match the distillation token ing neighboring tokens and projecting them to a higher
with the teacher output. Both soft and hard label choices dimensional feature space. Examples of multi-stage ViTs
were explored for distillation, where the hard distillation include Pyramid ViT [93], [97], Twins [37], CoaT [98], Swin
was found to perform better. Interestingly, the learned class Transformer [36], Convolutional vision Transformer (CvT)
and distillation tokens do not exhibit a high correlation indi- [96], Shuffle Transformer [95], CrossFormer [99], RegionViT
cating their complementary nature. The learned representa- [100] and Focal Transformer models [94]. Some of them are
tions compare favorably well against top-performing CNN hybrid designs (with both convolution and self-attention
architectures such as EfficientNet [87] and also generalize operations, see Sec. 3.2.3), while others only employ pure
well for a number of downstream recognition tasks. self-attention based design (discussed next).
Token to Token (T2T) ViT [35] recursively combines Pyramid ViT (PVT) [93] is the first hierarchical design
neighboring tokens into a single token to reduce tokens for ViT, and proposes a progressive shrinking pyramid
length and aggregate spatial context. Transformer in Trans- and spatial-reduction attention. PVTv2 [97] and SegFormer
former [88] computes attention at two levels: patch-level [101] improve original PVT [93] by introducing overlapping
(as done is standard ViTs [11]) and local sub-patch-level patch embedding, depth-wise convolution, and efficient
(e.g.by subdividing a 16 × 16 patch into four 4 × 4 blocks, attention. Swin Transformer [36] has a multi-stage hierar-
and computing attention amongst these blocks). In token chical architecture which computes attention within a local
labelling ViT [89], all patch tokens contribute towards loss window, by partitioning the window into multiple sub-
calculation, different from regular ViTs that only use clas- patches. To capture interactions between different windows
sification token in the loss. This process includes auxiliary (image locations), window partitioning is gradually shifted,
supervision where each image-patch (token) is labeled using along the hierarchy of the network, to capture overlapping
a pre-trained CNN model. Similar to CutMix augmentation regions. Focal Transformer models [94] is another hierar-
[90], tokens from different images are mixed as an augmen- chical design, where focal self-attention is introduced to
tation strategy, and the model is trained using the standard simultaneously capture global and local relationships. Simi-
classification loss and auxiliary token-label loss. Their model larly, CrossFormer [99] has a hierarchical pyramid structure,
demonstrates excellent performance specially for smaller and introduces cross-scale embedding module, along-with
sized models. long short distance attention and dynamic position bias
The quadratic complexity of self-attention hinders its to faithfully capture both local and global visual cues.
applicability to longer sequences (high-resolution images). RegionViT [100] proposes a regional-to-local attention to
Cross-Covariance Image Transformers (XCiT) [91] incor- encode hierarchical features. Multi-Scale Vision Longformer
porate attention across feature-channels instead of to- [102] also considers a local context in self-attention, but
kens, i.e., their cross-covariance
 attention is given by employs the efficient Longformer [103] design for self-
T
K√ QT
Vsoftmax τ
. The proposed cross-covariance atten- attention. CrossViT [104] encodes multi-scale features with
tion has linear complexity (since it depends upon feature two branches (each with multiple transformer blocks), by
dimension instead of the number of tokens). XCiT can separately processesing smaller and larger image patches.
therefore handle large resolution images and demonstrate The information from these two multi-scale bracnches is
excellent performance across different vision tasks i.e., self- then fused together using a cross-attention module.
8

3.2.3 Hybrid ViTs with Convolutions and encodes relationships between tokens at multiple scales
using cross-attention. Twins [37] builds upon PVT [93] (an
Convolutions do an excellent job at capturing low-level local attention only pyramid design), by replacing the absolute
features in images, and have been explored in multiple hy- position embedding in PVT with relative conditional po-
brid ViT designs, specially at the beginning to “patchify and sition embedding [113], and incorporating the separable
tokenize” an input image. For example, Convolutional vi- depth-wise convolutions instead of the standard spatial
sion Transformer (CvT) [96] incorporate convolution based attention, to capture local and global context of the image. In
projection to capture the spatial structure and low-level this sense, the hybrid designs tend to combine the strengths
details, for tokenization of image patches. CvT has a hier- of both convolution and transformer models. TransCNN
archical design, where number of tokens is progressively re- [114] propose a hierarchical multi-head self attention block,
duced while the token-width is increased, thus imitating the which first learns interactions within small grids (tokens)
impact of spatial downsampling as in CNNs. Convolution using self-attention, and then gradually merges the smaller
enhanced image Transformers [105] employ convolutions grids into larger grids. The proposed block can then be
based image-to-token module to extract low-level features. plugged into existing CNN architectures.
Compact Convolutional Transformer (CCT) [106] introduces
a new sequence pooling scheme, and incorporates convolu- 3.2.4 Self-Supervised Vision Transformers
tional blocks (conv-pool-reshape) for tokenization. CCT can
Contrastive learning based self-supervised approaches,
be trained from scratch on smaller datasets, e.g., CIFAR10
which have gained significant success for CNN based vision
with ∼ 95% accuracy, which is a remarkable property not
tasks, have also been investigated for ViTs. Chen et al. [115]
possible with the traditional ViTs.
evaluate different self-supervised frameworks and propose
LocalViT [107] introduces depthwise convolutions to en- practical strategies including MoCo v3 (extended from
hance local features modeling capability of ViTs. LeViT [108] v1/v2 [116], [117]) for stabilized training of self-supervised
(name inspired from LeNet [109]) applies a four-layered ViTs. Xie et al. [118] combine MoCo v2 [117] and BYOL [119]
CNN block (with 3 × 3 convolutions) at the beginning with to train DeiT [12] and SwinTransformer [36]. They demon-
progressively increasing channels (3,32,64,128,256). For a strate generalization of self-supervised SwinTransformer
3×224×224 input image, the resulting 256×14×14 output for dense prediction tasks of detection and segmentation.
from the CNN block becomes input to a hierarchical ViT. Self distillation with no labels (DINO) [120] demonstrate
By virtue of its design, LeViT is 5× faster than EfficientNet that self-supervised ViTs can automatically segment the
[87] on CPU, at inference. ResT [110] is another hierarchical background pixels of an image, even though they were
architecture which applies a CNN block at the beginning for never trained using pixel-level supervision, a phenomena
patch-embedding. It incorporates depth-wise convolutions otherwise not observed in CNNs or fully supervised ViTs.
and adaptive position encoding to tackle varying image Efficient self-supervised vision transformer (EsViT) [121]
sizes. A recent approach NesT [111] proposes a simple propose a multi-stage design, where neighboring tokens are
technique to introduce hierarchy in ViTs. NesT divides an gradually merged along the hierarchy of the network, and
image into non-overlapping blocks (each block is further use DINO for self-supervision. Apart from standard image-
split into patches). It first separately applies local self- level self-supervision as in DINO, they incorporate addi-
attention on patches within each block, and then enables tional patch-level self-supervision in which correspondence
global interaction between blocks by aggregating them into is promoted between similar patches within augmented
an image space and applying convolution operation, fol- versions of an image. EsViT demonstrates excellent perfor-
lowed by downsampling. The number of blocks is gradually mance under self-supervision settings, and its off-the-shelf
reduced along the hierarchy of the model, while number features transfer better than supervised SwinTransformer on
of local-patches is kept fixed. This simple scheme performs 17 out of 18 evaluated datasets.
favorably compared with more sophisticated designs [36],
[97], and enables training NesT on smaller datasets (e.g.,
CIFAR-10) from scratch. 3.3 Transformers for Object Detection
Depthwise Convolution and self-Attention Networks Transformers based modules have been used for object
(CoAtNets) [112] introduce a relative attention mod- detection in the following manner: (a) Transformer back-
ule (which combines depthwise convolutions and self- bones for feature extraction, with a R-CNN based head
attention), and vertically stack convolution and attention for detection (see Sec. 3.2.2), (b) CNN backbone for visual
layers. CoAtNets demonstrate an impressive 86% Ima- features and a Transformer based decoder for object detec-
geNet top-1 accuracy without extra data (i.e. trained only tion [13], [14], [122], [123] (see Sec. 3.3.1, and (c) a purely
on ImageNet-1k). Shuffle Transformer [95] performs self- transformer based design for end-to-end object detection
attention within a window and has depth-wise convolutions [124] (see Sec. 3.3.2).
between the window-based multi-head self-attention and
MLP. It introduces a shuffle operation to build stronger 3.3.1 Detection Transformers with CNN Backbone
cross-patch connections. Co-scale conv-attentional image Detection Transformer (DETR) [13] treats object detection
Transformers (CoaT) [98], is a hybrid hierarchical pyramid as a set prediction task i.e., given a set of image features,
design, with serial and parallel blocks, where the serial the objective is to predict the set of object bounding boxes.
block is similar to standard transformer block except for The Transformer model enables the prediction of a set of
the attention layer replaced with depthwise convolution. objects (in a single shot) and also allows modeling their
The parallel blocks is applied on the output of serial blocks relationships. DETR adapts a set loss function which allows
9

Fig. 7: Detection Transformer (DETR) [13] treats the object


detection task as a set prediction problem and uses the Trans-
former network to encode relationships between set elements.
A bipartite set loss is used to uniquely match the box predic- Fig. 8: Axial attention module [133] that sequentially applies
tions with the ground-truth boxes (shown on the right two multi-head axial attention operations along height and width
columns). In case of no match, a ’no object’ class prediction axes. Image from [133].
is selected. Its simple design with minimal problem-specific
modifications can beat a carefully built and popular Faster R-
CNN model. Figure from [13]. requires a large number of training epochs to tune attention
weights to converge to meaningfully sparse locations. This
approach contributes to a slow convergence rate of DETR.
bipartite matching between predictions and ground-truth To mitigate the above-mentioned issues, [14] proposed a
boxes. The main advantage of DETR is that it removes deformable attention module to process the feature maps.
the dependence on hand-crafted modules and operations, Inspired from deformable convolutions [42], deformable
such as the RPN (region proposal network) and NMS (non- attention module [14] only attends to sparse set of elements
maximal suppression) commonly used in object detection from the whole feature map regardless of its spatial size.
[125]–[129]. In this manner, the dependence on prior knowl- This further allows cross-scale aggregation of feature maps
edge and careful engineering design is relaxed for complex with the help of multi-scale attention modules without
structured tasks like object detection. increasing the computational cost significantly. Deformable
Given spatial feature maps from the CNN backbone, the DETR not only performs better but its training time also
encoder first flattens the spatial dimensions (see Fig. 7). This remains 10× lower than the original DETR model [14].
gives a sequence of features d × n, where d is the feature Anchor DETR [122] replaces the learnable query tokens in
dimension and n = h × w with h, w being the height [13] with anchor-point based queries, such that each query
and width of the spatial feature maps. These features are focuses on predicting the object near the anchor point. The
then encoded and decoded using multi-head self-attention anchor points can be fixed on 2D grid, or learned from
modules as in [1]. The main difference in the decoding uniformly distributed points. Anchor DETR [122] requires
stage is that all boxes are predicted in parallel while [1] 10 × fewer training epochs with comparable performance.
uses an RNN to predict sequence elements one by one. Pix2Seq [123] is a generic Transformer-based framework,
Since the encoder and decoder are permutation invariant, without any specialized task-specific modules, and learns
learned positional encodings are used as the object queries to directly produce a sequence of tokens with object de-
by the decoder to generate different boxes. Note that the scriptions (bounding-boxes and class-labels). A quantization
spatial structure in a CNN detector (e.g., Faster R-CNN) and serialization scheme first converts bounding boxes and
automatically encodes the positional information. DETR class-labels into a sequence of discrete tokens. A generic
obtains performance comparable to the popular Faster R- Transformer based encoder-decoder network is then used
CNN model [125] which is an impressive feat given its to generate these tokens in an auto-regressive manner con-
simple design. The DETR has also been extended to inter- ditioned on previous predictions and image features.
esting applications in other domains, e.g., Cell-DETR [130]
extends it for instance segmentation of biological cells. A 3.3.2 Detection with Pure Transformers
dedicated attention branch is added to obtain instance-wise You Only Look at One Sequence (YOLOS) [124] is a sim-
segmentations in addition box predictions that are enhanced ple, attention-only architecture directly built upon the ViT
with a CNN decoder to generate accurate instance masks. [1], [132]. It replaces the class-token in ViT with multiple
The DETR [13] model successfully combines convolu- learnable object query tokens, and the bipartite matching
tional networks with Transformers [1] to remove hand- loss is used for object detection similar to [13]. YOLOS
crafted design requirements and achieves an end-to-end demonstrates the flexibility of ViTs to object detection, in a
trainable object detection pipeline. However, it struggles pure sequence-to-sequence learning manner, with minimal
to detect small objects and suffers from slow convergence image related 2D inductive biases. In similar spirit, PVT [93]
and a relatively high computational cost [14]. DETR maps is combined with DETR [13] to perform object detection
images to features space before using the Transformer for with an end-to-end transformer pipeline. We note that it is
the relation modeling. Thus, the computational cost of self- feasible to combine other recent ViTs with transformer based
attention grows quadratically with the spatial size of the detection heads as well to create pure ViT based designs
feature map i.e., O(H 2 W 2 C), where H and W represent [124], and we hope to see more such efforts in future.
the height and width of the feature map. This inherently
puts a limitation on the use of multi-scale hierarchical
features [131] in DETR training framework which is ulti- 3.4 Transformers for Segmentation
mately important to detect small objects. Furthermore, at the Self-attention can be leveraged for dense prediction tasks
beginning of training, the attention module simply projects like image segmentation that requires modeling rich interac-
uniform attention to all the locations of the feature map and tions between pixels. Below, we discuss axial self-attention
10

[133], a cross-modal approach [15] that can segment regions


corresponding to a given language expression, and ViTs
based segmentation architectures [101], [134], [135].
Panoptic segmentation [136] aims to jointly solve the
otherwise distinct tasks of semantic segmentation and in-
stance segmentation by assigning each pixel a semantic label
and an instance id. Global context can provide useful cues
to deal with such a complex visual understanding task.
Self-attention is effective at modeling long-range contextual
information, albeit applying it to large inputs for a dense
prediction task like panoptic segmentation is prohibitively
expensive. A naive solution is to apply self-attention either
to downsampled inputs or to limited regions around each
pixel [81]. Even after introducing these constraints, the self-
attention still has quadratic complexity and sacrifices the
global context. To tackle these issues, Wang et al. [133]
propose the position-sensitive axial-attention where the 2D
self-attention mechanism is reformulated as two 1D axial-
attention layers, applied to height-axis and width-axis se-
quentially (see Fig. 8). The axial-attention is compute effi-
cient and enables models to capture the full-image context.
It achieves competitive performance for the panoptic seg- Fig. 9: (a) Self-attention block in Image Transformer [142].
Given one channel for a pixel q , the block attends to the mem-
mentation task on COCO [75], Mapillary Vistas [137], and ory of previous synthesized pixels (mi ), followed by a feed-
Cityscapes [73] benchmarks and for the image classification forward sub-network. Positional encodings pi are added in the
on ImageNet dataset [138]. first layer. (b) The operation performed in Local Self-Attention
Cross-modal Self-attention (CMSA) [15] encodes long- (example of a 2D case is shown). The image is partitioned into
range multi-modal dependencies between linguistic and a grid of spatial blocks known as query blocks. In the self-
visual features for referring image segmentation task, that aims attention operation, each pixel in a query block attends to all
pixels in the memory block (shown in cyan rectangle). White
to segment entities in an image referred by a language grid locations show masked inputs that have zero-contribution
description.For this purpose, a set of cross-modal features is towards the self-attention.
obtained by concatenating image features with each word
embedding and the spatial coordinate features. The self-
attention operates on these features and generates attention
the perspective of generative modeling and learning unsu-
over the image corresponding to each word in the sentence.
pervised representations for down-stream tasks.
The segmentation network then performs self-attention at
multiple spatial levels and uses a gated multi-level fusion Parmar et al. [142] develop an image generation model
module to refine segmentation masks via information ex- that can sequentially predict each pixel of an output image
change across multi-resolution features. A binary CE loss is given its previously generated pixels (Fig. 9). Their approach
used to train the overall model that achieves good improve- models the joint distribution of the image pixels by factor-
ments on UNC [139], G-Ref [140] and ReferIt [141] datasets. izing it as a product of pixel-wise conditional distributions.
While the segmentation approaches discussed above in- Previously developed auto-regressive models for this task,
sert self-attention in their CNN based architectures, some such as the PixelCNN [147], suffer from a limited receptive
recent works have proposed transformer based encoder- field which hinders in modeling long term relationships in
decoder architectures. Segmentation Transformer (SETR) an image e.g., part relationships or occlusions. Using self-
[134] has a ViT encoder, and two decoder designs based attention, [142] enhances the receptive field without incur-
upon progressive upsampling, and multi-level feature ag- ring a high computational cost (e.g., effective receptive field
gregation. SegFormer [101] has a hierarchical pyramid ViT up to 256 pixels can be achieved as compared to 25 pixels
[93] (without position encoding) as an encoder, and a simple of PixelCNN [147]). The generative pipeline was also tested
MLP based decoder with upsampling operation to get the on conditional generation tasks e.g., image super-resolution,
segmentation mask. Segmenter [135] uses ViT encoder to image completion, and denoising.
extract image features, and the decoder is a mask Trans- Inspired by the success of GPT model [5] in the lan-
former module which predicts segmentation masks, using guage domain, image GPT (iGPT) [143] demonstrated that
learnable mask tokens and image-patch tokens as inputs. such models can be directly used for image generation
The authors also propose a baseline linear decoder which tasks, and to learn strong features for downstream vision
projects the patch-embeddings to classification space, thus tasks (e.g., image classification). Specifically, iGPT trains
producing coarse patch-level labels. GPT v2 model [5] on flattened image sequences (1D pixel
arrays) and shows that it can generate plausible image
outputs without any external supervision. The generated
3.5 Transformers for Image and Scene Generation samples depict the model’s ability to understand spatial
Here, we discuss Transformer-based architectures [23], relationships between pixels and high-level attributes such
[142]–[146] for image synthesis, which is interesting from as object classes, texture, and scale. Notably, the design
11

does not use any image-specific knowledge in the design conditioned scene generation task. Given the empty room
(e.g., the 2D position embeddings used in Image Trans- shape, [23] can propose new object configurations in the
former [142]). The features learned with iGPT’s unsuper- room while maintaining realism. Remarkably, the model
vised training mechanism compete impressively against does not use any appearance information and only learns to
other unsupervised approaches, achieving state-of-the-art generate new scenes by modeling the inter-object relation-
performance on CIFAR-10/100 [148] and STL [149] datasets ships using self-attention in Transformers. Similar to how
while performing comparably to SimCLR (a contrastive a Transformer operates on a sentence, it is applied to a
learning approach) [150] on ImageNet dataset. This is an sequence of objects to predict the next suitable object in a
astounding result, since the iGPT architecture is exactly the scene. Specifically, the size, pose, location, and category of
same as used for language modeling tasks, and therefore it the next object is predicted by the Transformer model. A
does not incorporate any prior domain-specific knowledge. start token indicates the initiation of inference and the num-
Notably, the competing unsupervised CNN based solutions ber of output token indicate the objects generated by the
widely adopt such priors in the form of architectural design, model in a sequence. The authors also explore generating
attention mechanisms, loss functions, and regularization new scenes given a textual description of the room layout.
[117], [151]–[154]. However, on the downside, iGPT has a The independence from the appearance makes the approach
high compute cost e.g., iGPT-L version has roughly 36× efficient, enabling interactive scene generation.
high training cost compared to MoCo [117] which is a The task of generating realistic images from text is inter-
state of the art self-supervised feature learning approach. esting and practically valuable (e.g., for artistic content cre-
For this reason, the training was generally limited to low- ation), but at the same time highly challenging. Prior text-to-
resolution of ≤ 64 × 64, while convolutional architectures image synthesis approaches [158]–[161] are mostly based on
can effectively learn from high-resolution inputs. GANs [54]. Although these methods produce encouraging
Transformers typically incur a high compute cost when results, they are far from being photo-realistic. Ramesh et
applied on high-dimensional sequences. To overcome this al. [20] recently proposed DALL·E which is a Transformer
limitation, Esser et al. [144] proposed to include inductive model capable of generating high-fidelity images from a
biases (commonly used in the CNNs) alongside Transform- given text description. DALL·E model has 12 billion param-
ers to improve their efficiency. Specifically, local connectivity eters and it is trained on a large set of text-image pairs taken
and spatial invariance biases inbuilt in the CNN structure from the internet. Before training, images are first resized
are leveraged by learning a rich dictionary of visual patterns to 256×256 resolution, and subsequently compressed to
(using a Generative Adversarial approach). A Transformer a 32×32 grid of latent codes using a pre-trained discrete
is then used to learn the long-range interactions between variational autoencoder [162], [163]. DALL·E takes as input
the dictionary items to generate the outputs. In turn, they a single stream of 1280 tokens (256 for the text and 1024
develop a conditional image generation model capable of for the image), and is trained to generate all other tokens
producing very high-resolution images (up to megapixel autoregressively (one after another). It provides flexibility
range) using Transformers. This is the first work that to generate images either from scratch (Fig. 10a) or by
demonstrates the application of Transformers to generate extending existing images (Fig. 10b), while staying faithful
such high-resolution images. to the text caption.
Generative Adversarial Networks (GANs) [54] with The authors demonstrate the effectiveness of DALL·E by
CNNs as default backbone have been very successful for creating images from text describing a wide variety of real
visually appealing image synthesis [155]–[157]. TransGAN and fictional concepts. While generating images purely from
[145] builds a strong GAN model, free of any convolution textural captions, DALL·E shows impressive performance at
operation, with both generator and discriminator based controlling multiple objects and their attributes (Fig. 10c),
upon the Transformer model [1]. The architecture of both rendering certain viewpoint (Fig. 10d), capturing object’s
generator and discriminator is based upon the encoder in internal structure (Fig. 10e), and combining unrelated ob-
original Transformer model [1]. For memory efficiency, the jects (Fig. 10f). Furthermore, DALL·E can perform image-to-
generator contains multiple stages, with up-sampling mod- image translation (Fig. 10g) guided by the input text.
ules in-between, which gradually increase the resolution
of feature maps (input sequence length) while reducing
the embedding dimension. The discriminator of TransGAN 3.6 Transformers for Low-level Vision
takes flattened image-patches as tokens similar to [132]. After witnessing the success of Transformer models in high-
Authors introduce different training techniques including level vision problems, numerous Transformer-based meth-
data augmentation, training with an auxiliary task and ods have been proposed for low-level vision tasks, including
injecting locality to self-attention to scale-up their model image super-resolution [16], [19], [164], denoising [19], [165],
for high quality image synthesis [144]. The TransGAN deraining [19], [165], and colorization [24]. Image restoration
model achieves state-of-the-art results in terms of Inception requires pixel-to-pixel correspondence from the input to the
Score and Fréchet Inception Distance (FID) on STL-10 and output images. One major goal of restoration algorithms
performs favorably compared with their CNN-based GAN is to preserve desired fine image details (such as edges
counterparts on other datasets. and texture) in the restored images. CNNs achieve this by
Unlike previous image generation methods [142]–[144], employing a single-scale architecture design that does not
which directly predict image outputs, [23] learns to generate involve any downsampling operation. Since the computa-
parameters of 3D objects to be placed in a given scene. tional complexity of self-attention in Transformer models
Specifically, SceneFormer [23] studies the 3D room layout increases quadratically with number of image patches, it is
12

(a) (b) (c) (d) (e) (f) (g)


Fig. 10: Images generated by DALL·E [20] from the following text prompts. (a) An armchair in the shape of an avocado. (b) A photo
of San Francisco’s golden gate bridge. Given a part of the image (in green box), DALL·E performs the image completion. (c) An emoji
of a baby penguin wearing a blue hat, red gloves, green shirt, and yellow pants. (d) An extreme close-up view of a capybara sitting in a field.
(e) A cross-section view of a pomegranate. (f) A penguin made of watermelon. (g) The exact same cat on the top as a sketch on the bottom.

infeasible to develop Transformer model that can operate on task, can provide significant performance gains over the
single-scale feature processing pipeline. Consequently, these state-of-the-art methods [167]–[169].
Transformer-based image restoration models make use of
various strategies to reduce the computational burden, such 3.6.2 Transformers for Super-Resolution
as computing attention on local image windows [164], per- Recent years have seen major performance breakthroughs
forming spatial reduction attention [166], and employing for super-resolution (SR) due to convolutional neural net-
encoder-decoder design [19], [165]. Here, we briefly discuss works (CNNs). Principally, the quality of super-resolved
a few image restoration Transformer models. images generated by CNNs is dependent on the choice of
optimization objective. While the SR methods [167], [170]–
3.6.1 Transformers for Image Processing Tasks [173] that are based on pixel-wise loss functions (e.g., L1,
MSE, etc.) yield impressive results in terms of image fi-
Top performing algorithms for high-level computer vision delity metrics such as PSNR and SSIM, they struggle to
tasks such as object detection and semantic segmentation recover fine texture details and often produce images that
often employ backbone models that are pre-trained on large- are overly-smooth and perceptually less pleasant. Further,
scale datasets e.g., ImageNet. In contrast, algorithms for low- perceptual SR approaches [52], [174]–[177], in addition to
level vision tasks such as image denoising, super-resolution, per-pixel loss, employ adversarial loss [54] and perceptual
and deraining are directly trained on task-specific data, loss [178] based on deep features extracted from pre-trained
thereby suffer from these limitations: (i) small number of im- CNNs. While these methods generate images that are sharp,
ages available in task-specific datasets (e.g., the commonly visually pleasant, and perceptually plausible, they show a
used DIV2K dataset for image super-resolution contains substantial decrease in reconstruction accuracy measured in
only 2000 images), (ii) the model trained for one image PSNR/SSIM. Moreover, the perceptual SR algorithms have a
processing task does not adapt well to other related tasks. tendency to hallucinate fake textures and cause artifacts. The
Chen et al. [19] propose a pre-trained model based on above mentioned SR approaches follow two distinct (but
Transformer architecture, named as Image Processing Trans- conflicting) research directions: one maximizing the recon-
former (IPT). It is capable of performing various image struction accuracy and the other maximizing the perceptual
restoration tasks such as super-resolution, denoising, and quality, but never both.
deraining. The overall architecture of IPT consists of multi- To alleviate the trade-off between perceptual reproduc-
heads and multi-tails to deal with different tasks separately, tion and accurate reproduction, Yang et al. [16] propose a
and a shared encoder-decoder Transformer body. Since ex- Transformer network (TTSR) for super-resolution. During
ploiting Transformers at full potential requires training on training, TTSR uses paired LR-HR images, as well as ref-
large-scale data, [19] takes the clean (ground-truth) images erence (Ref) images with similar content as of LR images.
from the ImageNet benchmark and synthesize their de- TTSR learns to search relevant regions in the Ref image and
graded versions for different tasks. For example, bicubic in- transfers rich textures to help super-resolving the input LR
terpolation is used for generating low-resolution images, ad- image. The texture Transformer module of TTSR method
ditive white Gaussian noise is added to prepare noisy data, (see Fig. 11) consists of four core components: (1) Learnable
and hand-crafted rain streaks are applied to obtain rainy texture extractor: takes as input LR↑, Ref↓↑, and Ref images,
images. In total, 10 million images are used to pre-train the and generates texture features query (Q), key (K), and value
IPT model. During training, each task-specific head takes as (V), respectively. Here, ↑ denotes bicubic upsampling opera-
input a degraded image and generates visual features. These tion, and ↓↑ represents bicubic down-sampling followed by
feature maps are divided into small crops and subsequently an upsampling operation. (2) Relevance embedding: first un-
flattened before feeding them to the Transformer encoder folds Q and K into patches and then computes the similarity
(whose architecture is the same as [1]). The outputs of the of each patch in Q with each patch in K in order to generate
encoder along with the task-specific embeddings are given hard and soft attention maps. (3) Hard-attention: transfers
as input to the Transformer decoder. The features from the HR texture features from V to (LR features) Q using the hard
decoder output are reshaped and passed to the multi-tail attention map. (4) Soft-attention: further enhances relevant
that yields restored images. The IPT model is optimized features while suppressing less relevant ones.
with L1 loss. Experimental results show that the pre-trained While TTSR [16] method deals with reference-based
IPT model, when fine-tuned for a specific low-level vision image super-resolution, most of the research is conducted
13

attention to all pixels in a given row, while the column-wise


attention layer considers pixels only in a given column of
an image. This work [24] is the first successful application
of Transformers trained to colorize grey-scale images at high
(256×256) resolution.

3.7 Transformers for Multi-Modal Tasks


Transformer models have also been extensively used for
vision-language tasks such as visual question answering
(VQA) [183], visual commonsense reasoning (VSR) [184],
cross-modal retrieval [185] and image captioning [29]. Sev-
eral works in this direction target effective vision-language
pre-training (VLP) on large-scale multi-modal datasets to
learn generic representations that effectively encode cross-
modality relationships (e.g., grounding semantic attributes
of a person in a given image). These representations can
Fig. 11: Diagram of the texture Transformer module. Q (query), then be transferred to downstream tasks, often obtaining
K (key) and V (value) represent texture features extracted from state of the art results. Notably, several of these models
a (bicubic upsampled) low-resolution image, a sequentially still use CNNs as vision backbone to extract visual features
down/upsampled reference image, and an original reference while Transformers are used mainly used to encode text
image, respectively. The relevance embedding aims to estimate followed by the fusion of language and visual features.
similarity between low-resolution and reference images. H
Such models generally apply the vanilla multi-layer Trans-
and S respectively denote hard and soft attentions computed
from relevance embedding. T indicates high-resolution texture former [1] with multi-modal inputs and do not introduce
features that are then transferred to the features F of low- fundamental changes to the core attention block. However,
resolution image. Figure is from [16]. their main distinction is in the configuration of Transformers
and the loss functions, based on which we categorize them
into: (a) Multi-stream Transformers (see Sec. 3.7.1) and (b)
on single image super-resolution problem in which only Single-stream Transformers (see Sec. 3.7.2). The single-stream
LR-HR paired images are available. Since the computa- designs feed the multi-modal inputs to a single Transformer
tional complexity of the original self-attention operation while the multi-stream designs first use independent Trans-
is prohibitively high for high-resolution images, recently formers for each modality and later learn cross-modal repre-
a few efficient transformer models have been proposed sentations using another Transformer (see Fig. 12). Besides
that employ window-based attention (SwinIR [164]) and these vision language pretraining methods, we also explain
spatial resolution reduction operation in attention module visual grounding approaches towards the end of this section
(ESRT [166]) to perform super-resolution. (see Sec. 3.7.3).

3.6.3 Colorization Transformer 3.7.1 Multi-stream Transformers


Given a grayscale image, colorization seeks to produce the Vision and Language BERT (ViLBERT) [63] was the first
corresponding colorized sample. It is a one-to-many task as extension of the BERT model to the multi-modal domain.
for a given grayscale input, there exist many possibilities The goal was to learn representations that can jointly model
in the colorized output space. The challenging nature of images and natural language. For this purpose, ViLBERT
this task requires probabilistic models capable of produc- developed a two-stream architecture where each stream is
ing multiple colorized output samples. Colorization Trans- dedicated to model the vision or language inputs (Fig. 12-h).
former [24] is a probabilistic model based on conditional The architecture of both parallel streams is a series of Trans-
attention mechanism [179]. It divides the image colorization former blocks similar to the BERT model. Subsequently, co-
task into three sub-problems and proposes to solve each attentional Transformer layers are applied to learn cross-
task sequentially by a different Transformer network. The modal relationships. The co-attentional framework is very
authors first train a Transformer network to map a low- simple. Query, key, and value matrices are computed for
resolution grey-scale image to a 3-bit low-resolution col- each modality in the standard way [1] and then key-value
ored image. Low-resolution images in turn allow training pairs for one modality are passed on to the other modality’s
of larger models. The 3-bit low-resolution colored image attention head.
is then upsampled to an 8-bit RGB sample by another ViLBERT applies VLP on a set of proxy tasks defined on
Transformer network in the second stage of training. Finally, the Conceptual Concepts dataset (with 3.3M images with
a third stage Transformer is trained to increase the spatial weak captions) and later fine-tune the model on down-
resolution of the 8-bit RGB sample produced by the second- stream tasks such as VQA. The pre-training phase oper-
stage Transformer. Self-attention used in the colorization ates in a self-supervised manner, i.e., pretext tasks are cre-
Transformer is based on row/column attention layers intro- ated without manual labeling on the large-scale unlabelled
duced in [179]. These layers capture the interaction between dataset. These pretext tasks include predicting whether the
each pixel of an input image while being computation- text and image inputs are related and predicting the seman-
ally less costly. The row-wise attention layer applies self- tics of masked image regions and textual inputs (e.g., similar
14

Fig. 12: An overview of Transformer models used for multi-modal tasks in computer vision. The Transformer designs in this
category can be grouped into single-stream (UNITER [43], OSCAR [44], VideoBERT [17], Unicoder-VL [180], VisualBERT [63] and
VL-BERT [22]) and dual-stream architectures (LXMERT [21], ViLBERT [181] and PEMT [182]). A key distinction between models
is the choice of loss functions. While most of the multi-modal methods are focused on images as visual data, VideoBERT [17] and
PEMT [182] are designed to work on video streams and leverage unique modalities e.g., audio signals in videos [182].

to reconstructing masked words in text in the BERT model experiments on Visual Reasoning for Real (NLVR) task [186]
[3]). This way, the model learns the inherent structure in demonstrating impressive improvements on novel tasks.
the data during pre-training and also models cross-domain
associations. With evaluations on several tasks, [17] demon- Lee et al. [182] note that the multi-modal representation
strated that a two-stream model can perform better than a learning approaches like VideoBERT [17] and ViLBERT [181]
single-stream model that uses shared parameters to model generally keep the language processing part fixed to a pre-
both language and vision domains [17]. trained model (e.g., BERT [3]) to reduce training complex-
ity. For the first time in the literature, they propose to
Similar to ViLBERT [181], Learning Cross-Modality En- learn an end-to-end multi-modal bidirectional Transformer
coder Representations from Transformers (LXMERT) [21] model called PEMT on audio-visual data from unlabeled
also uses a two-stream architecture based on BERT frame- videos. First, short-term (e.g., 1-3 seconds) video dynamics
work. The main difference lies in the object-relationship are encoded using CNNs, followed by a modality-specific
encoder that is used to model the visual features instead Transformer (audio/visual) to model long-term dependen-
of simple image-level features used in ViLBERT. The infor- cies (e.g., 30 seconds). A multi-modal Transformer is then
mation in two streams is then fused across modalities using applied to the modality-specific Transformer outputs to ex-
cross-attention blocks similar to [181]. change information across visual-linguistic domains. How-
Compared to two pre-texts tasks used for VLP in [181], ever, learning such a model in a naive form would incur
LXMERT uses five pre-training tasks including masked ob- huge memory requirements. To reduce parametric complex-
ject and language prediction, cross-modality matching, and ity, the parameters are shared across layers within each
visual question answering (Fig. 12-g). The pre-trained model Transformer which leads upto 80% parameter reduction.
is fine-tuned on the VQA task, however, a high similarity The Transformer is trained using a contrastive learning ap-
between pre-training and fine-tuned tasks raises questions proach based on a content-aware negative sampling (Fig. 12-
on the generalizability of the learned representations to new i). Specifically, the model uses the features obtained from
tasks. To this end, the authors conducted generalization CNNs learned during the training phase to select negative
15

samples that are visually similar to the positive instances. text pairs, CLIP learns a multi-modal embedding space, by
This work also compares various fusion strategies adopted jointly training an image-encoder and a text-encoder, such
in earlier works such as early (VideoBERT [17] and VL- that the cosine similarity of the valid N image-text pairs is
BERT [22]), mid-level (ViL-BERT [181] and LXMERT [21]) maximized, while the remaining N 2 −N pairs is minimized.
and late fusion mechanisms and shows that the mid-level The authors consider ResNet-50 [67] and Vision Transformer
fusion is the optimal choice. The proposed model is pre- (ViT) [132] for encoding images. The modified Transformer
trained on Kinetics-700 [187] dataset and later fine-tuned on model [1] as in [5] is employed for encoding text. CLIP is
downstream video classification tasks such as short video trained on a large corpus of 400 million image-text pairs and
classification on UCF101 [188], audio classification on ESC50 demonstrates excellent zero-shot transfer capabilities. At
[189] and long-term action recognition on Charades [190] inference, the names of classes are used as input to the text-
and Kinetics-Sounds [65] datasets. encoder, and similarity of the encoded image is computed
Tan and Bansal [191] introduce the concept of ‘vokens’ with all encoded texts (classes) to find the image-text pair
(images related to language tokens extracted from sen- with highest match. The CLIP achieves an astounding zero-
tences). The vokens (visualized tokens) provide visual su- shot classification accuracy of 75% on ImageNet, without us-
pervision to the language model to learn better features. The ing an supervision from ImageNet training set. The authors
motivation is that humans learn languages by correlating further demonstrate zero-shot transfer capabilities of the
visual information with semantic concepts. In a similar spirit CLIP model on 30 different computer vision benchmarks.
to other self-supervised language representation learning Note that CLIP with ResNet took 18 days to train on 592
methods [3], [181], they learn representations by defining V100 GPUs while CLIP with ViT took 12 days on 256 V100
an auxiliary task of voken-prediction task. Since the exist- GPUs. This highlights the computational cost of CLIP.
ing datasets encode limited visually grounded tokens, they
propose a vokenization method to map language tokens to
3.7.2 Single-stream Transformers
visual vokens, as illustrated in Fig. 13. The approach uses
language-based retrieval for such a mapping and transfers Different from two-stream networks like ViLBERT [181]
a model trained on a small labeled dataset (MS-COCO) to a and LXMERT [21], VisualBERT [63] uses a single stack of
large dataset (Wikipedia). Furthermore, it was ensured that Transformers to model both the domains (images and text).
the sentence-wide context is considered to obtain the token- The input sequence of text (e.g., caption) and the visual
voken mapping. The resulting model trained using gener- features corresponding to the object proposals are fed to
ated tokens outperforms the state of the art BERT model on the Transformer that automatically discovers relations be-
a diverse set of NLP tasks. In this sense, the proposed model tween the two domains. Notably, VisualBERT architecture is
does not evaluate vision tasks, however, uses vision as a somewhat similar to VideoBERT [17] (explained in Sec. 3.8),
useful grounding cue to train the language model, hence we but instead of only focusing on cooking videos, Visual-
include it in the multi-modal representation learning group. BERT evaluates on various visual-linguistic tasks (e.g., VCR,
Vision-and-Language Navigation (VLN) aims to predict NLVR, VQA, and visual grounding). The VisualBERT model
a navigation plan on a map based on the vision and first applies task-agnostic pre-training using two objectives
language inputs. Transformer models were used earlier in (Fig. 12-e). The first objective simply attempts to predict
[192], [193] for VLN task. These works first pre-train a cross- missing text tokens using the image features and remaining
modal Transformer using self-supervision on vision and textual tokens. The second objective attempts to differentiate
language pairs and subsequently fine-tune on the specific between the true and false caption of a given image. After
VLN tasks. While these works learn attention between im- task-agnostic pre-training, the authors propose to perform
age region and language, Chen et al. [194] propose to learn task-specific pre-training to bridge the domain gap before
cross-modal attention between language inputs and spatial the final fine-tuning to the downstream task.
topological maps (to represent an agent’s environment as Su et al. [22] propose a multi-modal pre-training ap-
a graph whose nodes denote places and the edges denote proach to learn features that are generalizable to multi-
their connectivity). Given the topological map and natural modal downstream tasks such as Visual Commonsense
language inputs, a VLN task using the Transformer model Reasoning and Visual Question Answering. This endeavor
bears resemblance to sequence prediction in NLP. Specif- requires adequately aligning the visual and linguistic cues
ically, at each time instance, the cross-modal Transformer so that an effective composite representation is learned. To
predicts a single node of the topological map in the nav- the end, [22] builds on the BERT model and inputs both
igation plan. The individual language and map encodings the visual and language features. The language features
are first processed using uni-modal encoders and later a correspond to the token in the input sentence and the visual
cross-modal encoder (similar to LXMERT [21]) is applied features correspond to the region of interest (RoI) from
to aggregate information across modalities. To denote posi- the input image (obtained via a standard Faster R-CNN).
tions in the map, a learned trajectory position encoding is Specifically, the model is pre-trained on both the visual-
appended with the map features. Based on this Transformer lingual dataset (Conceptual Captions [196]) as well as the
setup, [194] reports a full navigation system that can freely language-only datasets (e.g., Wikipedia). The loss function is
explore the environment and intelligently plan its actions. identical to BERT, where the model is trained to predict the
CLIP [195] is a contrastive approach to learn image rep- masked out words or visual ROIs (Fig. 12-f). In contrary to
resentations from text, with a learning objective which max- other works such as UNITER [43], VL-BERT claims that the
imizes similarity of correct text-image pairs embeddings in visual-linguistic matching tasks are not useful during pre-
a large batch size. Specifically, given a batch of N image- training, which is in contrast to evidence from later efforts
16

[180]. Their results on several multi-modal tasks show their


benefit over the language-only pre-training (e.g., in BERT).
Universal Encoder for Vision and Language (Unicoder-
VL) [180] learns multi-modal representations using large-
scale image-caption pairs. The language and image inputs
are fed to a single Transformer model (with multiple suc-
cessive encoders) to learn joint embeddings. To this end,
it uses masked word prediction, masked object classifica-
tion, and visual-linguistic matching as self-supervision tasks
during pre-training (Fig. 12-d). Notably, the visual-linguistic
matching is carried out only at the global level (i.e., image-
sentence alignment). The model is evaluated on image-
text retrieval, zero-shot learning, and visual commonsense
reasoning where it performs better than the previous models
such as ViLBERT [181] and VisualBERT [63]. This shows Fig. 13: Visualized tokens (Vokens) [191]: A language model
is visually supervised using closely related images that leads
the significance of rich self-supervised tasks and advocates to better feature representations from the pretrained model.
for a unified Transformer architecture to learn multi-modal Figure from [191].
features in a common framework.
The Unified Vision-Language Pre-training (VLP) [197]
model uses a single Transformer network for both encod- UNITER adopts a single Transformer applied to the textual
ing and decoding stages. This stands in contrast to BERT and image inputs like [22], [63], [180].
inspired VLP models [17], [22], [63], [198] which use in- VisualBert [63], Uniter [43], VL-BERT [22], VilBERT [181],
dependent encoder and decoder networks. Joint modeling and Unicoder-VL [180] models for VLP concatenate im-
of encoding and decoding stages allows the Unified VLP age and text features and leave it to the self-attention to
model to perform well for both image captioning and visual- automatically discover cross-modal relationships. This can
question answering tasks, when fine-tuned on these individ- complicate the visual grounding of semantic concepts in an
ual tasks. The intuition for shared modeling of encoding and image. To address this problem, Object-Semantics Aligned
decoding stage stems from the need to better share cross- Pre-Training (Oscar) [44] first uses an object detector to
task information during pre-training. The unified model obtain object tags (labels), which are then subsequently used
consists of a stack of 12 Transformer blocks, each with a self- as a mechanism to align relevant visual features with the
attention layer followed by a feed-forward module. The self- semantic information (Fig. 12-b). The motivation is that the
supervised objectives used for pre-training include masked textual content generally pertains to major objects in the
vision-language predictions. Here, the authors explore two image, therefore by explicitly adding those image labels to
variants i.e., bidirectional and sequence-to-sequence predic- the input, visual features can be better attended. Similar to
tion of masked works where different context encodings are BERT [3], Oscar uses a Masked Token Loss for VLP, where
used for both types of objectives. The proposed approach is different tokens in the textual input and image tags are ran-
evaluated on COCO Captions, Flick 30K Captions and VQA domly masked and the model predicts these missing tokens.
2.0 and obtains encouraging results compared to previous Further, it also uses a contrastive loss that discriminates
methods on image captioning and VQA [199]. between the original and noisy/fake image-tag pairs. The
Universal image-text representation (UNITER) [43] per- representations thus learned are fine-tuned on VQA, cross-
forms pre-training on four large-scale visual-linguistic modality retrieval, natural language reasoning, and image
datasets (MS-COCO [75], Visual Genome [200], Conceptual captioning tasks to obtain better performances compared to
Captions [196] and SBU Captions [201]). The learned repre- VLP methods that do not use object tags. The recent VinVL
sentations transfer well on downstream tasks such as VQA, [202] approach extends Oscar for the object detection task
Multi-modal retrieval, Visual Commonsense reasoning, and and learns object instance-centered relationships between
NLVR. In order to emphasize on learning the relationships visual and language domains using an adapted pretraining
between visual and language domains, [43] specifically de- scheme. The model is trained on a collection of datasets
signs pre-training tasks to predict masked visual or text (MS-COCO, OpenImages, Visual Genome and Objects365)
region conditioned on the other domain input, and align and was demonstrated to precisely relate semantic attributes
language and visual inputs on both the global (image-text) with the visual information and provided better transfer-
and local (word-region) levels (Fig. 12-a). These tasks are ability to the downstream visual comprehension tasks.
beside the conventional masked language modeling task
used in BERT and explicitly include fine-grained word-
region alignment alongside conditional masking of inputs 3.7.3 Transformers for Visual Grounding
that were not considered in the earlier works such as VL- Modulated DETR (MDETR) [203] has a CNN and BERT
BERT [22], Visual-BERT [63], Vilbert [181] and Unicoder- backbone to extract features from image and text inputs,
VL [180]. Common to the other approaches, they adopt the respectively. The visual and text features are then separately
Transformer architecture proposed in BERT that operates linearly projected to a shared space, concatenated and fed to
on both the visual and language embeddings. In contrast a transformer model (with an architecture similar to DETR)
to applying independent Transformers to the language and to predict the bounding boxes for objects corresponding to
visual inputs (as in ViLBERT [181] and LXMERT [21]), the queries in the grounding text. The model is trained by
17

using a loss which predicts a uniform distribution over all sentence is temporally aligned with the sequence of visual
relevant text query tokens specific to the predicted bounding tokens. Further, the learned representations are shown to be
boxes. An additional contrastive loss term ensures corre- very useful for downstream tasks such as action classifica-
spondence between visual and text embedding. TransVG tion, zero-shot classification, and video captioning.
[204] is a simple design, where visual and text features are Zhou et al. [210] explore Masked Transformers for dense
fused together in a transformer module, and the bounding- video captioning. This requires generating language de-
box corresponding to the query is directly regressed us- scriptions for all events occurring in a video. Existing works
ing a learnable token (input to the Transformer module, on this problem generally operate sequentially i.e., first
along-with visual and text features). Referring Transformer detect events and then generate captions in separate sub-
[205] is also a simple one stage design where the text blocks. [210] proposes a unified Transformer network to
and image features are fused in a Transformer encoder, tackle both tasks jointly, thereby seamlessly integrating the
and the Transformer based decoder then directly regresses multi-modal tasks of event detection and captioning. First, a
bounding boxes or segmentation masks. Visual Grounding video encoder is used to obtain frame-wise representations
with Transformer [206] has an encoder-decoder architecture, followed by two decoder blocks focused on proposing the
where visual tokens (features extracted from a pretrained video events and the captions. Since untrimmed videos are
CNN model) and text tokens (parsed through an RNN considered, a masking network is used in the captioning
module) are processed in parallel with two distinct branches decoder to focus on describing a single event proposal.
in the encoder, with cross-modality attention to generate Remarkably, [210] was the first approach to target dense
text-guided visual features. The decoder then computes video captioning using non-recurrent models and used self-
attention between the text queries and visual features and attention in the encoder(applied on CNN derived features)
predicts query-specific bounding boxes. to model broad range context between video frames. Ex-
periments on ActivityNet Captions [214] and YouCookII
[215] datasets showed good improvements over previous
3.8 Video Understanding
recurrent network and two-stage based approaches.
Existing approaches for audio-video data analysis generally
learn representations on short-length videos (up to a few 3.8.2 Video Action Recognition
seconds long), that allow them to encode only short-range The traditional CNN based methods in video classification
dependencies [1], [32]. Long-range dependency modeling is generally perform 3D spatio-temporal processing over lim-
desirable in various uni-modal and multi-modal learning ited intervals to understand videos. Neimark et al. [211]
tasks such as activity recognition [71], [187], [207]–[209]. propose Video Transformer Network (VTN) that first ob-
Below, we explain recent approaches that seek to resolve this tains frame-wise features using 2D CNN and apply a Trans-
challenge using the expressivity of Transformer networks. former encoder (Longformer [103]) on top to learn temporal
It is important to note that several of these works [17], relationships. Longformer is an attractive choice to process
[18], [182], [210] still employ (pretrained) CNNs to encode long sequences (with an arbitrary length n) due to its O(n)
image/frame-level features in the videos on top of which complexity. The classification token is passed through a
Transformers are applied to model wide context. A few fully connected layer to recognize actions or events. The
exceptions include [209], [211]–[213] which obtain frame- advantage of using Transformer encoder on top of spatial
level features also using the ViT based backbones. features is two fold: (a) it allows processing a complete video
in a single pass, and (b) considerably improves training and
3.8.1 Joint Video and Language Modeling inference efficiency by avoiding the expensive 3D convolu-
The VideoBERT [17] model leverages Transformer networks tions. This makes VTN particularly suitable for modeling
and the strength of self-supervised learning to learn effec- long videos where interactions between entities are spread
tive multi-modal representations. Specifically, VideoBERT throughout the video length. Their experiments on Kinetics-
uses the prediction of masked visual and linguistic tokens as 400 dataset [71] with various backbones (ResNet [67], ViT
a pretext task (Fig. 12-c). This allows modeling high-level se- [11] and DeiT [12]) shows competitive performance.
mantics and long-range temporal dependencies, important Girdhar et al. [18] use a variant of Transformer archi-
for video understanding tasks. Given a video, [17] converts tecture to aggregate person-specific contextual cues in a
speech to text using off-the-shelf speech recognition systems video for action classification and localization. Initially, the
and applies vector quantization (clustering) to obtain visual model uses a Faster-RCNN [125] style processing where a
features from pre-trained video classification models. The backbone model generates features that are forwarded to the
BERT model is then directly applied to these concatenated Region Proposal Network to obtain object proposals. Then
sequences of language and visual tokens to learn their RoI pooling is applied to generate object-specific features.
joint distribution. The model can be trained with only-text, Multi-head self-attention [1] is then applied on top of the
video-only, and video+text domains. The resulting model object features as a cascade of self-attention layers. In each
showcases interesting capabilities for cross-modal predic- Transformer unit, a particular person feature is treated as
tions such as video generation from a given textual input the ‘query’ (Q), while the features from the neighboring
(e.g., captions or cooking recipe) and (video-based) future video clip are used as ‘key’ (K) and ‘value’ (V). The location
forecasting. The video+text model uses a visual-linguistic information is explicitly encoded in the input feature map
alignment task to learn cross-modality relationships. The from which K, V and Q are derived, thus incorporating
definition of this pre-text task is simple, given the latent the positional information in the self-attention. For a given
state of the [cls] token, the task is to predict whether the 400×400×64 video clip, the key and value tensors are of size
18

16×25×25×128, while the query is 128 dimensional vector.


Although [18] uses only RGB stream, additional modalities
like optical flow and audio signal (as in competing works)
would further increase the compute complexity. Further, the
Transformer model was found to be sub-optimal for action
localization, perhaps due to its tendency to incorporate
global information. Therefore, it is important to achieve
the right trade-off between the global and local context
for problems that demand precise delineation (e.g., action
localization and segmentation).
Human action recognition based on skeleton representa- (a) Spatial Self-Attention
tion requires understanding relationships between different
joints of a body in a given frame as well as between different
frames of a video. Plizzari et al. [216] proposed a two-stream
Transformer network to model such relationships. They
introduced spatial self-attention (SSA) to model relations
between different body-joints (Fig. 14a) while temporal self-
attention (TSA) to capture long-range inter-frame depen-
dencies (Fig. 14b). They first used a small residual network
to extract features from skeleton data and then used SSA
and TSA modules to process those feature maps. SSA finds
the correlation between each pair of joints independently,
while TSA focuses on how features of a certain joint change
between frames along the temporal dimension. The purpose (b) Temporal Self-Attention
of SSA is to discover relationships among the surrounding
Fig. 14: Spatial/Temporal Attention for Skeleton Data Repre-
joints in the same way as the Transformer relates different sentations. Relationships between body-joints and inter-frame
words in a phrase. On the other hand, TSA finds long-range dependencies are modeled using two dedicated self-attention
relations between frames, similar to how relations among modules. Figure is from [216].
phrases are built in NLP. The two streamed model achieves
state-of-the-art results on NTU-RGB+D 60 [217] and NTU-
RGB+D 120 [218] datasets. backbone CNN on a collection of video frames. An encoder
Multiscale Vision Transformers (MViT) [219] build a and a decoder Transformer is used similar to DETR to
feature hierarchy by progressively expanding the channel frame the instance segmentation problem as a sequence to
capacity and reducing the spatio-temporal resolution in sequence prediction task. The input frame-level features are
videos. They introduce multi-head pooling attention to concatenated to form clip representations and the Trans-
gradually change the visual resolution in their pyramid former outputs instance predictions in a order that is consis-
structure. TimeSFormer [213] extends ViTs [132] to videos, tent across frames. This integrates the object detection and
by considering the video as a sequence of patches ex- tracking with-in a single unified architecture. The predicted
tracted from individual frames. To capture spatio-temporal outputs are matched with the ground-truth using bipartitie
relationships, they propose divided attention i.e., spatial matching. Similar to Mask R-CNN [127], a separate head is
and temporal attentions are separately applied within each used to predict the instance mask based on self-attention
block. TimeSFormer demonstrates SoTA performance on and 3D convolutions. The overall results are competitive
action recognition, and can be applied to clips over one among the single model approaches on YouTube VIS dataset
minute. Another notable pure-transformer based model is [221], but performs somewhat lower compared to more
the Video Vision Transformer (ViViT) [212]. First, the spatio- complex CNN-based models such as MaskProp [222].
temporal tokens are extracted and then efficient factorised
versions of self-attention are applied to encode relationships
between tokens. However, they require initialization with 3.9 Transformers in Low-shot Learning
image-pretrained models to effectively learn the ViT models. In the few-shot learning settings, a support set is provided
There has also been concurrent work on learning sound at the inference to adapt to a novel set of categories. Trans-
pretrained models using self-supervised learning with ViTs. former models have been used to learn set-to-set mappings
An important recent effort is the long-short contrastive on this support set [26] or learn the spatial relationships
learning (LSTCL) framework [220], which reconstructs rep- between a given input query and support set samples [25].
resentations from different time-scales (narrow and broad) In terms of absolute performance, the patch-wise spatial
as auxiliary learning tasks and demonstrates good down- self-attention between query and support set images excels
stream performance. compared to an image level association learned in [26].
However, the patch-wise attention computation is computa-
3.8.3 Video Instance Segmentation tionally expensive. We elaborate on these approaches below.
The Video Instance Segmentation Transformer (VisTR) [209] Doersch et al. [25] explore the utility of self-supervision
model extends DETR [13] for video object instance seg- and Transformer model for few-shot fine-grained classifica-
mentation (VIS) task. Local features are obtained using a tion, where distribution mismatch exists between training
19

3.10 Transformers for Clustering


Clustering aims to discover structure in the data by group-
ing similar data points together. It has numerous applica-
tions such as data visualization and interpretation, anomaly
detection, and open-set categorization. Neural networks
have been developed for set prediction problems [225],
[227], however, the setpoints are processed individually
which can lose information about inter-point relationships.
Recent works employ Transformers that operate on set
inputs called the Set Transformers (ST) [228] for amortized
Fig. 15: An overview of FEAT [26]. Compared to the con- clustering. Amortized clustering is a challenging problem
ventional instance embedding methods in FSL that keep the that seeks to learn a parametric function that can map an
embedding function same for all tasks (a), FEAT uses a set- input set of points to their corresponding cluster centers. Lee
to-set function to adapt the embedding function to each FSL
et al. [228] propose to learn such a mapping function using
task (b). It evaluates several set-to-set functions and found the
Transformer module to be the most suitable choice for FSL. a Transformer architecture comprising of multi-head self-
Figure from [26]. attention blocks [1]. The Transformer model is permutation
invariant by design and allows encoding both pair-wise and
higher-order relationships between the input points. How-
ever, a full Transformer would lead to a high computational
cost of O(n2 ) in each self-attention layer, where n is the
and evaluation phases. They develop Cross-Transformer number of points in the set. ST reduces this cost to O(mn)
model to relate a given query image with the few-examples by using an Induced Self-Attention Block that uses a low-
available in the support set. To this end, the Transformer rank projection (H ∈ Rm ) to allow operating on large sets.
finds spatially similar regions in the query and support The model was trained to learn optimal parameters that
set images, and the corresponding features are then used maximize the likelihood of a mixture of Gaussians (MoGs).
to obtain class decisions for the query. The queries in the Thus MoG parameters are estimated by the ST given a set
Transformer architecture are derived from the grid features of data points. Beyond amortized clustering, ST is a generic
obtained using the query image. Similarly, grid features framework which can handle other set-input problems such
from the support images are used to construct keys and as counting unique elements in an input set, multi-instance
values which are in turn used to derive attended outputs. learning, set anomaly detection, and 3D point-cloud clas-
This approach, besides a contrastive self-supervision based sification. More recently, [229] improves [228] by taking a
training mechanism, leads to the best performance on the sequential approach to cluster generation, thereby allowing
challenging Meta-dataset [223]. assignment to a variable number of clusters.

Ye et al. [26] propose to adapt the few-shot embeddings


3.11 Transformers for 3D Analysis
learned on the base classes to the few-shot target classes
during inference using a Transformer module. This leads Given the irregular (variable number of points) and permu-
to task-specific embeddings that perform better on the tation invariant nature of 3D point cloud representations,
discriminative tasks such as few-shot classification. While Transformers provide a promising mechanism to encode
many other set-to-set functions are also evaluated, such as rich relationships between 3D data points. To this end,
Graph convolutional networks [224], Bidirectional LSTMs recent works [230], [231] are motivated by the capability of
[32] and DeepSets [225], the best performance is achieved Transformers to learn set-functions. Specifically, [230] intro-
with the Transformer-based mapping. This is attributed to duced a Point Transformer which uses vector attention to
the better contextualization, task interpolation and extrap- learn weights for each channel, while [231] suggest an alter-
olation capability of Transformers and their permutation nate design where local 3D structure is explicitly encoded.
invariance while maintaining a relatively lower parameter The non-local nature of Transformers is exploited in [45]
complexity. The Transformer architecture in [26] follows the towards an accurate human pose and mesh reconstruction
standard model [1]. The embeddings are adapted using algorithm. We discuss these approaches below.
a contrastive loss function for preserving discriminative Self-attention being a set-operator is ideally suited for
properties (Fig. 15). The resulting model achieves strong processing point clouds, a 3D data representation that de-
performance on inductive, transductive, and generalized mands invariance to number of points and their permuta-
FSL tasks. tions. Zhao et al. [230] propose a point Transformer layer that
applies self-attention in the local neighborhood of 3D points.
Liu et al. [226] learn a multi-head self-attention based The proposed layer builds on vectorized self-attention net-
module, to integrate the visual representation learned by the work (SAN) [82] where attention weights are represented
models trained on different domains present in the meta- with vectors.Furthermore, a positional encoding is added
dataset [223]. The Universal Representation Transformer both to the attention vector and transformed features (value
(URT) layer dynamically re-weights the representations vectors) to represent location information. The point Trans-
from different domain-specific backbones, and proves very former layer is sandwiched between two linear layers to
effective in handling few shot tasks across a variety of data create a point Transformer block that is stacked multiple
distributions. times in the developed network architecture. Their design
20

also included transition down/up blocks to reduce/increase 4.1 High Computational Cost
the number of points in the input (in a typical encoding- As discussed in Sec. 1, a strength of Transformer models
decoding pipeline style). The resulting architecture shows is their flexibility to scale to high parametric complexity.
promising results on the 3D classification and segmentation While this is a remarkable property that allows training
tasks. enormous sized models, this results in high training and
The Point Cloud Transformer (PCT) [231] is a parallel inference cost (a detailed comparison between CNN and
work to [230] and motivated by the permutation invariance ViTs is shown in Table 3). As an example, the BERT [3]
property of Transformers. However, compared to [230], it basic model (with 109 million parameters) took around 1.89
is more directly based on the conventional Transformer peta-flop days2 for training, while the latest GPT3 [6] model
architecture [1] and does not involve vector attention. The (175 billion parameters) took around 3640 peta-flop days
key modifications include a 3D coordinate-based position for training (a staggering ∼1925× increase). This comes
encoding, an offset attention module, and a neighbor em- with a huge price tag, e.g., according to one estimate [237],
bedding that encodes local 3D structure in point-clouds. GPT3 training might have cost OpenAI 4.6 million USD.
Specifically, the offset attention layer calculates the dif- Additionally, these large-scale models require aggressive
ference between the self-attended features and the input compression (e.g., distillation) to make them feasible for real-
features using element-wise subtraction. The local neighbor world settings.
embedding simply finds self-attention relationships among An empirical study on the scalability of Vision Trans-
a group of points instead of individual 3D points. Explicitly formers for number of parameters (ranging from five million
incorporating local neighbourhood information makes this to two billion), size of the training datasets (ranging from 30
a more efficient architecture compared to [230]. The method million to three billion training images), and compute bud-
shows promising performance on 3D shape classification, get (1-10000 TPU core-days) is presented in [238]. From this
normal estimation and segmentation tasks on ModelNet40 study, We can draw the following conclusions (a) scaling up
[232] and ShapeNet [233] datasets. on compute, model and size of training samples improves
The Mesh Transformer (METRO) [45] model targets 3D performance (b) only large models (with more parameters)
human pose and mesh reconstruction from a single 2D im- can benefit from more training data, and the performance
age. A key challenge here is to faithfully learn the non-local of smaller models platueas quickly and can not leverage
interactions between body-joints and mesh vertices (e.g., from additional data. This indicates that large scale models
hand and foot). The expressivity of Transformer network have the capacity to further enhance their representation
is used to jointly model vertex to vertex relationships in a learning capabilities. However, with the current designs,
mesh as well as the vertex to body-joint relationships. The scaling upon Transformer models is expensive and compute
self-attention mechanism can attend to any combination of prohibitive, thus necessitating the need for efficient designs.
vertices in the mesh, thereby encoding non-local relation- In the language domain, recent works focus on reducing
ships. The multi-layer Transformer architecture sequentially the high complexity of Transformer models (basically aris-
performs dimensionality reduction to map the 2D image to ing from the self-attention mechanism [1] where a token’s
3D mesh. Position encoding is performed using the 3D coor- representation is updated by considering all tokens from the
dinates (x,y ,z ) of each vertex and each body-joint. Similar to previous layer). For example, [103], [245] explore selective
masked language modeling in NLP, METRO uses masked or sparse attention to previous layer tokens while updating
vertex modeling (MVM) which randomly masks some per- each next layer token. Linformer [38] reduces complexity of
centage of input queries (see Fig. 16). The Transformer is standard self-attention operation from O(n2 ) to O(n) (both
tasked with regressing all the joints and vertices which helps in time and memory requirements). The main idea is to
encode inter-dependencies between them. METRO obtains show that a low-rank matrix is sufficient to model the self-
state-of-the-art results on human mesh reconstruction on attention mechanism. The Reformer model [246] employed
Human3.6M [234] and 3DPW [235] datasets. Since the ap- locally-sensitive hashing (LSH) to minimize the complexity
proach does not depends on a parametric mesh model, it of self-attention from O(n2 ) to O(nlog(n)). In similar pur-
generalizes well to other reconstruction tasks such as 3D suit, the recent Lambda Networks propose to model local
hand reconstruction [236]. Overall, this is the first effort context as a linear function which helps reduce complexity
to employ Transformers for 3D human reconstruction tasks of self-attention [247]. These linear function lambdas are
and leads to fairly good results. applied to the input query to model contextual relationships
between pixels.
4 O PEN C HALLENGES & F UTURE D IRECTIONS Vyas et al. [248] developed an efficient cluster attention
to deal with large input sequences that approximates the
Despite excellent performance from Transformer models original self-attention. The cluster attention groups queries
and their interesting salient features (Table 1), there ex- into clusters and then computes attention between cluster
ist several challenges associated with their applicability to centers (instead of attention between all the queries that
practical settings (Table 2). The most important bottlenecks leads to quadratic complexity). The main idea is that the
include requirement for large-amounts of training data and queries close in the Euclidean space should have similar
associated high computational costs. There have also been attention distributions. With a fixed number of clusters, this
some challenges to visualize and interpret Transformer intuition helps reduce the quadratic complexity to linear
models. In this section, we provide an overview of these
challenges, mention some of the recent efforts to address 2. A peta-flop day is a measure of computation and equals to per-
those limitations and highlight the open research questions. forming 1015 neural net operations per second for one complete day.
21

Task Method Design Highlights (focus on differences Input Data Type Label Type Loss
with the standard form)
Image ViT [11] Directly adopted NLP Transformer En- 2D Image Class labels Cross-entropy
Classification coder for images, Mechanism to linearly
embed image patches with positional em-
bedding suitable for the Encoder.
DeiT [12] Transformer as s student while CNN as 2D Image Class labels Cross-entropy,
a teacher, Distillation tokens to produce Distillation loss
estimated labels from teacher, Attention based on
between class and distillation tokens. KL-divergence
CLIP [195] Jointly train image and text encoders on 2D Images & texts Image-text Symmetric
image-text pairs, to maximize similarity of pairs cross-entropy
valid pairs and minimize otherwise
Object DETR [13] Linear projection layer to reduce CNN 2D Image Class labels Hungarian loss
Detection feature dimension, Spatial positional em- based on bipartite
bedding added to each multi-head self- matching between
attention layer of both encoder and de- predicted and
coder. Object queries (output positional ground truths
encoding) added to each multi-head self-
attention layer of decoder.
D-DETR [14] Deformable Transformer consists of de- 2D Image Class labels Hungarian loss
formable attention layers to introduce
sparse priors in Transformers, Multi-scale
attention module.
Low Shot CT [25] Self-supervised pretraining, Query- 2D Image Pretraining Normalized
Learning aligned class prototypes that provide without Cross-entropy
spatial correspondence between the labels and
support-set images and query image. few-shot
learning with
Class labels
Image ColTran [24] Conditional Row/column multi-head at- 2D Image 2D Image Negative
Colorization tention layers, Progressive multi-scale col- log-likelihood of the
orization scheme. images
Action ST-TR [216] Spatial and Temporal self-attention to op- Skeleton Action Cross-entropy
Recognition erates on graph data such as joints in skele- Classes
tons.
Super- TTSR [16] Texture enhancing Transformer module, 2D Image 2D Image Reconstruction loss,
resolution Relevance embeddings to compute the rel- Perceptual loss
evance between the low-resolution and defined on
reference image. pretrained VGG19
features.
Multi-Model Oscar [44] Transformer layer to jointly process triplet 2D Image Captions, Negative
Learning representation of image-text [words, tags, Class labels, log-likelihood of
features], Masked tokens to represent text Object tags masked tokens,
data. Contrastive binary
cross-entropy
3D Classifica- PT [230] Point Transformer block, Transition down CAD models, 3D Object and Cross-entropy
tion/Segmentation block to reduce cardinality of the point set, object part shape
Transition up for dense prediction tasks. segmentation categories
3D Mesh METRO [45] Progressive dimensionality reduction 2D Image 3D Mesh + L1 loss on mesh
Reconstruction across Transformer layers, Positional Human Pose vertices and joints in
Encoding with 3D joint and 3D vertex 3D and 2D
coordinates, Masked vertex/joint projection.
modeling.
Vision and Chen et al. [194] Uni-modal encoders on language and map Instruction text + Navigation Cross-entropy over
Language inputs followed by a cross-modal trans- RGBD panorama + Plan nodes and [stop]
Navigation former, Trajectory position encodings in Topological action
the map encoder. Environment Map
Referring CMSA [15] Multimodal feature, Cross-modal self- 2D Image + Segmentation Binary cross-entropy
Image attention on multiple levels and their fu- Language expression mask loss
Segmentation sion using learned gates.
Video Lee et al. [182] Operates on real-valued audio-visual sig- Audio-Visual Activity Contrastive InfoNCE
Classification nals instead of tokens, Contrastive learn- labels loss and Binary
ing for pre-training, End-to-end multi- cross-entropy
modal transformer learning.

TABLE 1: A summary of key design choices adopted in different variants of transformers for a representative set of
computer vision applications. The main changes relate to specific loss function choices, architectural modifications, different
position embeddings and variations in input data modalities.
22

Task Method Metric Dataset Performance Highlights Limitations

Image ViT [11] Top-1 Acc. ImageNet 88.55 a) First application of Transformer a) Requires training on large-scale
Classifica- ICLR’21 (global self-attention) directly on data e.g., 300-Million images, b)
tion image patches, b) Convolution-free Requires careful transfer learning
network architecture, c) Outper- to the new task, c) Requires large
forms CNN models such as ResNet. model with 632-Million parameters
to achieve SOTA results.
DeiT [12] Top-1 Acc. ImageNet 83.10 a) Successfully trains Transformer a) Requires access to pretrained
arXiv’20 on ImageNet only, b) Introduces CNN based teacher model thus per-
attention-based distillation method. formance depends on the quality of
c) Produces competitive perfor- the teacher model.
mance with small (86-Million pa-
rameters) Transformers.
Swin-T [36] Top-1 Acc. ImageNet 84.5 a) Provides a general purpose back- a) Hard to train from scratch on
arXiv’21 bone for different vision tasks e.g., smaller datasets b) Quadratic com-
classification, detection and seg- pute complexity inherent to the
mentation b) A hierarchical design self-attention operation.
using shifted-windows operation.

Low-Shot CT [25] Top-1 Acc. ImageNet 62.25 a) Self-supervised pre-training Proposed algorithm is limited in its
Learning NeurIPS’20 COCO 60.35 mechanism that does not need capacity to perform on datasets that
manual labels, b) Dynamic lack spatial details such as texture.
inference using Transformer
achieving stat-of-the-art results.

Object DETR [13] AP COCO 44.9 a) Use of Transformer allows end- a) Performs poorly on small objects,
Detection ECCV’20 to-end training pipeline for object b) Requires long training time to
detection, b) Removes the need for converge.
hand-crafted post-processing steps.
D-DETR [14] AP COCO 43.8 a) Achieves better performance on Obtain SOTA results with 52.3 AP
ICLR’21 small objects than DETR [13], b) but with two stage detector design
Faster convergence than DETR [13] and test time augmentations.

Image ColTran [24] FID ImageNet 19.71 a) First successful application of a) Lacks end-to-end training, b)
Coloriza- ICLR’21 Transformer to image colorization, limited to images of size 256×256.
tion b) Achieves SOTA FID score.

Action ST-TR [216] Top-1 Acc. NTU 94.0/84.7 a) Successfully applies Transformer Proposed Transformers do not pro-
Recogni- arXiv’20 60/120 to model relations between body cess joints directly rather operate on
tion joints both in spatial and temporal features extracted by a CNN, thus
domain, b) Achieves SOTA results. the overall model is based on hand-
crafted design.
CUFED5 27.1 / 0.8 a) Achieves state-of-the-art super- a) Proposed Transformer does not
TTSR [16] PSNR/ Sun80 30.0 / 0.81 resolution by using attention, b) process images directly but features
Super-
Urban100 25.9 / 0.78
Resolution CVPR’20 SSIM Novel Transformer inspired archi- extracted by a convolution based
Manga109 30.1 / 0.91 tectures that can process multi-scale network, b) Model with large num-
features. ber of trainable parameters, and c)
Compute intensive.
ViLBERT VQA [183]/ a) Proposed Transformer architec- a) Requires large amount of data
Acc./
Multi- [181] Retrieval 70.6/ 58.2 ture can combine text and visual for pre-training, b) Requires fine
mAP (R@1)
Model NeurIPS’19 [239] information to understand inter- tuning to the new task.
Learning task dependencies, b) Achieves pre-
training on unlabelled dataset.
Oscar [44] Acc./ VQA [240]/
80.37/57.5 a) Exploit novel supervisory signal Requires extra supervision through
ECCV’20 mAP (R@1) COCO
via object tags to achieve text and pre-trained object detectors thus
image alignment, b) Achieves state- performance is dependent on the
of-the-art results. quality of object detectors.
Acc./ VQA [183]/ Learns fine-grained relation align- Requires large multi-task datasets
UNITER [43] Avg. Flickr30K 72.47/83.72 ment between text and images
ECCV’20 for Transformer training which lead
(R@1/5/10) [241] to high computational cost.
Point Trans- a) Transformer based attention ca- a) Only moderate improvements
3D Top-1 Acc. ModelNet40 92.8
former [230] [232] pable to process unordered and un- over previous SOTA, b) Large num-
Analysis IoU 85.9
arXiv’20 structured point sets, b) Permuta- ber of trainable parameters around
tion invariant architecture. 6× higher than PointNet++ [242].
MPJPE 77.1 a) Does not depend on parametric Dependent on hand-crafted net-
METRO [45]
PA-MPJPE 3DPW 47.9 mesh models so easily extendable work design.
arXiv’20 [235]
MPVE 88.2 to different objects, b) Achieves
SOTA results using Transformers.

TABLE 2: A summary of advantages and limitations of different Transformers based methods in different Tasks. (CT: Cross
Transformers, AP: Average Precision, mAP: mean AP, IoU: Intersection over Union, FID: Fréchet inception distance, MPJPE:
Mean Per Joint Position Error, MPVE: Mean Per Vertex Error).
23

Fig. 16: Mesh Transformer architecture. The joint and vertex queries are appended with positional embeddings and passed
through multiple self-attention layers to jointly regress 3D coordinates of joints and mesh vertices. Figure is from [45].

Method #Param (M) GFLOPs Top-1 Acc (%) Method #Param (M) GFLOPs Top-1 Acc (%)
ResNet18 [67]? 11.7 1.8 69.8 ResNet101 [67] ? 44.7 7.9 77.4
EfficientNet-B3 [87]? 12.0 1.8 81.6 ResNeXt101-32x4d [244]? 44.2 8.0 78.8
DeiT-T [12] 5.7 1.3 72.2 RegNetY-8G [86]? 39.0 8.0 81.7
T2T-ViTt -7 [35] 5.0 1.3 71.7 EfficientNet-B5 [87] ? 30.0 9.9 83.6
LocalViT-T [107] 5.9 1.3 74.8 CvT-21 [96] 32.0 7.1 82.5
CrossViT-T [104] 6.9 1.6 73.4 CaiT-S-24 [243] 32.2 9.4 82.7
PVTv1-T [93] 13.2 1.9 75.1 T2T-ViTt -19 [35] 39.0 9.8 81.4
ResT-Lite [110] 10.5 1.4 77.2 PVTv1-M [93] 44.2 6.7 81.2
CaiT-XXX-24 [243] 12.0 2.5 77.6 PVTv2-B3 [97] 45.2 6.9 83.2
PVTv2-B1 [97] 13.1 2.1 78.7 NesT-S [111] 38.0 10.4 83.3
Lv-ViT-T [89] 8.5 – 79.1
RegionViT-T [100] 13.8 2.4 80.4 ResNet152 [67] ? 60.2 11.6 78.3
CaiT-S-36 [243] 48.0 13.9 83.3
ResNet50 [67]? 25.6 4.1 76.1 T2T-ViTt -24 [35] 64.0 15.0 82.2
ResNeXt50-32x4d [244]? 25.0 4.3 77.6 PVTv1-L [93] 61.4 9.8 81.7
RegNetY-4G [86]? 21.0 4.0 80.0 TNT-B [88] 66.0 14.1 82.8
EfficientNet-B4 [87]? 19.0 4.2 82.9 Swin-S [36] 50.0 8.7 83.0
DeiT-S [12] 22.1 4.6 79.9 Twins-SVT-B [37] 56.0 8.3 83.2
PVTv1-S [93] 24.5 3.8 79.8 RegionViT-B [100] 72.7 13.0 83.3
LocalViT-S [107] 22.4 4.6 80.8 PVTv2-B4 [97] 62.6 10.1 83.6
CrossViT-S [104] 26.7 5.6 81.0
TNT-S [88] 23.8 5.2 81.3 ResNeXt101-64x4d [244] ? 83.5 15.6 79.6
Swin-T [36] 29.0 4.5 81.3 RegNetY-16G [86] ? 84.0 16.0 82.9
NesT-T [111] 17.0 5.8 81.5 EfficientNet-B6 [87] ? 43.0 19.0 84.0
T2T-ViTt -14 [35] 21.5 5.2 81.5 NesT-B [111] 68.0 17.9 83.8
CvT-13 [96] 20.0 4.5 81.6 ViT-B/16 [11] 86.6 17.6 79.8
ResT-B [110] 30.3 4.3 81.6 DeiT-B/16 [12] 86.6 17.6 81.8
Twins-SVT-S [37] 24.0 2.8 81.7 Swin-B [36] 88.0 15.4 83.3
PVTv2-B2-Li [97] 22.6 3.9 82.1 Twins-SVT-L [37] 99.2 14.8 83.7
RegionViT-S [100] 30.6 5.6 82.5 PVTv2-B5 [97] 82.0 11.8 83.8
Lv-ViT-S [89] 26.0 6.6 83.3 Lv-ViT-M [89] 56.0 16.0 84.1

TABLE 3: A Comparative analysis between different vision transformer and CNN models in terms of their parameter
complexity and top-1 (%) accuracy on ImageNet validation set. For a direct comparison, we consider models that are
trained on ImageNet from scratch on input of size 224x224. ? denotes pure CNN-based methods.

complexity of O(nc) with respect to the input sequence applications, e.g., low-level vision, n = H × W where
length n (where c is the number of clusters). We refer H, W denote the height and width of the image). This is a
interested readers to a survey on efficient Transformers in major drawback of existing Transformers that hinders their
NLP [34]. application to most tasks involving high-resolution (HR)
Similar to the NLP domain, computer vision models images, such as object detection and segmentation (in high-
also suffer from the high computational cost of Transformer level vision), and super-resolution, deblurring, denoising,
models. For example, image generators that are based etc. (in low-level vision). Numerous methods have been
on sequence-based Transformers (e.g., iGPT) have a high proposed that make special design choices to perform self-
compute cost limiting their applicability to high-resolution attention more ‘efficiently’, for instance employing pool-
inputs. The time and memory cost of core self-attention ing/downsampling in self-attention [97], [219], [249], local
operation in Transformers increases quadratically with the window-based attention [36], [250], axial-attention [179],
number of patches, i.e. O(n2 ), for n image patches (in some [251], low-rank projection attention [38], [252], [253], ker-
24

nelizable attention [254], [255], and similarity-clustering and especially multi-modal processing [181]. Although the
based methods [246], [256]. However, almost all of these initial results from these simple applications are quite en-
approaches either come with a trade-off between complexity couraging and motivate us to look further into the strengths
and accuracy, require special hardware specifications or are of self-attention and self-supervised learning, current archi-
still not applicable to very large images. Therefore, there tectures may still remain better tailored for language prob-
is a pressing need to develop an efficient self-attention lems (with a sequence structure) and need further intuitions
mechanism that can be applied to HR images on resource- to make them more efficient for visual inputs. For example,
limited systems without compromising accuracy. It will be vector attention from [82] is a nice work in this direction
interesting to explore how existing models can be extended which attempts to specifically tailor self-attention operation
to high-dimensional cases e.g., using a multi-scale trans- for visual inputs via learning channel-wise attentions. Simi-
former design with a somewhat local context modeling. By larly, [260] uses a Jigsaw puzzle based self-supervision loss
inducing inductive biases based on our understanding of as a parallel branch in the Transformers to improve person
the visual learning tasks (e.g., spatial relationships in the re-identification. A recent work [35] rearranges the spa-
local neighbourhood), the high computational cost can be tially close tokens to better model relationships in spatially
reduced. Similarly, using sparse attention maps modeled proximal locations. Token distillation [12] from pre-trained
with low-rank factorization in the matrices can also help CNN models has also been used as a remedy to inject
towards reducing the computational cost [211]. domain biases in the representations. One may argue that
the architectures like Transformer models should remain
4.2 Large Data Requirements generic to be directly applicable across domains, we notice
that the high computational and time cost for pre-training
Since Transformer architectures do not inherently encode
such models demands novel design strategies to make their
inductive biases (prior knowledge) to deal with visual data,
training more affordable on vision problems.
they typically require large amount of training to figure
out the underlying modality-specific rules. For example, a
CNN has inbuilt translation invariance, weight sharing, and
partial scale invariance due to pooling operations or multi- 4.4 Neural Architecture Search for ViTs
scale processing blocks. However, a Transformer network While Nerual Architecuter Search (NAS) has been well
needs to figure out these image-specific concepts on its own explored for CNNs to find an optimized architecture, it
from the training examples. Similarly, relationships between is relatively less explored in Transformers (even for lan-
video frames need to be discovered automatically by the guage transformers [261], [262]). Chen et al. [263] propose a
self-attention mechanism by looking at a large database one-shot NAS for vision transformers, called AutoFormer.
of video sequences. This results in longer training times, BossNAS [264] searches for a hybrid architecture (CNN
a significant increase in computational requirements, and and Transformer). Another recent effort studies the trade-
large datasets for processing. For example, the ViT [11] off between global and local information in Transformers in
model requires hundreds of millions of image examples to the context of vision applications [265]. It will be insightful
obtain reasonable performance on the ImageNet benchmark to further explore the domain-specific design choices (e.g.,
dataset. The question of learning a Transformer in a data- the contrasting requirements between language and vision
efficient manner is an open research problem and recent domains) using NAS to design more efficient and light-
works report encouraging steps towards its resolution. For weight models similar to CNNs [87].
example, DeiT [12] uses a distillation approach to achieve
data efficiency while T2T (Tokens-to-Token) ViT [35] models 4.5 Interpretability of Transformers
local structure by combining spatially close tokens together,
Through an extensive set of carefully designed experiments,
thus leading to competitive performance when trained only
Naseer et al. [266] investigate multiple intriguing properties
on ImageNet from scratch (without pre-training). By incor-
of ViTs in terms of their generalization and robustness. They
porating CNNs like feature hierarchies in ViTs to effectively
show that, compared with CNNs, ViTs demonstrate strong
capture local image cues, ViTs (e.g., CCT [106], NesT [111])
robustness against texture changes and severe occlusions,
can be trained from scratch even on small-scale datasets
e.g.ViTs retain upto 60% top-1 accuracy on ImageNet once
(e.g., CIFAR-10). Another approach to data efficient training
80% of the image content is randomly occluded. Given the
of ViTs is proposed in et al. [257]. The authors show that
strong performance of Transformer architectures, it is inter-
by smoothing the local loss surface using sharpness-aware
esting and critical to interpret their decisions, e.g., by visual-
minimizer (SAM) [258], ViTs can be trained with simple
izing relevant regions in an image for a given classification
data augmentation scheme (random crop, and horizontal
decision. The main challenge is that the attention originating
flip) [259], instead of employing compute intensive strong
in each layer, gets inter-mixed in the subsequent layers in a
data augmentation strategies, and can outperform their
complex manner, making it difficult to visualize the relative
counterpart ResNet models.
contribution of input tokens towards final predictions. This
is an open problem, however, some recent works [267]–[269]
4.3 Vision Tailored Transformer Designs target enhanced interpretability of Transformers and report
We note that most of the existing works focused on vision encouraging results. Attention roll-out and attention flow
tasks tend to directly apply NLP Transformer models on methods were proposed in [268] to estimate the accurate at-
computer vision problems. These include architectures de- tentions. However, this method functions in an ad-hoc man-
signed for image recognition [11], video understanding [17] ner and makes simplistic assumptions e.g., input tokens are
25

linearly combined using attention weights across the layers. to high-dimensional inputs, Perceiver uses an asymmetric
Chefer et al. [269] note that the attention scores obtained di- cross attention method to distill input information into low-
rectly via the self-attention process (encoding relationships dimensional latent bottleneck features. Once the features are
between tokens) or reassignments in [268] do not provide an distilled in a compact and fixed-dimensional form, regular
optimal solution. As an alternative, they propose to assign Transformer blocks are applied in the latent space. The
and propagate relevancy scores in the Transformer network original Perceiver model shows performance competitive to
such that the sum of relevancy is constant throughout the ResNets and ViTs on image classification and can process 3D
network. Their design can handle both the positive and data, audio, images, video or their combinations. However,
negative attributions experienced in the self-attention layer. this model can only generate fixed outputs e.g., class prob-
The proposed framework has an added advantage of being abilities. A recent improvement called Perceiver IO [275]
able to provide class-specific visualizations. Despite these aims to learn models with both flexible inputs as well as
seminal works, visualizing and interpreting Transformers arbitrary sized outputs. This allows application to problems
is an unsolved problem and methods are needed to obtain which demand structured outputs such as natural language
spatially precise activation-specific visualizations. Further tasks and visual comprehension. While these models avoid
progress in this direction can help in better understanding modality dependent architectural choices, the learning itself
the Transformer models, diagnosing any erroneous behav- still involves modality dependent choices e.g., specific aug-
iors and biases in the decision process. It can also help us mentations or positional encodings. An interesting and open
design novel architectures that can help us avoid any biases. future direction is to achieve total modality-agnosticism in
the learning pipeline.
4.6 Hardware Efficient Designs
Large-scale Transformer networks can have intensive power 5 C ONCLUSION
and computation requirements, hindering their deployment Attention has played a key role in delivering efficient
on edge devices and resource-constrained environments and accurate computer vision systems, while simultane-
such as internet-of-things (IoT) platforms. Some recent ef- ously providing insights into the function of deep neu-
forts have been reported to compress and accelerate NLP ral networks. This survey reviews the self-attention ap-
models on embedded systems such as FPGAs [270]. Li et proaches and specifically focuses on the Transformer and bi-
al. [270] used an enhanced block-circulant matrix-based rep- directional encoding architectures that are built on the prin-
resentation to compress NLP models and proposed a new ciple of self-attention. We first cover fundamental concepts
Field Programmable Gate Array (FPGA) architecture design pertaining to self-attention architectures and later provide
to efficiently manage resources for high throughput and low an in-depth analysis of competing approaches for a broad
latency. They could achieve 27x, 3x and 81x improvements range of computer vision applications. Specifically, we in-
in performance (throughput measured in FPS), reduced clude state of the art self-attention models for image recog-
power consumption, and energy efficiency relative a CPU nition, object detection, semantic and instance segmentation,
for RoBERTa model [7]. Towards this goal, [262] proposed video analysis and classification, visual question answering,
to design Hardware-Aware Transformers (HAT) using neu- visual commonsense reasoning, image captioning, vision-
ral architecture search strategies [271]–[273]. Specifically, a language navigation, clustering, few-shot learning, and 3D
SuperTransformer model is first trained for performance data analysis. We systematically highlight the key strengths
approximation which can estimate a model’s performance and limitations of the existing methods and particularly
without fully training it. This model comprises the largest elaborate on the important future research directions. With
possible model in the search space while sharing weights its specific focus on computer vision tasks, this survey pro-
between common parts. Eventually, an evolutionary search vides a unique view of the recent progress in self-attention
is performed considering the hardware latency constraints and Transformer-based methods. We hope this effort will
to find a suitable SubTransformer model for a target hard- drive further interest in the vision community to leverage
ware platform (e.g., IoT device, GPU, CPU). However, such the potential of Transformer models and improve on their
hardware efficient designs are currently lacking for the current limitations e.g., reducing their carbon footprint.
vision Transformers to enable their seamless deployment
in resource-constrained devices. Further, the search cost of
ACKNOWLEDGMENTS
the evolutionary algorithms remains significant with the
associated impact of CO2 emissions on the environment. The authors would like to thank Tim Prangemeier (TU Darmstadt), Lu-
owei Zhou (Microsoft Research), Jason Corso (University of Michigan),
Pichao Wang (Alibaba Group), Yuqing Wang (Meituan), Alex Meinke
(Uni-Tuebingen), Irwan Bello (Google Brain) and Manoj Kumar (Google
4.7 Towards Integrating All Modalities
Brain) for their helpful feedback on the survey. We would also like to
Since Transformers provide a unified design to process thank Mohamed Afham for his help with a figure.
different modalities, recent efforts also focus on proposing
more generic general purpose reasoning systems based on
Transformers. Inspired by the biological systems that can R EFERENCES
process information from a diverse range of modalities, [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Perceiver model [274] aims to learn a unified model that Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
in NeurIPS, 2017.
can process any given input modality without making [2] M. Ott, S. Edunov, D. Grangier, and M. Auli, “Scaling neural
domain-specific architectural assumptions. In order to scale machine translation,” in WMT, 2018.
26

[3] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- [29] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A
training of deep bidirectional transformers for language under- neural image caption generator,” in CVPR, 2015.
standing,” arXiv preprint arXiv:1810.04805, 2018. [30] Y. Bengio, I. Goodfellow, and A. Courville, Deep learning. MIT
[4] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Im- press, 2017.
proving language understanding by generative pre-training,” [31] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,
tech. rep., OpenAI, 2018. 2015.
[5] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, [32] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
“Language models are unsupervised multitask learners,” tech. Neural computation, 1997.
rep., OpenAI, 2019. [33] D. Hu, “An introductory survey on attention mechanisms in nlp
[6] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, problems,” in IntelliSys, 2019.
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, [34] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient trans-
et al., “Language models are few-shot learners,” arXiv preprint formers: A survey,” arXiv preprint arXiv:2009.06732, 2020.
arXiv:2005.14165, 2020. [35] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, F. E. Tay, J. Feng, and
[7] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, S. Yan, “Tokens-to-token vit: Training vision transformers from
M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A ro- scratch on imagenet,” arXiv preprint arXiv:2101.11986, 2021.
bustly optimized bert pretraining approach,” arXiv preprint [36] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo,
arXiv:1907.11692, 2019. “Swin transformer: Hierarchical vision transformer using shifted
[8] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, windows,” arXiv preprint arXiv:2103.14030, 2021.
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer [37] X. Chu, Z. Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia,
learning with a unified text-to-text transformer,” arXiv preprint and C. Shen, “Twins: Revisiting the design of spatial attention
arXiv:1910.10683, 2019. in vision transformers,” arXiv preprint arXiv:2104.13840, 2021.
[9] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, [38] S. Wang, B. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-
M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant attention with linear complexity,” arXiv preprint arXiv:2006.04768,
models with conditional computation and automatic sharding,” 2020.
arXiv preprint arXiv:2006.16668, 2020. [39] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-
[10] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling attention generative adversarial networks,” in International con-
to trillion parameter models with simple and efficient sparsity,” ference on machine learning, pp. 7354–7363, PMLR, 2019.
arXiv preprint arXiv:2101.03961. [40] J. Pérez, J. Marinković, and P. Barceló, “On the turing complete-
[11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, ness of modern neural network architectures,” in International
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, Conference on Learning Representations, 2018.
et al., “An image is worth 16x16 words: Transformers for image [41] J.-B. Cordonnier, A. Loukas, and M. Jaggi, “On the relationship
recognition at scale,” arXiv preprint arXiv:2010.11929, 2020. between self-attention and convolutional layers,” in International
[12] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and Conference on Learning Representations, 2019.
H. Jégou, “Training data-efficient image transformers & distilla- [42] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei,
tion through attention,” arXiv preprint arXiv:2012.12877, 2020. “Deformable convolutional networks,” in Proceedings of the IEEE
[13] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and international conference on computer vision, pp. 764–773, 2017.
S. Zagoruyko, “End-to-end object detection with transformers,” [43] Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng,
arXiv preprint arXiv:2005.12872, 2020. and J. Liu, “UNITER: Universal image-text representation learn-
[14] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable ing,” in ECCV, 2020.
DETR: Deformable transformers for end-to-end object detection,” [44] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu,
arXiv preprint arXiv:2010.04159, 2020. L. Dong, F. Wei, et al., “Oscar: Object-semantics aligned pre-
[15] L. Ye, M. Rochan, Z. Liu, and Y. Wang, “Cross-modal self- training for vision-language tasks,” in ECCV, 2020.
attention network for referring image segmentation,” in CVPR, [45] K. Lin, L. Wang, and Z. Liu, “End-to-end human pose
2019. and mesh reconstruction with transformers,” arXiv preprint
[16] F. Yang, H. Yang, J. Fu, H. Lu, and B. Guo, “Learning texture arXiv:2012.09760, 2020.
transformer network for image super-resolution,” in CVPR, 2020. [46] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised represen-
[17] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, tation learning by predicting image rotations,” in ICLR, 2018.
“VideoBERT: A joint model for video and language represen- [47] “Revisiting the unreasonable effectiveness of data.” https://ai.
tation learning,” in ICCV, 2019. googleblog.com/2017/07/revisiting-unreasonable-effectiveness.
[18] R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “Video html. Accessed: 2020-12-31.
action transformer network,” in CVPR, 2019. [48] L. Jing and Y. Tian, “Self-supervised visual feature learning with
[19] H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, deep neural networks: A survey,” TPAMI, 2020.
C. Xu, and W. Gao, “Pre-trained image processing transformer,” [49] X. Liu, F. Zhang, Z. Hou, Z. Wang, L. Mian, J. Zhang, and
arXiv preprint arXiv:2012.00364, 2020. J. Tang, “Self-supervised learning: Generative or contrastive,”
[20] A. Ramesh, M. Pavlov, G. Goh, and S. Gray, “DALL·E: Creating arXiv preprint arXiv:2006.08218, 2020.
images from text,” tech. rep., OpenAI, 2021. [50] “Aaai 2020 keynotes turing award winners event.” https://www.
[21] H. Tan and M. Bansal, “LXMERT: Learning cross-modality en- youtube.com/watch?v=UX8OubxsY8w. Accessed: 2020-12-31.
coder representations from transformers,” in EMNLP-IJCNLP, [51] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,”
2019. in ECCV, 2016.
[22] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “VL-BERT: [52] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham,
Pre-training of generic visual-linguistic representations,” arXiv A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al., “Photo-
preprint arXiv:1908.08530, 2019. realistic single image super-resolution using a generative adver-
[23] X. Wang, C. Yeshwanth, and M. Nießner, “SceneFormer: sarial network,” in CVPR, 2017.
Indoor scene generation with transformers,” arXiv preprint [53] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. Efros,
arXiv:2012.09793, 2020. “Context encoders: Feature learning by inpainting,” in CVPR,
[24] M. Kumar, D. Weissenborn, and N. Kalchbrenner, “Colorization 2016.
transformer,” in ICLR, 2021. [54] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-
[25] C. Doersch, A. Gupta, and A. Zisserman, “CrossTransformers: Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver-
spatially-aware few-shot transfer,” NeurIPS, 2020. sarial nets,” in NeurIPS, 2014.
[26] H.-J. Ye, H. Hu, D.-C. Zhan, and F. Sha, “Few-shot learning via [55] D. Lin, K. Fu, Y. Wang, G. Xu, and X. Sun, “MARTA GANs:
embedding adaptation with set-to-set functions,” in CVPR, 2020. Unsupervised representation learning for remote sensing image
[27] S. Chaudhari, G. Polatkan, R. Ramanath, and V. Mithal, classification,” GRSL, 2017.
“An attentive survey of attention models,” arXiv preprint [56] U. Ahsan, R. Madhok, and I. Essa, “Video jigsaw: Unsupervised
arXiv:1904.02874, 2019. learning of spatiotemporal context for video action recognition,”
[28] A. de Santana Correia and E. L. Colombini, “Attention, please! in WACV, 2019.
asurvey of neural attention models in deep learning,” arXiv [57] M. Noroozi and P. Favaro, “Unsupervised learning of visual
preprint arXiv:2103.16775, 2021. representations by solving jigsaw puzzles,” in ECCV, 2016.
27

[58] D. Kim, D. Cho, D. Yoo, and I. S. Kweon, “Learning image [88] K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y. Wang, “Transformer
representations by completing damaged jigsaw puzzles,” WACV, in transformer,” arXiv preprint arXiv:2103.00112, 2021.
2018. [89] Z. Jiang, Q. Hou, L. Yuan, D. Zhou, Y. Shi, X. Jin, A. Wang, and
[59] L. Jing, X. Yang, J. Liu, and Y. Tian, “Self-supervised spatiotempo- J. Feng, “All tokens matter: Token labeling for training better
ral feature learning via video rotation prediction,” arXiv preprint vision transformers,” arXiv preprint arXiv:2104.10858, 2021.
arXiv:1811.11387, 2018. [90] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix:
[60] H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang, “Unsupervised Regularization strategy to train strong classifiers with localizable
representation learning by sorting sequences,” in ICCV, 2017. features,” in Proceedings of the IEEE/CVF International Conference
[61] I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: unsu- on Computer Vision, pp. 6023–6032, 2019.
pervised learning using temporal order verification,” in ECCV, [91] A. El-Nouby, H. Touvron, M. Caron, P. Bojanowski, M. Douze,
2016. A. Joulin, I. Laptev, N. Neverova, G. Synnaeve, J. Verbeek, and
[62] D. Wei, J. J. Lim, A. Zisserman, and W. T. Freeman, “Learning H. Jegou, “Xcit: Cross-covariance image transformers,” 2021.
and using the arrow of time,” in CVPR, 2018. [92] D. Zhou, B. Kang, X. Jin, L. Yang, X. Lian, Z. Jiang, Q. Hou, and
[63] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, J. Feng, “Deepvit: Towards deeper vision transformer,” 2021.
“VisualBERT: A simple and performant baseline for vision and [93] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo,
language,” in Arxiv preprint arXiv:1908.03557, 2019. and L. Shao, “Pyramid vision transformer: A versatile back-
[64] B. Korbar, D. Tran, and L. T., “Cooperative learning of audio and bone for dense prediction without convolutions,” arXiv preprint
video models from self-supervised synchronization,” in NeurIPS, arXiv:2102.12122, 2021.
2018. [94] J. Yang, C. Li, P. Zhang, X. Dai, B. Xiao, L. Yuan, and J. Gao, “Focal
[65] R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in self-attention for local-global interactions in vision transformers,”
ICCV, 2017. 2021.
[66] N. Sayed, B. Brattoli, and B. Ommer, “Cross and learn: Cross- [95] Z. Huang, Y. Ben, G. Luo, P. Cheng, G. Yu, and B. Fu, “Shuffle
modal self-supervision,” in GCPR, 2018. transformer: Rethinking spatial shuffle for vision transformer,”
[67] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for 2021.
image recognition,” in CVPR, 2016. [96] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang,
[68] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” “Cvt: Introducing convolutions to vision transformers,” arXiv
arXiv preprint arXiv:1607.06450, 2016. preprint arXiv:2103.15808, 2021.
[69] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for [97] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo,
image denoising,” in CVPR, 2005. and L. Shao, “Pvtv2: Improved baselines with pyramid vision
[70] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural transformer,” 2021.
networks,” in CVPR, 2018. [98] W. Xu, Y. Xu, T. Chang, and Z. Tu, “Co-scale conv-attentional
[71] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vi- image transformers,” 2021.
jayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al., [99] W. Wang, L. Yao, L. Chen, D. Cai, X. He, and W. Liu, “Cross-
“The kinetics human action video dataset,” arXiv preprint former: A versatile vision transformer based on cross-scale atten-
arXiv:1705.06950, 2017. tion,” arXiv preprint arXiv:2108.00154, 2021.
[72] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, [100] C.-F. Chen, R. Panda, and Q. Fan, “Regionvit: Regional-to-local
“CCNet: Criss-cross attention for semantic segmentation,” in attention for vision transformers,” 2021.
ICCV, 2019.
[101] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and
[73] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,
P. Luo, “Segformer: Simple and efficient design for semantic
R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes
segmentation with transformers,” 2021.
dataset for semantic urban scene understanding,” in CVPR, 2016.
[102] P. Zhang, X. Dai, J. Yang, B. Xiao, L. Yuan, L. Zhang, and J. Gao,
[74] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba,
“Multi-scale vision longformer: A new vision transformer for
“Scene parsing through ade20k dataset,” in CVPR, 2017.
high-resolution image encoding,” ICCV 2021, 2021.
[75] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects [103] I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-
in context,” in ECCV, 2014. document transformer,” arXiv preprint arXiv:2004.05150, 2020.
[76] X. Liang, K. Gong, X. Shen, and L. Lin, “Look into person: Joint [104] C.-F. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention
body parsing & pose estimation network and a new benchmark,” multi-scale vision transformer for image classification,” arXiv
TPAMI, 2018. preprint arXiv:2103.14899, 2021.
[77] G. J. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object [105] K. Yuan, S. Guo, Z. Liu, A. Zhou, F. Yu, and W. Wu, “Incorporat-
classes in video: A high-definition ground truth database,” Pat- ing convolution designs into visual transformers,” arXiv preprint
tern Recognition Letters, 2009. arXiv:2103.11816, 2021.
[78] H. Hu, Z. Zhang, Z. Xie, and S. Lin, “Local relation networks for [106] A. Hassani, S. Walton, N. Shah, A. Abuduweili, J. Li, and H. Shi,
image recognition,” in ICCV, 2019. “Escaping the big data paradigm with compact transformers,”
[79] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attention 2021.
augmented convolutional networks,” in ICCV, 2019. [107] Y. Li, K. Zhang, J. Cao, R. Timofte, and L. V. Gool, “Localvit:
[80] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with rela- Bringing locality to vision transformers,” 2021.
tive position representations,” in NAACL, 2018. [108] B. Graham, A. El-Nouby, H. Touvron, P. Stock, A. Joulin,
[81] N. Parmar, P. Ramachandran, A. Vaswani, I. Bello, A. Levskaya, H. Jégou, and M. Douze, “Levit: a vision transformer in convnet’s
and J. Shlens, “Stand-alone self-attention in vision models,” in clothing for faster inference,” 2021.
NeurIPS, 2019. [109] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
[82] H. Zhao, J. Jia, and V. Koltun, “Exploring self-attention for image W. Hubbard, and L. D. Jackel, “Backpropagation applied to
recognition,” in CVPR, 2020. handwritten zip code recognition,” Neural computation, vol. 1,
[83] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Good- no. 4, pp. 541–551, 1989.
fellow, and R. Fergus, “Intriguing properties of neural networks,” [110] Q. Zhang and Y. Yang, “Rest: An efficient transformer for visual
arXiv preprint arXiv:1312.6199, 2013. recognition,” arXiv preprint arXiv:2105.13677, 2021.
[84] M. M. Naseer, S. H. Khan, M. H. Khan, F. S. Khan, and F. Porikli, [111] Z. Zhang, H. Zhang, L. Zhao, T. Chen, and T. Pfister, “Aggre-
“Cross-domain transferability of adversarial perturbations,” in gating nested transformers,” in arXiv preprint arXiv:2105.12723,
NeurIPS, 2019. 2021.
[85] M. Naseer, K. Ranasinghe, S. Khan, F. S. Khan, and F. Porikli, “On [112] Z. Dai, H. Liu, Q. V. Le, and M. Tan, “Coatnet: Marrying convo-
improving adversarial transferability of vision transformers,” lution and attention for all data sizes,” 2021.
arXiv preprint arXiv:2106.04169, 2021. [113] X. Chu, Z. Tian, B. Zhang, X. Wang, X. Wei, H. Xia, and C. Shen,
[86] I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár, “Conditional positional encodings for vision transformers,” 2021.
“Designing network design spaces,” in CVPR, 2020. [114] Y. Liu, G. Sun, Y. Qiu, L. Zhang, A. Chhatkuli, and L. Van Gool,
[87] M. Tan and Q. V. Le, “EfficientNet: Rethinking model scaling for “Transformer in convolutional neural networks,” arXiv preprint
convolutional neural networks,” in ICML, 2019. arXiv:2106.03180, 2021.
28

[115] X. Chen, S. Xie, and K. He, “An empirical study of training self- [142] N. Parmar, A. Vaswani, J. Uszkoreit, Ł. Kaiser, N. Shazeer, A. Ku,
supervised visual transformers,” arXiv e-prints, pp. arXiv–2104, and D. Tran, “Image transformer,” in ICML, 2018.
2021. [143] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and
[116] X. Chen, H. Fan, R. Girshick, and K. He, “Improved base- I. Sutskever, “Generative pretraining from pixels,” in ICML, 2020.
lines with momentum contrastive learning,” arXiv preprint [144] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for
arXiv:2003.04297, 2020. high-resolution image synthesis,” arXiv:2012.09841, 2020.
[117] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum [145] Y. Jiang, S. Chang, and Z. Wang, “Transgan: Two transformers
contrast for unsupervised visual representation learning,” in can make one strong gan,” 2021.
Proceedings of the IEEE/CVF Conference on Computer Vision and [146] A. K. Bhunia, S. Khan, H. Cholakkal, R. M. Anwer, F. S.
Pattern Recognition, pp. 9729–9738, 2020. Khan, and M. Shah, “Handwriting transformers,” arXiv preprint
[118] Z. Xie, Y. Lin, Z. Yao, Z. Zhang, Q. Dai, Y. Cao, and H. Hu, arXiv:2104.03964, 2021.
“Self-supervised learning with swin transformers,” arXiv preprint [147] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals,
arXiv:2105.04553, 2021. A. Graves, et al., “Conditional image generation with pixelcnn
[119] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, decoders,” in NeurIPS, 2016.
E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, [148] A. Krizhevsky, “Learning multiple layers of features from tiny
et al., “Bootstrap your own latent: A new approach to self- images,” tech. rep., 2009.
supervised learning,” arXiv preprint arXiv:2006.07733, 2020. [149] A. Coates, A. Ng, and H. Lee, “An analysis of single-layer
[120] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, networks in unsupervised feature learning,” in AISTATS, 2011.
and A. Joulin, “Emerging properties in self-supervised vision [150] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple
transformers,” arXiv preprint arXiv:2104.14294, 2021. framework for contrastive learning of visual representations,”
[121] C. Li, J. Yang, P. Zhang, M. Gao, B. Xiao, X. Dai, L. Yuan, arXiv preprint arXiv:2002.05709, 2020.
and J. Gao, “Efficient self-supervised vision transformers for [151] P. Bachman, R. Hjelm, and W. Buchwalter, “Learning represen-
representation learning,” arXiv preprint arXiv:2106.09785, 2021. tations by maximizing mutual information across views,” in
[122] Y. Wang, X. Zhang, T. Yang, and J. Sun, “Anchor detr: Query NeurIPS, 2019.
design for transformer-based detector,” 2021. [152] O. J. Hénaff, A. Srinivas, J. De Fauw, A. Razavi, C. Doersch, S. Es-
[123] T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. Hinton, “Pix2seq: A lami, and A. v. d. Oord, “Data-efficient image recognition with
language modeling framework for object detection,” 2021. contrastive predictive coding,” arXiv preprint arXiv:1905.09272,
[124] Y. Fang, B. Liao, X. Wang, J. Fang, J. Qi, R. Wu, J. Niu, and W. Liu, 2019.
“You only look at one sequence: Rethinking transformer in vision [153] Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview cod-
through object detection,” 2021. ing,” arXiv preprint arXiv:1906.05849, 2019.
[125] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: To- [154] S. Khan, H. Rahmani, S. A. A. Shah, and M. Bennamoun, “A
wards real-time object detection with region proposal networks,” guide to convolutional neural networks for computer vision,”
TPAMI, 2016. Synthesis Lectures on Computer Vision, 2018.
[126] R. Girshick, “Fast R-CNN,” in ICCV, 2015. [155] A. Radford, L. Metz, and S. Chintala, “Unsupervised represen-
tation learning with deep convolutional generative adversarial
[127] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in
networks,” arXiv preprint arXiv:1511.06434, 2015.
ICCV, 2017.
[156] C. Gao, Y. Chen, S. Liu, Z. Tan, and S. Yan, “Adversarialnas: Ad-
[128] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only
versarial neural architecture search for gans,” in CVPR, pp. 5680–
look once: Unified, real-time object detection,” in CVPR, 2016.
5689, 2020.
[129] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and
[157] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila,
A. C. Berg, “SSD: Single shot multibox detector,” in ECCV, 2016.
“Analyzing and improving the image quality of stylegan,” in
[130] T. Prangemeier, C. Reich, and H. Koeppl, “Attention-based trans- CVPR, pp. 8110–8119, 2020.
formers for instance segmentation of cells in microstructures,” in [158] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee,
2020 IEEE International Conference on Bioinformatics and Biomedicine “Generative adversarial text to image synthesis,” in ICML, 2016.
(BIBM), pp. 700–707, IEEE, 2020.
[159] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N.
[131] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and Metaxas, “StackGAN: Text to photo-realistic image synthesis
S. Belongie, “Feature pyramid networks for object detection,” in with stacked generative adversarial networks,” in ICCV, 2017.
CVPR, 2017. [160] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N.
[132] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, Metaxas, “StackGAN++: Realistic image synthesis with stacked
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, generative adversarial networks,” TPAMI, 2018.
J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: [161] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and
Transformers for image recognition at scale,” 2020. X. He, “AttnGAN: Fine-grained text to image generation with
[133] H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and L.-C. attentional generative adversarial networks,” in CVPR, 2018.
Chen, “Axial-DeepLab: Stand-alone axial-attention for panoptic [162] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”
segmentation,” arXiv preprint arXiv:2003.07853, 2020. arXiv preprint arXiv:1312.6114, 2013.
[134] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, [163] A. Razavi, A. van den Oord, and O. Vinyals, “Generating diverse
J. Feng, T. Xiang, P. H. S. Torr, and L. Zhang, “Rethinking high-fidelity images with vq-vae-2,” in NeurISP, 2019.
semantic segmentation from a sequence-to-sequence perspective [164] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte,
with transformers,” 2021. “Swinir: Image restoration using swin transformer,” in ICCVW,
[135] R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: 2021.
Transformer for semantic segmentation,” 2021. [165] Z. Wang, X. Cun, J. Bao, and J. Liu, “Uformer: A general
[136] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic u-shaped transformer for image restoration,” arXiv preprint
segmentation,” in CVPR, 2019. arXiv:2106.03106, 2021.
[137] G. Neuhold, T. Ollmann, S. Rota Bulo, and P. Kontschieder, “The [166] Z. Lu, H. Liu, J. Li, and L. Zhang, “Efficient transformer for single
mapillary vistas dataset for semantic understanding of street image super-resolution,” arXiv preprint arXiv:2108.11084, 2021.
scenes,” in ICCV, 2017. [167] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image
[138] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, super-resolution using very deep residual channel attention net-
“ImageNet: A large-scale hierarchical image database,” in CVPR, works,” in ECCV, 2018.
2009. [168] T. Dai, J. Cai, Y. Zhang, S. Xia, and L. Zhang, “Second-order
[139] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, “Modeling attention network for single image super-resolution,” in CVPR,
context in referring expressions,” in ECCV, 2016. 2019.
[140] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and [169] B. Niu, W. Wen, W. Ren, X. Zhang, L. Yang, S. Wang, K. Zhang,
K. Murphy, “Generation and comprehension of unambiguous X. Cao, and H. Shen, “Single image super-resolution via a holistic
object descriptions,” in CVPR, 2016. attention network,” in ECCV, 2020.
[141] S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg, “Refer- [170] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep
itgame: Referring to objects in photographs of natural scenes,” residual networks for single image super-resolution,” in CVPRW,
in EMNLP, 2014. 2017.
29

[171] Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep [197] L. Zhou, H. Palangi, L. Zhang, H. Hu, J. Corso, and J. Gao,
recursive residual network,” in CVPR, 2017. “Unified vision-language pre-training for image captioning and
[172] W. Han, S. Chang, D. Liu, M. Yu, M. Witbrock, and T. Huang, vqa,” in AAAI, vol. 34, pp. 13041–13049, 2020.
“Image super-resolution via dual-state recurrent networks,” in [198] C. Sun, F. Baradel, K. Murphy, and C. Schmid, “Learning video
CVPR, 2018. representations using contrastive bidirectional transformer,”
[173] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense arXiv preprint arXiv:1906.05743, 2019.
network for image restoration,” TPAMI, 2020. [199] C. Alberti, J. Ling, M. Collins, and D. Reitter, “Fusion of detected
[174] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and objects in text for visual question answering,” in EMNLP, 2019.
C. Change Loy, “ESRGAN: enhanced super-resolution generative [200] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz,
adversarial networks,” in ECCVW, 2018. S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al., “Visual
[175] S.-J. Park, H. Son, S. Cho, K.-S. Hong, and S. Lee, “SRFEAT: Single genome: Connecting language and vision using crowdsourced
image super-resolution with feature discrimination,” in ECCV, dense image annotations,” IJCV, 2017.
2018. [201] V. Ordonez, G. Kulkarni, and T. L. Berg, “Im2text: Describing
[176] M. S. Sajjadi, B. Scholkopf, and M. Hirsch, “EnhanceNet: Single images using 1 million captioned photographs,” in NeurIPS, 2011.
image super-resolution through automated texture synthesis,” in [202] P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi,
ICCV, 2017. and J. Gao, “Vinvl: Revisiting visual representations in vision-
[177] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, language models,” in Proceedings of the IEEE/CVF Conference on
A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al., “Photo- Computer Vision and Pattern Recognition, pp. 5579–5588, 2021.
realistic single image super-resolution using a generative adver- [203] A. Kamath, M. Singh, Y. LeCun, I. Misra, G. Synnaeve, and
sarial network,” in CVPR, 2017. N. Carion, “Mdetr–modulated detection for end-to-end multi-
[178] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real- modal understanding,” arXiv preprint arXiv:2104.12763, 2021.
time style transfer and super-resolution,” in ECCV, 2016. [204] J. Deng, Z. Yang, T. Chen, W. Zhou, and H. Li, “Transvg: End-to-
[179] J. Ho, N. Kalchbrenner, D. Weissenborn, and T. Salimans, “Ax- end visual grounding with transformers,” 2021.
ial attention in multidimensional transformers,” arXiv preprint [205] M. Li and L. Sigal, “Referring transformer: A one-step approach
arXiv:1912.12180, 2019. to multi-task visual grounding,” arXiv preprint arXiv:2106.03089,
[180] G. Li, N. Duan, Y. Fang, M. Gong, D. Jiang, and M. Zhou, 2021.
“Unicoder-VL: A universal encoder for vision and language by [206] Y. Du, Z. Fu, Q. Liu, and Y. Wang, “Visual grounding with
cross-modal pre-training.,” in AAAI, 2020. transformers,” arXiv preprint arXiv:2105.04281, 2021.
[181] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task- [207] S. Ging, M. Zolfaghari, H. Pirsiavash, and T. Brox, “COOT: Co-
agnostic visiolinguistic representations for vision-and-language operative hierarchical transformer for video-text representation
tasks,” in NeurIPS, 2019. learning,” arXiv preprint arXiv:2011.00597, 2020.
[182] S. Lee, Y. Yu, G. Kim, T. Breuel, J. Kautz, and Y. Song, “Param- [208] H. Seong, J. Hyun, and E. Kim, “Video multitask transformer
eter efficient multimodal transformers for video representation network,” in ICCV Workshops, pp. 0–0, 2019.
learning,” arXiv preprint arXiv:2012.04124, 2020.
[209] Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia,
[183] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, “End-to-end video instance segmentation with transformers,”
C. Lawrence Zitnick, and D. Parikh, “VQA: Visual question arXiv preprint arXiv:2011.14503, 2020.
answering,” in ICCV, 2015.
[210] L. Zhou, Y. Zhou, J. Corso, R. Socher, and C. Xiong, “End-to-
[184] R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi, “From recognition to
end dense video captioning with masked transformer,” in CVPR,
cognition: Visual commonsense reasoning,” in CVPR, 2019.
2018.
[185] K.-H. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked cross
[211] D. Neimark, O. Bar, M. Zohar, and D. Asselmann, “Video trans-
attention for image-text matching,” in ECCV, 2018.
former network,” arXiv preprint arXiv:2102.00719, 2021.
[186] A. Suhr, S. Zhou, A. Zhang, I. Zhang, H. Bai, and Y. Artzi,
[212] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and
“A corpus for reasoning about natural language grounded in
C. Schmid, “Vivit: A video vision transformer,” arXiv preprint
photographs,” arXiv preprint arXiv:1811.00491, 2018.
arXiv:2103.15691, 2021.
[187] J. Carreira, E. Noland, C. Hillier, and A. Zisserman, “A short
note on the kinetics-700 human action dataset,” arXiv:1907.06987, [213] G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention
2019. all you need for video understanding?,” in Proceedings of the
International Conference on Machine Learning (ICML), July 2021.
[188] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101
human actions classes from videos in the wild,” arXiv preprint [214] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles,
arXiv:1212.0402, 2012. “Dense-captioning events in videos,” in ICCV, pp. 706–715, 2017.
[189] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, [215] L. Zhou, C. Xu, and J. Corso, “Towards automatic learning of
R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology procedures from web instructional videos,” in AAAI, vol. 32,
and human-labeled dataset for audio events,” in ICASSP, 2017. 2018.
[190] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and [216] C. Plizzari, M. Cannici, and M. Matteucci, “Spatial tempo-
A. Gupta, “Hollywood in homes: Crowdsourcing data collection ral transformer network for skeleton-based action recognition,”
for activity understanding,” in ECCV, 2016. arXiv preprint arXiv:2008.07404, 2020.
[191] H. Tan and M. Bansal, “Vokenization: Improving language un- [217] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “NTU RGB+D: A
derstanding with contextualized, visual-grounded supervision,” large scale dataset for 3d human activity analysis,” in CVPR,
in EMNLP, 2020. 2016.
[192] W. Hao, C. Li, X. Li, L. Carin, and J. Gao, “Towards learning [218] J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C.
a generic agent for vision-and-language navigation via pre- Kot, “NTU RGB+D 120: A large-scale benchmark for 3d human
training,” in CVPR, 2020. activity understanding,” TPAMI, 2019.
[193] A. Majumdar, A. Shrivastava, S. Lee, P. Anderson, D. Parikh, [219] H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, and
and D. Batra, “Improving vision-and-language navigation with C. Feichtenhofer, “Multiscale vision transformers,” 2021.
image-text pairs from the web,” arXiv preprint arXiv:2004.14973, [220] J. Wang, G. Bertasius, D. Tran, and L. Torresani, “Long-short tem-
2020. poral contrastive learning of video transformers,” arXiv preprint
[194] K. Chen, J. K. Chen, J. Chuang, M. Vázquez, and S. Savarese, arXiv:2106.09212, 2021.
“Topological planning with transformers for vision-and- [221] L. Yang, Y. Fan, and N. Xu, “Video instance segmentation,” in
language navigation,” arXiv preprint arXiv:2012.05292, 2020. ICCV, pp. 5188–5197, 2019.
[195] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- [222] G. Bertasius and L. Torresani, “Classifying, segmenting, and
wal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning tracking object instances in video with mask propagation,” in
transferable visual models from natural language supervision,” CVPR, pp. 9739–9748, 2020.
Image, vol. 2, p. T2, 2021. [223] E. Triantafillou, T. Zhu, V. Dumoulin, P. Lamblin, U. Evci, K. Xu,
[196] P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual R. Goroshin, C. Gelada, K. Swersky, P.-A. Manzagol, et al., “Meta-
captions: A cleaned, hypernymed, image alt-text dataset for dataset: A dataset of datasets for learning to learn from few
automatic image captioning,” in ACL, 2018. examples,” in ICLR, 2020.
30

[224] T. N. Kipf and M. Welling, “Semi-supervised classification with [251] X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen,
graph convolutional networks,” arXiv preprint arXiv:1609.02907, and B. Guo, “Cswin transformer: A general vision trans-
2016. former backbone with cross-shaped windows,” arXiv preprint
[225] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhut- arXiv:2107.00652, 2021.
dinov, and A. J. Smola, “Deep sets,” in NeurIPS, 2017. [252] Y. Xiong, Z. Zeng, R. Chakraborty, M. Tan, G. Fung, Y. Li, and
[226] L. Liu, W. Hamilton, G. Long, J. Jiang, and H. Larochelle, “A V. Singh, “Nystr\” omformer: A nystr\” om-based algorithm for
universal representation transformer layer for few-shot image approximating self-attention,” in AAAI, 2021.
classification,” 2020. [253] Y. Tay, D. Bahri, D. Metzler, D. Juan, Z. Zhao, and C. Zheng,
[227] H. Edwards and A. Storkey, “Towards a neural statistician,” arXiv “Synthesizer: Rethinking self-attention in transformer models,”
preprint arXiv:1606.02185, 2016. in ICML, 2021.
[228] J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, and Y. W. Teh, [254] H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. A. Smith, and
“Set transformer: A framework for attention-based permutation- L. Kong, “Random feature attention,” in ICLR, 2021.
invariant neural networks,” in ICML, 2019. [255] K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane,
[229] J. Lee, Y. Lee, and Y. W. Teh, “Deep amortized clustering,” arXiv T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, et al.,
preprint arXiv:1909.13433, 2019. “Rethinking attention with performers,” in ICLR, 2021.
[230] H. Zhao, L. Jiang, J. Jia, P. Torr, and V. Koltun, “Point trans- [256] Y. Tay, D. Bahri, L. Yang, D. Metzler, and D.-C. Juan, “Sparse
former,” arXiv preprint arXiv:2012.09164, 2020. sinkhorn attention,” in ICML, 2020.
[231] M.-H. Guo, J.-X. Cai, Z.-N. Liu, T.-J. Mu, R. R. Martin, [257] X. Chen, C.-J. Hsieh, and B. Gong, “When vision transformers
and S.-M. Hu, “Pct: Point cloud transformer,” arXiv preprint outperform resnets without pretraining or strong data augmen-
arXiv:2012.09688, 2020. tations,” arXiv preprint arXiv:2106.01548, 2021.
[232] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, [258] P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, “Sharpness-
“3D ShapeNets: A deep representation for volumetric shapes,” in aware minimization for efficiently improving generalization,”
CVPR, 2015. arXiv preprint arXiv:2010.01412, 2020.
[233] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, [259] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna,
Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and “Rethinking the inception architecture for computer vision,” in
F. Yu, “ShapeNet: An information-rich 3d model repository,” Proceedings of the IEEE conference on computer vision and pattern
arXiv preprint arXiv:1512.03012, 2015. recognition, pp. 2818–2826, 2016.
[234] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Hu- [260] S. He, H. Luo, P. Wang, F. Wang, H. Li, and W. Jiang, “Transreid:
man3.6M: Large scale datasets and predictive methods for 3D Transformer-based object re-identification,” arXiv:2102.04378,
human sensing in natural environments,” TPAMI, 2013. 2021.
[235] T. von Marcard, R. Henschel, M. J. Black, B. Rosenhahn, and [261] D. R. So, C. Liang, and Q. V. Le, “The evolved transformer,” 2019.
G. Pons-Moll, “Recovering accurate 3d human pose in the wild [262] H. Wang, Z. Wu, Z. Liu, H. Cai, L. Zhu, C. Gan, and S. Han, “Hat:
using imus and a moving camera,” in ECCV, 2018. Hardware-aware transformers for efficient natural language pro-
[236] C. Zimmermann, D. Ceylan, J. Yang, B. Russell, M. Argus, and cessing,” 2020.
T. Brox, “FreiHAND: A dataset for markerless capture of hand [263] M. Chen, H. Peng, J. Fu, and H. Ling, “Autoformer:
pose and shape from single rgb images,” in ICCV, 2019. Searching transformers for visual recognition,” arXiv preprint
arXiv:2107.00651, 2021.
[237] “OpenAI’s GPT-3 language model: A technical overview.” https:
[264] C. Li, T. Tang, G. Wang, J. Peng, B. Wang, X. Liang, and
//lambdalabs.com/blog/demystifying-gpt-3/. Accessed: 2020-
X. Chang, “Bossnas: Exploring hybrid cnn-transformers with
12-31.
block-wisely self-supervised neural architecture search,” arXiv
[238] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, “Scaling vision
preprint arXiv:2103.12424, 2021.
transformers,” 2021.
[265] B. Chen, P. Li, C. Li, B. Li, L. Bai, C. Lin, M. Sun, W. Ouyang,
[239] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image
et al., “Glit: Neural architecture search for global and local image
descriptions to visual denotations: New similarity metrics for
transformer,” arXiv preprint arXiv:2107.02960, 2021.
semantic inference over event descriptions,” TACL, 2014.
[266] M. Naseer, K. Ranasinghe, S. Khan, M. Hayat, F. S. Khan, and
[240] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, M.-H. Yang, “Intriguing properties of vision transformers,” arXiv
“Making the v in vqa matter: Elevating the role of image un- preprint arXiv:2105.10497, 2021.
derstanding in visual question answering,” in CVPR, 2017. [267] E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, “Ana-
[241] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hock- lyzing multi-head self-attention: Specialized heads do the heavy
enmaier, and S. Lazebnik, “Flickr30k entities: Collecting region- lifting, the rest can be pruned,” arXiv preprint arXiv:1905.09418,
to-phrase correspondences for richer image-to-sentence models,” 2019.
in ICCV, 2015. [268] S. Abnar and W. Zuidema, “Quantifying attention flow in trans-
[242] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep formers,” arXiv preprint arXiv:2005.00928, 2020.
hierarchical feature learning on point sets in a metric space,” [269] H. Chefer, S. Gur, and L. Wolf, “Transformer interpretability
NeurIPS, 2017. beyond attention visualization,” arXiv preprint arXiv:2012.09838,
[243] H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and 2020.
H. Jégou, “Going deeper with image transformers,” arXiv preprint [270] B. Li, S. Pandey, H. Fang, Y. Lyv, J. Li, J. Chen, M. Xie, L. Wan,
arXiv:2103.17239, 2021. H. Liu, and C. Ding, “FTRANS: energy-efficient acceleration of
[244] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated transformers using fpga,” in ISLPED, 2020.
residual transformations for deep neural networks,” in CVPR, [271] G. Bender, P.-J. Kindermans, B. Zoph, V. Vasudevan, and Q. Le,
2017. “Understanding and simplifying one-shot architecture search,”
[245] R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long in ICML, 2018.
sequences with sparse transformers,” arXiv:1904.10509, 2019. [272] Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun,
[246] N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: The efficient “Single path one-shot neural architecture search with uniform
transformer,” in ICLR, 2020. sampling,” arXiv preprint arXiv:1904.00420, 2019.
[247] I. Bello, “Lambdanetworks: Modeling long-range interactions [273] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efficient
without attention,” in International Conference on Learning Repre- neural architecture search via parameter sharing,” in ICML, 2018.
sentations, 2021. [274] A. Jaegle, F. Gimeno, A. Brock, A. Zisserman, O. Vinyals, and
[248] A. Vyas, A. Katharopoulos, and F. Fleuret, “Fast transformers J. Carreira, “Perceiver: General perception with iterative atten-
with clustered attention,” NeurIPS, 2020. tion,” arXiv preprint arXiv:2103.03206, 2021.
[249] Y.-H. Wu, Y. Liu, X. Zhan, and M.-M. Cheng, “P2t: Pyramid [275] A. Jaegle, S. Borgeaud, J.-B. Alayrac, C. Doersch, C. Ionescu,
pooling transformer for scene understanding,” arXiv preprint D. Ding, S. Koppula, D. Zoran, A. Brock, E. Shelhamer, et al.,
arXiv:2106.12011, 2021. “Perceiver io: A general architecture for structured inputs &
[250] A. Vaswani, P. Ramachandran, A. Srinivas, N. Parmar, B. Hecht- outputs,” arXiv preprint arXiv:2107.14795, 2021.
man, and J. Shlens, “Scaling local self-attention for parameter
efficient visual backbones,” in Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition, pp. 12894–12904,
2021.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy