0% found this document useful (0 votes)

17 views30 pages

Transformers in Computational Visual Media A Surve

This survey reviews the application of transformers in computational visual media, highlighting their advantages over traditional convolutional neural networks (CNNs) in modeling long-range relationships and improving performance in various vision tasks. The paper categorizes recent advancements in visual transformers across four main areas: backbone design, high-level vision, low-level vision and generation, and multimodal learning, providing detailed comparisons and insights into the latest methodologies. Additionally, the survey includes quantitative evaluations and source code links to facilitate further research in the field.

Uploaded by

Miftah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views30 pages

Transformers in Computational Visual Media A Surve

Uploaded by

Miftah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Computational Visual Media

https://doi.org/10.1007/s41095-021-0247-3 Vol. 8, No. 1, March 2022, 33–62

Review Article

Transformers in computational visual media: A survey

Yifan Xu1,2 , Huapeng Wei3 , Minxuan Lin1,2 , Yingying Deng1,2 , Kekai Sheng4 , Mengdan Zhang4 ,
Fan Tang3 , Weiming Dong1,2,5 ( ), Feiyue Huang4 , and Changsheng Xu1,2,5

c The Author(s) 2021.

Abstract Transformers, the dominant architecture Keywords visual transformer; computational visual
for natural language processing, have also recently media (CVM); high-level vision; low-level
attracted much attention from computational visual vision; image generation; multi-modal
media researchers due to their capacity for long-range learning
representation and high performance. Transformers
are sequence-to-sequence models, which use a self-
1 Introduction
attention mechanism rather than the RNN sequential
structure. Thus, such models can be trained in parallel Convolutional neural networks (CNNs) [1–3] have
and can represent global information. This study become the fundamental architecture in computa-
comprehensively surveys recent visual transformer tional visual media (CVM). Researchers began to
works. We categorize them according to task scenario: incorporate a self-attention mechanism into CNNs to
backbone design, high-level vision, low-level vision and model long-range relationships, due to the problem
generation, and multimodal learning. Their key ideas of locality of convolutional kernels [4–8]. Recently,
are also analyzed. Differing from previous surveys, Dosovitskiy et al. [9] found that using a self-attention-
we mainly focus on visual transformer methods in only structure, without convolution, works well
low-level vision and generation. The latest works in computer vision. Since then, the transformer
on backbone design are also reviewed in detail. For
architecture [10], a non-convolutional architecture
ease of understanding, we precisely describe the main
dominating the research field of natural language
contributions of the latest works in the form of tables.
processing (NLP), has has been used in computer
As well as giving quantitative comparisons, we also
vision. Introducing transformers into computer vision
present image results for low-level vision and generation
tasks. Computational costs and source code links for provides four advantages that CNNs lack:
various important works are also given in this survey to • Transformers learn with more inductive bias and
assist further development. performs better when trained on large datasets
(e.g., ImageNet-21K or JFT-300M) [9, 11].
1 NLPR, Institute of Automation, Chinese Academy of Sciences, • Transformers provide a more general architecture
Beijing 100190, China. E-mail: Y. Xu, xuyifan2019@ia.ac.cn; suitable for most fields, including NLP, CV, and
M. Lin, linminxuan2018@ia.ac.cn; Y. Deng, dengyingying2017@
multimodal learning.
ia.ac.cn; W. Dong, weiming.dong@ia.ac.cn ( ); C. Xu,
changsheng.xu@ia.ac.cn. • Transformers powerfully model long-range
2 School of Artificial Intelligence, University of Chinese interactions in a computationally-efficient manner
Academy of Sciences, Beijing 100040, China. [12, 13].
3 School of Artificial Intelligence, Jilin University, Changchun • The learned representation of relationships is
130012, China. E-mail: H. Wei, weihp20@jlu.edu.cn; more general and robust than the local patterns
F. Tang, tangfan@jlu.edu.cn.
from convolution modules [14].
4 Youtu Lab, Tencent Inc., Shanghai 200233, China. E-mail:
As Table 1 shows, an increasing number of works on
K. Sheng, saulsheng@tencent.com; M. Zhang, davinazhang@
tencent.com; F. Huang, garyhuang@tencent.com. visual transformers have come out in various subfields
5 CASIA-LLVISION Joint Lab, Beijing 100190, China. of computational visual media. An instructive survey
Manuscript received: 2021-06-17; accepted: 2021-07-16 is important because of the difficulties in arranging

33
34 Y. Xu, H. Wei, M. Lin, et al.

Table 1 Recent visual transformers introduced in this survey

Area Secondary area Method Contributions

T2T ViT [15] An effective and efficient tokens-to-token module
TNT [16] The first to exploit the benefit of pixel-level relations
CPVT [17] An instance-level position embedding module
Backbone Classification ConViT [18] Adaptive reception field in visual transformers
network DeepViT [19] A Re-Attention module for deep-layer ViTs
Swin Transformer [20] A shifted-window based MSA & a deep-narrow module
PiT [21] The first to investigate the benefit of pooling in ViTs
LocalViT [22] A depth-wise convolution based module to exploit locality
Visualization Transformer-Explainability [23] A better tool to visualize feature maps from ViT models

DETR [24] First transformer-based detection SOTA model

Deformable DETR [25] An eﬃcient attention module reducing time consumption
High-level Detection
UP-DETR [26] An unsupervised pre-training method for DETR
vision
PVT [27] A general transformer architecture for dense prediction
VisTR [28] First transformer-based segmentation model
Segmentation
SegFormer [29] A lightweight eﬃcient segmentation transformer model

Colorization ColTran [30] First transformer-based image colorization model

TIME [31] Text-to-image generation
Text-to-image
DALL·E [32] Zero-shot text-to-image generation framework
IPT [11] Image processing model
Low-level Super resolution
TTSR [33] Flexible application of transformer
vision
TransGAN [34] First pure transformer-based GAN for generation
Image generation GANsformer [35] A bipartite transformer
VQGAN [36] A transformer-based high-resolution image generator
Image restoration Uformer [37] A transformer-based hierarchical encoder–decoder network
Style transfer StyTr2 [38] First transformer-based style transfer model
Point cloud learning PCT [39] Among the ﬁrst transformer-based point cloud models
Two-stream model ViLBERT [40] The ﬁrst proposed two-stream model for V+L tasks
Multi-modality
Single-stream model UNITER [41] A universal model for joint multi-modal embedding
learning
Mixed model SemVLP [42] First mixed single- and two-stream model

such fast and abundant developments. Due to the level vision, we introduce the mainstream of DETR-
fast development of visual transformer backbones, based transformer detection models [24]. For low-level
this survey specifically focuses on the latest works in vision and generation, we arrange papers according
that area, as well as low-level vision tasks. to different subareas including colorization [30, 43–
Specifically, this study is mainly arranged into four 45], text-to-image [31, 32, 46], super-resolution [47–
specific fields: backbone design, high-level vision (e.g., 49], and image generation [50–54]. For multimodal
object detection and semantic segmentation), low- learning, we review some recent representative
level vision and generation, and multimodal learning. works on vision-plus-language (V+L) models and
We highlight backbone design and low-level vision summarize pretraining objectives in this field.
as our main focus in Fig. 1. The developments to We comprehensively compare results in different
be introduced are summarised in Table 1. For fields and give training details, including computa-
backbone design, several latest works are introduced, tional cost and source code links to facilitate and
considering two aspects: (i) injecting convolutional encourage further research. Some images resulting
prior knowledge into ViT, and (ii) boosting the from low-level vision models are also illustrated. The
richness of visual features. We also summarize the rest of the paper is organized as follows. Section 2
breakthrough ideas of each work in Fig. 1. For high- introduces visual transformers. Section 3 lists latest
Transformers in computational visual media: A survey 35

Fig. 1 Organisation of recent works on visual transformers.

developments in backbone networks for visual blocks [55] (i.e., layer normalization (LN) [56] + multi-
transformers in image classification. Section 4 head self-attention (MSA) [57] + skip-connection
describes several recent advanced designs using visual layer [1] + multilayer perception (MLP) or feed-
transformers in object detection. Section 5 introduces forward network (FFN)), and post-process module.
transformer-based methods for various low-level Formally, given an input image X ∈ RH×W ×C and
vision tasks. Section 6 reviews recent representative its labels Y , X is first reshaped into a sequence of
2
works on multimodal learning. Finally, we draw flattened 2D image patches Xp ∈ RN ×(P ·C) . Then,
conclusions from different research fields in Section 7. following BERT [10], a class token and several
position tokens are used to record extra meaningful
2 Visual transformers information for inference. Together, the input is
formulated as follows:
Before introducing the latest developments, we give
z0 = [xcls ; x1p · E; · · · ; xN
p · E]
the basic formulation of visual transformers by using cls 1 N
ViT [9] as an example. As shown in Fig. 2, a typical + [Epos ; Epos ; · · · ; Epos ]
2
ViT mainly contains five basic procedures: splitting where xcls ∈ RD is the class token, E ∈ R(P ·C)×D is
i
input images into smaller local patches, preparing a linear projection of each patch Xp , and Epos ∈ RD
the input token (patch tokens, class token, and is the learnable position embedding for the i-th token.
position embedding), a series of stacked transformer Then, the input is sent into several sequential

Fig. 2 Framework of ViT (left) and typical pipeline of a transformer encoder (right). Reproduced with permission from Ref. [9],
c The
Author(s) 2021.
36 Y. Xu, H. Wei, M. Lin, et al.

transformer blocks: patches and reduce the length of tokens progressively

zl+1 = zl + MSA(LN(zl )) for the sake of computational and parameter efficiency.
Inspired by CNN architectures [1–3], they also
zl+1 = zl+1 + MLP(LN(zl+1 ))
devise a deep-narrow ViT framework to reduce the
where l ∈ {0, · · · , L − 1} denotes the layer, L is number of parameters and enhance training efficiency.
the number of transformer blocks, the MLP includes Overall, they train ViT models from scratch on
two fully-connected layers using GELU [58] as the ImageNet without additional datasets.
activation function, LN(·) is a layer-normalization 3.1.2 TNT
module [56], and the MSA module is formulated as Han et al. [16] propose a novel Transformer-iN-
MSA(z) = [SA1 (z); · · · ; SAH (z)] × Umsa Transformer (TNT) framework to further exploit
the intrinsic spatial structural information in image
Q · KT data. As Fig. 3 shows, TNT considers patch and
SAi (Q, K, V ) = σ √ ·V
dk pixel level relations in learning useful visual features.
i i
where z is the input, [Q, K, V ] = z × Uqkv , Uqkv ∈ They propose a TNT block to utilize the pixel-
D×(3·Dh )
R projects the D-dimensional input z to Dh - level representations effectively and efficiently. They
dimensional Q, K, and V in the head i, σ(·) is the introduce an additional transformer called an Inner
softmax function, and Umsa ∈ R(H·Dh )×D re-casts the T-Block to model pixel-level relationships in each
output from H heads of the MSA module into one patch and then reinforce the patch-level features
D-dimensional output. Several variants of MSA, like with the calculated pixel-level ones. Consequently,
Reformer [59], Performer [60], and LinFormer [61], TNT achieves 81.3% top-1 classification accuracy
are available. on ImageNet [64] at the cost of only moderate
additional computation. The experimental results
3 Backbone design verify the positive effects of pixel-level relation
modeling.
In this section, we describe several recent designs
3.1.3 ConViT
for the backbone of ViT models. Without loss
of generality, we focus on the image classification D’Ascoli et al. [18] propose a novel ViT model
task. We divide recent progress into two mainstream with soft convolutional inductive biases (ConViT) to
approaches: (i) injecting convolutional prior know- endow transformers with an adaptive receptive field.
ledge into ViT, works including T2T-ViT [15], Figure 4 schematically shows the core block, called
ConViT [18], PiT [21], and Swin Transformer [20], a gated positional self-attention (GPSA) module. A
and (ii) boosting the richness of visual features, GPSA block has two branches: Wqry or Wkey is used
including TNT [16], CPVT [17], DeepViT [19], to model the global or long-range relationship, and
and LocalViT [22]. We also briefly describe recent vpos is utilized to model the relationship within local
developments in visualizing feature maps of ViT regions. To adaptively trade-off between the two
models [23, 62, 63], which help to better understand the branches, they adopt a learnable parameter λ, which
working mechanism of ViT models. We list core details
of their performance on ImageNet [64] in Table 2.
3.1 Latest developments
3.1.1 T2T-ViT
Yuan et al. [15] note that the method to convert input
images into tokens in a typical ViT [9] ineffectively
models the spatial structure of image data and
may lead to poor training efficiency and suboptimal
performance. They propose two effective approaches
to address the aforementioned problem. First, they
propose a token-to-token (T2T) module to inject Fig. 3 Framework of TNT. Reproduced with permission from
spatial information into the tokenization of image Ref. [16],
c The Author(s) 2021.
Transformers in computational visual media: A survey 37

Table 2 Classiﬁcation accuracy on ImageNet [64] for various visual transformers

Method Image size FLOGs (G) #Param (M) Acc (%) Source (GitHub)

Convolution-based neural network

ResNet [1] 2242 4.1 25.6 76.2 —

RegNetY-4G [3] 2242 4.0 21 80.0
facebookresearch/pycls
RegNetY-16G [3] 2242 16.0 84 82.9
EfficientNet-B0 [2] 2242 0.4 5.3 77.1
EfficientNet-B1 [2] 2242 0.7 7.8 79.1
EfficientNet-B3 [2] 3002 1.8 12 81.6 rwightman/gen-efficientnet-pytorch
EfficientNet-B5 [2] 4562 9.9 30 83.6
EfficientNet-B7 [2] 6002 37.0 66 84.3

Visual transformer

3842 55.4 86 77.9

ViT [9] google-research/vision_transformer
3842 190.7 307 76.5
2242 4.6 22 79.8
DeiT [65] facebookresearch/deit
3842 55.4 86 83.1
2
T2T ViT [15] 224 5.2 21.5 80.7 yitu-opensource/T2T-ViT
2242 5.2 23.8 81.3
TNT [16] huawei-noah/noah-research/tree/master/TNT
2242 14.1 65.6 82.8
2242 — 23 81.5
CPVT [17] Meituan-AutoML/CPVT
2242 — 88 82.3
2242 5.4 27 81.3
ConViT [18] —
2242 17 86 82.4
2242 — 27 82.3
DeepViT [19] zhoudaquan/dvit_repo
2242 — 55 83.1
2242 4.5 29 81.3
Swin Transformer [20] microsoft/Swin-Transformer
3842 47.0 88 84.2
2242 4.6 22.1 81.9
PiT [21] naver-ai/pit
2242 12.5 73.8 84.0
LocalViT [22] 2
224 4.6 22.4 80.8 ofsoundof/LocalViT

is initialized as 1 for all layers and all heads in MSA.

With the proposed GPSA module, they manage to
adaptively expand the self-attention receptive field
during training.
3.1.4 CPVT
Chu et al. [17] resort to a novel design of position
embedding module to further reinforce the richness
of learned visual features from ViT. Instead of a
predefined position embedding that is independent
of the input data, they propose a conditional
position embedding scheme to generate different
positional encodings for various input tokens, akin
to dynamic neural network design [66]. In their
implementation, they also rearrange the input
tokens in a spatial manner and apply convolution
operations to extract the position embedding in a
Fig. 4 Framework of ConViT and the gated positional self-attention learnable way. In this way, they also maintain the
mechanism. Reproduced with permission from Ref. [18], c The
Author(s) 2021.
local neighborhood information during tokenization,
benefiting classification performance.
38 Y. Xu, H. Wei, M. Lin, et al.

Two further ViT models, LeViT [12] x and CoaT [67] y, Swin Transformer enhances efficient use of parameters
investigate the importance of position embedding and achieves state-of-the-art object detection and
and propose different implementations. We do not semantic segmentation.
describe them further due to lack of space. Dong et al. [68] also propose another vision transfor-
3.1.5 Swin Transformer mer model, CSWin Transformer, which utilizes a
On the basis of the observations that image data cross-shaped window self-attention mechanism (akin
contain much redundant spatial information and to criss-cross attention [69] or strip pooling [70])
given the success of deep-narrow CNN architectures, and a locally enhanced position encoding. CSWin
Liu et al. [20] propose a novel hierarchical visual Transformer obtains even better performance than
transformer design. Figure 5(a) illustrates the SWin Transformer.
core idea of the window MSA (W-MSA) and the 3.1.6 DeepViT
shifted W-MSA (SW-MSA) within Swin Transformer, Layer scaling (e.g., 152-layer ResNet [1]) is an
which separate local patches into several windows important aspect of CNN architectures. With regard
and run the MSA module window by window. to ViT models, Zhou et al. [19] empirically find that
With the W-MSA mechanism, they reduce the the performance of deep layer ViT models saturates
computation complexity from O(4HW C 2 +2(HW )2C) when we stack more than 20 transformer blocks even
to O(4HW C 2 + 2M 2HW C), where H and W with the help of skip-connection layers. They unveil
represent the size of input patches, M × M is that the reason is attention collapse: the feature
the number of windows, and C is the feature maps extracted from each head in one MSA module
dimension. A shifted window design is also proposed share increasingly similar patterns, leading to huge
to encourage cross-window communication for rich information redundancy and low training efficiency.
visual features. They also propose a deep-narrow If the communication between the MSA heads is
architecture (see Fig. 5(b)). Extensive experiments on promoted, the information redundancy between each
ImageNet, COCO, and ADE-20K demonstrate that head and rich learned visual feature can be reduced.

Fig. 5 (a) Window MSA (W-MSA) greatly reduces computational cost and facilitates communication between each isolated W-MSA.
(b) Overview of Swin Transformer. Reproduced with permission from Ref. [20],
c The Author(s) 2021.

x https://github.com/facebookresearch/LeViT
y https://github.com/mlpc-ucsd/CoaT
Transformers in computational visual media: A survey 39

On the basis of the aforementioned motivation, they 3.1.8 LocalViT

propose a simple and effective Re-Attention module: Li et al. [22] study the differences between ViT

T Q · KT models and CNN architectures. They find that visual
N orm Θ sof tmax √ ·V
d transformers are good at modeling global relations
where Θ ∈ RH×H is a learnable parameter while lacking a local scheme to learn interactions
to facilitate the communication between the H within a local region, which is the characteristic of
heads within one MSA module. Experiments on convolution. A local mechanism is important and
ImageNet [64] verify that a 32-layer ViT model can useful for modeling spatial structures for image data.
be trained without performance saturation with the Thus, they believe that visual transformers must
help of a Re-Attention module. reinforce the model’s capability for local relation
Notably, concurrent work, which is termed modeling to promote the learned visual features
CaiT [71], also investigates the topic of layer scaling from ViT models. Specifically, they investigate
and proposes a different perspective. Further details several possible blocks and then propose local ViT
can be obtained from their paper. (LocalViT), as shown in Fig. 7(right). Experiments
on ImageNet [64] indicate that the LocalViT module
3.1.7 PiT
is a practical local mechanism which boosts the
Considering the importance of the pooling layer to performance of various ViT models [15, 16, 27, 65].
model capability and generalization performance of
CNN architectures, Heo et al. [21] investigate the 3.2 Comparison on ImageNet
possibility of taking advantage of pooling modules We compare the classification accuracies of the latest
in ViT. The pooling layer in a conventional CCN ViT models on the ImageNet benchmark [64] in
architecture conducts spatial information aggregation Table 2, together with their implementation details,
for spatially invariant features. On the basis of namely, #FLOGs, #Param, and source code, to
this observation, they propose to implement spatial facilitate further research. The experimental values
information condensation via depth-wise convolution. indicate that ViT models have potential to achieve
As shown in Fig. 6, they first split the obtained input comparable performance or even outperform state-
tokens into class tokens and spatial ones, and then of-the-art CNN architectures like RegNet [3] and
they recover the spatial shape of the latter. Next, EfficientNet [2], which are based on expert-designed
they leverage a depth-wise convolution operation basic modules and the power of neural architecture
on the spatial branch for the purpose of a pooling search (NAS) techniques. We also observe very
layer. Meanwhile, they apply a fully connected layer recent exciting progress, in that the latest proposed
to project the class token into the same dimension. ViT models possess higher model capability and
With a simple and effective pooling module, they better parameter efficiency for vision than the original
propose a pooling-based ViT (PiT) and achieve an version of ViT.
optimal trade-off between computation efficiency and
classification performance.

Fig. 7 Comparison of the convolutional version of the FFN module in

ViT models (left), inverted residual blocks (center), and the proposed
Fig. 6 Pooling layer in the PiT architecture. Reproduced with module in LocalViT to exploit the beneﬁt of locality in ViT (right).
permission from Ref. [21],
c The Author(s) 2021. Reproduced with permission from Ref. [22], c The Author(s) 2021.
40 Y. Xu, H. Wei, M. Lin, et al.

3.3 Visualization of ViT and key-point detection [80–85]. As the focus of

Visualizing the feature maps in ViT is also an this survey is low-level vision tasks, we only
interesting and worthy research topic. As ViT briefly introduce some interesting works in object
models leverage different basic components from detection. Modern detection methods address the
set prediction task by defining a large set of propos-
CNN models, we should adopt different visualization
als [86, 87], anchors [88], or window centers [89, 90].
methods correspondingly. As shown in Fig. 8, the
Unlike previous attempts [91–96], transformer-based
latest tools specialized for MSA modules and ViT
detection raises the possibility of total anchor-free
models, namely, partial LRP [63] and Transformer-
and end-to-end models. We begin with the stream
Explainability [23]x , can generate better results for
of DETR [24], followed by Deformable DETR [25]
feature map visualization than the visualization
and UP-DETR [26]. A more complete approach,
methods for CNN. The visualizations indicate that
PVT [27], which is the earliest transformer backbone
ViT models can learn additional meaningful spatial
for dense prediction tasks like detection, is also
information with image-level annotations alone.
introduced. Additional recent high-level backbones like
Therefore, ViT models have potential values in weakly
Swin Transformer [20] and Twins [97] are introduced
supervision scenarios, such as weakly supervised
in Section 3. A comparison is provided in Table 3.
object detection.
4.1 DETR
Carion et al. [24] were the first to provide a completely
4 High-level vision
end-to-end detection model based on the transformer
In this section, we focus on representative recent high- encoder–decoder architecture. It gives researchers a
level vision tasks based on transformer framework. new insight that the transformer architecture can
High-level vision refers to stages of visual processing achieve state-of-the-art performance in detection.
that transition from analyzing local image structure Unlike previous detection models, DETR does not
to exploring the structure of the external world that rely on artificially designed anchors. The overall
produced those images. The main tasks include structure is illustrated in Fig. 9. The transformer
object detection [24–26], segmentation [28, 29, 74–79], encoders are arranged after a convolution feature

Fig. 8 Class-speciﬁc visualization results from ViT. Left to right: input image, rollout [62], raw-attention, GradCAM [72], LRP [73], partial
LRP [63], and Transformer-Explainability [23]. Reproduced with permission from Ref. [23], c The Author(s) 2021.

x https://github.com/hila-chefer/Transformer-Explainability
Transformers in computational visual media: A survey 41

Table 3 Comparison of transformer-based detection models on the COCO 2017 val set

Training #Param FLOPS Source

Method AP AP50 AP75 APS APM APL FPS
epochs (M) (G) (GitHub)

Convolution-based models

FCOS [90] 36 41.0 59.8 44.1 26.2 44.6 52.2 — 23 177 tianzhi0549/FCO
Faster R-CNN+FPN [86] 109 42.0 62.1 45.5 26.6 45.4 53.4 42 26 180 rbgirshick/py-faster-rcnn

Transformer-based models
DETR [24] 500 42.0 62.4 44.2 20.5 45.8 61.1 41 28 86
facebookresearch/detr
DETR-DC5 [24] 500 43.3 63.1 45.9 22.5 47.3 61.1 41 12 187
Deformable DETR [25] 50 43.8 62.6 47.7 26.4 47.1 58.0 40 19 173 fundamentalvision/Deformable-DETR
UP-DETR [26] 150 40.5 60.8 42.6 19.0 44.1 60.0 41 — —
dddzg/up-detr
UP-DETR [26] 300 42.8 63.0 45.3 20.8 47.1 61.7 41 — —
PVT-T [27] 300 36.7 56.9 38.9 22.6 38.8 50.0 23.0 — —
whai362/PVT
PVT-M [27] 300 41.9 63.1 44.3 25.0 44.9 57.6 53.9 — —
ViT-B/16-FRCNN [98] 21 37.8 57.4 40.1 17.8 41.4 57.3 21 — — —

injected into all encoder blocks rather than only the

input layer to preserve the positional information;
they claim that high-level vision detection needs
more positional information than classification. Only
queries and keys are also injected into the positional
embedding.
The decoder has two inputs. The first is the object
query, which only serves as queries in the second
MSA layers. The second is the output of the encoder
module, which serves as values and keys for the second
MSA layers. To easily understand the mechanism,
readers can treat the object queries as information of
different target objects and suppose the decoder aims
to find whether the similar patterns to the object
queries exist in the image features. The output of
the decoder is then passed through two branches,
namely, the box and class branches. The box branch
predicts the positions of the target objects, while the
class branch serves to predict the category of each
predicted box.
Fig. 9 Overall structure of DETR. Reproduced with permission
from Ref. [24],
c Springer Nature Switzerland AG 2020. 4.2 Deformable DETR
Although considerable progress has been achieved
extractor. We introduce the encoder and decoder by DETR in transformer-based detection, its main
modules in order. deficiency is its huge computational cost. Training
In the DETR encoder, first, the output feature map a DETR on one V100 GPU is reported to take 48
of CNN is decomposed into patches as for ViT [9] days, which is unaffordable for common institutions.
introduced in Section 3. Then, the patches are Thus, Zhu et al. [25] propose the deformable self-
mapped to one-dimensional vectors to go through attention module to reduce the training time to 340
several traditional transformer encoders. DETR GPU hours at the same time as improving the original
and traditional BERT encoders differ only in the performance. The core idea of deformable attention
positional embedding. The positional embedding is is to find the nearest K values of an input query to
42 Y. Xu, H. Wei, M. Lin, et al.

calculate attention. Nearest here refers to semantic the query patch. This pretraining method leads to
distance rather than spatial distance. Deformable more flexible training. As shown in Fig. 11(b), a more
attention is illustrated in Fig. 10, which is drawn robust representation can be learned after augmenting
from deformable convolution [99]. A linear model is the random query patches.
established to learn the offsets of the nearest K values,
4.4 PVT
and then another linear model is established to learn
the attention score of each value. In summary, the Diverging from DETR, Wang et al. [27] also
main contributions of deformable attention module propose a pure transformer-based backbone, called
are (i) only K corresponding values rather than all a pyramid vision transformer (PVT), for detection
values are required to calculate the attention of one and segmentation. Its framework is shown in Fig. 12.
query and (ii) the attention scores are learned by After each stage, the output is rearranged to recover
a network rather than by simple multiplication of spatial structure and is then down-sampled to half
queries and keys. resolution. Notably, the spatial reduction is only
conducted on the key K and value V while the
4.3 UP-DETR
spatial size of the query Q is maintained. In practice,
Dai et al. [26] propose a novel unsupervised pre- the full architecture of PVT-based detection models
training method called random query patch detection includes a PVT backbone and a general detection
for DETR [24, 25], which leads to better performance. head, such as RetinaNet [88] and Mask R-CNN [100].
Figure 11(a) illustrates their pretraining method. A Recently, several dense prediction backbones have
random query patch is randomly cropped from an come out after PVT [27], like Swin Transformer [20],
input image. Then, the query patch is added to the CPVT [17], and Twins [97], which we introduced in
object queries of the DETR decoder. The final goal Section 3.
is to predict two things: (i) Lcls , that is, existence
of objects in the query patch, and (ii) Lbox , that
5 Low-level vision and generation
is, the location of the query patch in the image. A
reconstruction loss Lrec is also designed to ensure In this section, we focus on some representative
that the CNN has extracted full information from recent transformer-based works on low-level vision

Fig. 10 Deformable attention module. Reproduced with permission from Ref. [25],
c The Author(s) 2020.
Transformers in computational visual media: A survey 43

Fig. 11 (a) Random single query patch detection of UP-DETR. (b) More robust representation is derived by augmenting the query patch.
Reproduced with permission from Ref. [26],
c The Author(s) 2020.

Fig. 12 Framework of PVT. Reproduced with permission from Ref. [27],

c The Author(s) 2021.

tasks, as listed in Table 4. Low-level vision tasks colorisation [30], text-to-image generation [31], and
include super-resolution [101], denoising [101], image image generation [34, 35]. We separately introduce
how these tasks use transformers to achieve good
Table 4 Source code links for ViT-based models for low-level vision results (see examples in Fig. 13).
tasks
Method Source (GitHub) 5.1 TIME
TIME [31] —
As a pre-trained NLP model is always required for the
IPT [101] —
text-to-image (T2I) task, it may introduce inflexibility
google-research/google-research
ColTran [30] for the whole model. Liu et al. [31] propose an
/tree/master/coltran
TTSR [73] researchmm/TTSR efficient model for T2I tasks: Text and Image
GANsformer [35] — Mutual Translation Adversarial Networks (TIME).
TransGAN [34] VITA-Group/TransGAN TIME can jointly handle T2I and image captioning
DALL·E [32] openai/DALL-E using a single network without a pretrained NLP
VQGAN [102] CompVis/taming-transformers model. As Fig. 14 shows, TIME introduces a multi-
StyTr2 [38] —
head and multi-layer transformer to the generator
PCT [39] Strawberry-Eat-Mango/PCT_Pytorch
and text decoder, which can be used to effectively
44 Y. Xu, H. Wei, M. Lin, et al.

Fig. 13 Representative results for low-level tasks, such as text-to-image generation, basic image processing tasks, colorization, image super
resolution, and image generation. Images are taken from the corresponding papers.

combine image features and the sequence of word sequence of word embeddings ft , and outputs the
embeddings into the output. The Text-Conditioned revised image fit according to the word embeddings.
Image Transformer takes image feature fi and the The Image-Captioning Transformer is similar to the
Transformers in computational visual media: A survey 45

Fig. 14 TIME model overview. Reproduced with permission from Ref. [31],
c Association for the Advancement of Artiﬁcial Intelligence 2021.

Text-Conditioned Image Transformer but for image pretraining procedure solves the problem of task-
captioning [103–105]. T2I and the image captioning specific data limitation. Therefore, Chen et al. [101]
task are jointly trained in the generative adversarial develop a pretrained model for image processing using
network (GAN) manner. TIME achieves state-of-the- the transformer architecture, the Image Processing
art T2I performance without pretraining. Transformer (IPT). The model architecture is shown
DALL·E. Text-to-image generation is a classical in Fig. 15. To adapt to different vision tasks,
generation problem, which needs to construct a Chen et al. [101] design a multi-head and multi-
mapping between two streams. Ramesh et al. [32] tail architecture, which involves three convolutional
propose a transformer-based framework to better layers. The transformer body consists of an encoder
align text and image semantic information. A two- and a decoder described in Ref. [57]. Like the
stage model is applied to model the text and image discriminator in Ref. [34], they split the given features
tokens. They first train a discrete variational into patches and each patch is regarded as a “word”
autoencoder [106] to build 1024 image tokens and before features are input into the transformer body.
adopt 256 BPE-encoded text tokens to represent Unlike the original transformer, they utilize a task-
the text information. Thereafter, an auto-regressive specific embedding as an additional input to the
transformer is used to capture the joint distribution decoder. The model is pretrained on ImageNet, which
of the text and image tokens. They also use a mixed- is a key factor for success.
precision training strategy and PowerSGD [57] to save 5.3 Uformer
GPU memory. The model consumes approximately
24 GB memory in 16-bit precision. Wang et al. [37] propose an effective and efficient
transformer-based architecture for image restoration.
5.2 IPT It uses a transformer module to construct a
Classification models can be pretrained on large- hierarchical encoder–decoder network. Two core
scale datasets to enlarge model representation ability. designs of Uformer make it suitable for image
Related low-level vision tasks such as image super- restoration. The first is a local-enhanced window
resolution, inpainting, and deraining are combined transformer block. Specifically, a nonoverlapping
in a model to help one another. The generalized window-based self-attention is used to reduce the
46 Y. Xu, H. Wei, M. Lin, et al.

Fig. 15 Overview of IPT. Reproduced with permission from Ref. [101],

c The Author(s) 2021.

computational cost, and depth-wise convolution 5.4 TransGAN

is used in the FFN to further improve its Driven by curiosity, Jiang et al. [34] first design
ability to capture local context. The second is the a GAN using pure transformer-based structures
skip-connection mechanism, which is explored to to determine whether transformers perform well
effectively deliver the encoder information to the when applied to generative adversarial networks
decoder. Uformer can capture useful dependencies (GANs) [107]. This network consists of a memory-
for image restoration because of the two designs friendly transformer-based generator and a patch-
above. The network structure of Uformer is shown level discriminator. Jiang et al. [34] also imitate
in Fig. 16. Its performance has been verified through the philosophy in CNN-based GANs and design a
several image restoration tasks, including denoising, novel structure for image generation to avoid the
deraining, and deblurring. high cost when applying transformers from NLP

Fig. 16 (a) Overview of Uformer. (b) Structure of the LeWin transformer block. Reproduced with permission from Ref. [37],
c The Author(s) 2021.
Transformers in computational visual media: A survey 47

Fig. 17 Model overview of TransGAN. Reproduced with permission from Ref. [34],
c The Author(s) 2021.

to visual tasks. As shown in Fig. 17(left), the

memory-friendly transformer-based generator has
multiple stages, thus increasing the feature resolution
while decreasing the embedding dimension. The
discriminator splits the generated images into small
patches and regards them as “words”. The tokens
are taken by the classification head to output the
real/fake prediction. The whole net is trained with
three ingenious strategies: data augmentation, self-
supervised auxiliary task (super task) cooperative
training, and locality-aware initialization. The results
in CIFAR-10 and STL-10 are comparable to those of
some state-of-the-art works using CNN-based GANs.
5.5 TTSR
Texture is often damaged during downsampling and
also cannot be easily recovered. Traditional single
image super-resolution always leads to blurring effects
in the output. Therefore, Yang et al. [33] propose Fig. 18 Model overview of TTSR. Reproduced with permission from
Ref. [33],
c IEEE 2020.
a reference-based image super resolution method,
namely, the Texture Transformer Network for Image
Super Resolution (TTSR). As shown in Fig. 18, the Hard-Attention and a Soft-Attention, and it is
Learnable Texture Extractor is first used to extract applied to the high-resolution feature guided by the
proper texture information, which is crucial for super reference image. Finally, they propose a cross-scale
resolution. Then, the input to the texture transformer feature integration module to exchange information
can be expressed as follows: between the features at different scales for better
Q = LTE(LR ↑) representation at different scales.
K = LTE(Ref ↓↑) 5.6 ColTran
V = LTE(Ref) Image colorization is a challenging task that needs
where LR↑, Ref, and Ref↓↑ denote the image to be to determine the image semantics. Most colorization
reconstructed, the reference image, and the reference models estimate log-likelihood based on neural
image that is down-sampled and then up-sampled generative approaches. Kumar et al. [30] propose
respectively. The texture transformer contains a the Colorization Transformer (ColTran) using a
48 Y. Xu, H. Wei, M. Lin, et al.

self-attention mechanism to promote the effects around relational attention and dynamic interaction.
of a probabilistic colorization model. ColTran They propose a Bipartite Transformer to eliminate
replaces self-attention blocks with axial self-attention, the limitation of huge computational complexity
which decreases √ the computational complexity from of self-attention of transformers. Unlike the self-
2
O(D ) to O(D D). Kumar et al. [30] adopt a attention operator which considers all pairwise
conditional variant of the Axial Transformer [108] relations between input elements, the Bipartite
for low-resolution coarse colorization. As shown in Transformer generalizes this formulation by featuring
Fig. 19, the ColTran core consists of Conditional a bipartite graph between two groups of variables
Self-Attention, MLP, and Layer Norm modules, and (latent and image features) instead. As shown in
it applies conditioning to the auto-regressive core. Fig. 20, simplex attention distributes information in a
They also design a Color Upsampler and Spatial single direction over the Bipartite Transformer, while
Upsampler to produce high-fidelity colorized images Duplex attention supports bidirectional interaction
from low resolution results. The Color Upsampler between the elements. The bipartite structure makes
converts the coarse image of 512 colors back into a a good balance between expressiveness and efficiency,
3-bit RGB image with 8 symbols per channel. The and it constructs the interaction between latent and
Spatial Upsampler generates colorized images with visual features to generate good results.
high resolution. ColTran can handle grayscale images 5.8 StyTr2
of 256 × 256 pixels.
Considering the limited receptive fields of CNNs,
5.7 GANsformer obtaining global information about input images is
The cognitive science literature talks about two mech- difficult but is critical for the image style transfer
anisms by which human perception interacts, namely, task. The content leak problem also occurs when
bottom–up and top–down processing. Previous vision CNN-based models are adopted for style transfer.
tasks using CNNs do not reflect this bidirectional Therefore, Deng et al. [38] propose the first transformer-
nature because the local receptive field reduces their based style transfer model using the ability for long-
ability to model long-range dependencies. Therefore, range extraction (Fig. 21). The unbiased Style
Hudson and Zitnick [35] aim to design a transformer Transfer Transformer framework StyTr2 contains
network with a highly adaptive architecture centered two transformer encoders to obtain domain-specific

Fig. 19 Overview of colorization transformer. Reproduced with permission from Ref. [30],
c The Author(s) 2021.
Transformers in computational visual media: A survey 49

The number of codes in the codebook is 512–4096

according to the dataset. The model can synthesize
the results containing 1280 × 460 pixels.

5.10 PCT
Unlike CNNs, transformers are inherently permuta-
Fig. 20 Overview of the GANsformer framework. Reproduced with tion invariant when processing a series of points and
permission from Ref. [35],
c The Author(s) 2021. are thus suitable for point cloud learning. Guo et
al. [39] propose a state-of-the-art transformer-based
information. Following encoding, a multilayer trans- point cloud model based on offset-attention with an
former decoder generates the output sequences. implicit Laplace operator. They enhance the input
Moreover, Deng et al. [38] propose a content-aware embedding based on farthest point sampling and
mechanism to learn the positional encoding based on nearest neighbor search to better capture the local
image semantic features and dynamically expand the context in the point cloud.
position to suit different image sizes.
5.9 VQGAN
6 Multimodal learning
High-resolution image synthesis is a difficult genera-
tion problem which aims to generate high-fidelity The above sections cover developments in conven-
images within a reasonable time. Convolutional tional computer vision. Apart from pure vision
approaches exploit the local structure of the image, tasks, transformer-based models have also achieved
while transformer methods are good at establishing promising progress in language and vision multimodal
long-range interactions. Esser et al. [102] utilize the tasks, such as visual question answering (VQA) [109,
advantages of CNNs and transformers to build a high- 110], image captioning [111], and image retrieval [112],
resolution image generation framework. They propose due to the high performance achieved by the NLP
a variant of VQVAE [36] and adopt adversarial transformers. Transformer-based vision-language
learning to achieve vivid results. The content hidden (V+L) approaches are often pretrained on multiple
space consists of a discrete codebook, and different tasks and fine-tuned on diverse downstream sub-tasks.
codes in the codebook are combined according Inputs of different modalities share the analogous
to a certain probability to represent the content single- or two-stream architecture.
information. The key to sampling in a discrete In this section, we start from recently representative
space is to predict the distribution of discrete codes, transformer-based works on V+L tasks with different
and the transformer can deal with the issue. Given frameworks (Section 6.1), and then summarise
the first i codes, the transformer module is used to pretraining objectives (Section 6.2) and compare
predict the probability of occurrence of the i-th code. details (Section 6.3).

Fig. 21 Overview of StyTr2 . Reproduced with permission from Ref. [38],

c The Author(s) 2021.
50 Y. Xu, H. Wei, M. Lin, et al.

6.1 Transformer-based V+L works depth for each modality and enables cross-modal
Most transformer-based V+L works are based on two connections at different depths.
kinds of structures: the two-stream (each stream 6.1.2 UNITER
for a single modality) framework or the single- Chen et al. [41] propose UNITER: UNiversal Image-
stream (common stream for jointly learning cross- TExt Representation. It can power heterogeneous
modal representation) framework. ViLBERT [40] and downstream V+L tasks with joint multimodal
UNITER [41] are representative works for two- and embeddings. As shown in Fig. 24, UNITER first
single-stream frameworks, respectively. Meanwhile, encodes image regions (visual features and bounding
SemVLP [42] unifies the two mainstream architectures box features) and textual words (tokens and positions)
for aligning the cross-modal semantics. into a shared embedding space with image and text
6.1.1 ViLBERT embedders. Then, UNITER applies a transformer
ViLBERT [40] is a representative two-stream module to learn the joint embedding of the two
transformer-based model for V+L. Two separate modalities through designed pretraining tasks that
streams are used for vision and language processing. include classic image–text matching (ITM), masked
Figure 23 shows the architecture of ViLBERT. Two language modeling (MLM), and masked region
parallel BERT-style models operate on image regions modeling (MRM). UNITER uses conditional masking
and text tokens. Each stream connects a series on MLM and MRM, which means masking only one
of transformer blocks (TRM) and co-attentional modality while keeping the other untainted. A novel
transformer layers (Co-TRM). As shown in Fig. 22, word–region alignment pretraining task via optimal
the Co-TRM layers enable information exchange transport is also proposed to encourage fine-grained
between modalities, and the modified attention alignment between words and image regions. The
mechanism is the key technical innovation. By authors consider the matching of word tokens and
exchanging key–value pairs in multi-headed attention, RoI regions as minimizing the distance of two discrete
the Co-TRM structure allows for variable network distributions, where the distance is computed based
on optimal transport. UNITER, as a single-stream
model, achieved state-of-the-art performance when
proposed. ViLLA [113], which combines UNITER
and adversarial training, achieves higher performance.
6.1.3 SemVLP
Li et al. [42] present a novel V+L framework, SemVLP.
It unifies both mainstream architectures. By fusing
single- and two-stream architectures, SemVLP utilizes
cross-modal semantics. Its framework is detailed
in Fig. 25. On the basis of a shared bidirectional
transformer encoder with cross-modal attention
module, SemVLP can encode the input text and
Fig. 22 (a) Architecture of a standard encoder transformer block. image into different semantics. It adopts common
(b) Co-attention transformer layer in ViLBERT. Reproduced with
permission from Ref. [40],
c The Author(s) 2019. pretraining methods with a special training strategy:
single- and two-stream frameworks are updated in

Fig. 23 Overview of ViLBERT. Reproduced with permission from Ref. [40],

c The Author(s) 2019.
Transformers in computational visual media: A survey 51

Fig. 24 Overview of UNITER. Reproduced with permission from Ref. [41],

c Springer Nature Switzerland AG 2020.

Fig. 25 Overview of SemVLP. Reproduced with permission from Ref. [42],

c The Author(s) 2021.

each half of the training time for each mini-batch of objectives for multimodal learning. In this section,
image–text pairs. we briefly introduce pretraining tasks extended from
6.2 Multimodal pretraining BERT. These extended approaches include MLM,
masked region modeling (MRM), and image–text
Designing reasonable pretraining objectives for
matching (ITM). We also list other specially designed
transformer-based models, such as masked language
pretraining tasks for multimodal learning.
modeling (MLM) and next sentence classification
from BERT, has brought excellent results on NLP 6.2.1 Masked language modeling
tasks. These methods also work in the cross-modal Most recent V+L works follow BERT in using
field with V+L. The key challenge is the way to MLM for cross-modal tasks. UNITER modifies
replicate or extend large-scale pretraining to cross- MLM by introducing visual information. Specifically,
modal methods and to design novel pretraining UNITER attempts to predict masked words based on
52 Y. Xu, H. Wei, M. Lin, et al.

observation of the surrounding words and all image 6.2.4 Other designs for V+L
regions. InterBERT [114] changes MLM to masked Some models are also trained with unique, newly
segment modeling. In the case of using a random designed pretraining strategies. In Oscar [121], each
word to replace the selected word, masked segment image–text pair is defined as a triple and thus consists
modeling masks a continuous segment of text instead of a word sequence, a set of object tags, and a set of
of random words. image region features. Therefore, in addition to MLM
6.2.2 Image–text matching on words and object tags, Oscar uses a contrastive
For another pretraining task of BERT, next sentence loss to encourage the model to distinguish the original
classification has been converted to an ITM problem, and modified triple. By differently using contrastive
which determines whether a pair of sentence and learning, UNIMO creates image–text pairs by a novel
image regions match. This task is widely used in text rewriting method. ERNIE-ViL [122] introduces
advanced V+L works. InterBERT [114] performs a scene graph to design advanced pretrained tasks,
ITM with hard negatives by regarding the image– including object prediction, attribute prediction, and
text pairs in the dataset as positive samples, pairing relationship prediction. Li et al. [123] add masked
the images with uncorrelated texts, and regarding sentence generation to optimize their model: a cross-
the pairs as negative samples. VL-BERT [115] and modal decoder is taught to autoregressively decode
UnifiedVLP [116] also do not use ITM, tending to the input sentence word-by-word conditioned on
use other efficient choices like MRM introduced next. the input image. Training directly on downstream
tasks like QA is also used in LXMERT [124] and
6.2.3 Masked region modeling
SemVLP [42].
The existing masking method, MRM, is the dual
task of MLM. MLM can be easily applied to visual 6.3 Comparisons and implementation de-
input. Some researchers have proposed several novel tails
pretraining methods by masking input visual tokens Table 5 details implementations and open source. The
to extend masked modeling to vision. Masked MSCOCO dataset [111], maintained by Microsoft, is
region feature regression (MRFR) is one of these widely used in multiple tasks like object detection. The
approaches applied by ViLBERT [40]. ViLBERT Conceptual Captions Dataset (CC) [125] is provided
trains the model to regress the masked input RoI by Google AI and consists of nearly 3.3 million
pooled feature, which is extracted by Faster R- images annotated with captions harvested from the
CNN [86]. Most models perform optimization with web. The SBU Captions Dataset [126] includes
L2 loss. VL-BERT [115] also follows MRFR instead image captions collected from 1 million images from
of using ITM. It uses masked RoI classification with Flickrx . MSCOCO, CC, and SBU all can be used
linguistic clues, predicting the category label of the for image caption tasks. The Visual Genome [127]
masked RoI obtained by Fast R-CNN [117] from is a dataset including images and image content
the other clues. On the contrary, some models semantic information. Visual Genome, VQA [109],
choose masked region classification, which lets the VQAv2 [110], and GQA [128] datasets can all be used
model predict the object semantic class for each for VQA pretraining. Notably, partial datasets are
masked region. Models are often optimized by used as benchmarks simultaneously. Table 6 shows
cross-entropy loss or KL-divergence to learn the the performance of models reported above on different
class distribution. These MRM tasks are performed V+L benchmark datasets. The results are obtained
in UNITER and UNIMO [118]. InterBERT [114] by models fine-tuned on the corresponding datasets.
also changes MRM strategy in the visual modality
by masking objects which have a high proportion 7 Conclusions and discussion
of mutual intersection with zero vectors to avoid
7.1 Backbone design
information leakage due to overlap between objects.
Notably, earlier transformer-based works, such as Section 3 describes several recent developments in
VisualBERT [119] and B2T2 [120], do not extend the backbone design of visual transformers, including
MLM to the visual domain. x https://www.flickr.com
Transformers in computational visual media: A survey 53

Table 5 Model setting in various papers. COCO refers to MS COCO [129], CC to Conceptual Captions [125], VG to Visual Genome [127],
SBU to SBU captions [126], and OI to OpenImages [130]

Model Dataset(s) for pre-training Params Batch size Hard-aware Source (GitHub)

ViLBERT [40] CC 221M 512 8 TitanX jiasenlu/vilbert_beta

VL-BERT [115] CC 110M 256 16 V100 jackroos/VL-BERT
UNITER [41] CC/COCO/VG/SBU 110M Dynamic 16 V100 ChenRocks/UNITER
Oscar [121] CC/COCO/VG/SBU/Flicker30K/VQA/GQA 110M 512 — microsoft/Oscar
VILLA [113] CC/COCO/VG/SBU — Task-speciﬁc — zhegan27/VILLA
ERNIE-ViL [122] CC/SBU 210M 512 8 V100 PaddlePaddle/ERNIE
UNIMO [118] CC/COCO/VG/SBU — — — weili-baidu/UNIMO
VinVL [131] CC/COCO/VG/SBU/Flicker30K/VQA/GQA/OI — 1024 — pzzhang/VinVL
TDEN [123] CC — 1024 16 P40 YehLi/TDEN
UniT [132] COCO/VG/VQAv2 — 64 64 V100 —
SemVLP [42] CC/COCO/VG/SBU/VQAv2/GQA 140M 256 4 V100 —

Table 6 Comparison of transformer-based V+L models on VQA [109], GQA [128], Flickr30K [112], CoCo Caption [111], NLVR2 [133],
SNLI-VE [134], VCR [135], and RefCOCO+ [136] benchmarks

VQA GQA IR-Flickr30K TR-Flickr30K CoCo Caption NLVR2 SNLI-VE VCR RefCOCO+
test-dev test-std test-dev test-std R@1 R@5 R@10 R@1 R@5 R@10 BLUE4 CIDEr dev test-P val test Q/A QA/R Q/AR val testA testB
ViLBERT [40] 70.55 70.92 — — 58.20 84.90 91.52 — — — — — — — — — 73.3 74.6 54.8 72.34 78.52 58.20
VL-BERT [115] 71.79 72.22 — — — — — — — — — — — — — — 75.8 78.4 59.7 80.31 83.62 75.45
UNITER [41] 73.82 74.02 — — 75.56 94.08 96.76 87.3 98.0 99.2 — — 79.12 79.98 79.39 79.38 77.3 80.8 62.8 84.25 86.34 79.75
Oscar [121] 73.61 73.82 61.58 61.62 — — — — — — 41.7 140.0 79.12 80.37 — — — — — 84.40 86.22 80.00
VILLA [113] 74.69 74.87 76.26 94.24 96.84 87.9 97.5 98.8 — — 79.76 81.47 80.18 80.02 78.9 79.1 60.6 84.40 86.22 80.00
ERNIE-ViL [122] 74.95 75.10 — — 76.66 94.16 96.76 89.2 98.5 99.2 — — — — — — 79.2 83.5 66.3 75.89 82.37 66.91
UNIMO [118] 73.79 74.02 — — — — — — — — 38.6 124.1 — — 80.00 79.10 — — — — — —
VinVL [131] 76.52 76.60 65.05 64.65 75.40 92.90 93.30 58.8 83.5 90.3 41.0 140.9 82.67 83.98 — — — — — — — —
TDEN [123] 72.50 72.80 — — — — — — — — 40.2 133.4 — — — — 75.7 76.4 58.0 — — —
SemVLP [42] 74.52 74.68 62.87 63.62 74.80 93.43 96.12 87.7 98.2 99.3 — — 79.00 79.55 — — — — — — — —

feature map visualization approaches. Recent pro- transformer models, and making the transformer
gress can be technically divided into two main more computationally efficient, are of interest.
streams: (i) enhancing the capability of visual The versatility of ViT models in additional
transformers in modeling spatial structure and real-world scenarios, such as aesthetic visual
locality mechanism, such as a better image-to-token analysis [137–139], face anti-spoofing [140, 141],
module, a pixel-level transformer block, a depth- and point cloud learning [142], is also worthy of
wise convolution-based pooling layer, and an SW- exploration.
MSA module, and (ii) boosting the richness of • The transformer block can be placed in the
learned visual features and promoting efficient use perspective of NAS. One of the goals of the
of parameters, such as conditional position encoding, NAS framework [143–145] is to search for optimal
a message communication scheme between the MSA network architectures for a given task without
heads, and deep-narrow ViT architectures. human intervention. Interesting architectures can
As the first visual transformer was proposed very be considered and practical insights for further
recently (October 2020), we believe that the potential developments can be gained by building on a well-
of the ViT model has not been fully exploited and designed search space that contains a transformer
several research topics are worthy of consideration block. Several recent works have investigated this
and effort: topic. Wang et al. [146] and So et al. [147] leverage
• Advanced designs of basic ViT operation or NAS techniques to seek for effective and efficient
modules and the corresponding learning scheme transformer-based architectures automatically. Li
for CV tasks, like injecting prior knowledge of et al. [148] propose a novel scheme, BossNAS, to
image data or the computer vision task into the achieve optimal solutions which trade-off CNN
module design or the learning scheme of visual architecture and transformer blocks.
54 Y. Xu, H. Wei, M. Lin, et al.

• Understanding of the working mechanism and Natural Science Foundation of China under Grant
theoretical rationale of visual transformers can Nos. 61832016 and U20B2070.
be enhanced. Several researchers have achieved
promising progress in unveiling the power of References
transformer models, from such perspectives as
[1] He, K. M.; Zhang, X. Y.; Ren, S. Q.; Sun, J.
information bottlenecks [149, 150] and better
Deep residual learning for image recognition. In:
visualization tools [23, 63].
Proceedings of the IEEE Conference on Computer
7.2 High-level vision Vision and Pattern Recognition, 770–778, 2016.
In Section 4, we introduce several representative [2] Tan, M.; Le, Q. EfficientNet: Rethinking model
works on object detection. The basic logic follows scaling for convolutional neural networks. In:
the line of DETR [24]. PVT [27], which is a general Proceedings of the 36th International Conference on
Machine Learning, 2019.
backbone for dense prediction, is also introduced.
Several problems still need to be addressed despite [3] Radosavovic, I.; Kosaraju, R. P.; Girshick, R.; He,
K. M.; Dollár, P. Designing network design spaces.
improvements brought by these works. Unlike
In: Proceedings of the IEEE/CVF Conference on
CNN-based methods, such as Faster-RCNN [86],
Computer Vision and Pattern Recognition, 10425–
current transformers for dense prediction tasks suffer
10433, 2020.
from high computation time. Thus, efficiency of
[4] Yin, M. H.; Yao, Z. L.; Cao, Y.; Li, X.; Zhang, Z.; Lin,
transformers for high-level vision remains a pressing
S.; Hu, H. Disentangled non-local neural networks.
research direction.
In: Computer Vision–ECCV 2020. Lecture Notes in
7.3 Low-level vision and generation Computer Science, Vol. 12360. Vedaldi, A.; Bischof,
In Section 5, we introduce some low-level vision H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 191–
and image generation tasks using transformer-based 207, 2020.
models. They can achieve outstanding results but [5] Hu, H.; Gu, J. Y.; Zhang, Z.; Dai, J. F.; Wei,
have difficulty generating large images. Therefore, Y. C. Relation networks for object detection. In:
extending a pure transformer with CNN layers is Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 3588–3597, 2018.
widely adopted by many works. A pure transformer
structure still faces the challenge of high computation [6] Wang, X. L.; Girshick, R.; Gupta, A.; He, K. M.
Non-local neural networks. In: Proceedings of the
time.
IEEE/CVF Conference on Computer Vision and
7.4 Multimodal learning Pattern Recognition, 7794–7803, 2018.
In Section 6, we introduce several representative [7] Hu, H.; Zhang, Z.; Xie, Z. D.; Lin, S. Local relation
transformer-based models proposed in the past 2 networks for image recognition. In: Proceedings of the
years for vision and language tasks. We also review IEEE/CVF International Conference on Computer
mainstream pretraining tasks in the V+L field. Vision, 3463–3472, 2019.
Meanwhile, transformer-based models have succeeded [8] Yuan, Y.; Huang, L.; Guo, J.; Zhang, C.; Chen, X.;
for the tasks listed in Table 6, but performance can Wang, J. OCNet: Object context network for scene
still be improved: parsing. arXiv preprint arXiv:1809.00916, 2018.
• Pure transformers may be an alternative choice [9] Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weis-
for the image mode. senborn, D.; Zhai, X.; Unterthiner, T.; Dehghani,
• Design of efficient pretraining tasks can lead to M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit,
better results and performance. J.; Houlsby, N. An image is worth 16x16 words:
Transformers for image recognition at scale. In:
Proceedings of the International Conference on
Acknowledgements Learning Representations, 2021.
We thank the anonymous reviewers for their [10] Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K.
valuable comments. This work was supported BERT: Pre-training of deep bidirectional transformers
by National Key R&D Program of China under for language understanding. In: Proceedings of the
Grant No. 2020AAA0106200, and by National Conference of the North American Chapter of the
Transformers in computational visual media: A survey 55

Association for Computational Linguistics: Human [24] Carion, N.; Massa, F.; Synnaeve, G.; Usunier,
Language Technologies, 4171–4186, 2019. N.; Kirillov, A.; Zagoruyko, S. End-to-end object
[11] Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, detection with transformers. In: Computer Vision–
H.; Luan, D.; Sutskever, I. Generative pretraining ECCV 2020. Lecture Notes in Computer Science, Vol.
from pixels. In: Proceedings of the 37th International 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J.
Conference on Machine Learning, 1691–1703, 2020. M. Eds. Springer Cham, 213–229, 2020.
[12] Graham, B.; El-Nouby, A.; Touvron, H.; Stock, P.; [25] Zhu, X. Z.; Su, W. J.; Lu, L. W.; Li, B.; Dai,
Joulin, A.; Jégou, H.; Douze, M. LeViT: A vision J. F. Deformable DETR: Deformable transformers
transformer in ConvNet’s clothing for faster inference. for end-to-end object detection. arXiv preprint
arXiv preprint arXiv:2104.01136, 2021. arXiv:2010.04159, 2020.
[13] Tay, Y.; Dehghani, M.; Bahri, D.; Metzler, D. [26] Dai, Z. G.; Cai, B. L.; Lin, Y. G.; Chen, J. Y. UP-
Efficient transformers: A survey. arXiv preprint DETR: Unsupervised pre-training for object detection
arXiv:2009.06732, 2020. with transformers. arXiv preprint arXiv:2011.09094,
[14] Liang, J.; Hu, D.; He, R.; Feng, J. Distill and fine- 2020.
tune: Effective adaptation from a black-box source [27] Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.;
model. arXiv preprint arXiv:2104.01539, 2021. Liang, D.; Lu, T.; Luo, P.; Shao. L. Pyramid
[15] Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Tay, F. vision transformer: A versatile backbone for dense
E.; Feng, J.; Yan, S. Tokens-to-Token ViT: Training prediction without convolutions. arXiv preprint
vision transformers from scratch on ImageNet. arXiv arXiv:2102.12122, 2021.
preprint arXiv:2101.11986, 2021. [28] Wang, Y.; Xu, Z.; Wang, X.; Shen, C.; Cheng,
[16] Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; B.; Shen, H.; Xia, H. End-to-end video instance
Wang, Y. Transformer in transformer. arXiv preprint segmentation with transformers. In: Proceedings of
arXiv:2103.00112, 2021. the IEEE/CVF Conference on Computer Vision and
[17] Chu, X. X.; Tian, Z.; Zhang, B.; Wang, X. L.; Wei, Pattern Recognition, 8741–8750, 2021.
X. L.; Xia, H. X.; Shen, C. Conditional positional
[29] Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez,
encodings for vision transformers. arXiv preprint
J. M.; Luo, P. SegFormer: Simple and efficient design
arXiv:2102.10882, 2021.
for semantic segmentation with transformers. arXiv
[18] D’Ascoli, S.; Touvron, H.; Leavitt, M. L.; Morcos, A. preprint arXiv:2105.15203, 2021.
S.; Biroli, G.; Sagun, L. ConViT: Improving vision
[30] Kumar, M.; Weissenborn, D.; Kalchbrenner, N.
transformers with soft convolutional inductive biases.
Colorization transformer. In: Proceedings of the 9th
In: Proceedings of the 38th International Conference
International Conference on Learning Representations,
on Machine Learning, 2286–2296, 2021.
2021.
[19] Zhou, D.; Kang, B.; Jin, X.; Yang, L.; Lian, X.;
[31] Liu, B. C.; Song, K. P.; Zhu, Y. Z.; de Melo,
Hou, Q.; Feng, J. DeepViT: Towards deeper vision
G.; Elgammal, A. TIME: Text and image mutual-
transformer. arXiv preprint arXiv:2103.11886, 2021.
translation adversarial networks. In: Proceedings of
[20] Liu, Z.; Lin, Y. T.; Cao, Y.; Hu, H.; Guo, B. N. Swin
the 35th AAAI Conference on Artificial Intelligence,
transformer: Hierarchical vision transformer using
2082–2090, 2021.
shifted windows. arXiv preprint arXiv:2103.14030,
[32] Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.;
2021.
Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-
[21] Heo, B.; Yun, S.; Han, D.; Chun, S.; Oh, S. J.
to-image generation. arXiv preprint arXiv:2102.12092,
Rethinking spatial dimensions of vision transformers.
2021.
arXiv preprint arXiv:2103.16302, 2021.
[22] Li, Y. W.; Zhang, K.; Cao, J. Z.; Timofte, R.; Gool, L. [33] Yang, F. Z.; Yang, H.; Fu, J. L.; Lu, H. T.;
V. LocalViT: Bringing locality to vision transformers. Guo, B. N. Learning texture transformer network
arXiv preprint arXiv:2104.05707, 2021. for image super-resolution. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and
[23] Chefer, H.; Gur, S.; Wolf, L. Transformer
interpretability beyond attention visualization. In: Pattern Recognition, 5790–5799, 2020.
Proceedings of the IEEE/CVF Conference on [34] Jiang, Y. F.; Chang, S. Y.; Wang, Z. Y. TransGAN:
Computer Vision and Pattern Recognition, 782–791, Two transformers can make one strong GAN. arXiv
2021. preprint arXiv:2102.07074, 2021.
56 Y. Xu, H. Wei, M. Lin, et al.

[35] Hudson, D. A.; Zitnick, C. L. Generative adversarial [47] Dong, C.; Loy, C. C.; He, K. M.; Tang, X. O. Image
transformers. arXiv preprint arXiv:2103.01209, 2021. super-resolution using deep convolutional networks.
[36] Van den Oord, A.; Vinyals, O.; Kavukcuoglu, IEEE Transactions on Pattern Analysis and Machine
K. Neural discrete representation learning. In: Intelligence Vol. 38, No. 2, 295–307, 2016.
Proceedings of the 31st International Conference on [48] Zhang, Y. L.; Tian, Y. P.; Kong, Y.; Zhong, B. N.; Fu,
Neural Information Processing Systems, 6309–6318, Y. Residual dense network for image super-resolution.
2017. In: Proceedings of the IEEE/CVF Conference on
[37] Wang, Z.; Cun, X.; Bao, J.; Liu, J. Uformer: A Computer Vision and Pattern Recognition, 2472–2481,
general U-shaped transformer for image restoration. 2018.
arXiv preprint arXiv:2106.03106, 2021. [49] Haris, M.; Shakhnarovich, G.; Ukita, N. Deep
[38] Deng, Y. Y.; Tang, F.; Pan, X. J.; Dong, W. M.; back-projection networks for super-resolution. In:
Xu, C. S. StyTr2: Unbiased image style transfer with Proceedings of the IEEE/CVF Conference on
transformers. arXiv preprint arXiv:2105.14576, 2021. Computer Vision and Pattern Recognition, 1664–1673,
[39] Guo, M.-H.; Cai, J.-X.; Liu, Z.-N.; Mu, T.-J.; Martin, 2018.
R. R.; Hu, S.-M. PCT: Point cloud transformer. [50] Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.;
Computational Visual Media Vol. 7, No. 2, 187–199, Sutskever, I.; Abbeel, P. InfoGAN: Interpretable
2021. representation learning by information maximiz-
[40] Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pre- ing generative adversarial nets. arXiv preprint
training task-agnostic visiolinguistic representations arXiv:1606.03657, 2016.
for vision-and-language tasks. In: Proceedings of the [51] Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung,
33rd Conference on Neural Information Processing V.; Radford, A.; Chen, X. Improved techniques
Systems, 13–23, 2019. for training GANs. arXiv preprint arXiv:1606.03498,
[41] Chen, Y.-C.; Li, L. J.; Yu, L. C.; El Kholy, A.; 2016.
Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. UNITER: [52] Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler,
UNiversal image-TExt representation learning. In: B.; Hochreiter, S. GANs trained by a two time-scale
Computer Vision–ECCV 2020. Lecture Notes in update rule converge to a local Nash equilibrium.
Computer Science, Vol. 12375. Vedaldi, A.; Bischof, arXiv preprint arXiv:1706.08500, 2017.
H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 104– [53] Karras, T.; Laine, S.; Aila, T. M. A style-based gener-
120, 2020. ator architecture for generative adversarial networks.
[42] Li, C. L.; Yan, M.; Xu, H. Y.; Luo, F. L.; In: Proceedings of the IEEE/CVF Conference on
Huang, S. F. SemVLP: Vision-language pre-training Computer Vision and Pattern Recognition, 4396–4405,
by aligning semantics at multiple levels. arXiv preprint 2019.
arXiv:2103.07829, 2021. [54] Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.;
[43] Zhang, R.; Isola, P.; Efros, A. A. Colorful image Courville, A. Improved training of wasserstein GANs.
colorization. In: Computer Vision–ECCV 2016. arXiv preprint arXiv:1704.00028, 2017.
Lecture Notes in Computer Science, Vol. 9907. Leibe, [55] Bebis, G.; Georgiopoulos, M. Feed-forward neural
B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer networks. IEEE Potentials Vol. 13, No. 4, 27–31, 1994.
Cham, 649–666, 2016. [56] Ba, J. L.; Kiros, J. R.; Hinton, G. E. Layer
[44] Zhang, R.; Zhu, J.-Y.; Isola, P.; Geng, X. Y.; Lin, normalization. arXiv preprint arXiv:1607.06450, 2016.
A. S.; Yu, T. H.; Efros, A. A. Real-time user-guided [57] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.;
image colorization with learned deep priors. arXiv Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I.
preprint arXiv:1705.02999, 2017. Attention is all you need. In: Proceedings of the
[45] Su, J.-W.; Chu, H.-K.; Huang, J.-B. Instance- 31st International Conference on Neural Information
aware image colorization. In: Proceedings of the Processing Systems, 6000–6010, 2017.
IEEE/CVF Conference on Computer Vision and [58] Hendrycks, D.; Gimpel, K. Gaussian error linear units
Pattern Recognition, 7965–7974, 2020. (GELUs). arXiv preprint arXiv:1606.08415, 2016.
[46] Pang, L.; Lan, Y.; Guo, J.; Xu, J.; Wan, S.; Cheng, X. [59] Kitaev, N.; Kaiser, L.; Levskaya, A. Reformer:
Text matching as image recognition. In: Proceedings The eﬃcient transformer. In: Proceedings of the
of the 30th AAAI Conference on Artiﬁcial Intelligence, International Conference on Learning Representations,
2793–2799, 2016. 2020.
Transformers in computational visual media: A survey 57

[60] Choromanski, K. M.; Likhosherstov, V.; Dohan, D.; [71] Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve,
Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, G.; Jegou, H. Going deeper with image transformers.
J. Q.; Mohiuddin, A.; Kaiser, L. et al. Rethinking arXiv preprint arXiv:2103.17239, 2021.
attention with performers. In: Proceedings of the [72] Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.;
International Conference on Learning Representations, Parikh, D.; Batra, D. Grad-CAM: Visual explanations
2021. from deep networks via gradient-based localization.
[61] Wang, S.; Li, B.; Khabsa, M.; Fang, H.; Ma, H. In: Proceedings of the IEEE International Conference
Linformer: Self-attention with linear complexity. on Computer Vision, 618–626, 2017.
arXiv preprint arXiv:2006.04768, 2020. [73] Binder, A.; Montavon, G.; Lapuschkin, S.; Müller,
[62] Abnar, S.; Zuidema, W. Quantifying attention flow in K.-R.; Samek, W. Layer-wise relevance propagation
transformers. arXiv preprint arXiv:2005.00928, 2020. for neural networks with local renormalization layers.
[63] Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; Titov, In: Artificial Neural Networks and Machine Learning–
I. Analyzing multi-head self-attention: Specialized ICANN 2016. Lecture Notes in Computer Science,
heads do the heavy lifting, the rest can be pruned. Vol. 9887. Villa, A.; Masulli, P.; Pons Rivero, A. Eds.
In: Proceedings of the 57th Annual Meeting of the Springer Cham, 63–71, 2016.
Association for Computational Linguistics, 5797–5808, [74] Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.;
2019. Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.
[64] Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; H. et al. Rethinking semantic segmentation from a
Satheesh, S.; Ma, S. A.; Huang, Z.; Karpathy, A.; sequence-tosequence perspective with transformers.
Khosla, A.; Bernstein, M. et al. ImageNet large scale In: Proceedings of the IEEE/CVF Conference on
visual recognition challenge. International Journal of Computer Vision and Pattern Recognition, 6881–6890,
Computer Vision Vol. 115, No. 3, 211–252, 2015. 2021.
[65] Touvron, H.; Cord, M.; Douze, M.; Massa, F.; [75] Duke, B.; Ahmed, A.; Wolf, C.; Aarabi, P.;
Sablayrolles, A.; Jegou, H. Training data-efficient Taylor, G. W. SSTVOS: Sparse spatiotemporal
image transformers & distillation through attention. transformers for video object segmentation. arXiv
In: Proceedings of the 38th International Conference preprint arXiv:2101.08833, 2021.
on Machine Learning, 10347–10357, 2021. [76] Chen, J. N.; Lu, Y. Y.; Yu, Q. H.; Luo, X. D.;
[66] Han, Y. Z.; Huang, G.; Song, S. J.; Yang, L.; Wang, Y. Zhou, Y. Y. TransUNet: Transformers make strong
L. Dynamic neural networks: A survey. arXiv preprint encoders for medical image segmentation. arXiv
arXiv:2102.04906, 2021. preprint arXiv:2102.04306, 2021.
[67] Xu, W.; Xu, Y.; Chang, T.; Tu, Z. Co-scale [77] Ye, L. W.; Rochan, M.; Liu, Z.; Wang, Y. Cross-
conv-attentional image transformers. arXiv preprint modal self-attention network for referring image
arXiv:2104.06399, 2021. segmentation. In: Proceedings of the IEEE/CVF
[68] Dong, X. Y.; Bao, J. M.; Chen, D. D.; Zhang, Conference on Computer Vision and Pattern
W. M.; Yu, N. H.; Yuan, L.; Chen, D.; Guo, B. Recognition, 10494–10503, 2019.
CSWin transformer: A general vision transformer [78] Wang, H.; Zhu, Y.; Adam, H.; Yuille, A.; Chen, L.-C.
backbone with cross-shaped windows. arXiv preprint Max-deeplab: End-to-end panoptic segmentation with
arXiv:2107.00652, 2021. mask transformers. arXiv preprint arXiv:2012.00759,
[69] Huang, Z. L.; Wang, X. G.; Huang, L. C.; Huang, 2020.
C.; Wei, Y. C.; Liu, W. CCNet: Criss-cross attention [79] Durner, M.; Boerdijk, W.; Sundermeyer, M.; Friedl,
for semantic segmentation. In: Proceedings of the W.; Marton, Z.-C.; Triebel, R. Unknown object
IEEE/CVF International Conference on Computer segmentation from stereo images. arXiv preprint
Vision, 603–612, 2019. arXiv:2103.06796, 2021.
[70] Hou, Q. B.; Zhang, L.; Cheng, M. M.; Feng, J. S. Strip [80] Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.-E.; Sheikh, Y.
pooling: Rethinking spatial pooling for scene parsing. OpenPose: Realtime multi-person 2D pose estimation
In: Proceedings of the IEEE/CVF Conference on using part affinity fields. IEEE Transactions on
Computer Vision and Pattern Recognition, 4002–4011, Pattern Analysis and Machine Intelligence Vol. 43,
2020. No. 1, 172–186, 2021.
58 Y. Xu, H. Wei, M. Lin, et al.

[81] Simon, T.; Joo, H.; Matthews, I.; Sheikh, Y. [93] Rezatofighi, S. H.; Kaskman, R.; Motlagh, F. T.;
Hand keypoint detection in single images using Shi, Q. F.; Cremers, D.; Leal-Taixé, L.; Reid, I.
multiview bootstrapping. In: Proceedings of the Deep perm-set net: Learn to predict sets with
IEEE Conference on Computer Vision and Pattern unknown permutation and cardinality using deep
Recognition, 4645–4653, 2017. neural networks. arXiv preprint arXiv:1805.00613,
[82] Cao, Z.; Simon, T.; Wei, S.-E.; Sheikh, Y. Realtime 2018.
multi-person 2D pose estimation using part affinity [94] Pan, X. J.; Tang, F.; Dong, W. M.; Gu, Y.; Song,
fields. In: Proceedings of the IEEE Conference on Z. C.; Meng, Y. P.; Xu, P.; Deussen, O.; Xu,
Computer Vision and Pattern Recognition, 1302–1310, C. Self-supervised feature augmentation for large
2017. image object detection. IEEE Transactions on Image
[83] Fang, H.-S.; Xie, S. Q.; Tai, Y.-W.; Lu, C. W. Processing Vol. 29, 6745–6758, 2020.
RMPE: Regional multi-person pose estimation. In: [95] Pan, X.; Gao, Y.; Lin, Z.; Tang, F.; Dong, W.;
Proceedings of the IEEE International Conference on Yuan, H.; Huang, F.; Xu, C. Unveiling the potential
Computer Vision, 2353–2362, 2017. of structure preserving for weakly supervised object
[84] Zhang, F.; Zhu, X. T.; Dai, H. B.; Ye, M.; Zhu, localization. In: Proceedings of the IEEE/CVF
C. Distribution-aware coordinate representation for
Conference on Computer Vision and Pattern
human pose estimation. In: Proceedings of the
Recognition, 11642–11651, 2021.
IEEE/CVF Conference on Computer Vision and
[96] Pan, X. J.; Ren, Y. Q.; Sheng, K. K.; Dong, W.
Pattern Recognition, 7091–7100, 2020.
M.; Yuan, H. L.; Guo, X. W.; Ma, C.; Xu, C.
[85] Sun, K.; Xiao, B.; Liu, D.; Wang, J. D. Deep
Dynamic refinement network for oriented and densely
high-resolution representation learning for human
packed object detection. In: Proceedings of the
pose estimation. In: Proceedings of the IEEE/CVF
IEEE/CVF Conference on Computer Vision and
Conference on Computer Vision and Pattern
Pattern Recognition, 11204–11213, 2020.
Recognition, 5686–5696, 2019.
[97] Chu, X. X.; Tian, Z.; Wang, Y. Q.; Zhang, B.; Shen,
[86] Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-
C. H. Twins: Revisiting spatial attention design in
CNN: Towards real-time object detection with region
vision transformers. arXiv preprint arXiv:2104.13840,
proposal networks. arXiv preprint arXiv:1506.01497,
2021.
2015.
[98] Beal, J.; Kim, E.; Tzeng, E.; Park, D. H.; Kislyuk,
[87] Cai, Z. W.; Vasconcelos, N. Cascade R-CNN: High
quality object detection and instance segmentation. D. Toward transformer-based object detection. arXiv
IEEE Transactions on Pattern Analysis and Machine preprint arXiv:2012.09958, 2020.
Intelligence Vol. 43, No. 5, 1483–1498, 2021. [99] Dai, J. F.; Qi, H. Z.; Xiong, Y. W.; Li, Y.; Zhang,
[88] Lin, T. Y.; Goyal, P.; Girshick, R.; He, K. M.; G. D.; Hu, H.; Wei, Y. Deformable convolutional
Dollár, P. Focal loss for dense object detection. In: networks. In: Proceedings of the IEEE International
Proceedings of the IEEE International Conference on Conference on Computer Vision, 764–773, 2017.
Computer Vision, 2999–3007, 2017. [100] He, K. M.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask
[89] Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. R-CNN. In: Proceedings of the IEEE International
arXiv preprint arXiv:1904.07850, 2019. Conference on Computer Vision, 2980–2988, 2017.
[90] Tian, Z.; Shen, C. H.; Chen, H.; He, T. FCOS: [101] Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu,
Fully convolutional one-stage object detection. Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-trained
In: Proceedings of the IEEE/CVF International image processing transformer. In: Proceedings of
Conference on Computer Vision, 9626–9635, 2019. the IEEE/CVF Conference on Computer Vision and
[91] Stewart, R.; Andriluka, M.; Ng, A. Y. End-to-end Pattern Recognition, 12299–12310, 2021.
people detection in crowded scenes. In: Proceedings [102] Esser, P.; Rombach, R.; Ommer, B. Taming
of the IEEE Conference on Computer Vision and transformers for high-resolution image synthesis.
Pattern Recognition, 2325–2333, 2016. arXiv preprint arXiv:2012.09841, 2020.
[92] Hosang, J.; Benenson, R.; Schiele, B. Learning [103] Kaiser, L.; Bengio, S. Can active memory replace
non-maximum suppression. In: Proceedings of the attention? In: Proceedings of the 30th International
IEEE Conference on Computer Vision and Pattern Conference on Neural Information Processing Systems,
Recognition, 6469–6477, 2017. 3781–3789, 2016.
Transformers in computational visual media: A survey 59

[104] Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show [115] Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; Dai,
and tell: Lessons learned from the 2015 MSCOCO J. VL-BERT: Pre-training of generic visual-linguistic
image captioning challenge. IEEE Transactions on representations. In: Proceedings of the International
Pattern Analysis and Machine Intelligence Vol. 39, Conference on Learning Representations, 2020.
No. 4, 652–663, 2016. [116] Zhou, L. W.; Palangi, H.; Zhang, L.; Hu, H. D.; Corso,
[105] Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show J.; Gao, J. F. Unified vision-language pre-training for
and tell: A neural image caption generator. In: image captioning and VQA. In: Proceedings of the
Proceedings of the IEEE Conference on Computer AAAI Conference on Artificial Intelligence, Vol. 34,
Vision and Pattern Recognition, 3156–3164, 2015. No. 7, 13041–13049, 2020.
[106] Rolfe, J. T. Discrete variational autoencoders. arXiv [117] Girshick, R. Fast R-CNN. In: Proceedings of the
preprint arXiv:1609.02200, 2016. IEEE International Conference on Computer Vision,
1440–1448, 2015.
[107] Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu,
B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, [118] Li, W.; Gao, C.; Niu, G. C.; Xiao, X. Y.; Wang, H. F.
Y. Generative adversarial networks. arXiv preprint UNIMO: Towards unified-modal understanding and
arXiv:1406.2661, 2014. generation via cross-modal contrastive learning. arXiv
preprint arXiv:2012.15409, 2020.
[108] Ho, J.; Kalchbrenner, N.; Weissenborn, D.; Salimans,
[119] Li, L. H.; Yatskar, M.; Yin, D.; Hsieh, C. J.; Chang, K.
T. Axial attention in multidimensional transformers.
W. VisualBERT: A simple and performant baseline for
arXiv preprint arXiv:1912.12180, 2019.
vision and language. arXiv preprint arXiv:1908.03557,
[109] Antol, S.; Agrawal, A.; Lu, J. S.; Mitchell, M.; Batra,
2019.
D.; Zitnick, C. L.; Parikh, D. VQA: Visual question
[120] Alberti, C.; Ling, J.; Collins, M.; Reitter, D. Fusion of
answering. In: Proceedings of the IEEE International
detected objects in text for visual question answering.
Conference on Computer Vision, 2425–2433, 2015.
In: Proceedings of the Conference on Empirical
[110] Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.;
Methods in Natural Language Processing and the 9th
Parikh, D. Making the V in VQA matter: Elevating
International Joint Conference on Natural Language
the role of image understanding in visual question
Processing, 2131–2140, 2019.
answering. In: Proceedings of the IEEE Conference
[121] Li, X. J.; Yin, X.; Li, C. Y.; Zhang, P. C.; Hu, X. W.;
on Computer Vision and Pattern Recognition, 6325–
Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F. et
6334, 2017.
al. OSCAR: Object-semantics aligned pre-training for
[111] Chen, X. L.; Fang, H.; Lin, T.-Y.; Vedantam, R.; vision-language tasks. In: Computer Vision–ECCV
Gupta, S.; Dollar, P.; Zitnick, C. L. Microsoft COCO 2020. Lecture Notes in Computer Science, Vol. 12375.
captions: Data collection and evaluation server. arXiv Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds.
preprint arXiv:1504.00325, 2015. Springer Cham, 121–137, 2020.
[112] Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. [122] Yu, F.; Tang, J.; Yin, W.; Sun, Y.; Tian, H.; Wu, H.;
From image descriptions to visual denotations: New Wang, H. ERNIE-ViL: Knowledge enhanced vision-
similarity metrics for semantic inference over event language representations through scene graph. In:
descriptions. Transactions of the Association for Proceedings of the AAAI Conference on Artificial
Computational Linguistics Vol. 2, 67–78, 2014. Intelligence, 2021.
[113] Gan, Z.; Chen, Y.-C.; Li, L.; Zhu, C.; Cheng, Y.; [123] Li, Y.; Pan, Y.; Yao, T.; Chen, J.; Mei, T. Scheduled
Liu, J. Large-scale adversarial training for vision- sampling in vision-language pretraining with decou-
and-language representation learning. In: Advances pled encoder–decoder network. In: Proceedings of the
in Neural Information Processing Systems, Vol. 33. AAAI Conference on Artificial Intelligence, 8518–8526,
Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M. 2021.
F.; Lin, H. Eds. Curran Associates, Inc., 6616–6628, [124] Tan, H.; Bansal, M. LXMERT: Learning cross-
2020. modality encoder representations from transformers.
[114] Lin, J. Y.; Yang, A.; Zhang, Y. C.; Liu, J.; Yang, H. X. In: Proceedings of the Conference on Empirical
InterBERT: Vision-and-language interaction for multi- Methods in Natural Language Processing and the 9th
modal pretraining. arXiv preprint arXiv:2003.13198, International Joint Conference on Natural Language
2020. Processing, 5100–5111, 2019.
60 Y. Xu, H. Wei, M. Lin, et al.

[125] Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. [134] Xie, N.; Lai, F.; Doran, D.; Kadav, A. Visual
Conceptual captions: A cleaned, hypernymed, image entailment: A novel task for fine-grained image
alt-text dataset for automatic image captioning. In: understanding. arXiv preprint arXiv:1901.06706,
Proceedings of the 56th Annual Meeting of the 2019.
Association for Computational Linguistics, 2556–2565, [135] Zellers, R.; Bisk, Y.; Farhadi, A.; Choi, Y. From
2018. recognition to cognition: Visual commonsense reason-
[126] Ordonez, V.; Kulkarni, G.; Berg, T. L. Im2Text: De- ing. In: Proceedings of the IEEE/CVF Conference on
scribing images using 1 million captioned photographs. Computer Vision and Pattern Recognition, 6713–6724,
In: Proceedings of the 24th International Conference 2019.
on Neural Information Processing Systems, 1143–1151, [136] Kazemzadeh, S.; Ordonez, V.; Matten, M.; Berg, T.
2011. ReferItGame: Referring to objects in photographs of
[127] Krishna, R.; Zhu, Y. K.; Groth, O.; Johnson, J.; natural scenes. In: Proceedings of the Conference on
Empirical Methods in Natural Language Processing,
Hata, K. J.; Kravitz, J.; Chen, S.; Kalantidis, Y.;
787–798, 2014.
Li, L.-J.; Shamma, D. A. et al. Visual genome:
Connecting language and vision using crowdsourced [137] Sheng, K. K.; Dong, W. M.; Ma, C. Y.; Mei, X.;
dense image annotations. International Journal of Huang, F. Y.; Hu, B.-G. Attention-based multi-
patch aggregation for image aesthetic assessment.
Computer Vision Vol. 123, No. 1, 32–73, 2017.
In: Proceedings of the 26th ACM International
[128] Hudson, D. A.; Manning, C. D. GQA: A new dataset
Conference on Multimedia, 879–886, 2018.
for real-world visual reasoning and compositional
[138] Sheng, K. K.; Dong, W. M.; Chai, M. L.; Wang,
question answering. In: Proceedings of the IEEE/CVF
G. H.; Zhou, P.; Huang, F. Y.; Hu, B.-G.; Ji, R.;
Conference on Computer Vision and Pattern
Ma, C. Revisiting image aesthetic assessment via self-
Recognition, 6693–6702, 2019.
supervised feature learning. In: Proceedings of the
[129] Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; AAAI Conference on Artificial Intelligence, Vol. 34,
Perona, P.; Ramanan, D., Dollar, P.; Zitnick, C. No. 4, 5709–5716, 2020.
L. Microsoft COCO: Common objects in context.
[139] Sheng, K. K.; Dong, W. M.; Huang, H. B.; Chai,
In: Computer Vision–ECCV 2014. Lecture Notes in
M. L.; Zhang, Y.; Ma, C. Y.; Hu, B.-G. Learning to
Computer Science, Vol. 8693. Fleet, D.; Pajdla, T.; assess visual aesthetics of food images. Computational
Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 740– Visual Media Vol. 7, No. 1, 139–152, 2021.
755, 2014.
[140] Zhang, S. F.; Wang, X. B.; Liu, A.; Zhao, C. X.;
[130] Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Wan, J.; Escalera, S.; Shi, H.; Wang, Z.; Li, S.
Krasin, I.; Pont-Tuset, J.; Kamali, S.; Popov, S.; Z. A dataset and benchmark for large-scale multi-
Malloci, M.; Kolesnikov, A. et al. The open images modal face anti-spoofing. In: Proceedings of the
dataset V4. International Journal of Computer Vision IEEE/CVF Conference on Computer Vision and
Vol. 128, No. 7, 1956–1981, 2020. Pattern Recognition, 919–928, 2019.
[131] Zhang, P.; Li, X.; Hu, X.; Yang, J.; Zhang, L.; [141] Chen, Z.; Yao, T.; Sheng, K.; Ding, S.; Tai, Y.; Li,
Wang, L.; Choi, Y.; Gao, J. VinVL: Revisiting J.; Huang, F.; Jin, X. Generalizable representation
visual representations in vision-language models. learning for mixture domain face anti-spoofing. In:
In: Proceedings of the IEEE/CVF Conference on Proceedings of the AAAI Conference on Artificial
Computer Vision and Pattern Recognition, 5579–5588, Intelligence, 1132–1139, 2021.
2021. [142] Zhao, H.; Jiang, L.; Jia, J.; Torr, P.; Koltun, V. Point
[132] Hu, R.; Singh, A. UniT: Multimodal multitask transformer. arXiv preprint arXiv:2012.09164, 2020.
learning with a unified transformer. arXiv preprint [143] Zoph, B.; Le, Q. V. Neural architecture search
arXiv:2102.10772, 2021. with reinforcement learning. In: Proceedings of the
[133] Suhr, A.; Zhou, S.; Zhang, A.; Zhang, I.; Bai, H. International Conference on Learning Representations,
J.; Artzi, Y. A corpus for reasoning about natural 2017.
language grounded in photographs. In: Proceedings [144] Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q. V.
of the 57th Annual Meeting of the Association for Learning transferable architectures for scalable image
Computational Linguistics, 6418–6428, 2019. recognition. In: Proceedings of the IEEE/CVF
Transformers in computational visual media: A survey 61

Conference on Computer Vision and Pattern Minxuan Lin received his B.Sc. degree
Recognition, 8697–8710, 2018. in computer science and technology
from the Ocean University of China in
[145] Real, E.; Aggarwal, A.; Huang, Y. P.; Le, Q. V.
2018. He is currently a postgraduate
Regularized evolution for image classifier architecture
of NLPR. His research interests include
search. In: Proceedings of the AAAI Conference on computational visual media and machine
Artificial Intelligence, Vol. 33, 4780–4789, 2019. learning.
[146] Wang, H. R.; Wu, Z. H.; Liu, Z. J.; Cai, H.; Zhu,
L. G.; Gan, C.; Han, S. HAT: Hardware-aware
transformers for efficient natural language processing.
In: Proceedings of the 58th Annual Meeting of the Yingying Deng received her B.Sc.
Association for Computational Linguistics, 7675–7688, degree in automation from the University
2020. of Science and Technology, Beijing in
2017. She is currently working towards
[147] So, D.; Le, Q.; Liang, C. The evolved transformer. In:
her Ph.D. degree in NLPR. Her research
Proceedings of the 36th International Conference on
interests include computational visual
Machine Learning, 5877–5886, 2019. media and machine learning.
[148] Li, C. L.; Tang, T.; Wang, G. R.; Peng, J. F.; Chang,
X. J. BossNAS: Exploring hybrid CNN-transformers
with Block-wisely Self-supervised neural architecture
search. arXiv preprint arXiv:2103.12424, 2021. Kekai Sheng received his Ph.D. degree
[149] Schulz, K.; Sixt, L.; Tombari, F.; Landgraf, T. from NLPR in 2019. He received
Restricting the flow: Information bottlenecks for his B.Eng. degree in telecommunication
attribution. In: Proceedings of the International engineering from the University of
Conference on Learning Representations, 2019. Science and Technology, Beijing in 2014.
[150] Jiang, Z.; Tang, R.; Xin, J.; Lin, J. Inserting He is currently a research engineer at
Youtu Lab, Tencent Inc. His research
information bottleneck for attribution in transformers.
interests include domain adaptation,
In: Proceedings of the Conference on Empirical neural architecture search, and AutoML.
Methods in Natural Language Processing: Findings,
3850–3857, 2020.

Mengdan Zhang received her Ph.D.

degree from NLPR in 2018. She received
Yifan Xu is currently a postgraduate her B.Eng. degree in automation from
of the National Laboratory of Pattern Xi’an Jiao Tong University in 2013. She
Recognition (NLPR) at the Institute is currently a research engineer at Youtu
of Automation, Chinese Academy of Lab, Tencent Inc. Her research interests
Sciences. He received a his B.Eng. degree include computer vision and machine
from Beijing Institute of Technology in learning.
2015. His research interests include
transfer learning, machine learning, and
computational visual media.

Fan Tang is an assistant professor in

the School of Artiﬁcial Intelligence, Jilin
University. He received his B.Sc. degree
Huapeng Wei is a postgraduate of the in computer science from North China
School of Artiﬁcial Intelligence, Jilin Electric Power University in 2013 and
University. He received his B.Sc. degree his Ph.D. degree from NLPR in 2019.
from Jilin University in 2020. His His research interests include computer
research interests include computational graphics, computer vision, and machine
visual media and image processing. learning.
62 Y. Xu, H. Wei, M. Lin, et al.

Weiming Dong is a professor in NLPR. special session organizer, session chair and TPC member
He received his B.Eng. and M.S. degrees for over 20 prestigious IEEE and ACM multimedia journals,
in computer science in 2001 and 2004 conferences, and workshops. Currently he is the editor-in-
from Tsinghua University. He received chief of Multimedia Systems. Changsheng Xu is an IEEE
his Ph.D. degree in information tech- Fellow, IAPR Fellow, and ACM Distinguished Scientist.
nology from the University of Lorraine,
France, in 2007. His research interests
include visual media synthesis and
evaluation. Weiming Dong is a member of the ACM and
Open Access This article is licensed under a Creative
IEEE.
Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduc-
tion in any medium or format, as long as you give appropriate
Feiyue Huang is the director of the credit to the original author(s) and the source, provide a link
Youtu Lab, Tencent Inc. He received his to the Creative Commons licence, and indicate if changes
B.Sc. and Ph.D. degrees in computer were made.
science in 2001 and 2008 respectively,
The images or other third party material in this article are
both from Tsinghua University, China.
included in the article’s Creative Commons licence, unless
His research interests include image
indicated otherwise in a credit line to the material. If material
understanding and face recognition.
is not included in the article’s Creative Commons licence and
your intended use is not permitted by statutory regulation or
exceeds the permitted use, you will need to obtain permission
Changsheng Xu is a professor in directly from the copyright holder.
NLPR. His research interests include To view a copy of this licence, visit http://
multimedia content analysis, indexing creativecommons.org/licenses/by/4.0/.
and retrieval, pattern recognition, and Other papers from this open access journal are available
computer vision. Prof. Xu has served free of charge from http://www.springer.com/journal/41095.
as associate editor, guest editor, general To submit a manuscript, please go to https://www.
chair, program chair, area/track chair, editorialmanager.com/cvmj.

LLM
No ratings yet
LLM
28 pages
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
No ratings yet
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
2 pages
ViT Transformers SEMINAR
No ratings yet
ViT Transformers SEMINAR
16 pages
A Survey of The Vision Transformers and Its CNN-Transformer Based Variants - Khan Et Al
No ratings yet
A Survey of The Vision Transformers and Its CNN-Transformer Based Variants - Khan Et Al
82 pages
Behavior Cloning For Self Driving Cars Using Attention Models
No ratings yet
Behavior Cloning For Self Driving Cars Using Attention Models
5 pages
Vision Transformers in Medical Imaging - A Review
No ratings yet
Vision Transformers in Medical Imaging - A Review
31 pages
A Survey of Visual Transformers
No ratings yet
A Survey of Visual Transformers
23 pages
ViViT: A Video Vision Transformer
No ratings yet
ViViT: A Video Vision Transformer
14 pages
TSP CMC 50790
No ratings yet
TSP CMC 50790
24 pages
A Survey On Visual Transformer
No ratings yet
A Survey On Visual Transformer
23 pages
Lec25 Architectures
No ratings yet
Lec25 Architectures
52 pages
2022 - ViTAEv2 - Zhang Et Al - Arxiv
No ratings yet
2022 - ViTAEv2 - Zhang Et Al - Arxiv
22 pages
Vision Transformers in Medical Imaging: A Comprehensive Review of Advancements and Applications Across Multiple Diseases
No ratings yet
Vision Transformers in Medical Imaging: A Comprehensive Review of Advancements and Applications Across Multiple Diseases
44 pages
Gaurav Vision Transformer
No ratings yet
Gaurav Vision Transformer
10 pages
Transformer Segmentation
No ratings yet
Transformer Segmentation
35 pages
Abstract
No ratings yet
Abstract
2 pages
Vitae: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias
No ratings yet
Vitae: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias
23 pages
Neural Architecture Search For Transformers A Surv
No ratings yet
Neural Architecture Search For Transformers A Surv
39 pages
Research Paper (2) Done
No ratings yet
Research Paper (2) Done
17 pages
A Survey of Visual Transformers
No ratings yet
A Survey of Visual Transformers
21 pages
CVT: Introducing Convolutions To Vision Transformers
No ratings yet
CVT: Introducing Convolutions To Vision Transformers
10 pages
Comprehensive Survey of Model Compression and Speed Up For Vision Transformers - Chen Et Al
No ratings yet
Comprehensive Survey of Model Compression and Speed Up For Vision Transformers - Chen Et Al
12 pages
Deep Learning Paper About Vit
No ratings yet
Deep Learning Paper About Vit
12 pages
Challenging Task
No ratings yet
Challenging Task
21 pages
Good Note - ViT
No ratings yet
Good Note - ViT
13 pages
Lecture 28 TransformerIntroductionFinal 1
No ratings yet
Lecture 28 TransformerIntroductionFinal 1
69 pages
Vision Transformers For Dense Prediction Tasks: Junyong Lee Computer Graphics Lab
No ratings yet
Vision Transformers For Dense Prediction Tasks: Junyong Lee Computer Graphics Lab
22 pages
Ai Lakshmana Sai Vision Transformer
No ratings yet
Ai Lakshmana Sai Vision Transformer
19 pages
Applsci 13 05521 v2
No ratings yet
Applsci 13 05521 v2
17 pages
Paper 3
No ratings yet
Paper 3
7 pages
Deep Learning Book PDF
No ratings yet
Deep Learning Book PDF
272 pages
2024 GVT Shan Chen Arxiv
No ratings yet
2024 GVT Shan Chen Arxiv
9 pages
Video Quality Assessment (VQA) Using Vision Transformers
No ratings yet
Video Quality Assessment (VQA) Using Vision Transformers
6 pages
CVT: Introducing Convolutions To Vision Transformers
No ratings yet
CVT: Introducing Convolutions To Vision Transformers
10 pages
10 Transformers
No ratings yet
10 Transformers
22 pages
Transformer-Based Visual Segmentation - A Survey
No ratings yet
Transformer-Based Visual Segmentation - A Survey
23 pages
Yang 2022
No ratings yet
Yang 2022
20 pages
Yearly Scheme of Work Science Y6 2025-2026
No ratings yet
Yearly Scheme of Work Science Y6 2025-2026
16 pages
An Overview of Vision Transformers For Image Processing A Survey
No ratings yet
An Overview of Vision Transformers For Image Processing A Survey
17 pages
03 - ViViT - A Video Vision Transformer
No ratings yet
03 - ViViT - A Video Vision Transformer
13 pages
Vision Transformer Adapter For Dense Predictions
No ratings yet
Vision Transformer Adapter For Dense Predictions
20 pages
Yan Multiview Transformers For Video Recognition CVPR 2022 Paper
No ratings yet
Yan Multiview Transformers For Video Recognition CVPR 2022 Paper
11 pages
2021 NeurIPS VAAT Akbari, Yuan, Qian, Chuang, Chang, Cui, Gong
No ratings yet
2021 NeurIPS VAAT Akbari, Yuan, Qian, Chuang, Chang, Cui, Gong
16 pages
Video Quality Assessment (VQA) Using Vision Transformers
No ratings yet
Video Quality Assessment (VQA) Using Vision Transformers
5 pages
Vision Transformers: Revolutionizing Computer Vision
No ratings yet
Vision Transformers: Revolutionizing Computer Vision
14 pages
A Survey On Vision Transformer
No ratings yet
A Survey On Vision Transformer
23 pages
Pixel To Phrases
No ratings yet
Pixel To Phrases
6 pages
ViT Survey On Segmentation
No ratings yet
ViT Survey On Segmentation
30 pages
Assignment Transforming Computer Vision The Rise of Vision Transformers and Its Impact
No ratings yet
Assignment Transforming Computer Vision The Rise of Vision Transformers and Its Impact
3 pages
Transformer-Based Visual Segmentation: A Survey
No ratings yet
Transformer-Based Visual Segmentation: A Survey
25 pages
Synthesis Lectures On Computer Vision: Series Editors
No ratings yet
Synthesis Lectures On Computer Vision: Series Editors
8 pages
Video Transformer Network
No ratings yet
Video Transformer Network
11 pages
Revised Bioethics Syllabus
100% (1)
Revised Bioethics Syllabus
7 pages
Transformer in Transformer: Kai Han An Xiao Enhua Wu Jianyuan Guo Chunjing Xu Yunhe Wang
No ratings yet
Transformer in Transformer: Kai Han An Xiao Enhua Wu Jianyuan Guo Chunjing Xu Yunhe Wang
14 pages
Paper 2
No ratings yet
Paper 2
8 pages
Research Notes
No ratings yet
Research Notes
9 pages
Video Swin Transformer
No ratings yet
Video Swin Transformer
12 pages
Are Native English Speakers Really Better Teachers
No ratings yet
Are Native English Speakers Really Better Teachers
3 pages
CCE Form - 7th Cycle
100% (1)
CCE Form - 7th Cycle
5 pages
A Survey On Vision Transformer
No ratings yet
A Survey On Vision Transformer
24 pages
Transformers For Vision
No ratings yet
Transformers For Vision
28 pages
Lesson Plan Test Taking
100% (1)
Lesson Plan Test Taking
3 pages
NeurIPS 2021 Transformer in Transformer Paper
No ratings yet
NeurIPS 2021 Transformer in Transformer Paper
12 pages
Vi Transformer
No ratings yet
Vi Transformer
21 pages
Civics Lesson 2 Why Do People Form Governments
No ratings yet
Civics Lesson 2 Why Do People Form Governments
3 pages
Slide Akuntansi Manahemen
50% (2)
Slide Akuntansi Manahemen
10 pages
Power Factor Measurement and Correction Techniques
No ratings yet
Power Factor Measurement and Correction Techniques
3 pages
BSN Learning Plan Senior Capstone
100% (1)
BSN Learning Plan Senior Capstone
3 pages
Sample Reflection Statement Document
No ratings yet
Sample Reflection Statement Document
1 page
q3 Week 1-2 Lesson Plan
No ratings yet
q3 Week 1-2 Lesson Plan
9 pages
Training Need Identification Format
No ratings yet
Training Need Identification Format
1 page
Cambridge Ict Starters Syllabus English
No ratings yet
Cambridge Ict Starters Syllabus English
84 pages
Portfolio Resume-2
No ratings yet
Portfolio Resume-2
4 pages
55 Limba Engleza 10 Lotul 55 Reevaluare Compressed
No ratings yet
55 Limba Engleza 10 Lotul 55 Reevaluare Compressed
152 pages
TX EN Comm V1.4 20190402 W
No ratings yet
TX EN Comm V1.4 20190402 W
25 pages
Characterization of Developed Low Thermal Scanner For DC Voltage Measurement - Rev1
No ratings yet
Characterization of Developed Low Thermal Scanner For DC Voltage Measurement - Rev1
9 pages
Characterization of Developed Low Thermal Scanner For DC Voltage Measurement - RevSYah
No ratings yet
Characterization of Developed Low Thermal Scanner For DC Voltage Measurement - RevSYah
10 pages
Z Durrani CV 2024
No ratings yet
Z Durrani CV 2024
9 pages
Light Unit Lesson 2
No ratings yet
Light Unit Lesson 2
4 pages
Wunderkeys - Final
100% (3)
Wunderkeys - Final
18 pages
MDT Magnetometer CAL01 Datasheet EN V1.0
No ratings yet
MDT Magnetometer CAL01 Datasheet EN V1.0
5 pages
Automatic Vehicle Classification System For Monitoring Highways
No ratings yet
Automatic Vehicle Classification System For Monitoring Highways
3 pages
MeasurementAn Exciting Journey Across Centuries
No ratings yet
MeasurementAn Exciting Journey Across Centuries
7 pages
DLL Gr8 Edited
No ratings yet
DLL Gr8 Edited
61 pages
Azum Abstract Rev2
No ratings yet
Azum Abstract Rev2
2 pages
The Project Metrology For Static and Dynamic Characterization of Supercapacitors MetSuperCap
No ratings yet
The Project Metrology For Static and Dynamic Characterization of Supercapacitors MetSuperCap
2 pages
Gossen Metrawatt Metraclip 87 User Manual
No ratings yet
Gossen Metrawatt Metraclip 87 User Manual
74 pages
CBSE - Application From Private Candidates
No ratings yet
CBSE - Application From Private Candidates
2 pages
LP For Grade 8 Final PDF
No ratings yet
LP For Grade 8 Final PDF
7 pages
Lesson 3 Table
No ratings yet
Lesson 3 Table
4 pages
Hamburg School District Comprehensive School Counseling Plan 2020-2021
No ratings yet
Hamburg School District Comprehensive School Counseling Plan 2020-2021
27 pages
LRP Monitoring Tool
No ratings yet
LRP Monitoring Tool
2 pages
Variasi Model TVC Ballantine 1398 Dan Jangkauan Tegangannya
No ratings yet
Variasi Model TVC Ballantine 1398 Dan Jangkauan Tegangannya
1 page
Welcome To FCE (Cambridge First Certificate Exam)
No ratings yet
Welcome To FCE (Cambridge First Certificate Exam)
24 pages
Sptve - Techdrwg 8 - q1 - m10
No ratings yet
Sptve - Techdrwg 8 - q1 - m10
13 pages
DLL AOM (Week 0 June 20-21)
No ratings yet
DLL AOM (Week 0 June 20-21)
3 pages
Minggu/Week Hari /day Tarikh/Date: Cuti Bergilir SPM 2018
No ratings yet
Minggu/Week Hari /day Tarikh/Date: Cuti Bergilir SPM 2018
2 pages
English 9 Q1 Module 1
No ratings yet
English 9 Q1 Module 1
27 pages
Class Demonstration of Audio-Lingual Method
No ratings yet
Class Demonstration of Audio-Lingual Method
2 pages
Phils Vs Singapore
No ratings yet
Phils Vs Singapore
6 pages
Learning Organization 1
No ratings yet
Learning Organization 1
24 pages
Live Trace Visualization for System and Program Comprehension in Large Software Landscapes
From Everand
Live Trace Visualization for System and Program Comprehension in Large Software Landscapes
Florian Fittkau
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Transformers in Computational Visual Media A Surve

Uploaded by

Transformers in Computational Visual Media A Surve

Uploaded by

Computational Visual Media

https://doi.org/10.1007/s41095-021-0247-3 Vol. 8, No. 1, March 2022, 33–62

Transformers in computational visual media: A survey

Table 1 Recent visual transformers introduced in this survey

Area Secondary area Method Contributions

DETR [24] First transformer-based detection SOTA model

Colorization ColTran [30] First transformer-based image colorization model

Fig. 1 Organisation of recent works on visual transformers.

transformer blocks: patches and reduce the length of tokens progressively

Table 2 Classiﬁcation accuracy on ImageNet [64] for various visual transformers

Convolution-based neural network

ResNet [1] 2242 4.1 25.6 76.2 —

3842 55.4 86 77.9

is initialized as 1 for all layers and all heads in MSA.

On the basis of the aforementioned motivation, they 3.1.8 LocalViT

Fig. 7 Comparison of the convolutional version of the FFN module in

3.3 Visualization of ViT and key-point detection [80–85]. As the focus of

Training #Param FLOPS Source

injected into all encoder blocks rather than only the

Fig. 12 Framework of PVT. Reproduced with permission from Ref. [27],

Fig. 15 Overview of IPT. Reproduced with permission from Ref. [101],

computational cost, and depth-wise convolution 5.4 TransGAN

to visual tasks. As shown in Fig. 17(left), the

The number of codes in the codebook is 512–4096

Fig. 21 Overview of StyTr2 . Reproduced with permission from Ref. [38],

Fig. 23 Overview of ViLBERT. Reproduced with permission from Ref. [40],

Fig. 24 Overview of UNITER. Reproduced with permission from Ref. [41],

Fig. 25 Overview of SemVLP. Reproduced with permission from Ref. [42],

ViLBERT [40] CC 221M 512 8 TitanX jiasenlu/vilbert_beta

Mengdan Zhang received her Ph.D.

Fan Tang is an assistant professor in

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.