Transformers in Computational Visual Media A Surve
Transformers in Computational Visual Media A Surve
Review Article
Yifan Xu1,2 , Huapeng Wei3 , Minxuan Lin1,2 , Yingying Deng1,2 , Kekai Sheng4 , Mengdan Zhang4 ,
Fan Tang3 , Weiming Dong1,2,5 ( ), Feiyue Huang4 , and Changsheng Xu1,2,5
c The Author(s) 2021.
Abstract Transformers, the dominant architecture Keywords visual transformer; computational visual
for natural language processing, have also recently media (CVM); high-level vision; low-level
attracted much attention from computational visual vision; image generation; multi-modal
media researchers due to their capacity for long-range learning
representation and high performance. Transformers
are sequence-to-sequence models, which use a self-
1 Introduction
attention mechanism rather than the RNN sequential
structure. Thus, such models can be trained in parallel Convolutional neural networks (CNNs) [1–3] have
and can represent global information. This study become the fundamental architecture in computa-
comprehensively surveys recent visual transformer tional visual media (CVM). Researchers began to
works. We categorize them according to task scenario: incorporate a self-attention mechanism into CNNs to
backbone design, high-level vision, low-level vision and model long-range relationships, due to the problem
generation, and multimodal learning. Their key ideas of locality of convolutional kernels [4–8]. Recently,
are also analyzed. Differing from previous surveys, Dosovitskiy et al. [9] found that using a self-attention-
we mainly focus on visual transformer methods in only structure, without convolution, works well
low-level vision and generation. The latest works in computer vision. Since then, the transformer
on backbone design are also reviewed in detail. For
architecture [10], a non-convolutional architecture
ease of understanding, we precisely describe the main
dominating the research field of natural language
contributions of the latest works in the form of tables.
processing (NLP), has has been used in computer
As well as giving quantitative comparisons, we also
vision. Introducing transformers into computer vision
present image results for low-level vision and generation
tasks. Computational costs and source code links for provides four advantages that CNNs lack:
various important works are also given in this survey to • Transformers learn with more inductive bias and
assist further development. performs better when trained on large datasets
(e.g., ImageNet-21K or JFT-300M) [9, 11].
1 NLPR, Institute of Automation, Chinese Academy of Sciences, • Transformers provide a more general architecture
Beijing 100190, China. E-mail: Y. Xu, xuyifan2019@ia.ac.cn; suitable for most fields, including NLP, CV, and
M. Lin, linminxuan2018@ia.ac.cn; Y. Deng, dengyingying2017@
multimodal learning.
ia.ac.cn; W. Dong, weiming.dong@ia.ac.cn ( ); C. Xu,
changsheng.xu@ia.ac.cn. • Transformers powerfully model long-range
2 School of Artificial Intelligence, University of Chinese interactions in a computationally-efficient manner
Academy of Sciences, Beijing 100040, China. [12, 13].
3 School of Artificial Intelligence, Jilin University, Changchun • The learned representation of relationships is
130012, China. E-mail: H. Wei, weihp20@jlu.edu.cn; more general and robust than the local patterns
F. Tang, tangfan@jlu.edu.cn.
from convolution modules [14].
4 Youtu Lab, Tencent Inc., Shanghai 200233, China. E-mail:
As Table 1 shows, an increasing number of works on
K. Sheng, saulsheng@tencent.com; M. Zhang, davinazhang@
tencent.com; F. Huang, garyhuang@tencent.com. visual transformers have come out in various subfields
5 CASIA-LLVISION Joint Lab, Beijing 100190, China. of computational visual media. An instructive survey
Manuscript received: 2021-06-17; accepted: 2021-07-16 is important because of the difficulties in arranging
33
34 Y. Xu, H. Wei, M. Lin, et al.
such fast and abundant developments. Due to the level vision, we introduce the mainstream of DETR-
fast development of visual transformer backbones, based transformer detection models [24]. For low-level
this survey specifically focuses on the latest works in vision and generation, we arrange papers according
that area, as well as low-level vision tasks. to different subareas including colorization [30, 43–
Specifically, this study is mainly arranged into four 45], text-to-image [31, 32, 46], super-resolution [47–
specific fields: backbone design, high-level vision (e.g., 49], and image generation [50–54]. For multimodal
object detection and semantic segmentation), low- learning, we review some recent representative
level vision and generation, and multimodal learning. works on vision-plus-language (V+L) models and
We highlight backbone design and low-level vision summarize pretraining objectives in this field.
as our main focus in Fig. 1. The developments to We comprehensively compare results in different
be introduced are summarised in Table 1. For fields and give training details, including computa-
backbone design, several latest works are introduced, tional cost and source code links to facilitate and
considering two aspects: (i) injecting convolutional encourage further research. Some images resulting
prior knowledge into ViT, and (ii) boosting the from low-level vision models are also illustrated. The
richness of visual features. We also summarize the rest of the paper is organized as follows. Section 2
breakthrough ideas of each work in Fig. 1. For high- introduces visual transformers. Section 3 lists latest
Transformers in computational visual media: A survey 35
developments in backbone networks for visual blocks [55] (i.e., layer normalization (LN) [56] + multi-
transformers in image classification. Section 4 head self-attention (MSA) [57] + skip-connection
describes several recent advanced designs using visual layer [1] + multilayer perception (MLP) or feed-
transformers in object detection. Section 5 introduces forward network (FFN)), and post-process module.
transformer-based methods for various low-level Formally, given an input image X ∈ RH×W ×C and
vision tasks. Section 6 reviews recent representative its labels Y , X is first reshaped into a sequence of
2
works on multimodal learning. Finally, we draw flattened 2D image patches Xp ∈ RN ×(P ·C) . Then,
conclusions from different research fields in Section 7. following BERT [10], a class token and several
position tokens are used to record extra meaningful
2 Visual transformers information for inference. Together, the input is
formulated as follows:
Before introducing the latest developments, we give
z0 = [xcls ; x1p · E; · · · ; xN
p · E]
the basic formulation of visual transformers by using cls 1 N
ViT [9] as an example. As shown in Fig. 2, a typical + [Epos ; Epos ; · · · ; Epos ]
2
ViT mainly contains five basic procedures: splitting where xcls ∈ RD is the class token, E ∈ R(P ·C)×D is
i
input images into smaller local patches, preparing a linear projection of each patch Xp , and Epos ∈ RD
the input token (patch tokens, class token, and is the learnable position embedding for the i-th token.
position embedding), a series of stacked transformer Then, the input is sent into several sequential
Fig. 2 Framework of ViT (left) and typical pipeline of a transformer encoder (right). Reproduced with permission from Ref. [9],
c The
Author(s) 2021.
36 Y. Xu, H. Wei, M. Lin, et al.
Method Image size FLOGs (G) #Param (M) Acc (%) Source (GitHub)
Visual transformer
Two further ViT models, LeViT [12] x and CoaT [67] y, Swin Transformer enhances efficient use of parameters
investigate the importance of position embedding and achieves state-of-the-art object detection and
and propose different implementations. We do not semantic segmentation.
describe them further due to lack of space. Dong et al. [68] also propose another vision transfor-
3.1.5 Swin Transformer mer model, CSWin Transformer, which utilizes a
On the basis of the observations that image data cross-shaped window self-attention mechanism (akin
contain much redundant spatial information and to criss-cross attention [69] or strip pooling [70])
given the success of deep-narrow CNN architectures, and a locally enhanced position encoding. CSWin
Liu et al. [20] propose a novel hierarchical visual Transformer obtains even better performance than
transformer design. Figure 5(a) illustrates the SWin Transformer.
core idea of the window MSA (W-MSA) and the 3.1.6 DeepViT
shifted W-MSA (SW-MSA) within Swin Transformer, Layer scaling (e.g., 152-layer ResNet [1]) is an
which separate local patches into several windows important aspect of CNN architectures. With regard
and run the MSA module window by window. to ViT models, Zhou et al. [19] empirically find that
With the W-MSA mechanism, they reduce the the performance of deep layer ViT models saturates
computation complexity from O(4HW C 2 +2(HW )2C) when we stack more than 20 transformer blocks even
to O(4HW C 2 + 2M 2HW C), where H and W with the help of skip-connection layers. They unveil
represent the size of input patches, M × M is that the reason is attention collapse: the feature
the number of windows, and C is the feature maps extracted from each head in one MSA module
dimension. A shifted window design is also proposed share increasingly similar patterns, leading to huge
to encourage cross-window communication for rich information redundancy and low training efficiency.
visual features. They also propose a deep-narrow If the communication between the MSA heads is
architecture (see Fig. 5(b)). Extensive experiments on promoted, the information redundancy between each
ImageNet, COCO, and ADE-20K demonstrate that head and rich learned visual feature can be reduced.
Fig. 5 (a) Window MSA (W-MSA) greatly reduces computational cost and facilitates communication between each isolated W-MSA.
(b) Overview of Swin Transformer. Reproduced with permission from Ref. [20],
c The Author(s) 2021.
x https://github.com/facebookresearch/LeViT
y https://github.com/mlpc-ucsd/CoaT
Transformers in computational visual media: A survey 39
Fig. 8 Class-specific visualization results from ViT. Left to right: input image, rollout [62], raw-attention, GradCAM [72], LRP [73], partial
LRP [63], and Transformer-Explainability [23]. Reproduced with permission from Ref. [23], c The Author(s) 2021.
x https://github.com/hila-chefer/Transformer-Explainability
Transformers in computational visual media: A survey 41
Table 3 Comparison of transformer-based detection models on the COCO 2017 val set
Convolution-based models
FCOS [90] 36 41.0 59.8 44.1 26.2 44.6 52.2 — 23 177 tianzhi0549/FCO
Faster R-CNN+FPN [86] 109 42.0 62.1 45.5 26.6 45.4 53.4 42 26 180 rbgirshick/py-faster-rcnn
Transformer-based models
DETR [24] 500 42.0 62.4 44.2 20.5 45.8 61.1 41 28 86
facebookresearch/detr
DETR-DC5 [24] 500 43.3 63.1 45.9 22.5 47.3 61.1 41 12 187
Deformable DETR [25] 50 43.8 62.6 47.7 26.4 47.1 58.0 40 19 173 fundamentalvision/Deformable-DETR
UP-DETR [26] 150 40.5 60.8 42.6 19.0 44.1 60.0 41 — —
dddzg/up-detr
UP-DETR [26] 300 42.8 63.0 45.3 20.8 47.1 61.7 41 — —
PVT-T [27] 300 36.7 56.9 38.9 22.6 38.8 50.0 23.0 — —
whai362/PVT
PVT-M [27] 300 41.9 63.1 44.3 25.0 44.9 57.6 53.9 — —
ViT-B/16-FRCNN [98] 21 37.8 57.4 40.1 17.8 41.4 57.3 21 — — —
calculate attention. Nearest here refers to semantic the query patch. This pretraining method leads to
distance rather than spatial distance. Deformable more flexible training. As shown in Fig. 11(b), a more
attention is illustrated in Fig. 10, which is drawn robust representation can be learned after augmenting
from deformable convolution [99]. A linear model is the random query patches.
established to learn the offsets of the nearest K values,
4.4 PVT
and then another linear model is established to learn
the attention score of each value. In summary, the Diverging from DETR, Wang et al. [27] also
main contributions of deformable attention module propose a pure transformer-based backbone, called
are (i) only K corresponding values rather than all a pyramid vision transformer (PVT), for detection
values are required to calculate the attention of one and segmentation. Its framework is shown in Fig. 12.
query and (ii) the attention scores are learned by After each stage, the output is rearranged to recover
a network rather than by simple multiplication of spatial structure and is then down-sampled to half
queries and keys. resolution. Notably, the spatial reduction is only
conducted on the key K and value V while the
4.3 UP-DETR
spatial size of the query Q is maintained. In practice,
Dai et al. [26] propose a novel unsupervised pre- the full architecture of PVT-based detection models
training method called random query patch detection includes a PVT backbone and a general detection
for DETR [24, 25], which leads to better performance. head, such as RetinaNet [88] and Mask R-CNN [100].
Figure 11(a) illustrates their pretraining method. A Recently, several dense prediction backbones have
random query patch is randomly cropped from an come out after PVT [27], like Swin Transformer [20],
input image. Then, the query patch is added to the CPVT [17], and Twins [97], which we introduced in
object queries of the DETR decoder. The final goal Section 3.
is to predict two things: (i) Lcls , that is, existence
of objects in the query patch, and (ii) Lbox , that
5 Low-level vision and generation
is, the location of the query patch in the image. A
reconstruction loss Lrec is also designed to ensure In this section, we focus on some representative
that the CNN has extracted full information from recent transformer-based works on low-level vision
Fig. 10 Deformable attention module. Reproduced with permission from Ref. [25],
c The Author(s) 2020.
Transformers in computational visual media: A survey 43
Fig. 11 (a) Random single query patch detection of UP-DETR. (b) More robust representation is derived by augmenting the query patch.
Reproduced with permission from Ref. [26],
c The Author(s) 2020.
tasks, as listed in Table 4. Low-level vision tasks colorisation [30], text-to-image generation [31], and
include super-resolution [101], denoising [101], image image generation [34, 35]. We separately introduce
how these tasks use transformers to achieve good
Table 4 Source code links for ViT-based models for low-level vision results (see examples in Fig. 13).
tasks
Method Source (GitHub) 5.1 TIME
TIME [31] —
As a pre-trained NLP model is always required for the
IPT [101] —
text-to-image (T2I) task, it may introduce inflexibility
google-research/google-research
ColTran [30] for the whole model. Liu et al. [31] propose an
/tree/master/coltran
TTSR [73] researchmm/TTSR efficient model for T2I tasks: Text and Image
GANsformer [35] — Mutual Translation Adversarial Networks (TIME).
TransGAN [34] VITA-Group/TransGAN TIME can jointly handle T2I and image captioning
DALL·E [32] openai/DALL-E using a single network without a pretrained NLP
VQGAN [102] CompVis/taming-transformers model. As Fig. 14 shows, TIME introduces a multi-
StyTr2 [38] —
head and multi-layer transformer to the generator
PCT [39] Strawberry-Eat-Mango/PCT_Pytorch
and text decoder, which can be used to effectively
44 Y. Xu, H. Wei, M. Lin, et al.
Fig. 13 Representative results for low-level tasks, such as text-to-image generation, basic image processing tasks, colorization, image super
resolution, and image generation. Images are taken from the corresponding papers.
combine image features and the sequence of word sequence of word embeddings ft , and outputs the
embeddings into the output. The Text-Conditioned revised image fit according to the word embeddings.
Image Transformer takes image feature fi and the The Image-Captioning Transformer is similar to the
Transformers in computational visual media: A survey 45
Fig. 14 TIME model overview. Reproduced with permission from Ref. [31],
c Association for the Advancement of Artificial Intelligence 2021.
Text-Conditioned Image Transformer but for image pretraining procedure solves the problem of task-
captioning [103–105]. T2I and the image captioning specific data limitation. Therefore, Chen et al. [101]
task are jointly trained in the generative adversarial develop a pretrained model for image processing using
network (GAN) manner. TIME achieves state-of-the- the transformer architecture, the Image Processing
art T2I performance without pretraining. Transformer (IPT). The model architecture is shown
DALL·E. Text-to-image generation is a classical in Fig. 15. To adapt to different vision tasks,
generation problem, which needs to construct a Chen et al. [101] design a multi-head and multi-
mapping between two streams. Ramesh et al. [32] tail architecture, which involves three convolutional
propose a transformer-based framework to better layers. The transformer body consists of an encoder
align text and image semantic information. A two- and a decoder described in Ref. [57]. Like the
stage model is applied to model the text and image discriminator in Ref. [34], they split the given features
tokens. They first train a discrete variational into patches and each patch is regarded as a “word”
autoencoder [106] to build 1024 image tokens and before features are input into the transformer body.
adopt 256 BPE-encoded text tokens to represent Unlike the original transformer, they utilize a task-
the text information. Thereafter, an auto-regressive specific embedding as an additional input to the
transformer is used to capture the joint distribution decoder. The model is pretrained on ImageNet, which
of the text and image tokens. They also use a mixed- is a key factor for success.
precision training strategy and PowerSGD [57] to save 5.3 Uformer
GPU memory. The model consumes approximately
24 GB memory in 16-bit precision. Wang et al. [37] propose an effective and efficient
transformer-based architecture for image restoration.
5.2 IPT It uses a transformer module to construct a
Classification models can be pretrained on large- hierarchical encoder–decoder network. Two core
scale datasets to enlarge model representation ability. designs of Uformer make it suitable for image
Related low-level vision tasks such as image super- restoration. The first is a local-enhanced window
resolution, inpainting, and deraining are combined transformer block. Specifically, a nonoverlapping
in a model to help one another. The generalized window-based self-attention is used to reduce the
46 Y. Xu, H. Wei, M. Lin, et al.
Fig. 16 (a) Overview of Uformer. (b) Structure of the LeWin transformer block. Reproduced with permission from Ref. [37],
c The Author(s) 2021.
Transformers in computational visual media: A survey 47
Fig. 17 Model overview of TransGAN. Reproduced with permission from Ref. [34],
c The Author(s) 2021.
self-attention mechanism to promote the effects around relational attention and dynamic interaction.
of a probabilistic colorization model. ColTran They propose a Bipartite Transformer to eliminate
replaces self-attention blocks with axial self-attention, the limitation of huge computational complexity
which decreases √ the computational complexity from of self-attention of transformers. Unlike the self-
2
O(D ) to O(D D). Kumar et al. [30] adopt a attention operator which considers all pairwise
conditional variant of the Axial Transformer [108] relations between input elements, the Bipartite
for low-resolution coarse colorization. As shown in Transformer generalizes this formulation by featuring
Fig. 19, the ColTran core consists of Conditional a bipartite graph between two groups of variables
Self-Attention, MLP, and Layer Norm modules, and (latent and image features) instead. As shown in
it applies conditioning to the auto-regressive core. Fig. 20, simplex attention distributes information in a
They also design a Color Upsampler and Spatial single direction over the Bipartite Transformer, while
Upsampler to produce high-fidelity colorized images Duplex attention supports bidirectional interaction
from low resolution results. The Color Upsampler between the elements. The bipartite structure makes
converts the coarse image of 512 colors back into a a good balance between expressiveness and efficiency,
3-bit RGB image with 8 symbols per channel. The and it constructs the interaction between latent and
Spatial Upsampler generates colorized images with visual features to generate good results.
high resolution. ColTran can handle grayscale images 5.8 StyTr2
of 256 × 256 pixels.
Considering the limited receptive fields of CNNs,
5.7 GANsformer obtaining global information about input images is
The cognitive science literature talks about two mech- difficult but is critical for the image style transfer
anisms by which human perception interacts, namely, task. The content leak problem also occurs when
bottom–up and top–down processing. Previous vision CNN-based models are adopted for style transfer.
tasks using CNNs do not reflect this bidirectional Therefore, Deng et al. [38] propose the first transformer-
nature because the local receptive field reduces their based style transfer model using the ability for long-
ability to model long-range dependencies. Therefore, range extraction (Fig. 21). The unbiased Style
Hudson and Zitnick [35] aim to design a transformer Transfer Transformer framework StyTr2 contains
network with a highly adaptive architecture centered two transformer encoders to obtain domain-specific
Fig. 19 Overview of colorization transformer. Reproduced with permission from Ref. [30],
c The Author(s) 2021.
Transformers in computational visual media: A survey 49
5.10 PCT
Unlike CNNs, transformers are inherently permuta-
Fig. 20 Overview of the GANsformer framework. Reproduced with tion invariant when processing a series of points and
permission from Ref. [35],
c The Author(s) 2021. are thus suitable for point cloud learning. Guo et
al. [39] propose a state-of-the-art transformer-based
information. Following encoding, a multilayer trans- point cloud model based on offset-attention with an
former decoder generates the output sequences. implicit Laplace operator. They enhance the input
Moreover, Deng et al. [38] propose a content-aware embedding based on farthest point sampling and
mechanism to learn the positional encoding based on nearest neighbor search to better capture the local
image semantic features and dynamically expand the context in the point cloud.
position to suit different image sizes.
5.9 VQGAN
6 Multimodal learning
High-resolution image synthesis is a difficult genera-
tion problem which aims to generate high-fidelity The above sections cover developments in conven-
images within a reasonable time. Convolutional tional computer vision. Apart from pure vision
approaches exploit the local structure of the image, tasks, transformer-based models have also achieved
while transformer methods are good at establishing promising progress in language and vision multimodal
long-range interactions. Esser et al. [102] utilize the tasks, such as visual question answering (VQA) [109,
advantages of CNNs and transformers to build a high- 110], image captioning [111], and image retrieval [112],
resolution image generation framework. They propose due to the high performance achieved by the NLP
a variant of VQVAE [36] and adopt adversarial transformers. Transformer-based vision-language
learning to achieve vivid results. The content hidden (V+L) approaches are often pretrained on multiple
space consists of a discrete codebook, and different tasks and fine-tuned on diverse downstream sub-tasks.
codes in the codebook are combined according Inputs of different modalities share the analogous
to a certain probability to represent the content single- or two-stream architecture.
information. The key to sampling in a discrete In this section, we start from recently representative
space is to predict the distribution of discrete codes, transformer-based works on V+L tasks with different
and the transformer can deal with the issue. Given frameworks (Section 6.1), and then summarise
the first i codes, the transformer module is used to pretraining objectives (Section 6.2) and compare
predict the probability of occurrence of the i-th code. details (Section 6.3).
6.1 Transformer-based V+L works depth for each modality and enables cross-modal
Most transformer-based V+L works are based on two connections at different depths.
kinds of structures: the two-stream (each stream 6.1.2 UNITER
for a single modality) framework or the single- Chen et al. [41] propose UNITER: UNiversal Image-
stream (common stream for jointly learning cross- TExt Representation. It can power heterogeneous
modal representation) framework. ViLBERT [40] and downstream V+L tasks with joint multimodal
UNITER [41] are representative works for two- and embeddings. As shown in Fig. 24, UNITER first
single-stream frameworks, respectively. Meanwhile, encodes image regions (visual features and bounding
SemVLP [42] unifies the two mainstream architectures box features) and textual words (tokens and positions)
for aligning the cross-modal semantics. into a shared embedding space with image and text
6.1.1 ViLBERT embedders. Then, UNITER applies a transformer
ViLBERT [40] is a representative two-stream module to learn the joint embedding of the two
transformer-based model for V+L. Two separate modalities through designed pretraining tasks that
streams are used for vision and language processing. include classic image–text matching (ITM), masked
Figure 23 shows the architecture of ViLBERT. Two language modeling (MLM), and masked region
parallel BERT-style models operate on image regions modeling (MRM). UNITER uses conditional masking
and text tokens. Each stream connects a series on MLM and MRM, which means masking only one
of transformer blocks (TRM) and co-attentional modality while keeping the other untainted. A novel
transformer layers (Co-TRM). As shown in Fig. 22, word–region alignment pretraining task via optimal
the Co-TRM layers enable information exchange transport is also proposed to encourage fine-grained
between modalities, and the modified attention alignment between words and image regions. The
mechanism is the key technical innovation. By authors consider the matching of word tokens and
exchanging key–value pairs in multi-headed attention, RoI regions as minimizing the distance of two discrete
the Co-TRM structure allows for variable network distributions, where the distance is computed based
on optimal transport. UNITER, as a single-stream
model, achieved state-of-the-art performance when
proposed. ViLLA [113], which combines UNITER
and adversarial training, achieves higher performance.
6.1.3 SemVLP
Li et al. [42] present a novel V+L framework, SemVLP.
It unifies both mainstream architectures. By fusing
single- and two-stream architectures, SemVLP utilizes
cross-modal semantics. Its framework is detailed
in Fig. 25. On the basis of a shared bidirectional
transformer encoder with cross-modal attention
module, SemVLP can encode the input text and
Fig. 22 (a) Architecture of a standard encoder transformer block. image into different semantics. It adopts common
(b) Co-attention transformer layer in ViLBERT. Reproduced with
permission from Ref. [40],
c The Author(s) 2019. pretraining methods with a special training strategy:
single- and two-stream frameworks are updated in
each half of the training time for each mini-batch of objectives for multimodal learning. In this section,
image–text pairs. we briefly introduce pretraining tasks extended from
6.2 Multimodal pretraining BERT. These extended approaches include MLM,
masked region modeling (MRM), and image–text
Designing reasonable pretraining objectives for
matching (ITM). We also list other specially designed
transformer-based models, such as masked language
pretraining tasks for multimodal learning.
modeling (MLM) and next sentence classification
from BERT, has brought excellent results on NLP 6.2.1 Masked language modeling
tasks. These methods also work in the cross-modal Most recent V+L works follow BERT in using
field with V+L. The key challenge is the way to MLM for cross-modal tasks. UNITER modifies
replicate or extend large-scale pretraining to cross- MLM by introducing visual information. Specifically,
modal methods and to design novel pretraining UNITER attempts to predict masked words based on
52 Y. Xu, H. Wei, M. Lin, et al.
observation of the surrounding words and all image 6.2.4 Other designs for V+L
regions. InterBERT [114] changes MLM to masked Some models are also trained with unique, newly
segment modeling. In the case of using a random designed pretraining strategies. In Oscar [121], each
word to replace the selected word, masked segment image–text pair is defined as a triple and thus consists
modeling masks a continuous segment of text instead of a word sequence, a set of object tags, and a set of
of random words. image region features. Therefore, in addition to MLM
6.2.2 Image–text matching on words and object tags, Oscar uses a contrastive
For another pretraining task of BERT, next sentence loss to encourage the model to distinguish the original
classification has been converted to an ITM problem, and modified triple. By differently using contrastive
which determines whether a pair of sentence and learning, UNIMO creates image–text pairs by a novel
image regions match. This task is widely used in text rewriting method. ERNIE-ViL [122] introduces
advanced V+L works. InterBERT [114] performs a scene graph to design advanced pretrained tasks,
ITM with hard negatives by regarding the image– including object prediction, attribute prediction, and
text pairs in the dataset as positive samples, pairing relationship prediction. Li et al. [123] add masked
the images with uncorrelated texts, and regarding sentence generation to optimize their model: a cross-
the pairs as negative samples. VL-BERT [115] and modal decoder is taught to autoregressively decode
UnifiedVLP [116] also do not use ITM, tending to the input sentence word-by-word conditioned on
use other efficient choices like MRM introduced next. the input image. Training directly on downstream
tasks like QA is also used in LXMERT [124] and
6.2.3 Masked region modeling
SemVLP [42].
The existing masking method, MRM, is the dual
task of MLM. MLM can be easily applied to visual 6.3 Comparisons and implementation de-
input. Some researchers have proposed several novel tails
pretraining methods by masking input visual tokens Table 5 details implementations and open source. The
to extend masked modeling to vision. Masked MSCOCO dataset [111], maintained by Microsoft, is
region feature regression (MRFR) is one of these widely used in multiple tasks like object detection. The
approaches applied by ViLBERT [40]. ViLBERT Conceptual Captions Dataset (CC) [125] is provided
trains the model to regress the masked input RoI by Google AI and consists of nearly 3.3 million
pooled feature, which is extracted by Faster R- images annotated with captions harvested from the
CNN [86]. Most models perform optimization with web. The SBU Captions Dataset [126] includes
L2 loss. VL-BERT [115] also follows MRFR instead image captions collected from 1 million images from
of using ITM. It uses masked RoI classification with Flickrx . MSCOCO, CC, and SBU all can be used
linguistic clues, predicting the category label of the for image caption tasks. The Visual Genome [127]
masked RoI obtained by Fast R-CNN [117] from is a dataset including images and image content
the other clues. On the contrary, some models semantic information. Visual Genome, VQA [109],
choose masked region classification, which lets the VQAv2 [110], and GQA [128] datasets can all be used
model predict the object semantic class for each for VQA pretraining. Notably, partial datasets are
masked region. Models are often optimized by used as benchmarks simultaneously. Table 6 shows
cross-entropy loss or KL-divergence to learn the the performance of models reported above on different
class distribution. These MRM tasks are performed V+L benchmark datasets. The results are obtained
in UNITER and UNIMO [118]. InterBERT [114] by models fine-tuned on the corresponding datasets.
also changes MRM strategy in the visual modality
by masking objects which have a high proportion 7 Conclusions and discussion
of mutual intersection with zero vectors to avoid
7.1 Backbone design
information leakage due to overlap between objects.
Notably, earlier transformer-based works, such as Section 3 describes several recent developments in
VisualBERT [119] and B2T2 [120], do not extend the backbone design of visual transformers, including
MLM to the visual domain. x https://www.flickr.com
Transformers in computational visual media: A survey 53
Table 5 Model setting in various papers. COCO refers to MS COCO [129], CC to Conceptual Captions [125], VG to Visual Genome [127],
SBU to SBU captions [126], and OI to OpenImages [130]
Model Dataset(s) for pre-training Params Batch size Hard-aware Source (GitHub)
Table 6 Comparison of transformer-based V+L models on VQA [109], GQA [128], Flickr30K [112], CoCo Caption [111], NLVR2 [133],
SNLI-VE [134], VCR [135], and RefCOCO+ [136] benchmarks
VQA GQA IR-Flickr30K TR-Flickr30K CoCo Caption NLVR2 SNLI-VE VCR RefCOCO+
test-dev test-std test-dev test-std R@1 R@5 R@10 R@1 R@5 R@10 BLUE4 CIDEr dev test-P val test Q/A QA/R Q/AR val testA testB
ViLBERT [40] 70.55 70.92 — — 58.20 84.90 91.52 — — — — — — — — — 73.3 74.6 54.8 72.34 78.52 58.20
VL-BERT [115] 71.79 72.22 — — — — — — — — — — — — — — 75.8 78.4 59.7 80.31 83.62 75.45
UNITER [41] 73.82 74.02 — — 75.56 94.08 96.76 87.3 98.0 99.2 — — 79.12 79.98 79.39 79.38 77.3 80.8 62.8 84.25 86.34 79.75
Oscar [121] 73.61 73.82 61.58 61.62 — — — — — — 41.7 140.0 79.12 80.37 — — — — — 84.40 86.22 80.00
VILLA [113] 74.69 74.87 76.26 94.24 96.84 87.9 97.5 98.8 — — 79.76 81.47 80.18 80.02 78.9 79.1 60.6 84.40 86.22 80.00
ERNIE-ViL [122] 74.95 75.10 — — 76.66 94.16 96.76 89.2 98.5 99.2 — — — — — — 79.2 83.5 66.3 75.89 82.37 66.91
UNIMO [118] 73.79 74.02 — — — — — — — — 38.6 124.1 — — 80.00 79.10 — — — — — —
VinVL [131] 76.52 76.60 65.05 64.65 75.40 92.90 93.30 58.8 83.5 90.3 41.0 140.9 82.67 83.98 — — — — — — — —
TDEN [123] 72.50 72.80 — — — — — — — — 40.2 133.4 — — — — 75.7 76.4 58.0 — — —
SemVLP [42] 74.52 74.68 62.87 63.62 74.80 93.43 96.12 87.7 98.2 99.3 — — 79.00 79.55 — — — — — — — —
feature map visualization approaches. Recent pro- transformer models, and making the transformer
gress can be technically divided into two main more computationally efficient, are of interest.
streams: (i) enhancing the capability of visual The versatility of ViT models in additional
transformers in modeling spatial structure and real-world scenarios, such as aesthetic visual
locality mechanism, such as a better image-to-token analysis [137–139], face anti-spoofing [140, 141],
module, a pixel-level transformer block, a depth- and point cloud learning [142], is also worthy of
wise convolution-based pooling layer, and an SW- exploration.
MSA module, and (ii) boosting the richness of • The transformer block can be placed in the
learned visual features and promoting efficient use perspective of NAS. One of the goals of the
of parameters, such as conditional position encoding, NAS framework [143–145] is to search for optimal
a message communication scheme between the MSA network architectures for a given task without
heads, and deep-narrow ViT architectures. human intervention. Interesting architectures can
As the first visual transformer was proposed very be considered and practical insights for further
recently (October 2020), we believe that the potential developments can be gained by building on a well-
of the ViT model has not been fully exploited and designed search space that contains a transformer
several research topics are worthy of consideration block. Several recent works have investigated this
and effort: topic. Wang et al. [146] and So et al. [147] leverage
• Advanced designs of basic ViT operation or NAS techniques to seek for effective and efficient
modules and the corresponding learning scheme transformer-based architectures automatically. Li
for CV tasks, like injecting prior knowledge of et al. [148] propose a novel scheme, BossNAS, to
image data or the computer vision task into the achieve optimal solutions which trade-off CNN
module design or the learning scheme of visual architecture and transformer blocks.
54 Y. Xu, H. Wei, M. Lin, et al.
• Understanding of the working mechanism and Natural Science Foundation of China under Grant
theoretical rationale of visual transformers can Nos. 61832016 and U20B2070.
be enhanced. Several researchers have achieved
promising progress in unveiling the power of References
transformer models, from such perspectives as
[1] He, K. M.; Zhang, X. Y.; Ren, S. Q.; Sun, J.
information bottlenecks [149, 150] and better
Deep residual learning for image recognition. In:
visualization tools [23, 63].
Proceedings of the IEEE Conference on Computer
7.2 High-level vision Vision and Pattern Recognition, 770–778, 2016.
In Section 4, we introduce several representative [2] Tan, M.; Le, Q. EfficientNet: Rethinking model
works on object detection. The basic logic follows scaling for convolutional neural networks. In:
the line of DETR [24]. PVT [27], which is a general Proceedings of the 36th International Conference on
Machine Learning, 2019.
backbone for dense prediction, is also introduced.
Several problems still need to be addressed despite [3] Radosavovic, I.; Kosaraju, R. P.; Girshick, R.; He,
K. M.; Dollár, P. Designing network design spaces.
improvements brought by these works. Unlike
In: Proceedings of the IEEE/CVF Conference on
CNN-based methods, such as Faster-RCNN [86],
Computer Vision and Pattern Recognition, 10425–
current transformers for dense prediction tasks suffer
10433, 2020.
from high computation time. Thus, efficiency of
[4] Yin, M. H.; Yao, Z. L.; Cao, Y.; Li, X.; Zhang, Z.; Lin,
transformers for high-level vision remains a pressing
S.; Hu, H. Disentangled non-local neural networks.
research direction.
In: Computer Vision–ECCV 2020. Lecture Notes in
7.3 Low-level vision and generation Computer Science, Vol. 12360. Vedaldi, A.; Bischof,
In Section 5, we introduce some low-level vision H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 191–
and image generation tasks using transformer-based 207, 2020.
models. They can achieve outstanding results but [5] Hu, H.; Gu, J. Y.; Zhang, Z.; Dai, J. F.; Wei,
have difficulty generating large images. Therefore, Y. C. Relation networks for object detection. In:
extending a pure transformer with CNN layers is Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 3588–3597, 2018.
widely adopted by many works. A pure transformer
structure still faces the challenge of high computation [6] Wang, X. L.; Girshick, R.; Gupta, A.; He, K. M.
Non-local neural networks. In: Proceedings of the
time.
IEEE/CVF Conference on Computer Vision and
7.4 Multimodal learning Pattern Recognition, 7794–7803, 2018.
In Section 6, we introduce several representative [7] Hu, H.; Zhang, Z.; Xie, Z. D.; Lin, S. Local relation
transformer-based models proposed in the past 2 networks for image recognition. In: Proceedings of the
years for vision and language tasks. We also review IEEE/CVF International Conference on Computer
mainstream pretraining tasks in the V+L field. Vision, 3463–3472, 2019.
Meanwhile, transformer-based models have succeeded [8] Yuan, Y.; Huang, L.; Guo, J.; Zhang, C.; Chen, X.;
for the tasks listed in Table 6, but performance can Wang, J. OCNet: Object context network for scene
still be improved: parsing. arXiv preprint arXiv:1809.00916, 2018.
• Pure transformers may be an alternative choice [9] Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weis-
for the image mode. senborn, D.; Zhai, X.; Unterthiner, T.; Dehghani,
• Design of efficient pretraining tasks can lead to M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit,
better results and performance. J.; Houlsby, N. An image is worth 16x16 words:
Transformers for image recognition at scale. In:
Proceedings of the International Conference on
Acknowledgements Learning Representations, 2021.
We thank the anonymous reviewers for their [10] Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K.
valuable comments. This work was supported BERT: Pre-training of deep bidirectional transformers
by National Key R&D Program of China under for language understanding. In: Proceedings of the
Grant No. 2020AAA0106200, and by National Conference of the North American Chapter of the
Transformers in computational visual media: A survey 55
Association for Computational Linguistics: Human [24] Carion, N.; Massa, F.; Synnaeve, G.; Usunier,
Language Technologies, 4171–4186, 2019. N.; Kirillov, A.; Zagoruyko, S. End-to-end object
[11] Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, detection with transformers. In: Computer Vision–
H.; Luan, D.; Sutskever, I. Generative pretraining ECCV 2020. Lecture Notes in Computer Science, Vol.
from pixels. In: Proceedings of the 37th International 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J.
Conference on Machine Learning, 1691–1703, 2020. M. Eds. Springer Cham, 213–229, 2020.
[12] Graham, B.; El-Nouby, A.; Touvron, H.; Stock, P.; [25] Zhu, X. Z.; Su, W. J.; Lu, L. W.; Li, B.; Dai,
Joulin, A.; Jégou, H.; Douze, M. LeViT: A vision J. F. Deformable DETR: Deformable transformers
transformer in ConvNet’s clothing for faster inference. for end-to-end object detection. arXiv preprint
arXiv preprint arXiv:2104.01136, 2021. arXiv:2010.04159, 2020.
[13] Tay, Y.; Dehghani, M.; Bahri, D.; Metzler, D. [26] Dai, Z. G.; Cai, B. L.; Lin, Y. G.; Chen, J. Y. UP-
Efficient transformers: A survey. arXiv preprint DETR: Unsupervised pre-training for object detection
arXiv:2009.06732, 2020. with transformers. arXiv preprint arXiv:2011.09094,
[14] Liang, J.; Hu, D.; He, R.; Feng, J. Distill and fine- 2020.
tune: Effective adaptation from a black-box source [27] Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.;
model. arXiv preprint arXiv:2104.01539, 2021. Liang, D.; Lu, T.; Luo, P.; Shao. L. Pyramid
[15] Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Tay, F. vision transformer: A versatile backbone for dense
E.; Feng, J.; Yan, S. Tokens-to-Token ViT: Training prediction without convolutions. arXiv preprint
vision transformers from scratch on ImageNet. arXiv arXiv:2102.12122, 2021.
preprint arXiv:2101.11986, 2021. [28] Wang, Y.; Xu, Z.; Wang, X.; Shen, C.; Cheng,
[16] Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; B.; Shen, H.; Xia, H. End-to-end video instance
Wang, Y. Transformer in transformer. arXiv preprint segmentation with transformers. In: Proceedings of
arXiv:2103.00112, 2021. the IEEE/CVF Conference on Computer Vision and
[17] Chu, X. X.; Tian, Z.; Zhang, B.; Wang, X. L.; Wei, Pattern Recognition, 8741–8750, 2021.
X. L.; Xia, H. X.; Shen, C. Conditional positional
[29] Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez,
encodings for vision transformers. arXiv preprint
J. M.; Luo, P. SegFormer: Simple and efficient design
arXiv:2102.10882, 2021.
for semantic segmentation with transformers. arXiv
[18] D’Ascoli, S.; Touvron, H.; Leavitt, M. L.; Morcos, A. preprint arXiv:2105.15203, 2021.
S.; Biroli, G.; Sagun, L. ConViT: Improving vision
[30] Kumar, M.; Weissenborn, D.; Kalchbrenner, N.
transformers with soft convolutional inductive biases.
Colorization transformer. In: Proceedings of the 9th
In: Proceedings of the 38th International Conference
International Conference on Learning Representations,
on Machine Learning, 2286–2296, 2021.
2021.
[19] Zhou, D.; Kang, B.; Jin, X.; Yang, L.; Lian, X.;
[31] Liu, B. C.; Song, K. P.; Zhu, Y. Z.; de Melo,
Hou, Q.; Feng, J. DeepViT: Towards deeper vision
G.; Elgammal, A. TIME: Text and image mutual-
transformer. arXiv preprint arXiv:2103.11886, 2021.
translation adversarial networks. In: Proceedings of
[20] Liu, Z.; Lin, Y. T.; Cao, Y.; Hu, H.; Guo, B. N. Swin
the 35th AAAI Conference on Artificial Intelligence,
transformer: Hierarchical vision transformer using
2082–2090, 2021.
shifted windows. arXiv preprint arXiv:2103.14030,
[32] Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.;
2021.
Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-
[21] Heo, B.; Yun, S.; Han, D.; Chun, S.; Oh, S. J.
to-image generation. arXiv preprint arXiv:2102.12092,
Rethinking spatial dimensions of vision transformers.
2021.
arXiv preprint arXiv:2103.16302, 2021.
[22] Li, Y. W.; Zhang, K.; Cao, J. Z.; Timofte, R.; Gool, L. [33] Yang, F. Z.; Yang, H.; Fu, J. L.; Lu, H. T.;
V. LocalViT: Bringing locality to vision transformers. Guo, B. N. Learning texture transformer network
arXiv preprint arXiv:2104.05707, 2021. for image super-resolution. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and
[23] Chefer, H.; Gur, S.; Wolf, L. Transformer
interpretability beyond attention visualization. In: Pattern Recognition, 5790–5799, 2020.
Proceedings of the IEEE/CVF Conference on [34] Jiang, Y. F.; Chang, S. Y.; Wang, Z. Y. TransGAN:
Computer Vision and Pattern Recognition, 782–791, Two transformers can make one strong GAN. arXiv
2021. preprint arXiv:2102.07074, 2021.
56 Y. Xu, H. Wei, M. Lin, et al.
[35] Hudson, D. A.; Zitnick, C. L. Generative adversarial [47] Dong, C.; Loy, C. C.; He, K. M.; Tang, X. O. Image
transformers. arXiv preprint arXiv:2103.01209, 2021. super-resolution using deep convolutional networks.
[36] Van den Oord, A.; Vinyals, O.; Kavukcuoglu, IEEE Transactions on Pattern Analysis and Machine
K. Neural discrete representation learning. In: Intelligence Vol. 38, No. 2, 295–307, 2016.
Proceedings of the 31st International Conference on [48] Zhang, Y. L.; Tian, Y. P.; Kong, Y.; Zhong, B. N.; Fu,
Neural Information Processing Systems, 6309–6318, Y. Residual dense network for image super-resolution.
2017. In: Proceedings of the IEEE/CVF Conference on
[37] Wang, Z.; Cun, X.; Bao, J.; Liu, J. Uformer: A Computer Vision and Pattern Recognition, 2472–2481,
general U-shaped transformer for image restoration. 2018.
arXiv preprint arXiv:2106.03106, 2021. [49] Haris, M.; Shakhnarovich, G.; Ukita, N. Deep
[38] Deng, Y. Y.; Tang, F.; Pan, X. J.; Dong, W. M.; back-projection networks for super-resolution. In:
Xu, C. S. StyTr2: Unbiased image style transfer with Proceedings of the IEEE/CVF Conference on
transformers. arXiv preprint arXiv:2105.14576, 2021. Computer Vision and Pattern Recognition, 1664–1673,
[39] Guo, M.-H.; Cai, J.-X.; Liu, Z.-N.; Mu, T.-J.; Martin, 2018.
R. R.; Hu, S.-M. PCT: Point cloud transformer. [50] Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.;
Computational Visual Media Vol. 7, No. 2, 187–199, Sutskever, I.; Abbeel, P. InfoGAN: Interpretable
2021. representation learning by information maximiz-
[40] Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pre- ing generative adversarial nets. arXiv preprint
training task-agnostic visiolinguistic representations arXiv:1606.03657, 2016.
for vision-and-language tasks. In: Proceedings of the [51] Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung,
33rd Conference on Neural Information Processing V.; Radford, A.; Chen, X. Improved techniques
Systems, 13–23, 2019. for training GANs. arXiv preprint arXiv:1606.03498,
[41] Chen, Y.-C.; Li, L. J.; Yu, L. C.; El Kholy, A.; 2016.
Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. UNITER: [52] Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler,
UNiversal image-TExt representation learning. In: B.; Hochreiter, S. GANs trained by a two time-scale
Computer Vision–ECCV 2020. Lecture Notes in update rule converge to a local Nash equilibrium.
Computer Science, Vol. 12375. Vedaldi, A.; Bischof, arXiv preprint arXiv:1706.08500, 2017.
H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 104– [53] Karras, T.; Laine, S.; Aila, T. M. A style-based gener-
120, 2020. ator architecture for generative adversarial networks.
[42] Li, C. L.; Yan, M.; Xu, H. Y.; Luo, F. L.; In: Proceedings of the IEEE/CVF Conference on
Huang, S. F. SemVLP: Vision-language pre-training Computer Vision and Pattern Recognition, 4396–4405,
by aligning semantics at multiple levels. arXiv preprint 2019.
arXiv:2103.07829, 2021. [54] Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.;
[43] Zhang, R.; Isola, P.; Efros, A. A. Colorful image Courville, A. Improved training of wasserstein GANs.
colorization. In: Computer Vision–ECCV 2016. arXiv preprint arXiv:1704.00028, 2017.
Lecture Notes in Computer Science, Vol. 9907. Leibe, [55] Bebis, G.; Georgiopoulos, M. Feed-forward neural
B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer networks. IEEE Potentials Vol. 13, No. 4, 27–31, 1994.
Cham, 649–666, 2016. [56] Ba, J. L.; Kiros, J. R.; Hinton, G. E. Layer
[44] Zhang, R.; Zhu, J.-Y.; Isola, P.; Geng, X. Y.; Lin, normalization. arXiv preprint arXiv:1607.06450, 2016.
A. S.; Yu, T. H.; Efros, A. A. Real-time user-guided [57] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.;
image colorization with learned deep priors. arXiv Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I.
preprint arXiv:1705.02999, 2017. Attention is all you need. In: Proceedings of the
[45] Su, J.-W.; Chu, H.-K.; Huang, J.-B. Instance- 31st International Conference on Neural Information
aware image colorization. In: Proceedings of the Processing Systems, 6000–6010, 2017.
IEEE/CVF Conference on Computer Vision and [58] Hendrycks, D.; Gimpel, K. Gaussian error linear units
Pattern Recognition, 7965–7974, 2020. (GELUs). arXiv preprint arXiv:1606.08415, 2016.
[46] Pang, L.; Lan, Y.; Guo, J.; Xu, J.; Wan, S.; Cheng, X. [59] Kitaev, N.; Kaiser, L.; Levskaya, A. Reformer:
Text matching as image recognition. In: Proceedings The efficient transformer. In: Proceedings of the
of the 30th AAAI Conference on Artificial Intelligence, International Conference on Learning Representations,
2793–2799, 2016. 2020.
Transformers in computational visual media: A survey 57
[60] Choromanski, K. M.; Likhosherstov, V.; Dohan, D.; [71] Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve,
Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, G.; Jegou, H. Going deeper with image transformers.
J. Q.; Mohiuddin, A.; Kaiser, L. et al. Rethinking arXiv preprint arXiv:2103.17239, 2021.
attention with performers. In: Proceedings of the [72] Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.;
International Conference on Learning Representations, Parikh, D.; Batra, D. Grad-CAM: Visual explanations
2021. from deep networks via gradient-based localization.
[61] Wang, S.; Li, B.; Khabsa, M.; Fang, H.; Ma, H. In: Proceedings of the IEEE International Conference
Linformer: Self-attention with linear complexity. on Computer Vision, 618–626, 2017.
arXiv preprint arXiv:2006.04768, 2020. [73] Binder, A.; Montavon, G.; Lapuschkin, S.; Müller,
[62] Abnar, S.; Zuidema, W. Quantifying attention flow in K.-R.; Samek, W. Layer-wise relevance propagation
transformers. arXiv preprint arXiv:2005.00928, 2020. for neural networks with local renormalization layers.
[63] Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; Titov, In: Artificial Neural Networks and Machine Learning–
I. Analyzing multi-head self-attention: Specialized ICANN 2016. Lecture Notes in Computer Science,
heads do the heavy lifting, the rest can be pruned. Vol. 9887. Villa, A.; Masulli, P.; Pons Rivero, A. Eds.
In: Proceedings of the 57th Annual Meeting of the Springer Cham, 63–71, 2016.
Association for Computational Linguistics, 5797–5808, [74] Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.;
2019. Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.
[64] Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; H. et al. Rethinking semantic segmentation from a
Satheesh, S.; Ma, S. A.; Huang, Z.; Karpathy, A.; sequence-tosequence perspective with transformers.
Khosla, A.; Bernstein, M. et al. ImageNet large scale In: Proceedings of the IEEE/CVF Conference on
visual recognition challenge. International Journal of Computer Vision and Pattern Recognition, 6881–6890,
Computer Vision Vol. 115, No. 3, 211–252, 2015. 2021.
[65] Touvron, H.; Cord, M.; Douze, M.; Massa, F.; [75] Duke, B.; Ahmed, A.; Wolf, C.; Aarabi, P.;
Sablayrolles, A.; Jegou, H. Training data-efficient Taylor, G. W. SSTVOS: Sparse spatiotemporal
image transformers & distillation through attention. transformers for video object segmentation. arXiv
In: Proceedings of the 38th International Conference preprint arXiv:2101.08833, 2021.
on Machine Learning, 10347–10357, 2021. [76] Chen, J. N.; Lu, Y. Y.; Yu, Q. H.; Luo, X. D.;
[66] Han, Y. Z.; Huang, G.; Song, S. J.; Yang, L.; Wang, Y. Zhou, Y. Y. TransUNet: Transformers make strong
L. Dynamic neural networks: A survey. arXiv preprint encoders for medical image segmentation. arXiv
arXiv:2102.04906, 2021. preprint arXiv:2102.04306, 2021.
[67] Xu, W.; Xu, Y.; Chang, T.; Tu, Z. Co-scale [77] Ye, L. W.; Rochan, M.; Liu, Z.; Wang, Y. Cross-
conv-attentional image transformers. arXiv preprint modal self-attention network for referring image
arXiv:2104.06399, 2021. segmentation. In: Proceedings of the IEEE/CVF
[68] Dong, X. Y.; Bao, J. M.; Chen, D. D.; Zhang, Conference on Computer Vision and Pattern
W. M.; Yu, N. H.; Yuan, L.; Chen, D.; Guo, B. Recognition, 10494–10503, 2019.
CSWin transformer: A general vision transformer [78] Wang, H.; Zhu, Y.; Adam, H.; Yuille, A.; Chen, L.-C.
backbone with cross-shaped windows. arXiv preprint Max-deeplab: End-to-end panoptic segmentation with
arXiv:2107.00652, 2021. mask transformers. arXiv preprint arXiv:2012.00759,
[69] Huang, Z. L.; Wang, X. G.; Huang, L. C.; Huang, 2020.
C.; Wei, Y. C.; Liu, W. CCNet: Criss-cross attention [79] Durner, M.; Boerdijk, W.; Sundermeyer, M.; Friedl,
for semantic segmentation. In: Proceedings of the W.; Marton, Z.-C.; Triebel, R. Unknown object
IEEE/CVF International Conference on Computer segmentation from stereo images. arXiv preprint
Vision, 603–612, 2019. arXiv:2103.06796, 2021.
[70] Hou, Q. B.; Zhang, L.; Cheng, M. M.; Feng, J. S. Strip [80] Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.-E.; Sheikh, Y.
pooling: Rethinking spatial pooling for scene parsing. OpenPose: Realtime multi-person 2D pose estimation
In: Proceedings of the IEEE/CVF Conference on using part affinity fields. IEEE Transactions on
Computer Vision and Pattern Recognition, 4002–4011, Pattern Analysis and Machine Intelligence Vol. 43,
2020. No. 1, 172–186, 2021.
58 Y. Xu, H. Wei, M. Lin, et al.
[81] Simon, T.; Joo, H.; Matthews, I.; Sheikh, Y. [93] Rezatofighi, S. H.; Kaskman, R.; Motlagh, F. T.;
Hand keypoint detection in single images using Shi, Q. F.; Cremers, D.; Leal-Taixé, L.; Reid, I.
multiview bootstrapping. In: Proceedings of the Deep perm-set net: Learn to predict sets with
IEEE Conference on Computer Vision and Pattern unknown permutation and cardinality using deep
Recognition, 4645–4653, 2017. neural networks. arXiv preprint arXiv:1805.00613,
[82] Cao, Z.; Simon, T.; Wei, S.-E.; Sheikh, Y. Realtime 2018.
multi-person 2D pose estimation using part affinity [94] Pan, X. J.; Tang, F.; Dong, W. M.; Gu, Y.; Song,
fields. In: Proceedings of the IEEE Conference on Z. C.; Meng, Y. P.; Xu, P.; Deussen, O.; Xu,
Computer Vision and Pattern Recognition, 1302–1310, C. Self-supervised feature augmentation for large
2017. image object detection. IEEE Transactions on Image
[83] Fang, H.-S.; Xie, S. Q.; Tai, Y.-W.; Lu, C. W. Processing Vol. 29, 6745–6758, 2020.
RMPE: Regional multi-person pose estimation. In: [95] Pan, X.; Gao, Y.; Lin, Z.; Tang, F.; Dong, W.;
Proceedings of the IEEE International Conference on Yuan, H.; Huang, F.; Xu, C. Unveiling the potential
Computer Vision, 2353–2362, 2017. of structure preserving for weakly supervised object
[84] Zhang, F.; Zhu, X. T.; Dai, H. B.; Ye, M.; Zhu, localization. In: Proceedings of the IEEE/CVF
C. Distribution-aware coordinate representation for
Conference on Computer Vision and Pattern
human pose estimation. In: Proceedings of the
Recognition, 11642–11651, 2021.
IEEE/CVF Conference on Computer Vision and
[96] Pan, X. J.; Ren, Y. Q.; Sheng, K. K.; Dong, W.
Pattern Recognition, 7091–7100, 2020.
M.; Yuan, H. L.; Guo, X. W.; Ma, C.; Xu, C.
[85] Sun, K.; Xiao, B.; Liu, D.; Wang, J. D. Deep
Dynamic refinement network for oriented and densely
high-resolution representation learning for human
packed object detection. In: Proceedings of the
pose estimation. In: Proceedings of the IEEE/CVF
IEEE/CVF Conference on Computer Vision and
Conference on Computer Vision and Pattern
Pattern Recognition, 11204–11213, 2020.
Recognition, 5686–5696, 2019.
[97] Chu, X. X.; Tian, Z.; Wang, Y. Q.; Zhang, B.; Shen,
[86] Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-
C. H. Twins: Revisiting spatial attention design in
CNN: Towards real-time object detection with region
vision transformers. arXiv preprint arXiv:2104.13840,
proposal networks. arXiv preprint arXiv:1506.01497,
2021.
2015.
[98] Beal, J.; Kim, E.; Tzeng, E.; Park, D. H.; Kislyuk,
[87] Cai, Z. W.; Vasconcelos, N. Cascade R-CNN: High
quality object detection and instance segmentation. D. Toward transformer-based object detection. arXiv
IEEE Transactions on Pattern Analysis and Machine preprint arXiv:2012.09958, 2020.
Intelligence Vol. 43, No. 5, 1483–1498, 2021. [99] Dai, J. F.; Qi, H. Z.; Xiong, Y. W.; Li, Y.; Zhang,
[88] Lin, T. Y.; Goyal, P.; Girshick, R.; He, K. M.; G. D.; Hu, H.; Wei, Y. Deformable convolutional
Dollár, P. Focal loss for dense object detection. In: networks. In: Proceedings of the IEEE International
Proceedings of the IEEE International Conference on Conference on Computer Vision, 764–773, 2017.
Computer Vision, 2999–3007, 2017. [100] He, K. M.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask
[89] Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. R-CNN. In: Proceedings of the IEEE International
arXiv preprint arXiv:1904.07850, 2019. Conference on Computer Vision, 2980–2988, 2017.
[90] Tian, Z.; Shen, C. H.; Chen, H.; He, T. FCOS: [101] Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu,
Fully convolutional one-stage object detection. Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-trained
In: Proceedings of the IEEE/CVF International image processing transformer. In: Proceedings of
Conference on Computer Vision, 9626–9635, 2019. the IEEE/CVF Conference on Computer Vision and
[91] Stewart, R.; Andriluka, M.; Ng, A. Y. End-to-end Pattern Recognition, 12299–12310, 2021.
people detection in crowded scenes. In: Proceedings [102] Esser, P.; Rombach, R.; Ommer, B. Taming
of the IEEE Conference on Computer Vision and transformers for high-resolution image synthesis.
Pattern Recognition, 2325–2333, 2016. arXiv preprint arXiv:2012.09841, 2020.
[92] Hosang, J.; Benenson, R.; Schiele, B. Learning [103] Kaiser, L.; Bengio, S. Can active memory replace
non-maximum suppression. In: Proceedings of the attention? In: Proceedings of the 30th International
IEEE Conference on Computer Vision and Pattern Conference on Neural Information Processing Systems,
Recognition, 6469–6477, 2017. 3781–3789, 2016.
Transformers in computational visual media: A survey 59
[104] Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show [115] Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; Dai,
and tell: Lessons learned from the 2015 MSCOCO J. VL-BERT: Pre-training of generic visual-linguistic
image captioning challenge. IEEE Transactions on representations. In: Proceedings of the International
Pattern Analysis and Machine Intelligence Vol. 39, Conference on Learning Representations, 2020.
No. 4, 652–663, 2016. [116] Zhou, L. W.; Palangi, H.; Zhang, L.; Hu, H. D.; Corso,
[105] Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show J.; Gao, J. F. Unified vision-language pre-training for
and tell: A neural image caption generator. In: image captioning and VQA. In: Proceedings of the
Proceedings of the IEEE Conference on Computer AAAI Conference on Artificial Intelligence, Vol. 34,
Vision and Pattern Recognition, 3156–3164, 2015. No. 7, 13041–13049, 2020.
[106] Rolfe, J. T. Discrete variational autoencoders. arXiv [117] Girshick, R. Fast R-CNN. In: Proceedings of the
preprint arXiv:1609.02200, 2016. IEEE International Conference on Computer Vision,
1440–1448, 2015.
[107] Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu,
B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, [118] Li, W.; Gao, C.; Niu, G. C.; Xiao, X. Y.; Wang, H. F.
Y. Generative adversarial networks. arXiv preprint UNIMO: Towards unified-modal understanding and
arXiv:1406.2661, 2014. generation via cross-modal contrastive learning. arXiv
preprint arXiv:2012.15409, 2020.
[108] Ho, J.; Kalchbrenner, N.; Weissenborn, D.; Salimans,
[119] Li, L. H.; Yatskar, M.; Yin, D.; Hsieh, C. J.; Chang, K.
T. Axial attention in multidimensional transformers.
W. VisualBERT: A simple and performant baseline for
arXiv preprint arXiv:1912.12180, 2019.
vision and language. arXiv preprint arXiv:1908.03557,
[109] Antol, S.; Agrawal, A.; Lu, J. S.; Mitchell, M.; Batra,
2019.
D.; Zitnick, C. L.; Parikh, D. VQA: Visual question
[120] Alberti, C.; Ling, J.; Collins, M.; Reitter, D. Fusion of
answering. In: Proceedings of the IEEE International
detected objects in text for visual question answering.
Conference on Computer Vision, 2425–2433, 2015.
In: Proceedings of the Conference on Empirical
[110] Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.;
Methods in Natural Language Processing and the 9th
Parikh, D. Making the V in VQA matter: Elevating
International Joint Conference on Natural Language
the role of image understanding in visual question
Processing, 2131–2140, 2019.
answering. In: Proceedings of the IEEE Conference
[121] Li, X. J.; Yin, X.; Li, C. Y.; Zhang, P. C.; Hu, X. W.;
on Computer Vision and Pattern Recognition, 6325–
Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F. et
6334, 2017.
al. OSCAR: Object-semantics aligned pre-training for
[111] Chen, X. L.; Fang, H.; Lin, T.-Y.; Vedantam, R.; vision-language tasks. In: Computer Vision–ECCV
Gupta, S.; Dollar, P.; Zitnick, C. L. Microsoft COCO 2020. Lecture Notes in Computer Science, Vol. 12375.
captions: Data collection and evaluation server. arXiv Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds.
preprint arXiv:1504.00325, 2015. Springer Cham, 121–137, 2020.
[112] Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. [122] Yu, F.; Tang, J.; Yin, W.; Sun, Y.; Tian, H.; Wu, H.;
From image descriptions to visual denotations: New Wang, H. ERNIE-ViL: Knowledge enhanced vision-
similarity metrics for semantic inference over event language representations through scene graph. In:
descriptions. Transactions of the Association for Proceedings of the AAAI Conference on Artificial
Computational Linguistics Vol. 2, 67–78, 2014. Intelligence, 2021.
[113] Gan, Z.; Chen, Y.-C.; Li, L.; Zhu, C.; Cheng, Y.; [123] Li, Y.; Pan, Y.; Yao, T.; Chen, J.; Mei, T. Scheduled
Liu, J. Large-scale adversarial training for vision- sampling in vision-language pretraining with decou-
and-language representation learning. In: Advances pled encoder–decoder network. In: Proceedings of the
in Neural Information Processing Systems, Vol. 33. AAAI Conference on Artificial Intelligence, 8518–8526,
Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M. 2021.
F.; Lin, H. Eds. Curran Associates, Inc., 6616–6628, [124] Tan, H.; Bansal, M. LXMERT: Learning cross-
2020. modality encoder representations from transformers.
[114] Lin, J. Y.; Yang, A.; Zhang, Y. C.; Liu, J.; Yang, H. X. In: Proceedings of the Conference on Empirical
InterBERT: Vision-and-language interaction for multi- Methods in Natural Language Processing and the 9th
modal pretraining. arXiv preprint arXiv:2003.13198, International Joint Conference on Natural Language
2020. Processing, 5100–5111, 2019.
60 Y. Xu, H. Wei, M. Lin, et al.
[125] Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. [134] Xie, N.; Lai, F.; Doran, D.; Kadav, A. Visual
Conceptual captions: A cleaned, hypernymed, image entailment: A novel task for fine-grained image
alt-text dataset for automatic image captioning. In: understanding. arXiv preprint arXiv:1901.06706,
Proceedings of the 56th Annual Meeting of the 2019.
Association for Computational Linguistics, 2556–2565, [135] Zellers, R.; Bisk, Y.; Farhadi, A.; Choi, Y. From
2018. recognition to cognition: Visual commonsense reason-
[126] Ordonez, V.; Kulkarni, G.; Berg, T. L. Im2Text: De- ing. In: Proceedings of the IEEE/CVF Conference on
scribing images using 1 million captioned photographs. Computer Vision and Pattern Recognition, 6713–6724,
In: Proceedings of the 24th International Conference 2019.
on Neural Information Processing Systems, 1143–1151, [136] Kazemzadeh, S.; Ordonez, V.; Matten, M.; Berg, T.
2011. ReferItGame: Referring to objects in photographs of
[127] Krishna, R.; Zhu, Y. K.; Groth, O.; Johnson, J.; natural scenes. In: Proceedings of the Conference on
Empirical Methods in Natural Language Processing,
Hata, K. J.; Kravitz, J.; Chen, S.; Kalantidis, Y.;
787–798, 2014.
Li, L.-J.; Shamma, D. A. et al. Visual genome:
Connecting language and vision using crowdsourced [137] Sheng, K. K.; Dong, W. M.; Ma, C. Y.; Mei, X.;
dense image annotations. International Journal of Huang, F. Y.; Hu, B.-G. Attention-based multi-
patch aggregation for image aesthetic assessment.
Computer Vision Vol. 123, No. 1, 32–73, 2017.
In: Proceedings of the 26th ACM International
[128] Hudson, D. A.; Manning, C. D. GQA: A new dataset
Conference on Multimedia, 879–886, 2018.
for real-world visual reasoning and compositional
[138] Sheng, K. K.; Dong, W. M.; Chai, M. L.; Wang,
question answering. In: Proceedings of the IEEE/CVF
G. H.; Zhou, P.; Huang, F. Y.; Hu, B.-G.; Ji, R.;
Conference on Computer Vision and Pattern
Ma, C. Revisiting image aesthetic assessment via self-
Recognition, 6693–6702, 2019.
supervised feature learning. In: Proceedings of the
[129] Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; AAAI Conference on Artificial Intelligence, Vol. 34,
Perona, P.; Ramanan, D., Dollar, P.; Zitnick, C. No. 4, 5709–5716, 2020.
L. Microsoft COCO: Common objects in context.
[139] Sheng, K. K.; Dong, W. M.; Huang, H. B.; Chai,
In: Computer Vision–ECCV 2014. Lecture Notes in
M. L.; Zhang, Y.; Ma, C. Y.; Hu, B.-G. Learning to
Computer Science, Vol. 8693. Fleet, D.; Pajdla, T.; assess visual aesthetics of food images. Computational
Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 740– Visual Media Vol. 7, No. 1, 139–152, 2021.
755, 2014.
[140] Zhang, S. F.; Wang, X. B.; Liu, A.; Zhao, C. X.;
[130] Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Wan, J.; Escalera, S.; Shi, H.; Wang, Z.; Li, S.
Krasin, I.; Pont-Tuset, J.; Kamali, S.; Popov, S.; Z. A dataset and benchmark for large-scale multi-
Malloci, M.; Kolesnikov, A. et al. The open images modal face anti-spoofing. In: Proceedings of the
dataset V4. International Journal of Computer Vision IEEE/CVF Conference on Computer Vision and
Vol. 128, No. 7, 1956–1981, 2020. Pattern Recognition, 919–928, 2019.
[131] Zhang, P.; Li, X.; Hu, X.; Yang, J.; Zhang, L.; [141] Chen, Z.; Yao, T.; Sheng, K.; Ding, S.; Tai, Y.; Li,
Wang, L.; Choi, Y.; Gao, J. VinVL: Revisiting J.; Huang, F.; Jin, X. Generalizable representation
visual representations in vision-language models. learning for mixture domain face anti-spoofing. In:
In: Proceedings of the IEEE/CVF Conference on Proceedings of the AAAI Conference on Artificial
Computer Vision and Pattern Recognition, 5579–5588, Intelligence, 1132–1139, 2021.
2021. [142] Zhao, H.; Jiang, L.; Jia, J.; Torr, P.; Koltun, V. Point
[132] Hu, R.; Singh, A. UniT: Multimodal multitask transformer. arXiv preprint arXiv:2012.09164, 2020.
learning with a unified transformer. arXiv preprint [143] Zoph, B.; Le, Q. V. Neural architecture search
arXiv:2102.10772, 2021. with reinforcement learning. In: Proceedings of the
[133] Suhr, A.; Zhou, S.; Zhang, A.; Zhang, I.; Bai, H. International Conference on Learning Representations,
J.; Artzi, Y. A corpus for reasoning about natural 2017.
language grounded in photographs. In: Proceedings [144] Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q. V.
of the 57th Annual Meeting of the Association for Learning transferable architectures for scalable image
Computational Linguistics, 6418–6428, 2019. recognition. In: Proceedings of the IEEE/CVF
Transformers in computational visual media: A survey 61
Conference on Computer Vision and Pattern Minxuan Lin received his B.Sc. degree
Recognition, 8697–8710, 2018. in computer science and technology
from the Ocean University of China in
[145] Real, E.; Aggarwal, A.; Huang, Y. P.; Le, Q. V.
2018. He is currently a postgraduate
Regularized evolution for image classifier architecture
of NLPR. His research interests include
search. In: Proceedings of the AAAI Conference on computational visual media and machine
Artificial Intelligence, Vol. 33, 4780–4789, 2019. learning.
[146] Wang, H. R.; Wu, Z. H.; Liu, Z. J.; Cai, H.; Zhu,
L. G.; Gan, C.; Han, S. HAT: Hardware-aware
transformers for efficient natural language processing.
In: Proceedings of the 58th Annual Meeting of the Yingying Deng received her B.Sc.
Association for Computational Linguistics, 7675–7688, degree in automation from the University
2020. of Science and Technology, Beijing in
2017. She is currently working towards
[147] So, D.; Le, Q.; Liang, C. The evolved transformer. In:
her Ph.D. degree in NLPR. Her research
Proceedings of the 36th International Conference on
interests include computational visual
Machine Learning, 5877–5886, 2019. media and machine learning.
[148] Li, C. L.; Tang, T.; Wang, G. R.; Peng, J. F.; Chang,
X. J. BossNAS: Exploring hybrid CNN-transformers
with Block-wisely Self-supervised neural architecture
search. arXiv preprint arXiv:2103.12424, 2021. Kekai Sheng received his Ph.D. degree
[149] Schulz, K.; Sixt, L.; Tombari, F.; Landgraf, T. from NLPR in 2019. He received
Restricting the flow: Information bottlenecks for his B.Eng. degree in telecommunication
attribution. In: Proceedings of the International engineering from the University of
Conference on Learning Representations, 2019. Science and Technology, Beijing in 2014.
[150] Jiang, Z.; Tang, R.; Xin, J.; Lin, J. Inserting He is currently a research engineer at
Youtu Lab, Tencent Inc. His research
information bottleneck for attribution in transformers.
interests include domain adaptation,
In: Proceedings of the Conference on Empirical neural architecture search, and AutoML.
Methods in Natural Language Processing: Findings,
3850–3857, 2020.
Weiming Dong is a professor in NLPR. special session organizer, session chair and TPC member
He received his B.Eng. and M.S. degrees for over 20 prestigious IEEE and ACM multimedia journals,
in computer science in 2001 and 2004 conferences, and workshops. Currently he is the editor-in-
from Tsinghua University. He received chief of Multimedia Systems. Changsheng Xu is an IEEE
his Ph.D. degree in information tech- Fellow, IAPR Fellow, and ACM Distinguished Scientist.
nology from the University of Lorraine,
France, in 2007. His research interests
include visual media synthesis and
evaluation. Weiming Dong is a member of the ACM and
Open Access This article is licensed under a Creative
IEEE.
Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduc-
tion in any medium or format, as long as you give appropriate
Feiyue Huang is the director of the credit to the original author(s) and the source, provide a link
Youtu Lab, Tencent Inc. He received his to the Creative Commons licence, and indicate if changes
B.Sc. and Ph.D. degrees in computer were made.
science in 2001 and 2008 respectively,
The images or other third party material in this article are
both from Tsinghua University, China.
included in the article’s Creative Commons licence, unless
His research interests include image
indicated otherwise in a credit line to the material. If material
understanding and face recognition.
is not included in the article’s Creative Commons licence and
your intended use is not permitted by statutory regulation or
exceeds the permitted use, you will need to obtain permission
Changsheng Xu is a professor in directly from the copyright holder.
NLPR. His research interests include To view a copy of this licence, visit http://
multimedia content analysis, indexing creativecommons.org/licenses/by/4.0/.
and retrieval, pattern recognition, and Other papers from this open access journal are available
computer vision. Prof. Xu has served free of charge from http://www.springer.com/journal/41095.
as associate editor, guest editor, general To submit a manuscript, please go to https://www.
chair, program chair, area/track chair, editorialmanager.com/cvmj.