0% found this document useful (0 votes)
16 views8 pages

Paper 2

This paper provides a comprehensive overview of vision transformers, detailing their architecture, training techniques, and applications in various computer vision tasks. It discusses the self-attention mechanism, variant architectures, and the challenges faced by vision transformers, including data requirements and locality issues. The authors aim to highlight open research opportunities and future directions in the field of vision transformers.

Uploaded by

aaminasiddiqui82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views8 pages

Paper 2

This paper provides a comprehensive overview of vision transformers, detailing their architecture, training techniques, and applications in various computer vision tasks. It discusses the self-attention mechanism, variant architectures, and the challenges faced by vision transformers, including data requirements and locality issues. The authors aim to highlight open research opportunities and future directions in the field of vision transformers.

Uploaded by

aaminasiddiqui82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Vision Transformers: State of the Art and Research Challenges

Bo-Kai Ruan1 , Hong-Han Shuai1 , Wen-Huang Cheng1


1
National Yang Ming Chiao Tung University
justin.ee08@nycu.edu.tw, hhshuai@nycu.edu.tw, whcheng@nycu.edu.tw

Abstract ganized as follows. We start with the preliminary of trans-


arXiv:2207.03041v1 [cs.CV] 7 Jul 2022

formers and vision transformers, and then introduce variant


Transformers have achieved great success in natu- architectures of vision transformers. Afterward, the training
ral language processing. Due to the powerful ca- tricks used in vision transformers are presented, together with
pability of self-attention mechanism in transform- the self-supervised learning. Finally, we conclude this paper
ers, researchers develop the vision transformers for and discuss the future research directions and challenges.
a variety of computer vision tasks, such as image
recognition, object detection, image segmentation,
pose estimation, and 3D reconstruction. This pa-
2 Preliminary
per presents a comprehensive overview of the lit- 2.1 Self-attention
erature on different architecture designs and train- The attention mechanism is one of the most beneficial break-
ing tricks (including self-supervised learning) for throughs in deep learning research, which measures the im-
vision transformers. Our goal is to provide a sys- portance of a feature that contributes to the final results. Us-
tematic review with the open research opportuni- ing an attention mechanism often teaches a model to concen-
ties. trate on specific features. For self-attention, the input and
the output size are the same, while the self-attention mecha-
nism allows the interaction between the inputs and discovers
1 Introduction which they should pay more attention to. Afterward, each
Transformers [Vaswani et al., 2017] have originally achieved output is enhanced by the weighted inputs according to the
a great success in natural language processing [Devlin et al., attention scores. For instance, given a sentence stating “a dog
2018; Radford et al., 2018] and can be adopted in various ap- is saved from pond after it fell through the ice,” self-attention
plications, including sentiment classification, machine trans- can enhance the embedding of “it” by attending to “dog”. The
lation, word prediction, and summarization. The key feature self-attention is designed to help the model learn the global
of transformers is the self-attention mechanism, which helps contexts under the property of long-range dependencies.
a model learn the global contexts and enables the model to Fig. 1a illustrates the self-attention module. Let X ∈
acquire the long-range dependencies. Motivated by the great RL×d denote a sequence of vectors (x1 , x2 , · · · , xL ), where
success in natural language processing, transformers have d is the embedding dimension of each vector. To help the
been adopted by computer vision tasks, leading to the devel- model learn the relations between each vector, query, key,
opment of vision transformers. Vision transformers have be- and value matrices are projected from X with linear layers
come prevalent in these years and have reached considerable and denoted by Q, K, and V , respectively. For instance, the
success in many fields such as image classification [Doso- query matrix is obtained by projecting X with a linear layer
vitskiy et al., 2021; Liu et al., 2021], video classification Wq , i.e., Q = XWq . Specifically, the attention weights E are
[Arnab et al., 2021], object detection [Carion et al., 2020; calculated by the normalized product of Q and K T with the
Fang et al., 2021], semantic segmentation [Xie et al., 2021a; softmax function, i.e.,
Li et al., 2021b], and pose estimation [Zhu et al., 2021].
Despite the success of the architecture, there are still sev- QK T
E = softmax( √ ). (1)
eral drawbacks which should be addressed, e.g., data-hungry, d
lack of locality and cross-patch information. Therefore, a re- Softmax function normalizes the attention weights of each
cent line of research has been proposed to further enhance query-key pair within (0, 1), where 1 means the most impor-
vision transformers. This survey introduces important ideas tant, and 0 means useless information. Finally, the output
to solve the problems and aims to shed light on these top- features O are enhanced by applying the attention weights E
ics for future research. In addition, as self-supervised learn- to V as follows:
ing methods play an important role in vision transformers, QK T
we also present several self-supervised learning methods em- O = Attention(Q, K, V ) = EV = softmax( √ )V. (2)
ployed on vision transformers. The rest of the paper is or- d
Q Decoder
Matrix
Scale Softmax Linear &
Mul Matrix
x Transform K Softmax
Mul

V Add & Norm

(a) FFN
Encoder
Transformer Encoder
Add & Norm
cls token Add & Norm
Patch Emb. Multi-head

∈ ℝ(H/P×W/P+1)×d FFN Attention


Pos. Emb. Input
V K Q

Flatten + Transpose + Projection Add & Norm Add & Norm

Multi-head Multi-head
Attention Attention
V K Q V K Q
2 ×H/P×W/P
Image ∈ ℝH×W Patches ∈ ℝP
Positional
Encoding
∼ ⨁ ⨁ ∼ Positional
Encoding
(b)
Embedding Embedding

Figure 1: Illustration of (a) self-attention and (b) image to sequence


Inputs Outputs
with P = 3, C = 1 and embedding dimension d.
Figure 2: Architecture of Transformer.
Multi-Head Attention. In order to learn variant repre-
sentations at different positions, the input X is trans- Positional Encoding. Since the self-attention relaxes the
formed into nh different representations (heads), denoted by sequential order, transformers use positional encoding to
h1 , · · · , hi , · · · , hnh . The attention is first computed for each maintain the ordinal information. The author proposes to use
head hi with Qi , K i , and V i with projection matrices Wqi , sine and cosine functions with different frequencies to derive
Wki , and Wvi , respectively. the positional encoding, which allows the model to learn the
relative positions since any positional encoding can be lin-
hi = Attention(Qi , K i , V i ). (3)
early combined by other positional encodings. Specifically,
Afterward, all the heads are concatenated to form the multi- the i-th dimension of the positional encoding at position pos,
head representations H as follows: denoted by PE(pos,i) can be calculated as follows:
H = Concat(h1 , h2 , ..., hnh ). (4) 
sin(pos/10000i/d ) if i is even,
PE(pos,i) = (6)
Finally, H is transformed back to dimension d with projection cos(pos/10000(i−1)/d ) otherwise,
matrix W O , i.e., O = HW O . As the number of relations
between inputs is usually unknown, multi-head attention is where d is the dimension of the input features.
commonly-used to capture different relations between input Encoder. A single encoder block has two sub-layers, i.e.,
elements in a data-driven manner. multi-head attention and position-wise feed-forward net-
works. Each sub-layer is followed by residual connection [He
2.2 Transformer et al., 2016] and layer normalization [Ba et al., 2016]. The
Transformer, proposed by Vaswani et al. [Vaswani et al., outputs of the encoder will be sent into the decoder as K, V
2017], are ubiquitous in NLP tasks [Devlin et al., 2018; in the second multi-head attention layer.
Brown et al., 2020; Peyrard et al., 2021]. Fig. 2 illustrates Decoder. Each decoder block is made by three sub-layers.
the architecture of Transformer. A single transformer block The first two layers are multi-head attention blocks, and the
can be separated into Encoder and Decoder, which both can final one is position-wise feed-forward networks. Similar to
be further decomposed into 1) self-attention, 2) Position-wise encoder block, all three layers are followed by residual con-
Feed-Forward Networks, and 3) positional encoding. nection and layer normalization as well. Since the decoder is
Position-wise Feed-Forward Networks. Feed-Forward auto-regressive, i.e., sequentially predicting a new result only
Networks (FFN) are mainly composed of Fully-Connected based on previous predictions, the multi-head attention layers
layers and can be written in: in the decoder utilize the mask operation to only attend the
predicted results for preventing the violation of causality.
FFN(X) = δ(XW1 + b1 )W2 + b2 , (5)
where W1 ∈ Rd×h and W2 ∈ Rh×d are the weight and b1 2.3 Vision Transformer (ViT)
and b2 are the bias terms. δ is ReLU activation function. h is ViT [Dosovitskiy et al., 2021] replaces all the CNN struc-
the hidden dimension and is usually set to 4d. tures with several transformer layers and reaches state of the
art performance on image recognition and are known as the Taxonomy of Hierarchical-based (Sec 3.3)
pioneer of vision transformers. ViT contains three segments:
1) patch and positional embedding, 2) Transformer encoder, Vision Transformers Local window attention SwinT

Global window attention Twins-SVT


and 3) multi-layer perceptron (MLP) head. Vanilla Version (Sec 2.3)
Patch Merging PVT
Patch and Positional Embedding. In order to transform ViT
Pooling layer in FFN PiT
an image X ∈ RC×H×W into a 1D sequence of vectors, an
2 Locality-based (Sec 3.1) Tree Structure NexT
image will be transformed into Xp ∈ RN ×(P C) as illus-
Adding inductive bias DeiT, ConViT
trated in Fig. 1b, where P × P is the resolution of each patch,
Others (Sec 3.4)
and the number of patches is N = HW/P 2 . The patches CNN in Attn. LeViT, CvT

Layer-Scale Normalization CaiT


are then projected into patch embeddings by a linear layer. CNN as Patch and Pos. Emb. CeiT

Akin to BERT [Devlin et al., 2018], a learnable class token CNN in FFN LocalViT
Dual-branch for mutli-scale CrossViT
2
xcls ∈ R1×(P C) is added to Xp . Afterward, the positional Sequence Pooling CCT
Improving Robustness RVT

embeddings are added to the output [xcls ; Xp ] to obtain the Cross Covariance Attn. XCiT
positional information. Feature-based (Sec 3.2)

Transformer Encoder. Instead of using both encoder and Overlapping patches T2T-ViT Self-supervised (Sec 5)

decoder, ViT only utilizes the Transformer encoder since the Regenerating attn. map DeepViT SiT, MoCoV3, DINO, MoBY, EsViT, BEiT, MAE

goal is to find a better representation rather than autoregres-


sive prediction. The image patches are transformed into se-
Figure 3: Taxonomy of Vision Transformers.
quences and then sent into the encoder. The only difference
is that the Layer Normalization is added before the sub-layers
(pre-norm) [Xiong et al., 2020]. presented in Sec. 3.3, which reduce the feature size layer-by-
Class Tokens. The class token is trained to include the class layer to increase the inference speed. Some architectures that
information and can be used to represent the entire features. are not classified into the above categories are included in
Therefore, it can be used to classify the data into different Sec. 3.4. Please note that these models are put into certain
categories. categories, but these categories are not mutual-exclusive.

MLP Head. MLP head is commonly-used for the down- 3.1 Locality-based Models
stream tasks. Specifically, let Z = [Z0 , Z1 , · · · , ZN ] denotes
ViT, which lacks locality and translation equivalence, usually
the output of the Transformer encoder. As the class token
performs worse than CNN. Therefore, researchers start to in-
prepended to the input sequence can be regarded as the im-
clude the CNN structures into the vision transformer since the
age representation, the final result y is determined based on
convolution kernels help the model capture the local informa-
the first token of the output Z0 from encoders through a single
tion. As such, adding locality from CNN improves the data
MLP layer, i.e.,
efficiency of vision transformers, resulting in a better perfor-
Z = TransEncoder([xcls ; Xp ]), (7) mance on a small dataset. In the following, we introduce sev-
eral approaches considering the locality.
y = MLP(Z0 ). (8)
DeiT [Touvron et al., 2021a] uses a CNN as a teacher
3 Variant Architectures model to train a vision transformer, which utilizes knowledge
distillation [Hinton et al., 2015] to transfer the inductive bias
Despite the promising performance of ViT, it still suffers from to a vision transformer and applies stronger data augmenta-
several issues. For instance, ViT requires to be trained on tion for input data. This approach allows others to train a
a large dataset. As ViT is initially trained on a JFT dataset vision transformer from scratch without the requirement of
[Sun et al., 2017] with ≈ 300M images before being fine- pre-training on a large dataset.
tuned on ImageNet (≈ 1.2M images) [Russakovsky et al.,
2015]. If the dataset is insufficient, the model may perform ConViT [d’Ascoli et al., 2021] is similar to DeiT as Con-
worse than CNN-based approaches. Although the pre-trained ViT also combines the inductive bias of CNN into models. In-
weights can be used on various tasks, many datasets are not stead of using knowledge distillation, ConViT includes Gated
transferable or even with an inferior performance. Moreover, Positional Self-attention (GPSA), which can be initialized as
ViT is not a general-purpose backbone since it is only suitable a convolutional layer [Cordonnier et al., 2020] for capturing
for image classification but not for the dense prediction, such the local information at the beginning of the training stage.
as object detection and image segmentation, due to the patch As such, ConViT can utilize the advantages of soft inductive
partitions. bias of CNN without being limited to CNN. In other words,
GPSA allows vision transformers to be the same as CNN to
Fig. 3 shows the taxonomy of vision transformers with
improve the data efficiency on small datasets but can be better
three mainstream directions. Specifically, Sec. 3.1 first in-
than CNN when there are infinite data used in training.
troduces the locality-based models, which manage to add the
locality into the architectures. Next, feature-based models LeViT [Graham et al., 2021] gets the embeddings from
are introduced in Sec. 3.2, which aim to diversify the fea- four convolutional layers, which the model can extract the
ture representations. Finally, hierarchical-based models are local features at the beginning and also reduce the input size.
On top of that, LeViT further lowers the input size in some 3.3 Hierarchical-based Models
attention blocks, which speeds up the inference time. These
The original version of ViT is notorious for its heavy com-
advantages help strike the balance between accuracy, data ef-
putation. Significantly, the cost raises when the input size
ficiency and the training speed.
increases. Therefore, decreasing the feature size would def-
CeiT [Yuan et al., 2021a] obtains the embedding features initely help reduce the training time. Below are some ap-
directly from convolutional blocks to add the locality in the proaches to solve this issue.
beginning of the model, similar to LeViT. Moreover, the au-
thors include a depth-wise convolutional layer in FFN to en- PVT [Wang et al., 2021] is mainly designed for solving
courage the model to extract the local features. In order dense prediction problems (e.g., object detection and seman-
to exchange the class information in different layers, CeiT tic segmentation). It uses Spatial Reduction Layer to reduce
utilizes Layer-wise Class-token Attention to collect different the computation by decreasing the dimension of K and V .
class representations by computing the self-attention on class The feature size is reduced by a patch embedding layer at
tokens. the beginning of each stage. Due to its pyramid structure,
the model can generate multi-scale feature maps and can be
CvT [Yuan et al., 2021a] also obtains the embeddings with trained faster with its smaller feature size.
convolutional layers. Besides, CvT uses convolutional layers
to create query, key, and value before self-attention operation, PiT [Heo et al., 2021] includes Pooling Layer, which uses
which provides the local spatial information. Furthermore, a depth-wise convolutional layer to achieve the dimension re-
CvT also shrinks the size of the key and the value by using duction. The authors also test PiT on different robustness
stride = 2 to accelerate the computation. benchmarks and get outstanding performance.
LocalViT [Li et al., 2021c] is designed to use convolutional Swin-Transformer [Liu et al., 2021] is proposed to derive
layers in FFN to extract the local features in every transformer a general-purpose backbone based on ViT, which can be used
blocks. The authors also try to apply different activation func- for different applications, e.g., image classification and se-
tions (ReLU6, h-swish) and architectures (SE-Block, ECA mantic segmentation. Since applying self-attention pixel-by-
module) in FFN layers to improve the performance. pixel results in a tremendous computation complexity, Swin-
CCT [Hassani et al., 2021] is proposed to solve the data- Transformer forms a hierarchical structure, which merges
hungry problem to make vision transformers perform well on patches after each Swin-Transformer block and therefore has
small datasets. Specifically, CCT gets the embeddings with approximate linear computation time complexity to input im-
convolutional layers and makes the input shape flexible by age size due to the computation of self-attention only within
removing the positional embeddings. On top of that, CCT in- each local shifted window.
corporates the Sequence Pooling at the end of the transformer Twins-SVT [Chu et al., 2021] also computes the attention
layers to compute the weights of the output sequence. within a shifted window, while the global attention is evalu-
ated after the local window attention. This helps each win-
3.2 Featured-based Models dow retains the outside-window information, similar to the
These models put their efforts on diversifying the features, overlapping strategy. Likewise, the patches are merged at the
e.g., token maps, attention maps in vision transformers. Hav- beginning of the stages to form the hierarchical shape.
ing distinct feature maps indicates that the model can extract
various features, which allows the model to perform well. NexT [Zhang et al., 2021] is created on a different strategy
to perform the hierarchical computation. Specifically, an im-
DeepViT [Zhou et al., 2021] is created after the authors ex- age is first partitioned into n blocks and every four blocks are
amine the attention maps of ViT and find out that the atten- merged after a transformer layer. Moreover, Gradient-based
tion collapsing takes place in the deeper layer. This problem Class-aware Tree-traversal is proposed to visualize the most
hinders the model to be representative and would lower the critical path from child to root, which reveals how the model
performance. The authors alter the self-attention layers and makes the decision given an input image. It is worth noting
provide a learnable transformation matrix after the attention that NexT can be used to generate images by turning the tree
layer to address the issue by stimulating the model to generate upside down due to its tree structure.
a new set of attention maps.
T2T-ViT [Yuan et al., 2021b] is designed after the authors 3.4 Others
observe the feature maps reshaped from the tokens and point In the following, we introduce several promising directions
out that most of the token maps are meaningless in ViT. To for improving ViT which are not classified before.
improve the diversity of the token maps, the authors inserts
a T2T-module in the beginning of the model. T2T-module is CrossViT [Chen et al., 2021a] can extract multi-scale fea-
made up of few T2T-Transformers. These T2T-Transformers tures from two branches. These branches are L-Branch and
can be viewed as the original transformer layers or can be re- S-Branch, where the former uses a larger patch size and the
placed by the Performer used in [Choromanski et al., 2021]. latter uses a smaller patch size. By using the dual-branch
Between the T2T-Transformers is the T2T-process, which structure, the model is able to obtain different scales of spa-
crops the images into several overlapping patches. The over- tial information. To fuse the spatial information between the
lapping strategy enables the model to share the information branches, Cross Attention is applied by inserting the class to-
between the neighbors to improve the features diversity. ken from one to the other.
RVT [Mao et al., 2021] is designed after studying different randomly leaves some samples to be 0 after an attention or
components in ViT with regards to the robustness, e.g., ac- a FFN layer but before the residual connection. This step
curacy under adversarial attacks. Based on the results, RVT can be viewed as randomly replacing the networks by iden-
decides to 1) remove the class token, which is not important tity function for some samples.
to a vision transformer, by averaging the features and 2) add
Fixing Resolution Discrepancy [Touvron et al., 2019] is to
CNN to the embedding and the FFN layers for increasing lo-
relax the discrepancy between training and testing image size
cality and 3) use more attention heads for obtaining different
resulted from random cropping in data augmentation. The
features. Moreover, the original self-attention is replaced by
authors find that upsampling the training data can mitigate
Position-Aware Attention Scaling (PAAS), which adds a learn-
the discrepancy caused by different resolutions. This is why
able matrix to the self-attention for showing the importance
many models are trained with a size of 384 or bigger.
of each Q-K pair. PAAS is proved to suppress the unrelated
signal in the noise input. Moreover, Patch-wise Augmenta-
tion applies different augmentations on different patches to 5 Self-supervised Learning in Vision
diversify the training data. Transformer
CaiT [Touvron et al., 2021b] employs the normalization Self-Supervised Learning (SSL) trains a model as supervised
with LayerScale, which uses learnable factors to adaptively learning but only uses data itself to create labels instead
normalize the features. The normalization speeds up the con- of manual annotation. SSL has a considerable advantage
verge rate and allows the deep model to be trained well. CaiT over exploiting the data, especially useful for those extensive
also includes Class Attention Layer for computing the atten- datasets. It also helps the model learn essential information
tion between the class embedding and the overall features to lies in data and makes the model robust and transferable.
have better knowledge on the inputs. Recently, SSL in computer vision can be mainly catego-
XCiT [Ali et al., 2021] is proposed to reduce the time com- rized into pretext tasks and contrastive learning. The former
plexity of the self-attention. The reduction is done by Cross- is to design a particular job for a model to learn before fine-
Covariance Attention, which uses the transposed version of tuning on downstream tasks, e.g., predicting the rotation de-
self-attention, i.e., the self-attention is computed on the fea- gree, coloring, or solving the jigsaw puzzles. In contrast, the
ture channels but not on the tokens. As such, the time com- latter generates similar features for the same-class data and
plexity is reduced from O(N 2 d) to O(N 2 d/h). Moreover, pushes away other negative samples. In the following, we in-
Local Patch Interaction Block is employed to use depth-wise troduce several SSL methods used in the vision transformer.
convolutional networks to further extract the information be- SiT [Atito et al., 2021] includes two pretext tasks with
tween different patches. the contrastive loss after passing input data through a vi-
sion transformer. These tasks include predicting the rota-
4 Training Tricks for Vision Transformer tion degree of (0, 90, 180, 270) and the image reconstruction.
To better train a vision transformer, several tricks are pro- The image is first augmented and partitioned into different
posed to increase the diversity of the data and to improve the patches. The rotation task is to match the predicting degree
generality of the model. and the actual degree of the input image. The reconstruc-
tion task is to reconstruct the augmented image back to the
Data Augmentation is used to increase the diversity of original image. Finally, the contrastive loss maximizes the
training data, e.g., translation, cropping, which help a model similarity between the same inputs.
learn the main features by altering the input patterns. To find
out the best combination for a variety of datasets, AutoAug- MoCoV3 [Chen et al., 2021b] is designed explicitly for
ment [Cubuk et al., 2019] and RandAugment [Cubuk et al., vision transformer based on V1 and V2 [He et al., 2020;
2020] are designed to search for a better combination. These Chen et al., 2020]. Each image has two different versions
augmentation strategies are proved to be transferable to dif- generated by data augmentation and the images of differ-
ferent datasets. ent versions are fed into two different encoders. The model
then learns to reduce the difference between the two identi-
Exponential moving average (EMA) is often added to sta-
0 cal images and increases the distance of those negative sam-
bilize the training process. Let θl and θl respectively denote ples, where the difference is measured by InfoNCE [Oord et
the model parameters in the l-th iteration and the parameters al., 2018]. Other modifications are 1) removing the memory
updated by an optimizer. The model parameters in the (l+1)- queue, which is able to reduce the requirement of a substan-
th iteration can be calculated by tial batch size, and 2) adopting the symmetrized loss. Sup-
0 pose the first encoder outputs q1 , q2 and the second encoder
θl+1 = λθl + (1 − λ)θl , (9)
outputs k1 , k2 , the symmetrized loss L is computed by:
where λ is a hyperparameter in range [0, 1]. EMA stabilizes
the training process by smoothing old and new parameters. L = infoNCE(q1 , k2 ) + infoNCE(q2 , k1 ). (10)
Stochastic Depth (SD) [Huang et al., 2016] is first pro- DINO [Caron et al., 2021] is also trained with InfoNCE but
posed to train deep networks such as ResNet [He et al., 2016], with different methods. The inputs are cropped into global
which drops the entire block as a regularization method. Sim- and local views, where the global views have higher resolu-
ilarly, when training a vision transformer, stochastic depth tion. The teacher model can only see the global views while
the student model can utilize all the views. During the updat- Fixed input size. Although self-attention accepts varied
ing stage, the student model is updated by an optimizer, while sequence lengths, positional embeddings require the fixed
the teacher model updates the parameters using EMA with length for each input. One approach to deal with this issue
the student model. To avoid collapsing, DINO adds centers, is by interpolation, which help expand or compress the em-
which can be viewed as a bias term, to each teacher output beddings with a given size. Nevertheless, this would cause
for helping the model generate the uniform distribution. The the information loss when the input size is extremely differ-
centers are updated smoothly by the average of the teacher ent from the training size. Currently, the only feasible ap-
outputs. proach is to directly extract the features from convolutional
layers without adding positional embeddings. However, us-
MoBY [Xie et al., 2021b] incorporates the training strate- ing only the convolutional layers lacks the global positional
gies from MoCoV2 [Chen et al., 2020] and BYOL [Grill information.
et al., 2020]. Instead of using vanilla ViT as the backbone,
MoBY replaces it with Swin-Transformer [Liu et al., 2021]. Robustness. It is important to test the robustness of a model
Additionally, the updating strategy is similar to DINO. On top when the input image are altered or corrupted by some uncon-
of that, to avoid a large batch size, MoBY follows MoCoV2 trolled reasons. These alternations include brightness, back-
to create memory queue for reusing the past features. ground, blur, noise, digital artifacts, or even adversarial at-
tack. Despite that RVT examines the robustness of differ-
EsViT [Li et al., 2022] uses a hierarchical transformer ent components, it only focuses on the vision transformer
for reducing the computational cost. Instead of using posi- architecture. Other aspects such as data augmentation or
tional embeddings, EsViT uses the Relative Position Bias to learning objective are not explored yet. The former includes
avoid the positional information being affected by the differ- what kinds of combination of data augmentation can resist
ent cropping resolutions. Furthermore, EsViT uses the same the noise inputs and the latter are associated with the learning
updating method adopted by DINO and MoBY. On top of criterion that enables the model to filter out the noise attack.
that, EsViT adds extra regional-level tasks to attach the inter- Lightweight model for mobile devices. Since deep learn-
region relationships. ing has gradually become popular in these years, more and
BEiT [Bao et al., 2022] is designed based on BERT. The more manufacturers transplant the deep learning models into
images are separated into patches, which are tokenized into mobile devices. However, due to the limitations of the size
different discrete values by training a discrete VAE [Ramesh and the cost, the computing resources within a mobile de-
et al., 2021]. Afterward, the vision transformer are made to vice are not suitable for running a vision transformer. There-
predict the tokens of the masked patches. The ground truth fore, reducing the model size is also an important topic for
of the tokens are generated with the non-mask patches by a extended usage on mobile devices. Currently, there are only
discrete VAE trained in the first stage. few publications [Mehta and Rastegari, 2022] focus on solv-
ing the related issue.
MAE [He et al., 2021] is created based on method adopted Feature Collapsing. The ability to extract the features
by autoencoder to train a vision transformer. Distinctly, the highly influence the model performance. As described in
input image is first masked by up to 75%. These masking Sec. 3.2, the original vision transformer is subjected to fea-
strategies applied on images include random, block-wise, and ture collapsing and cannot generate various representations,
grid. Then, the masked images are encoded into features by which is a serious issue when training a deep vision trans-
a ViT and the decoder decodes the features by another ViT former. Current methods often add additional modules (T2T-
into the original non-mask images. Astonishingly, this strat- ViT) or learnable parameters (DeepViT) to solve the problem,
egy ends up with an unprecedentedly high accuracy and the which can cause overhead in inference time. Investigation on
masked images can be well reconstructed. other methods such as altering the architectures or training
strategies, adding additional criterion and adopting different
6 Challenges and Discussions data augmentations are also promising directions for prevent-
ing the vision transformers from collapsing.
Although existing works have made a considerable success,
there are still lots of challenges left to be resolved. In the 7 Conclusion
following, several open challenges and future directions of
vision transformers are discussed. We present several vision transformer models and highlight
the innovative components. Specifically, variant architec-
Universal pre-trained weights. Vision transformers are in- tures are introduced to deal with weaknesses such as data-
spired by the self-attention in Transformer, which is origi- hungry, low efficiency, and weak robustness. These ideas in-
nally used for solving NLP problems. However, the adaption clude transferring the inductive bias from CNN, adding local-
from text to vision is not completely explored. One promising ity, strong data augmentation, cross-window information ex-
direction is to find out the universal pre-trained weights that change, and reducing the computational cost. We also review
are suitable for different kinds of inputs, e.g., texts, images the training tricks, as well as self-supervised learning, which
or audio. Currently, [Li et al., 2021a] discuss the relation- trains datasets without requiring any labels but can even reach
ship between texts and visions. More explorations on gen- a higher accuracy than that of the supervised methods. At the
eral conditions enable us to unveil the mask of the underlying end of our paper, we discuss some open challenges for the
principles of transformers. future research.
References [Cubuk et al., 2019] Ekin D. Cubuk, Barret Zoph, Dandelion
[Ali et al., 2021] Alaaeldin Ali, Hugo Touvron, Mathilde Mane, Vijay Vasudevan, and Quoc V. Le. Autoaugment:
Learning augmentation strategies from data. In CVPR,
Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin,
2019.
Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob
Verbeek, et al. Xcit: Cross-covariance image transform- [Cubuk et al., 2020] Ekin D Cubuk, Barret Zoph, Jonathon
ers. NIPS, 2021. Shlens, and Quoc V Le. Randaugment: Practical auto-
mated data augmentation with a reduced search space. In
[Arnab et al., 2021] Anurag Arnab, Mostafa Dehghani,
CVPR, 2020.
Georg Heigold, Chen Sun, Mario Lučić, and Cordelia
Schmid. Vivit: A video vision transformer. In ICCV, 2021. [Devlin et al., 2018] Jacob Devlin, Ming-Wei Chang, Ken-
ton Lee, and Kristina Toutanova. Bert: Pre-training of
[Atito et al., 2021] Sara Atito, Muhammad Awais, and Josef deep bidirectional transformers for language understand-
Kittler. Sit: Self-supervised vision transformer. arXiv ing. arXiv preprint arXiv:1810.04805, 2018.
preprint arXiv:2104.03602, 2021.
[Dosovitskiy et al., 2021] Alexey Dosovitskiy, Lucas Beyer,
[Ba et al., 2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Ge- Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
offrey E Hinton. Layer normalization. arXiv preprint et al. An image is worth 16x16 words: Transformers for
arXiv:1607.06450, 2016. image recognition at scale. In ICLR, 2021.
[Bao et al., 2022] Hangbo Bao, Li Dong, Songhao Piao, and [d’Ascoli et al., 2021] Stéphane d’Ascoli, Hugo Touvron,
Furu Wei. BEit: BERT pre-training of image transformers. Matthew L Leavitt, Ari S Morcos, Giulio Biroli, and Lev-
In ICLR, 2022. ent Sagun. Convit: Improving vision transformers with
[Brown et al., 2020] Tom Brown, Benjamin Mann, Nick Ry- soft convolutional inductive biases. In ICML, 2021.
der, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- [Fang et al., 2021] Yuxin Fang, Bencheng Liao, Xinggang
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, and
Amanda Askell, et al. Language models are few-shot Wenyu Liu. You only look at one sequence: Rethinking
learners. NIPS, 33, 2020. transformer in vision through object detection. In NIPS,
[Carion et al., 2020] Nicolas Carion, Francisco Massa, 2021.
Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, [Graham et al., 2021] Benjamin Graham, Alaaeldin El-
and Sergey Zagoruyko. End-to-end object detection with Nouby, Hugo Touvron, Pierre Stock, Armand Joulin,
transformers. In ECCV, 2020. Hervé Jégou, and Matthijs Douze. Levit: a vision trans-
former in convnet’s clothing for faster inference. In ICCV,
[Caron et al., 2021] Mathilde Caron, Hugo Touvron, Ishan
2021.
Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and
Armand Joulin. Emerging properties in self-supervised vi- [Grill et al., 2020] Jean-Bastien Grill, Florian Strub, Florent
sion transformers. In ICCV, 2021. Altché, Corentin Tallec, Pierre Richemond, et al. Boot-
strap your own latent-a new approach to self-supervised
[Chen et al., 2020] Xinlei Chen, Haoqi Fan, Ross Girshick, learning. NIPS, 2020.
and Kaiming He. Improved baselines with momentum
[Hassani et al., 2021] Ali Hassani, Steven Walton, Nikhil
contrastive learning. arXiv preprint arXiv:2003.04297,
2020. Shah, Abulikemu Abuduweili, Jiachen Li, and Humphrey
Shi. Escaping the big data paradigm with compact trans-
[Chen et al., 2021a] Chun-Fu Richard Chen, Quanfu Fan, formers. arXiv preprint arXiv:2104.05704, 2021.
and Rameswar Panda. Crossvit: Cross-attention multi- [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing
scale vision transformer for image classification. In ICCV,
Ren, and Jian Sun. Deep residual learning for image recog-
2021.
nition. In CVPR, 2016.
[Chen et al., 2021b] Xinlei Chen, Saining Xie, and Kaiming [He et al., 2020] Kaiming He, Haoqi Fan, Yuxin Wu, Sain-
He. An empirical study of training self-supervised vision ing Xie, and Ross Girshick. Momentum contrast for unsu-
transformers. In ICCV, 2021. pervised visual representation learning. In CVPR, 2020.
[Choromanski et al., 2021] Krzysztof Marcin Choromanski, [He et al., 2021] Kaiming He, Xinlei Chen, Saining Xie,
Valerii Likhosherstov, David Dohan, Xingyou Song, An- Yanghao Li, Piotr Dollár, and Ross Girshick. Masked
dreea Gane, et al. Rethinking attention with performers. autoencoders are scalable vision learners. arXiv preprint
In ICLR, 2021. arXiv:2111.06377, 2021.
[Chu et al., 2021] Xiangxiang Chu, Zhi Tian, Yuqing Wang, [Heo et al., 2021] Byeongho Heo, Sangdoo Yun, Dongyoon
Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, and Han, Sanghyuk Chun, Junsuk Choe, and Seong Joon Oh.
Chunhua Shen. Twins: Revisiting the design of spatial Rethinking spatial dimensions of vision transformers. In
attention in vision transformers. NIPS, 2021. ICCV, 2021.
[Cordonnier et al., 2020] Jean-Baptiste Cordonnier, Andreas [Hinton et al., 2015] Geoffrey Hinton, Oriol Vinyals, and
Loukas, and Martin Jaggi. On the relationship between Jeff Dean. Distilling the knowledge in a neural network.
self-attention and convolutional layers. In ICLR, 2020. arXiv preprint arXiv:1503.02531, 2015.
[Huang et al., 2016] Gao Huang, Yu Sun, Zhuang Liu, [Touvron et al., 2019] Hugo Touvron, Andrea Vedaldi,
Daniel Sedra, and Kilian Q Weinberger. Deep networks Matthijs Douze, and Hervé Jégou. Fixing the train-test
with stochastic depth. In ECCV, 2016. resolution discrepancy. NIPS, 2019.
[Li et al., 2021a] Qing Li, Boqing Gong, Yin Cui, Dan Kon- [Touvron et al., 2021a] Hugo Touvron, Matthieu Cord,
dratyuk, Xianzhi Du, Ming-Hsuan Yang, and Matthew Matthijs Douze, Francisco Massa, Alexandre Sablay-
Brown. Towards a unified foundation model: Jointly pre- rolles, and Hervé Jégou. Training data-efficient image
training transformers on unpaired images and text. arXiv transformers & distillation through attention. In ICML,
preprint arXiv:2112.07074, 2021. 2021.
[Li et al., 2021b] Shaohua Li, Xiuchao Sui, Xiangde Luo, [Touvron et al., 2021b] Hugo Touvron, Matthieu Cord,
Xinxing Xu, Yong Liu, and Rick Goh. Medical image Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé
segmentation using squeeze-and-expansion transformers. Jégou. Going deeper with image transformers. In ICCV,
In IJCAI, 2021. 2021.
[Li et al., 2021c] Yawei Li, Kai Zhang, Jiezhang Cao, Radu [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki
Timofte, and Luc Van Gool. Localvit: Bringing locality Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
to vision transformers. arXiv preprint arXiv:2104.05707, Łukasz Kaiser, and Illia Polosukhin. Attention is all you
2021. need. In NIPS, 2017.
[Wang et al., 2021] Wenhai Wang, Enze Xie, Xiang Li,
[Li et al., 2022] Chunyuan Li, Jianwei Yang, Pengchuan
Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping
Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, and
Luo, and Ling Shao. Pyramid vision transformer: A ver-
Jianfeng Gao. Efficient self-supervised vision transform-
satile backbone for dense prediction without convolutions.
ers for representation learning. In ICLR, 2022.
In ICCV, 2021.
[Liu et al., 2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, [Xie et al., 2021a] Enze Xie, Wenjia Wang, Wenhai Wang,
Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Peize Sun, Hang Xu, Ding Liang, and Ping Luo. Segment-
Swin transformer: Hierarchical vision transformer using ing transparent objects in the wild with transformer. In
shifted windows. In ICCV, 2021. IJCAI, 2021.
[Mao et al., 2021] Xiaofeng Mao, Gege Qi, Yuefeng Chen, [Xie et al., 2021b] Zhenda Xie, Yutong Lin, Zhuliang Yao,
Xiaodan Li, Ranjie Duan, Shaokai Ye, Yuan He, and Hui Zheng Zhang, Qi Dai, Yue Cao, and Han Hu. Self-
Xue. Towards robust vision transformer. arXiv preprint supervised learning with swin transformers. arXiv preprint
arXiv:2105.07926, 2021. arXiv:2105.04553, 2021.
[Mehta and Rastegari, 2022] Sachin Mehta and Mohammad [Xiong et al., 2020] Ruibin Xiong, Yunchang Yang, Di He,
Rastegari. Mobilevit: Light-weight, general-purpose, and Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang,
mobile-friendly vision transformer. In ICLR, 2022. Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer nor-
[Oord et al., 2018] Aaron van den Oord, Yazhe Li, and Oriol malization in the transformer architecture. In ICML, 2020.
Vinyals. Representation learning with contrastive predic- [Yuan et al., 2021a] Kun Yuan, Shaopeng Guo, Ziwei Liu,
tive coding. arXiv preprint arXiv:1807.03748, 2018. Aojun Zhou, Fengwei Yu, and Wei Wu. Incorporating con-
[Peyrard et al., 2021] Maxime Peyrard, Beatriz Borges, volution designs into visual transformers. In ICCV, 2021.
Kristina Gligorić, and Robert West. Laughing heads: Can [Yuan et al., 2021b] Li Yuan, Yunpeng Chen, Tao Wang,
transformers detect what makes a sentence funny? In IJ- Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis EH Tay,
CAI, 2021. Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit:
[Radford et al., 2018] Alec Radford, Karthik Narasimhan, Training vision transformers from scratch on imagenet. In
ICCV, 2021.
Tim Salimans, and Ilya Sutskever. Improving language
understanding by generative pre-training. 2018. [Zhang et al., 2021] Zizhao Zhang, Han Zhang, Long Zhao,
Ting Chen, and Tomas Pfister. Aggregating nested trans-
[Ramesh et al., 2021] Aditya Ramesh, Mikhail Pavlov, formers. arXiv preprint arXiv:2105.12723, 2021.
Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford,
Mark Chen, and Ilya Sutskever. Zero-shot text-to-image [Zhou et al., 2021] Daquan Zhou, Bingyi Kang, Xiaojie Jin,
generation. In ICML, 2021. Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, and
Jiashi Feng. Deepvit: Towards deeper vision transformer.
[Russakovsky et al., 2015] Olga Russakovsky, Jia Deng, arXiv preprint arXiv:2103.11886, 2021.
Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
[Zhu et al., 2021] Yiran Zhu, Xing Xu, Fumin Shen, Yanli Ji,
Zhiheng Huang, et al. Imagenet large scale visual recogni-
tion challenge. International Journal of Computer Vision Lianli Gao, and Heng Tao Shen. Posegtac: Graph trans-
(IJCV), 2015. former encoder-decoder with atrous convolution for 3d hu-
man pose estimation. In IJCAI, 2021.
[Sun et al., 2017] Chen Sun, Abhinav Shrivastava, Saurabh
Singh, and Abhinav Gupta. Revisiting unreasonable effec-
tiveness of data in deep learning era. In ICCV, 2017.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy