0% found this document useful (0 votes)
14 views10 pages

CVT: Introducing Convolutions To Vision Transformers

Uploaded by

guosibei97
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views10 pages

CVT: Introducing Convolutions To Vision Transformers

Uploaded by

guosibei97
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

CvT: Introducing Convolutions to Vision Transformers

Haiping Wu1,2 * Bin Xiao2† Noel Codella2 Mengchen Liu2 Xiyang Dai2
Lu Yuan2 Lei Zhang2
1
McGill University 2
Microsoft Cloud + AI
haiping.wu2@mail.mcgill.ca, {bixi, ncodella, mengcliu, xidai, luyuan, leizhang}@microsoft.com
arXiv:2103.15808v1 [cs.CV] 29 Mar 2021

Abstract CvT ViT BiT


CvT
DeiT
T2T
PVT
TNT

88 277M
We present in this paper a new architecture, named Con- 82.5

ImageNet top-1 accuracy (%)

ImageNet top-1 accuracy (%)


volutional vision Transformer (CvT), that improves Vision 928M
86 82.0
Transformer (ViT) in performance and efficiency by intro- 32M
307M

ducing convolutions into ViT to yield the best of both de- 84 86M 81.5
20M
signs. This is accomplished through two primary modifica-
81.0
tions: a hierarchy of Transformers containing a new convo- 82

lutional token embedding, and a convolutional Transformer 80.5


80 25M
block leveraging a convolutional projection. These changes
80.0
introduce desirable properties of convolutional neural net-
78
works (CNNs) to the ViT architecture (i.e. shift, scale, CvT ViT BiT 20 40 60 80
Model Paramters (M)
and distortion invariance) while maintaining the merits of (a) (b)
Transformers (i.e. dynamic attention, global context, and
better generalization). We validate CvT by conducting ex- Figure 1: Top-1 Accuracy on ImageNet validation com-
tensive experiments, showing that this approach achieves pared to other methods with respect to model parame-
state-of-the-art performance over other Vision Transform- ters. (a) Comparison to CNN-based model BiT [18] and
ers and ResNets on ImageNet-1k, with fewer parame- Transformer-based model ViT [11], when pretrained on
ters and lower FLOPs. In addition, performance gains ImageNet-22k. Larger marker size indicates larger archi-
are maintained when pretrained on larger datasets (e.g. tectures. (b) Comparison to concurrent works: DeiT [30],
ImageNet-22k) and fine-tuned to downstream tasks. Pre- T2T [41], PVT [34], TNT [14] when pretrained on
trained on ImageNet-22k, our CvT-W24 obtains a top-1 ac- ImageNet-1k.
curacy of 87.7% on the ImageNet-1k val set. Finally, our
results show that the positional encoding, a crucial com-
ponent in existing Vision Transformers, can be safely re- architectures [10] from language understanding with mini-
moved in our model, simplifying the design for higher res- mal modifications. First, images are split into discrete non-
olution vision tasks. Code will be released at https: overlapping patches (e.g. 16 × 16). Then, these patches are
//github.com/leoxiaobin/CvT. treated as tokens (analogous to tokens in NLP), summed
with a special positional encoding to represent coarse spa-
tial information, and input into repeated standard Trans-
1. Introduction former layers to model global relations for classification.

Transformers [31, 10] have recently dominated a wide Despite the success of vision Transformers at large scale,
range of tasks in natural language processing (NLP) [32]. the performance is still below similarly sized convolutional
The Vision Transformer (ViT) [11] is the first computer vi- neural network (CNN) counterparts (e.g., ResNets [15])
sion model to rely exclusively on the Transformer archi- when trained on smaller amounts of data. One possible rea-
tecture to obtain competitive image classification perfor- son may be that ViT lacks certain desirable properties in-
mance at large scale. The ViT design adapts Transformer herently built into the CNN architecture that make CNNs
uniquely suited to solve vision tasks. For example, im-
* This work is done when Haiping Wu was an intern at Microsoft. ages have a strong 2D local structure: spatially neighbor-
† Corresponding author ing pixels are usually highly correlated. The CNN archi-

1
Method Needs Position Encoding (PE) Token Embedding Projection for Attention Hierarchical Transformers
ViT [11], DeiT [30] yes non-overlapping linear no
CPVT [6] no (w/ PE Generator) non-overlapping linear no
TNT [14] yes non-overlapping (patch+pixel) linear no
T2T [41] yes overlapping (concatenate) linear partial (tokenization)
PVT [34] yes non-overlapping spatial reduction yes
CvT (ours) no overlapping (convolution) convolution yes

Table 1: Representative works of vision Transformers.

tecture forces the capture of this local structure by using former (CvT) employs all the benefits of CNNs: local re-
local receptive fields, shared weights, and spatial subsam- ceptive fields, shared weights, and spatial subsampling,
pling [20], and thus also achieves some degree of shift, while keeping all the advantages of Transformers: dynamic
scale, and distortion invariance. In addition, the hierarchi- attention, global context fusion, and better generalization.
cal structure of convolutional kernels learns visual patterns Our results demonstrate that this approach attains state-of-
that take into account local spatial context at varying levels art performance when CvT is pre-trained with ImageNet-
of complexity, from simple low-level edges and textures to 1k, while being lightweight and efficient: CvT improves the
higher order semantic patterns. performance compared to CNN-based models (e.g. ResNet)
In this paper, we hypothesize that convolutions can be and prior Transformer-based models (e.g. ViT, DeiT) while
strategically introduced to the ViT structure to improve utilizing fewer FLOPS and parameters. In addition, CvT
performance and robustness, while concurrently maintain- achieves state-of-the-art performance when evaluated at
ing a high degree of computational and memory efficiency. larger scale pretraining (e.g. on the public ImageNet-22k
To verify our hypothesises, we present a new architecture, dataset). Finally, we demonstrate that in this new design, we
called the Convolutional vision Transformer (CvT), which can drop the positional embedding for tokens without any
incorporates convolutions into the Transformer that is in- degradation to model performance. This not only simplifies
herently efficient, both in terms of floating point operations the architecture design, but also makes it readily capable of
(FLOPs) and parameters. accommodating variable resolutions of input images that is
critical to many vision tasks.
The CvT design introduces convolutions to two core sec-
tions of the ViT architecture. First, we partition the Trans-
formers into multiple stages that form a hierarchical struc- 2. Related Work
ture of Transformers. The beginning of each stage consists Transformers that exclusively rely on the self-attention
of a convolutional token embedding that performs an over- mechanism to capture global dependencies have dominated
lapping convolution operation with stride on a 2D-reshaped in natural language modelling [31, 10, 25]. Recently, the
token map (i.e., reshaping flattened token sequences back Transformer based architecture has been viewed as a viable
to the spatial grid), followed by layer normalization. This alternative to the convolutional neural networks (CNNs) in
allows the model to not only capture local information, but visual recognition tasks, such as classification [11, 30], ob-
also progressively decrease the sequence length while si- ject detection [3, 45, 43, 8, 28], segmentation [33, 36], im-
multaneously increasing the dimension of token features age enhancement [4, 40], image generation [24], video pro-
across stages, achieving spatial downsampling while con- cessing [42, 44] and 3D point cloud processing [12].
currently increasing the number of feature maps, as is per-
formed in CNNs [20]. Second, the linear projection prior Vision Transformers. The Vision Transformer (ViT) is
to every self-attention block in the Transformer module is the first to prove that a pure Transformer architecture can
replaced with our proposed convolutional projection, which attain state-of-the-art performance (e.g. ResNets [15], Ef-
employs a s × s depth-wise separable convolution [5] oper- ficientNet [29]) on image classification when the data is
ation on an 2D-reshaped token map. This allows the model large enough (i.e. on ImageNet-22k, JFT-300M). Specifi-
to further capture local spatial context and reduce seman- cally, ViT decomposes each image into a sequence of tokens
tic ambiguity in the attention mechanism. It also permits (i.e. non-overlapping patches) with fixed length, and then
management of computational complexity, as the stride of applies multiple standard Transformer layers, consisting of
convolution can be used to subsample the key and value ma- Multi-Head Self-Attention module (MHSA) and Position-
trices to improve efficiency by 4× or more, with minimal wise Feed-forward module (FFN), to model these tokens.
degradation of performance. DeiT [30] further explores the data-efficient training and
In summary, our proposed Convolutional vision Trans- distillation for ViT. In this work, we study how to combine

2
Figure 2: The pipeline of the proposed CvT architecture. (a) Overall architecture, showing the hierarchical multi-stage
structure facilitated by the Convolutional Token Embedding layer. (b) Details of the Convolutional Transformer Block,
which contains the convolution projection as the first layer.

CNNs and Transformers to model both local and global de- tasks. Among these works, the non-local networks [35] are
pendencies for image classification in an efficient way. designed for capturing long range dependencies via global
In order to better model local context in vision Trans- attention. The local relation networks [17] adapts its weight
formers, some concurrent works have introduced design aggregation based on the compositional relations (similar-
changes. For example, the Conditional Position encod- ity) between pixels/features within a local window, in con-
ings Visual Transformer (CPVT) [6] replaces the prede- trast to convolution layers which employ fixed aggrega-
fined positional embedding used in ViT with conditional tion weights over spatially neighboring input feature. Such
position encodings (CPE), enabling Transformers to pro- an adaptive weight aggregation introduces geometric pri-
cess input images of arbitrary size without interpolation. ors into the network which are important for the recogni-
Transformer-iN-Transformer (TNT) [14] utilizes both an tion tasks. Recently, BoTNet [27] proposes a simple yet
outer Transformer block that processes the patch embed- powerful backbone architecture that just replaces the spa-
dings, and an inner Transformer block that models the re- tial convolutions with global self-attention in the final three
lation among pixel embeddings, to model both patch-level bottleneck blocks of a ResNet and achieves a strong per-
and pixel-level representation. Tokens-to-Token (T2T) [41] formance in image recognition. Instead, our work performs
mainly improves tokenization in ViT by concatenating mul- an opposite research direction: introducing convolutions to
tiple tokens within a sliding window into one token. How- Transformers.
ever, this operation fundamentally differs from convolutions
especially in normalization details, and the concatenation
Introducing Convolutions to Transformers. In NLP
of multiple tokens greatly increases complexity in compu-
and speech recognition, convolutions have been used to
tation and memory. PVT [34] incorporates a multi-stage
modify the Transformer block, either by replacing multi-
design (without convolutions) for Transformer similar to
head attentions with convolution layers [38], or adding
multi-scales in CNNs, favoring dense prediction tasks.
additional convolution layers in parallel [39] or sequen-
In contrast to these concurrent works, this work aims tially [13], to capture local relationships. Other prior work
to achieve the best of both worlds by introducing convolu- [37] proposes to propagate attention maps to succeeding
tions, with image domain specific inductive biases, into the layers via a residual connection, which is first transformed
Transformer architecture. Table 1 shows the key differences by convolutions. Different from these works, we propose
in terms of necessity of positional encodings, type of token to introduce convolutions to two primary parts of the vi-
embedding, type of projection, and Transformer structure in sion Transformer: first, to replace the existing Position-wise
the backbone, between the above representative concurrent Linear Projection for the attention operation with our Con-
works and ours. volutional Projection, and second, to use our hierarchical
multi-stage structure to enable varied resolution of 2D re-
Introducing Self-attentions to CNNs. Self-attention shaped token maps, similar to CNNs. Our unique design
mechanisms have been widely applied to CNNs in vision affords significant performance and efficiency benefits over

3
prior works. width
   
Hi−1 + 2p − s Wi−1 + 2p − s
Hi = + 1 , Wi = +1 .
3. Convolutional vision Transformer s−o s−o
(1)
The overall pipeline of the Convolutional vision Trans- f (xi−1 ) is then flattened into size Hi Wi × Ci and normal-
former (CvT) is shown in Figure 2. We introduce two ized by layer normalization [1] for input into the subsequent
convolution-based operations into the Vision Transformer Transformer blocks of stage i.
architecture, namely the Convolutional Token Embedding The Convolutional Token Embedding layer allows us to
and Convolutional Projection. As shown in Figure 2 (a), a adjust the token feature dimension and the number of to-
multi-stage hierarchy design borrowed from CNNs [20, 15] kens at each stage by varying parameters of the convolution
is employed, where three stages in total are used in this operation. In this manner, in each stage we progressively
work. Each stage has two parts. First, the input image decrease the token sequence length, while increasing the
(or 2D reshaped token maps) are subjected to the Convo- token feature dimension. This gives the tokens the ability
lutional Token Embedding layer, which is implemented as a to represent increasingly complex visual patterns over in-
convolution with overlapping patches with tokens reshaped creasingly larger spatial footprints, similar to feature layers
to the 2D spatial grid as the input (the degree of overlap of CNNs.
can be controlled via the stride length). An additional layer
normalization is applied to the tokens. This allows each 3.2. Convolutional Projection for Attention
stage to progressively reduce the number of tokens (i.e. fea- The goal of the proposed Convolutional Projection layer
ture resolution) while simultaneously increasing the width is to achieve additional modeling of local spatial context,
of the tokens (i.e. feature dimension), thus achieving spa- and to provide efficiency benefits by permitting the under-
tial downsampling and increased richness of representation, sampling of K and V matrices.
similar to the design of CNNs. Different from other prior Fundamentally, the proposed Transformer block with
Transformer-based architectures [11, 30, 41, 34], we do not Convolutional Projection is a generalization of the origi-
sum the ad-hod position embedding to the tokens. Next, nal Transformer block. While previous works [13, 39] try
a stack of the proposed Convolutional Transformer Blocks to add additional convolution modules to the Transformer
comprise the remainder of each stage. Figure 2 (b) shows Block for speech recognition and natural language process-
the architecture of the Convolutional Transformer Block, ing, they result in a more complicated design and addi-
where a depth-wise separable convolution operation [5], tional computational cost. Instead, we propose to replace
referred as Convolutional Projection, is applied for query, the original position-wise linear projection for Multi-Head
key, and value embeddings respectively, instead of the stan- Self-Attention (MHSA) with depth-wise separable convo-
dard position-wise linear projection in ViT [11]. Addition- lutions, forming the Convolutional Projection layer.
ally, the classification token is added only in the last stage.
Finally, an MLP (i.e. fully connected) Head is utilized upon
3.2.1 Implementation Details
the classification token of the final stage output to predict
the class. Figure 3 (a) shows the original position-wise linear projec-
We first elaborate on the proposed Convolutional Token tion used in ViT [11] and Figure 3 (b) shows our proposed
Embedding layer. Next we show how to perform Convolu- s × s Convolutional Projection. As shown in Figure 3 (b),
tional Projection for the Multi-Head Self-Attention module, tokens are first reshaped into a 2D token map. Next, a Con-
and its efficient design for managing computational cost. volutional Projection is implemented using a depth-wise
separable convolution layer with kernel size s. Finally, the
3.1. Convolutional Token Embedding projected tokens are flattened into 1D for subsequent pro-
cess. This can be formulated as:
This convolution operation in CvT aims to model local
q/k/v
spatial contexts, from low-level edges to higher order se- xi = Flatten (Conv2d (Reshape2D(xi ), s)) ,
mantic primitives, over a multi-stage hierarchy approach, (2)
similar to CNNs. where xi
q/k/v
is the token input for Q/K/V matrices at
Formally, given a 2D image or a 2D-reshaped output to- layer i, xi is the unperturbed token prior to the Convolu-
ken map from a previous stage xi−1 ∈ RHi−1 ×Wi−1 ×Ci−1 tional Projection, Conv2d is a depth-wise separable con-
as the input to stage i, we learn a function f (·) that maps volution [5] implemented by: Depth-wise Conv2d →
xi−1 into new tokens f (xi−1 ) with a channel size Ci , where BatchNorm2d → Point-wise Conv2d, and s refers
f (·) is 2D convolution operation of kernel size s × s, stride to the convolution kernel size.
s − o and p padding (to deal with boundary conditions). The resulting new Transformer Block with the Convo-
The new token map f (xi−1 ) ∈ RHi ×Wi ×Ci has height and lutional Projection layer is a generalization of the original

4
Figure 3: (a) Linear projection in ViT [11]. (b) Convolutional projection. (c) Squeezed convolutional projection. Unless
otherwise stated, we use (c) Squeezed convolutional projection by default.

Transformer Block design. The original position-wise lin- us the ability to model local spatial relationships through the
ear projection layer could be trivially implemented using a network. This built-in property allows dropping the position
convolution layer with kernel size of 1 × 1. embedding from the network without hurting performance,
as evidenced by our experiments (Section 4.4), simplifying
3.2.2 Efficiency Considerations design for vision tasks with variable input resolution.

There are two primary efficiency benefits from the design Relations to Concurrent Work: Recently, two more re-
of our Convolutional Projection layer. lated concurrent works also propose to improve ViT by in-
First, we utilize efficient convolutions. Directly using corporating elements of CNNs to Transformers. Tokens-
standard s×s convolutions for the Convolutional Projection to-Token ViT [41] implements a progressive tokenization,
would require s2 C 2 parameters and O(s2 C 2 T ) FLOPs, and then uses a Transformer-based backbone in which the
where C is the token channel dimension, and T is the num- length of tokens is fixed. By contrast, our CvT implements
ber of tokens for processing. Instead, we split the standard a progressive tokenization by a multi-stage process – con-
s × s convolution into a depth-wise separable convolution taining both convolutional token embeddings and convolu-
[16]. In this way, each of the proposed Convolutional Pro- tional Transformer blocks in each stage. As the length of
jection would only introduce an extra of s2 C parameters tokens are decreased in each stage, the width of the tokens
and O(s2 CT ) FLOPs compared to the original position- (dimension of feature) can be increased, allowing increased
wise linear projection, which are negligible with respect to richness of representations at each feature spatial resolu-
the total parameters and FLOPs of the models. tion. Additionally, whereas T2T concatenates neighboring
Second, we leverage the proposed Convolutional Projec- tokens into one new token, leading to increasing the com-
tion to reduce the computation cost for the MHSA opera- plexity of memory and computation, our usage of convolu-
tion. The s × s Convolutional Projection permits reducing tional token embedding directly performs contextual learn-
the number of tokens by using a stride larger than 1. Fig- ing without concatenation, while providing the flexibility
ure 3 (c) shows the Convolutional Projection, where the key of controlling stride and feature dimension. To manage the
and value projection are subsampled by using a convolu- complexity, T2T has to consider a deep-narrow architecture
tion with stride larger than 1. We use a stride of 2 for key design with smaller hidden dimensions and MLP size than
and value projection, leaving the stride of 1 for query un- ViT in the subsequent backbone. Instead, we changed pre-
changed. In this way, the number of tokens for key and vious Transformer modules by replacing the position-wise
value is reduced 4 times, and the computational cost is re- linear projection with our convolutional projection
duced by 4 times for the later MHSA operation. This comes Pyramid Vision Transformer (PVT) [34] overcomes the
with a minimal performance penalty, as neighboring pix- difficulties of porting ViT to various dense prediction tasks.
els/patches in images tend to have redundancy in appear- In ViT, the output feature map has only a single scale with
ance/semantics. In addition, the local context modeling of low resolution. In addition, computations and memory cost
the proposed Convolutional Projection compensates for the are relatively high, even for common input image sizes. To
loss of information incurred by resolution reduction. address this problem, both PVT and our CvT incorporate
pyramid structures from CNNs to the Transformers struc-
3.3. Methodological Discussions
ture. Compared with PVT, which only spatially subsam-
Removing Positional Embeddings: The introduction of ples the feature map or key/value matrices in projection, our
Convolutional Projections for every Transformer block, CvT instead employs convolutions with stride to achieve
combined with the Convolutional Token Embedding, gives this goal. Our experiments (shown in Section 4.4) demon-

5
Output Size Layer Name CvT-13 CvT-21 CvT-W24
56 × 56 Conv. Embed.  × 7, 64, stride
7  4 7 × 7, 192, stride 4
Conv. Proj.
  
3 × 3, 64 3 × 3, 64 3 × 3, 192
Stage1
56 × 56 MHSA  H1 = 1, D1 = 64  × 1  H1 = 1, D1 = 64  × 1  H1 = 3, D1 = 192  × 2
MLP R1 = 4 R1 = 4 R1 = 4
28 × 28 Conv. Embed. 3× 3, 192, stride 2 3 × 3, 768, stride 2
Conv. Proj.
   
3 × 3, 192 3 × 3, 192 3 × 3, 768
Stage2
28 × 28 MHSA  H2 = 3, D2 = 192  × 2  H2 = 3, D2 = 192  × 4  H2 = 12, D2 = 768  × 2
MLP R2 = 4 R2 = 4 R2 = 4
14 × 14 Conv. Embed. 3× 3, 384,stride 2 3 × 3, 1024, stride 2
Conv. Proj.
  
3 × 3, 384 3 × 3, 384 3 × 3, 1024
Stage3
14 × 14 MHSA  H3 = 6, D3 = 384  × 10  H3 = 6, D3 = 384  × 16  H3 = 16, D3 = 1024  × 20
MLP R3 = 4 R3 = 4 R3 = 4
Head 1×1 Linear 1000
Params 19.98 M 31.54 M 276.7 M
FLOPs 4.53 G 7.13 G 60.86 G

Table 2: Architectures for ImageNet classification. Input image size is 224 × 224 by default. Conv. Embed.: Convolutional
Token Embedding. Conv. Proj.: Convolutional Projection. Hi and Di is the number of heads and embedding feature
dimension in the ith MHSA module. Ri is the feature dimension expansion ratio in the ith MLP layer.

strate that the fusion of local neighboring information plays same data augmentation and regularization methods as in
an important role on the performance. ViT [30]. Unless otherwise stated, all ImageNet models are
trained with an 224 × 224 input size.
4. Experiments
Fine-tuning We adopt fine-tuning strategy from ViT [30].
In this section, we evaluate the CvT model on large-scale
SGD optimizor with a momentum of 0.9 is used for fine-
image classification datasets and transfer to various down-
tuning. As in ViT [30], we pre-train our models at resolu-
stream datasets. In addition, we perform through ablation
tion 224 × 224, and fine-tune at resolution of 384 × 384.
studies to validate the design of the proposed architecture.
We fine-tune each model with a total batch size of 512,
4.1. Setup for 20,000 steps on ImageNet-1k, 10,000 steps on CIFAR-
10 and CIFAR-100, and 500 steps on Oxford-IIIT Pets and
For evaluation, we use the ImageNet dataset, with 1.3M Oxford-IIIT Flowers-102.
images and 1k classes, as well as its superset ImageNet-22k
with 22k classes and 14M images [9]. We further trans- 4.2. Comparison to state of the art
fer the models pretrained on ImageNet-22k to downstream
tasks, including CIFAR-10/100 [19], Oxford-IIIT-Pet [23], We compare our method with state-of-the-art classifica-
Oxford-IIIT-Flower [22], following [18, 11]. tion methods including Transformer-based models and rep-
resentative CNN-based models on ImageNet [9], ImageNet
Model Variants We instantiate models with different pa- Real [2] and ImageNet V2 [26] datasets in Table 3.
rameters and FLOPs by varying the number of Transformer Compared to Transformer based models, CvT achieves
blocks of each stage and the hidden feature dimension used, a much higher accuracy with fewer parameters and FLOPs.
as shown in Table 2. Three stages are adapted. We de- CvT-21 obtains a 82.5% ImageNet Top-1 accuracy, which
fine CvT-13 and CvT-21 as basic models, with 19.98M and is 0.5% higher than DeiT-B with the reduction of 63% pa-
31.54M paramters. CvT-X stands for Convolutional vision rameters and 60% FLOPs. When comparing to concurrent
Transformer with X Transformer Blocks in total. Addition- works, CvT still shows superior advantages. With fewer
ally, we experiment with a wider model with a larger token paramerters, CvT-13 achieves a 81.6% ImageNet Top-1 ac-
dimension for each stage, namely CvT-W24 (W stands for curacy, outperforming PVT-Small [34], T2T-ViTt -14 [41],
Wide), resulting 298.3M parameters, to validate the scaling TNT-S [14] by 1.7%, 0.8%, 0.2% respectively.
ability of the proposed architecture. Our architecture designing can be further improved in
terms of model parameters and FLOPs by neural architec-
Training AdamW [21] optimizer is used with the weight ture search (NAS) [7]. In particular, we search the proper
decay of 0.05 for our CvT-13, and 0.1 for our CvT-21 and stride for each convolution projection of key and value
CvT-W24. We train our models with an initial learning (stride = 1, 2) and the expansion ratio for each MLP
rate of 0.02 and a total batch size of 2048 for 300 epochs, layer (ratioM LP = 2, 4). Such architecture candidates
with a cosine learning rate decay scheduler. We adopt the with FLOPs ranging from 2.59G to 4.03G and the num-

6
#Param. image FLOPs ImageNet Real V2
Method Type Network (M) size (G) top-1 (%) top-1 (%) top-1 (%)
ResNet-50 [15] 25 2242 4.1 76.2 82.5 63.3
Convolutional Networks ResNet-101 [15] 45 2242 7.9 77.4 83.7 65.7
ResNet-152 [15] 60 2242 11 78.3 84.1 67.0
ViT-B/16 [11] 86 3842 55.5 77.9 83.6 –
ViT-L/16 [11] 307 3842 191.1 76.5 82.2 –
DeiT-S [30][arxiv 2020] 22 2242 4.6 79.8 85.7 68.5
DeiT-B [30][arxiv 2020] 86 2242 17.6 81.8 86.7 71.5
PVT-Small [34][arxiv 2021] 25 2242 3.8 79.8 – –
Transformers PVT-Medium [34][arxiv 2021] 44 2242 6.7 81.2 – –
PVT-Large [34][arxiv 2021] 61 2242 9.8 81.7 – –
T2T-ViTt -14 [41][arxiv 2021] 22 2242 6.1 80.7 – –
T2T-ViTt -19 [41][arxiv 2021] 39 2242 9.8 81.4 – –
T2T-ViTt -24 [41][arxiv 2021] 64 2242 15.0 82.2 – –
TNT-S [14][arxiv 2021] 24 2242 5.2 81.3 – –
TNT-B [14][arxiv 2021] 66 2242 14.1 82.8 – –
Ours: CvT-13 20 2242 4.5 81.6 86.7 70.4
Ours: CvT-21 32 2242 7.1 82.5 87.2 71.3
Convolutional Transformers
Ours: CvT-13↑384 20 3842 16.3 83.0 87.9 71.9
Ours: CvT-21↑384 32 3842 24.9 83.3 87.7 71.9
Ours: CvT-13-NAS 18 2242 4.1 82.2 87.5 71.3
Convolution Networks22k BiT-M↑480 [18] 928 4802
837 85.4 – –
ViT-B/16↑384 [11] 86 3842
55.5 84.0 88.4 –
Transformers22k ViT-L/16↑384 [11] 307 3842 191.1 85.2 88.4 –
ViT-H/16↑384 [11] 632 3842 – 85.1 88.7 –
Ours: CvT-13↑384 20 3842 16 83.3 88.7 72.9
Convolutional Transformers22k Ours: CvT-21↑384 32 3842 25 84.9 89.8 75.6
Ours: CvT-W24↑384 277 3842 193.2 87.7 90.6 78.8

Table 3: Accuracy of manual designed architecture on ImageNet [9], ImageNet Real [2] and ImageNet V2 matched fre-
quency [26]. Subscript 22k indicates the model pre-trained on ImageNet22k [9], and finetuned on ImageNet1k with the input
size of 384 × 384, except BiT-M [18] finetuned with input size of 480 × 480.

ber of model parameters ranging from 13.66M to 19.88M based models ViT-L/16 by 2.5% with similar number of
construct the search space. The NAS is evaluated directly model parameters and FLOPs.
on ImageNet-1k. The searched CvT-13-NAS, a bottleneck-
like architecture with stride = 2, ratioM LP = 2 at the first 4.3. Downstream task transfer
and last stages, and stride = 1, ratioM LP = 4 at most lay- We further investigate the ability of our models to trans-
ers of the middle stage, reaches to a 82.2% ImageNet Top-1 fer by fine-tuning models on various tasks, with all models
accuracy with fewer model parameters than CvT-13. being pre-trained on ImageNet-22k. Table 4 shows the re-
Compared to CNN-based models, CvT further closes the sults. Our CvT-W24 model is able to obtain the best per-
performance gap of Transformer-based models. Our small- formance across all the downstream tasks considered, even
est model CvT-13 with 20M parameters and 4.5G FLOPs when compared to the large BiT-R152x4 [18] model, which
surpasses the large ResNet-152 model by 3.2% on Ima- has more than 3× the number of parameters as CvT-W24.
geNet Top-1 accuracy, while ResNet-151 has 3 times the
parameters of CvT-13. 4.4. Ablation Study
Furthermore, when more data are involved, our wide We design various ablation experiments to investigate
model CvT-W24* pretrained on ImageNet-22k reaches to the effectiveness of the proposed components of our archi-
87.7% Top-1 Accuracy on ImageNet without extra data tecture. First, we show that with our introduction of con-
(e.g. JFT-300M), surpassing the previous best Transformer volutions, position embeddings can be removed from the

7
Param CIFAR CIFAR Flowers Conv. Pos. #Param ImageNet
Model Pets Method
(M) 10 100 102 Embed. Embed. (M) top-1 (%)
BiT-M [18] 928 98.91 92.17 94.46 99.30
a 19.5 80.7
ViT-B/16 [11] 86 98.95 91.67 94.43 99.38 b 3 19.9 81.1
ViT-L/16 [11] 307 99.16 93.44 94.73 99.61 c 3 3 20.3 81.4
ViT-H/16 [11] 632 99.27 93.82 94.82 99.51
d 3 20.0 81.6
Ours: CvT-13 20 98.83 91.11 93.25 99.50
Ours: CvT-21 32 99.16 92.88 94.03 99.62 Table 6: Ablations on Convolutional Token Embedding.
Ours: CvT-W24 277 99.39 94.09 94.73 99.72

Table 4: Top-1 accuracy on downstream tasks. All the mod- Conv. Proj. KV. Params FLOPs ImageNet
Method
els are pre-trained on ImageNet-22k data stride (M) (G) top-1 (%)
a 1 20 6.55 82.3
b 2 20 4.53 81.6
Param ImageNet
Method Model Pos. Emb.
(M) Top-1 (%) Table 7: Ablations on Convolutional Projection with differ-
a DeiT-S 22 Default 79.8 ent strides for key and value projection. Conv. Proj. KV.:
b DeiT-S 22 N/A 78.0 Convolutional Projection for key and value. We apply Con-
volutional Projection in all Transformer blocks.
c CvT-13 20 Every stage 81.5
d CvT-13 20 First stage 81.4
e CvT-13 20 Last stage 81.4 Conv. Projection Imagenet
Method
f CvT-13 20 N/A 81.6 Stage 1 Stage 2 Stage 3 top-1 (%)
a 80.6
Table 5: Ablations on position embedding.
b 3 80.8
c 3 3 81.0
d 3 3 3 81.6
model. Then, we study the impact of each of the proposed
Convolutional Token Embedding and Convolutional Projec- #Blocks 1 2 10
tion components.
Table 8: Ablations on Convolutional Projection v.s.
Position-wise Linear Projection. 3 indicates the use of
Convolutional Projection, otherwise use Position-wise Lin-
Removing Position Embedding Given that we have in-
ear Projection.
troduced convolutions into the model, allowing local con-
text to be captured, we study whether position embed-
ding is still needed for CvT. The results are shown in Ta- Convolutional Token Embedding We study the effec-
ble 5, and demonstrate that removing position embedding tiveness of the proposed Convolutional Token Embedding,
of our model does not degrade the performance. There- and Table 6 shows the results. Table 6d is the CvT-13
fore, position embeddings have been removed from CvT model. When we replace the Convolutional Token Embed-
by default. As a comparison, removing the position em- ding with non-overlapping Patch Embedding [11], the per-
bedding of DeiT-S would lead to 1.8% drop of ImageNet formance drops 0.8% (Table 6a v.s. Table 6d). When po-
Top-1 accuracy, as it does not model image spatial relation- sition embedding is used, the introduction of Convolutional
ships other than by adding the position embedding. This Token Embedding still obtains 0.3% improvement (Table 6b
further shows the effectiveness of our introduced convolu- v.s. Table 6c). Further, when using both Convolutional
tions. Position Embedding is often realized by fixed-length Token Embedding and position embedding as Table 6d, it
learn-able vectors, limiting the trained model adaptation of slightly drops 0.1% accuracy. These results validate the in-
variable-length input. However, a wide range of vision ap- troduction of Convolutional Token Embedding not only im-
plications take variable image resolutions. Recent work proves the performance, but also helps CvT model spatial
CPVT [6] tries to replace explicit position embedding of relationships without position embedding.
Vision Transformers with a conditional position encodings
module to model position information on-the-fly. CvT is Convolutional Projection First, we compare the pro-
able to completely remove the positional embedding, pro- posed Convolutional Projection with different strides in Ta-
viding the possibility of simplifying adaption to more vision ble 7. By using a stride of 2 for key and value projection,
tasks without requiring a re-designing of the embedding. we observe a 0.3% drop in ImageNet Top-1 accuracy, but

8
with 30% fewer FLOPs. We choose to use Convolutional cient neural architecture search. In European Conference on
Projection with stride 2 for key and value as default for less Computer Vision, 2020. 6
computational cost and memory usage. [8] Zhigang Dai, Bolun Cai, Yugeng Lin, and Junying Chen.
Then, we study how the proposed Convolutional Pro- Up-detr: Unsupervised pre-training for object detection with
jection affects the performance by choosing whether to use transformers. arXiv preprint arXiv:2011.09094, 2020. 2
Convolutional Projection or the regular Position-wise Lin- [9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
and Li Fei-Fei. Imagenet: A large-scale hierarchical image
ear Projection for each stage. The results are shown in Ta-
database. In 2009 IEEE conference on computer vision and
ble 8. We observe that replacing the original Position-wise pattern recognition, pages 248–255. Ieee, 2009. 6, 7
Linear Projection with the proposed Convolutional Projec- [10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
tion improves the Top-1 Accuracy on ImageNet from 80.6% Toutanova. BERT: Pre-training of deep bidirectional trans-
to 81.5%. In addition, performance continually improves as formers for language understanding. In Proceedings of the
more stages use the design, validating this approach as an 2019 Conference of the North American Chapter of the As-
effective modeling strategy. sociation for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pages
5. Conclusion 4171–4186, Minneapolis, Minnesota, 2019. Association for
Computational Linguistics. 1, 2
In this work, we have presented a detailed study of in- [11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
troducing convolutions into the Vision Transformer archi- Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
tecture to merge the benefits of Transformers with the ben- Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
efits of CNNs for image recognition tasks. Extensive ex- vain Gelly, et al. An image is worth 16x16 words: Trans-
periments demonstrate that the introduced convolutional to- formers for image recognition at scale. arXiv preprint
ken embedding and convolutional projection, along with the arXiv:2010.11929, 2020. 1, 2, 4, 5, 6, 7, 8
multi-stage design of the network enabled by convolutions, [12] Nico Engel, Vasileios Belagiannis, and Klaus Dietmayer.
Point transformer. arXiv preprint arXiv:011.00931, 2020.
make our CvT architecture achieve superior performance
2
while maintaining computational efficiency. Furthermore,
[13] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Par-
due to the built-in local context structure introduced by con- mar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zheng-
volutions, CvT no longer requires a position embedding, dong Zhang, Yonghui Wu, et al. Conformer: Convolution-
giving it a potential advantage for adaption to a wide range augmented transformer for speech recognition. arXiv
of vision tasks requiring variable input resolution. preprint arXiv:2005.08100, 2020. 3, 4
[14] Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu,
References and Yunhe Wang. Transformer in transformer. arXiv preprint
arXiv:2103.00112, 2021. 1, 3, 6, 7
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Layer normalization, 2016. 4 Deep residual learning for image recognition. In Proceed-
[2] Lucas Beyer, Olivier J Hénaff, Alexander Kolesnikov, Xi- ings of the IEEE conference on computer vision and pattern
aohua Zhai, and Aäron van den Oord. Are we done with recognition, pages 770–778, 2016. 1, 2, 4, 7
imagenet? arXiv preprint arXiv:2006.07159, 2020. 6, 7 [16] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry
[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- dreetto, and Hartwig Adam. Mobilenets: Efficient convolu-
end object detection with transformers. In European Confer- tional neural networks for mobile vision applications. arXiv
ence on Computer Vision, pages 213–229. Springer, 2020. preprint arXiv:1704.04861, 2017. 5
2 [17] Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Lo-
[4] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping cal relation networks for image recognition. arXiv preprint
Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and arXiv:1904.11491, 2019. 3
Wen Gao. Pre-trained image processing transformer. arXiv [18] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan
preprint arXiv:2012.00364, 2020. 2 Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby.
[5] François Chollet. Xception: Deep learning with depthwise Big transfer (bit): General visual representation learning.
separable convolutions. In Proceedings of the IEEE con- arXiv preprint arXiv:1912.11370, 6(2):8, 2019. 1, 6, 7
ference on computer vision and pattern recognition, pages [19] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple
1251–1258, 2017. 2, 4 layers of features from tiny images. 2009. 6
[6] Xiangxiang Chu, Bo Zhang, Zhi Tian, Xiaolin Wei, and [20] Yann Lecun, Patrick Haffner, Léon Bottou, and Yoshua Ben-
Huaxia Xia. Do we really need explicit position encodings gio. Object recognition with gradient-based learning. In
for vision transformers? arXiv preprint arXiv:2102.10882, Contour and Grouping in Computer Vision. Springer, 1999.
2021. 3, 8 2, 4
[7] Xiyang Dai, Dongdong Chen, Mengchen Liu, Yinpeng [21] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
Chen, and Lu YUan. Da-nas: Data adapted pruning for effi- regularization. arXiv preprint arXiv:1711.05101, 2017. 6

9
[22] Maria-Elena Nilsback and Andrew Zisserman. Automated [35] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-
flower classification over a large number of classes. In In- ing He. Non-local neural networks. In Proceedings of the
dian Conference on Computer Vision, Graphics and Image IEEE conference on computer vision and pattern recogni-
Processing, Dec 2008. 6 tion, pages 7794–7803, 2018. 3
[23] Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and [36] Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen,
C. V. Jawahar. Cats and dogs. In IEEE Conference on Com- Baoshan Cheng, Hao Shen, and Huaxia Xia. End-to-
puter Vision and Pattern Recognition, 2012. 6 end video instance segmentation with transformers. arXiv
[24] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz preprint arXiv:2011.14503, 2020. 2
Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im- [37] Yujing Wang, Yaming Yang, Jiangang Bai, Mingliang
age transformer. In International Conference on Machine Zhang, Jing Bai, Jing Yu, Ce Zhang, Gao Huang, and Yunhai
Learning, pages 4055–4064. PMLR, 2018. 2 Tong. Evolving attention with residual convolutions. arXiv
[25] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya preprint arXiv:2102.12895, 2021. 3
Sutskever. Improving language understanding by generative [38] Felix Wu, Angela Fan, Alexei Baevski, Yann N Dauphin,
pre-training. 2018. 2 and Michael Auli. Pay less attention with lightweight and dy-
[26] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and namic convolutions. arXiv preprint arXiv:1901.10430, 2019.
Vaishaal Shankar. Do imagenet classifiers generalize to im- 3
agenet? In International Conference on Machine Learning, [39] Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song
pages 5389–5400. PMLR, 2019. 6, 7 Han. Lite transformer with long-short range attention. arXiv
[27] Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon preprint arXiv:2004.11886, 2020. 3, 4
Shlens, Pieter Abbeel, and Ashish Vaswani. Bottle- [40] Fuzhi Yang, Huan Yang, Jianlong Fu, Hongtao Lu, and Bain-
neck transformers for visual recognition. arXiv preprint ing Guo. Learning texture transformer network for image
arXiv:2101.11605, 2021. 3 super-resolution. In Proceedings of the IEEE/CVF Confer-
[28] Zhiqing Sun, Shengcao Cao, Yiming Yang, and Kris Kitani. ence on Computer Vision and Pattern Recognition, pages
Rethinking transformer-based set prediction for object detec- 5791–5800, 2020. 2
tion. arXiv preprint arXiv:2011.10881, 2020. 2 [41] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi,
[29] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-
scaling for convolutional neural networks. In International to-token vit: Training vision transformers from scratch on
Conference on Machine Learning, pages 6105–6114. PMLR, imagenet. arXiv preprint arXiv:2101.11986, 2021. 1, 3, 4,
2019. 2 5, 6, 7
[30] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco [42] Yanhong Zeng, Jianlong Fu, and Hongyang Chao. Learning
Massa, Alexandre Sablayrolles, and Hervé Jégou. Training joint spatial-temporal transformations for video inpainting.
data-efficient image transformers & distillation through at- In European Conference on Computer Vision, pages 528–
tention. arXiv preprint arXiv:2012.12877, 2020. 1, 2, 4, 6, 543. Springer, 2020. 2
7
[43] Minghang Zheng, Peng Gao, Xiaogang Wang, Hongsheng
[31] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
Li, and Hao Dong. End-to-end object detection with adaptive
reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
clustering transformer. arXiv preprint arXiv:2011.09315,
Polosukhin. Attention is all you need. In Isabelle Guyon,
2020. 2
Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob
[44] Luowei Zhou, Yingbo Zhou, Jason J. Corso, Richard Socher,
Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors,
and Caiming Xiong. End-to-end dense video captioning with
Advances in Neural Information Processing Systems 30: An-
masked transformer. In Proceedings of the IEEE Conference
nual Conference on Neural Information Processing Systems
on Computer Vision and Pattern Recognition (CVPR), June
2017, December 4-9, 2017, Long Beach, CA, USA, pages
2018. 2
5998–6008, 2017. 1, 2
[45] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang
[32] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill,
Wang, and Jifeng Dai. Deformable detr: Deformable trans-
Omer Levy, and Samuel R. Bowman. GLUE: A multi-task
formers for end-to-end object detection. arXiv preprint
benchmark and analysis platform for natural language un-
arXiv:2010.04159, 2020. 2
derstanding. In 7th International Conference on Learning
Representations, ICLR 2019, New Orleans, LA, USA, May
6-9, 2019. OpenReview.net, 2019. 1
[33] Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille,
and Liang-Chieh Chen. Max-deeplab: End-to-end panop-
tic segmentation with mask transformers. arXiv preprint
arXiv:2012.00759, 2020. 2
[34] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao.
Pyramid vision transformer: A versatile backbone for
dense prediction without convolutions. arXiv preprint
arXiv:2102.12122, 2021. 1, 3, 4, 5, 6, 7

10

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy