0% found this document useful (0 votes)

8 views11 pages

Scalable Vision Transformers With Hierarchical Pooling

The document presents a novel Hierarchical Visual Transformer (HVT) that improves the efficiency and scalability of Visual Transformers by progressively pooling visual tokens to reduce sequence length during inference. This approach enhances computational savings while maintaining performance, outperforming existing models like DeiT on image classification benchmarks. The authors also propose predictions without relying on a class token, using average pooling over patch tokens instead, which yields better results.

Uploaded by

Chor Yin Ho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views11 pages

Scalable Vision Transformers With Hierarchical Pooling

Uploaded by

Chor Yin Ho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Scalable Vision Transformers with Hierarchical Pooling

Zizheng Pan Bohan Zhuang† Jing Liu Haoyu He Jianfei Cai

Dept of Data Science and AI, Monash University

Abstract
arXiv:2103.10619v2 [cs.CV] 18 Aug 2021

DeiT-B
The recently proposed Visual image Transformers (ViT) DeiT-S
with pure attention have achieved promising performance 80
R-152

Top-1 Acc (%)

on image recognition tasks, such as image classification. HVT-S-1 R-101
However, the routine of the current ViT model is to main- R-50
tain a full-length patch sequence during inference, which 75 Scale HVT-Ti-4
is redundant and lacks hierarchical representation. To this R-34
DeiT-Ti
end, we propose a Hierarchical Visual Transformer (HVT)
which progressively pools visual tokens to shrink the se- 70 R-18
HVT-Ti-1
HVT
quence length and hence reduces the computational cost,
DeiT
analogous to the feature maps downsampling in Convolu-
tional Neural Networks (CNNs). It brings a great benefit
ResNet
65
that we can increase the model capacity by scaling dimen- 0 4 8 12 16 20
sions of depth/width/resolution/patch size without introduc- GFLOPs
ing extra computational complexity due to the reduced se-
quence length. Moreover, we empirically find that the av-
Figure 1: Performance comparisons on ImageNet. With
erage pooled visual tokens contain more discriminative in-
comparable GFLOPs (1.25 vs. 1.39), our proposed Scale
formation than the single class token. To demonstrate the
HVT-Ti-4 surpasses DeiT-Ti by 3.03% in Top-1 accuracy.
improved scalability of our HVT, we conduct extensive ex-
periments on the image classification task. With compara-
To improve the efficiency, there are emerging efforts to
ble FLOPs, our HVT outperforms the competitive baselines
design efficient and scalable Transformers. On the one
on ImageNet and CIFAR-100 datasets. Code is available at
hand, some methods follow the idea of model compression
https://github.com/MonashAI/HVT.
to reduce the number of parameters and computational over-
head. Typical methods include knowledge distillation [19],
low-bit quantization [29] and pruning [12]. On the other
1. Introduction hand, the self-attention mechanism has quadratic memory
Equipped with the self-attention mechanism that has and computational complexity, which is the key efficiency
strong capability of capturing long-range dependencies, bottleneck of Transformer models. The dominant solu-
Transformer [37] based models have achieved significant tions include kernelization [20, 28], low-rank decomposi-
breakthroughs in many computer vision (CV) and natural tion [41], memory [30], sparsity [4] mechanisms, etc.
language processing (NLP) tasks, such as machine trans- Despite much effort has been made, there still lacks spe-
lation [10, 9], image classification [11, 36], segmentation cific efficient designs for Visual Transformers considering
[43, 39] and object detection [3, 48]. However, the good taking advantage of characteristics of visual patterns. In
performance of Transformers comes at a high computa- particular, ViT models maintain a full-length sequence in
tional cost. For example, a single Transformer model re- the forward pass across all layers. Such a design can suffer
quires more than 10G Mult-Adds to translate a sentence of from two limitations. Firstly, different layers should have
only 30 words. Such a huge computational complexity hin- different redundancy and contribute differently to the accu-
ders the widespread adoption of Transformers, especially racy and efficiency of the network. This statement can be
on resource-constrained devices, such as smart phones. supported by existing compression methods [35, 23], where
each layer has its optimal spatial resolution, width and bit-
† Corresponding author. Email: bohan.zhuang@monash.edu width. As a result, the full-length sequence may contain
huge redundancy. Secondly, it lacks multi-level hierarchi- • Extensive experiments show that, with comparable
cal representations, which is well known to be essential for FLOPs, our HVT outperforms the competitive base-
the success of image recognition tasks. line DeiT on image classification benchmarks, includ-
To solve the above limitations, we propose to gradually ing ImageNet and CIFAR-100.
downsample the sequence length as the model goes deeper.
Specifically, inspired by the design of VGG-style [33] and 2. Related Work
ResNet-style [14] networks, we partition the ViT blocks
Visual Transformers. The powerful multi-head self-
into several stages and apply the pooling operation (e.g.,
attention mechanism has motivated the studies of applying
average/max pooling) in each stage to shrink the sequence
Transformers on a variety of CV tasks. In general, cur-
length. Such a hierarchical design is reasonable since a re-
rent Visual Transformers can be mainly divided into two
cent study [7] shows that a multi-head self-attention layer
categories. The first category seeks to combine convolu-
with a sufficient number of heads can express any convo-
tion with self-attention. For example, Carion et al. [3] pro-
lution layers. Moreover, the sequence of visual tokens in
pose DETR for object detection, which firstly extracts vi-
ViT can be analogous to the flattened feature maps of CNNs
sual features with CNN backbone, followed by the feature
along the spatial dimension, where the embedding of each
refinement with Transformer blocks. BotNet [34] is a re-
token can be seen as feature channels. Hence, our design
cent study that replaces the convolution layers with mul-
shares similarities with the spatial downsampling of feature
tiheaded self-attention layers at the last stage of ResNet.
maps in CNNs. To be emphasized, the proposed hierarchi-
Other works [48, 18] also present promising results with
cal pooling has several advantages. (1) It brings consider-
this hybrid architecture. The second category aims to
able computational savings and improves the scalability of
design a pure attention-based architecture without convo-
current ViT models. With comparable floating-point opera-
lutions. Recently, Ramachandran et al. [27] propose a
tions (FLOPs), we can scale up our HVT by expanding the
model which replaces all instances of spatial convolutions
dimensions of width/depth/resolution. In addition, the re-
with a form of self-attention applied to ResNet. Hu et
duced sequential resolution also empowers the partition of
al. [17] propose LR-Net [17] that replaces convolution lay-
the input image into smaller patch sizes for high-resolution
ers with local relation layers, which adaptively determines
representations, which is needed for low-level vision and
aggregation weights based on the compositional relation-
dense prediction tasks. (2) It naturally leads to the generic
ship of local pixel pairs. Axial-DeepLab [40] is also pro-
pyramidal hierarchy, similar to the feature pyramid network
posed to use Axial-Attention [16], a generalization form of
(FPN) [24], which extracts the essential multi-scale hidden
self-attention, for Panoptic Segmentation. Dosovitskiy et
representations for many image recognition tasks.
al. [11] first transfers Transformer to image classification.
In addition to hierarchical pooling, we further propose The model inherits a similar architecture from standard
to perform predictions without the class token. Inherited Transformer in NLP and achieves promising results on Ima-
from NLP, conventional ViT models [11, 36] equip with a geNet, whereas it suffers from prohibitively expensive train-
trainable class token, which is appended to the input patch ing complexity. To solve this, the following work DeiT [36]
tokens, then refined by the self-attention layers, and is fi- propose a more advanced optimization strategy and a dis-
nally used for prediction. However, we argue that it is not tillation token, with improved accuracy and training effi-
necessary to rely on the extra class token for image clas- ciency. Moreover, T2T-ViT [45] aims to overcome the lim-
sification. To this end, we instead directly apply average itations of simple tokenization of input images in ViT and
pooling over patch tokens and use the resultant vector for propose to progressively structurize the image to tokens to
prediction, which achieves improved performance. We are capture rich local structural patterns. Nevertheless, the pre-
aware of a concurrent work [6] that also observes the similar vious literature all assumes the same architecture to the NLP
phenomenon. task, without the adaptation to the image recognition tasks.
Our contributions can be summarized as follows: In this paper, we propose several simple yet effective mod-
• We propose a hierarchical pooling regime that grad- ifications to improve the scalability of current ViT models.
ually reduces the sequence length as the layer goes
deeper, which significantly improves the scalability
and the pyramidal feature hierarchy of Visual Trans- Efficient Transformers. Transformer-based models are
formers. The saved FLOPs can be utilized to improve resource-hungry and compute-intensive despite their state-
the model capacity and hence the performance. of-the-art performance. We roughly summarize the efficient
Transformers into two categories. The first category fo-
• Empirically, we observe that the average pooled visual cuses on applying generic compression techniques to speed
tokens contain richer discriminative patterns than the up the inference, either based on quantization [47], prun-
class token for classification. ing [26, 12], and distillation [32] or seeking to use Neu-
0
1

0
1
Linear Projection
2

Transformer Block

Transformer Block
Transformer Block

Transformer Block

Transformer Block
cat

Average Pool
2

MLP Head
Max Pool
Max Pool

Max Pool
dog

0
3

1
…

1
4
bird

2
5

3
…
13

6
14

7
15

Stage 1 Stage 2 Stage 3

Patch + Pos
Without ‘CLS’

Figure 2: Overview of the proposed Hierarchical Visual Transformer. To reduce the redundancy in the full-length patch
sequence and construct a hierarchical representation, we propose to progressively pool visual tokens to shrink the sequence
length. To this end, we partition the ViT [11] blocks into several stages. At each stage, we insert a pooling layer after the first
Transformer block to perform down-sampling. In addition to the pooling layer, we perform predictions using the resultant
vector of average pooling the output visual tokens of the last stage instead of the class token only.

ral Architecture Search (NAS) [38] to explore better con- 3. Proposed Method
figurations. Another category aims to solve the quadratic
complexity issue of the self-attention mechanism. A rep- In this section, we first briefly revisit the preliminaries of
resentative approach [5, 20] is to express the self-attention Visual Transformers [11] and then introduce our proposed
weights as a linear dot-product of kernel functions and make Hierarchical Visual Transformer.
use of the associative property of matrix products to re-
duce the overall self-attention complexity from O(n2 ) to 3.1. Preliminary
O(n). Moreover, some works alternatively study diverse Let I ∈ RH×W ×C be an input image, where H, W
sparse patterns of self-attention [4, 21], or consider the low- and C represent the height, width, and the number of chan-
rank structure of the attention matrix [41], leading to lin- nels, respectively. To handle a 2D image, ViT first splits
ear time and memory complexity with respect to the se- the image into a sequence of flattened 2D patches X =
quence length. There are also some NLP literatures that P 2C
[x1p ; x2p ; ...; xN i
p ], where xp ∈ R is the i-th patch of the
tend to reduce the sequence length during processing. For
input image and [·] is the concatenation operation. Here,
example, Goyal et al. [13] propose PoWER-BERT, which
N = HW/P 2 is the number of patches and P is the size
progressively eliminates word tokens during the forward
of each patch. ViT then uses a trainable linear projection
pass. Funnel-Transformer [8] presents a pool-query-only
that maps each vectorized patch to a D dimension patch
strategy, pooling the query vector within each self-attention
embedding. Similar to the class token in BERT [10], ViT
layer. However, there are few literatures targeting improv-
prepends a learnable embedding xcls ∈ RD to the sequence
ing the efficiency of the ViT models.
of patch embeddings. To retain positional information, ViT
introduces an additional learnable positional embeddings
To compromise FLOPs, current ViT models divide the
E ∈ R(N +1)×D . Mathematically, the resulting represen-
input image into coarse patches (i.e., large patch size), hin-
tation of the input sequence can be formulated as
dering their generalization to dense predictions. In order to
bridge this gap, we propose a general hierarchical pooling
X0 = [xcls ; x1p W; x2p W; ...; xN
p W] + E, (1)
strategy that significantly reduces the computational cost
while enhancing the scalability of important dimensions 2
of the ViT architectures, i.e., depth, width, resolution and where W ∈ RP C×D is a learnable linear projection pa-
patch size. Moreover, our generic encoder also inherits the rameter. Then, the resulting sequence of embeddings serves
pyramidal feature hierarchy from classic CNNs, potentially as the input to the Transformer encoder [37].
benefiting many downstream recognition tasks. Also note Suppose that the encoder in a Transformer consists of
that different from a concurrent work [42] which applies L blocks. Each block contains a multi-head self-attention
2D patch merging, this paper introduces the feature hierar- (MSA) layer and a position-wise multi-layer perceptron
chy with 1D pooling. We discuss the impact of 2D pooling (MLP). For each layer, layer normalization (LN) [1] and
in Section 5.2. residual connections [14] are employed, which can be for-
mulated as follows needs to be updated. Moreover, previous work [8] in NLP
0 also find it important to complement positional information
Xl−1 = Xl−1 + MSA(LN(Xl−1 )), (2) after changing the sequence length. Therefore, at the m-th
0 0
Xl = Xl−1 + MLP(LN(Xl−1 )), (3) stage, we introduce an additional learnable positional em-
bedding Ebm to capture the positional information, which
where l ∈ [1, ..., L] is the index of Transformer blocks. can be formulated as
Here, a MLP contains two fully-connected layers with a
GELU non-linearity [15]. In order to perform classifica- X̂bm = MaxPool1D(Xbm ) + Ebm , (5)
tion, ViT applies a layer normalization layer and a fully-
where Xbm is the output of the Transformer block bm . We
connected (FC) layer to the first token of the Transformer
then forward the resulting embeddings X̂bm into the next
encoder’s output X0L . In this way, the output prediction y
Transformer block bm + 1.
can be computed by

y = FC(LN(X0L )). (4) 3.2.2 Prediction without the Class Token

3.2. Hierarchical Visual Transformer Previous works [11, 36] make predictions by taking the
class token as input in classification tasks as described in
In this paper, we propose a Hierarchical Visual Trans- Eq. (4). However, such structure relies solely on the sin-
former (HVT) to reduce the redundancy in the full-length gle class token with limited capacity while discarding the
patch sequence and construct a hierarchical representation. remaining sequence that is capable of storing more discrim-
In the following, we first propose a hierarchical pooling to inative information. To this end, we propose to remove the
gradually shrink the sequence length and hence reduce the class token in the first place and predict with the remaining
computational cost. Then, we propose to perform predic- output sequence on the last stage.
tions without the class token. The overview of the proposed Specifically, given the output sequence without the class
HVT is shown in Figure 2. token on the last stage XL , we first apply average pooling,
then directly apply an FC layer on the top of the pooled
3.2.1 Hierarchical Pooling embeddings and make predictions. The process can be for-
mulated as
We propose to apply hierarchical pooling in ViT for two
reasons: (1) Recent studies [13, 8] on Transformers show y = FC(AvgPool(LN(XL ))). (6)
that tokens tend to carry redundant information as it goes
deeper. Therefore, it would be beneficial to reduce these
3.3. Complexity Analysis
redundancies through the pooling approaches. (2) The in-
put sequence projected from image patches in ViT can be In this section, we analyse the block-wise compression
seen as flattened CNN feature maps with encoded spatial ratio with hierarchical pooling. Following ViT [11], we use
information, hence pooling from the nearby tokens can be FLOPs to measure the computational cost of a Transformer.
analogous to the spatial pooling methods in CNNs. Let n be the number of tokens in a sequence and d is the di-
Motivated by the hierarchical pipeline of VGG-style [33] mension of each token. The FLOPs of a Transformer block
and ResNet-style [14] networks, we partition the Trans- φBLK (n, d) can be computed by
former blocks into M stages and apply downsampling op-
eration to each stage to shrink the sequence length. Let φBLK (n, d) = φM SA (n, d) + φM LP (n, d),
(7)
{b1 , b2 , . . . , bM } be the indexes of the first block in each = 12nd2 + 2n2 d,
stage. At the m-th stage, we apply a 1D max pooling oper-
ation with a kernel size of k and stride of s to the output of where φM SA (n, d) and φM LP (n, d) are the FLOPs of the
the Transformer block bm ∈ {b1 , b2 , . . . , bM } to shrink the MSA and MLP, respectively. Details about Eq. (7) can be
sequence length. found in the supplementary material.
Note that the positional encoding is important for a Without loss of generality, suppose that the sequence
Transformer since the positional encoding is able to cap- length n is reduced by half after performing hierarchical
ture information about the relative and absolute position of pooling. In this case, the block-wise compression ratio α
the token in the sequence [37, 3]. In Eq. (1) of ViT, each can be computed by
patch is equipped with positional embedding E at the be- φBLK (n, d) 2
ginning. However, in our HVT, the original positional em- α= =2+ . (8)
φBLK (n/2, d) 12(d/n) + 1
bedding E may no longer be meaningful after pooling since
the sequence length is reduced after each pooling operation. Clearly, Eq. (8) is monotonic, thus the block-wise com-
In this case, positional embedding in the pooled sequence pression ratio α is bounded by (2, 4), i.e., α ∈ (2, 4).
ResNet50: conv1 ResNet50: conv42

DeiT-S: Linear Projection, N = 196 DeiT-S: Block1, N = 196

HVT-S-1: Linear Projection, N = 196 HVT-S-1: Block1, N = 97

Figure 3: Feature visualization of ResNet50 [14], DeiT-S [36] and our HVT-S-1 trained on ImageNet. DeiT-S and our HVT-
S-1 correspond to the small setting in DeiT, except that our model applies a pooling operation and performing predictions
without the class token. The resolution of the feature maps from ResNet50 conv1 and conv4 2 are 112×112 and 14×14,
respectively. For DeiT and HVT, the feature maps are reshaped from tokens. For our model, we interpolate the pooled
sequence to its initial length then reshape it to a 2D map.

4. Discussions 4.2. Scalability of HVT

The computational complexity reduction equips HVT
4.1. Analysis of Hierarchical Pooling with strong scalability in terms of width/depth/patch
size/resolution. Take DeiT-S for an example, the model
In CNNs, feature maps are usually downsampled to consists of 12 blocks and 6 heads. Given a 224×224 image
smaller sizes in a hierarchical way [33, 14]. In this pa- with a patch size of 16, the computational cost of DeiT-S is
per, we show that this principle can be applied to ViT mod- around 4.6G FLOPs. By applying four pooling operations,
els by comparing the visualized feature maps from ResNet our method is able to achieve nearly 3.3× FLOPs reduc-
conv4 2, DeiT-S [36] block1 and HVT-S-1 block1 in Fig- tion. Furthermore, to re-allocate the reduced FLOPs, we
ure 3. From the figure, in ResNet, the initial feature maps may construct wider or deeper HVT-S, with 11 heads or 48
after the first convolutional layer contain rich edge informa- blocks, then the overall FLOPs would be around 4.51G and
tion. After feeding the features to consecutive convolutional 4.33G, respectively. Moreover, we may consider a longer
layers and a pooling layer, the output feature maps tend sequence by setting a smaller patch size or using a larger
to preserve more high-level discriminative information. In resolution. For example, with a patch size of 8 and an im-
DeiT-S, following the ViT structure, although the image res- age resolution of 192×192, the FLOPs for HVT-S is around
olution for the feature maps has been reduced to 14 × 14 by 4.35G. Alternatively, enlarging the image resolution into
the initial linear projection layer, we can still observe clear 384×384 will lead to 4.48G FLOPs. In all of the above
edges and patterns. Then, the features get refined in the mentioned cases, the computational costs are still lower
first block to obtain sharper edge information. In contrast than that of DeiT-S while the model capacity is enhanced.
to DeiT-S that refines features at the same resolution level, It is worth noting that finding a principled way to scale
after the first block, the proposed HVT downsamples the up HVT to obtain the optimal efficiency-vs-accuracy trade-
hidden sequence through a pooling layer and reduces the off remains an open question. At the current stage, we take
sequence length by half. We then interpolate the sequence an early exploration by evenly partitioning blocks and fol-
back to 196 and reshape it to 2D feature maps. We can find lowing model settings in DeiT [36] for a fair comparison. In
that the hidden representations contain more abstract infor- fact, the improved scalability of HVT makes it possible for
mation with high discriminative power, which is similar to using Neural Architecture Search (NAS) to automatically
ResNet. find optimal configurations, such as EfficientNet [35]. We
7.0 80
leave for more potential studies for future work.
6.5
5. Experiments 6.0 60

Top-1 Acc.(%)
Training Loss
Compared methods. To investigate the effectiveness of 5.5
DeiT-Ti 40
HVT, we compare our method with DeiT [36] and a BERT- 5.0 Scale HVT-Ti-4
based pruning method PoWER-BERT [13]. DeiT is a rep-
4.5
resentative Vision Transformer and PoWER progressively
20
prunes unimportant tokens in pretrained BERT models for 4.0
inference acceleration. Moreover, we consider two archi- 3.5
tectures in DeiT for comparisons: HVT-Ti: HVT with the 0
tiny setting. HVT-S: HVT with the small setting. For con- 0 50 100 150 200 250 300
venience, we use “Architecture-M ” to represent our model
Epoch
with M pooling stages, e.g., HVT-S-1. Figure 4: Performance comparisons of DeiT-Ti (1.25G
FLOPs) and the proposed Scale HVT-Ti-4 (1.39G FLOPs).
All the models are evaluated on ImageNet. Solid lines de-
Datasets and Evaluation metrics. We evaluate our note the Top-1 accuracy (y-axis on the right). Dash lines
proposed HVT on two image classification benchmark denote the training loss (y-axis on the left).
datasets: CIFAR-100 [22] and ImageNet [31]. We measure
can significantly reduce redundancy while maintaining per-
the performance of different methods in terms of the Top-1
formance. Second, compared to PoWER, HVT-Ti-1 uses
and Top-5 accuracy. Following DeiT [36], we measure the
less FLOPs while achieving better performance. Besides,
computational cost by FLOPs. Moreover, we also measure
HVT-S-1 reduces more FLOPs than PoWER, while achiev-
the model size by the number of parameters (Params).
ing slightly lower performance than PoWER. Also note that
PoWER involves three training steps, while ours is a sim-
Implementation details. For experiments on ImageNet, pler one-stage training scheme.
we train our models for 300 epochs with a total batch size Moreover, we also compare the scaled HVT with DeiT
of 1024. The initial learning rate is 0.0005. We use AdamW under similar FLOPs. Specifically, we enlarge the embed-
optimizer [25] with a momentum of 0.9 for optimization. ding dimensions and add extra heads in HVT-Ti. From Ta-
We set the weight decay to 0.025. For fair comparisons, ble 1 and Figure 4, by re-allocating the saved FLOPs to
we keep the same data augmentation strategy as DeiT [36]. scale up the model, HVT can converge to a better solution
For the downsampling operation, we use max pooling by and yield improved performance. For example, the Top-
default. The kernel size k and stride s are set to 3 and 2, 1 accuracy on ImageNet can be improved considerably by
respectively, chosen by a simple grid search on CIFAR100. 3.03% in the tiny setting. More empirical studies on the
Besides, all learnable positional embeddings are initialized effect of model scaling can be found in Section 5.2.
in the same way as DeiT. More detailed settings on the other
hyper-parameters can be found in DeiT. For experiments 5.2. Ablation Study
on CIFAR-100, we train our models with a total batch size Effect of the prediction without the class token. To in-
of 128. The initial learning rate is set to 0.000125. Other vestigate the effect of the prediction without the class token,
hyper-parameters are kept the same as those on ImageNet. we train DeiT-Ti with and without the class token and show
the results in Table 2. From the results, the models without
5.1. Main Results the class token outperform the ones with the class token.
We compare the proposed HVT with DeiT and PoWER, The performance gains mainly come from the extra discrim-
and report the results in Table 1. First, compared to DeiT, inative information stored in the entire sequence without
our HVT achieves nearly 2× FLOPs reduction with a hi- the class token. Note that the performance improvement on
erarchical pooling. However, the significant FLOPs reduc- CIFAR-100 is much larger than that on ImageNet. It may
tion also leads to performance degradation in both the tiny be attributed that CIFAR-100 is a small dataset, which lacks
and small settings. Additionally, the performance drop of varieties compared with ImageNet. Therefore, the model
HVT-S-1 is smaller than that of HVT-Ti-1. For example, for trained on CIFAR-100 benefits more from the increase of
HVT-S-1, it only leads to 1.80% drop in the Top-1 accuracy. model’s discriminative power.
In contrast, it results in 2.56% drop in the Top-1 accuracy
for HVT-Ti-1. It can be attributed to that, compared with Effect of different pooling stages. We train HVT-S with
HVT-Ti-1, HVT-S-1 is more redundant with more parame- different pooling stages M ∈ {0, 1, 2, 3, 4} and show the
ters. Therefore, applying hierarchical pooling to HVT-S-1 results in Table 4. Note that HVT-S-0 is equivalent to the
Table 1: Performance comparisons with DeiT and PoWER on ImageNet. “Embedding Dim” refers to the dimension of
each token in the sequence. “#Heads” and “#Blocks” are the number of self-attention heads and blocks in Transformer,
respectively. “FLOPs” is measured with a 224×224 image. “Ti” and “S” are short for the tiny and small settings, respectively.
“Architecture-M ” denotes the model with M pooling stages. “Scale” denotes that we scale up the embedding dimension
and/or the number of self-attention heads. “DeiT-Ti/S + PoWER” refers to the model that applies the techniques in PoWER-
BERT [13] to DeiT-Ti/S.

Model Embedding Dim #Heads #Blocks FLOPs (G) Params (M) Top-1 Acc. (%) Top-5 Acc. (%)
DeiT-Ti [36] 192 3 12 1.25 5.72 72.20 91.10
DeiT-Ti + PoWER [13] 192 3 12 0.80 5.72 69.40 (-2.80) 89.20 (-1.90)
HVT-Ti-1 192 3 12 0.64 5.74 69.64 (-2.56) 89.40 (-1.70)
Scale HVT-Ti-4 384 6 12 1.39 22.12 75.23 (+3.03) 92.30 (+1.20)
DeiT-S [36] 384 6 12 4.60 22.05 79.80 95.00
DeiT-S + PoWER [13] 384 6 12 2.70 22.05 78.30 (-1.50) 94.00 (-1.00)
HVT-S-1 384 6 12 2.40 22.09 78.00 (-1.80) 93.83 (-1.17)

Table 2: Effect of the prediction without the class token. “CLS” denotes the class token.
ImageNet CIFAR-100
Model FLOPs (G) Params (M)
Top-1 Acc. (%) Top-5 Acc. (%) Top-1 Acc. (%) Top-5 Acc. (%)
DeiT-Ti with CLS 1.25 5.72 72.20 91.10 64.49 89.27
DeiT-Ti without CLS 1.25 5.72 72.42 (+0.22) 91.55 (+0.45) 65.93 (+1.44) 90.33 (+1.06)

Table 3: Performance comparisons on HVT-S-4 with three Table 5: Performance comparisons on HVT-S-4 with differ-
downsampling operations: convolution, max pooling and ent number of Transformer blocks. We report the Top-1 and
average pooling. We report the Top-1 and Top-5 accuracy Top-5 accuracy on CIFAR-100.
on CIFAR-100.
#Blocks FLOPs (G) Params (M) Top-1 Acc. (%) Top-5 Acc. (%)
Model Operation FLOPs (G) Params (M) Top-1 Acc. (%) Top-5 Acc. (%) 12 1.39 21.77 75.43 93.56
HVT-S Conv 1.47 23.54 69.75 92.12 16 1.72 28.87 75.32 93.30
HVT-S Avg 1.39 21.77 70.38 91.39
HVT-S Max 1.39 21.77 75.43 93.56 20 2.05 35.97 75.35 93.35
24 2.37 43.07 75.04 93.39
Table 4: Performance comparisons on HVT-S with different
pooling stages M . We report the Top-1 and Top-5 accuracy Table 6: Performance comparisons on HVT-Ti-4 with dif-
on CIFAR-100. ferent number of self-attention heads. We report the Top-1
M FLOPs Params
ImageNet CIFAR100 and Top-5 accuracy on CIFAR-100.
Top-1 (%) Top-5 (%) Top-1 (%) Top-5 (%)
0 4.57 21.70 80.39 95.13 71.99 92.44 #Heads FLOPs (G) Params (M) Top-1 Acc. (%) Top-5 Acc. (%)
1 2.40 21.74 78.00 93.83 74.27 93.07 3 0.38 5.58 69.51 91.78
2 1.94 21.76 77.36 93.55 75.37 93.69 6 1.39 21.77 75.43 93.56
3 1.62 21.77 76.32 92.90 75.22 93.90 12 5.34 86.01 76.26 93.39
4 1.39 21.77 75.23 92.30 75.43 93.56 16 9.39 152.43 76.30 93.16

DeiT-S without the class token. With the increase of M , introduces additional FLOPs and parameters. Besides, av-
HVT-S achieves better performance with decreasing FLOPs erage pooling performs slightly better than convolution in
on CIFAR-100, while on ImageNet we observe the accu- terms of the Top-1 accuracy. Compared with the two set-
racy degrades. One possible reason is that HVT-S is very tings, HVT-S-4 with max pooling performs much better as
redundant on CIFAR-100, such that pooling acts as a reg- it significantly surpasses average pooling by 5.05% in the
ularizer to avoid the overfitting problem and improves the Top-1 accuracy and 2.17% in the Top-5 accuracy. The re-
generalization of HVT on CIFAR-100. On ImageNet, we sult is consistent with the common sense [2] that max pool-
assume HVT is less redundant and a better scaling strategy ing performs well in a large variety of settings. To this end,
is required to improve the performance. we use max pooling in all other experiments by default.

Effect of different downsampling operations. To inves- Effect of model scaling. One of the important advan-
tigate the effect of different downsampling operations, we tages of the proposed hierarchical pooling is that we can
train HVT-S-4 with three downsampling strategies: convo- re-allocate the saved computational cost for better model ca-
lution, average pooling and max pooling. As Table 3 shows, pacity by constructing a model with a wider, deeper, larger
downsampling with convolution performs the worst even it resolution or smaller patch size configuration. Similar to the
CNNs literature [14, 44, 46], we study the effect of model Table 7: Performance comparisons on HVT-S-4 with dif-
scaling in the following. ferent image resolutions. We report the Top-1 and Top-5
Based on HVT-S-4, we first construct deeper models by accuracy on CIFAR-100.
increasing the number of blocks in Transformers. Specif- Resolution FLOPs (G) Params (M) Top-1 Acc. (%) Top-5 Acc. (%)
ically, we train 4 models with different number of blocks 160 0.69 21.70 73.84 92.90
L ∈ {12, 16, 20, 24}. As a result, each pooling stage for 224 1.39 21.77 75.43 93.56
320 3.00 21.92 75.54 94.18
different models would have 3, 4, 5, and 6 blocks, respec- 384 4.48 22.06 76.31 94.02
tively. We train 4 models on CIFAR-100 and report the re- Table 8: Performance comparisons on HVT-S-4 with differ-
sults in Table 5. From the results, we observe no more gains ent patch sizes P . We report the Top-1 and Top-5 accuracy
by stacking more blocks in HVT. on CIFAR-100.
Based on HVT-Ti-4, we then construct wider models by
increasing the number of self-attention heads. To be spe- P FLOPs (G) Params (M) Top-1 Acc. (%) Top-5 Acc. (%)
cific, we train 4 models with different numbers of self- 8 6.18 21.99 77.29 94.22
attention heads, i.e., 3, 6, 12, and 16, on CIFAR-100 and 16 1.39 21.77 75.43 93.56
32 0.37 22.55 68.15 90.19
report the results in Table 6. From the results, our mod-
els achieve better performance with the increase of width. Table 9: Effect of 2D pooling on HVT-S-2. We report the
For example, the model with 16 self-attention heads out- Top-1 and Top-5 accuracy on CIFAR-100. For HVT-S-2,
performs those with 3 self-attention heads by 6.79% in the we apply 2D max pooling and use a patch size of 8.
Top-1 accuracy and 1.38% in the Top-5 accuracy. Model FLOPs (G) Params (M) Top-1 Acc. (%) Top-5 Acc. (%)
Based on HVT-S-4, we further construct models with DeiT-S 4.60 21.70 71.99 92.44
HVT-S-2 (2D) 4.62 21.80 77.58 94.40
larger input image resolutions. Specifically, we train 4 mod-
els with different input image resolutions, i.e., 160, 224, and compare it with DeiT-S. The results show that HVT-S-
320, and 384, on CIFAR-100 and report the results in Ta- 2 with 2D pooling outperforms DeiT-S on CIFAR100 by a
ble 7. From the results, with the increase of image resolu- large margin with similar FLOPs. In this case, we assume
tion, our models achieve better performance. For example, that HVT can achieve promising performance with a ded-
the model with the resolution of 384 outperforms those with icated scaling scheme for 2D pooling. We will leave this
the resolution of 160 by 2.47% in the Top-1 accuracy and exploration for future work.
1.12% in the Top-5 accuracy. Nevertheless, increasing im-
age resolutions also leads to high computational cost. To 6. Conclusion and Future Work
make a trade-off between computational cost and accuracy,
we set the image resolution to 224 by default. In this paper, we have presented a Hierarchical Visual
We finally train HVT-S-4 with different patch sizes P ∈ Transformer, termed HVT, for image classification. In par-
{8, 16, 32} and show the results in Table 8. From the re- ticular, the proposed hierarchical pooling can significantly
sults, HVT-S-4 performs better with the decrease of patch compress the sequential resolution to save computational
size. For example, when the patch size decreases from 32 cost in a simple yet effective form. More importantly, this
to 8, our HVT-S achieves 9.14% and 4.03% gain in terms of strategy greatly improves the scalability of visual Trans-
the Top-1 and Top-5 accuracy. Intuitively, a smaller patch formers, making it possible to scale various dimensions -
size leads to fine-grained image patches and helps to learn depth, width, resolution and patch size. By re-allocating
high-resolution representations, which is able to improve the saved computational cost, we can scale up these dimen-
the classification performance. However, with a smaller sions for better model capacity with comparable or fewer
patch size, the patch sequence will be longer, which sig- FLOPs. Moreover, we have empirically shown that the vi-
nificantly increases the computational cost. To make a bal- sual tokens are more important than the single class token
ance between the computational cost and accuracy, we set for class prediction. Note that the scope of this paper only
the patch size to 16 by default. targets designing our HVT as an encoder. Future works may
include extending our HVT model to decoder and to solve
other mainstream CV tasks, such as object detection and se-
Exploration on 2D pooling. Compared to 1D pooling, mantic/instance segmentation. In addition, it would be in-
2D pooling brings more requirements. For example, it re- teresting to find a principled way to scale up HVT that can
quires a smaller patch size to ensure a sufficient sequence achieve better accuracy and efficiency.
length. Correspondingly, it is essential to reduce the heads
at the early stages to save FLOPs and memory consumption 7. Acknowledgements
from high-resolution feature maps. Besides, it also requires
to vary the blocks at each stage to control the overall model This research is partially supported by Monash FIT
complexity. In Table 9, we apply 2D pooling to HVT-S-2 Start-up Grant and Sensetime Gift Fund.
References [16] Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim
Salimans. Axial attention in multidimensional transformers.
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- arXiv preprint arXiv:1912.12180, 2019. 2
ton. Layer normalization. arXiv preprint arXiv:1607.06450,
[17] Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Local
2016. 3
relation networks for image recognition. In ICCV, 2019. 2
[2] Y-Lan Boureau, Jean Ponce, and Yann LeCun. A theoretical
analysis of feature pooling in visual recognition. In ICML, [18] Zilong Huang, Xinggang Wang, Lichao Huang, Chang
pages 111–118, 2010. 7 Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross
attention for semantic segmentation. In ICCV, 2019. 2
[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- [19] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao
end object detection with transformers. In ECCV, 2020. 1, Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distill-
2, 4 ing BERT for natural language understanding. In EMNLP,
[4] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. pages 4163–4174, 2020. 1
Generating long sequences with sparse transformers. arXiv [20] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and
preprint arXiv:1904.10509, 2019. 1, 3 François Fleuret. Transformers are rnns: Fast autoregressive
[5] Krzysztof Choromanski, Valerii Likhosherstov, David Do- transformers with linear attention. In ICML, pages 5156–
han, Xingyou Song, Andreea Gane, Tamás Sarlós, Peter 5165. PMLR, 2020. 1, 3
Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, [21] Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Re-
David Belanger, Lucy Colwell, and Adrian Weller. Rethink- former: The efficient transformer. In ICLR, 2020. 3
ing attention with performers. In ICLR, 2021. 3 [22] Alex Krizhevsky and Geoffrey Hinton. Learning multiple
[6] Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Xi- layers of features from tiny images. Master’s thesis, Depart-
aolin Wei, Huaxia Xia, and Chunhua Shen. Conditional po- ment of Computer Science, University of Toronto, 2009. 6
sitional encodings for vision transformers. Arxiv preprint [23] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and
2102.10882, 2021. 2 Hans Peter Graf. Pruning filters for efficient convnets. In
[7] Jean-Baptiste Cordonnier, Andreas Loukas, and Martin ICLR, 2017. 1
Jaggi. On the relationship between self-attention and con- [24] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
volutional layers. In ICLR, 2020. 2 Bharath Hariharan, and Serge Belongie. Feature pyramid
[8] Zihang Dai, Guokun Lai, Yiming Yang, and Quoc Le. networks for object detection. In CVPR, pages 2117–2125,
Funnel-transformer: Filtering out sequential redundancy for 2017. 2
efficient language processing. In NeurIPS, 2020. 3, 4 [25] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
[9] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, regularization. In ICLR, 2019. 6
Quoc Le, and Ruslan Salakhutdinov. Transformer-xl: Atten-
[26] Paul Michel, Omer Levy, and Graham Neubig. Are sixteen
tive language models beyond a fixed-length context. In ACL,
heads really better than one? In NeurIPS, pages 14014–
2019. 1
14024, 2019. 2
[10] J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
[27] Niki Parmar, Prajit Ramachandran, Ashish Vaswani, Irwan
Toutanova. Bert: Pre-training of deep bidirectional trans-
Bello, Anselm Levskaya, and Jon Shlens. Stand-alone self-
formers for language understanding. In NAACL-HLT, 2019.
attention in vision models. In NeurIPS, pages 68–80, 2019.
1, 3
2
[11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, [28] Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Noah A Smith, and Lingpeng Kong. Random feature atten-
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is tion. In ICLR, 2021. 1
worth 16x16 words: Transformers for image recognition at [29] Gabriele Prato, Ella Charlaix, and Mehdi Rezagholizadeh.
scale. ICLR, 2021. 1, 2, 3, 4 Fully quantized transformer for machine translation. In
[12] Mitchell A. Gordon, Kevin Duh, and Nicholas Andrews. Trevor Cohn, Yulan He, and Yang Liu, editors, EMNLP,
Compressing BERT: studying the effects of weight pruning pages 1–14, 2020. 1
on transfer learning. In RepL4NLP@ACL, pages 143–155, [30] Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and
2020. 1, 2 Timothy P Lillicrap. Compressive transformers for long-
[13] Saurabh Goyal, Anamitra Roy Choudhury, Saurabh Raje, range sequence modelling. In ICLR, 2020. 1
Venkatesan T. Chakaravarthy, Yogish Sabharwal, and Ashish [31] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
Verma. Power-bert: Accelerating BERT inference via pro- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
gressive word-vector elimination. In ICML, 2020. 3, 4, 6, Aditya Khosla, Michael Bernstein, et al. Imagenet large
7 scale visual recognition challenge. IJCV, 115(3):211–252,
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. 6
Deep residual learning for image recognition. In CVPR, [32] Victor Sanh, Lysandre Debut, Julien Chaumond, and
2016. 2, 3, 4, 5, 8 Thomas Wolf. Distilbert, a distilled version of BERT:
[15] Dan Hendrycks and Kevin Gimpel. Gaussian error linear smaller, faster, cheaper and lighter. In NeurIPS EMC2 Work-
units (gelus). arXiv: Learning, 2016. 4 shop, 2019. 2
[33] Karen Simonyan and Andrew Zisserman. Very deep convo- Appendix
lutional networks for large-scale image recognition. In ICLR,
2015. 2, 4, 5 We organize our supplementary material as follows.
[34] Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon
Shlens, Pieter Abbeel, and Ashish Vaswani. Bottleneck • In Section S1, we elaborate on the components of
transformers for visual recognition. In CVPR, 2021. 2 a Transformer block, including the multi-head self-
[35] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model attention layer (MSA) and the position-wise multi-
scaling for convolutional neural networks. In ICML, 2019. layer perceptron (MLP).
1, 5
[36] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco • In Section S2, we provide details for the FLOPs calcu-
Massa, Alexandre Sablayrolles, and Herve Jegou. Training lation of a Transformer block.
data-efficient image transformers & distillation through at-
tention. In ICML, pages 10347–10357, 2021. 1, 2, 4, 5, 6, S1. Transformer Block
7
[37] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- S1.1. Multi-head Self-Attention
reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
Let X ∈ RN ×D be the input sentence, where N is the
Polosukhin. Attention is all you need. In NeurIPS, 2017. 1,
3, 4
sequence length and D the embedding dimension. First, a
self-attention layer computes query, key and value matrices
[38] Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng
from X using linear transformations
Zhu, Chuang Gan, and Song Han. HAT: hardware-aware
transformers for efficient natural language processing. In
ACL, pages 7675–7688, 2020. 3 [Q, K, V] = XWqkv , (9)
[39] Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, and
Liang-Chieh Chen. Max-deeplab: End-to-end panoptic seg- where Wqkv ∈ RD×3Dh is a learnable parameter and Dh is
mentation with mask transformers. In CVPR, 2021. 1 the dimension of each self-attention head. Next, the atten-
[40] Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam,
tion map A can be calculated by scaled inner product from
Alan L. Yuille, and Liang-Chieh Chen. Axial-deeplab: Q and K and normalized by a softmax function
Stand-alone axial-attention for panoptic segmentation. In p
ECCV, 2020. 2 A = Softmax(QK> / Dh ), (10)
[41] Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and
Hao Ma. Linformer: Self-attention with linear complexity. where A ∈ RN ×N and Aij represents for the attention
arXiv preprint arXiv:2006.04768, 2020. 1, 3 score between the Qi and Kj . Then, the self-attention op-
[42] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao eration is applied on the value vectors to produce an output
Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyra- matrix
mid vision transformer: A versatile backbone for dense pre- O = AV, (11)
diction without convolutions. In ICCV, 2021. 3
[43] Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, where O ∈ RN ×Dh . For a multi-head self-attention layer
Baoshan Cheng, Hao Shen, and Huaxia Xia. End-to-end with D/Dh heads, the outputs can be calculated by a linear
video instance segmentation with transformers. In CVPR, projection for the concatenated self-attention outputs
2021. 1
0
[44] Zifeng Wu, Chunhua Shen, and Anton van den Hengel. X = [O1 ; O2 ; ...; OD/Dh ]Wproj , (12)
Wider or deeper: Revisiting the resnet model for visual
recognition. arXiv preprint arXiv:1611.10080, 2016. 8 where Wproj ∈ RD×D is a learnable parameter and [·] de-
[45] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, notes the concatenation operation.
Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-
to-token vit: Training vision transformers from scratch on S1.2. Position-wise Multi-Layer Perceptron
imagenet. In ICCV, 2021. 2 0
[46] Sergey Zagoruyko and Nikos Komodakis. Wide residual net- Let X be the output from the MSA layer. An MLP layer
works. In BMVC, 2016. 8 which contains two fully-connected layers with a GELU
[47] Wei Zhang, Lu Hou, Yichun Yin, Lifeng Shang, Xiao Chen, non-linearity can be represented by
Xin Jiang, and Qun Liu. Ternarybert: Distillation-aware 0
ultra-low bit BERT. In EMNLP, pages 509–521, 2020. 2 X = GELU(X Wf c1 )Wf c2 , (13)
[48] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang,
and Jifeng Dai. Deformable DETR: deformable transformers where Wf c1 ∈ RD×4D and Wf c2 ∈ R4D×D are learnable
for end-to-end object detection. ICLR, 2021. 1, 2 parameters.
S2. FLOPs of a Transformer Block
We denote φ(n, d) as a function of FLOPs with respect to
the sequence length n and the embedding dimension d. For
an MSA layer, The FLOPs mainly comes from four parts:
(1) The projection of Q,K,V matrices φqkv (n, d) = 3nd2 .
(2) The calculation of the attention map φA (n, d) = n2 d.
(3) The self-attention operation φO (n, d) = n2 d. (4)
And finally, a linear projection for the concatenated self-
attention outputs φproj (n, d) = nd2 . Therefore, the overall
FLOPs for an MSA layer is

φM SA (n, d) = φqkv (n, d) + φA (n, d) + φO (n, d) + φproj (n, d)

= 3nd2 + n2 d + n2 d + nd2
= 4nd2 + 2n2 d.
(14)

For an MLP layer, the FLOPs mainly comes from two

fully-connected (FC) layers. The first FC layer f c1 is used
to project each token from Rd to R4d . The next FC layer
f c2 projects each token back to Rd . Therefore, the FLOPs
for an MLP layer is

φM LP (n, d) = φf c1 (n, d)+φf c2 (n, d) = 4nd2 +4nd2 = 8nd2 .

(15)
By combining Eq. (14) and Eq. (15), we can get the total
FLOPs of one Transformer block

φBLK (n, d) = φM SA (n, d)+φM LP (n, d) = 12nd2 +2n2 d.

(16)

A Survey On Efficient Vision Transformers Algorithms Techniques and Performance Benchmarking
No ratings yet
A Survey On Efficient Vision Transformers Algorithms Techniques and Performance Benchmarking
19 pages
Deitmar2012 Book AutomorphicForms 220408 050805
No ratings yet
Deitmar2012 Book AutomorphicForms 220408 050805
255 pages
An Image Is Worth More Than 16x16 Patches - Explorting Transformers On Individual Pixels
No ratings yet
An Image Is Worth More Than 16x16 Patches - Explorting Transformers On Individual Pixels
21 pages
LLM
No ratings yet
LLM
28 pages
NeurIPS 2024
No ratings yet
NeurIPS 2024
14 pages
Classification of Compact Complex Homogeneous Mani
No ratings yet
Classification of Compact Complex Homogeneous Mani
33 pages
Bhojanapalli Understanding Robustness of Transformers For Image Classification ICCV 2021 Paper
No ratings yet
Bhojanapalli Understanding Robustness of Transformers For Image Classification ICCV 2021 Paper
11 pages
Pyramid Vision Transformer: A Versatile Backbone For Dense Prediction Without Convolutions
No ratings yet
Pyramid Vision Transformer: A Versatile Backbone For Dense Prediction Without Convolutions
15 pages
Lightweight
No ratings yet
Lightweight
23 pages
Vision Transformer (ViT)
No ratings yet
Vision Transformer (ViT)
26 pages
Bottleneck Transformers For Visual Recognition
No ratings yet
Bottleneck Transformers For Visual Recognition
11 pages
Vitae: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias
No ratings yet
Vitae: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias
23 pages
2022 - ViTAEv2 - Zhang Et Al - Arxiv
No ratings yet
2022 - ViTAEv2 - Zhang Et Al - Arxiv
22 pages
ViT Transformers SEMINAR
No ratings yet
ViT Transformers SEMINAR
16 pages
Demailly
No ratings yet
Demailly
11 pages
【图片重叠采样+VIT多层提炼+架构】Tokens-To-Token ViT Training Vision Transformers From Scratch on ImageNet
No ratings yet
【图片重叠采样+VIT多层提炼+架构】Tokens-To-Token ViT Training Vision Transformers From Scratch on ImageNet
10 pages
Understanding Robustness of Transformers For Image
No ratings yet
Understanding Robustness of Transformers For Image
23 pages
Transnext: Robust Foveal Visual Perception For Vision Transformers
No ratings yet
Transnext: Robust Foveal Visual Perception For Vision Transformers
22 pages
【PVT】Wang Pyramid Vision Transformer a Versatile Backbone for Dense Prediction Without ICCV 2021 Paper
No ratings yet
【PVT】Wang Pyramid Vision Transformer a Versatile Backbone for Dense Prediction Without ICCV 2021 Paper
11 pages
Ef Cient Training of Visual Transformers With Small Datasets - Liu Et Al
No ratings yet
Ef Cient Training of Visual Transformers With Small Datasets - Liu Et Al
13 pages
Comprehensive Survey of Model Compression and Speed Up For Vision Transformers - Chen Et Al
No ratings yet
Comprehensive Survey of Model Compression and Speed Up For Vision Transformers - Chen Et Al
12 pages
An Image Is Worth More Than 16x16 Patches
No ratings yet
An Image Is Worth More Than 16x16 Patches
23 pages
Chen CrossViT Cross-Attention Multi-Scale Vision Transformer For Image Classification ICCV 2021 Paper
No ratings yet
Chen CrossViT Cross-Attention Multi-Scale Vision Transformer For Image Classification ICCV 2021 Paper
10 pages
NeurIPS 2021 Not All Images Are Worth 16x16 Words Dynamic Transformers For Efficient Image Recognition Paper
No ratings yet
NeurIPS 2021 Not All Images Are Worth 16x16 Words Dynamic Transformers For Efficient Image Recognition Paper
14 pages
Deit Iii: Revenge of The Vit: Hugo Touvron Matthieu Cord Herv E J Egou Meta Ai Sorbonne University
No ratings yet
Deit Iii: Revenge of The Vit: Hugo Touvron Matthieu Cord Herv E J Egou Meta Ai Sorbonne University
27 pages
AE-ViT: Token Enhancement For Vision Transformers Via CNN-based Autoencoder Ensembles.
No ratings yet
AE-ViT: Token Enhancement For Vision Transformers Via CNN-based Autoencoder Ensembles.
12 pages
Solution 02
No ratings yet
Solution 02
7 pages
2024 GVT Shan Chen Arxiv
No ratings yet
2024 GVT Shan Chen Arxiv
9 pages
2022 - PVT v2
No ratings yet
2022 - PVT v2
10 pages
A Simple Single-Scale Vision Transformer For Object Detection and Instance Segmentation
No ratings yet
A Simple Single-Scale Vision Transformer For Object Detection and Instance Segmentation
23 pages
Deep Learning Paper About Vit
No ratings yet
Deep Learning Paper About Vit
12 pages
Good Note - ViT
No ratings yet
Good Note - ViT
13 pages
Rethinking Local Perception in Lightweight Vision Transformer
No ratings yet
Rethinking Local Perception in Lightweight Vision Transformer
14 pages
CVT: Introducing Convolutions To Vision Transformers
No ratings yet
CVT: Introducing Convolutions To Vision Transformers
10 pages
Bf 00148221
No ratings yet
Bf 00148221
9 pages
Higher Operads, Higher Categories: Tom Leinster
No ratings yet
Higher Operads, Higher Categories: Tom Leinster
410 pages
2CPSC531 Simulation
No ratings yet
2CPSC531 Simulation
28 pages
Marin Token Pooling in Vision Transformers For Image Classification WACV 2023 Paper
No ratings yet
Marin Token Pooling in Vision Transformers For Image Classification WACV 2023 Paper
10 pages
Whitehead 1948 Hopf Invariants
No ratings yet
Whitehead 1948 Hopf Invariants
47 pages
ViT Lite
No ratings yet
ViT Lite
11 pages
Vision Transformer Adapter For Dense Predictions
No ratings yet
Vision Transformer Adapter For Dense Predictions
20 pages
Understanding The Robustness in Vision Transformers
No ratings yet
Understanding The Robustness in Vision Transformers
17 pages
Challenging Task
No ratings yet
Challenging Task
21 pages
Li Et Al. - 2022 - EfficientFormer Vision Transformers at MobileNet Speed
No ratings yet
Li Et Al. - 2022 - EfficientFormer Vision Transformers at MobileNet Speed
19 pages
10 Transformers
No ratings yet
10 Transformers
22 pages
Li Et Al. - 2022 - Rethinking Vision Transformers For MobileNet Size and Speed
No ratings yet
Li Et Al. - 2022 - Rethinking Vision Transformers For MobileNet Size and Speed
15 pages
GPU友好稀疏量化Boost Vision Transformer
No ratings yet
GPU友好稀疏量化Boost Vision Transformer
11 pages
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
No ratings yet
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
2 pages
Better Vision Transformer Via Token Pooling and Attention Sharing
No ratings yet
Better Vision Transformer Via Token Pooling and Attention Sharing
13 pages
Chen GLiT Neural Architecture Search For Global and Local Image Transformer ICCV 2021 Paper
No ratings yet
Chen GLiT Neural Architecture Search For Global and Local Image Transformer ICCV 2021 Paper
10 pages
Crossvit: Cross-Attention Multi-Scale Vision Transformer For Image Classification
No ratings yet
Crossvit: Cross-Attention Multi-Scale Vision Transformer For Image Classification
12 pages
Vision Transformer For Small-Size Datasets
No ratings yet
Vision Transformer For Small-Size Datasets
11 pages
Efficient V It
No ratings yet
Efficient V It
11 pages
XXXBetter Plain ViT Baselines For ImageNet-1k
No ratings yet
XXXBetter Plain ViT Baselines For ImageNet-1k
3 pages
NeurIPS 2021 Transformer in Transformer Paper
No ratings yet
NeurIPS 2021 Transformer in Transformer Paper
12 pages
CVT: Introducing Convolutions To Vision Transformers
No ratings yet
CVT: Introducing Convolutions To Vision Transformers
10 pages
用于目标检测的视觉Transformer的训练策略
No ratings yet
用于目标检测的视觉Transformer的训练策略
9 pages
Chen Visformer The Vision-Friendly Transformer ICCV 2021 Paper
No ratings yet
Chen Visformer The Vision-Friendly Transformer ICCV 2021 Paper
10 pages
3muri Brief Theory PDF
No ratings yet
3muri Brief Theory PDF
19 pages
Yin a-ViT Adaptive Tokens For Efficient Vision Transformer CVPR 2022 Paper
No ratings yet
Yin a-ViT Adaptive Tokens For Efficient Vision Transformer CVPR 2022 Paper
10 pages
Chapter 3 An Illustrative Example of Case 1 Best-Worst Scaling - Non-Market Valuation With R
No ratings yet
Chapter 3 An Illustrative Example of Case 1 Best-Worst Scaling - Non-Market Valuation With R
41 pages
Linear Programming: X X 0, y 0 Are Co-Ordinate Axes. X 0, y 0 Represents The Region in 1
No ratings yet
Linear Programming: X X 0, y 0 Are Co-Ordinate Axes. X 0, y 0 Represents The Region in 1
4 pages
Research Notes
No ratings yet
Research Notes
9 pages
Transformer in Transformer: Kai Han An Xiao Enhua Wu Jianyuan Guo Chunjing Xu Yunhe Wang
No ratings yet
Transformer in Transformer: Kai Han An Xiao Enhua Wu Jianyuan Guo Chunjing Xu Yunhe Wang
14 pages
03 - ViViT - A Video Vision Transformer
No ratings yet
03 - ViViT - A Video Vision Transformer
13 pages
Applsci 13 05521 v2
No ratings yet
Applsci 13 05521 v2
17 pages
Video Swin Transformer
No ratings yet
Video Swin Transformer
12 pages
Chapter 5 (5.3)
No ratings yet
Chapter 5 (5.3)
84 pages
AMA2111 Tutorial 3.1
No ratings yet
AMA2111 Tutorial 3.1
2 pages
A Simple Single-Scale Vision Transformer For Object Localization
No ratings yet
A Simple Single-Scale Vision Transformer For Object Localization
12 pages
ViViT: A Video Vision Transformer
No ratings yet
ViViT: A Video Vision Transformer
14 pages
MCQ Question
No ratings yet
MCQ Question
5 pages
Vi Transformer
No ratings yet
Vi Transformer
21 pages
Animated Algorithm Visualizer For Graph Based Algorithms and Recursive Programs
No ratings yet
Animated Algorithm Visualizer For Graph Based Algorithms and Recursive Programs
5 pages
What Is Machine Learning - IBM
No ratings yet
What Is Machine Learning - IBM
1 page
SECA4002
No ratings yet
SECA4002
65 pages
K Means Clustering Algorithm - Explained - DnI Institute
0% (1)
K Means Clustering Algorithm - Explained - DnI Institute
11 pages
Notes On Log-Linearization
No ratings yet
Notes On Log-Linearization
15 pages
Forced Vibration of Two Degrees of Freedom System
No ratings yet
Forced Vibration of Two Degrees of Freedom System
4 pages
Cracking The Horse Racing Code SUMMARY
No ratings yet
Cracking The Horse Racing Code SUMMARY
7 pages
Updating Weight
No ratings yet
Updating Weight
9 pages
Data Structures Unit-1 Chapter-1
No ratings yet
Data Structures Unit-1 Chapter-1
17 pages
(Big Data Analysis) : Python Scikit-Learn 機器學習
No ratings yet
(Big Data Analysis) : Python Scikit-Learn 機器學習
97 pages
Session 08022022
No ratings yet
Session 08022022
128 pages
03 A Polynomial Linear Regression
No ratings yet
03 A Polynomial Linear Regression
6 pages
The AM-GM Inequality
No ratings yet
The AM-GM Inequality
2 pages
References
No ratings yet
References
9 pages
Chapter 7 Complexity
No ratings yet
Chapter 7 Complexity
21 pages
Adsaa Sem Important Questions
No ratings yet
Adsaa Sem Important Questions
3 pages
Intersection Theory Notes
No ratings yet
Intersection Theory Notes
68 pages
Helmet and Vehicle License Plate Detection System
No ratings yet
Helmet and Vehicle License Plate Detection System
26 pages
BE Honours (Text, Web and Social Media Analytics
No ratings yet
BE Honours (Text, Web and Social Media Analytics
1 page
Bs 341 Exam Tutorial 1
No ratings yet
Bs 341 Exam Tutorial 1
6 pages
Multilevel Queue Scheduling Algorithm
No ratings yet
Multilevel Queue Scheduling Algorithm
9 pages
Franc Ois Greer: N 2 I n+1 T
No ratings yet
Franc Ois Greer: N 2 I n+1 T
8 pages
Quintic B-Spline Method For Numerical Solution of Fourth Order Singular Perturbation Boundary Value Problems
No ratings yet
Quintic B-Spline Method For Numerical Solution of Fourth Order Singular Perturbation Boundary Value Problems
11 pages
Tree Search Using MPI With Static and Dynamic Partitioning PDF
No ratings yet
Tree Search Using MPI With Static and Dynamic Partitioning PDF
9 pages
CS 170, Spring 2020 HW1 A. Chiesa & J. Nelson
No ratings yet
CS 170, Spring 2020 HW1 A. Chiesa & J. Nelson
4 pages
Quantum Breakdown.: Catalysts Coding Contest 3 April 2020
No ratings yet
Quantum Breakdown.: Catalysts Coding Contest 3 April 2020
7 pages
Solving Initial Value Problems Using Laplace Transforms
No ratings yet
Solving Initial Value Problems Using Laplace Transforms
4 pages
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
From Everand
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Scalable Vision Transformers With Hierarchical Pooling

Uploaded by

Scalable Vision Transformers With Hierarchical Pooling

Uploaded by

Scalable Vision Transformers with Hierarchical Pooling

Zizheng Pan Bohan Zhuang† Jing Liu Haoyu He Jianfei Cai

Top-1 Acc (%)

Stage 1 Stage 2 Stage 3

y = FC(LN(X0L )). (4) 3.2.2 Prediction without the Class Token

DeiT-S: Linear Projection, N = 196 DeiT-S: Block1, N = 196

HVT-S-1: Linear Projection, N = 196 HVT-S-1: Block1, N = 97

4. Discussions 4.2. Scalability of HVT

φM SA (n, d) = φqkv (n, d) + φA (n, d) + φO (n, d) + φproj (n, d)

For an MLP layer, the FLOPs mainly comes from two

φM LP (n, d) = φf c1 (n, d)+φf c2 (n, d) = 4nd2 +4nd2 = 8nd2 .

φBLK (n, d) = φM SA (n, d)+φM LP (n, d) = 12nd2 +2n2 d.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.