0% found this document useful (0 votes)
235 views16 pages

Transformer For Object Detection Review and Benchmark

This paper provides a comprehensive review of transformer models for object detection. It categorizes transformer-based object detection methods into two groups: transformer neck and transformer backbone. It analyzes representative methods from each group, compares their performance on the COCO2017 dataset, and summarizes the novel aspects of each approach.

Uploaded by

perry1005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
235 views16 pages

Transformer For Object Detection Review and Benchmark

This paper provides a comprehensive review of transformer models for object detection. It categorizes transformer-based object detection methods into two groups: transformer neck and transformer backbone. It analyzes representative methods from each group, compares their performance on the COCO2017 dataset, and summarizes the novel aspects of each approach.

Uploaded by

perry1005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Engineering Applications of Artificial Intelligence 126 (2023) 107021

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence


journal homepage: www.elsevier.com/locate/engappai

Survey paper

Transformer for object detection: Review and benchmark✩


Yong Li a ,∗, Naipeng Miao a , Liangdi Ma b , Feng Shuang a , Xingwen Huang a
a
Guangxi Key Laboratory of Intelligent Control and Maintenance of Power Equipment, School of Electrical Engineering, Guangxi University, No. 100, Daxuedong
Road, Xixiangtang District, Nanning, 530004, Guangxi, China
b
School of Software, Tsinghua University, No. 30 Shuangqing Road, Haidian District, Beijing, 100084, China

ARTICLE INFO ABSTRACT

Keywords: Object detection is a crucial task in computer vision (CV). With the rapid advancement of Transformer-
Review based models in natural language processing (NLP) and various visual tasks, Transformer structures are
Object detection becoming increasingly prevalent in CV tasks. In recent years, numerous Transformer-based object detectors
Transformer-based models
have been proposed, achieving performance comparable to mainstream convolutional neural network-based
COCO2017 dataset
(CNN-based) approaches. To provide researchers with a comprehensive understanding of the development,
Benchmark
advantages, disadvantages, and future potential of Transformer-based object detectors in Artificial Intelligence
(AI), this paper systematically reviews the mainstream methods and analyzes the limitations and challenges
encountered in their current applications, while also offering insights into future research directions. We have
reviewed a large number of papers, selected the most prominent Transformer detection methods, and divided
them into Transformer Neck and Transformer Backbone categories for introduction and comparative analysis.
Furthermore, we have constructed a benchmark using the COCO2017 dataset to evaluate different object
detection algorithms. Finally, we summarize the challenges and prospects in this field.

1. Introduction researchers have endeavored to adapt Transformer architectures for


computer vision tasks. As a result, numerous Transformer-based vision
Object detection is a fundamental task in computer vision that models have emerged in recent years, achieving performance levels that
requires simultaneous classification and localization of potential objects are comparable or even superior to their CNN counterparts.
within a single image (Zhao et al., 2019). As such, it plays a cru- Transformer (Vaswani et al., 2017) was initially proposed as an
cial role in various applications, including autonomous driving (Chen architecture based on the self-attention mechanism for machine trans-
et al., 2015, 2017), face recognition (Sung and Poggio, 1998), pedes- lation and sequence modeling tasks (Sutskever et al., 2014). In recent
trian detection (Dollar et al., 2012), and medical detection (Kobatake years, Transformer has experienced significant advancements in NLP
and Yoshinaga, 1996). The performance of object detection directly and has become a mainstream deep learning model, such as BERT (De-
influences object tracking, environment perception, and scene under- vlin et al., 2018) and its variants (Lan et al., 2019; Liu et al., 2019), GPT
standing (Felzenszwalb et al., 2010). Recently, deep learning-based series (Radford et al., 2018, 2019; Brown et al., 2020), and others. Due
object detection methods have gained considerable attention due to to its scalability, Transformer can be pre-trained on large datasets and
the rapid development of deep learning. However, numerous challenges subsequently fine-tuned for downstream tasks.
Transformers in object detection have garnered increasing attention,
remain, such as balancing accuracy and efficiency, handling multi-scale
particularly over the last three years. Several high-performance models
objects, and creating lightweight models.
have been proposed, such as DETR (Carion et al., 2020), Deformable
Traditional mainstream object detection methods have predomi-
DETR (Dai et al., 2017), Swin Transformer (Liu et al., 2021b,a),
nantly utilized convolutional neural networks (CNNs), including Faster
DINO (Zhang et al., 2022a), and more. Currently, Transformer-based
R-CNN (Ren et al., 2016), SSD (Liu et al., 2016), and YOLO with
models have emerged as a new paradigm in object detection, making a
its variants (Redmon et al., 2016; Redmon and Farhadi, 2018, 2017;
systematic analysis and evaluation of numerous existing Transformer-
Bochkovskiy et al., 2020; Ge et al., 2021). Owing to the remark-
based detectors essential for future research.
able success of Transformers in natural language processing (NLP),

✩ This work was supported by the Guangxi Science and Technology base and Talent Project (Grant No. Guike AD22080043), the Key Laboratory of
Advanced Manufacturing Technology, Ministry of Education (Grant No. GZUAMT2021KF04), and the National Natural Science Foundation of China (Grant No.
61720106009).
∗ Corresponding author.
E-mail address: yongli@gxu.edu.cn (Y. Li).

https://doi.org/10.1016/j.engappai.2023.107021
Received 22 October 2022; Received in revised form 25 May 2023; Accepted 19 August 2023
Available online 4 September 2023
0952-1976/© 2023 Elsevier Ltd. All rights reserved.
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021

Fig. 1. Chronological overview of most Transformer-based object detection methods.

Some reviews (Khan et al., 2021; Liu et al., 2021c; Arkin et al.,
2021; Han et al., 2022; Arkin et al., 2022) have provided detailed
introductions and analyses of Transformer-based detectors. In contrast
to these surveys, our study not only presents a thorough comparison
of the strengths and weaknesses of object detectors based on both
Transformer and CNN architectures, but also classifies the prevalent
Transformer-based detectors into Transformer Backbone and Trans-
former Neck categories. Moreover, we systematically analyze their
performance, potential, and limitations. We investigate the advance-
ments and constraints of various state-of-the-art Transformer-based
detectors (Table 6) and establish benchmarks for these methods using
the COCO2017 dataset (Tables. 4, 5). We hope this review delivers
a comprehensive understanding of Transformer-based object detectors
for researchers.
We have categorized existing methods into two groups based on
the role of Transformer in the overall model architecture: Transformer
Neck and Transformer Backbone, as illustrated in Fig. 1. We present Fig. 2. Transformer structure. The Input Embedding module of Transformer encoder
a detailed analysis of representative methods, compare these methods (left column) can map the input sequence to Embedding space and pass it to encoder
horizontally on the COCO2017 dataset (Lin et al., 2014), and summa- module for processing. The Transformer decoder (right column) receives the previous
rize the novelty of each method, such as Transformer Backbone with output sequence and the output sequence from the intermediate encoder. The previous
output sequence will be shifted one bit to the right, and the start token will be appended
hierarchical representation structure, spatial prior acceleration based
before the sequence to get the input from the decoder. The feed-forward network and
on sparse attention, and pure sequence processing for object detection, the multi-head attention module are repeated 𝑁 times to form the encoder and decoder.
among others. The main contributions of this paper are as follows:

1. We provide a comprehensive summary of state-of-the-art Trans-


former-based object detectors from the past three years, high- to its unique architecture, whose core design is the Encoder–Decoder
lighting recent breakthroughs in Transformer architecture for structure based on self-attention. As shown in Fig. 2, Transformer con-
object detection. For each representative model, we offer an in- sists of three main blocks: multi-headed attention, positional encoding,
depth analysis while examining its relationship and connections and feed-forward network. Multi-head attention (MHA) block and Feed-
with other models, both incrementally and comparatively. More- forward network block are the main modules of Encoder and Decoder.
over, we compare the strengths and weaknesses of Transformer Position encoding is a vital module to all Transformer variants and is
and CNN architectures, and further discuss the performance, key responsible for attaching position information to the input sequence. In
features, and limitations of both Transformer Neck (DETR-like this section, these fundamental techniques are described in detail.
models) and Transformer Backbone (ViT-like models).
2. We comprehensively compare mainstream models on the same 2.1. Basic architecture
dataset, establish a benchmark based on the COCO2017 dataset,
and offer insightful discussions. The structure of Transformer is based on encoder–decoder. The
3. We present an in-depth analysis of the transition as Transformer encoder consists of 𝑁 basic encoder modules, as shown in Fig. 2. Every
architecture extends from sequence to visual tasks. Furthermore, encoder module consists of a multi-head attention module (MHA) and a
we discuss the future development of Transformer and CNN feed-forward network (FFN). And then, they are cascaded with residual
approaches in object detection. connection and layer normalization one by one. Finally, the output of
the encoder module is shown in Eq. (1):
The rest structure of this paper is organized as follows. Section 2
introduces the main object detection datasets and evaluation metrics, 𝑂𝑢𝑡𝑝𝑢𝑡 = 𝐿𝑎𝑦𝑒𝑟𝑁𝑜𝑟𝑚(𝑥 + 𝑆𝑢𝑏𝐿𝑎𝑦𝑒𝑟(𝑥)), (1)
as well as the Attention mechanism and Transformer basic architecture.
where 𝑥 is the input sequence, and 𝑆𝑢𝑏𝐿𝑎𝑦𝑒𝑟 represents the attention
Section 3 outlines the current mainstream Transformer-based object de-
tectors. Section 4 discusses the methods of these models in a multi-level module or feedforward network.
comparison. Section 5 concludes the paper with an outlook.
2.2. Self-attention
2. Transformer architecture
2.2.1. Scaled dot-product attention
Transformer is an architecture based on the attention mechanism The self-attention mechanism module, as the core component of
proposed by Vaswani et al. (2017) in 2017, which was initially used Transformer, consists of two main parts: (1) Linear projection layer:
for machine translation tasks and subsequently achieved great success the input sequence is mapped into 3 different vectors (query 𝑄, key
in NLP (Devlin et al., 2018). The success of Transformer is attributed 𝐾, value 𝑉 ). The input sequences are 𝑋 ∈ R𝑛𝑥 ×𝑑𝑥 and 𝑌 ∈ R𝑛𝑦 ×𝑑𝑦 ,

2
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021

RuLU activation in between. The output of FFN can be expressed as


shown in Eq. (5):
( )
FFN(𝑥) = max 0, 𝑥𝑊1 + 𝑏1 𝑊2 + 𝑏2 , (5)

where 𝑊1 and 𝑊2 denote weight matrices of the two fully connected


layers.

2.4. Positional encoding

Unlike CNN and RNN, self-attention computation brings the ad-


vantage of parallel computing while losing word order information.
Fig. 3. (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of Therefore, positional encoding is used to provide positional information
several attention layers running in parallel. to the model. In detail, a position-dependent signal is added to each
word embedding for each input sequence to help the model incorporate
the order of words. The output of positional encoding has the same
where 𝑛 and 𝑑 denote the length and dimension of the input sequence, dimension as the embedding layer. So it can be superimposed directly
respectively. Then 𝑄, 𝐾, and 𝑉 are generated as follows: on Embedding. The positional information of each token (a sequence
of primitives obtained after the text has been divided into words) and
𝑄 = 𝑋𝑊 𝑄 , 𝐾 = 𝑌 𝑊 𝐾, 𝑉 = 𝑌𝑊𝑉, (2) its semantic information (Embedding) are fully integrated and passed
𝑞 𝑑𝑦 ×𝑑 𝑘 𝑑𝑦 ×𝑑 𝑣 to the subsequent layer.
where 𝑊 ∈ 𝑄 R𝑑𝑥 ×𝑑 ,
𝑊 ∈R 𝐾 and 𝑊 ∈ R 𝑉 are the learnable
There are many variants of positional encoding. The original Trans-
weight matrices. The 𝑑 𝑞 and 𝑑 𝑘 denotes the dimensions of 𝑊 𝑄 and 𝑊 𝐾 ,
former uses sine and cosine functions for positional encoding, as shown
respectively. The dimension of 𝑊 𝑉 is 𝑑 𝑣 . When 𝑋 = 𝑌 , Eq. (2) is the
in Eq. (6).
self-attention computation, and when 𝑋 ≠ 𝑌 , it is the cross-attention
( )
computation in the Decoder module. 𝑃 𝐸(𝑝𝑜𝑠,2𝑖) = sin 𝑝𝑜𝑠∕100002𝑖∕𝑑model ,
(2) Attention layer: Transformer adopts a special attention method ( ) (6)
𝑃 𝐸(𝑝𝑜𝑠,2𝑖+1) = cos 𝑝𝑜𝑠∕100002𝑖∕𝑑model ,
called Scaled Dot-Product Attention, as shown in Fig. 3 (left). The
input consists of 𝑄 in 𝑑 𝑞 dimensions, 𝐾 in 𝑑 𝑘 dimensions and 𝑉 in where 𝑝𝑜𝑠 is the position and 𝑖 is the dimension. That is, each dimension
𝑑 𝑣 dimensions, and the scaled attention matrix is calculated as shown of the position encoding corresponds to a sine wave. The wavelengths
in Eq. (3). form a geometric progression from 2𝜋 to 10000×2𝜋. For any fixed offset
( ) 𝑘, 𝑃 𝐸𝑝𝑜𝑠+𝑘 can be represented as a linear function of 𝑃 𝐸𝑝𝑜𝑠 .
𝑄𝐾 𝑇
Attention (𝑄, 𝐾, 𝑉 ) = sof tmax √ 𝑉, (3)
𝑑𝑘 2.5. Remarks
1
where √ is the scaling factor. The attention weights are obtained by The self-attention mechanism allows Transformer to break through
𝑑𝑘
computing the dot product of Q for all K. The attention weights are then the limitation that RNN models cannot be computed in parallel and
normalized by the scaling factor √1 and the softmax layer. The output improve the computational efficiency. Compared with CNN, the self-
𝑑𝑘
weights are assigned to the corresponding elements of V to obtain the attentive mechanism has a global perceptual field. The number of
final attention matrix. operations required to compute the association between two locations
does not grow with distance, so it has a stronger ability to learn long-
2.2.2. Multi-head attention range dependencies. In addition, Transformer has a general modeling
However, the modeling ability of single-head attention is weak. capability. Transformer can be regarded as a fully connected graph
To address this problem, Vaswani et al. (2017) proposed multi-head modeling method that can model heterogeneous nodes by projecting
attention (MHA). The structure is shown in Fig. 3 (right). MHA can them into a comparable space to compute similarity. Therefore, there is
enhance the modeling ability of each attention layer without changing a sufficient theoretical basis for using Transformer for various computer
the number of parameters. vision tasks based on its general modeling capability. Considering
Compared to single-head attention, MHA maps Q, K, and V linearly the dimensional differences between images and text, the images are
to different dimensional subspaces (𝑑𝑞 , 𝑑𝑘 , 𝑑𝑣 ) to compute similarity and converted into sequences and can then be input into the model for
compute the attention function in parallel. As shown in Eq. (4), the processing.
resulting vectors are concatenated and mapped again to obtain the final Moreover, we compare the characteristics of CNN and Transformer.
output. As shown in Table 1, Transformer tends to model shapes more but
( ) requires massive data for training. In contrast, CNN tends to model local
MultiHead(𝑄, 𝐾, 𝑉 ) = Concat head 1 , … , head h 𝑊 𝑂 ,
( ) textures more but has to pile many convolutional layers to have a large
(4)
where head i = Attention 𝑄𝑊𝑖𝑄 , 𝐾𝑊𝑖𝐾 , 𝑉 𝑊𝑖𝑉 , enough receptive field to get global information (Geirhos et al., 2019).
𝑞 𝑘
where 𝑊𝑖𝑄 ∈ R𝑑model ×𝑑 , 𝑊𝑖𝐾 ∈ R𝑑model ×𝑑 , 𝑊𝑖𝑉 ∈ R𝑑model × 𝑑 𝑣 , 𝑊𝑖𝑂 ∈ 3. Transformer for object detection
𝑣
R𝑑model ×ℎ𝑑 is the projection parameter matrix. Multi-head attention
reduces the dimensionality of each vector when calculating the atten- This section first introduces common datasets and evaluation met-
tion of each head, which reduces overfitting to a certain extent. Since rics for object detection and analyzes classic Transformer-based object
attention has different distributions in different subspaces, this module detectors. According to their structural difference, We classify the
fuses the feature relationships between different sequence dimensions listed detectors as Transformer Neck-based detectors and Transformer
in vector concatenation. Backbone-based detectors. The Transformer Neck-based detector infers
the class labels and bounding box coordinates with a set of learnable
2.3. Position-wise feed-forward networks object queries but does not change the backbone used for feature
extraction. Transformer Backbone-based detectors propose a generic
The output of the MHA layer is fed into the feed-forward network visual backbone that flattens the image into a sequence instead of
(FFN). FFN is mainly composed of two linear transformations with a convolution for feature extraction. Multiscale feature fusion is also

3
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021

Table 1
Summary of the highlights and limitations of CNN and Transformer.
Architecture Highlights Limitations
Transformer (1) The attention mechanism (1) Transformer-based models are
amplifies the significance of crucial known for their substantial data
aspects of an image while reducing requirements and computationally
the rest, thereby concentrating on expensive nature, particularly when
more relevant features. This applied to vision tasks (He et al.,
mechanism assists the Transformer in 2021). (2) They are also
modeling the long-range characterized by a slower rate of
dependencies of input sequence convergence, which can pose
elements and thus enhances its challenges in their utilization (Gao
generalization ability for samples et al., 2021). (3) Further, these
outside the distribution (Bai et al., models often involve high
2021). (2) Unlike methods such as computational overhead, which
RNN and LSTM, the Transformer exacerbates their deployment issues
allows for parallel computations. (3) in resource-constrained settings (Li
Given its straightforward yet et al., 2022).
adaptable design, the Transformer
can tackle multiple tasks
simultaneously, rendering it a
potential candidate for a
general-purpose model handling
various tasks.
CNN (1) CNN-based models have strong (1) CNN rarely encodes relative
local feature extraction ability feature positions, instead favoring
benefited from inductive bias receptive field expansion via larger
properties such as translation kernels or stacked layers, often
invariance, weight sharing, and reducing local convolution’s
sparse connectivity. (2) CNNs can computational and statistical
operate in parallel with lower efficiency. (2) CNN’s global feature
computational complexity than capture is comparatively weaker than
Transformer. Transformer models (Liu et al.,
2021b).

Table 2 is detected. Therefore to determine how many objects were detected


Briefing on datasets for object detection.
correctly and how many false positives were generated, we use the
Name Image volume class Source Annotation format Intersection over Union (IoU) metric.
VOC2007 9963 20 PASCAL XML Intersection over Union (IoU). IoU is an evaluation metric that
VOC2012 17112 20 PASCAL XML
quantifies the similarity between the ground truth bounding box
COCO2017 121408 80 Microsoft JSON
(𝑔𝑡 𝑏𝑜𝑥) and the predicted bounding box (𝑝𝑑 𝑏𝑜𝑥) to evaluate how good
the predicted box is. The IoU score ranges from 0 to 1; the closer the
two boxes, the higher the IoU score. It can be calculated as follow:
incorporated in many methods to improve detection accuracy and area(𝑔𝑡 𝑏𝑜𝑥 ∩ 𝑝𝑑 𝑏𝑜𝑥)
replace the CNN backbone in classical detectors. In reviewing these 𝐼𝑜𝑈 (𝑔𝑡, 𝑝𝑑) = , (7)
area(𝑔𝑡 𝑏𝑜𝑥 ∪ 𝑝𝑑 𝑏𝑜𝑥)
methods, we summarize the optimization innovations or modules of the
different methods. Finally, we compare their performance in Table 4 For the IoU threshold at 𝛼, True Positive(TP) is a detection for
and Table 5 and give analyzation and discussion on improvements of which 𝐼𝑜𝑈 (𝑔𝑡, 𝑝𝑑) ≥ 𝛼 and False Positive (FP) is a detection for which
the above methods. 𝐼𝑜𝑈 (𝑔𝑡, 𝑝𝑑) ≤ 𝛼. False Negative (FN) is a ground-truth missed together
with 𝑔𝑡 for which 𝐼𝑜𝑈 (𝑔𝑡, 𝑝𝑑) ≤ 𝛼. The definitions of TP, TN, FP and FN
3.1. Common datasets and evaluation metrics are shown in Table 3.
Precision. Precision is the probability of the predicted bounding
3.1.1. Common datasets for object detection boxes matching actual ground truth boxes, also referred to as the
Datasets are the basis for measuring and comparing algorithm positive predictive value. Precision scores range from 0 to 1, with a
performance. The commonly used object detection datasets are Pas- high precision implying that most detected objects match ground truth
cal VOC2007(Everingham et al., 2007), Pascal VOC2012(Everingham objects.
et al., 2012) and Microsoft COCO2017(Lin et al., 2014), as shown in Recall. Recall is the true positive rate, also referred to as sensitivity,
Table 2. The Pascal VOC dataset has only 20 object categories and is which measures the probability of ground truth objects being correctly
regarded as a benchmark dataset for object detection. Compared with detected. Similarly, Recall ranges from 0 to 1, where a high recall score
VOC, the COCO dataset has more small objects and more objects in means that most ground truth objects were detected.
a single image, and most of the objects are non-centrally distributed The Precision and Recall can be calculated as follow:
and more similar to the real environment. Thus COCO dataset is TP
more difficult for object detection and has been the mainstream object Precision = , (8)
TP + FP
detection dataset in recent years.
TP
Recall = , (9)
3.1.2. Evaluation metrics TP + FN
Common evaluation metrics for object detection include Precision, Average precision (AP). AP is Area Under the Precision–Recall
Recall, Average Precision (AP), and mean Average Precision (mAP). In Curve evaluated at a specific IoU threshold. AP is a single number
addition to classification, the object detection task localizes the object metric that combines precision and recall and describes the Accuracy–
further with a bounding box associated with its corresponding confi- Recall curve by AP among recall values ranging from 0 to 1. It is used
dence score to report how certain the bounding box of the object class to evaluate the performance of object detectors.

4
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021

Table 3
Definition of terms.
Terms Definitions
TP (True Positive) Positive samples are correctly identified as positive samples.
TN (True Negative) Negative samples are correctly identified as negative samples.
FP (False Positive) False positive samples, that is, negative samples are mistakenly identified as positive samples.
FN (False Negative) False negative samples, that is, positive samples are wrongly identified as negative samples.

Fig. 4. The pipeline of DETR. The backbone is a convolutional neural network (CNN) that serves as a feature extractor. And Transformer is the core of the DETR architecture,
consisting of an encoder and a decoder. The high-dimensional feature map from the backbone is flattened and fed into the encoder. Then encoder processes the spatial information
and outputs a sequence of encoded feature vectors. Finally, The output of the decoder is passed through a series of linear layers to predict the final bounding box coordinates and
class probabilities for each object query.
Source: Image from Carion et al. (2020).

Mean average precision (mAP). AP is calculated for each class hand-designed anchor sets and non-maximal suppression (NMS). As
individually, and mAP is the average of AP values across all classes. shown in Fig. 4, DETR uses CNN Backbone to learn the 2D features
The mAP can be calculated as Eq. (10). There are two kinds of mAPs of the input image. Then feature maps are unfolded into sequences and
commonly used. (1) PASCAL VOC challenge uses mAP as a metric fed to the Transformer encoder module (where there is still positional
with an IoU threshold of 0.5. (2) While MS COCO averages mAP over encoding). The output of the Transformer Decoder module is then
different IoU thresholds 50% to 95% with a step of 0.05, this metric obtained under the constraint of object queries. Finally, the class and
is denoted in papers by mAP@[.5,.95]. Therefore, COCO not only bounding box regression parameters are obtained after a feedforward
averages AP over all classes but also on the defined IoU thresholds. network.
∑𝑘 Based on the idea of sequential prediction, DETR regards the predic-
AP𝑖
mAP = 𝑖=1 for 𝑘 classes, (10) tion of the network as a fixed sequence 𝑦̃ of length N, 𝑦̃ = 𝑦̃𝑖 , 𝑖 ∈ (1, 𝑁),
𝑘
(where the value of 𝑁 is fixed and much larger than the number of
Frame Per Second (FPS). FPS defines how fast your object detec- ( )
Ground Truth in the image), 𝑦̃𝑖 = 𝑐̃𝑖 , 𝑏̃ 𝑖 . Meanwhile, the Ground Truth
tion model processes your video and generates the desired output. ( )
is considered as a sequence 𝑦 ∶ 𝑦𝑖 = 𝑐𝑖 , 𝑏𝑖 (the length must be less
than N, so the sequence is filled with 𝜙 (for no object), which can be
3.2. Transformer neck
interpreted as the category of background, to make its length equal to
N), where 𝑐𝑖 denotes the true category to which the object belongs,
In this section, we review the classic Transformer Neck-based ob-
and 𝑏𝑖 denotes a quaternion (containing the center point coordinates
ject detection models in last two years, starting from the original
and the width and height of the object box, and both are relative to
Transformer detector DETR (Carion et al., 2020). The original DETR
the scale coordinates of the image).
regards object detection as end-to-end set prediction, thus removing
So the prediction task can be viewed as a bipartite matching prob-
hand-designed components such as anchor boxes and non-maximum
lem between 𝑦 and 𝑦, ̃ with the Hungarian algorithm as the solution
suppression (NMS). However, some drawbacks need to be solved in
method, defining the strategy for minimum matching as follows:
DETR, such as slow convergence and poor detection of small objects.
Therefore, many approaches (sparse attention, spatial prior acceler- ∑
𝑁
( )
ation, multi-scale detection) have been proposed to improve it by 𝜎̂ = arg min Lmatch 𝑦𝑖 , 𝑦̂𝜎(𝑖) , (11)
𝜎∈S𝑁 𝑖
researchers. We compare the performance of all methods together on
the COCO2017 dataset with the benchmark shown in Table 4. where 𝜎̃ denotes the matching strategy when finding the minimum
loss, for L while considering the similarity prediction between Ground
( )
3.2.1. DETR Truth boxes. For 𝜎(𝑖), 𝑐𝑖 the predicted category confidence is 𝑃̃𝜎(𝑖) 𝑐𝑖
DETR proposed by Carion et al. (2020) is the first object detector and the bounding box prediction is 𝑏̃ 𝜎(𝑖) , for non-empty matches, define
( ) ( ) ( )
that successfully uses the Transformer as the main module in object Lmatch 𝑦𝑖 , 𝑦̂𝜎(𝑖) as: −1{𝑐𝑖 ≠∅} 𝑝̂𝜎(𝑖) 𝑐𝑖 + 1{𝑐𝑖 ≠∅} Lbox 𝑏𝑖 , 𝑏̂ 𝜎(𝑖) .
detection. DETR not only has a simpler and more flexible structure In this way, the overall loss is obtained as
but also has comparable performance compared to previous SOTA 𝑁 [
∑ ( ) ( )]
approaches, such as the highly optimized Faster R-CNN. Unlike classical LHungarian (𝑦, 𝑦)
̂ = − log 𝑝̂𝜎(𝑖)
̂ 𝑐𝑖 + 1{𝑐𝑖 ≠∅} Lbox 𝑏𝑖 , 𝑏̂ 𝜎̂ (𝑖) , (12)
object detectors, DETR is an end-to-end object detection model. It gets 𝑖=1
rid of the autoregressive model, performs parallel inference on object Considering the bounding box scale, the 𝐿1 loss and the IoU loss are
relationships and global image context, and then outputs the final linearly combined to obtain the L𝑏𝑜𝑥 loss:
predictions. The structure of DETR is shown in Fig. 4. ( ) ‖ ‖
DETR treats the object detection task as an intuitive set prediction L𝑏𝑜𝑥 = 𝜆iou Liou 𝑏𝑖 , 𝑏̂ 𝜎(𝑖) + 𝜆L1 ‖𝑏𝑖 − 𝑏̂ 𝜎(𝑖) ‖ , (13)
‖ ‖1
problem and discards some traditional hand-craft components such as

5
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021

Table 4
Comparison between Transformer Necks and representative CNNs on COCO2017 Val set. ‘‘Multi-Scale’’ refers to multi-scale inputs. AP denotes IoU threshold = .50:.05:.95. AP50
and AP75 denote IoU threshold = .50 and .75. In addition, AP𝑆 , AP𝑀 , AP𝐿 denote different scales of objects. Small means area < 3232 . Medium means 3232 < area < 9696 . Large
means area > 9696 .
Method Backbone Epochs GFLOPs #Params(M) Multi-scale FPS AP 𝐀𝐏50 𝐀𝐏75 𝐀𝐏𝑆 𝐀𝐏𝑀 𝐀𝐏𝐿
Faster R-CNN + FPN (Ren et al., 2016) ResNet50 109 180 42 – 26 42.0 62.1 45.5 26.6 45.4 53.4
DERR+ (Carion et al., 2020) 500 86 41 – 28 42.0 62.4 44.2 20.5 45.8 61.1
DETR-DC5+ (Carion et al., 2020) 500 187 41 – 12 43.4 63.1 45.9 22.5 47.3 61.1
ResNet50
DERR (Carion et al., 2020) 50 86 41 – 12 42.0 62.4 44.2 20.5 45.8 61.1
DETR-DC5 (Carion et al., 2020) 50 187 41 – 12 43.4 63.1 45.9 22.5 47.3 61.1
UP-DETR (Dai et al., 2021a) 150 86 41 – 28 40.5 60.8 42.6 19.0 44.4 60.0
ResNet50
UP-DETR+ (Dai et al., 2021a) 300 86 41 – 28 42.8 63.0 45.3 20.8 47.1 61.7
Deformable DETR (Zhu et al., 2021) 50 173 40 – 19 43.8 62.6 47.7 26.4 47.1 58.0
ResNet50
Two-stage Deformable DETR (Zhu et al., 2021) 50 173 40 – 19 46.2 65.2 50.0 28.8 49.2 61.7
Conditional DETR (Meng et al., 2021) 108 90 44 – – 43.0 64.0 45.7 22.7 46.7 61.5
ResNet50
Conditional DETR-DC5 (Meng et al., 2021) 108 195 44 – – 45.1 65.4 48.5 25.3 49.0 62.2
ACT-MTKD(L=16) (Zheng et al., 2021) – 156 – – 14 40.6 – – 18.5 44.3 59.7
ResNet50
ACT-MTKD(L=32) (Zheng et al., 2021) – 169 – – 16 43.1 – – 22.2 47.1 61.4
SMCA (Gao et al., 2021) 50 152 40 – 10 43.7 63.6 47.2 24.2 47.0 60.4
ResNet50
SMCA+ (Gao et al., 2021) 50 152 108 – 10 45.6 65.5 49.1 25.9 49.3 62.6
Efficient DETR (Yao et al., 2021) 36 159 32 – – 44.2 62.2 48.0 28.4 47.5 56.6
ResNet50
Efficient DETR* (Yao et al., 2021) 36 210 35 – – 45.1 65.4 48.5 25.3 49.0 62.2
TSP-FCOS (Sun et al., 2021) 36 189 51.5 – 15 43.1 62.3 47.0 26.6 46.8 55.9
TSP-RCNN (Sun et al., 2021) ResNet50 36 188 64 – 11 43.8 63.3 48.3 28.6 46.9 55.7
TSP-RCNN+ (Sun et al., 2021) 96 188 64 – 11 45.0 64.5 49.6 29.7 47.7 58.0
YOLOS-S (Fang et al., 2021) 150 200 30.7 – 7 36.1 56.4 37.1 15.3 38.5 56.1
DeiT-S
YOLOS-S (Fang et al., 2021) 150 179 27.9 – 5 37.6 57.6 39.2 15.9 40.2 57.3
YOLOS-B (Fang et al., 2021) DeiT-B 150 537 127 – – 42.0 62.2 44.5 19.5 45.3 62.1
PnP-DETR-R50-DC5-𝛼-0.33 (Wang et al., 2021b) 500 20.7(omit backbone) – – – 42.7 62.8 45.1 22.4 46.2 60.0
ResNet50
PnP-DETR-R50-DC5-𝛼-0.5 (Wang et al., 2021b) 500 32.9(omit backbone) – – – 43.1 63.4 45.3 22.7 46.5 61.1
Dynamic DETR (Dai et al., 2021c) ResNet50 40 – – – – 47.2 65.9 51.1 28.6 49.3 59.1
Anchor DETR-C5 (Wang et al., 2021c) 50 – – – – 42.1 63.1 44.9 22.3 46.2 60.0
ResNet50
Anchor DETR-DC5 (Wang et al., 2021c) 50 – – – – 44.2 64.7 47.5 24.7 48.2 60.6
D2ETR (Lin et al., 2022) 50 82 35 – – 43.2 62.9 46.2 22.0 48.5 62.4
PVT2
Deformable D2ETR (Lin et al., 2022) 50 93 40 – – 50.0 67.9 54.1 31.7 53.4 66.7
Sparse DETR- 𝜌 = 10% (Roh et al., 2022) ResNet50 50 105 41 – 25.3 45.3 65.8 49.3 28.4 48.3 60.1
Sparse DETR- 𝜌 = 10% (Roh et al., 2022) Swin-T 50 113 41 – 21.2 48.2 69.2 52.3 29.8 51.2 64.5
DAB-DETR (Liu et al., 2022a) 50 202 44 – – 44.5 65.1 47.7 25.3 48.2 62.3
ResNet50
DAB-DETR* (Liu et al., 2022a) 50 216 44 – – 45.7 66.2 49.0 26.1 49.4 63.1
DN-DETR (Li et al., 2022) 50 94 44 – – 44.1 64.4 46.7 22.9 48.0 63.4
DN-DETR-DC5 (Li et al., 2022) ResNet50 50 202 44 – – 46.3 66.4 49.7 26.7 50.0 64.3
DN-Deformable-DETR (Li et al., 2022) 50 195 48 – – 48.6 67.4 52.7 31.0 52.0 63.7
DINO-4scale (Zhang et al., 2022a) 12 279 47 – 24 47.9 65.3 52.1 31.2 50.9 61.9
DINO-5scale (Zhang et al., 2022a) 12 860 47 – 10 48.3 65.8 52.4 32.2 51.3 62.2
ResNet50
DINO-4scale (Zhang et al., 2022a) 36 – – – – 50.5 68.3 55.1 32.7 53.9 64.9
DINO-5scale (Zhang et al., 2022a) 36 – – – – 51.0 69.0 55.6 34.1 53.6 65.6
SAM-DETR (Zhang et al., 2022b) 50 100 58 – – 39.8 61.8 41.6 20.5 43.4 59.6
ResNet50
SAM-DETR-DC5 (Zhang et al., 2022b) 50 210 58 – – 43.3 64.4 46.2 25.1 46.9 61.0
Pix2Seq (Chen et al., 2021) 50 – 37 – – 43.0 61.0 45.6 25.1 46.9 59.4
ResNet50
Pix2Seq-DC5 (Chen et al., 2021) 50 – 38 – – 43.2 61.0 46.1 26.6 47.0 58.6

Additionally, we have presented the attention visualization of the Nonetheless, its end-to-end architecture possesses significant potential
encoder and decoder (as shown in Figs. 5 and 6). This visualization and has attracted numerous researchers to explore improvements.
aids in understanding how the model focuses on various parts of the
input image and utilizes attention mechanisms for object detection. 3.2.2. UP- DETR
The encoder processes the input image, captures its spatial information, Since DERT faces great challenges in training and optimization, it
and creates a set of contextualized feature representations. Attention requires a huge amount of training data and an extremely long training
visualization in the encoder demonstrates how the model concentrates schedule, which leads to limitations in application on small datasets.
on specific regions of the image, emphasizing crucial areas that con- Moreover, the existing pretext task cannot be directly applied to train
tribute to the comprehension of the objects present. The decoder uses the Transformer module of DETR, because DETR focuses mainly on
the encoded features to generate final object detections, employing a spatial localization rather than image instance-based or cluster-based
series of self-attention and cross-attention mechanisms to iteratively segmentation learning. To address the above issues, Dai et al. (2021a)
refine the predicted object bounding boxes and class labels. proposed UP-DETR, a DETR-like model capable of unsupervised pre-
In summary, DETR, the first Transformer-based end-to-end object training, whose structure is shown in Fig. 7.
detector, exhibited performance comparable to state-of-the-art (SOTA) Multiple query patches are randomly cropped from a given image
methods at the time. However, there are evident drawbacks in its and the Transformer for detection is pre-trained to predict the bounding
application: slow convergence and low accuracy on small objects. boxes of these query patches in the given image. In the pre-training

6
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021

Fig. 5. Encoder self-attention for a set of reference points. It demonstrates the attention distribution after the input image is processed through the Transformer encoder.

Fig. 6. Visualization of decoder attention for each predicted object in images from the COCO validation set, using the DETR-DC5 model. Attention scores are represented by
distinct colors for different objects. The decoder primarily focuses on object extremities, such as legs and heads, highlighting the model’s ability to capture fine-grained details. It
is recommended to view this figure in color for better understanding.

Fig. 7. UP-DETR pre-training architecture by random query patch detection: (a) For one single-query patch, which is added to all object queries. (b) For the multi-query patch,
which is added each query patch to 𝑁∕𝑀 object queries with shuffle and attention mask.
Source: Image form Dai et al. (2021a).

process, the method addresses the following two key problems. (1) To and extended to multi-query patches with object query shuffle and
trade-off the preference of classification and localization in the pre-text attention mask.
task, the backbone network is frozen and a patch feature reconstruction In summary, UP-DETR proposes a new unsupervised pre-text task-
branch is proposed that is jointly optimized with patch detection. (2) random query patch detection to pre-train the Transformer. The results
For multi-query patch, UP-DETR is introduced in single-query patch show that UP-DETR has significantly better performance than DETR in

7
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021

object detection, panorama segmentation, and single detection, even on predict categories and bounding boxes; (2) The cross-attention layer,
the PASCAL VOC dataset where the training data is insufficient. which can use embedding output of encoder to complete the embedding
of the decoder; (3) The feed-forward networks layer (FFN).
3.2.3. YOLOS The core of the conditional cross-attention mechanism is to learn
Inspired by the pre-trained Transformer can fine-tune at the token a conditional spatial query from decoder embedding and reference
level tasks (Rajpurkar et al., 2016; Sang and De Meulder, 2003), Fang points, which can explicitly find the boundary regions of the object,
et al. (2021) proposed YOLOS, a pure sequence-to-sequence trans- thus narrowing down the search object, helping to locate the object,
former on the basis of DETR (Carion et al., 2020) and ViT (Dosovitskiy and alleviating the problem of over-reliance on the quality of content
et al., 2021). It replaces the class token of the original ViT with the embedding in DETR training. The problem of over-reliance on the
detection token and replaces the image classification loss with the quality of content embedding in DETR training is alleviated. These
bipartite matching loss of DETR in the training phase, which allows refinements improve the convergence speed of DETR by 8× faster and
object detection by set prediction. YOLOS demonstrates the generality the box mAP on the COCO dataset by 1.8%.
and transferability of the pre-trained Transformer from image classi-
fication to downstream object detection task, which is pre-trained in 3.2.6. Efficient DETR
the classification task and then transfer to the detection task for fine- Yao et al. analyzed the mechanisms of DETR and Deformable DETR
tuning. Experiments demonstrate that YOLOS-Base, pre-trained on only and found that their common feature is a cascade structure stacked
medium-sized ImageNet datasets can achieve 42.0 box AP. with six Decoders, which is used to iteratively update the object query.
The reference point proposed by Deformable DETR visualizes the object
3.2.4. Deformable DETR query and solves the difficult problem that the object query is difficult
Inspired by Deformable Convolution ((Dai et al., 2017), Zhu et al. to analyze directly. However, different initialization methods of refer-
(2021) proposed Deformable DETR. This method combines the ad- ence points have a great impact on decoder performance. In order to
vantages of sparse spatial sampling of deformable convolution with investigate a more efficient way to initialize the object container, Yao
the relational modeling capability of Transformer. The Deformable et al. proposed Efficient DETR, a two-stage object detector that consists
Attention Module (DEM) is introduced to accelerate convergence and of dense prediction and sparse set prediction, and these two parts share
fuse multi-scale features to improve accuracy. Moreover, The authors the same detection head.
introduce multi-scale feature from FPN (Lin et al., 2016), and then The model generates region proposals using dense detection before
propose Multi-Scale Deformable Attention (MSDA) to replace the Trans- initializing the object container, and then uses the highest-scoring 4-
former Attention Module for processing feature maps, as shown in dimensional proposal and its 256-dimensional encoder features as the
{ }𝐿
Fig. 8 is shown. Let 𝒙𝑙 𝑙=1 be the input multi-scale feature map, where initialization value of the object container, which results in better
𝑙
𝑥 ∈ R 𝐶×𝐻 𝑙 ×𝑊 ̂ 𝑞 ∈ [0, 1]2 be the normalized coordinates of the
𝑙 . Let 𝒑 performance and fast convergence. The experimental results show that
reference point of each query element 𝑞, and then compute Multi-Scale Efficient DETR combines the features of dense detection and ensemble
Deformable Attention as detection, and can converge quickly while achieving high performance.
( ) The model achieves the SOTA performance at that time on the COCO
DeformAttn 𝒛𝑞 , 𝒑𝑞 , 𝒙 =
[𝐾 ] dataset with only one encoder layer and three decoder layers, while the

𝑀 ∑ ( ) (14) epoch is reduced by 14× less.
𝑾𝑚 𝐴𝑚𝑞𝑘 ⋅ 𝑾 ′𝑚 𝒙 𝒑𝑞 + 𝛥𝒑𝑚𝑞𝑘 ,
𝑚=1 𝑘=1
3.2.7. SMCA
where 𝑚 is the index of attention head, 𝑙 is the index of input feature To strengthen the relationship between the visual region of common
level, and k is the index of sampling points. 𝛥𝒑𝑚𝑙𝑞𝑘 and 𝐴𝑚𝑙𝑞𝑘 denote interest for each object query and the bounding box to be predicted by
respectively sampling offset and attention weight of the 𝑘th sampling the query, Gao et al. (2021) introduced spatial prior and multi-scale fea-
point in the 𝑙th feature layer and the 𝑚th attention head. The scalar
∑ ∑𝐾 tures, and proposed Spatially Modulated Co Attention (SMCA), which
attention weights 𝐴𝑚𝑙𝑞𝑘 are normalized to 𝐿 𝑙=1 𝑘=1 𝐴𝑚𝑙𝑞𝑘 = 1. The replaces the cross attention in the original Decoder while keeping the
normalized coordinates (0, 0) and (1, 1) of 𝒑̂ 𝑞 ∈ [0, 1]2 denote the upper others unchanged.
left and lower right corners of the image, respectively. The function
( ) The decoder of SMCA has multiple cross-attention heads, each of
𝜙𝑙 𝒑̂ 𝑞 in Eq. (14) rescales the normalized coordinates 𝑷 𝑞 to the input which estimates the object center and scale from a slightly different
feature map of the 𝑙th layer. The computational complexity of MSDA
( ( )) location, resulting in a series of different spatial weight maps. This
is 𝑂 2𝑁𝑞 𝐶 2 + min 𝐻𝑊 𝐶 2 , 𝑁𝑞 𝐾𝐶 2 compared to the original DETR, weight map is used to spatially adjust the co-attention features, which
Deformable DETR requires less than one-tenth of training epochs to improves the detection performance. Based on these improvements,
achieve better performance (especially on small object). SMCA can achieve 43.7 mAP in 50 epochs and 45.6 mAP in 108 epochs
on the COCO dataset.
3.2.5. Conditional DETR
Meng et al. (2021) proposed Conditional DETR. They visualized ex- 3.2.8. ACT
periments on the operation of DETR and concluded that cross-attention Due to the slow convergence of DETR, Zheng et al. (2021) proposed
in DETR is highly dependent on content embedding to locate the the Adaptive Clustering Transformer (ACT) to address the problem of
four vertices and predict the bounding box. Thus, it increases the high trial and error costs for improving DETR. ACT is a plug-and-
training difficulty. So they improved the cross-attention of DETR by play module that is fully compatible with Transformer and can be
concatenating the content query 𝑐𝑞 and spatial query 𝑝𝑞 , and the key ported to DETR without any training. Its core design is first, to perform
by splicing the content key 𝑐𝑘 and spatial key 𝑝𝑘 . This inner product of feature clustering adaptively for the attention redundancy (points with
query and key gives the following result: similar semantics and similar spatial locations produce similar atten-
tion maps) of encoder, select representative prototypes, and broadcast
𝐜⊤ ⊤
𝑞 𝐜𝑘 + 𝐩 𝑞 𝐩 𝑘 , (15)
feature updates to their nearest neighboring points based on Euclidean
This separates the functions of content query and spatial query distance. Second, an adaptive clustering algorithm is designed for the
so that they focus on the weight of content and space respectively. encoder note feature diversity problem (for different inputs, the feature
As shown in Fig. 9, the improved Decoder layer consists of three distribution of each encoder layer is quite different), and a multi-
main modules: 1) The self-attention layer, which is from the previous round exact Euclidean location-sensitive hash (E2LSH) is chosen for this
Decoder layer and is used to remove duplicate predictions as well as algorithm to adaptively determine the number of prototypes. Thanks

8
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021

Fig. 8. The architecture of Deformable DETR. Its attention module focuses on only a small number of key sampling points around the reference point, and assigns a fixed and
small number of keys to each object query, thus alleviating the problems of slow convergence and low feature resolution.
Source: Image from Zhu et al. (2021).

and fusion with multi-scale features is a worthy research direction in


the future.

3.2.9. TSP
Sun et al. (2021) concluded after a lot of analysis that the cross-
attention part of decoder and Hungarian loss of DETR are the main
reasons for the slow convergence of DETR. So they proposed two
improved models of DETR with only encoder, TSP-FCOS and TSP-
RCNN corresponding to the One-Stage and Two-Stage object detection
methods, respectively. Both models can be viewed as feature pyra-
mid (Lin et al., 2016) based. The model uses a feature of interest (FoI)
selection mechanism that helps encoder process multi-scale features. In
addition, the model applies matching distillation to solve the instability
of bipartite graph matching. Experiments show that TSP achieves better
results with reduced training cost, using only 36-epoch to achieve the
500-epoch results of the original DETR training.

3.2.10. DINO
The Hungarian algorithm has been used in DETR (Carion et al.,
2020) to match the output of the object by Decoder with Ground Truth.
However, the discreteness of the Hungarian algorithm matching and
Fig. 9. A Decoder layer of Conditional DETR. The gray shaded box indicates that the the randomness of the model training cause the matching process to be
Conditional spatial query is predicted from the learnable 2D coordinates 𝑠 and the dynamic and unstable, resulting the final slow convergence of DETR.
embedding output of the previous Decoder layer. By deeply studying the iteration mechanism and optimization prob-
Source: Image from Meng et al. (2021).
lems of the DETR model, Zhang et al. (2022a)proposed DINO (DETR
with Improved deNoising anchor boxes) based on DN-DETR (Li et al.,
2022), DAB-DETR (Liu et al., 2022a) and Deformable DETR (Zhu et al.,
to these improvements, ACT can reduce the FLOPS of DETR from 2021). The key design of DINO is that the training phase uses denoising
73.4 Gflops to 58.2 Gflops (excluding Backbone Resnet FLOPs) without training as a shortcut to learning the relative offset of anchor by first
additional training, while the loss of AP is only 0.7%. The AP loss adding noise near the Ground Truth box, and then the Hungarian
can be further reduced to 0.2% by multitasking knowledge distillation. matching directly reconstructs the truth bounding box, thus improving
Given its excellent performance, exploring ACT training from scratch the stability of matching. Secondly, the model also uses a query-based

9
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021

Fig. 10. The architecture of Pyramid Vision Transformer, the whole model is divided into 4 stages to generate feature maps at different scales. Each stage consists of a patch
embedding layer, 𝐿𝑖 -layer, and reshape operation.
Source: Image form Wang et al. (2021a).

dynamic anchor formulation to initialize the query and correct the 3.3.1. PVT&PVTv2
parameters of adjacent earlier layers with the gradients of later layers. The feature maps output by ViT (Dosovitskiy et al., 2021) are
DINO breaks the dominance of classical architecture detector (SwinV2- difficult to apply to dense prediction due to their single scale and low
G (Liu et al., 2021a), Florence (Yuan et al., 2021), DyHead (Dai resolution. Wang et al. (2021a) proposed the Pyramid Vision Trans-
et al., 2021b), etc.). DINO-Res50, which combines multi-scale features, former (PVT) by incorporating the multi-scale feature into Transformer.
achieves 48.3AP and 51.0AP on the COCO2017 dataset with 12-epoch PVT can be used as a Backbone for various dense detection tasks,
and 36-epoch training schemes, respectively. Moreover, DINO-Swin-L especially it can replace the CNN backbone of DETR-like models or be
even achieves the highest performance 63.3AP after training on a larger combined into a pure Transformer model without manual components
dataset. such as NMS.
Benefiting from the progressive shrinking pyramid structure in the
3.3. Transformer backbone PVT, the Transformer sequence length decreases as the network gets
deeper. Meanwhile, in order to further reduce the computation of
Other efforts such as ViT (Dosovitskiy et al., 2021) have used Trans- fine-grained segmentation of images, they propose spatial-reduction
former in the image classification and achieved comparable results. attention (SRA) to reduce the computation of learning high-resolution
However, there are some limitations in other complex CV tasks. These feature maps (As shown in Fig. 10).
challenges of transferring the high performance of Transformer in NLP Compared with the CNN method based on feature pyramid struc-
to the CV can be explained by the differences between the two domains. ture, PVT not only generates multi-scale feature maps to detect ob-
jects of different sizes but also fuses global information through self-
1. The object entities in CV tasks often have dramatic scale varia- attention mechanism. The PVTv2 (Wang et al., 2022) proposed by
tion. the same team subsequently improves the PVT by adding a linear
2. Compared to text, the matrix nature of images makes it con- complexity attention layer, overlapping patch embedding, and convo-
tain at least hundreds of pixels for an image that can express lutional feed-forward network to improve the performance of the PVT
information. Especially the very long sequence unfolded by high- as backbone. On the COCO dataset, both achieved competitive results
resolution images is difficult for Transformer to model. at that time.
3. Many CV tasks such as semantic segmentation require pixel-level
dense prediction, and the computational complexity of the self- 3.3.2. Swin transformer
attention mechanism in ViT increases quadratically with image Liu et al. (2021b) proposed Swin Transformer, which creatively uses
size, which leads to unacceptable computational overhead. a hierarchical design to make the Transformer available as a backbone
4. In the existing Transformer-based models, tokens are fixed in for most CV tasks, rather than just a detection head. As shown in
scale and not improved in design for CV tasks. Fig. 11, It is easy to see that, unlike other Transformer models, Swin
Transformer builds a feature map with hierarchical representation,
To address the above challenges, many Transformer-based back- similar to the feature pyramid structure in CNN. As the network level
bones have been proposed for CV tasks and combined with methods deepens, the receptive expands, enabling the extraction of multi-scale
such as multi-scale to compensate for the shortcomings that ViT can features of the image. Secondly, Swin Transformer divides the feature
only detect at low resolution and so on. These methods can replace the map with multiple windows, and each non-overlapping window per-
backbone of mainstream object detection models, and in the benchmark forms local multi-head attention calculation without correspondence
Table 5, we list the performance of Mask R-CNN (He et al., 2017) between windows, which makes the computation greatly reduced and
and RetinaNet (Ross and Dollár, 2017) comparison after replacing the linear with the image size, as shown in the Eq. (16). In contrast, ViT
backbone and review the classical models in this subsection. produces a single low-resolution image and calculates global attention,

10
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021

3.3.3. Swin TransformerV2


After Swin Transformer, Liu et al. (2021a) proposed Swin Trans-
formerV2 to address the problems of expansion of CV models and
training with high-resolution images, as well as the excessive GPU
memory consumption for large models. Swin Transformer is optimized
to scale up to 3 billion parameters and can be trained with images up
to 1536 × 1536 resolutions. The improved method is shown in Fig. 15.
Post normalization technique: They found that when scaling up
the model, the activation values in the deep layer increase dramatically.
In fact, in the pre-normalized configuration (Layer Norm layer before
the Attention layer), the output activation values of each residual block
are directly merged back to the main branch, and the amplitude of the
Fig. 11. Compare Swin Transformer (left) with ViT (right). main branch becomes in the deeper layers. The huge amplitude differ-
Source: Image form Liu et al. (2021b). ences between different layers may cause training instability problems.
Therefore, they propose a post normalization technique, in which the
output of each residual block is normalized before it is merged back
so the computational complexity and image size are quadratically into the main branch, and the amplitude of the main branch does not
related, as shown in Eq. (17) accumulate as the number of layers deepens.
Scaled cosine attention: In the original self-attention computation,
𝛺(W − MSA) = 4ℎ𝑤𝐶 2 + 2𝑀 2 ℎ𝑤𝐶, (16) the similarity terms of pixel pairs are computed as dot products of
queries and keys vectors. However, when using this approach for
𝛺(MSA) = 4ℎ𝑤𝐶 2 + 2(ℎ𝑤)2 𝐶, (17) large visual models, the learned attention graph for some blocks and
attention heads is often dominated by several pixel pairs, especially
where M is a fixed window size (set to 7 by default), computing global
in post-normalization configurations. To alleviate this problem, the
attention for ViT is unacceptable for large image sizes 𝐻𝑊 , while
authors propose a scaled cosine attention (Scaled cosine attention)
window-based multi-head self-attention (W-MSA) is scalable.
method, which computes the number of attention pairs for a pixel pair
The Pipeline of the Swin Transformer is shown in Fig. 12(a). The
input image is spreading into a sequence after Patch Partition and 𝑖 and 𝑗 by a scaled cosine function:
Linear Embedding layers, and then input into 4 stages. The Swin ( ) ( )
Sim 𝐪𝑖 , 𝐤𝑗 = cos 𝐪𝑖 , 𝐤𝑗 ∕𝜏 + 𝐵𝑖𝑗 , (19)
Transformer block in each stage replaces the standard multi-head self-
attention (MSA) module in the Transformer module with window-based where 𝐵𝑖𝑗 is the relative position bias between pixels 𝑖 and 𝑗; 𝜏 is a
self-attention (W-MSA) or a shift window-based module (SW-MSA), learnable scalar that cannot be shared across heads and layers. The 𝜏 is
which introduces a relative position bias in the computation of at- set to be greater than 0.01. The cosine function is naturally normalized
tention to account for the geometric relationships in the self-attention so that it can have milder attention values, which improves the stability
computation, as shown in Eq. (18). This parameter accounts for the of large visual models and makes the model capacity easier to be scaled
relative spatial configuration of the visual elements and is shown to be up.
critical in various visual tasks, especially for intensive recognition tasks Log-spaced continuous position bias:They found that the original
such as object detection and semantic segmentation. relative position encoding method was weak for scale generalization of
( √ ) the model, and proposed log-spaced continuous position bias so that
Attention(𝑄, 𝐾, 𝑉 ) = Sof tMax 𝑄𝐾 𝑇 ∕ 𝑑 + 𝐵 𝑉 , (18)
the relative position bias can be transferred smoothly across windows at
Although W-MSA reduces the computation greatly, W-MSA loses the different resolutions, effectively transferring models pre-trained in low-
ability to model the relationship between different windows, and the resolution images and windows to their higher resolution counterparts:
lack of information exchange between non-overlapping windows affects ̂ = sign(𝑥) ⋅ log(1 + |𝛥𝑥|),
𝛥𝑥
the representation of the model. So they introduced SW-MSA, which (20)
shifts the windows in the Swin Transformer Block of the next layer to ̂ = sign(𝑦) ⋅ log(1 + |𝛥𝑦|),
𝛥𝑦
introduce correspondence of the previous layer. This operation greatly ̂ 𝛥𝑦̂ are the coordinates of linear scale and
where 𝛿𝑥, 𝛿𝑦 and 𝛥𝑥,
increases the actual receptive field, as shown in Fig. 13. In this way,
logarithmic space, respectively. The optimized resulting architecture
the multi-head attention is computed inside the new window to include
was named Swin TransformerV2, and the model achieved a box/mask
the boundary of the original window and achieve the modeling of the
mAP of 63.1/54 in the COCO2017 dataset.
relationship between windows.
But this approach causes the number of windows to change, and
the window size is not uniform. An easy way to solve this problem 3.3.4. Other representative methods
is padding the small window, but it will increase the computation. So In addition to the conventional approaches, our benchmark extends
they proposed cyclic-shifting, a more efficient method of batch com- to include comparisons with several cutting-edge techniques. ViL, in-
putation. This method cyclically shifts and merges small windows, so troduced by Zhang et al. (2021), realizes a multi-scale configuration
that a window may contain content from different windows, and there- through the sequential stacking of numerous ViT stages. Furthermore,
fore the masked MSA mechanism is used to restrict the self-attention it enhances the attention mechanism, thus elevating both efficiency and
computation to each sub-window, as shown in Fig. 14. classification performance. The Focal Transformer introduces the novel
Swin Transformer has achieved SOTA performance on classification, Focal Self-Attention mechanism. This integrates both granular local and
detection, and segmentation tasks. Its biggest contribution is to propose coarse global interactions, thereby ensuring the effective capturing of
a backbone that can be widely used in CV. And most of the hyper- both proximal and distal visual dependencies. Twins (Chu et al., 2021)
parameters commonly found in CNNs can be manually tuned in Swin propose two highly efficient vision transformer architectures, Twins-
Transformer, such as the number of network blocks and the size of input PCPVT and Twins-SVT, both leveraging a restructured spatial attention
images. This method combines the advantages of both Transformer and mechanism. This methodology incorporates both locally-grouped self-
CNN, fully considers the size invariance of CNN and the relationship attention and global sub-sampled attention, capturing both fine-grained
between receptive field and number of layers, and solves the problem proximal and coarse-grained distal visual information. Dong et al.
of slow application of Transformer in CV. (2022) introduced the CSWin Transformer, a robust Transformer-based

11
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021

Fig. 12. (a) Swin Transformer (Swin-T) (b) Swin Transformer Block.
Source: Image from Liu et al. (2021b).

Tables 4 and 5. Each method was evaluated using the NVIDIA A100
GPU and adhered to the DETR training protocol. The AdamW opti-
mizer (Loshchilov and Hutter, 2017) was uniformly employed across all
methods, with the initial learning rate for the transformer set to 10−4 ,
the backbone’s to 10−5 , and weight decay at 10−4 . The transformer
weights were initialized with Xavier init (Glorot and Bengio, 2010),
while the backbone leveraged the ImageNet-pretrained ResNet model
Fig. 13. The shift window approach can calculates the self-attention across the window from torchvision, with frozen batch normalization layers.
boundary of the previous layer.
For Transformer Neck-based models, they treat object detection as
Source: Image from Liu et al. (2021b).
a straightforward set prediction, removing manual components (such
as anchor set and NMS) that cannot be optimized, thus enabling end-
to-end detection. Starting from the original DETR with slow conver-
gence and poor detection of small objects, subsequent researchers have
proposed optimization strategies from different perspectives.
1. To address the problem of slow convergence, researchers often
start by improving the attention mechanism. Deformable DETR (Zhu
Fig. 14. An illustration of circular shift. et al., 2021) accelerates convergence 12× faster with the Deformable
Source: Image from Liu et al. (2021b).
Attention Module. Conditional DETR improves the cross-attention of
DETR and gets 8× faster convergence. Meanwhile, the box mAP on
the COCO dataset is improved by 1.8%. Unlike the above methods,
ACT (Zheng et al., 2021) proposes a plug-and-play module for adaptive
clustering, which reduces the GFLOPS of DETR by 15.2 without addi-
tional training, while the AP loss is only 0.7%. Sparse DETR achieves
higher performance and the same detection speed (FPS) as Faster
R-CNN by improving and reducing the GFLOPs by 75.
2. For the problem of poor detection of small objects, multi-scale
feature is currently the main focus. Methods such as SMCA (Gao et al.,
2021) (as shown in Table 4) introduce multi-scale feature with different
operations and significantly improve the accuracy of the detector.
Moreover, DINO (Zhang et al., 2022a) reaches 63.3 AP over all classical
object detection methods.
Presently, most Transformer Backbones are primarily active in im-
age classification, with only a few researchers transitioning them to tra-
ditional object detectors for dense prediction. These have then achieved
state-of-the-art (SOTA) performance. Compared to CNN-based Back-
Fig. 15. Comparison of the attention modules of Swin TransformerV1 and V2. bones, Transformer-based Backbones can integrate global contextual
Source: Image from Liu et al. (2021a).
information while outputting multi-scale feature maps, thereby enhanc-
ing feature extraction. Although Transformers have challenged CNN’s
dominance in object detection, recent advancements such as FAIR’s
backbone for vision tasks. It integrates the Cross-Shaped Window self- redesign of ConvNet (Liu et al., 2022b), which draws from the strengths
attention mechanism, varies stripe widths based on network depth, of the Transformer structure, underscore the continued potential of
and introduces a novel Locally-enhanced Positional Encoding (LePE) CNNs. In the future, CNN and visual Transformer are expected to
scheme to handle local positional information optimally, resulting in continue improving by leveraging each other’s strengths.
competitive performance across standard vision tasks.
4. Discussion
3.4. Analysis and discussion for detectors
Although the Transformer model has made great progress (as shown
This section provides a succinct review of conventional Transformer- in Table 6) and has shown excellent performance (Table 4, Table 5),
based object detectors, offering a detailed performance comparison in they still face some challenges, as well as limitations in practical

12
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021

Table 5
The prediction results of RetinaNet and Mask R-CNN with Transformer as Backbone on COCO 2017 Val Set. Where 3 × schedule denotes 36-epoch, MS denotes multi-scale input
(MS), and the numbers before and after ‘‘/’’ denote the parameters of RetinaNet and Mask R-CNN, respectively.
Backbone #Params FlOPs RetinaNet 3×schedule + MS Mask R-CNN 3×schedule + MS
(M) (G) 𝐀𝐏𝑏 𝐀𝐏𝑏50 𝐀𝐏𝑏75 𝐀𝐏𝑆 𝐀𝐏𝑀 𝐀𝐏𝐿 𝐀𝐏𝑏 𝐀𝐏𝑏50 𝐀𝐏𝑏75 𝐀𝐏𝑚 𝐀𝐏𝑚50 𝐀𝐏𝑚75
ResNet50 (He et al., 2015) 38/44 239/260 39 58.4 41.8 22.4 42.8 51.6 41 61.7 44.9 37.1 58.4 40.1
PVTv1-S (Wang et al., 2021a) 34/44 226/245 42.2 62.7 45.0 26.2 45.2 57.2 43.0 65.3 46.9 39.9 62.5 42.8
ViL-S (Zhang et al., 2021) 36/45 252/174 42.9 63.8 45.6 27.8 46.4 56.3 43.4 64.9 47.0 39.6 62.1 42.4
Swin-T (Liu et al., 2021b) 39/48 245/264 45.0 65.9 48.4 29.7 48.9 58.1 46.0 68.1 50.3 41.6 65.1 44.9
PVTv2-B2-Li (Liu et al., 2021b) 32/42 -/- – – – – – – 46.8 68.7 51..4 42.3 65.7 45.4
Focal-T (Yang et al., 2021) 39/49 265/291 45.5 66.3 48.8 31.2 49.2 58.7 47.2 69.4 51.9 42.7 66.5 45.9
TwinsP-S (Chu et al., 2021) 34/44 -/245 45.2 66.5 48.6 30.0 48.8 58.9 46.8 69.3 51.8 42.6 66.3 46.0
Twins-S (Chu et al., 2021) 34/55 -/228 45.6 67.1 48.6 29.8 49.3 60.0 46.8 69.2 51.2 42.6 66.3 45.8
CSwin-T (Dong et al., 2022) -/42 -/279 – – – – – – 49.0 70.7 53.7 43.6 67.9 46.6
PVTv2-B2 (Wang et al., 2022) 35/45 -/- – – – – – – 47.8 69.7 52.6 43.1 66.8 46.7
ResNet101 (He et al., 2015) 57/63 315/336 40.9 60.1 44.0 23.7 45.0 53.8 42.8 63.2 47.1 38.5 60.1 41.3
ResNeXt101-32 × 4d (He et al., 2015) 56/63 319/340 41.4 61 .0 44.3 23.9 45.5 53.7 44.0 64.4 48.0 39.2 61.4 41.9
PVTv1-M (Wang et al., 2021a) 54/64 283/302 43.2 63.8 46.1 27.3 46.3 58.9 44.2 66.0 48.2 40.5 63.1 43.5
ViL-M (Zhang et al., 2021) 51/60 339/261 43.7 64.6 46.4 27.9 47.1 56.9 44.6 66.3 48.5 40.7 63.8 43.7
TwinsP-B (Chu et al., 2021) 54/64 -/302 46.4 67.7 49.8 31.3 50.2 61.4 47.9 70.1 52.5 43.2 67.2 46.3
Twins-B (Chu et al., 2021) 67/76 -/340 46.9 68.0 50.2 31.7 50.3 61.8 48.0 69.5 52.7 43.0 66.8 46.6
Swin-Scite (Liu et al., 2021b) 60/69 335/354 46.4 67.0 50.1 31.0 50.1 60.3 48.5 70.2 53.5 43.3 67.3 46.6
Focal-S (Yang et al., 2021) 62/71 367/401 47.3 67.8 51.0 31.6 50.9 61.1 48.8 70.5 53.6 43.8 67.7 47.2
CSwin-S (Dong et al., 2022) -/54 -/342 – – – – – – 50.0 71.3 54.7 44.5 68.4 47.7
ResNeXt101-64 × 4d (He et al., 2015) 96/102 473/493 41.8 61.5 44.4 25.2 45.4 54.6 44.4 64.9 48.8 39.7 61.9 42.6
PVTv1-Large (Wang et al., 2021a) 71/81 345/364 43.4 63.6 46.1 26.1 46.0 59.5 44.5 66.0 48.3 40.7 63.4 43.7
ViL-Base (Zhang et al., 2021) 67/76 443/365 44.7 65.5 47.6 29.9 48.0 58.1 45.7 67.2 49.9 41.3 64.4 44.5
Swin-Base (Liu et al., 2021b) 98/107 477/496 45.8 66.4 49.1 29.9 49.4 60.3 48.5 69.8 53.2 43.4 66.8 46.9
Focal-Base (Yang et al., 2021) 101/110 514/533 46.9 67.8 50.3 31.9 50.3 61.5 49.0 70.1 53.6 43.7 67.6 47.0
CSWin-B (Dong et al., 2022) -/97 -/526 – – – – – – 50.8 72.1 55.8 44.9 69.1 48.3

applications. This section will summarize the innovative improvements factors such as the number of parameters, running time (FPS), and
of the current method, analyze the problems encountered by the floating-point operations (FLOPs). However, these metrics’ influence
Transformer detector, and give an outlook on the future development varies depending on the application context and hardware environ-
prospects. ment. For example, in situations like autonomous driving or robotic
navigation, FPS is a critical factor, as algorithms must process video
4.1. Challenges streams at high frame rates to respond quickly to external changes.
In the case of mobile devices and embedded systems, the number of
High computational overhead. Typical properties of CNNs include parameters and FLOPs are more influential due to energy and memory
inductive bias, which is expressed as translation invariance, weight constraints. Consequently, deploying algorithms on mobile platforms
sharing, and sparse connectivity (Dosovitskiy et al., 2021). These prop- necessitates balancing performance, energy consumption, and memory
erties grant CNNs a robust local feature extraction capability and enable usage. In cloud computing and high-performance hardware settings,
them to achieve high performance through the simple sliding match- computational overhead is not the most critical factor since compu-
ing of convolutional kernels. As a result, compared to Transformers, tational resources are relatively abundant. In these scenarios, model
CNNs often exhibit competitive performance with lower computational performance and accuracy are paramount.
overhead. However, current CNN architectures possess less potential According to the data in Table 4, Table 5, modern Transformer-
than Transformers due to their weaker extraction of global features and based models have outperformed classical two-stage object detection
contextual information. The self-attention mechanism in Transformers algorithms (e.g., Faster R-CNN) in terms of FPS and achieved improved
can also emulate convolutional layers, requiring only a sufficient num- accuracy, rendering them viable for practical applications. To ensure
ber of heads to focus on each pixel within the convolutional receptive efficient deployment and application in real-world engineering sce-
field and employing relative positional encoding to ensure translation narios, researchers typically optimize object detection algorithms for
invariance (Cordonnier et al., 2020). This full-attention operation can specific contexts, minimizing computational overhead and improving
effectively integrate local and global attention while dynamically gen- real-time performance and energy efficiency. This optimization may
erating attention weights based on feature relationships. Nevertheless, involve techniques such as model compression, knowledge distillation,
Transformers face certain limitations in practical applications. One of and network architecture design.
the main challenges stems from their high computational complexity. Insufficient understanding of visual Transformer. Compared to
The expensive computational overhead restricts the application the well-established research and applications of CNNs, our current un-
of Transformer-based detectors on mobile computing platforms. At derstanding of the underlying mechanisms behind visual Transformers
present, most mobile detection platforms primarily rely on one-stage is still limited. The Transformer architecture was originally designed for
detectors (Zhao et al., 2019), while the trend for Transformer de- sequence processing tasks (Vaswani et al., 2017). Although Transform-
tectors leans towards offline high-precision detection. Additionally, ers have demonstrated strong performance when applied to computer
Transformers require large amounts of data, and common solutions vision tasks, there is relatively little explanation regarding their specific
include data augmentation, self-supervised, or semi-supervised learning roles and functions in this context. Consequently, gaining a deeper
approaches (He et al., 2021). Compared to state-of-the-art CNN-based understanding of the principles behind visual Transformers is crucial
approaches, their deployment on mobile platforms is constrained by to facilitate more fundamental optimization improvements and en-
higher computational complexity. hance the model’s interpretability. This deeper understanding could
The impact of computational overhead on deploying Transformer- potentially involve investigating the attention mechanisms, hierarchical
based object detection models in practical scenarios is influenced by feature representation, and the interaction between different layers

13
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021

Table 6
Summary of the advantages and limitations of Transformer-based object detection models.
Type Method Highlights Limitations
Transformer Neck DETR (Carion et al., 2020) (1) Proposed Transformer-based end-to-end object (1) Requires massive dataset
detection framework, (2) Removed hand-designed training, (2) Convergence is
anchor set and non-maximal suppression (NMS) very slow, (3) Poor
performance for small objects.
SMCA (Gao et al., 2021) (1) Combining a learnable co-attention map and a (1) Good performance for
manual space prior speeds up the convergence of large objects and poor
DETR, (2) Incorporating a scale selection network performance for small objects,
in decoder. (2) High computational
overhead.
Deformable DETR (Zhu et al., 2021) (1) Proposed deformable attention mechanism, (1) Low accuracy for large
which pays more attention to local information objects, (2) Deformable
and improves convergence speed; (2) Combined attention brings unordered
with multi-scale feature, (3) Proposed reference memory access, (3) High
point visualization object query, (4) Two-stage computational overhead.
Deformable DETR is also proposed.
Efficient DETR (Yao et al., 2021) (1) They found that different object container (1) Poor performance for
initialization methods have a great impact on small objects, (2) High
decoder; (2) They also proposed an efficient way computational overhead.
of initializing object containers using the
characteristics of dense detection and sparse
detection.
DINO (Zhang et al., 2022a) (1) Propose a contrast denoising training method, (1) High computational
(2) Combine class DETR and two-stage model and overhead at high scales, (2)
propose a mixed query selection method to better Diminishing marginal benefit
initialize object query, (3) Look Forward Twice: from stacking too many scales.
Introducing proximity layer information to update
parameters and improve the detection of small
objects.
YOLOS (Fang et al., 2021) (1) Replace [cls] token with [det] token and (1) Low detection accuracy,
image classification loss with bipartite matching (2) High computational
loss, (2)Propose a pre-trained Transformer object overhead.
detection paradigm.
UP-DETR (Dai et al., 2021a) (1) Propose a new unsupervised pre-text task to (1) Slow convergence, (2)
perform unsupervised pre-training on Transformer, Poor performance for small
(2) Propose a patch detection reconstruction objects.
branch that is jointly optimized with patch
detection.
Transformer Backbone FPT (Zhang et al., 2020) (1) Propose a feature interaction method across (1) Low detection accuracy,
space and scale, (2) High compatibility. (2) High computational
overhead.
PVT (Wang et al., 2021a) (1) It can output multi-scale high-resolution (1) High computational
feature maps; (2) The proposed spatial reduction overhead for high-resolution
attention module makes PVT successfully applied images; (2) Simple image
to dense prediction. division loses the connection
information between different
patches.
Swin Transformer (Liu et al., 2021b) (1) Hierarchical representation, (2) Introduced (1) Excessive GPU memory
communication between windows by computing consumption at higher image
attention within shifted windows and reduced the resolutions, (2) Difficult to
computational complexity to be linear with the retrain on small datasets, (3)
image size. Difficult to transform
pre-trained models at low
resolutions to higher
resolutions.

within the visual Transformer models. By exploring these aspects, we Establishing efficient transformations of image sequences can help
can potentially uncover novel optimization strategies and improve the unlock the potential of the Transformer for CV tasks.
model’s overall performance in various computer vision tasks.
The inefficient image-sequence information transformation. 4.2. Future development outlook
Unlike images, human-created languages have a high semantic density.
Each word in a sentence can be treated as high-dimensional semantic Visual Transformer has made great progress in recent years, espe-
information embedded in a low-dimensional vector representation. cially in object detection, and the performance has surpassed SOTA
However images, as a natural signal with high spatial redundancy, CNN-based model on the COCO dataset. However, Transformer is not
have a low information density per pixel. For example, He et al. mature enough in practical application deployment. For example, the
(2021) performed random high-scale masking of images, and then computational overhead is too large to be deployed on platforms with
reconstructed the images well with Decoder, demonstrating that much limited computer resources, and the real-time performance is not as
higher semantic features than pixel information density can be captured good as the CNN-based one-stage approach.
in the images. But the current way of representing image information in Self-supervised learning. While self-supervised learning has made
sequences using Transformer is not efficient enough, which can bring a great success in natural language processing, current object detection
about accuracy degradation as well as high computational overhead. models, which are mainly supervised learning, require large amounts

14
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021

of high-quality manually labeled data, which is usually too expensive. References


Therefore, It is natural to think of using self-supervised learning for
visual tasks in order to pre-training models using a large amount of Arkin, Ershat, Yadikar, Nurbiya, Muhtar, Yusnur, Ubul, Kurban, 2021. A survey of
cheap data available on the Internet. For example, the MAE proposed object detection based on CNN and transformer. In: 2021 IEEE 2nd International
Conference on Pattern Recognition and Machine Learning (PRML). pp. 99–108.
by He et al. (2021) uses masked self-encoders for self-supervised learn-
http://dx.doi.org/10.1109/PRML52754.2021.9520732.
ing, which are adequately pre-trained and then migrated to specific Arkin, E., Yadikar, N., Xu, X., Aysa, A., Ubul, K., 2022. A survey: object detection
tasks for fine-tuning. methods from cnn to transformer. Multimedia Tools Appl. http://dx.doi.org/10.
Lightweight Transformer. Since the performance of the current 1007/s11042-022-13801-3.
Transformer-based detector is powerful enough, the development of a Bai, Y., Mei, J., Yuille, A., Xie, C., 2021. Are transformers more robust than CNNs? http:
//dx.doi.org/10.48550/arXiv.2111.05464.
lightweight Transformer architecture should be considered to broaden
Bochkovskiy, A., Wang, C.-Y., Liao, H.-Y.M., 2020. YOLOv4: Optimal speed and
its applicability. Key considerations could include a reduction in com- accuracy of object detection.
putational demands, optimization for object detection, and intelligent Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakan-
query design to ensure high performance while minimizing computa- tan, A., Shyam, P., Sastry, G., Askell, A., 2020. Language models are few-shot
tional overhead. This would enable deployment on mobile platforms learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901.
with limited computational resources. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S., 2020.
End-to-end object detection with transformers.
Multitasking. Within CNN-based methods, Mask R-CNN (He et al.,
Chen, X., Ma, H., Wan, J., Li, B., Xia, T., 2017. Multi-view 3d object detection network
2017) successfully performs instance segmentation alongside object for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision
detection, yielding superior results. Could a Transformer detector also and Pattern Recognition. pp. 1907–1915.
undertake multiple tasks simultaneously and derive benefits from this Chen, T., Saxena, S., Li, L., Fleet, D.J., Hinton, G., 2021. Pix2seq: A language modeling
approach? For instance, the performance of the object detector could be framework for object detection. arXiv:2109.10852[cs].
Chen, C., Seff, A., Kornhauser, A., Xiao, J., 2015. Deepdriving: Learning affordance for
enhanced by incorporating semantic segmentation. Semantic segmenta-
direct perception in autonomous driving. In: Proceedings of the IEEE International
tion captures object boundaries, aiding object localization in detection, Conference on Computer Vision. pp. 2722–2730.
and segments the background to delineate the contextual information Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., Shen, C., 2021.
of the object, improving detection probability. Such an approach is Twins: Revisiting the design of spatial attention in vision transformers. arXiv:
especially useful as objects typically exist within specific contexts, such 2104.13840[cs].
Cordonnier, J.-B., Loukas, A., Jaggi, M., 2020. On the relationship between
as cars appearing on roads.
self-attention and convolutional layers. arXiv:1911.03584[cs, stat].
Dai, Z., Cai, B., Lin, Y., Chen, J., 2021a. UP-DETR: Unsupervised pre-training for object
5. Conclusion detection with transformers. arXiv:2011.09094[cs].
Dai, X., Chen, Y., Xiao, B., Chen, D., Liu, M., Yuan, L., Zhang, L., 2021b. Dynamic head:
For the past decade, CNN-based models have reigned supreme in Unifying object detection heads with attentions. In: Proceedings of the IEEE/CVF
the field of object detection. However, the Transformer has recently Conference on Computer Vision and Pattern Recognition. pp. 7373–7382.
Dai, X., Chen, Y., Yang, J., Zhang, P., Yuan, L., Zhang, L., 2021c. Dynamic DETR: End-
demonstrated superior performance and substantial potential in com-
to-end object detection with dynamic attention. In: 2021 IEEE/CVF International
puter vision (CV), rendering Transformer-based models a burgeoning Conference on Computer Vision. ICCV, IEEE, Montreal, QC, Canada, pp. 2968–2977.
research topic within object detection. In this paper, we have conducted http://dx.doi.org/10.1109/ICCV48922.2021.00298.
an extensive review of mainstream Transformer-based object detectors Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y., 2017. Deformable
developed over the past three years. Our focus has mainly been on convolutional networks. In: Proceedings of the IEEE International Conference on
Computer Vision. pp. 764–773.
their concepts, innovative aspects, and detection accuracy. We have
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2018. Bert: Pre-training of deep
categorized these methods according to their model structure and es- bidirectional transformers for language understanding. arXiv preprint arXiv:1810.
tablished a benchmark based on the COCO2017 dataset. Furthermore, 04805.
we have conducted a multi-perspective analysis and comparison of Dollar, P., Wojek, C., Schiele, B., Perona, P., 2012. Pedestrian detection: An evaluation
these methods, summarizing their innovations and enhancements. We of the state of the art. IEEE Trans. Pattern Anal. Mach. Intell. 34 (4), 743–761,
doi:10/bjsn5q.
have also provided a comprehensive analysis of their limitations and
Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., Guo, B., 2022. CSWin
summarized the existing challenges that persist within the application transformer: A general vision transformer backbone with cross-shaped windows.
of Transformer in object detection. This study aims to aid readers in arXiv:2107.00652[cs].
deepening their understanding of Transformer object detectors, spark- Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T.,
ing research interest to unleash the potential of the Transformer model, Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N., 2021.
An image is worth 16 × 16 words: Transformers for image recognition at scale.
and enhancing its practical applications.
arXiv:2010.11929[cs].
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A., 2007. The
CRediT authorship contribution statement PASCAL visual object classes challenge 2007 (VOC2007) results. http://www.
pascal-network.org/challenges/VOC/voc2007/workshop/index.html.
Yong Li: Writing – review & editing, Supervision, Project admin- Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A., 2012. The
PASCAL visual object classes challenge 2012 (VOC2012) results. http://www.
istration. Naipeng Miao: Conceptualization, Methodology, Validation,
pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
Formal analysis, Investigation, Writing- – original draft, Writing –
Fang, Y., Liao, B., Wang, X., Fang, J., Qi, J., Wu, R., Niu, J., Liu, W., 2021. You only
review & editing, Visualization. Liangdi Ma: Writing – review & edit- look at one sequence: Rethinking transformer in vision through object detection.
ing. Feng Shuang: Funding acquisition, Resources. Xingwen Huang: arXiv:2106.00666[cs].
Writing – review & editing. Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D., 2010. Object detection
with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach.
Intell. 32 (9), 1627–1645, doi: 10/fgv7fd.
Declaration of competing interest
Gao, P., Zheng, M., Wang, X., Dai, J., Li, H., 2021. Fast convergence of DETR with
spatially modulated co-attention. arXiv:2101.07448[cs].
The authors declare that they have no known competing finan- Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J., 2021. YOLOX: Exceeding YOLO series in 2021.
cial interests or personal relationships that could have appeared to Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F.A., Brendel, W., 2019.
influence the work reported in this paper. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves
accuracy and robustness. arXiv:1811.12231[cs, q-bio, stat].
Glorot, X., Bengio, Y., 2010. Understanding the difficulty of training deep feedforward
Data availability neural networks. In: Proceedings of the Thirteenth International Conference on Ar-
tificial Intelligence and Statistics. In: JMLR Workshop and Conference Proceedings,
The data is public. pp. 249–256.

15
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021

Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Redmon, J., Farhadi, A. and, 2017. YOLO9000: Better, faster, stronger. In: 2017 IEEE
Xu, Y., Yang, Z., Zhang, Y., Tao, D., 2022. A survey on vision transformer. Conference on Computer Vision and Pattern Recognition. CVPR, pp. 6517–6525,
In: IEEE Transactions on Pattern Analysis and Machine Intelligence. p. 1. http: DOI: 10/gffdbj.
//dx.doi.org/10.1109/TPAMI.2022.3152247. Redmon, J., Farhadi, A., 2018. YOLOv3: An incremental improvement.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R., 2021. Masked autoencoders Ren, S., He, K., Girshick, R., Sun, J., 2016. Faster R-CNN: Towards Real-Time Object
are scalable vision learners. Detection with region proposal networks. arXiv:1506.01497 [cs].
He, K., Gkioxari, G., Dollár, P., Girshick, R., 2017. Mask r-cnn. In: Proceedings of the Roh, B., Shin, J., Shin, W., Kim, S., 2022. Sparse DETR: Efficient end-to-end object
IEEE International Conference on Computer Vision. pp. 2961–2969. detection with learnable sparsity. arXiv:2111.14330[cs].
He, K., Zhang, X., Ren, S., Sun, J., 2015. Deep residual learning for image recognition. Ross, T.-Y., Dollár, G., 2017. Focal loss for dense object detection. In: Proceedings of
arXiv:1512.03385[cs]. the IEEE Conference on Computer Visionv and Pattern Recognition. pp. 2980–2988.
Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M., 2021. Transformers Sang, E.F.T.K., De Meulder, F., 2003. Introduction to the CoNLL-2003 shared task:
in vision: A survey. arXiv:2101.01169. Language-independent named entity recognition. arXiv:cs/0306050.
Kobatake, H., Yoshinaga, Y., 1996. Detection of spicules on mammogram based on Sun, Z., Cao, S., Yang, Y., Kitani, K.M., 2021. Rethinking transformer-based set
skeleton analysis. IEEE Trans. Med. Imaging 15 (3), 235–245. http://dx.doi.org/ prediction for object detection. In: Proceedings of the IEEE/CVF International
10.1109/42.500062. Conference on Computer Vision. pp. 3611–3620.
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R., 2019. Albert: Sung, K.-K., Poggio, T., 1998. Example-based learning for view-based human face
A lite bert for self-supervised learning of language representations. arXiv preprint detection. IEEE Trans. Pattern Anal. Mach. Intell. 20 (1), 39–51. http://dx.doi.
arXiv:1909.11942. org/10.1109/34.655648/bnkgmt.
Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., Zhang, L., 2022. DN-DETR: Accelerate DETR Sutskever, I., Vinyals, O., Le, Q.V., 2014. Sequence to sequence learning with neural
training by introducing query denoising. arXiv:2203.01305[cs]. networks. 9.
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S., 2016. Feature Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł.,
pyramid networks for object detection. arXiv:1612.03144. Polosukhin, I., 2017. Attention is all you need. In: Advances in Neural Information
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Processing Systems, Vol. 30. Curran Associates, Inc.
Zitnick, C.L., 2014. Microsoft coco: Common objects in context. In: European Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L., 2021a.
Conference on Computer Vision. Springer, pp. 740–755. Pyramid vision transforme: A versatile backbone for dense prediction without
Lin, J., Mao, X., Chen, Y., Xu, L., He, Y., Xue, H., 2022. D2ETR: Decoder-only DETR convolutions. arXiv:2102.12122[cs].
with computationally efficient cross-scale attention. arXiv:2203.00860[cs]. Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L., 2022.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C., 2016. PVTv2: Improved baselines with pyramid vision transformer. arXiv:2106.13797[cs].
SSD: Single shot multibox detector. In: Computer Vision – ECCV 2016. pp. 21–37. Wang, T., Yuan, L., Chen, Y., Feng, J., Yan, S., 2021b. PnP-DETR: Towards efficient
http://dx.doi.org/10.1007/978-3-319-46448-0_2. visual analysis with transformers. In: Proceedings of the IEEE/CVF International
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., Conference on Computer Vision. pp. 4661–4670.
Wei, F., Guo, B., 2021a. Swin transformer V2: Scaling up capacity and resolution. Wang, Y., Zhang, X., Yang, T., Sun, J., 2021c. Anchor DETR: Query design for
arXiv:2111.09883[cs]. transformer-based object detection.
Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., Zhang, L., 2022a. DAB-DETR: Yang, J., Li, C., Zhang, P., Dai, X., Xiao, B., Yuan, L., Gao, J., 2021. Focal self-attention
Dynamic anchor boxes are better queries for DETR. arXiv:2201.12329[cs]. for local-global interactions in vision transformers. arXiv:2107.00641[cs].
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021b. Swin Yao, Z., Ai, J., Li, B., Zhang, C., 2021. Efficient DETR: Improving end-to-end object
transformer: Hierarchical vision transformer using shifted windows. arXiv:2103. detector with dense prior. arXiv:2104.01318[cs].
14030[cs]. Yuan, L., Chen, D., Chen, Y.-L., Codella, N., Dai, X., Gao, J., Hu, H., Huang, X., Li, B.,
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S., 2022b. A ConvNet Li, C., Liu, C., Liu, M., Liu, Z., Lu, Y., Shi, Y., Wang, L., Wang, J., Xiao, B., Xiao, Z.,
for the 2020s. arXiv:2201.03545[cs]. Yang, J., Zeng, M., Zhou, L., Zhang, P., 2021. Florence: A new foundation model
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettle- for computer vision. arXiv:2111.11432[cs].
moyer, L., Stoyanov, V., 2019. RoBERTa: A robustly optimized BERT pretraining Zhang, P., Dai, X., Yang, J., Xiao, B., Yuan, L., Zhang, L., Gao, J., 2021. Multi-scale
approach. arXiv:1907.11692[cs]. vision longformer: A new vision transformer for high-resolution image encoding.
Liu, Y., Zhang, Y., Wang, Y., Hou, F., Yuan, J., Tian, J., Zhang, Y., Shi, Z., Fan, J., arXiv preprint arXiv:2103.15358.
He, Z., 2021c. A survey of visual transformers. arXiv:2111.06091[cs]. Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., Shum, H.-Y., 2022a.
Loshchilov, I., Hutter, F., 2017. Decoupled weight decay regularization. arXiv preprint DINO: DETR with improved denoising anchor boxes for end-to-end object detection.
arXiv:1711.05101. arXiv:2203.03605[cs].
Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L., Wang, J., 2021. Zhang, G., Luo, Z., Yu, Y., Cui, K., Lu, S., 2022b. Accelerating DETR convergence via
Conditional DETR for fast training convergence. In: Proceedings of the IEEE/CVF semantic-aligned matching. arXiv:2203.06883[cs].
International Conference on Computer Vision. pp. 3651–3660. Zhang, D., Zhang, H., Tang, J., Wang, M., Hua, X., Sun, Q., 2020. Feature pyramid
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., 2018. Improving language transformer. In: European Conference on Computer Vision. Springer, pp. 323–339.
understanding by generative pre-training. Zhao, Z.-Q., Zheng, P., Xu, S.-T., Wu, X., 2019. Object detection with deep learning: A
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., 2019. Language review. IEEE Trans. Neural Netw. Learn. Syst. 30 (11), 3212–3232. http://dx.doi.
models are unsupervised multitask learners. Openai Blog 1 (8), 9. org/10.1109/TNNLS.2018.2876865.
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P., 2016. SQuAD: 100, 000+ questions for Zheng, M., Gao, P., Zhang, R., Li, K., Wang, X., Li, H., Dong, H., 2021. End-to-end
machine comprehension of text. arXiv:1606.05250[cs]. object detection with adaptive clustering transformer. arXiv:2011.09315[cs].
Redmon, J., Divvala, S., Girshick, R., Farhadi, A., 2016. You only look once: Unified, Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J., 2021. Deformable DETR: Deformable
real-time object detection. In: 2016 IEEE Conference on Computer Vision and transformers for end-to-end object detection. arXiv:2010.04159[cs].
Pattern Recognition. CVPR, IEEE, Las Vegas, NV, USA, pp. 779–788, doi:10/gc7rk9.

16

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy