Transformer For Object Detection Review and Benchmark
Transformer For Object Detection Review and Benchmark
Survey paper
Keywords: Object detection is a crucial task in computer vision (CV). With the rapid advancement of Transformer-
Review based models in natural language processing (NLP) and various visual tasks, Transformer structures are
Object detection becoming increasingly prevalent in CV tasks. In recent years, numerous Transformer-based object detectors
Transformer-based models
have been proposed, achieving performance comparable to mainstream convolutional neural network-based
COCO2017 dataset
(CNN-based) approaches. To provide researchers with a comprehensive understanding of the development,
Benchmark
advantages, disadvantages, and future potential of Transformer-based object detectors in Artificial Intelligence
(AI), this paper systematically reviews the mainstream methods and analyzes the limitations and challenges
encountered in their current applications, while also offering insights into future research directions. We have
reviewed a large number of papers, selected the most prominent Transformer detection methods, and divided
them into Transformer Neck and Transformer Backbone categories for introduction and comparative analysis.
Furthermore, we have constructed a benchmark using the COCO2017 dataset to evaluate different object
detection algorithms. Finally, we summarize the challenges and prospects in this field.
✩ This work was supported by the Guangxi Science and Technology base and Talent Project (Grant No. Guike AD22080043), the Key Laboratory of
Advanced Manufacturing Technology, Ministry of Education (Grant No. GZUAMT2021KF04), and the National Natural Science Foundation of China (Grant No.
61720106009).
∗ Corresponding author.
E-mail address: yongli@gxu.edu.cn (Y. Li).
https://doi.org/10.1016/j.engappai.2023.107021
Received 22 October 2022; Received in revised form 25 May 2023; Accepted 19 August 2023
Available online 4 September 2023
0952-1976/© 2023 Elsevier Ltd. All rights reserved.
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021
Some reviews (Khan et al., 2021; Liu et al., 2021c; Arkin et al.,
2021; Han et al., 2022; Arkin et al., 2022) have provided detailed
introductions and analyses of Transformer-based detectors. In contrast
to these surveys, our study not only presents a thorough comparison
of the strengths and weaknesses of object detectors based on both
Transformer and CNN architectures, but also classifies the prevalent
Transformer-based detectors into Transformer Backbone and Trans-
former Neck categories. Moreover, we systematically analyze their
performance, potential, and limitations. We investigate the advance-
ments and constraints of various state-of-the-art Transformer-based
detectors (Table 6) and establish benchmarks for these methods using
the COCO2017 dataset (Tables. 4, 5). We hope this review delivers
a comprehensive understanding of Transformer-based object detectors
for researchers.
We have categorized existing methods into two groups based on
the role of Transformer in the overall model architecture: Transformer
Neck and Transformer Backbone, as illustrated in Fig. 1. We present Fig. 2. Transformer structure. The Input Embedding module of Transformer encoder
a detailed analysis of representative methods, compare these methods (left column) can map the input sequence to Embedding space and pass it to encoder
horizontally on the COCO2017 dataset (Lin et al., 2014), and summa- module for processing. The Transformer decoder (right column) receives the previous
rize the novelty of each method, such as Transformer Backbone with output sequence and the output sequence from the intermediate encoder. The previous
output sequence will be shifted one bit to the right, and the start token will be appended
hierarchical representation structure, spatial prior acceleration based
before the sequence to get the input from the decoder. The feed-forward network and
on sparse attention, and pure sequence processing for object detection, the multi-head attention module are repeated 𝑁 times to form the encoder and decoder.
among others. The main contributions of this paper are as follows:
2
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021
3
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021
Table 1
Summary of the highlights and limitations of CNN and Transformer.
Architecture Highlights Limitations
Transformer (1) The attention mechanism (1) Transformer-based models are
amplifies the significance of crucial known for their substantial data
aspects of an image while reducing requirements and computationally
the rest, thereby concentrating on expensive nature, particularly when
more relevant features. This applied to vision tasks (He et al.,
mechanism assists the Transformer in 2021). (2) They are also
modeling the long-range characterized by a slower rate of
dependencies of input sequence convergence, which can pose
elements and thus enhances its challenges in their utilization (Gao
generalization ability for samples et al., 2021). (3) Further, these
outside the distribution (Bai et al., models often involve high
2021). (2) Unlike methods such as computational overhead, which
RNN and LSTM, the Transformer exacerbates their deployment issues
allows for parallel computations. (3) in resource-constrained settings (Li
Given its straightforward yet et al., 2022).
adaptable design, the Transformer
can tackle multiple tasks
simultaneously, rendering it a
potential candidate for a
general-purpose model handling
various tasks.
CNN (1) CNN-based models have strong (1) CNN rarely encodes relative
local feature extraction ability feature positions, instead favoring
benefited from inductive bias receptive field expansion via larger
properties such as translation kernels or stacked layers, often
invariance, weight sharing, and reducing local convolution’s
sparse connectivity. (2) CNNs can computational and statistical
operate in parallel with lower efficiency. (2) CNN’s global feature
computational complexity than capture is comparatively weaker than
Transformer. Transformer models (Liu et al.,
2021b).
4
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021
Table 3
Definition of terms.
Terms Definitions
TP (True Positive) Positive samples are correctly identified as positive samples.
TN (True Negative) Negative samples are correctly identified as negative samples.
FP (False Positive) False positive samples, that is, negative samples are mistakenly identified as positive samples.
FN (False Negative) False negative samples, that is, positive samples are wrongly identified as negative samples.
Fig. 4. The pipeline of DETR. The backbone is a convolutional neural network (CNN) that serves as a feature extractor. And Transformer is the core of the DETR architecture,
consisting of an encoder and a decoder. The high-dimensional feature map from the backbone is flattened and fed into the encoder. Then encoder processes the spatial information
and outputs a sequence of encoded feature vectors. Finally, The output of the decoder is passed through a series of linear layers to predict the final bounding box coordinates and
class probabilities for each object query.
Source: Image from Carion et al. (2020).
Mean average precision (mAP). AP is calculated for each class hand-designed anchor sets and non-maximal suppression (NMS). As
individually, and mAP is the average of AP values across all classes. shown in Fig. 4, DETR uses CNN Backbone to learn the 2D features
The mAP can be calculated as Eq. (10). There are two kinds of mAPs of the input image. Then feature maps are unfolded into sequences and
commonly used. (1) PASCAL VOC challenge uses mAP as a metric fed to the Transformer encoder module (where there is still positional
with an IoU threshold of 0.5. (2) While MS COCO averages mAP over encoding). The output of the Transformer Decoder module is then
different IoU thresholds 50% to 95% with a step of 0.05, this metric obtained under the constraint of object queries. Finally, the class and
is denoted in papers by mAP@[.5,.95]. Therefore, COCO not only bounding box regression parameters are obtained after a feedforward
averages AP over all classes but also on the defined IoU thresholds. network.
∑𝑘 Based on the idea of sequential prediction, DETR regards the predic-
AP𝑖
mAP = 𝑖=1 for 𝑘 classes, (10) tion of the network as a fixed sequence 𝑦̃ of length N, 𝑦̃ = 𝑦̃𝑖 , 𝑖 ∈ (1, 𝑁),
𝑘
(where the value of 𝑁 is fixed and much larger than the number of
Frame Per Second (FPS). FPS defines how fast your object detec- ( )
Ground Truth in the image), 𝑦̃𝑖 = 𝑐̃𝑖 , 𝑏̃ 𝑖 . Meanwhile, the Ground Truth
tion model processes your video and generates the desired output. ( )
is considered as a sequence 𝑦 ∶ 𝑦𝑖 = 𝑐𝑖 , 𝑏𝑖 (the length must be less
than N, so the sequence is filled with 𝜙 (for no object), which can be
3.2. Transformer neck
interpreted as the category of background, to make its length equal to
N), where 𝑐𝑖 denotes the true category to which the object belongs,
In this section, we review the classic Transformer Neck-based ob-
and 𝑏𝑖 denotes a quaternion (containing the center point coordinates
ject detection models in last two years, starting from the original
and the width and height of the object box, and both are relative to
Transformer detector DETR (Carion et al., 2020). The original DETR
the scale coordinates of the image).
regards object detection as end-to-end set prediction, thus removing
So the prediction task can be viewed as a bipartite matching prob-
hand-designed components such as anchor boxes and non-maximum
lem between 𝑦 and 𝑦, ̃ with the Hungarian algorithm as the solution
suppression (NMS). However, some drawbacks need to be solved in
method, defining the strategy for minimum matching as follows:
DETR, such as slow convergence and poor detection of small objects.
Therefore, many approaches (sparse attention, spatial prior acceler- ∑
𝑁
( )
ation, multi-scale detection) have been proposed to improve it by 𝜎̂ = arg min Lmatch 𝑦𝑖 , 𝑦̂𝜎(𝑖) , (11)
𝜎∈S𝑁 𝑖
researchers. We compare the performance of all methods together on
the COCO2017 dataset with the benchmark shown in Table 4. where 𝜎̃ denotes the matching strategy when finding the minimum
loss, for L while considering the similarity prediction between Ground
( )
3.2.1. DETR Truth boxes. For 𝜎(𝑖), 𝑐𝑖 the predicted category confidence is 𝑃̃𝜎(𝑖) 𝑐𝑖
DETR proposed by Carion et al. (2020) is the first object detector and the bounding box prediction is 𝑏̃ 𝜎(𝑖) , for non-empty matches, define
( ) ( ) ( )
that successfully uses the Transformer as the main module in object Lmatch 𝑦𝑖 , 𝑦̂𝜎(𝑖) as: −1{𝑐𝑖 ≠∅} 𝑝̂𝜎(𝑖) 𝑐𝑖 + 1{𝑐𝑖 ≠∅} Lbox 𝑏𝑖 , 𝑏̂ 𝜎(𝑖) .
detection. DETR not only has a simpler and more flexible structure In this way, the overall loss is obtained as
but also has comparable performance compared to previous SOTA 𝑁 [
∑ ( ) ( )]
approaches, such as the highly optimized Faster R-CNN. Unlike classical LHungarian (𝑦, 𝑦)
̂ = − log 𝑝̂𝜎(𝑖)
̂ 𝑐𝑖 + 1{𝑐𝑖 ≠∅} Lbox 𝑏𝑖 , 𝑏̂ 𝜎̂ (𝑖) , (12)
object detectors, DETR is an end-to-end object detection model. It gets 𝑖=1
rid of the autoregressive model, performs parallel inference on object Considering the bounding box scale, the 𝐿1 loss and the IoU loss are
relationships and global image context, and then outputs the final linearly combined to obtain the L𝑏𝑜𝑥 loss:
predictions. The structure of DETR is shown in Fig. 4. ( ) ‖ ‖
DETR treats the object detection task as an intuitive set prediction L𝑏𝑜𝑥 = 𝜆iou Liou 𝑏𝑖 , 𝑏̂ 𝜎(𝑖) + 𝜆L1 ‖𝑏𝑖 − 𝑏̂ 𝜎(𝑖) ‖ , (13)
‖ ‖1
problem and discards some traditional hand-craft components such as
5
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021
Table 4
Comparison between Transformer Necks and representative CNNs on COCO2017 Val set. ‘‘Multi-Scale’’ refers to multi-scale inputs. AP denotes IoU threshold = .50:.05:.95. AP50
and AP75 denote IoU threshold = .50 and .75. In addition, AP𝑆 , AP𝑀 , AP𝐿 denote different scales of objects. Small means area < 3232 . Medium means 3232 < area < 9696 . Large
means area > 9696 .
Method Backbone Epochs GFLOPs #Params(M) Multi-scale FPS AP 𝐀𝐏50 𝐀𝐏75 𝐀𝐏𝑆 𝐀𝐏𝑀 𝐀𝐏𝐿
Faster R-CNN + FPN (Ren et al., 2016) ResNet50 109 180 42 – 26 42.0 62.1 45.5 26.6 45.4 53.4
DERR+ (Carion et al., 2020) 500 86 41 – 28 42.0 62.4 44.2 20.5 45.8 61.1
DETR-DC5+ (Carion et al., 2020) 500 187 41 – 12 43.4 63.1 45.9 22.5 47.3 61.1
ResNet50
DERR (Carion et al., 2020) 50 86 41 – 12 42.0 62.4 44.2 20.5 45.8 61.1
DETR-DC5 (Carion et al., 2020) 50 187 41 – 12 43.4 63.1 45.9 22.5 47.3 61.1
UP-DETR (Dai et al., 2021a) 150 86 41 – 28 40.5 60.8 42.6 19.0 44.4 60.0
ResNet50
UP-DETR+ (Dai et al., 2021a) 300 86 41 – 28 42.8 63.0 45.3 20.8 47.1 61.7
Deformable DETR (Zhu et al., 2021) 50 173 40 – 19 43.8 62.6 47.7 26.4 47.1 58.0
ResNet50
Two-stage Deformable DETR (Zhu et al., 2021) 50 173 40 – 19 46.2 65.2 50.0 28.8 49.2 61.7
Conditional DETR (Meng et al., 2021) 108 90 44 – – 43.0 64.0 45.7 22.7 46.7 61.5
ResNet50
Conditional DETR-DC5 (Meng et al., 2021) 108 195 44 – – 45.1 65.4 48.5 25.3 49.0 62.2
ACT-MTKD(L=16) (Zheng et al., 2021) – 156 – – 14 40.6 – – 18.5 44.3 59.7
ResNet50
ACT-MTKD(L=32) (Zheng et al., 2021) – 169 – – 16 43.1 – – 22.2 47.1 61.4
SMCA (Gao et al., 2021) 50 152 40 – 10 43.7 63.6 47.2 24.2 47.0 60.4
ResNet50
SMCA+ (Gao et al., 2021) 50 152 108 – 10 45.6 65.5 49.1 25.9 49.3 62.6
Efficient DETR (Yao et al., 2021) 36 159 32 – – 44.2 62.2 48.0 28.4 47.5 56.6
ResNet50
Efficient DETR* (Yao et al., 2021) 36 210 35 – – 45.1 65.4 48.5 25.3 49.0 62.2
TSP-FCOS (Sun et al., 2021) 36 189 51.5 – 15 43.1 62.3 47.0 26.6 46.8 55.9
TSP-RCNN (Sun et al., 2021) ResNet50 36 188 64 – 11 43.8 63.3 48.3 28.6 46.9 55.7
TSP-RCNN+ (Sun et al., 2021) 96 188 64 – 11 45.0 64.5 49.6 29.7 47.7 58.0
YOLOS-S (Fang et al., 2021) 150 200 30.7 – 7 36.1 56.4 37.1 15.3 38.5 56.1
DeiT-S
YOLOS-S (Fang et al., 2021) 150 179 27.9 – 5 37.6 57.6 39.2 15.9 40.2 57.3
YOLOS-B (Fang et al., 2021) DeiT-B 150 537 127 – – 42.0 62.2 44.5 19.5 45.3 62.1
PnP-DETR-R50-DC5-𝛼-0.33 (Wang et al., 2021b) 500 20.7(omit backbone) – – – 42.7 62.8 45.1 22.4 46.2 60.0
ResNet50
PnP-DETR-R50-DC5-𝛼-0.5 (Wang et al., 2021b) 500 32.9(omit backbone) – – – 43.1 63.4 45.3 22.7 46.5 61.1
Dynamic DETR (Dai et al., 2021c) ResNet50 40 – – – – 47.2 65.9 51.1 28.6 49.3 59.1
Anchor DETR-C5 (Wang et al., 2021c) 50 – – – – 42.1 63.1 44.9 22.3 46.2 60.0
ResNet50
Anchor DETR-DC5 (Wang et al., 2021c) 50 – – – – 44.2 64.7 47.5 24.7 48.2 60.6
D2ETR (Lin et al., 2022) 50 82 35 – – 43.2 62.9 46.2 22.0 48.5 62.4
PVT2
Deformable D2ETR (Lin et al., 2022) 50 93 40 – – 50.0 67.9 54.1 31.7 53.4 66.7
Sparse DETR- 𝜌 = 10% (Roh et al., 2022) ResNet50 50 105 41 – 25.3 45.3 65.8 49.3 28.4 48.3 60.1
Sparse DETR- 𝜌 = 10% (Roh et al., 2022) Swin-T 50 113 41 – 21.2 48.2 69.2 52.3 29.8 51.2 64.5
DAB-DETR (Liu et al., 2022a) 50 202 44 – – 44.5 65.1 47.7 25.3 48.2 62.3
ResNet50
DAB-DETR* (Liu et al., 2022a) 50 216 44 – – 45.7 66.2 49.0 26.1 49.4 63.1
DN-DETR (Li et al., 2022) 50 94 44 – – 44.1 64.4 46.7 22.9 48.0 63.4
DN-DETR-DC5 (Li et al., 2022) ResNet50 50 202 44 – – 46.3 66.4 49.7 26.7 50.0 64.3
DN-Deformable-DETR (Li et al., 2022) 50 195 48 – – 48.6 67.4 52.7 31.0 52.0 63.7
DINO-4scale (Zhang et al., 2022a) 12 279 47 – 24 47.9 65.3 52.1 31.2 50.9 61.9
DINO-5scale (Zhang et al., 2022a) 12 860 47 – 10 48.3 65.8 52.4 32.2 51.3 62.2
ResNet50
DINO-4scale (Zhang et al., 2022a) 36 – – – – 50.5 68.3 55.1 32.7 53.9 64.9
DINO-5scale (Zhang et al., 2022a) 36 – – – – 51.0 69.0 55.6 34.1 53.6 65.6
SAM-DETR (Zhang et al., 2022b) 50 100 58 – – 39.8 61.8 41.6 20.5 43.4 59.6
ResNet50
SAM-DETR-DC5 (Zhang et al., 2022b) 50 210 58 – – 43.3 64.4 46.2 25.1 46.9 61.0
Pix2Seq (Chen et al., 2021) 50 – 37 – – 43.0 61.0 45.6 25.1 46.9 59.4
ResNet50
Pix2Seq-DC5 (Chen et al., 2021) 50 – 38 – – 43.2 61.0 46.1 26.6 47.0 58.6
Additionally, we have presented the attention visualization of the Nonetheless, its end-to-end architecture possesses significant potential
encoder and decoder (as shown in Figs. 5 and 6). This visualization and has attracted numerous researchers to explore improvements.
aids in understanding how the model focuses on various parts of the
input image and utilizes attention mechanisms for object detection. 3.2.2. UP- DETR
The encoder processes the input image, captures its spatial information, Since DERT faces great challenges in training and optimization, it
and creates a set of contextualized feature representations. Attention requires a huge amount of training data and an extremely long training
visualization in the encoder demonstrates how the model concentrates schedule, which leads to limitations in application on small datasets.
on specific regions of the image, emphasizing crucial areas that con- Moreover, the existing pretext task cannot be directly applied to train
tribute to the comprehension of the objects present. The decoder uses the Transformer module of DETR, because DETR focuses mainly on
the encoded features to generate final object detections, employing a spatial localization rather than image instance-based or cluster-based
series of self-attention and cross-attention mechanisms to iteratively segmentation learning. To address the above issues, Dai et al. (2021a)
refine the predicted object bounding boxes and class labels. proposed UP-DETR, a DETR-like model capable of unsupervised pre-
In summary, DETR, the first Transformer-based end-to-end object training, whose structure is shown in Fig. 7.
detector, exhibited performance comparable to state-of-the-art (SOTA) Multiple query patches are randomly cropped from a given image
methods at the time. However, there are evident drawbacks in its and the Transformer for detection is pre-trained to predict the bounding
application: slow convergence and low accuracy on small objects. boxes of these query patches in the given image. In the pre-training
6
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021
Fig. 5. Encoder self-attention for a set of reference points. It demonstrates the attention distribution after the input image is processed through the Transformer encoder.
Fig. 6. Visualization of decoder attention for each predicted object in images from the COCO validation set, using the DETR-DC5 model. Attention scores are represented by
distinct colors for different objects. The decoder primarily focuses on object extremities, such as legs and heads, highlighting the model’s ability to capture fine-grained details. It
is recommended to view this figure in color for better understanding.
Fig. 7. UP-DETR pre-training architecture by random query patch detection: (a) For one single-query patch, which is added to all object queries. (b) For the multi-query patch,
which is added each query patch to 𝑁∕𝑀 object queries with shuffle and attention mask.
Source: Image form Dai et al. (2021a).
process, the method addresses the following two key problems. (1) To and extended to multi-query patches with object query shuffle and
trade-off the preference of classification and localization in the pre-text attention mask.
task, the backbone network is frozen and a patch feature reconstruction In summary, UP-DETR proposes a new unsupervised pre-text task-
branch is proposed that is jointly optimized with patch detection. (2) random query patch detection to pre-train the Transformer. The results
For multi-query patch, UP-DETR is introduced in single-query patch show that UP-DETR has significantly better performance than DETR in
7
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021
object detection, panorama segmentation, and single detection, even on predict categories and bounding boxes; (2) The cross-attention layer,
the PASCAL VOC dataset where the training data is insufficient. which can use embedding output of encoder to complete the embedding
of the decoder; (3) The feed-forward networks layer (FFN).
3.2.3. YOLOS The core of the conditional cross-attention mechanism is to learn
Inspired by the pre-trained Transformer can fine-tune at the token a conditional spatial query from decoder embedding and reference
level tasks (Rajpurkar et al., 2016; Sang and De Meulder, 2003), Fang points, which can explicitly find the boundary regions of the object,
et al. (2021) proposed YOLOS, a pure sequence-to-sequence trans- thus narrowing down the search object, helping to locate the object,
former on the basis of DETR (Carion et al., 2020) and ViT (Dosovitskiy and alleviating the problem of over-reliance on the quality of content
et al., 2021). It replaces the class token of the original ViT with the embedding in DETR training. The problem of over-reliance on the
detection token and replaces the image classification loss with the quality of content embedding in DETR training is alleviated. These
bipartite matching loss of DETR in the training phase, which allows refinements improve the convergence speed of DETR by 8× faster and
object detection by set prediction. YOLOS demonstrates the generality the box mAP on the COCO dataset by 1.8%.
and transferability of the pre-trained Transformer from image classi-
fication to downstream object detection task, which is pre-trained in 3.2.6. Efficient DETR
the classification task and then transfer to the detection task for fine- Yao et al. analyzed the mechanisms of DETR and Deformable DETR
tuning. Experiments demonstrate that YOLOS-Base, pre-trained on only and found that their common feature is a cascade structure stacked
medium-sized ImageNet datasets can achieve 42.0 box AP. with six Decoders, which is used to iteratively update the object query.
The reference point proposed by Deformable DETR visualizes the object
3.2.4. Deformable DETR query and solves the difficult problem that the object query is difficult
Inspired by Deformable Convolution ((Dai et al., 2017), Zhu et al. to analyze directly. However, different initialization methods of refer-
(2021) proposed Deformable DETR. This method combines the ad- ence points have a great impact on decoder performance. In order to
vantages of sparse spatial sampling of deformable convolution with investigate a more efficient way to initialize the object container, Yao
the relational modeling capability of Transformer. The Deformable et al. proposed Efficient DETR, a two-stage object detector that consists
Attention Module (DEM) is introduced to accelerate convergence and of dense prediction and sparse set prediction, and these two parts share
fuse multi-scale features to improve accuracy. Moreover, The authors the same detection head.
introduce multi-scale feature from FPN (Lin et al., 2016), and then The model generates region proposals using dense detection before
propose Multi-Scale Deformable Attention (MSDA) to replace the Trans- initializing the object container, and then uses the highest-scoring 4-
former Attention Module for processing feature maps, as shown in dimensional proposal and its 256-dimensional encoder features as the
{ }𝐿
Fig. 8 is shown. Let 𝒙𝑙 𝑙=1 be the input multi-scale feature map, where initialization value of the object container, which results in better
𝑙
𝑥 ∈ R 𝐶×𝐻 𝑙 ×𝑊 ̂ 𝑞 ∈ [0, 1]2 be the normalized coordinates of the
𝑙 . Let 𝒑 performance and fast convergence. The experimental results show that
reference point of each query element 𝑞, and then compute Multi-Scale Efficient DETR combines the features of dense detection and ensemble
Deformable Attention as detection, and can converge quickly while achieving high performance.
( ) The model achieves the SOTA performance at that time on the COCO
DeformAttn 𝒛𝑞 , 𝒑𝑞 , 𝒙 =
[𝐾 ] dataset with only one encoder layer and three decoder layers, while the
∑
𝑀 ∑ ( ) (14) epoch is reduced by 14× less.
𝑾𝑚 𝐴𝑚𝑞𝑘 ⋅ 𝑾 ′𝑚 𝒙 𝒑𝑞 + 𝛥𝒑𝑚𝑞𝑘 ,
𝑚=1 𝑘=1
3.2.7. SMCA
where 𝑚 is the index of attention head, 𝑙 is the index of input feature To strengthen the relationship between the visual region of common
level, and k is the index of sampling points. 𝛥𝒑𝑚𝑙𝑞𝑘 and 𝐴𝑚𝑙𝑞𝑘 denote interest for each object query and the bounding box to be predicted by
respectively sampling offset and attention weight of the 𝑘th sampling the query, Gao et al. (2021) introduced spatial prior and multi-scale fea-
point in the 𝑙th feature layer and the 𝑚th attention head. The scalar
∑ ∑𝐾 tures, and proposed Spatially Modulated Co Attention (SMCA), which
attention weights 𝐴𝑚𝑙𝑞𝑘 are normalized to 𝐿 𝑙=1 𝑘=1 𝐴𝑚𝑙𝑞𝑘 = 1. The replaces the cross attention in the original Decoder while keeping the
normalized coordinates (0, 0) and (1, 1) of 𝒑̂ 𝑞 ∈ [0, 1]2 denote the upper others unchanged.
left and lower right corners of the image, respectively. The function
( ) The decoder of SMCA has multiple cross-attention heads, each of
𝜙𝑙 𝒑̂ 𝑞 in Eq. (14) rescales the normalized coordinates 𝑷 𝑞 to the input which estimates the object center and scale from a slightly different
feature map of the 𝑙th layer. The computational complexity of MSDA
( ( )) location, resulting in a series of different spatial weight maps. This
is 𝑂 2𝑁𝑞 𝐶 2 + min 𝐻𝑊 𝐶 2 , 𝑁𝑞 𝐾𝐶 2 compared to the original DETR, weight map is used to spatially adjust the co-attention features, which
Deformable DETR requires less than one-tenth of training epochs to improves the detection performance. Based on these improvements,
achieve better performance (especially on small object). SMCA can achieve 43.7 mAP in 50 epochs and 45.6 mAP in 108 epochs
on the COCO dataset.
3.2.5. Conditional DETR
Meng et al. (2021) proposed Conditional DETR. They visualized ex- 3.2.8. ACT
periments on the operation of DETR and concluded that cross-attention Due to the slow convergence of DETR, Zheng et al. (2021) proposed
in DETR is highly dependent on content embedding to locate the the Adaptive Clustering Transformer (ACT) to address the problem of
four vertices and predict the bounding box. Thus, it increases the high trial and error costs for improving DETR. ACT is a plug-and-
training difficulty. So they improved the cross-attention of DETR by play module that is fully compatible with Transformer and can be
concatenating the content query 𝑐𝑞 and spatial query 𝑝𝑞 , and the key ported to DETR without any training. Its core design is first, to perform
by splicing the content key 𝑐𝑘 and spatial key 𝑝𝑘 . This inner product of feature clustering adaptively for the attention redundancy (points with
query and key gives the following result: similar semantics and similar spatial locations produce similar atten-
tion maps) of encoder, select representative prototypes, and broadcast
𝐜⊤ ⊤
𝑞 𝐜𝑘 + 𝐩 𝑞 𝐩 𝑘 , (15)
feature updates to their nearest neighboring points based on Euclidean
This separates the functions of content query and spatial query distance. Second, an adaptive clustering algorithm is designed for the
so that they focus on the weight of content and space respectively. encoder note feature diversity problem (for different inputs, the feature
As shown in Fig. 9, the improved Decoder layer consists of three distribution of each encoder layer is quite different), and a multi-
main modules: 1) The self-attention layer, which is from the previous round exact Euclidean location-sensitive hash (E2LSH) is chosen for this
Decoder layer and is used to remove duplicate predictions as well as algorithm to adaptively determine the number of prototypes. Thanks
8
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021
Fig. 8. The architecture of Deformable DETR. Its attention module focuses on only a small number of key sampling points around the reference point, and assigns a fixed and
small number of keys to each object query, thus alleviating the problems of slow convergence and low feature resolution.
Source: Image from Zhu et al. (2021).
3.2.9. TSP
Sun et al. (2021) concluded after a lot of analysis that the cross-
attention part of decoder and Hungarian loss of DETR are the main
reasons for the slow convergence of DETR. So they proposed two
improved models of DETR with only encoder, TSP-FCOS and TSP-
RCNN corresponding to the One-Stage and Two-Stage object detection
methods, respectively. Both models can be viewed as feature pyra-
mid (Lin et al., 2016) based. The model uses a feature of interest (FoI)
selection mechanism that helps encoder process multi-scale features. In
addition, the model applies matching distillation to solve the instability
of bipartite graph matching. Experiments show that TSP achieves better
results with reduced training cost, using only 36-epoch to achieve the
500-epoch results of the original DETR training.
3.2.10. DINO
The Hungarian algorithm has been used in DETR (Carion et al.,
2020) to match the output of the object by Decoder with Ground Truth.
However, the discreteness of the Hungarian algorithm matching and
Fig. 9. A Decoder layer of Conditional DETR. The gray shaded box indicates that the the randomness of the model training cause the matching process to be
Conditional spatial query is predicted from the learnable 2D coordinates 𝑠 and the dynamic and unstable, resulting the final slow convergence of DETR.
embedding output of the previous Decoder layer. By deeply studying the iteration mechanism and optimization prob-
Source: Image from Meng et al. (2021).
lems of the DETR model, Zhang et al. (2022a)proposed DINO (DETR
with Improved deNoising anchor boxes) based on DN-DETR (Li et al.,
2022), DAB-DETR (Liu et al., 2022a) and Deformable DETR (Zhu et al.,
to these improvements, ACT can reduce the FLOPS of DETR from 2021). The key design of DINO is that the training phase uses denoising
73.4 Gflops to 58.2 Gflops (excluding Backbone Resnet FLOPs) without training as a shortcut to learning the relative offset of anchor by first
additional training, while the loss of AP is only 0.7%. The AP loss adding noise near the Ground Truth box, and then the Hungarian
can be further reduced to 0.2% by multitasking knowledge distillation. matching directly reconstructs the truth bounding box, thus improving
Given its excellent performance, exploring ACT training from scratch the stability of matching. Secondly, the model also uses a query-based
9
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021
Fig. 10. The architecture of Pyramid Vision Transformer, the whole model is divided into 4 stages to generate feature maps at different scales. Each stage consists of a patch
embedding layer, 𝐿𝑖 -layer, and reshape operation.
Source: Image form Wang et al. (2021a).
dynamic anchor formulation to initialize the query and correct the 3.3.1. PVT&PVTv2
parameters of adjacent earlier layers with the gradients of later layers. The feature maps output by ViT (Dosovitskiy et al., 2021) are
DINO breaks the dominance of classical architecture detector (SwinV2- difficult to apply to dense prediction due to their single scale and low
G (Liu et al., 2021a), Florence (Yuan et al., 2021), DyHead (Dai resolution. Wang et al. (2021a) proposed the Pyramid Vision Trans-
et al., 2021b), etc.). DINO-Res50, which combines multi-scale features, former (PVT) by incorporating the multi-scale feature into Transformer.
achieves 48.3AP and 51.0AP on the COCO2017 dataset with 12-epoch PVT can be used as a Backbone for various dense detection tasks,
and 36-epoch training schemes, respectively. Moreover, DINO-Swin-L especially it can replace the CNN backbone of DETR-like models or be
even achieves the highest performance 63.3AP after training on a larger combined into a pure Transformer model without manual components
dataset. such as NMS.
Benefiting from the progressive shrinking pyramid structure in the
3.3. Transformer backbone PVT, the Transformer sequence length decreases as the network gets
deeper. Meanwhile, in order to further reduce the computation of
Other efforts such as ViT (Dosovitskiy et al., 2021) have used Trans- fine-grained segmentation of images, they propose spatial-reduction
former in the image classification and achieved comparable results. attention (SRA) to reduce the computation of learning high-resolution
However, there are some limitations in other complex CV tasks. These feature maps (As shown in Fig. 10).
challenges of transferring the high performance of Transformer in NLP Compared with the CNN method based on feature pyramid struc-
to the CV can be explained by the differences between the two domains. ture, PVT not only generates multi-scale feature maps to detect ob-
jects of different sizes but also fuses global information through self-
1. The object entities in CV tasks often have dramatic scale varia- attention mechanism. The PVTv2 (Wang et al., 2022) proposed by
tion. the same team subsequently improves the PVT by adding a linear
2. Compared to text, the matrix nature of images makes it con- complexity attention layer, overlapping patch embedding, and convo-
tain at least hundreds of pixels for an image that can express lutional feed-forward network to improve the performance of the PVT
information. Especially the very long sequence unfolded by high- as backbone. On the COCO dataset, both achieved competitive results
resolution images is difficult for Transformer to model. at that time.
3. Many CV tasks such as semantic segmentation require pixel-level
dense prediction, and the computational complexity of the self- 3.3.2. Swin transformer
attention mechanism in ViT increases quadratically with image Liu et al. (2021b) proposed Swin Transformer, which creatively uses
size, which leads to unacceptable computational overhead. a hierarchical design to make the Transformer available as a backbone
4. In the existing Transformer-based models, tokens are fixed in for most CV tasks, rather than just a detection head. As shown in
scale and not improved in design for CV tasks. Fig. 11, It is easy to see that, unlike other Transformer models, Swin
Transformer builds a feature map with hierarchical representation,
To address the above challenges, many Transformer-based back- similar to the feature pyramid structure in CNN. As the network level
bones have been proposed for CV tasks and combined with methods deepens, the receptive expands, enabling the extraction of multi-scale
such as multi-scale to compensate for the shortcomings that ViT can features of the image. Secondly, Swin Transformer divides the feature
only detect at low resolution and so on. These methods can replace the map with multiple windows, and each non-overlapping window per-
backbone of mainstream object detection models, and in the benchmark forms local multi-head attention calculation without correspondence
Table 5, we list the performance of Mask R-CNN (He et al., 2017) between windows, which makes the computation greatly reduced and
and RetinaNet (Ross and Dollár, 2017) comparison after replacing the linear with the image size, as shown in the Eq. (16). In contrast, ViT
backbone and review the classical models in this subsection. produces a single low-resolution image and calculates global attention,
10
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021
11
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021
Fig. 12. (a) Swin Transformer (Swin-T) (b) Swin Transformer Block.
Source: Image from Liu et al. (2021b).
Tables 4 and 5. Each method was evaluated using the NVIDIA A100
GPU and adhered to the DETR training protocol. The AdamW opti-
mizer (Loshchilov and Hutter, 2017) was uniformly employed across all
methods, with the initial learning rate for the transformer set to 10−4 ,
the backbone’s to 10−5 , and weight decay at 10−4 . The transformer
weights were initialized with Xavier init (Glorot and Bengio, 2010),
while the backbone leveraged the ImageNet-pretrained ResNet model
Fig. 13. The shift window approach can calculates the self-attention across the window from torchvision, with frozen batch normalization layers.
boundary of the previous layer.
For Transformer Neck-based models, they treat object detection as
Source: Image from Liu et al. (2021b).
a straightforward set prediction, removing manual components (such
as anchor set and NMS) that cannot be optimized, thus enabling end-
to-end detection. Starting from the original DETR with slow conver-
gence and poor detection of small objects, subsequent researchers have
proposed optimization strategies from different perspectives.
1. To address the problem of slow convergence, researchers often
start by improving the attention mechanism. Deformable DETR (Zhu
Fig. 14. An illustration of circular shift. et al., 2021) accelerates convergence 12× faster with the Deformable
Source: Image from Liu et al. (2021b).
Attention Module. Conditional DETR improves the cross-attention of
DETR and gets 8× faster convergence. Meanwhile, the box mAP on
the COCO dataset is improved by 1.8%. Unlike the above methods,
ACT (Zheng et al., 2021) proposes a plug-and-play module for adaptive
clustering, which reduces the GFLOPS of DETR by 15.2 without addi-
tional training, while the AP loss is only 0.7%. Sparse DETR achieves
higher performance and the same detection speed (FPS) as Faster
R-CNN by improving and reducing the GFLOPs by 75.
2. For the problem of poor detection of small objects, multi-scale
feature is currently the main focus. Methods such as SMCA (Gao et al.,
2021) (as shown in Table 4) introduce multi-scale feature with different
operations and significantly improve the accuracy of the detector.
Moreover, DINO (Zhang et al., 2022a) reaches 63.3 AP over all classical
object detection methods.
Presently, most Transformer Backbones are primarily active in im-
age classification, with only a few researchers transitioning them to tra-
ditional object detectors for dense prediction. These have then achieved
state-of-the-art (SOTA) performance. Compared to CNN-based Back-
Fig. 15. Comparison of the attention modules of Swin TransformerV1 and V2. bones, Transformer-based Backbones can integrate global contextual
Source: Image from Liu et al. (2021a).
information while outputting multi-scale feature maps, thereby enhanc-
ing feature extraction. Although Transformers have challenged CNN’s
dominance in object detection, recent advancements such as FAIR’s
backbone for vision tasks. It integrates the Cross-Shaped Window self- redesign of ConvNet (Liu et al., 2022b), which draws from the strengths
attention mechanism, varies stripe widths based on network depth, of the Transformer structure, underscore the continued potential of
and introduces a novel Locally-enhanced Positional Encoding (LePE) CNNs. In the future, CNN and visual Transformer are expected to
scheme to handle local positional information optimally, resulting in continue improving by leveraging each other’s strengths.
competitive performance across standard vision tasks.
4. Discussion
3.4. Analysis and discussion for detectors
Although the Transformer model has made great progress (as shown
This section provides a succinct review of conventional Transformer- in Table 6) and has shown excellent performance (Table 4, Table 5),
based object detectors, offering a detailed performance comparison in they still face some challenges, as well as limitations in practical
12
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021
Table 5
The prediction results of RetinaNet and Mask R-CNN with Transformer as Backbone on COCO 2017 Val Set. Where 3 × schedule denotes 36-epoch, MS denotes multi-scale input
(MS), and the numbers before and after ‘‘/’’ denote the parameters of RetinaNet and Mask R-CNN, respectively.
Backbone #Params FlOPs RetinaNet 3×schedule + MS Mask R-CNN 3×schedule + MS
(M) (G) 𝐀𝐏𝑏 𝐀𝐏𝑏50 𝐀𝐏𝑏75 𝐀𝐏𝑆 𝐀𝐏𝑀 𝐀𝐏𝐿 𝐀𝐏𝑏 𝐀𝐏𝑏50 𝐀𝐏𝑏75 𝐀𝐏𝑚 𝐀𝐏𝑚50 𝐀𝐏𝑚75
ResNet50 (He et al., 2015) 38/44 239/260 39 58.4 41.8 22.4 42.8 51.6 41 61.7 44.9 37.1 58.4 40.1
PVTv1-S (Wang et al., 2021a) 34/44 226/245 42.2 62.7 45.0 26.2 45.2 57.2 43.0 65.3 46.9 39.9 62.5 42.8
ViL-S (Zhang et al., 2021) 36/45 252/174 42.9 63.8 45.6 27.8 46.4 56.3 43.4 64.9 47.0 39.6 62.1 42.4
Swin-T (Liu et al., 2021b) 39/48 245/264 45.0 65.9 48.4 29.7 48.9 58.1 46.0 68.1 50.3 41.6 65.1 44.9
PVTv2-B2-Li (Liu et al., 2021b) 32/42 -/- – – – – – – 46.8 68.7 51..4 42.3 65.7 45.4
Focal-T (Yang et al., 2021) 39/49 265/291 45.5 66.3 48.8 31.2 49.2 58.7 47.2 69.4 51.9 42.7 66.5 45.9
TwinsP-S (Chu et al., 2021) 34/44 -/245 45.2 66.5 48.6 30.0 48.8 58.9 46.8 69.3 51.8 42.6 66.3 46.0
Twins-S (Chu et al., 2021) 34/55 -/228 45.6 67.1 48.6 29.8 49.3 60.0 46.8 69.2 51.2 42.6 66.3 45.8
CSwin-T (Dong et al., 2022) -/42 -/279 – – – – – – 49.0 70.7 53.7 43.6 67.9 46.6
PVTv2-B2 (Wang et al., 2022) 35/45 -/- – – – – – – 47.8 69.7 52.6 43.1 66.8 46.7
ResNet101 (He et al., 2015) 57/63 315/336 40.9 60.1 44.0 23.7 45.0 53.8 42.8 63.2 47.1 38.5 60.1 41.3
ResNeXt101-32 × 4d (He et al., 2015) 56/63 319/340 41.4 61 .0 44.3 23.9 45.5 53.7 44.0 64.4 48.0 39.2 61.4 41.9
PVTv1-M (Wang et al., 2021a) 54/64 283/302 43.2 63.8 46.1 27.3 46.3 58.9 44.2 66.0 48.2 40.5 63.1 43.5
ViL-M (Zhang et al., 2021) 51/60 339/261 43.7 64.6 46.4 27.9 47.1 56.9 44.6 66.3 48.5 40.7 63.8 43.7
TwinsP-B (Chu et al., 2021) 54/64 -/302 46.4 67.7 49.8 31.3 50.2 61.4 47.9 70.1 52.5 43.2 67.2 46.3
Twins-B (Chu et al., 2021) 67/76 -/340 46.9 68.0 50.2 31.7 50.3 61.8 48.0 69.5 52.7 43.0 66.8 46.6
Swin-Scite (Liu et al., 2021b) 60/69 335/354 46.4 67.0 50.1 31.0 50.1 60.3 48.5 70.2 53.5 43.3 67.3 46.6
Focal-S (Yang et al., 2021) 62/71 367/401 47.3 67.8 51.0 31.6 50.9 61.1 48.8 70.5 53.6 43.8 67.7 47.2
CSwin-S (Dong et al., 2022) -/54 -/342 – – – – – – 50.0 71.3 54.7 44.5 68.4 47.7
ResNeXt101-64 × 4d (He et al., 2015) 96/102 473/493 41.8 61.5 44.4 25.2 45.4 54.6 44.4 64.9 48.8 39.7 61.9 42.6
PVTv1-Large (Wang et al., 2021a) 71/81 345/364 43.4 63.6 46.1 26.1 46.0 59.5 44.5 66.0 48.3 40.7 63.4 43.7
ViL-Base (Zhang et al., 2021) 67/76 443/365 44.7 65.5 47.6 29.9 48.0 58.1 45.7 67.2 49.9 41.3 64.4 44.5
Swin-Base (Liu et al., 2021b) 98/107 477/496 45.8 66.4 49.1 29.9 49.4 60.3 48.5 69.8 53.2 43.4 66.8 46.9
Focal-Base (Yang et al., 2021) 101/110 514/533 46.9 67.8 50.3 31.9 50.3 61.5 49.0 70.1 53.6 43.7 67.6 47.0
CSWin-B (Dong et al., 2022) -/97 -/526 – – – – – – 50.8 72.1 55.8 44.9 69.1 48.3
applications. This section will summarize the innovative improvements factors such as the number of parameters, running time (FPS), and
of the current method, analyze the problems encountered by the floating-point operations (FLOPs). However, these metrics’ influence
Transformer detector, and give an outlook on the future development varies depending on the application context and hardware environ-
prospects. ment. For example, in situations like autonomous driving or robotic
navigation, FPS is a critical factor, as algorithms must process video
4.1. Challenges streams at high frame rates to respond quickly to external changes.
In the case of mobile devices and embedded systems, the number of
High computational overhead. Typical properties of CNNs include parameters and FLOPs are more influential due to energy and memory
inductive bias, which is expressed as translation invariance, weight constraints. Consequently, deploying algorithms on mobile platforms
sharing, and sparse connectivity (Dosovitskiy et al., 2021). These prop- necessitates balancing performance, energy consumption, and memory
erties grant CNNs a robust local feature extraction capability and enable usage. In cloud computing and high-performance hardware settings,
them to achieve high performance through the simple sliding match- computational overhead is not the most critical factor since compu-
ing of convolutional kernels. As a result, compared to Transformers, tational resources are relatively abundant. In these scenarios, model
CNNs often exhibit competitive performance with lower computational performance and accuracy are paramount.
overhead. However, current CNN architectures possess less potential According to the data in Table 4, Table 5, modern Transformer-
than Transformers due to their weaker extraction of global features and based models have outperformed classical two-stage object detection
contextual information. The self-attention mechanism in Transformers algorithms (e.g., Faster R-CNN) in terms of FPS and achieved improved
can also emulate convolutional layers, requiring only a sufficient num- accuracy, rendering them viable for practical applications. To ensure
ber of heads to focus on each pixel within the convolutional receptive efficient deployment and application in real-world engineering sce-
field and employing relative positional encoding to ensure translation narios, researchers typically optimize object detection algorithms for
invariance (Cordonnier et al., 2020). This full-attention operation can specific contexts, minimizing computational overhead and improving
effectively integrate local and global attention while dynamically gen- real-time performance and energy efficiency. This optimization may
erating attention weights based on feature relationships. Nevertheless, involve techniques such as model compression, knowledge distillation,
Transformers face certain limitations in practical applications. One of and network architecture design.
the main challenges stems from their high computational complexity. Insufficient understanding of visual Transformer. Compared to
The expensive computational overhead restricts the application the well-established research and applications of CNNs, our current un-
of Transformer-based detectors on mobile computing platforms. At derstanding of the underlying mechanisms behind visual Transformers
present, most mobile detection platforms primarily rely on one-stage is still limited. The Transformer architecture was originally designed for
detectors (Zhao et al., 2019), while the trend for Transformer de- sequence processing tasks (Vaswani et al., 2017). Although Transform-
tectors leans towards offline high-precision detection. Additionally, ers have demonstrated strong performance when applied to computer
Transformers require large amounts of data, and common solutions vision tasks, there is relatively little explanation regarding their specific
include data augmentation, self-supervised, or semi-supervised learning roles and functions in this context. Consequently, gaining a deeper
approaches (He et al., 2021). Compared to state-of-the-art CNN-based understanding of the principles behind visual Transformers is crucial
approaches, their deployment on mobile platforms is constrained by to facilitate more fundamental optimization improvements and en-
higher computational complexity. hance the model’s interpretability. This deeper understanding could
The impact of computational overhead on deploying Transformer- potentially involve investigating the attention mechanisms, hierarchical
based object detection models in practical scenarios is influenced by feature representation, and the interaction between different layers
13
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021
Table 6
Summary of the advantages and limitations of Transformer-based object detection models.
Type Method Highlights Limitations
Transformer Neck DETR (Carion et al., 2020) (1) Proposed Transformer-based end-to-end object (1) Requires massive dataset
detection framework, (2) Removed hand-designed training, (2) Convergence is
anchor set and non-maximal suppression (NMS) very slow, (3) Poor
performance for small objects.
SMCA (Gao et al., 2021) (1) Combining a learnable co-attention map and a (1) Good performance for
manual space prior speeds up the convergence of large objects and poor
DETR, (2) Incorporating a scale selection network performance for small objects,
in decoder. (2) High computational
overhead.
Deformable DETR (Zhu et al., 2021) (1) Proposed deformable attention mechanism, (1) Low accuracy for large
which pays more attention to local information objects, (2) Deformable
and improves convergence speed; (2) Combined attention brings unordered
with multi-scale feature, (3) Proposed reference memory access, (3) High
point visualization object query, (4) Two-stage computational overhead.
Deformable DETR is also proposed.
Efficient DETR (Yao et al., 2021) (1) They found that different object container (1) Poor performance for
initialization methods have a great impact on small objects, (2) High
decoder; (2) They also proposed an efficient way computational overhead.
of initializing object containers using the
characteristics of dense detection and sparse
detection.
DINO (Zhang et al., 2022a) (1) Propose a contrast denoising training method, (1) High computational
(2) Combine class DETR and two-stage model and overhead at high scales, (2)
propose a mixed query selection method to better Diminishing marginal benefit
initialize object query, (3) Look Forward Twice: from stacking too many scales.
Introducing proximity layer information to update
parameters and improve the detection of small
objects.
YOLOS (Fang et al., 2021) (1) Replace [cls] token with [det] token and (1) Low detection accuracy,
image classification loss with bipartite matching (2) High computational
loss, (2)Propose a pre-trained Transformer object overhead.
detection paradigm.
UP-DETR (Dai et al., 2021a) (1) Propose a new unsupervised pre-text task to (1) Slow convergence, (2)
perform unsupervised pre-training on Transformer, Poor performance for small
(2) Propose a patch detection reconstruction objects.
branch that is jointly optimized with patch
detection.
Transformer Backbone FPT (Zhang et al., 2020) (1) Propose a feature interaction method across (1) Low detection accuracy,
space and scale, (2) High compatibility. (2) High computational
overhead.
PVT (Wang et al., 2021a) (1) It can output multi-scale high-resolution (1) High computational
feature maps; (2) The proposed spatial reduction overhead for high-resolution
attention module makes PVT successfully applied images; (2) Simple image
to dense prediction. division loses the connection
information between different
patches.
Swin Transformer (Liu et al., 2021b) (1) Hierarchical representation, (2) Introduced (1) Excessive GPU memory
communication between windows by computing consumption at higher image
attention within shifted windows and reduced the resolutions, (2) Difficult to
computational complexity to be linear with the retrain on small datasets, (3)
image size. Difficult to transform
pre-trained models at low
resolutions to higher
resolutions.
within the visual Transformer models. By exploring these aspects, we Establishing efficient transformations of image sequences can help
can potentially uncover novel optimization strategies and improve the unlock the potential of the Transformer for CV tasks.
model’s overall performance in various computer vision tasks.
The inefficient image-sequence information transformation. 4.2. Future development outlook
Unlike images, human-created languages have a high semantic density.
Each word in a sentence can be treated as high-dimensional semantic Visual Transformer has made great progress in recent years, espe-
information embedded in a low-dimensional vector representation. cially in object detection, and the performance has surpassed SOTA
However images, as a natural signal with high spatial redundancy, CNN-based model on the COCO dataset. However, Transformer is not
have a low information density per pixel. For example, He et al. mature enough in practical application deployment. For example, the
(2021) performed random high-scale masking of images, and then computational overhead is too large to be deployed on platforms with
reconstructed the images well with Decoder, demonstrating that much limited computer resources, and the real-time performance is not as
higher semantic features than pixel information density can be captured good as the CNN-based one-stage approach.
in the images. But the current way of representing image information in Self-supervised learning. While self-supervised learning has made
sequences using Transformer is not efficient enough, which can bring a great success in natural language processing, current object detection
about accuracy degradation as well as high computational overhead. models, which are mainly supervised learning, require large amounts
14
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021
15
Y. Li et al. Engineering Applications of Artificial Intelligence 126 (2023) 107021
Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Redmon, J., Farhadi, A. and, 2017. YOLO9000: Better, faster, stronger. In: 2017 IEEE
Xu, Y., Yang, Z., Zhang, Y., Tao, D., 2022. A survey on vision transformer. Conference on Computer Vision and Pattern Recognition. CVPR, pp. 6517–6525,
In: IEEE Transactions on Pattern Analysis and Machine Intelligence. p. 1. http: DOI: 10/gffdbj.
//dx.doi.org/10.1109/TPAMI.2022.3152247. Redmon, J., Farhadi, A., 2018. YOLOv3: An incremental improvement.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R., 2021. Masked autoencoders Ren, S., He, K., Girshick, R., Sun, J., 2016. Faster R-CNN: Towards Real-Time Object
are scalable vision learners. Detection with region proposal networks. arXiv:1506.01497 [cs].
He, K., Gkioxari, G., Dollár, P., Girshick, R., 2017. Mask r-cnn. In: Proceedings of the Roh, B., Shin, J., Shin, W., Kim, S., 2022. Sparse DETR: Efficient end-to-end object
IEEE International Conference on Computer Vision. pp. 2961–2969. detection with learnable sparsity. arXiv:2111.14330[cs].
He, K., Zhang, X., Ren, S., Sun, J., 2015. Deep residual learning for image recognition. Ross, T.-Y., Dollár, G., 2017. Focal loss for dense object detection. In: Proceedings of
arXiv:1512.03385[cs]. the IEEE Conference on Computer Visionv and Pattern Recognition. pp. 2980–2988.
Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M., 2021. Transformers Sang, E.F.T.K., De Meulder, F., 2003. Introduction to the CoNLL-2003 shared task:
in vision: A survey. arXiv:2101.01169. Language-independent named entity recognition. arXiv:cs/0306050.
Kobatake, H., Yoshinaga, Y., 1996. Detection of spicules on mammogram based on Sun, Z., Cao, S., Yang, Y., Kitani, K.M., 2021. Rethinking transformer-based set
skeleton analysis. IEEE Trans. Med. Imaging 15 (3), 235–245. http://dx.doi.org/ prediction for object detection. In: Proceedings of the IEEE/CVF International
10.1109/42.500062. Conference on Computer Vision. pp. 3611–3620.
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R., 2019. Albert: Sung, K.-K., Poggio, T., 1998. Example-based learning for view-based human face
A lite bert for self-supervised learning of language representations. arXiv preprint detection. IEEE Trans. Pattern Anal. Mach. Intell. 20 (1), 39–51. http://dx.doi.
arXiv:1909.11942. org/10.1109/34.655648/bnkgmt.
Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., Zhang, L., 2022. DN-DETR: Accelerate DETR Sutskever, I., Vinyals, O., Le, Q.V., 2014. Sequence to sequence learning with neural
training by introducing query denoising. arXiv:2203.01305[cs]. networks. 9.
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S., 2016. Feature Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł.,
pyramid networks for object detection. arXiv:1612.03144. Polosukhin, I., 2017. Attention is all you need. In: Advances in Neural Information
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Processing Systems, Vol. 30. Curran Associates, Inc.
Zitnick, C.L., 2014. Microsoft coco: Common objects in context. In: European Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L., 2021a.
Conference on Computer Vision. Springer, pp. 740–755. Pyramid vision transforme: A versatile backbone for dense prediction without
Lin, J., Mao, X., Chen, Y., Xu, L., He, Y., Xue, H., 2022. D2ETR: Decoder-only DETR convolutions. arXiv:2102.12122[cs].
with computationally efficient cross-scale attention. arXiv:2203.00860[cs]. Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L., 2022.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C., 2016. PVTv2: Improved baselines with pyramid vision transformer. arXiv:2106.13797[cs].
SSD: Single shot multibox detector. In: Computer Vision – ECCV 2016. pp. 21–37. Wang, T., Yuan, L., Chen, Y., Feng, J., Yan, S., 2021b. PnP-DETR: Towards efficient
http://dx.doi.org/10.1007/978-3-319-46448-0_2. visual analysis with transformers. In: Proceedings of the IEEE/CVF International
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., Conference on Computer Vision. pp. 4661–4670.
Wei, F., Guo, B., 2021a. Swin transformer V2: Scaling up capacity and resolution. Wang, Y., Zhang, X., Yang, T., Sun, J., 2021c. Anchor DETR: Query design for
arXiv:2111.09883[cs]. transformer-based object detection.
Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., Zhang, L., 2022a. DAB-DETR: Yang, J., Li, C., Zhang, P., Dai, X., Xiao, B., Yuan, L., Gao, J., 2021. Focal self-attention
Dynamic anchor boxes are better queries for DETR. arXiv:2201.12329[cs]. for local-global interactions in vision transformers. arXiv:2107.00641[cs].
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021b. Swin Yao, Z., Ai, J., Li, B., Zhang, C., 2021. Efficient DETR: Improving end-to-end object
transformer: Hierarchical vision transformer using shifted windows. arXiv:2103. detector with dense prior. arXiv:2104.01318[cs].
14030[cs]. Yuan, L., Chen, D., Chen, Y.-L., Codella, N., Dai, X., Gao, J., Hu, H., Huang, X., Li, B.,
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S., 2022b. A ConvNet Li, C., Liu, C., Liu, M., Liu, Z., Lu, Y., Shi, Y., Wang, L., Wang, J., Xiao, B., Xiao, Z.,
for the 2020s. arXiv:2201.03545[cs]. Yang, J., Zeng, M., Zhou, L., Zhang, P., 2021. Florence: A new foundation model
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettle- for computer vision. arXiv:2111.11432[cs].
moyer, L., Stoyanov, V., 2019. RoBERTa: A robustly optimized BERT pretraining Zhang, P., Dai, X., Yang, J., Xiao, B., Yuan, L., Zhang, L., Gao, J., 2021. Multi-scale
approach. arXiv:1907.11692[cs]. vision longformer: A new vision transformer for high-resolution image encoding.
Liu, Y., Zhang, Y., Wang, Y., Hou, F., Yuan, J., Tian, J., Zhang, Y., Shi, Z., Fan, J., arXiv preprint arXiv:2103.15358.
He, Z., 2021c. A survey of visual transformers. arXiv:2111.06091[cs]. Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., Shum, H.-Y., 2022a.
Loshchilov, I., Hutter, F., 2017. Decoupled weight decay regularization. arXiv preprint DINO: DETR with improved denoising anchor boxes for end-to-end object detection.
arXiv:1711.05101. arXiv:2203.03605[cs].
Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L., Wang, J., 2021. Zhang, G., Luo, Z., Yu, Y., Cui, K., Lu, S., 2022b. Accelerating DETR convergence via
Conditional DETR for fast training convergence. In: Proceedings of the IEEE/CVF semantic-aligned matching. arXiv:2203.06883[cs].
International Conference on Computer Vision. pp. 3651–3660. Zhang, D., Zhang, H., Tang, J., Wang, M., Hua, X., Sun, Q., 2020. Feature pyramid
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., 2018. Improving language transformer. In: European Conference on Computer Vision. Springer, pp. 323–339.
understanding by generative pre-training. Zhao, Z.-Q., Zheng, P., Xu, S.-T., Wu, X., 2019. Object detection with deep learning: A
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., 2019. Language review. IEEE Trans. Neural Netw. Learn. Syst. 30 (11), 3212–3232. http://dx.doi.
models are unsupervised multitask learners. Openai Blog 1 (8), 9. org/10.1109/TNNLS.2018.2876865.
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P., 2016. SQuAD: 100, 000+ questions for Zheng, M., Gao, P., Zhang, R., Li, K., Wang, X., Li, H., Dong, H., 2021. End-to-end
machine comprehension of text. arXiv:1606.05250[cs]. object detection with adaptive clustering transformer. arXiv:2011.09315[cs].
Redmon, J., Divvala, S., Girshick, R., Farhadi, A., 2016. You only look once: Unified, Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J., 2021. Deformable DETR: Deformable
real-time object detection. In: 2016 IEEE Conference on Computer Vision and transformers for end-to-end object detection. arXiv:2010.04159[cs].
Pattern Recognition. CVPR, IEEE, Las Vegas, NV, USA, pp. 779–788, doi:10/gc7rk9.
16