0% found this document useful (0 votes)
20 views14 pages

2209 13351v2-SuperYlo

Uploaded by

yuweiji339
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views14 pages

2209 13351v2-SuperYlo

Uploaded by

yuweiji339
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. X, NO.

X, 2023 1

SuperYOLO: Super Resolution Assisted Object


Detection in Multimodal Remote Sensing Imagery
Jiaqing Zhang, Jie Lei, Member, IEEE, Weiying Xie, Member, IEEE, Zhenman Fang, Member, IEEE, Yunsong
Li, Member, IEEE, and Qian Du, Fellow, IEEE

Abstract—This is the pre-acceptance version, to read the final to the application of large-scale natural datasets with accurate
version please go to IEEE TRANSACTION ON GEOSCIENCE annotations [6], [7], [8].
AND REMOTE SENSING on IEEE Xplore. Accurately and Compared with natural scenarios, there are several vital
arXiv:2209.13351v2 [cs.CV] 8 Apr 2023

timely detecting multiscale small objects that contain tens of


pixels from remote sensing images (RSI) remains challenging. challenges for accurate object detection in remote sensing im-
Most of the existing solutions primarily design complex deep ages (RSIs). First, the number of labeled samples is relatively
neural networks to learn strong feature representations for small, which limits the training of DNNs to achieve high
objects separated from the background, which often results in a detection accuracy. Second, the size of objects in RSI is much
heavy computation burden. In this paper, we propose an accurate smaller, accounting for merely tens of pixels in relation to the
yet fast object detection method for RSI, named SuperYOLO,
which fuses multimodal data and performs high resolution (HR) complicated and broad backgrounds [9], [10]. Moreover, the
object detection on multiscale objects by utilizing the assisted scale of those objects is diverse with multiple categories [11].
super resolution (SR) learning and considering both the detection As shown in Fig. 1 (a), the object car is considerably small
accuracy and computation cost. First, we utilize a symmetric within a vast area. As shown in Fig. 1 (b), the objects have
compact multimodal fusion (MF) to extract supplementary in- large-scale variations, to which the scale of a car is smaller
formation from various data for improving small object detection
in RSI. Furthermore, we design a simple and flexible SR branch than that of a camping vehicle.
to learn HR feature representations that can discriminate small Currently, most object detection techniques are solely de-
objects from vast backgrounds with low-resolution (LR) input, signed and applied for a single modality such as RGB
thus further improving the detection accuracy. Moreover, to avoid and Infrared (IR) [12], [13]. Consequently, with respect to
introducing additional computation, the SR branch is discarded object detection, its capability to recognize objects on the
in the inference stage and the computation of the network model
is reduced due to the LR input. Experimental results show that, earth’s surface remains insufficient due to the deficiency
on the widely used VEDAI RS dataset, SuperYOLO achieves an of complementary information between different modalities
accuracy of 75.09% (in terms of mAP50 ), which is more than [14]. As imaging technology flourishes, RSIs collected from
10% higher than the SOTA large models such as YOLOv5l, multimodality become available and provide an opportunity to
YOLOv5x and RS designed YOLOrs. Meanwhile, the parameter improve detection accuracy. For example, as shown in Fig. 1,
size and GFOLPs of SuperYOLO are about 18x and 3.8x less
than YOLOv5x. Our proposed model shows a favorable accuracy- the fusion of two different multimodalities (RGB and IR) can
speed trade-off compared to the state-of-art models. The code will effectively enhance the detection accuracy in RSI. Sometimes
be open sourced at https://github.com/icey-zhang/SuperYOLO. the resolution of one modality is low which requires technique
Index Terms—Object detection, multimodal remote sensing to improve the resolution to enhance information. Recently,
image, super resolution, feature fusion. super resolution technology has shown great potential in
remote sensing fields [15], [16], [17], [18]. Benefiting from
the vigorous development of the convolutional neuron net-
I. I NTRODUCTION
work (CNN), the resolution of the remote sensing image has
BJECT detection plays an important role in vari-
O ous fields involving computer-aided diagnosis or au-
tonomous piloting. Over the past decades, numerous excellent
achieved high texture information to be interpreted. However,
due to the high computation cost of the CNN network, the
application of the SR network in real-time practical tasks has
deep neural network (DNN) based object detection frame- become a hot topic in current research.
works [1], [2], [3], [4], [5] have been proposed, updated, In this study, our motivation is to propose an on-board
and optimized in computer vision. The remarkable accuracy real-time object detection framework for multimodal RSIs to
enhancement of DNN-based object detection frameworks owes achieve high detection accuracy and high inference speed
without introducing additional computation overhead. Inspired
This work was supported in part by the National Natural Science Foundation
of China under Grant 62071360. (Corresponding authors: Jie Lei) by recent advances in real-time compact neural network
Jiaqing Zhang, Jie Lei, Weiying Xie, Yunsong Li are with the State Key models, we choose small-size YOLOv5s [19] structure as
Laboratory of Integrated Services Networks, Xidian University, Xi’an 710071, our detection baseline. It can reduce deployment costs and
China (e-mail: jqzhang 2@stu.xidian.edu.cn; jielei@mail.xidian.edu.cn;
wyxie@xidian.edu.cn; ysli@mail.xidian.edu.cn). facilitate rapid deployment of the model. Considering the high
Zhenman Fang is with School of Engineering Science, Simon Fraser resolution (HR) retention requirements for small objects, we
University, Burnaby, BC, Canada (e-mail: zhenman@sfu.ca). remove the Focus module in the baseline YOLOv5s model,
Qian Du is with the Department of Electronic and Computer Engi-
neering, Mississippi State University, Starkville, MS 39759 USA (e-mail: which not only benefits defining the location of small dense
du@ece.msstate.edu). objects but also enhances the detection performance. Consider-
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. X, NO. X, 2023 2

Fig. 1. Visual comparison of RGB image, IR image, and ground truth (GT). The IR image provides vital complementary information for resolving the
challenges in RGB detection. The object car in (a) is considerably small within a vast area. In (b), the objects have large-scale variation, to which the scale
of a car is smaller than that of a camping vehicle. The fusion of RGB and IR modalities effectively enhances detection performance.

ing the complementary characteristics in different modalities, fully convolutional network (FCN) framework.
we propose a multimodal fusion (MF) scheme to improve the • The proposed SuperYOLO markedly improves the perfor-
detection performance for RSI. We evaluate different fusion mance of object detection, outperforming SOTA detectors
alternatives (pixel-level or feature-level) and choose pixel-level in real-time multimodal object detection. Our proposed
fusion for low computation cost. model shows a favorable accuracy-speed trade-off com-
Lastly and most importantly, we develop a super resolution pared to the state-of-art models.
(SR) assurance module to guide the network to generate
HR features that are capable of identifying small objects in
II. R ELATED W ORK
vast backgrounds, thereby reducing false alarms induced by
background-contaminated objects in RSI. Nevertheless, a naive A. Object Detection with Multimodal Data
SR solution can significantly increase the computation cost. Recently, multimodal data has been widely leveraged in
Therefore, we set the auxiliary SR branch engaged in the numerous practical application scenarios, including visual
training process and remove it in the inference stage, facili- question answering [20], auto-pilot vehicles [21], saliency
tating spatial information extraction in HR without increasing detection [22], and remote sensing classification [23]. It is
computation cost. found that combining the internal information of multimodal
In summary, this paper makes the following contributions. data can efficiently transfer complementary features to avoid
• We propose a computation-friendly pixel-level fusion certain information of a single modality from being omitted.
method to combine inner information bi-directionally in In the field of RSI processing, there exist various modali-
a symmetric and compact manner. It efficiently decreases ties (e.g., Red-Green-Blue (RGB), Synthetic Aperture Radar
the computation cost without sacrificing accuracy com- (SAR), Light Detection and Ranging (LiDAR), Infrared (IR),
pared with feature-level fusion. panchromatic (PAN) and multispectral (MS) images) from di-
• We introduce an assisted SR branch into multimodal verse sensors, which can be fused with complementary charac-
object detection for the first time. Our approach not only teristics to enhance the performance of various tasks [24], [25],
makes a breakthrough in limited detection performance [26]. For example, the additional IR modality [27] captures
but also paves a more flexible way to study outstanding longer thermal wavelengths to improve the detection under
HR feature representations that are capable of discrim- difficult weather conditions. Manish et al. [27] proposed a
inating small objects from vast backgrounds with LR real-time framework for object detection in multimodal remote
input. sensing imaging, in which the extended version conducted
• Considering the demand for high-quality results and low- mid-level fusion and merged data from multiple modalities.
computation cost, the SR module functioning as an aux- Despite that multi-sensor fusion can enhance the detection
iliary task is removed during the inference stage without performance as shown in Fig 1, hardly can its low-accuracy
introducing additional computation. The SR branch is detection performance and to-be-improved computing speed
general and extensible and can be inserted in the existing meet the requirements of real-time detection tasks.
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. X, NO. X, 2023 3

* : Our New Contribution


C : Concatenation *Super Resolution
Low-Level IR
: Train and Test
Feature
RGB

Encoder
: Train only

Decoder
Reconstruction
High-Level Ls
Feature

RGB
*Multimodal
Fusion
IR C Backbone
P1 N1
Detector
Upsampling
C Classification
C Lcls
y1
Regression Lloc
P2 N2 y2
y3 Confidence
Upsampling C Lobj
C

P3 N3
*Focus Removal
FPN PANet
Head

Fig. 2. The overview of the proposed SuperYOLO framework. Our new contributions include 1) removal of the Focus module to reserve high resolution, 2)
multimodal fusion, and 3) assisted SR branch. The architecture is optimized in terms of Mean Square Error (MSE) loss for the SR branch and task-specific loss
for object detection. During the training stage, the SR branch guides the related learning of the spatial dimension to enhance the high resolution information
preservation for the backbone. During the test stage, the SR branch is removed to accelerate the inference speed equal to the baseline.

The fusion methods are primarily grouped into three strate- SR has proven to be effective and efficient in various object
gies, i.e., pixel-level fusion, feature-level fusion, and decision- detection tasks [32], [33]. Shermeyer et al. [34] quantified its
level fusion methods [28]. The decision-level fusion methods effect on the detection performance of satellite imaging by
fuse the detection results during the last stage, which may multiple resolutions of RSI. Based on generative adversarial
consume enormous computation resources due to repeated networks (GANs), Courtrai et al. [35] utilized SR to generate
calculations for different multimodal branches. In the field HR images, which were fed into the detector to improve its
of remote sensing, feature-level fusion methods are mainly detection performance. Rabbi et al. [36] leveraged a Laplacian
adopted with multi branches. The multimodal images will be operator to extract edges from the input image to enhance the
input into the parallel branches to extract respective indepen- capability of reconstructing HR images, thus improving its
dent features of different modalities, and then these features performance in object localization and classification. Hong et
will be combined by some operations, such as attention al. [37] introduced a cycle-consistent GAN structure as an
module or simple concatenation. The parallel branches bring SR network and modified faster R-CNN architecture to detect
repeated computation as the modalities increase, which is not vehicles from enhanced images that are produced by the SR
friendly in the real-time tasks in remote sensing. network. In these works, the adoption of the SR structure has
In contrast, the adoption of pixel-level fusion methods can effectively addressed the challenges regarding small objects.
reduce unnecessary computation. In this paper, our proposed However, compared with single detection models, additional
SuperYOLO fuses the modalities at the pixel-level to signif- computation is introduced, which attributes to the enlarged
icantly reduce the computation cost and design operations in scale of the input image by HR design.
spatial and channel domains to extract inner information in
the different modalities which can help enhance the detection Recently, Wang et al. [38] proposed an SR module that can
accuracy. maintain HR representations with LR input while reducing
the model computation in segmentation tasks. Inspired by
the [38], we design an SR assisted branch. In contrast to
B. Super Resolution in Object Detection the aforementioned work in which the SR is realized in the
In recent literature, the performance of small object detec- start stage, the assisted SR module guides the learning of
tion can be improved by multi-scale feature learning [29], high-quality HR representations for the detector, which not
[30], context-based detection [31]. These methods always only strengthens the response of small dense objects but also
enhance the information representation ability of the network improves the performance of the object detection in spatial
in different scales but ignore the high-resolution contextual space. Moreover, the SR module is removed in the inference
information reservation. Conducted in a pre-processing step, stage to avoid extra computation.
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. X, NO. X, 2023 4

Input

Backbone
CBS Resblock CSP1_ n CSP2 SPP

CBS
Conv CBS CBS CBS
CBS
n  Resblock CBS CBS
BN CBS CBS Maxpool Maxpool Maxpool
C
C
C
SiLU CBS CBS CBS

Fig. 3. The backbone structure of YOLOv5s. The low-level texture and high-level semantic features are extracted by stacked CSP, CBS, and SPP structures.

III. BASELINE A RCHITECTURE Limitation 2: It is known that the backbone of YOLO employs
As shown in Fig. 2, the baseline YOLOv5 network consists deep convolutional neural networks to extract hierarchical
of two main components: the Backbone and Head (including features with a stride step of 2, through which the size of the
the Neck). The backbone is designed to extract low-level extracted features is halved. Hence, the feature size retained
texture and high-level semantic features. Next, these hint for multiscale detection is far smaller than that of the original
features are fed to Head to construct the enhanced feature input image. For example, when the input image size is 608,
pyramid network from top to bottom to transfer robust se- the sizes of output features for the last detection layer are 76,
mantic features and from bottom to top to propagate a strong 38, and 19, respectively. LR features may result in the missing
response of local texture and pattern features. This resolves the of some small objects.
various scale issue of the objects by yielding an enhancement
of detection with diverse scales. IV. S UPERYOLO A RCHITECTURE
In Fig. 3, CSPNet [39] is utilized as the Backbone to As summarized in Fig. 2, we introduce three new contri-
extract the feature information, consisting of numerous sam- butions to our SuperYOLO network architecture. First, we
ple Convolution-Batch-normalization-SiLu (CBS) components remove the Focus module in the Backbone and replace it with
and Cross Stage Partial (CSP) modules. The CBS is com- an MF module, to avoid resolution degradation and thus accu-
posed of operations of convolution, batch normalization, and racy degradation. Second, we explore different fusion methods
activation function SiLu [40]. The CSP duplicates the feature and choose the computation-efficient pixel-level fusion to fuse
map of the previous layer into two branches and then halves RGB and IR modalities to refine dissimilar and complementary
the channel numbers through 1 × 1 convolution, by which information. Finally, we add an assisted SR module in the
the computation is therefore reduced. With respect to the two training stage, which reconstructs the HR images to guide
copies of the feature map, one is connected to the end of the the related Backbone learning in spatial dimension and thus
stage, and the other is sent into ResNet blocks or CBS blocks maintain HR information. In the inference stage, the SR branch
as the input. Finally, the two copies of the feature map are is discarded to avoid introducing additional computation over-
concatenated to combine the features, which is followed by a head.
CBS block. The SPP (Spatial Pyramid Pooling) module [41]
is composed of parallel Maxpool layers with different kernel
sizes and is utilized to extract multiscale deep features. The A. Focus Removal
low-level texture and high-level semantic features are extracted As presented in Section III and Fig. 2 (bottom left), the
by stacked CSP, CBS, and SPP structures. Focus module in the YOLOv5 backbone partitions images at
Limitation 1: It is worth mentioning that the Focus module is intervals on the spatial domain and then reorganize the new
introduced to decrease the number of computations. As shown image to resize the input images. Specifically, this operation
in Fig. 2 (bottom left), inputs are partitioned into individual is to collect a value for every group of pixels in an image and
pixels and reconstructed at intervals and finally concatenated then reconstruct it to obtain smaller complementary images.
in the channel dimension. The inputs are resized to a smaller The size of the rebuilt images decreases with the increase
scale to reduce the computation cost and accelerate the net- in the number of channels. As a result, it causes resolution
work training and inference speed. However, this may sacrifice degradation and spatial information loss for small targets.
object detection accuracy to a certain extent, especially for Considering that the detection of small targets depends more
small objects vulnerable to resolution. heavily on higher resolution, the Focus module is abandoned
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. X, NO. X, 2023 5

Low-Level Feature High-Level Feature


RGB
IR
Multimodal
Fuion

Backbone

RGB
RGB
SE Conv
Conv

IR C SE
IR
SE Conv
Conv

Element-wise matrix
C Concatenation Addition
multiplication

Fig. 4. The architecture of the multimodal fusion (MF) module at the pixel level.

and replaced by an MF module (shown in Fig. 4) to prevent the different modalities in the spatial domain is defined as:
the resolution from being degraded.
mIR = f1 (FIR ), mRGB = f2 (FRGB ), (2)
where f1 and f2 represent 1 × 1 convolutions for the RGB
B. Multimodal Fusion and IR modalities, respectively. Here, ⊗ denotes element-wise
matrix multiplication. Inner spatial information between the
The more information is utilized to distinguish objects, different modalities is produced by:
the better performance can be achieved in object detection.
Multimodal fusion is an effective path for merging different Fin1 = mRGB ⊗ FRGB , Fin2 = mIR ⊗ FIR . (3)
information from various sensors. The decision-level, feature-
To incorporate internal inner-view information and spatial
level, and pixel-level fusions are the three mainstream fu-
texture information, the features are added by the original input
sion methods that can be deployed at different depths of
modalities and then fed into 1 × 1 convolutions. That the full
the network. Since decision-level fusion requires enormous
features are:
computation, it is not considered in SuperYOLO.
We propose a pixel-level multimodal fusion (MF) to extract Ff ul1 = f3 (Fin1 + IRGB ), Ff ul2 = f4 (Fin2 + IIR ). (4)
the shared and special information from the different modal-
where f3 and f4 represent 1 × 1 convolutions. Finally, the
ities. The MF can combine multimodal inner information bi-
features are fused by:
directionally in a symmetric and compact manner. As shown
in Fig. 4, for the pixel-level fusion, we first normalize an Fo = SE(Concat(Ff ul1 , Ff ul2 )). (5)
input RGB image and an input IR image into two intervals
of [0, 1]. And the input modalities XRGB , XIR ∈ RC×H×W where Concat(·) denotes the concatenation operation along
H W
are subsampled to IRGB , IIR ∈ RC× n × n which are fed to the channel axis. The result is then fed to the Backbone
SE blocks extracting inner information in channel domain [42] to produce multi-level features. Note that, X is subsampled
to generate FRGB , FIR : to 1/n size of the original image to accomplish the SR
module discussed in Section IV-C and to accelerate the training
FRGB = SE(IRGB ), FIR = SE(IIR ), (1) process. The X represents the RGB or IR modality, and the
H W
sampled image is denoted as I ∈ RC× n × n and generated
Then the attention map that reveals the inner relationship of
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. X, NO. X, 2023 6

High-Level IR CR
Upsample
Feature Deconv 3

L1 Loss
or RGB

CR
CR
C

Conv
Low-Level EDSR
Feature
CR
ReLU
Encoder Decoder
Super Resolution
Fig. 5. The super resolution (SR) structure of SuperYOLO. The SR structure can be regarded as a simple Encode-Decoder model. The low-level and high-level
features of the backbone are selected to fuse local textures patterns and semantic information, respectively.

by: illustrated in Fig. 5, the Decoder is implemented using three


I = D(X), (6) deconvolutional layers. The SR guides the related learning of
spatial dimension and transfers it to the main branch, thereby
where D(·) represents n times downsampling operation using
improving the performance of object detection. In addition, we
bilinear interpolation.
introduce EDSR [43] as our Encoder structure to explore the
SR performance and its influence on detection performance.
C. Super Resolution To present a more visually interpretable description, we
As mentioned in Section III, the feature size retained for visualize the features of backbones for YOLOv5s, YOLOv5x
multiscale detection in the backbone is far smaller than that and SuperYOLO in Fig. 6. The features are upsampled to the
of the original input image. Most of the existing methods same scale as the input image for comparison. By comparing
conduct upsampling operations to recover the feature size. the pairwise images of (c), (f) and (i); (d), (g) and (j); (e)
Unfortunately, this approach has produced limited success due (h) and (k) in Fig. 6, it can be observed that SuperYOLO
to the information loss in texture and pattern, which explains contains clearer object structures with higher resolution with
that it is inappropriate to employ this operation to detect small the assistance of the SR. Eventually, we obtain a bumper
targets that require HR preservation in RSI. harvest in high-quality HR representation with the SR branch
To address this issue, as shown in Fig. 2, we introduce and utilize the Head of YOLOv5 to detect small objects.
an auxiliary SR branch. First, the introduced branch shall
facilitate the extraction of HR information in the backbone and
achieve satisfactory performance. Second, the branch should
not add more computation to reduce the inference speed. It D. Loss Function
shall realize a trade-off between accuracy and computation
time during the inference stage. Inspired by the study of Wang The overall loss of our network consists of two components:
et al. [38] where the proposed super resolution succeeded in fa- detection loss Lo and SR construction loss Ls , which can be
cilitating segmentation tasks without additional requirements, expressed as
we introduce a simple and effective branch named SR into Ltotal = c1 Lo + c2 Ls , (7)
the framework. Our proposal can improve detection accuracy where c1 and c2 are the coefficients for a balance of the two
without computation and memory overload, especially under training tasks. The L1 loss (rather than L2 loss) [44] is used to
circumstances of LR input. calculate the SR construction loss Ls between the input image
Specifically, the SR structure can be regarded as a simple X and SR result S, to which the expression is written as
Encode-Decoder model. We select the backbone’s low-level
and high-level features to fuse local textures and patterns Ls = kS − Xk1 . (8)
and semantic information, respectively. As depicted in Fig. The detection loss involves three components [19]: loss of
4, we select the result of the fourth and ninth modules judging whether there is an object Lobj , loss of object location
as the low-level and high-level features, respectively. The Lloc , and loss of object classification Lcls , which are used to
Encoder integrates the low-level feature and high-level feature evaluate the loss of the prediction as
generated in the backbone. As illustrated in Fig. 5, in Encoder,
2 2 2
the first CR module is conducted on the low-level feature. X X X
For the high-level feature, we use an Upsampling operation Lo = λloc al Lloc + λobj bl Lobj + λcls cl Lcls . (9)
l=0 l=0 l=0
to match the spatial size of the low-level feature and then
we use a concatenation operation and two CR modules to Here, equation 9, l represents the layer of the output in head,
merge the low-level and high-level features. The CR module al , bl , and cl are the weights of different layers for the three
includes a convolution and ReLU. For the Decoder, the LR loss functions, the weights λloc , λobj , and λcls regulate error
feature is upscaled to the HR space in which the SR module’s emphasis among box coordinates, box dimensions, objectness,
output size is twice larger than that of the input image. As no-objectness and classification.
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. X, NO. X, 2023 7

Fig. 6. Feature-level visualization of backbone for YOLOv5s, YOLOv5x and SuperYOLO with the same input: (a) RGB input, (b) IR input; (c), (d), and (e)
are the features of YOLOv5s; (f), (g), and (h) are the features of YOLOv5x; (i), (j) and (k) are the features of SuperYOLO. The features are upsampled to
the same scale as the input image for comparison. (c), (f) and (i) are the features in the first layer. (d), (g) and (j) are the low-level features. (e), (h) and (k)
are the high-level features in layers at the same depth.

V. E XPERIMENTAL R ESULTS absolute coordinate is transformed into a relative coordinate.


A. Dataset Similarly, the length and width of the bounding box are nor-
malized to [0, 1]. To realize the SR assisted branch, the input
The popular Vehicle Detection in Aerial Imagery (VEDAI)
images of the network are downsampled from 1024×1024 size
dataset [45] is used in the experiments, which contains cropped
to 512 × 512 during the training process. In the test process,
images obtained from the much larger Utah Automated Ge-
the image size is 512 × 512, which is consistent with the input
ographic Reference Center (AGRC) dataset. Each image col-
of other algorithms compared. In addition, data is augmented
lected from the same altitude in AGRC has approximately
with Hue Saturation Value (HSV), multi-scale, translation, left-
16, 000 × 16, 000 pixels, with a resolution of about 12.5cm ×
right flip, and mosaic. The augmentation strategy is canceled
12.5cm per pixel. RGB and IR are the two modalities for each
in the test stage. The standard Stochastic Gradient Descent
image in the same scenes. The VEDAI dataset consists of 1246
(SGD) [46] is used to train the network with a momentum of
smaller images that focus on diverse backgrounds involving
0.937, weight decay of 0.0005 for the Nesterov accelerated
grass, highway, mountains, and urban areas. All images are in
gradients utilized, and a batch size of 2. The learning rate is
the size of 1024 × 1024 or 512 × 512. The task is to detect
set to 0.01 initially. The entire training process involves 300
11 classes of different vehicles such as car, pickup, camping,
epochs.
and truck.

B. Implementation Details C. Accuracy Metrics


Our proposed framework is implemented in PyTorch and
runs on a workstation with an NVIDIA 3090 GPU. The The accuracy assessment measures the agreements and dif-
VEDAI dataset is used to train our SuperYOLO. Following ferences between the detection result and the reference mask.
[27], the VEDAI dataset is devised to 10 fold cross-validation. The recall, precision, and mAP (mean Average Precision)
In each split 1089 images are used for training and another are used as accuracy metrics to evaluate the performance of
121 images are used for testing. The ablation experiments are the methods to be compared with. The calculations of the
conducted on the first fold of data, while the comparisons precision and recall metrics are defined as
with previous methods are performed on the 10 folds by TP
P recision = (10)
averaging their results. The annotations for each object in the TP + FP
image contain the coordinates of the bounding box center, the
orientation of the object concerning the positive x-axis, the TP
Recall = . (11)
four corners of the bounding box, the class ID, a binary flag TP + FN
identifying whether an object is occluded, and another binary where the true positive (TP) and true negative (TN) de-
flag identify whether an object is cropped. We do not consider note correct prediction, and the false positive (FP) and false
classes with fewer than 50 instances in the dataset, such as negative (FN) denote incorrect outcome. The precision and
plane, motorcycle, and bus. So the annotations of the VEDAI recall are correlated with the commission and omission errors,
dataset are converted to YOLOv5 format, and we transfer the respectively. The mAP is a comprehensive indicator obtained
ID of the interested class to 0, 1, ..., 7, i.e., N = 8. Then the by averaging AP values, which uses an integral method to
center coordinates of the bounding box are normalized and the calculate the area enclosed by the Precision-Recall curve and
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. X, NO. X, 2023 8

TABLE I TABLE II
T HE C OMPARISON R ESULTS OF M ODEL S IZE AND I NFERENCE A BILITY IN I NFLUENCE OF R EMOVING THE F OCUS M ODULE IN THE N ETWORK ON
D IFFERENT BASELINE YOLO F RAMEWORKS ON THE F IRST F OLD OF THE THE F IRST F OLD OF THE VEDAI VALIDATION S ET.
VEDAI VALIDATION S ET.
Method Params ↓ GFLOPs↓ mAP50 ↑
Method Layers ↓ Params ↓ GFLOPs ↓ mAP50 ↑
Focus 7.0739M 5.3 62.2
YOLOv5s
YOLOv3 270 61.5M 52.8 62.6 noFocus 7.0705M 20.4 69.5 (+7.3)
YOLOrs 241 20.2M 46.4 55.8 Focus 21.0677M 16.1 64.5
YOLOv4 393 52.5M 38.2 65.7 YOLOv5m
noFocus 21.0625M 63.6 72.2 (+7.7)
YOLOv5s 224 7.1M 5.32 62.2
YOLOv5m 308 21.1M 16.1 64.5 Focus 46.6406M 36.7 63.7
YOLOv5l
YOLOv5l 397 46.6M 36.7 63.9 noFocus 46.6337M 145.0 72.5 (+8.8)
YOLOv5x 476 87.3M 69.7 64.0 Focus 87.2487M 69.7 64.0
YOLOv5x
noFocus 87.2400M 276.6 69.2 (+5.2)

coordinate axis of all categories. Hence, the mAP can be TABLE III
calculated by T HE C OMPARISON R ESULT OF P IXEL - LEVEL AND F EATURE - LEVEL
R1 F USIONS IN YOLOV 5 S ( NO F OCUS ) FOR M ULTIMODAL DATASET ON THE
AP p(r)dr F IRST F OLD OF THE VEDAI VALIDATION S ET.
mAP = = 0 , (12)
N N Method Params ↓ GFLOPs ↓ mAP50 ↑
where p denotes Precision, r denotes Recall, and N is the Pixel-level Concat 7.0705M 20.37 69.5
number of categories. Fusion MF 7.0897M 21.67 70.3
GFOLPs (Giga Floating-point Operations Per Second) and Fusion1 7.0887M 21.76 66.0
parameter size are used to measure the model complexity and Feature-level Fusion2 7.0744M 22.04 68.5
computation cost. In addition, PSNR and SSIM are used for Fusion Fusion3 7.1442M 24.22 64.8
image quality evaluation of the SR branch. Generally, higher Fusion4 7.0870M 24.50 63.8
PSNR values and SSIM values represent the better quality of Multistage Feature-level
7.7545M 34.56 59.3
the generated image. Fusion

Focus module, we observe a noticeable improving in the de-


D. Ablation Study
tection performance of YOLOv5s (62.2%→69.5% in mAP50 ),
First of all, we verify the effectiveness of our proposed YOLOv5m (64.5%→72.2%), YOLOV5l (63.7%→72.5%),
method by designing a series of ablation experiments which YOLOv5x (64.0%→69.2%). This is because by removing the
are conducted on the first fold of the validation set. Focus module, not only can the resolution degradation be
1) Validation of the Baseline Framework: In Table I, the avoided, but also the spatial interval information be retained
model size and inference ability of different base frameworks for small objects in RSI, thereby reducing the missing errors of
are evaluated in terms of the number of layers, parameter size object detection. Generally, removing the Focus module brings
and GFLOPs. The detection performances of those models more than 5% improvement in the detection performance
are measured by mAP50 , i.e., detection metric of mAP at IOU (mAP50 ) of the whole frameworks.
(Intersection over Union) = 0.5. Although YOLOv4 achieves Meanwhile, we notice that the above removal in-
the best detection performance, it has 169 more layers than creases the inference computation cost (GFLOPs) in
YOLOv5s (393 vs. 224), its parameter size (params) is 7.4 YOLOv5s (5.3→20.4), YOLOv5m (16.1→63.6), YOLOV5l
times larger than that of YOLOv5s (52.5M vs. 7.1M), and (36.7→145), YOLOv5x (69.7→276.6). However, the GFLOPs
its GFLOPs is 7.2 times higher than that of YOLOv5s (38.2 of YOLOv5s-noFocus (20.4) is smaller than those of YOLOv3
vs. 5.3). With respect to YOLOv5s, although its mAP is (52.8), YOLOv4 (38.2), and YOLOrs (46.4), as shown in Table
slightly lower than those of YOLOv4 and YOLOv5m, its I. The parameters of these models are slightly reduced after
number of layers, parameter size and GFLOPs are much removing the Focus module. In summary, in order to retain
smaller than those of other models. Therefore, it is easier to the resolution to better detect smaller objects, priority shall
deploy YOLOv5s on board to achieve real-time performance be given to the detection accuracy, for which the convolution
in practical applications. The above fact verifies the rationality operation is adopted to replace the Focus module.
of YOLOv5s as the baseline detection framework. 3) Comparison of Different Fusion Methods: To evaluate
2) Impact of Removing Focus Module: As presented in the influence of the devised fusion methods, we compare
Section IV-A, the Focus module reduces the resolution of five fusion results on YOLOv5-noFocus, as presented in
input images, which imposes encumbrance on the detection Section IV-B. As shown in Fig 7, fusion1, fusion2, fusion3,
performance of small objects in RSI. To investigate the in- and fusion4 represent the concatenation fusion operation per-
fluence of the Focus module, we conduct experiments on the formed in the first, second, third, and fourth blocks, respec-
four YOLOv5 network frameworks: YOLOv5s, YOLOV5m, tively. The IR image is expanded to three bands in feature-level
YOLOv5l, and YOLOv5x. Note that the results here are fusion to obtain the features which have equal channels for
collected after the concatenation pixel-level fusion of RGB the two modes. The final result is listed in TABLE III. The
and IR modalities. As listed in Table II, after removing the parameter size, GFLOPs, mAP50 of pixel-level fusion with
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. X, NO. X, 2023 9

TABLE IV
RGB
T HE I NFLUENCE OF D IFFERENT R ESOLUTIONS FOR I NPUT I MAGE ON
C
N ETWORK P ERFORMANCE ON THE F IRST F OLD OF THE VEDAI
VALIDATION S ET.
(a)
Fusion1
Train-Val Test
IR Method Params ↓ GFLOPs ↓ mAP50 ↑
Size Size
512 7.0739M 5.3 62.2
512
1024 7.0739M 21.3 10.6
YOLOv5s
RGB 1024 7.0739M 21.3 77.7
1024
C 512 7.0739M 5.3 48.2
512 7.0705M 20.4 69.5
(b) Fusion3 512
YOLOv5s 1024 7.0705M 81.5 13.4
IR (noFocus)
1024 7.0705M 81.5 79.3
1024
512 7.0705M 20.4 62.9
YOLOv5s
RGB
(noFocus) 512 512 7.0705M 20.4 78.0
C C C C +SR

(c)
I causing 15.5% increase when the image size is doubled from
R
512 to 1024. Similarly, YOLOv5s-noFocus (1024) outper-
forms YOLOv5s-noFocus (512) by 9.8% mAP50 score (79.3%
vs. 69.5%). The mean recall and mean precision increase
Fig. 7. The feature-level fusion of different blocks in the latent layers. Fusion- simultaneously, suggesting that ensuring resolution reduces the
n represents the concatenation fusion operation performed in the n-th blocks. commission and omission errors in object detection. Based
(a) and (b) is feature-level fusion and (c) is multistage feature-level fusion.
on the above analysis, we argue that the characteristics of
HR significantly influence the final performance of object
concatenation operation are 7.0705M, 20.37 and 69.5%, and detection. However, it is noteworthy that maintaining an HR
these of the pixel-level fusion with MF module are 7.0897M, input image of the network introduces a certain amount of
21.67 and 70.3% which are the best among all the compared calculation. The GFLOPs with a size of 1024 (high resolution)
methods. There are some reasons why the model parameters is higher than that with 512 (low resolution) in both YOLOv5s
of the feature-level fusions are close to the pixel-level fusion. (21.3 vs. 5.3) and YOLOv5s-noFocus (81.5 vs. 20.4).
First, the feature-level fusion is completed in the latent layers As shown in Table IV, the use of different sizes of the image
rather than the whole two separate models. Second, the mod- during the training process (train size) and the test process (test
ules before the concatenation fusion are different, making the size) results in the score reduction of mAP50 , i.e., (10.6% vs.
different fusion channels cause different parameters. However, 62.2%), (48.2% vs. 77.7%), (13.4% vs. 69.5%) and (62.9% vs.
it can be proved that calculation cost is in creased with the 79.3%). This may attribute to the inconsistent scale of objects
layer of fusion becomes deeper. In addition, we compare the in the test process and in the training process, where the size
multistage feature-level fusion (shown in Fig. 7 (c)) with the of the predicted bounding box is not suitable for the objects
proposed pixel-level fusion. As shown in TABLE III the accu- of test images anymore.
racy of multistage feature-level fusion is only 59.3% mAP50 Finally, the mAP50 of YOLOv5s-noFocus+SR is close to the
lower than that of pixel-level fusion, while its computation cost YOLOv5-noFocus HR (1024) one (78.0% vs. 79.3%), and the
is 34.56 GFLOPs with 7.7545M parameters, which is higher GFOLPs is equal to that of YOLOv5-noFocus LR (512) one
than that of pixel-level fusion. These findings suggest that (20.4 vs. 20.4). Our proposed network decreased the resolution
innovative pixel-level fusion methods are more effective than of input images in the test process to reduce computation and
multistage shallow feature-level fusion. Because the multiple maintain accuracy by remaining the identical resolution of the
stages of fusion can lead to the accumulation of redundant training and testing data, thereby highlighting the advantage
information. The above results suggest that pixel-level fusion of the proposed SR branch.
can accurately detect objects while reducing the computation. 5) Impact of Super Resolution Branch: Some ablation
Our proposed MF fusion can improve detection accuracy with experiences about the SR branch are completed in Table
some computation costs. Overall, the proposed method only V. Compared with the upsampling operation, the YOLOv5s
uses pixel-level fusion to contain the lower computation cost. (noFocus) added super resolution network shows favorable
4) Impact of High Resolution: We compare different performance and gets mAP50 1.8% better than upsampling
training and test modes to explore more possibilities in terms operation. The SR network is a learnable upsampling method
of the input image resolution in TABLE IV. First, we compare with a more vital reconstruction ability that can help the
cases where the image resolutions of the training set and test feature extraction in the backbone for detection. We deleted
set are the same. By comparing the result of YOLOv5s, the the PANet structure and two detectors, which are responsible
detection metric mAP50 is improved from 62.2% to 77.7%, for enhancing middle-scale and large-scale target detection
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. X, NO. X, 2023 10

TABLE V
T HE A BLATION E XPERIMENT R ESULTS ABOUT THE I NFLUENCE OF SR B RANCH ON D ETECTION P ERFORMANCE ON THE F IRST F OLD OF THE VEDAI
VALIDATION S ET.

Small-scale Decoder
Branch L1 Loss Params ↓ GFLOPs ↓ mAP50 ↑ PSNR ↑ SSIM ↑
Detector (EDSR)
Upsample 7.0705M 20.37G 76.2 - -
YOLOv5s SR 7.0705M 20.37G 78.0 - -
(noFocus) SR X 4.8259M 16.68G 79.0 23.811 0.602
SR X X 4.8259M 16.68G 79.9 23.902 0.604
SR X X X 4.8259M 16.68G 80.9 26.203 0.659

TABLE VI scenes. It can be observed that SuperYOLO can accurately


T HE E FFECTIVE VALIDATION OF THE S UPER R ESOLUTION BRANCH FOR detect those objects that are not detected, or predicted into a
THE D IFFERENT BASELINE ON ON THE F IRST F OLD OF THE VEDAI
VALIDATION S ET. wrong category or with uncertainty, in YOLOv4, YOLOv5s,
and YOLOv5m. The objects in RSIs are challenging to detect
Method Layers Params ↓ GFLOPs ↓ mAP50 ↑ on small scales. In particular, Pickup and Car or Van and
YOLOv3 270 61.5M 52.8 62.6 Boat are easily confused in the detection process due to their
YOLOv3+SR 270 61.5M 52.8 71.8 similarities. Hence, improving the detection classification is
of essential necessity in object detection tasks except for
YOLOv4 393 52.5M 38.2 65.7
YOLOv4+SR 393 52.5M 38.2 69.0 location detection, which can be accomplished by the proposed
SuperYOLO with better performance.
YOLOv5s 224 7.1M 5.3 62.2
YOLOv5s+SR 224 7.1M 5.3 64.4
TABLE VII summarizes the performance of the YOLOv3
because the objects RSI datasets, such as VEDAI are on the [47], YOLOv4 [48], and YOLOv5s-x [19] YOLOrs [27],
small scale and can be detected with the small-scale detector. YOLO-Fine [49], YOLOFusion [50] and our proposed Su-
When we only use one detector, the number of parameters perYOLO. Note that the AP scores of multimodal modes are
(7.0705M vs. 4.8259M) and GFLOPs (20.37 vs. 16.68) can significantly higher than those of unimodal (RGB or IR) modes
be decreased, and the detection accuracy can be increased for most classes. The overall mAP50 of multimodal (multi)
(78.0% vs 79.0% ). When we utilize the EDSR network (rather modes outperforms those of RGB or IR modes. These results
than three ordinal deconvolutions) as Decoder and L1 loss confirm that multimodal fusion is an effective and efficient
(rather than L2 loss) as SR loss function in the SR branch, strategy for object detection based on information comple-
which is powerful in the SR task, not only the performance mentation between multimodal inputs. However, it should be
of SR is improved but also the performance of the detection noted that the slight increase in parameters and GFLOPs with
network enhanced meantime because the SR branch helps multimodal fusion reflects the necessity of choosing pixel-level
the detection network to extract more effective and superior fusion rather than feature-level fusion.
features in the backbone, accelerating the convergence of the
detection network and thus improving the performance of the It is obvious that the SuperYOLO achieves higher mAP50
detection network. The performance of super resolution and than the other frameworks except for YOLOFusion. The
object detection is complementary and cooperative. results of YOLOFusion are slightly better than SuperYOLO,
Table VI shows the favorable accuracy-complexity tradeoff as YOLOFusion uses pre-trained weight which is trained on
of the SR branch. At the different baselines, the influence of MS COCO [7]. However, its parameter count is approximately
the SR branch on object detection is positive. Compared with three times that of SuperYOLO. The performance of YOLO-
bare baseline, baseline added super resolution shows favorable Fine is good on a single modality, but it lacks the development
performance: YOLOv3+SR performs mAP50 9.2% better than of multi-modality fusion techniques. In particular, the SuperY-
YOLOv3, YOLOv4+SR is mAP50 3.3% better than YOLOv4, OLO outperforms the YOLOv5x by a 12.44% mAP50 score
YOLOv5s+SR performs mAP50 2.2% better than YOLOv5s. in multimodal mode. Meanwhile, parameter size and GFLOPs
Notably, super resolution can be removed in the inference of SuperYOLO are about 18x and 3.8x less than YOLOv5x.
stage. Hence no extra parameters and computation costs are
introduced, which is impressive considering that the SR branch In addition, it can be noticed that the superior performance
does not require a lot of manpower to refine the design of the is achieved for the classes of Car, Pickup, Tractor and
detection network. The SR branch is general and extensible Camping, which have the most training instances. YOLOv5s
and can be utilized in the existing fully convolutional network performs superior on GFLOPs, which depends on the Focus
(FCN) framework. module to slim the input image, but results in lousy detection
performance, especially for small objects. The SuperYOLO
E. Comparisons with Previous Methods performs 18.30% mAP50 better than YOLOv5s. Our proposed
The visual detection results of the compared YOLO methods SuperYOLO shows a favorable speed-accuracy trade-off com-
and SuperYOLO are shown in Fig. 8, for a diverse set of pared to the state-of-the-art models.
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. X, NO. X, 2023 11

Fig. 8. Visual results of object detection using different methods involving YOLOv4, YOLOv5s, YOLOv5m and the proposed SuperYOLO. The red cycles
represent the False Alarms, the yellow ones denote the False Positive detection results and the blue ones are False Negative detection results. (a)-(e) is the
different images in the VEDAI dataset.

F. Generalization to single modal remote sensing images object detection of remote sensing. It contains 2806 large
images and 188 282 instances, which are divided into 15
At present, although there are massive multimodal images
categories. The size of each original image is 4000 × 4000,
in remote sensing, the labeled dataset in object detection tasks
and the images are cropped into 1024 × 1024 pixels with an
is lacking due to the expensive cost of manually annotating.
overlap of 200 pixels in the experiment. We select half of the
To validate the generalization of our proposed network, we
original images as the training set, 1/6 as the validation set,
compare the SuperYOLO with different one-stage or two-
and 1/3 as the testing set. The size of the image is fixed to
stage methods using data from the single modality including
512 × 512.
a large-scale Dataset for Object Detection in Aerial images
(DOTA), object DetectIon in Optical Remote sensing images 2) NWPU VHR-10: The dataset of NWPU VHR-10 was
(DIOR), and Northwestern Polytechnical University Very- proposed in 2016. It contains 800 images, of which 650
High-Resolution 10-class (NWPU VHR-10) datasets. pictures contain objects, so we use 520 images as the training
1) DOTA: The DOTA dataset was proposed in 2018 for set and 130 images as the testing set. The dataset contains 10
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. X, NO. X, 2023 12

TABLE VII
C LASS - WISE AVERAGE P RECISION AP, M EAN AVERAGE P RECISION mAP50 , PARAMETERS AND GFLP S FOR P ROPOSED S UPERYOLO, YOLOV 3,
YOLOV 4, YOLOV 5 S - X , YOLO RS , YOLO-F INE AND YOLOF USION I NCLUDING U NIMODAL AND M ULTIMODAL C ONFIGURATIONS ON VEDAI
DATASET. * R EPRESENTS U SING P RE - TRAINED W EIGHT.

Method Car Pickup Camping Truck Other Tractor Boat Van mAP50 ↑ Params ↓ GFLOPs ↓
IR 80.21 67.03 65.55 47.78 25.86 40.11 32.67 53.33 51.54 61.5351M 49.55
YOLOv3 [47] RGB 83.06 71.54 69.14 59.30 48.93 67.34 33.48 55.67 61.06 61.5351M 49.55
Multi 84.57 72.68 67.13 61.96 43.04 65.24 37.10 58.29 61.26 61.5354M 49.68
IR 80.45 67.88 68.84 53.66 30.02 44.23 25.40 51.41 52.75 52.5082M 38.16
YOLOv4 [48] RGB 83.73 73.43 71.17 59.09 51.66 65.86 34.28 60.32 62.43 52.5082M 38.16
Multi 85.46 72.84 72.38 62.82 48.94 68.99 34.28 54.66 62.55 52.5085M 38.23
IR 77.31 65.27 66.47 51.56 25.87 42.36 21.88 48.88 49.94 7.0728M 5.24
YOLOv5s [19] RGB 80.07 68.01 66.12 51.52 45.76 64.38 21.62 40.93 54.82 7.0728M 5.24
Multi 80.81 68.48 69.06 54.71 46.76 64.29 24.25 45.96 56.79 7.0739M 5.32
IR 79.23 67.32 65.43 51.75 26.66 44.28 26.64 56.14 52.19 21.0659M 16.13
YOLOv5m [19] RGB 81.14 70.26 65.53 53.98 46.78 66.69 36.24 49.87 58.80 21.0659M 16.13
Multi 82.53 72.32 68.41 59.25 46.20 66.23 33.51 57.11 60.69 21.0677M 16.24
IR 80.14 68.57 65.37 53.45 30.33 45.59 27.24 61.87 54.06 46.6383M 36.55
YOLOv5l [19] RGB 81.36 71.70 68.25 57.45 45.77 70.68 35.89 55.42 60.81 46.6383M 36.55
Multi 82.83 72.32 69.92 63.94 48.48 63.07 40.12 56.46 62.16 46.6406M 36.70
IR 79.01 66.72 65.93 58.49 31.39 41.38 31.58 58.98 54.18 87.2458M 69.52
YOLOv5x [19] RGB 81.66 72.23 68.29 59.07 48.47 66.01 39.15 61.85 62.09 87.2458M 69.52
Multi 84.33 72.95 70.09 61.15 49.94 67.35 38.71 56.65 62.65 87.2487M 69.71
IR 82.03 73.92 63.80 54.21 43.99 54.39 21.97 43.38 54.71 - -
YOLOrs [27] RGB 85.25 72.93 70.31 50.65 42.67 76.77 18.65 38.92 57.00 - -
Multi 84.15 78.27 68.81 52.60 46.75 67.88 21.47 57.91 59.73 - -
IR 76.77 74.35 64.74 63.45 45.04 78.12 70.04 77.91 68.18 - -
YOLO-Fine [49]
RGB 79.68 74.49 77.09 80.97 37.33 70.65 60.84 63.56 68.83 - -
IR 86.7 75.9 66.6 77.1 43.0 62.3 70.7 84.3 70.8 - -
YOLOFusion* [50] RGB 91.1 82.3 75.1 78.3 33.3 81.2 71.8 62.2 71.9 - -
Multi 91.7 85.9 78.9 78.1 54.7 71.9 71.7 75.2 75.9 12.5M -
IR 87.90 81.39 76.90 61.56 39.39 60.56 46.08 71.00 65.60 4.8256M 16.61
SuperYOLO RGB 90.30 82.66 76.69 68.55 53.86 79.48 58.08 70.30 72.49 4.8256M 16.61
Multi 91.13 85.66 79.30 70.18 57.33 80.41 60.24 76.50 75.09 4.8451M 17.98

categories, and the size of the image is fixed to 512 × 512. optimal detection result (69.99%, 93.30%, 71.82% mAP50 )
3) DIOR: The DIOR dataset was proposed in 2020 for the and the model parameters (7.70 M, 7.68 M, and 7.70 M)
task of object detection, which involves 23 463 images and and GFLOPs (20.89, 20.86, and 20.93) are much smaller than
192 472 instances. The size of each image is 800 × 800. We other SOTA detectors regardless of the two-stage, one-stage,
choose 11 725 images as the training set and 11 738 images lightweight or distillation-based method. The PANet structure
as the testing set. The size of the image is fixed to 512 × 512 and three detectors are responsible for enhancing small-scale,
The training strategy is modified to accommodate the new middle-scale and large-scale target detection in consideration
dataset. The entire training process involves 150 epochs for of the big objects such as playgrounds in these three datasets.
NWPU and DIOR datasets and 100 epochs for DOTA. The Hence the model parameters of SuperYOLO are more than
batch size of the DOTA and DIOR is 16 and NWPU is 8. those in TABLE VII. We also compare two detectors designed
To verify the superiority of the SuperYOLO proposed in this for remote sensing imagery such as FMSSD [58] and O2DNet
paper, we selected 11 generic methods for comparison: one- [57]. Although these models have a close performance with
stage algorithms (YOLOv3 [47], FCOS [53], ATSS [54], Re- our lightweight model, the huger parameters and GFLOPs
tainNet [51], GFL [52]); two-stage method (Faster R-CNN seem to be a massive cost in computation resources. Hence,
[5]); lightweight models (MobileNetV2 [55] and ShuffleNet our model has a better balance in consideration of detection
[56]); distillation-based methods (ARSD [59]); remote sensing efficiency and efficacy.
designed approaches (FMSSD [58] and O2DNet [57]).
As presented in TABLE VIII, our SuperYOLO achieves the
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. X, NO. X, 2023 13

TABLE VIII
P ERFORMANCE OF D IFFERENT A LGORITHMS ON DOTA, NWPU AND DOTA T ESTING S ET.

DOTA-v1.0 NWPU DIOR


Method mAP50 Params(M) GFLOPs mAP50 Params(M) GFLOPs mAP50 Params(M) GFLOPs
Faster R-CNN [5] 60.64 60.19 289.25 77.80 41.17 127.70 54.10 60.21 182.20
RetainNet [51] 50.39 55.39 293.36 89.40 36.29 123.27 65.70 55.49 180.62
YOLOv3 [47] 60.00 61.63 198.92 88.30 61.57 121.27 57.10 61.95 122.22
GFL [52] 66.53 19.13 159.18 88.80 19.13 91.73 68.00 19.13 97.43
FCOS [53] 67.72 31.57 202.15 89.65 31.86 116.63 67.60 31.88 123.51
ATSS [54] 66.84 18.97 156.01 90.50 18.96 89.90 67.70 18.98 95.50
MobileNetV2 [55] 56.91 10.30 124.24 76.90 10.29 71.49 58.20 10.32 76.10
ShuffleNet [56] 57.73 12.11 142.60 83.00 12.10 82.17 61.30 12.12 87.31
O2-DNet [57] 71.10 209.0 - - - - 68.3 209.0 -
FMSSD [58] 72.43 136.04 - - - - 69.5 136.03 -
ARSD [59] 68.28 13.08 68.03 90.92 11.57 26.65 70.10 13.10 41.60
SuperYOLO 69.99 7.70 20.89 93.30 7.68 20.86 71.82 7.70 20.93

VI. C ONCLUSION AND F UTURE W ORK [5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: towards real-time
object detection with region proposal networks,” IEEE Trans. Pattern
In this paper, we have presented SuperYOLO, a real-time Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2016.
lightweight network that is built on top of the widely-used [6] D. Jia, D. Wei, R. Socher, J. Lili, K. Li, and F. Li, “ImageNet: A large-
scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis.
YOLOv5s to improve the detection performance of small Pattern Recognit. (CVPR), 2009, pp. 248–255.
objects in RSI. First, we have modified the baseline network by [7] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár,
removing the Focus module to avoid resolution degradation, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in
Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
through which the baseline is significantly improved and [8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisser-
overcomes the missing error of small objects. Second, we have man, “The pascal visual object classes (VOC) challenge,” Int. J. Comput.
conducted research fusion of multimodality to improve the de- Vis., vol. 88, no. 2, pp. 303–338, 2010.
[9] Z. Zheng, Y. Zhong, J. Wang, and A. Ma, “Foreground-aware relation
tection performance based on mutual information. Lastly and network for geospatial object segmentation in high spatial resolution
most importantly, we have introduced a simple and flexible SR remote sensing imagery,” in Proc. IEEE Conf. Comput. Vis. Pattern
branch facilitating the backbone to construct an HR representa- Recognit. (CVPR), 2020, pp. 4096–4105.
tion feature, by which small objects can be easily recognized [10] J. Pang, C. Li, J. Shi, Z. Xu, and H. Feng, “R2 -CNN: Fast tiny object
detection in large-scale remote sensing images,” IEEE Trans. on Geosci.
from vast backgrounds with merely LR input required. We and Remote Sens., vol. 57, no. 8, pp. 5512–5524, 2019.
remove the SR branch in the inference stage, accomplishing [11] Z. Deng, H. Sun, S. Zhou, J. Zhao, L. Lei, and H. Zou, “Multi-scale
the detection without changing the original structure of the object detection in remote sensing imagery with convolutional neural
networks,” ISPRS J. Photogramm. Remote Sens., vol. 145, pp. 3–22,
network to achieve the same GFOLPs. With joint contributions 2018.
of these ideas, the proposed SuperYOLO achieves 75.09% [12] J. Ding, N. Xue, Y. Long, G. Xia, and Q. Lu, “Learning RoI transformer
mAP50 with lower computation cost on VEDAI dataset, which for oriented object detection in aerial images,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 2844–2853.
is 18.30% higher than that of YOLOv5s, and more than [13] Z. Liu, H. Wang, L. Weng, and Y. Yang, “Ship rotated bounding box
12.44% higher than that of YOLOv5x. space for ship extraction from high-resolution optical satellite images
The performance and inference ability of our proposal with complex backgrounds,” IEEE Geosci. Remote Sens. Lett., vol. 13,
no. 8, pp. 1074–1078, 2016.
highlight the value of SR in remote sensing tasks, paving way [14] D. Hong, L. Gao, N. Yokoya, J. Yao, J. Chanussot, Q. Du, and B. Zhang,
for the future study of multimodal object detection. Our future “More diverse means better: Multimodal deep learning meets remote-
interests will be focusing on the design of a low-parameter sensing imagery classification,” IEEE Trans. on Geosci. and Remote
Sens., vol. 59, no. 5, pp. 4340–4354, 2021.
mode to extract HR features, thereby further satisfying real- [15] Z. Wang, K. Jiang, P. Yi, Z. Han, and Z. He, “Ultra-dense gan for
time and high-accuracy motivations. satellite imagery super-resolution,” Neurocomputing, vol. 398, pp. 328–
337, 2020.
[16] M. T. Razzak, G. Mateo-Garcı́a, G. Lecuyer, L. Gómez-Chova,
R EFERENCES Y. Gal, and F. Kalaitzis, “Multi-spectral multi-image super-resolution of
sentinel-2 with radiometric consistency losses and its effect on building
[1] R. Girshick, D. Jeff, D. Trevor, and M. Jitendra, “Rich feature hierarchies delineation,” ISPRS J. of Photogramm. and Remote Sens., vol. 195, pp.
for accurate object detection and semantic segmentation,” in Proc. IEEE 1–13, 2023.
Conf. Comput. Vis. Pattern Recognit. (CVPR), 2014, pp. 580–587. [17] K. Jiang, Z. Wang, P. Yi, G. Wang, T. Lu, and J. Jiang, “Edge-enhanced
[2] R. Girshick, “Fast r-cnn,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, gan for remote sensing image superresolution,” IEEE Trans. on Geosci.
pp. 1440–1448. and Remote Sens., vol. 57, no. 8, pp. 5799–5812, 2019.
[3] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look [18] Y. Xiao, X. Su, Q. Yuan, D. Liu, H. Shen, and L. Zhang, “Satellite
once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. video super-resolution via multiscale deformable convolution alignment
Vis. Pattern Recognit. (CVPR), 2016, pp. 779–788. and temporal grouping projection,” IEEE Trans. on Geosci. and Remote
[4] P. Tang, X. Wang, X. Bai, and W. Liu, “Multiple instance detection Sens., vol. 60, pp. 1–19, 2021.
network with online instance classifier refinement,” in Proc. IEEE Conf. [19] G. J. et al., “ultralytics/yolov5: v5.0,” 2021. [Online]. Available:
Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 3059–3067. https://github.com/ultralytics/yolov5
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. X, NO. X, 2023 14

[20] S. Zhang, M. Chen, J. Chen, F. Zou, Y.-F. Li, and P. Lu, “Multimodal [43] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep
feature-wise co-attention method for visual question answering,” Inf. residual networks for single image super-resolution,” in Proc. IEEE
Fusion, vol. 73, pp. 1–10, 2021. Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), 2017, pp.
[21] Y. Chen, J. Shi, C. Mertz, S. Kong, and D. Ramanan, “Multimodal 136–144.
object detection via bayesian fusion,” arXiv, 2021. [Online]. Available: [44] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image
https://arxiv.org/abs/2104.02904 restoration with neural networks,” IEEE Trans. on Comput. Imaging,
[22] Q. Chen, K. Fu, Z. Liu, G. Chen, H. Du, B. Qiu, and L. Shao, “EF-Net: vol. 3, no. 1, pp. 47–57, 2016.
A novel enhancement and fusion network for rgb-d saliency detection,” [45] S. Razakarivony and F. Jurie, “Vehicle detection in aerial imagery: A
Pattern Recognit., vol. 112, p. 107740, 2021. small target detection benchmark,” J. Vis. Commun. Image Represent.,
[23] H. Zhu, M. Ma, W. Ma, L. Jiao, S. Hong, J. Shen, and B. Hou, “A spatial- vol. 34, pp. 187–203, 2016.
channel progressive fusion resnet for remote sensing classification,” Inf. [46] L. Bottou, “Large-scale machine learning with stochastic gradient de-
Fusion, vol. 70, pp. 72–87, 2021. scent,” in Proc. 19th Int. Conf. Comput. Statist., 2010, pp. 177–186.
[24] Y. Sun, Z. Fu, C. Sun, Y. Hu, and S. Zhang, “Deep multimodal fusion [47] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,”
network for semantic segmentation using remote sensing image and lidar arXiv, 2018. [Online]. Available: https://arxiv.org/abs/1804.02767
data,” IEEE Trans. on Geosci. and Remote Sens., 2021. [48] A. Bochkovskiy, C. Wang, and H. M. Liao, “YOLOv4: Optimal speed
and accuracy of object detection,” arXiv, 2020. [Online]. Available:
[25] W. Li, Y. Gao, M. Zhang, R. Tao, and Q. Du, “Asymmetric feature fusion
https://arxiv.org/abs/2004.10934v1
network for hyperspectral and sar image classification,” IEEE Trans. on
[49] M.-T. Pham, L. Courtrai, C. Friguet, S. Lefèvre, and A. Baussard, “Yolo-
Neural Netw. and Learn. Syst., 2022.
fine: One-stage detector of small objects under various backgrounds in
[26] Y. Gao, W. Li, M. Zhang, J. Wang, W. Sun, R. Tao, and Q. Du,
remote sensing images,” Remote Sens., vol. 12, no. 15, p. 2501, 2020.
“Hyperspectral and multispectral classification for coastal wetland using
[50] F. Qingyun and W. Zhaokui, “Cross-modality attentive feature fusion
depthwise feature interaction network,” IEEE Trans. on Geosci. and
for object detection in multispectral remote sensing imagery,” Pattern
Remote Sens., 2021.
Recognit., vol. 130, p. 108786, 2022.
[27] M. Sharma, M. Dhanaraj, S. Karnam, D. G. Chachlakis, R. Ptucha, P. P. [51] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss
Markopoulos, and E. Saber, “Yolors: Object detection in multimodal for dense object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern
remote sensing imagery,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Recognit. (CVPR), 2017, pp. 2980–2988.
Sens., vol. 14, pp. 1497–1508, 2021. [52] X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang,
[28] L. Gómez-Chova, D. Tuia, G. Moser, and G. Camps-Valls, “Multimodal “Generalized focal loss: Learning qualified and distributed bounding
classification of remote sensing images: A review and future directions,” boxes for dense object detection,” Proc. Adv. Neural Inf. Process. Syst.,
Proc. IEEE, vol. 103, no. 9, pp. 1560–1584, 2015. vol. 33, pp. 21 002–21 012, 2020.
[29] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, [53] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional
“Feature pyramid networks for object detection,” in Proc. IEEE Conf. one-stage object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern
Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 2117–2125. Recognit. (CVPR), 2019, pp. 9627–9636.
[30] C. Li, T. Yang, S. Zhu, C. Chen, and S. Guan, “Density map guided [54] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the gap
object detection in aerial images,” in Proc. IEEE Conf. Comput. Vis. between anchor-based and anchor-free detection via adaptive training
Pattern Recognit. Workshops (CVPRW), 2020, pp. 190–191. sample selection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
[31] C. Chen, M.-Y. Liu, O. Tuzel, and J. Xiao, “R-cnn for small object (CVPR), 2020, pp. 9759–9768.
detection,” in Asian Conf. on Comput. Vis. Springer, 2017, pp. 214– [55] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
230. “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proc. IEEE
[32] J. Noh, W. Bae, W. Lee, J. Seo, and G. Kim, “Better to follow, follow Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 4510–4520.
to be better: Towards precise supervision of feature super-resolution [56] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely
for small object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern efficient convolutional neural network for mobile devices,” in Proc. IEEE
Recognit. (CVPR), 2019, pp. 9725–9734. Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 6848–6856.
[33] M. Haris, G. Shakhnarovich, and N. Ukita, “Task-driven super [57] H. Wei, Y. Zhang, Z. Chang, H. Li, H. Wang, and X. Sun, “Oriented
resolution: Object detection in low-resolution images,” arXiv, 2018. objects as pairs of middle lines,” ISPRS J. of Photogramm. and Remote
[Online]. Available: https://arxiv.org/abs/1803.11316 Sens., vol. 169, pp. 268–279, 2020.
[34] J. Shermeyer and A. Van Etten, “The effects of super-resolution on [58] P. Wang, X. Sun, W. Diao, and K. Fu, “Fmssd: Feature-merged single-
object detection performance in satellite imagery,” in Proc. IEEE Conf. shot detection for multiscale objects in large-scale remote sensing
Comput. Vis. Pattern Recognit. Workshops (CVPRW), 2019, pp. 1432– imagery,” IEEE Trans. on Geosci. and Remote Sens., vol. 58, no. 5,
1441. pp. 3377–3390, 2020.
[35] L. Courtrai, M. Pham, and S. Lefèvre, “Small object detection in remote [59] Y. Yang, X. Sun, W. Diao, H. Li, Y. Wu, X. Li, and K. Fu, “Adaptive
sensing images based on super-resolution with auxiliary generative knowledge distillation for lightweight remote sensing object detectors
adversarial networks,” Remote Sens., vol. 12, no. 19, p. 3152, 2020. optimizing,” IEEE Trans. on Geosci. and Remote Sens., 2022.
[36] J. Rabbi, N. Ray, M. Schubert, S. Chowdhury, and D. Chao, “Small-
object detection in remote sensing images with end-to-end edge-
enhanced gan and object detector network,” Remote Sens., vol. 12, no. 9,
p. 1432, 2020.
[37] H. Ji, Z. Gao, T. Mei, and B. Ramesh, “Vehicle detection in remote
sensing images leveraging on simultaneous super-resolution,” IEEE
Geosci. Remote Sens. Lett., vol. 17, no. 4, pp. 676–680, 2019.
[38] L. Wang, D. Li, Y. Zhu, L. Tian, and Y. Shan, “Dual super-resolution
learning for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), 2020, pp. 3773–3782.
[39] C. Wang, H. Mark Liao, Y. Wu, P. Chen, J. Hsieh, and I. Yeh, “CSPNet:
A new backbone that can enhance learning capability of cnn,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), 2020,
pp. 1571–1580.
[40] S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear units
for neural network function approximation in reinforcement learning,”
Neural Netw., vol. 107, pp. 3–11, 2018.
[41] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in
deep convolutional networks for visual recognition,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, 2015.
[42] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 7132–
7141.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy