2209 13351v2-SuperYlo
2209 13351v2-SuperYlo
X, 2023 1
Abstract—This is the pre-acceptance version, to read the final to the application of large-scale natural datasets with accurate
version please go to IEEE TRANSACTION ON GEOSCIENCE annotations [6], [7], [8].
AND REMOTE SENSING on IEEE Xplore. Accurately and Compared with natural scenarios, there are several vital
arXiv:2209.13351v2 [cs.CV] 8 Apr 2023
Fig. 1. Visual comparison of RGB image, IR image, and ground truth (GT). The IR image provides vital complementary information for resolving the
challenges in RGB detection. The object car in (a) is considerably small within a vast area. In (b), the objects have large-scale variation, to which the scale
of a car is smaller than that of a camping vehicle. The fusion of RGB and IR modalities effectively enhances detection performance.
ing the complementary characteristics in different modalities, fully convolutional network (FCN) framework.
we propose a multimodal fusion (MF) scheme to improve the • The proposed SuperYOLO markedly improves the perfor-
detection performance for RSI. We evaluate different fusion mance of object detection, outperforming SOTA detectors
alternatives (pixel-level or feature-level) and choose pixel-level in real-time multimodal object detection. Our proposed
fusion for low computation cost. model shows a favorable accuracy-speed trade-off com-
Lastly and most importantly, we develop a super resolution pared to the state-of-art models.
(SR) assurance module to guide the network to generate
HR features that are capable of identifying small objects in
II. R ELATED W ORK
vast backgrounds, thereby reducing false alarms induced by
background-contaminated objects in RSI. Nevertheless, a naive A. Object Detection with Multimodal Data
SR solution can significantly increase the computation cost. Recently, multimodal data has been widely leveraged in
Therefore, we set the auxiliary SR branch engaged in the numerous practical application scenarios, including visual
training process and remove it in the inference stage, facili- question answering [20], auto-pilot vehicles [21], saliency
tating spatial information extraction in HR without increasing detection [22], and remote sensing classification [23]. It is
computation cost. found that combining the internal information of multimodal
In summary, this paper makes the following contributions. data can efficiently transfer complementary features to avoid
• We propose a computation-friendly pixel-level fusion certain information of a single modality from being omitted.
method to combine inner information bi-directionally in In the field of RSI processing, there exist various modali-
a symmetric and compact manner. It efficiently decreases ties (e.g., Red-Green-Blue (RGB), Synthetic Aperture Radar
the computation cost without sacrificing accuracy com- (SAR), Light Detection and Ranging (LiDAR), Infrared (IR),
pared with feature-level fusion. panchromatic (PAN) and multispectral (MS) images) from di-
• We introduce an assisted SR branch into multimodal verse sensors, which can be fused with complementary charac-
object detection for the first time. Our approach not only teristics to enhance the performance of various tasks [24], [25],
makes a breakthrough in limited detection performance [26]. For example, the additional IR modality [27] captures
but also paves a more flexible way to study outstanding longer thermal wavelengths to improve the detection under
HR feature representations that are capable of discrim- difficult weather conditions. Manish et al. [27] proposed a
inating small objects from vast backgrounds with LR real-time framework for object detection in multimodal remote
input. sensing imaging, in which the extended version conducted
• Considering the demand for high-quality results and low- mid-level fusion and merged data from multiple modalities.
computation cost, the SR module functioning as an aux- Despite that multi-sensor fusion can enhance the detection
iliary task is removed during the inference stage without performance as shown in Fig 1, hardly can its low-accuracy
introducing additional computation. The SR branch is detection performance and to-be-improved computing speed
general and extensible and can be inserted in the existing meet the requirements of real-time detection tasks.
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. X, NO. X, 2023 3
Encoder
: Train only
Decoder
Reconstruction
High-Level Ls
Feature
RGB
*Multimodal
Fusion
IR C Backbone
P1 N1
Detector
Upsampling
C Classification
C Lcls
y1
Regression Lloc
P2 N2 y2
y3 Confidence
Upsampling C Lobj
C
P3 N3
*Focus Removal
FPN PANet
Head
Fig. 2. The overview of the proposed SuperYOLO framework. Our new contributions include 1) removal of the Focus module to reserve high resolution, 2)
multimodal fusion, and 3) assisted SR branch. The architecture is optimized in terms of Mean Square Error (MSE) loss for the SR branch and task-specific loss
for object detection. During the training stage, the SR branch guides the related learning of the spatial dimension to enhance the high resolution information
preservation for the backbone. During the test stage, the SR branch is removed to accelerate the inference speed equal to the baseline.
The fusion methods are primarily grouped into three strate- SR has proven to be effective and efficient in various object
gies, i.e., pixel-level fusion, feature-level fusion, and decision- detection tasks [32], [33]. Shermeyer et al. [34] quantified its
level fusion methods [28]. The decision-level fusion methods effect on the detection performance of satellite imaging by
fuse the detection results during the last stage, which may multiple resolutions of RSI. Based on generative adversarial
consume enormous computation resources due to repeated networks (GANs), Courtrai et al. [35] utilized SR to generate
calculations for different multimodal branches. In the field HR images, which were fed into the detector to improve its
of remote sensing, feature-level fusion methods are mainly detection performance. Rabbi et al. [36] leveraged a Laplacian
adopted with multi branches. The multimodal images will be operator to extract edges from the input image to enhance the
input into the parallel branches to extract respective indepen- capability of reconstructing HR images, thus improving its
dent features of different modalities, and then these features performance in object localization and classification. Hong et
will be combined by some operations, such as attention al. [37] introduced a cycle-consistent GAN structure as an
module or simple concatenation. The parallel branches bring SR network and modified faster R-CNN architecture to detect
repeated computation as the modalities increase, which is not vehicles from enhanced images that are produced by the SR
friendly in the real-time tasks in remote sensing. network. In these works, the adoption of the SR structure has
In contrast, the adoption of pixel-level fusion methods can effectively addressed the challenges regarding small objects.
reduce unnecessary computation. In this paper, our proposed However, compared with single detection models, additional
SuperYOLO fuses the modalities at the pixel-level to signif- computation is introduced, which attributes to the enlarged
icantly reduce the computation cost and design operations in scale of the input image by HR design.
spatial and channel domains to extract inner information in
the different modalities which can help enhance the detection Recently, Wang et al. [38] proposed an SR module that can
accuracy. maintain HR representations with LR input while reducing
the model computation in segmentation tasks. Inspired by
the [38], we design an SR assisted branch. In contrast to
B. Super Resolution in Object Detection the aforementioned work in which the SR is realized in the
In recent literature, the performance of small object detec- start stage, the assisted SR module guides the learning of
tion can be improved by multi-scale feature learning [29], high-quality HR representations for the detector, which not
[30], context-based detection [31]. These methods always only strengthens the response of small dense objects but also
enhance the information representation ability of the network improves the performance of the object detection in spatial
in different scales but ignore the high-resolution contextual space. Moreover, the SR module is removed in the inference
information reservation. Conducted in a pre-processing step, stage to avoid extra computation.
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. X, NO. X, 2023 4
Input
Backbone
CBS Resblock CSP1_ n CSP2 SPP
CBS
Conv CBS CBS CBS
CBS
n Resblock CBS CBS
BN CBS CBS Maxpool Maxpool Maxpool
C
C
C
SiLU CBS CBS CBS
Fig. 3. The backbone structure of YOLOv5s. The low-level texture and high-level semantic features are extracted by stacked CSP, CBS, and SPP structures.
III. BASELINE A RCHITECTURE Limitation 2: It is known that the backbone of YOLO employs
As shown in Fig. 2, the baseline YOLOv5 network consists deep convolutional neural networks to extract hierarchical
of two main components: the Backbone and Head (including features with a stride step of 2, through which the size of the
the Neck). The backbone is designed to extract low-level extracted features is halved. Hence, the feature size retained
texture and high-level semantic features. Next, these hint for multiscale detection is far smaller than that of the original
features are fed to Head to construct the enhanced feature input image. For example, when the input image size is 608,
pyramid network from top to bottom to transfer robust se- the sizes of output features for the last detection layer are 76,
mantic features and from bottom to top to propagate a strong 38, and 19, respectively. LR features may result in the missing
response of local texture and pattern features. This resolves the of some small objects.
various scale issue of the objects by yielding an enhancement
of detection with diverse scales. IV. S UPERYOLO A RCHITECTURE
In Fig. 3, CSPNet [39] is utilized as the Backbone to As summarized in Fig. 2, we introduce three new contri-
extract the feature information, consisting of numerous sam- butions to our SuperYOLO network architecture. First, we
ple Convolution-Batch-normalization-SiLu (CBS) components remove the Focus module in the Backbone and replace it with
and Cross Stage Partial (CSP) modules. The CBS is com- an MF module, to avoid resolution degradation and thus accu-
posed of operations of convolution, batch normalization, and racy degradation. Second, we explore different fusion methods
activation function SiLu [40]. The CSP duplicates the feature and choose the computation-efficient pixel-level fusion to fuse
map of the previous layer into two branches and then halves RGB and IR modalities to refine dissimilar and complementary
the channel numbers through 1 × 1 convolution, by which information. Finally, we add an assisted SR module in the
the computation is therefore reduced. With respect to the two training stage, which reconstructs the HR images to guide
copies of the feature map, one is connected to the end of the the related Backbone learning in spatial dimension and thus
stage, and the other is sent into ResNet blocks or CBS blocks maintain HR information. In the inference stage, the SR branch
as the input. Finally, the two copies of the feature map are is discarded to avoid introducing additional computation over-
concatenated to combine the features, which is followed by a head.
CBS block. The SPP (Spatial Pyramid Pooling) module [41]
is composed of parallel Maxpool layers with different kernel
sizes and is utilized to extract multiscale deep features. The A. Focus Removal
low-level texture and high-level semantic features are extracted As presented in Section III and Fig. 2 (bottom left), the
by stacked CSP, CBS, and SPP structures. Focus module in the YOLOv5 backbone partitions images at
Limitation 1: It is worth mentioning that the Focus module is intervals on the spatial domain and then reorganize the new
introduced to decrease the number of computations. As shown image to resize the input images. Specifically, this operation
in Fig. 2 (bottom left), inputs are partitioned into individual is to collect a value for every group of pixels in an image and
pixels and reconstructed at intervals and finally concatenated then reconstruct it to obtain smaller complementary images.
in the channel dimension. The inputs are resized to a smaller The size of the rebuilt images decreases with the increase
scale to reduce the computation cost and accelerate the net- in the number of channels. As a result, it causes resolution
work training and inference speed. However, this may sacrifice degradation and spatial information loss for small targets.
object detection accuracy to a certain extent, especially for Considering that the detection of small targets depends more
small objects vulnerable to resolution. heavily on higher resolution, the Focus module is abandoned
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. X, NO. X, 2023 5
Backbone
RGB
RGB
SE Conv
Conv
IR C SE
IR
SE Conv
Conv
Element-wise matrix
C Concatenation Addition
multiplication
Fig. 4. The architecture of the multimodal fusion (MF) module at the pixel level.
and replaced by an MF module (shown in Fig. 4) to prevent the different modalities in the spatial domain is defined as:
the resolution from being degraded.
mIR = f1 (FIR ), mRGB = f2 (FRGB ), (2)
where f1 and f2 represent 1 × 1 convolutions for the RGB
B. Multimodal Fusion and IR modalities, respectively. Here, ⊗ denotes element-wise
matrix multiplication. Inner spatial information between the
The more information is utilized to distinguish objects, different modalities is produced by:
the better performance can be achieved in object detection.
Multimodal fusion is an effective path for merging different Fin1 = mRGB ⊗ FRGB , Fin2 = mIR ⊗ FIR . (3)
information from various sensors. The decision-level, feature-
To incorporate internal inner-view information and spatial
level, and pixel-level fusions are the three mainstream fu-
texture information, the features are added by the original input
sion methods that can be deployed at different depths of
modalities and then fed into 1 × 1 convolutions. That the full
the network. Since decision-level fusion requires enormous
features are:
computation, it is not considered in SuperYOLO.
We propose a pixel-level multimodal fusion (MF) to extract Ff ul1 = f3 (Fin1 + IRGB ), Ff ul2 = f4 (Fin2 + IIR ). (4)
the shared and special information from the different modal-
where f3 and f4 represent 1 × 1 convolutions. Finally, the
ities. The MF can combine multimodal inner information bi-
features are fused by:
directionally in a symmetric and compact manner. As shown
in Fig. 4, for the pixel-level fusion, we first normalize an Fo = SE(Concat(Ff ul1 , Ff ul2 )). (5)
input RGB image and an input IR image into two intervals
of [0, 1]. And the input modalities XRGB , XIR ∈ RC×H×W where Concat(·) denotes the concatenation operation along
H W
are subsampled to IRGB , IIR ∈ RC× n × n which are fed to the channel axis. The result is then fed to the Backbone
SE blocks extracting inner information in channel domain [42] to produce multi-level features. Note that, X is subsampled
to generate FRGB , FIR : to 1/n size of the original image to accomplish the SR
module discussed in Section IV-C and to accelerate the training
FRGB = SE(IRGB ), FIR = SE(IIR ), (1) process. The X represents the RGB or IR modality, and the
H W
sampled image is denoted as I ∈ RC× n × n and generated
Then the attention map that reveals the inner relationship of
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. X, NO. X, 2023 6
High-Level IR CR
Upsample
Feature Deconv 3
L1 Loss
or RGB
CR
CR
C
Conv
Low-Level EDSR
Feature
CR
ReLU
Encoder Decoder
Super Resolution
Fig. 5. The super resolution (SR) structure of SuperYOLO. The SR structure can be regarded as a simple Encode-Decoder model. The low-level and high-level
features of the backbone are selected to fuse local textures patterns and semantic information, respectively.
Fig. 6. Feature-level visualization of backbone for YOLOv5s, YOLOv5x and SuperYOLO with the same input: (a) RGB input, (b) IR input; (c), (d), and (e)
are the features of YOLOv5s; (f), (g), and (h) are the features of YOLOv5x; (i), (j) and (k) are the features of SuperYOLO. The features are upsampled to
the same scale as the input image for comparison. (c), (f) and (i) are the features in the first layer. (d), (g) and (j) are the low-level features. (e), (h) and (k)
are the high-level features in layers at the same depth.
TABLE I TABLE II
T HE C OMPARISON R ESULTS OF M ODEL S IZE AND I NFERENCE A BILITY IN I NFLUENCE OF R EMOVING THE F OCUS M ODULE IN THE N ETWORK ON
D IFFERENT BASELINE YOLO F RAMEWORKS ON THE F IRST F OLD OF THE THE F IRST F OLD OF THE VEDAI VALIDATION S ET.
VEDAI VALIDATION S ET.
Method Params ↓ GFLOPs↓ mAP50 ↑
Method Layers ↓ Params ↓ GFLOPs ↓ mAP50 ↑
Focus 7.0739M 5.3 62.2
YOLOv5s
YOLOv3 270 61.5M 52.8 62.6 noFocus 7.0705M 20.4 69.5 (+7.3)
YOLOrs 241 20.2M 46.4 55.8 Focus 21.0677M 16.1 64.5
YOLOv4 393 52.5M 38.2 65.7 YOLOv5m
noFocus 21.0625M 63.6 72.2 (+7.7)
YOLOv5s 224 7.1M 5.32 62.2
YOLOv5m 308 21.1M 16.1 64.5 Focus 46.6406M 36.7 63.7
YOLOv5l
YOLOv5l 397 46.6M 36.7 63.9 noFocus 46.6337M 145.0 72.5 (+8.8)
YOLOv5x 476 87.3M 69.7 64.0 Focus 87.2487M 69.7 64.0
YOLOv5x
noFocus 87.2400M 276.6 69.2 (+5.2)
coordinate axis of all categories. Hence, the mAP can be TABLE III
calculated by T HE C OMPARISON R ESULT OF P IXEL - LEVEL AND F EATURE - LEVEL
R1 F USIONS IN YOLOV 5 S ( NO F OCUS ) FOR M ULTIMODAL DATASET ON THE
AP p(r)dr F IRST F OLD OF THE VEDAI VALIDATION S ET.
mAP = = 0 , (12)
N N Method Params ↓ GFLOPs ↓ mAP50 ↑
where p denotes Precision, r denotes Recall, and N is the Pixel-level Concat 7.0705M 20.37 69.5
number of categories. Fusion MF 7.0897M 21.67 70.3
GFOLPs (Giga Floating-point Operations Per Second) and Fusion1 7.0887M 21.76 66.0
parameter size are used to measure the model complexity and Feature-level Fusion2 7.0744M 22.04 68.5
computation cost. In addition, PSNR and SSIM are used for Fusion Fusion3 7.1442M 24.22 64.8
image quality evaluation of the SR branch. Generally, higher Fusion4 7.0870M 24.50 63.8
PSNR values and SSIM values represent the better quality of Multistage Feature-level
7.7545M 34.56 59.3
the generated image. Fusion
TABLE IV
RGB
T HE I NFLUENCE OF D IFFERENT R ESOLUTIONS FOR I NPUT I MAGE ON
C
N ETWORK P ERFORMANCE ON THE F IRST F OLD OF THE VEDAI
VALIDATION S ET.
(a)
Fusion1
Train-Val Test
IR Method Params ↓ GFLOPs ↓ mAP50 ↑
Size Size
512 7.0739M 5.3 62.2
512
1024 7.0739M 21.3 10.6
YOLOv5s
RGB 1024 7.0739M 21.3 77.7
1024
C 512 7.0739M 5.3 48.2
512 7.0705M 20.4 69.5
(b) Fusion3 512
YOLOv5s 1024 7.0705M 81.5 13.4
IR (noFocus)
1024 7.0705M 81.5 79.3
1024
512 7.0705M 20.4 62.9
YOLOv5s
RGB
(noFocus) 512 512 7.0705M 20.4 78.0
C C C C +SR
(c)
I causing 15.5% increase when the image size is doubled from
R
512 to 1024. Similarly, YOLOv5s-noFocus (1024) outper-
forms YOLOv5s-noFocus (512) by 9.8% mAP50 score (79.3%
vs. 69.5%). The mean recall and mean precision increase
Fig. 7. The feature-level fusion of different blocks in the latent layers. Fusion- simultaneously, suggesting that ensuring resolution reduces the
n represents the concatenation fusion operation performed in the n-th blocks. commission and omission errors in object detection. Based
(a) and (b) is feature-level fusion and (c) is multistage feature-level fusion.
on the above analysis, we argue that the characteristics of
HR significantly influence the final performance of object
concatenation operation are 7.0705M, 20.37 and 69.5%, and detection. However, it is noteworthy that maintaining an HR
these of the pixel-level fusion with MF module are 7.0897M, input image of the network introduces a certain amount of
21.67 and 70.3% which are the best among all the compared calculation. The GFLOPs with a size of 1024 (high resolution)
methods. There are some reasons why the model parameters is higher than that with 512 (low resolution) in both YOLOv5s
of the feature-level fusions are close to the pixel-level fusion. (21.3 vs. 5.3) and YOLOv5s-noFocus (81.5 vs. 20.4).
First, the feature-level fusion is completed in the latent layers As shown in Table IV, the use of different sizes of the image
rather than the whole two separate models. Second, the mod- during the training process (train size) and the test process (test
ules before the concatenation fusion are different, making the size) results in the score reduction of mAP50 , i.e., (10.6% vs.
different fusion channels cause different parameters. However, 62.2%), (48.2% vs. 77.7%), (13.4% vs. 69.5%) and (62.9% vs.
it can be proved that calculation cost is in creased with the 79.3%). This may attribute to the inconsistent scale of objects
layer of fusion becomes deeper. In addition, we compare the in the test process and in the training process, where the size
multistage feature-level fusion (shown in Fig. 7 (c)) with the of the predicted bounding box is not suitable for the objects
proposed pixel-level fusion. As shown in TABLE III the accu- of test images anymore.
racy of multistage feature-level fusion is only 59.3% mAP50 Finally, the mAP50 of YOLOv5s-noFocus+SR is close to the
lower than that of pixel-level fusion, while its computation cost YOLOv5-noFocus HR (1024) one (78.0% vs. 79.3%), and the
is 34.56 GFLOPs with 7.7545M parameters, which is higher GFOLPs is equal to that of YOLOv5-noFocus LR (512) one
than that of pixel-level fusion. These findings suggest that (20.4 vs. 20.4). Our proposed network decreased the resolution
innovative pixel-level fusion methods are more effective than of input images in the test process to reduce computation and
multistage shallow feature-level fusion. Because the multiple maintain accuracy by remaining the identical resolution of the
stages of fusion can lead to the accumulation of redundant training and testing data, thereby highlighting the advantage
information. The above results suggest that pixel-level fusion of the proposed SR branch.
can accurately detect objects while reducing the computation. 5) Impact of Super Resolution Branch: Some ablation
Our proposed MF fusion can improve detection accuracy with experiences about the SR branch are completed in Table
some computation costs. Overall, the proposed method only V. Compared with the upsampling operation, the YOLOv5s
uses pixel-level fusion to contain the lower computation cost. (noFocus) added super resolution network shows favorable
4) Impact of High Resolution: We compare different performance and gets mAP50 1.8% better than upsampling
training and test modes to explore more possibilities in terms operation. The SR network is a learnable upsampling method
of the input image resolution in TABLE IV. First, we compare with a more vital reconstruction ability that can help the
cases where the image resolutions of the training set and test feature extraction in the backbone for detection. We deleted
set are the same. By comparing the result of YOLOv5s, the the PANet structure and two detectors, which are responsible
detection metric mAP50 is improved from 62.2% to 77.7%, for enhancing middle-scale and large-scale target detection
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. X, NO. X, 2023 10
TABLE V
T HE A BLATION E XPERIMENT R ESULTS ABOUT THE I NFLUENCE OF SR B RANCH ON D ETECTION P ERFORMANCE ON THE F IRST F OLD OF THE VEDAI
VALIDATION S ET.
Small-scale Decoder
Branch L1 Loss Params ↓ GFLOPs ↓ mAP50 ↑ PSNR ↑ SSIM ↑
Detector (EDSR)
Upsample 7.0705M 20.37G 76.2 - -
YOLOv5s SR 7.0705M 20.37G 78.0 - -
(noFocus) SR X 4.8259M 16.68G 79.0 23.811 0.602
SR X X 4.8259M 16.68G 79.9 23.902 0.604
SR X X X 4.8259M 16.68G 80.9 26.203 0.659
Fig. 8. Visual results of object detection using different methods involving YOLOv4, YOLOv5s, YOLOv5m and the proposed SuperYOLO. The red cycles
represent the False Alarms, the yellow ones denote the False Positive detection results and the blue ones are False Negative detection results. (a)-(e) is the
different images in the VEDAI dataset.
F. Generalization to single modal remote sensing images object detection of remote sensing. It contains 2806 large
images and 188 282 instances, which are divided into 15
At present, although there are massive multimodal images
categories. The size of each original image is 4000 × 4000,
in remote sensing, the labeled dataset in object detection tasks
and the images are cropped into 1024 × 1024 pixels with an
is lacking due to the expensive cost of manually annotating.
overlap of 200 pixels in the experiment. We select half of the
To validate the generalization of our proposed network, we
original images as the training set, 1/6 as the validation set,
compare the SuperYOLO with different one-stage or two-
and 1/3 as the testing set. The size of the image is fixed to
stage methods using data from the single modality including
512 × 512.
a large-scale Dataset for Object Detection in Aerial images
(DOTA), object DetectIon in Optical Remote sensing images 2) NWPU VHR-10: The dataset of NWPU VHR-10 was
(DIOR), and Northwestern Polytechnical University Very- proposed in 2016. It contains 800 images, of which 650
High-Resolution 10-class (NWPU VHR-10) datasets. pictures contain objects, so we use 520 images as the training
1) DOTA: The DOTA dataset was proposed in 2018 for set and 130 images as the testing set. The dataset contains 10
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. X, NO. X, 2023 12
TABLE VII
C LASS - WISE AVERAGE P RECISION AP, M EAN AVERAGE P RECISION mAP50 , PARAMETERS AND GFLP S FOR P ROPOSED S UPERYOLO, YOLOV 3,
YOLOV 4, YOLOV 5 S - X , YOLO RS , YOLO-F INE AND YOLOF USION I NCLUDING U NIMODAL AND M ULTIMODAL C ONFIGURATIONS ON VEDAI
DATASET. * R EPRESENTS U SING P RE - TRAINED W EIGHT.
Method Car Pickup Camping Truck Other Tractor Boat Van mAP50 ↑ Params ↓ GFLOPs ↓
IR 80.21 67.03 65.55 47.78 25.86 40.11 32.67 53.33 51.54 61.5351M 49.55
YOLOv3 [47] RGB 83.06 71.54 69.14 59.30 48.93 67.34 33.48 55.67 61.06 61.5351M 49.55
Multi 84.57 72.68 67.13 61.96 43.04 65.24 37.10 58.29 61.26 61.5354M 49.68
IR 80.45 67.88 68.84 53.66 30.02 44.23 25.40 51.41 52.75 52.5082M 38.16
YOLOv4 [48] RGB 83.73 73.43 71.17 59.09 51.66 65.86 34.28 60.32 62.43 52.5082M 38.16
Multi 85.46 72.84 72.38 62.82 48.94 68.99 34.28 54.66 62.55 52.5085M 38.23
IR 77.31 65.27 66.47 51.56 25.87 42.36 21.88 48.88 49.94 7.0728M 5.24
YOLOv5s [19] RGB 80.07 68.01 66.12 51.52 45.76 64.38 21.62 40.93 54.82 7.0728M 5.24
Multi 80.81 68.48 69.06 54.71 46.76 64.29 24.25 45.96 56.79 7.0739M 5.32
IR 79.23 67.32 65.43 51.75 26.66 44.28 26.64 56.14 52.19 21.0659M 16.13
YOLOv5m [19] RGB 81.14 70.26 65.53 53.98 46.78 66.69 36.24 49.87 58.80 21.0659M 16.13
Multi 82.53 72.32 68.41 59.25 46.20 66.23 33.51 57.11 60.69 21.0677M 16.24
IR 80.14 68.57 65.37 53.45 30.33 45.59 27.24 61.87 54.06 46.6383M 36.55
YOLOv5l [19] RGB 81.36 71.70 68.25 57.45 45.77 70.68 35.89 55.42 60.81 46.6383M 36.55
Multi 82.83 72.32 69.92 63.94 48.48 63.07 40.12 56.46 62.16 46.6406M 36.70
IR 79.01 66.72 65.93 58.49 31.39 41.38 31.58 58.98 54.18 87.2458M 69.52
YOLOv5x [19] RGB 81.66 72.23 68.29 59.07 48.47 66.01 39.15 61.85 62.09 87.2458M 69.52
Multi 84.33 72.95 70.09 61.15 49.94 67.35 38.71 56.65 62.65 87.2487M 69.71
IR 82.03 73.92 63.80 54.21 43.99 54.39 21.97 43.38 54.71 - -
YOLOrs [27] RGB 85.25 72.93 70.31 50.65 42.67 76.77 18.65 38.92 57.00 - -
Multi 84.15 78.27 68.81 52.60 46.75 67.88 21.47 57.91 59.73 - -
IR 76.77 74.35 64.74 63.45 45.04 78.12 70.04 77.91 68.18 - -
YOLO-Fine [49]
RGB 79.68 74.49 77.09 80.97 37.33 70.65 60.84 63.56 68.83 - -
IR 86.7 75.9 66.6 77.1 43.0 62.3 70.7 84.3 70.8 - -
YOLOFusion* [50] RGB 91.1 82.3 75.1 78.3 33.3 81.2 71.8 62.2 71.9 - -
Multi 91.7 85.9 78.9 78.1 54.7 71.9 71.7 75.2 75.9 12.5M -
IR 87.90 81.39 76.90 61.56 39.39 60.56 46.08 71.00 65.60 4.8256M 16.61
SuperYOLO RGB 90.30 82.66 76.69 68.55 53.86 79.48 58.08 70.30 72.49 4.8256M 16.61
Multi 91.13 85.66 79.30 70.18 57.33 80.41 60.24 76.50 75.09 4.8451M 17.98
categories, and the size of the image is fixed to 512 × 512. optimal detection result (69.99%, 93.30%, 71.82% mAP50 )
3) DIOR: The DIOR dataset was proposed in 2020 for the and the model parameters (7.70 M, 7.68 M, and 7.70 M)
task of object detection, which involves 23 463 images and and GFLOPs (20.89, 20.86, and 20.93) are much smaller than
192 472 instances. The size of each image is 800 × 800. We other SOTA detectors regardless of the two-stage, one-stage,
choose 11 725 images as the training set and 11 738 images lightweight or distillation-based method. The PANet structure
as the testing set. The size of the image is fixed to 512 × 512 and three detectors are responsible for enhancing small-scale,
The training strategy is modified to accommodate the new middle-scale and large-scale target detection in consideration
dataset. The entire training process involves 150 epochs for of the big objects such as playgrounds in these three datasets.
NWPU and DIOR datasets and 100 epochs for DOTA. The Hence the model parameters of SuperYOLO are more than
batch size of the DOTA and DIOR is 16 and NWPU is 8. those in TABLE VII. We also compare two detectors designed
To verify the superiority of the SuperYOLO proposed in this for remote sensing imagery such as FMSSD [58] and O2DNet
paper, we selected 11 generic methods for comparison: one- [57]. Although these models have a close performance with
stage algorithms (YOLOv3 [47], FCOS [53], ATSS [54], Re- our lightweight model, the huger parameters and GFLOPs
tainNet [51], GFL [52]); two-stage method (Faster R-CNN seem to be a massive cost in computation resources. Hence,
[5]); lightweight models (MobileNetV2 [55] and ShuffleNet our model has a better balance in consideration of detection
[56]); distillation-based methods (ARSD [59]); remote sensing efficiency and efficacy.
designed approaches (FMSSD [58] and O2DNet [57]).
As presented in TABLE VIII, our SuperYOLO achieves the
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. X, NO. X, 2023 13
TABLE VIII
P ERFORMANCE OF D IFFERENT A LGORITHMS ON DOTA, NWPU AND DOTA T ESTING S ET.
VI. C ONCLUSION AND F UTURE W ORK [5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: towards real-time
object detection with region proposal networks,” IEEE Trans. Pattern
In this paper, we have presented SuperYOLO, a real-time Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2016.
lightweight network that is built on top of the widely-used [6] D. Jia, D. Wei, R. Socher, J. Lili, K. Li, and F. Li, “ImageNet: A large-
scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis.
YOLOv5s to improve the detection performance of small Pattern Recognit. (CVPR), 2009, pp. 248–255.
objects in RSI. First, we have modified the baseline network by [7] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár,
removing the Focus module to avoid resolution degradation, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in
Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
through which the baseline is significantly improved and [8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisser-
overcomes the missing error of small objects. Second, we have man, “The pascal visual object classes (VOC) challenge,” Int. J. Comput.
conducted research fusion of multimodality to improve the de- Vis., vol. 88, no. 2, pp. 303–338, 2010.
[9] Z. Zheng, Y. Zhong, J. Wang, and A. Ma, “Foreground-aware relation
tection performance based on mutual information. Lastly and network for geospatial object segmentation in high spatial resolution
most importantly, we have introduced a simple and flexible SR remote sensing imagery,” in Proc. IEEE Conf. Comput. Vis. Pattern
branch facilitating the backbone to construct an HR representa- Recognit. (CVPR), 2020, pp. 4096–4105.
tion feature, by which small objects can be easily recognized [10] J. Pang, C. Li, J. Shi, Z. Xu, and H. Feng, “R2 -CNN: Fast tiny object
detection in large-scale remote sensing images,” IEEE Trans. on Geosci.
from vast backgrounds with merely LR input required. We and Remote Sens., vol. 57, no. 8, pp. 5512–5524, 2019.
remove the SR branch in the inference stage, accomplishing [11] Z. Deng, H. Sun, S. Zhou, J. Zhao, L. Lei, and H. Zou, “Multi-scale
the detection without changing the original structure of the object detection in remote sensing imagery with convolutional neural
networks,” ISPRS J. Photogramm. Remote Sens., vol. 145, pp. 3–22,
network to achieve the same GFOLPs. With joint contributions 2018.
of these ideas, the proposed SuperYOLO achieves 75.09% [12] J. Ding, N. Xue, Y. Long, G. Xia, and Q. Lu, “Learning RoI transformer
mAP50 with lower computation cost on VEDAI dataset, which for oriented object detection in aerial images,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 2844–2853.
is 18.30% higher than that of YOLOv5s, and more than [13] Z. Liu, H. Wang, L. Weng, and Y. Yang, “Ship rotated bounding box
12.44% higher than that of YOLOv5x. space for ship extraction from high-resolution optical satellite images
The performance and inference ability of our proposal with complex backgrounds,” IEEE Geosci. Remote Sens. Lett., vol. 13,
no. 8, pp. 1074–1078, 2016.
highlight the value of SR in remote sensing tasks, paving way [14] D. Hong, L. Gao, N. Yokoya, J. Yao, J. Chanussot, Q. Du, and B. Zhang,
for the future study of multimodal object detection. Our future “More diverse means better: Multimodal deep learning meets remote-
interests will be focusing on the design of a low-parameter sensing imagery classification,” IEEE Trans. on Geosci. and Remote
Sens., vol. 59, no. 5, pp. 4340–4354, 2021.
mode to extract HR features, thereby further satisfying real- [15] Z. Wang, K. Jiang, P. Yi, Z. Han, and Z. He, “Ultra-dense gan for
time and high-accuracy motivations. satellite imagery super-resolution,” Neurocomputing, vol. 398, pp. 328–
337, 2020.
[16] M. T. Razzak, G. Mateo-Garcı́a, G. Lecuyer, L. Gómez-Chova,
R EFERENCES Y. Gal, and F. Kalaitzis, “Multi-spectral multi-image super-resolution of
sentinel-2 with radiometric consistency losses and its effect on building
[1] R. Girshick, D. Jeff, D. Trevor, and M. Jitendra, “Rich feature hierarchies delineation,” ISPRS J. of Photogramm. and Remote Sens., vol. 195, pp.
for accurate object detection and semantic segmentation,” in Proc. IEEE 1–13, 2023.
Conf. Comput. Vis. Pattern Recognit. (CVPR), 2014, pp. 580–587. [17] K. Jiang, Z. Wang, P. Yi, G. Wang, T. Lu, and J. Jiang, “Edge-enhanced
[2] R. Girshick, “Fast r-cnn,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, gan for remote sensing image superresolution,” IEEE Trans. on Geosci.
pp. 1440–1448. and Remote Sens., vol. 57, no. 8, pp. 5799–5812, 2019.
[3] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look [18] Y. Xiao, X. Su, Q. Yuan, D. Liu, H. Shen, and L. Zhang, “Satellite
once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. video super-resolution via multiscale deformable convolution alignment
Vis. Pattern Recognit. (CVPR), 2016, pp. 779–788. and temporal grouping projection,” IEEE Trans. on Geosci. and Remote
[4] P. Tang, X. Wang, X. Bai, and W. Liu, “Multiple instance detection Sens., vol. 60, pp. 1–19, 2021.
network with online instance classifier refinement,” in Proc. IEEE Conf. [19] G. J. et al., “ultralytics/yolov5: v5.0,” 2021. [Online]. Available:
Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 3059–3067. https://github.com/ultralytics/yolov5
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. X, NO. X, 2023 14
[20] S. Zhang, M. Chen, J. Chen, F. Zou, Y.-F. Li, and P. Lu, “Multimodal [43] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep
feature-wise co-attention method for visual question answering,” Inf. residual networks for single image super-resolution,” in Proc. IEEE
Fusion, vol. 73, pp. 1–10, 2021. Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), 2017, pp.
[21] Y. Chen, J. Shi, C. Mertz, S. Kong, and D. Ramanan, “Multimodal 136–144.
object detection via bayesian fusion,” arXiv, 2021. [Online]. Available: [44] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image
https://arxiv.org/abs/2104.02904 restoration with neural networks,” IEEE Trans. on Comput. Imaging,
[22] Q. Chen, K. Fu, Z. Liu, G. Chen, H. Du, B. Qiu, and L. Shao, “EF-Net: vol. 3, no. 1, pp. 47–57, 2016.
A novel enhancement and fusion network for rgb-d saliency detection,” [45] S. Razakarivony and F. Jurie, “Vehicle detection in aerial imagery: A
Pattern Recognit., vol. 112, p. 107740, 2021. small target detection benchmark,” J. Vis. Commun. Image Represent.,
[23] H. Zhu, M. Ma, W. Ma, L. Jiao, S. Hong, J. Shen, and B. Hou, “A spatial- vol. 34, pp. 187–203, 2016.
channel progressive fusion resnet for remote sensing classification,” Inf. [46] L. Bottou, “Large-scale machine learning with stochastic gradient de-
Fusion, vol. 70, pp. 72–87, 2021. scent,” in Proc. 19th Int. Conf. Comput. Statist., 2010, pp. 177–186.
[24] Y. Sun, Z. Fu, C. Sun, Y. Hu, and S. Zhang, “Deep multimodal fusion [47] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,”
network for semantic segmentation using remote sensing image and lidar arXiv, 2018. [Online]. Available: https://arxiv.org/abs/1804.02767
data,” IEEE Trans. on Geosci. and Remote Sens., 2021. [48] A. Bochkovskiy, C. Wang, and H. M. Liao, “YOLOv4: Optimal speed
and accuracy of object detection,” arXiv, 2020. [Online]. Available:
[25] W. Li, Y. Gao, M. Zhang, R. Tao, and Q. Du, “Asymmetric feature fusion
https://arxiv.org/abs/2004.10934v1
network for hyperspectral and sar image classification,” IEEE Trans. on
[49] M.-T. Pham, L. Courtrai, C. Friguet, S. Lefèvre, and A. Baussard, “Yolo-
Neural Netw. and Learn. Syst., 2022.
fine: One-stage detector of small objects under various backgrounds in
[26] Y. Gao, W. Li, M. Zhang, J. Wang, W. Sun, R. Tao, and Q. Du,
remote sensing images,” Remote Sens., vol. 12, no. 15, p. 2501, 2020.
“Hyperspectral and multispectral classification for coastal wetland using
[50] F. Qingyun and W. Zhaokui, “Cross-modality attentive feature fusion
depthwise feature interaction network,” IEEE Trans. on Geosci. and
for object detection in multispectral remote sensing imagery,” Pattern
Remote Sens., 2021.
Recognit., vol. 130, p. 108786, 2022.
[27] M. Sharma, M. Dhanaraj, S. Karnam, D. G. Chachlakis, R. Ptucha, P. P. [51] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss
Markopoulos, and E. Saber, “Yolors: Object detection in multimodal for dense object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern
remote sensing imagery,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Recognit. (CVPR), 2017, pp. 2980–2988.
Sens., vol. 14, pp. 1497–1508, 2021. [52] X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang,
[28] L. Gómez-Chova, D. Tuia, G. Moser, and G. Camps-Valls, “Multimodal “Generalized focal loss: Learning qualified and distributed bounding
classification of remote sensing images: A review and future directions,” boxes for dense object detection,” Proc. Adv. Neural Inf. Process. Syst.,
Proc. IEEE, vol. 103, no. 9, pp. 1560–1584, 2015. vol. 33, pp. 21 002–21 012, 2020.
[29] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, [53] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional
“Feature pyramid networks for object detection,” in Proc. IEEE Conf. one-stage object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern
Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 2117–2125. Recognit. (CVPR), 2019, pp. 9627–9636.
[30] C. Li, T. Yang, S. Zhu, C. Chen, and S. Guan, “Density map guided [54] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the gap
object detection in aerial images,” in Proc. IEEE Conf. Comput. Vis. between anchor-based and anchor-free detection via adaptive training
Pattern Recognit. Workshops (CVPRW), 2020, pp. 190–191. sample selection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
[31] C. Chen, M.-Y. Liu, O. Tuzel, and J. Xiao, “R-cnn for small object (CVPR), 2020, pp. 9759–9768.
detection,” in Asian Conf. on Comput. Vis. Springer, 2017, pp. 214– [55] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
230. “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proc. IEEE
[32] J. Noh, W. Bae, W. Lee, J. Seo, and G. Kim, “Better to follow, follow Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 4510–4520.
to be better: Towards precise supervision of feature super-resolution [56] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely
for small object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern efficient convolutional neural network for mobile devices,” in Proc. IEEE
Recognit. (CVPR), 2019, pp. 9725–9734. Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 6848–6856.
[33] M. Haris, G. Shakhnarovich, and N. Ukita, “Task-driven super [57] H. Wei, Y. Zhang, Z. Chang, H. Li, H. Wang, and X. Sun, “Oriented
resolution: Object detection in low-resolution images,” arXiv, 2018. objects as pairs of middle lines,” ISPRS J. of Photogramm. and Remote
[Online]. Available: https://arxiv.org/abs/1803.11316 Sens., vol. 169, pp. 268–279, 2020.
[34] J. Shermeyer and A. Van Etten, “The effects of super-resolution on [58] P. Wang, X. Sun, W. Diao, and K. Fu, “Fmssd: Feature-merged single-
object detection performance in satellite imagery,” in Proc. IEEE Conf. shot detection for multiscale objects in large-scale remote sensing
Comput. Vis. Pattern Recognit. Workshops (CVPRW), 2019, pp. 1432– imagery,” IEEE Trans. on Geosci. and Remote Sens., vol. 58, no. 5,
1441. pp. 3377–3390, 2020.
[35] L. Courtrai, M. Pham, and S. Lefèvre, “Small object detection in remote [59] Y. Yang, X. Sun, W. Diao, H. Li, Y. Wu, X. Li, and K. Fu, “Adaptive
sensing images based on super-resolution with auxiliary generative knowledge distillation for lightweight remote sensing object detectors
adversarial networks,” Remote Sens., vol. 12, no. 19, p. 3152, 2020. optimizing,” IEEE Trans. on Geosci. and Remote Sens., 2022.
[36] J. Rabbi, N. Ray, M. Schubert, S. Chowdhury, and D. Chao, “Small-
object detection in remote sensing images with end-to-end edge-
enhanced gan and object detector network,” Remote Sens., vol. 12, no. 9,
p. 1432, 2020.
[37] H. Ji, Z. Gao, T. Mei, and B. Ramesh, “Vehicle detection in remote
sensing images leveraging on simultaneous super-resolution,” IEEE
Geosci. Remote Sens. Lett., vol. 17, no. 4, pp. 676–680, 2019.
[38] L. Wang, D. Li, Y. Zhu, L. Tian, and Y. Shan, “Dual super-resolution
learning for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), 2020, pp. 3773–3782.
[39] C. Wang, H. Mark Liao, Y. Wu, P. Chen, J. Hsieh, and I. Yeh, “CSPNet:
A new backbone that can enhance learning capability of cnn,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), 2020,
pp. 1571–1580.
[40] S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear units
for neural network function approximation in reinforcement learning,”
Neural Netw., vol. 107, pp. 3–11, 2018.
[41] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in
deep convolutional networks for visual recognition,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, 2015.
[42] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 7132–
7141.