0% found this document useful (0 votes)
59 views20 pages

Remote Sensing: Improved YOLO Network For Free-Angle Remote Sensing Target Detection

Uploaded by

Sun The
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views20 pages

Remote Sensing: Improved YOLO Network For Free-Angle Remote Sensing Target Detection

Uploaded by

Sun The
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

remote sensing

Article
Improved YOLO Network for Free-Angle Remote Sensing
Target Detection
Yuhao Qing, Wenyi Liu *, Liuyan Feng and Wanjia Gao

School of Instrument and Electronics, North University of China, Taiyuan 030000, China;
s2006262@st.nuc.edu.cn (Y.Q.); s2006261@st.nuc.edu.cn (L.F.); b1806014@st.nuc.edu.cn (W.G.)
* Correspondence: liuwenyi@nuc.edu.cn; Tel.: +86-139-3460-7107

Abstract: Despite significant progress in object detection tasks, remote sensing image target detection
is still challenging owing to complex backgrounds, large differences in target sizes, and uneven
distribution of rotating objects. In this study, we consider model accuracy, inference speed, and
detection of objects at any angle. We also propose a RepVGG-YOLO network using an improved
RepVGG model as the backbone feature extraction network, which performs the initial feature
extraction from the input image and considers network training accuracy and inference speed. We
use an improved feature pyramid network (FPN) and path aggregation network (PANet) to reprocess
feature output by the backbone network. The FPN and PANet module integrates feature maps of
different layers, combines context information on multiple scales, accumulates multiple features, and
strengthens feature information extraction. Finally, to maximize the detection accuracy of objects
of all sizes, we use four target detection scales at the network output to enhance feature extraction
from small remote sensing target pixels. To solve the angle problem of any object, we improved the
loss function for classification using circular smooth label technology, turning the angle regression
 problem into a classification problem, and increasing the detection accuracy of objects at any angle.
 We conducted experiments on two public datasets, DOTA and HRSC2016. Our results show the
Citation: Qing, Y.; Liu, W.; Feng, L.; proposed method performs better than previous methods.
Gao, W. Improved YOLO Network
for Free-Angle Remote Sensing Target Keywords: image target detection; deep learning; multiple scales; any angle object; remote sensing
Detection. Remote Sens. 2021, 13, 2171. of small objects
https://doi.org/10.3390/rs13112171

Academic Editor:
Fahimeh Farahnakian 1. Introduction
Target detection is a basic task in computer vision and helps estimate the category
Received: 24 April 2021
Accepted: 29 May 2021
of objects in a scene and mark their locations. The rapid deployment of airborne and
Published: 1 June 2021
spaceborne sensors has made ultra-high-resolution aerial images common. However,
object detection in remote sensing images remains a challenging task. Research on remote
Publisher’s Note: MDPI stays neutral
sensing images has crucial applications in the military, disaster control, environmental
with regard to jurisdictional claims in
management, and transportation planning [1–4]. Therefore, it has attracted significant
published maps and institutional affil- attention from researchers in recent years.
iations. Object detection in aerial images has become a prevalent topic in computer vision [5–7].
In the past few years, machine learning methods have been successfully applied for remote
sensing target detection [8–10]. David et al. [8] used the Defense Science and Technology
Organization Analysts’ Detection Support System, which is a system developed particularly
Copyright: © 2021 by the authors.
for ship detection in remote sensing images. Wang et al. [9] proposed an intensity-space
Licensee MDPI, Basel, Switzerland.
domain constant false alarm rate ship detector. Leng et al. [10] presented a highly adaptive
This article is an open access article
ship detection scheme for spaceborne synthetic-aperture radar (SAR) imagery.
distributed under the terms and Although these remote sensing target detection methods based on machine learning
conditions of the Creative Commons have achieved good results, the missed detection rate remains very high in complex ground
Attribution (CC BY) license (https:// environments. Deep neural networks, particularly the convolutional neural network
creativecommons.org/licenses/by/ (CNN) class, significantly improve the detection of objects in natural images owing to
4.0/). the advantages in robust feature extraction using large-scale datasets. In recent years,

Remote Sens. 2021, 13, 2171. https://doi.org/10.3390/rs13112171 https://www.mdpi.com/journal/remotesensing


Remote Sens. 2021, 13, 2171 2 of 20

systems employing the powerful feature learning capabilities of CNN have demonstrated
remarkable success in various visual tasks such as classification [11,12], segmentation [13],
tracking [14], and detection [15–17]. CNN-based target detectors can be divided into
two categories: single-stage and two-stage target detection networks. Single-stage target
detection networks discussed in the literature [18–21] include a you only look once (YOLO)
detector optimized end-to-end, which was proposed by Joseph et al. [18,19]. Liu et al. [20]
presented a method for detecting objects in images using a deep neural network single-shot
detector (SSD). Lin et al. [21] designed and trained a simple dense object detector, RetinaNet,
to evaluate the effectiveness of the focal loss. The works of [22–27], describing two-stage
target detection networks, include the proposal by Girshick et al. [22] of a simple and
scalable detection algorithm that combines the region proposal network (RPN) with a CNN
(R-CNN). Subsequently, Girshick et al. [23] developed a fast region-based convolutional
network (fast R-CNN) to efficiently classify targets and improve the training speed and
detection accuracy of the network. Ren et al. [24] merged the convolutional features of
RPN and fast R-CNN into a neural network with an attention mechanism (faster R-CNN).
Dai et al. [25] proposed a region-based fully convolutional network (R-FCN), and Lin
et al. [26] proposed a top-down structure, feature pyramid network (FPN), with horizontal
connections, which considerably improved the accuracy of target detection.
General object detection methods, generally based on horizontal bounding boxes
(HBBs), have proven quite successful in natural scenes. Recently, HBB-based methods
have also been widely used for target detection in aerial images [27–31]. Li et al. [27]
proposed a weakly supervised deep learning method that uses separate scene category
information and mutual prompts between scene pairs to fully train deep networks. Ming
et al. [28] proposed a deep learning method for remote sensing image object detection
using a polarized attention module and a dynamic anchor learning strategy. Pang et al. [29]
proposed a self-enhanced convolutional neural network, rotational region CNN (R2 -CNN),
based on the content of remotely sensed regions. Han et al. [30] used a feature alignment
module and orientation detection module to form a single-shot alignment network (S2 A-
Net) for target detection in remote sensing images. Deng et al. [31] redesigned the feature
extractor using cascaded rectified linear unit and inception modules, used two detection
networks with different functions, and proposed a new target detection method.
Most targets in remote sensing images have the characteristics of arbitrary directional-
ity, high aspect ratio, and dense distribution. Therefore, the HBB-based model may cause
severe overlap and noise. In subsequent work, an oriented bounding box (OBB) was used
to process rotating remote sensing targets [32–40], enabling more accurate target capture
and introducing considerably less background noise. Feng et al. [32] proposed a robust
Student’s t-distribution-aided one-stage orientation detector. Ding et al. [34] proposed an
RoI transformer that transforms horizontal regions of interest into rotating regions of inter-
est. Azimi et al. [36] minimized the joint horizontal and OBB loss functions. Liu et al. [37]
applied a newly defined rotatable bounding box (RBox) to develop a method to detect
objects at any angle. Yang et al. [39] proposed a rotating dense feature pyramid framework
(R-DFPN), and Yang et al. [40] designed a circular smooth label (CSL) technology to analyze
the angle of rotating objects.
To improve feature extraction, a few studies have integrated the attention mechanism
into their network model [41–43]. Chen et al. [41] proposed a multi-scale spatial and
channel attention mechanism remote sensing target detector, and Cui et al. [42] proposed
using a dense attention pyramid network to detect multi-sized ships in SAR images. Zhang
et al. [43] used attention-modulated features and context information to develop a novel
object detection network (CAD-Net).
A few studies have focused on the effect of context information in table checks, extract-
ing different proportions of context information as well as deep low-resolution high-level
and high-resolution low-level semantic features [44–49]. Zhu et al. [44] constructed a target
detection problem as an inference in a Markov random field. Gidaris et al. [45] proposed an
object detection system that relies on a multi-region deep CNN. Zhang et al. [46] proposed
Remote Sens. 2021, 13, 2171 3 of 20

a hierarchical target detector with deep environmental characteristics. Bell et al. [47] used
a spatial recurrent neural network (S-RNN) to integrate contextual information outside
the region of interest, proposing an object detector that uses information both inside and
outside the target. Marcu et al. [48] proposed a dual-stream deep neural network model
using two independent paths to process local and global information inference. Kang
et al. [49] proposed a multi-layer neural network that tends to merge based on context.
In this article, we propose the RepVGG-YOLO model to detect targets in remote
sensing images. RepVGG-YOLO uses the improved RepVGG module as the backbone
feature extraction network (Backbone) of the model; spatial pyramid pooling (SPP), multi-
layer FPN, and path aggregation network (PANet) as the enhanced feature extraction
networks; and CSL to correct the rotating angle of objects. In this model, we increased
the number of target detection scales to four. The main contributions of this article are as
follows:
1. We used the improved RepVGG as the backbone feature extraction module. This
module employs different networks in the training and inference parts, while consid-
ering the training accuracy and inference speed. The module uses a single-channel
architecture, which has high speed, high parallelism, good flexibility, and memory-
saving features. It provides a research foundation for the deployment of models on
hardware systems.
2. We used the combined FPN and PANet and the top-down and bottom-up feature
pyramid structures to accumulate low-level and process high-level features. Simul-
taneously, we used the network detection scales to enhance the network’s ability
to detect small remote sensing targets. The pixel feature extraction portion ensures
accurate detection of objects of all sizes.
3. We used CSL to determine the angle of rotating objects, thereby turning the angle
regression problem into a classification problem and more accurately detecting objects
at any angle.
4. Compared with seven other recent remote sensing target detection networks, the
proposed RepVGG-YOLO network demonstrated the best performance on two public
datasets.
The rest of this paper is arranged as follows. Section 2 introduces the proposed model
for remote sensing image target detection. Section 3 describes the experimental validation
and discusses the results. Section 4 summarizes the study.

2. Materials and Methods


In this section, we first introduce the proposed network framework for target detection
in remote sensing images. Next, we present a formula derivation of the Backbone network
and multi-scale pyramid structure (Neck) for extracting and processing target features.
Then, we discuss the prediction structure of the proposed model and, finally, we detail the
loss function of the model.

2.1. Overview of the Proposed Model


We first perform operations such as random scaling, random cropping, and random
arrangement of the original dataset images, followed by data enhancement on the data to
balance the size and target sample ratio and segmentation of the image with overlapping
areas to retain the small target edge information. Simultaneously, we crop the original
data of the different sized segments into pictures of 608 × 608 pixels, which serve as the
input to the model. As shown in Figure 1, we first extract the low-level general features
from the processed image through the Backbone network. To detect targets of different
scales and categories, Backbone provides several combinations of receptive field size and
center step length. Then, we select the corresponding feature maps from different parts
of the Backbone input for Neck. Feature maps of varying sizes {152 × 152, 76 × 76, 38 ×
38, 19 × 19} are selected from the hierarchical feature maps to detect targets of different
sizes. By coupling the feature maps of different receptive field sizes, Neck enhances the
input to the model. As shown in Figure 1, we first extract the low-level general features
from the processed image through the Backbone network. To detect targets of different
scales and categories, Backbone provides several combinations of receptive field size and
center step length. Then, we select the corresponding feature maps from different parts of
Remote Sens. 2021, 13, 2171 the Backbone input for Neck. Feature maps of varying sizes {152 × 152, 76 × 76, 38 × 38, 1920
4 of
× 19} are selected from the hierarchical feature maps to detect targets of different sizes. By
coupling the feature maps of different receptive field sizes, Neck enhances the network
expressivity and distributes the multi-scale learning tasks to multiple networks. The
network aligns
Backbone expressivity and distributes
the feature the multi-scale
maps by width once, andlearning
directly tasks
outputs to multiple networks.
the feature maps
of the same width to the head network. Finally, we integrate the feature information maps
The Backbone aligns the feature maps by width once, and directly outputs the feature and
of the same
convert it intowidth to thepredictions.
detection head network.
WeFinally, we on
elaborate integrate the feature
these parts in theinformation and
following sec-
convert
tions. it into detection predictions. We elaborate on these parts in the following sections.

Backbone

upsampl
Block_A Block_A Block_B_3 Block_A Block_B_5 Block_A Block_B_15 Block_A SSP CSP2_1 CBL
ing

Neck
Concat CSP2_1 CBL
upsampl Prediction
{608,608,3} ing

upsampl
Concat CSP2_1 CBL
ing

CBL = Conv BN Leaky_relu


Concat CSP2_1 Conv
{152,152,603}

Maxpool
CBL
Concat CSP2_1 Conv
Maxpool
SSP = CBL Concat CBL
Maxpool {76,76,603}
CBL
Concat CSP2_1 Conv

{38,38,603}
CSP2_1 = CBL CBL Conv CBL
Concat CSP2_1 Conv
Concat BN Leaky_relu CBL
Conv
{19,19,603}

Figure
Figure1.1.Overall
Overallnetwork
networkframework
frameworkmodel.
model.

2.2. Backbone Feature Extraction Network


2.2. Backbone Feature Extraction Network
The Backbone network is a reference network for many computer tasks, often used
The Backbone
to extract low-level network is a reference
general features, suchnetwork
as color,forshape,
many and
computer tasks,
texture. often
It can used to
provide
extract low-level general features, such as color, shape, and texture. It
several combinations of receptive field size and center step length to meet the require- can provide several
ments of different scales and categories in target detection. ResNet and MobileNet com-of
combinations of receptive field size and center step length to meet the requirements
different
prise scales and
two networks categories
often used ininvarious
target computer-vision
detection. ResNettasks.and MobileNet
The former comprise
can realizetwoa
networks often used in various computer-vision tasks. The former can realize
combination of different resolution features and extract a robust feature representation. a combination
of different
The resolution
latter, with features
its faster and extract
inference a robust
speed and fewerfeature representation.
network parameters,The latter,
finds usewith
in
its faster inference speed and fewer network parameters, finds
embedded devices with low computing power. The RepVGG [50] model has improved use in embedded devices
with low
speed and computing power. The
accuracy compared RepVGG
with [50] model
Resnet34, has improved
ResNet50, ResNet101, speed and accuracy
ResNet152, and
compared with Resnet34, ResNet50, ResNet101, ResNet152,
VGG-16. While MobileNet and VGG have improved inference speed compared and VGG-16. While MobileNet
with
and VGG
models suchhave improved
as VGG-16, theyinference
have lowerspeed compared
accuracy. with models
Therefore, such asboth
considering VGG-16,
accuracythey
have lower accuracy. Therefore, considering both accuracy and inference speed, we use the
and inference speed, we use the improved RepVGG as the backbone network in this
improved RepVGG as the backbone network in this study. The network improvements
study. The network improvements arise from VGG network enhancements: identity and
arise from VGG network enhancements: identity and residual branches are added to the
residual branches are added to the VGG network block to utilize the advantages of the
VGG network block to utilize the advantages of the ResNet network. On the basis of the
ResNet network. On the basis of the RepVGG-B [50] network, we add a Block_A module
RepVGG-B [50] network, we add a Block_A module at the end of the network to enhance
at the end of the network to enhance feature extraction and, at the same time, pass the
feature extraction and, at the same time, pass the feature map input of a specific shape to
feature map input of a specific shape to the subsequent network. Figure 2 shows the ex-
the subsequent network. Figure 2 shows the execution process of the backbone feature
ecution process of the backbone feature extraction network. The two-dimensional con-
extraction network. The two-dimensional convolution in the Block_A module has a step
volution in the Block_A module has a step size of 2; thus, the feature map size will be
size of 2; thus, the feature map size will be halved after the Block_A module. Similarly,
halved after the Block_A module. Similarly, because the two-dimensional convolution in
because the two-dimensional convolution in the Block_B module has a step size of 1, the
the Block_B module has a step size of 1, the size of the feature map remains unchanged
size of the feature map remains unchanged after the Block_B module.
after the Block_B module.
Remote
Remote Sens.
Sens. 2021,
2021, 13,
13, x2171
FOR PEER REVIEW 55 of
of 20
20
Remote Sens. 2021, 13, x FOR PEER REVIEW 5 of 20

Block_A Block_A Block_B_3 Block_A Block_B_5


Block_A Block_A Block_B_3 Block_A Block_B_5
[76,76,128]
[152,152,64] [76,76,128]
[304,304,32]
[152,152,64]
[304,304,32]
[608,608,3]
[608,608,3]

Block_B_15 Block_A
Block_A
Block_B_15 Block_A
Block_A
[38,38,256]
[19,19,512] [38,38,256]
[19,19,512]

Figure 2. Backbone
Backbone feature
feature extraction network.
Figure 2. Backbone feature extraction network.
For the
For theinput
inputpicture
picture size
size of of
608608 × 608,
× 608, FigureFigure 2 shows
2 shows the shapethe of
shape of the feature
the output output
For
feature the
map input
of picture
each layer.size of
After 608 ×
each 608, Figure
continuous 2 shows
Block_B the shape
module
map of each layer. After each continuous Block_B module (Block_B_3, Block_B_5, of the output
(Block_B_3, feature
Block_B_5,
map of each layer.
Block_B_15),
Block_B_15), aa branch
branchAfter each continuous
is output,
is output, and the
and Block_B
the high-level
high-level module
features
features are (Block_B_3,
are passed to theBlock_B_5,
subsequent
Block_B_15),
network for
network a branch
for feature is
featurefusion,output,
fusion, therebyand
thereby the high-level
enhancing
enhancing features
the the
feature are passed
extraction
feature to capability
the subsequent
capability
extraction of the model.
of the
network
model. for
thefeature
Finally,Finally, the fusion,
feature map with
feature thereby
map thewithenhancing
shape the{19, 19,the
shape feature
512}
{19, extraction
is passed
19, 512} is to capability
strengthen
passed theof
to strengthen the
feature
the
model. Finally,
extraction the
network. feature
feature extraction network. map with the shape {19, 19, 512} is passed to strengthen the
featureInextraction
In addition,network.
addition, different network
different network architectures
architectures areare used
used inin the
the training
training and and inference
inference
In
stages addition,
while different
considering network
training architectures
accuracy and are used
inference in
speed.
stages while considering training accuracy and inference speed. Figure 3 shows the training
Figure 3 and
shows inference
the training
the
stages
and while
structuralconsidering training
re-parameterization accuracy
network and inference
architectures.
training and structural re-parameterization network architectures. speed. Figure 3 shows the
training and structural re-parameterization network architectures.

. Block_B
.
. . Block_B Block_A
.
. Block_A

3× 3 1× 1
3× 3 1× 1 3× 3 1× 1
3× 3 1× 1 BN BN BN
Block_A BN BN BN
Block_A
BN
BN
3× 3 3× 3 3× 3
3× 3 1× 1 3× 3 3× 3 3× 3
3× 3 1× 1

3× 3
BN 3× 3
Block_B BN
Block_B
..
.. ..
..
(a) (b)
(a) (b)
Figure 3.
Figure (a)Block_A
3. (a) Block_Aand
andBlock_B
Block_Bmodules
modulesin inthe
thetraining
training phase;
phase; (b)
(b) structural
structural re-parameterization
re-parameterization
Figure
of 3. (a)
Block_A Block_A
and and Block_B
Block_B.
of Block_A and Block_B. modules in the training phase; (b) structural re-parameterization
of Block_A and Block_B.
Figure 3a
Figure 3a shows
shows thethe training
training network
network of of the
the RepVGG.
RepVGG. The The network
network uses
uses two
two branch
branch
Figure
structures: 3a
theshows the
residual training
structure network
that of
contains the RepVGG.
only Block_A The
of network
the
structures: the residual structure that contains only Block_A of the Conv1*1 residualConv1*1uses two
residual branch
branch,
structures:
the residual
branch, thestructure
the residualof
residual structure
Conv1*1,of
structure that
and contains
the identity
Conv1*1, only
and Block_A
residual;
the andof
identity the Conv1*1
structure
residual; and residual
Block_B. Because
structure
branch,
the the
training residual
network structure
has of
multiple Conv1*1,
gradient and
flow the
paths,identity
a deeper
Block_B. Because the training network has multiple gradient flow paths, a deeper residual;
network and
modelstructure
cannet-
not
Block_B.
only Because
handle the the training
problem of network
gradient has multiple
disappearance gradient
in the deepflow paths,
layer
work model can not only handle the problem of gradient disappearance in the deep layer of a
the deeper
network, net-
but
work
also
of model
theobtain can not
a more
network, butonly
also handle
robust athe
feature
obtain problem
robustoffeature
representation
more gradient disappearance
layer. in in
in therepresentation
deep thethe deep
deep layer
layer.
of the network, but also obtain a more robust feature representation
Figure 3b shows that RepVGG converts the multi-channel training model to single-
Figure 3b shows that RepVGG converts the multi-channel in the
training deep
model layer.
to a a sin-
channel
Figure
gle-channeltest model.
3btest
shows To improve
that
model. RepVGG
To improve the inference
converts
the the speed, the convolutional
multi-channel
inference speed, the training and and
model
convolutional batch nor-
to a batch
sin-
gle-channel test model. To improve the inference speed, the convolutional
normalization (BN) layers are merged. Equations (1) and (2) express the formulas for the and batch
normalization
convolutional(BN) and layers are merged.
BN layers, Equations (1) and (2) express the formulas for the
respectively.
convolutional and BN layers, respectively.
Remote Sens. 2021, 13, 2171 6 of 20

malization (BN) layers are merged. Equations (1) and (2) express the formulas for the
convolutional and BN layers, respectively.

Conv(x) = W(x) + b (1)

(x − mean)
BN(x) = γ∗ + β (2)
σ
Replacing the argument in the BN layer equation with the convolution layer formula
yields the following:

γ∗W(x) γ∗(b−mean)
BN(Conv(x)) = σ + σ +β
γ∗W(x) γ∗µ
(3)
= σ + σ + β

Here, µ, σ, γ, and β represent the cumulative average, standard deviation, scaling


factor, and deviation, respectively. We use W k e RC2 ×C1 ×k×k to represent the input C1 , the
output C2 , and the convolution kernel of the convolution of k. With M1 e R N ×C1 × H1 ×W1
and M2 e R N ×C2 × H2 ×W2 denoting the input and output, respectively, the BN layer of the
fusion convolution can be simplified to yield the following:

W 0 i,:,:,: = γ

i
σi Wi,:,:,: 

0 µi γi
bi = − σ Wi,:,:,: + βi (4)
i
0 
BN(M ∗ W, µ, σ, γ, β):, i, :,: = ( M ∗ W 0 ):,i,:,: + bi 

where i ranges in the interval from 1 to C2 ; * represents the convolution operation; and
0
W 0 and bi the weight and bias of the convolution after fusion, respectively. Let C1 = C2 ,
H1 = H2 , and W1 = W2 ; then, the output can be expressed as follows:
 
M2 = BN M1 × W 3 , µ3 , σ3 , γ3 , β3
 
+ BN M1 × W 1 , µ1 , σ1 , γ1 , β1 (5)
 
+ BN M1 , µ0 , σ0 , γ0 , β0

where µk , σk , γk , and βk represent the BN parameters obtained after the k × k convolution


and µ0 , σ0 , γ0 , and β0 represent the parameters of the identity branch. For the output of
three different scales, we adopt the following strategy for fusion. We can regard the identity
branch structure as a 1 × 1 convolution; for the Conv1*1 and the identity branches, the 1 ×
1 convolution kernel can be filled and converted into a 3 × 3 convolution kernel; finally,
we add the three 3 × 3 convolution kernels from the three output scales to obtain the final
convolution kernel, and add the three deviations to obtain the final deviation. The Block_B
module can be represented by Equation (5); further, because the Block_A module does
not contain the identity branch structure, it can be represented by the first two items in
Equation (5).

2.3. Strengthening the Feature Extraction Network (Neck)


In the target detection task, to make the model learn diverse features and improve
detection performance, the Neck network can reprocess the features extracted by the
Backbone, disperse the learning of different scales applied to the multiple levels of feature
maps, and couple the feature maps with different receptive field sizes. In this study, we
use SPP [51], improved FPN [26], and PANet [52] structure to extract the features. Figure 4
shows the detailed execution process of the model. The SPP structure uses pooling methods
of different scales to perform multi-scale feature fusion, which can improve the receptive
field of the model, significantly increase the receiving range of the main features, and
more effectively separate the most important context features, thereby avoiding problems
such as image distortion caused by cropping and zooming the image area. The computer-
based learning (CBL) module comprises a two-dimensional convolution process, BN, and
Remote Sens. 2021, 13, 2171 7 of 20

Remote Sens. 2021, 13, x FOR PEER REVIEW 7 of 20

Leaky_ReLU activation function. The input of the CSP2_1 module is divided into two parts.
One part goes through
a two-dimensional two CBL modules
convolution; and part
the other then through
directly aundergoes
two-dimensional convolution;
a two-dimensional
the
convolution operation. Finally, the feature maps obtained from the two parts Finally,
other part directly undergoes a two-dimensional convolution operation. the
are spliced,
feature maps obtained from the two parts are spliced, then put through the BN layer
then put through the BN layer and Leaky_ReLU activation function, and output after the and
Leaky_ReLU
CBL module.activation function, and output after the CBL module.

upsampl
SSP CSP2_1 CBL
ing
{19,19,512}
{19,19,256}

upsampl
Concat CSP2_1 CBL
ing
{38,38,256} {38,38,128}
{38,38,512}

upsampl
Concat CSP2_1 CBL
ing
{76,76,64}
{76,76,128} {76,76,256}

Concat CSP2_1
{152,152,64} {152,152,128}

CBL
Concat CSP2_1
{76,76,128}

CBL
Concat CSP2_1
CBL = Conv BN Leaky_relu {38,38,256}

CBL
Concat CSP2_1
Maxpool {19,19,512}

Maxpool
SSP = CBL Concat CBL CSP2_1 = CBL CBL Conv
Maxpool Concat BN Leaky_relu CBL
Conv

Figure4.4. Strengthening
Figure Strengtheningthe
thefeature
featureextraction
extractionnetwork.
network.

Figure 4 shows the shape of the feature map of the key parts of the entire network.
Figure 4 shows the shape of the feature map of the key parts of the entire network.
Note that the light-colored CBL module (the three detection scale output parts at the
Note that the light-colored CBL module (the three detection scale output parts at the
bottom right) has a two-bit convolution step size of 2, whereas the other two-dimensional
bottom right) has a two-bit convolution step size of 2, whereas the other two-dimensional
convolutions have a step size of 1. FPN is top-down, and transfers and integrates high-level
convolutions have a step size of 1. FPN is top-down, and transfers and integrates
feature information through up-sampling. FPN also transfers high-level strong semantic
high-level feature information through up-sampling. FPN also transfers high-level strong
features to enhance the entire pyramid, but only enhances semantic information, not
semantic features
positioning to enhance
information. We alsothe entire
added pyramid,feature
a bottom-up but only enhances
pyramid behind semantic
the FPN infor-
layer
mation, not positioning information. We also added a bottom-up feature pyramid
that accumulates low-level and processed high-level features. Because low-level features behind
the provide
can FPN layer thataccurate
more accumulates low-level
location and processed
information, high-level
the additional layer features.
creates a Because
deeper
low-level features can provide more accurate location information, the additional
feature pyramid, adding the ability to aggregate different detection layers from different layer
creates a deeper
backbone layers, feature pyramid,the
which enhances adding theextraction
feature ability to performance
aggregate different detection lay-
of the network.
ers from different backbone layers, which enhances the feature extraction performance of
the network.
2.4. Target Boundary Processing at Any Angle
Because remote sensing images contain many complex and dense rotating targets, we
2.4. Target
need Boundary
to correct theseProcessing at Any for
rotating objects Angle
more accurate detection of objects at any angle.
Common Because remote
angle sensing
regression imagesinclude
methods containthemany
opencomplex and dense rotating
source computer-vision, longtargets,
edge,
we need to correct these rotating objects for more accurate detection of
and ordered quadrilateral definition methods. The predictions of these methods often objects at any an-
gle. Common
exceed angle
the initial set regression
range. Becausemethods include
the target the open of
parameters source computer-vision,
learning long
are periodic, they
edge,
can be and
at theordered
boundary quadrilateral
of periodic definition methods.
changes. This Thecan
condition predictions of these
cause a sudden methods
increase in
the loss
often value the
exceed thatinitial
increases the difficulty
set range. Becauseofthe
learning
targetby the network,
parameters leading to
of learning areboundary
periodic,
problems.
they can be Weatuse
thecircular
boundary smooth label (CSL)
of periodic [40] to
changes. handle
This the angle
condition can problem, as shown
cause a sudden in-
in Figure
crease in 5.
the loss value that increases the difficulty of learning by the network, leading to
boundary problems. We use circular smooth label (CSL) [40] to handle the angle prob-
lem, as shown in Figure 5.
Remote Sens. 2021, 13, 2171 8 of 20
Remote Sens. 2021, 13, x FOR PEER REVIEW 8 of 20

89 88
87
86
85

.
.
. .
.
.

Gaussian
function

-3 5
-2 4
-1 0 1 2 3

Figure5.5.Circular
Figure Circularsmooth
smoothlabel.
label.

Equation
Equation(6) (6)expresses
expressesCSL, whereg(x)
CSL,where g(x)isisthe
thewindow
windowfunction.
function.
 𝑔(𝑥) , 𝜃 − 𝑟 < 𝑥 < 𝜃 + 𝑟
CSL(x) = g( x ) , θ − r < x < θ + r (6)
CSL(x) = 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (6)
0, otherwise
where 𝜃 represents the angle passed by the longest side when the x-axis rotates clock-
where θ represents
wise, and the angle
r represents passed radius.
the window by the longest side when
We convert the x-axis rotates
angle prediction from aclockwise,
regression
and
problem to a classification problem and place the entire defined angle range problem
r represents the window radius. We convert angle prediction from a regression into one
to a classification
category. We chooseproblem and place
a Gaussian the entire
function for thedefined
window angle range to
function into one category.
measure We
the angular
choose
distancea Gaussian
between the function for the
predicted andwindow
groundfunction to measure
truth labels. the angular
The predicted valuedistance
loss be-
between the predicted and ground truth labels. The predicted value loss
comes smaller the closer it comes to the true value within a certain range. Introducingbecomes smaller
the closer it comes
periodicity, i.e., thetotwo
the true value89within
degrees, a certain
and −90, becomerange. Introducing
neighbors, solvesperiodicity,
the problem i.e.,
of
the two degrees, 89 and − 90, become neighbors, solves the problem of
angular periodicity. Using discrete rather than continuous angle predictions avoids angular periodicity.
Using discrete
boundary rather than continuous angle predictions avoids boundary problems.
problems.
2.5. Target Prediction Network
2.5. Target Prediction Network
After subjecting the image to feature extraction twice, we integrate the feature in-
After subjecting the image to feature extraction twice, we integrate the feature in-
formation and transform it into a prediction, as shown in Figure 6. We use the k-means
formation and transform it into a prediction, as shown in Figure 6. We use the k-means
clustering algorithm to generate 12 prior boxes with different scales according to the labels
clustering
of the trainingalgorithm
set. Becauseto generate
remote 12 priortarget
sensing boxesdetection
with different
involves scales according
detecting smalltotargets,
the la-
bels of the training set. Because remote sensing target detection involves
to enhance the feature extraction of small pixel targets, we use four detection scales with detecting small
targets, to enhance the feature extraction
sizes of 19 × 19, 38 × 38, 76 × 76, and 152 × 152. of small pixel targets, we use four detection
scales with sizes
Taking the 19of×1919 × 19, 38 × 38,scale
detection 76 × 76, and
as an 152 × 152.
example, we divide the input image into
multiple 19 × 19 grids. Each grid point is preset with threewe
Taking the 19 × 19 detection scale as an example, divide
boxes the input image
of corresponding into
scales.
multiple 19 × 19 grids. Each grid point is preset with three boxes
When these grids enclose an object, we use the corresponding grid for object detection.of corresponding scales.
When these
Finally, grids of
the shape enclose an object,
the feature mapwe use the
output bycorresponding
the detection grid for layer
feature objectisdetection.
{19, 19,
Finally, the shape of the feature map output by the detection feature
603}. The third quantity implies that each of the three anchors in the corresponding layer is {19, 19,grid
603}.
The third quantity implies that each of the three anchors in the corresponding
consists of 201 dimension predictions. The width and height of the box and the coordinates grid con-
sists
of theofcenter
201 dimension predictions.
point (x_offset, Theh,
y_offset, width and height of
w), confidence, 16the box and theresults,
classification coordinates
and
of the
180 center point
classification (x_offset,
angles y_offset,
(described h, w),2.4).
in Section confidence,
Based on16 classification
the results,
set loss function and 180
(described
classification
in Section 2.6.3),angles (described
iterative in Section
calculations for the2.4). Based on the set
backpropagation loss function
operation (described
are performed andin
Section 2.6.3), iterative calculations for the backpropagation operation
the position and angle of the prediction box are continually adjusted and, finally, to attain are performed and
the position and angle of the prediction box are continually adjusted
the highest confidence test results, non-maximum suppression screening is applied [53]. and, finally, to attain
the highest confidence test results, non-maximum suppression screening is applied [53].
RemoteRemote
Sens. Sens. 13, x13,
2021,2021, FOR2171PEER REVIEW 9 of 20 9 of 20

Prediction

Conv
{152,152,603}

Conv

{76,76,603}

Conv

{38,38,603}
Conv

{19,19,603}

Figure 6.6.Target
Figure Targetprediction network.
prediction network.

2.6.
2.6. LossFunction
Loss Function

InInthis
this section, we describe the bounding box regression loss function, the confidence
section, we describe the bounding box regression loss function, the confi-
loss function with weight coefficients, and the classification loss function with increased
dence loss function with weight coefficients, and the classification loss function with in-
angle calculation.
creased angle calculation.
2.6.1. Bounding Box Border Regression Loss
2.6.1. Bounding Box Borderused
The most commonly Regression
indicatorLoss
in target detection, often used to calculate the
bounding
The mostboxcommonly
regression loss,
used theindicator
intersection over union
in target (IoU) [54]
detection, value,
often usedis defined as the
to calculate
the ratio of the intersection and union of the areas of two rectangular boxes. Equation
bounding box regression loss, the intersection over union (IoU) [54] value, is defined as (7)
shows the IoU and the bounding box regression loss.
the ratio of the intersection and union of the areas of two rectangular boxes. Equation (7)
shows the IoU and the bounding box regression |B∩ Bgt | loss.

IoU =

|B∪ Bgt | (7)
|B ∩ B |
LOSS = 1 − IoU 
IoU
IoU =
|B ∪ B | (7)
where LOSS box,
B represents the predicted bounding = 1Bgt− represents
𝐼𝑜𝑈 the
real bounding box,
B ∩ Bgt represents the B and Bgt intersection area, and B ∪ Bgt represents the B and

where B represents the predicted bounding box, B represents the real bounding box,
Bgt union area. The following problems arise in calculating the loss function defined in
|BEquation
∩ B | represents
(7): the B and B intersection area, and |B ∪ B | represents the B
and B 1. When
unionB area.
and BgtThe following
do not intersect,problems arise
IoU = 0, the in calculating
distance the Bloss
between B and gt function
cannot be de-
fined in Equation (7):
expressed, and the loss function LOSS_IoU cannot be directed or optimized.
1.2.When
When the B and B B remains
size of do notthe intersect, IoU = situations,
same in different 0, the distance
the IoU between B and B
values obtained
do notbe
cannot change, making
expressed, andit impossible to distinguish
the loss function different
LOSS_IoU intersections
cannot of B or
be directed Bgt .
andoptimized.
2.To overcome
When the size B remains
theseofproblems, the the
generalized
same inIoU (GIoU) [55]
different was proposed
situations, the IoU in 2019,
values ob-
with the formulation shown below:
tained do not change, making it impossible to distinguish different intersections of B
and B . |C (B∪ Bgt )|
)
GIoU = IoU −
To overcome these problems, the generalized ,
|C| IoU (GIoU) [55] was proposed(8) in 2019,
LOSSGIoU = 1 − GIoU
with the formulation shown below:
( ∪ )
GIoU = IoU − | | , (8)
LOSS = 1 − 𝐺𝐼𝑜𝑈
where |C| represents the area of the smallest rectangular box containing B and B , and
Remote Sens. 2021, 13, 2171 10 of 20

gt
where |C| represents the area of the smallest rectangular box containing
B and B , and
C \ B ∪ Bgt represents the area of the C rectangle excluding B ∪ Bgt . The calculation


of the bounding box frame regression loss uses the GIoU. Compared with using the IoU,
using the GIoU improves the measurement method of the intersection scale and alleviates
the above-mentioned problems to a certain extent, but still does not consider the situation
when B is inside Bgt . Furthermore, when the size of B remains the same and the position
changes, the GIoU value also remains the same, and the model cannot be optimized.
In response to this situation, distance-IoU (DIoU) [56] was proposed in 2020. Based
on IoU and GIoU, and incorporating the center point of the bounding box, DIoU can be
expressed as follows:
ρ2 (B, Bgt )
)
DIoU = 1 − IoU + c2 , (9)
LOSSDIoU = 1 − DIoU
where ρ2 B, Bgt represents the Euclidean distance between the center points of B and Bgt ,


and c represents the diagonal distance of the smallest rectangle that can cover B and Bgt
simultaneously. LOSSDIoU can be minimized by calculating the distance between B and
Bgt and using the distance between the center points of B and Bgt as a penalty term, which
improves the convergence speed.
Using both GIoU and DIoU, recalculating the aspect ratio of B and Bgt , and increasing
the impact factor av, the complete IoU (CIoU) [56] was proposed, as expressed below:

ρ2 (B, Bgt )

CIoU = IoU − c2
− av 


v 
a = 1− IOU +v 

 gt
2 (10)
v = π2 arc tan wh gt − arc tan wh
4 


ρ2 (B, Bgt )


LOSSCIoU = 1 − IoU + + av

c2

where h gt and w gt are the length and width of Bgt , respectively; h and w are the length and
width of B, respectively; a is the weight coefficient; and v is the distance between the aspect
ratios of B and Bgt . We use LOSSCIoU as the bounding box border regression loss function,
which brings the predicted bounding box more in line with the real bounding box, and
improves the model convergence speed, regression accuracy, and detection performance.

2.6.2. Confidence Loss Function


We use cross-entropy to calculate the object confidence loss. Regardless of whether
there is an object to be detected in the grid, the confidence error must be calculated. Because
only a small part of the input image may contain objects to be detected, we add a weight
coefficient (λno ) to constrain the confidence loss for the image area that does not contain
the target object, thereby reducing the number of negative samples. The object confidence
loss can be expressed as follows:
2
LOSSConf = − ∑iS=0 ∑ Bj=0 Iij Ĉi j log Ci j + 1 − Ĉi j log 1 − Ci j R Iou
 
(11)
+ 1 − Iij Ĉi log Ci + 1 − Ĉi j log(1 − Ci ) λno .
  

where S is the number of grids in the network output layer and B is the number of anchors.
j
Ii indicates whether the j-th anchor in the i-th grid can detect this object (the detected
value is 1 and the undetected value is 0), and the value of Ĉi j is determined by whether
the bounding box of the grid is responsible for predicting an object (if it is responsible for
prediction, the value of Ĉi j is 1, otherwise it is 0). Ci j is the predicted value after parameter
normalization (the value lies between 0 and 1). R Iou represents the IoU of the rotating
bounding box.
after parameter normalization (the value lies between 0 and 1). 𝑅 represents the IoU
of the rotating bounding box.
Remote Sens. 2021, 13, 2171
The complete decoupling of the correlation between the prediction angle and11the of 20

prediction confidence means the confidence loss is not only related to the frame param-
eters, but also to the rotation angle. Table 1 summarizes the recalculation of the IoU [35]
of the The
rotating bounding
complete box as of
decoupling thethe
confidence
correlationloss coefficient,
between along withangle
the prediction its pseudo-
and the
code.
prediction confidence means the confidence loss is not only related to the frame parameters,
but also to the rotation angle. Table 1 summarizes the recalculation of the IoU [35] of the
Table 1. Rotating
rotating intersection
bounding box as over union (IoU) loss
the confidence calculation pseudocode.
coefficient, along with its pseudocode.
Algorithm 1 RIoU computation
Table 1. Rotating intersection over union (IoU) calculation pseudocode.
1: Input: Rectangles R1; R2; :::; RN
Output: RIoU
2: Algorithm 1 RIoU between
computationrectangle pairs RIoU
3: 1: for eachRectangles
Input: pair <Ri; RR1;
j> (i
R2;< j)
:::;do
RN
4: 2: Point set PSet
Output: RIoUφbetween rectangle pairs RIoU
5: 3: Addfor each pair <Ri;
intersection Rj> (iof< R
points j)i do
and Rj to PSet
6:
4: Point set PSet ϕ
Add the vertices of Ri inside Rj to PSet
5: Add intersection points of Ri and Rj to PSet
7:
6: AddAdd
the the
vertices of of
vertices RjRi
inside
insideRRj i to PSet
to PSet
8: 7: SortAdd
PSetthe
into anticlockwise
vertices of Rj inside order
Ri to PSet
9: 8: Compute intersection
Sort PSet I of PSet
into anticlockwise by triangulation
order
9: Compute intersection I( of ) PSet by triangulation
10: RIoU[i; j]  ( ) Area(I) ( )
10: RIoU[i; j]
Area( Ri )+ Area( R j )− Area( I )
11: end for
11: end for

Figure 7 shows the geometric principle of rotating IoU calculations. We divide the
Figurepart
overlapping 7 shows the geometric
into multiple principle
triangles of rotating
with the IoU calculations.
same vertex, calculate the We areadivide the
of each
overlapping part into multiple triangles with the same vertex, calculate
triangle separately, and finally add the calculated areas to obtain the area of the over- the area of each
trianglepolygons.
lapping separately, and
The finally add
detailed the calculated
calculation areas
principle to obtain
is as theGiven
follows. area of the of
a set overlapping
rotating
polygons. The detailed calculation principle is as follows. Given a set of rotating
rectangles R1, R2, …., RN, calculate the RIoU of each pair of <Ri, Rj>. First, the intersec- rectangles
R1,set,
tion R2,PSet,
. . . , of
RN,Ri calculate
and Rj (thetheintersection
RIoU of each pairrectangles
of two of <Ri, Rj>.
andFirst, the intersection
the vertices set,
of one rec-
PSet, of Ri and Rj (the intersection of two rectangles and the vertices of one rectangle in
tangle in the other rectangle form a set, PSet, corresponding to rows 4–7 of Table 1); then,
the other rectangle form a set, PSet, corresponding to rows 4–7 of Table 1); then, calculate
calculate the intersection area, I, of PSet and, finally, calculate the RIoU according to the
the intersection area, I, of PSet and, finally, calculate the RIoU according to the formula
formula in row 10 of Table 1 (combine the points generated by the PSet into a polygon,
in row 10 of Table 1 (combine the points generated by the PSet into a polygon, divide the
divide the polygon into multiple triangles, calculate the sum of the area of the multiple
polygon into multiple triangles, calculate the sum of the area of the multiple triangles as
triangles as the polygon area, and finally calculate the polygon area and remove the ro-
the polygon area, and finally calculate the polygon area and remove the rotation of the
tation of the polygon area; corresponding to rows 8–10 of Table 1).
polygon area; corresponding to rows 8–10 of Table 1).

F
F
F A I B
A I J B
E
J K
A B
N E G
G
E
L L
G O
H M D K C
D N M C
D C H H

(a) (b) (c)

Figure 7. Intersection over union (IoU) calculation for rotating intersecting rectangles: (a) intersecting graph is a quadrilat-
Figure 7. Intersection over union (IoU) calculation for rotating intersecting rectangles: (a) intersecting graph is a quadri-
eral, (b) intersecting graph is a hexagon, and (c) intersecting graph is an octagon.
lateral, (b) intersecting graph is a hexagon, and (c) intersecting graph is an octagon.

2.6.3. Classification
2.6.3. ClassificationLoss
LossFunction
Function
Because we converted
Because we converted thethe
angle calculation
angle from
calculation a regression
from problem
a regression problemintointo
a classi-
a clas-
fication problem,
sification we we
problem, calculate both
calculate thethe
both category
categoryand
andangle loss
angle losswhen
whencalculating
calculatingthethe
classification loss function. Here, we use the cross-entropy loss function for the calculation.
When the j-th anchor box of the i-th grid is responsible for a real target, we calculate
Remote Sens. 2021, 13, 2171 12 of 20

the classification loss function for the bounding box generated by this anchor box, using
Equation (12).

S2 B
LOSSClass = − ∑ ∑ Iij ∑ P̂i (c + θ ) log Pi (c + θ )
i =0 j =0 c∈Class, θ ∈(0,180] (12)
 
+ 1 − P̂i (c + θ ) log(1 − Pi (c + θ ))

where c belongs to the target classification category; θ belongs to the angle processed by the
CSL [40] algorithm; S is the number of grids in the network output layer; B is the number
j
of anchors; and Ii indicates whether the j-th anchor in the i-th grid can detect this object
(the detected value is 1 and the undetected value is 0).
The final total loss function equals the sum of the three loss functions, as shown in
Equation (13). Furthermore, the three loss functions have the same effect on the total loss
function; that is, the reduction of any one of the loss functions will lead to the optimization
of the total loss function.

LOSS = LOSSCIoU + LOSSConf + LOSSClass (13)

3. Experiments, Results, and Discussion


3.1. Introduction to DOTA and HRSC2016 Datasets
3.1.1. DOTA Dataset
The DOTA dataset [57] comprises 2806 aerial images obtained from different sensors
and platforms, including 15 classification categories: plane (PL), baseball diamond (BD),
bridge (BR), ground track (GTF), small vehicle (SV), large vehicle (LV), ship (SH), tennis
court (TC), basketball court (BC), oil storage tank (ST), football field (SBF), roundabout
(RA), airport and helipad (HA), swimming pool (SP), and helicopter (HC). The image data
can be divided into 1411 training sets, 937 test sets, and 458 verification sets. The image
size ranges between 800 × 800 and 4000 × 4000 pixels. Dataset labeling consisted of a
horizontal and a directional bounding box for a total of 188,282 instances.

3.1.2. HRSC2016 Dataset


The HRSC2016 dataset [58] comes from six different ports, with a total of 1061 remote
sensing pictures. Examples of detection objects include ships on the sea and ships docked
on the shore. The images can be divided into 436 training sets (1207 labeled examples in
total), 444 test sets (1228 labeled examples in total), and 181 validation sets (541 labeled
examples in total). The image size ranges from 300 × 300 to 1500 × 900 pixels.

3.2. Image Preprocessing and Parameter Optimization


In this section, we describe image preprocessing, experimental parameter settings,
and experimental evaluation standards.

3.2.1. Image Preprocessing


Owing to the complex background of remote sensing target detection [59], large
changes in the target scale [60], special viewing angle [61–63], unbalanced categories [31],
and so on, we preprocess the original data. Directly processing the original high-resolution
remote sensing images not only increases equipment requirements, but also significantly
reduces detection accuracy. We cut the entire picture and send it to the proposed model
training module. During the test, we cut the test pictures into pictures of the same size as
those in the training set, and after the test, we splice the predicted results one by one to
obtain the total result. To ensure the loss of small target information at the cutting edge
during the cutting process, we allow the cut image to have a certain proportion of overlap
area (in this study, we set the overlap area to 30%). If the size of the original image is smaller
than the size of the cut image, we perform an edge pixel filling operation on the original
image to make its size reach the training size. In the remote sensing dataset (e.g., DOTA),
Remote Sens. 2021, 13, 2171 13 of 20

the sample target size changes drastically, and small targets can be densely distributed
and large and small targets can be considerably unevenly distributed (the number of
small targets is much larger than the number of large targets). In this regard, we use the
Mosaic data enhancement method to splice the pictures in random zooming, cropping, and
arrangement, which substantially enriches the dataset and makes the distribution of targets
of different sizes more uniform. Mixed multiple images can have different semantics.
Enhanced network robustness occurs when the picture information allows the detector to
detect targets beyond the conventional context.

3.2.2. Experimental Parameter Settings


We evaluated the performance of the proposed model on two NVIDIA GeForce RTX
2080 Ti GPUs with 11 GB of RAM. We used the PyTorch 1.7 deep learning framework and
Python 3.7 compiler run on Windows 10. To optimize the network, we used stochastic
gradient descent with momentum, setting the learning rate momentum and weight decay
coefficients to 0.857 and 0.00005, respectively; the iterative learning rate for the first 50 K to
0.001; and the later iterative learning rate to 0.0001. The CIoU loss and classification loss
coefficients were set to 0.0337 and 0.313, respectively. The weight coefficient, λno , of the
confidence loss function was set to 0.4. The batch size was set to eight, and the epoch was
set to 500.

3.2.3. Evaluation Criteria


To verify the performance of the proposed method, two broad criteria were used to
evaluate the test results [64]: precision and recall. The accuracy rate indicates the detection
rate of the predicted true-positive samples, and the recall rate indicates the rate of correctly
identified true-positive samples. Accuracy and recall can be expressed as follows.

TP
Precision = (14)
TP + FP
TP
Recall = (15)
TP + FN
TP represents a real positive sample, TN represents a real negative sample, FP is a false
positive sample, and FN is a false negative sample. This study adopts the mean average
precision (mAP) [45–47] to evaluate all methods, which can be expressed as follows:
N
∑i=class
R
1 Pi ( Ri )dRi
mAP = (16)
Nclass

where Pi and Ri represent the accuracy and recall rate of the i-th class of classified objects,
respectively. Nclass represents the total number of detected objects in the dataset.

3.3. Experimental Results


Figure 8 shows the precision–recall curve of the DOTA detection object category. We
focus on the interval between 0.6 and 0.9, where the recall rate is concentrated. Except for
BR, when the recall value is greater than 0.6, the decline in the curves of the other types
of objects increases. The BD, PL, and TC curves all drop sharply when the recall value is
greater than 0.8. The results show that the overall performance of the proposed method is
stable and has good detection effectiveness.
Remote Sens. 2021, 13, 2171 14 of 20

To prove that the proposed method has better performance, we compared the pro-
posed method (RepVGG-YOLO NET) to seven other recent methods: SSD [20], joint train-
ing method for target detection and classification (YOLOV2) [19], rotation dense feature
pyramid network (R-DFPN) [39], toward real-time object detection with RPN (FR-C) [25],
joint image cascade and functional pyramid network and multi-size convolution kernel to
extract multi-scale strong and weak semantic feature framework (ICN) [36], fine FPN and
multi-layer attention network (RADET) [65], and end-to-end refined single-stage rotation
detector (R3Det) [66]. Table 2 summarizes the quantitative comparison results of the eight
methods on the DOTA dataset. The table indicates that the proposed model has achieved
the most advanced results, achieving relatively stable detection results in all categories,
with an mAP of 74.13%. SSD and YOLOV2 networks have poor detection effectiveness
and relatively low detection effectiveness on small targets; their poor feature extraction
network performance needs improvement. The FR-C, ICN, and RADET network models
achieved good detection results.
Compared with other methods, owing to the increased processing of targets at any
angle and the use of four target detection scales, the proposed model achieved good
classification results for small objects with complex backgrounds and dense distributions
(for example, SV and SH achieved 71.02% and 78.41% mAP values). Compared with the
suboptimal method (i.e., R3Det), the suggested method achieved a 1.32% better mAP value.
In addition, using the FPN and PANet structures to accumulate high-level and low-level
features helped the improvement in the detection of categories with large differences in the
target scale of the same image (for example, BR and LV on the same image), with BR and
LV achieving classification results of 52.34% and 76.27%, respectively. We also obtained
relatively stable mAP values in single-category detection (PL, BR, SV, LV, TC, BC, SBF, RA,
SP, and HC achieved the highest mAP values).
Table 3 summarizes the proposed model and five other methods (i.e., rotation-sensitive
regression for oriented scene text detection (RRD) [67], rotated region-based CNN for ship
detection (BL2 and RC2) [68], refined single-stage detector with feature refinement for
rotating object (R3 DET) [66], and rotated region proposal and discrimination networks
(R2PN) [69]). Table 3 summarizes quantitative comparison results on the HRSC2016 dataset.
The results demonstrate that the proposed method achieves an mAP detection result of
91.54, which is better than the other methods evaluated on this dataset. Compared with
the suboptimal method (R3Det), the mAP for the proposed model was better by 2.21%.
Good results were achieved for the detection of ship instances with large aspect ratios and
rotation directions. The proposed method achieved 22 frames per second (FPS), which is
more than that achieved by the suboptimal method (R3Det).
Figure 9 shows the partial visualization results of the proposed method on the DOTA
and HRSC2016 datasets. The first three rows are the visualization results of the DOTA dataset,
and the last row shows the visualization results of the HRSC2016 dataset. Figure 9 shows that
the proposed model handles well the noise problem in a complex environment, and has a
better detection effectiveness on densely distributed small objects. Good test results were also
obtained for some samples with drastic size changes and special viewing angles.

Table 2. Comparison of the results with the other seven latest methods on the DOTA dataset (highest performance is in
boldface).

Method PL BD BR GTF SV LV SH TC BC ST SBF RA HA SP HC mAP (%)


SSD 57.85 32.79 16.14 18.67 0.05 36.93 24.74 81.16 25.10 47.47 11.22 31.53 14.12 9.09 0.00 29.86
YOLOV2 76.90 33.87 22.73 34.88 38.73 32.02 52.37 61.65 48.54 33.91 29.27 36.83 36.44 38.26 11.61 39.20
R-DFPN 80.92 65.82 33.77 58.94 55.77 50.94 54.78 90.33 66.34 68.66 48.73 51.76 55.1 51.32 35.88 57.94
FR-C 80.2 77.55 32.86 68.13 53.66 52.49 50.04 90.41 75.05 59.59 57.00 49.81 61.69 56.46 41.85 60.46
ICN 81.36 74.3 47.7 70.32 64.89 67.82 69.98 90.76 79.06 78.20 53.64 62.90 67.02 64.17 50.23 68.16
RADET 79.45 76.99 48.05 65.83 65.46 74.40 68.86 89.70 78.14 74.97 49.92 64.63 66.14 71.58 62.16 69.09
R3 Det 89.24 80.81 51.11 65.62 70.67 76.03 78.32 90.83 84.89 84.42 65.10 57.18 68.1 68.98 60.88 72.81
proposed 90.27 79.34 52.34 64.35 71.02 76.27 77.41 91.04 86.21 84.17 66.82 63.07 67.23 69.75 62.07 74.13
Remote Sens. 2021, 13, x FOR PEER REVIEW 15 of 20

Remote Sens. 2021, 13, 2171 15 of 20


Remote Sens. 2021, 13, x FOR PEER REVIEW 15 of 20
proposed 90.27 79.34 52.34 64.35 71.02 76.27 77.41 91.04 86.21 84.17 66.82 63.07 67.23 69.75 62.07 74.13

Table 3. Comparison
Table ofofthe
3. Comparison theresults
results with fiveother
with five otherrecent
recent methods
methods on on
the the HRSC2016
HRSC2016 dataset.
dataset.
proposed 90.27 79.34 52.34 64.35 71.02 76.27 77.41 91.04 86.21 84.17 66.82 63.07 67.23 69.75 62.07 74.13
Method
Method mAP (%) mAP (%) FPS FPS
Table 3. Comparison
BL2 BL2 of the results with five other recent69.6
69.6 methods on the HRSC2016
– dataset.
--
RC2 RC2 75.7 –
Method mAP75.7(%) FPS --
R2 PN 79.6 –
R PN
2
BL2 79.6 --
RRD 84.3 69.6 –--
RRD 3
RC2
R Det 84.3
89.33 75.7 10-- --
Rproposed
RDet
3 2PN 89.33
91.54 79.6 22-- 10
RRD
proposed 84.3
91.54 -- 22
R3Det 89.33 10
proposed 91.54 22
Precision-Recall Curve
1
Precision-Recall Curve
0.95 1

0.95
0.9
Precision

0.9
Precision

0.85
0.85
0.8
0.8
0.75
0.75
0.10.20.30.40.50.6 0.7 Recall 0.8 0.9 1
0.10.20.30.40.50.6 0.7 Recall 0.8 0.9 1
PL BD BR GTF SV
PL BD BR GTF SV
LV SH TC BC ST
LV SH TC BC ST
Figure 8. Figure
Figure Precision-recall
8. Precision-recallcurve
curve of
8. Precision-recall theDOTA
curve
of the DOTA
of dataset.
the DOTA dataset.
dataset.

Figure 9. Cont.
Remote Sens. 2021, 13, 2171 16 of 20
Remote Sens. 2021, 13, x FOR PEER REVIEW 16 of 20

Figure 9.
Figure 9. Visualization results of
Visualization results of the
the DOTA dataset and
DOTA dataset and HRSC2016
HRSC2016 dataset.
dataset. The
The first
first three
three groupings
groupings of
of images
images are
are part
part of
of
the test results of the DOTA dataset, whereas the last grouping is part of the test results of the HRSC2016 dataset.
the test results of the DOTA dataset, whereas the last grouping is part of the test results of the HRSC2016 dataset.

3.4. Ablation Study


3.4. Ablation Study a series of comparative experiments on the DOTA data set, as shown
We conducted
in Table 4. We considered
We conducted a series the influence ofexperiments
of comparative different combinations
on the DOTAof data
the five
set, factors
as shownof
backbone
in Table 4.network, bounding
We considered thebox border regression
influence of differentloss (BBRL), dataofenhancement
combinations (DE),
the five factors of
backbone network,
multi-scale settings,bounding
and CSL on boxthe
border regression lossresults.
final experimental (BBRL), data
We usedenhancement
mAP and FPS (DE),
as
multi-scalecriteria
evaluation settings,
toand CSL
verify oneffectiveness
the the final experimental results. We used mAP and FPS as
of our method.
evaluation criteria to verify the effectiveness of our method.
Table 4. Ablation study on components on the DOTA dataset.
Table 4. Ablation study on components on the DOTA dataset.
N Proposed Backbone BBRL DE Multi scale CSL mAP FPS
1
N 
Proposed RepVGG-A
Backbone DIou
BBRL DE Multi Scale CSL 66.98
mAP 25
FPS
21 
3 RepVGG-A
RepVGG-A CIou
DIou 67.19
66.98 25
25
32 
3 RepVGG-B
RepVGG-A DIou
CIou 68.03
67.19 23
25
43 3
 RepVGG-B
RepVGG-B DIou
CIou 68.03
69.98 23
23
4 3 RepVGG-B CIou 69.98 23
5  RepVGG-B CIou  71.03 23
5 3 RepVGG-B CIou 3 71.03 23
66 
3 RepVGG-B
RepVGG-B CIou
CIou 3  3 72.25
72.25 22
22
7 
3 RepVGG-B
RepVGG-B CIou
CIou 3  3  3 74.13
74.13 22
22

From
From Table 4, the
Table 4, the first
first row
row isis the
the baseline,
baseline, thethe improved
improved RepVGG-A
RepVGG-A is is used
used as
as the
the
backbone,
backbone, andand the
theDIou
DIouisisused
usedasasthethe BBRL.
BBRL. TheThe backbone
backbone network
network is a is a reference
reference net-
network
work for many computer tasks. We set the first and third groups, and
for many computer tasks. We set the first and third groups, and the second combination the second com-
bination and thegroup
and the fourth fourthofgroup of experiments
experiments to verifytothe
verify the backbone
backbone network.network. The results
The results show
show that RepVGG-B has more complex network parameters
that RepVGG-B has more complex network parameters and is deeper than RepVGG-A. and is deeper than
RepVGG-A.
Consequently, using the improved RepVGG-B as the backbone (groups 3 and 4), mAP3
Consequently, using the improved RepVGG-B as the backbone (groups
and 4), mAP
increased increased
by 1.05% andby 1.05%respectively.
2.79%, and 2.79%, Choosing
respectively.an Choosing
appropriate anloss
appropriate
function loss
can
function can improve the convergence speed and prediction accuracy of
improve the convergence speed and prediction accuracy of the model. Here, we set the firstthe model. Here,
we set the
group, the first
secondgroup,
group,theand
second group,
the third and the third
combination andcombination and of
the fourth group theexperiments
fourth group to
of experiments to analyze the BBRL. Because CIou recalculated the
analyze the BBRL. Because CIou recalculated the predicted bounding box, the aspect predicted bounding
ratio
box,
of thethe aspect ratio
bounding of the
box and the bounding
real boundingbox box
andincreased,
the real bounding box increased,
and the influence and the
factor increased
influence
to align thefactor increased
predicted to align
bounding boxthe predicted
with bounding
the actual box with
box. Under the actual
the same box. Under
conditions, better
the same
results conditions,
were obtainedbetter
when results
CIou was were obtained
used when CIou
as the BBRL. was used
The objective ofasDEtheis BBRL. The
to increase
objective of DE is to increase the number and diversity of samples, which can signifi-
Remote Sens. 2021, 13, 2171 17 of 20

the number and diversity of samples, which can significantly improve the problem of
sample imbalance. According to the experimental results of the fourth and fifth groups,
mAP increased by 1.06% after the image was processed by cropping, zooming, and random
arrangement. Because different detection scales have different sensitivities to objects of
different scales, there are many detection targets with large differences in size in remote
sensing images. We can observe from the experimental results of the fifth and sixth groups
that mAP improved by 1.21% when four detection scales were used. The increased number
of detection scales enhances the detection of small target objects. Because there are many
dense rotating targets in remote sensing images, we assume that the bounding box can be
predicted more accurately. Next, we set up the sixth and seventh groups of experiments.
The results show that, after using CSL, we can change the angle prediction from a regression
problem into a classification problem, and the periodicity problem of the angle was solved.
mAP improved by 1.88% to 74.13%. We finally chose the improved RepVGG-B model as
the backbone network with CIou as the BBRL loss function, using DE, Multi scale, and CSL
simultaneously, and finally obtaining RepVGG-YOLO NET.

4. Conclusions
In this article, we introduce a method for detecting targets from arbitrary-angle geo-
graphic remote sensing. A RepVGG-YOLO model is proposed, which uses an improved
RepVGG module as the backbone feature extraction network (Backbone) of the model,
and uses SPP, feature pyramid network (FPN), and path aggregation network (PANet)
as the enhanced feature extraction networks. The model combines context information
on multiple scales, accumulates multi-layer features, and strengthens feature information
extraction. In addition, we use four target detection scales to enhance the feature extrac-
tion of remote sensing small target pixels and the CSL method to increase the detection
accuracy of objects at any angle. We redefine the classification loss function and add the
angle problem to the loss calculation. The proposed model achieved the best detection
performance among the eight methods evaluated. The proposed model obtained an mAP
of 74.13% and 22 FPS on the DOTA dataset, wherein the mAP value exceeded that of the
suboptimal method (R3Det) by 1.32%. The proposed model obtained an mAP of 91.54%
on the HRSC2016 dataset. The mAP value and the FPS exceeded that of the suboptimal
method (R3Det) by 2.21% and 13, respectively. We expect to conduct further research on
the detection of blurred, dense small objects and obscured objects.

Author Contributions: Conceptualization, Y.Q. and W.L.; methodology, Y.Q.; software, Y.Q. and W.L.;
validation, Y.Q., L.F. and W.G.; formal analysis, Y.Q. and L.F.; writing—original draft preparation,
Y.Q., W.L. and L.F.; writing—review and editing, Y.Q. and W.L.; visualization, Y.Q. and W.L. All
authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Data Availability Statement: Not applicable.
Acknowledgments: The authors would like to thank Guigan Qing and Chaoxiu Li for their support,
secondly, thanks to Lianshu Qing and Niuniu Feng for their support.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Zhang, F.; Du, B.; Zhang, L.; Xu, M. Weakly supervised learning based on coupled convolutional neural networks for aircraft
detection. IEEE Trans. Geosci. Remote Sens. 2016, 54, 5553–5563. [CrossRef]
2. Kamusoko, C. Importance of remote sensing and land change modeling for urbanization studies. In Urban Development in Asia
and Africa; Springer: Singapore, 2017.
3. Ahmad, K.; Pogorelov, K.; Riegler, M.; Conci, N.; Halvorsen, P. Social media and satellites. Multimed. Tools Appl. 2019, 78,
2837–2875. [CrossRef]
4. Tang, T.; Zhou, S.; Deng, Z.; Zou, H.; Lei, L. Vehicle detection in aerial images based on region convolutional neural networks and
hard negative example mining. Sensors 2017, 17, 336. [CrossRef]
Remote Sens. 2021, 13, 2171 18 of 20

5. Cheng, G.; Zhou, P.; Han, J. RIFD-CNN: Rotation-invariant and fisher discriminative convolutional neural networks for object
detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26
June–1 July 2016; pp. 2884–2893.
6. Deng, Z.; Sun, H.; Zhou, S.; Zhao, J.; Zou, H. Toward fast and accurate vehicle detection in aerial images using coupled
region-based convolutional neural networks. J-STARS 2017, 10, 3652–3664. [CrossRef]
7. Long, Y.; Gong, Y.; Xiao, Z.; Liu, Q. Accurate object localization in remote sensing images based on convolutional neural networks.
IEEE Trans. Geosci. Remote Sens. 2017, 55, 2486–2498. [CrossRef]
8. Crisp, D.J. A ship detection system for RADARSAT-2 dual-pol multi-look imagery implemented in the ADSS. In Proceedings of
the 2013 IEEE International Conference on Radar, Adelaide, Australia, 9–12 September 2013; pp. 318–323.
9. Wang, C.; Bi, F.; Zhang, W.; Chen, L. An intensity-space domain CFAR method for ship detection in HR SAR images. IEEE Geosci.
Remote Sens. Lett. 2017, 14, 529–533. [CrossRef]
10. Leng, X.; Ji, K.; Zhou, S.; Zou, H. An adaptive ship detection scheme for spaceborne SAR imagery. Sensors 2016, 16, 1345.
[CrossRef] [PubMed]
11. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. NIPS 2012, 25,
1097–1105. [CrossRef]
12. Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500.
13. Chen, K.; Pang, J.; Wang, J.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Shi, J.; Ouyang, W.; et al. Hybrid task cascade for instance
segmentation. arXiv 2019, arXiv:1901.07518.
14. Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with Siamese region proposal network. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp.
8971–8980.
15. Tian, L.; Cao, Y.; He, B.; Zhang, Y.; He, C.; Li, D. Image Enhancement Driven by Object Characteristics and Dense Feature Reuse
Network for Ship Target Detection in Remote Sensing Imagery. Remote Sens. 2021, 13, 1327. [CrossRef]
16. Li, Y.; Li, X.; Zhang, C.; Lou, Z.; Zhu, Y.; Ding, Z.; Qin, T. Infrared Maritime Dim Small Target Detection Based on Spatiotemporal
Cues and Directional Morphological Filtering. Infrared Phys. Technol. 2021, 115, 103657. [CrossRef]
17. Yao, Z.; Wang, L. ERBANet: Enhancing Region and Boundary Awareness for Salient Object Detection. Neurocomputing 2021, 448,
152–167. [CrossRef]
18. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788.
19. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271.
20. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, S.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of
the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland,
2016; pp. 21–37.
21. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International
Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988.
22. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-based convolutional networks for accurate object detection and segmentation.
IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 142–158. [CrossRef]
23. Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Araucano Park, Las
Condes, Chile, 11–18 December 2015; pp. 1440–1448.
24. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans.
Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef] [PubMed]
25. Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object detection via region-based fully convolutional networks. NIPS 2016, 29, 379–387.
26. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125.
27. Li, Y.; Zhang, Y.; Huang, X.; Yuille, A.L. Deep networks under scene-level supervision for multi-class geospatial object detection
from remote sensing images. ISPRS J. Photogramm. Remote Sens. 2018, 146, 182–196. [CrossRef]
28. Ming, Q.; Miao, L.; Zhou, Z.; Dong, Y. CFC-Net: A critical feature capturing network for arbitrary-oriented object detection in
remote sensing images. arXiv 2021, arXiv:2101.06849.
29. Pang, J.; Li, C.; Shi, J.; Xu, Z.; Feng, H. R2-CNN: Fast tiny object detection in large-scale remote sensing images. IEEE Trans. Geosci.
Remote Sens. 2019, 57, 5512–5524. [CrossRef]
30. Han, J.; Ding, J.; Li, J.; Xia, G.S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2021, 1–11.
31. Deng, Z.; Sun, H.; Zhou, S.; Zhao, J.; Lei, L.; Zou, H. Multi-scale object detection in remote sensing imagery with convolutional
neural networks. ISPRS J. Photogramm. Remote Sens. 2018, 145, 3–22. [CrossRef]
32. Feng, P.; Lin, Y.; Guan, J.; He, G.; Shi, H.; Chambers, J. TOSO: Student’s-T distribution aided one-stage orientation target detection
in remote sensing images. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 4057–4061.
Remote Sens. 2021, 13, 2171 19 of 20

33. Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.; Bai, X. Gliding vertex on the horizontal bounding box for multi-oriented
object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1452–1459. [CrossRef]
34. Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI Transformer for Detecting Oriented Objects in Aerial Images. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Los Angeles, CA, USA, 16–19 June 2019.
35. Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object
Detection in Aerial Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City,
UT, USA, 18–23 June 2018; pp. 3974–3983.
36. Azimi, S.M.; Vig, E.; Bahmanyar, R.; Körner, M.; Reinartz, P. Towards multi-class object detection in unconstrained remote sensing
imagery. arXiv 2018, arXiv:1807.02700.
37. Liu, L.; Pan, Z.; Lei, B. Learning a rotation invariant detector with rotatable bounding box. arXiv 2017, arXiv:1711.09405.
38. Wang, J.; Ding, J.; Guo, H.; Cheng, W.; Pan, T.; Yang, W. Mask OBB: A Semantic Attention-Based Mask Oriented Bounding Box
Representation for Multi-Category Object Detection in Aerial Images. Remote Sens. 2019, 11, 2930. [CrossRef]
39. Yang, X.; Sun, H.; Fu, K.; Yang, J.; Sun, X.; Yan, M.; Guo, Z. Automatic ship detection in remote sensing images from Google Earth
of complex scenes based on multiscale rotation dense feature pyramid networks. Remote Sens. 2018, 10, 132. [CrossRef]
40. Yang, X.; Yan, J. Arbitrary-oriented object detection with circular smooth label. In Proceedings of the 16th European Conference
on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 677–694.
41. Chen, J.; Wan, L.; Zhu, J.; Xu, G.; Deng, M. Multi-scale spatial and channel-wise attention for improving object detection in remote
sensing imagery. IEEE Geosci. Remote Sens. Lett. 2020, 17, 681–685. [CrossRef]
42. Cui, Z.; Li, Q.; Cao, Z.; Liu, N. Dense attention pyramid networks for multi-scale ship detection in SAR images. IEEE Trans.
Geosci. Remote Sens. 2019, 57, 8983–8997. [CrossRef]
43. Zhang, G.; Lu, S.; Zhang, W. CAD-net: A context-aware detection network for objects in remote sensing imagery. IEEE Trans.
Geosci. Remote Sens. 2019, 57, 10015–10024. [CrossRef]
44. Zhu, Y.; Urtasun, R.; Salakhutdinov, R.; Fidler, S. segDeepM: Exploiting segmentation and context in deep neural networks for
object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA,
7–12 June 2015; pp. 4703–4711.
45. Gidaris, S.; Komodakis, N. Object detection via a multi-region and semantic segmentation-aware CNN model. In Proceedings of
the IEEE International Conference on Computer Vision (ICCV), Araucano Park, Las Condes, Chile, 11–18 December 2015; pp.
1134–1142.
46. Zhang, L.; Shi, Z.; Wu, J. A hierarchical oil tank detector with deep surrounding features for high-resolution optical satellite
imagery. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2015, 8, 4895–4909. [CrossRef]
47. Bell, S.; Zitnick, C.L.; Bala, K.; Girshick, R. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural
networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26
June–1 July 2016; pp. 2874–2883.
48. Marcu, A.; Leordeanu, M. Dual local-global contextual pathways for recognition in aerial imagery. arXiv 2016, arXiv:1605.05462.
49. Kang, M.; Ji, K.; Leng, X.; Lin, Z. Contextual region-based convolutional neural network with multilayer fusion for SAR ship
detection. Remote Sens. 2017, 9, 860. [CrossRef]
50. Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-style ConvNets Great Again. arXiv 2021,
arXiv:2101.03697v3.
51. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans.
Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [CrossRef] [PubMed]
52. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768.
53. Bai, J.; Zhu, J.; Zhao, R.; Gu, F.; Wang, J. Area-based non-maximum suppression algorithm for multi-object fault detection. Front.
Optoelectron. 2020, 13, 425–432. [CrossRef]
54. Rezatofighi, H.; Tsoi, N.; Gwak, J.Y.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss
for bounding box regression. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [CrossRef]
55. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression.
In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp.
12993–13000. [CrossRef]
56. Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-oriented scene text detection via rotation proposals.
IEEE Trans. Multimed. 2018, 20, 3111–3122. [CrossRef]
57. Liu, Z.; Yuan, L.; Weng, L.; Yang, Y. A high resolution optical satellite image dataset for ship recognition and some new baselines.
In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods (ICPRAM), Porto, Portugal,
24–26 February 2017; pp. 324–331.
58. Wang, C.; Bai, X.; Wang, S.; Zhou, J.; Ren, P. Multiscale visual attention networks for object detection in VHR remote sensing
images. IEEE Geosci. Remote Sens. Lett. 2018, 16, 310–314. [CrossRef]
59. Zhang, Y.; Yuan, Y.; Feng, Y.; Liu, X. Hierarchical and robust convolutional neural network for very high-resolution remote
sensing object detection. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5535–5548. [CrossRef]
Remote Sens. 2021, 13, 2171 20 of 20

60. Cheng, G.; Zhou, P.; Han, J. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote
sensing images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7405–7415. [CrossRef]
61. Li, K.; Cheng, G.; Bu, S.; You, X. Rotation-insensitive and context-augmented object detection in remote sensing images. IEEE
Trans. Geosci. Remote Sens. 2017, 56, 2337–2348. [CrossRef]
62. Wu, X.; Hong, D.; Tian, J.; Chanussot, J.; Li, W.; Tao, R. ORSIm detector: A novel object detection framework in optical remote
sensing imagery using spatial-frequency channel features. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5146–5158. [CrossRef]
63. Zou, Z.; Shi, Z. Random access memories: A new paradigm for target detection in high resolution aerial remote sensing images.
IEEE Trans. Image Process. 2017, 27, 1100–1111. [CrossRef] [PubMed]
64. Guo, W.; Yang, W.; Zhang, H.; Hua, G. Geospatial object detection in high resolution satellite images based on multi-scale
convolutional neural network. Remote Sens. 2018, 10, 131. [CrossRef]
65. Li, Y.; Huang, Q.; Pei, X.; Jiao, L.; Shang, R. RADet: Refine feature pyramid network and multi-layer attention network for
arbitrary-oriented object detection of remote sensing images. Remote Sens. 2020, 12, 389. [CrossRef]
66. Yang, X.; Liu, Q.; Yan, J.; Li, A.; Zhang, Z.; Yu, G. R3det: Refined single-stage detector with feature refinement for rotating object.
arXiv 2019, arXiv:1908.05612.
67. Liao, M.; Zhu, Z.; Shi, B.; Xia, G.S.; Bai, X. Rotation-sensitive regression for oriented scene text detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 5909–5918.
68. Liu, Z.; Hu, J.; Weng, L.; Yang, Y. Rotated region based CNN for ship detection. In Proceedings of the IEEE International
Conference on Image Processing, Beijing, China, 17–20 September 2017; pp. 900–904.
69. Zhang, Z.; Guo, W.; Zhu, S.; Yu, W. Toward arbitrary-oriented ship detection with rotated region proposal and discrimination
networks. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1745–1749. [CrossRef]

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy