Electronics 12 04928
Electronics 12 04928
Article
Real-Time Object Detection and Tracking for Unmanned Aerial
Vehicles Based on Convolutional Neural Networks
Shao-Yu Yang 1 , Hsu-Yung Cheng 1, * and Chih-Chang Yu 2
Abstract: This paper presents a system applied to unmanned aerial vehicles based on Robot Operating
Systems (ROSs). The study addresses the challenges of efficient object detection and real-time target
tracking for unmanned aerial vehicles. The system utilizes a pruned YOLOv4 architecture for fast
object detection and the SiamMask model for continuous target tracking. A Proportional Integral
Derivative (PID) module adjusts the flight attitude, enabling stable target tracking automatically in
indoor and outdoor environments. The contributions of this work include exploring the feasibility of
pruning existing models systematically to construct a real-time detection and tracking system for
drone control with very limited computational resources. Experiments validate the system’s feasibility,
demonstrating efficient object detection, accurate target tracking, and effective attitude control. This
ROS-based system contributes to advancing UAV technology in real-world environments.
Keywords: UAV; deep learning; ROS; convolutional neural network; pruned network; target tracking
network; PID control
tools for handling inter-system connections, enabling developers to establish, control, and
monitor UAV systems more easily. It also offers numerous software modules and libraries
for various functionalities, such as sensor data processing, motion control, and mission
planning. The goal of this paper is to apply ROS technology to UAVs, allowing for a
modular system design and simplifying the overall development process, making UAV
development more accessible and efficient.
Many newly introduced drones on the market are equipped with tracking and fol-
lowing capabilities. Typically, this functionality is achieved through electronic devices
worn by the target, utilizing GPS coordinates to track and follow the target. However,
in environments where GPS signals are unavailable, such as tunnels or basements, GPS
positioning becomes ineffective or even impossible. Therefore, image tracking serves as
an auxiliary method to realize target tracking. One significant aspect of this research is to
explore the feasibility of controlling UAV systems via detection and tracking techniques
based on images. By utilizing detection and tracking techniques, drones can capture target
images through cameras and perform real-time analysis to achieve precise detection and
tracking of the position and orientation of the target. With the rapid development of deep
learning, many deep learning-based models have been proposed to address the problem of
object detection based on images. R-CNN [4] applies high-capacity convolutional neural
networks to bottom-up region proposals in order to localize and segment objects. SPP-
Net [5] utilizes spatial pyramid pooling to eliminate the requirement of fixed size input
images. Faster R-CNN [6] improves R-CNN and SPP-Net to reduce the training and testing
speed while also increasing the detection accuracy. The Single Shot MultiBox Detector
(SSD) [7] utilizes multi-scale convolutional bounding box outputs attached to multiple
feature maps at the top of the network to detect objects in images using a single deep
neural network. Unlike prior works that treat detection as a classification problem, the
work named You Only Look Once (YOLO) [8] considers object detection as a regression
problem to spatially separated bounding boxes and the associated class probabilities. A
single neural network that can be optimized end-to-end is used to predict bounding boxes
and class probabilities directly from full images in one evaluation. Therefore, YOLO has
achieved great success. YOLO9000 and YOLOv2 [9] improve the original YOLO by in-
troducing the concepts of batch normalization [10], high resolution classifier, convolution
with anchor boxes [6], dimension clusters, direct location prediction, fine-grained features,
and multi-scale training. YOLOv3 [11] made some little changes to update YOLO and to
make it better. YOLOv4 [12] performs extensive experiments on the techniques of weighted
residual connections, cross-stage partial connections, cross mini-batch normalization, self-
adversarial training, mish-activation, mosaic data augmentation, drop-block regularization,
and CIoU loss, and it combines a subset of these techniques to achieve state-of-the-art
results. Person detection is a specialized form of object detection designed to identify the
specific class “person” within images or video frames. Therefore, we utilize YOLOv4 to
perform the detection task for the drone. To further reduce the computation complexity
of YOLOv4 so that it can be applied in the environment with very limited computational
resources, we perform pruning on the original YOLOv4 model.
Pruning methods have been proposed to reduce the complexity of CNN models [13–17].
Channel pruning intends to exploit the redundancy of feature maps between channels and
remove channels with the minimal performance loss [13]. Li et al. [14] proposed pruning
deep learning models using both channel-level and layer-level compression techniques. Liu
et al. [16] designed a pruning method that can be directly applied to existing modern CNN
architectures by enforcing channel-level sparsity in the network to reduce the model size,
decrease the run-time memory footprint and lower the number of computing operations
while maintaining the accuracy of the model. In [17], the authors demonstrate how to prune
YOLOv3 and YOLOv4 models and then deployed them on OpenVINO with an increased
frame rate and little accuracy loss. Since we utilize YOLOv4 for detection in the framework,
we refer to the pruning methods described in [16,17].
Electronics 2023, 12, 4928 3 of 19
Tracking algorithms based on Siamese Networks have become mainstream for visual
tracking recently [18]. Bertinetto et al. [19] utilized a fully convolutional Siamese network
that can be trained end-to-end for tracking applications. Zhu et al. [20] designed a distractor-
aware module to perform incremental learning, which is able to transfer the general
embedding to the current video domain effectively. Li et al. [21] proposed a tracker based
on a Siamese region proposal network that is trained offline with large-scale image pairs.
A ResNet-driven Siamese tracker is trained in [22]. SiamMask [23] improves the offline
training procedure of popular fully convolutional Siamese methods for visual tracking by
augmenting their loss with a binary segmentation task.
In this work, we utilize the ROS [24] (Robot Operating System) to implement image
detection and tracking for controlling UAVs. Due to hardware constraints on the laptop,
lightweight models are required. Therefore, for the object detector, we train a convolutional
neural network based on the YOLOv4 architecture and prune it accordingly. In this work,
the target object for detection is a person. We employ the pruned version of the YOLOv4
object detector and the SiamMask [23] monocular object tracker to detect and track the
target person captured by the camera of the drone. Our system consists of four main
components: (1) object detection, (2) target tracking, (3) Proportional Integral Derivative
(PID) control, and (4) the UAV driver package. We utilize the Tello drone for implementing
the object detection and tracking system. During the tracking process, the UAV control
parameters include the roll, pitch, yaw, and altitude, all of which are controlled using PID
controllers. These PID controllers take the position and distance of the target object as
inputs. The position and distance are calculated using the monocular front-facing camera
of the UAV.
2. Approach
The details of the methods used in the proposed framework, including object detection,
model pruning, and visual tracking, are elaborated in this section. Figure 1 illustrates the
system framework. A laptop computer (PC) is connected to the Tello drone via Wi-Fi for
communication. The drone transmits images at a constant frequency of 30 Hz, which is
preconfigured in the drone’s driver software. These images are processed on the PC using
a pruned version of the YOLOv4 algorithm for object detection. Users have the ability
to select bounding boxes based on their requirements. The system utilizes the Siamese
network, called SiamMask, for object tracking. Based on the tracked object’s position and
distance, a tracking algorithm based on a PID controller is employed to calculate estimates
of the roll, pitch, yaw, and altitude. These estimated values for the roll, pitch, yaw, and
altitude are then sent back to the drone to initiate the tracking process and to utilize the
texture information of the background to enhance the final results.
The overall flowchart of the system architecture is shown in Figure 1. The drone sends
the image feed to the PC, where the received images are processed using Pruned-YOLOv4
for person detection. If a person is detected in the image, their bounding box is displayed
on the screen. The green boxes in Figure 1 represent the human detection results. If the user
selects a specific object of interest by clicking on its bounding box, the system extracts the
person within that bounding box as a template frame for the SiamMask network, enabling
subsequent tracking. The tracking algorithm calculates the error between the target and the
center of the frame. This error serves as the input for the PID controller, which generates
flight commands for the yaw, roll, and altitude. As for the fourth flight command, pitch, it
is calculated based on the relative distance of the tracked object using its position data. If
no target is detected in the image, the drone maintains its position until a target appears.
Electronics 2023, 12, x FOR PEER REVIEW 4 of 19
Electronics 2023, 12, 4928 4 of 19
Figure 1. System
Figure 1. System framework.
framework.
Training steps
Figure 2. Training steps for
for the
the detection
detection model.
model.
The second and third hyperparameters to adjust are the batch size and subdivisions. subdivisions.
These
These settings
settings are are adjusted
adjusted based
based on on the GPU’s performance.
the GPU’s performance. The The batch
batch size
size hyperparam-
hyperparam-
eter
eter represents
represents the the number
number of of images
images to to load
load during training, with
during training, with aa default
default value
value of
of 64.
64.
However, if the GPU’s memory size is insufficient, it will not be
However, if the GPU’s memory size is insufficient, it will not be able to load 64 images atable to load 64 images
at once.
once. ToTo address
address thisissue,
this issue,each
eachbatch
batchisisfurther
furthersubdivided
subdividedinto into multiple
multiple sub-batches.
sub-batches.
Each sub-batch is fed into the GPU one by one until the batch
Each sub-batch is fed into the GPU one by one until the batch is completed. In this is completed. In this study,
study,
we set the batch size and subdivisions to 64 and 8, respectively. The
we set the batch size and subdivisions to 64 and 8, respectively. The fourth hyperparame- fourth hyperparame-
ter
ter to
to adjust
adjust is is the
the number
number of of iterations
iterations (note
(note that
that in
in the
the Darknet
Darknet framework,
framework, training
training is
is
measured in iterations, not epochs). According to the Darknet
measured in iterations, not epochs). According to the Darknet framework’s guidelines, framework’s guidelines,
each
each object
object class
class should
should havehave atat least
least 2000
2000 iterations.
iterations. Since
Since we
we have
have only
only one
one class,
class, the
the
number of iterations should exceed 2000. We set the number
number of iterations should exceed 2000. We set the number of iterations to 2200of iterations to 2200 to achieveto
higher
achieveaccuracy.
higher accuracy.
B.
B. Pruning Stage
Pruning Stage
Due to the hardware limitations of the laptop, a lightweight model needs needs to to be
be used.
used.
Therefore,
Therefore, after
after training
training the
the model
model using
using the
the Darknet
Darknet framework,
framework, it needs to be pruned to
achieve
achieve thethegoal
goalofoflightweighting.
lightweighting.We Weuse
usethe metrics
the metricsof accuracy
of accuracy (mAP@0.5)
(mAP@0.5) andand
inference
infer-
speed
ence speed (BFLOPs) to evaluate the pruned model. However, it is important to notethere
(BFLOPs) to evaluate the pruned model. However, it is important to note that that
is a trade-off
there between
is a trade-off accuracy
between and inference
accuracy speed. Assuming
and inference the hardware
speed. Assuming configuration
the hardware con-
is fixed, when
figuration the model
is fixed, when the is pruned
model is topruned
a very small size,small
to a very its inference speed may
size, its inference increase
speed may
but its accuracy
increase is typically
but its accuracy reduced. reduced.
is typically
Before
Before pruning,
pruning, the the weights
weights from
from the
the Darknet
Darknet framework
framework undergoundergo aa basic
basic training
training
process.
process. Once the basic training is completed, the obtained model is pruned using
Once the basic training is completed, the obtained model is pruned using thethe
pruning
pruning strategy
strategy from
from [27].
[27]. This
This strategy
strategy involves
involves first
first conducting
conducting sparse
sparse training
training on on the
the
model,
model, where
wherethe thechannel
channelsparsity
sparsityinindeep
deepmodels
models helps
helps with
withchannel
channelpruning.
pruning. To To
facilitate
facili-
channel
tate channel pruning, each channel in the convolutional layers is associated with a scaling
pruning, each channel in the convolutional layers is associated with a scaling
factor.
factor. During
Duringtraining,
training,L1 L1regularization
regularizationis isapplied
applied to to
these scaling
these factors
scaling to automatically
factors to automati-
identify unimportant
cally identify channels.
unimportant Channels
channels. with smaller
Channels scalingscaling
with smaller factor factor
valuesvalues
(orange color)
(orange
are pruned (left side). After pruning, we obtain a compact model (right side), which is then
color) are pruned (left side). After pruning, we obtain a compact model (right side), which
fine-tuned to achieve comparable (or even higher) accuracy with the fully trained network.
is then fine-tuned to achieve comparable (or even higher) accuracy with the fully trained
The pruning process is illustrated in Figure 3.
network. The pruning process is illustrated in Figure 3.
Electronics 2023, 12, x FOR PEER REVIEW 6 of 19
Electronics 2023, 12, 4928 6 of 19
Pruningmethod.
Figure3.3.Pruning
Figure method.
C.C. Sparsity
Sparsity Training
Training
Weadd
We adda aBatch
BatchNormalization
Normalization(BN)
(BN)layer
layerafter
aftereach
eachconvolutional
convolutionallayer
layerininYOLOv4
YOLOv4
totoaccelerate
accelerateconvergence
convergenceand
and improve
improve generalization.
generalization. The
The BN
BN layer
layerutilizes
utilizesbatch
batchstatistics
statis-
to normalize the convolutional features as:
tics to normalize the convolutional features as:
𝑥x −
− 𝑥x + β
y
y= ×√ 2
= γγ × +β (1)
(1)
√𝜎 +
σ + 𝜀ε
Here,𝑥x and
Here, and σ𝜎2 represent
represent thethe mean
mean andand variance
variance of the
of the input
input features
features in mini-batch,
in the the mini-
batch, respectively. γ and β
respectively. γ and β represent the trainable scale factor and bias in the BN layer. layer.
represent the trainable scale factor and bias in the BN In this
Instudy,
this study, we directly
we directly use the use
scalethe scaleinfactor
factor the BNinlayer
the BN layer
as an as an of
indicator indicator
channelof channel
importance.
importance.
To effectivelyTo effectively
distinguishdistinguish between important
between important and unimportant
and unimportant channels,channels,
we applyweL1
apply L1 regularization
regularization to γ, enabling
to γ, enabling channel-level
channel-level sparse sparse
training.training.
The lossThefunction
loss function for
for sparse
sparse training
training is shown
is shown as: as:
L = lossyolo + α ∑ f (γ) (2)
γeΓ
Electronics 2023, 12, 4928 7 of 19
The function f (γ) represents the L1 norm applied to γ, which is widely used in the
sparsification step. α represents the penalty factor that balances the two loss terms.
The effectiveness of pruning depends on the sparsity of the model. Prior to sparse
training, the distribution of γ in the BN layer of YOLOv4 is expected to be uniform. After
sparse training, most of the γ values in the BN layer are compressed toward zero. This
brings two benefits:
(1) Achieving network pruning and compression to improve model efficiency: The
weights in the BN layer are typically used for standardizing and scaling each input
sample in the network. When the weights are close to zero, the corresponding stan-
dardization and scaling operations are reduced, thereby reducing the computational
complexity.
(2) By sparsifying the weights of the BN layer close to zero, it becomes possible to identify
parameters that have minimal impact on network performance and prune them.
D. Channel cutting
Once sparse training is completed, channel cutting can be performed. Here is an
explanation of how to proceed with channel cutting. First, the total number of channels in
the backbone is computed. Once the number of channels is determined, the corresponding
γ values are stored in a variable and sorted in ascending order. The next step is to decide
which channels to keep and which ones to prune. This can be achieved by setting a pruning
rate, which represents the proportion of channels to be pruned. The pruning rate is typically
a value between 0 and 1, where a higher value indicates a greater degree of pruning. By
following these steps, the channel-cutting process can be carried out to selectively retain or
remove channels based on the specified pruning rate.
E. Layer cutting
Within the YOLOv4 backbone, there are multiple CSPX modules, where each CSPX
module consists of three CBL layers and X ResUnit modules. The resulting features of
these modules are concatenated together, as depicted in Figure 4a. For layer cutting, we
mainly prune the ResUnit within YOLOv4. The architecture of the ResUnit is illustrated
in Figure 4b, which consists of two CBL layers and a shortcut connection. The CBL layer
comprises a Conv layer, a BN layer, and a Leaky ReLU activation function, as shown in
Figure 4c. In layer cutting, the mean values of γ for each layer are first sorted, and by
evaluating the previous CBL layer of each shortcut, the minimum value can be selected for
layer pruning. To ensure the structural integrity of YOLOv4, when pruning one ResUnit,
both the shortcut layer and the preceding CBL layer are simultaneously pruned, resulting
in the pruning of three layers in total.
F. Fine-tuning
Different pruning strategies and threshold settings yield different effects on the pruned
model. Sometimes, the accuracy of the pruned model may even increase, although in
most cases, pruning can have a negative impact on model accuracy. In such cases, it is
necessary to perform fine-tuning on the pruned model to compensate for the accuracy loss
caused by pruning. Fine-tuning is crucial for restoring the accuracy of the pruned model.
In our experiments, we directly retrained the Pruned-YOLOv4 using the same training
hyperparameters as the normal training process for YOLOv4.
Electronics 2023,12,
Electronics2023, 12,4928
x FOR PEER REVIEW 88 of
of 19
19
Figure4.4.(a)
Figure (a)CSPX,
CSPX,(b)
(b)ResUnit,
ResUnit,and
and(c)
(c)CBL
CBLlayer
layerin
inYOLOv4.
YOLOv4.
2.3. Object Tracking and Drone Control
2.3. Object Tracking and Drone Control
The laptop performs the real-time detection of drones, allowing users to track the
Thetargets
detected laptopuntil
performs the real-time
the tracking detection
is completed. of drones,
The objects allowing
detected usingusers to track the
Pruned-YOLOv4
detected targets until the tracking is completed. The objects
are represented by bounding boxes, each containing four coordinates (x, y, w, h). detected using Pruned-
Here,
YOLOv4 are represented by bounding boxes, each containing four
x and y represent the coordinates of the top left corner of the bounding box, while w and coordinates (x, y, w,h
h). Here, x and y represent the coordinates of the top left corner
represent the width and height of the bounding box, respectively. Once the coordinates of the bounding box,
while
are w and hwe
obtained, represent the width
continuously andthe
detect height
user’sof mouse
the bounding box,
position. If respectively.
the mouse click Once the
falls
coordinates
within a boundingare obtained,
box, thewe fourcontinuously
coordinatesdetect
of thethe user’s mouse
bounding box areposition.
passed to If the
the mouse
object
click falls within a bounding box, the four coordinates of the bounding
tracking module, which utilizes SiamMask [23]. SiamMask is a target-tracking algorithm box are passed to
the object
based tracking
on Siamese module,
Neural which [28].
Networks utilizes SiamMask
Siamese Neural[23]. SiamMask
Networks wereisinitially
a target-tracking
proposed
algorithm based on Siamese Neural Networks [28]. Siamese
by Bromley and LeCun to address signature verification problems [29] and have since Neural Networks were ini-
been
tially proposed
widely applied inby Bromley
various andsuch
fields, LeCun to address
as image matching signature verification
and target tracking.problems [29]
and Inhave
thesince been
task of widely
target applied
tracking, in various
Siamese Neural fields, such as
Networks imagetwo
employ matching
identical and target
subnet-
tracking.
works with shared parameters and weights. The tracking template is fed into the network,
and theIn the
outputtaskweights
of targetaretracking,
obtained. Siamese
TheseNeural
weights Networks
are then employ
matchedtwo with identical
the output sub-
weights of the search region to calculate the similarity score. The target’s location tonet-
networks with shared parameters and weights. The tracking template is fed into the be
work, and
tracked the output by
is determined weights are obtained.
computing Thesescore
the response weights
map.are then matched
Building upon the with the out-
traditional
put weights
Siamese of theSiamMask
network, search region to calculate
incorporates thesegmentation
target similarity score. The target’s
computation, whichlocation
allows to
be the
for tracked is determined
extraction by computing
of the target’s contour.the Thisresponse score map.
helps mitigate the Building
effects ofupontargetthe tradi-
feature
tional Siamese
variations caused network, SiamMask
by rotation incorporates target segmentation computation, which
and deformation.
allows for the
While extraction
performing of the target’s
tracking, SiamMask contour. This helps mitigate
simultaneously returns an theimage
effectswith of target
the
feature variations
bounding box of the caused by rotation
tracked and deformation.
object. This bounding box contains information about the
While performing tracking, SiamMask simultaneously returns an image with the
bounding box of the tracked object. This bounding box contains information about the
Electronics 2023,
Electronics 12,12,
2023, x 4928
FOR PEER REVIEW 9 of9 of
1919
object’s
object’sposition
positionin in the image. The
the image. Thebounding
boundingbox boxof of
thethe tracked
tracked object
object consists
consists of points:
of four four
points: (𝑥 , 𝑦 ), (𝑥 , 𝑦 ), (𝑥 , 𝑦 ), and
(xmin , ymin ), (xmax , ymin ), (xmin , ymax ), and (xmax , ymax ). (𝑥 , 𝑦 ).
WeWecancanuseusethese
thesefour
fourpoints
pointstotocalculate
calculatethe thecenter
centerofofthetheobject’s
object’sposition,
position,which
whichisis
computed
computedas:as:
( xcenter,,𝑦ycenter )) =
+ 𝑥xmax 𝑦ymin +
𝑥xmin + ,
+ 𝑦ymax
(𝑥 = 2 , 2 (3)(3)
2 2
In order to track an object accurately, it is necessary to know the exact center position
In order to track an object accurately, it is necessary to know the exact center position
of the drone’s screen. This is because the detected object’s center should always align
of the drone’s screen. This is because the detected object’s center should always align with
with the center of the drone’s screen for proper tracking. The calculation of the disparity
the center of the drone’s screen for proper tracking. The calculation of the disparity be-
between the center of the drone’s screen and the object’s center is performed as:
tween the center of the drone’s screen and the object’s center is performed as:
𝑒 ex==𝑖𝑚𝑔𝑥 −𝑥xcenter
imgx center − (4)(4)
𝑒 = 𝑖𝑚𝑔𝑦 −𝑦 (5)
ey = imgycenter − ycenter (5)
𝑒 and 𝑒 should always be equal to or close to zero to achieve effective tracking.
ex and ey should always be equal to or close to zero to achieve effective tracking.
The drone has a total of four control parameters: roll, yaw, altitude, and pitch. Roll
The drone has a total of four control parameters: roll, yaw, altitude, and pitch. Roll
controls the drone’s lateral movement, yaw controls the drone’s clockwise or counter-
controls the drone’s lateral movement, yaw controls the drone’s clockwise or counter-
clockwise rotation, altitude controls the drone’s vertical movement, and pitch controls the
clockwise rotation, altitude controls the drone’s vertical movement, and pitch controls the
drone’s
drone’sforward
forwardororbackward
backwardmovement.
movement.Figure
Figure5 5illustrates
illustratesthe
thebasic
basicflight
flightmaneuvers
maneuversofof
the drone.
the drone.
Figure 5.5.
Figure Fundamental maneuvers
Fundamental ofof
maneuvers a drone.
a drone.
Next,
Next,wewewill
willexplain
explainhowhowPID PIDcontrols
controlsdrone
droneflight.
flight.It Itisisevident
evidentthatthatbybyusing
usingthethe
center point of the tracked object and the center point of the screen,
center point of the tracked object and the center point of the screen, we can obtain we can obtain the error
the
inerror
the X-axis. This error
in the X-axis. Thisiserror
related to the to
is related drone’s roll for
the drone’s lateral
roll movement
for lateral movement and yaw
and for
yaw
clockwise or counterclockwise
for clockwise or counterclockwiserotation. If the Ifdrone
rotation. detects
the drone that the
detects thatobject is moving
the object left
is moving
orleft
right, we canwe
or right, choose to adjust
can choose the drone’s
to adjust headingheading
the drone’s to face the object
to face theorobject
perform lateral
or perform
movements to keep up
lateral movements withup
to keep it. with
Additionally, there isthere
it. Additionally, the pitch
is theaxis,pitchwhich involves
axis, which for-
involves
ward and and
forward backward
backwardmovements.
movements. By subtracting
By subtracting thethe
distance
distance between
between thethe
drone
droneandandthe
the
realobject
real objectfrom
fromthe
thedesired
desiredideal
idealdistance,
distance,we wecancancalculate
calculatethe thedistance
distanceerrorerrorand
andcontrol
control
thedrone’s
the drone’sforward
forwardororbackward
backward movements
movements accordingly.Finally,
accordingly. Finally,regarding
regardingaltitude,
altitude,byby
subtracting the Y-coordinate of the tracked object from the Y-coordinate of the screen center,
Electronics 2023, 12, x FOR PEER REVIEW 10 of 19
Electronics 2023, 12, x FOR PEER REVIEW 10 of 19
(a) (b)
(a) (b)
Figure 7. Examples of object movement (a) Roll (b) Yaw.
Electronics 2023, 12, 4928 11 of 19
3. Experimental Results
In this section, we demonstrate and analyze the performance of the detection model
pruning, object detection, tracking and drone control.
Cut ResUnit
Precision Recall mAP@0.5
Numbers
11 0.914 0.175 0.5336
12 0.91 0.132 0.494
13 0.91 0.131 0.483
14 0.912 0.116 0.463
15 0.929 0.0188 0.335
Cut ResUnit
Params Size of .Weights BFLOPs
Numbers
11 4.6 M 18 M 19.490
12 4.4 M 17.2 M 19.220
13 4.3 M 17.1 M 18.991
14 4.2 M 16.8 M 18.610
15 4.15 M 16.2 M 17.869
3.2.Subject
3.2. SubjectTracking
Tracking and
and Drone
Drone Control
Control
Theaim
The aimofofthis
this study
study is explore
is to to explore the the feasibility
feasibility and and effectiveness
effectiveness of automated
of automated con-
control in real-world applications. To achieve this goal, we design a
trol in real-world applications. To achieve this goal, we design a series of experimentsseries of experimentsto
to simulate
simulate thethe exploration
exploration needs
needs of of drones
drones in in real
real environmentsand
environments andrequire
requirethethedrones
dronestoto
successfully track target objects automatically. To ensure the reliability
successfully track target objects automatically. To ensure the reliability of the experi- of the experimental
results, results,
mental we perform experiments
we perform in both indoor
experiments in both and outdoor
indoor andenvironments. By conducting
outdoor environments. By
experiments in these locations, we are able to better assess the adaptability
conducting experiments in these locations, we are able to better assess the adaptability and performance
of the
and automatedof
performance control in variouscontrol
the automated real-world scenarios.
in various Figuresscenarios.
real-world 8 and 9 list the selected
Figures 8 and
9 list the selected outdoor scenes and indoor scenes in the experimental videos, subjects
outdoor scenes and indoor scenes in the experimental videos, respectively. The respec-
being tracked
tively. include
The subjects ten tracked
being differentinclude
people.ten Each personpeople.
different is tracked
Eachforperson
50 to 90
is stracked
in outdoor
for
and indoor environments five times. During the tracking process, a
50 to 90 s in outdoor and indoor environments five times. During the tracking process,random number of 0 to
a
7 other people would appear as passersby in the scene.
random number of 0 to 7 other people would appear as passersby in the scene.
Figure
Figure8.8.Selected
Selectedscenes
scenesin
inoutdoor
outdoorenvironments.
environments.
Electronics 2023, 12, x FOR PEER REVIEW 15 of 19
Electronics2023,
Electronics 12, x4928
2023,12, FOR PEER REVIEW 1515ofof19
19
B. 10.
Figure Analysis of PID control
Drone three-axis for drones
orientation.
Figure 10. Drone three-axis orientation.
In this sub-setcion, we demonstrate how the drone continuously tracks the target
B. Analysis of PID control for drones
object
B. in flightofand
Analysis PIDadjusts
controlitsforflight
dronesactions as the target object moves. Two experimental
In this
videos aresub-setcion,
selected towe demonstrate
demonstrate thehow the drone
tracking andcontinuously
control tracks the
processes. target1, ob-
In this sub-setcion, we demonstrate how the drone continuously tracksInthe
video the
target ob-
ject in flight
tracked and walks
subject adjustsonits
a flight
flat actions
surface, as as the target
shown in object
Figure 11a.moves.
The Two
four experimental
directions that the
ject in flight and adjusts its flight actions as the target object moves. Two experimental
videos are
subjectare selected
moves to demonstrate
are represented the
as thethe tracking
red, and control
blue, yellow, processes.
and purple In
arrows video 1, the
videos selected to demonstrate tracking and control processes. In in Figure
video 11a.
1, the
tracked subject walks
The response on acontrol
of the PID flat surface,
to theas shown
error in x-axis
in the Figureposition
11a. Theoffour directions
the tracked thatin
object
tracked subject walks on a flat surface, as shown in Figure 11a. The four directions that
the subject
video 1 is moves
plottedare represented
in Figure as the11b
11b. Figure red,shows
blue, that
yellow,
as theand purple
target arrows
object moves, in Figure
the error
the subject moves are represented as the red, blue, yellow, and purple arrows in Figure
11a. The response
increases, and theof PID
the PID control
control to thecorrects
quickly error inthe
theerror
x-axistoposition
minimize of the tracked
it toward object
zero. The
11a. The response of the PID control to the error in the x-axis position of the tracked object
indrone
video continuously
1 is plotted intracks
Figurethe 11b. Figure
target 11bwhile
object, showsthethat
PIDas control
the target object to
attempts moves,
reduce the
the
in video 1 is plotted in Figure 11b. Figure 11b shows that as the target object moves, the
Electronics 2023, 12, x FOR PEER REVIEW 16 of 19
error increases, and the PID control quickly corrects the error to minimize it toward zero.
The drone continuously tracks the target object, while the PID control attempts to reduce
x-axis
the error
x-axis of the
error target
of the object.
target ForFor
object. the the
error in y-axis,
error since
in y-axis, the the
since subject in video
subject 1 does
in video not
1 does
undergo
not undergosignificant changes
significant in height,
changes in height,we we
cancan observe from
observe Figure
from 11c 11c
Figure thatthat
there is not
there is
a significant
not variation
a significant in the
variation y-axis
in the error.
y-axis As for
error. the the
As for distance in the
distance z-axis
in the position
z-axis of the
position of
tracked
the object,
tracked the the
object, distance between
distance between thethe
drone
drone andand
the the
target object
target is fixed
object at aatreference
is fixed a refer-
value of 150, which corresponds to a distance of 1.5 m on the ground
ence value of 150, which corresponds to a distance of 1.5 m on the ground between between the target
the
object and the drone. When the target object moves forward and
target object and the drone. When the target object moves forward and backward, thebackward, the drone has
to continuously
drone track the target
has to continuously object.
track the Figure
target object.11dFigure
plots the
11dresponse
plots the of the PIDofcontrol
response the PID to
the error
control tointhe z-axisinposition
theerror the z-axisof the tracked
position of object in the object
the tracked selected invideo. It demonstrates
the selected video. It
that the error varies
demonstrates that thewith the varies
error distance between
with the target
the distance object and
between the the drone,
target andand
object the PID
the
control attempts to reduce the z-axis error of the target object.
drone, and the PID control attempts to reduce the z-axis error of the target object.
In
In video 2, the
the tracked
trackedsubject
subjectmoves
movesupstairs
upstairs andand downstairs,
downstairs, as shown
as shown in Figure
in Figure 12a.
12a.
The The red arrow
red arrow and blue
and blue arrowarrow represent
represent the directions
the directions moving moving
up and updown
and down the
the stairs,
stairs, respectively.
respectively. In video
In video 2, the drone
2, the drone needs needs to follow
to follow the subject
the subject and flyand fly straight
straight up or
up or down
along along
down the stairs. The main
the stairs. focus focus
The main is to test
is towhether the drone
test whether can adjust
the drone the y-axis
can adjust error
the y-axis
in real
error intime. Figure
real time. 12a–c12a–c
Figure showshowthe response
the responseof the
of PID control
the PID to the
control to error in the
the error x-axis
in the x-
axis y-axis
and and positions
y-axis and the
positions anddistance
the distance z-axis
in the in the position of the tracked
z-axis position object inobject
of the tracked videoin2,
respectively.
video Figure 12Figure
2, respectively. demonstrates that the PID
12 demonstrates thatcontrol
the PID continuously x-axis
adjusts theadjusts
control continuously
and y-axis errors to approach zero as the subject moves forward and
the x-axis and y-axis errors to approach zero as the subject moves forward and backward backward while
ascending or descending the stairs.
while ascending or descending the stairs.
C.
C. Tracking
TrackingAccuracy
AccuracyEvaluation
Evaluation
The
The mean
mean absolute
absolute errors
errors between
between the the target
target and
and the
the center
center of
of the
the frame
frame for
for tracking
tracking
are
are listed
listed in
inTable
Table10.
10.The
Theerrors
errorsininthe x-axis
the x-axis areare
slightly higher
slightly than
higher thethe
than errors in the
errors y-
in the
axis
y-axisbecause
because the subjects
the subjectschange
changetheir
theirmoving
movingdirections
directionsininmost
mostexperimental
experimentalvideos.
videos.
When
When thethe targets
targets being
being tracked
tracked are
are moving
moving on on the
the ground
ground plane
plane without
without ascending
ascending stairs
stairs
or
or descending stairs, the errors in the y-axis are close to zero. The tracking errorsthe
descending stairs, the errors in the y-axis are close to zero. The tracking errors in in
outdoor environments
the outdoor environmentsare higher than than
are higher the errors in indoor
the errors environments
in indoor environmentsdue todue
the influ-
to the
ence of wind.
influence of wind.
Electronics 2023, 12, x4928
Electronics FOR PEER REVIEW 1717ofof 19
19
Table 10.
Table Mean absolute
10. Mean absolute error
error of
of tracking.
tracking.
Indoor
Indoor Outdoor
Outdoor
x-axis
x-axis 128.95
128.95 137.62
137.62
y-axis
y-axis 98.73
98.73 116.98
116.98
3.3. Discussion
3.3. Discussion
It is
It is aa challenging
challenging issueissue toto balance
balancethethesize
sizeofofthe
themodel
modelparameters
parametersand and the
theaccuracy
accuracyof
thethe
of model
model in in
thetheprocess
processof of
pruning.
pruning. Through
Through analyzing
analyzing thethe
Precision,
Precision, Recall, and
Recall, andmAP
mAP as
well
as as the
well as theparameter
parameter sizessizes
under various
under pruning
various rates, rates,
pruning we arewe able to able
are determine a suitable
to determine a
pruning rate to balance the trade-offs. Also, fine-tuning the pruned
suitable pruning rate to balance the trade-offs. Also, fine-tuning the pruned model is help- model is helpful to
recover
ful and increase
to recover the model
and increase accuracy.
the model The PID
accuracy. The control
PID controlprocess, which
process, continuously
which continu-
ously minimizes the error between the subject being tracked and the center of frame,
minimizes the error between the subject being tracked and the center of the can
the frame,
complete the task of automatic drone control and maintain a stable
can complete the task of automatic drone control and maintain a stable flight path in real flight path in real time.
This is
time. attributed
This to thetoability
is attributed of the
the ability ofautomatic
the automatic control system
control to promptly
system to promptly adjust to the
adjust to
position and movement of the target object to minimize the error
the position and movement of the target object to minimize the error between the refer- between the reference
position
ence and the
position and actual position.
the actual This allows
position. the drone
This allows thetodrone
track totarget
trackobjects
targetaccurately and
objects accu-
respond quickly to changes. Automated control is of great significance
rately and respond quickly to changes. Automated control is of great significance to hu- to human–machine
collaboration. collaboration.
man–machine It extends the It high cognitive
extends capabilities
the high cognitiveofcapabilities
human operators.
of human Atoperators.
the same
time, automated control ensures good execution efficiency and stability. Based on the
At the same time, automated control ensures good execution efficiency and stability.
above observations and analysis, automatic drone control has obvious advantages. It can
Based on the above observations and analysis, automatic drone control has obvious ad-
provide accurate and stable tracking capabilities while taking into account the execution
vantages. It can provide accurate and stable tracking capabilities while taking into account
efficiency of automated control. This collaborative model allows researchers and operators
the execution efficiency of automated control. This collaborative model allows researchers
to participate in drone missions and leverage their respective expertise while achieving
and operators to participate in drone missions and leverage their respective expertise
higher efficiency with the help of automatic control systems.
while achieving higher efficiency with the help of automatic control systems.
4. Conclusions
4. Conclusions
In this paper, we propose an implementation method for an object detection and target
In this
tracking paper,based
system we propose
on the an implementation
Robot method(ROS)
Operating System for anand
object detection
apply andTello
it to the tar-
get tracking system based on the Robot Operating System (ROS) and apply it to the
drone. The system achieves efficient object detection and target-tracking capabilities in Tello
drone. The
real-time system achieves
environments. efficient
We utilize the object
pruneddetection
YOLOv4and target-tracking
architecture capabilities
as the detection in
model
Electronics 2023, 12, 4928 18 of 19
and select SiamMask as the tracking model. Additionally, we introduce a PID module
to calculate the errors and determine the flight attitude and action. For the detection
module, we choose the pruned YOLOv4 architecture, which provides a faster execution
speed while maintaining the detection accuracy. By reducing the redundant parameters
and computations in the model, we achieve lightweight and accelerated performance. This
allows our system to efficiently perform object detection tasks in real-time environments.
For the tracking module, we adopt the SiamMask model. SiamMask is a single-object
tracking method capable of real-time target tracking. In our system, SiamMask is used to
track the objects detected by YOLOv4, enabling continuous object tracking and positioning.
Furthermore, we introduce the PID module to calculate the errors and adjust the flight
attitudes. PID is a classical control algorithm that computes control signals based on the
current error, accumulated error, and rate of error change, aiming to bring the system
output closer to the desired value. In our system, the PID module calculates errors based
on the target object’s position and the drone’s current state, and adjusts the drone’s attitude
control signals to stably track the target object. Through flight experiments, we validate
the feasibility of applying this system in everyday environments. The pruned YOLOv4
model provides efficient object detection capabilities, enabling fast target detection in real-
time environments. SiamMask is used for tracking the target object, and the PID module
accurately calculates the errors and adapts to different flight situations, allowing the drone
to stably track the target object.
References
1. Attention Drone Geeks! Here’s Some Answers You’ve Been Looking for. The Local Brand. Available online: https://thelocalbrand.
com/attention-drone-geeks-some-answers/ (accessed on 18 November 2023).
2. Amazon Plans to Start Drone Deliveries in the UK and Italy Next Year. Engadget. Available online: https://www.engadget.com/
amazon-plans-to-start-drone-deliveries-in-the-uk-and-italy-next-year-185027120.html (accessed on 18 November 2023).
3. Operation and Certification of Small Unmanned Aircraft. Federal Aviation Administration. Available online: https:
//www.federalregister.gov/documents/2016/06/28/2016-15079/operation-and-certification-of-small-unmanned-aircraft-
systems#h-33 (accessed on 18 November 2023).
4. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), Columbus, OH, USA,
23–28 June 2014; pp. 580–587.
5. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. In Proceedings
of the 13th European Conference on Computer Vision (ECCV 2014), Zurich, Switzerland, 6–12 September 2014; pp. 346–361.
6. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks. In
Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS 2015), Montreal, QC, Canada,
7–12 December 2015; pp. 91–99.
7. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A.C. SSD: Single Shot Multibox Detector. In Proceedings of
the 14th European Conference on Computer Vision (ECCV 2016), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37.
8. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of
the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 26 June–1 July 2016;
pp. 779–788.
9. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 30th IEEE International Conference on
Computer Vision (CVPR 2017), Venice, Italy, 22–29 October 2017; pp. 6517–6525.
Electronics 2023, 12, 4928 19 of 19
10. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In
Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML 2015), Lille, France,
6–11 July 2015; pp. 448–456.
11. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [CrossRef]
12. Bochkovskiy, A.; Wang, C.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934.
[CrossRef]
13. He, Y.; Zhang, X.; Sun, J. Channel Pruning for Accelerating Very Deep Neural Networks. In Proceedings of the IEEE International
Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1398–1406.
14. Li, Q.; Li, H.; Meng, L. Deep Learning Architecture Improvement Based on Dynamic Pruning and Layer Fusion. Electronics 2023,
12, 1208. [CrossRef]
15. Liu, X.; Li, C.; Jiang, Z.; Han, L. Low-Complexity Pruned Convolutional Neural Network Based Nonlinear Equalizer in Coherent
Optical Communication Systems. Electronics 2023, 12, 3120. [CrossRef]
16. Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C. Learning Efficient Convolutional Networks Through Network Slimming.
In Proceedings of the 30th IEEE International Conference on Computer Vision (CVPR 2017), Venice, Italy, 22–29 October 2017;
pp. 2736–2744.
17. Pruned-OpenVINO-YOLO. TNTWEN. Available online: https://github.com/TNTWEN/Pruned-OpenVINO-YOLO (accessed
on 10 May 2023).
18. Li, J.; Zhang, K.; Gao, Z.; Yang, L.; Zhuo, L. SiamPRA: An Effective Network for UAV Visual Tracking. Electronics 2023, 12, 2374.
[CrossRef]
19. Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H.S. Fully-Convolutional Siamese Networks for Object Tracking.
In Proceedings of the European Conference on Computer Vision (ECCV 2016), Amsterdam, The Netherlands, 11–14 October 2016;
pp. 850–865.
20. Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware Siamese Networks for Visual Object Tracking. In Proceedings
of the European Conference on Computer Vision (ECCV 2018), Munich, Germany, 8–14 September 2018; pp. 101–117.
21. Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High Performance Visual Tracking with Siamese Region Proposal Network. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–23 June 2018;
pp. 8971–8980.
22. Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks.
In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA,
15–20 June 2019; pp. 4282–4291.
23. Wang, Q.; Zhang, L.; Bertinetto, L.; Hu, W.; Torr, P. Fast Online Object Tracking and Segmentation: A Unifying Approach. In
Proceedings of the 32nd IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA,
15–20 June 2019; pp. 1328–1338.
24. Quigley, M.; Gerkey, B.; Conley, K.; Faust, J.; Foote, T.; Leibs, J.; Berger, E.; Wheeler, R.; Ng, A. ROS: An Open-Source Robot
Operating System. In Proceedings of the IEEE International Conference on Robotics and Automation, Workshop on Open Source
Software (ICRA 2009), Kobe, Japan, 12–17 May 2009; pp. 1–6.
25. Tello Edu. Ryze Robotics. Available online: https://www.ryzerobotics.com/zh-tw/tello-edu?site=brandsite&from=
landing_page (accessed on 18 November 2023).
26. YOLOv4 Baseline Training. Available online: https://github.com/AlexeyAB/Darknet (accessed on 1 June 2023).
27. Zhang, P.; Zhong, Y.; Li, X. SlimYolov3: Narrower, Faster, and Better for Real-Time UAV Applications. In Proceedings of the
IEEE/CVF International Conference on Computer Vision Workshop (ICCVW 2019), Seoul, Republic of Korea, 27–28 October
2019; pp. 37–45.
28. Koch, G.; Zemel, R.; Salakhutdinov, R. Siamese Neural Networks for One-Shot Image Recognition. In Proceedings of the
International Conference on Machine Learning Deep Learning Workshop (ICML 2015), Lille, France, 6–11 July 2015; pp. 1–8.
29. Bromley, J.; LeCun, Y. Signature Verification Using a “Siamese” Time Delay Neural Network. In Proceedings of the Advances in
the 6th Neural Information Processing Systems, Denver, CO, USA, November 1993; pp. 737–744.
30. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. MicROSoft COCO: Common Objects in
Context. In Proceedings of the 13th European Conference on Computer Vision (ECCV 2014), Zurich, Switzerland, 6–12 September
2014; pp. 740–755.
31. Drone-Face-Tracking. Available online: https://github.com/murtazahassan/Drone-Face-Tracking (accessed on 12 March 2023).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.