0% found this document useful (0 votes)
3 views7 pages

V2I41

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views7 pages

V2I41

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Raqamli Transformatsiya va Sun’iy Intellekt ilmiy jurnali VOLUME 2, ISSUE 4, AUGUST 2024

ISSN: 3030-3346

AN OBJECT TRACKING METHOD BASED ON IMPROVED YOLOV3 MODEL


AND KALMAN FILTER FOR UAV APPLICATIONS
Sukhrob Atoev1, Akhram Nishanov2
1
Department of Software of Information Technologies, Tashkent University of Information Technologies named after
Muhammad al-Khwarizmi, Tashkent, Uzbekistan
2
Department of System and Applied Programming, Tashkent University of Information Technologies named after Muhammad
al-Khwarizmi, Tashkent, Uzbekistan
E-mail: sukhrob.reus@gmail.com

KEYWORDS ABSTRACT
object detection, improved Unmanned Aerial Vehicles (UAVs) have gained significant attention
YOLOv3 model, Kalman in various applications, including surveillance, search and rescue,
filter, unmanned aerial military tasks, delivery services, and object tracking. Efficient and
vehicle. accurate object tracking is a crucial task for enabling UAVs to perform
complex missions autonomously. In this research paper, we propose
an object tracking method that combines the improved YOLOv3
model and the Kalman filter to enhance the tracking capabilities of
UAVs. The improved YOLOv3 model is utilized for real-time object
detection, providing initial bounding box predictions. However, due
to the inherent limitations of YOLOv3 in handling occlusions and
abrupt motion changes, the proposed method incorporates a Kalman
filter to refine and predict the object’s state over time. By fusing the
object detection results with the Kalman filter, proposed method
achieves robust and accurate tracking, even in challenging scenarios.

algorithms [6-9] have been widely used for object


INTRODUCTION detection before the dominance of deep learning
models. While deep learning models have achieved
In recent years, deep learning has state-of-the-art performance in recent years,
revolutionized the field of object detection, traditional algorithms still have their relevance,
achieving remarkable accuracy and performance. especially in scenarios where limited computational
Object detection is a computer vision technique that resources, interpretability, or specific constraints are
involves training deep neural networks to identify involved.
and locate objects within images or video frames [1-
3]. Unmanned Aerial Vehicles (UAVs) have Two-stage detectors are a type of object
revolutionized various domains, from aerial detection model that typically consists of two main
surveillance and mapping to disaster response and components: region proposal and object
delivery services. Over the years, numerous object classification/refinement. These detectors aim to
detection and tracking methods [4],[5] have been improve accuracy by employing a two-step process.
developed to cater to the specific challenges posed Region-based convolutional neural network (R-
by UAV applications, such as occlusions, scale CNN) is one of the pioneering two-stage detectors
variations, and abrupt motion changes. [10]. So, faster methods for tackling some of these
issues had to be introduced to overcome the
It is known that object detection algorithms problems that existed in R-CNN. The Fast R-CNN
are divided into three categories: traditional [11] was introduced to combat some of the pre-
computer vision, two-stage detectors, and single- existing issues of R-CNN.
stage detectors. Traditional computer vision

©Sukhrob Atoev, Akhram Nishanov ~1~


dtai.tsue.uz
Raqamli Transformatsiya va Sun’iy Intellekt ilmiy jurnali VOLUME 2, ISSUE 4, AUGUST 2024
ISSN: 3030-3346

The single-shot detector (SSD) [12] for multi- Object detection using the YOLOv3 model is
box predictions is one of the fastest ways to achieve a popular and powerful approach for real-time
the real-time computation of object detection tasks. object detection. YOLOv3 is an object detection
While the Faster R-CNN methodologies can achieve algorithm that can detect multiple objects within an
high accuracies of prediction, the overall process is image and provide their bounding box coordinates
quite time-consuming and it requires the real-time along with class probabilities. As shown in Figure
task to run at about 7 frames per second, which is 1, detection module consists of 53 layers which
far from desirable. SSD solves this issue by combined results into a fully convolutional 106
improving the frames per second to almost five layers of deep network. An improved YOLOv3
times more than the Faster R-CNN model. utilizes a deep CNN as its backbone. The backbone
network is a modified version of Darknet, which
You Only Look Once (YOLO) [13] is one of consists of multiple convolutional layers followed
the most popular model architectures and algorithms by downsampling operations. The backbone
for object detection. Usually, the first concept found extracts rich feature representations from the input
on a Google search for algorithms on object image, capturing both low-level and high-level
detection is the YOLO architecture. Among the features necessary for object detection.
various object detection and tracking frameworks,
the YOLO version 3 (YOLOv3) model has gained The process is initialized from splitting an
significant attention due to its real-time performance image into an S × S grid (at three different scales S
and high accuracy [14]. In this research paper, we is equal to 13, 26 and 52). Each cell is responsible
propose an object tracking method that leverages the for predicting bounding boxes and their associated
improved YOLOv3 model and the Kalman filter to class probabilities. Within each cell, an improved
enhance the tracking capabilities of UAVs. Our goal YOLOv3 uses anchor boxes of different sizes and
is to overcome the challenges associated with aspect ratios to detect objects of various scales and
occlusions, scale changes, and abrupt motion, shapes. An improved YOLOv3 has multiple
enabling UAVs to track objects accurately and prediction layers at different scales. Each prediction
efficiently. layer is responsible for detecting objects of different
sizes. The prediction layers make predictions for
PROPOSED METHOD bounding boxes, class probabilities, and confidence
scores at their respective scales.
Improved YOLOv3 Model.

Figure 1. The architecture of an improved YOLOv3 model

©Sukhrob Atoev, Akhram Nishanov ~2~


dtai.tsue.uz
Raqamli Transformatsiya va Sun’iy Intellekt ilmiy jurnali VOLUME 2, ISSUE 4, AUGUST 2024
ISSN: 3030-3346

Object Tracking Using The UAV


In this research work, YOLOv3 model is
improved by using the residual block instead of The goal of tracking an object is to generate
regular block. Moreover, Mish activation function is an object trajectory over time by detecting its exact
used to increase the accuracy of the object detection position in each frame of the image sequence.
network. The architectural difference between Proposed object detection and tracking system is
normal convolution block and the residual block is presented in Figure 2. DJI Phantom 4 Pro drone
the addition of skip connection. Skip connection equipped with a gimbal camera capturing real-time
carries the input to the deeper layers. video footage of the environment is used as the
UAV in this research. The Raspberry Pi 4 model B
Deep neural networks are difficult to train. on-board computer is attached to the UAV to
With the depth increasing, sometimes network perform object detection and tracking tasks. A
accuracy gets saturated which leads to higher hybrid algorithm based on an improved YOLOv3
training error. To solve this problem residual block model and the Kalman filter is implemented on the
was introduced. Thus, Leaky ReLU would be more Raspberry Pi 4 on-board computer, specifically
suitable if the goal was to maximize speed without trained for 6 object classes (human, drone, car, dog,
sacrificing much of the accuracy: horse and bird). YOLOv3 model identifies and
locates objects of interest in each frame. The
0.01𝑥, 𝑥 < 0 tracking algorithm links detections across frames,
𝑓(𝑥) = { (1)
𝑥, 𝑥 ≥ 0 estimating the object’s movement and trajectory. In
this system, tracking results are transmitted from the
Then, if accuracy should be maximized, Mish
Raspberry Pi 4 to the ground control station (GCS)
would be the better option:
via a wireless communication link. GCS provides a
𝑓(𝑥) = 𝑥 𝑡𝑎𝑛ℎ(𝑙𝑛(1 + 𝑒 𝑥 )) (2) user-friendly interface that allows the operator to
interact with the drone.
Moreover, proposed model is optimized by a
loss function that consists of 3 parts. Localization
loss defines how well the generated bounding box
correlates with the ground truth bounding box.
Confidence loss reflects the confidence that an
object is in the grid cell and the generated anchor is
responsible for its prediction. Last part of the loss
function is the classification loss that reflects the
difference between actual class probabilities and
predicted class probabilities as shown in the
following equation:

𝐿𝑙𝑜𝑠𝑠 = 𝐿𝑙𝑜𝑐 +𝐿𝑐𝑜𝑛𝑓 +𝐿𝑐𝑙𝑎𝑠𝑠 (3)

In the final step, YOLOv3 predicts class


probabilities for each bounding box. The model
assigns class probabilities to a predefined set of
object classes. The class probabilities represent the
confidence of the model that an object of a specific
class is present within a bounding box.

Figure 2. The flowchart of the proposed method

©Sukhrob Atoev, Akhram Nishanov ~3~


dtai.tsue.uz
Raqamli Transformatsiya va Sun’iy Intellekt ilmiy jurnali VOLUME 2, ISSUE 4, AUGUST 2024
ISSN: 3030-3346

As shown in Figure 2, the object tracking In the prediction step, the Kalman filter
method consists of three modules: the target estimates the future state of the object based on the
selection module in the first frame of the sequence, previous state and the system’s dynamic model. The
object detection module, and the Kalman filter predicted state of the object at time 𝑘 + 1 is
module. The position of the object in the first frame calculated as:
is selected by the selection module, which consists
of initialization parameters, including size, width, 𝑥𝑘+1|𝑘 = 𝐴𝑘 𝑥𝑘|𝑘 + 𝐵𝑘 𝑢𝑘 (8)
length of the search window. Once an object is
detected, the improved YOLOv3 creates the where 𝑥𝑘+1 is the estimated state of the object
corresponding bounding box. In order to estimate at time 𝑘 + 1, 𝐵 is the control input matrix that
the position of the object in each frame of the represents any external influences on the object’s
sequence, the Kalman filter is used in this work. The state, 𝑢 represents discretized process noise.
input parameters of the Kalman filter are the
position of the object in the image at time k, the size The predicted covariance matrix 𝑃 of the state
of the object, as well as the width and length of the at time 𝑘 is calculated as:
object search window, which change due to the
object’s mobility during the sequence. 𝑃𝑘+1|𝑘 = 𝐴𝑘 𝑃𝑘|𝑘 𝐴𝑇𝑘 + 𝐺𝑘 𝑄𝑘 𝐺𝑘𝑇 (9)

The state vector and the measurement vector where 𝐺 is the process noise gain matrix, 𝑄 is
of the Kalman filter can be represented by input the covariance.
parameters. The state vector consists of the initial
position (𝑥𝑘 , 𝑦𝑘 ), width 𝑊𝑘 and length 𝐿𝑘 of the In the update step, the Kalman filter updates
search window and the center of mass of the object the state estimate based on the measurements
(𝑥𝑐 , 𝑦𝑐 ) at time 𝑡𝑘 , respectively. This vector is received from sensors. The measurement residual,
presented by: denoted by 𝑧𝑘 , is calculated as:

𝑠𝑘 = (𝑥𝑘 , 𝑦𝑘 , 𝑊𝑘 , 𝐿𝑘 , 𝑥𝑐 , 𝑦𝑐 ) (4) 𝑧𝑘+1|𝑘 = 𝐻𝑘+1 𝑥𝑘+1|𝑘 (10)

On the other hand, the measurement vector By iteratively performing the prediction and
consists of the initial position, width and length of update steps, the Kalman filter continuously
the search window of the object at time 𝑡𝑘 , estimates and updates the state of the object,
correspondingly and this vector can be written by: providing accurate tracking even in the presence of
noise and uncertainties.
𝑧𝑘 = (𝑥𝑘 , 𝑦𝑘 , 𝑊𝑘 , 𝐿𝑘 ) (5)
Evaluation Metrics
Using a discrete process, the Kalman filter
estimates the state. This state is modeled by the Evaluating object detection models requires
linear equation, as given by: various metrics to assess their performance across
different aspects. To evaluate the proposed object
𝑠𝑘 = 𝐴 𝑠𝑘−1 + 𝑤𝑘−1 (6) detection method, we use the precision metric which
measures the proportion of true positives among all
where 𝐴 is the transition matrix, 𝑤𝑘 is the detections:
process noise. The measurement model can be
𝑇𝑃
written as: 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃+𝐹𝑃 (11)

𝑧𝑘 = 𝐻 𝑠𝑘 + 𝑣𝑘 (7) where 𝑇𝑃 is the true positives (correctly


detected objects), 𝐹𝑃 is the false positives
where 𝐻 is the measurement matrix, 𝑣𝑘 is the
(incorrectly detected objects).
measurement noise.

©Sukhrob Atoev, Akhram Nishanov ~4~


dtai.tsue.uz
Raqamli Transformatsiya va Sun’iy Intellekt ilmiy jurnali VOLUME 2, ISSUE 4, AUGUST 2024
ISSN: 3030-3346

The recall metric measures the proportion of Table 1.


true positives detected out of all actual objects:
Mean Average Precision (mAP) results for
𝑇𝑃 different algorithms
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃+𝐹𝑁 (12)
Fas
R- Impr
where 𝐹𝑁 is the false negatives (missed ter
Vid C YOL YOL YOL oved
objects). R-
eo N Ov3 Ov5 Ov7 YOL
CN
N Ov3
Mean Average Precision (mAP) is the mean N
of AP values calculated for multiple object classes. Hu 72. 80. 82.4 83.3 85.1 90.2
It provides an overall performance measure for an man 1 4
object detection algorithm across different object Dro 64 70. 75.2 84.2 86.2 88.7
classes. The mAP is usually calculated as the ne 3
average of AP values for each class: Bir 60. 63. 78.3 84.3 88.7 88.4
d 2 2
1
m𝐴𝑃 = 𝑛 ∑𝑘=𝑛
𝑘=1 𝐴𝑃𝑘 (13)
Dog 71. 82. 85.6 86.5 89.6 91.2
3 5
where 𝐴𝑃𝑘 is the average precision of class k, Car 70. 84. 88.4 88.2 90.3 92.7
𝑛 is the number of classes. 2 2
Hor 68. 74. 80 90.3 91.2 92.3
To evaluate the object tracking system in our se 1 7
experiments, we used the CLE of the tracked target.
CLE is defined as the Euclidean distance between The center location error (CLE) for different
the estimated center of the target and the ground object tracking algorithms are shown in Table 2.
truth center on each image frame, as shown by: Based on the results, it can be seen that the MIL
algorithm demonstrated the worst results in terms of
𝐶𝐿𝐸 = ‖𝑂𝑘𝐸 − 𝑂𝑘𝐺𝑇 ‖ (14) the CLE metric.

where 𝑂𝑘𝐸 and 𝑂𝑘𝐺𝑇 are the estimated and Table 2.


ground truth centers of the target object.
Center location error (CLE) results for different
3. EXPERIMENTAL RESULTS algorithms
Vide MI STRU K SO GL Propo
In this research work, we used a computer o L CK CF RT OM sed
with an Intel(R) Core(TM) i7-8800 CPU 3.80 GHz, Hum 13. 13.6 11. 10.3 9.5 7.5
RAM: 32 GB, SSD: 256, HDD: 1TB, operating an 2 5
system: Windows 10, 64-bit to evaluate the Dro 15. 14.2 12. 11.5 9.6 7.4
complexity of the proposed approach in terms of ne 6 6
CLE and mAP. In order to implement the proposed Bird 13. 13.3 12. 11.7 11.4 9.6
method, we used the Python programming 7 8
language. Electron.Js framework and JavaScript Dog 11. 7.5 8.2 8.2 8.3 7.5
programming language are used to develop ODDA 4
1.1 desktop application. Mean Average Precision Car 11. 8.3 10. 9.4 9.5 6.4
(mAP) results for different object detection 3 3
algorithms are shown in Table 1. It is observed that Hors 12. 9.6 11. 11.7 11.6 8.2
R-CNN achieves lowest accuracy among object e 2 3
detectors in localizing objects within images by
generating region proposals and accurately aligning Here, MIL – Multiple Instance Learning [15],
them with ground truth bounding boxes. STRUCK – Structured output tracking with kernels
©Sukhrob Atoev, Akhram Nishanov ~5~
dtai.tsue.uz
Raqamli Transformatsiya va Sun’iy Intellekt ilmiy jurnali VOLUME 2, ISSUE 4, AUGUST 2024
ISSN: 3030-3346

[16], KCF – Kernelized Correlation Filters [17], 5. X. Bai, X. Wang, and Z. Luo, Real-time object
SORT - Simple online and real-time tracking [18], detection and tracking for UAVs using deep
GLOM – Global-Local Object Model [19]. learning. Sensors, 19(21), 4807, 2019.
6. N. Dalal and B. Triggs, “Histograms of
The research work was tested on 6 different oriented gradients for human detection,” in
classes of objects. From the experimental results Proceedings of the IEEE conference on
shown in Table 1 and Table 2, we can observe that computer vision and pattern recognition
proposed method demonstrated great performance (CVPR), vol. 1, 2005, pp. 886-893.
for each image sequence. 7. H. Bay, et al. Speeded-up robust features
(SURF). Computer Vision and Image
4. CONCLUSION Understanding, 110(3), 2008, pp. 346-359.
8. P. Arbeláez, et al. Contour detection and
Our proposed approach is based on deep hierarchical image segmentation. IEEE
learning algorithms. For object detection part, we Transactions on Pattern Analysis and
have developed an improved YOLOv3 model based Machine Intelligence (PAMI), 33(5), 2011,
on neural networks. Afterwards, the objects were pp. 898-916.
tracked using the Kalman filter. It is clear that by 9. P. Dollar and C. L. Zitnick, “Structured forests
using new methods of CNN, such as YOLOv3 based for fast edge detection,” in Proceedings of the
on the regional proposals instead of selective search IEEE International Conference on Computer
algorithm, we can be closer to real-time object Vision (ICCV), 2013, pp. 1841-1848.
detection since results become well. We can assume 10. R. Girshick and T. Darrell, Region-based
that improved YOLOv3 represents a significant convolutional networks for accurate object
advancement in object detection, providing a robust detection and segmentation. IEEE
and accurate framework for detecting objects in Transactions on Pattern Analysis and
images. Its ability to combine region proposal Machine Intelligence (PAMI), 38(1), 2015,
generation and object classification within a unified pp. 142-158.
architecture has made it a popular choice for 11. R. Girshick, Fast R-CNN. In Proceedings of
researchers and practitioners in the field. the IEEE International Conference on
Computer Vision (ICCV), 2015, pp. 1440-
REFERENCES
1448.
1. K. P. Kumar, K. S. Sudeep, “Preprocessing for 12. W. Liu, et al. SSD: Single shot multibox
Image Classification by Convolutional Neural detector. In European conference on computer
Networks,” in IEEE International Conference vision (ECCV), 2016, pp. 21-37.
on Recent Trends in Electronics Information 13. J. Redmon, S. Divvala, R. Girshick, and A.
Communication Technology, May 20-21, Farhadi, “You Only Look Once: Unified,
2016, India. Real-Time Object Detection,” in Proceedings
2. H. Law and J. Deng, “CornerNet: Detecting of the IEEE conference on computer vision
objects as paired keypoints,” in Proceedings and pattern recognition (CVPR), 2016, pp.
of the European conference on computer 779-788.
vision (ECCV), 2018. 14. J. Redmon and A. Farhadi, YOLOv3: An
3. N. Carion, et al. “End-to-end object detection incremental improvement. arXiv preprint
with transformers,” in European conference arXiv:1804.02767, 2018.
on computer vision. Springer, Cham, 2020. 15. Ophir, B., & Kedem, D. (2013). Robust visual
4. L. Zhang, D. Du, X. Wang, and G. Wu, Object tracking using multiple instance learning. In
detection and tracking for unmanned aerial Proceedings of the IEEE International
vehicle systems: A review. IEEE Transactions Conference on Computer Vision (ICCV) (pp.
on Aerospace and Electronic Systems, 55(2), 1177-1184).
2019, pp. 957-975. 16. S. Hare, A. Saffari and P. H. S. Torr, “Struck:
Structured output tracking with kernels,” 2011
©Sukhrob Atoev, Akhram Nishanov ~6~
dtai.tsue.uz
Raqamli Transformatsiya va Sun’iy Intellekt ilmiy jurnali VOLUME 2, ISSUE 4, AUGUST 2024
ISSN: 3030-3346

International Conference on Computer 18. Bewley, A., Ge, Z., Ott, L., Ramos, F., &
Vision, Barcelona, Spain, 2011, pp. 263-270. Upcroft, B. (2016). Simple online and
17. Henriques, J. F., Caseiro, R., Martins, P., & realtime tracking. In Proceedings of the IEEE
Batista, J. (2015). High-speed tracking with International Conference on Image Processing
kernelized correlation filters. IEEE (ICIP) (pp. 3464-3468).
Transactions on Pattern Analysis and 19. Y. S. Yoo, S. H. Lee, S. H. Bae, Effective
Machine Intelligence (TPAMI), 37(3), 583- Multi-Object Tracking via Global Object
596. Models and Object Constraint Learning.
20. Sensors 2022, 22(20), 7943.

©Sukhrob Atoev, Akhram Nishanov ~7~


dtai.tsue.uz

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy