V2I41
V2I41
ISSN: 3030-3346
KEYWORDS ABSTRACT
object detection, improved Unmanned Aerial Vehicles (UAVs) have gained significant attention
YOLOv3 model, Kalman in various applications, including surveillance, search and rescue,
filter, unmanned aerial military tasks, delivery services, and object tracking. Efficient and
vehicle. accurate object tracking is a crucial task for enabling UAVs to perform
complex missions autonomously. In this research paper, we propose
an object tracking method that combines the improved YOLOv3
model and the Kalman filter to enhance the tracking capabilities of
UAVs. The improved YOLOv3 model is utilized for real-time object
detection, providing initial bounding box predictions. However, due
to the inherent limitations of YOLOv3 in handling occlusions and
abrupt motion changes, the proposed method incorporates a Kalman
filter to refine and predict the object’s state over time. By fusing the
object detection results with the Kalman filter, proposed method
achieves robust and accurate tracking, even in challenging scenarios.
The single-shot detector (SSD) [12] for multi- Object detection using the YOLOv3 model is
box predictions is one of the fastest ways to achieve a popular and powerful approach for real-time
the real-time computation of object detection tasks. object detection. YOLOv3 is an object detection
While the Faster R-CNN methodologies can achieve algorithm that can detect multiple objects within an
high accuracies of prediction, the overall process is image and provide their bounding box coordinates
quite time-consuming and it requires the real-time along with class probabilities. As shown in Figure
task to run at about 7 frames per second, which is 1, detection module consists of 53 layers which
far from desirable. SSD solves this issue by combined results into a fully convolutional 106
improving the frames per second to almost five layers of deep network. An improved YOLOv3
times more than the Faster R-CNN model. utilizes a deep CNN as its backbone. The backbone
network is a modified version of Darknet, which
You Only Look Once (YOLO) [13] is one of consists of multiple convolutional layers followed
the most popular model architectures and algorithms by downsampling operations. The backbone
for object detection. Usually, the first concept found extracts rich feature representations from the input
on a Google search for algorithms on object image, capturing both low-level and high-level
detection is the YOLO architecture. Among the features necessary for object detection.
various object detection and tracking frameworks,
the YOLO version 3 (YOLOv3) model has gained The process is initialized from splitting an
significant attention due to its real-time performance image into an S × S grid (at three different scales S
and high accuracy [14]. In this research paper, we is equal to 13, 26 and 52). Each cell is responsible
propose an object tracking method that leverages the for predicting bounding boxes and their associated
improved YOLOv3 model and the Kalman filter to class probabilities. Within each cell, an improved
enhance the tracking capabilities of UAVs. Our goal YOLOv3 uses anchor boxes of different sizes and
is to overcome the challenges associated with aspect ratios to detect objects of various scales and
occlusions, scale changes, and abrupt motion, shapes. An improved YOLOv3 has multiple
enabling UAVs to track objects accurately and prediction layers at different scales. Each prediction
efficiently. layer is responsible for detecting objects of different
sizes. The prediction layers make predictions for
PROPOSED METHOD bounding boxes, class probabilities, and confidence
scores at their respective scales.
Improved YOLOv3 Model.
As shown in Figure 2, the object tracking In the prediction step, the Kalman filter
method consists of three modules: the target estimates the future state of the object based on the
selection module in the first frame of the sequence, previous state and the system’s dynamic model. The
object detection module, and the Kalman filter predicted state of the object at time 𝑘 + 1 is
module. The position of the object in the first frame calculated as:
is selected by the selection module, which consists
of initialization parameters, including size, width, 𝑥𝑘+1|𝑘 = 𝐴𝑘 𝑥𝑘|𝑘 + 𝐵𝑘 𝑢𝑘 (8)
length of the search window. Once an object is
detected, the improved YOLOv3 creates the where 𝑥𝑘+1 is the estimated state of the object
corresponding bounding box. In order to estimate at time 𝑘 + 1, 𝐵 is the control input matrix that
the position of the object in each frame of the represents any external influences on the object’s
sequence, the Kalman filter is used in this work. The state, 𝑢 represents discretized process noise.
input parameters of the Kalman filter are the
position of the object in the image at time k, the size The predicted covariance matrix 𝑃 of the state
of the object, as well as the width and length of the at time 𝑘 is calculated as:
object search window, which change due to the
object’s mobility during the sequence. 𝑃𝑘+1|𝑘 = 𝐴𝑘 𝑃𝑘|𝑘 𝐴𝑇𝑘 + 𝐺𝑘 𝑄𝑘 𝐺𝑘𝑇 (9)
The state vector and the measurement vector where 𝐺 is the process noise gain matrix, 𝑄 is
of the Kalman filter can be represented by input the covariance.
parameters. The state vector consists of the initial
position (𝑥𝑘 , 𝑦𝑘 ), width 𝑊𝑘 and length 𝐿𝑘 of the In the update step, the Kalman filter updates
search window and the center of mass of the object the state estimate based on the measurements
(𝑥𝑐 , 𝑦𝑐 ) at time 𝑡𝑘 , respectively. This vector is received from sensors. The measurement residual,
presented by: denoted by 𝑧𝑘 , is calculated as:
On the other hand, the measurement vector By iteratively performing the prediction and
consists of the initial position, width and length of update steps, the Kalman filter continuously
the search window of the object at time 𝑡𝑘 , estimates and updates the state of the object,
correspondingly and this vector can be written by: providing accurate tracking even in the presence of
noise and uncertainties.
𝑧𝑘 = (𝑥𝑘 , 𝑦𝑘 , 𝑊𝑘 , 𝐿𝑘 ) (5)
Evaluation Metrics
Using a discrete process, the Kalman filter
estimates the state. This state is modeled by the Evaluating object detection models requires
linear equation, as given by: various metrics to assess their performance across
different aspects. To evaluate the proposed object
𝑠𝑘 = 𝐴 𝑠𝑘−1 + 𝑤𝑘−1 (6) detection method, we use the precision metric which
measures the proportion of true positives among all
where 𝐴 is the transition matrix, 𝑤𝑘 is the detections:
process noise. The measurement model can be
𝑇𝑃
written as: 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃+𝐹𝑃 (11)
[16], KCF – Kernelized Correlation Filters [17], 5. X. Bai, X. Wang, and Z. Luo, Real-time object
SORT - Simple online and real-time tracking [18], detection and tracking for UAVs using deep
GLOM – Global-Local Object Model [19]. learning. Sensors, 19(21), 4807, 2019.
6. N. Dalal and B. Triggs, “Histograms of
The research work was tested on 6 different oriented gradients for human detection,” in
classes of objects. From the experimental results Proceedings of the IEEE conference on
shown in Table 1 and Table 2, we can observe that computer vision and pattern recognition
proposed method demonstrated great performance (CVPR), vol. 1, 2005, pp. 886-893.
for each image sequence. 7. H. Bay, et al. Speeded-up robust features
(SURF). Computer Vision and Image
4. CONCLUSION Understanding, 110(3), 2008, pp. 346-359.
8. P. Arbeláez, et al. Contour detection and
Our proposed approach is based on deep hierarchical image segmentation. IEEE
learning algorithms. For object detection part, we Transactions on Pattern Analysis and
have developed an improved YOLOv3 model based Machine Intelligence (PAMI), 33(5), 2011,
on neural networks. Afterwards, the objects were pp. 898-916.
tracked using the Kalman filter. It is clear that by 9. P. Dollar and C. L. Zitnick, “Structured forests
using new methods of CNN, such as YOLOv3 based for fast edge detection,” in Proceedings of the
on the regional proposals instead of selective search IEEE International Conference on Computer
algorithm, we can be closer to real-time object Vision (ICCV), 2013, pp. 1841-1848.
detection since results become well. We can assume 10. R. Girshick and T. Darrell, Region-based
that improved YOLOv3 represents a significant convolutional networks for accurate object
advancement in object detection, providing a robust detection and segmentation. IEEE
and accurate framework for detecting objects in Transactions on Pattern Analysis and
images. Its ability to combine region proposal Machine Intelligence (PAMI), 38(1), 2015,
generation and object classification within a unified pp. 142-158.
architecture has made it a popular choice for 11. R. Girshick, Fast R-CNN. In Proceedings of
researchers and practitioners in the field. the IEEE International Conference on
Computer Vision (ICCV), 2015, pp. 1440-
REFERENCES
1448.
1. K. P. Kumar, K. S. Sudeep, “Preprocessing for 12. W. Liu, et al. SSD: Single shot multibox
Image Classification by Convolutional Neural detector. In European conference on computer
Networks,” in IEEE International Conference vision (ECCV), 2016, pp. 21-37.
on Recent Trends in Electronics Information 13. J. Redmon, S. Divvala, R. Girshick, and A.
Communication Technology, May 20-21, Farhadi, “You Only Look Once: Unified,
2016, India. Real-Time Object Detection,” in Proceedings
2. H. Law and J. Deng, “CornerNet: Detecting of the IEEE conference on computer vision
objects as paired keypoints,” in Proceedings and pattern recognition (CVPR), 2016, pp.
of the European conference on computer 779-788.
vision (ECCV), 2018. 14. J. Redmon and A. Farhadi, YOLOv3: An
3. N. Carion, et al. “End-to-end object detection incremental improvement. arXiv preprint
with transformers,” in European conference arXiv:1804.02767, 2018.
on computer vision. Springer, Cham, 2020. 15. Ophir, B., & Kedem, D. (2013). Robust visual
4. L. Zhang, D. Du, X. Wang, and G. Wu, Object tracking using multiple instance learning. In
detection and tracking for unmanned aerial Proceedings of the IEEE International
vehicle systems: A review. IEEE Transactions Conference on Computer Vision (ICCV) (pp.
on Aerospace and Electronic Systems, 55(2), 1177-1184).
2019, pp. 957-975. 16. S. Hare, A. Saffari and P. H. S. Torr, “Struck:
Structured output tracking with kernels,” 2011
©Sukhrob Atoev, Akhram Nishanov ~6~
dtai.tsue.uz
Raqamli Transformatsiya va Sun’iy Intellekt ilmiy jurnali VOLUME 2, ISSUE 4, AUGUST 2024
ISSN: 3030-3346
International Conference on Computer 18. Bewley, A., Ge, Z., Ott, L., Ramos, F., &
Vision, Barcelona, Spain, 2011, pp. 263-270. Upcroft, B. (2016). Simple online and
17. Henriques, J. F., Caseiro, R., Martins, P., & realtime tracking. In Proceedings of the IEEE
Batista, J. (2015). High-speed tracking with International Conference on Image Processing
kernelized correlation filters. IEEE (ICIP) (pp. 3464-3468).
Transactions on Pattern Analysis and 19. Y. S. Yoo, S. H. Lee, S. H. Bae, Effective
Machine Intelligence (TPAMI), 37(3), 583- Multi-Object Tracking via Global Object
596. Models and Object Constraint Learning.
20. Sensors 2022, 22(20), 7943.