Yolov10 To Its Genesis A Decadal and Comprehensive
Yolov10 To Its Genesis A Decadal and Comprehensive
Ranjan Sapkota1* , Rizwan Qureshi2 , Marco Flores-Calero3 , Chetan Badgujar4 , Upesh Nepal5 ,
Alwin Poulose6 , Peter Zeno7 , Uday Bhanu Prakash Vaddevolu8 , Sheheryar Khan9 ,
Maged Shoman10 , Hong Yan11 , and Manoj Karkee1
arXiv:2406.19407v4 [cs.CV] 25 Jul 2024
1
Department of Biological Systems Engineering, Washington State University, United States; ranjan.sapkota@wsu.edu; 2 Center for Research in Computer
Vision, The University of Central Florida, Orlando, USA; 3 Department of Electrical, Electronics and Telecommunications, Universidad de las Fuerzas
Armadas, Av. General Rumiñahui s/n, Sangolquí 171-5-231B, Ecuador; 4 Biosystems Engineering and Soil Sciences, The University of Tennessee,
Knoxville, TN 37996; 5 Cooper Machine Company, Inc., Wadley, Georgia, 30477; 6 School of Data Science, Indian Institute of Science Education and
Research Thiruvananthapuram (IISER TVM), Vithura, Thiruvananthapuram 695551, Kerala, India; 7 ZenoRobotics, LLC, Billings, MT 59106, USA; 8
Agricultural and Biological Engineering, University of Florida, Gainesville, FL 32611; 9 School of Professional Education and Executive Development,
The Hong Kong Polytechnique University, Hong Kong, 999077, Hong Kong, SAR China, 10 Department of Civil, Environmental and Construction
Engineering, The University of Central Florida, Orlando, Florida, United States 11 Department of Electrical Engineering, and Center for Intelligent
Multidimensional Data Analysis, City University of Hong Kong, 999077, Hong Kong, China
Figure 1: Technical performance of YOLO models: comparing speed (FPS) and accuracy (mAP) of YOLOv1 to
YOLOv10.
A BSTRACT
This review systematically examines the progression of the You Only Look Once (YOLO) object
detection algorithms from YOLOv1 to the recently unveiled YOLOv10. Employing a reverse
chronological analysis, this study examines the advancements introduced by YOLO algorithms,
beginning with YOLOv10 and progressing through YOLOv9, YOLOv8, and subsequent versions to
explore each version’s contributions to enhancing speed, accuracy, and computational efficiency in
real-time object detection. The study highlights the transformative impact of YOLO models across five
critical application areas: automotive safety, healthcare, industrial manufacturing, surveillance, and
agriculture. By detailing the incremental technological advancements in subsequent YOLO versions,
this review chronicles the evolution of YOLO, and discusses the challenges and limitations in each
earlier versions. The evolution signifies a path towards integrating YOLO with multimodal, context-
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
aware, and General Artificial Intelligence (AGI) systems for the next YOLO decade, promising
significant implications for future developments in AI-driven applications.
Keywords You Only Look Once, YOLO, YOLOv10 to YOLOv1, YOLO configurations, CNN, Deep learning, Object
detection, Real-time object detection, Artificial intelligence, Computer vision, Healthcare, Autonomous Vehicles,
Industrial manufacturing, Surveillance, Agriculture
1 Introduction
Object detection is a critical component of computer vision systems, which enables automated systems to identify and
locate objects of interest within images or video frames [1]. Real-time object detection has become integral to numerous
applications requiring real- and near-real-time analysis, monitoring and interaction with dynamic environments such
as agriculture and health-care [2, 3, 4]. For instance, real-time object detection is the foundational technology for the
success of autonomous vehicles and robotic systems [5], allowing the system to quickly recognize and track different
objects of interests such as vehicles, pedestrians, bicycles, and other obstacles, enhancing navigational safety and
efficiency [6, 7]. The utility of object recognition extends beyond vehicular applications, and is also pivotal in action
recognition within video sequences, useful in digital surveillance, monitoring, sports analysis, and human-machine
interaction [8, 2, 9]. These areas benefit from the capability to analyze and respond to situational dynamics in real-time,
illustrating its broad applicability, acceptance, and impact. However, the problem of object detection involves several
challenges:
• Complexity of Real-World Environments: Real-world environments/scenes are highly variable and unpre-
dictable. Objects can appear in various orientations, scales, distances and lighting conditions, making it
difficult for a detection algorithm to generalize and maintain accuracy [10].
• Occlusions and Clutter: Objects may be partially or fully obscured by other objects, creating cluttered scenes
that result in incomplete information, which requires careful interpretation for accurate analysis [11, 12].
• Speed and Efficiency: Many applications necessitate rapid processing of visual data to enable timely decision-
making. This requires detection algorithms to achieve a balance between high accuracy and low latency,
ensuring that the systems can deliver efficient and reliable results in real- or near-real-time scenarios, such as
autonomous driving, surveillance, and industrial and agricultural automation [13].
Before the advent of deep learning, object detection relied on a combination of hand-crafted features and machine
learning classifiers [14]. Some of the notable traditional methods include:
• Correlation Filters: Used to detect objects by correlating a filter with the image, often struggling with
variations in object appearance and lighting conditions [15].
• Sliding Window Approach: This method involves moving a fixed-size window across the image and applying
a classifier to each window to determine whether it contains an object [16].
• Viola-Jones Detector: This algorithm uses Haar-like features and a cascade of classifiers trained with
AdaBoost to detect objects in images efficiently [17].
Supporting these methods are various hand-crafted feature extraction techniques, including:
• Gabor Features: Extracted texture features using Gabor filters, which are effective for texture representation
but computationally intensive [18].
• Histogram of Oriented Gradients (HOG): Captures edge or gradient structures that characterize the shape
of objects, typically combined with Support Vector Machines (SVM) for classification [19].
• Local Binary Patterns (LBP): Utilizes pixel intensity comparisons to form a binary pattern, used in texture
classification and face recognition [20].
• Haar-like features: They consider adjacent rectangular regions in a detection window, sum up the pixel
intensities in each region, and calculate the difference between these sums. This difference is then used to
categorize subsections of an image [21].
Some of the most commonly employed classification methods for these detectors include Random Forest, Support
Vector Machine (SVM), statistical classifiers (e.g., Bayesian Classifier and Adaboost) and Multilayer Perceptrons
2
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
(MLP) [22, 23]. These traditional methods in early computer vision, reliant on hand-crafted features and classical
classifiers, offered moderate success under controlled conditions but struggled with robustness and generalization in
diverse real-world scenarios, lacking the accuracy achieved by modern deep learning techniques [24]. The Viola-Jones
Detector [17], introduced in 2001, was a pioneering method for real-time face detection, utilizing Haar-like features [21]
and an AdaBoost classifier for fast and accurate detections. Figure 2 shows the historical development of computer
vision systems emphasizing how object detection algorithms evolved.
The introduction of Convolutional Neural Networks (CNNs) revolutionized object detection [25, 26, 27] by automating
feature extraction and enabling end-to-end learning. CNNs are particularly effective because:
• Hierarchical Feature Learning: CNNs learn to extract low-level features (e.g., edges, textures) in early layers
and high-level features (e.g., object parts, shapes) in deeper layers, facilitating robust object representation [28].
• Spatial Invariance: Convolutional layers enable CNNs to recognize objects regardless of their position within
the image, enhancing detection robustness [29].
• Scalability and Generalizability: CNNs can be scaled to handle larger datasets and more complex models,
improving performance and robustness on a wide range of tasks and application environments [30].
CPNet
CPNDet
Centernet2
Oriented-
Centernet Reppoints
Anchor Less
Cornernet OneNet
CSP
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
Deep YOLOS
YOLOv5 Conditional
Learning
DETR
based Anchor Free YOLOv1 YOLOv6
YOLOv4 DETR
YOLOv2
Keypoint based YOLOv3 YOLOv7
One-stage detection Deformable
CvT DETR YOLOv8
Anchor-based CNN based
Transformer based
Sparse
SSPNet Faster Libra
Mask FPN Center RCNN
Two-stage detection RCNN RCNN ReDet
RCNN RCNN Map
Cascade Grid DINO
Fast RCNN RCNN
RCNN Token
ViTDet
DETR
2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
Figure 2: Timeline of Object detection paradigms and evolution of the YOLO Object detection. The figure
shows the progression from traditional methods like VJ Detector and HOG to deep learning-based approaches,
including R-CNN, Fast R-CNN, Faster R-CNN, and YOLO series. Recent advancements feature transformer-
based models such as DETR [31] and ViTDet [32].
Object detection presents a unique challenge for CNNs due to the variable number of objects in an image, which
prevents the direct application of CNNs with fixed output layers [27]. While a sliding window-based brute force search
could be used to select and classify regions [16], this approach is computationally expensive because it requires applying
the CNN model to numerous region proposals of varying sizes and aspect ratios, making it inefficient for real-time and
near-real-time applications.
In 2013, Ross Girshick et al. proposed the R-CNN (Region-based CNN) architecture to address these challenges.
R-CNN uses the selective search algorithm to generate about 2000 region proposals, which are then processed by a
CNN to extract features [33]. Fast R-CNN improved this process by integrating region proposal feature extraction
and classification in a single pass [34]. Faster R-CNN further advanced the approach by introducing Region Proposal
Networks (RPNs) for end-to-end training, eliminating the need for selective search [4].
3
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
The "You Only Look Once" (YOLO) object detection algorithm was first introduced by Joseph Redmon et al., [35] in
2015, revolutionized real-time object detection by combining region proposal and classification into a single neural
network, significantly reducing computation time. YOLO’s unified architecture divides the image into a grid, predicting
bounding boxes and class probabilities directly for each cell, enabling end-to-end learning [35]. YOLO is versatile, and
its real-time detection capabilities have revolutionized agriculture [36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46], healthcare
[47, 48, 49], surveillance [50, 51] and industrial applications [52, 53], where accuracy and speed are crucial.
In agriculture, YOLO models have been applied to detect and classify crops [54], pests, and diseases [55, 56, 57], facili-
tating precision agriculture techniques and automating farming operations to increase productivity and optimizing inputs.
Additionally, in remote sensing, YOLO contributes to object recognition in satellite [58, 59] and aerial imagery [60, 61],
which supports urban planning, land use mapping, and environmental monitoring. These capabilities demonstrate
YOLO’s contribution to critical global challenges such as urban development and environmental conservation.
In healthcare, YOLO has been instrumental in assisting and improving diagnostic processes and treatment outcomes.
The applications include, but are not limited to, cancer detection [62, 63], skin segmentation [64], and pill identification
[65, 66] which showcase the model’s ability to adapt to different needs, and essential tasks.
Surveillance and Security systems also leverage YOLO for real-time monitoring and rapid identification of suspicious
activities [50, 51]. By integrating these models into surveillance systems, security personnel can more effectively
monitor and respond to potential threats, enhancing public safety [67]. Similarly, in the context of public health
measures like social distancing and face mask detection during pandemics [68, 69], YOLO models provided essential
support in enforcing health regulations.
In industrial applications, YOLO aids in surface inspection processes to detect defects and anomalies [52, 53], and
xxxxx, ensuring quality control in manufacturing and production.
Since "You Only Look Once" has been widely adopted in the field of computer vision, a search for this keyword in
Google Scholar yields approximately 5,550,000 results as of June 9, 9:05 PM Pacific Daylight Time. The acronym
"YOLO" further emphasizes its popularity, generating around 210,000 search results. Thousands of researchers have
cited YOLO papers, highlighting its significant influence. This study aims to review and critically summarize the
YOLO’s decadal progress and its advancements over time, as visually summarized in the mind-map, shown in Figure 3.
Figure 3: Block diagram showing organization of this review article: The structure includes YOLO Trajectory
discussing the development path, Prior YOLO literature: Context and Distinctions providing background
and differentiations, Review of YOLO Versions detailing each version, Applications highlighting various use
cases, Challenges, Limitations and Future Directions addressing current issues and potential advancements, and
Conclusion summarizing the findings. Each section systematically contributes to a comprehensive understanding
of the YOLO framework’s evolution and impact.
4
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
The comprehensive analysis begins with YOLO Trajectory, discussing the development path from YOLOv1 to
YOLOv10. Next, Context and Distinctions of Prior YOLO literature will be presented to provide background and
differentiations among existing works. Then, Review of YOLO Versions details the key features and improvements
of each version. The Applications section highlights various use cases across different domains. Following this,
Challenges, Limitations and Future Directions addresses current issues and potential advancements. Finally, the
Conclusion section summarizes the findings of this comprehensive review. Each section is further divided into
various sub-subsections to present and discuss specific topic areas relevant to the corresponding sections. YOLO
Trajectory includes Significance of Latency and mAP Scores in YOLO and Single stage detection in YOLO;
Prior YOLO literature: Context and Distinctions; Review of YOLO Versions covers YOLOv10, YOLOv9 and
YOLOv8, YOLOv7, YOLOv6 and YOLOv5, and YOLOv4, YOLOv3, YOLOv2 and YOLOv1; Applications
discusses Autonomous Vehicles, Healthcare and Medical Imaging, Security and Surveillance, Manufacturing,
and Agriculture; Challenges, Limitations and Future Directions explores YOLO and the Artificial General
Intelligence - AGI, Yolo on the Edge Devices, and Future Prospects. This structured approach ensures a detailed and
systematic review of the YOLO framework’s evolution and impact.
2 YOLO Trajectory
YOLOv1 [35] was introduced in 2015 as a novel approach to object detection, offering good accuracy and computational
speed by processing images using a single stage network architecture. The first YOLO version laid the foundation for
real-time applications of machine vision systems, setting a new standard for subsequent developments.
Figure 2 shows the development timeline of YOLO models from the release of YOLOv1 to the latest YOLOv10.
YOLOv2, or YOLO9000 [70, 71], expanded on the foundation of YOLOv1 by improving the resolution at which the
model operated and by expanding the capability to detect over 9000 object categories, thus enhancing its versatility and
accuracy. YOLOv3 further advanced these capabilities by implementing multi-scale predictions and a deeper network
architecture, which allowed better detection of smaller objects [72]. The series continued to evolve with YOLOv4 and
YOLOv5, each introducing more refined techniques and optimizations to improve detection performance (i.e., accuracy
and speed) even further [73, 74, 75]. YOLOv4 incorporated features like Cross-Stage Partial (CSP) connections and
Mosaic data augmentation, while YOLOv5, developed by Ultralytics, brought significant improvements in terms of
ease of use and performance, establishing itself as a popular choice in the computer vision community. Subsequent
versions, YOLOv6 through YOLOv10, have continued to build on this success, focusing on enhancing model scalability,
reducing computational demands, and improving real-time performance metrics. Each iteration of the YOLO series
has set new benchmarks for object detection capabilities and significantly impacted various application areas, from
autonomous driving and traffic monitoring to healthcare, industrial automation and smart farming.
YOLOv10 [76], the latest iteration, introduces multiple model variants such as YOLOv10-N, YOLOv10-S, YOLOv10-
M, YOLOv10-B, YOLOv10-L, and YOLOv10-X, achieving precision (AP) scores ranging from 38.5% to 54.4%
on MS-COCO dataset [76]. Notably, YOLOv10-N and YOLOv10-S exhibit the lowest latencies at 1.84 ms and
2.49 ms, respectively, making them highly suitable for applications requiring low latency. These models outperform
their predecessors, with YOLOv10-X achieving the highest mAP of 54.4% and a latency of 10.70 ms, reflecting a
well-balanced enhancement in both accuracy and inference speed. According to Wang et al. [76], comparing YOLOv10
with YOLOv9 and YOLOv8 reveals a trend of incremental improvements. Similar to YOLOv10, YOLOv9 features
various model configurations including YOLOv9-N, YOLOv9-S, YOLOv9-M, YOLOv9-C, and YOLOv9-X, and
achieves mAP scores from 39.5% to 54.4% [77]. While the mAP scores are comparable to YOLOv10, the latency
of YOLOv9 models is generally higher, particularly for YOLOv9-X, which matches YOLOv10-X in mAP but not
in latency, indicating YOLOv10’s superior efficiency [77]. YOLOv8 model configurations, including YOLOv8-N,
YOLOv8-S, YOLOv8-M, YOLOv8-L, and YOLOv8-X, show mAP scores ranging from 37.3% to 53.9% and latencies
from 6.16 ms to 16.86 ms. Although YOLOv8 models perform well, they lag slightly behind YOLOv10 and YOLOv9
in terms of both accuracy and latency, suggesting that the architectural refinements in YOLOv10 have effectively
enhanced both detection performance and computational efficiency.
Further analysis of earlier YOLO versions, such as YOLOv7, YOLOv6, YOLOv5, YOLOv4, YOLOv3, YOLOv2,
and YOLOv1, underscores the rapid advancements in this domain [78]. YOLOv7-tiny and other YOLOv7 model
configurations achieve mAP scores of 56.4% and 51.2%, respectively, but with significantly higher latencies, indicating
their focus on higher accuracy at the cost of speed [79]. YOLOv6 models (YOLOv6-N, YOLOv6-S, YOLOv6-M,
YOLOv6-L) achieve mAP scores from 37.0% to 51.8% with moderate latencies. YOLOv5, a popular model, shows
a competitive mAP of 50.7% and a latency of 140 ms [80]. Earlier versions like YOLOv4, YOLOv3, YOLOv2, and
YOLOv1, with mAP scores of 43.5%, 57.9%, 76.8%, and 63.4% respectively [73], laid the groundwork for subsequent
improvements, though they exhibit higher latencies compared to the latest versions.
5
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
(a) Performance metrics for YOLOv1 to YOLOv4, illustrating (b) Performance analysis of YOLOv5 by Ultralytics, highlight-
advancements in object detection technology. Detailed in [35], ing significant improvements in speed and accuracy. Refer to
[81], [82], [83]. [83], [84], [85], [86].
The evolution of the earlier YOLO models (YOLOv1 through YOLOv4) have been extensively presented and discussed
in scholarly articles [35], [81], [82], [83]. These versions, showcased in Figure 4a, were fundamental in advancing object
detection technologies, providing robust source code on GitHub and paving the way for further innovations. With the
commercial landscape evolving, Ultralytics released YOLOv5 and YOLOv8, not through traditional academic channels
but directly on GitHub, creating a pivotal shift in deployment and adaptation [84], [85], [86]. Subsequent versions,
YOLOv6 and YOLOv7, marked a return to the academic realm, with detailed documentations and enhancements
presented in [80], [79]. Figure 4b shows the FPS and mAP comparision.
The technical analysis of these versions, as visualized from YOLOv1 to YOLOv10, highlights a progressive enhancement
in both speed and accuracy. Performance metrics such as FPS and mAP were critically analyzed in this study using
Python and Matplotlib, illustrating the trade-offs inherent in each version’s design. YOLOv6 through YOLOv10,
documented in Figures 4c and 4d illustrate the continuous improvements, with later models optimizing computational
efficiency and detection precision [79], [77], [76]. Each figure reflects a balance between processing speed and
accuracy, providing insights into the model’s performance across various configurations and input resolutions. This
ongoing development trajectory also showcases the dynamic interplay and collaboration between academic research
and commercial applications, driving forward the capabilities of object detection systems in real-world scenarios.
Latency (L) and mAP are important metrics for describing the performance of object detection models like YOLO
[26, 87]. Latency measures the time taken by the model to process an image and produce predictions [87]. This includes
all the steps required for the detection process, such as image preprocessing, model inference, and postprocessing, and
is typically measured in milliseconds (ms). Lower latency is essential for real-time applications such as autonomous
6
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
driving, surveillance, and robotics, where timely and accurate detection is crucial [88]. High latency can result in delays
that are detrimental in these fast-paced environments, potentially compromising operational safety and effectiveness
[89]. FPS (Frames per second) is another critical metric that complements latency by indicating how many images the
model can process each second. Together, latency and FPS provide a comprehensive overview of a model’s performance
in real-time scenarios. Figure 4a illustrates the mAP and FPS rates, while Figure 4b illustrates the latency value of all
10 YOLO versions showcasing their evolution and effectiveness in real-time applications.
Likewise, mAP is a comprehensive metric used to evaluate the accuracy of object detection models [90]. It considers
both precision and recall (Table 1), and it is calculated by taking the average precision (AP) across all classes and then
averaging these AP scores [91, 90]. It provides a balanced view of how well the model performs across different object
categories and varying conditions within the dataset. Other metrics used for comprehensive evaluation of YOLO models
[92, 93] are detailed in Table 1.
Here, True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) are the key performance
evaluators. TP is instance where the model correctly identifies an object as present. TN occurs when the model correctly
predicts the absence of an object. FP arises when the model incorrectly identifies an object as present, and FN happens
when the model fails to detect an object that is actually present. These metrics are crucial for assessing the accuracy and
reliability of the YOLO object detection [91, 90, 93].
Table 2: Presenting the latency in milliseconds for various YOLO versions, highlighting the progression and
improvements in speed across different iterations. The data reflects the enhancements made in real-time object
detection capabilities, with references to the respective research papers for each version. These values are
for reference only, as they are directly extracted from the source documents and have not been standardized.
Consequently, direct comparisons between the different applications discussed here are not feasible.
YOLO Version Latency (ms) Reference
YOLOv1 Base 22.22 [35]
YOLOv1 Fast 6.45 [35]
YOLOv2 (416x416) 14.93 [81]
YOLOv2 (544x544) 25 [81]
YOLOv3 (320x320) 22 [82]
YOLOv3 (416x416) 29 [82]
YOLOv3 (608x608) 51 [82]
YOLOv4 15.38 [83]
YOLOv5n (640x640) 158.73 [83, 84, 85, 86]
YOLOv5s (640x640) 156.25 [83, 84, 85, 86]
YOLOv5m (640x640) 121.95 [83, 86]
YOLOv5l (640x640) 99.01 [83, 86]
YOLOv5x (640x640) 82.64 [83, 86]
YOLOv5n6 (1280x1280) 123.46 [83, 86]
YOLOv5s6 (1280x1280) 121.95 [83, 86]
YOLOv5m6 (1280x1280) 90.09 [83, 84, 86]
YOLOv5l6 (1280x1280) 63.29 [83, 84, 86]
7
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
The Single Shot MultiBox Detector (SSD) [94] introduced in 2015 has revolutionized object detection by streamlining
the process through a single-stage approach, significantly inspiring subsequent developments in YOLO models
[94, 95, 96]. Unlike two-stage models like R-CNN, which rely on a region proposal step before actual object detection,
SSD and by extension, YOLO variants, perform detection and classification in a single sweep across the image. This
paradigm shift enhances the detection process by eliminating intermediate steps, thus facilitating faster and more
efficient object detection suitable for real-time applications. The architecture of SSD, which YOLO models have
adapted, utilizes multiple feature maps at different resolutions to detect objects of various sizes, employing a diverse
array of anchor boxes at each feature map location to improve localization accuracy [97, 98].
Figure 5: Enhanced YOLO model architecture incorporating SSD’s single-stage detection approach with Multi-
Headed Attention (MA) layers for superior real-time object detection performance [99].
8
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
Figure 5 shows an example of a YOLO model that integrates SSD’s architecture principles, specifically focusing on
enhancing real-time detection capabilities through improved feature extraction using Multi-Headed Attention (MA)
layers. These adaptations from SSD’s methodology have enabled YOLO models such as YOLOv8, YOLOv9, and
YOLOv10 to achieve significant improvements in processing speed and detection accuracy, making them highly effective
for applications requiring rapid and reliable object detection [100, 101]. The SSD-inspired single-shot mechanism
directly classifies and localizes objects, reducing computational overhead and enabling the deployment of these models
in resource-constrained environments such as mobile and edge devices. The continuous refinement of these techniques
in YOLO models underscores an ongoing evolution aimed at balancing the demanding accuracy requirements with the
need for speed in diverse real-world scenarios [96].
• "A Review of YOLO Algorithm Developments" by Peiyuan Jiang et al. [102] provided an insightful overview
on YOLO algorithm development and its evolution through its versions. The authors analyze the fundamental
aspects of YOLO’s to object detection, comparing its various iterations to traditional CNNs. They emphasize
the ongoing improvements in YOLO, particularly in enhancing target recognition and feature extraction
capabilities. It also discusses YOLO’s application in specific fields like finance, highlighting its practical
implications in feature extraction for image-based news analysis [102].
• "A Comprehensive Systematic Review of YOLO for Medical Object Detection (2018 to 2023)" by Ragab
et al. [103] presented a systematic review of YOLO’s application in the medical field, that analyzes how
different variants, particularly YOLOv7 and YOLOv8, have been employed for various medical detection
tasks. They highlight the algorithm’s significant performance in lesion detection, skin lesion classification,
and other critical areas, demonstrating YOLO’s superiority over traditional methods in terms of accuracy
and computational efficiency. Despite its successes, the review identifies challenges, such as the need for
well-annotated datasets and addresses the high computational demands of YOLO implementations. The paper
suggested directions for future research to optimize YOLO’s application in medical object detection [103].
• "A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and
YOLO-NAS" by Terven et al. [104] provides an extensive analysis of the evolutionary trajectory of the
YOLO algorithm, detailing how each iteration has contributed to advancements in real-time object detection.
Their review covers the significant architectural and training enhancements from YOLOv1 through YOLOv8
and introduces YOLO-NAS and YOLO with Transformers. This study serves as a valuable resource for
understanding the progression in network architecture, which has progressively improved YOLO’s efficacy in
diverse applications such as robotics and autonomous driving.
• "YOLOv1 to v8: Unveiling Each Variant–A Comprehensive Review of YOLO" by Hussain [78], provided
in-depth analyses of the internal components and architectural innovations of each YOLO variant. It provided
a deep dive into the structural details and incremental improvements that have marked the evolution of YOLO,
presenting a well-structured analysis complete with performance benchmarks. This methodological approach
not only highlights the capabilities of each variant but also discusses their practical impact across different
domains, suggesting the potential for future enhancements like federated learning to improve privacy and
model generalization [78].
• "YOLO-v1 to YOLO-v8, the rise of YOLO and its complementary nature toward digital manufacturing and
industrial defect detection" by Muhammad Hussain [105] reviewed and showed rapid progression of the
YOLO variants, focusing on their critical role in industrial applications, specifically for defect detection in
manufacturing. Starting with YOLOv1 and extending through YOLOv8, the paper illustrates how each version
has been optimized to meet the demanding needs of real-time, high-accuracy defect detection on constrained
devices. Hussain’s work not only examines the technical advancements within each YOLO iteration but
also validates their practical efficacy through deployment scenarios in the manufacturing sector, emphasizing
YOLO’s alignment with industrial needs [105].
The existing literature shows a significant lack of comprehensive reviews incorporating the latest YOLO releases,
specifically YOLOv9 and YOLOv10. As we mark a decade of the YOLO algorithm’s evolution, it is crucial to
systematically document and critically analyze the newer models to provide well-documented, synthesized, up-to-date
insights and comparative analyses across a broader range of applications to the broad research and technical community.
This state-of-the-art review paper aims to bridge this gap by exploring the advancements and capabilities of YOLOv9
9
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
and YOLOv10, offering a detailed perspective on their impact and potential within the ever-evolving landscape of
object detection technologies.
In this review paper, we adopt a unique reverse-chronological approach to analyze the progression of YOLO, beginning
with the most recent versions and moving backward. The analysis is divided into three distinct subsections. The
first section covers the latest iterations, YOLOv10, YOLOv9, and YOLOv8, where we delve into the architecture
and advancements that define the forefront of object detection technology. This approach not only shows the most
cutting-edge developments but also sets the stage for understanding the incremental improvements that have been
realized over time. The second section reviews YOLOv7, YOLOv6, and YOLOv5, tracing further back in the series to
highlight the evolutionary steps that have contributed to the enhancements observed in the later versions. We analyze
each model’s technical and scientific aspects to provide a comprehensive view of the progress within these iterations.
The third section addresses the earlier YOLO versions, offering a complete historical perspective that enriches the
reader’s understanding of the foundational technologies and the methodologies, refined through successive updates.
Additionally, we discuss the application of the YOLO models in reverse order across five critical real-world domains:
autonomous vehicles, healthcare and medical image analysis, security and surveillance, manufacturing industry, and
agriculture. For each application, we present a detailed examination and a corresponding tabular data in reverse
chronological order, showcasing how YOLO technologies have been adapted and implemented to meet specific industry
needs and challenges. This reverse review strategy not only emphasizes the state-of-the-art but also provides a narrative
of technological evolution, illustrating how each iteration builds upon the last to push the boundaries of what’s possible
in object detection. By understanding where YOLO technology stands today and how it got there, readers gain a
comprehensive view of its capabilities and potential future directions. This methodical unpacking of the YOLO series
not only highlights technological advancements but also offers insights into the broader implications and utility of these
models in practical scenarios, setting the groundwork for anticipating future innovations in object detection technology.
This section reviews YOLO series models, starting from the advanced and latest version, YOLOv10, and progressively
tracing back to the foundational YOLOv1. By first highlighting the most recent technological advancements, this
approach enables immediate insights into the state-of-the-art capabilities of object detection. Subsequently, the narrative
is focused on exploring how earlier models laid the groundwork for these innovations.
YOLOv10 [76], developed at Tsinghua University, China, represents a breakthrough in the YOLO series for real-time
object detection, achieving unprecedented performance. This version eliminates the need for non-maximum suppression
(NMS) [106], a traditional bottleneck in earlier models, thereby drastically reducing latency. YOLOv10 introduces a
dual assignment strategy in its training protocol, which optimizes detection accuracy without sacrificing speed with
the help of one-to-many and one-to-one label assignments, ensuring robust detection with lower latency [107]. The
architecture of YOLOv10 includes several innovative components that enhance both computational efficiency and
detection performance. Among these are lightweight classification heads [108] that reduce computational demands,
spatial-channel decoupled downsampling to minimize information loss during feature reduction [109], and rank-
guided block design that optimizes parameter use [110]. These architectural advancements ensure that YOLOv10
operates synergistically across various scales—from YOLOv10-N (Nano) to YOLOv10-X (Extra Large), making it
adaptable to diverse computational constraints and operational requirements [76]. According to wang et al. [76],
performance evaluations on benchmark datasets like MS-COCO [111] demonstrate that YOLOv10 not only surpasses
its predecessors—YOLOv9 and YOLOv8—in both accuracy and efficiency but also sets new industry standards. For
instance, YOLOv10-S substantially outperforms comparable models (e.g., xxxx) with an improved mAP and lower
latency. This version also incorporates holistic efficiency-accuracy driven design, large-kernel convolutions, and partial
self-attention modules, which collectively improve the trade-off between computational cost and detection capability.
The architecture diagrams of YOLOv10, YOLOv9, and YOLOv8 are summarized in Figures 6, 7, and 8, respectively.
The YOLOv10 model offers a broad array of configurations, each tailored to specific performance needs within real-time
object detection frameworks. Starting with YOLOv10-N (Nano), it demonstrates a rapid detection capability with a
mAP of 38.5% at an exceptionally reduced latency to 1.84 ms, making it highly suitable for scenarios demanding quick
responses. Progressing through the series, YOLOv10-S (Small) and YOLOv10-M (Medium) offer progressively higher
mAP values of 46.3% and 51.1% at latencies of 2.49 ms and 4.74 ms, respectively, providing a balanced performance
for versatile applications. The larger variants, YOLOv10-B (Balanced) and YOLOv10-L (Large), cater to environments
requiring detailed detections, with mAPs of 52.5% and 53.2% and latencies of 5.74 ms and 7.28 ms respectively. The
largest model, YOLOv10-X (Extra Large), excels with the highest mAP of 54.4% at a latency of 10.70 ms, designed
10
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
Figure 6: YOLOv10 architecture, which employs a dual label assignment strategy to improve detection accuracy.
A backbone processes the input image, while PAN (Path Aggregation Network) enhances feature representation.
Employed heads are (1) one-to-many head for regression and classification tasks, and (2) one-to-one head for
precise localization [76]
Figure 7: YOLOv9 architecture [77] with CSPNet, ELAN, and GELAN modules. CSPNet enhances gradient
flow and reduces computational load through feature map partitioning. ELAN focuses on linear aggregation of
features for improved learning efficiency, while GELAN generalizes this approach to combine features from
multiple depths and pathways, providing greater flexibility and accuracy in feature extraction.
11
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
Figure 8: YOLOv8 architecture [112]: showcasing the key components and their connections. The backbone
network processes the input image through multiple convolutional layers (C1 to C5), extracting hierarchical
features. These features are then passed through the Feature Pyramid Network (FPN) to create a feature
pyramid (P3, P4, P5), which enhances detection at different scales. The network heads perform final predictions,
incorporating convolutional blocks and upsample blocks to refine features.
for complex detection tasks where precision is paramount. These configurations underscore YOLOv10’s adaptability
across a spectrum of operational requirements.
Reflecting on YOLO’s evolution, starting from YOLOv1, which set the benchmark with an mAP of 63.4% and a
latency of 45 ms, to the latest YOLOv10, significant technological strides have been evident. YOLOv10’s predecessors,
YOLOv9 and YOLOv8, display comparable mAP scores to YOLOv10 but with marginally higher latency, indicating
the incremental enhancements YOLOv10 brings to the table. Specifically, YOLOv9 and YOLOv8 models, such
as YOLOv9-N and YOLOv8-N, showcase mAPs of 39.5% and 37.3%, respectively, at latency indicative of their
generational improvements. Meanwhile, the higher end of these series, YOLOv9-X, and YOLOv8-X, achieve mAPs of
54.4% and 53.9%, respectively, with YOLOv10 outperforming them in efficiency. The YOLO series, from YOLOv1
through YOLOv8, YOLOv9, and now YOLOv10, has continually advanced the frontier of real-time object detection,
enhancing both the speed and accuracy of detections, and thus broadening the scope for practical applications in sectors
like autonomous driving, surveillance, and real-time video analytics.
YOLOv9 [77] marks a significant advancement in real-time object detection by addressing the efficiency and accuracy
challenges associated with earlier versions, particularly through the mitigation of information loss in deep neural
processing. It introduces the innovative Programmable Gradient Information (PGI) and the Generalized Efficient Layer
Aggregation Network (GELAN) architecture. These enhancements focus on preserving crucial information across
the network, ensuring robust and reliable gradients that prevent data degradation, which is common in deep neural
networks [113]. Compared to its successor, YOLOv10, YOLOv9 sets a foundational stage by addressing the information
bottleneck problem that typically hinders deep learning models. While YOLOv9’s PGI strategically maintains data
integrity throughout the processing layers, YOLOv10 builds upon this foundation by completely eliminating the
need for NMS and further optimizing model architecture for reduced latency and enhanced computational efficiency.
YOLOv10 also introduces dual assignment strategies for NMS-free training, significantly enhancing the system’s
response time without compromising accuracy, which reflects a direct evolution from the groundwork laid by YOLOv9’s
innovations [114]. Furthermore, YOLOv9’s GELAN architecture represents a pivotal improvement in network design,
offering a flexible and efficient structure that effectively integrates multi-scale features. While GELAN contributes
significantly to YOLOv9’s performance, YOLOv10 extends these architectural improvements to achieve even greater
12
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
efficiency and adaptability [115]. It reduces computational overhead and increases the model’s applicability to various
real-time scenarios, showcasing an advanced level of refinement that leverages and enhances the capabilities introduced
by YOLOv9.
YOLOv8 was released in January 2023 by Ultralytics, marking a significant progression in the YOLO series with
an introduction of multiple scaled versions designed to cater to a wide range of applications [84, 116]. These
versions included YOLOv8n (nano), YOLOv8s (small), YOLOv8m (medium), YOLOv8l (large), and YOLOv8x
(extra-large), each optimized for specific performance and computational needs. This flexibility made YOLOv8 highly
versatile, supporting a multitude of vision tasks such as object detection, segmentation, pose estimation, tracking, and
classification, significantly broadening its application scope in real-world scenarios [116]. The architecture of YOLOv8
underwent substantial refinements to enhance its detection capabilities. It retained a similar backbone to YOLOv5
but introduced modifications in the CSP Layer, now evolved into the C2f module—a cross-stage partial bottleneck
with dual convolutions that effectively combine high-level features with contextual information to bolster detection
accuracy. ========YOLOv8 transitioned to an anchor-free model with a decoupled head, allowing independent
processing of objectness, classification, and regression tasks which, in turn, improved overall model accuracy [117].
The output layer employed a sigmoid activation function for objectness scores and softmax for class probabilities,
enhancing the precision of bounding box predictions. YOLOv8 also integrated advanced loss functions like CIoU [118]
and Distribution Focal Loss (DFL) [119] for bounding-box optimization and binary cross-entropy for classification,
which proved particularly effective in enhancing detection performance for smaller objects. YOLOv8’s architecture,
demonstrated in detailed diagrams, features the modified CSPDarknet53 backbone with the innovative C2f module,
augmented by a spatial pyramid pooling fast (SPPF) layer that accelerates computation by pooling features into a
fixed-size map. This model also introduced a semantic segmentation variant, YOLOv8-Seg, which utilized the backbone
and C2f module, followed by two segmentation heads designed to predict semantic segmentation masks efficiently.
This segmentation model achieved state-of-the-art results on various benchmarks while maintaining high speed and
accuracy, evident in its performance on the MS COCO dataset where YOLOv8x reached an AP of 53.9% at 640 pixels
image size—surpassing the 50.7% AP of YOLOv5—with a remarkable speed of 280 FPS on an NVIDIA A100 using
TensorRT. As we progress backward through the YOLO series, from YOLOv10 to YOLOv8 and soon to YOLOv7,
these architectural and functional advancements highlight the series’ evolutionary trajectory in optimizing real-time
object detection networks.
YOLOv8 was released in January 2023 by Ultralytics, marking a significant progression in the YOLO series with
an introduction of multiple scaled versions designed to cater to a wide range of applications [84, 116]. These
versions included YOLOv8n (nano), YOLOv8s (small), YOLOv8m (medium), YOLOv8l (large), and YOLOv8x
(extra-large), each optimized for specific performance and computational needs. This flexibility made YOLOv8 highly
versatile, supporting a multitude of vision tasks such as object detection, segmentation, pose estimation, tracking, and
classification, significantly broadening its application scope in real-world scenarios [116]. The architecture of YOLOv8
underwent substantial refinements to enhance its detection capabilities. It retained a similar backbone to YOLOv5 but
introduced modifications in the CSP Layer, now evolved into the C2f module—a cross-stage partial bottleneck with
dual convolutions that effectively combine high-level features with contextual information to bolster detection accuracy.
YOLOv8 transitioned to an anchor-free model with a decoupled head, allowing independent processing of objectness,
classification, and regression tasks which, in turn, improved overall model accuracy [117]. The output layer employed
a sigmoid activation function for objectness scores and softmax for class probabilities, enhancing the precision of
bounding box predictions. YOLOv8 also integrated advanced loss functions like CIoU [118] and Distribution Focal
Loss (DFL) [119] for bounding-box optimization and binary cross-entropy for classification, which proved particularly
effective in enhancing detection performance for smaller objects. This model also introduced a semantic segmentation
variant, YOLOv8-Seg [120], which utilized the backbone and C2f module, followed by two segmentation heads
designed to predict semantic segmentation masks efficiently. This segmentation model achieved state-of-the-art results
on various benchmarks while maintaining high speed and accuracy, evident in its performance on the MS COCO dataset
where YOLOv8x reached an AP of 53.9% at 640 pixels image size—surpassing the 50.7% AP of YOLOv5—with a
remarkable speed of 280 FPS on an NVIDIA A100 using TensorRT. As we progress backward through the YOLO
series, from YOLOv10 to YOLOv8 and soon to YOLOv7, these architectural and functional advancements highlight
the series’ evolutionary trajectory in optimizing real-time object detection networks.
The YOLOv7 model introduces enhancements in object detection tailored for drone-captured scenarios, particularly
through the Transformer Prediction Head (TPH-YOLOv5) variant [121], which emphasizes improvements in handling
scale variations and densely packed objects [79]. By incorporating TPH and the Convolutional Block Attention Module
(CBAM) [122], YOLOv7 substantially boosts its capacity to focus on relevant regions in cluttered environments.
These features particularly enhance the model’s ability to detect objects across varied scales, an essential trait for
13
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
drone applications where altitude changes affect object size perception drastically. The model integrates sophisticated
strategies like multi-scale testing [123] and a self-trained classifier, which refines its performance on challenging
categories by specifically addressing common issues in drone imagery such as motion blur and occlusion. These
adaptations have shown notable improvements, with YOLOv7 achieving competitive results in drone-specific datasets
and challenges [124]. The model’s adaptability and robustness in such specialized conditions demonstrate its potential
beyond conventional settings, catering effectively to next-generation applications like urban surveillance and wildlife
monitoring.
Figure 9: Comparative architectures of YOLOv5 [125], YOLOv6 [126], and YOLOv7 [127]. (a) Decoupled head
structures for YOLOv5 and YOLOv6, showing feature extraction from the Feature Pyramid Network (FPN) and
subsequent classification (Cls.), regression (Reg.), and objectness (Obj.) predictions. (b) Detailed backbone, neck,
and prediction modules of YOLOv7, highlighting ELAN and other components. (c) Overall pipeline of YOLOv5,
including backbone, detection heads, and feature extraction blocks, showcasing the architectural advancements
across versions.
YOLOv6 emerges as a robust solution in industrial applications by delivering a finely balanced trade-off between
speed and accuracy, crucial for deployment across various hardware platforms [80]. It iterates on previous versions by
incorporating cutting-edge network designs, training strategies, and quantization techniques to enhance its efficiency
and performance significantly. This model has been optimized for diverse operational requirements with its scalable
architecture, ranging from YOLOv6-N to YOLOv6-X, each offering different levels of performance to suit specific
computational budgets [128]. Significant innovations in YOLOv6 include the use of advanced label assignment
techniques and loss functions that refine the model’s predictive accuracy and operational efficiency. By leveraging
state-of-the-art advancements in machine learning, YOLOv6 not only excels in traditional object detection metrics
but also sets new standards in throughput and latency, making it exceptionally suitable for real-time applications in
industrial and commercial domains.
The subsequent versions of YOLO, namely YOLOv6 and YOLOv7 each introduce innovative features that build on the
foundation set by YOLOv5. YOLOv6, released in October 2021, introduced lightweight nano models optimized for
mobile and CPU environments, alongside a more effective backbone for improved small object detection. YOLOv7
further advanced this development by incorporating a new backbone network, PANet [129], enhancing feature aggre-
gation and representation, and introducing the CIOU loss function for better object scaling and aspect ratio handling.
YOLO-v6 significantly shifts the architecture to an anchor-free design, incorporating a self-attention mechanism to
better capture long-range dependencies and employing adaptive training techniques to optimize performance during
training [130]. These versions collectively push the boundaries of object detection performance, emphasizing speed,
accuracy, and adaptability across a range of deployment scenarios.
14
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
YOLOv5 has significantly contributed to the YOLO series evolution, focusing on user-friendliness and performance
enhancements [85, 86]. Its introduction by Ultralytics brought a streamlined, accessible framework that lowered the
barriers to implementing high-speed object detection across various platforms. YOLOv5’s architecture incorporates a
series of optimizations including improved backbone, neck, and head designs which collectively enhance its detection
capabilities. The model supports multiple size variants, facilitating a broad range of applications from mobile devices to
cloud-based systems [85]. YOLOv5’s adaptability is further evidenced by its continuous updates and community-driven
enhancements, which ensure it remains at the forefront of object detection technologies. This version stands out for its
balance of speed, accuracy, and utility, making it a preferred choice for developers and researchers looking to deploy
state-of-the-art detection systems efficiently.
YOLOv5 marks a significant evolution in the YOLO series, focusing on production-ready deployments with streamlined
architecture for real-world applications. This version emphasizes reducing the model’s complexity by refining its
layers and components, enhancing its inference speed without sacrificing detection accuracy. The backbone and feature
extraction layers were optimized to accelerate processing, and the network’s architecture was simplified to facilitate
faster data throughput. Importantly, YOLO v5 enhances its deployment flexibility, catering to edge devices with limited
computational resources through model modularity and efficient activations. These architectural refinements ensure
YOLO v5 operates effectively in diverse environments, from high-resource servers to mobile devices, making it a
versatile tool in the arsenal of object detection technologies.
The introduction of YOLOv4 [83] in 2020 marked the latest in these developments, employing CSPDarknet-53 [131]as
its backbone. This modified version of Darknet-53 uses Cross-Stage Partial connections to reduce computational
demands while enhancing learning capacity. YOLOv4 incorporates innovative features such as Mish activation [132],
Figure 10: Comparison of YOLOv4 [83] and YOLOv3 [82] architectures. (a) YOLOv4 architecture showing
a two-stage detector with backbone, neck, dense prediction, and sparse prediction modules. (b) YOLOv3
architecture featuring convolutional and upsampling layers leading to multi-scale predictions. This highlights
the structural advancements in object detection between the two versions
15
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
replacing traditional ReLU to maintain smooth gradients, and utilizes new data augmentation techniques like Mosaic and
CutMix [133]. Additionally, it introduces advanced regularization methods including DropBlock regularization [134]
and Class Label Smoothing to prevent overfitting [135], alongside optimization strategies termed BoF (Bag of Free-
bies) [136] and BoS (Bag of Specials) that enhance training and inference efficiency. Following the success of YOLOv4,
YOLOv3 was introduced in 2018, which utilized the Darknet-53 architecture with influences from residual learning.
This version was trained initially on ImageNet, helping it to effectively detect objects across various sizes due to its
multi-scale detection capabilities within the architecture.
Figure 11: Comparison of YOLOv1 [137] and YOLOv2 [81] architectures. (a) YOLOv1 architecture, showing
the sequence of convolutional layers, max-pooling layers, and fully connected layers used for object detection.
This model performs feature extraction and prediction in a single unified step, aiming for real-time performance.
(b) YOLOv2 architecture, illustrating improvements such as the use of batch normalization, higher resolution
input, and anchor boxes.
YOLOv3 [82] improved detection accuracy, especially for small objects, through its use of three different scales
for detection, thereby capturing essential features at various resolutions. Earlier, YOLOv2 and the original YOLO
(YOLOv1) laid the groundwork for these advancements [137]. Released in 2016, YOLOv2 introduced a new 30-layer
architecture with anchor boxes from Faster R-CNN and batch normalization [138] to speed up convergence and enhance
model performance. YOLOv1, debuting in 2015 by Joseph Redmon, revolutionized object detection with its single-shot
mechanism that predicted bounding boxes and class probabilities in one network pass, utilizing a simpler Darknet-
16
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
19 architecture. This initial approach significantly accelerated the detection process, establishing the foundational
techniques that would be refined in later versions of the YOLO series.
The introduction of YOLOv4 [83] in 2020 marked the latest in these developments, employing CSPDarknet-53 [131]as
its backbone. This modified version of Darknet-53 uses Cross-Stage Partial connections to reduce computational
demands while enhancing learning capacity. YOLOv4 incorporates innovative features such as Mish activation [132],
replacing traditional ReLU to maintain smooth gradients, and utilizes new data augmentation techniques like Mosaic and
CutMix [133]. Additionally, it introduces advanced regularization methods including DropBlock regularization [134]
and Class Label Smoothing to prevent overfitting, alongside optimization strategies termed BoF (Bag of Freebies) and
BoS (Bag of Specials) that enhance training and inference efficiency. Following the success of YOLOv4, YOLOv3 was
introduced in 2018, which utilized the Darknet-53 architecture with influences from residual learning. This version was
trained initially on ImageNet, helping it to effectively detect objects across various sizes due to its multi-scale detection
capabilities within the architecture. YOLOv3 improved detection accuracy, especially for small objects, through its use
of three different scales for detection, thereby capturing essential features at various resolutions. Earlier, YOLOv2 and
the original YOLO (YOLOv1) laid the groundwork for these advancements. Released in 2016, YOLOv2 introduced a
new 30-layer architecture with anchor boxes from Faster R-CNN and batch normalization to speed up convergence
and enhance model performance. YOLOv1, debuting in 2015 by Joseph Redmon, revolutionized object detection
with its single-shot mechanism that predicted bounding boxes and class probabilities in one network pass, utilizing a
simpler Darknet-19 architecture. This initial approach significantly accelerated the detection process, establishing the
foundational techniques that would be refined in later versions of the YOLO series. YOLOv4 and YOLOv3, showcasing
their advanced architectures and features, are illustrated in Figure 10a and b, respectively, while YOLOv2 and YOLOv1
are depicted in Figure 11a and b, showcasing the foundational developments in the series.
5 Applications
YOLO has many real-time practical applications, such as, autonomous vehicles for obstacle detection, pedestrian pose
estimation for intention prediction and traffic sign recognition, enhancing safety and navigation [60]. Additionally,
YOLO is employed in surveillance for intrusion detection and anomaly identification, and in healthcare for detecting
anomalies in medical images, aiding in accurate and efficient diagnostics [139].
Each YOLO version has been pivotal in advancing the capabilities of autonomous vehicles by providing highly efficient
and accurate real-time detection systems. Each iteration of YOLO has brought improvements that enhance the vehicle’s
ability to perceive its environment quickly and accurately, which is critical for safe navigation and decision-making
[140]. Starting with YOLOv1 [35], the YOLO algorithm revolutionized the approach by performing detection tasks
directly from full images in a single network pass, allowing for the detection of objects at a remarkable speed [141].
This initial model was pivotal, setting a high standard for real-time object detection and establishing a framework that
future versions would build upon. Subsequent iterations, including YOLOv2 and YOLOv3, continued to refine this
approach by introducing concepts such as real-time multi-scale processing and improved anchor box adjustments, which
enhanced the accuracy and robustness of the detections. These versions were particularly adept at handling the variable
scales of objects seen in driving environments—from nearby pedestrians to distant road signs—making them invaluable
for autonomous driving applications. YOLOv4 and later versions further pushed the boundaries by integrating advanced
neural network techniques and optimizations that improved detection accuracy while maintaining the high-speed
processing necessary for real-time applications [142, 143]. These advancements in YOLO technology have not only
bolstered the capabilities of autonomous vehicles in terms of environmental perception and decision-making but have
also significantly contributed to advancements in automotive safety and operational reliability [144].
Ye et al. (2022) developed an end-to-end adaptive neural network control for autonomous vehicles that predicts steering
angles using YOLOv5, enhancing vehicle navigation precision [145]. Mostafa et al. (2022) compared the effectiveness
of YOLOv5, YOLOX, and Faster R-CNN in detecting occluded objects for autonomous vehicles, improving detection
reliability [12]. Jia et al. (2023) proposed an enhanced YOLOv5 detector for autonomous driving, which offers
increased speed and accuracy [146]. Chen et al. (2023) utilized an improved YOLOv5-OBB algorithm for autonomous
parking space detection in electric vehicles, enhancing operational efficiency [147]. Liu and Yan (2022) customized
YOLOv7 for vehicle-related distance estimation, providing essential metrics for safe navigation [148]. Mehla et al.
(2023) evaluated YOLOv8 against EfficientDet in autonomous maritime vehicles, highlighting the superior detection
capabilities of YOLOv8 [149]. Patel et al. (2024) enhanced traffic sign detection using YOLOv8, promoting safer
driving environments [150].
17
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
YOLOv8 and YOLOv9 are at the forefront of transforming the landscape of autonomous vehicle technologies, playing
a pivotal role in enhancing the operational safety and efficiency of self-driving cars. These models have excelled
in real-time object detection, a crucial aspect of autonomous driving, especially under the challenging and variable
conditions typical in real-world traffic environments. For instance, in the Robotaxi-Full Scale Autonomous Vehicle
Competition, YOLOv8 was specifically adapted to recognize and interpret traffic signs, providing real-time alerts that
are essential for safe driving [151]. Moreover, YOLOv8-QSD, an enhanced version, addresses the need for detecting
smaller objects such as traffic signs and signals, demonstrating its utility with a notable accuracy rate and efficiency in
processing, making it ideal for high-speed driving scenarios [152].
Further advancements with YOLOv8 have led to significant improvements in object detection in adverse weather
conditions, an area of particular concern for autonomous driving. The application of transfer learning techniques using
datasets from diverse weather conditions has markedly increased the detection performance of YOLOv8, ensuring
reliable recognition of crucial road elements like pedestrians and obstacles under challenging weather scenarios
[153]. Additionally, the development of YOLOv8 for specific tasks such as brake light status detection illustrates the
algorithm’s flexibility and its potential in enhancing interpretability and safety for autonomous vehicles [154]. These
innovations underscore the critical role of YOLOv8 and YOLOv9 in pushing the boundaries of what is possible in the
autonomous vehicle industry, highlighting their impact in meeting the rigorous demands for safety and reliability in
self-driving technologies [155]. Table 3 illustrates different applications of YOLO in the autonomous vehicle industry,
presented in reverse chronological order from the most recent versions to the older ones.
18
Table 3: Studies on YOLO applications in autonomous vehicles, focusing on object detection and real-time performance improvements for enhanced safety.
Title of Paper Description of Work Purpose and YOLO Usage Version Ref. and Year
"Transforming Aircraft Detection Utilizes YOLOv9 with LEO satellite Aims to improve airport security YOLOv9 [156], 2024
Through LEO Satellite Imagery and imagery for enhanced detection of and aviation safety by integrating
YOLOv9 for Improved Aviation aircraft in wide-area airport environ- advanced YOLO-based object detec-
Safety" ments. tion with satellite imagery.
"YOLOv8-QSD: An Improved Developed an anchor-free, BiFPN- Enhances detection of small ob- YOLOv8-QSD [152], 2024
Small Object Detection Algorithm enhanced YOLOv8 model for better jects for autonomous vehicles with
tems" jects like traffic signs and traffic to YOLOv5, tested on BDD100K,
lights. TT100K, and DTLD datasets.
"Deep convolutional neural network Analyzed YOLO V4 and YOLO V4- Compares the enhancement of traf- YOLOv4 [160], 2022
for enhancing traffic sign recogni- tiny with SPP for better feature ex- fic sign recognition performance
tion developed on Yolo V4" traction in traffic sign recognition. by integrating SPP into YOLO V4
backbones.
"The improvement in obstacle detec- Employed a hybrid of fuzzy logic Enhances obstacle detection accu- YOLOv3 [161], 2021
tion in autonomous vehicles using and NMS in YOLO for better obsta- racy and speed using a modified
YOLO non-maximum suppression cle detection in autonomous driving. YOLO algorithm.
fuzzy algorithm"
"Object Tracking for Autonomous Evaluated YOLOv3 for object track- Two models were provided, one YOLOv3 [162], 2022
Vehicle Using YOLOV3" ing in autonomous vehicles. trained using only the online COCO
dataset, and the other trained with
additional images from various loca-
tions at Universiti Malaysia Pahang
(UMP).
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
Ali et al. [163] presented a Bayesian Generalized Extreme Value Model designed to evaluate real-time pedestrian crash
risks at signalized intersections, leveraging advanced AI-based video analytics. This framework employs deep learning
algorithms like YOLO for precise object detection and DeepSORT for effective tracking. The model concentrates on
crucial safety indicators such as Post Encroachment Time (PET). Through this approach, the study underscores the
significant role of AI-driven video analysis in boosting intersection safety by delivering real-time risk assessments.
This development signifies a substantial advancement in the proactive management of traffic safety. Hussain et al.
[164] explored the enhancement of pedestrian crash estimation using machine learning techniques focused on anomaly
detection. Their study addresses the limitations of traditional Extreme Value Theory (EVT) models by applying
unconventional sampling methods, thereby increasing the accuracy and reducing uncertainty in crash risk estimations.
The use of YOLO for object detection and DeepSORT for tracking is pivotal in this methodology, enhancing detection
accuracy and tracking reliability in real-time scenarios. Ghaziamin et al. [165] developed a privacy-preserving real-time
passenger counting system for bus stops using overhead fisheye cameras. This innovative system employs YOLOv4,
along with Detecnet-V2 and Faster-RCNN, for detection purposes, and DeepSORT for tracking. The system processes
data in real-time at 30 frames per second (FPS) when utilizing YOLOv4 as the detection model. This technology
significantly enhances transit planning by providing accurate passenger counts, while also maintaining passenger
privacy and energy efficiency.
Additionally, Pedestrian-vehicle conflict prediction was explored by Zhang et al. [166] proposed a model employing a
Long Short-Term Memory (LSTM) neural network to forecast pedestrian-vehicle conflicts at signalized intersections by
analyzing video data. The model uses YOLOv3 for object detection and Deep SORT for tracking, achieving impressive
accuracy rates and demonstrating the transformative potential of LSTM networks in collision warning systems. This
approach suggests a proactive enhancement of pedestrian safety in connected vehicle environments.
Crossing intention prediction and behavioral analysis was explored by Zhang et al. [166] utilized an LSTM neural
network to predict pedestrian red-light crossing intentions at intersections by analyzing video data of real traffic
scenarios. The model uses YOLOv3 for detection and DeepSORT for tracking, recognizing patterns that indicate
potential red-light crossings with a high accuracy rate. This capability aims to improve traffic safety through vehicle-to-
infrastructure communication systems that alert drivers to potential pedestrian violations, thereby preventing accidents.
Yang et al. [167] introduced the VENUS smart node, a cooperative traffic signal assistance system for non-motorized
users and individuals with disabilities. This novel infrastructure leverages computer vision and edge AI to integrate
real-time data on pedestrian movement and intent. The system employs YOLOv4 for detection and OpenPose for
pose estimation, achieving high accuracy in detecting crossing intentions and mobility status across various test sites.
This innovation has significant potential for widespread use in smart city infrastructures, greatly enhancing safety and
accessibility.
Jiao et al. [168] conducted a study on monitoring pedestrian walking speeds at the street level using drones. The
research utilized UAV-based video footage to measure walking speeds of pedestrians on a commercial street. Deep
learning algorithms, particularly YOLOv5 for object detection and DeepSORT for tracking, were employed in this
study. Speed calculations were adjusted for geometric distortions using the SIFT and RANSAC algorithms, achieving
high accuracy. The study found that 90.5% of the corrected speeds had an absolute error of less than 0.1 m/s, providing
a precise and non-intrusive method for analyzing pedestrian walking speeds. Wang et al. [169] used drone-captured
video footage to examine "safe spaces" for pedestrians and e-bicyclists at urban crosswalks. The study discovered that
e-bicyclists maintain larger semi-elliptical safe zones that are sensitive to speed changes compared to the semi-circular
zones maintained by pedestrians. By quantifying these safe spaces and examining variations due to speed and traffic
presence, the study offers valuable insights for enhancing crosswalk safety and managing urban traffic more effectively.
The use of YOLOv3 for object detection and DeepSORT for tracking plays a critical role in this analysis. Zhou et al.
[170] developed an innovative model that integrates a pedestrian-centric environment graph with Graph Convolutional
Networks (GCNs) and a pedestrian-state encoder. This model effectively captures dynamic interactions between
pedestrians and their environments, providing advanced safety warnings by predicting crossing intentions up to three
seconds in advance. This model holds significant potential for applications in intelligent transportation systems. The
integration of YOLOv5 for detection, DeepSORT for tracking, and HRNet for pose estimation enhances the model’s
predictive accuracy and real-time application. Table 4 illustrates different applications of YOLO usage in pedestrian
pose estimation, for intention prediction and behavioral analysis
20
Table 4: Studies on YOLO usage in pedestrian pose estimation, for intention prediction and behavioral analysis
Title of Paper Description of Work Purpose and YOLO Usage Version Ref. and
Year
"Multi-Object Pedestrian Tracking Proposes a comprehensive approach for pedestrian Aimed to enhance multi-object pedestrian track- YOLOv8 [171],
Using Improved YOLOv8 and OC- tracking by combining the improved YOLOv8 ob- ing for autonomous driving systems by improv- 2023
SORT" ject detection algorithm with the OC-SORT track- ing detection accuracy and model efficiency with
ing algorithm, integrating advanced techniques YOLOv8, and integrating it with the OC-SORT
such as SoftNMS, GhostConv, and C3Ghost Mod- tracking algorithm for robust tracking in challeng-
ules. ing scenarios.
"Forecast Pedestrian-Vehicle Colli- Implements YOLOv3 for detecting pedestrian- Enhance pedestrian safety at intersections by YOLOv3 [166],
sions at Traffic Lights" vehicle interactions and classify them into safe modeling and predicting potential pedestrian- 2020
interactions, slight conflicts, and severe conflicts. vehicle conflicts.
"Prediction of Pedestrian Crossing Prediction pedestrian red-light crossing behavior at YOLOv3 is used to identify pedestrians and ex- YOLOv3 [166],
Intentions at Intersections Based On intersections using LSTM networks. YOLOv3 is tract relevant characteristics from the video data, 2020
Long Short-Term Memory Recur- employed to detect pedestrians and extract relevant which are then passed into the LSTM neural net-
rent Neural Network" characteristics from video data, which are then work for prediction.
used for behavioral prediction.
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
YOLO has marked a significant technological advancement, especially with the introduction of newer versions such as
YOLOv7 and YOLOv8 [172, 173, 174]. The recent iterations of YOLO, particularly YOLOv7, YOLOv8, and YOLOv9,
could significantly enhance medical diagnostics by offering advanced computational efficiency and improved feature
extraction capabilities, making them suitable for real-time medical imaging applications. Such capabilities are crucial in
urgent care scenarios, where swift diagnosis can be pivotal. For instance, YOLOv8’s sophisticated algorithms excel in
accurately delineating complex biological structures, vital for identifying pathologies in conditions like vascular diseases
or tumors. Similarly, YOLOv9’s rapid processing power enables immediate analysis of medical images, essential in
emergency medical responses where timely intervention is critical. These versions have the potential to revolutionize
healthcare by facilitating early detection of diseases and supporting continuous patient monitoring, transforming the
traditional approach of healthcare diagnostics into one that integrates accurate, swift diagnostics seamlessly with routine
medical examinations. Unlike the traditional methods which depend heavily on manual annotation and are prone to
errors and subjectivity, YOLO algorithms automate the detection and localization of medical anomalies such as tumors,
lesions, and other pathological markers across various imaging modalities. This automation is driven by YOLO’s unique
architecture that efficiently predicts multiple bounding boxes and class probabilities in a single analysis, enhancing
diagnostic accuracy and reducing the potential for human error.
In the field of medical imaging and diagnostics, the adoption of the YOLO object detection algorithm has showcased
promising improvements in accuracy and efficiency, particularly with its latest versions like YOLOv5, YOLOv6,
YOLOv7, and YOLOv8. For instance, Luo et al. (2021) leveraged YOLOv5 in conjunction with ResNet50 to
enhance chest abnormality detection, demonstrating the algorithm’s proficiency in identifying subtle medical conditions
[48]. Similarly, Wu et al. (2022) developed Me-YOLO, an adapted version of YOLOv5, to improve the detection
of medical personal protective equipment, highlighting the model’s adaptability to varied medical use cases [175].
Moreover, advancements like the CSFF-YOLOv5 by Zhao et al. (2024) introduced modifications for better feature
fusion, significantly boosting the detection accuracy in femoral neck fracture cases [176]. This specificity is further
explored by Goel and Patel (2024), who enhanced YOLOv6 for lung cancer detection using an advanced PSO
optimizer, underscoring the potential of YOLO algorithms in facilitating early disease diagnosis and treatment [177].
Additionally, the extension of YOLOv6 by Norkobil Saydirasulovich et al. (2023) for improved fire detection in smart
city environments exemplifies the algorithm’s versatility beyond traditional medical applications, proving its efficacy in
diverse environmental conditions [178]. Each of these developments not only enhances specific medical diagnostic
processes but also paves the way for integrating these advanced object detection systems into broader healthcare
applications, as illustrated by the innovative uses of YOLOv7 and YOLOv8 in detecting whole body bone fractures and
enhancing hospital efficiency [179, 49]. These studies collectively demonstrate the significant advancements brought by
YOLO in the healthcare sector, ensuring more precise, efficient, and versatile diagnostic solutions.
Recent versions such as YOLOv7, YOLOv8 and YOLOv9 have been effectively demonstrated across a variety of
healthcare applications. Razaghi et al. (2024) utilized YOLOv8 for the innovative diagnosis of dental diseases,
highlighting its precision in identifying dental pathologies [180]. Similarly, Pham and Le (2024) leveraged YOLOv8
for the detection and classification of ovarian tumors from ultrasound images, showcasing the model’s adaptability
to different medical imaging modalities [181]. Krishnamurthy et al. (2023) applied custom YOLO architectures to
enhance object detection capabilities during endoscopic surgeries, illustrating the potential of YOLO in surgical settings
[182]. Furthermore, Palanivel et al. (2023) discussed the application of YOLOv8 in cancer diagnosis through medical
imaging, further cementing YOLO’s role in critical healthcare applications [183].
Continuing with advancements, Karaköse et al. (2024) introduced CSFF-YOLOv5, an improved YOLO model for
femoral neck fracture detection, utilizing advanced feature fusion techniques [184]. Inui et al. (2023) demonstrated
YOLOv8’s effectiveness in detecting elbow osteochondritis dissecans in ultrasound images, which supports its use in
orthopedic diagnostics [174]. Bhojane et al. (2023) employed YOLOv8 for detecting liver lesions from MRI and CT
images, underscoring the algorithm’s capability across various imaging technologies [185]. Additionally, Zhang et
al. (2023) developed an improved detection model for microaneurysms using YOLOv8, which illustrates continuous
enhancements in YOLO’s application to highly specific medical tasks [114].
Table 5 illustrates the different uses of YOLO versions in security and survelliance:
23
Table 5: Studies on YOLO applications in healthcare and medicine, emphasizing object detection for diagnostic imaging and real-time medical analysis
Title of Paper Description of Work Purpose and YOLO Usage Version Ref. and
Year
"Efficient Skin Lesion Detection using Utilized YOLOv9 for advanced skin le- Developed improved skin lesion identifica- YOLOv9 [173],
YOLOv9 Network" sion detection, leveraging deep learning to tion using YOLOv9, showcasing significant 2023
enhance diagnostic accuracy and speed. advances in detection performance.
"Fracture detection in pediatric wrist Employed YOLOv8 with data augmenta- Enhanced fracture detection in pediatric YOLOv8 [173],
trauma X-ray images using YOLOv8 algo- tion on the GRAZPEDWRI-DX dataset wrist trauma using YOLOv8, achieving su- 2023
from smartphone imaging" lesions from smartphone images. and R metrics validated performance. YOLOv5M
"An Improved Method of Polyp Detection Customized YOLOv4-tiny with Inception- Developed to imporve the detection perfor- YOLOv4- [188],
Using Custom YOLOv4-Tiny" ResNet-A block for enhanced detection of mance of polyp detection using a modified Tiny 2022
polyps in wireless endoscopic images. YOLOv4-tiny. Demonstrated significant
perforamnce improvement.
"Detection of dental caries in oral pho- Utilized YOLOv3 for detecting dental Enhanced detection and diagnosis of dental YOLOv3 [189],
tographs taken by mobile phones based on caries from mobile phone images, employ- caries using YOLOv3, with evaluation of 2021
the YOLOv3 algorithm" ing image augmentation and enhancement diagnostic precision, recall, and F1-score
for improved accuracy. across different datasets.
"Automatic thyroid nodule recognition and Employed YOLOv2 for automatic detec- Compared AI performance with radiolo- YOLOv2 [190],
diagnosis in ultrasound imaging with the tion and diagnosis of thyroid nodules in gists using YOLOv2, showing improved 2019
YOLOv2 neural network" ultrasound images, enhancing diagnostic accuracy and specificity in thyroid nodule
precision. diagnosis. ROC curve analysis confirms
effectiveness.
"Real-Time Facial Features Detection from Developed a method to localize facial fea- Demonstrates how spatial information can Custom [191],
Low Resolution Thermal Images with Deep tures from low-resolution thermal images be restored and utilized from classification Deep Clas- 2018
Classification Models" by modifying existing deep classification models for facial feature detection, signif- sification
networks for real-time detection. icantly reducing dataset preparation time Model and
while maintaining high precision. YOLO
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
In the ever-evolving field of security systems, YOLO’s application extends to detecting unauthorized entries and
identifying potential threats swiftly, thereby bolstering security measures [192, 193]. Recent YOLO models such as
YOLOv6 build on this by improving detection accuracy through deeper network layers that process images with greater
precision [194]. Meanwhile, YOLOv7 offers advanced customization options that allow security systems to be finely
tuned to specific surveillance needs, enhancing the adaptability and effectiveness of threat detection [194, 195]. These
YOLO versions support high-resolution video feeds, ensuring that security personnel can engage with real-time data
to make informed decisions quickly. Further advancements in surveillance systems are embodied by YOLOv8 and
YOLOv9, which introduce significant innovations in deep learning for security applications [196, 197, 198]. YOLOv8’s
architecture is designed to handle complex environments where traditional surveillance systems may fail, such as
varying lighting and weather conditions. This version’s robust performance in diverse scenarios enhances its utility in
comprehensive security strategies. On the other hand, YOLOv9 pushes the boundaries of speed and accuracy, providing
unparalleled real-time analysis and detection capabilities. Its deployment in surveillance systems ensures that even
the subtlest anomalies are detected, reducing the likelihood of security breaches. The integration of recent versions of
YOLO such as YOLOv8 and YOLOv9 into security frameworks not only streamlines operations but also ensures a
proactive approach to threat management, keeping public and private spaces safer across the globe [199, 200, 201].
The application of YOLO models in surveillance and security systems highlights their pivotal role in enhancing
real-time response and precision. Majeed et al. [192] investigated the effectiveness of a YOLOv5-based security
system within a real-time environment, underscoring its capability to significantly improve operational efficiency in
dynamic settings. Similarly, Affes et al. [194] conducted a comparative study across YOLOv5, YOLOv6, YOLOv7,
and YOLOv8, focusing on their performance in intelligent video surveillance systems. Their analysis demonstrated
the incremental improvements in detection accuracy and processing speed, crucial for real-time security applications.
Further advancing the field, Cao and Ma [195] utilized a refined YOLOv7 model to enhance campus security through
improved target detection capabilities, highlighting the model’s precision in identifying potential threats in densely
populated environments. Chatterjee et al. [196] introduced a YOLOv8-based intrusion detection system specifically
tailored for physical security and surveillance, which significantly contributes to safeguarding assets and individuals by
detecting unauthorized entries or activities effectively. Additionally, Sandhya and Kashyap [197] employed YOLOv8 for
real-time object-removal tampering localization in surveillance videos, a crucial technology for maintaining the integrity
of video evidence and ensuring the reliability of surveillance feeds. Together, these studies showcase the robustness of
YOLO architectures in addressing diverse and complex security challenges, providing substantial improvements in both
the efficacy and efficiency of surveillance operations.
Recent studies have significantly leveraged advanced YOLO models to enhance surveillance and security across various
domains. Bakirci and Bayraktar [199] discussed optimizing ground surveillance for aircraft monitoring using YOLOv9,
highlighting its efficacy in real-time security applications. Similarly, Chakraborty et al. [202] explored a multi-model
approach for violence detection, incorporating YOLOv8 to improve public safety through automated surveillance.
These advancements indicate a shift towards reliable and efficient security systems for complex scenarios.
Chen et al. [203] delve into the application of an enhanced YOLOv8 model for large-scale security and low-altitude
drone-based law enforcement, demonstrating its potential in managing security risks effectively. Further, Pashayev et
al. [204] utilize YOLO8 for intelligent face recognition in smart cameras, contributing to the development of smarter,
more responsive surveillance technologies. Additionally, Kaç et al. [205] investigate image-based security techniques
for critical water infrastructure surveillance, employing YOLO models to ensure robust monitoring. Lastly, Gao et al.
[206] introduce an improved YOLOv8s network model for contraband detection in X-ray images, underscoring the
versatility and precision of YOLO models in enhancing contraband security measures.
Recent advancements in surveillance technologies have leveraged the YOLO’s capabilities, particularly in managing
crowd dynamics and detecting critical events. Antony et al. [207] explored the use of YOLOv8 alongside ByteTrack
for crowd management, emphasizing the system’s efficiency in improving surveillance and public safety. This
integration marks a significant step towards enhancing real-time monitoring capabilities during large public gatherings.
Concurrently, Zhang [208] utilized a YOLO model to detect fire and smoke in IoT surveillance systems, showcasing the
model’s ability to respond swiftly to emergency situations, thus bolstering safety protocols within environments.
In security, Khin et al. [209] conducted a comparative study of YOLOv8 with other models like RetinaNet and
EfficientDet for gun detection, emphasizing YOLOv8’s superior accuracy in detecting firearms within a custom dataset.
It underlines the critical role of precise object detection to prevent potential threats. Additionally, Nkuzo et al. [210]
provided a comprehensive analysis of the YOLOv7 in detecting car safety belts in real-time, illustrating its importance in
enforcing road safety measures. Moreover, Chang et al. [211] developed an improved YOLOv7, equipped with feature
fusion and attention mechanisms, tailored for detecting safety gear violations in high-risk environments like construction,
to enhance workplace safety standards. Table 6 presents the various YOLO usage in security and surveillance.
25
Table 6: Studies on YOLO usage in security and surveillance, for real-time threat detection and enhanced monitoring to improved safety measures
Title of Paper Description of Work Purpose and YOLO Usage Version Ref. and
Year
"YOLOv9-Enabled Vehicle Implements YOLOv9 for aerial vehicle detection Focus on utilizing YOLOv9 for real-time vehicle YOLOv9 [200],
Detection for Urban Security via UAVs, enhancing urban security and forensic monitoring, facilitating efficient law enforcement 2024
and Forensics Applications" capabilities. and forensic analysis in urban settings.
"SC-YOLOv8: A Security Developed a custom YOLOv8 model for X-ray Aimed to improve security screening effective- YOLOv8 [212],
Check Model for the Inspec- image analysis to detect prohibited items. En- ness and reduce error rates in detecting prohibited 2023
warning with Deep Neural thorized entry, and vehicle misplacement in real- tecting various security threats simultaneously
Network based on YOLO-V5" time. Combines deep learning with video surveil- using YOLO-v5. Demonstrates the application
lance to reduce the need for extra hardware. of YOLO-v5 in critical infrastructure protection.
"Fighting against terrorism: A Imporved YOLOv4 with SCSP-ResNet backbone Aims to bolster security and counter-terrorism ef- YOLOv4 [216],
real-time CCTV autonomous and F-PaNet module for detecting weapons in forts by accurately identifying weapons in CCTV 2023
weapons detection based on CCTV footage, integrating synthetic and real- using an advanced YOLOv4 architecture, demon-
improved YOLO v4" world data to enhance detection. strating significant performance improvements.
"Automatic tracking of objects Utilizes an enhanced YOLOv3 model to automat- Designed to enhance surveillance systems by de- YOLOv3 [217],
using improvised Yolov3 algo- ically track objects and alert for anomalies in live tecting and alerting on anomalies like bag stealing 2022
rithm and alarm human activi- video feeds, comparing performance with CNNs and lock-breaking, demonstrating rapid process-
ties in case of anomalies" and decision trees. ing and high detection accuracy.
"Multi-Object Detection us- Employs a novel YOLOv2-LuNet combination Designed to improve real-time surveillance by YOLOv2 [218],
ing Enhanced YOLOv2 and for efficient multi-object tracking in video surveil- enabling robust multi-object tracking in challeng- 2024
LuNet Algorithms in Surveil- lance, enhancing feature extraction and object ing conditions. Highlights the effectiveness of
lance Videos" detection accuracy. combined YOLOv2 and LuNet approach.
"From Silence to Propagation: Examines the cultural shift from ’Stop Snitchin” Aims to explore how social media influences N/A [219],
Understanding the Relation- to ’YOLO’ in urban hip-hop culture, highlighting criminal behavior and public perception, applying 2015
ship between ’Stop Snitchin’ the role of social media in promoting individual- cultural criminology to assess changes in social
and ’YOLO’" ism and exceptionalism. interactions and deviance.
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
5.5 Manufacturing
In the landscape of industrial manufacturing, the deployment of YOLO algorithms significantly enhances the capability
of automated optical inspection (AOI) systems. Each iteration of the YOLO family, from YOLOv2 to YOLOv5, and
beyond into the latest versions like YOLOv6 and YOLOv7, brings forward substantial improvements in detecting
defects across various manufacturing domains [105, 220, 221, 222]. The high accuracy and real-time processing
capabilities of YOLOv6 and YOLOv7, for instance, allow for immediate identification of production flaws, crucial
for maintaining workflow efficiency on fast-paced production lines [223, 224]. Advancing into the domain of smart
manufacturing, YOLO algorithms are pivotal in revolutionizing quality control mechanisms [225, 226]. The continuous
evolution from YOLOv5 to YOLOV6, YOLOv7, YOLOv8,and upto 10th version of YOLO exemplifies the adaptation
of deep learning to meet the stringent quality demands of modern manufacturing processes. These algorithms reduce
the need for labor-intensive manual inspections, thereby minimizing the margin for human error and enhancing the
overall speed of quality assessments [105, 220, 221, 222, 225, 226].
For instance, [227] pioneered YOLO-IMF, an enhanced version of YOLOv8 tailored for precise surface defect detection
in industrial settings, exemplifying the algorithm’s efficacy in real-time environments. This refinement aims to
cater to the high demands for accuracy in manufacturing sectors where defects can significantly impact quality and
safety. Continuing this trend, [228] introduced Yolo-SD, which utilizes simulated feature fusion for few-shot learning,
enhancing YOLOv8’s capability in detecting industrial defects under varied conditions. Similarly, [229] extended
YOLOv8’s utility in monitoring 3D printing processes by optimizing hyperparameters to detect faults more accurately,
reflecting a targeted approach to maintaining production integrity. [230] adapted YOLOv8 to inspect cylindrical parts,
a critical aspect of quality control in specialized manufacturing. Lastly, [231] leveraged a conditioned version of
YOLOv8, named Cond-YOLOv8-seg, to assess the uniformity of industrially produced materials, showcasing the
model’s versatility across different manufacturing scenarios. These innovations underscore the pivotal role of YOLO
algorithms in driving forward the capabilities of industrial inspection systems, highlighting their impact on enhancing
operational efficiency and product quality.
Additionally [232] introduced DCS-YOLOv8, a variant optimized for detecting steel surface defects, demonstrating
its effectiveness in addressing the complexities of steel manufacturing. This adaptation ensures that even minor
imperfections are identified, crucial for maintaining the structural integrity of steel products.Likewise, [233] further
refined YOLOv8 to develop BL-YOLOv8, focusing on road defect detection. This model enhances the safety and
maintenance of transportation infrastructure by enabling more accurate and real-time detection of road surface anomalies.
Similarly, [234] presented a "Hardware-Friendly" YOLOv8 model designed for foreign object identification on belt
conveyors, crucial for preventing equipment damage in materials handling. This version of YOLOv8 is tailored to
perform well on the limited computational resources typical of industrial hardware systems. Finally, [235] employed an
improved YOLOv8 algorithm for the detection of defects in automotive adhesives, a critical quality control measure
for ensuring vehicle safety and durability. These applications of YOLOv8 exemplify its adaptability and precision in
industrial settings, where high accuracy and efficiency are paramount for operational success and safety compliance.
The recent advancements in YOLOv7 have paved the way for significant improvements in industrial inspection and
monitoring systems. Wu et al. (2023) developed an enhanced YOLOv7 model specifically tailored for detecting objects
in complex industrial equipment scenarios, highlighting its application in real-world settings [236]. Similarly, Kim et
al. (2022) implemented YOLOv7 in a real-time inspection system that leverages Moire patterns to detect defects in
highly reflective injection molding products, demonstrating the algorithm’s capability in manufacturing quality control
[237]. Further, Chen et al. (2023) explored the defect detection capabilities of YOLOv7 for automotive running lights,
contributing to safer automotive systems through precise quality assurance techniques [238].
Hussain et al. (2022) applied domain feature mapping with YOLOv7 to automate inspections of pallet racking in
storage facilities, enhancing safety and efficiency in logistics operations [239]. Zhu et al. (2023) extended YOLOv7’s
utility to the identification and classification of surface defects in belt grinding processes, aiding in maintaining the
integrity of manufacturing workflows [240]. Lastly, Zhang et al. (2024) innovated with YOLO-RDP, a lightweight
version of YOLOv7, optimized for detecting steel defects in real-time, showcasing the adaptability of YOLOv7 to
resource-constrained environments and promoting sustainable manufacturing practices [208]. Table 7 illustrates the
different use of YOLO versions in the field of industrial manufacturing:
27
Table 7: Studies on YOLO applications in the manufacturing industry, focusing on real-time defect detection and process optimization for improved
efficiency
Title of Paper Description of Work Purpose and YOLO Usage Version Ref. and
Year
"YOLO-IMF: An Improved YOLOv8 Al- Proposes an enhanced YOLOv8, YOLO- Demonstrates YOLOv8’s extended appli- YOLOv8 [227],
gorithm for Surface Defect Detection in In- IMF, for surface defect detection on alu- cability in industrial settings by enhancing 2023
dustrial Manufacturing Field" minum plates. Replaces CIOU with EIOU accuracy and defect detection capabilities.
loss function to better handle small and
irregularly shaped targets, achieving sig-
facturing Using YOLOv5" tool detection in manufacturing environ- leveraging YOLOv5 for accurate and real- 2023
ments, optimizing object detection capa- time detection of various tools, contributing
bilities for precise tool localization. significantly to Industry 4.0 initiatives.
"Efficient Automobile Assembly State Implements a channel-pruned YOLOv4 Designed to streamline assembly monitor- YOLOv4 [243],
Monitoring System Based on Channel- algorithm to optimize monitoring in au- ing in industrial environments, showcasing 2024
Pruned YOLOv4" tomobile assembly, enhancing detection YOLOv4’s utility in enhancing operational
speed without compromising accuracy. efficiency and deployment readiness.
"YOLO V3 + VGG16-based Automatic Utilizes a combined YOLO V3 and Aims to enhance production efficiency and YOLO V3, [244],
Operations Monitoring in Manufacturing VGG16 framework to recognize and mon- quality by automating action analysis and VGG16 2022
Workshop" itor industrial operations accurately for In- process monitoring using advanced YOLO
dustry 4.0 manufacturing workshops. V3 and VGG16 technologies.
"Improvements of Detection Accuracy by Employs YOLOv2 with an innovative data Seeks to optimize defect detection and vi- YOLOv2 [245],
YOLOv2 with Data Set Augmentation" set augmentation method to enhance the sualization on production lines, demonstrat- 2023
detection accuracy and confidence in iden- ing YOLOv2’s effectiveness with limited
tifying defective areas in industrial prod- data augmentation options.
ucts.
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
5.6 Agriculture
In agricultural environments, advanced object detection techniques such as YOLOv5 [40, 246, 247], YOLOv6 [248, 249],
YOLOv7 [41, 250], and YOLOv8 [42, 251] have proven to be instrumental in transforming traditional farming into
precision agriculture[252]. YOLOv5, for example, has been adept at weed detection [42, 253], enabling farmers to
apply herbicides more effectively and economically by precisely identifying and localizing weed species amidst crops.
This level of precision not only conserves resources but also mitigates the adverse environmental impact of excessive
chemical usage. Furthermore, YOLOv6, YOLOv7 and YOLOv8 has enhanced capabilities in broader agricultural
applications such as monitoring and analyzing crop health and growth patterns, significantly improving yield predictions
and crop management strategies [43, 254, 255].
The recent introduction of YOLOv7 and YOLOv8 has further pushed the boundaries of agricultural innovation.
YOLOv7 [44, 256, 45] and YOLOv8 [251, 120, 257] has been specifically refined to detect small pests and subtle
disease symptoms on crops, which are often overlooked by human inspectors. Its enhanced deep learning framework
allows for integrating complex image recognition tasks that facilitate early detection, thereby preventing widespread
crop damage. On the other hand, YOLOv8 has made significant strides in fruit detection tasks. Its application in
orchards for detecting fruits such as apples [258, 258] supports optimal harvesting by determining the right stage of
fruit maturity. This maximizes the harvest quality and ensures that the fruits are picked at their nutritional peak, thereby
enhancing their market value. The application of these advanced YOLO models YOLOv5, YOLOv6, YOLOv7, and
YOLOv8 represents a leap towards a more sustainable and efficient agricultural sector.
Recent studies have demonstrated the efficacy of YOLO-based models in enhancing various aspects of agricultural
automation and efficiency. Junos et al. (2021) optimized a YOLO-based object detection model to improve crop
harvesting systems, showcasing the potential to boost yield and reduce labor costs [259]. Zhao et al. (2024) extended
this application to real-time object detection combined with robotic manipulation, further aligning agricultural practices
with advanced automation technologies [260]. Chen et al. (2021) developed an apple detection method using a tailored
YOLOv4 algorithm, specifically designed to support harvesting robots operating in complex environments, which
significantly enhances the precision and efficiency of fruit picking [261].
Further contributions include work by Nergiz (2023), who utilized YOLOv7 to enhance strawberry harvesting efficiency,
providing practical solutions for small to medium-sized enterprises in the agricultural [262]. Wang et al. (2024)
focused on planning harvesting operations in large strawberry fields using a deep learning-based image processing
method, demonstrating the scalability of YOLO for larger agricultural operations [263]. Lastly, Zhang et al. (2023)
introduced DCF-YOLOv8, an improved algorithm for agricultural pests and diseases detection by aggregating low-level
features, which helps in early detection and management of crop health [251]. These studies collectively illustrate the
transformative impact of YOLO-based models in modernizing agricultural practices, ensuring higher productivity and
sustainability.
In orchard automation, the YOLO object detection models have been specifically pivotal in enhancing the accuracy and
efficiency of fruit detection [264, 265, 266], flower identification [267, 268, 269], and automated harvesting processes
[270, 259, 271]. These models adeptly identify and classify fruits at various stages of ripeness, detect flowers with
high precision, and facilitate efficient harvesting operations. The development of YOLO models, has introduced
significant improvements that cater specifically to the challenges of agricultural environments. For instance, YOLOv5’s
introduction of multi-scale predictions improved the detection of small and clustered objects like flowers and young
fruits, which are critical during the early stages of crop yield management [272]. As the models advanced, YOLOv7
and YOLOv8 incorporated better segmentation techniques, which enhanced the differentiation between fruit types and
maturity stages, critical for targeted harvesting [273, 274].
Moreover, recent iteration, YOLOv9 have leveraged advanced algorithms with spatial pyramid pooling and attention
mechanisms, which have refined the detection capabilities in plant disease detection [275]. [275] performed a
comparative study on different important versions of YOLO (v5, v8 and v9) on a real-world dataset for tomato plant
disease detection and suggested that YOLOv9 outperforms YOLOv5 and YOLOv8.
Table 8 illustrates the different use of YOLO versions in the field of Agriculture:
29
Table 8: Studies on YOLO usage in agriculture, emphasizing automated crop monitoring, pest detection, and yield estimation for enhanced productivity.
Title of Paper Description of Work Purpose and YOLO Usage Version Ref. and
Year
"Automating Tomato Ripeness Classifi- Implements YOLOv9 to automate and en- Aims to streamline tomato ripeness moni- YOLOv9 [276],
cation and Counting with YOLOv9" hance the accuracy of classifying and count- toring and counting, to enhance agricultural 2024
ing ripe tomatoes, replacing labor-intensive productivity and quality. Utilizes YOLOv9
visual inspections. for high accuracy in detection.
"A Lightweight YOLOv8 Tomato Detec- Enhances YOLOv8 for tomato detection Aims to advance agricultural automation YOLOv8 [110],
disease detection method" vegetable diseases by upgrading CSP, FPN, the accuracy and speed of disease detection 2022
and NMS modules to handle complex envi- in vegetables using an improved YOLOv5
ronmental interference. algorithm.
"Using channel pruning-based YOLO v4 Implements a channel pruned YOLOv4 Aims to optimize apple flower detection YOLOv4 [267],
deep learning algorithm for the real-time model to enhance efficiency and accuracy in orchards by applying channel pruning 2020
and accurate detection of apple flowers in in detecting apple flowers, supporting the to YOLOv4, significantly reducing model
natural environments" development of flower thinning robots. size and improving processing speed while
maintaining high accuracy.
"Fast and accurate detection of kiwifruit Enhances YOLOv3-tiny with additional Focus on increasing the efficiency of ki- YOLOv3- [38],
in orchard using improved YOLOv3-tiny convolutional kernels for improved ki- wifruit detection in dynamic orchard en- tiny 2021
model" wifruits detection in orchards, in occlusions vironmentswith a modified YOLOv3-tiny,
and varying lighting conditions. demonstrating high performance.
"A Detection Method for Tomato Fruit Implements YOLOv2 to detect and iden- Aims to boost tomato yield and quality con- YOLOv2 [277],
Common Physiological Diseases Based tify healthy and diseased tomato, using ad- trol through efficient detection of physiolog- 2019
on YOLOv2" vanced image processing and data augmen- ical diseases, demonstrating the effective-
tation to enhance detection accuracy. ness of YOLOv2 in agriculture.
"A Vision-Based Counting and Recogni- Utilizes YOLO for initial detection and Demonstrates a robust, efficient system for YOLO, [39],
tion System for Flying Insects in Intelli- counting, and SVM for fine classification of insect monitoring, greatly enhancing accu- SVM 2018
gent Agriculture" flying insects, for efficient in pest control. racy and speed in pest management.
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
• As the latest version in the YOLO series, YOLOv10 has not yet seen widespread adoption in published
research. Its release promises cutting-edge improvements in object detection capabilities, but the lack of
extensive testing and real-world application data makes it difficult to ascertain its full potential and limitations.
• Preliminary evaluations suggest that while YOLOv10 might offer advancements in speed and accuracy,
integrating it into existing systems could present challenges due to compatibility and computational demands.
Potential users may hesitate to adopt this version until more comprehensive studies and benchmarks are
available, which articulate its advantages over previous models.
• The expectation with YOLOv10, much like its predecessors, is that it will drive further research in object
detection technologies. Its eventual widespread implementation could pave the way for addressing complex
detection scenarios with higher accuracy, particularly in dynamic environments. However, as with any new
technology, the adaptation phase will be crucial in understanding its practical limitations and operational
challenges.
YOLOv9:
• Despite YOLOv9’s enhancements in detection capabilities, it has only been featured in a handful of studies,
which limits a comprehensive understanding of its performance across diverse applications. This lack of
extensive validation may deter organizations from adopting it until more empirical evidence and comparative
analyses establish its efficacy and efficiency over earlier versions.
• While YOLOv9 improves upon the speed and accuracy of its predecessors, it may still struggle with detecting
small or overlapping objects in cluttered scenes. This is a recurring challenge in high-density environments
like crowded urban areas or complex natural scenes, where precise detection is critical for applications such as
autonomous driving and wildlife monitoring.
• Future developments for YOLOv9 could focus on enhancing its robustness in adverse conditions, such as
varying weather, lighting, or occlusions. Integrating more adaptive and context-aware mechanisms could help
in mitigating false positives and improving the reliability of the system under different operational conditions.
The implementation of advanced training techniques such as federated learning could also be explored to
enhance its adaptability and learning efficiency from decentralized data sources.
YOLOv8:
• YOLOv8 has shown significant improvements in object detection tasks, particularly in real-time applications.
However, it continues to face challenges in terms of computational efficiency and resource consumption when
deployed on lower-end hardware [278]. This can limit its applicability in resource-constrained environments
where deploying advanced hardware solutions is not feasible [151].
• The future direction for YOLOv8 could involve optimizing its architectural design to reduce computational
load without compromising detection accuracy. Enhancing its scalability to efficiently process images of
varying resolutions and conditions can broaden its application scope. Moreover, incorporating adaptive scaling
and context-aware training methods could potentially address the detection challenges in complex scenes,
making it more robust against diverse operational challenges.
YOLOv7:
• Although YOLOv7 introduces significant improvements in detection accuracy and speed, its adoption across
varied real-world applications reveals a persistent challenge in handling highly dynamic scenes. For instance,
in environments with rapid motion or in scenarios involving occlusions, YOLOv7 can still experience drops in
performance. The algorithm’s ability to generalize across different types of blur and motion artifacts remains
an area for further research and enhancement.
• The complexity of YOLOv7’s architecture, while beneficial for accuracy, imposes a substantial computational
burden. This makes it less ideal for deployment on edge devices or platforms with limited processing
capabilities, where maintaining a balance between speed and power efficiency is crucial [279, 194]. Efforts to
streamline the model for such applications without significant loss of performance are necessary.
• Looking forward, there is significant potential in expanding YOLOv7’s capabilities through the integration of
semi-supervised or unsupervised learning paradigms. This would enable the model to leverage unlabeled data
31
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
effectively, a common challenge in the real-world where annotated datasets are often scarce or expensive to
produce. Additionally, enhancing the model’s resilience to adversarial attacks and variability in data quality
could further solidify its utility in security-sensitive applications like surveillance and fraud detection.
YOLOv6:
• One of the notable challenges with YOLOv6 is its handling of scale variability within images, which can affect
its efficacy in environments where objects appear at diverse distances from the camera. While YOLOv6 shows
improved accuracy and speed over its predecessors, it sometimes struggles with small or partially occluded
objects, which are common in crowded scenes or complex industrial environments [178, 280]. This limitation
can be critical in applications such as automated surveillance or advanced manufacturing monitoring.
• YOLOv6, while efficient, still requires considerable computational resources when compared to other models
optimized for edge devices. Its deployment in resource-constrained environments such as mobile or embedded
systems often requires a trade-off between detection performance and operational efficiency. Further optimiza-
tions and model pruning are necessary to achieve the best of both worlds—real-time performance with reduced
computational demands.
• Future enhancements for YOLOv6 could focus on incorporating more advanced feature extraction techniques
that improve its robustness to variations in object appearance and environmental conditions. Additionally,
integrating more adaptive and context-aware learning mechanisms could help overcome some of the challenges
related to background clutter and similar adversities. Enhancing the model’s capacity to learn from a limited
number of training samples, through techniques such as few-shot learning or transfer learning, could address
the scarcity of labeled training data in specialized applications.
YOLOv5:
• YOLOv5 has made significant strides in improving detection speed and accuracy, but it faces challenges
in consistently detecting small objects due to its spatial resolution constraints. This is particularly evident
in fields like medical imaging or satellite image analysis, where precision is crucial for identifying fine
details. Techniques such as spatial pyramid pooling or enhanced up-sampling may be needed to increase the
receptive field and improve the detection of smaller objects without compromising the model’s efficiency
[140, 281, 282].
• While YOLOv5 offers faster training and inference times compared to previous versions, its deployment
on edge devices is limited by high memory and processing requirements [283, 146]. Although optimized
models like YOLOv5s provide a solution, they sometimes do so at the cost of detection accuracy. Optimizing
network architecture through neural architecture search (NAS) could potentially offer a more balanced solution,
enhancing both performance and efficiency for real-time object detection applications.
• The adaptability of YOLOv5 to varied environmental conditions and different types of data distribution
remains an area for development. Future research could focus on enhancing the robustness of YOLOv5
through advanced data augmentation techniques and domain adaptation strategies. This would enable the
model to maintain high accuracy levels across diverse application settings, from urban surveillance to complex
natural environments, effectively handling variations in lighting, weather, and seasonal changes.
YOLOv4, YOLOv3, YOLOv2 and YOLOv1:
• The advancements in YOLOv4 brought significant improvements in speed and accuracy, but the model’s
performance remains inconsistent across various datasets, especially in class imbalance and rare object
recognition. Its computational demand limits its practical deployment on low-power devices. Efforts to
enhance model compression and environmental adaptability could further broaden its utility in real-world
applications.
• YOLOv3 improved upon the balance of speed and accuracy, yet it struggles with small object detection due
to its grid limitation. Its computational efficiency poses challenges for deployment in resource-constrained
environments, prompting research towards optimization techniques to improve efficiency without sacrificing
performance. Additionally, enhancing the model’s robustness to environmental variations could improve its
reliability for applications like autonomous driving and urban surveillance.
• Despite the incremental improvements introduced in YOLOv2, it faces challenges in detecting small objects,
balancing speed with accuracy, and maintaining relevance with the advent of more capable successors. This
version’s reliance on a fixed grid system hampers its ability to perform in high-precision detection tasks. Future
developments may shift towards adapting YOLOv2’s core strengths in new architectures that enhance its
spatial resolution and dynamic scaling capabilities.
32
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
For the versions of YOLO under YOLOv5, their use may decrease and discontinue in the future as newer versions are
replacing the older YOLO versions in overall performance and efficiency.
• The potential for YOLOv4, YOLOv3, and YOLOv2 in future research involves exploring adaptive mechanisms
that can tailor learning rates and augment data to better handle diverse operational scenarios. Integrating these
models with newer technologies like model pruning and feature fusion may address existing inefficiencies and
extend their applicability to a wider range of applications.
• YOLOv1 was revolutionary for its time, introducing real-time object detection by processing the entire image
at once as a single regression problem. However, it faces significant challenges in dealing with small objects
due to each grid cell predicting only two boxes and the probabilities for the classes. This structure often leads
to poor performance on groups of small objects that are close together, such as flocks of birds or traffic scenes
with multiple vehicles at a distance. Improvements in subsequent models focus on increasing the number of
predictions per grid and incorporating finer-grained feature maps to enhance small object detection.
• Another limitation of YOLOv1 is the spatial constraints of its bounding boxes. Since each cell in the grid can
only predict two boxes and has limited context about its neighboring cells, the precision in localizing objects,
especially those with complex or irregular shapes, is often compromised. This challenge is particularly evident
in medical imaging and satellite image analysis, where the exact contours of the objects are crucial. Advances
in convolutional neural network designs and cross-layer feature integration in later versions seek to address
these drawbacks.
• Despite the foundational advancements introduced by YOLOv1, its direct application has waned over the
years, superseded by more robust iterations like YOLOv2 and YOLOv3. These later versions build upon the
core principles of YOLOv1 but offer refined mechanisms for handling varied object sizes and aspect ratios.
Future research directions are less likely to focus on YOLOv1 itself but may explore its integration into hybrid
models or specialized adaptations that can leverage its speed for real-time applications where latency is critical,
albeit with compensations in detection accuracy and granularity.
• Future iterations could focus on dynamic grid systems, lighter network architectures, and advanced scaling
features to tackle the challenges of small object detection and computational limitations. These improvements
could enhance their deployment in emerging areas such as edge computing, where real-time processing and
low power consumption are crucial.
• As newer models like YOLOv8 and YOLOv9 continue to evolve, the foundational aspects of YOLOv4,
YOLOv3, and YOLOv2 can still offer valuable insights for developing hybrid models or specialized appli-
cations. Research may increasingly focus on leveraging these older versions for their speed attributes while
compensating for their detection limitations through composite and hybrid modeling approaches.
Over the past decade, the series of You Only Look Once (YOLO) models have significantly impacted various sectors,
demonstrating the powerful capabilities of deep learning in real-world applications. As a pioneering object detection
algorithm, YOLO has facilitated rapid advancements across diverse fields by offering high-speed, real-time detection
with commendable accuracy. One of the most notable applications has been in public safety and surveillance, where
YOLO models have improved the efficacy of monitoring systems, enhancing the detection of suspicious activities and
ensuring public safety more efficiently. In the realm of automotive technology, YOLO has been integral in developing
advanced driver-assistance systems (ADAS), contributing to object detection that supports collision avoidance systems
and pedestrian safety. Furthermore, YOLO has transformed the healthcare sector by accelerating medical image analysis,
enabling quicker and more accurate detection of pathologies which is critical for diagnostics and treatment planning.
In industrial settings, YOLO has optimized quality control processes by identifying defects in manufacturing lines in
real-time, thereby reducing waste and increasing production efficiency. Additionally, in the retail sector, YOLO has
supported inventory management through automated checkouts and stock monitoring, enhancing customer experience
and operational efficiency.
33
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
could transform security mechanisms, making them more interactive and responsive. In the healthcare sector, the
incorporation of medical imaging with historical patient data and live symptom descriptions could significantly improve
the personalization and accuracy of medical responses.
Looking further, YOLO’s potential to adapt to such multimodal advancements will be instrumental in pioneering the
next wave of intelligent applications. From autonomous vehicles that interpret both road signs and pedestrian gestures
to smart homes that react to visual cues and voice instructions, the integration of YOLO with a broader spectrum of data
types and deeper contextual understanding heralds a groundbreaking epoch in artificial intelligence. This transformative
phase promises to significantly improve the interactivity and cognitive capabilities of Machine Vision systems, marking
a pivotal shift in visual process automation.
AGI refers to an intelligent agent with human-level or higher intelligence, capable of solving a variety of complex
problems in diverse domains [284, 285]. YOLO, a specialized AI focused on object detection, highlights the critical
ability to process and interpret visual data, making it a key component of AGI. An AGI system would need to combine
object detection, similar to YOLO, with other cognitive capabilities, such as natural language understanding and
reasoning, to perform a wide range of tasks in real time. For example, a robot equipped with AGI could use YOLO for
visual recognition to navigate and interact with its environment while simultaneously using natural language models to
understand and respond to verbal instructions. This integration would demonstrate a level of versatility and general
intelligence akin to human capabilities, allowing the AGI system to seamlessly perform complex and diverse tasks, thus
moving closer to the achievement of true AGI.
This generation of neural networks has amazed us with its advanced vision and language capabilities, pushing the
boundaries of what AI can perceive and interpret. The next wave of neural networks, however, will be defined by their
ability to not only understand but also to act and execute tasks in real-time. YOLO is poised to be a key player in this
transformation. Its unparalleled speed and accuracy in object detection make it the ideal candidate for applications
requiring immediate action, such as autonomous driving, robotics, and real-time surveillance. As we move towards
a future where AI not only sees and speaks but also performs complex tasks autonomously, YOLO’s role will be
instrumental in bridging the gap between perception and action. One such project is the "BEHAVIOUR", which is a
human-centered simulation benchmark to evaluate embodied AI solutions [286] at the Stanford University [287]
Embodied Artificial Intelligence (EAI) refers to AI systems that are integrated with physical entities or bodies, allowing
them to interact with the real world in a more natural and human-like manner [288]. Integrating YOLO into these
systems can significantly enhance their sensory capabilities, enabling more efficient and accurate interaction with the
physical world. Applications of YOLO in Embodied AI includes autonomous vehicles, cars, robots [289], human-robot
interaction [290], healthcare [291], and others [287]. Need to add a concept figure
The deployment of YOLO on edge devices unlocks several promising avenues for future research and development. One
potential direction involves enhancing the algorithm’s efficiency and accuracy for even more constrained environments,
such as ultra-low-power Microcontrollers and embedded systems. This can be achieved through further optimization
techniques, including model pruning, quantization, and the development of specialized hardware accelerators. Addition-
ally, integrating YOLO with advanced communication protocols and edge computing frameworks could facilitate more
seamless collaboration between edge devices and centralized cloud services, enhancing the overall system performance
and scalability. Exploring the integration of YOLO with other AI-driven functionalities, such as anomaly detection and
predictive analytics, may unlock new applications in areas like healthcare, smart cities, and industrial automation. As
edge computing continues to evolve, the adaptation of YOLO to support federated learning paradigms could ensure the
data privacy while enabling continuous learning and improvement of object detection models. These future directions
will not only expand the capabilities of YOLO but also contribute significantly to the advancement of intelligent edge
computing systems.
34
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
Threat: Relying on a single statistical summary metric to measure YOLO detection capability may not fully reflect the
performance of systems across various YOLO applications, necessitating the use of several metrics.
Mitigation: Despite this limitation, our main premise is that the selected metrics enable us to compare various YOLO
systems and adequately assess their overall effectiveness. Recognizing the inherent limitations of statistical summaries
is crucial when conducting a comprehensive evaluation of detection systems across different applications. Therefore,
we aim to improve the clarity and reliability of our review by openly acknowledging these potential threats to construct
validity. This approach provides a more nuanced understanding of the limitations associated with various aspects of
YOLO techniques for object detection in diverse domains.
Training and retraining YOLO is extremely energy-intensive, leading to substantial energy and water consumption, as
well as significant carbon dioxide emissions. This environmental impact underscores concerns about the sustainability
of AI development, emphasizing the urgent need for more efficient practices to reduce the ecological footprint of
large-scale model training [292, 293].
7 Conclusion
In this comprehensive review, we explored the evolution of the YOLO models from the most recent YOLOv10
to the inaugural YOLOv1. This retrospective analysis covered a decade of advancements, highlighting the pivotal
improvements in each version and their respective impacts across five critical application areas: public safety, automotive
technology, healthcare, industrial manufacturing, and retail. Our review outlined the significant enhancements in
detection speed, accuracy, and computational efficiency that each iteration brought, while also addressing the specific
challenges and limitations faced by earlier versions. Furthermore, we have identified gaps in the current capabilities of
YOLO models and proposed potential directions for future research. Predicting the trajectory of YOLO’s development,
we anticipate a shift towards multimodal data processing, leveraging advancements in large language models and
natural language processing to enhance object detection systems. This fusion is expected to broaden the utility of
YOLO models, enabling more sophisticated, context-aware applications that could revolutionize interaction between AI
systems and their environments. Thus, this review not only serves as a detailed chronicle of YOLO’s evolution but also
sets a prospective blueprint for its integration into the next generation of technological innovations.
8 Authors Contribution
Ranjan Sapkota: principal conceptualizer, idea generation, research design, formal analysis, original draft preparation,
manuscript writing, and editing. Rizwan Qureshi, Marco Flores-Calero, Chetan Badgujar, Upesh Nepal, Alwin
Poulose, Peter Zeno, Uday Bhanu Prakash Vaddevolu, Sheheryar Khan, Maged Shoman, Hong Yan, Manoj
Karkee: methodology refinement, critical revisions, manuscript review, and editing.
References
[1] Li Liu, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie Chen, Xinwang Liu, and Matti Pietikäinen. Deep
learning for generic object detection: A survey. International journal of computer vision, 128:261–318, 2020.
[2] Ramon A Suarez Fernandez, Jose Luis Sanchez-Lopez, Carlos Sampedro, Hriday Bavle, Martin Molina, and
Pascual Campoy. Natural user interfaces for human-drone multi-modal interaction. In 2016 International
Conference on Unmanned Aircraft Systems (ICUAS), pages 1013–1022. IEEE, 2016.
[3] Robert J Wang, Xiang Li, and Charles X Ling. Pelee: A real-time object detection system on mobile devices.
Advances in neural information processing systems, 31, 2018.
[4] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with
region proposal networks. Advances in neural information processing systems, 28, 2015.
[5] Daniel Flippo, Sujith Gunturu, Carolyn Baldwin, and Chetan Badgujar. Tree Trunk Detection of Eastern Red
Cedar in Rangeland Environment with Deep Learning Technique. Croatian journal of forest engineering,
44(2):357–368, 2023.
35
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
[6] Juan Guerrero-Ibáñez, Sherali Zeadally, and Juan Contreras-Castillo. Sensor technologies for intelligent
transportation systems. Sensors, 18(4):1212, 2018.
[7] Maged Shoman, Dongdong Wang, Armstrong Aboah, and Mohamed Abdel-Aty. Enhancing traffic safety with
parallel dense video captioning for end-to-end event analysis. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR) Workshops, pages 7125–7133, June 2024.
[8] Rasheed Hussain and Sherali Zeadally. Autonomous cars: Research results, issues, and future challenges. IEEE
Communications Surveys & Tutorials, 21(2):1275–1313, 2018.
[9] Maged Shoman, Gabriel Lanzaro, Tarek Sayed, and Suliman Gargoum. Autonomous vehicle–pedestrian
interaction modeling platform: A case study in four major cities. Journal of Transportation Engineering, Part A:
Systems, 150(9):04024045, 2024.
[10] Manisha Kaushal, Baljit S Khehra, and Akashdeep Sharma. Soft computing based object detection and tracking
approaches: State-of-the-art survey. Applied Soft Computing, 70:423–464, 2018.
[11] Saad M Khan and Mubarak Shah. Tracking multiple occluding people by localizing on multiple scene planes.
IEEE transactions on pattern analysis and machine intelligence, 31(3):505–519, 2008.
[12] Tanzim Mostafa, Sartaj Jamal Chowdhury, Md Khalilur Rhaman, and Md Golam Rabiul Alam. Occluded
object detection for autonomous vehicles employing yolov5, yolox and faster r-cnn. In 2022 IEEE 13th Annual
Information Technology, Electronics and Mobile Communication Conference (IEMCON), pages 0405–0410.
IEEE, 2022.
[13] Abhishek Gupta, Alagan Anpalagan, Ling Guan, and Ahmed Shaharyar Khwaja. Deep learning for object
detection and scene perception in self-driving cars: Survey, challenges, and open issues. Array, 10:100057, 2021.
[14] Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey.
Proceedings of the IEEE, 111(3):257–276, 2023.
[15] Shuai Liu, Dongye Liu, Gautam Srivastava, Dawid Połap, and Marcin Woźniak. Overview and methods of
correlation filter algorithms in object tracking. Complex & Intelligent Systems, 7:1895–1917, 2021.
[16] Michael Teutsch and Wolfgang Kruger. Robust and fast detection of moving vehicles in aerial videos using
sliding windows. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops,
pages 26–34, 2015.
[17] Qian Li, Usman Niaz, and Bernard Merialdo. An improved algorithm on viola-jones object detector. In 2012
10th International Workshop on Content-Based Multimedia Indexing (CBMI), pages 1–6. IEEE, 2012.
[18] Xiao-dong Hu, Xin-qing Wang, Fan-jie Meng, Xia Hua, Yu-ji Yan, Yu-yang Li, Jing Huang, and Xun-lin Jiang.
Gabor-CNN for object detection based on small samples. Defence Technology, 16(6):1116–1129, 2020.
[19] Thattapon Surasak, Ito Takahiro, Cheng-hsuan Cheng, Chi-en Wang, and Pao-you Sheng. Histogram of oriented
gradients for human detection in video. In 2018 5th International conference on business and industrial research
(ICBIR), pages 172–176. IEEE, 2018.
[20] Mohd Safirin Karis, Nur Rafiqah Abdul Razif, Nursabillilah Mohd Ali, M Asyraf Rosli, Mohd Shahrieel Mohd
Aras, and Mariam Md Ghazaly. Local binary pattern (lbp) with application to variant object detection: A survey
and method. In 2016 IEEE 12th international colloquium on signal processing & its applications (CSPA), pages
221–226. IEEE, 2016.
[21] Takeshi Mita, Toshimitsu Kaneko, and Osamu Hori. Joint haar-like features for face detection. In Tenth IEEE
International Conference on Computer Vision (ICCV’05) Volume 1, volume 2, pages 1619–1626. IEEE, 2005.
[22] Huan-Jung Chiu, Tzuu-Hseng S Li, and Ping-Huan Kuo. Breast cancer–detection system using pca, multilayer
perceptron, transfer learning, and support vector machine. IEEE Access, 8:204309–204324, 2020.
[23] Ibomoiye Domor Mienye and Yanxia Sun. A survey of ensemble learning: Concepts, algorithms, applications,
and prospects. IEEE Access, 10:99129 – 99149, 2022. Cited by: 170; All Open Access, Gold Open Access.
[24] Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. Beyond pascal: A benchmark for 3d object detection in the
wild. In IEEE winter conference on applications of computer vision, pages 75–82. IEEE, 2014.
[25] Xingxing Xie, Gong Cheng, Jiabao Wang, Xiwen Yao, and Junwei Han. Oriented r-cnn for object detection. In
Proceedings of the IEEE/CVF international conference on computer vision, pages 3520–3529, 2021.
[26] Shijian Tang and Ye Yuan. Object detection based on convolutional neural network. In International Conference-
IEEE–2016, 2015.
[27] Wang Zhiqiang and Liu Jun. A review of object detection based on convolutional neural network. In 2017 36th
Chinese control conference (CCC), pages 11104–11109. IEEE, 2017.
36
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
[28] Xuelong Li, Dawei Song, and Yongsheng Dong. Hierarchical feature fusion network for salient object detection.
IEEE Transactions on Image Processing, 29:9165–9175, 2020.
[29] Eric Crawford and Joelle Pineau. Spatially invariant unsupervised object detection with convolutional neural
networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3412–3420, 2019.
[30] Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient object detection. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10781–10790,
2020.
[31] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko.
End-to-end object detection with transformers. In European conference on computer vision, pages 213–229.
Springer, 2020.
[32] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for
object detection. In European conference on computer vision, pages 280–296. Springer, 2022.
[33] Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik, and Ezgi Mercan. R-cnn for object detection. In
IEEE Conference, 2014.
[34] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages
1440–1448, 2015.
[35] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object
detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788,
2016.
[36] Chetan M Badgujar, Alwin Poulose, and Hao Gan. Agricultural object detection with you look only once (yolo)
algorithm: A bibliometric and systematic literature review. arXiv preprint arXiv:2401.10379, 2024.
[37] Jiawei Li, Yongliang Qiao, Sha Liu, Jiaheng Zhang, Zhenchao Yang, and Meili Wang. An improved yolov5-based
vegetable disease detection method. Computers and Electronics in Agriculture, 202:107345, 2022.
[38] Longsheng Fu, Yali Feng, Jingzhu Wu, Zhihao Liu, Fangfang Gao, Yaqoob Majeed, Ahmad Al-Mallahi, Qin
Zhang, Rui Li, and Yongjie Cui. Fast and accurate detection of kiwifruit in orchard using improved yolov3-tiny
model. Precision Agriculture, 22:754–776, 2021.
[39] Yuanhong Zhong, Junyuan Gao, Qilun Lei, and Yao Zhou. A vision-based counting and recognition system for
flying insects in intelligent agriculture. Sensors, 18(5):1489, 2018.
[40] Yifan Wang, Lin Yang, Hong Chen, Aamir Hussain, Congcong Ma, and Malek Al-gabri. Mushroom-yolo: A
deep learning algorithm for mushroom growth recognition based on improved yolov5 in agriculture 4.0. In 2022
IEEE 20th International Conference on Industrial Informatics (INDIN), pages 239–244. IEEE, 2022.
[41] Kailin Jiang, Tianyu Xie, Rui Yan, Xi Wen, Danyang Li, Hongbo Jiang, Ning Jiang, Ling Feng, Xuliang Duan,
and Jianjun Wang. An attention mechanism-improved yolov7 object detection algorithm for hemp duck count
estimation. Agriculture, 12(10):1659, 2022.
[42] Guojun Chen, Yongjie Hou, Tao Cui, Huihui Li, Fengyang Shangguan, and Lei Cao. Yolov8-cml: A lightweight
target detection method for color-changing melon ripening in intelligent agriculture. ResearchSquare, 2023.
[43] Xun Yu, Dameng Yin, Honggen Xu, Francisco Pinto Espinosa, Urs Schmidhalter, Chenwei Nie, Yi Bai, Sindhuja
Sankaran, Bo Ming, Ningbo Cui, et al. Maize tassel number and tasseling stage monitoring based on near-ground
and uav rgb images by improved yolov8. Precision Agriculture, pages 1–39, 2024.
[44] Liangquan Jia, Tao Wang, Yi Chen, Ying Zang, Xiangge Li, Haojie Shi, and Lu Gao. Mobilenet-ca-yolo:
An improved yolov7 based on the mobilenetv3 and attention mechanism for rice pests and diseases detection.
Agriculture, 13(7):1285, 2023.
[45] Muhammad Umar, Saud Altaf, Shafiq Ahmad, Haitham Mahmoud, Adamali Shah Noor Mohamed, and Rashid
Ayub. Precision agriculture through deep learning: Tomato plant multiple diseases recognition with cnn and
improved yolov7. IEEE Access, 2024.
[46] Ranjan Sapkota, Dawood Ahmed, and Manoj Karkee. Comparing yolov8 and mask r-cnn for instance segmenta-
tion in complex orchard environments. Artificial Intelligence in Agriculture, 2024.
[47] Govind S Patel, Ashish A Desai, Yogesh Y Kamble, Ganesh V Pujari, Priyanka A Chougule, and Varsha A
Jujare. Identification and separation of medicine through robot using yolo and cnn algorithms for healthcare.
In 2023 International Conference on Artificial Intelligence for Innovations in Healthcare Industries (ICAIIHI),
volume 1, pages 1–5. IEEE, 2023.
[48] Yu Luo, Yifan Zhang, Xize Sun, Hengwei Dai, Xiaohui Chen, et al. Intelligent solutions in chest abnormality
detection based on yolov5 and resnet50. Journal of healthcare engineering, 2021, 2021.
37
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
[49] Alejandro Salinas-Medina and Antonio Neme. Enhancing hospital efficiency through web-deployed object
detection: A yolov8-based approach for automating healthcare operations. In 2023 Mexican International
Conference on Computer Science (ENC), pages 1–6. IEEE, 2023.
[50] Miguel A Arroyo, M Tarek Ibn Ziad, Hidenori Kobayashi, Junfeng Yang, and Simha Sethumadhavan. Yolo:
frequently resetting cyber-physical systems for security. In Autonomous Systems: Sensors, Processing, and
Security for Vehicles and Infrastructure 2019, volume 11009, pages 166–183. SPIE, 2019.
[51] Nipunjita Bordoloi, Anjan Kumar Talukdar, and Kandarpa Kumar Sarma. Suspicious activity detection from
videos using yolov3. In 2020 IEEE 17th India Council International Conference (INDICON), pages 1–5. IEEE,
2020.
[52] Dinh-Lam Pham, Tai-Woo Chang, et al. A yolo-based real-time packaging defect detection system. Procedia
Computer Science, 217:886–894, 2023.
[53] Jaromír Klarák, Robert Andok, Peter Malík, Ivan Kuric, Mário Ritomskỳ, Ivana Klačková, and Hung-Yin Tsai.
From anomaly detection to defect classification. Sensors, 24(2):429, 2024.
[54] Oluibukun Gbenga Ajayi, John Ashi, and Blessed Guda. Performance evaluation of yolo v5 model for automatic
crop and weed classification on uav images. Smart Agricultural Technology, 5:100231, 2023.
[55] Achyut Morbekar, Ashi Parihar, and Rashmi Jadhav. Crop disease detection using yolo. In 2020 international
conference for emerging technology (INCET), pages 1–5. IEEE, 2020.
[56] Dawei Li, Foysal Ahmed, Nailong Wu, and Arlin I Sethi. Yolo-jd: A deep learning network for jute diseases and
pests detection from images. Plants, 11(7):937, 2022.
[57] Shashidhar Cheeti, GAE Satish Kumar, J Swetha Priyanka, Ghazala Firdous, and Pogaku Rani Ranjeeva. Pest
detection and classification using yolo and cnn. Annals of the Romanian Society for Cell Biology, pages
15295–15300, 2021.
[58] Minh-Tan Pham, Luc Courtrai, Chloé Friguet, Sébastien Lefèvre, and Alexandre Baussard. Yolo-fine: One-stage
detector of small objects under various backgrounds in remote sensing images. Remote Sensing, 12(15):2501,
2020.
[59] Libo Cheng, Jia Li, Ping Duan, and Mingguo Wang. A small attentional yolo model for landslide detection from
satellite remote sensing images. Landslides, 18(8):2751–2765, 2021.
[60] Chunling Chen, Ziyue Zheng, Tongyu Xu, Shuang Guo, Shuai Feng, Weixiang Yao, and Yubin Lan. Yolo-based
uav technology: A review of the research and its applications. Drones, 7(3):190, 2023.
[61] Xudong Luo, Yiquan Wu, and Langyue Zhao. Yolod: A target detection method for uav aerial imagery. Remote
Sensing, 14(14):3240, 2022.
[62] Francesco Prinzi, Marco Insalaco, Alessia Orlando, Salvatore Gaglio, and Salvatore Vitabile. A yolo-based
model for breast cancer detection in mammograms. Cognitive Computation, 16(1):107–120, 2024.
[63] Ghada Hamed Aly, Mohammed Marey, Safaa Amin El-Sayed, and Mohamed Fahmy Tolba. Yolo based breast
masses detection and classification in full-field digital mammograms. Computer methods and programs in
biomedicine, 200:105823, 2021.
[64] Halil Murat Ünver and Enes Ayan. Skin lesion segmentation in dermoscopic images with combination of yolo
and grabcut algorithm. Diagnostics, 9(3):72, 2019.
[65] Lu Tan, Tianran Huangfu, Liyao Wu, and Wenying Chen. Comparison of retinanet, ssd, and yolo v3 for real-time
pill identification. BMC medical informatics and decision making, 21:1–11, 2021.
[66] Ureerat Suksawatchon, Supawadee Srikamdee, Jakkarin Suksawatchon, and Worawit Werapan. Shape recognition
using unconstrained pill images based on deep convolution network. In 2022 6th International Conference on
Information Technology (InCIT), pages 309–313. IEEE, 2022.
[67] Asmita Gorave, Srinibas Misra, Omkar Padir, Anirudha Patil, and Kshitij Ladole. Suspicious activity detection
using live video analysis. In Proceeding of International Conference on Computational Science and Applications:
ICCSA 2019, pages 203–214. Springer, 2020.
[68] Rupali Kolpe, Shubham Ghogare, MA Jawale, P William, and AB Pawar. Identification of face mask and social
distancing using yolo algorithm based on machine learning approach. In 2022 6th International conference on
intelligent computing and control systems (ICICCS), pages 1399–1403. IEEE, 2022.
[69] Saba Bashir, Rizwan Qureshi, Abbas Shah, Xinqi Fan, and Tanvir Alam. Yolov5-m: A deep neural network
for medical object detection in real-time. In 2023 IEEE Symposium on Industrial Electronics & Applications
(ISIEA), pages 1–6. IEEE, 2023.
38
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
[70] Rui Li and Jun Yang. Improved yolov2 object detection model. In 2018 6th international conference on
multimedia computing and systems (ICMCS), pages 1–6. IEEE, 2018.
[71] Hiroki Nakahara, Haruyoshi Yonekawa, Tomoya Fujii, and Shimpei Sato. A lightweight yolov2: A binarized
cnn with a parallel support vector regression for an fpga. In Proceedings of the 2018 ACM/SIGDA International
Symposium on field-programmable gate arrays, pages 31–40, 2018.
[72] Kwang-Ju Kim, Pyong-Kun Kim, Yun-Su Chung, and Doo-Hyun Choi. Performance enhancement of yolov3 by
adding prediction layers with spatial pyramid pooling for vehicle detection. In 2018 15th IEEE international
conference on advanced video and signal based surveillance (AVSS), pages 1–6. IEEE, 2018.
[73] Upesh Nepal and Hossein Eslamiat. Comparing yolov3, yolov4 and yolov5 for autonomous landing spot
detection in faulty uavs. Sensors, 22(2):464, 2022.
[74] Marco Sozzi, Silvia Cantalamessa, Alessia Cogato, Ahmed Kayad, and Francesco Marinello. Automatic bunch
detection in white grape varieties using yolov3, yolov4, and yolov5 deep learning algorithms. Agronomy,
12(2):319, 2022.
[75] Nikita Mohod, Prateek Agrawal, and Vishu Madaan. Yolov4 vs yolov5: Object detection on surveillance videos.
In International Conference on Advanced Network Technologies and Intelligent Computing, pages 654–665.
Springer, 2022.
[76] Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Yolov10: Real-time
end-to-end object detection. arXiv preprint arXiv:2405.14458, 2024.
[77] Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. Yolov9: Learning what you want to learn using
programmable gradient information. arXiv preprint arXiv:2402.13616, 2024.
[78] Muhammad Hussain. Yolov1 to v8: Unveiling each variant–a comprehensive review of yolo. IEEE Access,
12:42816–42833, 2024.
[79] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. Yolov7: Trainable bag-of-freebies sets new
state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition, pages 7464–7475, 2023.
[80] Chuyi Li, Lulu Li, Hongliang Jiang, Kaiheng Weng, Yifei Geng, Liang Li, Zaidan Ke, Qingyuan Li, Meng
Cheng, Weiqiang Nie, et al. Yolov6: A single-stage object detection framework for industrial applications. arXiv
preprint arXiv:2209.02976, 2022.
[81] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 7263–7271, 2017.
[82] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
[83] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of
object detection. arXiv preprint arXiv:2004.10934, 2020.
[84] Ultralytics. Home — docs.ultralytics.com. https://docs.ultralytics.com/. [Accessed 28-05-2024].
[85] Ultralytics. Comprehensive Guide to Ultralytics YOLOv5 — docs.ultralytics.com. https://docs.
ultralytics.com/yolov5/. [Accessed 28-05-2024].
[86] GitHub - ultralytics/yolov5: YOLOv5 in PyTorch > ONNX > CoreML > TFLite — github.com. https:
//github.com/ultralytics/yolov5. [Accessed 28-05-2024].
[87] Huizi Mao, Xiaodong Yang, and William J Dally. A delay metric for video object detection: What average
precision fails to tell. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages
573–582, 2019.
[88] Bo Chen, Golnaz Ghiasi, Hanxiao Liu, Tsung-Yi Lin, Dmitry Kalenichenko, Hartwig Adam, and Quoc V Le.
Mnasfpn: Learning latency-aware pyramid architecture for object detection on mobile devices. In Proceedings
of the IEEE/CVF conference on computer vision and pattern recognition, pages 13607–13616, 2020.
[89] Daniel Pestana, Pedro R Miranda, João D Lopes, Rui P Duarte, Mário P Véstias, Horácio C Neto, and José T
De Sousa. A full featured configurable accelerator for object detection with yolo. IEEE Access, 9:75864–75877,
2021.
[90] Peng Zhou, Bingbing Ni, Cong Geng, Jianguo Hu, and Yi Xu. Scale-transferrable object detection. In proceedings
of the IEEE conference on computer vision and pattern recognition, pages 528–537, 2018.
[91] David Hall, Feras Dayoub, John Skinner, Haoyang Zhang, Dimity Miller, Peter Corke, Gustavo Carneiro, Anelia
Angelova, and Niko Sünderhauf. Probabilistic object detection: Definition and evaluation. In Proceedings of the
IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1031–1040, 2020.
39
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
[92] Cyril Goutte and Eric Gaussier. A probabilistic interpretation of precision, recall and f-score, with implication
for evaluation. In European conference on information retrieval, pages 345–359. Springer, 2005.
[93] Zhidong Liang, Zehan Zhang, Ming Zhang, Xian Zhao, and Shiliang Pu. Rangeioudet: Range image based
real-time 3d object detector optimized by intersection over union. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 7140–7149, 2021.
[94] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C
Berg. Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference,
Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 21–37. Springer, 2016.
[95] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C Berg. Dssd: Deconvolutional single
shot detector. arXiv preprint arXiv:1701.06659, 2017.
[96] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z Li. Single-shot refinement neural network for
object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
4203–4212, 2018.
[97] Lisha Cui, Rui Ma, Pei Lv, Xiaoheng Jiang, Zhimin Gao, Bing Zhou, and Mingliang Xu. Mdssd: multi-scale
deconvolutional single shot detector for small objects. arXiv preprint arXiv:1805.07009, 2018.
[98] Tianwei Lin, Xu Zhao, and Zheng Shou. Single shot temporal action detection. In Proceedings of the 25th ACM
international conference on Multimedia, pages 988–996, 2017.
[99] Jie Jiang, Hui Xu, Shichang Zhang, and Yujie Fang. Object detection algorithm based on multiheaded attention.
Applied Sciences, 9(9):1829, 2019.
[100] Xu Tang, Daniel K Du, Zeqiang He, and Jingtuo Liu. Pyramidbox: A context-assisted single shot face detector.
In Proceedings of the European conference on computer vision (ECCV), pages 797–813, 2018.
[101] Zuoxin Li, Lu Yang, and Fuqiang Zhou. Fssd: feature fusion single shot multibox detector. arXiv preprint
arXiv:1712.00960, 2017.
[102] Peiyuan Jiang, Daji Ergu, Fangyao Liu, Ying Cai, and Bo Ma. A review of yolo algorithm developments.
Procedia computer science, 199:1066–1073, 2022.
[103] Mohammed Gamal Ragab, Said Jadid Abdulkader, Amgad Muneer, Alawi Alqushaibi, Ebrahim Hamid Sumiea,
Rizwan Qureshi, Safwan Mahmood Al-Selwi, and Hitham Alhussian. A comprehensive systematic review of
yolo for medical object detection (2018 to 2023). IEEE Access, 2024.
[104] Juan Terven, Diana-Margarita Córdova-Esparza, and Julio-Alejandro Romero-González. A comprehensive
review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas. Machine Learning and
Knowledge Extraction, 5(4):1680–1716, 2023.
[105] Muhammad Hussain. Yolo-v1 to yolo-v8, the rise of yolo and its complementary nature toward digital manufac-
turing and industrial defect detection. Machines, 11(7):677, 2023.
[106] Rasmus Rothe, Matthieu Guillaumin, and Luc Van Gool. Non-maximum suppression for object detection by
passing messages between windows. In Computer Vision–ACCV 2014: 12th Asian Conference on Computer
Vision, Singapore, Singapore, November 1-5, 2014, Revised Selected Papers, Part I 12, pages 290–306. Springer,
2015.
[107] Shuai Li, Minghan Li, Ruihuang Li, Chenhang He, and Lei Zhang. One-to-few label assignment for end-to-end
dense detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages
7350–7359, 2023.
[108] Sandesh Bhagat, Manesh Kokare, Vineet Haswani, Praful Hambarde, and Ravi Kamble. Wheatnet-lite: A novel
light weight network for wheat head detection. In Proceedings of the IEEE/CVF international conference on
computer vision, pages 1332–1341, 2021.
[109] Yuting Hu, Wen Tan, Fanyang Meng, and Yongsheng Liang. A decoupled spatial-channel inverted bottleneck for
image compression. In 2023 IEEE International Conference on Image Processing (ICIP), pages 1740–1744.
IEEE, 2023.
[110] Guoliang Yang, Jixiang Wang, Ziling Nie, Hao Yang, and Shuaiying Yu. A lightweight yolov8 tomato detection
algorithm combining feature enhancement and attention. Agronomy, 13(7):1824, 2023.
[111] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th
European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755.
Springer, 2014.
40
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
[112] Glenn Jocher et al. Yolov8: A comprehensive improvement of the yolo object detection series. https:
//docs.ultralytics.com/yolov8/, 2022. Accessed: 2024-06-05.
[113] Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 ieee
information theory workshop (itw), pages 1–5. IEEE, 2015.
[114] Bowei Zhang, Jing Li, Yun Bai, Qing Jiang, Biao Yan, and Zhenhua Wang. An improved microaneurysm
detection model based on swinir and yolov8. Bioengineering, 10(12):1405, 2023.
[115] Chun-Tse Chien, Rui-Yang Ju, Kuang-Yi Chou, and Jen-Shiun Chiang. Yolov9 for fracture detection in pediatric
wrist trauma x-ray images. arXiv preprint arXiv:2403.11249, 2024.
[116] YOLOv8 Object Detection Model: What is, How to Use — roboflow.com. https://roboflow.com/model/
yolov8. [Accessed 28-05-2024].
[117] Ultralytics. Ultralytics YOLOv8 Solutions: Quick Walkthrough — ultralytics.medium.com. https://
ultralytics.medium.com/ultralytics-yolov8-solutions-quick-walkthrough-b802fd6da5d7.
[Accessed 28-05-2024].
[118] Shuangjiang Du, Baofu Zhang, Pin Zhang, and Peng Xiang. An improved bounding box regression loss function
based on ciou loss for multi-scale object detection. In 2021 IEEE 2nd International Conference on Pattern
Recognition and Machine Learning (PRML), pages 92–98. IEEE, 2021.
[119] Shangliang Xu, Xinxin Wang, Wenyu Lv, Qinyao Chang, Cheng Cui, Kaipeng Deng, Guanzhong Wang, Qingqing
Dang, Shengyu Wei, Yuning Du, et al. Pp-yoloe: An evolved version of yolo. arXiv preprint arXiv:2203.16250,
2022.
[120] Xiang Yue, Kai Qi, Xinyi Na, Yang Zhang, Yanhua Liu, and Cuihong Liu. Improved yolov8-seg network for
instance segmentation of healthy and diseased tomato plants in the growth stage. Agriculture, 13(8):1643, 2023.
[121] Xingkui Zhu, Shuchang Lyu, Xu Wang, and Qi Zhao. Tph-yolov5: Improved yolov5 based on transformer
prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF international
conference on computer vision, pages 2778–2788, 2021.
[122] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention
module. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
[123] Mazin Hnewa and Hayder Radha. Integrated multiscale domain adaptive yolo. IEEE Transactions on Image
Processing, 32:1857–1867, 2023.
[124] Zhen Bai, Xinbiao Pei, Zheng Qiao, Guangxin Wu, and Yue Bai. Improved yolov7 target detection algorithm
based on uav aerial photography. Drones, 8(3):104, 2024.
[125] Glenn Jocher and Ultralytics Team. Yolov5. https://github.com/ultralytics/yolov5, 2020. Accessed:
2024-06-05.
[126] Meituan Team. Yolov6: A single-stage object detection framework for industrial applications. https://github.
com/meituan/YOLOv6, 2022. Accessed: 2024-06-05.
[127] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. Yolov7: Trainable bag-of-freebies sets new
state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696, 2022. Accessed: 2024-06-05.
[128] U Sirisha, S Phani Praveen, Parvathaneni Naga Srinivasu, Paolo Barsocchi, and Akash Kumar Bhoi. Statistical
analysis of design aspects of various yolo-based deep learning models for object detection. International Journal
of Computational Intelligence Systems, 16(1):126, 2023.
[129] Kaixin Wang, Jun Hao Liew, Yingtian Zou, Daquan Zhou, and Jiashi Feng. Panet: Few-shot image semantic
segmentation with prototype alignment. In proceedings of the IEEE/CVF international conference on computer
vision, pages 9197–9206, 2019.
[130] Zixiao Zhang, Xiaoqiang Lu, Guojin Cao, Yuting Yang, Licheng Jiao, and Fang Liu. Vit-yolo: Transformer-based
yolo for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages
2799–2808, 2021.
[131] Marsa Mahasin and Irma Amelia Dewi. Comparison of cspdarknet53, cspresnext-50, and efficientnet-b0
backbones on yolo v4 as object detector. International Journal of Engineering, Science and Information
Technology, 2(3):64–72, 2022.
[132] Diganta Misra. Mish: A self regularized non-monotonic activation function. arXiv preprint arXiv:1908.08681,
2019.
41
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
[133] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix:
Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF
international conference on computer vision, pages 6023–6032, 2019.
[134] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Dropblock: A regularization method for convolutional networks.
Advances in neural information processing systems, 31, 2018.
[135] Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help? Advances in neural
information processing systems, 32, 2019.
[136] Zhi Zhang, Tong He, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag of freebies for training object
detection neural networks. arXiv preprint arXiv:1902.04103, 2019.
[137] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object
detection. arXiv preprint arXiv:1506.02640, 2016. Accessed: 2024-06-05.
[138] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help
optimization? Advances in neural information processing systems, 31, 2018.
[139] Ajantha Vijayakumar and Subramaniyaswamy Vairavasundaram. Yolo-based object detection models: A review
and its applications. Multimedia Tools and Applications, pages 1–40, 2024.
[140] Aduen Benjumea, Izzeddin Teeti, Fabio Cuzzolin, and Andrew Bradley. Yolo-z: Improving small object detection
in yolov5 for autonomous vehicles. arXiv preprint arXiv:2112.11798, 2021.
[141] Abhishek Sarda, Shubhra Dixit, and Anupama Bhan. Object detection for autonomous driving using yolo [you
only look once] algorithm. In 2021 Third international conference on intelligent communication technologies
and virtual mobile networks (ICICV), pages 1370–1374. IEEE, 2021.
[142] Yingfeng Cai, Tianyu Luan, Hongbo Gao, Hai Wang, Long Chen, Yicheng Li, Miguel Angel Sotelo, and
Zhixiong Li. Yolov4-5d: An effective and efficient object detector for autonomous driving. IEEE Transactions
on Instrumentation and Measurement, 70:1–13, 2021.
[143] Jingyi Zhao, Shengnan Hao, Chenxu Dai, Haiyang Zhang, Li Zhao, Zhanlin Ji, and Ivan Ganchev. Improved
vision-based vehicle detection and classification by optimized yolov4. IEEE Access, 10:8590–8603, 2022.
[144] Joo Woo, Ji-Hyeon Baek, So-Hyeon Jo, Sun Young Kim, and Jae-Hoon Jeong. A study on object detection
performance of yolov4 for autonomous driving of tram. Sensors, 22(22):9026, 2022.
[145] Cunliang Ye, Yongfu Wang, Yunlong Wang, and Ming Tie. Steering angle prediction yolov5-based end-to-end
adaptive neural network control for autonomous vehicles. Proceedings of the Institution of Mechanical Engineers,
Part D: Journal of Automobile Engineering, 236(9):1991–2011, 2022.
[146] Xiang Jia, Ying Tong, Hongming Qiao, Man Li, Jiangang Tong, and Baoling Liang. Fast and accurate object
detector for autonomous driving based on improved yolov5. Scientific reports, 13(1):9711, 2023.
[147] Zhaoyan Chen, Xiaolan Wang, Weiwei Zhang, Guodong Yao, Dongdong Li, and Li Zeng. Autonomous parking
space detection for electric vehicles based on improved yolov5-obb algorithm. World Electric Vehicle Journal,
14(10):276, 2023.
[148] Xiaoxu Liu and Wei Qi Yan. Vehicle-related distance estimation using customized yolov7. In International
Conference on Image and Vision Computing New Zealand, pages 91–103. Springer, 2022.
[149] Nandni Mehla, Ishita, Ritika Talukdar, and Deepak Kumar Sharma. Object detection in autonomous maritime
vehicles: Comparison between yolo v8 and efficientdet. In International Conference on Data Science and
Network Engineering, pages 125–141. Springer, 2023.
[150] Pranav Patel, Vipul Vekariya, Jaimeel Shah, and Brijesh Vala. Detection of traffic sign based on yolov8. In AIP
Conference Proceedings, volume 3107. AIP Publishing, 2024.
[151] Emel Soylu and Tuncay Soylu. A performance comparison of yolov8 models for traffic sign detection in the
robotaxi-full scale autonomous vehicle competition. Multimedia Tools and Applications, 83(8):25005–25035,
2024.
[152] Hai Wang, Chenyu Liu, Yingfeng Cai, Long Chen, and Yicheng Li. Yolov8-qsd: An improved small object
detection algorithm for autonomous vehicles based on yolov8. IEEE Transactions on Instrumentation and
Measurement, 2024.
[153] Debasis Kumar and Naveed Muhammad. Object detection in adverse weather for autonomous driving through
data merging and yolov8. Sensors, 23(20):8471, 2023.
[154] Geesung Oh and Sejoon Lim. One-stage brake light status detection based on yolov8. Sensors, 23(17):7436,
2023.
42
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
[155] Afdhal Afdhal, Khairun Saddami, Sugiarto Sugiarto, Zahrul Fuadi, and Nasaruddin Nasaruddin. Real-time
object detection performance of yolov8 models for self-driving cars in a mixed traffic environment. In 2023 2nd
International Conference on Computer System, Information Technology, and Electrical Engineering (COSITE),
pages 260–265. IEEE, 2023.
[156] Murat Bakirci and Irem Bayraktar. Transforming aircraft detection through leo satellite imagery and yolov9
for improved aviation safety. In 2024 26th International Conference on Digital Signal Processing and its
Applications (DSPA), pages 1–6. IEEE, 2024.
[157] Ari Wibowo, Bambang Riyanto Trilaksono, Egi Muhammad Idris Hidayat, and Rinaldi Munir. Object detection
in dense and mixed traffic for autonomous vehicles with modified yolo. IEEE Access, 11:134866–134877, 2023.
[158] Ravinder Kaur and Jitendra Singh. Local regression based real-time traffic sign detection using yolov6. In 2022
4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N),
pages 522–526. IEEE, 2022.
[159] Bharat Mahaur and KK Mishra. Small-object detection based on yolov5 in autonomous driving systems. Pattern
Recognition Letters, 168:115–122, 2023.
[160] Christine Dewi, Rung-Ching Chen, Xiaoyi Jiang, and Hui Yu. Deep convolutional neural network for enhancing
traffic sign recognition developed on yolo v4. Multimedia Tools and Applications, 81(26):37821–37845, 2022.
[161] Nayereh Zaghari, Mahmood Fathy, Seyed Mahdi Jameii, and Mohammad Shahverdy. The improvement in
obstacle detection in autonomous vehicles using yolo non-maximum suppression fuzzy algorithm. The Journal
of Supercomputing, 77(11):13421–13446, 2021.
[162] William Chin Wei Hung, Muhammad Aizzat Zakaria, MI Ishak, and PM Heerwan. Object tracking for
autonomous vehicle using yolo v3. In Enabling Industry 4.0 through Advances in Mechatronics: Selected
Articles from iM3F 2021, Malaysia, pages 265–273. Springer, 2022.
[163] Yasir Ali, Md Mazharul Haque, and Fred Mannering. A bayesian generalised extreme value model to estimate
real-time pedestrian crash risks at signalised intersections using artificial intelligence-based video analytics.
Analytic methods in accident research, 38:100264, 2023.
[164] Fizza Hussain, Yasir Ali, Yuefeng Li, and Md Mazharul Haque. Revisiting the hybrid approach of anomaly
detection and extreme value theory for estimating pedestrian crashes using traffic conflicts obtained from artificial
intelligence-based video analytics. Accident Analysis & Prevention, 199:107517, 2024.
[165] Pardis Ghaziamin, Kairavi Bajaj, Nizar Bouguila, and Zachary Patterson. A privacy-preserving edge computing
solution for real-time passenger counting at bus stops using overhead fisheye camera. In 2024 IEEE 18th
International Conference on Semantic Computing (ICSC), pages 25–32. IEEE, 2024.
[166] Shile Zhang, Mohamed Abdel-Aty, Jinghui Yuan, and Pei Li. Prediction of pedestrian crossing intentions
at intersections based on long short-term memory recurrent neural network. Transportation research record,
2674(4):57–65, 2020.
[167] Hao Frank Yang, Yifan Ling, Cole Kopca, Sam Ricord, and Yinhai Wang. Cooperative traffic signal assistance
system for non-motorized users and disabilities empowered by computer vision and edge artificial intelligence.
Transportation research part C: emerging technologies, 145:103896, 2022.
[168] Dan Jiao and Teng Fei. Pedestrian walking speed monitoring at street scale by an in-flight drone. PeerJ Computer
Science, 9:e1226, 2023.
[169] Yongjie Wang, Yuqi Jia, Wenqiang Chen, Tao Wang, and Airen Zhang. Examining safe spaces for pedestrians
and e-bicyclists at urban crosswalks: an analysis based on drone-captured video. Accident Analysis & Prevention,
194:107365, 2024.
[170] Wei Zhou, Yuqing Liu, Lei Zhao, Sixuan Xu, and Chen Wang. Pedestrian crossing intention prediction from
surveillance videos for over-the-horizon safety warning. IEEE Transactions on Intelligent Transportation
Systems, 2023.
[171] Xin Xiao and Xinlong Feng. Multi-object pedestrian tracking using improved yolov8 and oc-sort. Sensors,
23(20), 2023.
[172] Sumit Pandey, Kuan-Fu Chen, and Erik B Dam. Comprehensive multimodal segmentation in medical imaging:
Combining yolov8 with sam and hq-sam models. In Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 2592–2598, 2023.
[173] Rui-Yang Ju and Weiming Cai. Fracture detection in pediatric wrist trauma x-ray images using yolov8 algorithm.
Scientific Reports, 13(1):20077, 2023.
43
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
[174] Atsuyuki Inui, Yutaka Mifune, Hanako Nishimoto, Shintaro Mukohara, Sumire Fukuda, Tatsuo Kato, Takahiro
Furukawa, Shuya Tanaka, Masaya Kusunose, Shunsaku Takigami, et al. Detection of elbow ocd in the ultrasound
image by artificial intelligence using yolov8. Applied Sciences, 13(13):7623, 2023.
[175] Baizheng Wu, Chengxin Pang, Xinhua Zeng, and Xing Hu. Me-yolo: Improved yolov5 for detecting medical
personal protective equipment. Applied Sciences, 12(23):11978, 2022.
[176] Xiaonan Zhao, Qi Wang, Min Zhang, Zixian Wei, Rui Ku, Zihao Zhang, Yang Yu, Bo Zhang, Yuan Liu, and
Cheng Wang. Csff-yolov5: Improved yolov5 based on channel split and feature fusion in femoral neck fracture
detection. Internet of Things, 26:101190, 2024.
[177] Lavika Goel and Pankaj Patel. Improving yolov6 using advanced pso optimizer for weight selection in lung
cancer detection and classification. Multimedia Tools and Applications, pages 1–34, 2024.
[178] Saydirasulov Norkobil Saydirasulovich, Akmalbek Abdusalomov, Muhammad Kafeel Jamil, Rashid Nasimov,
Dinara Kozhamzharova, and Young-Im Cho. A yolov6-based improved fire detection approach for smart city
environments. Sensors, 23(6):3161, 2023.
[179] Junting Zou and Mohd Rizal Arshad. Detection of whole body bone fractures based on improved yolov7.
Biomedical Signal Processing and Control, 91:105995, 2024.
[180] Marzieh Razaghi, Hossein Ebrahimpour Komleh, Fereshteh Dehghani, and Zahra Shahidi. Innovative diagnosis
of dental diseases using yolo v8 deep learning model. In 2024 13th Iranian/3rd International Machine Vision
and Image Processing Conference (MVIP), pages 1–5. IEEE, 2024.
[181] Thi-Loan Pham and Van-Hung Le. Ovarian tumors detection and classification from ultrasound images based on
yolov8. Journal of Advances in Information Technology, 15(2), 2024.
[182] Vallidevi Krishnamurthy, Surendiran Balasubramanian, R Sujithra Kanmani, S Srividhya, Jaladi Deepika, and
G Narayanee Nimeshika. Endoscopic surgical operation and object detection using custom architecture models.
In International Conference on Human-Centric Smart Computing, pages 637–654. Springer, 2023.
[183] N Palanivel, S Deivanai, B Sindhuja, et al. The art of yolov8 algorithm in cancer diagnosis using medical
imaging. In 2023 International Conference on System, Computation, Automation and Networking (ICSCAN),
pages 1–6. IEEE, 2023.
[184] Mehmet Karaköse, Hasan Yetış, and Mert Çeçen. A new approach for effective medical deepfake detection in
medical images. IEEE Access, 2024.
[185] Rhugved Bhojane, Siddhi Chourasia, Snehal V Laddha, and Rohini S Ochawar. Liver lesion detection from mr t1
in-phase and out-phase fused images and ct images using yolov8. In International Conference on Data Science
and Applications, pages 121–135. Springer, 2023.
[186] R Julia, Shajin Prince, and D Bini. Medical image analysis of masses in mammography using deep learning
model for early diagnosis of cancer tissues. In Computational Intelligence and Modelling Techniques for Disease
Detection in Mammogram Images, pages 75–89. Elsevier, 2024.
[187] SM Siamus Salahin, MD Shefat Ullaa, Saif Ahmed, Nabeel Mohammed, Taseef Hasan Farook, and James
Dudley. One-stage methods of computer vision object detection to classify carious lesions from smartphone
imaging. Oral, 3(2):176–190, 2023.
[188] Mukhtorov Doniyorjon, Rakhmonova Madinakhon, Muksimova Shakhnoza, and Young-Im Cho. An improved
method of polyp detection using custom yolov4-tiny. Applied Sciences, 12(21):10856, 2022.
[189] Baichen Ding, Zhuo Zhang, Yiran Liang, Weiwei Wang, Siwei Hao, Ze Meng, Lian Guan, Ying Hu, Bin Guo,
Runlian Zhao, et al. Detection of dental caries in oral photographs taken by mobile phones based on the yolov3
algorithm. Annals of Translational Medicine, 9(21), 2021.
[190] Lei Wang, Shujian Yang, Shan Yang, Cheng Zhao, Guangye Tian, Yuxiu Gao, Yongjian Chen, and Yun Lu.
Automatic thyroid nodule recognition and diagnosis in ultrasound imaging with the yolov2 neural network.
World journal of surgical oncology, 17:1–9, 2019.
[191] Alicja Kwaśniewska, Jacek Rumiński, Krzysztof Czuszyński, and Maciej Szankin. Real-time facial features
detection from low resolution thermal images with deep classification models. Journal of Medical Imaging and
Health Informatics, 8(5):979–987, 2018.
[192] Fahad Majeed, Farrukh Zeeshan Khan, Maria Nazir, Zeshan Iqbal, Majed Alhaisoni, Usman Tariq, Muham-
mad Attique Khan, and Seifedine Kadry. Investigating the efficiency of deep learning based security system in a
real-time environment using yolov5. Sustainable Energy Technologies and Assessments, 53:102603, 2022.
44
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
[193] Armstrong Aboah, Maged Shoman, Vishal Mandal, Sayedomidreza Davami, Yaw Adu-Gyamfi, and Anuj Sharma.
A vision-based system for traffic anomaly detection using deep learning and decision trees. In 2021 IEEE/CVF
Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 4202–4207, 2021.
[194] Nesrine AFFES, Jalel KTARI, Nader BEN AMOR, Tarek FRIKHA, and Habib HAMAM. Comparison of yolov5,
yolov6, yolov7 and yolov8 for intelligent video surveillance. Journal of Information Assurance & Security,
18(5), 2023.
[195] Fengyun Cao and Shuai Ma. Enhanced campus security target detection using a refined yolov7 approach.
Traitement du Signal, 40(5), 2023.
[196] Narendra Chatterjee, Ajay Vikram Singh, and Rekha Agarwal. You only look once (yolov8) based intrusion
detection system for physical security and surveillance. In 2024 11th International Conference on Reliability,
Infocom Technologies and Optimization (Trends and Future Directions)(ICRITO), pages 1–5. IEEE, 2024.
[197] Sandhya and Abhishek Kashyap. Real-time object-removal tampering localization in surveillance videos by
employing yolo-v8. Journal of Forensic Sciences, 2024.
[198] Dai Quoc Tran, Armstrong Aboah, Yuntae Jeon, Maged Shoman, Minsoo Park, and Seunghee Park. Low-light
image enhancement framework for improved object detection in fisheye lens datasets. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 7056–7065, June
2024.
[199] Murat Bakirci and Irem Bayraktar. Boosting aircraft monitoring and security through ground surveillance
optimization with yolov9. In 2024 12th International Symposium on Digital Forensics and Security (ISDFS),
pages 1–6. IEEE, 2024.
[200] Murat Bakirci and Irem Bayraktar. Yolov9-enabled vehicle detection for urban security and forensics applications.
In 2024 12th International Symposium on Digital Forensics and Security (ISDFS), pages 1–6. IEEE, 2024.
[201] Maged Shoman, Tarek Ghoul, Gabriel Lanzaro, Tala Alsharif, Suliman Gargoum, and Tarek Sayed. Enforcing
traffic safety: A deep learning approach for detecting motorcyclists’ helmet violations using yolov8 and deep
convolutional generative adversarial network-generated images. Algorithms, 17(5), 2024.
[202] Sovon Chakraborty, Sabrina Zahir, Nabiha Tasnim Orchi, Md Ferdous Bin Hafiz, AOM Shamsuddoha, and
Shakib Mahmud Dipto. Violence detection: A multi-model approach towards automated video surveillance and
public safety. In 2024 International Conference on Advances in Computing, Communication, Electrical, and
Smart Systems (iCACCESS), pages 1–6. IEEE, 2024.
[203] Guancheng Chen, Wenzhuo Du, Tong Xu, Siyuan Wang, Xiang Qi, and Yuxin Wang. Investigating enhanced
yolov8 model applications for large-scale security risk management and drone-based low-altitude law enforce-
ment. Highlights in Science, Engineering and Technology, 98:390–396, 2024.
[204] Farid Pashayev, Leyla Babayeva, Zuleykha Isgandarova, and Behnam Kiani Kalejahi. Face recognition in smart
cameras by yolo8. KHAZAR JOURNAL OF SCIENCE AND TECHNOLOGY (KJSAT), page 67, 2023.
[205] Seda Balta Kaç, Süleyman Eken, Deniz Dural Balta, Musa Balta, Murat İskefiyeli, and İbrahim Özçelik. Image-
based security techniques for water critical infrastructure surveillance. Applied Soft Computing, page 111730,
2024.
[206] Qingji Gao, Haozhi Deng, and Gaowei Zhang. A contraband detection scheme in x-ray security images based on
improved yolov8s network model. Sensors, 24(4):1158, 2024.
[207] J Cruz Antony, Ch Leela Sri Chowdary, E Murali, Albert Mayan, et al. Advancing crowd management
through innovative surveillance using yolov8 and bytetrack. In 2024 International Conference on Wireless
Communications Signal Processing and Networking (WiSPNET), pages 1–6. IEEE, 2024.
[208] Dawei Zhang. A yolo-based approach for fire and smoke detection in iot surveillance systems. International
Journal of Advanced Computer Science & Applications, 15(1), 2024.
[209] Pyone Pyone Khin and Nay Min Htaik. Gun detection: A comparative study of retinanet, efficientdet and yolov8
on custom dataset. In 2024 IEEE Conference on Computer Applications (ICCA), pages 1–7. IEEE, 2024.
[210] Lwando Nkuzo, Malusi Sibiya, and Elisha Didam Markus. A comprehensive analysis of real-time car safety belt
detection using the yolov7 algorithm. Algorithms, 16(9):400, 2023.
[211] Rong Chang, Bingzhen Zhang, Qianxin Zhu, Shan Zhao, Kai Yan, Yang Yang, et al. Ffa-yolov7: Improved
yolov7 based on feature fusion and attention mechanism for wearing violation detection in substation construction
safety. Journal of Electrical and Computer Engineering, 2023, 2023.
[212] Li Han, Chunhai Ma, Yan Liu, Junyang Jia, and Jiaxing Sun. Sc-yolov8: A security check model for the
inspection of prohibited items in x-ray images. Electronics, 12(20):4208, 2023.
45
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
[213] Jinhao Yuan, Nanfeng Zhang, Yuexuan Xie, and Xiangdong Gao. Detection of prohibited items based upon
x-ray images and improved YOLOv7. In Journal of Physics: Conference Series, volume 2390, page 012114,
Guangzhou, China, 2022. 3rd International Conference on Advanced Materials and Intelligent Manufacturing
(ICAMIM 2022).
[214] Suryanti Awang, Mohd Qhairel Rafiqi Rokei, and Junaida Sulaiman. Suspicious activity trigger system using
yolov6 convolutional neural network. In 2023 International Conference on Artificial Intelligence in Information
and Communication (ICAIIC), pages 527–532. IEEE, 2023.
[215] Yaohui Xiao, An Chang, Yufeng Wang, Yu Huang, Junsong Yu, and Lihai Huo. Real-time object detection
for substation security early-warning with deep neural network based on yolo-v5. In 2022 IEEE IAS Global
Conference on Emerging Technologies (GlobConET), pages 45–50. IEEE, 2022.
[216] Guanbo Wang, Hongwei Ding, Mingliang Duan, Yuanyuan Pu, Zhijun Yang, and Haiyan Li. Fighting against
terrorism: A real-time cctv autonomous weapons detection based on improved yolo v4. Digital Signal Processing,
132:103790, 2023.
[217] PH Kashika and Rekha B Venkatapur. Automatic tracking of objects using improvised yolov3 algorithm and
alarm human activities in case of anomalies. International Journal of Information Technology, 14(6):2885–2891,
2022.
[218] T Mohandoss and J Rangaraj. Multi-object detection using enhanced yolov2 and lunet algorithms in surveillance
videos. e-Prime-Advances in Electrical Engineering, Electronics and Energy, 8:100535, 2024.
[219] Calvin John Smiley. From silence to propagation: Understanding the relationship between “stop snitchin” and
“yolo”. Deviant Behavior, 36(1):1–16, 2015.
[220] Hafiz Mughees Ahmad and Afshin Rahimi. Deep learning methods for object detection in smart manufacturing:
A survey. Journal of Manufacturing Systems, 64:181–196, 2022.
[221] Rohit Pendse, Harshal Rajput, Shubham Saraf, Atharva Sarwate, Jyoti Jadhav, et al. Defect detection in manufac-
turing using yolov7. IJRAR-International Journal of Research and Analytical Reviews (IJRAR), 10(2):179–185,
2023.
[222] Feifan Yi, Haigang Zhang, Jinfeng Yang, Liming He, Ahmad Sufril Azlan Mohamed, and Shan Gao. Yolov7-
siamff: Industrial defect detection algorithm based on improved yolov7. Computers and Electrical Engineering,
114:109090, 2024.
[223] Hongjun Wang, Xiujin Xu, Yuping Liu, Deda Lu, Bingqiang Liang, and Yunchao Tang. Real-time defect
detection for metal components: a fusion of enhanced canny–devernay and yolov6 algorithms. Applied Sciences,
13(12):6898, 2023.
[224] Adinda Sekar Ludwika and Achmad Pratama Rifai. Deep learning for detection of proper utilization and
adequacy of personal protective equipment in manufacturing teaching laboratories. Safety, 10(1):26, 2024.
[225] Seunghyo Beak, Yo-Han Han, Yeeun Moon, Jieun Lee, and Jongpil Jeong. Yolov7-based anomaly detection
using intensity and ng types in labeling in cosmetic manufacturing processes. Processes, 11(8):2266, 2023.
[226] Hongyu Zhao, Xiangyu Wang, Junbo Sun, Yufei Wang, Zhaohui Chen, Jun Wang, and Xinglong Xu. Artificial
intelligence powered real-time quality monitoring for additive manufacturing in construction. Construction and
Building Materials, 429:135894, 2024.
[227] Ziqiang Liu and Kejiang Ye. Yolo-imf: an improved yolov8 algorithm for surface defect detection in industrial
manufacturing field. In International Conference on Metaverse, pages 15–28. Springer, 2023.
[228] Yihao Wen and Li Wang. Yolo-sd: simulated feature fusion for few-shot industrial defect detection based on
yolov8 and stable diffusion. International Journal of Machine Learning and Cybernetics, pages 1–13, 2024.
[229] Nyoman Karna, Made Adi Paramartha Putra, Syifa Maliah Rachmawati, Mideth Abisado, and Gabriel Avelino
Sampedro. Towards accurate fused deposition modeling 3d printer fault detection using improved yolov8 with
hyperparameter optimization. IEEE Access, 2023.
[230] Wei Li, Mahmud Iwan Solihin, and Hanung Adi Nugroho. Rca: Yolov8-based surface defects detection on the
inner wall of cylindrical high-precision parts. Arabian Journal for Science and Engineering, pages 1–19, 2024.
[231] Yike Hu, Jiajun Wang, Xiaoling Wang, Yuheng Sun, Hongling Yu, and Jun Zhang. Real-time evaluation of
the blending uniformity of industrially produced gravelly soil based on cond-yolov8-seg. Journal of Industrial
Information Integration, 39:100603, 2024.
[232] Shuxin Yang, Zexin Zhang, Bi Wang, and Jianqing Wu. Dcs-yolov8: An improved steel surface defect detection
algorithm based on yolov8. In Proceedings of the 2024 7th International Conference on Image and Graphics
Processing, pages 39–46, 2024.
46
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
[233] Xueqiu Wang, Huanbing Gao, Zemeng Jia, and Zijian Li. Bl-yolov8: An improved road defect detection model
based on yolov8. Sensors, 23(20):8361, 2023.
[234] Bingxin Luo, Ziming Kou, Cong Han, and Juan Wu. A “hardware-friendly” foreign object identification method
for belt conveyors based on improved yolov8. Applied Sciences, 13(20):11464, 2023.
[235] Chunjie Wang, Qibo Sun, Xiaogang Dong, and Jia Chen. Automotive adhesive defect detection based on
improved yolov8. Signal, Image and Video Processing, pages 1–13, 2024.
[236] Qian Wu, Xi Kuang, Xin Tang, Dongdong Guo, and Zhonghao Luo. Industrial equipment object detection based
on improved yolov7. In International Conference on Computer, Artificial Intelligence, and Control Engineering
(CAICE 2023), volume 12645, pages 600–608. SPIE, 2023.
[237] Oungsub Kim, Yohan Han, and Jongpil Jeong. Real-time inspection system based on moire pattern and yolov7
for coated high-reflective injection molding product. WSEAS Transactions on Computer Research, 10:120–125,
2022.
[238] Jincheng Chen, Shoujun Bai, Guoyang Wan, and Yunfei Li. Research on yolov7-based defect detection method
for automotive running lights. Systems Science & Control Engineering, 11(1):2185916, 2023.
[239] Muhammad Hussain, Hussain Al-Aqrabi, Muhammad Munawar, Richard Hill, and Tariq Alsboui. Domain
feature mapping with yolov7 for automated edge-based pallet racking inspections. Sensors, 22(18):6927, 2022.
[240] Bao Zhu, Guijian Xiao, Youdong Zhang, and Hui Gao. Multi-classification recognition and quantitative
characterization of surface defects in belt grinding based on yolov7. Measurement, 216:112937, 2023.
[241] Chhaya Gupta, Nasib Singh Gill, Preeti Gulia, and Jyotir Moy Chatterjee. A novel finetuned yolov6 transfer
learning model for real-time object detection. Journal of Real-Time Image Processing, 20(3):42, 2023.
[242] Niloofar Zendehdel, Haodong Chen, and Ming C Leu. Real-time tool detection in smart manufacturing using
you-only-look-once (yolo) v5. Manufacturing Letters, 35:1052–1059, 2023.
[243] Daqi Jiang, Hong Wang, and Yanzheng Lu. An efficient automobile assembly state monitoring system based on
channel-pruned yolov4 algorithm. International Journal of Computer Integrated Manufacturing, 37(3):372–382,
2024.
[244] Jihong Yan and Zipeng Wang. Yolo v3+ vgg16-based automatic operations monitoring and analysis in a
manufacturing workshop under industry 4.0. Journal of Manufacturing Systems, 63:134–142, 2022.
[245] Koki Arima, Fusaomi Nagata, Tatsuki Shimizu, Akimasa Otsuka, Hirohisa Kato, Keigo Watanabe, and Maki K
Habib. Improvements of detection accuracy and its confidence of defective areas by yolov2 using a data set
augmentation method. Artificial Life and Robotics, 28(3):625–631, 2023.
[246] Chetan M. Badgujar, Paul R. Armstrong, Alison R. Gerken, Lester O. Pordesimo, and James F. Campbell. Real-
time stored product insect detection and identification using deep learning: System integration and extensibility
to mobile platforms. Journal of Stored Products Research, 104:102196, December 2023.
[247] Maged Shoman, Armstrong Aboah, Alex Morehead, Ye Duan, Abdulateef Daud, and Yaw Adu-Gyamfi. A
region-based deep learning approach to automated retail checkout. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 3210–3215, June 2022.
[248] Srikanth Bhat, K Annapoorna Shenoy, Moulya R Jain, and K Manasvi. Detecting crops and weeds in fields
using yolov6 and faster r-cnn object detection models. In 2023 International Conference on Recent Advances in
Information Technology for Sustainable Development (ICRAIS), pages 43–48. IEEE, 2023.
[249] Ramesh Bahadur Bist, Sachin Subedi, Xiao Yang, and Lilong Chai. A novel yolov6 object detector for monitoring
piling behavior of cage-free laying hens. AgriEngineering, 5(2):905–923, 2023.
[250] Praveen Kumar and Naveen Kumar. Drone-based apple detection: Finding the depth of apples using yolov7
architecture with multi-head attention mechanism. Smart Agricultural Technology, 5:100311, 2023.
[251] Lijuan Zhang, Gongcheng Ding, Chaoran Li, and Dongming Li. Dcf-yolov8: An improved algorithm for
aggregating low-level features to detect agricultural pests and diseases. Agronomy, 13(8):2012, 2023.
[252] Chetan M Badgujar, Alwin Poulose, and Hao Gan. Agricultural object detection with You Only Look Once
(YOLO) Algorithm: A bibliometric and systematic literature review. Computers and Electronics in Agriculture,
223:109090, August 2024.
[253] Luiz Carlos M Junior and José Alfredo C Ulson. Real time weed detection using computer vision and deep
learning. In 2021 14th IEEE International Conference on Industry Applications (INDUSCON), pages 1131–1137.
IEEE, 2021.
47
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
[254] Mahnoor Khalid, Muhammad Shahzad Sarfraz, Uzair Iqbal, Muhammad Umar Aftab, Gniewko Niedbała, and
Hafiz Tayyab Rauf. Real-time plant health detection using deep convolutional neural networks. Agriculture,
13(2):510, 2023.
[255] Ignazio Gallo, Anwar Ur Rehman, Ramin Heidarian Dehkordi, Nicola Landro, Riccardo La Grassa, and Mirco
Boschetti. Deep object detection of crop weeds: Performance of yolov7 on a real case dataset from uav images.
Remote Sensing, 15(2):539, 2023.
[256] Shreya Vaidya, Sameer Kavthekar, and Amit Joshi. Leveraging yolov7 for plant disease detection. In 2023 4th
international conference on innovative trends in information technology (ICITIIT), pages 1–6. IEEE, 2023.
[257] Hafedh Mahmoud Zayani, Ikhlass Ammar, Refka Ghodhbani, Albia Maqbool, Taoufik Saidani, Jihane Ben
Slimane, Amani Kachoukh, Marouan Kouki, Mohamed Kallel, Amjad A Alsuwaylimi, et al. Deep learning for
tomato disease detection with yolov8. Engineering, Technology & Applied Science Research, 14(2):13584–13591,
2024.
[258] Baoling Ma, Zhixin Hua, Yuchen Wen, Hongxing Deng, Yongjie Zhao, Liuru Pu, and Huaibo Song. Using
an improved lightweight yolov8 model for real-time detection of multi-stage apple fruit in complex orchard
environments. Artificial Intelligence in Agriculture, 2024.
[259] Mohamad Haniff Junos, Anis Salwa Mohd Khairuddin, Subbiah Thannirmalai, and Mahidzal Dahari. An
optimized yolo-based object detection model for crop harvesting system. IET Image Processing, 15(9):2112–
2125, 2021.
[260] Hongyu Zhao, Zezhi Tang, Zhenhong Li, Yi Dong, Yuancheng Si, Mingyang Lu, and George Panoutsos. Real-
time object detection and robotic manipulation for agriculture using a yolo-based learning approach. arXiv
preprint arXiv:2401.15785, 2024.
[261] Wei Chen, Jingfeng Zhang, Biyu Guo, Qingyu Wei, and Zhiyu Zhu. An apple detection method based on
des-yolo v4 algorithm for harvesting robots in complex environment. Mathematical Problems in Engineering,
2021:1–12, 2021.
[262] Mehmet NERGİZ. Enhancing strawberry harvesting efficiency through yolo-v7 object detection assessment.
Turkish Journal of Science and Technology, 18(2):519–533, 2023.
[263] Chenglin Wang, Qiyu Han, Chunjiang Li, Jianian Li, Dandan Kong, Faan Wang, and Xiangjun Zou. Assisting
the planning of harvesting plans for large strawberry fields through image-processing method based on deep
learning. Agriculture, 14(4):560, 2024.
[264] Wenkang Chen, Shenglian Lu, Binghao Liu, Ming Chen, Guo Li, and Tingting Qian. Citrusyolo: a algorithm for
citrus detection under orchard environment based on yolov4. Multimedia Tools and Applications, 81(22):31363–
31389, 2022.
[265] Hamzeh Mirhaji, Mohsen Soleymani, Abbas Asakereh, and Saman Abdanan Mehdizadeh. Fruit detection and
load estimation of an orange orchard using the yolo models through simple approaches in different imaging and
illumination conditions. Computers and Electronics in Agriculture, 191:106533, 2021.
[266] Ranjan Sapkota, Dawood Ahmed, Martin Churuvija, and Manoj Karkee. Immature green apple detection and
sizing in commercial orchards using yolov8 and shape fitting techniques. IEEE Access, 12:43436–43452, 2024.
[267] Dihua Wu, Shuaichao Lv, Mei Jiang, and Huaibo Song. Using channel pruning-based yolo v4 deep learning
algorithm for the real-time and accurate detection of apple flowers in natural environments. Computers and
Electronics in Agriculture, 178:105742, 2020.
[268] Jizhang Wang, Zhiheng Gao, Yun Zhang, Jing Zhou, Jianzhi Wu, and Pingping Li. Real-time detection and
location of potted flowers based on a zed camera and a yolo v4-tiny deep learning algorithm. Horticulturae,
8(1):21, 2021.
[269] Salik Ram Khanal, Ranjan Sapkota, Dawood Ahmed, Uddhav Bhattarai, and Manoj Karkee. Machine vision
system for early-stage apple flowers and flower clusters detection for precision thinning and pollination. IFAC-
PapersOnLine, 56(2):8914–8919, 2023.
[270] Feng Xiao, Haibin Wang, Yueqin Xu, and Ruiqing Zhang. Fruit detection and recognition based on deep learning
for automatic harvesting: an overview and review. Agronomy, 13(6):1625, 2023.
[271] Wu Yijing, Yang Yi, Wang Xue-fen, Cui Jian, and Li Xinyun. Fig fruit recognition method based on yolo
v4 deep learning. In 2021 18th International Conference on Electrical Engineering/Electronics, Computer,
Telecommunications and Information Technology (ECTI-CON), pages 303–306. IEEE, 2021.
[272] Yunfeng Zhang, Li Li, Changpin Chun, Yifeng Wen, and Gang Xu. Multi-scale feature adaptive fusion model
for real-time detection in complex citrus orchard environments. Computers and Electronics in Agriculture,
219:108836, 2024.
48
R. S APKOTA ET AL .: YOLOV 10 TO ITS GENESIS : A D ECADAL R EVIEW - J ULY 26, 2024
[273] Jialiang Zhou, Yueyue Zhang, and Jinpeng Wang. A dragon fruit picking detection method based on yolov7 and
psp-ellipse. Sensors, 23(8):3803, 2023.
[274] GAO Xiuyan and Yanmin ZHANG. Detection of fruit using yolov8-based single stage detectors. International
Journal of Advanced Computer Science & Applications, 14(12), 2023.
[275] Boudjemaa Boudaa, Kamel Abada, Walid Aymen Aichouche, and Ahmed Nabil Belakermi. Advancing plant
diseases detection with pre-trained yolo models. In 2024 6th International Conference on Pattern Analysis and
Intelligent Systems (PAIS), pages 1–6. IEEE, 2024.
[276] Hoang-Tu Vo, Kheo Chau Mui, Nhon Nguyen Thien, and Phuc Pham Tien. Automating tomato ripeness
classification and counting with yolov9. International Journal of Advanced Computer Science & Applications,
15(4), 2024.
[277] Jiayue Zhao and Jianhua Qu. A detection method for tomato fruit common physiological diseases based on
yolov2. In 2019 10th international conference on Information Technology in Medicine and Education (ITME),
pages 559–563. IEEE, 2019.
[278] Rong Ye, Quan Gao, Ye Qian, Jihong Sun, and Tong Li. Improved yolov8 and sahi model for the collaborative
detection of small targets at the micro scale: A case study of pest detection in tea. Agronomy, 14(5):1034, 2024.
[279] Oluwaseyi Ezekiel Olorunshola, Martins Ekata Irhebhude, and Abraham Eseoghene Evwiekpaefe. A comparative
study of yolov5 and yolov7 object detection algorithms. Journal of Computing and Social Informatics, 2(1):1–12,
2023.
[280] Ning Li, Mingliang Wang, Gaochao Yang, Bo Li, Baohua Yuan, and Shoukun Xu. Dens-yolov6: A small object
detection model for garbage detection on water surface. Multimedia Tools and Applications, pages 1–21, 2023.
[281] Hyun-Ki Jung and Gi-Sang Choi. Improved yolov5: Efficient object detection using drone images under various
conditions. Applied Sciences, 12(14):7255, 2022.
[282] Aichen Wang, Tao Peng, Huadong Cao, Yifei Xu, Xinhua Wei, and Bingbo Cui. Tia-yolov5: An improved
yolov5 network for real-time detection of crop and weed in the field. Frontiers in Plant Science, 13:1091655,
2022.
[283] Tian-Hao Wu, Tong-Wen Wang, and Ya-Qi Liu. Real-time vehicle and distance detection based on improved
yolo v5 network. In 2021 3rd World Symposium on Artificial Intelligence (WSAI), pages 24–28. IEEE, 2021.
[284] Aarush Kaunteya Pande, Preston Brantley, Muhammad Hassan Tanveer, and Razvan Cristian Voicu. From ai to
agi-the evolution of real-time systems with gpt integration. In SoutheastCon 2024, pages 699–707. IEEE, 2024.
[285] Youzhi Qu, Chen Wei, Penghui Du, Wenxin Che, Chi Zhang, Wanli Ouyang, Yatao Bian, Feiyang Xu, Bin Hu,
Kai Du, Haiyan Wu, Jia Liu, and Quanying Liu. Integration of cognitive tasks into artificial general intelligence
test for large models. iScience, 27, 2024.
[286] Riccardo Manzotti. Embodied ai beyond embodied cognition and enactivism. Philosophies, 4(3):39, 2019.
[287] Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen
Wang, Gabrael Levine, Wensi Ai, Benjamin Martinez, et al. Behavior-1k: A human-centered, embodied ai
benchmark with 1,000 everyday activities and realistic simulation. arXiv preprint arXiv:2403.09227, 2024.
[288] Rolf Pfeifer and Fumiya Iida. Embodied artificial intelligence: Trends and challenges. Lecture notes in computer
science, pages 1–26, 2004.
[289] Nitin Jagannatha Sanket. Active vision based embodied-ai design for nano-uav autonomy. PhD thesis, University
of Maryland, College Park, 2021.
[290] Tian Wang, Pai Zheng, Shufei Li, and Lihui Wang. Multimodal human–robot interaction for human-centric
smart manufacturing: A survey. Advanced Intelligent Systems, 6(3):2300359, 2024.
[291] Aarthi Lakshmipathy, Madhurima Vardhineedi, Venkata Ramana Patnaik Sekharamahanthi, Devanshi Dineshbhai
Patel, Saurav Saini, and Sabah Mohammed. Medicaption: Integrating yolo-driven computer vision and nlp for
advanced pharmaceutical package recognition and annotation. Authorea Preprints, 2024.
[292] Ruiheng Xu, Kaiwen Ji, Zichen Yuan, Chenye Wang, and Yihan Xia. Exploring the evolution trend of china’s
digital carbon footprint: A simulation based on system dynamics approach. Sustainability (Switzerland), 16(10),
2024. All Open Access, Gold Open Access.
[293] Payal Dhar. The carbon impact of artificial intelligence. Nature Machine Intelligence, 2(10), 2020.
49