Jimaging 10 00227
Jimaging 10 00227
Imaging
Article
Leveraging Perspective Transformation for Enhanced Pothole
Detection in Autonomous Vehicles
Abdalmalek Abu-raddaha * , Zaid A. El-Shair and Samir Rawashdeh
Abstract: Road conditions, often degraded by insufficient maintenance or adverse weather, signifi-
cantly contribute to accidents, exacerbated by the limited human reaction time to sudden hazards like
potholes. Early detection of distant potholes is crucial for timely corrective actions, such as reducing
speed or avoiding obstacles, to mitigate vehicle damage and accidents. This paper introduces a novel
approach that utilizes perspective transformation to enhance pothole detection at different distances,
focusing particularly on distant potholes. Perspective transformation improves the visibility and
clarity of potholes by virtually bringing them closer and enlarging their features, which is particularly
beneficial given the fixed-size input requirement of object detection networks, typically significantly
smaller than the raw image resolutions captured by cameras. Our method automatically identifies
the region of interest (ROI)—the road area—and calculates the corner points to generate a perspec-
tive transformation matrix. This matrix is applied to all images and corresponding bounding box
labels, enhancing the representation of potholes in the dataset. This approach significantly boosts
detection performance when used with YOLOv5-small, achieving a 43% improvement in the average
precision (AP) metric at intersection-over-union thresholds of 0.5 to 0.95 for single class evaluation,
and notable improvements of 34%, 63%, and 194% for near, medium, and far potholes, respectively,
after categorizing them based on their distance. To the best of our knowledge, this work is the first to
employ perspective transformation specifically for enhancing the detection of distant potholes.
Citation: Abu-raddaha, A.; El-Shair, Keywords: autonomous vehicles; perspective transformation; deep learning; pothole detection;
Z.A.; Rawashdeh, S. Leveraging computer vision; mobile robotics
Perspective Transformation for
Enhanced Pothole Detection in
Autonomous Vehicles. J. Imaging 2024,
10, 227. https://doi.org/10.3390/
1. Introduction
jimaging10090227
1.1. Pothole Detection Importance for Autonomous Vehicles
Academic Editors: Pier Luigi Mazzeo
Potholes, commonly found in asphalt pavements, are caused by water weakening the
and Alessandro Bruno
underlying soil and repeated traffic wear. Factors such as temperature fluctuations causing
Received: 6 August 2024 expansion and contraction, poor drainage allowing water infiltration, and the use of low-
Revised: 10 September 2024 quality materials further contribute to the formation of depressions or holes in the road
Accepted: 11 September 2024 surface [1–3]. These can vary in severity and pose significant hazards, such as suspension
Published: 14 September 2024 damage, tire punctures, and even accidents, by causing loss of control or immobilization of
vehicles. The dangers of potholes extend to both vehicles and pedestrians, highlighting the
critical need for efficient detection systems.
In 2011, poor road conditions caused around 2200 deaths in India, while in the U.S.,
Copyright: © 2024 by the authors.
one-third of the 38,824 traffic deaths in 2020 were linked to substandard roads. Michi-
Licensee MDPI, Basel, Switzerland.
gan, with some of the worst potholes, spent millions annually on repairs, highlighting
This article is an open access article
the widespread impact of this issue. Effective pothole detection is crucial, particularly
distributed under the terms and
for autonomous vehicles, which rely on accurate hazard detection to ensure safe oper-
conditions of the Creative Commons
ation. This need is underscored by the potential damages and safety risks associated
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
with potholes, emphasizing the importance of accurate and timely detection systems to
4.0/).
mitigate these dangers.
a fixed, smaller size to ensure reasonable processing times. This resizing is necessary
because models trained on high-resolution images demand substantial computational
resources, leading to impracticalities in real-time applications. As a result, small objects,
such as potholes, especially those far away, often become indistinguishable when images
are downscaled. The reduction in size leads to a loss of crucial details, making it challenging
for the model to accurately detect and classify these objects. This issue is exacerbated by
the fact that more complex models, while potentially offering increased accuracy, do not
necessarily resolve the problem of lost detail due to image downscaling.
Moreover, the utilization of high-resolution images in training object detection models
is hindered by the immense complexity of the search space. High-dimensional data require
more extensive computational resources and can significantly slow down the training and
inference processes. This trade-off between image resolution and processing efficiency
is particularly problematic in real-time applications such as autonomous driving, where
rapid detection and response are crucial for safety. The need to rescale high-definition
images to lower resolutions introduces a bottleneck in object detection systems. The act
of rescaling can lead to a substantial loss of fine-grained features that are essential for
accurately identifying potholes, thereby compromising the model’s performance. One
proposed solution to mitigate these challenges is the use of perspective transformation.
Unlike conventional resizing, perspective transformation selectively focuses on a region
of interest (ROI) within an image, such as the area containing a pothole. This approach
preserves critical features by altering the viewing angle, effectively enlarging the ROI and
reducing the prominence of irrelevant areas. While this method does not introduce new
features, it helps retain more of the significant details associated with the potholes, thereby
improving detection accuracy. Although theoretically training a model on full-resolution
images would be ideal, it is practically unfeasible due to computational constraints. Hence,
perspective transformation offers a practical compromise, allowing the retention of essential
features while maintaining manageable processing times, thus enhancing the robustness
and effectiveness of pothole detection in autonomous vehicle systems.
2. Using this matrix to transform the input data before model inference significantly
boosts the accuracy and robustness of pothole detection with limited impact on
runtime. To the best of our knowledge, this is the first study to utilize perspective
transformation in this manner to enhance pothole detection performance.
3. We propose an intuitive evaluation strategy to assess the pothole detection models’ per-
formance across potholes at three distance ranges (near, medium, and far), demonstrating
the potential of our approach in improving pothole detection at far distances.
Figure 1. Comparison between the naive approach and our proposed approach. The naive approach
involves loading the raw input image and then simply downscaling it to the required input resolution
for the object detection network, losing significant image features and resulting in undetected potholes.
Meanwhile, our approach demonstrates successful and robust pothole detection by transforming the
input image to primarily retain the region of interest and minimize irrelevant segments of the image.
Ground-truth pothole labels and predicted potholes are represented by green and red bounding
boxes, respectively. Street and rescaling icons created by Trevor Dsouza and Doodle Icons via
TheNounProject.com (accessed on 1 August 2024).
2. Related Work
Numerous studies have investigated various computer vision and image processing
techniques for automated pothole detection using visual road imagery. However, the
majority of these approaches have focused more on pothole classification rather than de-
tection. Furthermore, most of the papers have not exclusively used the same dataset as
our study, making direct comparisons between our work and these studies challenging.
Additionally, other research has highlighted the significance of utilizing perspective trans-
formation in object detection applications. The following sections review some of the
aforementioned approaches.
Deep learning techniques have shown great potential in developing more powerful
models for pothole detection and classification. Chen et al. [16] focused on classification
tasks, using location-aware CNNs to identify areas likely to contain potholes. Additionally,
Dhiman et al. [17] reported promising results by employing transfer learning with Mask
R-CNN and YOLOv2. Although these models demonstrate high classification accuracy,
they are computationally expensive and struggle with precise localization. Additionally,
the two-stage nature of some pipelines, such as those of Chen et al., can introduce further
computational overhead. The performance of these models is also heavily dependent on
high-quality input data, and challenges in real-time applications due to processing latency
remain a significant drawback.
Stereo vision approaches, explored by Dhiman et al. [17,18], utilize depth information
to identify road defects by analyzing road elevation and depth variations. Although these
methods theoretically provide detailed spatial information, they are highly dependent
on the quality of stereo images and the precise camera calibration. Issues such as noise,
distortion, and the need for well-aligned image pairs can significantly affect the accuracy
of depth estimation and pothole detection. The computational intensity further hinders
the practical deployment of these techniques in real-time applications, particularly in the
context of autonomous vehicles.
Data augmentation and enhancement techniques have also been explored to improve
pothole detection performance. Maeda et al. [19] utilized Generative Adversarial Networks
(GANs) to generate synthetic training data in combination with a Single-Shot Multibox
Detection (SSD) model. The addition of synthetic data led to moderate improvements,
increasing the F-score by 5% when the synthetic data were less than 50% of the original
dataset and by 2% when they constituted about 50%. However, performance deteriorated
when the synthetic data exceeded 50% of the original data. Despite these improvements,
the approach faced challenges, including increased computational complexity, instability
during GAN training, and concerns about the generalizability of the model to real-world
conditions. Similarly, Salaudeen et al. [20] introduced a pothole detection approach that
combines an image enhancement GAN with an object detection network. The GAN,
specifically ESRGAN, enhances image quality to make potholes more distinguishable,
while the detection network identifies and localizes them. Using a combination of datasets,
their method produced notable results. EfficientDet demonstrated an improvement in
mAP when applied to super-resolution images compared to low-resolution ones. Similarly,
YOLOv5, in conjunction with ESRGAN, showed better performance in super-resolution
images compared to low-resolution counterparts, both evaluated within the same IoU range.
Despite improved detection metrics, this approach introduces significant computational
overhead and risks of overfitting, particularly since results depend mainly on the quality of
the generated data.
Specific YOLO variants have also been investigated. Al-Shaghouri et al. [21] investi-
gated real-time pothole detection using YOLOv3 and YOLOv4. However, the evaluation
was performed at a low IoU threshold of 25%, which is less stringent than typical object
detection standards. This low threshold, along with performance variability at different
distances, highlights some limitations of their approach despite the promising precision
results. Buko et al. [22] examined the effectiveness of YOLOv3 and Sparse R-CNN under
various challenging conditions, revealing a substantial performance degradation under low
light and adverse weather conditions, indicating limited applicability in various real-world
scenarios. Nevertheless, this project used the same dataset for training and testing, which
affects the generalizability of this approach. Rastogi et al. [23] modified YOLOv2 to address
issues such as vanishing gradients and irrelevant feature learning. However, the reliance
on close-range smartphone images limits the model’s applicability to broader contexts,
such as autonomous vehicles where variable distances and angles are encountered.
While previous research in pothole detection has largely concentrated on improving
algorithmic architectures or combining multiple techniques to enhance detection accuracy,
our contribution addresses a fundamental gap by focusing on the quality and effectiveness
J. Imaging 2024, 10, 227 6 of 22
of the input dataset. By leveraging perspective transformation, our approach optimizes the
dataset, maintains the desired objects’ features, and enhances them. This method effectively
tackles the common issue of limited data without incurring additional computational
costs during training. In fact, it often reduces training time compared to using regular
or cropped images, making it a practical and efficient solution. Unlike other approaches
that are computationally intensive and dependent on high-quality data, our contribution
ensures better utilization of the current dataset, enhancing the robustness and real-world
applicability of pothole detection systems without the trade-offs associated with complex
algorithmic fusions.
3. Methodology
Potholes significantly impact vehicles and road users, increasing the likelihood of
hazardous situations. Therefore, implementing an effective early detection system for
potholes is crucial to mitigate potential risks and prevent undesirable or harmful incidents.
In this section, we present a detailed breakdown of our proposed approach detailing the
techniques employed in this work.
YOLOv5 [28] has been tailored for an input spatial resolution of 640 × 640. Although
the trend demonstrates an increase in the models’ input resolutions, they are still quite
small relative to raw camera output resolutions. Further, training these models on larger
input resolution negatively impacts inference times [28,29], which is not ideal for real-time
applications. This necessitates resizing the images to a predetermined, smaller size for
effective training and generalization. However, when these images are resized, the features
of small objects, such as potholes, can become significantly less discernible. Potholes may
appear very small relative to the overall image size, resulting in insufficient features for
the model to detect and differentiate them effectively, as shown in Figure 3. Furthermore,
portions of the image that do not contain regions of interest, such as sidewalks or the
sky, are often retained, limiting the focus on the relevant areas necessary for pothole de-
tection. To improve detection accuracy, it is beneficial to adjust the image perspective to
emphasize the road—the primary ROI—while minimizing the inclusion of non-essential
parts of the image. This approach ensures that more useful features are preserved after
rescaling, enhancing the model’s ability to detect potholes without significantly increasing
computational overhead [31,32].
Figure 2. Overview of the proposed framework. Raw input images are initially transformed using the
transformation matrix generated by our proposed automated algorithm. Then, the resulting images
are rescaled to the required input resolution and fed to the object detection network (e.g., YOLOv5).
Ground truth pothole labels and predicted potholes are represented by the green and red bounding boxes,
respectively. Neural network icon by Lucas Rathgeb via TheNounProject.com (accessed on 1 August 2024).
Figure 3. Comparison of the resulting preprocessed input images between (a) the naive approach and
(b) the automated perspective transformation approach. The naive approach involves reading the
image as is and then downscaling to a fixed input resolution (800 × 800 in this example). A 50 × 50
image crop demonstrates very low resolution for the potholes in the scene. Instead, our proposed
approach transforms the image to mainly focus on the ROI (i.e., the street) where, after rescaling
to the same input resolution, the resulting spatial resolutions of the potholes are much larger with
clearer image features as depicted by the 50 × 50 image crops.
J. Imaging 2024, 10, 227 8 of 22
homography, and warping the image using a 3 × 3 homography matrix M. This process
emphasizes relevant features, thereby enhancing detection accuracy, as follows:
t′ = M · t , (1)
where t′ and t are the coordinates of the ROI in the transformed and source images, respectively.
The homography matrix M is calculated using the corresponding points as explained earlier,
aiding in feature extraction and pothole bounding-box regression by mimicking a bird’s
eye view for more robust pothole detection in AVs. Moreover, it holds the transformation
parameters that map points from the source image to the target image [39].
The homography matrix M can be represented as follows:
m11 m12 m13
M = m21 m22 m23 (2)
m31 m32 m33
Each element (mij ) within the matrix contributes to the transformation:
• m11 , m12 , m13 : Affect the x-coordinate of the transformed point.
• m21 , m22 , m23 : Affect the y-coordinate of the transformed point.
• m31 , m32 , m33 : Homogenization factors (usually, m33 is set to 1).
Applying perspective transformation to a set of images requires manually selecting
the boundary points of the ROI. In our application, this ROI would primarily be the road
as viewed from the perspective of the vehicle itself. However, the boundaries of the
road change significantly from one scene to another depending on many factors, such as
the type of road, curvature, number of lanes, etc. Therefore, an ideal application of this
technique would be to select the corners of the road for each given image. However, this
approach is not feasible for object detection applications due to its time-consuming and
error-prone nature, especially when dealing with different scenes that include not only
straight streets but street curvatures, u-turns, etc. Manually specifying four points for each
image based on the shape of the road is labor-intensive and introduces significant variability
and inaccuracies, making it unsuitable for large-scale datasets. Furthermore, due to its
infeasibility, a new transformation matrix is required to be generated for each image, which
would add run-time overhead. Therefore, automating this process is essential to ensure a
usable workflow and to generalize the technique to most, if not all, datasets with similar
structures. To address this challenge, we propose an algorithm that automatically finds a set
of ROI corner points that enables the generation of the perspective transformation matrix
M. Using this matrix, a set of images from the same source can be similarly transformed to
better represent the ROI. Consequently, this approach ensures consistency and precise ROI
selection in any given dataset of the same source.
To achieve an optimal transformation, it is crucial to accurately identify the best
ROI that encompasses all bounding boxes of potholes within an image, ensuring that no
potholes are excluded. The automatic transformation process involves determining the
ROI by calculating the coordinates of all bounding boxes to establish four defining points.
Specifically, the boundary coordinates of the ROI are determined as follows: the top-left
point is defined by the minimum x and y coordinates among all bounding boxes, the
top-right point by the maximum x and minimum y coordinates, the bottom-left point by
the minimum x and maximum y coordinates, and the bottom-right point by the maximum
x and y coordinates. An offset value α is added to each of these points to ensure that the
ROI extends slightly beyond the boundaries of the bounding boxes. This offset allows for
full coverage of the image or ROI boundaries, depending on how much extension the user
desires. This approach guarantees that the ROI fully covers the outermost boundaries of the
bounding boxes, thereby ensuring comprehensive inclusion of all potholes. Additionally,
the alpha value was tested at 0.2, 0.1, and 0 to control the inclusion of background features,
particularly near the top of the image. A higher alpha value (e.g., 0.2) allows more back-
ground context, helping the model better differentiate between objects and background,
J. Imaging 2024, 10, 227 10 of 22
especially when bounding boxes are close to the image edge or not perfectly accurate. The
optimal alpha value was chosen to balance the inclusion of background features with the
model’s ability to handle imperfect bounding box annotations.
To implement automatic perspective transformation, it is expected that we have a
labeled object detection dataset. The algorithm could then be applied to the selected images
and their corresponding labels, following the subsequent steps outlined in the algorithm
structure below Algorithm 1. The transformed images and label files were then saved to
their designated output directories.
32: return M
J. Imaging 2024, 10, 227 11 of 22
Algorithm 1 automates the process of determining the ROI and generating the per-
spective transformation matrix M. The main steps of the algorithm are as follows:
1. Initialize lists: Store coordinates and dimensions of bounding boxes for all images,
including minimum and maximum x and y coordinates, width, and height for each
bounding box.
2. Read bounding boxes: Extract bounding box data from each image’s corresponding
label file, calculate all ROI boundary points, and update the respective lists.
3. Calculate offsets: Determine the ROI offsets using a specific α value and the max
width and height of all bounding boxes to define a slightly larger ROI.
4. Determine ROI corners: Use the minimum and maximum coordinates from the lists,
along with the calculated offsets, to determine the corners of the ROI. These corners
are the source points (src_pts) for the perspective transformation.
5. Clip ROI corners: Ensure ROI corners stay within image boundaries.
6. Define target points: Set target points (trg_pts) based on the image dimensions,
representing the transformed image corners.
7. Calculate transformation matrix: Compute the perspective transformation matrix M
using the source and target points. This matrix is used to transform the coordinates of
the ROI to the new perspective.
8. Transform images and bounding boxes: Apply the transformation matrix M to each
image and its bounding boxes. This involves transforming the image and adjusting
the bounding box coordinates accordingly. The transformed images and bounding
boxes are then saved.
The algorithm detailed in Algorithm 1 was initially applied to the training dataset to
compute the perspective transformation matrix M. Subsequently, this matrix was utilized
to transform the images and bounding boxes in the testing dataset. The proposed algorithm
automates the perspective transformation process, ensuring consistent and precise selection
of the region of interest (ROI) across extensive datasets. This automation minimizes
manual intervention and the associated variability in selecting ROI corners for each image,
making it an effective and scalable solution for applications like pothole detection in
autonomous vehicles.
4. Experiment Design
4.1. Evaluation Dataset
To evaluate our proposed method, we utilized the dataset introduced by Nienaber et al.
in [14,40]. This dataset, one of the few publicly available labeled pothole datasets, comprises
4405 images extracted from video footage captured with a GoPro camera mounted on a
vehicle’s windshield. Unlike most of the other pothole datasets collected using mobile
phones or drones, this dataset provides a realistic representation of South African road
conditions from a driver’s perspective, making it particularly relevant for applications
involving AVs and ground mobile robots. The dataset is split into two positive and
negative directories. Positive samples are samples that include at least one instance of a
pothole and comprise a total of 1119 images, while negative samples are samples without
any potholes and comprise a total of 2658 images. Each image is provided with a label
file with a bounding box’s format (class label, bounding box coordinates, width, and
height). Moreover, the dataset is divided into training and testing subsets, with 628 images
designated for testing. All images are provided in JPEG format with a resolution of
3680 × 2760 pixels. Figure 4 showcases six representative samples from the dataset,
illustrating the challenges posed by varying illumination levels and pothole appearances,
which are critical for developing robust and accurate pothole detection systems.
J. Imaging 2024, 10, 227 12 of 22
Figure 4. Demonstration of different samples of the dataset used in this work. These samples are
some examples of the variance in lighting intensity and road conditions observed in this dataset.
The dataset presents several significant challenges, particularly given the nature of the
objects it aims to detect—potholes. Potholes vary widely in size, shape, and appearance,
making them inherently difficult to detect. These variations are further exacerbated by the
fact that potholes at greater distances appear smaller, complicating the task of accurately
identifying them. Moreover, the color of potholes can differ depending on the surrounding
environment, such as sandy areas, pavement, or other types of ground surfaces. This
variation in appearance makes it challenging for a model to generalize across different
scenarios, as the model must learn to recognize potholes in various contexts and lighting
conditions. The difficulty of this task is amplified by the relatively small size of the dataset.
Detecting small objects like potholes typically requires a large dataset to effectively learn
the complex features necessary for accurate detection. To overcome this challenge, we use
various augmentation techniques to enlarge the size of the training samples and enhance
the robustness and accuracy of the pothole detection model. Furthermore, we used several
augmentation techniques including affine scaling, rotation, and shearing to adjust the image
size, orientation, and viewpoint to help the model recognize potholes of different sizes and
angles. Horizontal flipping provides different perspectives, whereas Gaussian blur mimics
motion blur to handle imperfect image captures. Adjustments to gamma contrast, bright-
ness, and contrast normalization manage varying lighting conditions, ensuring that the
model performs well under different environments. Additionally, additive Gaussian noise
is added to make the model resilient to grainy images, and crop and pad transformations
simulate occlusions and varying distances from the camera. As a result, the number of
training images increased from 1119 images to 2658 images. These augmentations simulate
real-world conditions, helping the model generalize better and improve pothole detection
performance under diverse scenarios encountered by autonomous vehicles. By creating a
diverse and representative training dataset, the model becomes more robust and capable of
accurately detecting potholes in various challenging conditions.
J. Imaging 2024, 10, 227 13 of 22
Area of Intersection
IoU = (3)
Area of Union
Precision is the ratio of TPs to the sum of TPs and FPs, indicating the accuracy of the
positive predictions made by the model. It is defined as shown in Equation (4).
TP
Precision = (4)
TP + FP
Recall measures the proportion of actual positives correctly identified by the model,
calculated as the ratio of TPs to the sum of TPs and false negatives (FNs). This is expressed
in Equation (5).
TP
Recall = (5)
TP + FN
Average precision (AP), derived from the precision–recall curve, is calculated by
integrating the area under this curve. AP at a specific IoU threshold (e.g., AP50 for
IoU ≥ 0.5) represents the precision averaged across different recall levels at that threshold.
J. Imaging 2024, 10, 227 14 of 22
AP50:95 refers to the average precision computed at multiple IoU thresholds ranging from
0.5 to 0.95 with a step size of 0.05. This metric provides a comprehensive evaluation of
model performance across various IoU thresholds. AP50 and AP75 specifically denote AP
at IoU thresholds of 0.5 and 0.75, respectively, offering insights into model precision at
different levels of overlap criteria.
Average recall (AR) reflects the average recall over different numbers of detections
per image, providing an aggregate measure of the model’s ability to identify relevant
instances among all actual positives. ARmax=1 and ARmax=10 denote the average recall
when considering a maximum of one detection per image and ten detections per image,
respectively, across IoU thresholds of 0.5 to 0.95. These metrics help to evaluate the model’s
recall capability, considering different levels of detection strictness.
These metrics collectively offer a detailed assessment of the detection model’s perfor-
mance, highlighting its strengths and weaknesses across various detection thresholds and
conditions.
(d) Auto-transformation
mance using a single class (pothole). Subsequently, we expanded the analysis to include
three classes (near, medium, and far) by categorizing the bounding boxes based on the
y-coordinates of their top-left corners, with each region representing a different class. This
classification aimed to measure the effectiveness of our approach in enhancing the detec-
tion of potholes at different distances. We conducted a comparative analysis against other
dataset processing techniques, applying these evaluation strategies to each dataset using
predefined thresholds as follows:
For the Image As Is and the cropping methods:
• Far: y ≤ 1350
• Medium: 1351 ≤ y ≤ 1500
• Near: 1501 ≤ y
For the Automatic Transformation approach:
• Far: y ≤ 670
• Medium: 671 ≤ y ≤ 1099
• Near: 1100 ≤ y
tion outcomes, we selected an α value of 0.2 for the automatic transformation algorithm
presented in Algorithm 1.
Table 1. Experiment 1 results. This experiment compares the different approaches presented in this
work by fine-tuning YOLOv5-small under each configuration and then evaluating their performance
on the test set using various object detection metrics. Our proposed approach demonstrates superior
performance across all metrics and pothole distance categories.
Metric (%)
Approach Pothole Distance
AP50:95 AP50 AP75 ARmax=1 ARmax=10
All 19.8 47.7 12.8 17.1 25.8
The significant improvements observed in our results are due to the effectiveness
of the automatic perspective transformation approach, which virtually brings potholes
closer to the vehicle, amplifying their features and making them more discernible to the
detection model, as Figure 3 shows. This perspective adjustment enhances the model’s
ability to learn and recognize pothole patterns, resulting in more accurate detections.
The amplification of pothole features simplifies the learning process for the YOLOv5
model, leading to significant improvements in average precision across various classes and
IoU thresholds.
J. Imaging 2024, 10, 227 17 of 22
The proposed approach not only improved the overall pothole detection performance
but also excelled in detecting the more challenging cases, particularly medium- and far-
distance potholes, which are the most critical for safety. Our method significantly improved
the detection accuracy for far potholes, an area where other methods have notably un-
derperformed. The consistent performance gains across different IoU thresholds validate
the robustness of our approach. Traditional detection methods struggle with varying per-
spectives and angles, while our method standardizes these perspectives, offering a more
uniform dataset for the model to train on, which is crucial for real-world applications. The
success of our approach in enhancing pothole detection accuracy has broader implications
for other object detection tasks, potentially leading to advancements in multiple areas of
computer vision.
Table 2. Experiment 2 results. This experiment compares the naive (i.e., Image As Is) approach with
our proposed approach on three YOLOv5 variants. In each configuration, a YOLOv5 variant is fine-
tuned on the corresponding approach’s training set and then evaluated on the test set using various
object detection metrics. Results show that our proposed approach always surpasses the performance
of the naive approach regardless of the utilized variant. Additionally, combining YOLOv5-small with
our proposed approach significantly outperforms the naive approach even when compared to the
YOLOv5-large configuration, for which the model is over six times larger in terms of the number of
parameters, across all metrics and pothole distance categories.
Image As Is YOLOv5-medium 21.2 49.0 Near 25.0 55.1 18.1 21.1 31.1
Medium 21.9 51.8 15.1 22.9 29.3
Far 7.4 25.0 1.7 9.6 12.3
All 21.7 50.5 14.2 17.5 27.7
applied to the baseline approach, knowing that YOLOv5-large has almost six times the
number of parameters of YOLOv5-small, as explained in Section 3.1.2. This highlights the
effectiveness of our method in detecting potholes across various distances while requiring
lower computational resources compared to the traditional approach.
The results in Table 2 reveal that our proposed approach significantly enhances detec-
tion accuracy across all YOLOv5 variants for every class compared to the baseline method.
The better performance of the YOLOv5-small variant under the AP50 metric is particularly
noteworthy, which outperformed both the medium and large variants when using our
method. We hypothesize that this counterintuitive result stems from the larger and medium
YOLOv5 models being more susceptible to the poor quality of some labels, potentially
learning and incorporating these inaccuracies into their detection processes more than the
smaller variant under the lower IoU threshold. Consequently, the small version’s relatively
simpler architecture may have enabled it to generalize better and avoid overfitting to the
noisy data, resulting in enhanced detection accuracy [41]. This finding underscores the
effectiveness of our automatic perspective transformation approach and suggests that
smaller, less complex models can be more robust in scenarios where data quality is variable,
offering valuable insights for similar projects in object detection.
Table 3. Ablation study results comparing the different preprocessing configurations. These results
are based on the automated transformation approach on all test set potholes using YOLOv5-small.
Including the preprocessing augmentations while excluding the negative samples (training images
without potholes) produced the best performance across all metrics.
Moreover, the results show that our approach that combined manual preprocessing
with only the positive dataset consistently outperformed all other configurations to improve
the accuracy of the pothole detection. This setup, which excluded negative images and
relied on extensive augmentations, proved superior across all metrics. We hypothesize that
the inclusion of negative images introduced noise into the training process, as these images
lack bounding boxes or pothole features, which are critical for improving the model’s
robustness and pattern recognition capabilities. Additionally, the extensive use of manual
preprocessing augmentations exposed the model to a wider variety of pothole shapes,
colors, and orientations, enhancing its ability to generalize across different scenarios. In
contrast, relying solely on YOLOv5’s framework augmentation step limited the model’s
exposure to diverse cases, thereby restricting its generalization potential.
Table 4. Computational latency analysis of different components within the pothole detection pipeline
across various YOLOv5 model variants. Specifically, we measure the average latency of each of the
pre-processing, perspective transformation, inference, and post-processing stages.
Object Detection Model Pre-Processing (ms) Perspective Transformation (ms) Inference (ms) Post-Processing (ms) Total Time (ms)
YOLOv5-small 0.6 14.4 7.2 1.2 23.4
YOLOv5-medium 0.6 14.4 14.3 1.4 30.7
YOLOv5-large 0.6 14.4 25.5 1.3 41.8
The best result is highlighted in bold.
Despite the small overhead, this technique significantly enhances pothole detection
performance, as evidenced by our experimental results. This demonstrates the efficiency
and effectiveness of incorporating perspective transformation in our detection pipeline.
6. Conclusions
In this paper, we introduced a novel method for improving pothole detection by
leveraging perspective transformation to automatically extract ROI from images and their
corresponding labels. The transformed dataset was then fed into the YOLOv5-small
object detection model. Our approach resulted in a notable improvement in detection
accuracy using YOLOv5-small, achieving a 43% increase in AP for a single class at IoU
thresholds of 0.5 to 0.95 (AP50:95 ), compared to the naive use of unchanged images. Similarly,
improvements of 29% and 32% in the same metric for YOLOv5-medium and YOLOv5-large
have been achieved, respectively. In addition, the method significantly improved the
detection of potholes at various distances, addressing a crucial aspect of road safety, where
it has achieved significant increases of 34%, 63%, and 194% in the same metric (AP50:95 )
for near, medium, and far, respectively. Moreover, Table 2 shows further improvement
using both YOLOv5-medium and YOLOv5-large. The findings underscore the critical
role of preprocessing techniques, such as perspective transformation, in enhancing the
performance of object detection tasks.
For future work, we propose developing a deep-learning model capable of dynamically
regressing the four corner points of the street in each image to generate a perspective
transformation matrix. This approach would necessitate labeled data, potentially obtainable
from semantic segmentation datasets, to further automate and refine the pre-processing
pipeline.
Author Contributions: Conceptualization, A.A.-r., Z.A.E.-S., and S.R.; data curation, A.A.-r.; method-
ology, A.A.-r., Z.A.E.-S., and S.R.; software, A.A.-r.; validation, Z.A.E.-S.; formal analysis, A.A.-r.
and Z.A.E.-S.; investigation, A.A.-r.; resources, S.R.; writing—original draft preparation, A.A.-r.
and Z.A.E.-S.; writing—review and editing, Z.A.E.-S. and S.R.; visualization, A.A.-r. and Z.A.E.-S.;
supervision, Z.A.E.-S. and S.R.; project administration, S.R. All authors have read and agreed to the
published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The original data presented in the study are openly available in Kaggle
at https://www.kaggle.com/datasets/sovitrath/road-pothole-images-for-pothole-detection/data
(accessed on 1 August 2024).
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Tamrakar, N.K. Overview on causes of flexible pavement distresses. Bull. Nepal Geol. Soc 2019, 36, 245–250.
2. Wada, S.A. Bituminous pavement failures. J. Eng. Res. Appl. 2016, 6, 94–100.
3. Adlinge, S.S.; Gupta, A. Pavement deterioration and its causes. Int. J. Innov. Res. Dev. 2013, 2, 437–450.
J. Imaging 2024, 10, 227 21 of 22
33. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch:
An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32.
34. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object
detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada,
17–24 June 2023; pp. 7464–7475.
35. Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO. Available online: https://github.com/ultralytics/ultralytics (accessed on 4
August 2024).
36. Nozick, V. Multiple view image rectification. In Proceedings of the 2011 1st International Symposium on Access Spaces (ISAS),
Yokohama, Japan, 17–19 June 2011; pp. 277–282.
37. El Shair, Z.; Rawashdeh, S. High-temporal-resolution event-based vehicle detection and tracking. Opt. Eng. 2023, 62, 031209.
[CrossRef]
38. Kocur, V.; Ftáčnik, M. Detection of 3D bounding boxes of vehicles using perspective transformation for accurate speed
measurement. Mach. Vis. Appl. 2020, 31, 62. [CrossRef]
39. Barath, D.; Hajder, L. Novel Ways to Estimate Homography from Local Affine Transformations; Distributed Event Analysis Research
Laboratory: Budapest, Hungary, 2016.
40. Nienaber, S.; Kroon, R.; Booysen, M.J. A comparison of low-cost monocular vision techniques for pothole distance estimation. In
Proceedings of the 2015 IEEE symposium series on computational Intelligence, Cape Town, South Africa, 7–10 December 2015;
pp. 419–426.
41. Quach, L.D.; Quoc, K.N.; Quynh, A.N.; Ngoc, H.T. Evaluating the effectiveness of YOLO models in different sized object detection
and feature-based classification of small objects. J. Adv. Inf. Technol. 2023, 14, 907–917. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.