A Novel Convolutional Neural Network For Enhancing The Continuity of Pavement Crack Detection
A Novel Convolutional Neural Network For Enhancing The Continuity of Pavement Crack Detection
com/scientificreports
Keywords Pavement crack detection, Weighted edge cross entropy loss function, Deep learning, Corrected
upsampling bias
Highways serve as the backbone of the national comprehensive transportation system, playing a vital foundational
role in the overall socioeconomic development of the country. Due to factors such as road structure, climate
conditions, and traffic loads, roads often suffer varying degrees of damage. Therefore, comprehensive promotion
of technical condition detection is a necessary means to improve the scientific decision-making level of highway
maintenance. Intelligent detection of road surface cracks is a major technological bottleneck in this field.1
Traditional image processing techniques for road surface crack segmentation mainly consist of
several categories: filtering-based segmentation2, segmentation based on texture and fractal geometric
features3,threshold-based segmentation4, edge detection-based segmentation5 , and methods based on minimum
distance6. Although traditional crack extraction algorithms have low computational costs, they are susceptible
to issues such as lighting conditions and camera imaging making it difficult to directly extract crack features
from the original images. The performance of these methods largely depends on the quality of the images being
processed.
With the rapid development of computer vision technology, machine learning methods identify cracks by
learning patterns on the image surface, which can mitigate the interference of background noise7. Its main
techniques include Random Forest8,Support Vector Machine9, Artificial Neural Networks10 among others.
Due to traditional machine learning methods relying on manually setting color or texture features to simulate
cracks, they depend on domain experts to extract features, which can result in strong feature quality. The features
manually set in these methods can only satisfy crack detection under certain specific conditions. When new
crack environments emerge, these methods require reconfiguration, making them unable to meet the detection
requirements for all road crack scenarios.
In recent years, deep learning has been widely used in the field of computer vision, which brings new
opportunities for automatic identification of pavement cracks through automatic learning instead of manual
feature setting11. Researchers achieve automatic identification and extraction of road crack by constructing
various deep convolutional neural network models and iteratively training them using a dataset of road crack
samples.12 proposed a Rectangular Convolutional Pyramid and Edge Enhanced Network, which utilizes a deep
1School of Geomatics, Liaoning Technical University, Fuxin 123000, China. 2Collaborative Innovation Institute
of Geospatial Information Service, Liaoning Technical University, Fuxin 123000, China. 3State Key Laboratory of
Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China.
email: shangyu_sun@126.com
network architecture to construct a rectangular convolutional pyramid module to describe crack features of
different structures. Then, through hierarchical feature fusion refinement modules and boundary refinement
modules, they effectively promote the fusion of features at different scales. Tang et al.13 proposed EDNet to
address the issue of class imbalance in crack segmentation. The encoder fits feature maps with road surface
images, enhancing segmentation accuracy, while the decoder generates feature maps from ground truth
images in an autoencoding manner, reducing the imbalance between crack and non-crack pixels. Guo et al.14
proposed BARNet, a network that adaptsively adjusts and refines crack boundaries. However, it requires manual
adjustment of penalty weights for different types of cracks. Qu et al.15 proposed a deep supervised convolutional
neural network for crack detection, utilizing a multi-scale convolutional feature fusion module. High-level
features are directly introduced into low-level features at different convolutional stages, providing integrated
direct supervision for convolutional feature fusion. AlHuda et al.16 proposed a road surface crack segmentation
network based on class activation maps and an encoder-decoder architecture, fused the crack localization map
generated by a classification network with an encoder, and then achieved accurate segmentation of road surface
cracks through a decoder network.Yu et al.17 proposed the CCapFPN, which enhances the accuracy of crack
detection by integrating features from different levels and scales. Wang et al. Yang et al.18proposed the PAFNet
for road crack segmentation, which addresses the issue of information loss in crack detection through context
fusion, dual attention, and dynamic weight learning. Jaziri et al.19 introduced a fractal-based crack simulator
along with a corresponding crack dataset. They generated crack images using simulation techniques and
obtained generalization ability to real cracks through effective learning methods.
The Transformer model was initially designed for natural language processing tasks. However, with further
research, it has also been successfully applied in the field of computer vision. Some Transformer-based models
and methods have also achieved success in crack detection. CrackFormer20 adopts the SegNet architecture and
introduces self-attention blocks and local attention. It enhances crack detection clarity through multi-stage lateral
fusion. Another CrackFormer21 employs a multi-scale window strategy, utilizing four parallel feature extraction
branches for local and global crack feature extraction. The model undergoes multiple stages of transformation,
gradually reducing spatial resolution while increasing feature channel dimensions. It merges multi-scale feature
representations to enhance performance. Compared to traditional convolutional neural networks, Transformers
perform better in handling long range dependencies. However, they lack the ability to capture local relationships
and have high computational complexity. As a result, some researchers have begun to combine CNNs and
Transformers for crack detection tasks. Quan et al.22 proposed a model for crack extraction by utilizing a hybrid
CNN and Transformer architecture. They leverage the advantages of convolutional neural networks in capturing
local correlations while combining the strengths of Transformers in modeling global relationships to enhance fine
extraction of crack boundaries. Bai et al.23 proposed a Dual Encoding Multi-Scale Fusion Network (DMFNet)
based on CNN and Transformer networks. By learning global and local feature interactions, they introduced
attention enhancement and deep supervision mechanisms, achieving efficient crack detection. Guo et al.24
utilized the Swin Transformer as an encoder to provide global crack semantic features and employed UperNet as
a decoder to retrieve more detailed crack information, thus enhancing the accuracy of crack detection. Wang et
al.25 proposed CGTrNet, which incorporates a Transformer and convolutional feature fusion module to address
the issue of dimension inconsistency and semantic gap between CNN and Transformer outputs. This effectively
integrates both local and global information of cracks.
While existing crack detection and segmentation models have made significant progress in automation and
accuracy, they still face several challenges. The slender structures of cracks may cause the network to fail to cover
a sufficiently long area to maintain continuity. Even if continuous crack features exist, they may be partially
covered by some convolution kernels, resulting in the network unable to fully extract continuous cracks.
Additionally, pooling operations reduce the resolution of feature maps, which may lead to the loss or blurring
of part of the cracks, further affecting continuity. To address this problem, this paper proposes for maintaining
crack continuity extraction network CPCDNet.The main contributions of this paper are as follows:
1) Cracks, being long and narrow structures, typically appear as slender and curved features in images. Tradi-
tional convolutional neural networks, while performing exceptionally well in many image processing tasks,
may lack sensitivity to such specific structures. In particular, traditional convolutional kernels may not ade-
quately capture the details and shape variations of long, narrow features like cracks. To address this issue, this
paper introduces the Dynamic Snake Convolution method, which dynamically adjusts the convolutional
kernels to better accommodate the elongated structure of cracks, thereby improving crack detection perfor-
mance.
2) In convolutional neural networks, the resolution of feature maps is often reduced due to downsampling
operations. During the upsampling phase, these low-resolution feature needs to be restored to the original
image size. However, due to the complexity of the downsampling process, pixel position discrepancies often
arise during the restoration, leading to cracks appearing broken or discontinuous. To address this issue, this
paper proposes the Crack Align Module, which uses learned offset values from the model to guide the resto-
ration of pixel values during upsampling, ensuring the continuity of crack structures.
3) A weighted edge cross entropy loss function has been designed, which adjusts weights by applying different
penalties based on the distance of each pixel point from the crack edge. Since pixels near the crack edges
often exhibit higher uncertainty, the distance transform values near the edges require smoothing. This paper
addresses the limited precision issue at the crack edges by attenuating the distance transform values near the
edges, thereby slowing down the model’s learning in these areas.The remaining organization of this paper is
as follows: “Related work” reviews crack extraction methods based on convolution, feature fusion, and loss
functions. Then, in “CPCDNet model overview”, we describe our proposed model approach. In “Experi-
ments and results”, we present and analyze experimental results. Finally, in “Conclusions”, we summarize our
work and discuss future prospects.
Related work
Due to the proposed method in this paper involving convolution-based, feature fusion-based, and loss function-
based crack detection methods, we will introduce related work on each of these aspects in the following
subsections.
Figure 3. Feature map visualization during training with or without DSC addition. After adding DSC, the
focus on the ruler in the feature map is significantly reduced, while the crack extraction is significantly more
refined.
detection. However, since DSC introduces additional parameters, applying it to all layers of UNet may result in
excessive parameterization, increasing computational complexity. Given that cracks in images are relatively small
in proportion, we choose to add Dynamic Deformable Convolution only to the first layer of UNet to balance
model performance and computational efficiency. This approach allows us to retain sensitivity to elongated
structures in crack detection tasks while effectively controlling the number of parameters, avoiding excessive
model complexity. DSC is illustrated in Fig. 2.
In Fig. 2, the changes along the x-axis and y-axis within the receptive field are given by the following equations:
{
(xi+c , yi+c ) = (xi + c, yi + Σii+c ∆y),
Ki±c = (1)
(xi−c , yi−c ) = (xi − c, yi + Σii−c ∆y),
{
(xj+c , yj+c ) = (xj + Σj+c ∆x, yj + c),
Kj±c = j
(2)
(xj−c , yj−c ) = (xj + Σjj−c ∆x, yj − c),
′
where K represents the fractional positions in Eqs. (2) and (3), and K enumerates all integer spatial positions.
As shown in Fig. 3, due to the mismatch between the elongated structure of the ruler and the cracks in the image,
the attention of UNet towards the ruler significantly decreases after adding DSC, compared to the original UNet,
which pays more attention to the cracks. This demonstrates the effectiveness of DSC.
a complex structure consisting of alligator crack, in which case the cracks are also very susceptible to extraction
discontinuities.
UNet requires continuous upsampling operations during the decoding stage, and after each upsampling,
it performs feature fusion with the corresponding part of the encoder. However, the upsampling method
using bilinear interpolation cannot accurately restore the position information of the crack edges, resulting in
discontinuities in crack extraction. This is because during the downsampling process, positional information
gradually gets lost, which may result in different input images yielding the same output after downsampling, as
shown in Fig. 5. The pixel values at positions A and B in the same image, after downsampling, converge to position
C. However, during upsampling, position C cannot definitively determine which of the same results should be
restored. When the feature maps from the encoder part are pixel-wise fused with these misaligned results, it
is easy to result in incorrect fusion outcomes, and continuous upsampling and downsampling exacerbate this
misalignment.
The Crack Align Module proposed in this paper addresses this issue, as shown in Fig. 6.
The Crack Align Module first performs upsampling on high-level feature maps and concatenates them with
low-level feature maps. Then, it introduces a 1 × 1 convolution to generate a feature map with a depth of 2,
where the first layer encodes the shift information in the x-direction, and the second layer encodes the shift
Figure 7. Feature map visualization during training with or without CAM addition. After adding CAM, the
extraction of tiny crack in the feature map is significantly more continuous.
information in the y-direction. Adjustments are made through learnable offset values. By predicting the position
deviation offset values based on traditional bilinear interpolation upsampling, the model corrects the results of
traditional upsampling, enabling more accurate localization of crack pixel positions. The expression is as follows:
Fof f set =Conv1×1 (Concatenate(U pSample(Hhigh ), Hlow ), W ) (3)
Itranslated =T ranslate(I, Fof f set_x , Fof f set_y ) (4)
Where W is offset weight, after generating the translation information, CAM utilizes this information to guide
the upsampling of high-level feature maps, aligning feature maps at different levels more effectively to preserve
more boundary position information of cracks. At the same time, the model can adjust the upsampling positions
more accurately, thereby reducing or eliminating discontinuities caused by interpolation, making the pixel value
changes in the crack area smoother, thus improving the continuity of cracks. Finally, by fusing the feature maps
from the encoder at each pixel, the occurrence of misalignment is reduced, resulting in a reasonable fusion
result. The final expression is as follows:
H = F use(Hlow , U pSample(Hhigh , Fof f set_x , Fof f set_y ))(5)
From Fig. 7, it can be observed that the small cracks in the image are very similar to the background. With the
addition of CAM, UNet can accurately capture these cracks, while the original UNet pays less attention to them,
resulting in discontinuities. This demonstrates the effectiveness of CAM.
enhance the edge features of pavement cracks. In Fig. 8, assuming the green region A represents the actual crack
area and the blue region represents the model’s prediction, the yellow region B denotes the incorrectly predicted
parts. These errors contribute to the loss value.
Near the edges of cracks, the model’s accuracy is limited. If a prediction error occurs, a smaller penalty
should be applied. However, in regions far away from the crack edges, where the characteristics are significantly
different from those of the crack area, the model has a high probability of identifying this region as a background
region. In case of a prediction error, a larger penalty should be applied. In this paper, the distance transform
method is used to calculate the nearest distance L from each pixel to the edge. Pixels near the edge will have
smaller values, while those further away will have larger distance values. To avoid extreme weighting, smooth
processing is required. The expression is as follows:
Lij = log2 (L + 2)(6)
where Lij represents the distance value at pixel position (i,j) , and this value is taken as the weight for the
corresponding pixel point. The distance value inside the crack should be negative, thus: −1 × Lij . The distance
value outside the crack should be positive, thus: 1 × Lij . This way, when the model’s prediction is correct, the
forward operation result inside the crack tends towards 1. When multiplied by the distance value inside the
crack, it becomes a very small value. The forward operation result outside tends towards 0. When multiplied
by the distance value outside, it becomes a very small value. Finally, taking the average of all pixels’ results
yields the global minimum value. Due to the disproportionate ratio of foreground to background in the dataset,
the distance transform values for the exterior region should be adjusted. Otherwise, the extensive background
region may receive too much attention, which is detrimental to model convergence. Therefore, for Lij outside
the mask, we need to set an upper limit. The expression is as follows:
Lij = max(Lmax , Lij )(7)
Additionally, it is important to note that the distance transform values near the edge also need to be smoothed.
Since both the model’s prediction accuracy and the annotation accuracy are poor, forcibly applying distance
transform values may lead to overfitting. In this paper, pixels at positions where the absolute value of the distance
transform is less than β are multiplied by a scaling factor of 0.5, ensuring that the resulting loss values are not
too large. This is illustrated in Fig. 9.
The results of the accuracy of β in setting different values are shown in Table 1, and the highest value is obtained
in β = 5 numerical indicators, and β is set to 5 in this paper. Figure 10 shows the results of the model taking
different β identifications.
During the model training process, the cross-entropy function aids in model convergence, thus it is also
included as part of the loss function. Additionally, the Weighted Edge Loss and cross-entropy functions are
combined to form the final Weighted Edge CE Loss.
Figure 8. A represents the actual crack area, B represents the model’s predicted result, and C represents the
background area.
β 1 2 3 4 5 6 7 8
mIoU 85.04 85.03 86.98 87.06 90.87 82.98 86.27 87.33
Recall 84.92 84.85 86.77 86.86 91.20 82.17 85.97 87.16
Precision 98.88 98.90 99.04 99.04 99.37 98.75 98.99 99.06
F-score 91.37 91.34 92.50 92.55 95.11 89.70 92.02 92.73
Figure 11. Feature map visualization during training with or without WECEL addition. After adding WECEL,
the boundaries of the feature map become noticeably more refined.
∑
WE = 0.5 × β × Lij (8)
ij
W ECEL =W × CE + (1 − W ) × W E (9)
where CE represents the cross entropy function, WE represents the edge loss function, and W is the weight
assigned to both loss functions. From Fig. 11, it can be observed that after incorporating WECEL, UNet exhibits
finer boundary detection of cracks, demonstrating the effectiveness of WECEL.
Training strategy
The parameters used for model training in this study are listed in Table 3. The training was conducted on a
server with the following specifications: CPU: Intel(R) Core(TM) i7-9700 CPU, GPU: Nvidia GeForce RTX
3090. All models were implemented using the PyTorch framework. At the start of training, the weight W for
the loss function is set to 1, indicating that only the cross-entropy loss function is utilized. With each epoch, the
weight W for the cross-entropy function is gradually decreased, while the weight for the Weighted Edge Loss is
increased. The weight for the cross-entropy loss is incremented by (epoch/300)×0.1 , with each epoch, gradually
increasing until reaching 0.1.
Evaluation metrics
In this study, precision, recall, mIoU, and F-score are used as metrics for crack identification accuracy. Here,
TP represents the number of true positives, FP represents the number of false positives, and FN represents the
number of false negatives. The calculation of these four metrics is as follows:
TP
P recision = (10)
TP + FP
TP
Recall = (11)
TP + FN
2 × P recision × Recall
F = (12)
P recision + Recall
1 TP TN
mIoU = ( + ) (13)
2 TP + FP + FN TN + FN + FP
(1) Crack500: This dataset contains clearer cracks compared to the other three datasets, with most cracks occu-
pying a higher proportion of the images. However, we observed that the annotations are rougher compared
to the actual cracks, and there are some annotation errors, possibly due to the subjective judgment and
uncertainty during the annotation process, which increases the difficulty of the crack detection task, CP-
CDNet achieves an mIoU of 80.36% and an F-score of 87.12%. Compared to the original UNet, CPCDNet
improves mIoU by 0.67%, Recall by 0.85%, Precision by 0.05%, and F-score by 0.54%.
(2) CFD: This dataset primarily consists of asphalt road surfaces with complex background information and
numerous interferences, making it easy for the model to misclassify some background information as
cracks. Additionally, the crack structures in this dataset are highly complex, which makes it extremely prone
to missing some cracks during detection. In column (c) of Fig. 13, we can observe that most models iden-
tify stains in the background as cracks. CPCDNet achieves an mIoU of 77.71% and an F-score of 85.57%.
Compared to the original UNet, CPCDNet improves mIoU by 1.68%, Recall by 0.51%, Precision by 1.06%,
and F-score by 0.86%.
(3) Deepcrack537: This dataset mainly comprises cement road surfaces with relatively less background noise,
resulting in the best performance among all models across the four datasets. Deepcrack537 contains nu-
merous small cracks, often appearing alongside clear cracks, making it challenging for models to identify
them. For instance, in column (d) of Fig. 13, small cracks are difficult to extract as they have similar contrast
to the ground. Our model achieves a Recall of 90.98% and an F-score of 95.05% on this dataset. Compared
to the original UNet, CPCDNet improves mIoU by 6.16%, Recall by 6.79%, Precision by 0.48%, and F-score
by 4.03%.
(4) GAPs384: This dataset presents significant challenges for crack identification. Firstly, the low contrast be-
tween cracks and the background makes it difficult to distinguish cracks, leading to the possibility of mis-
taking background clutter for cracks. Secondly, the cracks in this dataset are relatively small and sparse in
the images, making them hard for models to capture. Consequently, the performance of all models on this
dataset is the poorest among the four datasets. Our model achieves the best results with an mIoU of 71.16%
and an F-score of 79.82%. Compared to the original UNet, CPCDNet improves mIoU by 7.73%, Recall by
11.62%, Precision by 0.1%, and F-score by 8.94%.
Our model outperforms other mainstream models, especially in the extraction of fine cracks and crack
boundaries, producing more refined and continuous results. Compared to other models, our algorithm more
sensitively captures tiny cracks and extracts finer and more complete crack boundaries, capturing more details of
crack boundaries. Visual results intuitively demonstrate our model’s better preservation of details and clarity of
crack boundaries compared to other models. Figure 14 shows the PR curves of each model on different datasets.
Figure 15 shows the loss curves of each model on different datasets.
Effectiveness analysis
We tested the model with several images outside the existing image database. Figure 16 shows images (a)–(e)
with pavements featuring water seepage and various types of interference, while (f) and (g) are two negative
samples. CPCDNet was able to accurately detect cracks even without training on similar samples and did not
misidentify pavement stains or manhole covers as cracks, demonstrating the effectiveness of the model.
At the same time, we performed a 10% zoom on the images from four datasets, and the model was still able
to accurately detect pavement cracks, the recognition result is shown in Fig. 17.
Figure 12. Each model identifies the results for each dataset, (a–c) are and Gaps384 (d–f) are Crack500.
Ablation analyse
To validate the effectiveness of our model approach, we conducted five sets of ablation experiments on the
Deepcrack537 dataset: (1) UNet, (2) UNet+DSC representing UNet with DSC convolution added to the
first layer of the encoder, (3) UNet+CAM representing UNet with CAM added to the skip connections, (4)
UNet+WECEL representing UNet with the loss function replaced by WECEL, and (5) CPCDNet proposed in
this paper. Figure 18 presents the recognition results of different models added to the test set. In the first row,
UNet fails to recognize the marked area, while with the addition of our algorithm, the recognition becomes
Figure 13. Each model identifies the results for each dataset, (a–c) are and CFD (d–f) are DeepCrack537.
more completed and detailed. In the second row, UNet exhibits misrecognition, which is substantially reduced
after adding our algorithm. In the third row, UNet fails to detect the small cracks on the right side, but with
the addition of our algorithm, the small cracks are effectively identified. Meanwhile, in the fourth row, UNet
shows crack fragmentation, which is resolved by adding our algorithm. CPCDNet not only enhances the
capability to extract small features but also addresses the issue of inaccurate boundaries. The analysis indicates
the effectiveness and superiority of our algorithm.
From Table 8, it can be observed that adding DSC increased mIoU by 1.23%. This means that DSC can
enhance the model’s ability to capture detailed information about crack pillars during the encoding phase, thus
confirming the hypothesis that DSC can better fit the shape of road cracks. Adding CAM increased mIoU by
4.90%, indicating that the introduction of this module improved the positional accuracy of pixel recovery after
upsampling. Adding WECEL resulted in an increase in mIoU by 3.73%, suggesting that WECEL, through weight
adjustment, enabled the model to focus more on the edge regions of cracks, thereby improving the predictive
performance of crack edges. This allows the model to more accurately capture and emphasize edge information
in crack detection tasks.
Conclusions
To address the issues of discontinuity in crack detection models, a pavement crack image segmentation
algorithm called CPCDNet has been proposed. Extensive experiments on four crack datasets have demonstrated
the superior segmentation performance of CPCDNet. The main contributions of this paper are as follows:
(1) Introduced DSC to enhance the perception of elongated structures, thereby improving the capture of crack
structures.
(2) Designed the CAM module, which uses learned offset values to guide the pixel value recovery during the
upsampling process, thereby enhancing the continuity of crack extraction.
(3) Developed WECEL, which adjusts weights by applying different penalties based on the distance of each
pixel to the crack edges, improving crack edge detection capability.In the design of WECEL, the β value
was manually controlled using empirical methods. In the future, β should be made a dynamically varying
parameter based on crack width to improve the algorithm’s applicability and accuracy. Additionally, we
observed that some cracks are overly smoothed during edge extraction. While this enhances the clarity and
continuity of crack extraction, it can also lead to the loss or blurring of crack edges, resulting in the loss of
some detailed information and impacting the algorithm’s precision. Therefore, further optimization of the
crack edge extraction process is needed to balance smoothing with detail preservation. Finally, the current
model’s parameter count still does not meet the requirements for real-time crack detection. Future work
should focus on further simplifying the model’s complexity to better meet the needs of routine inspections.
Data availability
The data used to support the findings of this study is available from the corresponding author upon request.
References
1. Yang, L., Bai, S., Liu, Y. & Yu, H. Multi-scale triple-attention network for pixelwise crack segmentation. Autom. Constr. 150, 104853.
https://doi.org/10.1016/j.autcon.2023.104853 (2023).
2. Zalama, E., Gómez-García-Bermejo, J., Medina, R. & Llamas, J. Road crack detection using visual features extracted by gabor
filters. Comput. Aid. Civ. Infrastruct. Eng. 29, 342–358. https://doi.org/10.1111/mice.12042 (2014).
3. Liu, M., Liu, Y., Hu, H. & Nie, L. Genetic algorithm and mathematical morphology based binarization method for strip steel defect
image with non-uniform illumination. J. Vis. Commun. Image Represent. 37, 70–77. https://doi.org/10.1016/j.jvcir.2015.04.005
(2014).
4. Jiang, K. et al. Atmfn: Adaptive-threshold-based multi-model fusion network for compressed face hallucination. IEEE Trans.
Multimedia 22, 2734–2747. https://doi.org/10.1109/TMM.2019.2960586 (2019).
5. Luo, Q., Ge, B. & Tian, Q. A fast adaptive crack detection algorithm based on a double-edge extraction operator of fsm. Constr.
Build. Mater. 204, 244–254. https://doi.org/10.1016/j.conbuildmat.2019.01.150 (2019).
6. Chen, Y. et al. An improved minimal path selection approach with new strategies for pavement crack segmentation. Measurement
184, 109877. https://doi.org/10.1016/j.measurement.2021.109877 (2021).
7. Hsieh, Y.-A. & Tsai, Y. J. Machine learning for crack detection: Review and model performance comparison. J. Comput. Civ. Eng.
34, 04020038. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000918 (2020).
8. Shi, Y., Cui, L., Qi, Z., Meng, F. & Chen, Z. Automatic road crack detection using random structured forests. IEEE Trans. Intell.
Transp. Syst. 17, 3434–3445. https://doi.org/10.1109/TITS.2016.2552248 (2016).
9. Sari, Y., Prakoso, P. B. & Baskara, A. R. Road crack detection using support vector machine (svm) and otsu algorithm. In 2019 6th
International Conference on Electric Vehicular Technology (ICEVT) 349–354. https://doi.org/10.1109/ICEVT48285.2019.8993969
(2019).
10. Saleem, M. & Gutierrez, H. Using artificial neural network and non-destructive test for crack detection in concrete surrounding
the embedded steel reinforcement. Struct. Concr. 22, 2849–2867. https://doi.org/10.1002/suco.202000767 (2021).
11. Ali, L. et al. Performance evaluation of deep cnn-based crack detection and localization techniques for concrete structures. Sensors
21, 1688. https://doi.org/10.3390/s21051688 (2021).
12. Wang, Y. et al. Renet: Rectangular convolution pyramid and edge enhancement network for salient object detection of pavement
cracks. Measurement 170, 108698. https://doi.org/10.1016/j.measurement.2020.108698 (2021).
13. Tang, Y., Zhang, A. A., Luo, L., Wang, G. & Yang, E. Pixel-level pavement crack segmentation with encoder-decoder network.
Measurement 184, 109914. https://doi.org/10.1016/j.measurement.2021.109914 (2021).
14. Guo, J.-M., Markoni, H. & Lee, J.-D. Barnet: Boundary aware refinement network for crack detection. IEEE Trans. Intell. Transp.
Syst. 23, 7343–7358. https://doi.org/10.1109/TITS.2021.3069135 (2021).
15. Qu, Z., Cao, C., Liu, L. & Zhou, D.-Y. A deeply supervised convolutional neural network for pavement crack detection with
multiscale feature fusion. IEEE Trans. Neural Netw. Learn. Syst. 33, 4890–4899. https://doi.org/10.1109/TNNLS.2021.3062070
(2021).
16. Al-Huda, Z. et al. A hybrid deep learning pavement crack semantic segmentation. Eng. Appl. Artif. Intell. 122, 106142. ht tps://doi.
org/10.101 6/j.engapp ai.2023.106142 (2021).
17. Yu, Y. et al. Ccapfpn: A context-augmented capsule feature pyramid network for pavement crack detection. IEEE Trans. Intell.
Transp. Syst. 23, 3324–3335. https://doi.org/10.1109/TITS.2020.3035663 (2023).
18. Yang, L., Huang, H., Kong, S., Liu, Y. & Yu, H. Paf-net: A progressive and adaptive fusion network for pavement crack segmentation.
IEEE Trans. Intell. Transp. Syst. 33, 8636–8646. https://doi.org/10.1109/TITS.2023.3287533 (2023).
19. Jaziri, A., Mundt, M., Fernandez, A. & Ramesh, V. Designing a hybrid neural system to learn real-world crack segmentation from
fractal-based simulation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision 33, 8636–8646.
https://doi.org/10.48550/arXiv.2309.0963 (2024).
20. Liu, H., Miao, X., Mertz, C., Xu, C. & Kong, H. Crackformer: Transformer network for fine-grained crack detection. Proceedings of
the IEEE/CVF international conference on computer vision 3783–3792, https://doi.org/10.1109/ICCV48922.2021.00376 (2021).
21. Xiao, S. et al. Pavement crack detection with hybrid-window attentive vision transformers. Int. J. Appl. Earth Obs. Geoinf. 116,
103172. https://doi.org/10.1016/j.jag.2022.103172 (2021).
22. Quan, J., Ge, B. & Wang, M. Crackvit: a unified cnn-transformer model for pixel-level crack extraction. Neural Comput. Appl. 35,
10957–10973. https://doi.org/10.1007/s00521-023-08277-7 (2021).
23. Quan, J., Ge, B. & Wang, M. Dmf-net: A dual-encoding multi-scale fusion network for pavement crack detection. IEEE Trans.
Intell. Transp. Syst.[SPACE]https://doi.org/10.1109/TITS.2023.3331769 (2023).
24. Guo, F., Liu, J., Lv, C. & Yu, H. A novel transformer-based network with attention mechanism for automatic pavement crack
detection. Constr. Build. Mater. 391, 131852. https://doi.org/10.1016/j.conbuildmat.2023.131852 (2023).
25. Wang, Z., Leng, Z. & Zhang, Z. A weakly-supervised transformer-based hybrid network with multi-attention for pavement crack
detection. Constr. Build. Mater. 411, 134134. https://doi.org/10.1016/j.conbuildmat.2023.134134 (2024).
26. Zhou, Q., Qu, Z., Wang, S.-Y. & Bao, K.-H. A method of potentially promising network for crack detection with enhanced convolution
and dynamic feature fusion. IEEE Trans. Intell. Transp. Syst. 23, 18736–18745. https://doi.org/10.1109/TITS.2022.3154746 (2022).
27. Qi, L., Li, C. & Mei, T. Crackunet: a novel network with joint network-in-network structure and deformable convolution for
pavement crack detection. Int. J. Mach. Learn. Cybern. 1–12. https://doi.org/10.1007/s13042-023-02054-7 (2023).
28. Zhou, Q., Qu, Z., Wang, S.-Y. & Bao, K.-H. Deepcrackat: An effective crack segmentation framework based on learning multi-scale
crack features. Eng. Appl. Artif. Intell. 126, 106876. https://doi.org/10.1016/j.engappai.2023.106876 (2023).
29. Choi, W. & Cha, Y.-J. Sddnet: Real-time crack segmentation. IEEE Trans. Industr. Electron. 67, 8016–8025. https: //doi.org/ 10.1109
/TI E.2019.294 5265 (2019).
30. Zhong, J., Zhu, J., Huyan, J., Ma, T. & Zhang, W. Multi-scale feature fusion network for pixel-level pavement distress detection.
Autom. Constr. 141, 104436. https://doi.org/10.1016/j.autcon.2022.104436 (2022).
31. Ye, W., Ren, J., Zhang, A. A. & Lu, C. Automatic pixel-level crack detection with multi-scale feature fusion for slab tracks. Comput.
Aid. Civ. Infrastruct. Eng. 38, 2648–2665. https://doi.org/10.1111/mice.12984 (2023).
32. Liu, C., Zhu, C., Xia, X., Zhao, J. & Long, H. Ffedn: Feature fusion encoder decoder network for crack detection. IEEE Trans. Intell.
Transp. Syst. 23, 15546–15557. https://doi.org/10.1109/TITS.2022.3141827 (2022).
33. Qu, Z., Wang, C.-Y., Wang, S.-Y. & Ju, F.-R. A method of hierarchical feature fusion and connected attention architecture for
pavement crack detection. IEEE Trans. Intell. Transp. Syst. 23, 16038–16047. https://doi.org/10.1109/TITS.2022.3147669 (2022).
34. Qu, Z., Wang, C.-Y., Wang, S.-Y. & Ju, F.-R. Cycleadc-net: A crack segmentation method based on multi-scale feature fusion.
Measurement 204, 112107. https://doi.org/10.1016/j.measurement.2022.112107 (2022).
35. Du Nguyen, Q. & Thai, H.-T. Crack segmentation of imbalanced data: The role of loss functions. Eng. Struct. 297, 116988. https://
doi.org/10 .1016/j.en
gstruct.20 23.116988 (2023).
36. Mei, Q., Gül, M. & Azim, M. R. Densely connected deep neural network considering connectivity of pixels for automatic crack
detection. Autom. Constr. 110, 103018. https://doi.org/10.1016/j.autcon.2019.103018 (2020).
37. Ali, R., Chuah, J. H., Talip, M. S. A., Mokhtar, N. & Shoaib, M. A. Automatic pixel-level crack segmentation in images using fully
convolutional neural network based on residual blocks and pixel local weights. Eng. Appl. Artif. Intell. 104, 104391. https://doi. org
/10.101 6/j.engapp ai.2021.104391 (2021).
38. Fang, J., Qu, B. & Yuan, Y. Distribution equalization learning mechanism for road crack detection. Neurocomputing 424, 193–204.
https://doi.org/10.1016/j.neucom.2019.12.057 (2020).
39. Li, K., Wang, B., Tian, Y. & Qi, Z. Fast and accurate road crack detection based on adaptive cost-sensitive loss function. IEEE Trans.
Cybern. 53, 1051–1062. https://doi.org/10.1109/TCYB.2021.3103885 (2021).
40. Zhu, X., Hu, H., Lin, S. & Dai, J. Deformable convnets v2: More deformable, better results. Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition 9308–9316, https://doi.org/10.1109/CVPR.2019.00953 (2019).
41. Qi, Y., He, Y., Qi, X., Zhang, Y. & Yang, G. Dynamic snake convolution based on topological geometric constraints for tubular
structure segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision 6070–6079. https://doi. org/
10.110 9/ICCV5107 0.2023.00558 (2023).
42. Yang, F. et al. Feature pyramid and hierarchical boosting network for pavement crack detection. IEEE Trans. Intell. Transp. Syst. 21,
1525–1535. https://doi.org/10.1109/TITS.2019.2910595 (2019).
43. Eisenbach, M. et al. How to get pavement distress detection ready for deep learning? a systematic approach. In 2017 international
joint conference on neural networks (IJCNN) 2039–2047, https://doi.org/10.1109/IJCNN.2017.7966101 (2017).
44. Zou, Q. et al. Deepcrack: Learning hierarchical convolutional features for crack detection. IEEE Trans. Image Process. 28, 1498–
1512. https://doi.org/10.1109/TIP.2018.2878966 (2018).
45. Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical image
computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015,
proceedings, part III 18 234–241, https://doi.org/10.1007/978-3-319-24574-4_28 (2015).
46. Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F. & Adam, H. Encoder-decoder with atrous separable convolution for semantic
image segmentation. In Proceedings of the European conference on computer vision (ECCV) 801–818, https://do i.org/10.1 007/97
8-3- 030-01234- 2_49 (2018).
47. Wang, J. et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43,
3349–3364. https://doi.org/10.1109/TPAMI.2020.2983686 (2020).
48. Xie, E. et al. Segformer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 34,
12077–12090. https://doi.org/10.48550/arXiv.2105.15203 (2021).
49. Benz, C., Debus, P., Ha, H. K. & Rodehorst, V. Crack segmentation on uas-based imagery using transfer learning. In 2019 International
Conference on Image and Vision Computing New Zealand (IVCNZ) 1–6. https://doi.org/10.1109/IVCNZ48456.2019.8960998
(2019).
50. Han, C., Ma, T., Huyan, J., Huang, X. & Zhang, Y. Crackw-net: A novel pavement crack image segmentation convolutional neural
network. IEEE Trans. Intell. Transp. Syst. 23, 22135–22144. https://doi.org/10.1109/TITS.2021.3095507 (2021).
Acknowledgements
This research was funded by the National Natural Science Foundation of China (grant number 42071343).
Author contributions
Conceptualization, J.Z. and S.S.; methodology, J.Z.; validation, J.Z. and S.S.; formal analysis, J.Z.; investigation,
Y.L.; resources, W.S.; data curation, Y.L.; writing-original draft preparation, J.Z.; writing-review and editing, J.Z.
and S.S.; visualization, Z.J. and Q.T.; supervision, S.S.; project administration, W.S.; funding acquisition, W.S. All
the authors have read and agreed to the published version of the manuscript.
Declarations
Competing interests
The authors declare no competing interests.
Additional information
Correspondence and requests for materials should be addressed to S.S.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives
4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in
any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide
a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have
permission under this licence to share adapted material derived from this article or parts of it. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated
otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence
and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to
obtain permission directly from the copyright holder. To view a copy of this licence, visit http
://creativ ecommo
ns.o
rg/license s/by-nc-nd/4.0/.