0% found this document useful (0 votes)
19 views22 pages

Remotesensing 17 00935

Uploaded by

chpunlan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views22 pages

Remotesensing 17 00935

Uploaded by

chpunlan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Article

E2S: A UAV-Based Levee Crack Segmentation Framework Using


the Unsupervised Deblurring Technique
Fangyi Wang 1 , Zhaoli Wang 2 , Xushu Wu 1, * , Di Wu 1 , Haiying Hu 1 , Xiaoping Liu 3 and Yan Zhou 3

1 School of Civil Engineering and Transportation, State Key Laboratory of Subtropical Building and Urban
Science, South China University of Technology, Guangzhou 510641, China;
202321008526@mail.scut.edu.cn (F.W.); 202321009243@mail.scut.edu.cn (D.W.); cthyhu@scut.edu.cn (H.H.)
2 Pazhou Laboratory, Guangzhou 510335, China; wangzhl@scut.edu.cn
3 School of Geography and Planning, Sun Yat-sen University, Guangzhou 510275, China;
liuxp3@mail.sysu.edu.cn (X.L.); zhouyan9103@whu.edu.cn (Y.Z.)
* Correspondence: xshwu@scut.edu.cn

Abstract: The accurate detection and monitoring of levee cracks is critical for maintaining
the structural integrity and safety of flood protection infrastructure. Yet at present the
application of using UAV to achieve an automatic, rapid detection of levee cracks is still lim-
ited and there is a lack of effective deblurring methods specifically tailored for UAV-based
levee crack images. In this study, we present E2S, a novel two-stage framework specifically
designed for UAV-based levee crack segmentation, which leverages an unsupervised de-
blurring technique to enhance image quality. In the first stage, we introduce an Improved
CycleGAN model that mainly performs motion deblurring on UAV-captured images, effec-
tively enhancing crack visibility and preserving crucial structural details. The enhanced
images are then fed into the second stage, where an Attention U-Net is employed for
precise crack segmentation. The experimental results demonstrate that the E2S framework
significantly outperforms traditional supervised models, achieving an F1-score of 81.3%
and a crack IoU of 71.84%, surpassing the best-performing baseline, Unet++. The findings
confirm that the integration of unsupervised image enhancement can substantially benefit
downstream segmentation tasks, providing a robust and scalable solution for automated
levee crack monitoring.
Academic Editor: Mehdi Khaki
and Jun Liu Keywords: UAV inspection; levee crack; unsupervised deblurring; crack segmentation
Received: 20 January 2025
Revised: 1 March 2025
Accepted: 4 March 2025
Published: 6 March 2025 1. Introduction
Citation: Wang, F.; Wang, Z.; Wu, X.; Levees are vital flood protection structures that prevent water from flooding low-lying
Wu, D.; Hu, H.; Liu, X.; Zhou, Y. E2S: A areas. Their failure can result in severe property damage and loss of life. Cracks often
UAV-Based Levee Crack Segmentation
develop during operation, and understanding their morphology is crucial for maintaining
Framework Using the Unsupervised
the structural health and safety of levees [1]. Traditional methods for crack inspection
Deblurring Technique. Remote Sens.
2025, 17, 935. https://doi.org/ have predominantly relied on manual labor, which poses challenges such as being time-
10.3390/rs17050935 consuming, risky for personnel, and often yielding unreliable results. In recent years,
unmanned aerial vehicles (UAVs) have emerged as an effective solution to the limitations
Copyright: © 2025 by the authors.
Licensee MDPI, Basel, Switzerland. of traditional inspection methods. UAVs provide key advantages, including an efficient
This article is an open access article coverage of large areas, an adaptability to varying levee slopes for capturing images
distributed under the terms and of crack-prone surfaces, and reduced safety risks by eliminating the need for manual
conditions of the Creative Commons inspections in hazardous zones.
Attribution (CC BY) license
Automated crack identification using UAVs typically relies on digital image processing
(https://creativecommons.org/
algorithms to extract the crack location and morphology. Nowadays, deep learning-based
licenses/by/4.0/).

Remote Sens. 2025, 17, 935 https://doi.org/10.3390/rs17050935


Remote Sens. 2025, 17, 935 2 of 22

crack detection has been widely adopted and developed due to its robust recognition
capabilities and ease of deployment. These methods can be categorized into three types:
classification, detection, and segmentation. Among them, segmentation, which delineates
crack boundaries at the pixel level, is particularly valuable for identifying crack morphology.
In recent years, scholars have improved classic networks such as U-Net [2], FCN [3], and
SegNet [4] to make them suitable for the crack detection task. Zou, et al. [5] introduced a
Crack Detection on the Dam Surface (CDDS) method, also based on SegNet, for a pixel-level
detection of dam cracks in UAV images, outperforming SegNet, U-Net, and FCN. Feng,
et al. [6] introduced the Crack Detection on the Dam Surface method based on SegNet
structure for a pixel-level detection of dam cracks in UAV images and outperforms SegNet,
U-Net, and FCN. Ren, et al. [7] employed pre-trained VGG16 as the encoder for U-Net
and adopted focal loss to alleviate the imbalance problem of data, achieving a pixel-level
crack detection on a self-made hydraulic crack dataset. Sun, et al. [8] developed DMA-Net,
an enhanced DeepLabv3+ framework, attaining state-of-the-art results on the Crack500,
DeepCrack, and FMA datasets. Liu, et al. [9] presented CrackFormer-II, an advanced self-
attention network for pavement crack segmentation that integrates innovative Transformer
encoder modules. This model achieved state-of-the-art performance on four benchmark
datasets, including CrackTree260, CrackLS315, Stone331, and DeepCrack537. It is notewor-
thy that the effectiveness of these algorithms is highly related to the image quality, with a
common assumption that the obtained images must be of a sufficiently high quality, taken
under favorable conditions without image degradation. However, the use of computer
vision-based autonomous systems in outdoor environments often exposes them to external
conditions that can cause image degradation [10]. In particular, UAV-captured images often
suffer from motion blur caused by geographical factors when shooting above the water
surface, such as windy conditions [11]. Motion blur in UAV images can significantly reduce
the accuracy of crack detection, thereby compromising the reliability of monitoring results.
In this case, studying the deblur technique tailored for UAV-based levee inspection is of a
fundamental practical significance.
Traditional deblurring techniques often treat the task as an inverse filtering prob-
lem [12–16], in which a blurred image is modeled as a convolution with a spatially invariant
or varying blur kernels. These approaches heavily depend on prior information to estimate
the required blur kernels. Despite their effective performance, these methods require
various problem-specific parameters, such as camera internal settings and external motion
functions, making them challenging to implement and generalize in practical scenarios.
The advent of deep learning has led to the widespread application of methods for image
deblurring, leveraging architectures such as convolutional neural networks (CNN) [17–22],
generative adversarial networks (GAN) [23–26], recurrent neural networks (RNN) [27–31],
and Transformer networks [32–36]. Due to the challenges in obtaining paired clear and
blurred images, unsupervised learning techniques provide a promising alternative for
image deblurring. Unsupervised learning-based deblurring algorithms, trained on un-
paired datasets, approach deblurring as an image style conversion task, treating blurred
and clear images as two distinct styles. Madam, et al. [37] proposed an unsupervised
GAN-based method for the motion deblurring of images. This approach integrates a
deblurring CNN and a gradient module to prevent mode collapse and artifacts. Wen
et al. [38] improved cycle-consistent generative adversarial network (CycleGAN) [39] by
introducing structural-aware strategies and multi-adversarial optimization, significantly
improving edge restoration and detail recovery for unsupervised high-resolution image
deblurring. Lu, et al. [40] introduced UID-GAN, an unsupervised method that enhances
single-image deblurring by disentangling content and blur features without the need for
paired training images. Zhao, et al. [41] developed FCL-GAN, an unsupervised blind
Remote Sens. 2025, 17, 935 3 of 22

image deblurring model. This model addresses the challenges of large model sizes and
lengthy inference times by incorporating lightweight domain transformation units and
frequency-domain contrastive learning, thereby improving the real-time deblurring per-
formance. Pham et al. [42] developed a tailored image deblurring framework for specific
cameras. This method transforms challenging blurry images into more deblur-friendly
ones using unpaired sharp and blurry data, simplifying the task via a blur pattern modifi-
cation. It achieves a superior performance over SOTA methods in both quantitative and
qualitative metrics.
However, there is currently a significant lack of research focused on developing and
validating deblurring methods specifically tailored for UAV-based levee crack images.
Moreover, the prevalence of hairline cracks in UAV-captured images increases their sus-
ceptibility to distortion during the deblurring process, underscoring the need for further
modifications to adapt these techniques to the unique challenges posed by UAV-based
levee crack inspections.
To address this gap, we propose the Enhance2Segment (E2S) framework, which lever-
ages a learning-based deblurring method to tackle image degradation, enabling UAV-based
levee crack inspection. Unlike conventional segmentation methods, which struggle to pre-
cisely detect fine details such as small cracks in blurry or noisy images, the key advantage
of E2S lies in its unsupervised deblurring method, which is tailored for UAV-based crack
images and helps to detect cracks more precisely in terms of both location and morphology.
The first stage of E2S employs an Improved CycleGAN for image enhancement, designed
to effectively address motion deblurring challenges in UAV-captured images. In the second
stage, an Attention U-Net model is employed to accurately extract the crack morphol-
ogy, ensuring a precise identification of crack features from the enhanced images. The
performance of E2S was evaluated against traditional supervised crack detection models.
The experimental results show that E2S significantly outperforms these baseline models,
achieving a superior segmentation accuracy on our custom-built dataset of levee crack
images. This research demonstrates that E2S not only improves segmentation accuracy
but also offers a scalable and practical solution for UAV-based levee crack monitoring,
contributing to the enhanced safety and reliability of flood protection infrastructure.

2. Materials and Methods


2.1. Study Area and Data Acquisition
The crack images used in this study were collected from levees along the Hengmen
Waterway in Zhongshan City, China. Zhongshan is located between 22◦ 11′ N–22◦ 47′ N and
113◦ 09′ E–113◦ 46′ E, in the southern coastal area of the Greater Bay Area. The geographical
location of study area is shown in Figure 1a where the blue area represents Zhongshan City
and the orange point illustrates the location of Hengmen Waterway. The city experiences
a subtropical monsoon climate, with abundant sunlight, warmth, and significant rainfall.
Precipitation is primarily concentrated in the summer and autumn months, while winters
are relatively dry. Typhoons, heavy rainstorms, and severe convection events are common
meteorological hazards in this region. The Hengmen Waterway, one of the major estuaries
of the Pearl River, serves as an important route for maritime traffic in Zhongshan City. It
plays a critical role in flood control and coastal protection. Ensuring the structural integrity
of the levees along this waterway is crucial for safeguarding the region. Thus, establishing
an effective crack monitoring system is essential for the timely detection of potential risks,
helping to protect both the local population and economic activities along the waterway.
The levee section investigated is shown in Figure 1b. Figure 1c follows with a photograph
of the levee, captured by a UAV during the data collection.
Remote Sens. 2025, 17, 935
the waterway. The levee section investigated is shown in Figure 1b. Figure 1c follows with
4 of 22
a photograph of the levee, captured by a UAV during the data collection.

Figure 1. (a) The geographical location of the study area. (b) A satellite image of the investigated
Figure 1. (a) The geographical location of the study area. (b) A satellite image of the investigated
levee section. (c) A UAV-captured photograph of the levee during the data collection.
levee section. (c) A UAV-captured photograph of the levee during the data collection.

A rotary-wing DJI Phantom 4 Multispectral (P4M) drone (DJI Inc., Shenzhen, China)
A rotary-wing DJI Phantom 4 Multispectral (P4M) drone (DJI Inc., Shenzhen, China)
was deployed to capture images of levee cracks, as illustrated in Figure 1c. The P4M is
was deployed to capture images of levee cracks, as illustrated in Figure 1c. The P4M is
equipped with six 1/2.9-inch CMOS sensors, comprising one color sensor for visible-light
equipped with six 1/2.9-inch CMOS sensors, comprising one color sensor for visible-light
imaging and five monochrome sensors designed for multispectral imaging. For this study,
imaging and five monochrome sensors designed for multispectral imaging. For this study,
only the RGB camera was used to capture images at a resolution of 1600 × 1300 pixels.
only the RGB camera was used to capture images at a resolution of 1600×1300 pixels. The
The ground resolution of each pixel is H/18.9 cm when the UAV operates at an altitude
ground resolution of each pixel is H/18.9 cm when the UAV operates at an altitude of H
of H meters above the mapping area, which is sufficient for a precise crack detection.
meters above the mapping area, which is sufficient for a precise crack detection. The◦ RGB
The RGB camera features a focal length of 5.74 mm and a field of view of 62.7 . The
camera features a focal length of 5.74 mm and a field of view of 62.7°. The P4M is outfitted
P4M is outfitted with a gimbal that stabilizes the camera, significantly reducing image
with a gimbal that stabilizes the camera, significantly reducing image distortions. This
distortions. This gimbal also enables the camera to tilt from −90◦ to +30◦ , enhancing the
gimbal also enables the camera to tilt from -90° to +30°, enhancing the flexibility to adjust
flexibility to adjust shooting angles, particularly useful in sloped terrains. For enhanced
shooting angles, particularly useful in sloped terrains. For enhanced positional accuracy,
positional accuracy, the P4M incorporates an RTK module that records geographic coordi-
the P4M incorporates an RTK module that records geographic coordinates and the corre-
nates and the corresponding timestamps. The high-precision positioning results from the
sponding timestamps. The high-precision positioning results from the RTK are compen-
RTK are compensated in real time to the center of the camera’s CMOS, ultimately achieving
sated in real time to the center of the camera’s CMOS, ultimately achieving centimeter-
centimeter-level accuracy.
level accuracy.
The UAV was manually controlled throughout the collection process to capture crack
The UAV was manually controlled throughout the collection process to capture crack
images from the slope and crest of the levee. During this process, we kept the camera lens
images from the slope and crest of the levee. During this process, we kept the camera lens
as parallel as possible to the surface to minimize distortion. To improve model robustness
as parallel as possible to the surface to minimize distortion. To improve model robustness
and generalization, images were taken under varying conditions, capturing cracks with
and generalization,
different images were
shapes, orientations, andtaken under varying
backgrounds conditions,
(such as capturing
pipes, branches, andcracks
stones)with
and
different
in diverseshapes, orientations,
lighting conditions.and backgrounds
In total, (such as
138 raw images pipes,
were branches,
collected and stones)
for further and
processing.
inEach
diverse
image lighting conditions.
was then In total,
cropped into smaller138patches
raw images were collected
at a resolution of 448 ×for448further pro-
pixels using
cessing.
a slidingEach imagetechnique.
window was then cropped
This stepinto smallercomputational
reduced patches at a resolution
demandsofand 448×448 pix-
improved
els using a sliding window technique. This step reduced computational demands
the model’s ability to capture fine details within smaller regions. Finally, 300 sub-images and im-
proved the model’s
containing ability
cracks were to capture
selected fine
as the details
final within
sample smaller
set for modelregions.
trainingFinally, 300 sub-
and analysis.
For the image enhancement task in stage 1 of E2S, which used the Improved CycleGAN
model, the focus was on the unsupervised learning requiring unpaired data. The improved
CycleGAN enables unpaired image-to-image translation, meaning that images from the
Remote Sens. 2025, 17, 935 5 of 22

source domain and target domain do not need to be directly paired or correspond to each
other. Thus, we created the deblurring dataset incorporating UAV-based samples alongside
high-resolution crack images captured from building surfaces using a high-quality camera.
To ensure effective learning and successful image enhancement, the target domain data
(high-resolution images) were carefully selected to share key characteristics with the source
domain. These included similar resolutions, lighting conditions, and surface textures,
ensuring that the model can accurately capture the crack morphology and context during
the translation process.
For the segmentation task in Stage 2 of E2S, we manually labeled samples at the pixel
level to create ground truth masks, classifying each pixel as either “crack” or “background”.
The enhanced images from stage 1, along with their corresponding ground truth masks,
constituted the segmentation dataset for Stage 2. The dataset was then split into training,
validation, and test sets in an 8:1:1 ratio. This split remained constant throughout all of the
experiments. To prevent overfitting and enhance robustness to real-world variations, we ap-
plied several data augmentation techniques to the training set, including rotation, flipping,
scaling, shearing, and intensity adjustment. Specifically, rotation helps the model recognize
cracks from different angles, while flipping ensures the model can handle symmetric cracks.
Scaling improves the model’s ability to handle cracks at different sizes, and intensity adjust-
ment ensures robustness under varying lighting conditions. These transformations expose
the model to a wider range of data variations, which helps it generalize better to unseen
data and improve its performance under real-world conditions. Such transformations are
commonly used in deep learning models and the effectiveness of these augmentation tech-
niques in improving model robustness has been well documented in prior research [43–45].
It is worth noting that for the other segmentation models discussed in Section 3.2, their
segmentation data consist of UAV-based samples and corresponding ground truth masks.
These datasets were also divided using the same 8:1:1 ratio and subjected to the same data
augmentation techniques to ensure consistency across the models.

2.2. Improved CycleGAN for Crack Image Enhancement


CycleGAN [39] is a powerful generative adversarial network (GAN) architecture that
enables the translation of images between two different domains without paired data. It
comprises two generator networks G and F, along with two discriminators, DY and DX.
Generator G maps images from domain X to Y, while DY aims to distinguish the generative
samples G(x) from the real samples y. Likewise, generator F maps domain Y to domain
X and DX attempts to differentiate real samples from generated ones. In this study, we
conceptualize sharpness and blurriness as two distinct domains, or more precisely, as two
distinct visual styles. This distinction enables us to approach the deblurring process as a
style transfer task, where sharp and blurry images represent two identifiable and distinct
visual representations. Our goal is to train a model that can understand and learn the
differences between these two styles, and effectively restore the lost details and texture
information in blurry images to achieve a clear state. To this end, we have introduced
modifications to the original CycleGAN architecture.
The discriminator network of CycleGAN is a 70 × 70 PatchGAN [46] discriminator
that operates on overlapping 70 × 70 image patches. It uses several convolutional layers
to down-sample the image, eventually outputting a matrix of probabilities that indicates
whether each patch is real or fake. The PatchGAN discriminator thus assesses the realism
of local image regions, helping the generator to produce more detailed and realistic images.
The CycleGAN generator includes an encoder, transfer layers (residual blocks), and a
decoder. The transfer layers in the default architecture are composed of several residual
blocks [47] which are effective at preserving global structures. However, this design may
ages.
The CycleGAN generator includes an encoder, transfer layers (residual blocks), and
a decoder. The transfer layers in the default architecture are composed of several residual
Remote Sens. 2025, 17, 935 blocks [47] which are effective at preserving global structures. However, this design may 6 of 22
not fully capture fine-grained local features—such as the detailed morphology of cracks
in UAV-captured images. Small, hairline cracks contain subtle geometric patterns that can
not fully capture fine-grained local features—such as the detailed morphology of cracks in
be easily distorted or lost during the image generation process. Thus, the residual blocks
UAV-captured images. Small, hairline cracks contain subtle geometric patterns that can be
in the CycleGAN generator are replaced with residual dense blocks (RDBs) [48] since
easily distorted or lost during the image generation process. Thus, the residual blocks in
RDBs can better extract local features, leading to an improved image detail reconstruction.
the CycleGAN generator are replaced with residual dense blocks (RDBs) [48] since RDBs
An RDB consists of dense connected layers [49], local feature fusion (LFF), and local re-
can better extract local features, leading to an improved image detail reconstruction. An
sidual learning (LRL) and its structure is illustrated in Figure 2. The mathematical reason-
RDB consists of dense connected layers [49], local feature fusion (LFF), and local residual
ing and detailed description of the RDB’s internal workings are elaborated in Appendix
learning (LRL) and its structure is illustrated in Figure 2. The mathematical reasoning and
A.1.
detailed description of the RDB’s internal workings are elaborated in Appendix A.1.

Figure 2. The architecture of the generator in the Improved CycleGAN model.


Figure 2. The architecture of the generator in the Improved CycleGAN model.

After extracting local dense features through a series of RDBs, global feature fusion
After extracting local dense features through a series of RDBs, global feature fusion
(GFF) is introduced to combine features from all of the RDB layers. This helps in preserving
(GFF) is introduced to combine features from all of the RDB layers. This helps in preserv-
and refining global image features, ensuring that both local crack details and the broader
ing and refining global image features, ensuring that both local crack details and the
structural integrity are maintained. Finally, the generator architecture of our Improved
broader structural integrity are maintained. Finally, the generator architecture of our Im-
CycleGAN is shown in Figure 2.
proved CycleGAN is shown in Figure 2.
CycleGAN incorporates both adversarial losses and cycle consistency loss to drive the
CycleGAN incorporates both adversarial losses and cycle consistency loss to drive
image translation process. Adversarial losses guide generator G to produce realistic images
the image translation process. Adversarial losses guide generator G to produce realistic
in the target domain, which are then distinguished from real images by the discriminator DY.
images in the target domain, which are then distinguished from real images by the dis-
Similarly, the reverse generator F is also trained with an adversarial loss using discriminator
criminator DY. Similarly, the reverse generator F is also trained with an adversarial loss
DX. As is mentioned above, CycleGAN can be used without paired training data. In this
using discriminator DX. As is mentioned above, CycleGAN can be used without paired
case, an unregularized network may map inputs to arbitrary locations in the target domain,
training data. In this case, an unregularized network may map inputs to arbitrary loca-
resulting in structural distortions or a loss of fine details. To address this, cycle consistency
tions in the target domain, resulting in structural distortions or a loss of fine details. To
loss enforces the principle that an image should remain unchanged after being translated to
address this, cycle consistency loss enforces the principle that an image should remain
another domain and back. Traditionally, this loss is computed using an L1 distance between
unchanged after being translated to another domain and back. Traditionally, this loss is
the original and reconstructed images. However, L1 loss alone may not sufficiently capture
computed using an L1 distance between the original and reconstructed images. However,
perceptual and structural characteristics, leading to a potential loss of fine-grained details.
L1 loss alone may not sufficiently capture perceptual and structural characteristics, lead-
To enhance the structural fidelity of the generated images, the Improved CycleGAN
ing to a potential loss of fine-grained details.
incorporates the Structural Similarity Index Measure (SSIM) into the cycle consistency loss.
SSIM evaluates the perceptual similarity between images by considering the luminance,
contrast, and structural components. The enhanced cycle consistency loss balances L1 loss
and SSIM loss, ensuring that the generated images better preserve the structural details. The
complete loss function of Improved CycleGAN integrates both adversarial and enhanced
cycle consistency losses, controlling the relative importance by weighting factors. See the
detailed mathematical derivations and formulae in Appendix A.2.
nance, contrast, and structural components. The enhanced cycle consistency loss balances
L1 loss and SSIM loss, ensuring that the generated images better preserve the structural
details. The complete loss function of Improved CycleGAN integrates both adversarial
Remote Sens. 2025, 17, 935 7 of 22
and enhanced cycle consistency losses, controlling the relative importance by weighting
factors. See the detailed mathematical derivations and formulae in Appendix A.2.
In summary,
In summary,the theImproved
ImprovedCycleGAN
CycleGAN introduces
introduces RDBs
RDBs to improve
to improve the the retention
retention of of
fine-grained localfeatures,
fine-grained local features,crucial
crucialfor
forcrack
crackmorphology
morphology in in UAV-captured
UAV-captured images.
images. Com- Com-
bined with the use of SSIM in the cycle consistency loss, this architecture more
bined with the use of SSIM in the cycle consistency loss, this architecture more effectively effectively
captures bothglobal
captures both globalandand local
local features,
features, enabling
enabling an improved
an improved imageimage enhancement
enhancement and re-and
reconstruction
construction forfor crack
crack images.
images. These
These enhancements
enhancements ultimately
ultimately contribute
contribute to more toaccurate
more accu-
rate
crackcrack morphology
morphology preservation,
preservation, crucialcrucial for downstream
for downstream tasks
tasks such as such as segmentation.
segmentation. The
The Improved
Improved CycleGAN
CycleGAN architecture
architecture is depicted
is depicted in 3.
in Figure Figure 3.

Figure 3. The architecture of the Improved CycleGAN model, incorporating RDB and an enhanced
Figure 3. The architecture of the Improved CycleGAN model, incorporating RDB and an enhanced
loss function.
loss function.
2.3. Attention U-Net for Crack Segmentation
2.3. Attention
Building U‐Net
on the for Crack Segmentation
enhanced crack images generated by the Improved CycleGAN in the
previous section,
Building on the
the next crucialcrack
enhanced step is to train
images a pixel-level
generated segmentation
by the Improved model specifi-
CycleGAN in the
cally for levee cracks. A common challenge in crack segmentation tasks
previous section, the next crucial step is to train a pixel-level segmentation model specifi- is maintaining a
precise
cally forlocalization
levee cracks. andA accurate
common segmentation.
challenge inWhile cracktraditional CNN tasks
segmentation architectures, com- a
is maintaining
monly used as encoders, are effective at capturing global contextual
precise localization and accurate segmentation. While traditional CNN architectures, information through
gradual down-sampling, they tend to lose fine details, which can lead to inaccuracies in the
commonly used as encoders, are effective at capturing global contextual information
final segmentation output, particularly increasing the risk of false positive predictions for
through gradual down-sampling, they tend to lose fine details, which can lead to inaccu-
small targets and targets with a large shape variability.
racies in the final segmentation output, particularly increasing the risk of false positive
In this study, we adopt an Attention U-Net [50], which incorporates attention gates
predictions for small targets and targets with a large shape variability.
(AGs), a self-attention mechanism. AGs work by dynamically adjusting feature responses,
In this study, we adopt an Attention U-Net [50], which incorporates attention gates
suppressing irrelevant background regions while amplifying features from the target areas,
(AGs), a self-attention
ultimately mechanism.
guiding the model to focus AGs
on work
the object,by dynamically adjusting
thereby increasing feature responses,
accuracy.
suppressing irrelevant background regions while amplifying
The schematic of the Attention Gate (AG) is shown in Figure 4. An AG takes features from the target
two ar-
eas, ultimately
inputs: a gatingguiding
signal gtheandmodel
featureto maps
focus onx l . the
Theobject,
gatingthereby
signal isincreasing accuracy.
a vector containing
The schematic
contextual information of the
fromAttention Gate (AG)
coarser layers. is shownwith
It is combined in Figure 4. Anmaps
the feature AG takes
at the two
l
inputs: a gating signal g and feature maps x . The gating signal is
current layer to compute an attention coefficient for each pixel, determining which areas of a vector containing
the image should
contextual be emphasized.
information from coarserSpecifically,
layers. theIt isattention
combined coefficient
with the is feature
computed using
maps at the
additivelayer
current attention [50,51], which
to compute is formulated
an attention coefficient by this,for as follows:
each pixel, determining which areas

qlatt = ψ T (σ1 (WxT xil + WgT gi + bg )) + bψ (1)

αil = σ2 (qlatt ) (2)


att 1 x i g i g  (1)

il   2 (qatt
l
) (2)
Remote Sens. 2025, 17, 935 8 of 22
where  1 and  2 denote the ReLU activation function and the sigmoid activation func-
Fg  Fint
tion, respectively. Wx   Fl  Fint and WgT   are linear transformations that are cal-
where σ1 and σ2 denote the ReLU activation function and the sigmoid activation function,
culated by using channel-wise 1×1 convolutions for the input tensors. bg and b repre-
respectively. Wx ∈ RFl × Fint and WgT ∈ RFg × Fint are linear transformations that are calculated
sent the bias.
by using Finally, the
channel-wise 1× output xˆl in layer
1 convolutions forl the
is the element-wise
input tensors. bg multiplication of the
and bψ represent input
feature maps the
bias. Finally,
l
x i and attention
output coefficients
x̂l in layer i .
l is the element-wise multiplication of input feature
l
maps x i and attention coefficients αi .

Figure 4. The architecture of the attention gate.


Figure 4. The architecture of the attention gate.
Figure 5 illustrates the architecture of the Attention U-Net, highlighting the application
of Attention
Figure 5 Gates (AGs)
illustrates thebefore the skip of
architecture connections.
the Attention SkipU-Net,
connections integrate
highlighting thefeature
applica-
maps from the encoder’s shallow layers with those from the decoder’s
tion of Attention Gates (AGs) before the skip connections. Skip connections integrate fea- deeper layers,
thereby recovering spatial details lost during downsampling. Direct concatenation, how-
ture maps from the encoderʹs shallow layers with those from the decoderʹs deeper layers,
ever, can introduce noise or irrelevant features from the encoder, weakening the decoder’s
thereby recovering spatial details lost during downsampling. Direct concatenation, how-
focus on semantically significant regions. AGs mitigate this by filtering the encoder’s
ever, can introduce noise or irrelevant features from the encoder, weakening the decoder’s
feature map x x x using a gating signal g g g from the decoder. The gating signal g g g
focus on semantically significant regions. AGs mitigate this by filtering the encoder’s fea-
carries high-level semantic information, such as the location and shape of a crack, while x
ture map x x x using a gating signal g g g from the decoder. The gating signal g g g carries
x x from the encoder provides detailed spatial data, including edges, textures, and back-
high-level semantic
ground clutter. AGsinformation, such as the
use g g g to emphasize location andfeatures
crack-relevant shape inof xaxcrack, while x x x
x and suppress
from the encoder
irrelevant providesa detailed
ones, producing spatialmap.
refined feature data,This
including
refined mapedges, textures,
is then and back-
concatenated
ground clutter. AGs use g g g to emphasize crack-relevant features in
with the decoder’s feature map, balancing precise spatial details with semantic context. x x x and suppress
irrelevant
AGs enhance ones,segmentation
producing ain refined feature
two ways. map.the
During This refinedpass,
forward mapAGs is then concatenated
emphasize the
with
crackthe decoder’s
regions, feature
refining map, balancing
the model’s focus on keyprecise
areas.spatial details with
In the backward pass,semantic context.
AGs optimize
AGs enhance
gradient flowsegmentation
by down-weighting in twogradients
ways. During the forward
from irrelevant areas,pass, AGs the
allowing emphasize
model tothe
crack regions,
prioritize refining
updates the model’s
to critical regions.focus
Thison keyprocess
dual areas. Inofthe backward
refining pass, AGs
both feature optimize
responses
and gradient
gradient flow bypropagation enhances
down-weighting segmentation
gradients from accuracy,
irrelevantreduces false positives,
areas, allowing and to
the model
Remote Sens. 2025, 17, x FOR PEER REVIEW 9 of 23
improvesupdates
prioritize the robustness to noisy
to critical backgrounds.
regions. In summary,
This dual process Attention
of refining bothU-Net
feature improves
responses
spatial
and focus and
gradient gradient flow,
propagation resulting
enhances in a more accurate
segmentation and reduces
accuracy, robust crack
falsesegmentation.
positives, and
improves the robustness to noisy backgrounds. In summary, Attention U-Net improves
spatial focus and gradient flow, resulting in a more accurate and robust crack segmenta-
tion.

Figure 5. The architecture of the Attention U-Net.


Figure 5. The architecture of the Attention U-Net.

Given that the crack regions are relatively small compared to the background in the
entire image, the dice loss function was selected as the loss function for the Attention U-
Net, as it enables the network to focus more effectively on the classification of crack pixels.
The dice loss function assesses the similarity and overlap between the predicted results
Remote Sens. 2025, 17, 935 9 of 22

Given that the crack regions are relatively small compared to the background in the
entire image, the dice loss function was selected as the loss function for the Attention U-Net,
as it enables the network to focus more effectively on the classification of crack pixels. The
dice loss function assesses the similarity and overlap between the predicted results and the
ground truth and is formulated as follows:

2 × | yi ∩ pi |
Ldice = 1 − (3)
| yi | + | pi |

where pi denotes the predicted binary image and yi represents its ground truth. | pi | is the
number of pixels in pi and |yi | is the number of pixels in yi . |yi ∩ pi | denotes the number of
pixels in the overlapping region of them.
The overall workflow of E2S is illustrated in Figure 6, which consists of two stages.
Stage 1 utilizes an Improved CycleGAN for image enhancement, focusing primarily on
deblurring but also on incorporating background adjustment as a part of the unsupervised
learning process. It is important to note that background adjustment is an unintended
consequence, arising from the nature of the unsupervised style transfer. While this ad-
justment is not the primary objective, it inevitably occurs and carries a potential risk of
introducing artifacts, as discussed in Section 4. Stage 2 employs a supervised segmentation 10 of 23
Remote Sens. 2025, 17, x FOR PEER REVIEW
model based on Attention U-Net, which aims to produce a pixel-level segmentation of the
enhanced images from Stage 1.

Figure6.6.The
Figure Theoverall
overallworkflow
workflowof the E2S E2S
of the methodology.
methodology.
2.4. Evaluation Metrics
2.4. Evaluation Metrics
To comprehensively evaluate the performance of the Improved CycleGAN, we em-
ployedTothree
comprehensively evaluate
metrics: Structural the performance
Similarity of the
Index (SSIM) and Improved
offset CycleGAN,
distance based on the we em-
ployed
Scale three metrics:
Invariant Structural(SIFT)
Feature Transform Similarity Index
algorithm. (SSIM)
SSIM and the
measures offset distance
perceptual based on the
similar-
ity between two images by comparing their luminance, contrast, and structural information.
Scale Invariant Feature Transform (SIFT) algorithm. SSIM measures the perceptual simi-
larity between two images by comparing their luminance, contrast, and structural infor-
mation. Its value ranges from -1 to 1, with higher values indicating a better structural con-
sistency between the original and enhanced images. The calculation formula for SSIM is
as follows:
(2 I1  I  C1 )(2 I1I2  C2 )
SSIM ( I1 , I 2 )  2
(4)
(  I21   I22  C1 )( I21   I22  C2 )
Remote Sens. 2025, 17, 935 10 of 22

Its value ranges from -1 to 1, with higher values indicating a better structural consistency
between the original and enhanced images. The calculation formula for SSIM is as follows:

(2µ I1 µ I2 + C1 )(2σI1 I2 + C2 )
SSI M ( I1 , I2 ) = (4)
(µ2I1 + µ2I2 + C1 )(σI21 + σI22 + C2 )

where I1 and I2 represent the two images for comparison.µ and σ denote the mean and
standard deviation of the image, respectively. C1 and C2 are constants to ensure that the
denominator is always greater than 0.
Additionally, we introduce the SIFT-based offset distance as a quantitative metric to
evaluate the preservation of geometric structure during the image enhancement process.
Preserving the geometric structure of cracks is particularly critical because the ground
truth was originally created based on the UAV-captured images. If geometric distortions
are introduced during the enhancement process, the crack structures may be altered,
leading to a misalignment between the enhanced images and their corresponding labels.
This misalignment could reduce the accuracy of segmentation models trained on the
enhanced images.
The process begins with the detection of feature points using the SIFT [51] applied
to both the UAV-based samples and the corresponding enhanced images. Feature points,
representing visually distinctive regions such as crack edges or corners, are extracted based
on their local structural properties. SIFT is renowned for its robustness in extracting feature
points that are invariant to scale, rotation, and partial illumination changes, making it
particularly suitable for tasks requiring structural consistency. Once the feature points are
detected, correspondences between the two images are established. Using the FLANN (Fast
Library for Approximate Nearest Neighbors) algorithm, the feature descriptors generated
by SIFT are compared to identify matching feature points across the original and enhanced
images. These matches represent points that are highly similar in local features and likely
correspond to the same physical locations in the two images, despite minor positional
discrepancies caused by the enhancement process. Then, we calculate the displacement
between each pair of matched feature points. Since the enhancement process does not
involve geometric transformations such as scaling, or perspective changes, the feature
points in the original and enhanced images are inherently aligned within the same coor-
dinate system. This alignment allows for a straightforward comparison of their positions
without any need for additional transformations or coordinate adjustments. For each pair
of matched points, the displacement, referred to as the “offset distance”, reflects how much
the position of the corresponding feature in the enhanced image deviates from its original
location. Smaller offset distances indicate that the enhancement process has effectively
preserved the geometric structure of the original image, whereas larger distances may
suggest distortions or misalignments introduced during the enhancement. Finally, the
overall geometric consistency is assessed by averaging the offset distances across all of the
matched feature points. This average offset distance provides a robust and quantitative
measure of how well the enhancement process maintains the structural integrity of the
UAV-based images. In summary, the SIFT-based offset distance provides a reliable method
to evaluate the fidelity of image enhancement techniques in preserving critical geometric
details. Figure 7 visually represents this process, illustrating the detection, match, and
calculation of offset distances.
To evaluate the segmentation accuracy of various models, we selected four classical
metrics: Precision, Recall, F1 Score, and Intersection over Union (IoU). In the study, crack
pixels are considered positive (P) and background pixels are negative (N). True Positive (TP)
refers to correctly predicted crack pixels; True Negative (TN) refers to correctly predicted
background pixels; False Positive (FP) refers to background pixels mistakenly predicted as
changes, the feature points in the original and enhanced images are inherently aligned
within the same coordinate system. This alignment allows for a straightforward compar-
Remote Sens. 2025, 17, 935 ison of their positions without any need for additional transformations or coordinate 11 of 22ad-
justments. For each pair of matched points, the displacement, referred to as the “offset
distance”, reflects how much the position of the corresponding feature in the enhanced
cracks; and False Negative (FN) refers to crack pixels mistakenly predicted as background.
image deviates from its original location. Smaller offset distances indicate that the en-
These metrics are defined as follows:
hancement process has effectively preserved the geometric structure of the original image,
whereas larger distances may suggest distortionsTPor misalignments introduced during
Precision = (5)the
TP + FP
enhancement. Finally, the overall geometric consistency is assessed by averaging the off-
set distances across all of the matched feature points.
TP This average offset distance provides
Recall = (6)
a robust and quantitative measure of how TP well
+ the
FN enhancement process maintains the
structural integrity of the UAV-based 2× images. In ×
Precision summary,
Recall the SIFT-based offset distance
F1 = (7)
provides a reliable method to evaluate the fidelity
Recall of image enhancement techniques in
+ Precision
preserving critical geometric details. Figure 7 TP visually represents this process, illustrating
IoU = (8)
the detection, match, and calculation of offset
TP + FNdistances.
+ FP

Anillustration
Figure7.7.An
Figure illustration of
of the
the SIFT-based
SIFT-basedoffset
offsetdistance
distancecalculation process.
calculation (a)(a)
process. TheThe
detection andand
detection
matching of SIFT feature points between the UAV-based original image (left) and the enhanced image
matching of SIFT feature points between the UAV-based original image (left) and the enhanced
(right). Green circles represent detected feature points, and the connecting lines indicate matched
image (right). Green circles represent detected feature points, and the connecting lines indicate
points across the two images. (b) A conceptual representation of matched feature points and the offset
matched
distance points across
calculation. thepoints
Blue two images. (b) A conceptual
denote feature positions inrepresentation of matched
the original image, feature
while yellow points
points
and the offset distance calculation. Blue points denote feature positions in the original image,
represent their corresponding positions in the enhanced image. The Euclidean distance between each while
yellow
matched points
pointrepresent their the
pair quantifies corresponding
displacementpositions in by
introduced thethe
enhanced image.
enhancement The Euclidean
process, serving asdis-
the SIFT-based offset distance.
tance between each matched point pair quantifies the displacement introduced by the enhancement
process, serving as the SIFT-based offset distance.
3. Results
3.1. To
Evaluation
evaluateof the
the Improved CycleGAN
segmentation for Crack
accuracy Image Enhancement
of various models, we selected four classical
To Precision,
metrics: validate the effectiveness
Recall, F1 Score,ofand
the Intersection
Improved CycleGAN
over Union model,
(IoU).weIn designed
the study,and
crack
conducted
pixels ablation experiments.
are considered positive (P) andThese ablation experiments
background systematically
pixels are negative removed
(N). True Positive
key refers
(TP) improvement modules
to correctly from the
predicted model
crack to assess
pixels; their individual
True Negative contributions
(TN) refers to thepre-
to correctly
overallbackground
dicted performance of image
pixels; Falseenhancement.
Positive (FP) We tested
refers four versions
to background of themistakenly
pixels model: Thepre-
Baseline-CycleGAN, which is the original CycleGAN model serving as the baseline;
dicted as cracks; and False Negative (FN) refers to crack pixels mistakenly predicted the as
SSIM-CycleGAN, a CycleGAN model incorporating SSIM loss in the loss function; the
background. These metrics are defined as follows:
RDB-CycleGAN, a CycleGAN model with the generator’s residual module replaced by
the RDB module; and the Improved CycleGAN, the final improved model that combines
both the SSIM loss and the RDB module. In the ablation experiments, the batch size was
set to 2, and the Adam optimizer with an initial learning rate of 1 × 10−4 was employed.
The weight decay was set to 0.001. For the loss function, the weight for cycle-consistency
loss (λ) was set to 10. If SSIM loss was involved, its weight (α) was set to 0.3. When using
Residual Dense Blocks (RDB) to replace the residual blocks in the generator, the number of
RDB units was set to 6. All four models were trained on the deblurring dataset including
90 clear crack images and 300 UAV-based samples.
The results of the ablation experiments are presented in Figure 8. The results demon-
strate that all of the models generate images that are visually clearer than the original,
Remote Sens. 2025, 17, 935 12 of 22

with the cracks appearing more prominent. However, distinct differences can be observed
among the various models in terms of detail preservation. For typical samples, such as
row 1, the Baseline-CycleGAN produces images where crack edges are blurred, and the
crack shape deviates significantly from the original. The SSIM-CycleGAN generates cracks
with an overall shape closer to the original, though certain areas lose structural detail; for
instance, the upper section of the crack generated by SSIM-enhanced in row 1 is simplified
into a straight line. Additionally, similar to the baseline model, the cracks appear thicker
than in the original image. The RDB-CycleGAN shows an improvement in the alignment
of crack shapes and dimensions with the original, thanks to an enhanced local feature
extraction. However, the result exhibits a noticeable grid pattern, and the edges of the
cracks lack smoothness. In contrast, the Improved CycleGAN generates cracks with shapes
and dimensions closely matching the original. Moreover, the edges are smoother, and
the detailed structure of the cracks is better preserved compared to other models. For
samples with water stains, as seen in row 2, the general trends mirror those observed in
row 1. The SSIM-CycleGAN captures the overall crack shape with greater fidelity, while
the RDB-CycleGAN enhances the local detail consistency. The Improved CycleGAN again
produces the most accurate results, with the crack shape and size closely aligning with
the original image. It is worth noting that the presence of water stains can occasionally
introduce artifacts, which may impact subsequent segmentation tasks. In samples prone
to generating artifacts, such as those in row 3, the Baseline-CycleGAN, SSIM-CycleGAN,
and RDB-CycleGAN all exhibit varying degrees of artifact generation. However, Improved
CycleGAN, due to its improved capability in local feature extraction and maintaining over-
all image structure, preserves the image features consistent with the original and avoids
generating artifacts. For samples containing distracting objects in the background, such
as row 4, the Baseline-CycleGAN struggles to capture crack details accurately, leading
to inconsistencies in the local crack shape and an overall thicker appearance. Both the
RDB-CycleGANl and SSIM-CycleGAN improve the depiction of cracks, but noticeable
artifacts remain. The Improved CycleGAN delivers the most satisfactory results, with
the crack structure closely resembling the original and with minimal interference from
background objects, indicating that these distractions did not significantly affect the final
output. Considering the crack shape, size, and the clarity and smoothness of the crack
edges, the Improved CycleGAN demonstrates the most effective performance in image
enhancement. In the Improved CycleGAN model, SSIM and RDB each play distinct roles,
with their own strengths and limitations. SSIM loss helps the model preserve the overall
structural similarity between the enhanced and original images, particularly in maintaining
the general shape and continuity of the cracks. However, it can oversimplify fine details,
causing parts of the cracks to lose their intricate structure, and may result in slightly thicker
cracks than the original. The RDB module, on the other hand, enhances the model’s abil-
ity to capture local features and details, making the crack shapes and sizes align more
closely with the original image. However, this comes at the cost of introducing a grid-like
pattern, which reduces edge smoothness and can distort crack boundaries, potentially
affecting tasks requiring precise edge information. Improved CycleGAN combining these
two modules has the best performance on the image enhancement task.
For a quantitative analysis of the experimental results, the performance of the models is
evaluated based on SSIM and offset distance, as shown in Table 1. The Baseline-CycleGAN
model achieved an SSIM of 0.63 and an offset distance of 3.15 pixels. Adding the RDB
module slightly reduced the SSIM to 0.60 but significantly lowered the offset distance to
2.15 pixels. When incorporating the SSIM loss, the model’s SSIM increased to 0.64, with
an offset distance of 2.34 pixels. The fully improved CycleGAN, combining both the RDB
plify fine details, causing parts of the cracks to lose their intricate structure, and may result
in slightly thicker cracks than the original. The RDB module, on the other hand, enhances
Remote Sens. 2025, 17, 935 the model’s ability to capture local features and details, making the crack shapes and13sizes of 22
align more closely with the original image. However, this comes at the cost of introducing
a grid-like pattern, which reduces edge smoothness and can distort crack boundaries, po-
tentiallyand
module affecting tasksmaintained
SSIM loss, requiring precise
an SSIMedge information.
of 0.63 and achieved Improved
the best CycleGAN com-
offset distance of
bining
1.91 these two modules has the best performance on the image enhancement task.
pixels.

Figure 8. A visual comparison of the crack image enhancement results from the ablation experiments.
Figure 8. A visual comparison of the crack image enhancement results from the ablation experi-
(a) Input images. (b) Results from the Baseline-CycleGAN. (c) Results from the SSIM-CycleGAN.
ments. (a) Input images. (b) Results from the Baseline-CycleGAN. (c) Results from the SSIM-Cy-
(d) Results from the RDB-CycleGAN. (e) Results from the Improved CycleGAN.
cleGAN. (d) Results from the RDB-CycleGAN. (e) Results from the Improved CycleGAN.
Table 1. An evaluation of the ablation experiments on crack image enhancement models.

Model Variant SSIM [%] Offset Distance [px]


Baseline (CycleGAN) 63 3.15
Baseline + RDB 60 2.15
Baseline + SSIM 64 2.34
Improved CycleGAN (Full Model) 63 1.91

The decrease in offset distance when RDB is introduced suggests that RDB effectively
preserves geometric features in the image, especially in maintaining local details such as
crack edges and shapes. This explains the more precise crack boundaries observed in the
RDB-enhanced models. However, the grid-like pattern introduced by RDB disrupts the
smoothness of the image, which may explain why the SSIM value drops when RDB is
included. The SSIM metric evaluates brightness, contrast, and structure, and the grid effect
could negatively impact the perceived structural consistency, leading to a lower SSIM score
despite a better local detail retention. On the other hand, SSIM loss improves the structural
consistency of the cracks on a global level, as reflected by the slight increase in the SSIM
value when SSIM loss is applied. However, it does not enhance local details as effectively
as RDB, which explains why the offset distance does not decrease as much. The balance
between global consistency (SSIM) and local detail preservation (offset distance) becomes
clearer in the complete model, which integrates both approaches to minimize geometric
distortion and preserve the overall crack shape. The relatively low SSIM values across all
Remote Sens. 2025, 17, 935 14 of 22

of the models can be attributed to the variations in brightness and contrast between the
two domains used in this study. The data from the Y domain are clear and well lit, while
the X domain (drone-captured images) has an inconsistent brightness and contrast due
to environmental factors. As SSIM is sensitive to these factors, the background changes
during enhancement contribute to the lower overall SSIM values. This variation highlights
the challenge of maintaining both visual quality and structural consistency when working
with diverse image sets.

3.2. Comparison with Supervised Crack Semantic Segmentation Networks


The proposed E2S framework was compared with existing methods to evaluate its
performance. Four mainstream crack semantic segmentation algorithms were selected for
comparison as follows: Attenion-Unet [47], Deeplab3+ [50], Unet++ [51], and Unet [33].
Deeplabv3+ uses MobileNetv2 as its backbones. For the E2S framework, we utilized the
enhanced output from stage 1 as the input for the segmentation model, which is based on
Attention U-Net, as described in Section 2.3. It is important to note that while these four
segmentation models are widely used in crack segmentation tasks, they are not the latest
state-of-the-art (SOTA) models. The purpose of this comparison is to evaluate whether
the E2S framework, specifically designed for UAV-based crack images, can outperform
traditional supervised models by integrating an image enhancement stage tailored for
UAV-captured data. Additionally, the segmentation model used in stage 2 of E2S is not
the latest SOTA model since the focus is not on developing an advanced supervised
model, but rather on demonstrating the benefits of integrating an unsupervised image
enhancement for improving segmentation performance. All of the networks were trained
using the same hyperparameters and the Dice loss function. To prevent overfitting, L1
and L2 regularization were incorporated into the Dice loss. The batch size was set to 2,
and the Adam optimizer with an initial learning rate of 1e-3 was used. The weight decay
was set to 0.001. Additionally, a learning rate decay strategy was implemented during
training to enhance efficiency and stability, and an early stopping strategy was adopted to
prevent overfitting. Specifically, if the validation performance does not show a significant
improvement (or even declines) for eight consecutive epochs, the training is terminated to
avoid unnecessary overfitting.
The results presented in Table 2 show that the proposed E2S framework achieves the
highest performance across all metrics compared to the other models. E2S achieves a preci-
sion of 82.5%, recall of 80.3%, F1-score of 81.3%, and crack IoU of 71.84%, outperforming
the other methods by a notable margin. Among the comparison models, Unet++ ranks
second with an F1-score of 79.6% and a crack IoU of 70.03%, followed by Attention U-Net,
which performs slightly lower at 78.5% F1 and 69.8% crack IoU. U-Net and Deeplab3+
exhibit a lower overall performance, with Deeplab3+ yielding the weakest results at a 72.0%
F1 and 63.9% crack IoU.

Table 2. A comparison of the segmentation performance across different models.

Models Precision [%] Recall [%] F1 [%] Crack_IoU [%]


E2S 82.5 80.3 81.3 71.8
Unet++ 80.1 79.4 79.6 70.03
Attention U-Net 78.3 78.8 78.5 69.8
U-Net 73.7 75.5 74.6 65.7
Deeplab3+ 69.4 74.7 72.0 63.9

Figure 9 presents the segmentation results of five networks on the test set of the levee
crack dataset. For samples without a significant background interference, such as the one
in the first row and column (a), all networks deliver relatively accurate predictions with
Remote Sens. 2025, 17, 935 15 of 22

no noticeable noise. Among them, the E2S framework produces the most precise results,
closely matching the ground truth in terms of crack location and morphology. For samples
with background interference, as depicted in the second and third rows of column (a), the
networks exhibit varying levels of performance. For samples with water stains (the second
row in column (a)), most networks effectively ignore the interference, although Deeplab3+
introduces some minor noise. For samples with fine cracks and low contrast (the third row
in column (a)), networks such as Unet++, U-Net, and Deeplab3+ miss significant portions
of the cracks, resulting in an incomplete segmentation. A small part of the crack is also
missed in the output of Attention U-Net, whereas E2S successfully maintains the continuity
of the predicted cracks. Lastly, in samples with noise resembling cracks (the fourth row
in column (a)), all of the networks incorrectly classify some of these elements as cracks,
mote Sens. 2025, 17, x FOR PEER REVIEWleading to block-like noise. Unet++ exhibits the least noise, followed by E2S and Attention 16 of
U-Net. In terms of crack completeness, E2S demonstrates the highest level of prediction
accuracy across all of the models.

Figure 9. Crack segmentation results of five networks on the test set. (a) Input image. (b) Ground
Figure 9. Crack segmentation results of five networks on the test set. (a) Input image. (b) Grou
truth. (c) Results from E2S. (d) Results from UNet++. (e) Results from Attention U-Net. (f) Results
truth.from
(c) Results from
U-Net. (g) E2S.
Results (d)Deeplab3+.
from Results from UNet++. (e) Results from Attention U-Net. (f) Resu
from U-Net. (g) Results from Deeplab3+.
The key advantage of E2S lies in its ability to enhance crack boundaries and define
crack structures more clearly in the first stage, which effectively reduces the probability of
The key advantage of E2S lies in its ability to enhance crack boundaries and defi
FNs. This, in turn, significantly increases precision in both crack location and morphology.
crackFor
structures
example, themore
imageclearly in the
in the third row,first stage,
column (a) ofwhich effectively
Figure 9, reduces
is prone to false the It
negatives. probabil
of FNs.
canThis, in turn,
be observed thatsignificantly
all of the models increases
except E2Sprecision in both
fail to produce crack crack
a complete location
with and
clear morph
ogy. edges
For example, the image
due to excessive FNs. To inhighlight
the third therow, column
difference in FN(a) of Figure we
performance, 9, marked
is prone to fa
negatives. It can be observed that all of the models except E2S fail to produce10,
the FN pixels in the outputs of the top three models ranked by IoU, as shown in Figure a compl
where FN pixels are visualized in yellow. It is evident that E2S produces significantly fewer
crack with clear edges due to excessive FNs. To highlight the difference in FN perf
FN pixels in regions with a high FN frequency.
mance, we marked the FN pixels in the outputs of the top three models ranked by IoU,
shown in Figure 10, where FN pixels are visualized in yellow. It is evident that E2S p
duces significantly fewer FN pixels in regions with a high FN frequency.
Although E2S preserves continuous and well defined crack shapes, the enhancem
process in Stage 1 does not effectively suppress noise. Here, noise refers to backgrou
pixels misclassified as cracks, often due to their resemblance to crack features, leading
Remote Sens. 2025, 17, x FOR PEER REVIEW 17 of 23
Remote Sens. 2025, 17, 935 16 of 22

Figure 10. A comparison of the False Negative (FN) performance among E2S, Attention U-Net, and
Figure 10. A comparison of the False Negative (FN) performance among E2S, Attention U-Net, and
U-Net++. The window highlights the area prone to FN errors. FN pixels are visualized in yellow.
U-Net++. The window highlights the area prone to FN errors. FN pixels are visualized in yellow.
“Attn” denotes Attention.
“Attn” denotes Attention.
Although E2S preserves continuous and well defined crack shapes, the enhancement
Furthermore,
process in Stage in leveenot
1 does crack semanticsuppress
effectively segmentation,
noise.DeepLabv3+
Here, noise underperforms due
refers to background
topixels
architectural limitations. It struggles with multi-scale feature reuse and faces
misclassified as cracks, often due to their resemblance to crack features, leading to challenges
with
falsethe extreme
positives classWe
(FPs). imbalance,
hypothesize where
that cracks occupy a small
the enhancement processportion
in Stageof1 the
mayimage.
amplify
DeepLabv3+’s lack of cross-layer connections between compressed features
both cracks and noise, as the network treats them similarly. As a result, noise reduction and up-sam-
pling hampers
depends moreitson
ability to accurately
the supervised segment finemodel
segmentation cracks,inasStage
it cannot
2. Foreffectively
example,leverage
in the last
global information. Conversely, U-Net-based networks, which establish
row, column (a) of Figure 9, E2S demonstrates a superior precision in crack morphology, crucial cross-
layer
yet connections
its noise leveland incorporate
remains high-dimensional
comparable featuresSpecifically,
to other models. from both lower
E2S andandAttention
higher
levels, perform better. This architecture enables U-Net models to utilize multi-scale
U-Net exhibit similar levels of noise and their primary difference lies in Stage 1. To sum infor-up,
mation
whilemore effectively, making
E2S demonstrates them betterinsuited
clear advantages for crack
capturing cracksegmentation
continuity andtasks with ex-
morphology,
treme class imbalances. As a result, U-Net-based networks outperform
its performance in reducing noise and resisting background interference is less effective,DeepLabv3+ in
terms of preserving
indicating and segmenting
that improvements crack
in these details,
areas highlighting
may depend on thethe significancemodel
segmentation of archi-
itself.
tectureFurthermore,
design in addressing the unique challenges posed by crack segmentation
in levee crack semantic segmentation, DeepLabv3+ underperforms tasks.
due to architectural limitations. It struggles with multi-scale feature reuse and faces
4.challenges
Discussionwith the extreme class imbalance, where cracks occupy a small portion of the
image.
This DeepLabv3+’s
study introduces lack of cross-layer
a novel two-stageconnections
framework,between compressed levee
E2S, for UAV-based features and
crack
up-sampling
detection. In thehampers its an
first stage, ability to accurately
unsupervised segment
model basedfine cracks, as itCycleGAN
on Improved cannot effectively
is ap-
leverage
plied global the
to enhance information.
UAV-based Conversely, U-Net-based
images, primarily networks,
focusing whichdeblurring.
on motion establish crucial
The
cross-layer
second connections
stage uses and incorporate
an Attention U-Net model high-dimensional
to segment crackfeatures
regions from both lower and
the enhanced
higherExperimental
images. levels, perform better.demonstrate
results This architecture enables
that the U-Net models
E2S framework to utilize multi-scale
outperforms conven-
information
tional supervised more effectively,
models making
in terms them better suited
of segmentation for crack
accuracy. The segmentation
findings of this tasks with
study
extreme
align with class imbalances.
previous researchAs a result,
[52,53], U-Net-based
which networks
demonstrated that outperform DeepLabv3+
motion deblurring en-
in terms
hances the of preserving
accuracy and segmenting
of UAV-based crack details,
crack inspection. highlighting
Similar to earlierthe significance
studies, this re-of
architecture
search confirms design in addressing
that image the unique
enhancement is anchallenges posed
effective way by crack segmentation
to improve the performance tasks.
of downstream tasks, such as crack segmentation.
4. E2S
Discussion
offers a potential solution to the image blurring issues commonly encountered
in drone-based
This study inspections
introduces byaincorporating
novel two-stage an unsupervised
framework, E2S, enhancement
for UAV-based algorithm. By
levee crack
detection.segmentation
improving In the first stage, an unsupervised
accuracy, E2S enables amodel more based
reliableonautomated
Improved monitoring,
CycleGAN is
applied
which hastosignificant
enhance the UAV-based
practical images, primarily
implications. focusing
For instance, keyon motion
crack deblurring.
features, such asThe
second
length, stage
can be uses an Attention
extracted U-Net model
from segmented to segment
images captured crack regions
during UAV from the enhanced
inspections at
images.
fixed Experimental
intervals. resultsmonitoring
This proactive demonstrate that the
supports E2S framework
long-term outperforms
maintenance, reducing conven-
the
tional
risk supervised
of structural models
failures in terms
and of segmentation
enhancing infrastructureaccuracy. The and
resilience findings of this study align
safety.
with previous
Despite the research
promising [52,53], which
results, demonstrated
this study has severalthatlimitations.
motion deblurring enhances
One notable limita-the
accuracy
tion of UAV-based
is the fact that the clearcrack inspection.
images used forSimilar to earlierprocess
the deblurring studies,were
this research confirms
not captured in
thethat
sameimage enhancement
application is anThis
scenarios. effective way todomain
introduces improve the performance
differences, such asof downstream
variations in
tasks,lighting,
scene, such as and
crackcamera
segmentation.
angles, which may prevent the model from accurately learn-
ing theE2S bluroffers a potentialtypical
characteristics solution to the image
of real-world blurringThese
scenarios. issuesdiscrepancies
commonly encountered
could re-
in in
sult drone-based
a suboptimal inspections by incorporating
visual recovery an unsupervised
of fine details during practical enhancement
applications. algorithm.
In the
By improving
UAV-based leveesegmentation accuracy,
inspection, where E2S enables
obtaining perfectlya more
clearreliable
imagesautomated monitoring,
is often challenging,
Remote Sens. 2025, 17, 935 17 of 22

which has significant practical implications. For instance, key crack features, such as
length, can be extracted from segmented images captured during UAV inspections at fixed
intervals. This proactive monitoring supports long-term maintenance, reducing the risk of
structural failures and enhancing infrastructure resilience and safety.
Despite the promising results, this study has several limitations. One notable limitation
is the fact that the clear images used for the deblurring process were not captured in the
same application scenarios. This introduces domain differences, such as variations in scene,
lighting, and camera angles, which may prevent the model from accurately learning the
Remote Sens. 2025, 17, x FOR PEER REVIEW 18 of 23
blur characteristics typical of real-world scenarios. These discrepancies could result in a
suboptimal visual recovery of fine details during practical applications. In the UAV-based
levee inspection, where obtaining perfectly clear images is often challenging, leveraging
leveraging data augmentation
data augmentation and domain andadaptation
domain adaptation
techniques techniques
could helpcould helpthis
bridge bridge
gapthis
by
gap by increasing image diversity and reducing
increasing image diversity and reducing domain differences. domain differences.
Another
Another limitation
limitation isis the
the potential
potential for
for generating
generating visual
visual artifacts
artifacts during
during the
the image
image
enhancement stage. These artifacts arise from the unsupervised model
enhancement stage. These artifacts arise from the unsupervised model used for deblur- used for deblur-
ring,
ring, which
whichinevitably
inevitablyadjusts
adjuststhe background
the background duedueto the imperfect
to the alignment
imperfect of clear
alignment and
of clear
blurred samples. While these adjustments sometimes enhance cracks,
and blurred samples. While these adjustments sometimes enhance cracks, they can also they can also unin-
tentionally amplify
unintentionally background
amplify background elements
elementsresembling
resembling cracks, leading
cracks, leadingtotovisual
visualartifacts.
artifacts.
Such
Such artifacts persist despite fine-tuning hyper parameters and may adversely affect seg-
artifacts persist despite fine-tuning hyper parameters and may adversely affect seg-
mentation
mentation accuracy. Examples exhibiting
accuracy. Examples exhibiting these
these artifacts are shown
artifacts are shown inin Figure
Figure 11,
11, where
where the
the
affected regions are circled with red rectangles for clarity. Future research
affected regions are circled with red rectangles for clarity. Future research should focus onshould focus
on methods
methods to minimize
to minimize unintended
unintended background
background changes
changes in unsupervised
in the the unsupervised deblurring
deblurring pro-
process or develop artifact detection techniques that can identify and mitigate
cess or develop artifact detection techniques that can identify and mitigate these distortions these dis-
tortions before
before the the segmentation
segmentation stage. stage.

11. Instances
Figure 11.
Figure Instancesof
ofartifacts
artifactsleading
leadingtoto
incorrect segmentation
incorrect segmentationresults. (a,b)
results. Original
(a,b) UAV-captured
Original UAV-cap-
images. (c,d) The corresponding enhanced images, where artifacts impacted segmentation
tured images. (c,d) The corresponding enhanced images, where artifacts impacted segmentation accuracy.
accuracy.
Other limitations include the absence of a precise evaluation metric to assess crack
fidelity after enhancement. While the SIFT-based offset distance quantifies geometric
Other limitations include the absence of a precise evaluation metric to assess crack
structure preservation during the enhancement process, it primarily measures the overall
fidelity after enhancement. While the SIFT-based offset distance quantifies geometric
displacement of local feature points and does not capture subtle changes in crack shape
structure preservation during the enhancement process, it primarily measures the overall
or edge details. Furthermore, this metric does not specifically evaluate crack regions,
displacement of local feature points and does not capture subtle changes in crack shape
potentially overlooking minor shifts in crack position and morphology. Thus, it may fail
or edge details. Furthermore, this metric does not specifically evaluate crack regions, po-
to provide a comprehensive assessment of crack fidelity. To more accurately evaluate
tentially overlooking minor shifts in crack position and morphology. Thus, it may fail to
crack preservation, new metrics tailored specifically to crack shape and position should be
provide a comprehensive assessment of crack fidelity. To more accurately evaluate crack
developed, offering a more precise reflection of whether the enhancement stage distorts
preservation, new metrics tailored specifically to crack shape and position should be de-
critical crack information
veloped, offering a more precise reflection of whether the enhancement stage distorts crit-
Lastly, the architecture of the E2S framework can be improved. While the first stage’s
ical crack information
deblurring method improves crack visibility, it may not be the optimal approach for
Lastly, the architecture of the E2S framework can be improved. While the first stage’s
preserving crack position and morphology. Additionally, the segmentation model used
deblurring method improves crack visibility, it may not be the optimal approach for pre-
in the second stage is not a state-of-the-art (SOTA) model, which limits the framework’s
serving crack position and morphology. Additionally, the segmentation model used in
performance. Future iterations of E2S should explore more advanced architectures for both
the second stage is not a state-of-the-art (SOTA) model, which limits the framework’s per-
formance. Future iterations of E2S should explore more advanced architectures for both
the enhancement and segmentation stages, potentially leveraging cutting-edge models to
unlock further improvements.
In summary, while the E2S framework demonstrates a considerable potential for im-
Remote Sens. 2025, 17, 935 18 of 22

the enhancement and segmentation stages, potentially leveraging cutting-edge models to


unlock further improvements.
In summary, while the E2S framework demonstrates a considerable potential for
improving UAV-based crack detection by enhancing crack visibility and segmentation
accuracy, it also introduces risks related to visual artifacts and domain gaps. Future
research incorporating more advanced techniques for deblurring, noise reduction, and
model architecture, along with methods to manage artifacts, could result in significant
performance improvements and help advance the field of UAV-based crack inspection.

5. Conclusions
In this study, we proposed the Enhance2Segment (E2S) framework, specifically de-
signed for UAV-based levee crack detection. E2S consists of two stages: the first stage
involves image enhancement through motion deblurring, and the second stage focuses on
crack segmentation. The experimental results demonstrate that E2S significantly outper-
forms traditional supervised models, achieving an F1-score of 81.3% and a crack IoU of
71.84%. These results indicate that incorporating an unsupervised image enhancement prior
to segmentation substantially improves the accuracy of downstream tasks. Consequently,
the E2S framework offers a promising solution for automated levee crack monitoring.
Future work can focus on two key areas, including addressing the challenge of acquir-
ing clear images for training under unpaired data conditions, and exploring more advanced
architectures for both the enhancement and segmentation stages, potentially leveraging
state-of-the-art models to achieve further performance gains.

Author Contributions: Conceptualization: X.W., F.W. and Z.W.; methodology: X.W., F.W., Z.W., D.W.,
H.H. and Y.Z.; validation: F.W. and D.W.; investigation: F.W. and D.W.; data curation: F.W. and H.H.;
writing—original draft preparation: F.W.; writing—review and editing: X.W. and X.L.; supervision:
Z.W. and Y.Z.; project administration: X.W.; funding acquisition: X.W. and Z.W. All authors have
read and agreed to the published version of the manuscript.

Funding: This work was supported by the National Natural Science Foundation of China (52479015,
52379010), the Fundamental Research Funds for the Central Universities (2024ZYGXZR084),
the Basic and Applied Basic Research Foundation of Guangdong Province (2023A1515030191,
2022A1515010019), and the Fund of Science and Technology Program of Guangzhou (2024A04J3674).

Data Availability Statement: The data that support the findings of this study are available from the
corresponding author upon reasonable request.

Conflicts of Interest: The authors declare no conflicts of interest.

Abbreviations
The following abbreviations are used in this manuscript:

UAV unmanned aerial vehicle


RDB residual dense blocks
SSIM Structural Similarity Index Measure
SIFT Scale-Invariant Feature Transform

Appendix A
Appendix A is about the mathematical foundations of improvements we made in
Improved CycleGAN.
Remote Sens. 2025, 17, 935 19 of 22

Appendix A.1
Appendix A.1 is about mathematical foundations and architectural details of RDB.
A RDB is a powerful architectural unit designed to enhance feature extraction in deep
networks. It consists of multiple layers, each comprising a convolution operation followed
by an activation function. The outputs of each layer, as well as the outputs from the
previous RDB, are directly connected to all subsequent layers. This dense connectivity
allows the model to preserve critical local features across multiple layers, ensuring that
fine details, such as crack edges and shapes, are not lost as the image passes through the
network. This mechanism also enhances the flow of information and gradients, enabling
the network to effectively learn both local and global features crucial for tasks like image
reconstruction. Specifically, the output of the c-th convolution layer of d-th RDB can be
formulated as follows:

Fd,c = σ (Wd,c [ Fd−1 , Fd,1 , · · · , Fd,c−1 ]) (A1)

where σ represents the ReLU activation function [54]. Wd,c is the weights of the c-th
convolution layer, where the bias term is omitted for simplicity. [ Fd−1 , Fd,1 , · · · , Fd,c−1 ] is
the concatenation of outputs of the preceding RDB and the previous layers within the
current RDB. In addition to this dense connectivity, the RDB incorporates Local Feature
Fusion (LFF), which adaptively integrates features from both the preceding RDB and the
convolutional layers within the current RDB. LFF ensures that the model retains high-
resolution local features critical for tasks like accurately reconstructing fine image details
(e.g., crack textures). The fusion process can be expressed as follows:

Fd,LFF = H d LFF ([ Fd−1 , Fd,1 , · · · , Fd,c , · · · , Fd,C ]) (A2)

where H d LFF denotes the function of a 1 × 1 convolution layer in the d-th RDB.
[ Fd−1 , Fd,1 , · · · , Fd,c−1 ] refers to the concatenation of outputs from the previous RDB and
all the convolution layers in the current RDB. Finally, Local Residual Learning (LRL) is
introduced to further improve the information flow. The final output of the dth RDB can be
computed by the following:
Fd = Fd−1 + Fd,LFF (A3)

In summary, the RDB’s architecture with dense connectivity, local feature fusion, and
residual learning enables the network to extract and preserve crucial local image features,
which is fundamental for the reconstruction of crack image.

Appendix A.2
Appendix A.2 provides the detailed mathematical formulation of the loss functions
involved in Improved CycleGAN, including adversarial loss, enhanced cycle consistency
loss, and the final loss function.
Adversarial losses are applied to restrict the translated images to follow data distri-
bution in the target domain. For generator G and its corresponding discriminator DY , the
adversarial loss is defined as follows:

LGAN ( G, DY , X, Y ) = Ey∼ pdata (y) [log DY (y)] + Ex∼ p(data) ( x) [1 − log DY ( G ( x ))] (A4)

where DY aims to distinguish real images y from generated images G(x), and G attempts to
generate images G(x) that can deceive DY . A similar formulation applies for the reverse
mapping F and its discriminator DX, where the adversarial loss for F is defined as follows:

LGAN ( F, DX , Y, X ) = Ex∼ p(data) ( x) [log DX ( x )] + Ey∼ pdata (y) [1 − log DX ( F (y))] (A5)
Remote Sens. 2025, 17, 935 20 of 22

In the absence of paired training data, a network with a sufficient capacity could map
inputs to any permutation in the target domain, potentially leading to distorted outputs or
a lack of fine details and structural accuracy expected from the target domain. To address
this, cycle consistency loss is introduced to ensure that an image, when translated from one
domain to another and then back to the original domain, remains consistent. The cycle
consistency loss is traditionally calculated using L1 loss, formulated as follows:

Lcyc ( G, F ) = Ex∼ p(data) ( x) [∥ F ( G ( x )) − x ∥1 ] + Ey∼ pdata (y) [∥ G ( F (y)) − y∥1 ] (A6)

However, L1 loss alone does not effectively capture perceptual and structural details,
which may result in the loss of important features. By incorporating SSIM loss into the
cycle consistency loss, the Improved CycleGAN ensures that the generated images preserve
structural details more faithfully, maintaining a closer alignment with the original images.
The SSIM loss compares the structural similarity between the reconstructed and original
images. It considers three key components: luminance, contrast, and structure. The SSIM
loss between two images x and y is defined as follows:

(2µ x µy + C1 )(2σxy + C2 )
LSSI M ( x, y) = 1 − (A7)
(µ2x + µ2y + C1 )(σx2 + σy2 + C2 )

where µ x and µy are the means of x and y, σx and σy are the variances, and σxy is the
covariance between x and y C1 and C2 are constants to stabilize the division.
The enhanced cycle consistency loss that combines both the L1 loss and SSIM loss is
given by the following:

Lcyc_imp ( G, F ) = (1 − α) Lcyc + αEy∼ pdata (y) [(1 − SSI M( G ( F (y)), y))] + αEx∼ p(data) ( x) [(1 − SSI M ( F ( G ( x )), x ))] (A8)

where α controls the relative importance between the L1 loss and the SSIM loss. Finally, the
complete objective function for the Improved CycleGAN model, combining adversarial
and enhanced cycle consistency losses, is expressed as follows:

Limp (G, F, D X , DY ) = LGAN ( G, DY , X, Y ) + LGAN ( F, DX , Y, X ) + λLcyc_imp ( G, F ) (A9)

where λ balances the relative importance between adversarial loss and enhanced cycle
consistency loss.

References
1. Rafiei, M.H.; Adeli, H. A novel machine learning-based algorithm to detect damage in high-rise building structures. Struct. Des.
Tall Spec. Build. 2017, 26, e1400. [CrossRef]
2. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image
Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer
International Publishing: Cham, Switzerland, 2015; pp. 234–241.
3. Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach.
Intell. 2017, 39, 640–651. [CrossRef]
4. Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation.
IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [CrossRef]
5. Zou, Q.; Zhang, Z.; Li, Q.; Qi, X.; Wang, Q.; Wang, S. DeepCrack: Learning Hierarchical Convolutional Features for Crack
Detection. IEEE Trans. Image Process. 2019, 28, 1498–1512. [CrossRef] [PubMed]
6. Feng, C.; Zhang, H.; Wang, H.; Wang, S.; Li, Y. Automatic Pixel-Level Crack Detection on Dam Surface Using Deep Convolutional
Network. Sensors 2020, 20, 2069. [CrossRef]
7. Ren, Q.; Li, M.; Yang, S.; Zhang, Y.; Bai, S. Pixel-level shape segmentation and feature quantification of hydraulic concrete cracks
based on digital images. J. Hydroelectr. Eng. 2021, 40, 234–246.
Remote Sens. 2025, 17, 935 21 of 22

8. Sun, X.; Xie, Y.; Jiang, L.; Cao, Y.; Liu, B. DMA-Net: DeepLab with Multi-Scale Attention for Pavement Crack Segmentation. IEEE
Trans. Intell. Transp. Syst. 2022, 23, 18392–18403. [CrossRef]
9. Liu, H.; Yang, J.; Miao, X.; Mertz, C.; Kong, H. CrackFormer Network for Pavement Crack Segmentation. IEEE Trans. Intell.
Transp. Syst. 2023, 24, 9240–9252. [CrossRef]
10. Kapoor, C.; Warrier, A.; Singh, M.; Narang, P.; Puppala, H.; Rallapalli, S.; Singh, A.P. Fast and Lightweight UAV-based Road
Image Enhancement Under Multiple Low-Visibility Conditions. In Proceedings of the 2023 IEEE International Conference on
Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops), Atlanta, GA, USA,
13–17 March 2023; pp. 154–159.
11. Wang, J.; Li, Y.; Chen, W. UAV Aerial Image Generation of Crucial Components of High-Voltage Transmission Lines Based on
Multi-Level Generative Adversarial Network. Remote Sens. 2023, 15, 1412. [CrossRef]
12. Richardson, W.H. Bayesian-Based Iterative Method of Image Restoration*. J. Opt. Soc. Am. 1972, 62, 55–59. [CrossRef]
13. Helstrom, C.W. Image Restoration by the Method of Least Squares. J. Opt. Soc. Am. 1967, 57, 297–303. [CrossRef]
14. Bahat, Y.; Efrat, N.; Irani, M. Non-uniform Blind Deblurring by Reblurring. In Proceedings of the IEEE International Conference
on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 3306–3314.
15. Cho, S.; Lee, S. Fast Motion Deblurring; Association for Computing Machinery: New York, NY, USA, 2009; Volume 28, pp. 1–8.
[CrossRef]
16. Fergus, R.; Singh, B.; Hertzmann, A.; Roweis, S.T.; Freeman, W.T. Removing camera shake from a single photograph. In
Proceedings of the SIGGRAPH’06: Special Interest Group on Computer Graphics and Interactive Techniques Conference, Boston,
MA, USA, 30 July–3 August 2006; Association for Computing Machinery: New York, NY, USA, 2006; pp. 787–794.
17. Chakrabarti, A. A Neural Approach to Blind Motion Deblurring. arXiv 2016, arXiv:1603.04771.
18. Vitoria, P.A.G.S. Event-Based Image Deblurring with Dynamic Motion Awareness. In Computer Vision—ECCV 2022 Workshops;
Springer: Cham, Switzerland, 2023; pp. 95–112.
19. Park, D.A.K.D. Multi-Temporal Recurrent Neural Networks for Progressive Non-uniform Single Image Deblurring with Incre-
mental Temporal Training. In Computer Vision—ECCV 2020; Springer: Cham, Switzerland, 2020; pp. 327–343.
20. Zhang, H.; Zhang, L.; Dai, Y.; Li, H.; Koniusz, P. Event-guided Multi-patch Network with Self-supervision for Non-uniform
Motion Deblurring. Int. J. Comput. Vis. 2023, 131, 453–470. [CrossRef]
21. Kaufman, A.; Fattal, R. Deblurring Using Analysis-Synthesis Networks Pair. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5810–5819.
22. Wu, X.; Guo, S.; Yin, J.; Yang, G.; Zhong, Y.; Liu, D. On the event-based extreme precipitation across China: Time distribution
patterns, trends, and return levels. J. Hydrol. 2018, 562, 305–317. [CrossRef]
23. Ramakrishnan, S.; Pachori, S.; Gangopadhyay, A.; Raman, S. Deep Generative Filter for Motion Deblurring. In Proceedings of the
IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 2993–3000.
24. Kupyn, O.; Martyniuk, T.; Wu, J.; Wang, Z. DeblurGAN-v2: Deblurring (Orders-of-Magnitude) Faster and Better. In Proceedings of
the IEEE International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019; pp. 8877–8886.
25. Peng, J.; Guan, T.; Liu, F.; Liang, J. MND-GAN: A Research on Image Deblurring Algorithm Based on Generative Adversarial
Network. In Proceedings of the 42nd Chinese Control Conference (CCC), Tianjin, China, 24–26 July 2023; pp. 7584–7589.
26. Zhang, K.; Luo, W.; Zhong, Y.; Ma, L.; Stenger, B.; Liu, W.; Li, H. Deblurring by Realistic Blurring. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2734–2743.
27. Zhang, J.; Pan, J.; Ren, J.; Song, Y.; Bao, L.; Lau, R.W.H.; Yang, M. Dynamic Scene Deblurring Using Spatially Variant Recurrent
Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA,
18–22 June 2018; pp. 2521–2529.
28. LipingLIU, J.S. Overview of Blind Deblurring Methods for Single Image. J. Front. Comput. Sci. Technol. 2022, 16, 552–564.
[CrossRef]
29. Ren, W.; Zhang, J.; Pan, J.; Liu, S.; Ren, J.S.; Du, J.; Cao, X.; Yang, M. Deblurring Dynamic Scenes via Spatially Varying Recurrent
Neural Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3974–3987. [CrossRef]
30. Tao, X.; Gao, H.; Shen, X.; Wang, J.; Jia, J. Scale-Recurrent Network for Deep Image Deblurring. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8174–8182.
31. Wu, X.; Wang, Z.; Guo, S.; Liao, W.; Zeng, Z.; Chen, X. Scenario-based projections of future urban inundation within a coupled
hydrodynamic model framework: A case study in Dongguan City, China. J. Hydrol. 2017, 547, 428–442. [CrossRef]
32. Wang, Z.; Cun, X.; Bao, J.; Liu, J. Uformer: A General U-Shaped Transformer for Image Restoration. In Proceedings of
the IEEE/CVF Conference on Computer Vision And Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022;
pp. 17662–17672.
33. Tsai, F.A.P.Y. Stripformer: Strip Transformer for Fast Image Deblurring. In Computer Vision—ECCV 2022; Springer: Cham,
Switzerland, 2022; pp. 146–162.
Remote Sens. 2025, 17, 935 22 of 22

34. Zou, Y.; Ma, Y. Edgeformer: Edge-Enhanced Transformer for High-Quality Image Deblurring. In Proceedings of the 2023 IEEE
International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 504–509.
35. Wu, Y.; Liang, L.; Ling, S.; Gao, Z. Hierarchical Patch Aggregation Transformer for Motion Deblurring. Neural Process. Lett. 2024,
56, 139. [CrossRef]
36. Wu, X.; Guo, S.; Qian, S.; Wang, Z.; Lai, C.; Li, J.; Liu, P. Long-range precipitation forecast based on multipole and preceding
fluctuations of sea surface temperature. Int. J. Clim. 2022, 42, 8024–8039. [CrossRef]
37. Madam, N.T.; Kumar, S.; Rajagopalan, A.N. Unsupervised Class-Specific Deblurring. In Proceedings of the European Conference
on Computer Vision (ECCV), Munich, Germany, 8 September 2018; Springer: Berlin, Heidelberg, 2018; pp. 358–374.
38. Wen, Y.; Chen, J.; Sheng, B.; Chen, Z.; Li, P.; Tan, P.; Lee, T. Structure-Aware Motion Deblurring Using Multi-Adversarial
Optimized CycleGAN. IEEE Trans. Image Process. 2021, 30, 6142–6155. [CrossRef] [PubMed]
39. Zhu, J.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In
Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2242–2251.
40. Lu, B.; Chen, J.; Chellappa, R. UID-GAN: Unsupervised Image Deblurring via Disentangled Representations. IEEE Trans. Biom.
Behav. Identity Sci. 2020, 2, 26–39. [CrossRef]
41. Zhao, S.; Zhang, Z.; Hong, R.; Xu, M.; Yang, Y.; Wang, M. FCL-GAN: A Lightweight and Real-Time Baseline for Unsupervised
Blind Image Deblurring. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, Lisboa, Portugal,
10–14 October 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 6220–6229.
42. Pham, B.; Tran, P.; Tran, A.; Pham, C.; Nguyen, R.; Hoai, M. Blur2Blur: Blur Conversion for Unsupervised Image Deblurring on
Unknown Domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA,
USA, 17–21 June 2024; pp. 2804–2813.
43. Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.;
Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021,
8, 53. [CrossRef] [PubMed]
44. Zhou, K.; Liu, Z.; Yu, Q.; Xiang, T.; Chen, C.L. Domain Generalization: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45,
4396–4415. [CrossRef]
45. Archana, R.; Jeevaraj, P.S.E. Deep learning models for digital image processing: A review. Artif. Intell. Rev. 2024, 57, 11. [CrossRef]
46. Li, C.; Wand, M. Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks. In Computer
Vision—ECCV 2016; Springer: Cham, Switzerland, 2016.
47. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
48. Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual Dense Network for Image Restoration. IEEE Trans. Pattern Anal. Mach.
Intell. 2021, 43, 2480–2495. [CrossRef]
49. Huang, G.; Liu, Z.; Laurens, V.D.M.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016.
50. Oktay, O.; Schlemper, J.; Le Folgoc, L.I.C.; Lee, M.J.; Heinrich, M.P.; Misawa, K.; Mori, K.; McDonagh, S.G.; Hammerla, N.Y.;
Kainz, B.; et al. Attention U-Net: Learning Where to Look for the Pancreas. arXiv 2018, arXiv:1804.03999.
51. Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [CrossRef]
52. Lee, J.; Gwon, G.; Kim, I.; Jung, H. A Motion Deblurring Network for Enhancing UAV Image Quality in Bridge Inspection. Drones
2023, 7, 657. [CrossRef]
53. Cai, E.; Deng, C. Restoration of motion-blurred UAV images based on deep belief hourglass networkor. Comput. Appl. Softw. 2022,
39, 260–266.
54. Glorot, X.; Bordes, A.; Bengio, Y. Deep Sparse Rectifier Neural Networks. J. Mach. Learn. Res. 2011, 15, 315–323.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy