0% found this document useful (0 votes)
7 views13 pages

1 s2.0 S0950061823038527 Mainext

The document presents a novel hybrid network called CGTr-Net for pavement crack detection, integrating Convolutional Neural Networks (CNN) and Gated Axial Transformers to enhance feature extraction. This approach addresses challenges such as thin and discontinuous cracks while minimizing the need for extensive pixel-level annotation. Experimental results demonstrate CGTr-Net's superior performance on benchmark datasets, outperforming existing methods in accuracy and efficiency.

Uploaded by

zhanglele0526
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views13 pages

1 s2.0 S0950061823038527 Mainext

The document presents a novel hybrid network called CGTr-Net for pavement crack detection, integrating Convolutional Neural Networks (CNN) and Gated Axial Transformers to enhance feature extraction. This approach addresses challenges such as thin and discontinuous cracks while minimizing the need for extensive pixel-level annotation. Experimental results demonstrate CGTr-Net's superior performance on benchmark datasets, outperforming existing methods in accuracy and efficiency.

Uploaded by

zhanglele0526
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

i An update to this article is included at the end

Construction and Building Materials 411 (2024) 134134

Contents lists available at ScienceDirect

Construction and Building Materials


journal homepage: www.elsevier.com/locate/conbuildmat

Review

A weakly-supervised transformer-based hybrid network with


multi-attention for pavement crack detection
Zhenlin Wang a, Zhufei Leng b, *, Zhixin Zhang a
a
School of Information and Communication Engineering, University of Electronic Science and Technology of China, Sichuan 611731, PR China
b
School of Mechanical and Electrical Engineering, University of Electronic Science and Technology of China, Sichuan 611731, PR China

A R T I C L E I N F O A B S T R A C T

Keywords: At present, crack detection is of grand importance for the maintenance of infrastructure, one of which the most
Pavement crack segmentation crucial kind in China is roads. Road safety accidents, which are mainly caused by cracks, have a significant
Semantic segmentation influence on people’s property, life security and the economic development of the society. Thus, it is essential to
Transformer
accurately identify the pavement defects and promptly repair them in order to prolong the lifespan of the road,
Weakly-supervised learning
minimize maintenance expenses, prevent further deterioration of the road and decrease the occurrence of haz­
Convolutional neural network (CNN)
Deep learning ards. In recent years, deep neural networks have achieved a huge degree of success in crack detection, resulting
in substantial savings in terms of manpower, time and money when compared to conventional approaches.
Nevertheless, owing to numerous difficulties, including time-consuming pixel annotation, inadequacy in
acquiring information, discontinuous cracks and low-quality images, the detection of pavement defects remains a
great challenge, still having some tricky issues demanding fabulous solutions. To this end, we propose a novel
Weakly-Supervised hybrid network with multi-attention, termed CGTr-Net, for pavement crack detection.
Aiming at alleviating the loss of information, behaving well in extracting both local and global features, the
architecture of the backbone CG-Trans was designed. It is a combination of Convolutional Neural Network
(CNN), which is expert in extracting local features but experiencing difficulties to capture global representations,
and Gated axial Transformer, whose gated position-sensitive axial attention mechanism can efficiently extract
long-distance feature dependencies but deteriorate in capturing local feature details. To enhance feature fusion
between the Transformer Layer and the Convolution Layer, a feature fusion module (TCFF) was added to this
network. The two feature maps obtained from Transformer and CNN are utilized to generate Grad-CAM. Sub­
sequently, we use Conditional Random Field (CRF) to further refine the Grad-CAM and adapt Affinity from
Attention (AFA), which learn semantic affinity from the Gated Axial Transformer and the Convolutional Neural
Network, to produce more accurate pseudo labels. The proposed CGTr-Net is evaluated on two different crack
segmentation datasets and our CGTr-Net achieves the highest scores of Recall (Re), F-score (F1) and the mean
intersection-over-union (mIoU) on the two benchmark datasets, surpassing all the competitors in the experiment.
These results demonstrate the robustness, effectiveness and the superiority of our CGTr-Net compared with
existing state-of-the-art methods.

1. Introduction reticulated, are a sign of structural deterioration, and if not detected in


time, may lead to more serious problems, such as collapse, potholes and
Road safety is closely related to people’s daily life and national etc. Therefore, it is extremely necessary and meaningful to accurately
economy. Suffering from external pressure, environment, climate, pro­ detect and timely repair cracks, mitigating safety risks and ensuring
longed service and other factors, cracks gradually arise on the surface of convenient transportation and driving safety.
the pavement, posing hidden threats to road safety, even causing heavy The traditional pavement crack detection is based on manual in­
casualties and economic losses. Therefore, to ensure road safety and spection, whose detection process is time-consuming and laborious.
sustainability and convenience of the traffic, it is quite critical to accu­ Additionally, manual inspection is prone to safety accidents and traffic
rately identify and timely repair cracks. Cracks, usually curved or inconvenience. Compared with manual inspection, machine learning

* Corresponding author.
E-mail address: 767648785@qq.com (Z. Leng).

https://doi.org/10.1016/j.conbuildmat.2023.134134
Received 19 August 2023; Received in revised form 14 October 2023; Accepted 7 November 2023
Available online 30 November 2023
0950-0618/© 2023 Elsevier Ltd. All rights reserved.
Z. Wang et al. Construction and Building Materials 411 (2024) 134134

has higher accuracy, less time, and more stable performance than inef­ • A hybrid loss function is developed to optimize the segmentation
ficient manual recognition. Recently, with the development of deep network, capture information about thin and discontinuous cracks
learning, more and more models have been proposed to solve the and tackle the problem of costly annotation.
problem of crack detection. The majority of these methods take • A comprehensive investigation demonstrates that the proposed
advantage of Convolutional Neural Network (CNN) [1–3], which mainly CGTr-Net has a superior capability to capture thin and discontinuous
focus on the local information, lack a grasp of the overall information crack details under noisy conditions, and achieves state-of-the-art
and often make an inaccurate result. Others attempt to combine the CNN results on two benchmark datasets, including Crack500, DeepCrack.
with Transformer [4–6], while the results are not optimistic any more.
Therefore, to improve the network performance and further extract in­ 2. Related works
formation, a novel architecture integrating Gated Axial Transformer into
CNNs, named CGTr-Net, is designed in this paper to address the crack Our work draws on recent works in CNN, Gated Axial Transformer,
detection problem. Grad-CAM and Weakly-Supervised Semantic Segmentation. In this part,
Crack detection can be essentially reduced to a semantic segmenta­ we will simply introduce these works and the improvements we made
tion problem, which is a process of splitting an image into different upon them.
meaningful regions or categories and each of the regions contains pixels
that share similar properties, such as color, texture, edge and so on. If the 2.1. CNN and transformer
image segmentation does well, it will be extraordinarily beneficial to the
next processes, making the other operations easy. Consequently, the As the earliest proposed network in deep learning, CNN is widely
accuracy and reliability of the image segmentation determine the used in multiple computer vision tasks including image classification,
quality of the subsequent image analysis. However, the biggest problem object detection and instance segmentation. It is extraordinarily
with image semantic segmentation is the high cost of annotation. Some powerful in extracting local features, but it is poor in grasping the
of the cracks may be very thin or discontinuous, which is tough and relationship between patches. To alleviate such a limitation, a variety of
time-consuming to handcraft pixel-level annotation. Even adopting methods were proposed. Some introduce a more complex architecture or
image-level annotations completely covering the cracks is time- more pooling operations to obtain a larger receptive field. Deformable
consuming as well. Besides, this kind of labeling also requires convolution [11] learned the sampling positions while the dilated
extremely high quality of the image. Therefore, we hope to use image- convolution methods [12] increased the sampling step size. CBAM [7]
level annotation without accurately covering the crack labels to obtain respectively used global Maxpooling and global Avgpooling to refine
the semantic segmentation of low-quality and possibly discontinuous features independently in the spatial and channel dimensions. Addi­
crack images. tionally, Both GENet [13] and SENet[14] exploit global Avgpooling to
In this paper, we construct a novel and simple architecture named aggregate global context and then use it to reweight feature channels.
CGTr-Net. The first part of the architecture is the backbone, named CG- Others choose to add Attention mechanism into their modules. Relation
Trans, combining Convolutional Neural Network (CNN) and Gated Axial Networks [15] proposed an object attention module, which can process
Transformer to extract multi-scale features, in which CNN can extract a set of objects simultaneously through interaction between their
the local features, completing the details of the features that Trans­ appearance feature and geometry. Attention augmented convolutional
former struggles to capture and Gated Axial Transformer can use gates to networks [16] concatenates convolutional feature maps with feature
obtain more accurate location information, effectively improving the maps produced via self-attention to capture long-distance relationships
quality and identifiability of feature coding. Second, a module called and optimize the convolutional operation. BoTNet [17] incorporates
TCFF utilizing the pixel attention (PA) and Convolutional Block Atten­ self-attention for classic ResNet[18], replacing the spatial convolutions
tion Module (CBAM) [7] is designed to fuse the feature maps obtained in the final three bottleneck blocks with global self-attention. By just,
respectively from Gated Axial Transformer and CNN. Afterwards, the this approach is turned out to be a strong baseline on instance seg­
extracted feature map is processed to generate Grad-CAM and then the mentation and object detection while also reducing the parameters, with
Grad-CAM is further refined by Conditional Random Fields (CRF), which minimal overhead in latency. Although they have made some progress,
is proficient in grasping global relationship. Inspired by Affinity from there are still some defects. For the former, a larger receptive field means
Attention (AFA) [8], we align the CNN’s affinity prediction with the the loss of spatial resolution. And the latter need accurate position
affinity label generated by transformer’s refined CAM. The last step is encoding and appropriate fusion between convolution operation and
comparing the segmentation prediction with the pseudo label obtained attention mechanisms. If they are not fused well, the local feature details
from the refined CAM and the affinity prediction to further refine the will terribly deteriorate.
pseudo label. It is a mechanism that can help the network obtain more Contrary to the CNN, a transformer comprises embedding and self-
comprehensive image semantic information and further ameliorate the attention [19], having advantages in extracting global and long-range
accuracy of Grad-CAM. Experimental results on Crack500 [9], Deep­ information [20]. Recently, transformers are implemented as stand­
Crack [10] demonstrate our method surpasses state of the art methods alone architectures to fulfil computer vision tasks, such as image clas­
remarkably. sification, semantic segmentation and target detection. And numerous
The main contributions of this paper can be summarized as follow: works have demonstrated the accuracy of transformers are able to sur­
pass that of sophisticated CNNs for image segmentation [21,22]. As a
• We propose a novel network structure termed CGTr-Net, retaining groundbreaking work, ViT [23] validated that transformer is not only
the structure and generalization advantages of both U-Net and gated powerful in natural language processing tasks, a pure transformer
axial transformer, which behaves extremely well in extracting both without reliance on CNN can also be applied in computer vision and
local and global features. perform very well. Motivated by ViT, varieties of transformer-based
• We utilize CRF, which can grasp the connection between each pixel, methods have been proposed to address computer vision tasks, such as
to refine the Grad-CAM to improve its accuracy and address the object detection [24,25], image classification [26,27], semantic segmen-
problem that the cracks are not continuous or high-quality. tation[28] and so on. Nevertheless, due to the nature of the attention
• A Transformer-convolution feature fusion module (TCFF) utilizing mechanism, ViT has a higher demand for training data and always does
the pixel attention (PA) and Convolutional Block Attention Module not behave well in capturing the details of the local features. To improve
(CBAM) is proposed to concatenate the feature maps respectively this defect, DeiT[27] introduced a token-based strategy to distill
from the transformer and the convolution and feed the fused feature CNN-based features and transfer them to visual transformer while DETR
map to the next layer. [29] fed local features extracted by CNN to the transformer to model the

2
Z. Wang et al. Construction and Building Materials 411 (2024) 134134

global relationships between features in a serial fashion. Besides, applied to image segmentation tasks, including medical and remote
Swim-transformer [30] proposed a hierarchical Transformer whose sensing image segmentation. Despite previous models like SegCrack
representation is computed with shifted windows, which brings greater [41] and CrackFormer [42], embedded a CNN at the top of the trans­
efficiency by allowing self-attention computation for cross-window former networks, there remains a need for a requirement for further
connection instead of limiting it within non-overlapping local win­ advancement in transformer models to effectively address the challenges
dows. Zheng et al. [31] proposed Segmentation Transformer (SETR) for associated with crack-image segmentation.
semantic segmentation. In the conventional encoder-decoder architec­ More sophisticated methods combing CNN and Transformer are
ture, SETR is used as an encoder to produce improved segmentation explored. For medical image segmentation, TransFuse, a parallel in-
outcomes. SegFormer, introduced by Xie et al. [32], is a transformer branch coding architecture was designed by Zhang et al. [43]. It is an
encoder combined with a lightweight multilayer perceptual decoder. It architecture that fuses CNN and Transformer to enhance the modeling of
has demonstrated exceptional performance in high-speed image global information while preserving the precise extraction of low-level
segmentation. details. The effectiveness of utilizing dual encoders in U-shaped net­
With the boom of deep learning, a gated position-sensitive axial works, as discovered by Jha et al. [44], was found to surpass the use of a
attention mechanism was introduced in MedT [33] to solve the problem single encoder. The superiority of this parallel encoder design over
that training with fewer images causes difficulty in learning positional models with transformers and CNNs in series as encoders has been
encoding for the images. Axial attention mechanism decomposed the demonstrated by FAT-Net [45].
self-attention module into two modules. One module calculates Previous researches have indicated that a dual encoder combining
self-attention on the feature map height axis and the other performs on transformers and CNNs can inherit the advantages of both CNNs and
the width axis. Meanwhile, a position bias term is added to make the transformers, retain the capability of accurately extracting target details
computing affinities sensitive to the positional information. This posi­ and effectively capture background information, leading to improved
tion bias term is usually referred to as relative positional encodings, segmentation of fine cracks in complex backgrounds. However, it is
which are typically learnable through training and have been proven to essential to examine the generalization capability of models when
have the capacity to encode spatial structure of the image. For any given transformers and CNNs are parallelly employed as encoders. In sum­
input feature map x, the updated self-attention mechanism with posi­ mary, the successful integration of transformers and CNNs in crack
tional encodings along with width axis can be written as: segmentation networks remains a challenging endeavor, especially
( )( ) when dealing with crack detection in complex scenes. In this paper, we

W
yij = softmax qTij kiw + qTij rqiw + qTij rkiw viw + rviw (1) devise a novel module called CGTr-Net to fuse the CNN with gated axial
w=1 attention transformer, significantly enhancing the performance of the
network.
where the formulation in Eq. (1) follows the attention model proposed in
[34] and rq 、rk 、rv ∈ Rw×w for the width-wise axial attention model. 2.2. Grad-CAM
Note that Eq. (1) describes the axial attention applied along the width
axis of the tensor. A similar formulation is also used to apply axial In these years, deep learning has been developed rapidly and there
attention along the height axis and together they form a single have been an increasing number of works exploring weakly-supervised
self-attention model that is computationally efficient. in deep learning, trying to find accesses to alleviate the reliance on
Nonetheless, when the datasets are not large enough, the positional time-consuming pixel-level annotation. To obtain the remarkable ability
bias will not be accurate any more in encoding long-range interactions. to localize objects only trained on image-level labels, Zhou et al. [46]
Hence, to avoid introducing inaccurate the relative positional encod­ proposed a general technique called Class Activation Mapping (CAM).
ings, the gated position-sensitive axial attention mechanism was pro­ Using global average pooling, this technique enables
posed in [33]. In this mechanism, if a relative positional encoding is classification-trained CNNs to learn to locate object, without using any
learned accurately, the gating mechanism will assign it a high weight bounding box annotations. Class activation maps also make us accessible
compared to the ones which are not learned accurately. With the pro­ to visualize the predicted class scores on given images, highlighting the
posed modification the self-attention mechanism applied on the width discriminative object parts detected by the CNN, which makes the CNN
axis can be formally written as: explainable and aid others in understanding the mechanism of

W ( )( ) discrimination used by CNN. However, CAM has a great limitation that it
yij = softmax qTij kiw + GQ qTij rqiw + GK kTiw rkiw GV1 viw + GV2 rviw (2) requires feature maps to precede SoftMax layers directly, signifying it
w=1 can be only applied to the particular kind of CNN architectures with
global average pooling over convolutional maps immediately prior to
where the self-attention formula closely follows Eq. (2) with added
prediction. To make Convolutional Neural Network (CNN)-based
gating mechanism. Also, GQ 、GK 、GV1 、GV2 ∈ Rare learnable parame­
models more transparent and explainable, Selvaraju et al. [47] put
ters and together they create gating mechanism which control influence
forward a novel method called Gradient- weighted Class Activation
of the learned relative positional encodings have on encoding non-local
Mapping (Grad-CAM), to combine feature maps using the gradient
context. Typically, if a relative positional encoding is learned accurately,
signal that is applicable to a wide variety of CNN. Due to its advanta­
the gating mechanism will assign it high weight compared to the ones
geous performance, we apply this Gradient- weighted Class Activation
which are not learned accurately.
Mapping (Grad-CAM) into our weakly supervised semantic segmenta­
Although transformer-based models perform significantly well in
tion part, expecting to receive a better result.
capturing global relationships, it still experiences difficulties in
extracting local details. To overcome this challenge, several models
2.3. Weakly-supervised semantic segmentation
combing CNN and Transformer have been proposed for image seg­
mentation. These models, including Conformer [35], defining the first
Weakly-Supervised Semantic Segmentation (WSSS) methods can be
concurrent network structure that fuses features extracted from CNN
mainly divided into two categories and both of them need class activa­
and Transformer in an interactive fashion, Fused Transformers and
tion maps (CAM) to generate pseudo labels. Hou et al. [48] had proposed
CNNs (TransFuse) [36], Convolutional Transformer (CvT) [37], Feature
a method that erasing the discriminative region to make the classifier
Adaptive Transformer Network (FAT-Net) [38], Squeeze-Excitation
pay more attention to the complete object region while Wu et al. [49]
Transformer U-net (SETUnet) [39] and the U-shaped hybrid Trans­
proposed an Embedded Discriminative Attention Mechanism (EDAM),
former Network (UTNet) [40], have exhibited promising results when
which integrates the activation map generation directly into the

3
Z. Wang et al. Construction and Building Materials 411 (2024) 134134

classification network for WSSS. Furthermore, there are an increasing Convolutional Neural Network, to produce more accurate pseudo labels.
number of works exploring accesses to explain the working mechanism Additionally, a hybrid loss function is designed to obtain more accurate
of CAM and refine CAM generation. IRNet [50] and CRF are applied to results. Fusing the above characteristics, we refer to this developed ar­
networks, aimed at refining the rough boundary of the initial localiza­ chitecture as CGTr-Net. The architecture overview is illustrated by
tion map. In addition, Zhang et al. [51] develop Context Adjustment Fig. 1.
(CONTA) to get rid of the confounding bias in image-level classification
and provide much better pseudo-masks as ground- truth while Explicit 3.2. CNN encoder
Pseudo-pixel Supervision (EPS) are devised to learns from pixel-level
feedback by combining two weak supervisions. In this framework, the The backbone of our architecture starts with a convolution stem,
image-level label offers the object identity by the localization map while which consists of N convolutional layers, applying 3 × 3 convolutions
the saliency map generated from the off-the-shelf saliency detection with strides of 1 and paddings of 1, for low-level image feature exploi­
model provides rich boundaries. Different from the above methods, Xu tation. The experimental results validate that this module performs
et al. has devised a novel method called SCD [52], utilizing the feature much more efficiently in extracting low-level feature semantics with the
correspondence derived from the network’s own CAM as a distillation pre-connection of convolution operations than that without it [54].
target to refine the CAM. After preprocessing the image, the low-level features are transferred to
convolution encoder and gated axial transformer encoder respectively.
3. Methodology The convolution block consists of nine Convolution Layers and
shares a similar structure with U-Net [53]. The block is divided into a
In this study, we introduce a novel segmentation network, CGTr-Net, contracting path on the left side and an expansive path on the right side.
to tackle the precise detection task of pavement crack detection. To The contracting path follows a standard convolutional network archi­
enhance feature representations and better capture multi-scale infor­ tecture. It involves iteratively applying two unpadded 3 × 3 convolu­
mation, several optimization blocks are presented, including the back­ tions, followed by rectified linear units (ReLU), and 2 × 2 max pooling
bone CG-Trans, TCFF and the hybrid loss function. These blocks are with a stride of 2 for downsampling. At each downsampling step, the
meticulously designed to further extract relevant information from the number of feature channels is doubled. On the expansive path, each step
feature maps and enhance the network’s performance in crack detection includes upsampling the feature map, followed by a 2 × 2 convolution
tasks. In the subsequent section, we present an elaborate explanation of (referred to as "up-convolution") that reduces the number of feature
the proposed network and its individual constituent blocks. channels by half. This is then concatenated with the corresponding
cropped feature map from the contracting path. After concatenation,
3.1. Model overview two 3 × 3 convolutions are applied, each followed by a ReLU. Cropping
is necessary to account for the loss of border pixels during convolutions.
CNN is a convolution process from shallow to deep, which extracts Lastly, a 1 × 1 convolution is employed in the final layer to map each
informative features by fusing spatial and channel-wise information 64-component feature vector to the desired number of classes. To enable
together within local receptive fields. CNN is powerful in extracting seamless tiling of the output segmentation map, it is crucial to choose an
local features, but insufficient for capturing global cues, and will lose input tile size that ensures all 2 × 2 max-pooling operations are applied
part of the correlation on the global scale due to the limited receptive to a layer with an even x- and y-size. This consideration is essential for
fields and the pooling process. Besides, in convolutional architectures, achieving consistent and accurate segmentation across tiled regions.
there is inherent inductive biases, which makes the learning algorithm Afterwards, the Conv Layer2–5 of expansive path pass through Conv
prioritizes the solutions with some specific properties, implying that Block respectively and then their processed results are concatenated to
CNN lack the understanding of long-range dependencies in the image. generate CAM-U. The Conv Block consists of a.
Different from traditional convolution operation using short-distance 1 × 1 convolution layer, a Batch Normalization (BN), a ReLu acti­
local feature semantics, Transformer focuses on the study of long- vation function a 3 × 3 convolution layer and a BN.
distance global features, which can better capture the connections on
the global scale of the semantic entity of interest and obtain some key 3.3. Transformer encoder
information about the overall perspective. Moreover, Gated Axial
Transformer leveraging Gated Axial Attention is able to get much more The gated axial transformer encoder in the proposed CGTr-Net
accurate location encoding, further improving the performance. In the consists of two modules: an input module and an encoding module, as
proposed network, CGTr-Net, the first part is a convolution operation shown in Fig. 1. The input module comprises a patch-partition operation
done on the image. And then, inspired by U-Net [53], we utilize a U-like and a positional embedding operation. First, different from the tradi­
structure, named CG-Trans, combining CNN and gated axial transformer tional equal-size, non-overlapping, uniform patch organization embed­
as the backbone to obtain more comprehensive feature information. The ding strategy [55], which is short of capabilities of sharing
initially processed data is inputted into the CNN module and Trans­ complementary information between adjacent patches and accessing
former module simultaneously, and then using the TCFF module to fuse different contextual details of entities, we used a multi-scale token
the processed data from each Transformer Layer and each Convolution embedding method aiming at enhancing the encoding performance of
Layer. Afterwards, the fused data is delivered to the next Convolution the generated token representations. As shown in Fig. 3, two patches of
Layer and the corresponding decoder convolution layer to better inte­ different sizes are set to slide on the feature map to attain the token
grate global information. embeddings. Specifically, to get the same dimension of token embed­
Considering the pixel-level annotations are extremely costly and the dings under different partition sizes for the subsequent process, the
image-level annotations completely covering cracks are time-consuming sliding intervals are configured with the same value, thereby finalizing
as well, we introduce Grad-CAM to the network, adopt the model of the same number of patches in different sets. For each patch, a linear
weak supervised semantic segmentation (WSSS) mode to train the model mapping was performed to convert it into a one-dimensional vector,
and utilize CRF as the refined module to polish the generated Grad-CAM which was then passed through the embedding layer. By doing so, the
so that the training efficiency can be improved and the problems of fuzzy input module converted the input image format with dimensions [H, W,
cracks and discontinuous cracks can be solved. Thus, CAM-T and CAM-U C] into a sequence of tokens (vectors) suitable for the standard trans­
are obtained respectively from the Transformer and CNN and then former module’s input format. To capture the positional information
refined by CRF. Subsequently, adapt Affinity from Attention (AFA), and enable the model to determine the precise location of feature tokens
which learn semantic affinity from the Gated Axial Transformer and the during attention computation, a positional embedding operation was

4
Z. Wang et al. Construction and Building Materials 411 (2024) 134134

Fig. 1. The structure of the proposed segmentation network.

employed. This operation established a relative position relationship 3.4. Transformer-convolution feature fusion module
between the patch features. It ensured that the model could effectively
learn and incorporate spatial information during the training process The feature dimensionalities of the feature map extracted by
[45]. convolution block is inconsistent to that of gated axial transformer
The output from the input module was then fed into the encoding block. And there is a distinct semantic gap lying between feature maps
module, consisting of a series of consecutive transformer layers, as and patch embeddings. Hence, it is impracticable to add or multiply the
illustrated in Fig. 2. The design of this network drew inspiration from the two feature maps directly. It is a tough issue to eliminate the misalign­
DeiT model [56], where each block comprised three transformer layers. ment between them. To this end, a Transformer-Convolution Feature
In our proposed network, we incorporated 12 transformer layers based Fusion Module (TCFF) is designed to consecutively couple the local
on findings from Fang et al. [57], who experimentally determined that features with the global representations.
this configuration achieved the optimal performance on the Crack­ In this module, the feature map coming from Transformer encoder Ft
Tree200 dataset. Each transformer layer consisted of two sublayers: goes through the Dconv Module and a 1 × 1 convolution operation,
gated position-sensitive axial attention and a feed-forward network obtaining a novel feature map F1 . The Dconv Module is composed of a
(FFN) [58]. To facilitate the flow of information and aid in optimization, 2 × 2 deconvolution and a 3 × 3 convolution, a Batch Normalization and
residual connections and normalization operations were included be­ a activation function ReLU. The feature map obtaining from CNN
tween these sublayers. The self-attentive mechanism employed within encoder Fu goes through a1 × 1 convolution operation and obtain a new
each layer facilitated the capturing of feature dependencies between the feature map F2 . Then F1 is added to F2 by element and obtain a fused
template and search regions within the global feature representation. feature map Ff . Afterwards, the fused feature map Ff proceeds through
This enabled the effective fusion of global information and augmented Pixel Attention(PA) and Convolutional Block Attention Module (CBAM)
the discriminative features within the model. The knowledge related to [7] respectively and then concatenate. Finally, the processed feature
Gated position-sensitive axial attention is detailed in Section2.1. map Fp is fed to the next Convolution Layer. Consequently, multi-scale
semantic features are able to be attained via this module. The proced­
ure described above can be formulated in the following formula:

5
Z. Wang et al. Construction and Building Materials 411 (2024) 134134

3.5. Grad-CAM module architecture

After obtaining the multi-scale semantic features and initial classi­


fication using the front CGTr-Net architecture, the next step is to
generate a Grad-CAM (Gradient-weighted Class Activation Mapping).
Grad-CAM utilizes the gradient information that flows into the last
convolutional layer of a convolutional block. Its objective is to assign
importance values to individual neurons in that layer. To obtain the
class-discriminative localization map Grad-CAM LcGrad− CAM of width u
and height v for any class c, we first compute yc , the gradient of the score
for class c (before the softmax), with respect to feature map activations
k
Ak of a convolutional layer, i.e. ∂∂Ayc . The gradients flowing back from the
last convolutional layer are global-average-pooled across the width and
height dimensions to obtain the neuron importance weights. This
pooling operation calculates the average gradient value for each neuron
in the layer, resulting in the importance weights that indicate the overall
significance of each neuron in the prediction process αck :

1 ∑ ∑ ∂yc
αck = j ∂Ak
(5)
Z i ij

The weight αck represents the relevance of feature map k for a specific
target class c. To incorporate this importance information into the for­
ward activation maps, we perform a weighted combination of the
feature maps. This can be achieved by multiplying each feature map by
its corresponding weight and summing them together. This linear
combination of maps is followed by applying a Rectified Linear Unit
(ReLU) activation function. The purpose of applying the ReLU activation
is to discard any negative or irrelevant features that do not positively
contribute to the class of interest. By discarding the negative influences,
we focus solely on the features that have a positive impact on the clas­
sification and contribute to the generation of the Grad-CAM map.
( )

LcGrad− CAM = ReLU k
αc k
k A (6)

Fig. 2. Transformer layer.


3.6. Loss function

) Four different components constitute the total loss function used in


Ff = Conv1∗1 (Fu ) + Conv1∗1 (Dconv(Ft ) (3)
this study: classification loss for the MLP layer to process the feature map
)◦ ) from Transformer block, CRF loss for refining the CAM, affinity loss for
Fp = CBAM(Ff PA(Ff (4) comparing the affinity predictions with the affinity labels and segmen­
tation loss for obtain more accurate pseudo labels. In this part, the loss
where Conv1∗1 represents 1 × 1 convolution, Dconv represents the Dconv
functions used in this architecture are detailed.
Module, CBAM represents Convolutional Block Attention Module, PA
represents Pixel Attention and ◦ represents concatenating.

Fig. 3. Diagram of Transformer-Convolution Feature Fusion(TCFF).

6
Z. Wang et al. Construction and Building Materials 411 (2024) 134134

3.6.1. Classification loss 4.1.2. Segmentation datasets


As shown in Fig. 1, the feature map obtained from the Transformer To train our segmentation network, we utilize the training sets pro­
block is fed into the classification layer to compute the prediction vector vided by the Crack500 [9], DeepCrack [10] datasets. DeepCrack is a
p for the classification. The multi-label soft-margin loss is adopted as the widely used benchmark dataset for evaluating crack detection methods.
classification loss for the subsequent network training. For the total It consists of 300 high-resolution training images and 237
number of classes C, the classification loss is defined as: high-resolution test images, with cracks of varying sizes and locations in
c ( ( ) ( ) ( − pc )) different scenarios. The images in DeepCrack have a resolution of
1 ∑ 1 e
Lcls = lc log + 1 − lc
log (7) 544 × 384 pixels. Crack500 dataset, on the other hand, presents addi­
C c=1 1 + e− pc 1 + e− pc tional challenges in crack segmentation as it contains 1896 training
images and 1124 testing images with a resolution of 360 × 640 pixels.
where l is the ground truth for class labels and C is equal to two, which These images exhibit cracks with diverse widths and shapes, posing a
means the crack and the background. more complex crack segmentation task. Since the number of images in
the Crack500 dataset is limited, we perform augmentation techniques to
3.6.2. CRF loss increase the diversity of the dataset. These augmentations include
To further refine the Grad-CAM, we introduce the CRF loss [59], rotation augmentation with 90◦ , 180◦ , and − 90◦ , as well as flip
which can grasp the relationship between pixels and pixels. augmentation along the horizontal and vertical axes. The same aug­
PRealPath eSRealPath mentations are applied to the corresponding ground truth images. By
logLossFunction = − log
P1 + P2 + ... + PN
= − log S
e + eS2 + ... + eSN
1 incorporating these diverse datasets and applying appropriate aug­
(8) mentations, we aim to enhance the robustness and generalizability of
the crack segmentation network.
In the above function, Si = EmissionScore + TransitionScore. Through
the classification layer, each pixel has got their own score to be crack
and to be background. The PRealPath symbolizes the ground truth and Pi , i 4.2. Evaluation metrics
= 1, 2, 3...N, symbolizes all combinations of the two classifications of all
pixels. Four well-known evaluation metrics are utilized here, including
Recall (Re), Precision (Pr), F-score (F1), the mean intersection-over-
3.6.3. Affinity loss union (mIoU) and Dice score. These metrics are described as:
Inspired by [60], the affinity loss is defined as follow. TP
( ) ⃒ Pr = (11)
⃒ TP + FN
CL px , py = − py λ × ⃒py − px |λ− r × log(px ) (9)
TP
Re = (12)
where, py ∈[0,1], px ∈[0,1], are predication probability respectively TP + FP
from the CNN Affinity prediction and Affinity label from Transformer.
The loss function consists of two novel terms: (1) self-weighting term; F1 = 2 ×
Re × Pr
(13)
(2) difference-modulating term, as described in Eq. (9). Different from Re + Pr
the initial form of the function, a novel parameter λ is added to prevent
|X ∩ Y|
the training from being biased towards CNN or Transformer. λ − r ≥ 0 mIoU = (14)
|X ∪ Y|
and r ≥ 0 are weighting parameters.
where, true positives, false negatives, and false positives are denoted by
3.6.4. Total loss TP, FN and FP, respectively. |X| and |Y| are the numbers of pixels in the
For the sake of comparing the Grad-CAMs from Transformer and predicted and labeled binary mask images, respectively. To assess the
CNN, Representation learning via invariant casual mechanisms (ReLIC) similarity between the labeled mask and the binary segmentation result,
[61] is introduced to our network as the Grad-CAM loss function LCAM . the mean Intersection over Union (mIoU) metric is employed.
Additionally, the cross-entropy loss is adopted as the segmentation loss
Lseg . The total loss is defined as follow:
5. Experiment
L = λ1 Lseg + λ2 Lcrf + λ3 Lcls + λ4 Laff + λ5 LCAM (10)
5.1. Experiment results and discussion
4. Experimental dataset and setup
In this section, we provide implementation details of the proposed
This section provides a comprehensive introduction to the experi­ CGTr-Net and assess its effectiveness by comparing it with different
mental datasets and evaluation metrics used for performance testing and state-of-the-art pavement crack segmentation methods, such as Crack­
evaluation. Net [62], SegNet [63], DMA-Net [64], CrackSeg [65], U2 -Net [66] and
DeepLabv3 +[67]. Our evaluation is conducted on two widely used
pavement crack image datasets, allowing a comprehensive performance
4.1. Datasets
Table 1
4.1.1. Classification dataset Comparison among CNN architectures on Crack500 dataset.
The crack image classification dataset (CIC) is utilized to train the
Method Pr Re F1 mIoU
proposed pavement crack classification network and generate class
activation maps. The CIC dataset comprises a total of 458 original im­ U-net 0.642 0.641 0.623 0.703
CrackNet 0.631 0.700 0.665 0.729
ages, each with a resolution of 4032 × 3024 pixels. These images are
SegNet 0.562 0.624 0.593 0.685
subsequently cropped into patches of 227 × 227 pixels. Thus, the DMA-Net 0.695 0.801 0.743 0.559
dataset consists of approximately 40,000 images suitable for training a CrackSeg 0.644 0.700 0.673 0.734
classification network. It includes two classes: the positive class (crack) U2 -Net 0.673 0.702 0.687 0.547
represented by 20,000 images, and the negative class (non-crack) rep­ DeepLabv3 + 0.546 0.830 0.655 0.723
CGTr-Net 0.686 0.838 0.755 0.745
resented by another 20,000 images.

7
Z. Wang et al. Construction and Building Materials 411 (2024) 134134

comparison. The corresponding results are listed in Table 1, Table 2 and Table 3
Table 3 respectively and the highest scores are bolded. Ablation experiments on DeepCrack dataset.
Method Pre Re mIoU
5.1.1. Evaluation on Crack500
Baseline 0.892 0.876 0.811
In Table 1, we present a comparison between CGTr-Net and the Baseline+AFA 0.917 0.917 0.872
current state-of-the-art methods on the Crack500 dataset. CGTr-Net Baseline+CRF 0.911 0.898 0.889
achieves a mean Intersection over Union (mIoU) of 74.5%, recall of Baseline+Lseg 0.926 0.931 0.892
83.8%, and F1-score of 75.5%. While the precision of CGTr-Net is Baseline+LCAM 0.920 0.922 0.904
Proposed 0.951 0.959 0.913
slightly lower than that of DMANet, it outperforms other methods in
terms of F1-score, which is a comprehensive measure of precision and
recall. 5.2. Ablation results
Fig. 4 illustrates a visual comparison of the segmentation results
obtained by CGTr-Net and re-implemented networks, such as U-Net This section delves into the effectiveness of each component within
[53], and DeepLabv3 +[67], on the Crack500 dataset. It is evident from the proposed CGTr-Net framework. Assessment was carried out by
the comparison that CGTr-Net generates crack segmentation maps with conducting all the comparison experiments using the DeepCrack dataset.
significantly enhanced detail and refinement compared to Deep­ A comprehensive set of experiments were executed in order to evaluate
Labv3 + and U-Net. Particularly, CGTr-Net excels in detecting and the performance of different loss functions in the context of crack seg­
segmenting cracks of short and small sizes, showcasing its superior mentation. After analyzing the scores in Table 3, it can be observed that
performance on the Crack500 dataset. Baseline+AFA, Baseline+CRF, Baseline+Lseg and Baseline+LCAM have
achieved higher scores in Pre, Re and mIoU compared to the pure
5.1.2. Evaluation on DeepCrack Baseline. This indicates that each loss function contributes to improving
Table 2 presents the evaluation results on the DeepCrack dataset, the network’s performance to some extent. Additionally, the proposed
comparing the performance of different methods, including the pro­ network, which combines all the loss functions, achieves the highest
posed method CGTr-Net. The metrics used for evaluation include pre­ scores in Pre, Re and mIoU, surpassing all the competitors in the
cision, recall, mean Intersection over Union (mIoU), and F1-score. CGTr- experiment. This suggests that each loss function has its own advan­
Net achieves impressive results, outperforming all other competitive tages, and combining them maximizes the network’s performance.
methods. Specifically, CGTr-Net achieves a precision of 88.8%, recall of
88.3%, mIoU of 89.4%, and F1-score of 88.6%. 6. Conclusion
In Fig. 5, a visual comparison is provided between the results ob­
tained by CGTr-Net and other reimplemented networks, U-Net and This paper introduces CGTr-Net, a novel hybrid deep learning
DeepLabv3 + , on the DeepCrack testing set. It is evident from the approach for pavement crack semantic segmentation. It combines Con­
comparison that CGTr-Net delivers more accurate crack segmentation volutional Neural Network (CNN) and Gated Axial Transformer to
results when compared to U-Net and DeepLabv3 + . This highlights the leverage the strengths of both architectures. Additionally, to enhance
superior performance of CGTr-Net in accurately identifying and seg­ feature fusion between the Transformer intermediate layer and the
menting cracks in the DeepCrack dataset. Convolutional layer, a feature fusion module called TCFF is introduced.
The feature maps obtained from Transformer and CNN are utilized to
5.1.3. Discussion generate Grad-CAM. Subsequently, we use CRF to further refine the
After evaluating on the two datasets, it is obvious that our proposed Grad-CAM and adapt Affinity from Attention (AFA), which learn se­
network CGTr-Net performs extraordinarily well. During the experi­ mantic affinity from the Gated Axial Transformer and the Convolutional
ments conducted on the Crack500 dataset, the networks’ Pr, Re, F1, and Neural Network, to produce more accurate pseudo labels. Also, a new
mIoU values were found to be lower compared to those on the Deep­ hybrid loss function has been designed to optimize the segmentation
crack dataset. This difference can be attributed to variations in data network. Experimental evaluations on two datasets, namely DeepCrack,
distribution and quality. However, our CGTr-Net outperformed all other Crack500, and CrackSC, demonstrate the effectiveness of CGTr-Net in
competitors in terms of Recall (Re), F-score (F1), and mean intersection- accurately segmenting pavement cracks, outperforming state-of-the-art
over-union (mIoU) on both benchmark datasets, achieving the highest methods. In addition, ablation studies are conducted to analyze and
scores. Profiting from the transformer’s superior ability to obtain the showcase the effectiveness of each component of CGTr-Net. In future
global information and the CNN’s superior ability to capture local in­ work, our aim is to further enhance the accuracy and robustness of
formation, we are able to extract the information much more exhaus­ pavement crack semantic segmentation by refining the architecture of
tively. Additionally, during the pooling process, there may be a loss of CGTr-Net. Although the currently proposed CGTr-Net has only been
spatial structure. Thus, an attention mechanism is employed to perform validated for pavement crack semantic segmentation, we believe that
weighted enhancement of features, resulting a higher score on the CGTr-Net can also be well-suited for other defect segmentation tasks in
datasets than the other networks. Nevertheless, the precision of CGTr- surface inspection. We will conduct further investigation of this issue in
Net is slightly lower than that of DMANet. That’s perhaps because the future research.
there are some problems in the cascade process of the network.
CRediT authorship contribution statement

Table 2 Zhang Zhixin: Writing – review & editing, Validation, Supervision,


Comparison among CNN architectures on Deepcrack dataset. Software. Wang Zhenlin: Writing – review & editing, Supervision, Re­
Method Pr Re F1 mIoU sources, Methodology, Formal analysis. Leng Zhufei: Writing – review
U-net 0.825 0.784 0.84 0.827 & editing, Writing – original draft, Visualization, Validation, Software,
CrackNet 0.838 0.833 0.836 0.855 Methodology, Formal analysis, Data curation.
SegNet 0.802 0.768 0.785 0.812
CrackSeg 0.815 0.805 0.811 0.832
U2 -Net 0.824 0.867 0.863 0.875 Declaration of Competing Interest
DeepLabv3 + 0.759 0.767 0.762 0.801
CGTr-Net 0.888 0.883 0.887 0.894
The authors declare that they have no known competing financial

8
Z. Wang et al. Construction and Building Materials 411 (2024) 134134

Fig. 4. An evaluation upon the proposed CGTr-Net utilizing the Crack500 dataset compared with other methods.

Fig. 5. An evaluation upon the proposed CGTr-Net utilizing the DeepCrack dataset compared with other methods.

9
Z. Wang et al. Construction and Building Materials 411 (2024) 134134

interests or personal relationships that could have appeared to influence [25] Josh Beal, Eric Kim, Eric Tzeng, Dong Huk Park, Andrew Zhai, and Dmitry Kislyuk.
Toward transformer-based object detection. arXiv preprint arXiv:2012.09958,
the work reported in this paper.
2020. 2, 3.
[26] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis E.H. Tay, Jiashi
Data availability Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from
scratch on imagenet. arXiv preprint arXiv:2101.11986, 2021. 2, 3, 6.
[27] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jegou, H. (2021).
Data will be made available on request. Training data-efficient image transformers & distillation through attention. In
Marina Meila & Tong Zhang (Eds.), Proceedings of the 38th International
References Conference on Machine Learning (pp. 10347–10357). PMLR, volume 139.
[28] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T.,
Torr, P.H., & Zhang, L. (2020). Rethinking Semantic Segmentation from a
[1] S.L.H. Lau, E.K.P. Chong, X. Yang, X. Wang, Automated pavement crack Sequence-to-Sequence Perspective with Transformers. 2021 IEEE/CVF Conference
segmentation using u-net-based convolutional neural network, IEEE Access 8 on Computer Vision and Pattern Recognition (CVPR), 6877–6886.
(2020) 114892–114899, https://doi.org/10.1109/ACCESS.2020.3003638. [29] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier,
[2] A. Ji, X. Xue, Y. Wang, X. Luo, W. Xue, An integrated approach to automatic pixel- Alexander Kirillov, Sergey Zagoruyko, End-to-end object detection with
level crack detection and quantification of asphalt pavement, Autom. Constr. 114 transformers, in: In ECCV, 2, Springer,, 2020, pp. 213–229.
(2020), 103176, https://doi.org/10.1016/j.autcon.2020.103176. [30] Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang,
[3] H. Fu, D. Meng, W. Li, Y. Wang, Bridge crack semantic segmentation based on Zheng and Lin, Stephen and Guo, Baining. Swin Transformer: Hierarchical Vision
improved deeplabv3+, J. Mar. Sci. Eng. 9 (2021) 671, https://doi.org/10.3390/ Transformer using Shifted Windows. arXiv preprint arXiv:2103.14030, 2021. 8, 17.
jmse9060671. [31] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P.H.S.
[4] Y. Gao, M. Zhou, D. Metaxas, UTNet: A hybrid transformer architecture for medical Torr, L. Zhang, Rethinking semantic segmentation from a sequence-to-sequence
image segmentation, Med. Image Comput. Comput. Assist. Interv. (MICCAI), perspective with transformers, in: Proceedings of the IEEE/CVF Conference on
Springe Int. Publ., Strasbg., Fr. (2021) 61–71, https://doi.org/10.1007/978-3-030- Computer Vision and Pattern Recognition, Kuala Lumpur, Malaysia, 2021, pp.
87199-4_6. 6877–6886, https://doi.org/10.1109/cvpr46437.2021.00681.
[5] H. Wu, S. Chen, G. Chen, W. Wang, B. Lei, Z. Wen, FAT-net: feature adaptive [32] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J.M. Alvarez, P. Luo, SegFormer: simple
transformers for automated skin lesion segmentation, Med. Image Anal. 76 (2022), and efficient design for semantic segmentation with transformers, in: Neural
102327, https://doi.org/10.1016/j.media.2021.102327. Information Processing Systems (NeurIPS),Online Conference, 2021,
[6] Y. Zhang, H. Liu, Q. Hu, TransFuse: Fusing transformers and cnns for medical pp.12077–12090, https://doi.org/10.48550/arXiv.2105.15203.
image segmentation, in: Medical Image Computing and Computer Assisted [33] Jeya Maria, Jose Valanarasu, Poojan Oza, Ilker Hacihaliloglu, Vishal Patel, Medical
Intervention (MICCAI), Strasbourg, France, 2021, pp. 14–24, https://doi.org/ Transformer: Gated Axial-Attention for Medical Image Segmentation, arXiv Prepr.
10.1007/978–3-030–87193-2_2. arXiv:2102. 10662 7 (2021) 6.
[7] Woo, S., Park, J., Lee, J., & Kweon, I. (2018). CBAM: Convolutional Block Attention [34] Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A.L., & Chen, L. (2020). Axial-
Module. arXiv preprint arXiv:1807.06521. Retrieved from 〈https://api.seman DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. European
ticscholar.org〉 /CorpusID:49867180. Conference on Computer Vision.
[8] L. Ru, Y. Zhan, B. Yu and B. Du, "Learning Affinity from Attention: End-to-End [35] Z. Peng et al., "Conformer: Local Features Coupling Global Representations for
Weakly-Supervised Semantic Segmentation with Transformers," 2022 IEEE/CVF Visual Recognition," 2021 IEEE/CVF International Conference on Computer Vision
Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, (ICCV), Montreal, QC, Canada, 2021, pp. 357–366, doi: 10.1109/
USA, 2022, pp. 16825–16834, doi: 10.1109/CVPR52688.2022.01634. ICCV48922.2021.00042.
[9] F. Yang, L. Zhang, S. Yu, D. Prokhorov, X. Mei, H. Ling, Feature pyramid and [36] Y. Zhang, H. Liu, Q. Hu, TransFuse: Fusing transformers and cnns for medical
hierarchical boosting network for pavement crack detection, IEEE Trans. Intell. image segmentation, in: Medical Image Computing and Computer Assisted
Transp. Syst. 21 (4) (2020) 1525–1535. Intervention (MICCAI), Strasbourg, France, 2021, pp. 14–24, https://doi.org/
[10] Y. Liu, J. Yao, X. Lu, R. Xie, L. Li, DeepCrack: A deep hierarchical feature learning 10.1007/978–3-030–87193-2_2.
architecture for crack segmentation, Neurocomputing 338 (2019) 139–153. [37] Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., & Zhang, L. (2021). CvT:
[11] Haozhi Qi Jifeng Dai, Yi. Li Yuwen Xiong, Han Hu Guodong Zhang, Yichen Wei, Introducing Convolutions to Vision Transformers. In 2021 IEEE/CVF International
Deformable convolutional networks, IEEE ICCV, Pages 764 773 (2017) 2. Conference on Computer Vision (ICCV) (pp. 51–60). IEEE. doi:10.1109/
[12] Yu Fisher, Koltun Vladlen, Thomas Funkhouser, Dilated residual networks, IEEE iccv48922.2021.00009.
CVPR 472–480 (2017) 2. [38] H. Wu, S. Chen, G. Chen, W. Wang, B. Lei, Z. Wen, FAT-net: feature adaptive
[13] Jie Hu, Li. Shen, Samuel Albanie, Gang Sun, and Andrea Vedaldi. Gather-excite: transformers for automated skin lesion segmentation, Med. Image Anal. 76 (2022),
Exploiting feature context in convo-lutional neural networks. arXiv preprint arXiv: 102327, https://doi.org/10.1016/j.media.2021.102327.
1810.12348, 2018. 2. [39] X. Shen, J. Xu, H. Jia, P. Fan, F. Dong, B. Yu, S. Ren, Self-attentional microvessel
[14] Hu Jie, Shen Li, Sun Gang, Squeeze-and-excitation networks, IEEE CVPR segmentation via squeeze-excitation transformer unet, Comput. Med. Imaging
7132–7141 (2018) 2, 1. Graph. 97 (2022), 102055, https://doi.org/10.1016/j.
[15] Jiayuan Gu Han Hu, Jifeng Dai Zheng Zhang, Yichen Wei, Relation networks for compmedimag.2022.102055.
object detection, IEEE CVPR 3588–3597 (2018) 2. [40] Y. Gao, M. Zhou, D. Metaxas, UTNet: A hybrid transformer architecture for medical
[16] Barret Zoph Irwan Bello, Jonathon Shlens Ashish Vaswani, V.Le Quoc, Attention image segmentation, Med. Image Comput. Comput. Assist. Interv. (MICCAI),
augmented convolutional networks, IEEE ICCV 3286–3295 (2019) 2. Springe Int. Publ., Strasbg., Fr. (2021) 61–71, https://doi.org/10.1007/978-3-030-
[17] A. Srinivas, T.-Y. Lin, N. Parmar, J. Shlens, P. Abbeel and A. Vaswani, "Bottleneck 87199-4_6.
Transformers for Visual Recognition," 2021 IEEE/CVF Conference on Computer [41] W. Wang, C. Su, Automatic concrete crack segmentation model based on
Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. transformer, Autom. Constr. 139 (2022), 104275, https://doi.org/10.1016/j.
16514–16524, doi: 10.1109/CVPR46437.2021.01625. autcon.2022.104275.
[18] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image [42] H. Liu, X. Miao, C. Mertz, C. Xu, H. Kong, CrackFormer: Transformer network for
Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition fine-grained crack detection, in: International Conference on Computer Vision
(CVPR), 770–778. (ICCV), Montreal, Canada, 2021, pp.3783–3792, 〈https://openaccess.thecvf.com/c
[19] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. ontent/ICCV2021/papers/Liu_〉 CrackFormer_Transformer_Network_for_Fine-Grain
Polosukhin, Attention is all you need, in: Proceedings of the 31st International ed_Crack_Detection_ICCV_2021_paper.pdf.
Conference on Neural Information Processing Systems, Long Beach California, [43] Y. Zhang, H. Liu, Q. Hu, TransFuse: Fusing transformers and cnns for medical
USA, 2017, pp. 5998–6008, https://doi.org/10.48550/arXiv.1706.03762. image segmentation, in: Medical Image Computing and Computer Assisted
[20] S. Khan, M. Naseer, M. Hayat, S.W. Zamir, F.S. Khan, M. Shah, Transformers in Intervention (MICCAI), Strasbourg, France, 2021, pp. 14–24, https://doi.org/
vision: a survey, ACM Comput. Surv. 1 (2022) 1–38, https://doi.org/10.1145/ 10.1007/978–3-030–87193-2_2.
3505244. [44] D. Jha, M.A. Riegler, D. Johansen, P. Halvorsen, H.D. Johansen, DoubleU-net: a
[21] X. He, E.L. Tan, H. Bi, X. Zhang, S. Zhao, B. Lei, Fully transformer network for skin deep convolutional neural network for medical image segmentation, in:
lesion analysis, Med. Image Anal. 77 (2022), 102357, https://doi.org/10.1016/j. Proceedings -IEEE Symposium on Computer-Based Medical Systems, 2020, pp.
media.2022.102357. 558–564, https://doi.org/10.1109/ CBMS49503.2020.00111.
[22] X. Shen, J. Xu, H. Jia, P. Fan, F. Dong, B. Yu, S. Ren, Self-attentional microvessel [45] H. Wu, S. Chen, G. Chen, W. Wang, B. Lei, Z. Wen, FAT-net: feature adaptive
segmentation via squeeze-excitation transformer unet, Comput. Med. Imaging transformers for automated skin lesion segmentation, Med. Image Anal. 76 (2022),
Graph. 97 (2022), 102055, https://doi.org/10.1016/j. 102327, https://doi.org/10.1016/j.media.2021.102327.
compmedimag.2022.102055. [46] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva and A. Torralba, "Learning Deep Features
[23] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, for Discriminative Localization," 2016 IEEE Conference on Computer Vision and
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 2921–2929, doi:
(2020). An Image is Worth 16×16 Words: Transformers for Image Recognition at 10.1109/CVPR.2016.319.
Scale. ArXiv, abs/2010.11929. [47] R.R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh and D. Batra, "Grad-
[24] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, CAM: Visual Explanations from Deep Networks via Gradient-Based Localization,"
Alexander Kirillov, Sergey Zagoruyko, End-to-end object detection with 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy,
transformers, in: ECCV, 2, Springer,, 2020, pp. 213–229. 2017, pp. 618–626, doi: 10.1109/ICCV.2017.74.

10
Z. Wang et al. Construction and Building Materials 411 (2024) 134134

[48] Hou, Q.; Jiang, P.; Wei, Y.; and Cheng, M.-M. 2018. Self-erasing network for [57] J. Fang, C. Yang, Y. Shi, N. Wang, Y. Zhao, External attention based transunet and
integral object attention. Advances in Neural Information Processing Systems, 31. label expansion strategy for crack detection, IEEE Trans. Intell. Transp. Syst. 23
Hou, Y.; Ma, Z.; Liu, C.; and Loy, C. C. 2019. Learning lightweight lane detection (2022) 19054–19063, https://doi.org/10.1109/TITS.2022.3154407.
cnns by self attention distillation. In Proceedings of the IEEE/CVF international [58] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I.
conference on computer vision, 1013–1021. Polosukhin, Attention is all you need, in: Proceedings of the 31st International
[49] Wu, T.; Huang, J.; Gao, G.; Wei, X.; Wei, X.; Luo, X.; and Liu, C.H. 2021. Embedded Conference on Neural Information Processing Systems, Long Beach California,
discriminative attention mecha-nism for weakly supervised semantic USA, 2017, pp. 5998–6008, https://doi.org/10.48550/arXiv.1706.03762.
segmentation. In Pro-ceedings of the IEEE/CVF Conference on Computer Vision [59] A.Sutton Charles, Mc.Callum Andrew, Rohanimanesh Khashayar, Dynamic
and Pattern Recognition, 16765–16774. Conditional Random Fields: Factorized Probabilistic Models for Labeling and
[50] Ahn, J.; Cho, S.; and Kwak, S. 2019. Weakly supervised learning of instance Segmenting Sequence Data, J. Mach. Learn. Res. (2007).
segmentation with inter-pixel relations. In Proceedings of the IEEE/CVF conference [60] Zhu, Q., Bi, Y., Wang, D., Chu, X., Chen, J., & Wang, Y., 2023, Coordinated
on computer vision and pattern recognition, 2209–2218. Transformer with Position & Sample-aware Central Loss for Anatomical Landmark
[51] D. Zhang, H. Zhang, J. Tang, X. Hua, Q. Sun, Causal Intervention for Weakly- Detection.
Supervised Semantic Segmentation, ArXiv, abs/2009 (2020) 12547. [61] Mitrovic, J., McWilliams, B., Walker, J., Buesing, L., & Blundell, C. (2020).
[52] Xu, Rongtao and Wang, Changwei and Sun, Jiaxi and Xu, Shibiao and Meng, Representation Learning via Invariant Causal Mechanisms [Preprint]. arXiv:
Weiliang and Zhang, Xiaopeng. Self Correspondence Distillation for End-to-End 2010.07922. Retrieved from 〈http://arxiv.org/abs/2010.07922〉.
Weakly-Supervised Semantic Segmentation. arXiv preprint arXiv:2302.13765, [62] W. Song, G. Jia, D. Jia, H. Zhu, Automatic pavement crack detection and
2023. 2, 27. classification using multiscale feature attention network, IEEE Access 7 (2019)
[53] Ronneberger, O., Fischer, P., Brox, T. (2015). U-Net: Convolutional Networks for 171001–171012.
Biomedical Image Segmentation. In: Navab, N., Hornegger, J., Wells, W., Frangi, A. [63] V. Badrinarayanan, A. Kendall, R. Cipolla, SegNet: A deep convolutional encoder-
(eds) Medical Image Computing and Computer-Assisted Intervention – MICCAI decoder architecture for image segmentation, IEEE Trans. Pattern Anal. Mach.
2015. MICCAI 2015. Lecture Notes in Computer Science, vol 9351. Springer, Cham. Intell. 39 (12) (2017) 2481–2495.
https://doi.org/10.1007/978–3-319–24574-4_28. [64] X. Sun, Y. Xie, L. Jiang, Y. Cao, B. Liu, DMA-Net: DeepLab with multi-scale
[54] S. Xiao, K. Shang, K. Lin, Q. Wu, H. Gu, Z. Zhang, Pavement crack detection with attention for pavement crack segmentation, IEEE Trans. Intell. Transp. Syst. (2022)
hybrid-window attentive vision transformers, Int. J. Appl. Earth Obs. Geoinf. 116 1–12.
(2023), 103172, https://doi.org/10.1016/j.jag.2022.103172. [65] W. Song, G. Jia, H. Zhu, D. Jia, L. Gao, Automated pavement crack damage
[55] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. detection using deep multiscale convolutional features, J. Adv. Transp. (2020)
Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An Image is 2020.
Worth 16×16 Words: Transformers for Image Recognition at Scale, in: ArXiv: [66] X. Qin, Z. Zhang, C. Huang, M. Dehghan, O.R. Zaiane, M. Jagersand, U2 -Net: Going
2010.11929, 2020, https://doi.org/10.48550/arXiv.2010.11929. deeper with nested U-structure for salient object detection, . Pattern Recognit. 106
[56] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, H. Jégou, Training data- (2020), 107404.
efficient image transformers & distillation through attention, in: International [67] Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H., 2018c. Encoder-
Conference on Machine Learning, PMLR, Online Conference, 2021, pp. decoder with atrous separable convolution for semantic image segmentation. In:
10347–10357, https://doi.org/10.48550/arXiv.2012.12877. Proceedings of the European Conference on Computer Vision. ECCV.

11
Update
Construction and Building Materials
Volume 411, Issue , 12 January 2024, Page

DOI: https://doi.org/10.1016/j.conbuildmat.2023.134677
Construction and Building Materials 411 (2024) 134677

Contents lists available at ScienceDirect

Construction and Building Materials


journal homepage: www.elsevier.com/locate/conbuildmat

Corrigendum

Corrigendum to “A Weakly-Supervised transformer-based hybrid network


with multi-attention for pavement crack detection” [Constr. Build. Mater.
vol. 411, 12 January 2024, 134134]
Zhenlin Wang a, Zhufei Leng b, *, Zhixin Zhang a
a
School of Mechanical and Electrical Engineering, University of Electronic Science and Technology of China, Sichuan 611731, PR China
b
School of Information and Communication Engineering, University of Electronic Science and Technology of China, Sichuan 611731, PR China

The authors regret that the affiliation in the published article was ᵇSchool of Information and Communication Engineering, University
incorrect. The correct order of the affiliations should read as: of Electronic Science and Technology of China, Sichuan, 611731, PR
ᵃSchool of Mechanical and Electrical Engineering, University of China.
Electronic Science and Technology of China, Sichuan, 611731, PR The authors would like to apologise for any inconvenience caused.
China.

DOI of original article: https://doi.org/10.1016/j.conbuildmat.2023.134134.


* Corresponding author.
E-mail address: 767648785@qq.com (Z. Leng).

https://doi.org/10.1016/j.conbuildmat.2023.134677

Available online 28 December 2023


0950-0618/© 2023 Elsevier Ltd. All rights reserved.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy