0% found this document useful (0 votes)
48 views11 pages

MS-Former: Memory-Supported Transformer For Weakly Supervised Change Detection With Patch-Level Annotations

This document summarizes a research paper that proposes a new framework called MS-Former for weakly supervised change detection using patch-level annotations. MS-Former uses a bidirectional attention block and patch-level supervision scheme. The attention block captures context associated with changed and unchanged regions to construct prototypes stored in a memory bank. It also extracts information from the prototypes to enhance the input features. The supervision scheme guides the network to learn from patch-level annotations. Experimental results on benchmark datasets show the approach improves change detection performance compared to only using patch-level annotations.

Uploaded by

meyina4311
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views11 pages

MS-Former: Memory-Supported Transformer For Weakly Supervised Change Detection With Patch-Level Annotations

This document summarizes a research paper that proposes a new framework called MS-Former for weakly supervised change detection using patch-level annotations. MS-Former uses a bidirectional attention block and patch-level supervision scheme. The attention block captures context associated with changed and unchanged regions to construct prototypes stored in a memory bank. It also extracts information from the prototypes to enhance the input features. The supervision scheme guides the network to learn from patch-level annotations. Experimental results on benchmark datasets show the approach improves change detection performance compared to only using patch-level annotations.

Uploaded by

meyina4311
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.

8, AUGUST 2015 1

MS-Former: Memory-Supported Transformer for


Weakly Supervised Change Detection with
Patch-Level Annotations
Zhenglai Li, Chang Tang, Senior Member, IEEE, Xinwang Liu, Senior Member, IEEE, Changdong Li, Xianju Li,
Wei Zhang,
arXiv:2311.09726v1 [cs.CV] 16 Nov 2023

Abstract—Fully supervised change detection methods have geographical area with binary categories, e.g., changed or
achieved significant advancements in performance, yet they unchanged. During the past several years, CD has received
depend severely on acquiring costly pixel-level labels. Considering increased attention and been applied for a wide range of
that the patch-level annotations also contain abundant informa-
tion corresponding to both changed and unchanged objects in bi- applications, such as land-use change detecting [1], [2], urban
temporal images, an intuitive solution is to segment the changes development monitoring [3], [4], global resources monitor-
with patch-level annotations. How to capture the semantic vari- ing [5], and damage assessment [6], [7].
ations associated with the changed and unchanged regions from With the rapid development of deep learning models, nu-
the patch-level annotations to obtain promising change results is
merous approaches tailored specifically for the CD task have
the critical challenge for the weakly supervised change detection
task. In this paper, we propose a memory-supported transformer been introduced [8]. However, the impressive performance
(MS-Former), a novel framework consisting of a bi-directional exhibited by most existing CD methods is heavily depended
attention block (BAB) and a patch-level supervision scheme (PSS) on the well-annotated annotated pixel-level labels, a process
tailored for weakly supervised change detection with patch- known for its time-consuming and labor-intensive properties.
level annotations. More specifically, the BAM captures contexts
The advent of high-resolution satellites globally has spurred
associated with the changed and unchanged regions from the
temporal difference features to construct informative prototypes interest in weakly supervised CD techniques, as they offer
stored in the memory bank. On the other hand, the BAM more cost-effective alternatives, such as image-level labels, to
extracts useful information from the prototypes as supplementary strike a balance between performance and annotation expendi-
contexts to enhance the temporal difference features, thereby ture. Several methods leveraging diverse techniques have been
better distinguishing changed and unchanged regions. After that,
proposed to enhance CD performance guided by image-level
the PSS guides the network learning valuable knowledge from the
patch-level annotations, thus further elevating the performance. annotations. For instance, Andermatt et al. [9] exploited the
Experimental results on three benchmark datasets demonstrate Conditional Random Field (CRF) to refine the change mask
the effectiveness of our proposed method in the change detection in the hidden layer. Kalita et al. [10] combined the Principal
task. The demo code for our work will be publicly available at Component Analysis (PCA) and k-means clustering algorithm
https://github.com/guanyuezhen/MS-Former.
to produce the change maps. Wu et al. [11] introduced a Gen-
Index Terms—Remote sensing change detection, weakly- erative Adversarial Network (GAN) to solve the unsupervised,
supervised learning, memory bank mechanism, patch-level an- weakly-supervised, and region-supervision CD in a unified
notations.
framework. Huang et al. [12] proposed a Background-Mixed
sample augmentation approach (BGMix) to augment samples
I. I NTRODUCTION with the help of a set of background-changing images.
Fig 1 gives co-registered bi-temporal images with pixel-
As a fundamental technique in the area of remote sens- level, patch-level, and image-level labels. The initial H × W
ing image understanding, change detection (CD) locates the bi-temporal images are cropped into small patches of size
landscape changes and assigns each pixel of co-registered bi- h × w. These patches are assigned binary labels indicating
temporal images captured at diverse periods within a given the presence or absence of changes, thereby formulating the
Z. Li, C. Tang, and X. Li are with the school of computer, China
patch-level annotations. Notably, as the patch size increases,
University of Geosciences, Wuhan, China. (E-mail: {yuezhenguan, tangchang, the patch-level labels align more closely with image-level
ddwhlxj}@cug.edu.cn). annotations, while decreasing patch size results in labels close
X. Liu is with the school of computer, National University of Defense
Technology, Changsha 410073, China. (E-mail: xinwangliu@nudt.edu.cn).
to pixel-wise annotations. In this work, we observe that a slight
C. Li is with the faculty of engineering, China University of Geosciences, reduction in patch size substantially enhances change detection
Wuhan, China. (E-mail: lichangdong@cug.edu.cn). performance, as given in Fig 2. Specifically, on the BCDD
W. Zhang is with Shandong Provincial Key Laboratory of Computer
Networks, Shandong Computer Science Center (National Supercomputing
dataset 1 , the F1 scores for change detection under patch sizes
Center in Jinan), Qilu University of Technology (Shandong Academy of 256×256, 128×128, 64×64, and 32×32 are 0.7431, 0.8414,
Sciences), Jinan 250000, China. (E-mail: wzhang@qlu.edu.cn). 0.8906, and 0.9014, respectively. Moreover, the F1 scores
Manuscript received April 19, 2021; revised August 16, 2021. This work
was supported in part by the National Science Foundation of China under
demonstrate a significant increase of approximately 9.83,
Grant 62076228, and in part by Natural Science Foundation of Shandong
Province under Grant ZR2021LZH001. (Corresponding author: Chang Tang.) 1 http://study.rsgis.whu.edu.cn/pages/download/building dataset.html
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

Fig. 1: Comparison of the pixel-level, image-level, and our patch-level labels for remote sensing change detection.

follows:
1) We establish a novel benchmark for remote sensing CD
incorporating patch-level supervision. The introduced
patch-level annotations strike a balance between change
detection performance and the associated label annota-
tion costs compared with both image-level and pixel-
level labels.
2) We design a novel memory-supported transformer (MS-
Former) tailored for remote sensing change detection
under patch-level supervision. The MS-Former inte-
grates a bi-directional attention block (BAB) and patch-
level supervision scheme (PSS) to achieve considerable
change detection performance.
3) A comprehensive set of experiments is conducted on
three benchmark datasets to assess the effectiveness and
Fig. 2: Comparison of the change detection performance superiority of our proposed MS-Former. Considerable
measured by F1 of our proposed MS-Former using patch-level experimental results suggest that patch-level supervision
labels across different patch size settings on the BCDD dataset. could offer an effective solution to the remote sensing
CD task.
The structure of this paper is organized as follows: Section II
14.75, and 15.84 percentages when transitioning from patch offers a review of related literature concerning remote sensing
sizes 256×256 to 128×128, 64×64, and 32×32, respectively. CD and memory bank mechanisms. In Section III, the pro-
This observation suggests the potential of exploring patch-level posed method is elaborated in detail. Extensive experiments,
annotations for remote sensing change detection. conducted on three benchmark datasets to assess the perfor-
mance of the method, along with comprehensive discussions
In this paper, we propose a novel memory-supported trans-
and model analyses, are presented in Section IV. The paper
former (MS-Former) for the purpose of weakly supervised
concludes in Section V.
CD. Inspired by the recent advantage of the memory bank
mechanism in deep learning, we introduce a memory bank to
store the prototypes depicting semantic variations associated II. R ELATED W ORK
with the changed and unchanged regions. To this end, our MS- A. Change Detection
Former incorporates a bi-directional attention block (BAB) de- 1) Fully supervised change detection: With the rapid ad-
signed to enhance context extraction from bi-temporal images vancement of deep learning techniques, CD methods have
in a dual-directional manner. First, the BAB selects salient made significant strides, particularly under the guidance of
feature representations within each patch and fuses them pixel-wise supervision Daudt et al. [13], proposed three end-
with the prototypes stored in the memory bank, progressively to-end CD networks based on U-net [14], and explored the
capturing the distinctive characteristics of changes across the effectiveness of three diverse architectures, e.g., FC-EF, FC-
entire dataset in an online procedure. Second, the BAB extracts Siam-conc, and FC-Siam-diff. Building upon this foundation,
supplementary contextual information from the prototypes to subsequent efforts have focused on enhancing CD perfor-
refine the temporal difference features, thus facilitating a mance. For instance, the integration of long short-term mem-
more effective distinguishing between changed and unchanged ory (LSTM) and skip connection is employed to extract more
regions within bi-temporal images. Concurrently, our patch- distinguishable multi-level features [15]. A full-scale feature
level supervision scheme (PSS) delivers patch-level labels fusion approach has been introduced to aggregate information
at both high and low resolutions, guiding the network in across various feature scales. Additionally, a hybrid atten-
acquiring valuable insights from patch-level annotations and tion mechanism is utilized to capture long-range contextual
thereby further enhancing its performance. information among bi-temporal features, thereby enhancing
The main contributions of this work are summarized as the accuracy of CD [16]. Li et al. [17] proposed a guided
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

refinement model, which aggregates multi-level features and of the memory bank mechanism in deep learning, we apply
iteratively refines them, effectively filtering out irrelevant noise the memory to explore the semantic variations associated with
from the features. Further insights into recent developments in the changed and unchanged regions to weakly supervised CD,
this field can be found in the comprehensive survey [8]. which is a novel design in this field.
Previous CD methods primarily explore multi-level feature
fusion [18]–[20], temporal difference extraction [1], [21], [22], III. P ROPOSED M ETHOD
and attention mechanisms [23]–[25] to capture distinctive
feature representations, thus optimizing performance under In this section, we first introduce the weakly supervised
the guidance of pixel-level annotations. However, the labor- CD problem and provide an overview of our proposed MS-
intensive nature of pixel-wise labeling renders it impractical Former. Subsequently, we explicate the individual modules of
and inefficient for addressing large-scale remote sensing data our proposed MS-Former in detail. Finally, we illustrate the
challenges. training loss of the method.
2) Weakly supervised change detection: Weakly supervised
CD adopt more economical weak supervisions, such as image- A. Overview
level labels, as opposed to intricate pixel-wise annotations,
Given a registered pair of images It ∈ RH×W ×3 , t ∈ {1, 2},
aiming to strike a balance between performance and expenses.
where H, W , and t respectively denote the height, width,
Several techniques have been proposed to enhance CD per-
and temporal order of images, the bi-temporal images can be
formance relying on image-level annotations. For example,
cropped into multiple none-overlapped paired patches as,
Andermatt et al. [9] utilized the Conditional Random Field
(CRF) in the hidden layer to refine the change mask. Kalita et P(Ptk , k ∈ {1, 2, · · · , K}) = Crop(It , t ∈ {1, 2}), (1)
al. [10] combined Principal Component Analysis (PCA) with
the k-means clustering algorithm to generate change maps. where Ptk ∈ Rh×w×3 , t ∈ {1, 2} is the pair of bi-temporal
Wu et al. [11] introduced Generative Adversarial Networks image patches. And, h and w represent the height and width of
(GAN) to tackle unsupervised, weakly-supervised, and region- patch size. Then, a binary indicator is utilized to label whether
supervised CD challenges. Huang et al. [12] devised a sample there are changes in each patch. Such process can be denoted
augmentation approach, integrating background images effec- as, (
tively into the samples. These methods demonstrate the diverse 1, if changes exist in P1k and P2k ,
Yk = (2)
approaches employed to optimize CD performance with the 0, otherwise
support of image-level annotations, avoiding costly pixel-wise
annotations. In contrast to prior approaches utilizing image- where Yk is the k-th patch in the patch-level label Y ∈
level labels, our study introduces novel patch-level annotations RH×W .
with flexible patch size settings, enabling a fine balance As highlighted in the previous section, it is evident that
between performance and costs. when the patch size is set to match the image size, the patch-
level labels align with the image-level labels. As the patch
size decreases, the obtained patch-level annotations closely
B. Memory Bank Mechanisms approximate the pixel-level labels. Therefore, fine-tuning the
The memory bank mechanism is a widely used technique in patch size within appropriate ranges enables a balance between
deep learning. For instance, Li et al, [26] introduced diverse change detection performance and the costs associated with la-
memory prototypes for each modality to capture its semantic bel annotations. During the training phase, our model is trained
structure for enhancing cross-modal retrieval. Cui et al, [27] to utilize bi-temporal images with patch-level annotations, en-
proposed an appearance-motion memory consistency network abling precise identification of changes within the bi-temporal
to make full use of the appearance and motion knowledge images. Notably, in comparison to pixel-wise annotations, the
in the high-level feature space to perform video anomaly utilization of patch-level annotations significantly reduces the
detection. In [28], a regional memory network is presented to overall labeling costs.
capture the object cues from the past frames in a local-to-local As illustrated in Fig. 3, the proposed MS-Former con-
matching manner to boost the performance of video object sists of three key modules, including a feature extractor, a
segmentation. In recent years, the memory bank mechanism bi-directional attention block, and a patch-level supervision
has received much attention in weakly-supervised semantic scheme. The feature extractor is applied to extract temporal
segmentation [29], [30]. Fan et al, [29] utilized the memory difference feature representations from the bi-temporal images.
bank mechanism to explore cross-image contexts to refine The bi-directional attention block is designed to learn the
the pseudo-masks. Zhou et al, [30] introduced a regional changes-aware prototypes from the whole training dataset and
memory bank to store the diverse object patterns, which leverage the changes-aware prototypes to refine the temporal
are constrained by a regional semantic contrast regularization difference features to generate more accurate change maps.
to recover the dataset-level semantic structure, leading to Finally, the patch-level supervision scheme is introduced to
improved segmentation performance. In [31], a visual words guide the model learning from the patch-level annotations.
learning strategy is proposed to force the network focus on More details of the feature extractor, bi-directional attention
partial discriminative object regions to generate more accurate block, and patch-level supervision scheme will be presented
Class Activation Maps (CAMs). Inspired by the great progress in Subsection III-B, III-C, and III-D, respectively.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

Fig. 3: A framework of the proposed MS-Former. Initially, the bi-temporal images pass through a feature extractor to capture
the temporal difference features. After that, the temporal difference features and prototypes stored in the memory bank are
jointly learned by a series of bi-directional attention blocks. Finally, a patch-level supervision scheme is introduced to guide
the network learning from the patch-level annotations.

B. Feature Extractor resulting in losing the contextual information in bi-temporal


The feature extractor contains a weight-shared backbone images. To this end, the pyramid adaptive average pooling used
to obtain multi-level bi-temporal features, denoted as Fti ∈ in [33] is introduced to capture the global semantic contexts in
i+1 i+1
RH/2 ×W/2 ×ci , i ∈ {2, 3, 4, 5}, t ∈ {1, 2}. Then, we uti- diverse pooling size and construct a series of pyramid features
lize an element-wise subtraction operation to extract temporal Pa1 , · · · , Pan . Such process can be denoted as,
i+1 i+1
difference information, i.e, Di ∈ RH/2 ×W/2 ×ci , i ∈ Pm = MP(Ps ),
{2, 3, 4, 5} from multi-level bi-temporal features. Finally, a
decoder is exploited to aggregate the multi-level temporal dif- Pa1 = AP1 (Ps ),
(3)
ference information to obtain a unified feature representation ...,
for change detection. Pan = APn (Ps ),

where AP(·) and MP(·) represent the adaptive average pooling


C. Bi-directional Attention Block and adaptive max pooling operations, respectively.
The memory bank mechanism is a widely used tool in deep Then, the global and local contexts are passed through
learning for exploring the semantic structure of the whole the corresponding convolution layer for feature transfer and
dataset. Inspired by this, we formulate a memory bank to concatenated with memory prototypes, like,
store the semantic variations associated with the changed and
unchanged regions. Afterward, a bi-directional attention block M̂s =Cat(Ms , Conv1 (Pm ),
(4)
(BAB) is introduced to construct the connection between the Conv1 (Pa1 ), · · · Conv1 (Pan )),
prototypes stored in the memory bank and the pixel-wise
temporal difference feature representations. The goal of BAB where Conv1 (·) denotes a 1×1 convolution layer. Cat(·) is the
is to employ the prototypes, which are learned from the feature concatenation operation. M̂s is the pixel-augmented
whole dataset, to enhance the temporal feature representations, memory prototypes.
enabling generate better change maps, as shown in Fig. 3. With the above augmentation manner, we utilize the cross-
The BAB is composed of a pixel-to-memory (P2M) attention attention to aggregate information from the pixel-augmented
layer, a memory-to-pixel (M2P) attention layer, and two feed- memory prototypes M̂s into Ms . The forward pass of the
forward layers [32]. In the following content, the P2M and attention is formulated as,
M2P layers will be presented in detail.
1) Pixel-to-Memory Attention: The memory bank stores Ms Wq (M̂s Wk )⊤
Ms+1 = softmax( ),
the contextual prototypes of changed and unchanged regions. M̂s Wv (5)
In this part, we introduce pixel-to-memory (P2M) attention, Ms+1 = Ms + Ms+1 Wo ,
which extracts contexts from the pixel-wise temporal differ-
ence feature representations to update the prototypes. Let Ps ∈ where Wq , Wk , Wv , Wo denote four learnable linear pro-
RNp ×C and Ms ∈ RNm ×C be the inputs of the s-th BAM, jections. The cross-attention updates the memory prototypes
where Np , Nm , and C are the number of temporal difference with the pixel-level temporal difference representations and
feature tokens, memory length, and channels, respectively. models the relationships of diverse memory prototypes, which
We first leverage a representative feature extractor to obtain makes it possible to learn the prototypes of various changes
contexts from each patch in the paired bi-temporal images. deceptively for different bi-temporal images.
The adaptive max pooling operation is utilized to extract the 2) Memory-to-Pixel Attention: After updating the memory
most representative feature Pm in each patch of the pixel-wise prototypes, we apply the memory-to-pixel (M2P) attention,
temporal difference feature representations. However, Pp only implemented by the cross-attention to query additional in-
encodes the more representative information in each patch, formation from the memory prototypes to generate more
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

discriminative temporal difference feature representations. The With the loss of the patch-level supervision scheme, the total
cross attention can be formulated as, training loss of MS-Former can be denoted as,
Ps Wq (Ms+1 Wk )⊤ S
Ps+1 = softmax( ),
X
Ms+1 Wv (6) L= Lsp (Qs , Yl ) + Lpss (12)
s=1
Ps+1 = Ps + Ps+1 Wo ,
where S is the number of BAB in the MS-Former.
The cross-attention updates pixel-level temporal difference
feature representations with the memory prototypes and makes IV. E XPERIMENTS
it aware of the presence of the whole dataset, for better
adjusting the diversities of changed and unchanged regions A. Datasets
in the embedding space. We perform experiments on three widely used CD datasets,
including BCDD 2 , LEVIR [3], and SYSU [34]. The detailed
information is given as follows,
D. Patch-level Supervision Scheme
BCDD: It is a high resolution (0.3m) urban building CD
After obtaining the updated temporal difference features dataset, consisting of a pair of aerial images that size is
from the bi-directional attention blocks, we generate the 32507 × 15354, collected in 2012 and 2016. We leverage the
change maps G with a convolution layer. Then, patch-level data processed in BGMix [12], which crops the bi-temporal
annotations are utilized to guide the learning process of the images into 256 × 256 patches. Then, 90% and 10% samples
network. To achieve this goal, we introduce a patch-level of the dataset are randomly selected for training and testing.
supervision scheme. Finally, the cropped dataset consists of 1890 changed and 5544
The patch-level supervisions indicate whether a pair of bi- unchanged paired image patches.
temporal image patches contain changes or not, as defined in LEVIR: It is a high resolution (0.3m) building CD dataset,
Eq. (2). In paired bi-temporal image patches, we select the which exists of 637 pairs of bi-temporal remote sensing images
most significant pixel to formulate a local-scale change map, of 1024 × 1024 spatial size. The method in [3] officially split
which preserves the change information of patch-level label, the dataset into 7:1:2 for training, validation, and testing. We
with an adaptive max pooling operation. crop the original images into 256 × 256 none-overlapping
patches, and obtained a total of 7120/1024/2048 image pairs
Yl = MP(Y), Gl = MP(G), (7)
for training/validation/testing, respectively.
where Yl and Gl are the ground-truth local-scale change SYSU: This dataset consists of 20000 pairs of bi-temporal
maps and predicted ones, respectively. Then, the binary cross- remote sensing image patches with 256 × 256 spatial size and
entropy (BCE) loss Lbce is exploited to supervise the predicted 0.5m spatial resolution. The ratios of training, validation, and
local-scale change maps as well as classify bi-temporal image testing are officially set to 6 : 2 : 2. This dataset contains
patches into changed or unchanged classes, like, various types of complex change scenes, including road ex-
pansion, newly built urban buildings, change of vegetation,
Lpcl = Lbce (Gl , Yl ) suburban dilation, and groundwork before construction.
(8)
= Gl · log(Yl ) + (1 − Gl ) · log(1 − Yl )
where Lpcl is the patch classification loss. B. Evaluation Metrics
In addition to the patch classification loss, we construct an In our experiments, five widely used metrics, namely Kappa
unchanged patch consistency loss to filter the miss-predictions coefficient (κ), intersection over union (IoU), F1-score (F1),
that occur in unchanged patches as, recall (Rec), and precision (Pre) are employed to evaluate the
performance of the CD task. The detailed calculations of five
Lupcl = ∥(1 − Y) · (G − Y)∥1 (9) metrics can be found in [17], [20].
where ∥ · ∥1 is the ℓ1 loss.
Thus, the loss of the patch-level supervision scheme can be C. Implementation details
formulated as, In this paper, the Resnet18 [35] pre-trained on ImageNet is
l l
Lpss = Lpcl (G , Y ) + Lupcl (G, Y) (10) exploited as the image encoder to extract bi-temporal features
from bi-temporal images, for a fair comparison. Then, the
decoder is set as the UperNet [36], which is widely used for
E. Loss Function aggregate multi-level features in the semantic segmentation
In our BAB, we extract the most representative feature Pm task. The memory length Nm and the number of BAB are
from each patch in the paired bi-temporal images. Here, we set as 128 and 3, respectively. As done in [33], the pyramid
additionally use the BCE loss to ensure semantic properties of adaptive average pooling ratios are set as 12, 16, 20, and 24.
Pm . Suppose the Qs is the predicated semantic map of Pm The proposed method is implemented via the Pytorch
in s-th block of BAB by a single convolution layer, the loss toolbox [37] and performed on a single Nvidia Titan V
can be represented as, GPU. The Adam optimizer [38], in which the momentum,

Lsp = Lbce (Qs , Yl ) (11) 2 http://study.rsgis.whu.edu.cn/pages/download/building dataset.html


JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

TABLE I: Ablation Studies of the proposed method with diverse settings in terms of κ, IoU, and F1 on BCDD dataset.

256 × 256 128 × 128 64 × 64 32 × 32


No. Variants
κ IoU F1 κ IoU F1 κ IoU F1 κ IoU F1
#01 WCDNet 0.7332 0.5912 0.7431 0.8342 0.7262 0.8414 0.8857 0.8028 0.8906 0.8980 0.8221 0.9024
(a) Bi-directional Attention Block (BAB)
#02 w/o BAB 0.7172 0.5711 0.7270 0.7882 0.6621 0.7967 0.8640 0.7694 0.8697 0.8294 0.7186 0.8363
#03 Nm = 64 0.7213 0.5768 0.7316 0.8360 0.7291 0.8433 0.8808 0.7951 0.8859 0.8800 0.7939 0.8851
#04 Nm = 192 0.7164 0.5712 0.7271 0.8288 0.7186 0.8362 0.8770 0.7893 0.8822 0.8814 0.7960 0.8864
#05 BAB w/o P2M 0.6612 0.5056 0.6716 0.8155 0.7001 0.8236 0.8651 0.7712 0.8708 0.8230 0.7093 0.8299
#06 BAB w/o MP 0.7389 0.6002 0.7501 0.8150 0.6993 0.8230 0.8720 0.7816 0.8774 0.8442 0.7399 0.8505
#07 BAB w/o AP 0.7011 0.5529 0.7121 0.8321 0.7232 0.8394 0.8748 0.7859 0.8801 0.8712 0.7803 0.8766
(b) Patch-level Supervision Scheme (PSS)
#08 PSS w/o L1 0.6863 0.5352 0.6973 0.8355 0.7280 0.8426 0.8741 0.7849 0.8795 0.8872 0.8051 0.8920
#09 PSS w/o BCE 0.7235 0.5797 0.7339 0.8300 0.7198 0.8085 0.7994 0.6763 0.8069 0.0063 0.0033 0.0066
#10 Directly sup 0.2976 0.2101 0.3472 0.4207 0.2977 0.4588 0.5646 0.4191 0.5906 0.6986 0.5564 0.7150

TABLE II: Quantitative comparisons of the proposed method with full supervised change detection methods in terms of κ,
IoU, F1, OA, Rec, and Pre on the LEVIR and SYSU datasets.

Full-supervised methods Our weakly-supervised method


Methods
STANet L-Unet SNUNet DSIFN BIT TFI-GR A2Net 128 × 128 64 × 64 32 × 32
FLOPs (G) 25.69 34.63 246.22 164.56 16.88 19.44 6.02 22.14 22.14 22.14
Params (M) 12.28 8.45 27.07 50.71 3.01 28.37 3.78 15.20 15.20 15.20
κ 0.8839 0.8935 0.8972 0.9090 0.9007 0.9045 0.9129 0.6982 (↓0.2147) 0.7561 (↓0.1568) 0.8034 (↓0.1095)
IoU 0.8014 0.8164 0.8220 0.8408 0.8276 0.8336 0.8472 0.5571 (↓0.2901) 0.6258 (↓0.2214) 0.6864 (↓0.1608)
LEVIR F1 0.8898 0.8989 0.9023 0.9135 0.9057 0.9093 0.9173 0.7156 (↓0.2017) 0.7698 (↓0.1475) 0.8141 (↓0.1032)
Rec 0.8761 0.8915 0.8897 0.8944 0.8940 0.8973 0.9059 0.8223 (↓0.0836) 0.8606 (↓0.0453) 0.8692 (↓0.0367)
Pre 0.9038 0.9064 0.9154 0.9335 0.9177 0.9215 0.9290 0.6334 (↓0.2956) 0.6964 (↓0.2326) 0.7655 (↓0.1635)
κ 0.7100 0.7398 0.7391 0.7205 0.7333 0.7902 0.7916 0.5979 (↓0.1937) 0.6681 (↓0.1235) 0.7364 (↓0.0552)
IoU 0.6332 0.6662 0.6673 0.6442 0.6584 0.7240 0.7237 0.5065 (↓0.2175) 0.5794 (↓0.1446) 0.6559 (↓0.0681)
SYSU F1 0.7754 0.7996 0.8004 0.7836 0.7940 0.8399 0.8397 0.6724 (↓0.1675) 0.7337 (↓0.1062) 0.7922 (↓0.0477)
Rec 0.7430 0.7808 0.7979 0.7511 0.7668 0.8437 0.8224 0.5538 (↓0.2899) 0.6314 (↓0.2123) 0.7181 (↓0.1256)
Pre 0.8108 0.8195 0.8030 0.8190 0.8232 0.8361 0.8577 0.8556 (↓0.0021) 0.8755 (↓0.0178) 0.8832 (↑0.0255)

weight decay, parameters β1 and β2 are respectively set to w/o BAB decreases about 1.61, 4.47, 2.09, and 6.61 percent-
0.9, 0.0001, 0.9, and 0.99, to optimize the network. Then, ages in terms of F1 compared with #01 on the BCDD dataset
the poly learning scheme is applied to adjust the learning in 256×256, 128×128, 64×64, and 32×32 patch size settings,
cur iteration power
rate as (1 − max iteration ) × lr, where power and respectively. This indicates the effectiveness of the proposed
max iteration are set as 0.9 and 40000, respectively. We BAB and emphasizes the importance of memory-augmented
set the initial learning rate as 0.0005 and the batch size as 32. representation learning manner in boosting performance for
Random flipping and temporal exchanging are employed on the weakly-supervised CD task.
the image patches for data augmentation. The length of memory prototypes Nm is a parameter in the
BAB. We search the parameter in the range of 64, 128, 192.
D. Ablation Studies From the results in TABLE I, we observe that #01 outperforms
In this part, we perform ablation studies on the BCDD #03 and #04 on the BCDD dataset. Thus, the length of memory
dataset to explore the effectiveness of diverse components in prototypes is set to 128 in the proposed method.
the proposed method. To further study the effectiveness of diverse components in
1) Effectiveness of bi-directional attention block (BAB): BAB, we construct three comparators, i.e., BAB w/o P2M,
In the proposed method, memory prototypes are introduced BAB w/o MP, and BAB w/o AP (#05, #06, #07 in TABLE I).
to preserve the attributes of changed and unchanged regions. For BAB w/o P2M, we drop out the pixel-to-memory attention
The Bi-directional Attention Block (BAB) is proposed to learn and just utilize the self-attention to self-update the memory
the memory prototypes from pixel-wise temporal difference prototypes. As shown in TABLE I, the performance of BAB
feature representations and enhance the temporal difference w/o P2M declines about 7.15, 1.78, 1.89, and 7.29 percentages
representation with the updated memory prototypes. To vali- measured by F1 compared with #01 on the BCDD dataset
date the effectiveness of BAB, we directly remove the BAB under four patch size settings, respectively. Thus, leveraging
from the network and term the method as w/o BAB (#02 in the temporal difference information to update the memory
TABLE I). From the results in TABLE I, the performance of prototypes is important in the proposed method. As to BAB
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

w/o MP and BAB w/o AP, we respectively remove the global 64 × 64, and 32 × 32. The above observations strongly demon-
and local representation feature extractors in the pixel to strate the effectiveness and superiority of the proposed method.
memory attention. From the results in TABLE I, such two In addition, when the patch size decreases, our proposed
methods both decrease a lot compared with #01 on the BCDD method can capture more accurate change detection results,
dataset. Therefore, both global and local representation feature verifying the effectiveness of the idea of employing patch-level
extractors play a key role in updating the memory prototypes annotations to achieve the trade-offs between performance and
and consequently boost the change detection performance. costs.
2) Effectiveness of patch-level supervision scheme (PSS):
The Patch-level Supervision Scheme (PSS) is formulated to
F. Comparison With Fully-supervised CD Methods
guide the network learning knowledge from the patch-level
annotations. In PSS, the patch classification loss and consis- We compare the proposed model with several state-of-
tency loss are jointly utilized. To verify their effectiveness, the-art remote sensing change detection approaches, includ-
we respectively remove the patch classification loss and con- ing STANet [3], L-Unet [15], SNUNet [40], DSIFN [41],
sistency loss termed as PSS w/o pcl and PSS w/o cl (#08 BIT [42], TFI-GR [17], and A2Net [20].
and #09 in TABLE I). As seen from the results, the change 1) Quantitative comparisons: The quantitative evaluation
detection performance of PSS w/o pcl and PSS w/o cl methods results in terms of κ, IoU, F1, Rec, and Pre of our weakly
both decrease significantly on BCDD datasets. Thus, we can supervised MS-Former and other fully supervised change
observe that the patch classification loss and consistency loss detection methods are reported in TABLE II. The model
can improve the change detection performance, indicating the parameters (Params), and computation costs (FLOPs) related
effectiveness of PSS. In addition, we formulate a method to all approaches are also given in TABLE II. From the results,
that directly uses the patch-level annotation to supervise the we observe that there are about 20.17 and 16.75 percentage
network via binary cross entropy loss as w DS (#10 in TABLE gaps between our MS-Former with 128 × 128 patch-size
I). Compared with PSS, w DS method reduces the performance patch-level annotations and the best fully supervised change
a lot. detection performers measured by F1 on LEVIR and SYSU
datasets. Our MS-Former with 32 × 32 patch-size patch-level
annotations can significantly narrow the gap to 10.32 and 4.77
E. Comparison With Weakly-supervised CD Methods percentages in terms of F1 on LEVIR and SYSU datasets. It
We compare the proposed model with ten state-of-the- should be noted that the change detection performance gap
art remote sensing change detection approaches, including keeps getting smaller as the patch size goes smaller. Such
WCDNet [9], WS-C [10], CAM [11], FCD-GAN [11], BG- observations strongly indicate the effectiveness of the patch-
Mix [12], TransWCD [39], and TransWCD-DL [39]. The level annotation and our MS-Former method.
change detection performance of various methods in terms of 2) Visual comparisons: To intuitively compare our MS-
κ, IoU, F1, Rec, and Pre on the BCDD dataset is given in Former and other methods, we present the visual comparison
TABLE III. results of BIT, TFI-GR, A2Net, and our MS-Former on
LEVIR+ and BCDD datasets in Figs. 4 and 5, respectively.
TABLE III: Quantitative comparisons of the proposed method White, red, black, and blue are used to indicate true positives,
with other image-level supervised change detection methods in false positives, true negatives, and false negatives, respectively,
terms of κ, IoU, F1, OA, Rec, and Pre on the BCDD dataset. for better visualization. Based on the visual results in Figs.
Methods Patch-size κ IoU F1 Rec Pre 4 and 5, we observe that the proposed method demonstrates
WCDNet 256 × 256 - 0.2210 0.3930 - -
superiority in the following aspects:
WS-C 256 × 256 - 0.1937 0.3245 0.2387 0.5067 The visual results in Fig. 4 demonstrate that the proposed
CAM 256 × 256 - 0.3755 0.5460 0.4940 0.6102 MS-Former generally identifies the changed regions and ef-
FCD-GAN 256 × 256 - 0.3932 0.5645 0.4893 0.6678
BGMix 256 × 256 - 0.4270 0.6240 - - fectively eliminates the irrelevant pseudo changes caused by
TransWCD 256 × 256 - 0.5236 0.6873 0.7534 0.6319 background clutters in bi-temporal images. And, the prediction
TransWCD-DL 256 × 256 - 0.5619 0.7195 0.6446 0.8142
errors mainly lie in the boundary of changed buildings due
256 × 256 0.7332 0.5912 0.7431 0.6501 0.8672
128 × 128 0.8342 0.7262 0.8414 0.8483 0.8346 to lacking fine-gained supervision. In addition, with the patch
Ours
64 × 64 0.8857 0.8028 0.8906 0.8854 0.8958 size decreasing, our MS-Former obtains more accurate change
32 × 32 0.8980 0.8221 0.9024 0.8983 0.9065
detection results. In Fig. 5, it can be observed that BIT, TFI-
GR, and A2Net fail to detect the entire changed regions. On
As shown in TABEL III, our proposed method exhibits su- the other hand, the proposed MS-Former with 32 × 32 patch
perior performance compared with the other seven image-level size patch-level annotations can identify the changed objects
annotations supervised approaches. For instance, our proposed with fine content integrity. The above comparable performance
method obtains about 2.93, 2.36, 0.55, and 5.30 percentages of our approach is attributed to two main reasons. Firstly,
higher performances in terms of IoU, F1, Rec, and Pre than the we employ a memory-supported transformer to recover the
second-best approach (TransWCD-DL) under 256 × 256 patch fine-grained change information from the bi-temporal image
size setting, respectively. Meanwhile, our proposed method patches, and we jointly utilize the global memory prototypes
consistently achieves superior change detection results than to comprehensively augment the temporal difference features
other approaches when the patch size is set to 128 × 128, to generate accurate change results. Secondly, our method
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

Fig. 4: Visual comparisons of the proposed MS-Former and other methods on the LEVIR dataset. (a) t1 images; (b) t2 images;
(c) Ground-truth; (d) BIT; (e) TFI-GR; (f) A2Net; (g) our MS-Former with 128 × 128 patch size; (h) our MS-Former with
64 × 64 patch size; (i) our MS-Former with 32 × 32 patch size. The rendered colors represent true positives (white), false
positives (red), true negatives (black), and false negatives (blue).

Fig. 5: Visual comparisons of the proposed MS-Former and other methods on the SYSU dataset. (a) t1 images; (b) t2 images;
(c) Ground-truth; (d) BIT; (e) TFI-GR; (f) A2Net; (g) our MS-Former with 128 × 128 patch size; (h) our MS-Former with
64 × 64 patch size; (i) our MS-Former with 32 × 32 patch size. The rendered colors represent true positives (white), false
positives (red), true negatives (black), and false negatives (blue).
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

TABLE IV: Quantitative comparisons of diverse approaches in terms of κ, IoU, F1, OA, Rec, and Pre on the GVLM dataset.
Single-temporal full-supervised methods Bi-temporal full-supervised methods Our weakly-supervised method
Datasets
SINet-V2 GeleNet BASNet DSIFN BIT MSCANet TFI-GR A2Net AR-CDNet 128 × 128 64 × 64 32 × 32
κ 0.8747 0.8774 0.8759 0.8566 0.8851 0.8893 0.8974 0.8916 0.8987 0.7098(↓0.1889) 0.8049(↓0.0938) 0.8251(↓0.0736)
IoU 0.7906 0.7946 0.793 0.7638 0.8067 0.8131 0.8256 0.8162 0.8275 0.5705(↓0.2570) 0.6915(↓0.1360) 0.7191(↓0.1084)
F1 0.8831 0.8855 0.8846 0.8661 0.893 0.8969 0.9045 0.8988 0.9056 0.7265(↓0.1791) 0.8176(↓0.0880) 0.8366(↓0.0689)
Rec 0.8563 0.8571 0.8947 0.8324 0.8936 0.9021 0.9069 0.8749 0.9031 0.6258(↓0.2811) 0.7783(↓0.1286) 0.8009(↓0.1060)
Pre 0.9116 0.9159 0.8746 0.9026 0.8925 0.8918 0.9021 0.9241 0.9082 0.8659(↓0.0582) 0.8611(↓0.0630) 0.8755(↓0.0486)

Fig. 6: Visual comparisons of the proposed MS-Former and other methods on the GVLM dataset. (a) t1 images; (b) t2 images;
(c) Ground-truth; (d) SINet-V2; (e) GeleNet; (f) BASNet; (g) TFI-GR; (h) A2Net; (i) AR-CDNet; (i) our MS-Former with
32 × 32 patch size. The rendered colors represent true positives (white), false positives (red), true negatives (black), and false
negatives (blue).

employs a patch-level supervision scheme to guide the MS- and obtained 1389, 199, and 389 paired images for training,
Former learning from the patch-level annotations, resulting in validation, and testing, respectively. The compared methods
precise change detection results. consist of three single-temporal fully supervised ones (SINet-
V2 [46], GeleNet [47], BASNet [48]), and six bi-temporal
fully supervised ones (DSIFN [41], BIT [42], MSCANet [49],
G. Downstream Application on Landslide Detection
TFI-GR [17], A2Net [20], and AR-CDNet [50]). The quanti-
As one of the most widespread disasters, landslides may tative comparisons of diverse approaches in terms of κ, IoU,
cause mass human casualties and considerable economic F1, OA, Rec, and Pre on the GVLM dataset are reported in
losses every year [43]. Thus, detecting such events is an TABLE IV. Fig. 6 shows the visualization results generated by
important procedure in landslide hazard monitoring. With different methods. From the above results, we observe that our
the development of earth observation technology, identify- proposed MS-Former with patch-level supervision can obtain
ing Landslides with the help of very-high-resolution remote considerable performance compared with other strong fully
sensing imagery has attracted increasing attention in recent supervised methods in the landslide detection task.
years [44].
In this part, we assess the generalization capabilities of
V. C ONCLUSION
our proposed MS-Former on the landslide detection task with
patch-level annotations. To this end, we perform experiments In this work, we introduce a novel memory-supported trans-
on the carefully collected GLVM [45] dataset with nine ap- former for weakly supervised change detection with patch-
proaches. GLVM is a large-scale and challenging landslide de- level annotations that is capable of achieving the trade-offs
tection dataset, in which the bi-temporal images are collected between change detection performance and label annotation
from 17 diverse regions in the wide world, e.g., Vietnam, costs. The proposed method comprises a bi-directional at-
Zimbabwe, New Zealand, and so on. In our experiments, tention block and a patch-level supervision scheme, which
we cropped the bi-temporal images into 512 × 512 patches respectively aim to achieve accurate change detection from
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

the feature representation learning and loss function. The [16] Z. Li, C. Yan, Y. Sun, and Q. Xin, “A densely attentive refinement
experimental results obtained on three high spatial resolution network for change detection based on very-high-resolution bitemporal
remote sensing images,” IEEE Transactions on Geoscience and Remote
remote sensing change detection datasets demonstrate that Sensing, vol. 60, pp. 1–18, 2022.
the proposed approach outperforms state-of-the-art weakly [17] Z. Li, C. Tang, L. Wang, and A. Y. Zomaya, “Remote sensing change
supervised change detection methods, and obtains considerable detection via temporal feature interaction and guided refinement,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11,
performance compared with state-of-the-art fully supervised 2022.
change detection approaches. We hope that MS-Former will [18] X. Zhang, S. Cheng, L. Wang, and H. Li, “Asymmetric cross-attention
serve as a solid baseline and help ease future research in hierarchical network based on cnn and transformer for bitemporal remote
sensing images change detection,” IEEE Transactions on Geoscience
weakly supervised change detection. and Remote Sensing, vol. 61, pp. 1–15, 2023.
[19] Y. Feng, J. Jiang, H. Xu, and J. Zheng, “Change detection on remote
sensing images using dual-branch multilevel intertemporal network,”
ACKNOWLEDGMENT IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp.
1–15, 2023.
The authors wish to gratefully acknowledge the anonymous [20] Z. Li, C. Tang, X. Liu, W. Zhang, J. Dou, L. Wang, and A. Y. Zomaya,
reviewers for their constructive comments on this paper. “Lightweight remote sensing change detection with progressive feature
aggregation and supervised attention,” IEEE Transactions on Geoscience
and Remote Sensing, vol. 61, pp. 1–12, 2023.
R EFERENCES [21] T. Lei, J. Wang, H. Ning, X. Wang, D. Xue, Q. Wang, and A. K.
Nandi, “Difference enhancement and spatial–spectral nonlocal network
[1] Z. Zheng, Y. Zhong, S. Tian, A. Ma, and L. Zhang, “Changemask: Deep for change detection in vhr remote sensing images,” IEEE Transactions
multi-task encoder-transformer-decoder architecture for semantic change on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2022.
detection,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. [22] S. Fang, K. Li, and Z. Li, “Changer: Feature interaction is what you need
183, pp. 228–239, 2022. for change detection,” IEEE Transactions on Geoscience and Remote
[2] Y. Hu, Y. Dong et al., “An automatic approach for land-change detection Sensing, 2023.
and land updates based on integrated ndvi timing analysis and the [23] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in IEEE
cvaps method with gee support,” ISPRS Journal of Photogrammetry and Conference on Computer Vision and Pattern Recognition, 2018, pp.
Remote Sensing, vol. 146, pp. 347–359, 2018. 7132–7141.
[3] H. Chen and Z. Shi, “A spatial-temporal attention-based method and [24] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional block
a new dataset for remote sensing image change detection,” Remote attention module,” in European Conference on Computer Vision, 2018,
Sensing, vol. 12, no. 10, p. 1662, 2020. pp. 3–19.
[4] C. Pang, J. Wu, J. Ding, C. Song, and G.-S. Xia, “Detecting building [25] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net-
changes with off-nadir aerial images,” Science China Information Sci- works,” in IEEE Conference on Computer Vision and Pattern Recogni-
ences, vol. 66, no. 4, p. 140306, 2023. tion, 2018, pp. 7794–7803.
[5] P. Yuan, Q. Zhao, X. Zhao, X. Wang, X. Long, and Y. Zheng, “A [26] H. Li, J. Song, L. Gao, X. Zhu, and H. T. Shen, “Prototype-
transformer-based siamese network and an open optical dataset for based aleatoric uncertainty quantification for cross-modal retrieval,” in
semantic change detection of remote sensing images,” International NeurIPS, 2023.
Journal of Digital Earth, vol. 15, no. 1, pp. 1506–1525, 2022. [27] R. Cai, H. Zhang, W. Liu, S. Gao, and Z. Hao, “Appearance-motion
[6] Z. Li, W. Shi, P. Lu, L. Yan, Q. Wang, and Z. Miao, “Landslide mapping memory consistency network for video anomaly detection,” in AAAI
from aerial photographs using change detection-based markov random Conference on Artificial Intelligence, vol. 35, no. 2, 2021, pp. 938–946.
field,” Remote Sensing of Environment, vol. 187, pp. 76–90, 2016. [28] H. Xie, H. Yao, S. Zhou, S. Zhang, and W. Sun, “Efficient regional
[7] Z. Zheng, Y. Zhong, J. Wang, A. Ma, and L. Zhang, “Building memory network for video object segmentation,” in IEEE Conference
damage assessment for rapid disaster response with a deep object-based on Computer Vision and Pattern Recognition, 2021, pp. 1286–1295.
semantic change detection framework: From natural disasters to man- [29] J. Fan and Z. Zhang, “Memory-based cross-image contexts for weakly
made disasters,” Remote Sensing of Environment, vol. 265, p. 112636, supervised semantic segmentation,” IEEE Transactions on Pattern Anal-
2021. ysis and Machine Intelligence, vol. 45, no. 5, pp. 6006–6020, 2022.
[8] A. Shafique, G. Cao, Z. Khan, M. Asad, and M. Aslam, “Deep learning- [30] T. Zhou, M. Zhang, F. Zhao, and J. Li, “Regional semantic contrast
based change detection in remote sensing images: a review,” Remote and aggregation for weakly supervised semantic segmentation,” in IEEE
Sensing, vol. 14, no. 4, p. 871, 2022. Conference on Computer Vision and Pattern Recognition, 2022, pp.
[9] P. Andermatt and R. Timofte, “A weakly supervised convolutional net- 4299–4309.
work for change segmentation and classification,” in Asian Conference [31] L. Ru, B. Du, Y. Zhan, and C. Wu, “Weakly-supervised semantic seg-
on Computer Vision, 2020, pp. 103–119. mentation with visual words learning and hybrid pooling,” International
[10] I. Kalita, S. Karatsiolis, and A. Kamilaris, “Land use change detection Journal of Computer Vision, vol. 130, no. 4, pp. 1127–1144, 2022.
using deep siamese neural networks and weakly supervised learning,” in [32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Computer Analysis of Images and Patterns. Springer, 2021, pp. 24–35. Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in
[11] C. Wu, B. Du, and L. Zhang, “Fully convolutional change detection neural information processing systems, vol. 30, 2017.
framework with generative adversarial network for unsupervised, weakly [33] Y.-H. Wu, Y. Liu, X. Zhan, and M.-M. Cheng, “P2t: Pyramid pooling
supervised and regional supervised change detection,” IEEE Transac- transformer for scene understanding,” IEEE Transactions on Pattern
tions on Pattern Analysis and Machine Intelligence, 2023. Analysis & Machine Intelligence, no. 01, pp. 1–12, 2022.
[12] R. Huang, R. Wang, Q. Guo, J. Wei, Y. Zhang, W. Fan, and Y. Liu, [34] Q. Shi, M. Liu, S. Li, X. Liu, F. Wang, and L. Zhang, “A deeply
“Background-mixed augmentation for weakly supervised change detec- supervised attention metric-based network and an open aerial image
tion,” in AAAI Conference on Artificial Intelligence, vol. 37, no. 7, 2023, dataset for remote sensing change detection,” IEEE Transactions on
pp. 7919–7927. Geoscience and Remote Sensing, vol. 60, pp. 1–16, 2022.
[13] R. C. Daudt, B. Le Saux, and A. Boulch, “Fully convolutional siamese [35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
networks for change detection,” in International Conference on Image image recognition,” in IEEE Conference on Computer Vision and Pattern
Processing, 2018, pp. 4063–4067. Recognition, 2016, pp. 770–778.
[14] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks [36] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified perceptual
for biomedical image segmentation,” in International Conference on parsing for scene understanding,” in European Conference on Computer
Medical Image Computing and Computer-assisted Intervention, 2015, Vision, 2018, pp. 418–434.
pp. 234–241. [37] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
[15] M. Papadomanolaki, M. Vakalopoulou, and K. Karantzalos, “A deep T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An
multitask learning framework coupling semantic segmentation and fully imperative style, high-performance deep learning library,” Advances in
convolutional lstm networks for urban change detection,” IEEE Transac- Neural Information Processing Systems, vol. 32, 2019.
tions on Geoscience and Remote Sensing, vol. 59, no. 9, pp. 7651–7668, [38] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
2021. Computer Science, 2014.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

[39] Z. Zhao, L. Ru, and C. Wu, “Exploring effective priors and effi-
cient models for weakly-supervised change detection,” arXiv preprint
arXiv:2307.10853, 2023.
[40] S. Fang, K. Li, J. Shao, and Z. Li, “Snunet-cd: A densely connected
siamese network for change detection of vhr images,” IEEE Geoscience
and Remote Sensing Letters, vol. 19, pp. 1–5, 2022.
[41] C. Zhang, P. Yue, D. Tapete, L. Jiang, B. Shangguan, L. Huang, and
G. Liu, “A deeply supervised image fusion network for change detection
in high resolution bi-temporal remote sensing images,” ISPRS Journal
of Photogrammetry and Remote Sensing, vol. 166, pp. 183–200, 2020.
[42] H. Chen, Z. Qi, and Z. Shi, “Remote sensing image change detection
with transformers,” IEEE Transactions on Geoscience and Remote
Sensing, vol. 60, pp. 1–14, 2022.
[43] N. Casagli, E. Intrieri, V. Tofani, G. Gigli, and F. Raspini, “Landslide
detection, monitoring and prediction with remote-sensing techniques,”
Nature Reviews Earth & Environment, vol. 4, no. 1, pp. 51–64, 2023.
[44] O. Ghorbanzadeh, Y. Xu, P. Ghamisi, M. Kopp, and D. Kreil, “Land-
slide4sense: Reference benchmark data and deep learning models for
landslide detection,” IEEE Transactions on Geoscience and Remote
Sensing, vol. 60, pp. 1–17, 2022.
[45] X. Zhang, W. Yu, M.-O. Pun, and W. Shi, “Cross-domain landslide
mapping from large-scale remote sensing images using prototype-guided
domain-aware progressive representation learning,” ISPRS Journal of
Photogrammetry and Remote Sensing, vol. 197, pp. 1–17, 2023.
[46] D.-P. Fan, G.-P. Ji, M.-M. Cheng, and L. Shao, “Concealed object detec-
tion,” IEEE transactions on pattern analysis and machine intelligence,
vol. 44, no. 10, pp. 6024–6042, 2021.
[47] G. Li, Z. Bai, Z. Liu, X. Zhang, and H. Ling, “Salient object detection
in optical remote sensing images driven by transformer,” IEEE Trans-
actions on Image Processing, 2023.
[48] W. Bo, J. Liu, X. Fan, T. Tjahjadi, Q. Ye, and L. Fu, “Basnet: Burned
area segmentation network for real-time detection of damage maps in
remote sensing images,” IEEE Transactions on Geoscience and Remote
Sensing, vol. 60, pp. 1–13, 2022.
[49] M. Liu, Z. Chai, H. Deng, and R. Liu, “A cnn-transformer network with
multiscale context aggregation for fine-grained cropland change detec-
tion,” IEEE Journal of Selected Topics in Applied Earth Observations
and Remote Sensing, vol. 15, pp. 4297–4306, 2022.
[50] Z. Li, C. Tang, X. Li, W. Xie, K. Sun, and X. Zhu, “Towards
accurate and reliable change detection of remote sensing images via
knowledge review and online uncertainty estimation,” arXiv preprint
arXiv:2305.19513, 2023.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy