Abstract
In remote sensing image fusion, the conventional Convolutional Neural Networks (CNNs) extract local features of the image through layered convolution, which is limited by the receptive field and struggles to capture global features. Transformer utilizes self-attention to capture long-distance dependencies in images, which has a global receptive field, but the computational cost for high-resolution images is excessively high. In response to the above issues, this paper draws inspiration from the FusionNet network, harnessing the local detail acquisition capability of CNNs and the global data procuring capacity of Transformer. It presents a novel method for remote sensing image sharpening named Guided Filtering-Cross Stage Partial Network-Transformer, abbreviated as GF-CSTNet. This solution unifies the strengths of Guided Filtering (GF), Cross Stage Partial Network (CSPNet), and Transformer. Firstly, this method utilizes GF to enhance the acquired remote sensing image data. The CSPNet and Transformer structures are then combined to further enhance fusion performance by leveraging their respective advantages. Subsequently, a Rep-Conv2Former method is designed to streamline attention and extract diverse receptive field features through a multi-scale convolution modulator block. Simultaneously, a reparameterization module is constructed to integrate the multiple branches generated during training into a unified branch during inference, thereby optimizing the model’s inference speed. Finally, a residual learning module incorporating attention has been devised to augment the modeling and feature extraction capabilities of images. Experimental results obtained from the GaoFen-2 and WorldView-3 datasets demonstrate the effectiveness of the proposed GF-CSTNet approach. It effectively extracts detailed information from images while avoiding the problem of spectral distortion.
Similar content being viewed by others
Introduction
Pansharpening stands as a crucial technique in the processing of remote sensing images. Its objective is to merge low-resolution multispectral (MS) images with high-resolution panchromatic (PAN) images to generate multispectral images with a higher spatial resolution. The fused high space resolution multispectral (HRMS) images contain more information than the individual source pictures, thereby effectively compensating for the limited information in a single image. In applications such as object identification and image classification, pansharpening can provide high-quality images. Therefore, the pursuit of obtaining higher quality fused images has emerged as a prominent subject explored by numerous scholars.
Currently, there are two main types of representative sharpening approaches: deep learning algorithms and traditional algorithms. Traditional methods can be categorized into Component Substitution (CS), Multi-Resolution Analysis (MRA), and model-based approaches. Gram-Schmidt transformation (GS)1, Principal Component Analysis (PCA)2, and the Band-dependent Spatial Detail (BDSD)3, among other methods, are examples of common CS methods. These methods aim to substitute the space components of multispectral images with panchromatic images while retaining as much of the original spectral information as possible. While the aforementioned methods are user-friendly, the direct replacement approach may introduce a certain level of spectral distortion by destroying information in the image. The fundamental principle of Multi-Resolution Analysis (MRA) approaches entails subjecting the panchromatic (PAN) images to multi-scale transformations initially, and subsequently fusing them with the multispectral (MS) images. The commonly employed methods include the Discrete Wavelet Transform (DWT)4, Non-Subsampled Contour Transform (NSCT)5, Laplace Pyramid (LP)6, and various others. While these methods partially mitigate spectral distortion issues, they are prone to artifacts, and the resulting fusion may not reach an optimal outcome. Model-based methods treat pansharpening as a process of image reconstruction. They formulate and solve objective functions to reconstruct the fused image by utilizing features from previous iterations. Common methods encompass fusion techniques based on Variational Optimization (VO)7, Compressed Sensing-based methods8, and Bayesian-based fusion9, among others. Such methods exhibit fewer losses in spectral features and spatial details compared to CS and MRA methods. However, they require significant prior knowledge and involve complex algorithms, which prevent them from achieving the desired fusion results.
Because of their superior feature representation capabilities, deep learning techniques have become increasingly popular in domains such as remote sensing image fusion, driven by the rapid advancements in artificial intelligence in recent years. Inspired by super-resolution neural networks, Masi et al.10 introduced an initial CNN-based sharpening algorithm called PNN. This approach involves sending both the PAN images and the upsampled MS images to the network for processing. Subsequently, valuable information is extracted and combined from the images using a CNN. While this approach significantly improved performance, its simple 3-layer convolutional structure limited the model’s ability to capture nonlinearity, resulting in some spectral distortion. Subsequently, numerous advanced algorithms have been proposed. For instance, Yang et al.11 introduced PanNet, a sharpening approach based on deep residual networks. The utilization of residual learning enables the conversion of the network’s training process into the high-pass domain, which helps the network effectively learn high-frequency inputs. MSDCNN is a multi-scale, multi-depth CNN sharpening algorithm proposed by Yuan et al.12. It extracts features from diverse scale receptive fields by employing convolutional kernels of varying sizes. The combination of these features enhances the accuracy of feature map extraction. CNN was combined with conventional techniques to create the FusionNet network structure, as suggested by Deng et al.13. This method yields better results in both downscaling and full scaling by directly utilizing a singleness PAN image and every MS waveband to obtain image feature information. Wang et al.14 proposed a detail injection-based two-branch network, which utilizes the CS and MRA algorithms’ detail injection technique to directly learn the detail information of up-sampled MS images, thereby increasing the fusion image’s empty spectral quality. Fang et al.15 proposed SDRCNN, a lightweight CNN architecture with a single branch and scaling. This architecture is innovative because it incorporates a unique convolutional block and dense residual connection structure. This design preserves the spatial and spectral information of the image while striking a balance between accuracy and efficiency. A numerous scales multi-stream fusion network known as MMFN was proposed by Jian et al.16. The network first extracts a variety of features in MS and PAN images through a numerous scales strategy, and then fuses the extracted information with multi-stream fusion blocks to retain the optimal spatial and spectral characteristics. Que et al.17 effectively integrated spatial spectral information using a bilateral pyramid structure, achieving end-to-end fusion of multiple branches and inputs, and significantly improving the sharpening effect. Lu et al.18 proposed a novel self-guided spatial channel adaptive convolution technique called SSCAConv. This method adeptly considers a feature’s spatial and channel differences while maintaining an outstanding sharpening outcome. Furthermore, it can expand its application to address the super-resolution issue in hyperspectral remote sensing images, demonstrating its robust adaptability. Despite the significant advancements in CNN-based fusion techniques within the realm of remote sensing image fusion, challenges persist, such as limited receptive fields, insufficient capture of contextual information, and inadequate feature extraction.
This paper introduces the Transformer structure to address these issues, which can be viewed as a complement to the CNN model. This structure mitigates model biases and strengthens the ability to model remote dependencies19,20.
GF-CSTNet, a novel pansharpening approach that combines the Transformer and CSPNet networks, presents the following main contributions:
-
The input data is preprocessed using the guided filtering method to preserve detailed information in the image.
-
Combine the Transformer architecture with the CSPNet network, significantly enhancing the model’s learning capacity. This integration leverages the multi-branch features of CSPNet and the global modeling advantages of Transformer, leading to improved overall performance in the fused image.
-
Design the Rep-Conv2Former module to simplify attention. Within this module, a reparameterization structure is constructed to not only reduce complexity but also extract more feature information from different receptive fields.
-
Enhance the difference method. Introduce a residual learning module with attention to extract details from the image, creating a fusion image with a clear boundary.
Related works
FusionNet
FusionNet, a deep convolutional neural network that incorporates detail injection, was proposed by Deng et al.13. It was inspired by two classic sharpening methods: CS and MRA. The network adopts a differencing approach to gather meaningful information from the PAN and upsampled MS pictures. This message is then inputted into multiple residual blocks to further extract the image’s feature information. Finally, a fused image is produced by combining it with the upsampled MS images. The following is the precise implementation formula:
where \({\widehat{MS}}\) and \({\widetilde{MS}}\) are obtained by the superposition of \({\widehat{MS}}_i, i = 1, 2,..., B\) and \({\widetilde{MS}}_i, i = 1, 2,..., B\) bands, the high space resolution MS image of the ith band is represented by \({\widehat{MS}}_i\), \({\widetilde{MS}}_i\) denotes a lower space resolution MS image after upsampling the ith band, \(P^D\) is the PAN image replicated across the spectral scale, and \(f_{\Theta _{FS}}\) is the non-linear mapping function of FS.
The paper introduces a residual learning module that incorporates attention, based on the FusionNet model. It aims to extract edge information from images by employing a learn-then-subtract methodology, with detailed elaboration provided in section “Residual Learning”.
CSPNet
Wang et al.21 presented a CNN model called CSPNet with the aim of enhancing the learning capacity and efficiency of the model. The basic idea of this network is to divide the feature mappings of the base layer into two halves. One part undergoes feature extraction through the nth ResNet blocks, while the other part combines the results extracted from the previous segment using cross-stage layers. Subsequently, tensor transformations and skip connections are used to generate significant differences in correlation within the backpropagation gradient flow. This design enables a significant reduction in the workload of processing feature maps while preserving some of the advanced characteristics of the feature graph, thereby improving the precision of the model. This design allows for a significant reduction in the workload of processing feature maps while preserving some of the advanced characteristics of the feature graph, thereby improving the precision of the model.
To strengthen the ability to be learned about the algorithm and enhance the fused images’ accuracy, the study integrates the CSPNet network structure into the technique of the remote satellite image.
Attention mechanism
Presently, prevalent sequence transformation models mainly employ intricate RNN and CNN structures, comprising an encoder and a decoder. However, in constructing the model, it is crucial to consider not only the current output but also the output from the previous state. To achieve the ultimate goal, the network must perform calculations step by step. Additionally, transmitting data prematurely into the network may lead to incomplete content or even its discard. Even if this information is not discarded, it results in memory redundancy as the network continues to propagate. In response to this, Vaswani et al.22 introduced a novel Transformer architecture. This structure exclusively utilizes multi-head self-attention (MHSA) to compute implicit representations between the input and output of the model, significantly improving parallel processing capability and learning efficiency. However, this algorithm demands extensive computation when processing high-resolution images. To address this, Hou et al.23 proposed the Conv2Former method. This algorithm simplifies self-attention by employing the Hadamard product between the output and values of large-kernel convolutions, reducing the computational workload while enhancing efficiency.
The paper utilizes the Transformer structure to develop an algorithm for remote sensing image fusion. Meanwhile, drawing inspiration from the Conv2Former method and the RepVGG24 method. The Rep-Conv2Former attention block and the Rep-DConv reparameterization block were designed. These blocks aim to simplify multi-head self-attention and improve the efficiency of the model’s inference. Additionally, the incorporation of these additional structures enhances the model’s expressiveness and effectively overcomes the limitations of traditional convolutional neural networks in terms of receptive field size.
Proposed method
Figure 1 depicts the overall design of the fusion network, and each module will be explained in more detail in the following sections.
Guided filtering
Guided filtering25 is a technique for image filtering that is based on a local linear model. The fundamental idea is to locally linearly filter a guidance image, implying the filtering of an input image using the guidance image. In this way, the texture information of the guided image can be obtained while preserving the attributes of the input image. By employing this technique, it becomes possible to effectively remove noise from remote sensing image data and enhance the detailed information in the image. The detailed design procedure is outlined below: the upsampled MS image is labeled as the filtered input image L, the PAN image is labeled as the guide image P, and the filtered output image is labeled as H. The principle of guided filtering can be expressed as follows:
where i and k represent pixel indices, \(\omega _k\) is the kth local window, the numerical value for the ith pixel in the filtered output image H is represented by \(H_i\), \(P_i\) is a representation the magnitude of the ith pixel in the guide image P, \(a_k\) and \(b_k\) are the linear coefficients calculated by the filtered output image \(H_i\) within window \(\omega _k\). This can be achieved by minimizing the distinction between the input image P and the output graph H. The formula is expressed as follows:
In the filtered input image L, where \(L_i\) represents the measurement of the ith pixel, \(\varepsilon\) is a regularization parameter aimed at preventing excessively large \(a_k\) coefficients that could affect the filtering effect. In equation (3), the optimum amounts for the coefficients of linearity \(a_k\) and \(b_k\) are estimated by optimizing the fitting value \(E(a_k, b_k)\) . The function is expressed as follows and can be obtained through the least squares approach.
where \(\left| \omega \right|\) denotes all the pixels in the window \(\omega _k\), the guide picture inside the window is represented by \(\mu _{k}\) and \(\sigma _k^2\), respectively, as its average and variance. \({\overline{L}}_k\) represents the average pixel value for the filtered input image L within the window. However, during the calculation of linear coefficients for each window, it is discovered that pixel i could be included in multiple windows \(\omega _k\) . In other words, each pixel is described by multiple distinct linear functions. This implies that the filtered output image \(H_i\) in Eq. (2) continuously varies with the alteration of \(\omega _k\) . For this purpose, the mean method is adopted to calculate the average of the values within the window of each pixel point i. The ultimate result graph of the filter can be represented as follows:
where \(\overline{a_i}=\frac{1}{\left| \omega \right| }\sum _{k\in \omega _i}a_k, \overline{b_i}=\frac{1}{\left| \omega \right| }\sum _{k\in \omega _i}b_k\) represent the mean coefficients, which are the average values of the coefficients across all windows that include pixel i.
From the above process, it can be observed that the GF algorithm demonstrates a strong local correlation. By utilizing the window average and linear invariance transformation model, it is possible to incorporate specific details from the reference graph P into the input image L while preserving the overall characteristics of the input graph L, in order to obtain the output image H with enhanced visual impact. In this manner, the minimization of the difference between P and L is pursued to the greatest extent, leading to a higher-quality image for subsequent fusion.
Backbone network enhancement
Feature graphs at different levels can be concatenated using CSPNet to optimize network performance when integrated with various convolutional neural network architectures. Therefore, in this section, the backbone network is constructed by combining CSPNet and ResNet. Additionally, this network includes an Attention ResBlock based on Convolutional Transformer. This design leverages the high-performance characteristics of multi-branch networks and the advantages of global modeling in Transformers, further augmenting the fusion capabilities of the model.
Figure 2 shows the backbone network’s general architecture, with a more detailed depiction of the Attention ResBlock in Fig. 3. The main idea is as follows: Firstly, the feature graph \(P^{D^{\prime }} - \widetilde{MS^{\prime }}\), extracted by the residual learning module, is set as \((P^{D^{\prime }} - \widetilde{MS^{\prime }})_0\) . Then, it is divided into two branches, Part 1: \((P^{D^{\prime }} - \widetilde{MS^{\prime }})_0^{\prime }\) and Part 2: \((P^{D^{\prime }}-\widetilde{MS^{\prime }})_0^{\prime \prime }\) . Part 2 performs deep-level feature extraction using nth Attention ResBlock blocks and then undergoes gradient boosting through the Transition layer. Finally, it is combined with Part 1 to achieve cross-stage splitting and merging. Employing this cross-stage structure not only enhances fusion performance but also enables the model to demonstrate greater adaptability and flexibility in complex network environments.
Based on this, the output of ResNet integrated into CSPNet is represented as follows:
where \(*\) represents convolution, \(\omega\) represents the weight, and \((P^{D^{\prime }}-\widetilde{MS^{\prime }})_K\) denotes the output of the Kth Attention ResBlock, which is then converted into \((P^{D^{\prime }}-\widetilde{MS^{\prime }})_T\) and \((P^{D^{\prime }}-\widetilde{MS^{\prime }})_Z\) . The equation for the inverse update weight can be represented as follows:
where \(g_K\) represents the gradient of the Kth Attention ResBlock, and f denotes the weight update function.
In conclusion, CSPNet is a simple and versatile network, combining ResNet with the Convolutional Morphology Transformer and designing the Attention ResBlock blocks as the backbone network. By implementing this approach, the problem of reusing gradient information is effectively avoided, resulting in improved overall performance of the model. Consequently, it successfully attains the goal of improving the spectral and spatial resolution of the fused image.
Attention simplification
As demonstrated in the previous text, incorporating a Transformer block into the network can improve the model’s learning capacity. However, the application of multi-head self-attention increases computational complexity. In this section, a Rep-Conv2Former module is designed to address this issue. The approach draws inspiration from the Conv2Former method and strives to optimize attention. In contrast to the reference method, this module extracts features from different receptive fields using multi-scale convolutions (Conv 3 \(\times\) 3, Conv 11 \(\times\) 11) to integrate and leverage global information at various scales. Simultaneously, the Rep-Conv2Former module integrates a Rep-DConv reparameterization block, drawing inspiration from the multi-branch merging concept of RepVGG and its associated transformation method. Employing the concept of structural reparameterization, the multi-way (Conv 3 \(\times\) 3, Conv 11 \(\times\) 11) structure of the training network is transformed into a single-way (Conv 11 \(\times\) 11) structure for the inference network. This transformation significantly enhances the network’s inference rate.
The structure is illustrated in Fig. 4, and detailed explanations will be provided below.
Rep-Conv2Former
This subsection will outline the convolutional modulation block used in Rep-Conv2Former. Firstly, the self-attention block is compared to the convolutional modulation block. From Fig. 5, assuming the Part 2 branch feature graph \((P^{D^{\prime }}-\widetilde{MS^{\prime }})_0^{\prime \prime }\) with a size of H \(\times\) W \(\times\) C, extracted by subtraction, is taken as the input. The self-attention first undergoes a correlation operation through a linear layer to obtain the query matrix Q, key matrix K, and value matrix V. Among them, \({\textbf{K}},{\textbf{Q}},{\textbf{V}}\in {\mathbb {R}}^{H\times W\times C}\), C represents the overall number of channels, while H and W represent the input’s spatial dimensions. The output is represented by the attention matrix A. The utilization of multi-head attention allows for learning information from different spatial locations, as demonstrated below.
It is further rewritten as:
In the above equation, the scale factor is omitted for simplicity. Although the encoding of spatial information is highly efficient, its computational complexity increases as the size of the input feature graph grows, which leads to higher computational requirements.
In Fig. 6, the convolution modulation block’s input \((P^{D^{\prime }}-\widetilde{MS^{\prime }})_0^{\prime \prime }\in {\mathbb {R}}^{H\times W\times C}\) is the Part 2 branch feature map extracted by the residual learning module. Convolution modulation, on the other hand, utilizes multi-scale convolution \({\text {Re}}\text {p-DConv}_{k\times k}\) of size k \(\times\) k and the Hadamard product to determine the value of the output feature map Z, as depicted below.
where \({\text {Re}}\text {p-DConv}_{k\times k}\) can be expressed in detail as follows:
In the equation, \(\odot\) represents the Hadamard product, the two linear layers’ parameters are denoted by \(W_1\) and \(W_2\), and k represents the reparameterized convolutional kernel. Here, \(k_1k_1\) and \(k_2k_2\) represent Conv 3 \(\times\) 3 and Conv 11 \(\times\) 11, respectively. By modulating the V values with this multi-scale convolution, not only can the characteristics of receptive fields of different sizes be generated, but also the position of space (h, w) can be linked to pixels in the k \(\times\) k square area centered on (h, w) . Furthermore, channel-wise information exchange can be performed using linear layers.
In summary, self-attention employs matrix multiplication between K and Q to generate the attention matrix. Convolutional modulation utilizes multi-scale convolutions of size k \(\times\) k to generate weights for the input’s feature map. The Hadamard product is then used to establish dependency relationships among features. By doing so, the model not only enhances the weights of high-frequency spatial features and spectral distribution features, but also extracts more profound feature information, thereby further improving its effectiveness.
Rep-DConv
RepVGG is a simple and efficient CNN architecture that mainly consists of an identity branch, 1 \(\times\) 1 convolution, and 3 \(\times\) 3 convolution. This method employs different structures during training and inference. Specifically, it utilizes a 3 \(\times\) 3 convolution during training and incorporates a 1 \(\times\) 1 convolution branch, as well as a branch for identity mapping. In reasoning, the concept of re-parameterization is employed to convert the model into a 3 \(\times\) 3 convolutional branch. In contrast to this method, this section only utilizes the concept of multi-branch merging and the corresponding transformation method. Additionally, a Rep-DConv block is designed to improve computational speed.
The structure of training and reasoning is shown in Fig. 7. After convolutional modulation, the feature maps in the model prioritize precision during the training phase. At this stage, the main components are a 3 \(\times\) 3 convolution and an 11 \(\times\) 11 convolution. Parallelizing the two branches to introduce multiple gradient paths to the network enhances the extraction and fusion of information at different scales in the feature map. This, in turn, further improves the model’s representational capacity. During the inference phase, the model prioritizes speed. The two training branches are combined into a single 11 \(\times\) 11 convolutional branch through reparameterization. This enables the network to efficiently infer information while also acquiring the weights of the multi-branch training parameters.
In essence, this module employs structural reparameterization to convert the trained multi-branch structure into a single-branch architecture during inference. By merging the benefits of multi-branch high performance with single-branch fast speed, the model retains robust feature extraction capabilities while achieving rapid recognition speed.
Residual learning
The FusionNet model extracts image characteristics through \(P^D-{\widetilde{MS}}\), as can be observed from Eq. (1). However, there are instances where employing a direct subtraction method might not effectively emphasize the boundary information in the image. Therefore, in this section, a residual learning module incorporating attention is designed to address this issue.
The dashed area in Fig. 3 represents the specific architecture. The implementation process can be described as follows: the filtered image is obtained through the learning process of a residual block with attention. Every residual block is composed of two branches: the residual branch, which learns the distinction between the input data and the output, and the identity branch, which directly transfers the input to the output using an identity mapping. By incorporating these two branches, the residual learning results in \(P^{D^{\prime }}\) and \(\widetilde{MS^{\prime }}\) can be obtained. The obtained results are then subtracted to improve the image details’ clarity. The specific formula is represented as follows:
where \(P^{D^{\prime }}=f_1(P^D)\), \(\widetilde{MS^{\prime }}=f_2({\widetilde{MS}})\), \(f_1\) and \(f_2\) represent the mapping layers of the two residuals.
Overall, the attention method is introduced in the residual learning module, where the process involves performing residual learning first and then subtracting the learned results. By leveraging the dual advantages of convolutional and Transformer models, the network becomes better at preserving the original input information while also learning complex features. This enhances the network’s ability to extract and represent image features, addressing the issue of vanishing gradients.
Improvement of the loss function
During the training of the model, the loss function, which compares predicted values with actual values to evaluate the model’s performance, is calculated for each sample. Initially, the model utilizes forward propagation to generate a prediction. The difference between the anticipated and real outcomes is then calculated using the loss function. After obtaining the deviation, the model parameters are adjusted through backpropagation. This process aims to align the model’s predicted outcomes with the actual circumstances, thereby improving the validity of the predictions.
The FusionNet model utilizes the L2 loss function for network optimization. However, during practical training, it has been observed that overfitting occurs, where the model’s error during training is significantly lower than the error during testing. While increasing the amount of training samples may help reduce overfitting, it is more expensive to obtain additional training samples. Therefore, weight decay is used in this part to alleviate the overfitting problem. Weight decay penalizes models with large absolute values by adding a penalty term, which imposes constraints on the models that need to be learned. In this way, the model’s capacity for generalization is further improved while reducing overfitting. The formula is as follows:
where n is the amount of training samples, the number kth from GT is denoted by \(GT_{\{k\}}\), \(\left\| \cdot \right\| _F\) is the Frobenius norm, \(\lambda\) is a hyperparameter used to control the strength of weight decay.
Fusion algorithm
Based on the aforementioned research content, the specific fusion algorithm can be expressed as follows:
-
The LMS picture can be obtained by first duplicating the PAN image channel and then resizing the MS image to match the dimensions of the PAN image.
-
The guided filtering method is applied using the PAN picture as the guide image, and the LMS image is used as the input map for image preprocessing.
-
The filtered image is subtracted by the residual learning module to extract detailed information.
-
The obtained information is fed into a two-branch CSPNet, which consists of Attention ResBlock blocks based on the convolutional form of Transformers. In this structure, Part 1 consists of only one layer of convolution, while Part 2 leverages the global modeling capability of the Transformer to extract more detailed information from the image. Next, it calculates the nth Attention ResBlock blocks to segment different gradient flows.
-
Then, part 1 and part 2 are connected, and the result is inputted into a new convolutional layer to achieve the fusion of feature information.
-
The fusion result is then output by connecting it to the input feature map through a skip connection.
-
Finally, the output result undergoes non-linear mapping, and a skip connection is utilized to merge it with the LMS in order to generate the ultimate fusion image.
Experimental validation and analysis
Experimental preparation and dataset
Comparison methods and evaluation metric
This article selected several representative methods for comparative experiments, including EXP26 (multispectral images upsampled by 23-tap polynomial interpolation), BT-H27, BDSD-PC28, SR_D29, TV30, PNN10, BDPN31, MSDCNN12, DRPNN32, DiCNN133, and FusionNet13. The comparison experiments in the paper were conducted in the same equipment and environment. The deep learning method adopted Pytorch framework and trained on GeForce RTX 4060, and the test used MATLAB (2019 a). Using the Adam optimizer during the training, the proposed method sets the initial rate of learning to 3e-4, weight decay to 1e-7, and the number of batches to 32. At 400 iterations, stable performance was achieved.
Both subjective and objective analyses of the fusion results are performed in the paper. Subjectively, fusion images are analyzed through direct observation of fusion results and local amplification. Objectively, eight commonly used evaluation metrics are employed to systematically analyze the fused images. These eight indicators are SAM34, ERGAS34, Q34, RMSE34, SID35, Q4/Q836,37, RASE34, and SCC34.
Dataset preprocessing
The WorldView-3 (WV3) and GaoFen-2 (GF2) datasets were used in the experiment to demonstrating how effective to demonstrate the effectiveness of the strategy suggested in this study. However, the research will preprocess the GF2 and WV3 datasets using the GF technique to improve the fusion accuracy of the suggested approach. (Dataset download address: PanCollection for Survey Paper (liangjiandeng. github. io)).
Fusion results analysis
GF2 dataset
Figures 8, 9 demonstrates the fusion renderings of both sets of remote sensing images that were tested on the GF2 dataset using different algorithms. Analyzed from a subjective standpoint, the fusion results of conventional algorithms such as BT-H (Fig. d), BDSD-PC (Fig. e), SR_D (Fig. f), and TV (Fig. g) exhibit pseudo-shadow phenomena in certain areas when compared to the Ground Truth (GT). The fusion outcomes of the PNN algorithm (Fig. h) show a significant improvement when it comes to space and spectrum properties compared to those of the traditional method. However, they still do not match the sharpness and minutiae of GT images in terms of visual quality. The result produced by the BDPN algorithm (Fig. i) exhibits a significant issue of spectral distortion, which leads to a loss of detail and an overall darkening of the image’s color. The MSDCNN algorithm (Fig. j) suffers from spectral distortion and performs poorly in terms of spatial quality, resulting in a relatively blurry fusion outcome. The DRPNN method (Fig. k), DiCNN1 algorithm (Fig. l), and FusionNet algorithm (Fig. m) all produce fusion results that successfully retain the image’s spectrum characteristics. Furthermore, the quality of space has also improved significantly. However, from the perspective of local amplification, there is some loss of information. CSTNet (Fig. n) and GF-CSTNet (Fig. o) are slightly superior to other comparison methods in regard to both space and spectrum detail. However, compared to CSTNet (without guided filter processing), the suggested method GF-CSTNet contains more spectrum information and distinct texture details. It achieves a higher spatial resolution and is more consistent with the GT image.
Sharpening results of test Fig. 1 on GF2 dataset.
Tables 1, 2 present the objective evaluation metrics obtained by each algorithm across the GF2 dataset. The optimal values are indicated in bold. The conventional BT-H, BDSD-PC, SR_D, and TV algorithms exhibit relatively high values for ERGAS, RMSE, SID, and RASE metrics, as clearly indicated in the table, while showing the smallest values for SCC. This suggests that the information included in the original photos is not effectively captured by the fused images generated by these algorithms. They also exhibit lower spatial quality and spectral characteristics. The PNN, DRPNN, DiCNN1, and FusionNet algorithms have achieved favorable results across various metrics, indicating that the fusion results have been significantly improved in the corresponding spectrum and space realms. While the metrics of the CSTNet method outperform those of the other comparison approaches, they still fall short of the performance achieved by the GF-CSTNet method. This suggests that the outcomes of fusion can still be improved. The proposed GF-CSTNet method outperforms all other algorithms in every evaluation indicator. The fusion results exhibit a higher level of spatial detail and spectral fidelity, making the overall effect ideal.
Sharpening results of test Fig. 2 on GF2 dataset.
WV3 dataset
The fusion outcomes of various algorithms using the WV3 dataset are illustrated in Figs. 10 and 11. From a subjective perspective, the BT-H (Fig. d) and TV (Fig. g) algorithms exhibit some spectral distortion, and the fusion images lack clarity. The BDSD-PC algorithm (Fig. e) exhibits excessive sharpening and is prone to significant spectral distortion. Although the SR_D algorithm (Fig. f) can effectively preserve spectral features, it can also cause significant spatial distortion, leading to overall ambiguity. The BDPN algorithm (Fig. i) and MSDCNN algorithm (Fig. j) suffer from information loss in either the spectral or spatial domains. The MSDCNN algorithm, however, exhibits more severe overall distortion, leading to a fusion output characterized by evident granularity that differs significantly from the GT image. The fusion results from the PNN algorithm (Fig. h), DRPNN algorithm (Fig. k), DiCNN1 algorithm (Fig. l), FusionNet algorithm (Fig. m), and CSTNet algorithm (Fig. n) more effectively retain the spectral features of the images. However, the boundary texture features are blurred and do not exhibit the same level of clarity as the GT image. In terms of both space and spectrum, the GF-CSTNet (Fig. o) approach presented in this research aligns most closely with the GT image, whether viewed as a whole or in locally magnified areas. Moreover, the edge information is clearer compared to other comparison methods, which achieves better fusion results.
Sharpening results of test Fig. 3 on the WV3 dataset.
The objective assessment metrics for the WV3 dataset are displayed in Tables 3 and 4. Bold numbers indicate the optimal values. As the table clearly shows, the high SAM values of the BT-H, SR_D, and TV algorithms indicate that the fusion image has poor spectral preservation performance. In the BDSD-PC algorithm, the metrics ERGAS, RMSE, RASE, SID, and SAM have the maximum values, while Q and Q8 have the minimum values. This indicates that the fusion results exhibit significant spectral distortion, lack a satisfactory sharpening effect, and bear the least resemblance to the GT image in terms of structural similarity. In the PNN algorithm, the larger SCC number signifies improved spatial performance in the sharpened results. For the BDPN and MSDCNN algorithms show that higher ERGAS values indicate greater overall distortion in the fusion images. The DRPNN, DiCNN1, and FusionNet methods have shown superior performance compared to previously described methods, as indicated by various metrics. This indicates that the sharpened images exhibit notable improvements in both spatial and spectral qualities. The GF-CSTNet method achieves the best results among the eight indexes, followed by the CSTNet method in second place.
Sharpening results of test Fig. 4 on the WV3 dataset.
Residual experiment
Test Fig. 3 was used as an example in the residual experiment to demonstrate the effective functioning of the suggested method. The average value was determined by computing the difference between the final picture and the one used as a reference. If the resulting image contains less detailed information, it indicates that the algorithm has a higher fusion quality. From Fig. 12, it can be observed that traditional algorithms such as BT-H (Fig. a), BDSD-PC (Fig. b), SR_D (Fig. c), and TV (Fig. d) contain the most detailed information. It indicates that the fusion results are suboptimal, and there is a loss of information within the spectrum and space domains. The residual images from the PNN algorithm (Fig. e), BDPN algorithm (Fig. f), MSDCNN algorithm (Fig. g), DRPNN algorithm (Fig. h), DiCNN1 algorithm (Fig. i), and FusionNet algorithm (Fig. j) show better results than the conventional method. However, there are still distinct texture features in the residual images, indicating that the sharpening results still need improvement. The residual image from the unfiltered CSTNet method (Fig. k) still contains slight texture details, while the filtered GF-CSTNet (Fig. l) method shows the least amount of detail in the residual diagram. This suggests that adopting the approach of enhancing the image through guided filtering first and then performing fusion allows the image to capture more information, resulting in an optimal fusion outcome.
Test Fig. 3 residual chart.
Ablation experiment
The ablation experiments in this section were conducted using the GF2 dataset to validate the feasibility of the suggested approach. The specific details are as follows:
-
Ablation regularization (A). On the basis of the FusionNet model (Base), an additional regularization term is introduced to the loss function to mitigate overfitting issues during the model training process.
-
Ablation residual learning (B). Based on (1), the modified skeleton network and the residual learning module are considered as a whole for ablation to confirm the effectiveness of residual learning.
-
Ablation of attention (C). Building upon (1), the attention module is integrated with the modified skeleton network for ablation to confirm the efficiency of the attention mechanism.
-
Ablation of multiple kernels (D). On the basis of (1), open attention is ablated as a way to demonstrate the effectiveness of multi-scale convolution.
Figure 13 displays the results of the ablation experiment. From a visual standpoint, the sharpening results of a, a + b, a + c, a + c + d, CSTNet, and GF-CSTNet are all superior to Base. Additionally, the sharpening effect of each new module surpasses that of the previous module. Particularly, the GF-CSTNet method, after filtering, is closer to the GT image in terms of spectrum fidelity and spatial information, achieving the most effective sharpening result. Table 5 displays the evaluation metrics. It can be observed that there is an improvement in each of the additional modules compared to the Base. Among them, the GF-CSTNet method, which performs guided filtering operations, achieves optimal values for evaluation metrics, thereby demonstrating the feasibility of the suggested approach.
More discussion
This part conducts more experimental discussion of the suggested approach, focusing mostly on the computational complexity and running time analysis.
-
Run time Analysis Running efficiency is also a crucial metric when assessing fusion algorithms.This section provides an in-depth analysis of the algorithm’s effectiveness by calculating the average running time of different methods on the GF2 and WV3 datasets. The Tables 6 and 7 show that the method suggested has the least running time and an obvious advantage. Based on comprehensive subjective and objective evaluations, as well as time performance analysis, it is possible to infer that the approach provided in this study performs superiorly.
-
Calculation Complexity Analysis For feature maps with input length n, in the GF-CSTNet method, the complexity of each layer is mainly affected by the self-attention and convolutional operations due to the use of convolutional modulation blocks to simplify the attention. The overall representation is \({\mathcal {O}}(n^2d+nkd^2)\), where d represents the dimension and k is the size of the convolution kernel. To further analyze the complexity of the proposed method, Floating Point Operations (Flops) and Params were compared with other methods on the WV3 and GF2 datasets. Where Flops can be used to evaluate the computational effort of the model, which is approximately equivalent to the time complexity, and Params resembles the space complexity of the model. (To facilitate comparison, the conversion follows \(1 GFlops=10^9Flops\)). As shown in Tables 8 and 9, it can be seen that BDPN has the most parameters and DiCNN1 has the least GFlops. The proposed method uses a dual branch network architecture, which slightly increases the computational cost, but this design also improves the fusion accuracy. To demonstrate the superiority of the proposed method more intuitively, taking the GF2 dataset as an example, Q, RMSE and SCC three indicators were selected to compare the correlation between parameter quantities and indicators. It is evident from Fig. 14 that the sharpening effect of the approach suggested in this research is substantial.
Conclusion
This paper introduces a novel pansharpening method named GF-CSTNet, which combines CSPNet with Transformer, for enhancing images from remote sensors. In this method, guided filtering is applied to enhance the multispectral image initially. Subsequently, the Transformer structure is integrated into CSPNet, and a new multi-scale convolutional modulator block, along with a parameter-heavy block, is designed by drawing inspiration from the Conv2Former method and the RepVGG method. This design enables the model to extract more information from diverse receptive fields, enhancing the spatial and spectral resolution of the fusion image. Moreover, the GF-CSTNet approach optimizes the direct difference method. It introduces a residual learning block incorporating attention, further improve the overall quality of the fusion image. Compared to alternative approaches, the proposed GF-CSTNet method significantly enhances fusion excellence, as evidenced by its validation on the GF2 and WV3 datasets. It achieves optimal results across multiple objective evaluation metrics. Additionally, residual experiments and ablation experiments further affirm the efficacy of the GF-CSTNet approach.
Data availability
The datasets used during the current study are available from the corresponding author on reasonable request. The code is available on GitHub (https://github.com/Liu-9911/GF-CSPNet).
References
Zhou, H., Liu, Q., Weng, D. & Wang, Y. Unsupervised cycle-consistent generative adversarial networks for pan sharpening. IEEE Trans. Geosci. Remote Sens. 60, 1–14. https://doi.org/10.1109/TGRS.2022.3166528 (2022).
Ghadjati, M., Moussaoui, A. & Boukharouba, A. A novel iterative PCA-based pansharpening method. Remote Sens. Lett. 10, 264–273. https://doi.org/10.1080/2150704X.2018.1547443 (2019).
Zhang, K., Niu, M. & Zhu, X. Nonlinear pansharpening for electric vehicle detection in remote sensing. SIViP 16, 2073–2081. https://doi.org/10.1007/s11760-022-02169-4 (2022).
Xiang, S., Liang, Q. & Fang, L. Discrete wavelet transform-based Gaussian mixture model for remote sensing image compression. IEEE Trans. Geosci. Remote Sens. 61, 1–12. https://doi.org/10.1109/TGRS.2023.3272588 (2023).
Ma, D. & Lai, H. Remote sensing image matching based improved orb in NSCT domain. J. Indian Soc. Remote Sens. 47, 801–807. https://doi.org/10.1007/s12524-019-00958-y (2019).
Liu, Y., Wang, L., Cheng, J., Li, C. & Chen, X. Multi-focus image fusion: A survey of the state of the art. Inf. Fusion 64, 71–91. https://doi.org/10.1016/j.inffus.2020.06.013 (2020).
Gao, Z., Wang, Q. & Zuo, C. A total variation global optimization framework and its application on infrared and visible image fusion. SIViP 16, 219–227. https://doi.org/10.1007/s11760-021-01963-w (2021).
Zhu, X. X. & Bamler, R. A sparse image fusion algorithm with application to pan-sharpening. IEEE Trans. Geosci. Remote Sens. 51, 2827–2836. https://doi.org/10.1109/TGRS.2012.2213604 (2013).
Khademi, G. & Ghassemian, H. Incorporating an adaptive image prior model into Bayesian fusion of multispectral and panchromatic images. IEEE Geosci. Remote Sens. Lett. 15, 917–921. https://doi.org/10.1109/LGRS.2018.2817561 (2018).
Masi, G., Cozzolino, D., Verdoliva, L. & Scarpa, G. Pansharpening by convolutional neural networks. Remote Sens. 8, 594. https://doi.org/10.3390/rs8070594 (2016).
Yang, J. et al. Pannet: A deep network architecture for pan-sharpening, in 2017 IEEE International Conference on Computer Vision (ICCV), 1753–1761. https://doi.org/10.1109/ICCV.2017.193 (2017).
Yuan, Q., Wei, Y., Meng, X., Shen, H. & Zhang, L. A multiscale and multidepth convolutional neural network for remote sensing imagery pan-sharpening. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 11, 978–989. https://doi.org/10.1109/JSTARS.2018.2794888 (2018).
Deng, L. J., Vivone, G., Jin, C. & Chanussot, J. Detail injection-based deep convolutional neural networks for pansharpening. IEEE Trans. Geosci. Remote Sens. 59, 6995–7010. https://doi.org/10.1109/TGRS.2020.3031366 (2021).
Wang, W. et al. Ditbn: Detail injection-based two-branch network for pansharpening of remote sensing images. Remote Sens.https://doi.org/10.3390/rs14236120 (2022).
Fang, Y., Cai, Y. & Fan, L. Sdrcnn: A single-scale dense residual connected convolutional neural network for pansharpening. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 16, 6325–6338. https://doi.org/10.1109/JSTARS.2023.3292320 (2023).
Jian, L. et al. Multi-scale and multi-stream fusion network for pansharpening. Remote Sens. 15, 1666. https://doi.org/10.3390/rs15061666 (2023).
Que, Y., Xiong, H., Xia, X., You, J. & Yang, Y. Integrating spectral and spatial bilateral pyramid networks for pansharpening. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 17, 3985–3998. https://doi.org/10.1109/JSTARS.2024.3356513 (2024).
Lu, X., Zhuo, Y.-W., Chen, H., Deng, L.-J. & Hou, J. Sscaconv: Self-guided spatial-channel adaptive convolution for image fusion. IEEE Geosci. Remote Sens. Lett. 21, 1–5. https://doi.org/10.1109/LGRS.2023.3344944 (2024).
Su, X., Li, J. & Hua, Z. Transformer-based regression network for pansharpening remote sensing images. IEEE Trans. Geosci. Remote Sens. 60, 1–23. https://doi.org/10.1109/TGRS.2022.3152425 (2022).
Qu, L., Liu, S., Wang, M. & Song, Z. Transmef: A transformer-based multi-exposure image fusion framework using self-supervised multi-task learning. ArXiv:2112.01030, https://doi.org/10.48550/arXiv.2112.01030 (2021).
Wang, C. Y. et al. Cspnet: A new backbone that can enhance learning capability of cnn, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 1571–1580, https://doi.org/10.1109/CVPRW50498.2020.00203 (2020).
Vaswani, A. et al. Attention is all you need. Neural Inf. Process. Syst.https://doi.org/10.48550/arXiv.1706.03762 (2017).
Hou, Q., Lu, C. Z., Cheng, M. M. & Feng, J. Conv2former: A simple transformer-style convnet for visual recognition. ArXivarxiv:2211.11943, https://doi.org/10.48550/arXiv.2211.11943 (2022).
Ding, X. et al. Repvgg: Making vgg-style convnets great again, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13728–13737, https://doi.org/10.1109/CVPR46437.2021.01352 (2021).
He, K., Sun, J. & Tang, X. Guided image filtering. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1397–1409. https://doi.org/10.1109/TPAMI.2012.213 (2013).
Aiazzi, B., Alparone, L., Baronti, S. & Garzelli, A. Context-driven fusion of high spatial and spectral resolution images based on oversampled multiresolution analysis. IEEE Trans. Geosci. Remote Sens. 40, 2300–2312. https://doi.org/10.1109/TGRS.2002.803623 (2002).
Lolli, S., Alparone, L., Garzelli, A. & Vivone, G. Haze correction for contrast-based multispectral pansharpening. IEEE Geosci. Remote Sens. Lett. 14, 2255–2259. https://doi.org/10.1109/LGRS.2017.2761021 (2017).
Vivone, G. Robust band-dependent spatial-detail approaches for panchromatic sharpening. IEEE Trans. Geosci. Remote Sens. 57, 6421–6433. https://doi.org/10.1109/TGRS.2019.2906073 (2019).
Vicinanza, M. R., Restaino, R., Vivone, G., Dalla Mura, M. & Chanussot, J. A pansharpening method based on the sparse representation of injected details. IEEE Geosci. Remote Sens. Lett. 12, 180–184. https://doi.org/10.1109/LGRS.2014.2331291 (2015).
Palsson, F., Sveinsson, J. R. & Ulfarsson, M. O. A new pansharpening algorithm based on total variation. IEEE Geosci. Remote Sens. Lett. 11, 318–322. https://doi.org/10.1109/LGRS.2013.2257669 (2014).
Zhang, Y., Liu, C., Sun, M. & Ou, Y. Pan-sharpening using an efficient bidirectional pyramid network. IEEE Trans. Geosci. Remote Sens. 57, 5549–5563. https://doi.org/10.1109/TGRS.2019.2900419 (2019).
Wei, Y., Yuan, Q., Shen, H. & Zhang, L. Boosting the accuracy of multispectral image pansharpening by learning a deep residual network. IEEE Geosci. Remote Sens. Lett. 14, 1795–1799. https://doi.org/10.1109/LGRS.2017.2736020 (2017).
He, L. et al. Pansharpening via detail injection based convolutional neural networks. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 12, 1188–1204. https://doi.org/10.1109/JSTARS.2019.2898574 (2019).
Pushparaj, J. & Hegde, A. V. Evaluation of pan-sharpening methods for spatial and spectral quality. Appl. Geomat. 9, 1–12. https://doi.org/10.1007/s12518-016-0179-2 (2017).
Chang, C. I. Spectral information divergence for hyperspectral image analysis, in IEEE 1999 International Geoscience and Remote Sensing Symposium. IGARSS’99 (Cat. No.99CH36293), Vol. 1, 509–511. https://doi.org/10.1109/IGARSS.1999.773549 (1999).
Liu, X., Liu, Q. & Wang, Y. Remote sensing image fusion based on two-stream fusion network. Inf. Fusion 55, 1–15. https://doi.org/10.1016/j.inffus.2019.07.010 (2020).
Garzelli, A. & Nencini, F. Hypercomplex quality assessment of multi/hyperspectral images. IEEE Geosci. Remote Sens. Lett. 6, 662–665. https://doi.org/10.1109/LGRS.2009.2022650 (2009).
Acknowledgements
This work was supported by Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University under Grant MIP20222, in part by the Fundamental Research Funds for the Central Universities, in part by the China University Industry-University-Research Inno-vation Fund Project under Grant 2021RYC06002 and in part by the Scientific Research Program of Hubei Provincial Department of Education under Grant B2022040.
Author information
Authors and Affiliations
Contributions
Y.C. and H.L. designed the proposed model, implemented the experiments, and drafted the manuscript. F.F. provided overall guidance to the project, and reviewed and edited the manuscript. All authors contributed equally to this manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Chen, Y., Liu, H. & Fang, F. A novel pansharpening method based on cross stage partial network and transformer. Sci Rep 14, 12631 (2024). https://doi.org/10.1038/s41598-024-63336-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-63336-w
- Springer Nature Limited