Introduction

Semantic image segmentation is a computer vision task where machines or algorithms label specific regions of an image automatically according to people’s prior annotation. The aim of semantic segmentation is to predict the class of each pixel in an image. Since all pixels in the image should be labeled, this task is generally referred to as dense prediction. Segmentation models are applied in a range of scenarios, including robotic development1, autonomous vehicles2 and medical images3.

Since AlexNet4, a convolutional neural network (CNN), won the ImageNet challenge in December 2012, CNN-based deep learning models have proliferated in the medical image field. U-Net is the most popular and influential framework in this area, utilizing skip connections between the encoder and decoder to process the full information of the image, which has emerged as the cornerstone of medical image segmentation models5. Since then, the U-shape encoder-decoder architecture has become the preferred choice when investigators exploit segmentation algorithms. In 2016, one work proposed 3D-UNet, which takes 2D slices from a volume scan as input into U-net to achieve 3D segmentation6. However, 3D-UNet still is still trained in a 2D manner. That same year, V-net was introduced by Milletari et al., a genuine 3D convolutional networks that takes 3D volume as direct inputs to perform 3D image segmentation7.

Although CNNs have achieved premising success in medical image analysis, their inherent locality considerably limits the performance of CNN-based models. The attention mechanism is one option to compensate such a shortcoming, assigning the importance weights to each part of the input features and facilitating their interconnection to focus on relevant areas8. Attention mechanism can be categorized into spatial attention, channel attention, temporal attention and multi-head attention. Spatial attention applies attention mask across spatial domains to filter significant spatial regions9,10 or directly identify the most pertinent spatial positions11,12. Besides, channel attention generates attention mask across the channels, responsible for selecting important channel features13. In addition, multi-head attention is the most commonly-used one, enhancing the model’s expressive effectiveness and stabilizing training by promoting the model to jointly emphasize the details in distinct feature subspaces.

In the fields of machine learning and artificial intelligence, the attention mechanism is a technique that simulates human visual attention, allowing models to automatically seek out and focus on certain important parts of information during processing. In 2017, the Transformer architecture was first introduced with the primary aim of addressing the parallelization challenges faced by recurrent neural networks (RNNs), long short-term memory networks (LSTMs), and gated recurrent unit networks (GRUs) in natural language processing (NLP) tasks such as reading comprehension, abstractive summarization, textual entailment, and learning task-independent sentence representations14. At the core of the Transformer is the self-attention mechanism. Self-attention, also known as intra-attention, enables the model to compute global dependencies within a sequence by relating different positions in the same sequence. Therefore, in self-attention, the query, key, and value are derived from the same set of inputs. In 2020, the Vision Transformer (ViT) was proposed, marking the first application of the self-attention-based Transformer architecture to the field of image processing, including tasks like image classification and semantic segmentation15. ViT achieved superior performance compared to convolutional neural networks (CNNs) by feeding images into a standard Transformer with positional embeddings, dividing the image into non-overlapping patches. Subsequently, Transformer-based models for medical image processing have garnered significant attention from the academic community.

In this study, we will exploit a novel and versatile medical segmentation model, named Dual ATTention Network (DATTNet), which integrates pretrained methods, spatial attention, channel attention and multi-head self-attention. A pretrained convolutional neural network on ImageNet is applied to extract the original sparse information from images. Specifically, Dual Attention Module is proposed, which includes the spatial attention and channel attention to comprehensively refine the feature maps from the CNN backbone. Additionally, we present Context Fusion Bridge to mix the features with different resolutions on the foundation of Transformer-based multi-head self-attention mechanism. To expand the functionality of DATTNet across different modalities, we train it in 2D slices and restore and test it in 3D fashion for volumetric data, such as CT and MRI.

In summary, our contributions are as fellow:

  1. 1)

    We propose DATTNet, a novel medical segmentation model, incorporating pretrained CNN and Dual Attention Module together as encoder to analyze the locality of input and model the long-range dependency in spatial and channel dimensions.

  2. 2)

    We develop the Context Fusion Bridge to fuse the multi-scale features and yield distinguish representation.

  3. 3)

    We embed pure CNN-based attention mechanism into encoder rather than Transformer for the reduction of computational burden.

  4. 4)

    DATTNet possesses generic property for multi-modality data, such as CT, MRI and endoscopy.

Related works

CNN-based medical image segmentation models

Since U-Net was published in 20155, CNN-based U-shape architectures have emerged as the mainstream in the medical image segmentation community. A wealth of variants have been developed based on this excellent framework. In the field of neuroscience. Andermatt et al. utilized two convolutional gated recurrent units in different directions for tissue segmentation in each dimension16 in 2016. Besides, Bao et al. presented a novel label consistency method to perform anatomical segmentation via multi-scale late fusion CNN with random walker17.

Precise identification of brain tumors is critical for the formulation of treatment strategies. Chen et al. proposed Dense-Res-Inception Net (DRINet), an enhancement over the U-Net architecture, demonstrating superior performance in both brain segmentation and abdominal organ segmentation tasks.18. Havaei et al. used CNN to handle missing modalities with an abstraction layer that shifted feature maps to their statistics19. In 2017, Kamnitsas et al. designed a 3D multi-scale fully convolutional network with conditional random field (CRF) for brain tumor segmentation20.

Retinal image analysis is another important task in medical imaging. By parsing the blood vessels on the retinal image, ophthalmologists can diagnose ocular diseases, such as cataract, glaucoma and diabetic eye disease. However, analyzing the vessel structures manually is laborious and time-consuming, which can lead to false-negative diagnoses on the tiny and thin vessels. Therefore, a series of deep learning works have been published to address this difficulty. Fu et al. developed a method by combining CNN and CRF to attain global interplays for blood vessel segmentation21. Maninis et al. employed a VGG-19 network with specialized layers for blood vessels and optic disk segmentation, respectively22. Additionally, Wu et al. performed blood vessel segmentation via patch-based CNN with projecting PCA solutions of last layer feature maps to full segmentation23. Xiang D et al. embedded a dual stream segmentation network into a conditional generative adversarial network for more accurate and efficient retinal lesion segmentation. Notably, this model is trained in a semi-supervised adversarial manner to leverage from labeled images and unlabeled images with high confident pseudo labels. This method innovatively addresses the scarcity of labeled medical images.

Segmentation of cardiac structures from magnetic resonance imaging (MRI) images is an essential step for calculating clinical indices such as ventricular volume and ejection fraction. Numerous studies have been conducted on cardiac segmentation task based on CNN. Avendi et al. applied stacked autoencoders to infer the left ventricle shape due to the small dataset24. Tran et al. exploited a 2D fully CNN architecture to segment the left and right ventricle simultaneously and evaluated the model on several MRI public datasets25. Multiple views are commonly used in cardiac MRI, which enable physicians to evaluate the shape and function of the heart from different aspects. In 2016, Yang et al. proposed a deep fusion net by concatenating a feature collection block and a non-local patch-based label fusion (NL-PLF) module in a single network for left ventricle segmentation in multi-atlas segmentation26. Similarly, Zreik et al. designed an automatic multi-stream CNN (3 views) for segmentation of the LV in cardiac CT angiography (CCTA) scans27. Ge R et al. proposed a K-shaped Unified Network (K-Net), an end-to-end framework to concurrently segment left ventricle (LV) from apical 4-chamber and 2-chamber views, and directly quantify LV from major- and minor-axis dimensions (1D), area (2D), and volume (3D), in sequence. Similarly, Li X et al. proposed a Spatial Dependence Multi-task Transformer (SDMT) network for 3D knee MRI segmentation and landmark localization, simultaneously, which are two crucial tasks for diagnosis and treatment of knee diseases28.

Briefly, previous investigations have commonly enhanced feature expressiveness and promoted the use of spatial information by employing CNN, CRFs and other techniques. However, these works have hardly applied attention mechanisms in the architecture, leading to the lack of long-range representation.

CNN and transformer-based hybrid segmentation models

Although CNN-based deep learning models have achieved satisfying success in the medical image segmentation, the performance of these models is commonly limited due to the shortage of global information. Transformer, used in the natural language processing (NLP) tasks for its excellent sequence-to-sequence prediction ability, bridged the gap between CNNs and itself in 202015. Swin-UNet is an example that utilize Transformer blocks exclusively in both the encoder and decoder29. However, the intrinsic locality of convolution operation is essential for segmentation models as specific information can be captured by CNNs rather than Transformers. Thus, a hybrid architecture combining CNN and Transformer has become the preferred choice when investigators design segmentation models. TransUNet is a classic example, which extracts local feature via a pretrained ResNet backbone and subsequently employs Transformer blocks to model long-range dependency30, configuring CNN and Transformer in a cascaded manner. Different from TransUNet, MISSFormer was introduced for 2D medical image segmentation with a hierarchical encoder composed of both CNN and Transformer31. Similarly, H2Former is also a hierarchical network, integrating multi-scale channel attention within encoder32. Enze et al. proposed SegFormer, which aggregates Transformers with lightweight multilayer perceptron (MLP) decoders33. Segmenting high-resolution 3D images is a task with high computational and spatial complexities. Self-distill is a technology that only cost computational resource during training and is deactivated at inference for minimal overhead. In 2023, Wang N et al. developed a 3D Medical Image Segmentation via Self-Distilling TransUNet, termed MISSU, which could achieve efficient 3D brain tumor segmentation on BraTS 2019 dataset34.

Skin lesion segmentation from dermoscopy images is another challenging task, particularly due to the considerable size, shape and color variation of lesions as well as their ambiguous boundaries. Jiacheng et al. proposed Xbound-Former, a purely attention-based network, to address these issues. Xbound-Former includes implicit boundary learners (im-Bound) to enhance the local context modeling and explicit boundary learners (ex-Bound) to collect multi-scale boundary-knowledge35. Another notable segmentation model is the pyramid scene parsing network (PSPNet), proposed by Hengshuang et al. in 201636. The researchers designed a pyramid pooling module to explore the capability of global context information through the aggregation of distinct regions.

In summary, CNN and Transformer can be complementary for each other in extracting the local and global information. Integrating them into a single encoder not only enables the model to handle short-range and long-range features simultaneously, but also fuses them to produce discrimination representations for precise segmentation.

Methods

In this section, we first describe the overall pipeline and its key modules. Then, the details of Dual Attention Module will be shown, which is a crucial component for DATTNet to extract useful information. Lastly, we will introduce the Transformer-based Context Fusion Bridge, which models the local and global correlation of high-dimension features.

Overall pipeline

Figure 1 illustrates the proposed DATTNet, which adopts a classic encoder-decoder architecture with a Context Fusion Bridge that connects the fourth and fifth stages of the encoder and decoder.

In the encoder, there are six stages that progressively extract the multi-scale features. Since this study focuses on the macro-design of medical segmentation model rather than the micro-structure, we utilize off-the-shelf convolutional network structures. The convolutional backbones in the six stages are the sub-architectures named Conv1, Conv2, Conv3, Conv4, Conv5 and Conv6 from the pretrained VGG1637. Besides, we propose the Dual Attention Module to analyze the global dependency of the features extracted by the VGG16 backbones. Given an image with the shape of Hin×Win×Cin, in the \(i\)-th stage of the encoder, the input first passes through the blocks of VGG16 and the output size is \(\frac{{H}_{in}}{{2}^{i-1}}\times \frac{{W}_{in}}{{2}^{i-1}}\times {C}_{i}\)(\(i= 1, 2,\ldots6;\)\({C}_{i}\) is the channel of \(i\)-th stage). Afterward, the feature maps containing local information generated by the VGG16 blocks are fed into the Dual Attention Module to model global information and local context features, in which the feature maps from i-th stage will be output.

Subsequently, DATTNet passes the feature maps from the fourth and fifth stages through the Context Fusion Bridge to calculate the global and local correlation of multi-scale features and fuse the multi-scale feature maps. Given the varying sizes of multi-scale feature maps, we flatten the feature map in spatial dimension and reshape them to make the number of channel equal to \({C}_{x}\) (\({C}_{x}\) is the greatest common divisor of \({C}_{1}\)~\({C}_{6}\)). For example, the feature map with size \(\frac{{H}_{in}}{{2}^{i-1}}\times \frac{{W}_{in}}{{2}^{i-1}}\times {C}_{i}\) will be arranged into \(\frac{{H}_{in}\times {W}_{in}\times \lambda }{{2}^{2i-2}}\times {C}_{x}\) (\({C}_{i}\) is the \(\lambda\) fold of \({C}_{x}\)). These are then fed into the Context Fusion Bridges after concatenation in the flatten spatial dimension. Specifically, there is a residual addition between each Context Fusion Bridge. Finally, we restore them to the original size and then obtain discriminative multi-scale features.

As for the decoder, DATTNet passes the fusion features from fourth and fifth stages and the encoder features from other stages through the decoder. By upsampling the high-dimensional features and combining them with shallow-level features from the skip connection, the multi-scale features are refined, and the features with size Hin×Win are generated. These are then projected onto the final predicted segmentation map by a convolutional layer of 1*1 kernel size and 1 pixel stride.

Fig. 1
figure 1

The overview of our proposed DATTNet. (a) The backbone of DATTNet: VGG backbone is split into six sub-blocks (Conv1 ~ Conv6), which are used in six stages of encoder of DATTNet. (b) The streamline of DATTNet: There are six stages in the encoder and decoder of DATTNet. In encoder, each stage includes VGG16 subblocks and DAM to extract local features and model long-range dependency, respectively. Context Fusion Bridge is used to remix multi-scale features. In decoder, SA modules are responsible for the global information integration. DAM: Dual Attention Module; SA: Spatial Attention. (c) The structure of UpSample module: The input features, with dimensions (H, W, C), undergo a series of operations in UpSample, including Conv2d, BN, ReLU activation, and ConvTranspose2d, which double their spatial dimensions while halving the number of channels, resulting in upsampled features with dimensions (2H, 2W, C/2). BN: Batch Normalization; ReLU: Rectified Linear Unit. (d) The structure of DAM: The Dual Attention Module consists of two components: Efficient Channel Attention and Spatial Attention. The Efficient Channel Attention module performs an Adaptive Average Pool operation on input feature with a shape of (H, W, C) in the channel dimension, resulting in channel weights with a shape of (1, 1, C). Then, we use Conv1d to obtain attention weights for each channel and restore them to their original spatial size, (H, W, C), which are then Hadamard multiplied with the original input feature and element-wise added to obtain feature weighted by channel attention. The feature is then fed into the SA module, where it is mapped to the Cinter dimension by Conv2d and BN, and after being summed, they pass through ReLU, Conv2d, BN, and the Sigmoid activation function to obtain spatial attention weights. These weights are then Hadamard multiplied with the input feature of the SA module to obtain the final output feature, which enter the next stage of DATTNet. AAP: Adaptive Average Pool; DAM: Dual Attention Module; SA: Spatial Attention; ECA: Efficient Channel Attention.

Dual attention module

We propose the Dual Attention Module for the first time to perform global attention analysis on both channel and spatial dimensions of sparse spatial features extracted by the VGG sub-block. This module consists of two parts: the Efficient Channel Attention module and the Spatial Attention module. The former is responsible for extracting global information in the channel dimension, while the latter analyzes global representations in the spatial dimension. In this section, we will detail the specifics of these two modules.

Efficient channel attention

Convolutional neural network can only extract the local information in the medical images with fixed receptive field rather than modeling the relationship among the channels well. In order to address this problem, inspired by 38 and 32, we utilize Efficient Channel Attention (ECA) module to generate channel weights that can calibrate the convolutional features obtained from VGG16 sub-blocks (Fig. 1d):

$$\begin{array}{c}{F}_{i}^{VC}=\sigma \left({C1D}_{k}\left(AAP\left({F}_{i}^{V}\right)\right)\right) \odot{F}_{i}^{V}+{F}_{i}^{V}\end{array}$$
(1)

Where, \({F}_{i}^{V}\)is the convolutional features from VGG16 sub-blocks in the \(i\)-th stage. \(AAP\) is the adaptive average pool operation. \({C1D}_{k}\) is the fast 1D convolutional layer with the kernel of k. \(\sigma\) is Sigmoid activation function to obtain the channel attention values that present the significance of each channel. Additionally, \(\odot\) indicates the Hadamard Product operation. After that, residual addition between\({F}_{i}^{V}\) and channel attention values is necessary to avoiding gradient vanishing or exploding.

Spatial attention

ECA module aggregates the features in channel dimension to collect the attention value of each channel. In addition to channel attention, spatial attention is equally critical in medical image segmentation. As for some tiny or various tissue and organ, such as aorta and pancreas, the features extracted by pure traditional CNNs may lead to much false-positive prediction, which is prohibited in medical practice. Therefore, we introduced Spatial Attention (SA) module to build global dependency in spatial dimension39.

Figure 1d shows the overview of SA module. Given the inputs \({F}_{in}\) from the output of ECA module, SA module first projects them to \({C}_{inter}\) dimension via the convolutional layer of kernel size 1*1 and stride 1, respectively, leading to \({F}_{1}\) and \({F}_{2}\). Then, element-wise addition is done between \({F}_{1}\) and \({F}_{2}\). followed by ReLU function, Conv2d, Batch Normalization and Sigmoid function. After that, the feature with size \(H\times W\times {C}_{int}\) is mapped to grid attention map \(\alpha\) with size \(H\times W\times 1\). Last, \({F}_{in}\) is scaled by the Hadamard Product with grid attention map \(\alpha\). This process can be formulated as:

$$\begin{array}{c}{F}_{out}={\sigma}_{2}\left(\psi\left({\sigma}_{1}\left({W}_{{F}_{1}}^{T}\odot{F}_{in}+{W}_{{F}_{2}}^{T}\odot{F}_{in}\right)\right)\right)\odot{F}_{in}\end{array}$$
(2)

Where, \({W}_{{F}_{1}}^{T}\), \({W}_{{F}_{2}}^{T}\) and \(\psi\) are linear transformation (\({W}_{{F}_{1}}^{T}\in{\mathbb{R}}^{{C}_{in}*{C}_{inter}}\), \({W}_{{F}_{2}}^{T}\in{\mathbb{R}}^{{C}_{in}*{C}_{inter}}\), \(\psi \in {\mathbb{R}}^{{C}_{inter}*1}\)). \({\sigma}_{1}\) indicates ReLU function. \({\sigma}_{2}\) is Sigmoid function.

Context fusion bridge

The receptive fields in each stage of encoder are different because of the downsampling and pooling operation. However, it has been proven to be effective to fuse multi-scale features from distinct stages in medical image segmentation task31. Thereby, we propose Context Fusion Bridge, a Transformer-based module that is comprised of Spatial-Reduction Multi-head Attention and Fusion feed-forward network (FFN), to remix the multi-scale features (Fig. 2). First, the complexity of the original Transformer is quadratic with the feature size, resulting in the enormous cost of computational power when the low-level features pass through Transformer. Second, compared to convolutional layer with a similar number of parameters, the convergence of Transformer block is difficult, which may have a bad influence on the model performance and prolong the training time. Considering the above reasons, the Transformer blocks are not applied in encoder. However, there is no substitute for Transformer in semantic segmentation due to its competitive global modeling capability. Thus, DATTNet only utilize it in Context Fusion Bridge, reducing computational complexity and facilitating the convergence of model. Meanwhile, some modifications are made to the original Transformer block for decreasing the computational complexity ulteriorly. The details are discussed in Sect. 6-C. In original Transformer block, multi-head attention mechanism makes the global dependency well rather than the useful local context. To solve this problem, we integrate the convolutional layer into FFN as a supplement and add skip connections to anchor feature weight and fuse representation information, named as Fusion FFN.

Spatial-reduction multi-head attention

In original multi-head attention mechanism, given a \(H\times W\times C\) feature map \({F}_{in}\), it reshapes \(Q\), \(K\), \(V\) from image to sequence form (\(H\times W\times C\to\:N\times\:C\), where \(N=H \times W\)), and then calculate the attention score, which can be denoted as:

$$\begin{array}{c}Atten\left({F}_{in}\right)=Softmax\left(\frac{Q{K}^{T}}{\sqrt{d}}\right)V\end{array}$$
(3)

Here, \(Softmax\) is SoftMax function to map attention value to attention score \(\alpha\), \(\alpha\:\in[0,\:1]\); \(d\) indicates the number of heads. Although original multi-head attention can model global dependency well, the quadratic relationship (\(\vartheta\:\left({N}^{2}\right)\)) between the computational complexity and feature size may lead to the out of memory (OOM) while processing large feature maps. To overcome this shortage, we introduce spatial reduction factor \(f\) to decrease the spatial resolution of feature maps. It can be defined as:

$$\begin{array}{c}Atten\left({Q}^{{\prime}},K,{V}^{{\prime}}\right)=Softmax\left(\frac{{Q}^{{\prime}}{K}^{T}}{\sqrt{d}}\right){V}^{{\prime}}\end{array}$$
(4)

Where \({Q}^{{\prime}},{V}^{{\prime}}\in{R}^{\frac{N}{f}\times(f\times\:C)}\) and \(K\in\:{R}^{N\times C}\). In this way, the computational complexity is reduced to \(\vartheta\:\left(\frac{{N}^{2}}{f}\right)\), which achieves a better trade-off between computational source and model performance.

Fusion FFN

Spatial-Reduction Multi-head Attention models global dependency well without the ability to fuse the local context of multi-scale features. Similarly, FFN with only a linear layer still lack the function to remix the local information. Therefore, original FFN is modified by embedding convolutional layer and skip connection, which can be formulated as:

$$\:\begin{array}{c}{F}_{out}={LP}_{2}\left(GELU\left(LN\left(Conv2d\left({LP}_{1}\left({F}_{in}\right)\right)+{LP}_{1}\left({F}_{in}\right)\right)\right)\right)\end{array}$$
(5)

First, \({LP}_{1}\) projects \({F}_{in}\) to higher dimensional space, where the local context is collected. \(Conv2d\) is a convolution layer. \(LN\) and \(GELU\) are LayerNorm and Gaussian Error Linear Unit activation function, respectively. \({LP}_{2}\) restores the channel dimension of features with remixed local information.

Given feature map \(Fi\in\:{\mathbb{R}}^{{H}_{i}\times\:{W}_{i}\times\:{C}_{i}}\) (\(i=1,\:2,\:\dots,\:6\)), Context Fusion Bridge first flattens and reshapes them to \({C}_{x}\) (\({C}_{x}\) is the greatest common divisor of \({C}_{1\sim6}\)) in channel dimension to make it practicable while concatenating the features with distinct resolutions. The concatenated feature is fed into Spatial-Reduction Multi-head attention to model global dependency. After that, the fusion feature is restored to original size, followed by Fusion FFN to analyze local information. There are several Context Fusion Bridges to build the global and local correlation of multi-scale features. After the integration of Context Fusion Bridges, we obtain the discriminative representation, which will be delivered into decoder with other features from encoder and eventually generate segmentation map.

Fig. 2
figure 2

The structures of Context Fusion Bridge. (a)  The streamline of Context Fusion Bridge: The features of the fourth and fifth stages are input into the Context Fusion Bridge. First, they are reshaped into sequence features, then concatenated into a new sequence with the shape (5HW/32, C). Subsequently, they are sent to the AttU_BridgeLayer we proposed for context fusion. Finally, the fused features are restored to their original shapes (H/8, W/8, 8C) and (H/16, W/16, 8C) using the same method. (b) The structure of AttU_BridgeLayer: The AttU_BridgeLayer consists of two main components: the Spatial-Reduction Multi-head Attention (SRMA) and the Fusion feed-forward Network (Fusion FFN). The primary function of SRMA is to extract global features through multi-head attention while reducing computational complexity via linear projection; the role of Fusion FFN is to extract local features of different stages. (c) The structure of Spatial-Reduction Multi-head Attention: The features, characterized by (5HW/32, C), are initially transformed through a linear projection to obtain sequence features of different sizes, called Query, Key, and Value. This approach aids in reducing computational complexity and accelerating model training. Following a Conv2d operation on Key and Value, they engage in multi-head attention with Query. The output is subsequently subjected to another linear mapping to yield the final result. (d) The structure of Fusion FFN: For sequence features with shape of (HW/64, C), Fusion FFN initially applies a linear mapping and then reshapes them into image features. Subsequently, local features are extracted through a Conv2d operation. Finally, after passing through the GELU activation function, Layer Norm, and another linear mapping, the output features are obtained.

Experiments

Dataset

In this section, we introduce three open-access public datasets, which is applied to evaluate models. Table 1 elaborates the details of the three datasets.

Automated cardiac diagnosis challenge dataset

Automated Cardiac Diagnosis Challenge (ACDC) is part of the MICCAI 2017 Challenges40. The dataset contains 100 short-axis T1-weighted 3D volume MRI scans. All scans are obtained by 1.5T or 3.0T scanners. Three cardiac structures are annotated by experts: right ventricle (RV), myocardium (Myo) and left ventricle (LV). The dataset is split into 70 cases for training and 30 for testing. Additionally, average Dice Similarity Coefficient (DSC) and 95% Hausdorff Distance (HD) are responsible for the metrics of the model performance.

Synapse multi-organ segmentation dataset

Synapse dataset was launched publicly in MICCAI 2015 Multi-Atlas Abdomen Labeling Challenge41, which includes 30 axis 3D abdominal CT scans from distinct patients. Each volume is comprised of 85 ~ 198 slices with the resolution of 512\(\times\)512, hence, there are 3799 slices totally. Eight organs are labelled by clinical experts (aorta, gallbladder, left kidney, right kidney, liver, pancreas, spleen, stomach). The dataset is divided into two subsets: 18 cases for training and 12 for testing. Last, the evaluation metric is average DSC.

Kvasir-SEG dataset

Kvasir-SEG is a gastrointestinal poly segmentation dataset42, consisting of 1000 RGB endoscopy images and corresponding masks that are outlined by an engineer and medical doctor and then verified by a senior gastroenterologist. We select 700 images as train set randomly and the last is chosen as test set due to the lack of official split. DSC, Mean Absolute Error (MAE), Accuracy (ACC), Intersection over Union (IoU) are applied to be the assessment metrics of model performance.

Table 1 Summary of key information of the three datasets.

Implementation details

DATTNet is implemented on PyTorch 1.11.0 and trained on a Nvidia GeForce RTX 4090 GPU with 24GB memory and an AMD EPYC 7T83 CPU with 64 cores. Apart from the VGG16 subblocks in encoder being pre-trained on ImageNet, the other modules of DATTNet are initialized randomly and trained from scratch. Hence, to facilitate the convergence of model, mitigate overfitting and strengthen the generality of model, some appropriate data augmentation strategies are employed on ACDC, Synapse and Kvasir-SEG datasets. In addition to generally-used flipping and rotation, Gaussian noise and blur, linear contrast transformation, random shear, local twist and so on are adopted as well. Data augmentation is performed while training phrase, instead of testing.

We set the input size as 224\(\times\)224. Because larger sizes may result in OOM while small input resolutions harm the key information of original images and lead to the performance loss ulteriorly. The initial learning rate is 1e-3 and changing according to poly learning rate policy as follow:

$$\begin{array}{c}{LR}_{i+1}={LR}_{i}\times{\left(1-\frac{i}{Ma{x}_{iteration}}\right)}^{0.9}\end{array}$$
(6)

Where, \({LR}_{i+1}\) and \({LR}_{i}\) represent the \((i+1)\)th and \(i\)-th iteration, respectively. \(Max\_iteration\) is the maximum number of iterations. Besides, the max epoch is 200 with batch size 8. AdamW with \(\beta\:\in\:\left\{0.9,\:0.999\right\}\) and weight decay 3e-5 is applied for DATTNet. Moreover, we employ joint objective function to optimize the model, which can be formulated as:

$$\begin{array}{c}{\mathcal{L}}_{CE}\left(P,\:G\right)=-P\text{log}\left(G\right)-\left(1-P\right)\text{log}\left(1-G\right)\end{array}$$
(7)
$$\begin{array}{c}{\mathcal{L}}_{DSC}\left(P,G\right)=1-\frac{2* \left\| {P} \right\| * \left\| {G} \right\| }{ \left\| {P} \right\| + \left\| {G} \right\| }\end{array}$$
(8)
$$\:\begin{array}{c}{\mathcal{L}}_{total}=\lambda\:\cdot\:{\mathcal{L}}_{CE}\left(P,\:G\right)+\xi\:\cdot\:{\mathcal{L}}_{DSC}\left(P,\:G\right)\end{array}$$
(9)

\({\mathcal{L}}_{CE}\) is cross-entropy loss, \({\mathcal{L}}_{DSC}\) denotes the Dice Similarity Coefficient loss. \(P\) is the prediction result of model and \(G\) is the ground truth. \(\lambda\) and \(\xi\) are weight coefficient for the balance of cross-entropy and DSC loss. To compared with other SOTA methods fairly, all the experiments are performed in consistent hyperparameter settings without any preprocess or postprocess.

Comparison with other SOTA methods

Results on cardiac segmentation

The quantitative results on cardiac segmentation task are elaborated in Table 2. Compared with other SOTA methods, our proposed DATTNet outperforms its counterparts by a large margin. Specifically, DATTNet attains 92.2% average DSC, suppressing AttentionUNet43 by 1.1%. AttentionUNet is a fully-convolutional network with attention module. Besides, our method achieves the lowest 95%HD among these SOTA models at 1.29, being lower than PSPNet36 by 0.33, which possesses a novel pyramid pooling module to incorporate global and local information simultaneously. The results demonstrate that DATTNet can capture the margin feature precisely as well. In addition to the best average performance compared with SOTA methods, our model achieves favorable results on sub-classes segmentation as well. As for the performance on each tissue, DATTNet gets 93.9%, 86.7% and 95.9% DSC in RV, Myo and LV, outperforming the second-highest DSC by 0.9%, 0.8% and 0.8%, respectively. DATTNet has limited performance on Myo than RV and LV, which may result from the cardiac shifting in cardiac cycles. The qualitative analysis results of the model are presented in Fig. 3. In summary, these results show that DATTNet has a favorable ability to perform cardiac segmentation tasks.

Furthermore, a survey about attention mechanisms reveals that integrating spatial and channel attention improves the model’s capacity to extract more distinguishing representations compared to using a single attention mechanism 53. Hence, medical image segmentation networks based on spatial and channel attention have garnered increasing attention from researchers. We substitute the dual attention modules in DATTNet with those proposed in Dual Attention-guided Efficient Transformer (DAEFormer) 47, termed as DATTNet_{DAEFormer}, and Dual Attention Vision Transformers (DaViT) 48, termed as DATTNet_{DaViT}. Comparative experiments are performed on the ACDC, Synapse and Kvasir-SEG datasets. The three distinct dual attention modules are illustrated in Supplementary Fig. 1. And the experimental results are presented in Tables 2 and 3, and 4. In addition, we conduct the analysis and discussion of these results in the “Discussion” section.

Table 2 The quantitative results of SOTA methods on ACDC.
Fig. 3
figure 3

The qualitative analysis of DATTNet and other SOTA methods on ACDC dataset. The figure demonstrates some segmentation results of DATTNet and four other SOTA models on the ACDC dataset. In this, red represents the right ventricle, green represents the myocardium layer, and yellow represents the left ventricle. DSC stands for dice similarity coefficient.

Results on multi-organ segmentation

We also conduct experiments on Synapse dataset with 13 categories officially. Here, the Synapse with 8 labeled organs is adopted, following 31, to evaluate our methods. The quantitative results on Synapse are shown in Table 3. DATTNet gets the best performance compared to other methods at 84.5% DSC, which is higher than TransUNet by 1.7%. TransUNet, comprised of CNN and Transformer, is a classic network in semantic segmentation being listed as baseline model in many works29,30,31,32. Although TransUNet surpasses DATTNet on the performance of aorta marginally by 0.12%, DATTNet outperforms all SOTA methods on other organs. Specifically, the DSC on right kidney of DATTNet is higher than the one of TransUNet by 2.8%. Furthermore, among the eight abdominal organs, DATTNet yields the best performance on live and spleen, reaching 95.3% and 92.4% DSC, which is better than TransUNet by 0.2% and 0.8%, respectively. The result demonstrates that our method has a high sensitivity on the fixed and large abdominal organs. However, the performances on pancreas are unsatisfying in all methods. For example, DATTNet only gets 68.4% DSC, TransUNet does 66.4%, SegFormer does 63.2% and DeepLabv3 does 54%. These poor performances may result from the anatomical complexity and shape variations of pancreas. Additionally, we conduct qualitative experiments to visually demonstrate these models’ segmentation effectiveness (Fig. 4). Briefly, the extensive experiments on Synapse indicate that DATTNet can achieve reasonable performance and obtain efficiency benefit on multi-organ segmentation task.

Fig. 4
figure 4

The qualitative analysis of DATTNet and other SOTA methods on Synapse dataset. We present the qualitative segmentation results of the top five models in DSC on six representative examples. As observed from the figures, our proposed DATTNet model consistently achieves the best segmentation results for Examples 1 to 5. Notably, the segmentation of the stomach (green) is most accurate, without misclassifying the small intestine as the stomach. In contrast, UNet, SegFormer, and H2Former all misclassified the small intestine as the stomach in Example 3. DSC: Dice Similarity Coefficient.

Table 3 The quantitative comparisons to previous SOTA methods on synapse dataset.

The results on poly segmentation

To further verify the performance of our model, we perform the poly segmentation experiments on Kvasir-SEG dataset. As can be seen in Table 4, while PSPNet36 attains better performance than ours slightly, DATTNet still outstrips other SOTA models, reaching 89.1% DSC, which is higher than CE-Net, TransUNet, MISSFormer and H2Former by 0.1%, 2.1%, 9.4% and 8.6%, respectively. MISSFormer and H2Former are CNN and Transformer-based hierarchical hybrid networks. The distinction is that MISSFormer fuses the features from encoder in skip connection while H2Former incorporates them in each phase of encoder via a hybrid Transformer block. Besides, DATTNet harvests a comparable MAE with AttentionUNet, standing at 0.036, higher than MISSFormer, TransUNet, Deeplabv3 and H2Former by 0.03, 0.12, 0.007 and 0.028. It can be found that DATTNet not only predicts the inner content of target area well, but also possesses competitive ability to capture the edge of gastrointestinal poly. Additionally, a 96.5% ACC is attained by DATTNet as well, being higher than Swin-UNet, which is a shifted-windows Transformer network. The shifted window Transformer may cause ambiguity between poly and mucosa. Furthermore, DATTNet also yields the best IoU at 81.1%, exceeding SegFormer and DeepLabv3 by 6.7% and 3.8%, respectively, which indicates that DATTNet attains superior localization ability. Thereby, DATTNet is able to achieve favorable performance on poly segmentation task. Ulteriorly, we visualize the prediction maps of some examples from Kvasir-SEG datasets to justify the performance of our DATTNet intuitively (Fig. 5).

Table 4 The quantitative results of SOTA methods and ours DATTNet on Kvasir-SEG dataset.
Fig. 5
figure 5

The qualitative analysis of DATTNet and other SOTA methods on Kvasir-SEG dataset.We demonstrate the qualitative segmentation results of intestinal polyps by DATTNet and other SOTA models. Among these, we only presented the top 5 models in terms of DSC. As observed from the figures, our proposed DATTNet can effectively segment intestinal polyps and achieve relatively clear edges, whereas PSPNet and DeepLabv3 exhibit poorer segmentation capabilities for blurred lesion edges in Example 3. DSC, dice similarity coefficient.

Ablation study

We conduct a series of thorough ablation experiments on a 3D segmentation dataset (ACDC) and a 2D dataset (Kvasir-SEG) to evaluate the effectiveness of the key modules in DATTNet and training settings.

Ablation study of convolution backbone

Convolutional neural network, serving as feature extractor, captures discriminative information from input image, which is the fundamental of precise segmentation prediction. In this section, the different CNN backbones, the variants of backbones and the influence of Batch Normalization operation will be investigated comprehensively. In order to probe the impact of distinct CNN backbones on DATTNet, we make VGG16_Bn as baseline for comparison, called “DATTNet_V16Bn”. ResNet was proposed by Kaiming He in 2016 49, which won the first prize on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2015 classification task. Since that, ResNet has been applied for many SOTA methods in medical image segmentation30,36,38,50,51. Therefore, we perform experiments by replacing VGG16_Bn with ResNets. The results are elaborated in Table 5. In cardiac segmentation task, DATTNet_R18 and DATTNet_R34 yield the best performance, at 90.8% DSC, which is inferior to DATTNet_ V16Bn by 1.4%. Besides, DATTNet_R34 obtains 1.24 95%HD, exceeding DATTNet_ V16Bn by 0.05. Similarly, DATTNet_ V16Bn outperforms DATTNet_ R variants overall on poly segmentation task. For example, DATTNet_ V16Bn achieves 89.1% DSC and 81.1% IoU on Kvasir-SEG dataset, surpassing DATTNet_R18 by 0.7%, and 1.4%, respectively. VGG, getting the competitive results on the ILSVRC 2014, firstly presented the approach to strengthen the feature capturing ability of deep neural network by increasing the depth of model and utilizing the smaller convolution filters (3 × 3)52. ResNet is deep convolutional neural network as well, introducing residual connection into model design initially to avoid the gradient vanishing49. Hence, it can be found that DATTNet_V16Bn with kernel size 3 × 3 tends to collect fine features and then brings better segmentation prediction.

In addition to the architectures of backbone, the distinct variants of one convolutional neural network also exert imparts on the performance of DATTNet. Thus, the experiments on four variants of VGG are undergone to explore the influence of the depth of model on the prediction performance. As we can see in Table 6, the performance of DATTNet experiences a slight rise and then a decrease from DATTNet_V13Bn to DATTNet_V19Bn, with the backbones going deeper. DATTNet_V13Bn and DATTNet_V16Bn achieve the highest DSC, standing at 92.1% on ACDC dataset. DATTNet_V16Bn yields a superior 95%HD at 1.29, while DATTNet_V13Bn attain 1.76 95%HD. In summary, on the one hand, the ability of feature extraction of DATTNet is limited since the model is shallow, leading to the insufficient effective information captured by encoder for segmentation prediction. On the other hand, the large model may result in overfitting, which has a bad influence on segmentation accuracy.

Batch Normalization is a technique for the activation normalization in the intermediate layers of deep neural networks. This operation enables the model a larger learning rate that accelerates training and enhances the generality of model. To verify whether Batch Normalization brings performance benefit, we eliminate the Batch Normalization Layer of DATTNet_VxBn. Table 7 elaborates the experiment results. Batch Normalization achieves DSC gains of DATTNet_V11Bn, DATTNet_V13Bn and DATTNet_V16Bn by 0.3%, 0.7% and 0.2%, respectively on cardiac segmentation task although DATTNet_V19Bn gets a negative DSC improvement. Similar performance gains are also found on the Kvasir-SEG. These results demonstrate that embedding Batch Normalization Layer into DATTNet does lead to the performance advancement.

Table 5 The influence of different CNN backbones on the performance of DATTNet.
Table 6 The impact of different variants of VGG on the performance of DATTNet.
Table 7 The comparison results of DATTNet with/without batch normalization.

Ablation study of encoder architecture

Encoder plays a crucial role in U-shape segmentation network. In this section, we will probe the key modules of the encoder in DATTNet profoundly, including Spatial Attention module and ECA module and their integration manners. We make the pure CNN-based encoder with VGG16_Bn as baseline. Table 8 elaborates the results. By comparing E.1 and E.2, it can be found that Spatial Attention (SA) module can bring performance gain, improving DSC and 95%HD by 0.2% and 0.36. It suggests that it is practicable to model the interplay between different organs and tissues and extract strong global representation by multi-head attention operation in spatial dimension, which are essential for accurate segmentation. Figure 6 illustrates that CNN-based SA Module pays attention to the target areas in medical images in DATTNet. In many SOTA methods, global dependency is built via Transformer-based attention mechanism. In order to validate the effectiveness of the CNN-based Spatial Attention module in DATTNet, we make comparison with Efficient Multi-head Attention in MISSFormer31. We can observe from E.2 and E.3 that the multi-head attention mechanism in conventional Transformer block only achieves 70.1% in DSC and 7.84 in 95%HD, being inferior to CNN-based one by a large margin (91.8% and 1.44). Besides, for the model complexity and the forward propagation speed, our method takes a pronounced advantage over the Transformer-based multi-head attention mechanism as well. The parameters and FLOPs of E.3 are higher than those of E.2 by 17.4% and 14.3%, at 60.01 M and 90.95G. The above results demonstrate that CNN-based attention mechanism can render competitive performance and save computational overhead simultaneously compared to Transformer-based one. We argue that there are two potential reasons: on the one hand, it is difficult to converge for training Transformer block from scratch15; on the other hand, we design the encoder in hierarchical manner, where Transformer blocks are embedded into each stages. Transformer is only able to process sequence features. Therefore, the sequence transformation from the feature maps with large resolution in shallow stages extends the training time enormously.

Table 8 The ablation study of the encoder structure of DATTNet.
Fig. 6
figure 6

The attention maps generated by SA module. SA module can model the global dependency and then attains higher attention weights on the region of interests on ACDC, Synapse and Kvasir-SEG datasets.

In addition to spatial attention, we also introduce channel attention extractor, ECA module, into the encoder of DATTNet. In convolutional backbone, the number of channels is increased progressively as models go deeper, guaranteeing the backbone to collect sufficient discriminative features, such as edge, contrast and blob in shallow layers, texture in intermediate layers and semantic information in deep layers. We assume that it is essential to model long-range dependency in channel dimension for segmentation prediction. Hence, ECA module is aggregated on the foundation of baseline model to assess the function of channel attention. By comparing E.1 and E.4, it can be found that the channel attention brings 0.5% and 0.16 increasing in DSC and 95%HD, respectively, at 92.1% and 1.4. The reason is that ECA module constructs interconnection among various channel feature and then attains distinguishing representation fusion, leading to performance benefit. The above investigations verify that sole spatial or channel attention can promote segmentation prediction. However, it is uncertain that incorporating them into one block, named as Dual Attention Module, still brings performance advancement. Thus, experiments are performed to justify this hypothesis. By comparing E.6 and E.2, E.6 and E.4, we observe that the DSC of E.6 (Dual Attention) is higher than single attention by 0.1% and 0.3%. Similarly, Dual Attention obtains 1.29 in 95%HD, being lower than spatial and channel attention by 0.15 and 0.35, respectively. These results prove that the aggregation of spatial and channel attention is complementary for each other to strengthen the feature extraction ability of encoder.

We have analyzed the necessity of building Dual Attention Module to capture the global dependency in spatial and channel dimension. While, for the hybrid choice in Dual Attention Module design, there are two options: one is to embed them into Dual Attention Module in parallel manner, and another is to integrate them sequentially (Fig. 7). By comparing E.5 and E.6, the Dual Attention Module in parallel fashion yields sub-optimal performance, dropping DSC by 0.3% and increasing 95%HD by 0.58, which may result from the attention deviation in the parallelled Dual Attention Block.

Fig. 7
figure 7

The two alternatives of Dual Attention Module. (a) Sequential Design: The local features extracted pass through ECA module firstly. After that, the features with channel attention information are fed into Spatial Attention module to model global dependency. (b) Parallel design: the local representation is delivered to spatial and channel attention modules respectively. Then, the features aligned by two different attention mechanisms are fused via pixel-wise addition operation. Furthermore, residual connection is applied to avoid gradient vanishing or explosion.

Ablation study of context fusion bridge

Skip connections are utilized to feed original feature map from each stage into decoder to calibrate the upsampling features for preserving low-level details in U-shape encoder-decoder architecture. Nevertheless, adding the feature maps with the same resolution from encoder and decoder directly is far from enough to integrate encoder-decoder representations and model multi-scale correlation. In order to solve this problem, Context Fusion Bridge is proposed to mix the feature maps with multiple scale. In this section, we will investigate the impact of the number, position of Context Fusion Bridge and residual connection on the model performance. The results are elaborated in Table 9. There are one, two and four Context Fusion Bridges in E.1, E.6 and E.2, respectively. E6 attains 92.2% DSC, outperforming E.1 and E.2 by 0.2% and 0.1%. Similarly, E.6 surpasses E.1 by 0.1% DSC and 0.4% IoU, standing at 89.1% and 80.7% on cardiac segmentation task. The parameters and FLOPs of E6 are 49.54 M and 77.91G, which is 4.27 M and 6.47G less than E.2, despite the comparable DSC of them (89.1%), indicating that E.2 is more effective given the consistent performance. There are six stages in DATTNet. In order to explore the fusion power of different stages, the experiments are undergone. E.3 remixes the feature maps from the third and fourth stages. Due to the quadratic relationship between sequence length and feature resolution, the transformation of shallow features is a challenge to the memory of GPUs. As shown in Table 9, out of memory (OOM) is triggered while fusing the third and fourth-stage features. With regard to the feature maps from fifth and sixth stages, E.5 obtains 91.9% DSC, being inferior to E.6 (92.2%). These results demonstrate that the fusion between the fourth and fifth-stage features can bring performance benefit. In addition, feature deviation generally has a negative impact on the segmentation prediction when the multi-scale features are remixed. To validate this hypothesis, we eliminate the residual connections in the Context Fusion Bridge of E.6. By the comparison between E.4 and E.6, we conclude that E.4 achieves sub-optimal performance on ACDC dataset, dropping DSC by 0.2%. Besides, removing residual connections reduce DSC and IoU by 0.3% and 1.8% on Kvasir-SEG dataset, respectively. These results indicate that it is important to constrain features via residual connections while fusing multi-scale representations.

Table 9 The ablation results of context fusion bridge in DATTNet.

Ablation study of optimizers

A wealth of knowledge about optimizers was developed in semantic segmentation community53. Stochastic gradient descent (SGD)54, adaptive moment estimation (Adam)55 and Adam weight decay regularization (AdamW)56 are three commonly-used optimizers in many works. In this section, we conduct experiments to investigate the influence of the three optimizers to select the most suitable optimizer. Table 10 elaborates the results. The DATTNet trained by distinct optimizers are named as “DATTNet_S”, “DATTNet_A” and “DATTNet_AW”. DATTNet_AW attains the best performance, outstripping DATTNet_S by 3.4% DSC and 1.91 95%HD on ACDC dataset. Similarly, DATTNet_S yields 86.8% DSC and 77.2% IoU, being 2.3% and 3.9% lower than DATTNet_AW on Kvasir-SEG dataset. These results suggest that AdamW is the applicable optimizer for DATTNet and can render performance advancement.

Table 10 The comparison results of different optimizers on the performance of DATTNet.

Discussion

In medical image segmentation community, there are three popular architectures: pure CNN, pure Transformer and hybrid architecture. AttentionUNet is a type of pure CNN method, which uses only CNN-based attention blocks in the decoder instead of the encoder39. In addition, the feature maps from the encoder are fed directly into the decoder without any feature fusion, which affects the performance of AttentionUNet. AttentionUNet achieves 91.1%, 79.2% and 88.8% DSC, which is inferior to our DATTNet by 1.1%, 5.3% and 0.3% on cardiac, multi-organ and poly segmentation tasks, respectively. Swin-UNet is a fully-Transformer deep learning model, whose encoder and decoder are comprised of Swin Transformers29. Despite its premising global modeling ability, Swin-UNet lacks the components that can extract short-range information compared to DATTNet, making it insensitive to local details. Besides, both Swin-UNet and DATTNet are designed hierarchically. The difference lies in that Swin Transformers are embedded into each stage in Swin-UNet, leading to extreme computational costs. Lastly, the pertinent position and anatomical shape of the human body are relatively fixed in medical images. For example, the liver and spleen are on the right and left of abdominal cavity respectively. gallbladder is closely adjacent to liver. The pancreas has a spoon-shape. However, the shifted windows in Swin-UNet disrupt these prior knowledge and cross-organ correlations, causing ambiguity to the model. Hence, the above disadvantages adversely affect Swin-UNet’ ability to segment tiny and various organs. For example, Swin-UNet yields 69.4% and 57.1% DSC on the aorta and gallbladder, respectively, which are 19.2% and 16.3% lower than our DATTNet on the Synapse dataset. These results indicate that shifted window is not suitable for medical image segmentation. Furthermore, regarding the hybrid methods combining CNN and Transformer, TransUNet30 and MISSFormer31 are representative works. TransUNet is developed in cascaded manner while DATTNet does so hierarchically. Unlike the cascaded approach, the advantage of hierarchical design is that the sparse features extracted by CNN backbone can be aligned and fused in time, preventing weight deviation. Intriguingly, both MISSFormer and DATTNet remix the multi-scale feature maps in skip connections. The difference is that MISSFormer fuses features from four stages while DATTNet does so from two, saving the computational power and ensuring performance.

Attention mechanisms, modeling global dependency for different parts of features, have emerged as the key components in medical segmentation models. Spatial, channel and multi-head attention are employed in common. In this study, we integrate spatial and channel attention modules into the encoder and incorporate multi-head attention into the Context Fusion Bridge to comprehensively improve the global modeling ability of DATTNet. First, in contrast to the Transformer-based attention mechanism, CNN-based spatial attention operations are performed in the encoder of DATTNet. On the one hand, the encoder needs to build long-range dependency for feature maps from six stages. Therefore, it is more reliable and effective to utilize the CNN-based spatial attention mechanism rather than the Transformer-based one considering computational overhead and hardware memory. On the other hand, the intrinsic locality of convolutional operations enables DATTNet to preserve valuable spatial information. By contrast, patch and sequence procedures loss abundant useful and discriminative representations in Transformer. Thus, CNN-based spatial attention mechanism is more applicable in the hierarchical encoder of DATTNet. Figure 8 shows the areas to which SA module pays attention in each stage of DATTNet. It can be observed that SA modules assign more importance weights to lesions in each stage with the model going deeper. Additionally, with respect to channel attention, it utilizes a scalar to attain importance factor for distinct channels and then emphasizes pertinent information or suppresses ignorable features with simplicity and effectiveness57. This advantage facilitates DATTNet to segment the organ/tissue with ambiguous boundaries more precisely.

Fig. 8
figure 8

The different areas that SA modules in each stage (1 ~ 6) pays attention to. In shallow encoder (stage 1 ~ 3), SA modules focus on the target area in full image generally. From stage 4 to 6, SA modules pay more attention to gastrointestinal poly gradually, leading to localizing the poly lesion more accurately.

The dual-attention structure was first proposed in the Residual Attention Network (RAN) in 201758. Subsequently, AI models based on dual attention mechanism have seen extensive development. Both DAEFormer and DaViT also utilize spatial and channel attention structures similar to our proposed DATTNet. To compare their performance differences, we conduct experiments on ACDC, Synapse, and Kvasir-SEG datasets under the same experimental conditions (Tables 2, 3 and 4). In terms of segmentation performance, DATTNet_{DaViT} does not achieve satisfactory results in cardiac MRI segmentation tasks, abdominal CT multi-organ segmentation tasks, or colonoscopy lesion segmentation tasks. There are two possible reasons for this. Firstly, DATTNet_{DaViT} is only trained and validated on the natural object dataset ADE20K, and it is not a dedicated medical imaging model. Secondly, unlike DATTNet, DATTNet_{DaViT} uses Window Attention in the spatial attention module (Supplementary Fig. 1c). This spatial attention mechanism has good segmentation performance for natural objects, but it does not achieve good segmentation results for medical imaging data. Window Attention often requires splitting the feature map into different windows, reassembling these windows, and then extracting attention. Since medical imaging data is sensitive to direction, such as the similar shape and CT value of the spleen and liver in CT images, but the spleen is on the left side of the upper abdomen and the liver is on the right side. After being split and reassembled by Window Attention, it is very easy for the model to confuse left and right, leading to segmentation errors. DAEFormer is a specialized segmentation network for medical applications, which has achieved superior performance compared to DaViT. However, DATTNet_{DAEFormer} achieves a DSC of 0.901, slightly lower than DATTNet’s 0.922, on the ACDC dataset. In tasks involving multi-organ segmentation in abdominal CT scans and polyp segmentation in colonoscopy images, DATTNet_{DAEFormer} achieves DSC scores of 0.765 and 0.803, respectively, which are 8% and 8.8% lower than those of DATTNet. In terms of computational resource consumption, DATTNet_{DaViT}, despite having only 8.33 M parameters, far less than DATTNet_{DAEFormer} and DATTNet, has a FLOPs as high as 149.41G. On the other hand, DATTNet_{DAEFormer} has only 29.61 M parameters and 25.95G FLOPs, which may be related to its relatively shallow encoder-decoder layers. In conclusion, considering both segmentation performance and computational resource consumption, DATTNet_{DaViT} is not suitable for use as a medical image segmentation network. The performance of DATTNet_{DAEFormer} in tasks such as MRI heart segmentation, abdominal CT multi-organ segmentation, and colonoscopy polyp segmentation is inferior to that of DATTNet.

Additionally, DATTNet only fuses and models global dependency on the feature maps from the fourth and fifth stages of the encoder via the multi-head self-attention module in the Context Fusion Bridge, which not only remixes the multi-scale features but also avoids wasting computational resources. Last but not least, an appropriate optimizer is essential for training speed and model performance. In Sect. 6-D, we have probed the impacts of three classic optimizers on DATTNet: SGD, Adam and AdamW. SGD updates the model parameters with a fixed learning rate constantly, which slows convergence by a large margin compared to other adaptive gradient algorithms, such as Adam and AdamW59. However, the prerequisite for adaptive gradient algorithms is the landscape curvature of loss objective function, limiting the generalization potential considerably60,61. AdamW addresses this issue with a dynamically regularized loss59. This novel objective function combines vanilla loss with decoupled weight decay to regularize the model dynamically.

Currently, DATTNet serves as a versatile cross-modal framework. In the future, researchers can curate their own task-specific datasets and fine-tune or transfer learn on DATTNet, for instance, by adjusting the number of layers in the AttU_BridgeLayer within Context Fusion or substituting VGG with other pretrained networks. Our DATTNet has been trained on multiple modalities, thus possessing prior knowledge of CT, MRI, and endoscopic imaging (stored in the model’s weights), such as higher CT values for bone tissue and high signals for liquid in MRI T2-weighted sequences. Consequently, fine-tuning based on DATTNet enables faster model adaptation and more accurate automatic segmentation of target regions. In addition to providing researchers with an open, modifiable multimodal framework, DATTNet can be fine-tuned on specific data from particular institutions using existing weights. The resulting model can be integrated into hospital imaging systems for clinical applications, enhancing the diagnostic efficiency of doctors, especially radiologists. Moreover, for cancer patients, DATTNet can segment tumor lesions from multiple examinations during treatment, monitoring changes in tumor volume to assess the clinical response to drug therapy. If the tumor volume does not decrease after multiple treatments, it suggests drug resistance, necessitating a change in anticancer medication. Manually segmenting tumor lesions is time-consuming and labor-intensive, limiting its widespread clinical use. However, our DATTNet model has preliminarily demonstrated that employing an AI segmentation model can significantly reduce the time and effort required for segmenting specific organs and lesions in medical imaging, making it feasible.

Artificial intelligence (AI) has the potential to revolutionize healthcare. However, there are several challenges that limit its widespread use. One major challenge is the lack of standardization in algorithms and software. Currently, there are no universal methods for AI-based data analysis or interpretation, and no consistent approaches to address missing data, which is a significant concern in large-scale datasets. Another challenge is the need for consensus guidelines in reporting data from machine learning (ML) studies. A group is working on defining an AI-specific version of the STARD checklist (Standards for Reporting of Diagnostic Accuracy Studies)62, which aims to improve the completeness and transparency of studies investigating diagnostic test accuracy. Additionally, recommendations will be needed for prognostic or theragnostic biomarkers. Their performance should be compared to existing diagnostic, staging, and predictive systems. The availability of large datasets is crucial for the development of research and its future impact on clinical care. To foster this, the deposition and sharing of large datasets should be encouraged, including the utilization and sharing of large-scale data from electronic health records across and between health systems. Moreover, sharing individual-participant data from clinical trials or purely academic research studies is becoming increasingly advocated by many scientists and organizations. This would assist in constructing datasets of sufficient size and detail to appropriately train and validate AI models. Finally, another one is the lack of racial, ethnic, and socioeconomic diversity in cohorts used to develop and train AI models. Future studies must ensure that promising AI-based tools are validated in diverse cohorts that include racial and ethnic minorities as well as patients across the complete socioeconomic spectrum. In conclusion, while AI holds great promise for improving disease detection and patient stratification, several challenges must be addressed before its full potential can be realized. Addressing these challenges will require collaboration among researchers, clinicians, policymakers, and patients to create a more efficient and effective healthcare system through the integration of AI technology.

In terms of computational resource consumption, we compared DATTNet with other SOTA models. As shown in Table 4, compared to traditional CNN-based segmentation models (such as UNet and AttentionUNet), our model has relatively more parameters and higher FLOPs. This is likely due to the deeper model structure of DATTNet and the use of the Transformer-based AttU_BridgeLayer in the Context Fusion Bridge. However, compared to the classical TransUNet, DATTNet has a clear advantage in terms of the number of parameters. In summary, although our DATTNet does not hold an advantage in terms of parameter count compared to most SOTA models, according to Moore’s Law, hardware performance will continue to improve in the future. Moreover, the primary application scenarios for our developed medical image segmentation algorithm are large-scale medical imaging equipment manufacturers and healthcare institutions. Therefore, in these scenarios, the disadvantage of higher parameter count should not impact practical applications. We will also strive to address it in future work by reducing the number of model parameters and speeding up the execution time while maintaining performance.

There are still some limitations and directions for modification. We will continue to improve DATTNet in other aspects even though our presented DATTNet has achieved competitive performance compared with SOTA methods, hoping that it will be deployed in clinical practice soon. First, VGG backbones slow the convergence speed of the model compared to ResNets. DATTNet_V16B and DATTNet_R34 possess comparable parameters, while the former obtains 77.91G FLOPs, exceeding the latter significantly. There are two possible solutions to address this shortcoming: one is to choose a pretrained model that is simpler and more effective, and the other is to exploit a new framework, that not only extracts robust features but also saves computational power. Furthermore, the interference speed of DATTNet on Kvasir-SEG dataset is faster than those on ACDC and Synapse dataset. However, there is still a gap in real-time segmentation. The reason is that there are many heterogeneous blocks and structure redundancies in DATTNet. In clinical scenarios, real-time inference of the model is critical for physicians, as it helps to distinguish uncertain and early-phase polys to reduce false negativity. Therefore, we will explore solutions to accelerate the interference of DATTNet for real-time segmentation in the future work.

Conclusion

In this study, we propose DATTNet, a deep learning model for medical image segmentation, which is mainly comprised of a VGG backbone, ECA modules, Spatial Attention modules and Context Fusion Bridges. The VGG backbone is responsible for the extraction of the local information in the images. The Spatial Attention and ECA modules are employed to model global dependency in spatial and channel dimensions, respectively. The Context Fusion Bridge is applied to remix the feature maps of multiple scales and construct their correlations. In order to validate the performance of our presented DATTNet, we conduct extensive experiments on two 3D volume datasets (ACDC and Synapse) and a 2D image dataset (Kvasir-SEG). DATTNet nearly surpasses all the SOTA models, yielding 92.2%, 84.5% and 89.1% DSC on the ACDC, Synapse and Kvasir-SEG datasets, respectively. The results demonstrate that our proposed DATTNet attains competitive cross-modality capability for MRI, CT, and endoscopy, and can be generalized to different tasks such as the cardiac, abdominal organ and gastrointestinal poly segmentation tasks. In the future, we will focus on the directions of reducing model complexity, speeding up interference and further advancing performance for promoting DATTNet to practicable clinical application.