Skip to main content

Deep learning-based Covid-19 diagnosis: a thorough assessment with a focus on generalization capabilities

Abstract

The Covid-19 pandemic has significantly spurred the development of deep learning (DL) models for the pathology automatic diagnosis based on CT scan images. However, the assumption about the generalization of the proposed models remains to be assessed and shown for concrete clinical use. In this work, we have investigated the real value of widely used public datasets for the elaboration of DL models that are dedicated to automatic diagnosis of Covid-19 using CT scans. We have collected various international public datasets from 13 countries. Different Convolutional Neural Networks (CNNs) have been trained and their performances carefully assessed. Two evaluations have been conducted: (1) an internal evaluation following a cross-validation procedure, and (2) an external evaluation on real patients coming from new and different sources. The objective is to assess the generalization capabilities considering real-world conditions: different acquisition conditions, devices and configurations. Three families from the most effective CNN models have been selected (ResNet, DenseNet and EfficientNet). These have been fine-tuned, evaluated and used within a training methodology based on transfer learning. The most effective models have been further customized in order to create new models that are dedicated to the task at hand. These models have significantly improved the diagnosis performance.

1 Introduction

Covid-19 diagnosis methods enable the detection of newly infected individuals, which is essential for an effective control of the pandemic spread. In particular, Covid-19 can be detected through characteristic features on medical imaging techniques like chest X-ray and chest computed tomography (CT) scan. Here, ground-glass opacities (GGOs) and pulmonary consolidation are the most typical findings that do not change according to the different variants [1].

CT scan-based diagnosis allows an easier and a more reliable detection of the Covid-19 compared to X-ray radiography. It provides a better characterization of morphological features at different stages of the disease [2]. The CT scan is also considered as more accurate for diagnosis compared to antibodies testing [3]. In addition, chest CT scan can reveal typical Covid-19 findings when the patient is at an early (asymptomatic) or late phase (e.g., incubation stage) of the disease, when precisely RT-PCR is not effective [4].

Consequently, CT scan is valuable for Covid-19 diagnosis. It is accurate, easy and fast to acquire while having a comparable cost with respect to other diagnosis methods. However, the interpretation of CT scans requires qualified radiologists, and this may penalize disadvantaged regions that suffer from medical desertification. CT scans interpretation (screening) is also time consuming which can be detrimental in times of pandemic.

In this context, providing an accurate automatic diagnosis tool based on CT scan images is therefore very useful in practice. It is worthwhile in order to support the medical staff. This allows to take advantage of the rapidity of image acquisition by making the diagnosis also quick and effective. This is why, and toward an automatic diagnosis, numerous computer-aided diagnosis systems have been proposed very quickly just after the beginning of the pandemic. Most (if not all) of the proposed techniques are based on Deep Learning [5,6,7,8,9,10,11,12].

While deep learning techniques like Convolutional Neural Networks (CNN) and Transformers excel in technical aspects like pattern recognition, their effectiveness is significantly challenged by issues of generalizability in the context of Covid-19 diagnosis [11]. The main question is whether models trained on datasets from diverse sources can yield robust algorithms. Given the varied and imbalanced nature of these datasets, it is crucial to assess if such models can consistently perform well across different clinical environments and patient populations. The evolving nature of Covid-19 and its regional differences further complicates this issue, amplifying the need for models that can adapt and scale rapidly [12]. Therefore, understanding the effectiveness of these algorithms when trained on heterogeneous datasets becomes essential for developing reliable and universally applicable AI diagnostic tools.

In this paper, we focus on the generalization capabilities of DL-based models. We investigate in particular whether DL models developed on one of the largest public dataset (Covidx CT-2B [13]) are able to accurately generalize. Covidx CT-2B contains CT scans of Covid-19 patients and healthy subjects from at least 13 countries.

Our contributions can be summarized as follows:

  • A very large dataset has been employed for experiments: a public dataset of multinational Covid-19 and healthy cases [13], a publicly available SARS-CoV-2 dataset [14], and a new CT scan dataset that we have collected and finely annotated. This dataset is made available to the research community.

  • In order to focus on specific parts of CT scan images that are of interest for diagnosis, a preprocessing step is used to perform lung segmentation. A DL-based classification is then applied using different models and architectures. We have considered architectures promoted in [5,6,7,8, 15].

  • The best models in terms of both classification accuracy and number of parameters have been selected, and a new classifier has then been proposed in order to further improve the performance.

  • The generalization capabilities of the models have been assessed using new datasets that come from completely different sources. The idea is to assess the generalization capabilities considering real-world conditions: different acquisition conditions, devices and configurations. Here, the evaluation has been achieved on a per image (considering only a single slice) and on a per patient basis (that is considering the whole CT scan).

  • The DL classification methodology has been finally investigated. The objective is to identify which specific parts of the model are activated during diagnosis. This has been done using data visualization through the Grad-CAM method [16].

As a result, the main novelty of the paper is twofold. First, a thorough evaluation of DL models generalization capabilities for Covid-19 diagnosis under real-world circumstances is conducted. Second, datasets have undergone a tremendous amount of work, enabling us to perform experiments in very realistic conditions and to produce a new labeled dataset that is made available to the scientific community.

The rest of the paper is organized as follows: Sect. 2 discusses related works. In Sect. 3, our technical methodology is described: CT scan images, datasets, CNN models, training procedure, etc. Experiments are then presented along with obtained results in Sect. 4. These are then discussed in Sect. 5. Finally, Sect. 6 concludes the paper.

2 Related works

The review given in [15] has assessed the maturity of machine learning-based techniques for Covid-19 diagnosis using CT scans and it reveals that only 2.1% of these studies have a high maturity score. Most of these techniques are based on transfer learning and on fine-tuning of well-known architectures (CNN models). Classification layers are generally added on top of the considered architecture.

Ko et al. [17] proposes an architecture called FCONet. Four backbone architectures (Vgg16, ResNet50, Inception v3 and Xception) have been tested. A classifier is added on top of these backbone. It consists of a flatten layer and three fully connected layers. A set of 95,820 images is used. It is divided into 3 subsets: Covid-19, other pneumonia and healthy subjects. The study reveals that ResNet50 achieves the best average values of sensitivity (98.88%), specificity (99.36%), and accuracy (99.10%). Thus, the model has been included in the workflow for the diagnosis of Covid-19 in the two hospitals that were involved in this research.

The work by Bai et al. [18] aims to discern between Covid-19 and other pneumonia based on chest CT scans. It uses a dataset of 512 Covid-19 CT scans and 665 CT scans of other pneumonia. A 2D classification is conducted using EfficientNet [19] that has on top of it a multi-layer perceptron with dropout and ReLu function activation. The fully connected layers with batch normalization are then connected to a Sigmoid-based classifier. The model performances are better than those of 6 radiologists, with 96% accuracy, 95% sensitivity, and 96% specificity.

Another common method is characterized by a pipeline which consists in a segmentation procedure followed by a classification. Gao et al. [20] proposes a binary classification between healthy cases and Covid-19 patients. The used dataset includes a total of 704 Covid-19 and 498 healthy chest CT scans, containing in total 210,395 slices. Authors have suggested a 2D segmentation and classification with a Dual-branch Combination Network. A lung-lesion segmentation is performed using U-net in order to obtain features-maps from the decoder part. Those outcomes maps are the input to a ResNet50 CNN model that classifies the slices using a Lesion Attention module at every convolution stage. This network achieves 95.99% accuracy, 89.14% sensitivity, and 98.04% specificity.

Wang et al. [21] have proposed an approach that relies on lung segmentation followed by ternary classification of the CT scan slices. Here, target classes are Covid-19, other pathology and healthy patient. Proposed models have been trained with CT scans from 1,136 patients (of which 723 are Covid-19 positive) and tested on 282 other CT scans (of which 154 are Covid-19 positive). Among a dozen of different considered architectures, the pipeline composed of U-net followed by ResNet-50 obtained the best results with a sensitivity of 97.4% and a specificity of 92.22%.

Song et al. [22] have considered a dataset composed of three CT clusters: 88 Covid-19 patients, 100 bacterial pneumonia cases and 86 healthy subjects. The study relies on a 2D diagnosis approach that, firstly, performs the lung region segmentation in 15 slices representing a CT scan, then extracts the relevant region of each slice with a pooled features map obtained by ResNet50 and Feature Pyramid Network, and lastly, predicts the diagnosis with a multi-layer perceptron that takes as input a one-dimensional findings vector. This network achieves a precision of 93% and a recall (sensitivity) of 96%.

The predominant aspect of these deep learning-based diagnosis approaches is the use of heterogeneous sources from multiple medical centers with homogeneous acquisition protocols specified through the lung windows [17, 18, 20, 22] and the slice thicknesses [20,21,22]. Nevertheless, it can be challenging for researchers to find CT scans of multiple labeled datasets, since access to health data is often hampered by the privacy of the data [23] or the lack of systems to aggregate them. The publication of several CT scan datasetsFootnote 1 has been the corollary of this bottleneck due to the joint action of research teams during the Covid-19 pandemic. However, as it turns out, these public datasets have various shortcomings such as degraded quality of shared CT scans, heterogenous labeling errors and inappropriate CT windows [9].

This raises the following important question: What is really the generalization performance of these models? Are their predictions and diagnosis reliable on CT scans of completely unseen patients? Is it possible to develop a new deep learning-based model that takes into account generalization as the first priority? These are the main research questions our study aims to tackle.

3 Methodology

As stated before, the main question we are investigating in this work is whether DL-based models that have been developed from disparate datasets can generalize to new unseen patients whose CT-scans have been acquired in different conditions and with varying configurations. In other words, to what extent, do existing DL-based models perform well when addressing new patients, in real-world situations?

To this end, we have employed the following methodology:

  • Creation of a new dataset of CT scans that is composed of most of publicly available datasets.

  • Study of the most powerful existing CNN-based models on the classification of CT scan images.

  • Using fine-tuning and transfer learning, the considered CNN-based models are fitted to the task at hand.

  • Extensive experiments are conducted in order to perform an objective and an exhaustive assessment of the models. Cross-validation is in particular used here.

  • Generalization is assessed on two completely unseen datasets during the previous steps and coming from different sources: a Brazilian dataset and an Algerian one.

  • Using GradCam, relevant and effective parts of the NN models (during the prediction phase) are investigated.

Our methodology and its main technical parts are described in the following subsections.

3.1 CT scan images

CT scans are widely used in medical imaging for diagnosis purposes. It can be used to diagnose injuries, stroke, tumors, heart disease, infection, inflammatory disease, etc. When automatically analyzing CT scan images for pathology automatic diagnosis, one essential challenge is related the variability of the quality and the properties of CT scan images. These are highly sensitive to the acquisition parameters that condition the obtained CT scans, and this is why it is fundamental to deeply study the generalization capabilities of any automatic diagnosis method that makes use of CT scans.

The acquisition parameters utilized during a CT scan play hence a crucial role in determining the resulting image quality, features and pixel density. CT scanners employ various acquisition parameters that can be adjusted to optimize image quality for different clinical scenarios. The key acquisition parameters that significantly influence the quality and the properties of CT scan image are:

  1. 1.

    Tube current (mA): tube current represents the amount of radiation emitted by the X-ray tube during the scan. Higher tube currents result in increased radiation exposure, leading to higher pixel intensity and greater image noise. Conversely, lower tube currents reduce radiation exposure, resulting in decreased pixel intensity and improved image quality [24].

  2. 2.

    Tube voltage (kVp): tube voltage determines the energy level of the X-rays used during the scan. Higher tube voltages enable better penetration through dense anatomical structures, producing images with higher contrast and reduced noise. Lower tube voltages are typically employed for specialized applications, such as pediatric imaging, to minimize radiation dose [25]. Altering tube voltage influences image contrast. Higher tube voltages enhance the visualization of high-density structures (e.g., bones), while lower voltages improve the differentiation of low-density structures (e.g., soft tissues).

  3. 3.

    Slice thickness: slice thickness refers to the thickness of each image slice acquired during the CT scan. Smaller slice thickness improves spatial resolution, allowing for better visualization of fine anatomical details. However, reducing slice thickness also increases the number of slices acquired and the overall radiation dose [26].

  4. 4.

    Pitch value: the pitch is the distance that the X-ray tube moves along the patient’s body for each rotation of the gantry. A higher pitch produces images with faster scan times, but it also produces images with lower spatial resolution [27].

  5. 5.

    Reconstruction algorithm: the reconstruction algorithm employed during image reconstruction affects the final image quality. Different algorithms can be used to balance image sharpness, noise reduction, and artifact suppression. Iterative reconstruction algorithms have gained popularity due to their ability to enhance image quality while reducing noise levels [28].

Changing acquisition parameters can significantly modify the appearance and features of CT scan images and thus may drastically reduce the effectiveness of automatic diagnosis methods. In a study conducted by Berenguer et al. [29], the impact of changing CT acquisition parameters on the reproducibility of radiomics features, which are image features extracted from CT scans, has been investigated. The analysis relies on the evaluation of the effects of altering parameters on the same CT scan as well as comparing different CT scans using the same parameters. It shows that a significant number of features exhibited redundancy and were non-reproducible, indicating limitations in the robustness of certain radiomics features when acquisition parameters are varied.

In our study, and in order to properly address this point, we have conducted our experiments on very diversified dataset, as explained hereafter.

3.2 Datasets

Our main contribution consists in assessing DL models for Covid-19 diagnosis and in studying their generalization capabilities. For this purpose, a special attention has been given to the dataset on which evaluation is performed.

In order to have the largest possible dataset, we have created a new one as a collection of seven different datasets. These datasets have been selected from multinational cohort of Covid-19 and healthy subjects with clinically verified findings. In addition, we have considered 50 cases from the Brazilian dataset [14] and we have collected a new dataset containing 55 Algerian CT-scans.

One of the novel aspects of this study is the effort to gather and to compile all of these datasets and to create a new one towards carrying out experiments in a real-world setting.

Table 1 Description of the CT-scan datasets that has been used for training and testing

A detailed description of the dataset is given in Table 1. The total number of CT images in our datasets is 95,733 for the Covid-19 cases and 55,944 for the healthy subjects. We have however noticed that in 7.27% of these images (that is 11,024 images), the lung parenchyma is not visible. It corresponds to “non-relevant” slices that are outside the lung field or at the end of the lung. These images are not relevant for the Covid-19 diagnosis and have therefore to be identified and discarded. This way, three classes have been considered: “Covid-19”, “Healthy” and “Non-relevant”.

On the other hand, to achieve balanced classes, data augmentation using horizontal flips, positive and negative rotations (at 5\(^\circ\), 10\(^\circ\), 15\(^\circ\), 20\(^\circ\)) has been employed. Images of the healthy and the non-relevant classes have been augmented to 40,000 and 80,000 images, respectively.

The final number of used images is as follows:

  • 95,733 images for the “Covid-19” class,

  • 95,944 images for the “Healthy” class, and,

  • 91,024 images for the “Non-relevant” class.

For the Covid-19 and healthy classes, right and left lungs have been segmented in order to keep only the parts where the lung parenchyma is visible. The observation of pneumonia is only apparent in these regions. This allows for the removal of non-discriminating parts such as the scanner table. Therefore, the deep learning model automatically focuses on abnormal areas that show consistent characteristics with the reported radiological findings.

Regarding DZ-CovidScan, a specialized physician has examined each CT scan in our dataset and has rendered a diagnosis. We were not able to precisely label every CT-Scan image, though. This is an extremely time-consuming and expensive task that should only be carried out by a qualified physician. In Table 1, the numbers of healthy (resp. Covid-19) images of DZ-CovidScan (mentioned in the last column) refer to the total number of images belonging to healthy (resp. Covid-19) CT-scans.

3.3 Automatic diagnosis using CNNs

The problem of Covid-19 diagnosis is expressed in our case as an image classification problem, where each image (a slice) for the CT scan has to be classified as “covid-19”, “healthy” or “non-relevant”. For this kind of problems, CNNs are widely considered as the reference solution. Therefore, we have selected three families from the most accurate and the most effective CNN architectures [5,6,7,8, 15]. These have been used within a training methodology based on transfer learning and have been evaluated. The most effective models have been further customized. The selected CNNs and the proposed methodology are described hereafter.

3.3.1 Selected CNN architectures and considered variants

  1. 1.

    ResNet: an architecture that provides skip connections between front and back layers, allowing back-propagation of the gradient during training. We have used in particular ResNet18 (two layers deep) and ResNet50 (three layers deep).

  2. 2.

    DenseNet: this architecture has the same design as ResNet, but it connects all previous and subsequent layers densely. The objective is to achieve better performance than ResNet with lower computational cost. We have tested DenseNet121, DenseNet161 and DenseNet201. Each architecture consists of four DenseBlocks with different number of layers.

  3. 3.

    EfficientNet: this architecture uses a compound scaling method to find the best combination of three dimensions: network width, network depth and image resolution. We have tried four versions: b1, b3, b5 and b7. These only differ in the number of feature maps.

3.3.2 The proposed training methodology and architecture

In order to make used of and to customize the selected CNN architectures to our application, we have proposed a two-step training procedure, as depicted in Fig 1:

  1. 1.

    In the first step, we mainly rely on transfer learning and fine-tuning. We have considered CNNs models starting with weights obtained for ImageNet. The architecture has also been modified: the output layer (after the fully connected part) has been adapted to our context by setting its numbers of units to 3 (instead of the 1000, the number of ImageNet classes). Using our training data, these models have then been fine-tuned.

  2. 2.

    In the second step, we first select CNN models that achieves the best classification performances and that contain the least number of parameters. The selected CNN models are further adapted, optimized and trained using transfer learning. The classification part of the CNN architectures is in particular redesigned as shown in Fig 1. Weights of features extraction layers are frozen and the learning focuses on weights of the classification part. SELU activation function has been used. Here, SELU has been chosen in order to speed up training process [44]. Moreover, dropout with a rate of 0.5 has been used along with batch normalization. Again, classification (output) layer has been designed using 3 units corresponding to the three classes and the activation function is SoftMax. This provides pseudo-probabilities for each input images with respect to the three classes.

Fig. 1
figure 1

Covid-19 diagnosis using CT scan images—training procedure making used of the selected CNN models

4 Experiments

In order to perform a thorough assessment, we have considered all of the datasets we have collected and that are described in Table 1. The first 7 datasets (CNCB, iCTCF, Negin Center, LIDC and IDRI, ITAC, MosMedData and Radiopedia) have been used for training (fine-tuning and transfer learning) and for testing. In order to take full advantage of this data and to avoid the potential bias of making a single partitioning of the datasets, we have conducted this study using a fivefold cross-validation procedure. Partitioning the data into train and test datasets under each fold across target classes is split using a ratio of 4:1.

The input layer size is 224\(\times\) 224. Batch normalization with different batch sizes from 8 to 64 has been used. The learning rate starts at 0.00001 with SGD optimizer. All of the CNN have been fine-tuned for 20 epochs, where the second learning step has been achieved on 10 epochs.

In order to further assess generalization, we have performed two external additional evaluations using two completely unseen datasets during the training procedure: BR-SARS-CoV-2 and DZ-CovidScan. The quality and the effectiveness of the automatic diagnosis using CNN models have been measured per patient (for both datasets) and per image (for BR-SARS-CoV-2).

4.1 Results of the 1\(^\text {st}\) learning step (fine-tuning)

Obtained results after the first step of fine-tuning using all of the considered CNN models, the seven datasets and fivefold cross-validation procedure are presented in Table 2.

For each fine-tuned model, we provide the number of parameters of the model, the performance with respect to the evaluation metrics (precision recall, specificity and accuracy) for each fold and the average performance across the 5 folds. All values are given in percentages and the best results are highlighted in bold based on a color code related to the evaluation metrics.

Table 2 Results of different fine-tuned CNN models for Covid-19 diagnosis

Among the fine-tuned CNN models, EfficientNet-b3 achieves the best results with 96.14% precision, 96.04% recall on Fold1, and 98.98% precision, 98.98% recall on Fold2. EfficientNet-b5 performs the best on Fold5 but obtains the lowest results with respect to all evaluation metrics. The precision and recall scores are 95.17% and 94.35%, respectively. DenseNet201 achieves the best result on Fold3 with a precision of 98.99% and a recall of 98.99%. As for fold4, the highest prediction performances with a precision of 99.07% and a recall of 99.05% is given by DenseNet201. ResNet18 and ResNet50 do not stand out in the classification task compared to the DenseNet and EfficientNet variants on the five folds. Lastly, a difference in performance can be noticed for all networks on folds 1 and 5 with minima at 91% and maxima at 96% for all metrics combined, compared to folds 2, 3 and 4, which fluctuate between 97% and 99.5%.

4.2 Results of the 2\(^\text {nd}\) learning step (transfer learning)

Based on the results obtained during the first learning step and the number of parameters of the CNN models, we have selected 6 models to be used for the second learning step. As explained in Sect. 3, this step relies on transfer learning.

After applying transfer learning to the selected models and modifying the classification part (i.e., fully connected layers, as described in Sect. 3), the overall improved performances for ResNet, DenseNet and EfficientNet are summarized in Table 3. A symbol [*] is added to the names of models resulting from transfer learning to distinguish them from the original models fine-tuned for our task.

For Fold1, the best classification evaluation gains more than 0.3%. Fold5, despite being the least well predicted, has an improvement of about +1% for all metrics. Fold1 and Fold3 experience a slight reduction in classification performance with, respectively \(-\)0.1% and \(-\)0.2% across all metrics, but the results remain at a similar level of accuracy with over 99%. The classification done by the modified DenseNet121 on Fold4 show the best performance of all the tests with precision, recall, specificity and accuracy values of 99.18%, 99.17%, 99.59% and 99.45%, respectively. Finally, the best average precision (97.75%), recall (97.73%), specificity (98.86%) and accuracy (98.48%) have been reached by the modified DenseNet201.

Table 3 Performance of the six selected fine-tuned CNN models for Covid-19 diagnosis after the second learning step (based on transfer learning)

4.3 Focus on generalization

Figs 2 and 3 show the area under the curve (AUC) of the six selected CNN models on BR-SARS-CoV-2 and DZ-CovidScan, respectively. Here, the evaluation has been realized on a per patient basis, that is, all the images of a CT scan are classified and the objective is to find the good diagnosis of the patient. The majority voting has been using in order to decide to which class the patient belongs (Covid-19 or not).

As for Fig. 4, it shows the AUC of the tests performed on BR-SARS-CoV-2 but on image basis, that is, each image of a CT-Scan is classified and the evaluation metrics are computed considering all CT scan imagesFootnote 2.

We recall that these two datasets have not been considered at all during training. The aim of this evaluation is to assess the effective generalization capabilities of CNN models.

Fig. 2
figure 2

All CNN models—AUC per patient on the DZ-CovidScan dataset

Fig. 3
figure 3

All CNN models—AUC per patient on the BR-SARS-CoV-2 dataset

Fig. 4
figure 4

All CNN models—AUC per image on the BR-SARS-CoV-2 dataset

5 Result analysis and discussion

The fine-tuning phase followed by transfer learning on 6 CNN models has shown an excellent ability to perform Covid-19 diagnosis on seven different datasets. In our experiments, we have even obtained a better accuracy than the one reported in the literature: 99.45% using DenseNet121* with respect to 99.0% that has been achieved by MobileNet and EfficientNet in [13].

Our experiments show however that the generalization capability in this context is debatable. On BR-SARS-CoV-2 and DZ-CovidScan datasets, we have noticed that all Covid-19-positive patients have been correctly predicted by our CNN models, but 6 healthy patients out of the 28 in DZ-CovidScan dataset and 14 healthy patients out of the 25 in BR-SARS-CoV-2 dataset have been wrongly classified as Covid-19-positive cases. Here, CNN models have misclassified between 1 and 20 slices in each CT-Scan. This high ratio of false-positives in Covid-19 diagnostics using CT scans has been also reported in [34], but our results using completely separate test datasets (BR-SARS-CoV-2 and DZ-CovidScan) highlights a more consequent gap in the accuracy (with respect to datasets used for training).

On the other hand, we can notice also that the evaluation on a per patient basis is more conclusive in the case of BR-SARS-CoV-2 (with respect to DZ-CovidScan). Here, the main difference between the two datasets relies in number of images per CT-Scan: It is about 20 for BR-SARS-CoV-2 with respect to 300 for DZ-CovidScan. This suggests that it is not worth using a high number of images per CT-Scan. Very close images are likely to be redundant and do not necessarily improve the diagnosis.

We notice also that the best performance on a per image basis is obtained using EfficientNetb3* on BR-SARS-CoV-2 dataset (AUC, recall and precision of, respectively, 84.6%, 72,4% and 95.83%). These results are than the ones reported in [45] that has followed a similar evaluation methodology of strict testing on a separate dataset. Even more, the architecture of our CNN Model has fewer parameters than the ResNet50 used in [45].

In order to try to explain how prediction are performed and which parts of CT-scan images are more relevant for the diagnosis, we have employed a Grad-CAM visualization, as advised by several reviews [5,6,7,8, 15]. The objective is to make the model predictions more interpretable and explainable. Grad-CAM begins by isolating a specific part in the input image, where pixels not belonging to this part are set to zero, and the others are set to one. This modified image is then processed by the CNN, which computes an output score for the class. The method calculates how much each feature map in the last convolutional layer of the network contributes to this score. The importance of each feature map is assessed by averaging its contributions across all pixels, reflecting its role in the prediction process. A heat-map is then generated using these averages. This heat-map, created by emphasizing only the positive influences on the class prediction, visually highlights the regions in the image that were most pivotal in identifying the class, be it COVID-19, healthy tissue, or closed lung.

In this experiment, we have considered DenseNet201* (one among the most effective models) and we have selected 8 true-positive Covid-19 images and 16 misclassified cases (8 false positives and 8 false negatives). Grad-CAM has then been applied on these images and obtained results are depicted in Fig 5. We can observe that, regardless of whether the lobe is superior, medium or inferior, and whether the lesions are located unilaterally or bilaterally, the Grad-CAM refers to the affected areas to support the correct Covid-19 diagnosis of the true-positive cases.

Data-visualization of the misclassified healthy images revealed that the CNNs considered both pulmonary nodules and benign abnormalities as Covid-19 lesions. So, these elements have been considered as false discriminating criterion in the prediction. This shows that it is essential to collect a variety of lung characteristics to tea the neural networks which lung lesions are valuable in the detection of Covid-19.

For the false-negative cases, we can observe that the heat maps focus on the entire lung parenchyma without finding the Covid-19 features. Indeed, Covid-19 opacities are sparse in these images and are not revealed by the models. It is possible that the dataset acquisition process used during the learning has an impact on the CNN weights computation and the predictions.

Fig. 5
figure 5

Grad-CAM applied on CT-scan images of true-positive Covid-19 cases and misclassified cases (false positives or false negatives)

6 Conclusion

In this study, we have highlighted the true potential of the Covid-19 diagnosis based on DL models and we have thoroughly assessed their generalization capabilities. We have used for that three families from the most effective CNN models (ResNet, DenseNet and EfficientNet). A training methodology has been proposed in order to create new models that are dedicated to Covid-19 diagnosis. A dataset of CT scans from at least 13 countries has also been used to train CNN models, as well as a cross-validation to attest the quality of classification of the different considered architectures. As for the generalization capabilities assessment, experiments have been conducted on two other datasets: BR-SARS-CoV-2 and DZ-CovidScan.

We have conducted our study with the objective to remain very close to real-world settings. Obtained results show that trained CNNs models are still relatively sensitive to irrelevant elements of the scan images (that are likely due to varying acquisition protocol). This is why we advocate the creation of a larger international public dataset in order to foster the improvement of Covid-19 diagnosis models. We will thus be able to concretely federate our efforts to face scientific challenges and the aftermath of possible new coronavirus pandemics.

Availability of data and materials

The new dataset DZ-CovidScan is available following the URL: https://diag.cerist.dz/datasets/DZ-CovidScan.zip.

Notes

  1. A detailed description of the chosen public datasets taken into consideration for our research may be found in Section 3.2.

  2. It should be noted that, due to the lack of image-level annotation in DZ-CovidScan (as explained in Section 3.2), it is not possible to carry out the same experiment on this dataset (i.e., on a per image evaluation).

Abbreviations

AUC:

Area under the curve

CNN:

Convolutional Neural Networks

CT:

Computed tomography

DL:

Deep learning

GGO:

Ground-glass opacity

References

  1. N. Miyashita, Y. Nakamori, M. Ogata, N. Fukuda, A. Yamura, Y. Ishiura, S. Nomura, Early identification of novel coronavirus (covid-19) pneumonia using clinical and radiographic findings. J Infect Chemother 28(5), 718–721 (2022)

    Article  Google Scholar 

  2. D. Toussie, N. Voutsinas, M. Chung, A. Bernheim, Imaging of covid-19. Semin. Roentgenol. 57(1), 40–52 (2022)

    Article  Google Scholar 

  3. A. Anka, M. Tahir, S.D. Abubakar, M. Alsabbagh, Z. Zian, H. Hamedifar, A. Sabzevari, G. Azizi, Coronavirus disease 2019 (covid-19): An overview of the immunopathology, serological diagnosis and management. Scand. J. Immunol. 93, 12998 (2021)

    Article  Google Scholar 

  4. A. Kovács, P. Palásti, D. Veréb, B. Bozsik, A. Palkó, Z.T. Kincses, The sensitivity and specificity of chest ct in the diagnosis of covid-19. Eur. Radiol. 31, 2819–2824 (2021)

    Article  Google Scholar 

  5. M. Moezzi, K. Shirbandi, H. Shahvandi, B. Arjmand, F. Rahim, The diagnostic accuracy of artificial intelligence-assisted ct imaging in covid-19 disease: A systematic review and meta-analysis. Inform. Med. Unlocked. 24, 100591 (2021)

    Article  Google Scholar 

  6. H. Mohammad-Rahimi, M. Nadimi, A. Ghalyanchi-Langeroudi, M. Taheri, S. Ghafouri-Fard, Application of machine learning in diagnosis of covid-19 through x-ray and ct images: A scoping review. Front. Cardiovasc. Med. 8, 638011 (2021)

    Article  Google Scholar 

  7. ...J. Suri, S. Agarwal, S. Gupta, A. Puvvula, M. Biswas, L. Saba, A. Bit, G. Tandel, M. Agarwal, A. Patrick, G. Faa, I. Singh, R. Oberleitner, M. Turk, P. Chadha, A. Johri, J.M. Sanches, N. Khanna, K. Viskovic, S. Mavrogeni, J. Laird, G. Pareek, M. Miner, D. Sobel, A. Balestrieri, P. Sfikakis, G. Tsoulfas, A. Protogerou, D. Misra, V. Agarwal, G. Kitas, P. Ahluwalia, J. Teji, M. Al-Maini, S. Dhanjil, M. Sockalingam, A. Saxena, A. Nicolaides, A. Sharma, V. Rathore, J. Ajuluchukwu, M. Fatemi, A. Alizad, V. Viswanathan, P. Krishnan, S. Naidu, A narrative review on characterization of acute respiratory distress syndrome in covid-19-infected lungs using artificial intelligence. Comput. Biol. Med. 130, 104210 (2021)

    Article  Google Scholar 

  8. N. Benameur, R. Mahmoudi, S. Zaid, Y. Arous, B. Hmida, M. Bedoui, Sars-cov-2 diagnosis using medical imaging techniques and artificial intelligence: A review. Clin. Imaging. 76, 6–14 (2021)

    Article  Google Scholar 

  9. W. Hryniewska, P. Bombiński, P. Szatkowski, P. Tomaszewska, A. Przelaskowski, P. Biecek, Checklist for responsible deep learning modeling of medical images based on covid-19 detection studies. Pattern Recognition. 118, 108035 (2021)

    Article  Google Scholar 

  10. F. Mehboob, A. Rauf, R. Jiang, A.K.J. Saudagar, K.M. Malik, M.B. Khan, M.H.A. Hasnat, A. AlTameem, M. AlKhathami, Towards robust diagnosis of covid-19 using vision self-attention transformer. Sci. Rep. 12(1), 8922 (2022)

    Article  Google Scholar 

  11. J. Mozaffari, A. Amirkhani, S.B. Shokouhi, A survey on deep learning models for detection of covid-19. Neural Compu. Appl. 35(23), 16945–16973 (2023)

    Article  Google Scholar 

  12. A. Agnihotri, N. Kohli, Challenges, opportunities, and advances related to covid-19 classification based on deep learning. Data Sci. Manage. 6(2), 98–109 (2023)

    Article  Google Scholar 

  13. H. Gunraj, A. Sabri, D. Koff, A. Wong, Covid-net ct-2: Enhanced deep neural networks for detection of covid-19 from chest ct images through bigger, more diverse learning. Front. Med. 8, 729287 (2022)

    Article  Google Scholar 

  14. E. Soares, P. Angelov, S. Biaso, M. Froes, D. Abe, Sars-cov-2 ct-scan dataset: A large dataset of real patients ct scans for sars-cov-2 identification. MedRxiv. (2020). https://doi.org/10.1101/2020.04.24.20078584

    Article  Google Scholar 

  15. J. Born, D. Beymer, D. Rajan, A. Coy, V. Mukherjee, M. Manica, P. Prasanna, D. Ballah, M. Guindy, D. Shaham, P. Shah, E. Karteris, J. Robertus, M. Gabrani, M. Rosen-Zvi, On the role of artificial intelligence in medical imaging of covid-19. Patterns N. Y. N. 2(6), 100269 (2021)

    Article  Google Scholar 

  16. R.R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision. 618–626 (2017)

  17. H. Ko, H. Chung, W. Kang, K. Kim, Y. Shin, S. Kang, J. Lee, Y. Kim, N. Kim, H. Jung, J. Lee, Covid-19 pneumonia diagnosis using a simple 2d deep learning framework with a single chest ct image: Model development and validation. J. Med. Internet Res. 22, 19569 (2020)

    Article  Google Scholar 

  18. ...H. Bai, R. Wang, Z. Xiong, B. Hsieh, K. Chang, K. Halsey, T. Tran, J. Choi, D. Wang, L. Shi, J. Mei, X. Jiang, I. Pan, Q. Zeng, P. Hu, Y. Li, F. Fu, R. Huang, R. Sebro, Q. Yu, M. Atalay, W. Liao, Artificial intelligence augmentation of radiologist performance in distinguishing covid-19 from pneumonia of other origin at chest ct. Radiology. 299, 225 (2021)

    Article  Google Scholar 

  19. M. Tan, Q.V. Le, Efficientnet: Rethinking model scaling for convolutional neural networks. Int. Conf. Machine Learn. 97, 6105 (2019)

    Google Scholar 

  20. K. Gao, J. Su, Z. Jiang, L. Zeng, Z. Feng, H. Shen, P. Rong, X. Xu, J. Qin, Y. Yang, W. Wang, D. Hu, Dual-branch combination network (dcn): Towards accurate diagnosis and lesion segmentation of covid-19 using ct images. Med. Image Anal. 67, 101836 (2021)

    Article  Google Scholar 

  21. ...B. Wang, S. Jin, Q. Yan, H. Xu, C. Luo, L. Wei, W. Zhao, X. Hou, W. Ma, Z. Xu, Z. Zheng, W. Sun, L. Lan, W. Zhang, X. Mu, C. Shi, Z. Wang, J. Lee, Z. Jin, M. Lin, H. Jin, L. Zhang, J. Guo, B. Zhao, Z. Ren, S. Wang, W. Xu, X. Wang, J. Wang, Z. You, J. Dong, Ai-assisted ct imaging analysis for covid-19 screening: Building and deploying a medical ai system. Appl. Soft Comput. 98, 106897 (2021)

    Article  Google Scholar 

  22. Y. Song, S. Zheng, L. Li, X. Zhang, X. Zhang, Z. Huang, J. Chen, R. Wang, H. Zhao, Y. Zha, J. Shen, Y. Chong, Y. Yang, Deep learning enables accurate diagnosis of novel coronavirus (covid-19) with ct images. IEEE/ACM Trans. Comput. Biol. Bioinform. 18(6), 2775–2780 (2021)

    Article  Google Scholar 

  23. W. Naudé, Artificial intelligence vs covid-19: limitations, constraints and pitfalls. AI Soc. 35(3), 761–765 (2020)

    Article  Google Scholar 

  24. K. Ledenius, M. Gustavsson, S. Johansson, F. Stalhammar, L. Wiklund, A. Thilander-Klang, Effect of tube current on diagnostic image quality in paediatric cerebral multidetector ct images. Br. J. Radiol. 82(976), 313–320 (2009)

    Article  Google Scholar 

  25. Y. Murakami, S. Kakeda, K. Kamada, N. Ohnari, J. Nishimura, M. Ogawa, K. Otsubo, Y. Morishita, Y. Korogi, Effect of tube voltage on image quality in 64-section multidetector 3d ct angiography: evaluation with a vascular phantom with superimposed bone skull structures. Am. J. neuroradiol. 31(4), 620–625 (2010)

    Article  Google Scholar 

  26. M. Alshipli, N.A. Kabir, Effect of slice thickness on image noise and diagnostic content of single-source-dual energy computed tomography. J. Phys. Conf. Series. 851, 012005 (2017)

    Article  Google Scholar 

  27. M.M. Lell, M. May, P. Deak, S. Alibek, M. Kuefner, A. Kuettner, H. Köhler, S. Achenbach, M. Uder, T. Radkow, High-pitch spiral computed tomography: effect on image quality and radiation dose in pediatric chest computed tomography. Invest. Radiol. 46(2), 116–123 (2011)

    Article  Google Scholar 

  28. K. Jensen, A.C.T. Martinsen, A. Tingberg, T.M. Aaløkken, E. Fosse, Comparing five different iterative reconstruction algorithms for computed tomography in an roc study. Euro. Radiol. 24, 2989–3002 (2014)

    Article  Google Scholar 

  29. R. Berenguer, M.D.R. Pastor-Juan, J. Canales-Vázquez, M. Castro-García, M.V. Villas, F. Mansilla Legorburo, S. Sabater, Radiomics of ct features may be nonreproducible and redundant: Influence of ct acquisition parameters. Radiology 288(2), 407–415 (2018)

    Article  Google Scholar 

  30. ...K. Zhang, X. Liu, J. Shen, Z. Li, Y. Sang, X. Wu, Y. Zha, W. Liang, C. Wang, K. Wang, L. Ye, M. Gao, Z. Zhou, L. Li, J. Wang, Z. Yang, H. Cai, J. Xu, L. Yang, W. Cai, W. Xu, S. Wu, W. Zhang, S. Jiang, L. Zheng, X. Zhang, L. Wang, L. Lu, J. Li, H. Yin, W. Wang, O. Li, C. Zhang, L. Liang, T. Wu, R. Deng, K. Wei, Y. Zhou, T. Chen, J. Lau, M. Fok, J. He, T. Lin, W. Li, G. Wang, Clinically applicable ai system for accurate diagnosis, quantitative measurements, and prognosis of covid-19 pneumonia using computed tomography. Cell. 181, 1423–1433 (2020)

    Article  Google Scholar 

  31. CNCB. http://ncov-ai.big.ac.cn/download. Accessed 15 Dec 2021

  32. W. Ning, S. Lei, J. Yang, Y. Cao, P. Jiang, Q. Yang, J. Zhang, X. Wang, F. Chen, Z. Geng, L. Xiong, H. Zhou, Y. Guo, Y. Zeng, H. Shi, L. Wang, Y. Xue, Z. Wang, Open resource of clinical data from patients with pneumonia for the prediction of covid-19 outcomes via deep learning. Nat. Biomed. Eng. 4, 1197–1207 (2020)

    Article  Google Scholar 

  33. iCTCF. https://ngdc.cncb.ac.cn/ictcf. Accessed 22 Dec 2021

  34. M. Rahimzadeh, A. Attar, S. Sakhaei, A fully automated deep learning-based network for detecting covid-19 from a new and large lung ct scan dataset. Biomed. Signal Proc. Cont. 68, 102588–102588 (2021)

    Article  Google Scholar 

  35. Negin Center. https://github.com/mr7495/COVID-CTset. Accessed 3 Jan 2022

  36. ...S. Armato, G. McLennan, M.M.-G.L. Bidaut, C. Meyer, A. Reeves, B. Zhao, D. Aberle, C. Henschke, E. Hoffman, E. Kazerooni, H. MacMahon, E.V. Beeke, D. Yankelevitz, A. Biancardi, P. Bland, M. Brown, R. Engelmann, G. Laderach, D. Max, R. Pais, D. Qing, R. Roberts, A. Smith, A. Starkey, P. Batrah, P. Caligiuri, A. Farooqi, G. Gladish, C. Jude, R. Munden, I. Petkovska, L. Quint, L. Schwartz, B. Sundaram, L. Dodd, C. Fenimore, D. Gur, N. Petrick, J. Freymann, J. Kirby, B. Hughes, A. Casteele, S. Gupte, M. Sallamm, M. Heath, M. Kuhn, E. Dharaiya, R. Burns, D. Fryd, M. Salganicoff, V. Anand, U. Shreter, S. Vastagh, B. Croft, The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. Med. Phys. 38, 915–931 (2011)

    Article  Google Scholar 

  37. LIDC and IDRI. https://wiki.cancerimagingarchive.net/display/Public/LIDC-IDRI. Accessed 12 Jan 2022

  38. S. Harmon, T. Sanford, S. Xu, E. Turkbey, H. Roth, Z. Xu, D. Yang, A. Myronenko, V. Anderson, A. Amalou, M. Blain, M. Kassin, D. Long, N. Varble, S. Walker, U. Bagci, A. Ierardi, E. Stellato, G. Plensich, G. Franceschelli, C. Girlando, G. Irmici, D. Labella, D. Hammoud, A. Malayeri, E. Jones, R. Summers, P. Choyke, D. Xu, M. Flores, K. Tamura, H. Obinata, H. Mori, F. Patella, M. Cariati, G. Carrafiello, P. An, B. Wood, B. Turkbey, Artificial intelligence for the detection of covid-19 pneumonia on chest ct using multinational datasets. Nat. Commun. 11, 4080 (2020)

    Article  Google Scholar 

  39. ITAC. https://ngc.nvidia.com/catalog/containers/nvidia:clara:ai-covid-19. Accessed 18 Jan 2022

  40. S. Morozov, A. Andreychenko, N. Pavlov, A. Vladzymyrskyy, N. Ledikhova, V. Gombolevskiy, I. Blokhin, P. Gelezhe, A. Gonchar, V. Chernina, Mosmeddata: Chest ct scans with covid-19 related findings dataset. MedRxiv. 2005, 06465 (2020)

    Google Scholar 

  41. MosMedData. http://www.kaggle.com/datasets/mathurinache/mosmeddata-chest-ct-scans-with-covid19?datasetId=863426. Accessed 25 Jan 2022

  42. Radiopedia: COVID-19 pneumonia. https://radiopaedia.org/cases?lang=us.. Accessed 26 Feb 2022

  43. BR-SARS-CoV-2 Data. https://www.kaggle.com/datasets/plameneduardo/sarscov2-ctscan-dataset. Accessed 30 Jan 2022

  44. S. Zhou, H. Greenspan, C. Davatzikos, J. Duncan, B.V. Ginneken, A. Madabhushi, J. Prince, D. Rueckert, R. Summers, A review of deep learning in medical imaging: Imaging traits, technology trends, case studies with progress highlights, and future promises. Proc. IEEE 109(5), 820–838 (2021)

    Article  Google Scholar 

  45. L. Aversano, M. Bernardi, M. Cimitile, R. Pecori, Deep neural networks ensemble to detect covid-19 from ct scans. Pattern Recogn. 120, 108135 (2021)

    Article  Google Scholar 

Download references

Acknowledgements

This work was partially supported by the AUF (grant agreement DRM-6862).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sid-Ahmed Berrani.

Ethics declarations

Competing interests

The authors have no competing of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hadj Bouzid, A.I., Berrani, SA., Yahiaoui, S. et al. Deep learning-based Covid-19 diagnosis: a thorough assessment with a focus on generalization capabilities. J Image Video Proc. 2024, 40 (2024). https://doi.org/10.1186/s13640-024-00656-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13640-024-00656-x

Keywords

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy