1 Introduction

Neoadjuvant chemotherapy (NAC) is the standard treatment for locally advanced breast cancer [1]. Favorable response to NAC is best measured by pathological complete response (pCR), where the patient has no residual invasive disease in the breast and lymph nodes following treatment [2]. Only 10–50% of patients receiving NAC will experience pCR, leading to many unnecessary side effects and detriments to quality of life for non-responsive patients [3]. However, a growing body of work has suggested the potential of computational medical image analysis to enable the prediction of response to NAC from dynamic contrast-enhanced (DCE) MRI. Radiomics, the high-throughput extraction and analysis of algorithmically defined features from radiology, has been shown to predict response on pre-treatment DCE-MRI through attributes such as image texture [4]. Deep learning, involving the training of a convolutional neural network to identify novel predictive image patterns, has successfully been applied to response prediction prior to treatment [5].

An emerging trend is that prediction of response to NAC is not only a matter of selecting the right computational tools, but also where they are deployed. For instance, previous work has shown that supplementing radiomic features extracted from the tumor itself with texture features computed within the peri-tumoral region - the tumor’s surrounding environment - enables improved prediction of pCR from pre-treatment DCE-MRI [4] and the ability to identify genotypes associated with response to targeted therapy [6]. A related study, using deep learning for predicting response to esophageal cancer based on the tumor and peritumoral region on PET, was reported in [7].

While several studies have investigated the fusion of radiomics and deep learning, a large portion employ naive ensembling strategies [8, 9] that weight each equally without consideration of their relative strengths and weaknesses. However, it is not clear that the different types of representations are ideally suited uniformly across the different parts of the region of interest (ROI). For instance, in the case of a lung nodule on CT, while CNNs might capture unique heterogeneity patterns pertaining to ground glass opacity within the nodule, radiomic edge operators might accentuate unique attributes relating to margin spicularity. In other words, spatially invoking specific representations might be a better mechanism for feature fusion compared to combining multiple representations within the same ROI.

In this work, we present Response Estimation through Spatially Oriented Neural Network and Texture Ensemble (RESONATE): a novel approach for the fusion of radiomic and deep learning data streams by invoking them in the spatial regions at which they are most discriminative.

2 Previous Works and Novel Contributions

Despite the individual promise of radiomics and deep learning approaches, relatively few have explored how these approaches can be combined. Antropova et al. [8] and Paul et al. [9] averaged the outputs and predictions of deep learning and radiomic based classifiers for diagnostic problems relating to breast and lung cancer, respectively. Others have explored more complex fusion strategies that attempt to account for differences in model performance through training a model across ensembled classifiers and data streams [10, 11]. A common theme of these approaches is the fact that they leverage CNNs and radiomics as ways of providing different types of representations from within the same region to boost classifier performance.

The approach presented in this work, RESONATE, is unique from these previous approaches by using a spatial preference to invoke different types of representations, radiomics and deep learning, within different tumor subcompartments. The approach differentially weights representations based on their relative regional strengths within a fused regression model.

3 Methodology

Fig. 1.
figure 1

RESONATE Workflow – Image segmentation (left) separates the patient’s tumor into intratumoral (red) and peritumoral (blue) regions. Deep learning (bottom) and radiomic (top) classification is then performed separately within each region, with the predictions of each classifier being fused with a logistic regression classifier (right) to create a final prediction. (Color figure online)

3.1 Spatial Localization of Tumor Habitat

We define an image scene \(\mathcal {I}\) as a 3-dimensional spatial grid of voxels, corresponding to one phase of a DCE-MRI volume. Let \(\mathcal {T}\) represent a sub-volume of \(\mathcal {I}\) corresponding to a segmentation of a tumor. From \(\mathcal {T}\) we further define a peri-tumoral volume, \(\mathcal {P}\), which corresponds to the region surrounding the tumor, originating at the tumor border, and extending out to all voxels within some user-specified maximum physical distance.

3.2 Spatial-Specific Model Representation

The spatial regions, \(\mathcal {T}\) and \(\mathcal {P}\), were separately analyzed using deep learning and radiomic classification methods. For each \(\mathcal {I}\), let y indicate an accompanying binary outcome to treatment, where a value of 1 indicates successful response.

Deep Learning. Two models were created, one for each region \(\mathcal {S} \in \left\{ \mathcal {T}, \mathcal {P}\right\} \), using the following procedure. We define \(\mathcal {I}_\mathcal {S}\) as a volumetric box, which is a sub-volume of \(\mathcal {I}\) large enough to contain all points of \(\mathcal {S}\). To isolate the ROI, the intensity value of all voxels contained within \(\mathcal {I}_\mathcal {S}\) but not in \(\mathcal {S}\) are set equal to the mean intensity within \(\mathcal {S}\) in order to prevent deep learning models from relying too heavily on annotation boundaries. A convolutional neural network, \(\mathcal {D}_\mathcal {S}\) is then trained using \(\mathcal {I}_\mathcal {S}\) from a number of different image volumes to predict a probability of response \(p_\mathcal {S}^\mathcal {D}\) as close as possible to y, as measured by some loss function \(\textit{L}(y,p_\mathcal {S}^\mathcal {D})\).

Radiomics. Radiomic models were created for each region \(\mathcal {S} \in \left\{ \mathcal {T}, \mathcal {P}\right\} \). For every voxel within \(\mathcal {S}\), a set of unique radiomic descriptors is computed. The distribution of each radiomic descriptor across all voxels of \(\mathcal {S}\) are statistically summarized into a feature vector, and a feature selection algorithm is applied to the feature vector to create an optimal reduced set of features. A classifier \(\mathcal {R}_\mathcal {S}\) accepts this reduced set of features and outputs a response prediction \(p_\mathcal {S}^\mathcal {R}\).

3.3 Model Fusion

A logistic regression classifier, \(\mathcal {L}\), was designed to fuse the predictions of classifiers \(\mathcal {C_{\mathcal {S}}}\), where \(\mathcal {C} \in \left\{ \mathcal {D}, \mathcal {R}\right\} \), \(\mathcal {S} \in \left\{ \mathcal {T}, \mathcal {P}\right\} \). Each classifier, \(\mathcal {C_{\mathcal {S}}}\) is first trained independently, with \(\mathcal {L}\) then being trained using the predictions of each individual classifier as

$$\begin{aligned} \ln (\frac{p_{\mathcal {L}}}{1-p_{\mathcal {L}}}) = W_0 + \sum _{n=1}^{\mathcal {N}} W_np_\mathcal {S}^\mathcal {C} \end{aligned}$$
(1)

where \(p_\mathcal {S}^\mathcal {C}\) represents a prediction output for classifier \(\mathcal {C}\) in region \(\mathcal {S}\). This model fusion approach allows \(\mathcal {L}\) to learn a weighted combination between per-patient predictions from each classifier based on the relative strengths of representation and location, giving the ability for a stronger ensemble prediction.

4 Experimental Design

4.1 Data Description

The dataset used consists of axial-plane breast DCE-MRIs of 114 patients with biopsy-proved breast cancer, collected prior to administration of neoadjuvant chemotherapy with a 1.5 or 3 T magnet [4, 12]. DCE-MRI acquisitions were collected across six separate contrast enhancement sequences for each patient, and three-frame intratumoral masks were annotated at the peak-enhancement sequence by a trained radiologist. Peritumoral masks were generated by expanding the intratumoral mask 3 mm outward. For all experiments, the patients were divided into a training (N = 80) and held out testing (N = 34) cohort. The training cohort was further stratified into three training (N = 53) and validation (N = 27) folds for cross-validation.

4.2 Implementation Details

Deep Learning: CNN inputs, of size \(150\times 110\times 3\times 3\) for the peritumoral network and \(146\times 104\times 3\times 3\) for the tumoral network, consist of 3-dimensional blocks centered at the region of interest with a fourth, temporal dimension corresponding to different phases of DCE-MRI acquisition. The network (depicted in Fig. 1) consists of three convolutional blocks, with a block containing two convolutional layers with ReLU activation, followed by a batch normalization and max pooling layer. Each block sequentially widens the network by increasing the number of filters used in the convolutional layers, going from 8 to 16 to 32 filters, all with kernels of size \(3\times 3\times 1\). After flattening the final convolutional outputs, a small dense block consisting of two dense layers (with 128 and 64 filters, respectively), each followed by ReLU activation and 50% dropout. A final sigmoid layer computes final probability output. Data augmentation was performed, applying random rotations and spatial zooming, as well as random sampling preserving temporal order from 5 available DCE-MRI post-contrast acquisitions to include with the pre-contrast scan as input for training. Training was performed using a binary cross-entropy loss function, and a stochastic gradient descent optimizer with a nesterov momentum of 0.9, a learning rate of 0.0005, and a learning rate decay of 0.01. Visual attention with guided back-propogation [13], via keras-vis [14], was implemented post-hoc for evaluation of image regions corresponding to a prediction of response.

Radiomics Classifier: Within the tumor and peri-tumoral region, a total of 99 radiomic texture features were extracted voxelwise on the DCE-MRI phase of peak contrast enhancement, including 25 Laws, 48 Gabor, 13 gray level co-occurrence matrix (GLCM), and 13 co-occurrence of local anisotropic gradient orientation (CoLlAGe) features. See [4] for greater description on the radiomic feature set explored. Five first order statistics - mean, median, standard deviation (SD), skewness, and kurtosis - were computed to describe the distribution of features within each region. A set of top features were chosen with a two-part feature selection scheme. First, the feature set was pruned to eliminate correlated features based on a minimum allowable spearman correlation between features (with the retained feature chosen by wilcoxon rank sum test). Second, two rounds of minimum redundancy maximum relevance (mRMR) feature selection were used to identify between 1–20 top features optimized over 1000 iterations in cross-validation [4]. Top features within each fold were used to train several classifiers: naive bayes, support vector machine (SVM), and diagonal linear discriminant analysis (DLDA). The optimal combination of correlation threshold for pruning, number of top features, and type of classifier was chosen based on cross-validated AUC within the training set, then applied to the testing set.

Model Fusion: The fused model was developed in the following fashion:

  1. 1.

    Tuning individual models: Region-specific radiomic and deep learning classifiers were optimized via 3-fold cross-validation within the training set. Specific hyper-parameter details are provided in Sect. 5.1.

  2. 2.

    Evaluating model fusion by cross-validation: For each optimized model, validation fold predictions were accumulated into a set of predicted response probabilities for the full training set. Cross-validation was repeated, this time training logistic regression model weights based on predictions from the training fold and evaluating on predictions from the validation fold.

  3. 3.

    Creating and testing final fusion models: The final regression model was trained on accumulated cross-validation predictions from each of the 3 folds. A final version of each individual model was retrained using the full training set with the optimal hyper-parameters discovered in cross-validation. Fusion model predictions were then collected by passing in the prediction outputs of each individual final model to the logistic regression classifier.

5 Results and Discussion

5.1 Individual pCR Prediction Models

Deep Learning: Final deep learning models were trained on the full training set for 67 epochs, based on average convergence time observed in cross-validation. Two variants of the deep learning classifier were trained, one using intratumoral segmentations and the other using peritumoral segmentations. The deep learning model focused on the tumor outperformed the peritumoral model in both the training and testing set (Table 1).

Radiomics: Best performance was achieved in cross-validation of the training set when using a SVM classifier, initially pruning features with correlation higher than 90%, and choosing a final set of 11 features via mRMR feature selection. Two variants of the SVM classifier, \(\mathcal {R_\mathcal {T}}\) and \(\mathcal {R_\mathcal {P}}\), were trained within each region. Performance between tumoral and peri-tumoral models was found to be comparable (Table 1).

Table 1. Classification results for single model and multi-model fusion

5.2 Experiment 2: Multi-region, Multi-representation Response Prediction via RESONATE

A fusion of predictions from all spatially-oriented classifiers, \(\mathcal {L}\)(\(\mathcal {D_\mathcal {T}}\), \(\mathcal {D_\mathcal {P}}\), \(\mathcal {R_\mathcal {T}}\), \(\mathcal {R_\mathcal {P}}\)), was found to best identify pCR, achieving an AUC of \(0.78 \pm 0.05\) in cross-validation and 0.79 in the testing set. Confidence intervals (CI) and p-values were computed on the test set AUC via 50,000 iteration permutation testing [15], giving a 95% testing set CI = \(.62-.96\), p = .003. Note that, for the full RESONATE model, as well as some individual models, performance increased in the testing set relative to the training set: likely a result of using the full training set to derive final models, as compared to models evaluated in cross-validation which leveraged only a portion of the training data. We found that, based on the weights of \(\mathcal {L}\), the ensemble prediction relied primarily on \(\mathcal {R_\mathcal {P}}\), \(\mathcal {R_\mathcal {T}}\), and \(\mathcal {D_\mathcal {T}}\), which had weights of 0.80, 0.99, and 0.77 respectively. The difference in these representations between patients identified as pCR, as compared to non-responders, are depicted visually in Fig. 2. Meanwhile, the peritumoral deep learning model, \(\mathcal {D_\mathcal {P}}\), was found to contribute the least to pCR prediction relative to other models, with a weight of −0.08.

5.3 Experiment 3: Comparative Strategies - Pairwise Fusion

Each pairwise combination of classifiers was similarly combined into a fused model for comparison against the full RESONATE model (Table 1), none of which matched its performance. A radiomics model combining information from both the tumor and peritumoral region, previously shown to be an effective pCR prediction strategy [4], was found to under-perform relative to the model also incorporating multi-region deep learning, with an AUC of \(0.75 \pm 0.04\) in the training set and 0.77 in the testing set (95% CI = \(.58-.97\), p = .006). Likewise, fusion of deep learning models from both regions achieved AUC of \(0.73 \pm 0.07\) in the training set and 0.75 in the test set (95% CI = \(.50-.99\), p = .026).

Similarly, fusion of representations in a single region was found to less effectively predict pCR, with combinations of models fusing deep learning and radiomic representations only inside or outside the tumor achieving AUCs of 0.78 (95% CI = \(.55-1.00\), p = .012) and 0.70 (95% CI = \(.45-.94\), p = .052) within the testing set, respectively. Of all pairwise combinations, the best performance was observed when combining the tumoral deep learning model and the peritumoral radiomics model (AUC = 0.78, 95% CI = \(.55-1.00\), p = .014). This finding emphasizes the value of considering both multiple representations and multiple regions of analysis in pre-treatment response prediction.

Fig. 2.
figure 2

Middle, radiomics feature maps: Increased expression of mid frequency Gabor features within the peri-tumoral region and CoLlAGe entropy features within the tumor distinguished non-response (right). Visual attention maps for the intra-tumoral CNN emphasize tumor border and core in patients who achieved pCR (Right).

6 Conclusion

Our results show that an ensemble of classifiers oriented spatially in the tumor habitat is a viable method for predicting favorable response to NAC in patients with biopsy-proven breast cancer. We applied deep learning and radiomic classifiers with attention focused in either the intratumoral or peritumoral regions of the breast, with individual classifier predictions being further boosted by a logistic regression ensemble model and achieving an AUC of 0.79 in a held out test set. This work is the first to present a methodology for the fusion of radiomics and deep learning approaches across multiple regions of biological significance and emphasizes the importance of multi-region, multi-representation in the pre-treatment determination of which patients will benefit from therapeutic intervention.