Explainable AI in Medical Imaging
Explainable AI in Medical Imaging
Review
A R T I C L E I N F O A B S T R A C T
Keywords: Driven by recent advances in Artificial Intelligence (AI) and Computer Vision (CV), the implementation of AI
Explainable AI systems in the medical domain increased correspondingly. This is especially true for the domain of medical
Medical imaging imaging, in which the incorporation of AI aids several imaging-based tasks such as classification, segmentation,
Radiology
and registration. Moreover, AI reshapes medical research and contributes to the development of personalized
Black-Box
clinical care. Consequently, alongside its extended implementation arises the need for an extensive under
Explainability
Interpretability standing of AI systems and their inner workings, potentials, and limitations which the field of eXplainable AI
(XAI) aims at. Because medical imaging is mainly associated with visual tasks, most explainability approaches
incorporate saliency-based XAI methods. In contrast to that, in this article we would like to investigate the full
potential of XAI methods in the field of medical imaging by specifically focusing on XAI techniques not relying on
saliency, and providing diversified examples. We dedicate our investigation to a broad audience, but particularly
healthcare professionals. Moreover, this work aims at establishing a common ground for cross-disciplinary un
derstanding and exchange across disciplines between Deep Learning (DL) builders and healthcare professionals,
which is why we aimed for a non-technical overview. Presented XAI methods are divided by a method’s output
representation into the following categories: Case-based explanations, textual explanations, and auxiliary
explanations.
* Corresponding author at: Institute for Artificial Intelligence in Medicine, Institute of Diagnostic and Interventional Radiology and Neuroradiology, University
Hospital Essen, Hufelandstraße 55, 45147 Essen, Germany.
E-mail address: Katarzyna.Borys@uk-essen.de (K. Borys).
https://doi.org/10.1016/j.ejrad.2023.110786
Received 10 January 2023; Received in revised form 3 March 2023; Accepted 14 March 2023
Available online 20 March 2023
0720-048X/© 2023 Elsevier B.V. All rights reserved.
K. Borys et al. European Journal of Radiology 162 (2023) 110786
Fig. 1. Distinction of explanatory approaches by resulting output presentation form. Non-visual XAI approaches encompass auxiliary, case-based, and textual ex
planations. Visual explanations are not considered in this work but have been listed for completeness.
generate a corresponding output (e.g., a disease prediction). To capture To achieve a successful interplay between healthcare professionals
the relation between a set of inputs (termed training data) and the and XAI, it is not only important to specifically tailor DL systems for the
desired outputs (termed labels), an elaborate and complex model training healthcare sector but also to introduce XAI as a powerful technique to
is required, during which a mapping function is approximated by iter healthcare professionals and provide a non-technical introduction to
atively estimating appropriate parameter values (parametrization). This how XAI methods can help them handle novel DL systems. This review
estimation is calculated with a loss function which denotes how well a aims to fill that gap by providing a non-technical overview of common
trained model fits a featured data set. With simplified input data like non-visual XAI methods that apply to medical imaging, along with their
two-dimensional coordinates, the final mapping could be as easily advantages, pitfalls, and limitations. While common saliency-based
depicted as a quadratic equation with few parameters. However, as methods like GradCAM [11] project their explanations directly onto
medical imaging data is highly dimensional, mapping functions can an input image, non-saliency-based methods usually provide more
become enormously complex, operating in the trillion parameter range diversified explanations, e.g., by generating plots [12], textual de
[4]. Then such networks impede a direct interpretation of their pre scriptions [1] or confidence scores [13]. For ease of reading, we will
dictions, mainly because of the inherent unpredictability of mapping label saliency-based methods as “visual” and non-saliency-based
functions coupled with high structural complexity. Consequently, the methods as “non-visual”. The contributions of this paper can be sum
demand for interpretability and explainability of AI has experienced a marized as follows:
tremendous resurgence over recent years, promoting the formation of
new research fields, mainly known as eXplainable AI (XAI) [5,6]. • Collection and categorization of non-visual XAI methods into
Generally, XAI refers to all methods and approaches enabling human distinctive categories defined by an XAI method’s expected output.
users to comprehend AI models. Since there is neither a mathematical • For each category, a summarization of the methods’ functioning in a
nor a standardized definition of explainability and interpretability, in non-technical manner.
this paper, a non-mathematical definition from the social sciences • Presentation of limitations, pitfalls, and potentials of the introduced
perspective will be used as given by Miller [7]: “Interpretability is the XAI methods regarding implementation, evaluation, and interpre
degree to which a person can understand the cause of a decision”. tation (see Table 1 and Appendix A).
Consequently, while interpretability focuses on the reasonings behind • Summarization of XAI’s current state of research in the field of
resulting outputs and helps uncover cause-and-effect relationships, medical imaging and outlining of future directions.
explainability differs in the sense that it is associated with a system’s
internal logic and procedures [5]. However, XAI is not a purely tech 2. Non-Visual XAI methods in medical imaging
nological issue; instead, it relies on various medical, legal, ethical, and
technological aspects that require thorough investigation. For instance, Since medical imaging is mainly associated with visual tasks, most
from the development point-of-view, explainability can be helpful to explainability approaches incorporate visual XAI methods, including
sanity-check DL models beyond mere performance and to identify severe attribution and heat maps [9]. Even though visual XAI methods are
errors before deploying tools into clinical validation or utilization. From considered easy to interpret and intuitive, some studies pointed out
the medical perspective, all systems – whether AI-powered or not – are significant limitations. A major study by Adebayo et al. [14] investigated
incumbent to a rigorous validation process and medical certification [8]. whether saliency methods are insensitive. This highly unwanted effect
However, random errors, systematic errors, and biases impede the indicates that an explanation is unrelated to the model or data and does
development of DL tools with 100% diagnostic accuracy. And if bias is not explain anything. One example encompasses edge detectors because
present, there will be prediction errors for inputs deviating from training they just highlight areas with strong color changes within images and do
data. Conclusively, random and systematic errors will occur in the not have a relation to a model prediction. Moreover, Ghorbani et al. [15]
clinical setting, even with a fully validated high-performing DL model. demonstrated saliency maps’ vulnerability to image perturbations and
Even though this cannot be avoided, XAI can help to uncover such cases fragility to adversarial attacks leading to the question of how the
by providing a global (whole model) or a local (single prediction) model robustness of visual XAI methods can be ensured. In contrast, Tomsett
explanation and simultaneously safe-checking predictions that might be et al. [16] pointed out a lack of consistency concerning evaluation
out-of-distribution. Regarding legal aspects, sensitive data-related issues metrics in sanity-check studies concluding that despite the increasing
such as privacy and security, patient consent, and anonymization also effort, it remains challenging to evaluate visual explanations fully.
play an important role. For example, a prominent regulation in the Moreover, visual XAI methods represent only a minor subset of all
European Union called General Data Protection Regulation enforces the possible methods. Therefore, even though non-visual XAI techniques
right of patients to receive transparent information about a decision’s might seem more specific and, in some cases, rely on method-specific
origin and requires the inclusion of XAI [9]. Consequently, the legal knowledge for a correct interpretation, it is desirable to investigate the
implications of establishing AI within healthcare are important, and the whole range of XAI methods against the background of exploring the full
ongoing debate between innovation and regulation needs careful potential of XAI in the medical imaging domain. Consequently, apart
consideration [10]. from visual explanations, three subgroups of non-visual XAI methods
2
K. Borys et al. European Journal of Radiology 162 (2023) 110786
Fig. 2. Depiction on top-activated images for two exemplary units of a CNN based on [26] using the MedNIST dataset [27,28]. The top row represents activations for
hands images, whereas the row below contains images of breasts and the abdomen. The visualizations serve primarily to clarify the procedure. The focus is on
evaluating individual concepts concerning specific units, which is why this approach is listed among case-based methods.
were identified according to the type of result a user can expect when adequate as possible (truthful), which settles a trade-off between given
applying an XAI technique [17]: auxiliary explanations, case-based ex requirements [7]. These challenges are commonly a question of finding
planations, and textual explanations. A thorough introduction to these the right balance for corresponding circumstances, the task at hand, the
groups will be provided in subsequent sections. Importantly, they can be selected model type, and involved end-users and remain an essential
used as a classification scheme for XAI methods by their outcome, as aspect of ongoing research. Conclusively, several questions and chal
depicted in Fig. 1. Deciding which type of informational representation lenges concerning XAI are most likely to be investigated in the context of
is suitable for a given application is strongly specific to the end-users, a multi-disciplinary intersection of social and ethical sciences, end-
technical circumstances, the desired explainability outcome, and - users, and DL practitioners [7,21].
most importantly - what is perceived as beneficial concerning inter
pretability in the context of the medical domain. In this paper, we pri
marily refer to end-users as radiologists, clinicians, and doctors whose 2.1. Case-based explanations
main need presumably might be the linking of knowledge and confir
mation of diagnosis. For example, a radiologist applying a tumor Case-based explanations may be highly diversified but share the
detection model could possibly be more interested in receiving visual same goal of providing insight based on specific examples, such as using
explanations on the original image to analyze the model’s focus and how similar input images, data samples, or counterfactuals to enable “What
much trust can be put into its predictions. In contrast, a model that es if”-assumptions [22]. Compared to visual and auxiliary explanations,
timates a patient’s overall survival chances might benefit from using this explanatory method is less explored. However, some research as
counterfactuals, which are a case-based XAI method enabling “What-if”- serts that this approach is the most intuitive for human users to
assumptions and could, for example, yield how a diagnosis changed if comprehend [7]. Furthermore, case-based explanations commonly aim
specific clinical parameters were adjusted. However, during the devel at sample-based reasoning and uphold the potential of patient-
opment of such models, synthetic visualizations and auxiliary values individual explanation approaches within the medical domain. One
may be more insightful [17]. Conclusively, each method has a unique prominent example of case-based methods is Testing with Concept
potential to reveal helpful explanatory insight, and there are no rec Activation Vectors (TCAV) [23]. TCAV determines a model’s sensitivity
ommendations or restrictions for selecting appropriate methods. to an underlying high-level concept for a given class by training a linear
Because of that, selecting an appropriate XAI method for the task at hand classifier to separate the images containing the defined concept from
remains a challenge, especially considering the subjective perceptions of those that do not. On the resulting hyperplane, TCAV utilizes directional
end-users alongside their individual needs [18,19] and relatively sparse derivatives to estimate the degree to which a defined concept is vital to a
assessment guidelines. classification result. An exemplary intuitive question could be how
Recently, in [20], an extended investigation of real-world user needs sensitive a prediction of a zebra is to the presence of the concept
for understanding AI was performed using an algorithm-informed XAI “stripes”. In this sense, a Concept Activation Vector (CAV) can be un
question bank. This work aims to clarify how end-user requirements derstood as a numerical representation generalizing a concept in the
should be understood, prioritized, and addressed to provide specific activation space of a neural network’s specific layer, also called the
criteria. Miller [7] extensively examined the question of beneficial ex bottleneck. For calculating a concept’s CAV, two separate datasets must
planations, arguing that several XAI methods only consider the re be generated: A concept dataset and a random dataset representing
searcher’s intuition, not including social aspects of how humans define, arbitrary data. For example, to define the concept “stripes”, images of
generate, select, evaluate, and present explanations. Miller argues that a striped objects can be collected, whereas the arbitrary dataset can be a
good explanation is contrastive, selective, and truthful. On the one hand, group of random images without stripes. A final quantitative explana
it is advisable to know why a particular prediction was made instead of tion, also called the TCAVQ measure, represents the relative importance
another (contrastive) because humans tend to devalue explanations that of each concept across all prediction classes, allowing for a global
contradict their prior beliefs. On the other hand, an explanation of un interpretation. One pitfall to be aware of when using the TCAV tech
expected results by including several attributes can become over nique is the possibility of obtaining meaningless CAVs. A randomly
whelming (selective). Nevertheless, such explanations must be as selected set of images still produces a CAV; hence a significance test is
recommended. Additionally, assembling and acquiring appropriate
3
K. Borys et al. European Journal of Radiology 162 (2023) 110786
4
K. Borys et al. European Journal of Radiology 162 (2023) 110786
Fig. 4. Presentation of an exemplary AI system performing Image Captioning with visual explanations. The automated report generation is extended by visualization
on the input image using bounding boxes to denote obtained findings (hyperexpansion and flattened diaphragms). The input image is taken from the IU X-Ray
dataset [48].
explanations represented by natural language. Several methods have called VisDial. This dataset was explicitly crafted for the Visual Dialog
been proposed within studies to gain additional insight through a more task and contained dialogs with ten question–answer pairs on ~140 k
specific and context-related quantification of images [1,36], with solu COCO images, resulting in a total of ~1.4 M question–answer pairs.
tions ranging from the semantic annotation of relevant regions to the Given one image, a dialog history encompassing question–answer pairs,
simulation of conversations based on AI agents. One fundamental and a follow-up question formulated in natural language, the AI agent’s
mutuality is extending the available information basis (images) by goal is to provide a natural language answer to each question. For
additional sources of information concerning the given visual content. example, in [41], Visual Dialog was deployed in radiology specific to
One desirable attribute of such systems is that questions can be free-form chest X-ray images bringing forth RadVisDial, the first publicly available
and open-ended, meaning that users can ask beyond binary yes-or-no dataset for visual dialog in radiology. The authors outlined AI agents’
questions. According to these requirements, the authors of [37] pro practical usefulness and clinical contribution to medical imaging
posed a Visual Question Answering (VQA) task. Generally, VQA serves regarding a radiologist’s workflow. However, they also pointed out that
as a representative area of automatic image understanding. An exem X-rays are only one of the many data points available (e.g., medications,
plary depiction of such systems is shown in Fig. 3. VQA draws on lab data, clinical data) for a patient’s diagnosis and emphasize the
research from several disciplines, such as Computer Vision, Natural importance of including these additional parameters in AI agents to
Language Processing, and Knowledge Representation [37]. An impor overcome diagnostic limitations. Another radiologic dataset of interest
tant contribution of the VQA task is an accompanying dataset expanding that is not confined to X-ray images and allows for an interplay between
another image dataset called Microsoft Common Objects in Context visual components and semantic relations within radiologic images is
(COCO) [38]. The final VQA dataset contains ~ 0.25 million images, the Radiology Objects in COntext (ROCO) [42]. Among several descriptive
~0.76 million questions, and ~ 10 million answers. components, included images contain keywords and descriptions, which
However, this leads to a challenging requirement of textual expla can, for example, be used for multimodal image representations in
nations, namely the dependence on sophisticated datasets, including classification or natural sentence generation.
images alongside descriptive semantic information. This challenge is Another approach that does not require user questions but auto
specifically present in medical imaging, where several imaging modal matically generates an associated description of the image content based
ities, anatomical areas, and disease entities must be considered. Ren and on an input image is the task of Image Captioning [43]. In this work, the
Zhou [36] developed a model called CGMVQA by using the ImageCLEF authors utilized a modified deep recurrent architecture for sequence
2019 VQA-Med dataset [39], which can answer corresponding questions modeling [44] to employ a generative model for generating natural
related to medical images. The model is not restricted to a specific dis language sentences describing a given input image. Such approaches are
ease and can be used for various image modalities and body regions. especially meaningful in the context of automatic diagnosis report
Building on the aim of not only using single questions but allowing generation, which, in a wider sense, can be seen as a form of an image’s
for a context-related dialog, Das et al. introduced Visual Dialog [40], an visual content explanation. In general, diagnostic report generation is a
AI agent attempting a conversation with humans about an image’s visual time-consuming and knowledge-intensive task [45]. They summarize
content. A human asks questions, e.g., what color an object is, and the AI findings observed in medical images and are different from image
agent tries to answer. The agent embeds the visual content and the di captioning in that they are paragraphs (e.g., indication, findings, and
alog’s history to develop a subsequent answer. More specifically, given a impression) rather than sentences. Moreover, high precision is manda
natural image, a dialog history, and a question, the agent tries to deduce tory, and generated reports must focus on normal and abnormal findings
a context and answer the question accurately. The Common Objects in related to medical characteristics rather than general descriptions
Context (COCO) dataset [38], including natural images of multiple [45,46]. Several studies investigated the fusion of image captioning with
everyday objects, served as the basis for generating another dataset visual explanations, as shown in Fig. 4, called Image Captioning with
5
K. Borys et al. European Journal of Radiology 162 (2023) 110786
Fig. 5. t-SNE Clustering performed with clustimage [56] on radiologic images contained within the MedNIST dataset [27,28]. The contained data classes were
’BreastMRI’ (0), ’AbdomenCT’ (1), ’Hand’ (2), and ’CXR’ (3). For each cluster, an image was drawn from the embedding centroid. The detected clusters indicate a
good separability between the classes.
Visual Explanations [47]. (e.g., images or texts) can be projected. In simpler terms, t-SNE provides
Zhang et al. [49] proposed an extensive framework named Tan a depiction or intuition of how the data is arranged in a high-
demNet, similar to Image Captioning [43], yielding visual attention dimensional space. Algorithms such as t-SNE are often used to cluster
maps corresponding to textual explanations on images. Lee et al. [50] input images based on their activation of neurons in a network. In [55],
used Image Captioning with visual explanations for breast mammo Rauber et al. use t-SNE to depict activations of hidden neurons and the
grams by combining radiology reports with visual saliency maps. A learned data representations. It could be shown that these projections
similar approach called TieNet was shown by Wang et al. [51] on chest provide valuable feedback about the relationships between neurons and
X-rays. Importantly the authors pointed out how different parts of tex classes. As stated by the authors: “This feedback may confirm the known,
tual explanations resulted in different saliency maps on the input image reveal the unknown, and prompt improvements along the classification
related to radiological findings. pipeline, as we have shown through concrete examples.” [55].
Overall, it is noteworthy that systems like VQA or Image Captioning Moreover, not only neural activations but also the data itself can be
are not XAI methods as such but rather serve as an example of how an directly translated into the embedding space. This is especially helpful
interplay between semantic and visual elements could serve as a recti when exploring new datasets or analyzing relationships regarding
fication within the diagnostic process. Moreover, applying these clusters and outliers within the data, as shown in Fig. 5. One pitfall to be
methods does not require an in-depth understanding or method-specific aware of is how the interpretation of obtained results is conducted.
interpretation. Altering tunable hyperparameters, such as the perplexity (balancing the
attention t-SNE gives to local and global aspects of the data), can dras
tically change the plot. Secondly, t-SNE plots cannot always visualize the
2.3. Auxiliary explanations relative sizes of clusters appropriately; hence, cluster size and distance
should not be over-interpreted. An interactive examination of such ef
Auxiliary measures mainly provide additional information, such as a fects can be tested in [57]. Uniform Manifold Approximation and
statistical indicator for single predictions or whole models, and can, for Projection (UMAP) [58] is a successor of t-SNE, superior in run time
example, be illustrated in tabular or graphical form. Even though the performance, scalability regarding larger datasets, and perseverance of
specific interpretation strongly depends on the context and its imple data structure. Graziani et al. [59] used UMAP to visualize layer acti
mentation, auxiliary methods are powerful techniques to provide vations of a classification model for Retinopathy of Prematurity. More
condensed information for a single prediction or a whole model. Po precisely, the investigated misclassification errors using a 2D UMAP
tential applications include i) prediction intervals denoting a pre compression of the activations of specific layers. A major disadvantage
diction’s variance [52], ii) plots illustrating uncertainty [53], or iii) of UMAP is its lack of maturity, as it is a relatively new technique, and
importance scores [54]. A prominent scatter-plot-based method that can best practices are not yet established [60].
project high-dimensional data in a two- or three-dimensional space Prototypes represent another set of auxiliary methods. Human clas
called T-distributed stochastic neighbor embedding (t-SNE) was intro sification of images or objects strongly relies on a subconscious com
duced by van der Maaten and Hinton [12]. It builds upon conditional parison with prototypical parts or characteristics associated with certain
probabilities to express the distances between data points and find objects. For example, when classifying cat breeds, the final decision is
similarities. To facilitate this process, the similarities are usually based on the existence or absence of common prototypical parts repre
measured within an embedding space, a relatively low-dimensional senting a specific breed, such as ear shape, fur length, or color. This
space into which input data represented by high-dimensional vectors
6
K. Borys et al. European Journal of Radiology 162 (2023) 110786
Fig. 6. Depiction of the interplay between epistemic and aleatoric uncertainty. The dashed line represents the approximated function by a model, whereas the blue
dots represent samples from the training data. Figure partly inspired by [70].
reasoning is also common in complex identification tasks, e.g., when capturing a model’s uncertainty. Uncertainty may appear in two forms,
during the diagnosis of cancer, radiologists compare suspected tumors in aleatoric uncertainty and epistemic uncertainty [67], with the former
radiologic images with prototypical tumor depictions [61]. Conse being caused by data or label noise and the latter being caused by
quently, a prototype can be understood as a data instance representative sparsity or out-of-data distribution. While epistemic uncertainty is
of a specific class and can be used to describe data or develop an reducible by acquiring more data or optimizing the training process,
interpretable model [62,63]. In this sense, ProtoTree (Neural Proto aleatoric uncertainty is not because of a task’s intrinsic randomness and
type Trees) [62] provides a local and global explanation utilizing a the fact that a model is only an approximation of it [13,17]. An overview
decision tree that is explainable by design. Its hierarchical structure is is presented in Fig. 6. Commonly, predictive probabilities obtained as
intended to increase interpretability and lead to more insights regarding model output (e.g., the softmax output) are often falsely represented as
clusters in the data by exploiting the positive and the negative reasoning model confidence [68]. However, a model can be uncertain in its pre
process. Instead of similarity scores multiplied by weights, the local dictions even with a high softmax output since different distributions
explanation shows the trajectory of the entered image through the de can yield the same probability estimate. Ultimately, the distribution
cision trees. In this work, the number of prototypes is comparatively form characterizes the range of a system’s behavior [69]. In response,
reduced and not selected according to the class. A significant advantage uncertainty quantification seeks to identify and combine sources of vari
stems from the fact that each node represents a trainable prototype ability to define and characterize the range of a system’s possible
represented by a tensor containing a specific image area. Hence, no behavior. This can be performed directly on a model’s weights or by
manual formulation of prototypes or concepts is required. However, one locating outliers within the data and adjusting the model to capture
shortcoming to be aware of is the interpretation of explanations con these [17]. Explanation methods involving this uncertainty category
taining no positive matches. For example, a model could predict the enable the identification of model limitations and provide information
class “sparrow” if a sample lacked red feathers, a long beak, and wide about possible optimization approaches.
wings. Therefore, checking for unambiguity and awareness of excep Bayesian Neural Networks (BNNs) [68,71,72] are stochastic ANNs
tional cases is highly recommended [64]. trained using Bayesian inference. Bayesian Inference is a powerful
A comparable method is based on feature-space partitioning, framework that initially helps with overfitting in neural networks but
referred to as TreeView [65]. A complex model is represented via hi also estimates how uncertain a model is by, for example, computing
erarchical clustering of features according to the activation values of conditional probabilities of a model’s weights w.r.t. the training data.
hidden neurons in such a way that each cluster comprises a set of neu Bayesian networks offer a paradigm for interpretability based on prob
rons with a similar distribution of activations across the training set. ability theory [73] and are, therefore, intrinsically interpretable. BNNs
Subsequently, the feature space is divided into subspaces by clustering can be designed in different ways regarding the selection of stochastic
similar neurons according to their activation distribution, with each components, distribution approximation, and inference approaches
cluster representing a specific factor. Given a prediction of which the [74]. Eaton-Rosene et al. [75] proposed a generalizable technique for
actual label is known, a decision tree surrogate model is used to predict quantifying uncertainty with Bayesian Neural Networks for semantic
the label and traces the decision tree’s relevant nodes for its prediction. segmentation on the BraTS 2017 dataset [76], including 285 subjects
The results are depicted in a scatter plot, in which each column denotes with high- or low-grade gliomas. The networks predict a voxel’s prob
if a sample belongs to a specific class. In [66], Sattigeri et al. employed ability of belonging to each segmentation class and generate calibrated
disentangled generative models enabling unsupervised learning of high- confidence intervals of downstream biomarkers.
level concepts to visualize the surrogate’s decision path within the input In contrast, there are also other methods for uncertainty estimation,
space. A fine-grained analysis of the high-level concepts within the input justified by the fact that Bayesian approaches usually require complex
space can yield a different understanding of the connection between modifications of the training process and can be computationally
inputs and label spaces, enabling the analyst to validate their mental expensive. One prominent approach is to estimate uncertainty using
map between the spaces [65]. Overall, a visualization using TreeView Deep Ensembles [77], which quantify uncertainty by sampling from
enables a convenient transition between factors, class labels, and an multiple models, where each model is trained separately on the same
input data space. However, an interpretation may be challenging dataset. The models’ agreement represents the confidence in the overall
without a thorough understanding of how this method works. model ensemble. Yang and Fevens [78] applied Monte Carlo dropout
Another set of techniques is derived from conventional DL tools not and ensembles to several tasks and modalities, including COVID-19
7
K. Borys et al. European Journal of Radiology 162 (2023) 110786
Table 1
Overview of introduced explanations alongside their strengths, weaknesses, and explanation targets.
Method Method name Strength Weakness Explanation target Explanation
output perspective
Case-based Influence functions [30] Direct identification of data samples that fall Only applicable to models with a 2nd- Data Global | Local
out of distribution order differentiable loss
No agreed-upon threshold separating
influential from non-influential
instances
Network Dissection [26] Inner workings Detection of concepts beyond Dependence on datasets with labels on Internals Global
classes in classification task pixel-level
Communication of inner workings in a non- Only positive activations are considered
technical way
Testing with Concept Explorative explanations beyond feature Manual formulation of concepts Predictions | Global | Local
Activation Vectors [23] attribution required Internals
Automated concept-based No manual labeling required Results depending on selection of Predictions | Global | Local
explanations [24] parameters Internals
GANterfactuals [32] Intuitive explainability by contrastive setting, Dependence on binary classifier Lack of Predictions Local
even for non-experts flexibility
Textual Image Captioning [43] Semantic relationship between images and Data | Predictions Local
textual descriptions is intuitive, even for non-
experts
Visual Question Answering Free-form and open-ended Data | Predictions Local
[37]
Visual Dialog [40] Interactive and context-related dialog Data Local
Image Captioning with Interplay between an image’s semantic content Dependence on complex datasets Data | Predictions Local
Visual Explanations [47] description and a visual explanation including images and textual
annotations
Auxiliary t-SNE [12] Allows for explorative interpretation of Sensitivity to tunable hyperparameters Data | Internals | Global | Local
activations within a neural network but also Cluster size and distance should not be Predictions
investigation over-interpreted
UMAP [58] Improved time performance, Lack of maturity Data | Internals | Global | Local
Better perseverance of data structure Predictions
Deep Ensembles [77] Allows for estimation of both, epistemic and Low diversity of ensemble members Data Global
aleatoric uncertainty affects uncertainty estimates
ProtoTree [62] No manual formulation of prototypes/ Possibility of ambiguous result Predictions Global | Local
concepts is required
BNNs [71] Intrinsically interpretable Requires complex modifications to Data Global
training process
TreeView [65] Transition between factors, class labels, and Interpretation requires a thorough Predictions Local
input data space understanding of the method
8
K. Borys et al. European Journal of Radiology 162 (2023) 110786
they enable a valuable linkage between medical images and semantic Table A1
information. Also, their interpretation usually does not rely on a thor Listing of presented XAI methods ordered by their citations/year ratio (ac
ough understanding of the underlying internal functioning, as generated cording to Google Scholar, see Appendix B Supplementary data) alongside a
texts are easily interpretable. A significant advantage is that the inter corresponding open-source repository link (if available). If no open-source re
pretation of images, and thus, the generation of diagnostic reports or pository was available, unofficial implementations were provided.
documentation, can also be automated, which can be particularly Method Method Citations/ Open source library/
helpful for inexperienced or in-training healthcare professionals who output year ratio dataset available
want to confirm their diagnostic findings with the help of a support Case- Influence functions 333 https://github.com/koh
system. Coupled with visualizations, systems encompassing image based [30] pangwei/influence-rele
ase
captioning can even pose as an educative, as findings described within
Testing with Concept 215 https://github.com/tens
generated captions can be projected directly onto the initial image, Activation Vectors [23] orflow/tcav
justifying which image area is responsible for the generation of a specific Network Dissection 203 https://github.com/CSA
caption part. However, diagnostic processes often rely on the assessment [26] ILVision/NetDissect
of multiple data points [41], such as different X-Ray views (lateral, Automated concept- 81 https://github.com/ami
based explanations ratag/ACE
oblique, or anteroposterior), comparisons between modalities, or eval
[24]
uating changes over time (e.g., follow-up examinations). For these rea GANterfactuals [32] 8 https://github.com/h
sons, systems such as VQA or VisualDialog are clinically limited and rely cmlab/GANterfactual
on further optimization research that takes the presented requirements Textual Image Captioning with 1257 https://github.com/zizh
into account. Visual Explanations aozhang/tandemnet
[47]
Lastly, auxiliary approaches yield explanations in the form of addi https://github.com/zizh
tional information, such as statistical indicators for single predictions or aozhang/distill2
models with their final presentation form strongly varying as results can Image Captioning [43] 770 https://github.com/
be, for example, provided as importance scores, visualizations of sgrvinod/a-PyTorch-Tuto
rial-to-Image-Captioning
embedding or activation spaces, or as surrogate white-box models. Great
Visual Question 524 https://github.com/GT-Vi
potential is attributed to the quantification of uncertainty. The differ Answering [37] sion-Lab/VQA_LSTM_CNN
entiation between epistemic and aleatoric uncertainty can be used by DL
engineers, in particular, to determine whether, and if so, model or https://github.com/jia
training adjustments are necessary. Regarding predictions, uncertainty senlu/HieCoAttenVQA
Visual Dialog [40] 146 https://github.com/
estimations can uncover when a model does not have high confidence, batra-mlp-lab/visdial-ch
indicating out-of-distribution samples. Among other things, these con allenge-starter-pytorch
siderations are also important for differentiating whether a model acts Auxiliary t-SNE [12] 1816 https://lvdmaaten.github.
according to an open-world or closed-world assumption. Differentiation io/tsne/
UMAP [58] 1384 https://github.com/lmci
of these assumptions can be crucial as open-world systems have the
nnes/umap
ability to denote low confidence when answering questions to which the BNNs [68] 962 https://github.com/ya
answer is unknown, whereas closed-world systems would simply answer ringal/DropoutUncertaint
with the most likely answer (class) of a given subset as the opposite yExps
cannot be confirmed. Especially with medical questions, it would be Deep Ensembles [77] 581 https://github.com/
SamsungLabs/pytorch-
incorrect to state that the patient does not suffer from a specific disease if ensembles
there is no record reporting this unless more information is given to ProtoTree [62] 33 https://github.com/
confirm this assumption. Out-of-distribution identification and closed- M-Nauta/ProtoTree
world assumptions are closely related, as models are usually not TreeView [65] 7 –
equipped with the ability to reject an input during training or inference
if it is not represented well by the underlying data or falls out of dis
Appendix B. Supplementary data
tribution. Therefore, uncertainty quantification significantly contributes
to model integrity within the medical domain.
Supplementary data to this article can be found online at https://doi.
Conclusively, even though non-visual XAI methods may appear less
org/10.1016/j.ejrad.2023.110786.
intuitive compared to prominent saliency-based approaches such as
GradCAM [11], they significantly contribute to the diversification of
References
XAI, allowing for the systematic uncovering of model flaws, data out
liers, distributional discrepancies, and biases. An aspect that has so far [1] Z. Zhang, Y. Xie, F. Xing, M. McGough, and L. Yang, “MDNet: A Semantically and
been little researched is the clinical integration and empirical assess Visually Interpretable Medical Image Diagnosis Network,” presented at the
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
ment of the impact of such XAI methods. Therefore, investigating this
2017, pp. 6428–6436. Accessed: Apr. 07, 2022. [Online]. Available: https://
question could be of significant interest for future research. openaccess.thecvf.com/content_cvpr_2017/html/Zhang_MDNet_A_Semantically_
CVPR_2017_paper.html.
Declaration of Competing Interest [2] R. Hosch, L. Kroll, F. Nensa, S. Koitka, Differentiation Between Anteroposterior and
Posteroanterior Chest X-Ray View Position With Convolutional Neural Networks,
Rofo 193 (2) (Feb. 2021) 168–176, https://doi.org/10.1055/a-1183-5227.
The authors declare that they have no known competing financial [3] S. Koitka, M.S. Kim, M. Qu, A. Fischer, C.M. Friedrich, F. Nensa, Mimicking the
interests or personal relationships that could have appeared to influence radiologists’ workflow: Estimating pediatric hand bone age with stacked deep
neural networks, Med. Image Anal. 64 (Aug. 2020), 101743, https://doi.org/
the work reported in this paper. 10.1016/j.media.2020.101743.
[4] W. Fedus, B. Zoph, N. Shazeer, Switch Transformers: Scaling to Trillion Parameter
Appendix A Models with Simple and Efficient Sparsity, J. Mach. Learn. Res. 23 (120) (2022)
1–39.
[5] F. Chollet, Deep Learning with Python, second ed., Simon and Schuster, 2021.
(See Table A1). [6] F. Doshi-Velez and B. Kim, “Towards A Rigorous Science of Interpretable Machine
Learning,” arXiv:1702.08608 [cs, stat], Mar. 2017, Accessed: Apr. 07, 2022.
[Online]. Available: http://arxiv.org/abs/1702.08608.
9
K. Borys et al. European Journal of Radiology 162 (2023) 110786
[7] T. Miller, Explanation in artificial intelligence: Insights from the social sciences, [30] P.W. Koh, P. Liang, Understanding Black-box Predictions via Influence Functions,
Artif. Intell. 267 (Feb. 2019) 1–38, https://doi.org/10.1016/j.artint.2018.07.007. in: Proceedings of the 34th International Conference on Machine Learning, Jul.
[8] D. Higgins, V.I. Madai, From bit to bedside: a practical framework for artificial 2017, pp. 1885–1894. Accessed: Sep. 20, 2022. [Online]. Available: https://
intelligence product development in healthcare, Adv. Intellig. Syst. 2 (10) (2020) proceedings.mlr.press/v70/koh17a.html.
2000052, https://doi.org/10.1002/aisy.202000052. [31] C.J. Wang, et al., Deep learning for liver tumor diagnosis part II: convolutional
[9] B.H.M. van der Velden, H.J. Kuijf, K.G.A. Gilhuijs, M.A. Viergever, Explainable neural network interpretation using radiologic imaging features, Eur Radiol 29 (7)
artificial intelligence (XAI) in deep learning-based medical image analysis, Med. (Jul. 2019) 3348–3357, https://doi.org/10.1007/s00330-019-06214-8.
Image Anal. 79 (Jul. 2022), 102470, https://doi.org/10.1016/j. [32] S. Mertes, T. Huber, K. Weitz, A. Heimerl, E. André,
media.2022.102470. GANterfactual—Counterfactual Explanations for Medical Non-experts Using
[10] J. Amann, A. Blasimme, E. Vayena, D. Frey, V. I. Madai, and the Precise4Q Generative Adversarial Learning, Front. Artificial Intelligence, vol. 5, 2022,
consortium, Explainability for artificial intelligence in healthcare: a Accessed: Jul. 12, 2022. [Online]. Available: https://www.frontiersin.org/articles/
multidisciplinary perspective, BMC Medical Informatics and Decision Making, vol. 10.3389/frai.2022.825565.
20, no. 1, p. 310, Nov. 2020, doi: https://doi.org/10.1186/s12911-020-01332-6. [33] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-To-Image Translation
[11] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad- Using Cycle-Consistent Adversarial Networks,” presented at the Proceedings of the
CAM: Visual Explanations From Deep Networks via Gradient-Based Localization,” IEEE International Conference on Computer Vision, 2017, pp. 2223–2232.
presented at the Proceedings of the IEEE International Conference on Computer Accessed: Apr. 07, 2022. [Online]. Available: https://openaccess.thecvf.com/
Vision, 2017, pp. 618–626. Accessed: Apr. 07, 2022. [Online]. Available: https:// content_iccv_2017/html/Zhu_Unpaired_Image-To-Image_Translation_ICCV_2017_
openaccess.thecvf.com/content_iccv_2017/html/Selvaraju_Grad-CAM_Visual_ paper.html.
Explanations_ICCV_2017_paper.html. [34] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, W. Samek, On Pixel-
[12] Van der Maaten, Laurens, Hinton, Geoffrey, Visualizing data using t-SNE., ., vol. 9, Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance
no. 86, pp. 2579–2605, 2008. Propagation, PLoS One 10 (7) (Oct. 2015) e0130140.
[13] Y. Kwon, J.-H. Won, B.J. Kim, M.C. Paik, Uncertainty quantification using Bayesian [35] M. T. Ribeiro, S. Singh, C. Guestrin, Why Should I Trust You?’: Explaining the
neural networks in classification: Application to biomedical image segmentation, Predictions of Any Classifier, in: Proceedings of the 22nd ACM SIGKDD
Comput. Stat. Data Anal. 142 (Feb. 2020), 106816, https://doi.org/10.1016/j. International Conference on Knowledge Discovery and Data Mining, New York,
csda.2019.106816. NY, USA, Aug. 2016, pp. 1135–1144. doi: https://doi.org/10.1145/
[14] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, B. Kim, Sanity checks 2939672.2939778.
for saliency maps, in: Proceedings of the 32nd International Conference on Neural [36] F. Ren, Y. Zhou, CGMVQA: A New Classification and Generative Model for Medical
Information Processing Systems. Visual Question Answering, IEEE Access 8 (2020) 50626–50636, https://doi.org/
[15] A. Ghorbani, A. Abid, J. Zou, Interpretation of Neural Networks Is Fragile, in: 10.1109/ACCESS.2020.2980024.
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, Art. [37] A. Agrawal, et al., VQA: Visual Question Answering, Int. J. Comput. Vision 123 (1)
no. 01, Jul. 2019, doi: https://doi.org/10.1609/aaai.v33i01.33013681. (May 2017) 4–31, https://doi.org/10.1007/s11263-016-0966-6.
[16] R. Tomsett, D. Harborne, S. Chakraborty, P. Gurram, and A. Preece, “Sanity checks [38] T.-Y. Lin, et al., Microsoft COCO: Common Objects in Context, in: in Computer
for saliency metrics,” presented at the AAAI Conference on Artificial Intelligence, Vision – ECCV Cham, 2014, pp. 740–755, https://doi.org/10.1007/978-3-319-
Feb. 2020. Accessed: Nov. 14, 2022. [Online]. Available: https://research.ibm. 10602-1_48.
com/publications/sanity-checks-for-saliency-metrics. [39] A. Ben Abacha, S. Hasan, V. Datla, J. Liu, D. Demner-Fushman, H. Müller, VQA-
[17] M. Pocevičiūtė, G. Eilertsen, C. Lundström, Survey of XAI in Digital Pathology, in: Med: Overview of the Medical Visual Question Answering Task at ImageCLEF
A. Holzinger, R. Goebel, M. Mengel, H. Müller (Eds.), Artificial Intelligence and 2019, Lect. Notes Comput. Sci (Sep. 2019).
Machine Learning for Digital Pathology: State-of-the-Art and Future Challenges, [40] A. Das et al., “Visual Dialog,” presented at the Proceedings of the IEEE Conference
Springer International Publishing, Cham, 2020, pp. 56–88, https://doi.org/ on Computer Vision and Pattern Recognition, 2017, pp. 326–335. Accessed: Apr.
10.1007/978-3-030-50402-1_4. 07, 2022. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2017/
[18] A. Barredo Arrieta et al., Explainable Artificial Intelligence (XAI): Concepts, html/Das_Visual_Dialog_CVPR_2017_paper.html.
taxonomies, opportunities and challenges toward responsible AI, Information [41] O. Kovaleva et al., Towards Visual Dialog for Radiology, in: Proceedings of the 19th
Fusion, vol. 58, pp. 82–115, Jun. 2020, doi: https://doi.org/10.1016/j. SIGBioMed Workshop on Biomedical Language Processing, Online, Jul. 2020, pp.
inffus.2019.12.012. 60–69. doi: https://doi.org/10.18653/v1/2020.bionlp-1.6.
[19] F. Hohman, M. Kahng, R. Pienta, D.H. Chau, Visual Analytics in Deep Learning: An [42] O. Pelka, S. Koitka, J. Rückert, F. Nensa, C.M. Friedrich, Radiology Objects in
Interrogative Survey for the Next Frontiers, IEEE Trans. Vis. Comput. Graph. 25 (8) COntext (ROCO): A Multimodal Image Dataset, in: in Intravascular Imaging and
(Aug. 2019) 2674–2693, https://doi.org/10.1109/TVCG.2018.2843369. Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and
[20] Q.V. Liao, D. Gruen, S. Miller, Questioning the AI: Informing Design Practices for Expert Label Synthesis, Cham, 2018, pp. 180–189, https://doi.org/10.1007/978-3-
Explainable AI User Experiences, in: Proceedings of the 2020 CHI Conference on 030-01364-6_20.
Human Factors in Computing Systems, New York, NY, USA: Association for [43] O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and Tell: A Neural Image Caption
Computing Machinery, 2020, pp. 1–15. Accessed: Jun. 06, 2022. [Online]. Generator, presented at the Proceedings of the IEEE Conference on Computer
Available: https://doi.org/10.1145/3313831.3376590. Vision and Pattern Recognition, 2015, pp. 3156–3164. Accessed: Dec. 12, 2022.
[21] Z.C. Lipton, In machine learning, the concept of interpretability is both important [Online]. Available: https://www.cv-foundation.org/openaccess/content_cvpr_
and slippery, Machine learning, p. 28. 2015/html/Vinyals_Show_and_Tell_2015_CVPR_paper.html.
[22] M.T. Keane, B. Smyth, Good counterfactuals and where to find them: a case-based [44] S. Hochreiter, J. Schmidhuber, Long Short-Term Memory, Neural Comput. 9 (8)
technique for generating counterfactuals for explainable AI (XAI), in: in Case-Based (Nov. 1997) 1735–1780, https://doi.org/10.1162/neco.1997.9.8.1735.
Reasoning Research and Development, Cham, 2020, pp. 163–178, https://doi.org/ [45] S. Yang, J. Niu, J. Wu, X. Liu, Automatic Medical Image Report Generation with
10.1007/978-3-030-58342-2_11. Multi-view and Multi-modal Attention Mechanism, Lecture Notes in Computer
[23] B. Kim et al., Interpretability Beyond Feature Attribution: Quantitative Testing Science (including subseries Lecture Notes in Artificial Intelligence and Lecture
with Concept Activation Vectors (TCAV), in: Proceedings of the 35th International Notes in Bioinformatics), vol. 12454 LNCS, pp. 687–699, 2020, doi: https://doi.
Conference on Machine Learning, Jul. 2018, pp. 2668–2677. Accessed: Feb. 14, org/10.1007/978-3-030-60248-2_48.
2022. [Online]. Available: https://proceedings.mlr.press/v80/kim18d.html. [46] J. Yuan, H. Liao, R. Luo, J. Luo, Automatic Radiology Report Generation Based on
[24] A. Ghorbani, J. Wexler, J.Y. Zou, B. Kim, Towards Automatic Concept-based Multi-view Image Fusion and Medical Concept Enrichment, in: in Medical Image
Explanations, in: Advances in Neural Information Processing Systems, 2019, vol. Computing and Computer Assisted Intervention – MICCAI, Cham, 2019,
32. Accessed: Jun. 03, 2022. [Online]. Available: https://proceedings.neurips.cc/ pp. 721–729, https://doi.org/10.1007/978-3-030-32226-7_80.
paper/2019/hash/77d2afcb31f6493e350fca61764efb9a-Abstract.html. [47] K. Xu et al., Show, attend and tell: neural image caption generation with visual
[25] D. Sauter, G. Lodde, F. Nensa, D. Schadendorf, E. Livingstone, M. Kukuk, attention, in: Proceedings of the 32nd International Conference on International
Validating Automatic Concept-Based Explanations for AI-Based Digital Conference on Machine Learning - Volume 37, Lille, France, Jul. 2015, pp.
Histopathology, Sensors, vol. 22, no. 14, Art. no. 14, Jan. 2022, doi: https://doi. 2048–2057.
org/10.3390/s22145346. [48] D. Demner-Fushman, et al., Preparing a collection of radiology examinations for
[26] D. Bau, B. Zhou, A. Khosla, A. Oliva, A. Torralba, Network Dissection: Quantifying distribution and retrieval, J. Am. Med. Inform. Assoc. 23 (2) (Mar. 2016) 304–310,
Interpretability of Deep Visual Representations, presented at the Proceedings of the https://doi.org/10.1093/jamia/ocv080.
IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. [49] Z. Zhang, P. Chen, M. Sapkota, L. Yang, TandemNet: Distilling Knowledge from
6541–6549. Accessed: Apr. 07, 2022. [Online]. Available: https://openaccess. Medical Images Using Diagnostic Reports as Optional Semantic References, in: in
thecvf.com/content_cvpr_2017/html/Bau_Network_Dissection_Quantifying_CVPR_ Medical Image Computing and Computer Assisted Intervention – MICCAI, Cham,
2017_paper.html. 2017, pp. 320–328, https://doi.org/10.1007/978-3-319-66179-7_37.
[27] J. Yang, R. Shi, B. Ni, MedMNIST Classification Decathlon: A Lightweight AutoML [50] H. Lee, S.T. Kim, Y.M. Ro, Generation of Multimodal Justification Using Visual
Benchmark for Medical Image Analysis., in: 18th IEEE International Symposium on Word Constraint Model for Explainable Computer-Aided Diagnosis, in: in
Biomedical Imaging, ISBI 2021, Nice, France, April 13-16, 2021, 2021, pp. Interpretability of Machine Intelligence in Medical Image Computing and
191–195. doi: https://doi.org/10.1109/ISBI48211.2021.9434062. Multimodal Learning for Clinical Decision Support, Cham, 2019, pp. 21–29,
[28] N. Kokhlikyan et al., Captum: A unified and generic model interpretability library https://doi.org/10.1007/978-3-030-33850-3_3.
for PyTorch, arXiv [cs.LG], 2020, [Online]. Available: http://arxiv.org/abs/ [51] X. Wang, Y. Peng, L. Lu, Z. Lu, R. M. Summers, TieNet: Text-Image Embedding
2009.07896. Network for Common Thorax Disease Classification and Reporting in Chest X-Rays,
[29] C. Molnar, Interpretable Machine Learning. Accessed: Apr. 12, 2022. [Online]. in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun.
Available: https://christophm.github.io/interpretable-ml-book/. 2018, pp. 9049–9058. doi: https://doi.org/10.1109/CVPR.2018.00943.
10
K. Borys et al. European Journal of Radiology 162 (2023) 110786
[52] T. Pearce, A. Brintrup, M. Zaki, A. Neely, High-Quality Prediction Intervals for [68] Y. Gal, Z. Ghahramani, Dropout as a Bayesian Approximation: Representing Model
Deep Learning: A Distribution-Free, Ensembled Approach, in: Proceedings of the Uncertainty in Deep Learning, in: Proceedings of The 33rd International
35th International Conference on Machine Learning, Jul. 2018, pp. 4075–4084. Conference on Machine Learning, Jun. 2016, pp. 1050–1059. Accessed: Jul. 08,
Accessed: Dec. 12, 2022. [Online]. Available: https://proceedings.mlr.press/v80/ 2022. [Online]. Available: https://proceedings.mlr.press/v48/gal16.html.
pearce18a.html. [69] M.C. Darling, D.J. Stracuzzi, Toward Uncertainty Quantification for Supervised
[53] M.S. Ayhan, P. Berens, Test-time Data Augmentation for Estimation of Classification, SAND–2018-0032, 1527311, Jan. 2018. doi: https://doi.org/
Heteroscedastic Aleatoric Uncertainty in Deep Neural Networks, presented at the 10.2172/1527311.
Medical Imaging with Deep Learning, Apr. 2018. Accessed: Apr. 07, 2022. [70] M. Abdar, et al., A review of uncertainty quantification in deep learning:
[Online]. Available: https://openreview.net/forum?id=rJZz-knjz. Techniques, applications and challenges, Information Fusion 76 (Dec. 2021)
[54] W. Jin, X. Li, G. Hamarneh, Evaluating Explainable AI on a Multi-Modal Medical 243–297, https://doi.org/10.1016/j.inffus.2021.05.008.
Imaging Task: Can Existing Algorithms Fulfill Clinical Requirements? arXiv, Mar. [71] J. Lampinen, A. Vehtari, Bayesian approach for neural networks—review and case
12, 2022. doi: https://doi.org/10.48550/arXiv.2203.06487. studies, Neural Netw. 14 (3) (Apr. 2001) 257–274, https://doi.org/10.1016/
[55] P.E. Rauber, S.G. Fadel, A.X. Falcão, A.C. Telea, Visualizing the Hidden Activity of S0893-6080(00)00098-8.
Artificial Neural Networks, IEEE Trans. Vis. Comput. Graph. 23 (1) (Jan. 2017) [72] D.M. Titterington, Bayesian Methods for Neural Networks and Related Models,
101–110, https://doi.org/10.1109/TVCG.2016.2598838. Stat. Sci. 19 (1) (Feb. 2004) 128–139, https://doi.org/10.1214/
[56] E. Taskesen, Python package clustimage is for unsupervised clustering of images. 088342304000000099.
Nov. 2021. Accessed: Dec. 10, 2022. [Online]. Available: https://erdogant.github. [73] B. Mihaljević, C. Bielza, P. Larrañaga, Bayesian networks for interpretable machine
io/clustimage. learning and optimization, Neurocomputing 456 (Oct. 2021) 648–665, https://doi.
[57] M. Wattenberg, F. Viégas, I. Johnson, How to Use t-SNE Effectively, accessed Apr. org/10.1016/j.neucom.2021.01.138.
13, 2022, Distill (Oct. 13, 2016.), http://distill.pub/2016/misread-tsne. [74] L.V. Jospin, H. Laga, F. Boussaid, W. Buntine, M. Bennamoun, Hands-On Bayesian
[58] L. McInnes, J. Healy, N. Saul, L. Großberger, UMAP: Uniform Manifold Neural Networks—A Tutorial for Deep Learning Users, IEEE Comput. Intell. Mag.
Approximation and Projection, J. Open Source Software 3 (29) (Sep. 2018) 861, 17 (2) (May 2022) 29–48, https://doi.org/10.1109/MCI.2022.3155327.
https://doi.org/10.21105/joss.00861. [75] Z. Eaton-Rosen, F. Bragman, S. Bisdas, S. Ourselin, M.J. Cardoso, Towards Safe
[59] M. Graziani et al., Improved interpretability for computer-aided severity Deep Learning: Accurately Quantifying Biomarker Uncertainty in Neural Network
assessment of retinopathy of prematurity, in: Medical Imaging 2019: Computer- Predictions, in: in Medical Image Computing and Computer Assisted Intervention –
Aided Diagnosis, Mar. 2019, vol. 10950, pp. 450–460. doi: https://doi.org/ MICCAI, Cham, 2018, pp. 691–699, https://doi.org/10.1007/978-3-030-00928-1_
10.1117/12.2512584. 78.
[60] S. Nanga, et al., Review of dimension reduction methods, J. Data Anal. Informat. [76] B.H. Menze, et al., The Multimodal Brain Tumor Image Segmentation Benchmark
Process. 9 (3) (2021), https://doi.org/10.4236/jdaip.2021.93013. (BRATS), IEEE Trans Med Imaging 34 (10) (Oct. 2015) 1993–2024, https://doi.
[61] A. Holt, I. Bichindaritz, R. Schmidt, P. Perner, Medical applications in case-based org/10.1109/TMI.2014.2377694.
reasoning, Knowl. Eng. Rev. 20 (3) (Sep. 2005) 289–292, https://doi.org/10.1017/ [77] B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and scalable predictive
S0269888906000622. uncertainty estimation using deep ensembles, in:.
[62] M. Nauta, R. van Bree, and C. Seifert, “Neural Prototype Trees for Interpretable [78] S. Yang, T. Fevens, Uncertainty Quantification and Estimation in Medical Image
Fine-Grained Image Recognition,” presented at the Proceedings of the IEEE/CVF Classification, in: in Artificial Neural Networks and Machine Learning – ICANN,
Conference on Computer Vision and Pattern Recognition, 2021, pp. 14933–14943. Cham, 2021, pp. 671–683, https://doi.org/10.1007/978-3-030-86365-4_54.
Accessed: Jul. 12, 2022. [Online]. Available: https://openaccess.thecvf.com/ [79] M.E.E. Khan, A. Immer, E. Abedi, M. Korzepa, Approximate Inference Turns Deep
content/CVPR2021/html/Nauta_Neural_Prototype_Trees_for_Interpretable_Fine- Networks into Gaussian Processes, in: Advances in Neural Information Processing
Grained_Image_Recognition_CVPR_2021_paper.html?ref=https://githubhelp.com. Systems, 2019, vol. 32. Accessed: Jun. 08, 2022. [Online]. Available: https://
[63] C. Chen, O. Li, C. Tao, A.J. Barnett, J. Su, C. Rudin, in: This looks like that: deep proceedings.neurips.cc/paper/2019/hash/b3bbccd6c008e727785cb81b1aa08ac5-
learning for interpretable image recognition, Curran Associates Inc., Red Hook, NY, Abstract.html.
USA, 2019, pp. 8930–8941. [80] F. D’Angelo, V. Fortuin, “Repulsive Deep Ensembles are Bayesian,” presented at the
[64] D. Rymarczyk, Ł. Struski, M. Górszczak, K. Lewandowska, J. Tabor, and B. Neural Information Processing Systems, Jun. 2021. Accessed: Dec. 18, 2022.
Zieliński, “Interpretable Image Classification with Differentiable Prototypes [Online]. Available: https://www.semanticscholar.org/paper/Repulsive-Deep-
Assignment.” arXiv, Dec. 06, 2021. Accessed: Jun. 03, 2022. [Online]. Available: Ensembles-are-Bayesian-D’Angelo-Fortuin/
http://arxiv.org/abs/2112.02902. be5491660a61d60606aaec8dc0e7e046fb930110.
[65] J. J. Thiagarajan, B. Kailkhura, P. Sattigeri, and K. N. Ramamurthy, “TreeView: [81] A. Lucieri, M.N. Bajwa, S.A. Braun, M.I. Malik, A. Dengel, S. Ahmed, On
Peeking into Deep Neural Networks Via Feature-Space Partitioning,” arXiv: Interpretability of Deep Learning based Skin Lesion Classifiers using Concept
1611.07429 [cs, stat], Nov. 2016, Accessed: Apr. 07, 2022. [Online]. Available: Activation Vectors, in: 2020 International Joint Conference on Neural Networks
http://arxiv.org/abs/1611.07429. (IJCNN), Jul. 2020, pp. 1–10. doi: https://doi.org/10.1109/
[66] P. Sattigeri, K. N. Ramamurthy, J.J. Thiagarajan, B. Kailkhura, Treeview and IJCNN48605.2020.9206946.
Disentangled Representations for Explaining Deep Neural Networks Decisions, in: [82] X. Ma, et al., Understanding adversarial attacks on deep learning based medical
2020 54th Asilomar Conference on Signals, Systems, and Computers, Nov. 2020, image analysis systems, Pattern Recogn. 110 (Feb. 2021), 107332, https://doi.org/
pp. 284–288. doi: https://doi.org/10.1109/IEEECONF51394.2020.9443487. 10.1016/j.patcog.2020.107332.
[67] A.D. Kiureghian, O. Ditlevsen, Aleatory or epistemic? Does it matter? Struct. Saf.
31 (2) (Mar. 2009) 105–112, https://doi.org/10.1016/j.strusafe.2008.06.020.
11