Engineering Applications of Artificial Intelligence: Hajar Hakkoum, Ali Idri, Ibtissam Abnane
Engineering Applications of Artificial Intelligence: Hajar Hakkoum, Ali Idri, Ibtissam Abnane
A R T I C L E I N F O A B S T R A C T
Keywords: The most effective machine learning classification techniques, such as artificial neural networks, are not easily
Interpretability interpretable, which limits their usefulness in critical areas, such as medicine, where errors can have severe
XAI consequences. Researchers have been working to balance the trade-off between the model performance and
Explainability
interpretability. In this study, seven interpretability techniques (global surrogate, accumulated local effects, local
Black box
Numerical data
interpretable model-agnostic explanations (LIME), Shapley additive explanations (SHAP), model agnostic post
Medicine hoc local explanations (MAPLE), local rule-based explanation (LORE), and Contextual Importance and Utility
(CIU)) were evaluated to interpret five medical classifiers (multilayer perceptron, support vector machines,
random forests, extreme gradient boosting, and naïve bayes) using six model performance metrics and three
interpretability technique metrics across six medical numerical datasets. The results confirmed the effectiveness
of integrating global and local interpretability techniques, and highlighted the superior performance of global
SHAP explainer and local CIU explanations. The quantitative evaluations of explanations emphasised the
importance of assessing these interpretability techniques before employing them to interpret black box models.
* Corresponding author.
E-mail addresses: hajar.hakkoum@um5r.ac.ma (H. Hakkoum), ali.idri@um5.ac.ma (A. Idri), abnane.ibtissam@um5s.net.ma (I. Abnane).
https://doi.org/10.1016/j.engappai.2023.107829
Received 6 August 2022; Received in revised form 20 November 2023; Accepted 27 December 2023
0952-1976/© 2023 Elsevier Ltd. All rights reserved.
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829
particular data point. eXtreme Gradient Boosting (XGB), and naïve bayes (NB)) were trained
Interpretability techniques can also differ, depending on the ML and optimised using particle swarm optimisation (PSO) on six numerical
model. Model-specific techniques (for example, decompositional rule medical datasets available from the UCI online repository (Dua and
extractions) can be applied to a specific type of ML model; these tech Graff, 2017): Wisconsin breast cancer original (9 features, 699 in
niques frequently use the model’s internals to provide an explanation. stances), Wisconsin breast cancer diagnosis (32 features, 569 instances),
Consequently, they are also known as decompositional or internal diabetic retinopathy Debrecen (20 features, 1151 instances), Parkin
techniques. There are also model-agnostic techniques (e.g., global sur son’s disease (23 features, 197 instances), and heart (SPECT and SPECTF
rogates) that construct explanations based on input-output interactions, with 22 and 44 features, respectively, and a size of 267 instances). The
regardless of the model. These are also known as pedagogical, external, five ML models were compared over each dataset based on their accu
and post hoc techniques. racy values using the ScottKnott (SK) statistical test (Jelihovschi et al.,
Errors in medicine are not tolerated because they directly affect the 2014) and the Borda count voting system. Then, the seven interpret
lives of the patients. Our systematic literature review (SLR) of 179 ar ability techniques were applied; the global surrogate was first assessed
ticles investigating interpretability in medicine from 1994 to 2020 using its fidelity to the black box, its accuracy and comprehensibility,
(Hakkoum et al., 2021a) revealed a strong interest in classification and and thereafter by comparing its feature importance scores to those of
diagnosis tasks, particularly in oncology. In terms of interpretability SHAP as well as those generated from a white box (DT) using Kendal’s
techniques, the majority of papers reviewed used global interpretability rank correlation. At a local scale, LIME, SHAP, MAPLE, LORE, and CIU
over local interpretability, with rule-based explanations being the most were evaluated and compared using three metrics: faithfulness, mono
common global interpretability technique. ANNs and SVMs have tonicity, and execution time. Finally, the gap between the accuracy of
received more attention from researchers than other ML models. It was the model and its interpretability offered by the techniques used (fidelity
also discovered that evaluating interpretability techniques when they for the global surrogate and the average faithfulness for local tech
were not rules or trees, which can be easily measured, posed a signifi niques) was analysed in an attempt to determine the
cant challenge. performance-interpretability trade-off.
The SLR also emphasised the lack of use and comparison of new local This study addressed the following research questions (RQs).
interpretability techniques in the medical field. Seventy-two papers
studied local techniques, 28 of which performed comparisons. Further RQ1 : What is the overall accuracy of each constructed model?
more, 5 out of the 72 assessed the technique of local interpretable RQ2 : What is the global interpretability of each constructed model?
model-agnostic explanations (LIME), and the same number of papers RQ3 : What is the local interpretability of each constructed model?
evaluated Shapley additive explanations (SHAP). Very few articles RQ4 : Is there a relationship between accuracy and interpretability for
compared global and local techniques simultaneously (only 10 articles). each model? Which model is most vulnerable to the gap?
Additionally, quantitative evaluations of new local interpretability
techniques, such as LIME and SHAP, have never been performed, and The contributions of this empirical evaluation are summarised as
researchers frequently opted for a qualitative description or comparison. follows.
For instance, in our previous empirical evaluation (Hakkoum et al.,
2021b) of LIME and two global techniques, partial dependence plot • Quantitative comparison of global and local interpretability tech
(PDP) and feature importance (FI), we used a descriptive rather than a niques: global surrogate, ALE, SHAP, LIME, MAPLE, LORE, and CIU,
quantitative evaluation. It was also an evaluation of a local technique on the basis of fidelity, comprehensibility, faithfulness, mono
using one dataset which jeopardised the validity of the experiment and tonicity, and time.
weakened its conclusions. Nonetheless, it defined how well LIME agreed • Discovering whether these interpretability techniques solve the
with the global explanations for the original Wisconsin breast cancer accuracy-interpretability trade-off and whether one model is more
dataset (Dua and Graff, 2017). vulnerable to the trade-off than the other.
The motives for evaluating interpretability techniques stem from the
need to enhance the transparency of complex models, particularly in the The remainder of this paper is organised as follows: Section 2 pro
medical domain. There is a demand for explainable models that can vides an overview of the classifiers, namely SVM, MLP, RF, XGB, NB, and
elucidate the rationale behind decisions made by ML based medical DTs, along with a presentation of the seven interpretability techniques:
models. Local post-hoc interpretability techniques, like SHAP and LIME, global surrogate, ALE, SHAP (with two functions), LIME, MAPLE, LORE,
are valuable in deciphering the key factors and rules contributing to and CIU. In Section 3, we discuss related research on the application of
individual predictions. However, comparing such techniques necessi black box interpretability techniques in the field of medicine. The spe
tates both qualitative and quantitative evaluations, which should rely on cifics of the dataset, performance measures, and statistical tests
objective metrics instead of subjective human judgment. Ahmed and employed to select the best performing models and techniques are
ALPKOÇAK (Nizar Abdulaziz Mahyoub and Alpkoçak, 2022), intro outlined in Section 4. Section 5 presents the experimental design used
duced a novel strategy that employs the structure of a DT as a proxy. By for the empirical evaluation. Section 6 summarises and discusses the
mapping the output of interpretability techniques, specifically SHAP findings of our study. In Section 7, we thoroughly examine potential
and LIME features scores, onto a DT, two primary complexity metrics are threats to the validity of our research. Finally, Section 8 concludes the
proposed: total depth of the tree and average of the weighted class paper by discussing the implications of the study’s results and proposing
depth. Through this approach, they demonstrated that SHAP is superior avenues for future work.
to LIME in terms of complexity and scalability, providing insights into
suitable interpretability techniques for varying document scales and 2. Background
identifying features to enhance the performance of the ANN they trained
on the cardiovascular dataset OHSUMED. This section presents an overview of the constructed models, opti
As a result, the current study compares, qualitatively and quantita misation algorithm used, and interpretability techniques investigated.
tively, three known global interpretability techniques (global surrogates
using DTs, Accumulated Local Effects (ALE), and the global summary 2.1. ML models and optimisation algorithm
plot of SHAP) and five newly introduced local techniques (LIME, SHAP,
Model Agnostic Post-hoc Local Explanations (MAPLE), Local Rule-Based 2.1.1. Multilayer perceptron (MLP)
Explanation (LORE), and Contextual Importance and Utility (CIU)). Five Multilayer perceptrons (MLPs) are composed of multiple layers of
black box supervised learners (MLP, SVM, Random Forests (RFs), interconnected nodes (neurones). MLPs are feedforward networks, in
2
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829
which the information flows only in one direction from the input layer the DTs through majority voting or averaging to make the final pre
(data) and the output layer (prediction) through a series of intermediate diction, XGBoost trains DTs sequentially, focusing on instances with
layers called hidden layers that learn to represent higher-level features higher errors from previous iterations, and combines the predictions of
that are useful for predicting the output variables. weak models by assigning weights to each model based on their per
The layers are composed of a sum of neurones connected to each formance, with a stronger emphasis on models that contribute more to
other by weights to represent nonlinear mapping (Gardner and Dorling, reducing the loss. XGBoost further optimises the loss function and ap
1998). Each neuron in an MLP receives input signals from the neurones plies regularisation techniques to improve the model performance.
in the previous layer, applies an activation function to the sum of the
weighted inputs, and passes its output to the neurones in the next layer. 2.1.6. Naïve Bayes (NB)
The weights of the connections between neurones are learned during the The Naive Bayes classifier is a fundamental probabilistic machine
training process using a variant of the backpropagation algorithm (Idri learning algorithm designed for classification tasks. Its core principle is
et al., 2002). rooted in Bayes’ theorem, a fundamental concept in probability theory.
The MLP takes the input data, propagates it through the network, The classifier calculates the probability of each possible class label for a
and generates output predictions during training. During the training given input by combining two key components: the prior probability of
process, the weights of the connections between neurones in the each class and the likelihood of observing the input’s features given each
network were adjusted to minimise the difference between the predicted class. What sets the Naive Bayes classifier apart is its assumption of
and true outputs. This is typically accomplished using an optimisation feature independence, implying that the presence or absence of one
algorithm, such as stochastic gradient descent. feature is unrelated to the presence or absence of other features. This
simplifying assumption allows the algorithm to calculate probabilities
2.1.2. Support vector machines (SVMs) efficiently, but it might not always align with the real-world data
SVMs are a type of supervised classification algorithm based on the generating process. Additionally, some variants of Naive Bayes, such as
statistical learning theory. Their primary objective was to separate Gaussian Naive Bayes, introduce a smoothing technique called “var_
classes using an optimal hyperplane that maximises the distance be smoothing” to prevent zero probabilities in cases where certain feature-
tween them. The data points closest to this hyperplane are called support class combinations are absent in the training data. This technique adds a
vectors, and are considered the most critical elements of the training set. small constant to the variance of each feature, ensuring non-zero
SVMs can also be transformed into nonlinear classifiers by using probabilities and stabilizing calculations.
nonlinear kernels. One such kernel is the radial basis function (RBF) While the Naive Bayes classifier is conceptually interpretable due to
kernel, which maps data into a higher-dimensional space to achieve its reliance on probabilistic reasoning, the intricate internal calculations
better class separation. SVMs incorporate a penalty parameter that al of probabilities, along with the underlying assumption of feature inde
lows for some degree of misclassification by allowing training points to pendence, can render it somewhat opaque and challenging to fully
be on the wrong side of the hyperplane. Increasing the penalty param grasp, thus positioning it as a black box classifier to varying degrees
eter increases the cost of misclassifying points and pushes for the depending on the complexity of the data and the specific use case.
development of a more accurate model; however, this may result in a
less generalisable model. 2.1.7. Particle swarm optimisation (PSO)
Designing black box models with optimised hyperparameters re
2.1.3. Decision trees (DTs) mains a significant challenge. To address this, the biologically inspired
DTs build a hierarchical structure of decision nodes and leaf nodes approach of particle swarm optimisation (PSO) (Kennedy and Eberhart,
based on features and their thresholds (Quinlan, 1986). The decision 1995) was employed in this study. The PSO operates on the premise that
nodes represent conditions or questions regarding the features, and the a bird’s knowledge and experience can be shared with the entire group.
leaf nodes represent the predicted outcomes or target values. DTs By mimicking the movement of a flock of birds, in which each bird at
recursively split the data based on the selected features to maximise the tempts to find an optimal solution within a solution space, the group’s
information gain (or minimise the impurity) at each node. This process best solution becomes the PSO optimal solution in that space. Although
creates a tree-like structure that can be used to make predictions or draw it cannot be definitively proven that this solution is the true global
insights into the relationships between the features and target variable. optimal solution, it is often very close to the global optimal value (Tam,
DTs are interpretable (unlike MLP, SVM, RFs, XGB, and NB) and can 2021).
handle both numerical and categorical data, making them widely used Optunity, a Python library for hyperparameter optimisation, was
in various domains. used in this study. It provides a variety of optimisation methods, ranging
from basic methods such as grid search and random search (Bergstra and
2.1.4. Random forests (RFs) Bengio, 2012) to evolutionary methods such as PSO which is currently
RFs (Breiman, 2001) is an ensemble learning method that combines the method of choice owing to its high performance (Claesen et al.,
multiple DTs to make predictions. It operates by training a set of DTs on 2014).
different subsets of data and features. Each tree makes independent
predictions, and the final prediction is obtained by aggregating the in
dividual tree predictions. RFs provide robust and accurate predictions 2.2. Interpretability techniques
by reducing overfitting and capturing the collective wisdom of an
ensemble of trees. 2.2.1. Global surrogates
Global surrogates are a class of ML models trained to approximate
2.1.5. Extreme gradient boosting (XGB) the behaviour of black box models across the entire input space. By
XGBoost (Chen and Guestrin, 2016) is a popular gradient boosting training global surrogates on the same input-output pairs used to train a
algorithm designed to enhance the performance of predictive models. It black box, it is possible to gain insights into the underlying logic of the
sequentially trains an ensemble of weak prediction models such as DTs black box model.
by focusing on instances that were incorrectly predicted in previous it Global surrogates can take a variety of forms (any transparent ML
erations. XGBoost optimises a specific loss function by leveraging model), including DTs, in which the labels predicted by the black box are
gradient descent techniques to minimise loss and improve the overall used instead of the true labels of a dataset. The data on which a global
model performance. While RFs train individual DTs independently using surrogate is trained are often called oracle data. The oracle data reflect
random subsets of the data and features, and combine the predictions of the behaviour of the black box model and not reality (Molnar, 2021).
3
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829
Table 1
Related works with their findings.
Authors Black box ML Interpretability Global/ Metrics Medical datasets Findings
models technique local
Lakkaraju Deep ANNs, XGB, LIME Global Fidelity, Electronic health ROPE Explanations improve robustness and are more
et al. (2020) RFs, and SVM SHAP Local Robustness, stability (on records structurally similar compared to those generated by
MUSE synthetic data) LIME, SHAP, or MUSE.
ROPE
Adhikari et al. RF LIME Local Fidelity (Area Under the Breast Cancer LIME outperforms LEAFAGE on linear ML models,
(2019) SVM LEAFAGE ROC) Original while LEAFAGE achieves better results on non-linear
Wisconsin models.
El Shawi et al. RF LIME Local Identity, stability, Mortality, LIME performs the worst in terms of the identity
(2019) Anchors separability, similarity, diabetes. metric but the best in terms of the separability metric.
SHAP execution time, bias SHAP has the shortest average time to output
detection explanation. And was more effective in enabling bias
detection. Techniques were ranked SHAP, Anchors
and LIME, for enabling correct bias detection.
Zhang et al. ANN ensemble Global surrogate Global Fidelity (accuracy) Breast cancer, The use of oracle data of the ensemble led to an
(2019) diabetes, increase in test set accuracy of the DT. The latter did
hepatitis, heart, better when compared to a default implementation of
liver J48 DT.
Zhou and ANN ensemble Global surrogate Global Accuracy, size of trees Breast cancer, NeC4.5 is time consuming but stronger than C4.5 DT.
Jiang (2004) (NeC4.5) diabetes, heart,
liver.
De Laet and LIME, LORE Local End users Student success LIME provided information about every feature,
Huysmans while LORE only focused on features present in the
(2021) decision rule yet presents a simpler visualization.
Knapič et al. Convolutional LIME, SHAP, CIU Local Human evaluation Images from CIU outperformed LIME and SHAP in improving
(2021) ANNs (CNNs) video capsule human decision-making, transparency, and
endoscopy understandability. CIU also generates explanations
faster.
2.2.2. Accumulated local effects (ALE) refers to an absent value; therefore, it can be replaced with a random
An ALE plot is a visual representation that provides insights into the value from the dataset. Consequently, the feature attributions which are
relationship between a feature and a model’s predictions (Apley and approximations of the Shapley values (linear model weights), are
Zhu, 2016). It shows the accumulated impact of a specific feature on the computed.
predictions while considering interactions with other features. The plot
displays how the average predictions change as the feature of interest 2.2.4. Local interpretable model-agnostic explanations (LIME)
varies within its observed range, while keeping other features fixed. The Unlike global surrogates, which approximate black box behaviour
ALE plot helps users understand the nonlinear relationship between the across the entire input space, LIME (Ribeiro et al., 2016) interprets
feature and the model predictions, capturing both the main effect and predictions at the local level. LIME generates a surrogate interpretable
potential interactions. By examining the ALE plot, one can gain insights model in the local neighbourhood of a particular data point. The local
into the impact of the feature on the model’s output and identify any surrogate is trained on a perturbation of the data point features. This
nonlinear patterns or dependencies between the feature and predictions. new dataset is weighted with respect to its proximity to the data point,
and a local surrogate is trained. Ribeiro et al. (2016) also introduced a
2.2.3. Shapley additive exPlanations (SHAP) submodular pick algorithm, in which they showed the user different
Shapley values (Shapley, 1952) are a concept from cooperative game relevant instances explanations from the test set to have sense of how the
theory where the payout of a “game” is fairly distributed among its features affect black box decision.
players. In ML, players are the features because they play together and
interact with each other to produce an outcome which is the prediction. 2.2.5. Model agnostic post-hoc local explanations (MAPLE)
Lundberg and Lee (2017) proposed a framework (SHAP) that com MAPLE (Plumb et al., 2018) is another model-agnostic interpret
putes an approximation of the Shapley values and provides global and ability technique that can be applied to any ML model, regardless of its
local explanations. The summary plot in SHAP is a visual representation underlying architecture or algorithm. It operates in a post hoc manner,
that provides insights into the global feature importance and its impact which means it analyses the model’s output after it has made pre
on model predictions. This shows the overall contribution of each dictions. The main idea behind MAPLE is to combine RFs with feature
feature to the output of the model. The plot displays the Shapley values, selection and return feature importance explanations. This is achieved
which represent the average contribution of each feature across different using two techniques called SILO and Dstump (Plumb et al., 2018). SILO
instances. Features with larger Shapley values have a greater impact on stands for subgroup instance level optimisation and helps identify
the predictions, whereas those with smaller values have less influence. relevant subgroups within the data that have similar predictions based
The summary plot helps users understand the relative importance of on the RF leaves. A dstump, on the other hand, refers to a decision
different features and their directionality (whether they contribute stump, which is a simple DT with one internal node and two leaf nodes.
positively or negatively to the predictions). By examining the summary Dstump ranks features to solve a weighted linear regression similar to
plot, one can gain a holistic understanding of how the features collec LIME, providing interpretable insights into the decision-making process
tively contribute to the model’s output and identify the most influential of the black box model.
factors driving the predictions.
SHAP also represents the Shapley value explanation as an additive 2.2.6. Local rule-based explanation (LORE)
feature attribution method, which is a local linear model that connects LORE (Guidotti et al., 2018) is a local model-agnostic technique for
LIME and SHAP. To calculate the coalitions, a feature entry of 1 is black box interpretability that employs a genetic algorithm to extract
considered present and replaced with its original value. An entry of zero local rule-based explanations for specific instances or regions. LORE uses
4
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829
Table 2 the individual techniques because they were only a few, especially local.
Strengths and limitations of interpretability techniques. Despite this, the review settled on comparing clusters of techniques
Technique Strengths Limitations instead of individual ones, based on the explanation type. This shows the
importance and need for more experiments to clarify the differences
LIME Good separability (El Shawi May be sensitive to local
et al., 2019) perturbations (Lundberg and between these techniques in different domains and for different black
Lee, 2017) box models and to facilitate their adoption in real-life scenarios.
SHAP Provides both local and global Computationally expensive for This section surveys similar or related works which investigate
interpretability and is high-dimensional spaces and interpretability techniques. Table 1 summarises the use of interpret
theoretically grounded based on large datasets (Lundberg and
cooperative game theory ( Lee, 2017).
ability techniques in the medical field, with a degree of quantitative
Lundberg and Lee, 2017). evaluation. Moreover, Table 2 report some pros and cons of the used
MAPLE Offers interpretable Requires pre-defined rule interpretability techniques.
explanations using logic rules templates (Plumb et al., 2018). The literature on interpretability in medicine suggests that rule- or
which are very easy to follow ( May struggle to capture complex
tree-based explanations are most commonly used (Hakkoum et al.,
Plumb et al., 2018). and non-linear relationships.
LORE Generates human-readable rules Prone to overfitting because it 2021a). Global surrogates, which are a type of rule- or tree-based
for interpretability (Guidotti only focuses on features present explanation, are commonly used in various ways. For example, Zhou
et al., 2018) in the decision rule (De Laet and and Jiang (2004) developed a global surrogate called NeC4.5, which
Huysmans, 2021). combines the strengths of ANNs and C4.5 tree algorithm. Similarly,
CIU Understandable and fast (Knapič
Zhang et al. (2019) conducted experiments with different datasets using
–
et al., 2021). It considers the
usefulness of features in specific both training datasets and ensemble training datasets (oracle data),
contexts (Anjomshoae et al., either combined or separately. Both studies compared their global sur
2020). rogate approaches to simple DTs trained using normal datasets, and
Global Offers a simpler and more May not accurately represent
found that involving the black box model resulted in improved
Surrogate interpretable model as a reality since it is based on the
substitute for the complex complex model’s behaviour. May performance.
model (Molnar, 2021). be time consuming (Zhou and Despite the small number of papers investigating and comparing
Jiang, 2004). Explains ML black local interpretability techniques, we were able to find some studies that
boxes using more ML techniques strengthened their evaluations with quantitative results. Lakkaraju et al.
(white-boxes) (Molnar, 2021).
(2020) compared LIME (Ribeiro et al., 2016), SHAP (Lundberg and Lee,
ALE Captures the average effect of Requires defining bins or
features on the model’s intervals for continuous 2017), and model understanding through subspace explanations
predictions (Apley and Zhu, variables and may not fully (MUSE) (Lakkaraju et al., 2019) to their proposed method, robust post
2016). capture complex interactions hoc explanations (ROPE) using fidelity. Experiments were conducted on
between features (Apley and
an electronic health records dataset and two non-medical datasets
Zhu, 2016).
(Lakkaraju et al., 2016) to analyse the explanations produced by several
ML black boxes, including DNNs, XGB, RFs, and SVM. They adapted
a decision tree classifier to generate a set of interpretable if-then rules LIME and SHAP to generate global explanations using the submodular
that approximate the decision-making process of the black box model. pick procedure, which selects a set of representative points from the
These rules capture important features and their impact on the pre dataset and combines their local models to form a global explanation.
dictions. Additionally, LORE provides a pair of explanations consisting Adhikari et al. (2019) compared LIME and LEAFAGE (local example
of logic rules that describe the decision boundaries and counterfactual and feature importance-based model agnostic Explanations) using a
rules that explain how changing the values of certain features would similar approach to quantitatively evaluate explanations. They investi
alter the prediction. This combination of logic and counterfactual rules gated the interpretability of RF and SVM on the original dataset of breast
helps users gain insights into the decision logic of the black box model in cancer (Dua and Graff, 2017), as well as two other non-medical datasets.
a local and interpretable manner. For each test instance, they chose a radius by expanding it until the
corresponding hypersphere included a percentage of instances that did
2.2.7. Contextual Importance and Utility (CIU) not have the same predicted label as the test instance. The scores given
The CIU explanation (Främling, 1996; Anjomshoae et al., 2020) is by the local explanations were then compared to the scores given by the
another post hoc local interpretability technique. It focuses on under black box classifier on all test instances that fell into this hyper-sphere
standing that the importance of a feature in a context may be irrelevant using the area under the ROC (AUC) fidelity. As shown in Table 1,
in another, which is why they introduce contextual utility along with LIME performed better than LEAFAGE on linear ML models, whereas
importance to estimate the usefulness of the feature for the prediction. LEAFAGE performed better on nonlinear models.
CIU analyses the context variables associated with the data and quan El Shawi et al. (2019) systematically examined the effectiveness of
tifies their impact on the model’s decision-making process. By assessing three distinct interpretability techniques: LIME, SHAP, and Anchors,
the importance of these contextual factors, CIU provides insights into the within the context of a trained RF model. The primary focus of this
aspects of the input that are most influential in driving the model’s investigation encompassed diverse aspects, including the temporal de
predictions. mands of these techniques, their resilience to input perturbations, and
their adeptness in identifying biases, particularly those discernible
3. Related work through visual inspection rather than quantitative measurement. The
study’s findings revealed detailed information about how these tech
Our SLR (Hakkoum et al., 2021a) revealed that medical interest in niques perform; LIME exhibited the least favourable outcomes in terms
local interpretability increased in recent years, particularly in 2019. This of the identity which basically checks how well the interpretability
can be attributed to the fact that research no longer believes that the technique keeps the original details and relationships present in the
workings of a complex model can be understood in a generalisation black box model’s predictions. In contrast, LIME performed better when
context (i.e., it is difficult to reveal the process of the learned nonlinear it comes to separability. This metric essentially measures the technique’s
relationships by the black box) or to the fact that medicine encourages skill in telling apart the separate impacts of different features on the
personalised solutions as they can help in monitoring and treatment model’s predictions. SHAP turned out to be the fastest technique,
tasks. The SLR could not present an analysis of the comparisons between needing the least time on average to explain the model’s results and
managing to give understandable insights quickly compared to the other
5
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829
Six publicly available datasets were used for the experiments con 4.2.2. Interpretability metrics
ducted in this comparative study. Table 3 describes the datasets in terms Measuring interpretability still represents a challenge for the ML
of the number of instances, classes, and attributes as well as the type of community because interpretability can take different forms, and thus
attributes (integer, real, or binary). The datasets were taken from the can be evaluated differently which sometimes makes it difficult to
UCI repository and are all connected to the medical field; however, they compare many techniques with the same metric.
present different levels of evaluation for this experiment because the On a global scale, accuracy, comprehensibility, and fidelity to the
number of instances and attributes varied from 1151 (diabetes dataset) black box model were used for the global surrogate. Meanwhile, Ken
to 197 (Parkinson’s dataset) for instances and from 10 (breast cancer dal’s rank correlation was used to compare the feature importance
6
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829
resulting from the global surrogate, SHAP summary plot, and a white
box model (DT) trained directly on the data. On a local scale, the average
of the faithfulness metric over the test set instances, percentage of
monotonic instances in the test set, and time of execution were used to
evaluate and compare LIME, SHAP, MAPLE, LORE, and CIU. These
metrics are defined as follows.
7
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829
RF, XGB, and NB) and six datasets. The hyperparameters of the models
Table 4
were selected based on their accuracy with 10-folds cross-validation
Models accuracy with default hyperparameters.
using the PSO algorithm. The performance of each model was evalu
ated using the performance metrics on the test sets. The accuracy metric Dataset BC BCD Diabetes SPECT SPECTF Parkinson’s
was used to conduct the SK test across the datasets, whereas the six MLP 0.948 0.938 0.766 0.753 0.740 0.538
metrics defined in Section 4 for model performance were employed by SVM 0.970 0.938 0.653 0.753 0.827 0.641
RF 0.963 0.973 0.714 0.740 0.827 0.897
the Borda count.
XGB 0.956 0.956 0.709 0.740 0.827 0.923
NB 0.963 0.974 0.571 0.790 0.728 0.743
5.2. Step 2: global interpretability
faithfulness on the test set to identify the best performing local tech
Fig. 2 illustrates how the global surrogate using a DT was trained for
nique across different datasets.
each experiment using the black box model prediction (Ytrain-BB and Ytest-
BB) as the class labels instead of the original ground-truth labels. This
new dataset is referred to as the Oracle dataset (Johansson and Niklas 5.4. Step 4: accuracy-interpretability trade-off
son, 2009). The DTs learned from the Oracle dataset reflect the behav
iour of the black box model and not the ground-truth labels, because the An insightful accuracy-interpretability analysis was conducted for
DTs do not have access to them (Molnar, 2021). The fidelity of the global each experiment to delve into the intricate trade-off between these two
surrogate to the black box model was assessed using the accuracy, depth crucial factors. The aim was to uncover the extent of the trade-off gap
of the resulting DT, and number of leaves. The fidelity-based SK test and and identify the models that are particularly susceptible to interpret
Borda count based on fidelity, accuracy, depth, and number of leaves, ability constraints, especially when considering a white box model (DT).
were conducted to identify the best-performing global surrogate model This meticulous step involved comparing the performance accuracy of
for each dataset. models 1) with the fidelity of the global surrogate and 2) with the
On the other hand, a total of 30 SHAP summary plots were generated average local faithfulness across all instances in the test set for each local
for each experiment, representing the combination of models and interpretability technique. By exploring this interplay, we sought to gain
datasets. These plots were initially compared with the feature impor a comprehensive understanding of the models’ susceptibility to inter
tance derived from global surrogate decision trees (DTs). Subsequently, pretability limitations and shed light on the models that strike an
both the SHAP and global surrogate rankings were compared individ optimal balance between accuracy and interpretability.
ually against the rankings produced by the interpretable model, which
consisted of a DT classifier constructed directly on the raw datasets. 6. Results and discussion
These comparisons were based on Kendal’s rank correlation metric.
Finally, ALE plots were generated for the most consensually identi This section presents and discusses the findings of this empirical
fied features. These plots provide insights into how these particular evaluation in order to answer the RQs defined in Section 1. The exper
features influence the behaviour of black box models. By examining iments were conducted on a Lenovo Legion laptop with hexa-core Intel
these ALE plots, we can gain insights into the intricate relationships Core i7-9750H processor with 16 GB of RAM and a base speed of 2.59
between these features and the model’s predictions. This knowledge can GHz using Windows 10.
potentially inform medical practitioners and researchers about the
specific mechanisms through which certain important features influence 6.1. Model construction, performance and validation (step 1)
prediction, thereby enhancing our understanding of the underlying
factors driving the model’s decision-making process in the medical This step aimed to build and test five optimised models (MLP, SVM,
domain. RF, XGB, and NB) on six datasets.
Table 4 represents the accuracy scores of the five black box models
5.3. Step3: local interpretability without any parameter tunning (default) on the six datasets. Overall,
these scores for all models, ranging from 0.538 to 0.973, show that some
In this step, five local interpretability techniques, LIME, SHAP, of them perform fairly well. The accuracy levels between the various
MAPLE, LORE, and CIU, were applied to each instance in the test set for datasets differ noticeably. For instance, the accuracy scores for all
each experiment. For each instance, the explanations generated by the models on the BCD dataset, which range from 0.938 to 0.974, are
five techniques were evaluated in terms of faithfulness and mono consistently high. The accuracy scores for the Parkinson’s dataset, on the
tonicity, which were computed based on the local importance of the other hand, range from 0.538 to 0.923. SVM, NB, and XGB appear to
instance features generated by each technique. In addition, the time consistently achieve high accuracy scores across a variety of datasets
required by each technique to generate explanations for our six models when comparing the performance of the models. SVM, for example,
(MLP, SVM, RF, XGB, and NB, and the DT white box) was recorded to excels on BC and SPECTF, whereas NB excels on datasets like BCD and
enable a later comparison of the techniques in terms of execution time. SPECT, and Parkinson’s. Meanwhile, XGB outperforms the other clas
Furthermore, the SK significance test was conducted using the average sifiers on SPECTF and Parkinson’s.
8
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829
Table 5
Model optimised hyperparameters using PSO.
Model Dataset BC BCD Diabetes SPECT SPECTF Parkinson’s
Hyperparameter
Table 6
Performance results for MLP and SVM.
Dataset BC BCD Diabetes SPECT SPECTF Parkinson’s
Model MLP SVM MLP SVM MLP SVM MLP SVM MLP SVM MLP SVM
Accuracy 0.956 0.978 0.947 0.938 0.714 0.735 0.691 0.753 0.679 0.802 0.794 0.820
Precision 0.981 0.966 1.00 0.928 0.822 0.868 0.825 0.883 0.954 0.820 0.875 0.903
Recall 0.9138 0.982 0.860 0.907 0.617 0.617 0.787 0.803 0.636 0.969 0.875 0.875
F1_Score 0.946 0.974 0.925 0.917 0.705 0.721 0.806 0.841 0.763 0.888 0.875 0.888
Kappa 0.909 0.955 0.884 0.868 0.438 0.483 0.050 0.291 0.321 0.052 0.303 0.422
AUC 0.997 0.997 0.998 0.971 0.804 0.822 0.687 0.784 0.768 0.905 0.714 0.817
Table 7
Performance results for RF and XGB.
Dataset BC BCD Diabetes SPECT SPECTF Parkinson’s
Tables 6–8 list the metric values for each experiment. Table 6 reports
Table 8
a comparison between SVM and MLP in terms of different metrics across
Performance results for NB.
various datasets. The reported accuracy values indicate that SVM out
Dataset BC BCD Diabetes SPECT SPECTF Parkinson’s performed MLP in all datasets, except for BCD, where MLP exhibited a
Accuracy 0.963 0.974 0.576 0.802 0.728 0.795 slight advantage with a difference of 0.9%. Similarly, Table 7 demon
Precision 0.949 1 0.826 0.903 0.978 0.900 strates that RF performed better than XGB on the BC and BCD datasets,
Recall 0.965 0.930 0.297 0.848 0.682 0.844
whereas both models achieved similar accuracies on SPECT, SPECTF,
F1_Score 0.957 0.964 0.437 0.875 0.804 0.871
Kappa 0.925 0.943 0.203 0.407 0.406 0.373 and Parkinson’s. Meanwhile, Table 8 reports the performance results of
AUC 0.990 0.998 0.712 0.885 0.891 0.862 the trained NB models on the six datasets. NB showcased very low ac
curacy on the Diabetes, SPECTF and Parkinson’s datasets (0.576, 0.728,
0.795 respectively), similar to MLP (0.714, 0.679, 0.794 respectively).
Subsequently, the models are optimised using the PSO algorithm on Comparing the findings from Tables 6–8, we observe that RF and
the basis of accuracy. Table 5 lists the optimal hyperparameters chosen XGB exhibit superior performance in terms of accuracy compared to
by PSO. For MLP, the BC dataset required the highest number of epochs SVM and MLP on the BC, DBC, SPECT, SPECTF, and Parkinson’s data
(386) and batch size (115), whereas the other datasets ranged from 64 to sets. Additionally, NB gave the best accuracy results on the SPECT
47 for the batch size and between 373 and 96 for the optimal number of dataset (0.802).
epochs. The SPECTF dataset had the highest learning rate (0.3272), Overall, it can be seen that PSO optimisation improved the accuracy
whereas BCD had the highest number of hidden neurones (389). For the of MLP and SVM models over several datasets. In the BC, BCD, SPECTF,
SVM, the penalty parameter C was the highest for the diabetes dataset and Parkinson’s datasets, MLP’s accuracy increased, whereas SVM’s
(93.4) and lowest for the BCD dataset (32.4). On the other hand, RF and accuracy in the BC, Diabetes, and Parkinson’s datasets improved. On the
XGB had the same optimised hyperparameter (the number of estimators) other hand, the impacts of PSO on RF and XGB, were comparatively
which allows comparison. It should be noted that XGB required a smaller minimal, with accuracy levels remaining constant with and without
number of estimators for BC, SPECT, SPECTF, and Parkinson’s with 50, optimisation. For some datasets where the initial NB model performed
85, 139, and 148 estimators, respectively, compared to RF with 103, 92, well (BC and DBC), PSO did not lead to substantial changes. However,
187, and 185 estimators, respectively. For NB, it is possible to observe for datasets with lower initial accuracy (Diabetes: 0.571 vs. 0.576), PSO
that the optimal ’var_smoothing’ values tend to be in the range of e− 10 to still managed to make marginal improvements. Notably, for SPECT,
e− 07 across the different datasets. This suggests that a moderate amount SPECTF, and Parkinson’s, PSO optimisation significantly enhanced NB’s
of smoothing is effective for handling the probabilities and improving accuracy, highlighting its effectiveness in optimizing the hyper
the classification accuracy of the black box model. parameters for improved classification performance.
9
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829
secured the top position based on Borda count by appearing first in BCD,
Table 9
SPECTF, and Parkinson’s for RF, and first in BC and Diabetes for SVM.
Borda count scores for the black box models on each dataset.
Meanwhile, XGB and NB also appeared first in SPECT and Parkinson’s
Dataset Model Borda score respectively.
BC MLP 17
SVM 28
RF 22
6.2. Global interpretability (step 2)
XGB 8
NB 15 Global surrogates were generated using the process outlined in Fig. 2
BCD MLP 17 to assess the global interpretability of the constructed models (RQ2).
SVM 7
The resulting accuracy (on original data), fidelity using accuracy (on
RF 26
XGB 17 oracle data), depth, and number of leaves of these surrogates are pre
NB 23 sented in Table 10.
Diabetes MLP 19 First, the SK test was used to compare the surrogates of the five black
SVM 27 box models based on their fidelity values to determine whether any
RF 14
XGB 21
significant differences existed. Additionally, we used the Borda count
NB 9 scores, as shown in Table 11, to further evaluate the global interpret
SPECT MLP 8 ability of the models for each dataset using the metrics provided in
SVM 23 Table 10.
RF 14
Subsequently, we compared the feature importance according to
XGB 15
NB 30 these global surrogates to those generated by the SHAP summary plots.
SPECTF MLP 10 Then, each of these two sets of rankings was compared to the DT trained
SVM 19 on the original data to check how close they were to the white-box
RF 22 perspective. Finally, ALE plots were generated for the most consensu
XGB 21
ally identified important features for the BC dataset to verify how they
NB 18
Parkinson’s MLP 10 influence the model’s decision-making process. This analysis not only
SVM 16 enables us to contribute valuable knowledge to the medical tasks asso
RF 27 ciated with the investigated datasets, but also helps establish a level of
XGB 27
trust in these black box models.
NB 10
According to the fidelity values in Table 10, the RF and NB global
surrogates demonstrate higher fidelity to the black box model, per
Subsequently, the performances are evaluated and compared in forming best in SPECTF, and Parkinson’s for RF and in BCD and Diabetes
terms of accuracy using SK to verify whether the difference between the for NB. SVM was the best in the SPECT datasets, whereas XGB excelled in
five models is statistically significant across the datasets. The test’s BC. Nevertheless, considering the depth of the trees, MLP consistently
default assumption (i.e., the null hypothesis) was that the accuracy achieved the best (smallest) depth, except for the BC and Diabetes
values of the five models were not significantly different. If the null datasets, where NB surpassed with a depth of 6 and 10 respectively.
hypothesis is rejected (when the p-value is less than 0.05), the models Similarly, for the number of leaves, NB achieved the best values in five
are considered significantly different. The performances were then datasets, whereas MLP surpassed in the sixth dataset (BC). Table 10 also
evaluated in terms of six model performance metrics (accuracy, preci provides the accuracy of the global surrogates based on the true labels
sion, recall, F1 Score, Kappa, and AUC). The SK test, as shown in Fig. 3, (original dataset). SVM demonstrates higher accuracy in Diabetes
revealed that the black box model’s accuracy assigns them to one clus (67.1%), XGB in SPECTF (81.48%), and RF in Parkinson’s (87.18%).
ter, even though RF had the best mean accuracy. Furthermore, Table 9 Meanwhile NB excels on BCD and SPECT datasets (95.6% and 79%
reports that two black box models, namely RF and SVM, consistently respectively).
Fig. 4 demonstrates that the SK significance test assigned the mean of
10
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829
Table 10
Global interpretability results for each experiment.
Model BC BCD Diabetes SPECT SPECTF Parkinson’s
Fig. 4. SK results of the black box models according to global surrogate fidelity.
11
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829
Table 13
Global features ranks comparisons.
Kendal’s rank BC BCD Diabetes SPECT SPECTF Parkinson’s
Corr. p-val Corr. p-val Corr. p-val Corr. p-val Corr. p-val Corr. p-val
MLP GS-SHAP 0.592 0.028 0.286 0.041 0.317 0.059 0.206 0.19 0.304 0.01 − 0.07 0.681
GS-DT 0.592 0.028 0.068 0.647 0.576 0.001 0.613 0 − 0.05 0.703 0.201 0.276
SHAP-DT 0.167 0.612 0.189 0.174 0.152 0.363 0.166 0.283 0.063 0.577 0.251 0.13
SVM GS-SHAP 0.487 0.08 − 0.02 0.861 0.059 0.726 0.328 0.043 0.169 0.139 0.311 0.065
GS-DT 0.548 0.049 0.583 0 0.627 0 0.219 0.181 0.81 0 0.625 0.001
SHAP-DT 0.389 0.18 0.185 0.181 0.047 0.779 0.41 0.008 0.122 0.278 0.271 0.101
RF GS-SHAP 0.556 0.045 0.251 0.071 0.387 0.021 0.331 0.032 0.232 0.038 0.101 0.546
GS-DT 0.83 0 0.689 0 0.84 0 0.338 0.029 0.696 0 0.371 0.04
SHAP-DT 0.611 0.025 0.259 0.061 0.411 0.014 0.401 0.009 0.345 0.002 0.241 0.146
XGB GS-SHAP 0.611 0.025 0.324 0.021 0.661 0 0.347 0.025 0.269 0.017 0.078 0.645
GS-DT 1 0 0.63 0 0.89 0 0.574 0 0.76 0 0.91 0
SHAP-DT 0.611 0.025 0.317 0.022 0.669 0 0.566 0 0.382 0.001 0.118 0.477
NB GS-SHAP 0.667 0.013 0.297 0.038 0.502 0.004 0.212 0.187 0.257 0.027 0.129 0.45
GS-DT 0.278 0.358 0.292 0.058 − 0.07 0.669 0.076 0.639 0.021 0.866 0.305 0.099
SHAP-DT 0.167 0.612 0.1 0.467 0.012 0.944 − 0.1 0.534 0.157 0.164 0.046 0.781
Fig. 5. ALE plots of “bare nuclei” feature for SVM (left) and RF (right).
Table 13 indicates the Kendal’s rank correlation between the rankings of respectively), for SVM in SPECTF (0.806), and for XGB in BC, diabetes,
features according to the global surrogate, the global SHAP explainer, SPECTF, and Parkinson’s (1, 0.891, 0.758, and 0.908 respectively), with
and the DTs. a very small p-value, indicating strong evidence and a statistically sig
The Kendal’s rank correlation was reported in Table 13 to describe nificant correlation. Meanwhile, the lowest correlation coefficients
the relationship between the rankings of features according to the global (below 0.1) were reported for MLP when comparing: global surrogate to
surrogate and SHAP global explainer, then according to each of these SHAP on Parkinson’s (− 0.070), global surrogate to DT in DBC and
two and the DT trained on the original dataset. Table 13 demonstrates SPECTF (0.068, and − 0.048 respectively), and SHAP to DT in SPECTF
that the strongest agreements (above 0.7) were for the global surrogate (0.063). For SVM, the lowest coefficient was reported when comparing
and the DT classifier for RF in BC and Diabetes (0.833, 0.835 GS to SHAP in DBC (− 0.024). For XGB, the correlation between the
Fig. 6. ALE plots of “uniformity of cell size” for SVM (left) and RF (right).
12
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829
Table 14
Local interpretability evaluation using monotonicity and faithfulness averages.
Dataset Technique LIME SHAP MAPLE LORE CIU
Metrics M% Fμ M% Fμ M% Fμ M% Fμ M% Fμ
BC MLP 3.65 0.016 3.65 − 3.01 7.30 − 2.23 2.19 − 13.73 7.30 3.29
SVM 0.73 − 0.234 15.33 18.8 5.11 − 0.74 4.38 − 26.13 5.11 2.80
RF 83.94 0.257 82.48 31.28 78.83 12.88 76.64 20.15 72.26 35.50
XGB 58.39 0.186 58.39 22.3 58.39 8.68 59.12 10.96 48.18 31.73
DT 99.27 0.103 99.27 21.8 97.08 11.88 96.35 12.6 83.21 22.29
NB 24.09 18.05 24.09 − 6.88 26.28 43.52 15.33 − 10.43 28.47 25.98
BCD MLP 0 0.125 1.75 25.6 0 − 2.65 0 39.14 0 47.19
SVM 9.65 0.610 7.89 79.57 8.77 1.09 7.89 64.86 7.89 70.06
RF 2.63 0.166 1.75 19.51 13.16 3.49 6.14 12.17 0 23.50
XGB 1.75 0.256 0.88 22.71 1.75 − 3.11 1.75 19.47 0.88 34.06
DT 80.7 0.165 64.91 29.39 51.75 2.34 76.32 27.15 60.53 43.59
NB 0 23.29 0 24.87 9.65 − 2.98 0.88 19.42 14.91 22.93
Diabetes MLP 0 0.101 0 22.08 0 − 2.89 0 12.75 0 9.91
SVM 17.32 0.03 18.18 16.45 0 − 1.21 0 21.27 0 30.83
RF 1.3 0.203 0 44.33 0.87 4.81 0 13.79 0 30.21
XGB 0 0.239 0 43.51 0 − 3.13 0 32.27 0 48.50
DT 49.78 0.063 43.72 47.1 39.83 12.37 22.51 22.27 31.60 48.50
NB 0.43 − 3.70 2.16 − 7.93 4.76 − 1.94 20.35 − 3.01 20.35 − 7.69
SPECT MLP 25.93 31.61 32.10 59.81 20.99 19.17 14.67 40.58 24.69 52.95
SVM 27.16 58.96 48.15 72.64 35.80 35.26 18.67 59.78 24.69 63.00
RF 44.44 43.6 40.74 71.26 25.93 40.36 24.00 42.89 30.86 50.75
XGB 22.22 40.92 29.63 66.42 19.75 41.29 9.33 47.87 17.28 66.13
DT 32.10 19.83 34.57 49.05 39.51 20.02 32.00 39.38 37.04 41.39
NB 87.65 33.90 87.65 48.22 83.95 4.86 82.67 27.21 87.65 44.97
SPECTF MLP 0 2.29 0 8.76 0 − 7.27 0 5.22 0 21.80
SVM 37.04 − 11.76 33.33 − 17.27 40.74 6.73 44.44 − 0.36 39.51 − 12.43
RF 0 32.15 1.23 30.79 1.23 2.16 0 2.20 0 − 1.20
XGB 0 30.79 0 27.78 0 2.24 0 1.20 0 17.57
DT 74.07 6.8 79.01 14.63 56.79 − 6.88 83.95 15.74 67.90 30.55
NB 76.54 1.56 76.54 5.57 76.54 − 1.02 76.54 0 76.54 11.39
Parkinson’s MLP 0 9.13 0 13.11 0 2.68 0 26.74 0 32.69
SVM 69.23 22.43 71.79 60.13 48.72 − 1.09 74.36 58.29 74.36 55.82
RF 0 53.24 0 64.76 2.56 4.04 0 56.41 0 64.62
XGB 0 59.66 0 76.55 0 17.08 0 66.19 0 86.89
DT 23.08 25.86 17.95 56.76 74.36 2.15 17.95 45.13 28.21 69.43
NB 5.13 30.36 5.13 38.06 7.69 − 0.78 5.13 22.53 2.56 47.08
global surrogate and SHAP in the Parkinson’s dataset was the lowest
Table 15
(0.078). Lastly, for NB, correlation between global surrogate and DT was
Execution time (in seconds) over the test set for the local interpretability
the lowest on the SPECTF dataset (0.021).
techniques.
For the SHAP summary plots, the “bare nuclei” feature was always
elected as the most important for the BC dataset, as opposed to the global Dataset Number of test LIME SHAP MAPLE LORE CIU
records
surrogate and the DT classifier importances. Interestingly, literature
shows that a low “bare nuclei” is one of the most important features to BC 137 3656 1457 3587 42626 116
BCD 114 6397 5818 8333 44262 1062
get a benign diagnosis (Reis-Filho et al., 2002). For further analysis of
Diabetes 231 7028 32566 4648 21173 699
feature importance, ALE plots were generated specifically for this SPECT 81 789 1644 511 9411 202
feature, as an example of ALE plots, to understand how this feature af SPECTF 81 9211 3173 1075 18673 672
fects the models. Parkinson’s 39 1781 741 120 13692 118
Fig. 5 represents the ALE plots of the “bare nuclei” feature for SVM
and RF as the best performing classifiers. ALE plots represent the change
To conclude, this analysis shows the alignment between the rankings
in the model’s prediction probability as the feature changes. The x axis
based on Shapley values, ALE plots and the medical knowledge by
in the in the plot represents the feature’s range of values while the y
ranking “bare nuclei” as the top influencer feature for the black box
represents the change in the model’s prediction. The plots in Fig. 5 show
models.
a monotonic increasing relationship, which suggests that increasing the
feature value tends to increase the model’s predicted probability.
The “uniformity of cell size” feature was ranked as the top feature by 6.3. Local interpretability (step 3)
all the global surrogate trees as well as the DT classifier trained on the
raw dataset. ALE plots in Fig. 6 show how this feature changes the Following the process described in Section 5 (step 3), local expla
prediction for SVM and RF. While increasing the features value seems to nations of each test set were evaluated and compared using faithfulness,
increase the SVM’s predicted probability, we notice that this change is monotonicity, and execution time. Thereafter, SK test was performed on
not as strong as for the “bare nuclei” feature as shown in Fig. 5. On the the basis of the faithfulness average on every dataset, while Borda was
other hand, “uniformity of cell size” does not seem to have any signifi applied using the three aforementioned metrics.
cant impact of the RF model’s predictions since it does not contribute to Table 14 reports the percentage of test records where the local
the variability or change in the model’s output. Therefore, this feature technique explanation was monotonic (denoted as M%), and the average
alone does not have any meaningful relationship with the prediction yet of faithfulness over the test records (denoted as Fμ). Moreover, Table 15
it is possible for it becomes more influential when combined with other presents the execution time of the explanations of all test records for
features. every dataset on the five black box models.
13
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829
Fig. 7. SK results for the five classifiers based on faithfulness of local techniques across datasets.
It was observed that the explanations generated for the DT classifier monotonicity score was reported for LIME and SHAP in BC with the DT
exhibited the best monotonicity and faithfulness’s average. For the local classifier (99.27). Considering black boxes only, the highest mono
techniques LIME, SHAP, MAPLE, LORE, and CIU, these proportions of tonicity values were in BC with RF using LIME and SHAP (83.94, 82.48
test records for the monotonic explanations were as follows for BC: respectively). MLP and NB came in a separate second cluster according
99.27%, 99.27%, 97.08%, 96.35%, and 83.21% respectively. For DBC, to the SK test when applied on the basis of faithfulness as shown in Fig. 7.
the fractions were 80.7%, 64.91%, 51.75%, 76.32, and 60.53% Fig. 8 also employs faithfulness to assess the significance of the differ
respectively. For Diabetes, they were 49.78%, 43.72%, 39.83%, 22.51%, ences between the local interpretability techniques employing the SK
and 31.6% respectively. While for SPECTF, they were 74.07%, 79.01%, test. CIU and SHAP were both classified within the top performing
56.79%, 83.95%, and 67.9% respectively. This superiority on the DT cluster, followed by LORE, then LIME and MAPLE last. Table 15 shows
classifier suggests that white classifiers are easier to interpret by the used that CIU executions over the test set were the fastest compared to the
local interpretability techniques. Meanwhile, for SPECT, DT did not other techniques while LORE had the worst execution time except on the
have the best monotonicity proportion in all explanations with 39.51% Diabetes dataset.
for MAPLE and 32% for LORE as the highest values. Similarly, the best
value for the DT explanations in the Parkinson’s dataset was with
74.36% for MAPLE. SVM took the lead in this dataset for LIME, SHAP, 6.4. Accuracy vs interpretability trade-off (step 4)
LORE, and CIU (69.23, 71.79, 74.36, and 74.36, respectively).
From the interpretability techniques perspective, the highest Many studies have been conducted to fix the accuracy-
interpretability trade-off (Wunsch and Saad, 2007; Huysmans et al.,
14
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829
2008). In this study, global and local interpretability techniques were It can be observed that CIU and SHAP exhibited faithfulness scores
used to gain insight into the black box models used. In particular, this above 50 in nine experiments, LORE in five, and LIME in three. Mean
step compares interpretability and accuracy in two aspects: 1) from a while, MAPLE did not appear in the plot. This confirms the findings of
global view (demonstrated in Fig. 9), where the global surrogate’s fi step 3 where LIME and MAPLE came in the last SK cluster in Fig. 8 ac
delity (from Table 10) is compared to the black box accuracy (from cording to faithfulness. Fig. 11 also reveals that SVM appeared ten times,
Tables 6–8), and 2) from a local view (demonstrated in Figs. 10 and 11) RF and XGB six times each, while MLP and DT appeared twice.
where that black box accuracy is compared to the average faithfulness of In general, CIU and SHAP provide more faithful and meaningful
the local explanations on the test records reported in Table 14. insights into the model’s behaviour and capture important features that
Before applying interpretability techniques, MLP, SVM, RF, XGB, and contribute to its performance. This increases our confidence in the
NB have no built-in interpretability since they give very little to no in interpretability technique thus in the black box model itself. CIU and
formation about the reasons behind their predictions. All the con SHAP high performances are further supported by their appearance in
structed models in this experiment were able to defy the trade-off the top cluster identified by SK in Fig. 8. Meanwhile, execution time
because their interpretability according to the global surrogates was comes in favour of CIU when compared to SHAP.
almost as good as their performance.
It is important to highlight that the fidelity should not necessarily be 7. Threats to validity
maximized and should only be sufficiently high. Other aspects should
also be considered; for instance, the high fidelity of a global surrogate To guarantee the validity of this study, its limitations must be
would still be problematic if the underlying DT of the global surrogate is highlighted.
too deep and difficult to comprehend. In the model construction (step 1), feature standardisation could
Generally, according to Fig. 9, the global surrogates deliver inter have been considered such that the data became centred around zero
pretable models that can be used to understand black box models. and had a standard deviation of one such that the model performed
Therefore, black box models evaluations can primarily rely on their better. Since this study focused on interpretability, we did not give much
performances once their interpretability is “sufficient”. attention to additional preprocessing tasks to determine the contribu
On the other hand, Fig. 10 illustrates the relationship between the tion of each original feature. Additionally, we did not explore and
black box models accuracy and the faithfulness average of every inter compare optimisation methods and we simply chose to use PSO given its
pretability technique across the different models. There is a notable reported performance in literature (Idri et al., 2020; Saha et al., 2022).
variation among points representing the same model which made it hard Nevertheless, the preprocessing tasks as well as other different optimi
to compare the strength of the faithfulness per model/technique. sation techniques such Bayesian-based ones, have the potential to
Faithfulness values higher than 50% indicate a positive correlation improve the model performance and potentially its interpretability;
between the importance assigned by the interpretability technique to therefore, this can be used and discussed in future work.
the features. Therefore, in Fig. 11, we focused on these values where the Experiments were performed on six datasets with different numbers
faithfulness is higher than 50%, which suggests that the interpretability of instances and features. Nevertheless, they all belong to the field of
technique is correctly identifying important features. The higher the medicine. This ensured specificity but did not validate the evaluated
faithfulness value is, the stronger the correlation between the impor techniques across different domains. Another limitation of this study is
tance assigned by the interpretability technique and the effect of those that all datasets were tabular. Therefore, the empirical evaluation did
attributes on the model’s performance. not compare interpretability techniques for different data types, such as
15
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829
Fig. 10. Trade-off between accuracy and faithfulness of the local interpretability techniques across datasets.
categorical attributes or images which could alter our findings. In performance, although the SK test deemed this difference insignificant.
addition, all experiments had to solve a binary classification task Furthermore, our results indicated, for the BC dataset, that SHAP
because interpretability can be difficult in multiclass classification results were more in line with ALE plots, unlike the global surrogate.
(Zhang et al., 2019). Finally, our choice to abstain from delving into Meanwhile, the NB global surrogate surpassed that of to the other global
deep learning networks, especially within the context of binary classi surrogate DTs as well as the white boxDT, although the SK test deemed
fication tasks involving tabular data, stemmed from the non-image na this superiority insignificant based on fidelity.
ture of the dataset; Our decision was underpinned by a strong emphasis On the local scope, CIU and SHAP performed better than other local
on model transparency and interpretability, with a focus on MLPs thanks techniques according to the SK test based on faithfulness, although CIU
to their widespread popularity and suitability as an initial step towards had a faster execution time. On the other hand, the best monotonicity
network interpretability (Hakkoum et al., 2021a). values were given by LIME and SHAP for the DT and RF classifiers.
Overall, interpretability techniques helped to achieve a level of inter
8. Conclusion and future work pretability for black box models, thereby overcoming the trade-off and
making them useful in critical domains that require explanations for
In this study, we performed an empirical evaluation of seven inter decision-making. However, to gain trust and be effectively utilized for
pretability techniques, including three global and five local techniques decision-making, quantitative assessments of these explanations are
(with SHAP used in both). The primary focus was to evaluate these essential.
techniques for MLP, SVM, RF, XGB, and NB black box models on six For future work, we intend to evaluate interpretability techniques
medical numeric datasets. Our quantitative evaluations showed that RF across different domains and with various data types and ML tasks for
and SVM generally outperformed the other models in terms of further analysis and comparisons. Moreover, we highlight the
16
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829
Fig. 11. Accuracy vs. average of faithfulness scores higher than 50.
17
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829
Guidotti, R., Monreale, A., Ruggieri, S., Pedreschi, D., Turini, F., Giannotti, F., 2018. International Conference on Intelligent Systems: Theories and Applications, SITA’20.
Local Rule-Based Explanations of Black Box Decision Systems. ArXiv. Association for Computing Machinery, New York, NY, USA. https://doi.org/
Hakkoum, H., Abnane, I., Idri, A., 2021a. Interpretability in the medical field: a 10.1145/3419604.3419776.
systematic mapping and review study. Appl. Soft Comput., 108391 https://doi.org/ Nicholson Price, W., 2018. Big data and black-box medical algorithms. Sci. Transl. Med.
10.1016/J.ASOC.2021.108391. 10 https://doi.org/10.1126/SCITRANSLMED.AAO5333.
Hakkoum, H., Idri, A., Abnane, I., 2021b. Assessing and comparing interpretability Nizar Abdulaziz Mahyoub, Ahmed, Alpkoçak, Adil, 2022. A quantitative evaluation of
techniques for artificial neural networks breast cancer classification. Comput. explainable AI methods using the depth of decision tree. Turk. J. Elec. Eng. Comput.
Methods Biomech. Biomed. Eng. Imaging Vis. 9 https://doi.org/10.1080/ Sci. 30 (6), 4. https://doi.org/10.55730/1300-0632.3924.
21681163.2021.1901784. Pereira, S., Meier, R., McKinley, R., Wiest, R., Alves, V., Silva, C.A., Reyes, M., 2018.
Huysmans, J., Setiono, R., Baesens, B., Vanthienen, J., 2008. Minerva: sequential Enhancing interpretability of automatically extracted machine learning features:
covering for rule extraction. IEEE Trans. Syst. Man, Cybern. Part B 38, 299–309. application to a RBM-Random Forest system on brain lesion segmentation. Med.
https://doi.org/10.1109/TSMCB.2007.912079. Image Anal. 44, 228–244. https://doi.org/10.1016/j.media.2017.12.009. Epub
Idri, A., Bouchra, E.O., Hosni, M., Abnane, I., 2020. Assessing the impact of parameters 2017 Dec 20. PMID: 29289703.
tuning in ensemble based breast Cancer classification. Health Technol. 10, Plumb, G., Al-Shedivat, M., Cabrera, Á.A., Perer, A., Xing, E., Talwalkar, A., 2020.
1239–1255. https://doi.org/10.1007/S12553-020-00453-2. Regularizing black-box models for improved interpretability. Adv. Neural Inf.
Idri, A., Khoshgoftaar, T.M., Abran, A., 2002. Can neural networks be easily interpreted Process. Syst. 33, 10526–10536.
in software cost estimation?. In: 2002 IEEE World Congress on Computational Plumb, G., Molitor, D., Talwalkar, A., 2018. Model agnostic supervised local
Intelligence. 2002 IEEE International Conference on Fuzzy Systems. FUZZ-IEEE’02. explanations. Adv. Neural Inf. Process. Syst. 2018-December 2515–2524.
Proceedings (Cat. No.02CH37291). IEEE, pp. 1162–1167. https://doi.org/10.1109/ Quinlan, J.R., 1986. Induction of decision trees. Mach. Learn. 11 (1), 81–106. https://
FUZZ.2002.1006668. doi.org/10.1007/BF00116251.
Jelihovschi, E., Faria, J.C., Allaman, I.B., 2014. ScottKnott: a package for performing the Reis-Filho, J.S., Albergaria, A., Milanezi, F., Amendoeira, I., Schmitt, F.C., 2002. Naked
Scott-Knott clustering algorithm in R. Trends Comput. Appl. Math. 15, 3–17. https:// nuclei revisited: p63 immunoexpression. Diagn. Cytopathol. 27, 135–138. https://
doi.org/10.5540/TEMA.2014.015.01.0003. doi.org/10.1002/DC.10164.
Johansson, U., Niklasson, L., 2009. Evolving decision trees using oracle guides. In: 2009 Ribeiro, M.T., Singh, S., Guestrin, C., 2016. “Why should i trust you?” Explaining the
IEEE Symp. Comput. Intell. Data Mining, CIDM 2009 - Proc, pp. 238–244. https:// predictions of any classifier. In: Proceedings of the ACM SIGKDD International
doi.org/10.1109/CIDM.2009.4938655. Conference on Knowledge Discovery and Data Mining. Association for Computing
Kennedy, J., Eberhart, R., 1995. Particle swarm optimization. Proc. ICNN’95 - Int. Conf. Machinery, pp. 1135–1144. https://doi.org/10.1145/2939672.2939778.
Neural Networks 4, 1942–1948. https://doi.org/10.1109/ICNN.1995.488968. Risse, M., 2005. Why the count de Borda cannot beat the Marquis de Condorcet. Soc.
Knapič, S., Malhi, A., Saluja, R., Främling, K., 2021. Explainable artificial intelligence for Choice Welfare 25, 95–113. https://doi.org/10.1007/s00355-005-0045-3.
human decision support system in the medical domain. Mach. Learn. Knowl. Extr. 3, Saha, S., Saha, A., Roy, B., Sarkar, R., Bhardwaj, D., Kundu, B., 2022. Integrating the
740–770. https://doi.org/10.3390/make3030037. Particle Swarm Optimization (PSO) with machine learning methods for improving
Lakkaraju, H., Arsov, N., Bastani, O., 2020. Robust and Stable Black Box Explanations. the accuracy of the landslide susceptibility model. Earth Sci. Informatics 15,
ArXiv. 2637–2662. https://doi.org/10.1007/S12145-022-00878-5/METRICS.
Lakkaraju, H., Bach, S.H., Leskovec, J., 2016. Interpretable decision sets: a joint Shapley, L.S., 1952. A Value for N-Person Games. CA RAND Corp, St. Monica.
framework for description and prediction. Proc. ACM SIGKDD Int. Conf. Knowl. Shinde, P.P., Shah, S., 2018. A Review of Machine Learning and Deep Learning
Discov. Data Min. https://doi.org/10.1145/2939672.2939874, 13-17-August-2016, Applications. 2018 Fourth International Conference on Computing Communication
1675–1684. Control and Automation (ICCUBEA), pp. 1–6.. Pune, India doi: 10.1109/
Lakkaraju, H., Caruana, R., Kamar, E., Leskovec, J., 2019. Faithful and customizable ICCUBEA.2018.8697857.
explanations of black box models. In: AIES 2019 - Proc. 2019 AAAI/ACM Conf. AI, Silva, W., Fernandes, K., Cardoso, M.J., Cardoso, J.S., 2018. Towards Complementary
Ethics, Soc, pp. 131–138. https://doi.org/10.1145/3306618.3314229. Explanations Using Deep Neural Networks. In: Stoyanov, D., et al. (Eds.),
Lundberg, S.M., Lee, S.-I., 2017. A unified approach to interpreting model predictions. In: Understanding and Interpreting Machine Learning in Medical Image Computing
NIPS’17: Proceedings of the 31st International Conference on Neural Information Applications. MLCN DLF IMIMIC, Lecture Notes in Computer Science, 11038.
Processing Systems. Long Beach. California, USA. Curran Associates Inc., Red Hook, Springer, Cham. https://doi.org/10.1007/978-3-030-02628-8_15.
NY, USA, pp. 4768–4777. Tam, Adrian, 2021. WebPage: A Gentle Introduction to Particle Swarm Optimization -.
Luss, R., Chen, P.Y., Dhurandhar, A., Sattigeri, P., Zhang, Y., Shanmugam, K., Tu, C.C., MachineLearningMastery.com.
2019. Leveraging latent features for local explanations. Proc. ACM SIGKDD Int. Conf. Vellido, A., 2019. Societal issues concerning the application of artificial intelligence in
Knowl. Discov. Data Min. 1139–1149. https://doi.org/10.1145/3447548.3467265. medicine keywords artificial intelligence in medicine ⋅ machine learning ⋅ social
Miller, T., 2019. Explanation in artificial intelligence: insights from the social sciences. impact. Rev. Artic. Kidney Dis 5, 11–17. https://doi.org/10.1159/000492428.
Artif. Intell. 267, 1–38. Wunsch, D.C., Saad, E.W., 2007. Neural network explanation using inversion. Neural
Molnar, C., 2021. Interpretable Machine Learning : a Guide for Making Black Box Models Network. 20, 78–93. https://doi.org/10.1016/j.neunet.2006.07.005.
Interpretable. Zhang, X., Lou, Y., Tan, S., Chajewska, U., Koch, P., Caruana, R., 2019. Axiomatic
Molnar, C., Casalicchio, G., Bischl, B., 2020. Quantifying model complexity via interpretability for multiclass additive models. Proc. ACM SIGKDD Int. Conf. Knowl.
functional decomposition for better post-hoc interpretability. Commun. Comput. Inf. Discov. Data Min. 226–234. https://doi.org/10.1145/3292500.3330898.
Sci. 1167, 193–204. https://doi.org/10.1007/978-3-030-43823-4_17/COVER. Zhou, Z., Jiang, Y., 2004. NeC4.5: neural ensemble based C4.5. IEEE Trans. Knowl. Data
Nassih, R., Berrado, A., 2020. State of the art of fairness, interpretability and Eng. 16, 770–773. https://doi.org/10.1109/TKDE.2004.11.
explainability in machine learning: case of PRIM. In: Proceedings of the 13th
18