0% found this document useful (0 votes)

28 views18 pages

Engineering Applications of Artificial Intelligence: Hajar Hakkoum, Ali Idri, Ibtissam Abnane

This study evaluates seven interpretability techniques for supervised machine learning models applied to numerical medical data, focusing on balancing model performance and interpretability. The results indicate that global SHAP and local CIU techniques outperform others in explaining model decisions, emphasizing the need for quantitative assessments of interpretability methods. The research highlights the importance of integrating interpretability techniques in critical fields like medicine to enhance transparency and trust in machine learning models.

Uploaded by

μ con

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views18 pages

Engineering Applications of Artificial Intelligence: Hajar Hakkoum, Ali Idri, Ibtissam Abnane

Uploaded by

μ con

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Engineering Applications of Artificial Intelligence 131 (2024) 107829

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence

journal homepage: www.elsevier.com/locate/engappai

Global and local interpretability techniques of supervised machine learning

black box models for numerical medical data
Hajar Hakkoum a, *, Ali Idri a, b, Ibtissam Abnane a
a
SPM Research Team, ENSIAS, Mohammed V University, Rabat, Morocco
b
Mohammed VI Polytechnic University, Ben Guerir, Morocco

A R T I C L E I N F O A B S T R A C T

Keywords: The most effective machine learning classification techniques, such as artificial neural networks, are not easily
Interpretability interpretable, which limits their usefulness in critical areas, such as medicine, where errors can have severe
XAI consequences. Researchers have been working to balance the trade-off between the model performance and
Explainability
interpretability. In this study, seven interpretability techniques (global surrogate, accumulated local effects, local
Black box
Numerical data
interpretable model-agnostic explanations (LIME), Shapley additive explanations (SHAP), model agnostic post
Medicine hoc local explanations (MAPLE), local rule-based explanation (LORE), and Contextual Importance and Utility
(CIU)) were evaluated to interpret five medical classifiers (multilayer perceptron, support vector machines,
random forests, extreme gradient boosting, and naïve bayes) using six model performance metrics and three
interpretability technique metrics across six medical numerical datasets. The results confirmed the effectiveness
of integrating global and local interpretability techniques, and highlighted the superior performance of global
SHAP explainer and local CIU explanations. The quantitative evaluations of explanations emphasised the
importance of assessing these interpretability techniques before employing them to interpret black box models.

1. Introduction (Miller, 2019). Interpretability is highly dependent on the target audi

ence’s willingness to accept the explanations (Silva et al., 2018). They
Machine learning (ML) techniques have been widely adopted in can be global or local, model-independent, or model-dependent (Mol
various domains (Shinde and Shah, 2018), but their use in critical do nar, 2021). It is critical to note that interpretability does not imply
mains, such as medicine, finance, and aviation, can pose challenges making the black box internals transparent, which is “explainability”
(Pereira et al., 2018). ML can provide quick predictive second opinions (Barredo Arrieta et al., 2020; Nassih and Berrado, 2020). Explainability
that reduce human error; however, in domains that require transparency reveals the inner workings of the black box by decomposing the system,
and understanding of decision systems, this second opinion can be whereas interpretability can be achieved by only analysing the input and
viewed with scepticism (Vellido, 2019). output relations without providing extensive information about the
Despite the outperformance of black box ML techniques such as model’s internals. As a result, explainability implies interpretability, but
artificial neural networks (ANNs) and support vector machines (SVMs) the converse is not always valid (Gall, 2018). While explainability is
over white-box models such as decision trees (DTs) and K-nearest more informative, some interpretability techniques can be used with
neighbours (KNN) (Florez-Lopez and Ramon-Jeronimo, 2015), owing to various ML techniques and can be useful when a user is looking for
the availability of large datasets and advanced computational power, simple input-output explanations to help them understand an ML
white-box models are still preferred in critical domains, such as medi technique.
cine, for their transparency and regulatory compliance (Nicholson Price, Global interpretability provides a generalisable explanation, such as
2018). Many efforts have been made to enable black box ML techniques an IF-THEN rule set that represents black box behaviour or feature
to interpret and explain decisions and ensure interpretability, particu importance, which calculates an importance value for each feature to
larly in critical domains. reflect the extent to which the model relies on that particular feature. On
The degree to which a human can predict the outcome of a model or the other hand, local interpretability focuses on explaining a single data
understand the reasons for its decisions is referred to as interpretability point, such as the contribution of input features to the output for that

* Corresponding author.
E-mail addresses: hajar.hakkoum@um5r.ac.ma (H. Hakkoum), ali.idri@um5.ac.ma (A. Idri), abnane.ibtissam@um5s.net.ma (I. Abnane).

https://doi.org/10.1016/j.engappai.2023.107829
Received 6 August 2022; Received in revised form 20 November 2023; Accepted 27 December 2023
0952-1976/© 2023 Elsevier Ltd. All rights reserved.
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829

particular data point. eXtreme Gradient Boosting (XGB), and naïve bayes (NB)) were trained
Interpretability techniques can also differ, depending on the ML and optimised using particle swarm optimisation (PSO) on six numerical
model. Model-specific techniques (for example, decompositional rule medical datasets available from the UCI online repository (Dua and
extractions) can be applied to a specific type of ML model; these tech Graff, 2017): Wisconsin breast cancer original (9 features, 699 in
niques frequently use the model’s internals to provide an explanation. stances), Wisconsin breast cancer diagnosis (32 features, 569 instances),
Consequently, they are also known as decompositional or internal diabetic retinopathy Debrecen (20 features, 1151 instances), Parkin
techniques. There are also model-agnostic techniques (e.g., global sur son’s disease (23 features, 197 instances), and heart (SPECT and SPECTF
rogates) that construct explanations based on input-output interactions, with 22 and 44 features, respectively, and a size of 267 instances). The
regardless of the model. These are also known as pedagogical, external, five ML models were compared over each dataset based on their accu
and post hoc techniques. racy values using the ScottKnott (SK) statistical test (Jelihovschi et al.,
Errors in medicine are not tolerated because they directly affect the 2014) and the Borda count voting system. Then, the seven interpret
lives of the patients. Our systematic literature review (SLR) of 179 ar ability techniques were applied; the global surrogate was first assessed
ticles investigating interpretability in medicine from 1994 to 2020 using its fidelity to the black box, its accuracy and comprehensibility,
(Hakkoum et al., 2021a) revealed a strong interest in classification and and thereafter by comparing its feature importance scores to those of
diagnosis tasks, particularly in oncology. In terms of interpretability SHAP as well as those generated from a white box (DT) using Kendal’s
techniques, the majority of papers reviewed used global interpretability rank correlation. At a local scale, LIME, SHAP, MAPLE, LORE, and CIU
over local interpretability, with rule-based explanations being the most were evaluated and compared using three metrics: faithfulness, mono
common global interpretability technique. ANNs and SVMs have tonicity, and execution time. Finally, the gap between the accuracy of
received more attention from researchers than other ML models. It was the model and its interpretability offered by the techniques used (fidelity
also discovered that evaluating interpretability techniques when they for the global surrogate and the average faithfulness for local tech
were not rules or trees, which can be easily measured, posed a signifi niques) was analysed in an attempt to determine the
cant challenge. performance-interpretability trade-off.
The SLR also emphasised the lack of use and comparison of new local This study addressed the following research questions (RQs).
interpretability techniques in the medical field. Seventy-two papers
studied local techniques, 28 of which performed comparisons. Further RQ1 : What is the overall accuracy of each constructed model?
more, 5 out of the 72 assessed the technique of local interpretable RQ2 : What is the global interpretability of each constructed model?
model-agnostic explanations (LIME), and the same number of papers RQ3 : What is the local interpretability of each constructed model?
evaluated Shapley additive explanations (SHAP). Very few articles RQ4 : Is there a relationship between accuracy and interpretability for
compared global and local techniques simultaneously (only 10 articles). each model? Which model is most vulnerable to the gap?
Additionally, quantitative evaluations of new local interpretability
techniques, such as LIME and SHAP, have never been performed, and The contributions of this empirical evaluation are summarised as
researchers frequently opted for a qualitative description or comparison. follows.
For instance, in our previous empirical evaluation (Hakkoum et al.,
2021b) of LIME and two global techniques, partial dependence plot • Quantitative comparison of global and local interpretability tech
(PDP) and feature importance (FI), we used a descriptive rather than a niques: global surrogate, ALE, SHAP, LIME, MAPLE, LORE, and CIU,
quantitative evaluation. It was also an evaluation of a local technique on the basis of fidelity, comprehensibility, faithfulness, mono
using one dataset which jeopardised the validity of the experiment and tonicity, and time.
weakened its conclusions. Nonetheless, it defined how well LIME agreed • Discovering whether these interpretability techniques solve the
with the global explanations for the original Wisconsin breast cancer accuracy-interpretability trade-off and whether one model is more
dataset (Dua and Graff, 2017). vulnerable to the trade-off than the other.
The motives for evaluating interpretability techniques stem from the
need to enhance the transparency of complex models, particularly in the The remainder of this paper is organised as follows: Section 2 pro
medical domain. There is a demand for explainable models that can vides an overview of the classifiers, namely SVM, MLP, RF, XGB, NB, and
elucidate the rationale behind decisions made by ML based medical DTs, along with a presentation of the seven interpretability techniques:
models. Local post-hoc interpretability techniques, like SHAP and LIME, global surrogate, ALE, SHAP (with two functions), LIME, MAPLE, LORE,
are valuable in deciphering the key factors and rules contributing to and CIU. In Section 3, we discuss related research on the application of
individual predictions. However, comparing such techniques necessi black box interpretability techniques in the field of medicine. The spe
tates both qualitative and quantitative evaluations, which should rely on cifics of the dataset, performance measures, and statistical tests
objective metrics instead of subjective human judgment. Ahmed and employed to select the best performing models and techniques are
ALPKOÇAK (Nizar Abdulaziz Mahyoub and Alpkoçak, 2022), intro outlined in Section 4. Section 5 presents the experimental design used
duced a novel strategy that employs the structure of a DT as a proxy. By for the empirical evaluation. Section 6 summarises and discusses the
mapping the output of interpretability techniques, specifically SHAP findings of our study. In Section 7, we thoroughly examine potential
and LIME features scores, onto a DT, two primary complexity metrics are threats to the validity of our research. Finally, Section 8 concludes the
proposed: total depth of the tree and average of the weighted class paper by discussing the implications of the study’s results and proposing
depth. Through this approach, they demonstrated that SHAP is superior avenues for future work.
to LIME in terms of complexity and scalability, providing insights into
suitable interpretability techniques for varying document scales and 2. Background
identifying features to enhance the performance of the ANN they trained
on the cardiovascular dataset OHSUMED. This section presents an overview of the constructed models, opti
As a result, the current study compares, qualitatively and quantita misation algorithm used, and interpretability techniques investigated.
tively, three known global interpretability techniques (global surrogates
using DTs, Accumulated Local Effects (ALE), and the global summary 2.1. ML models and optimisation algorithm
plot of SHAP) and five newly introduced local techniques (LIME, SHAP,
Model Agnostic Post-hoc Local Explanations (MAPLE), Local Rule-Based 2.1.1. Multilayer perceptron (MLP)
Explanation (LORE), and Contextual Importance and Utility (CIU)). Five Multilayer perceptrons (MLPs) are composed of multiple layers of
black box supervised learners (MLP, SVM, Random Forests (RFs), interconnected nodes (neurones). MLPs are feedforward networks, in

2
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829

which the information flows only in one direction from the input layer the DTs through majority voting or averaging to make the final pre
(data) and the output layer (prediction) through a series of intermediate diction, XGBoost trains DTs sequentially, focusing on instances with
layers called hidden layers that learn to represent higher-level features higher errors from previous iterations, and combines the predictions of
that are useful for predicting the output variables. weak models by assigning weights to each model based on their per
The layers are composed of a sum of neurones connected to each formance, with a stronger emphasis on models that contribute more to
other by weights to represent nonlinear mapping (Gardner and Dorling, reducing the loss. XGBoost further optimises the loss function and ap
1998). Each neuron in an MLP receives input signals from the neurones plies regularisation techniques to improve the model performance.
in the previous layer, applies an activation function to the sum of the
weighted inputs, and passes its output to the neurones in the next layer. 2.1.6. Naïve Bayes (NB)
The weights of the connections between neurones are learned during the The Naive Bayes classifier is a fundamental probabilistic machine
training process using a variant of the backpropagation algorithm (Idri learning algorithm designed for classification tasks. Its core principle is
et al., 2002). rooted in Bayes’ theorem, a fundamental concept in probability theory.
The MLP takes the input data, propagates it through the network, The classifier calculates the probability of each possible class label for a
and generates output predictions during training. During the training given input by combining two key components: the prior probability of
process, the weights of the connections between neurones in the each class and the likelihood of observing the input’s features given each
network were adjusted to minimise the difference between the predicted class. What sets the Naive Bayes classifier apart is its assumption of
and true outputs. This is typically accomplished using an optimisation feature independence, implying that the presence or absence of one
algorithm, such as stochastic gradient descent. feature is unrelated to the presence or absence of other features. This
simplifying assumption allows the algorithm to calculate probabilities
2.1.2. Support vector machines (SVMs) efficiently, but it might not always align with the real-world data
SVMs are a type of supervised classification algorithm based on the generating process. Additionally, some variants of Naive Bayes, such as
statistical learning theory. Their primary objective was to separate Gaussian Naive Bayes, introduce a smoothing technique called “var_
classes using an optimal hyperplane that maximises the distance be smoothing” to prevent zero probabilities in cases where certain feature-
tween them. The data points closest to this hyperplane are called support class combinations are absent in the training data. This technique adds a
vectors, and are considered the most critical elements of the training set. small constant to the variance of each feature, ensuring non-zero
SVMs can also be transformed into nonlinear classifiers by using probabilities and stabilizing calculations.
nonlinear kernels. One such kernel is the radial basis function (RBF) While the Naive Bayes classifier is conceptually interpretable due to
kernel, which maps data into a higher-dimensional space to achieve its reliance on probabilistic reasoning, the intricate internal calculations
better class separation. SVMs incorporate a penalty parameter that al of probabilities, along with the underlying assumption of feature inde
lows for some degree of misclassification by allowing training points to pendence, can render it somewhat opaque and challenging to fully
be on the wrong side of the hyperplane. Increasing the penalty param grasp, thus positioning it as a black box classifier to varying degrees
eter increases the cost of misclassifying points and pushes for the depending on the complexity of the data and the specific use case.
development of a more accurate model; however, this may result in a
less generalisable model. 2.1.7. Particle swarm optimisation (PSO)
Designing black box models with optimised hyperparameters re
2.1.3. Decision trees (DTs) mains a significant challenge. To address this, the biologically inspired
DTs build a hierarchical structure of decision nodes and leaf nodes approach of particle swarm optimisation (PSO) (Kennedy and Eberhart,
based on features and their thresholds (Quinlan, 1986). The decision 1995) was employed in this study. The PSO operates on the premise that
nodes represent conditions or questions regarding the features, and the a bird’s knowledge and experience can be shared with the entire group.
leaf nodes represent the predicted outcomes or target values. DTs By mimicking the movement of a flock of birds, in which each bird at
recursively split the data based on the selected features to maximise the tempts to find an optimal solution within a solution space, the group’s
information gain (or minimise the impurity) at each node. This process best solution becomes the PSO optimal solution in that space. Although
creates a tree-like structure that can be used to make predictions or draw it cannot be definitively proven that this solution is the true global
insights into the relationships between the features and target variable. optimal solution, it is often very close to the global optimal value (Tam,
DTs are interpretable (unlike MLP, SVM, RFs, XGB, and NB) and can 2021).
handle both numerical and categorical data, making them widely used Optunity, a Python library for hyperparameter optimisation, was
in various domains. used in this study. It provides a variety of optimisation methods, ranging
from basic methods such as grid search and random search (Bergstra and
2.1.4. Random forests (RFs) Bengio, 2012) to evolutionary methods such as PSO which is currently
RFs (Breiman, 2001) is an ensemble learning method that combines the method of choice owing to its high performance (Claesen et al.,
multiple DTs to make predictions. It operates by training a set of DTs on 2014).
different subsets of data and features. Each tree makes independent
predictions, and the final prediction is obtained by aggregating the in
dividual tree predictions. RFs provide robust and accurate predictions 2.2. Interpretability techniques
by reducing overfitting and capturing the collective wisdom of an
ensemble of trees. 2.2.1. Global surrogates
Global surrogates are a class of ML models trained to approximate
2.1.5. Extreme gradient boosting (XGB) the behaviour of black box models across the entire input space. By
XGBoost (Chen and Guestrin, 2016) is a popular gradient boosting training global surrogates on the same input-output pairs used to train a
algorithm designed to enhance the performance of predictive models. It black box, it is possible to gain insights into the underlying logic of the
sequentially trains an ensemble of weak prediction models such as DTs black box model.
by focusing on instances that were incorrectly predicted in previous it Global surrogates can take a variety of forms (any transparent ML
erations. XGBoost optimises a specific loss function by leveraging model), including DTs, in which the labels predicted by the black box are
gradient descent techniques to minimise loss and improve the overall used instead of the true labels of a dataset. The data on which a global
model performance. While RFs train individual DTs independently using surrogate is trained are often called oracle data. The oracle data reflect
random subsets of the data and features, and combine the predictions of the behaviour of the black box model and not reality (Molnar, 2021).

3
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829

Table 1
Related works with their findings.
Authors Black box ML Interpretability Global/ Metrics Medical datasets Findings
models technique local

Lakkaraju Deep ANNs, XGB, LIME Global Fidelity, Electronic health ROPE Explanations improve robustness and are more
et al. (2020) RFs, and SVM SHAP Local Robustness, stability (on records structurally similar compared to those generated by
MUSE synthetic data) LIME, SHAP, or MUSE.
ROPE
Adhikari et al. RF LIME Local Fidelity (Area Under the Breast Cancer LIME outperforms LEAFAGE on linear ML models,
(2019) SVM LEAFAGE ROC) Original while LEAFAGE achieves better results on non-linear
Wisconsin models.
El Shawi et al. RF LIME Local Identity, stability, Mortality, LIME performs the worst in terms of the identity
(2019) Anchors separability, similarity, diabetes. metric but the best in terms of the separability metric.
SHAP execution time, bias SHAP has the shortest average time to output
detection explanation. And was more effective in enabling bias
detection. Techniques were ranked SHAP, Anchors
and LIME, for enabling correct bias detection.
Zhang et al. ANN ensemble Global surrogate Global Fidelity (accuracy) Breast cancer, The use of oracle data of the ensemble led to an
(2019) diabetes, increase in test set accuracy of the DT. The latter did
hepatitis, heart, better when compared to a default implementation of
liver J48 DT.
Zhou and ANN ensemble Global surrogate Global Accuracy, size of trees Breast cancer, NeC4.5 is time consuming but stronger than C4.5 DT.
Jiang (2004) (NeC4.5) diabetes, heart,
liver.
De Laet and LIME, LORE Local End users Student success LIME provided information about every feature,
Huysmans while LORE only focused on features present in the
(2021) decision rule yet presents a simpler visualization.
Knapič et al. Convolutional LIME, SHAP, CIU Local Human evaluation Images from CIU outperformed LIME and SHAP in improving
(2021) ANNs (CNNs) video capsule human decision-making, transparency, and
endoscopy understandability. CIU also generates explanations
faster.

2.2.2. Accumulated local effects (ALE) refers to an absent value; therefore, it can be replaced with a random
An ALE plot is a visual representation that provides insights into the value from the dataset. Consequently, the feature attributions which are
relationship between a feature and a model’s predictions (Apley and approximations of the Shapley values (linear model weights), are
Zhu, 2016). It shows the accumulated impact of a specific feature on the computed.
predictions while considering interactions with other features. The plot
displays how the average predictions change as the feature of interest 2.2.4. Local interpretable model-agnostic explanations (LIME)
varies within its observed range, while keeping other features fixed. The Unlike global surrogates, which approximate black box behaviour
ALE plot helps users understand the nonlinear relationship between the across the entire input space, LIME (Ribeiro et al., 2016) interprets
feature and the model predictions, capturing both the main effect and predictions at the local level. LIME generates a surrogate interpretable
potential interactions. By examining the ALE plot, one can gain insights model in the local neighbourhood of a particular data point. The local
into the impact of the feature on the model’s output and identify any surrogate is trained on a perturbation of the data point features. This
nonlinear patterns or dependencies between the feature and predictions. new dataset is weighted with respect to its proximity to the data point,
and a local surrogate is trained. Ribeiro et al. (2016) also introduced a
2.2.3. Shapley additive exPlanations (SHAP) submodular pick algorithm, in which they showed the user different
Shapley values (Shapley, 1952) are a concept from cooperative game relevant instances explanations from the test set to have sense of how the
theory where the payout of a “game” is fairly distributed among its features affect black box decision.
players. In ML, players are the features because they play together and
interact with each other to produce an outcome which is the prediction. 2.2.5. Model agnostic post-hoc local explanations (MAPLE)
Lundberg and Lee (2017) proposed a framework (SHAP) that com MAPLE (Plumb et al., 2018) is another model-agnostic interpret
putes an approximation of the Shapley values and provides global and ability technique that can be applied to any ML model, regardless of its
local explanations. The summary plot in SHAP is a visual representation underlying architecture or algorithm. It operates in a post hoc manner,
that provides insights into the global feature importance and its impact which means it analyses the model’s output after it has made pre
on model predictions. This shows the overall contribution of each dictions. The main idea behind MAPLE is to combine RFs with feature
feature to the output of the model. The plot displays the Shapley values, selection and return feature importance explanations. This is achieved
which represent the average contribution of each feature across different using two techniques called SILO and Dstump (Plumb et al., 2018). SILO
instances. Features with larger Shapley values have a greater impact on stands for subgroup instance level optimisation and helps identify
the predictions, whereas those with smaller values have less influence. relevant subgroups within the data that have similar predictions based
The summary plot helps users understand the relative importance of on the RF leaves. A dstump, on the other hand, refers to a decision
different features and their directionality (whether they contribute stump, which is a simple DT with one internal node and two leaf nodes.
positively or negatively to the predictions). By examining the summary Dstump ranks features to solve a weighted linear regression similar to
plot, one can gain a holistic understanding of how the features collec LIME, providing interpretable insights into the decision-making process
tively contribute to the model’s output and identify the most influential of the black box model.
factors driving the predictions.
SHAP also represents the Shapley value explanation as an additive 2.2.6. Local rule-based explanation (LORE)
feature attribution method, which is a local linear model that connects LORE (Guidotti et al., 2018) is a local model-agnostic technique for
LIME and SHAP. To calculate the coalitions, a feature entry of 1 is black box interpretability that employs a genetic algorithm to extract
considered present and replaced with its original value. An entry of zero local rule-based explanations for specific instances or regions. LORE uses

4
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829

Table 2 the individual techniques because they were only a few, especially local.
Strengths and limitations of interpretability techniques. Despite this, the review settled on comparing clusters of techniques
Technique Strengths Limitations instead of individual ones, based on the explanation type. This shows the
importance and need for more experiments to clarify the differences
LIME Good separability (El Shawi May be sensitive to local
et al., 2019) perturbations (Lundberg and between these techniques in different domains and for different black
Lee, 2017) box models and to facilitate their adoption in real-life scenarios.
SHAP Provides both local and global Computationally expensive for This section surveys similar or related works which investigate
interpretability and is high-dimensional spaces and interpretability techniques. Table 1 summarises the use of interpret
theoretically grounded based on large datasets (Lundberg and
cooperative game theory ( Lee, 2017).
ability techniques in the medical field, with a degree of quantitative
Lundberg and Lee, 2017). evaluation. Moreover, Table 2 report some pros and cons of the used
MAPLE Offers interpretable Requires pre-defined rule interpretability techniques.
explanations using logic rules templates (Plumb et al., 2018). The literature on interpretability in medicine suggests that rule- or
which are very easy to follow ( May struggle to capture complex
tree-based explanations are most commonly used (Hakkoum et al.,
Plumb et al., 2018). and non-linear relationships.
LORE Generates human-readable rules Prone to overfitting because it 2021a). Global surrogates, which are a type of rule- or tree-based
for interpretability (Guidotti only focuses on features present explanation, are commonly used in various ways. For example, Zhou
et al., 2018) in the decision rule (De Laet and and Jiang (2004) developed a global surrogate called NeC4.5, which
Huysmans, 2021). combines the strengths of ANNs and C4.5 tree algorithm. Similarly,
CIU Understandable and fast (Knapič
Zhang et al. (2019) conducted experiments with different datasets using
–
et al., 2021). It considers the
usefulness of features in specific both training datasets and ensemble training datasets (oracle data),
contexts (Anjomshoae et al., either combined or separately. Both studies compared their global sur
2020). rogate approaches to simple DTs trained using normal datasets, and
Global Offers a simpler and more May not accurately represent
found that involving the black box model resulted in improved
Surrogate interpretable model as a reality since it is based on the
substitute for the complex complex model’s behaviour. May performance.
model (Molnar, 2021). be time consuming (Zhou and Despite the small number of papers investigating and comparing
Jiang, 2004). Explains ML black local interpretability techniques, we were able to find some studies that
boxes using more ML techniques strengthened their evaluations with quantitative results. Lakkaraju et al.
(white-boxes) (Molnar, 2021).
(2020) compared LIME (Ribeiro et al., 2016), SHAP (Lundberg and Lee,
ALE Captures the average effect of Requires defining bins or
features on the model’s intervals for continuous 2017), and model understanding through subspace explanations
predictions (Apley and Zhu, variables and may not fully (MUSE) (Lakkaraju et al., 2019) to their proposed method, robust post
2016). capture complex interactions hoc explanations (ROPE) using fidelity. Experiments were conducted on
between features (Apley and
an electronic health records dataset and two non-medical datasets
Zhu, 2016).
(Lakkaraju et al., 2016) to analyse the explanations produced by several
ML black boxes, including DNNs, XGB, RFs, and SVM. They adapted
a decision tree classifier to generate a set of interpretable if-then rules LIME and SHAP to generate global explanations using the submodular
that approximate the decision-making process of the black box model. pick procedure, which selects a set of representative points from the
These rules capture important features and their impact on the pre dataset and combines their local models to form a global explanation.
dictions. Additionally, LORE provides a pair of explanations consisting Adhikari et al. (2019) compared LIME and LEAFAGE (local example
of logic rules that describe the decision boundaries and counterfactual and feature importance-based model agnostic Explanations) using a
rules that explain how changing the values of certain features would similar approach to quantitatively evaluate explanations. They investi
alter the prediction. This combination of logic and counterfactual rules gated the interpretability of RF and SVM on the original dataset of breast
helps users gain insights into the decision logic of the black box model in cancer (Dua and Graff, 2017), as well as two other non-medical datasets.
a local and interpretable manner. For each test instance, they chose a radius by expanding it until the
corresponding hypersphere included a percentage of instances that did
2.2.7. Contextual Importance and Utility (CIU) not have the same predicted label as the test instance. The scores given
The CIU explanation (Främling, 1996; Anjomshoae et al., 2020) is by the local explanations were then compared to the scores given by the
another post hoc local interpretability technique. It focuses on under black box classifier on all test instances that fell into this hyper-sphere
standing that the importance of a feature in a context may be irrelevant using the area under the ROC (AUC) fidelity. As shown in Table 1,
in another, which is why they introduce contextual utility along with LIME performed better than LEAFAGE on linear ML models, whereas
importance to estimate the usefulness of the feature for the prediction. LEAFAGE performed better on nonlinear models.
CIU analyses the context variables associated with the data and quan El Shawi et al. (2019) systematically examined the effectiveness of
tifies their impact on the model’s decision-making process. By assessing three distinct interpretability techniques: LIME, SHAP, and Anchors,
the importance of these contextual factors, CIU provides insights into the within the context of a trained RF model. The primary focus of this
aspects of the input that are most influential in driving the model’s investigation encompassed diverse aspects, including the temporal de
predictions. mands of these techniques, their resilience to input perturbations, and
their adeptness in identifying biases, particularly those discernible
3. Related work through visual inspection rather than quantitative measurement. The
study’s findings revealed detailed information about how these tech
Our SLR (Hakkoum et al., 2021a) revealed that medical interest in niques perform; LIME exhibited the least favourable outcomes in terms
local interpretability increased in recent years, particularly in 2019. This of the identity which basically checks how well the interpretability
can be attributed to the fact that research no longer believes that the technique keeps the original details and relationships present in the
workings of a complex model can be understood in a generalisation black box model’s predictions. In contrast, LIME performed better when
context (i.e., it is difficult to reveal the process of the learned nonlinear it comes to separability. This metric essentially measures the technique’s
relationships by the black box) or to the fact that medicine encourages skill in telling apart the separate impacts of different features on the
personalised solutions as they can help in monitoring and treatment model’s predictions. SHAP turned out to be the fastest technique,
tasks. The SLR could not present an analysis of the comparisons between needing the least time on average to explain the model’s results and
managing to give understandable insights quickly compared to the other

5
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829

Table 3 original dataset) to 44 (SPECTF dataset) for attributes. Such differences

Datasets description. in the datasets (number of instances/attributes) can have a direct impact
Dataset Number of Number of Number of Attributes on the model performance, and it is important to check whether inter
instances classes attributes types pretability is affected in the same way.
Breast Cancer 699 2 10 Integer
Original (BC)
Breast cancer 569 2 32 Real
4.2. Performance and interpretability evaluation
Diagnosis (BCD)
Diabetic 1151 2 20 Integer, The first part of this subsection is model- and performance-related,
Retinopathy Real whereas the second part focuses on evaluating interpretability results.
Debrecen
In addition, we present the statistical testing method used in this study
(Diabetes)
Heart (SPECT) 267 2 22 binary to compare the different models and techniques used.
Heart (SPECTF) 267 2 44 Integer
Parkinson’s 197 2 23 Real 4.2.1. Model performance metrics
Multiple classification performance measures were used to select the
best performing model for each experiment.
techniques. Additionally, SHAP had a clear ability to help participants
find bias in the results. This shows that SHAP’s way of explaining can
• Accuracy is the proportion of correctly classified instances among all
help find cases where the model’s predictions might show biases, which
instances (true positives (TP) and true negatives (TN)), as given by
could show up as uneven effects on various groups or traits.
Equation (1).
Meanwhile, Molnar et al. (2020) defined a score based on three
• Precision is the proportion of TP among all instances classified as
factors: main effect complexity (MEC), interaction strengths (IAS), and
positive (TP and false positives (FP)), as shown in Equation (2).
number of features (NF), where a score of zero reflects low interpret
• Recall (sensitivity or true positive rate (TPR)): is the proportion of TP
ability and a score of three reflects high interpretability. The MEC rep
among all actual positive instances which are TPs and false negatives
resents the complexity of the input feature and the prediction
(FN)), as shown in Equation (3).
relationship. The IAS is the strength of interactions between features
• The F1-Score is the harmonic mean of precision and recall, and
which cannot be the sum of independent feature effects, because a
provides a balance between them, as described by Equation (4).
feature depends on the values of other features. Finally, NF represents
• Kappa coefficient is a statistic that measures the agreement between
the features that change the prediction and thus influence the outcome.
the model’s predictions and the true labels, taking into account the
Nevertheless, Molnar’s method does not determine confidence intervals
agreement that could occur by chance. It is calculated as the
for the estimated level of interpretability (Schwartzenberg et al., 2020).
observed agreement po minus the expected agreement pe divided by
Plumb et al. (2020) proposed the addition of an interpretability
one minus the expected agreement. The kappa coefficient ranges
regularizer based on fidelity or stability at training time to the loss
from − 1.0 to 1.0, with higher values indicating better agreement
function which was minimized for better performance. In addition to
than would be expected by chance alone. A kappa coefficient of
fidelity that was discussed earlier, which Plumb et al. referred to as
0 indicates agreement equivalent to chance, whereas a kappa coef
neighbourhood fidelity, they used stability which can be defined as the
ficient of 1 indicates perfect agreement.
degree to which the explanation changes between points and ensures
• The AUC is a measure of the model’s ability to distinguish between
that similar inputs yield similar explanations (Plumb et al., 2020). Their
positive and negative classes. It is calculated as the area under the
experiments improved the interpretability of different models trained on
Receiver Operating Characteristic (ROC) curve, which plots the TPR
the in-hospital mortality Suport2 dataset by more than 50%.
against the false-positive rate (FPR) (given by Equation (5)) at
Knapič et al. (2021) tried to increase trust in a CNN model trained on
various classification thresholds. The AUC score ranges from 0.0 to
gastral images of video capsule endoscopy (VCE) by applying LIME,
1.0, with higher values indicating better performance.
SHAP and CIU as post-hoc explanations. A survey of non-medical users
showed that CIU explanations outperformed LIME and SHAP in TP ± TN
Accuracy = (1)
improving decision-making support, transparency, and speed. These TP + FP + TN + FN
findings suggest a potential for effective decision support in medical
settings. Precision =
TP
(2)
Furthermore, Table 2 reports a pros and cons list in an intent to TP + FP
compare the seven interpretability techniques used in this empirical / /
TP
evaluation according to literature. Recall sensitivity TPR = (3)
TP + FN
4. Datasets and metrics 2 ∗ Precison ∗ Recall
F1 − score = (4)
Precision + Recall
This section presents the medical datasets, performance and inter
pretability metrics, K-fold cross-validation, Borda count, and the sta FP
FPR = = 1 − Specificity (5)
tistical test used in this study. TN + FP
TN
4.1. Datasets description where: Specificity/TrueNegativeRate = TN+FP .

Six publicly available datasets were used for the experiments con 4.2.2. Interpretability metrics
ducted in this comparative study. Table 3 describes the datasets in terms Measuring interpretability still represents a challenge for the ML
of the number of instances, classes, and attributes as well as the type of community because interpretability can take different forms, and thus
attributes (integer, real, or binary). The datasets were taken from the can be evaluated differently which sometimes makes it difficult to
UCI repository and are all connected to the medical field; however, they compare many techniques with the same metric.
present different levels of evaluation for this experiment because the On a global scale, accuracy, comprehensibility, and fidelity to the
number of instances and attributes varied from 1151 (diabetes dataset) black box model were used for the global surrogate. Meanwhile, Ken
to 197 (Parkinson’s dataset) for instances and from 10 (breast cancer dal’s rank correlation was used to compare the feature importance

6
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829

resulting from the global surrogate, SHAP summary plot, and a white
box model (DT) trained directly on the data. On a local scale, the average
of the faithfulness metric over the test set instances, percentage of
monotonic instances in the test set, and time of execution were used to
evaluate and compare LIME, SHAP, MAPLE, LORE, and CIU. These
metrics are defined as follows.

• The fidelity metric is a measure of how closely a surrogate model

approximates the behaviour of the original model. One common
approach for measuring fidelity is to compare the predictions of the
surrogate model to the predictions of the original model on a hold-
out set of test data. The fidelity metric can be computed as the dis
tance between the predictions of the two models (black box and the
global surrogate). The closer the predictions of the surrogate model
are to the predictions of the original model, the higher the fidelity
score. In this study, fidelity was computed using accuracy, as shown Fig. 1. Experimental design.
in Equation (1), to compare the predictions of the black box with
those of the global surrogate. performance. Specifically, with k = 10, the model is trained on 90% of
• The comprehensibility of the global surrogate DTs was measured the available data and evaluated on 10% of the data, with this process
using the depth and number of leaves to check how easily these repeated 10 times.
surrogate models can be understood and interpreted. Borda count is a voting method (Borda, 1784) that determines the
• The predictive performance of the global surrogate reflects how well outcome of an election by assigning points to each candidate based on
the global surrogate DT predicts the outcome of interest. This was their rank in the preferences of voters. In the Borda count, the candidate
defined using accuracy, as defined in Equation (1). with the most points is declared the winner, which is the best-
• Kendall’s rank correlation coefficient, also known as Kendall’s tau, performing model (Risse, 2005).
can be applied to two sets of ranks to measure the strength and di The SK test is a hierarchical clustering algorithm used to rank and
rection of the monotonic relationship between them. It evaluates compare the performance of various models and interpretability tech
agreement in the order of ranks rather than the actual values. It niques to determine whether they exhibit significant differences. The
ranges between − 1 and 1, where − 1 indicates perfect negative algorithm sorts the methods based on a specific metric: accuracy for
agreement, 1 indicates perfect positive agreement, and 0 indicates no black box models, fidelity for global surrogates, and the average of
agreement. faithfulness for local interpretability techniques. Once the methods are
• Faithfulness (Alvarez-Melis and Jaakkola, 2018) assesses the corre sorted, the SK test recursively splits them into groups based on the sta
lation between the importance assigned by a local interpretability tistical differences. This process involves testing the hypothesis that
algorithm to the features and the actual effect of these features on the there are no significant differences between the methods in each group.
performance of the predictive model. This metric examines whether During recursive splitting, the SK test calculates a p-value that quantifies
features deemed important by the interpretability algorithm truly the probability of observing the observed differences between the
have a significant impact on the model’s performance. This is done methods by chance alone, assuming the null hypothesis (no significant
by systematically removing each attribute considered important by difference) is true. The p-value represents the strength of the evidence
the interpretability metric and evaluating the resulting effect on the against the null hypothesis. A smaller p value suggests stronger evidence
model’s performance. The metric then calculates the correlation of significant differences in performance. Based on the calculated p-
between the importance weights assigned to the attributes and the values, the SK test identifies groups of methods that are statistically
corresponding changes in model performance. A higher correlation different from one another. The resulting groups are then ranked ac
indicates a stronger alignment between the assigned importance and cording to their mean performance, with the top-performing group
actual effect of the attributes on the performance of the model. identified as the winner or the best cluster.
• Monotonicity (Luss et al., 2019) examines the effect of individual
features on the performance of a model by evaluating the impact of 5. Experimental design
incrementally adding each feature in order of increasing importance.
The metric assumes that as each feature is added to the model, the This section describes the empirical evaluation process employed in
performance of the model should correspondingly increase, resulting this study. The process can be divided into four primary steps, as illus
in a monotonically increasing trend in model performance. By trated in Fig. 1. These steps include the following:1) Model construction.
assessing this monotonic relationship between feature importance 2) Global interpretation using global surrogate, SHAP bars summary
and model performance, the metric evaluates the consistency of the plots, ALE, and an interpretable by design model; 3) local interpret
interpretability algorithm in correctly identifying the impact of fea ability using LIME, SHAP, MAPLE, LORE, and CIU; and 4) Accuracy-
tures on model predictions. interpretability trade-off analysis.

4.2.3. Validations and statistical testing 5.1. Step 1: model construction

K-fold cross-validation is used to assess the performance of an ML
model by dividing the available data into k equal parts or folds. The After removing any missing values from the datasets, each dataset
model is trained on k folds of the data and evaluated on the remaining was divided into two distinct subsets: an 80% cross-validation set for
fold. This process is repeated k times, with each fold used as the test set model training and evaluation, and a 20% test set to comprehensively
once, and the average performance across all k folds is used as the es assess the final model performances and evaluating the interpretability
timate of the model’s performance. It helps reduce the bias of the model techniques. The Synthetic Minority Oversampling Technique (SMOTE)
and provides an idea of how it will behave or generalise over new/un algorithm (Chawla et al., 2002) was then used to balance the
seen data. K = 10 is often recommended as a standard value for k in k- cross-validation set.
fold cross-validation because it has been shown to provide a good bal Each experiment involved one model and one dataset configuration,
ance between the bias and variance in the estimate of the model’s and 30 experiments were conducted based on five models (MLP, SVM,

7
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829

Fig. 2. Global surrogate process.

RF, XGB, and NB) and six datasets. The hyperparameters of the models
Table 4
were selected based on their accuracy with 10-folds cross-validation
Models accuracy with default hyperparameters.
using the PSO algorithm. The performance of each model was evalu
ated using the performance metrics on the test sets. The accuracy metric Dataset BC BCD Diabetes SPECT SPECTF Parkinson’s

was used to conduct the SK test across the datasets, whereas the six MLP 0.948 0.938 0.766 0.753 0.740 0.538
metrics defined in Section 4 for model performance were employed by SVM 0.970 0.938 0.653 0.753 0.827 0.641
RF 0.963 0.973 0.714 0.740 0.827 0.897
the Borda count.
XGB 0.956 0.956 0.709 0.740 0.827 0.923
NB 0.963 0.974 0.571 0.790 0.728 0.743
5.2. Step 2: global interpretability

faithfulness on the test set to identify the best performing local tech
Fig. 2 illustrates how the global surrogate using a DT was trained for
nique across different datasets.
each experiment using the black box model prediction (Ytrain-BB and Ytest-
BB) as the class labels instead of the original ground-truth labels. This
new dataset is referred to as the Oracle dataset (Johansson and Niklas 5.4. Step 4: accuracy-interpretability trade-off
son, 2009). The DTs learned from the Oracle dataset reflect the behav
iour of the black box model and not the ground-truth labels, because the An insightful accuracy-interpretability analysis was conducted for
DTs do not have access to them (Molnar, 2021). The fidelity of the global each experiment to delve into the intricate trade-off between these two
surrogate to the black box model was assessed using the accuracy, depth crucial factors. The aim was to uncover the extent of the trade-off gap
of the resulting DT, and number of leaves. The fidelity-based SK test and and identify the models that are particularly susceptible to interpret
Borda count based on fidelity, accuracy, depth, and number of leaves, ability constraints, especially when considering a white box model (DT).
were conducted to identify the best-performing global surrogate model This meticulous step involved comparing the performance accuracy of
for each dataset. models 1) with the fidelity of the global surrogate and 2) with the
On the other hand, a total of 30 SHAP summary plots were generated average local faithfulness across all instances in the test set for each local
for each experiment, representing the combination of models and interpretability technique. By exploring this interplay, we sought to gain
datasets. These plots were initially compared with the feature impor a comprehensive understanding of the models’ susceptibility to inter
tance derived from global surrogate decision trees (DTs). Subsequently, pretability limitations and shed light on the models that strike an
both the SHAP and global surrogate rankings were compared individ optimal balance between accuracy and interpretability.
ually against the rankings produced by the interpretable model, which
consisted of a DT classifier constructed directly on the raw datasets. 6. Results and discussion
These comparisons were based on Kendal’s rank correlation metric.
Finally, ALE plots were generated for the most consensually identi This section presents and discusses the findings of this empirical
fied features. These plots provide insights into how these particular evaluation in order to answer the RQs defined in Section 1. The exper
features influence the behaviour of black box models. By examining iments were conducted on a Lenovo Legion laptop with hexa-core Intel
these ALE plots, we can gain insights into the intricate relationships Core i7-9750H processor with 16 GB of RAM and a base speed of 2.59
between these features and the model’s predictions. This knowledge can GHz using Windows 10.
potentially inform medical practitioners and researchers about the
specific mechanisms through which certain important features influence 6.1. Model construction, performance and validation (step 1)
prediction, thereby enhancing our understanding of the underlying
factors driving the model’s decision-making process in the medical This step aimed to build and test five optimised models (MLP, SVM,
domain. RF, XGB, and NB) on six datasets.
Table 4 represents the accuracy scores of the five black box models
5.3. Step3: local interpretability without any parameter tunning (default) on the six datasets. Overall,
these scores for all models, ranging from 0.538 to 0.973, show that some
In this step, five local interpretability techniques, LIME, SHAP, of them perform fairly well. The accuracy levels between the various
MAPLE, LORE, and CIU, were applied to each instance in the test set for datasets differ noticeably. For instance, the accuracy scores for all
each experiment. For each instance, the explanations generated by the models on the BCD dataset, which range from 0.938 to 0.974, are
five techniques were evaluated in terms of faithfulness and mono consistently high. The accuracy scores for the Parkinson’s dataset, on the
tonicity, which were computed based on the local importance of the other hand, range from 0.538 to 0.923. SVM, NB, and XGB appear to
instance features generated by each technique. In addition, the time consistently achieve high accuracy scores across a variety of datasets
required by each technique to generate explanations for our six models when comparing the performance of the models. SVM, for example,
(MLP, SVM, RF, XGB, and NB, and the DT white box) was recorded to excels on BC and SPECTF, whereas NB excels on datasets like BCD and
enable a later comparison of the techniques in terms of execution time. SPECT, and Parkinson’s. Meanwhile, XGB outperforms the other clas
Furthermore, the SK significance test was conducted using the average sifiers on SPECTF and Parkinson’s.

8
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829

Table 5
Model optimised hyperparameters using PSO.
Model Dataset BC BCD Diabetes SPECT SPECTF Parkinson’s
Hyperparameter

MLP Neurons 137 389 334 101 340 189

Learning rate 0.1265 0.0184 0.0187 0.3272 0.0207 0.0436
Batch Size 115 47 49 61 64 54
Epochs 386 373 96 286 250 279
SVM C 49.1715 32.3945 93.4324 70.2433 39.4157 74.9023
Log Gamma − 2.7521 − 3.7824 − 4.2382 − 2.9803 − 2.5181 − 2.4755
RF Number of estimators 103 83 53 92 187 185
XGB Number of estimators 50 176 114 85 139 148
NB Var_ smoothing 1.83 e− 07 9.73 e− 10 5.46 e− 10 7.61 e− 07 8.95 e− 07 1.88 e− 07

Table 6
Performance results for MLP and SVM.
Dataset BC BCD Diabetes SPECT SPECTF Parkinson’s

Model MLP SVM MLP SVM MLP SVM MLP SVM MLP SVM MLP SVM
Accuracy 0.956 0.978 0.947 0.938 0.714 0.735 0.691 0.753 0.679 0.802 0.794 0.820
Precision 0.981 0.966 1.00 0.928 0.822 0.868 0.825 0.883 0.954 0.820 0.875 0.903
Recall 0.9138 0.982 0.860 0.907 0.617 0.617 0.787 0.803 0.636 0.969 0.875 0.875
F1_Score 0.946 0.974 0.925 0.917 0.705 0.721 0.806 0.841 0.763 0.888 0.875 0.888
Kappa 0.909 0.955 0.884 0.868 0.438 0.483 0.050 0.291 0.321 0.052 0.303 0.422
AUC 0.997 0.997 0.998 0.971 0.804 0.822 0.687 0.784 0.768 0.905 0.714 0.817

Table 7
Performance results for RF and XGB.
Dataset BC BCD Diabetes SPECT SPECTF Parkinson’s

Model RF XGB RF XGB RF XGB RF XGB RF XGB RF XGB

Accuracy 0.971 0.949 0.974 0.956 0.701 0.714 0.740 0.740 0.827 0.827 0.923 0.923
Precision 0.965 0.947 0.954 0.952 0.770 0.772 0.857 0.868 0.948 0.919 0.939 0.939
Recall 0.965 0.931 0.977 0.930 0.656 0.687 0.818 0.803 0.833 0.863 0.968 0.968
F1_Score 0.965 0.939 0.965 0.941 0.709 0.727 0.837 0.834 0.887 0.890 0.953 0.953
Kappa 0.940 0.895 0.944 0.906 0.406 0.429 0.202 0.238 0.525 0.480 0.723 0.723
AUC 0.995 0.994 0.998 0.993 0.764 0.773 0.715 0.727 0.881 0.871 0.955 0.964

Tables 6–8 list the metric values for each experiment. Table 6 reports
Table 8
a comparison between SVM and MLP in terms of different metrics across
Performance results for NB.
various datasets. The reported accuracy values indicate that SVM out
Dataset BC BCD Diabetes SPECT SPECTF Parkinson’s performed MLP in all datasets, except for BCD, where MLP exhibited a
Accuracy 0.963 0.974 0.576 0.802 0.728 0.795 slight advantage with a difference of 0.9%. Similarly, Table 7 demon
Precision 0.949 1 0.826 0.903 0.978 0.900 strates that RF performed better than XGB on the BC and BCD datasets,
Recall 0.965 0.930 0.297 0.848 0.682 0.844
whereas both models achieved similar accuracies on SPECT, SPECTF,
F1_Score 0.957 0.964 0.437 0.875 0.804 0.871
Kappa 0.925 0.943 0.203 0.407 0.406 0.373 and Parkinson’s. Meanwhile, Table 8 reports the performance results of
AUC 0.990 0.998 0.712 0.885 0.891 0.862 the trained NB models on the six datasets. NB showcased very low ac
curacy on the Diabetes, SPECTF and Parkinson’s datasets (0.576, 0.728,
0.795 respectively), similar to MLP (0.714, 0.679, 0.794 respectively).
Subsequently, the models are optimised using the PSO algorithm on Comparing the findings from Tables 6–8, we observe that RF and
the basis of accuracy. Table 5 lists the optimal hyperparameters chosen XGB exhibit superior performance in terms of accuracy compared to
by PSO. For MLP, the BC dataset required the highest number of epochs SVM and MLP on the BC, DBC, SPECT, SPECTF, and Parkinson’s data
(386) and batch size (115), whereas the other datasets ranged from 64 to sets. Additionally, NB gave the best accuracy results on the SPECT
47 for the batch size and between 373 and 96 for the optimal number of dataset (0.802).
epochs. The SPECTF dataset had the highest learning rate (0.3272), Overall, it can be seen that PSO optimisation improved the accuracy
whereas BCD had the highest number of hidden neurones (389). For the of MLP and SVM models over several datasets. In the BC, BCD, SPECTF,
SVM, the penalty parameter C was the highest for the diabetes dataset and Parkinson’s datasets, MLP’s accuracy increased, whereas SVM’s
(93.4) and lowest for the BCD dataset (32.4). On the other hand, RF and accuracy in the BC, Diabetes, and Parkinson’s datasets improved. On the
XGB had the same optimised hyperparameter (the number of estimators) other hand, the impacts of PSO on RF and XGB, were comparatively
which allows comparison. It should be noted that XGB required a smaller minimal, with accuracy levels remaining constant with and without
number of estimators for BC, SPECT, SPECTF, and Parkinson’s with 50, optimisation. For some datasets where the initial NB model performed
85, 139, and 148 estimators, respectively, compared to RF with 103, 92, well (BC and DBC), PSO did not lead to substantial changes. However,
187, and 185 estimators, respectively. For NB, it is possible to observe for datasets with lower initial accuracy (Diabetes: 0.571 vs. 0.576), PSO
that the optimal ’var_smoothing’ values tend to be in the range of e− 10 to still managed to make marginal improvements. Notably, for SPECT,
e− 07 across the different datasets. This suggests that a moderate amount SPECTF, and Parkinson’s, PSO optimisation significantly enhanced NB’s
of smoothing is effective for handling the probabilities and improving accuracy, highlighting its effectiveness in optimizing the hyper
the classification accuracy of the black box model. parameters for improved classification performance.

9
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829

Fig. 3. Accuracy based SK results of the black box models.

secured the top position based on Borda count by appearing first in BCD,
Table 9
SPECTF, and Parkinson’s for RF, and first in BC and Diabetes for SVM.
Borda count scores for the black box models on each dataset.
Meanwhile, XGB and NB also appeared first in SPECT and Parkinson’s
Dataset Model Borda score respectively.
BC MLP 17
SVM 28
RF 22
6.2. Global interpretability (step 2)
XGB 8
NB 15 Global surrogates were generated using the process outlined in Fig. 2
BCD MLP 17 to assess the global interpretability of the constructed models (RQ2).
SVM 7
The resulting accuracy (on original data), fidelity using accuracy (on
RF 26
XGB 17 oracle data), depth, and number of leaves of these surrogates are pre
NB 23 sented in Table 10.
Diabetes MLP 19 First, the SK test was used to compare the surrogates of the five black
SVM 27 box models based on their fidelity values to determine whether any
RF 14
XGB 21
significant differences existed. Additionally, we used the Borda count
NB 9 scores, as shown in Table 11, to further evaluate the global interpret
SPECT MLP 8 ability of the models for each dataset using the metrics provided in
SVM 23 Table 10.
RF 14
Subsequently, we compared the feature importance according to
XGB 15
NB 30 these global surrogates to those generated by the SHAP summary plots.
SPECTF MLP 10 Then, each of these two sets of rankings was compared to the DT trained
SVM 19 on the original data to check how close they were to the white-box
RF 22 perspective. Finally, ALE plots were generated for the most consensu
XGB 21
ally identified important features for the BC dataset to verify how they
NB 18
Parkinson’s MLP 10 influence the model’s decision-making process. This analysis not only
SVM 16 enables us to contribute valuable knowledge to the medical tasks asso
RF 27 ciated with the investigated datasets, but also helps establish a level of
XGB 27
trust in these black box models.
NB 10
According to the fidelity values in Table 10, the RF and NB global
surrogates demonstrate higher fidelity to the black box model, per
Subsequently, the performances are evaluated and compared in forming best in SPECTF, and Parkinson’s for RF and in BCD and Diabetes
terms of accuracy using SK to verify whether the difference between the for NB. SVM was the best in the SPECT datasets, whereas XGB excelled in
five models is statistically significant across the datasets. The test’s BC. Nevertheless, considering the depth of the trees, MLP consistently
default assumption (i.e., the null hypothesis) was that the accuracy achieved the best (smallest) depth, except for the BC and Diabetes
values of the five models were not significantly different. If the null datasets, where NB surpassed with a depth of 6 and 10 respectively.
hypothesis is rejected (when the p-value is less than 0.05), the models Similarly, for the number of leaves, NB achieved the best values in five
are considered significantly different. The performances were then datasets, whereas MLP surpassed in the sixth dataset (BC). Table 10 also
evaluated in terms of six model performance metrics (accuracy, preci provides the accuracy of the global surrogates based on the true labels
sion, recall, F1 Score, Kappa, and AUC). The SK test, as shown in Fig. 3, (original dataset). SVM demonstrates higher accuracy in Diabetes
revealed that the black box model’s accuracy assigns them to one clus (67.1%), XGB in SPECTF (81.48%), and RF in Parkinson’s (87.18%).
ter, even though RF had the best mean accuracy. Furthermore, Table 9 Meanwhile NB excels on BCD and SPECT datasets (95.6% and 79%
reports that two black box models, namely RF and SVM, consistently respectively).
Fig. 4 demonstrates that the SK significance test assigned the mean of

10
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829

Table 10
Global interpretability results for each experiment.
Model BC BCD Diabetes SPECT SPECTF Parkinson’s

Accuracy MLP 93.4 92.1 62.3 65.4 61.7 84.6

SVM 93.4 94.7 67.1 69.1 80.2 82
RF 92.7 94.7 60.1 69.1 76.5 87.1
XGB 93.4 93.8 58.4 65.4 81.4 84.6
NB 91.9 95.6 61.9 79 69.1 82
Fidelity MLP 94.8 93.3 75.3 81.4 76.5 84.6
SVM 95.6 93.8 81.3 91.3 77.7 84.6
RF 94.1 97.3 71.8 90.1 83.9 89.7
XGB 97 96.4 73.1 86.4 81.4 87.1
NB 95.6 98.2 93 83.9 81.4 87.1
Depth MLP 8 6 15 8 6 5
SVM 7 8 20 8 15 5
RF 9 8 23 9 15 6
XGB 9 8 21 8 15 6
NB 6 8 10 9 9 5
Number of leaves MLP 13 19 109 37 19 11
SVM 15 22 114 18 27 11
RF 27 22 187 37 27 13
XGB 28 22 188 36 27 13
NB 15 16 20 17 17 10

white box classifier reported in Table 12 in terms of accuracy with the

Table 11 exception of the SPECTF dataset where the best black box models (RF
Global interpretability Borda count winners. and XGB) achieved a similar accuracy (82.72) to the DT classifier.
Dataset Borda winner Nevertheless, the comprehensibility of the global surrogates, as shown
BC SVM in Table 10 by the depth and number of leaves, consistently surpassed
BCD NB that of the DTs trained on the original labels.
Diabetes NB In continuation, SHAP summary plots were generated for every black
SPECT SVM box model in each dataset. This step resulted in 30 plots, whose feature
SPECTF NB
Parkinson’s RF and MLP
ranks were compared against the global surrogates (denoted as GS) in
Table 13 (GS-SHAP). Moreover, the ranks were also compared against
the DT classifiers feature importance (GS-DT and SHAP-DT in Table 13).
fidelity of the models across the datasets to the same cluster, although
NB had the best fidelity mean. On the other hand, the Borda elections on Table 12
the basis of the global interpretability metrics, as shown in Table 11, Performance of DTs on the original datasets.
show the outperformance of NB on BCD, Diabetes, and SPECTF, while
Metric Accuracy Depth Number of leaves
SVM excels on BC and RF and MLP share the top position on the Par
kinson’s dataset. BC 92.70 9 27
BCD 94.74 8 22
Additionally, DT classifiers were constructed on the original datasets Diabetes 55.84 24 191
to obtain an interpretable model perspective on feature importance. SPECT 61.73 10 43
Table 12 presents the accuracy and size of the DTs. We noticed that most SPECTF 82.72 15 27
of the black box models, shown in Tables 6–8, outperformed the DTs Parkinson’s 89.74 6 13

Fig. 4. SK results of the black box models according to global surrogate fidelity.

11
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829

Table 13
Global features ranks comparisons.
Kendal’s rank BC BCD Diabetes SPECT SPECTF Parkinson’s

Corr. p-val Corr. p-val Corr. p-val Corr. p-val Corr. p-val Corr. p-val

MLP GS-SHAP 0.592 0.028 0.286 0.041 0.317 0.059 0.206 0.19 0.304 0.01 − 0.07 0.681
GS-DT 0.592 0.028 0.068 0.647 0.576 0.001 0.613 0 − 0.05 0.703 0.201 0.276
SHAP-DT 0.167 0.612 0.189 0.174 0.152 0.363 0.166 0.283 0.063 0.577 0.251 0.13
SVM GS-SHAP 0.487 0.08 − 0.02 0.861 0.059 0.726 0.328 0.043 0.169 0.139 0.311 0.065
GS-DT 0.548 0.049 0.583 0 0.627 0 0.219 0.181 0.81 0 0.625 0.001
SHAP-DT 0.389 0.18 0.185 0.181 0.047 0.779 0.41 0.008 0.122 0.278 0.271 0.101
RF GS-SHAP 0.556 0.045 0.251 0.071 0.387 0.021 0.331 0.032 0.232 0.038 0.101 0.546
GS-DT 0.83 0 0.689 0 0.84 0 0.338 0.029 0.696 0 0.371 0.04
SHAP-DT 0.611 0.025 0.259 0.061 0.411 0.014 0.401 0.009 0.345 0.002 0.241 0.146
XGB GS-SHAP 0.611 0.025 0.324 0.021 0.661 0 0.347 0.025 0.269 0.017 0.078 0.645
GS-DT 1 0 0.63 0 0.89 0 0.574 0 0.76 0 0.91 0
SHAP-DT 0.611 0.025 0.317 0.022 0.669 0 0.566 0 0.382 0.001 0.118 0.477
NB GS-SHAP 0.667 0.013 0.297 0.038 0.502 0.004 0.212 0.187 0.257 0.027 0.129 0.45
GS-DT 0.278 0.358 0.292 0.058 − 0.07 0.669 0.076 0.639 0.021 0.866 0.305 0.099
SHAP-DT 0.167 0.612 0.1 0.467 0.012 0.944 − 0.1 0.534 0.157 0.164 0.046 0.781

Fig. 5. ALE plots of “bare nuclei” feature for SVM (left) and RF (right).

Table 13 indicates the Kendal’s rank correlation between the rankings of respectively), for SVM in SPECTF (0.806), and for XGB in BC, diabetes,
features according to the global surrogate, the global SHAP explainer, SPECTF, and Parkinson’s (1, 0.891, 0.758, and 0.908 respectively), with
and the DTs. a very small p-value, indicating strong evidence and a statistically sig
The Kendal’s rank correlation was reported in Table 13 to describe nificant correlation. Meanwhile, the lowest correlation coefficients
the relationship between the rankings of features according to the global (below 0.1) were reported for MLP when comparing: global surrogate to
surrogate and SHAP global explainer, then according to each of these SHAP on Parkinson’s (− 0.070), global surrogate to DT in DBC and
two and the DT trained on the original dataset. Table 13 demonstrates SPECTF (0.068, and − 0.048 respectively), and SHAP to DT in SPECTF
that the strongest agreements (above 0.7) were for the global surrogate (0.063). For SVM, the lowest coefficient was reported when comparing
and the DT classifier for RF in BC and Diabetes (0.833, 0.835 GS to SHAP in DBC (− 0.024). For XGB, the correlation between the

Fig. 6. ALE plots of “uniformity of cell size” for SVM (left) and RF (right).

12
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829

Table 14
Local interpretability evaluation using monotonicity and faithfulness averages.
Dataset Technique LIME SHAP MAPLE LORE CIU

Metrics M% Fμ M% Fμ M% Fμ M% Fμ M% Fμ

BC MLP 3.65 0.016 3.65 − 3.01 7.30 − 2.23 2.19 − 13.73 7.30 3.29
SVM 0.73 − 0.234 15.33 18.8 5.11 − 0.74 4.38 − 26.13 5.11 2.80
RF 83.94 0.257 82.48 31.28 78.83 12.88 76.64 20.15 72.26 35.50
XGB 58.39 0.186 58.39 22.3 58.39 8.68 59.12 10.96 48.18 31.73
DT 99.27 0.103 99.27 21.8 97.08 11.88 96.35 12.6 83.21 22.29
NB 24.09 18.05 24.09 − 6.88 26.28 43.52 15.33 − 10.43 28.47 25.98
BCD MLP 0 0.125 1.75 25.6 0 − 2.65 0 39.14 0 47.19
SVM 9.65 0.610 7.89 79.57 8.77 1.09 7.89 64.86 7.89 70.06
RF 2.63 0.166 1.75 19.51 13.16 3.49 6.14 12.17 0 23.50
XGB 1.75 0.256 0.88 22.71 1.75 − 3.11 1.75 19.47 0.88 34.06
DT 80.7 0.165 64.91 29.39 51.75 2.34 76.32 27.15 60.53 43.59
NB 0 23.29 0 24.87 9.65 − 2.98 0.88 19.42 14.91 22.93
Diabetes MLP 0 0.101 0 22.08 0 − 2.89 0 12.75 0 9.91
SVM 17.32 0.03 18.18 16.45 0 − 1.21 0 21.27 0 30.83
RF 1.3 0.203 0 44.33 0.87 4.81 0 13.79 0 30.21
XGB 0 0.239 0 43.51 0 − 3.13 0 32.27 0 48.50
DT 49.78 0.063 43.72 47.1 39.83 12.37 22.51 22.27 31.60 48.50
NB 0.43 − 3.70 2.16 − 7.93 4.76 − 1.94 20.35 − 3.01 20.35 − 7.69
SPECT MLP 25.93 31.61 32.10 59.81 20.99 19.17 14.67 40.58 24.69 52.95
SVM 27.16 58.96 48.15 72.64 35.80 35.26 18.67 59.78 24.69 63.00
RF 44.44 43.6 40.74 71.26 25.93 40.36 24.00 42.89 30.86 50.75
XGB 22.22 40.92 29.63 66.42 19.75 41.29 9.33 47.87 17.28 66.13
DT 32.10 19.83 34.57 49.05 39.51 20.02 32.00 39.38 37.04 41.39
NB 87.65 33.90 87.65 48.22 83.95 4.86 82.67 27.21 87.65 44.97
SPECTF MLP 0 2.29 0 8.76 0 − 7.27 0 5.22 0 21.80
SVM 37.04 − 11.76 33.33 − 17.27 40.74 6.73 44.44 − 0.36 39.51 − 12.43
RF 0 32.15 1.23 30.79 1.23 2.16 0 2.20 0 − 1.20
XGB 0 30.79 0 27.78 0 2.24 0 1.20 0 17.57
DT 74.07 6.8 79.01 14.63 56.79 − 6.88 83.95 15.74 67.90 30.55
NB 76.54 1.56 76.54 5.57 76.54 − 1.02 76.54 0 76.54 11.39
Parkinson’s MLP 0 9.13 0 13.11 0 2.68 0 26.74 0 32.69
SVM 69.23 22.43 71.79 60.13 48.72 − 1.09 74.36 58.29 74.36 55.82
RF 0 53.24 0 64.76 2.56 4.04 0 56.41 0 64.62
XGB 0 59.66 0 76.55 0 17.08 0 66.19 0 86.89
DT 23.08 25.86 17.95 56.76 74.36 2.15 17.95 45.13 28.21 69.43
NB 5.13 30.36 5.13 38.06 7.69 − 0.78 5.13 22.53 2.56 47.08

global surrogate and SHAP in the Parkinson’s dataset was the lowest
Table 15
(0.078). Lastly, for NB, correlation between global surrogate and DT was
Execution time (in seconds) over the test set for the local interpretability
the lowest on the SPECTF dataset (0.021).
techniques.
For the SHAP summary plots, the “bare nuclei” feature was always
elected as the most important for the BC dataset, as opposed to the global Dataset Number of test LIME SHAP MAPLE LORE CIU
records
surrogate and the DT classifier importances. Interestingly, literature
shows that a low “bare nuclei” is one of the most important features to BC 137 3656 1457 3587 42626 116
BCD 114 6397 5818 8333 44262 1062
get a benign diagnosis (Reis-Filho et al., 2002). For further analysis of
Diabetes 231 7028 32566 4648 21173 699
feature importance, ALE plots were generated specifically for this SPECT 81 789 1644 511 9411 202
feature, as an example of ALE plots, to understand how this feature af SPECTF 81 9211 3173 1075 18673 672
fects the models. Parkinson’s 39 1781 741 120 13692 118
Fig. 5 represents the ALE plots of the “bare nuclei” feature for SVM
and RF as the best performing classifiers. ALE plots represent the change
To conclude, this analysis shows the alignment between the rankings
in the model’s prediction probability as the feature changes. The x axis
based on Shapley values, ALE plots and the medical knowledge by
in the in the plot represents the feature’s range of values while the y
ranking “bare nuclei” as the top influencer feature for the black box
represents the change in the model’s prediction. The plots in Fig. 5 show
models.
a monotonic increasing relationship, which suggests that increasing the
feature value tends to increase the model’s predicted probability.
The “uniformity of cell size” feature was ranked as the top feature by 6.3. Local interpretability (step 3)
all the global surrogate trees as well as the DT classifier trained on the
raw dataset. ALE plots in Fig. 6 show how this feature changes the Following the process described in Section 5 (step 3), local expla
prediction for SVM and RF. While increasing the features value seems to nations of each test set were evaluated and compared using faithfulness,
increase the SVM’s predicted probability, we notice that this change is monotonicity, and execution time. Thereafter, SK test was performed on
not as strong as for the “bare nuclei” feature as shown in Fig. 5. On the the basis of the faithfulness average on every dataset, while Borda was
other hand, “uniformity of cell size” does not seem to have any signifi applied using the three aforementioned metrics.
cant impact of the RF model’s predictions since it does not contribute to Table 14 reports the percentage of test records where the local
the variability or change in the model’s output. Therefore, this feature technique explanation was monotonic (denoted as M%), and the average
alone does not have any meaningful relationship with the prediction yet of faithfulness over the test records (denoted as Fμ). Moreover, Table 15
it is possible for it becomes more influential when combined with other presents the execution time of the explanations of all test records for
features. every dataset on the five black box models.

13
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829

Fig. 7. SK results for the five classifiers based on faithfulness of local techniques across datasets.

Fig. 8. SK results of local techniques based on faithfulness average across experiments.

It was observed that the explanations generated for the DT classifier monotonicity score was reported for LIME and SHAP in BC with the DT
exhibited the best monotonicity and faithfulness’s average. For the local classifier (99.27). Considering black boxes only, the highest mono
techniques LIME, SHAP, MAPLE, LORE, and CIU, these proportions of tonicity values were in BC with RF using LIME and SHAP (83.94, 82.48
test records for the monotonic explanations were as follows for BC: respectively). MLP and NB came in a separate second cluster according
99.27%, 99.27%, 97.08%, 96.35%, and 83.21% respectively. For DBC, to the SK test when applied on the basis of faithfulness as shown in Fig. 7.
the fractions were 80.7%, 64.91%, 51.75%, 76.32, and 60.53% Fig. 8 also employs faithfulness to assess the significance of the differ
respectively. For Diabetes, they were 49.78%, 43.72%, 39.83%, 22.51%, ences between the local interpretability techniques employing the SK
and 31.6% respectively. While for SPECTF, they were 74.07%, 79.01%, test. CIU and SHAP were both classified within the top performing
56.79%, 83.95%, and 67.9% respectively. This superiority on the DT cluster, followed by LORE, then LIME and MAPLE last. Table 15 shows
classifier suggests that white classifiers are easier to interpret by the used that CIU executions over the test set were the fastest compared to the
local interpretability techniques. Meanwhile, for SPECT, DT did not other techniques while LORE had the worst execution time except on the
have the best monotonicity proportion in all explanations with 39.51% Diabetes dataset.
for MAPLE and 32% for LORE as the highest values. Similarly, the best
value for the DT explanations in the Parkinson’s dataset was with
74.36% for MAPLE. SVM took the lead in this dataset for LIME, SHAP, 6.4. Accuracy vs interpretability trade-off (step 4)
LORE, and CIU (69.23, 71.79, 74.36, and 74.36, respectively).
From the interpretability techniques perspective, the highest Many studies have been conducted to fix the accuracy-
interpretability trade-off (Wunsch and Saad, 2007; Huysmans et al.,

14
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829

Fig. 9. Accuracy vs. global surrogate fidelity for each experiment.

2008). In this study, global and local interpretability techniques were It can be observed that CIU and SHAP exhibited faithfulness scores
used to gain insight into the black box models used. In particular, this above 50 in nine experiments, LORE in five, and LIME in three. Mean
step compares interpretability and accuracy in two aspects: 1) from a while, MAPLE did not appear in the plot. This confirms the findings of
global view (demonstrated in Fig. 9), where the global surrogate’s fi step 3 where LIME and MAPLE came in the last SK cluster in Fig. 8 ac
delity (from Table 10) is compared to the black box accuracy (from cording to faithfulness. Fig. 11 also reveals that SVM appeared ten times,
Tables 6–8), and 2) from a local view (demonstrated in Figs. 10 and 11) RF and XGB six times each, while MLP and DT appeared twice.
where that black box accuracy is compared to the average faithfulness of In general, CIU and SHAP provide more faithful and meaningful
the local explanations on the test records reported in Table 14. insights into the model’s behaviour and capture important features that
Before applying interpretability techniques, MLP, SVM, RF, XGB, and contribute to its performance. This increases our confidence in the
NB have no built-in interpretability since they give very little to no in interpretability technique thus in the black box model itself. CIU and
formation about the reasons behind their predictions. All the con SHAP high performances are further supported by their appearance in
structed models in this experiment were able to defy the trade-off the top cluster identified by SK in Fig. 8. Meanwhile, execution time
because their interpretability according to the global surrogates was comes in favour of CIU when compared to SHAP.
almost as good as their performance.
It is important to highlight that the fidelity should not necessarily be 7. Threats to validity
maximized and should only be sufficiently high. Other aspects should
also be considered; for instance, the high fidelity of a global surrogate To guarantee the validity of this study, its limitations must be
would still be problematic if the underlying DT of the global surrogate is highlighted.
too deep and difficult to comprehend. In the model construction (step 1), feature standardisation could
Generally, according to Fig. 9, the global surrogates deliver inter have been considered such that the data became centred around zero
pretable models that can be used to understand black box models. and had a standard deviation of one such that the model performed
Therefore, black box models evaluations can primarily rely on their better. Since this study focused on interpretability, we did not give much
performances once their interpretability is “sufficient”. attention to additional preprocessing tasks to determine the contribu
On the other hand, Fig. 10 illustrates the relationship between the tion of each original feature. Additionally, we did not explore and
black box models accuracy and the faithfulness average of every inter compare optimisation methods and we simply chose to use PSO given its
pretability technique across the different models. There is a notable reported performance in literature (Idri et al., 2020; Saha et al., 2022).
variation among points representing the same model which made it hard Nevertheless, the preprocessing tasks as well as other different optimi
to compare the strength of the faithfulness per model/technique. sation techniques such Bayesian-based ones, have the potential to
Faithfulness values higher than 50% indicate a positive correlation improve the model performance and potentially its interpretability;
between the importance assigned by the interpretability technique to therefore, this can be used and discussed in future work.
the features. Therefore, in Fig. 11, we focused on these values where the Experiments were performed on six datasets with different numbers
faithfulness is higher than 50%, which suggests that the interpretability of instances and features. Nevertheless, they all belong to the field of
technique is correctly identifying important features. The higher the medicine. This ensured specificity but did not validate the evaluated
faithfulness value is, the stronger the correlation between the impor techniques across different domains. Another limitation of this study is
tance assigned by the interpretability technique and the effect of those that all datasets were tabular. Therefore, the empirical evaluation did
attributes on the model’s performance. not compare interpretability techniques for different data types, such as

15
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829

Fig. 10. Trade-off between accuracy and faithfulness of the local interpretability techniques across datasets.

categorical attributes or images which could alter our findings. In performance, although the SK test deemed this difference insignificant.
addition, all experiments had to solve a binary classification task Furthermore, our results indicated, for the BC dataset, that SHAP
because interpretability can be difficult in multiclass classification results were more in line with ALE plots, unlike the global surrogate.
(Zhang et al., 2019). Finally, our choice to abstain from delving into Meanwhile, the NB global surrogate surpassed that of to the other global
deep learning networks, especially within the context of binary classi surrogate DTs as well as the white boxDT, although the SK test deemed
fication tasks involving tabular data, stemmed from the non-image na this superiority insignificant based on fidelity.
ture of the dataset; Our decision was underpinned by a strong emphasis On the local scope, CIU and SHAP performed better than other local
on model transparency and interpretability, with a focus on MLPs thanks techniques according to the SK test based on faithfulness, although CIU
to their widespread popularity and suitability as an initial step towards had a faster execution time. On the other hand, the best monotonicity
network interpretability (Hakkoum et al., 2021a). values were given by LIME and SHAP for the DT and RF classifiers.
Overall, interpretability techniques helped to achieve a level of inter
8. Conclusion and future work pretability for black box models, thereby overcoming the trade-off and
making them useful in critical domains that require explanations for
In this study, we performed an empirical evaluation of seven inter decision-making. However, to gain trust and be effectively utilized for
pretability techniques, including three global and five local techniques decision-making, quantitative assessments of these explanations are
(with SHAP used in both). The primary focus was to evaluate these essential.
techniques for MLP, SVM, RF, XGB, and NB black box models on six For future work, we intend to evaluate interpretability techniques
medical numeric datasets. Our quantitative evaluations showed that RF across different domains and with various data types and ML tasks for
and SVM generally outperformed the other models in terms of further analysis and comparisons. Moreover, we highlight the

16
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829

Fig. 11. Accuracy vs. average of faithfulness scores higher than 50.

importance of collaborations with physicians in order to validate ML References

interpretability techniques, as well as the value of exploring reinforce
ment and unsupervised learning in the context of interpretability. Adhikari, A., Tax, D.M.J., Satta, R., Faeth, M., 2019. LEAFAGE: example-based and
feature importance-based explanations for black-box ML models. Fuzzy Syst. Conf.
https://doi.org/10.1109/FUZZ-IEEE.2019.8858846.
9. Declaration of generative AI Alvarez-Melis, D., Jaakkola, T., 2018. Towards robust interpretability with self-
explaining neural networks. Adv. Neural Inf. Process. Syst. 31.
Anjomshoae, S., Kampik, T., Främling, K., 2020. Py-CIU: a Python library for explaining
During the preparation of this work, we often used ChatGPT to machine learning predictions using contextual importance and utility. In: IJCAI-
improve language and readability. We reviewed and edited the gener PRICAI 2020 Workshop on Explainable Artificial Intelligence (XAI).
ated content as needed and we take full responsibility for the content of Apley, D.W., Zhu, J., 2016. Visualizing the effects of predictor variables in black box
supervised learning models. J. R. Stat. Soc. Ser. B Stat. Methodol. 82, 1059–1086.
the publication. https://doi.org/10.1111/rssb.12377.
Barredo Arrieta, A., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A.,
CRediT authorship contribution statement Garcia, S., Gil-Lopez, S., Molina, D., Benjamins, R., Chatila, R., Herrera, F., 2020.
Explainable explainable artificial intelligence (XAI): concepts, taxonomies,
opportunities and challenges toward responsible AI. Inf. Fusion 58, 82–115. https://
Hajar Hakkoum: Conceptualization, Methodology, Software, Vali doi.org/10.1016/j.inffus.2019.12.012.
dation, Formal analysis, Investigation, Resources, Visualization, Writing Bergstra, J., Bengio, Y., 2012. Random search for hyper-parameter optimization.
– original draft, Writing – review & editing. Ali Idri: Conceptualization, J. Mach. Learn. Res. 13, 281–305. https://doi.org/10.5555/2188385.2188395.
Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32. https://doi.org/10.1023/A:
Methodology, Validation, Formal analysis, Supervision, Writing – re 1010933404324/METRICS.
view & editing. Ibtissam Abnane: Conceptualization, Methodology, Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P., 2002. SMOTE: synthetic
Writing – review & editing. minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357. https://doi.org/
10.1613/JAIR.953.
Chen, T., Guestrin, C., 2016. XGBoost: a scalable tree boosting system. Proc. 22nd ACM
Declaration of competing interest SIGKDD Int. Conf. Knowl. Discov. Data Min. https://doi.org/10.1145/2939672.
Claesen, M., Simm, J., Popovic, D., Moreau, Y., De Moor, B., 2014. Easy Hyperparameter
Search Using Optunity.
The authors declare that they have no known competing financial De Laet, T., Huysmans, L., 2021. Do student advisors prefer explanations using local
interests or personal relationships that could have appeared to influence linear approximations (LIME) or rules (LORE) in the prediction of student success?.
the work reported in this paper. In: Companion Proceedings of the 11th International Learning Analytics and
Knowledge Conference (LAK’21). Society for Learning Analytics Research (SOLAR,
pp. 91–93.
Data availability Dua, D., Graff, C., 2017. UCI Machine Learning Repository.
El Shawi, R., Sherif, Y., Al-Mallah, M., Sakr, S., 2019. Interpretability in HealthCare A
Comparative Study of Local Machine Learning Interpretability Techniques. 2019
Data available online
IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS),
pp. 275–280. Cordoba, Spain doi:10.1109/CBMS.2019.00065.
Acknowledgments Florez-Lopez, R., Ramon-Jeronimo, J.M., 2015. Enhancing accuracy and interpretability
of ensemble strategies in credit risk assessment. A correlated-adjusted decision forest
proposal. Expert Syst. Appl. 42, 5737–5753. https://doi.org/10.1016/j.
This work was conducted under the research project “Machine eswa.2015.02.042.
Learning based Breast Cancer Diagnosis and Treatment”, 2020–2023. Främling, Kary, 1996. Explaining Results of Neural Networks by Contextual Importance
The authors would like to thank the Moroccan Ministry of Higher Edu and Utility. In: Robert Andrews, Diederich, Joachim (Eds.), Rules and networks:
Proceedings of the Rule Extraction from Trained Artificial Neural Networks
cation and Scientific Research, Digital Development Agency (ADD), Workshop, AISB’96 conference, 1-2 April 1996. Brighton, UK.
CNRST, and UM6P for their support. Gall, R., 2018. Machine Learning Explainability vs Interpretability: Two Concepts that
Could Help Restore Trust in AI.
Gardner, M.W., Dorling, S.R., 1998. Artificial neural networks (the multilayer
perceptron) - a review of applications in the atmospheric sciences. Atmos. Environ.
32, 2627–2636. https://doi.org/10.1016/S1352-2310(97)00447-0.

17
H. Hakkoum et al. Engineering Applications of Artificial Intelligence 131 (2024) 107829

Guidotti, R., Monreale, A., Ruggieri, S., Pedreschi, D., Turini, F., Giannotti, F., 2018. International Conference on Intelligent Systems: Theories and Applications, SITA’20.
Local Rule-Based Explanations of Black Box Decision Systems. ArXiv. Association for Computing Machinery, New York, NY, USA. https://doi.org/
Hakkoum, H., Abnane, I., Idri, A., 2021a. Interpretability in the medical field: a 10.1145/3419604.3419776.
systematic mapping and review study. Appl. Soft Comput., 108391 https://doi.org/ Nicholson Price, W., 2018. Big data and black-box medical algorithms. Sci. Transl. Med.
10.1016/J.ASOC.2021.108391. 10 https://doi.org/10.1126/SCITRANSLMED.AAO5333.
Hakkoum, H., Idri, A., Abnane, I., 2021b. Assessing and comparing interpretability Nizar Abdulaziz Mahyoub, Ahmed, Alpkoçak, Adil, 2022. A quantitative evaluation of
techniques for artificial neural networks breast cancer classification. Comput. explainable AI methods using the depth of decision tree. Turk. J. Elec. Eng. Comput.
Methods Biomech. Biomed. Eng. Imaging Vis. 9 https://doi.org/10.1080/ Sci. 30 (6), 4. https://doi.org/10.55730/1300-0632.3924.
21681163.2021.1901784. Pereira, S., Meier, R., McKinley, R., Wiest, R., Alves, V., Silva, C.A., Reyes, M., 2018.
Huysmans, J., Setiono, R., Baesens, B., Vanthienen, J., 2008. Minerva: sequential Enhancing interpretability of automatically extracted machine learning features:
covering for rule extraction. IEEE Trans. Syst. Man, Cybern. Part B 38, 299–309. application to a RBM-Random Forest system on brain lesion segmentation. Med.
https://doi.org/10.1109/TSMCB.2007.912079. Image Anal. 44, 228–244. https://doi.org/10.1016/j.media.2017.12.009. Epub
Idri, A., Bouchra, E.O., Hosni, M., Abnane, I., 2020. Assessing the impact of parameters 2017 Dec 20. PMID: 29289703.
tuning in ensemble based breast Cancer classification. Health Technol. 10, Plumb, G., Al-Shedivat, M., Cabrera, Á.A., Perer, A., Xing, E., Talwalkar, A., 2020.
1239–1255. https://doi.org/10.1007/S12553-020-00453-2. Regularizing black-box models for improved interpretability. Adv. Neural Inf.
Idri, A., Khoshgoftaar, T.M., Abran, A., 2002. Can neural networks be easily interpreted Process. Syst. 33, 10526–10536.
in software cost estimation?. In: 2002 IEEE World Congress on Computational Plumb, G., Molitor, D., Talwalkar, A., 2018. Model agnostic supervised local
Intelligence. 2002 IEEE International Conference on Fuzzy Systems. FUZZ-IEEE’02. explanations. Adv. Neural Inf. Process. Syst. 2018-December 2515–2524.
Proceedings (Cat. No.02CH37291). IEEE, pp. 1162–1167. https://doi.org/10.1109/ Quinlan, J.R., 1986. Induction of decision trees. Mach. Learn. 11 (1), 81–106. https://
FUZZ.2002.1006668. doi.org/10.1007/BF00116251.
Jelihovschi, E., Faria, J.C., Allaman, I.B., 2014. ScottKnott: a package for performing the Reis-Filho, J.S., Albergaria, A., Milanezi, F., Amendoeira, I., Schmitt, F.C., 2002. Naked
Scott-Knott clustering algorithm in R. Trends Comput. Appl. Math. 15, 3–17. https:// nuclei revisited: p63 immunoexpression. Diagn. Cytopathol. 27, 135–138. https://
doi.org/10.5540/TEMA.2014.015.01.0003. doi.org/10.1002/DC.10164.
Johansson, U., Niklasson, L., 2009. Evolving decision trees using oracle guides. In: 2009 Ribeiro, M.T., Singh, S., Guestrin, C., 2016. “Why should i trust you?” Explaining the
IEEE Symp. Comput. Intell. Data Mining, CIDM 2009 - Proc, pp. 238–244. https:// predictions of any classifier. In: Proceedings of the ACM SIGKDD International
doi.org/10.1109/CIDM.2009.4938655. Conference on Knowledge Discovery and Data Mining. Association for Computing
Kennedy, J., Eberhart, R., 1995. Particle swarm optimization. Proc. ICNN’95 - Int. Conf. Machinery, pp. 1135–1144. https://doi.org/10.1145/2939672.2939778.
Neural Networks 4, 1942–1948. https://doi.org/10.1109/ICNN.1995.488968. Risse, M., 2005. Why the count de Borda cannot beat the Marquis de Condorcet. Soc.
Knapič, S., Malhi, A., Saluja, R., Främling, K., 2021. Explainable artificial intelligence for Choice Welfare 25, 95–113. https://doi.org/10.1007/s00355-005-0045-3.
human decision support system in the medical domain. Mach. Learn. Knowl. Extr. 3, Saha, S., Saha, A., Roy, B., Sarkar, R., Bhardwaj, D., Kundu, B., 2022. Integrating the
740–770. https://doi.org/10.3390/make3030037. Particle Swarm Optimization (PSO) with machine learning methods for improving
Lakkaraju, H., Arsov, N., Bastani, O., 2020. Robust and Stable Black Box Explanations. the accuracy of the landslide susceptibility model. Earth Sci. Informatics 15,
ArXiv. 2637–2662. https://doi.org/10.1007/S12145-022-00878-5/METRICS.
Lakkaraju, H., Bach, S.H., Leskovec, J., 2016. Interpretable decision sets: a joint Shapley, L.S., 1952. A Value for N-Person Games. CA RAND Corp, St. Monica.
framework for description and prediction. Proc. ACM SIGKDD Int. Conf. Knowl. Shinde, P.P., Shah, S., 2018. A Review of Machine Learning and Deep Learning
Discov. Data Min. https://doi.org/10.1145/2939672.2939874, 13-17-August-2016, Applications. 2018 Fourth International Conference on Computing Communication
1675–1684. Control and Automation (ICCUBEA), pp. 1–6.. Pune, India doi: 10.1109/
Lakkaraju, H., Caruana, R., Kamar, E., Leskovec, J., 2019. Faithful and customizable ICCUBEA.2018.8697857.
explanations of black box models. In: AIES 2019 - Proc. 2019 AAAI/ACM Conf. AI, Silva, W., Fernandes, K., Cardoso, M.J., Cardoso, J.S., 2018. Towards Complementary
Ethics, Soc, pp. 131–138. https://doi.org/10.1145/3306618.3314229. Explanations Using Deep Neural Networks. In: Stoyanov, D., et al. (Eds.),
Lundberg, S.M., Lee, S.-I., 2017. A unified approach to interpreting model predictions. In: Understanding and Interpreting Machine Learning in Medical Image Computing
NIPS’17: Proceedings of the 31st International Conference on Neural Information Applications. MLCN DLF IMIMIC, Lecture Notes in Computer Science, 11038.
Processing Systems. Long Beach. California, USA. Curran Associates Inc., Red Hook, Springer, Cham. https://doi.org/10.1007/978-3-030-02628-8_15.
NY, USA, pp. 4768–4777. Tam, Adrian, 2021. WebPage: A Gentle Introduction to Particle Swarm Optimization -.
Luss, R., Chen, P.Y., Dhurandhar, A., Sattigeri, P., Zhang, Y., Shanmugam, K., Tu, C.C., MachineLearningMastery.com.
2019. Leveraging latent features for local explanations. Proc. ACM SIGKDD Int. Conf. Vellido, A., 2019. Societal issues concerning the application of artificial intelligence in
Knowl. Discov. Data Min. 1139–1149. https://doi.org/10.1145/3447548.3467265. medicine keywords artificial intelligence in medicine ⋅ machine learning ⋅ social
Miller, T., 2019. Explanation in artificial intelligence: insights from the social sciences. impact. Rev. Artic. Kidney Dis 5, 11–17. https://doi.org/10.1159/000492428.
Artif. Intell. 267, 1–38. Wunsch, D.C., Saad, E.W., 2007. Neural network explanation using inversion. Neural
Molnar, C., 2021. Interpretable Machine Learning : a Guide for Making Black Box Models Network. 20, 78–93. https://doi.org/10.1016/j.neunet.2006.07.005.
Interpretable. Zhang, X., Lou, Y., Tan, S., Chajewska, U., Koch, P., Caruana, R., 2019. Axiomatic
Molnar, C., Casalicchio, G., Bischl, B., 2020. Quantifying model complexity via interpretability for multiclass additive models. Proc. ACM SIGKDD Int. Conf. Knowl.
functional decomposition for better post-hoc interpretability. Commun. Comput. Inf. Discov. Data Min. 226–234. https://doi.org/10.1145/3292500.3330898.
Sci. 1167, 193–204. https://doi.org/10.1007/978-3-030-43823-4_17/COVER. Zhou, Z., Jiang, Y., 2004. NeC4.5: neural ensemble based C4.5. IEEE Trans. Knowl. Data
Nassih, R., Berrado, A., 2020. State of the art of fairness, interpretability and Eng. 16, 770–773. https://doi.org/10.1109/TKDE.2004.11.
explainability in machine learning: case of PRIM. In: Proceedings of the 13th

Explainable Artificial Intelligence (XAI) : Concepts, Taxonomies, Opportunities and Challenges Toward Responsible AI
No ratings yet
Explainable Artificial Intelligence (XAI) : Concepts, Taxonomies, Opportunities and Challenges Toward Responsible AI
72 pages
Machine Learning Interpretability
No ratings yet
Machine Learning Interpretability
10 pages
A Survey On Explainable Artificial Intelligence XAI Toward Medical XAI
No ratings yet
A Survey On Explainable Artificial Intelligence XAI Toward Medical XAI
21 pages
21 SS133
No ratings yet
21 SS133
85 pages
Open Problems in Mechanistic Interpretability
No ratings yet
Open Problems in Mechanistic Interpretability
82 pages
Different XAI Techniques
No ratings yet
Different XAI Techniques
52 pages
Glocalx - From Local To Global Explanations of Black Box Ai Models
No ratings yet
Glocalx - From Local To Global Explanations of Black Box Ai Models
27 pages
Interpretable Machine Learning
No ratings yet
Interpretable Machine Learning
80 pages
Chapter 5
No ratings yet
Chapter 5
29 pages
LOYOLA-GONZALES - Black-Box vs. White-Box. Understanding Their Advantages and Weaknesses From A Practical Point of View
No ratings yet
LOYOLA-GONZALES - Black-Box vs. White-Box. Understanding Their Advantages and Weaknesses From A Practical Point of View
18 pages
Module1 Lecture 1
No ratings yet
Module1 Lecture 1
39 pages
Interpretable Machine Learning Models For Healthcare Diagnostics: Addressing The Black-Box Problem
No ratings yet
Interpretable Machine Learning Models For Healthcare Diagnostics: Addressing The Black-Box Problem
27 pages
Interpretable Machine Learning - Fundamental Principles and 10 Grand Challenges
No ratings yet
Interpretable Machine Learning - Fundamental Principles and 10 Grand Challenges
74 pages
An Introduction To Machine Learning Interpretability 2e
100% (1)
An Introduction To Machine Learning Interpretability 2e
62 pages
Mechanistic Interpretability For AI Safety A Review: Leonard Bereska Efstratios Gavves
No ratings yet
Mechanistic Interpretability For AI Safety A Review: Leonard Bereska Efstratios Gavves
41 pages
A Comprehensive Guide To Explainable Ai: From Classical Models To Llms
No ratings yet
A Comprehensive Guide To Explainable Ai: From Classical Models To Llms
255 pages
Interpretability of Machine Learning: Recent Advances and Future Prospects
No ratings yet
Interpretability of Machine Learning: Recent Advances and Future Prospects
12 pages
Interpret Ability
No ratings yet
Interpret Ability
65 pages
Interpretability and Explainability A Ma
No ratings yet
Interpretability and Explainability A Ma
24 pages
Christoph Molnar - Interpretable Machine Learning-Lulu - Com (2020)
No ratings yet
Christoph Molnar - Interpretable Machine Learning-Lulu - Com (2020)
255 pages
XAI Basics
No ratings yet
XAI Basics
34 pages
Does Dataset Complexity Matters For Model Explainers?: 1 Jos e Ribeiro 2 Ra Issa Silva 3 Lucas Cardoso
No ratings yet
Does Dataset Complexity Matters For Model Explainers?: 1 Jos e Ribeiro 2 Ra Issa Silva 3 Lucas Cardoso
9 pages
Opening The Black Box of Large Language Models - Two Views On Holistic Interpretability
No ratings yet
Opening The Black Box of Large Language Models - Two Views On Holistic Interpretability
11 pages
Rudin - 2019 - Stop Explaining Black Box Machine Learning Models For High Stakes Decisions and
No ratings yet
Rudin - 2019 - Stop Explaining Black Box Machine Learning Models For High Stakes Decisions and
10 pages
Causal Interpretability For Machine Learning
No ratings yet
Causal Interpretability For Machine Learning
16 pages
Discover Artificial Intelligence
No ratings yet
Discover Artificial Intelligence
21 pages
The Cricket Winner Prediction With Applications of ML and Data Analytics
No ratings yet
The Cricket Winner Prediction With Applications of ML and Data Analytics
18 pages
Algorithms For Interpretable Machine Learning
No ratings yet
Algorithms For Interpretable Machine Learning
125 pages
Entropy 23 00018 v2 36
No ratings yet
Entropy 23 00018 v2 36
1 page
Explainable AI
No ratings yet
Explainable AI
41 pages
(BABIC) - Beware Explanations From AI in Health Care 2021
No ratings yet
(BABIC) - Beware Explanations From AI in Health Care 2021
4 pages
Module1 Lecture 2
No ratings yet
Module1 Lecture 2
19 pages
Explainable Artificial Intelligence: A Comprehensive Review: Dang Minh H. Xiang Wang Y. Fen Li Tan N. Nguyen
No ratings yet
Explainable Artificial Intelligence: A Comprehensive Review: Dang Minh H. Xiang Wang Y. Fen Li Tan N. Nguyen
66 pages
Machine Learnig
No ratings yet
Machine Learnig
93 pages
TDSC Choo 221
No ratings yet
TDSC Choo 221
12 pages
EthicalAI TT2
No ratings yet
EthicalAI TT2
14 pages
A Systematic Literature Review of Explainable Arti
No ratings yet
A Systematic Literature Review of Explainable Arti
30 pages
WIREs Data Min Knowl - 2021 - Angelov - Explainable Artificial Intelligence An Analytical Review
No ratings yet
WIREs Data Min Knowl - 2021 - Angelov - Explainable Artificial Intelligence An Analytical Review
13 pages
Entropy 23 00018 v2 4
No ratings yet
Entropy 23 00018 v2 4
1 page
Counterfactuals and Causability in Explainable Artificial Intelligence Theory, Algorithms, and Applications
No ratings yet
Counterfactuals and Causability in Explainable Artificial Intelligence Theory, Algorithms, and Applications
59 pages
Review IML 2020
No ratings yet
Review IML 2020
17 pages
A Survey On Explainable Artificial Intelligence
No ratings yet
A Survey On Explainable Artificial Intelligence
22 pages
Interpretable Machine Learning
No ratings yet
Interpretable Machine Learning
10 pages
Interpreting Black Box Models: A Review On Explainable Artificial Intelligence
No ratings yet
Interpreting Black Box Models: A Review On Explainable Artificial Intelligence
30 pages
ML Mcqs
No ratings yet
ML Mcqs
34 pages
WIREs Data Min Knowl - 2021 - Angelov - Explainable Artificial Intelligence An Analytical Review
No ratings yet
WIREs Data Min Knowl - 2021 - Angelov - Explainable Artificial Intelligence An Analytical Review
13 pages
Entropy 23 00018 v2 1
No ratings yet
Entropy 23 00018 v2 1
1 page
An Introduction To Machine Learning Interpretability Second Edition PDF
No ratings yet
An Introduction To Machine Learning Interpretability Second Edition PDF
62 pages
Entropy 23 00018 v2 2
No ratings yet
Entropy 23 00018 v2 2
1 page
Explaining Explanations - An Overview of Interpretability of Machine Learning
No ratings yet
Explaining Explanations - An Overview of Interpretability of Machine Learning
10 pages
A Comparative Study and Systematic Analysis of XAI Models and Their Applications in Healthcare
No ratings yet
A Comparative Study and Systematic Analysis of XAI Models and Their Applications in Healthcare
26 pages
Applying Genetic Programming To Improve Interpretability in Machine Learning Models
No ratings yet
Applying Genetic Programming To Improve Interpretability in Machine Learning Models
8 pages
XAI P T A Brief Review of Explainable
No ratings yet
XAI P T A Brief Review of Explainable
9 pages
Data Science Answers
No ratings yet
Data Science Answers
2 pages
Entropy: Explainable AI: A Review of Machine Learning Interpretability Methods
No ratings yet
Entropy: Explainable AI: A Review of Machine Learning Interpretability Methods
45 pages
Module 1 Xai
No ratings yet
Module 1 Xai
10 pages
Explainable AI
No ratings yet
Explainable AI
4 pages
Automated Plant Disease Analysis (APDA) Performance Comparison of Machine
No ratings yet
Automated Plant Disease Analysis (APDA) Performance Comparison of Machine
6 pages
Shap Lime
No ratings yet
Shap Lime
6 pages
Explainable Artificial Intelligence Approaches
No ratings yet
Explainable Artificial Intelligence Approaches
14 pages
Overview ML Interpretability
No ratings yet
Overview ML Interpretability
10 pages
A Comparative Study On Predicting The Probability of Liver Disease IJERTV8IS100314 PDF
No ratings yet
A Comparative Study On Predicting The Probability of Liver Disease IJERTV8IS100314 PDF
5 pages
d2k Tutorial
No ratings yet
d2k Tutorial
78 pages
Explainable Artificial Intelligence and Machine Learning: A Reality Rooted Perspective
No ratings yet
Explainable Artificial Intelligence and Machine Learning: A Reality Rooted Perspective
8 pages
Interpretable Machine Learning - Definitions, Methods, and Applications PDF
No ratings yet
Interpretable Machine Learning - Definitions, Methods, and Applications PDF
11 pages
Final Report of Mini Project
No ratings yet
Final Report of Mini Project
52 pages
Data Mining Classification: Alternative Techniques
No ratings yet
Data Mining Classification: Alternative Techniques
15 pages
Meeting 6 CE609-supervised-learning
No ratings yet
Meeting 6 CE609-supervised-learning
166 pages
Afreen Taha CE1 Finalizedsuman... Design.... D
No ratings yet
Afreen Taha CE1 Finalizedsuman... Design.... D
9 pages
ECE 368 Course Review: Probabilistic Reasoning 2023
No ratings yet
ECE 368 Course Review: Probabilistic Reasoning 2023
138 pages
Final Report
No ratings yet
Final Report
38 pages
Fake News Detection Using Machine Learni
100% (1)
Fake News Detection Using Machine Learni
13 pages
Exercises695Clas Solution
100% (2)
Exercises695Clas Solution
13 pages
L10-Naive Bayes Continuous
No ratings yet
L10-Naive Bayes Continuous
16 pages
An Efficient Machine Learning Approach For Diagnosing Parkinson's Disease by Utilizing Voice Features
No ratings yet
An Efficient Machine Learning Approach For Diagnosing Parkinson's Disease by Utilizing Voice Features
20 pages
Chronic Disease Prediction Using Machine Learning
No ratings yet
Chronic Disease Prediction Using Machine Learning
7 pages
ML Unit-3
No ratings yet
ML Unit-3
15 pages
Chapter2 (Classification)
No ratings yet
Chapter2 (Classification)
17 pages
A Novel Transfer Learning Based Approach For Plant Species
No ratings yet
A Novel Transfer Learning Based Approach For Plant Species
14 pages
Automatic Grading of Answer Sheets Using Machine L
No ratings yet
Automatic Grading of Answer Sheets Using Machine L
10 pages
Naive Bayes Classifier: K M M I I M
No ratings yet
Naive Bayes Classifier: K M M I I M
16 pages
Unit-Iv Data Classification: Data Warehousing and Data Mining
No ratings yet
Unit-Iv Data Classification: Data Warehousing and Data Mining
7 pages
Analysis of Machine Learning Algorithms On Cancer Dataset
No ratings yet
Analysis of Machine Learning Algorithms On Cancer Dataset
10 pages
Supervised Learning - A Systematic Literature Review
No ratings yet
Supervised Learning - A Systematic Literature Review
22 pages
Towards Automatically Extracting UML Class Diagram
No ratings yet
Towards Automatically Extracting UML Class Diagram
8 pages
ATM Service Analysis Using Predictive Data Mining
No ratings yet
ATM Service Analysis Using Predictive Data Mining
5 pages
Theory Assignment Information Retrieval (CSE 4053) : Institute of Technical Education and Research
No ratings yet
Theory Assignment Information Retrieval (CSE 4053) : Institute of Technical Education and Research
9 pages
Prediction of HIV Status in Addis Ababa Using Data Mining Technology
No ratings yet
Prediction of HIV Status in Addis Ababa Using Data Mining Technology
7 pages
Expert Systems With Applications: Aytu G Onan, Serdar Koruko Glu, Hasan Bulut
No ratings yet
Expert Systems With Applications: Aytu G Onan, Serdar Koruko Glu, Hasan Bulut
3 pages
Big-O Notation Demystified: Definitive Reference for Developers and Engineers
From Everand
Big-O Notation Demystified: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Engineering Applications of Artificial Intelligence: Hajar Hakkoum, Ali Idri, Ibtissam Abnane

Uploaded by

Engineering Applications of Artificial Intelligence: Hajar Hakkoum, Ali Idri, Ibtissam Abnane

Uploaded by

Engineering Applications of Artificial Intelligence 131 (2024) 107829

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence

Global and local interpretability techniques of supervised machine learning

1. Introduction (Miller, 2019). Interpretability is highly dependent on the target audi

Table 3 original dataset) to 44 (SPECTF dataset) for attributes. Such differences

• The fidelity metric is a measure of how closely a surrogate model

4.2.3. Validations and statistical testing 5.1. Step 1: model construction

Fig. 2. Global surrogate process.

MLP Neurons 137 389 334 101 340 189

Model RF XGB RF XGB RF XGB RF XGB RF XGB RF XGB

Fig. 3. Accuracy based SK results of the black box models.

Accuracy MLP 93.4 92.1 62.3 65.4 61.7 84.6

white box classifier reported in Table 12 in terms of accuracy with the

Fig. 8. SK results of local techniques based on faithfulness average across experiments.

Fig. 9. Accuracy vs. global surrogate fidelity for each experiment.

importance of collaborations with physicians in order to validate ML References

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Engineering Applications of Artificial Intelligence: Hajar Hakkoum, Ali Idri, Ibtissam Abnane

Uploaded by

Engineering Applications of Artificial Intelligence: Hajar Hakkoum, Ali Idri, Ibtissam Abnane

Uploaded by

Engineering Applications of Artificial Intelligence 131 (2024) 107829

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence

Global and local interpretability techniques of supervised machine learning

1. Introduction (Miller, 2019). Interpretability is highly dependent on the target audi­

Table 3 original dataset) to 44 (SPECTF dataset) for attributes. Such differences

• The fidelity metric is a measure of how closely a surrogate model

4.2.3. Validations and statistical testing 5.1. Step 1: model construction

Fig. 2. Global surrogate process.

MLP Neurons 137 389 334 101 340 189

Model RF XGB RF XGB RF XGB RF XGB RF XGB RF XGB

Fig. 3. Accuracy based SK results of the black box models.

Accuracy MLP 93.4 92.1 62.3 65.4 61.7 84.6

white box classifier reported in Table 12 in terms of accuracy with the

Fig. 8. SK results of local techniques based on faithfulness average across experiments.

Fig. 9. Accuracy vs. global surrogate fidelity for each experiment.

importance of collaborations with physicians in order to validate ML References

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

1. Introduction (Miller, 2019). Interpretability is highly dependent on the target audi