0% found this document useful (0 votes)
79 views16 pages

Causal Interpretability For Machine Learning

The document discusses causal interpretability for machine learning models. It provides an overview of problems with non-interpretable models and methods for generating causal explanations. The survey covers traditional interpretability approaches as well as recent work on causal models that can explain decisions under different scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views16 pages

Causal Interpretability For Machine Learning

The document discusses causal interpretability for machine learning models. It provides an overview of problems with non-interpretable models and methods for generating causal explanations. The survey covers traditional interpretability approaches as well as recent work on causal models that can explain decisions under different scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Causal Interpretability for Machine Learning

- Problems, Methods and Evaluation

Raha Moraffah∗ , Mansooreh Karami∗ , Ruocheng Guo∗ , Adrienne Raglin† , Huan Liu∗

Computer Science & Engineering, Arizona State University, Tempe, AZ, USA

Army Research Lab, USA

{rmoraffa, mkarami, rguo12, huanliu}@asu.edu, † adrienne.raglin2.civ@mail.mil
arXiv:2003.03934v3 [cs.LG] 19 Mar 2020

ABSTRACT planation” [32] and AI call for diversity and inclusion [9],
interpretable models which are capable of explaining the
Machine learning models have had discernible achievements
decisions they made are necessary. Moreover, recent re-
in a myriad of applications. However, most of these models
search shows that machine learning models, especially deep
are black-boxes, and it is obscure how the decisions are made
neural networks, can be easily fooled into predicting a spe-
by them. This makes the models unreliable and untrustwor-
cific class label for an image when its pixel values are un-
thy. To provide insights into the decision making processes
der minimal perturbations [30; 74; 80]. Such results imply
of these models, a variety of traditional interpretable models
that machine learning models suffer from the risk of making
have been proposed. Moreover, to generate more human-
unexpected decisions. Understanding decisions of machine
friendly explanations, recent work on interpretability tries
learning models and the process leading to decision making
to answer questions related to causality such as “Why does
can help us understand the rules the models use to make
this model makes such decisions?” or “Was it a specific
their decisions and therefore, prevent potential unexpected
feature that caused the decision made by the model?”. In
situations from happening. More specifically, through in-
this work, models that aim to answer causal questions are
terpretable machine learning models, we aim to guarantee
referred to as causal interpretable models. The existing sur-
that (a) decisions made by machine learning models comply
veys have covered concepts and methodologies of traditional
with the rules toward social good; (b) the classifier does not
interpretability. In this work, we present a comprehensive
pick up the biases in the data and the decisions made are
survey on causal interpretable models from the aspects of
compatible with human understandings.
the problems and methods. In addition, this survey pro-
vides in-depth insights into the existing evaluation metrics Previously, various frameworks have been proposed to gen-
for measuring interpretability, which can help practitioners erate explanations for machine learning algorithms. These
understand for what scenarios each evaluation metric is suit- algorithms can be mainly divided into two categories, (1)
able. algorithms that are inherently interpretable, which includes
the models that generate explanations at training time [106];
(2) post-hoc interpretations that refer to the model that gen-
Keywords erate explanations for already made decisions [75; 85; 47].
Interpratablity, explainability, causal inference, counterfac- Henceforth, these models are referred to as traditional in-
tuals, machine learning terpretable models.
In this work, we focus on causal interpretable models that
can explain their decisions through what decisions would
1. INTRODUCTION have been made if they had been under alternative situa-
With the surge of machine learning in critical areas such tions (e.g., being trained with different inputs, model com-
as healthcare, law-making and autonomous cars, decisions ponents or hyperparameters). Note that traditional inter-
that had been previously made by humans are now made au- pretable models are unable to answer such questions about
tomatically using these algorithms. In order to ensure the decision making under alternative situations, although they
reliability of such decisions, humans need to understand how can explain how and why a decision is made by an existing
these decisions are made. However, machine learning mod- model on an observed instance. For instance, in the case of
els are usually inherently black-boxes and do not provide credit applications, to impose fairness on the decision mak-
explanations for how and why they make such decisions. ing process, we may need to answer questions such as Did
This has become especially problematic when recent work the protected features (e.g., race and gender etc.) cause the
shows that the decisions made by machine learning mod- system to reject the application of the i-th applicant?” and
els are sometimes biased and enforce inequality [69]. For “If the i-th applicant had different protected features, would
instance, Angwin et al. [4] demonstrates that predictions the system still make the same decision?” In other words,
made by Correctional Offender Management Profiling for in order to make the explanations more understandable and
Alternative Sanctions (COMPAS), which is a widely used useful for humans, we need to ask questions such as “Why
criminal risk assessment tool, shows racial biases. With re- did the classifier make this decision instead of another?”,
cent regulations such as European Unions “Right to Ex- “What would have happened to this decision of a classifier
had we had a different input to it?”, or “Was it feature
X that caused decision Y ?”. Traditional interpretability
frameworks which only consider correlations are not capa- tently predict the model’s decisions. Doshi-Velez et al. [17]
ble of generating such explanations. This is due to the fact define interpretability as the ability to explain in intelligi-
that these frameworks cannot estimate how altering a fea- ble ways to a human. Gilpin et al. [27] take a step further
ture or a component of a model would change the predictions and define interpretability as a part of explainability. They
made by the rest of the model or the predicted labels on state that explainable models are those that summarize the
the data samples. Therefore, in order to answer such ques- reasons for neural network behaviors, gain the trust of the
tions about both data samples and models, counterfactual users, or generate insights into the causes of their decisions
analysis needs to be leveraged. Counterfactual analysis is a while interpretable models may not be able to describe the
concept from the causal inference literature [25]. In coun- operation of a system in an accurate way 1 . Pearl [84] claims
terfactual analysis, we aim to infer the output of a model that tasks such as explainability require a causal model of
in imaginary scenarios that we have not observed or cannot the environment and cannot be handled at the level of as-
observe. Recently, counterfactual analysis and causal infer- sociation.
ence have gained a lot of attention from the interpretable
machine learning field. Research in this area has mainly fo- 2.1 Interpetability in Machine Learning
cused on generating counterfactual explanations from both Interpretable machine learning has been widely explored and
the data perspective [34; 76] as well as the components of a discussed in previous literature. However, to the best of our
model [77; 38]. knowledge, there is no comprehensive review on causal in-
Existing surveys on interpretable machine learning focus terpretability models. For instance, Lipton [59] discusses
on the traditional methods and do not discuss the exist- the motivation behind creating interpretable models and
ing methods from a causal perspective. In this survey, we categorizes interpretable models into two main categories:
present commonly used definitions for interpretability, dis- transparent models and post-hocs. Doshi-velez et al. [17]
cuss interpretable models from a causal perspective and pro- provide a definition of model interpretability and evalua-
vide guidelines for evaluating these methods. More specifi- tion criteria. However, this review only proposes definitions
cally, in Section 2, we first provide different definitions for in- and evaluations that are used for traditional interpretabil-
terpretability. We then briefly introduce the existing meth- ity of models and does not cover causal and counterfactual
ods on traditional interpretablity and present different types questions. Gilpin et al. [27], explain fundamental concepts
of interpretable models in this category (Section 2.2). Sec- of explainability and use them to classify the literature on
tion 3 discusses concepts from causal inference, which are interpretable models. Zhang and Zhu [111] review the exist-
used in this survey. In section 4, we provide an overview of ing interpretable models proposed for deep models used in
existing works on causal interpretability. We also compare visual domains. Du et al. [18] provide a comprehensive sur-
the proposed models for both traditional and causal models vey of existing interpretable methods and discuss issues that
from different perspectives to provide insights on advantages should be considered in future work. It is worth mention-
and disadvantages of each type of interpretability. Section ing that none of the existing work discussed interpretable
5 provides detailed guidelines on the experimental settings models from a causal perspective. In this work, we first
such as commonly used datasets and evaluation metrics for introduce the state-of-the-art research in traditional inter-
both traditional and causal approaches. We then discuss pretability (Sec. 2.2) and then give a detailed survey on
evaluation metrics specifically used for causal methods in causal interpretable models (Sec. 4). Figure 1 shows an
more detail and provide different scenarios for which these overview of intepretable models and their classification.
metrics can be used. Since the evaluation of causal inter-
pretable models is a challenging task, these guidelines can 2.2 Traditional Interpretablity
be helpful for future research in this area and can be used Before proceeding with the detailed review of the method-
to evaluate approaches with similar characteristics. In addi- ologies in causal interpretable models, we provide an overview
tion, they can also be used to create new evaluation metrics of existing state-of-the-art methods in traditional machine
for the approaches with different functionalities. learning. We categorize traditional models into two main
categories:

Interpretability algorithms • Inherently interpretable models: Models that generate


explanations in the process of decision making or while
being trained.
Traditional inter- Causal interpretabil-
pretability (Section 2.2) ity (Section 4) • Post-hoc interpretability: Generating explanations for
an already existing model using an auxiliary model.
Example-based interpretablity also falls into this cat-
Figure 1: Main categories for Interpretable frameworks egory. In example-based interpretablity, we are look-
ing for examples from the dataset which explain the
model’s behavior the best.
2. AN OVERVIEW OF INTERPRETABIL-
ITY 2.2.1 Interpretable Models
In this section, we present an overview of existing defini- A machine learning model can be designed to include expla-
tions for interpretability. Miller et al. [70] suggest that nations embedded as part of their architecture or output in-
interpretability is the degree to which a human can under- terpretable decisions as part of their training process. Most
stand the cause of a decision. Kim et al. [48] propose that 1
In this survey we use the words interpretable and explain-
interpretability is the degree to which a human can consis- able interchangeably.
of these models are created in application of the deep neural bedding [99] and machine translation [98; 6]. These models
network. In this section, we present common interpretable are widely known not only for their improved performance
models in the literature. over previous methods but also for their capability to show
Decision Trees. These methods make use of a tree-structured which input features or learned representations are more im-
framework in which each internal node checks whether a portant for making a specific prediction. Yang et al. [106]
condition on a feature is satisfied or not while the leaf nodes use a hierarchical attention network in document classifica-
show the final predictions (class labels). A decision infers the tion to capture the informative words as well as the sentences
label of an instance by starting from the root and tracing a that have a significant role in the decision. This is because
path till a leaf node is reached, which can be interpreted as the same word or sentence may be differentially important
an if..then.. rule. An example is illustrated in Figure 2. in different contexts. Attention networks also proved to be a
useful tool in visual question answering applications, which
Att1 require a joint image-text understanding to answer a ques-
tion about the image [103; 64; 102; 65]. Yang et al. [105]
Yes No propose a Stacked Attention Network (SAN) that uses two
attention layers to infer the answer progressively. While the
+1 Att2 first attention layer focuses on all referred concepts in the
question, the higher-level layer provides a sharper attention
Yes No distribution to highlight regions that are more relevant to
the answer.
Att3 -1 Disentangled Representation Learning. One goal of
No
representation learning is to break down the features into
Yes
the independent latent variables that are highly correlated
+1 with meaningful patterns [29]. In traditional machine learn-
-1
ing, approaches such as PCA [44], ICA [42] and spectrum
analysis [100] are proposed to discover disentangled compo-
Figure 2: An example of a decision tree with positive and nents of data. Recently, deep latent-variable models such as
negative class (binary) and three attributes. The red path VAE [50], InfoGAN [13] and β-VAE [40] were developed to
has a decision rule, if ¬Att1 ∧ Att2 ∧ ¬Att3 ⇒ +1 learn disentangled latent variables through variational infer-
ence. For example, in empirical studies, it is shown that β-
Rule-Based Models. Rule-based classifiers also create ex- VAE and InfoGAN can learn interpretable factorized latent
planations that are interpretable for humans. These classi- variables of human face images such as azimuth, hairstyle
fiers use a collection of if..then.. rules to infer the class and emotion [40].
labels. In a sense, rule-based classifiers are the text rep-
resentation of the decision trees. However, there are some 2.2.2 Post-hoc Interpretability
key differences. Rule-based models can have rules that are Post-hoc interpretable methods aim to explain the decision-
not mutually exclusive (i.e., two or more rules might trigger making process of the black-box models after they are trained.
by the same record), not exhausted (i.e., a record may not These methods map an abstract concept used in a trained
trigger any rules) and ordered (i.e., the rule set is ordered machine learning model into a domain that is understand-
based on their priority) [95]. able by humans such as a block of pixels or a sequence of
Linear Regression. Another common method known to words. Following are the widely known post-hoc methods.
be interpretable is Linear Regression. Linear Regression Local Explanations. Local Interpretable Model-Agnostic
models the linear relation between a dependent variable and Explanations (LIME) [89] is a representative and pioneer
a set of explanatory variables (features). The weight of each framework that generates local explanations of black-box
feature represents the mean change in the prediction given models. LIME approximates the prediction of any black-
a one unit increase of the feature. Accordingly, it is rea- box via local surrogate interpretable models. LIME selects
sonable to think that the features with larger weights has an instance to explain by perturbing it around its neigh-
more effect on the final result. However, different types of borhood (i.e., eliminating patches of pixels or zeroing out
variables (e.g., categorical data vs numerical features) have the values of some features). These samples are then fed to
different scales. This makes it difficult to interpret the effect the complex model for labeling and then it will be weighted
of each feature. Fortunately, there are several methods that based on their proximity to the original data. Finally, LIME
can be used to find the importance of a feature in a linear learns an interpretable model on the weighted perturbed
regression such as t-statistics and chi-square score [57]. data and their associated labels to create the explanations.
The aforementioned methods are restricted by users’ limita- It is worth noting that LIME is a fast approximation of a
tions (i.e., human understanding). With the increase in the broader approach named SHAP [66] that measures feature
number of features, these models become more and more importance.
complex; for example, decision trees become much deeper, Saliency Maps. Originally introduced by Simonyan et al.
and the number of the rules increase in the rule sets. This [93] as “image-specific class saliency maps”, saliency maps
makes comprehending the prediction of these models diffi- highlight pixels of a given input image that are mostly in-
cult for humans [89]. Below, we discuss recent inherently volved in deciding a particular class label for the image.
interpretable models which are designed for more sophisti- To extract those pixels, the derivative of the weight vector
cated scenarios. is found by a single backpropagation pass (deconvolution).
Attention Networks. Attention networks have been suc- The magnitude of the derivative shows the importance of
cessful in various highly-impactful tasks such as graph em- each pixel for the class score. Similar concepts were used
by other researchers to deconvolve the prediction and show trary layer of deep neural network. Let θ̂ be the learned
the locations of the input image that strongly impacts the fixed parameters after training and hij (θ̂, x) be the activa-
activation of the neurons [108; 94; 91]. While these methods tion of neuron i in layer j, the learned image for that neu-
belong to a popular class of tools for interpretability, Ade- ron can be calculated by solving the following optimization
bayo et al. [2] and Ghorbani et al. [26] suggest that relying problem,
on visual assessment is not adequate and can be misleading.
Example-Based Explanations. As proved in education x∗ = arg max hij (θ̂, x), subject to ||x||2 = 1 (2)
x
[87] and psychology domains [1], learning from experiences
and examples are promising tools to explain complex con- Despite this method being used as a tool in providing ex-
cepts. In these methods, a certain example is selected from planations for higher-layer features [55; 79; 75], it has been
the dataset to represent the model’s prediction (e.g., k- reported that due to the complexity of the input distribu-
nearest neighbor) or the distribution of the data. It is worth tion, some returned images might contain optical illusions
mentioning that example-based explanations should not be [78; 20].
confused with those explanations that perturb features in Explaining by Base Interpretable Models. In sec-
the dataset [85]. Although using prototypes as the represen- tion 2.2.1 we discussed base models such as decision tree,
tation of data has shown to be effective in the human learn- rule-based and linear regression, that are known to be in-
ing process [1], Kim et al. [47] use a method called Maximum terepretable. Following, we will introduce some works that
Mean Discrepancy (MMD) to capture a more complex dis- utilize these algorithms to explain a more sophisticated frame-
tribution of the data. This method uses some instances as work. Craven and Shavlik [16] are one of the first to use tree-
criticisms to explain which prototypes are not captured by structured representations to approximate neural networks.
the model to improve the interpretability of the black-boxes. Since their model is independent of the network architec-
Gurumoorthy et al. [37] extend this method and designed a ture and training algorithm, it can be generalized to a wide
fast prototype selection algorithm called ProtoDash to not variety of models. Their method, TREPAN, is similar to
only select the prototypes and criticism instances, but also CART and C4.5 and uses a gain ratio criterion to evaluate
output non-negative weights indicating their importance. the potential splits, but expands the tree based on a node
Influence Functions. To track the impact of a training that increases the fidelity of the extracted tree to the net-
sample on the prediction of a machine learning model, one work. Inspired by TREPAN, Boz [10] propose a method
can simply modify an example or delete it (leave-one-out), called DECTEXT to extract a decision tree that mimics
retrain the model, and observe the effect. However, this ap- the behavior of a trained Neural Network. In their method,
proach can be extremely expensive. To alleviate the issue, they propose a new splitting technique, a new discretiza-
influence functions, a classic method from the robust statis- tion method, and a novel pruning procedure. With these
tics literature, can be used. Koh and Liang [52] proposed modifications, the proposed method can handle continuous
a second-order optimization technique to approximate these features, optimize fidelity and minimize the tree size. A
influence functions. They verified their technique with dif- technique called distillation [41] can also be used to fully
ferent assumptions on the empirical risk ranging from being understand why a specific answer is returned for a partic-
strictly convex and twice-differentiable to non-convex and ular example. Frosst and Hinton [24] answer this question
non-differentiable losses. by creating a model in the form of soft decision tree and ex-
Suppose ŷ(xt , θ̂) is the model’s prediction for the sample xt amine all the learned filters from the root of the tree to the
with an optimal parameter θ̂. Lets ŷ(xt , θ̂−z ) be the predic- classification’s leaf node. Zhang et al. [110] adopt the same
tion on the sample xt when the training sample z was re- concept but explained the network knowledge at a human-
moved while the model’s optimal parameter is θ̂−z . The in- interpretable semantic level and also showed how much each
fluence function tries to approximate the difference between filter contributes to the prediction.
the two predictions, ŷ(xt , θ̂) − ŷ(xt , θ̂−z ), without retraining The MofN algorithm [96] is one of the well-known methods
the model with the following equation, that is used to extracts symbolic rules from trained neu-
1 ral networks. This method clusters the links based on the
ŷ(xt , θ̂) − ŷ(xt , θ̂−z ) = − ∇θ ŷ(xt , θ̂)T Hθ̂−1 ∇θ L(z, θ̂) (1) weights and eliminates those groups that unlikely to have
n
any impact on the consequent. It then forms rules that are
where L(z, θ̂) is the loss function and Hθ̂ = n1 ∇2θ L(zi , θ̂) is the sum of the weighted antecedents with regard to the bias.
the Hessian matrix. Authors also report experiments on the fidelity of the model
The same authors [51] also investigate the effect of remov- and the comprehensibility of the set rules and the individual
ing large groups of training points in large datasets on the rules.
accuracy of influence functions. They find out that the ap- Lou et al. [62] use a generalized version of linear regres-
proximation computed by the influence functions are corre- sion calledP generalized additive models (GAM) in the form
lated with the actual effect. Inspired by this work, Cheng of g(y) = fi (xi ) = f1 (x1 ) + ... + fn (xn ) to interpret the
et al. [14] propose an explanation method, Fast Influence contribution of each predictor for different classifiers or re-
Analysis, that employs influence functions on Latent Factor gression models. g(.) is a link function that controls whether
Models to resolve the lack of interpretability of the collabo- we want to describe the model as an additive model (re-
rative filtering approaches for recommender systems. gression by setting g(y) = y) or generalized additive model
Feature Visualization. Another way of describing what (classification by setting it to a logistic function). f (.) is
the model has learned is feature visualization. Most meth- a shape function that quantifies the impact of each indi-
ods in this category deal with image inputs. Erhan et al. [20] vidual feature. This gives the ability to interpret spline
present an optimization technique called activation maxi- models and tree-based shape functions such as single trees,
mization to visualize what a neuron computes in an arbi- bagged trees, boosted trees and boosted-bagged trees. Due
to the model not considering the interactions between the the way to achieve the highest level of interpretability. Be-
features, there is a significant gap in terms of accuracy be- low are those levels of interpretability and their definitions:
tween these models and complex models. To fill this gap, the
same authors propose a method named Generalized Additive • Statistical (associational) interpretability: Aims to un-
Models plusPInteractions (GA2 Ms) in the form of g(y) = cover statistical associations by asking questions such
P as “How would seeing x change my belief in y?”
fi (xi ) + fij (xi , xj ) which takes into account the two-
dimensional interactions that still can be interpretable as
• Causal interventional interpretability: Is designed to
heat maps [63]. Two case studies are conducted on real
answer “What if” questions.
healthcare problems on predicting pneumonia risks by us-
ing GA2 Ms. These studies uncover new patterns that are • Counterfactual interpretability: Is the highest level of
ignored by state-of-the-art complex models while still hitting interpretability, which aims to answer “Why” ques-
their accuracy [11]. tions.

3. CAUSAL INFERENCE Traditional interpretability mainly focuses on the statisti-


cal interpretability, whereas causal interpretability aims to
In this section, we briefly review the concepts from causal answer questions associated with the causal interventional
inference used in this paper for causal interpretable models. interpretability and counterfactual interpretability. In the
In their paper, Guo et al. [36] provide a comprehensive following, we provide an extensive review of existing work
review of existing causal inference methods and definitions. on causal interpretability. We classify the existing works in
Definition 1 (Structural Causal Models ). A 4- this field into four main categories:
tuple variable M (X, U, f, Pu ) where X is a finite set of en- 1. Causal interpretablity for model-based interpretations:
dogenous variables, usually the observable variables, U de- In this category, methods explain the causal effect of
notes a finite set of exogenous variables which usually ac- a model component on the final decision.
count for unobserved or noise variables, f is a set of function
{f1 , f2 , ..., fn } where each function represents a causal mech- 2. Counterfactual explanation generators: Methods in this
anism such that ∀xi ∈ X, xi = fi (P a(xi ), ui ) and P a(xi ) is category aim to generate counterfactual explanations
a subset of (X \ {xi }) ∪ U and Pu is a probability distribu- for alternate situations and scenarios.
tion over U is called An Structural Causal Model (SCM) or
Structural Equation Model (SEM)[82]. 3. Causal interpretability and fairness: Lipton [59] ex-
plains that interpretable models are often indispens-
Definition 2 (Causal Bayesian Network). To rep- able to guarantee fairness. Motivated by this, we pro-
resent an SCM M (X, U, f, Pu ), a directed graphical model vide an overview of the state-of-the-art methods on
G(V, E) is used. V is the set of endogenous variables X and causal fairness.
E denotes the causal mechanisms. This indicates for each
causal mechanism xi = fi (P a(xi ), ui ), there exists a directed 4. Causal interpretability and its role in verifying the
edge from each node in the parent set P a(xi ) to xi . The en- causal relationships discovered from data: In this cate-
tire graph representing this SCM is called a Causal Bayesian gory, we review methods which leverage interpretabil-
Network (CBN). ity as a tool to verify causal assumptions and rela-
tionships. We also discuss the scenarios, where causal
Definition 3 (Average Causal Effect). The Aver- inference can be used to guarantee the interpretability
age Causal Effect (ACE) of a binary random variable x (treat- of a machine learning model.
ment) on another random variable (outcome) is defined as:
In the following, we discuss each category in detail.
ACE = E[y|do(x = 1)] − E[y|do(x = 0)], (3)
4.1 Causal Inference and Model-based Inter-
Where do(.) operator denotes the corresponding interven- pretation
tional distribution defined by the SCM or CBN.
Recently, causality has gained increasing attention in ex-
plaining machine learning models [12; 38]. These approaches
4. CAUSAL INTERPRETABLITY are usually designed to explain the role and importance of
In this section, we discuss the state-of-the-art frameworks each component of a machine learning model on its deci-
on causal interpretability. These frameworks are particu- sions with concepts from the causality. For instance, one
larly needed since objective functions of machine learning way to explain the role of a neuron on the decision of a neu-
models only capture correlations and not real causes. There- ral network is to estimate the ACE of the neuron on the
fore, these models might cause problems in real-world deci- output [12; 81]. Traditional interpretable models cannot
sion making, such as making policies related to smoking and answer vital questions for understanding machine learning
cancer. Moreover, training data used to train these models models. For instance, traditional machine interpretability
might not perfectly represent the environment; and the train frameworks are not capable to answer causal questions such
and the test sets might also have different distributions. A as “What is the impact of the n-th filter of the m-th layer
causal interpretable model can help us understand the real of a deep neural network on the predictions of the model?”
causes of decisions made by machine learning algorithms, which are helpful and required for understanding a neural
improve their performance, and prevent them from failing network model. Furthermore, despite being simple and in-
in unexpected circumstances. tuitive, performing ablation testing (i.e., removing a com-
Pearl [83] introduces different levels of said interpretability ponent of the model and retraining it to measure the per-
and argues that generating counterfactual explanations is formance for a fixed dataset) is computationally expensive
and impractical. To address these problems, causal inter- Parafita and Vitria [81] introduce a causal attribution frame-
pretability frameworks have been proposed. These frame- work to explain decisions of a classifier based on the latent
works are mainly designed to explain the importance of each factors. The framework consists of three steps, (a) con-
component of a deep neural network on its predictions by structing Distributional Causal Graph which allows us to
answering counterfactual questions such as “What would sample and compute likelihoods of the samples; (b) gener-
have happened to the output of the model had we had a dif- ating a counterfactual image which is as similar as possible
ferent component in the model?”. These types of questions to the original image; and (c) estimating the effect of the
are answered by borrowing some concepts from the causal modified factor by estimating the causal effect.
inference literature. The main idea is to model the struc- Causal interpretation has also gained a lot of attention in
ture of the DNN as a SCM and estimate the causal effect of Generative Adversarial Networks (GANs) interpretability.
each component of the model on the output by performing Bau et al. [7] propose a causal framework to understand
causal reasoning. Narendra et al. [77] consider the DNN as ”How” and ”Why” images are generated by Deep Convo-
an SCM, apply a function on each filter of the model to ob- lutional GANs (DCGANs). This is achieved by a two-step
tain the targeted value such as variance or expected value of framework which finds units, objects or scenes that cause
each filter and reason on the obtained SCM. Harradon et al. specific classes in the data samples. In the first step, dis-
[38] further suggest that in order to have an effective inter- section is performed, where classes with explicit represen-
pretability, having a human-understandable causal model of tations in the units are obtained by measuring the spatial
DNN, which allows different kinds of causal interventions, agreement between individual units of the region we are ex-
is necessary. Based on this hypothesis, the authors pro- amining and classes using a dictionary of object classes. In
pose an interpretability framework, which extracts human- the second step, intervention is performed to estimate the
understandable concepts such as eyes and ears of a cat from causal effect of a set of units on the class. This framework
deep neural networks, learns the causal structure between is then used to find the units with the highest causal effect
the input, output and these concepts in an SCM and per- on the class. Following equation shows the objective of this
forms causal reasoning on it to gain more insights into the framework,
model. Chattopadhyay et al. [12] propose an attribution
method based on the first principle of causality, particularly α∗ = arg min(−δα→c + λ||α||2 ), (5)
α
SCMs and do(·) calculus. More concretely, similar to other
proposed methods in this category, the proposed framework where α indicates the units that have causal effect on the
models the structure of the machine learning algorithm as an outcome, δα→c measures the causal effect of units on the
SCM. It then proposes a scalable causal inference approach class by intervening on α and set it to the constant c and
to the estimate individual treatment effect of a desired com- λ||α||2 is a regularization term. Besserve et al. [8] propose
ponent on the decision made by the algorithm. to better understand the internal functionality of generative
Chattopadhyay et al. suggest to simplify the SCM defined models such as GANs or Variational Autoencoders (VAE)
on a multi-layer network M ([l1 , l2 , l3 ...., ln ], U, f, PU ) to an- and answer questions like ”For a face generator, is there an
other network as SCM M ′ ([l1 , ln ], U, f ′ , PU ) where l1 and ln internal encoding of the eyes, independent of the remain-
represent neurons in the input and output layers, li repre- ing facial features?”, by manipulating the internal variables
sents neurons in the i-th layer of the network, U denotes using counterfactual inference.
the set of unknown variables, f and f ′ correspond to the Madumal et al. [68] leverage causal inference to explain
SCM functions and PU defines distributions of the unknown the behavior of reinforcement learning agents by learning
variables. They then propose to calculate the ACE of any an SCM during reinforcement learning and generate coun-
neurons of the model on the output by performing causal terfactual examples using the learned SCM.
reasoning on M as follows,
y
ACEdo(x = E[y|do(xi = α)] − baselinexi , (4)
4.2 Causal Inference and Example-based In-
i =α) terpretation
where xi is i-th neuron of the network, y is the output of the As mentioned in Section 2.2, in example based explanations,
model and α is an arbitrary value the neuron is set to. They we are looking for data instances that are capable of explain-
also propose to calculate the baselinexi as Exi [Ey [y|do(xi = ing the model or the underlying distribution of the data. In
α)]] this subsection, we explain counterfactual explanations, a
In another research direction, Zhao and Hastie [112] state type of example-based explanations, which are one of the
that to extract the causal interpretations from black-box widely used explanations for interpreting a model’s deci-
models, one needs a model with good predictive perfor- sions. Counterfactual explanations aim to answer “Why”
mance, domain knowledge in the form of a causal graph, questions such as “Why the model’s decision is Y?” or “Was
and an appropriate visualization tool. They further explore it input X that caused the model to predict Y?”. Generally
partial dependence plot (PDP) [23] and Individual Condi- speaking, counterfactuals are designed to answer hypothet-
tional Expectation (ICE) [28] to extract causal interpreta- ical questions such as “What would have happened to Y,
tions from black-box models. Alvarez-Melis and Jaakkola had I not done X?”. They are designed based on a new
[3] generated causal explanations for structured input struc- type of conditional probability P (yx |x′ , y ′ ). This probabil-
tured output black-box models by (a) generating perturbed ity indicates how likely the outcome (label) of an observed
samples using a variational auroencoder; (b) generating a instance, i.e., y ′ , would change to yx if x′ is set to x. These
weighted bipartite graph G = (Vx ∪ Vy , E), where Vx and kinds of questions can be answered using SCMs [25].
Vy are elements in x and y and Eij represents the causal Counterfactual explanations are defined as examples that
influence of xi and yj ; and (c) generating explanation com- are obtained by performing minimal changes in the origi-
ponents using graph partitioning algorithms. nal instance’s features and have a predefined output. For
example, what minimal changes can be made in a credit permutation matrix used to align spatial cells of f (I ′ ) with
card applicant’s features such that their application gets ac- f (I), f (I) and f (I ′ ) correspond to spatial feature maps of
cepted. These explanations are human friendly because they I and I ′ , respectively. Function g(.) represents the classifier
are usually focused on a few number of features and there- and P is a set of all hw × hw permutation matrices. Goyal
fore are more understandable. However, they suffer from the et al. [33] propose to explain classifiers’ decisions by mea-
Roshomon effect [71] which means there could be multiple suring the Causal Concept Effect (CACE). CACE is defined
true versions of explanations for a predefined outcome. To as the causal effect of a concept (such as the brightness or
alleviate this problem, we could report all possible explana- an object in the image) on the prediction. In order to gen-
tions, or find a way to evaluate all explanations and report erate counterfactuals, authors leverage a VAE-based archi-
the best one. Recently, several works have been proposed to tecture. Hendricks et al. [45] propose a method to generate
generate counterfactual explanations. In order to generate counterfactual explanations using multimodal information
counterfacutal examples, Wachter et al. [101] propose to for video classification tasks. The proposed method in this
minimize the mean squared error between the model’s pre- work generates visual-linguistic explanations in two steps.
dictions and counterfactual outcomes as well as the distance First, it trains a classification model for which we would like
between the original instances and their corresponding coun- to generate explanations. Then, in the second step, it trains
terfactuals in the feature space. Eq. (6) shows the objective a post-hoc explanation model by leveraging the output and
function to achieve this goal, mid-level features of the trained model in first the step. The
explanation model predicts the counterfactuality score for all
arg min max L(x, xcf , y, ycf )
xcf λ the negative classes (classes that the instance does not be-
(6) long to according to the prediction model trained in the first
L(x, xcf , y, ycf ) = λ · (fˆ(xcf ) − ycf )2 + d(x, xcf ), step). The explanation model then generates explanations
where the first term indicates the distance between the model’s by maximizing the counterfactuality score between positive
prediction for the counterfactual input xcf and the desired and negative classes.
counterfactual output, while the second term indicates the Moore et al. [73] propose to leverage adversarial examples
distance between the actual instance features x and the to generate counterfactual explanations. In order to gener-
counterfactual features xcf . ate plausible explanations, the number of changed features
Liu et al. [60] propose a generative model to generate coun- should be small. Moreover, some features such as age can-
terfactual explanations for explaining a model’s decisions not be changed arbitrarily. For example, we cannot ask loan
using Eq.(6). Garth et al. [35] propose a method to gener- applicants to reduce their age. Therefore, to constrain the
ate counterfactual examples in a high dimensional setting. number of changed features and the direction of gradients
The method is proposed for credit application prediction in the generated adversarial examples, authors propose to
via off-the-shelf interchangeable black-box classifiers. In the mask the unwanted features and gradients in a way that
case of high dimensional feature space, the generated expla- only desired features change in the generated explanations.
nation might not be interpretable due to the existence of Kommiya et al. [76] propose to explain the decision of a ma-
too many features. To alleviate the problem, the authors chine learning framework by generating counterfactual ex-
propose to reweigh the distance between the features of an amples which satisfy the following two criteria, (1) generated
instance and its corresponding counterfactual with the in- examples must be feasible given users conditions and context
verse median absolute deviation (Eq.(7)). This metric is such as range for the features or features to be changed; (2)
robust to outliers and results in more sparse, and therefore, counterfactual examples generated for explanations should
more explainable solutions. be as diverse as possible. In order to impose the diversity
criterion, authors propose to either maximize the point-wise
M ADj = mediani∈{1,2,...,n} (|xi,j −medianl∈{1,2,...,n} (xl,j )|) distance between examples in feature-space or leverage the
(7) concept from Determinantal point processes to select a sub-
Goyal et al. [34] propose to generate counterfactual visual set of samples with the diversity constraint.
explanations for a query image I by using a distractor image Van Looveren and Klaise [61] propose to leverage class pro-
I ′ which belongs to the class c′ (a different class from the totypes to generate counterfactual explanations. They also
actual output of the classifier). To generate counterfactual claim that using class prototypes for counterfactual example
explanations, the authors propose to detect spatial regions generation accelerates the process. This work suggests that
in I and I ′ such that replacing those regions in I with regions the generated examples by traditional counterfactual gener-
in I ′ results in system classifying the generated image as c′ . ation frameworks [101; 35] do not satisfy two main criteria:
In order to avoid trivial solutions such as replacing the entire (1) they do not consider the training data manifold which
image I with I ′ , authors propose to minimize the number may result in out-of-distribution examples, and (2) the hy-
of edits to transform I to I ′ . The proposed framework is perparameters in the framework should be carefully tuned
shown in the following equation, in an appropriate range which could be time consuming. To
min ||a||1 solve the mentioned problems, the authors propose to add a
P,a reconstruction loss term (defined as L2 reocnstruction error
s.t. c′ = argmax g((1 − a) ◦ f (I) + a ◦ P f (I ′ )) (8) between counterfactuals and an autoencoder trained on the
training samples) as well as a prototype loss term, which
ai ∈ {0, 1} ∀i and P ∈ P,
is defined as L2 loss between the class prototype and the
where a ∈ Rhw (h and w represent height and width of counterfactual samples, to the original objective function of
an image, respectively) is a binary vector which indicates counterfactual generation (Eq (6)).
whether the feature in I needs to be changed with the fea- Rathi [86] generates counterfactual explanations using shapely
ture in I ′ (value 1) or not (value 0). P ∈ Rhw×hw is a additive explanations (SHAP).
Hendricks et al. [39] defined a method to generate natu- in order to find the efficacy of a drug on patient’s health,
ral language counterfactual explanations. The framework one needs to estimate the causal effect of the drug on pa-
checks for evidences of a counterfactual class in the text ex- tient’s health status. Moreover, in order for the results to be
planation generated for the original input. It then checks if reliable for doctors and experts, an explanation of how the
those factors exist in the counterfactual image and returns decision has been made is necessary. Despite recent achieve-
the existing ones. ments in these two fields separately, not so many works
have been done to cover both requirements simultaneously.
4.3 Causal Inference and Fairness Moreover, the state-of-the-art approaches in each field are
Nowadays, politicians, journalists and researchers are con- incompatible and therefore can not be combined and used
cerned regarding the interpretability of model’s decisions together. Kim and Bastani [49] propose a framework to
and whether they comply with ethical standards [31]. Algo- bridge the gap between causal and interpretable models by
rithmic decision making has been widely utilized to perform transforming any algorithm into an interpretable individual
different tasks such as approving credit lines, filtering job treatment effect estimation framework. To be more specific,
applicants and predicting the risk of recidivism [15]. Predic- this work leverages the algorithm proposed in [92] to learn
tion of recidivism is used to determine whether to detain or an oracle function f which estimates the causal effect of a
free a person and therefore, it needs to be guaranteed that treatment for any observed instance and then learn an in-
it does not discriminate against a group of people. Since terpretable function f ′ to estimate f . They further provide
conventional evaluation metrics such as accuracy does not a bound for the error produced by their framework.
take these into account, it is usually required to come up In another line of research, causal interpretability has been
with interpretable models in order to satisfy fairness crite- used to verify the causal relationships in the data. Caruana
ria. Recently, huge attention has been paid to incorporating et al. [11] perform two case studies to discover the rules
fairness into decision making methods and its connection which show cases where generalized additive models with
with causal inference. Kusner et al. [53] propose a new pairwise interactions (GA2 M s) learn rules based on only
metric for measuring how fair decisions are based on coun- correlations in the data and invade causal rules. They then
terfactuals. According to this paper, a decision is fair for propose to fix the learned rules based on domain experts
an individual if the outcome is the same both in the actual knowledge.
world and a counterfactual world in which the individual Bastani et al. [88] propose a decision tree based explana-
belonged to a different demographic group. Kilbertus et al. tion method to generate global explanations for a black-
[46] address the problem from a data generation perspective box model. Their proposed framework provides powerful
by going beyond observational data. The authors propose to insights into the data such as causal issues confirmed by the
utilize causal reasoning to address the fairness problem by physicians previously.
asking the question “What do we need to assume about the
causal data generating process?” instead of “What should 5. PERFORMANCE EVALUATION
be the fairness criterion?”.
In this section we provide a detailed review of evaluation
Madras et al. [67] propose a causal inference model in which methods and common datasets used to assess the inter-
the sensitive attribute confounds both the treatment and pretability of models for causal interpretablity. Evaluation
the outcome. It then leverages deep learning techniques to of interpretability is a challenging task due to the lack of con-
learn the parameters of the model. Zhang and Bareinboim sensus definition of interpretability and understanding of hu-
[109] propose a metric (i.e., causal explanations) to quanti- mans from the concept. Evaluation of causal interpretability
tatively measure the fairness of an algorithm. This measure is even more challenging due to the lack of groundtruth data
is based on three measures of transmission from cause to for causal explanations and verification of causal relation-
effect namely counterfactual direct (Ctf-DE), indirect (Ctf- ships. Therefore, it is important to have a unified guideline
IE), and spurious (Ctf-SE) effects as defined below. Given on how to evaluate the proposed models. Traditional inter-
an SCM M , the counterfactual indirect effect of intervention pretability of a model is usually measured with quantifiable
X = x1 on Y = y (relative to baseline X = x0 ) conditioned proxies such as if a model is approximated using sparse linear
on X = x with mediator W = Wx1 is defined as, models it can be considered interpretable. To evaluate the
IEx0 ,x1 (y|x) = P (yx0 ,Wx1 |x) − P (yx0 |x) (9) causal interpretability, researchers also came up with some
proxy metrics such as size and diversity of the counterfactual
the counterfactual direct effect of intervention X = x1 on Y explanation. In this section, we discuss all criteria defined
(with baseline x0 ) conditioned on X = x is defined as, for the “goodness” of both causal and traditional interpreta-
tions and proxy metrics to measure how good the proposed
DEx0 ,x1 (y|x) = P (yx1 ,Wx0 |x) − P (yx0 |x) (10)
framework can generate these explanations.
And finally, the spurious effect of event X = x1 on Y = y
(relative to baseline x0 ) is defined as,
5.1 Datasets
In this section, we briefly introduce benchmark datasets
SEx0 ,x1 (y|x) = P (yx0 |x1 ) − P (y|x0 ) (11) commonly used to evaluate interpretable models. Depend-
ing on the the type of the data (i.e., text, image or tabu-
4.4 Causal Inference as Guarantee for Inter- lar) different datasets are used to assess the interpretabil-
pretability ity . Some commonly used datasets for image are “Ima-
Machine learning has had great achievements in medical, le- geNet (ILSVRC)” [90], “MNIST” [56] and “PASCAL VOC
gal and economic decision making. Frameworks for these dataset” [21]. While for text they experimented on “20
applications must satisfy the following two criteria: 1) they Newsgroup Dataset” [54], “Yelp” [107], “IMDB” [43] and
must be causal 2) they must be interpretable. For example, “Amazon” [5] reviews. “UCI repository” [97] consists of
some tabular datasets that were used by the litreture such as Human Subject-Based Evaluation Metrics. Part of
“Spambase”, “Insurance”, “Magic”, “Letter”, and “Adult” the research in interpretability aims to let humans under-
datasets. In order to explain the outcome of the test sample, stand the reasons behind the outcome of a product. Ac-
the explanations are provided by the model. For instance, cordingly, experiments carried out by the researchers usually
in the case of image data, those patches of the images that answer the following questions:
are mostly responsible for the class label were selected. For
the text data, words involved in the final decision are made • By providing two different models, can the explana-
bold with different shades of color, which represent the de- tions help users choose the better classifier in terms
gree of their involvement. In addition to the mentioned of generalizability? This will help us to investigate
datasets, there are some datasets commonly used to eval- whether the explanations can be used to decide which
uate the causal interpretable frameworks. In the following, model is better. Ribeiro et al. [89] used human sub-
we list common datasets used for the evaluation of causal jects from “Amazon Mechanical Turk” (AMT) to choose
interpretability. between two models, one that generalizes better than
the other while its accuracy was lower on cross vali-
• German loan dataset [19]. This dataset contains 1000 dation. With the provided explanations, the subjects
observations of loan applicants which contains, nu- were able to choose the more generalized model 89%
meric, categorical and ordinal attributes. of the time.
• With explanations provided by the interpretable meth-
• LendingClub. This dataset2 contains 5 years of loan
ods for a particular sample, can a user correctly predict
records (2007-2011) given by LendingClub company.
the outcome of that sample? This is also called “For-
After preprocessing, it contains 8 features, namely,
ward Simulation/Prediction” by Doshi-Velez and Kim
employment years, annual income, number of open
[17]. We can verify the explanations actually defines
credit accounts, credit history, loan grade as decided
the output we are looking for.
by LendingClub, home ownership, purpose, and the
state of residence in the United States. . • Based on the explanations, do users trust the classi-
fier to be used in real-world applications? Selvaraju et
• COMPAS. Collected by ProPublica [22] for analysis al. [91] evaluated the trust by asking 54 AMT work-
purposes on recidivism decisions in the United States, ers to rate the reliability of the models via a 5-point
after preprocessing, this data contains 5 features, namely, scale questionnaire. A sample along with its explana-
bail applicant’s age, gender, race, prior count of of- tions were demonstrated to subjects for two different
fenses, and degree of criminal charge. models, AlexNet and VGG-16 (VGG-16 is known to
be more reliable than AlexNet). Moreover, only those
Unfortunately, datasets used for this purpose are not specif- instances that provided the same prediction and were
ically designed for causal interpretability and do not contain aligned with the ground truth label were considered.
the groundtruth that captures the causal aspect of the model The results of the evaluation shows that with the pro-
such as counterfactual explanations or the ACE of different posed explanation the subjects trust the model that
components of the model on the final decision. On the other generalizes better (VGG-16).
hand, there are existing benchmark datasets specifically de-
signed for evaluating tasks in causal inference. Cheng et al. • Do the resulted explanations match human intuition?
[58] provide a comprehensive survey on benchmark datasets The model is described to human subjects in detail
for different causal tasks. and they were asked to provide insights about the out-
come of the model (human-produced explanations).
5.2 Evaluation Metrics The test assumes that the explanations provided by
In order to assess the performance of a causal interpretable the human should be aligned with one that the model
framework, authors are required to evaluate the interpretabil- provides [66]. Moreover, experts in a specific field (e.g.,
ity of generated explanations from two aspects, (1) the qual- doctors) can also be used to provide the explanations
ity of the generated explanations, i.e., are generated expla- (e.g., important factors/symptoms) on the task (e.g.,
nations interpretable to humans?; and (2) are the generated recognizing the disease).
explanations causal? In the following two subsections, we
• Given two different explanations from different algo-
provide comprehensive guidelines and metrics on how to an-
rithms, which one provides a better quality explana-
swer these questions.
tion? This is also known as “Binary Forced Choice”
5.2.1 Interpretability Evaluation Metrics evaluation metric [17]. This test can be used to com-
pare the different explanations from different inter-
Evaluating the interpretability of a machine learning model pretable models.
is usually a challenging task. Interpretable frameworks often
evaluate their methods via two main perspectives, (1) how Non-human Based Evaluation Metrics. Multiple fac-
well the generated explanations by the method match the tors such as human fatigue, improper practice sessions and
human expectation from different aspects; (2) how well the incentive costs can affect experimental results when human-
generated explanations are without using any human sub- subject evaluation metrics are used. Hence, it is important
jects. Thus, we will categorize different assessment methods to conduct other evaluation metrics.
based on the aforementioned perspectives and provide some
examples of experiments conducted by the researchers. • How much a proposed interpretable model recovers the
important features of the data for a certain predic-
2
https://www.lendingclub.com/info/download-data.action tion task? This requires the important features to be
Countrfactual
Description of Property Evaluation Metrics
Property
Perturbation which transforms x Elastic net loss term (EN (δ) = β.||δ||1 + ||δ||22 ) [61]
1 Sparsity/Size
to xcf should be small Counting number of altered features manually [35]
Ratio of the reconstruction errors of counterfactual gen-
Counterfactual explanations erator trained only on the counterfactual class and coun-
2 Interpretability
should lie close to data manifold terfactual generator trained on the original class [61]
Ratio of the reconstruction errors of counterfactual gen-
erator trained only on the counterfactual class and coun-
terfactual generator trained on the all class [61]
Counterfactual explanations Pk
3 Proximity should be as similar as possible to P roximity = − k1 i=1 dist(xcfi , x) [76]
the original instance
Generating counterfactuals should
4 Speed be fast enough to be deployable in Measure the time and number of gradient updates [61]
real-world applications
Counterfatual explanations gener- Pk−1 Pk
1
5 Diversity ated for a data instance should be Diversity = |Ck |2 i=1 j=i+1 dist(xcfi , xcfj ) [76]
different from each other
Visual explanation is the region
which retains high positiveness or Measure how the output of the target classifier changes
Visual-Linguistic negativeness (i.e., on the model corresponding to the negative class when a specific region
6
Counterfactuals prediction for specific positive or is removed from the input using accuracy [45].
negative classes).
Measure how the output of the target classifier changes
Linguistic explanation is compati-
corresponding to the negative class when a specific region
ble to the visual counterpart.
is removed from the input using accuracy [45].

Table 1: A summary of evaluation metrics for counterfactual explanations

known beforehand. We should verify that the model Counterfactual Explanations Evaluation Metrics. Ex-
will pick up the important features of the data. One isting approaches for causal interpretability are mostly based
simply can use any base method introduced in sec- on generating counterfactual explanations. For such ap-
tion 2.2.1 as a proxy model to extract the important proaches, the causal interpretability is often measured through
features. The fraction of these important features re- the goodness of generated counterfactual explanation. As
covered by the interpretable method can be used as an mentioned in section 4, a counterfactual explanation is the
evaluation score [89]. highest level of explanation and therefore, we can claim that
if an explanation is a counterfactual explanation and is gen-
• How locally faithful the proposed method is compared erated by considering causal relationships, it is indeed ex-
to the original model (fidelity)? Lack of fidelity will re- plainable. However, due to the lack of groundtruth for coun-
sult in a limited insight to the original model [104]. In terfactuals, we are unable to measure if the generated expla-
convolutional neural network, one common approach is nations are generated based on causal relationships. There-
the image occlusion. The pixels that the interpretable fore, to measure the “goodness” of counterfactual explana-
method defines as important will be masked to see tions, we suggest to conduct experiments to (1) measure
whether it reflects on the classification score or not the interpretability of the explanations using the metrics
[91; 108]. designed for interpretability; and (2) evaluate the conter-
• How consistent the explanations are for the similar in- factuals themselves by measuring different characteristics of
stances with the same class label? The explanations them. An interpretable Counterfatual explanation should
should not be significantly different for samples with have the following characteristics:
the same label with a slightly different features. This
instability could be the result of a high variance as • The model prediction on the counterfactual sample
well as the non-deterministic components of the ex- (xcf ) needs to be close to the predefined output for
planation method [72]. counterfactual explanation.

5.2.2 Causal Evaluation Metrics • The perturbation δ changing the original instance x
Due to the lack of groundtruth for causal explanations, to into xcf = x + δ should be sparse. In other words, size
verify the causal aspect of the proposed framework, we need of counterfactual (i.e., number of features) should be
to quantify the desired characteristics of the model and mea- small.
sure the “goodness” of them via some predefined proxy met-
rics. In the following, we go over the existing metrics to • A counterfactual explanation xcf is considered inter-
evaluate the proposed causal interpretable frameworks for pretable if it lies close to the models training data dis-
different categories of causal interpretability. tribution.
• The counterfactual instance xcf needs to be found fast terfactual distribution is as good as the distribution over all
enough to ensure it can be used in a real life setting. classes.
Generated counterfactual explanations can be used to mea-
• Counterfatual explanations generated for a data in- sure users’ understanding of a machine learning model’s lo-
stance should be different from each other. In other cal decision boundary. Mothilal et al. [76] propose to mimic
words, counterfactual explanations should be diverse. users’ understanding of a model’s local decision boundaries
by, (a) constructing an auxiliary classifier on both original
• Visual-linguistic counterfactual explanations must sat- inputs and counterfactual examples; and (b) measuring how
isfy the following two criteria, (1) Visual explanation is well it mimics the actual decision boundaries. More specif-
the region which keeps high positiveness/negativeness ically, they train a 1-nearest neighbor (1-NN) classifier on
on the model prediction for specific positive/negative both the original and the counterfactual samples to predict
classes; (2) Linguistic explanation should be compat- the class of new inputs. The accuracy of this model is then
ible to the visual counterpart in the generated visual compared with the accuracy of the original model.
explanations. The definition of counterfactual explanations implies that
generated explanations should be as similar as possible to
Below, we briefly discuss these evaluation metrics designed the original instance. In order to evaluate the proximity
to assess aformentioned characteristics of a counterfactual between original samples and counterfactual explanations,
explanation: Mothilal et al. [76] defines proximity as Eq. (14),
To evaluate the sparsity of the generated counterfactual ex-
amples, Mc Grath et al. [35] measures the size of a generated 1X
k

example by counting the number of features each example P roximity = − dist(xcfi , x) (14)
k i=1
consists of. Van Looveren and Klaise [61] use elastic net
loss term EN (δ) = β||δ||1 + ||δ||22 where δ is the distance be- In order to be able to calculate the proximity for both cate-
tween the original instance and its generated counterfactual gorical and continuous features, the authors further propose
example and β is the hyperparameter. two metrics to calculate the proximity for categorical and
In order for counterfactual explanations to be interpretable, continuous features. For continuous features, the proximity
they need to be close to the data manifold. Looveren and is defined as the mean of feature-wise L1 distances between
Klaise improves this criterion by suggesting that the coun- the original sample and counterfactuals divided by the me-
terfactuals are interpretable if they are close to the data dian absolute deviation (MAD) of the features values in the
manifold of the counterfactual class [61]. To measure the training set. For categorical features, disctance function is
interpretability defined above, Looveren and Klaise propose calculated such that for each categorical feature it assigns 1
to measure the ratio of the reconstruction errors when the if the feature differs from the original feature and otherwise
model used for generating counterfactuals is trained only on it assigns 0.
the counterfactual class vs when it is trained on the original In order to gauge the speed of generating counterfactual ex-
class [61]. The proposed metric is shown in the following planations, Looveren and Klaise [61] measure the time and
equation, the number of gradient updates until the desired counter-
||x0 + δ − AEi (x0 + δ)||22 factual explanation is generated.
IM 1(AEi , AEt0 , xcf ) = Diversity of generated counterfactuals is measured via mea-
||x0 + δ − AEt0 (x0 + δ)||22 + ǫ
(12) suring feature-wise distances between each pair of counter-
Where AEi and AEt0 represent the autoencoders used to factual examples and calculating diversity as the mean of
generate the counterfacutals trained on the class i (counter- the distances between each pair of examples [76]. Eq. (15)
factual class) and class t0 (the original class), respectively. illustrates the measure used for diversity.
We let xcf and x0 be the counterfactual explanation and
1 X X
k−1 k
the original sample. In addition, δ denotes the distance be- Diversity = d(xcfi , xcfj ) (15)
tween the original and counterfactual samples. A lower value |Ck |2 i=1 j=i+1
of IM 1 shows that counterfactual examples can be better
reconstructed from the autoencoder trained on the counter- Where Ck represents a set of k counterfactuals generated
factual class in comparison to the autoencoder trained on for the original input, xcfi and xcfj are the i-th and j-th
the original class. This implies that the generated counter- counterfactuals in the set Ck .
factuals are closer to the counterfactual class data manifold. Kanehira et al. [45] propose metrics to evaluate visual-
Another metric proposed by [61] measures how similar the linguistic counterfactual explanations to ensure, (a) visual
generated counterfactuals are when generated using the au- explanations keep possession of high positiveness/negativeness
toencoder trained on only counterfactuals vs the autoen- on the model predictions for positive/negative classes; (b)
coder trained on all classes. The metric is shown in the linguistic explanations are compatible with their correspond-
following equation, ing visual explanations. To measure if the generated exam-
ples meet these criteria, authors in [45] propose two metrics
based on the accuracy. More specifically, to check for the
||AEi (x0 + δ) − AE(x0 + δ)||22 first condition, they investigate how the output of the target
IM 2(AEi , AEt0 , xcf ) =
||x0 + δ||1 + ǫ classifier changes towards the negative class when a specific
(13) region is removed from the input. To measure the second
A lower value of IM 2 shows that counterfactuals generated criterion, for each output pair (s, R) they examine how the
by both autoencoders trained on all classes and counterfac- region R makes the concept s distinguishable by humans. To
tuals are more similar. This implies that the generated coun- measure this quantitatively, they compute the accuracy by
Overview of interpretable models and their categories
Interpretable Models: [106], [103], Model-based: [77], [38], [12], [8], [112], [81], [7],
Traditional [64], [102], [65], [105], [50], [13], [40] Causal Example-based: [35], [45], [39], [101], [76], [86],
Interpretability Interpretability [73], [61], [60], [33], [34]
Post-hoc: [47], [89], [66], [93], [91], Fairness: [53], [46], [67], [109]
[108], [20], [16], [96], [62] Guarantee: [49], [17], [88]

Table 2: A summary of the state-of-the-art frameworks for each type of interpretability

utilizing bounding boxes for each attribute in the test set. ACKNOWLEDGEMENTS
More specifically, IoU (intersection over union) between a
We would like to thank Andre Harrison for helpful com-
given R and all bounding boxes R0 corresponding to at-
ments.
tribute s0 is calculated. Then the accuracy is measured by
selecting the the attribute s0 with the largest IoU score and
checking its consistency with s a counterpart of R. 7. REFERENCES
Table 1 summarizes evaluation metrics for counterfactulas
explanations based on the properties of the generated exam- [1] A. Aamodt and E. Plaza. Case-based reasoning: Foun-
ples. dational issues, methodological variations, and system
Model-based Evaluation Metrics. Due to the lack of approaches. AI communications, 7(1):39–59, 1994.
evaluation groundtruth for representing the actual effect of
each component of the model on its final decisions, evalua- [2] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow,
tion for this type of models is still an open problem. One M. Hardt, and B. Kim. Sanity checks for saliency
common way of evaluating such models is to report the most maps. In Advances in Neural Information Processing
important components of a model by measuring their causal Systems, pages 9505–9515, 2018.
effects on the outcome of the model [38; 77]. Chattopadhyay
et al. also used the causal attribution of each neuron on [3] D. Alvarez-Melis and T. Jaakkola. A causal framework
the output to visualize the local decisions of the model by for explaining the predictions of black-box sequence-
saliency map. Moreover, to further investigate how well the to-sequence models. In Proceedings of the 2017 Con-
model estimates the ACE, they proposed to run the model ference on Empirical Methods in Natural Language
on datasets for causal effect estimations [12]. Processing, pages 412–421, Copenhagen, Denmark,
Causal Fairness Evaluation. Evaluation of causal fair- Sept. 2017. Association for Computational Linguistics.
ness models is a challenging task. Papers in this field usually
assess the performance of the model for detecting discrimi- [4] J. Angwin, J. Larson, L. Kirch-
nation. Zhang et al. leverage direct, indirect and spurious ner, and S. Mattu. Machine bias.
effect measures (defined in section 4.3) to detect and explain https://www.propublica.org/article/machine-bias-risk-asses
discrimination [109]. However, to the best of our knowledge, Mar 2019.
no quantitative measure of causality of a fairness algorithm
existis. [5] AWS. Amazon customer reviews dataset.
https://s3.amazonaws.com/amazon-reviews-pds/readme.html,
2020.
6. CONCLUSION [6] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine
In this survey, we introduce the problem of interpretabil- translation by jointly learning to align and translate.
ity in machine learning. We view the problem from two arXiv preprint arXiv:1409.0473, 2014.
perspectives, (1) Traditional interpretability algorithms; (2)
causal interpretability algorithms. However, the primary [7] D. Bau, J. Zhu, H. Strobelt, B. Zhou, J. B. Tenen-
focus of the survey is on causal frameworks. We first pro- baum, W. T. Freeman, and A. Torralba. GAN dissec-
vide different definitions of interpretability, then review the tion: Visualizing and understanding generative adver-
state-of-the-art methods in both categories and point out the sarial networks. CoRR, abs/1811.10597, 2018.
differences between them. Each type of interpretable mod-
els is further subdivided into other sub categories to provide [8] M. Besserve, R. Sun, and B. Schölkopf. Counterfactu-
readers with better overview of existing directions and ap- als uncover the modular structure of deep generative
proaches in the field. More conceretely, for traditional meth- models. CoRR, abs/1812.03253, 2018.
ods, we divide existing work into inherently interpretable
models and post-hoc intrerpretability. For causal models, [9] D. Boyd and K. Crawford. Critical questions for big
we divide the existing works into the following four cate- data: Provocations for a cultural, technological, and
gories: counterfactual examples, model-based interpretabil- scholarly phenomenon. Information, communication
ity, causal models in fairness and interpretability for veri- & society, 15(5):662–679, 2012.
fying causal relationships. We also address the challenging
problem of evaluating interpretable models , explain existing [10] O. Boz. Extracting decision trees from trained neural
metrics in detail and categorize them based on the scenarios networks. In Proceedings of the eighth ACM SIGKDD
they are designed for. Table 2 summarizes state-of-the-art international conference on Knowledge discovery and
methods which belong to each category of interpretability. data mining, pages 456–461. ACM, 2002.
[11] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, [25] L. Gerson Neuberg. Causality: models, reasoning, and
and N. Elhadad. Intelligible models for healthcare: inference, by judea pearl, cambridge university press,
Predicting pneumonia risk and hospital 30-day read- 2000. Econometric Theory, 19:675–685, 08 2003.
mission. In Proceedings of the 21th ACM SIGKDD In-
ternational Conference on Knowledge Discovery and [26] A. Ghorbani, A. Abid, and J. Zou. Interpretation of
Data Mining, KDD ’15, pages 1721–1730, New York, neural networks is fragile. In Proceedings of the AAAI
NY, USA, 2015. ACM. Conference on Artificial Intelligence, volume 33, pages
3681–3688, 2019.
[12] A. Chattopadhyay, P. Manupriya, A. Sarkar, and
[27] L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa,
V. N. Balasubramanian. Neural network attributions:
M. Specter, and L. Kagal. Explaining explanations:
A causal perspective. CoRR, abs/1902.02302, 2019.
An overview of interpretability of machine learning.
[13] X. Chen, Y. Duan, R. Houthooft, J. Schulman, In 2018 IEEE 5th International Conference on Data
I. Sutskever, and P. Abbeel. Infogan: Interpretable Science and Advanced Analytics (DSAA), pages 80–
representation learning by information maximizing 89. IEEE, 2018.
generative adversarial nets. In Advances in neural in- [28] A. Goldstein, A. Kapelner, J. Bleich, and E. Pitkin.
formation processing systems, pages 2172–2180, 2016. Peeking inside the black box: Visualizing statistical
[14] W. Cheng, Y. Shen, L. Huang, and Y. Zhu. Incor- learning with plots of individual conditional expecta-
porating interpretability into latent factor models via tion. Journal of Computational and Graphical Statis-
fast influence analysis. In Proceedings of the 25th ACM tics, 24(1):44–65, 2015.
SIGKDD International Conference on Knowledge Dis- [29] I. Goodfellow, Y. Bengio, and A. Courville. Deep
covery & Data Mining, pages 885–893. ACM, 2019. Learning. MIT Press, 2016.
[15] A. Chouldechova. Fair prediction with disparate im- [30] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining
pact: A study of bias in recidivism prediction instru- and harnessing adversarial examples. arXiv preprint
ments. CoRR, abs/1703.00056, 2017. arXiv:1412.6572, 2014.
[16] M. Craven and J. W. Shavlik. Extracting tree- [31] B. Goodman and S. Flaxman. Eu regulations on al-
structured representations of trained networks. In Ad- gorithmic decision-making and a ”right to explana-
vances in neural information processing systems, pages tion”, 2016. cite arxiv:1606.08813Comment: presented
24–30, 1996. at 2016 ICML Workshop on Human Interpretability in
Machine Learning (WHI 2016), New York, NY.
[17] F. Doshi-Velez and B. Kim. Towards a rigorous sci-
ence of interpretable machine learning. arXiv preprint [32] B. Goodman and S. Flaxman. European union regu-
arXiv:1702.08608, 2017. lations on algorithmic decision-making and a right to
explanation. AI Magazine, 38(3):50–57, 2017.
[18] M. Du, N. Liu, and X. Hu. Techniques for in-
terpretable machine learning. arXiv preprint [33] Y. Goyal, U. Shalit, and B. Kim. Explaining clas-
arXiv:1808.00033, 2018. sifiers with causal concept effect (cace). CoRR,
abs/1907.07165, 2019.
[19] D. Dua and C. Graff. UCI machine learning repository,
2017. [34] Y. Goyal, Z. Wu, J. Ernst, D. Batra, D. Parikh,
and S. Lee. Counterfactual visual explanations. CoRR,
[20] D. Erhan, Y. Bengio, A. Courville, and P. Vincent. abs/1904.07451, 2019.
Visualizing higher-layer features of a deep network.
[35] R. M. Grath, L. Costabello, C. L. Van, P. Sweeney,
University of Montreal, 1341(3):1, 2009.
F. Kamiab, Z. Shen, and F. Lécué. Interpretable credit
[21] M. Everingham, L. Van Gool, C. K. I. Williams, application predictions with counterfactual explana-
J. Winn, and A. Zisserman. The pascal visual object tions. CoRR, abs/1811.05245, 2018.
classes (voc) challenge. International Journal of Com- [36] R. Guo, L. Cheng, J. Li, P. R. Hahn, and H. Liu. A
puter Vision, 88(2):303–338, June 2010. survey of learning causality with data: Problems and
[22] A. Flores, K. Bechtel, and C. Lowenkamp. False pos- methods. arXiv preprint arXiv:1809.09337, 2018.
itives, false negatives, and false analyses: A rejoinder [37] K. S. Gurumoorthy, A. Dhurandhar, G. Cecchi, and
to machine bias: Theres software used across the coun- C. Aggarwal. Efficient data representation by selecting
try to predict future criminals. and its biased against prototypes with importance weights, 2017.
blacks.. Federal probation, 80, 09 2016.
[38] M. Harradon, J. Druce, and B. E. Ruttenberg. Causal
[23] J. H. Friedman. Greedy function approximation: a learning and explanation of deep neural networks
gradient boosting machine. Annals of statistics, pages via autoencoded activations. CoRR, abs/1802.00541,
1189–1232, 2001. 2018.
[24] N. Frosst and G. Hinton. Distilling a neural net- [39] L. A. Hendricks, R. Hu, T. Darrell, and Z. Akata.
work into a soft decision tree. arXiv preprint Generating counterfactual explanations with natural
arXiv:1711.09784, 2017. language. CoRR, abs/1806.09809, 2018.
[40] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, [54] K. Lang. 20 newsgroups.
M. Botvinick, S. Mohamed, and A. Lerchner. beta- http://qwone.com/~ jason/20Newsgroups/, 2008.
vae: Learning basic visual concepts with a constrained
variational framework. In International Conference on [55] Q. V. Le. Building high-level features using large scale
Learning Representations, volume 3, 2017. unsupervised learning. In 2013 IEEE international
conference on acoustics, speech and signal processing,
[41] G. Hinton, O. Vinyals, and J. Dean. Distilling pages 8595–8598. IEEE, 2013.
the knowledge in a neural network. arXiv preprint
arXiv:1503.02531, 2015. [56] Y. LeCun, C. Cortes, and C. Burges. The mnist
database. http://yann.lecun.com/exdb/mnist/, Jan
[42] A. Hyvärinen and E. Oja. Independent component 2020.
analysis: algorithms and applications. Neural net-
works, 13(4-5):411–430, 2000. [57] J. Li, K. Cheng, S. Wang, F. Morstatter, R. P.
Trevino, J. Tang, and H. Liu. Feature selection: A
[43] IMDb. Imdb datasets.
data perspective. ACM Computing Surveys (CSUR),
https://www.imdb.com/interfaces/, 2020.
50(6):94, 2018.
[44] I. Jolliffe. Principal component analysis. Springer,
2011. [58] Y. Li, R. Guo, W. Wang, and H. Liu. Causal learn-
ing in question quality improvement. In 2019 Bench-
[45] A. Kanehira, K. Takemoto, S. Inayoshi, and Council International Symposium on Benchmarking,
T. Harada. Multimodal explanations by predicting Measuring and Optimizing (Bench19), 2019.
counterfactuality in videos. CoRR, abs/1812.01263,
2018. [59] Z. C. Lipton. The mythos of model interpretability.
CoRR, abs/1606.03490, 2016.
[46] N. Kilbertus, M. Rojas Carulla, G. Parascandolo,
M. Hardt, D. Janzing, and B. Schölkopf. Avoiding [60] S. Liu, B. Kailkhura, D. Loveland, and Y. Han. Gener-
discrimination through causal reasoning. In I. Guyon, ative counterfactual introspection for explainable deep
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, learning. CoRR, abs/1907.03077, 2019.
S. Vishwanathan, and R. Garnett, editors, Advances
in Neural Information Processing Systems 30, pages [61] A. V. Looveren and J. Klaise. Interpretable coun-
656–666. Curran Associates, Inc., 2017. terfactual explanations guided by prototypes. CoRR,
abs/1907.02584, 2019.
[47] B. Kim, R. Khanna, and O. O. Koyejo. Examples
are not enough, learn to criticize! criticism for inter- [62] Y. Lou, R. Caruana, and J. Gehrke. Intelligible mod-
pretability. In Advances in Neural Information Pro- els for classification and regression. In Proceedings of
cessing Systems, pages 2280–2288, 2016. the 18th ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 150–158.
[48] B. Kim, O. Koyejo, and R. Khanna. Examples are ACM, 2012.
not enough, learn to criticize! criticism for inter-
pretability. In Advances in Neural Information Pro- [63] Y. Lou, R. Caruana, J. Gehrke, and G. Hooker. Accu-
cessing Systems 29: Annual Conference on Neural In- rate intelligible models with pairwise interactions. In
formation Processing Systems 2016, December 5-10, Proceedings of the 19th ACM SIGKDD International
2016, Barcelona, Spain, pages 2280–2288, 2016. Conference on Knowledge Discovery and Data Min-
ing, KDD ’13, pages 623–631, New York, NY, USA,
[49] C. Kim and O. Bastani. Learning interpretable models 2013. ACM.
with causal guarantees. CoRR, abs/1901.08576, 2019.
[64] J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing
[50] D. P. Kingma and M. Welling. Auto-encoding varia-
when to look: Adaptive attention via a visual sen-
tional bayes. arXiv preprint arXiv:1312.6114, 2013.
tinel for image captioning. In Proceedings of the IEEE
[51] P. W. Koh, K.-S. Ang, H. H. Teo, and P. Liang. On the conference on computer vision and pattern recognition,
accuracy of influence functions for measuring group pages 375–383, 2017.
effects. arXiv preprint arXiv:1905.13289, 2019.
[65] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchi-
[52] P. W. Koh and P. Liang. Understanding black-box pre- cal question-image co-attention for visual question an-
dictions via influence functions. In Proceedings of the swering. In Advances In Neural Information Process-
34th International Conference on Machine Learning- ing Systems, pages 289–297, 2016.
Volume 70, pages 1885–1894. JMLR. org, 2017.
[66] S. M. Lundberg and S.-I. Lee. A unified approach to
[53] M. J. Kusner, J. R. Loftus, C. Russell, and R. Silva. interpreting model predictions. In Advances in Neu-
Counterfactual fairness. In I. Guyon, U. von Luxburg, ral Information Processing Systems, pages 4765–4774,
S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vish- 2017.
wanathan, and R. Garnett, editors, Advances in Neu-
ral Information Processing Systems 30: Annual Con- [67] D. Madras, E. Creager, T. Pitassi, and R. S.
ference on Neural Information Processing Systems Zemel. Fairness through causal awareness: Learn-
2017, 4-9 December 2017, Long Beach, CA, USA, ing latent-variable models for biased data. CoRR,
pages 4069–4079, 2017. abs/1809.02519, 2018.
[68] P. Madumal, T. Miller, L. Sonenberg, and F. Vetere. [86] S. Rathi. Generating counterfactual and contrastive
Explainable reinforcement learning through a causal explanations using SHAP. CoRR, abs/1906.09293,
lens. CoRR, abs/1905.10958, 2019. 2019.
[69] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, [87] A. Renkl. Toward an instructionally oriented theory of
and A. Galstyan. A survey on bias and fairness in ma- example-based learning. Cognitive science, 38(1):1–37,
chine learning. arXiv preprint arXiv:1908.09635, 2019. 2014.
[70] T. Miller. Explanation in artificial intelligence: In- [88] M. T. Ribeiro, S. Singh, and C. Guestrin. Model-
sights from the social sciences. CoRR, abs/1706.07269, agnostic interpretability of machine learning. arXiv
2017. preprint arXiv:1606.05386, 2016.
[71] C. Molnar. Interpretable Machine Learning. 2019. [89] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should
https://christophm.github.io/interpretable-ml-book/. i trust you?: Explaining the predictions of any classi-
fier. In Proceedings of the 22nd ACM SIGKDD inter-
[72] C. Molnar. Interpretable machine learning. Lulu. com, national conference on knowledge discovery and data
2019. mining, pages 1135–1144. ACM, 2016.
[73] J. Moore, N. Hammerla, and C. Watkins. Explaining [90] O. Russakovsky, J. Deng, H. Su, J. Krause,
deep learning models with constrained adversarial ex- S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,
amples. CoRR, abs/1906.10671, 2019. M. Bernstein, A. C. Berg, and L. Fei-Fei. Im-
[74] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deep- ageNet Large Scale Visual Recognition Challenge.
fool: a simple and accurate method to fool deep neural International Journal of Computer Vision (IJCV),
networks. CoRR, abs/1511.04599, 2015. 115(3):211–252, 2015.

[75] A. Mordvintsev, C. Olah, and M. Tyka. Inceptionism: [91] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam,
Going deeper into neural networks, 2015. D. Parikh, and D. Batra. Grad-cam: Visual explana-
tions from deep networks via gradient-based localiza-
[76] R. K. Mothilal, A. Sharma, and C. Tan. Explaining tion. In Proceedings of the IEEE International Con-
machine learning classifiers through diverse counter- ference on Computer Vision, pages 618–626, 2017.
factual explanations. CoRR, abs/1905.07697, 2019.
[92] U. Shalit, F. D. Johansson, and D. Sontag. Estimat-
[77] T. Narendra, A. Sankaran, D. Vijaykeerthy, and ing individual treatment effect: generalization bounds
S. Mani. Explaining deep learning models using causal and algorithms. In Proceedings of the 34th Interna-
inference. CoRR, abs/1811.04376, 2018. tional Conference on Machine Learning-Volume 70,
pages 3076–3085. JMLR. org, 2017.
[78] C. Olah, A. Mordvintsev, and L. Schu-
bert. Feature visualization. Distill, 2017. [93] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep in-
https://distill.pub/2017/feature-visualization. side convolutional networks: Visualising image clas-
sification models and saliency maps. arXiv preprint
[79] C. Olah, A. Satyanarayan, I. Johnson, S. Carter, arXiv:1312.6034, 2013.
L. Schubert, K. Ye, and A. Mordvintsev. The
building blocks of interpretability. Distill, 2018. [94] J. T. Springenberg, A. Dosovitskiy, T. Brox, and
https://distill.pub/2018/building-blocks. M. Riedmiller. Striving for simplicity: The all con-
volutional net. arXiv preprint arXiv:1412.6806, 2014.
[80] N. Papernot, P. D. McDaniel, S. Jha, M. Fredrik-
son, Z. B. Celik, and A. Swami. The limitations [95] P.-N. Tan. Introduction to data mining. Pearson Edu-
of deep learning in adversarial settings. CoRR, cation India, 2018.
abs/1511.07528, 2015.
[96] G. G. Towell and J. W. Shavlik. Extracting refined
[81] Á. Parafita and J. Vitrià. Explaining visual models by rules from knowledge-based neural networks. Machine
causal attribution. arXiv preprint arXiv:1909.08891, learning, 13(1):71–101, 1993.
2019.
[97] UCI. Uci machine learning repository.
[82] J. Pearl. Causality. Cambridge university press, 2009. https://archive.ics.uci.edu/ml/index.php,
2020.
[83] J. Pearl. Theoretical impediments to machine learning
with seven sparks from the causal revolution. CoRR, [98] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
abs/1801.04016, 2018. L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin.
Attention is all you need. In Advances in neural infor-
[84] J. Pearl. The seven tools of causal inference, with mation processing systems, pages 5998–6008, 2017.
reflections on machine learning. Commun. ACM,
62(3):54–60, Feb. 2019. [99] P. Veličković, G. Cucurull, A. Casanova, A. Romero,
P. Lio, and Y. Bengio. Graph attention networks.
[85] G. Plumb, D. Molitor, and A. S. Talwalkar. Model arXiv preprint arXiv:1710.10903, 2017.
agnostic supervised local explanations. In Advances in
Neural Information Processing Systems, pages 2515– [100] U. Von Luxburg. A tutorial on spectral clustering.
2524, 2018. Statistics and computing, 17(4):395–416, 2007.
[101] S. Wachter, B. D. Mittelstadt, and C. Russell. Coun-
terfactual explanations without opening the black
box: Automated decisions and the GDPR. CoRR,
abs/1711.00399, 2017.
[102] H. Xu and K. Saenko. Ask, attend and answer: Ex-
ploring question-guided spatial attention for visual
question answering. In European Conference on Com-
puter Vision, pages 451–466. Springer, 2016.
[103] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville,
R. Salakhudinov, R. Zemel, and Y. Bengio. Show, at-
tend and tell: Neural image caption generation with
visual attention. In International conference on ma-
chine learning, pages 2048–2057, 2015.
[104] F. Yang, M. Du, and X. Hu. Evaluating explanation
without ground truth in interpretable machine learn-
ing, 2019.
[105] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola.
Stacked attention networks for image question an-
swering. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 21–29,
2016.
[106] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and
E. Hovy. Hierarchical attention networks for document
classification. In Proceedings of the 2016 conference
of the North American chapter of the association for
computational linguistics: human language technolo-
gies, pages 1480–1489, 2016.
[107] YELP. Yelp dataset.
https://www.yelp.com/dataset, 2020.
[108] M. D. Zeiler and R. Fergus. Visualizing and under-
standing convolutional networks. In European confer-
ence on computer vision, pages 818–833. Springer,
2014.
[109] J. Zhang and E. Bareinboim. Fairness in decision-
making – the causal explanation formula. 02 2018.
[110] Q. Zhang, Y. Yang, H. Ma, and Y. N. Wu. Interpret-
ing cnns via decision trees. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recogni-
tion, pages 6261–6270, 2019.
[111] Q.-s. Zhang and S.-C. Zhu. Visual interpretability
for deep learning: a survey. Frontiers of Informa-
tion Technology & Electronic Engineering, 19(1):27–
39, 2018.
[112] Q. Zhao and T. Hastie. Causal interpretations of
black-box models. Journal of Business & Economic
Statistics, (just-accepted):1–19, 2019.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy