Causal Interpretability For Machine Learning
Causal Interpretability For Machine Learning
Raha Moraffah∗ , Mansooreh Karami∗ , Ruocheng Guo∗ , Adrienne Raglin† , Huan Liu∗
∗
Computer Science & Engineering, Arizona State University, Tempe, AZ, USA
†
Army Research Lab, USA
∗
{rmoraffa, mkarami, rguo12, huanliu}@asu.edu, † adrienne.raglin2.civ@mail.mil
arXiv:2003.03934v3 [cs.LG] 19 Mar 2020
ABSTRACT planation” [32] and AI call for diversity and inclusion [9],
interpretable models which are capable of explaining the
Machine learning models have had discernible achievements
decisions they made are necessary. Moreover, recent re-
in a myriad of applications. However, most of these models
search shows that machine learning models, especially deep
are black-boxes, and it is obscure how the decisions are made
neural networks, can be easily fooled into predicting a spe-
by them. This makes the models unreliable and untrustwor-
cific class label for an image when its pixel values are un-
thy. To provide insights into the decision making processes
der minimal perturbations [30; 74; 80]. Such results imply
of these models, a variety of traditional interpretable models
that machine learning models suffer from the risk of making
have been proposed. Moreover, to generate more human-
unexpected decisions. Understanding decisions of machine
friendly explanations, recent work on interpretability tries
learning models and the process leading to decision making
to answer questions related to causality such as “Why does
can help us understand the rules the models use to make
this model makes such decisions?” or “Was it a specific
their decisions and therefore, prevent potential unexpected
feature that caused the decision made by the model?”. In
situations from happening. More specifically, through in-
this work, models that aim to answer causal questions are
terpretable machine learning models, we aim to guarantee
referred to as causal interpretable models. The existing sur-
that (a) decisions made by machine learning models comply
veys have covered concepts and methodologies of traditional
with the rules toward social good; (b) the classifier does not
interpretability. In this work, we present a comprehensive
pick up the biases in the data and the decisions made are
survey on causal interpretable models from the aspects of
compatible with human understandings.
the problems and methods. In addition, this survey pro-
vides in-depth insights into the existing evaluation metrics Previously, various frameworks have been proposed to gen-
for measuring interpretability, which can help practitioners erate explanations for machine learning algorithms. These
understand for what scenarios each evaluation metric is suit- algorithms can be mainly divided into two categories, (1)
able. algorithms that are inherently interpretable, which includes
the models that generate explanations at training time [106];
(2) post-hoc interpretations that refer to the model that gen-
Keywords erate explanations for already made decisions [75; 85; 47].
Interpratablity, explainability, causal inference, counterfac- Henceforth, these models are referred to as traditional in-
tuals, machine learning terpretable models.
In this work, we focus on causal interpretable models that
can explain their decisions through what decisions would
1. INTRODUCTION have been made if they had been under alternative situa-
With the surge of machine learning in critical areas such tions (e.g., being trained with different inputs, model com-
as healthcare, law-making and autonomous cars, decisions ponents or hyperparameters). Note that traditional inter-
that had been previously made by humans are now made au- pretable models are unable to answer such questions about
tomatically using these algorithms. In order to ensure the decision making under alternative situations, although they
reliability of such decisions, humans need to understand how can explain how and why a decision is made by an existing
these decisions are made. However, machine learning mod- model on an observed instance. For instance, in the case of
els are usually inherently black-boxes and do not provide credit applications, to impose fairness on the decision mak-
explanations for how and why they make such decisions. ing process, we may need to answer questions such as Did
This has become especially problematic when recent work the protected features (e.g., race and gender etc.) cause the
shows that the decisions made by machine learning mod- system to reject the application of the i-th applicant?” and
els are sometimes biased and enforce inequality [69]. For “If the i-th applicant had different protected features, would
instance, Angwin et al. [4] demonstrates that predictions the system still make the same decision?” In other words,
made by Correctional Offender Management Profiling for in order to make the explanations more understandable and
Alternative Sanctions (COMPAS), which is a widely used useful for humans, we need to ask questions such as “Why
criminal risk assessment tool, shows racial biases. With re- did the classifier make this decision instead of another?”,
cent regulations such as European Unions “Right to Ex- “What would have happened to this decision of a classifier
had we had a different input to it?”, or “Was it feature
X that caused decision Y ?”. Traditional interpretability
frameworks which only consider correlations are not capa- tently predict the model’s decisions. Doshi-Velez et al. [17]
ble of generating such explanations. This is due to the fact define interpretability as the ability to explain in intelligi-
that these frameworks cannot estimate how altering a fea- ble ways to a human. Gilpin et al. [27] take a step further
ture or a component of a model would change the predictions and define interpretability as a part of explainability. They
made by the rest of the model or the predicted labels on state that explainable models are those that summarize the
the data samples. Therefore, in order to answer such ques- reasons for neural network behaviors, gain the trust of the
tions about both data samples and models, counterfactual users, or generate insights into the causes of their decisions
analysis needs to be leveraged. Counterfactual analysis is a while interpretable models may not be able to describe the
concept from the causal inference literature [25]. In coun- operation of a system in an accurate way 1 . Pearl [84] claims
terfactual analysis, we aim to infer the output of a model that tasks such as explainability require a causal model of
in imaginary scenarios that we have not observed or cannot the environment and cannot be handled at the level of as-
observe. Recently, counterfactual analysis and causal infer- sociation.
ence have gained a lot of attention from the interpretable
machine learning field. Research in this area has mainly fo- 2.1 Interpetability in Machine Learning
cused on generating counterfactual explanations from both Interpretable machine learning has been widely explored and
the data perspective [34; 76] as well as the components of a discussed in previous literature. However, to the best of our
model [77; 38]. knowledge, there is no comprehensive review on causal in-
Existing surveys on interpretable machine learning focus terpretability models. For instance, Lipton [59] discusses
on the traditional methods and do not discuss the exist- the motivation behind creating interpretable models and
ing methods from a causal perspective. In this survey, we categorizes interpretable models into two main categories:
present commonly used definitions for interpretability, dis- transparent models and post-hocs. Doshi-velez et al. [17]
cuss interpretable models from a causal perspective and pro- provide a definition of model interpretability and evalua-
vide guidelines for evaluating these methods. More specifi- tion criteria. However, this review only proposes definitions
cally, in Section 2, we first provide different definitions for in- and evaluations that are used for traditional interpretabil-
terpretability. We then briefly introduce the existing meth- ity of models and does not cover causal and counterfactual
ods on traditional interpretablity and present different types questions. Gilpin et al. [27], explain fundamental concepts
of interpretable models in this category (Section 2.2). Sec- of explainability and use them to classify the literature on
tion 3 discusses concepts from causal inference, which are interpretable models. Zhang and Zhu [111] review the exist-
used in this survey. In section 4, we provide an overview of ing interpretable models proposed for deep models used in
existing works on causal interpretability. We also compare visual domains. Du et al. [18] provide a comprehensive sur-
the proposed models for both traditional and causal models vey of existing interpretable methods and discuss issues that
from different perspectives to provide insights on advantages should be considered in future work. It is worth mention-
and disadvantages of each type of interpretability. Section ing that none of the existing work discussed interpretable
5 provides detailed guidelines on the experimental settings models from a causal perspective. In this work, we first
such as commonly used datasets and evaluation metrics for introduce the state-of-the-art research in traditional inter-
both traditional and causal approaches. We then discuss pretability (Sec. 2.2) and then give a detailed survey on
evaluation metrics specifically used for causal methods in causal interpretable models (Sec. 4). Figure 1 shows an
more detail and provide different scenarios for which these overview of intepretable models and their classification.
metrics can be used. Since the evaluation of causal inter-
pretable models is a challenging task, these guidelines can 2.2 Traditional Interpretablity
be helpful for future research in this area and can be used Before proceeding with the detailed review of the method-
to evaluate approaches with similar characteristics. In addi- ologies in causal interpretable models, we provide an overview
tion, they can also be used to create new evaluation metrics of existing state-of-the-art methods in traditional machine
for the approaches with different functionalities. learning. We categorize traditional models into two main
categories:
known beforehand. We should verify that the model Counterfactual Explanations Evaluation Metrics. Ex-
will pick up the important features of the data. One isting approaches for causal interpretability are mostly based
simply can use any base method introduced in sec- on generating counterfactual explanations. For such ap-
tion 2.2.1 as a proxy model to extract the important proaches, the causal interpretability is often measured through
features. The fraction of these important features re- the goodness of generated counterfactual explanation. As
covered by the interpretable method can be used as an mentioned in section 4, a counterfactual explanation is the
evaluation score [89]. highest level of explanation and therefore, we can claim that
if an explanation is a counterfactual explanation and is gen-
• How locally faithful the proposed method is compared erated by considering causal relationships, it is indeed ex-
to the original model (fidelity)? Lack of fidelity will re- plainable. However, due to the lack of groundtruth for coun-
sult in a limited insight to the original model [104]. In terfactuals, we are unable to measure if the generated expla-
convolutional neural network, one common approach is nations are generated based on causal relationships. There-
the image occlusion. The pixels that the interpretable fore, to measure the “goodness” of counterfactual explana-
method defines as important will be masked to see tions, we suggest to conduct experiments to (1) measure
whether it reflects on the classification score or not the interpretability of the explanations using the metrics
[91; 108]. designed for interpretability; and (2) evaluate the conter-
• How consistent the explanations are for the similar in- factuals themselves by measuring different characteristics of
stances with the same class label? The explanations them. An interpretable Counterfatual explanation should
should not be significantly different for samples with have the following characteristics:
the same label with a slightly different features. This
instability could be the result of a high variance as • The model prediction on the counterfactual sample
well as the non-deterministic components of the ex- (xcf ) needs to be close to the predefined output for
planation method [72]. counterfactual explanation.
5.2.2 Causal Evaluation Metrics • The perturbation δ changing the original instance x
Due to the lack of groundtruth for causal explanations, to into xcf = x + δ should be sparse. In other words, size
verify the causal aspect of the proposed framework, we need of counterfactual (i.e., number of features) should be
to quantify the desired characteristics of the model and mea- small.
sure the “goodness” of them via some predefined proxy met-
rics. In the following, we go over the existing metrics to • A counterfactual explanation xcf is considered inter-
evaluate the proposed causal interpretable frameworks for pretable if it lies close to the models training data dis-
different categories of causal interpretability. tribution.
• The counterfactual instance xcf needs to be found fast terfactual distribution is as good as the distribution over all
enough to ensure it can be used in a real life setting. classes.
Generated counterfactual explanations can be used to mea-
• Counterfatual explanations generated for a data in- sure users’ understanding of a machine learning model’s lo-
stance should be different from each other. In other cal decision boundary. Mothilal et al. [76] propose to mimic
words, counterfactual explanations should be diverse. users’ understanding of a model’s local decision boundaries
by, (a) constructing an auxiliary classifier on both original
• Visual-linguistic counterfactual explanations must sat- inputs and counterfactual examples; and (b) measuring how
isfy the following two criteria, (1) Visual explanation is well it mimics the actual decision boundaries. More specif-
the region which keeps high positiveness/negativeness ically, they train a 1-nearest neighbor (1-NN) classifier on
on the model prediction for specific positive/negative both the original and the counterfactual samples to predict
classes; (2) Linguistic explanation should be compat- the class of new inputs. The accuracy of this model is then
ible to the visual counterpart in the generated visual compared with the accuracy of the original model.
explanations. The definition of counterfactual explanations implies that
generated explanations should be as similar as possible to
Below, we briefly discuss these evaluation metrics designed the original instance. In order to evaluate the proximity
to assess aformentioned characteristics of a counterfactual between original samples and counterfactual explanations,
explanation: Mothilal et al. [76] defines proximity as Eq. (14),
To evaluate the sparsity of the generated counterfactual ex-
amples, Mc Grath et al. [35] measures the size of a generated 1X
k
example by counting the number of features each example P roximity = − dist(xcfi , x) (14)
k i=1
consists of. Van Looveren and Klaise [61] use elastic net
loss term EN (δ) = β||δ||1 + ||δ||22 where δ is the distance be- In order to be able to calculate the proximity for both cate-
tween the original instance and its generated counterfactual gorical and continuous features, the authors further propose
example and β is the hyperparameter. two metrics to calculate the proximity for categorical and
In order for counterfactual explanations to be interpretable, continuous features. For continuous features, the proximity
they need to be close to the data manifold. Looveren and is defined as the mean of feature-wise L1 distances between
Klaise improves this criterion by suggesting that the coun- the original sample and counterfactuals divided by the me-
terfactuals are interpretable if they are close to the data dian absolute deviation (MAD) of the features values in the
manifold of the counterfactual class [61]. To measure the training set. For categorical features, disctance function is
interpretability defined above, Looveren and Klaise propose calculated such that for each categorical feature it assigns 1
to measure the ratio of the reconstruction errors when the if the feature differs from the original feature and otherwise
model used for generating counterfactuals is trained only on it assigns 0.
the counterfactual class vs when it is trained on the original In order to gauge the speed of generating counterfactual ex-
class [61]. The proposed metric is shown in the following planations, Looveren and Klaise [61] measure the time and
equation, the number of gradient updates until the desired counter-
||x0 + δ − AEi (x0 + δ)||22 factual explanation is generated.
IM 1(AEi , AEt0 , xcf ) = Diversity of generated counterfactuals is measured via mea-
||x0 + δ − AEt0 (x0 + δ)||22 + ǫ
(12) suring feature-wise distances between each pair of counter-
Where AEi and AEt0 represent the autoencoders used to factual examples and calculating diversity as the mean of
generate the counterfacutals trained on the class i (counter- the distances between each pair of examples [76]. Eq. (15)
factual class) and class t0 (the original class), respectively. illustrates the measure used for diversity.
We let xcf and x0 be the counterfactual explanation and
1 X X
k−1 k
the original sample. In addition, δ denotes the distance be- Diversity = d(xcfi , xcfj ) (15)
tween the original and counterfactual samples. A lower value |Ck |2 i=1 j=i+1
of IM 1 shows that counterfactual examples can be better
reconstructed from the autoencoder trained on the counter- Where Ck represents a set of k counterfactuals generated
factual class in comparison to the autoencoder trained on for the original input, xcfi and xcfj are the i-th and j-th
the original class. This implies that the generated counter- counterfactuals in the set Ck .
factuals are closer to the counterfactual class data manifold. Kanehira et al. [45] propose metrics to evaluate visual-
Another metric proposed by [61] measures how similar the linguistic counterfactual explanations to ensure, (a) visual
generated counterfactuals are when generated using the au- explanations keep possession of high positiveness/negativeness
toencoder trained on only counterfactuals vs the autoen- on the model predictions for positive/negative classes; (b)
coder trained on all classes. The metric is shown in the linguistic explanations are compatible with their correspond-
following equation, ing visual explanations. To measure if the generated exam-
ples meet these criteria, authors in [45] propose two metrics
based on the accuracy. More specifically, to check for the
||AEi (x0 + δ) − AE(x0 + δ)||22 first condition, they investigate how the output of the target
IM 2(AEi , AEt0 , xcf ) =
||x0 + δ||1 + ǫ classifier changes towards the negative class when a specific
(13) region is removed from the input. To measure the second
A lower value of IM 2 shows that counterfactuals generated criterion, for each output pair (s, R) they examine how the
by both autoencoders trained on all classes and counterfac- region R makes the concept s distinguishable by humans. To
tuals are more similar. This implies that the generated coun- measure this quantitatively, they compute the accuracy by
Overview of interpretable models and their categories
Interpretable Models: [106], [103], Model-based: [77], [38], [12], [8], [112], [81], [7],
Traditional [64], [102], [65], [105], [50], [13], [40] Causal Example-based: [35], [45], [39], [101], [76], [86],
Interpretability Interpretability [73], [61], [60], [33], [34]
Post-hoc: [47], [89], [66], [93], [91], Fairness: [53], [46], [67], [109]
[108], [20], [16], [96], [62] Guarantee: [49], [17], [88]
utilizing bounding boxes for each attribute in the test set. ACKNOWLEDGEMENTS
More specifically, IoU (intersection over union) between a
We would like to thank Andre Harrison for helpful com-
given R and all bounding boxes R0 corresponding to at-
ments.
tribute s0 is calculated. Then the accuracy is measured by
selecting the the attribute s0 with the largest IoU score and
checking its consistency with s a counterpart of R. 7. REFERENCES
Table 1 summarizes evaluation metrics for counterfactulas
explanations based on the properties of the generated exam- [1] A. Aamodt and E. Plaza. Case-based reasoning: Foun-
ples. dational issues, methodological variations, and system
Model-based Evaluation Metrics. Due to the lack of approaches. AI communications, 7(1):39–59, 1994.
evaluation groundtruth for representing the actual effect of
each component of the model on its final decisions, evalua- [2] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow,
tion for this type of models is still an open problem. One M. Hardt, and B. Kim. Sanity checks for saliency
common way of evaluating such models is to report the most maps. In Advances in Neural Information Processing
important components of a model by measuring their causal Systems, pages 9505–9515, 2018.
effects on the outcome of the model [38; 77]. Chattopadhyay
et al. also used the causal attribution of each neuron on [3] D. Alvarez-Melis and T. Jaakkola. A causal framework
the output to visualize the local decisions of the model by for explaining the predictions of black-box sequence-
saliency map. Moreover, to further investigate how well the to-sequence models. In Proceedings of the 2017 Con-
model estimates the ACE, they proposed to run the model ference on Empirical Methods in Natural Language
on datasets for causal effect estimations [12]. Processing, pages 412–421, Copenhagen, Denmark,
Causal Fairness Evaluation. Evaluation of causal fair- Sept. 2017. Association for Computational Linguistics.
ness models is a challenging task. Papers in this field usually
assess the performance of the model for detecting discrimi- [4] J. Angwin, J. Larson, L. Kirch-
nation. Zhang et al. leverage direct, indirect and spurious ner, and S. Mattu. Machine bias.
effect measures (defined in section 4.3) to detect and explain https://www.propublica.org/article/machine-bias-risk-asses
discrimination [109]. However, to the best of our knowledge, Mar 2019.
no quantitative measure of causality of a fairness algorithm
existis. [5] AWS. Amazon customer reviews dataset.
https://s3.amazonaws.com/amazon-reviews-pds/readme.html,
2020.
6. CONCLUSION [6] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine
In this survey, we introduce the problem of interpretabil- translation by jointly learning to align and translate.
ity in machine learning. We view the problem from two arXiv preprint arXiv:1409.0473, 2014.
perspectives, (1) Traditional interpretability algorithms; (2)
causal interpretability algorithms. However, the primary [7] D. Bau, J. Zhu, H. Strobelt, B. Zhou, J. B. Tenen-
focus of the survey is on causal frameworks. We first pro- baum, W. T. Freeman, and A. Torralba. GAN dissec-
vide different definitions of interpretability, then review the tion: Visualizing and understanding generative adver-
state-of-the-art methods in both categories and point out the sarial networks. CoRR, abs/1811.10597, 2018.
differences between them. Each type of interpretable mod-
els is further subdivided into other sub categories to provide [8] M. Besserve, R. Sun, and B. Schölkopf. Counterfactu-
readers with better overview of existing directions and ap- als uncover the modular structure of deep generative
proaches in the field. More conceretely, for traditional meth- models. CoRR, abs/1812.03253, 2018.
ods, we divide existing work into inherently interpretable
models and post-hoc intrerpretability. For causal models, [9] D. Boyd and K. Crawford. Critical questions for big
we divide the existing works into the following four cate- data: Provocations for a cultural, technological, and
gories: counterfactual examples, model-based interpretabil- scholarly phenomenon. Information, communication
ity, causal models in fairness and interpretability for veri- & society, 15(5):662–679, 2012.
fying causal relationships. We also address the challenging
problem of evaluating interpretable models , explain existing [10] O. Boz. Extracting decision trees from trained neural
metrics in detail and categorize them based on the scenarios networks. In Proceedings of the eighth ACM SIGKDD
they are designed for. Table 2 summarizes state-of-the-art international conference on Knowledge discovery and
methods which belong to each category of interpretability. data mining, pages 456–461. ACM, 2002.
[11] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, [25] L. Gerson Neuberg. Causality: models, reasoning, and
and N. Elhadad. Intelligible models for healthcare: inference, by judea pearl, cambridge university press,
Predicting pneumonia risk and hospital 30-day read- 2000. Econometric Theory, 19:675–685, 08 2003.
mission. In Proceedings of the 21th ACM SIGKDD In-
ternational Conference on Knowledge Discovery and [26] A. Ghorbani, A. Abid, and J. Zou. Interpretation of
Data Mining, KDD ’15, pages 1721–1730, New York, neural networks is fragile. In Proceedings of the AAAI
NY, USA, 2015. ACM. Conference on Artificial Intelligence, volume 33, pages
3681–3688, 2019.
[12] A. Chattopadhyay, P. Manupriya, A. Sarkar, and
[27] L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa,
V. N. Balasubramanian. Neural network attributions:
M. Specter, and L. Kagal. Explaining explanations:
A causal perspective. CoRR, abs/1902.02302, 2019.
An overview of interpretability of machine learning.
[13] X. Chen, Y. Duan, R. Houthooft, J. Schulman, In 2018 IEEE 5th International Conference on Data
I. Sutskever, and P. Abbeel. Infogan: Interpretable Science and Advanced Analytics (DSAA), pages 80–
representation learning by information maximizing 89. IEEE, 2018.
generative adversarial nets. In Advances in neural in- [28] A. Goldstein, A. Kapelner, J. Bleich, and E. Pitkin.
formation processing systems, pages 2172–2180, 2016. Peeking inside the black box: Visualizing statistical
[14] W. Cheng, Y. Shen, L. Huang, and Y. Zhu. Incor- learning with plots of individual conditional expecta-
porating interpretability into latent factor models via tion. Journal of Computational and Graphical Statis-
fast influence analysis. In Proceedings of the 25th ACM tics, 24(1):44–65, 2015.
SIGKDD International Conference on Knowledge Dis- [29] I. Goodfellow, Y. Bengio, and A. Courville. Deep
covery & Data Mining, pages 885–893. ACM, 2019. Learning. MIT Press, 2016.
[15] A. Chouldechova. Fair prediction with disparate im- [30] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining
pact: A study of bias in recidivism prediction instru- and harnessing adversarial examples. arXiv preprint
ments. CoRR, abs/1703.00056, 2017. arXiv:1412.6572, 2014.
[16] M. Craven and J. W. Shavlik. Extracting tree- [31] B. Goodman and S. Flaxman. Eu regulations on al-
structured representations of trained networks. In Ad- gorithmic decision-making and a ”right to explana-
vances in neural information processing systems, pages tion”, 2016. cite arxiv:1606.08813Comment: presented
24–30, 1996. at 2016 ICML Workshop on Human Interpretability in
Machine Learning (WHI 2016), New York, NY.
[17] F. Doshi-Velez and B. Kim. Towards a rigorous sci-
ence of interpretable machine learning. arXiv preprint [32] B. Goodman and S. Flaxman. European union regu-
arXiv:1702.08608, 2017. lations on algorithmic decision-making and a right to
explanation. AI Magazine, 38(3):50–57, 2017.
[18] M. Du, N. Liu, and X. Hu. Techniques for in-
terpretable machine learning. arXiv preprint [33] Y. Goyal, U. Shalit, and B. Kim. Explaining clas-
arXiv:1808.00033, 2018. sifiers with causal concept effect (cace). CoRR,
abs/1907.07165, 2019.
[19] D. Dua and C. Graff. UCI machine learning repository,
2017. [34] Y. Goyal, Z. Wu, J. Ernst, D. Batra, D. Parikh,
and S. Lee. Counterfactual visual explanations. CoRR,
[20] D. Erhan, Y. Bengio, A. Courville, and P. Vincent. abs/1904.07451, 2019.
Visualizing higher-layer features of a deep network.
[35] R. M. Grath, L. Costabello, C. L. Van, P. Sweeney,
University of Montreal, 1341(3):1, 2009.
F. Kamiab, Z. Shen, and F. Lécué. Interpretable credit
[21] M. Everingham, L. Van Gool, C. K. I. Williams, application predictions with counterfactual explana-
J. Winn, and A. Zisserman. The pascal visual object tions. CoRR, abs/1811.05245, 2018.
classes (voc) challenge. International Journal of Com- [36] R. Guo, L. Cheng, J. Li, P. R. Hahn, and H. Liu. A
puter Vision, 88(2):303–338, June 2010. survey of learning causality with data: Problems and
[22] A. Flores, K. Bechtel, and C. Lowenkamp. False pos- methods. arXiv preprint arXiv:1809.09337, 2018.
itives, false negatives, and false analyses: A rejoinder [37] K. S. Gurumoorthy, A. Dhurandhar, G. Cecchi, and
to machine bias: Theres software used across the coun- C. Aggarwal. Efficient data representation by selecting
try to predict future criminals. and its biased against prototypes with importance weights, 2017.
blacks.. Federal probation, 80, 09 2016.
[38] M. Harradon, J. Druce, and B. E. Ruttenberg. Causal
[23] J. H. Friedman. Greedy function approximation: a learning and explanation of deep neural networks
gradient boosting machine. Annals of statistics, pages via autoencoded activations. CoRR, abs/1802.00541,
1189–1232, 2001. 2018.
[24] N. Frosst and G. Hinton. Distilling a neural net- [39] L. A. Hendricks, R. Hu, T. Darrell, and Z. Akata.
work into a soft decision tree. arXiv preprint Generating counterfactual explanations with natural
arXiv:1711.09784, 2017. language. CoRR, abs/1806.09809, 2018.
[40] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, [54] K. Lang. 20 newsgroups.
M. Botvinick, S. Mohamed, and A. Lerchner. beta- http://qwone.com/~ jason/20Newsgroups/, 2008.
vae: Learning basic visual concepts with a constrained
variational framework. In International Conference on [55] Q. V. Le. Building high-level features using large scale
Learning Representations, volume 3, 2017. unsupervised learning. In 2013 IEEE international
conference on acoustics, speech and signal processing,
[41] G. Hinton, O. Vinyals, and J. Dean. Distilling pages 8595–8598. IEEE, 2013.
the knowledge in a neural network. arXiv preprint
arXiv:1503.02531, 2015. [56] Y. LeCun, C. Cortes, and C. Burges. The mnist
database. http://yann.lecun.com/exdb/mnist/, Jan
[42] A. Hyvärinen and E. Oja. Independent component 2020.
analysis: algorithms and applications. Neural net-
works, 13(4-5):411–430, 2000. [57] J. Li, K. Cheng, S. Wang, F. Morstatter, R. P.
Trevino, J. Tang, and H. Liu. Feature selection: A
[43] IMDb. Imdb datasets.
data perspective. ACM Computing Surveys (CSUR),
https://www.imdb.com/interfaces/, 2020.
50(6):94, 2018.
[44] I. Jolliffe. Principal component analysis. Springer,
2011. [58] Y. Li, R. Guo, W. Wang, and H. Liu. Causal learn-
ing in question quality improvement. In 2019 Bench-
[45] A. Kanehira, K. Takemoto, S. Inayoshi, and Council International Symposium on Benchmarking,
T. Harada. Multimodal explanations by predicting Measuring and Optimizing (Bench19), 2019.
counterfactuality in videos. CoRR, abs/1812.01263,
2018. [59] Z. C. Lipton. The mythos of model interpretability.
CoRR, abs/1606.03490, 2016.
[46] N. Kilbertus, M. Rojas Carulla, G. Parascandolo,
M. Hardt, D. Janzing, and B. Schölkopf. Avoiding [60] S. Liu, B. Kailkhura, D. Loveland, and Y. Han. Gener-
discrimination through causal reasoning. In I. Guyon, ative counterfactual introspection for explainable deep
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, learning. CoRR, abs/1907.03077, 2019.
S. Vishwanathan, and R. Garnett, editors, Advances
in Neural Information Processing Systems 30, pages [61] A. V. Looveren and J. Klaise. Interpretable coun-
656–666. Curran Associates, Inc., 2017. terfactual explanations guided by prototypes. CoRR,
abs/1907.02584, 2019.
[47] B. Kim, R. Khanna, and O. O. Koyejo. Examples
are not enough, learn to criticize! criticism for inter- [62] Y. Lou, R. Caruana, and J. Gehrke. Intelligible mod-
pretability. In Advances in Neural Information Pro- els for classification and regression. In Proceedings of
cessing Systems, pages 2280–2288, 2016. the 18th ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 150–158.
[48] B. Kim, O. Koyejo, and R. Khanna. Examples are ACM, 2012.
not enough, learn to criticize! criticism for inter-
pretability. In Advances in Neural Information Pro- [63] Y. Lou, R. Caruana, J. Gehrke, and G. Hooker. Accu-
cessing Systems 29: Annual Conference on Neural In- rate intelligible models with pairwise interactions. In
formation Processing Systems 2016, December 5-10, Proceedings of the 19th ACM SIGKDD International
2016, Barcelona, Spain, pages 2280–2288, 2016. Conference on Knowledge Discovery and Data Min-
ing, KDD ’13, pages 623–631, New York, NY, USA,
[49] C. Kim and O. Bastani. Learning interpretable models 2013. ACM.
with causal guarantees. CoRR, abs/1901.08576, 2019.
[64] J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing
[50] D. P. Kingma and M. Welling. Auto-encoding varia-
when to look: Adaptive attention via a visual sen-
tional bayes. arXiv preprint arXiv:1312.6114, 2013.
tinel for image captioning. In Proceedings of the IEEE
[51] P. W. Koh, K.-S. Ang, H. H. Teo, and P. Liang. On the conference on computer vision and pattern recognition,
accuracy of influence functions for measuring group pages 375–383, 2017.
effects. arXiv preprint arXiv:1905.13289, 2019.
[65] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchi-
[52] P. W. Koh and P. Liang. Understanding black-box pre- cal question-image co-attention for visual question an-
dictions via influence functions. In Proceedings of the swering. In Advances In Neural Information Process-
34th International Conference on Machine Learning- ing Systems, pages 289–297, 2016.
Volume 70, pages 1885–1894. JMLR. org, 2017.
[66] S. M. Lundberg and S.-I. Lee. A unified approach to
[53] M. J. Kusner, J. R. Loftus, C. Russell, and R. Silva. interpreting model predictions. In Advances in Neu-
Counterfactual fairness. In I. Guyon, U. von Luxburg, ral Information Processing Systems, pages 4765–4774,
S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vish- 2017.
wanathan, and R. Garnett, editors, Advances in Neu-
ral Information Processing Systems 30: Annual Con- [67] D. Madras, E. Creager, T. Pitassi, and R. S.
ference on Neural Information Processing Systems Zemel. Fairness through causal awareness: Learn-
2017, 4-9 December 2017, Long Beach, CA, USA, ing latent-variable models for biased data. CoRR,
pages 4069–4079, 2017. abs/1809.02519, 2018.
[68] P. Madumal, T. Miller, L. Sonenberg, and F. Vetere. [86] S. Rathi. Generating counterfactual and contrastive
Explainable reinforcement learning through a causal explanations using SHAP. CoRR, abs/1906.09293,
lens. CoRR, abs/1905.10958, 2019. 2019.
[69] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, [87] A. Renkl. Toward an instructionally oriented theory of
and A. Galstyan. A survey on bias and fairness in ma- example-based learning. Cognitive science, 38(1):1–37,
chine learning. arXiv preprint arXiv:1908.09635, 2019. 2014.
[70] T. Miller. Explanation in artificial intelligence: In- [88] M. T. Ribeiro, S. Singh, and C. Guestrin. Model-
sights from the social sciences. CoRR, abs/1706.07269, agnostic interpretability of machine learning. arXiv
2017. preprint arXiv:1606.05386, 2016.
[71] C. Molnar. Interpretable Machine Learning. 2019. [89] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should
https://christophm.github.io/interpretable-ml-book/. i trust you?: Explaining the predictions of any classi-
fier. In Proceedings of the 22nd ACM SIGKDD inter-
[72] C. Molnar. Interpretable machine learning. Lulu. com, national conference on knowledge discovery and data
2019. mining, pages 1135–1144. ACM, 2016.
[73] J. Moore, N. Hammerla, and C. Watkins. Explaining [90] O. Russakovsky, J. Deng, H. Su, J. Krause,
deep learning models with constrained adversarial ex- S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,
amples. CoRR, abs/1906.10671, 2019. M. Bernstein, A. C. Berg, and L. Fei-Fei. Im-
[74] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deep- ageNet Large Scale Visual Recognition Challenge.
fool: a simple and accurate method to fool deep neural International Journal of Computer Vision (IJCV),
networks. CoRR, abs/1511.04599, 2015. 115(3):211–252, 2015.
[75] A. Mordvintsev, C. Olah, and M. Tyka. Inceptionism: [91] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam,
Going deeper into neural networks, 2015. D. Parikh, and D. Batra. Grad-cam: Visual explana-
tions from deep networks via gradient-based localiza-
[76] R. K. Mothilal, A. Sharma, and C. Tan. Explaining tion. In Proceedings of the IEEE International Con-
machine learning classifiers through diverse counter- ference on Computer Vision, pages 618–626, 2017.
factual explanations. CoRR, abs/1905.07697, 2019.
[92] U. Shalit, F. D. Johansson, and D. Sontag. Estimat-
[77] T. Narendra, A. Sankaran, D. Vijaykeerthy, and ing individual treatment effect: generalization bounds
S. Mani. Explaining deep learning models using causal and algorithms. In Proceedings of the 34th Interna-
inference. CoRR, abs/1811.04376, 2018. tional Conference on Machine Learning-Volume 70,
pages 3076–3085. JMLR. org, 2017.
[78] C. Olah, A. Mordvintsev, and L. Schu-
bert. Feature visualization. Distill, 2017. [93] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep in-
https://distill.pub/2017/feature-visualization. side convolutional networks: Visualising image clas-
sification models and saliency maps. arXiv preprint
[79] C. Olah, A. Satyanarayan, I. Johnson, S. Carter, arXiv:1312.6034, 2013.
L. Schubert, K. Ye, and A. Mordvintsev. The
building blocks of interpretability. Distill, 2018. [94] J. T. Springenberg, A. Dosovitskiy, T. Brox, and
https://distill.pub/2018/building-blocks. M. Riedmiller. Striving for simplicity: The all con-
volutional net. arXiv preprint arXiv:1412.6806, 2014.
[80] N. Papernot, P. D. McDaniel, S. Jha, M. Fredrik-
son, Z. B. Celik, and A. Swami. The limitations [95] P.-N. Tan. Introduction to data mining. Pearson Edu-
of deep learning in adversarial settings. CoRR, cation India, 2018.
abs/1511.07528, 2015.
[96] G. G. Towell and J. W. Shavlik. Extracting refined
[81] Á. Parafita and J. Vitrià. Explaining visual models by rules from knowledge-based neural networks. Machine
causal attribution. arXiv preprint arXiv:1909.08891, learning, 13(1):71–101, 1993.
2019.
[97] UCI. Uci machine learning repository.
[82] J. Pearl. Causality. Cambridge university press, 2009. https://archive.ics.uci.edu/ml/index.php,
2020.
[83] J. Pearl. Theoretical impediments to machine learning
with seven sparks from the causal revolution. CoRR, [98] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
abs/1801.04016, 2018. L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin.
Attention is all you need. In Advances in neural infor-
[84] J. Pearl. The seven tools of causal inference, with mation processing systems, pages 5998–6008, 2017.
reflections on machine learning. Commun. ACM,
62(3):54–60, Feb. 2019. [99] P. Veličković, G. Cucurull, A. Casanova, A. Romero,
P. Lio, and Y. Bengio. Graph attention networks.
[85] G. Plumb, D. Molitor, and A. S. Talwalkar. Model arXiv preprint arXiv:1710.10903, 2017.
agnostic supervised local explanations. In Advances in
Neural Information Processing Systems, pages 2515– [100] U. Von Luxburg. A tutorial on spectral clustering.
2524, 2018. Statistics and computing, 17(4):395–416, 2007.
[101] S. Wachter, B. D. Mittelstadt, and C. Russell. Coun-
terfactual explanations without opening the black
box: Automated decisions and the GDPR. CoRR,
abs/1711.00399, 2017.
[102] H. Xu and K. Saenko. Ask, attend and answer: Ex-
ploring question-guided spatial attention for visual
question answering. In European Conference on Com-
puter Vision, pages 451–466. Springer, 2016.
[103] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville,
R. Salakhudinov, R. Zemel, and Y. Bengio. Show, at-
tend and tell: Neural image caption generation with
visual attention. In International conference on ma-
chine learning, pages 2048–2057, 2015.
[104] F. Yang, M. Du, and X. Hu. Evaluating explanation
without ground truth in interpretable machine learn-
ing, 2019.
[105] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola.
Stacked attention networks for image question an-
swering. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 21–29,
2016.
[106] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and
E. Hovy. Hierarchical attention networks for document
classification. In Proceedings of the 2016 conference
of the North American chapter of the association for
computational linguistics: human language technolo-
gies, pages 1480–1489, 2016.
[107] YELP. Yelp dataset.
https://www.yelp.com/dataset, 2020.
[108] M. D. Zeiler and R. Fergus. Visualizing and under-
standing convolutional networks. In European confer-
ence on computer vision, pages 818–833. Springer,
2014.
[109] J. Zhang and E. Bareinboim. Fairness in decision-
making – the causal explanation formula. 02 2018.
[110] Q. Zhang, Y. Yang, H. Ma, and Y. N. Wu. Interpret-
ing cnns via decision trees. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recogni-
tion, pages 6261–6270, 2019.
[111] Q.-s. Zhang and S.-C. Zhu. Visual interpretability
for deep learning: a survey. Frontiers of Informa-
tion Technology & Electronic Engineering, 19(1):27–
39, 2018.
[112] Q. Zhao and T. Hastie. Causal interpretations of
black-box models. Journal of Business & Economic
Statistics, (just-accepted):1–19, 2019.