A Causality Inspired Framework For Model Interpretation
A Causality Inspired Framework For Model Interpretation
2731
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Chenwang Wu, Xiting Wang, Defu Lian, Xing Xie, & Enhong Chen
• RQ3: How can the causal model be improved to overcome these Causal graph is used to formally depict causal relations. In the
challenges? graph, each node is a random variable, and each direct edge rep-
In this paper, we aim to bridge the gap between causality and resents a causal relation, which means that the target node (child)
explainability by studying these issues. can change in response to the change of the source node (parent).
We first provide a causal theoretical interpretation for explana- Do-operator is a mathematical operator for intervention. In gen-
tion methods including LIME [40], Shapley values [30], and CXplain eral, applying a do-operator 𝑑𝑜 (𝐸 = 𝑒) on a random variable 𝐸
[42] (RQ1). Our analysis shows that their explanation scores cor- means that we set the random variable to value 𝑒. For example,
respond to (average) treatment effect [36] in causal inference to • 𝑃 (𝑌 = 𝑦|𝑑𝑜 (𝐸 = 𝑒)) is the probability that 𝑌 is 𝑦 when in every
some extent, and they share the same causal graph, with only small instance, 𝐸 is assigned to 𝑒. This is a global intervention that
differences such as the choices of the treatment (i.e., the perturbed happens to the whole population. In comparison, 𝑃 (𝑌 = 𝑦|𝐸 = 𝑒)
features). This provides a unified view for understanding the precise denotes the probability that 𝑌 is 𝑦 on the subpopulation where
meaning of their explanations and provides theoretical evidence 𝐸 is observed to be 𝑒.
about their advantages and limitations. • 𝑃 (𝑌 = 𝑦|𝑑𝑜 (𝐸 = 𝑒), 𝐶 = 𝑐) applies do-operator to the sub-
These observations allow us to summarize the core challenge in population where the random variable 𝐶 has value 𝑐.
applying causal inference for model interpretation (RQ2). While Treatment effect is an important method to quantify how much
it is easy for explanation methods to compute individual causal causal effect a random variable 𝐸 has on 𝑌 . Suppose that 𝐸 is a
effects, e.g., understanding how much the model prediction will binary value, the average treatment effect 𝑇 of 𝐸 on 𝑌 is
change when one input feature changes, the core challenge is how to 𝑇 (𝑌 |𝑑𝑜 (𝐸)) = E𝑐 𝑇 (𝑌 |𝑑𝑜 (𝐸), 𝐶 = 𝑐)
efficiently discover prominent common causes that can be generalized (1)
= E𝑐 (𝑌 (𝑑𝑜 (𝐸 = 1), 𝐶 = 𝑐) − 𝑌 (𝑑𝑜 (𝐸 = 0), 𝐶 = 𝑐),
to different instances from a large number of features and data points.
Addressing this issue requires ensuring that the explanations are (1) where 𝑌 (𝑑𝑜 (𝐸 = 𝑒), 𝐶 = 𝑐) represents the value of 𝑌 when 𝐸 is set
causal sufficiency for understanding model predictions and can to 𝑒 and all other causes 𝐶 is fixed to 𝑐.
(2) generalize to different instances. These become increasingly
important when the black-box model grows larger and there are 2.2 Causal Graph of Existing XAI Methods
more data points to be explained. In this case, it is vital that the
explanations correspond to common causes that can be generalized
𝐶 𝑋 𝐸 𝑋 𝐸
across many data points, so that we can save users’ cognitive efforts.
To solve the above challenges (RQ3), we follow important causal
principles, and propose Causality Inspired Model Interpreter (CIMI)1 .
Specifically, we first discuss different choices of causal graphs for 𝐸 𝑌 𝑈 𝑌 𝑈 𝑌
model interpretation and identify the one that can address the
aforementioned challenges. Based on the selected causal graph, we (a) (b) (c)
devise training objectives and desirable properties of our neural in-
terpreters following important causal principles. We then show how Figure 1: (a) The causal graph for existing methods, in which
these training objectives and desirable properties can be achieved explanation 𝐸 is not the sore cause for model prediction 𝑌ˆ ; (b)
through our CIMI framework. Another causal graph, in which explanations are causally suf-
Finally, we conduct extensive experiments on four datasets. The ficient for prediction, but not generalizable; (c) Our proposed
results consistently show that CIMI significantly outperforms base- model which explanation 𝐸 is generalizable and modeled as
lines on both causal sufficiency and generalizability metrics on the only cause for 𝑌ˆ . Observed variables are shaded in blue.
all datasets. Notably, CIMI’s sampling efficiency is also outstand-
ing, emphasizing that our method is quite timely, because it is more Revisiting existing methods from the causal perspective allows
suitable for analyzing large models [57], in which each intervention us to show that many well-known perturbation-based methods
requires a forward pass through the model. This makes our method such as LIME [40], Shapley values [30], and CXPlain [42] actually
particularly suitable for larger pretrained language models: its compute or learn the treatment effect, and that their causal graph
generalizability allows users to save cognitive efforts by checking corresponds to the one shown in Fig. 1(a). Notably, here we only
a fewer number of new inputs, and its sampling efficiency makes it briefly summarize the commonalities and differences among these
more suitable for analyzing large models, in which each sample (or XAI methods by presenting the main intuition behind the mathe-
intervention) requires a forward pass through the model. matical analysis, and formal theoretical analysis can be found in
Appendix A of our full-version paper2 .
2 REVISITING XAI FROM CAUSAL In the causal graph shown in Fig. 1(a), 𝐸 corresponds to the spe-
PERSPECTIVE cific treatment, characterized by one feature (or a set of features) to
be perturbed. By 𝑑𝑜 (𝐸 = 1), these methods include the feature in
2.1 Preliminary about Causal Inference the input, while 𝑑𝑜 (𝐸 = 0) does the opposite. Then, they obtain the
We follow the common terminologies in causal inference [36] to model’s outcome 𝑌ˆ when 𝐸 is changed and compute the treatment
discuss existing and our explanation methods. 2 Since the supplementary material exceeds the space limit of two pages, we put all of
the supplementary material into the full version of the paper, and it can be found in
1 The source code of CIMI is available at https://github.com/Daftstone/CIMI. https://github.com/Daftstone/CIMI/blob/master/paper.pdf
2732
A Causality Inspired Framework for Model Interpretation KDD ’23, August 6–10, 2023, Long Beach, CA, USA
effect 𝑇 (𝑌ˆ |𝑑𝑜 (𝐸)) = 𝑌ˆ (𝑑𝑜 (𝐸 = 1), 𝐶 = 𝑐) − 𝑌ˆ (𝑑𝑜 (𝐸 = 0), 𝐶 = 𝑐), 1(b) fails to model the explanation’s generalizability: in this causal
where 𝐶 denotes the context concerning 𝐸, or more intuitively, the graph, the explanation may change in arbitrary ways when 𝑋
features that remain unchanged after changing 𝐸. The treatment changes. Generalizability is very important for model interpretabil-
effect then composes (or is equal to) the explanation weight, re- ity because it helps to foster human trust and reduce human efforts.
vealing the extent to which the feature can be considered in the Taking a pathological detector as an example, it would be quite
explanation, or "contribution" for each feature in the model predic- disconcerting if entirely different crucial regions of the same pa-
tion. It is worth noting that the do-operator here is directly applied tient were detected at different sectional planes. These prominent
to the data points and collects experiment outcomes, which is differ- common causes that can be generalized to various instances help
ent from traditional modeling confounders and converting causal avoid the high cost of repetitive explanation generation and human
estimand into statistical estimand. investigation for similar instances.
Although all three methods can be summarized using the frame- Our choice. Considering the above, we choose the causal graph in
work in Fig. 1(a), they differ a little in terms of the following aspects. Fig. 1(c), which resembles the Domain Generalization causal graph
It is worth emphasizing that we will see how this unified view al- [31] and follows its common cause principle to build a shared par-
lows us to easily compare the pros and cons of each work. ent node (in our case 𝐸) for two statistically dependent variables (in
• Intervened features 𝐸. CXPlain and Shapley value only con- our case 𝑋 and 𝑌ˆ ). In the causal graph, it is evident that alterations
sider one feature as 𝐸 while LIME uses a set of features as 𝐸 to non-explanatory variable 𝑈 have no impact on the explanation
for testing. Thus, the former two methods cannot measure the 𝐸 or the prediction 𝑌ˆ , only resulting in slight variations in 𝑋 . This
causal contribution of a set of features without further extension demonstrates the stability of the explanation across different in-
or assumptions. stances of 𝑋 and its sufficiency as a cause for the model prediction
• Context 𝐶. Shapley values consider all subsets of features as 𝑌ˆ , as 𝐸 is the only determining factor (parent) for 𝑌ˆ .
possible context, while the other methods take the input instance
𝑥 as the major context. Accordingly, Shapley values compute 3.2 Causality Inspired Problem Formulation
the average treatment effect on all contexts (i.e., all possible Given the causal graph in Fig. 1(c), we aim to learn unobserved
subsets of features) while others consider individual treatment causal factors 𝐸 and 𝑈 , where 𝐸 denotes the generalizable causal ex-
effects. While individual treatment effects may be computed planation for model prediction, and 𝑈 denotes the non-explanations.
more efficiently and have a more precise meaning, their ability Following the common assumption of existing feature-attribution-
to generalize to similar inputs may be significantly reduced. based explanations, we assume that 𝐸 and 𝑈 could be mapped into
• Model output 𝑌ˆ . Most methods track changes in model predic- the input space of 𝑋 . More specifically, we assume that 𝐸 is the
tions, while CXPlain observes how input changes the error of the set of features in 𝑋 that influences 𝑌ˆ , while 𝑈 = 𝑋 \𝐸 is the other
model prediction. Thus, CXPlain may be more useful for debug- features in 𝑋 that are not included in 𝐸. Equivalently, 𝐸 and 𝑈 can
ging, while the others may be more suitable for understanding be represented by learning masks 𝑀 over 𝑋 :
model behavior.
• 𝐸 = 𝑀 ⊙ 𝑋 , where ⊙ is element-wise multiplication, and 𝑀𝑖 = 1
means that the 𝑖 − 𝑡ℎ feature in 𝑋 is included in the explanation.
3 METHODOLOGY
• 𝑈 = (1 − 𝑀) ⊙ 𝑋 , where 𝑀𝑖 = 0 means the 𝑖 − 𝑡ℎ feature in 𝑋 is
3.1 Causal Graph included in the non-explanation.
Causally insufficiency of explanations in Fig. 1(a). From the Our goal is to learn a function 𝒈 : 𝑿 → 𝑴 that inputs an instance
previous section, we have seen that existing work adopts the causal 𝑋 = 𝑥 and outputs the masks representing the causal factors 𝐸 and
graph in Fig. 1(a). The major issue of this framework is that the non-causal factor 𝑈 . Function 𝑔 is the interpreter in this paper3 .
model prediction 𝑌ˆ is determined by both the explanation and the In our work, we relax 𝑀 ∈ {0, 1} to [0, 1] (do not discretize
context, in other words, the explanation 𝐸 is not the core cause for the probability vector of 𝑔), which not only guarantees the end-to-
𝑌ˆ . Thus, even if the users have carefully checked the explanations, end training of our neural interpreter but also distinguishes the
the problem remains as long as the specific context is a potential different contributions of features to the output. We also try to
cause for the model prediction, thereby the real complete reason discretize 𝑀 using deep hashing technique [26], see Section 4.6 for
for the model prediction cannot be seen. the comparison and discussion.
Solving the causal insufficiency issue. The causal insufficiency
of explanations may be addressed by removing context as a causal 3.3 Optimization Principles and Modules
of the model prediction. Fig. 1(b) and (c) show two possible causal
It is impractical to directly reconstruct the causal mechanism in
graphs to solve this issue. Here, 𝑋 denotes the random variable
the causal graph of Fig. 1(c) since important causal factors are
for input instances. 𝐸 and 𝑈 are unknown random variables for
unobservable and ill-defined [31]. However, causal factors in causal
explanations and non-explanations respectively, where 𝐸 = 𝑥𝑒
graphs need to follow clear principles. We use the following two
means that the explanation for 𝑋 = 𝑥 is 𝑥𝑒 , and 𝑈 = 𝑥𝑢 means
main principles in causality inference to devise desirable properties.
that the non-explanation for 𝑋 = 𝑥 is 𝑥𝑢 . In both causal graphs, 𝑌ˆ
has the only parent (cause), which is the explanation, making the
3 Although 𝑔 : 𝑋 → 𝑀 and the flow in Fig. 1(c) appear to be reversed, this is reasonable
explanation sufficient to model prediction.
because 𝑀 = 𝑔 (𝑥 ) is a normal symmetric equation. Since the direction of flows in our
Issue of explanations’ generalizability. While both causal graphs framework does not imply causal direction, defining 𝑔 is okay as long as 𝑋 → 𝑀 is a
allow explanations to be the only cause of model predictions, Fig. many-to-one (or one-to-one) mapping, which is exactly our cases.
2733
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Chenwang Wu, Xiting Wang, Defu Lian, Xing Xie, & Enhong Chen
Principle 1. Humean’s Causality Principle [12] 4 : There exists 3.3.2 Causal Intervention Module. Following Principle 2, we de-
a causal link 𝑥𝑖 → 𝑦ˆ if the use of all available information results in sire 𝑈 and 𝐸 to be independent, which makes it possible to find
a more precise prediction of 𝑦ˆ than using information excluding 𝑥𝑖 , the invariable explanations of neighboring instances and improve
all causes for 𝑦ˆ are available, and 𝑥𝑖 occurs prior to 𝑦.
ˆ the interpreter’s generalizability. Despite the lack of true explana-
Principle 2. Independent Causal Mechanisms Principle [38]: tions for supervised training, we have the prior knowledge that the
The conditional distribution of each variable given its causes (i.e., its learned interpreter 𝑔 should be invariant to the intervention of 𝑈 ,
mechanism) does not inform or influence the other mechanisms. that is the 𝑑𝑜 (𝑈 ) does not affect 𝐸. Based on this prior knowledge,
Accordingly, we design three modules to ensure that the ex- we design a causal intervention loss to separate explanations.
tracted explanations (causal factors) satisfy the basic properties First, we describe how to intervene on 𝑈 . Following the common
required by Principles 1 and 2. practice [31] in causal inference, we perturb the non-explanation 𝑥𝑢
• Causal Sufficiency Module. Following Principle 1, we desire via a linear interpolation between the non-explanation positions of
to discover 𝐸 that is causally sufficient for 𝑌ˆ by ensuring that the original instance 𝑥 and another instance 𝑥 ′ sampled randomly
𝐸 contains all the information to predict 𝑌ˆ and explaining the from 𝑋 . The intervention paradigm is shown as follows:
dependency between 𝑋 and 𝑌ˆ . Similarly, we also ensure that 𝑈 𝑥𝑖𝑛𝑡 = 𝑔(𝑥) ⊙ 𝑥 + (1 − 𝑔(𝑥)) ⊙ ((1 − 𝜆) · 𝑥 + 𝜆 · 𝑥 ′ ),
is causally insufficient for predicting 𝑌ˆ . | {z } | {z }
• Causal Intervention Module. Following Principle 2, we ensure 𝑖𝑛𝑣𝑎𝑟𝑖𝑎𝑛𝑡 𝑒𝑥𝑝𝑙𝑎𝑛𝑎𝑡𝑖𝑜𝑛 𝑖𝑛𝑡𝑒𝑟 𝑣𝑒𝑛𝑒𝑑 𝑛𝑜𝑛−𝑒𝑥𝑝𝑙𝑎𝑛𝑎𝑡𝑖𝑜𝑛
that 𝑈 and 𝐸 are independent by intervening 𝑈 and guarantee (4)
that the learned 𝑔(𝑋 ) = 𝐸 will not change accordingly. This also where 𝜆 ∼ 𝑈 (0, 𝜖), and 𝜖 limits the magnitude of perturbation.
allows us to find explanations that can better be generalized to Furthermore, we can optimize the following causal intervention
varying cases. loss to ensure that 𝑈 and 𝐸 are independent.
• Causal Prior Module. Following Principle 1, we facilitate the L𝑖 = E𝑥 ℓ (𝑔(𝑥), 𝑔(𝑥𝑖𝑛𝑡 )). (5)
learning of explanations by using potential causal hints as inputs
to the interpreter and weakly supervise over its output causal This loss ensures that the generated explanations do not change
masks 𝑀. These learning priors enable faster and easier learning. before and after intervening in non-explanations. This invariant
property guarantees local consistency of explanations. i.e., inter-
3.3.1 Causal Sufficiency Module. According to Principle 1, to en- preters should generate consistent explanations with neighboring
sure that 𝐸 is a sufficient cause of 𝑌ˆ , it is necessary to guarantee (or similar) data points. This coincides with the smooth landscape
that 𝐸 is the most suitable feature for predicting 𝑌ˆ = 𝑓 (𝑋 ), rather assumption of the loss function of the deep learning model [27],
than other features 𝑈 . In other words, 𝑥𝑒 can always predict 𝑓 (𝑥) which may help to capture more generalizable features and improve
through an optimal function 𝑓 ′ that maps explanation 𝑥𝑒 to 𝑓 (𝑥), the generalizability of the interpreter.
while non-explanation 𝑥𝑢 cannot give meaning information for
predicting 𝑓 (𝑥). Accordingly, the causal sufficiency loss can be 3.3.3 Causal Prior Module. To facilitate the learning of the inter-
modeled as follows preter, we 1) inject potential causal hints into the neural network
of the interpreter, and 2) design a weakly supervision loss on the
L𝑠 ′ = min E𝑥 (ℓ (𝑓 (𝑥), 𝑓 ′ (𝑥𝑒 )) − ℓ (𝑓 (𝑥), 𝑓 ′ (𝑥𝑢 ))), (2) output causal masks 𝑀.
𝑓′
Interpreter neural network design. A core challenge in XAI is
where ℓ (·) is the mean squared error loss, 𝑥𝑒 = 𝑔(𝑥) ⊙ 𝑥, 𝑥𝑢 = that there lack of prior knowledge about which architecture should
(1 − 𝑔(𝑥)) ⊙ 𝑥, and 𝑥 is sampled from the entire model input space. be used for the interpreter [42]. When we learn an interpreter with
In practice, finding the optimal 𝑓 ′ directly is very difficult due the neural network, it is difficult to decide which neural network
to the vast and sometimes even continuous input space. The inter- structure should be used. If the architecture of 𝑔 is not as expressive
action between optimizing 𝑓 ′ and the interpreter 𝑔 may also easily and complex as the black-box model 𝑓 , then how we can be sure
lead to unstable training and difficulty in converging to an optimal that 𝑔 has the ability to understand the original black-box 𝑓 ? If 𝑔
solution [8]. To address this issue, we approximate the optimal 𝑓 ′ could be more complicated than 𝑓 , then it is prone to slow training
by using 𝑓 , under the assumption that the difference between 𝑓 ′ efficiency and overfitting.
and 𝑓 is minimum, considering that explanation 𝑥𝑒 is in the same Our solution to this problem is inspired by Principle 1, which
space with the origin model inputs 𝑋 . By setting 𝑓 ′ to 𝑓 , we are states that causes (model 𝑓 ) are more effective in predicting the
actually minimizing each individual treatment effect, which has a effects (explanation 𝑥𝑒 ). Hence, we generate the explanation 𝑥𝑒
precise causal meaning. Besides, since we do not have to optimize by directly utilizing the parameters of the black-box model 𝑓 . To
𝑓 ′ , it may allow us to sample much fewer samples 𝑥 ′ to optimize achieve this, we use the encoding part of the black-box model 𝑓 (de-
L𝑠 ′ and learn an interpreter 𝑔. In summary, the causal sufficiency noted as 𝑓𝑒 ) as the encoder in our interpreter model 𝑔. The decoder
loss L𝑠 rewritten as follows of 𝑔 is a simple neural network, denoted as 𝜙. The ease of learning
is supported by information bottleneck theory, which states that
L𝑠 = E𝑥 (ℓ (𝑓 (𝑥), 𝑓 (𝑥𝑒 )) − ℓ (𝑓 (𝑥), 𝑓 (𝑥𝑢 ))), (3)
information in each layer decreases as we progress through the
where 𝑥𝑒 = 𝑔(𝑥) ⊙ 𝑥, and 𝑥𝑢 = (1 − 𝑔(𝑥)) ⊙ 𝑥. model [13]. Therefore, the input 𝑥 contains the most information,
while 𝑓𝑒 (𝑥) contains less information as the information deemed
4 Althoughthe principle needs a clear occurrence order of variables, we follow it only unnecessary for prediction has been removed. The final prediction
to determine the relationship between 𝐸/𝑈 and 𝑌ˆ , which can be satisfied. and ground-truth explanation use the least amount of information.
2734
A Causality Inspired Framework for Model Interpretation KDD ’23, August 6–10, 2023, Long Beach, CA, USA
𝒙𝒆
Black-box model
Interpreter
𝒇 = 𝒇𝒅 ∘ 𝒇𝒆
𝒈 = 𝝓 ∘ 𝒇𝒆 Explanation 𝒙𝒖
mask 𝑔(𝑥)
𝒇(𝒙𝒆 ) Causal
𝒙 𝒇(𝒙𝒖 ) sufficiency
⊙ lossℒ𝑠
𝒇(𝒙)
ℒ𝑠 = ℓ 𝑓 𝑥 , 𝑓 𝑥𝑒 − ℓ 𝑓 𝑥 , 𝑓(𝑥𝑢 )
𝒙
ℒ𝑖 = ℓ(𝑔 𝑥 , 𝑔(𝑥𝑖𝑛𝑡 )) Annotation
𝒙𝒊𝒏𝒕
𝒈(𝒙) Causal
… Instance 𝑥
… Instance 𝑥 ′ that
satisfies 𝑥 ′ = 𝑥
intervention
𝒙
𝒈(𝒙𝒊𝒏𝒕 )
lossℒ𝑖 … Explanation
mask 𝑔(𝑥)
… Intervened
instance 𝑥𝑖𝑛𝑡
𝒈(𝒙) Weakly
𝒙′ 𝒈𝒙′ (𝒙)
supervising
lossℒ𝑝
… Explanation
𝑥𝑒 = 𝑔 𝑥 ⊙ 𝑥
Fixed component
in black box
𝒇𝒆 (∙) 𝝓(∙)
Figure 2: The framework of CIMI. The only trainable component is the decoder 𝜙, which is a simple neural network that can be
trained with a relatively small number of samples.
Consequently, compared with 𝑋 , the last embedding layer output included in 𝑥𝑒 while minimizing the probability that a token not in
𝑓𝑒 (𝑋 ) is a better indicator to find the explanation. 𝑥 (noise) is predicted to be the explanation:
Based on this observation, we design 𝜙 so its input concatenates
L𝑝 = E𝑥,𝑥 ′ ,𝑥≠𝑥 ′ log 𝜎 (𝑔(𝑥) − 𝑔𝑥 ′ (𝑥)), (7)
the encoded embedding 𝑓𝑒 (𝑥) ∈ R |𝑥 | ×𝑑 and the original instance
embedding 𝑣 𝑥 ∈ R |𝑥 | ×𝑑 along the axis 1, i.e., [𝑓𝑒 (𝑥); 𝑣 𝑥 ] 1 ∈ R |𝑥 | ×2𝑑 , where 𝑔𝑥 ′ (𝑥) means to map 𝑓𝑒 (𝑥) to 𝑥 ′
in 𝑔(𝑥), refer to Eq. 6, that is,
where 𝑑 is the dimension of embedding, and the operator [𝑎; 𝑏]𝑖 𝑔𝑥 ′ (𝑥) = 𝜙 ([𝑓𝑒 (𝑥); 𝑣 𝑥 ′ ] 1 ). Correspondingly, 𝑔𝑥 (𝑥) = 𝜙 ( [𝑓𝑒 (𝑥); 𝑣 𝑥 ] 1 ) =
denotes the axis 𝑖 along which matrix 𝑎 and 𝑏 will be joined. There- 𝑔(𝑥), and the subscript in 𝑔𝑥 (𝑥) are omitted for simplicity.
fore, the decoder 𝜙 maps input [𝑓𝑒 (𝑥); 𝑣 𝑥 ] ∈ R |𝑥 | ×2𝑑 to [0, 1] |𝑥 | ×1 , This weakly supervising loss prevents the interpreter from overly
and the 𝑖-th dimension of the output represents the probability that optimistically predicting all tokens as explanations, which helps
the token 𝑖 is used for explanation. 𝜙 can be any neural network. alleviate trivial solutions.
In summary, the interpreter 𝑔 can be reformulated as 3.3.4 Overall Framework and Optimization. Overall loss func-
𝑔(𝑥) = 𝜙 ( [𝑓𝑒 (𝑥); 𝑣 𝑥 ] 1 ). (6) tion. Combining the above three modules, the overall optimization
objective of CIMI is summarized as follows and the framework is
By setting the encoder in 𝑔 as 𝑓𝑒 , the architecture of 𝑔 could be shown in Fig. 2.
as complicated as 𝑓 , and such a complex structure helps to fully min L𝑠 + 𝛼 L𝑖 + L𝑝 , (8)
understand the model’s decision-making mechanism. 𝑔 can also 𝜙
be considered simple, because the parameters of 𝑓𝑒 in 𝑔 are fixed where 𝛼 is the trade-off parameter. Notably, the introduction of
and only the decoder 𝜙 is learnable, while only requiring a few weakly supervising loss is to avoid the difficulty of tuning regulariza-
additional parameters (1-layer LSTM + 2-layers MLP in our paper), tion parameters, so this term does not require trade-off parameters.
avoiding the issues of overfitting and high training cost. Analysis of the framework. As shown in Fig. 2, the only trainable
Weakly supervising loss. Without a further regularization loss on parameter in our framework is the simple decoder in the interpreter
the causal factors, there exists a trivial solution (i.e., all explanation 𝑔, which uses a 1-layer LSTM (hidden size is 64) and 2-layers MLP
masks set to 1) that makes the interpreter collapse. A common (64 × 16 and 16 × 2). This enables us to learn the interpreter effi-
regularization in causal discovery is sparsity loss which requires ciently with a small number of forward propagations through 𝑓 .
the number of involved causal factors to be small [5]. However, The validity of our framework can be further verified by consid-
this sparsity loss may fail to adapt to the different requirements of ering the information bottleneck theory, which says that during
different instances, as the constraints are the same for complicated the forward propagation, a neural network gradually focuses on
sentences and simple sentences. Therefore, this poses difficulty in the most important parts in the input by filtering information that
tuning the hyper-parameters for different datasets. is not useful for prediction through the layer [13]. According to
To tackle this issue, we leverage noisy ground-truth labels as this theory, setting the first part of the interpreter to the encoder
a prior for the causal factor 𝐸 to guide the learning process. Our 𝑓𝑒 of the black-box model enables the interpreter to filter a large
approach is based on the intuition that the explanation for 𝑥 should portion of noisy information that has been filtered by the black-box
contain more information about 𝑥 itself than information about encoder, thus allowing us to learn the explanations more efficiently
another instance 𝑥 ′ . Using this, we derive a weakly supervision and faithfully. A more formal description of the validity of our
loss by maximizing the probability that the token in instance 𝑥 is framework is given in Appendix C.
2735
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Chenwang Wu, Xiting Wang, Defu Lian, Xing Xie, & Enhong Chen
Table 1: Faithfulness comparison when explaining BERT. *** indicate that our method’s improvements over the best results of
baselines results are statistically significant for 𝑝 < 0.001.
Relationship with disentanglement. Our method falls into the Sufficiency (SUFF) [11], in contrast to COMP, it only keeps
group of methods that disentangle the causal effects of vari- important tokens and compares the changes in output probabil-
ables [18], while most existing disentanglement methods focus ities over the original predicted class. Notably, this metric is not
on disentangling the latent representations [53]. The additional equivalent to the causal sufficiency we focus on.
causal perspective is essential for ensuring the extraction of com- The number of important tokens is selected from {1, 5, 10, 20, 50}
mon causes to model prediction, which improves both explanation and the average performance is taken. Notably, SUFF and COMP
faithfulness and generalizability. To better illustrate the effective- have been proven to be more faithful to the model prediction than
ness, we implement a variant of CIMI that uses the representation other metrics [6]. In addition to the above metrics, we use AvgSen
disentanglement loss [53], which can be found in Appendix E.7. to measure the explanation’s generalizability.
Average Sensitivity (AvgSen) [52], which measures the aver-
4 EXPERIMENTS age sensitivity of an explanation when the input is perturbed. In
our experiments, we replace 5 tokens per instance and calculate
4.1 Experimental Setup the sensitivity of top-10 important tokens.
4.1.1 Datasets. We use four datasets from the natural language In addition, we present the generated explanation instances in
processing domain, including Clickbait [2], Hate [9], Yelp [56], Appendix E.9 for a more intuitive evaluation.
and IMDB [32]. See Appendix D.1 for their details.
4.1.4 Parameter Settings. For the two pre-training models used,
4.1.2 Black-box Models and Baselines. Although pre-training mod- BERT and RoBERTa, we both add two-layer MLP as decoders for
els are brilliant in many fields, it is difficult to answer what informa- downstream tasks. In all four datasets, the optimization is based
tion they encode [25]. For this reason, we choose two well-known on Adam with a learning rate of 1𝑒 − 5. The training epoch is 20,
pre-training models, BERT [10] and RoBERTa [29], as the black and the batch size is 8. For the proposed CIMI, without special
box models to be explained. Notably, the main body shows the instructions, we train 100 epochs on Clickbait and Hate, and 50
experimental results on BERT, and the results of RoBERTa can be epochs on the other two larger datasets to improve efficiency. For
found in Appendix E.2 and E.3. the trade-off parameter 𝛼, set it to 1, 1, 1, 0.1 on Clickbait, Hate, Yelp,
We compare CIMI with Gradient [47], Attention [3], LIME and IMDB respectively. In addition, the perturbation magnitude 𝜖
[40], KernelSHAP [30], Rationale [24], Probing [1], CXPlain in the causal intervention module is set to 0.2.
[42], AXAI [39], Smask [27]. Their details can be found in Appen-
dix D.3. Although there are some works [28, 37] that can be used
4.2 Faithfulness Comparison
to extract causal explanations, they often make strict assumptions
about the underlying data format, so they cannot be compared In this section, we evaluate the causal sufficiency of the explanations
fairly, and we put the comparison in Appendix E.1. using faithfulness metrics. Table 1 summarizes the average results
of 10 independent repeated experiments.
4.1.3 Evaluation Metrics. First, we evaluate the causal sufficiency First, it can be seen that the proposed method achieves the
using three faithful metrics (see Appendix D.4 for more details). best or comparable results compared with the baselines on var-
Decison Flip-Fraction of Tokens (DFFOT) [43], which mea- ious datasets. In particular, this improvement is more pronounced
sures the minimum fraction of important tokens that need to be on more complex datasets (from Clickbait → IMDB). For example,
erased in order to change the model prediction. the improvement over the best baselines reaches 119% on IMDB
Comprehensiveness (COMP) [11], which measures the faith- w.r.t. DFFOT metric. Such invaluable property could adapt to the
fulness score by the change in the output probability of the original complex trend of black-box models. This gratifying result veri-
prediction class after the important tokens are removed. fies that CIMI can generate explanations that are more faithful to
2736
A Causality Inspired Framework for Model Interpretation KDD ’23, August 6–10, 2023, Long Beach, CA, USA
Gradient Attention AXAI Probing LIME KernelSHAP Rationale CXPlain Smask CIMI
0.7 0.6
0.5 0.7
0.6 0.5
0.6
0.5 0.4
0.4 0.5
COMP
COMP
COMP
COMP
0.4
0.3 0.3 0.4
0.3 0.3
0.2 0.2
0.2 0.2
0.1 0.1 0.1 0.1
0.0 0.0 0.0 0.0
2 4 6 8 10 0 5 10 15 20 25 0 10 20 30 40 0 10 20 30 40 50
Explanation Length Explanation Length Explanation Length Explanation Length
(a) Clickbait (b) Hate (c) Yelp (d) IMDB
Table 2: Generalizability comparison under BERT. IMP indicates the improvement of our method compared to baselines.
the model. Second, Gradient has impressive performance in some 4.4 Effectiveness of Causal Modules
cases, which indicates that their linear assumption can reflect the 4.4.1 Effectiveness w.r.t. Faithfulness. In this section, we verify the
model’s decision-making process to some extent. Third, among effectiveness of the proposed three causal modules concerning
the perturbation-based methods, LIME, KernelSHAP, and CXPlain faithfulness. We define versions that remove causal sufficiency loss,
all show satisfactory performance. Especially LIME based on local causal intervention loss, weakly supervising loss, and interpreter’s
linear approximation, which once again verifies the rationality of encoder 𝑓𝑒 as CIMI-s, CIMI-i, CIMI-p, and CIMI-f, respectively. Their
the first finding, the model linear assumption. sufficient effects are shown in Fig. 4. Overall, removing any module
In addition, we also illustrate the performance w.r.t. COMP under leads to performance degradation, which justifies the design of three
different explanation lengths, as shown in Fig. 3 (similar findings modules. Specifically, first, we found that the removal of the causal
can be obtained when concerning SUFF). The experimental results sufficient module (CIMI-s) has an impact on sufficient performance,
show that regardless of explanation length, CIMI exhibits significant which reasonably explains the original intention of our design of
competitiveness. The above results demonstrate the power of causal this module, ensuring that the explanations 𝐸 are causally sufficient
principle constraints in CIMI for understanding model predictions. for the model predictions 𝑌ˆ . Second, the sufficient impact of the
causal intervention module (CIMI-i) is marginal, since this module
4.3 Generalizability Comparison is designed primarily for the explanation’s generalizability. Finally,
We use AvgSen to evaluate the generalizability of the explanations both the weakly supervising loss and the interpreter’s encoder
to neighboring (similar) samples. It is undeniable that for AvgSen, design in the causal prior module can assist the model to learn
some important tokens included in the explanation may be replaced, more easily.
but the probability is low, especially in Yelp and IMDB which have
more tokens. The results are summarized in Table 2. It can be found 4.4.2 Effectiveness w.r.t. Generalizability. In this section, we dis-
that the explanations generated by CIMI are the most generalizable. cuss the generalizable effect of the three causal modules. Keeping
Specifically, in the four datasets, at least 8 of the top-10 impor- the experimental settings consistent with Section 4.3, the results
tant tokens before and after perturbation are consistent, which are illustrated in Fig. 5. First, we find that the causal sufficiency
is impossible for other methods. Besides, as the dataset becomes module helps improve generalizability against perturbations on
more complex, our performance remains stable while the baselines Clickbait and Hate, but significantly degrades performance on Yelp
decrease significantly. these results demonstrate the outstanding and IMDB. We suspect that in the latter two datasets, CIMI-s expla-
ability of the proposed method to capture invariant generalizable nations are not faithful to model prediction (Fig. 4 (c)(d)), then it is
features. Additionally, We conduct generalizability evaluation in difficult to capture the invariant explanations of model decisions
attack settings [45, 50], see Appendix E.8 for the comparison. from similar instances, resulting in a decrease in generalizability.
2737
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Chenwang Wu, Xiting Wang, Defu Lian, Xing Xie, & Enhong Chen
Value
Value
Value
0.3 0.3
0.3
0.2
0.2 0.2 0.2
0.1 0.1
0.1 0.1
0.0 0.0 0.0 0.0
COMP SUFF COMP SUFF COMP SUFF COMP SUFF
Metric Metric Metric Metric
(a) Clickbait (b) Hate (c) Yelp (d) IMDB
Figure 4: The faithful effect of the causal modules concerning COMP and SUFF.
2738
A Causality Inspired Framework for Model Interpretation KDD ’23, August 6–10, 2023, Long Beach, CA, USA
COMP
COMP
COMP
0.25
0.40
0.40 0.20 0.20
0.35 0.15 0.15
0.38
0.10
0.30 0.10
0.36
0.05
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
The Number of Sampling The Number of Sampling The Number of Sampling The Number of Sampling
(a) Clickbait (b) Hate (c) Yelp (d) IMDB
Figure 6: Performance comparison with the different number of sampling (perturbation) w.r.t. SUFF.
Value
Value
Value
0.3 0.20 0.4
0.2 0.2 0.15 0.3
0.10 0.2
0.1 0.1
0.05 0.1
0.0 0.0 0.00 0.0
COMP SUFF AvgSen COMP SUFF AvgSen COMP SUFF AvgSen COMP SUFF AvgSen
Metric Metric Metric Metric
(a) Clickbait (b) Hate (c) Yelp (d) IMDB
0.6
Method
Origin
on a fundamental question in XAI: whether existing explanations
0.90
0.5
LIME
CIMI
capture spurious correlations or remain faithful to the underlying
Accuracy
0.85
causes of model behavior. First, many well-known perturbation-
Loss
0.80 0.4
Method
based methods such Shapley values [30], LIME [40], and Smask
0.75 Origin
LIME
0.3 [27] implicitly use causal inference, and their explanatory scores
0.70 CIMI
0.2 correspond exactly to the (average) treatment effect [36]. From a
0 2 4 6 8 10 0 2 4 6 8 10
Epoch Epoch causal point of view, the slight difference between them lies only
in the number of features selected, the contextual information con-
sidered, and the model output. CXPlain [42] explicitly considered
Figure 8: Usefulness evaluation. Classification performance
the non-informative features should have no effect on model pre-
comparison before and after deleting shortcuts.
dictions.
6 CONCLUSION
The self-explaining method focuses on building model architec-
tures that are self-explainable and transparent [51], such as decision We reinterpreted some classic methods from causal inference and
tree [22], rule-based models [49], self-attention mechanisms, [3, 55]. analyzed their pros and cons from this unified view. Then, we
In order to provide rules that are easy for humans to understand, revealed the major challenges in leveraging causal inference for
they are often too simple to enjoy both interpretability and pre- interpretation: causal sufficiency and generalizability. Finally, based
dictive performance [23]. Recently, methods of integrating add-on on a suitable causal graph and important causal principles, we
modules have received increasing attention [7, 15, 20, 23]. However, devised training objectives and desirable properties of our neural
the process of generating explanations remains opaque. interpreters and presented an efficient solution, CIMI. Through
Post-hoc interpretation has received more attention as models extensive experiments, we demonstrated the superiority of the pro-
have gradually evolved into incomprehensible highly nonlinear posed method in terms of the explanation’s causal sufficiency and
forms [51]. Gradient-based methods [30, 44, 47, 48, 54] approxi- generalizability and additionally explored the potential of explana-
mate the deep model as a linear and accordingly incorporate the tion methods to help debug models.
gradient as feature importance. Admittedly, the gradient is only
an approximation of the decision sensitivity. Influence function ACKNOWLEDGMENTS
[21] also has been introduced to understand models, which effi- The work was supported by grants from the National Key R&D
ciently approximates the impact of perturbations of training data Program of China (No. 2021ZD0111801) and the National Natural
through a second-order optimization strategy. Recently, casual in- Science Foundation of China (No. 62022077).
terpretability has attracted increasing attention because it focuses
2739
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Chenwang Wu, Xiting Wang, Defu Lian, Xing Xie, & Enhong Chen
2740
A Causality Inspired Framework for Model Interpretation KDD ’23, August 6–10, 2023, Long Beach, CA, USA
[56] Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional A SUPPLEMENTARY MATERIAL
networks for text classification. NeurIPS’15 (2015).
[57] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, The source code of CIMI is available at https://github.com/Daftstone/
Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, CIMI, and the full version of the paper (including the main text and
Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang,
Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A Survey of Large
all appendices) is available at https://github.com/Daftstone/CIMI/
Language Models. arXiv:2303.18223 [cs.CL] blob/master/paper.pdf.
2741