SSRN Id3981160
SSRN Id3981160
Abstract
1 Introduction
The benefits of customer retention are well documented and studied, and have led to calls for
proactive churn management across a variety of industries (Ascarza et al., 2017). Gallo (2014)
documents that acquiring a new customer is usually 5-25 times more costly than retaining an
existing one. A case from financial services shows that a 5% increase in retention raised profits by
more than 25% (Reichheld and Detrick, 2003). Additionally, CAC (Customer Acquisition Cost) has
grown by nearly 50% over the past five years (Campbell, 2019).
the censoring of observations and a range of distributional forms (Wang et al., 2019). However, with
easier access to more complex data, the limitations of traditional survival models are also becoming
more apparent (Fader and Hardie, 2009; Ascarza et al., 2017). Some of these limitations include:
survival models can only take a few high signal variables that are engineered by experts and based
on existing theories. Though there have been efforts to use more data by incorporating cohort
effects (Schweidel et al., 2008) or customer heterogeneity (Fader and Hardie, 2010), current models
cannot natively handle sparse, high dimensional, unstructured data (Witten and Tibshirani, 2010),
problems. For example, the problem of time-varying covariates present hurdles for existing models
(Ascarza et al., 2017). While a majority of traditional models assume constant covariates (Guillén
et al., 2012), realistically, latent customer characteristics that influence churn are likely to change
over time (e.g., changes in marketing trends, ongoing dynamic customer relationships, learning
effects in mobile games, etc.). Indeed, time-varying covariates often cause biases when estimating
the hazard function in the current churn models (Fader and Hardie, 2009; Ascarza et al., 2017).
Recent advances in predictive analytics allow us to address each of these limitations. Specifically:
1. Deep learning (DL) approaches show superior capability in dealing with sparse, high-
dimensional, and unstructured data (LeCun et al., 2015). Representation learning a set of methods
that permit algorithms to automatically discover various levels of abstraction/information from raw
data allows extremely flexible input data types (Bengio et al., 2013) such as images (Krizhevsky
et al., 2012) and text (Collobert et al., 2011). Additionally, the automated nonlinear feature learn-
ing ability (Bengio et al., 2013) helps to extract more non-obvious signals from the data than
2. Machine learning (ML) frameworks such as Hidden Markov Models can incorporate time-
varying covariates in survival analysis (Netzer et al., 2008). In addition, more recent DL techniques,
such as different varieties of Recurrent Neural Network (RNN), allow survival models to handle
(Martinsson, 2016; Lee et al., 2018a). This line of work enables retention models in business contexts
Despite its superiority, one potential drawback of DL approaches is their lack of interpretabil-
ity. However, recent eXplainable Artificial Intelligent (XAI) techniques provide much-needed in-
terpretability in predictive analytics. XAI turns complex black-box AI models into interpretable
glass-box models while retaining a high level of model performance (Lu et al., 2020; Rai, 2020). In
other words, DL models capture complex nonlinear signals from real-world data, and XAI inter-
prets them in the form of human-understandable explanations. Notably, XAI can, along with deep
sequential models, provide time-varying dynamics regarding customer behavior and related factors
We introduce Weibull Time To Event TCN (WTTE-TCN), an interpretable deep survival model
that achieves superior performance with less computation costs and provides two different levels of
explanation (model- and individual-level) via the recent XAI methods. It can also incorporate tex-
tual data through Natural Language Process (NLP) augmentation. To do so, we enhance Martinsson
(2016)’s Weibull Time To Event RNN (WTTE-RNN) in several ways. WTTE-RNN combines the
best of both the RNN model (efficient in processing multivariate time series data) and literature-
proven approaches to survival analyses via a Weibull-based time to event model. The details of the
• We enable more efficient training while improving performance. Mainly, we enhance the
RNN layer of Martinsson (2016) by using Temporal Convolutional Network (TCN). TCN
outperforms canonical RNNs, such as Long Short Term Memory (LSTM) or Gated Recurrent
Unit (GRU), across a diverse range of tasks and datasets while demonstrating longer effective
memory (Bai et al., 2018). We also apply the Attention Mechanism (which acts like random
access memory across time and input data) (Yang et al., 2016) and Layer Normalization
(which adjusts the mean and the variance of the summated inputs within each layer) (Ba
et al., 2016). These methods have proven to allow more efficient training for a wide range
of DL applications (Raffel and Ellis, 2015; Zhang et al., 2019a). The test result shows that
our TCN model reduces 17% of mean absolute errors (MAE) while using about half of the
vey, please see Guidotti et al., 2018) to provide human-understandable intuitions for our DL
model’s predictions. As with other DL-based survival models (Ranganath et al., 2016; Lee
et al., 2018a; Kvamme et al., 2019), WTTE-RNN does not provide interpretability. With XAI
methods, model- and individual-level attribute importance and contribution can be computed
while maintaining high-level performance. The end result is a particularly flexible and accu-
rate model equipped to extract nonlinear and temporally diffuse signals for churn that is still
• We enhance the stability of the training process. Exploding gradients are a problem in that
accumulated gradients from large errors require huge updates to neural network model weights,
resulting in unstable or failed learning processes (Goodfellow et al., 2016). Notably, it occurs
more commonly with RNN models, and gets worse when handling sparse and messy real-world
data (Pascanu et al., 2013; Brownlee, 2018). WTTE-RNN is not an exception, and prior work
has argued that it sometimes fails in the training process (Maragall Cambra, 2018; Palau et al.,
2018). To allow more stable and efficient training, while also preventing exploding gradients,
Our contribution is as follows. First, we propose a novel, interpretable deep survival method.
Second, we present a demonstration and application of the said method on a novel proprietary
dataset. Lastly, we present exploratory results that examine the model’s ability to capture churn
signals from complex (sparse, high-dimensional, and unstructured) data and provide domain-specific
marketing implications.
We evaluate our model on a real-world mobile gaming dataset that contains highly sparse and
complex customer behaviors. In this process, our TCN network handles the temporality and large
receptive fields of complex sequential data through powerful convolutions and dilations, while min-
imizing the loss of information (He and Zhao, 2019). As a result, our deep survival models reduced
MAE by 56-81% and 17-51% compared to traditional and deep survival models, respectively. Also,
through current XAI methods, we assess the individual- and model-level dynamics between service
characteristics, customer activities, and churn decisions as they evolve over time.
explicates the model. In Section 4, we evaluate our model with a real-world dataset. Section 5
discusses the implications of our work and suggests opportunities for future research.
2 Related Works
In the first part of this section, we explain the advances in survival models in the business context.
We mainly focus on what efforts prior work has made to overcome time-varying covariates and to
incorporate complex data. Second, we describe AI interpretability and XAI methods in business
Prior marketing research has built survival models based on theoretical backgrounds (Ascarza et al.,
2017). Briefly, these studies have used simple stochastic models such as BG (beta-geometric) (Pot-
ter and Parker, 1964) and sBG (shifted-beta-geometric) (Fader and Hardie, 2007) for contractual
settings that are aware of when customers become inactive, and Pareto/NBD (Pareto Type II and
negative binomial distribution) (Schmittlein et al., 1987), BG/NBD (beta-genomic and negative
binomial distribution) (Fader et al., 2005), and BG/BB (beta-genomic and binomial) (Fader and
Hardie, 2010) for non-contractual settings that do not observe inactivivity (Fader and Hardie, 2009).
There also have been efforts to incorporate features that have yet to be discovered in prior work,
such as customer heterogeneity and cross-cohort effects in marketing mix activities (Schweidel et al.,
2008), the frequency and amount of direct marketing activities across individuals and over time
(Schweidel and Knox, 2013), customer’s complaints and recoveries (Knox and Van Oest, 2014), and
customer’s service experiences, such as frequency and recency of past purchases (Braun et al., 2015).
Despite these efforts, these studies focus on classical statistical methods that cannot directly handle
complicated (sparse, higher dimensional, and unstructured) data. As a result, existing survival
churn models in a business context can only take a few high signal variables that are engineered by
experts and based on existing theories (Witten and Tibshirani, 2010; Ascarza et al., 2017). With
easier access to various data sources, the limitations of traditional survival models are becoming
more apparent.
multivariate time series data (Ascarza et al., 2017). The simplicity of traditional models cannot cap-
ture patterns in data change over time (Žliobaitė et al., 2016). For example, the Cox Proportional
Hazard model assumes the explanatory variable remains constant over time (Cox, 1972). However,
in real-world business settings, customer behaviors, which contain critical clues to predicting cus-
tomers’ decisions, change over time (Fader et al., 2004). To handle these problems, prior studies
have developed various approaches, such as changepoint models and sequence models.
Changepoint models capture the underlying evolution points of features and split time-varying
covariates into multiple time-fixed covariates (Koop and Potter, 2004). In marketing literature,
Fader et al. (2004) build on a changepoint framework which nests simple, theory-based models of
customer buying behaviors. Despite the advantages, (parametric) changepoint models often rely on
strong assumptions and face difficulties in modeling complex temporal patterns that are hard to
perform manually. Also, (non-parametric) changepoint models show the limitations of addressing
some patterns (e.g., changes happen at arbitrary timescales or gradually over different durations)
In the sequence model, the current state is dependent on the previous stochastic input, and it
helps to capture signals from a non-stationary process (Kuznetsov, 2016). In a business context,
Autoregressive (AR) models have been widely used in prediction tasks related to finance (Pacurar,
2008), rather than churn prediction in marketing and IS. These AR models assume that the dynamics
in the market are gradual or smooth, not complex (Netzer et al., 2008). In other words, such
approaches are restrictive when it comes to handling the complexity of recent datasets.
Breakthroughs in machine learning technology have led to a wealth of data and superior perfor-
mance in predictive systems, ranging from recommendations on e-commerce and content filtering
on social networks to image processing and autonomous cars (LeCun et al., 2015). As machine
learning methods advance, we are starting to see more sophisticated applications that utilize more
data for better decision making (Hosanagar, 2019). With regards to the survival models, for exam-
ple, recent clinical research has suggested novel applications for churn prediction while addressing
newly discovered high-dimensional data through machine learning-based survival models. (Pölsterl
It is well-documented that the Hidden Markov Model (HMM) can handle nonlinearity and time-
visible outputs, and it treats observations as a result of previous unobserved states (Ghahramani,
2001). In a business context, HMM has been used to provide more sophisticated rationales for
underlying dynamics in customer behavior (Montoya et al., 2010; Ascarza et al., 2017). For example,
Netzer et al. (2008) incorporate HMM to account for the evolution of relationship dynamics as
a result of interactions between the customer and the company. However, while HMM is an ML
method that assesses customer relation dynamics over time, it has limitations for natively addressing
Also, few HMM studies have examined time-varying dynamics while handling data censorship.
Recent advances in deep learning have dramatically enhanced the state-of-the-art in a wide
range of tasks including sequence process (LeCun et al., 2015). Thus, prior research has been
trying to combine advantages of both DL models (e.g., exceptional performance while handling
complex data, nonlinearity, and time-varying covariates) and survival models (e.g., addressing cen-
sored data). Katzman et al. (2018) demonstrate that a deep survival model (DeepSurv) a simple
combination of multilayer perceptron and Cox partial log-likelihood functions, not a deep sequen-
tial model outperforms traditional survival models in multiple electronic health record (EHR)
datasets. Lee et al. (2018a) propose a multi-task learning based deep survival model (Deephit)
to address competing risk the competing nature of different causes to the same event more
natively. Martinsson (2016) suggests a sequential deep learning-based survival approach (WTTE-
RNN) that effectively handles multivariate time series data, as well as the long term dynamics
among time-varying covariates. These deep survival approaches show superior performance when
However, this line of work is heavily focused on medical settings, which is not appropriate for
business contexts, as survival analysis was originated and mainly used by medical researchers to
measure the lifetime of populations (Prinja et al., 2010). Since the temporality and irregularity
of data (due to patients’ irregular visits) is a significant challenge in the medical context (Xiao
et al., 2018), medical deep survival models have been focused on making good use of time-invariant
variables, such as electronic health records (EHR) (Ranganath et al., 2016; Katzman et al., 2018)
and genetic and protein expression features (Yousefi et al., 2017). On the other hand, in many real-
world business cases, firms periodically collect an incomparable amount of customer activities in
suggests a TCN-based survival approach that effectively handles complex sequences of multivariate
As AI systems become widespread, increasing scholarly attention has been given to AI interpretabil-
ity, including marketing and IS literature (Lee et al., 2018b; Rai, 2020). However, interpretability,
while not yet well defined in the literature (Lipton, 2016) and varying across different contexts
(Rudin, 2019), broadly refers to the understandability of a model regarding how and why it made
Prior work claims that AI interpretability provides the following benefits to inscrutable business
AI systems. First, providing an effective interpretation for AI systems can increase users’ trust in
the system (Ribeiro et al., 2016a; Carvalho et al., 2019). Second, interpretability is essential to
identify potential algorithmic bias (Doshi-Velez and Kim, 2017), which refers to those AI results
that discriminate on arbitrary grounds, such as race, gender, and ethnicity (Hajian et al., 2016) and
can be critical to a firm’s relationship with customers (Rai, 2020). Third, interpretability provides
business implications from complex AI systems in terms of human-understandable forms (Du et al.,
Advances in interpretability techniques are allowing predictive analytics to pursue both perfor-
mance and explainability, and as such XAI has received enormous scholarly attention (Rai, 2020).
The eXplainable AI (XAI) refers to a suite of methods that provide human-understandable explana-
tions for AI models (Gunning, 2017). Notably, the current model-agnostic (or post-hoc) techniques
derive approximated interpretations by analyzing the patterns between input and output features
in a trained model. By doing so, it can be applied to any machine learning model, regardless of
their structures and complexities and allows for the conversion of black-box models to explainable
glass-box ones, while retaining a high level of model performance (Molnar, 2018; Rai, 2020). XAI
can aim to flexibly create better explanations that consider the context of task and type of data
(e.g., LIME (Ribeiro et al., 2016a), SHAP (Lundberg and Lee, 2017), and Grad-CAM (Selvaraju
et al., 2017)), compared to traditional, intrinsically interpretable approaches that are limited to a
certain form of explanation dependent on the selection of model (Ribeiro et al., 2016a).
mance rather than providing intuitive explanations through complex datasets (Martinsson, 2016;
Ranganath et al., 2016; Lee et al., 2018a; Kvamme et al., 2019). Some studies have tried to de-
rive explanations from deep survival models, but these are limited to traditional interpretability
perspectives that discuss cohort-level differences (Katzman et al., 2018) or variable importances
(Luck et al., 2017). Yousefi et al. (2017) is one of the few studies, which applies current XAI to
deep survival models (though it is not based on sequential DL methods). The authors describe how
each feature contributes to predicted cancer risk while dealing with high-dimensional medical data.
However, Yousefi et al. (2017) only focuses on global feature importance. Local interpretability is
difficult to find via traditional methods (Du et al., 2019) and expected to provide new business
implications for marketing professionals and increased value to customers by enabling elaborate
Our approach flexibly consumes complex data and attempts to explain the churn in a bottom-up
(data-driven) fashion on two different levels: global and local interpretability. To the best of our
knowledge, we are unaware of studies that take advantage of accurate, yet highly complex, sequential
DL models in conjunction with current XAI algorithms to address survival churn management.
Notably, we demonstrate how recent explanation techniques such as additive feature attribution
methods (i.e. SHAP) and local interpretability can uncover different types of business implications
3 Model
We propose an interpretable deep survival model that can provide human-understandable explana-
tions for nonlinear and time-varying dynamics from big data. To do so, we build on Martinsson
(2016)’s WTTE-RNN in several ways. We first provide an overview of the WTTE-RNN before
Predicted Duration
Observation Window
Probability Events
Distribution
True Duration
T
Observation Period Prediction Period
.
Figure 1: WTTE-RNN.
Based on sequential data up to the current state, WTTE-RNN predicts the probability of the nearest
future event in the form of a Weibull distribution. In this process, the mode of Weibull PDF has
the highest probability of churning and is considered as the predicted churning time.
WTTE-RNN (Martinsson, 2016) combines the best of both the RNN model (efficient in processing
time series data) and literature-proven approaches to survival analyses via a Weibull-based time to
event model. To simplify, the RNN part processes multivariate time series data and outputs two
parameters (α for scale and β for shape) to be used by the probability density function (PDF) of
Traditional survival models (e.g., Cox Proportional Hazard model) often assume that the hazard
rate of a customer increases over time (Ascarza et al., 2017). However, in many real-world cases (e.g.,
for high loyalty customers and in the case of lock-ins), the longer the duration, the lower the hazard
rate. Some models do solve this “individual-level duration dependence” with flexible distributions
such as Weibull, Gamma, or Log-logistic distribution for the hazard function, incorporating both
increasing and decreasing patterns of the hazard rate (Braun and Schweidel, 2011; Fader et al., 2018).
WTTE-RNN combines flexible Weibull distribution and a sequential deep learning framework.
The probability density function (PDF) of Weibull distribution can be expressed with two pa-
β x β−1 −( x )β
P DF (x, α, β) = e a where x = 0 (1)
α α
10
Decreasing Increasing
Failure Rate Failure Rate
Constant
Failure Rate
Time
With this form, Weibull distribution can model various patterns of hazard rates (also called
failure rates) such as decreasing (β < 1), constant (β = 1), or increasing (β > 1) through
changing a shape parameter β. Many platform services show u-shaped hazard rates over time (see
Figure 2). In the early stage of a certain service, the firm performs a number of trials and errors,
and users churn easily. However, as optimization in the service progresses, more users become
satisfied, and the hazard rate also decreases. Then, the hazard rate stays stable during the maturity
stage. Later, in the decline stage, the service loses its competitive edge due to internal or external
factors, such as the emergence of new competitors, and the hazard rate increases. Additionally,
at the individual level, each user may have a different scale of u-shaped hazard rate depending
on their characteristics. Unlike traditional models (e.g., Cox Proportional Hazards Model based
models can flexibly explain all of the decreasing, constant, and increasing phenomena (Lu et al.,
2016).
The objective function of WTTE-RNN comes from a typical survival function (Tableman and
Kim, 2003), and the goal is to maximize the following log likelihood:
Tn
N X
X
unt · log[P r(Ytn = ytn |xn0:t )] + (1 − unt ) · log[P r(Ytn > ytn |xn0:t )] (2)
n=1 t=0
11
After manipulations to derive two Weibull parameters (α, β), the final log-likelihood (object
function) is shown in Table 1. We choose one of two different objective functions, depending on
whether the scale of time is continuous or discrete, and maximize it to learn the model (Martinsson,
2016).
y β−1
Continuous Weibull Distribution u · log( αβ α ) − ( αy )β u = 0 : if right censored
Discrete Weibull Distribution u · log(e
y+1 β t β
( α ) −( α )
− 1) − ( y+1 β u = 1 : if uncensored
α )
Figure 3 presents the overview structure of our model: WTTE-TCN. In this section, we explain each
module of the suggested model sequentially. WTTE-TCN consists of bidirectional TCNs, attention
layers, unstructured data processing layers (i.e., GloVe (Pennington et al., 2014) for text data), a
fully connected layer, a Weibull survival loss layer, and model-agnostic XAI methods. The types of
input data can be multi- and univariate time series, cross-sectional, and unstructured, such as text
and image data. The output is the Weibull probability distribution of each customer’s churn.
The TCN layer abstracts sequential events while considering their context. The attention layer
helps the TCN layer efficiently handle the data’s sparsity and long-term dependencies. For unstruc-
tured data, additional pre-trained layers are used for transfer learning, which embeds complex data
into the organized lower-dimensional vector space. For example, through pre-trained NLP layers,
similar words are embedded near each other, resulting in improved performance and efficiency. Ab-
stracted inputs are concatenated and provided to a fully connected layer, which distills information
one more time. The entire learning process is conducted to minimize the Weibull survival loss.
We replace the RNN layer of Martinsson (2016) with a TCN that retains the time series processing
ability of RNN, but adds the computational efficiency of convolutional networks (Lea et al., 2017).
12
Input Bidirectional TCN Attention Layer Fully Connected Survival Loss Output
Attention
With PI
Context Weibull
Survival
Time … (Yang et al., Model
…
…
…
…
…
…
…
…
2016)
Features
Concatenate
…
…
…
…
…
…
…
2016)
Explanations
(Global & Local Interpretability)
Figure 3: WTTE-TCN
13
Features
Time
TCN was chosen since it outperforms advanced RNNs such as LSTM or GRU across various se-
quential tasks while handling longer input sequences. TCN is faster than RNN while requiring less
memory and computational power (Bai et al., 2018), critical advantages in the era of big data.
TCN has the following notable characteristics: 1) the convolutions in the model are time-stamp
aware, implying that no future information is leaked during processing; and 2) the model structure
can map any length input sequence to the same length output sequence, just as RNNs do (Bai
et al., 2018). TCN consists of stacked residual blocks, which in turn consist of convolutional layers,
activation layers, normalization layers, and regularization layers (see Figure 4 and Lea et al. (2016)
layers. In detail, for a input sequence x ∈ R, output sequence u ∈ R, and a convolution filter with
size k f : {0, ..., k − 1} → R, the rth level dilated convolution operation F at time t is defined as
k−1
X
F (t) = (x ∗d f )(t) = f (i) · xt−d·i (3)
i=0
where d is the dilation factor, which can be written as (k − 1)r−1 to cover an exponentially wide
14
The Rectified Linear Units (ReLU) is used as an activation function to provide nonlinearity to
One of the obstacles of DL is that the gradients for the weights in one layer are highly correlated
to the outputs of the previous layer, resulting in increased training time. Layer normalization is
designed to alleviate this “covariate shift” problem by adjusting the mean and variance of the sum-
mated inputs within each layer (Ba et al., 2016). Though the theoretical motivation for decreasing
methods, which allow for faster and more efficient training, has proven indispensable to a wide range
of DL applications (Zhang et al., 2019a). The statistics of layer normalization over each hidden unit
H
1 X l
l
µ = ai (6)
H
i=1
v
u
u1 XH
l
σ =t (ali − µl )2 (7)
H
i=1
a l − µl
a¯li = i l (8)
σ
where a¯li is normalized summated inputs to the ith hidden unit in the lth layer, and H denotes
the number of hidden units in a layer. We also implemented layer normalization to fully connect
the layers of our suggested model and improve the learning efficiency and stability.
Dropout is an essential regularization technique to prevent the over-fitting of the neural network.
The idea is to randomly drop (hidden and visible) units from the network during training, which
prevents units from co-adapting too often. By doing so, dropout improves the generalization of
15
Complex and noisy real-world data make the training process extremely unstable and easily converge
to poor local minima, especially when handing sequential tasks with RNN models (Pascanu et al.,
2013). Notably, an exploding gradients problem which means that accumulated gradients from
large errors make huge updates to neural network model weights results in an unstable or failed
learning process (Goodfellow et al., 2016). WTTE-RNN is not an exception, and related literature
has argued that it sometimes fails in the training process (Maragall Cambra, 2018; Palau et al., 2018).
To solve these problems, we apply various techniques, such as 1) Attention Mechanism1 , 2) Gradient
Clipping, 2 , 3) Rectified Adam,3 , and 4) Lookahead Optimizer.4 These methods enable more efficient
and stable training while achieving better performance. The details of each implementation are
described in Appendix C.
To incorporate text data into our survival model, we integrate GloVe (Pennington et al., 2014), a
DL-based NLP model that converts words to meaningful representation vectors. GloVe improves
Word2Vec (Mikolov et al., 2013) a method for learning vector representations of words based on
the idea that similar words should be embedded near each other in the lower dimensional space by
incorporating global co-occurrence word relationships as well. Figure 5 explains the text processing
with GloVe and TCN. We assign each word to unique numbers through the tokenization process, and
the GloVe layer changes these numbers to representation vectors. Then, the TCN layer distills (i.e.,
feature extracts) information from sequential numbers and passes it to our main survival model.
16
To interpret WTTE-TCN predictions, we provide two types of post-hoc model explanations: global
Global interpretation captures and explains the overall importance of features at the model level.
Extracting the feature importance from a trained classifier or coefficients from a linear regression
model are typical examples of global interpretation. In contrast, local interpretation attempts to
explain “how the model behaves in the vicinity of the instances being predicted,” and therefore fo-
cuses on delivering instance-by-instance explanations (Ribeiro et al., 2016b). In our research setup,
this would entail identifying the main features driving the churn prediction for an individual user.
Further, to provide clarity on the uncertainty in prediction, we provide prediction intervals derived
We provide global interpretation in two ways. The first is the Global Feature Importance,
which indicates how important each feature is for outcome prediction. The second is the Partial
Difference Plots (PDP), which shows the marginal effect of each feature on the predicted outcome,
where the value of each feature changes from its minimum to its maximum (Friedman, 2001). For
To implement global and local feature importance and PDP, we apply SHapley Additive ex-
Plantions (SHAP), which 1) has the characteristics of model-agnostic methods, and 2) employs
cooperative game theory to provide a more accurate contribution of each feature to the outputs
(Molnar, 2018). Notably, for the efficiency in estimation, we implement Deep SHAP (Lundberg and
Lee, 2017), which greatly improves computational performance by decomposing the approximation
process for the whole network into smaller ones (see Shrikumar et al. (2017) for more details). As
17
so, SHAP has advantages over model-specific XAI techniques (e.g., coefficients of linear regression
models) it can naturally provide explainability and transparency in decision processes, despite its
1) Model flexibility: Model-agnostic methods can be implemented with any models regardless of
their structures and complexities. So, we can interpret complex black-box DL models via explainable
2) Explanation flexibility: The explanation is not limited to a certain form. In this paper,
we show various types of explanations, such as global and local feature importances and partial
dependence plots.
3) Representation flexibility: Model-agnostic XAI methods can interpret various types of feature
Local feature importance is equal to the Shapley value φij (i.e., the contribution of the feature j
of given instance i) and can be estimated through Deep SHAP (please see Lundberg and Lee (2017)
for more details). Global feature importance is calculated by averaging the absolute Shapley (or
n
(i)
X
Ij = |φj | (9)
i=1
In terms of the partial difference plots (PDP), we simply draw a point plot with the feature value
(i)
on the x-axis and the matching Shapley value on the y-axis. Since Shapley value φj means the
contribution of the given feature j of instance i while considering the local condition (or the effect of
other feature values), we do not need to calculate partial dependences separately. Mathematically,
(i) (i)
{(xj , φj )}ni=1 (10)
Weibull-distribution-based models provide two parameters (α, β) as the output. This means
that if we apply XAI methods to the Weibull-models, we can only obtain the contribution of each
instance i and feature j to (α, β), and not the duration of customers. To solve this problem, we
developed a mapping function to convert the Shapley value of (α, β) to that of customer duration.
18
the highest probability of churning in the distribution (See Figure 1). Specifically:
1
α β−1 β
β β>1
WM ode (α, β) = (11)
0
β51
(i) (i)
φDαj = WM ode (φα0 + φαj , φβ0 ) − WM ode (φα0 , φβ0 ) (12)
(i) (i)
φDβj = WM ode (φα0 , φβ0 + φβj ) − WM ode (φα0 , φβ0 ) (13)
(i) (i)
φiDj = φDαj + φDβj (14)
(i) (i)
where φDαj and φDβj are marginal contributions to the customer duration caused by predicted
parameters α and β of instance i and feature j respectively. φα0 and φβ0 are baseline Shapley values
(i) (i)
for α and β, and φαj and φβj are Shapley values derived by α and β of observation i. The final
(i) (i)
mapped Shapley value of observation i is a sum of φDαj and φDβj , and it can be interpreted as a
To demonstrate the efficacy of our algorithm, we evaluate its performance and explainability on
a proprietary dataset: game logs and churning outcomes obtained from a mobile game company.
With the dataset (which contains highly sparse and complex customer behaviors), we show the
ability of our model to capture nonlinear patterns that may not be possible with the traditional
model. Then, we figure out how firms can utilize explainability to build effective customer churn
management policies.
According to Wijman et al. (2019), the mobile game market is estimated to be worth $ 68.5 billion
globally. As competition intensifies, game companies are increasingly required to perform effective
customer churn management by utilizing massive amounts of play log data. Although data-driven
churn management has greater potential for game companies, the complexity and noise of the data
19
eXplainable AI
The prediction-driven approach forecasts future events and intervenes before they happen. In other
words, it focuses on “Prediction and Intervention.” The interpretability-driven approach observes
past events and seeks to discern “why users churn at a certain time.” Based on the explanations, it
focuses on modifying the current policy and service design (“Explain and Improve”).
deter them from making accurate prediction models, which in turn restricts churn management’s
applications. For example, NCSOFT, one of the biggest game companies in South Korea, hosted an
international competition for churn prediction using game log data, and argued its difficulties the
performance of models drastically decreased in predicting the churn of loyal customers or different
periods (Lee et al., 2018c). Even worse, if we accurately predict customers’ churn, there are limited
ways to prevent it (e.g., push-notification, coupon). In particular, if users leave push notifications
off or don’t access the game, there is little that game companies can do to retain their customers.
nation of customer behaviors through post hoc analysis. Based on the explanations, our model can
suggest guidelines for churn management policies and adjust game designs.
20
We collected the game logs of 89,877 new international users who started playing the game between
June 2 and October 9, 2018. Our dataset is from a casual mobile puzzle game. The details of
the game, rules, and variable details are described in Appendix A. The dataset contains each user’s
behavioral logs from their initial interaction with the game up to the point where they either reached
level 150 or churned: data after level 150 is right-censored. We define “churning” as when a user
doesn’t visit the game for more than four weeks. The dataset has 3 million rows of individual-level
multivariate time series data. The variable definitions are shown in Table 3 and the-game specific
In Figure 7, the x- and y-axes refer to the game levels (1 to 150) and average feature val-
ues, respectively. We can observe significant nonstationary patterns in variables. For example,
new pattern suddenly replaces the old one. Regarding Payment, PlayCount, AdvertiseView, and
tinueWithCoin, we can find a recurrent drift, which refers to temporary changes in the trend (Ren
et al., 2018). These intricate patterns make the model difficult to fit correctly.
In the analysis, we excluded users who had churned during a tutorial period it allows users to
practice the game and learn the rules of the game from levels 1 to 30. Also, we only observed
the 50 most recent levels before the churn of each user (or censoring at 150 level). There are two
reasons for this. First, in the mobile game, half of the revenue comes from only 0.19% of loyal
users (Cifuentes, 2016). This means that discovering and managing loyal customers is the most
important goal of mobile game operations. However, it is common in gaming for a large proportion
of users to churn very early (i.e., before starting of gameplay), and such users are of limited interest
to the gaming company. Second, the firm has 80 million annual active users in North America and
Europe. Given that just two months of data for 89,877 new users have 3 million rows, the company
is motivated to reduce the otherwise tremendous costs of data preprocessing and analysis. After
eliminating such users, we were left with data for a total of 27,004 users.
21
22
2.0
50 100 150 50 100 150
Level Level
23
Input Data
We set the game level as a time indicator rather than a day or week which is commonly used in
time series data analyses. The objective is to solve the sparsity problem that arises from the vast
majority of users not accessing the game every day. Using a day- or week-based time indicator not
only leads to data sparsity, it also unnecessarily lengthens the sequence of data with many empty
entries. In contrast, the level-based dataset is less sparse, and managers in gaming companies
often discuss user behavior as well as product and marketing strategies in terms of user level (e.g.,
A total of 26 variables such as play counts per each stage, in-game purchases, number of
friends, and advertisement views make up the explanatory variables. Two dependent variables
are used in the analysis: whether the customer has churned or not and the level at which the
Multivariate time series data input has the following dimensions: [instance, level (50), feature
(26)]. If a player churned at level 120, they have an input data size of (50, 26) filled with feature
values of the recent 50 LV (Figure 9, left). If a player churned earlier than level 50 (e.g., 34 LV), we
applied zero-padding this refers to adding zeros to fit different-sized data to the same shape to
Input:
• Multi-variate time series data related to customers’ in-game behavior (e.g., play count, item
24
120 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0
119 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0
118 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0
117 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0
116 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv Zero 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0
115 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv Padded 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0
114 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0
113 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0
… … … … … … … … … … … … … … … … … … … … 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0
… … … … … … … … … … … … … … … … … … … … 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0
… … … … … … … … … … … … … … … … … … … … Level (50) 34 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv Level (50)
80 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 33 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv
79 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv … … … … … … … … … … … … … … … … … … … …
78 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 8 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv
77 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 7 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv
76 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 6 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv
75 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 5 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv
74 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 4 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv
73 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 3 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv
72 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 2 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv
71 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 1 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv
LV LV
purchase count, login with Facebook ID, and the number of friends).
Output:
4.3 Performance
We compared the performance of our model (WTTE-TCN) to four traditional survival models Cox
Proportional Hazard, Weibull Accelerated Failure Time (AFT), Log-Normal AFT, and Aalen’s
Additive as well as four deep survival models DeepSurv (Katzman et al., 2018), CoxCC (Kvamme
et al., 2019), DeepHit (Lee et al., 2018a), and Martinsson (2016)’s WTTE-RNN. Regarding WTTE-
RNN and WTTE-TCN, we can directly use a 3d-tensor (instance, time (=level), features) multivari-
ate time series data, since it has sequential layers such as LSTM and TCN. On the other hand, since
25
we converted it to a 2d-matrix (instance, time × features). Notably, since traditional models did
not work with the transposed inputs due to the curse of dimensionality, we reduced the dimension
through Principal Components Analysis (PCA) (Witten and Tibshirani, 2010). We split the data
into three parts, training (50%), validation (25%), and test (25%) data. To evaluate the perfor-
mance of duration prediction models, we used mean absolute error (MAE) on the test data (Wang
et al., 2019). Specifically, MAE takes the difference between the predicted game level for user churn
(= time to event) and the actual churned level. In respect to DL models, the performance varies
from each training, so we conducted a total of 10 separate trainings by changing the random seed.
The mean, minimum, and maximum MAE were used as metrics for comparison. Root mean square
We choose MAE over another popular metric, concordance index (or C-index), which refers to
the accuracy of predicted rank (i.e. which observation survives longer) between pairs of randomly
chosen observations. This is because concordance cannot accurately evaluate survival models when
time-varying patterns are observed in data that are uncommon in traditional settings. For example,
our mobile game is designed to have incremental difficulties, so users fail more as they reach further
stages. Also, as shown in Figure 7, there are more features with incremental/decremental patterns.
These time-varying patterns make it effortless to predict the order of user churn, and most of our
test models showed extremely high concordance scores. Thus, we focus on evaluating how accurately
models predict the timing of users’ churn (i.e., minimizing prediction errors), rather than the order
26
The test result shows that our WTTE-TCN model achieves superior performance over other tra-
ditional (reduced MAE by 56-81%) and deep survival models (reduced MAE by 17-51%). Notably,
we reduced 17% of MAE while using about half of the parameters compared to Martinsson (2016).
For managerial guidance, we provide global interpretations of model predictions in this section.
Figure 10 shows the global feature importance of the most critical variables across all users.
According to the result, two key features related to game design, PlayCount and TrophyColor, play
the most important role in predicting how long users will survive. PlayCount one of the indicators
of game difficulty is the total attempts to get past a level. TrophyColor refers to the three types
of trophies users are awarded when they clear stages depending on their game style. The more
skillful the users, the better the trophies they get. FacebookLogin and NumberOfFriends indicate
that social-related variables are also useful to predict the user’s duration. Regarding OS, simple
data exploration shows that IOS users stay longer than Android users, ceteris paribus. Items and
boosts help users clear the game more easily, and the pattern of using them is critical information
27
140
120
100
# of
Churned 80
Users
60
40
20
as well. Also, the in-app purchasing experience (IsPayer) and the number of advertising views are
With respect to PDP, we provide a cohort-level global interpretation by grouping users who
churned at similar game levels into the same cohort. This is to control for the time-varying char-
acteristics of variables such as PlayCount, which is the average of total play counts per level, and
which is observed to increase as the game level increases. In other words, feature values at the high
and low levels have entirely different meanings, and the design of the game should be different, too.
We demonstrate how we can utilize cohort-level PDPs to derive managerial implications from
the game. As mentioned above, the firm focuses on potential loyal customers and cares about user
28
2.0
0.25
1.5 0.0
0.00
1.0 -0.2
-0.25
0.5 -0.4 -0.50
0.0
-0.6 -0.75
-0.5 -1.00
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10
IsPayer FacebookLogin NumberOfFriends
Figure 12: Partial Dependence Plots (PDP) of Users Who Churned at 126-135 Level
churn later in the observed period. In Figure 11, we can observe a sharp increase in user churn
between levels 126 to 135 (red dotted box). We seek to find a way to reduce the churn rate in this
period, where more potential loyal customers have left than we expected. To do so, we use XAI to
analyze the recent (latest 50 levels) behavior of the churned users and explain how those behaviors
affected their decisions. Based on the explanation, we provide new churn management strategies.
Figure 12 presents the PDPs of users who churned at 126-135 level. The x- and y-axes indicate
the average feature value of the latest 50 levels and its SHAP value (feature importance), respec-
tively. Since this feature importance is an average value of the latest 50 levels, the cumulative
impact is 50 times greater (average SHAP value per level × the length of lookback window) than
SHAP values shown in the figure. If a users’ average PlayCount is 3 and their SHAP value is 1
(Figure 12 left above), we can expect that this user will stay 50 levels (1 level x 50) more than the
baseline user, whose SHAP value is zero, and, whose average PlayCount is around two. Also, users
who have the same feature value can have different SHAP values. This is because they have other
different feature values. For example, even though users have the same PlayCount, its impact on
paid and free users may differ. SHAP value can capture these differences because it calculates the
29
With respect to PlayCount, a user’s duration increases until PlayCount reaches 3 (Figure 12 left
above). However, after 3, users who play (fail) more tend to churn more easily. In other words, users
don’t like games that are too easy (low PlayCount) or too difficult (high PlayCount), and there is
an optimal level of difficulty (PlayCount is around 3). In this case, it is good for the company to
adjust the difficulty of levels 125-135 once users have played about three times per level on average.
In detail, since the current difficulty of this period is too low (many users are between PlayCount
TrophyColor is related to a sense of accomplishment. When users clear the game with skillful
plays and high scores, they get better trophies from among the bronze, silver, and gold options.
Notably, as a reward, the company carefully designed special visual and sound effects to excite users
when they receive a better trophy. According to the result (Figure 12 mid above), the better trophy
users survived longer, and the company needs to provide more opportunities for users with bronze
and silver trophies to win gold trophies. In sum, at levels 126-135, the company should increase the
difficulty of the game so that users retry about three times per level (PlayCount), while allowing
them to feel a higher sense of accomplishment if they clear the game (TrophyColor).
Regarding ItemUseCount (Figure 12 right above), users who spent more items churned earlier
(BoosterUseCount showed a similar result). Items and boosters help users clear levels easily. To
get items and boosters, users should spend more time to complete additional quests or buy them
with money. In this sense, users perceive items and boosters as currency, and unused items act
as sunk costs of making users play the game for longer. Also, users who decide to leave the game
tend to use up items in a short period of time. This suggests that we can use ItemUseCount and
BoostUseCount as key metrics for prescriptive churn prevention. For example, the company can
offer additional items, boosters, or special promotions before users run out of their own.
If a player has made cash payments before, “IsPayer” is 1, otherwise it is 0. The result (Figure
12 left below) shows that paying users stay longer than free users, as expected. However, concerning
social variables, the result turns to be the reverse of what the company intended. For example, users
who login with their Facebook accounts (FacebookLogin = 1) showed a shorter duration (Figure
12 mid below). The company expected users to enjoy games and communicate with their social
media friends by linking their Facebook accounts, and that this would increase the user’s duration
30
(Cole and Griffiths, 2007). In the game, users can have various social interactions, such as sending
gifts to their Facebook friends or showing off their scores to each other. Therefore, users who have
many Facebook friends in the game should have more fun and stay longer. However, users with
more than three friends (Figure 12 right below) also have shorter survival periods. This means the
company needs to check if there are any problems with their design of the social interaction system.
Fortunately, interviews with game users show that they felt like they were losing interest in the
game faster if their friends were leaving. In other words, these social functions are only effective
while their friend groups are active otherwise, they are bad for user retention. The company
could design a new social interaction mechanism to overcome these shortcomings (e.g., matching
One of the advantages of a distribution-based predictive model is that we can address uncertainty in
prediction results. Figure 13 shows the outputs the probability Weibull distribution of churn for
two users. We can interpret the point where the probability (y-axis) is the highest as the predicted
churn of a user. User #776, whose predicted churn (77) shows a comparatively small deviation from
the true churn (76), has narrow PI (colored band in Figure 14), [37, 89], and is a more reliable as a
predictor. On the other hand, user #873, whose predicted churn (223) shows significant deviation
from the true churn (147), has wide PI, [96, 419], and is consequently less reliable.
31
SHAP Values
C
F
D
B
TrophyColor
A
PlayCount
+ + + + +
(52 LV) Churn
(102 LV)
We demonstrate how we can derive customized churn management strategies from the perspec-
tive of individual levels. Figure 14 (below) presents the behavioral feature importance (y-axis)
map of user #1358 through different game levels (x-axis) until the eventual churn at stage 102.
Blue indicates a negative contribution to retention and red indicates a positive contribution. Fig-
ure 14 (above) shows the actual feature values of different behavioral variables across game stages.
Examine, for example, region A, which shows the PlayCount (total attempt to get past a level)
contribution through different game levels. The PlayCount is usually 1 until stage 40 where it
bumps up to 4. This suggests that the user cleared each level without fail for the most part, which
in turn prompted the user to churn at level 102. In other words, the game was too easy for user
#1358 and contributed to him/her churning. On the other hand, varying number of TrophyColor
(orange line curve located in the second column in Figure 14 above), which is given based on stage
32
bonus stages for extra rewards (PlayCountOtherLevels) was beneficial for retention while replaying
cleared stages (ClearedGameRetryCounts) was harmful. The user spent items mostly in stages 30-
40 (region E), which was also harmful for retention. Lastly, the user was stuck at stage 42 (region
F) for a long time (about 17 days), and it was detrimental to retention shortly after, the user
churned. This suggests that customized promotions such as free items or readjusting difficulty at
nonlinear patterns from complex (sparse, high-dimensional, and unstructured) data. When applied
to a unique and comprehensive mobile game churning dataset, WTTE-TCN shows superior per-
formance with less computational costs. Upon applying post-hoc XAI methods, we were able to
illustrate actionable insights and prescriptive strategies for real-world customer churn management.
We contribute in two ways. First, we improve existing DL survival models by incorporating TCN,
which outperforms canonical RNNs (e.g., LSTM, GRU) through powerful convolutions and dilations
while minimizing the loss of information. Second, we present exploratory results that examine the
model’s ability to capture churn signals from complex data and provide domain-specific marketing
We envision WTTE-TCN as “advanced knowledge capturing tools” that help to find novel hy-
potheses from empirical customer churning data (Molnar, 2018; Lee et al., 2018b). To this end,
DL models discover complex nonlinear patterns in data (beyond human cognitive abilities), and
interpretability can extract this knowledge from the model. Specifically, we hope managers and
researchers can use WTTE-TCN to incorporate novel unstructured data (e.g., text, image) and dis-
cover/test new hypotheses in churn management contexts. The potential of text data notably
remains untapped in that this exponentially growing data fully captures the daily life of cus-
tomers, including their thoughts, emotions, tastes, routines, and even relationships. Despite its
great prospects, the potential value lost due to unused text data is estimated at $3 trillion in global
(Analytics, 2016).
33
First, while we successfully incorporated NLP layers into WTTE-TCN, this study focused on
demonstrating that our algorithm can capture more signals and provide useful marketing implica-
tions with complex multivariate sequential data, rather than illustrating good use of interpretability
with text data. Future studies with novel text datasets might provide new business implications for
Second, although there are explainable models, deriving an actionable strategy still requires
great effort and know-how. In the classical setting, good use of interpretability has already been well
documented and studied. For example, econometrics studies have solid foundations for interpreting,
validating, and utilizing coefficients of regression models. On the other hand, in the DL setting, good
use of interpretability is still in its infancy, and the complexity of DL makes the right interpretation
difficult. In this context, fruitful future research topics can include advantages and constraints of
various interpretability methods (Molnar, 2018), good use cases and new techniques (Yang et al.,
2016; Lee et al., 2018b; Mothilal et al., 2020), and guidelines for the robust use of interpretability
(Doshi-Velez and Kim, 2017; Melis and Jaakkola, 2018) in business contexts.
Third, applied ML/DL requires managing uncertainty which refers to model works with im-
perfect or unknown information and will be an important future research topic, given that it
helps to manage biases in algorithms and minimize risks in data-driven enterprise decisions. For
example, a medical diagnosis system with 90% accuracy would still misdiagnose 10% of patients. In
typical predictive models, we cannot know which patient will be misdiagnosed because the model
simply gives the patient’s prediction results as a single value (point estimate). In contrast, proba-
bilistic ML/DL methods address uncertainty by quantitatively measuring how much we can trust
the prediction result of each observation through distributional results (interval estimate). Thus,
for observations suffering insufficient data or heavy noise, the model provides wider distributions or
prediction intervals as results (see Hüllermeier and Waegeman (2019) for more details). Although
our model deals with uncertainty through distributional results, studies about uncertainty are still
lacking. More theoretical/empirical studies regarding implementations, impacts, and novel methods
34
Ascarza, E., P. S. Fader, and B. G. Hardie: 2017, ‘Marketing models for the customer-centric firm’. In:
Handbook of marketing decision models. Springer, pp. 297–329.
Ba, J. L., J. R. Kiros, and G. E. Hinton: 2016, ‘Layer normalization’. arXiv preprint arXiv:1607.06450.
Bahdanau, D., K. Cho, and Y. Bengio: 2014, ‘Neural machine translation by jointly learning to align and
translate’. arXiv preprint arXiv:1409.0473.
Bai, S., J. Z. Kolter, and V. Koltun: 2018, ‘An empirical evaluation of generic convolutional and recurrent
networks for sequence modeling’. arXiv preprint arXiv:1803.01271.
Bengio, Y., A. Courville, and P. Vincent: 2013, ‘Representation learning: A review and new perspectives’.
IEEE transactions on pattern analysis and machine intelligence 35(8), 1798–1828.
Braun, M. and D. A. Schweidel: 2011, ‘Modeling customer lifetimes with multiple causes of churn’. Marketing
Science 30(5), 881–902.
Braun, M., D. A. Schweidel, and E. Stein: 2015, ‘Transaction attributes and customer valuation’. Journal
of Marketing Research 52(6), 848–864.
Brownlee, J.: 2018, Better Deep Learning: Train Faster, Reduce Overfitting, and Make Better Predictions.
Machine Learning Mastery.
Campbell, P.: 2019, ‘Is Content Marketing Dead? Here’s Some Data.’. ProfitWell (blog).
Carvalho, D. V., E. M. Pereira, and J. S. Cardoso: 2019, ‘Machine Learning Interpretability: A Survey on
Methods and Metrics’. Electronics 8(8), 832.
Cifuentes, J.: 2016, ‘Half of all mobile games revenue reportedly comes from only 0.19
Cole, H. and M. D. Griffiths: 2007, ‘Social interactions in massively multiplayer online role-playing gamers’.
Cyberpsychology & behavior 10(4), 575–583.
Collobert, R., J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa: 2011, ‘Natural language
processing (almost) from scratch’. Journal of machine learning research 12(Aug), 2493–2537.
Cox, D. R.: 1972, ‘Regression models and life-tables’. Journal of the Royal Statistical Society: Series B
(Methodological) 34(2), 187–202.
Doshi-Velez, F. and B. Kim: 2017, ‘Towards a rigorous science of interpretable machine learning’. arXiv
preprint arXiv:1702.08608.
Du, M., N. Liu, and X. Hu: 2019, ‘Techniques for interpretable machine learning’. Communications of the
ACM 63(1), 68–77.
Ebrahimzadeh, Z., M. Zheng, S. Karakas, and S. Kleinberg: 2019, ‘Deep Learning for Multi-Scale Change-
point Detection in Multivariate Time Series’. arXiv preprint arXiv:1905.06913.
Fader, P. S. and B. G. Hardie: 2007, ‘How to project customer retention’. Journal of Interactive Marketing
21(1), 76–90.
Fader, P. S. and B. G. Hardie: 2009, ‘Probability models for customer-base analysis’. Journal of interactive
marketing 23(1), 61–69.
Fader, P. S. and B. G. Hardie: 2010, ‘Customer-base valuation in a contractual setting: The perils of ignoring
heterogeneity’. Marketing Science 29(1), 85–93.
35
Fader, P. S., B. G. Hardie, and K. L. Lee: 2005, ‘”Counting your customers” the easy way: An alternative
to the Pareto/NBD model’. Marketing science 24(2), 275–284.
Fader, P. S., B. G. Hardie, Y. Liu, J. Davin, and T. Steenburgh: 2018, ‘”How to Project Customer Retention”
Revisited: The Role of Duration Dependence’. Journal of Interactive Marketing 43, 1–16.
Friedman, J. H.: 2001, ‘Greedy function approximation: a gradient boosting machine’. Annals of statistics
pp. 1189–1232.
Gallo, A.: 2014, ‘The value of keeping the right customers’. Harvard business review 29.
Ghahramani, Z.: 2001, ‘An introduction to hidden Markov models and Bayesian networks’. In: Hidden
Markov models: applications in computer vision. World Scientific, pp. 9–41.
Goodfellow, I., Y. Bengio, and A. Courville: 2016, Deep learning. MIT press.
Guidotti, R., A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi: 2018, ‘A survey of methods
for explaining black box models’. ACM computing surveys (CSUR) 51(5), 93.
Guillén, M., J. P. Nielsen, T. H. Scheike, and A. M. Pérez-Marín: 2012, ‘Time-varying effects in the analysis
of customer loyalty: A case study in insurance’. Expert Systems with Applications 39(3), 3551–3558.
Gunning, D.: 2017, ‘Explainable artificial intelligence (xai)’. Defense Advanced Research Projects Agency
(DARPA), nd Web 2.
Hajian, S., F. Bonchi, and C. Castillo: 2016, ‘Algorithmic bias: From discrimination discovery to fairness-
aware data mining’. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge
discovery and data mining. pp. 2125–2126.
He, Y. and J. Zhao: 2019, ‘Temporal Convolutional Networks for Anomaly Detection in Time Series’. In:
Journal of Physics: Conference Series, Vol. 1213. p. 042050.
Hosanagar, K.: 2019, A Human’s Guide to Machine Intelligence: How Algorithms are Shaping Our Lives
and how We Can Stay in Control. Viking.
Hüllermeier, E. and W. Waegeman: 2019, ‘Aleatoric and epistemic uncertainty in machine learning: A
tutorial introduction’. arXiv preprint arXiv:1910.09457.
Katzman, J. L., U. Shaham, A. Cloninger, J. Bates, T. Jiang, and Y. Kluger: 2018, ‘DeepSurv: personalized
treatment recommender system using a Cox proportional hazards deep neural network’. BMC medical
research methodology 18(1), 24.
Knox, G. and R. Van Oest: 2014, ‘Customer complaints and recovery effectiveness: A customer base ap-
proach’. Journal of marketing 78(5), 42–57.
Koop, G. and S. Potter: 2004, ‘Forecasting in dynamic factor models using Bayesian model averaging’. The
Econometrics Journal 7(2), 550–565.
Krizhevsky, A., I. Sutskever, and G. Hinton: 2012, ‘Imagenet classification with deep convolutional neural
networks. In Advances in neural information processing systems (pp. 1097-1105)’.
Kuznetsov, V.: 2016, ‘Theory and Algorithms for Forecasting Non-Stationary Time Series’. Ph.D. thesis,
New York University.
Kvamme, H., Ø. Borgan, and I. Scheel: 2019, ‘Time-to-event prediction with neural networks and Cox
regression’. Journal of Machine Learning Research 20(129), 1–30.
36
Lea, C., R. Vidal, A. Reiter, and G. D. Hager: 2016, ‘Temporal convolutional networks: A unified approach
to action segmentation’. In: European Conference on Computer Vision. pp. 47–54, Springer.
LeCun, Y., Y. Bengio, and G. Hinton: 2015, ‘Deep learning’. nature 521(7553), 436.
Lee, C., W. R. Zame, J. Yoon, and M. van der Schaar: 2018a, ‘Deephit: A deep learning approach to survival
analysis with competing risks’. In: Thirty-Second AAAI Conference on Artificial Intelligence.
Lee, D., E. Manzoor, and Z. Cheng: 2018b, ‘Focused Concept Miner (FCM): Interpretable Deep Learning
for Text Exploration’. Emaad and Cheng, Zhaoqi, Focused Concept Miner (FCM): Interpretable Deep
Learning for Text Exploration (November 20, 2018).
Lee, E., Y. Jang, D.-M. Yoon, J. Jeon, S.-i. Yang, S.-K. Lee, D.-W. Kim, P. P. Chen, A. Guitart, P. Bertens,
et al.: 2018c, ‘Game data mining competition on churn prediction and survival analysis using commercial
game log data’. IEEE Transactions on Games 11(3), 215–226.
Lipton, Z. C.: 2016, ‘The mythos of model interpretability’. arXiv preprint arXiv:1606.03490.
Liu, L., H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han: 2019, ‘On the variance of the adaptive
learning rate and beyond’. arXiv preprint arXiv:1908.03265.
Lu, J., D. Lee, T. W. Kim, and D. Danks: 2020, ‘Good Explanation for Algorithmic Transparency’. AIES
’20: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (February, 2020).
Lu, Y., A. A. Miller, R. Hoffmann, and C. W. Johnson: 2016, ‘Towards the Automated Verification of
Weibull Distributions for System Failure Rates’. In: Critical Systems: Formal Methods and Automated
Verification. Springer, pp. 81–96.
Luck, M., T. Sylvain, H. Cardinal, A. Lodi, and Y. Bengio: 2017, ‘Deep learning for patient-specific kidney
graft survival analysis’. arXiv preprint arXiv:1705.10245.
Lundberg, S. M. and S.-I. Lee: 2017, ‘A unified approach to interpreting model predictions’. In: Advances
in Neural Information Processing Systems. pp. 4765–4774.
Maragall Cambra, M.: 2018, ‘Using recurrent neural networks to predict the time for an event’.
Martinsson, E.: 2016, ‘Wtte-rnn: Weibull time to event recurrent neural network’. Ph. D. dissertation,
Masters thesis, University of Gothenburg, Sweden.
Melis, D. A. and T. Jaakkola: 2018, ‘Towards robust interpretability with self-explaining neural networks’.
In: Advances in Neural Information Processing Systems. pp. 7775–7784.
Mikolov, T., I. Sutskever, K. Chen, G. S. Corrado, and J. Dean: 2013, ‘Distributed representations of words
and phrases and their compositionality’. In: Advances in neural information processing systems. pp.
3111–3119.
Molnar, C.: 2018, ‘Interpretable machine learning: A guide for making black box models explainable’. E-book
at< https://christophm. github. io/interpretable-ml-book/>, version dated 10.
Montoya, R., O. Netzer, and K. Jedidi: 2010, ‘Dynamic allocation of pharmaceutical detailing and sampling
for long-term profitability’. Marketing Science 29(5), 909–924.
Moor, M., M. Horn, B. Rieck, D. Roqueiro, and K. Borgwardt: 2019, ‘Temporal convolutional net-
works and dynamic time warping can drastically improve the early prediction of sepsis’. arXiv preprint
arXiv:1902.01659.
37
Netzer, O., J. M. Lattin, and V. Srinivasan: 2008, ‘A hidden Markov model of customer relationship dy-
namics’. Marketing science 27(2), 185–204.
Pacurar, M.: 2008, ‘Autoregressive conditional duration models in finance: a survey of the theoretical and
empirical literature’. Journal of economic surveys 22(4), 711–751.
Palau, A. S., K. Bakliwal, M. H. Dhada, T. Pearce, and A. K. Parlikad: 2018, ‘Recurrent neural networks for
real-time distributed collaborative prognostics’. In: 2018 IEEE international conference on prognostics
and health management (ICPHM). pp. 1–8.
Pascanu, R., T. Mikolov, and Y. Bengio: 2013, ‘On the difficulty of training recurrent neural networks’. In:
International conference on machine learning. pp. 1310–1318.
Pennington, J., R. Socher, and C. D. Manning: 2014, ‘Glove: Global vectors for word representation’. In:
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp.
1532–1543.
Pölsterl, S., S. Conjeti, N. Navab, and A. Katouzian: 2016, ‘Survival analysis for high-dimensional, het-
erogeneous medical data: Exploring feature extraction as an alternative to feature selection’. Artificial
intelligence in medicine 72, 1–11.
Potter, R. G. and M. P. Parker: 1964, ‘Predicting the time required to conceive’. Population studies 18(1),
99–116.
Prinja, S., N. Gupta, and R. Verma: 2010, ‘Censoring in clinical trials: review of survival analysis techniques’.
Indian journal of community medicine: official publication of Indian Association of Preventive & Social
Medicine 35(2), 217.
Raffel, C. and D. P. Ellis: 2015, ‘Feed-forward networks with attention can solve some long-term memory
problems’. arXiv preprint arXiv:1512.08756.
Rai, A.: 2020, ‘Explainable AI: from black box to glass box’. Journal of the Academy of Marketing Science
pp. 1–5.
Ranganath, R., A. Perotte, N. Elhadad, and D. Blei: 2016, ‘Deep survival analysis’. arXiv preprint
arXiv:1608.02158.
Reichheld, F. and C. Detrick: 2003, ‘Loyalty: A prescription for cutting costs’. Marketing Management
12(5), 24–24.
Ren, S., B. Liao, W. Zhu, and K. Li: 2018, ‘Knowledge-maximized ensemble algorithm for different types of
concept drift’. Information Sciences 430, 261–281.
Ribeiro, M. T., S. Singh, and C. Guestrin: 2016a, ‘" Why should i trust you?" Explaining the predictions of
any classifier’. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery
and data mining. pp. 1135–1144.
Ribeiro, M. T., S. Singh, and C. Guestrin: 2016b, ‘Model-agnostic interpretability of machine learning’.
arXiv preprint arXiv:1606.05386.
Rudin, C.: 2019, ‘Stop explaining black box machine learning models for high stakes decisions and use
interpretable models instead’. Nature Machine Intelligence 1(5), 206.
Schmittlein, D. C., D. G. Morrison, and R. Colombo: 1987, ‘Counting your customers: Who-are they and
what will they do next?’. Management science 33(1), 1–24.
38
Schweidel, D. A. and G. Knox: 2013, ‘Incorporating direct marketing activity into latent attrition models’.
Marketing Science 32(3), 471–487.
Selvaraju, R. R., M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra: 2017, ‘Grad-cam: Visual ex-
planations from deep networks via gradient-based localization’. In: Proceedings of the IEEE international
conference on computer vision. pp. 618–626.
Shrikumar, A., P. Greenside, and A. Kundaje: 2017, ‘Learning important features through propagating
activation differences’. In: Proceedings of the 34th International Conference on Machine Learning-Volume
70. pp. 3145–3153.
Srivastava, N., G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov: 2014, ‘Dropout: a simple way
to prevent neural networks from overfitting’. The journal of machine learning research 15(1), 1929–1958.
Tableman, M. and J. S. Kim: 2003, Survival analysis using S: analysis of time-to-event data. CRC press.
Wang, P., Y. Li, and C. K. Reddy: 2019, ‘Machine learning for survival analysis: A survey’. ACM Computing
Surveys (CSUR) 51(6), 110.
Wijman, T., O. Meehan, and B. de Heij: 2019, ‘Global games market report’.
Witten, D. M. and R. Tibshirani: 2010, ‘Survival analysis with high-dimensional covariates’. Statistical
methods in medical research 19(1), 29–51.
Xiao, C., E. Choi, and J. Sun: 2018, ‘Opportunities and challenges in developing deep learning models
using electronic health records data: a systematic review’. Journal of the American Medical Informatics
Association 25(10), 1419–1428.
Yang, Z., D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy: 2016, ‘Hierarchical attention networks for
document classification’. In: Proceedings of the 2016 conference of the North American chapter of the
association for computational linguistics: human language technologies. pp. 1480–1489.
Zhang, A., Z. C. Lipton, M. Li, and A. J. Smola: 2019a, ‘Dive into Deep Learning’. Unpublished draft.
Retrieved 3, 319.
Zhang, M., J. Lucas, J. Ba, and G. E. Hinton: 2019b, ‘Lookahead Optimizer: k steps forward, 1 step back’.
In: Advances in Neural Information Processing Systems. pp. 9593–9604.
Žliobaitė, I., M. Pechenizkiy, and J. Gama: 2016, ‘An overview of concept drift applications’. In: Big data
analysis: new algorithms for a new society. Springer, pp. 91–114.
39
10
NO Component Description
Name
1 Heart (or Life) A player needs one heart to play a game and five hearts are provided by default.
Hearts are increased by one every 30 minutes to a maximum of five. If the player
uses up all the hearts, he/she can buy hearts using coins, as rewards for
completing missions, or be gifted one from friends in the game.
2 Level (or Stage) Level means a stage number of the game played by a user. Users are not allowed
to play levels beyond a certain point until they have progressed through lower
ones. However, the cleared stages can be replayed and the highest scores can also
be updated.
3 Trophy Color When player beats the game, he/she will receive one of the gold, silver, or bronze
trophies, depending on his/her achieved score. A gold trophy will be awarded for
clearing with a high score, but a bronze trophy will be awarded for clearing with
a low score.
4 Coin Coins are a kind of currency that can buy hearts, items, and boosters. Coins can
be acquired through in-app purchases using cash, and they can also be obtained
as compensation for completing various missions or events.
5 Booster Booster refers to items with special features that make it easier to clear certain
levels of the game.
6 Facebook Login If the player allows Facebook login, he/she can post their records on Facebook or
and Friends interact with Facebook friends in the game. Facebook logins are rewarded with
items or coins, and Facebook friends can present hearts to each other.
7 Item Item refers to special features that help the player beat the game more easily.
8 Advertisement If they fail, the player can get additional opportunities by watching video
advertisements. The player can also get more coins or items rewards by looking
at the advertisements.
9 Crystal Trophy When the player replays accomplished stages, they can update their previous
trophy colors (gold, silver, or bronze) or get an additional crystal trophy as a
clearing reward.
10 Gacha The player sometimes gets items or boosters through a random picker.
exp(u|it uw )
αit = P | (2)
t exp(uit uw )
X
si = αit hit (3)
t
→
− ←−
where hit is a word annotation obtained by concatenating h it (the forward hidden state) and h it
→
− ←
−
(the backward hidden state) of bidirectional RNN or TCN, i.e., hit =[ h it , h it ]. uw is a word-
level context vector that contains the information of “what is the informative word” over words.
si is the output sentence vector that summarized all the information of words in the ith sentence.
With respect to the behavioral data, we replace the input of attention layer from the sequence
of (numeric) embedded words with the sequence of feature values about customer behavior. The
attention mechanism not only allows for faster training and better performance, it also increases
the stability of training by giving more direct pathways to the model structure (?).
One challenge of handling sequential data is an exploding gradients problem, in that accumulated
gradients from large errors make huge updates to neural network model weights, resulting in an
unstable or incapable learning process (?). It occurs more commonly with RNN models, and it
gets worse handling sparse and messy real-world data (?????). WTTE-RNN is not an exception,
and related literature has argued that it sometimes fails in the training process (???). To solve this
problem, we implement gradient clipping, which rescales the gradients when it exceeds the threshold
value (?). g is the gradient and η is a threshold and if kgk > η, the clipped gradient is defined as:
ηg
g← (4)
kgk
Adopted from ?.