0% found this document useful (0 votes)
40 views44 pages

SSRN Id3981160

This document proposes an interpretable deep learning model called WTTE-TCN for churn prediction and management. WTTE-TCN uses a temporal convolutional network combined with survival analysis techniques to better handle time-varying customer data and censored observations. Explainable AI methods are also applied to provide human-interpretable explanations of the model's predictions. When tested on mobile game churn data, WTTE-TCN achieves superior performance compared to previous methods while also addressing their limitations such as an inability to handle complex, unstructured data and time-varying covariates.

Uploaded by

hetalpatel259
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views44 pages

SSRN Id3981160

This document proposes an interpretable deep learning model called WTTE-TCN for churn prediction and management. WTTE-TCN uses a temporal convolutional network combined with survival analysis techniques to better handle time-varying customer data and censored observations. Explainable AI methods are also applied to provide human-interpretable explanations of the model's predictions. When tested on mobile game churn data, WTTE-TCN achieves superior performance compared to previous methods while also addressing their limitations such as an inability to handle complex, unstructured data and time-varying covariates.

Uploaded by

hetalpatel259
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Interpretable Deep Learning Approach to Churn Management

Daehwan Ahn1 , Dokyun “DK” Lee2 , Kartik Hosanagar3


{ahndh1 , kartikh3 }@wharton.upenn.edu, dokyun@bu.edu2
University of Pennsylvania13 , Boston University2

Abstract

We propose an interpretable deep survival model that can capture human-understandable


nonlinear patterns from big data while handling censored observations and time-varying cus-
tomer dynamics. To this end, we build WTTE-TCN (Weibull Time to Event Temporal Convolu-
tional Networks) and apply post-hoc eXplainable Artificial Intelligent (XAI) methods to explain
model predictions in a human-interpretable manner. When applied to mobile game churning
data, WTTE-TCN demonstrates superior performance with less computational costs while also
addressing the limitations of traditional survival models such as time-varying covariates.
We build the algorithm so that managers can easily interpret human-understandable expla-
nations and draw actionable insights and inform potential prescriptive strategies. For example,
replaying the cleared stages in the game is linked to early churning, whereas making users
feel a higher sense of accomplishment through appropriate hardness is connected to decreased
churning. We identify misdesigned in-game systems (e.g., difficulties and social interaction
mechanisms) that increase churn rate and provide suggestions to improve them.

1 Introduction

The benefits of customer retention are well documented and studied, and have led to calls for

proactive churn management across a variety of industries (Ascarza et al., 2017). Gallo (2014)

documents that acquiring a new customer is usually 5-25 times more costly than retaining an

existing one. A case from financial services shows that a 5% increase in retention raised profits by

more than 25% (Reichheld and Detrick, 2003). Additionally, CAC (Customer Acquisition Cost) has

grown by nearly 50% over the past five years (Campbell, 2019).

Electronic copy available at: https://ssrn.com/abstract=3981160


Survival models are widely used for churn modeling because of their ability to properly handle

the censoring of observations and a range of distributional forms (Wang et al., 2019). However, with

easier access to more complex data, the limitations of traditional survival models are also becoming

more apparent (Fader and Hardie, 2009; Ascarza et al., 2017). Some of these limitations include:

1. Processing of complex (sparse, high-dimensional, and unstructured) data: Most existing

survival models can only take a few high signal variables that are engineered by experts and based

on existing theories. Though there have been efforts to use more data by incorporating cohort

effects (Schweidel et al., 2008) or customer heterogeneity (Fader and Hardie, 2010), current models

cannot natively handle sparse, high dimensional, unstructured data (Witten and Tibshirani, 2010),

resulting in the loss of valuable signals hidden in customer databases.

2. The simplicity of traditional models creates limitations in reflecting complex real-world

problems. For example, the problem of time-varying covariates present hurdles for existing models

(Ascarza et al., 2017). While a majority of traditional models assume constant covariates (Guillén

et al., 2012), realistically, latent customer characteristics that influence churn are likely to change

over time (e.g., changes in marketing trends, ongoing dynamic customer relationships, learning

effects in mobile games, etc.). Indeed, time-varying covariates often cause biases when estimating

the hazard function in the current churn models (Fader and Hardie, 2009; Ascarza et al., 2017).

Recent advances in predictive analytics allow us to address each of these limitations. Specifically:

1. Deep learning (DL) approaches show superior capability in dealing with sparse, high-

dimensional, and unstructured data (LeCun et al., 2015). Representation learning a set of methods

that permit algorithms to automatically discover various levels of abstraction/information from raw

data allows extremely flexible input data types (Bengio et al., 2013) such as images (Krizhevsky

et al., 2012) and text (Collobert et al., 2011). Additionally, the automated nonlinear feature learn-

ing ability (Bengio et al., 2013) helps to extract more non-obvious signals from the data than

human-engineered features can.

2. Machine learning (ML) frameworks such as Hidden Markov Models can incorporate time-

varying covariates in survival analysis (Netzer et al., 2008). In addition, more recent DL techniques,

such as different varieties of Recurrent Neural Network (RNN), allow survival models to handle

time-varying covariates effectively by considering long-term contexts among multivariate features

(Martinsson, 2016; Lee et al., 2018a). This line of work enables retention models in business contexts

Electronic copy available at: https://ssrn.com/abstract=3981160


to capture customer relationship dynamics, such as how the influence of marketing activities changes

over time (Netzer et al., 2008; Montoya et al., 2010).

Despite its superiority, one potential drawback of DL approaches is their lack of interpretabil-

ity. However, recent eXplainable Artificial Intelligent (XAI) techniques provide much-needed in-

terpretability in predictive analytics. XAI turns complex black-box AI models into interpretable

glass-box models while retaining a high level of model performance (Lu et al., 2020; Rai, 2020). In

other words, DL models capture complex nonlinear signals from real-world data, and XAI inter-

prets them in the form of human-understandable explanations. Notably, XAI can, along with deep

sequential models, provide time-varying dynamics regarding customer behavior and related factors

that traditional models may have missed.

We introduce Weibull Time To Event TCN (WTTE-TCN), an interpretable deep survival model

that achieves superior performance with less computation costs and provides two different levels of

explanation (model- and individual-level) via the recent XAI methods. It can also incorporate tex-

tual data through Natural Language Process (NLP) augmentation. To do so, we enhance Martinsson

(2016)’s Weibull Time To Event RNN (WTTE-RNN) in several ways. WTTE-RNN combines the

best of both the RNN model (efficient in processing multivariate time series data) and literature-

proven approaches to survival analyses via a Weibull-based time to event model. The details of the

improvement are as follows:

• We enable more efficient training while improving performance. Mainly, we enhance the

RNN layer of Martinsson (2016) by using Temporal Convolutional Network (TCN). TCN

outperforms canonical RNNs, such as Long Short Term Memory (LSTM) or Gated Recurrent

Unit (GRU), across a diverse range of tasks and datasets while demonstrating longer effective

memory (Bai et al., 2018). We also apply the Attention Mechanism (which acts like random

access memory across time and input data) (Yang et al., 2016) and Layer Normalization

(which adjusts the mean and the variance of the summated inputs within each layer) (Ba

et al., 2016). These methods have proven to allow more efficient training for a wide range

of DL applications (Raffel and Ellis, 2015; Zhang et al., 2019a). The test result shows that

our TCN model reduces 17% of mean absolute errors (MAE) while using about half of the

parameters compared to Martinsson (2016).

Electronic copy available at: https://ssrn.com/abstract=3981160


• We apply the recent Post-hoc eXplainable Artificial Intelligence (XAI) methods (for a sur-

vey, please see Guidotti et al., 2018) to provide human-understandable intuitions for our DL

model’s predictions. As with other DL-based survival models (Ranganath et al., 2016; Lee

et al., 2018a; Kvamme et al., 2019), WTTE-RNN does not provide interpretability. With XAI

methods, model- and individual-level attribute importance and contribution can be computed

while maintaining high-level performance. The end result is a particularly flexible and accu-

rate model equipped to extract nonlinear and temporally diffuse signals for churn that is still

able to justify its predictions to better guide managers.

• We enhance the stability of the training process. Exploding gradients are a problem in that

accumulated gradients from large errors require huge updates to neural network model weights,

resulting in unstable or failed learning processes (Goodfellow et al., 2016). Notably, it occurs

more commonly with RNN models, and gets worse when handling sparse and messy real-world

data (Pascanu et al., 2013; Brownlee, 2018). WTTE-RNN is not an exception, and prior work

has argued that it sometimes fails in the training process (Maragall Cambra, 2018; Palau et al.,

2018). To allow more stable and efficient training, while also preventing exploding gradients,

we implemented various optimization techniques (e.g., Gradient Clipping, Rectified Adam,

and Lookahead Optimizer).

Our contribution is as follows. First, we propose a novel, interpretable deep survival method.

Second, we present a demonstration and application of the said method on a novel proprietary

dataset. Lastly, we present exploratory results that examine the model’s ability to capture churn

signals from complex (sparse, high-dimensional, and unstructured) data and provide domain-specific

marketing implications.

We evaluate our model on a real-world mobile gaming dataset that contains highly sparse and

complex customer behaviors. In this process, our TCN network handles the temporality and large

receptive fields of complex sequential data through powerful convolutions and dilations, while min-

imizing the loss of information (He and Zhao, 2019). As a result, our deep survival models reduced

MAE by 56-81% and 17-51% compared to traditional and deep survival models, respectively. Also,

through current XAI methods, we assess the individual- and model-level dynamics between service

characteristics, customer activities, and churn decisions as they evolve over time.

Electronic copy available at: https://ssrn.com/abstract=3981160


The rest of the paper is organized as follows. Section 2 reviews prior research. Section 3

explicates the model. In Section 4, we evaluate our model with a real-world dataset. Section 5

discusses the implications of our work and suggests opportunities for future research.

2 Related Works

In the first part of this section, we explain the advances in survival models in the business context.

We mainly focus on what efforts prior work has made to overcome time-varying covariates and to

incorporate complex data. Second, we describe AI interpretability and XAI methods in business

and survival model contexts.

2.1 Survival Models in Business Context

Prior marketing research has built survival models based on theoretical backgrounds (Ascarza et al.,

2017). Briefly, these studies have used simple stochastic models such as BG (beta-geometric) (Pot-

ter and Parker, 1964) and sBG (shifted-beta-geometric) (Fader and Hardie, 2007) for contractual

settings that are aware of when customers become inactive, and Pareto/NBD (Pareto Type II and

negative binomial distribution) (Schmittlein et al., 1987), BG/NBD (beta-genomic and negative

binomial distribution) (Fader et al., 2005), and BG/BB (beta-genomic and binomial) (Fader and

Hardie, 2010) for non-contractual settings that do not observe inactivivity (Fader and Hardie, 2009).

There also have been efforts to incorporate features that have yet to be discovered in prior work,

such as customer heterogeneity and cross-cohort effects in marketing mix activities (Schweidel et al.,

2008), the frequency and amount of direct marketing activities across individuals and over time

(Schweidel and Knox, 2013), customer’s complaints and recoveries (Knox and Van Oest, 2014), and

customer’s service experiences, such as frequency and recency of past purchases (Braun et al., 2015).

Despite these efforts, these studies focus on classical statistical methods that cannot directly handle

complicated (sparse, higher dimensional, and unstructured) data. As a result, existing survival

churn models in a business context can only take a few high signal variables that are engineered by

experts and based on existing theories (Witten and Tibshirani, 2010; Ascarza et al., 2017). With

easier access to various data sources, the limitations of traditional survival models are becoming

more apparent.

Electronic copy available at: https://ssrn.com/abstract=3981160


Ttime-varying covariates have been critical problems for survival models in handling uni- and

multivariate time series data (Ascarza et al., 2017). The simplicity of traditional models cannot cap-

ture patterns in data change over time (Žliobaitė et al., 2016). For example, the Cox Proportional

Hazard model assumes the explanatory variable remains constant over time (Cox, 1972). However,

in real-world business settings, customer behaviors, which contain critical clues to predicting cus-

tomers’ decisions, change over time (Fader et al., 2004). To handle these problems, prior studies

have developed various approaches, such as changepoint models and sequence models.

Changepoint models capture the underlying evolution points of features and split time-varying

covariates into multiple time-fixed covariates (Koop and Potter, 2004). In marketing literature,

Fader et al. (2004) build on a changepoint framework which nests simple, theory-based models of

customer buying behaviors. Despite the advantages, (parametric) changepoint models often rely on

strong assumptions and face difficulties in modeling complex temporal patterns that are hard to

perform manually. Also, (non-parametric) changepoint models show the limitations of addressing

some patterns (e.g., changes happen at arbitrary timescales or gradually over different durations)

because they focus on detecting abrupt changes (Ebrahimzadeh et al., 2019).

In the sequence model, the current state is dependent on the previous stochastic input, and it

helps to capture signals from a non-stationary process (Kuznetsov, 2016). In a business context,

Autoregressive (AR) models have been widely used in prediction tasks related to finance (Pacurar,

2008), rather than churn prediction in marketing and IS. These AR models assume that the dynamics

in the market are gradual or smooth, not complex (Netzer et al., 2008). In other words, such

approaches are restrictive when it comes to handling the complexity of recent datasets.

Breakthroughs in machine learning technology have led to a wealth of data and superior perfor-

mance in predictive systems, ranging from recommendations on e-commerce and content filtering

on social networks to image processing and autonomous cars (LeCun et al., 2015). As machine

learning methods advance, we are starting to see more sophisticated applications that utilize more

data for better decision making (Hosanagar, 2019). With regards to the survival models, for exam-

ple, recent clinical research has suggested novel applications for churn prediction while addressing

newly discovered high-dimensional data through machine learning-based survival models. (Pölsterl

et al., 2016; Yousefi et al., 2017).

It is well-documented that the Hidden Markov Model (HMM) can handle nonlinearity and time-

Electronic copy available at: https://ssrn.com/abstract=3981160


varying covariates in sequential data. HMM represents probability distributions over sequences of

visible outputs, and it treats observations as a result of previous unobserved states (Ghahramani,

2001). In a business context, HMM has been used to provide more sophisticated rationales for

underlying dynamics in customer behavior (Montoya et al., 2010; Ascarza et al., 2017). For example,

Netzer et al. (2008) incorporate HMM to account for the evolution of relationship dynamics as

a result of interactions between the customer and the company. However, while HMM is an ML

method that assesses customer relation dynamics over time, it has limitations for natively addressing

complex (sparse, high-dimensional, and unstructured) datasets compared to recent DL methods.

Also, few HMM studies have examined time-varying dynamics while handling data censorship.

Recent advances in deep learning have dramatically enhanced the state-of-the-art in a wide

range of tasks including sequence process (LeCun et al., 2015). Thus, prior research has been

trying to combine advantages of both DL models (e.g., exceptional performance while handling

complex data, nonlinearity, and time-varying covariates) and survival models (e.g., addressing cen-

sored data). Katzman et al. (2018) demonstrate that a deep survival model (DeepSurv) a simple

combination of multilayer perceptron and Cox partial log-likelihood functions, not a deep sequen-

tial model outperforms traditional survival models in multiple electronic health record (EHR)

datasets. Lee et al. (2018a) propose a multi-task learning based deep survival model (Deephit)

to address competing risk the competing nature of different causes to the same event more

natively. Martinsson (2016) suggests a sequential deep learning-based survival approach (WTTE-

RNN) that effectively handles multivariate time series data, as well as the long term dynamics

among time-varying covariates. These deep survival approaches show superior performance when

handling complex datasets compared to traditional methods.

However, this line of work is heavily focused on medical settings, which is not appropriate for

business contexts, as survival analysis was originated and mainly used by medical researchers to

measure the lifetime of populations (Prinja et al., 2010). Since the temporality and irregularity

of data (due to patients’ irregular visits) is a significant challenge in the medical context (Xiao

et al., 2018), medical deep survival models have been focused on making good use of time-invariant

variables, such as electronic health records (EHR) (Ranganath et al., 2016; Katzman et al., 2018)

and genetic and protein expression features (Yousefi et al., 2017). On the other hand, in many real-

world business cases, firms periodically collect an incomparable amount of customer activities in

Electronic copy available at: https://ssrn.com/abstract=3981160


their database. Notably, many of these data are time-varying features. In this context, our research

suggests a TCN-based survival approach that effectively handles complex sequences of multivariate

time-varying features (Lea et al., 2017; He and Zhao, 2019).

2.2 Interpretability in Survival Models

As AI systems become widespread, increasing scholarly attention has been given to AI interpretabil-

ity, including marketing and IS literature (Lee et al., 2018b; Rai, 2020). However, interpretability,

while not yet well defined in the literature (Lipton, 2016) and varying across different contexts

(Rudin, 2019), broadly refers to the understandability of a model regarding how and why it made

certain predictions (Molnar, 2018; Lu et al., 2020).

Prior work claims that AI interpretability provides the following benefits to inscrutable business

AI systems. First, providing an effective interpretation for AI systems can increase users’ trust in

the system (Ribeiro et al., 2016a; Carvalho et al., 2019). Second, interpretability is essential to

identify potential algorithmic bias (Doshi-Velez and Kim, 2017), which refers to those AI results

that discriminate on arbitrary grounds, such as race, gender, and ethnicity (Hajian et al., 2016) and

can be critical to a firm’s relationship with customers (Rai, 2020). Third, interpretability provides

business implications from complex AI systems in terms of human-understandable forms (Du et al.,

2019; Rai, 2020).

Advances in interpretability techniques are allowing predictive analytics to pursue both perfor-

mance and explainability, and as such XAI has received enormous scholarly attention (Rai, 2020).

The eXplainable AI (XAI) refers to a suite of methods that provide human-understandable explana-

tions for AI models (Gunning, 2017). Notably, the current model-agnostic (or post-hoc) techniques

derive approximated interpretations by analyzing the patterns between input and output features

in a trained model. By doing so, it can be applied to any machine learning model, regardless of

their structures and complexities and allows for the conversion of black-box models to explainable

glass-box ones, while retaining a high level of model performance (Molnar, 2018; Rai, 2020). XAI

can aim to flexibly create better explanations that consider the context of task and type of data

(e.g., LIME (Ribeiro et al., 2016a), SHAP (Lundberg and Lee, 2017), and Grad-CAM (Selvaraju

et al., 2017)), compared to traditional, intrinsically interpretable approaches that are limited to a

certain form of explanation dependent on the selection of model (Ribeiro et al., 2016a).

Electronic copy available at: https://ssrn.com/abstract=3981160


Despite the advantages of XAI, most DL survival models have focused on improving perfor-

mance rather than providing intuitive explanations through complex datasets (Martinsson, 2016;

Ranganath et al., 2016; Lee et al., 2018a; Kvamme et al., 2019). Some studies have tried to de-

rive explanations from deep survival models, but these are limited to traditional interpretability

perspectives that discuss cohort-level differences (Katzman et al., 2018) or variable importances

(Luck et al., 2017). Yousefi et al. (2017) is one of the few studies, which applies current XAI to

deep survival models (though it is not based on sequential DL methods). The authors describe how

each feature contributes to predicted cancer risk while dealing with high-dimensional medical data.

However, Yousefi et al. (2017) only focuses on global feature importance. Local interpretability is

difficult to find via traditional methods (Du et al., 2019) and expected to provide new business

implications for marketing professionals and increased value to customers by enabling elaborate

personalization services (Carvalho et al., 2019; Rai, 2020).

Our approach flexibly consumes complex data and attempts to explain the churn in a bottom-up

(data-driven) fashion on two different levels: global and local interpretability. To the best of our

knowledge, we are unaware of studies that take advantage of accurate, yet highly complex, sequential

DL models in conjunction with current XAI algorithms to address survival churn management.

Notably, we demonstrate how recent explanation techniques such as additive feature attribution

methods (i.e. SHAP) and local interpretability can uncover different types of business implications

in a complex (sparse, high-dimensional, and unstructured) real-world dataset.

3 Model

We propose an interpretable deep survival model that can provide human-understandable explana-

tions for nonlinear and time-varying dynamics from big data. To do so, we build on Martinsson

(2016)’s WTTE-RNN in several ways. We first provide an overview of the WTTE-RNN before

laying out our model in detail.

Electronic copy available at: https://ssrn.com/abstract=3981160


Current State

Predicted Duration
Observation Window
Probability Events
Distribution

True Duration
T
Observation Period Prediction Period
.

Figure 1: WTTE-RNN.
Based on sequential data up to the current state, WTTE-RNN predicts the probability of the nearest
future event in the form of a Weibull distribution. In this process, the mode of Weibull PDF has
the highest probability of churning and is considered as the predicted churning time.

Figure recreated from (Martinsson, 2016)

3.1 Baseline: WTTE-RNN (Weibull Time To Event Recurrent Neural Net-


work)

WTTE-RNN (Martinsson, 2016) combines the best of both the RNN model (efficient in processing

time series data) and literature-proven approaches to survival analyses via a Weibull-based time to

event model. To simplify, the RNN part processes multivariate time series data and outputs two

parameters (α for scale and β for shape) to be used by the probability density function (PDF) of

the Weibull distribution to fit survival data.

Traditional survival models (e.g., Cox Proportional Hazard model) often assume that the hazard

rate of a customer increases over time (Ascarza et al., 2017). However, in many real-world cases (e.g.,

for high loyalty customers and in the case of lock-ins), the longer the duration, the lower the hazard

rate. Some models do solve this “individual-level duration dependence” with flexible distributions

such as Weibull, Gamma, or Log-logistic distribution for the hazard function, incorporating both

increasing and decreasing patterns of the hazard rate (Braun and Schweidel, 2011; Fader et al., 2018).

WTTE-RNN combines flexible Weibull distribution and a sequential deep learning framework.

The probability density function (PDF) of Weibull distribution can be expressed with two pa-

rameters (α, β) as follows:

  
β x β−1 −( x )β
P DF (x, α, β) = e a where x = 0 (1)
α α

10

Electronic copy available at: https://ssrn.com/abstract=3981160


Failure
Early Stage Maturity Stage Decline Stage
Rate

Decreasing Increasing
Failure Rate Failure Rate

Constant
Failure Rate

β<1 β=1 β>1

Time

Figure 2: U-Shape Failure Rate and Shape Parameter β in Weibull distribution

Figure recreated from Lu et al. (2016)

With this form, Weibull distribution can model various patterns of hazard rates (also called

failure rates) such as decreasing (β < 1), constant (β = 1), or increasing (β > 1) through

changing a shape parameter β. Many platform services show u-shaped hazard rates over time (see

Figure 2). In the early stage of a certain service, the firm performs a number of trials and errors,

and users churn easily. However, as optimization in the service progresses, more users become

satisfied, and the hazard rate also decreases. Then, the hazard rate stays stable during the maturity

stage. Later, in the decline stage, the service loses its competitive edge due to internal or external

factors, such as the emergence of new competitors, and the hazard rate increases. Additionally,

at the individual level, each user may have a different scale of u-shaped hazard rate depending

on their characteristics. Unlike traditional models (e.g., Cox Proportional Hazards Model based

on Exponential distribution), which assume an increasing hazard rate, Weibull distribution-based

models can flexibly explain all of the decreasing, constant, and increasing phenomena (Lu et al.,

2016).

The objective function of WTTE-RNN comes from a typical survival function (Tableman and

Kim, 2003), and the goal is to maximize the following log likelihood:

Tn
N X
X
unt · log[P r(Ytn = ytn |xn0:t )] + (1 − unt ) · log[P r(Ytn > ytn |xn0:t )] (2)
n=1 t=0

ynt : time to event for user n = 1, ..., N at timestep t = 0, 1, ..., Tn

xn0:t : data up to time t

11

Electronic copy available at: https://ssrn.com/abstract=3981160


unt : if datapoint is right censored unt = 0, not censored unt = 1

After manipulations to derive two Weibull parameters (α, β), the final log-likelihood (object

function) is shown in Table 1. We choose one of two different objective functions, depending on

whether the scale of time is continuous or discrete, and maximize it to learn the model (Martinsson,

2016).
 
y β−1
Continuous Weibull Distribution u · log( αβ α ) − ( αy )β u = 0 : if right censored
Discrete Weibull Distribution u · log(e
y+1 β t β
( α ) −( α )
− 1) − ( y+1 β u = 1 : if uncensored
α )

Table 1: The Objective Functions of Weibull Survival Models

3.2 Our Model: WTTE-TCN (Weibull Time To Event Temporal Convolutional


Network)

3.2.1 Model Overview

Figure 3 presents the overview structure of our model: WTTE-TCN. In this section, we explain each

module of the suggested model sequentially. WTTE-TCN consists of bidirectional TCNs, attention

layers, unstructured data processing layers (i.e., GloVe (Pennington et al., 2014) for text data), a

fully connected layer, a Weibull survival loss layer, and model-agnostic XAI methods. The types of

input data can be multi- and univariate time series, cross-sectional, and unstructured, such as text

and image data. The output is the Weibull probability distribution of each customer’s churn.

The TCN layer abstracts sequential events while considering their context. The attention layer

helps the TCN layer efficiently handle the data’s sparsity and long-term dependencies. For unstruc-

tured data, additional pre-trained layers are used for transfer learning, which embeds complex data

into the organized lower-dimensional vector space. For example, through pre-trained NLP layers,

similar words are embedded near each other, resulting in improved performance and efficiency. Ab-

stracted inputs are concatenated and provided to a fully connected layer, which distills information

one more time. The entire learning process is conducted to minimize the Weibull survival loss.

3.2.2 TCN (Temporal Convolutional Network)

We replace the RNN layer of Martinsson (2016) with a TCN that retains the time series processing

ability of RNN, but adds the computational efficiency of convolutional networks (Lea et al., 2017).

12

Electronic copy available at: https://ssrn.com/abstract=3981160


eXplainable AI Method

Input Bidirectional TCN Attention Layer Fully Connected Survival Loss Output

Attention
With PI

Context Weibull
Survival
Time … (Yang et al., Model






2016)

Features

Concatenate

Bidirectional TCN Attention Layer


Unstructured Data
Time Series
Processing Layer Attention
Unstructured Data With
(e.g., GloVe Embedding Layer
(e.g., Daily Text Context
for Text Data,
Messages or Posts)
VGG16 for Image Data)
… (Yang et al.,






2016)

Explanations
(Global & Local Interpretability)

Figure 3: WTTE-TCN

13

Electronic copy available at: https://ssrn.com/abstract=3981160


p0 ᆞᆞᆞ ᆞᆞᆞ pt-k ᆞᆞᆞ pt-1 pt

Output Residual block

Residual block RELU


d=4 +
with d = 4
Dropout
Hidden
Layer Norm
Residual block RELU
d=2
with d = 2
2nd Causal Conv. (d)
Hidden
Dropout
Residual block
d=1
with d = 1 Layer Norm
x0 ᆞᆞᆞ ᆞᆞᆞ xt-k ᆞᆞᆞ xt-1 xt Input RELU
1st Causal Conv. (d)

Features

Time

Figure 4: Temporal Convolutional Network (TCN)

Figure recreated from (Moor et al., 2019).

TCN was chosen since it outperforms advanced RNNs such as LSTM or GRU across various se-

quential tasks while handling longer input sequences. TCN is faster than RNN while requiring less

memory and computational power (Bai et al., 2018), critical advantages in the era of big data.

TCN has the following notable characteristics: 1) the convolutions in the model are time-stamp

aware, implying that no future information is leaked during processing; and 2) the model structure

can map any length input sequence to the same length output sequence, just as RNNs do (Bai

et al., 2018). TCN consists of stacked residual blocks, which in turn consist of convolutional layers,

activation layers, normalization layers, and regularization layers (see Figure 4 and Lea et al. (2016)

for more details).

Regarding convolution layers, a temporal block is constructed by stacking several convolutional

layers. In detail, for a input sequence x ∈ R, output sequence u ∈ R, and a convolution filter with

size k f : {0, ..., k − 1} → R, the rth level dilated convolution operation F at time t is defined as

k−1
X
F (t) = (x ∗d f )(t) = f (i) · xt−d·i (3)
i=0

u = (F (xk ), F (xk+1 ), ..., F (xn )) (4)

where d is the dilation factor, which can be written as (k − 1)r−1 to cover an exponentially wide

14

Electronic copy available at: https://ssrn.com/abstract=3981160


receptive field.

The Rectified Linear Units (ReLU) is used as an activation function to provide nonlinearity to

the output of the convolutional layers and is defined as

ReLU (x) = max(0, x) (5)

One of the obstacles of DL is that the gradients for the weights in one layer are highly correlated

to the outputs of the previous layer, resulting in increased training time. Layer normalization is

designed to alleviate this “covariate shift” problem by adjusting the mean and variance of the sum-

mated inputs within each layer (Ba et al., 2016). Though the theoretical motivation for decreasing

covariate shift is controversial in technical ML literature, the practical advantage of normalization

methods, which allow for faster and more efficient training, has proven indispensable to a wide range

of DL applications (Zhang et al., 2019a). The statistics of layer normalization over each hidden unit

in the same layer is written as:

H
1 X l
l
µ = ai (6)
H
i=1

v
u
u1 XH
l
σ =t (ali − µl )2 (7)
H
i=1

a l − µl
a¯li = i l (8)
σ

where a¯li is normalized summated inputs to the ith hidden unit in the lth layer, and H denotes

the number of hidden units in a layer. We also implemented layer normalization to fully connect

the layers of our suggested model and improve the learning efficiency and stability.

Dropout is an essential regularization technique to prevent the over-fitting of the neural network.

The idea is to randomly drop (hidden and visible) units from the network during training, which

prevents units from co-adapting too often. By doing so, dropout improves the generalization of

neural networks by allowing the training process to be an efficient stochastical approximation of an

exponential ensemble of “thinned” networks (Srivastava et al., 2014).

15

Electronic copy available at: https://ssrn.com/abstract=3981160


3.2.3 Statistical Methods for Faster and More Stable Training

Complex and noisy real-world data make the training process extremely unstable and easily converge

to poor local minima, especially when handing sequential tasks with RNN models (Pascanu et al.,

2013). Notably, an exploding gradients problem which means that accumulated gradients from

large errors make huge updates to neural network model weights results in an unstable or failed

learning process (Goodfellow et al., 2016). WTTE-RNN is not an exception, and related literature

has argued that it sometimes fails in the training process (Maragall Cambra, 2018; Palau et al., 2018).

To solve these problems, we apply various techniques, such as 1) Attention Mechanism1 , 2) Gradient

Clipping, 2 , 3) Rectified Adam,3 , and 4) Lookahead Optimizer.4 These methods enable more efficient

and stable training while achieving better performance. The details of each implementation are

described in Appendix C.

3.2.4 Text Preprocessing with GloVe and TCN

To incorporate text data into our survival model, we integrate GloVe (Pennington et al., 2014), a

DL-based NLP model that converts words to meaningful representation vectors. GloVe improves

Word2Vec (Mikolov et al., 2013) a method for learning vector representations of words based on

the idea that similar words should be embedded near each other in the lower dimensional space by

incorporating global co-occurrence word relationships as well. Figure 5 explains the text processing

with GloVe and TCN. We assign each word to unique numbers through the tokenization process, and

the GloVe layer changes these numbers to representation vectors. Then, the TCN layer distills (i.e.,

feature extracts) information from sequential numbers and passes it to our main survival model.

The entire process is shown in Figure 3.


1
The attention mechanism allows the sequential model to focus more on the relevant parts of the input data by
acting like a random access memory across time and input data (Bahdanau et al., 2014). By doing so, the attention
mechanism improves the training efficiency and performance of the model.
2
It rescales the gradients when it exceeds the threshold value (Pascanu et al., 2013)
3
Despite their faster and more stable training, stochastic optimizers (e.g., Adam and RMSProp) experience a
variance issue in that, in the early stage of training, problematically large variance leads to a risk of converging into
undesirable local optima (Liu et al., 2019). Rectified Adam (RAdam) solves this problem by incorporating a warmup
(an initial training with a much smaller learning rate), which is identified to reduce the variance.
4
Lookahead iteratively updates two sets of weights (“fast weights” and “slow weights”) and then interpolates them.
By doing so, Lookahead improves training stability and reduces the variance of optimization algorithms such as
Adam, SGD, and RAdam (Zhang et al., 2019b)

16

Electronic copy available at: https://ssrn.com/abstract=3981160


Sequence of Words Main Model
Tokenize Concatenate
Sequence of Numbers TCN Layer
Encoding

GloVe Layer Sequence of Vectors

Figure 5: Text Processing Layers (GloVe + TCN)

3.3 XAI Methods (SHAP) For Interpretability

To interpret WTTE-TCN predictions, we provide two types of post-hoc model explanations: global

interpretability at the model-level and local interpretability at the individual-level predictions.

Global interpretation captures and explains the overall importance of features at the model level.

Extracting the feature importance from a trained classifier or coefficients from a linear regression

model are typical examples of global interpretation. In contrast, local interpretation attempts to

explain “how the model behaves in the vicinity of the instances being predicted,” and therefore fo-

cuses on delivering instance-by-instance explanations (Ribeiro et al., 2016b). In our research setup,

this would entail identifying the main features driving the churn prediction for an individual user.

Further, to provide clarity on the uncertainty in prediction, we provide prediction intervals derived

from Weibull distribution output (Figure 13).

We provide global interpretation in two ways. The first is the Global Feature Importance,

which indicates how important each feature is for outcome prediction. The second is the Partial

Difference Plots (PDP), which shows the marginal effect of each feature on the predicted outcome,

where the value of each feature changes from its minimum to its maximum (Friedman, 2001). For

local interpretation, we illustrate how instance-level feature importance varies in time.

To implement global and local feature importance and PDP, we apply SHapley Additive ex-

Plantions (SHAP), which 1) has the characteristics of model-agnostic methods, and 2) employs

cooperative game theory to provide a more accurate contribution of each feature to the outputs

(Molnar, 2018). Notably, for the efficiency in estimation, we implement Deep SHAP (Lundberg and

Lee, 2017), which greatly improves computational performance by decomposing the approximation

process for the whole network into smaller ones (see Shrikumar et al. (2017) for more details). As

17

Electronic copy available at: https://ssrn.com/abstract=3981160


a model-agnostic XAI technique, SHAP separates the explanations from the ML models. By doing

so, SHAP has advantages over model-specific XAI techniques (e.g., coefficients of linear regression

models) it can naturally provide explainability and transparency in decision processes, despite its

restrictive model complexity (Ribeiro et al., 2016b; Rai, 2020). Specifically:

1) Model flexibility: Model-agnostic methods can be implemented with any models regardless of

their structures and complexities. So, we can interpret complex black-box DL models via explainable

glass-boxes (Molnar, 2018; Rai, 2020).

2) Explanation flexibility: The explanation is not limited to a certain form. In this paper,

we show various types of explanations, such as global and local feature importances and partial

dependence plots.

3) Representation flexibility: Model-agnostic XAI methods can interpret various types of feature

representations (e.g., sequence, text, image).

Local feature importance is equal to the Shapley value φij (i.e., the contribution of the feature j

of given instance i) and can be estimated through Deep SHAP (please see Lundberg and Lee (2017)

for more details). Global feature importance is calculated by averaging the absolute Shapley (or

SHAP) values by feature j:

n
(i)
X
Ij = |φj | (9)
i=1

In terms of the partial difference plots (PDP), we simply draw a point plot with the feature value
(i)
on the x-axis and the matching Shapley value on the y-axis. Since Shapley value φj means the

contribution of the given feature j of instance i while considering the local condition (or the effect of

other feature values), we do not need to calculate partial dependences separately. Mathematically,

the SHAP dependence plot is defined as follows:

(i) (i)
{(xj , φj )}ni=1 (10)

Weibull-distribution-based models provide two parameters (α, β) as the output. This means

that if we apply XAI methods to the Weibull-models, we can only obtain the contribution of each

instance i and feature j to (α, β), and not the duration of customers. To solve this problem, we

developed a mapping function to convert the Shapley value of (α, β) to that of customer duration.

18

Electronic copy available at: https://ssrn.com/abstract=3981160


To derive the customer duration from (α, β), we use the mode of predicted Weibull PDF, which has

the highest probability of churning in the distribution (See Figure 1). Specifically:


 1
α β−1 β


β β>1
WM ode (α, β) = (11)

0
 β51

(i) (i)
φDαj = WM ode (φα0 + φαj , φβ0 ) − WM ode (φα0 , φβ0 ) (12)

(i) (i)
φDβj = WM ode (φα0 , φβ0 + φβj ) − WM ode (φα0 , φβ0 ) (13)

(i) (i)
φiDj = φDαj + φDβj (14)

(i) (i)
where φDαj and φDβj are marginal contributions to the customer duration caused by predicted

parameters α and β of instance i and feature j respectively. φα0 and φβ0 are baseline Shapley values
(i) (i)
for α and β, and φαj and φβj are Shapley values derived by α and β of observation i. The final
(i) (i)
mapped Shapley value of observation i is a sum of φDαj and φDβj , and it can be interpreted as a

contribution of feature j to the duration of instance i.

4 Demonstration of WTTE-TCN on Novel Data

To demonstrate the efficacy of our algorithm, we evaluate its performance and explainability on

a proprietary dataset: game logs and churning outcomes obtained from a mobile game company.

With the dataset (which contains highly sparse and complex customer behaviors), we show the

ability of our model to capture nonlinear patterns that may not be possible with the traditional

model. Then, we figure out how firms can utilize explainability to build effective customer churn

management policies.

According to Wijman et al. (2019), the mobile game market is estimated to be worth $ 68.5 billion

globally. As competition intensifies, game companies are increasingly required to perform effective

customer churn management by utilizing massive amounts of play log data. Although data-driven

churn management has greater potential for game companies, the complexity and noise of the data

19

Electronic copy available at: https://ssrn.com/abstract=3981160


(1) Prediction-Driven Approach
Intervention
Churn
User Behavior
Unobserved Period Ut
Current

(2) Interpretability-Driven Approach

Explain about ‘What makes users churn at time T?’


Give guidelines for ‘Churn Management’
Churn
Recent User Behavior
Current

eXplainable AI
The prediction-driven approach forecasts future events and intervenes before they happen. In other
words, it focuses on “Prediction and Intervention.” The interpretability-driven approach observes
past events and seeks to discern “why users churn at a certain time.” Based on the explanations, it
focuses on modifying the current policy and service design (“Explain and Improve”).

Figure 6: Interpretability-Driven Approach

deter them from making accurate prediction models, which in turn restricts churn management’s

applications. For example, NCSOFT, one of the biggest game companies in South Korea, hosted an

international competition for churn prediction using game log data, and argued its difficulties the

performance of models drastically decreased in predicting the churn of loyal customers or different

periods (Lee et al., 2018c). Even worse, if we accurately predict customers’ churn, there are limited

ways to prevent it (e.g., push-notification, coupon). In particular, if users leave push notifications

off or don’t access the game, there is little that game companies can do to retain their customers.

To tackle the problem of this prediction-driven approach, we focus on an interpretability-driven

approach (Figure 6). An interpretability-driven approach provides a managerially actionable expla-

nation of customer behaviors through post hoc analysis. Based on the explanations, our model can

suggest guidelines for churn management policies and adjust game designs.

20

Electronic copy available at: https://ssrn.com/abstract=3981160


4.1 Data

We collected the game logs of 89,877 new international users who started playing the game between

June 2 and October 9, 2018. Our dataset is from a casual mobile puzzle game. The details of

the game, rules, and variable details are described in Appendix A. The dataset contains each user’s

behavioral logs from their initial interaction with the game up to the point where they either reached

level 150 or churned: data after level 150 is right-censored. We define “churning” as when a user

doesn’t visit the game for more than four weeks. The dataset has 3 million rows of individual-level

multivariate time series data. The variable definitions are shown in Table 3 and the-game specific

terminologies and descriptive statistics are provided in Appendix A.

In Figure 7, the x- and y-axes refer to the game levels (1 to 150) and average feature val-

ues, respectively. We can observe significant nonstationary patterns in variables. For example,

in ClearCountOtherLevel and PlayCountOtherLevel, we see a sudden drift, which means that a

new pattern suddenly replaces the old one. Regarding Payment, PlayCount, AdvertiseView, and

ClearedGameRetryCount, incremental patterns are observed. For CoinConsumedTotal and Con-

tinueWithCoin, we can find a recurrent drift, which refers to temporary changes in the trend (Ren

et al., 2018). These intricate patterns make the model difficult to fit correctly.

4.2 Problem Statement and Setting

In the analysis, we excluded users who had churned during a tutorial period it allows users to

practice the game and learn the rules of the game from levels 1 to 30. Also, we only observed

the 50 most recent levels before the churn of each user (or censoring at 150 level). There are two

reasons for this. First, in the mobile game, half of the revenue comes from only 0.19% of loyal

users (Cifuentes, 2016). This means that discovering and managing loyal customers is the most

important goal of mobile game operations. However, it is common in gaming for a large proportion

of users to churn very early (i.e., before starting of gameplay), and such users are of limited interest

to the gaming company. Second, the firm has 80 million annual active users in North America and

Europe. Given that just two months of data for 89,877 new users have 3 million rows, the company

is motivated to reduce the otherwise tremendous costs of data preprocessing and analysis. After

eliminating such users, we were left with data for a total of 27,004 users.

21

Electronic copy available at: https://ssrn.com/abstract=3981160


Variable Description
Game Play
Time To Stay The total minutes that user stayed at a certain level.
Play Count The number of game plays at a certain level.
Trophy Color The color of the trophy received when the user accomplishes a certain level.
Play Count Other Levels The number of plays at other levels when the user stays at a certain level.
Clear Count Other Levels The number of accomplishments at other levels when the user stays at a
certain level.
Continue with Coin The number of additional move purchases in case of game failure using coins
when the user stays at a certain level.
Continue with Coin The number of additional move purchases in case of game failure using coins
Other Levels at other levels.
Heart Zero Count The number of times the number of hearts was zero at a certain level.
Heart Zero Count Other The number of times the number of hearts was zero at other levels.
Levels
Crystal Trophy Count The total number of crystal trophies received at a special level.
Quest Completed The total number of times the quest was completed at a certain level.
Capsule Gacha The total number of times an item has been acquired through Gacha.
OS The OS of the device used by the user at a certain level.
Social Activities
Number of Friends The total number of friends at a certain level.
Facebook Login Whether users are connected to Facebook or not at a certain level.
Item Use
Boost Use Count The total number of boost usage at a certain level.
Boost Use Count Other The total number of boost usage at other levels.
Levels
Item Use Count The total number of item usage at a certain level.
Item Use Count Other The total number of item usage at other levels.
Levels
Coin Consumed Total The total number of coin usage at a certain level.
Coin Left Count The total number of coins remaining at a certain level.
Payment & Advertisement
Payment The total amount of cash paid at a certain level.
Pay Count The total number of cash paid at a certain level.
Is Payer If player has made cash payments before, “Is Payer” is 1, others 0. For
example, if a player first bought a cash item at level 50, the values of “Is
Payer” from level 1 to 49 is 0, and the values of level 50 and above are 1.
AdvertiseView Whether player watched advertisements or not at a certain level.
Life Purchase The total number of heart purchases at a certain level.

Table 3: Description of Variables

22

Electronic copy available at: https://ssrn.com/abstract=3981160


AdvertiseView BoostUseCount BoostUseCountOtherLevel CapsuleGacha ClearCountOtherLevel ClearedGameRetryCount
0.0030 0.25
1.2
0.6 0.03 0.3
1.0 0.0025 0.20
0.8 0.0020 0.15
0.4 0.02 0.2
0.6 0.0015
0.10
0.2 0.4 0.01 0.0010 0.1
0.2 0.0005 0.05
0.0 0.0 0.00 0.0000 0.0 0.00

CoinConsumedTotal CoinLeftCount ContiuneWithCoin ContiuneWithCoinOtherLevel CrystalTrophyCount FacebookLogin


10 0.25
0.05
60 0.015 0.45
8 0.20 0.04
6 50 0.40
0.15 0.010 0.03
4 40 0.10 0.35
0.02
0.005
2 30 0.05 0.01 0.30

0 20 0.00 0.000 0.00 0.25

HeartZeroCount HeartZeroCountOtherLevel IsPayer ItemUseCount ItemUseCountOtherLevel LifePurchase


0.6 0.025 0.5 0.06 0.004
0.150
0.5 0.020 0.05
0.125 0.4
0.003
0.4 0.100 0.04
0.015 0.3
0.3 0.075 0.03 0.002
0.010 0.2
0.2 0.050 0.02
0.005 0.1 0.001
0.1 0.025 0.01
0.0 0.000 0.000 0.0 0.00 0.000

NumberOfFriend PayCount Payment PlayCount PlayCountOtherLevel QuestCompleted


1.0
0.150 0.6
0.8 5
0.020 0.125 0.8
0.6 0.100 4
0.015 0.4 0.6
0.4 0.075 3
0.010 0.4
0.050 0.2
0.2 0.005 2 0.2
0.025
0.0 0.000 0.000 1 0.0 0.0
50 100 150 50 100 150 50 100 150 50 100 150
TimeToStay TrophyColor Level Level Level Level
3.0
1000
2.8
800
2.6
600
2.4
400
200 2.2

2.0
50 100 150 50 100 150

Level Level

Figure 7: Complex and Nonstationary Patterns in Time Series Variables

23

Electronic copy available at: https://ssrn.com/abstract=3981160


Tutorial Observed Period Censored Period
LV50 Churn
Censored
Not Used

Input Data

Join LV30 LV150


Censoring Point

Figure 8: The Setting of the Mobile Game Analysis

We set the game level as a time indicator rather than a day or week which is commonly used in

time series data analyses. The objective is to solve the sparsity problem that arises from the vast

majority of users not accessing the game every day. Using a day- or week-based time indicator not

only leads to data sparsity, it also unnecessarily lengthens the sequence of data with many empty

entries. In contrast, the level-based dataset is less sparse, and managers in gaming companies

often discuss user behavior as well as product and marketing strategies in terms of user level (e.g.,

adjusting the game difficulty by levels).

A total of 26 variables such as play counts per each stage, in-game purchases, number of

friends, and advertisement views make up the explanatory variables. Two dependent variables

are used in the analysis: whether the customer has churned or not and the level at which the

customer has churned. Figure 8 shows this setting.

Multivariate time series data input has the following dimensions: [instance, level (50), feature

(26)]. If a player churned at level 120, they have an input data size of (50, 26) filled with feature

values of the recent 50 LV (Figure 9, left). If a player churned earlier than level 50 (e.g., 34 LV), we

applied zero-padding this refers to adding zeros to fit different-sized data to the same shape to

set the size to (50, 26) (Figure 9, right).

A summary of the setting’s inputs and outputs is as follows:

Input:

• Multi-variate time series data related to customers’ in-game behavior (e.g., play count, item

24

Electronic copy available at: https://ssrn.com/abstract=3981160


Features (26) Features (26)

120 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0
119 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0
118 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0
117 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0
116 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv Zero 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0
115 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv Padded 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0
114 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0
113 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0
… … … … … … … … … … … … … … … … … … … … 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0
… … … … … … … … … … … … … … … … … … … … 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0
… … … … … … … … … … … … … … … … … … … … Level (50) 34 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv Level (50)
80 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 33 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv
79 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv … … … … … … … … … … … … … … … … … … … …
78 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 8 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv
77 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 7 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv
76 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 6 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv
75 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 5 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv
74 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 4 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv
73 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 3 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv
72 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 2 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv
71 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv 1 fv fv fv fv fv fv fv fv fv … fv fv fv fv fv fv fv fv fv
LV LV

Figure 9: Examples of Multivariate Time Series Inputs in Mobile Game Dataset

purchase count, login with Facebook ID, and the number of friends).

• Churned or Censored Level / Customer churn (1) or right-censored (0).

Output:

• The probability distribution of each customer’s churn level.

• Prediction interval of each prediction result.

• Model- and individual-level interpretations of prediction results.

4.3 Performance

We compared the performance of our model (WTTE-TCN) to four traditional survival models Cox

Proportional Hazard, Weibull Accelerated Failure Time (AFT), Log-Normal AFT, and Aalen’s

Additive as well as four deep survival models DeepSurv (Katzman et al., 2018), CoxCC (Kvamme

et al., 2019), DeepHit (Lee et al., 2018a), and Martinsson (2016)’s WTTE-RNN. Regarding WTTE-

RNN and WTTE-TCN, we can directly use a 3d-tensor (instance, time (=level), features) multivari-

ate time series data, since it has sequential layers such as LSTM and TCN. On the other hand, since

25

Electronic copy available at: https://ssrn.com/abstract=3981160


traditional and other deep survival models cannot deal with 3d-tensors (instance, time, features),

we converted it to a 2d-matrix (instance, time × features). Notably, since traditional models did

not work with the transposed inputs due to the curse of dimensionality, we reduced the dimension

through Principal Components Analysis (PCA) (Witten and Tibshirani, 2010). We split the data

into three parts, training (50%), validation (25%), and test (25%) data. To evaluate the perfor-

mance of duration prediction models, we used mean absolute error (MAE) on the test data (Wang

et al., 2019). Specifically, MAE takes the difference between the predicted game level for user churn

(= time to event) and the actual churned level. In respect to DL models, the performance varies

from each training, so we conducted a total of 10 separate trainings by changing the random seed.

The mean, minimum, and maximum MAE were used as metrics for comparison. Root mean square

error (RMSE) was also used for a robustness check.

We choose MAE over another popular metric, concordance index (or C-index), which refers to

the accuracy of predicted rank (i.e. which observation survives longer) between pairs of randomly

chosen observations. This is because concordance cannot accurately evaluate survival models when

time-varying patterns are observed in data that are uncommon in traditional settings. For example,

our mobile game is designed to have incremental difficulties, so users fail more as they reach further

stages. Also, as shown in Figure 7, there are more features with incremental/decremental patterns.

These time-varying patterns make it effortless to predict the order of user churn, and most of our

test models showed extremely high concordance scores. Thus, we focus on evaluating how accurately

models predict the timing of users’ churn (i.e., minimizing prediction errors), rather than the order

of users’ churn (i.e., maximizing C-index).

26

Electronic copy available at: https://ssrn.com/abstract=3981160


Model MAE MAE MAE RMSE # of
Mean Min Max Mean Parameters
Cox Proportional 25.71 - - 32.98 -
Traditional Hazard
Models (Cox, 1972)
Weibull AFT 52.85 - - 61.45 -
Log-Normal AFT 60.49 - - 70.10 -
Aalen’s additive model 33.53 - - 38.09 -
DeepSurv 23.33 18.24 26.83 29.69 117,181
Deep (Katzman et al., 2018)
Learning CoxCC 21.84 19.14 23.91 28.61 124,993
Models (Kvamme et al., 2019)
DeepHit Single 17.61 15.78 21.24 24.41 140,617
(Lee et al., 2018a)
WTTE-RNN 13.65 12.66 14.99 19.84 215,692
(Martinsson, 2016)
WTTE-TCN 11.34 10.34 12.22 17.43 101,270
(Our Model)

Table 4: Prediction Results of Models

The test result shows that our WTTE-TCN model achieves superior performance over other tra-

ditional (reduced MAE by 56-81%) and deep survival models (reduced MAE by 17-51%). Notably,

we reduced 17% of MAE while using about half of the parameters compared to Martinsson (2016).

4.4 Interpretability & Managerial Implications

4.4.1 Global Interpretation (Model-Level)

For managerial guidance, we provide global interpretations of model predictions in this section.

Figure 10 shows the global feature importance of the most critical variables across all users.

According to the result, two key features related to game design, PlayCount and TrophyColor, play

the most important role in predicting how long users will survive. PlayCount one of the indicators

of game difficulty is the total attempts to get past a level. TrophyColor refers to the three types

of trophies users are awarded when they clear stages depending on their game style. The more

skillful the users, the better the trophies they get. FacebookLogin and NumberOfFriends indicate

that social-related variables are also useful to predict the user’s duration. Regarding OS, simple

data exploration shows that IOS users stay longer than Android users, ceteris paribus. Items and

boosts help users clear the game more easily, and the pattern of using them is critical information

27

Electronic copy available at: https://ssrn.com/abstract=3981160


PlayCount
TrophyColor
FacebookLogin
BoostUseCount
OS
NumberOfFriends
IsPayer
ItemUseCount
AdvertiseView
QuestCompleted
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75
mean(|SHAP value|) (average impact on model output magnitude)

Figure 10: Global Feature Importance

140

120

100
# of
Churned 80
Users
60

40

20

60 80 100 120 140


Level

Figure 11: Number of Churn Users by Level

as well. Also, the in-app purchasing experience (IsPayer) and the number of advertising views are

useful variables in predicting the user’s duration.

With respect to PDP, we provide a cohort-level global interpretation by grouping users who

churned at similar game levels into the same cohort. This is to control for the time-varying char-

acteristics of variables such as PlayCount, which is the average of total play counts per level, and

which is observed to increase as the game level increases. In other words, feature values at the high

and low levels have entirely different meanings, and the design of the game should be different, too.

We demonstrate how we can utilize cohort-level PDPs to derive managerial implications from

the game. As mentioned above, the firm focuses on potential loyal customers and cares about user

28

Electronic copy available at: https://ssrn.com/abstract=3981160


1.5
1.0 0.50
1.0
SHAP Value
0.25
0.5
0.5 0.00
0.0
0.0 -0.25
-0.5
-0.5 -0.50
-1.0 -0.75
-1.0
-1.5 -1.00
-1.5
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 2.5 2.6 2.7 2.8 2.9 0.000 0.050 0.100 0.150 0.200
PlayCount TrophyColor ItemUseCount
1.00
3.0
0.4
0.75
2.5
0.2 0.50
SHAP Value

2.0
0.25
1.5 0.0
0.00
1.0 -0.2
-0.25
0.5 -0.4 -0.50
0.0
-0.6 -0.75
-0.5 -1.00
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10
IsPayer FacebookLogin NumberOfFriends

Figure 12: Partial Dependence Plots (PDP) of Users Who Churned at 126-135 Level

churn later in the observed period. In Figure 11, we can observe a sharp increase in user churn

between levels 126 to 135 (red dotted box). We seek to find a way to reduce the churn rate in this

period, where more potential loyal customers have left than we expected. To do so, we use XAI to

analyze the recent (latest 50 levels) behavior of the churned users and explain how those behaviors

affected their decisions. Based on the explanation, we provide new churn management strategies.

Figure 12 presents the PDPs of users who churned at 126-135 level. The x- and y-axes indicate

the average feature value of the latest 50 levels and its SHAP value (feature importance), respec-

tively. Since this feature importance is an average value of the latest 50 levels, the cumulative

impact is 50 times greater (average SHAP value per level × the length of lookback window) than

SHAP values shown in the figure. If a users’ average PlayCount is 3 and their SHAP value is 1

(Figure 12 left above), we can expect that this user will stay 50 levels (1 level x 50) more than the

baseline user, whose SHAP value is zero, and, whose average PlayCount is around two. Also, users

who have the same feature value can have different SHAP values. This is because they have other

different feature values. For example, even though users have the same PlayCount, its impact on

paid and free users may differ. SHAP value can capture these differences because it calculates the

29

Electronic copy available at: https://ssrn.com/abstract=3981160


comparative importance among features based on game theory.

With respect to PlayCount, a user’s duration increases until PlayCount reaches 3 (Figure 12 left

above). However, after 3, users who play (fail) more tend to churn more easily. In other words, users

don’t like games that are too easy (low PlayCount) or too difficult (high PlayCount), and there is

an optimal level of difficulty (PlayCount is around 3). In this case, it is good for the company to

adjust the difficulty of levels 125-135 once users have played about three times per level on average.

In detail, since the current difficulty of this period is too low (many users are between PlayCount

1.5 and 2), the company needs to add more of a challenge.

TrophyColor is related to a sense of accomplishment. When users clear the game with skillful

plays and high scores, they get better trophies from among the bronze, silver, and gold options.

Notably, as a reward, the company carefully designed special visual and sound effects to excite users

when they receive a better trophy. According to the result (Figure 12 mid above), the better trophy

users survived longer, and the company needs to provide more opportunities for users with bronze

and silver trophies to win gold trophies. In sum, at levels 126-135, the company should increase the

difficulty of the game so that users retry about three times per level (PlayCount), while allowing

them to feel a higher sense of accomplishment if they clear the game (TrophyColor).

Regarding ItemUseCount (Figure 12 right above), users who spent more items churned earlier

(BoosterUseCount showed a similar result). Items and boosters help users clear levels easily. To

get items and boosters, users should spend more time to complete additional quests or buy them

with money. In this sense, users perceive items and boosters as currency, and unused items act

as sunk costs of making users play the game for longer. Also, users who decide to leave the game

tend to use up items in a short period of time. This suggests that we can use ItemUseCount and

BoostUseCount as key metrics for prescriptive churn prevention. For example, the company can

offer additional items, boosters, or special promotions before users run out of their own.

If a player has made cash payments before, “IsPayer” is 1, otherwise it is 0. The result (Figure

12 left below) shows that paying users stay longer than free users, as expected. However, concerning

social variables, the result turns to be the reverse of what the company intended. For example, users

who login with their Facebook accounts (FacebookLogin = 1) showed a shorter duration (Figure

12 mid below). The company expected users to enjoy games and communicate with their social

media friends by linking their Facebook accounts, and that this would increase the user’s duration

30

Electronic copy available at: https://ssrn.com/abstract=3981160


PI PI
#776 #873

Figure 13: Prediction Outputs and Prediction Intervals


The output of prediction is the probability Weibull distribution of churn. Also, 95% of prediction
intervals (PI) can be derived from the distribution.

(Cole and Griffiths, 2007). In the game, users can have various social interactions, such as sending

gifts to their Facebook friends or showing off their scores to each other. Therefore, users who have

many Facebook friends in the game should have more fun and stay longer. However, users with

more than three friends (Figure 12 right below) also have shorter survival periods. This means the

company needs to check if there are any problems with their design of the social interaction system.

Fortunately, interviews with game users show that they felt like they were losing interest in the

game faster if their friends were leaving. In other words, these social functions are only effective

while their friend groups are active otherwise, they are bad for user retention. The company

could design a new social interaction mechanism to overcome these shortcomings (e.g., matching

new friends outside the existing network).

4.4.2 Local Interpretation (User-Level)

One of the advantages of a distribution-based predictive model is that we can address uncertainty in

prediction results. Figure 13 shows the outputs the probability Weibull distribution of churn for

two users. We can interpret the point where the probability (y-axis) is the highest as the predicted

churn of a user. User #776, whose predicted churn (77) shows a comparatively small deviation from

the true churn (76), has narrow PI (colored band in Figure 14), [37, 89], and is a more reliable as a

predictor. On the other hand, user #873, whose predicted churn (223) shows significant deviation

from the true churn (147), has wide PI, [96, 419], and is consequently less reliable.

31

Electronic copy available at: https://ssrn.com/abstract=3981160


Feature Values

PlayCount (A) TrophyColor (B) ItemUseCount (E) TimeToStay (F)

SHAP Values
C
F

D
B
TrophyColor

A
PlayCount
+ + + + +
(52 LV) Churn
(102 LV)

Figure 14: Local Feature Importance Map


We look back at the most recent 50 LV behaviors of non-paying user #1358, who churned at 102
LV. We use SHAP values, and a deeper red means it positively affects the user’s retention.

We demonstrate how we can derive customized churn management strategies from the perspec-

tive of individual levels. Figure 14 (below) presents the behavioral feature importance (y-axis)

map of user #1358 through different game levels (x-axis) until the eventual churn at stage 102.

Blue indicates a negative contribution to retention and red indicates a positive contribution. Fig-

ure 14 (above) shows the actual feature values of different behavioral variables across game stages.

Examine, for example, region A, which shows the PlayCount (total attempt to get past a level)

contribution through different game levels. The PlayCount is usually 1 until stage 40 where it

bumps up to 4. This suggests that the user cleared each level without fail for the most part, which

in turn prompted the user to churn at level 102. In other words, the game was too easy for user

#1358 and contributed to him/her churning. On the other hand, varying number of TrophyColor

(orange line curve located in the second column in Figure 14 above), which is given based on stage

32

Electronic copy available at: https://ssrn.com/abstract=3981160


score, was beneficial for retention overall. When examining regions C and D, we found that playing

bonus stages for extra rewards (PlayCountOtherLevels) was beneficial for retention while replaying

cleared stages (ClearedGameRetryCounts) was harmful. The user spent items mostly in stages 30-

40 (region E), which was also harmful for retention. Lastly, the user was stuck at stage 42 (region

F) for a long time (about 17 days), and it was detrimental to retention shortly after, the user

churned. This suggests that customized promotions such as free items or readjusting difficulty at

stage 42 might have been beneficial for retention.

5 Conclusion and Discussion

We introduced a novel interpretable deep survival model, WTTE-TCN, to capture human-understandable

nonlinear patterns from complex (sparse, high-dimensional, and unstructured) data. When applied

to a unique and comprehensive mobile game churning dataset, WTTE-TCN shows superior per-

formance with less computational costs. Upon applying post-hoc XAI methods, we were able to

illustrate actionable insights and prescriptive strategies for real-world customer churn management.

We contribute in two ways. First, we improve existing DL survival models by incorporating TCN,

which outperforms canonical RNNs (e.g., LSTM, GRU) through powerful convolutions and dilations

while minimizing the loss of information. Second, we present exploratory results that examine the

model’s ability to capture churn signals from complex data and provide domain-specific marketing

implications for customer churn management.

We envision WTTE-TCN as “advanced knowledge capturing tools” that help to find novel hy-

potheses from empirical customer churning data (Molnar, 2018; Lee et al., 2018b). To this end,

DL models discover complex nonlinear patterns in data (beyond human cognitive abilities), and

interpretability can extract this knowledge from the model. Specifically, we hope managers and

researchers can use WTTE-TCN to incorporate novel unstructured data (e.g., text, image) and dis-

cover/test new hypotheses in churn management contexts. The potential of text data notably

remains untapped in that this exponentially growing data fully captures the daily life of cus-

tomers, including their thoughts, emotions, tastes, routines, and even relationships. Despite its

great prospects, the potential value lost due to unused text data is estimated at $3 trillion in global

(Analytics, 2016).

33

Electronic copy available at: https://ssrn.com/abstract=3981160


Our study branches into the following future research directions:

First, while we successfully incorporated NLP layers into WTTE-TCN, this study focused on

demonstrating that our algorithm can capture more signals and provide useful marketing implica-

tions with complex multivariate sequential data, rather than illustrating good use of interpretability

with text data. Future studies with novel text datasets might provide new business implications for

marketing professionals to increase value to customers.

Second, although there are explainable models, deriving an actionable strategy still requires

great effort and know-how. In the classical setting, good use of interpretability has already been well

documented and studied. For example, econometrics studies have solid foundations for interpreting,

validating, and utilizing coefficients of regression models. On the other hand, in the DL setting, good

use of interpretability is still in its infancy, and the complexity of DL makes the right interpretation

difficult. In this context, fruitful future research topics can include advantages and constraints of

various interpretability methods (Molnar, 2018), good use cases and new techniques (Yang et al.,

2016; Lee et al., 2018b; Mothilal et al., 2020), and guidelines for the robust use of interpretability

(Doshi-Velez and Kim, 2017; Melis and Jaakkola, 2018) in business contexts.

Third, applied ML/DL requires managing uncertainty which refers to model works with im-

perfect or unknown information and will be an important future research topic, given that it

helps to manage biases in algorithms and minimize risks in data-driven enterprise decisions. For

example, a medical diagnosis system with 90% accuracy would still misdiagnose 10% of patients. In

typical predictive models, we cannot know which patient will be misdiagnosed because the model

simply gives the patient’s prediction results as a single value (point estimate). In contrast, proba-

bilistic ML/DL methods address uncertainty by quantitatively measuring how much we can trust

the prediction result of each observation through distributional results (interval estimate). Thus,

for observations suffering insufficient data or heavy noise, the model provides wider distributions or

prediction intervals as results (see Hüllermeier and Waegeman (2019) for more details). Although

our model deals with uncertainty through distributional results, studies about uncertainty are still

lacking. More theoretical/empirical studies regarding implementations, impacts, and novel methods

of uncertainty might be fruitful for future business research.

34

Electronic copy available at: https://ssrn.com/abstract=3981160


References
Analytics, M.: 2016, ‘The age of analytics: competing in a data-driven world’.

Ascarza, E., P. S. Fader, and B. G. Hardie: 2017, ‘Marketing models for the customer-centric firm’. In:
Handbook of marketing decision models. Springer, pp. 297–329.

Ba, J. L., J. R. Kiros, and G. E. Hinton: 2016, ‘Layer normalization’. arXiv preprint arXiv:1607.06450.

Bahdanau, D., K. Cho, and Y. Bengio: 2014, ‘Neural machine translation by jointly learning to align and
translate’. arXiv preprint arXiv:1409.0473.

Bai, S., J. Z. Kolter, and V. Koltun: 2018, ‘An empirical evaluation of generic convolutional and recurrent
networks for sequence modeling’. arXiv preprint arXiv:1803.01271.

Bengio, Y., A. Courville, and P. Vincent: 2013, ‘Representation learning: A review and new perspectives’.
IEEE transactions on pattern analysis and machine intelligence 35(8), 1798–1828.

Braun, M. and D. A. Schweidel: 2011, ‘Modeling customer lifetimes with multiple causes of churn’. Marketing
Science 30(5), 881–902.

Braun, M., D. A. Schweidel, and E. Stein: 2015, ‘Transaction attributes and customer valuation’. Journal
of Marketing Research 52(6), 848–864.

Brownlee, J.: 2018, Better Deep Learning: Train Faster, Reduce Overfitting, and Make Better Predictions.
Machine Learning Mastery.

Campbell, P.: 2019, ‘Is Content Marketing Dead? Here’s Some Data.’. ProfitWell (blog).

Carvalho, D. V., E. M. Pereira, and J. S. Cardoso: 2019, ‘Machine Learning Interpretability: A Survey on
Methods and Metrics’. Electronics 8(8), 832.

Cifuentes, J.: 2016, ‘Half of all mobile games revenue reportedly comes from only 0.19

Cole, H. and M. D. Griffiths: 2007, ‘Social interactions in massively multiplayer online role-playing gamers’.
Cyberpsychology & behavior 10(4), 575–583.

Collobert, R., J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa: 2011, ‘Natural language
processing (almost) from scratch’. Journal of machine learning research 12(Aug), 2493–2537.

Cox, D. R.: 1972, ‘Regression models and life-tables’. Journal of the Royal Statistical Society: Series B
(Methodological) 34(2), 187–202.

Doshi-Velez, F. and B. Kim: 2017, ‘Towards a rigorous science of interpretable machine learning’. arXiv
preprint arXiv:1702.08608.

Du, M., N. Liu, and X. Hu: 2019, ‘Techniques for interpretable machine learning’. Communications of the
ACM 63(1), 68–77.

Ebrahimzadeh, Z., M. Zheng, S. Karakas, and S. Kleinberg: 2019, ‘Deep Learning for Multi-Scale Change-
point Detection in Multivariate Time Series’. arXiv preprint arXiv:1905.06913.

Fader, P. S. and B. G. Hardie: 2007, ‘How to project customer retention’. Journal of Interactive Marketing
21(1), 76–90.

Fader, P. S. and B. G. Hardie: 2009, ‘Probability models for customer-base analysis’. Journal of interactive
marketing 23(1), 61–69.

Fader, P. S. and B. G. Hardie: 2010, ‘Customer-base valuation in a contractual setting: The perils of ignoring
heterogeneity’. Marketing Science 29(1), 85–93.

35

Electronic copy available at: https://ssrn.com/abstract=3981160


Fader, P. S., B. G. Hardie, and C.-Y. Huang: 2004, ‘A dynamic changepoint model for new product sales
forecasting’. Marketing Science 23(1), 50–65.

Fader, P. S., B. G. Hardie, and K. L. Lee: 2005, ‘”Counting your customers” the easy way: An alternative
to the Pareto/NBD model’. Marketing science 24(2), 275–284.

Fader, P. S., B. G. Hardie, Y. Liu, J. Davin, and T. Steenburgh: 2018, ‘”How to Project Customer Retention”
Revisited: The Role of Duration Dependence’. Journal of Interactive Marketing 43, 1–16.

Friedman, J. H.: 2001, ‘Greedy function approximation: a gradient boosting machine’. Annals of statistics
pp. 1189–1232.

Gallo, A.: 2014, ‘The value of keeping the right customers’. Harvard business review 29.

Ghahramani, Z.: 2001, ‘An introduction to hidden Markov models and Bayesian networks’. In: Hidden
Markov models: applications in computer vision. World Scientific, pp. 9–41.

Goodfellow, I., Y. Bengio, and A. Courville: 2016, Deep learning. MIT press.

Guidotti, R., A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi: 2018, ‘A survey of methods
for explaining black box models’. ACM computing surveys (CSUR) 51(5), 93.

Guillén, M., J. P. Nielsen, T. H. Scheike, and A. M. Pérez-Marín: 2012, ‘Time-varying effects in the analysis
of customer loyalty: A case study in insurance’. Expert Systems with Applications 39(3), 3551–3558.

Gunning, D.: 2017, ‘Explainable artificial intelligence (xai)’. Defense Advanced Research Projects Agency
(DARPA), nd Web 2.

Hajian, S., F. Bonchi, and C. Castillo: 2016, ‘Algorithmic bias: From discrimination discovery to fairness-
aware data mining’. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge
discovery and data mining. pp. 2125–2126.

He, Y. and J. Zhao: 2019, ‘Temporal Convolutional Networks for Anomaly Detection in Time Series’. In:
Journal of Physics: Conference Series, Vol. 1213. p. 042050.

Hosanagar, K.: 2019, A Human’s Guide to Machine Intelligence: How Algorithms are Shaping Our Lives
and how We Can Stay in Control. Viking.

Hüllermeier, E. and W. Waegeman: 2019, ‘Aleatoric and epistemic uncertainty in machine learning: A
tutorial introduction’. arXiv preprint arXiv:1910.09457.

Katzman, J. L., U. Shaham, A. Cloninger, J. Bates, T. Jiang, and Y. Kluger: 2018, ‘DeepSurv: personalized
treatment recommender system using a Cox proportional hazards deep neural network’. BMC medical
research methodology 18(1), 24.

Knox, G. and R. Van Oest: 2014, ‘Customer complaints and recovery effectiveness: A customer base ap-
proach’. Journal of marketing 78(5), 42–57.

Koop, G. and S. Potter: 2004, ‘Forecasting in dynamic factor models using Bayesian model averaging’. The
Econometrics Journal 7(2), 550–565.

Krizhevsky, A., I. Sutskever, and G. Hinton: 2012, ‘Imagenet classification with deep convolutional neural
networks. In Advances in neural information processing systems (pp. 1097-1105)’.

Kuznetsov, V.: 2016, ‘Theory and Algorithms for Forecasting Non-Stationary Time Series’. Ph.D. thesis,
New York University.

Kvamme, H., Ø. Borgan, and I. Scheel: 2019, ‘Time-to-event prediction with neural networks and Cox
regression’. Journal of Machine Learning Research 20(129), 1–30.

36

Electronic copy available at: https://ssrn.com/abstract=3981160


Lea, C., M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager: 2017, ‘Temporal convolutional networks for
action segmentation and detection’. In: proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. pp. 156–165.

Lea, C., R. Vidal, A. Reiter, and G. D. Hager: 2016, ‘Temporal convolutional networks: A unified approach
to action segmentation’. In: European Conference on Computer Vision. pp. 47–54, Springer.

LeCun, Y., Y. Bengio, and G. Hinton: 2015, ‘Deep learning’. nature 521(7553), 436.

Lee, C., W. R. Zame, J. Yoon, and M. van der Schaar: 2018a, ‘Deephit: A deep learning approach to survival
analysis with competing risks’. In: Thirty-Second AAAI Conference on Artificial Intelligence.

Lee, D., E. Manzoor, and Z. Cheng: 2018b, ‘Focused Concept Miner (FCM): Interpretable Deep Learning
for Text Exploration’. Emaad and Cheng, Zhaoqi, Focused Concept Miner (FCM): Interpretable Deep
Learning for Text Exploration (November 20, 2018).

Lee, E., Y. Jang, D.-M. Yoon, J. Jeon, S.-i. Yang, S.-K. Lee, D.-W. Kim, P. P. Chen, A. Guitart, P. Bertens,
et al.: 2018c, ‘Game data mining competition on churn prediction and survival analysis using commercial
game log data’. IEEE Transactions on Games 11(3), 215–226.

Lipton, Z. C.: 2016, ‘The mythos of model interpretability’. arXiv preprint arXiv:1606.03490.

Liu, L., H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han: 2019, ‘On the variance of the adaptive
learning rate and beyond’. arXiv preprint arXiv:1908.03265.

Lu, J., D. Lee, T. W. Kim, and D. Danks: 2020, ‘Good Explanation for Algorithmic Transparency’. AIES
’20: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (February, 2020).

Lu, Y., A. A. Miller, R. Hoffmann, and C. W. Johnson: 2016, ‘Towards the Automated Verification of
Weibull Distributions for System Failure Rates’. In: Critical Systems: Formal Methods and Automated
Verification. Springer, pp. 81–96.

Luck, M., T. Sylvain, H. Cardinal, A. Lodi, and Y. Bengio: 2017, ‘Deep learning for patient-specific kidney
graft survival analysis’. arXiv preprint arXiv:1705.10245.

Lundberg, S. M. and S.-I. Lee: 2017, ‘A unified approach to interpreting model predictions’. In: Advances
in Neural Information Processing Systems. pp. 4765–4774.

Maragall Cambra, M.: 2018, ‘Using recurrent neural networks to predict the time for an event’.

Martinsson, E.: 2016, ‘Wtte-rnn: Weibull time to event recurrent neural network’. Ph. D. dissertation,
Masters thesis, University of Gothenburg, Sweden.

Melis, D. A. and T. Jaakkola: 2018, ‘Towards robust interpretability with self-explaining neural networks’.
In: Advances in Neural Information Processing Systems. pp. 7775–7784.

Mikolov, T., I. Sutskever, K. Chen, G. S. Corrado, and J. Dean: 2013, ‘Distributed representations of words
and phrases and their compositionality’. In: Advances in neural information processing systems. pp.
3111–3119.

Molnar, C.: 2018, ‘Interpretable machine learning: A guide for making black box models explainable’. E-book
at< https://christophm. github. io/interpretable-ml-book/>, version dated 10.

Montoya, R., O. Netzer, and K. Jedidi: 2010, ‘Dynamic allocation of pharmaceutical detailing and sampling
for long-term profitability’. Marketing Science 29(5), 909–924.

Moor, M., M. Horn, B. Rieck, D. Roqueiro, and K. Borgwardt: 2019, ‘Temporal convolutional net-
works and dynamic time warping can drastically improve the early prediction of sepsis’. arXiv preprint
arXiv:1902.01659.

37

Electronic copy available at: https://ssrn.com/abstract=3981160


Mothilal, R. K., A. Sharma, and C. Tan: 2020, ‘Explaining machine learning classifiers through diverse
counterfactual explanations’. In: Proceedings of the 2020 Conference on Fairness, Accountability, and
Transparency. pp. 607–617.

Netzer, O., J. M. Lattin, and V. Srinivasan: 2008, ‘A hidden Markov model of customer relationship dy-
namics’. Marketing science 27(2), 185–204.

Pacurar, M.: 2008, ‘Autoregressive conditional duration models in finance: a survey of the theoretical and
empirical literature’. Journal of economic surveys 22(4), 711–751.

Palau, A. S., K. Bakliwal, M. H. Dhada, T. Pearce, and A. K. Parlikad: 2018, ‘Recurrent neural networks for
real-time distributed collaborative prognostics’. In: 2018 IEEE international conference on prognostics
and health management (ICPHM). pp. 1–8.

Pascanu, R., T. Mikolov, and Y. Bengio: 2013, ‘On the difficulty of training recurrent neural networks’. In:
International conference on machine learning. pp. 1310–1318.

Pennington, J., R. Socher, and C. D. Manning: 2014, ‘Glove: Global vectors for word representation’. In:
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp.
1532–1543.

Pölsterl, S., S. Conjeti, N. Navab, and A. Katouzian: 2016, ‘Survival analysis for high-dimensional, het-
erogeneous medical data: Exploring feature extraction as an alternative to feature selection’. Artificial
intelligence in medicine 72, 1–11.

Potter, R. G. and M. P. Parker: 1964, ‘Predicting the time required to conceive’. Population studies 18(1),
99–116.

Prinja, S., N. Gupta, and R. Verma: 2010, ‘Censoring in clinical trials: review of survival analysis techniques’.
Indian journal of community medicine: official publication of Indian Association of Preventive & Social
Medicine 35(2), 217.

Raffel, C. and D. P. Ellis: 2015, ‘Feed-forward networks with attention can solve some long-term memory
problems’. arXiv preprint arXiv:1512.08756.

Rai, A.: 2020, ‘Explainable AI: from black box to glass box’. Journal of the Academy of Marketing Science
pp. 1–5.

Ranganath, R., A. Perotte, N. Elhadad, and D. Blei: 2016, ‘Deep survival analysis’. arXiv preprint
arXiv:1608.02158.

Reichheld, F. and C. Detrick: 2003, ‘Loyalty: A prescription for cutting costs’. Marketing Management
12(5), 24–24.

Ren, S., B. Liao, W. Zhu, and K. Li: 2018, ‘Knowledge-maximized ensemble algorithm for different types of
concept drift’. Information Sciences 430, 261–281.

Ribeiro, M. T., S. Singh, and C. Guestrin: 2016a, ‘" Why should i trust you?" Explaining the predictions of
any classifier’. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery
and data mining. pp. 1135–1144.

Ribeiro, M. T., S. Singh, and C. Guestrin: 2016b, ‘Model-agnostic interpretability of machine learning’.
arXiv preprint arXiv:1606.05386.

Rudin, C.: 2019, ‘Stop explaining black box machine learning models for high stakes decisions and use
interpretable models instead’. Nature Machine Intelligence 1(5), 206.

Schmittlein, D. C., D. G. Morrison, and R. Colombo: 1987, ‘Counting your customers: Who-are they and
what will they do next?’. Management science 33(1), 1–24.

38

Electronic copy available at: https://ssrn.com/abstract=3981160


Schweidel, D. A., P. S. Fader, and E. T. Bradlow: 2008, ‘Understanding service retention within and across
cohorts using limited information’. Journal of Marketing 72(1), 82–94.

Schweidel, D. A. and G. Knox: 2013, ‘Incorporating direct marketing activity into latent attrition models’.
Marketing Science 32(3), 471–487.

Selvaraju, R. R., M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra: 2017, ‘Grad-cam: Visual ex-
planations from deep networks via gradient-based localization’. In: Proceedings of the IEEE international
conference on computer vision. pp. 618–626.

Shrikumar, A., P. Greenside, and A. Kundaje: 2017, ‘Learning important features through propagating
activation differences’. In: Proceedings of the 34th International Conference on Machine Learning-Volume
70. pp. 3145–3153.

Srivastava, N., G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov: 2014, ‘Dropout: a simple way
to prevent neural networks from overfitting’. The journal of machine learning research 15(1), 1929–1958.

Tableman, M. and J. S. Kim: 2003, Survival analysis using S: analysis of time-to-event data. CRC press.

Wang, P., Y. Li, and C. K. Reddy: 2019, ‘Machine learning for survival analysis: A survey’. ACM Computing
Surveys (CSUR) 51(6), 110.

Wijman, T., O. Meehan, and B. de Heij: 2019, ‘Global games market report’.

Witten, D. M. and R. Tibshirani: 2010, ‘Survival analysis with high-dimensional covariates’. Statistical
methods in medical research 19(1), 29–51.

Xiao, C., E. Choi, and J. Sun: 2018, ‘Opportunities and challenges in developing deep learning models
using electronic health records data: a systematic review’. Journal of the American Medical Informatics
Association 25(10), 1419–1428.

Yang, Z., D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy: 2016, ‘Hierarchical attention networks for
document classification’. In: Proceedings of the 2016 conference of the North American chapter of the
association for computational linguistics: human language technologies. pp. 1480–1489.

Yousefi, S., F. Amrollahi, M. Amgad, C. Dong, J. E. Lewis, C. Song, D. A. Gutman, S. H. Halani, J. E. V.


Vega, and D. J. Brat: 2017, ‘Predicting clinical outcomes from large scale cancer genomic profiles with
deep survival models’. Scientific reports 7(1), 11707.

Zhang, A., Z. C. Lipton, M. Li, and A. J. Smola: 2019a, ‘Dive into Deep Learning’. Unpublished draft.
Retrieved 3, 319.

Zhang, M., J. Lucas, J. Ba, and G. E. Hinton: 2019b, ‘Lookahead Optimizer: k steps forward, 1 step back’.
In: Advances in Neural Information Processing Systems. pp. 9593–9604.

Žliobaitė, I., M. Pechenizkiy, and J. Gama: 2016, ‘An overview of concept drift applications’. In: Big data
analysis: new algorithms for a new society. Springer, pp. 91–114.

39

Electronic copy available at: https://ssrn.com/abstract=3981160


Appendix
A Description of the Game Rules
The goal of the game is gathering blocks of the same color in various forms to get scores.

10

NO Component Description
Name
1 Heart (or Life) A player needs one heart to play a game and five hearts are provided by default.
Hearts are increased by one every 30 minutes to a maximum of five. If the player
uses up all the hearts, he/she can buy hearts using coins, as rewards for
completing missions, or be gifted one from friends in the game.
2 Level (or Stage) Level means a stage number of the game played by a user. Users are not allowed
to play levels beyond a certain point until they have progressed through lower
ones. However, the cleared stages can be replayed and the highest scores can also
be updated.
3 Trophy Color When player beats the game, he/she will receive one of the gold, silver, or bronze
trophies, depending on his/her achieved score. A gold trophy will be awarded for
clearing with a high score, but a bronze trophy will be awarded for clearing with
a low score.
4 Coin Coins are a kind of currency that can buy hearts, items, and boosters. Coins can
be acquired through in-app purchases using cash, and they can also be obtained
as compensation for completing various missions or events.
5 Booster Booster refers to items with special features that make it easier to clear certain
levels of the game.
6 Facebook Login If the player allows Facebook login, he/she can post their records on Facebook or
and Friends interact with Facebook friends in the game. Facebook logins are rewarded with
items or coins, and Facebook friends can present hearts to each other.
7 Item Item refers to special features that help the player beat the game more easily.
8 Advertisement If they fail, the player can get additional opportunities by watching video
advertisements. The player can also get more coins or items rewards by looking
at the advertisements.
9 Crystal Trophy When the player replays accomplished stages, they can update their previous
trophy colors (gold, silver, or bronze) or get an additional crystal trophy as a
clearing reward.
10 Gacha The player sometimes gets items or boosters through a random picker.

Electronic copy available at: https://ssrn.com/abstract=3981160


B Descriptive Statistics of the Mobile Game Dataset (average val-
ues per each user)

Variable Name Mean Std Min 25% 50% 75% Max


Advertise View 0.015 0.057 0.000 0.000 0.000 0.000 2.293
Boost Use Count 0.038 0.066 0.000 0.000 0.000 0.053 2.240
Boost Use Count Other Level 0.003 0.029 0.000 0.000 0.000 0.000 4.000
Capsule Gacha 0.000 0.001 0.000 0.000 0.000 0.000 0.060
Clear Count Other Level 0.032 0.104 0.000 0.000 0.000 0.011 6.933
Cleared Game Retry Count 0.031 0.122 0.000 0.000 0.000 0.000 7.580
Coin Consumed Total 0.446 11.578 0.000 0.000 0.000 0.500 3445.828
Coin Left Count 35.756 82.963 0.000 30.000 30.000 43.714 24176.966
Contiune With Coin 0.011 0.032 0.000 0.000 0.000 0.008 2.247
Contiune With Coin Other Level 0.001 0.011 0.000 0.000 0.000 0.000 1.951
Crystal Trophy Count 0.002 0.026 0.000 0.000 0.000 0.000 1.047
Facebook Login 0.294 0.443 0.000 0.000 0.000 0.993 1.000
Heart Zero Count 0.011 0.043 0.000 0.000 0.000 0.000 1.607
Heart Zero Count Other Level 0.001 0.014 0.000 0.000 0.000 0.000 1.048
Is Payer 0.010 0.076 0.000 0.000 0.000 0.000 0.901
Item Use Count 0.028 0.059 0.000 0.000 0.000 0.029 2.240
Item Use Count Other Level 0.003 0.031 0.000 0.000 0.000 0.000 6.000
Life Purchase 0.000 0.001 0.000 0.000 0.000 0.000 0.100
Number Of Friend 0.177 0.727 0.000 0.000 0.000 0.000 68.680
Pay Count 0.001 0.006 0.000 0.000 0.000 0.000 0.373
Payment 0.005 0.063 0.000 0.000 0.000 0.000 4.132
Play Count 1.138 0.307 0.000 1.000 1.000 1.189 11.333
Play Count Other Level 0.043 0.141 0.000 0.000 0.000 0.020 7.593
Quest Completed 0.053 0.068 0.000 0.000 0.000 0.105 0.615
Time To Stay 338.210 1367.419 0.000 0.714 33.753 234.846 64988.000
Trophy Color 2.865 0.145 2.392 2.725 2.923 3.000 3.000

Electronic copy available at: https://ssrn.com/abstract=3981160


C Statistical Methods for Faster and More Stable Training
Complex and noisy real-world data make the training process extremely unstable and easily converge
to poor local minima. Notably, handling sequential tasks make the problem harder (??). Also, the
volume of data used in the analysis has been tremendously expanding (?). To solve these problems,
we apply various techniques, such as 1) Attention Mechanism, 2) Gradient Clipping, 3) Rectified
Adam, and 4) Lookahead Optimizer. These methods enable more efficient and stable training while
achieving better performance.
The attention mechanism allows the sequential model to focus more on the relevant parts of the
input data by acting like random access memory across time and input data (???). By doing so,
the attention mechanism improves the training efficiency and performance of the model. In this
paper, we follow the work of ? and implement it on two different tasks: handling 1) text data
and 2) customer’s sequential behavioral data. For the text data, we apply ?’s word-level attention.
Specifically,

uit = tanh(Ww hit + bw ) (1)

exp(u|it uw )
αit = P | (2)
t exp(uit uw )
X
si = αit hit (3)
t

− ←−
where hit is a word annotation obtained by concatenating h it (the forward hidden state) and h it

− ←

(the backward hidden state) of bidirectional RNN or TCN, i.e., hit =[ h it , h it ]. uw is a word-
level context vector that contains the information of “what is the informative word” over words.
si is the output sentence vector that summarized all the information of words in the ith sentence.
With respect to the behavioral data, we replace the input of attention layer from the sequence
of (numeric) embedded words with the sequence of feature values about customer behavior. The
attention mechanism not only allows for faster training and better performance, it also increases
the stability of training by giving more direct pathways to the model structure (?).
One challenge of handling sequential data is an exploding gradients problem, in that accumulated
gradients from large errors make huge updates to neural network model weights, resulting in an
unstable or incapable learning process (?). It occurs more commonly with RNN models, and it
gets worse handling sparse and messy real-world data (?????). WTTE-RNN is not an exception,
and related literature has argued that it sometimes fails in the training process (???). To solve this
problem, we implement gradient clipping, which rescales the gradients when it exceeds the threshold
value (?). g is the gradient and η is a threshold and if kgk > η, the clipped gradient is defined as:
ηg
g← (4)
kgk

Electronic copy available at: https://ssrn.com/abstract=3981160


Advances in optimization algorithms permit faster and more stable training. Adaptive stochastic
optimizers such as Adam and RMSProp have been widely used in AI research and applications.
However, these adaptive learning approaches have a variance issue in that in the early stage of
training, problematically large variance leads to a risk of converging into undesirable local optima
(?). Rectified Adam (RAdam) solves this problem by incorporating a warmup (an initial training
with a much smaller learning rate), which is identified to reduce the variance. The detail of RAdam
is described in Algorithm 1.
Another breakthrough in recent optimization techniques is Lookahead, which iteratively updates
two sets of weights (“fast weights” and “slow weights”) and then interpolates them. By doing so,
Lookahead improves training stability and reduces the variance of optimization algorithms such as
Adam, SGD, and RAdam (?). The combination of RAdam and Lookahead is highly synergistic and
complementary (?). Lookahead is described in detail in Algorithm 2.

Algorithm 1 Rectified Adam. All operations are element-wise


Input: {αt }Tt=1 : step size, {β1 , β2 }: decay rate to calculate moving average and moving 2nd mo-
ment, θ0 : initial parameter, ft (θ): stochastic objective function.
Output: θt : resulting parameters
m0 , υ0 ← 0, 0 (Initialize moving 1st and 2nd moment)
ρ∞ ← 2/(1 − β2 ) − 1 (Compute the maximum length of the approximated SMA)
while t = {1, · · · , T } do
gt ← 4θ ft (θt−1 ) (Calculate gradients w.r.t. stochastic objective at timestep t)
υt ← 1/β2 υt−1 + (1 − β2 )gt2 (Update exponential moving 2nd moment)
mt ← β1 mt−1 + (1 − β1 )gt (Update exponential moving 1st moment)
ct ← mt /(1 − β1t ) (Compute bias-corrected moving average)
m
ρt ← ρ∞ − 2tβ2t /(1 − β2t ) (Compute the length of the approximated SMA)
if the variance is tractable, i.e., ρt > 4 then
p
lt ← q(1 − β2t )/υt (Compute adaptive learning rate)
rt ← (ρ(ρ∞t −4)(ρt −2)ρ∞
−4)(ρ∞ −2)ρt (Compute the variance rectification term)
θt ← θt−1 − αt rt m
ct lt (Update parameters with adaptive momentum)
else
θt ← θt−1 − αt m
ct (Update parameters with un-adaptive momentum)
return θT

Adopted from ?.

Electronic copy available at: https://ssrn.com/abstract=3981160


Algorithm 2 Lookahead Optimizer
Require: Initial parameters φ0 , obejctive function L
Require: Synchronization period k, slow weights step size α, optimizer A
for t = 1, 2, . . . do
Synchronize parameters θt,0 ← θt−1
for i = 1, 2, . . . , k do
sample minibatch of data d ∼ D
θt,i ← θt,i−1 + A(L, θt,i−1, d)
end for
Perform outer update φt ← φt−1 + α(θt,k − θt−1 )
end for
return parameters φ

Note. Adopted from ?.

Electronic copy available at: https://ssrn.com/abstract=3981160

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy