0% found this document useful (0 votes)

19 views16 pages

GiaoHoThanh - RFM and CLV Paper - V2

The document summarizes a study that performed customer segmentation and predicted customer lifetime value (CLV) using machine learning algorithms. Specifically, it used the Recency-Frequency-Monetary (RFM) model to segment customers into groups and the Pareto/Negative Binomial Distribution and Gamma-Gamma models to predict CLV. The study experimented on transaction data from 121,317 customers and found the models achieved high accuracy according to evaluation metrics. The proposed models can help businesses better understand customers and implement effective marketing strategies tailored to each customer group.

Uploaded by

Vy Nguyễn Ngọc Phương

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views16 pages

GiaoHoThanh - RFM and CLV Paper - V2

Uploaded by

Vy Nguyễn Ngọc Phương

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Machine Translated by Google

Customer segmentation analysis and customer lifetime value

prediction using Pareto/NBD and Gamma-Gamma model
Kim-Giao Tran1,2 , Van-Ho Nguyen1,2, Thanh Ho1,2,*
1University of Economics and Law, Ho Chi Minh City, Vietnam
2Vietnam National University, Ho Chi Minh City, Vietnam
*Corresponding author, Email: thanhht@uel.edu.vn

ABSTRACT

Customer segmentation divides customers into groups with common characteristics such as
demographics, interests, needs, or locations. This will help the organization to manage customer
relationships expertly and gain a deep understanding of customers. With the advancement of
current technology and the proliferation of machine learning methods, this study performs data
science algorithms into traditional marketing such as Recency, Frequency and Monetary (RFM)
model for customer segmentation and Pareto/Negative binomial distribution (NBD), Gamma-
Gamma model for predicting customer lifetime value (CLV) to determine customer value. This
study experiments on a customer segmentation model based on a dataset of 121,317 historical
transactions, including individual customers and retailers. Then, customers are divided into
different groups based on their similar specialties and behaviors to help managers make better
decisions to retain customers. In addition, predicting CLV will help the business consider
customers more comprehensive.
Experimental results and model evaluation show that the evaluation metrics have high accuracy
with Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error
(RMSE), and F1 scores of 0.79, 0.89, 0.94, and 0.91, respectively . Based on the empirical
results, the proposed research model can be also applied in other businesses that will help them
get the right and effective business strategies for each customer group depending on their
financial and human potential.
Keywords: RFM model, customer segmentation, clustering, CLV, Pareto/NBD, customer
retention
1. Introduction

In the intense competition and complexity of the business environment, customer

segmentation helps the marketing departments easily define the pivotal solution to attract each
group of customers. Based on the data segmentation, customers are classified into different
groups according to distinguishing similarities such as gender, age, income, products of interest,
and purchasing behaviors (Anitha, P. & Patil, MM, 2019). These characteristics are analyzed
and categorized based on the historical purchasing data of the business. Recency, Frequency,
and Monetary (RFM) has been very famous in marketing as a tool to identify a company's best
customers by calculating and analyzing their spending habits. RFM analysis weights customers'
importance by scoring them in three measurements such as how recently they have made a
purchase (Recency), how often they have bought
(Frequency), and how much they have spent (Monetary) (Thanh, HT, Son, NT, 2021).
Besides using RFM for customer segmentation, customer lifetime value (CLV), retention
rate and churn rate are a combination of robust metrics to measure customer satisfaction. While
CLV is the discounted value of future profits that the customer spends on the company (Glady,
N., Baesens, B. & Croux, C., 2007), the retention rate shows the ability of a company to keep its
existing customers ( Ismail, M.B.M. & Safrana, M.J., 2015).
Machine Translated by Google

In contrast, the churn rate is the percentage of customers moving out of a cohort over a
particular period.
As Kotler and Keller describe customers' churn as a phenomenon that results in a
waste of money and efforts (Kotler, P. & Keller, KL, 2006), choosing to focus on retaining old
customers and turning them from potential customers into loyal customers will help businesses
reduce more costs than building advertising campaigns to attract new customers.
However, the problem is that when a business has a lot of customers and all of them have
made many transactions with the business, it is tough to know if they are still attached to the
business or not. Besides, businesses cannot calculate exactly when a customer will leave but
can only predict based on probabilities, so this is even more difficult.
Realizing the importance of customers to businesses, this study identifies the goal of
analyzing customer segmentation and customer lifetime value with a combination of business,
marketing, and information technology knowledge bases. The final result is to give managers
a multi-dimensional view of their customers. That makes it easier for managers to decide
whether to implement appropriate marketing strategies for each customer group as well as to
assess whether existing customer care policies are still appropriate for retaining customers or
not.

The following content of the article is Section 2, including the theoretical basis and
related studies, to identify models and algorithms suitable for the set goals. Section 3 is the
methodology that describes relevant issues and experimental processes. After the experimental
process, the results and discussion of the identified customer segments are mentioned in
Section 4. The last Section is the conclusion and implications of the study.
2. Theoretical background and related work
This section provides the literature overall and some related researches based on the
purpose of this study.
The RFM model is usually used to classify customers and define their behaviors.
RFM records the customers' transactions under three factors:

(1) Recency is the distance between the last purchasing date of that customer and the
date of implementing the model;
(2) Frequency is the total transactions of that customer;
(3) Monetary is the actual money that the customer had spent on businesses' products
or services.

The most well-known clustering methods by RFM are customer quintiles

(Miglautsch, 2000) and clustering by K-Means.
Clustering using the K-Means algorithm is a method of unsupervised learning used for
data analysis (Anitha, P. & Patil, MM, 2019). It generates k points as initial centroids randomly,
with k is chosen by users. Each point is assigned to the cluster with the closest centroid. Then
the centroids are updated by taking the mean of the points of each cluster (Anitha, P. & Patil,
MM, 2019) (Ismail, M. & Dauda, U., 2013) (Yedla, M., Pathakota, S.
R. & Srinivasa, T.M., 2010). The data points may move to different clusters after each iterative
approach. The chosen centroids are defined when there are no point changes clusters
or the centroids remain. The algorithm mainly uses Euclidean distance to measure the
distance between data points and centroids (Dwivedi, S., Pandey, P., Tiwari, MS & Kalam, A.,
2014). The formula to calculate Euclidean distance between two multi-dimensional data points
= ( 1, 2, 3, … , m) and = ( 1, 2, 3, … , m) is described as equation (1):
(first)
( , ) = ÿ( 1 ÿ 1) 2 + ( 2 ÿ 2) 2 + ÿ + ( m ÿ m) 2
Machine Translated by Google

Although K-Means is the most common algorithm to classify clusters, it still has some
drawbacks. Because the centroids are first chosen randomly, the results can turn out
different for different runs. Besides, defining the right number of clusters is also a huge
problem to deal with. Thanh HT and Son ND (2021) used the Elbow method
to find the optimal number of clusters then use the Silhouette method to re-evaluate the
results above, while Anitha and Patil (2019) only used the Silhouette score to find the optimal
k. These studies point out the efficiency of the clustering method in Data Science and also
perform the clustering results in RFM analysis and provide customers' different behaviors in
specific clusters.
The Elbow method is used to determine the number of clusters of a dataset by using
the visual technique. The graphic obtains the results from the Sum Squared Error (SSE)
calculation, which measures the difference between points in clusters. The more the number
of clusters k, the smaller the SSE value will be. If the value of the former cluster and the
value of the later cluster draw an angle between them, the cluster at the elbow flexion point
will be the chosen cluster or the cluster with the biggest reducing value compared with its
former will be chosen (Thanh, HT, Son, NT, 2021) (Humaira, H. & Rasyidah, R., 2020)
(Nainggolan, R., Perangin-angin, R., Simarmata, E. & Tarigan, F.A., 2015). The formula of
SSE calculation is described as equation (2):
j

i) 2 (2)
= ÿÿ ( ij,
=1 =1

Where m is the centroid of the data point x and k is the number of clusters. The
graphic which obtains the values of the SSE calculation for the different number of clusters
will perform the looks visual as an elbow arm. The Elbow method is easy to implement and
adequately fitted with perplexing, huge data, but its weakness is that the user must choose
the number of clusters based on experience (Humaira, H. & Rasyidah, R., 2020).
Along with the Elbow method, the Silhouette score is also an effective way to see
how well each cluster is separated from the others. In the two studies (Anitha, P. & Patil, M.
M., 2019) (Humaira, H. & Rasyidah, R., 2020), the authors give two different theories about
the range values of the Silhouette score. After researching more deeply, the Silhouette score
is informed to be in the range [ÿ1, +1], if it is scored near +1, the clustering quality performed
well, if it is valued at 0, we can say there is no distinction between the clusters, and if it is
near -1, the clusters were not distributed well (Ogbuabor, G. & Ugwoke, FN, 2018). The
formula to calculate Silhouette score is written as equation (3):
iÿi
ÿ i= (3)
max ( i, i)

With a is the average intra-cluster distance (the mean distance between i and the
data points in the same cluster), and b is the average inter-cluster distance (the mean
distance between the data point i to all the data points outside its cluster).
Pareto/negative binomial distribution (NBD) model is one of the most classic used
RFM models to calculate CLV. The model mostly uses the recency, frequency, and length
of the customer's observation period to predict the customer's future purchases (Qismat, T.
& Feng, Y., 2020). The Pareto/NBD model is developed by Fader and Hardie. They also
describe the model that is based on five assumptions (Peter, SF, Bruce, GSH & Ka, L.
L., 2004):
Machine Translated by Google

(1) The transactions made by a customer in a period of length follow a Poisson distribution
with transaction rate. It means that they can purchase randomly
whenever they want in their active period, but the rate (in a unit time) is constant.
(2) Heterogeneity in transaction rates across customers follows a gamma distribution
with shape parameter and scale parameter .
(3) Each customer has an unobserved lifetime. In other words, the point at which the
customer becomes inactive or churned is distributed exponentially with the dropout rate .

(4) Heterogeneity in dropout rates across customers follows a gamma distribution with shape
parameter and scale parameter .
(5) Each customer has a varied transaction rate and the dropout rate.
The Gamma-Gamma model is the extension of the Pareto/NBD model. While the Pareto/
NBD model only focuses on the recency and frequency factors, the Gamma-Gamma model uses
the monetary component to predict the average future purchase value (Aslekar, A., Piyali, S. &
Arunima, P., 2019) (Aslekar, A., Piyali, S. & Arunima, P., 2019) ( Qismat, T. & Feng, Y., 2020).
The Pareto/NBD and Gamma-Gamma models are a powerful combination to calculate CLV.
While Pareto/NBD predicts future purchases, the Gamma-Gamma model allows us to assign a
monetary value to each of those future purchases. To ensure to have the best estimated CLV,
these models can be evaluated in the holdout period before making forecasts.
3. Methodology and proposed research model
Figure 1 describes methodology and proposed research model with three main stages:
(1) Stage 1 is customer segmentation analysis according to RFM. From the input data is a
dataset extracted from the sales department of Microsoft's Adventure Works sample
data, perform data preprocessing, and calculate recency, frequency, and monetary
values for use in the RFM model. After preprocessing the data and realizing the difference
between the data points, the study will implement the standardization for input data, then
using some methods related to K-Means to find out the optimal number of clusters for
segmentation;

Figure 1. Overview of the proposed research model

(2) Stage 2 is to use Pareto/NBD and Gamma-Gamma model to predict the number of
purchases and revenue that customers yield in the future, the root of these two models
is to be exploited and developed from the RFM model. Build the above two models on
the training set and re-evaluate the predictions on the test set to see the accuracy
Machine Translated by Google

of the model. Repeat this loop by changing the indexes in the model until the model
gives the most optimal results;
(3) Stage 3 is from the two optimal models above, performing customer lifetime value
(CLV) prediction.
4. Experimental results and discussion
4.1. Customer segmentation using RFM
The first stage of the experimental process includes preprocessing data
standardization, RFM data construction, and K-Means customer segmentation (Figure 1).
4.1.1. Dataset and data preprocessing
The study uses a dataset of customer transactions extracted from the dataset of the
company Adventure Works Cycles. This is a multinational company that manufactures and
sells bicycles to the North American, European, and Asian markets. The extracted dataset
records 121,317 transactions of the company from 06/2011 to 07/2014. This includes both
individual customers and retailers. To analyze the optimal customer segment for each
different market, the study filters out the transactions made in the US (United States) market
for use in further analysis.
4.1.2. Customer retention analysis
Before jumping into clustering customers by the RFM model, the study will briefly
analyze the company's customer retention situation to find out the insight of its business
status. Adventure Works is a business that manufactures and sells bicycles for both
individuals and resellers, but bicycles are non-essential and can be used in the long term,
the number of customers who have one transaction only over 3 years is very high, at
74.31% (Figure 2). Meanwhile, the number of customers who used to repeat transactions
with the company only accounted for 25.69% but brought even higher revenue than the others over
time. time. In particular, there would be a sudden increase in the revenue that this group of
customers brought to the business every 1 month.

Figure 2. Sales by customers over time

It can be seen in Figure 3 that, in the period from 05/2011 to 06/2013, approximately
two years, the number of regular customers was higher, almost all customers returned to
make transactions again with the company. However, starting from 07/2013, when the
business had a sudden growth in attracting more customers, the number of customers
leaving when they only transacted once with the business was very high.
Machine Translated by Google

Figure 3. Number of churn and repeated customers over time

Grouping customers according to cohort, also known as grouping customers according to the
timeline from the customer's first transaction (Croll, A. & Yoskovitz, B., 2013). The formula to calculate the
retention rate is described as equation (4):
ÿ ÿ
= (4)

The retention rate of each cohort is shown on the horizontal axis of Figure 4. With
the analysis of customer retention rate using the heat chart, it can be seen:

Figure 4. Retention rates in cohort analysis

Customers of the business did not transact regularly once a month, but on average, customers
came back every 2-3 months. With the group of customers having transactions from the beginning of the
observed period, from June 2011, only 4% of customers returned to transact in the next month. However,
with a cycle every 2-3 months, the customer
Machine Translated by Google

retention rate of the business at this time was very high, in the 34th month, it still maintained
67% of the total number of original customers.
In contrast, the retention rate for new customer groups decreased significantly.
Generally, the company's customer retention policy was appropriate for the period before 2012
and was able to retain this group of loyal customers until the end of the period.
However, it seemed to be no longer suitable for new customer groups, especially when the
business in later period promoted marketing and attracted more customers but cannot keep
them. Businesses should focus more on customer care policies as well as targeted marketing
campaigns to attract returning customers.
4.1.3. Customer segmentation based on RFM scores
This is the traditional and the simple way that can explain how the RFM model works.
The RFM model is famous for transforming transactional data, which basically includes
CustomerID – unique customer code, SalesOrderID – unique invoice code, ProductID – unique
product code, InvoiceDate – date of the transaction, Quantity – quantity of purchased items, Unit
Price – the price of one item, Country – country of the transaction, into profitability scores (Zaki,
M., Kandeil, D. & Neely, A., 2016). After calculating recency, frequency, and monetary for RFM
analysis, the characteristics of the statistical distribution of these factors such as average,
minimum value, maximum value, as well as quartiles are described in Table 1. The average last
purchased date is 206 days ago with nearly 1.5 purchases and 1473.8 revenue in total.

Table 1. Quartiles description in RFM table

Recency Frequency Monetary

Mean 206.377101 1.466626 1473.809070

Min 0.000000 1.000000 1.374000

Max 1122.000000 12,000000 58662.190608

st
first
quartile 91,000000 1.000000 21.490000

2 nd quartile 177.000000 1.000000 69.990000

rd
3 quartile 277.000000 2.000000 2294.990000

While the authors in (Zaki, M., Kandeil, D. & Neely, A., 2016) ranked customers in
quintiles. This study chose to rank them based on the quartiles. Following the related works,
customers with the most recency value will have a 1 R score. In contrast, the ones with the
lowest recency received a 4 R score because the customers with more recent transactions are
considered more valuable to the business. This step was performed repeatedly for the frequency
and monetary but in a reverse way, which means the highest frequency and monetary received
4 scores and the ones with the lowest had 1 score. Noted that the customers' value is
proportional to RFM scores. By mapping the RFM scores, we had the worst valuable customers
who had an overall RFM score of 111. On the contrary, the ones with an RFM score of 444 were
considered as top customers to the company.
This study divided customers into segments based on the distinguished segmentation of
the RFM scores, which Jasmin (2020) described as a graphic in her blog. This study uses the
exemplary figures in the reverse RFM scores.
Machine Translated by Google

Figure 5. Customer segmentation distribution

Figure 5 shows the RFM segmentation labeling results in a treemap. Based on different
segments, the company needs a specific strategy to develop its business status. The company
had a huge number of new customers like Unsteady Customers (36.53%) with high monetary
value. It is advised to build a long-term relationship with these customers by cross-selling
strategy or specific promotions. Besides, the Customers At Risk segment, which accounted for
16.95%, is also a potential group from which the business can exploit. With a very high monetary
value but having stopped trading for a long time, finding a way to contact and pull these
customers back will bring a great benefit to the business. Top Customers and Active Customers
accounted for a small percentage but the profits that they brought were considerable, the
company cannot lose them. Finally, the company had quite many Inactive and Lost customers.
The study mentioned that the company with main products such as long-term usable bicycles
could have many churned customers, but the managers could research more insight into these
customer groups to find the exact churned reason to re-engage these customers as much as
possible.
4.1.4. Data standardization

After preprocessing the data and preparing the input data for the RFM model with the
corresponding recency, frequency, and monetary values, it was found that there is a huge
difference between these three values, which can affect the model run time and the accuracy of
the algorithms. The study conducted data standardization according to the standard distribution
method (Standard Score), also known as Z-Score (Ismail, M. & Dauda, U., 2013)
to bring the data to a distribution range that where the mean value of the observations is 0 and
the standard deviation is 1. The formula for standardizing the data is described as equation (5):

ÿ
= (5)

With x is the initial value before standardization, ÿ is the mean value of the observations,
and ÿ is the standard deviation of the observations. After standardizing the
Machine Translated by Google

data, it can be concluded that the three current recency, frequency, and monetary values
weight equally when included in the analysis in the K-Means clustering model.
4.1.5. The optimal number of clusters for K-Means
As having described these models' literature in the theoretical basis part, this study
uses the Elbow method and Silhouette score to find the most optimal number of clusters of
the dataset. The result for the number of clusters from 2 to 9 is described in Figure 6. The
graphic has the visualization as an elbow and the SSE line shows that the elbow flexion point
is around = 3 or = 4. Silhouette score will be implemented to re-evaluate the quality of finding
optimal k in the Elbow method to get the final result.

Figure 6. Elbow method result Figure 7. Silhouette score result

Figure 7 illustrates the results that Silhouette scored each cluster. It can be seen
that = 3 with 0.74 score is the highest among other clusters. The indicator means that
with = 3, the distance from data points to their centroid in each cluster is optimized and
the cluster eccentricity barely occurs. Therefore, the study will use the number of = 3 to
cluster customers into different levels based on three factors of the RFM model (Recency,
Frequency, and Monetary).
The number of customers and the average value of recency, frequency, and
monetary of each cluster after being divided into 3 clusters by K-Means are all described in Table 2.
It can be seen that the Gold cluster includes the least number of customers who have the
most transactions, relatively recent purchases and bring the highest revenue for the business.
In the other two groups, they quite have similar frequency and monetary mean value, but
the average recency value of one group is nearly twice more than the other one. The group
with the most recency value is labeled as the Bronze group because of no recent transactions
with the business.
Table 2. Each cluster description

Cluster Customers Mean recency Mean frequency Mean currency

Gold 216 149.527778 8.606481 11299.873848
silver 4665 148.998928 1.231726 1162.314786
Bronze 3329 290.471012 1.332532 1272.754954

Figure 8 describes the clustering result with = 3 in three-dimensional space. The

Silver level is the group with the highest convergence, but there is still confusion between
the data points in the Silver and Bronze levels. The Silver group contains customers who
Machine Translated by Google

have more stable recency and frequency indexes while the ones in the Bronze level have
even higher monetary value but stopped trading for a long time. Besides the Gold group
with its distinction from the others, Silver and Bronze were labeled mainly based on the
average recency value.

Figure 8. Clustering result by RFM level

4.2. Predicting CLV using Pareto/ NBD and Gamma-Gamma model

The second stage of the experimental process includes Pareto/NBD and Gamma-
Gamma model data construction (Figure 1), calibration and holdout dataset divided for
evaluation, then using the models to predict CLV.
4.2.1. Constructing input data for Pareto/ NBD and Gamma-Gamma model
Because the Pareto/NBD and Gamma-Gamma models use the RFM basis, the data
used for these models are quite the same as the previous data construction. The difference
is Pareto/NBD only considers customers with repeated transactions, which means the
customers with only one transaction have = 0 in this model.= The
0 and
Pareto/NBD model also
uses another factor, which is the customer lifetime (T) calculated as the distance from the
customer's first purchase date to the model implementation date.

The data for the Gamma-Gamma model is the same as the data for the Pareto/NBD
model but only the rows have frequency and currency bigger than 0.
4.2.2. Calibration and holdout dataset

The calibration dataset starts at the beginning of the observed period from June 7,
2011 to July 7, 2013, and the holdout period spans from July 8, 2013 to July 7, 2014,
exactly 365 days. The percentage is approximately 70% in calibration and 30% in the
holdout dataset.

4.2.3. Predicting future purchases using Pareto/ NBD model

Figure 9 illustrates the number of purchases in the calibration dataset on the x-axis
and the corresponding average number of purchases in the holdout dataset on the y-axis.
As can be seen, the model predicts that the customers with the higher purchases in the
calibration will also have higher purchases in the holdout, except for a slight reduction in
Machine Translated by Google

customers with 5 purchases in the calibration. In contrast, in actual data, the holdout set
shows more unpredictable volatility.

Figure 9. Actual and predicted purchases of Pareto/NBD model in holdout dataset

Evaluating models can be the most important step in Data Science. This study uses
several indicators to evaluate the quality of the prediction model. The formulas of these
indicators are described as equations (6), (7), and (8) (Chicco, D., Warrens, M.J. & Jurman,
G., 2021):
first

= (6)
ÿ| i ÿ i|
=1

first

= (7)
ÿ( i ÿ i) 2
=1

(8)
ÿ( i ÿ i) 2
=1
=ÿ1

With X is the actual values and Y is the predicted values. MAE is a measurement
of errors between actual and predicted observations. MSE measures the average of
squares of the errors or can be understood as the average squared difference between
the actual and estimated values. RMSE is also a measurement to evaluate the differences
between two observations. The more these indicators move close to 0, the fewer errors
the predictions have. Table 3 shows that the Pareto/NBD predicts the future purchases
fairly well while the evaluation values are all small, nearly 0.
Table 3. Pareto/NBD purchases prediction evaluation

Types Results

Mean Absolute Error (MAE) 0.7904071127295603

Mean Squared Error (MSE) 0.8850164704192321

Root Mean Squared Error (RMSE) 0.9407531399996665

Machine Translated by Google

4.2.4. Predicting the future average order value using the Gamma-Gamma model
The estimated results of the Gamma-Gamma model are shown in Figure 10. The
histogram plots the monetary value distribution of the actual and estimated observations. It
shows that predicted results tend to be smaller than actual and both predicted and actual
monetary values are concentrated near zero.

Figure 10. Actual and predicted of the Gamma-Gamma model in the holdout set

Because the difference in monetary values is much larger than the factor used in the
previous model. Instead of using metrics such as MAE, MSE, and RMSE, which are often
used for normalized, standardized datasets or have values close to zero, here the chosen
option is dividing monetary values into 5 bins according to ordinal variable and K-Means.
Then use the confusion matrix and the F1-score to evaluate the accuracy of the model.
The confusion matrix Figure 11 shows that the Gamma-Gamma model worked well
in the holdout set while the predictions were mostly divided into the right bins. The F1 score
is 0.9 which means the estimation had high accuracy.

Figure 11. Confusion matrix of actual and predicted monetary value

After training and evaluating the models to check their quality, the Gamma-Gamma
model was implemented again in the initial dataset to check if it had the overfitting problem
Machine Translated by Google

or not. Fortunately, the model also performed well in the original data set, demonstrated when
Figure 12 shows that the actual and predicted monetary values have a linear correlation, and the
histogram gives the results of the true and predicted values almost overlapping (Figure 13). After
training and fixing the Pareto/NBD and Gamma-Gamma model, finding that the two models
worked quite well and the evaluation was very high, the study will apply these two models to
predict the CLV for the company's customers.

Figure 12. Scatter plot of actual and Figure 13. Histogram of actual and predicted
predicted monetary value in initial monetary value in initial dataset
dataset

4.2.5. Predicting CLV values

Figure 14. Average predicted CLV by customer segmentation

Among the 5 groups containing repeated customers of the business in Figure 10, the
model predicts that the Top Customers group has the largest customer lifetime value. The Active
Customers and Unsteady groups have a fairly low value because they did not make many
transactions with the business during the observed period. Although the Customers At Risk group
had not traded with the business for a long time, the number of orders and the amount of revenue
that this customer group could bring is very large for the business, its estimated CLV is fairly high.
Machine Translated by Google

4.3. Discussion

The study had found out the relationship between the original RFM customer
segmentation and the RFM customer clustering by K-Means. Figure 15 describes the total
number of customers in 8 segments divided by RFM score and 3 clusters classified by K-
Means. As can be seen that the customers in Gold level were mostly divided into the Top
Customers, Emerging Customers, and Customers At Risk, these were also the three which
had the most predicted CLV in the previous analysis. This shows that the models used in
the study are closely related to each other.
Besides, as mentioned above, the Bronze and Silver groups had almost the same
Frequency and Monetary indexes, only a big difference in Recency. Therefore, the K-Means
model for clustering has not been fully effective. Since each run of K-Means gave different
clustering results, and the user had to base on those results and label the clusters for each
customer group, this requires a lot of expertise in the field to be able to effectively cluster
and label the group thoroughly . However, the K-Means clustering and the predictions of
the Pareto/NBD and Gamma-Gamma model were matching when the 3 top customer
groups in CLV including customers in Gold level, and the two next most CLV groups
contained mostly customers from the Silver group .

Figure 15. The number of customers by segmentation and RFM level

5. Conclusions and implications

By combining marketing and business knowledge with information technology, a
clearer view of the Adventure Works company was realized. RFM is easy to apply and
flexible method for implementing customer segmentation. As Mark Patron commented that
RFM did not provide the company the profitability and the potential of a customer (Mark, P.,
2004), using the combination of RFM and CLV results to find hidden potential customers
in the business will be very profitable. Managers can base on that to implement customer
care policies such as discounts and customer gratitude programs for Gold and Silver
customers or use cross-selling strategy to maximize profits from existing customers as well
attract as new customers. The model in this study was defined to use most effectively for
this dataset. They can be developed to use especially based on the company's needs.
Machine Translated by Google

References

Anitha, P. & Patil, M. M. (2019). RFM model for customer purchase behavior using K-Means
algorithm. Journal of King Saud University – Computer and Information Sciences,
1-8. doi:10.1016/j.jksuci.2019.12.011
Aslekar, A., Piyali, S. & Arunima, P. (2019). Big Data Analytics for Customer Lifetime Value
Prediction. Telecom Business Review, 12(1), 46-49. Retrieved from http://
publishingindia.com/tbr/
Chicco, D., Warrens, M.J. & Jurman, G. (2021). The coefficient of determination R-squared
is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression
analysis evaluation. PeerJ Computer Science, 7(3), e623. doi:10.7717/peerj-cs.623
Croll, A. & Yoskovitz, B. (2013). Use the Data to Build a Better Startup Faster (1 ed.).
Cambridge: O'Reilly Media.
Dwivedi, S., Pandey, P., Tiwari, M.S. & Kalam, A. (2014). Comparative Study of Clustering
Algorithms Used in Counter Terrorism. IOSR Journal of Computer Engineering (IOSR-
JCE), 16(6), 13-17.
Glady, N., Baesens, B. & Croux, C. (2007). A modified Pareto/ NBD approach for predicting
customer.customer lifetime value. Leuven: Ltd. doi:https://doi.org/
Elsevier
10.1016/j.eswa.2007.12.049
Humaira, H. & Rasyidah, R. (2020). Determining The Appropriate Cluster Number Using
Elbow Method for K-Means Algorithm. Proceedings of the 2nd Workshop on
Multidisciplinary and Applications (WMA) 2018 (pp. 24-25). Padang: EAI. doi:10.4108/
eai.24-1-2018.2292388

Ismail, M. & Dauda, U. (2013). Standardization and Its Effects on K-Means Clustering
Algorithm. Research Journal of Applied Sciences, Engineering and Technology,
6(17), 3299-3303. doi:10.19026/rjaset.6.3638
Ismail, M.B.M. & Safrana, M.J. (2015). Impact Of Marketing Strategy On Customer Retention
In Handloom Industry. Sri Lanka: 5th International Conference, SEUSL.
Jasmine. (2020, November 12). Machine Learning In Customer Segmentation With RFM-
Analysis. Retrieved from Nextlytics: https://www.nextlytics.com/blog/machine-
learning-in-customer-segmentation-with-rfm-analysis
Kotler, P. & Keller, K.L. (2006). Marketing Management (12th ed.). New Jersey: Pearson
Prentice Hall.

Mark, P. (2004). Applying RFM segmentation to the SilverMinds catalog. Journal of Direct
Data and Digital Marketing Practice, 5(3), 269-275. doi:10.1057/palgrave.im.4340243

Miglautsch, J.R. (2000). Thoughts on RFM scoring. Journal of Database Marketing &
Customer Strategic Management, 8(1), 67-72. doi:10.1057/palgrave.jdm.3240019
Nainggolan, R., Perangin-angin, R., Simarmata, E. & Tarigan, F.A. (2015). Improved the
Performance of the K-Means Cluster Using the Sum of Squared Error (SSE)
optimized by using the Elbow Method. Journal of Physics: Conference Series, 1361.
doi:10.1088/1742-6596/1361/1/012015
Machine Translated by Google

Ogbuabor, G. & Ugwoke, F. N. (2018). Clustering Algorithm For A Healthcare Dataset

Using Silhouette Score Value. International Journal of Computer Science &
Information Technology (IJCSIT), 10(2), 27-37. doi:10.5121/ijcsit.2018.10203
Peter, SF, Bruce, GSH & Ka, LL (2004). “Counting Your Customers” the Easy Way: An
Alternative to the Pareto/NBD Model. Marketing Science, INFORMS, 24(2),
261-282. doi:10.1287/mksc.1040.0098

Qismat, T. & Feng, Y. (2020). Comparison of classical RFM models and Machine learning.
Norway: Master of Science.
Thanh HT, Son N D. (2021). An interdisciplinary research between analyzing customer
segmentation in marketing and machine learning method. Sci. Tech. Dev. J. - Eco.
Law Manag, 6(1):2005-2015.
Yedla, M., Pathakota, S.R. & Srinivasa, T.M. (2010). Enhancing K-means Clustering
Algorithm with Improved Initial Center. International Journal of Computer Science
and Information Technologies, 1(2), 121-125.
Zaki, M., Kandeil, D. & Neely, A. (2016). The Fallacy of the Net Promoter Score: Customer
Loyalty Predictive Model. UK: Cambridge Service Alliance.

Review of Customer Segmentation Method in CRM
No ratings yet
Review of Customer Segmentation Method in CRM
3 pages
Customer Clustering Based On Customer Lifetime Value
No ratings yet
Customer Clustering Based On Customer Lifetime Value
20 pages
FULLTEXT02
No ratings yet
FULLTEXT02
87 pages
Customer Churn by Chen2014
No ratings yet
Customer Churn by Chen2014
20 pages
Predictive Model assignment-RFM Model
No ratings yet
Predictive Model assignment-RFM Model
15 pages
24770-Article Text-109440-2-10-20231203
No ratings yet
24770-Article Text-109440-2-10-20231203
28 pages
Customer Segmentation Using Machine Learning Model
No ratings yet
Customer Segmentation Using Machine Learning Model
12 pages
Customer Segmentation Analysis and Customer Lifetime Value Prediction Using Pareto/NBD and Gamma-Gamma Model
No ratings yet
Customer Segmentation Analysis and Customer Lifetime Value Prediction Using Pareto/NBD and Gamma-Gamma Model
18 pages
IEEE Conference Template
No ratings yet
IEEE Conference Template
7 pages
Data Analytics
No ratings yet
Data Analytics
12 pages
Final Compare
No ratings yet
Final Compare
9 pages
Week 10
No ratings yet
Week 10
18 pages
Customer Segmentation Based On RFM Model and Clustering Techniques With K-Means Algorithm
No ratings yet
Customer Segmentation Based On RFM Model and Clustering Techniques With K-Means Algorithm
6 pages
RFM Analysis For Customer Segmentation Using Machine Learning: A Survey of A Decade of Research
No ratings yet
RFM Analysis For Customer Segmentation Using Machine Learning: A Survey of A Decade of Research
8 pages
Yoseph 2019
No ratings yet
Yoseph 2019
19 pages
Data Mining RIEJ - Volume 11 - Issue 1 - Pages 62-76
No ratings yet
Data Mining RIEJ - Volume 11 - Issue 1 - Pages 62-76
15 pages
Segmenting Bank Customers Via RFM Model and Unsupervised Machine Learning
No ratings yet
Segmenting Bank Customers Via RFM Model and Unsupervised Machine Learning
6 pages
K-Means Clustering Interpretation Using Recency, Frequency, and Monetary Factor For Retail Customers Segmentation
No ratings yet
K-Means Clustering Interpretation Using Recency, Frequency, and Monetary Factor For Retail Customers Segmentation
12 pages
Reference Paper 1
No ratings yet
Reference Paper 1
6 pages
Customer Segmentation With RFM Models and Demographic Variable Using DBSCAN Algorithm
No ratings yet
Customer Segmentation With RFM Models and Demographic Variable Using DBSCAN Algorithm
8 pages
IJCRT2212570
No ratings yet
IJCRT2212570
4 pages
Undersatanding Churn in B2B and Imporance 2025
No ratings yet
Undersatanding Churn in B2B and Imporance 2025
34 pages
Irjet V11i5300
No ratings yet
Irjet V11i5300
5 pages
Using Data Mining Techniques To Improve Customer Relationship Management
No ratings yet
Using Data Mining Techniques To Improve Customer Relationship Management
7 pages
Short Handout 2024 MKTG Analytics
No ratings yet
Short Handout 2024 MKTG Analytics
8 pages
1 s2.0 S1319157819309802 Main
No ratings yet
1 s2.0 S1319157819309802 Main
8 pages
BMT (6148) - Marketing Metrics: Digital Assignment-2
No ratings yet
BMT (6148) - Marketing Metrics: Digital Assignment-2
12 pages
Eltikom 2003
No ratings yet
Eltikom 2003
8 pages
Lol 1
No ratings yet
Lol 1
7 pages
81420074701
No ratings yet
81420074701
16 pages
Jtaer 17 00024
No ratings yet
Jtaer 17 00024
18 pages
Customer Lifetime Value Prediction With K-Means Clustering and XGBoost
No ratings yet
Customer Lifetime Value Prediction With K-Means Clustering and XGBoost
5 pages
The Book of Love and Creation A Channeled Text Multiformat Download
100% (17)
The Book of Love and Creation A Channeled Text Multiformat Download
17 pages
Customer Segmentation Using Data Science
No ratings yet
Customer Segmentation Using Data Science
7 pages
RFM Analysis
No ratings yet
RFM Analysis
9 pages
Yeh 2009
No ratings yet
Yeh 2009
6 pages
Combining RFM Model and Clustering Techniques For Customer Value Analysis of A Company Selling Online
No ratings yet
Combining RFM Model and Clustering Techniques For Customer Value Analysis of A Company Selling Online
6 pages
PDF
No ratings yet
PDF
19 pages
Download
No ratings yet
Download
3 pages
2005 Research On Customer Segmentation Model by Clustering
No ratings yet
2005 Research On Customer Segmentation Model by Clustering
3 pages
Understanding Customers - Profiling and Segmentation: Mircea Andrei SCRIDON
No ratings yet
Understanding Customers - Profiling and Segmentation: Mircea Andrei SCRIDON
10 pages
The Application of Data Mining Techniques and Multiple Classifiers To Marketing Decision
No ratings yet
The Application of Data Mining Techniques and Multiple Classifiers To Marketing Decision
10 pages
Analysis of Customer Segmentation Based On Recency, Frequency, and Monetary at PT Pegadaian in Padang City As Basis On The Analysis of Segmentation and Developing CRM Strategies
No ratings yet
Analysis of Customer Segmentation Based On Recency, Frequency, and Monetary at PT Pegadaian in Padang City As Basis On The Analysis of Segmentation and Developing CRM Strategies
8 pages
Defend The Emperor and The Faith: See Inside For An Exclusive Wrath & Glory Relic!
No ratings yet
Defend The Emperor and The Faith: See Inside For An Exclusive Wrath & Glory Relic!
4 pages
Article Segmentation Clients
No ratings yet
Article Segmentation Clients
6 pages
Customer Segmentation Based On GRFM Case Study
No ratings yet
Customer Segmentation Based On GRFM Case Study
6 pages
An Efficiency Analysis On The TPA Clustering
No ratings yet
An Efficiency Analysis On The TPA Clustering
5 pages
Customer Clustering Based On Customer Purchasing Sequence Data
No ratings yet
Customer Clustering Based On Customer Purchasing Sequence Data
10 pages
CRM Analytics RFM MODEL
No ratings yet
CRM Analytics RFM MODEL
9 pages
IM 10 Muhammad Ridwan Andi Purnomo 2015
No ratings yet
IM 10 Muhammad Ridwan Andi Purnomo 2015
8 pages
What Is A Building Management System?
100% (1)
What Is A Building Management System?
14 pages
Customer Value Analysis Using Weighted RFM Model: Empirical Case Study
No ratings yet
Customer Value Analysis Using Weighted RFM Model: Empirical Case Study
17 pages
15
No ratings yet
15
6 pages
Data Mining Using RFM Analysis, Derya Birant, Dokuz Eylul University, Turkey
100% (1)
Data Mining Using RFM Analysis, Derya Birant, Dokuz Eylul University, Turkey
18 pages
2015 - Hosseini, Shabani - New Approach To Customer Segmentation Based On Changes in Customer Value - Journal of Marketing Analytics
No ratings yet
2015 - Hosseini, Shabani - New Approach To Customer Segmentation Based On Changes in Customer Value - Journal of Marketing Analytics
12 pages
K-Mean Clustering Method For Analysis Customer Lifetime Value With LRFM Relationship Model in Banking Services
No ratings yet
K-Mean Clustering Method For Analysis Customer Lifetime Value With LRFM Relationship Model in Banking Services
9 pages
Paper 7-Application of K Means Algorithm For Efficient Customer Segmentation
No ratings yet
Paper 7-Application of K Means Algorithm For Efficient Customer Segmentation
5 pages
Gaurav Upadhyay ML Project
No ratings yet
Gaurav Upadhyay ML Project
8 pages
03 - G01 Voltage Supply and Bus Systems
100% (1)
03 - G01 Voltage Supply and Bus Systems
52 pages
Data Mining Application in Customer Relationship Management of Credit Card Business
No ratings yet
Data Mining Application in Customer Relationship Management of Credit Card Business
2 pages
APEntering DR and CR Memos
No ratings yet
APEntering DR and CR Memos
3 pages
Driving NC Ii Post Test
100% (2)
Driving NC Ii Post Test
2 pages
Process of The Manufacture of Common Salt
No ratings yet
Process of The Manufacture of Common Salt
23 pages
Subsea Field Architecture Types - Evaluation & Comparison Made Easy in SFACE - SFACE
No ratings yet
Subsea Field Architecture Types - Evaluation & Comparison Made Easy in SFACE - SFACE
7 pages
Publications Frosch
No ratings yet
Publications Frosch
7 pages
Stylus SX200 SX205 TX200 TX203 TX209 NX200 Parts List and Diagram
No ratings yet
Stylus SX200 SX205 TX200 TX203 TX209 NX200 Parts List and Diagram
7 pages
RD - Incident Rail Commander
No ratings yet
RD - Incident Rail Commander
7 pages
Sist en 15347 2008
No ratings yet
Sist en 15347 2008
9 pages
Quantity Survey Notes: Index: Sub - Structure (Concrete Part)
No ratings yet
Quantity Survey Notes: Index: Sub - Structure (Concrete Part)
304 pages
Questions Tags: Negative Statement Positive Tag
No ratings yet
Questions Tags: Negative Statement Positive Tag
4 pages
Instant Download Process Validation in Manufacturing of Biopharmaceuticals 3rd Edition Anurag S. Rathore PDF All Chapters
100% (10)
Instant Download Process Validation in Manufacturing of Biopharmaceuticals 3rd Edition Anurag S. Rathore PDF All Chapters
85 pages
Plist Harga List Tanggal, 01april 2024
No ratings yet
Plist Harga List Tanggal, 01april 2024
61 pages
Emplys Job Satisfaction
No ratings yet
Emplys Job Satisfaction
64 pages
Workbook - General Management Skills
No ratings yet
Workbook - General Management Skills
5 pages
IELTS Reading Recent Actual Test
No ratings yet
IELTS Reading Recent Actual Test
82 pages
Mechanichs of Machining Process
No ratings yet
Mechanichs of Machining Process
8 pages
Modified Study On Customer Gratification Towards Online Market and Super /hyper Marke
No ratings yet
Modified Study On Customer Gratification Towards Online Market and Super /hyper Marke
11 pages
The First Quarterly Assessment Results of Grade 2
No ratings yet
The First Quarterly Assessment Results of Grade 2
13 pages
Week 8
No ratings yet
Week 8
6 pages
Template For Research Prtemplate For Research Proposaloposal
No ratings yet
Template For Research Prtemplate For Research Proposaloposal
2 pages
SPE 27343 Sand Production As A Viscoplastic Granular Flow
No ratings yet
SPE 27343 Sand Production As A Viscoplastic Granular Flow
10 pages
Beer PDF
No ratings yet
Beer PDF
19 pages
GE2 - Exercise 2.1 Juvine Ramos
No ratings yet
GE2 - Exercise 2.1 Juvine Ramos
4 pages
Twilight Switch With Ex-E Enclosure
No ratings yet
Twilight Switch With Ex-E Enclosure
2 pages
Q3 - Periodical Test MUSIC
No ratings yet
Q3 - Periodical Test MUSIC
2 pages
Tecstrip Flat & Flexible Phenolic Insulating Strip: Linda B - We Simplif y Const Ruc T Ion
No ratings yet
Tecstrip Flat & Flexible Phenolic Insulating Strip: Linda B - We Simplif y Const Ruc T Ion
2 pages
Customer-Centric Marketing: A Pragmatic Framework
From Everand
Customer-Centric Marketing: A Pragmatic Framework
R. Ravi
No ratings yet
AI in Quantitative Analysis
From Everand
AI in Quantitative Analysis
Anand Vemula
No ratings yet
Advanced E-Commerce Business Questions and Analytical Hints
From Everand
Advanced E-Commerce Business Questions and Analytical Hints
Zemelak Goraga
No ratings yet
Customer Relationship Management: A powerful tool for attracting and retaining customers
From Everand
Customer Relationship Management: A powerful tool for attracting and retaining customers
50minutes
3.5/5 (3)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

GiaoHoThanh - RFM and CLV Paper - V2

Uploaded by

GiaoHoThanh - RFM and CLV Paper - V2

Uploaded by

Machine Translated by Google

Customer segmentation analysis and customer lifetime value

In the intense competition and complexity of the business environment, customer

The most well-known clustering methods by RFM are customer quintiles

Figure 1. Overview of the proposed research model

Figure 2. Sales by customers over time

Figure 3. Number of churn and repeated customers over time

Figure 4. Retention rates in cohort analysis

Table 1. Quartiles description in RFM table

Recency Frequency Monetary

Min 0.000000 1.000000 1.374000

Max 1122.000000 12,000000 58662.190608

2 nd quartile 177.000000 1.000000 69.990000

Figure 5. Customer segmentation distribution

Figure 6. Elbow method result Figure 7. Silhouette score result

Cluster Customers Mean recency Mean frequency Mean currency

Figure 8 describes the clustering result with = 3 in three-dimensional space. The

Figure 8. Clustering result by RFM level

4.2. Predicting CLV using Pareto/ NBD and Gamma-Gamma model

4.2.3. Predicting future purchases using Pareto/ NBD model

Figure 9. Actual and predicted purchases of Pareto/NBD model in holdout dataset

Mean Absolute Error (MAE) 0.7904071127295603

Mean Squared Error (MSE) 0.8850164704192321

Root Mean Squared Error (RMSE) 0.9407531399996665

Figure 11. Confusion matrix of actual and predicted monetary value

4.2.5. Predicting CLV values

Figure 14. Average predicted CLV by customer segmentation

Figure 15. The number of customers by segmentation and RFM level

5. Conclusions and implications

Ogbuabor, G. & Ugwoke, F. N. (2018). Clustering Algorithm For A Healthcare Dataset

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.