Rapport PFE Balsam Bendhif
Rapport PFE Balsam Bendhif
INTERNSHIP REPORT
For the completion of the Master's degree in
Business Analytics
01/05/2024 – 31/08/2024
Hatem Jemai
Internship
Supervisor
Academic Tarek Hamrouni
advisor
Dedications
To my beloved parents
I owe everything I am today to your love, patience, and countless sacrifices. May this humble
work serve as a small token of my gratitude and recognition for all the incredible things you
have done for me. May God, the Almighty, bless you with health and long life, so that I may,
in turn, bring you happiness and fulfillment.
To my dearest friends
In honor of the genuine friendship that binds us and the wonderful moments we've shared, I
dedicate this work to you. I wish you a bright future filled with endless possibilities and great
promises.
Acknowledgments
First and foremost, I extend my deepest gratitude to the entire pedagogical team at ESB for
their unwavering guidance and support throughout my academic journey. I am also
immensely grateful to the internship team at Tunisie Télécom for providing a profoundly
enriching and engaging experience during my four-month tenure.
A special thanks goes to my supervisor, Mr. Hatem Jemai, for swiftly integrating me into the
company, placing his trust in me, offering constant encouragement and constructive
feedback, and dedicating his time to address all my queries despite his demanding schedule.
Lastly, I extend my appreciation to the jury members for their willingness to evaluate this
work.
Résumé
L’objectif principal de ce projet de fin d’études est d’analyser le comportement des clients et
d’évaluer la valeur client chez Tunisie Télécom. Nous utiliserons des techniques d’analyse de
données et d’apprentissage automatique pour comprendre les comportements des clients et
développer un modèle de prévision du churn. Par ailleurs, nous allons évaluer la valeur vie
client (CLV) pour estimer la contribution financière de chaque client à l'entreprise sur une
période de temps donnée. Nous intégrerons ces modèles dans une application web qui
permettra aux agents de Tunisie Télécom de réaliser facilement des prédictions et d’évaluer
la CLV. Cette application comprendra également une liaison avec le rapport BI, ce qui
permettra de visualiser les données historiques des clients. Pour mener à bien ce projet,
nous avons suivi la méthodologie de projet CRISP-DM.
Mots clés : comportement des clients, valeur vie client, apprentissage automatique,
prédiction du churn, CRISP-DM, Python
Abstract
The main objective of this final year project is to analyze customer behavior and evaluate
customer value at Tunisie Télécom. We will use data analysis and machine learning
techniques to understand customer behaviors and develop a churn prediction model.
Additionally, we will evaluate the Customer Lifetime Value (CLV) to estimate the financial
contribution of each customer to the company over a given period. These models will be
integrated into a web application that will assist Tunisie Télécom agents in making
predictions and evaluating CLV easily. This application will also include a connection to the BI
report, allowing visualization of historical customer data. To successfully carry out this
project, we followed the CRISP-DM project methodology.
Keywords: customer behavior, customer lifetime value, machine learning, churn prediction,
CRISP-DM, Python
Table of Contents
Chapter 6: Deployment
6.1 Introduction
6.2 Deployment Process
6.2.1 Model Extraction
6.2.2 Development of the Web Interface
6.2.3 Application Interfaces
6.3 Dashboard Construction
6.4 DAX Implementation
6.5 Gantt Chart
6.6 Conclusion
General Conclusion
List of Figures
List of Tables
General Introduction
The concept of churn, or customer attrition, represents a significant concern for telecom
companies. It not only impacts immediate revenue streams but also threatens long-term
profitability by eroding customer loyalty and reducing CLV potential. Conversely, focusing on
CLV involves identifying and nurturing valuable customer segments that contribute the most
to long-term revenue and profitability.
Our project, titled "Analysis of Customer Behavior and Evaluation of Customer Value at
Tunisie Telecom," aims to address these challenges through a comprehensive data-driven
approach. By integrating both churn analysis and CLV prediction, we seek to achieve the
following objectives:
1. Churn Analysis: Understanding the factors and behaviors that precede customer
churn is crucial. Through advanced analytics and machine learning techniques, we
aim to identify predictive indicators of churn within Tunisie Telecom's customer base.
This analysis will enable proactive measures to mitigate churn risks and enhance
customer retention strategies.
2. CLV Prediction: Predicting and optimizing Customer Lifetime Value involves not only
retaining customers but also maximizing their value over their entire lifecycle. By
developing models that forecast CLV, we can segment customers based on their
potential profitability and tailor strategies to nurture high-value segments. This
approach ensures that resources are effectively allocated to retain and grow the
most profitable customer relationships.
By adopting the CRISP-DM methodology, our project unfolds as a structured exploration into
data science techniques tailored to the telecom industry's specific challenges. Through a
rigorous analysis of historical data and the implementation of predictive models, we aim to
empower Tunisie Telecom with actionable insights to improve customer satisfaction, reduce
churn, and ultimately maximize CLV.
This report is structured into five chapters, each addressing key stages of our project
journey:
Through this comprehensive approach, we aim to equip Tunisie Telecom with the tools and
strategies needed to navigate the complexities of customer retention and CLV optimization
in a competitive telecommunications landscape.
Chapter 1: Project Study
1. Introduction
Tunisie Télécom is a Tunisian telecommunications company that offers fixed-line and mobile
telephony, internet, and data transmission services. It is one of the largest telecom
operators in the region, serving both individual and corporate clients. Established in 1995,
Tunisie Télécom has grown significantly and aims to enhance profitability while expanding its
international presence.
1.1.2 History
Originally established as the National Telecommunications Agency by Law No. 36 of April 17,
1995, Tunisie Télécom later transitioned into a public limited company under Decree No. 30
of April 5, 2004, under the name "Tunisie Télécom". In July 2006, 35% of its capital was
opened to the Emirati consortium "TeCom-DIG". The company's strategic goals include
maximizing profitability and solidifying its position as a leading international operator.
1.1.3 Organization
Tunisie Télécom comprises 24 regional directorates, 80 Actels and sales points, and over
13,000 private locations. The company employs more than 8,000 personnel across its
operations, which include six customer support centers for fixed-line, mobile, and data
services.
1.1.4 Presentation of the HACHED Complex
The HACHED Complex is one of Tunisie Télécom's major facilities, housing a wide range of
services:
This project is undertaken as part of the final dissertation for the Master's program in
Business Analytics at Esprit School of Business. It is centered around developing predictive
models for customer churn and evaluating Customer Lifetime Value (CLV) at Tunisie
Télécom.
To tackle the challenges of customer churn and maximize Customer Lifetime Value (CLV),
this project proposes using advanced data analytics and machine learning techniques. The
solution involves developing models to estimate CLV by segmenting customers based on
their value to Tunisie Télécom, recommending tailored services, and optimizing resource
allocation. Additionally, machine learning algorithms will be employed to predict customer
churn accurately and identify key factors driving attrition, enabling targeted retention
strategies. To support these efforts, interactive Power BI dashboards will be created for
better understanding and explanation of results, and a web page will be developed to deploy
and share final insights, providing easy access to predictions and actionable data. This
comprehensive approach aims to improve customer retention, enhance satisfaction, and
maximize the value of customer relationships for Tunisie Télécom.
In this section, we explore three project management methodologies for data exploration
and data science: CRISP-DM, SEMMA, and TDSP. Our objective is to analyze each method to
determine the most suitable approach for our project.
SEMMA (Sample, Explore, Modify, Model, Assess) is a data mining methodology developed
by SAS to address data analysis challenges through five key stages:
While CRISP-DM, SEMMA, and TDSP share the same objective, each methodology differs in
approach and flexibility. CRISP-DM was chosen for this project due to its iterative approach
and flexibility, allowing adaptation to changes throughout the project lifecycle.
Criteria SEMMA TDSP CRISP-DM
Application Domain Data Mining and Team Data Science Data Mining and
Predictive Analysis Projects Data Exploration
Phases 5 phases 5 phases 6 phases
Iterative Approach No Yes Yes
Flexibility Low Medium Hight
Priorities Focus on Focus on Acquisition Focus on
Exploitation and and Preparation Understanding and
Modification Preparation
The project architecture will include detailed plans for data collection, preprocessing, model
development, and deployment, ensuring a systematic approach to achieving project goals.
1. Data collection
2. Data understanding
3. Data preprocessing
4. Feature selection
5. Model building
6. Model evaluation
7. Decision-making for customer retention
8. Dashboard creation
The software work environment will leverage industry-standard tools and platforms for data
analysis, modeling, and visualization, facilitating robust analysis and efficient model
development.
Environment Description
Anaconda A software distribution environment and package management platform for
the Python and R programming languages, specifically tailored for developing
data science and machine learning applications.
Jupyter Notebook An open-source computational notebook that allows for the creation and
sharing of documents containing interactive code, data visualizations, and
textual explanations.
Power BI A Business Intelligence platform that enables users to collect, analyze, and
visualize data from multiple sources, facilitating the creation of dashboards
and data visualizations.
Visual Studio A code editor used to write, modify, and debug code in various programming
Code languages, offering a variety of features aimed at easing the software
development process.
Language Description
Python A high-level, interpreted, object-oriented, and cross-platform programming
language used in various fields such as data science, artificial intelligence,
data visualization, and more.
1.7 Conclusion
In conclusion, this chapter has provided an overview of Tunisie Télécom, outlined the project
scope and objectives, and introduced the methodologies and environments that will guide
our approach. The subsequent chapters will delve into specific aspects such as data
understanding, preprocessing, modeling techniques, performance evaluation, and
deployment strategies.
2.1 Introduction
This chapter provides an overview of the business problem and lays the groundwork for
understanding the data through theoretical concepts and analysis techniques.
This section delves into key theoretical concepts essential for the project:
2.2.1 Churn
Churn refers to the rate at which customers stop doing business with a company over a
specific period. It is crucial for telecommunications companies like Tunisie Télécom to
monitor churn rates to assess customer retention strategies effectively.
Types of Churn :
Churn Rate: The percentage of customers who discontinue their service within a
given period. The churn rate can be calculated using the following formula:
Customer Lifetime Value (CLV) estimates the total revenue a customer is expected to
generate over their entire relationship with the company. In this study, we calculate CLV
using an alternative approach based on direct revenue measurements rather than the
traditional Recency, Frequency, and Monetary (RFM) metrics.
CLV is determined as the total revenue generated by a customer over a specific period of
time, as indicated in our data. This approach focuses on the aggregated costs associated with
various types of customer interactions. The calculation is expressed as:
To compute the total revenue, we aggregate the costs associated with different types of calls
and services used by the customer. The formula used in the code is:
revenue_total_client =Cost of daytime calls +Cost of evening calls +Cost of nighttime calls+
Cost of inter-network calls
Where :
This approach directly measures the revenue generated from different types of call activities
rather than relying on RFM metrics. By focusing on these revenue components, we can
assess the value of customers based on their overall spending patterns and interactions with
the company.
Machine Learning (ML) is a branch of artificial intelligence that enables systems to learn and
improve from experience without being explicitly programmed. It plays a vital role in
predicting churn and CLV based on historical customer data.
Supervised Learning: Supervised learning involves training a model using labeled data. The
goal is to enable the algorithm to predict similar outcomes on new data by learning the
relationship between known inputs and outputs.
Unsupervised Learning: In unsupervised learning, the input data is not labeled, and there
are no predefined output variables. The algorithm uses clustering, unsupervised
classification, or anomaly detection techniques to uncover hidden relationships in the data.
When training a machine learning model, two major problems can arise: Overfitting and
Underfitting.
2.3.1 Overfitting
Overfitting occurs when a model learns too much noise or specific details from the training
data, resulting in poor performance on new data. The model may show good results on
training data but perform poorly on test data or new examples.
2.3.2 Underfitting
Underfitting occurs when a model is too simple to capture the underlying patterns in the
data, leading to poor predictive performance.
2.5 Database Selection
The database available for this research contains comprehensive information related to the
customers of Tunisie Télécom. It includes the history of national and international call
operations, voicemail messages, and customer complaints.
The table below provides an overview of the dataset, identifying the variables, their types,
and a brief description of their meanings:
This detailed dataset provides a robust foundation for conducting thorough data analysis
and developing predictive models for churn and Customer Lifetime Value (CLV).
The data base contains 18 variables in total, these variables encompass various aspects of
customer interaction and behavior with Tunisie Télécom, including call usage patterns, costs,
service activations, and churn status.
We will use the "read_csv" function from the Pandas library to import data from a CSV file
and store it in a DataFrame.
Exploration of data is a crucial step in data analysis that involves examining and interpreting
data within its context. This step typically includes tasks such as analyzing descriptive
statistics and visualizing data.
data.head: Displays the first few rows of a DataFrame. By default, it returns the first 5 rows.
data.describe(): Displays descriptive statistics of a DataFrame, including count, mean,
standard deviation, minimum and maximum values, and quartiles for each column.
data.info (): Provides a summary of the Data Frame, including column names and types,
number of non-null values, and memory usage information.
A pie chart was created to show the proportion of each data type in the dataset. This helps
in understanding the dataset's structure and guides preprocessing steps.
Univariate analysis is a statistical analysis method that explores one variable at a time. It
helps summarize and visualize the characteristics of the variable under study. This analysis
improves our understanding of the data. The distribution of customers based on their churn
status is depicted in the figure.
After analyzing the customer churn rate, we observed that the majority of customers remain
active, representing a proportion of 85.9%, while 14.1% of customers have churned, which
can be considered relatively high. This indicates there could be a significant negative impact
on the final models.
The bar plot reveals that the majority of customers are concentrated within the [50, 150]
days subscription period, indicating robust retention during this timeframe. However, fewer
customers are observed in longer-term categories ([150, 250] days), suggesting potential
churn risks in these segments. This insight emphasizes the need for targeted retention
strategies to maintain customer satisfaction and loyalty across all subscription periods
effectively.
Bivariate analysis is a statistical method that examines the relationship between two
variables by comparing or associating them. This approach helps to understand how the
variables are related and how they interact with each other.
The figure shows that churn rate increases significantly when complaints exceed 5.
Customers with more than 5 complaints are more likely to churn, indicating a strong link
between higher complaint volumes and increased attrition.
Based on the figure above, we can conclude that customers who do not use voicemail have a
higher risk of churn.
The swarm plot offers a clear visual representation of the relationship between churn status
and subscription duration. This visualization helps in understanding customer retention and
identifying patterns that may contribute to churn. If the dots for the churned group tend to
cluster around lower values on the y-axis, it indicates that customers who churn usually have
shorter subscription durations.
The box plot indicates that churned customers have a lower median and narrower
interquartile range for subscription durations, suggesting they typically have shorter
subscriptions. This reinforces the trend that shorter subscription periods are linked to higher
churn rates.
To gain deeper insights into customer behavior and revenue generation, we conducted
additional analyses using scatter plots and bar plots.
Displaying the frequency of customer complaints provides insight into grievance distribution,
which is essential for assessing churn risks. An increased complaint frequency may indicate
customer dissatisfaction, highlighting the need for enhanced service delivery. Analyzing this
data allows us to develop strategies to address concerns, improve satisfaction, and reduce
churn rates.
Created to examine the relationships between key features in the dataset and the total
revenue generated by each customer. These visualizations help identify trends and
correlations among customer interactions, such as call frequency and duration, enabling us
to pinpoint features linked to higher revenue and informing targeted marketing strategies.
Description SCATTERPlots ANALYSE MANQUANTE
2.8 Conclusion
Summarizes the findings from data exploration and analysis, highlighting key insights and
preparing for the next steps in data preparation and model development.
Chapter 3: Data preprocessing
3.1 Introduction
Data preprocessing is a critical step in data analysis, particularly when preparing data for
predictive modeling. This chapter discusses the techniques used to clean, encode, normalize,
and select features from the dataset to ensure the quality and effectiveness of the machine
learning models. The primary goal is to enhance data quality and suitability for analysis,
ultimately improving model performance.
Data cleaning is essential for ensuring accurate and reliable analysis. By removing errors and
inconsistencies, we improve data quality, which leads to more relevant results and better
decision-making. The following tasks were performed:
Column Removal: Columns that were deemed unnecessary were dropped from the
dataset.
Handling Missing Values: Missing values in columns were addressed using
appropriate methods.
Columns such as "churn" and "active_msg_vocaux" were removed as they were not relevant
for the analysis.
3.2.2 Handling Missing Values
Missing values were addressed using various techniques based on the nature of the data.
Missing values in each column were replaced with the most frequent value in that column.
Data encoding transforms categorical variables into numerical format suitable for machine
learning algorithms.
Label Encoding for Categorical Variables: The 'churn' column was encoded into
numerical values to facilitate churn prediction.
Feature engineering involves creating new features or modifying existing ones to enhance
model performance.
Feature selection involves identifying the most relevant features for the model to improve
performance and reduce dimensionality.
Variable Importance
Variable importance was assessed using feature selection techniques to determine which
features are most significant for the model.
For Revenue Prediction: Features related to call durations and customer interactions
were selected.
For Churn Prediction: Features such as "nb_reclamation,"
"durée_appel_jour(minutes)," "nb_jours_abonne," and others were selected.
A correlation matrix was created to identify relationships between features, which helps in
understanding feature redundancy.
3.7 Conclusion
This chapter has detailed the essential preprocessing steps to prepare the dataset for
machine learning. We addressed missing values, encoded categorical variables to enhance
data quality. Feature engineering and selection were performed to refine the dataset and
focus on the most relevant variables. These preprocessing efforts are crucial for building
accurate and effective predictive models.
Chapter 4: Customer Lifetime Value (CLV) and Churn Modeling
4.1 Introduction
After completing the data cleaning and preparation phases, we move to the modeling stage.
This chapter aims to develop and implement machine learning models to predict Customer
Lifetime Value (CLV) and churn rates.
To prepare for Customer Lifetime Value (CLV) modeling, the following steps were
performed:
This code selects the relevant columns from the Data Frame df to be used as features (X) and
the target variable (Y). The target variable Y is reshaped into a 2D array to be compatible
with the regression model.
To evaluate the model's performance, the dataset is divided into training and testing
sets.
Train_test_split is used to randomly split the data, with 20% allocated to the test set
and 80% to the training set. The random_state parameter ensures reproducibility of
the split.
This process ensures that the CLV modeling is based on well-defined features and a clearly
specified target variable, with data appropriately partitioned for training and evaluation.
Predictive Variables are the independent variables used to predict churn. For this modeling,
the selected predictive variables are:
Given the imbalanced nature of our churn dataset, we implement a stratified approach to
splitting the data. Instead of using the traditional train_test_split function from Scikit-learn,
we utilize StratifiedKFold for cross-validation.
StratifiedKFold Setup :
Before implementing machine learning models, we will examine the principles of each
algorithm, considering their strengths and weaknesses, to determine their suitability for our
dataset. This section details various supervised learning classification models used for our
analysis and their application to both CLV and churn predictions.
KNN (K-Nearest Neighbors) is a supervised learning algorithm used for classification and
regression. It predicts the class or value of a new sample based on the closest samples in the
feature space.
Steps for KNN Functioning:
Advantages of KNN:
Disadvantages of KNN:
Gradient Boosting Regressor is a general term for a type of boosting algorithm used for
regression tasks. It builds an ensemble of weak learners, typically decision trees, in a
sequential manner where each new tree aims to correct the errors made by the previous
trees. The key steps include :
Building an initial model.
Iteratively adding new models that correct the residuals of the combined previous
models.
Using a learning rate to control the contribution of each new model to the final
prediction.
Advantages:
Disadvantages:
Functioning:
Training: Trees are trained on bootstrapped samples with random subsets of
features.
Prediction: Each tree votes for a class label; the class with the majority votes is
selected.
Advantages:
Disadvantages:
Functioning:
Training: Trees are trained similarly to the classifier, using bootstrapped samples and
random subsets of features.
Prediction: The output is the average of predictions from all decision trees.
Advantages:
Disadvantages:
Linear Regression is a supervised learning algorithm used for predicting a continuous target
variable based on the linear relationship between the target and input variables. It
establishes this relationship by fitting a linear equation to the observed data.
Advantages of Linear Regression:
Support Vector Regression (SVR) is an extension of Support Vector Machines (SVM) for
regression tasks. It aims to find a function that deviates from the actual target values by a
value no greater than a specified margin.
Advantages of SVR:
Disadvantages of SVR:
4.4 Cross-Validation
The cross-validation procedure involves splitting the original dataset into two parts: a
training set and a validation set. The model is trained on the training set, and its
performance is evaluated on the validation set.
After identifying an overfitting issue with the Random Forest algorithm, we decided to adopt
cross-validation to improve the model's performance by reducing the risk of overfitting.
In this section, we discuss the evaluation metrics used to analyze the performance of
machine learning models effectively. For classification problems, we use several metrics,
including Accuracy, Precision, Recall, and F1 Score.
True Positives (TP): Instances where both actual and predicted values are positive.
True Negatives (TN): Instances where both actual and predicted values are negative.
False Positives (FP): Instances where the actual value is negative, but the predicted
value is positive.
False Negatives (FN): Instances where the actual value is positive, but the predicted
value is negative.
4.6.1.1 Precision
4.6.1.2 Accuracy
4.6.1.3 Recall
Recall measures the number of correct positive predictions relative to the total number of
actual positive instances. It answers the question: Of all the positive examples, how many
were correctly identified by the model?
4.6.1.4 F1-Score
The F1 Score combines Precision and Recall into a single metric by calculating their harmonic
mean. This provides an overall view of a model's prediction quality.
The ROC curve (Receiver Operating Characteristic curve) and AUC (Area Under the Curve) are
performance evaluation tools that assess a classification model by plotting sensitivity (true
positive rate) against specificity (false positive rate) across different classification thresholds.
An ideal ROC curve is located in the upper-left corner of the plot, indicating a high true
positive rate and a low false positive rate across all thresholds.
The Learning Curve is a graph that shows how a model's performance evolves with varying
amounts of training data. It compares training error and test error relative to the amount of
training data used. This curve helps visualize the relationship between training error and test
error, allowing developers to determine if the model needs more training data,
hyperparameter tuning, or regularization to avoid overfitting.
To achieve optimal results with most machine learning models, it is essential to find the best
hyperparameters. Two popular techniques for hyperparameter tuning are GridSearch and
RandomizedSearchCV.
GridSearch
RandomizedSearchCV
Table provides detailed information on the algorithms used and the corresponding
hyperparameters adjusted to optimize model performance.
4.8 Conclusion
In this chapter, we have concentrated on the various machine learning models used for
classification. We have examined each model in detail, including their principles, strengths,
and weaknesses. Our next step will be to evaluate the performance of these models to
identify which one delivers the highest accuracy.
Chapter 5: Model Evaluation and Optimization
5.1 Introduction
In this chapter, we assess the performance of the machine learning models developed for
predicting Customer Lifetime Value (CLV) and churn rates. This evaluation will involve
analyzing each model’s effectiveness, adjusting parameters to enhance performance, and
comparing the models to determine the best-performing approach.
For Customer Lifetime Value (CLV) Modeling, we implement five different regression models
to predict CLV using a set of features related to customer behavior. These models aim to
estimate the total revenue a customer will generate over their relationship with the
company based on certain behavioral metrics. Here’s a detailed explanation of each part of
the modeling process:
1. Outlier Consideration
Before building the models, it’s noted that the data contains outliers. These extreme values
can affect the predictions, especially in regression models. Handling these outliers (e.g.,
removing or transforming them) is important to improve model performance. In this case,
no explicit outlier handling is performed.
2. Feature Selection
The independent variables (features) selected for the regression models are behavioral
metrics that reflect customer interaction with the service. These include:
The target variable (y) is the total revenue generated by the client, representing the CLV.
3. Train-Test Split
The data is split into a training set and a test set, with 80% of the data used to train the
models and 20% reserved for testing. The (random_state=0) ensures that the split is
reproducible. This step is critical for evaluating how well the model generalizes to unseen
data.
4. Model Initialization
We used regression models to predict Customer Lifetime Value (CLV) because CLV is a
continuous numeric variable that represents the total revenue a business expects to earn
from a customer over their lifetime. Regression models are specifically designed to predict
such continuous outcomes by learning the relationship between independent variables
(customer behavior features) and the target variable (CLV).
Predicting Continuous Values: CLV, being a monetary value, can take on a wide range of
possible outcomes. Regression models are ideally suited for predicting such continuous
variables.
Capturing Relationships: These models can capture both linear and non-linear relationships
between customer behavior (e.g., call duration, number of complaints) and the associated
revenue, making them versatile for complex data.
Handling Multiple Features: The dataset includes multiple features influencing CLV.
Regression models can efficiently use these multiple variables to generate accurate
predictions.
Scalability: Regression models are widely used in predictive analytics and can handle larger
datasets, making them practical and scalable for CLV modeling.
5. Models Evaluation
In the evaluation section, we analyze how well different regression models predict Customer
Lifetime Value (CLV) using several important metrics. The evaluate_my_models fonction
performs the following steps:
1. Model Training: Each model is fitted using the training dataset (x_train, y_train).
2. Prediction: The model generates predictions (y_pred) based on the test dataset
(x_test).
3. Performance Metrics :
R² Score: Displays how well the model explains the variance in the data for both
training and test sets.
Mean Absolute Error (MAE): Indicates the average size of the prediction errors.
Mean Squared Error (MSE): Shows the average of the squared errors, giving more
weight to larger errors.
Root Mean Squared Error (RMSE): Provides a clearer interpretation of the error by
taking the square root of MSE.
These metrics help us understand the accuracy of each model and identify any issues with
overfitting or underfitting.
Model Test Score (R²) Train Score (R²) MAE MSE RMSE
Linear Regression 0.9932 0.9932 0.9932 0.9932 0.9932
Model Comparison :
Both models achieved R² scores over 0.99, demonstrating a strong ability to explain the
variance in Customer Lifetime Value (CLV).
Their low Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared
Error (RMSE) indicate high reliability for CLV prediction.
With an R² score of 0.9881, this model performed well but had higher error metrics than the
linear models, suggesting potential for improvement through tuning.
This model underperformed with an R² score of 0.9010 and higher error metrics, making it
less suitable for our objectives.
With an R² score of 0.9791, this model showed balanced performance, indicating it could be
a viable option with further tuning to capture complex relationships.
Based on the evaluation results, Linear Regression and Ridge Regression emerge as the
leading models for predicting Customer Lifetime Value (CLV). Their exceptional performance
is demonstrated by R² scores exceeding 0.99 and low error metrics, indicating their strong
reliability and accuracy in predictions. Additionally, their comparable training and testing
scores suggest they generalize well without overfitting, making them ideal for our analysis.
Support Vector Regression, while promising with an R² score of 0.9881, exhibited higher
error rates, indicating a need for further tuning and adjustments to enhance its predictive
capability. In contrast, the Gradient Boosting Regressor's lower R² score and higher error
metrics render it less suitable for our objectives. Although the Random Forest Regressor
provides balanced performance with an R² score of 0.9791, it does not surpass the accuracy
of the linear models.
In conclusion, for predicting CLV, we will prioritize the implementation of Linear Regression
and Ridge Regression due to their superior performance and robustness. Other models, such
as Support Vector and Random Forest, may be considered for future experimentation to
explore potential improvements.
This code section finalizes the predictive analysis of Customer Lifetime Value (CLV) by fitting
the best-performing model (either Linear or Ridge Regression) on the entire dataset. After
training the model on the features (x) and target variable (y), it predicts the CLV for all
customers, storing the results in a new column, predicted_clv.
Mean Absolute Error (MAE) of 0.06 and an R² score of 1.00, indicating near-perfect accuracy.
A scatter plot visually compares the actual total revenue against the predicted CLV, showing
a strong correlation. Finally, a sample of the actual and predicted CLV values is displayed,
confirming the model's ability to accurately predict customer value.
The visualization and predicted values indicate that the model performs exceptionally well in
predicting Customer Lifetime Value (CLV) when compared to the actual revenue values. The
scatter plot of actual versus predicted values shows a nearly perfect alignment along the red
diagonal line, which represents a 1:1 correlation. This strong alignment indicates that the
model has minimal error, as the predicted values are very close to the actual revenue values.
For instance, client ID 382-4657 has an actual revenue of 75.56 and a predicted CLV of 75.49,
showcasing a minimal difference. Similarly, for client ID 371-7191, the actual revenue is
59.24, and the predicted CLV is 59.25. These results reflect the low Mean Absolute Error
(MAE) of 0.06 and Mean Squared Error (MSE) of 0.32, confirming the model’s high accuracy.
With an R² score of 1.00, it’s evident that the model captures almost all variance in the data,
making it highly reliable for predicting CLV. This combination of low error metrics and high
predictive accuracy demonstrates that the model is well-suited for practical CLV predictions.
In conclusion, the Customer Lifetime Value (CLV) modeling process has proven effective,
with Linear Regression and Ridge Regression identified as the top-performing models. Both
demonstrated exceptional predictive accuracy, reflected in R² scores above 0.99 and low
error metrics. The strong correlation between actual and predicted CLV values indicates the
model's reliability in capturing customer revenue. These insights will aid in refining
marketing strategies and enhancing customer retention efforts, ultimately driving business
growth.
1. Outlier Consideration
Similar to CLV modeling, outliers can affect the performance of the churn prediction models.
Since classification models are more robust to outliers, no explicit outlier handling is
performed in this analysis.
2. Feature Selection
The selected features for churn prediction are behavioral metrics indicative of churn risk
(Number of Complaints, Call Duration and Number of Calls, Voice Messages, International
Call Duration) The target variable (y) is binary, representing churn (1 for churned, 0 for
retained)
4. Model Initialization
For churn prediction, we use classification models suited for binary outcomes, as churn is a
categorical variable (1 for churned, 0 for retained). Key reasons for selecting these models
include:
Complex Relationship Capture: These models can identify both linear and non-linear
relationships between customer behaviors and churn risk.
Imbalanced Data Handling: Certain algorithms, like Random Forest and Gradient Boosting,
can effectively manage imbalanced datasets.
Scalability: Classification models efficiently handle larger datasets, making them practical for
churn analysis.
5. Models Evaluation
1. Model Training: Each model is fitted using the training dataset (x_train, y_train).
2. Prediction: The model generates predictions (y_pred) based on the test dataset
(x_test).
3. Performance metrics :
Accuracy Score: Indicates the proportion of correctly predicted instances among the total
instances.
Recall: Indicates the ability of the model to identify all relevant instances (sensitivity).
F1 Score: The harmonic mean of precision and recall, providing a balance between the two.
AUC-ROC Score: Represents the model's ability to discriminate between classes across all
thresholds.
These metrics help us understand the accuracy and reliability of each model, highlighting any
issues with overfitting or underfitting.
Model Test Train Precision Recall F1 Score AUC-ROC
Accuracy Accuracy Score
Logistic Regression 0.92 0.94 0.91 0.93 0.92 0.95
Model Comparison:
Logistic Regression: This model achieved a notable accuracy of 0.92, demonstrating strong
predictive power for identifying customer churn. The balance between precision and recall
shows that it handles both false positives and false negatives well, making it a solid and
dependable model for churn prediction.
Decision Tree Classifier: With an accuracy of 0.88, the Decision Tree model offers reasonable
performance but is outperformed by other models in terms of precision and recall. This
suggests that the model might be prone to misclassifying some cases, particularly in terms of
false positives and false negatives, which limits its effectiveness in our specific use case.
Random Forest Classifier: The Random Forest model was the top performer in this analysis,
with an impressive accuracy of 0.93. It also achieved high precision and recall values,
demonstrating superior reliability in identifying churn cases. Its ability to prevent overfitting
makes it an ideal candidate for churn prediction, especially in cases involving complex data
patterns.
Gradient Boosting Classifier: This model showed balanced results, achieving an accuracy of
0.91. While it performs reasonably well, it did not surpass the Random Forest Classifier in
terms of accuracy or other key metrics, such as precision and recall, making it a slightly less
attractive choice.
Support Vector Classifier (SVC): The SVC model performed adequately, with an accuracy of
0.90. Although it demonstrated strong classification capabilities, it lagged behind the top-
performing models, particularly in terms of precision and recall. Thus, it may not be as
suitable for this specific churn prediction problem.
Hyperparameter Tuning
To further enhance model performance, hyperparameter tuning was applied across multiple
models. We defined hyperparameter grids for key models such as Decision Tree, Random
Forest, K-Nearest Neighbors (KNN), and Support Vector Classifier (SVC), focusing on critical
parameters:
Maximum Depth: Controls the maximum depth of the decision tree, determining
how deep the tree can grow before halting. This parameter prevents overfitting by
limiting the complexity of the model.
Number of Estimators: Refers to the number of trees or models used in ensemble
methods like Random Forest and Gradient Boosting. Increasing the number of
estimators enhances model accuracy, but it also increases computational cost.
Number of Neighbors (KNN): This parameter defines how many neighbors should be
considered when classifying a new data point. A larger value of "K" smooths decision
boundaries but may decrease model sensitivity.
Given the computational cost of an exhaustive grid search for the Support Vector Classifier,
we opted for RandomizedSearchCV to efficiently explore a wider range of hyperparameters.
This approach enabled us to identify the best-performing SVC parameters without requiring
an exhaustive search, reducing processing time while maintaining performance.
The churn modeling analysis identified the K-Nearest Neighbors (KNN) Classifier with K=4 as
the optimal model for predicting customer churn, achieving an accuracy of 0.90. While other
models like Logistic Regression performed well, KNN effectively captured churn instances
with a strong balance between recall and F1-scores.
Recency: This metric measures how recently a customer interacted with the service.
Typically, customers with more recent engagements are less likely to churn, making it
a key factor for analyzing churn risk.
Frequency: This metric captures how often a customer uses the service over a
specific time period. Higher interaction frequency is often associated with greater
customer loyalty and reduced churn probability.
Monetary: Represents the total spending of a customer on services. High-value
customers might exhibit different churn behaviors, making this a crucial aspect of
customer segmentation.
Though we did not use the RFM score directly in our predictive models, these features play
an essential role in enriching visualizations and dashboards in Power BI. They provide a
clearer picture of customer engagement patterns and behavior, which aids in identifying
trends that may not be captured by the model alone. For instance, through RFM analysis,
stakeholders can segment customers based on engagement and spending, leading to more
tailored retention strategies.
Thus, while the RFM score was not part of the predictive modeling process, the RFM
features significantly enhanced our ability to deliver actionable insights via Power BI,
improving business decision-making and customer retention strategies based on a deeper
understanding of customer engagement.
5.4 Conclusion
In this chapter, we highlighted the significance of Customer Lifetime Value (CLV) modeling,
churn prediction, and the creation of RFM (Recency, Frequency, Monetary) features at
Tunisie Telecom. The CLV model provides insights into the financial value of customers,
guiding marketing investments and acquisition strategies. The churn model effectively
identifies at-risk customers, enabling targeted interventions to reduce attrition.
Additionally, the RFM features facilitate effective customer segmentation, allowing for
tailored marketing strategies that address specific needs. Together, these models foster a
customer-centric approach, enhancing predictive accuracy and driving long-term value
creation for Tunisie Telecom.
Chapter 6: Deployment
6.1 Introduction
Deployment involves integrating the developed models into a usable application or system.
This chapter outlines the deployment process, including model extraction, web interface
development, and dashboard construction.
( PARTIE MANQUANTE )
The deployment process also included the creation of interactive dashboards to visualize
model outputs and key business metrics. These dashboards were essential for stakeholders
to gain actionable insights from the predictive models. The construction of these dashboards
utilized tools like Power BI and Dash by Plotly, allowing for dynamic and user-friendly
interfaces.
Client Behavior Dashboard:
Client Segmentation Filter: A filter at the top allows users to analyze data by
different client segments—high, medium, and low value.
Call Activity Cards: These cards display call volumes across various times, including
day calls, night calls, and international calls.
Variance Line Chart: This chart illustrates variations in call activity, correlating them
with the number of days subscribed, helping to identify trends in client behavior over
time.
Voice Message Activity (Pie Chart): This chart shows that 54.26% of clients are
inactive regarding voice messages, indicating potential areas for engagement.
Client Complaints Funnel: A funnel visualization tracks client counts based on the
number of complaints (nb_reclamation), highlighting customer satisfaction across
segments.
By utilizing the client segmentation filter, stakeholders gain deeper insights into behavioral
patterns, informing strategies to enhance customer experience and retention.
The Client Lifetime Value (CLV) Analysis Dashboard provides a detailed examination of
customer profitability and engagement. By utilizing various metrics and visualizations, this
dashboard supports data-driven decision-making, allowing stakeholders to better
understand the factors influencing customer value and retention.
This filter allows users to select different client segments—high, medium, and low value—
tailoring the analysis to focus on specific groups. This segmentation is essential for targeted
strategies that address the unique characteristics of each group.
Total Profit: Displays the overall profit generated from clients, providing a snapshot of
financial performance.
Average Days Subscribed: Indicates the average duration clients have been subscribed,
helping to assess client loyalty and engagement.
Churn Count by Segment: Shows the number of churned clients within each segment,
highlighting potential areas of concern.
Customer Count by Segment: Provides the total number of clients in each segment, offering
context for the profitability metrics.
Average Customer Lifetime Value (CLV): Highlights the average CLV, derived from total
revenue calculations, serving as a key indicator of customer profitability.
This chart visualizes the count of days subscribed (nb_jours_abonne) segmented by the
Frequency Score. The Frequency Score, calculated using DAX, assesses how often clients
interact with the service, with higher scores indicating more frequent usage. This
visualization helps identify the relationship between subscription length and client
engagement, revealing patterns that can inform retention strategies.
This chart shows the sum of total revenue (revenue_total_client) based on the Recency
Score. Clients with recency scores between 4 and 5 exhibit the highest revenue, indicating
that recently engaged clients tend to generate more revenue. This insight suggests that
enhancing engagement efforts for these clients could further increase profitability.
Average Frequency: Reflects how often clients engage with the service.
Average Revenue: Represents the average revenue generated per client.
RFM Score: A composite score calculated using DAX that incorporates recency,
frequency, and monetary values to categorize clients based on their overall value to
the business.
The scatter chart enables a visual analysis of customer segments, revealing trends and
outliers in behavior. For instance, high-value clients often show higher frequency and
revenue metrics, while low-value clients may cluster at the lower end of these dimensions.
The Churn Analysis Dashboard provides critical insights into customer churn patterns. Key
components include :
Churn Percentage (Pie Chart): This chart illustrates that 14.14% of clients have
churned, offering a quick overview of the overall churn rate.
Churn by Complaints (Clustered Column Chart): Highlights the correlation between
the number of complaints and churn likelihood, showing that clients with 1 to 5
complaints are more likely to leave.
Churn by Client Segment (Line Chart): Reveals that high-value clients experience the
highest churn rates, followed by medium- and low-value clients, indicating where
retention efforts should be concentrated.
Churn by Revenue (Stacked Bar Chart): Shows that clients generating higher revenue
tend to have higher churn rates, suggesting that unmet expectations contribute to
client loss.
Key Metrics (KPI Cards): These cards display a summary of essential statistics,
including total clients (5000), total churned clients (707), and total non-churned
clients (4293).
DAX Implementation
To enhance the dashboards, several DAX measures were created to facilitate a detailed
analysis of customer behavior, revenue generation, and churn patterns. Below is an
overview of these measures, along with explanations for their significance:
1. Frequency_Score
This column assigns a score to each customer based on their usage frequency. The scoring is
as follows:
This column helps identify customer engagement levels, which can inform retention
strategies.
6 Monetary_Score
This column evaluates customers based on their monetary contribution, assigning scores as
follows:
7 Recency_Score
This column categorizes customers based on how recently they have engaged with the
service:
5: Customers who engaged within the last 30 days are highly active.
4: Customers engaged within 31 to 60 days are fairly active.
3: Customers engaged within 61 to 90 days are moderately active.
2: Customers engaged within 91 to 120 days are less active.
1: Customers who haven't engaged for more than 120 days are inactive.
This score helps to identify how engagement timing impacts customer value and retention.
8 Total_RFM_Score
This column aggregates the individual scores from Frequency, Monetary, and Recency to
create a composite RFM score. Higher scores indicate more valuable customers, enabling
better segmentation for marketing efforts.
9 Customer_Segment Column
A customer Segment column was created to classify customers based on their Total RFM
Score.
This measure classifies customers into segments based on their Total RFM Score. Customers
with scores of 12 or higher are categorized as "High Value," those scoring between 10 and 11
as "Medium Value," and scores below 10 as "Low Value."
10 Average_CLV
This measure calculates the average Customer Lifetime Value (CLV) by averaging the total
revenue from customer activities, providing insights into typical revenue generated per
customer.
11 Total Client Revenue
This measure sums the total revenue generated by all clients, offering a comprehensive view
of revenue performance across the customer base.
12 Customer_Count_By_Segment
This measure counts the distinct number of clients in each segment, filtering the RFM
dimension to provide client counts specific to the selected Customer Segment.
6.5 Conclusion
General Conclusion
This report encapsulates the results of our final year project, which aimed to analyze
customer behavior and evaluate customer value at Tunisie Télécom as part of our Master’s
degree in Business Analytics from ESB. Over the course of this project, we applied advanced
data analysis and machine learning techniques to address two critical aspects of customer
relationship management: predicting customer churn and estimating Customer Lifetime
Value (CLV).
Tunisie Télécom, with its extensive customer base of over four million subscribers, faces
significant challenges related to customer retention and revenue optimization. To tackle
these challenges, we developed predictive models using machine learning algorithms to
identify customers likely to churn and to estimate their CLV. These models were integrated
into a web application designed to facilitate predictions and provide insights into customer
value, enhancing the decision-making capabilities of Tunisie Télécom's agents. The
application also features a BI report integration, allowing for the visualization of historical
customer data and further enriching the decision support system.
Our project followed the CRISP-DM methodology, ensuring a structured approach to data
analysis, from understanding the business problem to deploying the final solution. The key
phases included data exploration, preprocessing, modeling, evaluation, and deployment.
This methodology provided a solid framework for addressing the complexities of customer
behavior and value analysis in the telecommunications sector.
The internship was a valuable experience, providing practical insights into the application of
machine learning in a real-world business context. We successfully implemented various
algorithms, including Logistic Regression, K-Nearest Neighbors, XGBoost, Random Forest,
Decision Tree, and Naive Bayes, to develop robust models for both churn prediction and CLV
estimation. Our solution represents a significant step towards a more data-driven approach
to customer management at Tunisie Télécom.
Looking ahead, there is potential for further enhancement of our solution. Future work could
involve integrating natural language processing techniques to analyze customer interactions
and categorize complaints, which would provide a deeper understanding of customer issues
and further refine retention strategies.
In summary, this project not only advanced our technical skills but also contributed valuable
insights to Tunisie Télécom's customer management practices. By leveraging machine
learning and data analysis, we provided actionable tools to improve customer retention and
maximize CLV, addressing key challenges in the competitive telecommunications landscape.
Bibliography