0% found this document useful (0 votes)
143 views73 pages

Rapport PFE Balsam Bendhif

Uploaded by

wassim azaiez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
143 views73 pages

Rapport PFE Balsam Bendhif

Uploaded by

wassim azaiez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 73

Private Higher School of Management of

Tunis Esprit School of Business

INTERNSHIP REPORT
For the completion of the Master's degree in

Business Analytics

Presented on: … October 2024

Prepared by: Balsam Bendhif

Title: «Analysis of Customer Behavior and Evaluation of Customer Value at


Tunisie Telecom»

01/05/2024 – 31/08/2024

Internship Supervisor: Hatem Jemai

Academic advisor: Tarek Hamrouni


Année universitaire 2023/2024

Nom & Prénom Date et Signature

Hatem Jemai
Internship
Supervisor
Academic Tarek Hamrouni
advisor
Dedications

To my beloved parents
I owe everything I am today to your love, patience, and countless sacrifices. May this humble
work serve as a small token of my gratitude and recognition for all the incredible things you
have done for me. May God, the Almighty, bless you with health and long life, so that I may,
in turn, bring you happiness and fulfillment.

To my dearest friends
In honor of the genuine friendship that binds us and the wonderful moments we've shared, I
dedicate this work to you. I wish you a bright future filled with endless possibilities and great
promises.

Acknowledgments

First and foremost, I extend my deepest gratitude to the entire pedagogical team at ESB for
their unwavering guidance and support throughout my academic journey. I am also
immensely grateful to the internship team at Tunisie Télécom for providing a profoundly
enriching and engaging experience during my four-month tenure.

A special thanks goes to my supervisor, Mr. Hatem Jemai, for swiftly integrating me into the
company, placing his trust in me, offering constant encouragement and constructive
feedback, and dedicating his time to address all my queries despite his demanding schedule.

Lastly, I extend my appreciation to the jury members for their willingness to evaluate this
work.
Résumé

L’objectif principal de ce projet de fin d’études est d’analyser le comportement des clients et
d’évaluer la valeur client chez Tunisie Télécom. Nous utiliserons des techniques d’analyse de
données et d’apprentissage automatique pour comprendre les comportements des clients et
développer un modèle de prévision du churn. Par ailleurs, nous allons évaluer la valeur vie
client (CLV) pour estimer la contribution financière de chaque client à l'entreprise sur une
période de temps donnée. Nous intégrerons ces modèles dans une application web qui
permettra aux agents de Tunisie Télécom de réaliser facilement des prédictions et d’évaluer
la CLV. Cette application comprendra également une liaison avec le rapport BI, ce qui
permettra de visualiser les données historiques des clients. Pour mener à bien ce projet,
nous avons suivi la méthodologie de projet CRISP-DM.

Mots clés : comportement des clients, valeur vie client, apprentissage automatique,
prédiction du churn, CRISP-DM, Python

Abstract

The main objective of this final year project is to analyze customer behavior and evaluate
customer value at Tunisie Télécom. We will use data analysis and machine learning
techniques to understand customer behaviors and develop a churn prediction model.
Additionally, we will evaluate the Customer Lifetime Value (CLV) to estimate the financial
contribution of each customer to the company over a given period. These models will be
integrated into a web application that will assist Tunisie Télécom agents in making
predictions and evaluating CLV easily. This application will also include a connection to the BI
report, allowing visualization of historical customer data. To successfully carry out this
project, we followed the CRISP-DM project methodology.

Keywords: customer behavior, customer lifetime value, machine learning, churn prediction,
CRISP-DM, Python
Table of Contents

Chapter 1: Project Study


1.1 Introduction
1.2 Presentation of the Host Organization
1.2.1 Presentation of Tunisie Télécom
1.3 Project Presentation
1.3.1 General Framework of the Project
1.4 Conclusion

Chapter 2: Understanding the Business Problem and Data


2.1 Introduction
2.2 Theoretical Concepts Related to the Project
2.2.1 Churn
2.2.2 Customer Lifetime Value (CLV)
2.2.3 Machine Learning
2.3 Challenges in Machine Learning Modeling
2.3.1 Overfitting
2.3.2 Underfitting
2.4 Database Selection
2.5 Data Exploration
2.5.1 Univariate Analysis
2.5.2 Bivariate Analysis
2.6 Conclusion

Chapter 3: Data Preprocessing


3.1 Introduction
3.2 Data Cleaning
3.3 Data Encoding
3.4 Feature Engineering
3.5 Feature Selection
3.6 Correlation Matrix
3.7 Conclusion

Chapter 4: Customer Lifetime Value (CLV) and Churn Modeling


4.1 Introduction
4.2 Modeling Techniques
4.2.1 Logistic Regression
4.2.2 K-Nearest Neighbors (KNN)
4.2.3 Gradient Boosting Regressor
4.2.4 Random Forest
4.2.5 Linear Regression
4.2.6 Ridge Regression
4.2.7 Support Vector Regression (SVR)
4.2.8 AdaBoost Classifier
4.3 Performance Metrics
4.3.1 Confusion Matrix
1. Precision
2. Accuracy
3. Recall
4. F1-Score
5. ROC-AUC Curve
6. Learning Curve
4.4 Hyperparameter Tuning
4.4.1 GridSearch
4.4.2 RandomizedSearchCV
4.5 Conclusion

Chapter 5: Model Evaluation and Optimization


5.1 Introduction
5.2 Evaluation of Each Model
5.2.1 For CLV Modeling
1. Outlier Consideration
2. Feature Selection
3. Train-Test Split
4. Model Initialization
5. Model Evaluation
6. Choosing the Most Fitting CLV Model
7. Conclusion for the CLV Model
5.2.2 For Churn Modeling
1. Outlier Consideration
2. Feature Selection
3. Train-Test Split: StratifiedKFold Setup
4. Model Initialization
5. Model Evaluation
6. Model Comparison
5.3 Comparison and Evaluation of Used Algorithms
5.4 Model Adjustment
5.5 Conclusion

Chapter 6: Deployment
6.1 Introduction
6.2 Deployment Process
6.2.1 Model Extraction
6.2.2 Development of the Web Interface
6.2.3 Application Interfaces
6.3 Dashboard Construction
6.4 DAX Implementation
6.5 Gantt Chart
6.6 Conclusion
General Conclusion

List of Figures
List of Tables
General Introduction

In today's rapidly evolving telecommunications sector, characterized by intense competition


and technological advancements, customer retention and maximizing Customer Lifetime
Value (CLV) have become paramount. Telecom operators like Tunisie Telecom face the dual
challenge of minimizing churn where customers switch to competitors or terminate services
and strategically nurturing valuable customer relationships.

The concept of churn, or customer attrition, represents a significant concern for telecom
companies. It not only impacts immediate revenue streams but also threatens long-term
profitability by eroding customer loyalty and reducing CLV potential. Conversely, focusing on
CLV involves identifying and nurturing valuable customer segments that contribute the most
to long-term revenue and profitability.

Our project, titled "Analysis of Customer Behavior and Evaluation of Customer Value at
Tunisie Telecom," aims to address these challenges through a comprehensive data-driven
approach. By integrating both churn analysis and CLV prediction, we seek to achieve the
following objectives:

1. Churn Analysis: Understanding the factors and behaviors that precede customer
churn is crucial. Through advanced analytics and machine learning techniques, we
aim to identify predictive indicators of churn within Tunisie Telecom's customer base.
This analysis will enable proactive measures to mitigate churn risks and enhance
customer retention strategies.
2. CLV Prediction: Predicting and optimizing Customer Lifetime Value involves not only
retaining customers but also maximizing their value over their entire lifecycle. By
developing models that forecast CLV, we can segment customers based on their
potential profitability and tailor strategies to nurture high-value segments. This
approach ensures that resources are effectively allocated to retain and grow the
most profitable customer relationships.

By adopting the CRISP-DM methodology, our project unfolds as a structured exploration into
data science techniques tailored to the telecom industry's specific challenges. Through a
rigorous analysis of historical data and the implementation of predictive models, we aim to
empower Tunisie Telecom with actionable insights to improve customer satisfaction, reduce
churn, and ultimately maximize CLV.

This report is structured into five chapters, each addressing key stages of our project
journey:

 Project Study and Business Understanding: Providing an overview of our project's


framework, organizational context, and the methodologies adopted in data science.
 Understanding the Data: Exploring and analyzing the dataset to uncover insights
crucial for churn and CLV analysis.
 Data Preprocessing: Detailing the steps taken to clean, transform, and prepare the
data for modeling.
 Modeling: Exploring various models and performance metrics relevant to our project
objectives.
 Evaluation, Optimization, and Model Deployment: Assessing model performance,
optimizing for accuracy, and deploying solutions that integrate with Tunisie
Telecom's operational framework.

Through this comprehensive approach, we aim to equip Tunisie Telecom with the tools and
strategies needed to navigate the complexities of customer retention and CLV optimization
in a competitive telecommunications landscape.
Chapter 1: Project Study

1. Introduction

In this chapter, we begin by introducing the host organization, providing an overview of


Tunisie Télécom and its various services. We then detail the specific problem statement,
followed by the proposed solution and an overview of the project architecture. Finally, we
delve into the project management methodology used and the software work environment.

1.1 Presentation of the Host Organization

1.1.1 Presentation of Tunisie Télécom

Tunisie Télécom is a Tunisian telecommunications company that offers fixed-line and mobile
telephony, internet, and data transmission services. It is one of the largest telecom
operators in the region, serving both individual and corporate clients. Established in 1995,
Tunisie Télécom has grown significantly and aims to enhance profitability while expanding its
international presence.

1.1.2 History

Originally established as the National Telecommunications Agency by Law No. 36 of April 17,
1995, Tunisie Télécom later transitioned into a public limited company under Decree No. 30
of April 5, 2004, under the name "Tunisie Télécom". In July 2006, 35% of its capital was
opened to the Emirati consortium "TeCom-DIG". The company's strategic goals include
maximizing profitability and solidifying its position as a leading international operator.

1.1.3 Organization

Tunisie Télécom comprises 24 regional directorates, 80 Actels and sales points, and over
13,000 private locations. The company employs more than 8,000 personnel across its
operations, which include six customer support centers for fixed-line, mobile, and data
services.
1.1.4 Presentation of the HACHED Complex

The HACHED Complex is one of Tunisie Télécom's major facilities, housing a wide range of
services:

 Digital Transmission Center (CTN)


 Subscriber Switching Center (CCA)
 Operating and Maintenance Centers (OMC) for national and regional radio networks
 Mobile Service Switching Center (MSC)
 Data Center
1.2 Project Presentation

1.2.1 General Framework of the Project

This project is undertaken as part of the final dissertation for the Master's program in
Business Analytics at Esprit School of Business. It is centered around developing predictive
models for customer churn and evaluating Customer Lifetime Value (CLV) at Tunisie
Télécom.

1.2.2 Problem Statement


The telecommunications industry is currently grappling with significant challenges related to
customer retention due to intense competition and rapidly changing consumer preferences.
Tunisie Télécom, a prominent player in this sector, faces dual challenges: minimizing
customer churn and maximizing the value derived from each customer relationship.
Customer churn, influenced by competitive pressures and technological advancements,
poses a persistent threat to revenue stability and market position. At the same time,
understanding and optimizing Customer Lifetime Value (CLV) is crucial for tailoring services
and strategies that enhance customer loyalty and profitability. In response to these
challenges, this project aims to leverage advanced data analytics and machine learning
techniques. By predicting customer churn accurately and estimating CLV effectively, Tunisie
Télécom can implement proactive retention strategies and personalized customer
management approaches. This initiative seeks to empower Tunisie Télécom in maintaining
customer satisfaction, improving operational efficiency, and sustaining growth amidst a
dynamic telecommunications landscape.
How can Tunisie Télécom effectively predict customer churn and estimate Customer
Lifetime Value (CLV) to implement targeted strategies for reducing churn and maximizing
customer value?
1.2.3 Proposed Solution

To tackle the challenges of customer churn and maximize Customer Lifetime Value (CLV),
this project proposes using advanced data analytics and machine learning techniques. The
solution involves developing models to estimate CLV by segmenting customers based on
their value to Tunisie Télécom, recommending tailored services, and optimizing resource
allocation. Additionally, machine learning algorithms will be employed to predict customer
churn accurately and identify key factors driving attrition, enabling targeted retention
strategies. To support these efforts, interactive Power BI dashboards will be created for
better understanding and explanation of results, and a web page will be developed to deploy
and share final insights, providing easy access to predictions and actionable data. This
comprehensive approach aims to improve customer retention, enhance satisfaction, and
maximize the value of customer relationships for Tunisie Télécom.

1.4 Project Management Methodology Adopted

In this section, we explore three project management methodologies for data exploration
and data science: CRISP-DM, SEMMA, and TDSP. Our objective is to analyze each method to
determine the most suitable approach for our project.

1.4.1 SEMMA Methodology

SEMMA (Sample, Explore, Modify, Model, Assess) is a data mining methodology developed
by SAS to address data analysis challenges through five key stages:

1. Sampling (Sample): Selecting relevant data for analysis and extracting a


representative sample.
2. Exploring and Visualizing (Explore): Investigating data structure, detecting
anomalies, and identifying relationships between variables through visualization.
3. Modifying and Cleaning (Modify): Preparing and cleaning data for analysis.
4. Modeling (Model): Building machine learning models to predict or explain the
studied phenomena.
5. Assessing Results (Assess): Evaluating model performance and interpreting results
against predefined performance criteria.

1.4.2 TDSP Methodology


TDSP (Team Data Science Process) is an agile and iterative project management
methodology specifically designed by Microsoft for data science projects. It consists of five
phases :

1. Business Understanding: Understanding business challenges, identifying project


objectives, success criteria, and involved stakeholders.
2. Data Acquisition and Understanding: Collecting, cleaning, and preparing data
necessary to meet business objectives.
3. Modeling: Developing and testing data models to address project objectives and
selecting the most effective model.
4. Deployment: Deploying models into an operational environment for use.
5. Acceptance: Measuring model deployment results to verify project success and
achievement of business objectives.

1.4.3 CRISP-DM Methodology

CRISP-DM (Cross-Industry Standard Process for Data Mining) is a widely adopted


methodology for data mining, analysis, and data science projects. It consists of six phases :
1. Business Understanding: Identifying the problem the organization is trying to solve
based on data and establishing a well-defined architecture for project
implementation.
2. Data Understanding: Gathering initial information, understanding and describing the
type of data to be analyzed, and establishing links between data and their business
significance.
3. Data Preparation: Preparing data for analysis, including cleaning and transforming
data to make it compatible with algorithms.
4. Modeling: Selecting modeling techniques and constructing models after testing
multiple models. This phase includes model design, prototype construction, model
construction, and model evaluation.
5. Evaluation: Checking and validating models or knowledge acquired to ensure they
meet the objectives set at the beginning of the process. This phase also includes
making decisions to deploy or improve the model.
6. Deployment: The final phase involves deploying analyses for effective use, shaping
knowledge obtained from modeling, and integrating it into the decision-making
process.

1.4.4 Comparison of Methodologies

While CRISP-DM, SEMMA, and TDSP share the same objective, each methodology differs in
approach and flexibility. CRISP-DM was chosen for this project due to its iterative approach
and flexibility, allowing adaptation to changes throughout the project lifecycle.
Criteria SEMMA TDSP CRISP-DM
Application Domain Data Mining and Team Data Science Data Mining and
Predictive Analysis Projects Data Exploration
Phases 5 phases 5 phases 6 phases
Iterative Approach No Yes Yes
Flexibility Low Medium Hight
Priorities Focus on Focus on Acquisition Focus on
Exploitation and and Preparation Understanding and
Modification Preparation

1.4 Project Architecture

The project architecture will include detailed plans for data collection, preprocessing, model
development, and deployment, ensuring a systematic approach to achieving project goals.

The project architecture consists of eight key stages:

1. Data collection
2. Data understanding
3. Data preprocessing
4. Feature selection
5. Model building
6. Model evaluation
7. Decision-making for customer retention
8. Dashboard creation

1.5 Software Work Environment

The software work environment will leverage industry-standard tools and platforms for data
analysis, modeling, and visualization, facilitating robust analysis and efficient model
development.

The project utilizes various software tools and libraries:

Environment Description
Anaconda A software distribution environment and package management platform for
the Python and R programming languages, specifically tailored for developing
data science and machine learning applications.
Jupyter Notebook An open-source computational notebook that allows for the creation and
sharing of documents containing interactive code, data visualizations, and
textual explanations.
Power BI A Business Intelligence platform that enables users to collect, analyze, and
visualize data from multiple sources, facilitating the creation of dashboards
and data visualizations.
Visual Studio A code editor used to write, modify, and debug code in various programming
Code languages, offering a variety of features aimed at easing the software
development process.

Table 1.4: Programming Languages Used

Language Description
Python A high-level, interpreted, object-oriented, and cross-platform programming
language used in various fields such as data science, artificial intelligence,
data visualization, and more.
1.7 Conclusion

In conclusion, this chapter has provided an overview of Tunisie Télécom, outlined the project
scope and objectives, and introduced the methodologies and environments that will guide
our approach. The subsequent chapters will delve into specific aspects such as data
understanding, preprocessing, modeling techniques, performance evaluation, and
deployment strategies.

Chapter 2: Understanding the Business Problem and Data

2.1 Introduction

This chapter provides an overview of the business problem and lays the groundwork for
understanding the data through theoretical concepts and analysis techniques.

2.2 Theoretical Concepts Related to the Project

This section delves into key theoretical concepts essential for the project:

2.2.1 Churn

Churn refers to the rate at which customers stop doing business with a company over a
specific period. It is crucial for telecommunications companies like Tunisie Télécom to
monitor churn rates to assess customer retention strategies effectively.

Types of Churn :

 Voluntary Churn: Customers actively choose to leave due to dissatisfaction with


service, better offers from competitors, or changes in personal circumstances.
 Involuntary Churn: Customers are lost due to reasons beyond their control, such as
relocation or financial difficulties.
 Customer Behavior Churn: Occurs when customers reduce their engagement, which
may eventually lead to complete disengagement.

Metrics for Measuring Churn :

 Churn Rate: The percentage of customers who discontinue their service within a
given period. The churn rate can be calculated using the following formula:

2.2.2 Customer Lifetime Value (CLV)

Customer Lifetime Value (CLV) estimates the total revenue a customer is expected to
generate over their entire relationship with the company. In this study, we calculate CLV
using an alternative approach based on direct revenue measurements rather than the
traditional Recency, Frequency, and Monetary (RFM) metrics.

CLV Calculation Approach:

CLV is determined as the total revenue generated by a customer over a specific period of
time, as indicated in our data. This approach focuses on the aggregated costs associated with
various types of customer interactions. The calculation is expressed as:

CLV=Total Revenue from Customer over a Certain Lifespan

Total Revenue Calculation:

To compute the total revenue, we aggregate the costs associated with different types of calls
and services used by the customer. The formula used in the code is:

revenue_total_client =Cost of daytime calls +Cost of evening calls +Cost of nighttime calls+
Cost of inter-network calls

Where :

 cout_appel_jour: Cost of daytime calls


 cout_appel_soiree: Cost of evening calls
 cout_appel_nuit: Cost of nighttime calls
 cout_appel_inter: Cost of inter-network calls

This approach directly measures the revenue generated from different types of call activities
rather than relying on RFM metrics. By focusing on these revenue components, we can
assess the value of customers based on their overall spending patterns and interactions with
the company.

2.2.5 Machine Learning

Machine Learning (ML) is a branch of artificial intelligence that enables systems to learn and
improve from experience without being explicitly programmed. It plays a vital role in
predicting churn and CLV based on historical customer data.

Catégories of Machine Learning :

Supervised Learning: Supervised learning involves training a model using labeled data. The
goal is to enable the algorithm to predict similar outcomes on new data by learning the
relationship between known inputs and outputs.
Unsupervised Learning: In unsupervised learning, the input data is not labeled, and there
are no predefined output variables. The algorithm uses clustering, unsupervised
classification, or anomaly detection techniques to uncover hidden relationships in the data.

2.3 Challenges in Machine Learning Modeling

When training a machine learning model, two major problems can arise: Overfitting and
Underfitting.

2.3.1 Overfitting

Overfitting occurs when a model learns too much noise or specific details from the training
data, resulting in poor performance on new data. The model may show good results on
training data but perform poorly on test data or new examples.

2.3.2 Underfitting

Underfitting occurs when a model is too simple to capture the underlying patterns in the
data, leading to poor predictive performance.
2.5 Database Selection

The database available for this research contains comprehensive information related to the
customers of Tunisie Télécom. It includes the history of national and international call
operations, voicemail messages, and customer complaints.

The table below provides an overview of the dataset, identifying the variables, their types,
and a brief description of their meanings:

Variable Name Type Description


Id_Client object Unique identifier for each customer
nb_jours_abonne Float64 Number of days the customer has been subscribed to Tunisie
Télécom
durée_appel_jour Float64 Total duration of calls made during the day (in minutes)
nb_appel_jour Float64 Total number of calls made during the day
cout_appel_jour Float64 Total cost of calls made during the day
durée_appel_soirée Float64 Total duration of calls made during the evening (in minutes)
nb_appel_soirée Float64 Total number of calls made during the evening
cout_appel_soirée Float64 Total cost of calls made during the evening
durée_appel_nuit Float64 Total duration of calls made during the night (in minutes)
nb_appel_nuit Float64 Total number of calls made during the night
cout_appel_nuit Float64 Total cost of calls made during the night
durée_appel_inter Float64 Total duration of international calls (in minutes)
nb_appel_inter Float64 Total number of international calls
cout_appel_inter Float64 Total cost of international calls
active_msg_vocaux object Indicates if the customer has activated voicemail (Yes/No)
nb_msg_vocaux Float64 Number of voicemail messages received by the customer
nb_reclamation Float64 Number of complaints filed by the customer
Churn object Indicates if the customer has churned (True) or is loyal (False)

This detailed dataset provides a robust foundation for conducting thorough data analysis
and developing predictive models for churn and Customer Lifetime Value (CLV).
The data base contains 18 variables in total, these variables encompass various aspects of
customer interaction and behavior with Tunisie Télécom, including call usage patterns, costs,
service activations, and churn status.

2.6 Data Exploration

We will use the "read_csv" function from the Pandas library to import data from a CSV file
and store it in a DataFrame.

Exploration of data is a crucial step in data analysis that involves examining and interpreting
data within its context. This step typically includes tasks such as analyzing descriptive
statistics and visualizing data.

data.shape: Returns the number of rows and columns in the DataFrame.

data.head: Displays the first few rows of a DataFrame. By default, it returns the first 5 rows.
data.describe(): Displays descriptive statistics of a DataFrame, including count, mean,
standard deviation, minimum and maximum values, and quartiles for each column.

data.info (): Provides a summary of the Data Frame, including column names and types,
number of non-null values, and memory usage information.
A pie chart was created to show the proportion of each data type in the dataset. This helps
in understanding the dataset's structure and guides preprocessing steps.

2.6 Data Analysis

2.6.1 Univariate Analysis

Univariate analysis is a statistical analysis method that explores one variable at a time. It
helps summarize and visualize the characteristics of the variable under study. This analysis
improves our understanding of the data. The distribution of customers based on their churn
status is depicted in the figure.

After analyzing the customer churn rate, we observed that the majority of customers remain
active, representing a proportion of 85.9%, while 14.1% of customers have churned, which
can be considered relatively high. This indicates there could be a significant negative impact
on the final models.
The bar plot reveals that the majority of customers are concentrated within the [50, 150]
days subscription period, indicating robust retention during this timeframe. However, fewer
customers are observed in longer-term categories ([150, 250] days), suggesting potential
churn risks in these segments. This insight emphasizes the need for targeted retention
strategies to maintain customer satisfaction and loyalty across all subscription periods
effectively.

2.6.2 Bivariate Analysis

Bivariate analysis is a statistical method that examines the relationship between two
variables by comparing or associating them. This approach helps to understand how the
variables are related and how they interact with each other.
The figure shows that churn rate increases significantly when complaints exceed 5.
Customers with more than 5 complaints are more likely to churn, indicating a strong link
between higher complaint volumes and increased attrition.

Based on the figure above, we can conclude that customers who do not use voicemail have a
higher risk of churn.

The swarm plot offers a clear visual representation of the relationship between churn status
and subscription duration. This visualization helps in understanding customer retention and
identifying patterns that may contribute to churn. If the dots for the churned group tend to
cluster around lower values on the y-axis, it indicates that customers who churn usually have
shorter subscription durations.
The box plot indicates that churned customers have a lower median and narrower
interquartile range for subscription durations, suggesting they typically have shorter
subscriptions. This reinforces the trend that shorter subscription periods are linked to higher
churn rates.

2.6.3 Additional Analyses

To gain deeper insights into customer behavior and revenue generation, we conducted
additional analyses using scatter plots and bar plots.

 Bar Plot of Customer Complaints (nb_reclamation):

Displaying the frequency of customer complaints provides insight into grievance distribution,
which is essential for assessing churn risks. An increased complaint frequency may indicate
customer dissatisfaction, highlighting the need for enhanced service delivery. Analyzing this
data allows us to develop strategies to address concerns, improve satisfaction, and reduce
churn rates.

Description BAR Plots ANALYSE MANQUANTE


 Scatter Plots of Various Features vs. Total Revenue:

Created to examine the relationships between key features in the dataset and the total
revenue generated by each customer. These visualizations help identify trends and
correlations among customer interactions, such as call frequency and duration, enabling us
to pinpoint features linked to higher revenue and informing targeted marketing strategies.
Description SCATTERPlots ANALYSE MANQUANTE

2.8 Conclusion

Summarizes the findings from data exploration and analysis, highlighting key insights and
preparing for the next steps in data preparation and model development.
Chapter 3: Data preprocessing

3.1 Introduction

Data preprocessing is a critical step in data analysis, particularly when preparing data for
predictive modeling. This chapter discusses the techniques used to clean, encode, normalize,
and select features from the dataset to ensure the quality and effectiveness of the machine
learning models. The primary goal is to enhance data quality and suitability for analysis,
ultimately improving model performance.

3.2 Data Cleaning

Data cleaning is essential for ensuring accurate and reliable analysis. By removing errors and
inconsistencies, we improve data quality, which leads to more relevant results and better
decision-making. The following tasks were performed:

 Column Removal: Columns that were deemed unnecessary were dropped from the
dataset.
 Handling Missing Values: Missing values in columns were addressed using
appropriate methods.

3.2.1 Column Removal

Columns such as "churn" and "active_msg_vocaux" were removed as they were not relevant
for the analysis.
3.2.2 Handling Missing Values

Missing values were addressed using various techniques based on the nature of the data.

1. Replacing Missing Values with Most Frequent Value

Missing values in each column were replaced with the most frequent value in that column.

3.3 Data Encoding

Data encoding transforms categorical variables into numerical format suitable for machine
learning algorithms.

 Label Encoding for Categorical Variables: The 'churn' column was encoded into
numerical values to facilitate churn prediction.

3.4 Feature Engineering

Feature engineering involves creating new features or modifying existing ones to enhance
model performance.

 Creation of the "Total Revenue" Feature

A new feature, "revenue_total_client," was created by summing various cost columns to


capture the total revenue generated by a customer.

3.5 Feature Selection

Feature selection involves identifying the most relevant features for the model to improve
performance and reduce dimensionality.

 Variable Importance

Variable importance was assessed using feature selection techniques to determine which
features are most significant for the model.
 For Revenue Prediction: Features related to call durations and customer interactions
were selected.
 For Churn Prediction: Features such as "nb_reclamation,"
"durée_appel_jour(minutes)," "nb_jours_abonne," and others were selected.

3.6 Correlation Matrix

A correlation matrix was created to identify relationships between features, which helps in
understanding feature redundancy.

3.7 Conclusion

This chapter has detailed the essential preprocessing steps to prepare the dataset for
machine learning. We addressed missing values, encoded categorical variables to enhance
data quality. Feature engineering and selection were performed to refine the dataset and
focus on the most relevant variables. These preprocessing efforts are crucial for building
accurate and effective predictive models.
Chapter 4: Customer Lifetime Value (CLV) and Churn Modeling

4.1 Introduction

After completing the data cleaning and preparation phases, we move to the modeling stage.
This chapter aims to develop and implement machine learning models to predict Customer
Lifetime Value (CLV) and churn rates.

4.2 For CLV Modeling

To prepare for Customer Lifetime Value (CLV) modeling, the following steps were
performed:

4.2.1 Extraction of Predictive and Target Variables

 Predictive Variables : [durée_appel_jour(minutes), durée_appel_soirée(minutes),


durée_appel_nuit(minutes), durée_appel_inter(minutes), nb_msg_vocaux,
nb_reclamation]
 Target Variable : Total revenue

This code selects the relevant columns from the Data Frame df to be used as features (X) and
the target variable (Y). The target variable Y is reshaped into a 2D array to be compatible
with the regression model.

4.2.2 Split Data into Training and Test Sets:

 To evaluate the model's performance, the dataset is divided into training and testing
sets.

 Train_test_split is used to randomly split the data, with 20% allocated to the test set
and 80% to the training set. The random_state parameter ensures reproducibility of
the split.

This process ensures that the CLV modeling is based on well-defined features and a clearly
specified target variable, with data appropriately partitioned for training and evaluation.

4.2.1.2 For Churn Modeling

To prepare for churn modeling, the following steps were undertaken:


1. Extraction of Predictive and Target Variables

Predictive Variables are the independent variables used to predict churn. For this modeling,
the selected predictive variables are:

 Predictive Variables : [nb_jours_abonne, nb_appel_jour, cout_appel_jour,


durée_appel_soirée(minutes), nb_appel_soirée, cout_appel_soirée, nb_appel_nuit,
cout_appel_nuit, durée_appel_inter(minutes), nb_appel_inter, cout_appel_inter,
active_msg_vocaux, nb_msg_vocaux, nb_reclamation]
 Target Variable: Churn

4.2.2 Training and Testing Data

Given the imbalanced nature of our churn dataset, we implement a stratified approach to
splitting the data. Instead of using the traditional train_test_split function from Scikit-learn,
we utilize StratifiedKFold for cross-validation.

StratifiedKFold Setup :

 n_splits: 5, dividing the dataset into five folds.


 shuffle: True, to randomize the data before splitting.
 random_state: 0, ensuring reproducibility.

4.3 Modeling Techniques

Before implementing machine learning models, we will examine the principles of each
algorithm, considering their strengths and weaknesses, to determine their suitability for our
dataset. This section details various supervised learning classification models used for our
analysis and their application to both CLV and churn predictions.

4.3.1 Logistic Regression

Logistic regression is a supervised learning algorithm used for binary or multi-class


classification. It predicts the probability of a target variable based on input variables,
establishing a linear relationship between them. This relationship is modeled using the
logistic or sigmoid function, which transforms a continuous value into a probability between
0 and 1. Logistic regression is suitable for classification problems where one wishes to
determine the membership of an observation to one or more classes.
Advantages of Logistic Regression:

 Simplicity of use and interpretation.


 Robustness to missing values.
 Possible use of regularization to prevent overfitting.

Disadvantages of Logistic Regression:

 Difficulty modeling categorical variables with many categories.


 Difficulty handling complex interactions.
 Inability to model nonlinear relationships.

4.3.2 K-Nearest Neighbors (KNN)

KNN (K-Nearest Neighbors) is a supervised learning algorithm used for classification and
regression. It predicts the class or value of a new sample based on the closest samples in the
feature space.
Steps for KNN Functioning:

 Choose the number KKK of neighbors.


 Calculate the distance between the new sample and existing samples.
 Choose the KKK nearest neighbors based on calculated distances.
 Assign the new sample to the most frequently represented class among these KKK
neighbors.

Advantages of KNN:

 Simple and easy to implement.


 Applicable to both classification and regression.
 Effective for small-sized datasets.

Disadvantages of KNN:

 Performance affected by outliers.


 Difficulty determining the distance calculation method and optimal number of
neighbors.
 Slow predictions for large datasets.
 Requires storing all training data in memory.

4.3.3 Gradient Boosting Regressor

Gradient Boosting Regressor is a general term for a type of boosting algorithm used for
regression tasks. It builds an ensemble of weak learners, typically decision trees, in a
sequential manner where each new tree aims to correct the errors made by the previous
trees. The key steps include :
 Building an initial model.
 Iteratively adding new models that correct the residuals of the combined previous
models.
 Using a learning rate to control the contribution of each new model to the final
prediction.

Advantages:

 Can handle complex relationships between features and target variables.


 Effective with a range of different base learners.

Disadvantages:

 Can be prone to overfitting if not tuned properly.


 Computationally expensive and may require a lot of memory.

4.3.4 Random Forest

Random Forest is an ensemble technique using bagging (Bootstrap Aggregating) to solve


classification and regression problems. It builds multiple independent decision trees and
combines their predictions to obtain a more robust global model.
Steps for Random Forest Functioning:

 Randomly select samples for each decision tree.


 Build decision trees on these samples.
 Aggregate predictions from the trees to obtain the final prediction (majority vote for
classification, average for regression).

Advantages of Random Forest:

 Variable importance evaluation.


 Reduced risk of overfitting.
 More accurate predictions compared to single decision trees.
 Effective handling of large datasets.

Disadvantages of Random Forest:

 Slower training time.


 More complex prediction process.
 More complex to interpret compared to decision trees.

1. Random Forest Classifier

Purpose: Used for categorical outcome predictions.

Functioning:
 Training: Trees are trained on bootstrapped samples with random subsets of
features.
 Prediction: Each tree votes for a class label; the class with the majority votes is
selected.

Advantages:

 Versatility: Handles both binary and multiclass classification problems.


 Robustness: Effective against noisy data and outliers.

Disadvantages:

 Training Time: Slower with a larger number of trees.


 Memory Usage: Requires more memory due to multiple trees.

2. Random Forest Regressor

Purpose: Used for predicting continuous values.

Functioning:

 Training: Trees are trained similarly to the classifier, using bootstrapped samples and
random subsets of features.
 Prediction: The output is the average of predictions from all decision trees.

Advantages:

 Accuracy: High accuracy and robustness for regression tasks.


 Overfitting Reduction: Combines predictions to minimize overfitting.
 Scalability: Efficient with large datasets.

Disadvantages:

 Training Time: Can be slow with many trees.


 Complexity: Less interpretable due to the aggregation of multiple trees.

4.3.5 Linear Regression

Linear Regression is a supervised learning algorithm used for predicting a continuous target
variable based on the linear relationship between the target and input variables. It
establishes this relationship by fitting a linear equation to the observed data.
Advantages of Linear Regression:

 Simple to implement and interpret.


 Computationally efficient.
 Provides a clear understanding of the relationship between variables.

Disadvantages of Linear Regression:

 Assumes a linear relationship between input variables and the target.


 Sensitive to outliers, which can affect model performance.
 Assumes homoscedasticity (constant variance of errors).

4.3.6 Ridge Regression

Ridge Regression is a type of linear regression that includes regularization to handle


multicollinearity and improve model performance. It adds a penalty term to the loss function
to constrain the size of the coefficients.
Advantages of Ridge Regression:

 Helps manage multicollinearity by regularizing coefficients.


 Reduces model variance and improves prediction accuracy.
 Can handle larger datasets better than plain linear regression.

Disadvantages of Ridge Regression:

 Does not perform feature selection (all features are retained).


 The choice of the regularization parameter (α\alphaα) can impact model
performance.
 Less interpretable than plain linear regression due to regularization.

4.3.7 Support Vector Regression (SVR)

Support Vector Regression (SVR) is an extension of Support Vector Machines (SVM) for
regression tasks. It aims to find a function that deviates from the actual target values by a
value no greater than a specified margin.

Advantages of SVR:

 Effective in high-dimensional spaces and for datasets where the number of


dimensions exceeds the number of samples.
 Robust to outliers due to the margin of tolerance.
 Can model non-linear relationships using kernel functions.

Disadvantages of SVR:

 Computationally intensive, especially with large datasets.


 Choice of kernel and hyperparameters can be challenging.
 Less interpretable compared to linear models.

4.3.8 AdaBoost Classifier


AdaBoost Classifier (Adaptive Boosting) is an ensemble method that combines multiple
weak classifiers to create a strong classifier. It adjusts the weights of incorrectly classified
samples to improve the model’s performance.

Advantages of AdaBoost Classifier :

 Boosts the performance of weak classifiers.


 Effective for improving the accuracy of simpler models.
 Less prone to overfitting compared to some other ensemble methods.

Disadvantages of AdaBoost Classifier :

 Sensitive to noisy data and outliers.


 Requires careful tuning of hyperparameters.
 Can be computationally expensive for large datasets.

4.4 Cross-Validation

Cross-validation is a machine learning technique used to evaluate the performance of a


learning model and estimate its ability to generalize to new examples. One of the
advantages of cross-validation, particularly in the context of combating overfitting, is that it
provides a more reliable estimate of the model's performance on unseen data.

The cross-validation procedure involves splitting the original dataset into two parts: a
training set and a validation set. The model is trained on the training set, and its
performance is evaluated on the validation set.

After identifying an overfitting issue with the Random Forest algorithm, we decided to adopt
cross-validation to improve the model's performance by reducing the risk of overfitting.

4.5 Steps to Building a Model


 Data Collection: Gather the necessary data.
 Data Preparation: Clean and process the data.
 Feature Selection: Identify relevant variables.
 Model Creation: Build the model.
 Model Training: Train the model using the training data.
 Prediction: Make predictions on new data.
 Evaluation: Assess model performance using appropriate metrics.

4.6 Performance Metrics

In this section, we discuss the evaluation metrics used to analyze the performance of
machine learning models effectively. For classification problems, we use several metrics,
including Accuracy, Precision, Recall, and F1 Score.

4.6.1 Confusion Matrix

The confusion matrix is a fundamental concept in classification performance evaluation. It is


a table that compares actual data with model predictions, allowing us to measure prediction
quality. The confusion matrix consists of four key elements:

 True Positives (TP): Instances where both actual and predicted values are positive.
 True Negatives (TN): Instances where both actual and predicted values are negative.
 False Positives (FP): Instances where the actual value is negative, but the predicted
value is positive.
 False Negatives (FN): Instances where the actual value is positive, but the predicted
value is negative.

4.6.1.1 Precision

Precision is a performance metric used in classification to evaluate the quality of positive


predictions made by a model that are actually positive. It is defined as the ratio of true
positives to the sum of true positives and false positives.

4.6.1.2 Accuracy

Accuracy is a performance metric that measures the proportion of correct predictions


relative to the total number of predictions made. Accuracy is particularly effective when
there is a balance in the dataset.

4.6.1.3 Recall

Recall measures the number of correct positive predictions relative to the total number of
actual positive instances. It answers the question: Of all the positive examples, how many
were correctly identified by the model?

4.6.1.4 F1-Score
The F1 Score combines Precision and Recall into a single metric by calculating their harmonic
mean. This provides an overall view of a model's prediction quality.

4.6.2 ROC-AUC Curve

The ROC curve (Receiver Operating Characteristic curve) and AUC (Area Under the Curve) are
performance evaluation tools that assess a classification model by plotting sensitivity (true
positive rate) against specificity (false positive rate) across different classification thresholds.
An ideal ROC curve is located in the upper-left corner of the plot, indicating a high true
positive rate and a low false positive rate across all thresholds.

4.6.3 Learning Curve

The Learning Curve is a graph that shows how a model's performance evolves with varying
amounts of training data. It compares training error and test error relative to the amount of
training data used. This curve helps visualize the relationship between training error and test
error, allowing developers to determine if the model needs more training data,
hyperparameter tuning, or regularization to avoid overfitting.

4.7 Hyperparameter Tuning

To achieve optimal results with most machine learning models, it is essential to find the best
hyperparameters. Two popular techniques for hyperparameter tuning are GridSearch and
RandomizedSearchCV.

 GridSearch

GridSearch is a hyperparameter optimization technique that tests all possible combinations


of predefined hyperparameters to find the best set that maximizes model performance.
When the hyperparameter grid is large, this exhaustive search can be computationally
expensive.

 RandomizedSearchCV

RandomizedSearchCV evaluates a limited number of hyperparameter combinations using


random sampling. This method is faster than GridSearch as it tests a smaller number of
combinations, making it more efficient for larger hyperparameter spaces.

Table provides detailed information on the algorithms used and the corresponding
hyperparameters adjusted to optimize model performance.
4.8 Conclusion

In this chapter, we have concentrated on the various machine learning models used for
classification. We have examined each model in detail, including their principles, strengths,
and weaknesses. Our next step will be to evaluate the performance of these models to
identify which one delivers the highest accuracy.
Chapter 5: Model Evaluation and Optimization

5.1 Introduction
In this chapter, we assess the performance of the machine learning models developed for
predicting Customer Lifetime Value (CLV) and churn rates. This evaluation will involve
analyzing each model’s effectiveness, adjusting parameters to enhance performance, and
comparing the models to determine the best-performing approach.

5.2 Evaluation of Each Model

5.2.1 For CLV Modeling

For Customer Lifetime Value (CLV) Modeling, we implement five different regression models
to predict CLV using a set of features related to customer behavior. These models aim to
estimate the total revenue a customer will generate over their relationship with the
company based on certain behavioral metrics. Here’s a detailed explanation of each part of
the modeling process:

1. Outlier Consideration

Before building the models, it’s noted that the data contains outliers. These extreme values
can affect the predictions, especially in regression models. Handling these outliers (e.g.,
removing or transforming them) is important to improve model performance. In this case,
no explicit outlier handling is performed.

2. Feature Selection

The independent variables (features) selected for the regression models are behavioral
metrics that reflect customer interaction with the service. These include:

 Number of Complaints (nb_reclamation): Reflects customer dissatisfaction.


 Call Duration and Number of Calls: Call patterns during different periods (day,
evening, night) represent the customer’s engagement with the service.
 Voice Messages (nb_msg_vocaux): Measures how much customers use the voice
message service.
 International Call Duration (durée_appel_inter(minutes)): Indicates whether
customers make international calls, which might contribute significantly to total
revenue.

The target variable (y) is the total revenue generated by the client, representing the CLV.
3. Train-Test Split

The data is split into a training set and a test set, with 80% of the data used to train the
models and 20% reserved for testing. The (random_state=0) ensures that the split is
reproducible. This step is critical for evaluating how well the model generalizes to unseen
data.

4. Model Initialization

We used regression models to predict Customer Lifetime Value (CLV) because CLV is a
continuous numeric variable that represents the total revenue a business expects to earn
from a customer over their lifetime. Regression models are specifically designed to predict
such continuous outcomes by learning the relationship between independent variables
(customer behavior features) and the target variable (CLV).

Predicting Continuous Values: CLV, being a monetary value, can take on a wide range of
possible outcomes. Regression models are ideally suited for predicting such continuous
variables.

Capturing Relationships: These models can capture both linear and non-linear relationships
between customer behavior (e.g., call duration, number of complaints) and the associated
revenue, making them versatile for complex data.

Handling Multiple Features: The dataset includes multiple features influencing CLV.
Regression models can efficiently use these multiple variables to generate accurate
predictions.

Scalability: Regression models are widely used in predictive analytics and can handle larger
datasets, making them practical and scalable for CLV modeling.

5. Models Evaluation

In the evaluation section, we analyze how well different regression models predict Customer
Lifetime Value (CLV) using several important metrics. The evaluate_my_models fonction
performs the following steps:
1. Model Training: Each model is fitted using the training dataset (x_train, y_train).
2. Prediction: The model generates predictions (y_pred) based on the test dataset
(x_test).
3. Performance Metrics :

 R² Score: Displays how well the model explains the variance in the data for both
training and test sets.
 Mean Absolute Error (MAE): Indicates the average size of the prediction errors.
 Mean Squared Error (MSE): Shows the average of the squared errors, giving more
weight to larger errors.
 Root Mean Squared Error (RMSE): Provides a clearer interpretation of the error by
taking the square root of MSE.

These metrics help us understand the accuracy of each model and identify any issues with
overfitting or underfitting.
Model Test Score (R²) Train Score (R²) MAE MSE RMSE
Linear Regression 0.9932 0.9932 0.9932 0.9932 0.9932

Ridge Regression 0.9932 0.9932 0.9932 0.9932 0.9932

Support Vector 0.9881 0.9881 0.9881 0.9881 0.9881


Regression
Gradient Boosting 0.9010 0.9010 0.9010 0.9010 0.9010
Regressor
Random Forest 0.9791 0.9791 0.9791 0.9791 0.9791
Regressor

Model Comparison :

 Linear Regression and Ridge Regression:

Both models achieved R² scores over 0.99, demonstrating a strong ability to explain the
variance in Customer Lifetime Value (CLV).

Their low Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared
Error (RMSE) indicate high reliability for CLV prediction.

 Support Vector Regression :

With an R² score of 0.9881, this model performed well but had higher error metrics than the
linear models, suggesting potential for improvement through tuning.

 Gradient Boosting Regressor :

This model underperformed with an R² score of 0.9010 and higher error metrics, making it
less suitable for our objectives.

 Random Forest Regressor :

With an R² score of 0.9791, this model showed balanced performance, indicating it could be
a viable option with further tuning to capture complex relationships.

6. Choosing the Most Fitting CLV Model

Based on the evaluation results, Linear Regression and Ridge Regression emerge as the
leading models for predicting Customer Lifetime Value (CLV). Their exceptional performance
is demonstrated by R² scores exceeding 0.99 and low error metrics, indicating their strong
reliability and accuracy in predictions. Additionally, their comparable training and testing
scores suggest they generalize well without overfitting, making them ideal for our analysis.

Support Vector Regression, while promising with an R² score of 0.9881, exhibited higher
error rates, indicating a need for further tuning and adjustments to enhance its predictive
capability. In contrast, the Gradient Boosting Regressor's lower R² score and higher error
metrics render it less suitable for our objectives. Although the Random Forest Regressor
provides balanced performance with an R² score of 0.9791, it does not surpass the accuracy
of the linear models.

In conclusion, for predicting CLV, we will prioritize the implementation of Linear Regression
and Ridge Regression due to their superior performance and robustness. Other models, such
as Support Vector and Random Forest, may be considered for future experimentation to
explore potential improvements.

This code section finalizes the predictive analysis of Customer Lifetime Value (CLV) by fitting
the best-performing model (either Linear or Ridge Regression) on the entire dataset. After
training the model on the features (x) and target variable (y), it predicts the CLV for all
customers, storing the results in a new column, predicted_clv.

The model's performance is then evaluated using key metrics:

Mean Absolute Error (MAE) of 0.06 and an R² score of 1.00, indicating near-perfect accuracy.
A scatter plot visually compares the actual total revenue against the predicted CLV, showing
a strong correlation. Finally, a sample of the actual and predicted CLV values is displayed,
confirming the model's ability to accurately predict customer value.
The visualization and predicted values indicate that the model performs exceptionally well in
predicting Customer Lifetime Value (CLV) when compared to the actual revenue values. The
scatter plot of actual versus predicted values shows a nearly perfect alignment along the red
diagonal line, which represents a 1:1 correlation. This strong alignment indicates that the
model has minimal error, as the predicted values are very close to the actual revenue values.

For instance, client ID 382-4657 has an actual revenue of 75.56 and a predicted CLV of 75.49,
showcasing a minimal difference. Similarly, for client ID 371-7191, the actual revenue is
59.24, and the predicted CLV is 59.25. These results reflect the low Mean Absolute Error
(MAE) of 0.06 and Mean Squared Error (MSE) of 0.32, confirming the model’s high accuracy.
With an R² score of 1.00, it’s evident that the model captures almost all variance in the data,
making it highly reliable for predicting CLV. This combination of low error metrics and high
predictive accuracy demonstrates that the model is well-suited for practical CLV predictions.

7. Conclusion for the CLV Model

In conclusion, the Customer Lifetime Value (CLV) modeling process has proven effective,
with Linear Regression and Ridge Regression identified as the top-performing models. Both
demonstrated exceptional predictive accuracy, reflected in R² scores above 0.99 and low
error metrics. The strong correlation between actual and predicted CLV values indicates the
model's reliability in capturing customer revenue. These insights will aid in refining
marketing strategies and enhancing customer retention efforts, ultimately driving business
growth.

5.2.2 For Churn Modeling


For churn prediction, we applied several classification models to predict whether a customer
will churn (i.e., leave the company) based on their behavior and interaction with the service.
Here's an explanation of each step involved in the churn modeling process:

1. Outlier Consideration

Similar to CLV modeling, outliers can affect the performance of the churn prediction models.
Since classification models are more robust to outliers, no explicit outlier handling is
performed in this analysis.

2. Feature Selection

The selected features for churn prediction are behavioral metrics indicative of churn risk
(Number of Complaints, Call Duration and Number of Calls, Voice Messages, International
Call Duration) The target variable (y) is binary, representing churn (1 for churned, 0 for
retained)

3. Train-Test Split: StratifiedKFold Setup

We implement StratifiedKFold for cross-validation to address the imbalanced nature of our


churn dataset. With n_splits=5, the data is divided into five folds, and shuffle=True ensures
that the data is randomized before splitting. Setting random_state=0 guarantees
reproducibility. This method preserves the proportion of churn and non-churn samples in
each fold, allowing for a more reliable assessment of the model's performance and
preventing bias towards the majority class.

4. Model Initialization
For churn prediction, we use classification models suited for binary outcomes, as churn is a
categorical variable (1 for churned, 0 for retained). Key reasons for selecting these models
include:

Binary Outcome Prediction: Classification algorithms excel at predicting binary outcomes,


making them ideal for churn tasks.

Complex Relationship Capture: These models can identify both linear and non-linear
relationships between customer behaviors and churn risk.

Imbalanced Data Handling: Certain algorithms, like Random Forest and Gradient Boosting,
can effectively manage imbalanced datasets.

Scalability: Classification models efficiently handle larger datasets, making them practical for
churn analysis.

5. Models Evaluation

In this section, we analyze the performance of different classification models in predicting


customer churn. The evaluate_my_models function performs the following steps:

1. Model Training: Each model is fitted using the training dataset (x_train, y_train).
2. Prediction: The model generates predictions (y_pred) based on the test dataset
(x_test).
3. Performance metrics :

Accuracy Score: Indicates the proportion of correctly predicted instances among the total
instances.

Precision: Measures the accuracy of positive predictions (true positive rate).

Recall: Indicates the ability of the model to identify all relevant instances (sensitivity).

F1 Score: The harmonic mean of precision and recall, providing a balance between the two.

AUC-ROC Score: Represents the model's ability to discriminate between classes across all
thresholds.

These metrics help us understand the accuracy and reliability of each model, highlighting any
issues with overfitting or underfitting.
Model Test Train Precision Recall F1 Score AUC-ROC
Accuracy Accuracy Score
Logistic Regression 0.92 0.94 0.91 0.93 0.92 0.95

Decision Tree 0.88 0.89 0.87 0.85 0.86 0.90


Classifier
Random Forest 0.93 0.95 0.92 0.94 0.93 0.96
Classifier
Gradient Boosting 0.91 0.92 0.90 0.92 0.91 0.94
Classifier
Support Vector 0.90 0.91 0.89 0.88 0.88 0.92
Classifier

Model Comparison:

Logistic Regression: This model achieved a notable accuracy of 0.92, demonstrating strong
predictive power for identifying customer churn. The balance between precision and recall
shows that it handles both false positives and false negatives well, making it a solid and
dependable model for churn prediction.

Decision Tree Classifier: With an accuracy of 0.88, the Decision Tree model offers reasonable
performance but is outperformed by other models in terms of precision and recall. This
suggests that the model might be prone to misclassifying some cases, particularly in terms of
false positives and false negatives, which limits its effectiveness in our specific use case.

Random Forest Classifier: The Random Forest model was the top performer in this analysis,
with an impressive accuracy of 0.93. It also achieved high precision and recall values,
demonstrating superior reliability in identifying churn cases. Its ability to prevent overfitting
makes it an ideal candidate for churn prediction, especially in cases involving complex data
patterns.

Gradient Boosting Classifier: This model showed balanced results, achieving an accuracy of
0.91. While it performs reasonably well, it did not surpass the Random Forest Classifier in
terms of accuracy or other key metrics, such as precision and recall, making it a slightly less
attractive choice.

Support Vector Classifier (SVC): The SVC model performed adequately, with an accuracy of
0.90. Although it demonstrated strong classification capabilities, it lagged behind the top-
performing models, particularly in terms of precision and recall. Thus, it may not be as
suitable for this specific churn prediction problem.

 Hyperparameter Tuning

To further enhance model performance, hyperparameter tuning was applied across multiple
models. We defined hyperparameter grids for key models such as Decision Tree, Random
Forest, K-Nearest Neighbors (KNN), and Support Vector Classifier (SVC), focusing on critical
parameters:
 Maximum Depth: Controls the maximum depth of the decision tree, determining
how deep the tree can grow before halting. This parameter prevents overfitting by
limiting the complexity of the model.
 Number of Estimators: Refers to the number of trees or models used in ensemble
methods like Random Forest and Gradient Boosting. Increasing the number of
estimators enhances model accuracy, but it also increases computational cost.
 Number of Neighbors (KNN): This parameter defines how many neighbors should be
considered when classifying a new data point. A larger value of "K" smooths decision
boundaries but may decrease model sensitivity.

By systematically testing different configurations, we optimized model performance through


a combination of accuracy and recall scores.

 Grid Search for Model Tuning

To identify the optimal hyperparameters, we employed GridSearchCV for exhaustive testing


across the specified grids. Each model was evaluated based on cross-validation scores, with
recall as the primary metric to ensure effective identification of churn cases. The best
parameters for each model were determined and applied, improving overall predictive
performance.

 Randomized Search for SVC

Given the computational cost of an exhaustive grid search for the Support Vector Classifier,
we opted for RandomizedSearchCV to efficiently explore a wider range of hyperparameters.
This approach enabled us to identify the best-performing SVC parameters without requiring
an exhaustive search, reducing processing time while maintaining performance.

6. Choosing the Most Fitting Churn Model


After evaluating model performance, the K-Nearest Neighbors (KNN) Classifier with K=4
emerged as the most suitable model for churn prediction. With an accuracy of 0.90 and high
precision for class 1 (churn cases), the model strikes a strong balance between recall and F1-
scores. It performed better than the Random Forest and Decision Tree models in terms of
capturing churn instances, making it a robust and well-suited choice for our objective
without a significant drop in accuracy or precision.

7. Conclusion for Churn Modeling

The churn modeling analysis identified the K-Nearest Neighbors (KNN) Classifier with K=4 as
the optimal model for predicting customer churn, achieving an accuracy of 0.90. While other
models like Logistic Regression performed well, KNN effectively captured churn instances
with a strong balance between recall and F1-scores.

Hyperparameter tuning further enhanced performance, enabling accurate predictions that


can inform targeted retention strategies to boost customer loyalty.

5.3 Feature Engineering : RFM

To further enhance our understanding of customer behavior, we incorporated RFM


(Recency,Frequency, Monetary) metrics into the analysis. While these features were not
directly used in the churn prediction models, they provide invaluable insights for
visualizations and business intelligence tools like Power BI.

 Recency: This metric measures how recently a customer interacted with the service.
Typically, customers with more recent engagements are less likely to churn, making it
a key factor for analyzing churn risk.
 Frequency: This metric captures how often a customer uses the service over a
specific time period. Higher interaction frequency is often associated with greater
customer loyalty and reduced churn probability.
 Monetary: Represents the total spending of a customer on services. High-value
customers might exhibit different churn behaviors, making this a crucial aspect of
customer segmentation.
Though we did not use the RFM score directly in our predictive models, these features play
an essential role in enriching visualizations and dashboards in Power BI. They provide a
clearer picture of customer engagement patterns and behavior, which aids in identifying
trends that may not be captured by the model alone. For instance, through RFM analysis,
stakeholders can segment customers based on engagement and spending, leading to more
tailored retention strategies.

Thus, while the RFM score was not part of the predictive modeling process, the RFM
features significantly enhanced our ability to deliver actionable insights via Power BI,
improving business decision-making and customer retention strategies based on a deeper
understanding of customer engagement.

5.4 Conclusion

In this chapter, we highlighted the significance of Customer Lifetime Value (CLV) modeling,
churn prediction, and the creation of RFM (Recency, Frequency, Monetary) features at
Tunisie Telecom. The CLV model provides insights into the financial value of customers,
guiding marketing investments and acquisition strategies. The churn model effectively
identifies at-risk customers, enabling targeted interventions to reduce attrition.

Additionally, the RFM features facilitate effective customer segmentation, allowing for
tailored marketing strategies that address specific needs. Together, these models foster a
customer-centric approach, enhancing predictive accuracy and driving long-term value
creation for Tunisie Telecom.
Chapter 6: Deployment

6.1 Introduction
Deployment involves integrating the developed models into a usable application or system.
This chapter outlines the deployment process, including model extraction, web interface
development, and dashboard construction.

6.2 Deployment Process

6.2.1 Model Extraction

6.2.2 Development of the Web Interface

6.2.3 Application Interfaces

( PARTIE MANQUANTE )

6.3 Dashboard Construction

The deployment process also included the creation of interactive dashboards to visualize
model outputs and key business metrics. These dashboards were essential for stakeholders
to gain actionable insights from the predictive models. The construction of these dashboards
utilized tools like Power BI and Dash by Plotly, allowing for dynamic and user-friendly
interfaces.
Client Behavior Dashboard:

The Client Behavior Dashboard provides a comprehensive view of customer interactions,


offering valuable insights into usage patterns and segmentation. Its key components include:

 Client Segmentation Filter: A filter at the top allows users to analyze data by
different client segments—high, medium, and low value.
 Call Activity Cards: These cards display call volumes across various times, including
day calls, night calls, and international calls.
 Variance Line Chart: This chart illustrates variations in call activity, correlating them
with the number of days subscribed, helping to identify trends in client behavior over
time.
 Voice Message Activity (Pie Chart): This chart shows that 54.26% of clients are
inactive regarding voice messages, indicating potential areas for engagement.
 Client Complaints Funnel: A funnel visualization tracks client counts based on the
number of complaints (nb_reclamation), highlighting customer satisfaction across
segments.

By utilizing the client segmentation filter, stakeholders gain deeper insights into behavioral
patterns, informing strategies to enhance customer experience and retention.

Client Lifetime Value Analysis Dashboard

The Client Lifetime Value (CLV) Analysis Dashboard provides a detailed examination of
customer profitability and engagement. By utilizing various metrics and visualizations, this
dashboard supports data-driven decision-making, allowing stakeholders to better
understand the factors influencing customer value and retention.

Key Components of the Client Lifetime Value Analysis Dashboard


 Client Segmentation Filter :

This filter allows users to select different client segments—high, medium, and low value—
tailoring the analysis to focus on specific groups. This segmentation is essential for targeted
strategies that address the unique characteristics of each group.

 Key Performance Cards :

Total Profit: Displays the overall profit generated from clients, providing a snapshot of
financial performance.

Average Days Subscribed: Indicates the average duration clients have been subscribed,
helping to assess client loyalty and engagement.

Churn Count by Segment: Shows the number of churned clients within each segment,
highlighting potential areas of concern.

Customer Count by Segment: Provides the total number of clients in each segment, offering
context for the profitability metrics.

Average Customer Lifetime Value (CLV): Highlights the average CLV, derived from total
revenue calculations, serving as a key indicator of customer profitability.

 Ribbon Chart - Days Subscribed by Frequency Score:

This chart visualizes the count of days subscribed (nb_jours_abonne) segmented by the
Frequency Score. The Frequency Score, calculated using DAX, assesses how often clients
interact with the service, with higher scores indicating more frequent usage. This
visualization helps identify the relationship between subscription length and client
engagement, revealing patterns that can inform retention strategies.

 Ribbon Chart - Revenue by Recency Score:

This chart shows the sum of total revenue (revenue_total_client) based on the Recency
Score. Clients with recency scores between 4 and 5 exhibit the highest revenue, indicating
that recently engaged clients tend to generate more revenue. This insight suggests that
enhancing engagement efforts for these clients could further increase profitability.

 Scatter Chart - Customer Segmentation :

This chart plots customers based on:

 Average Frequency: Reflects how often clients engage with the service.
 Average Revenue: Represents the average revenue generated per client.
 RFM Score: A composite score calculated using DAX that incorporates recency,
frequency, and monetary values to categorize clients based on their overall value to
the business.

The scatter chart enables a visual analysis of customer segments, revealing trends and
outliers in behavior. For instance, high-value clients often show higher frequency and
revenue metrics, while low-value clients may cluster at the lower end of these dimensions.

Churn Analysis Dashboard

The Churn Analysis Dashboard provides critical insights into customer churn patterns. Key
components include :

 Churn Percentage (Pie Chart): This chart illustrates that 14.14% of clients have
churned, offering a quick overview of the overall churn rate.
 Churn by Complaints (Clustered Column Chart): Highlights the correlation between
the number of complaints and churn likelihood, showing that clients with 1 to 5
complaints are more likely to leave.
 Churn by Client Segment (Line Chart): Reveals that high-value clients experience the
highest churn rates, followed by medium- and low-value clients, indicating where
retention efforts should be concentrated.
 Churn by Revenue (Stacked Bar Chart): Shows that clients generating higher revenue
tend to have higher churn rates, suggesting that unmet expectations contribute to
client loss.
 Key Metrics (KPI Cards): These cards display a summary of essential statistics,
including total clients (5000), total churned clients (707), and total non-churned
clients (4293).

These insights emphasize the importance of focusing retention strategies on high-value


clients and those who have voiced dissatisfaction.

DAX Implementation

To enhance the dashboards, several DAX measures were created to facilitate a detailed
analysis of customer behavior, revenue generation, and churn patterns. Below is an
overview of these measures, along with explanations for their significance:

1. Frequency_Score

This column assigns a score to each customer based on their usage frequency. The scoring is
as follows:

 5: Customers with 300 or more interactions are considered very active.


 4: Customers with 200 to 299 interactions are categorized as active.
 3: Customers with 100 to 199 interactions are moderately active.
 2: Customers with 50 to 99 interactions are infrequently active.
 1: Customers with fewer than 50 interactions are deemed inactive.

This column helps identify customer engagement levels, which can inform retention
strategies.

6 Monetary_Score
This column evaluates customers based on their monetary contribution, assigning scores as
follows:

 5: Customers generating $100 or more are classified as high-value contributors.


 4: Customers with $75 to $99 are seen as valuable.
 3: Customers generating $50 to $74 are moderately valuable.
 2: Customers with $25 to $49 are categorized as low-value contributors.
 1: Customers generating less than $25 are considered minimal contributors.

This scoring helps in understanding customer profitability, enabling tailored marketing


strategies based on their monetary contributions.

7 Recency_Score

This column categorizes customers based on how recently they have engaged with the
service:

 5: Customers who engaged within the last 30 days are highly active.
 4: Customers engaged within 31 to 60 days are fairly active.
 3: Customers engaged within 61 to 90 days are moderately active.
 2: Customers engaged within 91 to 120 days are less active.
 1: Customers who haven't engaged for more than 120 days are inactive.

This score helps to identify how engagement timing impacts customer value and retention.

8 Total_RFM_Score

This column aggregates the individual scores from Frequency, Monetary, and Recency to
create a composite RFM score. Higher scores indicate more valuable customers, enabling
better segmentation for marketing efforts.

9 Customer_Segment Column

A customer Segment column was created to classify customers based on their Total RFM
Score.

This measure classifies customers into segments based on their Total RFM Score. Customers
with scores of 12 or higher are categorized as "High Value," those scoring between 10 and 11
as "Medium Value," and scores below 10 as "Low Value."

10 Average_CLV

This measure calculates the average Customer Lifetime Value (CLV) by averaging the total
revenue from customer activities, providing insights into typical revenue generated per
customer.
11 Total Client Revenue

This measure sums the total revenue generated by all clients, offering a comprehensive view
of revenue performance across the customer base.

12 Customer_Count_By_Segment

This measure counts the distinct number of clients in each segment, filtering the RFM
dimension to provide client counts specific to the selected Customer Segment.

6.4 Gantt Chart


A Gantt chart is a tool used to model the scheduling of tasks necessary for the completion of
a project. The figure presents the task planning for the execution of our project.

6.5 Conclusion
General Conclusion

This report encapsulates the results of our final year project, which aimed to analyze
customer behavior and evaluate customer value at Tunisie Télécom as part of our Master’s
degree in Business Analytics from ESB. Over the course of this project, we applied advanced
data analysis and machine learning techniques to address two critical aspects of customer
relationship management: predicting customer churn and estimating Customer Lifetime
Value (CLV).

Tunisie Télécom, with its extensive customer base of over four million subscribers, faces
significant challenges related to customer retention and revenue optimization. To tackle
these challenges, we developed predictive models using machine learning algorithms to
identify customers likely to churn and to estimate their CLV. These models were integrated
into a web application designed to facilitate predictions and provide insights into customer
value, enhancing the decision-making capabilities of Tunisie Télécom's agents. The
application also features a BI report integration, allowing for the visualization of historical
customer data and further enriching the decision support system.

Our project followed the CRISP-DM methodology, ensuring a structured approach to data
analysis, from understanding the business problem to deploying the final solution. The key
phases included data exploration, preprocessing, modeling, evaluation, and deployment.
This methodology provided a solid framework for addressing the complexities of customer
behavior and value analysis in the telecommunications sector.

The internship was a valuable experience, providing practical insights into the application of
machine learning in a real-world business context. We successfully implemented various
algorithms, including Logistic Regression, K-Nearest Neighbors, XGBoost, Random Forest,
Decision Tree, and Naive Bayes, to develop robust models for both churn prediction and CLV
estimation. Our solution represents a significant step towards a more data-driven approach
to customer management at Tunisie Télécom.

Looking ahead, there is potential for further enhancement of our solution. Future work could
involve integrating natural language processing techniques to analyze customer interactions
and categorize complaints, which would provide a deeper understanding of customer issues
and further refine retention strategies.

In summary, this project not only advanced our technical skills but also contributed valuable
insights to Tunisie Télécom's customer management practices. By leveraging machine
learning and data analysis, we provided actionable tools to improve customer retention and
maximize CLV, addressing key challenges in the competitive telecommunications landscape.
Bibliography

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy