ML Customer Segmentation
ML Customer Segmentation
A PROJECT REPORT
Submitted by
HONEY VISHWAKARMA(RA2211003020188)
VIRAT SINGH PUNDIR (RA2211003020175)
of
BACHELOR OF TECHNOLOGY
in
of
OCT 2024
i
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
(Deemed to be University U/S 3 of UGC Act, 1956)
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
Submitted for the project viva-voce held on ___________ at SRM Institute of Science and
Technology, Ramapuram, Chennai -600089.
ii
ABSTRACT
In this modern era, everything and everyone is innovative, where everyone competes with being
better than others. The emergence of many entrepreneurs, competitors, and business interested
people has created a lot of insecurities and tension among competing businesses to find new
customers and hold the old customers. Because of this one should need and maintain exceptional
customer service and it becomes very appropriate irrespective of the business scale. And also,
it is equally important to understand the needs of customers specifically to provide greater
customer support and to advertise them with the most appropriate products. In the pool of these
online products customers are confused about what to buy and what not to and also the company
or the business people are confused about which section of customers to be targeted for selling
their particular type of products. This confusion will probably be possible by the process called
CUSTOMER SEGMENTATION. The process of segmenting the customers with similar
interests and similar shopping behavior into the same segment and with different interests and
different shopping patterns into different segments is called customer segmentation. Customer
segmentation and pattern extraction are the major aspects of a business decision support system.
Each segment has the same set of customers who most probably has the same kind of interests
and shopping patterns. In this paper, we planned to do this customer segmentation using three
different clustering algorithms namely K means clustering algorithm, Mini batch means, and
hierarchical clustering algorithms and also going to compare all these clustering algorithms
based on their efficiency and root mean squared errors.
iii
TABLE OF CONTENTS
ABSTRACT v
LIST OF FIGURES viii
LIST OF ABBREVIATIONS ix
1 INTRODUCTION 1
1.1. OVERVIEW 1
2 LITERATURE SURVEY 11
3 METHODOLOGY 13
1
3.4. SOFTWARE AND HARDWARE REQUIREMENTS 14
3.4.1 SOFTWARE REQUIREMENTS 14
3.4.2 HARDWARE REQUIREMENTS 14
3.4.3 LIBRARIES 14
3.5. PROGRAMMING LANGUAGES 16
3.5.1 PYTHON 16
3.5.2 DOMAIN 16
3.6. SYSTEM ARCHITECTURE 17
3.7. ALGORITHMS USED 19
3.7.1 K-MEANS CLUSTERING 19
3.7.2 HIERARCHICAL CLUSTERING 20
3.7.3 MINIBATCH K-MEANS 21
3.7.4 ELBOW METHOD 22
3.8. MODULES 22
5 REFERENCES 26
6 APPENDICES 30
2
LIST OF FIGURES
3
LIST OF ABBREVIATIONS
ABBREVIATIONS EXPANSION
ML Machine Learning
AI Artificial Intelligence
4
CHAPTER 1
INTRODUCTION
1.1 OVERVIEW
Data is very precious in today‟s ever-competitive world. Every day organizations and people
are encountered with a large amount of data. A most efficient way to handle this data is to
classify or categorize the data into Clusters, set of groups, or partitions.
“Usually, the classification methods are either supervised or unsupervised, depending on
whether they have labeled datasets or not”. Unsupervised classification is the exploratory
data analysis where there won‟t be any training data set and having to extract hidden patterns
in the data set with no labeled responses is achieved whereas classification of supervised
learning model is machine learning task of deducing a function from training data set. The
main focus is to enhance the propinquity or closeness in data points belonging to the same
group and increase the variance among various groups and all this is achieved through some
measure of similarity. Exploratory- by data analysis is all about dealing with a wide range of
applications such as “ engineering, text mining, pattern recognition, bioinformatics, spatial
data analysis, - mechanical engineering, voice mining, textual document collection, artificial
intelligence, image segmentation, ”. This diversity explains the importance of clustering in
scientific research but this diversity can lead to contradictions due to different purposes and
nomenclature.
Maintaining and Managing relationships of a customer have always played a very key role
to provide business intelligence to companies to build, develop and manage very important
long-term relationships with customers. The importance of treating customers as a main asset
to the organization is increasing in the present-day era. By using clustering techniques like
k-means, mini-batch k-means, hierarchical clustering customers with the same habits are
clustered as one cluster. Segmentation of customers helps the team of marketing to recognize
different customer segments that think differently and follow different purchasing techniques
and strategies. Customer segmentation helps figure out the customers who vary in terms of
purchasing habits, expectations, desires, preferences, and attributes. The important purpose
of doing customer segmentation is to group customers, who have the same interests so that
the marketing or business team can converge in an effective marketing plan. The techniques
of clustering consider data tuples as objects. They partition the data objects into clusters or
5
groups. Customer Segmentation is the process where one has to divide the customers into
various groups called customer segments so each customer segment comprises customers
who have similar interests and patterns. The segmentation process is mostly based on the
similarity or the identical nature in different ways that are relevant to marketing features like
age, gender, interests, and miscellaneous spending habits.
Customer segmentation has importance as it includes, the ability to modify the pro- grams
of the market so that it is suitable to each of the segments, support in a business decision,
identification of products associated with each customer segment, and managing the demand
and supply of that product, and predicting customer defection, identifying and targeting the
potential customer base, providing directions in finding the solutions. Clustering is an
iterative process of knowledge discovery from unorganized and huge amounts of data that is
raw. Clustering is one of the kinds of exploration of data mining that is used in several
applications, those are classification, machine learning, and recognition of patterns
Machine learning (ML) is the study of computer algorithms that can improve automatically
through experience and by the use of data. It is seen as a part of artificial intelligence.
Machine learning algorithms build a model based on sample data, known as training data, to
make predictions or decisions without being explicitly programmed to do so. Machine
learning algorithms are used in a wide variety of applications, such as in medicine, email
filtering, speech recognition, and computer vision, where it is difficult or unfeasible to
develop conventional algorithms to perform the needed tasks.
6
Machine learning could be a subfield of computer science (AI). The goal of machine learning
typically is to know the structure information of knowledge of information and match that
data into models which will be understood and used by folks. Although machine learning
could be a field inside technology, it differs from ancient process approaches.
The term machine learning was coined in 1959 by Arthur Samuel, an American
IBMer, and pioneer in the field of computer gaming and artificial intelligence. Also,
the synonym self-teaching computers were used in this period. A representative book
of machine learning research during the 1960s was Nilsson's book on Learning
Machines, dealing mostly with machine learning for pattern classification. Interest
related to pattern recognition continued into the 1970s, as described by Duda and Hart
in 1973. In 1981 a report was given on using teaching strategies so that a neural
network learns to recognize 40 characters (26 letters, 10 digits, and 4 special symbols)
from a computer terminal.
Modern-day machine learning has two objectives, one is to classify data based on
models which have been developed, the other purpose is to make predictions for
future outcomes based on these models. A hypothetical algorithm specific to
classifying data may use computer vision of moles coupled with supervised learning
to train it to classify the cancerous moles. Whereas, a machine learning algorithm for
stock trading may inform the trader of future potential predictions.
7
In machine learning, tasks square measure is typically classified into broad classes. These
classes square measure supported however learning is received or however, feedback on the
education is given to the system developed. Two of the foremost wide adopted machine
learning strategies are square measure supervised learning that trains algorithms supported
example input and output information that's tagged by humans, and unattended learning that
provides the algorithmic program with no tagged information to permit it to search out
structure at intervals its computer file.
Machine learning approaches are traditionally divided into three broad categories, depending
on the nature of the "signal" or "feedback" available to the learning system:
Supervised learning: The computer is presented with example inputs and their desired
outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to
outputs.
Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own
to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden
patterns in data) or a means towards an end (feature learning).
In supervised learning, the pc is given example inputs that square measure labeled with
their desired outputs. The aim of this technique is for the algorithmic program to be ready
to “learn” by comparing its actual output with the “taught” outputs to search out errors,
and modify the model consequently. Supervised learning thus uses patterns to predict label
values on extra unlabeled information. For example, with supervised learning, an
algorithm may be fed data with images of sharks labeled as fish and images of oceans
8
labeled as water. By being trained on this data, the supervised learning algorithm should
be able to later identify unlabeled shark images as fish and unlabeled ocean images as
water.
In unsupervised learning, information is unlabeled, and the learning rule is left to seek out
commonalities among its input file. The goal of unattended learning is also as easy as
discovering hidden patterns at intervals in a dataset, however, it should even have a goal
of feature learning, that permits the procedure machine to mechanically discover the
representations that square measure required to classify data.
Unsupervised learning is usually used for transactional information. You will have an
oversized dataset of consumers and their purchases, however, as a person, you'll probably
not be able to add up what similar attributes will be drawn from client profiles and their
styles of purchases.
9
With this information fed into the Associate in Nursing unattended learning rule, it should
be determined that ladies of a definite age vary UN agency obtain unscented soaps square
measure probably to be pregnant, and so a promoting campaign associated with
physiological condition and baby will be merchandised
1.4 CLUSTERING
10
Clustering is the task of dividing the data points into definite groups such that the data points
in the same group have similar characteristics or similar behavior. In short, segregating the
data points into different clusters based on their similar traits.
It depends on the type of algorithm we use which decides how the clusters will be created.
The inferences that need to be drawn from the data sets also depend upon the user as there is
no criterion for good clustering.
Clustering itself can be categorized into two types viz. Hard Clustering and Soft
Clustering. In hard clustering, one data point can belong to one cluster only. But in
soft clustering, the output provided is a probability likelihood of a data point
belonging to each of the pre-defined numbers of clusters.
The task of clustering is subjective which means there are many ways of achieving
the goal of clustering. Each methodology has its own set of rules to segregate data
points into different clusters. There is n number of clustering algorithms in which
these are few mostly used algorithms such as K means clustering algorithm,
Hierarchical clustering algorithms, and Mini-batch K means clustering algorithm, etc.
In this method, the clusters are created based upon the density of the data points which
are represented in the data space. The regions that become dense due to the huge
number of data points residing in that region are considered clusters.
The data points in the sparse region (the region where the data points are very few)
are considered as noise or outliers. The clusters created in these methods can be of
arbitrary shape.
11
1.4.3 Hierarchical Clustering
1.4.4 Centroid-based
Centroid-based clustering is the one you probably hear about the most. It's a little
sensitive to the initial parameters you give it, but it's fast and efficient.
These types of algorithms separate data points based on multiple centroids in the data.
Each data point is assigned to a cluster based on its squared distance from the centroid.
This is the most commonly used type of clustering.
K-Means clustering is one of the most widely used algorithms. It partitions the data
points into k clusters based upon the distance metric used for the clustering. The value of „k‟ is
to be defined by the user. The distance is calculated between the data points and the centroids of
the clusters. The data point which is closest to the centroid of the cluster gets assigned to that
cluster. After an iteration, it computes the centroids of those clusters again and the process
continues until a pre-defined number of iterations are completed or when the centroids of the
clusters do not change after an iteration.
It is a very computationally expensive algorithm as it computes the distance of every
data point with the centroids of all the clusters at each iteration. This makes it difficult
for implementing the same for huge data sets.
12
Clustering is used in our daily lives such as in data mining, in academics, in web
cluster engines, in bioinformatics, in image processing, and many more. There are a
few common applications where clustering is used as a tool are Recommendation
engines, Market segmentation, Customer segmentation, Social Network
Analysis(SNA), Search result Clustering, Identification of cancer cells, biological
data analysis, and medical imaging analysis.
• Scalability − Some clustering algorithms work well in small data sets including less
than 200 data objects; however, a huge database can include millions of objects.
Clustering on a sample of a given huge data set can lead to biased results. There are
highly scalable clustering algorithms are required.
• Ability to deal with different types of attributes − Some algorithms are designed
to cluster interval-based (numerical) records. However, applications can require
clustering several types of data, including binary, categorical (nominal), and ordinal
data, or a combination of these data types.
• Discovery of clusters with arbitrary shape − Some clustering algorithms determine
clusters depending on Euclidean or Manhattan distance measures. Algorithms based
on such distance measures tend to discover spherical clusters with the same size and
density. However, a cluster can be of any shape. It is essential to develop algorithms
that can identify clusters of arbitrary shapes.
• Ability to deal with noisy data − Some real-world databases include outliers or
missing, unknown, or erroneous records. Some clustering algorithms are sensitive to
such data and may lead to clusters of poor quality.
13
cluster data objects in high-dimensional space, especially considering that data in
high-dimensional space can be very inadequate and highly misrepresented.
LITERATURE SURVEY
➢ Aman Banduni, Prof Ilavedhan A, in [1] studies customer segmentation using machine
learning. In this paper, they explained the concept of customer segmentation.
➢ Kamalpreet Bindra, Anuranjan Mishra in [2] studies detailed different clustering algorithms.
And compared different algorithms and based on the results decides which algorithms to use
for the project.
➢ Kai Peng (Member, IEEE), Victor C. M. Leung, (Fellow, IEEE), and Qinghai Huang in [3]
get to know in detail about mini-batch K-means clustering algorithm. Get to know about the
advantages and disadvantages of the algorithm and also about the implementation.
➢ Fionn Murtagh and Pedro Contreras in [4] studied hierarchical clustering algorithms. In this
paper get to know more about this clustering algorithm and also observe how clusters formed
and also about advantages and disadvantages and compare it with the other different
clustering algorithms.
➢ D. P. Yash Kushwaha, Deepak Prajapati in [5] studied customer segmentation in detail and
also studied in detail about k-means clustering algorithm and performed customer
segmentation using K-means clustering algorithm and observed the clusters formed and
compared the results with the other clustering algorithms.
➢ Manju Kaushik, Bhawana Mathur in [6] get to know in detail about two different clustering
algorithms such as K-means clustering algorithm and hierarchical clustering algorithm. And
perform customer segmentation using these two algorithms and compare the results and
decide the best clustering algorithm between these two to perform customer segmentation.
14
➢ Ali Feizollah, Nor Badrul Anuar, Rosli Salleh, and Fairuz Amalina in [7] studied in detail
two different clustering algorithms such as K-means clustering algorithm and mini-batch
means clustering algorithm. And perform customer segmentation using these two algorithms
and compare the results and decide the best clustering algorithm between these two to
perform customer segmentation.
➢ Asith Ishantha in [8] studied in detail different clustering algorithms such as K-means
clustering algorithm and mini-batch-means clustering algorithm and hierarchical clustering
and many more. And perform customer segmentation using all these algorithms and
compared the results and decide the best clustering algorithm between all these to perform
customer segmentation.
➢ Onur Dogan, Ejder Aycin, Zeki Atil Bulut in [9] studied customer segmentation in detail
using the RFM model and some clustering algorithms.
➢ Juni Norma Sari, Ride Dedriana, Lukito Nugroho, Paulus Insap Santosa in [10] reviewed all
customer segmentation techniques.
➢ Shi Na; Liu Xumin; Guan Yong in [11] studied in detail about k means clustering algorithm
and observed its pros and cons.
➢ Francesco Musumeci; Cristina Rottondi; Avishek Nag et. al in [12] get an overall overview
of the application of machine learning techniques and understand their implementation.
➢ Şükrü Ozan et. al in [13] studied about Case Study on Customer Segmentation by using
Machine Learning Methods.
➢ Tushar Kansal; Suraj Bahuguna; Vishal Singh; Tanupriya Choudhury in [14] studied mostly
customer segmentation using the K-means clustering algorithm.
➢ Ina Maryani; Dwiza Riana; Rachmawati Darma Astuti; Ahmad Ishaq; Sutrisno; Eva Argarini
Pratama in [15] studied different clustering techniques.
15
CHAPTER 3
METHODOLOGY
The existing model for the customer segmentation depicts that it is based on the K- means
clustering algorithm which comes under centroid-based clustering. The suitable K value for
the given dataset is selected appropriately which represents the predefined clusters. Raw and
unlabeled data is taken as input which is further divided into clusters until the best clusters
are found. Centroid based algorithm used in this model is efficient but sensitive to initial
conditions and outliers
In the proposed system, the customer segmentation model includes not only centroid-based
but also hierarchical clustering. • The three clustering algorithms K means, Minibatch K
means and the hierarchical algorithm has been selected from the literature survey. • By
deploying the three different algorithms, the clusters are formed and analyzed respectively. •
The most effective and efficient algorithm is determined by comparing and evaluating the
precision rate among the three algorithms
Customer segmentation is the practice of dividing a company‟s customers into groups that
reflect similarities among customers in each group. The main objective of segmenting
customers is to decide how to relate to customers in each segment to maximize the value of
each customer to the business
The emergence of many competitors and entrepreneurs has caused a lot of tension among
competing businesses to find new buyers and keep the old ones. As a result of the
predecessor, the need for exceptional customer service becomes appropriate regardless of
the size of the business. Furthermore, the ability of any business to understand the needs of
each of its customers will provide greater customer support in providing targeted customer
services and developing customized customer service plans. This understanding is possible
through structured customer service.
3.4 SOFTWARE AND HARDWARE REQUIREMENTS
16
3.4.1 Software Requirements:
✓ Python
✓ Anaconda
✓ Jupyter Notebook
✓ RAM: 8GB
✓ OS: Windows
3.4.3 Libraries:
✓ Numpy- Library of python used for arrays computation. It has so many functions.
We have used this module to change the 2-dimensional array into a contiguous
flattened array by using the ravel function.
17
3.5. PROGRAMMING LANGUAGES
3.5.1 Python
Python is the best programing language fitted to Machine Learning. In step with studies
and surveys, Python is the fifth most significant language yet because the preferred
language for machine learning and information science. It's owing to the subsequent
strengths that Python has –
✓ Easy to be told and perceive- The syntax of Python is simpler; thence it's
comparatively straightforward, even for beginners conjointly, to be told and
perceive the language.
3.5.2 Domain
Machine learning could be a subfield of computer science (AI). The goal of machine
learning typically is to know the structure information of knowledge of information
and match that data into models which will be understood and used by folks. Although
machine learning could be a field inside technology, it differs from ancient process
approaches. In ancient computing, algorithms are sets of expressly programmed
directions employed by computers to calculate or downside solve. Machine learning
algorithms instead give computers to coach on knowledge inputs and use applied math
analysis to output values that fall inside a particular vary. Thanks to this, machine
learning facilitates computers in building models from sample knowledge to modify
decision-making processes supported knowledge inputs.
3.6. SYSTEM ARCHITECTURE
18
Fig 3.1 System Architecture
A. Collect data
This is a data preparation phase. The feature usually helps to refine all data items at a standard
rate to improve the performance of clustering algorithms.[12] Each data point varies from
grade 2 to +2. Integration techniques that include min-max, decimal, and Z-point are the
standard z-signing strategy used to make things uneven before the dataset. While you‟ll be
occupied with analyzing the dataset, you should also start the process of collecting your data
in the right shape and format. It could be the same format as in the reference dataset (if that
fits your purpose), or if the difference is quite substantial – some other format.
The data are usually divided into two types: Structured and Unstructured. The simplest
example of structured data would be a .xls or .csv file where every column stands for an
attribute of the data. Unstructured data could be represented by a set of text files, photos, or
video files. Often, business dictates how to organize the collection and storage of data.
19
Data exploration refers to the initial step in data analysis in which data analysts use data
visualization and statistical techniques to describe dataset characterizations, such as size,
quantity, and accuracy, to better understand the nature of the data.
Data exploration, also known as exploratory data analysis (EDA), is a process where users
look at and understand their data with statistical and visualization methods. This step helps
identify patterns and problems in the dataset, as well as decide which model or algorithm to
use in subsequent steps. Although sometimes researchers tend to spend more time on model
architecture design and parameter tuning, the importance of data exploration should not be
ignored.
Considering the knowledge gained from the literature survey, the three most used and
efficient algorithms are taken into account for clustering the customers. K means clustering
algorithm; Mini batch k means clustering algorithm and Hierarchical clustering algorithm.
The three algorithms are all set to be deployed on the dataset respectively.
D. Cluster Results
By deploying the three selected algorithms on the dataset the customer data has been
clustered and clusters are formed. Further analyzing the clusters formed by different
algorithms the results of the cluster are obtained for three different algorithms which are
deployed respectively. Because clustering is unsupervised, no
“truth” is available to verify results. The absence of truth complicates assessing quality.
E. Comparison and Determination of Precise Algorithm
Checking the quality of clustering is not a rigorous process because clustering lacks “truth”.
Implementing a clustering model with no target to aim, it is not possible to calculate the
accuracy score. Henceforth, the aim is to create clusters with distinct or unique
characteristics. The two most common metrics to measure the distinctness of clusters are
Silhouette Coefficient and Davies-Bouldin Index. Comparing the metric scores produced by
the three algorithms, the most precise algorithm is determined.
20
3.7 ALGORITHMS USED
The algorithm takes the unlabelled dataset as input, divides the dataset into k- number
of clusters, and repeats the process until it does not find the best clusters. The value of
k should be predetermined in this algorithm.
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
The working of the K-Means algorithm is explained in the below steps:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined
K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassigning each datapoint to the new
closest centroid of each cluster.
21
Step-7: The model is ready.
The closest distance between the two clusters is crucial for hierarchical clustering.
There are various ways to calculate the distance between two clusters, and these ways
decide the rule for clustering. These measures are called Linkage methods.
Single Linkage: It is the Shortest Distance between the closest points of the clusters.
Complete Linkage: It is the farthest distance between the two points of two different
clusters. It is one of the popular linkage methods as it forms tighter clusters than
single-linkage.
Average Linkage: It is the linkage method in which the distance between each pair
of datasets is added up and then divided by the total number of datasets to calculate
the average distance between two clusters. It is also one of the most popular linkage
methods.
Centroid Linkage: It is the linkage method in which the distance between the
centroid of the clusters is calculated.
22
There is no doubt that k-means is one of the most popular clustering algorithms
because of its performance and low cost of time but with an increase in the size of the
datasets being taken into consideration for analysis the computation time of k-means
increases. To overcome this, a different approach is introduced called the Minibatch k-
means algorithm whose main idea is to divide the whole dataset into small- fixed-size
batches of data and use a new random mini batch from the dataset and update the
clusters where this iteration is repeated till the convergence.
Mini Batch K-means algorithm„s main idea is to use small random batches of data of
a fixed size, so they can be stored in memory. Each iteration a new random sample
from the dataset is obtained and used to update the clusters and this is repeated until
convergence. Each mini-batch updates the clusters using a convex combination of the
values of the prototypes and the data, applying a learning rate that decreases with the
number of iterations. This learning rate is the inverse of the number of data assigned
to a cluster during the process. As the number of iterations increases, the effect of new
data is reduced, so convergence can be detected when no changes in the clusters occur
in several consecutive iterations.
3.7.4 Elbow Method
Determining the optimal no of clusters for the given dataset is the most fundamental
step for any unsupervised algorithm. The Elbow method helps us to determine the best
value of k. the k value is selected where the point starts to flatten out forming an elbow
in the graph plotted using the sum of squared distance between the data points and
their respective assigned cluster centroids. Therefore, the optimal number of clusters
is determined.
The Elbow method is one of the most popular ways to find the optimal number of
clusters. This method uses the concept of WCSS value. WCSS stands for Within
Cluster Sum of Squares, which defines the total variations within a cluster. The formula
to calculate the value of WCSS (for 3 clusters) is given below:
2 2
WCSS= ∑Pi in Cluster1 distance(Pi C1) +∑Pi in Cluster2distance(Pi C2) +∑Pi in CLuster3
distance(Pi C3)2
3.8 MODULES
23
The project contains three parts:
❑ Dataset Collection- We had collected datasets from Kaggle notebooks. The dataset
contains the symptoms and the corresponding disease. It contains 200 rows and 5
columns.
❑ Train and test the model- We had used three clustering algorithms named K-means
clustering algorithm, Hierarchical clustering, and mini-batch K- means algorithm to
train the dataset. After training, we had tested the model and found their clusters,
silhouette score, and Davies Boulding score.
❑ Deploy the models- Deployed the model to get the clusters formed. The cluster
shows the different segmentation of customers based on many attributes. By this, we
will get the silhouette score and Davies Boulding scores of the model as the output.
D) Train the dataset using K-means clustering algorithm, Hierarchical clustering, and mini-
batch K-means algorithms.
E) Test the model and find the clusters and silhouette score and Davies Boulding score.
G) Based on the scores predict which algorithm is best for customer segmentation and go
ahead with that clustering algorithm.
24
CHAPTER 4
Unlike supervised algorithms such as a linear regression model, there is a target to predict
where the accuracy can be measured by using metrics such as RMSE, MAPE, MAE, etc.,
Implementing a clustering model with no target to aim, it is not possible to calculate the
accuracy score. Henceforth, the aim is to create clusters with distinct or unique
characteristics. The two most common metrics to measure the distinctness of clusters are:
Silhouette Coefficient:
The silhouette score is a measure of the average similarity of the objects within a cluster and
their distance to the other objects in the other clusters.
Secondly, we define:
This score ranges between -1 and 1, where the clusters are more well-defined and distinct
with higher scores.
Davies-Bouldin Index:
25
The Davies–Bouldin (DB) criterion is based on a ratio between “within- cluster” and
“between-cluster” distances.
Dij is the "within-to-between cluster distance ratio" for the ith and jth clusters.
where d¯ i is the average distance between every data point in cluster I and its centroid,
similar for d¯ j. dij is the Euclidean distance between the centroids of the two clusters.
On contrary to the Silhouette score, this score measures the similarity among the clusters
which defines that the lower the score the better clusters are formed.
26
CHAPTER 5
The significance of customer segmentation in attracting the customers towards the products
which in turn aids the increase in the business scale in the market. Segmenting the customer group
into the different groups according to the similarities they possess, on one hand, helps the
marketers to provide customized ads, products, and offers. where on other hand it supports the
customers by avoiding them from the confusion of the products to buy.
Comparing the clusters obtained by deploying the three different clustering algorithms on
the customers‟ data using the metrics that measure the distinctness and uniqueness of the clusters.
It is observed that the K means algorithm produces the best clusters by obtaining the highest
Silhouette score and the least Davies Bouldin score followed by hierarchical clustering and
minibatch k means clustering.
It couldn‟t be said that the K means is the most effective clustering algorithm every time.
It depends on the various factors such as the size of the data, attributes of the data, etc., This
Project can further be enhanced by including different clustering algorithms that may depict
more proficiency and by considering the large datasets which in turn increases the efficiency.
27
REFERENCES
[1] Aman Banduni, Prof Ilavedhan A, “Customer Segmentation using machine learning,”
School of Computing Science and Engineering, Galgotias University, Greater Noida,
Uttar Pradesh, India. •
[3] KAI PENG(Member, IEEE), VICTOR C. M. LEUNG, (Fellow, IEEE), AND QINGJIA
HUANG, "Clustering Approach based on Mini batch Kmeans ", In 2018, College of
Engineering, Huaqiao University, Quanzhou 362021, China.
[4] Fionn Murtagh and Pedro Contreras, "Methods of Hierarchical Clustering", In 2018,
Science Foundation Ireland, Wilton Place, Dublin, Ireland Department of
Computer
Science, Royal Holloway, University of London
[6] Manju Kaushik, Bhawana Mathur, “Comparative Study of K-Means and Hierarchical
Clustering Techniques”, In June 2014, JECRC University, Jaipur.
[7] Ali Feizollah, Nor Badrul Anuar, Rosli Salleh, and Fairuz Amalina,
"Comparative Study of K-means and Mini Batch K-means Clustering Algorithms",
Computer System and Technology Department, The University of Malaya, Kuala Lumpur,
Malaysia, International Journal of Software and Hardware Research in Engineering.
[8] Asith Ishantha, "Mall Customer Segmentation using Clustering Algorithm”, Future
University Hakodate, Conference Paper, March 2021.
[9] Onur Dogan, Dokuz eylul University, Ejder Aycin, Kocaeli University, Zeki Atil Bulut,
Dokuz Eylul University, "Customer Segmentation by using RFM model and clustering
methods: A case Study in Retail Industry", In July 2018, International Journal of
Contemporary Economics and Administrative Sciences.
[10] Juni Nurma Sari,Ridi Ferdiana,Lukito Nugroho,Paulus Insap Santosa,"Review on
Customer Segmentation Technique",Department of Electrical Engineering
28
APPENDICES
A. SOURCE CODE
df=pd.read_csv("Mall_Customers.csv") df.head()
df.shape df.describe() df.dtypes df.isnull().sum()
df.drop(["CustomerID"],axis=1, inplace=True)
df.head()
29
agex = ["18-25","26-35","36-45","46-55","55+"] agey
=
[len(age_18_25.values),len(age_26_35.values),len(age_36_45.values),len(age_46_55.val
ues),len(age_55above.values)]
plt.figure(figsize=(15,6))
sns.barplot(x = agex, y =
agey, palette ="mako")
plt.title("Number of
Customer and Ages")
plt.xlabel("Age")
plt.ylabel("Number of
Customer") plt.show()
sns.relplot(x="No. of
Purchases", y="Spending
Score (1-100)", data = df)
ss_1_20 = df["Spending
Score (1-
100)"][(df["Spending
Score (1-100)"] >= 1) &
(df["Spending Score (1-100)"] <= 20)] ss_21_40 = df["Spending Score (1-
100)"][(df["Spending Score (1-100)"] >= 21) &
(df["Spending Score (1-100)"] <= 40)] ss_41_60 = df["Spending Score (1-
100)"][(df["Spending Score (1-100)"] >= 41) &
(df["Spending Score (1-100)"] <= 60)] ss_61_80 = df["Spending Score (1-
100)"][(df["Spending Score (1-100)"] >= 61) &
(df["Spending Score (1-100)"] <= 80)] ss_81_100 = df["Spending Score (1-
100)"][(df["Spending Score (1-100)"] >= 81) & (df["Spending Score (1-100)"] <= 100)]
30
ai61_90 = df["No. of Purchases"][(df["No. of Purchases"] >= 61) & (df["No. of Purchases"]
<= 90)] ai91_120 = df["No. of Purchases"][(df["No. of Purchases"] >= 91) & (df["No. of
Purchases"] <= 120)]
ai121_150 = df["No. of Purchases"][(df["No. of Purchases"] >= 121) & (df["No. of Purchases"] <=
150)]
aix = ["$ 0 - 30,000","$ 30,001 - 60,000", "$ 60,000 - 90,000", "$ 90,001 - 120,000",
"120,001 - 150,000"] aiy = [len(ai0_30.values), len(ai31_60.values), len(ai61_90.values),
len(ai91_120.values), len(ai121_150.values)]
label = kmeans.fit_predict(X1)
print(label) print(kmeans.cluster_centers_)
plt.scatter(X1[:,0],X1[:,1], c=kmeans.labels_, cmap='rainbow')
plt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1], color='black')
plt.title('Clusters of Customers') plt.xlabel('Age') plt.ylabel('Spending Score(1-100)') plt.show()
X2=df.loc[:, ["No. of Purchases","Spending Score (1-100)"]].values
31
plt.figure(figsize=(12,6)) plt.grid() plt.plot(range(1,11),wcss,
linewidth=2, color="red", marker = "8") plt.xlabel("K Value")
plt.ylabel("WCSS") plt.show() kmeans = KMeans(n_clusters=5)
label = kmeans.fit_predict(X2)
print(label)
print(kmeans.cluster_centers_)
plt.scatter(X2[:,0], X1[:,1], c=kmeans.labels_, cmap='rainbow')
plt.scatter(kmeans.cluster_centers_[:,0] ,kmeans.cluster_centers_[:,1], color='black')
plt.title('Clusters of Customers') plt.xlabel('No. of Purchases') plt.ylabel('Spending Score(1-
100)') plt.show()
X3=df.iloc[:,1:]
wcss = [] for k in
range(1,11):
kmeans = KMeans(n_clusters=k, init="k-means++")
kmeans.fit(X3) wcss.append(kmeans.inertia_)
plt.figure(figsize=(12,6)) plt.grid()
plt.plot(range(1,11),wcss, linewidth=2, color="red",
marker ="8") plt.xlabel("K Value") plt.ylabel("WCSS")
plt.show() kmeans = KMeans(n_clusters = 5)
label = kmeans.fit_predict(X3)
32
B. SCREENSHOTS
B.1. Dataset
33
B.3. Plotting of Gender
34
B.5. Elbow method for determining No. of Clusters
35