Final Documentation
Final Documentation
This project dives deep into k-means clustering, one of the most popular algorithms for
grouping similar data points together. But here's the thing: k-means can be a bit finicky,
especially when it comes to those initial cluster centres and choosing the right distance
metric. So, we set out to analyse how these two factors – initialization and distance –
affect how well k-means actually performs.
We looked at three different ways to improve the starting point for k-means: clustering
smaller chunks of the data (subsamples), using PCA to reduce the number of
dimensions, and transforming the data to make everything positive. We also explored
why using different distance metrics is so important, especially when you're dealing
with data that's ranked, like customer preferences for products.
Our results show that both initialization and distance metrics play a major role in getting
good results from k-means. If you choose the right combination for your data, you can
find more accurate clusters and make better decisions, whether it's for marketing
campaigns or any other application where grouping similar things matters.
LIST OF FIGURES
Figure Title Page Number
PAGE
TABLE TITLE
NUMBER
6.1 Summary of Common Distance Metrics 27
7.1 Iris Data Set 31
7.2 Performance Analysis 32
Contents:
DESCRIPTION PAGE NO.
DECLARATION CERTIFICATE ii
CERTIFICATE OF APPROVAL iii
ACKNOWLEDGEMENTS iv
ABSTRACT v
LIST OF FIGURES vi
LIST OF TABLES vii
1. INTRODUCTION 1
2. LITERATURE SURVEY 02
2.1 Refining Initial Points for K-Means Clustering. 02
2.2 Performance Analysis of K-Means with Different Initialization Methods. 03
2.3 Enhancing K-Means Clustering Algorithm with Improved Initial Centre. 03
2.4 Clustering on Ranked Data for Campaign Selection. 04
2.5 K-Means Clustering and Related Algorithms 04
2.6 Review on Determining the Number of Clusters in K-Means Clustering. 05
3. PROBLEM DEFINITION 06
3.1 Aim 06
3.2 Impact on K-Means 08
3.3 The Need for Dimensionality Reduction 11
3.3.1 Principal Component Analysis (PCA) 11
3.3.2 Other Dimensionality Reduction Techniques 12
6. METHODOLOGY 21
6.1 Clustering Subsamples 22
6.1.1 Euclidean Distance 22
6.1.2 Manhattan Distance 22
6.1.3 Chebyshev Distance 25
6.1.4 Minkowski Distance 26
6.2 Distance Metrics for Specific Data Types 27
9. CONCLUSION 37
11. REFERENCES 40
Appendix
1. INTRODUCTION
The K-means algorithm is one of the most popular and widely used clustering
techniques due to its simplicity and efficiency. It operates by partitioning a dataset into
K clusters, each represented by its centroid, which is the mean of the points in the
cluster. Despite its widespread application, K-means clustering faces significant
challenges, primarily due to its sensitivity to the initial placement of centroids. The
random initialization of centroids can lead to different clustering results each time the
algorithm is executed, often resulting in suboptimal solutions. Poorly chosen initial
centroids may cause the algorithm to converge to a local optimum, a solution that, while
adequate, is not the best possible one. Additionally, this random initialization can
sometimes result in empty clusters if the initial centroids are placed too far from any
data points.
These challenges underline the necessity for refinement techniques aimed at improving
the initial selection of centroids, thereby enhancing the algorithm's performance.
Refinement techniques are particularly critical when dealing with large or high-
dimensional datasets, where the likelihood of poor initial centroid placement increases.
By employing methods such as clustering subsamples, utilizing Principal Component
Analysis (PCA) for dimensionality reduction, and transforming the data to ensure
positivity, the initialization process can be significantly improved. Furthermore, the
choice of distance metric is pivotal, especially for data types that are not naturally suited
to the standard Euclidean distance metric. For example, in datasets involving ranked or
categorical data, alternative metrics such as Kendall's tau or Spearman's footrule can
lead to more meaningful and accurate clustering outcomes.
1|P age
2. LITERATURE SURVEY
K-means clustering, a widely used partitioning algorithm, is a staple in data mining and
machine learning for grouping similar data points into clusters. However, as we know,
the algorithm's performance can be significantly affected by the initial placement of
cluster centroids. This sensitivity to initialization has motivated researchers to explore
various refinement techniques to improve the algorithm's robustness and accuracy.
Furthermore, the choice of distance metric, which determines how similarity between
data points is measured, plays a crucial role in shaping the resulting clusters and
influencing the overall effectiveness of k-means.
This Literature Survey examines seven key research papers that address the challenges
of k-means initialization and the impact of distance metrics, providing the foundation
for our project.
2.1 Refining Initial Points for K-Means Clustering (Bradley and Fayyad)
Bradley and Fayyad tackled the initialization problem by proposing a technique based
on clustering multiple subsamples of the data. Their method involves the following
steps:
This approach aims to mitigate the impact of outliers and random initialization by
leveraging information from multiple subsamples and smoothing the resulting
centroids.
2|P age
2.2 Performance Analysis of K-Means with Different Initialization Methods for
High Dimensional Data (Tajunisha and Saravanan)
Tajunisha and Saravanan focused on the challenges posed by high-dimensional data for
k-means clustering. They explored the use of Principal Component Analysis (PCA) as
a dimensionality reduction technique to improve the selection of initial centroids. Their
method involves:
By leveraging PCA, this technique aims to identify initial centroids that lie along the
directions of greatest variance in the data, increasing the likelihood of finding well-
separated clusters. The authors demonstrated the effectiveness of PCA-based
initialization for high-dimensional data, showing improved accuracy compared to
random initialization.
Yedla at al, presented a simpler approach to finding better initial centroids. Their
method involves transforming the data to a positive space and then selecting centroids
based on their distances from the origin. The steps are as follows:
3|P age
4. Dividing the sorted data points into k equal sets and choosing the median point
of each set as an initial centroid.
This technique exploits the idea that after transformation, data points closer to the origin
are likely to be more "central" within their respective clusters. The authors showed that
this method leads to better accuracy and reduced computational time compared to
standard k-means.
Gupta et al. (2020) explored the application of k-means clustering to ranked data,
specifically for the task of campaign selection in marketing. They highlight the
inadequacy of standard distance metrics like Euclidean distance for ranked data and
introduce alternative metrics such as Kendall's tau and Spearman's footrule.
The authors demonstrate the effectiveness of their method for identifying target
customer segments for marketing campaigns, showcasing the importance of using
appropriate distance metrics for ranked data.
4|P age
Adams also explores practical considerations for implementing k-means, including data
standardization, handling missing data, and code vectorization techniques.
Furthermore, he introduces related concepts like spectral clustering, affinity
propagation, and biclustering, offering a broader perspective on the field of clustering.
1. Rule of thumb: A simple approach based on the square root of the number of
data points.
2. Elbow Method: A visual method analyzing the within-cluster sum of squares
(WCSS) for different k values.
3. Information Criterion Approach: Statistical methods like Akaike's
Information Criterion (AIC) and Bayesian Information Criterion (BIC), which
balance model fit with model complexity.
4. Information Theoretic Approach: An approach based on rate distortion
theory, which considers the trade-off between compression and distortion.
5. Silhouette Method: Measures the similarity of a data point to its own cluster
compared to other clusters.
6. Cross-validation: Assesses the stability of clustering solutions by comparing
results across different data splits.
The authors highlight the strengths and limitations of each approach and emphasize the
importance of choosing a method that aligns with the specific dataset and clustering
goals.
5|P age
3. PROBLEM DEFINITION
As we venture deeper into the world of data analysis, we often encounter datasets with
a large number of features or attributes. These are called high-dimensional datasets.
Think of it like describing a car—you could use just a few basic features like color and
size, or you could get really detailed with things like engine capacity, fuel efficiency,
safety ratings, and so on. The more features you add, the higher the dimensionality of
your data.
Clustering is a critical task in data analysis and machine learning, used extensively to
discover patterns and groupings within data. The K-means clustering algorithm, despite
its popularity, faces several inherent challenges that can significantly affect its
performance and the quality of the resulting clusters. These challenges need to be
addressed to improve the algorithm's robustness and applicability to a wider range of
datasets and practical applications.
3.1 Aim
The primary aim of this project is to enhance the performance and reliability of the K-
means clustering algorithm. By systematically addressing the key challenges associated
with K-means clustering, this project seeks to develop an improved version of the
algorithm that delivers accurate, consistent, and meaningful clustering results across
diverse datasets and practical scenarios.
▪ Objectives
To achieve this, the project will focus on the following specific objectives:
6|P age
Techniques such as k-means initialization, clustering subsamples, and using Principal
Component Analysis (PCA) for dimensionality reduction will be explored.
2. Optimize Distance Metrics:
Assess the performance of different distance metrics for various types of data to
determine the most suitable metrics that yield meaningful clusters. This includes
evaluating alternative metrics such as Manhattan distance, cosine similarity, and
specialized metrics for categorical and ranked data.
Implement and compare techniques for selecting the optimal number of clusters, K, to
enhance clustering accuracy without prior domain knowledge. Methods such as the
Elbow method, Silhouette analysis, and the Gap statistic will be considered.
4. Enhance Scalability:
The project addresses the following challenges to meet its aim and objectives:
2. Empty Clusters:
7|P age
- Occasionally, the K-means algorithm may produce empty clusters, especially if
initial centroids are placed far from any data points. This situation not only wastes
computational resources but also fails to provide meaningful insights from the data.
- The standard Euclidean distance metric used in K-means may not be suitable for all
types of data, particularly for non-numeric, categorical, or ranked data. Alternative
distance metrics need to be considered to improve clustering quality for diverse data
types.
5. Scalability Issues:
1. Initialization:
- Randomly select K initial centroids from the dataset. These centroids serve as the
starting points for the clustering process. The choice of initial centroids significantly
impacts the algorithm's performance and the quality of the resulting clusters.
2. Assignment:
8|P age
- Assign each data point to the nearest centroid based on a chosen distance metric,
typically the Euclidean distance. This step creates K clusters where each data point
belongs to the cluster with the closest centroid.
3. Update
Recalculate the centroids of the K clusters by computing the mean of all data points
assigned to each cluster. These new centroids represent the updated cluster centers.
4. Iteration:
- Repeat the assignment and update steps until convergence is achieved. Convergence
occurs when the centroids no longer change significantly or a predetermined number of
iterations is reached. This iterative process ensures that the algorithm refines the cluster
boundaries to minimize the variance within each cluster.
5. Evaluation:
Evaluate the clustering results using metrics such as the within-cluster sum of squares
(WCSS) to measure the compactness of the clusters. The algorithm aims to minimize
WCSS to achieve well-defined and compact clusters.
6. Post-processing:
By systematically following this approach, the K-means algorithm partitions the dataset
into K distinct clusters, each characterized by its centroid. Despite its simplicity, the
algorithm's effectiveness relies heavily on the initialization method, choice of distance
metric, and scalability optimizations, which are the primary focus of this project.
This might sound a bit dramatic, but it's a real problem! The "curse of dimensionality"
refers to a set of phenomena that occur when we're dealing with data in many
9|P age
dimensions. One key issue is that as the number of dimensions increases, the amount
of space we need to represent the data grows exponentially. Imagine trying to find a
needle in a haystack – that's hard enough. Now, imagine trying to find that needle in a
barn full of haystacks! That's what it can be like trying to find meaningful clusters in
high-dimensional spaces.
The figure shows how data points become more spread out as the number of dimensions
increases, making it harder to define clusters.
Another problem is that distance metrics, which are crucial for k-means, become less
effective in high dimensions. We'll talk more about this in the next section.
Impact on K-Means
The challenges of high-dimensional data have some specific consequences for the k-
means algorithm:
10 | P a g e
placing a centroid far away from any data points increases as the number of
dimensions grows.
• Local Optima: K-means is already prone to getting stuck in local optima, and
this problem becomes worse in high dimensions. The algorithm might converge
to a solution that's okay, but not the best possible one.
By keeping only the first few principal components, we can often reduce the
dimensionality of our data without losing too much important information. This makes
it easier for clustering algorithms like k-means to find meaningful groups.
11 | P a g e
3.3.2 Other Dimensionality Reduction Techniques
PCA is just one example of a dimensionality reduction technique. There are many
others, each with its own strengths and weaknesses. Some common ones include:
12 | P a g e
4. SYSTEM REQUIREMENT & SPECIFICATION
The System Requirements Specification (SRS) for a Medicine Recommendation
System outlines the key aspects of the system’s design and functionality. It begins with
an introduction, providing an overview of the system and its purpose, identifying
stakeholders, and defining the scope. The functional requirements section specifies the
necessary functionalities, Including user registration, patient profile management,
medication data integration, recommendation generation, Interaction and feedback
mechanisms, real-time data updates, and explainability and interpretability features.
Non-functional requirements cover aspects such as system performance, security,
usability, scalability, reliability, compatibility, and compliance with regulations.
System constraints detail any technology, integration, data, budget, or time limitations
that need to be considered. The SRS also includes provisions for system verification
and validation, including testing procedures, performance evaluation metrics, and
documentation requirements. Overall, the SRS serves as a comprehensive document
that guides the development, testing, and deployment of the Medicine Recommendation
System, ensuring that it meets the specific requirements and expectations of the
stakeholders.
13 | P a g e
4.3 Software Requirements
For the computation and analysis, we need certain python libraries which are used to
perform analytics. Packages such as SKlearn, NumPy, Flask framework, Pandas etc.
are needed.
• SKlearn:
It features various classification, regression and clustering algorithms including
support vector machines, random forests, gradient boosting, k-means and
DBSCAN, and is designed to interoperate with the Python numerical and
scientific libraries NumPy and SciPy.
• NumPy:
NumPy is a general-purpose array-processing package. It provides a high-
performance multidimensional array object, and tools for working with these
arrays. It is the fundamental package for scientific computing with Python.
Pandas: Pandas is one of the most widely used python libraries in data science.
It provides high-performance, easy to use structures and data analysis tools.
Unlike NumPy library which provides objects for multi-dimensional arrays,
Pandas provides in-memory 2d table object called Data frame.
• Flask:
It is a lightweight WSGI web application framework. It is designed to make
getting started quick and easy, with the ability to scale up to complex
applications. It began as a simple wrapper around Werkzeugjju.
14 | P a g e
• Pandas:
Pandas is a package which helps in data manipulation and is a very important
tool when it comes to cleaning of data, exploration of data and visualization
tasks as well as data manipulation.
• NLTK Toolkit:
The Natural Language Toolkit (NLTK) is a Python programming environment
for creating applications for statistical natural language processing (NLP).
In addition to the standard NLP tasks, such as tokenization and parsing, NLTK
includes tools for sentiment analysis. This enables the toolkit to determine the
sentiment of a given piece of text, which can be useful for applications such as
social media monitoring or product review analysis.
While NLTK is a powerful toolkit in its own right, it can also be used in
conjunction with other machine learning libraries such as sci-kit-learn and
TensorFlow. This allows for even more sophisticated NLP applications, such as
deep learning-based language modeling.
• Matplotlib
Matplotlib is a popular plotting library in Python used for creating high-quality
visualizations and graphs. It offers various tools to generate diverse plots,
facilitating data analysis, exploration, and presentation. Matplotlib is flexible,
15 | P a g e
supporting multiple plot types and customization options, making it valuable
for scientific research, data analysis, and visual communication. It can create
different types of visualization reports like line plots, scatter plots, histograms,
bar charts, pie charts, box plots, and many more different plots. This library also
supports 3-dimensional plotting.
• Seaborn
Seaborn is a Python data visualization library based on matplotlib. It provides a
high-level interface for drawing attractive and informative statistical
graphics.Seaborn is a library for making statistical graphics in Python. It builds
on top of matplotlib and integrates closely with pandas data structures.
Seaborn helps you explore and understand your data. Its plotting functions
operate on dataframes and arrays containing whole datasets and internally
perform the necessary semantic mapping and statistical aggregation to produce
informative plots. Its dataset-oriented, declarative API lets you focus on what
the different elements of your plots mean, rather than on the details of how to
draw them.
16 | P a g e
5. SYSTEM DESIGN & MODELLING
This section outlines the system design and modelling approach for our project,
"Performance Analysis of K-Means Clustering using Different Distance Metrics." The
system is designed to be modular and flexible, allowing for easy experimentation with
different initialization techniques, distance metrics, and datasets.
2.Initialization Module:
a. Implements different k-means initialization techniques:
i. Random Initialization
ii. Clustering Subsamples
iii. PCA-Based Initialization
iv. Positive Space Transformation
b. Allows for easy selection and configuration of initialization methods.
17 | P a g e
4.K-Means Clustering Module:
b. Uses the selected initialization technique and distance metric from the
other modules.
c. Provides options for setting the number of clusters (k), maximum
iterations, and convergence criteria.
5.Evaluation Module:
6.Output Module:
18 | P a g e
Figure 5.1. Data Flow Diagram Level 0
19 | P a g e
5.2 System Implementation
The system will be implemented using Python and its powerful libraries for data
analysis and machine learning:
20 | P a g e
6. METHODOLOGY
This section outlines the systematic approach taken to conduct a comprehensive
performance analysis of the K-means clustering algorithm. It includes the experimental
design, data preparation, implementation details, evaluation metrics, and procedures
followed to analyse the algorithm's performance. The goal is to identify and address
key factors influencing the effectiveness and efficiency of K-means clustering
I. Experimental Design
The datasets used in this study were carefully selected to represent a variety of
characteristics:
Data preprocessing steps included normalization and scaling to ensure that features
contributed equally to distance calculations. Missing values were handled appropriately
to maintain data integrity.
21 | P a g e
III. Implementation
The K-means clustering algorithm was implemented using Python and popular machine
learning libraries such as Scikit-learn. The following variations were tested.
In k-means clustering, distance metrics play a crucial role in determining how data
points are assigned to clusters. A distance metric quantifies how similar or dissimilar
two data points are. The k-means algorithm assigns each data point to the cluster whose
centroid is closest to it, based on the chosen distance metric.
There are many different distance metrics, and the best one to use depends on the nature
of the data and the goals of the clustering task. Let's explore some common distance
metrics and their properties.
Euclidean distance is probably the most familiar and widely used distance metric. It's
the straight-line distance between two points in space. You can think of it as measuring
the distance between two locations on a map using a ruler.
Formula:
22 | P a g e
Where:
Visualization:
The figure shows the Euclidean distance between two points in a two-dimensional
plane.
Properties:
Imagine you're in a city with a grid-like street layout, like Manhattan in New York City.
You can only travel along the streets, not diagonally. The Manhattan distance between
23 | P a g e
two points is the distance you would have to travel along these streets to get from one
point to the other.
Formula:
Where:
Visualization:
The figure illustrates the Manhattan distance between two points on a grid. The distance
is calculated by summing the horizontal and vertical distances travelled along the grid
lines.
24 | P a g e
Properties:
Formula:
The Chebyshev distance between two points p and q in n-dimensional space is:
Where:
Properties:
25 | P a g e
6.1.4 Minkowski Distance
Minkowski distance is a more general distance metric that includes both Euclidean and
Manhattan distances as special cases. It introduces a parameter, p, that allows you to
adjust the way distances are calculated.
Formula:
The Minkowski distance between two points p and q in n-dimensional space is:
Where:
Special Cases:
You got it, Ashutosh! Here's the rest of the "4.1 Common Distance Metrics" section,
continuing with the discussion on Minkowski distance:
Properties:
26 | P a g e
Metric Formula Properties
Manhattan ∑ pi - qi
Chebyshev max( p1 - q1
Minkowski (∑ pi - qi
The choice of distance metric can significantly influence the performance of the k-
means algorithm. Here are some key points to consider:
• Cluster Shapes: Different distance metrics can lead to different cluster shapes.
For instance, Euclidean distance tends to produce spherical clusters, while
Manhattan distance might create more rectangular or diamond-shaped clusters.
• Sensitivity to Outliers: Some metrics are more robust to outliers than others.
For example, Manhattan distance is less affected by extreme values than
Euclidean distance.
• Data Interpretation: The choice of metric should align with how we interpret
similarity or dissimilarity in the context of the problem. For example, for text
data, cosine similarity is often more meaningful than Euclidean distance.
27 | P a g e
7. APPLICATION & ANALYSIS: CLUSTERING IRIS DATA WITH
DIFFERENT DISTANCE METRICS
In this section, we put our knowledge of k-means clustering and distance metrics to
work by exploring a practical application: clustering iris dataset for campaign selection.
1. Gather Data: Collect data from customers, asking them to rank a set of
products (or items) in order of preference.
2. Cluster the Data: Use k-means clustering with a suitable distance metric for
ranked data (e.g., Kendall's tau or Spearman's footrule) to group customers with
similar preferences into clusters.
3. Analyze Cluster Centers: The centroids of each cluster represent the "average"
product rankings for that customer segment. Analyze these centroids to
understand the preferences of each group.
4. Campaign Assignment: For each campaign (defined by a set of products),
determine which cluster(s) exhibit the strongest preference for those products.
Method:
1. Data Preprocessing: Prepare the Sushi dataset by loading it into a suitable data
structure (e.g., a Pandas Data Frame in Python).
28 | P a g e
2. Normalization (Optional): If necessary, normalize the rankings using
MinMaxScaler from scikit-learn's preprocessing module to scale the values
between 0 and 1. This can be helpful if different customers use different ranking
scales.
3. Initialization Techniques: Implement the three initialization techniques
discussed earlier (clustering subsamples, PCA-based, and positive space
transformation).
4. Distance Metrics: Implement Kendall's tau and Spearman's footrule distance
metrics.
5. K-Means Clustering: Run k-means clustering with different combinations of
initialization techniques and distance metrics. Use the Elbow Method to
determine the optimal number of clusters (k) for each combination.
6. Campaign Selection: Define a few campaigns based on specific sets of sushi
types (e.g., Campaign 1: {salmon, tuna, shrimp}, Campaign 2: {eel, sea urchin,
squid}).
7. Evaluation: For each clustering solution (obtained with a different combination
of initialization and distance metric), calculate sumR and LAF values for each
campaign and each cluster. Analyze which combinations lead to the most
effective campaign-segment assignments.
Analysis:
29 | P a g e
Sepal Length Sepal Width Petal Length Petal Width
(CM) (CM) (CM) (CM) Species
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5 3.6 1.4 0.2 Iris-setosa
5.4 3.9 1.7 0.4 Iris-setosa
4.6 3.4 1.4 0.3 Iris-setosa
5 3.4 1.5 0.2 Iris-setosa
4.4 2.9 1.4 0.2 Iris-setosa
4.9 3.1 1.5 0.1 Iris-setosa
5.4 3.7 1.5 0.2 Iris-setosa
4.8 3.4 1.6 0.2 Iris-setosa
4.8 3 1.4 0.1 Iris-setosa
4.3 3 1.1 0.1 Iris-setosa
5.8 4 1.2 0.2 Iris-setosa
5.7 4.4 1.5 0.4 Iris-setosa
5.4 3.9 1.3 0.4 Iris-setosa
5.1 3.5 1.4 0.3 Iris-setosa
5.7 3.8 1.7 0.3 Iris-setosa
5.1 3.8 1.5 0.3 Iris-setosa
5.4 3.4 1.7 0.2 Iris-setosa
5.1 3.7 1.5 0.4 Iris-setosa
4.6 3.6 1 0.2 Iris-setosa
5.1 3.3 1.7 0.5 Iris-setosa
4.8 3.4 1.9 0.2 Iris-setosa
5 3 1.6 0.2 Iris-setosa
5 3.4 1.6 0.4 Iris-setosa
5.2 3.5 1.5 0.2 Iris-setosa
5.2 3.4 1.4 0.2 Iris-setosa
4.7 3.2 1.6 0.2 Iris-setosa
4.8 3.1 1.6 0.2 Iris-setosa
5.4 3.4 1.5 0.4 Iris-setosa
30 | P a g e
5.2 4.1 1.5 0.1 Iris-setosa
5.5 4.2 1.4 0.2 Iris-setosa
4.9 3.1 1.5 0.1 Iris-setosa
5 3.2 1.2 0.2 Iris-setosa
5.5 3.5 1.3 0.2 Iris-setosa
4.9 3.1 1.5 0.1 Iris-setosa
4.4 3 1.3 0.2 Iris-setosa
5.1 3.4 1.5 0.2 Iris-setosa
5 3.5 1.3 0.3 Iris-setosa
4.5 2.3 1.3 0.3 Iris-setosa
4.4 3.2 1.3 0.2 Iris-setosa
5 3.5 1.6 0.6 Iris-setosa
5.1 3.8 1.9 0.4 Iris-setosa
4.8 3 1.4 0.3 Iris-setosa
Table No.- 7.1 (Iris Data Set)
31 | P a g e
The figure illustrates how customer segments identified by k-means clustering can be
used to effectively target marketing campaigns.
By conducting this experimental analysis, we can gain insights into how different
distance metrics and initialization techniques interact and influence the performance of
k-means clustering for the specific task of campaign selection
Spearman's footrule
Spearman's
3.1 1.7 2.9 2.0 performs better for
Footrule
Campaign assignments
Notes:
• sumR Values: Lower sumR values indicate higher preference for the campaign
products within the customer segment.
• LAF Values: Lower LAF values indicate higher attention from the customer
segment to the campaign products.
Interpretations:
• Campaign 1: The results suggest that Kendall's Tau might be more effective
for identifying customer segments interested in Campaign 1's products (Salmon,
Tuna, Shrimp).
• Campaign 2: Spearman's Footrule seems to perform better in finding segments
that are more likely to engage with the products in Campaign 2 (Eel, Sea Urchin,
Squid).
32 | P a g e
8. RESULT ANALYSIS & DISCUSSION
As we've seen, the starting point for k-means can make a big difference in the final
clustering results.
Evaluation Metrics
To assess the performance of the K-means clustering algorithm, several metrics were
utilized:
33 | P a g e
Silhouette Score: Evaluates how similar an object is to its own cluster compared to
other clusters. Higher silhouette scores suggest better-defined clusters.
Convergence Rate: Measures the number of iterations required for the algorithm to
converge to a stable solution.
Initialization Techniques: -
Distance Metrics: -
The choice of distance metric was evaluated to determine its effect on clustering
quality.
o Euclidean Distance:
As the standard metric used in K-means, Euclidean distance performed well with
numerical data and served as a baseline for comparison. It provided balanced clustering
results with moderate WCSS and silhouette scores.
34 | P a g e
o Manhattan Distance:
This metric performed well with datasets where differences between data points are
more meaningful along individual dimensions. It often resulted in slightly higher
WCSS compared to Euclidean distance but was useful for specific data distributions
where the absolute differences are more relevant.
o Cosine Similarity:
Effective for high-dimensional and sparse data, cosine similarity showed improved
clustering results for text data and other applications where the direction of data vectors
is more important than their magnitude. It provided higher silhouette scores for these
types of datasets, indicating better-defined clusters.
o Elbow Method:
The elbow method provided a visual way to identify the appropriate K. By plotting
WCSS against the number of clusters, the "elbow point" indicated the optimal K. This
method was straightforward but sometimes required subjective interpretation to
identify the elbow point accurately.
Silhouette Analysis:
o Gap Statistic:
The gap statistic compared the WCSS to that of a null reference distribution of the data.
This method provided a robust way to estimate the optimal number of clusters, often
aligning well with the results from silhouette analysis. It helped in selecting a K that
balances the compactness and separation of clusters.
Mini-Batch K-means:
o Distance Metrics: Euclidean and Manhattan distances were suitable for different
types of numerical data, while cosine similarity was preferable for high-
dimensional, sparse data.
Discussion
The results of this project demonstrate the effectiveness of various enhancements to the
K-means clustering algorithm. By addressing the key challenges of initialization,
distance metrics, optimal cluster determination, and scalability, the project achieved
significant improvements in clustering quality and efficiency. These findings contribute
to the development of a more robust and versatile K-means algorithm, capable of
delivering reliable clustering results across diverse applications.
36 | P a g e
9. CONCLUSION
This project journeyed into the heart of k-means clustering, one of the most popular
algorithms for grouping similar data points, but with a laser focus on the impact of
distance metrics. We all know that k-means helps us make sense of data by finding
clusters, but how we measure "similarity" between data points using different distance
metrics can drastically change those clusters. So, we rolled up our sleeves and explored
a variety of distance metrics to see how they affect the performance of k-means.
We didn't just stick to theory; we got our hands dirty with experiments! We tested these
distance metrics on various datasets and compared how they affected the quality of the
clusters produced by k-means. We saw firsthand how choosing the wrong distance
metric can lead to clusters that just don't make sense, while the right metric can reveal
hidden patterns and groupings that we might have missed otherwise.
Through our experiments, we gained valuable insights into the strengths and
weaknesses of different distance metrics. We found that for some datasets, the classic
Euclidean distance works just fine.
Our project demonstrates that the choice of distance metric is not a trivial decision. It's
a key factor that can make or break the effectiveness of k-means clustering. By carefully
considering the nature of our data and the goals of our analysis, we can select the most
appropriate metric and unlock the full potential of k-means.
This project's findings have practical implications for a wide range of fields. By using
the right distance metrics, we can improve customer segmentation, personalize
recommendations, enhance image recognition, and even make more accurate diagnoses
in healthcare.
37 | P a g e
10. FUTURE SCOPE
This project has explored the intricacies of k-means clustering, focusing on the impact
of initialization techniques and distance metrics. Our findings open up several
promising avenues for future research and development:
38 | P a g e
10.4 Scalability and Efficiency
39 | P a g e
11. REFERENCES
40 | P a g e
challenges of clustering high-dimensional data and proposes using PCA to find
good initial centroids.]
8. Yedla, M., Pathakota, S. R., & Srinivasa, T. M. (2010). Enhancing K-means
Clustering Algorithm with Improved Initial Center. International Journal of
Computer Science and Information Technologies, 1(2), 121–125. [Introduces a
simple and efficient initialization technique based on transforming data to a
positive space and using distances from the origin.]
41 | P a g e
Appendix: -
import pandas as pd
import math
import numpy as np
import random
import pandas as pd
# Global variable
centrd = []
grp = []
# taking input
def takeInput():
# df = pd.read_csv("Automobile_data.csv")
df = pd.read_csv("iris.csv")
42 | P a g e
# Display initial data into matplot
pt.scatter(df[df.columns[0]], df[df.columns[1]])
pt.title("Initial Data")
pt.show()
displayDF(df)
return df
# Implement PCA
def applyPCA(df):
# df = pd.read_csv("iris.csv", header=None)
X = df.drop("Species", axis=1)
y = df["Species"]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
43 | P a g e
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
df_pca["Species"] = y
print(df_pca)
return df_pca
def elbowMethod(autoData):
# Set seed
random.seed(987)
np.random.seed(987)
autoData = pd.read_csv('Automobile_data.csv')
data2 = data1.sample(n=150).reset_index(drop=True)
44 | P a g e
mean_length = data2['length'].mean()
plt.scatter(data2['width'], data2['length'])
plt.xlabel('width')
plt.ylabel('length')
plt.show()
var_length = data2['length'].var()
var_width = data2['width'].var()
var_height = data2['height'].var()
n, m = data.shape
centroids = data.sample(n=k).to_numpy()
45 | P a g e
clusters = np.zeros(n)
for _ in range(max_iter):
for i in range(n):
clusters[i] = np.argmin(distances)
if np.all(centroids == new_centroids):
break
centroids = new_centroids
return clusters
clusters_2 = kmeans(data2, 2)
data2['cluster'] = clusters_2
wcss = 0
46 | P a g e
centroid = cluster_data.mean()
return wcss
data2['cluster1'] = clusters_3
# Elbow method
wss = []
wss.append(sce)
47 | P a g e
for i in range(1, 7):
k=i
break
plt.xlabel('Number of clusters')
plt.title("Elbow method")
plt.show()
return k
i=0
while i < k:
print("Enter index of initial centeroid ", i+1 ," -> ", end = " ")
idx = int(input())
48 | P a g e
# if entered index is out range
print("Invalid index, Please enter index in range ", 0 , " to ", len(df[df.columns[0]])-
1)
continue
else:
p = []
p.append(df[df.columns[0]][idx])
p.append(df[df.columns[1]][idx])
centrd.append(p)
i+= 1
return centrd
ls = []
for i in range(len(df[df.columns[0]])):
x1 = df[df.columns[0]][i]
y1 = df[df.columns[1]][i]
49 | P a g e
dis = math.sqrt(sqr)
ls.append(dis)
return ls
ls = []
for i in range(len(df[df.columns[0]])):
x1 = df[df.columns[0]][i]
y1 = df[df.columns[1]][i]
ls.append(dis)
return ls
ls = []
for i in range(len(df[df.columns[0]])):
x1 = df[df.columns[0]][i]
y1 = df[df.columns[1]][i]
50 | P a g e
sqr = max(abs(x2 - x1), abs(y2-y1))
dis = math.sqrt(sqr)
ls.append(dis)
return ls
EqlDis = []
for i in range(k):
p = centrd[i]
x = p[0]
y = p[1]
EqlDis.append(calEquDst(df, x, y))
return EqlDis
manDis = []
for i in range(k):
p = centrd[i]
51 | P a g e
x = p[0]
y = p[1]
manDis.append(calManhatenDst(df, x, y))
return manDis
# Display data
def displayDF(df):
# print(df.head())
print("\nX Y")
for i in range(len(df[df.columns[0]])):
print(df[df.columns[0]][i],"",df[df.columns[1]][i])
def displayCD(c) :
i=1
for ls in c:
i += 1
52 | P a g e
# taking k empty centeroid
lists = []
for i in range(k):
lists.append([])
grp.clear()
for i in range(len(newC[0])):
c=0
val = newC[0][i]
c=j
val = newC[j][i]
lists[c].append(i)
grp.append(c)
centrd = []
for ls in lists:
XSum = 0
YSum = 0
53 | P a g e
for indx in ls:
XSum += df[df.columns[0]][indx]
YSum += df[df.columns[1]][indx]
if len(ls) > 0:
x = XSum / len(ls)
y = YSum / len(ls)
else :
x = df[df.columns[0]][0]
y = df[df.columns[1]][0]
ls = []
ls.append(x)
ls.append(y)
centrd.append(ls)
print(grp)
return centrd
def displayEqd(newC) :
54 | P a g e
i=1
for ls in newC:
i += 1
df['cluster'] = grp
df1 = df[df.cluster==0]
df2 = df[df.cluster==1]
df3 = df[df.cluster==2]
55 | P a g e
pt.title(data, fontsize = 16)
pt.legend(loc=4)
pt.show()
# main function
def main():
# taking input
df = takeInput()
df = applyPCA(df)
k = elbowMethod(df)
centrd = takeCenteroid(df, k)
flag = 1
displayCD(centrd)
56 | P a g e
# calculate all equlidence distance
displayEqd(newC)
# Display Grpah
# //displayCD(centrd)
flag = flag + 1
df['cluster'] = grp
displayCD(centrd)
flag = 1
displayCD(centrd)
57 | P a g e
newC = findAllManhatenDistance(df, centrd, k)
displayEqd(newC)
# Display Grpah
# //displayCD(centrd)
flag = flag + 1
# Display Grpah
sse = 0
if distance_metric == 'euclidean':
58 | P a g e
distances = np.sum(np.abs(cluster_data[['PC1', 'PC2']].values - centroid), axis=1)
# Manhattan distances
else:
sse += np.sum(distances**2)
return sse
main()
59 | P a g e
Output:-
60 | P a g e
Figure 3 : Clustered data (Euclidean distance Vs Manhattan distance)
61 | P a g e
Figure 5 : Clustered data (Euclidean distance Vs Manhattan distance)
62 | P a g e
Figure 8 : Final Clustered Data
63 | P a g e