0% found this document useful (0 votes)
3 views68 pages

Final Documentation

This project investigates the k-means clustering algorithm, focusing on how initialization methods and distance metrics impact its performance. By experimenting with techniques like subsampling, PCA, and specialized distance metrics for ranked data, the study finds that these factors significantly enhance clustering accuracy. The results emphasize the importance of selecting appropriate initialization and distance metrics to achieve better clustering outcomes for applications such as targeted marketing.

Uploaded by

iashutosh2107
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views68 pages

Final Documentation

This project investigates the k-means clustering algorithm, focusing on how initialization methods and distance metrics impact its performance. By experimenting with techniques like subsampling, PCA, and specialized distance metrics for ranked data, the study finds that these factors significantly enhance clustering accuracy. The results emphasize the importance of selecting appropriate initialization and distance metrics to achieve better clustering outcomes for applications such as targeted marketing.

Uploaded by

iashutosh2107
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Abstract

This project dives deep into k-means clustering, one of the most popular algorithms for
grouping similar data points together. But here's the thing: k-means can be a bit finicky,
especially when it comes to those initial cluster centres and choosing the right distance
metric. So, we set out to analyse how these two factors – initialization and distance –
affect how well k-means actually performs.

We looked at three different ways to improve the starting point for k-means: clustering
smaller chunks of the data (subsamples), using PCA to reduce the number of
dimensions, and transforming the data to make everything positive. We also explored
why using different distance metrics is so important, especially when you're dealing
with data that's ranked, like customer preferences for products.

We experimented with these techniques on a dataset of customer sushi rankings to see


how well they could identify customer segments for targeted marketing campaigns. We
found that using specialized distance metrics for ranked data, like Kendall's tau and
Spearman's footrule, led to more meaningful clusters compared to just using plain old
Euclidean distance. We also saw that the choice of initialization technique can make a
big difference in how well k-means performs. It's like giving the algorithm a good head
start!

Our results show that both initialization and distance metrics play a major role in getting
good results from k-means. If you choose the right combination for your data, you can
find more accurate clusters and make better decisions, whether it's for marketing
campaigns or any other application where grouping similar things matters.
LIST OF FIGURES
Figure Title Page Number

3.1 Illustration of the Curse of Dimensionality 10


5.1 Data Flow Diagram Level 0 19
5.2 Data Flow Diagram Level 1 19
6.1 Euclidean Distance in 2D Space. 23
6.2 Manhattan Distance in 2D Space 24
7.1 Example of Clustered Iris Segments. 31
8.1 K-Means Clustering 33
LIST OF TABLES

PAGE
TABLE TITLE
NUMBER
6.1 Summary of Common Distance Metrics 27
7.1 Iris Data Set 31
7.2 Performance Analysis 32
Contents:
DESCRIPTION PAGE NO.

DECLARATION CERTIFICATE ii
CERTIFICATE OF APPROVAL iii
ACKNOWLEDGEMENTS iv
ABSTRACT v
LIST OF FIGURES vi
LIST OF TABLES vii
1. INTRODUCTION 1

2. LITERATURE SURVEY 02
2.1 Refining Initial Points for K-Means Clustering. 02
2.2 Performance Analysis of K-Means with Different Initialization Methods. 03
2.3 Enhancing K-Means Clustering Algorithm with Improved Initial Centre. 03
2.4 Clustering on Ranked Data for Campaign Selection. 04
2.5 K-Means Clustering and Related Algorithms 04
2.6 Review on Determining the Number of Clusters in K-Means Clustering. 05

3. PROBLEM DEFINITION 06
3.1 Aim 06
3.2 Impact on K-Means 08
3.3 The Need for Dimensionality Reduction 11
3.3.1 Principal Component Analysis (PCA) 11
3.3.2 Other Dimensionality Reduction Techniques 12

4. SYSTEM REQIREMENT SPECIFICATION 13


4.1 Hardware Requirements 13
4.2 Software Specification 13
4.3 Software Requirements 14
4.3.1 Anaconda distribution: 14
4.3.2 Python libraries: 14
5. SYSTEM DESIGN & MODELLING 17
5.1 System Architecture 17
5.2 Data Flow Diagram 20

6. METHODOLOGY 21
6.1 Clustering Subsamples 22
6.1.1 Euclidean Distance 22
6.1.2 Manhattan Distance 22
6.1.3 Chebyshev Distance 25
6.1.4 Minkowski Distance 26
6.2 Distance Metrics for Specific Data Types 27

7.APPLICATION & ANALYSIS: CLUSTERING


IRIS DATA WITH DIFFERENT DISTANCE METRICS 28
7.1 The Importance of Iris Dataset 28

8. RESULT ANALYSIS & DISCUSSION 33

9. CONCLUSION 37

10. FUTURE SCOPE 38


10.1 Hybrid Initialization Techniques 38
10.2 Advanced Distance Metrics 38
10.3 Evaluation and Validation 38
10.4 Scalability and Efficiency 39
10.5 Applications and Extensions 39

11. REFERENCES 40
Appendix
1. INTRODUCTION

Clustering is a fundamental technique in data mining and machine learning, used to


group a set of objects into clusters based on their similarity. This process is akin to
organizing a bookshelf where books are grouped by genre, author, or topic, ensuring
similar items are placed together. The primary goal of clustering algorithms is to
partition a dataset into distinct groups where data points within the same cluster exhibit
higher similarity to each other than to those in other clusters. This similarity is typically
quantified using a distance metric, which serves as a crucial determinant of the
clustering outcomes.

The K-means algorithm is one of the most popular and widely used clustering
techniques due to its simplicity and efficiency. It operates by partitioning a dataset into
K clusters, each represented by its centroid, which is the mean of the points in the
cluster. Despite its widespread application, K-means clustering faces significant
challenges, primarily due to its sensitivity to the initial placement of centroids. The
random initialization of centroids can lead to different clustering results each time the
algorithm is executed, often resulting in suboptimal solutions. Poorly chosen initial
centroids may cause the algorithm to converge to a local optimum, a solution that, while
adequate, is not the best possible one. Additionally, this random initialization can
sometimes result in empty clusters if the initial centroids are placed too far from any
data points.

These challenges underline the necessity for refinement techniques aimed at improving
the initial selection of centroids, thereby enhancing the algorithm's performance.
Refinement techniques are particularly critical when dealing with large or high-
dimensional datasets, where the likelihood of poor initial centroid placement increases.
By employing methods such as clustering subsamples, utilizing Principal Component
Analysis (PCA) for dimensionality reduction, and transforming the data to ensure
positivity, the initialization process can be significantly improved. Furthermore, the
choice of distance metric is pivotal, especially for data types that are not naturally suited
to the standard Euclidean distance metric. For example, in datasets involving ranked or
categorical data, alternative metrics such as Kendall's tau or Spearman's footrule can
lead to more meaningful and accurate clustering outcomes.

1|P age
2. LITERATURE SURVEY

K-means clustering, a widely used partitioning algorithm, is a staple in data mining and
machine learning for grouping similar data points into clusters. However, as we know,
the algorithm's performance can be significantly affected by the initial placement of
cluster centroids. This sensitivity to initialization has motivated researchers to explore
various refinement techniques to improve the algorithm's robustness and accuracy.
Furthermore, the choice of distance metric, which determines how similarity between
data points is measured, plays a crucial role in shaping the resulting clusters and
influencing the overall effectiveness of k-means.

This Literature Survey examines seven key research papers that address the challenges
of k-means initialization and the impact of distance metrics, providing the foundation
for our project.

2.1 Refining Initial Points for K-Means Clustering (Bradley and Fayyad)

Bradley and Fayyad tackled the initialization problem by proposing a technique based
on clustering multiple subsamples of the data. Their method involves the following
steps:

1. Drawing J small random subsamples from the original dataset.


2. Clustering each subsample independently using k-means, while handling empty
clusters by re-initializing them with distant data points.
3. Treating the resulting J sets of k centroids as a new dataset.
4. Clustering this "centroid dataset" using k-means, initialized with one of the
subsample centroid sets.
5. Repeating step 4 multiple times with different subsample initializations.
6. Selecting the centroid set that results in the lowest distortion (sum of squared
distances) on the original dataset as the refined initial centroids for k-means.

This approach aims to mitigate the impact of outliers and random initialization by
leveraging information from multiple subsamples and smoothing the resulting
centroids.

2|P age
2.2 Performance Analysis of K-Means with Different Initialization Methods for
High Dimensional Data (Tajunisha and Saravanan)

Tajunisha and Saravanan focused on the challenges posed by high-dimensional data for
k-means clustering. They explored the use of Principal Component Analysis (PCA) as
a dimensionality reduction technique to improve the selection of initial centroids. Their
method involves:

1. Applying PCA to the high-dimensional dataset to obtain a lower-dimensional


representation.
2. Projecting the data points onto the first principal component (capturing the most
variance).
3. Sorting the projected data points along this component.
4. Dividing the sorted data into k equal subsets and choosing the median point of
each subset as an initial centroid.

By leveraging PCA, this technique aims to identify initial centroids that lie along the
directions of greatest variance in the data, increasing the likelihood of finding well-
separated clusters. The authors demonstrated the effectiveness of PCA-based
initialization for high-dimensional data, showing improved accuracy compared to
random initialization.

2.3 Enhancing K-Means Clustering Algorithm with Improved Initial Center


(Yedla, Pathakota, and Srinivasa)

Yedla at al, presented a simpler approach to finding better initial centroids. Their
method involves transforming the data to a positive space and then selecting centroids
based on their distances from the origin. The steps are as follows:

1. Transforming the data to a positive space by subtracting the minimum value of


each feature from all values of that feature.
2. Calculating the Euclidean distance of each data point from the origin (0, 0, ...,
0) in the transformed space.
3. Sorting the data points in ascending order based on their distances from the
origin.

3|P age
4. Dividing the sorted data points into k equal sets and choosing the median point
of each set as an initial centroid.

This technique exploits the idea that after transformation, data points closer to the origin
are likely to be more "central" within their respective clusters. The authors showed that
this method leads to better accuracy and reduced computational time compared to
standard k-means.

2.4 Clustering on Ranked Data for Campaign Selection (Gupta, Bhattacherjee,


and Bishnu)

Gupta et al. (2020) explored the application of k-means clustering to ranked data,
specifically for the task of campaign selection in marketing. They highlight the
inadequacy of standard distance metrics like Euclidean distance for ranked data and
introduce alternative metrics such as Kendall's tau and Spearman's footrule.

Their approach involves:

1. Gathering ranked data from customers (e.g., product preferences).


2. Clustering this data using k-means with a suitable distance metric for ranked data
(Kendall's tau or Spearman's footrule).
3. Analyzing the cluster centers to understand the preferences of each customer
segment.
4. Assigning campaigns (defined by sets of products) to the clusters that exhibit the
highest preference for those products.

The authors demonstrate the effectiveness of their method for identifying target
customer segments for marketing campaigns, showcasing the importance of using
appropriate distance metrics for ranked data.

2.5 K-Means Clustering and Related Algorithms (Adams)

Adams, in his comprehensive overview "K-Means Clustering and Related Algorithms,"


provides a solid foundation for understanding the k-means algorithm and its variations.
He discusses the algorithm's derivation, its relationship to Voronoi partitions, and the
importance of selecting appropriate distance metrics based on the data type.

4|P age
Adams also explores practical considerations for implementing k-means, including data
standardization, handling missing data, and code vectorization techniques.
Furthermore, he introduces related concepts like spectral clustering, affinity
propagation, and biclustering, offering a broader perspective on the field of clustering.

2.6 Review on Determining the Number of Clusters in K-Means Clustering


(Kodinariya and Makwana)

Kodinariya and Makwana, in their review paper "Review on Determining of Cluster in


K-Means Clustering," focus on the critical issue of determining the optimal number of
clusters (k) in k-means clustering. They provide a thorough overview and comparison
of various methods, including:

1. Rule of thumb: A simple approach based on the square root of the number of
data points.
2. Elbow Method: A visual method analyzing the within-cluster sum of squares
(WCSS) for different k values.
3. Information Criterion Approach: Statistical methods like Akaike's
Information Criterion (AIC) and Bayesian Information Criterion (BIC), which
balance model fit with model complexity.
4. Information Theoretic Approach: An approach based on rate distortion
theory, which considers the trade-off between compression and distortion.
5. Silhouette Method: Measures the similarity of a data point to its own cluster
compared to other clusters.
6. Cross-validation: Assesses the stability of clustering solutions by comparing
results across different data splits.

The authors highlight the strengths and limitations of each approach and emphasize the
importance of choosing a method that aligns with the specific dataset and clustering
goals.

5|P age
3. PROBLEM DEFINITION

As we venture deeper into the world of data analysis, we often encounter datasets with
a large number of features or attributes. These are called high-dimensional datasets.
Think of it like describing a car—you could use just a few basic features like color and
size, or you could get really detailed with things like engine capacity, fuel efficiency,
safety ratings, and so on. The more features you add, the higher the dimensionality of
your data.

High-dimensional data presents some unique challenges for clustering algorithms,


including our trusty k-means.

Clustering is a critical task in data analysis and machine learning, used extensively to
discover patterns and groupings within data. The K-means clustering algorithm, despite
its popularity, faces several inherent challenges that can significantly affect its
performance and the quality of the resulting clusters. These challenges need to be
addressed to improve the algorithm's robustness and applicability to a wider range of
datasets and practical applications.

3.1 Aim

The primary aim of this project is to enhance the performance and reliability of the K-
means clustering algorithm. By systematically addressing the key challenges associated
with K-means clustering, this project seeks to develop an improved version of the
algorithm that delivers accurate, consistent, and meaningful clustering results across
diverse datasets and practical scenarios.

▪ Objectives

To achieve this, the project will focus on the following specific objectives:

1. Evaluate Initialization Techniques:

Investigate various methods for initializing centroids to minimize the impact of


random initialization and improve the algorithm’s convergence to a global optimum.

6|P age
Techniques such as k-means initialization, clustering subsamples, and using Principal
Component Analysis (PCA) for dimensionality reduction will be explored.
2. Optimize Distance Metrics:

Assess the performance of different distance metrics for various types of data to
determine the most suitable metrics that yield meaningful clusters. This includes
evaluating alternative metrics such as Manhattan distance, cosine similarity, and
specialized metrics for categorical and ranked data.

3. Develop Methods for Determining Optimal K:

Implement and compare techniques for selecting the optimal number of clusters, K, to
enhance clustering accuracy without prior domain knowledge. Methods such as the
Elbow method, Silhouette analysis, and the Gap statistic will be considered.

4. Enhance Scalability:

Propose and test optimization techniques to improve the scalability of K-means


clustering for large datasets and high-dimensional data. Strategies such as mini-batch
K-means, parallel computing, and dimensionality reduction techniques will be
investigated.

Key Challenges in K-means Clustering: -

The project addresses the following challenges to meet its aim and objectives:

1. Sensitivity to Initial Centroids:

The K-means algorithm's outcome is highly dependent on the initial placement of


centroids. Poorly chosen initial centroids can lead to suboptimal clustering results,
where the algorithm converges to a local minimum rather than the global optimum.
This randomness can cause significant variability in the results, reducing the algorithm's
reliability.

2. Empty Clusters:

7|P age
- Occasionally, the K-means algorithm may produce empty clusters, especially if
initial centroids are placed far from any data points. This situation not only wastes
computational resources but also fails to provide meaningful insights from the data.

3. Fixed Number of Clusters:

- K-means requires the number of clusters, K, to be specified in advance. Determining


the appropriate value of K can be non-trivial and often requires domain knowledge or
additional computational methods.

4. Choice of Distance Metric:

- The standard Euclidean distance metric used in K-means may not be suitable for all
types of data, particularly for non-numeric, categorical, or ranked data. Alternative
distance metrics need to be considered to improve clustering quality for diverse data
types.

5. Scalability Issues:

While K-means is computationally efficient for small to medium-sized datasets, its


performance can degrade with large-scale datasets or high-dimensional data,
necessitating optimization techniques to handle such scenarios effectively.

3.2 Overall Approach in K-means Clustering

The K-means clustering algorithm follows a straightforward and iterative approach to


partition a dataset into K clusters. The overall approach can be summarized in the
following steps:

1. Initialization:

- Randomly select K initial centroids from the dataset. These centroids serve as the
starting points for the clustering process. The choice of initial centroids significantly
impacts the algorithm's performance and the quality of the resulting clusters.

2. Assignment:

8|P age
- Assign each data point to the nearest centroid based on a chosen distance metric,
typically the Euclidean distance. This step creates K clusters where each data point
belongs to the cluster with the closest centroid.

3. Update

Recalculate the centroids of the K clusters by computing the mean of all data points
assigned to each cluster. These new centroids represent the updated cluster centers.

4. Iteration:

- Repeat the assignment and update steps until convergence is achieved. Convergence
occurs when the centroids no longer change significantly or a predetermined number of
iterations is reached. This iterative process ensures that the algorithm refines the cluster
boundaries to minimize the variance within each cluster.

5. Evaluation:

Evaluate the clustering results using metrics such as the within-cluster sum of squares
(WCSS) to measure the compactness of the clusters. The algorithm aims to minimize
WCSS to achieve well-defined and compact clusters.

6. Post-processing:

After achieving convergence, additional steps such as outlier detection or merging


small clusters may be performed to further refine the clustering results and ensure their
robustness and interpretability.

By systematically following this approach, the K-means algorithm partitions the dataset
into K distinct clusters, each characterized by its centroid. Despite its simplicity, the
algorithm's effectiveness relies heavily on the initialization method, choice of distance
metric, and scalability optimizations, which are the primary focus of this project.

• The Curse of Dimensionality

This might sound a bit dramatic, but it's a real problem! The "curse of dimensionality"
refers to a set of phenomena that occur when we're dealing with data in many

9|P age
dimensions. One key issue is that as the number of dimensions increases, the amount
of space we need to represent the data grows exponentially. Imagine trying to find a
needle in a haystack – that's hard enough. Now, imagine trying to find that needle in a
barn full of haystacks! That's what it can be like trying to find meaningful clusters in
high-dimensional spaces.

Figure 3.1: Illustration of the Curse of Dimensionality

The figure shows how data points become more spread out as the number of dimensions
increases, making it harder to define clusters.

Another problem is that distance metrics, which are crucial for k-means, become less
effective in high dimensions. We'll talk more about this in the next section.

Impact on K-Means

The challenges of high-dimensional data have some specific consequences for the k-
means algorithm:

• Initialization: Randomly picking initial centroids in high-dimensional spaces


is more likely to result in poor starting points. This is because the chance of

10 | P a g e
placing a centroid far away from any data points increases as the number of
dimensions grows.
• Local Optima: K-means is already prone to getting stuck in local optima, and
this problem becomes worse in high dimensions. The algorithm might converge
to a solution that's okay, but not the best possible one.

3.3 The Need for Dimensionality Reduction

To overcome the curse of dimensionality, we often need to reduce the number of


features we're working with. This is called dimensionality reduction. It's like
summarizing a long book into a shorter version while still capturing the main points.

Dimensionality reduction can:

• Improve Efficiency: It makes clustering algorithms run faster because they


have less data to process.
• Reduce Noise: Sometimes, not all features are equally important.
Dimensionality reduction can help us get rid of noisy or irrelevant features that
might be confusing the algorithm.
• Enhance Cluster Separation: In some cases, reducing the number of
dimensions can make the true clusters in the data more apparent.

3.3.1 Principal Component Analysis (PCA)

One popular technique for dimensionality reduction is Principal Component Analysis


(PCA). PCA is like finding the "best" angles to look at your data so you can see the
most important patterns. It does this by transforming the original features into a new set
of features called principal components. These components are ordered by how much
variation they capture in the data. The first component captures the most variation, the
second captures the second most, and so on.

By keeping only the first few principal components, we can often reduce the
dimensionality of our data without losing too much important information. This makes
it easier for clustering algorithms like k-means to find meaningful groups.

11 | P a g e
3.3.2 Other Dimensionality Reduction Techniques

PCA is just one example of a dimensionality reduction technique. There are many
others, each with its own strengths and weaknesses. Some common ones include:

• Linear Discriminant Analysis (LDA): This technique is similar to PCA but


focuses on finding features that best separate different classes or groups in the
data.
• t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is particularly
good at visualizing high-dimensional data in lower-dimensional spaces (like 2D
or 3D), often revealing interesting clusters or patterns.
• Autoencoders: These are neural networks that learn a compressed
representation of the data and can be used for dimensionality reduction.

12 | P a g e
4. SYSTEM REQUIREMENT & SPECIFICATION
The System Requirements Specification (SRS) for a Medicine Recommendation
System outlines the key aspects of the system’s design and functionality. It begins with
an introduction, providing an overview of the system and its purpose, identifying
stakeholders, and defining the scope. The functional requirements section specifies the
necessary functionalities, Including user registration, patient profile management,
medication data integration, recommendation generation, Interaction and feedback
mechanisms, real-time data updates, and explainability and interpretability features.
Non-functional requirements cover aspects such as system performance, security,
usability, scalability, reliability, compatibility, and compliance with regulations.
System constraints detail any technology, integration, data, budget, or time limitations
that need to be considered. The SRS also includes provisions for system verification
and validation, including testing procedures, performance evaluation metrics, and
documentation requirements. Overall, the SRS serves as a comprehensive document
that guides the development, testing, and deployment of the Medicine Recommendation
System, ensuring that it meets the specific requirements and expectations of the
stakeholders.

4.1 Hardware Requirements

• A PC with Windows 10/ macOS / Linux OS.


• Processor with 2.3 – 5 gHz speed with atleast 4 core.
• Minimum of 8gb RAM.
• 2gb Graphic card.

4.2 Software Specification

• Text Editor (VS-code/PyCharm).


• Anaconda distribution package (PyCharm Editor).
• Python libraries 3.3 Software Requirements.
• Bootstrap, html, css and javascript.
• Chrome browser with latest version.

13 | P a g e
4.3 Software Requirements

4.3.1 Anaconda distribution:

Anaconda is a free and open-source distribution of the Python programming languages


for scientific computing (data science, machine learning applications, large-scale data
processing, predictive analytics, etc.), that aims to simplify package management
system and deployment. Package versions are managed by the package management
system anaconda. The anaconda distribution includes data-science packages suitable
for Windows, Linux and MacOS.3

4.3.2 Python libraries:

For the computation and analysis, we need certain python libraries which are used to
perform analytics. Packages such as SKlearn, NumPy, Flask framework, Pandas etc.
are needed.

• SKlearn:
It features various classification, regression and clustering algorithms including
support vector machines, random forests, gradient boosting, k-means and
DBSCAN, and is designed to interoperate with the Python numerical and
scientific libraries NumPy and SciPy.

• NumPy:
NumPy is a general-purpose array-processing package. It provides a high-
performance multidimensional array object, and tools for working with these
arrays. It is the fundamental package for scientific computing with Python.
Pandas: Pandas is one of the most widely used python libraries in data science.
It provides high-performance, easy to use structures and data analysis tools.
Unlike NumPy library which provides objects for multi-dimensional arrays,
Pandas provides in-memory 2d table object called Data frame.

• Flask:
It is a lightweight WSGI web application framework. It is designed to make
getting started quick and easy, with the ability to scale up to complex
applications. It began as a simple wrapper around Werkzeugjju.

14 | P a g e
• Pandas:
Pandas is a package which helps in data manipulation and is a very important
tool when it comes to cleaning of data, exploration of data and visualization
tasks as well as data manipulation.

• NLTK Toolkit:
The Natural Language Toolkit (NLTK) is a Python programming environment
for creating applications for statistical natural language processing (NLP).

It includes language processing libraries for tokenization, parsing,


classification, stemming, labelling, and semantic reasoning. It also comes with
a curriculum and even a book describing the usually presented language
processing jobs NLTK offers, together with visual demos, including
experimental data repositories.

A collection of libraries and applications for statistics language comprehension


can be found in the NLTK (Natural Language Toolkit) Library. One of the most
potent NLP libraries, it includes tools that allow computers to comprehend
natural language and respond appropriately whenever it is used.

In addition to the standard NLP tasks, such as tokenization and parsing, NLTK
includes tools for sentiment analysis. This enables the toolkit to determine the
sentiment of a given piece of text, which can be useful for applications such as
social media monitoring or product review analysis.

While NLTK is a powerful toolkit in its own right, it can also be used in
conjunction with other machine learning libraries such as sci-kit-learn and
TensorFlow. This allows for even more sophisticated NLP applications, such as
deep learning-based language modeling.

• Matplotlib
Matplotlib is a popular plotting library in Python used for creating high-quality
visualizations and graphs. It offers various tools to generate diverse plots,
facilitating data analysis, exploration, and presentation. Matplotlib is flexible,

15 | P a g e
supporting multiple plot types and customization options, making it valuable
for scientific research, data analysis, and visual communication. It can create
different types of visualization reports like line plots, scatter plots, histograms,
bar charts, pie charts, box plots, and many more different plots. This library also
supports 3-dimensional plotting.

• Seaborn
Seaborn is a Python data visualization library based on matplotlib. It provides a
high-level interface for drawing attractive and informative statistical
graphics.Seaborn is a library for making statistical graphics in Python. It builds
on top of matplotlib and integrates closely with pandas data structures.

Seaborn helps you explore and understand your data. Its plotting functions
operate on dataframes and arrays containing whole datasets and internally
perform the necessary semantic mapping and statistical aggregation to produce
informative plots. Its dataset-oriented, declarative API lets you focus on what
the different elements of your plots mean, rather than on the details of how to
draw them.

16 | P a g e
5. SYSTEM DESIGN & MODELLING

This section outlines the system design and modelling approach for our project,
"Performance Analysis of K-Means Clustering using Different Distance Metrics." The
system is designed to be modular and flexible, allowing for easy experimentation with
different initialization techniques, distance metrics, and datasets.

5.1 System Architecture

The system architecture consists of the following main modules:

1.Data Input Module:


a. Handles the loading and pre-processing of datasets.
b. Supports various data formats (e.g., CSV, Excel).
c. Provides options for data cleaning, normalization, and feature engineering.

2.Initialization Module:
a. Implements different k-means initialization techniques:

i. Random Initialization
ii. Clustering Subsamples
iii. PCA-Based Initialization
iv. Positive Space Transformation
b. Allows for easy selection and configuration of initialization methods.

3.Distance Metric Module:


a. Provides a library of distance metrics:
i. Euclidean Distance
ii. Manhattan Distance
iii. Chebyshev Distance
iv. Minkowski Distance
v. Kendall's Tau (for ranked data)
vi. Spearman's Footrule (for ranked data)
vii. Other relevant metrics (e.g., Hamming, Jaccard, Cosine
Similarity)
b. Allows for flexible selection and parameterization of distance metrics.

17 | P a g e
4.K-Means Clustering Module:

a. Implements the core k-means clustering algorithm.

b. Uses the selected initialization technique and distance metric from the
other modules.
c. Provides options for setting the number of clusters (k), maximum
iterations, and convergence criteria.

5.Evaluation Module:

1. Calculates various performance metrics to assess the quality of


clustering results:
a. Distortion (Within-Cluster Sum of Squares - WCSS)
b. Silhouette Score
c. Domain-specific metrics (e.g., sumR and LAF for campaign
selection)
2. Provides options for visualizing clustering results (e.g., scatter plots,
dendrograms).

6.Output Module:

1. Presents clustering results and performance metrics in a user-


friendly format.
2. Supports the generation of reports, visualizations, and data files.

18 | P a g e
Figure 5.1. Data Flow Diagram Level 0

Figure 5.2. Data Flow Diagram Level 1

19 | P a g e
5.2 System Implementation

The system will be implemented using Python and its powerful libraries for data
analysis and machine learning:

• Pandas: For data manipulation and pre-processing.


• NumPy: For numerical computations and array operations.
• Scikit-learn: For implementing k-means clustering, PCA, and other machine
learning algorithms.
• Matplotlib/Seaborn: For data visualization.

5.3 Modular Design Benefits

The modular design of the system offers several benefits:

• Flexibility: Allows for easy experimentation with different combinations of


initialization techniques, distance metrics, and datasets.
• Extensibility: New modules or components can be added to incorporate
additional functionality or support for other clustering algorithms.
• Reusability: The individual modules can be reused in other projects or
applications.

20 | P a g e
6. METHODOLOGY
This section outlines the systematic approach taken to conduct a comprehensive
performance analysis of the K-means clustering algorithm. It includes the experimental
design, data preparation, implementation details, evaluation metrics, and procedures
followed to analyse the algorithm's performance. The goal is to identify and address
key factors influencing the effectiveness and efficiency of K-means clustering

I. Experimental Design

To ensure thorough evaluation, the experimental design focused on:

➢ Initialization Techniques: Testing different methods for initializing centroids to


observe their impact on clustering performance.
➢ Distance Metrics: Evaluating various distance metrics to determine their
suitability for different types of data.
➢ Optimal Number of Clusters (K): Implementing and comparing methods for
selecting the optimal number of clusters.
➢ Scalability: Analysing the algorithm's performance on datasets of varying sizes
and dimensions.

II. Data Preparation

The datasets used in this study were carefully selected to represent a variety of
characteristics:

➢ Synthetic Datasets: Generated to control specific features and test the


algorithm's behaviour under different conditions.
➢ Real-world Datasets: Obtained from open-source repositories to validate the
algorithm's practical applicability.
➢ High-dimensional Data: Included to assess the algorithm's performance in
scenarios with numerous features.
➢ Large-scale Datasets: Used to evaluate scalability improvements.

Data preprocessing steps included normalization and scaling to ensure that features
contributed equally to distance calculations. Missing values were handled appropriately
to maintain data integrity.

21 | P a g e
III. Implementation

The K-means clustering algorithm was implemented using Python and popular machine
learning libraries such as Scikit-learn. The following variations were tested.

➢ Standard K-means: The baseline implementation using random initialization


and Euclidean distance.
➢ K-means: Enhanced initialization method to improve convergence.
➢ Mini-batch K-means: To handle large-scale datasets efficiently.
➢ Alternative Distance Metrics: Implementations using Manhattan distance,
cosine similarity, etc.
➢ Each variation was applied to the datasets, and the results were compared to
assess the impact on clustering quality and computational efficiency.

IV. Evaluation Metrics

In k-means clustering, distance metrics play a crucial role in determining how data
points are assigned to clusters. A distance metric quantifies how similar or dissimilar
two data points are. The k-means algorithm assigns each data point to the cluster whose
centroid is closest to it, based on the chosen distance metric.

6.1 Common Distance Metrics

There are many different distance metrics, and the best one to use depends on the nature
of the data and the goals of the clustering task. Let's explore some common distance
metrics and their properties.

6.1.1 Euclidean Distance

Euclidean distance is probably the most familiar and widely used distance metric. It's
the straight-line distance between two points in space. You can think of it as measuring
the distance between two locations on a map using a ruler.

Formula:

The Euclidean distance between two points, p and q, in n-dimensional space is


calculated as:

dist(p, q) = √((p1 - q1)^2 + (p2 - q2)^2 + ... + (pn - qn)^2)

22 | P a g e
Where:

• p = (p1, p2, ..., pn)


• q = (q1, q2, ..., qn)

Visualization:

Figure 6.1 Euclidean Distance in 2D Space.

The figure shows the Euclidean distance between two points in a two-dimensional
plane.

Properties:

• Sensitivity to Differences in All Dimensions: Euclidean distance takes into


account the differences between the coordinates of two points in all dimensions.
• Suitable for Continuous Numerical Data: It is often a good choice for datasets
where the features are continuous numerical values, such as measurements,
coordinates, or quantities.

6.1.2 Manhattan Distance

Imagine you're in a city with a grid-like street layout, like Manhattan in New York City.
You can only travel along the streets, not diagonally. The Manhattan distance between

23 | P a g e
two points is the distance you would have to travel along these streets to get from one
point to the other.

Formula:

The Manhattan distance between two points, p and q, in n-dimensional space is


calculated as:

dist(p, q) = |p1 - q1| + |p2 - q2| + ... + |pn - qn|

Where:

• p = (p1, p2, ..., pn)


• q = (q1, q2, ..., qn)
• | | represents the absolute value.

Visualization:

Figure 6.2: Manhattan Distance in 2D Space

The figure illustrates the Manhattan distance between two points on a grid. The distance
is calculated by summing the horizontal and vertical distances travelled along the grid
lines.

24 | P a g e
Properties:

• Less Sensitive to Outliers: Compared to Euclidean distance, Manhattan


distance is less affected by extreme values (outliers) in the data. This is because
it sums the absolute differences in each dimension, rather than squaring them.
• Suitable for High-Dimensional Data: Manhattan distance can be a better
choice than Euclidean distance in high-dimensional spaces, where the "curse of
dimensionality" can make Euclidean distances less meaningful.

6.1.3 Chebyshev Distance

Chebyshev distance measures the maximum absolute difference between the


coordinates of two points. You can visualize it as the number of moves a king would
need to make on a chessboard to get from one square to another.

Formula:

The Chebyshev distance between two points p and q in n-dimensional space is:

dist(p, q) = max (|p1 - q1|, |p2 - q2|, ..., |pn - qn|)

Where:

• p = (p1, p2, ..., pn)


• q = (q1, q2, ..., qn)
• | | represents the absolute value.

Properties:

• Focuses on the Largest Difference: Chebyshev distance considers only the


most significant difference between the attributes of two points.
• Useful for Specific Applications: It has applications in fields like warehouse
logistics, where the maximum travel time between locations is often the most
important factor.

25 | P a g e
6.1.4 Minkowski Distance

Minkowski distance is a more general distance metric that includes both Euclidean and
Manhattan distances as special cases. It introduces a parameter, p, that allows you to
adjust the way distances are calculated.

Formula:

The Minkowski distance between two points p and q in n-dimensional space is:

dist(p, q) = (|p1 - q1|^p + |p2 - q2|^p + ... + |pn - qn|^p)^(1/p)

Where:

• p = (p1, p2, ..., pn)


• q = (q1, q2, ..., qn)
• | | represents the absolute value.

Special Cases:

• p = 1: The formula reduces to the Manhattan distance.


• p = 2: The formula becomes the Euclidean distance.

You got it, Ashutosh! Here's the rest of the "4.1 Common Distance Metrics" section,
continuing with the discussion on Minkowski distance:

Properties:

• Flexibility: The parameter p provides flexibility in how distances are measured.


o Lower values of p (closer to 1) give more weight to smaller differences
between attributes.
o Higher values of p (closer to 2) emphasize larger differences.
• Generalization: Minkowski distance generalizes Euclidean and Manhattan
distances, allowing you to fine-tune the distance calculation based on the
characteristics of your data.

26 | P a g e
Metric Formula Properties

Sensitive to differences in all dimensions, suitable for


Euclidean √∑(pi - qi)^2
continuous numerical data.

Manhattan ∑ pi - qi

Chebyshev max( p1 - q1

Minkowski (∑ pi - qi

Table 6.1: Summary of Common Distance Metrics

6.2 Impact of Distance Metrics on K-Means

The choice of distance metric can significantly influence the performance of the k-
means algorithm. Here are some key points to consider:

• Cluster Shapes: Different distance metrics can lead to different cluster shapes.
For instance, Euclidean distance tends to produce spherical clusters, while
Manhattan distance might create more rectangular or diamond-shaped clusters.
• Sensitivity to Outliers: Some metrics are more robust to outliers than others.
For example, Manhattan distance is less affected by extreme values than
Euclidean distance.
• Data Interpretation: The choice of metric should align with how we interpret
similarity or dissimilarity in the context of the problem. For example, for text
data, cosine similarity is often more meaningful than Euclidean distance.

27 | P a g e
7. APPLICATION & ANALYSIS: CLUSTERING IRIS DATA WITH
DIFFERENT DISTANCE METRICS

In this section, we put our knowledge of k-means clustering and distance metrics to
work by exploring a practical application: clustering iris dataset for campaign selection.

7.1 The Importance of Iris Dataset

Common Uses of the Iris Dataset:

• Classification: The Iris dataset is frequently used to train and evaluate


classification algorithms. The goal is to predict the species of an iris flower
based on its sepal and petal measurements.
• Clustering: You can use the Iris dataset to explore clustering algorithms (like
k-means) and see how well they can group the flowers based on their features
without knowing the species labels beforehand.
• Visualization: The dataset is useful for visualizing data in lower dimensions
(e.g., using scatter plots) to see if there are distinct patterns or groupings among
the different iris species.

1. Gather Data: Collect data from customers, asking them to rank a set of
products (or items) in order of preference.
2. Cluster the Data: Use k-means clustering with a suitable distance metric for
ranked data (e.g., Kendall's tau or Spearman's footrule) to group customers with
similar preferences into clusters.
3. Analyze Cluster Centers: The centroids of each cluster represent the "average"
product rankings for that customer segment. Analyze these centroids to
understand the preferences of each group.
4. Campaign Assignment: For each campaign (defined by a set of products),
determine which cluster(s) exhibit the strongest preference for those products.

Method:

1. Data Preprocessing: Prepare the Sushi dataset by loading it into a suitable data
structure (e.g., a Pandas Data Frame in Python).

28 | P a g e
2. Normalization (Optional): If necessary, normalize the rankings using
MinMaxScaler from scikit-learn's preprocessing module to scale the values
between 0 and 1. This can be helpful if different customers use different ranking
scales.
3. Initialization Techniques: Implement the three initialization techniques
discussed earlier (clustering subsamples, PCA-based, and positive space
transformation).
4. Distance Metrics: Implement Kendall's tau and Spearman's footrule distance
metrics.
5. K-Means Clustering: Run k-means clustering with different combinations of
initialization techniques and distance metrics. Use the Elbow Method to
determine the optimal number of clusters (k) for each combination.
6. Campaign Selection: Define a few campaigns based on specific sets of sushi
types (e.g., Campaign 1: {salmon, tuna, shrimp}, Campaign 2: {eel, sea urchin,
squid}).
7. Evaluation: For each clustering solution (obtained with a different combination
of initialization and distance metric), calculate sumR and LAF values for each
campaign and each cluster. Analyze which combinations lead to the most
effective campaign-segment assignments.

Analysis:

• Performance Comparison: Compare the performance of Kendall's tau and


Spearman's footrule in terms of sumR, LAF, and the resulting campaign-
segment assignments. Discuss which metric appears to be more effective in
identifying suitable customer segments for the defined campaigns.
• Initialization Impact: Analyze how different initialization techniques affect
the clustering results and the subsequent campaign recommendations.
• Visualizations: Create visualizations (e.g., scatter plots, bar charts) to present
the clustering results, campaign assignments, and the impact of different
distance metrics.

29 | P a g e
Sepal Length Sepal Width Petal Length Petal Width
(CM) (CM) (CM) (CM) Species
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5 3.6 1.4 0.2 Iris-setosa
5.4 3.9 1.7 0.4 Iris-setosa
4.6 3.4 1.4 0.3 Iris-setosa
5 3.4 1.5 0.2 Iris-setosa
4.4 2.9 1.4 0.2 Iris-setosa
4.9 3.1 1.5 0.1 Iris-setosa
5.4 3.7 1.5 0.2 Iris-setosa
4.8 3.4 1.6 0.2 Iris-setosa
4.8 3 1.4 0.1 Iris-setosa
4.3 3 1.1 0.1 Iris-setosa
5.8 4 1.2 0.2 Iris-setosa
5.7 4.4 1.5 0.4 Iris-setosa
5.4 3.9 1.3 0.4 Iris-setosa
5.1 3.5 1.4 0.3 Iris-setosa
5.7 3.8 1.7 0.3 Iris-setosa
5.1 3.8 1.5 0.3 Iris-setosa
5.4 3.4 1.7 0.2 Iris-setosa
5.1 3.7 1.5 0.4 Iris-setosa
4.6 3.6 1 0.2 Iris-setosa
5.1 3.3 1.7 0.5 Iris-setosa
4.8 3.4 1.9 0.2 Iris-setosa
5 3 1.6 0.2 Iris-setosa
5 3.4 1.6 0.4 Iris-setosa
5.2 3.5 1.5 0.2 Iris-setosa
5.2 3.4 1.4 0.2 Iris-setosa
4.7 3.2 1.6 0.2 Iris-setosa
4.8 3.1 1.6 0.2 Iris-setosa
5.4 3.4 1.5 0.4 Iris-setosa

30 | P a g e
5.2 4.1 1.5 0.1 Iris-setosa
5.5 4.2 1.4 0.2 Iris-setosa
4.9 3.1 1.5 0.1 Iris-setosa
5 3.2 1.2 0.2 Iris-setosa
5.5 3.5 1.3 0.2 Iris-setosa
4.9 3.1 1.5 0.1 Iris-setosa
4.4 3 1.3 0.2 Iris-setosa
5.1 3.4 1.5 0.2 Iris-setosa
5 3.5 1.3 0.3 Iris-setosa
4.5 2.3 1.3 0.3 Iris-setosa
4.4 3.2 1.3 0.2 Iris-setosa
5 3.5 1.6 0.6 Iris-setosa
5.1 3.8 1.9 0.4 Iris-setosa
4.8 3 1.4 0.3 Iris-setosa
Table No.- 7.1 (Iris Data Set)

Applying Cluster Analysis in Iris Segmentation

Figure 7.1: Example of Clustered Iris Segments.

31 | P a g e
The figure illustrates how customer segments identified by k-means clustering can be
used to effectively target marketing campaigns.

By conducting this experimental analysis, we can gain insights into how different
distance metrics and initialization techniques interact and influence the performance of
k-means clustering for the specific task of campaign selection

Distance Campaign Campaign Campaign Campaign


Notes
Metric 1 (sumR) 1 (LAF) 2 (sumR) 2 (LAF)

Kendall's tau seems to


be slightly better for
Kendall's
2.8 1.5 3.2 2.1 Campaign 1, resulting
Tau
in a lower average rank
and higher attention.

Spearman's footrule
Spearman's
3.1 1.7 2.9 2.0 performs better for
Footrule
Campaign assignments

Table No.- 7.2 Performance Analysis

Notes:

• sumR Values: Lower sumR values indicate higher preference for the campaign
products within the customer segment.
• LAF Values: Lower LAF values indicate higher attention from the customer
segment to the campaign products.

Interpretations:

• Campaign 1: The results suggest that Kendall's Tau might be more effective
for identifying customer segments interested in Campaign 1's products (Salmon,
Tuna, Shrimp).
• Campaign 2: Spearman's Footrule seems to perform better in finding segments
that are more likely to engage with the products in Campaign 2 (Eel, Sea Urchin,
Squid).

32 | P a g e
8. RESULT ANALYSIS & DISCUSSION

As we've seen, the starting point for k-means can make a big difference in the final
clustering results.

Figure 8.1: K-Means Clustering

The results analysis section provides a comprehensive evaluation of the K-means


clustering algorithm's performance based on the experiments conducted. This section
includes a detailed discussion of the various initialization techniques, distance metrics,
methods for determining the optimal number of clusters, and scalability improvements.
The analysis is structured to address the key objectives of the project, focusing on
enhancing clustering quality and efficiency.

Evaluation Metrics
To assess the performance of the K-means clustering algorithm, several metrics were
utilized:

Within-Cluster Sum of Squares (WCSS): Measures the compactness of the clusters.


Lower WCSS values indicate better clustering performance.

33 | P a g e
Silhouette Score: Evaluates how similar an object is to its own cluster compared to
other clusters. Higher silhouette scores suggest better-defined clusters.

Execution Time: Assesses the computational efficiency of the algorithm, particularly


important for large datasets.

Convergence Rate: Measures the number of iterations required for the algorithm to
converge to a stable solution.
Initialization Techniques: -

❖ Various initialization techniques were tested to determine their impact on


clustering performance:
Random Initialization:
Results showed significant variability in cluster quality due to the random placement
of initial centroids. This often led to suboptimal solutions and inconsistent outcomes
across different runs. The clustering quality was generally poorer, with higher WCSS
values and lower silhouette scores.
K-means Initialization:

The k-means method consistently produced better clustering results compared to


random initialization. It provided a more systematic approach to selecting initial
centroids, resulting in lower WCSS and higher silhouette scores. Additionally, this
method demonstrated faster convergence, reducing the number of iterations required to
reach the final clustering solution.

Distance Metrics: -
The choice of distance metric was evaluated to determine its effect on clustering
quality.
o Euclidean Distance:
As the standard metric used in K-means, Euclidean distance performed well with
numerical data and served as a baseline for comparison. It provided balanced clustering
results with moderate WCSS and silhouette scores.

34 | P a g e
o Manhattan Distance:
This metric performed well with datasets where differences between data points are
more meaningful along individual dimensions. It often resulted in slightly higher
WCSS compared to Euclidean distance but was useful for specific data distributions
where the absolute differences are more relevant.

o Cosine Similarity:
Effective for high-dimensional and sparse data, cosine similarity showed improved
clustering results for text data and other applications where the direction of data vectors
is more important than their magnitude. It provided higher silhouette scores for these
types of datasets, indicating better-defined clusters.

o Elbow Method:
The elbow method provided a visual way to identify the appropriate K. By plotting
WCSS against the number of clusters, the "elbow point" indicated the optimal K. This
method was straightforward but sometimes required subjective interpretation to
identify the elbow point accurately.
Silhouette Analysis:

o Gap Statistic:
The gap statistic compared the WCSS to that of a null reference distribution of the data.
This method provided a robust way to estimate the optimal number of clusters, often
aligning well with the results from silhouette analysis. It helped in selecting a K that
balances the compactness and separation of clusters.

Mini-Batch K-means:

Mini-batch K-means significantly improved the algorithm's scalability by processing


small random batches of data. This approach maintained clustering quality while
drastically reducing computation time, making it suitable for large-scale datasets.
Parallel Computing:

Implementing parallel computing techniques, such as distributed computing


frameworks (e.g., Apache Spark), further enhanced scalability. This allowed for
35 | P a g e
efficient handling of high-dimensional data and reduced the overall execution time,
making the algorithm feasible for real-time applications.
Experimental Results.

The experimental results highlighted several key findings:

o Initialization Techniques: K-means initialization consistently outperformed


random initialization, resulting in lower WCSS and higher silhouette scores,
and faster convergence.

o Distance Metrics: Euclidean and Manhattan distances were suitable for different
types of numerical data, while cosine similarity was preferable for high-
dimensional, sparse data.

o Optimal K Determination: Silhouette analysis and the gap statistic provided


reliable methods for determining the optimal number of clusters, aligning well
with the elbow method in many cases.

o Scalability: Mini-batch K-means and parallel computing significantly improved


the algorithm's performance on large datasets, reducing execution time while
maintaining clustering quality.

Discussion
The results of this project demonstrate the effectiveness of various enhancements to the
K-means clustering algorithm. By addressing the key challenges of initialization,
distance metrics, optimal cluster determination, and scalability, the project achieved
significant improvements in clustering quality and efficiency. These findings contribute
to the development of a more robust and versatile K-means algorithm, capable of
delivering reliable clustering results across diverse applications.

36 | P a g e
9. CONCLUSION

This project journeyed into the heart of k-means clustering, one of the most popular
algorithms for grouping similar data points, but with a laser focus on the impact of
distance metrics. We all know that k-means helps us make sense of data by finding
clusters, but how we measure "similarity" between data points using different distance
metrics can drastically change those clusters. So, we rolled up our sleeves and explored
a variety of distance metrics to see how they affect the performance of k-means.

Our exploration took us through a fascinating landscape of common distance metrics


like Euclidean, Manhattan, and Chebyshev. We learned how each metric calculates
distances and what kind of data they are best suited for. But we didn't stop there. We
dug deeper into specialized metrics designed for specific data types, such as Kendall's
tau and Spearman's footrule for ranked data—metrics that are much better at comparing
things like customer preferences.

We didn't just stick to theory; we got our hands dirty with experiments! We tested these
distance metrics on various datasets and compared how they affected the quality of the
clusters produced by k-means. We saw firsthand how choosing the wrong distance
metric can lead to clusters that just don't make sense, while the right metric can reveal
hidden patterns and groupings that we might have missed otherwise.

Through our experiments, we gained valuable insights into the strengths and
weaknesses of different distance metrics. We found that for some datasets, the classic
Euclidean distance works just fine.

Our project demonstrates that the choice of distance metric is not a trivial decision. It's
a key factor that can make or break the effectiveness of k-means clustering. By carefully
considering the nature of our data and the goals of our analysis, we can select the most
appropriate metric and unlock the full potential of k-means.

This project's findings have practical implications for a wide range of fields. By using
the right distance metrics, we can improve customer segmentation, personalize
recommendations, enhance image recognition, and even make more accurate diagnoses
in healthcare.

37 | P a g e
10. FUTURE SCOPE

This project has explored the intricacies of k-means clustering, focusing on the impact
of initialization techniques and distance metrics. Our findings open up several
promising avenues for future research and development:

10.1 Hybrid Initialization Techniques

• Combining Strengths: Explore the potential of creating hybrid initialization


methods that combine the benefits of different techniques. For example, one
could use PCA to reduce dimensionality and then apply the subsample
clustering approach on the reduced-dimensional data for more robust
initialization.
• Adaptive Initialization: Investigate methods for adaptively selecting the most
suitable initialization technique based on the characteristics of the dataset, such
as its dimensionality, the number of clusters, and the presence of outliers.

10.2 Advanced Distance Metrics

• Context-Aware Metrics: Explore the development of distance metrics that are


tailored to specific data types and application domains.
• Adaptive Distance Learning: Investigate techniques that learn optimal
distance metrics directly from the data during the clustering process. This could
lead to more accurate and meaningful clusters, especially for complex datasets.

10.3 Evaluation and Validation

• Robust Evaluation Metrics: Explore new evaluation metrics that better


capture the quality of clustering results, particularly in high-dimensional spaces
or when dealing with complex data distributions.
• Comparative Analysis: Conduct more extensive comparative analyses of
different initialization techniques and distance metrics across a wider range of
datasets and application scenarios to establish best practices and guidelines for
selecting the most appropriate methods.

38 | P a g e
10.4 Scalability and Efficiency

• Parallel and Distributed Clustering: Research and implement parallel or


distributed versions of the k-means algorithm to handle increasingly large
datasets that are common in real-world applications.
• Approximate K-Means: Explore approximate k-means algorithms that trade
off some accuracy for significant gains in computational efficiency, making it
possible to cluster massive datasets in a reasonable time frame.

10.5 Applications and Extensions

• Domain-Specific Adaptations: Apply the insights gained from this project to


specific problem domains, such as image segmentation, anomaly detection, or
recommender systems. Adapt the k-means algorithm and its initialization and
distance metric choices to the unique characteristics of each domain.
• Integration with Other Techniques: Explore integrating k-means clustering
with other data mining and machine learning techniques, such as classification,
regression, or association rule mining, to develop more powerful and
comprehensive data analysis solutions.

By pursuing these future directions, we can further enhance the capabilities of k-


means clustering, making it a more robust, versatile, and effective tool for
extracting valuable insights from complex and high-dimensional datasets.

39 | P a g e
11. REFERENCES

1. Aggarwal, C. C., & Reddy, C. K. (Eds.). (2013). Data Clustering: Algorithms


and Applications. CRC Press. [A comprehensive textbook covering various
aspects of data clustering, including algorithms, evaluation metrics, and
applications.]
2. Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful
seeding. Proceedings of the Eighteenth Annual ACM-SIAM Symposium on
Discrete Algorithms, 1027–1035. Society for Industrial and Applied
Mathematics. [Presents the k-means++ initialization algorithm, a popular and
effective method for selecting initial centroids.]
3. Bradley, P. S., & Fayyad, U. M. (1998). Refining Initial Points for K-Means
Clustering. Proceedings of the 15th International Conference on Machine
Learning (ICML 1998), 91–99. Morgan Kaufmann Publishers Inc. [Focuses on
the general problem of k-means initialization and introduces the technique of
clustering subsamples.]
4. Gupta, U., Bhattacherjee, V., & Bishnu, P. S. (2020). Clustering on Ranked
Data for Campaign Selection. IEEE Access, 8, 169857–169867.
https://doi.org/10.1109/ACCESS.2020.3019394 [Presents a method for
clustering ranked data using k-means and applies it to the task of campaign
selection. Introduces distance metrics for ranked data and evaluation parameters
like sumR and LAF.]
5. Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern
Recognition,Letters,31(8),651–666.
https://doi.org/10.1016/j.patrec.2009.09.011 [A comprehensive review of
clustering algorithms, including a discussion of the history and evolution of k-
means.]
6. Kamishima, T. (2003). Nantonac collaborative filtering: Recommendation
based on order responses. Proceedings of the 9th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), 583–588.
7. Tajunisha, N., & Saravanan, V. (2010). Performance analysis of k-means with
different initialization methods for high dimensional data. International Journal
of Artificial Intelligence & Applications (IJAIA), 1(4), 44–52. [Emphasizes the

40 | P a g e
challenges of clustering high-dimensional data and proposes using PCA to find
good initial centroids.]
8. Yedla, M., Pathakota, S. R., & Srinivasa, T. M. (2010). Enhancing K-means
Clustering Algorithm with Improved Initial Center. International Journal of
Computer Science and Information Technologies, 1(2), 121–125. [Introduces a
simple and efficient initialization technique based on transforming data to a
positive space and using distances from the origin.]

41 | P a g e
Appendix: -

# Import all required module

from matplotlib import pyplot as pt

import pandas as pd

import math

import numpy as np

import matplotlib.pyplot as plt

import random

import pandas as pd

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

# Global variable

centrd = []

grp = []

# taking input

def takeInput():

# df = pd.read_csv("Automobile_data.csv")

df = pd.read_csv("iris.csv")

42 | P a g e
# Display initial data into matplot

pt.scatter(df[df.columns[0]], df[df.columns[1]])

pt.title("Initial Data")

pt.show()

displayDF(df)

return df

# Implement PCA

def applyPCA(df):

# Load the Iris dataset

# df = pd.read_csv("iris.csv", header=None)

df.columns = ["SepalLengthCm", "SepalWidthCm", "PetalLengthCm",


"PetalWidthCm", "Species"]

# Separate features and target

X = df.drop("Species", axis=1)

y = df["Species"]

# Scale the features

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Apply PCA with 2 components

43 | P a g e
pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)

# Convert the PCA results into a DataFrame

df_pca = pd.DataFrame(data=X_pca, columns=["PC1", "PC2"])

df_pca["Species"] = y

print(df_pca)

return df_pca

# Find perfect value of K using Elbow method

def elbowMethod(autoData):

# Set seed

random.seed(987)

np.random.seed(987)

# Read the data

autoData = pd.read_csv('Automobile_data.csv')

# Select the desired columns

data1 = autoData[['length', 'width', 'height']]

data2 = data1.sample(n=150).reset_index(drop=True)

# Compute mean of the 'length' column

44 | P a g e
mean_length = data2['length'].mean()

print(f"Mean of length: {mean_length}")

# Plot width vs length

plt.scatter(data2['width'], data2['length'])

plt.xlabel('width')

plt.ylabel('length')

plt.axhline(y=172.24, color='r', linestyle='-')

plt.show()

# Calculate variance of columns

var_length = data2['length'].var()

var_width = data2['width'].var()

var_height = data2['height'].var()

total_variance = (var_length + var_width + var_height) * 19

print(f"Total variance: {total_variance}")

# K-means clustering implementation from scratch

def kmeans(data, k, max_iter=100):

n, m = data.shape

centroids = data.sample(n=k).to_numpy()

45 | P a g e
clusters = np.zeros(n)

for _ in range(max_iter):

for i in range(n):

distances = np.linalg.norm(data.to_numpy()[i] - centroids, axis=1)

clusters[i] = np.argmin(distances)

new_centroids = np.array([data[clusters == j].mean() for j in range(k)])

if np.all(centroids == new_centroids):

break

centroids = new_centroids

return clusters

# Clustering with 2 clusters

clusters_2 = kmeans(data2, 2)

data2['cluster'] = clusters_2

# Calculate within-cluster sum of squares for 2 clusters

def wcss(data, clusters):

wcss = 0

for cluster in np.unique(clusters):

cluster_data = data[clusters == cluster]

46 | P a g e
centroid = cluster_data.mean()

wcss += ((cluster_data - centroid) ** 2).sum().sum()

return wcss

wcss_2 = wcss(data2[['length', 'width', 'height']], clusters_2)

print(f"WCSS for 2 clusters: {wcss_2}")

# Clustering with 3 clusters

clusters_3 = kmeans(data2[['length', 'width', 'height']], 3)

data2['cluster1'] = clusters_3

# Calculate within-cluster sum of squares for 3 clusters

wcss_3 = wcss(data2[['length', 'width', 'height']], clusters_3)

print(f"WCSS for 3 clusters: {wcss_3}")

# Elbow method

wss = []

for i in range(1, 7):

clusters_i = kmeans(data2[['length', 'width', 'height']], i)

sce = wcss(data2[['length', 'width', 'height']], clusters_i)

print(f"WCSS for {i} clusters: {sce}")

wss.append(sce)

47 | P a g e
for i in range(1, 7):

if wss[i-1]//2 < wss[i]:

k=i

break

print("Perfect value of K", k)

plt.plot(range(1, 7), wss, 'bo-')

plt.xlabel('Number of clusters')

plt.ylabel('Total within cluster sum of squares')

plt.title("Elbow method")

plt.show()

return k

# taking Initial centeroid

def takeCenteroid(df, k):

i=0

while i < k:

print("Enter index of initial centeroid ", i+1 ," -> ", end = " ")

idx = int(input())

48 | P a g e
# if entered index is out range

if idx >= len(df[df.columns[0]]):

print("Invalid index, Please enter index in range ", 0 , " to ", len(df[df.columns[0]])-
1)

continue

else:

p = []

p.append(df[df.columns[0]][idx])

p.append(df[df.columns[1]][idx])

centrd.append(p)

i+= 1

return centrd

# calculating equlidence distance from centeroid to each data point

def calEquDst(df, x, y):

ls = []

for i in range(len(df[df.columns[0]])):

x1 = df[df.columns[0]][i]

y1 = df[df.columns[1]][i]

sqr =(x-x1)**2 + (y-y1)**2

49 | P a g e
dis = math.sqrt(sqr)

ls.append(dis)

return ls

# calculating manhaten distance from centeroid to each data point

def calManhatenDst(df, x2, y2):

ls = []

for i in range(len(df[df.columns[0]])):

x1 = df[df.columns[0]][i]

y1 = df[df.columns[1]][i]

dis = abs(x2-x1) + abs(y2-y1)

ls.append(dis)

return ls

# calculating suprmeDistance distance from centeroid to each data point

def calSuprmeDst(df, x2, y2):

ls = []

for i in range(len(df[df.columns[0]])):

x1 = df[df.columns[0]][i]

y1 = df[df.columns[1]][i]

50 | P a g e
sqr = max(abs(x2 - x1), abs(y2-y1))

dis = math.sqrt(sqr)

ls.append(dis)

return ls

# finding all equliden distance

def findAllEqulidenDistance(df, centrd, k):

EqlDis = []

for i in range(k):

p = centrd[i]

x = p[0]

y = p[1]

EqlDis.append(calEquDst(df, x, y))

return EqlDis

# finding all manhaten distance

def findAllManhatenDistance(df, centrd, k):

manDis = []

for i in range(k):

p = centrd[i]

51 | P a g e
x = p[0]

y = p[1]

manDis.append(calManhatenDst(df, x, y))

return manDis

# Display data

def displayDF(df):

# print(df.head())

print("\nX Y")

for i in range(len(df[df.columns[0]])):

print(df[df.columns[0]][i],"",df[df.columns[1]][i])

# Display centeroid data

def displayCD(c) :

i=1

for ls in c:

print("Centeroid ",i, " : ", ls[0],", ", ls[1])

i += 1

# Find new centeroid

def newCenteroid(df, newC, k) :

52 | P a g e
# taking k empty centeroid

lists = []

for i in range(k):

lists.append([])

grp.clear()

for i in range(len(newC[0])):

c=0

val = newC[0][i]

for j in range(1, k):

if newC[j][i] < val:

c=j

val = newC[j][i]

lists[c].append(i)

grp.append(c)

centrd = []

for ls in lists:

XSum = 0

YSum = 0

53 | P a g e
for indx in ls:

XSum += df[df.columns[0]][indx]

YSum += df[df.columns[1]][indx]

if len(ls) > 0:

x = XSum / len(ls)

y = YSum / len(ls)

else :

x = df[df.columns[0]][0]

y = df[df.columns[1]][0]

ls = []

ls.append(x)

ls.append(y)

centrd.append(ls)

print(grp)

return centrd

# Display calculated equlident distance data

def displayEqd(newC) :

54 | P a g e
i=1

for ls in newC:

print("\n\nEquliden distance of all data with Centeroid ", i ," :")

for value in ls:

print(value ," ", end = " ")

i += 1

def plotGraph(df, data):

colors = ["black", "red", "blue", "orange","yellow", "pink"]

df['cluster'] = grp

# df.drop('cluster', axis = 'columns', inplace=True)

print("Data after adding column \n", df)

df1 = df[df.cluster==0]

df2 = df[df.cluster==1]

df3 = df[df.cluster==2]

pt.scatter(df1[df1.columns[0]], df1[df1.columns[1]], color = colors[0], label = "Rate")

pt.scatter(df2[df2.columns[0]], df2[df2.columns[1]], color = colors[1], label =


"Weight")

pt.scatter(df3[df3.columns[0]], df3[df3.columns[1]], color = colors[2], label =


"Amount")

55 | P a g e
pt.title(data, fontsize = 16)

pt.legend(loc=4)

pt.show()

# main function

def main():

# taking input

df = takeInput()

df = applyPCA(df)

# find perfect value of k

k = elbowMethod(df)

print("Number of clustered for this data set is ", k)

# Taking initial centeroid

centrd = takeCenteroid(df, k)

flag = 1

while flag <= 5:

print("Iteration ", flag)

# display centeroid data

displayCD(centrd)

56 | P a g e
# calculate all equlidence distance

newC = findAllEqulidenDistance(df, centrd, k)

displayEqd(newC)

# finding new centeroid data

centrd = newCenteroid(df, newC, k)

# Display Grpah

plotGraph(df, "Clustered of Iteration : " + str(flag))

# //displayCD(centrd)

flag = flag + 1

df['cluster'] = grp

print("Final Centeroid Data ")

displayCD(centrd)

flag = 1

while flag <= 2:

print("Iteration ", flag)

# display centeroid data

displayCD(centrd)

# calculate all equlidence distance

57 | P a g e
newC = findAllManhatenDistance(df, centrd, k)

displayEqd(newC)

# finding new centeroid data

centrd = newCenteroid(df, newC, k)

# Display Grpah

plotGraph(df, "Clustered of Iteration : " + str(flag))

# //displayCD(centrd)

flag = flag + 1

# Display Grpah

plotGraph(df, "Final Clustered")

def calculate_sse(df, centroids, distance_metric='euclidean'):

sse = 0

for i, centroid in enumerate(centroids):

cluster_data = df[df['cluster'] == i] # Select data points belonging to the current


cluster

if distance_metric == 'euclidean':

distances = np.linalg.norm(cluster_data[['PC1', 'PC2']].values - centroid, axis=1)


# Euclidean distances

elif distance_metric == 'manhattan':

58 | P a g e
distances = np.sum(np.abs(cluster_data[['PC1', 'PC2']].values - centroid), axis=1)
# Manhattan distances

else:

raise ValueError("Invalid distance metric. Choose 'euclidean' or 'manhattan'.")

sse += np.sum(distances**2)

return sse

# Calculate and print SSE for Euclidean distance

sse_euclidean=calculate_sse(df, centrd, distance_metric='euclidean')

print(f"SSE (Euclidean Distance): {sse_euclidean}")

# Calculate and print SSE for Manhattan distance

sse_manhattan=calculate_sse(df, centrd, distance_metric='manhattan')

print(f"SSE (Manhattan Distance): {sse_manhattan}")

# Call main function

main()

59 | P a g e
Output:-

Figure 1 : Initial data

Figure 2 : Elbow Method

60 | P a g e
Figure 3 : Clustered data (Euclidean distance Vs Manhattan distance)

Figure 4 : Clustered data (Euclidean distance Vs Manhattan distance)

61 | P a g e
Figure 5 : Clustered data (Euclidean distance Vs Manhattan distance)

Figure 6 : Clustered data (Euclidean distance Vs Manhattan distance)

Figure 7 : Clustered data (Euclidean distance Vs Manhattan distance)

62 | P a g e
Figure 8 : Final Clustered Data

63 | P a g e

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy