0% found this document useful (0 votes)
4 views4 pages

Clustering

The document provides an overview of machine learning techniques, categorizing them into supervised, unsupervised, and semi-supervised methods. It focuses on clustering techniques, detailing various models such as connectivity, centroid, distribution, and density models, along with performance criteria and algorithms like K-means. Additionally, it discusses distance measurement methods and metrics for choosing the number of clusters, including Davis Bouldin's Index, Dunn's Index, and Pseudo F-statistics.

Uploaded by

Saikat Dey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views4 pages

Clustering

The document provides an overview of machine learning techniques, categorizing them into supervised, unsupervised, and semi-supervised methods. It focuses on clustering techniques, detailing various models such as connectivity, centroid, distribution, and density models, along with performance criteria and algorithms like K-means. Additionally, it discusses distance measurement methods and metrics for choosing the number of clusters, including Davis Bouldin's Index, Dunn's Index, and Pseudo F-statistics.

Uploaded by

Saikat Dey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Cluster Analysis

Subhajit Chattopadhyay

Types of Machine Learning techniques

• Supervised => methods having a dependent variable or labelled data

– regression
– time series
– multiple discriminant analysis
– support vector machine
– decision tree

Figure 1: ML Methods

• Unsupervised - such methods does not have any dependent variable or labelled data

– clustering

• Semi-supervised - a section of methods uses labelled data and another section of method uses unlabeled
data

1
Figure 2: Cluster Analysis

2
Types of Clustering Techniques

• Connectivity models: distance connectivity between observations is the measure e.g. hierarchical clus-
tering
• Centroid models: distance from mean value of each observations / cluster is the measure e.g. k-means
clustering
• Distribution models: signficance of statistical distribution of variables in the dataset is the measure
e.g. expectation - maximization algorithms
• Density models: density in data space is the measure e.g. DBSCAN
• Another way to categorize clustering models -

– Hard clustering - each object belong to only one cluster


– Soft clustering - each object has some likelihood of belonging to some other cluster

Performance Criteria of clustering algorithms:

• High intra-class similarity

• Low inter-class similarity

K-means clustering : This can be done using 3 algorithms

• Llyod Forgy algorithm =>

– select the value of number of clusters (k)


– randomly initiate the K center in the feature space
– assign each case / data to the cluster whose center is nearest to the data
– place the center at the mean of the center of each cluster
– for each data point -
∗ after assignment of a data , calculate the distance between the data and the center of the
cluster
∗ assign the case to nearest centroid
∗ with every new addition or deletion of cases calculate the value of the centroids
– place each center at the mean of the cluster

• McQueen algorithm => McQueen and Lloyd algorithms are very similar. The only difference is -

– Lloyd algorithm updates the centroids after each iteration , so it is called batch or offline algorithm.
– McQueen algorithm updates the centroids when a case changes a cluster and algorithm passes
through the entire data set. McQueen algorith converges quicker than Llyod algorithm.

• Hartigan Wong
P algorithm => this algorithm, for each case of the dataset, calculates the sum of squared
error (ss = i=k (xi − ck )2 , where ck is the centroid of the other cluster) of that case’s current cluster
excluding the case and also sum of squared errors of other clusters where it may be assigned.

3
Measuring the distance between two data points

Distance between two cases underlines the similarity or dissimilarity. Higher the distance between two case,
lower is the similarity and vice-versa.

Pn 1
• Eucledian distance => d(x, y) = [ i=1 (xi − yi )2 ] 2
Pn
• Manhattan distance => d(x, y) = i=1 |xi − yi |
P
(xi −x̄)(yi −ȳ)
• Pearson Correlation Index =>d(x, y) = 1 − P P 1
[ (xi −x̄)2 ∗ (yi −ȳ)2 )] 2
P
xi yi
• Eisen Cosen correlation distance => d(x, y) = 1 − P x2 ∗P y2
i i

Pp 1
• Minkowski distance => d(x, y) = [ k=1 |xik − xjk |r ] r

Choosing the number of clusters:

To decide the number of clusters , internal cluster metrics are looked into -

• Davis Bouldin’s Index => It looks at the ratio of within cluster variance (scatter) to the distance
between the centroids of the clusters. This ratio is computer for all possible pair of clusters and for
each cluster.

1 X 1
Scatterk = [ (xi − ck )2 ] 2
nk
N
1
X
separationj,k = [ (cj − ck )2 ] 2
i<j<k

scatterj + scatterk
ratioj,k =
separationk
PN
The largest ratio is termed as Rk and DBIndex = N1 k=1 Rk

• Dunn’s Index => It aims to identify dense and well separated clusters. It is defined as the ratio of
minimal
• Pseudo F-statistics => It is the ratio of the between cluster sum of squares to the between clusters to
sum of square within clusters.

SSbetween /(k − 1)
P seudo − F =
SSwithin /(n − k)
n
X
SSbetween = (ck − cg )2 ∗ nk
k
nk
N X
X
SSwithin = (xi − ck )2
k i=k

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy