0% found this document useful (0 votes)
62 views27 pages

DS143 Group 13 Presentation-1

The document discusses several density-based clustering methods: 1. DBSCAN grows clusters based on density connectivity and discovers clusters of arbitrary shapes with noise. 2. OPTICS extends DBSCAN to produce cluster orderings across different parameter settings. 3. DENCLUE clusters objects based on density distribution functions. It then provides details on the DBSCAN, OPTICS, and grid-based clustering algorithms STING and WaveCluster.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views27 pages

DS143 Group 13 Presentation-1

The document discusses several density-based clustering methods: 1. DBSCAN grows clusters based on density connectivity and discovers clusters of arbitrary shapes with noise. 2. OPTICS extends DBSCAN to produce cluster orderings across different parameter settings. 3. DENCLUE clusters objects based on density distribution functions. It then provides details on the DBSCAN, OPTICS, and grid-based clustering algorithms STING and WaveCluster.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Density Clustering

Methods
Group 13 Members :

Kasumba Munashe J R2113946F


Moyo Takudzwa N R219451X
Mtiti Tendai R2112045C
Moyo Locious R2110357W
Kaseke Kudakwashe R218270V
Raphiel Shirichena R2116960A
Stellah Mhlanga R2115150M
Divine Sveto R2110454C
Arther Nyamayaro R214145N
Ngwena Mpho Jerone Takudzwa R212143J
Density-based clustering methods

To discover clusters with arbitrary shape,


density-based clustering methods have
been developed. – These typically regard
clusters as dense regions of objects in the
data space that are separated by regions
of low density (representing noise).

Density-based clustering
algorithms:
DBSCAN- grows clusters according to a
density-based connectivity analysis.

OPTICS - extends DBSCAN to produce a


cluster ordering obtained from a wide
range of parameter settings.

DENCLUE clusters objects based on a


set of density Density-Based Methods
distribution functions
DBSCAN Algorithm

-stands for Density-Based Spatial Clustering of Applications with Noise – It is a density-based clustering algorithm. – The algorithm grows regions with sufficiently high density into clusters and discovers clusters of arbitrary shape in Density-Based Methods
spatial databases with noise. – It defines a cluster as a maximal set of density-connected points.
Definitions:
ε-Neighborhood of an object – The neighborhood
within a radius e of a given object is called the ε-neighborhood
of the object.

Core object – If the ε-neighborhood of an object contains


at least a minimum number, MinPts, of objects, then the object
is Density-Based Methods called a core object.

Directly density-reachable objects – Given a set of objects,


D, we say that an object p is directly density-reachable from
object q if p is within the ε- neighborhood of q, and q is a core
object

Indirectly Density-reachable objects – An object p is


indirectly density-reachable from object q, – if there is a chain
of objects p1, . . ., pn, where p1 = q and pn = p such that pi+1
is directly density-reachable from pi, for 1 ≤ i ≤ n.

Indirectly Density- connected objects Density-Based


Methods – An object p is indirectly density-connected to object
q, if there is an object o such that both p and q are density-
reachable from o. Example: Density-reachability and density
connectivity A given ε represented by the radius of the circles,
and, say, let MinPts = 3
Example: Density-reachability and density
connectivity:
Core objects – m, p, o, and r are core objects
because each is in an ε - neighborhood containing at
least three points.
Directly density-reachable objects – q is
directly density-reachable from m.
-m is directly density-reachable from p and vice
versa
Example: Density-reachability and density
connectivity.
Indirectly density-reachable objects – q is indirectly
density-reachable from p because q is directly density-
reachable from m and m is directly density reachable from p.
- However, p is not indirectly density-reachable from q because
q is not a core object.
- Similarly, r and s are indirectly density-reachable from o, and
o is indirectly density-reachable from r.

Indirectly Density-connected objects – o, r, and s are


all indirectly density-connected .

A density-based cluster –A
density-based cluster is a set of density-connected
objects that is maximal with respect to density-
reachability. – Every object not contained in any cluster
is considered to be noise.
DBSCAN
DBSCAN searches for clusters by checking the ε-
neighborhood of each point in the database.
- If the ε-neighborhood of a point p contains at least
MinPts, a new cluster with p as a core object is
created.
- DBSCAN then iteratively collects directly
densityreachable objects from these core objects,
which may involve the merge of a few density-
reachable clusters.
- The process terminates when no new point can be
added to any cluster.

DBSCAN Algorithm: The computational complexity of


DBSCAN is O(n 2), where n is the number of database objects.
With appropriate settings of the user-defined parameters ε and
MinPts, the algorithm is effective at finding arbitrary-shaped
clusters
OPTICS Algorithm
Stands for Ordering Points to Identify the Clustering
Structure . Core-distance of an object
-OPTICS produces a set or ordering of density-based – The core-distance of an object p is the smallest ε΄
clusters value that makes p a core object. If p is not a core
-It constructs the different clusterings simultaneously object, the coredistance of p is undefined.
-The objects should be processed in a specific order.
-This order selects an object that is density-reachable
with respect to the lowest ε value so that clusters with Reachability-distance of an
higher density (lower ε) will be finished first.
- Based on this idea, two values need to be stored for object
each object—core-distance and reachability-distance - The reachability-distance of an object q with
respect to another object p is the greater value of
the core-distance of p and the Euclidean distance
between p and q. – If p is not a core object, the
reachability-distance between p and q is undefined
OPTICS Algorithm
The OPTICS algorithm creates an ordering
of the objects in a database.

-OPTICS additionally storing the core-


distance and a suitable reachability-
distance for each object.

-An algorithm was proposed to extract


clusters based on the ordering information
produced by OPTICS.

- Such information is sufficient for the


extraction of all density-based clusterings
with respect to any distance ε΄ that is
smaller than the distance ε used in
generating the order.
DENCLUE Algorithm
The grid based clustering

•The grid-based clustering methods use a multi-resolution grid data structure. It quantizes the object
areas into a finite number of cells that form a grid structure on which all of the operations for
clustering are implemented. The benefit of the method is its quick processing time, which is generally
independent of the number of data objects, still dependent on only the multiple cells in each
dimension in the quantized space.
•An instance of the grid-based approach involves STING, which explores statistical data stored in the
grid cells, WaveCluster, which clusters objects using a wavelet transform approach, and CLIQUE,
which defines a grid-and density-based approach for clustering in high-dimensional data space.
Grid-Based Clustering
Grid-Based Clustering method uses a multi-resolution grid data
structure.
 
(Partitional Clustering Methods
 
(Hierarchical Clustering Methods
 
(Density-Based Clustering Methods
a.) STING - A Statistical Information Grid Approach

•STING was proposed by Wang, Yang, and Muntz (VLDB’97).



In this method, the spatial area is divided into rectangular cells.

There are several levels of cells corresponding to different levels of resolution
For each cell, the high level is partitioned into several smaller cells in the
next lower level.

The statistical info of each cell is calculated and stored beforehand and is
used to answer queries.

The parameters of higher-level cells can be easily calculated from


parameters of lower-level cell
 Count, mean, s, min, max
 Type of distribution—normal, uniform, etc.
Then using a top-down approach we need to answer spatial data queries.

Then start from a pre-selected layer—typically with a small number of


cells.

For each cell in the current level compute the confidence interval.

Now remove the irrelevant cells from further consideration.

When finishing examining the current layer, proceed to the next lower
level.

Repeat this process until the bottom layer is reached.


Advantages:
It is Query-independent, easy to parallelize, incremental update.

O(K), where K is the number of grid cells at the lowest level.


Disadvantages:
All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected.

b.) WaveCluster
It was proposed by Sheikholeslami, Chatterjee, and Zhang
(VLDB’98).

It is a multi-resolution clustering approach which applies wavelet


transform to the feature space
 A wavelet transform is a signal processing technique that
decomposes a signal into different frequency sub-band.
It can be both grid-based and density-based method.

Input parameters:
 No of grid cells for each dimension
 The wavelet, and the no of applications of wavelet transform.
How to apply the wavelet transform to find
clusters
 It summaries the data by imposing a
multidimensional grid structure onto data
space.
 These multidimensional spatial data objects are
represented in an n-dimensional feature space.
 Now apply wavelet transform on feature space
to find the dense regions in the feature space.
 Then apply wavelet transform multiple times
which results in clusters at different scales from
fine to coarse.
Why is wavelet transformation useful for clustering
 It uses hat-shape filters to emphasize region where points cluster, but
simultaneously to suppress weaker information in their boundary. 
 It is an effective removal method for outliers.
 It is of Multi-resolution method.
 It is cost-efficiency.
 
Major features:
 The time complexity of this method is O(N).
 It detects arbitrary shaped clusters at different scales.
 It is not sensitive to noise, not sensitive to input order.
 It only applicable to low dimensional data.
c.) CLIQUE - Clustering In QUEst 

•It was proposed by Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).



It is based on automatically identifying the subspaces of high dimensional data space
that allow better clustering than original space.

CLIQUE can be considered as both density-based and grid-based:
 It partitions each dimension into the same number of equal-length intervals.
 It partitions an m-dimensional data space into non-overlapping rectangular units.
 A unit is dense if the fraction of the total data points contained in the unit exceeds
the input model parameter.
 A cluster is a maximal set of connected dense units within a subspace.
Partition the data space and find the
number of points that lie inside each cell
of the partition.
 
Identify the subspaces that contain
clusters using the Apriori principle.
 
Identify clusters:
 Determine dense units in all subspaces of
interests.
 Determine connected dense units in all
subspaces of interests.
 
Generate minimal description for the
clusters:
 Determine maximal regions that cover a
cluster of connected dense units for each
cluster.
 Determination of minimal cover for each
cluster.
Advantages
It automatically finds subspaces of the highest dimensionality such that high-density
clusters exist in those subspaces.

It is insensitive to the order of records in input and does not presume some canonical
data distribution.

It scales linearly with the size of input and has good scalability as the number of
dimensions in the data increases.

Disadvantages
The accuracy of the clustering result may be degraded at the expense of the simplicity
of the method.

Summary
Grid-Based Clustering -> It is one of the methods of cluster analysis which uses a
multi-resolution grid data structure.


The End !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy