0% found this document useful (0 votes)
48 views12 pages

CSE4014 - High Performance Computing (EPJ) : Submitted by Project Guide

This document describes a project to accelerate the Density Peak Clustering (DPC) algorithm for large datasets. The DPC algorithm spends most time calculating local density and separation distance for each point. The project aims to speed this up by scanning only a point's neighbors to calculate separation distance, and identifying non-peak points early. The objectives are to accelerate DPC calculation and yield the same clusters as the original algorithm. Preliminary details on DPC clustering and limitations like dimensionality are also provided.

Uploaded by

Ashish Paudel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views12 pages

CSE4014 - High Performance Computing (EPJ) : Submitted by Project Guide

This document describes a project to accelerate the Density Peak Clustering (DPC) algorithm for large datasets. The DPC algorithm spends most time calculating local density and separation distance for each point. The project aims to speed this up by scanning only a point's neighbors to calculate separation distance, and identifying non-peak points early. The objectives are to accelerate DPC calculation and yield the same clusters as the original algorithm. Preliminary details on DPC clustering and limitations like dimensionality are also provided.

Uploaded by

Ashish Paudel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

CSE4014 - High Performance Computing

(EPJ)

Submitted by Project Guide


Raj Prakash Shrivastav- 18BCE2463 MANJULA V
Ashutosh Devkota- 18BCE2465 Associate Professor
Ashish Paudel- 18BCE2494 School of Information Technology and
Engineering
ABSTRACT

• The Density Peak Clustering (DPC) algorithm is a new density-based clustering method.
• It spends most of its execution time on calculating the local density and the separation
distance for each data point in a dataset.
• The purpose of this study is to accelerate its computation.
• On average, the DPC algorithm scans half of the dataset to calculate the separation distance of
each data point.
MODIFICATIONS…!!

• We propose an approach to calculate the separation distance of a data point


by scanning only the neighbors of the data point.
• Additionally, the purpose of the separation distance is to assist in choosing the
density peaks, which are the data points with both high local density and high
separation distance.
• We propose an approach to identify non-peak data points at an early stage to
avoid calculating their separation distances.
• Our experimental results show that most of the data points in a dataset can
benefit from the proposed approaches to accelerate the DPC algorithm.
PROBLEM STATEMENT AND OBJECTIVES

Accelerating DPC by Scanning Neighbors Only

The objectives of the project is:


 To speed up the density clustering algorithm for large data sets.
 To accelerate the calculation of separation distances and yield the same clustering
results as that of the DPC algorithm.
 To accelerate the DPC algorithm by identifying a significant portion of the non-
peak data points and avoiding calculating their separation distances.
 Input: the set of data points X∈ℝNXM and the parameters 𝑑C for defining the
neighborhood, and 𝑑r for selecting density peaks
 Output: the label vector of cluster index y∈ℝNx1
 Algorithm:
 1. Calculate ρ(𝑥i) for each 𝑥i ∈ X using either (1) or (3).
 2. Sort all data points in X by their local densities descendingly.
 3. Calculate δ(𝒙i) and σ(𝒙i) for each 𝒙i ∈ X using (4) and (5), respectively.
 4. Select data points with ρ(𝒙i)δ(𝒙i) > 𝑑r as density peaks.
 5. For each density peak 𝒙i, set 𝑦i = 𝑖. // starting point of each cluster
 6. For each non-peak data point 𝒙i, set 𝑦i = 𝑦δ(𝒙i). // cluster assignment
 7. Return y.
Preliminary

– Consider a set of points in some space to be clustered. Let ε be a parameter specifying the radius of a neighborhood
with respect to some point.
– For the purpose of DPC clustering, the points are classified as core points, (density-)reachable points and outliers, as
follows:
– A point p is a core point if at least minPts points are within distance ε of it (including p).
– A point q is directly reachable from p if point q is within distance ε from core point p.
– Points are only said to be directly reachable from core point.
– A point q is reachable from p if there is a path p1, ..., pn with p1 = p and pn = q, where each pi+1 is directly reachable
from pi.
– All points not reachable from any other point are outliers or noise points.

– Now if p is a core point, then it forms a cluster together with all points (core or non-core) that are reachable from it.
Each cluster contains at least one core point; non-core points can be part of a cluster, but they form its "edge", since
they cannot be used to reach more points.
Contd..

■ Reachability is not a symmetric relation since, by definition, no point may be reachable


from a non-core point, regardless of distance (so a non-core point may be reachable, but
nothing can be reached from it).
■ Therefore, a further notion of connectedness is needed to formally define the extent of
the clusters found by DBSCAN. Two points p and q are density-connected if there is a
point o such that both p and q are reachable from o. Density-
connectedness is symmetric.
■ A cluster then satisfies two properties:
■ All points within the cluster are mutually density-connected.
■ If a point is density-reachable from any point of the cluster, it is part of the cluster as
well.
 The quality of DPC depends on the distance measure used in the function region
Query.
 The most common distance metric used is Euclidean distance. Especially for high-
dimensional data, this metric can be rendered almost useless due to the so-called
"Curse of dimensionality", making it difficult to find an appropriate value.
 This effect, however, is also present in any other algorithm based on Euclidean
distance.
 DPC cannot cluster data sets well with large differences in densities.
Conclusion

• The proposed methods focus on accelerating the calculation of the


separation distance.
• However, it is also possible to improve the DPC algorithm by
accelerating the calculation of the local density .
• Conceptually, the DPC algorithm builds a directed acyclic graph of
all data points with an out-degree ≤ 1. Then, it selects several
data points from the graph as the density peaks.
• Finally, it removes the outgoing links of the density peaks and
breaks the graph into several subgraphs, each of which represents
a cluster.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy