A Spatial Data Mining Method by Delaunay Triangulation: In-So0 Kang, Tae-Wan Kim, and Ki-Joune Li
A Spatial Data Mining Method by Delaunay Triangulation: In-So0 Kang, Tae-Wan Kim, and Ki-Joune Li
--L---- - _ -- -_-
35
section, we introduce three best-known methods which are based in BIRCH, to other spatial objects, SMTIN clusters spatial objects
on cluster analysis. One common fact is that these methods use as it traverses from any object to qualified neighboring spatial
distance as a measure of clustering. objects. The traversal property of SMTIN mainly comes from
PAM[6] was developed to find k-medoids which represent k Delaunay Triangulation method. By the property of Delaunay
clusters. Medoid is a representative object that is the most Triangulation, SUTZN presents a cluster as an encompassing
centrally located in the cluster. PAM selects R objects arbitrarily polygon, which makes possible to discover the shape and the
as medoids and swaps repeatedly with other objects until all k hierarchical structure of clusters. We call an encompassing
objects qualify as medoids. The major disadvantage of PAM polygon a contour. The motivation of our research lies in the
comes from the fact that it compares an object with the entire data facts that previous distance-baaed approaches such as CLARANS
set, to find a medoid. This fact results in slow processing time, and BIRCH cannot cluster spatial objects as they are distributed
O(k(n-k)2). and cannot present conceptually reasonable clusters of
CLARANS[S] was developed to overcome disadvantages of sophisticated and hierarchical data set. SMTIN is also motivated
PAM. It uses sample data set to find medoids. Thus it needs less to generate clusters which are independent of the input data
processing time at each step when it clusters objects into k sequences and of seeds.
medoids. CLARANS clusters objects around k medoids based on
randomized search algorithm. It selects arbitrary k objects as 3. Clustering Spatial Objects by Delaunay Triangulation
current objects. And they are compared with sampled neighbor
objects and swaps each other when one of neighbor objects According to [5], clustering of a set is a partition of its elements
qualifies certain conditions. CALRANS swaps repeatedly until it that is chosen to minimize some measure of dissimilarity. And the
finds k medoids. CLARANS also exhibits several disadvantages. dissimilarity has been defined as diameter of cluster.
First, it is still slow since it uses randomized search algorithm, Conventional clustering methods such as PAM, CLARANS and
although it is faster than PAM. Second, it could not guarantee BIRCH therefore partition N spatial objects into k clusters with k
optimal clustering, due to its randomized approach. Third, k- mediods, so that the diameter should be minimized. Since they do
medoids approach does not present enough spatial information not fully consider geometric properties of spatial objects and they
when the patterns and the distribution of a data set are complex. rather rely on geo-statistical methods, they fail to discover
Especially, it does not present hierarchical structure of a data set. geometric information like shape of clusters. But the information
BZRCH[l 1, 121 improves the lacks of CLARANS in that firstly that spatial data mining discovers should include not only
it supports localized clustering. At each clustering, it does not statistical regroupement but also geometric characteristics. This is
scan ah spatial objects and refer all currently existing clusters. a basic motivation of our approach.
Second, it treats outliers in an efficient way. Third, it minimizes
both CPU processing time and disk I/O time. Fourth, it inserts
dynamically new spatial objects into existing clusters without
modifying the all clusters. BIRCH clusters data by using CF-
Vector (Clustering Feature) and CF-Tree. CF-Vector is a triple <N.
LS, Ss>, where N means the number of objects in a cluster, LS is L I
N over a linear sum of each data object in a cluster, and SS is the Figure 1. A Sample TZN
square sum of each data object in a cluster. The elements of CF-
Tree are CF-Vector. While CF-Vector contains distance One of the efficient way to investigate geometric properties of
relationships among spatial objects, CF-Tree contains information spatial objects is Delaunay Triangulation [7], which is the dual
about clusters. Since the size of each node is small, CF-Tree can graph of Voronoi Diagram [7]. And Delaunay triangles or the
be loaded in main memory. It outperforms CLARANS in that it dual graph of Voronoi diagram are represented by TIN
constructs CF-Tree in a single scan and accesses disk less than (Triangulated Irregular Network), where nodes represent spatial
CL.ARANS.But the deficiencies of BIRCH are as follows. First, objects and edges means nearest couples among spatial objects as
BIRCH concentrates on clustering spatial objects instead of shown by Figure 1. It has an important property that the nearest
finding patterns of distribution. Thus, it cannot provide enough objects to a given spatial object are always linked by edges, which
information about complex and hierarchical patterns. Second, allows us to analyze proximity relationship between spatial
BIRCH generates different clusters for the same input data set objects. By Euler’s formula [7], TIN has at most 3n-6 edges and
according to the input order and selection of seed points. In other 2n-4 triangles for n input data points. We can see that it takes
words, this method is order-sensitive and also seed-sensitive. O(n) time to find the contour of a TIN with n data points.
Finally, we must have a priori knowledge about the nature of
distribution, for example, the number of clusters and other input
parameters, which is often unrealistic.
36
. ---.-- ~__. -_~_
-.-__-. ~- ---_.- ~-- 2 __-
given threshold, T. Figure 2 shows the procedure of SMTYN. Figure 3 shows an example of sophisticated distribution with
‘* four clusters and it also implies a hierarchical structure. We can
Algorithm SMTIN obtain clustering as Figure 4, with different thresholds. It clearly
1. Input spatial objects and construct their TM separates four clusters and discovers the original shape of clusters
2. Remove edges whose length is greater than threshold T, as Figure 4(a). And if we need more fine clustering, we can get a
and find connected components. Each connected result as Figure 4(b) with smaller threshold.
component becomes a cluster.
3. Remove clusters whose number of objects is less than a
given number no.
4. Find the contour lines of remaining clusters.
37
.- ---
--x- ..^-zA---. -c
I t
Figure 7. Initial DS3 dataset : random distribution
Table 1. Execution time of CLARANS, BIRCH, and SMT.IN We find a great difference between Sh4lYN and the other two
methods in Figure 9. (1) While BIRCH and CL&ANS discover
In this table, we excluded disk writing time required by only a set of local clusters as shown by Figure 9a and 9b, Sh4TIN
BIRCH to fairly compare the CPU processing time. By comparing discover a shape of clusters given by sine curve in addition to set
them, we observe that: (1) SM7’IN is faster than the other of local clusters by applying different thresholds. For two
methods, CLARANS and BIRCH. (2) The execution time of thresholds T, and i?, the clusters discovered by SMTlN applying
SMTIN is independent of the distribution of spatial objects and Tr and T2 respectively are hierarchical as explained in the
consequently a stable clustering method. The reason is that previous section. It means that we can find the shapes of
Delaunay triangulation process mainly determines its execution distribution for several scales by SMTIN in a hierarchical way. (2)
time, which is almost independent from the nature of distribution. We observe that the local clusters with T2 exactly correspond to
the actual clusters. Obviously, it is important to find a proper
In Figure 8, we find that: (1) The presentation method of the threshold value. In order to rind a proper threshold, we commence
clustering results by SMTIN differs from those of CLARANS and with relatively large threshold and decrease it grndually until we
BIRCH. The contour line of cluster in Figure 8c is actual shape of get a good clustering.
each cluster, whereas the ellipses in Figure 8a and 8b do not
exactly correspond the contour lines of cluster. The numbers
respectively represent the number of points in each cluster. (2)
BIRCH and SATIN cluster the points nearly same as the actual
clusters, while CLARANS is far from them. (3) We have a priori
defined the number of clusters for CLARANS and BIRCH as 100,
which is actual number of clusters. It means that we must know
the number of clusters a priori for CLARANS and BIRCH, which
is often unrealistic. If we give an incorrect number of cluster, the
result may be totally different from the actual. But we do not need
I I
to give the number of clusters for running SM7” and in any way, Figure 9a. CLARANSClusters of DS2
it finds a correct clusters.
38
_ ~. _-.-__ -_-~ ..-_-__ -
Acknowledgements
This work was partially supported by the ministry of science and
technology of South Korea under the contract of National GIS
Development Project.
References
[l] Beckmann, N., Kriegel, H-P., Schneider, R., and Seeger, B.,
“The R*-tree: An Efficient and Robust Access Method for
Points and Rectangles,” Proceedings of SIGMOD, 1990, pp.
322-331
[2] Brinkoff, T. and Kriegel, H-P., “The Impact of Global
Clustering on Spatial Database Systems,” Proceedings of the
2Uth VLDB Conference, Santiago, Chile, 1994, pp. 168 - 179
Figure 9c. S&IN Clusters of DS2 [3] Faloutsos, C. and Kamel, I. “Beyond Uniformity and
Independence : Analysis of R-trees Using the Concept of
we finally compare BIRCH and SMTlN with dataset DS4 Fractal Dimension,” Proceedings of ACM Conference
given by Figure 3. The distribution of DS4 is complicated and its PrincipIes on Database Systems(pODs/, 1995, pp.4-13
shape is like nested doughnuts. Figure IO(a). and 10(b). [4] Guenther, 0. and A. Buchmarm, “Research Issues in Spatial
respectively show the clusters discovered by BIRCH and SM77N. Database,” ACM SIGMOD Record, vol. 19, no.4, 1990, pp.
In Figure 10(b), we see that SWihr correctly carries out 61-68
clustering with DS4, and each cluster is clearly separated from [5] Hartigan, JA, Clustering Algorithms, John Wiley & Sons, 1975
others, BIRCH, however fails to discover the actual chrsters as [6] Kaufman, L. and Rousseeuw, P.J., Finding Groups in Data:
shown by Figure 10(a). an Introduction to Cluster Analysis, John Wiley & Sons, 1990
[7] Preparata, F.P. and Shamos, M.L, Computational Geometry:
An Introduction, Springer-Verlag, 1985
[S] Raymond, T. Ng and Han, J., “Efficient and Effective
Clustering Methods for Spatial Data Mining,” Proceedings of
the 20th VLDB Conference, Santiago, Chile, 1994, pp. 144 -
155
[9] Schewchuk, J.R, “Triangle: Engineering a 2D Quality Mesh
Generator and Delaunay Triangulator,” Proceedings First
(a) BIRCH Cluster Oer Workshop on Applied Computational Geometry, ACM, 1996.
* Figure 10. Clusters bfDS4 [IO] Seber, A.J., Multivariate observations, John Wiley & Sons,
1984
6. Conclusion [I l] Zhang, T., Ramakrishnan, R, and Livny, M., “BIRCH: An
Efficient Data Clustering Methods for Very Large Databases,”
In this paper, we presented a spatial data mining method, SMTYhr Proceedings of ACM SIGMOD International Conference of
which is based on Delaunay Triangulation. By comparison with Management ofData, 1996, pp. 103 -114
other spatial data mining methods, such as CLARANS or BIRCH, [12] Edwin M. Knorr and Raymond T. Ng, “Finding Aggregate
it has the following advantages; Proximity Relationships and Commonalities in Spatial Data
Mining,” IEEE Transaction on Knowledge and Data
l It discovers rich information about the distribution of Engineering, Vol. 8, NO. 6, 1996, pp. 884-897
39