0% found this document useful (0 votes)
45 views5 pages

A Spatial Data Mining Method by Delaunay Triangulation: In-So0 Kang, Tae-Wan Kim, and Ki-Joune Li

1) The document proposes a new spatial data mining method called SMiYN (Spatial data Mining by Triangulated Irregular Network) based on Delaunay triangulation. 2) SMiYN aims to discover patterns in spatial data, including nested or hierarchical clusters, without needing prior knowledge of the number or nature of clusters. 3) The document compares SMiYN to previous methods like CLARANS and BIRCH, arguing SMiYN provides richer cluster information, is independent of input order, and requires less processing time.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views5 pages

A Spatial Data Mining Method by Delaunay Triangulation: In-So0 Kang, Tae-Wan Kim, and Ki-Joune Li

1) The document proposes a new spatial data mining method called SMiYN (Spatial data Mining by Triangulated Irregular Network) based on Delaunay triangulation. 2) SMiYN aims to discover patterns in spatial data, including nested or hierarchical clusters, without needing prior knowledge of the number or nature of clusters. 3) The document compares SMiYN to previous methods like CLARANS and BIRCH, arguing SMiYN provides richer cluster information, is independent of input order, and requires less processing time.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

_ .- -_-..-.-I_.^ - .

--L---- - _ -- -_-

A Spatial Data Mining Method by Delaunay Triangulation

In-So0 Kang, Tae-wan Kim, and Ki-Joune Li

Department of Computer Science, Pusan National University


Kumjeong-Gu, Jangjeon-Dong, Pusan, Korea 609-735
{iskang,twkim,lilc)@chronos.cs.pusan.ac.kr
Tel:+8251 582 1182
Fax : +82 51515 2208

Abstract information. Such information includes sets of spatial objects in a


cluster, their shape and distribution, and statistical results like
It becomes an important task to discover significant pattern or density, diameter, etc. Third, outliers should be treated properly.
characteristics which may implicitly exist in huge spatial Outliers refer to spatial objects which are not contained in any
dntabases, such as geographical or medical databases. In this cluster and should be discarded during mining process, But, when
paper, we present a spatial data mining method named SMiYN new spatial objects are inserted, these outliers must be considered,
(Spatial data Mining by Triangulated Irregular Network), which since outliers may form a cluster with the newly inserted objects.
is based on Delaunay Triangulation. Sh47ZN demonstrates Previous researches have been studied in distance-based or
important advantages over the previous works. First, it discovers probability-based ways. Both approaches satis@ first and third
even sophisticated pattern like nested doughnuts, and hierarchical requirements to some degree. But, the second requirement is not
structure of cluster distribution. Second, in order to execute quite sufficiently satisfied so far. In this paper, we propose a
SMTIN, we do not need to know a priori the nature of distribution, spatial data mining method, SA4TYNwhich has the following
for example the number of clusters, which is indispensable to objectives.
other methods. Third, experiments show that SMTIN requires less
CPU processing time than other methods such as BIRCH and It concentrates on discovering the pattern of distribution of
CLARANS. Finally it is not ordering sensitive and handles spatial objects and provides information of rich contents.
efticiently outliers. For example, it shows whether the pattern contains holes
and nested clusters.
It should be stable. In other words, the insertion order of
Lhtroduction spatial objects does not influence on the results and it does
not require any a priori knowledge.
The database research community has been considerably attended It should be fast
to GIS[Geographic Information Systems) due to huge amount of
spntinl dnta[4]. Research focus of spatial databases has been This paper is organized as follows. Section 1 introduces this
concentrated on storing and retrieving spatial objects efficiently paper. We briefly investigate previous chrstering methods in
rather than analyzing pattern and distribution of spatial data. section 2. In section 3, Sh4ZXVclustering algorithm is introduced.
Recently, spatinl database researchers have turned to their We show the characteristics of SM7EV method in section 4. And
concerns on mining spatial objects’. Spatial data mining is the in section 5, we compare SMTIN with two important previous
analysis of geometric or statistical characteristics and methods by experiments; CLAR4ZVS and BIRCH. Finally we
relntionships of spatial data which may exist implicitly. The major conclude the paper and propose our future researches.
approach of spatial data mining is how to cluster spatial data to
discover implicit information. In terms of spatial data mining, 2. Related Works
cluster means grouping of relevant spatial objects,
Seveml requirements were proposed in spatial data mining Partitioning N objects into k clusters is one of major issues in
techniques. First, they should be fast, since the amount of data statistics, and is called cluster analysis. It has been applied to
they process is very huge. Second, they should provide rich many areas, such as medicine, psychology, archeology, etc.
Clustering is defined as partitioning or grouping of relevant
Penlli&ll (0 111&cdigitnlhrd topics ol‘nll or pall ol‘thismnlrrinl Ibr
objects based on their attributes or geometric properties. Recently,
penollnl or clwsroomuse is grantedwi\lwt I& providedIht 111~ copis
,lrc:llol mndr or dislrilmkd Ibr pmlil or commtrcinladwlhge. llie copy- this technique is adopted in spatial data mining fields. In this
r$l( I~OI~CC’llta. lillc ol’thc publhio~i rmdits drill:nppcnr.and nolice is
giventlinl sopyriglilis by pwmissiolioflli~ ACM. Inc. ‘I’0 copy ollwwisa
10rcpul,lisll,10post011s~rws or IO rcdislrilwk to lists.requiresspecific
prniii~ioli rmd~orkc. *In this paper, we assume that the shape of spatial object is point.
as 97 /.,nsl~‘e~m Nc~wh l&i
Copyrigll[ 1997 AC’hi I-MI IS-017-1197/l l..KLSfl

35
section, we introduce three best-known methods which are based in BIRCH, to other spatial objects, SMTIN clusters spatial objects
on cluster analysis. One common fact is that these methods use as it traverses from any object to qualified neighboring spatial
distance as a measure of clustering. objects. The traversal property of SMTIN mainly comes from
PAM[6] was developed to find k-medoids which represent k Delaunay Triangulation method. By the property of Delaunay
clusters. Medoid is a representative object that is the most Triangulation, SUTZN presents a cluster as an encompassing
centrally located in the cluster. PAM selects R objects arbitrarily polygon, which makes possible to discover the shape and the
as medoids and swaps repeatedly with other objects until all k hierarchical structure of clusters. We call an encompassing
objects qualify as medoids. The major disadvantage of PAM polygon a contour. The motivation of our research lies in the
comes from the fact that it compares an object with the entire data facts that previous distance-baaed approaches such as CLARANS
set, to find a medoid. This fact results in slow processing time, and BIRCH cannot cluster spatial objects as they are distributed
O(k(n-k)2). and cannot present conceptually reasonable clusters of
CLARANS[S] was developed to overcome disadvantages of sophisticated and hierarchical data set. SMTIN is also motivated
PAM. It uses sample data set to find medoids. Thus it needs less to generate clusters which are independent of the input data
processing time at each step when it clusters objects into k sequences and of seeds.
medoids. CLARANS clusters objects around k medoids based on
randomized search algorithm. It selects arbitrary k objects as 3. Clustering Spatial Objects by Delaunay Triangulation
current objects. And they are compared with sampled neighbor
objects and swaps each other when one of neighbor objects According to [5], clustering of a set is a partition of its elements
qualifies certain conditions. CALRANS swaps repeatedly until it that is chosen to minimize some measure of dissimilarity. And the
finds k medoids. CLARANS also exhibits several disadvantages. dissimilarity has been defined as diameter of cluster.
First, it is still slow since it uses randomized search algorithm, Conventional clustering methods such as PAM, CLARANS and
although it is faster than PAM. Second, it could not guarantee BIRCH therefore partition N spatial objects into k clusters with k
optimal clustering, due to its randomized approach. Third, k- mediods, so that the diameter should be minimized. Since they do
medoids approach does not present enough spatial information not fully consider geometric properties of spatial objects and they
when the patterns and the distribution of a data set are complex. rather rely on geo-statistical methods, they fail to discover
Especially, it does not present hierarchical structure of a data set. geometric information like shape of clusters. But the information
BZRCH[l 1, 121 improves the lacks of CLARANS in that firstly that spatial data mining discovers should include not only
it supports localized clustering. At each clustering, it does not statistical regroupement but also geometric characteristics. This is
scan ah spatial objects and refer all currently existing clusters. a basic motivation of our approach.
Second, it treats outliers in an efficient way. Third, it minimizes
both CPU processing time and disk I/O time. Fourth, it inserts
dynamically new spatial objects into existing clusters without
modifying the all clusters. BIRCH clusters data by using CF-
Vector (Clustering Feature) and CF-Tree. CF-Vector is a triple <N.
LS, Ss>, where N means the number of objects in a cluster, LS is L I
N over a linear sum of each data object in a cluster, and SS is the Figure 1. A Sample TZN
square sum of each data object in a cluster. The elements of CF-
Tree are CF-Vector. While CF-Vector contains distance One of the efficient way to investigate geometric properties of
relationships among spatial objects, CF-Tree contains information spatial objects is Delaunay Triangulation [7], which is the dual
about clusters. Since the size of each node is small, CF-Tree can graph of Voronoi Diagram [7]. And Delaunay triangles or the
be loaded in main memory. It outperforms CLARANS in that it dual graph of Voronoi diagram are represented by TIN
constructs CF-Tree in a single scan and accesses disk less than (Triangulated Irregular Network), where nodes represent spatial
CL.ARANS.But the deficiencies of BIRCH are as follows. First, objects and edges means nearest couples among spatial objects as
BIRCH concentrates on clustering spatial objects instead of shown by Figure 1. It has an important property that the nearest
finding patterns of distribution. Thus, it cannot provide enough objects to a given spatial object are always linked by edges, which
information about complex and hierarchical patterns. Second, allows us to analyze proximity relationship between spatial
BIRCH generates different clusters for the same input data set objects. By Euler’s formula [7], TIN has at most 3n-6 edges and
according to the input order and selection of seed points. In other 2n-4 triangles for n input data points. We can see that it takes
words, this method is order-sensitive and also seed-sensitive. O(n) time to find the contour of a TIN with n data points.
Finally, we must have a priori knowledge about the nature of
distribution, for example, the number of clusters and other input
parameters, which is often unrealistic.

In order to overcome these shortcomings, we propose a


spatial data mining method named SUTIN that is efficient and
effective. It is efftcient in terms of processing time and effective Initial Dataset Phase 1 Phase 2
in a sense that it does not only cluster spatial objects according to Figure 2. SMTIN Procedure
the patterns of distribution but also presents complex and
hierarchical patterns of distribution. While previous clustering Now, let us explain SMTIN. It consists of two phases; the first
methods use radius as a distance measure and calculate distance phase is building TIN from spatial objects. And on the second
from a datum point, like k medoids in CLARANS and centroid X0 phase, we eliminate all edges whose distance is greater than a

36
. ---.-- ~__. -_~_
-.-__-. ~- ---_.- ~-- 2 __-

given threshold, T. Figure 2 shows the procedure of SMTYN. Figure 3 shows an example of sophisticated distribution with
‘* four clusters and it also implies a hierarchical structure. We can
Algorithm SMTIN obtain clustering as Figure 4, with different thresholds. It clearly
1. Input spatial objects and construct their TM separates four clusters and discovers the original shape of clusters
2. Remove edges whose length is greater than threshold T, as Figure 4(a). And if we need more fine clustering, we can get a
and find connected components. Each connected result as Figure 4(b) with smaller threshold.
component becomes a cluster.
3. Remove clusters whose number of objects is less than a
given number no.
4. Find the contour lines of remaining clusters.

As a result, SMTIN generates clusters and their contour lines.


If one or a number of spatial objects are isolated from other
pJ pi
clusters, we consider them as outliers and exclude them from
clustering. We can control outlier by no. If no =l, it means that we (a)
. _ T = 0.02 GO T= 0.002
do not exclude any outliers. We canalso control the granularity of Figure 4. SMT’ Cl~ers
clusters. By setting threshold T as great value, we can obtain
coarse clustering, and when T is small, clusters contain small We can also find a hierarchical relationship between two
number of elements. Normally, we commence with relatively outputs. For example, cluster A in Figure 4(a) consists of several
grent threshold to see an outlook of clustering and by decreasing sub-clusters in Figure 4(b), which forms a hierarchical
it, we get finer clustering. relationship. In the real world, we often find such relations and
Sh47YN is very helpful to analyze them. For example, the
Suppose that the number of input spatial objects is n. Then, distribution of buildings in a city may contain the distribution of
step 1 of SMTIN requires O(nlogn) time to generate Delaunay residential houses, commercial buildings, factories, and so on. In
triangulation[9]. And since the number of edges is at most 3n-6, this case, SMTZV generates a cluster of buildings. And, it
the time complexity of step 2 is O(n). It is obvious that it takes repeatedly generates several clusters of residential houses,
0(n) time for step 3, because the maximal number of clusters commercial buildings and factories with a smaller threshold value.
does not exceed the number of objects. As explained previously, A tree for clusters may be constructed by iteration of threshold
it takes C(ni) to find contour line of i-th cluster with ni nodes, values range from T, to T, (TI > T2).
where k is the number of clusters and nl + nz + . .. + nk < n.
Therefore for step 4, we need O(n,) + O(n2) + . .. O(nk) = O(n) 5. Comparisons with BIRCH and CLARAhS
time. As consequence, it takes linear time of input size except step
In this section, we compare the performance and functions of
1, and the total time complexity of SMTIN is qnlogn).
BIRCH and CLARAh5 with SMT7.N by experiments. We use
Delaunay Triangulator [9] to implement SMTIN. We have used
4. Clustering Sophisticated and Hierarchical Pattern
the same data sets of [I 1,121 for the experiments, which are DSI,
DS2, and DS3 shown in Figure 5,6, and 7. Each dataset consists
Spatial data mining methods that assume the shape of clusters are
of 100,000 points. The points in DS3 are randomly distributed
not odequnte in analyzing geometric characteristics of spatial
while DSI and DS2 are distributed in grid and sine curve patterns,
objects[3]. Since there is no general rule about the shape of
respectively. And their ordering are random, which favors BIRCH
clusters[lO], cluster may be spherical, linear, or of other shapes. and CLAMS than SMTIN, since it is not order-sensitive.
In comparison with previous spatial data mining methods, Sh4TIN
does not assumes any a priori shape of distribution and clusters
We assume that the main memory is enough for loading the
points as it visits arbitrarily around neighboring points. Therefore,
whole dataset. It is evident that this assumption favors SMTZNand
it generates clusters as data points are distributed.
CLARANS than BIRCH, since one of the advantages of BIRCH is
reduction of the number of disk I/O. But, by using spatial access
In the case when spatial objects are distributed in a very
method such as R*-tree [I], we expect that the number of disk I/O
complex pattern, it works properly by virtue of tire above
could be considerably reduced and we plan to combine R*-tree
property. For example, when houses are distributed along a lake,
with SMTIN for improving its disk I/O performance.
SMTIN traverses along the lake and clusters houses as a doughnut
shape while letting inside of a doughnut empty. SMTZN
demonstrates its strength when the shape of distribution is
sophisticated and the distribution has a hierarchical structure.

Figure 5. Initial DSI dataset : grid pattern


Figure 3. A so-f distribution.

37

.- ---
--x- ..^-zA---. -c

Figure 6. Initial DS2 dataset : sine curve pattern

I t
Figure 7. Initial DS3 dataset : random distribution

Table 1 shows the execution time of three methods,


CLARANS, BIRCH, and SMZN on Sun UltraSPARC 2 with 128
M bytes main memory for DSI, DS2, and DS3 respectively.

Table 1. Execution time of CLARANS, BIRCH, and SMT.IN We find a great difference between Sh4lYN and the other two
methods in Figure 9. (1) While BIRCH and CL&ANS discover
In this table, we excluded disk writing time required by only a set of local clusters as shown by Figure 9a and 9b, Sh4TIN
BIRCH to fairly compare the CPU processing time. By comparing discover a shape of clusters given by sine curve in addition to set
them, we observe that: (1) SM7’IN is faster than the other of local clusters by applying different thresholds. For two
methods, CLARANS and BIRCH. (2) The execution time of thresholds T, and i?, the clusters discovered by SMTlN applying
SMTIN is independent of the distribution of spatial objects and Tr and T2 respectively are hierarchical as explained in the
consequently a stable clustering method. The reason is that previous section. It means that we can find the shapes of
Delaunay triangulation process mainly determines its execution distribution for several scales by SMTIN in a hierarchical way. (2)
time, which is almost independent from the nature of distribution. We observe that the local clusters with T2 exactly correspond to
the actual clusters. Obviously, it is important to find a proper
In Figure 8, we find that: (1) The presentation method of the threshold value. In order to rind a proper threshold, we commence
clustering results by SMTIN differs from those of CLARANS and with relatively large threshold and decrease it grndually until we
BIRCH. The contour line of cluster in Figure 8c is actual shape of get a good clustering.
each cluster, whereas the ellipses in Figure 8a and 8b do not
exactly correspond the contour lines of cluster. The numbers
respectively represent the number of points in each cluster. (2)
BIRCH and SATIN cluster the points nearly same as the actual
clusters, while CLARANS is far from them. (3) We have a priori
defined the number of clusters for CLARANS and BIRCH as 100,
which is actual number of clusters. It means that we must know
the number of clusters a priori for CLARANS and BIRCH, which
is often unrealistic. If we give an incorrect number of cluster, the
result may be totally different from the actual. But we do not need
I I
to give the number of clusters for running SM7” and in any way, Figure 9a. CLARANSClusters of DS2
it finds a correct clusters.

38
_ ~. _-.-__ -_-~ ..-_-__ -

spatial objects, such as shape of clusters and hierarchical


structure of cluster distribution, even though the distribution
has sophisticated shape, like nested doughuuts.
l We do not need to know the nature of cluster distribution a
priori, such as the number of clusters.
l It requires less CPU processing time than CL4RMrS and
BIRCH.
l It efficiently handles outliers and is not ordering sensitive
method.

An important drawback of our method is that it requires more


disk I/O than BIRCH, which needs only one scan of spatial
objects. We however expect that we could considerably reduce
the number of disk I/O by using spatial access method[2]. Future
works therefore include reconstruction of SMTIiVon R*-tree. We
will also extend our method to deal with non-point spatial objects
such like lines and regions, as well as points.

Acknowledgements
This work was partially supported by the ministry of science and
technology of South Korea under the contract of National GIS
Development Project.

References

[l] Beckmann, N., Kriegel, H-P., Schneider, R., and Seeger, B.,
“The R*-tree: An Efficient and Robust Access Method for
Points and Rectangles,” Proceedings of SIGMOD, 1990, pp.
322-331
[2] Brinkoff, T. and Kriegel, H-P., “The Impact of Global
Clustering on Spatial Database Systems,” Proceedings of the
2Uth VLDB Conference, Santiago, Chile, 1994, pp. 168 - 179
Figure 9c. S&IN Clusters of DS2 [3] Faloutsos, C. and Kamel, I. “Beyond Uniformity and
Independence : Analysis of R-trees Using the Concept of
we finally compare BIRCH and SMTlN with dataset DS4 Fractal Dimension,” Proceedings of ACM Conference
given by Figure 3. The distribution of DS4 is complicated and its PrincipIes on Database Systems(pODs/, 1995, pp.4-13
shape is like nested doughnuts. Figure IO(a). and 10(b). [4] Guenther, 0. and A. Buchmarm, “Research Issues in Spatial
respectively show the clusters discovered by BIRCH and SM77N. Database,” ACM SIGMOD Record, vol. 19, no.4, 1990, pp.
In Figure 10(b), we see that SWihr correctly carries out 61-68
clustering with DS4, and each cluster is clearly separated from [5] Hartigan, JA, Clustering Algorithms, John Wiley & Sons, 1975
others, BIRCH, however fails to discover the actual chrsters as [6] Kaufman, L. and Rousseeuw, P.J., Finding Groups in Data:
shown by Figure 10(a). an Introduction to Cluster Analysis, John Wiley & Sons, 1990
[7] Preparata, F.P. and Shamos, M.L, Computational Geometry:
An Introduction, Springer-Verlag, 1985
[S] Raymond, T. Ng and Han, J., “Efficient and Effective
Clustering Methods for Spatial Data Mining,” Proceedings of
the 20th VLDB Conference, Santiago, Chile, 1994, pp. 144 -
155
[9] Schewchuk, J.R, “Triangle: Engineering a 2D Quality Mesh
Generator and Delaunay Triangulator,” Proceedings First
(a) BIRCH Cluster Oer Workshop on Applied Computational Geometry, ACM, 1996.
* Figure 10. Clusters bfDS4 [IO] Seber, A.J., Multivariate observations, John Wiley & Sons,
1984
6. Conclusion [I l] Zhang, T., Ramakrishnan, R, and Livny, M., “BIRCH: An
Efficient Data Clustering Methods for Very Large Databases,”
In this paper, we presented a spatial data mining method, SMTYhr Proceedings of ACM SIGMOD International Conference of
which is based on Delaunay Triangulation. By comparison with Management ofData, 1996, pp. 103 -114
other spatial data mining methods, such as CLARANS or BIRCH, [12] Edwin M. Knorr and Raymond T. Ng, “Finding Aggregate
it has the following advantages; Proximity Relationships and Commonalities in Spatial Data
Mining,” IEEE Transaction on Knowledge and Data
l It discovers rich information about the distribution of Engineering, Vol. 8, NO. 6, 1996, pp. 884-897

39

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy