0% found this document useful (0 votes)
30 views28 pages

Aiml 5th Module Part2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
30 views28 pages

Aiml 5th Module Part2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 28
fl Ye) oy cam] Clustering Algorithms “Wherever you see a successful business, someone once made a courageous decision.” — Veter Drucker Cluster analysis is a technique of partitioning a collection of unlabelled objects, with many attributes, into meaningful disjoint groups or clusters. This chapter aims to provide the basic concepts of clustering algorithms. Lear * Introduce the concepts of clustering + Highlight the role of distanwe measures in clustering prowess * Provide a taxonomy of clustering algorithms * Explain hierarchical clustering algorithms ‘+ Explain partitional clustering algorithms + Briefly explain density-based, grid-hased, and prohabilistie madel-hased clustering techniques Discuss the validation techniques for clustering algorithms 13.1 INTRODUCTION TO CLUSTERING APPROACHES Cluster analysis is the fundamental task of unsupervised learning. Unsupervised learning involves exploring the given dataset. Cluster analysis is a technique of partitioning a collection, of unlabelled objects that have many attributes into meaningful disjoint groups or clusters. This 1s done using a trial and error approach as there are no supervisors available as in classification. ‘The characteristic of clustering is that the objects in the clusters or groups are similar to each other within the clusters while differ from the objects in other clusters significantly. The input for cluster analysis is examples or samples. These are known as objects, data points or data instances. All these terms are same and used interchangeably in this chapter. All the samples or objects with no labels associated with them are called unlabelled. The output is = ———____ 362. + Machine Learning = eT the set of clusters (or groups) of similar data if it exists in the input. For example, the following, pul ple, Figure 13.1(a) shows data points or samples with two features shown in difterent shaded samples and Figure 13.1(b) shows the manually drawn ellipse to indicate the clusters formed. ‘cluster visualisation 45, 2 10, 5 2 75 3 5. 23] ox o « 2 3 «@ 5 6 ? Samoles @) Cluster visualisation value samples «) Figure 13.1: (a) Data Samples (b) Clusters’ Description Visual identification of cluctere in this case is easy as the evamples have only two features. But, when examples have more features, say 100, then clustering cannot be done manually and automatic clustering algorithms are required. Also, automating the clustering process is desirable as these tasks are considered difficult by humans and almost impossible. All clusters are repre- sented by centroids, For example, if the input examples or data is (3, 3), (2, 6) and (7, 9), then 7.34649 the centroid is given as @ oe - # ) = (4, 6). The clusters should not overlap and every luster should represent only one class. Therefore, clustering algorithms use trial and error method to form clusters that can be converted to labels. Thus, the important differences between classifi- cation and clustering are given in Table 13.1. [ —._?A>€_—_______— ctstering Algorithms . 363 able 13.1: Differences between Classification and Clustering ro Ct Classification Unsupervised learning and cluster formation are | Supervised learning with the presence of a stone by ial and error as there is no supervisor_| supervisor to provide training and testing data Unlabeled data Labelled data No prior knowledge in dustering Knowledge of the domain ls must t label the samples ofthe dataset Cluster results are dynamic Once a label is assigned, it does not change Applications of Clustering 1. Grouping based on customer buying patterns 2, Profiling of customers based on lifestyle . In information retrieval applications (like retrieval of a document from a collection of documents) . Identifying the groups of genes that influence a disease Identification of organs that are similar in physiology functions ‘Taxonomy of animals, plants in Biology Clustering based on purchasing behaviour and demography Document indexing 9, Data compression by grouping similar objects and finding duplicate objects, euawe Challenges of Clustering Algorithms A huge collection of data with higher dimensions (Le., features or attributes) can pose a problem for clustering algorithms. With the arrival of the internet, billions of data are available for clustering algorithms. This is a difficult task, as scaling is always an issue with clustering algorithms. Scaling is an issue where some algorithms work with lower dimension data but do not perform well for higher dimension data. Also, units of data can post a problem, like same weights in kg and come in poundo can pose a problem in clustering. Designing a proximity measure is also a big challenge. The advantages and disadvantages of the cluster analysis algorithms are given in Table 13.2. Table 13.2; Advantages and Disadvantages of Clustering Algorithms es Eaters 1. | Cluster analysis algorithms can handle missing | Cluster analysis algorithms are sensitive to data and outliers. initialization and order of the input data. 2 | Can help classifiers un labelling the unlabelled | Often, the number of clusters present in the data. Semi-supervised algorithms use cluster | data have to be specified by the user. analysis algorithms to label the unlabelled data and then use classifiers to classify them. (Continued) gee © Machine Learning Advantages Itiseasy toexplainthe clusteranalysisalgorithms and to implement them, Clustering is the oldest technique in statistics | Designing a proximity measure for the given and it is easy to explain. It is also rel vaiiaeti to implement iso relatively easy | data is an issue Peed Scaling is a problem. 13.2 PROXIMITY MEASURES Sean for ‘Additonal Information on Proximity Measures! Clustering algorithms need a measure to find the similarity or dissimilarity among the objects to group them. Similarity and dissimilarity are collectively known as proximity measures. Often, the distance measures are used to find similarity between two objects, say i and j. Distance measures are known as dissimilarity measures, as these indicate how one object is different from another. Measures like cosine similarity indicate the similarity among objects. Distarwe measures and similaity measures are two sides of the same coin, a more distance indicates more similarity and vice versa. Distance between two objects, say i and j, is denoted by the symbol D,. ‘The properties of the distance measures are: 1. D, is always positive or zero. 2. D,=0, ie, the distance between the object to itselfis 0. 3. D,= Dy, Thus property is called symmetry. 4. D, low and low < medium. Quantitative variables are real or Integer numbers ot binary data. In binary data, the attributes of the object can take a Boolean value. Objects whose attributes take binary data are called binary objects. Tot us review some of the proximity measures. —— _—atastering algorithms 6 365 Quantitative Variables Some of the qualitative variables are discussed below. guclidean Distance it is one of the most important and common distance measures. It is also called as L, norm. It can be defined as the square root of squared differences between the coordi- .3 of a pair of objects. The Euclidean distance between objects x, and x, with k features is given as follows: xy (13.1) ‘The advantage of Euclidean distance is that the distance does not change with the addition of new objects. But the disadvantage is that if the units change, the resulting Euclidean or squared Euclidean changes drastically. Another disadvantage is that as the Euclidean distance involves a square root and a square, the computational complexity is high for implementing the distance for millions or billions of operations involved. City Block Distance City block distance is known as Manhattan distance. This is also known as boxcar, absolute value distance, Manhattan distance, Taxicab or L, norm. The formula for finding the distance is given as follows: Distance (x, x) (132) Chebyshev Distance Chebyshev distance is known as maximum value distance. This is the absolute magnitude of the differences between the coordinates of a pair of objects. This distance is called supremum distance or L,,, or L_ norm. he formula for computing Chebyshev distance is given as follows: Distance (x, x) (13.3) Suppose, it the coordinates of the objects are (J, 3) and (5, 8), then what is the Chebyshev distance? Solution: The Euclidean distance usin By, (13.1) fs given as follows: Distance (x,, x,) = (0-5? + (8-8) 50 = 7.07 ‘The Manhattan distance using Eq. (13.2) is given as follows: Distance (x, x) = (0-5) + (3 -8)] ‘The Chebyshev distance using Eq. (13.3) is given as follows: Max {0-5}, [3-8|} = Max (5,5}=5 —_____ Minkowski Distance In general, all the above distance measures can be generalized as: [y (134) x, Distance (x, x,) = ( fs 366 © Machine Leaming——— This is called Minkowski distance, Here, ris a parameter. When the value of ris 1, the distance measure is called city block distance. When the value of r is 2, the distance measure is called Euclidean distance. When, ris «, then this is Chebyshev distance. Binary Attributes Binary attributes have only two values. Distance measures discussed above cannot be applied to find distance between objects that have binary attributes. For finding the distance among objects with binary objects, the contingency Table 13.3 can be used. Let x and y be the objects consisting cof N-binary objects. Then, the contingency table can be constructed by counting the number of matching of transitions, 0-0, 0-1, 1-0 and 1-1. Table 13.3: Contingency Table Ser ae) 0 a_|b 1 cia In other words, ‘a’ is the number of attributes where x attribute is 0 and y attribute is 0. ‘bis the number of attributes where x attribute is 0 and y attribute is 1, ‘c’ is the number of attributes where x attribute is 1 and y attribute is 0 and ‘a’ is the number of attributes where x attribute is 1 and y attribute is 1. Simple Matching Coefficient (SMC) SMC is a simple distance measure and is defined as the ratio of number of matching attributes aid the suwuiber uf allsibutes. The formula is given as: ee (3.5) atb+c+d The values of a, b, ¢, and d can be observed from the Table 13.4. If the given vectors are x = (1, 0, 0) and y= (1, 1,1) then find the SMC and Jaccard coefficient? Solution: It can be seen from Table 13.2 that, a=0,b=2, ¢~Oandd=1. . 5 ad _oape The SMC using Eq, (13.5) is given as ———" = 041/3 = 0.38 = 1/3 = 0.33 a Jaccard coefficient using Eq. (13.6) is given as J= -—— —" Hamming Distance Hamming distance is another useful measure that can be used for knowing the sequence of characters or binary values. It indicates the number of positions at which the characters or binary bits are different. For example, the hamming distance between x = (1 0 1) and y = (1 1 0) is 2 as x and y differ in two positions. The distance between two words, say wood and hood is 1, as they differ in only one character. Sometimes, more complex distance measures like edit distance can also be used. Clustering Algorktims «367 categorical Variables In many cases, categorical values are used. It is just a code or symbol to represent the values. For example, for the attribute Gender, a code 1 can be given to female and 0 can be given to male. To calculate the distance between two objects represented by variables, we need to find only whether they are equal or not. This is given as: fH ifxey Distance (x, ¥) = 19 ig y (37) ordinal Variables Ordinal variables are like categorical values but with an inherent order. For example, designation is an ordinal variable. If job designation is 1 or 2 or 3, it means code 1 is higher than 2 and code 2 ishigher than 3. It is ranked as 1>2>3, Let us assume the designations of oltice employees are clerk, supervisor, manager and general manager. These can be designated as numbers as clerk = 1, supervisor = 2, manager = 3 and general manager = 4. Then, the distance between employee X who is a clerk and Y whois a manager can be obtained as: Distance (x, 1) = Peston (X) — pesitin (| (38) . n-1 Here, position (X) and position(Y) indicate the designated numerical value. Thus, the distance between X (Clerk = 1) and ¥ (Manager ~3) uoing Eq, (13.8) ie given as: Distance x, p= esti (= postion 09] _ N=. _2 65 Vector Type Distance Measures For text classification, vectors are normally used. Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. Cosine similarity measures the cosine of the angle between two vectors projected int a utulti-diuweusiual space. The similarity function for vector objects can be defined as: EX, xy sim (XY) = es PIP Bee x Eve The numeration is the dot product of the vectors A and B; The denominator is the product of the norm of vectors A and D. (13.9) Eanes similarity? Seaton: The dot product of the vector is 1x 0+1 x 1+0x1=1. The norm of the vectors A and Bis V2 If the given vectors are A = (1. 1, 0) and B= (0. 1, 1}, then what is the cosine So, the cosine similarity using Eq, (13.9) is given as oe 360. + Machine Learning —— ooo Now, let us discuss the types of clustering algorithms, which include hierarchical, partitional, density-based and grid-based algorithms, ‘Scan for information on “Taxonomy of Clustering Algorithms’ 13.3 HIERARCHICAL CLUSTERING ALGORITHMS Hierarchical methods produce a nested partition of objects with hierarchical relationships among objects. Often, the hierarchy relationship is shown in the form of a dendrogram, Hierarchical methods include categories, agglomerative methods and divisive methods, In agglomerative methods, initially all individual samples are considered as a cluster, that is, a cluster with a single element, Then, they are merged and the process is continued to get a single cluster. Divisive methods use another kind of philosophy, where a single cluster of all samples of the dataset taken initially io chosen and then partitioned. This partition process is continued until the cluster is split into smaller clusters Agglomerative methods merge clusters to reduce the number of clusters. This is repeated each time while merging two closest clusters to get a single cluster. The procedure of agglomerative clustering is given as follows: RES ee One Roce 1. Place cach N sample or dala instaue into a separate cluster. Su, initially N clusters are available. 2. Repeat the following steps until a single cluster is formed: (a) Determine two most similar clusters. (b) Merge the two clusters into a single cluster reducing the number of clusters as N-1. 3. Choose resultant clusters of step 2 as result: All the clusters that are produced by hierarchical algorithms have equal diameters. The main disadvantage of this approach is that once the cluster is formed, it is an irreversible decision. 13.3.1 Single Linkage or MIN Algorithm Hierarchical clustering algorithms producenested clusters, which can be visualized as a hierarchical tree or dendrogram. The idea behind this approach is proximity among clusters. In single linkage algorithm, the smallest distaste (2, y), where x is from one cluster anul y is fruit another duster, is the distance between all possible pairs of the two groups or clusters (or simply the smallest distance of two points where points are in different clusters) and is used for merging the clusters. This corresponds to finding of minimum spanning tree (MST) of a graph. ‘The distance measures between individual samples or data points is already demonstrated in the previous section 13.2. To understand the single linkage algorithm, go through the following numerical problem that involves finding the distance between clusters. Clustering Algorithms + 369 ooo? pee Consider the array of points as shown in the following Table. 13.4. Table 13.4: Sample Data Cn 0 1{4 1 2| 8 2 5 | 10 3 2 | 16 4 4 | 28 Solution: The similarity table among the variables is computed as shown in Table 13.4. Euclidean distance is computed as shown in the following Table 13.5. ; tae 3 - 10.198 2 4 The minimum distance is 3.606. Therefore, the items 1 and 2 are clustered together. The resultant is shown in Table 13.6, Table 13.6: After Iteration 1 {1,2} Ga23)| 10.630 | 20.124 ° eo eee 27.295 3 = _ [10198 4 = ‘The distance between the group (1, 2} and items (0}, (3) and (4 is computed using the formula: D(C, C,) = minimum, 4. d(@, 2) (13.10) Here, D,, is the single linkage distance, C, C, are clusters and d (a, b) is the distance between the elements @ and b. ‘Thus, the distance between (1, 2} and {0} is: ‘Minimum ((1, 0}, (2, 0}} = Mintmum (4.123, 7211} = 4.123, ‘The distance between {1, 2} and (3} is given as: Minimum (1, 3}, (2, 3)) = Minimum (14.412, 10.630) = 10.630 The distance between {1, 2} and (4) is given as: Minimum (1, 4], (2, 4]} ~ Minimum (23.324, 20.124) - 20.124 370 = Machine Learning This is shown in Table 13.6. The minimum distance of Table 13.6 is 4.123. Therefore, (0, 1, 2) is clustered together. This result is shown in Table 13.7. Table 13,7: After Iteration 2 10.630 | 20.124 z 4 Thus, the distance between {U, 1, 2} and (3) using Eq. (13.10) is given as Minimum ((0, 3}, (1, 3}, (2, 3}} = Minimum (17.804, 14.142, 10.630} = 10.630. ‘Thus, the distance between (0, 1, 2) and (4) is: ‘Minimum {{0, 4}, (1, 4), (2, 4}} = Minimum (27.295, 23.324, 20.124} = 20.124. This is shown in Table 13,7, The minimum is 10.198, ‘Therefore, the items {3, 4) are merged. And, the items (1, 2, 3} and (4, 5] are merged. The resultant is shown in Table 13.8. Table 13.8: After Iteration 3 Cee ‘The computation of ? in Table 13.8 is not needed as no other items are left. Therefore, the clusters (0, 1, 2} and (3, 4) are merged. Dendrograms are used to plot the hierarchy of clusters. Dendrogram for the above clustering is shown in the following Figure 13.2. oS 10 8 6 4 1 2 ° ° 1 2 3 4 Dendrogram for Table 13.4 es es et Clustering Algorithms + 374 13.3.2 Complete Linkage or MAX or Clique In complete linkage algorithm, the distance (x, y), (where x is from one cluster and y is from another cluster), is the largest distance between all possible pairs of the two groups or clusters (or simply the largest distance of two points where points are in different clusters) as given below. It is used for merging the clusters, De (C,,C)) = maximum,.< ec, (as B) as) Dendrogram for the above clustering is shown in Figure 13.3. 25. 20. 5 10 ° | 0 1 2 3 4 Figure 13.3: Dendrogram for Table 13.4 13.3.3 Average Linkage In case of an average linkage algorithm, the average distance of all pairs of points across the | dusters is used to form clusters. The average value computed between clusters c, ¢, is given as follows: | DyCC)=—— E ata,d) (132) OT tty wcycy Here, m, and m, are sizes of the clusters. The dendrogram for Table 13.4 is given in Figure 13.4. 00 - 0 1 2 3 4 Figure 13.4: Dendrogram for Table 13.4 572 + Machine Learning ——— 13.3.4 Mean-Shift Clustering Algorithm Mean-shift is a non-parametric and hierarchical clustering algorithm. This algorithm is also known as mode seeking algorithm or a sliding window algorithm. It has many applications in image processing and computer vision, There is no need for any prior knowledge of clusters or shape of the clusters present in the dataset. The algorithm slowly moves from its initial position towards the dense regions. The algorithm uses a window, which is basically a weighting function. Gaussian window is 2 good example of a window. The radius of the kernel ie called bandwidth. The entire window is called a kernel. The window is based on the concept of kernel density function and its aim is to find the underlying data distribution. The method of calculation of mean is dependent on the choice of windows. If a Gaussian window is chosen, then every point is wsigned a weight that decreases as the distance from the kernel center increases. The algorithm is given below. Vober 4 Step 1: Design a window. Step 2: Place the window ona set of data points. Step 3 Compnte the monn for all the points that come under the window. Step 4: Move the center of the window to the mean computed in step 3. Thus, the window moves towards the dense regions, The movement to the dense region is controlled by a mean shift vector. The mean shift vector is given as: 1 i (x, — x) (13.13) Ree) (13.13) Here, K is the number of points and S, is the data points where the distance from data points x, and centroid of the kernel x is within the radius ot the sphere. Ihen, the centroid is updated as x= x +0, Step 5: Repeat the steps 3-4 for convergence. Once convergence is achieved, no further points can be accommodated. Advantages 1, No model assumptions 2, Suitable for all non-convex shapes 3, Only one parameter of the window, thal 4. Robust to noise 5. No issues of local minima or premature termination s handwidth is required Disadvantages 1. Selecting the bandwidth is a challenging task. If it is larger, then many clusters are missed. it is small, then many points are missed and convergence occurs as the problem. 2. The number of clusters cannot be specified and user has no control over this parameter. — ee taistoring aigoritns 6 373 ‘scan for “Additional Examples’ | Gi ie 13.4 PARTITIONAL CLUSTERING ALGORITHM ‘kemeans’ algorithm is a straightforward iterative partitional algorithm. Here, k stands for the user specified requested clusters as users are not aware of the clusters that are present in the dataset. The k-means algorithm assumes that the clusters do not overlap. Therefore, a sample or data point can belong to only one cluster in the end. Also, this algorithm can detect clusters of shapes like circular or spherical, Initially, the algorithm needs to be initialized. The algorithm can select k data points randomly or use the prior knowledge of the data. In most cases, in k-means algorithm setup, prior knowledge is absent. The composition of the cluster is based on the initial condition, therefore, initialization is am important task. The sample or data points need to be normalized for better, performance. The concepts of normalization are covered in Chapter 3. The core process of the k-mean algorithm is assigning a sample to a cluster, that is, assigning each sample or data point to the k cluster centers based on its distance and the centroid of the dusters. This distance should be minimum. As a new sample is added, new computation of mean veetora of the points for tual luster to which sample is assigned is required. Therefore, this iterative process is continued until no change of instances to clusters is noticed. This algorithm then terminates and the termination is guaranteed. Step 1: Determine the number of clusters before the algorithm is started. This is called k. Step 2: Choose k instances randomly. These are initial cluster centers. Step 3: Compute the mean of the initial clusters and assign the remaining sample to the closest cluster based on Euclidean distance or any other distance measure between. the instances and the centroid of the chisters. Step 4: Compute new centroid again considering the newly added samples. Step 5: Perform the steps 3-4 till the algorithm becomes stable with no more changes in assignment of instances and clusters. kemeans can also be viewed as greedy algorithm as it involves partitioning n samples to k clusters to minimize Sum of Squared Error (SSE). SSE is a metric that is a measure of error that gives the sum of the squared Kuclidean distances of each data to its closest centroid. It is given as: & SSE = Eedist(c,, x? 13.14) Here, ¢, is the centroid of the i cluster, x is the sample or data point and dist is the Euclidean distance. The aim of the k-means algorithm is to minimize SSE. 374 + Machine Learming Advantages 1. Simple 2, Easy to implement Disadvantages 1. Itis sensitive to initialization process ae change of initial pointe leads to different clusters. 2, Ifthe samples are large, then the algorithm takes a lot of time. How to Choose the Value of k? Itis obvious that k is the user specified value specifying the number of clusters that are present, Obviously, there are no standard rules available to pick the value of k. Normally, the k-means algorithm is run with multiple values of k and within group variance (sum of squares of samples with its centroid) and plotted as a line graph. This plot is called Flhow curve. The optimal or best value of k can be determined from the graph. The optimal value of k is identified by the flat or horizontal part of the Elbow curve. Complexity The complexity of k-means algorithm is dependent on the parameters like 1», the number of samples, k, the number of clusters, @(nkId). is the number of iterations and d is the number of attributes. The complexity of k-means algorithm is O (7) Consider the following set of data given in Table 13.9. Cluster it using k-means algorithm with the initial value of objects 2 and 5 with the coordinate values (4, 6) and (12, 4) as initial seeds. Table 13.9: Sample Data Solution: As per the problem, choose the objects 2 and 5 with the coordinate values. Hereafter, the objects’ id is not important. The samples or data points (4, 6) and (12, 4) are started as two dlusters as shown in Table 13.10. Initially, centroid and data points are same as only one sample is involved. Table 13.10: initial Cluster Table 4,6) (12,4) Centroid 1 (4, 6) Centroid 2 (12, 4) Iteration 1: Compare all the data points or samples with the centroid and assign to the nearest sample. Take the sample object 1 (2, 4) from Table 13.9 and compare with the centroid of storing aigoritn ms «375, the clusters in Table 13.10. The distance is 0. Therefore, it remains in the same cluster. Similarly, consider the remaining samples. For the object 1 (2, 4), the Kuclidean distance between it and the centroid is given as: Dist (1, centroid 1) = @—4? + a-6p = V8 Dist (1, centroid 2)= (2—12)' + (4— 4? = V0 = 10 Object 1 is clover to the centroid of cluster 1 and hence assign it to cluster 1. Ihis is shown in able 13.11. Object 2 is taken as centroid point. For the object 3 (6, 8), the Euclidean distance between it and the centroid points is given ac: Dist (3, centroid 1)= (6-4)? + (6-6) = V8 Dist (3, centroid 2) = (6 = 12)" + (8-4) = /52 Object 3 is closer to the centroid of cluster 1 and hence remains in the same cluster 1. Proceed with the next point object 4(10, 4) and again compare it with the centroids in Table 13.10. Dict (4, centroid 1) = Jaap sO = VES Dist (4, centroid 2)= (10 - 127° + (4-47 = V4 =2 Object 4 is closer to the centroid of cluster 2 and hence assign it to the cluster table. Object 4 is in the same cluster. The final cluster table is shown in Table 13.11. Obviously, Object 5 is in Cluster 3. Recompute the new centroids of cluster 1 and cluster 2. ‘They are (4, 6) and (11, 4), respectively. Table 13.11: Cluster Tahle After Iteration 1 Cluster2 (20,4) (2,4) 8) Centroid 1 (4,6) | Centroid 2 (11,4) ‘The second iteration is ctarted again with the Table 13.11. Obviously, the point (4, 6) remains in cluster 1, as the distance of it with itself is 0. The remaining objects can be checked. Take the sample object 1 (2, 4) and compare with the centroid of the clusters in Table 13.12. Dist (1, centroid 1) = (2-4) +@-6) =v8 Dist (1, centroid 2) = (2-11)? + (4-4) = 81 =9 Object 1 is closer to centroid of cluster 1 and hence remains in the same cluster. Take the sample object 3 (6, 8) and compare with the centroid values of clusters 1 (4, 6) and cluster 2(11, 4) of the Table 13.12. Dist (3, centroid 1)= y(6—4)* + (= 6)" = 6-11? + (8-4 = Van Dist (3, centroid 2) 376 + Machine Learning — i Object 3 is closer to centroid of cluster 1 and hence remains in the same cluster. Take the sample object 4 (10, 4) and compare with the centroid values of clusters 1 (4, 6) and cluster 2 (11, 4) of the Table 13.12: Dist (4 controid 1) 0 ay a=) = V0 Dist (3, centroid 2) = (10-11? +(@—4y = Ji =1 Object 5 Is closer to centroid of cluster 2 and hence remains in the same cluster. Obviously, the sample (12, 4) is closer to its centroid as shown below: Dist (5, centroid 1) = (12-4) + (46) = Vea Dist (5, centroid 2) = (12-11)? + (4-4)? = Vi = 1. Therefore, it remains in the same cluster. ‘Object 5 is taken as centroid point. The final cluster Table 13.12 is given below: ‘able 13.12: Cluster Table After Iteration 2 69) (0,4) (24) (12,4) an) Centroid (4, 6) Centroid (11, 4) ‘There is no change in the cluster Table 13.12. It is exactly the same; therefore, the k-means algorithm terminates with two clusters with data points as shown in the Table 13.12. ng “ ec a | o 13.5 DENSITY-BASED METHODS Density-based spatial clustering of applications with noise (DBSCAN) is one of the density-based algorithms. Density of a region represents the region where many points above the specified threshold are present. In a density-based approach, the clusters are regarded as dense regions of objects that are separated by regions of low density such as noise. This 1s same as a human's intuitive way of observing clusters. ‘The concept of density and connectivity is hased on the local distance of neighbours ‘The functioning of this algorithm is based on two parameters, the size of the neighbourhood (¢) and the minimum number of points (m). 1. Core point ~ A point is called a core point if it has more than specified number of points (m) within e-neighbourhood. 2. Border point - A point is called a border point if it has fewer than ‘m’ points but is @ neighbour of a core point. 3. Noise point ~ A point that is neither a core point nor border point ——— Clustering Algorithms » 377 The main idea is that every data point or sample should have atleast a minimum number ‘of neighbours in a neighbourhood. The neighbourhood of radius € should have atleast m points, ‘The notion of density connectedness determines the quality of the algorithm. The following connectednesa measures are used for thie algorithm. 1, Direct density reachable ~The point X is directly reachable from ¥, if: (a) X is the eneighborhood of Y (b) Yisa core point 2. Densely reachable ~ The point X is densely reachable from Y. if there is a set of core points that leads from Y to X. 3. Density connected ~ X and ¥ are densely connected if Z is a core point and thus points Xand Y are densely reachable from Z. Step 1: Randomly select a point p. Compute distance between p and all other points. Step 2: Find all points trom p with respect to its neighbourhood and check whether it has minimum number of points m. Ifso, it is marked as a core point. Stop 3: IF itis a core point, then a new cluster is formed, or existing cluster is enlarged. Step 4: If it is a border point, then the algorithm moves to the next point and marks it as visited. Step 5: If it is a noise point, they are removed, Step 6: Merge the clusters ifit is mergeable, dist (¢,, ¢,) <. Step 7: Repeat the proccss 3-6 till all points are processed, Advantages 1. No need for specifying the number of clusters beforehand 2. The algorithm can detect clusters of any shapes, 3. Robust to noise 4. Few parameters are needed The complexity of this algorithm is O(wlogn). 13.6 GRID-BASED APPROACH Grid-based approach is a space-based approach. It partitions space into cells, the given data is fitted on the cells for cluster formation. There are three important concepts that need to be mastered for understanding the grid-based schemes. They are: 1. Subspace clustering 2. Concept of denise cells 3. Monotonicity property Let us discuss about them. 378 + Machine Learning — Subspace Clustering Grid-based algorithms are useful for clustering high-dimensional data, that is, data with many attributes. Some data like gene data may have millions of attributes. Every attribute is cailed a dimension. Dut all the attributes are not needed, as in many applications one may not require all the attributes. For example, an employee's address may not be required for profiling his diseases. ‘Age may be required in that case. So, one can conclude that only a subset of features is required. For example, one may be interested in grouping gene data with similar characteristics or organs that have similar functions, Finding subspaces is difficult. For example, N dimensions may have 2“ subspaces. Exploring all the subspaces is a difficult task. Here, only the CLIQUE algorithms are useful for exploring, the subspaces. CLIQUE (Clustering in Quest) is a grid-based method for finding clustering in subspaces. CLIQUE uses a multiresolution grid data structure. Concept of Dense Cells CLIQUE partitions each dimension into several overlapping intervals and intervals it into cells. Then, the algorithm determines whither the cell is dense or sparse. The cell is eunsidered dense if it exceeds a threshold value, say . Density is defined as the ratio of number of points and volume of the region. In one pass, the algorithm finds the number of ces, number of points, etc. and then combines the dense cells. For that, the algorithm uses the | ‘contiguous intervals and a set of dense cells. Step 1: Define a set of grid points and assign the given data points on the grid. Step 2: Determine the dense and sparse cells. If the number of points in a cell exceeds the threshold value 7; the cells categorized as dense cell. Sparse cells are removed from thelist. Step 3: Merge the dence cells if they ate adjacent, Step 4: Form a list of grid cells for every subspace as output. Monotonicity Property CLIQUE uses anti-monotonicity property or apriori property of the famous apriori algorithm. It means that all the subsets of a frequent item should be frequent. Similarly, if the subset is infrequent. then all its supersets are infrequent as well. Based on the apriari property, ane can conclude that a k-dimensional cell has r points if and only if every (k= 1) dimensional projections of this cell have atleast r points. So like association rule mining that uses apriori rule, the candidate 1s. The algorithm works in two stages as shown Step 1: Identify the dense cells. Step 2: Merge dense cells c, and c, if they share the same interval. (Continued) —_,_,_, _____—tustering algorithms. 379 (Step Generate Apriori rule tu generate (k + 1)" cell for higher dimension. Then, check ‘whether the number of points cross the threshold. This is repeated till there are no dense cells or new generation of dense cells Stage 2: Step 1: Merging of dense cells into a cluster is carried out in each subspace using maximal rior to cover dense cells, The maximal region is an hyperrectangle where all cells fall into, ‘Step 2: Maximal region tries to cover all dense cells to form clusters. In stage two, CLIQUE starts from dimension 2 and starts merging. This process is. continued till the n-dimension Advantages of CLIQUE 1. Insensitive to input order of objects 2. No assumptions of underlying data distributions 3. Finds subspace of higher dimensions such that high-density clusters exist in those subspaces Disadvantage The disadvantage of CLIQUE is that tuning of grid parameters, such as grid size, and finding optimal threshold for finding whether the cell is dense or not is a challenge. 13.7 PROBABILITY MODEL-BASED METHODS In the earlier clustering algorithms, the ample wore assigned to the clusters permanently. Also, the samples were not allowed to be present in two clusters. In short, the clusters were non-overlapping. In model-based schemes, the sample is associated with a probability for membership. This assignment based on probability 1s called soft assignment. Soft assignments are dynamic as compared to hard assignments which are static. Also, the sample can belong to more than one clusters. This is acceptable as person X can be a father, a manager as well as a member of a prestigious club. In short, person X has different roles in life. A Model’ means a statistic method like probability distributions associated with its param- eters. In the EM algorithm, the model assumes data 18 generated by a process and the focus 1s to find the distribution that observes that data. There are two probability based soft assignment schemes that are discussed here. One is fuzzy-C means (FCM) clustering algorithm and another is EM algorithm. EM algorithm is discussed in detail in Chapter 2. 13.7.1 Fuzzy Clustering Fuzzy C-Means is one of the most widely used algorithms for implementing the fuzzy clustering concept. In fuzzy clustering, an object can belong to more than one cluster. Let us assume two clusters ¢,and c, then an element, say x, can belong to both the clusters. The strength of association of an object with the cluster is given as w,, The value of w, lies between zero and one. The sum of weights of an object, if added, gives 1. wo + Ma ine Learning — ee ee - Like in k-means algorithm, the centroid of the cluster, c, s computed. The membership weight ss inversely proportional to the distance between abject and centroid computed in earlier pass. EI | % Step 1: Choose the clusters, , randomly Step 2: Assign weights w, of objects to clusters randomly. | Step 3: Compute the centroid: Bes, (03.15) | Step 4: The Sum of Squared Error (SSE) is computed as: SSE = 3 E waist (x, €,? (13.16) Step 5: Minimize SSE to update the membership weights. Here, p is a fuzzifier whose value ranges from 1 to», This parameter determines the influence of the weights. If p is 1, then fuzzy-c acts like k-means algorithm. A large weight results in a smaller value of the membership and hence more fuzziness. Typically, p value is 2. Step 6: Repeat steps 3-5 till convergence is reached, which mean when there is no change in weights exceeding the threshold valuc. Advantages and Disadvantages of FCM ‘The advantages and disadvantages of the FCM algorithm are given in Table 13.13. Table 13.1 idvantages of FCM Algorithm Advantages and No, ‘Advantages Disad Te [Minimum intra cls variance | The qual depends on he inital choke of weights 2 [Robust to noise Local nica rater than global nnimar 13.7.2 Expectation-Maximization (EM) Algorithm Like FCM algorithm, in EM algorithm too there are no hard assignments but there are overlapping clusters. {In this scheme, clustering 1s done by statistical models. What ts a model? A statastical model is described in terms of a distribution and a set of parameters. The data is assumed to be generated by a process and the focus is to describe the data by finding a model that fits the data. In fact. data is assumed to be generated by multiple distributions ~ a mixture model. As mostly gaussian distri- butions are used, it also called as GMM (Gaussian mixture model). Given a mix of distributions, data can be generated by randomly picking a distribution and generating the point. The basics of Gaussian distribution are given in Chapter 3. One can recollect - Clustering Algorithms + 302 that Gaussian distribution is a bell-shaped curve, The function of gaussian distribution is given as follows: Novipo?)= hee Vanco? vize this function mean and standard deviation. Sometimes, variance ean also be used! as it is the square of standard deviation. When the mean is zero, the peak of the bell-shaped curve occurs, Standard deviation is the spread of the shape. The above function is called probability distribution function that tells how to find the probability of the observed point x, Two parameters charact Ihe same gaussian tunction can be extended for multivariate too. In 2D, the mean is also a vector and variance takes the form of covariance matrix. Chapter 3 discusses these important concepts Let us assume that: k= Number of distributions n= Number of samples 8 = {0,,0,,0,,---,0,1, a set of parameters that are associated with the distributions. 6, is the parameter of the j" probability distribution. Then, p(x, |8,) is the probability of i object coming from the j* distribution. The probability that j* distribution to be chosen is given by the weight w, 1

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy