Unit III 1
Unit III 1
Introduction:
In unsupervised learning, the learning algorithm is just shown the input data and asked to extract
knowledge from this data.
When we do not have any prior knowledge of the data set we are working with, but we still want to
discover interesting relationships among the attributes of the data or group the data in logical
segments for easy analysis. The task of the machine is then to identify this knowledge without any
prior training and that is the space of unsupervised learning.
Clustering algorithms help in grouping data sets into logical segments and the association
analysis which enables to identify a pattern or relationship of attributes within the data set. An
interesting application of the association analysis is the Market Basket Analysis, which is used
widely by retailers and advertisers across the globe.
Unsupervised learning helps in pushing movie promotions to the correct group of people. With the
advent of smart devices and apps, there is now a huge database available to understand what type of
movie is liked by what segment of the demography. Machine learning helps to find out the pattern or
the repeated behaviour of the smaller groups/clusters within this database to provide the intelligence
about liking or disliking of certain types of movies by different groups within the demography. So, by
using this intelligence, the smart apps can push only the relevant movie promotions or trailers to the
selected groups, which will significantly increase the chance of targeting the right interested person for
the movie.
Clustering is a broad class of methods used for discovering unknown subgroups in data, which is the
most important concept in unsupervised learning.
clustering is the task of partitioning the dataset into groups, called clusters. The goal is to split up
the data in such a way that points within a single cluster are very similar and points in different
clusters are different. Similarly to classification algorithms, clustering algorithms assign (or predict) a
number to each data point, indicating which cluster a particular point belongs to.
Clustering refers to a broad set of techniques for finding subgroups, or clusters, in a data set on the
basis of the characteristics of the objects within that data set in such a manner that the objects within
the group are similar (or related to each other) but are different from (or unrelated to) the objects
from the other groups. The effectiveness of clustering depends on how similar or related the objects
within a group are or how different or unrelated the objects in different groups are from each other.
Clustering analysis can help in this activity by analysing different ways to group the set of people and
arriving at different types of clusters.
There are many different fields where cluster analysis is used effectively, such as
Text data mining: it includes tasks such as text categorization, text clustering, document
summarization, concept extraction, sentiment analysis, and entity relation modelling
Customer segmentation: creating clusters of customers on the basis of parameters such as
demographics, financial conditions, buying habits, etc., which can be used by retailers and advertisers
to promote their products in the correct segment.
Anomaly checking: checking of anomalous behaviours such as fraudulent bank transaction,
unauthorized computer intrusion, suspicious movements on a radar scanner, etc.
Data mining: simplify the data mining task by grouping a large number of features from an extremely
large data set to make the analysis manageable.
Figure 3-23. Input data and three steps of the k-means algorithm
Cluster centers are shown as triangles, while data points are shown as circles. Colors indicate cluster
membership. We specified that we are looking for three clusters, so the algorithm was initialized by
declaring three data points randomly as cluster centers (see “Initialization”). Then the iterative
algorithm starts. First, each data point is assigned to the cluster center it is closest to (see “Assign
Points (1)”). Next, the cluster centers are updated to be the mean of the assigned points (see
“Recompute Centers (1)”). Then the process is repeated two more times. After the third iteration,
the assignment of points to cluster centers remained unchanged, so the algorithm stops.
Given new data points, k-means will assign each to the closest cluster center. The next example
(Figure 3-24) shows the boundaries of the cluster centers that were learned in Figure 3-23:
Figure 3-24. Cluster centers and cluster boundaries found by the k-means algorithm
mglearn.plots.plot_kmeans_algorithm()
mglearn.plots.plot_kmeans_boundaries()
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
During the algorithm, each training data point in X is assigned a cluster label. You can find these labels
in the kmeans.labels_ attribute:
In[50]:
print("Cluster memberships:\n{}".format(kmeans.labels_))
Out[50]:
Cluster memberships:
[1 2 2 2 0 0 0 2 1 1 2 2 0 1 0 0 0 1 2 2 0 2 0 1 2 0 0 1 1 0 1 1 0 1 2 0 2 2 2 0 0 2 1 2 2 0 1 1 1 1 2 0 0 0 1 0
2 2 1 1 2 0 0 2 2 0 1 0 1 2 2 2 0 1 1 2 0 0 1 2 1 2 2 0 1 1 1 1 2 1 0 1 1 2 2 0 0 1 0 1]
As we asked for three clusters, the clusters are numbered 0 to 2.
You can also assign cluster labels to new points, using the predict method. Each new point is assigned
to the closest cluster center when predicting, but the existing model is not changed. Running predict
on the training set returns the same result as labels_:
In[51]:
print(kmeans.predict(X))
Out[51]:
[1 2 2 2 0 0 0 2 1 1 2 2 0 1 0 0 0 1 2 2 0 2 0 1 2 0 0 1 1 0 1 1 0 1 2 0 2 2 2 0 0 2 1 2 2 0 1 1 1 1 2 0 0 0 1 0
2 2 1 1 2 0 0 2 2 0 1 0 1 2 2 2 0 1 1 2 0 0 1 2 1 2 2 0 1 1 1 1 2 1 0 1 1 2 2 0 0 1 0 1]
Each cluster is defined solely by its center, which means that each cluster is a convex shape.
As a result of this, k-means can only capture relatively simple shapes.
k-means also assumes that all directions are equally important for each cluster. k-means fails to
identify nonspherical clusters. However, these groups are stretched toward the diagonal. As k-means
only considers the distance to the nearest cluster center, it can’t handle this kind of data.
k-means also performs poorly if the clusters have more complex shapes, like the two_moons data.
⮚ k-means is a very popular algorithm for clustering, not only because it is relatively easy to
understand and implement, but also because it runs relatively quickly. Kmeans scales easily to
large datasets.
⮚ One of the drawbacks of k-means is that it relies on a random initialization, which means the
outcome of the algorithm depends on a random seed.
Agglomerative Clustering:
Agglomerative clustering refers to a collection of clustering algorithms that all build upon the same
principles: the algorithm starts by declaring each point its own cluster, and then merges the two most
similar clusters until some stopping criterion is satisfied.
The stopping criterion implemented in scikit-learn is the number of clusters, so similar clusters are
merged until only the specified number of clusters are left.
There are several linkage criteria that specify how exactly the “most similar cluster” is measured.
This measure is always defined between two existing clusters.
The following three choices are implemented in scikit-learn:
ward
The default choice, ward picks the two clusters to merge such that the variance within all clusters
increases the least. This often leads to clusters that are relatively equally sized.
average
average linkage merges the two clusters that have the smallest average distance between all their
points.
complete
complete linkage (also known as maximum linkage) merges the two clusters that have the smallest
maximum distance between their points.
ward works on most datasets, and we will use it in our examples. If the clusters have very dissimilar
numbers of members (if one is much bigger than all the others, for example), average or complete
might work better.
Example for Agglomerative clustering on a two-dimensional dataset, looking for three clusters:
mglearn.plots.plot_agglomerative_algorithm()
Initially, each point is its own cluster. Then, in each step, the two clusters that are closest are merged.
In the first four steps, two single-point clusters are picked and these are joined into two-point
clusters. In step 5, one of the two-point clusters is extended to a third point, and so on. In step 9,
there are only three clusters remaining. As we specified that we are looking for three clusters, the
algorithm then stops.
Figure 3-33. Agglomerative clustering iteratively joins the two closest clusters
Agglomerative clustering cannot make predictions for new data points. Therefore, Agglomerative
Clustering has no predict method. To build the model and get the cluster memberships on the training
set, use the fit_predict method instead.
Fig3-35: Hierarchical cluster assignment (shown as lines) generated with agglomerative clustering, with
numbered data points
⮚ Hierarchical clustering, it relies on the two-dimensional nature of the data and therefore
cannot be used on datasets that have more than two features. There is, however, another tool
to visualize hierarchical clustering, called a dendrogram, that can handle multidimensional
datasets.
⮚ SciPy provides a function that takes a data array X and computes a linkage array, which
encodes hierarchical cluster similarities. We can then feed this linkage array into the scipy
dendrogram function to plot the dendrogram. (Fig3-36)
# Import the dendrogram function and the ward clustering function from SciPy
from scipy.cluster.hierarchy import dendrogram, ward
X, y = make_blobs(random_state=0, n_samples=12)
# Apply the ward clustering to the data array X
# The SciPy ward function returns an array that specifies the distances
# bridged when performing agglomerative clustering
linkage_array = ward(X)
# Now we plot the dendrogram for the linkage_array containing the distances
# between clusters
dendrogram(linkage_array)
# Mark the cuts in the tree that signify two or three clusters
ax = plt.gca()
bounds = ax.get_xbound()
ax.plot(bounds, [7.25, 7.25], '--', c='k')
ax.plot(bounds, [4, 4], '--', c='k')
ax.text(bounds[1], 7.25, ' two clusters', va='center', fontdict={'size': 15})
ax.text(bounds[1], 4, ' three clusters', va='center', fontdict={'size': 15})
plt.xlabel("Sample index")
plt.ylabel("Cluster distance")
Fig:3-36:Dendrogram of the clustering shown in Figure 3-35 with lines indicating splits into two and three clusters
The dendrogram shows data points as points on the bottom (numbered from 0 to 11). Then, a tree is
plotted with these points (representing single-point clusters) as the leaves, and a new node parent is
added for each two clusters that are joined.
Reading from bottom to top, the data points 1 and 4 are joined first (as you could see in Figure 3-33).
Next, points 6 and 9 are joined into a cluster, and so on. At the top level, there are two branches, one
consisting of points 11, 0, 5, 10, 7, 6, and 9, and the other consisting of points 1, 4, 3, 2, and 8. These
correspond to the two largest clusters in the left-hand side of the plot.
The longest branches in this dendrogram are the three lines that are marked by the dashed line
labeled “three clusters.” That these are the longest branches indicates that going from three to two
clusters. At the top of the chart, where merging the two remaining clusters into a single cluster again
bridges a relatively large distance.
Limitation of agglomerative clustering methods
One of the challenges is very hard to assess how well an algorithm worked, and to compare outcomes
between different algorithms.
Comparing random assignment, k-means, agglomerative clustering, and DBSCAN on the two_moons dataset using
the supervised ARI score
A common mistake when evaluating clustering in this way is to use accuracy_score instead of
adjusted_rand_score, normalized_mutual_info_score, or some other clustering metric. The problem
in using accuracy is that it requires the assigned cluster labels to exactly match the ground truth.
However, the cluster labels themselves are meaningless—the only thing that matters is which points
are in the same cluster:
In[69]:
from sklearn.metrics import accuracy_score
# these two labelings of points correspond to the same clustering
clusters1 = [0, 0, 1, 1, 0]
clusters2 = [1, 1, 0, 0, 1]
# accuracy is zero, as none of the labels are the same
print("Accuracy: {:.2f}".format(accuracy_score(clusters1, clusters2)))
# adjusted rand score is 1, as the clustering is exactly the same
print("ARI: {:.2f}".format(adjusted_rand_score(clusters1, clusters2)))
Out[69]:
Accuracy: 0.00
ARI: 1.00
– Specially according to these metrics and scores, we still don’t know if there is any semantic meaning
in the clustering
– The only way to know whether the clustering corresponds to anything we are interested in is to
analyze the clusters manually
There are scoring metrics for clustering that don’t require ground truth, like the silhouette
coefficient.
silhouette coefficient is a measure how well samples are clustered with samples that are similar to
themselves.
silhouette score is a metric used to calculate the goodness of a clustering technique.
Its value ranges from -1 to 1.
1 means clusters are well apart from each other and clearly distinguished.
However, these often don’t work well in practice. The silhouette score computes the compactness of
a cluster, where higher is better, with a perfect score of 1. While compact clusters are good,
compactness doesn’t allow for complex shapes.
An example comparing the outcome of k-means, agglomerative clustering, and DBSCAN on the two-
moons dataset using the silhouette score.
from sklearn.metrics.cluster import silhouette_score
X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
# rescale the data to zero mean and unit variance
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
fig, axes = plt.subplots(1, 4, figsize=(15, 3),
subplot_kw={'xticks': (), 'yticks': ()})
# create a random cluster assignment for reference
random_state = np.random.RandomState(seed=0)
random_clusters = random_state.randint(low=0, high=2, size=len(X))
# plot random assignment
axes[0].scatter(X_scaled[:, 0], X_scaled[:, 1], c=random_clusters, cmap=mglearn.cm3, s=60)
axes[0].set_title("Random assignment: {:.2f}".format(silhouette_score(X_scaled, random_clusters)))
algorithms = [KMeans(n_clusters=2), AgglomerativeClustering(n_clusters=2),DBSCAN()]
for ax, algorithm in zip(axes[1:], algorithms):
clusters = algorithm.fit_predict(X_scaled)
# plot the cluster assignments and cluster centers
ax.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap=mglearn.cm3,s=60)
ax.set_title("{} : {:.2f}".format(algorithm.__class__.__name__,silhouette_score(X_scaled,
clusters)))
Comparing random assignment, k-means, agglomerative clustering, and DBSCAN on the two_moons dataset
using the unsupervised silhouette score—the more intuitive result of DBSCAN has a lower silhouette score than
the assignments found by k-means
As you can see, k-means gets the highest silhouette score, even though we might prefer the result
produced by DBSCAN. A slightly better strategy for evaluating clusters is using robustness-based
clustering metrics. These run an algorithm after adding some noise to the data, or using different
parameter settings, and compare the outcomes.
The idea is that if many algorithm parameters and many perturbations of the data return the same
result, it is likely to be trustworthy. Unfortunately, this strategy is not implemented in scikit-learn at
the time of writing.
Even if we get a very robust clustering, or a very high silhouette score, we still don’t know if there is
any semantic meaning in the clustering, or whether the clustering reflects an aspect of the data that
we are interested in.
Scaling of Features is an essential step in modeling the algorithms with the datasets. The data that is
usually used for the purpose of modeling is derived through various means such as:
● Questionnaire
● Surveys
● Research
● Scraping, etc.
So, the data obtained contains features of various dimensions and scales altogether. Different scales
of the data features affect the modeling of a dataset adversely.
It leads to a biased outcome of predictions in terms of misclassification error and accuracy rates. Thus,
it is necessary to Scale the data prior to modeling.
Standardization is a scaling technique wherein it makes the data scale-free by converting the
statistical distribution of the data into the below format:
● mean - 0 (zero)
● standard deviation - 1
By this, the entire data set scales with a zero mean and unit variance, altogether.
dataset = load_iris()
object= StandardScaler()
# standardization
scale = object.fit_transform(i_data)
print(scale)
Using StandardScaler() Function to Standardize Python Data | DigitalOcean
– k-means allows for a characterization of clusters using the cluster means; it can be considered a decomposition method
– DBSCAN allows for the detection of “noise points” and allows for complex cluster shapes.