0% found this document useful (0 votes)
12 views22 pages

Unit III 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views22 pages

Unit III 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Unit-3 Unsupervised Learning

Introduction:
In unsupervised learning, the learning algorithm is just shown the input data and asked to extract
knowledge from this data.
When we do not have any prior knowledge of the data set we are working with, but we still want to
discover interesting relationships among the attributes of the data or group the data in logical
segments for easy analysis. The task of the machine is then to identify this knowledge without any
prior training and that is the space of unsupervised learning.

Unsupervised learning is a machine learning concept where the unlabelled and


unclassified information is analysed to discover hidden knowledge. The algorithms work on the
data without any prior training, but they are constructed in such a way that they can identify patterns,
groupings, sorting order, and numerous other interesting knowledge from the set of data.

Clustering algorithms help in grouping data sets into logical segments and the association
analysis which enables to identify a pattern or relationship of attributes within the data set. An
interesting application of the association analysis is the Market Basket Analysis, which is used
widely by retailers and advertisers across the globe.

APPLICATIONs OF UNSUPERVISED LEARNING


● Segmentation of target consumer populations by an advertisement consulting agency on the
basis of few dimensions such as demography, financial data, purchasing habits, etc. so that the
advertisers can reach their target consumers efficiently.
● Anomaly or fraud detection in the banking sector by identifying the pattern of loan defaulters.
● Image processing and image segmentation such as face recognition, expression identification,
etc.
● Grouping of important characteristics in genes to identify important influencers in new areas of
genetics.
● Utilization by data scientists to reduce the dimensionalities in sample data to simplify modelling
● Document clustering and identifying potential labelling options
Today, unsupervised learning is used in many areas involving Artificial Intelligence (AI) and
Machine Learning (ML). Chat bots, self-driven cars, and many more recent innovations are results
of the combination of unsupervised and supervised learning.

Types of Unsupervised Learning


● Unsupervised transformations of a dataset are algorithms that create a new representation of
the data which might be easier to understand compared to the original representation of the
data.
● A common application of unsupervised transformations is dimensionality reduction, which
takes a high-dimensional representation of the data, is reduction totwo dimensions for
visualization purposes.
● Another application for unsupervised transformations is finding the parts or components
that “make up” the data. This can be useful for tracking the discussion of themes like elections, gun
control, or pop stars on social media.

Two major aspects of unsupervised learning, namely --


Clustering🡪 which helps in segmentation of the set of objects into groups of similar objects i.e.
partition data into distinct groups of similar items.
Example of uploading photos to a social media site. To allow you to organize your pictures, the site
might want to group together pictures that show the same person.
Association Analysis 🡪 which is related to the identification of relationships among objects in a data
set.
Challenges in Unsupervised Learning:
Unsupervised learning algorithms are usually applied to data that does not contain any label
information so we don’t know what the right output should be. Therefore, it is very hard to say
whether a model “did well.”
Another common application for unsupervised algorithms is as a preprocessing step for supervised
algorithms.
Learning a new representation of the data can sometimes improve the accuracy of supervised
algorithms, or can lead to reduced memory and time consumption.

UNSUPERVISED VS SUPERVISED LEARNING:


Supervised learning aim was to predict the outcome variable Y on the basis of the feature set X1 : X2
:… Xn , by using the methods such as regression and classification for the same.
Unsupervised learning objective is to observe only the features X1 : X2 :… Xn; we are not going to
predict any outcome variable, but rather our intention is to find out the association between the
features or their grouping to understand the nature of the data. This analysis may reveal an
interesting correlation between the features or a common behaviour within the subgroup of the data,
which provides better understanding of the data.
The main difference between supervised vs unsupervised learning is the need for
labelled training data. Supervised machine learning relies on labelled input and
output training data, whereas unsupervised learning processes unlabelled or raw
data.
Examples of supervised machine learning include:

● Classification, identifying input data as part of a learned group.


● Regression, predicting outcomes from continuously changing data.

Examples of unsupervised machine learning include:

● Clustering, grouping together data points with similar data.


● Association, understanding how certain data features connect with other
features.
In terms of statistics,
🡪Supervised learning algorithm will try to learn the probability of outcome Y for a particular input X,
which is called the posterior probability.
🡪Unsupervised learning is closely related to density estimation in statistics. Here, every input and the
corresponding targets are concatenated to create a new set of input such as {( X1 , Y1 ), (X2 , Y2 ),…,
(Xn, Yn )}, which leads to a better understanding of the correlation of X and Y; this probability notation
is called the joint probability.

Unsupervised learning helps in pushing movie promotions to the correct group of people. With the
advent of smart devices and apps, there is now a huge database available to understand what type of
movie is liked by what segment of the demography. Machine learning helps to find out the pattern or
the repeated behaviour of the smaller groups/clusters within this database to provide the intelligence
about liking or disliking of certain types of movies by different groups within the demography. So, by
using this intelligence, the smart apps can push only the relevant movie promotions or trailers to the
selected groups, which will significantly increase the chance of targeting the right interested person for
the movie.

The principle underlying unsupervised learning – Clustering and Association Analysis.

Clustering is a broad class of methods used for discovering unknown subgroups in data, which is the
most important concept in unsupervised learning.

clustering is the task of partitioning the dataset into groups, called clusters. The goal is to split up
the data in such a way that points within a single cluster are very similar and points in different
clusters are different. Similarly to classification algorithms, clustering algorithms assign (or predict) a
number to each data point, indicating which cluster a particular point belongs to.

Another technique is Association Analysis which identifies a low-dimensional representation of the


observations that can explain the variance and identify the association rule for the explanation.

Clustering refers to a broad set of techniques for finding subgroups, or clusters, in a data set on the
basis of the characteristics of the objects within that data set in such a manner that the objects within
the group are similar (or related to each other) but are different from (or unrelated to) the objects
from the other groups. The effectiveness of clustering depends on how similar or related the objects
within a group are or how different or unrelated the objects in different groups are from each other.
Clustering analysis can help in this activity by analysing different ways to group the set of people and
arriving at different types of clusters.

There are many different fields where cluster analysis is used effectively, such as
Text data mining: it includes tasks such as text categorization, text clustering, document
summarization, concept extraction, sentiment analysis, and entity relation modelling
Customer segmentation: creating clusters of customers on the basis of parameters such as
demographics, financial conditions, buying habits, etc., which can be used by retailers and advertisers
to promote their products in the correct segment.
Anomaly checking: checking of anomalous behaviours such as fraudulent bank transaction,
unauthorized computer intrusion, suspicious movements on a radar scanner, etc.
Data mining: simplify the data mining task by grouping a large number of features from an extremely
large data set to make the analysis manageable.

The focus will be on


⮚ how clustering tasks differ from classification tasks and how clustering defines groups
⮚ a classic and easy-to-understand clustering algorithm, namely k-means, which is used for
clustering along with the k-medoids algorithm
⮚ application of clustering in real-life scenarios
k-Means Clustering:
k-means clustering is one of the simplest and most commonly used clustering algorithms.
It tries to find cluster centers that are representative of certain regions of the data. The algorithm
alternates between two steps: assigning each data point to the closest cluster center, and then setting
each cluster center as the mean of the data points that are assigned to it. The algorithm is finished
when the assignment of instances to clusters no longer changes. The following example (Figure 3-23)
illustrates the algorithm on a synthetic dataset:

Figure 3-23. Input data and three steps of the k-means algorithm
Cluster centers are shown as triangles, while data points are shown as circles. Colors indicate cluster
membership. We specified that we are looking for three clusters, so the algorithm was initialized by
declaring three data points randomly as cluster centers (see “Initialization”). Then the iterative
algorithm starts. First, each data point is assigned to the cluster center it is closest to (see “Assign
Points (1)”). Next, the cluster centers are updated to be the mean of the assigned points (see
“Recompute Centers (1)”). Then the process is repeated two more times. After the third iteration,
the assignment of points to cluster centers remained unchanged, so the algorithm stops.
Given new data points, k-means will assign each to the closest cluster center. The next example
(Figure 3-24) shows the boundaries of the cluster centers that were learned in Figure 3-23:

Figure 3-24. Cluster centers and cluster boundaries found by the k-means algorithm
mglearn.plots.plot_kmeans_algorithm()
mglearn.plots.plot_kmeans_boundaries()
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# generate synthetic two-dimensional data


X, y = make_blobs(random_state=1)

# build the clustering model


kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

During the algorithm, each training data point in X is assigned a cluster label. You can find these labels
in the kmeans.labels_ attribute:

In[50]:
print("Cluster memberships:\n{}".format(kmeans.labels_))
Out[50]:
Cluster memberships:
[1 2 2 2 0 0 0 2 1 1 2 2 0 1 0 0 0 1 2 2 0 2 0 1 2 0 0 1 1 0 1 1 0 1 2 0 2 2 2 0 0 2 1 2 2 0 1 1 1 1 2 0 0 0 1 0
2 2 1 1 2 0 0 2 2 0 1 0 1 2 2 2 0 1 1 2 0 0 1 2 1 2 2 0 1 1 1 1 2 1 0 1 1 2 2 0 0 1 0 1]
As we asked for three clusters, the clusters are numbered 0 to 2.

You can also assign cluster labels to new points, using the predict method. Each new point is assigned
to the closest cluster center when predicting, but the existing model is not changed. Running predict
on the training set returns the same result as labels_:
In[51]:
print(kmeans.predict(X))
Out[51]:
[1 2 2 2 0 0 0 2 1 1 2 2 0 1 0 0 0 1 2 2 0 2 0 1 2 0 0 1 1 0 1 1 0 1 2 0 2 2 2 0 0 2 1 2 2 0 1 1 1 1 2 0 0 0 1 0
2 2 1 1 2 0 0 2 2 0 1 0 1 2 2 2 0 1 1 2 0 0 1 2 1 2 2 0 1 1 1 1 2 1 0 1 1 2 2 0 0 1 0 1]

Failure cases of k-means:

Each cluster is defined solely by its center, which means that each cluster is a convex shape.
As a result of this, k-means can only capture relatively simple shapes.
k-means also assumes that all directions are equally important for each cluster. k-means fails to
identify nonspherical clusters. However, these groups are stretched toward the diagonal. As k-means
only considers the distance to the nearest cluster center, it can’t handle this kind of data.
k-means also performs poorly if the clusters have more complex shapes, like the two_moons data.

Fig 3-29. k-means fails to identify clusters with complex shapes

⮚ k-means is a very popular algorithm for clustering, not only because it is relatively easy to
understand and implement, but also because it runs relatively quickly. Kmeans scales easily to
large datasets.
⮚ One of the drawbacks of k-means is that it relies on a random initialization, which means the
outcome of the algorithm depends on a random seed.
Agglomerative Clustering:
Agglomerative clustering refers to a collection of clustering algorithms that all build upon the same
principles: the algorithm starts by declaring each point its own cluster, and then merges the two most
similar clusters until some stopping criterion is satisfied.
The stopping criterion implemented in scikit-learn is the number of clusters, so similar clusters are
merged until only the specified number of clusters are left.
There are several linkage criteria that specify how exactly the “most similar cluster” is measured.
This measure is always defined between two existing clusters.
The following three choices are implemented in scikit-learn:
ward
The default choice, ward picks the two clusters to merge such that the variance within all clusters
increases the least. This often leads to clusters that are relatively equally sized.
average
average linkage merges the two clusters that have the smallest average distance between all their
points.
complete
complete linkage (also known as maximum linkage) merges the two clusters that have the smallest
maximum distance between their points.

ward works on most datasets, and we will use it in our examples. If the clusters have very dissimilar
numbers of members (if one is much bigger than all the others, for example), average or complete
might work better.

Example for Agglomerative clustering on a two-dimensional dataset, looking for three clusters:

mglearn.plots.plot_agglomerative_algorithm()

Initially, each point is its own cluster. Then, in each step, the two clusters that are closest are merged.
In the first four steps, two single-point clusters are picked and these are joined into two-point
clusters. In step 5, one of the two-point clusters is extended to a third point, and so on. In step 9,
there are only three clusters remaining. As we specified that we are looking for three clusters, the
algorithm then stops.
Figure 3-33. Agglomerative clustering iteratively joins the two closest clusters

Agglomerative clustering cannot make predictions for new data points. Therefore, Agglomerative
Clustering has no predict method. To build the model and get the cluster memberships on the training
set, use the fit_predict method instead.

from sklearn.cluster import AgglomerativeClustering


X, y = make_blobs(random_state=1)
agg = AgglomerativeClustering(n_clusters=3)
assignment = agg.fit_predict(X)
mglearn.discrete_scatter(X[:, 0], X[:, 1], assignment)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")

Cluster assignment using agglomerative clustering with three clusters


Hierarchical clustering and dendrograms:
Agglomerative clustering produces what is known as a hierarchical clustering. The clustering proceeds
iteratively, and every point makes a journey from being a single point cluster to belonging to some
final cluster. Each intermediate step provides a clustering of the data (with a different number of
clusters). It is sometimes helpful to look at all possible clusterings jointly.

Fig3-35: Hierarchical cluster assignment (shown as lines) generated with agglomerative clustering, with
numbered data points
⮚ Hierarchical clustering, it relies on the two-dimensional nature of the data and therefore
cannot be used on datasets that have more than two features. There is, however, another tool
to visualize hierarchical clustering, called a dendrogram, that can handle multidimensional
datasets.
⮚ SciPy provides a function that takes a data array X and computes a linkage array, which
encodes hierarchical cluster similarities. We can then feed this linkage array into the scipy
dendrogram function to plot the dendrogram. (Fig3-36)

# Import the dendrogram function and the ward clustering function from SciPy
from scipy.cluster.hierarchy import dendrogram, ward
X, y = make_blobs(random_state=0, n_samples=12)
# Apply the ward clustering to the data array X
# The SciPy ward function returns an array that specifies the distances
# bridged when performing agglomerative clustering
linkage_array = ward(X)
# Now we plot the dendrogram for the linkage_array containing the distances
# between clusters
dendrogram(linkage_array)
# Mark the cuts in the tree that signify two or three clusters
ax = plt.gca()
bounds = ax.get_xbound()
ax.plot(bounds, [7.25, 7.25], '--', c='k')
ax.plot(bounds, [4, 4], '--', c='k')
ax.text(bounds[1], 7.25, ' two clusters', va='center', fontdict={'size': 15})
ax.text(bounds[1], 4, ' three clusters', va='center', fontdict={'size': 15})
plt.xlabel("Sample index")
plt.ylabel("Cluster distance")

Fig:3-36:Dendrogram of the clustering shown in Figure 3-35 with lines indicating splits into two and three clusters

The dendrogram shows data points as points on the bottom (numbered from 0 to 11). Then, a tree is
plotted with these points (representing single-point clusters) as the leaves, and a new node parent is
added for each two clusters that are joined.
Reading from bottom to top, the data points 1 and 4 are joined first (as you could see in Figure 3-33).
Next, points 6 and 9 are joined into a cluster, and so on. At the top level, there are two branches, one
consisting of points 11, 0, 5, 10, 7, 6, and 9, and the other consisting of points 1, 4, 3, 2, and 8. These
correspond to the two largest clusters in the left-hand side of the plot.

The longest branches in this dendrogram are the three lines that are marked by the dashed line
labeled “three clusters.” That these are the longest branches indicates that going from three to two
clusters. At the top of the chart, where merging the two remaining clusters into a single cluster again
bridges a relatively large distance.
Limitation of agglomerative clustering methods

– Still fails at separating complex shapes like the two_moons dataset


Comparing and Evaluating Clustering Algorithms:

One of the challenges is very hard to assess how well an algorithm worked, and to compare outcomes
between different algorithms.

Evaluating clustering with ground truth:


There are metrics that can be used to assess the outcome of a clustering algorithm relative to a
ground truth clustering:
The adjusted rand index (ARI) and normalized mutual information (NMI), which both provide a
quantitative measure between 0 and 1. (These methods applied to the measurement of partitions’
diversity and quality.)
The Adjusted Rand Index (ARI) is defined by Here, N is the number of data points in a given data set
and is the number of data points of the class label assigned to cluster in partition. is the number of
data points in cluster of partition , and is the number of data points in class .
NMI tells us how much the uncertainty about class labels when we know the cluster labels. It is
similar to the information gain in decision trees.
In order to normalize this measure, total number of nodes which are correctly classified are divided by
the total number of nodes in the network. The Rand Index (RI) is the proportion of node pairs for which
both communities agree.
Here, we compare the k-means, agglomerative clustering, and DBSCAN algorithms using ARI.
First randomly assign points to two clusters for comparison.

from sklearn.metrics.cluster import adjusted_rand_score


X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
# rescale the data to zero mean and unit variance
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
fig, axes = plt.subplots(1, 4, figsize=(15, 3),
subplot_kw={'xticks': (), 'yticks': ()})
# make a list of algorithms to use
algorithms = [KMeans(n_clusters=2), AgglomerativeClustering(n_clusters=2), DBSCAN()]
# create a random cluster assignment for reference
random_state = np.random.RandomState(seed=0)
random_clusters = random_state.randint(low=0, high=2, size=len(X))
# plot random assignment
axes[0].scatter(X_scaled[:, 0], X_scaled[:, 1], c=random_clusters, cmap=mglearn.cm3, s=60)
axes[0].set_title("Random assignment - ARI: {:.2f}".format( adjusted_rand_score(y, random_clusters)))
for ax, algorithm in zip(axes[1:], algorithms):
# plot the cluster assignments and cluster centers
clusters = algorithm.fit_predict(X_scaled)
ax.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap=mglearn.cm3, s=60)
ax.set_title("{} - ARI: {:.2f}".format(algorithm.__class__.__name__,adjusted_rand_score(y,
clusters)))

Comparing random assignment, k-means, agglomerative clustering, and DBSCAN on the two_moons dataset using
the supervised ARI score

A common mistake when evaluating clustering in this way is to use accuracy_score instead of
adjusted_rand_score, normalized_mutual_info_score, or some other clustering metric. The problem
in using accuracy is that it requires the assigned cluster labels to exactly match the ground truth.
However, the cluster labels themselves are meaningless—the only thing that matters is which points
are in the same cluster:
In[69]:
from sklearn.metrics import accuracy_score
# these two labelings of points correspond to the same clustering
clusters1 = [0, 0, 1, 1, 0]
clusters2 = [1, 1, 0, 0, 1]
# accuracy is zero, as none of the labels are the same
print("Accuracy: {:.2f}".format(accuracy_score(clusters1, clusters2)))
# adjusted rand score is 1, as the clustering is exactly the same
print("ARI: {:.2f}".format(adjusted_rand_score(clusters1, clusters2)))
Out[69]:
Accuracy: 0.00
ARI: 1.00

Evaluating clustering without ground truth:


Although we have just shown one way to evaluate clustering algorithms, in practice, there is a big
problem with using measures like ARI. When applying clustering algorithms, there is usually no ground
truth to which to compare the results. If we knew the right clustering of the data, we could use this
information to build a supervised model like a classifier. Therefore, using metrics like ARI and NMI
usually only helps in developing algorithms, not in assessing success in an application.
Using metrics like ARI and NMI usually only helps in developing algorithms

– Not in assessing success in an application

– Specially according to these metrics and scores, we still don’t know if there is any semantic meaning
in the clustering

– The only way to know whether the clustering corresponds to anything we are interested in is to
analyze the clusters manually

There are scoring metrics for clustering that don’t require ground truth, like the silhouette
coefficient.
silhouette coefficient is a measure how well samples are clustered with samples that are similar to
themselves.
silhouette score is a metric used to calculate the goodness of a clustering technique.
Its value ranges from -1 to 1.
1 means clusters are well apart from each other and clearly distinguished.
However, these often don’t work well in practice. The silhouette score computes the compactness of
a cluster, where higher is better, with a perfect score of 1. While compact clusters are good,
compactness doesn’t allow for complex shapes.

An example comparing the outcome of k-means, agglomerative clustering, and DBSCAN on the two-
moons dataset using the silhouette score.
from sklearn.metrics.cluster import silhouette_score
X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
# rescale the data to zero mean and unit variance
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
fig, axes = plt.subplots(1, 4, figsize=(15, 3),
subplot_kw={'xticks': (), 'yticks': ()})
# create a random cluster assignment for reference
random_state = np.random.RandomState(seed=0)
random_clusters = random_state.randint(low=0, high=2, size=len(X))
# plot random assignment
axes[0].scatter(X_scaled[:, 0], X_scaled[:, 1], c=random_clusters, cmap=mglearn.cm3, s=60)
axes[0].set_title("Random assignment: {:.2f}".format(silhouette_score(X_scaled, random_clusters)))
algorithms = [KMeans(n_clusters=2), AgglomerativeClustering(n_clusters=2),DBSCAN()]
for ax, algorithm in zip(axes[1:], algorithms):
clusters = algorithm.fit_predict(X_scaled)
# plot the cluster assignments and cluster centers
ax.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap=mglearn.cm3,s=60)
ax.set_title("{} : {:.2f}".format(algorithm.__class__.__name__,silhouette_score(X_scaled,
clusters)))

Comparing random assignment, k-means, agglomerative clustering, and DBSCAN on the two_moons dataset
using the unsupervised silhouette score—the more intuitive result of DBSCAN has a lower silhouette score than
the assignments found by k-means

As you can see, k-means gets the highest silhouette score, even though we might prefer the result
produced by DBSCAN. A slightly better strategy for evaluating clusters is using robustness-based
clustering metrics. These run an algorithm after adding some noise to the data, or using different
parameter settings, and compare the outcomes.
The idea is that if many algorithm parameters and many perturbations of the data return the same
result, it is likely to be trustworthy. Unfortunately, this strategy is not implemented in scikit-learn at
the time of writing.
Even if we get a very robust clustering, or a very high silhouette score, we still don’t know if there is
any semantic meaning in the clustering, or whether the clustering reflects an aspect of the data that
we are interested in.

NOTE: Extra Content

Need for Standardization


Before getting into Standardization, let us first understand the concept of Scaling.

Scaling of Features is an essential step in modeling the algorithms with the datasets. The data that is
usually used for the purpose of modeling is derived through various means such as:

● Questionnaire
● Surveys
● Research
● Scraping, etc.

So, the data obtained contains features of various dimensions and scales altogether. Different scales
of the data features affect the modeling of a dataset adversely.

It leads to a biased outcome of predictions in terms of misclassification error and accuracy rates. Thus,
it is necessary to Scale the data prior to modeling.

This is when standardization comes into picture.

Standardization is a scaling technique wherein it makes the data scale-free by converting the
statistical distribution of the data into the below format:

● mean - 0 (zero)
● standard deviation - 1

By this, the entire data set scales with a zero mean and unit variance, altogether.

Standardizing data with StandardScaler() function


from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

dataset = load_iris()
object= StandardScaler()

# Splitting the independent and dependent variables


i_data = dataset.data
response = dataset.target

# standardization
scale = object.fit_transform(i_data)
print(scale)
Using StandardScaler() Function to Standardize Python Data | DigitalOcean

Each algorithm has somewhat different strengths

– k-means allows for a characterization of clusters using the cluster means; it can be considered a decomposition method

– DBSCAN allows for the detection of “noise points” and allows for complex cluster shapes.

– Agglomerative clustering can provide a whole hierarchy of possible partitions.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy