ARM and Clustering
ARM and Clustering
• Support
• It is one of the measure of interestingness. This tells about usefulness and certainty of rules.
• can be calculated by finding number of transactions containing a particular item divided by total number of transactions. Suppose we
want to find support for item B. This can be calculated as:
• For instance if out of 1000 transactions, 100 transactions contain Ketchup then the support for item Ketchup can be calculated as:
• Confidence
• Confidence refers to the likelihood that an item B is also bought if item A is bought. It can be calculated by finding the number of
transactions where A and B are bought together, divided by total number of transactions where A is bought. Mathematically, it can be
represented as:
• Coming back to our problem, we had 50 transactions where Burger and Ketchup were bought together. While in 150 transactions,
burgers are bought. Then we can find likelihood of buying ketchup when a burger is bought can be represented as confidence of Burger -
> Ketchup and can be mathematically written as:
• Step 1: Create a frequency table of all the items that occur in all the transactions.
• For our case:
Cont.
• Step 2: We know that only those elements are significant for which the support is greater than or
equal to the threshold support. Here, support threshold is 50%, hence only those items are significant
which occur in more than three transactions and such items are Onion(O), Potato(P), Burger(B), and
Milk(M). Therefore, we are left with:
Cont.
• Step 3: The next step is to make all the possible pairs of the significant items keeping in mind that the
order doesn’t matter, i.e., AB is same as BA. To do this, take the first item and pair it with all the others
such as OP, OB, OM. Similarly, consider the second item and pair it with preceding items, i.e., PB, PM.
We are only considering the preceding items because PO (same as OP) already exists. So, all the pairs
in our example are OP, OB, OM, PB, PM, BM.
Cont.
• Step 4: We will now count the occurrences of each pair in all the transactions.
Cont.
• Step 5: Again only those itemsets are significant which cross the support threshold, and those are OP,
OB, PB, and PM.
Cont.
• Step 6: Now let’s say we would like to look for a set of three items that are purchased together. We will
use the itemsets found in step 5 and create a set of 3 items.
• To create a set of 3 items another rule, called self-join is required. It says that from the item pairs OP, OB,
PB and PM we look for two pairs with the identical first letter and so we get
• OP and OB, this gives OPB
• PB and PM, this gives PBM
Cont.
• Next, we find the frequency for these two itemsets.
• Applying the threshold rule again, we find that OPB is the only significant itemset.
• Therefore, the set of 3 items that was purchased most frequently is OPB.
General Process of the Apriori
algorithm
Pros and Cons of the Apriori
algorithm
• Pros:
• It is an easy-to-implement and easy-to-understand algorithm.
• It can be used on large itemsets.
• Cons:
• Sometimes, it may need to find a large number of candidate
rules which can be computationally expensive.
• Calculating support is also expensive because it has to go
through the entire database.
Implementation in Python
Implementation in Python
Assignment
When dealing with real-world problems, most of the time, data will
not come with predefined labels.
So, we want to group the data into that are similar in nature (feature
values)
Introduction to Clustering
In supervised
learning, we have
labelled set of
examples.
Introduction to Unsupervised Learning
In unsupervised
learning, we have
unlabeled set of
examples.
Introduction to Unsupervised Learning
In unsupervised
learning, we form
clusters in the
dataset
Clustering
Clustering
Applications of Unsupervised Learning
K-Means Algorithm
In K-means, assume
we have dataset as
shown in Figure and
we need find 2
clusters
K-Means Algorithm
First step: to
randomly initialize
two points called
cluster centroids. 2
because we need 2
clusters
K-Means Algorithm
K-means is an iterative
algorithm and perform
two steps.
First, clustering
assignment step
Second, move centroid
step
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
A.
B.
C.
D.
K-Means Algorithm
A.
B.
C.
D.
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm Optimization Objective (Cost
Function)
K-Means Algorithm Optimization Objective (Cost
Function)
K-Means Algorithm Optimization Objective (Cost
Function)
K-Means Algorithm Optimization Objective (Cost
Function)
K-Means Algorithm Optimization Objective (Cost
Function)
K-Means Algorithm Optimization Objective (Cost
Function)
K-Means Random Initialization
K-Means Random Initialization
K-Means Random Initialization
K-Means Random Initialization
K-Means Random Initialization
K-Means Random Initialization
K-Means Random Initialization
Conclusion:
1. To conclude, for the most part, the number of clusters K is still chosen by hand by
human input or human insight.
2. One way to try to do so is to use the Elbow Method, but we shouldn’t always
expect that it will always work well
3. Thus, the better way to think about how to choose the number of clusters is to
ask, for what purpose are you running K-means?
Implementation of K - Means Clustering
Algorithm in Python (Live Demo)
K-Means Implementation
K-Means Implementation
K-Means Implementation
# Importing the libraries # Using the elbow method to find the optimal number of
1. import numpy as np clusters
2. import matplotlib.pyplot as plt from sklearn.cluster import KMeans
3. import pandas as pd
wcss =[]
for i in range (1,11):
#Importing the mall dataset with pandas kmeans = KMeans(n_clusters = i, init =
4. dataset = pd.read_csv('Mall_Customers.csv') 'k-means++', max_iter =300, n_init = 10,
5. X = dataset.iloc[:,[3,4]].values
random_state = 0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
# Plot the graph to visualize the Elbow Method # Applying KMeans to the dataset with the optimal number
to find the optimal number of cluster of cluster
1. plt.plot(range(1,11),wcss)
2. plt.title('The Elbow Method') 1. kmeans=KMeans(n_clusters= 5, init = 'k-means++',
3. plt.xlabel('Number of clusters') max_iter = 300, n_init = 10, random_state = 0)
4. plt.ylabel('WCSS') 2. y_kmeans = kmeans.fit_predict(X)
5. plt.show()
# Visualising the clusters
1. plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0,1],s = 100, c='red', label = 'Cluster 1')
2. plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1,1],s = 100, c='blue', label = 'Cluster 2')
3. plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2,1],s = 100, c='green', label = 'Cluster 3')
4. plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3,1],s = 100, c='cyan', label = 'Cluster 4')
5. plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4,1],s = 100, c='magenta', label = 'Cluster 5')
6. plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], s = 300, c = 'yellow', label =
'Centroids')
7. plt.title('Clusters of clients')
8. plt.xlabel('Annual Income (k$)')
9. plt.ylabel('Spending score (1-100)')
10. plt.legend()
11. plt.show()
K – Means (Advantages)
1. Choosing manually.
2. Being dependent on initial values.
3. Clustering data of varying sizes and density.
4. Clustering outliers.
5. Scaling with number of dimensions.