0% found this document useful (0 votes)
3 views79 pages

ARM and Clustering

Association Rule Mining is a machine learning technique used to discover interesting relationships between variables in large datasets, focusing on frequent patterns and associations. The Apriori algorithm is a classical method for mining frequent itemsets and association rules, utilizing prior knowledge of itemset properties. Clustering, particularly through the K-Means algorithm, is an unsupervised learning method aimed at grouping similar data points, with the choice of the number of clusters often determined by methods like the Elbow Method.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views79 pages

ARM and Clustering

Association Rule Mining is a machine learning technique used to discover interesting relationships between variables in large datasets, focusing on frequent patterns and associations. The Apriori algorithm is a classical method for mining frequent itemsets and association rules, utilizing prior knowledge of itemset properties. Clustering, particularly through the K-Means algorithm, is an unsupervised learning method aimed at grouping similar data points, with the choice of the number of clusters often determined by methods like the Elbow Method.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 79

Association Rule Mining

• Association Rule Mining is one of the ways to


find patterns in data
• It is a rule-based machine learning method for
discovering interesting relations between
variables in large databases
• It aims to extract interesting correlations,
frequent patterns, associations or casual
structures among sets of items in the transaction
databases or other data repositories.
• In short, Frequent Mining shows which
items appear together in a transaction or
relation.
Important Definitions

• Support
• It is one of the measure of interestingness. This tells about usefulness and certainty of rules.
• can be calculated by finding number of transactions containing a particular item divided by total number of transactions. Suppose we
want to find support for item B. This can be calculated as:

• For instance if out of 1000 transactions, 100 transactions contain Ketchup then the support for item Ketchup can be calculated as:

Support(Ketchup) = (Transactions containingKetchup)/(Total Transactions)


Support(Ketchup) = 100/1000 = 10%
Important Definitions

• Confidence
• Confidence refers to the likelihood that an item B is also bought if item A is bought. It can be calculated by finding the number of
transactions where A and B are bought together, divided by total number of transactions where A is bought. Mathematically, it can be
represented as:

• Coming back to our problem, we had 50 transactions where Burger and Ketchup were bought together. While in 150 transactions,
burgers are bought. Then we can find likelihood of buying ketchup when a burger is bought can be represented as confidence of Burger -
> Ketchup and can be mathematically written as:

Confidence(Burger→Ketchup) = (Transactions containing both (Burger and Ketchup))/(Transactions containing Burger)


Confidence(Burger→Ketchup) = 50/150 = 33.3%
Apriori Algorithm

• It is a classical algorithm in data mining.


• Given by R. Agrawal and R. Srikant in 1994
• It is used for mining frequent itemsets and relevant association rules
• Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset properties
Steps in Apriori Algorithm

• Let’s take an example from the supermarket sphere


• I = {Onion, Burger, Potato, Milk, Beer} and a database consisting of six
transactions.
• Each transaction is a tuple of 0’s and 1’s where 0 represents the absence of an
item and 1 the presence.
Cont.

• Step 1: Create a frequency table of all the items that occur in all the transactions.
• For our case:
Cont.

• Step 2: We know that only those elements are significant for which the support is greater than or
equal to the threshold support. Here, support threshold is 50%, hence only those items are significant
which occur in more than three transactions and such items are Onion(O), Potato(P), Burger(B), and
Milk(M). Therefore, we are left with:
Cont.

• Step 3: The next step is to make all the possible pairs of the significant items keeping in mind that the
order doesn’t matter, i.e., AB is same as BA. To do this, take the first item and pair it with all the others
such as OP, OB, OM. Similarly, consider the second item and pair it with preceding items, i.e., PB, PM.
We are only considering the preceding items because PO (same as OP) already exists. So, all the pairs
in our example are OP, OB, OM, PB, PM, BM.
Cont.

• Step 4: We will now count the occurrences of each pair in all the transactions.
Cont.

• Step 5: Again only those itemsets are significant which cross the support threshold, and those are OP,
OB, PB, and PM.
Cont.

• Step 6: Now let’s say we would like to look for a set of three items that are purchased together. We will
use the itemsets found in step 5 and create a set of 3 items.
• To create a set of 3 items another rule, called self-join is required. It says that from the item pairs OP, OB,
PB and PM we look for two pairs with the identical first letter and so we get
• OP and OB, this gives OPB
• PB and PM, this gives PBM
Cont.
• Next, we find the frequency for these two itemsets.

• Applying the threshold rule again, we find that OPB is the only significant itemset.
• Therefore, the set of 3 items that was purchased most frequently is OPB.
General Process of the Apriori
algorithm
Pros and Cons of the Apriori
algorithm

• Pros:
• It is an easy-to-implement and easy-to-understand algorithm.
• It can be used on large itemsets.

• Cons:
• Sometimes, it may need to find a large number of candidate
rules which can be computationally expensive.
• Calculating support is also expensive because it has to go
through the entire database.
Implementation in Python
Implementation in Python
Assignment

 Find useful patterns from store_data using Apriori algorithm


Unsupervised Learning

 When dealing with real-world problems, most of the time, data will
not come with predefined labels.

 So, we want to group the data into that are similar in nature (feature
values)
Introduction to Clustering

• Clustering is a type of unsupervised learning method.

• In basic terms, the objective of clustering is to find different groups


within the elements in the data.

• To do so, clustering algorithms find the structure in the data so that


elements of the same cluster (or group) are more similar to each
other than to those from different clusters.
Unsupervised Learning Process
Introduction to Unsupervised Learning

In supervised
learning, we have
labelled set of
examples.
Introduction to Unsupervised Learning

In unsupervised
learning, we have
unlabeled set of
examples.
Introduction to Unsupervised Learning

In unsupervised
learning, we form
clusters in the
dataset
Clustering
Clustering
Applications of Unsupervised Learning
K-Means Algorithm

In K-means, assume
we have dataset as
shown in Figure and
we need find 2
clusters
K-Means Algorithm

First step: to
randomly initialize
two points called
cluster centroids. 2
because we need 2
clusters
K-Means Algorithm

K-means is an iterative
algorithm and perform
two steps.
First, clustering
assignment step
Second, move centroid
step
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm

Repeat the cluster


assignment step
K-Means Algorithm

Perform the Move


Centroid Step
K-Means Algorithm

Repeat the cluster


assignment step
K-Means Algorithm

if you keep running additional


iterations of K means from here
the cluster centroids will not
change any further and the colors
of the points will not change any
further. And so, this is the, at this
point, K means has converged and
it's done a pretty good job finding the
2 clusters in the data
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm

A.

B.

C.
D.
K-Means Algorithm

A.

B.

C.
D.
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm Optimization Objective (Cost
Function)
K-Means Algorithm Optimization Objective (Cost
Function)
K-Means Algorithm Optimization Objective (Cost
Function)
K-Means Algorithm Optimization Objective (Cost
Function)
K-Means Algorithm Optimization Objective (Cost
Function)
K-Means Algorithm Optimization Objective (Cost
Function)
K-Means Random Initialization
K-Means Random Initialization
K-Means Random Initialization
K-Means Random Initialization
K-Means Random Initialization
K-Means Random Initialization
K-Means Random Initialization

Hence, depending upon the


random initialization, k-means
come up with different
convergence solutions. See
next slide
K-Means Random Initialization

So, if k-means does not give


you good clusters then the
solution is to try multiple
random initializations. Run k-
means lots of time
K-Means Random Initialization

So, if k-means does not give


you good clusters then the
solution is to try multiple
random initializations. Run k-
means lots of time
K-Means Random Initialization

So, if k-means does not give


you good clusters then the
solution is to try multiple
random initializations. Run k-
means lots of time
K-Means Random Initialization

Note: this multiple random


initialization works the best
when you have small number
of K or clusters (K = 2 to 10).
However, this multiple
random initialization may not
work the best when you have
very large number of K (from
100 to 1000). In such case,
the initial random initiation
may work the best.
K-Means Random Initialization
K-Means Random Initialization
K-Means Choosing K Number of Clusters

How to choose the value of


parameter K?
Solution No. 1: Data
Visualization

Solution No. 2: By hand


K-Means Choosing K Number of Clusters

A useful method is to use the


Elbow Method to determine
the best number of K
K-Means Choosing K Number of Clusters

A useful method is to use the


Elbow Method to determine
the best number of K
K-Means Choosing K Number of Clusters

Distortion goes down rapidly


until K=3 then it was very
slow. Thus, K=3 may be the
best number of K. We call it
elbow method because the
curve is analogous to human
arm.
K-Means Choosing K Number of Clusters

One reason people don’t use elbow method because it


sometimes generate graph like this. Hence, very difficult to
determine the best number of K in such graph because it
looks,3,4,5,6 can give a good number of clusters. Hence, elbow
method is useful when it gives graph like elbow
K-Means Choosing K Number of Clusters
K-Means Choosing K Number of Clusters
K-Means Choosing K Number of Clusters

Sometimes, particular business


might give you a good idea to
decide the number of clusters.
For instance, particular brand
sells S, M, L T-shirts or XS, S, M,
L, XL T-shirts.
Hence, either 3 or 5 K
K-Means Choosing K Number of Clusters

Conclusion:

1. To conclude, for the most part, the number of clusters K is still chosen by hand by
human input or human insight.

2. One way to try to do so is to use the Elbow Method, but we shouldn’t always
expect that it will always work well

3. Thus, the better way to think about how to choose the number of clusters is to
ask, for what purpose are you running K-means?
Implementation of K - Means Clustering
Algorithm in Python (Live Demo)
K-Means Implementation
K-Means Implementation
K-Means Implementation
# Importing the libraries # Using the elbow method to find the optimal number of
1. import numpy as np clusters
2. import matplotlib.pyplot as plt from sklearn.cluster import KMeans
3. import pandas as pd
wcss =[]
for i in range (1,11):
#Importing the mall dataset with pandas kmeans = KMeans(n_clusters = i, init =
4. dataset = pd.read_csv('Mall_Customers.csv') 'k-means++', max_iter =300, n_init = 10,
5. X = dataset.iloc[:,[3,4]].values
random_state = 0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
# Plot the graph to visualize the Elbow Method # Applying KMeans to the dataset with the optimal number
to find the optimal number of cluster of cluster
1. plt.plot(range(1,11),wcss)
2. plt.title('The Elbow Method') 1. kmeans=KMeans(n_clusters= 5, init = 'k-means++',
3. plt.xlabel('Number of clusters') max_iter = 300, n_init = 10, random_state = 0)
4. plt.ylabel('WCSS') 2. y_kmeans = kmeans.fit_predict(X)
5. plt.show()
# Visualising the clusters
1. plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0,1],s = 100, c='red', label = 'Cluster 1')
2. plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1,1],s = 100, c='blue', label = 'Cluster 2')
3. plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2,1],s = 100, c='green', label = 'Cluster 3')
4. plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3,1],s = 100, c='cyan', label = 'Cluster 4')
5. plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4,1],s = 100, c='magenta', label = 'Cluster 5')
6. plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], s = 300, c = 'yellow', label =
'Centroids')
7. plt.title('Clusters of clients')
8. plt.xlabel('Annual Income (k$)')
9. plt.ylabel('Spending score (1-100)')
10. plt.legend()
11. plt.show()
K – Means (Advantages)

1. Relatively simple to implement.


2. Scales to large data sets.
3. Guarantees convergence.
4. Can warm-start the positions of centroids.
5. Easily adapts to new examples.
6. Generalizes to clusters of different shapes and sizes, such as
elliptical clusters.
K – Means (Disadvantages)

1. Choosing manually.
2. Being dependent on initial values.
3. Clustering data of varying sizes and density.
4. Clustering outliers.
5. Scaling with number of dimensions.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy