0% found this document useful (0 votes)
22 views68 pages

Unit-2 Dma

The document covers Association Rule Learning, a machine learning method used to discover relationships between variables in large datasets, particularly for analyzing customer behavior. It explains key concepts such as support and confidence, classification of association rules, and various algorithms like Apriori and FP Growth for mining frequent itemsets. Additionally, it discusses practical applications in marketing and the importance of understanding buying patterns to enhance sales strategies.

Uploaded by

bm8968
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views68 pages

Unit-2 Dma

The document covers Association Rule Learning, a machine learning method used to discover relationships between variables in large datasets, particularly for analyzing customer behavior. It explains key concepts such as support and confidence, classification of association rules, and various algorithms like Apriori and FP Growth for mining frequent itemsets. Additionally, it discusses practical applications in marketing and the importance of understanding buying patterns to enhance sales strategies.

Uploaded by

bm8968
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 68

21CSE355T- Data Mining and Analytics

Unit-2
School of Computing - SRMIST Kattankulathur Campus
Association Rules

● Association rule learning is a rule-based machine learning


method for discovering interesting relations between variables
in large databases.

● Analyse and predict costumers behaviour.


● If/then statements

Example
bread => butter
buys{onions, potatoes} => buys{tomatoes}

These information's are the basics for marketing activities such as


product promotion /product pricing.

2
Association Rules Continues…..

Understanding the buying patterns can help to increase sales in several ways.
Example:
● If there is a pair of items, X and Y, that are frequently bought together

● Both X and Y can be placed on the same shelf, so that buyers of one item
would be prompted to buy the other.

● Promotional discounts could be applied to just one out of the two


items.

● Advertisements on X could be targeted at buyers who purchase Y.

● X and Y could be combined into a new product, such as having Y in


flavour's of X.
3
Parts of Association Rule

bread => butter[20%,45%]


● Bread: Antecedent
● Butter: Consequent
● 20%: Support
● 45%: Confidence

● Support: denotes the probability that contains both bread and butter.
● Confidence: denotes the probability that a transaction containing bread also
contains butter.

4
Examples to calculate the Support and confidence

● Consider, in the supermarket


Total Transactions: 100
Bread: 20
So 20/100*100 = 20% [Support].

In 20 transactions, butter occurs in 9 transactions


So 9/20*100 = 45%[Confidence].

5
Examples to calculate the Support and confidence
● Support: This says how popular an itemset is, as measured by the
proportion of transactions in which an itemset appears. In Table, the
support of {apple} is 4 out of 8, or 50%. Itemsets can also contain multiple
items. For instance, the support of {apple, beer, rice} is 2 out of 8, or 25%.

Confidence: This says how likely item Y is purchased when item X is purchased, expressed
as {X -> Y}. This is measured by the proportion of transactions with item X, in which item Y
also appears. In Table, the confidence of {apple -> beer} is 3 out of 4, or 75%.

6
Classification of Association Rules

● Single Dimensional Association Rule


Eg: Bread => butter
Dimension: buying(one dimension)
● Multidimensional Association Rule
With 2 or more predicates or dimensions
Eg: Occupation(IT),age(>22) => buys(laptop)
Dimensions must be unique it should not repeat.
● Hybrid Dimensional Association Rule
With repetative predicates or dimensions
Eg: Time(5’0 clock), buys(tea) => buys(biscuits)

7
Association Mining – Fields &Algorithms

● Web Usage Mining


● Banking
● Bio informatics
● Market Based Analysis
● Credit/Debit Card Analysis
● Product Clustering
● Catalog Design
Algorithms
● Apriori Algorithm
● Elcat Algorithm
● FP Growth Algorithm

8
● Frequent patterns are patterns (such as itemsets,
subsequences, or substructures) that appear in a data set
frequently.
● Eg: milk and bread.

● A subsequence, such as buying first a PC, then a digital


camera, and then a memory card, if it occurs frequently in a
shopping history database, is a (frequent) sequential
pattern.

● If a substructure occurs frequently, it is called a (frequent)


structured pattern.
● Eg: subgraphs, subtrees.
9
Frequent Itemset

10
Market basket Analysis

11
Market Basket Analysis

● Frequent item set mining leads to the discovery of


associations and correlations among items in large
transactional or relational data sets.
● This process analyses customer buying habits by finding
associations between the different items that customers
place in their “shopping baskets”.

12
Market Basket Analysis

13
Frequent Item sets, Closed Item sets, and Association Rules

• Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence
threshold (min conf ) are called strong.
• A set of items is referred to as an item set.
• The occurrence frequency of an item set is the number of transactions that contain the itemset.
• If the relative support( the proportion of transactions in a dataset that contain a specific itemset) of
an item set I satisfies a pre-specified minimum support threshold (i.e., the absolute support of I
satisfies the corresponding minimum support count threshold), then I is
a frequent item set.

14
Frequent Item sets, Closed Item sets, and Association Rules

In general, association rule mining can be viewed as a two-step


process:

1. Find all frequent item sets: By definition, each of these itemsets will
occur at least as frequently as a predetermined minimum support
count, min sup.
2. Generate strong association rules from the frequent itemsets: By
definition, these rules must satisfy minimum support and minimum
confidence

15
Frequent Item sets, Closed Item sets, and Association Rules

● Maximal Itemset: An itemset is maximal frequent if none of its supersets are frequent.
● Closed Itemset:An itemset is closed if none of its immediate supersets have same support count same
as Itemset.
● K- Itemset:Itemset which contains K items is a K-itemset. So it can be said that an itemset is frequent if
the corresponding support count is greater than minimum support count.

16
Maximal & Closed Frequent Item set

17
Maximal & Closed Frequent Item set

All Maximal Frequent Itemsets are Closed Frequent Itemsets but all
Closed Frequent Itemsets are not Maximal Frequent Itemsets.

18
Apriori Algorithm

19
Frequent Pattern Mining

● Based on the completeness of patterns to be mined.


Eg: closed frequent itemsets, and the maximal frequent itemsets,
constrained frequent itemsets etc.
● Based on the levels of abstraction involved in the rule set.

● Based on the number of data dimensions involved in the


rule.
Eg: single-dimensional association rule, multidimensional
association rule.

20
● Based on the types of values handled in the rule.
Eg: Boolean association rule, quantitative association
rule

● Based on the kinds of rules to be mined.


Eg: Association rules, correlation rules.

● Based on the kinds of patterns to be mined.


Eg: Sequential pattern mining, Structured pattern
mining etc.

21
Frequent Itemset Mining Methods – Apriori Alg

It is for mining frequent itemsets for boolean association rules.


APRIORI Property: All nonempty subsets of a frequent itemset must also be frequent

Two actions involved


1. Join Step
2. Prune Step

22
23
Example -2

24
Steps of Apriori Algorithm

1. Generation of Candidate Item set C1.


2. Check for the required minimum support count of the transaction.
3. The set of frequent 1-itemset L1 is generated from C1 that satisfies the minimum
support count(Pruning).
4. Discover the 2-frequent item set by L1*L1(Joining) and generate C2.
5. L2 is generated by pruning the records that do not satisfy the minimum support.
6. Discover the 3-frequent item set by L2*L2(Joining) and generate C3.
7. L3 is generated by pruning the records that do not satisfy the minimum support.
8. Discover the 4-frequent item set by L3*L3(Joining) and generate C4.
9. The Algorithm ends the frequent pattern mining if 4-frequent itemset is not
available.
25
Apriori Algorithm

26
● Find the frequent item sets in the following database
with min support 50% & min confidence 50%.
Transaction id Items Bought
2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F

● 50/100*4 = 2
● MIN SUP COUNT IS 2.

27
● Step1: find c1
Items Support count
[A] 3
[B] 2
[C] 2
[D] 1
[E] 1
[F] 1

● Min support count is 2 so eliminate which are less than


that.

28
● Step 2: compare the candidate support count with min
support count so L1 will be
Items Support
[A] 3
[B] 2
[C] 2

● Step 3: Generate candidate C2 from L1.

Items

[A,B]
[A,C]
[B,C]

29
● Step 4: Scan D for count of each candidate in C2 and find
support.
Items Support
[A,B] 1
[A,C] 2
[B,C] 1

● Step 5: Compare candidate C2 support count with min


support count so L2 will be

Items Support
[A,C] 2

● Step 6: so the data contains the frequent item [A,C].


30
Association Rule Support Confidence Confidence %
A->C 2 2/3=0.66 66%
C->A 2 2/2=1 100%

Min Confidence - 50%

So final rules are

Rule 1: A -> C
Rule 2: C -> A

31
Generating Association Rules for Frequent Item sets

● Association rules can be generated as follows:

32
Generating Association Rules for Frequent Item sets

Minimum Confidence : 70%


33
Generating Association Rules for Frequent Item sets

● R1 I1^I2 -> I5
Confidence = SC(I1,I2,I5)/SC(I1,I2) = 2/4=50%. (Rejected)
● R2 I1^I5 -> I2
Confidence = SC(I1,I5,I2)/SC(I1,I5) = 2/2=100%. (Accepted)
● R3 I2^I5 -> I1
Confidence = SC(I2,I5,I1)/SC(I2,I5) = 2/2=100%. (Accepted)
● R4 I1->I2^I5
Confidence = SC(I1,I2,I5)/SC(I1) = 2/6=33%. (Rejected)
● R5 I2->I1^I5
Confidence = SC(I2,I1,I5)/SC(I2) = 2/7=29%. (Rejected)
● R6 I5->I1^I2
Confidence = SC(I5,I1,I2)/SC(I5) = 2/2=100%. (Accepted)
34
Problem - 1
● A database has five transactions. Let the Minimum Support &
Confidence ,min_sup=60%, min_confi = 100%.
● Find the frequent itemsets and generate the
association rules using Apriori algorithm.
TI ITEMS
D
T {M,O,N,K,E,Y}
1
T {D,O,N,K,E,Y}
2
T {M,A,K,E}
3
T {M,U,C,K,Y}
4
T {C,O,O,K,I,E}
5 35
Problem - 2
● A database has five transactions. Let the Minimum
Support & Confidence ,min_sup=3, min_confi = 80%.
TID ITEMS
T1 {1,2,3,4,5,6}
T2 {7,2,3,4,5,6}
T3 {1,8,4,5}
T4 {1,9,0,4,6}
T5 {0,2,2,4,5}

● Find the frequent item sets and generate the


association rules using Apriori algorithm.

36
Improving the Efficiency of Apriori

● Transaction Reduction(reducing the number of transactions


scanned in future iterations): A transaction that does not contain
any frequent k-itemsets cannot contain any frequent (k+1)-itemsets.

● Partitioning(partitioning the data to find candidate itemsets):


Partitioning technique can be used that requires just two database
scans to mine the frequent itemsets.
● In Phase I, the algorithm subdivides the transactions of D into n non
overlapping partitions. If the minimum support threshold for
transactions in D is min sup, then the minimum support count for a
partition is
min sup X the number of transactions in that partition.
● All frequent itemsets within the partition are found. These are
referred to as local frequent itemsets.

37
Improving the Efficiency of Apriori

● Phase II, Any itemset that is potentially frequent with respect to D


must occur as a frequent itemset in at least one of the partitions.
Therefore, all local frequent itemsets are candidate itemsets with
respect to D.
● The collection of frequent itemsets from all partitions forms the global
candidate itemsets.

38
Improving the Efficiency of Apriori

Sampling(mining on a subset of the given data):


● Pick a random sample S of the given data D, and then
search for frequent itemsets in S instead of D.

Dynamic itemset counting (adding candidate itemsets at


different points during a scan):
● The database is partitioned into blocks marked by start
points.
● new candidate itemsets can be added at any start point.

39
Hash Based techniques

● Hash-based technique (hashing itemsets into corresponding


buckets):
● A hash-based technique can be used to reduce the size of the candidate
k-itemsets.

40
Hash Based techniques

41
Frequent Pattern Growth Algorithm

42
Mining Frequent Item sets without Candidate
Generation
Disadvantages in Apriori Algorithm:
● It may need to generate a huge number of candidate sets.
● It may need to repeatedly scan the database and check a large set of
candidates by pattern matching.
FP Growth Algorithm

43
Mining Frequent Item sets without Candidate Generation

● we will start from the node that has the minimum support count ie.I5.
● We exclude the node with maximum support count ie.I2 for preparing
the table.

44
Mining Frequent Item sets without Candidate Generation

The Conditional FP-Tree associated with the Conditional node I3.

45
FP Growth Algorithm

46
FP Growth Algorithm Vs Apriori Algorithm

FP Growth Algorithm Apriori Algorithm

1. FP growth algorithm is faster It is slower than FP growth algorithm .


than Apriori algorithm.

2. It is an array based algorithm. It is a tree based algorithm

3. It required only 2 database It requires multiple database scan to generate a


scan candidate set.

4.It uses depth-first search It uses breath-first search.

5. Less accurate More accurate

47
FP GROWTH ALGORITHM Vs APRIORI ALGORITHM

48
Problems 1 – FP Growth Tree

● A database has five transactions. Let the Minimum


Support min_sup=60%.
T ITEMS
● Find the frequent itemsets using I
FP growth Algorithm. D
T {M,O,N,K,E,Y}
1
T {D,O,N,K,E,Y}
2
T {M,A,K,E}
3
T {M,U,C,K,Y}
4
T {C,O,O,K,I,E}
5

49
Problems 2 – FP Growth Tree

● A database has Eight transactions. Let the Minimum


Support, min_sup=30%.
TID ITEMS
1 {E,A,D,B}
2 {D,A,C,E,B}
3 {C,A,B.E}
4 {B,A,D}
5 {D}
6 {D,B}
7 {A,D,E}
8 {B,C}

● Find the frequent item sets using FP growth Algorithm.

50
Mining Frequent Item sets Using Vertical Data Format
Horizontal Data Format will be converted to Vertical Data Format

51
Mining Frequent Item sets Using Vertical Data Format

52
Mining Closed Frequent Item sets

● It is a frequent itemset that is both closed and its support is greater than
or equal to minsup.
● An itemset is closed in a data set if there exists no superset that has the
same support count as this original itemset.

● Frequent itemset mining may generate a huge number of frequent


itemsets, when the min sup threshold is set low or when there exist
long patterns in the data set.

53
Mining Closed Frequent Item sets

“How can we mine closed frequent itemsets?”


● First mine the complete set of frequent itemsets.
● Then remove every frequent itemset that is a proper subset of, and carries the same support as, an
existing frequent itemset.

● To search for closed frequent itemsets directly during the mining process.
● This requires us to prune the search space as soon as we can identify the case of closed itemsets during
mining.

54
Pruning strategies

● Item merging: If every transaction containing a frequent item set


X also contains an item set Y but not having any proper superset of
Y,
● then X UY forms a frequent closed item set and there is no need to
search for any item set containing X but no Y.

55
Pruning strategies

● Sub-item set pruning: If a frequent item set X is a proper subset of an already found frequent closed
itemset Y
● and support count(X) = support count(Y), then X and all of X’s descendants in the set enumeration tree
cannot be frequent closed item sets and thus can be pruned.

56
Pruning strategies
Item skipping: In the depth-first mining of closed itemsets, at each level, there will
be a prefix itemset X associated with a header table and a projected database.
● If a local frequent item p has the same support in several header tables at
different levels, we can safely prune p from the header tables at higher levels.

57
Pruning strategies
● Important optimization is to perform efficient checking

Perform two kinds of closure checking:


● superset checking: checks if this new frequent itemset is a superset of some
already found closed itemsets with the same support.

● subset checking: checks whether the newly found itemset is a subset of an already
found closed itemset with the same support.

● For efficient subset checking, we can use the following property:

● If the current itemset Sc can be subsumed by another already found closed itemset Sa,
then
(1) Sc and Sa have the same support.
(2) the length of Sc is smaller than that of Sa.
(3) all of the items in Sc are contained in Sa.

58
Which Patterns Are Interesting?—Pattern
Evaluation Method
● Most association rule mining algorithms employ a support-confidence
framework.
● Many interesting rules can be found using low support thresholds.
● Strong Rules Are Not Necessarily Interesting.
● Whether or not a rule is interesting can be assessed either subjectively
or objectively.
● only the user can judge if a given rule is interesting, and this judgment,
being subjective, may differ from one user to another.
● objective interestingness measures, based on the statistics “behind” the
data.

59
Association Mining to Correlation Analysis

A misleading “strong” association rule.


● Let game refer to the transactions containing computer games,
and video refer to those containing videos. Of the 10,000
transactions analyzed, the data show that 6,000 of the customer
transactions included computer games, while 7,500 included
videos, and 4,000 included both computer games and videos.
● minimum support - 30% minimum confidence - 60%.
● Support value of computer games: 4000/10000 =
40%
● Confidence value of “ “ “ “ : 4000/6000 = 66%

60
Association Mining to Correlation Analysis

● The probability of purchasing videos is 75%, which is even larger than


66%.
● In fact, computer games and videos are negatively associated because
the purchase of one of these items actually decreases the likelihood of
purchasing the other.

61
From Association Analysis to Correlation Analysis

● The support and confidence measures are insufficient at filtering out


uninteresting association rules.
● This leads to correlation rules of the form
A=>B [support, confidence. correlation].
● A correlation rule is measured not only by its support and confidence
but also by the correlation between item sets A and B.

62
Correlation Measures
● Lift is a simple correlation measure.
● The occurrence of item set A is independent of the occurrence of itemset
B if P(A U B) = P(A)P(B).
● otherwise, item sets A and B are dependent and correlated as events.

• Lift(A,B)<1 – A & B are negatively correlated.


• Lift(A,B)>1 – A & B are positively correlated.
• Lift(A,B)=1 – A & B are not correlated, they are independent.

• It assesses the degree to which the occurrence of one “lifts” the occurrence of
the other.

63
Correlation analysis using lift

0.89<1 so game and video are negatively correlated.

64
Correlation analysis using Chi square

65
Correlation analysis using Chi square

66
67
Thank You

68

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy