Unit 2
Unit 2
Basic Structure
The goal is to find interesting relationships or patterns between items in large datasets.
Often used to answer:
➔ "When X happens, how likely is Y to happen?"
An association rule is usually written in the form:
X→Y
Where:
X is called the Antecedent (the "if" part)
Y is called the Consequent (the "then" part)
This rule reads as: "If X occurs, then Y is likely to occur."
Applications
1. Market Basket Analysis: Retail stores use association rules to find products that are frequently
bought together. Example: Milk and Bread, Chips and Soft Drinks.
2. E-commerce Recommendations: Websites like Amazon or Flipkart recommend products based
on association rules derived from previous user behavior.
3. Medical Diagnosis: Helps doctors find patterns in symptoms and diseases.
Example: Patients with symptom A and B are likely to have disease C.
4. Web Usage Mining: Predicts which webpages a user is likely to visit next based on browsing
history.
5. Fraud Detection: Detects unusual patterns of transactions that might indicate fraudulent
activity.
Association Rule Parameters
Key Terms Every Student Should Know
Term Meaning
Item A product or object (e.g., Milk, Bread)
Itemset A group of items bought together
Transaction A record of purchased items
Rule An if-then statement (e.g., X → Y)
Association Rule Mining is not just about finding rules; it's about finding strong and useful rules.
To measure the quality and strength of these rules, we use certain parameters:
1. Support
2. Confidence
3. Lift
These are the three key metrics used to measure the strength, usefulness, evaluate and filter
association rules.
1. Support
Definition:
Support tells us how frequently an itemset appears in the dataset. It helps in identifying rules that
are relevant to a large number of transactions.
Formula:
Purpose:
Helps to filter out infrequent and insignificant rules.
High support = Rule is applicable to many transactions (more useful).
Example:
If 100 transactions are recorded and 20 include both Milk and Bread,
Support=20/100=0.20 (or 20%)
2. Confidence
Definition:
Confidence measures how often items in Y appear in transactions that contain X.
In simple terms: When X occurs, how likely is Y to also occur?
Formula:
Measures how much more likely the antecedent and consequent are to occur together than if they
were independent.
Purpose:
Shows the strength and significance of a rule beyond random chance.
Helps to identify whether the rule is truly useful.
Interpretation:
A lift value greater than 1 indicates a positive association between X and Y.
Lift > 1 ➔ Positive correlation (X and Y occur together more than by chance)
Lift = 1 ➔ X and Y are independent (no real association)
Lift < 1 ➔ Negative correlation (X and Y occur together less often than by chance)
Example:
If Confidence is 0.66 and Support for Bread is 0.40:
Lift=0.66/0.40=1.65
This means Milk and Bread are 1.65 times more likely to be bought together than randomly.
Algorithms Used in Association Rule
1. Apriori Algorithm
The Apriori algorithm is one of the foundational algorithms for mining frequent itemsets and
association rules in large datasets. It is widely used in market basket analysis, where it helps to find
patterns in transaction data, like which products are frequently bought together.
The Apriori Algorithm is one of the most popular algorithms used for mining association rules.
It is based on the bottom-up approach, where frequent itemsets of length 1 are found first, and
then progressively longer itemsets are generated.
Working:
Apriori follows a bottom-up approach, where it starts with individual items and generates
progressively larger itemsets based on previously found frequent itemsets. It uses the property "Apriori
Property" which states that if an itemset is frequent, then all of its subsets must also be frequent.
Conversely, if an itemset is infrequent, all its supersets will also be infrequent.
1. Generate frequent itemsets: Start with itemsets of size 1 and progressively generate
larger itemsets by combining frequent itemsets of smaller sizes.
2. Prune itemsets: After generating the candidate itemsets, the algorithm uses a "pruning"
step to eliminate itemsets that do not meet the minimum support threshold.
3. Rule generation: From the frequent itemsets, the association rules are generated based
on the specified confidence threshold.
Algorithm
Step 1: Generate Candidate Itemsets
Step 2: Count the Support for Each Candidate Itemset
Step 3: Prune Infrequent Itemsets
Step 4: Generate New Candidate Itemsets
Step 5: Repeat Steps 2 to 4 for Larger Itemsets
Step 6: Generate Association Rules
Advantages:
Easy to understand and implement.
Efficient in finding frequent itemsets.
Disadvantages:
Computationally expensive for large datasets.
The need for multiple database scans (inefficient when working with large data).
Generates a large number of candidate itemsets.
2. FP-Growth Algorithm (Frequent Pattern Growth)
The FP-Growth Algorithm is an improvement over the Apriori algorithm.
It uses a tree structure called FP-tree (Frequent Pattern Tree) to store compressed
information about frequent itemsets.
Unlike Apriori, FP-Growth does not generate candidate itemsets explicitly.
Solution:
Given
Transaction ID Items Purchased
T1 {Milk, Bread, Butter}
T2 {Milk, Bread}
T3 {Bread, Butter}
T4 {Milk, Bread, Butter}
T5 {Milk, Bread, Butter, Ice Cream}
Calculate the Support for individual items (1-itemsets). Support is defined as:
Candidate 2-itemsets
The candidate 2-itemsets are: {Milk, Bread}, {Milk, Butter}, {Bread, Butter}
Calculate Support for 2-itemsets
Now, calculate the Support for each 2-itemset:
{Milk, Bread}: Appears in transactions T1, T2, T4, T5.
Support = 4/5=0.80
{Milk, Butter}: Appears in transactions T1, T4, T5.
Support = 3/5=0.60
{Bread, Butter}: Appears in transactions T1, T3, T4, T5.
Support = 4/5=0.80
Pair Transactions it appears in Count Support
Milk, Bread T1, T2, T4, T5 4 4/5 = 0.8
Milk, Butter T1, T4, T5 3 3/5 = 0.6
Milk, Ice Cream T5 1 1/5 = 0.2
Bread, Butter T1, T3, T4, T5 4 4/5 = 0.8
Bread, Ice Cream T5 1 1/5 = 0.2
Butter, Ice Cream T5 1 1/5 = 0.2
Candidate 3-itemsets
The candidate 3-itemset is: {Milk, Bread, Butter}
Calculate Support for 3-itemset
Now, calculate the Support for the 3-itemset:
{Milk, Bread, Butter}: Appears in transactions T1, T4, T5.
Support = 3/5=0.60
Triplet Transactions it appears in Count Support
Milk, Bread, Butter T1, T4, T5 3 3/5 = 0.6
Milk, Bread, Ice Cream T5 1 1/5 = 0.2
Milk, Butter, Ice Cream T5 1 1/5 = 0.2
Bread, Butter, Ice Cream T5 1 1/5 = 0.2
Prune Infrequent 3-itemsets
We assume the minimum support threshold is still 60% (0.60).
{Milk, Bread, Butter} has a support of 60%, so it is frequent.
So, the frequent 3-itemsets is: {Milk, Bread, Butter}
,
Generate Association Rules from Frequent Itemsets
From 2-itemsets
{Milk, Bread}
Rule 1: Milk → Bread
Confidence=0.8/0.8=1.0
Lift=1.0/1.0=1.0
Rule 2: Bread → Milk
Confidence=0.8/1.0=0.8 Lift=0.8/0.8=1.0
{Milk, Butter}
Rule 3: Milk → Butter
Confidence=0.6/0.8=0.75
Lift=0.75/0.8=0.9375
Rule 4: Butter → Milk
Confidence=0.6/0.8=0.75
Lift=0.75/0.8=0.9375
{Bread, Butter}
Rule 5: Bread → Butter
Confidence=0.8/1.0=0.8
Lift=0.8/0.8=1.0
Rule 6: Butter → Bread
Confidence=0.8/0.8=1.0
Lift=1.0/1.0=1.0
Recommendation Engines
A recommendation engine is a system designed to suggest items or content to users based
on their preferences, behaviours, or past interactions. These engines are used across a variety of
platforms and industries, including e-commerce, media streaming, social networks, and more. The
core goal of a recommendation engine is to personalize the user experience by providing re A
recommendation engine is a system that gives customers recommendations based on their
behaviour patterns and similarities to people who might have shared preferences. These systems,
also known as recommenders, use statistical modelling, machine learning, and behavioural and
predictive analytics algorithms to personalize the web experience.
E-commerce companies, social media platforms and content-based websites frequently use
recommendation engines to generate product recommendations and relevant content that matches
the characteristics of a particular web visitor. They are also used to suggest products that
complement what a shopper has ordered. Search engines are a popular type of recommender, using
a searcher's query and personal data, such as their location and browsing history, to generate
relevant results. Levant suggestions that enhance user engagement and satisfaction.
Challenges:
o Cold start for new users/items.
o Sparsity problem (few interactions).
o Scalability for large datasets.
Matrix Factorization breaks the big user-item matrix into 2 smaller matrices:
o One represents user preferences.
o One represents item features.
Example of CBF:
Recommend items similar to an item the user already bought, based on past transaction patterns
(item-item similarity).
Example:
Consider the following dataset with 5 transactions:
Transaction ID Items Purchased
T1 {Milk, Bread, Butter}
T2 {Milk, Bread}
T3 {Bread, Butter}
T4 {Milk, Bread, Butter}
T5 {Milk, Bread, Butter, Ice Cream}
Content-Based Recommendations:
If customer buys Milk, recommend:
o Bread (Similarity 0.894)
o Butter (Similarity 0.75)
Avoid recommending Ice Cream (low similarity 0.5).
o Coverage
(Fraction of items recommended).
Offline vs. Online Evaluation:
o Offline: Testing on historical data (train-test split).
o Online: A/B testing on live users.
Question Bank
1. Define Association Rules. Give an example using items from a shopping basket.
2. What are the three main parameters used to evaluate Association Rules?
3. A rule says: {Milk} ⇒ {Bread}
Given: Support(Milk) = 60% , Support(Bread) = 70% , Support(Milk ∩ Bread) = 50%
Calculate the Confidence of the rule.
4. If the Confidence of {Milk} ⇒ {Bread} is 83%, and the Support(Bread) is 70%,
Calculate the Lift of the rule.
5. Explain the meaning of Support, Confidence, and Lift with an example in e-commerce.
6. What is a Recommendation Engine? List three industries where it's used.
7. Differentiate between Content-Based Filtering and Collaborative Filtering in tabular form
(write at least 6 points).
8. In Recommendation Engines, what does it mean when we say “personalization”?
9. Numerical:
Suppose a user rated the following movies:
Action Movie A: 5 stars
Action Movie B: 4.5 stars
Sic-fiction Movie C: 1 star
According to Content-Based Filtering, should we recommend another action movie or sci-fiction
movie? Why?
10. Explain how Collaborative Filtering uses "users who are similar" to make
recommendations
11. Given the following purchase data:
User Milk Bread Butter
U1 1 1 0
U2 1 0 1
U3 0 1 1
Which product would Collaborative Filtering recommend to U1?
12. Describe user-based Collaborative Filtering and item-based Collaborative Filtering.
13. Given user similarity scores: User A and User B similarity: 0.9
User A and User C similarity: 0.4
If User B likes Item X, should we recommend Item X to User A? Explain.
14. A user watches 5 Sci-Fi movies and rates them highly but gives low ratings to Drama
movies. According to Content-Based Filtering, what genre should be recommended next?
15. If the cosine similarity between a user profile and Movie A is 0.85 and with Movie B is
0.65, which movie should be recommended? Why?