MODULE - 4 Advance AIML Part 1
MODULE - 4 Advance AIML Part 1
Datasets :
Two Publicly available datasets and build recommendations –
1. groceries.csv: This dataset contains transactions of a grocery store and can be downloaded from
http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml13/groceries.csv/.
2. Movie Lens: This dataset contains 20000263 ratings and 465564 tag applications across 27278 movies. As per
the source of data, these data were created by 138493 users between January 09, 1995 and March 31, 2015. This
dataset was generated on October 17, 2016. Users were selected and included randomly. All selected users had
rated at least 20 movies. The dataset can be downloaded from the link https://grouplens.org/datasets/movielens/.
I. ASSOCIATION RULES:
Association rule finds combinations of items that frequently occur together in orders or baskets. The items that
frequently occur together are called itemsets. Itemsets help to discover relationships between items that people
buy together and use that as a basis for creating strategies like combining products as combo offer or place
products next to each other in retail shelves to attract customer attention.
An application of association rule mining is in Market Basket Analysis (MBA). MBA is a technique used mostly
by retailers to find associations between items purchased by customers.
To illustrate the association rule mining concept, let us consider a set of baskets and the items in those baskets
purchased by customers
Items purchased in different baskets are:
1. Basket 1: egg, beer, sugar, bread, diaper
2. Basket 2: egg, beer, cereal, bread, diaper
3. Basket 3: milk, beer, bread
4. Basket 4: cereal, diaper, bread
The primary objective of a recommender system is to predict items that a customer may purchase in the future
based on his/her purchases so far. In future, if a customer buys beer, can we predict what he/she is most likely to
buy along with beer? To predict this, we need to find out which items have shown a strong association with beer
in previously purchased baskets.
Association rule considers all possible combination of items in the previous baskets and computes various
measures such as support, confidence, and lift to identify rules with stronger associations. One of the challenges
in association rule mining is the number of combination of items that need to be considered; as the number of
unique items sold by the seller increases, the number of associations can increase exponentially.
One solution to this problem is to eliminate items that possibly cannot be part of any itemsets. One such algorithm
the association rules use apriori algorithm. The rules generated are represented as
which means that customers who purchased diapers also purchased beer in the same basket. {diaper, beer}
together is called itemset. {diaper} is called the antecedent and the {beer} is called the consequent.
1. METRICS:
Concepts such as support, confidence, and lift are used to generate association rules.
A) SUPPORT:
Support indicates the frequencies of items appearing together in baskets with respect to all possible baskets being
considered.
For example, the support for (beer, diaper) will be 2/4, that is, 50% as it appears together in 2 baskets out of 4
baskets.
Assume that X and Y are items being considered. Let
1.N be the total number of baskets.
2. 𝑁𝑋𝑌 represent the number of baskets in which X and Y appear together.
3. 𝑁𝑋 represent the number of baskets in which X appears.
4. 𝑁𝑌 represent the number of baskets in which Y appears.
Then the support between X and Y, Support(X, Y), is given by
To filter out stronger associations, we can set a minimum support (for example, minimum support of 0.01). This
means the itemset must be present in at least 1% of baskets. Apriori algorithm uses minimum support criteria to
reduce the number of possible itemset combinations, which in turn reduces computational requirements.
B) CONFIDENCE:
Confidence measures the proportion of the transactions that contain X, which also contain Y. X is called
antecedent and Y is called consequent. Confidence can be calculated using the following formula:
C) LIFT:
The transactional matrices are likely to be sparse since each customer is likely to buy very few items in
comparison to the total number of items sold by the seller. The following code can be used for finding the size
(shape or dimension) of the matrix
The sparse matrix has a dimension of 9835 × 171. So, a total of 9835 transactions and 171 items are available.
C) GENERATING ASSOCIATION RULES:
Apriori algorithms is used to generate itemset. The total number of itemset will depend on the number of items
that exist across all transactions. The number of items in the data can be obtained using the following code
The following command can be used for printing 10 randomly sampled itemsets and their corresponding support.
Figure shows that Rahul’s preferences are similar to Purvi’s rather than to Gaurav’s. So, the other book, Into the
Wild, which Rahul has bought and rated high, can now be recommended to Purvi.
Collaborative filtering comes in two variations:
1. User-Based Similarity: Finds K similar users based on common items they have bought.
2. Item-Based Similarity: Finds K similar items based on common users who have bought those items
The total dimension of the matrix is available in the shape variable of user_sim_df matrix
iii. FILTERING SIMILAR USERS:
To find most similar users, the maximum values of each column can be filtered. For example, the most similar
user to first 5 users with userid 1 to 5 can be obtained using the following code:
To dive a little deeper to understand the similarity, let us print the similarity values between user 2 and users
ranging from 331 to 340.
To find out the movies, user 2 and user 338 have watched in common and how they have rated each one of them,
we will filter out movies that both have rated at least 4 to limit the number of movies to print.
vi. CHALLENGES WITH USER BASED SIMILARITY:
• Finding user similarity does not work for new users. We need to wait until the new user buys a few items and
rates them.
• Only then users with similar preferences can be found and recommendations can be made based on that. This
is called cold start problem in recommender systems
• This can be overcome by using item-based similarity
The surprise.Dataset is used to load the datasets and has a method load_from_df to convert DataFrames to
Dataset. Reader class can be used to provide the range of rating scales that is being used.
The surprise.model_selection provides cross_validate method to split the dataset into multiple folds, runs the
algorithm, and reports the accuracy measures. It takes the following parameters:
1. algo: The algorithm to evaluate.
2. data (Dataset): The dataset on which to evaluate the algorithm.
3. Measures: The performance measures to compute. Allowed names are function names as defined in the
accuracy module. Default is [‘rmse’, ‘mae’].
4. cv: The number of folds for K-Fold cross validation strategy
B). FINDING THE BEST MODEL:
The surprise.model_selection.search.GridSearchCV takes the following parameters:
C. MAKING PREDICTIONS:
III. MATRIX FACTORIZATION:
Matrix factorization is a matrix decomposition technique. Matrix decomposition is an approach for reducing
a matrix into its constituent parts. Matrix factorization algorithms decompose the user-item matrix into the
product of two lower dimensional rectangular matrices.
The Users–Movies matrix contains the ratings of 3 users (U1, U2, U3) for 5 movies (M1 through M5). This
Users–Movies matrix is factorized into a (3, 3) Users–Factors matrix and (3, 5) Factors–Movies matrix.
Multiplying the Users–Factors and Factors–Movies matrix will result in the original Users– Movies matrix.
The idea behind matrix factorization is that there are latent factors that determine why a user rates a movie,
and the way he/she rates. The factors could be the story or actors or any other specific attributes of the movies.
But we may never know what these factors actually represent. That is why they are called latent factors. A
matrix with size (n, m), where n is the number of users and m is the number of movies, can be factorized into
(n, k) and (k, m) matrices, where k is the number of factors.
The Users–Factors matrix represents that there are three factors and how each user has preferences towards
these factors. Factors–Movies matrix represents the attributes the movies possess.
One of the popular techniques for matrix factorization is Singular Vector Decomposition (SVD). Surprise
library provides SVD algorithm, which takes the number of factors (n_factors) as a parameter. We will use 5
latent factors for our example.