0% found this document useful (0 votes)
54 views12 pages

MODULE - 4 Advance AIML Part 1

Uploaded by

Srusti Shripurna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views12 pages

MODULE - 4 Advance AIML Part 1

Uploaded by

Srusti Shripurna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Department of Artificial Intelligence and Machine Learning

Course Name: Advanced AI AND ML


Course code: 21AI71
Module 4 – Recommender System
RECOMMENDER SYSTEMS :
Recommendation systems are a set of algorithms which recommend most relevant items to users based on their
preferences predicted using the algorithms. It acts on behavioral data, such as customer’s previous purchase,
ratings or reviews to predict their likelihood of buying a new product or service.
Recommender systems are very popular for recommending products such as movies, music, news, books, articles,
groceries and act as a backbone for cross-selling across industries
Example: Amazon’s “Customers who buy this item also bought”, Netflix’s “shows and movies you may want to
watch”.
There are three algorithms that are widely used for building recommendation systems:
1.Association Rules
2. Collaborative Filtering
3. Matrix Factorization

Datasets :
Two Publicly available datasets and build recommendations –
1. groceries.csv: This dataset contains transactions of a grocery store and can be downloaded from
http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml13/groceries.csv/.
2. Movie Lens: This dataset contains 20000263 ratings and 465564 tag applications across 27278 movies. As per
the source of data, these data were created by 138493 users between January 09, 1995 and March 31, 2015. This
dataset was generated on October 17, 2016. Users were selected and included randomly. All selected users had
rated at least 20 movies. The dataset can be downloaded from the link https://grouplens.org/datasets/movielens/.

I. ASSOCIATION RULES:
Association rule finds combinations of items that frequently occur together in orders or baskets. The items that
frequently occur together are called itemsets. Itemsets help to discover relationships between items that people
buy together and use that as a basis for creating strategies like combining products as combo offer or place
products next to each other in retail shelves to attract customer attention.
An application of association rule mining is in Market Basket Analysis (MBA). MBA is a technique used mostly
by retailers to find associations between items purchased by customers.
To illustrate the association rule mining concept, let us consider a set of baskets and the items in those baskets
purchased by customers
Items purchased in different baskets are:
1. Basket 1: egg, beer, sugar, bread, diaper
2. Basket 2: egg, beer, cereal, bread, diaper
3. Basket 3: milk, beer, bread
4. Basket 4: cereal, diaper, bread
The primary objective of a recommender system is to predict items that a customer may purchase in the future
based on his/her purchases so far. In future, if a customer buys beer, can we predict what he/she is most likely to
buy along with beer? To predict this, we need to find out which items have shown a strong association with beer
in previously purchased baskets.
Association rule considers all possible combination of items in the previous baskets and computes various
measures such as support, confidence, and lift to identify rules with stronger associations. One of the challenges
in association rule mining is the number of combination of items that need to be considered; as the number of
unique items sold by the seller increases, the number of associations can increase exponentially.
One solution to this problem is to eliminate items that possibly cannot be part of any itemsets. One such algorithm
the association rules use apriori algorithm. The rules generated are represented as

which means that customers who purchased diapers also purchased beer in the same basket. {diaper, beer}
together is called itemset. {diaper} is called the antecedent and the {beer} is called the consequent.

1. METRICS:
Concepts such as support, confidence, and lift are used to generate association rules.
A) SUPPORT:
Support indicates the frequencies of items appearing together in baskets with respect to all possible baskets being
considered.
For example, the support for (beer, diaper) will be 2/4, that is, 50% as it appears together in 2 baskets out of 4
baskets.
Assume that X and Y are items being considered. Let
1.N be the total number of baskets.
2. 𝑁𝑋𝑌 represent the number of baskets in which X and Y appear together.
3. 𝑁𝑋 represent the number of baskets in which X appears.
4. 𝑁𝑌 represent the number of baskets in which Y appears.
Then the support between X and Y, Support(X, Y), is given by

To filter out stronger associations, we can set a minimum support (for example, minimum support of 0.01). This
means the itemset must be present in at least 1% of baskets. Apriori algorithm uses minimum support criteria to
reduce the number of possible itemset combinations, which in turn reduces computational requirements.
B) CONFIDENCE:
Confidence measures the proportion of the transactions that contain X, which also contain Y. X is called
antecedent and Y is called consequent. Confidence can be calculated using the following formula:

C) LIFT:

• Lift can be interpreted as the degree of association between two items.


• Lift value 1 indicates that the items are independent (no association),
• lift value of less than 1 implies that the products are substitution (purchase one product will decrease the
probability of purchase of the other product)
• lift value of greater than 1 indicates purchase of Product X will increase the probability of purchase of Product
Y.
• Lift value of greater than 1 is a necessary condition of generating association rules

2. APPLYING ASSOCIATION RULES:


Association rules can be created using the transactions data available in the groceries.csv dataset. Each line in the
dataset is an order and contains a variable number of items. Each item in each order is separated by a comma in
the dataset.
A) LOADING THE DATASET:
Python’s open() method can be used to open the file and readlines() to read each line.

The steps in this code block are explained as follows:


1. The code opens the file groceries.csv.
2. Reads all the lines from the file.
3. Removes leading or trailing white spaces from each line.
4. Splits each line by a comma to extract items.
5. Stores the items in each line in a list
To print the first five transactions, we can use the following code:

B) ENCODING THE TRANSACTIONS:


Python library mlxtend provides methods to generate association rules from a list of transactions. But these
methods require the data to be fed in specific format. The transactions and items need to be converted into a
tabular or matrix format. Each row represents a transaction and each column represents an item. So, the matrix
size will be of M × N, where M represents the total number of transactions and N represents all unique items
available across all transactions.
The items available in each transaction will be represented in one-hot-encoded format, that is, the item is encoded
1 if it exists in the transaction or 0 otherwise. The mlxtend library has a feature pre-processing implementation
class called OnehotTransactions that will take all_txns as an input and convert the transactions and items into
one-hot-encoded format. The code for converting the transactional data using one-hot encoding is as follows:

The transactional matrices are likely to be sparse since each customer is likely to buy very few items in
comparison to the total number of items sold by the seller. The following code can be used for finding the size
(shape or dimension) of the matrix

The sparse matrix has a dimension of 9835 × 171. So, a total of 9835 transactions and 171 items are available.
C) GENERATING ASSOCIATION RULES:
Apriori algorithms is used to generate itemset. The total number of itemset will depend on the number of items
that exist across all transactions. The number of items in the data can be obtained using the following code

Apriori algorithm takes the following parameters:


1. df: pandas − DataFrame in a one-hot-encoded format.
2. min_support: float − A float between 0 and 1 for minimum support of the itemsets returned. Default is 0.5.
3. use_colnames: boolean − If true, uses the DataFrames’ column names in the returned DataFrame instead of
column indices
The following commands can be used for setting minimum support.

The following command can be used for printing 10 randomly sampled itemsets and their corresponding support.

The corresponding association rules are-


1. df : pandas − DataFrame of frequent itemsets with columns [‘support’, ‘itemsets’].
2. metric − In this use ‘confidence’ and ‘lift’ to evaluate if a rule is of interest. Default is ‘confidence’.
3. min_threshold − Minimal threshold for the evaluation metric to decide whether a candidate rule is of interest

II. COLLABORATIVE FILTERING:


Collaborative filtering is based on the notion of similarity (or distance).
For example, if two users A and B have purchased the same products and have rated them similarly on a common
rating scale, then A and B can be considered similar in their buying and preference behavior. Hence, if A buys a
new product and rates high, then that product can be recommended to B. Alternatively, the products that A has
already bought and rated high can be recommended to B, if not already bought by B.
A) HOW TO FIND SIMILARITY BETWEEN USERS?
Similarity or the distance between users can be computed using the rating the users have given to the common
items purchased. If the users are similar, then the similarity measures such as Jaccard coefficient and cosine
similarity will have a value closer to 1 and distance measures such as Euclidian distance will have low value.
The picture depicts three users Rahul, Purvi, and Gaurav and the books they have bought and rated. The users are
represented using their rating on the Euclidean space. Here the dimensions are represented by the two books Into
Thin Air and Missoula, which are the two books commonly bought by Rahul, Purvi, and Gaurav.

Figure shows that Rahul’s preferences are similar to Purvi’s rather than to Gaurav’s. So, the other book, Into the
Wild, which Rahul has bought and rated high, can now be recommended to Purvi.
Collaborative filtering comes in two variations:
1. User-Based Similarity: Finds K similar users based on common items they have bought.
2. Item-Based Similarity: Finds K similar items based on common users who have bought those items

B) USER BASED SIMILARITY:


MovieLens dataset is used (see https://grouplens.org/datasets/movielens/) for finding similar users based on
common movies the users have watched and how they have rated those movies. The file ratings. csv in the dataset
contains ratings given by users. Each line in this file represents a rating given by a user to a movie. The ratings
are on the scale of 1 to 5.
The dataset has the following features:
1. userId
2. movieId
3. Rating
4. Timestamp
i.LOADING DATASET:
The following loads the file onto a DataFrame using pandas’ read_csv() method
Pandas DataFrame has pivot method which takes the following three parameters:
1. index: Column value to be used as DataFrame’s index. So, it will be userId column of rating_df.
2. columns: Column values to be used as DataFrame’s columns. So, it will be movieId column of rating_df.
3. values: Column to use for populating DataFrame’s values. So, it will be rating column of rating_df.

To print the first 5 rows and first 15 columns.

ii. CALCULATING COSINE SIMILARITY B/W USERS:


Cosine similarity closer to 1 means users are very similar and closer to 0 means users are very dissimilar. The
following code can be used for calculating the similarity.

The total dimension of the matrix is available in the shape variable of user_sim_df matrix
iii. FILTERING SIMILAR USERS:
To find most similar users, the maximum values of each column can be filtered. For example, the most similar
user to first 5 users with userid 1 to 5 can be obtained using the following code:

To dive a little deeper to understand the similarity, let us print the similarity values between user 2 and users
ranging from 331 to 340.

iv. LOADING THE MOVIES DATASET:


Movie information is contained in the file movies.csv. Each line of this file contains the movieid, the movie name,
and the movie genre.
Movie titles are entered manually or imported from https://www.themoviedb.org/ and include the year of release
in parentheses. Errors and inconsistencies may exist in these titles. The movie can be loaded using the following
codes:

To print the first 5 movie details

v. FINDING COMMON MOVIES OF SIMILAR USERS:


The following method takes userids of two users and returns the common movies they have watched and their
ratings.

To find out the movies, user 2 and user 338 have watched in common and how they have rated each one of them,
we will filter out movies that both have rated at least 4 to limit the number of movies to print.
vi. CHALLENGES WITH USER BASED SIMILARITY:
• Finding user similarity does not work for new users. We need to wait until the new user buys a few items and
rates them.
• Only then users with similar preferences can be found and recommendations can be made based on that. This
is called cold start problem in recommender systems
• This can be overcome by using item-based similarity

C). ITEM BASED SIMILARITY:


Item-based similarity is based on the notion that if two items have been bought by many users and rated similarly,
then there must be some inherent relationship between these two items.
If two movies, movie A and movie B, have been watched by several users and rated very similarly, then movie A
and movie B can be similar in taste. In other words, if a user watches movie A, then he or she is very likely to
watch B and vice versa.
i. CALCULATING COSINE SIMILARITY BETWEEN MOVIES:
In this approach, we need to create a pivot table, where the rows represent movies, columns represent users, and
the cells in the matrix represent ratings the users have given to the movies. So, the pivot() method will be called
with movieId as index and userId as columns as described below:

ii. FINDING MOST SIMILAR MOVIES:


In the following code, we write a method get_similar_movies() which takes a movieid as a parameter and returns
the similar movies based on cosine similarity. Note that movieid and index of the movie record in the movies_df
are not same. We need to find the index of the movie record from the movieid and use that to find similarities in
the movie_sim_df. It takes another parameter topN to specify how many similar movies will be returned.
4. USING SURPRISE LIBRARY:
For real-world implementations, we need a more extensive library which hides all the implementation details and
provides abstract Application Programming Interfaces (APIs) to build recommender systems. Surprise is a Python
library for accomplishing this.
Import the required modules or classes from surprise library.

The surprise.Dataset is used to load the datasets and has a method load_from_df to convert DataFrames to
Dataset. Reader class can be used to provide the range of rating scales that is being used.

A). USER BASED SIMILARITY ALGORITHM:


The surprise.prediction_algorithms.knns.KNNBasic provides the collaborative filtering algorithm and takes the
following parameters:
1. K: The (max) number of neighbors to take into account for aggregation.
2. min_k: The minimum number of neighbors to take into account for aggregation, if there are not enough
neighbors.
3. sim_options - (dict): A dictionary of options for the similarity measure.
(a) name: Name of the similarity to be used, e.g., cosine, msd or pearson.
(b) user_based: True for user-based similarity and False for item-based similarity.
The following code implements movies recommendation based on Pearson correlation and 20 nearest similar
users.

The surprise.model_selection provides cross_validate method to split the dataset into multiple folds, runs the
algorithm, and reports the accuracy measures. It takes the following parameters:
1. algo: The algorithm to evaluate.
2. data (Dataset): The dataset on which to evaluate the algorithm.
3. Measures: The performance measures to compute. Allowed names are function names as defined in the
accuracy module. Default is [‘rmse’, ‘mae’].
4. cv: The number of folds for K-Fold cross validation strategy
B). FINDING THE BEST MODEL:
The surprise.model_selection.search.GridSearchCV takes the following parameters:

C. MAKING PREDICTIONS:
III. MATRIX FACTORIZATION:
Matrix factorization is a matrix decomposition technique. Matrix decomposition is an approach for reducing
a matrix into its constituent parts. Matrix factorization algorithms decompose the user-item matrix into the
product of two lower dimensional rectangular matrices.

The Users–Movies matrix contains the ratings of 3 users (U1, U2, U3) for 5 movies (M1 through M5). This
Users–Movies matrix is factorized into a (3, 3) Users–Factors matrix and (3, 5) Factors–Movies matrix.
Multiplying the Users–Factors and Factors–Movies matrix will result in the original Users– Movies matrix.
The idea behind matrix factorization is that there are latent factors that determine why a user rates a movie,
and the way he/she rates. The factors could be the story or actors or any other specific attributes of the movies.
But we may never know what these factors actually represent. That is why they are called latent factors. A
matrix with size (n, m), where n is the number of users and m is the number of movies, can be factorized into
(n, k) and (k, m) matrices, where k is the number of factors.
The Users–Factors matrix represents that there are three factors and how each user has preferences towards
these factors. Factors–Movies matrix represents the attributes the movies possess.
One of the popular techniques for matrix factorization is Singular Vector Decomposition (SVD). Surprise
library provides SVD algorithm, which takes the number of factors (n_factors) as a parameter. We will use 5
latent factors for our example.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy