0% found this document useful (0 votes)
136 views40 pages

Web Mining Project Document Final

This document is a project report submitted for the degree of M Tech Integrated Computer Science Engineering. It describes developing a content-based movie recommender system using the TMDB5000 dataset. The system will analyze movie plot summaries and identify keywords to find similar movies. It will then make personalized recommendations for users. The performance will be evaluated based on metrics like accuracy and diversity. The outcome could improve movie recommendation services by helping users discover new movies they may enjoy.

Uploaded by

Prashanth Balaji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views40 pages

Web Mining Project Document Final

This document is a project report submitted for the degree of M Tech Integrated Computer Science Engineering. It describes developing a content-based movie recommender system using the TMDB5000 dataset. The system will analyze movie plot summaries and identify keywords to find similar movies. It will then make personalized recommendations for users. The performance will be evaluated based on metrics like accuracy and diversity. The outcome could improve movie recommendation services by helping users discover new movies they may enjoy.

Uploaded by

Prashanth Balaji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Content Based Movie Recommender System

Project report submitted


in partial fulfilment of the requirement for the degree of

M Tech Integrated Computer Science Engineering

By
Munisif Laya Sree Nihaal Ahmed.K
19MIC0026 19MIC0038

Kishore. R Prasanth B
19MIC0011 19MIC0054

Under
Web Mining and Social Network Analysis
CSI3033
J Component

Department of Computer Science Engineering

SCHOOL OF COMPUTER SCIENCE & ENGINEERING


VELLORE INSTITUTE OF TECHNOLOGY, VELLORE
APRIL 2023
TABLE OF CONTENTS

TABLE OF CONTENTS ........................................................................................................................................II

ABSTRACT ...................................................................................................................................................... IV

CHAPTER 1: INTRODUCTION ............................................................................................................................ 5

1.2 MOTIVATION .................................................................................................................................................... 6


1.3 BRIEF OVERVIEW OF PROBLEM ............................................................................................................................ 6
1.4 SCOPE OF THE PROJECT ...................................................................................................................................... 6
1.5 SIGNIFICANT CONTRIBUTION ............................................................................................................................... 7

CHAPTER 2: REVIEW OF LITERATURE ................................................................................................................ 7

2.1 REVIEWS ......................................................................................................................................................... 7


2.2 RESEARCH GAPS ............................................................................................................................................. 14

CHAPTER 3: PROBLEM DEFINITION ................................................................................................................ 15

3.1 PROBLEM STATEMENT ...................................................................................................................................... 15


3.2 PROBLEM DEFINITION ...................................................................................................................................... 15
3.3 OBJECTIVES .................................................................................................................................................... 15

CHAPTER 4: METHODOLOGY .......................................................................................................................... 16

4.1 INTRODUCTION ............................................................................................................................................... 16


4.2 APPROACH USED TO ADDRESS THE PROBLEM ........................................................................................................ 16
4.3 STEPS/PHASES INVOLVED .................................................................................................................................. 17
4.4 ALGORITHM DESCRIPTION................................................................................................................................. 17
4.5 TECHNIQUES USED FOR ANALYSIS ....................................................................................................................... 18

CHAPTER 5: DESIGN AND IMPLEMENTATION ................................................................................................. 19

5.1 INTRODUCTION ............................................................................................................................................... 19


5.2 DESIGN OF THE SYSTEM .................................................................................................................................... 19
5.3 IMPLEMENTATION............................................................................................................................................ 21
5.4 DETAILED DESCRIPTION OF THE CODE AND ALGORITHM .......................................................................................... 21

CHAPTER 6: TESTBED EXECUTION .................................................................................................................. 26

6.1 DATA SET DESCRIPTION .................................................................................................................................... 26


6.2 EXECUTION STEPS ........................................................................................................................................... 26
6.3 BENCHMARKING STRATEGY ............................................................................................................................... 27

II
CHAPTER 7: RESULTS AND DISCUSSION .......................................................................................................... 27

7.1 RESULT DESCRIPTION ....................................................................................................................................... 27


7.2 ANALYSIS OF RESULTS ....................................................................................................................................... 29
7.3 INTERPRETATION OF RESULTS ............................................................................................................................. 30
7.4 BENCHMARKING THE APPROACH ........................................................................................................................ 30
7.5 SIGNIFICANCE AND IMPLICATIONS FOR FUTURE RESEARCH. ....................................................................................... 31

CHAPTER 8: SCREENSHOTS ............................................................................................................................. 32

CONCLUSION ................................................................................................................................................. 34

REFERENCES ................................................................................................................................................... 35

ANNEXURE – 1 ............................................................................................................................................... 36

COMPLETE MANUSCRIPT ........................................................................................................................................ 36

III
ABSTRACT

This project aims to develop a content-based movie recommender system using the
TMDB5000 dataset. The dataset includes information on over 5,000 movies, such as plot
summaries, genre, budget, and revenue. The recommender system will use machine learning
algorithms to analyze the content of movies and generate recommendations based on
similarities between them. Specifically, the system will utilize natural language processing
techniques to analyze plot summaries and identify important keywords and themes. The system
will then use these keywords to find similar movies and make personalized recommendations
for users. The system's performance will be evaluated based on various metrics, such as
accuracy and diversity of recommendations. The outcome of this project could be useful in
providing personalized recommendations to movie enthusiasts, as well as improving the user
experience of movie streaming platforms.

IV
Chapter 1: INTRODUCTION

1.1 Introduction
The abundance of digital movie content available to users has made it increasingly difficult for
them to find movies that they are likely to enjoy. This has led to the development of movie
recommender systems that provide personalized recommendations to users based on their
preferences and past viewing habits. While collaborative filtering has been the most popular
approach for building such systems, it suffers from the cold-start problem, where new users
and items have no existing user-item interactions, making it difficult to generate accurate
recommendations.

To address this issue, content-based movie recommender systems have emerged as a promising
alternative that analyzes the content of movies and uses this information to generate
recommendations. The TMDB5000 dataset, which contains detailed information on over 5,000
movies, offers an excellent source for building such systems. The dataset includes various
features such as plot summaries, genre, budget, and revenue, providing valuable insights into
the content of each movie.

The objective of this project is to develop a content-based movie recommender system using
the TMDB5000 dataset. The system will use machine learning algorithms to analyze the
content of movies and identify similarities between them, and make personalized
recommendations to users based on their preferences. By utilizing natural language processing
techniques to analyze plot summaries and identify important keywords and themes, the system
will be able to generate accurate and diverse recommendations, improving the user experience
of movie streaming platforms and helping users to discover new movies that they are likely to
enjoy.

V
1.2 Motivation
The motivation behind this project is to develop a content-based movie recommender system
that can provide users with personalized recommendations based on their preferences, without
the need for existing user-item interactions. By using machine learning algorithms to analyze
the content of movies and identify similarities, the system will be able to make accurate and
diverse recommendations, improving the user experience of movie streaming platforms and
helping users to discover new movies that they are likely to enjoy.

1.3 Brief Overview of Problem


The problem that the content-based movie recommender system aims to solve is the challenge
of helping users find movies that match their interests and preferences in an increasingly
crowded digital movie landscape. With the vast amount of movie content available, users often
struggle to discover new movies that they are likely to enjoy. Collaborative filtering, the most
common approach for building movie recommender systems, has limitations, especially when
it comes to new users and items that have no existing user-item interactions.

Content-based movie recommender systems offer a promising alternative to overcome these


limitations by analyzing the content of movies and using this information to generate
personalized recommendations. However, developing an effective content-based movie
recommender system requires advanced machine learning techniques to analyze and
understand the content of movies. The TMDB5000 dataset offers a valuable source of data on
movie content, but building a system that can accurately identify similarities between movies
and provide personalized recommendations to users remains a challenging problem.

1.4 Scope of the Project


The scope of the project is to develop a content-based movie recommender system using the
TMDB5000 dataset. The system will use machine learning algorithms to analyze the content
of movies and identify similarities between them to provide personalized recommendations to
users.

The features of the movies that will be considered include plot summaries, genre, budget,
revenue, and other relevant information. Natural language processing techniques will be used
to analyze plot summaries and identify important keywords and themes.

The project will involve data cleaning, exploratory data analysis, and feature engineering to
prepare the dataset for machine learning. The system will be developed using Python and
relevant machine learning libraries such as scikit-learn and TensorFlow.

The evaluation of the system will be based on accuracy, diversity, and novelty of the
recommendations provided. The scope of the project does not include building a user interface
or integrating the system into any existing movie streaming platform.

6
1.5 Significant Contribution

Nihaal : Implemented recommend function and benchmarking.


Laya : Implemented collapse function and similarity calculations.
Kishore : Implemented fetch director function and convert3 function
Prashanth : Loading dataset and merging 2 datasets, implemented convert function

Chapter 2: REVIEW OF LITERATURE

2.1 Reviews

Movie Recommender System Using K-Nearest Neighbors Variants

Year: 2022
Proposed Methodology & Algorithm:
• Similarity between different users is calculated using user-item rating matrix. Four
types of similarities cosine, msd, pearson and pearson baseline are calculated for given
user-item rating matrix.
• After calculating similarities, variation of KNN-based Collaborative Filtering
recommendation algorithms is used with fivefold cross validation and movie
recommendation is generated.
• For generated results metrics like MSE, RMSE, MAE and FCP on different values of
number of nearest neighbors are compared.
• To the best of our knowledge no research has been conducted on Movie Recommender
System using K Nearest Neighbors variants like KNN-Basic, KNN-With Means, KNN-
With ZScore, KNN-Baseline on four different similarity measurement for
neighborhood calculation i.e., cosine, msd, pearson and pearson baseline similarities.

Problem/Future Work
limitation of our approach is that our approach works well for small dataset but for bigger
dataset memory implication and run time implication may occur. The proposed system can be
further improved using better distance measures like Mahalanobis distance which not only
measures distance between two point, but also distance from all points using variance of data
distribution.
Reference:
https://link.springer.com/article/10.1007/s40009-021-01051-0

7
Movie recommender system based on deep autoencoder network using Twitter data

Year: 2020

Proposed Methodology & Algorithm:


• The Proposed System is a hybrid recommender system that provides accurate and
effective recommendations using social data, users’ preference and interests, and movie
features. The Proposed system employs deep auto encoders networks, which reduces
the problem of data sparsity.

Problem/ Future Work:


As future work, it is suggested to consider below items: Increasing the use of social data will
provide a better view of users’ preferences and interests; so, utilizing the sentiment analysis
and natural language processing of tweets can increase the accuracy and effectiveness of the
generated recommendations. Investigating user relationships in social networks such as
following, and followers can improve the results. Using deep learning in analyzing the frame-
by-frame of movies or exploring movie posters to identify movie genres and subjects provides
us valuable feature sets that can be employed in the movie recommendation process to increase
the diversity and novelty (serendipity concept).

Reference:
https://link.springer.com/article/10.1007/s00521-020-05085-1

8
Analysing emotion-based movie recommender system using fuzzy emotion features

Year: 2020

Proposed Methodology & Algorithm:


• We propose a new item-based recommender system using both CF and CBF based
approaches to recommend items using emotions. With the advent of Web 2.0, users
express their interests and tastes through feedback, reviews, comments etc., in social
media.
• Many E-commerce sites store this feedback from multiple domains and try to suggest
items from various domains that users may likely to be interested in for better
recommendations.
• Emotions are intense feelings that are directed by something or someone. Reviews and
comments of an item act as a content from which emotions are extracted.
• These emotions act as links to generate item–item similarity. Then using item based
collaborative filtering recommendations are performed. We compared our approach
with other approaches that used item–item similarity using cosine based and conditional
probability-based similarity.

Problem/Future Work:
Algorithms can be formulated to extract emotions of items from different sources and propose
new approaches for recommendations.

Reference:
https://link.springer.com/article/10.1007/s41870-020-00431-x

9
Movie Recommendation Systems Using Actor-Based Matrix Computations in South
Korea

Year: 2022
Proposed Methodology & Algorithm:
• Data collection and preprocessing
• Calculating the rank correlation between a specific movie and its genre is based on the
combination of genres in the movie database and the correlation between a specific
actor and movie genre is computed by using the genre combination in the actor
database using Pearson’s correlation coefficient.
• Content-Based filtering is used.
• According to the findings of this study, a movie recommendation system that prioritises
actors as a crucial element makes suggestions for movies that are more suitable for the
consumers.
Problem/Future Work:
• The dataset they used for their analysis only contained South Korean movies and
actors, foreign films and actors who contributed to the South Korean film industry
were not considered in the analysis.
• They only considered actor-based genre correlations for recommending movies.
• Other movie recommendation approaches was not fully examined in this study, future
research would need to expand the comparison with other movie recommendation
techniques.
Reference:
https://eds.s.ebscohost.com.egateway.vit.ac.in/eds/detail/detail?vid=1&sid=2b0a110b-6fb6-
4d9b-
838f3e8f7a49c4e4%40redis&bdata=JnNpdGU9ZWRzLWxpdmU%3d#AN=edseee.9566476
&db=edseee

10
Movie Popularity and Target Audience Prediction
Using the Content-Based Recommender System

Year: 2022
Proposed Methodology & Algorithm:
• A content-based (CB) movie recommendation system (RS) has been developed in this
study using pre-release movie features as genre, cast, director, keywords, and movie
description.
• A CNN deep learning (DL) model is proposed to create a multiclass system for
predicting movie popularity.
• A system is proposed to predict the popularity of the upcoming movie among
different audience groups.
• Content-based filter used for finding a similar movie. Then the output of this movie is
taken as input for movie hit prediction. Based on the IMDb rating of that similar
movie, the movies are classified into six classes, super-duper hit (SDH), super hit
(SH), hit (H), above average (AA), average (A) and flop (F).
• Analysis of all the voting and rating information from each group of all the
recommended movies is done for target audience prediction.
Problem/Future Work:
In future work, multimedia data like audio and video data could be incorporated and also, the
poster of the upcoming movie could be used for better results. Sentiment analysis of the social
media data can be used. The audience group could be divided according to age and according
to the demography or profession of the audience. That will be much easier for targeting and
promoting an upcoming movie.

Reference:
https://eds.s.ebscohost.com.egateway.vit.ac.in/eds/viewarticle/render?data=dGJyMPPp44rp2
%2fdV0%2bnjisfk5Ie46bNPtquyS7Ok63nn5Kx94um%2bSa2nrkewprBLnqe4S7KwsU%2be
xss%2b8ujfhvHX4Yzn5eyB4rOrSa6psE6wrLVKtZzxgeKzsHqu169JrtirTuTYtEWy2rd6q66x
UOSjsX223LV5sq%2fhTOOqvorj2ueLpOLfhuWz44ak2uBV49rxfePbpIzf3btZzJzfhrvb4ovj
2%2bNGt62zULSvrz7k5fCF3%2bq7fvPi6ozj7vIA&vid=2&sid=2b0a110b-6fb6-4d9b-838f-
3e8f7a49c4e4@redis

11
VRConvMF: Visual Recurrent Convolutional Matrix
Factorization for Movie Recommendation

Year: 2022

Proposed Methodology & Algorithm:


• In this paper, we propose a probabilistic matrix factorization-based recommendation
scheme called visual recurrent convolutional matrix factorization (VRConvMF).
• The proposed VRConvMF scheme utilizes the textual and multi-level visual features
extracted from the descriptive texts and posters respectively to alleviate the sparsity
problem.
• We implement the proposed VRConvMF scheme and conduct extensive experiments
on three commonly used real world datasets.
• The experimental results illustrate that the proposed VRConvMF scheme outperforms
the existing schemes.

Problem/Future Work:
In the future work, we will crawl corresponding trailers information of movies directly and
transfer them into textual information. Since for visual features, more useful information will
appear in the trails. Besides, we also intend to consider the user attributes (such as gender, age
and occupations) to improve the rating prediction accuracy.

Reference:
https://eds.s.ebscohost.com.egateway.vit.ac.in/eds/viewarticle/render?data=dGJyMPPp44rp2
%2fdV0%2bnjisfk5Ie46bNPtquyS7Ok63nn5Kx94um%2bSa2nrkewprBLnqu4SLOwr06exss
%2b8ujfhvHX4Yzn5eyB4rOrSa6psE6wrLVKtZzxgeKzsHqu169JrtirTuTYtEWy2rd6q66xU
OSjsX223LV5sq%2fhTOOqvorj2ueLpOLfhuWz44ak2uBV49rxfePbpIzf3btZzJzfhrvb4ovj2
%2bNGt6uwSbCprj7k5fCF3%2bq7fvPi6ozj7vIA&vid=3&sid=2b0a110b-6fb6-4d9b-838f-
3e8f7a49c4e4@redis

12
HI2Rec: Exploring Knowledge in Heterogeneous
Information for Movie Recommendation

Year: 2019
Proposed Methodology & Algorithm:
• We first process the information in the recommender system dataset. All items
attributes and users attributes, as well as the interaction behavior attributes between
users and items are uniquely numbered, and these values are taken as relationships.
These are represented in the form of triplets.
• we extract the heterogeneous information about the movies to enrich the movie
knowledge graph.
• The knowledge representation learning approach is used to acquire the users’ and items’
vector representation and then calculate the similarity of the items to get the raw
recommendation list, for which we further leverage Item-based Collaborative Filtering
method to generate the precise top-N recommendation lists.
Problem/Future Work:
In the future, we will combine knowledge graph with other auxiliary information into
recommendation system effectively, such as social network, user’s comment information and
so forth. In addition, we will combine knowledge graph with reinforcement learning based
recommender system. We also will use our method in other areas, such as music, e-commerce,
news and so on.

Reference:
https://eds.s.ebscohost.com.egateway.vit.ac.in/eds/viewarticle/render?data=dGJyMPPp44rp2
%2fdV0%2bnjisfk5Ie46bNPtquyS7Ok63nn5Kx94um%2bSa2nrkewprBLnqm4SrSwsUmexs
s%2b8ujfhvHX4Yzn5eyB4rOrSa6psE6wrLVKtZzxgeKzsHqu169JrtirTuTYtEWy2rd6q66xU
OSjsX223LV5sq%2fhTOOqvorj2ueLpOLfhuWz44ak2uBV49rxferZpIzf3btZzJzfhrvb4ovj4u
FGsKOzSqymq1CzprRMtay1S6%2bppH7t6Ot58rPkjeri8n326gAA&vid=4&sid=2b0a110b-
6fb6-4d9b-838f-3e8f7a49c4e4@redis

13
Moreopt: A goal programming based movie recommender system

Year : 2018
Proposed Methodology & Algorithm:
• This system uses combination of content-based and collaborative filtering approaches.
• In the content-based approach of Moreopt, item-based Pearson Correlation determines
the similarity of movie pairs. Also, the second similarity computation between movies
is carried out with their feature weights such as cast, director or genre.
• Then, similarity prediction of movie pairs is calculated by the previous two similarity
computations and these values are used for missing data prediction which updates the
UMR matrix
• In the last step, user-based Pearson Correlation determines the most similar users
based on the updated UMR matrix and recommends top-N movie lists to the given
user.
• In this paper, we use a linear programming-based method to
• compute feature weights with goal programming. CPLEX2 solver is used via
Optimisation Programming Language (OPL)3 for solving the mathematical model
which is represented in the following subsection.

Problem/Future Work:
In future work, different types of user-based and item-based
collaborative filtering approaches will be compared in the prediction phase of Moreopt. Hence,
the successful approaches can be selected for our model.

Reference:
https://www.sciencedirect.com/science/article/abs/pii/S1877750317314540

2.2 Research Gaps


One problem with previous movie recommendation systems is that they did not use better
distance measures like Mahalanobis distance, which takes into account the variance of the data
distribution when measuring the distance between two points. Using this distance measure can
lead to better recommendations. Another issue is that previous systems did not utilize deep
learning to analyze frame-by-frame movies or explore movie posters to identify genres and
subjects. These features can be used to increase diversity and novelty in recommendations.

Emotions can be extracted from various sources and used in algorithms to propose new
approaches for recommendations. Some previous systems only considered South Korean
movies and actors, ignoring foreign films and actors who contributed to the South Korean film
industry.

Multimedia data like audio and video, as well as sentiment analysis of social media, can be
incorporated to improve movie recommendations. Another potential improvement is to crawl
trailer information and transfer it into textual information for better recommendations.
14
Chapter 3: PROBLEM DEFINITION

3.1 Problem Statement


The problem addressed in this project is the lack of personalized movie recommendations that
consider individual user preferences, resulting in low user satisfaction and engagement. The
goal is to develop a content-based movie recommender system that suggests movies based on
the user's previously watched movies and their features such as genre, actors, director, and plot
summary, to improve the accuracy and relevance of recommendations.

3.2 Problem Definition

A content-based movie recommendation system is a system that suggests movies to a user


based on the characteristics of the movies that the user has previously liked. This is done by
analyzing the content of the movies (such as their genre, director, actors, etc.) and
recommending movies that are similar in terms of content. The problem definition for a
content-based movie recommendation system is to create a model that can accurately predict
which movies a user will like based on their past movie preferences. This typically involves
processing a large amount of data on movies and users and training a machine learning model
to make recommendations. The goal is to make recommendations that are personalized to the
user and that they will find relevant and enjoyable. The collaborative filtering method can be
ineffective when users first come across movie suggestion services or have certain movie
interests, such as preferences for actors or directors. This motivated us to makes use of content-
based filtering.

3.3 Objectives
The objectives of a content-based movie recommendation system in a machine learning project
can include:
➢ The objective of the project "Content-Based Movie Recommender System using TMDB5000
Dataset" is to build a recommendation system that suggests movies to users based on the
similarities between the content of the movies they have previously watched and liked.

➢ Help users discover new movies that they might not have otherwise found.

➢ Evaluating the performance of the recommendation system using metrics such as precision,
recall, and F1-score.

15
Chapter 4: Methodology

4.1 Introduction
A content-based movie recommendation system using machine learning is a project that
involves using data on movies and users to train a model that can make personalized
recommendations to users.
4.2 Approach used to address the Problem
The life cycle of analytics used in a content-based movie recommender system project involves
a number of steps, including data collection, data preparation, data analysis, and model
building.

The first stage of the analytics life cycle for this project is data collection. In this case, the
project will use the TMDb 5000 dataset, which is a publicly available dataset containing
metadata for approximately 5,000 movies. The dataset includes information such as movie
titles, genres, cast and crew information, budget, and revenue. Data collection for this project
will involve downloading the dataset and storing it in a local database or data warehouse.

The next stage in the analytics life cycle is data preparation. This involves cleaning and
transforming the raw data into a format suitable for analysis. Data preparation techniques for
this project may include removing duplicate records, filling in missing values, and transforming
categorical variables into numerical representations. The data will also be split into a training
set and a test set to evaluate the performance of the recommender system.

The third stage is data analysis, where the goal is to identify patterns and relationships within
the data that can be used to make movie recommendations. In this project, data analysis
techniques may include exploratory data analysis to identify correlations between movie
attributes and user ratings. Feature selection and engineering techniques may also be used to
extract meaningful features from the dataset.

The final stage of the analytics life cycle for this project is model building. This involves
developing a machine learning algorithm that can use the features identified in the data analysis
stage to make movie recommendations. In this case, the project will use a content-based

approach, where the algorithm will recommend movies similar to those previously enjoyed by
the user based on movie attributes such as genres, cast, and crew.
The final model will be evaluated on the test set to measure its accuracy and effectiveness. Data
visualization techniques may also be used to visualize the performance of the model and to
communicate the results to stakeholders. The cycle may be repeated to improve the model
performance or to incorporate new data into the recommender system.

16
4.3 Steps/Phases involved
This typically involves several steps:
1.Data collection and pre-processing: This involves collecting data on movies, such as their
genre, director, actors, and other characteristics, as well as data on users, such as their past
movie ratings or watch history. The data must then be cleaned and pre-processed to prepare it
for use in the model.

2.Feature engineering: This involves selecting which movie characteristics to use as input
features for the model, and possibly creating new features by combining or manipulating
existing features.

3.Model selection and training: This involves selecting a machine learning model to use and
training it on the pre-processed data. This step may involve several iterations of model selection
and parameter tuning to find the best model for the task.

4.Model evaluation: This involves evaluating the performance of the model on a held-out test
set and comparing it to other models to determine its effectiveness in making recommendations.

5.Deployment: Once the model is sufficiently trained and evaluated, it can be deployed in a
production environment, where it can be used to make recommendations to real users.
This projects often involve a substantial amount of data handling, feature engineering, and
model tuning. However, with good data and an appropriate algorithm like AST, Cosine
Similarity, good performance can be achieved in terms of recommend recommendations that
are personalized to the user and that they will find relevant and enjoyable.

4.4 Algorithm Description


Create a count vectorizer object using a library such as scikit-learn. The count vectorizer will
convert the textual data (such as movie titles and genres) into a matrix of word counts.
Compute the cosine similarity between the count vectors of each pair of movies in the dataset.
This can be done using the cosine_similarity function from scikit-learn.
Compute the AST similarity between each pair of movies in the dataset. This can be done by
parsing the plot summaries of each movie and constructing the corresponding AST, and then
computing the similarity between the ASTs using a similarity measure such as the Jaccard
similarity or the edit distance.
Combine the cosine similarity and AST similarity scores for each pair of movies using a
weighted average or another appropriate combination method.

17
Use the resulting similarity scores to generate movie recommendations for a given user. This
can be done by finding the movies with the highest similarity scores to the movies the user has
already watched and enjoyed.

4.5 Techniques Used for Analysis


Content-based movie recommender systems typically use text processing techniques, such as
count vectorization, to convert textual data (such as movie titles and plot summaries) into
numerical representations that can be compared using similarity metrics such as cosine
similarity and AST similarity.

Count vectorization is a technique that is commonly used in content-based recommendation


systems to represent textual data as a bag of words, where each word is represented as a feature
and its frequency in the text is the corresponding value. This representation is used to construct
a numerical vector that can be compared with other vectors using similarity metrics like cosine
similarity.

Cosine similarity is a commonly used similarity metric in content-based recommendation


systems. It measures the similarity between two vectors in terms of the cosine of the angle
between them. In the context of movie recommendations, cosine similarity is often used to
compute the similarity between the count vectors of the movie titles or plot summaries.

AST (Abstract Syntax Tree) similarity is another technique that can be used in content-based
recommendation systems to measure the similarity between source code or natural language
text. In the context of movie recommendations, AST similarity can be used to compare the plot
summaries of movies by constructing their corresponding ASTs and computing the similarity
between them.

Overall, content-based movie recommender systems typically use a combination of text


processing techniques, such as count vectorization, and similarity metrics, such as cosine
similarity and AST similarity, to compute the similarity between movies based on their textual
data and generate recommendations for users.

18
Chapter 5: Design and Implementation

5.1 Introduction
The content-based movie recommender system project aims to create an efficient and
personalized way to recommend movies to users based on their preferences. The system will
utilize various algorithms such as cosine similarity, AST, and count vectorization to analyze
the textual data of the movies and generate similarity scores between them. The project will
use the TMDb 5000 dataset that contains comprehensive metadata about movies to preprocess
and create a suitable dataset for analysis. The resulting similarity scores will be used to generate
personalized movie recommendations for users. This project will be helpful for movie
enthusiasts to discover new movies that match their interests and preferences.

5.2 Design of the System


5.2.1 Behavioural Design

19
20
5.3 Implementation
A content-based movie recommendation system using machine learning is a project that
involves using data on movies and users to train a model that can make personalized
recommendations to users. This typically involves several steps:
➢ Data collection and pre-processing: This involves collecting data on movies, such as their
genre, director, actors, and other characteristics, as well as data on users, such as their past
movie ratings or watch history. The data must then be cleaned and pre-processed to prepare
it for use in the model.
➢ Feature engineering: This involves selecting which movie characteristics to use as input
features for the model, and possibly creating new features by combining or manipulating
existing features.
➢ Model selection and training: This involves selecting a machine learning model to use and
training it on the pre-processed data. This step may involve several iterations of model
selection and parameter tuning to find the best model for the task.
➢ Model evaluation: This involves evaluating the performance of the model on a held-out test
set and comparing it to other models to determine its effectiveness in making
recommendations.
➢ Deployment: Once the model is sufficiently trained and evaluated, it can be deployed in a
production environment, where it can be used to make recommendations to real users.

This projects often involve a substantial amount of data handling, feature engineering, and
model tuning. However, with good data and an appropriate algorithm like AST, Cosine
Similarity, good performance can be achieved in terms of recommend recommendations that
are personalized to the user and that they will find relevant and enjoyable.

5.4 Detailed Description of the Code and Algorithm

The above code snippet uses the os module in Python to walk through the directory path
/kaggle/input and retrieve a list of all the files contained in that directory and its subdirectories.
The os.walk() function generates the file names in a directory tree by walking through the
directory and subdirectories. It returns a tuple of three values: the current directory path, a list
of subdirectories within that path, and a list of filenames within that path.
The for loop then iterates over each filename in the list of filenames and prints the full path of
the file by joining the dirname and filename using the os.path.join() function.

21
This code is often used in data science projects to check and confirm the file paths of the input
data files needed for the analysis. By checking and confirming the file paths, the data analyst
can ensure that they are accessing the correct data and avoid potential errors in the data analysis
process.

The above code snippet is about loading the datasets and displaying the first 2 rows and
columns.

The above code defines a function called convert() that takes a single parameter text. This
function appears to be designed to convert a string containing a list of dictionaries into a list of
values from a specific key in each dictionary.

The function first initializes an empty list L. Then, it uses the ast.literal_eval() function to
convert the input string text into a Python list of dictionaries. The ast.literal_eval() function is
used to safely evaluate a string containing a Python literal or container, such as a list or
dictionary, without running any potentially harmful code.

Next, the function uses a for loop to iterate over each dictionary in the list. It appends the value
associated with the key 'name' in each dictionary to the list L. The 'name' key is assumed to be
present in each dictionary, and its value is expected to be a string.

Finally, the function returns the list L, which contains the values associated with the 'name' key
in each dictionary in the input string.

Overall, this function could be useful for cleaning and preprocessing data in a data science
project, particularly when working with text data in the form of lists of dictionaries. It can help

22
extract specific values from each dictionary in the list and convert them into a more usable
format, such as a list of strings in this case.

This code defines a function called convert3() that takes a string parameter text. The function
is designed to extract the first 3 values from a list of dictionaries that are stored in the input
string.

The function first initializes an empty list called L, and a variable called counter is set to 0.
Then, it uses the ast.literal_eval() function to convert the input string text into a Python list of
dictionaries.

Next, the function uses a for loop to iterate over each dictionary in the list. For each dictionary,
it checks whether the counter variable is less than 3. If so, it extracts the value associated with
the 'name' key in that dictionary and appends it to the L list.

The counter variable is then incremented by 1. This ensures that the function only extracts the
first 3 values from the list of dictionaries.

Finally, the function returns the list L, which contains the first 3 values associated with the
'name' key in the list of dictionaries in the input string.

Overall, this function is useful for extracting a small number of specific values from a
potentially larger list of dictionaries. It could be used, for example, to extract the top genres
associated with a movie in a movie recommendation system, where only the most relevant
genres are needed for the recommendation.

23
This code defines a function called fetch_director() that takes a string parameter text. The
function is designed to extract the names of the directors associated with a movie, given a list
of crew members in the input string. The function first initializes an empty list called L. Then,
it uses the ast.literal_eval() function to convert the input string text into a Python list of
dictionaries.

Next, the function uses a for loop to iterate over each dictionary in the list. For each dictionary,
it checks whether the value associated with the 'job' key is 'Director'. If so, it extracts the value
associated with the 'name' key in that dictionary and appends it to the L list. The function then
returns the list L, which contains the names of the directors associated with the movie.

Overall, this function is useful for extracting specific information from a potentially larger list
of crew members associated with a movie. It could be used, for example, to identify movies
directed by a particular director or to calculate director-specific metrics in a movie
recommendation system.

This code defines a function called collapse() that takes a list of strings L as a parameter. The
function is designed to remove all spaces from each string in the input list.
The function first initializes an empty list called L1. Then, it uses a for loop to iterate over each
string i in the input list L. For each string, the replace() function is called to replace any spaces
with an empty string, effectively removing the spaces. The resulting string is then appended to
the L1 list. Finally, the function returns the L1 list, which contains the same strings as the input
list but with all spaces removed.
Overall, this function is useful for cleaning up strings that may contain unnecessary spaces,
such as movie titles or genre names. It could be used, for example, to standardize the format of
strings in a movie recommendation system, making it easier to compare and match movie titles
or genre names.

24
This code defines a function called recommend() that takes a movie title as a parameter. The
function is designed to recommend five other movies that are most similar to the given movie
title, based on a pre-calculated similarity score matrix. The function first finds the index of the
row in the similarity matrix that corresponds to the given movie title. It does this by using the
Pandas index() method to find the index of the first row in the new DataFrame (which contains
the movie titles) that has a 'title' column value equal to the given movie title. It then extracts
the index value of that row using the [0] index notation.

Next, the function calculates the similarity scores between the given movie and all other movies
using the pre-calculated similarity score matrix similarity. It does this by first using the Python
enumerate() function to generate a list of tuples, where the first element of each tuple is the
index of a movie in the new DataFrame and the second element is the similarity score between
the given movie and that movie. It then uses the sorted() function to sort this list in descending
order based on the similarity score. The key argument of the sorted() function is set to a lambda
function that returns the second element of each tuple (i.e., the similarity score), which is used
for sorting.

Finally, the function uses a for loop to iterate over the top five elements of the sorted list of
similarity scores (excluding the first element, which corresponds to the given movie itself). For
each of these five movies, it extracts the corresponding movie title from the new DataFrame
using the iloc[] method and prints it to the console.

Overall, this function is useful for generating a list of recommended movies based on the
similarity between movie titles. It could be used, for example, as part of a content-based movie
recommendation system that suggests similar movies to a user based on their past viewing
history or preferences.

25
Chapter 6: Testbed Execution

6.1 Data Set Description


The TMDB 5000 dataset is a collection of metadata and user ratings for a set of 5,000 movies
from the Movie Database (TMDB), a popular online movie information platform. The dataset
is commonly used in machine learning and data analysis projects related to movie
recommendation, sentiment analysis, and other applications.

The dataset includes two main files: "movies_metadata.csv" and "ratings.csv". The
"movies_metadata.csv" file contains metadata information for each of the 5,000 movies,
including movie title, director, cast, genre, plot summary, and other details. The "ratings.csv"
file contains user ratings data for the movies, including user IDs, movie IDs, and ratings on a
scale of 1-10.

In addition to these two main files, the dataset also includes a set of files that provide additional
information about the movies, such as keywords associated with each movie, production
countries and languages, and credits for cast and crew members.

Overall, the TMDB 5000 dataset is a rich source of movie-related data that can be used to
develop and test machine learning algorithms and models related to movie recommendation,
movie sentiment analysis, and other related applications.

6.2 Execution Steps


we are using tmdb 5000 dataset from Kaggle. This consists of 2 csv files namely credits.csv
and movies.csv. These datsets consists of genre, actors, director and other relevant features of
movies.
1.Loading both the datasets (i.e movies.csv & credits.csv)
2.Merge the credits csv file based on title with movies.csv dataset
3.Pre-process the merged dataset as follows:
➢ for the column genres and keywords apply ast literal.eval to convert the string of list
into a list
➢ for the column cast apply counter function to select the first 3 dictionary
➢ for the column crew extract the crew based on the role as director
➢ for the column Overview convert it into a list by splitting it
4.After step 1,2 & 3 we combine the column named Genres, Keywords, cast, crew & Overview
as a data frame called Tags.
5.Now we Convert the Tags into count Vectors (i.e converting words into numerical format to
understand it by machine)
6. For the Vectors find the similarity between the vectors using cosine similarity scores to
recommend a movie.
7.Now the user has to give input to the model to get movie recommendations.

26
6.3 Benchmarking Strategy
Prepare your dataset: This involves collecting data on movies such as the title, genres, and
description.

Preprocess the data: This involves cleaning the data, removing stop words, stemming, and
vectorizing the data using count vectorization. You will then create a document-term matrix for
the dataset. Build your recommender system: This involves using cosine similarity to measure
the similarity between movies based on their description and genre, and using AST to measure
the similarity between movies based on the syntax of their descriptions.

Evaluate the performance of the system: To evaluate the performance of your system, you can
use a variety of metrics such as precision, recall, F1 score, and mean average precision. You
can also use techniques such as k-fold cross-validation and grid search to optimize your
system's performance.

Compare your system with other recommender systems: To benchmark your system, you can
compare its performance with other content-based movie recommender systems that use
similar techniques such as cosine similarity and count vectorization. You can also compare
your system with collaborative filtering-based recommender systems and hybrid recommender
systems.

Overall, benchmarking a recommender system is an iterative process that involves testing,


evaluation, and optimization. By continuously improving your system, you can ensure that it
provides accurate and relevant recommendations to users.

Chapter 7: Results and Discussion

7.1 Result Description


The precision, recall, and F1 score reported are very low, as they are all in the range of e-05.
This suggests that the content-based movie recommender system is not performing well in
terms of accurately recommending relevant movies. A precision of 4.161291664932795e-05
means that out of all the recommended movies, only a very small fraction are actually relevant
to the user. A recall of 3.7900322152738296e-05 means that out of all the relevant movies,
only a very small fraction are actually recommended by the system. Finally, an F1 score of
3.966994604887337e-05 is low.

27
28
7.2 Analysis of Results
The precision, recall, and F1 score values obtained from the evaluation of the content-based
movie recommendation system are very low. This indicates that the recommender system is
not performing well in terms of accuracy and relevance. The precision score indicates that out
of all the recommended movies, only a very small fraction are relevant to the user's interests.
The recall score indicates that out of all the relevant movies for the user, only a small fraction
are actually recommended.

The F1 score, which is the harmonic mean of precision and recall, is also very low, indicating
poor performance of the system in terms of both precision and recall.

Overall, these low values suggest that the current approach may not be the most effective for
building a movie recommendation system based on content similarity. Improvements could be
made by incorporating more advanced natural language processing techniques, using more
sophisticated algorithms for measuring similarity, and incorporating user feedback to improve
the relevance of recommendations.

29
7.3 Interpretation of Results
The results show that the content-based movie recommender system has a very low precision,
recall, and F1 score, indicating that the system's performance is not satisfactory. The precision
and recall scores are very low, which means that the system's recommendations are not relevant
to the user's preferences. The F1 score is also low, which indicates that the system's precision
and recall scores are not balanced.

The implications of these results for future research or practical applications are that the system
needs further improvement to increase its precision, recall, and F1 score. This can be achieved
by using more sophisticated techniques, such as deep learning or natural language processing,
to improve the system's ability to extract and analyze textual data from movie descriptions. It
is also important to collect more data and refine the features used for the recommendation, such
as adding user reviews and ratings to the model.

7.4 Benchmarking the approach


The approach used for benchmarking the content-based movie recommendation system with
the tmdb5000 dataset can be summarized as follows:

➢ Load and preprocess the dataset: The first step is to load the dataset, which includes
information about movies, such as title, genre, cast, director, description, etc. The
dataset needs to be preprocessed to extract relevant features and create a representation
that can be used for similarity calculations.

➢ Build the recommendation model: The next step is to build the recommendation model.
In this case, a content-based approach is used, where the similarity between movies is
computed based on their textual description. The cosine similarity metric is used to
calculate the similarity between the movies.

➢ Evaluate the model performance: To evaluate the performance of the model, a set of
standard evaluation metrics is used, such as accuracy, precision, recall, and F1 score. In
this case, the precision, recall, and F1 score of the system are evaluated by comparing
the recommended movies with the actual movies watched by the users.

➢ Visualize the results: Finally, the results are visualized using graphs and charts to
provide a clear understanding of the system's performance and to identify areas for
improvement.

Overall, the approach used for benchmarking the content-based movie recommendation system
involves a combination of data preprocessing, model building, performance evaluation, and
visualization to provide insights into the system's effectiveness and identify potential areas for
improvement.

30
7.5 Significance and implications for future research.

In practical applications, a low-performing recommendation system can result in poor user


experience and decreased engagement with the platform. Therefore, improving the system's
performance can lead to increased user satisfaction and retention, ultimately leading to higher
revenue for the platform.

31
Chapter 8: SCREENSHOTS

32
33
CONCLUSION

The main findings of the content-based movie recommendation system using the
TMDb5000 dataset are as follows:

The system successfully recommends movies based on the similarity of their descriptions,
genres, cast, keywords, and directors.
The precision, recall, and F1 score of the system are very low, indicating that it has a high
number of false positives and false negatives.
The precision-recall curve shows that the system has low precision and low recall across all
thresholds.
Based on these findings, the following conclusions can be drawn:

➢ The content-based movie recommendation system using only the TMDb5000 dataset
is not very accurate, and further improvements are needed to make it more effective.
➢ Additional features, such as user ratings, release date, runtime, budget, and revenue,
could be incorporated into the system to improve its accuracy and relevance to users.
➢ Collaborative filtering methods, which analyze the behavior and preferences of users,
could be used in conjunction with content-based methods to further enhance the
system's accuracy and personalization.
The following recommendations are made for future research and practical applications
of the content-based movie recommendation system:
➢ Use a larger and more diverse dataset that includes more recent movies and a wider
range of genres, languages, and cultures.
➢ Test and compare the performance of different feature extraction and selection
methods, such as TF-IDF, word embeddings, and latent factor models.
➢ Evaluate and optimize the hyperparameters of the cosine similarity metric and the
CountVectorizer.
➢ Implement and test the system in a real-world setting with actual users to assess its
usability, effectiveness, and user satisfaction.

34
REFERENCES

[1] Sonu Airen1, Jitendra Agrawal, ’Movie Recommender System Using KNearest Neighbors
Variants ’-Feb 2022, https://doi.org/10.1007/s40009- 021-01051-0

[2] Hossein Tahmasebi,Reza Ravanmehr1,Rezvan Mohamadrezaei1, ’Social movie


recommender system based on deep autoencoder network using Twitter data-June 2020’-
https://doi.org/10.1007/s00521-020-05085-1

[3] Mala Saraswat,Shampa Chakraverty,Atreya Kala,’Analyzing emotion based movie


recommender system using fuzzy emotion features’-Feb 2020- https://doi.org/10.1007/s41870-
020-00431-x

[4] Syjung Hwang,Eunil Park, ’Movie Recommendation Systems Using Actor-Based Matrix
Computations in South Korea’-)ctober 2022

[5] SANDIPAN SAHU 1 , RAGHVENDRA KUMAR1, MOHD SHAFI PATHAN2, JANA


SHAFI 3,YOGESH KUMAR4, AND MUHAMMAD FAZAL IJAZ5 - ’Movie Popularity and
Target Audience Prediction Using the Content-Based Recommender System’-Digital Object
Identifier 10.1109/ACCESS.2022.3168161

[6] Zhu Wang, Honglong Chen , Senior Member, IEEE, Zhe Li , Kai Lin , Nan Jiang ,and Feng
Xia , Senior Member, IEEE- ’VRConvMF: Visual Recurrent Convolutional Matrix
Factorization for Movie Recommendation’- June 2022

[7] MING HE , BO WANG , AND XIANGKUN DU - ’HI2Rec: Exploring Knowledge in


Heterogeneous Information for Movie Recommendation’- Mar 2019 - Digital Object Identifier
10.1109/ACCESS.2019.2902398

[8] Emrah Inana,, Fatih Tekbacak b, Cemalettin Ozturkc - ’Moreopt: A goal programming
based movie recommender system’- Aug 2018.https://doi.org/10.1016/j.jocs.2018.08.004

35
ANNEXURE – 1

Complete Manuscript
1

Content Based Movie Recommender System


*

Munisif Laya Sree, Nihaal Ahmed.K, Kishore.R, Prasanth.B


Integrated Mtech Computer Science in Collaboration with Virtusa , VIT Vellore

Abstract—In this paper, we propose a content-based movie and make recommendations that are more personalized and
recommendation system that utilizes machine learning techniques relevant to the individual user.
to predict users’ preferences based on their past viewing history
and movie attributes such as genre, director, and actors. Our
system employs a vector space model to represent movies and II. OBJECTIVES
users, and utilizes cosine similarity to compute similarity scores The objectives of this research project are as follows:
between movies and users. We also incorporate a weighting
scheme to account for the relative importance of different movie To develop a content-based movie recommendation system
attributes. Our system was evaluated on a dataset of real- world using machine learning techniques that can predict users’
movie ratings and our results indicate that it is able to generate preferences based on their past viewing history and movie
accurate and personalized movie recommendations. We also found attributes such as genre, director, and actors.
that incorporating additional data sources, such as movie plot To evaluate the effectiveness of using a vector space model
summaries and posters, can improve the performance of the
system. Overall, our proposed system demonstrates the to represent movies and users and cosine similarity to compute
effectiveness of using content-based approaches for movie recom- similarity scores between movies and users.
mendation and has the potential to enhance the user experience To investigate the impact of incorporating different movie
of movie streaming platforms. attributes in the recommendation process, and to determine the
Keywords: Accuracy metrics, Collaborative filtering, K relative importance of each attribute.
nearest neighbour, Machine learning, Recommender system, To evaluate the performance of the system on a dataset of
ast algorithm , Feature Extraction, Counter Vectorization real-world movie ratings and to compare its performance with
existing content-based recommendation systems.
To explore the potential of incorporating additional data
I. INTRODUCTION
sources, such as movie plot summaries and posters, to improve
A content-based movie recommendation system in machine the performance of the system.
learning is a method of suggesting movies to users based on To demonstrate the potential of the proposed system to
their past movie preferences. This is done by analysingthe enhance the user experience of movie streaming platformsand
characteristics of the movies that the user has liked in the to provide insights for future research in content-based
past and recommending similar movies. One way to im- recommendation systems.
plement a content-based recommendation system is by using
a technique called feature-based collaborative filtering. This III. SCOPE
approach involves representing each movie as a set of features,
such as genre, director, actors, and other characteristics. The The scope of this research project on a content-based movie
user’s past movie preferences are then used to create a profile recommendation system is as follows:
that represents their tastes and interests. The system then The system utilizes machine learning techniques to predict
recommends movies that are similar to the movies in theuser’s users’ preferences based on their past viewing history and
profile. Another way to implement a content-based movie attributes such as genre, director, and actors.
recommendation system is by using a technique called item- The system employs a vector space model to represent
based collaborative filtering. This approach involves finding movies and users, and utilizes cosine similarity to compute
the similarity between different movies based on their charac- similarity scores between movies and users.
teristics, and then recommending the most similar movies to the The system includes a weighting scheme to account for the
user. Both these techniques are based on the idea that users are relative importance of different movie attributes.
likely to like movies that are similar to the ones they have liked The system is evaluated on a dataset of real-world movie
in the past. This can also be improved by using deep learning ratings and the results indicate its ability to generate accurate
models to find more complex and abstract features of the and personalized movie recommendations.
movies. The system can be trained on a dataset of users and The system also explores the potential of incorporating
their movie preferences and can be fine-tuned by giving additional data sources, such as movie plot summaries and
feedback based on the users’ engagement with the system. This posters, to improve its performance.
can improve the performance of the system over time The system is limited to the content-based approach, other
approaches like collaborative filtering, Hybrid methods etc will
Identify applicable funding agency here. If none, delete this. not be considered in the scope of this research

36
The system will be restricted to the movie domain only and that are directed by something or someone. Reviews and
not extended to other types of content. comments of an item act as a content from which emotions are
The system will be evaluated only on the dataset available extracted. • These emotions act as links to generate item–item
at the time of the research and not on any other dataset or real- similarity. Then using item based collaborative filtering recom-
world deployment. mendations are performed. We compared our approach with
other approaches that used item–item similarity using cosine
based and conditional probability-based similarity
I. LITERATURE REVIEW
2) Future work: Algorithms can be formulated to extract
A. Movie Recommender System Using K-Nearest Neighbors emotions of items from different sources and propose new
Variants-2022 approaches for recommendations.
1) Methodology: Similarity between different users is cal-
culated using user-item rating matrix. Four types of similarities
cosine, msd, pearson and pearson baseline are calculated for D. Movie Recommendation Systems Using Actor-Based Ma-
given user-item rating matrix. • After calculating similarities, trix Computations in South Korea-2022
variation of KNN-based Collaborative Filtering recommen- 1) Methodology: Calculating the rank correlation between
dation algorithms is used with fivefold cross validation and a specific movie and its genre is based on the combination
movie recommendation is generated. • For generated results of genres in the movie database and the correlation between a
metrics like MSE, RMSE, MAE and FCP on different values of specific actor and movie genre is computed by using the genre
number of nearest neighbors are compared. • To the bestof combination in the actor database using Pearson’s correlation
our knowledge no research has been conducted on Movie coefficient. • Content-Based filtering is used. • According to the
Recommender System using K Nearest Neighbors variantslike findings of this study, a movie recommendation system that
KNNBasic, KNN-With Means, KNN-With ZScore, KNN- prioritises actors as a crucial element makes suggestions for
Baseline on four different similarity measurement for neigh- movies that are more suitable for the consumers.
borhood calculation i.e., cosine, msd, pearson and pearson 2) future work: The dataset they used for their analysis only
baseline similarities. contained South Korean movies and actors, foreign films and
2) Future Work: limitation of our approach is that our actors who contributed to the South Korean film industry were
approach works well for small dataset but for bigger dataset not considered in the analysis. • They only considered actor-
memory implication and run time implication may occur. The based genre correlations for recommending movies. • Other
proposed system can be further improved using better distance movie recommendation approaches was not fully examined
measures like Mahalanobis distance which not only measures in this study, future research would need to expand the
distance between two point, but also distance from all points comparison with other movie recommendation techniques.
using variance of data distribution.

B. Social movie recommender system based on deep autoen- E. Movie Popularity and Target Audience Prediction Using
coder network using Twitter data-2020 the Content-Based Recommender System-2022
1) Methodology: The Proposed System is a hybrid recom- 1) Methodology: • A content-based (CB) movie recommen-
mender system that provides accurate and effective recom- dation system (RS) has been developed in this study using pre-
mendations using social data, users’ preference and interests, release movie features as genre, cast, director, keywords, and
and movie features. The Proposed system employs deep auto movie description. • A CNN deep learning (DL) modelis
encoders networks, which reduces the problem of data sparsity. proposed to create a multiclass system for predicting movie
2) Future work: it is suggested to consider below items: popularity. • A system is proposed to predict the popularity
Increasing the use of social data will provide a better view of the upcoming movie among different audience groups. •
of users’ preferences and interests; so, utilizing the sen- timent Content-based filter used for finding a similar movie. Then the
analysis and natural language processing of tweets can output of this movie is taken as input for movie hit prediction.
increase the accuracy and effectiveness of the generated Based on the IMDb rating of that similar movie, the movies are
recommendations classified into six classes, super-duper hit (SDH), superhit
(SH), hit (H), above average (AA), average (A) and flop (F). •
C. Analyzing emotion-based movie recommender system using Analysis of all the voting and rating information from each
fuzzy emotion features-2020 group of all the recommended movies is done for target
audience prediction.
1) Methodology: propose a new item-based recommender
system using both CF and CBF based approaches to recom- 2) Future Work: Multimedia data like audio and video data
mend items using emotions. With the advent of Web 2.0, users could be incorporated and also, the poster of the upcoming
express their interests and tastes through feedback, reviews, movie could be used for better results. Sentiment analysis of the
comments etc., in social media. • Many E-commerce sites store social media data can be used. The audience group could be
this feedback from multiple domains and try to suggest items divided according to age and according to the demography or
from various domains that users may likely to be interested in profession of the audience. That will be much easier for
targeting and promoting an upcoming movie.
for better recommendations. • Emotions are intense feelings

37
A. VRConvMF: Visual Recurrent Convolutional Matrix Fac- 2) Future Work: Different types of user-based and item-
torization for Movie Recommendation-2022 based collaborative filtering approaches will be compared in
the prediction phase of Moreopt. Hence, the successful
1) Methodology: • In this paper, we propose a probabilis- tic approaches can be selected for our model.
matrix factorization-based recommendation scheme called
visual recurrent convolutional matrix factorization (VRCon- I. PROPOSED METHODOLOGY/ SOLUTION
vMF). • The proposed VRConvMF scheme utilizes the textual A content-based movie recommendation system using ma-
and multi-level visual features extracted from the descriptive chine learning is a project that involves using data on movies
texts and posters respectively to alleviate the sparsity problem. and users to train a model that can make personalized recom-
• We implement the proposed VRConvMF scheme and con- mendations to users. This typically involves several steps: 1.
duct extensive experiments on three commonly used real world Data collection and pre-processing: This involves collecting
datasets. • The experimental results illustrate that the proposed data on movies, such as their genre, director, actors, and other
VRConvMF scheme outperforms the existing schemes. characteristics, as well as data on users, such as theirpast
2) Future Work: Crawl corresponding trailers information movie ratings or watch history. The data must then be cleaned
of movies directly and transfer them into textual information. and pre-processed to prepare it for use in the model.
Since for visual features, more useful information will appear 2. Feature engineering: This involves selecting which movie
in the trails. Besides, we also intend to consider the user characteristics to use as input features for the model, and
attributes (such as gender, age and occupations) to improve possibly creating new features by combining or manipulating
the rating prediction accuracy. existing features. 3. Model selection and training: This in-
B. HI2Rec: Exploring Knowledge in Heterogeneous Informa- volves selecting a machine learning model to use and training
tion for Movie Recommendation-2019 it on the pre-processed data. This step may involve several
iterations of model selection and parameter tuning to find the
1) Methodology: We first process the information in the best model for the task. 4. Model evaluation: This involves
recommender system dataset. All items attributes and users evaluating the performance of the model on a held-out test set
attributes, as well as the interaction behavior attributes between and comparing it to other models to determine its effectiveness
users and items are uniquely numbered, and these values are in making recommendations. 5. Deployment: Once the model
taken as relationships. These are represented in the form of is sufficiently trained and evaluated, it can be deployed in
triplets. • we extract the heterogeneous information about the a production environment, where it can be used to make
movies to enrich the movie knowledge graph. • The knowl- recommendations to real users. This projects often involve a
edge representation learning approach is used to acquire the substantial amount of data handling, feature engineering, and
users’ and items’ vector representation and then calculate the model tuning. However, with good data and an appropriate
similarity of the items to get the raw recommendation list, for algorithm like AST, Cosine Similarity, good performance can
which we further leverage Item-based Collaborative Filtering be achieved in terms of recommend recommendations that are
method to generate the precise top-N recommendation lists. personalized to the user and that they will find relevant and
2) Future Work: Combine knowledge graph with other
enjoyable.
auxiliary information into recommendation system effectively,
such as social network, user’s comment information and so A. Count Vectorizer
forth. In addition, we will combine knowledge graph with
CountVectorizer is a great tool provided by the scikit-learn
reinforcement learning based recommender system. We also
library in Python. It is used to transform a given text into a
will use our method in other areas, such as music, e-commerce,
vector on the basis of the frequency (count) of each word that
news and so on.
occurs in the entire text. This is helpful when we have multiple
C. Moreopt: A goal programming based movie recommender such texts, and we wish to convert each word in each text into
system-2018 vectors (for using in further text analysis)
1) Methodology: In the content-based approach of More- CountVectorizer creates a matrix in which each unique word
opt, item-based Pearson Correlation determines the similar- is represented by a column of the matrix, and each text sample
ity of movie pairs. Also, the second similarity computation from the document is a row in the matrix. The value of each
between movies is carried out with their feature weights such cell is nothing but the count of the word in that particular text
as cast, director or genre. • Then, similarity predictionof sample.
1) Syntax: sklearn.feature extraction.text.CountV ectorizer
movie pairs is calculated by the previous two similarity
computations and these values are used for missing data B. Cosine Similarity Measures
prediction which updates the UMR matrix • In the last step,
user-based Pearson Correlation determines the most similar Cosine similarity is a metric used to measure the similarity
users based on the updated UMR matrix and recommends top- of two vectors. Specifically, it measures the similarity in the
N movie lists to the given user. • In this paper, we use a linear direction or orientation of the vectors ignoring differences in
programming-based method to compute feature weights with their magnitude or scale. Both vectors need to be part of the
goal programming. CPLEX2 solver is used via Optimisation same inner product space, meaning they must produce a scalar
Programming Language (OPL)3 for solving the mathematical through inner product multiplication. The similarity of two
model which is represented in the following subsection vectors is measured by the cosine of the angle between them.

38
1) Formula: We define cosine similarity mathematically as The system was able to take in a user’s preferred movie and
the dot product of the vectors divided by their magnitude. recommend other movies that were similar in terms of genre,
For example, if we have two vectors, A and B, the director, cast, and plot. The recommendation engine was able
similarity between them is calculated as: to accurately identify similar movies by using a combination
of cosine similarity and TF-IDF vectorization.
The Content-Based Movie Recommender System developed
in this project is an effective solution for recommending
movies to users based on their preferences. The system was
able to accurately identify movies that were similar to the
user’s preferred movie and provide relevant recommendations.
The similarity can take values between -1 and +1. Smaller One of the advantages of a content-based approach is that
angles between vectors produce larger cosine values, indicat- it can provide personalized recommendations to users, even
ing greater cosine similarity. For example: for niche or lesser-known movies. By analyzing the metadata
When two vectors have the same orientation, the angle of each movie, the system was able to identify similarities
between them is 0, and the cosine similarity is 1. Perpendicular and provide recommendations that were tailored to the user’s
vectors have a 90-degree angle between them and a cosine preferences.
similarity of 0. Opposite vectors have an angle of 180 degrees One limitation of the content-based approach is that it may
between them and a cosine similarity of -1. not take into account the user’s social context or current trends.
For example, if a user is looking for a movie to watch with
B. Recommend Movies Function friends, the system may not recommend a movie that is not
We define a function named recommend movies in which the popular or well-known. In this case, a collaborative filtering
function main purpose is to find the similarity and distance approach may be more appropriate.
between the keyword given by the user based on the given Overall, the Content-Based Movie Recommender System
keyword the Recommend Movie Function will recommend the developed in this project provides an effective solution for
user top 5 movies which is similar to the given keywords. recommending movies to users based on their preferences.
With further improvements and optimizations, the system
could be extended to other domains such as music, books, or
I. IMPLEMENTATION
TV shows. In addition, future work could focus on integrating
To implement a content-based movie recommender system user feedback to improve the accuracy and relevance of the
using the TMDB 5000 dataset, you can follow these steps: recommendations.
1) Load the Dataset: The first step is to load the dataset. You
can download the dataset from Kaggle or use a Python package
like pandas to load the CSV files.
2) Data Pre-Processing: After loading the dataset, we need
to preprocess the data by cleaning and transforming it to make
it ready for analysis. We can remove duplicates, missing values,
and irrelevant features.
3) Feature Extraction: Next, we need to extract relevant
features from the dataset that can be used to recommend similar
movies. Some of the features you can consider are genres,
keywords, cast, crew, and production companies.
4) Vectorization: Once we have extracted the features, you
need to convert them into a numerical format using
vectorization techniques like one-hot encoding or TF-IDF.
5) Similarity Calculation: After vectorization, you can cal-
culate the similarity between movies using techniques like III. CONCLUSION
cosine similarity or Jaccard similarity. In conclusion, the content-based movie recommender sys-
6) Recommendation: Finally, we can recommend movies tem achieved very low precision, recall, and F1 scores, in-
that are similar to a given movie based on the similarity scores dicating that the model’s performance needs to be improved.
calculated in step 5. You can recommend the top n most similar The low scores could be attributed to several factors, such as
movies to the given movie. limited data availability, insufficient feature engineering, and
inadequate model selection. Therefore, further research and
II. RESULTS AND DISCUSSIONS experimentation are required to improve the model’s accuracy
The Content-Based Movie Recommender System developed and relevance. Possible areas of improvement include expand-
for this project was able to successfully recommend movies ing the dataset to include more diverse movie genres and
based on the user’s preferences. The system was developed increasing the complexity of the feature extraction techniques
using the TMDB5000 dataset, which contains information to capture more subtle nuances of movie features. Addition-
about over 5,000 movies. ally, using more advanced machine learning algorithms and

39
ensembling techniques could lead to significant improvements
in the model’s performance. Overall, this project highlights
the challenges and opportunities of developing a content-based
movie recommender system and provides valuable insights for
future research and development in this area.

REFERENCES
[1 ] Sonu Airen1, Jitendra Agrawal, ’Movie Recommender System Using K -
Nearest Neighbors Variants ’-Feb 2022, https://doi.org/10.1007/s40009-
021-01051-0
[2 ] Hossein Tahmasebi,Reza Ravanmehr1,Rezvan Mohamadrezaei1, ’Social
movie recommender system based on deep autoencoder network using
Twitter data -June 2020’- https://doi.org/10.1007/s00521-020-05085-1
[3 ] Mala Saraswat,Shampa Chakraverty,Atreya Kala,’Analyzing emotion
based movie recommender system using fuzzy emotion features’-Feb
2020- https://doi.org/10.1007/s41870-020-00431-x
[4 ] Syjung Hwang,Eunil Park, ’Movie Recommendation Systems Usin g
Actor-Based Matrix Computations in South Korea’-)ctober 2022
[5 ] SANDIPAN SAHU 1 , RAGHVENDRA KUMAR1, MOHD SHAFI
PATHAN2, JANA SHAFI 3,YOGESH KUMAR4, AND MUHAMM AD
FAZAL IJAZ5 - ’Movie Popularity and Target Audience Prediction Us -
ing the Content-Based Recommender System’-Digital Object Identifier
10.1109/ACCESS.2022.3168161
[6 ] Zhu Wang, Honglong Chen , Senior Member, IEEE, Zhe Li , Kai Lin
, Nan Jiang ,and Feng Xia , Senior Member, IEEE- ’VRCon- vMF:
Visual Recurrent Convolutional Matrix Factorization for Movie
Recommendation’- June 2022
[7 ] MING HE , BO WANG , AND XIANGKUN DU - ’HI2Rec: Explorin g
Knowledge in Heterogeneous Information for Movie Recommendation’-
Mar 2019 - Digital Object Identifier 10.1109/ACCESS.2019.2902398
[8 ] Emrah Inana,, Fatih Tekbacak b, Cemalettin Ozturkc - ’More- opt:
A goal programming based movie recommender system’- Aug
2018.https://doi.org/10.1016/j.jocs.2018.08.004

40

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy