0% found this document useful (0 votes)

8 views30 pages

Fake Review Detection Prj2 (1)

The project report details the development of a 'Fake Review Detection Using Machine Learning' system by students at BML Munjal University, aimed at identifying deceptive online reviews through machine learning techniques. Utilizing Natural Language Processing and classification models like Support Vector Machine (SVM) and Extreme Gradient Boosting (XGBoost), the SVM model achieved an accuracy of 91%. The project underscores the importance of maintaining credibility in online platforms and suggests future enhancements using deep learning methods.

Uploaded by

vanshit.ahuja.23cse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views30 pages

Fake Review Detection Prj2 (1)

Uploaded by

vanshit.ahuja.23cse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Project Report

Fake Review Detection

SUBMITTED IN THE PARTIAL FULFILLMENT REQUIREMENT FOR THE

AWARD OF DEGREE OF

Bachelor of Technology
(COMPUTER SCIENCE AND ENGINEERING)
SUBMITTED BY

NAME
DASAM SRI SAI SASANK(230699)
DEV GUPTA(230685)
SAUMIL GUPTA(230706)

UNDER THE SUPERVISION OF

DR.SATYENDER SINGH
DR.ANANTHA RAO
SCHOOL OF ENGINEERING AND TECHNOLOGY

BML MUNJAL UNIVERSITY

Gurugram, Haryana - 122413
May 2025
CANDIDATE’S DECLARATION
This is to certify that the project titled “Fake Review Detection Using Machine Learning”
has been successfully completed by the following students as part of their Project 2 in
partial fulfillment of the requirements for the award of the Degree of Bachelor of
Technology in Computer Science and Engineering at BML Munjal University:

● Sri Sai Sasank

● Dev Gupta
● Saumil Gupta

The project was carried out during the period from January 2025 to May 2025 under the
supervision of Dr. Anantha Rao and Dr. Satyender Singh, faculty members of BML Munjal
University.

The work presented in this project is an authentic record of the students' efforts.

(Sri Sai Sasank)

(Dev Gupta)
(Saumil Gupta)

1
In today's digital era, online reviews significantly influence consumer choices across
e-commerce platforms, food delivery services, and travel sites. However, the growing
presence of fake reviews undermines the reliability of such feedback systems. This project
focuses on detecting fake reviews using machine learning techniques to help maintain the
credibility of online platforms.

The approach involves preprocessing a labeled dataset of reviews using Natural Language
Processing (NLP) methods, including text cleaning, tokenization, and vectorization. Two
classification models were developed and evaluated: Support Vector Machine (SVM) and
Extreme Gradient Boosting (XGBoost). The SVM model achieved an accuracy of 91%,
outperforming XGBoost, which achieved 89%. These results highlight the effectiveness of
traditional machine learning models when combined with well-engineered textual
features.

By accurately distinguishing between genuine and deceptive reviews, this system

contributes to building user trust and promoting transparency in online platforms. Future
extensions may explore deep learning methods and real-time review filtering systems for
broader applicability.

2
I express my heartfelt gratitude to Mr. Anantha Rao and Mr. Satyender Singh, faculty

members at BML Munjal University, Gurugram, for their invaluable guidance, support,

and encouragement throughout the course of our project titled “Fake Review Detection

Using Machine Learning”. Their expert insights, timely feedback, and constant motivation

greatly contributed to the successful completion of this work, carried out from January to

May 2025.

I would also like to thank the Department of Computer Science and Engineering at BML

Munjal University for providing the necessary academic infrastructure and an

environment conducive to research and learning.

Finally, I extend my sincere thanks to all those who supported and encouraged us during

the project, including friends and peers who contributed in various ways.

SRI SAI SASANK

DEV GUPTA
SAUMIL GUPTA

3
Figure No. Figure Description Page No.

4.2.1 Fake vs Genuine Review Distribution 16

4.2.2 Review Text Length Distribution 17

4.2.3 Review Counts By Categories 18

4.2.4 Rating Distribution by Label 19

4.2.5 Distribution of Labels across Categories 19

4.2.6 Word cloud for CG and OR reviews 20

5.4 UseCase Model 22

5.7 R Diagram 23

6.1 Comparison of different models and its evaluation matrices 24

6.2 Confusion matrix for SVM 25

6.3 Confusion matrix for XGBoost 25

4
Table No. Table Description Page No.

3.1 Comparison of Fake Review Detection Studies 13

4.1 Raw Dataset of reviews from different categories 15

5
Abbreviation Full Form

SVM Support Vector Machine

NLP Natural Language Processing

XGBoost Extreme Gradient Boosting

CNN Convolutional Neural Networks

RBF Resting Bitch Face

LSTM Long Short-Term Memory

BERT Bidirectional Encoder Representations from Transformers

6
TABLE OF CONTENTS

Contents Page No.

Candidate’s Declaration 1

Abstract 2

Acknowledgement 3

List of Figures 4

List of Tables 5

List of Abbreviations 6

1 Introduction to Organisation 7

2 Introduction to Project 8
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Existing System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 User Requirement Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Feasibility Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Literature Review 12
3.1 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Objectives of Project (Must be clearly, precisely defined and Implementa-
tion must be done.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Exploratory Data Analysis 15

4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Exploratory Data Analysis and Visualisations . . . . . . . . . . . . . . . 16

5 Methodology 21
5.1 Introduction to Languages (Front End and Back End) . . . . . . . . . . . 21
5.2 User characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.4 Use Case Model/Flow Chart/DFDS . . . . . . . . . . . . . . . . . . . . . 22
5.5 ER Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.6 Assumptions and Dependencies . . . . . . . . . . . . . . . . . . . . . . . 23
5.7 ML algorithm discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6 Results 24

7 Conclusion and Future Scope 26

7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.2 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

8 References 27

Introduction to Organisation

BML Munjal University (BMU), located in Gurugram, Haryana, is a premier institution of

higher education established with the vision of transforming students into innovative,
socially responsible, and globally aware leaders. Founded by the Hero Group, BMU
emphasizes experiential learning, research, and interdisciplinary education, aiming to
bridge the gap between classroom knowledge and real-world application.

The university offers a wide range of undergraduate, postgraduate, and doctoral

programs in engineering, management, law, and economics. Its pedagogy combines
academic rigor with industry exposure, fostering critical thinking, problem-solving, and
entrepreneurial skills among students.

The School of Engineering and Technology at BMU, under which this project was carried
out, is known for its strong focus on hands-on learning, collaborative projects, and
integration of technology in solving contemporary challenges. With state-of-the-art
infrastructure, expert faculty, and active industry collaboration, BML Munjal University
provides an enriching environment for students to pursue impactful academic and
research-oriented endeavors.

7
Introduction to Project

2.1 Overview

Online reviews significantly influence consumer behavior and purchasing decisions on

digital platforms such as e-commerce websites, food delivery apps, and travel services.
However, the rise of fake or manipulated reviews has become a major concern, as they
can mislead customers and damage the reputation of businesses. Detecting such
deceptive reviews manually is not practical due to the vast volume of data generated
daily.

This project, titled “Fake Review Detection Using Machine Learning,” aims to develop an
automated system that can distinguish between genuine and fake reviews using machine
learning techniques. The approach involves collecting a labeled dataset of reviews,
applying Natural Language Processing (NLP) for text preprocessing, and training
classification models to identify review authenticity.

Two key algorithms—Support Vector Machine (SVM) and XGBoost—were implemented

and evaluated. Among these, the SVM model achieved an accuracy of 91%, while
XGBoost achieved 89%, indicating strong performance in classifying review authenticity
based on textual features.

The project demonstrates the practical application of machine learning in real-world

problems, providing a scalable and efficient solution for maintaining the integrity of
online review systems. It also lays the groundwork for further enhancements using deep
learning or real-time data analysis.

2.2 Existing System

In the current digital landscape, many online platforms rely on basic filtering techniques
or user reports to identify fake reviews. The existing systems used by e-commerce
websites or review aggregators often depend on manual moderation, rule-based
algorithms, or simple keyword matching to flag suspicious content. While these methods
offer some level of control, they are limited in their ability to accurately detect
sophisticated or intentionally deceptive reviews.

Some platforms incorporate basic Natural Language Processing (NLP) to scan for
spam-like behavior, such as repetitive words, excessive use of promotional language, or
8
unnatural patterns. However, these static rule-based approaches struggle to adapt to
evolving tactics used by fake reviewers, such as varying sentence structure or using
AI-generated text.

Another major limitation of existing systems is the lack of contextual understanding and
the inability to learn from past data. Many fake reviews are written to closely mimic
genuine ones, making it difficult for traditional systems to identify them without
advanced analysis.

Overall, the existing systems are not scalable, lack automation, and are often unable to
deliver high accuracy in classification. This creates a strong need for machine
learning-based solutions that can learn patterns from data, generalize across different
types of reviews, and improve over time through training.

2.3 User Requirement Analysis

The goal of this project is to develop a system that can automatically detect fake reviews
using machine learning techniques. To ensure the system meets the expectations and
needs of its intended users—such as administrators, businesses, platform moderators,
and end consumers—a detailed analysis of user requirements is essential.

1. Functional Requirements

● Input Interface: Users should be able to input or upload reviews in text format.

● Classification Output: The system should display whether the review is likely to be
fake or genuine.

● Model Integration: The system should integrate trained machine learning models
(e.g., SVM, XGBoost) to perform classification in real time or batch mode.

● Visualization: A basic dashboard or output display showing results, accuracy, and

model performance may be provided for administrators or testers.

● Review Storage (Optional): Reviews and their predicted labels should be stored for
future analysis or improvement of the model.

2. Non-Functional Requirements

● Accuracy: The system should maintain a high level of accuracy (above 85%), with
consistent results across datasets.
9
● Scalability: It should be capable of handling large volumes of review data if
integrated into a live platform.

● Responsiveness: The classification process should be quick, ideally processing each

review within a few seconds.

● Usability: The system should be simple and user-friendly for non-technical users,
particularly moderators or business owners.

● Security: If deployed online, data should be handled securely to protect sensitive

content.

3. Stakeholder Expectations

● Platform Owners: Want to maintain credibility and prevent spam or manipulation.

● Customers/Users: Expect trustworthy reviews to guide their purchasing decisions.

● Moderators: Require efficient tools to reduce manual review verification efforts.

This analysis helps in aligning the system's features with user expectations, ensuring a
solution that is practical, efficient, and effective in real-world usage.

2.4 Feasibility Study

The development of a Fake Review Detection System using machine learning is both
technically and operationally feasible. Leveraging widely-used libraries like scikit-learn
and XGBoost for text classification, the project can efficiently process large datasets of
online reviews with high accuracy. The system will utilize Natural Language Processing
(NLP) techniques to extract features from the reviews, and the chosen machine learning
models, SVM and XGBoost, have demonstrated strong performance in similar tasks. The
system is easy to integrate with existing review platforms, offering an intuitive interface
for moderators to detect fake reviews, and can scale to handle large volumes of data as
required.

From an economic perspective, the project is cost-effective, requiring minimal

development resources since open-source tools and platforms will be used. Once

10
implemented, the system can automate review verification, reducing operational costs
and improving platform credibility. Additionally, it can deliver a significant return on
investment by enhancing user trust and platform reliability. Thus, the project is feasible
within budgetary constraints, with a high potential for impact in real-world applications.

11
Literature Review
Fake review detection has gained significant attention with the growth of online
platforms. Early research by Jindal and Liu (2008) focused on identifying inconsistencies in
review content, such as promotional language and unnatural patterns. Mihalcea et al.
(2009) used Support Vector Machines (SVM) to detect fake reviews by analyzing review
features like length and word frequency, setting the stage for SVM's use in review
classification.

Recent studies have integrated Natural Language Processing (NLP) and ensemble
methods. Zhang et al. (2017) combined classifiers like SVM and Logistic Regression with
sentiment analysis, achieving higher accuracy. Deep learning approaches, such as
Convolutional Neural Networks (CNNs), have also been explored. Chen et al. (2020)
demonstrated that deep learning outperformed traditional methods, though it requires
large datasets and computational resources.

XGBoost has emerged as a powerful tool for fake review detection, with Zhao et al. (2020)
showing it delivers high accuracy when combined with text-based features like TF-IDF.
Despite the rise of deep learning, traditional machine learning algorithms like SVM and
XGBoost remain effective, particularly in scenarios with limited data.

12
3.1 Comparison

Table 3.1 Comparison of Fake Review Detection Studies

3.2 Objectives of Project (Must be clearly, precisely defined and Implemen- tation must

be done.)

To develop an innovative hybrid model for fake review detection that surpasses current

methodologies in both accuracy and computational efficiency.

Specific Objectives

3.2.1 Enhance Feature Extraction

Implement advanced natural language processing techniques to identify subtle linguistic

patterns that basic feature extraction methods have overlooked.

3.2.2 Optimize Computational Performance

Create a balanced approach that maintains high accuracy while reducing the computational

resources required compared to existing deep learning solutions.

3.2.3 Incorporate Contextual Intelligence

Develop a framework that evaluates reviews within their broader context, including product
13
metadata and marketplace dynamics, rather than analyzing text in isolation.

3.2.4 Design Self-Optimizing Parameters

Build a model with adaptive parameter adjustment capabilities to minimize manual tuning

requirements and maintain consistent performance across varied datasets.

3.2.5 Ensure Cross-Domain Effectiveness

Test and refine the detection system across multiple product categories and service sectors to

ensure versatility beyond single-domain applications.

3.2.6 Enable Real-Time Detection

Construct an implementation framework capable of classifying reviews at submission time,

moving beyond the retrospective analysis focus of current research.

3.2.7 Analyze Behavioral Patterns

Integrate reviewer behavior metrics and temporal analysis to identify coordinated review

manipulation that content-only approaches cannot detect.

14
Exploratory Data Analysis

4.1 Dataset

This project utilizes a balanced dataset of approximately 20,000 online product reviews
spanning ten retail categories, including Home and Kitchen, Electronics, and Kindle Store,
with each category containing roughly 2,000 reviews equally distributed between genuine
(CG) and potentially deceptive (OR) labels. Data collection employed a custom-built,
ethically-compliant web crawler implementing request randomization and proxy rotation
to gather diverse review samples while respecting platform usage policies. The extracted
data underwent a comprehensive preprocessing pipeline including text normalization,
HTML artifact removal, and standardization of product references, followed by rigorous
quality assurance procedures to eliminate duplicates and verify label consistency. As
evident from the sample Home and Kitchen reviews (Image 1), the dataset captures
varied linguistic patterns across different rating levels, with observable correlations
between 5-star ratings and genuine labels, while Image 2 confirms the balanced
distribution of labels across all product categories, with Kindle Store containing the
highest volume at approximately 2,300 reviews and Movies and TV the lowest at roughly
1,800 reviews. This meticulously curated cross-domain dataset provides a robust
foundation for developing and evaluating our enhanced fake review detection
methodology.

Table 4.1 - Raw Dataset of reviews from different categories

15
4.2 Exploratory Data Analysis and Visualisations

Fig 4.2.1 - Fake vs Genuine Review Distribution

This bar graph depicts the distribution between fake and genuine reviews in the dataset,
showing approximately 20,000 reviews in each category. The blue bar represents genuine
reviews (labeled as CG) while the peach/orange bar represents fake or deceptive reviews
(labeled as OR). The y-axis measures the count of reviews, reaching just over 20,000 for
each category, and the x-axis identifies the review types. The near-perfect balance
between genuine and fake reviews indicates a deliberately curated dataset designed for
optimal machine learning model training, where class imbalance would not skew the
results.

16
Fig 4.2.2- Review Text Length Distribution

The image illustrates the distribution of review text lengths by showing a histogram of
word counts in user reviews. The x-axis represents the number of words per review, while
the y-axis indicates the count of reviews within each range. The plot reveals a highly
right-skewed distribution, with most reviews containing between 10 and 30 words. As the
word count increases, the frequency of reviews declines significantly. A smooth density
curve overlays the histogram, further highlighting that shorter reviews are far more
common than longer ones. This visualization is useful for understanding review behavior
and guiding preprocessing steps in NLP tasks.

17
Fig 4.2.3- Review Counts By Categories

The bar chart visualizes the distribution of review counts across different product
categories. Each horizontal bar represents a distinct category, with its length indicating
the number of reviews received. Among the categories, Kindle_Store_5 has the
highest review count, followed closely by Books_5 and Pet_Supplies_5. The remaining
categories, including Home_and_Kitchen_5, Electronics_5, and Movies_and_TV_5,
have slightly fewer reviews but still show substantial engagement. This visualization
highlights which categories receive the most customer feedback, useful for analyzing
consumer interest and product popularity.

18
Fig 4.2.4-Rating Distribution by Label

The stacked bar chart illustrates the distribution of product ratings across two labels:
CG and OR. The x-axis represents the rating scale from 1 to 5, while the y-axis shows
the frequency of each rating. Each bar is segmented by label, with CG in teal and OR in
light yellow. It is evident that the majority of ratings are 5-star, with both labels
contributing significantly but OR having a slight edge. Lower ratings (1 to 3) are
comparatively rare for both categories. This visualization helps compare the sentiment
trends between the two labels, indicating a strong skew toward positive feedback
overall.

Fig 4.2.5- Distribution of Labels across Categories

19
The grouped bar chart shows the distribution of two labels, CG and OR, across
different product categories. Each category has nearly equal counts for both labels,
indicating a balanced distribution. Among the categories, Kindle_Store_5 stands out
with the highest review count for both CG and OR, followed closely by Books_5 and
Pet_Supplies_5. Meanwhile, categories like Movies_and_TV_5 and
Clothing_Shoes_and_Jewelry_5 have relatively lower counts. This uniform
distribution across labels ensures fair representation, which is useful for comparative
analysis and modeling purposes.

Fig 4.2.6 - Word cloud for CG and OR reviews

The word clouds for CG and OR reviews reveal both overlap and distinct differences in
language usage. Common prominent words like “book,” “love,” “great,” and “read”
suggest shared themes across labels, particularly related to books and storytelling.
However, CG reviews tend to include more emotionally expressive and qualitative terms
such as “comfortable,” “well written,” and “series,” indicating a focus on experience and
enjoyment. In contrast, OR reviews feature more practical and functional words like
“work,” “fit,” “time,” and “need,” suggesting a utilitarian tone aimed at assessing product
performance. This indicates that CG reviews are generally more subjective, while OR
reviews lean toward objective evaluation.

20
Methodology
5.1 Introduction to Languages (Front End and Back End)
Front End:
While this project focuses on the backend machine learning implementation, if there is
any user-facing component (such as visualizing results or charts). These technologies are
useful for displaying results or creating interactive components in a web-based
application, if applicable.

● Python Libraries: Within Google Colab, we use matplotlib and seaborn for
generating static visualizations such as confusion matrices, accuracy graphs, or
feature importance plots

Back End:
The machine learning model is developed and trained within Google Colab, a
cloud-based platform that supports Python. Colab provides an interactive environment
for running Python code without the need for setting up local computing resources. This
project relies heavily on the following Python libraries for data processing and model
development:

● Google Colab: The primary platform for running all machine learning experiments
and model training.
● Data Preprocessing: pandas for data manipulation, numpy for numerical
operations, re for text cleaning, and NLP libraries like nltk and spaCy for text
tokenization and other language processing tasks.
● Modeling and Classification: scikit-learn for developing the Support Vector
Machine (SVM) model, and XGBoost for implementing the Extreme Gradient
Boosting (XGBoost) model.
● Text Vectorization: TfidfVectorizer and CountVectorizer are used to
convert raw text data into numerical feature vectors that can be fed into the
machine learning models.

● Evaluation: Metrics such as accuracy_score, confusion_matrix, and

classification_report from scikit-learn are used to evaluate the
model performance.

21
5.2 User characteristics
The system targets business owners, platform moderators, and consumers on
e-commerce, food delivery, and travel platforms. It helps ensure the authenticity of
online reviews, enhancing trust and transparency for users relying on review systems to
make informed decisions.

5.3 Constraints
● Data Quality: The model’s accuracy depends on clean and representative training
data.
● Feature Selection: The effectiveness of the model is influenced by the chosen
textual features.
● Imbalanced Dataset: Fake reviews are less frequent, which may cause bias toward
predicting genuine reviews.
● Real-Time Processing: The model may need further optimization for large-scale,
real-time review filtering.
● Computational Resources: Scaling the model for high-volume platforms may
require more powerful infrastructure.
● Model Generalization: The model may need adjustments for diverse platforms or
languages beyond the training dataset.

5.4 Use Case Model

Fig: 5.4 - UseCase Model

22
5.5 ER Diagram

Fig 5.7 - ER Diagram

5.6 Assumptions and Dependencies

Labeled Dataset Availability
The system assumes access to a reliable dataset where reviews are clearly marked
as genuine or fake, which is essential for training and evaluating the detection
model.

Dependence on NLP Tools and Libraries

The project depends on natural language processing libraries such as NLTK, spaCy,
and Scikit-learn for tasks like text preprocessing, feature extraction, and
classification.

Language and Content Consistency

It is assumed that the reviews are primarily in English and contain enough
descriptive text to extract meaningful patterns for classification.

5.7 ML algorithm discussions

5.9.1 XGBoost
● Boosting algorithm that uses gradient descent to minimize error.
● Handles imbalanced data well.
● Provides feature importance for interpretability.

5.9.2 SVM (Support Vector Machine)

● Excellent for text classification.
● Uses kernel trick (linear/RBF) for non-linear separation.
● Hyperparameters: C (penalty), Kernel (linear used in this case).
23
Results
The figure presents a comparative analysis of three models — Support Vector Machine
(SVM), XGBoost, and BERT — based on four key evaluation metrics: Accuracy, Precision,
Recall, and F1-Score.

● Accuracy: SVM achieved the highest accuracy (0.91), followed by XGBoost (0.89)
and BERT (0.81).
● Precision: SVM scored 0.90, slightly ahead of XGBoost (0.89), while BERT lagged at
0.79.
● Recall: SVM again outperformed with a recall of 0.91, XGBoost recorded 0.88, and
BERT had a relatively low recall of 0.72.
● F1-Score: This balanced metric showed SVM at 0.91, XGBoost at 0.89, and BERT at
0.70.

These results indicate that the SVM model consistently performed better across all
metrics, making it the most effective model for fake review detection in this project.
Although BERT is a deep learning model, its performance was comparatively lower,
possibly due to limited data, resource constraints, or lack of fine-tuning.

Fig 6.1 - Comparison of different models and its evaluation matrices

24
Fig 6.2 - Confusion matrix for SVM

The confusion matrix for the SVM model shows that it correctly classified 3528 genuine
(CG) and 3634 fake (OR) reviews. Misclassifications include 488 genuine reviews
predicted as fake and 437 fake reviews predicted as genuine.

Fig 6.3 - Confusion matrix for XGBoost

The XGBoost model produced identical classification results as the SVM model, indicating
strong performance as well. However, its overall metric scores were slightly lower than
SVM in accuracy and recall.

25
Conclusion and Future Scope

7.1 Conclusion
In this project, two machine learning models Support Vector Machine (SVM) and
XGBoost were applied to detect fake reviews using natural language processing
techniques. The performance of both models was evaluated using precision, recall, and
F1-score metrics.

The SVM model achieved an overall accuracy of 91%, outperforming XGBoost, which
reached 89% accuracy. SVM also showed balanced performance across both classes
("CG" and "OR"), with precision and recall values consistently around 0.90–0.91. In
comparison, XGBoost, while slightly lower in all metrics, still demonstrated reliable
detection capability.

Based on these results, SVM is the more effective model for identifying fake reviews in
this context, offering better generalization and robustness across different review types.
However, XGBoost remains a strong alternative and may perform better with further
hyperparameter tuning or in ensemble approaches.

7.2 Future Scope

Integration of Deep Learning Models

Future work can explore the use of advanced deep learning techniques such as LSTM,
BERT, or transformers to capture deeper semantic meaning and context within reviews,
potentially improving detection accuracy.

Multilingual Review Analysis

The current model is limited to English reviews. Expanding the system to support
multiple languages would increase its applicability across global platforms and make it
more versatile.

Real-Time Detection System

The model can be deployed as part of a real-time monitoring tool for e-commerce and
review platforms, allowing automatic flagging of suspicious reviews as they are posted.

26
[1] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word

Representations in Vector Space. Proceedings of the 1st International Conference on

Learning Representations (ICLR 2013) - Workshop Track.

[2] Lakshmi Kalyani, M., Jacintha, M., Kar, D., Roy, N., & Sharma, V.K. (2020). A Comparative

Analysis of Text Embeddings (TF-IDF, Word2Vec, FastText) for Machine Learning-Based

Fake News Detection. TIJER.

[3] Wael, A., et al. (2024). Fake Reviews Detection in E-Commerce Using Machine Learning

Techniques. BIO Web of Conferences, 97, 00099.

[4] Rathore, A.A., Bhadane, G.L., Jadhav, A.D., Dhale, K.H., & Muley, J.D. (2023). Fake Reviews

Detection Using NLP Model and Neural Network Model. International Journal of

Engineering Research & Technology (IJERT).

[5] Sayam, A. (2023). Fake-Reviews-Detection. GitHub Repository.

[6] IIETA. (2023). Fake Review Detection Using Machine Learning. Revue d’Intelligence

Artificielle, 37(5).

[7] Banerjee, S., et al. (2022). Detecting Fake Reviews Using Linguistic Clues and Machine

Learning. Proceedings of the International Conference on Data Mining and Big Data.

D8 - Fake Profile Detection (Gpku)
No ratings yet
D8 - Fake Profile Detection (Gpku)
94 pages
Indore: A Project Report Submitted at
No ratings yet
Indore: A Project Report Submitted at
27 pages
Batch Num 11 PDF
No ratings yet
Batch Num 11 PDF
86 pages
Project Report Sem II Final
0% (1)
Project Report Sem II Final
102 pages
Agalya Updated Word_merged
No ratings yet
Agalya Updated Word_merged
94 pages
MAJOR PROJECT B (3)
No ratings yet
MAJOR PROJECT B (3)
72 pages
Prashant Project Report Latest
No ratings yet
Prashant Project Report Latest
49 pages
documentation sample
No ratings yet
documentation sample
72 pages
mmmmmm
No ratings yet
mmmmmm
23 pages
Final Main Report 1
No ratings yet
Final Main Report 1
68 pages
Batch 9
No ratings yet
Batch 9
90 pages
UPDATED PROJECT REPORT FORMAT 2025 (1)
No ratings yet
UPDATED PROJECT REPORT FORMAT 2025 (1)
58 pages
udaya
No ratings yet
udaya
63 pages
Final Modified Document PG
No ratings yet
Final Modified Document PG
58 pages
Report Final
No ratings yet
Report Final
49 pages
Final BE Project Report
No ratings yet
Final BE Project Report
74 pages
Final_Report
No ratings yet
Final_Report
74 pages
MC4411 Project Work - Format (1)
No ratings yet
MC4411 Project Work - Format (1)
65 pages
New Lms Project 1
No ratings yet
New Lms Project 1
70 pages
Mini Project
No ratings yet
Mini Project
65 pages
Report
No ratings yet
Report
112 pages
autocertify__Copy___2___1___1_
No ratings yet
autocertify__Copy___2___1___1_
31 pages
Group 5 Report
No ratings yet
Group 5 Report
30 pages
Gama ACMCS AdaptationCD Accepted
No ratings yet
Gama ACMCS AdaptationCD Accepted
44 pages
Bhanu Final
No ratings yet
Bhanu Final
56 pages
Accurate Traffic Prediction 4.4
No ratings yet
Accurate Traffic Prediction 4.4
50 pages
Report SEM I
No ratings yet
Report SEM I
56 pages
Lab Manual Soft Computing
100% (1)
Lab Manual Soft Computing
44 pages
Binder 1
No ratings yet
Binder 1
93 pages
Edited_file_latex
No ratings yet
Edited_file_latex
61 pages
Visvesvaraya Technological University: "Machine Learning Based Approach To Detect Phishing Attacks"
No ratings yet
Visvesvaraya Technological University: "Machine Learning Based Approach To Detect Phishing Attacks"
78 pages
Thesis Machine Learning
No ratings yet
Thesis Machine Learning
28 pages
Rapport_ISTIC_2023_2024_Ilef___Tasnim (2)
No ratings yet
Rapport_ISTIC_2023_2024_Ilef___Tasnim (2)
94 pages
Lessons From Large-Scale Machine Learning Deployments On Spark
No ratings yet
Lessons From Large-Scale Machine Learning Deployments On Spark
105 pages
Share CapstoneFinal
No ratings yet
Share CapstoneFinal
69 pages
Fake News Detection
40% (10)
Fake News Detection
71 pages
report
No ratings yet
report
26 pages
Final Document Recent f4
No ratings yet
Final Document Recent f4
52 pages
Final Document Recent f5
No ratings yet
Final Document Recent f5
52 pages
Minor Project Synopsis - Dog Breed Identification
No ratings yet
Minor Project Synopsis - Dog Breed Identification
43 pages
Ipl Matches Documentation
No ratings yet
Ipl Matches Documentation
28 pages
Project Proposal 260 Copy
No ratings yet
Project Proposal 260 Copy
38 pages
Dishank Jain 22eskca031 Itr Report 3CS Ai G1
No ratings yet
Dishank Jain 22eskca031 Itr Report 3CS Ai G1
21 pages
A Customized ViT For Detection of Java Plum Leaf Disease - Revised
No ratings yet
A Customized ViT For Detection of Java Plum Leaf Disease - Revised
27 pages
BT4431 Report of Project Ete 7TH Sem Plag Report Attachted
No ratings yet
BT4431 Report of Project Ete 7TH Sem Plag Report Attachted
69 pages
Santhosh BE Paper To Jeevi Veh
No ratings yet
Santhosh BE Paper To Jeevi Veh
47 pages
Combined Bb
No ratings yet
Combined Bb
20 pages
Upssscjuniorassistantcomputerquestionbank 250213043251 a8c799f4 (1)
No ratings yet
Upssscjuniorassistantcomputerquestionbank 250213043251 a8c799f4 (1)
19 pages
Students Project Report Coverage (V1.1) : The Following Sequence Should Be Followed and Maintained
No ratings yet
Students Project Report Coverage (V1.1) : The Following Sequence Should Be Followed and Maintained
67 pages
PROJECT_REPORT[1] AAA
No ratings yet
PROJECT_REPORT[1] AAA
43 pages
UNIT II Basic On Neural Networks
No ratings yet
UNIT II Basic On Neural Networks
36 pages
25June Final_merged
No ratings yet
25June Final_merged
64 pages
AI Project Plan Template Student 2 (1)
No ratings yet
AI Project Plan Template Student 2 (1)
16 pages
Digital Text Analysis UAntwerp
No ratings yet
Digital Text Analysis UAntwerp
18 pages
CISC 867 Deep Learning: 15. Generative Adversarial Networks
No ratings yet
CISC 867 Deep Learning: 15. Generative Adversarial Networks
71 pages
A Social Media Platform
No ratings yet
A Social Media Platform
75 pages
dubber
No ratings yet
dubber
36 pages
Group No 22 Report
No ratings yet
Group No 22 Report
54 pages
NationalStrategy For AI Discussion Paper
No ratings yet
NationalStrategy For AI Discussion Paper
115 pages
Marketing Thesis
No ratings yet
Marketing Thesis
18 pages
Deep Neural Nets - 33 Years Ago and 33 Years From Now
No ratings yet
Deep Neural Nets - 33 Years Ago and 33 Years From Now
17 pages
B2 Salma Fayaz
No ratings yet
B2 Salma Fayaz
56 pages
Ce 21 PDF
No ratings yet
Ce 21 PDF
75 pages
A13 Final
No ratings yet
A13 Final
29 pages
Minor Project-1 R21-Cse Report Template Ss2425
No ratings yet
Minor Project-1 R21-Cse Report Template Ss2425
39 pages
study_id84972_infrastructure-as-a-service-report
No ratings yet
study_id84972_infrastructure-as-a-service-report
22 pages
Student Performance Analysis Using Machine Learning
No ratings yet
Student Performance Analysis Using Machine Learning
40 pages
Machine Learning in Layman Language
No ratings yet
Machine Learning in Layman Language
11 pages
2022-2023 Ieee Software Titles
No ratings yet
2022-2023 Ieee Software Titles
26 pages
SRT ASSESSMENT
No ratings yet
SRT ASSESSMENT
12 pages
Mini Project
50% (2)
Mini Project
64 pages
s41597-022-01639-1
No ratings yet
s41597-022-01639-1
22 pages
Cross Validation
No ratings yet
Cross Validation
10 pages
Deep 3D Histology Powered by Tissue Clearing, Omics and AI: Nature Methods
No ratings yet
Deep 3D Histology Powered by Tissue Clearing, Omics and AI: Nature Methods
13 pages
AWS Machine Learning Engineer Nanodegree Program Syllabus
No ratings yet
AWS Machine Learning Engineer Nanodegree Program Syllabus
16 pages
Visvesvaraya Technological University: BELAGAVI-590018
No ratings yet
Visvesvaraya Technological University: BELAGAVI-590018
25 pages
Minor PROJECT WS 21 22
No ratings yet
Minor PROJECT WS 21 22
37 pages
Group-Project Final Documentation2
No ratings yet
Group-Project Final Documentation2
59 pages
The Impact of Generative AI and LLMs On The Cybersecurity Profession
No ratings yet
The Impact of Generative AI and LLMs On The Cybersecurity Profession
6 pages
Machine Learning in Medicine a Complete Overview Exclusive Download
100% (10)
Machine Learning in Medicine a Complete Overview Exclusive Download
14 pages
Prediction Project 2024.07.09
No ratings yet
Prediction Project 2024.07.09
3 pages
UNIT-III
No ratings yet
UNIT-III
33 pages
Stating
No ratings yet
Stating
11 pages
Unit 3
No ratings yet
Unit 3
14 pages
Restaurant Review Analysis
67% (3)
Restaurant Review Analysis
59 pages
1 s2.0 S0925521421002805 Main
No ratings yet
1 s2.0 S0925521421002805 Main
7 pages
22683-W24
No ratings yet
22683-W24
2 pages
[IJCST-V13I2P3]:DR.D.J.SAMATHA NAIDU, M.LAHYA
No ratings yet
[IJCST-V13I2P3]:DR.D.J.SAMATHA NAIDU, M.LAHYA
3 pages
Flight Fare Prediction Final
No ratings yet
Flight Fare Prediction Final
65 pages
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Fake Review Detection Prj2 (1)

Uploaded by

Fake Review Detection Prj2 (1)

Uploaded by

Project Report

Fake Review Detection

SUBMITTED IN THE PARTIAL FULFILLMENT REQUIREMENT FOR THE

UNDER THE SUPERVISION OF

BML MUNJAL UNIVERSITY

●​ Sri Sai Sasank

(Sri Sai Sasank)​

By accurately distinguishing between genuine and deceptive reviews, this system

Munjal University for providing the necessary academic infrastructure and an

environment conducive to research and learning.

SRI SAI SASANK

4.2.1 Fake vs Genuine Review Distribution 16

4.2.2 Review Text Length Distribution 17

4.2.3 Review Counts By Categories 18

4.2.4 Rating Distribution by Label 19

4.2.5 Distribution of Labels across Categories 19

4.2.6 Word cloud for CG and OR reviews 20

5.4 UseCase Model 22

6.1 Comparison of different models and its evaluation matrices 24

6.2 Confusion matrix for SVM 25

6.3 Confusion matrix for XGBoost 25

3.1 Comparison of Fake Review Detection Studies 13

4.1 Raw Dataset of reviews from different categories 15

SVM​ Support Vector Machine

NLP​ Natural Language Processing

XGBoost​ Extreme Gradient Boosting

CNN Convolutional Neural Networks

RBF Resting Bitch Face

LSTM Long Short-Term Memory

BERT Bidirectional Encoder Representations from Transformers

Contents​ Page No.

4​ Exploratory Data Analysis​ 15

7​ Conclusion and Future Scope​ 26

BML Munjal University (BMU), located in Gurugram, Haryana, is a premier institution of

The university offers a wide range of undergraduate, postgraduate, and doctoral

Online reviews significantly influence consumer behavior and purchasing decisions on

Two key algorithms—Support Vector Machine (SVM) and XGBoost—were implemented

The project demonstrates the practical application of machine learning in real-world

2.2​ Existing System

2.3​ User Requirement Analysis

●​ Visualization: A basic dashboard or output display showing results, accuracy, and

●​ Responsiveness: The classification process should be quick, ideally processing each

●​ Security: If deployed online, data should be handled securely to protect sensitive

●​ Platform Owners: Want to maintain credibility and prevent spam or manipulation.​

●​ Customers/Users: Expect trustworthy reviews to guide their purchasing decisions.​

●​ Moderators: Require efficient tools to reduce manual review verification efforts.​

2.4​ Feasibility Study

From an economic perspective, the project is cost-effective, requiring minimal

Table 3.1 Comparison of Fake Review Detection Studies

methodologies in both accuracy and computational efficiency.

3.2.1 Enhance Feature Extraction

Implement advanced natural language processing techniques to identify subtle linguistic

patterns that basic feature extraction methods have overlooked.

3.2.2 Optimize Computational Performance

resources required compared to existing deep learning solutions.

3.2.3 Incorporate Contextual Intelligence

3.2.4 Design Self-Optimizing Parameters

requirements and maintain consistent performance across varied datasets.

3.2.5 Ensure Cross-Domain Effectiveness

ensure versatility beyond single-domain applications.

3.2.6 Enable Real-Time Detection

Construct an implementation framework capable of classifying reviews at submission time,

moving beyond the retrospective analysis focus of current research.

3.2.7 Analyze Behavioral Patterns

manipulation that content-only approaches cannot detect.

Table 4.1 - Raw Dataset of reviews from different categories

Fig 4.2.1 - Fake vs Genuine Review Distribution

Fig 4.2.5- Distribution of Labels across Categories

Fig 4.2.6 - Word cloud for CG and OR reviews

●​ Evaluation: Metrics such as accuracy_score, confusion_matrix, and

5.4​ Use Case Model

Fig: 5.4 - UseCase Model

Fig 5.7 - ER Diagram

5.6​ Assumptions and Dependencies

Dependence on NLP Tools and Libraries​

Language and Content Consistency​

● Sri Sai Sasank

(Sri Sai Sasank)

SVM Support Vector Machine

NLP Natural Language Processing

XGBoost Extreme Gradient Boosting

Contents Page No.

4 Exploratory Data Analysis 15

7 Conclusion and Future Scope 26

2.2 Existing System

2.3 User Requirement Analysis

● Visualization: A basic dashboard or output display showing results, accuracy, and

● Responsiveness: The classification process should be quick, ideally processing each

● Security: If deployed online, data should be handled securely to protect sensitive

● Platform Owners: Want to maintain credibility and prevent spam or manipulation.

● Customers/Users: Expect trustworthy reviews to guide their purchasing decisions.

● Moderators: Require efficient tools to reduce manual review verification efforts.

2.4 Feasibility Study

● Evaluation: Metrics such as accuracy_score, confusion_matrix, and

5.4 Use Case Model

5.6 Assumptions and Dependencies

Dependence on NLP Tools and Libraries

Language and Content Consistency

5.7 ML algorithm discussions

7.2 Future Scope

Integration of Deep Learning Models

Multilingual Review Analysis

Real-Time Detection System

Learning Representations (ICLR 2013) - Workshop Track.

Fake News Detection. TIJER.

Techniques. BIO Web of Conferences, 97, 00099.

Engineering Research & Technology (IJERT).

[5] Sayam, A. (2023). Fake-Reviews-Detection. GitHub Repository.