0% found this document useful (0 votes)
8 views30 pages

Fake Review Detection Prj2 (1)

The project report details the development of a 'Fake Review Detection Using Machine Learning' system by students at BML Munjal University, aimed at identifying deceptive online reviews through machine learning techniques. Utilizing Natural Language Processing and classification models like Support Vector Machine (SVM) and Extreme Gradient Boosting (XGBoost), the SVM model achieved an accuracy of 91%. The project underscores the importance of maintaining credibility in online platforms and suggests future enhancements using deep learning methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views30 pages

Fake Review Detection Prj2 (1)

The project report details the development of a 'Fake Review Detection Using Machine Learning' system by students at BML Munjal University, aimed at identifying deceptive online reviews through machine learning techniques. Utilizing Natural Language Processing and classification models like Support Vector Machine (SVM) and Extreme Gradient Boosting (XGBoost), the SVM model achieved an accuracy of 91%. The project underscores the importance of maintaining credibility in online platforms and suggests future enhancements using deep learning methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Project Report

Fake Review Detection

SUBMITTED IN THE PARTIAL FULFILLMENT REQUIREMENT FOR THE

AWARD OF DEGREE OF

Bachelor of Technology
(COMPUTER SCIENCE AND ENGINEERING)
SUBMITTED BY

NAME
DASAM SRI SAI SASANK(230699)
DEV GUPTA(230685)
SAUMIL GUPTA(230706)

UNDER THE SUPERVISION OF


DR.SATYENDER SINGH
DR.ANANTHA RAO
SCHOOL OF ENGINEERING AND TECHNOLOGY

BML MUNJAL UNIVERSITY


Gurugram, Haryana - 122413
May 2025
CANDIDATE’S DECLARATION
This is to certify that the project titled “Fake Review Detection Using Machine Learning”
has been successfully completed by the following students as part of their Project 2 in
partial fulfillment of the requirements for the award of the Degree of Bachelor of
Technology in Computer Science and Engineering at BML Munjal University:

●​ Sri Sai Sasank


●​ Dev Gupta
●​ Saumil Gupta​

The project was carried out during the period from January 2025 to May 2025 under the
supervision of Dr. Anantha Rao and Dr. Satyender Singh, faculty members of BML Munjal
University.

The work presented in this project is an authentic record of the students' efforts.

(Sri Sai Sasank)​


(Dev Gupta)​
(Saumil Gupta)

1
In today's digital era, online reviews significantly influence consumer choices across
e-commerce platforms, food delivery services, and travel sites. However, the growing
presence of fake reviews undermines the reliability of such feedback systems. This project
focuses on detecting fake reviews using machine learning techniques to help maintain the
credibility of online platforms.

The approach involves preprocessing a labeled dataset of reviews using Natural Language
Processing (NLP) methods, including text cleaning, tokenization, and vectorization. Two
classification models were developed and evaluated: Support Vector Machine (SVM) and
Extreme Gradient Boosting (XGBoost). The SVM model achieved an accuracy of 91%,
outperforming XGBoost, which achieved 89%. These results highlight the effectiveness of
traditional machine learning models when combined with well-engineered textual
features.

By accurately distinguishing between genuine and deceptive reviews, this system


contributes to building user trust and promoting transparency in online platforms. Future
extensions may explore deep learning methods and real-time review filtering systems for
broader applicability.

2
I express my heartfelt gratitude to Mr. Anantha Rao and Mr. Satyender Singh, faculty

members at BML Munjal University, Gurugram, for their invaluable guidance, support,

and encouragement throughout the course of our project titled “Fake Review Detection

Using Machine Learning”. Their expert insights, timely feedback, and constant motivation

greatly contributed to the successful completion of this work, carried out from January to

May 2025.

I would also like to thank the Department of Computer Science and Engineering at BML

Munjal University for providing the necessary academic infrastructure and an

environment conducive to research and learning.

Finally, I extend my sincere thanks to all those who supported and encouraged us during

the project, including friends and peers who contributed in various ways.

SRI SAI SASANK


​​ ​ ​ ​ ​ ​ DEV GUPTA
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ SAUMIL GUPTA

3
Figure No.​ Figure Description​ Page No.

4.2.1 Fake vs Genuine Review Distribution 16

4.2.2 Review Text Length Distribution 17

4.2.3 Review Counts By Categories 18

4.2.4 Rating Distribution by Label 19

4.2.5 Distribution of Labels across Categories 19

4.2.6 Word cloud for CG and OR reviews 20

5.4 UseCase Model 22

5.7 R Diagram 23

6.1 Comparison of different models and its evaluation matrices 24

6.2 Confusion matrix for SVM 25

6.3 Confusion matrix for XGBoost 25

4
Table No.​ Table Description​ Page No.

3.1 Comparison of Fake Review Detection Studies 13

4.1 Raw Dataset of reviews from different categories 15

5
Abbreviation​ Full Form

SVM​ Support Vector Machine

NLP​ Natural Language Processing

XGBoost​ Extreme Gradient Boosting

CNN Convolutional Neural Networks

RBF Resting Bitch Face

LSTM Long Short-Term Memory

BERT Bidirectional Encoder Representations from Transformers

6
TABLE OF CONTENTS

Contents​ Page No.

Candidate’s Declaration​ 1

Abstract​ 2

Acknowledgement​ 3

List of Figures​ 4

List of Tables​ 5

List of Abbreviations​ 6

1​ Introduction to Organisation​ 7

2​ Introduction to Project​ 8
2.1​ Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .​ 8
2.2​ Existing System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .​ 8
2.3​ User Requirement Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .​ 9
2.4​ Feasibility Study​ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .​ 10

3​ Literature Review​ 12
3.1​ Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .​ 13
3.2​ Objectives of Project (Must be clearly, precisely defined and Implementa-
tion must be done.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .​ 14

4​ Exploratory Data Analysis​ 15


4.1​ Dataset​ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .​ 15
4.2​ Exploratory Data Analysis and Visualisations . . . . . . . . . . . . . . .​ 16

5​ Methodology​ 21
5.1​ Introduction to Languages (Front End and Back End) . . . . . . . . . . .​ 21
5.2​ User characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .​ 22
5.3​ Constraints​ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .​ 22
5.4​ Use Case Model/Flow Chart/DFDS . . . . . . . . . . . . . . . . . . . . .​ 22
5.5​ ER Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .​ 23
5.6​ Assumptions and Dependencies . . . . . . . . . . . . . . . . . . . . . . .​ 23
5.7 ML algorithm discussion . . . . . . . . . . . . . . . . . . . . . . . . . . .​ 23
6​ Results​ 24

7​ Conclusion and Future Scope​ 26


7.1​ Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .​ 26
7.2​ Future Scope​ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .​ 26

8 References ​ ​ 27
​ ​ ​
​ ​ ​
Introduction to Organisation

BML Munjal University (BMU), located in Gurugram, Haryana, is a premier institution of


higher education established with the vision of transforming students into innovative,
socially responsible, and globally aware leaders. Founded by the Hero Group, BMU
emphasizes experiential learning, research, and interdisciplinary education, aiming to
bridge the gap between classroom knowledge and real-world application.

The university offers a wide range of undergraduate, postgraduate, and doctoral


programs in engineering, management, law, and economics. Its pedagogy combines
academic rigor with industry exposure, fostering critical thinking, problem-solving, and
entrepreneurial skills among students.

The School of Engineering and Technology at BMU, under which this project was carried
out, is known for its strong focus on hands-on learning, collaborative projects, and
integration of technology in solving contemporary challenges. With state-of-the-art
infrastructure, expert faculty, and active industry collaboration, BML Munjal University
provides an enriching environment for students to pursue impactful academic and
research-oriented endeavors.

7
Introduction to Project

2.1​ Overview

Online reviews significantly influence consumer behavior and purchasing decisions on


digital platforms such as e-commerce websites, food delivery apps, and travel services.
However, the rise of fake or manipulated reviews has become a major concern, as they
can mislead customers and damage the reputation of businesses. Detecting such
deceptive reviews manually is not practical due to the vast volume of data generated
daily.

This project, titled “Fake Review Detection Using Machine Learning,” aims to develop an
automated system that can distinguish between genuine and fake reviews using machine
learning techniques. The approach involves collecting a labeled dataset of reviews,
applying Natural Language Processing (NLP) for text preprocessing, and training
classification models to identify review authenticity.

Two key algorithms—Support Vector Machine (SVM) and XGBoost—were implemented


and evaluated. Among these, the SVM model achieved an accuracy of 91%, while
XGBoost achieved 89%, indicating strong performance in classifying review authenticity
based on textual features.

The project demonstrates the practical application of machine learning in real-world


problems, providing a scalable and efficient solution for maintaining the integrity of
online review systems. It also lays the groundwork for further enhancements using deep
learning or real-time data analysis.

2.2​ Existing System

In the current digital landscape, many online platforms rely on basic filtering techniques
or user reports to identify fake reviews. The existing systems used by e-commerce
websites or review aggregators often depend on manual moderation, rule-based
algorithms, or simple keyword matching to flag suspicious content. While these methods
offer some level of control, they are limited in their ability to accurately detect
sophisticated or intentionally deceptive reviews.

Some platforms incorporate basic Natural Language Processing (NLP) to scan for
spam-like behavior, such as repetitive words, excessive use of promotional language, or
8
unnatural patterns. However, these static rule-based approaches struggle to adapt to
evolving tactics used by fake reviewers, such as varying sentence structure or using
AI-generated text.

Another major limitation of existing systems is the lack of contextual understanding and
the inability to learn from past data. Many fake reviews are written to closely mimic
genuine ones, making it difficult for traditional systems to identify them without
advanced analysis.

Overall, the existing systems are not scalable, lack automation, and are often unable to
deliver high accuracy in classification. This creates a strong need for machine
learning-based solutions that can learn patterns from data, generalize across different
types of reviews, and improve over time through training.

2.3​ User Requirement Analysis

The goal of this project is to develop a system that can automatically detect fake reviews
using machine learning techniques. To ensure the system meets the expectations and
needs of its intended users—such as administrators, businesses, platform moderators,
and end consumers—a detailed analysis of user requirements is essential.

1. Functional Requirements

●​ Input Interface: Users should be able to input or upload reviews in text format.​

●​ Classification Output: The system should display whether the review is likely to be
fake or genuine.​

●​ Model Integration: The system should integrate trained machine learning models
(e.g., SVM, XGBoost) to perform classification in real time or batch mode.​

●​ Visualization: A basic dashboard or output display showing results, accuracy, and


model performance may be provided for administrators or testers.​

●​ Review Storage (Optional): Reviews and their predicted labels should be stored for
future analysis or improvement of the model.​

2. Non-Functional Requirements

●​ Accuracy: The system should maintain a high level of accuracy (above 85%), with
consistent results across datasets.​
9
●​ Scalability: It should be capable of handling large volumes of review data if
integrated into a live platform.​

●​ Responsiveness: The classification process should be quick, ideally processing each


review within a few seconds.​

●​ Usability: The system should be simple and user-friendly for non-technical users,
particularly moderators or business owners.​

●​ Security: If deployed online, data should be handled securely to protect sensitive


content.​

3. Stakeholder Expectations

●​ Platform Owners: Want to maintain credibility and prevent spam or manipulation.​

●​ Customers/Users: Expect trustworthy reviews to guide their purchasing decisions.​

●​ Moderators: Require efficient tools to reduce manual review verification efforts.​

This analysis helps in aligning the system's features with user expectations, ensuring a
solution that is practical, efficient, and effective in real-world usage.

2.4​ Feasibility Study

The development of a Fake Review Detection System using machine learning is both
technically and operationally feasible. Leveraging widely-used libraries like scikit-learn
and XGBoost for text classification, the project can efficiently process large datasets of
online reviews with high accuracy. The system will utilize Natural Language Processing
(NLP) techniques to extract features from the reviews, and the chosen machine learning
models, SVM and XGBoost, have demonstrated strong performance in similar tasks. The
system is easy to integrate with existing review platforms, offering an intuitive interface
for moderators to detect fake reviews, and can scale to handle large volumes of data as
required.

From an economic perspective, the project is cost-effective, requiring minimal


development resources since open-source tools and platforms will be used. Once

10
implemented, the system can automate review verification, reducing operational costs
and improving platform credibility. Additionally, it can deliver a significant return on
investment by enhancing user trust and platform reliability. Thus, the project is feasible
within budgetary constraints, with a high potential for impact in real-world applications.

11
Literature Review
Fake review detection has gained significant attention with the growth of online
platforms. Early research by Jindal and Liu (2008) focused on identifying inconsistencies in
review content, such as promotional language and unnatural patterns. Mihalcea et al.
(2009) used Support Vector Machines (SVM) to detect fake reviews by analyzing review
features like length and word frequency, setting the stage for SVM's use in review
classification.

Recent studies have integrated Natural Language Processing (NLP) and ensemble
methods. Zhang et al. (2017) combined classifiers like SVM and Logistic Regression with
sentiment analysis, achieving higher accuracy. Deep learning approaches, such as
Convolutional Neural Networks (CNNs), have also been explored. Chen et al. (2020)
demonstrated that deep learning outperformed traditional methods, though it requires
large datasets and computational resources.

XGBoost has emerged as a powerful tool for fake review detection, with Zhao et al. (2020)
showing it delivers high accuracy when combined with text-based features like TF-IDF.
Despite the rise of deep learning, traditional machine learning algorithms like SVM and
XGBoost remain effective, particularly in scenarios with limited data.

12
3.1​ Comparison

Table 3.1 Comparison of Fake Review Detection Studies

3.2​ Objectives of Project (Must be clearly, precisely defined and Implemen- tation must

be done.)

To develop an innovative hybrid model for fake review detection that surpasses current

methodologies in both accuracy and computational efficiency.

Specific Objectives

3.2.1 Enhance Feature Extraction

Implement advanced natural language processing techniques to identify subtle linguistic

patterns that basic feature extraction methods have overlooked.

3.2.2 Optimize Computational Performance

Create a balanced approach that maintains high accuracy while reducing the computational

resources required compared to existing deep learning solutions.

3.2.3 Incorporate Contextual Intelligence

Develop a framework that evaluates reviews within their broader context, including product
13
metadata and marketplace dynamics, rather than analyzing text in isolation.

3.2.4 Design Self-Optimizing Parameters

Build a model with adaptive parameter adjustment capabilities to minimize manual tuning

requirements and maintain consistent performance across varied datasets.

3.2.5 Ensure Cross-Domain Effectiveness

Test and refine the detection system across multiple product categories and service sectors to

ensure versatility beyond single-domain applications.

3.2.6 Enable Real-Time Detection

Construct an implementation framework capable of classifying reviews at submission time,

moving beyond the retrospective analysis focus of current research.

3.2.7 Analyze Behavioral Patterns

Integrate reviewer behavior metrics and temporal analysis to identify coordinated review

manipulation that content-only approaches cannot detect.

14
Exploratory Data Analysis

4.1​ Dataset

This project utilizes a balanced dataset of approximately 20,000 online product reviews
spanning ten retail categories, including Home and Kitchen, Electronics, and Kindle Store,
with each category containing roughly 2,000 reviews equally distributed between genuine
(CG) and potentially deceptive (OR) labels. Data collection employed a custom-built,
ethically-compliant web crawler implementing request randomization and proxy rotation
to gather diverse review samples while respecting platform usage policies. The extracted
data underwent a comprehensive preprocessing pipeline including text normalization,
HTML artifact removal, and standardization of product references, followed by rigorous
quality assurance procedures to eliminate duplicates and verify label consistency. As
evident from the sample Home and Kitchen reviews (Image 1), the dataset captures
varied linguistic patterns across different rating levels, with observable correlations
between 5-star ratings and genuine labels, while Image 2 confirms the balanced
distribution of labels across all product categories, with Kindle Store containing the
highest volume at approximately 2,300 reviews and Movies and TV the lowest at roughly
1,800 reviews. This meticulously curated cross-domain dataset provides a robust
foundation for developing and evaluating our enhanced fake review detection
methodology.

Table 4.1 - Raw Dataset of reviews from different categories

15
4.2​ Exploratory Data Analysis and Visualisations

Fig 4.2.1 - Fake vs Genuine Review Distribution

This bar graph depicts the distribution between fake and genuine reviews in the dataset,
showing approximately 20,000 reviews in each category. The blue bar represents genuine
reviews (labeled as CG) while the peach/orange bar represents fake or deceptive reviews
(labeled as OR). The y-axis measures the count of reviews, reaching just over 20,000 for
each category, and the x-axis identifies the review types. The near-perfect balance
between genuine and fake reviews indicates a deliberately curated dataset designed for
optimal machine learning model training, where class imbalance would not skew the
results.

16
Fig 4.2.2- Review Text Length Distribution

The image illustrates the distribution of review text lengths by showing a histogram of
word counts in user reviews. The x-axis represents the number of words per review, while
the y-axis indicates the count of reviews within each range. The plot reveals a highly
right-skewed distribution, with most reviews containing between 10 and 30 words. As the
word count increases, the frequency of reviews declines significantly. A smooth density
curve overlays the histogram, further highlighting that shorter reviews are far more
common than longer ones. This visualization is useful for understanding review behavior
and guiding preprocessing steps in NLP tasks.

17
Fig 4.2.3- Review Counts By Categories

The bar chart visualizes the distribution of review counts across different product
categories. Each horizontal bar represents a distinct category, with its length indicating
the number of reviews received. Among the categories, Kindle_Store_5 has the
highest review count, followed closely by Books_5 and Pet_Supplies_5. The remaining
categories, including Home_and_Kitchen_5, Electronics_5, and Movies_and_TV_5,
have slightly fewer reviews but still show substantial engagement. This visualization
highlights which categories receive the most customer feedback, useful for analyzing
consumer interest and product popularity.

18
Fig 4.2.4-Rating Distribution by Label

The stacked bar chart illustrates the distribution of product ratings across two labels:
CG and OR. The x-axis represents the rating scale from 1 to 5, while the y-axis shows
the frequency of each rating. Each bar is segmented by label, with CG in teal and OR in
light yellow. It is evident that the majority of ratings are 5-star, with both labels
contributing significantly but OR having a slight edge. Lower ratings (1 to 3) are
comparatively rare for both categories. This visualization helps compare the sentiment
trends between the two labels, indicating a strong skew toward positive feedback
overall.

Fig 4.2.5- Distribution of Labels across Categories


19
The grouped bar chart shows the distribution of two labels, CG and OR, across
different product categories. Each category has nearly equal counts for both labels,
indicating a balanced distribution. Among the categories, Kindle_Store_5 stands out
with the highest review count for both CG and OR, followed closely by Books_5 and
Pet_Supplies_5. Meanwhile, categories like Movies_and_TV_5 and
Clothing_Shoes_and_Jewelry_5 have relatively lower counts. This uniform
distribution across labels ensures fair representation, which is useful for comparative
analysis and modeling purposes.

Fig 4.2.6 - Word cloud for CG and OR reviews

The word clouds for CG and OR reviews reveal both overlap and distinct differences in
language usage. Common prominent words like “book,” “love,” “great,” and “read”
suggest shared themes across labels, particularly related to books and storytelling.
However, CG reviews tend to include more emotionally expressive and qualitative terms
such as “comfortable,” “well written,” and “series,” indicating a focus on experience and
enjoyment. In contrast, OR reviews feature more practical and functional words like
“work,” “fit,” “time,” and “need,” suggesting a utilitarian tone aimed at assessing product
performance. This indicates that CG reviews are generally more subjective, while OR
reviews lean toward objective evaluation.

20
Methodology
5.1​ Introduction to Languages (Front End and Back End)
Front End:​
While this project focuses on the backend machine learning implementation, if there is
any user-facing component (such as visualizing results or charts). These technologies are
useful for displaying results or creating interactive components in a web-based
application, if applicable.

●​ Python Libraries: Within Google Colab, we use matplotlib and seaborn for
generating static visualizations such as confusion matrices, accuracy graphs, or
feature importance plots

Back End:​
The machine learning model is developed and trained within Google Colab, a
cloud-based platform that supports Python. Colab provides an interactive environment
for running Python code without the need for setting up local computing resources. This
project relies heavily on the following Python libraries for data processing and model
development:

●​ Google Colab: The primary platform for running all machine learning experiments
and model training.
●​ Data Preprocessing: pandas for data manipulation, numpy for numerical
operations, re for text cleaning, and NLP libraries like nltk and spaCy for text
tokenization and other language processing tasks.
●​ Modeling and Classification: scikit-learn for developing the Support Vector
Machine (SVM) model, and XGBoost for implementing the Extreme Gradient
Boosting (XGBoost) model.
●​ Text Vectorization: TfidfVectorizer and CountVectorizer are used to
convert raw text data into numerical feature vectors that can be fed into the
machine learning models.​

●​ Evaluation: Metrics such as accuracy_score, confusion_matrix, and


classification_report from scikit-learn are used to evaluate the
model performance.

21
5.2​ User characteristics
The system targets business owners, platform moderators, and consumers on
e-commerce, food delivery, and travel platforms. It helps ensure the authenticity of
online reviews, enhancing trust and transparency for users relying on review systems to
make informed decisions.

5.3​ Constraints
●​ Data Quality: The model’s accuracy depends on clean and representative training
data.
●​ Feature Selection: The effectiveness of the model is influenced by the chosen
textual features.
●​ Imbalanced Dataset: Fake reviews are less frequent, which may cause bias toward
predicting genuine reviews.
●​ Real-Time Processing: The model may need further optimization for large-scale,
real-time review filtering.
●​ Computational Resources: Scaling the model for high-volume platforms may
require more powerful infrastructure.
●​ Model Generalization: The model may need adjustments for diverse platforms or
languages beyond the training dataset.

5.4​ Use Case Model

Fig: 5.4 - UseCase Model

22
5.5​ ER Diagram

Fig 5.7 - ER Diagram

5.6​ Assumptions and Dependencies


Labeled Dataset Availability​
The system assumes access to a reliable dataset where reviews are clearly marked
as genuine or fake, which is essential for training and evaluating the detection
model.​

Dependence on NLP Tools and Libraries​


The project depends on natural language processing libraries such as NLTK, spaCy,
and Scikit-learn for tasks like text preprocessing, feature extraction, and
classification.​

Language and Content Consistency​


It is assumed that the reviews are primarily in English and contain enough
descriptive text to extract meaningful patterns for classification.

5.7​ ML algorithm discussions


5.9.1 XGBoost
●​ Boosting algorithm that uses gradient descent to minimize error.
●​ Handles imbalanced data well.
●​ Provides feature importance for interpretability.

5.9.2 SVM (Support Vector Machine)


●​ Excellent for text classification.
●​ Uses kernel trick (linear/RBF) for non-linear separation.
●​ Hyperparameters: C (penalty), Kernel (linear used in this case).
23
Results
The figure presents a comparative analysis of three models — Support Vector Machine
(SVM), XGBoost, and BERT — based on four key evaluation metrics: Accuracy, Precision,
Recall, and F1-Score.

●​ Accuracy: SVM achieved the highest accuracy (0.91), followed by XGBoost (0.89)
and BERT (0.81).
●​ Precision: SVM scored 0.90, slightly ahead of XGBoost (0.89), while BERT lagged at
0.79.
●​ Recall: SVM again outperformed with a recall of 0.91, XGBoost recorded 0.88, and
BERT had a relatively low recall of 0.72.
●​ F1-Score: This balanced metric showed SVM at 0.91, XGBoost at 0.89, and BERT at
0.70.

These results indicate that the SVM model consistently performed better across all
metrics, making it the most effective model for fake review detection in this project.
Although BERT is a deep learning model, its performance was comparatively lower,
possibly due to limited data, resource constraints, or lack of fine-tuning.

Fig 6.1 - Comparison of different models and its evaluation matrices

24
Fig 6.2 - Confusion matrix for SVM

The confusion matrix for the SVM model shows that it correctly classified 3528 genuine
(CG) and 3634 fake (OR) reviews. Misclassifications include 488 genuine reviews
predicted as fake and 437 fake reviews predicted as genuine.

Fig 6.3 - Confusion matrix for XGBoost

The XGBoost model produced identical classification results as the SVM model, indicating
strong performance as well. However, its overall metric scores were slightly lower than
SVM in accuracy and recall.

25
Conclusion and Future Scope

7.1​ Conclusion
In this project, two machine learning models Support Vector Machine (SVM) and
XGBoost were applied to detect fake reviews using natural language processing
techniques. The performance of both models was evaluated using precision, recall, and
F1-score metrics.

The SVM model achieved an overall accuracy of 91%, outperforming XGBoost, which
reached 89% accuracy. SVM also showed balanced performance across both classes
("CG" and "OR"), with precision and recall values consistently around 0.90–0.91. In
comparison, XGBoost, while slightly lower in all metrics, still demonstrated reliable
detection capability.

Based on these results, SVM is the more effective model for identifying fake reviews in
this context, offering better generalization and robustness across different review types.
However, XGBoost remains a strong alternative and may perform better with further
hyperparameter tuning or in ensemble approaches.

7.2​ Future Scope

Integration of Deep Learning Models​


Future work can explore the use of advanced deep learning techniques such as LSTM,
BERT, or transformers to capture deeper semantic meaning and context within reviews,
potentially improving detection accuracy.​

Multilingual Review Analysis​


The current model is limited to English reviews. Expanding the system to support
multiple languages would increase its applicability across global platforms and make it
more versatile.​

Real-Time Detection System​


The model can be deployed as part of a real-time monitoring tool for e-commerce and
review platforms, allowing automatic flagging of suspicious reviews as they are posted.

26
[1]​ Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word

Representations in Vector Space. Proceedings of the 1st International Conference on

Learning Representations (ICLR 2013) - Workshop Track.​

[2]​ Lakshmi Kalyani, M., Jacintha, M., Kar, D., Roy, N., & Sharma, V.K. (2020). A Comparative

Analysis of Text Embeddings (TF-IDF, Word2Vec, FastText) for Machine Learning-Based

Fake News Detection. TIJER.​

[3]​ Wael, A., et al. (2024). Fake Reviews Detection in E-Commerce Using Machine Learning

Techniques. BIO Web of Conferences, 97, 00099.​

[4]​ Rathore, A.A., Bhadane, G.L., Jadhav, A.D., Dhale, K.H., & Muley, J.D. (2023). Fake Reviews

Detection Using NLP Model and Neural Network Model. International Journal of

Engineering Research & Technology (IJERT).​

[5]​ Sayam, A. (2023). Fake-Reviews-Detection. GitHub Repository.​

[6]​ IIETA. (2023). Fake Review Detection Using Machine Learning. Revue d’Intelligence

Artificielle, 37(5).​

[7]​ Banerjee, S., et al. (2022). Detecting Fake Reviews Using Linguistic Clues and Machine

Learning. Proceedings of the International Conference on Data Mining and Big Data.

27

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy