Fake Review Detection Prj2 (1)
Fake Review Detection Prj2 (1)
AWARD OF DEGREE OF
Bachelor of Technology
(COMPUTER SCIENCE AND ENGINEERING)
SUBMITTED BY
NAME
DASAM SRI SAI SASANK(230699)
DEV GUPTA(230685)
SAUMIL GUPTA(230706)
The project was carried out during the period from January 2025 to May 2025 under the
supervision of Dr. Anantha Rao and Dr. Satyender Singh, faculty members of BML Munjal
University.
The work presented in this project is an authentic record of the students' efforts.
1
In today's digital era, online reviews significantly influence consumer choices across
e-commerce platforms, food delivery services, and travel sites. However, the growing
presence of fake reviews undermines the reliability of such feedback systems. This project
focuses on detecting fake reviews using machine learning techniques to help maintain the
credibility of online platforms.
The approach involves preprocessing a labeled dataset of reviews using Natural Language
Processing (NLP) methods, including text cleaning, tokenization, and vectorization. Two
classification models were developed and evaluated: Support Vector Machine (SVM) and
Extreme Gradient Boosting (XGBoost). The SVM model achieved an accuracy of 91%,
outperforming XGBoost, which achieved 89%. These results highlight the effectiveness of
traditional machine learning models when combined with well-engineered textual
features.
2
I express my heartfelt gratitude to Mr. Anantha Rao and Mr. Satyender Singh, faculty
members at BML Munjal University, Gurugram, for their invaluable guidance, support,
and encouragement throughout the course of our project titled “Fake Review Detection
Using Machine Learning”. Their expert insights, timely feedback, and constant motivation
greatly contributed to the successful completion of this work, carried out from January to
May 2025.
I would also like to thank the Department of Computer Science and Engineering at BML
Finally, I extend my sincere thanks to all those who supported and encouraged us during
the project, including friends and peers who contributed in various ways.
3
Figure No. Figure Description Page No.
5.7 R Diagram 23
4
Table No. Table Description Page No.
5
Abbreviation Full Form
6
TABLE OF CONTENTS
Candidate’s Declaration 1
Abstract 2
Acknowledgement 3
List of Figures 4
List of Tables 5
List of Abbreviations 6
1 Introduction to Organisation 7
2 Introduction to Project 8
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Existing System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 User Requirement Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Feasibility Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Literature Review 12
3.1 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Objectives of Project (Must be clearly, precisely defined and Implementa-
tion must be done.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 Methodology 21
5.1 Introduction to Languages (Front End and Back End) . . . . . . . . . . . 21
5.2 User characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.4 Use Case Model/Flow Chart/DFDS . . . . . . . . . . . . . . . . . . . . . 22
5.5 ER Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.6 Assumptions and Dependencies . . . . . . . . . . . . . . . . . . . . . . . 23
5.7 ML algorithm discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6 Results 24
The School of Engineering and Technology at BMU, under which this project was carried
out, is known for its strong focus on hands-on learning, collaborative projects, and
integration of technology in solving contemporary challenges. With state-of-the-art
infrastructure, expert faculty, and active industry collaboration, BML Munjal University
provides an enriching environment for students to pursue impactful academic and
research-oriented endeavors.
7
Introduction to Project
2.1 Overview
This project, titled “Fake Review Detection Using Machine Learning,” aims to develop an
automated system that can distinguish between genuine and fake reviews using machine
learning techniques. The approach involves collecting a labeled dataset of reviews,
applying Natural Language Processing (NLP) for text preprocessing, and training
classification models to identify review authenticity.
In the current digital landscape, many online platforms rely on basic filtering techniques
or user reports to identify fake reviews. The existing systems used by e-commerce
websites or review aggregators often depend on manual moderation, rule-based
algorithms, or simple keyword matching to flag suspicious content. While these methods
offer some level of control, they are limited in their ability to accurately detect
sophisticated or intentionally deceptive reviews.
Some platforms incorporate basic Natural Language Processing (NLP) to scan for
spam-like behavior, such as repetitive words, excessive use of promotional language, or
8
unnatural patterns. However, these static rule-based approaches struggle to adapt to
evolving tactics used by fake reviewers, such as varying sentence structure or using
AI-generated text.
Another major limitation of existing systems is the lack of contextual understanding and
the inability to learn from past data. Many fake reviews are written to closely mimic
genuine ones, making it difficult for traditional systems to identify them without
advanced analysis.
Overall, the existing systems are not scalable, lack automation, and are often unable to
deliver high accuracy in classification. This creates a strong need for machine
learning-based solutions that can learn patterns from data, generalize across different
types of reviews, and improve over time through training.
The goal of this project is to develop a system that can automatically detect fake reviews
using machine learning techniques. To ensure the system meets the expectations and
needs of its intended users—such as administrators, businesses, platform moderators,
and end consumers—a detailed analysis of user requirements is essential.
1. Functional Requirements
● Input Interface: Users should be able to input or upload reviews in text format.
● Classification Output: The system should display whether the review is likely to be
fake or genuine.
● Model Integration: The system should integrate trained machine learning models
(e.g., SVM, XGBoost) to perform classification in real time or batch mode.
● Review Storage (Optional): Reviews and their predicted labels should be stored for
future analysis or improvement of the model.
2. Non-Functional Requirements
● Accuracy: The system should maintain a high level of accuracy (above 85%), with
consistent results across datasets.
9
● Scalability: It should be capable of handling large volumes of review data if
integrated into a live platform.
● Usability: The system should be simple and user-friendly for non-technical users,
particularly moderators or business owners.
3. Stakeholder Expectations
This analysis helps in aligning the system's features with user expectations, ensuring a
solution that is practical, efficient, and effective in real-world usage.
The development of a Fake Review Detection System using machine learning is both
technically and operationally feasible. Leveraging widely-used libraries like scikit-learn
and XGBoost for text classification, the project can efficiently process large datasets of
online reviews with high accuracy. The system will utilize Natural Language Processing
(NLP) techniques to extract features from the reviews, and the chosen machine learning
models, SVM and XGBoost, have demonstrated strong performance in similar tasks. The
system is easy to integrate with existing review platforms, offering an intuitive interface
for moderators to detect fake reviews, and can scale to handle large volumes of data as
required.
10
implemented, the system can automate review verification, reducing operational costs
and improving platform credibility. Additionally, it can deliver a significant return on
investment by enhancing user trust and platform reliability. Thus, the project is feasible
within budgetary constraints, with a high potential for impact in real-world applications.
11
Literature Review
Fake review detection has gained significant attention with the growth of online
platforms. Early research by Jindal and Liu (2008) focused on identifying inconsistencies in
review content, such as promotional language and unnatural patterns. Mihalcea et al.
(2009) used Support Vector Machines (SVM) to detect fake reviews by analyzing review
features like length and word frequency, setting the stage for SVM's use in review
classification.
Recent studies have integrated Natural Language Processing (NLP) and ensemble
methods. Zhang et al. (2017) combined classifiers like SVM and Logistic Regression with
sentiment analysis, achieving higher accuracy. Deep learning approaches, such as
Convolutional Neural Networks (CNNs), have also been explored. Chen et al. (2020)
demonstrated that deep learning outperformed traditional methods, though it requires
large datasets and computational resources.
XGBoost has emerged as a powerful tool for fake review detection, with Zhao et al. (2020)
showing it delivers high accuracy when combined with text-based features like TF-IDF.
Despite the rise of deep learning, traditional machine learning algorithms like SVM and
XGBoost remain effective, particularly in scenarios with limited data.
12
3.1 Comparison
3.2 Objectives of Project (Must be clearly, precisely defined and Implemen- tation must
be done.)
To develop an innovative hybrid model for fake review detection that surpasses current
Specific Objectives
Create a balanced approach that maintains high accuracy while reducing the computational
Develop a framework that evaluates reviews within their broader context, including product
13
metadata and marketplace dynamics, rather than analyzing text in isolation.
Build a model with adaptive parameter adjustment capabilities to minimize manual tuning
Test and refine the detection system across multiple product categories and service sectors to
Integrate reviewer behavior metrics and temporal analysis to identify coordinated review
14
Exploratory Data Analysis
4.1 Dataset
This project utilizes a balanced dataset of approximately 20,000 online product reviews
spanning ten retail categories, including Home and Kitchen, Electronics, and Kindle Store,
with each category containing roughly 2,000 reviews equally distributed between genuine
(CG) and potentially deceptive (OR) labels. Data collection employed a custom-built,
ethically-compliant web crawler implementing request randomization and proxy rotation
to gather diverse review samples while respecting platform usage policies. The extracted
data underwent a comprehensive preprocessing pipeline including text normalization,
HTML artifact removal, and standardization of product references, followed by rigorous
quality assurance procedures to eliminate duplicates and verify label consistency. As
evident from the sample Home and Kitchen reviews (Image 1), the dataset captures
varied linguistic patterns across different rating levels, with observable correlations
between 5-star ratings and genuine labels, while Image 2 confirms the balanced
distribution of labels across all product categories, with Kindle Store containing the
highest volume at approximately 2,300 reviews and Movies and TV the lowest at roughly
1,800 reviews. This meticulously curated cross-domain dataset provides a robust
foundation for developing and evaluating our enhanced fake review detection
methodology.
15
4.2 Exploratory Data Analysis and Visualisations
This bar graph depicts the distribution between fake and genuine reviews in the dataset,
showing approximately 20,000 reviews in each category. The blue bar represents genuine
reviews (labeled as CG) while the peach/orange bar represents fake or deceptive reviews
(labeled as OR). The y-axis measures the count of reviews, reaching just over 20,000 for
each category, and the x-axis identifies the review types. The near-perfect balance
between genuine and fake reviews indicates a deliberately curated dataset designed for
optimal machine learning model training, where class imbalance would not skew the
results.
16
Fig 4.2.2- Review Text Length Distribution
The image illustrates the distribution of review text lengths by showing a histogram of
word counts in user reviews. The x-axis represents the number of words per review, while
the y-axis indicates the count of reviews within each range. The plot reveals a highly
right-skewed distribution, with most reviews containing between 10 and 30 words. As the
word count increases, the frequency of reviews declines significantly. A smooth density
curve overlays the histogram, further highlighting that shorter reviews are far more
common than longer ones. This visualization is useful for understanding review behavior
and guiding preprocessing steps in NLP tasks.
17
Fig 4.2.3- Review Counts By Categories
The bar chart visualizes the distribution of review counts across different product
categories. Each horizontal bar represents a distinct category, with its length indicating
the number of reviews received. Among the categories, Kindle_Store_5 has the
highest review count, followed closely by Books_5 and Pet_Supplies_5. The remaining
categories, including Home_and_Kitchen_5, Electronics_5, and Movies_and_TV_5,
have slightly fewer reviews but still show substantial engagement. This visualization
highlights which categories receive the most customer feedback, useful for analyzing
consumer interest and product popularity.
18
Fig 4.2.4-Rating Distribution by Label
The stacked bar chart illustrates the distribution of product ratings across two labels:
CG and OR. The x-axis represents the rating scale from 1 to 5, while the y-axis shows
the frequency of each rating. Each bar is segmented by label, with CG in teal and OR in
light yellow. It is evident that the majority of ratings are 5-star, with both labels
contributing significantly but OR having a slight edge. Lower ratings (1 to 3) are
comparatively rare for both categories. This visualization helps compare the sentiment
trends between the two labels, indicating a strong skew toward positive feedback
overall.
The word clouds for CG and OR reviews reveal both overlap and distinct differences in
language usage. Common prominent words like “book,” “love,” “great,” and “read”
suggest shared themes across labels, particularly related to books and storytelling.
However, CG reviews tend to include more emotionally expressive and qualitative terms
such as “comfortable,” “well written,” and “series,” indicating a focus on experience and
enjoyment. In contrast, OR reviews feature more practical and functional words like
“work,” “fit,” “time,” and “need,” suggesting a utilitarian tone aimed at assessing product
performance. This indicates that CG reviews are generally more subjective, while OR
reviews lean toward objective evaluation.
20
Methodology
5.1 Introduction to Languages (Front End and Back End)
Front End:
While this project focuses on the backend machine learning implementation, if there is
any user-facing component (such as visualizing results or charts). These technologies are
useful for displaying results or creating interactive components in a web-based
application, if applicable.
● Python Libraries: Within Google Colab, we use matplotlib and seaborn for
generating static visualizations such as confusion matrices, accuracy graphs, or
feature importance plots
Back End:
The machine learning model is developed and trained within Google Colab, a
cloud-based platform that supports Python. Colab provides an interactive environment
for running Python code without the need for setting up local computing resources. This
project relies heavily on the following Python libraries for data processing and model
development:
● Google Colab: The primary platform for running all machine learning experiments
and model training.
● Data Preprocessing: pandas for data manipulation, numpy for numerical
operations, re for text cleaning, and NLP libraries like nltk and spaCy for text
tokenization and other language processing tasks.
● Modeling and Classification: scikit-learn for developing the Support Vector
Machine (SVM) model, and XGBoost for implementing the Extreme Gradient
Boosting (XGBoost) model.
● Text Vectorization: TfidfVectorizer and CountVectorizer are used to
convert raw text data into numerical feature vectors that can be fed into the
machine learning models.
21
5.2 User characteristics
The system targets business owners, platform moderators, and consumers on
e-commerce, food delivery, and travel platforms. It helps ensure the authenticity of
online reviews, enhancing trust and transparency for users relying on review systems to
make informed decisions.
5.3 Constraints
● Data Quality: The model’s accuracy depends on clean and representative training
data.
● Feature Selection: The effectiveness of the model is influenced by the chosen
textual features.
● Imbalanced Dataset: Fake reviews are less frequent, which may cause bias toward
predicting genuine reviews.
● Real-Time Processing: The model may need further optimization for large-scale,
real-time review filtering.
● Computational Resources: Scaling the model for high-volume platforms may
require more powerful infrastructure.
● Model Generalization: The model may need adjustments for diverse platforms or
languages beyond the training dataset.
22
5.5 ER Diagram
● Accuracy: SVM achieved the highest accuracy (0.91), followed by XGBoost (0.89)
and BERT (0.81).
● Precision: SVM scored 0.90, slightly ahead of XGBoost (0.89), while BERT lagged at
0.79.
● Recall: SVM again outperformed with a recall of 0.91, XGBoost recorded 0.88, and
BERT had a relatively low recall of 0.72.
● F1-Score: This balanced metric showed SVM at 0.91, XGBoost at 0.89, and BERT at
0.70.
These results indicate that the SVM model consistently performed better across all
metrics, making it the most effective model for fake review detection in this project.
Although BERT is a deep learning model, its performance was comparatively lower,
possibly due to limited data, resource constraints, or lack of fine-tuning.
24
Fig 6.2 - Confusion matrix for SVM
The confusion matrix for the SVM model shows that it correctly classified 3528 genuine
(CG) and 3634 fake (OR) reviews. Misclassifications include 488 genuine reviews
predicted as fake and 437 fake reviews predicted as genuine.
The XGBoost model produced identical classification results as the SVM model, indicating
strong performance as well. However, its overall metric scores were slightly lower than
SVM in accuracy and recall.
25
Conclusion and Future Scope
7.1 Conclusion
In this project, two machine learning models Support Vector Machine (SVM) and
XGBoost were applied to detect fake reviews using natural language processing
techniques. The performance of both models was evaluated using precision, recall, and
F1-score metrics.
The SVM model achieved an overall accuracy of 91%, outperforming XGBoost, which
reached 89% accuracy. SVM also showed balanced performance across both classes
("CG" and "OR"), with precision and recall values consistently around 0.90–0.91. In
comparison, XGBoost, while slightly lower in all metrics, still demonstrated reliable
detection capability.
Based on these results, SVM is the more effective model for identifying fake reviews in
this context, offering better generalization and robustness across different review types.
However, XGBoost remains a strong alternative and may perform better with further
hyperparameter tuning or in ensemble approaches.
26
[1] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word
[2] Lakshmi Kalyani, M., Jacintha, M., Kar, D., Roy, N., & Sharma, V.K. (2020). A Comparative
[3] Wael, A., et al. (2024). Fake Reviews Detection in E-Commerce Using Machine Learning
[4] Rathore, A.A., Bhadane, G.L., Jadhav, A.D., Dhale, K.H., & Muley, J.D. (2023). Fake Reviews
Detection Using NLP Model and Neural Network Model. International Journal of
[6] IIETA. (2023). Fake Review Detection Using Machine Learning. Revue d’Intelligence
Artificielle, 37(5).
[7] Banerjee, S., et al. (2022). Detecting Fake Reviews Using Linguistic Clues and Machine
Learning. Proceedings of the International Conference on Data Mining and Big Data.
27