FAke News Report
FAke News Report
RV COLLEGE OF ENGINEERING
BENGALURU-59
(Autonomous Institution affiliated to VTU, Belagavi)
Experiential Learning
On
“Fake News Detection Using Machine Learning”
Dr. K. Nagamani
Head Of Department
Electronics and Telecommunication
R. V. College of Engineering
NAME USN
PRIYANKA N 1RV22ET035
MANOJ S H 1RV22ET026
2024-25
Table of Contents
1. Introduction
1.1 Overview
1.2 Objective
1.3 Scope of the Project
2. Literature Review
2.1 Existing Approaches to Fake News Detection
2.2 Related Works
3. Methodology
3.1 Dataset Description
3.2 Data Preprocessing
3.3 Machine Learning Models Used
3.4 Model Training and Testing
4. Implementation
4.1 Development Environment
4.2 Model Training and Evaluation
4.3 Manual Testing Function
7. References
1. INTRODUCTION
1.1 Overview
Fake news has become a critical issue in the digital era, where information spreads
rapidly through social media, online news platforms, and messaging applications.
The widespread dissemination of false or misleading news can have significant
social, political, and economic consequences. Traditional methods of verifying
news articles rely on human fact-checkers, which is time-consuming and
inefficient. As a result, there is a growing need for automated solutions to detect
and classify fake news accurately. Machine learning and Natural Language
Processing (NLP) provide powerful techniques to analyze and distinguish between
real and fake news by learning patterns from large datasets. This project explores
various machine learning algorithms to develop an effective fake news detection
model.
1.2 Objective
This project focuses on detecting fake news articles based on textual content rather
than images or videos. The dataset used consists of labeled news articles
categorized as real or fake, ensuring supervised learning can be applied. The scope
includes data preprocessing, feature extraction using TF-IDF vectorization,
model training, evaluation, and performance comparison. The study does not
cover deep learning techniques but lays the foundation for future improvements
using LSTMs and Transformers. Furthermore, while the project evaluates
machine learning models, it does not address ethical concerns or the legal
implications of fake news detection. The long-term vision includes integrating the
system into a real-time web application to assist journalists, researchers, and the
general public in verifying news authenticity efficiently.
2. Literature Review
2.1 Existing Approaches to Fake News Detection
Fake news detection has been a growing area of research, with various approaches
developed to address the problem. Traditional methods involve manual fact-
checking by journalists and organizations such as PolitiFact and Snopes, but these
methods are time-consuming and unable to scale effectively. Automated detection
techniques can be broadly classified into linguistic-based, network-based, and
machine learning-based approaches.
Several studies have explored the use of machine learning for fake news detection.
Zhou et al. (2019) proposed a hybrid model combining TF-IDF features and deep
learning classifiers, achieving high accuracy in text classification tasks. Similarly,
Shu et al. (2020) introduced a fake news detection framework integrating textual
analysis with social network features, demonstrating improved performance by
leveraging propagation patterns.
3. Methodology
The dataset used for this project consists of labeled news articles categorized as
fake or real to facilitate supervised learning. The data is obtained from publicly
available repositories such as Kaggle and open-source fake news datasets, which
contain verified instances of misleading and authentic news. The dataset includes
various attributes such as:
For model training, we combine separate datasets of fake and real news articles to
ensure class balance and prevent model bias. After merging, the dataset is shuffled
and split into training and testing sets to assess model performance.
Data Sources
Kaggle Fake News Dataset: A well-known dataset containing labeled fake and
real news articles. LIAR Dataset: A dataset consisting of political news statements
classified as true, mostly true, half true, mostly false, or false.Fake News Corpus: A
large-scale dataset that includes fake news articles sourced from various unreliable
websites.
3.2 Data Preprocessing
3.2.2. Tokenization
Tokenization involves splitting the text into individual words or phrases (tokens) to
facilitate further processing. This step helps in analyzing word frequency and
extracting linguistic features.
Commonly used words such as "the," "is," "and," and "in" do not contribute to the
classification of news as real or fake. We remove such stopwords using the Natural
Language Toolkit (NLTK) to enhance model efficiency.
3.2.4. Lemmatization
Lemmatization converts words into their root forms to reduce dimensionality while
preserving meaning. For example, "running" is converted to "run", and "better" is
reduced to "good" using WordNet Lemmatizer.
Each of these models is trained using the preprocessed dataset, and their
performance is evaluated based on accuracy, precision, recall, and F1-score.
The dataset is split into training (75%) and testing (25%) subsets to evaluate model
performance. The following steps are carried out during training and testing:
3.4.1. Data Splitting
Each model is trained using the TF-IDF vectorized text data and labeled classes.
The training process involves:
After training, the models are tested on unseen data. We use classification metrics
such as:
The development and implementation of the fake news detection system were
carried out in Jupyter Notebook, which provides an interactive computing
environment suitable for Python-based data science and machine learning tasks.
The primary software and tools used for implementing this project include the
following:
The dataset was divided into training and testing sets using the train_test_split
function from the scikit-learn library. The data was split in a ratio of 75% for
training and 25% for testing, ensuring that the model has enough data to learn from
while still maintaining a validation set for performance evaluation.
4.3 Model Training
4. Results and Discussion
5.1 Model Performance
The performance of the machine learning models was evaluated using various
metrics such as accuracy, precision, recall, and F1-score, all of which were
generated using the classification_report function from scikit-learn. These metrics
provide insight into how well each model is able to classify fake and real news
articles.
Logistic Regression:
The Decision Tree model demonstrated good accuracy, but it showed a tendency to
overfit, especially when the depth of the tree was large. This led to high accuracy
on the training data but lower performance on the test data.
Random Forest Classifier:
The Random Forest model, being an ensemble of decision trees, performed better
than the individual Decision Tree model. It achieved higher accuracy and a better
balance between precision and recall, indicating its ability to generalize better for
unseen data.
Gradient Boosting performed excellently, achieving the highest accuracy among all
models tested. It provided balanced precision and recall scores, making it the most
reliable model for this task.
Each model was assessed on its ability to handle the imbalanced nature of the
dataset, where the number of fake news articles was lower than real ones. The
models' performance in this regard varies, with Gradient Boosting being the most
robust to class imbalance.
5.2 Key Findings
Gradient Boosting outperforms other models: Among all the models tested,
Gradient Boosting Classifier achieved the highest overall accuracy and balanced
precision-recall scores. This suggests that ensemble methods such as Gradient
Boosting are highly effective in handling text classification problems.
Logistic Regression is quick but less accurate: Logistic Regression, being a simpler
model, was faster to train but had lower performance compared to the ensemble
methods. It was less effective at distinguishing fake news, which indicates the need
for more complex models in such tasks.
Text data pre-processing is crucial: The text pre-processing steps, including the use
of TF-IDF vectorization, were critical in ensuring the models received clean and
informative features for learning. Removing stopwords and normalizing the text
helped reduce noise and improved the overall performance.
5.3 Limitations
• Data Quality: The quality of the dataset is a significant factor. Although the
dataset was sourced from reliable repositories, there could still be errors in
the labeling of news articles. Mislabeling can lead to inaccurate model
predictions.
• Model Interpretability: Complex models like Random Forest and Gradient
Boosting are often considered black-box models. This lack of interpretability
makes it difficult to understand why certain news articles are classified as
fake or real, which can be an issue for decision-making in real-world
applications.
• Limited Dataset: The size and diversity of the dataset used in this study may
not be representative of the vast range of news articles available globally.
This limitation can reduce the generalizability of the models to different
types of news sources and domains.
• Textual Features Only: The models were trained using only textual features
(the content of the news article). Future models could benefit from
incorporating additional features such as the source of the article, author
information, or social media signals, which may provide more context and
improve classification accuracy.
• Random Forest Classifier also performed well, showing good accuracy and
providing a strong alternative to Gradient Boosting.
• Logistic Regression was the least effective model, though it was faster to
train. However, it had lower accuracy compared to the ensemble models.
• Text pre-processing, including the use of TF-IDF for feature extraction,
played a crucial role in improving model performance by ensuring that the
models received relevant information from the text data.
Model Interpretability
Given that the models used in this project, particularly Random Forest and
Gradient Boosting, are often considered "black-box" models, future work could
focus on improving model interpretability. Implementing techniques like LIME
(Local Interpretable Model-Agnostic Explanations) or SHAP (Shapley Additive
Explanations) could provide insights into how the models make predictions, which
is important in practical applications, especially in sensitive fields like news
classification.
Expanding the Dataset
The performance of the model can be further enhanced by expanding the dataset to
include a broader variety of news articles, covering more topics and regions.
Additionally, more diverse and balanced datasets can help improve the robustness
of the model, ensuring better generalization to real-world scenarios.
Real-time Detection
Incorporating real-time detection capabilities could enhance the practical
application of this system. Integrating the model with news aggregators or social
media platforms would allow for the identification of fake news as it is being
published, enabling quicker interventions.
7. References
[1] S. S. Kulkarni, R. R. Deshmukh, and M. R. Bendre, "Fake news detection using machine
learning algorithms," Journal of Computer Science and Technology, vol. 35, no. 3, pp. 172-179,
Jun. 2020.
[2] S. K. Ghosh, "Machine learning for fake news detection: A comprehensive review,"
Proceedings of the International Conference on Machine Learning and Data Engineering, pp.
89-96, 2019.
[3] T. S. Zahran, K. H. Ghoneim, and F. A. Ahmed, "Fake news detection on social media using
deep learning," Computational Intelligence and Neuroscience, vol. 2020, Article ID 7282050,
2020.
[4] J. R. F. Gomes, L. A. S. Albuquerque, and P. M. R. G. Silva, "A hybrid ensemble model for
fake news detection," Expert Systems with Applications, vol. 115, pp. 156-168, Apr. 2019.
[5] B. Wang and T. Yang, "Using natural language processing and machine learning for fake
news detection," Journal of Artificial Intelligence Research, vol. 67, pp. 114-132, Jul. 2019.
[6] F. Zhang, Z. Xie, and M. Li, "Combining machine learning algorithms for fake news
detection," Journal of Information Science, vol. 45, no. 4, pp. 533-542, Aug. 2019.
[8] R. L. Chouhan and D. Sharma, "A review of gradient boosting techniques in machine
learning," Journal of Computer Applications, vol. 28, no. 3, pp. 55-62, Mar. 2020.
[9] M. Shapira and H. Shapira, "Understanding and improving decision tree-based classifiers for
fake news detection," International Journal of Data Science and Analytics, vol. 5, pp. 223-235,
Feb. 2021.