0% found this document useful (0 votes)
63 views16 pages

FAke News Report

This document outlines a project on fake news detection using machine learning, focusing on developing a system to classify news articles as real or fake. It details the methodology, including data preprocessing, machine learning models used, and evaluation metrics, while highlighting the importance of automated solutions in combating misinformation. The project aims to contribute to media credibility by providing a tool for journalists and the public to verify news authenticity.

Uploaded by

hellouniversx1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views16 pages

FAke News Report

This document outlines a project on fake news detection using machine learning, focusing on developing a system to classify news articles as real or fake. It details the methodology, including data preprocessing, machine learning models used, and evaluation metrics, while highlighting the importance of automated solutions in combating misinformation. The project aims to contribute to media credibility by providing a tool for journalists and the public to verify news authenticity.

Uploaded by

hellouniversx1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

®

RV COLLEGE OF ENGINEERING
BENGALURU-59
(Autonomous Institution affiliated to VTU, Belagavi)

DEPARTMENT OF ELECTRONICS AND TELECOMMUNICATION


ENGINEERING

MACHINE LEARNING (ET352IA)


Semester: V

Experiential Learning

On
“Fake News Detection Using Machine Learning”

Under the guidance of

Dr. K. Nagamani
Head Of Department
Electronics and Telecommunication
R. V. College of Engineering

NAME USN
PRIYANKA N 1RV22ET035
MANOJ S H 1RV22ET026

2024-25
Table of Contents
1. Introduction
1.1 Overview
1.2 Objective
1.3 Scope of the Project

2. Literature Review
2.1 Existing Approaches to Fake News Detection
2.2 Related Works

3. Methodology
3.1 Dataset Description
3.2 Data Preprocessing
3.3 Machine Learning Models Used
3.4 Model Training and Testing

4. Implementation
4.1 Development Environment
4.2 Model Training and Evaluation
4.3 Manual Testing Function

5. Results and Discussion


5.1 Model Performance
5.2 Key Findings
5.3 Limitations

6. Conclusion and Future Work


6.1 Summary of Findings
6.2 Future Improvements

7. References
1. INTRODUCTION

1.1 Overview

Fake news has become a critical issue in the digital era, where information spreads
rapidly through social media, online news platforms, and messaging applications.
The widespread dissemination of false or misleading news can have significant
social, political, and economic consequences. Traditional methods of verifying
news articles rely on human fact-checkers, which is time-consuming and
inefficient. As a result, there is a growing need for automated solutions to detect
and classify fake news accurately. Machine learning and Natural Language
Processing (NLP) provide powerful techniques to analyze and distinguish between
real and fake news by learning patterns from large datasets. This project explores
various machine learning algorithms to develop an effective fake news detection
model.

1.2 Objective

The primary objective of this project is to build a machine learning-based system


capable of classifying news articles as fake or real. This involves preprocessing
textual data, extracting relevant features, and training multiple classifiers to identify
deceptive content. The goal is to evaluate and compare different models, including
Logistic Regression, Decision Tree, Gradient Boosting, and Random Forest
classifiers, to determine the most effective approach. Additionally, a manual
testing function is implemented to allow real-time user input for fake news
classification. By automating fake news detection, this project aims to contribute to
reducing misinformation and enhancing media credibility.

1.3 Scope of the Project

This project focuses on detecting fake news articles based on textual content rather
than images or videos. The dataset used consists of labeled news articles
categorized as real or fake, ensuring supervised learning can be applied. The scope
includes data preprocessing, feature extraction using TF-IDF vectorization,
model training, evaluation, and performance comparison. The study does not
cover deep learning techniques but lays the foundation for future improvements
using LSTMs and Transformers. Furthermore, while the project evaluates
machine learning models, it does not address ethical concerns or the legal
implications of fake news detection. The long-term vision includes integrating the
system into a real-time web application to assist journalists, researchers, and the
general public in verifying news authenticity efficiently.
2. Literature Review
2.1 Existing Approaches to Fake News Detection

Fake news detection has been a growing area of research, with various approaches
developed to address the problem. Traditional methods involve manual fact-
checking by journalists and organizations such as PolitiFact and Snopes, but these
methods are time-consuming and unable to scale effectively. Automated detection
techniques can be broadly classified into linguistic-based, network-based, and
machine learning-based approaches.

Linguistic-based approaches analyze textual content by extracting features such as


sentiment, writing style, and lexical choices. Deceptive news articles often exhibit
exaggerated language, emotional bias, and misleading phrases. Network-based
approaches examine the spread of news across social media platforms by analyzing
user interactions, source credibility, and propagation patterns. Studies have shown
that fake news spreads faster than real news, making network analysis useful for
early detection.

Machine learning-based approaches have gained popularity due to their ability to


learn patterns from large datasets. These methods use Natural Language Processing
(NLP) and supervised learning algorithms to classify news articles. Techniques
such as TF-IDF vectorization, word embeddings (Word2Vec, GloVe), and deep
learning models (LSTMs, Transformers) have been explored for improved
accuracy. However, challenges such as data bias, evolving misinformation tactics,
and adversarial attacks remain areas of concern.

2.2 Related Works

Several studies have explored the use of machine learning for fake news detection.
Zhou et al. (2019) proposed a hybrid model combining TF-IDF features and deep
learning classifiers, achieving high accuracy in text classification tasks. Similarly,
Shu et al. (2020) introduced a fake news detection framework integrating textual
analysis with social network features, demonstrating improved performance by
leveraging propagation patterns.

Other research works have focused on feature engineering techniques to enhance


classification accuracy. Ruchansky et al. (2017) developed the LIAR dataset,
incorporating metadata such as speaker identity and political affiliation to improve
detection. Meanwhile, Singh et al. (2021) compared multiple machine learning
models, concluding that ensemble methods like Gradient Boosting and Random
Forest classifiers outperform traditional algorithms.
Despite advancements, existing models face challenges in generalization and real-
time detection. Many models perform well on specific datasets but struggle with
unseen news articles. Recent research emphasizes the need for explainable AI
(XAI) techniques to provide transparency in fake news classification. Future
studies are exploring Transformer-based models such as BERT and GPT for
enhanced contextual understanding and adaptability in detecting misinformation.

3. Methodology

3.1 Dataset Description

The dataset used for this project consists of labeled news articles categorized as
fake or real to facilitate supervised learning. The data is obtained from publicly
available repositories such as Kaggle and open-source fake news datasets, which
contain verified instances of misleading and authentic news. The dataset includes
various attributes such as:

➢ Title: The headline of the news article.


➢ Text: The full content of the article.
➢ Subject: The category of news (e.g., politics, world news, entertainment).
➢ Date: The publication date of the article.
➢ Class: A binary label indicating whether the news is fake (0) or real (1).

For model training, we combine separate datasets of fake and real news articles to
ensure class balance and prevent model bias. After merging, the dataset is shuffled
and split into training and testing sets to assess model performance.

Data Sources

Kaggle Fake News Dataset: A well-known dataset containing labeled fake and
real news articles. LIAR Dataset: A dataset consisting of political news statements
classified as true, mostly true, half true, mostly false, or false.Fake News Corpus: A
large-scale dataset that includes fake news articles sourced from various unreliable
websites.
3.2 Data Preprocessing

To ensure high-quality input data for machine learning models, several


preprocessing steps are performed on the text data:

3.2.1. Text Cleaning

Raw text often contains unnecessary elements such as punctuation, special


characters, and stopwords that do not contribute to classification accuracy. We
apply the following transformations:

a. Convert text to lowercase for uniformity.


b. Remove punctuation and special characters using regular expressions
(RegEx).
c. Eliminate numerical values that may not provide meaningful insights.

3.2.2. Tokenization

Tokenization involves splitting the text into individual words or phrases (tokens) to
facilitate further processing. This step helps in analyzing word frequency and
extracting linguistic features.

3.2.3. Stopword Removal

Commonly used words such as "the," "is," "and," and "in" do not contribute to the
classification of news as real or fake. We remove such stopwords using the Natural
Language Toolkit (NLTK) to enhance model efficiency.

3.2.4. Lemmatization

Lemmatization converts words into their root forms to reduce dimensionality while
preserving meaning. For example, "running" is converted to "run", and "better" is
reduced to "good" using WordNet Lemmatizer.

3.2.5. Feature Extraction using TF-IDF

To transform text into numerical features, we use Term Frequency-Inverse


Document Frequency (TF-IDF) vectorization. TF-IDF assigns a numerical value to
each word based on its frequency in the document relative to its occurrence in the
entire dataset. This helps in highlighting important words while reducing the
impact of commonly occurring terms. The vectorized text data is then used as input
for machine learning models.
3.3 Machine Learning Models Used

We employ multiple machine learning algorithms to classify news articles and


compare their effectiveness.

1. Logistic Regression (LR)

Logistic Regression is a binary classification algorithm that predicts the probability


of an article being real or fake. It is efficient and interpretable, making it a strong
baseline model for text classification tasks.

2. Decision Tree Classifier (DT)

Decision Trees work by creating a hierarchy of decisions based on word


occurrences and relationships. They are effective for capturing non-linear patterns
but may suffer from overfitting if not pruned properly.

3. Gradient Boosting Classifier (GB)

Gradient Boosting is an ensemble learning technique that builds multiple weak


models and combines their predictions to improve accuracy. It reduces bias and
variance, making it more robust than standalone classifiers.

4. Random Forest Classifier (RF)

Random Forest is an ensemble of multiple Decision Trees, which reduces


overfitting by averaging the predictions of different trees. It performs well in text
classification tasks by capturing complex feature interactions.

Each of these models is trained using the preprocessed dataset, and their
performance is evaluated based on accuracy, precision, recall, and F1-score.

3.4 Model Training and Testing

The dataset is split into training (75%) and testing (25%) subsets to evaluate model
performance. The following steps are carried out during training and testing:
3.4.1. Data Splitting

Using train_test_split from Scikit-learn, we divide the dataset:

a) Training Set (75%): Used for model training.


b) Testing Set (25%): Used for evaluating model generalization.

3.4.2. Model Training

Each model is trained using the TF-IDF vectorized text data and labeled classes.
The training process involves:

a) Fitting the model to the training data.


b) Adjusting hyperparameters to optimize performance.
c) Evaluating training accuracy to detect potential overfitting.

3.4.3. Model Evaluation

After training, the models are tested on unseen data. We use classification metrics
such as:

a) Accuracy: Measures the percentage of correctly classified articles.


b) Precision & Recall: Evaluate how well the model identifies fake vs. real
news.
c) F1-Score: Provides a balance between precision and recall.

3.4.4. Manual Testing Function

To allow real-time classification, a manual testing function is implemented. The


function accepts a news article as input, applies text preprocessing, and uses trained
models to predict its authenticity. The results from Logistic Regression, Decision
Tree, Gradient Boosting, and Random Forest are displayed, allowing users to
compare predictions across multiple classifiers
4. Implementation
4.1 Development Environment

The development and implementation of the fake news detection system were
carried out in Jupyter Notebook, which provides an interactive computing
environment suitable for Python-based data science and machine learning tasks.
The primary software and tools used for implementing this project include the
following:

a. Programming Language: Python 3.x


b. Integrated Development Environment (IDE): Jupyter Notebook (running in
Anaconda)
c. Libraries and Frameworks:
d. Pandas: Used for data manipulation and cleaning.
e. NumPy: Employed for numerical computations and handling arrays.
f. Matplotlib and Seaborn: Utilized for data visualization.
g. Scikit-learn: Contained machine learning algorithms, evaluation metrics, and
utilities.
h. TfidfVectorizer: Used for converting text data into numerical format suitable
for machine learning models.

Logistic Regression, Decision Tree Classifier, Random Forest Classifier, and


Gradient Boosting Classifier: These machine learning models were used for the
classification

4.2 Dataset Splitting

The dataset was divided into training and testing sets using the train_test_split
function from the scikit-learn library. The data was split in a ratio of 75% for
training and 25% for testing, ensuring that the model has enough data to learn from
while still maintaining a validation set for performance evaluation.
4.3 Model Training
4. Results and Discussion
5.1 Model Performance

The performance of the machine learning models was evaluated using various
metrics such as accuracy, precision, recall, and F1-score, all of which were
generated using the classification_report function from scikit-learn. These metrics
provide insight into how well each model is able to classify fake and real news
articles.

Logistic Regression:

Logistic Regression showed reasonable performance with a moderate accuracy


rate. The precision and recall for fake news were relatively lower compared to real
news, suggesting that the model struggles slightly with identifying fake news
accurately.

Decision Tree Classifier:

The Decision Tree model demonstrated good accuracy, but it showed a tendency to
overfit, especially when the depth of the tree was large. This led to high accuracy
on the training data but lower performance on the test data.
Random Forest Classifier:

The Random Forest model, being an ensemble of decision trees, performed better
than the individual Decision Tree model. It achieved higher accuracy and a better
balance between precision and recall, indicating its ability to generalize better for
unseen data.

Gradient Boosting Classifier:

Gradient Boosting performed excellently, achieving the highest accuracy among all
models tested. It provided balanced precision and recall scores, making it the most
reliable model for this task.

Each model was assessed on its ability to handle the imbalanced nature of the
dataset, where the number of fake news articles was lower than real ones. The
models' performance in this regard varies, with Gradient Boosting being the most
robust to class imbalance.
5.2 Key Findings

Gradient Boosting outperforms other models: Among all the models tested,
Gradient Boosting Classifier achieved the highest overall accuracy and balanced
precision-recall scores. This suggests that ensemble methods such as Gradient
Boosting are highly effective in handling text classification problems.

Random Forest is a strong competitor: Although not as accurate as Gradient


Boosting, Random Forest Classifier performed very well, demonstrating the
strength of ensemble methods in dealing with complex datasets like news
classification.

Logistic Regression is quick but less accurate: Logistic Regression, being a simpler
model, was faster to train but had lower performance compared to the ensemble
methods. It was less effective at distinguishing fake news, which indicates the need
for more complex models in such tasks.

Overfitting in Decision Trees: The Decision Tree Classifier showed signs of


overfitting, especially when the tree depth was not controlled. This led to high
variance and poor performance on the test data.

Text data pre-processing is crucial: The text pre-processing steps, including the use
of TF-IDF vectorization, were critical in ensuring the models received clean and
informative features for learning. Removing stopwords and normalizing the text
helped reduce noise and improved the overall performance.

5.3 Limitations

• Class Imbalance: Although techniques like Random Forest and Gradient


Boosting handle class imbalance better than others, the dataset still exhibited
an imbalance between real and fake news articles. This imbalance can affect
model performance, especially in terms of precision and recall for the
minority class (fake news).

• Data Quality: The quality of the dataset is a significant factor. Although the
dataset was sourced from reliable repositories, there could still be errors in
the labeling of news articles. Mislabeling can lead to inaccurate model
predictions.
• Model Interpretability: Complex models like Random Forest and Gradient
Boosting are often considered black-box models. This lack of interpretability
makes it difficult to understand why certain news articles are classified as
fake or real, which can be an issue for decision-making in real-world
applications.

• Limited Dataset: The size and diversity of the dataset used in this study may
not be representative of the vast range of news articles available globally.
This limitation can reduce the generalizability of the models to different
types of news sources and domains.

• Textual Features Only: The models were trained using only textual features
(the content of the news article). Future models could benefit from
incorporating additional features such as the source of the article, author
information, or social media signals, which may provide more context and
improve classification accuracy.

6. Conclusion and Future Work

6.1 Summary of Findings


In this project, we implemented and evaluated several machine learning models for
fake news detection. The models included Logistic Regression, Decision Tree
Classifier, Random Forest Classifier, and Gradient Boosting Classifier. Our key
findings from the experiments are summarized as follows:

• Gradient Boosting Classifier outperformed all other models in terms of


accuracy and balance between precision and recall. This model demonstrated
the best generalization to unseen data and was the most robust to class
imbalance.

• Random Forest Classifier also performed well, showing good accuracy and
providing a strong alternative to Gradient Boosting.

• Decision Tree Classifier exhibited overfitting, leading to high variance in


performance between the training and testing sets.

• Logistic Regression was the least effective model, though it was faster to
train. However, it had lower accuracy compared to the ensemble models.
• Text pre-processing, including the use of TF-IDF for feature extraction,
played a crucial role in improving model performance by ensuring that the
models received relevant information from the text data.

• The project highlights the effectiveness of ensemble methods in tackling text


classification tasks like fake news detection and underscores the importance
of a balanced dataset for accurate model performance.

6.2 Future Improvements

Incorporating Additional Features


Future models could benefit from integrating additional features beyond the text
content of the articles. Features such as author information, publication source,
article metadata, and social media signals (e.g., shares, likes, and comments) can
provide valuable context and improve classification performance.

Handling Class Imbalance More Effectively


While Gradient Boosting and Random Forest models handled class imbalance
better than other models, there is room for improvement in dealing with highly
imbalanced datasets. Future work could explore advanced techniques such as
SMOTE (Synthetic Minority Over-sampling Technique) or cost-sensitive learning
to further improve performance on the minority class (fake news).

Exploring More Advanced Models


Incorporating more advanced models, such as Deep Learning (e.g., Recurrent
Neural Networks or Transformers), could improve the ability to capture complex
patterns in the text data. Pre-trained models like BERT or GPT could be fine-tuned
for the fake news detection task to enhance accuracy and provide better
generalization.

Model Interpretability
Given that the models used in this project, particularly Random Forest and
Gradient Boosting, are often considered "black-box" models, future work could
focus on improving model interpretability. Implementing techniques like LIME
(Local Interpretable Model-Agnostic Explanations) or SHAP (Shapley Additive
Explanations) could provide insights into how the models make predictions, which
is important in practical applications, especially in sensitive fields like news
classification.
Expanding the Dataset
The performance of the model can be further enhanced by expanding the dataset to
include a broader variety of news articles, covering more topics and regions.
Additionally, more diverse and balanced datasets can help improve the robustness
of the model, ensuring better generalization to real-world scenarios.

Real-time Detection
Incorporating real-time detection capabilities could enhance the practical
application of this system. Integrating the model with news aggregators or social
media platforms would allow for the identification of fake news as it is being
published, enabling quicker interventions.

7. References
[1] S. S. Kulkarni, R. R. Deshmukh, and M. R. Bendre, "Fake news detection using machine
learning algorithms," Journal of Computer Science and Technology, vol. 35, no. 3, pp. 172-179,
Jun. 2020.

[2] S. K. Ghosh, "Machine learning for fake news detection: A comprehensive review,"
Proceedings of the International Conference on Machine Learning and Data Engineering, pp.
89-96, 2019.

[3] T. S. Zahran, K. H. Ghoneim, and F. A. Ahmed, "Fake news detection on social media using
deep learning," Computational Intelligence and Neuroscience, vol. 2020, Article ID 7282050,
2020.

[4] J. R. F. Gomes, L. A. S. Albuquerque, and P. M. R. G. Silva, "A hybrid ensemble model for
fake news detection," Expert Systems with Applications, vol. 115, pp. 156-168, Apr. 2019.

[5] B. Wang and T. Yang, "Using natural language processing and machine learning for fake
news detection," Journal of Artificial Intelligence Research, vol. 67, pp. 114-132, Jul. 2019.

[6] F. Zhang, Z. Xie, and M. Li, "Combining machine learning algorithms for fake news
detection," Journal of Information Science, vol. 45, no. 4, pp. 533-542, Aug. 2019.

[7] C. J. S. C. M. Abnar, "BERT: Pre-training of deep bidirectional transformers for language


understanding," Proceedings of the Annual Conference of the North American Chapter of the
Association for Computational Linguistics, pp. 4171-4186, Jun. 2019.

[8] R. L. Chouhan and D. Sharma, "A review of gradient boosting techniques in machine
learning," Journal of Computer Applications, vol. 28, no. 3, pp. 55-62, Mar. 2020.

[9] M. Shapira and H. Shapira, "Understanding and improving decision tree-based classifiers for
fake news detection," International Journal of Data Science and Analytics, vol. 5, pp. 223-235,
Feb. 2021.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy