0% found this document useful (0 votes)
149 views10 pages

Fake News Detection Using Machine Learning Algorithms

This document discusses using machine learning algorithms to detect fake news on social media. It introduces the problem of fake news spreading rapidly online and aims to classify news articles as real or fake. The authors aim to train machine learning models like naive Bayes classifier, random forest, and logistic regression on news content and contextual features to automatically detect fake news posted on social media. Prior research that used machine learning for fake news detection is also reviewed, showing accuracy between 70-80% is achievable.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views10 pages

Fake News Detection Using Machine Learning Algorithms

This document discusses using machine learning algorithms to detect fake news on social media. It introduces the problem of fake news spreading rapidly online and aims to classify news articles as real or fake. The authors aim to train machine learning models like naive Bayes classifier, random forest, and logistic regression on news content and contextual features to automatically detect fake news posted on social media. Prior research that used machine learning for fake news detection is also reviewed, showing accuracy between 70-80% is achievable.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Fake News Detection Using Machine Learning

Algorithms
Uma Sharma, Sidarth Saran, Shankar M. Patil
Department of Information Technology
Bharati Vidyapeeth College of Engineering
Navi Mumbai, India
eqtasharma@gmail.com , siddharthsaran00@gmail.com , smpatil2k@gmail.com

Abstract— In our modern era where the internet news source. Despite the benefits provided by social
is ubiquitous, everyone relies on various online media, the standard of stories on social media is less
resources for news. Along with the increase in the than traditional news organizations. However, because
use of social media platforms like Facebook, it's inexpensive to supply news online and far faster
Twitter, etc. news spread rapidly among millions of and easier to propagate through social media, large
users within a very short span of time. The spread volumes of faux news, i.e., those news articles with
of fake news has far-reaching consequences like the intentionally false information, are produced online for
creation of biased opinions to swaying election a spread of purposes, like financial and political gain.
outcomes for the benefit of certain candidates. it had been estimated that over 1 million tweets are
Moreover, spammers use appealing news headlines associated with fake news ―Pizzagate" by the top of the
to generate revenue using advertisements via click- presidential election. Given the prevalence of this new
baits. In this paper, we aim to perform binary phenomenon, ―Fake news" was even named the word
classification of various news articles available of the year by the Macquarie dictionary in 2016 [2].
online with the help of concepts pertaining to The extensive spread of faux news can have a
Artificial Intelligence, Natural Language significant negative impact on individuals and society.
Processing and Machine Learning. We aim to First, fake news can shatter the authenticity
provide the user with the ability to classify the news equilibrium of the news ecosystem for instance; it's
as fake or real and also check the authenticity of the evident that the most popular fake news was even
website publishing the news. more outspread on Facebook than the most accepted
genuine mainstream news during the U.S. 2016
Keywords—Internet, Social Media, Fake News, presidential election. Second, fake news intentionally
Classification, Artificial Intelligence, Machine persuades consumers to simply accept biased or false
Learning, Websites, Authenticity. beliefs. Fake news is typically manipulated by
propagandists to convey political messages or
I. INTRODUCTION influence for instance, some report shows that Russia
As an increasing amount of our lives is spent has created fake accounts and social bots to spread
interacting online through social media platforms, false stories. Third, fake news changes the way people
more and more people tend to hunt out and consume interpret and answer real news, for instance, some fake
news from social media instead of traditional news news was just created to trigger people's distrust and
organizations.[1] The explanations for this alteration in make them confused; impeding their abilities to
consumption behaviours are inherent within the nature differentiate what's true from what's not. To assist
of those social media platforms: (i) it's often more mitigate the negative effects caused by fake news (both
timely and fewer expensive to consume news on social to profit the general public and therefore the news
media compared with traditional journalism , like ecosystem). It's crucial that we build up methods to
newspapers or television; and (ii) it's easier to further automatically detect fake news broadcast on social
share, discuss , and discuss the news with friends or media [3].
other readers on social media. For instance, 62 percent Internet and social media have made the access to the
of U.S. adults get news on social media in 2016, while news information much easier and comfortable [2].
in 2012; only 49 percent reported seeing news on Often Internet users can pursue the events of their
social media [1]. It had been also found that social concern in online form, and increased number of the
media now outperforms television because the major mobile devices makes this process even easier. But

1
with great possibilities come great challenges. Mass true. They often use attention seeking words and news
media have an enormous influence on the society, and format and click baits. They are too good to be true.
because it often happens, there's someone who wants Their sources are not genuine most of the times [9].
to require advantage of this fact. Sometimes to realize
some goals mass-media may manipulate the II. LITERATURE REVIEW
knowledge in several ways. This result in producing of Mykhailo Granik et. al. in their paper [3] shows a
the news articles that isn‘t completely true or maybe simple approach for fake news detection using naive
completely false. There even exist many websites that Bayes classifier. This approach was implemented as a
produce fake news almost exclusively. They software system and tested against a data set of
intentionally publish hoaxes, half-truths, propaganda Facebook news posts. They were collected from three
and disinformation asserting to be real news – often large Facebook pages each from the right and from the
using social media to drive web traffic and magnify left, as well as three large mainstream political news
their effect. The most goals of faux news websites are pages (Politico, CNN, ABC News). They achieved
to affect the general public opinion on certain matters classification accuracy of approximately 74%.
(mostly political). Samples of such websites could also Classification accuracy for fake news is slightly worse.
be found in Ukraine, United States of America, This may be caused by the skewness of the dataset:
Germany, China and much of other countries [4]. only 4.9% of it is fake news.
Thus, fake news may be a global issue also as a Himank Gupta et. al. [10] gave a framework based on
worldwide challenge. Many scientists believe that fake different machine learning approach that deals with
news issue could also be addressed by means of various problems including accuracy shortage, time lag
machine learning and AI [5]. There‘s a reason for that: (BotMaker) and high processing time to handle
recently AI algorithms have begun to work far better thousands of tweets in 1 sec. Firstly, they have
on many classification problems (image recognition, collected 400,000 tweets from HSpam14 dataset. Then
voice detection then on) because hardware is cheaper they further characterize the 150,000 spam tweets and
and larger datasets are available. There are several 250,000 non- spam tweets. They also derived some
influential articles about automatic deception lightweight features along with the Top-30 words that
detection. In [6] the authors provide a general are providing highest information gain from Bag-of-
overview of the available techniques for the matter. In Words model. 4. They were able to achieve an
[7] the authors describe their method for fake news accuracy of 91.65% and surpassed the existing solution
detection supported the feedback for the precise news by approximately18%.
within the micro blogs. In [8] the authors actually Marco L. Della Vedova et. al. [11] first proposed a
develop two systems for deception detection supported novel ML fake news detection method which, by
support vector machines and Naive Bayes classifier combining news content and social context features,
(this method is employed within the system described outperforms existing methods in the literature,
during this paper as well) respectively. They collect increasing its accuracy up to 78.8%. Second, they
the info by means of asking people to directly provide implemented their method within a Facebook
true or false information on several topics – abortion, Messenger Chabot and validate it with a real-world
execution and friendship. The accuracy of the application, obtaining a fake news detection accuracy
detection achieved by the system is around 70%. This of 81.7%. Their goal was to classify a news item as
text describes an easy fake news detection method reliable or fake; they first described the datasets they
supported one among the synthetic intelligence used for their test, then presented the content-based
algorithms – naïve Bayes classifier, Random Forest approach they implemented and the method they
and Logistic Regression. The goal of the research is to proposed to combine it with a social-based approach
look at how these particular methods work for this available in the literature. The resulting dataset is
particular problem given a manually labelled news composed of 15,500 posts, coming from 32 pages (14
dataset and to support (or not) the thought of using AI conspiracy pages, 18 scientific pages), with more than
for fake news detection. The difference between these 2, 300, 00 likes by 900,000+ users. 8,923 (57.6%)
article and articles on the similar topics is that during posts are hoaxes and 6,577 (42.4%) are non-hoaxes.
this paper Logistic Regression was specifically used
for fake news detection; also, the developed system Cody Buntain et. al. [12] develops a method for
was tested on a comparatively new data set, which automating fake news detection on Twitter by learning
gave a chance to gauge its performance on a recent to predict accuracy assessments in two credibility-
data. focused Twitter datasets: CREDBANK, a crowd
sourced dataset of accuracy assessments for events in
A. Characteristics of Fake News: Twitter, and PHEME, a dataset of potential rumours in
They often have grammatical mistakes. They are often Twitter and journalistic assessments of their
emotionally coloured. They often try to affect readers‘ accuracies. They apply this method to Twitter content
opinion on some topics. Their content is not always sourced from BuzzFeed‟s fake news dataset. A feature
2
analysis identifies features that are most predictive for
crowd sourced and journalistic accuracy assessments,
results of which are consistent with prior work. They
rely on identifying highly retweeted threads of
conversation and use the features of these threads to
classify stories, limiting this work‘s applicability only
to the set of popular tweets. Since the majority of
tweets are rarely retweeted, this method therefore is
only usable on a minority of Twitter conversation
threads.
In his paper, Shivam B. Parikh et. al. [13] aims to
present an insight of characterization of news story in Figure 1: System Design
the modern diaspora combined with the differential
content types of news story and its impact on readers. B. System Architecture-
Subsequently, we dive into existing fake news i) Static Search-
detection approaches that are heavily based on text- The architecture of Static part of fake news detection
based analysis, and also describe popular fake news system is quite simple and is done keeping in mind the
datasets. We conclude the paper by identifying 4 key basic machine learning process flow. The system
open research challenges that can guide future design is shown below and self- explanatory. The main
research. It is a theoretical Approach which gives processes in the design are-
Illustrations of fake news detection by analysing the
psychological factors.
III. METHODOLOGY
This paper explains the system which is developed in
three parts. The first part is static which works on
machine learning classifier. We studied and trained the
model with 4 different classifiers and chose the best
classifier for final execution. The second part is
dynamic which takes the keyword/text from user and
searches online for the truth probability of the news.
The third part provides the authenticity of the URL Figure 2: System Architecture
input by user.
ii) Dynamic Search-
In this paper, we have used Python and its Sci-kit
libraries [14]. Python has a huge set of libraries and The second search field of the site asks for specific
extensions, which can be easily used in Machine keywords to be searched on the net upon which it
Learning. Sci-Kit Learn library is the best source for provides a suitable output for the percentage
machine learning algorithms where nearly all types of probability of that term actually being present in an
machine learning algorithms are readily available for article or a similar article with those keyword
Python, thus easy and quick evaluation of ML references in it.
algorithms is possible. We have used Django for the iii) URL Search-
web based deployment of the model, provides client The third search field of the site accepts a specific
side implementation using HTML, CSS and Javascript. website domain name upon which the implementation
We have also used Beautiful Soup (bs4), requests for looks for the site in our true sites database or the
online scrapping. blacklisted sites database. The true sites database holds
the domain names which regularly provide proper and
A. System Design- authentic news and vice versa. If the site isn‘t found in
either of the databases then the implementation doesn‘t
classify the domain it simply states that the news
aggregator does not exist.
IV. IMPLEMENTATION
4.1 DATA COLLECTION AND ANALYSIS
We can get online news from different sources like
social media websites, search engine, homepage of
news agency websites or the fact-checking websites.
3
On the Internet, there are a few publicly available 4.2 DEFINITIONS AND DETAILS
datasets for Fake news classification like Buzzfeed A. Pre-processing Data
News, LIAR [15], BS Detector etc. These datasets
Social media data is highly unstructured – majority of
have been widely used in different research papers for
them are informal communication with typos, slangs
determining the veracity of news. In the following
and bad-grammar etc. [17]. Quest for increased
sections, I have discussed in brief about the sources of
performance and reliability has made it imperative to
the dataset used in this work.
develop techniques for utilization of resources to make
Online news can be collected from different sources, informed decisions [18]. To achieve better insights, it
such as news agency homepages, search engines, and is necessary to clean the data before it can be used for
social media websites. However, manually predictive modelling. For this purpose, basic pre-
determining the veracity of news is a challenging task, processing was done on the News training data. This
usually requiring annotators with domain expertise step was comprised of-
who performs careful analysis of claims and additional
Data Cleaning:
evidence, context, and reports from authoritative
sources. Generally, news data with annotations can be While reading data, we get data in the structured or
gathered in the following ways: Expert journalists, unstructured format. A structured format has a well-
Fact-checking websites, Industry detectors, and Crowd defined pattern whereas unstructured data has no
sourced workers. However, there are no agreed upon proper structure. In between the 2 structures, we have a
benchmark datasets for the fake news detection semi-structured format which is a comparably better
problem. Data gathered must be pre-processed- that is, structured than unstructured format.
cleaned, transformed and integrated before it can Cleaning up the text data is necessary to highlight
undergo training process [16]. The dataset that we used attributes that we‘re going to want our machine
is explained below: learning system to pick up on. Cleaning (or pre-
processing) the data typically consists of a number of
steps:
LIAR: This dataset is collected from fact-checking
website PolitiFact through its API [15]. It includes a) Remove punctuation
12,836 human labelled short statements, which are Punctuation can provide grammatical context to a
sampled from various contexts, such as news releases, sentence which supports our understanding. But for
TV or radio interviews, campaign speeches, etc. The our vectorizer which counts the number of words and
labels for news truthfulness are fine-grained multiple not the context, it does not add value, so we remove all
classes: pants-fire, false, barely-true, half-true, mostly special characters. eg: How are you?->How are you
true, and true. b) Tokenization
The data source used for this project is LIAR dataset Tokenizing separates text into units such as sentences
which contains 3 files with .csv format for test, train or words. It gives structure to previously unstructured
and validation. Below is some description about the text. eg: Plata o Plomo-> ‗Plata‘,‘o‘,‘Plomo‘.
data files used for this project.
c) Remove stopwords
1. LIAR: A Benchmark Dataset for Fake News
Stopwords are common words that will likely appear
Detection
in any text. They don‘t tell us much about our data so
William Yang Wang, ―Liar, Liar Pants on Fire‖: A we remove them. eg: silver or lead is fine for me->
New Benchmark Dataset for Fake News Detection, to silver, lead, fine.
appear in Proceedings of the 55th Annual Meeting of
d) Stemming
the Association for Computational Linguistics (ACL
2017), short paper, Vancouver, BC, Canada, July 30- Stemming helps reduce a word to its stem form. It
August 4, ACL. often makes sense to treat related words in the same
way. It removes suffices, like ―ing‖, ―ly‖, ―s‖, etc. by a
Below are the columns used to create 3 datasets that
simple rule-based approach. It reduces the corpus of
have been in used in this project-
words but often the actual words get neglected. eg:
● Column1: Statement (News headline or text). Entitling, Entitled -> Entitle. Note: Some search
● Column2: Label (Label class contains: True, engines treat words with the same stem as synonyms
False) [18].
The dataset used for this project were in csv format
named train.csv, test.csv and valid.csv. B. Feature Generation
2. REAL_OR_FAKE.CSV we used this dataset for We can use text data to generate a number of features
passive aggressive classifier. It contains 3 columns viz like word count, frequency of large words, frequency
1- Text/keyword, 2-Statement, 3-Label (Fake/True) of unique words, n-grams etc. By creating a

4
representation of words that capture their meanings, TF-IDF is applied on the body text, so the relative
semantic relationships, and numerous types of context count of each word in the sentences is stored in the
they are used in, we can enable computer to understand document matrix.
text and perform Clustering, Classification etc [19]. ( ) ( ) ( )

Vectorizing Data: Note: Vectorizers outputs sparse matrices. Sparse


Vectorizing is the process of encoding text as integers Matrix is a matrix in which most entries are 0 [21].
i.e. numeric form to create feature vectors so that
machine learning algorithms can understand our data.
C. Algorithms used for Classification
1. Vectorizing Data: Bag-Of-Words
This section deals with training the classifier. Different
Bag of Words (BoW) or CountVectorizer describes the classifiers were investigated to predict the class of the
presence of words within the text data. It gives a result text. We explored specifically four different machine-
of 1 if present in the sentence and 0 if not present. It, learning algorithms – Multinomial Naïve Bayes
therefore, creates a bag of words with a document- Passive Aggressive Classifier and Logistic regression.
matrix count in each text document.
The implementations of these classifiers were done
2. Vectorizing Data: N-Grams using Python library Sci-Kit Learn.
N-grams are simply all combinations of adjacent
words or letters of length n that we can find in our
Brief introduction to the algorithms-
source text. Ngrams with n=1 are called unigrams.
Similarly, bigrams (n=2), trigrams (n=3) and so on can 1. Naïve Bayes Classifier:
also be used. Unigrams usually don‘t contain much This classification technique is based on Bayes
information as compared to bigrams and trigrams. The theorem, which assumes that the presence of a
basic principle behind n-grams is that they capture the particular feature in a class is independent of the
letter or word is likely to follow the given word. The presence of any other feature. It provides way for
longer the n-gram (higher n), the more context you calculating the posterior probability.
have to work with [20].
( ) ( )
3. Vectorizing Data: TF-IDF ( )
It computes ―relative frequency‖ that a word appears in ( )
a document compared to its frequency across all
documents TF-IDF weight represents the relative P(c|x)= posterior probability of class given predictor
importance of a term in the document and entire P(c)= prior probability of class
corpus [17]. P(x|c)= likelihood (probability of predictor given class)
P(x) = prior probability of predictor
TF stands for Term Frequency: It calculates how
frequently a term appears in a document. Since, every
document size varies, a term may appear more in a 2. Random Forest:
long sized document that a short one. Thus, the length Random Forest is a trademark term for an ensemble of
of the document often divides Term frequency. decision trees. In Random Forest, we‘ve collection of
Note: Used for search engine scoring, text decision trees (so known as ―Forest‖). To classify a
summarization, document clustering. new object based on attributes, each tree gives a
classification and we say the tree ―votes‖ for that class.
The forest chooses the classification having the most
( ) votes (over all the trees in the forest). The random
forest is a classification algorithm consisting of many
decisions trees. It uses bagging and feature randomness
IDF stands for Inverse Document Frequency: A word when building each individual tree to try to create an
is not of much use if it is present in all the documents. uncorrelated forest of trees whose prediction by
Certain terms like ―a‖, ―an‖, ―the‖, ―on‖, ―of‖ etc. committee is more accurate than that of any individual
appear many times in a document but are of little tree. Random forest, like its name implies, consists of a
importance. IDF weighs down the importance of these large number of individual decision trees that operate
terms and increase the importance of rare ones. The as an ensemble. Each individual tree in the random
more the value of IDF, the more unique is the word forest spits out a class prediction and the class with the
[17]. most votes becomes our model‘s prediction. The
reason that the random forest model works so well is:
( ) ) A large number of relatively uncorrelated models
(trees) operating as a committee will outperform any of
5
the individual constituent models. So how does 4.3 IMPLEMENTATION STEPS
random forest ensure that the behaviour of each A. Static Search Implementation-
individual tree is not too correlated with the behaviour
In static part, we have trained and used 3 out of 4
of any of the other trees in the model? It uses the
algorithms for classification. They are Naïve Bayes,
following two methods:
Random Forest and Logistic Regression.
2.1 Bagging (Bootstrap Aggregation) — Decisions
trees are very sensitive to the data they are trained on
— small changes to the training set can result in Step 1: In first step, we have extracted features from
significantly different tree structures. Random forest the already pre-processed dataset. These features are;
takes advantage of this by allowing each individual Bag-of-words, Tf-Idf Features and N-grams.
tree to randomly sample from the dataset with Step 2: Here, we have built all the classifiers for
replacement, resulting in different trees. This process predicting the fake news detection. The extracted
is known as bagging or bootstrapping. features are fed into different classifiers. We have used
2.2 Feature Randomness — In a normal decision tree, Naive-bayes, Logistic Regression, and Random forest
when it is time to split a node, we consider every classifiers from sklearn. Each of the extracted features
possible feature and pick the one that produces the was used in all of the classifiers.
most separation between the observations in the left Step 3: Once fitting the model, we compared the f1
node vs. those in the right node. In contrast, each tree score and checked the confusion matrix.
in a random forest can pick only from a random subset Step 4: After fitting all the classifiers, 2 best
of features. This forces even more variation amongst performing models were selected as candidate models
the trees in the model and ultimately results in lower for fake news classification.
correlation across trees and more diversification [22]. Step 5: We have performed parameter tuning by
implementing GridSearchCV methods on these
3. Logistic Regression: candidate models and chosen best performing
It is a classification not a regression algorithm. It is paramters for these classifier.
used to estimate discrete values (Binary values like Step 6: Finally selected model was used for fake news
0/1, yes/no, true/false) based on given set of detection with the probability of truth.
independent variable(s). In simple words, it predicts Step 7: Our finally selected and best performing
the probability of occurrence of an event by fitting data classifier was Logistic Regression which was then
to a logit function. Hence, it is also known as logit saved on disk. It will be used to classify the fake news.
regression. Since, it predicts the probability, its output It takes a news article as input from user then model is
values lies between 0 and 1 (as expected). used for final classification output that is shown to user
Mathematically, the log odds of the outcome are along with probability of truth.
modelled as a linear combination of the predictor
variables [23].
B. Dynamic Search Implementation-
Odds = p/(1-p) = probability of event occurrence /
probability of not event occurrence Our dynamic implementation contains 3 search fields
which are-
ln(odds) = ln(p/(1-p))
1) Search by article content.
logit(p)=ln(p/(1-p))= b0+b1X1+b2X2+b3X3....+bkXk
2) Search using key terms.
3) Search for website in database.
4. Passive Aggressive Classifier:
In the first search field we have used Natural Language
The Passive Aggressive Algorithm is an online
algorithm; ideal for classifying massive streams of data Processing for the first search field to come up with a
(e.g. twitter). It is easy to implement and very fast. It proper solution for the problem, and hence we have
works by taking an example, learning from it and then attempted to create a model which can classify fake
throwing it away [24]. Such an algorithm remains news according to the terms used in the newspaper
passive for a correct classification outcome, and turns articles. Our application uses NLP techniques like
aggressive in the event of a miscalculation, updating CountVectorization and TF-IDF Vectorization before
and adjusting. Unlike most other algorithms, it does passing it through a Passive Aggressive Classifier to
not converge. Its purpose is to make updates that output the authenticity as a percentage probability of
correct the loss, causing very little change in the norm an article.
of the weight vector [25]. The second search field of the site asks for specific
keywords to be searched on the net upon which it
provides a suitable output for the percentage
probability of that term actually being present in an
6
article or a similar article with those keyword into the errors being made by a classifier but more
references in it. importantly the types of errors that are being made
The third search field of the site accepts a specific [26].
website domain name upon which the implementation Table 1: Confusion Matrix
looks for the site in our true sites database or the Total Class 1 (Predicted) Class 2 (Predicted)
blacklisted sites database. The true sites database holds Class 1
the domain names which regularly provide proper and TP FN
(Actual)
authentic news and vice versa. If the site isn‘t found in Class 2
either of the databases then the implementation doesn‘t (Actual)
FP TN
classify the domain it simply states that the news
aggregator does not exist.
By formulating this as a classification problem, we can
Working- define following metrics-
The problem can be broken down into 3 statements-
1. Precision =
1) Use NLP to check the authenticity of a news article.
2) If the user has a query about the authenticity of a
search query then we he/she can directly search on our 2. Recall =
platform and using our custom algorithm we output a
confidence score.
3) Check the authenticity of a news source. 3. F1 Score = 2 *
These sections have been produced as search fields to
take inputs in 3 different forms in our implementation
of the problem statement. 4. Accuracy
These metrics are commonly used in the machine
4.4 EVALUATION MATRICES learning community and enable us to evaluate the
Evaluate the performance of algorithms for fake news performance of a classifier from different perspectives.
detection problem; various evaluation metrics have Specifically, accuracy measures the similarity between
been used. In this subsection, we review the most predicted fake news and real fake news.
widely used metrics for fake news detection. Most
existing approaches consider the fake news problem as 4.5 SNAPSHOTS OF SYSTEM WORKING
a classification problem that predicts whether a news
article is fake or not: A. Static System-
True Positive (TP): when predicted fake news pieces
are actually classified as fake news;
True Negative (TN): when predicted true news pieces Figure 3: Static output (True)
are actually classified as true news;
False Negative (FN): when predicted true news pieces
are actually classified as fake news;
False Positive (FP): when predicted fake news pieces Figure 4: Static Output (False)
are actually classified as true news.
B. Dynamic System-
Confusion Matrix:
A confusion matrix is a table that is often used to
describe the performance of a classification model (or
―classifier‖) on a set of test data for which the true
values are known. It allows the visualization of the
performance of an algorithm. A confusion matrix is a
summary of prediction results on a classification
problem. The number of correct and incorrect
predictions are summarized with count values and
broken down by each class. This is the key to the
confusion matrix. The confusion matrix shows the
Figure 5: Fake News Detector (Home Screen)
ways in which your classification model is confused
when it makes predictions. It gives us insight not only

7
Table 4: Confusion Matrix for Random Forest
Classifier using Tf-Idf features-
Random Forest
Total= 10240
Fake (Predicted) True (Predicted)
Fake (Actual) 1979 2509
True (Actual) 1630 4122

Table 5: Comparison of Precision, Recall, F1-scores


and Accuracy for all three classifiers-
Figure 6: Fake News Detector (Output page)
Classifiers Precision Recall F1- Accuracy
Score
V. RESULTS Naïve Bayes 0.59 0.92 0.72 0.60
Implementation was done using the above algorithms Random 0.62 0.71 0.67 0.59
with Vector features- Count Vectors and Tf-Idf vectors Forest
at Word level and Ngram-level. Accuracy was noted
Logistic 0.69 0.83 0.75 0.65
for all models. We used K-fold cross validation
Regression
technique to improve the effectiveness of the models.
As evident above our best model came out to be
Logistic Regression with an accuracy of 65%. Hence
A. Dataset split using K-fold cross validation
we then used grid search parameter optimization to
This cross-validation technique was used for splitting increase the performance of logistic regression which
the dataset randomly into k-folds. (k-1) folds were then gave us the accuracy of 80%.
used for building the model while kth fold was used to
Hence we can say that if a user feed a particular news
check the effectiveness of the model. This was
article or its headline in our model, there are 80%
repeated until each of the k-folds served as the test set.
chances that it will be classified to its true nature.
I used 3-fold cross validation for this experiment
where 67% of the data is used for training the model
and remaining 33% for testing. C. Confusion Matrix for Dynamic System
We used real_or_fake.csv with passive aggressive
B. Confusion Matrices for Static System classifier and obtained the following confusion matrix-
After applying various extracted features (Bag-of-
words, Tf-Idf. N-grams) on three different classifiers Table 6: Confusion Matrix for passive aggressive
(Naïve bayes, Logistic Regression and Random classifier-
Forest), their confusion matrix showing actual set and Passive Aggressive Classifier
predicted sets are mentioned below:
Total= 1267
Fake (Predicted) True (Predicted)
Table 2: Confusion Matrix for Naïve Bayes Classifier
using Tf-Idf features- Fake (Actual) 588 50
Naïve Bayes Classifier 587
Total= 10240 True (Actual) 42
Fake (Predicted) True (Predicted)
Fake (Actual) 841 3647
True (Actual) 427 5325 Table 7: Performance measures-
Classifier Precision Recall F1-Score Accuracy
Table 3: Confusion Matrix for Logistic Regresssion
using Tf-Idf features- PAC 0.93 0.9216 0.9257 0.9273
Logistic Regression
Total= 10240
Fake (Predicted) True (Predicted) VI. CONCLUSION
Fake (Actual) 1617 2871 In the 21st century, the majority of the tasks are done
True (Actual) 1097 4655 online. Newspapers that were earlier preferred as hard-
copies are now being substituted by applications like
Facebook, Twitter, and news articles to be read online.
Whatsapp‘s forwards are also a major source. The
8
growing problem of fake news only makes things more fake news‖ at Proceedings of the Association for
complicated and tries to change or hamper the opinion Information Science and Technology, 52(1), pp.1-4.
and attitude of people towards use of digital [7] Markines, B., Cattuto, C., & Menczer, F. (2009,
technology. When a person is deceived by the real April). ―Social spam detection‖. In Proceedings of the
news two possible things happen- People start 5th International Workshop on Adversarial
believing that their perceptions about a particular topic Information Retrieval on the Web (pp. 41-48)
are true as assumed. Thus, in order to curb the
[8] Rada Mihalcea , Carlo Strapparava, The lie
phenomenon, we have developed our Fake news
detector: explorations in the automatic recognition of
Detection system that takes input from the user and
deceptive language, Proceedings of the ACL-IJCNLP
classify it to be true or fake. To implement this,
various NLP and Machine Learning Techniques have [9] Kushal Agarwalla, Shubham Nandan, Varun Anil
to be used. The model is trained using an appropriate Nair, D. Deva Hema, ―Fake News Detection using
dataset and performance evaluation is also done using Machine Learning and Natural Language Processing,‖
various performance measures. The best model, i.e. the International Journal of Recent Technology and
model with highest accuracy is used to classify the Engineering (IJRTE) ISSN: 2277-3878, Volume-7,
news headlines or articles. As evident above for static Issue-6, March 2019
search, our best model came out to be Logistic [10] H. Gupta, M. S. Jamal, S. Madisetty and M. S.
Regression with an accuracy of 65%. Hence we then Desarkar, "A framework for real-time spam detection
used grid search parameter optimization to increase the in Twitter," 2018 10th International Conference on
performance of logistic regression which then gave us Communication Systems & Networks (COMSNETS),
the accuracy of 75%. Hence we can say that if a user Bengaluru, 2018, pp. 380-383
feed a particular news article or its headline in our [11] M. L. Della Vedova, E. Tacchini, S. Moret, G.
model, there are 75% chances that it will be classified Ballarin, M. DiPierro and L. de Alfaro, "Automatic
to its true nature. Online Fake News Detection Combining Content and
The user can check the news article or keywords Social Signals," 2018 22nd Conference of Open
online; he can also check the authenticity of the Innovations Association (FRUCT), Jyvaskyla, 2018,
website. The accuracy for dynamic system is 93% and pp. 272-279.
it increases with every iteration. [12] C. Buntain and J. Golbeck, "Automatically
We intent to build our own dataset which will be kept Identifying Fake News in Popular Twitter Threads,"
up to date according to the latest news. All the live 2017 IEEE International Conference on Smart Cloud
news and latest data will be kept in a database using (SmartCloud), New York, NY, 2017, pp. 208-215.
Web Crawler and online database. [13] S. B. Parikh and P. K. Atrey, "Media-Rich Fake
News Detection: A Survey," 2018 IEEE Conference
VII. REFERENCES on Multimedia Information Processing and Retrieval
(MIPR), Miami, FL, 2018, pp. 436-441
[1] Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang,
and Huan Liu, ―Fake News Detection on Social Media: [14] Scikit-Learn- Machine Learning In Python
A Data Mining Perspective‖ [15] Dataset- Fake News detection William Yang
arXiv:1708.01967v3 [cs.SI], 3 Sep 2017 Wang. " liar, liar pants on _re": A new benchmark
dataset for fake news detection. arXiv preprint
[2] Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang,
arXiv:1705.00648, 2017.
and Huan Liu, ―Fake News Detection on Social Media:
A Data Mining Perspective‖ [16] Shankar M. Patil, Dr. Praveen Kumar, ―Data
mining model for effective data analysis of higher
arXiv:1708.01967v3 [cs.SI], 3 Sep 2017
education students using MapReduce‖ IJERMT, April
[3] M. Granik and V. Mesyura, "Fake news detection 2017 (Volume-6, Issue-4).
using naive Bayes classifier," 2017 IEEE First Ukraine
[17] Aayush Ranjan, ― Fake News Detection Using
Conference on Electrical and Computer Engineering
Machine Learning‖, Department Of Computer Science
(UKRCON), Kiev, 2017, pp. 900-903.
& Engineering Delhi Technological University, July
[4] Fake news websites. (n.d.) Wikipedia. [Online]. 2018.
Available:
[18] Patil S.M., Malik A.K. (2019) Correlation Based
https://en.wikipedia.org/wiki/Fake_news_website.
Real-Time Data Analysis of Graduate Students
Accessed Feb. 6, 2017
Behaviour. In: Santosh K., Hegadi R. (eds) Recent
[5] Cade Metz. (2016, Dec. 16). The bittersweet Trends in Image Processing and Pattern Recognition.
sweepstakes to build an AI that destroys fake news. RTIP2R 2018. Communications in Computer and
[6] Conroy, N., Rubin, V. and Chen, Y. (2015). Information Science, vol 1037. Springer, Singapore.
―Automatic deception detection: Methods for finding

9
[19] Badreesh Shetty, “Natural Language Processing
(NLP) for machine learning‖ at towardsdatascience,
Medium.
[20] NLTK 3.5b1 documentation, Nltk generate n
gram
[21] Ultimate guide to deal with Text Data (using
Python) – for Data Scientists and Engineers by
Shubham Jain, February 27, 2018
[22] Understanding the random forest by Anirudh
Palaparthi, Jan 28, at analytics vidya.
[23] Understanding the random forest by Anirudh
Palaparthi, Jan 28, at analytics vidya.
[24]Shailesh-Dhama,―Detecting-Fake-News-with-
Python‖, Github, 2019
[25] Aayush Ranjan, ― Fake News Detection Using
Machine Learning‖, Department Of Computer Science
& Engineering Delhi Technological University, July
2018.
[26] What is a Confusion Matrix in Machine Learning
by Jason Brownlee on November 18, 2016 in Code
Algorithms From Scratch

10

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy