Fake News Detection Using Machine Learning Algorithms
Fake News Detection Using Machine Learning Algorithms
Algorithms
Uma Sharma, Sidarth Saran, Shankar M. Patil
Department of Information Technology
Bharati Vidyapeeth College of Engineering
Navi Mumbai, India
eqtasharma@gmail.com , siddharthsaran00@gmail.com , smpatil2k@gmail.com
Abstract— In our modern era where the internet news source. Despite the benefits provided by social
is ubiquitous, everyone relies on various online media, the standard of stories on social media is less
resources for news. Along with the increase in the than traditional news organizations. However, because
use of social media platforms like Facebook, it's inexpensive to supply news online and far faster
Twitter, etc. news spread rapidly among millions of and easier to propagate through social media, large
users within a very short span of time. The spread volumes of faux news, i.e., those news articles with
of fake news has far-reaching consequences like the intentionally false information, are produced online for
creation of biased opinions to swaying election a spread of purposes, like financial and political gain.
outcomes for the benefit of certain candidates. it had been estimated that over 1 million tweets are
Moreover, spammers use appealing news headlines associated with fake news ―Pizzagate" by the top of the
to generate revenue using advertisements via click- presidential election. Given the prevalence of this new
baits. In this paper, we aim to perform binary phenomenon, ―Fake news" was even named the word
classification of various news articles available of the year by the Macquarie dictionary in 2016 [2].
online with the help of concepts pertaining to The extensive spread of faux news can have a
Artificial Intelligence, Natural Language significant negative impact on individuals and society.
Processing and Machine Learning. We aim to First, fake news can shatter the authenticity
provide the user with the ability to classify the news equilibrium of the news ecosystem for instance; it's
as fake or real and also check the authenticity of the evident that the most popular fake news was even
website publishing the news. more outspread on Facebook than the most accepted
genuine mainstream news during the U.S. 2016
Keywords—Internet, Social Media, Fake News, presidential election. Second, fake news intentionally
Classification, Artificial Intelligence, Machine persuades consumers to simply accept biased or false
Learning, Websites, Authenticity. beliefs. Fake news is typically manipulated by
propagandists to convey political messages or
I. INTRODUCTION influence for instance, some report shows that Russia
As an increasing amount of our lives is spent has created fake accounts and social bots to spread
interacting online through social media platforms, false stories. Third, fake news changes the way people
more and more people tend to hunt out and consume interpret and answer real news, for instance, some fake
news from social media instead of traditional news news was just created to trigger people's distrust and
organizations.[1] The explanations for this alteration in make them confused; impeding their abilities to
consumption behaviours are inherent within the nature differentiate what's true from what's not. To assist
of those social media platforms: (i) it's often more mitigate the negative effects caused by fake news (both
timely and fewer expensive to consume news on social to profit the general public and therefore the news
media compared with traditional journalism , like ecosystem). It's crucial that we build up methods to
newspapers or television; and (ii) it's easier to further automatically detect fake news broadcast on social
share, discuss , and discuss the news with friends or media [3].
other readers on social media. For instance, 62 percent Internet and social media have made the access to the
of U.S. adults get news on social media in 2016, while news information much easier and comfortable [2].
in 2012; only 49 percent reported seeing news on Often Internet users can pursue the events of their
social media [1]. It had been also found that social concern in online form, and increased number of the
media now outperforms television because the major mobile devices makes this process even easier. But
1
with great possibilities come great challenges. Mass true. They often use attention seeking words and news
media have an enormous influence on the society, and format and click baits. They are too good to be true.
because it often happens, there's someone who wants Their sources are not genuine most of the times [9].
to require advantage of this fact. Sometimes to realize
some goals mass-media may manipulate the II. LITERATURE REVIEW
knowledge in several ways. This result in producing of Mykhailo Granik et. al. in their paper [3] shows a
the news articles that isn‘t completely true or maybe simple approach for fake news detection using naive
completely false. There even exist many websites that Bayes classifier. This approach was implemented as a
produce fake news almost exclusively. They software system and tested against a data set of
intentionally publish hoaxes, half-truths, propaganda Facebook news posts. They were collected from three
and disinformation asserting to be real news – often large Facebook pages each from the right and from the
using social media to drive web traffic and magnify left, as well as three large mainstream political news
their effect. The most goals of faux news websites are pages (Politico, CNN, ABC News). They achieved
to affect the general public opinion on certain matters classification accuracy of approximately 74%.
(mostly political). Samples of such websites could also Classification accuracy for fake news is slightly worse.
be found in Ukraine, United States of America, This may be caused by the skewness of the dataset:
Germany, China and much of other countries [4]. only 4.9% of it is fake news.
Thus, fake news may be a global issue also as a Himank Gupta et. al. [10] gave a framework based on
worldwide challenge. Many scientists believe that fake different machine learning approach that deals with
news issue could also be addressed by means of various problems including accuracy shortage, time lag
machine learning and AI [5]. There‘s a reason for that: (BotMaker) and high processing time to handle
recently AI algorithms have begun to work far better thousands of tweets in 1 sec. Firstly, they have
on many classification problems (image recognition, collected 400,000 tweets from HSpam14 dataset. Then
voice detection then on) because hardware is cheaper they further characterize the 150,000 spam tweets and
and larger datasets are available. There are several 250,000 non- spam tweets. They also derived some
influential articles about automatic deception lightweight features along with the Top-30 words that
detection. In [6] the authors provide a general are providing highest information gain from Bag-of-
overview of the available techniques for the matter. In Words model. 4. They were able to achieve an
[7] the authors describe their method for fake news accuracy of 91.65% and surpassed the existing solution
detection supported the feedback for the precise news by approximately18%.
within the micro blogs. In [8] the authors actually Marco L. Della Vedova et. al. [11] first proposed a
develop two systems for deception detection supported novel ML fake news detection method which, by
support vector machines and Naive Bayes classifier combining news content and social context features,
(this method is employed within the system described outperforms existing methods in the literature,
during this paper as well) respectively. They collect increasing its accuracy up to 78.8%. Second, they
the info by means of asking people to directly provide implemented their method within a Facebook
true or false information on several topics – abortion, Messenger Chabot and validate it with a real-world
execution and friendship. The accuracy of the application, obtaining a fake news detection accuracy
detection achieved by the system is around 70%. This of 81.7%. Their goal was to classify a news item as
text describes an easy fake news detection method reliable or fake; they first described the datasets they
supported one among the synthetic intelligence used for their test, then presented the content-based
algorithms – naïve Bayes classifier, Random Forest approach they implemented and the method they
and Logistic Regression. The goal of the research is to proposed to combine it with a social-based approach
look at how these particular methods work for this available in the literature. The resulting dataset is
particular problem given a manually labelled news composed of 15,500 posts, coming from 32 pages (14
dataset and to support (or not) the thought of using AI conspiracy pages, 18 scientific pages), with more than
for fake news detection. The difference between these 2, 300, 00 likes by 900,000+ users. 8,923 (57.6%)
article and articles on the similar topics is that during posts are hoaxes and 6,577 (42.4%) are non-hoaxes.
this paper Logistic Regression was specifically used
for fake news detection; also, the developed system Cody Buntain et. al. [12] develops a method for
was tested on a comparatively new data set, which automating fake news detection on Twitter by learning
gave a chance to gauge its performance on a recent to predict accuracy assessments in two credibility-
data. focused Twitter datasets: CREDBANK, a crowd
sourced dataset of accuracy assessments for events in
A. Characteristics of Fake News: Twitter, and PHEME, a dataset of potential rumours in
They often have grammatical mistakes. They are often Twitter and journalistic assessments of their
emotionally coloured. They often try to affect readers‘ accuracies. They apply this method to Twitter content
opinion on some topics. Their content is not always sourced from BuzzFeed‟s fake news dataset. A feature
2
analysis identifies features that are most predictive for
crowd sourced and journalistic accuracy assessments,
results of which are consistent with prior work. They
rely on identifying highly retweeted threads of
conversation and use the features of these threads to
classify stories, limiting this work‘s applicability only
to the set of popular tweets. Since the majority of
tweets are rarely retweeted, this method therefore is
only usable on a minority of Twitter conversation
threads.
In his paper, Shivam B. Parikh et. al. [13] aims to
present an insight of characterization of news story in Figure 1: System Design
the modern diaspora combined with the differential
content types of news story and its impact on readers. B. System Architecture-
Subsequently, we dive into existing fake news i) Static Search-
detection approaches that are heavily based on text- The architecture of Static part of fake news detection
based analysis, and also describe popular fake news system is quite simple and is done keeping in mind the
datasets. We conclude the paper by identifying 4 key basic machine learning process flow. The system
open research challenges that can guide future design is shown below and self- explanatory. The main
research. It is a theoretical Approach which gives processes in the design are-
Illustrations of fake news detection by analysing the
psychological factors.
III. METHODOLOGY
This paper explains the system which is developed in
three parts. The first part is static which works on
machine learning classifier. We studied and trained the
model with 4 different classifiers and chose the best
classifier for final execution. The second part is
dynamic which takes the keyword/text from user and
searches online for the truth probability of the news.
The third part provides the authenticity of the URL Figure 2: System Architecture
input by user.
ii) Dynamic Search-
In this paper, we have used Python and its Sci-kit
libraries [14]. Python has a huge set of libraries and The second search field of the site asks for specific
extensions, which can be easily used in Machine keywords to be searched on the net upon which it
Learning. Sci-Kit Learn library is the best source for provides a suitable output for the percentage
machine learning algorithms where nearly all types of probability of that term actually being present in an
machine learning algorithms are readily available for article or a similar article with those keyword
Python, thus easy and quick evaluation of ML references in it.
algorithms is possible. We have used Django for the iii) URL Search-
web based deployment of the model, provides client The third search field of the site accepts a specific
side implementation using HTML, CSS and Javascript. website domain name upon which the implementation
We have also used Beautiful Soup (bs4), requests for looks for the site in our true sites database or the
online scrapping. blacklisted sites database. The true sites database holds
the domain names which regularly provide proper and
A. System Design- authentic news and vice versa. If the site isn‘t found in
either of the databases then the implementation doesn‘t
classify the domain it simply states that the news
aggregator does not exist.
IV. IMPLEMENTATION
4.1 DATA COLLECTION AND ANALYSIS
We can get online news from different sources like
social media websites, search engine, homepage of
news agency websites or the fact-checking websites.
3
On the Internet, there are a few publicly available 4.2 DEFINITIONS AND DETAILS
datasets for Fake news classification like Buzzfeed A. Pre-processing Data
News, LIAR [15], BS Detector etc. These datasets
Social media data is highly unstructured – majority of
have been widely used in different research papers for
them are informal communication with typos, slangs
determining the veracity of news. In the following
and bad-grammar etc. [17]. Quest for increased
sections, I have discussed in brief about the sources of
performance and reliability has made it imperative to
the dataset used in this work.
develop techniques for utilization of resources to make
Online news can be collected from different sources, informed decisions [18]. To achieve better insights, it
such as news agency homepages, search engines, and is necessary to clean the data before it can be used for
social media websites. However, manually predictive modelling. For this purpose, basic pre-
determining the veracity of news is a challenging task, processing was done on the News training data. This
usually requiring annotators with domain expertise step was comprised of-
who performs careful analysis of claims and additional
Data Cleaning:
evidence, context, and reports from authoritative
sources. Generally, news data with annotations can be While reading data, we get data in the structured or
gathered in the following ways: Expert journalists, unstructured format. A structured format has a well-
Fact-checking websites, Industry detectors, and Crowd defined pattern whereas unstructured data has no
sourced workers. However, there are no agreed upon proper structure. In between the 2 structures, we have a
benchmark datasets for the fake news detection semi-structured format which is a comparably better
problem. Data gathered must be pre-processed- that is, structured than unstructured format.
cleaned, transformed and integrated before it can Cleaning up the text data is necessary to highlight
undergo training process [16]. The dataset that we used attributes that we‘re going to want our machine
is explained below: learning system to pick up on. Cleaning (or pre-
processing) the data typically consists of a number of
steps:
LIAR: This dataset is collected from fact-checking
website PolitiFact through its API [15]. It includes a) Remove punctuation
12,836 human labelled short statements, which are Punctuation can provide grammatical context to a
sampled from various contexts, such as news releases, sentence which supports our understanding. But for
TV or radio interviews, campaign speeches, etc. The our vectorizer which counts the number of words and
labels for news truthfulness are fine-grained multiple not the context, it does not add value, so we remove all
classes: pants-fire, false, barely-true, half-true, mostly special characters. eg: How are you?->How are you
true, and true. b) Tokenization
The data source used for this project is LIAR dataset Tokenizing separates text into units such as sentences
which contains 3 files with .csv format for test, train or words. It gives structure to previously unstructured
and validation. Below is some description about the text. eg: Plata o Plomo-> ‗Plata‘,‘o‘,‘Plomo‘.
data files used for this project.
c) Remove stopwords
1. LIAR: A Benchmark Dataset for Fake News
Stopwords are common words that will likely appear
Detection
in any text. They don‘t tell us much about our data so
William Yang Wang, ―Liar, Liar Pants on Fire‖: A we remove them. eg: silver or lead is fine for me->
New Benchmark Dataset for Fake News Detection, to silver, lead, fine.
appear in Proceedings of the 55th Annual Meeting of
d) Stemming
the Association for Computational Linguistics (ACL
2017), short paper, Vancouver, BC, Canada, July 30- Stemming helps reduce a word to its stem form. It
August 4, ACL. often makes sense to treat related words in the same
way. It removes suffices, like ―ing‖, ―ly‖, ―s‖, etc. by a
Below are the columns used to create 3 datasets that
simple rule-based approach. It reduces the corpus of
have been in used in this project-
words but often the actual words get neglected. eg:
● Column1: Statement (News headline or text). Entitling, Entitled -> Entitle. Note: Some search
● Column2: Label (Label class contains: True, engines treat words with the same stem as synonyms
False) [18].
The dataset used for this project were in csv format
named train.csv, test.csv and valid.csv. B. Feature Generation
2. REAL_OR_FAKE.CSV we used this dataset for We can use text data to generate a number of features
passive aggressive classifier. It contains 3 columns viz like word count, frequency of large words, frequency
1- Text/keyword, 2-Statement, 3-Label (Fake/True) of unique words, n-grams etc. By creating a
4
representation of words that capture their meanings, TF-IDF is applied on the body text, so the relative
semantic relationships, and numerous types of context count of each word in the sentences is stored in the
they are used in, we can enable computer to understand document matrix.
text and perform Clustering, Classification etc [19]. ( ) ( ) ( )
7
Table 4: Confusion Matrix for Random Forest
Classifier using Tf-Idf features-
Random Forest
Total= 10240
Fake (Predicted) True (Predicted)
Fake (Actual) 1979 2509
True (Actual) 1630 4122
9
[19] Badreesh Shetty, “Natural Language Processing
(NLP) for machine learning‖ at towardsdatascience,
Medium.
[20] NLTK 3.5b1 documentation, Nltk generate n
gram
[21] Ultimate guide to deal with Text Data (using
Python) – for Data Scientists and Engineers by
Shubham Jain, February 27, 2018
[22] Understanding the random forest by Anirudh
Palaparthi, Jan 28, at analytics vidya.
[23] Understanding the random forest by Anirudh
Palaparthi, Jan 28, at analytics vidya.
[24]Shailesh-Dhama,―Detecting-Fake-News-with-
Python‖, Github, 2019
[25] Aayush Ranjan, ― Fake News Detection Using
Machine Learning‖, Department Of Computer Science
& Engineering Delhi Technological University, July
2018.
[26] What is a Confusion Matrix in Machine Learning
by Jason Brownlee on November 18, 2016 in Code
Algorithms From Scratch
10