0% found this document useful (0 votes)
27 views7 pages

JPNR 2022 04 140

The document discusses a proposed machine learning system for detecting fake news, utilizing techniques such as TF-IDF for feature extraction and Support Vector Machine (SVM) for classification. It highlights the challenges of fake news proliferation and the importance of reliable datasets for training the model. The results indicate that the proposed system demonstrates efficiency in distinguishing between fake and true news articles.

Uploaded by

yasaswinisrit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views7 pages

JPNR 2022 04 140

The document discusses a proposed machine learning system for detecting fake news, utilizing techniques such as TF-IDF for feature extraction and Support Vector Machine (SVM) for classification. It highlights the challenges of fake news proliferation and the importance of reliable datasets for training the model. The results indicate that the proposed system demonstrates efficiency in distinguishing between fake and true news articles.

Uploaded by

yasaswinisrit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Fake News Detection Using Machine Learning

P. Yogendra Prasad1, Dr.G. Nagalakshmi2, P. Siva Kumar3


1
Assistant Professor, Dept. of CSSE, Sree Vidyanikethan Engineering College, Tirupati.
2
Assistant Professor, Dept. of Computer Science, National Sanskrit University, Tirupati.
3
Applications Lead, Oracle Corporation, Bangalore.

The phenomenon of fake news is expanding rapidly with the development of communication tools and social media. Fake news detection is
an emerging research area that is garnering a lot of interest. However, it faces some challenges due to limited resources such as datasets and
processing and analysis techniques.
In this work, we propose a fake news detection system using machine learning techniques. We used the term Frequency-Inverse Document
Frequency (TF-IDF) from Bag of Words and N-Grams as the feature extraction method, and Support Vector Machine (SVM) as the
classifier. We also propose fake news and true news datasets to train the proposed system. The results obtained demonstrate the efficiency of
the system.

Index Terms: Fake News, Social Media, Web Mining, MachineLearning, Support Vector Machine, TF-IDF.

DOI: 10.47750/pnr.2022.13.04.140

existing proposals for fake news detection. Section 3 details


INTRODUCTION our proposal and its different components. Section 4
Over the past decade, the phenomenon of fake news has presents the implementation of our proposal as well as some
become very widespread, fueled by social networks. This of the obtained results. Section 5 concludes the paper and
fake news can be spread for various purposes. Some are presents some perspectives.
created solely to increase clicks and visitors to your website.
Others influence political decisions and public opinion
about financial markets. For example, by damaging the RELATED WORKS
reputation of a company or institution on the web. Fake In literature, many works are interested to fake new detec-
health news on social media poses a global health risk. did. tion.
It is difficult for people to find reliable sources and reliable
Authors of [3] propose a typology of several methods of
information when they need it. The result of disinformation
truth assessment emerging from two main categories:
overload is the spread of anxiety, fear, insecurity and racism
linguistic cue approaches with machine learning and
on a scale not seen in previous epidemics. [11].
network analysis approaches, for detecting fake news.
In this paper, we present a novel method and tool for
In [5], authors present a simple approach to fake news
detecting fake news that uses:
detection using a naive Bayesian classifier. This approach is
• Text preprocessing: consisting of steaming and ana- tested on a set of data extracted from Facebook news posts.
lyzing the text by removing stop words and special They claim to be able to achieve an accuracy of 74%. The
characters. rate of this model is good but not the best, as many other
• Encoding of the text: using bag of words and N-gram works have achieved a better rate using other classifiers. We
then TF-IDF. discuss these works in the following.
• Extraction of the characteristics: this allows a precise Authors of [1] propose a fake news detection model that
identification of false information. We use the source uses n-gram analysis and machine learning techniques by
of a news, its author, the date and the feeling given by comparing two different feature extraction techniques and
the text as features of a news. six different classification techniques. The experiments
carried out show that the best performances are obtained by
• Support vector machine: a supervised machine
using the so-called features extraction method (TF-IDF).
learning algorithm that allows the classification of
The used the Linear Support Vector Machine (LSVM)
new informa- tion.
classifier that gives an accuracy of 92%. This model uses
This paper is structured as follows: Section 2 presents some LSVM that is limited to treat only the case of two linearly

Journal of Pharmaceutical Negative Results ¦ Volume 13 ¦ Issue 4 ¦ 2022 1024


P. Yogendra Prasad, et al.: Fake News Detection Using Machine Learning

separated classes. called preprocessing, it performs a series of operations such


Authors of [14] describe how users of social networks can as cleansing, filtering and encoding. The preprocessed
ensure the truth of information. They also describe the dataset is divided into two parts: the first for training and the
mech- anisms that allow their validation and the role of second for testing. The training module uses the training
journalists or what to expect from researchers and official dataset and support vector machine algorithm to build a
institutions. This work helps people see a little bit of the decision model that can be applied to the test dataset. If the
truth behind the news on social media and not believe model is accepted.
anything.
Authors of [9] propose several strategies and types of
indices relating to different modalities (text, image, social
information). They also explore the value of combining and
merging these approaches to assess and verify shared infor-
mation.
In [8], the authors present an overall performance analysis
of different approaches on three different datasets. This
work focused on the text of the information and the feeling
given by it, and ignores some features like the source, the
author or the date of the publication that can have a dramatic
impact on the result. Besides, in our work, we will show that
the integration of the feeling in the detection process does
Figure 1. The proposed fake news detection system
not bring any valuable information.
architecture’s
Authors of [16] created a new public dataset of valid new
articles and proposed a text-processing based machine
B. Preprocessing
learning approach for automatic identification of Fake News
with 87% accuracy. It appears that this work focuses on the In the news dataset, news characteristics are classified into
emerging feelings from the text and not on the contain of the three categories: textual data, categorical data and numerical
text in itself. data. Each category preprocessing is performed through a
set of operations as illustrated in Figure 2:
Authors of [17] introduced LIAR, a new dataset for
automatic fake news detection. This corpus can also be used
for stance classification, argument mining, topic modeling,
rumor detection, and political NLP research. Most of the
works in this area have used this benchmark. However, it is
well-known that this last is restricted to political
information, while others have integrated information from
various fields.
The overall drawback of these approaches is that the cate-
gorical data encoding may not be valid in reality ! Besides,
the usual fake news classification is limited to two values
(i.e., namely, Real or Fake), while in reality we can not say
that the news is real or fake at 100%, but according to a
degree of confidence. We consider that this point is very
important to classify the news in social media.

PROPOSED SYSTEM Figure 2. Preprocessing of different categories of news


The system we propose uses a news dataset to build a charateristics
decision model based on support vector machine method.
The model is then used to classify novel news to fake or • Textual data. Represent the text written by the author
real. in a news and pre-processed by the following
operations:
A. General architecture of the proposed system 1) Cleaning: eliminating stop words and special
char- acters.
The proposed system takes as input a dataset of comments
2) Steaming: transforming the useful words into
and their related information, such as date, source and
roots.
author. It then transforms them into a features dataset that
3) Encoding: transforming all the words of the com-
can be used in the learning phase. This transformation is

Journal of Pharmaceutical Negative Results ¦ Volume 13 ¦ Issue 4 ¦ 2022 1025


P. Yogendra Prasad, et al.: Fake News Detection Using Machine Learning

ment into a numerical vector. This needs two


steps: the combination of two techniques,
namelly, bag of words [13] and N-grams [4],
then the application of the TF-IDF method [12]
on the result.
Figure 3. Calculation of authors indices

(i.e., it is able to achieve an acceptable accuracy rate), it can


be kept and used and then training ends. Otherwise, the
parameters of the learning algorithm are revised in order to
Where:
Improve the accuracy rate. Figure 1 illustrates the general
scheme of the proposed system.

: the number of appearances of term t inthe


document n divided by the total number k of terms in the
document, keeping the multiplicity of each term. Decision
function designates a Fake news as well as its degree of
fakeness. Figure 4 illustrates this idea.
Figure 4. Degree of Confidence for news classification using
support vector machine decision function
the total number of documents D divided by
the number Dt of documents citing this term. Where:
• Categorical data. Represent the source of the news
such as TV channel, newspaper or magazine, and its • Numerical data. Represent the date of posting the
author. The pre-treatment of these data is performed com- ment and the sentiment given by the text. Since
through twosteps: the dateis already represented by a numerical value, we
– Cleaning: eliminating special characters and trans- only split it into three unique values: day, month and
forming letters into lowercase. year. For the sentiment given by the text, we calculate
– Encoding: for sources we used a label encoding. the sum of the sentiment degrees of the words.
For authors, we created our own encoding to According to the experts, each word has a degree of
convert the author’s names into digital numbers, so sentiment which allows it to be classified into three
that authors from the same source are close to each classes:
other compared to authors from other sources. – If the sum is less than 0, the feeling is negative.
We created a list containing two fields, the first for – If the sum is greater than 0, the sentiment is
the source and the second for its authors, then we positive.
replaced each author by its index number by – If the sum is 0, the feeling is neutral.
adding the sum of the sizes of the previous sources
plus one. Figure 3 shows an example where: C. Learning
It brings together two modules, namely, training and vali-
The maximum and minimum of the decision function are dation.
therefore calculated during the training phase and used to 1) Training: to train our model, we have chosen the
compute the degree of truth or fallacy by the following support vector machine algorithm [15]. This allows to use
function: the value of the decision function given for a news as a
confidence level of its classification: a positive value for the
decision function designates, at the same time, a true
news as well as its degree of truth and vice-versa, a
negative value of the
∗ T is the number of authors of the source (size).
∗ ik is the author index number k.

Journal of Pharmaceutical Negative Results ¦ Volume 13 ¦ Issue 4 ¦ 2022 1026


P. Yogendra Prasad, et al.: Fake News Detection Using Machine Learning

• Dec is the decision function value; • 3 compound words obtained by the N-gram method,
• Maxdec and Mindec are the maximum and • date: day, month and year,
minimum values of the decision function; • feeling,
• p is the percentage of truth or fake. • source,
2) Validation: to measure the capacity of the model to • author,
recognize new examples, we set aside some of the • class: fake or real.
examples to be used as test models. The features dataset is
then subdivided into two parts, a training part and a test
part. Its usefulness consists in avoiding over-fitting, i.e., RESULTS AND DISCUSSION
testing the model on the same training dataset. The To get the best decision model with highest accuracy, we
subdivision is not done at random but according to a tuned many parameters. First we tried to get the best
particular sample using the method of cross validation [10]. parameters from both bag of words and n-gram techniques
which give the best recognition rate on our dataset.
D. Revision of parameters For bag of words’ technique we have directed the number of
This operation aims to improve the model’s accuracy by most frequent words taken from each comment. This
tuning or setting the parameters of the support vector operation is repeated several times until the best rate is
machine algorithm, namely, Cost,γ, ϵ and change the cross- reached. At each time we increased the number of the most
validation variant [2]. frequent words. On the Weka software and using the SM0
library, and with the cross validation for 10 parts, we
obtained the following results:
Use
This is the last and most important phase in our system.
After reaching the best recognition rate, i.e., after building
the best model, we can now use it on new unlabeled
news, and the model allows us to predict their classes:
wrong or true,with a confidence degree.

EXPERIMENTS AND RESULTS


The performances of the proposed system was tested using
a dataset that we build by merging a true news datset with a
fake news one.
Figure 5. Evolution of the rate according to the number of
frequent words for the word bag
Used Dataset
We have merged two existing datasets "Getting Real about As shown in Figure 5 the recognition rate increases with the
Fake News" [6] containing fake news and "All the news" number of most frequent words, up to 25 words, then begins
containing real news. These datasets were obtained from the to decrease, which we explain by the phenomenon of over-
Kaggle site, the first contains text and metadata extracted fitting.
from 244 websites marked as false by Daniel Sieradski’s BS For the n-gram’s technique, we stepped the number of
Detector Chrome detector, extracted using the API web- grams. This operation was also repeated several times
hose.io. This dataset contains approximately 12,999 social increas- ing the value of n each time. We got these results:
media posts, divided into 20 columns of different types;
categorical, numeric and textual. The second dataset
contains texts and metadata taken from New York Times,
Breitbart, CNN, Business Insider, Atlantic, Fox News,
Talking Points Memo, Buzzfeed News, National Review,
New York Post, The Guardian, NPR, Reuters, Vox and the
Washington Post, retrieved using Beautiful Soup and stored
in Sqlite, split into three separate CSV files. This last dataset
contains texts and metadata subdivided into 10 columns of
different types; categorical, numeric and textual.
After pre-processing the two datasets and testing the fea-
tures one by one until reaching the best accuracy rate. We
have obtained a dataset which contains the following
Figure 6. Evolution of the rate according to the n-gram
features:
• 5 words obtained by the bag of words method, In Figure 6, we observe that after 2-grams the recognition

Journal of Pharmaceutical Negative Results ¦ Volume 13 ¦ Issue 4 ¦ 2022 1027


P. Yogendra Prasad, et al.: Fake News Detection Using Machine Learning

rate started to decrease, which is very logical, due to the Figure 9 shows the results obtained by the different kernels:
small size of the text of the news; a block of words of more LIBSVM and their tunning on WEKA.
than 2 will not be repeated several times in the same piece
of news which will not exceed 5 lines at most.
We stopped at 2-grams and proceeded to switch the number
of most frequent word blocks k. We obtained the following
results:

Figure 9. Accuracy according to the kernel type

It is clear that linear and polynomial kernels give the best


results. The linear kernel is parameterless and faster,
however, theoretically it cannot model the cases of
complicated overlap of the two classes. On the other hand
Figure 7. Evolution of the rate according to the k * (2- the Gaussian kernel makes it possible to model any type of
grams)
overlap but its accuracy depends on the parameters C (Cost),
ϵ and γ [20]. We have studied the influence of these
In Figure 7 we observe that the rate continued to increase parameters on the precision of the model.
without exceeding the rate obtained by the word bag
technique. This is due, in our opinion, to two causes: either • Influence of Cost C: the following Figure 10 represents
the small size of the information text, or the incompatibility the evolution of accuracy according to the Cost C: by
of n-grams with the TF-IDF method. We then thought to testing on the training dataset and using the RBF
combine the two techniques. We started by combining 5 kernelof the LIBSVM method in WEKA:
frequent words with the frequent 3 * (2-gram) which gave
us a rate of 52.30 %. Then, we added the other
characteristics to measure the influence of each one on the
recognition rate. The following figure 8 represents the
evolution of accuracy rate depending on different features:
by testing on the training data and using the RBF kernel of
the LIBSVM [2] method in WEKA [18].

Figure 10. Evolution of the accuracy according to the Cost C

at the start the cost is equal to 0 and the value of the rate is
52% then by increasing the cost value we observe a rapid
increase in the rate up to the value 150, then a stabilization
Figure 8. Influence of different features on accuracy of the rate all around the value 82% despite the fact that we
continued to increase the cost with high values.
In Figure 8, we notice that the influence of the feature This is due to our opinion for the following reason: it is
"Sentiment" on the accuracy is almost negligible, which known that for high values of C, the optimization will
seems to be very logical: if a feeling released by a comment choose a hyper-plane with a smaller margin, conversely, a
was negative it does not mean that it is fake. However, the very small value of C will cause the optimization to seek a
characteristic "source" increased it up to 89.27%, and "date" separation hyper-plane with a larger margin. So in this case
up to 96%. While the author feature pushed it to 100%, the two classes are very close to each other, then the
which shows the effectiveness of the encoding we have separation margin is small and that is found with the value
proposed[19]. 150, after this value there is no data in the margin..

Journal of Pharmaceutical Negative Results ¦ Volume 13 ¦ Issue 4 ¦ 2022 1028


P. Yogendra Prasad, et al.: Fake News Detection Using Machine Learning

• Influence of ϵ: Figure 11 represents the evolution of dataset of news preprocessed using cleaning techniques,
accuracy depending on ϵ: by testing on the training data steaming, N- gram encoding, bag of words and TF-IDF to
and using the RBF kernel of the LIBSVM method in extract a set of features allowing to detect fake news. We
WEKA: applied then Support Vector Machine algorithm on our
features dataset to build a model allowing the classification
of the new information.
Through the research carried out during this study, we
obtained the following results:
• The best features to detect fake news are in order:
text, author, source, date and sentiment.
• The followed process resulted in a recognition rate
of100%.
• The analysis of the sentiment given by the text is
inter- esting, however it would be more influential
Figure 11. Evolution of the accuracy rate according to є in the caseof opinion mining.
• The N-gram method gives a better result than the
e observe a stabilization of the rate around 82% up to the bag of words with bulky datasets and with large
value 0.1 then a slight drop in the rate which can be texts.
neglected up to the value 1. This shows that the ϵ parameter • The support vector machine seems the best
does not have a great influence on the rate of recognition. algorithm to detect fake news, because it gave a
Which is very logical because this parameter determines the better recognition rate, and allowed to give for
tolerance of the termination criterion. That’s the allowed each information a degree of confidence for its
error rate that’s all[21]. classification.
• Influence of γ: the following Figure 12 represents the • The parameters influencing the support vector
evolution of accuracy depending on the γ: by testing machine are in order: Cost C, gamma γ and epsilon
on the training data and using the RBF kernel of the ϵ.
LIBSVM method in WEKA: The work we have done could be completed and continued
in different aspects. It would be relevant to extend this study
with a larger dataset, and to evolve its supervised learning
by another online for a continuous update and automatic
integration of new fake news.

REFERENCES
Hadeer Ahmed, Issa Traore, and Sherif Saad. Detection of online fake
news using n-gram analysis and machine learning techniques. In
Inter- national Conference on Intelligent, Secure, and Dependable
Systems in Distributed and Cloud Environments, pages 127–138.
Figure 12. Evolution of the accuracy rate according to Springer, 2017.
Gamma Mr. P. Yogendra Prasad. Implementation of Machine Learning Based
Google Teachable Machine. International Journal of Early Childhood
Special Education (INT-JECSE). ISSN: 1308-5581, Vol 14, Issue 03,
With a C = 300 and a ϵ = 0.0001, the recognition rate 2022.
increases to the value of γ = 0.001, then a stabilization Niall J Conroy, Victoria L Rubin, and Yimin Chen. Automatic deception
around the rate 82 % then we observe a rapid decrease from detection: Methods for finding fake news. Proceedings of the
Association for Information Science and Technology, 52(1):1–4,
the value of γ = 0.01. 2015.
At the end we obtained the best model accuracy with the Chris Faloutsos. Access methods for text. ACM Computing Surveys
following parameters: Cost C = 300, ϵ = 0.0001 and γ = (CSUR), 17(1):49–74, 1985.
0.001. Mykhailo Granik and Volodymyr Mesyura. Fake news detection using
naive bayes classifier. In 2017 IEEE First Ukraine Conference on
Electrical and Computer Engineering (UKRCON), pages 900–903.
CONCLUSION IEEE, 2017.
Kaggle. Getting Real about Fake News, 2016.
This paper presents a method of detecting fake news using Kaggle. All the news, 2017.
support vector machine, trying to determine the best features Junaed Younus Khan, Md Khondaker, Tawkat Islam, Anindya Iqbal, and
and techniques to detect fake news. We started by studying Sadia Afroz. A benchmark study on machine learning methods for
the field of fake news, its impact and its detection methods. fake news detection. arXiv preprint arXiv:1905.04749, 2019.
We then designed and implemented a solution that uses a Cédric Maigrot, Ewa Kijak, and Vincent Claveau. Fusion par apprentis-
sage pour la détection de fausses informations dans les réseaux

Journal of Pharmaceutical Negative Results ¦ Volume 13 ¦ Issue 4 ¦ 2022 1029


P. Yogendra Prasad, et al.: Fake News Detection Using Machine Learning

sociaux. Document numerique, 21(3): 55–80, 2018.


Refaeilzadeh Payam, Tang Lei, and Liu Huan. Cross-validation. Ency-
clopedia of database systems, pages 532–538, 2009.
Cristina M Pulido, Laura Ruiz-Eugenio, Gisela Redondo-Sama, and
Beatriz Villarejo-Carballido. A new application of social impact in
social media for overcoming fake news in health. International journal
of environmental research and public health, 17(7):2430, 2020.
Juan Ramos et al. Using tf-idf to determine word relevance in document
queries. In Proceedings of the first instructional conference on
machine learning, volume 242, pages 133–142. New Jersey, USA,
2003.
Gerard Salton and J Michael. Mcgill. 1983. Introduction to modern
information retrieval, 1983.
Florian Sauvageau. Les fausses nouvelles, nouveaux visages, nouveaux
défis. Comment déterminer la valeur de l’information dans les
sociétés démocratiques? Presses de l’Université Laval, 2018.
Bernhard Scholkopf and Alexander J Smola. Learning with kernels: sup-
port vector machines, regularization, optimization, and beyond.
Adaptive Computation and Machine Learning series, 2018.
DSKR Vivek Singh and Rupanjal Dasgupta. Automated fake news
detection using linguistic analysis and machine learning.
William Yang Wang. "liar, liar pants on fire": A new benchmark dataset
for fake news detection. arXiv preprint arXiv:1705.00648, 2017.
Lechevallier Y. WEKA, un logiciel libre d’apprentissage et de data
mining”. INRIA-Rocquencourt.
Silpa C, RamPrakash Reddy Arava, K.K. Baseer. Agri Farm: Crop And F
International Journal of Early Childhood Special Education ertilizer
Recommendation Systemfor High Yield Farming Using Machine
Learning Algorithms; 14(5):2022.
CH Prathima, R. Anusuya, M. Ram Kumar Prabhu (2022).
Comprehensive Design Analysis of Digital Marketing in Agriculture
Sector. International Journal of Early Childhood Special Education;
14(5):2022.
Kamalraj R, Sakthivel M (2018). A hybrid model on child security and
activities monitoring system using iot. In2018 International
Conference on Inventive Research in Computing Applications
(ICIRCA) 2018 Jul 11 (pp. 996-999). IEEE

Journal of Pharmaceutical Negative Results ¦ Volume 13 ¦ Issue 4 ¦ 2022 1030

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy