JPNR 2022 04 140
JPNR 2022 04 140
The phenomenon of fake news is expanding rapidly with the development of communication tools and social media. Fake news detection is
an emerging research area that is garnering a lot of interest. However, it faces some challenges due to limited resources such as datasets and
processing and analysis techniques.
In this work, we propose a fake news detection system using machine learning techniques. We used the term Frequency-Inverse Document
Frequency (TF-IDF) from Bag of Words and N-Grams as the feature extraction method, and Support Vector Machine (SVM) as the
classifier. We also propose fake news and true news datasets to train the proposed system. The results obtained demonstrate the efficiency of
the system.
Index Terms: Fake News, Social Media, Web Mining, MachineLearning, Support Vector Machine, TF-IDF.
DOI: 10.47750/pnr.2022.13.04.140
• Dec is the decision function value; • 3 compound words obtained by the N-gram method,
• Maxdec and Mindec are the maximum and • date: day, month and year,
minimum values of the decision function; • feeling,
• p is the percentage of truth or fake. • source,
2) Validation: to measure the capacity of the model to • author,
recognize new examples, we set aside some of the • class: fake or real.
examples to be used as test models. The features dataset is
then subdivided into two parts, a training part and a test
part. Its usefulness consists in avoiding over-fitting, i.e., RESULTS AND DISCUSSION
testing the model on the same training dataset. The To get the best decision model with highest accuracy, we
subdivision is not done at random but according to a tuned many parameters. First we tried to get the best
particular sample using the method of cross validation [10]. parameters from both bag of words and n-gram techniques
which give the best recognition rate on our dataset.
D. Revision of parameters For bag of words’ technique we have directed the number of
This operation aims to improve the model’s accuracy by most frequent words taken from each comment. This
tuning or setting the parameters of the support vector operation is repeated several times until the best rate is
machine algorithm, namely, Cost,γ, ϵ and change the cross- reached. At each time we increased the number of the most
validation variant [2]. frequent words. On the Weka software and using the SM0
library, and with the cross validation for 10 parts, we
obtained the following results:
Use
This is the last and most important phase in our system.
After reaching the best recognition rate, i.e., after building
the best model, we can now use it on new unlabeled
news, and the model allows us to predict their classes:
wrong or true,with a confidence degree.
rate started to decrease, which is very logical, due to the Figure 9 shows the results obtained by the different kernels:
small size of the text of the news; a block of words of more LIBSVM and their tunning on WEKA.
than 2 will not be repeated several times in the same piece
of news which will not exceed 5 lines at most.
We stopped at 2-grams and proceeded to switch the number
of most frequent word blocks k. We obtained the following
results:
at the start the cost is equal to 0 and the value of the rate is
52% then by increasing the cost value we observe a rapid
increase in the rate up to the value 150, then a stabilization
Figure 8. Influence of different features on accuracy of the rate all around the value 82% despite the fact that we
continued to increase the cost with high values.
In Figure 8, we notice that the influence of the feature This is due to our opinion for the following reason: it is
"Sentiment" on the accuracy is almost negligible, which known that for high values of C, the optimization will
seems to be very logical: if a feeling released by a comment choose a hyper-plane with a smaller margin, conversely, a
was negative it does not mean that it is fake. However, the very small value of C will cause the optimization to seek a
characteristic "source" increased it up to 89.27%, and "date" separation hyper-plane with a larger margin. So in this case
up to 96%. While the author feature pushed it to 100%, the two classes are very close to each other, then the
which shows the effectiveness of the encoding we have separation margin is small and that is found with the value
proposed[19]. 150, after this value there is no data in the margin..
• Influence of ϵ: Figure 11 represents the evolution of dataset of news preprocessed using cleaning techniques,
accuracy depending on ϵ: by testing on the training data steaming, N- gram encoding, bag of words and TF-IDF to
and using the RBF kernel of the LIBSVM method in extract a set of features allowing to detect fake news. We
WEKA: applied then Support Vector Machine algorithm on our
features dataset to build a model allowing the classification
of the new information.
Through the research carried out during this study, we
obtained the following results:
• The best features to detect fake news are in order:
text, author, source, date and sentiment.
• The followed process resulted in a recognition rate
of100%.
• The analysis of the sentiment given by the text is
inter- esting, however it would be more influential
Figure 11. Evolution of the accuracy rate according to є in the caseof opinion mining.
• The N-gram method gives a better result than the
e observe a stabilization of the rate around 82% up to the bag of words with bulky datasets and with large
value 0.1 then a slight drop in the rate which can be texts.
neglected up to the value 1. This shows that the ϵ parameter • The support vector machine seems the best
does not have a great influence on the rate of recognition. algorithm to detect fake news, because it gave a
Which is very logical because this parameter determines the better recognition rate, and allowed to give for
tolerance of the termination criterion. That’s the allowed each information a degree of confidence for its
error rate that’s all[21]. classification.
• Influence of γ: the following Figure 12 represents the • The parameters influencing the support vector
evolution of accuracy depending on the γ: by testing machine are in order: Cost C, gamma γ and epsilon
on the training data and using the RBF kernel of the ϵ.
LIBSVM method in WEKA: The work we have done could be completed and continued
in different aspects. It would be relevant to extend this study
with a larger dataset, and to evolve its supervised learning
by another online for a continuous update and automatic
integration of new fake news.
REFERENCES
Hadeer Ahmed, Issa Traore, and Sherif Saad. Detection of online fake
news using n-gram analysis and machine learning techniques. In
Inter- national Conference on Intelligent, Secure, and Dependable
Systems in Distributed and Cloud Environments, pages 127–138.
Figure 12. Evolution of the accuracy rate according to Springer, 2017.
Gamma Mr. P. Yogendra Prasad. Implementation of Machine Learning Based
Google Teachable Machine. International Journal of Early Childhood
Special Education (INT-JECSE). ISSN: 1308-5581, Vol 14, Issue 03,
With a C = 300 and a ϵ = 0.0001, the recognition rate 2022.
increases to the value of γ = 0.001, then a stabilization Niall J Conroy, Victoria L Rubin, and Yimin Chen. Automatic deception
around the rate 82 % then we observe a rapid decrease from detection: Methods for finding fake news. Proceedings of the
Association for Information Science and Technology, 52(1):1–4,
the value of γ = 0.01. 2015.
At the end we obtained the best model accuracy with the Chris Faloutsos. Access methods for text. ACM Computing Surveys
following parameters: Cost C = 300, ϵ = 0.0001 and γ = (CSUR), 17(1):49–74, 1985.
0.001. Mykhailo Granik and Volodymyr Mesyura. Fake news detection using
naive bayes classifier. In 2017 IEEE First Ukraine Conference on
Electrical and Computer Engineering (UKRCON), pages 900–903.
CONCLUSION IEEE, 2017.
Kaggle. Getting Real about Fake News, 2016.
This paper presents a method of detecting fake news using Kaggle. All the news, 2017.
support vector machine, trying to determine the best features Junaed Younus Khan, Md Khondaker, Tawkat Islam, Anindya Iqbal, and
and techniques to detect fake news. We started by studying Sadia Afroz. A benchmark study on machine learning methods for
the field of fake news, its impact and its detection methods. fake news detection. arXiv preprint arXiv:1905.04749, 2019.
We then designed and implemented a solution that uses a Cédric Maigrot, Ewa Kijak, and Vincent Claveau. Fusion par apprentis-
sage pour la détection de fausses informations dans les réseaux