Detecting Fake News Using NLP Methods
Detecting Fake News Using NLP Methods
Czech
Technical
University
in Prague
Denis Řeháček
Guidelines:
The objectives of the thesis are:
1) Study state-of-the-art NLP methods. Focus on text classification and more
specifically on algorithms based on Deep Neural Networks (DNNs).
2) Select an existing Fake News dataset.
3) Design, implement, and compare multiple types of DNN classifiers. Experiment
with different architectures and optimize parameters.
4) Evaluate the models using the selected dataset. Discuss whether fake news is
detectable based solely on the specifics language used.
5) Collect a similar dataset for the Czech
Bibliography / sources:
[1] Hanselowski, Andreas, et al. "A retrospective analysis of the fake news challenge
stance detection task." arXiv preprint arXiv:1806.05180 (2018).
[2] Singhania, Sneha, Nigel Fernandez, and Shrisha Rao. "3han: A deep neural
network for fake news detection." International Conference on Neural Information
Processing. Springer, Cham, 2017.
[3] Bai, Shaojie, J. Zico Kolter, and Vladlen Koltun. "An empirical evaluation of
generic convolutional and recurrent networks for sequence modeling." arXiv preprint
arXiv:1803.01271 (2018).
[4] Davis, Richard, and Chris Proctor. "Fake news, real consequences: Recruiting
neural networks for the fight against fake news." (2017).
[5] Kaggle Fake News dataset: https://www.kaggle.com/c/fake-news (2019).
[6] Infobanka ČTK: https://ib.ctk.cz/ (2019).
Date of master’s thesis assignment: 12.08.2019 Deadline for master's thesis submission: 07.01.2020
.
Date of assignment receipt Student’s signature
iii
Abstract Abstrakt
This thesis introduces the problem of dis- Tato práce představuje problematiku dez-
information in an information-rich world. informací ve světě bohatém na informace.
Fake News detection was addressed as a Detekce Fake News (falešných zpráv) byla
text classification problem. More than a řešena jako text classification problem.
hundred experiments were done to find an Bylo provedeno více než sto experimentů
appropriate combination of pre-processing s cílem nalézt vhodnou kombinaci zpraco-
and efficient Neural Network architecture, vání přirozeného jazyka (NLP) a efektivní
relieving some specifics and limitations architektury Neuronové sítě. Specifika a
of the Fake News detection problem com- limity tohoto přístupu byla srovnána s
pared to other text classification tasks. jinými úlohami klasifikace textů. Byl pou-
An existing Fake News dataset was used žit existující dataset falešných zpráv i ně-
as well as several combinations of a self- kolik kombinací dat získaných konkrétně
obtained data. The work is unique in pro tuto práci. Tento projekt jedinečný
processing news articles in numerous Eu- ve zpracování článků v mnoha evropských
ropean languages, covering the same top- jazycích, pokrývajících stejná témata v
ics in both categories - reliable and disin- obou kategoriích - spolehlivé a dezinfor-
formation news. The best accuracy was mační zprávy. Nejlepší přesnosti bylo dosa-
achieved by a convolutional based Neu- ženo pomocí konvoluční neuronové sítě a
ral Network, with up to 99.9% of correct to s až 99,9% správné predikce na existují-
prediction on the existing dataset, and cím souboru dat a více než 98% ve většině
over 98% in most experiments on the experimentů na menších samo-získaných
smaller self-obtained data, outperform- datech, což předčilo Self-attention mecha-
ing Self-attention mechanism. Better re- nismus. Lepších výsledků bylo dosaženo
sults were achieved when using the origi- při použití původních textů namísto jejich
nal texts instead of human-written sum- lidmi psanými shrnutími (a to i přes to, že
maries (even though the second option druhá možnost byla otestována na větším
was trained on a larger dataset). Consid- souboru dat). Vzhledem k vlastnostem da-
ering the datasets properties (same topics tových sad (stejná témata v obou třídách)
in both classes), the results suggest, there se lze předpokládat, že existují jazykové
are probably language patterns distinctive vzory specifické pro každou z kategorií,
for each of the two categories that were které byly ve shrnutích ztraceny.
lost in the human-written summaries.
Klíčová slova: NLP, zpracování
Keywords: NLP, Natural Language přirozeného jazyka, AI, umělá inteligence,
Processing, AI, Artificial Intelligence, Fake News, dezinformace
Fake News, Disinformation, Multilingual
Text Classification Překlad názvu: Detekce fake news
metodami zpracování přirozeného jazyka
Supervisors: Ing. Jan Drchal, Ph.D.
iv
Contents
1 Introduction 1 4.5.2 Gated Recurrent Unit - GRU 32
1.1 Facebook–Cambridge Analytica 4.6 Convolutional Neural Network -
data scandal . . . . . . . . . . . . . . . . . . . . 2 CNN . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.2 East StratCom Task Force . . . . . . 3 4.6.1 Attention mechanism . . . . . . . 33
1.3 Definitions . . . . . . . . . . . . . . . . . . . . 4 5 Experiments 35
1.4 Problem statement . . . . . . . . . . . . . 4 5.1 Dataset and hardware . . . . . . . . . 35
2 Fake News detection 5 5.2 Neural Network models . . . . . . . . 35
2.1 Linguistic characteristics and text 5.2.1 Simple Sequential . . . . . . . . . . 36
classification . . . . . . . . . . . . . . . . . . . . 5 5.2.2 Embedding layer . . . . . . . . . . . 36
2.2 Other possible methods . . . . . . . . . 6 5.2.3 SimpleLSTM . . . . . . . . . . . . . . 37
2.2.1 Stance Detection . . . . . . . . . . . . 6 5.2.4 SimpleCNN . . . . . . . . . . . . . . . 37
2.2.2 Collective user intelligence . . . 7 5.2.5 TextCNN . . . . . . . . . . . . . . . . . 39
2.2.3 Fake image detection . . . . . . . . 7 5.2.6 2biLSTM with Attention . . . . 41
2.2.4 Source checking . . . . . . . . . . . . . 8 5.3 The effect of pre-processing . . . . 43
2.2.5 Deep fake . . . . . . . . . . . . . . . . . . 8 5.3.1 No restriction on vocabulary 44
2.2.6 Neural Fake News . . . . . . . . . . . 8 5.3.2 Maximum number of features 44
3 Datasets 11 5.3.3 Stopwords and maximum
3.1 Kaggle . . . . . . . . . . . . . . . . . . . . . . 11 number of features . . . . . . . . . . . . . 44
3.2 Unused datasets . . . . . . . . . . . . . . 11 5.3.4 Stopwords only . . . . . . . . . . . . 45
3.2.1 Kaggle . . . . . . . . . . . . . . . . . . . . 11 5.3.5 Vocabulary restriction
3.2.2 BuzzFeed . . . . . . . . . . . . . . . . . 12 conclusion . . . . . . . . . . . . . . . . . . . . 45
3.2.3 LIAR . . . . . . . . . . . . . . . . . . . . . 12 5.3.6 Input length . . . . . . . . . . . . . . . 45
3.3 Self-obtained dataset . . . . . . . . . . 12 5.3.7 Word embeddings . . . . . . . . . . 47
3.3.1 Crawling news . . . . . . . . . . . . . 13 5.4 Initial experiments and
3.3.2 Czech news . . . . . . . . . . . . . . . 15 pre-processing conclusion . . . . . . . . 48
3.3.3 Dataset-A . . . . . . . . . . . . . . . . . 16 5.4.1 Keras vs PyTorch . . . . . . . . . . 49
3.3.4 Dataset-EN . . . . . . . . . . . . . . . 16 5.4.2 Summary . . . . . . . . . . . . . . . . . 50
3.3.5 Dataset-RU . . . . . . . . . . . . . . . 16 6 Experiments on self-obtained
3.3.6 Dataset-M1 and Dataset-M2 16 datasets 53
3.3.7 Dataset-CZ-X . . . . . . . . . . . . . 16 6.1 Word embeddings comparison . . 53
3.3.8 Datasets summary . . . . . . . . . 17 6.1.1 Pre-trained word embedding . 53
4 Machine Learning with Sequential 6.1.2 Embedding comparison . . . . . 54
Data 19 6.2 Limitations . . . . . . . . . . . . . . . . . . 55
4.1 Sequential data . . . . . . . . . . . . . . . 19 6.2.1 Small datasets . . . . . . . . . . . . . 55
4.2 Natural Language Processing . . . 20 6.2.2 The length of input sequences 56
4.2.1 Bags-of-words . . . . . . . . . . . . . 20 6.3 Dataset-A . . . . . . . . . . . . . . . . . . . 57
4.2.2 TF-IDF Vectors . . . . . . . . . . . 22 6.4 Dataset-EN . . . . . . . . . . . . . . . . . . 57
4.2.3 Word Embeddings . . . . . . . . . 22 6.4.1 Pre-train model on Dataset-A 57
4.2.4 Multilingual Word Embeddings 6.5 Models comparison . . . . . . . . . . . 58
(MWEs) . . . . . . . . . . . . . . . . . . . . . . 27 6.6 Handing multilingual data . . . . . 58
4.3 Artificial Neural Network - NN . 30 6.6.1 Dataset-M1 . . . . . . . . . . . . . . . 58
4.4 Recursive vs Recurrent Neural 6.6.2 Dataset-M2 . . . . . . . . . . . . . . . 59
Network . . . . . . . . . . . . . . . . . . . . . . . 31 6.6.3 Unknown language test . . . . . 59
4.5 Recurrent Neural Network - RNN 31 6.7 Results . . . . . . . . . . . . . . . . . . . . . . 59
4.5.1 Long Short Memory - LSTM 32 6.8 Czech datasets . . . . . . . . . . . . . . . 60
6.8.1 Dataset-CZ-1 . . . . . . . . . . . . . . 60
v
6.8.2 Dataset-CZ-2 . . . . . . . . . . . . . . 61
6.8.3 Dataset-CZ-3 . . . . . . . . . . . . . . 61
6.8.4 Dataset-CZ-23 . . . . . . . . . . . . . 62
6.8.5 Brief analyses of Czech
disinformation websites . . . . . . . . . 62
6.9 Should we trust the Neural
Network prediction? . . . . . . . . . . . . . 64
6.10 Summary . . . . . . . . . . . . . . . . . . . 65
7 Outline 69
7.0.1 Libraries used in this project 70
8 Conclusion 73
A Grover example 75
B Attachments 77
C Bibliography 79
vi
Figures Tables
1.1 ’Fake News’ in Google Trends . . . 2 3.1 The most visited disinformation
1.2 Average daily media consumption websites in the Czech Republic . . . 15
worldwide [fEPSoE17] . . . . . . . . . . . . 3 3.2 Self-obtained datasets size and
labels comparison . . . . . . . . . . . . . . . 17
4.1 Linear substructures: man and
woman . . . . . . . . . . . . . . . . . . . . . . . . 25 5.1 NN performance with the
4.2 Linear substructures: comparative comparison of Count and TFIDF
superlative . . . . . . . . . . . . . . . . . . . . . 26 Vectorizer on the Simple
4.3 Linear substructures: comparative Sequentional model . . . . . . . . . . . . . 42
superlative . . . . . . . . . . . . . . . . . . . . . 27 5.2 NN performance with no additional
4.4 An example of a muntilingual word restriction on the vocabulary, input
embeddings in a single vector space 30 length of 300 words . . . . . . . . . . . . . 44
4.5 Mathematical model for a neuron 31 5.3 NN performance with restriction
on the vocabulary size only, input
5.1 SimpleLSTM performance, input length of 300 words . . . . . . . . . . . . . 44
length of 300 words . . . . . . . . . . . . . 38 5.4 NN performance with both
5.2 SimpleCNN performance, input stopwords and restriction on the
length of 300 words . . . . . . . . . . . . . 39 vocabulary size used, input length of
5.3 TextCNN Architecture . . . . . . . . 40 300 words . . . . . . . . . . . . . . . . . . . . . . 45
5.4 TextCNN performance, input 5.5 NN performance with only
length of 300 words . . . . . . . . . . . . . 41 stopwords, no restriction on the
5.5 2biLSTM with Attention vocabulary size, input length of 300
performance, input length of 300 words . . . . . . . . . . . . . . . . . . . . . . . . . 45
words . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.6 The impact of pre-processing on
the vocabulary size and length of the
6.1 Dataset-CZ-2: confusion matrix
longest sequence in the corpus . . . . 46
(average values over 10 runs) . . . . . 61
5.7 NN performance with input length
6.2 Dataset-CZ-3: confusion matrix
of 1000 words, pre-processed with
(average values over 10 runs) . . . . . 62
stopwords, vocabulary restricted to
6.3 Dataset-CZ-23: confusion matrix
5000 most frequent words . . . . . . . . 46
(average values over 10 runs) . . . . . 63
5.8 NN performance with GloVe
6.4 LIME explanation of a AZ247
embedding, input length of 300
(satire) article with highlighted words
words . . . . . . . . . . . . . . . . . . . . . . . . . 47
making the article being satire . . . 66
5.9 NN performance with GloVe
6.5 LIME explanation of a AZ247
embedding, input length of 1000
(satire) article with highlighted words
words. * - this test was run on 4
making the article being
CPUs . . . . . . . . . . . . . . . . . . . . . . . . . 48
disinformation . . . . . . . . . . . . . . . . 67
5.10 A brief summary of the main
pre-processing configurations tested 49
5.11 Keras (left part) vs PyTorch
(right part) LSTM model. The first
run is without pre-trained embedding
(-), the second with Glove . . . . . . . . 50
vii
5.12 TextCNN accuracy with different
embeddings. In the last two rows
(with + embedding training), the
embedding layer weights were also
adjusted during the training. . . . . . 51
5.13 Training time per epoch (s) with
pre-trained GloVe embedding. . . . . 51
viii
Chapter 1
Introduction
The New York Times defines Fake News as "A made-up story with an intention
to deceive ".[Tim16]
The term "Fake News" began to be frequently used after the US 2016
election, see Figure 1.1. Two separate investigations by the Guardian and
BuzzFeed identified at least 140 websites producing fake news aiming US
citizens, all of them ran from a small town in Macedonia. However, the date
of the phenomenon is much older. The study of the fake news on social media
aspect in the 2016 election [AG17] mentions the Great Moon Hoax - a
series of six articles about a developed civilization on the moon, beginning on
August 25, 1835. Regardless, we can find examples of what would be called
"Fake News" today even in ancient history; a famous one dates to 13th century
BC when Egyptian pharaoh Ramesses II, also known as Ramesses the Great,
spread his propaganda about a fabulous victory in the Battle of Kadesh.
Although, the Egyptian–Hittite peace treaty proves it was a stalemate.
In any case, in this work, I will focus on the modern aspect of the phe-
nomenon. In the digital age, spreading of disinformation became cheaper and
easier. Before, only media with sufficient resources were able to reach a large
audience. Today, anyone can be a producer for the masses. [fEPSoE18]
Last but not least, 2019 is likely to be the year when the internet will
surpass TV in average daily consumption of media worldwide (see 1.2).
1
1. Introduction .....................................
80
60
40
20
0
2005 2007 2009 2011 2013 2015 2017 2019
Year
Figure 1.1: ’Fake News’ in Google Trends
2
.............................. 1.2. East StratCom Task Force
180
160
140
minutes
120
100
Internet
80 TV
2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
year
Figure 1.2: Average daily media consumption worldwide [fEPSoE17]
3
1. Introduction .....................................
disinformation campaigns, established in March 2015. They published an
analysis called The legal framework to address "fake news": possible
policy actions at the EU level [fEPSoE18], which was one of the main
inspirations for this research. The paper describes the spreading of disinforma-
tion and summarizes key finding. Including decreasing trust in the internet,
low ability of its user to single out reliable news worthy of their attention,
the problem of content bubbles. Additionally, they discovered that many of
the disinformation news is ignored, there is still a part of them that is spread
quickly, affect public opinion, create a "noise" by giving many contradicting
versions of a single event, and in the last effect, it causes confusion and trouble
for the users.
They also expressed warning of addressing Fake News with a strict solution
that would amount to censorship.
In the long-term, they predict the online news will be permeated by artificial
intelligence on both sides, the one creating and sharing harmful content and
the other trying to keep the internet journalism safe and reliable. Most
importantly, they claimed that an AI-powered literacy is the most effective
response to disinformation.
Last but not least, the East StratCom Task Force gently provided their
disinformation dataset for this research.
1.3 Definitions
By the New York Times definition, Fake News is a story with the intention
to deceive. The EU framework [fEPSoE18] works with term disinformation
for news with clear intention to manipulate public opinion. A similar term
- misinformation - stands for information that is not based on the truth
but was not expressed with no intention to manipulate public opinion. This
thesis does not investigate the intention of the news articles.
This work addresses the Fake News detection task as a text classification
problem with the aim to discover whether fake news is detectable based solely
on the specifics of the language used. It focuses on algorithms based on Deep
Neural Networks.
4
Chapter 2
Fake News detection
The Fake News detection problem can be divided into several sub-problems.
I will discuss the possible automatization of some of them and attempts
that were already made. Considering the success of Neural Networks in
similar tasks, it might be possible to achieve satisfying accuracy in many of
the sub-problems, or in all of them. The final score of an article could be
then established as a combination of these sub-problems. This approach also
minimises the risk of false-positive classification.
There are methods based on Natural Language Processing (NLP) such as
Stance Detection - estimating if two pieces of text (in this case, the headline
and article body) are related or not. Text classification can provide insight
into the article or the comment section.
By image processing, one can uncover photoshopped images or deep fake
videos (see 2.2.5). Having this information about article reliability, we can
further track the spreading of the article across social media and try to find
any differences between spreading fake and reliable news.
Paper [GBEF+ 18] examines if there are any differences in the language of
fake news and satire. The work did not aim to do a deep linguistic analysis.
Primarily they tried to find any linguistic characteristic, such as word usage
patterns, that would be specific for one of the two types of text. Therefore,
they did not use any advanced machine learning models and used bags of
words encoding. They achieved 79.1% accuracy with Naive Bayes Multinomial
algorithm only and 0.882 using PRC (Precision-Recall Plot) Area on the Fake
or Satire task and based on the preliminary results suggested that it seems to
be possible to determinate a theme (such as conspiracy theory) of an article
based on its word vector only. On the other hand, they used a dataset of
only 283 fake news articles and 203 satirical stories, which makes it hard to
tell if the model generalises enough or if it only learns a few specific words
for each category. For later research, it would be interesting to obtain any
additional information about based on what the model predicts.
5
2. Fake News detection .................................
Text classification can provide fast, reliable information about the content
of an article. By visualisation of features that led to the given output, we
can prove the correctness of the decision, or at least give more insight into
the problem.
This work addresses the Fake News detection task as a text classification
problem with the aim to discover whether fake news is detectable based solely
on the specifics of the language used. However, there are more approaches of
the detection. Those that were examined already are briefly summarised in
this section. An efficient real-word Fake News detection system should be
then a combination of, if possible, all of them.
. Discusses: The body text discuss the same topic as the headline, but
does not take a position
. Unrelated: The body text discusses a different topic than the headline
. Dual GRU
6
............................... 2.2. Other possible methods
[RD17]
They ran over 100 experiments to improve their models, but only BoW
MLP achieve a better score than the baseline solution. This model got 0.97
accuracy on the first level of scoring (related or unrelated headline), 0.93 on
the second level (classification of the related pairs) and 0.89 on overall NFC-1
score - getting 10% improvement compared to baseline solution. Surprisingly,
none of their RNN architecture was able to beat the baseline. They gave three
possible explanation. First, since they were able to achieve high accuracy
on training data but never on validation data, they assume, RNN would
need a much larger dataset to generalise.[RD17] Second, even advanced RNN
models - Bi-dir MSTL and Dual GRU - are struggling with forgetting long-
term information that may be crucial in an article classification problem.
[SHS01, RD17] Third, they plan to enhance their RNN models by attention
layers. [RD17]
Beside RNN with attention, they also announce more development in NLP
branch. Mainly they plan to improve the dependency tree by weighting word
token by their inverse depth in the tree. Hoping it helps to distinguish words
that are central to the meaning from these that are not. However, they do
not provide any proof for this hypothesis.[RD17]
.
teristics:
The text of the article
. The user response
. The source users promoting it
Despite these patterns are only observable after the Fake News spread and
exposed to a considerable number of users, there are papers suggesting to
consider them. Especially in combination with a User response generator
(URG)- a model that tries to predict users responses even before they are
available and use them for early detection. [QGSL18] Furthermore, it seems
to be important to study URG since it has been shown that automatically
generated comments under news articles affect massively users acceptance
and trust.[Lee12]
On the other hand, this model relies deeply on collective users intelligence.
In the case of closed groups (i.e. groups in which users can not view the group’s
content until they become a member), this assumption can be untrustworthy.
Admins of such a group could filter its members and so also filter the discussed
topics.
7
2. Fake News detection .................................
the text. A wrongly chosen picture can manipulate in the same way as a
misleading headline. Made up stories can also use a picture from a completely
different occasion and use it as "proof" or photoshopped it in such a way it
suits their story and can be unable to distinguish from an authentic picture.
This problem is even mentioned in The legal framework to address
"fake news": possible policy actions at the EU level published by the
European Parliament.[fEPSoE18]
There are several models that achieve good accuracy in object detection
problem, mostly based on Convolutional Neural Networks (CNN), such as
YOLO, R-CNN, Fast R-CNN, Mask RCNN and other [ZZXW18]. Their
variations can be used to detect if the image is relevant to the article itself.
The popular approach is to use with pre-trained models in order to cut
training time and cost.
Moreover, there are other models capable of detection if a picture was edited
or not. Beside these NN tasks, we can also check online if the same image
was used previously, check the source claimed in the article and compare its
content with the original post. The article could be considered as unreliable
if it turns out that the picture is actually taken in a different place and
circumstances or in a different time then claimed.
A proper news article should always contain its author and sources. Some of
the Fake News media have already learned this and create a chain of many
articles published on different websites to look like it is covered adequately
by sources, but the original source is actually missing. Therefore, automatic
checking of sources could be an exciting feature to include in the evaluation.
This also applies to sources of images.
Neural Fake News is a relatively new term used for automatically generated
Fake News, typically using Neural Networks.
A study [ZHR+ 19] presents a model named Grover for controllable news
article generation. It can generate an entire article based on a small piece
8
............................... 2.2. Other possible methods
1
https://grover.allenai.org/
9
10
Chapter 3
Datasets
The Fake News Challenge - Stage 1 dataset1 only labels the relationship
between article headline and body, not considering whether it is fake or
credible. Also, the four classes are not equally distributed having 0.73% in
the unrelated group, 0.18% labelled as discuss and for the last two categories;
agree, disagree only 0.07% and 0.02% respectively.
Fortunately, there are more datasets available.
3.1 Kaggle
This dataset2 consists of 20.8k entries in training and 5200 entries in the test
group. It is available online at Kaggle as a part of a public classification
competition that took place in 2018. It contains id, title, author, text, and
label:
. 1 - unreliable
. 0 - reliable,
This section contains a summary of other Fake News related datasets available
online which were considered for this thesis. None of them was used but could
be interesting for further research.
3.2.1 Kaggle
11
3. Datasets .......................................
3.2.2 BuzzFeed
3.2.3 LIAR
This dataset was obtained from the fact-checking site PolitiFact. This site
collects public figures statements and labels them with one of the following
class: pants-fire, false, barely-true, half-true, mostly true, and true. It also
includes meta-data about speaker’s party affiliations, current job, home
state, credit history. In [Wan17], they reached significantly better results
while taking into consideration meta-data like this. However, it contains
only specific statements that are significantly shorter compared to a news
article. This fact makes the dataset inappropriate because the length of a
text sequence is one of the main aspects to deal with in the text classification
tasks and is discussed in the following chapter.
The previous datasets are oriented on US politics news only. The topics
included can vary compare to European Fake News and so any Neural Network
model trained on this data cannot be accurate enough. Therefore, for this
work, another dataset was collected. EU vs Disinformation campaign
6 ran by the European External Action Service East Stratcom Task
. Issue
. Date
.4
Summary of the Disinformation
https://github.com/KaiDMML/FakeNewsNet/tree/master/Data/BuzzFeed
5
https://github.com/BuzzFeedNews/2016-10-facebook-fact-
check/blob/master/data/facebook-fact-check.csv
6
https://euvsdisinfo.eu/about/
12
................................ 3.3. Self-obtained dataset
. title
. text
. description
. language
. URL
. publish date
. modify date
. filename
13
3. Datasets .......................................
These texts are in a variety of languages. One of the limitations of the
news-please library is the fact that it detected the language in 2790 cases from
5189 only. The most occurred language was Russian, followed by English.
The Czech language was not detected in a single time. The language of the
disinformation is also written in the "EU vs Disinformation campaign" data,
but in some cases, there are more of them for one disinformation, and it is
not clear which of the links lead to which language. However, for the Czech
language, the dataset contains 716 entries with 746 valid links. Only 491 leads
to news articles, and another 11 of them leads to duplicated text. Therefore,
only 480 original disinformation texts were found.
Samples of reliable news were crawled from public service broadcasters;
British BBC, one of the many German public service broadcaster - German
DW (Deutsche Welle) and Czech CTK. Both BBC and DW provide their
services in a number of languages. Besides of English, Russian articles were
downloaded as well because it is the most common language in the EU vs
Disinformation campaign’s dataset.
A historical version of the websites was visited using WebArchive 7 . The
first snapshot of every day in the interval between 01. 01. 2016 and 31.
12. 2018 was used to get the articles published in the main section of the
homepage of every broadcaster used (and its Russian language version).
Some news articles remained on the main section till the following day, these
duplicates were removed. Note that the interval corresponds to the first and
the last publish_date int the EU vs Disinformation campaign’s dataset. It
also provides keywords of each disinformation case. Reliable news that did
not match any of these keywords were not used either.
In total, 11 981 unique articles were obtained from the English version of BBC
and 21 052 from DW. Due to a large number of articles on these websites,
only those that were in the popular section of BBC8 and in the main list of
articles on the English version of DW9 were downloaded. Most of the articles
contained at least one of the keyword (32 666 of 33 033), hence the final
dataset of reliable news in English consist of 32 666.
In Russian, 5 510 unique articles were downloaded from BBC (from the main
grid of its Russian website10 ) and 1 382 from the Russian version of DW11 ,
having totally 6 892 reliable news articles in Russian and only 2 548 of them
matched at least one of the keywords.
7
http://web.archive.org/
8
https://www.bbc.com/news/popular/read
9
https://www.dw.com/en/top-stories/s-9097
10
https://www.bbc.com/russian
11
https://www.dw.com/ru/s-9119
14
................................ 3.3. Self-obtained dataset
One of the current members of the CTK council - Petr Zantovsky has a
tight connection to Czech disinformation scene. According to [Jan18], he
founded so-called Independent Media Association (Asociace nezávislých médií)
together with Ondrej Gersl, Jan Koral and Stanislav Novotný. Ondrej Gersl
is the founder of disinformation web AC2416 (see table 3.1 containing list of
12
https://domaci.ihned.cz/c1-66477660
13
https://similarweb.com
14
https://www.ctk.eu/
15
https://www.ctk.eu/about_ctk/
16
https://ac24.cz/
15
3. Datasets .......................................
the most visited disinformation websites in the Czech Republic) and LajkIt17 .
Jan Koral is the founder of another disinformation website NWOO18 .
3.3.3 Dataset-A
The first dataset considers the texts in ’Disproof’ column as reliable samples
and ’Summary of the Disinformation’ as unreliable. Only rows having more
than 32 characters in both of these columns were used, shorter cells usually
contained meaningless information such as "No proof for this." or "No evidence
given."
The total length of this dataset is then 10 570 with 50:50 ratio and every-
thing in it is in English.
This dataset is probably harder to classify because it does not contain the
disinformation text itself, only the summary of it. Moreover, the two columns
(’Disproof’ and ’Summary of the Disinformation’) describe the very same
issue and is shuffled randomly, therefore a deep text understanding is needed.
3.3.4 Dataset-EN
3.3.5 Dataset-RU
Two multilingual dataset were created from 3 066 disinformation texts. The
first one (Dataset-M1) was completed with 3 066 random articles from the
English reliable news corpus, the second (Dataset-M2) with both English and
Russian reliable news corpus in 1/3 ratio.
3.3.7 Dataset-CZ-X
Reliable articles in the Czech language were obtained from CTK within TAČR
TL02000288 project. There were 461 unreliable articles in the Czech language
in EU vs Disinformation dataset. Only 461 CTK articles were used, filtered
by EU vs Disinformation keywords and same published date period, to keep
the dataset balanced. Dataset-CZ-1 of size 922 contains these two groups.
Dataset-CZ-2 is enriched with a discuss category - articles crawled from
disinformation websites ParlamentniListy.cz and AC24.cz (see table 3.1 of the
17
https://www.lajkit.cz/
18
http://www.nwoo.org/
16
................................ 3.3. Self-obtained dataset
19
https://www.mvcr.cz/cthh/clanek/dezinformacni-kampane-dokumenty-a-odkazy-
dokumenty-a-odkazy.aspx
20
https://az247.cz/
17
18
Chapter 4
Machine Learning with Sequential Data
This chapter introduces Machine Learning and its techniques capable of
sequential data classification. Furthermore, Natural Language Processing
and Neural Networks are discussed.
Machine Learning stands for the ability of systems to automatically learn
and improve base on previous observation and experiences instead of relying
on specific instructions. There are three main types of learning - supervised
learning, unsupervised learning and reinforcement.
Supervised learning learning uses labelled dataset, the system observes
the input-label pair and tries to find a mapping function between the pair.
With more input data, the function is getting more and more general and
later is able to give the right prediction for inputs that have not been observed
before.
Unsupervised learning system can learn even without labelled input.
The so-called "AI bible" [RN09] gives a taxi agent as a good example, the
agent can gradually learn a concept of "good traffic day" and "bad traffic day"
without being explicitly told (by the input label) which day was "good" traffic
and which day "bad".
Reinforcement learning is based on reinforcement rewards or punish-
ment. On the taxi drive example, the agent can observe he did something
good when he gets an unusually large tip and something bad if he does not
get a tip at all. After more samples, it can determinate which actions led to
large-tip journey and no-tip journey respectively.
The news classification problem is clearly an example of supervised learning.
On the other hand, some of the pre-trained word embedding models discussed
later were trained using unsupervised learning.
19
4. Machine Learning with Sequential Data .........................
4.2 Natural Language Processing
The next step is encoding words into numerical values or vectors. There
are more algorithms for that.
4.2.1 Bags-of-words
Bags-of-words is a simple representation, where the input text is transformed
into a multiset of words that is called a bag. Multiset keeps information
about multiplicity. It stores tuples of words and number of its occurrence
in the sequence. Having a map storing which word corresponds to which
index, it is used as a simple vector. The value is 0 for words of the dataset’s
vocabulary, which are not in the sequence. However, that also means that any
information about grammar and word order is lost and for a more extensive
dataset with rich vocabulary, most of the values of small sequences will be 0.
A document can be converted into the Bags of Words encoding using
CountVectorizer from sklearn library. Its function fit_transform(self,
raw_documents[, y]) takes a list of document as parameter, returns a
term-document matrix and learns its vocabulary which can be obtained by
calling get_feature_names(self).
Each line of the matrix stands for one document, and each element of the
line represents the count of the word in the corresponding index of vocabulary.
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
... ’This is the first document.’,
... ’This document is the second document.’,
... ’And this is the third one.’,
... ’Is this the first document?’,
... ]
>>> vectorizer = CountVectorizer()
20
............................ 4.2. Natural Language Processing
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
[’and’, ’document’, ’first’, ’is’, ’one’, ’second’, ’the’,
’third’, ’this’]
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
Vectorizer options
The vectorizer can also be set to a binary mode, if so, all no zero values are
set to 1. The vocabulary can be restricted to consider only the most occurring
words by max_features parameter. With max_df and min_df, terms
having document frequency strictly higher and lower respectively are ignored.
On the tiny example above, min_df can be set to 3, meaning only words
having occurrence higher than three will be considered. It would lead to
having only one word; ’document,’ which is definitely not representative
enough. However, in a larger corpus, this can efficiently filter the vocabulary
from words with insignificant impact.
21
4. Machine Learning with Sequential Data .........................
4.2.2 TF-IDF Vectors
TF-IDF Vectors is more complex representation that do not stay on words
count only. TF-IDF Vectors consists of two parts, TF stands for Term
Frequency and IDF for Inverse Document Frequency, defined by the following
formulas:
number of term t occurrence in a document
T F (t) =
total number of terms
total number of documents
IDF (t) = ln( )
total number of documents with term t
Matrix representation in different levels
. Words level - represents TF-IDF scores of terms;
. N-gram level - represents TF-IDF scores of N-Grams;
. Character level - represents TF-IDF scores of character level n-grams;
22
............................ 4.2. Natural Language Processing
The code above shows the transformation on the same corpus as was used
for Bags of Words example before. First, each word from vocabulary is stored
in word_index dictionary. Again, each line stands for one document from the
corpus. The elements of the line, on the other hand, do not mean a count of
the word on the same position in vocabulary (as in Bag of Words) but is the
index of the word. On the example above, the first element of the first line
is eight that stands for ’this’ as can be seen in the word_index dictionary.
In the same way, the original sentence ’this is the first document’ can be
obtained.
If needed, another number without any meaning (not contained in the
dictionary) can be added to the sequences (usually in the beginning) to make
them all the same length.
The benefit of this encoding compared to Bag of Words is the fact that it
maintains the order of the words and not only counts its occurrence.
Finally, everything is prepared to load the pre-trained embedding. It is
23
4. Machine Learning with Sequential Data .........................
stored in a matrix in which every line corresponds to a word and every column
to a dimension. Typically 300 dimensions are used to save computation cost
although the pre-trained embeddings are (usually) in a higher dimension.
The size of the matrix is too extreme, and for many applications, it is not
necessary to load it fully.
...
between a pair of word vectors corresponds to their linguistic or semantic
similarity. For example, in the pre-train GloVe embedding, the nearest
neighbours for the word "frog" are:
1. frogs
...
2. toad
3. litoria
4. leptodactylidae
.
5. rana
6. lizard
7. eleutherodactylus
Note that the closest one, in this case, is the plural form, followed by frog
species, genera, etc.
The second highlight, Linear substructures, is the product of the previous
metric operation. The vector difference of a word pair is roughly equal
differences of its synonyms and other terms with similar meaning. It is easy
to understand to these substructures on an example of the word pair: "man"
and "women" in the figure 4.1. Not only that the vector differences are
similar to other word pairs, but it also groups gender-specific terms of family
members to one place (sister - brother, niece - nephew, aunt - uncle) and
terms describing royal families members (queen - king, duchess - duke, and
more) to another. Another interesting pattern is the superlative comparative
relation in figure 4.2. Thanks to these properties, much more comprehensive
information about the words and its meanings are contained.
Pre-trained GloVe word vectors glove.840B.300d by the authors [PSM14b]
are available online1 . It was trained on the Common Crawl data and contains
840B tokens, 2.2M vocab, cased and 300 dimensional vector.
1
http://nlp.stanford.edu/data/glove.840B.300d.zip
24
............................ 4.2. Natural Language Processing
[PSM14b]
Figure 4.1: Linear substructures: man and woman
FastText
25
4. Machine Learning with Sequential Data .........................
[PSM14b]
Figure 4.2: Linear substructures: comparative superlative
represent each word as a bag of n-gram and a word itself as another n-gram.
To distinguish the beginning and end of words, they used special boundary
symbols < and >. For example, on the word "where" and 3-gram for the
character-level bag, it used character-level and word-level representation:
On the character level (the first line), there are all 3-gram of the word,
including the sequence "her" that would be misleading without having the
boundary symbols < >. As a result, it computes vector even for words that
are not in the learning dataset and keeps information about the morphology
of words which is especially important for languages with large vocabularies
containing many rare words and languages frequently using prefixes and
suffixes for changing the meaning or for emotional colouring of words.
Paper [GBG+ 18] introduce word embeddings for 157 languages trained
using FastText on Wikipedia and roughly 24 terabytes of raw text data from
Common Crawl - a non-profit organization that crawls the web regularly and
every month provides fresh data to the public. The FastText embeddings are
current state-of-the-art for word embedding [GBG+ 18, MGB+ 18, BGJM16].
26
............................ 4.2. Natural Language Processing
[PSM14b]
Figure 4.3: Linear substructures: comparative superlative
27
4. Machine Learning with Sequential Data .........................
For the word ’frog’ the nearest neighbours in English are similar as in
GloVe embedding. Interesting is the close connection to ’tadpoles’ - the larval
stage in the life cycle of frogs and toads.
. 1.0000 - frog
. 0.7798 - toad
. 0.7620 - frogs
. 0.6847 - toads
. 0.6367 - tadpoles
. 0.6254 - salamander
. 0.5999 - salamanders
. 0.5786 - boulenger
. 0.5627 - amphibian
. 0.5606 - snake
In Spanish, the nearest neighbour of ’frog’ (from the position of its word
vector in the common space) are the following:
. 0.5785 - rana
. 0.5471 - ranas
. 0.5440 - sapo
. 0.5124 - bufonidae
. 0.5121 - tortuga
. 0.5081 - frog
. 0.5063 - lagartija
Note that the distance 0.5785 between ’frog’ (English subset) and ’rana’
(Spanish subset) is smaller than the distance from ’frog’ to ’amphibian’
which are both in the English subset. The connection is also visible in the
visualization 4.4, ’frog’, ’rana’, ’toad’, ’sapo’ in one part of the projection,
the plural forms ’frogs’ and ’ranas’ in other but not so far and words ’love’,
’amor’ in the opposite corner.
Finally, in the Czech language, the closest words for ’frog’ are:
. 0.4707 - hadů (genitive pluar) snakes
. 0.4706 - žába frog
. 0.4559 - klokan kangaroo
28
............................ 4.2. Natural Language Processing
.
subset:
0.5074 - nutria
. 0.4527 - rana
. 0.4515 - gato
. 0.4488 - sapo
. 0.4457 - ardilla
. 0.4422 - ranas
. 0.4418 - comadreja
Optimally, terms ’rana’ and ’ranas’ could be nearer than ’nutria’, but there
is still a connection between the words which supports the training hypotheses
that words used in the same context tend to have a similar meaning.
29
4. Machine Learning with Sequential Data .........................
Visualization of the multilingual word embedding space
amor
love
1.5
1.0
0.5
0.0
sapo
ranatoad
0.5
frog ranas
frogs
1.0
0.6 0.4 0.2 0.0 0.2 0.4 0.6
n
X
aj = g( wi,j ai )
i=0
Learning algorithms are adjusting the Bias Weight, in order to change the
behaviour of the unit and so the neuron will learn. Finally, connecting these
neurons and other types of units discussed in this section, form the (artificial)
neural network. [RN09]
30
........................ 4.4. Recursive vs Recurrent Neural Network
[RN09]
Figure 4.5: Mathematical model for a neuron
RNN has directed cycles in the connection graph from old to new inner state.
Unlike feed-forward networks, this feature allows RNN to store some inner
state, making them able to pass information across sequential steps. Therefore,
RNN is suitable for tasks where data points are related, such as video
processing (each frame depends on series of previous and following samples),
audio processing or in this case - text processing. In these applications, data
frames are not independent. It is necessary to learn the text as a sequence
and remember the inner state by processing the entire document to get the
desired level of understanding.
On the other hand, it turns out to be hard to write an efficient learning
algorithm for RNN. Mainly due to a problem with vanishing/exploding
gradient and oscillating weights.
The problem is challenging since it is the main feature of RNN that causes
the vanishing of the gradient. In order to preserve the inner state, long-term
information has to go through all the cells over and over again through the
loop in the directed graph forming RNN. That also means it can be multiplied
by a number close to zero many times during the learning period and so the
gradient can vanish.
Another problem is the high occurrence of over-fitting in RNN applications.
Back Propagation Through Time (BPTT) stands for extension of Back
31
4. Machine Learning with Sequential Data .........................
Propagation capable of modelling time - algorithm often used to learn RNN.
Many studies developing more and more complex LSTM based networks for
natural language processing had been published, thanks to the development
of new learning algorithms and advanced architectures, especially the Long
Short Term Memory (LSTM) introduced by Hochtreiter and Schimhuber.
They proposed to create additional memory gate units responsible for the
inner state. Input gate unit protects the inner state from perturbation
by irrelevant input. Output gate unit protects other units in the network
from irrelevant memory content.[SH97]
Although CNNs are widely used in image processing tasks, their application
is not so limited. In the last few years, there were published several new
methods using CNN in the sequential data processing. Before diving deeper,
let us refresh how a simple CNN for image classification works. Consider
an image as a matrix, where each value is a three-dimensional vector with
each value in an interval from 0 to 255 representing the RGB colour model.
For simplicity, consider just grey-scale images with one channel only instead
of three RGB channels. Then each value of the matrix is one number only,
usually normalized to be in 0 to 1 interval.
32
......................... 4.6. Convolutional Neural Network - CNN
The convolution layer helps to understand a word within its context. The
pooling layer extracts essential words or phrases and is usually followed by a
fully-connected layer.
33
4. Machine Learning with Sequential Data .........................
state-of-art in several datasets, including News Articles dataset that consists
of CNN and Daily Mail website. The article body was used as a context,
and the question was formed from the headline. It suggests that Attention
Sum Reader model could be also efficient for text classification since it was
able to get the main "topic" from the text and compare it with the headline.
The disadvantage of answering problem application is that this model cannot
give an answer, which is not included in the text. However, it does not mean
anything for an application in text classification tasks.
34
Chapter 5
Experiments
All experiments in this chapter were done on the Kaggle3.1 dataset of 20.8k
entries. The NN models and many preprocessing approaches on this single
dataset to choose which of them use for another research. The dataset was
divided into a train and test part in 0.33 ratio, and 100 samples from the
train part were used for validation (that was used for early-stopping). Every
model in this initial experiments was designed in Keras and trained using
one GeForce GTX 1080 GPU, 2 CPUs and 12GB of RAM (if not specified
differently).
The input documents were shortened to the length of 300 encoded-words
(in some tests to 1 000, 3 000 or even full sequences were processed) to improve
the training speed. The cut was always done as the last step of preprocessing.
The effects of different preprocessing are part of the discussion later.
Each experiment ran 15 training epoch, with cross-entropy loss function
and Adam optimizer. The validation set was used for early-stopping, but
when the time was measured, the early stopping function was turned off.
In this section, models used for the initial experiments are presented. The
input sequences were pre-processed by 4.2.1 with min_df=3 and max_df=0.9
options. In order to speed up the training and save computing cost, the input
length is cut to 300 words (if not specified differently). The cut was always
done as the last step of preprocessing.
35
5. Experiments .....................................
5.2.1 Simple Sequential
Input text data are reduced by stopwords from nltk library and transformed
to Bags of Words representation by CountVectorizer from sklearn library as
described in 4.2.1.
The network is a classical NN with only one fully-connected hidden layer
(in the Keras summary 5.2.1 called Dense) consisting of 256 neurons with
ReLu activation function, followed by an output layer with one neuron and
sigmoid activation function giving the true/false classification. The same
output layer is used in every other model tested in this chapter.
Layer (type) Output Shape Param #
==============================================================
dense_1 (Dense) (None, 256) 13671936
______________________________________________________________
dense_2 (Dense) (None, 1) 257
==============================================================
Total params: 13,672,193
Trainable params: 13,672,193
Non-trainable params: 0
This simple model was very fast to train even with full input length (the
longest sequence after preprocessing is 11 985 words long). It performed with
an unexpected accuracy of 0.97 on test set even though it never passed over
0.96 accuracy on the validation set (it might be caused by its small size).
Training loss kept decreasing down to 9.6642e-05 after 15 epochs, not so
validation loss which got slightly worse during the training ending on 0.2.
Neither more training epochs or hyperparameter tuning did not bring any
significant improvement.
These results were obtained on Bags of Words matrix, which was only
restricted by NLTK stopwords, min_df=3, and max_df=0.9 as explained in
4.2.
On the other hand, if the input length is cut to 300 words only, the accuracy
drop to 0.88.
Bags of Words representation obtained by CountVectorizer is based on
the word frequencies only without any other information. That means the
classification is made just based on the vocabulary, showing that, at least in
this dataset, the vocabulary of Fake News differs a lot from the vocabulary
of reliable news.
36
............................... 5.2. Neural Network models
5.2.3 SimpleLSTM
The model was extended with a 300-dimension Embedding layer and the
full-connected layer was replaced by LSTM (see 4.5.1) of the same number of
units (256).
In the Keras summary 5.2.3, the output shape of the input layer is the length
of the input. The following embedding layer turns each (integer encoded)
word of the input sequence into a 300-dimensional dense vector shaping its
output into 300 (input length) x 300 (embedding dimension) matrix.
Layer (type) Output Shape Param #
==============================================================
input_1 (InputLayer) (None, 300) 0
______________________________________________________________
embedding_1 (Embedding) (None, 300, 300) 16021500
______________________________________________________________
lstm_1 (LSTM) (None, 256) 570368
______________________________________________________________
dense_1 (Dense) (None, 1) 257
==============================================================
Total params: 16,592,125
Trainable params: 16,592,125
Non-trainable params: 0
With input sequence restricted to 300 words, this model ends up 0.927
accuracy on the test set (higher than 0.88 in case of Simple Sequential model
with CountVectorizer and the same length of input), 0.92 on the train set. As
can be seen in figure 5.1, after eight training epochs it managed to perform
with 1.0 accuracy on the train set. However, it never broke 0.95 accuracy on
a validation set. The network probably was not able to generalize.[RD17]
Moreover, the training time was significantly slower. It took about 1 minute
per epoch on the short input length (cut to 300 words) when the simple
sequential needed just a few seconds for full-size data.
5.2.4 SimpleCNN
Now, there is the same input and embedding layer as in the previous model,
followed by one-dimensional CNN as described in 4.6, ending with fully-
connected layer and an output layer with the same parameters as the first
model in this chapter.
37
5. Experiments .....................................
1.00
accuracy 0.95
0.90
0.85 train
0.80 val
0 2 4 6 8 10 12 14
0.4
0.3
loss
0.2
0.1 train
val
0.0
0 2 4 6 8 10 12 14
epoch
Figure 5.1: SimpleLSTM performance, input length of 300 words
The input was again cut to 300 vectors only in the same way as in the case
of the LSTM model, which was outperformed with final accuracy 0.954 on
the test set and 0.24 loss. As figure 5.2 shows, it also learned faster as the
38
............................... 5.2. Neural Network models
1.00
0.95
accuracy
0.90
train
0.85 val
0 2 4 6 8 10 12 14
0.3
0.2
loss
0.1
train
val
0.0
0 2 4 6 8 10 12 14
epoch
Figure 5.2: SimpleCNN performance, input length of 300 words
training accuracy has risen to 1.00 right on the end of the fourth epoch and
loss of 4.4330e-04 that kept decreasing down to 9.6733e-06 at the end of the
last 15th epoch. Moreover, the training time was significantly lower as each
epoch took 2-8 seconds only. Validation loss, on the other hand, was getting
slightly worse over the training time.
Next, the model was trained on full-size input sequences with a vocabulary
of 5000 most common words. The number of trainable parameters has risen
to 28,794,017 (including embedding layer) and so has risen the test accuracy
to 0.96.
5.2.5 TextCNN
This model was inspired by paper [Kim14]. The input and embedding
layer is the same as before, but then, instead of just one Convolution and
MaxPooling layer, there are four of them, connected in parallel as shown in
the Keras summary below (note the ’Connected to’ column) and figure 5.3.
The input sentence matrix of dimensions n x k is a representation of the input
sequence (article), n is the length of the input sequence, k is the dimension
of word embedding. In this example, the first row of the matrix is loaded
300-dimensional embedding vector of the word "wait". The convolutions
are computed as described in 4.6 but there are several convolutional layers
connected in parallel.
39
5. Experiments .....................................
wait
for
the
video
and
do
n't
rent
it
[Kim14]
Figure 1: Model architecture with two channels for an example sentence.
1.00
0.95
accuracy
0.90
train
val
0.85
0 2 4 6 8 10 12 14
0.3
0.2
loss
0.1 train
val
0.0
0 2 4 6 8 10 12 14
epoch
Figure 5.4: TextCNN performance, input length of 300 words
Once again, the input layer and embedding are the same. Two LSTM layers
are bi-directional followed by Attention layer. There is no implementation of
an attention layer in Keras yet; therefore a third-party open-source imple-
mentation was used [ker19].
41
5. Experiments .....................................
CountVectorizer TFIDFVectorizer
acc 0.8842 0.9620
time per epoch (s) 6.0 6.1
Table 5.1: NN performance with the comparison of Count and TFIDF Vectorizer
on the Simple Sequentional model
Vectorizer compare
42
............................. 5.3. The effect of pre-processing
1.000
0.975
accuracy
0.950
0.925
0.900 train
val
0.875
0 2 4 6 8 10 12 14
0.3
0.2
loss
0.1 train
val
0.0
0 2 4 6 8 10 12 14
epoch
Figure 5.5: 2biLSTM with Attention performance, input length of 300 words
43
5. Experiments .....................................
SimpleLSTM SimpleCNN TextCNN Attention
acc 0.9269 0.9543 0.9677 0.9268
loss 0.4791 0.2399 0.1032 0.4810
training loss 4.2e-5 9.7e-6 1.8e-4 6.3e-4
validation loss 0.3250 0.2136 0.2195 0.2187
time per epoch (s) 53.5 2.4 4.3 230.9
Table 5.3: NN performance with restriction on the vocabulary size only, input
length of 300 words
The first, baseline run is with no other restriction besides the setting mention
above, results in table 5.2
Using both stopwords and 5000 as the maximum number of features gives
following results 5.4.
44
............................. 5.3. The effect of pre-processing
Table 5.4: NN performance with both stopwords and restriction on the vocabu-
lary size used, input length of 300 words
Another test was done to distinguish between the effect of stopwords and
restriction on the maximum number of features. This time with stopwords
only, results can be seen in table 5.5
The original corpus consists of 146 211 terms (words and numerical values).
Table 5.6 shows the effect of preprocessing on vocabulary size and the length
of the longest sequence.
All the experiments described in section 5.3 show that cutting the vocabu-
lary size from 146 211 to as low as 5 000 have a negligible effect on the final
accuracy. This preprocessing cuts of a large number of numerical values from
the corpus. It is a common practice to do so in text classification tasks and
works well on this dataset. The results show, there is no need to use the
entire vocabulary for every experiment, but it might be beneficial for the final
model even for the price of higher computation cost - especially when using
an evidence-aware model, which might uncover even small lies in numbers.
The experiments on vocabulary size and cutting it from 146 211 to 5 000 have
shown only a low or no impact on the final accuracy on the test dataset and
on the loss. There was no noticeable impact on the training time. However,
45
5. Experiments .....................................
vocabulary size max sequence length
No pre-processing 146 211 23 374
+ min_df=3 53 608 23 077
+ max_df=0.9 53 603 19 741
+ stopwords 53 464 11 985
+ max_features=5000 5 000 9 830
Table 5.6: The impact of pre-processing on the vocabulary size and length of
the longest sequence in the corpus
Table 5.7: NN performance with input length of 1000 words, pre-processed with
stopwords, vocabulary restricted to 5000 most frequent words
46
............................. 5.3. The effect of pre-processing
Table 5.8: NN performance with GloVe embedding, input length of 300 words
.
sequences could be workaround by following steps;
use reasonable long sequences for training
. split sequences for classification into smaller sub-sequences
. if any of them is predicted to be unreliable, mark the entire sequence as
unreliable.
The parameters of the Embedding layers were trained together with the
succeeding layers. In the following test, pre-trained embedding is used and is
not trainable.
Both stopwords and maximum features restriction were used with the same
vectorizer options as before, but pre-trained word embeddings were loaded
and used as weights of the Embedding layer.
GloVe
First, GloVe 4.2.3 glove.840B.300d embedding was used with 300 sequences
length. This was the first test when LSTM based models (LSTM and
Attention) outperformed SimpleCNN and were comparative with TextCNN
as can be seen in table 5.8. However, there is still a huge gap in 1s per epoch
training for TextCNN and 231.9s for Attention network.
In any case, this test shows that RNN/LSTM layers are still powerful tools
and are used in the next experiments.
The same test was run once again with 1000 input sequences length, see
the results in table 5.9
*Attention model was trained using 4 CPUs; other hardware specification
stays unchanged. Even with double CPU power, it took 781.4s per epoch to
train, and the final accuracy was still lower compared to TextCNN model
that learned almost ten times faster.
The longer input length did not bring almost any improvement to LSTM
model. It might be due to losing long-term information in LSTM layer. On
the other, CNN based and Attention models end up with higher accuracy
47
5. Experiments .....................................
SimpleLSTM SimpleCNN TextCNN Attention
acc 0.9627 0.9538 0.9815 0.9778*
loss 0.1053 0.2299 0.0686 0.0653*
training loss 0.0982 1.4e-4 0.0021 0.0485*
validation loss 0.0917 0.1019 0.0134 0.0226*
time per epoch (s) 185.1 2.3 8.3 781.4*
Table 5.9: NN performance with GloVe embedding, input length of 1000 words.
* - this test was run on 4 CPUs
compare the 300 length run. The training of the TextCNN model is still
extremely fast - 8.3s.
The experiments showed that even strict restriction on the vocabulary does
not influence the results significantly. The length of the longest article is
less than half of the original when encoded using a filter on the less and
the most frequent words (min_df, max_df) together with stopwords. The
original length was 23 374, after the preprocessing, it is only 11 985. With
the additional filter on maximum 5000 words used, the length is 9 830 which
do not bring a big saving, and the performance in classification was slightly
worse, compared to using stopwords and min_df, max_df only. Therefore
the maximum number of features is restricted in future experiments.
A noteworthy test was training with longer input length, mainly CNN based
network gave significantly better results and kept fast training performance.
On the other hand, the training time of LSTM based models have risen to
hundreds of second per epoch (compared to less than 10s needed for CNN),
and their accuracy did not improve (LSTM case) or did not improve as much
as expected (Attention case). Both LSTM models (LSTM, Attention) needed
more hardware resources for input longer than 1000 vectors.
The most important test was comparing pre-trained embedding or learning
it on the dataset. Using pre-trained embedding on an enormous amount of
data brought more considerable improvement than any other test.
Another interesting observation was the success of the Simple Sequential
model showing there are notable differences in the vocabulary of Fake News
and reliable news, respectively. However, it is a low-level understanding and
can be dataset-specific. The authors of the dataset used in this chapter do
not give any information about which types of news were used as "reliable"
samples. Later, this model is used on a self-obtained dataset containing news
of similar topics in both reliable and unreliable parts. It might give better
insight into the problem, showing if the vocabulary used in disinformation is
really different or if only the topics in unreliable media vary.
48
.................... 5.4. Initial experiments and pre-processing conclusion
Table 5.10 summarize the accuracies and training times of the preprocessing
configuration tested in this chapter.
The models and the training flow in this chapter were developed in Keras
(version 2.2.4) with TensorFlow (version 1.13.1) backend and Binary Cross-
Entropy loss function. The two libraries, Keras (with TensorFlow backend)
and PyTorch, were compared on a simple LSTM model.
Surprisingly, a corresponding network to 5.2.3 designed in PyTorch gave
significantly better results, especially when running with BCEWithLogitsLoss
(a combination of a Sigmoid layer and Binary Cross Entropy) loss function and
pre-trained word embeddings. Moreover, the training time of the LSTM layer
was much faster compare to Keras, especially for longer input sequences 5.13.
The data were prepared using TorchText preprocessing build-in functions.
It contains everything needed for text processing in a single library; data
loading, building vocabulary, loading embedding vectors and even an iterator
preparing training tensors of given batch size. Minimum word frequency was
set to 3 and NLTK word_tokenize was used. Table 5.11 compares the
performance of Keras and PyTorch, respectively, on the Simple LSTM model
49
5. Experiments .....................................
Simple LSTM TextCNN
(-) with GloVe (-) with GloVe
acc 0.9269 0.9617 0.9324 0.9945
loss 0.3225 0.1426 0.2640 0.0200
training loss 0.0188 0.0173 0.0186 0.0082
validation loss 0.2628 0.0923 0.1918 0.0898
time per epoch (s) 53.5 51.3 9.37 8.2
Table 5.11: Keras (left part) vs PyTorch (right part) LSTM model. The first
run is without pre-trained embedding (-), the second with Glove
5.4.2 Summary
This chapter introduced five different Neural models written in Keras and two
of them were rewritten to PyTorch for comparison. A big effort was put to
the comparison of typical preprocessing frequently used in text classification.
It was shown that the vocabulary of the input data could be restricted
significantly without a relevant loss on the final test accuracy.
Later the importance of pre-trained word embedding was proven to bring
a larger impact on the final accuracy than any other experiment.
Crucial was also a comparison of Keras (with TensorFlow backend) and
PyTorch. The same architecture achieved higher accuracy when written and
trained in PyTorch. Moreover, the training time dropped by more than 80%.
When using an optimized model and pre-trained embedding, there is no
need for a strick preprocessing. Skipping words that occur less than three
times is enough. The GloVe embedding provides a strong understanding of
the natural language and even relationships between some numerical values
and words that are commonly used in the same context, for example, a
relation of city and its zip code, as can be seen in figure 4.3. If the vocabulary
is reduced too strictly, these numerical values having a relationship in the
embeddings might be lost.
Lastly, as we can see in the summary in table 5.10, using only 300 (encoded)
words from each article is not enough, and at least 1 000 words should be
used. Longer input brings higher accuracy. On the other hand, it also causes
higher computation cost and training time of LSTM layers.
50
.................... 5.4. Initial experiments and pre-processing conclusion
Accuracy
With input length of 300 words
no pre-trained embedding 0.9977
GloVe 0.9986
FastText 0.9968
With input length of 1000 words
no pre-trained embedding 0.9936
GloVe 0.9984
FastText 0.9980
GloVe + embedding training 0.9991
FastText + embedding training 0.9985
Table 5.12: TextCNN accuracy with different embeddings. In the last two rows
(with + embedding training), the embedding layer weights were also adjusted
during the training.
51
52
Chapter 6
Experiments on self-obtained datasets
53
6. Experiments on self-obtained datasets ..........................
GloVe
157
Wiki-News
Multilingual
54
.................................... 6.2. Limitations
Self-Attention TextCNN
No pre-trained embedding 0.9510 0.9533
GloVe 0.9513 0.9642
GloVe + embedding training 0.9584 0.9655
157 0.9475 0.9550
157 + embedding training 0.9584 0.9579
Wiki-News 0.9490 0.9596
Wiki-News + embedding training 0.9584 0.9527
Multilingual 0.9541 0.9581
Multilingual + embedding training 0.9536 0.9544
Table 6.1: Test set accuracies of pre-trained embedding performance on the
Dataset-A 3.3.3, with 300 input sequence length
achieved the highest accuracy again, especially with weights of the embedding
layer included in the training.
However, promising is the excellent performance of the multilingual dataset
showing that the alignment of vectors from several languages does not cause
a loss on accuracy.
6.2 Limitations
This section discusses problems that occurred or might occurred during the
experiments.
55
6. Experiments on self-obtained datasets ..........................
Training SelfAttention TextCNN
1. 0.9388 0.9825
2. 0.9417 0.9767
3. 0.9563 0.9825
4. 0.9592 0.9854
5. 0.9475 0.9854
6. 0.9446 0.9854
7. 0.9534 0.9825
8. 0.9592 0.9854
9. 0.9534 0.9854
10. 0.9475 0.9767
Mean 0.9501 0.9827
Table 6.2: Test accuracies after 10 runs on the same data showing the incon-
sistency of training on a small dataset. Trained with Pre-trained GloVe on
Dataset-EN. The input sequences length: 1000 words.
56
..................................... 6.3. Dataset-A
remained in the training set, it would cause noise. Especially on the auto-
matically crawled self-obtained dataset, no one can say anymore which parts
of the disinformation article put it into that group. Therefore the usage of
larger sequences is beneficial in this case. An example of it can be seen from
comparison of table 6.3 and 6.2. The first of them contains accuracies of
models trained with input sequences length of 300 encoded words and the
second sequences of 1000 encoded words. The same observation can be found
in most of the experiments summarized in 5.10.
6.3 Dataset-A
This is the largest of the self-obtained datasets. However, it does not contain
news articles but their summaries only, see details in section 3.3.
At first, it was trained using short input sequences of length 300 (encoded)
words. As we can see in table 6.1, the best accuracy achieved was 0.9655
in combination with pre-trained GloVe embedding. Training on a longer
input did not bring any improvement. When trained on sequences of 5 000
(encoded) words with the multilingual embedding, the final accuracy stayed
on 0.9584.
Nevertheless, the weight of the hidden layers trained on this data could be
used as initial weights for the training of a dataset that is too small to be
trained efficiently. This hypothesis is tested later on Dataset-EN.
6.4 Dataset-EN
With only 1 038 articles, this is the smallest of the self-obtained datasets. It
consists of disinformation from EU vs disinfo and reliable news from public
service broadcasters, all of them in English (for more information about
the dataset see section 3.3.4). The limitation of such a small dataset is the
inconsistency of training - meaning the final results can vary even when every
parameter of the model remain unchanged. As can be seen in table 6.2, the
final accuracy can be different. However, considering the mean value of ten
separated training, final accuracy 0.9827 exceeds the expectations. When the
input length of 10 000 (encoded) words were used, the final mean value of
accuracies was 0.9834.
Later, the TextCNN model trained on the Dataset-A was used to evaluate this
dataset. The accuracy was just a little over 51%, meaning the classification
cannot be done based on a network trained on an article summaries only. It
shows that the language of reliable and unreliable news is an essential aspect
of the classification.
As we can see in table 6.4, with additional training of the pre-trained
model, the accuracy has risen to 0.92 but never got close to the results of a
57
6. Experiments on self-obtained datasets ..........................
Model Accuracy
Pre-trained 0.88
+ embedding training 0.92
Not pre-trained 0.96
+ embedding training 0.96
Table 6.4: A comparison of using TextCNN model pre-trained on Dataset-A vs
freshly initialized model
6.6.1 Dataset-M1
58
...................................... 6.7. Results
Accuracy
Input sequences of 1 000 words 0.9867
+ embedding training 0.9906
Input sequences of 5 000 words 0.9901
+ embedding training 0.9911
Table 6.5: The classification accuracy on multilingual Dataset-M1
Accuracy
Input sequences of 1 000 words 0.9832
+ embedding training 0.9867
Table 6.6: The classification accuracy on multilingual Dataset-M2
6.6.2 Dataset-M2
One might expect that the model trained on Dataset-M1 has learned a simple
rule - if the language is not English, it is disinformation - since only the
disinformation entries were in other languages. That is the reason for another
test on Dataset-M2.
This multilingual dataset is similar to the previous one. It also consists of
6 132 articles. However, the positive samples are now not only in English, yet
also in Russian - the most common language of disinformation in the dataset.
Comparing the results with Dataset-M1 (in tables 6.5 and 6.6), the accu-
racies are similar in the run without embedding training and slightly worse
with the embedding layer included in the training.
6.7 Results
Based on the results in table 6.6, the performance is almost the same as it was
on Dataset-M1 containing only English reliable news. However, the Russian
test set might include some disinformation samples that were also included
59
6. Experiments on self-obtained datasets ..........................
in the multilingual dataset. Neural Networks, in general, perform better
on samples that were used during training. Therefore, the results cannot
be directly compared with different experiments and additional experiments
would be needed to prove if the usage of one multilingual dataset is a better
approach than creating various datasets - one for each language.
The results suggest that using multilingual embedding, data in different
languages could be processed by a single model. With the pre-trained cross-
lingual relation, it might be enough to include a disinformation case in a
few languages only (or maybe just in one) for the training and, ideally, the
model could be able to classify variation of the same disinformation written
in a different language. However, there were not enough data to confirm this
hypothesis. For a clear conclusion, a larger dataset would be needed, optimally,
having 50:50 ratio of reliable vs disinformation samples and balanced language
occurrence. Regrettably, the multilingual datasets used in this work are too
far from the perfect state.
In this section, similar experiments are done on Czech datasets 3.3.7. Some
of the articles are actually in the Slovak language even though they were
published on Czech websites. There can also be some terms or phrases
in English. Therefore, Czech, Slovak, and English multilingual embedding
were used. Since the datasets are small in size, all accuracies provided in
this section are mean values over ten runs. All experiments are done with
TextCNN model; its parameters remain unchanged. Except, there are more
output neurons. In previous tests, there was just one. Now, there is always
one for each category of a given dataset. The maximum length of articles is
3000 words.
6.8.1 Dataset-CZ-1
For a review, this dataset 3.3.7 contains 461 samples of each of the two
categories - reliable and disinformation. Reliable articles are obtained from
CTK and discuss the same topics as those in disinformation group (based on
keywords provided within the articles).
Final accuracy on this data was 0.9874, slightly higher than on comparable
English Dataset-EN of similar size 6.4 showing that with the corresponding
embedding, two very different languages can be processed by the same NN
architecture. Out of 922 articles, there were only 3.8 (on average) false-
negative predictions (disinformation articles mistakenly marked as reliable).
The network showed excellent performance in detecting reliable articles, with
no false-positive prediction.
60
................................... 6.8. Czech datasets
6.8.2 Dataset-CZ-2
This dataset is the same as the previous one but enriched with discuss category
- articles from Parlamentni Listy and AC24, i.e. websites that often publish
disinformation (see 3.3.2). However, it was not human-labelled hence it is not
clear if it is disinformation, reliable news, tabloid, or another type of text.
The accuracy was lower compared to the single output experiments, only
0.9094. Examining the results as confusion matrix (see 6.1), we can see that
performance on reliable news and disinformation is the same as before. There
were also a few errors in reliable and discuss groups, but most of them were
in false predictions of discuss articles being disinformation and vice versa.
Since the third category is a mixture of disinformation, reliable news and
other articles, lower accuracy, in this case, is not a mistake but property of
the dataset.
6.8.3 Dataset-CZ-3
This dataset contains reliable and disinformation articles (the same as Dataset-
CZ-1) and satire articles obtained from AZ247. This classification was
expected to be problematic since the satire is imitating disinformation in
vocabulary and style of writing. Even the name, AZ247.cz (satire website),
refers to AC24.cz (disinformation website).
However, accuracy is 0.9651. Considering there are only 1104 articles in
total and satire group is smaller than the others (182 articles compare to 461
61
6. Experiments on self-obtained datasets ..........................
Accuracy
Dataset-CZ-1 0.9874
Dataset-CZ-2 0.9094
Dataset-CZ-3 0.9651
Dataset-CZ-23 0.8593
Table 6.7: The classification accuracies on Czech datasets
6.8.4 Dataset-CZ-23
In this section, neural networks trained on the Czech news datasets are used
to evaluate articles on the main disinformation websites.
62
................................... 6.8. Czech datasets
Parlamentni Listy
In total 3261 were crawled from news (zpravy) section of Parlamentni Listy
5 , but only 231 randomly chosen articles were used to form ’discuss’ label
(together with 230 articles from AC24) of the Dataset-CZ-2 above. Trained
TextCNN model then evaluated the entire corpus.
The Neural Network trained on Dataset-CZ-1 is familiar with only two
categories of news - reliable and disinformations. From that point of view,
81.84% articles on Parlamentni Listy were labelled as disinformation.
On the other hand, classifying the same data by a model trained on the
three-category Dataset-CZ-3 (containing satire) gives different results. The
percentage of disinformation goes from 81.84% to 68.32%, and 19.93% articles
were classified as satire. Considering that also the percentage of reliable
articles decreased from 18.16% to 11.74%, it is evident that there is a need
for separating more categories of journalism - for example, tabloid. Sadly, we
do not have these in the training data, and so the network is unable to detect
them.
Most importantly, NN gives a probability of input article belonging to one
of the known group. The group with the highest probability is considered as
the prediction. For example, on the test set of Dataset-CZ-3, the probability
of the most probable label was more than 0.80 in 87% of cases, and the other
two probabilities were usually close to zero. In this case, the probability
5
https://www.parlamentnilisty.cz/zpravy
63
6. Experiments on self-obtained datasets ..........................
was higher than 0.80 only in 57% of articles, proving there is a need for
more categories in training set to classify entire corpus of website’s articles
correctly.
AC24
From 7540 articles crawled from AC24 6 , only 230 were used to form Dataset-
CZ-2 above. All of them were later evaluated by Text CNN trained.
Classification by model trained on Dataset-CZ-1 (reliable news and dis-
information) came up very similar as on Parlamentni Listy, 0.8225 of the
content was classified as disinformation. Nevertheless, by model trained on
Dataset-CZ-3 (enriched by satire), only 49.19% of AC24 content was classified
as disinformation and 33.53% as satire. We could see from the evaluation of
Dataset-CZ-3 that the network was able to distinguish satire from disinfor-
mations. The high percentage of articles marked as satire is likely caused by
the fact that the NN is not acquainted with any other type of journalism,
and from the three categories, satire was the closest one.
Analyses summary
Well-trained Neural Networks can give correct prediction with very high
accuracy but do not provide any explanation. Due to their complexity and
possibly a vast number of neurons/parameters, they are often termed as
"black boxes". In some applications, it is not an issue, but in other, it is a
crucial disadvantage. Such a sensitive topic as The Fake News problem is
sadly in the second group.
There are some model-specific methods to obtain additional information
giving insight into its prediction by inspecting model weights, but then there
is also a universal one explaining prediction of any classifier, presented in
"Why should I trust you" paper [RSG16].
The explainer creates a fake dataset from a single input data. In the
Fake News problem, it would take a single article as the input and generates
thousands of combinations of the text. Each of them omitting a different
part of the text. The "black box" model is used to give predictions of every
generated text combination. These predictions are used to train a new,
6
https://ac24.cz
64
..................................... 6.10. Summary
Example
6.10 Summary
65
6. Experiments on self-obtained datasets ..........................
Figure 6.4: LIME explanation of a AZ247 (satire) article with highlighted words
making the article being satire
Nevertheless, the model was able to give precise prediction even when
learned on such tiny datasets. For example, Glover [ZHR+ 19] used 120
gigabytes of data for training. However, the vast amount of data, in that
case, was needed because it is mainly a text generator — those needs larger
training data in general.
Another limitation to mention is the length of input sequences (number
of encoded-words in the samples after pre-processing). It was shown that
in Fake News detection task, using the length of - at least - 1 000 words is
crucial. A small improvement was detected when using 5 000 words. Longer
sequences than 5 000 did not bring any or only limited enhancement and was
not worth considering the higher computation cost. Learning long sequences
is costly with LSTM layers, as discussed in 4.3, it can also suffer from long
term memory loss. These problems do not occur on convolutional layers
which might lead to the better performance of TextCNN model compared to
SelfAttention.
Two different Neural Network architectures were compared, convolutional-
based and SelfAttention (LSTM based with an attention layer). TextCNN
unexpectedly outperformed the SelfAttention in almost every experiment in
training time, computation cost and chiefly - in the final accuracy.
Most importantly, impressive results were observed on the multilingual
datasets. In the first case (on Dataset-M1), samples of reliable news were in
English only, but disinformation samples were in up to 18 different languages.
In the second case (Dataset-M2), the reliable news was from 2/3 in Russian -
the most common language of disinformation, at least in the dataset used.
The final accuracy was unexpectedly high - scoring over 98% in accuracy in
all test and even over 0.99% on Dataset-M1.
66
..................................... 6.10. Summary
Figure 6.5: LIME explanation of a AZ247 (satire) article with highlighted words
making the article being disinformation
The trained model was later used to evaluate Dataset-RU and reliable
Ukrainian news. Despite excellent performance on the Russian dataset, it was
unable to classify the Ukrainian articles. It shows that for this application,
the train dataset would have to be balanced in terms of languages and positive
vs negative samples in each language.
Noteworthy is also the observation of low chance of reliable news being
classified as disinformation or satire.
Last but not least, better results were achieved when using the original
texts instead of human-written summaries. The reliable news articles were
filtered based on the keywords found in the disinformation dataset, and so
both groups contained the same (or similar) topics. It suggests, there are
language patterns specific for each of the categories that were lost in the
human-written summaries.
67
68
Chapter 7
Outline
69
7. Outline .......................................
convolutional-based model performed better than the state-of-the-art [AV17]
attention-based model. However, only the simplest, original version of the
SelfAttention model was tested. Recent studies are using a much deeper
version of the architecture with up to 48 layers and are trained on hardware
with hundreds of GPU or TPU [ZHR+ 19]. The TextCNN model is effortless
to train even on reasonable hardware. It is fast, reliable and outperforms
the attention-based model in almost every experiment achieving excellent
accuracy of 99.9% on the Kaggle dataset and at least 98% accuracy on the
self-obtain dataset when trained with long input sequences and pre-trained
embedding.
Lastly, it was shown that using a multilingual dataset containing news
articles in several languages can be correctly classified at once thanks to
embedding pre-trained word embeddings, opening another way in the Fake
News detection research. However, it is likely to make more wrong prediction
on languages that were covered less frequently in the training dataset.
The multilingual self-obtained datasets of real-world news articles covering
the same topics in both classes - reliable and disinformation - makes this
work unique. In such a case, the classification is harder since it cannot rely
on keywords only (there are the same in both cases) and so is more relevant.
This section contains a quick summary of the most important libraries used.
NLTK
NLTK [LB02] is one of the most used Natural Language Processing library.
In this project, it was used for tokenization of the input texts. For word
stemming (which is implemented specifically for several languages), namely it
was used for filtering reliable news based on the keywords of the disinformation
samples 3.3. Last but not least, it provides stopwords used in pre-processing.
news-please
Keras
70
....................................... 7. Outline
TensorFlow
PyTorch
PyTorch [PGC+ 17] is another Neural Network library. With the torchtext
package, it provides full tool-stack needed for NLP. It is more difficult to
create training, and testing flow compare to Keras (where a single function
does the training of a compiled model), but as a result, the training was
faster, especially for RNN based layers and even the final accuracy can be
higher as can be seen in table 5.11.
Scikit-learn
Scikit [PVG+ 11] is a simple machine learning library, no model in this work
was designed in it but a few of its function were used, such as tokenizers or
scoring function.
Matplotlib
LIME
ELI15
71
72
Chapter 8
Conclusion
73
8. Conclusion ......................................
used in this work are too far from the perfect state.
A real-word Fake News detection systems based on the work discussed in
this theses could be used by the EU and its member states agencies to detect
harmful content as well as in private sectors. Leading internet companies
provide News aggregators (or Feed readers) serving content from a vast
number of websites, personalised for each user. Reliability of a source and a
particular article should be taken into consideration by the personalisation
algorithms. Someone can reject this idea with a warning of censorship.
However, studies have shown that many disinformations are ignored by the
public(see 1.2), hence filtering disinformation articles from the feeds can
improve the personalised experience. If providing reliable, relevant content
would be flagged as censorship, then - by the same logic - any search engine
or even encyclopedias should be flagged the same as it also provides only
relevant results to a searched term, which is clearly a nonsense.
I would also like to express a recommendation for government agencies
and non-profit organisations studying the effect of disinformation and mainly
fact-checking teams to cooperate on a cross-country level as it has been shown
that a single Neural Network could process even multilingual data. There is
a need to create a large dataset to train the classifiers correctly. Lastly, it
has found to be essential to store the original text instead of its summaries
and links only, because the links might not be active anymore, or it can be
redirected to different content, and the article itself can be edited. Also, the
Neural Network classifier performs better on the original texts compare to its
human-written summaries. Moreover, even using state-of-the-art technologies
to obtain the article from a link, there can be still some errors during the
parsing, causing a noise in the final data leading to mistakes in learning and
wrong classification.
74
Appendix A
Grover example
75
A. Grover example ....................................
data is that it, along with others, have been unhappy with Facebook and
Google over the past few years as they run the platforms and create bots that
inject their content into social media feeds around the world. By analysing
the social media content, AI companies are finding more accurate ways to
sift through this content.
Reputation Defender
Reputation Defender, which also operates out of Israel, just released AI-
based authentication. This AI can determine if users are fraudulently using
news updates on the company’s platform and can prevent them from pub-
lishing fresh news stories that could be fake. The company claims that its
algorithm can scan more than 100,000 links per second, tens of thousands of
hours of video, and is incredibly fast.
The AI scans all links, including those in 100,000 headlines, to identify if
they are fraudulent. Once it has identified the material as such, the human
algorithm assesses whether or not it has tried to make a traffic link through
Autocorrect, picked up a scammy link on Hacker News, and even follows a
link to a site where stolen or fraudulently generated IP addresses have been
spotted.
If the company spot the problematic content, the AI can immediately
block it. It can also scour Twitter to look for news stories that were spammy
because the users have used bots and key words.
The company, which just raised $5 million last month, claims that its tech
uses an approach to AI that is very different from what others are using. "It
works by applying machine learning inference to structured data in a much
more real-time way", Gil Duvdevani, the co-founder and chief scientist of the
company, told the Times of Israel.
These technologies allow journalists to better track down and curb hate
speech and fake news, while still maintaining full faith in the ability of humans
to scrutinise content.
Neither AI can stop fake news completely. However, AI does have a role to
play in improving the quality of user reporting and brand trust. Not only
does this result in being faster, but the technology can also lead to more
robust protection against these problems. More progress is likely to be made
in the coming years, as AI and machine learning are becoming more advanced,
and data journalism becomes more complex. AI is changing everything —
except our relationships with each other.
Click HERE to read more
76
Appendix B
Attachments
. src.zip - source codes
. thesis.pdf - the thesis itself in pdf
. thesis.zip - the thesis in Latex
. assignment.pdf - official ssignment of the thesis
77
78
Appendix C
Bibliography
[ABC+ 16] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy
Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ge-
offrey Irving, Michael Isard, Manjunath Kudlur, Josh Leven-
berg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit
Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin
Wicke, Yuan Yu, and Xiaoqiang Zheng, Tensorflow: A sys-
tem for large-scale machine learning, Proceedings of the 12th
USENIX Conference on Operating Systems Design and Imple-
mentation (Berkeley, CA, USA), OSDI’16, USENIX Association,
2016, pp. 265–283.
[AG17] Hunt Allcott and Matthew Gentzkow, Social media and fake
news in the 2016 election, http://www.aeaweb.org/articles?
id=10.1257/jep.31.2.211, May 2017, Accessed on 2019-01-01.
79
C. Bibliography .....................................
[CLR+ 17] Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Lu-
dovic Denoyer, and Hervé Jégou, Word translation without par-
allel data, arXiv preprint arXiv:1710.04087 (2017).
[fEPSoE17] Andrea Renda (CEPS Centre for European Policy Studies and
College of Europe), Media consumption forecasts 2018.
[GD18] David Güera and Edward J Delp, Deepfake video detection using
recurrent neural networks, 1–6.
80
.................................... C. Bibliography
[LB02] Edward Loper and Steven Bird, Nltk: The natural language
toolkit, Proceedings of the ACL-02 Workshop on Effective Tools
and Methodologies for Teaching Natural Language Processing
and Computational Linguistics - Volume 1 (Stroudsburg, PA,
USA), ETMTNLP ’02, Association for Computational Linguis-
tics, 2002, pp. 63–70.
[Lee12] Eun-Ju Lee, That’s not the way it is: How user-generated
comments on the news affect perceived media bias, Journal of
Computer-Mediated Communication 18 (2012), no. 1, 32–45.
[LFdS+ 17] Zhouhan Lin, Minwei Feng, Cícero Nogueira dos Santos, Mo Yu,
Bing Xiang, Bowen Zhou, and Yoshua Bengio, A structured self-
attentive sentence embedding, CoRR abs/1703.03130 (2017).
[MSC+ 13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and
Jeffrey Dean, Distributed representations of words and phrases
and their compositionality, CoRR abs/1310.4546 (2013).
[PGC+ 17] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan,
Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison,
Luca Antiga, and Adam Lerer, Automatic differentiation in
pytorch, NIPS-W, 2017.
[QGSL18] Feng Qian, Chengyue Gong, Karishma Sharma, and Yan Liu,
Neural user response generator: Fake news detection with collec-
tive user intelligence, 3834–3840.
81
C. Bibliography .....................................
[RD17] Chris Proctor Richard Davis, Fake news, real consequences:
Recruiting neural networks for the fight against fake news.
[RSG16] Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin, "why
should I trust you?": Explaining the predictions of any classifier,
CoRR abs/1602.04938 (2016).
[RSL17] Natali Ruchansky, Sungyong Seo, and Yan Liu, Csi: A hybrid
deep model for fake news detection.
[Tim16] New York Times, As fake news spreads lies, more readers
shrug at the truth, https://www.nytimes.com/2016/12/06/us/
fake-news-partisan-republican-democrat.html, Dec 2016,
Accessed on 2019-01-01.
[Wan17] William Yang Wang, "liar, liar pants on fire": A new benchmark
dataset for fake news detection, CoRR abs/1705.00648 (2017).
[ZHR+ 19] Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk,
Ali Farhadi, Franziska Roesner, and Yejin Choi, Defending
against neural fake news, arXiv preprint arXiv:1905.12616 (2019).
[ZT18] Jun Xie Yidong Chen Xiaodong Shi1 Zhixing Tan, Mingx-
uan Wang, Deep semantic role labeling with self-attention.
82