0% found this document useful (0 votes)
31 views10 pages

Adversarial Training For Fake News Classification

This document summarizes a research paper on using adversarial training to improve fake news classification. The paper trains two transformer models, BERT and Longformer, on fake news datasets using adversarial examples. Adversarial training helps models learn noise-invariant representations and improves robustness. The models are trained simultaneously on clean and perturbed examples. Results show adversarial training improves performance over the baseline models, increasing F1-score by up to 2.05% on political news datasets.

Uploaded by

Karthik Yogesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views10 pages

Adversarial Training For Fake News Classification

This document summarizes a research paper on using adversarial training to improve fake news classification. The paper trains two transformer models, BERT and Longformer, on fake news datasets using adversarial examples. Adversarial training helps models learn noise-invariant representations and improves robustness. The models are trained simultaneously on clean and perturbed examples. Results show adversarial training improves performance over the baseline models, increasing F1-score by up to 2.05% on political news datasets.

Uploaded by

Karthik Yogesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Received 14 July 2022, accepted 25 July 2022, date of publication 29 July 2022, date of current version 10 August 2022.

Digital Object Identifier 10.1109/ACCESS.2022.3195030

Adversarial Training for Fake News Classification


ABDULLAH TARIQ 1 , ABID MEHMOOD 2 , (Member, IEEE),
MOURAD ELHADEF 2 , (Member, IEEE), AND
MUHAMMAD USMAN GHANI KHAN1
1 National Center for Artificial Intelligence, University of Engineering and Technology Lahore, Lahore 54890, Pakistan
2 Computer Science and information Technology Department, College of Engineering, Abu Dhabi University, Abu Dhabi, United Arab Emirates
Corresponding author: Abdullah Tariq (ab.sheikh909@gmail.com)
This work was supported by the Office of Research and Sponsored Program, Abu Dhabi University, United Arab Emirates.

ABSTRACT News is a source of information to know about progress in the various areas of life all
across the globe. However, the volume of this information is high, and getting benefits from the available
information becomes difficult. Moreover, the frequency of fake news is increasing significantly and used
to fulfill a particular agenda. This led to research on the classification of news to prevent the spread of
disinformation. In this work, we use Adversarial Training as a means of regularization for fake news
classification. We train two transformed-based encoder models using adversarial examples that help the
model learn noise invariant representations. We generate these examples by perturbing the model’s word
embedding matrix, and then we fine-tune the model on clean and adversarial examples simultaneously.
We train and evaluate the models on the Buzzfeed Political News and Random Political News datasets.
Results show consistent improvements over the baseline models when we train models using adversarial
examples. Experiments show that Adversarial Training improves the performance by 1.25% over the BERT
baseline, 2.05% over the Longformer baseline for the Random Political News dataset, 1.25% over the BERT
baseline and 0.9% over the Longformer baseline for Buzzfeed Political News dataset in terms of F1-score.

INDEX TERMS Fake news classification, political news mining, adversarial training, transformers, long-
former, BERT.

I. INTRODUCTION injured in an event [3]. False reporting is accused of having


Nowadays, the internet has become the most common contributed significantly to increasing political polarization
medium for seeking information. The blowout of false news during the recent election for president seat in the United
and misleading information is triggering severe problems States.
in the world, partially because most of us only focus on The imitations shaped by news headlines to readers are
headlines of the news rather than carefully paying heed to insistent and meaningfully contribute to becoming a news
its detail. Viewers are misled by intentionally distributing story viral on the social media platform. Therefore, detecting
false information because it may contain fabricated or fake in-congruent news is vital to fight social media misinforma-
information [1]. Increasing internet penetration has made tion. Researchers have currently exploited different methods
digital media networking a hub for distributing misleading for detecting fake news, ranging from simple n-gram features
propaganda, inaccurate facts, fraudulent evaluations, rumors, based methods [4], hierarchical encoding based models [5],
and parodies [2]. Furthermore, massively deceptive commu- summarization based models [6] to artificially intelligent
nication chains have increasingly harmful implications in systems [7]–[9]. Normally, a system based on artificial intelli-
various industries, including the stock market. In 2013, for gence encounters a bottleneck when optimization and tuning
example, the stock market lost 130 billion dollars when false of different parameters [10] are essential.
claims on Facebook spread that the US president had been In 2017, Vaswani et al. [11] introduced a new neural
network architecture known as the transformer. The author
The associate editor coordinating the review of this manuscript and stated that due to the fundamental design property of Recur-
approving it for publication was Mehul S. Raval . rent Neural Networks (RNNs), it does not perform parallel

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
82706 VOLUME 10, 2022
A. Tariq et al.: Adversarial Training for Fake News Classification

computing of the learning process. Transformers can tackle attend to tokens within its window v, as well as to the global
this limitation as they consist of multiple sequential atten- attending tokens, ensuring symmetric global attention. Long-
tion layers, catch long-range correlations in a sentence, and former self-attention can reduce query key matmul operation
be computationally efficient [11]. BERT takes the encoder memory and computation from O(ml × ml ) to O(ml × v),
part of the original transformer model and learns the repre- where ml be the sequence length and ‘‘v’’ is the average
sentations for the given input text by masking 15% of the window size, the memory and time bottleneck associated with
input text randomly. Besides predicting the masked words, this operation. When compared to the number of ‘‘locally’’
BERT also uses the next sentence prediction (NSP) objec- attending tokens, it is thought that the numbers of ‘‘globally’’
tive function for learning word representation. For predicting attending tokens are negligible.
the following sentence, BERT jointly takes two sentences, Adversarial Training (AT) helps in increasing the model’s
X and Y, as input and learns to classify whether Y genuinely robustness to adversarial attacks by acting as a regularizer.
follows X or is only a random text to learn the relationship The key concept is to perturb both clean data as well as
between sentences. disturbed data using a gradient-based perturbation procedure.
There are several Transformer models, namely BERT [12], Unlike images, text data is not directly suitable for this Adver-
RoBERTa, ALBERT, XLNet, DistilBERT, and Reformer. sarial Training technique. For the goal of text classification,
Other models such as RoBERTa and DeBERTa are built on Miyato et al. [14] used perturbations on word embeddings.
top of the BERT model. Nevertheless, the major drawback Chen et al. [15] explained how contrastive loss can be used to
of these models is: that they cannot perform well in longer learn features in computer vision. When this model is trained,
sentences, i.e., 512 tokens are addressed by BERT at a time. the input picture is perturbed with the help of augmentation,
To tackle the difficulties mentioned above, many approaches and contrastive loss aids toward clean and augmented samples
thrived. An example of such a model is the Longformer [13], being drawn closer together during training while pushing the
a pre-trained model based on the transformer that simpli- other examples apart from them. In model learning, contrast
fies the self-attention computation and, thus, reduces model loss facilitates the representation of noise invariant visual fea-
complexity. Longformer can operate on long documents, tures. Pan et al. [16] offered a contrastive adversarial strategy
thus eliminating the shortcomings of transformer models. for text categorization that outperforms existing algorithms.
Longformer can be used to perform several tasks other than Pan et al. [16] showed that contrastive Adversarial Training
language modeling. It can be done as Longformer comprises improves the text classification performance far better than
of the following attention patterns: the baseline approaches.
• Dilated Sliding Window: The first layers of an attention In this paper, we target the task of fake news classification
block of attention layers are regular sliding window using two transformer models, BERT and Longformer, and
layers, and the following layers are dilated sliding win- compare their performance. The fake news dataset we use
dow layers. Consider it this way: shallow layers better for experimentation consists of the title and the body, where
capture local attention information, whereas top layers the title of the news is much smaller than the body. Hence,
aim to move from local to global representations faster, it is worth evaluating the performance of models like BERT
minimizing the overall number of layers required. and Longformer on smaller and more extended versions of
• Global Attention: On a few pre-selected input points, the text. Further, we analyze the impact of Adversarial Train-
we apply ‘‘global attention.’’ We also make this attention ing(AT) on these two models by training them on both the
operation symmetric: a token with global attention pays clean and perturbed data. The main insights of this research
attention to all tokens in the sequence, and all tokens in are:
the series pay attention to it. • We compare the performance of two transformer mod-
• Sliding Window: The Longformer mimics the convo- els, namely BERT and Longformer, for classifying fake
lution process in specific ways. The issue that led news.
the memory needs in traditional attention layers to be • We analyze the noise as means of regularizer for fake
quadratic is that it permits each query node to pay news classification.
attention to both its peers in the keys and all of the The remaining sections of the paper are organized in the
key nodes, resulting in n attention weight per query following manner: Section II consists of the related work,
node. followed by section III. In this section, we present the paper’s
Self-attention is used in both a ‘‘local’’ and a ‘‘global’’ set- methodology. In section IV, we provide some details about
ting with Longformer self-attention. The majority of tokens the experimental settings. Moreover, in section V, we elabo-
interact with each other ‘‘locally,’’ which means that each rate the results of the paper. Finally, the paper’s conclusion is
token only interacts with half of its v preceding tokens discussed in section VI.
and half of its v following tokens, where v is the window
length. II. RELATED WORK
It is worth noting that the query, key, and value matrices In this part, first, we described some work related to Adver-
for ‘‘locally’’ and ‘‘globally’’ attending tokens are different. sarial Training. Then, we go over the work that has already
Additionally, each ‘‘locally’’ attending token does not just been done in the field of identifying fake news.

VOLUME 10, 2022 82707


A. Tariq et al.: Adversarial Training for Fake News Classification

A. ADVERSARIAL TRAINING fake news classification problems. The study suggested that
A number of supervised classification tasks in the Com- Gradient Boosting produced cutting-edge results with an
puter Vision (CV), such as object identification [17]–[19], 86% accuracy on the fake news classification dataset named
object segmentation [19], [20], and image classifica- as FNC-Challenge. Similarly, Jain et al. [35] proposed a
tion [21]–[23], have studied the impact of Adversarial Train- methodology for fake news detection. The methodology con-
ing (AT) on the performance of the tasks. AT uses the concept sisted of three modules: an aggregator to extract news from
of using adversarial attacks. In these attacks, input (clean) websites, a news authenticator that predicts whether the pre-
samples are manipulated so that the system has to predict dicted news is real or fake, and a recommender system.
the incorrect class label [24] for them. As Goodfellow et al. The fake news detection module employed a combination
suggested in [23], the Fast Gradient Sign Method (FGSM) of Support Vector Machines, semantic analysis, and Naive
can be used to produce adversarial examples of images. Bayes. The proposed system gained an accuracy of 93.6%
However, due to the nature of the text, FGSM cannot be on the evaluation dataset.
implemented directly to the input text to generate adversarial Umer et al. [9] presented a hybrid algorithm that merges
samples. Therefore, Miyato et al. [14] applied FGSM to the properties of CNN and LSTM. Two dimensionality reduc-
NLP tasks by perturbing word embeddings rather than real tion techniques, chi-square and the Principal Component
text input, which is applicable in both supervised and semi- Analysis (PCA) were used to minimize the features of text
supervised scenarios, as it uses Virtual Adversarial Train- articles that helped increase the processing speed. It helped
ing (VAT) [25] in the latter. [26]–[28] proposed different in determining whether the features of news stories were
works to add the perturbation to the attention mechanism consistent with the content article. The proposed algorithm
of transformer-based methods instead of the word embed- was evaluated on the FNC dataset and attained an accu-
dings. To generate adversarial examples, Madry et al. [24] racy of 97.8% using the PCA technique. Ahmad et al. [36]
adopted the multi-step approach in contrast to the single-step explored different textual properties of fake news corpus to
FGSM. Hiriyannaiah et al. [29] used GAN to better catego- classify fake and real news. In this study, the dataset used
rize fake news from online platforms. Zhu et al. [28] also was ISOT Fake News Dataset and two publicly available
achieved a larger effective batch by adding gradient accumu- datasets on Kaggle. The main contribution of this paper was
lation in the free AT algorithm. To update model parameters, the introduction of an ensemble of ML algorithms using
Shafahi et al. [30] proposed the free Adversarial Training in Logistic Regression, K-Nearest Neighbors, Support Vector
which the inner loop is responsible for the calculation of per- Machine, and Multi-layer perceptron. Ensemble techniques
turbation concerning model parameters. Chandra et al. [31] used are Random Forest, Decision Trees as bagging clas-
presented a methodology for the detection of offensive text. sifiers, AdaBoost and XGBoost as boosting classifiers, and
They utilize the adversarial training for making their model voting classifiers using above mentioned algorithms. The
more robust. highest accuracy on the ISOT dataset was 99%, and on the
combined three datasets was 99% using a Random Forest
classifier. The main point of this research was to identify the
B. FAKE NEWS CLASSIFICATION key elements in classifying fake news. Hardalov et al. [37]
Various AI researchers have proposed different methods in came up with an end-to-end framework named as Mixture-of-
order to classify the news as real or fake using deep learn- Experts with Label Embeddings (MoLE). Based on unsuper-
ing (DL) and machine learning (ML) tools and approaches. vised domain adaptation for the pre-trained RoBERTa model
In this section, we present some of the conventional or touch- and label embeddings, it was able to learn different labels.
stone techniques for classifying fake news. Social media The proposed model was evaluated on 16 different stance
platforms remain a significant source of news creation and detection datasets with an average F1-score of 65.55%, while
propagation. Facebook, Twitter, WhatsApp, and many other on the FNC-1 dataset, it obtained the same metric with a score
platforms have declared that they are developing their algo- of 75.82%.
rithms for the classification of fake news in real-time in order Vaibhav et al. [38] came up with a graph neural network
to limit the spread of fake news [32]. in order to classify fake news by looking up at the sen-
Bhatt et al. [33] presented a state-of-art solution for stance tence level relation of the text. Slovikovskaya et al. [39]
classification of news. The proposed model is a hybrid of reported refined stance detection results for the FNC-1
the deep recurrent model’s neural encoding, the weighted dataset. First, the author tested the power of these embed-
n-gram bag-of-words model’s features, and hand-crafted dings using the Facebook InferSent encoder and BERT based
external features that use feature engineering methodologies. features separately on the featMLP classifier. These find-
The method is evaluated on real-world data of fake news ings prompted further research into fine-tuning the BERT,
detection challenge FNC-1. It gained a weighted accuracy XLNet, and RoBERTa transformers on the FNC-1 extended
score of 83.08%, beating the baseline score of 82.05%. dataset. Experiments revealed that transformer-based models
Kaliyar et al. [34] utilized different ML classifiers, namely produced better results, with an accuracy of 91.32%, 92.10%,
Gradient Boosting, Multinomial Naive Bayes, Decision Tree, and 93.19% produced by BERT, XLNet, and RoBERTa,
Random Forest, Logistic Regression, and Linear SVM for respectively. Briskilal et al. [40] proposed an ensemble

82708 VOLUME 10, 2022


A. Tariq et al.: Adversarial Training for Fake News Classification

TABLE 1. Samples from Buzzfeed and political news dataset.

TABLE 2. Dataset statistics. Adversarial Training. At last, we merge these ideas to


improve the overall score.

A. TRANSFORMER ENCODERS
To get the statistical representations of the input text, we uti-
lize two transformer models, BERT and Longformer, as trans-
former models. Let x = {[CLS], x1 , x2 , . . . xn , [SEP]} be the
method that combined the weights generated from two trans- input token, where [CLS], and [SEP] are the unique tokens
former models, BERT and RoBERTa. This proposed architec- representing the beginning and conclusion of the sequence.
ture was fine-tuned with idioms and literal, namely TroFiIn. [CLS] token contains the hidden representations of the whole
In addition, the authors presented a new dataset containing input sequence x. Let encoder be a transformer encoder
1470 idioms and literal expressions. The overall accuracy of representing either BERT or Longformer at a time. Given a
the ensemble model was 90.4%, compared to 85% and 88% sequence x as an input, the encoder producers hidden repre-
for the standalone BERT and RoBERTa models, respectively. sentation H for each input token:
So far, we have seen how transformer-based models can
capture semantic and protracted correlations in sentences. H = encoder(x) (1)
However, Kaliyar et al. [41] developed fakeBERT, a BERT where H ∈ Rn×d where d represents number of hidden units
based DL model for detecting fake news. Input vectors gen- and n is the maximum sequence length. We fine tune the
erated by BERT after word embedding were passed into three encoder and add a softmax classifier that takes the hidden
convolutional layers, each preceded by a pooling layer and representation of [CLS] token. The aim of the training func-
stacked in parallel blocks. The model was evaluated using tion is to reduce the cross entropy loss:
a fake news dataset provided by Kaggle with true and false
N C
labels only with an accuracy of 98.90%. Furthermore, this 1 XX
paper provided a detailed overview of BERT embeddings L=− yi,c log(p(yi , c|hi[CLS] )) (2)
N
used with various state-of-the-art methods. However, this i=1 c=1

model only detected true or false labels and can be extended where ‘L’ represents the training loss, ‘N’ be the number of
to classify multi-class real-world datasets. training samples in a batch, and ‘C’ represents the number of
In order to classify long textual news, In this paper, we fine- classes in a dataset.
tuned BERT and Longformer with Adversarial Training.
Adversarial Training helps the models classify the fake news B. ADVERSARIAL TRAINING
with more accuracy. Inputs are being perturbed in Adversarial Training so that
model misclassifies the given inputs. Fast Gradient Sign
III. METHODOLOGY Method (FGSM) is one of the methods proposed in [23] to
In methodology, first, we deliberate the fundamentals of generate the perturbed examples. These perturbed samples
transformers for text classification, followed by fine-tuning are called adversarial samples. Once adversarial examples are
BERT and Longformer on our datasets with/without noise generated, the model is trained on a pair of perturbed and
for the classification of fake news. Then, we describe the clean examples. Let ‘p’ represents the perturbation, then the

VOLUME 10, 2022 82709


A. Tariq et al.: Adversarial Training for Fake News Classification

FIGURE 1. Workflow of Adversarial Training for fake news classification. We add perturbation to the
embedding matrix to generate adversarial samples and then train the model using clean and adversarial
samples simultaneously.

amount of perturbation generated using the FGSM method classification model is trained. The overflow of the training
can be represented as follows: procedure is shown in Figure 1.

p = −sign(∇xi L(fθ (xi ), yi )) (3) IV. EXPERIMENTAL SETTING


A. DATASET
where L(fθ (xi + r), yi ) represents the cross entropy loss func-
tion in our work and fθ represents neural network parameter- The detail of dataset [42] we use for the experiments is given
ized by θ, and  is the hyperparameter to control the amount in Table 2. It contains news datasets from two independent
of perturbation. To generate adversarial examples, instead of sources, i.e., ‘‘Buzzfeed News Data’’ and ‘‘Random News
adding perturbation to the input itself, we add perturbation ‘p’ Data’’. ‘‘Buzzfeed News Data’’ contains 48 samples for fake
to the embedding matrix for every input text. Then, we train news and 53 samples for real news. ‘‘Random News Data’’
the model jointly on adversarial and clean examples. The contains 75 samples for fake news, real news, and satires.
total loss is the sum of both losses. i.e., one for the clean In this work, we only use fake news and real news data. Both
example and the other for the perturbed example given as datasets contain the title as well as the body of the news.
follows: We classify the dataset using title and body separately. Some
samples of these datasets are shown in Table 1.
Ltotal = Lclean + Ladv (4)
B. BASELINE METHODS
where Lclean and Ladv represent losses for the clean and As a baseline method, we use two transformer-based models.
adversarial examples, respectively. The detail of both models is given as follows:

C. ADVERSARIAL TRAINING FOR FAKE NEWS 1) BERT-LARGE


CLASSIFICATION We fine-tune the BERT large model for the purpose of
This study employs the transformer models coupled with fake-news classification that consists of 24 encoder blocks.
Adversarial Training for classifying the text related to news We consider this model as a baseline model. For adversarial
and then classifying it as fake or real news. Input news text is training, we use the same architecture; however, now, instead
passed to the transformer model that comprises an embedding of training on clean training examples, We use both clean
layer and a series of hidden layers. The hidden state of [CLS] and perturbed samples to train the model. This makes the one
token representing the whole input sequence is then passed forward pass of the model training consist of two examples,
to the classification layer, and then the loss is computed as i.e., clean and adversarial examples.
given in equation 2. In order to estimate the perturbation,
we compute the gradient of the loss in terms of the embedding 2) LONGFORMER-4096
matrix, as shown by equation 3. The adversarial example is We fine-tune the other model for fake news classification is
created by adding this perturbation to the transformer model’s Longformer-4096, which consists of 4096 hidden units in the
embedding matrix. Then, this adversarial example also goes final hidden layer. Like BERT, we also use this model for clas-
through a series of hidden layers, then the classification layer, sification based on both the title and the news body and treat
and finally, the loss is computed. The total loss becomes the it as a baseline for its counter Adversarial Training. We use
sum of losses of clean and adversarial examples, as shown the same strategy of Adversarial Training for Longformer we
in equation 4. Then the backward step is taken, and the used for BERT. However, the value of the noise parameter
model’s parameters are updated. In this way, the fake news differs from the BERT.

82710 VOLUME 10, 2022


A. Tariq et al.: Adversarial Training for Fake News Classification

C. EVALUATION MEASURE 84.85%, 83.85%, and 83.2%, respectively. This means the
We employed accuracy, precision, recall, and the F1-score performance of the model degrades. For Longformer with-
as evaluation measures to assess our model’s performance. out Adversarial Training on the title of the news, we get a
The definitions of accuracy, precision, recall, and F1-score precision, recall, and F1-score of 87.75%, 87%, and 86.15%,
are given in equations 5, 6, 7 and 8 respectively. respectively. On the other hand, for Longformer with Adver-
sarial Training on the news title, we see the precision, recall,
Accuracy = TP + TN /TP + FP + FN + TN (5) and F1-score values of 87.15%, 86.4%, and 86.0%, respec-
Precision = TP/(TP + FP) (6) tively. We again notice the performance degradation for the
Recall = TP/(TP + FN ) (7) classification of news on the basis of the title when using
Precision ∗ Recall Adversarial Training. On the other hand, in the case of classi-
F1 − score = 2 (8) fying fake news on the basis of the body of the news, we get
Precision + Recall
a precision, recall, and F1-score of 81.35%, 80%, and 79.0%
where TP represents the true positive news, the news which for the BERT baseline model. However, for the Adversarial
was real, and the model also predicts that news as real, Training of the BERT model, based on news body precision,
TN represents the true negative, and the actual fake news recall, and F1-score values increase by 3.25%, 0.75%, and
is predicted as fake by the model. FP represents the false 1.25%, respectively. Similarly, for the Adversarial Training
positive; the actual fake news predicted real by the model. of Longformer model over the body of the news, we see that
FN represents the false negative; the actual real news pre- the precision, recall, and F1-score increase by 2.15%, 0.65%,
dicted fake by the model. and 0.9%, respectively.
D. HYPERPARAMETERS B. DATASET 2: RANDOM POLITICAL NEWS DATASET
For both BERT and Longformer, we use a fixed learning rate We discuss the performance of the baseline and adversarial
of 1e−5 . For the classification of news titles, we use a maxi- training models of the ‘‘Random Political News’’ dataset in
mum of 50 sequences, both on the BERT and the Longformer. this subsection. Results show that for the title of the news,
For classification based on the body of the news, we use a the BERT baseline gives a precision, recall, and F1-score of
maximum sequence length of 512 for BERT and 1000 for 83.2%, 81.95%, and 81.1%, respectively. On the other hand,
Longformer. For the value of noise parameter , we use BERT adversarial model performs with precision, recall, and
0.001 and 0.0001 for BERT and Longformer, respectively. F1-score of 81.6%, 80.85%, and 80.55%, respectively. In the
Due to computational constraints, we use a fixed batch size case of the Longformer baseline model, we get a precision,
of 1 for both baseline and Adversarial Training on the news recall, and F1-score of 89.05%, 87.55%, and 87.25%, respec-
body using BERT and RoBERTa. For classification of the title tively. For the Longformer with Adversarial Training, we had
of the news, as the maximum sequence length is small, we use a precision of 89%, a recall of 87.55%, and an F1-score
a batch size of 8 for both baseline and Adversarial Training. of 87.1%.
However, we still use a batch size of 1 for Longformer to The BERT baseline method performs with precision,
make a classification based on the title. The reason for this, recall, and F1-score of 89.05%, 88.35%, and 88.05% when
the Longformer model does not fit in GPU memory even classifying fake news based on the body of the news,
though the title length is small. We use Adam [43] as an whereas BERT with Adversarial Training performs with pre-
optimization algorithm for both the models and linear weight cision, recall, and F1-score of 90.6%, 89.85%, and 89.3%,
decay. Models are trained for 10 epochs, and early stopping respectively. Longformer baseline performs with precision,
is used to prevent overfitting. recall, and F1-score of 94.25%, 94.15%, and 93.95%. How-
V. RESULTS AND DISCUSSION ever, Longformer with Adversarial Training outperforms the
For the classification of news as fake and real, we use two baseline with precision, recall, and F1-score of 96.25%,
transformer models, BERT and Longformer. We use these 96.1%, and 96%, respectively. BERT with Adversarial Train-
two models as baselines. Moreover, we employ noise as ing for the news body gains a performance improvement
means of regularization for fake news classification using of 1.55%, 1.5%, and 1.25% for the precision, recall, and
Adversarial Training. The results on the body and title of the F1-score, respectively, over the BERT baseline method. Sim-
news for both Buzzfeed and Random Political News datasets ilarly, Longformer with Adversarial Training for the news
are presented in Table 3. body gains a performance improvement of 2.0%, 1.95%,
and 2.05% in terms of precision, recall, and F1-score,
A. DATASET 1: BUZZFEED POLITICAL NEWS DATASET respectively.
As shown in Table 3, we attained a precision, recall, and As Longformer performs the best on the body of news for
F1- score of 86.85%, 85.75%, and 84.95% respectively both datasets, we plot accuracy and loss for it in Figure 2
for ‘‘Buzzfeed Political News’’ dataset using BERT with- and 3. Although validation loss increases after certain epochs,
out adversarial training for the title of the news. However, we save the model with the highest F1-score on the validation
when we include adversarial training for the BERT model fold. Similarly, we plot the confusion matrix for both training
on the title, precision, recall, and F1-score values, we get methods in Figure 4.

VOLUME 10, 2022 82711


A. Tariq et al.: Adversarial Training for Fake News Classification

FIGURE 2. Averaged train and validation accuracy plots of both datasets for the baseline and adversarial methods on 10-folds
cross-validation.

FIGURE 3. Averaged train and validation loss plots of both datasets for the baseline and adversarial methods on 10-folds
cross-validation.

According to the experimental results, the Longformer degrades when adding noise to the title of the news. This
model outperforms the BERT on the title and the body of the makes sense because the news title already contains short
news on both datasets. Moreover, the model’s performance text, and adding noise further reduces the useful information.

82712 VOLUME 10, 2022


A. Tariq et al.: Adversarial Training for Fake News Classification

FIGURE 4. Confusion matrix of 10-folds cross-validation for the baseline and adversarial methods.

TABLE 3. Averaged precision, recall and F1-score for the Buzzfeed political news and random political news datasets on 10-fold cross validations.
Table shows the results for BERT and Longformer models with and without FGSM noise for both the title and body of the news. Results show that adding
FGSM noise for body of the news consistently improves F1-score on both the models.

On the other hand, model performance significantly improves basis of long text such as the body of the news increases the
if we employ Adversarial Training for the longer text, such as models’ performance significantly over the baseline in terms
the body of the news. of precision, recall, and F1-score on both datasets. However,
Adversarial Training for the short text, such as classification
VI. CONCLUSION using the title of the news, degrades models’ performance.
In this paper, we analyzed the impact of Adversarial Training Moreover, the Longformer model performs better than the
as means of regularizer for the fake news classification task. BERT for the fake news classification using both the title and
To this end, we utilized two transformer models, BERT and the body of the news. In addition, for future work, we would
Longformer. We measured the performance of the models consider adding more algorithms for adding perturbed sam-
on two publicly available datasets, namely, Random Political ples other than using FGSM and exploring the way to add the
News and Buzzfeed Political News. Evaluation results show required noise value in order not to add much noise that the
that Adversarial Training for the classification of news on the accuracy would fall rather than increase. Moreover, we would

VOLUME 10, 2022 82713


A. Tariq et al.: Adversarial Training for Fake News Classification

try to explore how effective our technique for the detection [24] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu,
of offensive text and present the outcomes in the graphical ‘‘Towards deep learning models resistant to adversarial attacks,’’ 2017,
arXiv:1706.06083.
form. [25] T. Miyato, S.-I. Maeda, M. Koyama, K. Nakae, and S. Ishii, ‘‘Distributional
smoothing with virtual adversarial training,’’ 2015, arXiv:1507.00677.
REFERENCES [26] S. Kitada and H. Iyatomi, ‘‘Attention meets perturbations: Robust and
interpretable attention with adversarial training,’’ IEEE Access, vol. 9,
[1] X. Jose, S. D. M. Kumar, and P. Chandran, ‘‘Characterization, classifica- pp. 92974–92985, 2021.
tion and detection of fake news in online social media networks,’’ in Proc. [27] S. Kitada and H. Iyatomi, ‘‘Making attention mechanisms more robust
IEEE Mysore Sub Sect. Int. Conf. (MysuruCon), Oct. 2021, pp. 759–765. and interpretable with virtual adversarial training for semi-supervised text
[2] A. E. Fard and T. Verma, ‘‘A comprehensive review on countering rumours classification,’’ 2021, arXiv:2104.08763.
in the age of online social media platforms,’’ in Causes Symptoms Socio- [28] C. Zhu, Y. Cheng, Z. Gan, S. Sun, T. Goldstein, and J. Liu, ‘‘FreeLB:
Cultural Polarization. Springer, 2022, pp. 253–284. Enhanced adversarial training for natural language understanding,’’ 2019,
[3] S. Vosoughi, D. Roy, and S. Aral, ‘‘The spread of true and false news arXiv:1909.11764.
online,’’ Science, vol. 359, pp. 1146–1151, May 2018. [29] S. Hiriyannaiah, A. Srinivas, G. K. Shetty, G. Siddesh, and K. Srinivasa,
[4] B. Riedel, I. Augenstein, G. P. Spithourakis, and S. Riedel, ‘‘A simple but ‘‘A computationally intelligent agent for detecting fake news using gen-
tough-to-beat baseline for the fake news challenge stance detection task,’’ erative adversarial networks,’’ in Hybrid Computational Intelligence.
2017, arXiv:1707.03264. Amsterdam, The Netherlands: Elsevier, 2020, pp. 69–96.
[5] R. Mishra, P. Yadav, R. Calizzano, and M. Leippold, ‘‘MuSeM: Detecting [30] A. Shafahi, M. Najibi, M. A. Ghiasi, Z. Xu, J. Dickerson, C. Studer,
incongruent news headlines using mutual attentive semantic matching,’’ L. S. Davis, G. Taylor, and T. Goldstein, ‘‘Adversarial training for free!’’
in Proc. 19th IEEE Int. Conf. Mach. Learn. Appl. (ICMLA), Dec. 2020, in Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019, pp. 1–12.
pp. 709–716. [31] S. Chandra and B. Das, ‘‘An approach framework of transfer learning,
[6] S. Yoon, K. Park, M. Lee, T. Kim, M. Cha, and K. Jung, ‘‘Learning to detect adversarial training and hierarchical multi-task learning—A case study of
incongruence in news headline and body text via a graph neural network,’’ disinformation detection with offensive text,’’ J. Physics, Conf., vol. 2161,
IEEE Access, vol. 9, pp. 36195–36206, 2021. no. 1, Jan. 2022, Art. no. 012049.
[7] Y. Liu and Y.-F. Wu, ‘‘Early detection of fake news on social media [32] Z. Chu, S. Gianvecchio, H. Wang, and S. Jajodia, ‘‘Detecting automation
through propagation path classification with recurrent and convolutional of Twitter accounts: Are you a human, bot, or cyborg?’’ IEEE Trans.
networks,’’ in Proc. AAAI Conf. Artif. Intell., vol. 32, 2018, pp. 1–8. Dependable Secure Comput., vol. 9, no. 6, pp. 811–824, Nov. 2012.
[8] S. Girgis, E. Amer, and M. Gadallah, ‘‘Deep learning algorithms for [33] G. Bhatt, A. Sharma, S. Sharma, A. Nagpal, B. Raman, and A. Mittal,
detecting fake news in online text,’’ in Proc. 13th Int. Conf. Comput. Eng. ‘‘On the benefit of combining neural, statistical and external features for
Syst. (ICCES), Dec. 2018, pp. 93–97. fake news identification,’’ 2017, arXiv:1712.03935.
[9] M. Umer, Z. Imtiaz, S. Ullah, A. Mehmood, G. S. Choi, and B.-W. On, [34] R. K. Kaliyar, A. Goswami, and P. Narang, ‘‘Multiclass fake news detection
‘‘Fake news stance detection using deep learning architecture (CNN- using ensemble machine learning,’’ in Proc. IEEE 9th Int. Conf. Adv.
LSTM),’’ IEEE Access, vol. 8, pp. 156695–156706, 2020. Comput. (IACC), Dec. 2019, pp. 103–107.
[10] L. Abualigah, A. Diabat, S. Mirjalili, M. A. Elaziz, and A. H. Gandomi, [35] A. Jain, A. Shakya, H. Khatter, and A. K. Gupta, ‘‘A smart system for
‘‘The arithmetic optimization algorithm,’’ Comput. Methods Appl. Mech. fake news detection using machine learning,’’ in Proc. Int. Conf. Issues
Eng., vol. 376, Apr. 2021, Art. no. 113609. Challenges Intell. Comput. Techn. (ICICT), vol. 1, Sep. 2019, pp. 1–4.
[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, [36] I. Ahmad, M. Yousaf, S. Yousaf, and M. O. Ahmad, ‘‘Fake news detec-
Ł. U. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv. tion using machine learning ensemble methods,’’ Complexity, vol. 2020,
Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–11. pp. 1–11, Oct. 2020.
[12] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘‘BERT: Pre-training [37] M. Hardalov, A. Arora, P. Nakov, and I. Augenstein, ‘‘Cross-domain label-
of deep bidirectional transformers for language understanding,’’ 2018, adaptive stance detection,’’ 2021, arXiv:2104.07467.
arXiv:1810.04805. [38] V. Vaibhav, R. M. Annasamy, and E. Hovy, ‘‘Do sentence interactions
[13] I. Beltagy, M. E. Peters, and A. Cohan, ‘‘Longformer: The long-document matter? Leveraging sentence level representations for fake news classifi-
transformer,’’ 2020, arXiv:2004.05150. cation,’’ 2019, arXiv:1910.12203.
[14] T. Miyato, A. M. Dai, and I. Goodfellow, ‘‘Adversarial training methods [39] V. Slovikovskaya, ‘‘Transfer learning from transformers to fake news
for semi-supervised text classification,’’ 2016, arXiv:1605.07725. challenge stance detection (FNC-1) task,’’ 2019, arXiv:1910.14353.
[15] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, ‘‘A simple framework [40] J. Briskilal and C. N. Subalalitha, ‘‘An ensemble model for classifying
for contrastive learning of visual representations,’’ in Proc. Int. Conf. Mach. idioms and literal texts using BERT and RoBERTa,’’ Inf. Process. Manage.,
Learn. (PMLR), 2020, pp. 1597–1607. vol. 59, no. 1, Jan. 2022, Art. no. 102756.
[16] L. Pan, C.-W. Hang, A. Sil, and S. Potdar, ‘‘Improved text classification [41] R. K. Kaliyar, A. Goswami, and P. Narang, ‘‘FakeBERT: Fake news
via contrastive adversarial training,’’ 2021, arXiv:2107.10137. detection in social media with a BERT-based deep learning approach,’’
[17] S.-T. Chen, C. Cornelius, J. Martin, and D. H. P. Chau, ‘‘ShapeShifter: Multimedia Tools Appl., vol. 80, no. 8, pp. 11765–11788, Mar. 2021.
Robust physical adversarial attack on faster R-CNN object detector,’’ [42] B. Horne and S. Adali, ‘‘This just in: Fake news packs a lot in title,
in Proc. Joint Eur. Conf. Mach. Learn. Knowl. Discovery Databases. uses simpler, repetitive content in text body, more similar to satire than
Springer, 2018, pp. 52–68. real news,’’ in Proc. Int. AAAI Conf. Web Social Media, vol. 11, 2017,
[18] D. Song, K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, pp. 759–766.
F. Tramer, A. Prakash, and T. Kohno, ‘‘Physical adversarial examples for [43] N. Landro, I. Gallo, and R. La Grassa, ‘‘Mixing Adam and SGD: A com-
object detectors,’’ in Proc. 12th USENIX Workshop Offensive Technol. bined optimization method,’’ 2020, arXiv:2011.08042.
(WOOT), 2018, pp. 1–10.
[19] C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. Yuille, ‘‘Adversarial
examples for semantic segmentation and object detection,’’ in Proc. IEEE
Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 1369–1378.
[20] A. Arnab, O. Miksik, and P. H. S. Torr, ‘‘On the robustness of semantic
segmentation models to adversarial attacks,’’ in Proc. IEEE/CVF Conf. ABDULLAH TARIQ is currently pursuing the
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 888–897. M.S. degree in computer science with the Uni-
[21] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and versity of Engineering and Technology Lahore.
A. Swami, ‘‘The limitations of deep learning in adversarial settings,’’ in He is also working as a Research Officer with
Proc. IEEE Eur. Symp. Secur. Privacy (EuroSP), Mar. 2016, pp. 372–387. the Intelligent Criminology Research Laboratory,
[22] J. Su, D. Vargas, and K. Sakurai, ‘‘One pixel attack for fooling deep National Center of Artificial Intelligence. His
neural networks,’’ IEEE Trans. Evol. Comput., vol. 23, no. 5, pp. 828–841, research interests include computer vision, ML,
Oct. 2019. and DL.
[23] I. J. Goodfellow, J. Shlens, and C. Szegedy, ‘‘Explaining and harnessing
adversarial examples,’’ 2014, arXiv:1412.6572.

82714 VOLUME 10, 2022


A. Tariq et al.: Adversarial Training for Fake News Classification

ABID MEHMOOD (Member, IEEE) received the MUHAMMAD USMAN GHANI KHAN is cur-
Ph.D. degree in computer science from Deakin rently the Director of the Intelligent Criminology
University, Australia. He is currently an Assistant Laboratory under the Center of Artificial Intelli-
Professor with Abu Dhabi University. His research gence. He is also the Director and the Founder
interests include ML, privacy, information security of five research laboratories including, the Com-
data mining, and cloud computing. puter Vision and ML Laboratory, the Bioinformat-
ics Laboratory, the Virtual Reality and Gaming
Laboratory, the Data Science Laboratory, and the
Software Systems Research Laboratory. He has
over 18 years of research experience specifically
in the areas of image processing, computer vision, bioinformatics, medical
imaging, computational linguistics, and ML. A Well-Groomed Teacher and a
Mentor for subjects related to artificial intelligence, ML, and DL. Recorded
MOURAD ELHADEF (Member, IEEE) received freely available video lectures on youtube for courses of bioinformatics,
the B.Sc., M.Sc., and Ph.D. degrees in computer image processing, data mining and data science, and computer programming.
science from the Institute Supérieur de Gestion
in Tunis, Tunisia, and the Ph.D. degree in com-
puter science from the University of Sherbrooke,
Sherbrooke, QC, Canada. He is currently a Com-
puter Science Professor at the College of Engineer-
ing, Abu Dhabi University, United Arab Emirates.
He has over 50 peer-reviewed articles and confer-
ence proceedings to his credit. His current research
interests include failure tolerance and fault diagnosis in distributed, wireless,
ad-hoc networks, cloud computing, artificial intelligence, and security. He is
on the Editorial Boards of several major conferences and journals, including
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS and the Journal of
Parallel and Distributed Computing.

VOLUME 10, 2022 82715

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy