0% found this document useful (0 votes)
55 views37 pages

Video Deepfake Detection Using Particle Swarm Optimization Improved Deep Neural Networks

Uploaded by

Ashiya Ajare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views37 pages

Video Deepfake Detection Using Particle Swarm Optimization Improved Deep Neural Networks

Uploaded by

Ashiya Ajare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Neural Computing and Applications (2024) 36:8417–8453

https://doi.org/10.1007/s00521-024-09536-x(0123456789().,-volV)(0123456789().,-volV)

ORIGINAL ARTICLE

Video deepfake detection using Particle Swarm Optimization


improved deep neural networks
Leandro Cunha1 • Li Zhang1 • Bilal Sowan2 • Chee Peng Lim3 • Yinghui Kong4

Received: 13 February 2023 / Accepted: 22 January 2024 / Published online: 22 February 2024
 The Author(s) 2024

Abstract
As complexity and capabilities of Artificial Intelligence technologies increase, so does its potential for misuse. Deepfake
videos are an example. They are created with generative models which produce media that replicates the voices and faces
of real people. Deepfake videos may be entertaining, but they may also put privacy and security at risk. A criminal may
forge a video of a politician or another notable person in order to affect public opinions or deceive others. Approaches for
detecting and protecting against these types of forgery must evolve as well as the methods of generation to ensure that
proper information is supplied and to mitigate the risks associated with the fast evolution of deepfakes. This research
exploits the effectiveness of deepfake detection algorithms with the application of a Particle Swarm Optimization (PSO)
variant for hyperparameter selection. Since Convolutional Neural Networks excel in recognizing objects and patterns in
visual data while Recurrent Neural Networks are proficient at handling sequential data, in this research, we propose a
hybrid EfficientNet-Gated Recurrent Unit (GRU) network as well as EfficientNet-B0-based transfer learning for video
forgery classification. A new PSO algorithm is proposed for hyperparameter search, which incorporates composite leaders
and reinforcement learning-based search strategy allocation to mitigate premature convergence. To assess whether an
image or a video is manipulated, both models are trained on datasets containing deepfake and genuine photographs and
videos. The empirical results indicate that the proposed PSO-based EfficientNet-GRU and EfficientNet-B0 networks
outperform the counterparts with manual and optimal learning configurations yielded by other search methods for several
deepfake datasets.

Keywords Video deepfake detection  EfficientNet  EfficientNet-Gated Recurrent Unit  Hyperparameter selection 
Particle Swarm Optimization

1 Introduction individual using a series of techniques such as face swap-


ping or lip synchronization to generate an entirely new
Generative models are increasingly used. They have video.
demonstrated a great deal of success in generating high- As continuous advancement in deep generative models,
quality fake photographs, videos and audios, which may it is getting increasingly difficult to distinguish between
frequently be impossible to be distinguished from the authentic and fraudulent photographs and videos. In a
genuine ones. The Internet makes it possible for anybody to study, Nightingale et al. [1] have proved, in two different
create this type of media. The use of deep learning algo- trials, that people’s capacity to discern edited photographs
rithms to the production of this form of content is a sig- of real-world settings is severely restricted. Their results
nificant contributor to the development of the notion suggested concern about the degree to which individuals
known as ‘‘deepfake‘‘. A person with malicious intentions may be misled in their day-to-day lives. According to the
is able to create real-time video deepfakes using the tools authors, this was supported by the fact that manipulated
that are currently available. Such video manipulations images already command a significant amount of attention
involve the replacement of a source individual with a target in the media, e.g. social networking sites. Additionally, the
researchers were unable to find any convincing evidence to
support the idea that personal characteristics, such as skills
Extended author information available on the last page of the article

123
8418 Neural Computing and Applications (2024) 36:8417–8453

in photography or opinions regarding the degree to which and facial reenactment are the primary focuses of this
image manipulation is pervasive in society, are linked to a research.
better ability to spot or locate manipulations. This was one Due to the gravity of the problem and the possible
of our primary motivations when conducting this study. danger that deepfakes pose to social stability, there has
The primary means for video and image forgery gen- been an increase in research aimed at finding a solution to
eration are through the training of generative models using the challenge of identifying deepfakes. Constructing a
variational autoencoders (VAE), Generative Adversarial CNN specifically tailored to the problem at hand, in this
Networks (GANs) or various blends of these two types of instance, detecting deepfakes, is one approach that may be
models with other image processing techniques. The used. But even so, there are a variety of channels that might
majority of the models will base their networks on Con- be explored. For instance, some may choose from a variety
volutional Neural Networks (CNNs), or, beginning in 2022, of network designs, while others may customize a number
more modern models are also capable of using Vision of hyperparameters in accordance with specific tasks. In
Transformers (ViT). addition, there are several studies adopting algorithms to
Some of the newest generation of models are able to handle the processing of an image or a video. An algorithm
produce images with a very subtle level of artefacts. As a may, for instance, take into account individual video
result, as pointed out by Sabir et al. [2], the only way to frames and attempt to locate instances of spatial inconsis-
determine whether or not a face is real or fake is by looking tency. Alternatively, the algorithm may compare succes-
for features such as an unnaturally asymmetric face, weird sive video frames in an effort to identify instances of
teeth and other more obvious inconsistencies not localized temporal inconsistency.
on the face but in the background. We introduce different In this research, we propose transfer learning of CNNs
types of attacks as follows. and hybrid CNN-Recurrent Neural Network (RNN) models
Face Swapping: The face of a source individual in a with Particle Swarm Optimization (PSO)-based hyperpa-
video is changed to match the form and characteristics of rameter selection for deepfake detection. Our system
that of a target individual [3]. After having the face of the comprises three key steps. (1) Firstly a data preprocessing
target individual initially extracted from an image, it is then procedure is applied to crop facial regions to eliminate
subsequently transferred to a newly generated image or background distraction. The cropped facial regions are then
video. In order to produce the manipulated image, the used as inputs to deep networks for video classification. (2)
process typically involves training two encoders on both Specifically, an ImageNet pretrained EfficientNet is fine-
the source image and the target image and then switching tuned using the deepfake datasets with video frames as
the decoders in order to rebuild the face from image A onto inputs, while EfficientNet serialized with a Gated Recur-
image B. Some applications have gained popularity due to rent Unit (GRU) network is used with videos as inputs
the ease with which they can be deployed and the results directly for synthetic video classification. During the
they produce. These applications enable even people with training stage, a new PSO algorithm is proposed to conduct
little knowledge to create fake images. Natsume et al. [4] optimal hyperparameter search for EfficientNet and Effi-
proposed an region-separative GAN (RSGAN) model for cientNet-GRU, which integrates composite leader signal
the generation of synthetic images independently for faces generation and reinforcement learning-based search oper-
and hair, which led to improved outcomes on face ation deployment to increase search flexibility. (3) Finally,
swapping. the yielded optimized settings are used to establish the final
Facial Reenactment: The facial reenactment techniques transfer learning and hybrid networks for fake/real video
change or reconstruct particular aspects of a face, such as classification. The research novelties are elaborated as
one’s head position, expression, eye gaze or lip movement. follows.
GAN is the most adopted facial reenactment image gen-
• To reduce background distraction, a face cropping
erator. In 2016, Thies et al. [5] developed one of the first
procedure using a multi-task cascaded deep learning
tools, namely Face2Face, for facial reenactment. It was a
model is used for facial region extraction from video
real-time system that created a 3D face model based on the
frames.
input image and used its 3D geometry to render the fake
• A hybrid EfficientNet-GRU network and transfer
face. Face2Face was one of the first tools of its kind.
learning using EfficientNet are proposed for identifying
Reenactment can also be carried out using purely one video
fake from real videos, owing to their great efficiency in
input with the use of a method proposed by [6]. Specifi-
extracting spatial–temporal cues and capturing inter/
cally, the head movement, facial expression, eye gazing
intra-frame inconsistencies. Automated hyperparameter
and blinking of the eyes were collected initially and then
search using the proposed PSO algorithm is also
transferred to a target actor who was also using a 3D head
conducted for both networks to further boost
model. The detection and classification of face swapping

123
Neural Computing and Applications (2024) 36:8417–8453 8419

performance. The new PSO algorithm combines adap- an RNN for video forgery identification. Other strategies
tive nonlinear functions for composite leader generation utilized biological signals and attention layers [14, 15] to
as well as the Q-learning algorithm for optimal dispatch enhance the efficiency for manipulated video detection. In
of different search operations, to overcome local optima particular, these techniques paid special attention to lip
traps. Evaluated using several well-known deepfake movements and eye gaze to check for inconsistencies.
datasets, the proposed PSO-based EfficientNet-B0 and Moreover, Sabir et al. [2] focused on using a CNN
EfficientNet-GRU networks achieve superior perfor- followed by a recurrent model with the input as the query
mance over those of existing state-of-the-art methods video frame sequences. Their model exploited frame-to-
for video authenticity identification. The proposed frame temporal differences. Their work claimed that since
optimizer also shows statistical superiority over other image manipulations were conducted frame-by-frame and
search methods in solving a variety of unimodal and temporal discrepancies were expected, low-level face
multimodal benchmark functions. manipulation techniques should show temporal artefacts
with inconsistent features across frames. Their work thus
aimed to identify such temporal inconsistencies. Specifi-
cally, they used a DenseNet to extract features like dis-
2 Related work
continuous jawlines and blurred eyes and then retrieved the
RNN’s final output rather than averaging recurrent features
In this section, we discuss state-of-the-art deep neural
across all time steps as in traditional video classification
networks for deepfake detection and swarm intelligence
pipelines.
algorithms for optimal hyperparameter fine-tuning.
2.2 Hyperparameter search
2.1 Deepfake detection
Hyperparameter configurations of deep neural networks
One of the first end-to-end trainable architectures for video
have significant effects in reducing or preventing oscilla-
classification using CNN and RNN was proposed in 2015
tions in gradient descents as well as correcting gradient
by Liang and Hu [7]. Their work exploited recurrent CNN
directions for weight adjustment towards global optima.
(RCNN) for undertaking object recognition. In 2016,
The local optima, plateau and saddle points in the loss
Donahue et al. [8] studied Long-Term Recurrent Convo-
space are the major challenges that deep networks
lutional Network (LRCN), where a CNN processed raw
encounter. If hyperparameters of deep neural networks are
visual input and fed it to a stack of recurrent sequential
not appropriately optimized, the networks’ performance
models for spatial–temporal feature extraction. Specifi-
will be affected significantly by the above factors.
cally, the LRCN model adopted CNNs to learn visual
Although methods, such as grid and random search, work
features from video frames and passed a sequence of image
well for hyperparameter search with discrete values in a
embeddings through Long Short-Term Memory (LSTM)
small search space, there are other more effective methods
networks for video classification.
like employing swarm-based metaheuristic methods to
In 2018, after the first deepfakes appeared, the idea to
determine optimal hyperparameter configurations specially
use the hybrid architecture for deepfake detection was first
in a continuous large search space like loss spaces in deep
researched and published combining the advantageous
networks. We explore such an option through an evolu-
characteristics of the RNN to enhance the performance of
tionary algorithm called PSO, which is simple to imple-
the CNN. According to Sabir et al. [2], the body of liter-
ment and has been proven by the literature to produce great
ature that has been most explored to gain insight about
robustness for learning configuration selection in neural
video classification for deep fake detection was video
networks.
action recognition [9–12] because of the extensive devel-
The evolutionary algorithms are optimization methods
opment in the field in recent years and similar spatial–
that take inspiration from biological processes. The PSO
temporal processing nature to that of deepfake detection.
algorithm was proposed by [16] in 1995, which takes
One of the main methods of human action recognition is a
inspiration from fish or bird swarm movement.
‘‘two-stream’’ network methodology, which processes
Because PSO does not rely on gradient descent, one can
video frames and optical flow in two different branches
use an objective function to optimize deep network
before fusing them for video classification. [13] presented a
parameters without relying on its derivatives [9]. In this
deepfake detection with this two-stream technique. In
work, hyperparameters of deep learning models such as the
addition, an RCNN model was employed by [2] for deep-
learning rate, dropout rate, image input size and number of
fake detection. It processed each frame using a CNN, and
frames extracted from videos will be optimized. The
the extracted spatial features were further processed using
objective function will be associated with loss function of

123
8420 Neural Computing and Applications (2024) 36:8417–8453

the deep learning model to advise search of optimal widely adopted in hyperparameter and architecture search
learning settings. in deep networks [18–20].
The way PSO works is by initializing a group of parti- There are inspiring related studies for hyperparameter
cles in the search space of the function randomly, and at and architecture search using multi-task learning. For
each iteration it checks which particle achieves the lowest example, automatic generation of multi-task learning
value (i.e. the most optimal loss) on the objective function. models was conducted by Zhang et al. [21] for solving a
At each following step, each particle uses the information variety of semantic segmentation problems. The automa-
of the best solutions found by the swarm and itself, along tion process utilized a randomly assigned backbone net-
with random exploration factors, to guide the particle’s work in conjunction with a set of tasks as inputs with the
movement [16]. attempt to generate a multi-task model with a reasonable
Taking k as the iteration number, the velocity of a given trade-off between performance and cost. A gradient-based
particle i is given by: search method was used for architecture search. The opti-
mization process determined the assignment of different
vi ðk þ 1Þ ¼ wvi ðkÞ þ c1 r1 ðxpbest
i ðkÞ  xi ðkÞÞ
ð1Þ network nodes for each task and how these selected nodes
þ c2 r2 ðgbestðkÞ  xi ðkÞÞ were shared with other tasks. A unique characteristic of
where their work was the adoption of parameter sharing at the
operator (neuron) level via a joint optimization of shared
– vi ðkÞ is the velocity of particle i at iteration k; policies and network weights. Their yielded multi-task
– w is an inertia weight; model showed great capabilities in tackling diverse multi-
– c1 and c2 are parameters called the ‘‘cognitive‘‘ and class semantic segmentation problems. In addition, auto-
‘‘social’’ coefficients, respectively; mated production of search parameter and search mecha-
– r1 and r2 are randomly generated numbers between 0 nisms for metaheuristic algorithms was exploited by
and 1; Stützle and López-Ibáñez [22]. Such techniques were
– pbest is the ‘‘personal best‘‘ position of the particle (i.e. capable of developing optimizers with effective new search
the best position it has achieved so far); strategies. They also showed great efficiency in enhancing
– gbest is the ‘‘global best’’ position among all particles existing search methods’ performance via optimal param-
in the swarm (i.e. the best position achieved by the eter selection. Furthermore, an adaptive hybridized multi-
swarm); task learning framework was developed by Lialestani et al.
The position of the particle i at iteration k þ 1 will be [23] pertaining to temperature prediction at different depth
updated with the velocity as follows: levels. Their work performed architecture generation of a
multi-task multilayer perceptron neural network using a FA
xi ðk þ 1Þ ¼ xi ðkÞ þ vi ðk þ 1Þ ð2Þ
variant developed by Shahri et al. [24]. The FA variant
The inertia weight controls how much of the particle’s conducted multi-task network architecture generation,
previous velocity is kept in the update. A greater w setting where the absorption and randomization parameters were
indicates that the particle’s previous velocity has strong fine-tuned by the population brightness variance.
effects to the new velocity generation. The cognitive and Cheng et al. [25] developed a multi-task learning model
social weights define how much the particle is impacted by with a hybrid CNN-transformer encoder for simultaneous
its own best past experiences (pbest) and the best experi- image segmentation and classification using multimodal
ences of the other particles in the swarm (gbest), respec- MRI image inputs. A U-Net-like encoder-decoder archi-
tively [17]. In general, the values of c1 and c2 should be tecture was proposed with an additional transformer unit
greater than 0, and less than or equivalent to 2.5. Setting embedded in the bottom of the CNN-based encoder. The
these values too low may lead the particles to fail to suf- hybrid CNN-transformer encoder fused high-level spatial
ficiently explore the search space, whereas setting them too and global features extracted by a CNN-stream and a
high may cause the particles to become extremely sensitive transformer-based operation, respectively. The joint learn-
to changes in the swarm and exhibit suboptimal behaviour. ing of both segmentation and classification tasks was
In short, PSO is a powerful optimization technique that conducted via a compound loss function integrating seg-
has been used to solve a wide range of optimization mentation and classification losses with uncertain weights.
problems. The velocity update formula is critical in To tackle data sparsity and unlabelled data, a semi-super-
establishing how the swarm particles move and update vised joint learning mechanism was deployed to enhance
their positions in search of a satisfactory optimal solution. classification performance by integrating with uncertainty-
Variant methods have also been proposed to tackle local based label selection.
optima traps of the original PSO algorithm, which were

123
Neural Computing and Applications (2024) 36:8417–8453 8421

2.3 Other Swarm intelligence algorithms ranking producer and lower ranking scrounger subswarms.
In addition, 10%-20% of the sparrows are capable of per-
Besides PSO, in recent years, a number of new state-of-the- ceiving danger. The top ranking producers perform global
art swarm intelligence algorithms have been proposed exploration when a randomly generated alarm coefficient is
including Spotted Hyena Optimizer, Symbiotic Organisms lower than the pre-defined safety threshold, otherwise the
Search, Tree Seed Algorithm, Sparrow Search Algorithm producer subswarm conducts local exploitation using
and Tunicate Swarm Algorithm, for tackling engineering, Gaussian distribution. The lower ranking scrounger sub-
mathematics and image processing optimization problems. swarm is guided by the producers to exploit optimal local
Proposed by Cheng and Prayogo [26], Symbiotic Organ- regions of the respective producers. The sparrows with the
isms Search employs mutualism, commensalism and par- capabilities of sensing danger follow the swarm leader
asitism processes to simulate mutual interaction of two while staying away from the global worst solution. The
organisms to lead the search of global optimality. Firstly, algorithm outperformed other classical search methods for
during mutualism, the mean position vector of the current solving a number of numerical optimization problems. To
search agent and another randomly selected organism is overcome slow convergence of the model, a number of
calculated, which is used in conjunction with the global variant methods of the Sparrow Search Algorithm were
best solution to update the positions of both the current and studied by Gharehchopogh et al. [31]. These include the
randomly selected individuals. Their offspring solutions incorporation of PSO, Firefly Algorithm (FA) [32], Dif-
are used to replace them if the new solutions are fitter. ferential Evolution (DE) and Since Cosine Algorithm
Secondly, for the commensalism stage, the difference (SCA) [33] with Sparrow Search Algorithm, respectively.
between the global best solution and a randomly selected Other enhancement mechanisms such as random walk
individual is used to guide the movement of the current based on Levy flights and chaotic map-based swarm ini-
search agent. Subsequently, the parasitism operation ran- tialization are also exploited to increase search robustness.
domly mutates the dimensions of the current search agent, Related studies of neural architecture and hyperparameter
which is used to substitute a randomly selected organism if search using the Sparrow Search Algorithm were also
this mutated solution is fitter. The effectiveness of Sym- investigated in [31]. A Tree Seed Algorithm was exploited
biotic Organisms Search was evidenced by its capabilities by Kiran [34]. A swarm of tree solutions is randomly ini-
in handling diverse engineering and benchmark optimiza- tialized. For each tree, a number of seed solutions are
tion problems, as indicated in a related study [27]. Modi- generated. Specifically, each new seed solution is gener-
fied Symbiotic Organisms Search algorithms and its ated using two sub-dimension-based search operations.
hybridation with other search methods were also exten- One is guided by the best tree solution and a randomly
sively studied in [27] to guide future development. Moti- selected tree position while the other is led by the current
vated by the cluster hunting behaviours of the spotted tree position and a randomly selected tree location. For the
hyenas, the search process of Spotted Hyena Optimizer generation of a specific dimension of a seed solution, the
[28] comprises encircling/hunting and attack mechanisms. selection of these two search strategies is controlled by a
The encircling/hunting operation performs local exploita- randomly generated threshold parameter. The number of
tion and intensifies the search around the global best the new seed solutions that can be produced for each tree is
solution. Specifically, the leader spotted hyena with the dynamic between 10% and 25% of the population size, in
best fitness score is used to re-allocate the remaining order to increase search exploitation. If the best offspring
spotted hyenas to its optimal neighbouring regions. Those seed solution is fitter than the tree solution, it is used to
spotted hyenas with high correlations to the leader form a substitute the tree solution. The algorithm obtained com-
cluster where their mean position is used to generate a new petitive performance in comparison with other search
swarm leader. Adaptive search coefficients are exploited to methods such as PSO and FA for solving 24 numerical test
balance local and global search operations. Diverse Spotted functions. A comprehensive survey of the Tree Seed
Hyena Optimizer variant methods including the integration Algorithm was conducted by [35] where a variety of
with other swarm intelligence algorithms such as PSO and variants of Tree Seed Algorithm were analysed. The vari-
Simulated Annealing (SA) as well as other local/global ant methods included the combination of Tree Seed
search strategies were extensively studied in Ghafori and Algorithm with other swarm intelligence algorithms such
Gharehchopogh [29]. Their flexibilities were further as Artificial Bee Colony (ABC) [36] and SCA. Improve-
demonstrated in solving a variety of complex single and ment strategies such as Levy and Gaussian distributions
multi-objective optimization problems [29]. were also utilized to enhance flexibility of Tree Seed
A Sparrow Search Algorithm was developed by Xue and Algorithm. Moreover, the effectiveness of Tree Seed
Shen [30] where the population was composed of top Algorithm was also ascertained by handling a variety of
real-world optimization problems such as feature selection

123
8422 Neural Computing and Applications (2024) 36:8417–8453

and image compression. A variant method of Tunicate 3 The proposed methods for deepfake
Swarm Algorithm was studied by Gharehchopogh [37], detection
which included Quantum Rotation Gate (QRG) and
mutation operators based on Cauchy, Gaussian and Levy The proposed deepfake detection system consists of three
distributions to increase search robustness of the original key steps, i.e. (1) data preprocessing for the extraction of
method. In particular, besides using QRG, their work cropped facial regions, (2) the proposed PSO-based
explored the effectiveness of the combinations of any two hyperparameter optimization during network training stage
out of the three mutation operators as well as the integra- and (3) model establishment using the selected optimal
tion of all three random walk strategies. The superiority of settings and subsequent evaluation using unseen test sam-
the full model integrating all mutation operators along with ples. In particular, transfer learning with EfficientNet as the
QRG was ascertained by solving a set of 52 unimodal, backbone as well as a hybrid EfficientNet-GRU model is
multimodal, composition and hybrid test functions, as well studied in conjunction with PSO-based hyperparameter
as several other engineering optimization problems. search for synthetic video classification. We introduce each
A new FA variant was developed by Shahri et al. [24] by key stage below.
incorporating a brightness expectation value and a gener-
alized weighted average of a random brightness. It 3.1 Data preprocessing
exploited an adaptive absorption coefficient and an adap-
tive randomization search step to better balance the search The initial stage of the training pipeline involves extracting
between intensification and diversification. The population and pre-processing the first 150 frames of each video. The
fitness variance was used to adjust these adaptive search Python OpenCV library was used to extract the image
parameters after a number of iterations, which was calcu- frames, and then, the faces on each frame were processed
lated using the difference between the fitness of each firefly through the Multi-task cascaded CNN (MTCNN) face
and the mean fitness of the overall swarm, divided by a detector [50] for cropping. After that, the face crops were
dynamic normalization factor. Owing to the adaptive organized into folders and saved as image files within the
adjustment of the search parameters based on the fitness file system. In particular, the cropped facial regions from
variations during the search process, their method showed the real videos are augmented during training by flipping
better capabilities in overcoming local optima traps in them horizontally to increase real sample sizes. Figure 1
comparison with FA for solving several benchmark func- shows the detailed preprocessing pipeline for face
tions as well as multi-objective blasting engineering cropping.
problems. Proposed by Zhang et al. [50], the MTCNN model is
Motivated by foraging behaviours of social spiders, used for face detection. Specifically, the model is able to
Social Spider Optimizer (SSO) [38] first generates a perform face classification, facial region bounding box
vibration intensity of each spider whereby a better fitness generation and facial alignment. MTCNN firstly deploys a
score aligns with a stronger vibration. The strongest proposal CNN to perform binary (face and non-face)
vibration intensity generated by other spiders and sensed classification and generate a number of candidate bounding
by the current spider is extracted. A randomly generated box regression vectors. The nonmaximum suppression
binary mask is used to select either this new best vibration (NMS) method is used to merge highly overlapped
intensity or another vibration intensity generated by a bounding boxes. A second CNN model is subsequently
random individual in each dimension for the construction utilized to further refine the bounding box regression
of new personal leader signal. This new elite leader signal results by rejecting remaining candidate false positives. A
is used to guide a random walk operation for position third comparatively deeper CNN is used in this stage to
updating. Boundary checkings are also performed after determine the final bounding box output as well as generate
position updates. SSO shows competitive performance as a set of facial landmarks indicating positions of both eye
compared with a number of state-of-the-art search algo- centres, left and right mouth corners and the nose tip. The
rithms for tackling diverse numerical optimization prob- MTCNN model outperformed other face detection bench-
lems. Besides the above, there are also other swarm marks while maintaining efficient computational cost.
intelligence algorithms developed for handling feature In this research, we employ MTCNN to perform real-
selection, hyperparameter search, deep neural architecture time facial bounding box regression for all sampled frames
generation with respect to image segmentation/classifica- extracted from the video, without using associated facial
tion [19, 39–41], human action recognition [42] and envi- landmark outputs. The detected facial regions determined
ronmental sound classification [18], as well as solving by the bounding box regression vectors are cropped out for
other engineering and mathematical optimization problems subsequent classification. Owing to the fact that a region of
[43–49].

123
Neural Computing and Applications (2024) 36:8417–8453 8423

Fig. 1 Preprocessing pipeline


for face cropping

interest containing the entire face is tracked through the two datasets with a video face recognition database, i.e.
overall video using bounding boxes, the false positives of YouTube Faces Database [53], for model evaluation. The
face classification and localization are greatly reduced. YouTube Faces Database is designed for video face
This in turn significantly improves real and manipulated recognition and consists of 3,425 videos from 1,595 sub-
video classification performance. Figure 2 shows the jects, with an average of 181.3 frames per video. It is used
results for face detection and cropping for a sample video. to increase the genuine video sample sizes to balance the
We use two well-known video deepfake datasets, i.e. large numbers of fake instances provided by Celeb-DFv2
Celeb-DFv2 and DFDC, for model evaluation. Precisely, and DFDC. To be specific, at the training stage, all real
the Celeb-DFv2 dataset [51] consists of 590 genuine and videos from the official training set in Celeb-DFv2, a
5,639 fake videos. The official split of the dataset shows comparatively larger number of real videos from DFDC, as
5,711 and 518 videos for training/validation and test, well as 1,618 videos from the Youtube Faces Database
respectively. We adopt this official split in our experiment. with more than 50 frames, were combined together to
The DFDC dataset [52] has a total of 23,654 real and construct the customized genuine training video set. In
104,500 synthetic videos. We extract a subset of 1,016 addition, a balanced number of fake videos are drawn from
original and 8,425 tampered videos in our experiments. A the official training set of Celeb-DFv2 and our DFDC
test set of 206 real and 1,636 fake videos is used for testing subset in order to obtain a ratio of approximately 50%-50%
with the remaining videos for training and validation in our between fake and real videos, in the constructed training
experiment. We further split the training and validation sets set. We further split the combined training set by a ratio of
using a ratio of 80-20. 80-20 for training and validation, respectively. The real and
Besides the evaluation of each of the above datasets, we fake samples from the official test set of Celeb-DFv2 and a
also generate a customized dataset by combining the above comparatively larger DFDC unseen test set are used for
model evaluation in this experiment.
Table 1 shows the detailed training/validation and test
sample sizes for each dataset.
The final step was to send the data to the Pytorch dat-
aloader so that the models could be trained and validated
for each experimental setting. During the training process,
the transformations serve to supplement the data by
changing each frame from real videos at each epoch with a

Table 1 Data split of each dataset


Training/validation Test
Dataset Real Fake Total Real Fake Total

Celeb-DFv2 412 5299 5711 178 340 518


DFDC 810 6789 7599 206 1636 1842
Combined 3683 3739 7422 605 2770 3375

Fig. 2 Example outputs for face detection and cropping for a sample
video clip

123
8424 Neural Computing and Applications (2024) 36:8417–8453

random component. This is accomplished through the use specifically used in this study including the fully connected
of augmentation. For the training dataset, we initially used layers is shown in Table 2 below. The pure EfficientNet
random rotation with 20 degrees and gaussian blur. The was also used by the winning solution of the Deepfake
images were initially resized and then normalized before Detection Tournament hosted by the DFDC dataset authors
being used in either the training/validation sets or the test in 2019 [52].
set. For the oversampling of frames from real videos, the The MBConv blocks consist of residual blocks like
RandomHorizontalFlip() function is utilized. ResNet that connect the beginning of the block with the
end using a skip connection. The difference from the
3.2 Model 1—transfer learning using CNN original block from ResNet is that, regarding the number of
channels, it follows a narrow-wide-narrow approach
We firstly employ transfer learning using a CNN model instead of the traditional wide-narrow-wide strategy [54].
with the EfficientNet architecture for deepfake detection. The EfficientNet-B0 model was initially trained using
Figure 3 shows the overall dataflow using transfer learning ImageNet. We further fine-tune the model using the
for synthetic video classification. training/validation sets of the frames of each deepfake
EfficientNet was designed with the goal of scaling dataset in our experiments. The fine-tuned model is used
CNNs more efficiently than other deep networks proposed for the identification of fake/real videos. In addition, a new
previously [54]. Since its inception, this CNN architecture PSO variant is used to fine-tune network hyperparameter,
has shown to be among those that achieve the highest i.e. learning rate, dropout rate, image size and number of
performance when tested against various image classifica- frames, with the attempt to further enhance performance.
tion benchmarks. Specifically, a random swarm is firstly initialized in a
EfficientNet makes use of a compound scaling strategy, search space of [0, 1]. Each particle has four dimensions to
which involves scaling the network’s width, resolution and represent the four optimized hyperparameters. The pro-
depth uniformly to whatever degree is required to make posed PSO search operations are used to guide the particle
optimal use of the computational resources available. A movement in the search space for hyperparameter search.
grid search is usually used to find the scaling constants We evaluate each particle’s fitness by converting its posi-
[54]. tion into valid network learning configurations, which are
The network design of EfficientNet makes use of mobile used to set up transfer learning process. The network per-
inverted bottleneck convolution (MBConv), which is formance on the validation set is used as the fitness mea-
analogous to MobileNetV2 convolutional block but slightly sure of each particle. The most optimal solution identified
larger. In order to maximize precision and FLOPS, a neural by the proposed optimizer is used as the recommended best
architecture search was employed in the construction of the learning configurations of EfficientNet-B0. The optimized
baseline model. After that, a family of EfficientNet models EfficientNet-B0 model is then trained using the combined
was obtained by scaling it up using such a strategy. Within training set with larger numbers of epochs and tested with
the context of this research, the version known as Effi- the respective test sets for deepfake detection.
cientNet-B0 was employed [54]. The overall architecture

Fig. 3 Classification of real and


deepfake videos using
EfficientNet-B0

123
Neural Computing and Applications (2024) 36:8417–8453 8425

Table 2 EfficientNet-B0 model architecture [54] Table 3 Hybrid EfficientNet-GRU model architecture
Stage Operator Resolution Channels Layers Stage Operator Resolution Channels Layers

1 Conv3x3 224  224 32 1 1 Conv3x3 224  224 32 1


2 MBConv1 k3x3 112  112 16 1 2 MBConv1 k3x3 112  112 16 1
3 MBConv6 k3x3 112  112 24 2 3 MBConv6 k3x3 112  112 24 2
4 MBConv6 k5x5 56  56 40 2 4 MBConv6 k5x5 56  56 40 2
5 MBConv6 k3x3 28  28 80 3 5 MBConv6 k3x3 28  28 80 3
6 MBConv6 k5x5 14  14 112 3 6 MBConv6 k5x5 14  14 112 3
7 MBConv6 k5x5 14  14 192 4 7 MBConv6 k5x5 14  14 192 4
8 MBConv6 k3x3 77 320 1 8 MBConv6 k3x3 77 320 1
9 Conv1x1 & Pooling 77 1280 1 9 Conv1x1 & Pooling 77 1280 1
10 FC 1 256 1 10 GRU 1 1280 2
11 FC with dropout 1 128 1 11 FC with dropout 1 256 1
12 FC 1 2 1 12 FC 1 128 1
13 FC 1 2 1

3.3 Model 2—hybrid CNN-RNN


sequence of frames verifying if it is either a deepfake or a
Besides using transfer learning for video forgery detection, real video. The process is different from the pure CNN (i.e.
motivated by [2, 55–57], a hybrid CNN-RNN architecture the aforementioned EfficientNet) that takes the average of
is proposed in this research for distinguishing fake from all frames for deepfake detection as in the transfer learning
real videos. Specifically, this hybrid model uses Effi- process. The GRU component has thus 1,280 latent
cientNet serialized with a GRU layer for spatial–temporal dimensions and 1,280 hidden layers as the layer configu-
feature extraction to inform video classification. The pro- rations in this study.
posed PSO algorithm is used for identifying optimal The optimizer used is ADAM that combines features of
hyperparameters. Figure 4 shows the system dataflow. The optimization algorithms such as RMSProp and ADAGrad.
detailed network architecture is shown in Table 3. It is used to adjust the learning rate for each weight on the
As shown in Table 3, a latent representation of 1,280 fly using exponential weighted moving average to get the
dimension output extracted from the last convolutional first and second moments of the gradient estimates [56].
layer of EfficientNet is taken as input from each frame and The loss function opted is cross-entropy loss. It gives two
these features of each frame are concatenated to be passed likelihoods for real and fake labels using a softmax func-
on to the GRU layer. The GRU layer is used to take tion [56]. In addition, to improve discriminative feature
advantage of the spatial temporal features from the learning, we fine-tune the weights of ImageNet pre-trained
EfficientNet-B0 embedded in the proposed EfficientNet-

Fig. 4 Classification of real and


deepfake videos using
EfficientNet-GRU

123
8426 Neural Computing and Applications (2024) 36:8417–8453

GRU model using the combined training set with a small nonlinear formulae. Equation 3 shows the operation for
number of epochs (i.e. 5 epochs), before passing on fea- composite leader generation, where the remote second
tures to the GRU layer. Moreover, the proposed PSO model leader, sbest, is obtained by selecting the most distant
is used to fine-tune hyperparameters of this hybrid network particle to the swarm leader among the top 5 ranking
during the training stage, similar to the process discussed solutions.
earlier for parameter search using transfer learning. compositeðkÞ ¼ wa  gbestðkÞ þ wb  sbestðkÞ ð3Þ
Specifically, we optimize the learning rate, dropout rate,
image size and number of video frames, owing to their where wa and wb are the adaptive weighting factors which
significance to network performance. are used to weigh the effects of the swarm leader and the
second leader for composite signal generation. Two sets of
3.4 The proposed PSO model nonlinear functions are introduced for weighting coeffi-
for hyperparameter optimization cient generation.
Equations 4–6 define the first set of formulae for adap-
A new PSO variant is proposed for hyperparameter search tive weighting coefficient production.
for both EfficientNet-GRU and EfficientNet-B0 in this  1  1 !2
research. In order to tackle limitations of the original PSO jcosð0:5uÞj 2 jsinð0:5uÞj 2
r¼ þ ð4Þ
algorithm, it incorporates nonlinear functions for compos- 2 2
ite leader generation and a reinforcement learning strategy
x ¼rcosðuÞ ð5Þ
for dynamically adjusting the search process. As such,
different search actions led by different hybrid leaders and y ¼rsinðuÞ ð6Þ
the global best solution are dynamically dispatched based
on the reward schemes of the reinforcement learning where u = [0:0.001:p] with x and y denoting the coordi-
algorithm. Figure 5 shows the overall proposed algorithm. nates of the produced 2D points. The above equations
The detailed search strategies are presented below. generate increasing and decreasing subgraphs, as shown in
blue and orange lines, respectively, in Fig. 6. Each com-
3.4.1 Composite leader generation prises 1571 unique 2D points. We subsequently extract
maximum iteration number of values from 1571 unique y-
As indicated in existing studies, the original PSO model is axis values in the increasing branch with an interval of
i
likely to be trapped in local optima because of the adoption maximum iteration. These extracted increasing values are used
of a single swarm leader to lead the search process. as the weighting factor wa for the swarm leader. Similarly,
Therefore, composite leaders are produced by incorporat- maximum iteration number of values are also extracted
ing the global best solution and a distant second leader from 1571 unique y-axis values in the decreasing subgraph
based on the adaptive weighting factors generated using with an interval of maximumi iteration. They are subsequently

Fig. 5 Data flow of the


proposed PSO algorithm

123
Neural Computing and Applications (2024) 36:8417–8453 8427

unique y-axis values in the increasing branch with an


interval of maximumi iteration and assign them as the increasing
weight coefficient wa for the swarm leader. The same
process is also applied to the decreasing sub-contour for the
generation of the weight factor wb for the second leader.
These new sets of wa and wb are then utilized for producing
composite leaders.
The difference between these new sub-contours defined
in Eq. 7 and the subgraphs defined by Eq. 4 is that these
new sub-contours generate larger weighting factors in
comparison with those yielded by the previous subgraphs,
therefore diversifying the production of the combined
leaders.
    !12
jcosð0:5uÞj 2 jsinð0:5uÞj 2
r¼ þ ð7Þ
2 2
Fig. 6 Resulting increasing and decreasing subgraphs as defined in
Eqs. 4–6
Each composite leader is then used to replace the global
best solution in Eq. 1 for velocity production with respect
assigned to the weighting coefficient wb for the second
to hyperparameter search, as shown in Eq. 8. Such com-
leader. Each pair of increasing wa and decreasing wb
posite leaders are able to explore the search space more
parameters is utilized for producing a composite leader in
thoroughly and show enhanced capabilities in tackling
each iteration. The adoption of such increasing and
stagnation.
decreasing coefficients strengthens the effects of the swarm
leader and reduces the influence of the second leader as vi ðk þ 1Þ ¼ wvi ðkÞ þ c1 r1 ðxpbest
i ðkÞ  xi ðkÞÞ
iteration increases. As such, the algorithm encourages ð8Þ
þ c2 r2 ðcompositeðkÞ  xi ðkÞÞ
global exploration and intensifies local exploitation at the
beginning and end of the search process, respectively.
Besides the above, another set of adaptive increasing 3.4.2 Reinforcement learning-based optimal search action
and decreasing coefficients is also generated using Eq. 7 selection
and Eqs. 5–6, for composite leader generation to increase
search flexibility. The resulting increasing and decreasing Owing to the employment of the composite leader gener-
sub-contours defined by Eq. 7 are illustrated in Figure 7 ation process, a total of three search operations led by the
with each containing 1571 unique 2D points. We also swarm leader and the aforementioned yielded two com-
extract maximum iteration number of values from 1571 posite leaders are constructed. A reinforcement learning
algorithm is subsequently used to identify the optimal
selection of different leader signals for hyperparameter
search. Specifically, in each iteration, each particle is gui-
ded by either a composite leader or the global best solution
recommended by the Q-learning algorithm [58].
The Q-learning algorithm [58] employs a Bellman
equation defined in Eq. 9 to identify a sequence of optimal
search actions. In reinforcement learning, an agent per-
ceives the environment by learning from punishment and
reward signals through trial and error. The ultimate goal of
the reinforcement learning scheme is to yield a set of
optimal search operations that maximize the cumulative
reward. Such an expected cumulative reward score for a
state–action combination denoted as the Q-value is updated
using Eq. 9, in the Q-learning algorithm. These Q-values
are stored in a Q-table pertaining to each state–action pair.

Fig. 7 Resulting increasing and decreasing subgraphs as defined in


Eqs. 7 and 5–6

123
8428 Neural Computing and Applications (2024) 36:8417–8453

Qnew ðst ; at Þ ¼ð1  hÞ  Qðst ; at Þ 4 Evaluation and results


ð9Þ
þ h  ðrt þ b  max Qðstþ1 ; aÞÞ
a We evaluate the transfer learning and hybrid networks with
where h is the learning rate and b is the discount coeffi- manual and automatic hyperparameter optimization using
cient. At each time t, the agent performs an action at in Celeb-DFv2, DFDC and the combined datasets, respec-
state st resulting in a new state stþ1 . Besides the current Q- tively. Firstly, for the manual and PSO-based parameter
value Qðst ; at Þ, the new Q-value, Qnew ðst ; at Þ, is generated selection, we use the training and validation sets of the
based on two additional components, i.e. an immediate combined dataset, since the combined dataset has a mixed
reward rt and a future reward maxa Qðstþ1 ; aÞ. After per- data source which may lead to better representative capa-
forming a selected search action at , the network with the bilities. The optimized learning configurations are subse-
new configuration decoded from the new position is used to quently used to set up each model. Each optimized network
test the sampled validation set of the combined dataset, is then trained using the combined training set and evalu-
whereby the cross-entropy loss of the sampled validation ated using test sets of the Celeb-DFv2, DFDC and com-
set is used as the fitness score. If this new fitness score is bined datasets, respectively. The experimental studies are
better than the previous fitness of the particle, an imme- elaborated in detail below.
diate reward ‘1’ is used for rt , otherwise ‘-1’ is dispatched.
The future reward maxa Qðstþ1 ; aÞ is produced by identi-
fying the action that leads to the maximum reward in the
new state stþ1 . 4.1 Manual hyperparameter selection
Each particle constructs a 3-by-3 Q-table with the rows
and columns denoting the states and actions, respectively. In the initial experiments, the models were trained with
Such a Q-table is used to determine the selection of optimal hyperparameter searched manually. The process entails
search actions led by either any of the composite leaders or individually experimenting with a range of hyperparame-
the global best solution. Therefore, in each iteration, each ters selected. The following hyperparameters are
particle is assigned with different leader signals to increase optimized:
search robustness. In comparison with random selection of • Learning rate
the search actions as in most existing PSO variants, the • Dropout rate
Q-learning algorithm produces a sequence of optimal • Image Size—The size measured by height x width of
search actions based on the reward principles imposed by the input image will influence the result because of the
the Bellman equation. number of pixels processed by the CNN. The ranges
The proposed PSO model equipped with composite evaluated are from 100 to 130 pixels because of the
leader generation and Q-learning-based search operation trade-off between performance and cost.
dispatch shows enhanced search capabilities in tackling • Number of Frames per Video—This metric influences
stagnation in our empirical studies. The hyperparameter the result because of the size of the sequence of frames
search is conducted as follows. Because of the large extracted. The maximum limit considered is 50 frames
training and validation sample sizes of the combined because of comparatively smaller or similar maximum
dataset, subsets of the training and validation sets are frame settings adopted in existing studies [59–61].
employed for optimal hyperparameter selection. Each ele-
The training and validation sets of the combined dataset are
ment of the particle represents a hyperparameter to be
used for hyperparameter search. For each of the afore-
optimized. The optimal hyperparameters recommended by
mentioned hyperparameters, a set of three values was
each particle are used to set up a customized deep network.
chosen in order to manually fine-tune the model and
It is subsequently trained and evaluated using the sampled
identify the configuration with the lowest loss. The process
training and validation sets of the combined dataset,
respectively. The cross-entropy loss of the sampled vali-
Table 4 Hyperparameters searched manually
dation set is used as the fitness score of each particle. The
final optimized network is constructed using the configu- Hyperparameter Ranges
rations extracted from the global best solution. This final Learning rate 1  105 ; 1  104 ; 1  103
optimized network is trained with a much larger number of
Dropout rate 0.2, 0.3, 0.4
training epochs (i.e. 30 epochs) using the overall training
Image size 100, 112, 130
set of the combined dataset and tested using Celeb-DFv2,
Frames 30, 40, 50
DFDC and combined datasets, respectively. We introduce
evaluation details in the following section.

123
Neural Computing and Applications (2024) 36:8417–8453 8429

was repeated for each one of the four hyperparameters. The Table 6 Best hyperparameters identified using manual selection for
values that composed the set are illustrated in Table 4. the EfficientNet-GRU network
A total of 5 epochs were run to obtain the loss value for Hyperparameter Values
each hyperparameter set, as this was the number of epochs
that demonstrated satisfactory stability in the preliminary Learning rate 1  104
results. To choose the optimal ones in each of them, the Dropout rate 0.3
algorithm was run through their possible values, while the Image size 112  112 px
other hyperparameters remained constant. The parameter Frames 40
setting with the smallest loss error was then selected. These
manually selected optimized settings are used to set up
each network, which is further tested with test sets of
Celeb-DFv2, DFDC and the combined datasets, resolutions such as 224  224, 256  256 and
respectively. 380  380, in their experiments. We identify the optimal
search range of the frame settings using the combined
4.1.1 Manual parameter search for EfficientNet-B0 training set for model training and the official Celeb-DFv2
test set for model testing. We manually set up the frame
As mentioned above, the training and validation sets of the settings in the range of [10, 100], with the following fixed
combined dataset are used for hyperparameter search. For learning configurations, i.e. learning rate = 0.0001, dropout
the EfficientNet-B0 architecture using transfer learning, the rate = 0.3 and image size = 112, for both EfficientNet-B0
best hyperparameters obtained through this manual process and EfficientNet-GRU. For both long and short videos, the
are shown in Table 5. target number of frames is randomly sampled from each
video. The detailed evaluation results, i.e. accuracy rates
4.1.2 Manual parameter search for EfficientNet-GRU and Area Under the Curve (AUC) scores, are shown in
Tables 7 and 8.
Similarly, Table 6 comprises the hyperparameters that As indicated in Tables 7 and 8, experimental results for
were determined to be the most optimal ones for the Effi- the EfficientNet-B0 model show improvements when the
cientNet-GRU architecture for the combined dataset. frame setting increases from 10 to 30 using the Celeb-
Both the EfficientNet and EfficientNet-GRU networks DFv2 test set. When further increasing of the number of
equipped with manually selected best model configurations frames to 50 above, the training cost increases signifi-
are subsequently evaluated using test sets of Celeb-DFv2, cantly, and the network is increasingly becoming overfit-
DFDC and the combined datasets, respectively. The ting, owing to the capture of irrelevant noise between
detailed evaluation results for both networks with manually frames, lowering its performance as indicated in both
identified optimal settings are provided in Section 4.2. accuracy rates and AUC scores. A similar case is also
We have also carried out experiments to determine the observed for the evaluation using EfficientNet-GRU using
best search range of the number of frames for automated the Celeb-DFv2 test set. The model shows enhanced per-
hyperparameter search. Existing studies such as Wang formance when using the frame settings ranging from 10 to
et al. [62] and Zhao et al. [15] employed 30 frames, while
Zheng et al. [59], Shiohara and Yamasaki [60] and Zhao
et al. [61] adopted 32 frames for video inference, for Table 7 Experiments using EfficientNet-B0 with different numbers of
evaluating several video deepfake datasets, such as Celeb- frames for the Celeb-DFv2 test set
DFv2, DFDC and FaceForensics?? (FF??). These No. of frames Accuracy AUC
studies resized the cropped facial images to larger image
10 0.7703 0.7086
20 0.7915 0.7421
Table 5 Best hyperparameters identified using manual selection for
transfer learning using EfficientNet-B0 30 0.8263 0.7780
40 0.8127 0.7610
Hyperparameter Values
50 0.7896 0.7393
Learning rate 4
1  10 60 0.7761 0.7264
Dropout rate 0.3 70 0.7413 0.6825
Image size 112  112 px 80 0.7568 0.6929
Frames 30 100 0.7413 0.6745

123
8430 Neural Computing and Applications (2024) 36:8417–8453

Table 8 Experiments using EfficientNet-GRU with different numbers Table 9 Ranges of hyperparameters optimized by each search method
of frames for the Celeb-DFv2 test set
Hyperparameter Ranges
No. of frames Accuracy AUC
Learning rate 1  105  1  103
10 0.7741 0.7142 Dropout rate 0.1–0.9
20 0.8089 0.7621 Image size 100–128
30 0.8224 0.7750 Frames 10–50
40 0.8417 0.7938
50 0.7992 0.7534
60 0.7896 0.7380
70 0.7703 0.7139
precision, recall, accuracy, AUC scores and the Wilcoxon
80 0.7780 0.7078
rank sum (RS) test, is used for performance comparison.
100 0.7625 0.6893
The details of the experimental studies are presented as
follows.
Again the training and validation sets of the combined
dataset are used for the proposed PSO-based hyperparam-
40. When further increasing the frame settings to 50 above, eter search, owing to their representative capabilities.
both accuracy rates and AUC scores are reduced, because Because of the large sample sizes of this combined video
of the extraction of noisy redundant details from video deepfake dataset, in order to reduce the high computational
frames. A similar observation is also obtained when using cost of hyperparameter search, the training and validation
the DFDC and combined test sets. Therefore, in order to sets of the combined dataset were sampled for 5% and
generate robust networks and balance well between com- 25%, respectively. The validation cross-entropy loss
putational cost and performance, we employ the frame function was used as the objective function for evaluating
setting range of [10-50] for automated hyperparameter each particle. The experiments ranged from 10 to 20 hours
search for both networks. for each of the two models using the sampled training and
validation sets for each hyperparameter search.
4.2 Automatic hyperparameter search using
the proposed PSO model 4.2.1 Automated hyperparameter search for EfficientNet-
B0
Besides manual hyperparameter selection, automatic
hyperparameter search is also performed. We employ the We conduct automated hyperparameter search for Effi-
proposed PSO model, as well as 8 classical search methods cientNet-B0 using different search methods based on the
and 4 PSO variant algorithms, for hyperparameter search, combined training set. The variable configurations of dif-
including PSO, ABC [36], Salp Swarm Algorithm (SSA) ferent search methods are taken from existing studies.
[44], SSO [38], Bare-bones PSO (BBPSO) [63], Flower Additionally for each search method, we adopt a swarm
Pollination Algorithm (FPA) [64], FA [32], Dragonfly size of 10 and a maximum number of generations of 20. A
Algorithm (DA) [65], Genetic PSO (GPSO) [40], PSO with set of 5 runs was used for hyperparameter search using
sine coefficients (SPSO) [20], a BBPSO variant with transfer learning based on EfficientNet-B0. The established
attractiveness and evade actions (BBPSOV) [49] and PSO final networks with identified optimal settings by each
with adaptive sine, circle and spiral coefficients (ACPSO) search method are trained with 30 epochs using the com-
[66]. We adopt variable settings of the above algorithms bined training set and tested using the three test sets,
from their original studies in our experiments. respectively.
The following experimental settings are adopted. For The detailed evaluation results, i.e. the mean precision,
each search method, a total of 10 search agents are created recall, accuracy and AUC scores, as well as the Wilcoxon
and the maximum number of 20 generations is used for rank sum test results, for the Celeb-DFv2, DFDC and
hyperparameter search. All the search methods perform the combined datasets are shown in Tables 10, 11 and 12. The
same number (i.e. 200) of function evaluations. A set of 5 AUC score is used as the summary measure of the overall
runs was conducted for each search method. For both model performance. Specifically, a higher AUC score
EfficientNet-B0 and EfficientNet-GRU, Table 9 shows the correlates with a better classifier. The network with a
search ranges of different hyperparameters. These search higher AUC score typically has better capabilities in dis-
ranges are obtained via trial-and-error as discussed in tinguishing between fake and real instances. In addition,
Section 4.1. A set of five evaluation metrics, i.e. the mean the Wilcoxon rank sum test is also performed based on the

123
Neural Computing and Applications (2024) 36:8417–8453 8431

Table 10 Performance comparison for optimized EfficientNet-B0 Table 12 Performance comparison for optimized EfficientNet-B0
models using the Celeb-DFv2 test set models using the combined test set
Model Acc. Prec. Recall AUC RS Model Acc. Prec. Recall AUC RS

Prop. PSO 0.9247 0.9101 0.9824 0.8985 n/a Prop. PSO 0.9576 0.9852 0.9628 0.9484 n/a
PSO 0.8996 0.8830 0.9765 0.8646 9.74E–03 PSO 0.9262 0.9758 0.9332 0.9137 2.16E–03
ABC 0.9015 0.8874 0.9735 0.8688 9.74E–03 ABC 0.9292 0.9774 0.9354 0.9181 2.16E–03
BBPSO 0.8629 0.8549 0.9529 0.8220 9.74E–03 BBPSO 0.9218 0.9569 0.9473 0.8761 2.16E–03
FPA 0.8900 0.8753 0.9706 0.8533 9.74E–03 FPA 0.9224 0.9736 0.9307 0.9075 2.16E–03
SSA 0.8687 0.8579 0.9588 0.8277 9.74E–03 SSA 0.9156 0.9708 0.9249 0.8988 2.16E–03
SSO 0.8919 0.8797 0.9677 0.8574 9.74E–03 SSO 0.9233 0.9743 0.9311 0.9093 2.16E–03
FA 0.8668 0.8575 0.9559 0.8263 9.74E–03 FA 0.9289 0.9634 0.9495 0.8921 2.16E–03
DA 0.8687 0.8523 0.9677 0.8237 9.74E–03 DA 0.9239 0.9591 0.9477 0.8813 2.16E–03
SPSO 0.8880 0.8770 0.9647 0.8531 9.74E–03 SPSO 0.9197 0.9731 0.9278 0.9052 2.16E–03
GPSO 0.9131 0.8976 0.9794 0.8830 9.74E–03 GPSO 0.9319 0.9793 0.9368 0.9230 2.16E–03
BBPSOV 0.8726 0.8605 0.9618 0.8320 9.74E–03 BBPSOV 0.9164 0.9716 0.9253 0.9007 2.16E–03
ACPSO 0.8803 0.8639 0.9706 0.8392 9.74E-03 ACPSO 0.9176 0.9723 0.9260 0.9027 2.16E–03
Manual 0.8263 0.8255 0.9324 0.7780 9.74E–03 Manual 0.9049 0.9464 0.9372 0.8471 2.16E–03

Table 11 Performance comparison for optimized EfficientNet-B0 better results than those of the networks optimized by all
models using the DFDC test set other baseline methods across the datasets.
Model Acc. Prec. Recall AUC RS Figures 8, 9 and 10 illustrate the Receiver Operating
Characteristic (ROC) curves of the devised models by all
Prop. PSO 0.9414 0.9848 0.9487 0.9161 n/a search methods for the three datasets, respectively. The
PSO 0.8865 0.9741 0.8961 0.8534 2.16E–03 discriminative capabilities of the proposed PSO-optimized
ABC 0.8941 0.9768 0.9022 0.8661 2.16E–03 EfficientNet-B0 models are also indicated by the AUC
BBPSO 0.8686 0.9610 0.8881 0.8009 2.16E–03 score comparison, as depicted in Figs. 8, 9 and 10. Our
FPA 0.8833 0.9721 0.8943 0.8452 2.16E–03 optimized EfficientNet-B0 models obtain the best AUC
SSA 0.8778 0.9682 0.8918 0.8294 2.16E–03 scores in all test cases. The superiority of the proposed
SSO 0.8844 0.9734 0.8943 0.8500 2.16E-03 PSO-based EfficientNet-B0 models is further ascertained
FA 0.8757 0.9668 0.8906 0.8239 2.16E–03 by the Wilcoxon rank sum test results based on the AUC
DA 0.8724 0.9642 0.8894 0.8136 2.16E–03
SPSO 0.8822 0.9714 0.8936 0.8425 2.16E–03
GPSO 0.9050 0.9784 0.9132 0.8765 2.16E–03
BBPSOV 0.8800 0.9689 0.8936 0.8327 2.16E–03
ACPSO 0.8817 0.9702 0.8943 0.8379 2.16E–03
Manual 0.8475 0.9484 0.8759 0.7486 2.16E–03

AUC scores over 5 runs to indicate the statistical signifi-


cance of the proposed model over baseline search methods.
As depicted in Tables 10, 11 and 12, the proposed PSO-
based EfficientNet-B0 models achieve better results than
those with optimal learning parameters obtained using
other classical and advanced search methods for all three
datasets, in terms of mean precision, recall, accuracy and
AUC scores as well as statistical test results. In addition,
models with settings yielded by GPSO and ABC show Fig. 8 ROC curves for Celeb-DFv2 using EfficientNet-B0 models
with manual and optimal hyperparameters identified by all search
methods

123
8432 Neural Computing and Applications (2024) 36:8417–8453

Table 13 Mean results of the optimal hyperparameters identified


using each search method for EfficientNet-B0
Model Learning rate Dropout Size No. of frames

Prop. PSO 0.0001810 0.5633 119 36


PSO 0.0003550 0.79 125 40
ABC 0.0001203 0.6512 121 35
BBPSO 0.0002273 0.1026 119 28
FPA 0.0002795 0.3566 117 35
SSA 0.0004570 0.2692 116 29
SSO 0.0003918 0.5266 118 38
FA 0.0004500 0.2454 114 32
DA 0.0005000 0.3635 115 32
SPSO 0.0002420 0.3326 116 33
GPSO 0.0001538 0.4671 126 37
BBPSOV 0.0002951 0.2985 115 34
Fig. 9 ROC curves for DFDC using EfficientNet-B0 models with
manual and optimal hyperparameters identified by all search methods ACPSO 0.0003626 0.3563 117 32
Manual 0.0001 0.3 112 30

Fig. 11 Accuracy rates of the three test sets (in the y-axis) for
Fig. 10 ROC curves for the combined dataset using EfficientNet-B0 optimized EfficientNet-B0 along with the dropout hyperparameters
models with manual and optimal hyperparameters identified by all (in the x-axis) identified by each search method based on the sampled
search methods combined dataset (The results of the three test sets, i.e. the Celeb-
DFv2, DFDC and combined test sets, for each search method are
scores (see the last columns in Tables 10, 11 and 12), represented by a unique shape and colour symbol.)
which are all lower than 0.05 for all three test sets. This
indicates that our optimized EfficientNet-B0 models out- As indicated in Table 13, the proposed PSO-, GPSO-
perform those devised by other search methods with a and ABC-based EfficientNet-B0 models outperform those
statistical significance. with learning configurations obtained by other search
The mean optimized hyperparameters over 5 runs for methods for the three test sets. As shown in Table 13 and
EfficientNet-B0 using the sampled combined dataset are Figure 11, these models are equipped with moderate mean
provided in Table 13. These yielded hyperparameters by learning rates and moderate or slightly higher mean drop-
different search methods are analysed to justify perfor- out rates. Such settings are able to deploy steady magnitude
mance variations in the optimized networks. In addition, updates to the learning mechanism and produce efficient
we also visualize the effects of different optimized dropout sparse network representations with effective discrimina-
rates in Figure 11, which shows accuracy rates of the three tive feature learning capabilities to minimize redundancy.
test sets (in the y-axis) along with the dropout hyperpa- Significantly large or small settings of dropout rates are
rameters (in the x-axis) identified by each search method identified by PSO and BBPSO, respectively, as indicated in
using the sampled combined training set. We use a specific Figure 11. These may lead to the switching off of too many
shape and colour symbol to represent each search method. or too few neurons which may in turn result in the
extraction of inadequate or noisy spatial features.

123
Neural Computing and Applications (2024) 36:8417–8453 8433

Moreover, as illustrated in Table 13, large mean learning Table 14 Performance comparison for optimized EfficientNet-GRU
rates are produced by PSO, SSA, SSO, FA, DA and models using the Celeb-DFv2 test set
ACPSO, which may generate large learning magnitudes to Model Acc. Prec. Recall AUC RS
cause fluctuations in loss space. A small learning rate is
Prop. PSO 0.9382 0.9208 0.9918 0.9141 n/a
used in combination with a small dropout rate for the
manual setting, resulting in inadequate gradient descent PSO 0.9054 0.8839 0.9853 0.8691 7.94E–03
updates with redundant network topologies, limiting its ABC 0.8784 0.8635 0.9676 0.8378 7.94E–03
performance. BBPSO 0.8822 0.8681 0.9677 0.8434 7.94E–03
As discussed in Section 4.1.2, moderate settings of FPA 0.8938 0.8740 0.9794 0.8549 7.94E-03
image solutions and number of frames may lead to effec- SSA 0.8861 0.8707 0.9706 0.8477 7.94E–03
tive representations of video inputs while having an opti- SSO 0.9131 0.9019 0.9735 0.8856 7.94E–03
mal computational cost. In contrast, significant large FA 0.8977 0.8806 0.9765 0.8618 7.94E–03
number of frames and image solutions may result in high DA 0.8996 0.8750 0.9882 0.8593 7.94E–03
computational cost as well as network overfitting by cap- SPSO 0.9116 0.8927 0.9853 0.8775 7.94E–03
turing noisy irrelevant details. In our experiments, 30-40 GPSO 0.9189 0.9049 0.9794 0.8914 7.94E–03
frames are recommended in most cases which are able to BBPSOV 0.9093 0.9081 0.9588 0.8867 7.94E–03
capture sufficient RGB and motion details as ascertained ACPSO 0.9209 0.9052 0.9824 0.8929 7.94E–03
by the empirical results. Image resolutions of 115-126 are Manual 0.8417 0.8342 0.9471 0.7938 7.94E–03
also mostly selected to balance between performance and
computational cost.

4.2.2 Automated hyperparameter search for EfficientNet- Table 15 Performance comparison for optimized EfficientNet-GRU
GRU models using the DFDC test set
Model Acc. Prec. Recall AUC RS
We also employ the same experimental settings for
hyperparameter search using the EfficientNet-GRU model. Prop. PSO 0.9517 0.9886 0.9566 0.9346 n/a
A set of 5 runs is performed for hyperparameter search for PSO 0.8849 0.9804 0.8881 0.8737 7.94E–03
each search algorithm. The optimized settings obtained by ABC 0.8806 0.9784 0.8851 0.8649 7.94E–03
each search method are used to establish the final models, BBPSO 0.8833 0.9804 0.8863 0.8728 7.94E–03
which are trained using the training set of the combined FPA 0.8936 0.9800 0.8985 0.8765 7.94E–03
dataset with a larger number of epochs. These optimized SSA 0.8931 0.9806 0.8973 0.8783 7.94E–03
networks are then evaluated using test sets of Celeb-DFv2, SSO 0.9034 0.9841 0.9059 0.8947 7.94E–03
DFDC and the combined datasets, respectively. Moreover, FA 0.8860 0.9741 0.8955 0.8531 7.94E–03
the Wilcoxon rank sum test is also performed based on the DA 0.8952 0.9676 0.9126 0.8349 7.94E–03
AUC scores over 5 runs to indicate the statistical signifi- SPSO 0.9023 0.9834 0.9053 0.8919 7.94E–03
cance of our optimized model over those with optimal GPSO 0.9083 0.9854 0.9102 0.9017 7.94E–03
settings yielded by other search methods. The detailed BBPSOV 0.8947 0.9788 0.9010 0.8728 7.94E–03
evaluation and statistical test results for our optimized ACPSO 0.9370 0.9792 0.9493 0.8945 7.94E–03
EfficientNet-GRU models against other devised networks Manual 0.8675 0.9703 0.8778 0.8321 7.94E-03
are provided in Tables 14, 15 and 16.
As shown in Tables 14, 15 and 16, the proposed PSO-
based EfficientNet-GRU models obtain better perfor-
mances than those of the counterparts generated by all the optimized models show better AUC scores than those of
baseline search methods in terms of the all evaluation the networks yielded by other search methods for all test
metrics (i.e. precision, recall, accuracy, AUC and Wil- scenarios. The significance of our optimized EfficientNet-
coxon rank sum test results) for all three test sets. In GRU models is further ascertained by the Wilcoxon rank
addition, models with hyperparameters produced by sum statistical test results. As illustrated in the last columns
ACPSO, GPSO and SSO obtain better mean accuracy rates in Tables 14, 15 and 16, all the statistical test results are
and AUC scores than the results of those with learning lower than 0.05 for the three test sets, which indicate the
configurations selected by other baselines in most test statistical superiority of our optimized networks against
cases. The ROC curves of the optimized EfficientNet-GRU those yielded by other search methods.
models derived by different search methods for the three
test sets are illustrated in Figs. 12, 13 and 14, where our

123
8434 Neural Computing and Applications (2024) 36:8417–8453

Table 16 Performance comparison for optimized EfficientNet-GRU


models using the combined test set
Model Acc. Prec. Recall AUC RS

Prop. PSO 0.9695 0.989 0.9737 0.9620 n/a


PSO 0.9307 0.9818 0.9329 0.9268 7.94E–03
ABC 0.9239 0.9846 0.9217 0.9278 7.94E–03
BBPSO 0.9253 0.9839 0.9242 0.9274 7.94E–03
FPA 0.9304 0.9818 0.9325 0.9266 7.94E–03
SSA 0.9265 0.9835 0.9260 0.9275 7.94E–03
SSO 0.9381 0.9860 0.9383 0.9377 7.94E–03
FA 0.9253 0.9730 0.9350 0.9080 7.94E–03
DA 0.9342 0.9684 0.9509 0.9044 7.94E–03
SPSO 0.9348 0.9826 0.9372 0.9306 7.94E–03
GPSO 0.9363 0.9845 0.9372 0.9347 7.94E–03
BBPSOV 0.9310 0.9753 0.9397 0.9153 7.94E–03
Fig. 13 ROC curves for DFDC using EfficientNet-GRU models with
ACPSO 0.9352 0.9793 0.9408 0.9249 7.94E–03 manual and optimal hyperparameters identified by all the search
Manual 0.9126 0.9714 0.9206 0.8983 7.94E–03 methods

Fig. 14 ROC curves for the combined dataset using EfficientNet-


GRU models with manual and optimal hyperparameters identified by
all the search methods

Fig. 12 ROC curves for Celeb-DFv2 using EfficientNet-GRU models by deploying reasonable magnitudes of learning updates.
with manual and optimal hyperparameters identified by all the search The identified mean dropout rate settings reduce redun-
methods dancy by switching off reasonable numbers of neurons,
while generating effective discriminative video represen-
The mean hyperparameters obtained by each search tations. Among the baseline methods, ABC, BBPSO, SSA,
method over 5 runs using the sampled training and vali- FA, DA and BBPSOV-based networks are equipped with
dation sets of the combined dataset are shown in Table 17. large learning rates, resulting in the employment of large
Some further analysis of these optimized hyperparameters magnitudes for the learning updates to cause instability.
is provided below to explain model performance variations. ABC, BBPSO and BBPSOV also produce large mean
As indicated in Table 17 and 14, 15 and 16 pertaining to dropout rates which lead to the elimination of large num-
selected hyperparameter settings and detailed evaluation bers of neurons, thus resulting in discarding important
results, respectively, the proposed PSO algorithm, ACPSO, spatial–temporal cues. Moreover, smaller mean learning
GPSO and SSO have obtained comparatively moderate rates are identified by PSO and FPA, as well as adopted in
mean learning rate and dropout rate configurations over a the manual setting, which may yield insufficient momen-
set of 5 runs, in comparison with those obtained by other tum for network upgrading towards global optima, lower-
search methods. Such moderate mean learning rates show ing network performance.
great efficiency in extracting knowledge in a new domain

123
Neural Computing and Applications (2024) 36:8417–8453 8435

Table 17 Mean results of the optimal hyperparameters identified Similar to the findings of the previous experiments using
using each search method for EfficientNet-GRU EfficientNet-B0, most search methods select 30-39 frames
Model Learning rate Dropout Size No. of frames leading to the capture of sufficient spatial–temporal pat-
terns using EfficientNet-GRU, while avoiding overfitting.
Prop. PSO 0.0002333 0.5237 118 37
Image resolutions of 113-125 are mostly identified to
PSO 0.0001080 0.3832 117 37 achieve reliable classification performance while main-
ABC 0.0003520 0.6224 105 27 taining efficient computational cost.
BBPSO 0.0003312 0.6897 113 30 In short, when the hyperparameters obtained using the
FPA 0.0001108 0.3237 116 36 proposed PSO method are used in network evaluation, both
SSA 0.0003292 0.3816 106 29 optimized EfficientNet-B0 and EfficientNet-GRU models
SSO 0.0001848 0.4641 121 39 show improvement over those with manual and optimal
FA 0.0004300 0.3341 116 35 settings yielded by other search methods. This is owing to
DA 0.0005000 0.3272 115 32 the efficient search capabilities of the proposed PSO
SPSO 0.0002499 0.3687 119 38 algorithm by integrating composite leaders and reinforce-
GPSO 0.0002258 0.4335 125 34 ment learning-based search strategy selection in identifying
BBPSOV 0.0003010 0.6057 117 39 optimal hyperparameters in a multi-dimensional complex
ACPSO 0.0001766 0.4478 123 35 search space with challenging high intra-class and low
Manual 0.0001 0.3 112 40 inter-class variations.
We analyse the search behaviours of each algorithm
below. ABC explores the search space by following ran-
domly selected leader individuals and therefore shows a
slow convergence rate to reach global optimality. SSA
simulates the salp chain behaviours by adopting the mean
position of a neighbouring follower salp and the current
individual for movement update. Owing to the adoption of
neighbouring solution for search exploration, SSA is more
likely to converge prematurely and requires a significant
number of iterations to obtain competitive performance.
Also a random threshold is used in SSA to determine the
respective random walk action for updating the leading
salp, instead of using a more informative selection scheme,
therefore limiting its performance.
Fig. 15 Accuracy rates of the three test sets (in the y-axis) for SSO generates a dynamic leader signal for the position
optimized EfficientNet-GRU along with the dropout hyperparameters
update of each search agent by using the strongest and
(in the x-axis) identified by each search method based on the sampled
combined dataset (The results of the three test sets, i.e. the Celeb- randomly selected vibration intensities, but the model only
DFv2, DFDC and combined test sets, for each search method are employs a single random walk mechanism for position
represented by a unique shape and colour symbol.) update. FA employs neighbouring fitter solutions for search
exploration while DA adopts separation, alignment, cohe-
In particular, Figure 15 shows the accuracy rates of the sion, attraction and evading actions for movement update.
EfficientNet-GRU models for three test sets (in the y-axis) But both algorithms use single search strategies for search
with manual and optimal dropout rates identified by each space exploration. Similarly, PSO, BBPSO, SPSO with
search method (in the x-axis). As mentioned above, high sine coefficients and GPSO with genetic operators also
accuracy rates are correlated with the moderate settings of mainly employ monotonous search operations guided by
dropout rates, which are preferred by the proposed PSO the swarm leader for position update. When the single
algorithm, ACPSO, GPSO and SSO. Significantly larger search actions in the aforementioned models become
dropout rate configurations, such as those obtained by stagnant, there are no substitute search operations available
ABC, BBPSO and BBPSOV, reduce the model perfor- to reactivate sudden movements of the search agents to
mance by constraining network representations consider- mitigate premature stagnation.
ately, while excessively small settings of the dropout rates, To overcome the above limitations, BBPSOV and
recommended by FPA, FA and DA, may lead to redundant ACPSO adopt two or multiple search mechanisms to better
network structures with limited flexibilities in tackling manage local optima traps. For example, attractiveness and
overfitting. evading action are exploited in BBPSOV, while ACPSO
employs three subswarms guided by the PSO operations

123
8436 Neural Computing and Applications (2024) 36:8417–8453

with adaptive sine, circle and spiral coefficients, respec- All the selected hybrid and 3D CNN baseline networks
tively. FPA uses search actions led by either the randomly are equipped with the following learning settings, i.e.
selected individuals or the swarm leader with Levy search learning rate = 0.0001, dropout rate = 0.3, image size = 112
coefficients. The multiple search actions in the above and number of frames = 40. Tables 18, 19 and 20 show the
algorithms are mostly randomly selected or performed in detailed evaluation results, while Figs. 16, 17 and 18
sequential orders without the guidance of more informative illustrate respective ROC curves, for the Celeb-DFv2,
selection principles. DFDC and combined test sets, respectively.
In comparison with the above search algorithms, a As indicated in Tables 18, 19 and 20 and Figs. 16, 17
reinforcement learning algorithm is employed in the pro- and 18, the proposed PSO-optimized EfficientNet-B0 and
posed PSO variant to generate a more informed strategy to EfficientNet-GRU models show competitive performances
identify the optimal selection of different search actions for as compared with those of the aforementioned three hybrid
each particle. Such Q-learning-based search deployment and four 3D CNN models, across datasets. The proposed
governed by Bellman equation empowers the search pro- optimizer employs multiple composite elite signals and
cess with bespoke particle behaviours to explore the search Q-learning-based search operation allocation with bespoke
space effectively while accelerating convergence. On top search behaviours of each particle, to boost model capa-
of it, search operations guided by multiple composite bilities in tackling stagnation. Our optimized networks are
leaders yielded by distinctive nonlinear functions are also thus equipped with better learning settings and show
used to divert the search process if the action led by the enhanced capabilities in spatial–temporal feature learning
swarm leader becomes stagnant. The above analysis has for video forgery classification. In addition, ResNet101-
been further evidenced by the evaluation and statistical test GRU shows better results than those of ResNet50-GRU
results in our empirical studies. and GoogLeNet-GRU, because of its more effective feature
learning capabilities using deeper residual blocks. Among
4.3 Comparison with other hybrid networks the 3D CNNs, I3D illustrates more discriminative capa-
and 3D CNNs bilities by inflating all the filters and pooling kernels in a
2D architecture through the insertion of an additional
We conduct performance comparison between our opti- temporal dimension and thus achieves the most reliable
mized EfficientNet-B0 and EfficientNet-GRU models and performance. It outperforms MC3, 3D ResNeXt101 and 3D
other networks including ResNet50-GRU, ResNet101- ResNeXt50 for most test scenarios.
GRU, GoogLeNet-GRU, as well as 3D CNNs such as The confusion matrices of the proposed PSO-optimized
Inflated-3D (I3D), a Mixed Convolution Network (MC3), EfficientNet-B0 and EfficientNet-GRU models with
3D ResNeXt101 and 3D ResNeXt50, using the Celeb- respect to the Celeb-DFv2, DFDC and combined test sets
DFv2, DFDC and combined datasets, respectively. These are provided in Figs. 19 and 20, respectively. The built-in
baseline state-of-the-art networks are selected because of scikit-learn packages in the Python library are used to
their significant discriminative capabilities in video clas-
sification [9, 42, 60, 67]. Similar to our work, for the hybrid
networks, i.e. ResNet50-GRU, ResNet101-GRU and Goo-
gLeNet-GRU, the respective CNN (ResNet50, ResNet101 Table 18 Performance comparison with hybrid networks and 3D
and GoogLeNet) models are pretrained using ImageNet CNNs for the Celeb-DFv2 test set
and their successive GRU models are trained from scratch Model Acc. Prec. Recall AUC
using our combined deepfake training set.
Prop. PSO-based Effnet-GRU 0.9382 0.9208 0.9912 0.9141
In addition, for 3D CNNs, 3D convolutions instead of
Prop. PSO-based Effnet 0.9247 0.9101 0.9824 0.8985
2D convolutions are used for feature learning, except for
ResNet50-GRU 0.7876 0.7556 1.0000 0.6910
MC3 where mixed convolutions are used. Precisely, MC3
ResNet101-GRU 0.8494 0.8359 0.9588 0.7996
employs 3D convolutions in first two groups and 2D con-
volutions from group 3 onwards [68]. 3D ResNeXt101 and GoogLeNet-GRU 0.8012 0.7699 0.9941 0.7134
3D ResNeXt50 are variants of 3D ResNet, which introduce I3D 0.8494 0.8639 0.9147 0.8197
a cardinality parameter to control the number of parallel MC3 0.8514 0.8433 0.9500 0.8065
paths within each residual block [69]. All the above 3D 3D ResNeXt101 0.8378 0.8249 0.9559 0.7841
CNNs are pre-trained using a large human action dataset, 3D ResNeXt50 0.7703 0.7826 0.9000 0.7112
i.e. Kinetics, consisting of 306,245 videos from 400 clas- Manual Effnet-GRU 0.8417 0.8342 0.9471 0.7938
ses, then fine-tuned using the combined deepfake training Manual Effnet 0.8263 0.8255 0.9324 0.7780
set for real/manipulated video classification.

123
Neural Computing and Applications (2024) 36:8417–8453 8437

Table 19 Performance comparison with hybrid networks and 3D


CNNs for the DFDC test set
Model Acc. Prec. Recall AUC

Prop. PSO-based Effnet-GRU 0.9517 0.9886 0.9566 0.9346


Prop. PSO-based Effnet 0.9414 0.9848 0.9487 0.9161
ResNet50-GRU 0.9289 0.9547 0.9658 0.8008
ResNet101-GRU 0.8127 0.9806 0.8050 0.8394
GoogLeNet-GRU 0.9164 0.9603 0.9450 0.8172
I3D 0.8969 0.9726 0.9095 0.8528
MC3 0.8865 0.9672 0.9028 0.8300
3D ResNeXt101 0.8882 0.9643 0.9077 0.8204
3D ResNeXt50 0.8806 0.9603 0.9028 0.8033
Manual Effnet-GRU 0.8675 0.9703 0.8778 0.8321
Manual Effnet 0.8475 0.9484 0.8759 0.7486

Fig. 16 ROC curve comparison between our optimized models and


hybrid networks and 3D CNNs for the Celeb-DFv2 test set

Table 20 Performance comparison with hybrid networks and 3D


CNNs for the combined test set
Model Acc. Prec. Recall AUC

Prop. PSO-based Effnet-GRU 0.9695 0.9890 0.9737 0.9620


Prop. PSO-based Effnet 0.9576 0.9852 0.9628 0.9484
ResNet50-GRU 0.9209 0.9606 0.9422 0.8827
ResNet101-GRU 0.8865 0.9846 0.8755 0.9063
GoogLeNet-GRU 0.9339 0.9626 0.9567 0.8932
I3D 0.9369 0.9765 0.9459 0.9209
MC3 0.9393 0.9655 0.9603 0.9016
3D ResNeXt101 0.9470 0.9592 0.9769 0.8934
3D ResNeXt50 0.9120 0.9541 0.9379 0.8656
Manual Effnet-GRU 0.9126 0.9714 0.9206 0.8983
Manual Effnet 0.9049 0.9464 0.9372 0.8471
Fig. 17 ROC curve comparison between our optimized models and
hybrid networks and 3D CNNs for the DFDC test set

generate these confusion matrix results, along with those B0 or EfficientNet-GRU. The cost variations are mainly
for other evaluation metrics (i.e. accuracy, precision, recall caused by different search principles operated in the search
and AUC scores) in this research. methods. Table 21 depicts the detailed cost comparison.
Since all the search methods employ the same number We conduct the computational cost comparison using a
of function evaluations for hyperparameter search, with NVIDIA RTX 3090 consumer GPU. As indicated in
deep learning-based fitness function evaluation as the most Table 21, the proposed model shows moderate mean
costly process, these methods have a similar overall cost computational costs per function evaluation over 5 runs for
for optimal parameter selection at the training stage. We both networks. ABC, SSA, FA, FPA and BBPSO have
provide the average cost of each algorithm with one lower mean computational costs owing to their compara-
function evaluation over 5 runs for computational effi- tively simpler search strategies by following randomly
ciency comparison. Such a mean cost for each trial is selected (ABC), neighbouring (SSA and FA) and global
calculated by averaging the cost for hyperparameter search best (BBPSO and FPA) solutions, respectively. DA
by the number of function evaluations performed. This employs a search action combining separation, alignment,
includes the cost of dedicated search operations embedded cohesion, attraction and evading mechanisms, also showing
in each algorithm along with one fitness evaluation of the relatively light computational costs. SSO performs vibra-
recommended hyperparameters using either EfficientNet- tion intensity-based actions with dynamic leader

123
8438 Neural Computing and Applications (2024) 36:8417–8453

performed in our experiments. But since these related


studies were trained using a variety of deepfake training
databases for evaluating both datasets, they are used for
loose performance comparison. Tables 22 and 23 illustrate
the performance comparison with state-of-the-art existing
studies using the Celeb-DFv2 and DFDC datasets,
respectively.
We used the official split to evaluate the Celeb-DFv2
dataset. Specifically, for the official Celeb-DFv2 test set,
the selected existing studies shown in Table 22 obtained
accuracy rates ranging from 0.8074 to 0.8989 and AUC
scores ranging from 0.696 to 0.9003. Our transfer learning
using EfficientNet-B0 with the proposed PSO-based
hyperparameter fine-tuning obtained a competitive bench-
mark with an accuracy rate of 0.9247 and an AUC score of
Fig. 18 ROC curve comparison between our optimized models and 0.8985, while EfficientNet-GRU with the proposed PSO-
hybrid networks and 3D CNNs for the combined test set based hyperparameter selection achieved a better accuracy
rate of 0.9382 and a better AUC result of 0.9141. Both of
generation, resulting in slightly higher but reasonable costs. our models outperform most of these related studies by a
In contrast, BBPSOV and ACPSO have the highest com- sufficient margin.
putational costs due to diverse embedded search actions, Table 23 shows the comparison between our optimized
e.g. evading/attraction-inspired mechanisms in BBPSOV, networks and existing studies for the DFDC test set. Again
and three subswarms with adaptive sine, circle and spiral our optimized EfficientNet-GRU and EfficientNet-B0
search coefficients in ACPSO. The proposed model, GPSO models with the proposed PSO-based hyperparameter fine-
and SPSO show moderate costs because of the deployment tuning obtain more reliable performance in comparison
of reinforcement learning-based action selection and with those of existing studies. Specifically, the Effi-
hybrid leader generation in our model, crossover and cientNet-B0 model with the proposed PSO-based hyper-
mutation operations in GPSO, and adaptive sine-based parameter selection achieves a competitive mean accuracy
search coefficients in SPSO, respectively. The optimal rate of 0.9414 and AUC score of 0.9161, and our optimized
configurations identified by each method in the training EfficientNet-GRU obtains a better mean accuracy rate of
process are used to construct the final network at the test 0.9517 and a better AUC score of 0.9346.
stage for performance comparison. In short, owing to the efficient search capabilities of the
We also compare our optimized networks with other proposed PSO model guided by composite leaders and
existing studies for Celeb-DFv2 and DFDC datasets. The optimized search strategies governed by the fitness evalu-
selected existing studies were evaluated using the official ations, our optimized networks outperform existing studies
Celeb-DFv2 test set and the DFDC subset, respectively, as for both Celeb-DFv2 and DFDC datasets and show great

Fig. 19 Confusion matrices of the proposed PSO-optimized EfficientNet-B0 for the Celeb-DFv2 (left), DFDC (middle) and combined (right) test
sets, respectively

Fig. 20 Confusion matrices of the proposed PSO-optimized EfficientNet-GRU for the Celeb-DFv2 (left), DFDC (middle) and combined (right)
test sets, respectively

123
Neural Computing and Applications (2024) 36:8417–8453 8439

Table 21 Mean computational costs (in seconds) of each search 5 Visualization using gradient-weighted
method with one function evaluation for hyperparameter search class activation mapping
Model EfficientNet-B0 EfficientNet-GRU
Various evolutionary and sensitivity-based methods are
Prop. PSO 268.1408 275.4205
proposed for feature selection [19, 83–85]. Owing to the
PSO 184.0033 205.9014
large feature dimensions extracted from multiple video
ABC 80.0875 83.0909
frames, in this research, we visualize the contributions of
BBPSO 132.3337 160.4870
different convolutional and spatial features using class-
FPA 147.4523 180.2467
discriminative heatmaps, to indicate the effectiveness of
SSA 77.3659 136.9845 our optimized networks for discriminative feature extrac-
SSO 184.8069 188.9568 tion. Specifically, we generate heatmaps using gradient-
FA 142.7953 153.1096 weighted class activation mapping (Grad-CAM) [86] for
DA 149.0886 173.9974 both proposed EfficientNet serialized with GRU, as well as
SPSO 238.7975 265.3025 the proposed PSO-optimized EfficientNet to indicate their
GPSO 243.6370 257.8900 effectiveness in feature learning. In particular, as indicated
BBPSOV 269.7608 279.9356 earlier, in the proposed EfficientNet-GRU model, all the
ACPSO 360.4934 378.6980 convolutional layers of the ImageNet pre-trained Effi-
cientNet are slightly fine-tuned using the combined training
set with a small number of epochs (e.g. 5 epochs) with the
attempt to improve their discriminative feature learning
efficiency in tackling challenging manipulated video clas- capabilities in the target domain.
sification tasks. They can be deployed as effective substi- First of all, since this research focuses on detection and
tute methods for deepfake content identification. In classification of face swapping and facial re-enactment,
addition, our work also highlights the importance of MTCNN-based facial cropping is performed to extract the
hyperparameter selection in deep learning networks and the facial regions and eliminate background noise. Besides
potential use of evolutionary algorithms in such tasks for that, for both EfficientNet-B0 and EfficienNet-GRU mod-
video deepfake classification. els, as discussed in Sections 4.2.1 and 4.2.2, the proposed
PSO and other search methods are used to optimize the
number of video frames and image resolution sizes, along
with learning and dropout rates, to maintain optimal cost

Table 22 Performance
Model Methodology Accuracy AUC
comparison for the Celeb-DFv2
test set Hu [13] Two stream 0.8074 –
Demir and Ciftci [14] Biological signals (sequence-based) 0.8576 –
Demir and Ciftci [14] Biological signals (video-based) 0.8835 –
Kandasamy et al. [70] VGG19 0.8843 –
Kandasamy et al. [70] ResNet 0.8932 –
Rossler et al. [3] XceptionNet-Max 0.8989 –
Haliassos et al. [71] LipForensics – 0.824
Liu et al. [72] SPSL(Xception as the backbone) – 0.7688
Wang et al. [62] MC-LCR – 0.7161
Zheng et al. [59] Temporal Coherence – 0.869
Wang et al. [73] CNN-aug – 0.756
Nguyen et al. [74] Multi-task – 0.757
Chai et al. [75] PatchForensics – 0.696
Masi et al. [76] Two-branch LSTM – 0.7665
Tolosana et al. [77] Facial element extraction – 0.836
Zhao et al. [61] PCL ? I2G (ResNet-34 as the backbone) – 0.9003
This research Prop. PSO-based EfficientNet-GRU 0.9382 0.9141
This research Prop. PSO-based EfficientNet 0.9247 0.8985

123
8440 Neural Computing and Applications (2024) 36:8417–8453

Table 23 Performance
Model Methodology Accuracy AUC
comparison for the DFDC test
set Li [78] XceptionNet ? MIL 0.8378 –
Li [78] XceptionNet ? S-MIL-T 0.8511 –
Zhang et al. [79] TD-3DCNN 0.8264 –
Wang et al. [62] MC-LCR 0.702 0.7134
Güera and Delp [80] RNN 0.6242 0.669
Hu et al. [81] FInfer 0.6945 0.7039
Li et al. [82] Face X-ray – 0.655
Zheng et al. [59] Temporal Coherence – 0.74
Wang et al. [73] CNN-aug – 0.721
Shiohara and Yamasaki [60] Self-blended data synthesis – 0.7242
Song et al. [67] CD-Net (Xception as the backbone) – 0.783
Zhao et al. [61] PCL ? I2G (ResNet-34 as the backbone) – 0.6752
This research Prop. PSO-based EfficientNet-GRU 0.9517 0.9346
This research Prop. PSO-based EfficientNet 0.9414 0.9161

while eliminating irrelevant noisy features to avoid indicate its effectiveness in feature learning. Besides that,
overfitting. we also generate Grad-CAM heatmaps for the proposed
To indicate the effectiveness of the proposed PSO-based PSO-optimized EfficientNet-B0 to demonstrate its capa-
EfficientNet model and the EfficientNet embedded in bilities in discriminative feature representation.
EfficientNet-GRU for feature learning, Grad-CAM [86] Figure 21 shows example original video frames (the first
heatmaps with different colour schemes are used to visu- row), respective manipulated video frames (the second
alize the importance of the extracted features from the row) and Grad-CAM heatmaps extracted from lightly-
cropped facial regions. Grad-CAM first calculates the tuned EfficientNet-B0 embedded in the EfficientNet-GRU
gradient of the prediction score for a target class with (the third row), as well as the heatmaps generated by the
respect to the extracted feature maps in the last convolu- proposed PSO-optimized EfficientNet-B0 (the fourth row),
tional layer. Then the global average pooling is applied to with respect to the manipulated image frames in the second
gradients calculated above to generate weightings of row. Since these example video frames are taken from
respective feature maps for a target class. The yielded Celeb-DFv2, the face swap attack is performed in these
importance weightings subsequently multiply with deepfake examples. The inspection of the synthetic image
respective feature maps. A summation operation followed frames in the second row in Figure 21 against the real
by a ReLU activation function is performed on these image frames in the first row reveals the presence of
weighted results to produce the Grad-CAM heatmaps. vagueness and blurry in the eye, nose, mouth or overall
These localization maps indicate feature significance to facial regions, as well as shape alterations of eye, nose and
class prediction using different colour schemes with deep mouth elements. As indicated in existing studies, incon-
red indicating the most significant/relevant characteristics sistent shadowing/lighting/colour tone over faces, unnatu-
and deep blue as the least influential factors for catego- ral/inconsistent teeth/mouth/eye movements, misaligned
rization. Therefore, such class-discriminative heatmaps are teeth, distortions in eyebrows, facial hair and facial bor-
used in this research to visualize which image regions have ders, double chins, non-circled pupils and other spatial
the most influence to synthetic/original video classification. inconsistency, are key factors for identifying manipulated
In comparison with class activation mapping (CAM), the images against real ones.
Grad-CAM method can be applied to any CNN architec- As visualized by Grad-CAM heatmaps shown in the
tures even without re-training [86]. In addition, as men- third and fourth rows in Figure 21, the most significant
tioned above, in this research, to improve discriminative factors extracted by both the lightly-tuned EfficientNet
feature learning, we fine-tune all the convolutional layers (embedded in EfficientNet-GRU) and the proposed PSO-
of the ImageNet pre-trained EfficientNet-B0 embedded in optimized EfficientNet model are mostly derived from
the proposed EfficientNet-GRU model using the combined these facial abnormality regions such as eye and mouth/-
training set with a small number of epochs (i.e. 5 epochs), teeth regions. These dominating features are represented by
before feature extraction. Therefore, we generate Grad- deep red heatmaps, emphasizing their importance to
CAM heatmaps for this EfficientNet-B0 model with light- manipulated class prediction. As an example, the heatmaps
weight fine-tuning embedded in the EfficientNet-GRU, to extracted using the lightly-tuned EfficientNet model

123
Neural Computing and Applications (2024) 36:8417–8453 8441

Fig. 21 Example Grad-CAM


heatmaps generated for
manipulated samples (row 1: the
original video frames, row 2:
the respective manipulated
frames, row 3: Grad-CAM
heatmaps generated by the
EfficientNet model with light-
weight fine-tuning for the
manipulated frames (in row 2)
and row 4: Grad-CAM
heatmaps generated using the
proposed PSO-optimized
EfficientNet model for the
manipulated frames (in row 2)

(embedded in EfficientNet-GRU) in the third row show feature learning to inform final video authenticity
high correlations to those manipulated facial regions such identification.
as blurred eye regions, and inconsistent shadowing/light- Besides the above, as indicated in Sections 4.2.1 and
ing/colour tone over faces. Built upon this, as shown in the 4.2.2, the proposed PSO model is also used to optimize the
last row in Figure 21, the proposed PSO-optimized Effi- number of frames and image resolution settings for both
cientNet-B0 method with substantial re-training is able to EfficientNet-GRU and EfficientNet-B0 networks, in order
even better capture such abnormalities and strengthen the to extract salient features without capturing too much noisy
extraction of such important characteristics. For instance, irrelevant/contradictory details to avoid overfitting. As an
in most cases, significant factors with respect to shadowy example, for EfficientNet-B0, a selection of 30-40 frames
eye regions, unnatural pupils and iris borders, are extracted and image resolutions of 115-126 is recommended by the
by our optimized network. In addition, distortions in facial proposed PSO and other search methods owing to a good
borders, misaligned teeth and double chins, which also balance between performance and computational cost.
have vital influence to authenticity classification, are Similarly for EfficientNet-GRU, the proposed PSO model
identified as well. and other search methods recommend the optimal number
As indicated in the results in Figure 21, both the Effi- of 30-39 frames with image resolutions of 113-125 for
cientNet with light-weight fine-tuning and the proposed model training and test to better tackle overfitting.
PSO-optimized EfficientNet-B0 model show great effi- These optimized frame and image resolution settings,
ciency in capturing important discriminative features along with the effectiveness of the extracted spatial–tem-
playing significant roles in synthetic video identification. In poral features by the EfficientNet-B0 with light-weight
addition, the proposed PSO-optimized EfficientNet-B0 fine-tuning and proposed PSO-optimized EfficientNet-B0
network adopts such extracted heatmaps for frame-level models as evidenced in Grad-CAM maps, lead to the
fake/real image classification. A mean average ensemble capture of discriminative RGB and motion cues to achieve
scheme is used to combine the frame-level prediction based reliable video classification. Our optimized EfficientNet-
on a sequence of frames to determine final video classifi- B0 networks also outperform those generated by other
cation outcome. Moreover, in the proposed EfficientNet- search methods as indicated by the empirical and statistical
GRU model, the most important spatial features in the test results in Tables 10, 11 and 12. The efficiency of our
heatmaps extracted using EfficientNet-B0 with light- optimized EfficientNet-GRU models is also ascertained by
weight fine-tuning are further strengthened by combining a the experimental and statistical test results as shown in
sequence of such class-discriminative Grad-CAM maps Tables 14, 15 and 16.
from multiple frames. Such a sequence of discriminative
heatmaps is then passed on to the GRU model for temporal

123
8442 Neural Computing and Applications (2024) 36:8417–8453

Table 24 Mean predictive


Model MPE_fake MPE_real MPE_all Mutual Info
entropy and mutual information
scores for the proposed PSO- Prop. PSO-based Effnet-GRU 0.06335 0.10060 0.16400 0.05633
optimized networks
Prop. PSO-based Effnet 0.07149 0.11009 0.18159 0.06355

during the forward passes for result calculation. These


generated new architectures can be regarded as Monte
Carlo samples. As such, by using dropout during testing,
each test sample will be evaluated using different model
architectures and the result distributions of these test
samples are used in this research for computing different
uncertainty metrics.
Since we focus on a classification problem for video
authenticity identification, as suggested by existing studies
[87–89], we employ the predictive entropy and mutual
information for uncertainty analysis. Equations 10–11
Fig. 22 Uncertainty estimation distributions in terms of the predictive define the formulae of the predictive entropy [87].
entropy for both manipulated and real classes for the proposed PSO- X
C
optimized EfficientNet-GRU model using the combined test set Entropy ¼  ðuc Þlogðuc Þ ð10Þ
c¼1

1X T
uc ¼ Pi ð11Þ
T i¼1 c

where C denotes the number of predicted classes with uc


representing the class-wise mean prediction probability. In
addition, T denotes the number of MCD forward passes
employed in our experiments.
The mutual information is formulated in Eq. 12.
X
C
MI ¼  ðuc Þ logðuc Þ
c¼1
Fig. 23 Uncertainty estimation distributions in terms of the predictive 1X C X T

entropy for both manipulated and real classes for the proposed PSO- þ Pððy 2 cÞjðx; wt ÞÞ logðPððy 2 cÞjðx; wt ÞÞÞ
T c¼1 c¼1
optimized EfficientNet-B0 model using the combined test set
ð12Þ
6 Uncertainty analysis where MI indicates mutual information, while Pððy 2
cÞðx; wt ÞÞ represents the softmax score for the input sample
To indicate model effectiveness, we also conduct uncer-
x belonging to class c with model parameters wt .
tainty analysis. We employ the Monte Carlo dropout
We employ all test video samples, i.e. 3375 videos, from
(MCD) method for uncertainty quantification in this
the combined test set for uncertainty analysis. Each sample
research. Specifically, we employ the MCD method to
is tested T ¼ 20 times using the MCD method. Table 24
measure epistemic uncertainty, which is usually caused by
shows the mean predictive entropy (MPE) and mutual
the lack of training data. In other words, with more training
information scores over all test videos with respect to the
data, such model uncertainty can be reduced. Before
EfficientNet and EfficientNet-GRU models with optimized
introducing MCD, we briefly discuss the traditional drop-
settings obtained using the proposed PSO model, respec-
out method, which is only applied in the training stage by
tively. Specifically, MPE_fake, MPE_real and MPE_all in
switching off some randomly selected neurons. And there
Table 24 denote the mean predictive entropy scores for the
is no dropout operation applied during test with all neurons
fake, real and both classes, with Mutual Info indicating the
enabled. The dropout operation provides flexibility in
mean mutual information score for both classes, over all
model training and thus helps tackle overfitting. In contrast,
test samples.
in the MCD method, the dropout is enabled during testing.
As indicated in Table 24, both optimized networks
This results in different dropout masks to be deployed
obtain low entropy and mutual information scores, which

123
Table 25 Evaluation results for the benchmark functions with dimension = 30
Prop. PSO PSO ABC SSA SSO FA DA BBPSO FPA BBPSOV SPSO GPSO ACPSO

Ackley mean 7.43E–15 1.09E?01 5.91E–02 2.74E?00 2.13E?01 3.02E–02 1.93E?01 1.17E?01 1.22E?01 1.95E?01 1.30E?01 1.84E?01 1.12E?01
min 4.00E–15 9.22E?00 5.91E–02 2.74E?00 2.13E?01 3.02E–02 1.93E?01 6.62E–01 1.09E?01 1.95E?01 1.30E?01 1.84E?01 4.21E?00
max 7.55E–15 1.10E?01 5.91E–02 2.74E?00 2.13E?01 3.02E–02 1.93E?01 1.77E?01 1.42E?01 1.95E?01 1.30E?01 1.84E?01 1.79E?01
std 6.49E–16 3.22E–01 0.00E100 4.52E–16 3.61E–15 0.00E100 0.00E100 3.62E?00 7.78E–01 3.61E–15 7.23E–15 1.45E–14 5.21E?00
Dixon mean 1.29E–01 1.26E?05 6.67E–01 7.29E–01 6.78E?06 1.41E?00 1.18E?03 4.30E?04 4.92E?03 2.09E?06 1.89E?04 3.58E?05 1.54E?05
min 1.29E–01 1.09E?05 6.67E–01 7.29E–01 6.78E?06 1.41E?00 1.18E?03 6.67E–01 1.27E?03 2.09E?06 1.89E?04 3.58E?05 1.21E?00
max 1.29E–01 1.27E?05 6.67E–01 7.29E–01 6.78E?06 1.41E?00 1.18E?03 3.52E?05 8.57E?03 2.09E?06 1.89E?04 3.58E?05 6.86E?05
std 5.65E–17 3.24E?03 2.00E–11 5.65E–16 1.89E–09 2.26E–16 0.00E100 1.05E?05 1.83E?03 9.47E–10 1.11E–11 5.92E–11 2.28E?05
Griewank mean 2.28E–03 3.47E?00 2.94E–03 6.25E–05 1.30E?03 5.08E–03 9.40E?00 1.81E?01 2.91E?01 5.02E?02 6.12E?01 2.80E?02 4.66E?01
min 2.28E–03 3.20E?00 2.94E–03 6.25E–05 1.30E?03 5.08E–03 9.40E?00 2.22E–03 1.84E?01 5.02E?02 6.12E?01 2.80E?02 2.53E–02
max 2.28E–03 3.48E?00 2.94E–03 6.25E–05 1.30E?03 5.08E–03 9.40E?00 9.05E?01 3.70E?01 5.02E?02 6.12E?01 2.80E?02 3.18E?02
std 1.76E–18 5.09E–02 0.00E100 0.00E100 0.00E100 0.00E100 5.42E–15 3.67E?01 5.29E?00 2.31E–13 4.34E–14 0.00E100 1.04E?02
Neural Computing and Applications (2024) 36:8417–8453

Rastrigin mean 8.21E–04 1.19E?02 6.07E?00 5.77E?01 6.91E?02 3.69E?01 9.24E?01 1.31E?02 2.00E?02 4.07E?02 2.21E?02 3.54E?02 2.86E?02
min 8.21E–04 7.14E?01 6.07E?00 5.77E?01 6.91E?02 3.69E?01 9.24E?01 3.19E?01 1.66E?02 4.07E?02 2.21E?02 3.54E?02 1.31E?02
max 8.21E–04 1.21E?02 6.07E?00 5.77E?01 6.91E?02 3.69E?01 9.24E?01 2.21E?02 2.27E?02 4.07E?02 2.21E?02 3.54E?02 3.58E?02
std 3.31E–19 9.01E?00 9.03E–16 3.61E–14 3.47E–13 2.17E–14 4.34E–14 5.15E?01 1.11E?01 5.78E–14 1.45E–13 1.73E–13 5.69E?01
Rothyp mean 1.05E–273 1.18E?04 1.29E–04 1.09E?01 9.80E?05 1.64E?00 1.62E?04 2.42E?04 1.40E?04 3.88E?05 2.86E?04 1.98E?05 1.75E?04
min 1.01E–273 6.31E?02 1.29E–04 1.09E?01 9.80E?05 1.64E?00 1.62E?04 2.34E–04 8.45E?03 3.88E?05 2.86E?04 1.98E?05 3.02E–05
max 2.16E–273 1.22E?04 1.29E–04 1.09E?01 9.80E?05 1.64E?00 1.62E?04 1.14E?05 1.76E?04 3.88E?05 2.86E?04 1.98E?05 1.96E?05
std 0.00E100 2.11E?03 5.51E–20 3.61E–15 1.18E–10 0.00E100 9.25E–12 3.36E?04 2.13E?03 1.18E–10 1.85E–11 8.88E–11 5.34E?04
Rosenbrock mean 2.36E?01 1.13E?04 7.35E–01 2.23E?02 6.21E?06 2.87E?01 5.37E?03 5.93E?04 6.13E?03 1.65E?06 2.86E?04 3.12E?05 2.48E?04
min 2.33E?01 2.32E?03 7.35E–01 2.23E?02 6.21E?06 2.87E?01 5.37E?03 9.59E?00 2.20E?03 1.65E?06 2.86E?04 3.12E?05 6.00E?00
max 2.36E?01 1.16E?04 7.35E–01 2.23E?02 6.21E?06 2.87E?01 5.37E?03 2.23E?05 1.04E?04 1.65E?06 2.86E?04 3.12E?05 2.66E?05
std 5.07E–02 1.69E?03 0.00E100 5.78E–14 9.47E–10 1.08E–14 0.00E100 6.23E?04 2.07E?03 0.00E100 1.48E–11 0.00E100 6.94E?04
Sphere mean 1.10E–275 1.28E?00 3.97E–07 9.52E–11 3.78E?02 2.02E–03 3.38E?00 8.74E?00 8.55E?00 1.83E?02 1.89E?01 1.22E?02 1.32E?01
min 4.63E–277 7.95E–01 3.97E–07 9.52E–11 3.78E?02 2.02E–03 3.38E?00 7.45E–08 6.04E?00 1.83E?02 1.89E?01 1.22E?02 2.83E–07
max 3.16E–274 1.30E?00 3.97E–07 9.52E–11 3.78E?02 2.02E–03 3.38E?00 5.24E?01 1.14E?01 1.83E?02 1.89E?01 1.22E?02 1.00E?02
std 0.00E100 9.21E–02 1.62E–22 0.00E100 5.78E–14 0.00E100 1.36E–15 1.59E?01 1.23E?00 5.78E–14 0.00E100 0.00E100 2.96E?01
Sumpow mean 0.00E100 2.84E–07 6.04E–14 4.61E–07 5.32E?00 2.24E–07 6.52E–05 1.85E–19 8.88E–05 4.43E–01 3.61E–04 2.08E–02 3.02E–02
min 0.00E100 1.70E–07 6.04E–14 4.61E–07 5.32E?00 2.24E–07 6.52E–05 4.77E–30 7.65E–06 4.43E–01 3.61E–04 2.08E–02 6.53E–22
max 0.00E100 2.88E–07 6.04E–14 4.61E–07 5.32E?00 2.24E–07 6.52E–05 3.96E–18 2.70E–04 4.43E–01 3.61E–04 2.08E–02 9.86E–02
std 0.00E100 2.15E–08 0.00E100 1.62E–22 0.00E100 0.00E100 1.38E–20 7.28E–19 6.93E–05 0.00E100 1.65E–19 1.06E–17 3.27E–02
Zakharov mean 1.72E–03 1.90E?02 4.52E?00 1.52E?02 1.40E?03 2.86E?01 2.66E?02 1.83E?02 2.65E?02 7.08E?02 3.12E?02 4.87E?02 4.05E?02
min 1.72E–03 1.51E?02 4.52E?00 1.52E?02 1.40E?03 2.86E?01 2.66E?02 9.29E?01 2.36E?02 7.08E?02 3.12E?02 4.87E?02 3.01E?02
max 1.72E–03 1.91E?02 4.52E?00 1.52E?02 1.40E?03 2.86E?01 2.66E?02 2.67E?02 3.01E?02 7.08E?02 3.12E?02 4.87E?02 4.95E?02
std 1.10E–18 7.40E?00 0.00E100 2.89E–14 6.94E–13 0.00E100 1.16E–13 4.85E?01 1.59E?01 3.47E–13 0.00E100 5.78E–14 4.76E?01
8443

123
8444 Neural Computing and Applications (2024) 36:8417–8453

2.28E?02

1.39E?03
4.70E?02
5.61E?02

4.75E?03
1.18E?03
indicate that the models have high certainty about predic-

6.51E–06

6.86E–04
ACPSO tions. In addition, both mean predictive entropy and mutual
information scores of the proposed PSO-based Effi-
cientNet-GRU method are lower than those of the proposed
1.15E?03
1.15E?03
1.15E?03

5.50E?03
5.50E?03
5.50E?03
4.63E–13

4.63E–12
PSO-based EfficientNet model. This shows that the opti-
GPSO

mized EfficientNet-GRU model has better discriminative


capabilities for distinguishing synthetic and real videos
with lower uncertainty.
2.29E?02
2.29E?02
2.29E?02
0.00E?00
3.02E?02
3.02E?02
3.02E?02
1.16E–13
For each optimized network, the mean predictive
SPSO

entropy scores for both manipulated and real classes are


also provided in Table 24. The mean predictive entropy
0.00E100
2.23E?03
2.23E?03
2.23E?03

9.96E?03
9.96E?03
9.96E?03

scores for the manipulated class are lower than those of the
BBPSOV

9.25E–13

genuine video class for both networks. Figures 22 and 23


also show the detailed uncertainty estimation distributions
in terms of the predictive entropy for both fake and real
1.08E?02
7.43E?01
1.67E?02
1.93E?01
8.39E?01
3.35E?01
1.22E?02
2.05E?01

classes for the PSO-devised EfficientNet-GRU and Effi-


cientNet models, respectively. As indicated in Table 24
FPA

and Figs. 22, 23, the manipulated videos are classified with
higher certainty than those of the original videos by both
1.69E?02

8.91E?02
1.82E?02
1.08E?03

6.39E?03
1.53E?03
1.10E–06

8.45E–02
BBPSO

optimized networks. The combined dataset construction


may help explain the above observations. Since in the
employed deepfake datasets (i.e. Celeb-DFv2 and DFDC),
0.00E100
1.08E?01
1.08E?01
1.08E?01

8.65E?01
8.65E?01
8.65E?01
1.81E–15

there are usually much larger numbers of synthetic videos


than those of the genuine ones, real video samples from the
DA

YouTube Faces Database are also borrowed to increase the


real class sample sizes and balance class distributions when
0.00E100
1.53E?00
1.53E?00
1.53E?00
3.86E–01
3.86E–01
3.86E–01
1.13E–16

constructing the combined dataset. Therefore, the trained


networks have been encountered with a variety of manip-
FA

ulated instances and show comparatively lower uncertainty


6.06E?03
6.06E?03
6.06E?03

1.12E?05
1.12E?05
1.12E?05

in recognizing fake videos in comparison with the original


2.78E–12

5.92E–11

ones.
SSO

Overall, the mean predictive entropy and mutual infor-


mation scores indicate the effectiveness of both of the
0.00E100
3.84E?00
3.84E?00
3.84E?00
3.44E–03
3.44E–03
3.44E–03
1.32E–18

proposed PSO-optimized networks for classifying manip-


ulated and real videos with reasonably high certainty.
SSA

7.64E–06
7.64E–06
7.64E–06
6.89E–21
5.26E–02
5.26E–02
5.26E–02
3.53E–17
ABC

7 Evaluation using benchmark functions


1.83E?01
2.70E?00
1.88E?01
2.95E?00
1.37E?02
1.34E?02
2.13E?02
1.45E?01

To further indicate the effectiveness of the proposed PSO


algorithm, we employ unimodal and multimodal bench-
PSO

mark functions with varied search spaces and artificial


landscapes for evaluation. Multimodal functions such as
1.15E–278
3.42E–279
2.46E–277
Prop. PSO

0.00E100
2.00E–07
7.61E–11
2.07E–07
3.78E–08

Rastrigin, Griewank, Ackley and Powell, as well as uni-


modal functions including Rotated Hyper-Ellipsoid
(Rothyp), Dixon-Price (Dixon), Sphere, Rosenbrock,
Table 25 (continued)

Zakharov, Sum of Different Powers (Sumpow) and Sum


mean

mean
max

max
min

min
std

std

Squares (Sumsqu), are evaluated in our experiments. The


experiments are conducted using a maximum number of
function evaluations of 25,000 (population = 50 and iter-
Sumsqu

Powell

ations = 500) and a dimension of 30. This maximum


number of function evaluations (i.e. 25,000) is conducted

123
Neural Computing and Applications (2024) 36:8417–8453 8445

Table 26 Wilcoxon rank sum test results over 30 runs for dimension = 30
PSO ABC SSA SSO FA DA BBPSO FPA BBPSOV SPSO GPSO ACPSO

Ackley 4.29E– 2.71E– 2.71E– 2.71E– 2.71E– 2.71E– 1.72E– 1.72E– 2.71E–14 2.71E– 2.71E– 1.72E–
14 14 14 14 14 14 12 12 14 14 12
Dixon 2.71E– 2.71E– 1.69E– 1.69E– 1.69E– 1.69E– 1.21E– 1.21E– 1.69E–14 1.69E– 1.69E– 1.21E–
14 14 14 14 14 14 12 12 14 14 12
Griewank 2.71E– 1.69E– 1.69E– 1.69E– 1.69v14 1.69E– 3.36E– 1.21E– 1.69E–14 1.69E– 1.69E– 1.21E–
14 14 14 14 14 11 12 14 14 12
Rastrigin 2.71E– 1.69E– 1.69E– 1.69E– 1.69E– 1.69E– 1.21E– 1.21E– 1.69E–14 1.69E– 1.69E– 1.21E–
14 14 14 14 14 14 12 12 14 14 12
Rothyp 4.29E– 2.71E– 2.71E– 2.71E– 2.71E– 2.71E– 1.72E– 1.72E– 2.71E–14 2.71E– 2.71E– 1.72E–
14 14 14 14 14 14 12 12 14 14 12
Rosenbrock 4.29E– 2.71E– 2.71E– 2.71E– 2.71E– 2.71E– 4.56E– 1.72E– 2.71E–14 2.71E– 2.71E– 2.59E–
14 14 14 14 14 14 11 12 14 14 06
Sphere 4.29E– 2.71E– 2.71E– 2.71E– 2.71E– 2.71E– 1.72E– 1.72E– 2.71E–14 2.71E– 2.71E– 1.72E–
14 14 14 14 14 14 12 12 14 14 12
Sumpow 2.71E– 1.69E– 1.69E– 1.69E– 1.69E– 1.69E– 1.21E– 1.21E– 1.69E–14 1.69E– 1.69E– 1.21E–
14 14 14 14 14 14 12 12 14 14 12
Zakharov 2.71E– 1.69E– 1.69E– 1.69E– 1.69E– 1.69E– 1.21E– 1.21E– 1.69E–14 1.69E– 1.69E– 1.21E–
14 14 14 14 14 14 12 12 14 14 12
Sumsqu 4.29E– 2.71E– 2.71E– 2.71E– 2.71E– 2.71E– 1.72E– 1.72E– 2.71E–14 2.71E– 2.71E– 1.72E–
14 14 14 14 14 14 12 12 14 14 12
Powell 4.29E– 2.71E– 2.71E– 2.71E– 2.71E– 2.71E– 1.72E– 1.72E– 2.71E–14 2.71E– 2.71E– 1.72E–
14 14 14 14 14 14 12 12 14 14 12

Table 27 Mean ranking results


Algorithms Mean Ranking
of all search methods based on
the Friedman test for benchmark Prop. PSO 1.18
functions with dimension = 30
PSO 6
ABC 2.27
SSA 3.55
SSO 13
FA 3.36
DA 6.45
BBPSO 7.36
FPA 6.91
BBPSOV 12
SPSO 9.09
GPSO 10.82
ACPSO 9

Fig. 24 Mean convergence rate comparison in the log scale over 30


runs for the Powell function with dimension = 30
Table 28 Statistical results of the Friedman test for benchmark
functions with dimension = 30
deviation performances over a set of 30 runs for solving
Chi-square p Value Hypothesis
these benchmark functions are presented in Table 25. The
120.94 \0:001 Rejected Wilcoxon rank sum test results shown in Table 26 are used
to indicate the significance of our results against those of
the baseline methods.
As shown in Table 25, our model outperforms all the
by all search methods to ensure a fair comparison. The baseline methods for 9 out of 11 benchmark functions,
mean results along with maximum, minimum and standard while SSA and ABC obtain the best results for Griewank

123
8446 Neural Computing and Applications (2024) 36:8417–8453

mean ranking result of each algorithm is obtained by


averaging the rankings of the mean results of all bench-
mark functions. As indicated in Table 27, the proposed
model has the highest mean ranking in comparison with
those of all the baseline methods. ABC and FA also show
competitive rankings against other baseline methods. The
p-value obtained using the Friedman test illustrated in
Table 28 is lower than 0.001. It further ascertains that our
results are better than those of all other search methods
with a statistical significance.
The empirical results indicate that the proposed model
has a fast convergence rate in most test cases. Moreover, it
shows better capabilities in tackling local optima traps by
locating the minimum global optima in most test cases.
Example mean convergence curves over 30 runs generated
using the logarithm scale with a base of 10 during the
Fig. 25 Mean convergence rate comparison in the log scale over 30
runs for the Ackley function with dimension = 30 course of 500 iterations with respect to the Powell, Ackley
and Sphere functions are provided in Figs. 24, 25 and 26,
respectively. The visualization results indicate that the
proposed model navigates various search spaces with a fast
convergence speed. Owing to the adoption of diverse
composite leaders and optimal search action selection
reinforced by Q-learning, as demonstrated in the visualized
convergence curves, the proposed optimizer also shows
better capabilities in tackling local optima traps as com-
pared with those of other baseline methods. A similar trend
is also obtained for other benchmark functions.
To further test model effectiveness, we have also eval-
uated the proposed model using the benchmark functions
with a dimension of 50. The experiments are performed
using the following settings, i.e. population = 50, iteration
= 1000 and trial = 30. A maximum number of 50,000
function evaluations is used by all search methods. The
Fig. 26 Mean convergence rate comparison in the log scale over 30
detailed evaluation and Wilcoxon rank sum statistical test
runs for the Sphere function with dimension = 30 results are provided in Tables 29 and 30.
As shown in Tables 29 and 30, the proposed model
and Rosenbrock, respectively, with the proposed model achieves statistically better results than those of all the
achieving the second best results for these two test func- baseline methods for most numerical optimization prob-
tions. The statistical superiority of the proposed model is lems. In particular, it achieves the most optimal global
also further evidenced in the rank sum test results shown in minima of ‘0’ for Rotated Hyper-Ellipsoid, Sphere, Sum of
Table 26. Specifically, our model obtains statistically bet- Different Powers and Sum Square functions. The excep-
ter results than those of the baseline methods for most test tions are for Rosenbrock where ABC obtains the best
functions, except that SSA and ABC obtain statistically global minima and outperforms the proposed model with
better results for Griewank and Rosenbrock than those of statistical significance. In addition, for Griewank, ABC and
the proposed model, respectively. SSA obtain statistical better performances than those of the
Besides the Wilcoxon rank sum test, the nonparametric proposed model. For these two test functions, i.e. Rosen-
Friedman test is also conducted. It tests the null hypothesis brock and Griewank, the proposed model achieves the
that the results of all the treatment methods have identical second and third best results, respectively. These numerical
distributions or otherwise based on a Chi-square approxi- function results also indicate the proposed model with
mation. Table 27 shows the mean rankings of all the search composite leaders and Q-learning-based search action
methods over 30 runs based on the mean results of all test optimization possesses better capabilities in tackling local
functions shown in Table 25 using the Friedman test. The optima traps and achieves the best global minima in most
test cases.

123
Table 29 Evaluation results for the benchmark functions with dimension = 50
Prop. PSO PSO ABC SSA SSO FA DA BBPSO FPA BBPSOV SPSO GPSO ACPSO

Ackley mean 7.55E–15 1.39E?01 4.07E–02 2.32E?00 2.11E?01 3.04E–02 1.96E?01 1.56E?01 1.31E?01 2.04E?01 1.40E?01 1.86E?01 1.39E?01
min 7.55E–15 1.25E?01 4.07E–02 2.32E?00 2.11E?01 3.04E–02 1.96E?01 1.28E?01 1.17E?01 2.04E?01 1.40E?01 1.86E?01 1.23E?01
max 7.55E–15 1.39E?01 4.07E–02 2.32E?00 2.11E?01 3.04E–02 1.96E?01 1.77E?01 1.42E?01 2.04E?01 1.40E?01 1.86E?01 1.48E?01
std 0.00E100 2.62E–01 0.00E100 4.52E–16 7.23E–15 0.00E100 1.45E–14 1.32E?00 7.26E–01 7.23E–15 0.00E100 1.08E–14 4.87E–01
Dixon mean 6.67E–01 7.19E?05 1.33E?00 7.36E–01 1.67E?07 2.96E?00 4.19E?04 1.54E?05 2.35E?03 5.79E?06 1.58E?05 3.10E?06 1.26E?05
min 6.67E–01 6.21E?05 1.33E?00 7.36E–01 1.67E?07 2.96E?00 4.19E?04 8.77E?00 1.21E?03 5.79E?06 1.58E?05 3.10E?06 8.84E?04
max 6.67E–01 7.23E?05 1.33E?00 7.36E–01 1.67E?07 2.96E?00 4.19E?04 6.27E?05 4.24E?03 5.79E?06 1.58E?05 3.10E?06 2.00E?05
std 9.62E–E–12 1.86E?04 4.52E–16 1.13E–16 3.79E–09 1.36E–15 7.40E–12 2.00E?05 7.14E?02 1.89E–09 5.92E–11 4.74E–10 2.62E?04
Griewank mean 1.93E–03 1.17E?02 1.72E–05 2.24E–05 1.95E?03 5.69E–03 1.73E?01 8.16E?01 1.77E?01 1.14E?03 1.68E?02 6.71E?02 1.48E?02
min 1.93E–03 1.07E?02 1.72E–05 2.24E–05 1.95E?03 5.69E–03 1.73E?01 5.88E–03 1.33E?01 1.14E?03 1.68E?02 6.71E?02 1.18E?02
max 1.93E–03 1.18E?02 1.72E–05 2.24E–05 1.95E?03 5.69E–03 1.73E?01 3.61E?02 2.42E?01 1.14E?03 1.68E?02 6.71E?02 1.89E?02
std 0.00E100 1.91E?00 0.00E100 0.00E100 9.25E–13 0.00E100 1.08E–14 9.57E?01 2.27E?00 9.25E–13 5.78E–14 1.16E–13 1.56E?01
Neural Computing and Applications (2024) 36:8417–8453

Rastrigin mean 1.87E–03 2.49E?02 8.42E?00 4.88E?01 1.07E?03 2.89E?01 1.92E?02 2.65E?02 2.95E?02 7.95E?02 4.41E?02 6.07E?02 4.47E?02
min 1.87E–03 2.49E?02 8.42E?00 4.88E?01 1.07E?03 2.89E?01 1.92E?02 1.24E?02 2.50E?02 7.95E?02 4.41E?02 6.07E?02 3.93E?02
max 1.87E–03 2.52E?02 8.42E?00 4.88E?01 1.07E?03 2.89E?01 1.92E?02 3.52E?02 3.34E?02 7.95E?02 4.41E?02 6.07E?02 5.07E?02
std 0.00E100 6.16E–01 3.61E–15 7.23E–15 4.63E–13 3.61E–15 2.89E–14 5.21E?01 1.82E?01 1.16E–13 2.89E–13 4.63E–13 2.73E?01
Rothyp mean 0.00E100 3.45E?05 7.21E–06 8.40E?01 2.36E?06 1.81E?01 2.56E?04 8.89E?04 1.66E?04 1.36E?06 1.41E?05 6.61E?05 6.60E?05
min 0.00E100 3.44E?05 7.21E–06 8.40E?01 2.36E?06 1.81E?01 2.56E?04 1.65E–01 1.33E?04 1.36E?06 1.41E?05 6.61E?05 3.84E?05
max 0.00E100 3.76E?05 7.21E–06 8.40E?01 2.36E?06 1.81E?01 2.56E?04 2.75E?05 2.35E?04 1.36E?06 1.41E?05 6.61E?05 7.95E?05
std 0.00E100 5.90E?03 3.45E–21 1.45E–14 9.47E–10 0.00E100 1.11E–11 7.96E?04 2.54E?03 4.74E–10 8.88E–11 1.18E–10 9.64E?04
Rosenbrock mean 4.48E?01 3.42E?05 2.33E–01 1.34E?02 9.16E?06 4.71E?01 9.06E?04 2.79E?05 2.46E?04 2.96E?06 1.02E?05 1.37E?06 8.63E?04
min 4.48E?01 3.39E?05 2.33E–01 1.34E?02 9.16E?06 4.71E?01 9.06E?04 5.72E?02 1.56E?04 2.96E?06 1.02E?05 1.37E?06 4.92E?04
max 4.48E?01 4.29E?05 2.33E–01 1.34E?02 9.16E?06 4.71E?01 9.06E?04 1.11E?06 3.71E?04 2.96E?06 1.02E?05 1.37E?06 1.51E?05
std 4.24E–03 1.65E?04 8.47E–17 8.67E–14 3.79E–09 2.89E–14 5.92E–11 1.98E?05 5.72E?03 9.47E–10 0.00E100 0.00E100 2.30E?04
Sphere mean 0.00E100 1.74E?01 1.30E–07 1.52E–10 5.68E?02 7.32E–04 4.22E?00 1.58E?01 5.50E?00 3.43E?02 4.85E?01 1.24E?02 4.19E?01
min 0.00E100 1.64E?01 1.30E–07 1.52E–10 5.68E?02 7.32E–04 4.22E?00 4.16E–06 4.46E?00 3.43E?02 4.85E?01 1.24E?02 3.21E?01
max 0.00E100 4.81E?01 1.30E–07 1.52E–10 5.68E?02 7.32E–04 4.22E?00 7.86E?01 6.67E?00 3.43E?02 4.85E?01 1.24E?02 5.34E?01
std 0.00E100 5.80E?00 1.08E–22 2.63E–26 1.16E–13 2.21E–19 1.81E–15 2.34E?01 6.36E–01 5.78E–14 7.23E–15 2.89E–14 5.56E?00
Sumpow mean 0.00E100 7.49E–08 4.29E–16 3.16E–08 6.54E?00 1.18E–07 1.16E–04 4.76E–18 8.20E–07 5.75E–01 3.08E–04 2.42E–01 3.12E–04
min 0.00E100 1.77E–08 4.29E–16 3.16E–08 6.54E?00 1.18E–07 1.16E–04 1.23E–26 7.11E–08 5.75E–01 3.08E–04 2.42E–01 3.97E–05
max 0.00E100 1.73E–06 4.29E–16 3.16E–08 6.54E?00 1.18E–07 1.16E–04 6.33E–17 4.01E–06 5.75E–01 3.08E–04 2.42E–01 1.16E–03
std 0.00E100 3.13E–07 1.50E–31 1.35E–23 0.00E100 4.04E–23 6.89E–20 1.57E–17 7.92E–07 1.13E–16 0.00E100 8.47E–17 2.58E–04
Zakharov mean 3.62E–03 5.47E?02 9.33E?00 3.28E?02 2.23E?03 5.19E?01 3.83E?02 3.79E?02 4.46E?02 1.36E?03 5.76E?02 8.92E?02 5.92E?02
min 3.62E–03 3.59E?02 9.33E?00 3.28E?02 2.23E?03 5.19E?01 3.83E?02 2.27E?02 3.79E?02 1.36E?03 5.76E?02 8.92E?02 5.11E?02
max 3.62E–03 5.53E?02 9.33E?00 3.28E?02 2.23E?03 5.19E?01 3.83E?02 5.23E?02 5.11E?02 1.36E?03 5.76E?02 8.92E?02 6.88E?02
std 0.00E100 3.55E?01 0.00E100 1.73E–13 0.00E100 0.00E100 1.73E–13 8.30E?01 2.60E?01 0.00E100 3.47E–13 1.16E–13 4.32E?01
8447

123
8448 Neural Computing and Applications (2024) 36:8417–8453

9.49E?02
7.54E?02
1.14E?03
9.74E?01
1.31E?03
8.98E?02
1.84E?03
2.22E?02
The significance of the proposed model is also further
ACPSO ascertained by the Friedman test. As indicated in Tables 31
and 32, the proposed model dominates the highest mean
ranking for solving benchmark functions with dimension =
3.52E?03
3.52E?03
3.52E?03

1.34E?04
1.34E?04
1.34E?04
2.31E–12

3.70E–12
50 as compared with those of other search methods. The p-
GPSO

value from the Friedman test is also lower than 0.05, which
indicates that the proposed model is better than all the
baseline methods with a statistical significance.
9.65E?02
9.65E?02
9.65E?02

1.25E?03
1.25E?03
1.25E?03
6.94E–13

9.25E–13
Figures 27 and 28 depict the mean convergence curves
SPSO

of all search methods over 30 runs during the course of


1000 iterations with respect to the Powell and Rotated
0.00E100
8.27E?03
8.27E?03
8.27E?03

2.45E?04
2.45E?04
2.45E?04

Hyper-Ellipsoid functions. As mentioned earlier, these


BBPSOV

1.85E–12

mean convergence graphs are generated using the loga-


rithm scale with a base of 10 for these example test func-
tions. As indicated in Figs. 27 and 28, the proposed model
1.15E?02
7.68E?01
1.68E?02
2.36E?01
8.84E?01
5.37E?01
1.40E?02
2.12E?01

illustrates sufficient capabilities in navigating through


FPA

complex search spaces and shows great competence in


tackling local optima traps. For the Rotated Hyper-Ellip-
7.59E?02

2.12E?03
5.57E?02
2.78E?03

9.90E?03
2.15E?03

soid function, the proposed model achieves the global


9.96E–02

7.03E–01
BBPSO

minimum of ‘0’ since iteration 657 based on the results


over 30 runs. Owing to the fact that log10 0 ¼ 1, the
convergence curve of our optimizer is shown until iteration
5.82E?00
5.82E?00
5.82E?00

3.18E?02
3.18E?02
3.18E?02
1.81E–15

5.78E–14

656. These convergence graphs again illustrate the model’s


fastest convergence rates and its competitive capabilities in
DA

achieving the most optimal global minima, in comparison


with those of all the baseline search methods. A similar
1.96E?00
1.96E?00
1.96E?00

3.92E?00
3.92E?00
3.92E?00
1.13E–15

1.81E–15

trend is also observed for the proposed optimizer for other


benchmark functions.
FA

0.00E100
1.56E?04
1.56E?04
1.56E?04

1.62E?05
1.62E?05
1.62E?05
1.11E–11

8 Conclusion
SSO

In this research, we have proposed transfer learning and


1.81E?00
1.81E?00
1.81E?00

5.10E?00
5.10E?00
5.10E?00
9.03E–16

1.81E–15

hybrid deep networks with PSO-based optimal hyperpa-


SSA

rameter selection for undertaking deepfake classification. A


new PSO model is proposed for optimal hyperparameter
3.93E–07
3.93E–07
3.93E–07
1.08E–22
8.26E–02
8.26E–02
8.26E–02
1.41E–17

search by integrating composite leader generation and


reinforcement learning-based search operation adjustment.
ABC

The preprocessing of face cropping is also conducted to


extract the facial regions and eliminate background noise.
7.72E?02
3.08E?02
7.88E?02
8.78E?01
4.52E?02
3.15E?02
4.57E?02
2.59E?01

Evaluated using several challenging deepfake datasets, the


PSO

proposed PSO-optimized EfficientNet-B0 and Effi-


cientNet-GRU models show enhanced performance. In
particular, EfficientNet-GRU with optimal settings yielded
Prop. PSO

0.00E100
0.00E100
0.00E100
0.00E100
6.82E–08
9.76E–09
7.02E–08
1.10E–08

by our proposed optimizer achieves the best benchmarks


and outperforms existing studies significantly in different
experimental settings. The proposed optimizer also
Table 29 (continued)

mean

mean

achieves statistically better performance against those of


max

max
min

min
std

std

other search methods in solving diverse unimodal and


multimodal mathematical landscapes.
The next steps for this research could include further
Sumsqu

Powell

exploration of different loss functions and the integration

123
Neural Computing and Applications (2024) 36:8417–8453 8449

Table 30 Wilcoxon rank sum test results over 30 runs with dimension = 50
PSO ABC SSA SSO FA DA BBPSO FPA BBPSOV SPSO GPSO ACPSO

Ackley 2.71E– 1.69E– 1.69E– 1.69E– 1.69E– 1.69E– 1.21E– 1.21E– 1.69E–14 1.69E– 1.69E– 1.69E–
14 14 14 14 14 14 12 12 14 14 14
Dixon 4.29E– 2.71E– 2.71E– 2.71E– 2.71E– 2.71E– 1.72E– 1.72E– 2.71E–14 2.71E– 2.71E– 2.71E–
14 14 14 14 14 14 12 12 14 14 14
Griewank 2.71E– 1.69E– 1.69E– 1.69E– 1.69E– 1.69E– 1.21E– 1.21E– 1.69E–14 1.69E– 1.69E– 1.69E–
14 14 14 14 14 14 12 12 14 14 14
Rastrigin 2.71E– 1.69E– 1.69E– 1.69E– 1.69E– 1.69E– 1.21E– 1.21E– 1.69E–14 1.69E– 1.69E– 1.69E–
14 14 14 14 14 14 12 12 14 14 14
Rothyp 2.71E– 1.69E– 1.69E– 1.69E– 1.69E– 1.69E– 1.21E– 1.21E– 1.69E–14 1.69E– 1.69E– 1.69E–
14 14 14 14 14 14 12 12 14 14 14
Rosenbrock 4.29E– 2.71E– 2.71E– 2.71E– 2.71E– 2.71E– 1.72E– 1.72E– 2.71E–14 2.71E– 2.71E– 3.02E–
14 14 14 14 14 14 12 12 14 14 11
Sphere 2.71E– 1.69E– 1.69E– 1.69E– 1.69E– 1.69E– 1.21E– 1.21E– 1.69E–14 1.69E– 1.69E– 1.69E–
14 14 14 14 14 14 12 12 14 14 14
Sumpow 2.71E– 1.69E– 1.69E– 1.69E– 1.69E– 1.69E– 1.21E– 1.21E– 1.69E–14 1.69E– 1.69E– 1.69E–
14 14 14 14 14 14 12 12 14 14 14
Zakharov 2.71E– 1.69E– 1.69E– 1.69E– 1.69E– 1.69E– 1.21E– 1.21E– 1.69E–14 1.69E– 1.69E– 1.69E–
14 14 14 14 14 14 12 12 14 14 14
Sumsqu 2.71E– 1.69E– 1.69E– 1.69E– 1.69E– 1.69E– 1.21E– 1.21E– 1.69E–14 1.69E– 1.69E– 1.69E–
14 14 14 14 14 14 12 12 14 14 14
Powell 4.29E– 2.71E– 2.71E– 2.71E– 2.71E– 2.71E– 1.72E– 1.72E– 2.71E–14 2.71E– 2.71E– 2.71E–
14 14 14 14 14 14 12 12 14 14 14

Table 31 Mean ranking results


Algorithms Mean Ranking
of all search methods based on
the Friedman test for benchmark Prop. PSO 1.27
functions with dimension = 50
PSO 7.77
ABC 2.18
SSA 3.36
SSO 13
FA 3.55
DA 6.36
BBPSO 7.09
FPA 5.91
BBPSOV 12
SPSO 8.91
GPSO 10.91
ACPSO 8.68
Fig. 27 Mean convergence rate comparison in the log scale over 30
runs for the Powell function with dimension = 50
Table 32 Statistical results of the Friedman test for benchmark
functions with dimension = 50
larger datasets for model training could also help improve
Chi-square p-Value Hypothesis deepfake detection accuracy. Another potential direction
would be to investigate the use of other state-of-the-art
120.52 \0:001 Rejected
models, such as transformer-based models and contrastive
learning, for deepfake detection. Such explorations could
provide valuable insights into the effectiveness of different
approaches for handling the deepfake detection problem.
with other optimization algorithms to further enhance Finally, reinforcement learning algorithms with continuous
performance. Additionally, incorporating more diverse and

123
8450 Neural Computing and Applications (2024) 36:8417–8453

Open Access This article is licensed under a Creative Commons


Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as
long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence, and indicate
if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless
indicated otherwise in a credit line to the material. If material is not
included in the article’s Creative Commons licence and your intended
use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright
holder. To view a copy of this licence, visit http://creativecommons.
org/licenses/by/4.0/.

References
1. Nightingale SJ, Wade KA, Watson DG (2017) Can people
Fig. 28 Mean convergence rate comparison in the log scale over 30 identify original and manipulated photos of real-world scenes?
runs for the Rotated Hyper-Ellipsoid function with dimension = 50 Cognit Res Princ Implic 2(1):1–21. https://doi.org/10.1186/
s41235-017-0067-2
2. Sabir E, Cheng J, Jaiswal A, AbdAlmageed W, Masi I, Natarajan
action space will also be studied to further enhance the P (2019) Recurrent convolutional strategies for face manipulation
proposed optimizer pertaining to search coefficient gener- detection in videos. Interfaces 3(1):80–87
ation to further increase search diversity. 3. Rossler A, Cozzolino D, Verdoliva L, Riess C, Thies J, Nießner
M (2019) Faceforensics??: learning to detect manipulated facial
images. In: 2019 IEEE International conference on computer
vision. IEEE, pp 1–11. https://doi.org/10.1109/ICCV.2019.00009
Author Contributions Leandro Cunha involved in conceptualization, 4. Natsume R, Yatagawa T, Morishima S (2018) Rsgan: face
data curation, formal analysis, investigation, methodology, resources, swapping and editing using face and hair representation in latent
software, validation, visualization, roles/writing—original draft, spaces 1–2. https://doi.org/10.1145/3230744.3230818
writing—review and editing. 5. Thies J, Zollhofer M, Stamminger M, Theobalt C, Nießner M
Li Zhang took part in conceptualization, formal analysis, investi- (2016) Face2face: real-time face capture and reenactment of rgb
gation, methodology, resources, supervision, validation, roles/writ- videos. In: 2016 IEEE conference on computer vision and pattern
ing—original draft, writing—review and editing. recognition. IEEE, pp 2387–2395. https://doi.org/10.1109/CVPR.
Bilal Sowan took part in supervision, writing—review and editing. 2016.262
Chee Peng Lim involved in supervision, validation, writing—re- 6. Kim H, Garrido P, Tewari A, Xu W, Thies J, Niessner M, Pérez
view and editing. P, Richardt C, Zollhöfer M, Theobalt C (2018) Deep video por-
Yinghui Kong involved in writing—review and editing. traits. ACM Transactions on Graphics 37(4):1–14. https://doi.org/
10.1145/3197517.3201283
Funding This research was supported by StoryFutures project funded 7. Liang M, Hu X (2015) Recurrent convolutional neural network
by Arts and Humanities Research Council (AHRC). for object recognition. In: 2015 IEEE conference on computer
vision and pattern recognition. IEEE, pp 3367–3375. https://doi.
Data availability This research employs publicly available deepfake org/10.1109/CVPR.2015.7298958
datasets for experimental studies. 8. Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M,
Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent
Code availability The authors will publish the code for the proposed convolutional networks for visual recognition and description. In:
work in a dedicated website after the acceptance of the paper. 2015 IEEE conference on computer vision and pattern recogni-
tion. IEEE, pp 2625–2634. https://doi.org/10.1109/TPAMI.2016.
2599174
Declarations 9. Zhang L, Lim CP, Yu Y (2021) Intelligent human action
recognition using an ensemble model of evolving deep networks
Conflict of interest The authors have no relevant financial or non- with swarm-based optimization. Knowl Based Syst 220:106918.
financial interests to disclose. https://doi.org/10.1016/j.knosys.2021.106918
10. Ahn D, Kim S, Hong H, Ko BC (2023) STAR-transformer: a
Ethics approval The proposed work has gained organizational ethical spatio-temporal cross attention transformer for human action
approval. recognition. In: 2023 IEEE winter conference on applications of
computer vision. IEEE, pp 3330–3339. https://doi.org/10.1109/
Consent to participate The consent to participate has been obtained WACV56688.2023.00333
from all the co-authors for the proposed studies. 11. Slade S, Zhang L, Yu Y, Lim CP (2022) An evolving ensemble
model of multi-stream convolutional neural networks for human
Consent for publication The consent for publication has been action recognition in still images. Neural Comput Appl
obtained from all the co-authors for the proposed studies. 34(11):9205–9231. https://doi.org/10.1007/s00521-022-06947-6
12. Dasari P, Zhang L, Yu Y, Huang H, Gao R (2022) Human action
recognition using hybrid deep evolving neural networks. In: 2022

123
Neural Computing and Applications (2024) 36:8417–8453 8451

International joint conference on neural networks. IEEE, pp 1–8. 29. Ghafori S, Gharehchopogh FS (2021) Advances in spotted hyena
https://doi.org/10.1109/IJCNN55064.2022.9892025 optimizer: a comprehensive survey. In: Archives of computa-
13. Hu J, Liao X, Wang W, Qin Z (2021) Detecting compressed tional methods in engineering, pp 1–22. https://doi.org/10.1007/
deepfake videos in social networks using frame-temporality two- s11831-021-09624-4
stream convolutional network. IEEE Trans Circuits Syst Video 30. Xue J, Shen B (2020) A novel swarm intelligence optimization
Technol 32(3):1089–1102. https://doi.org/10.1109/TCSVT.2021. approach: sparrow search algorithm. Syst Sci Control Eng
3074259 8(1):22–34. https://doi.org/10.1080/21642583.2019.1708830
14. Demir I, Ciftci UA (2021) Where do deep fakes look? synthetic 31. Gharehchopogh FS, Namazi M, Ebrahimi L, Abdollahzadeh B
face detection via gaze tracking. In: 2021 ACM symposium on (2023) Advances in sparrow search algorithm: a comprehensive
eye tracking research and applications. ACM, pp 1–11. https:// survey. Arch Comput Methods Eng 30(1):427–455. https://doi.
doi.org/10.1145/3448017.3457387 org/10.1007/s11831-022-09804-w
15. Zhao H, Zhou W, Chen D, Wei T, Zhang W, Yu N (2021) Multi- 32. Yang X-S, He X (2013) Firefly algorithm: recent advances and
attentional deepfake detection. In: 2021 IEEE conference on applications. Int J Swarm Intell 1(1):36–50. https://doi.org/10.
computer vision and pattern recognition. IEEE, pp 2185–2194. 1504/IJSI.2013.055801
https://doi.org/10.1109/CVPR46437.2021.00222 33. Mirjalili S (2016) SCA: a sine cosine algorithm for solving
16. Kennedy J, Eberhart R (1995) Particle swarm optimization. In: optimization problems. Knowl Based Syst 96:120–133. https://
1995 International conference on neural networks, vol 4. IEEE, doi.org/10.1016/j.knosys.2015.12.022
pp 1942–1948. https://doi.org/10.1109/ICNN.1995.488968 34. Kiran MS (2015) TSA: tree-seed algorithm for continuous opti-
17. Yamasaki T, Honma T, Aizawa K (2017) Efficient optimization mization. Expert Syst Appl 42(19):6686–6698. https://doi.org/10.
of convolutional neural networks using particle swarm opti- 1016/j.eswa.2015.04.055
mization. In: 2017 IEEE third international conference on mul- 35. Gharehchopogh FS (2022) Advances in tree seed algorithm: a
timedia big data (BigMM). IEEE, pp 70–73. https://doi.org/10. comprehensive survey. Arch Comput Methods Eng
1109/BigMM.2017.69 29(1):3281–3304. https://doi.org/10.1007/s11831-021-09698-0
18. Zhang L, Lim CP, Yu Y, Jiang M (2022) Sound classification 36. Karaboga D (2010) Artificial bee colony algorithm. Scholarpedia
using evolving ensemble models and particle swarm optimiza- 5(3):6915. https://doi.org/10.4249/scholarpedia.6915
tion. Appl Soft Comput 116:108322. https://doi.org/10.1016/j. 37. Gharehchopogh FS (2022) An improved tunicate swarm algo-
asoc.2021.108322 rithm with best-random mutation strategy for global optimization
19. Tan TY, Zhang L, Lim CP (2020) Adaptive melanoma diagnosis problems. J Bionic Eng 19(4):1177–1202. https://doi.org/10.
using evolving clustering, ensemble and deep neural networks. 1007/s42235-022-00185-1
Knowl Based Syst 187:104807. https://doi.org/10.1016/j.knosys. 38. James J, Li VO (2015) A social spider algorithm for global
2019.06.015 optimization. Appl Soft Comput 30:614–627. https://doi.org/10.
20. Fielding B, Zhang L (2018) Evolving image classification 1016/j.asoc.2015.02.014
architectures with enhanced particle swarm optimisation. IEEE 39. Slade S, Zhang L, Huang H, Asadi H, Lim CP, Yu Y, Zhao D,
Access 6:68560–68575. https://doi.org/10.1109/ACCESS.2018. Lin H, Gao R (2023) Neural inference search for multiloss seg-
2880416 mentation models. IEEE Trans Neural Netw Learn Syst. https://
21. Zhang L, Liu X, Guan H (2022) Automtl: a programming doi.org/10.1109/TNNLS.2023.3282799
framework for automating efficient multi-task learning. Adv 40. Chen Q, Chen Y, Jiang W (2016) Genetic particle swarm opti-
Neural Inf Process Syst 35:34216–34228 mization-based feature selection for very-high-resolution remo-
22. Stützle T, López-Ibáñez M (2019) Automated design of meta- tely sensed imagery object change detection. Sensors 16(8):1204.
heuristic algorithms. In: Handbook of metaheuristics, https://doi.org/10.3390/s16081204
pp 541–579. https://doi.org/10.1007/978-3-319-91086-4_17 41. Pandit D, Zhang L, Chattopadhyay S, Lim CP, Liu C (2018) A
23. Mirfallah Lialestani SP, Parcerisa D, Himi M, Abbaszadeh Shahri scattering and repulsive swarm intelligence algorithm for solving
A (2022) Generating 3D geothermal maps in Catalonia, Spain global optimization problems. Knowl Based Syst 156:12–42.
using a hybrid adaptive multitask deep learning procedure. https://doi.org/10.1016/j.knosys.2018.05.002
Energies 15(13):4602. https://doi.org/10.3390/en15134602 42. Zhang L, Lim CP, Liu C (2023) Enhanced bare-bones particle
24. Abbaszadeh Shahri A, Khorsand Zak M, Abbaszadeh Shahri H swarm optimization based evolving deep neural networks. In:
(2022) A modified firefly algorithm applying on multi-objective Expert systems with applications, pp 120642. https://doi.org/10.
radial-based function for blasting. In: Neural computing and 1016/j.eswa.2023.120642
applications, pp 1–17. https://doi.org/10.1007/s00521-021- 43. Zhang L, Mistry K, Neoh SC, Lim CP (2016) Intelligent facial
06544-z emotion recognition using moth-firefly optimization. Knowl
25. Cheng J, Liu J, Kuang H, Wang J (2022) A fully automated Based Syst 111:248–267. https://doi.org/10.1016/j.knosys.2016.
multimodal MRI-based multi-task learning for glioma segmen- 08.018
tation and IDH genotyping. IEEE Trans Med Imaging 44. Mirjalili S, Gandomi AH, Mirjalili SZ, Saremi S, Faris H, Mir-
41(6):1520–1532. https://doi.org/10.1109/TMI.2022.3142321 jalili SM (2017) Salp swarm algorithm: a bio-inspired optimizer
26. Cheng M-Y, Prayogo D (2014) Symbiotic organisms search: a for engineering design problems. Adv Eng Softw 114:163–191.
new metaheuristic optimization algorithm. Comput Struct https://doi.org/10.1016/j.advengsoft.2017.07.002
139:98–112. https://doi.org/10.1016/j.compstruc.2014.03.007 45. Tan TY, Zhang L, Lim CP, Fielding B, Yu Y, Anderson E (2019)
27. Gharehchopogh FS, Shayanfar H, Gholizadeh H (2020) A com- Evolving ensemble models for image segmentation using
prehensive survey on symbiotic organisms search algorithms. enhanced particle swarm optimization. IEEE Access
Artif Intell Rev 53:2265–2312. https://doi.org/10.1007/s10462- 7:34004–34019. https://doi.org/10.1109/ACCESS.2019.2903015
019-09733-4 46. Tan TY, Zhang L, Neoh SC, Lim CP (2018) Intelligent skin
28. Dhiman G, Kumar V (2017) Spotted hyena optimizer: a novel cancer detection using enhanced particle swarm optimization.
bio-inspired based metaheuristic technique for engineering Knowl Based Syst 158:118–135. https://doi.org/10.1016/j.knosys.
applications. Adv Eng Softw 114:48–70. https://doi.org/10.1016/ 2018.05.042
j.advengsoft.2017.05.014 47. Xie H, Zhang L, Lim CP, Yu Y, Liu C, Liu H, Walters J (2019)
Improving K-means clustering with enhanced firefly algorithms.

123
8452 Neural Computing and Applications (2024) 36:8417–8453

Appl Soft Comput 84:105763. https://doi.org/10.1016/j.asoc. 66. Tan TY, Zhang L, Lim CP (2019) Intelligent skin cancer diag-
2019.105763 nosis using improved particle swarm optimization and deep
48. Mistry K, Zhang L, Neoh SC, Lim CP, Fielding B (2016) A learning models. Appl Soft Comput 84:105725. https://doi.org/
micro-GA embedded PSO feature selection approach to intelli- 10.1016/j.asoc.2019.105725
gent facial emotion recognition. IEEE Trans Cybern 67. Song L, Fang Z, Li X, Dong X, Jin Z, Chen Y, Lyu S (2022)
47(6):1496–1509. https://doi.org/10.1109/TCYB.2016.2549639 Adaptive face forgery detection in cross domain. In: 2022
49. Srisukkham W, Zhang L, Neoh SC, Todryk S, Lim CP (2017) European conference on computer vision. Springer, pp 467–484.
Intelligent Leukaemia diagnosis with bare-bones PSO based https://doi.org/10.1007/978-3-031-19830-4_27
feature optimization. Appl Soft Comput 56:405–419. https://doi. 68. Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018)
org/10.1016/j.asoc.2017.03.024 A closer look at spatiotemporal convolutions for action recog-
50. Zhang K, Zhang Z, Li Z, Qiao Y (2016) Joint face detection and nition. In: 2018 IEEE conference on computer vision and pattern
alignment using multitask cascaded convolutional networks. recognition. IEEE, pp 6450–6459. https://doi.org/10.1109/CVPR.
IEEE Signal Process Lett 23(10):1499–1503. https://doi.org/10. 2018.00675.
1109/LSP.2016.2603342 69. Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated
51. Li Y, Yang X, Sun P, Qi H, Lyu S (2020) Celeb-df: A large-scale residual transformations for deep neural networks. In: 2017 IEEE
challenging dataset for deepfake forensics. In: 2020 IEEE con- conference on computer vision and pattern recognition. IEEE,
ference on computer vision and pattern recognition. IEEE, pp 1492–1500. https://doi.org/10.1109/CVPR.2017.634
pp 3207–3216. https://doi.org/10.1109/CVPR42600.2020.00327 70. Kandasamy V, Hubálovskỳ Š, Trojovskỳ P (2022) Deep fake
52. Dolhansky B, Bitton J, Pflaum B, Lu J, Howes R, Wang M, Ferrer detection using a sparse auto encoder with a graph capsule dual
CC (2020) The deepfake detection challenge (dfdc) dataset. graph CNN. PeerJ Comput Sci 8:953. https://doi.org/10.7717/
https://doi.org/10.48550/arXiv.2006.07397 peerj-cs.953
53. Wolf L, Hassner T, Maoz I (2011) Face recognition in uncon- 71. Haliassos A, Vougioukas K, Petridis S, Pantic M (2021) Lips
strained videos with matched background similarity. In: 2011 don’t lie: A generalisable and robust approach to face forgery
IEEE conference on computer vision and pattern recognition. detection. In: 2021 IEEE conference on computer vision and
IEEE, pp 529–534. https://doi.org/10.1109/CVPR.2011.5995566 pattern recognition. IEEE, pp 5039–5049. https://doi.org/10.
54. Tan M, Le Q (2019) Efficientnet: Rethinking model scaling for 1109/CVPR46437.2021.00500
convolutional neural networks. In: 2019 International conference 72. Liu H, Li X, Zhou W, Chen Y, He Y, Xue H, Zhang W, Yu N
on machine learning. PMLR, pp 6105–6114 (2021) Spatial-phase shallow learning: rethinking face forgery
55. Kinghorn P, Zhang L, Shao L (2018) A region-based image detection in frequency domain. In: 2021 IEEE conference on
caption generator with refined descriptions. Neurocomputing computer vision and pattern recognition. IEEE, pp 772–781.
272:416–424. https://doi.org/10.1016/j.neucom.2017.07.014 https://doi.org/10.1109/CVPR46437.2021.00083
56. Zhang A, Lipton ZC, Li M, Smola AJ (2021) Dive into deep 73. Wang SY, Wang O, Zhang R, Owens A, Efros AA (2020) CNN-
learning. https://doi.org/10.48550/arXiv.2106.11342 generated images are surprisingly easy to spot... for now. In: 2020
57. Kinghorn P, Zhang L, Shao L (2019) A hierarchical and regional IEEE conference on computer vision and pattern recognition.
deep learning architecture for image description generation. IEEE, pp. 8695–8704. https://doi.org/10.1109/CVPR42600.2020.
Pattern Recogn Lett 119:77–85. https://doi.org/10.1016/j.patrec. 00872
2017.09.013 74. Nguyen HH, Fang F, Yamagishi J, Echizen I (2019) Multi-task
58. Watkins CJ, Dayan P (1992) Q-learning. Mach Learn 8:279–292 learning for detecting and segmenting manipulated facial images
59. Zheng Y, Bao J, Chen D, Zeng M, Wen F (2021) Exploring and videos. In: 2019 IEEE conference on biometrics theory,
temporal coherence for more general video face forgery detec- applications and systems (BTAS). IEEE, pp. 1–8. https://doi.org/
tion. In: 2021 IEEE international conference on computer vision. 10.1109/BTAS46853.2019.9185974
IEEE, pp 15044–15054. https://doi.org/10.1109/ICCV48922. 75. Chai L, Bau D, Lim SN, Isola P (2020) What makes fake images
2021.01477 detectable? understanding properties that generalize. In: 2020
60. Shiohara K, Yamasaki T (2022) Detecting deepfakes with self- European conference on computer vision. Springer, pp 103–120.
blended images. In: 2022 IEEE conference on computer vision https://doi.org/10.1007/978-3-030-58574-7_7
and pattern recognition. IEEE, pp 18720–18729. https://doi.org/ 76. Masi I, Killekar A, Mascarenhas RM, Gurudatt SP, AbdAlma-
10.1109/CVPR52688.2022.01816 geed W (2020) Two-branch recurrent network for isolating
61. Zhao T, Xu X, Xu M, Ding H, Xiong Y, Xia W (2021) Learning deepfakes in videos. In: 2020 European conference on computer
self-consistency for deepfake detection. In: 2021 IEEE interna- vision. Springer, pp 667–684. https://doi.org/10.1007/978-3-030-
tional conference on computer vision. IEEE, pp 15023–15033. 58571-6_39
https://doi.org/10.1109/ICCV48922.2021.01475 77. Tolosana R, Romero-Tapiador S, Fierrez J, Vera-Rodriguez R
62. Wang G, Jiang Q, Jin X, Li W, Cui X (2022) MC-LCR: multi- (2021) Deepfakes evolution: analysis of facial regions and fake
modal contrastive classification by locally correlated representa- detection performance. In: 2021 International conference on
tions for effective face forgery detection. Knowl Based Syst pattern recognition. Springer, pp. 442–456. https://doi.org/10.
250:109114. https://doi.org/10.1016/j.knosys.2022.109114 1007/978-3-030-68821-9_38
63. Kennedy J (2003) Bare bones particle swarms. In: 2003 IEEE 78. Li X, Lang Y, Chen Y, Mao X, He Y, Wang S, Xue H, Lu Q
swarm intelligence symposium, pp 80–87. https://doi.org/10. (2020) Sharp multiple instance learning for deepfake video
1109/SIS.2003.1202251. IEEE detection. In: 2020 ACM international conference on multimedia.
64. Yang XS (2012) Flower pollination algorithm for global opti- ACM, pp. 1864–1872. https://doi.org/10.1145/3394171.3414034
mization. In: 2012 international conference on unconventional 79. Zhang D, Li C, Lin F, Zeng D, Ge S (2021) Detecting deepfake
computing and natural computation. Springer, pp 240–249. videos with temporal dropout 3DCNN. In: 2021 International
https://doi.org/10.1007/978-3-642-32894-7_27 joint conference on artificial intelligence. IJCAI, pp 1288–1294.
65. Mirjalili S (2016) Dragonfly algorithm: a new meta-heuristic https://doi.org/10.24963/ijcai.2021/178
optimization technique for solving single-objective, discrete, and 80. Güera D, Delp EJ (2018) Deepfake video detection using recur-
multi-objective problems. Neural Comput Appl 27:1053–1073. rent neural networks. In: 2018 IEEE international conference on
https://doi.org/10.1007/s00521-015-1920-1

123
Neural Computing and Applications (2024) 36:8417–8453 8453

advanced video and signal based surveillance (AVSS). IEEE, 86. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra
pp 1–6. https://doi.org/10.1109/AVSS.2018.8639163 D (2017) Grad-cam: visual explanations from deep networks via
81. Hu J, Liao X, Liang J, Zhou W, Qin Z (2022) Finfer: Frame gradient-based localization. In: 2017 IEEE international confer-
inference-based deepfake detection for high-visual-quality ence on computer vision. IEEE, pp 618–626. https://doi.org/10.
videos. In: 2022 AAAI conference on artificial intelligence, vol 1109/ICCV.2017.74
36. AAAI Press, pp 951–959. https://doi.org/10.1609/aaai.v36i1. 87. Gal Y, Ghahramani Z (2016) Dropout as a Bayesian approxi-
19978 mation: representing model uncertainty in deep learning. In: 2016
82. Li L, Bao J, Zhang T, Yang H, Chen D, Wen F, Guo B (2020) International conference on machine learning. PMLR,
Face x-ray for more general face forgery detection. In: 2020 IEEE pp 1050–1059
conference on computer vision and pattern recognition. IEEE, 88. Bórquez S, Pezoa R, Salinas L, Torres CE (2023) Uncertainty
pp 5001–5010. https://doi.org/10.1109/CVPR42600.2020.00505 estimation in the classification of histopathological images with
83. Naik DL, Kiran R (2021) A novel sensitivity-based method for HER2 overexpression using Monte Carlo Dropout. Biomed Sig-
feature selection. J Big Data 8:1–16. https://doi.org/10.1186/ nal Process Control 85:104864. https://doi.org/10.1016/j.bspc.
s40537-021-00515-w 2023.104864
84. Xue B, Zhang M, Browne WN, Yao X (2015) A survey on 89. Islam MF, Rahman FB, Zabeen S, Islam MA, Hossain MS,
evolutionary computation approaches to feature selection. IEEE Mehedi MHK, Manab MA, Rasel AA (2022) RNN variants vs
Trans Evol Comput 20(4):606–626. https://doi.org/10.1109/ transformer variants: uncertainty in text classification with Monte
TEVC.2015.2504420 Carlo dropout. In: 2022 International conference on computer and
85. Neoh SC, Zhang L, Mistry K, Hossain MA, Lim CP, Aslam N, information technology. IEEE, pp 7–12. https://doi.org/10.1109/
Kinghorn P (2015) Intelligent facial emotion recognition using a ICCIT57492.2022.10055922
layered encoding cascade optimization model. Appl Soft Comput
34:72–93. https://doi.org/10.1016/j.asoc.2015.05.006 Publisher’s Note Springer Nature remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

Authors and Affiliations

Leandro Cunha1 • Li Zhang1 • Bilal Sowan2 • Chee Peng Lim3 • Yinghui Kong4

& Li Zhang 1
Department of Computer Science, Royal Holloway,
li.zhang@rhul.ac.uk University of London, Surrey TW20 0EX, UK
2
Leandro Cunha Department of Business Intelligence and Data Analytics,
leandro.cunha.2021@live.rhul.ac.uk University of Petra, Amman 11196, Jordan
3
Bilal Sowan Institute for Intelligent Systems Research and Innovation,
bilal.sowan@uop.edu.jo Deakin University, Waurn Ponds, VIC 3216, Australia
4
Chee Peng Lim Department of Electronics and Communication Engineering,
chee.lim@deakin.edu.au North China Electric Power University,
Beijing 102206, Hebei, China
Yinghui Kong
kongyhbd2015@ncepu.edu.cn

123

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy