Electronics 13 00804
Electronics 13 00804
Article
UFCC: A Unified Forensic Approach to Locating Tampered Areas
in Still Images and Detecting Deepfake Videos by Evaluating
Content Consistency
Po-Chyi Su 1, * , Bo-Hong Huang 1 and Tien-Ying Kuo 2, *
Abstract: Image inpainting and Deepfake techniques have the potential to drastically alter the mean-
ing of visual content, posing a serious threat to the integrity of both images and videos. Addressing
this challenge requires the development of effective methods to verify the authenticity of investigated
visual data. This research introduces UFCC (Unified Forensic Scheme by Content Consistency), a
novel forensic approach based on deep learning. UFCC can identify tampered areas in images and
detect Deepfake videos by examining content consistency, assuming that manipulations can create
dissimilarity between tampered and intact portions of visual data. The term “Unified” signifies that
the same methodology is applicable to both still images and videos. Recognizing the challenge of
collecting a diverse dataset for supervised learning due to various tampering methods, we overcome
this limitation by incorporating information from original or unaltered content in the training process
rather than relying solely on tampered data. A neural network for feature extraction is trained to
classify imagery patches, and a Siamese network measures the similarity between pairs of patches.
For still images, tampered areas are identified as patches that deviate from the majority of the investi-
gated image. In the case of Deepfake video detection, the proposed scheme involves locating facial
regions and determining authenticity by comparing facial region similarity across consecutive frames.
Citation: Su, P.-C.; Huang, B.-H.;
Extensive testing is conducted on publicly available image forensic datasets and Deepfake datasets
Kuo, T.-Y. UFCC: A Unified Forensic
Approach to Locating Tampered
with various manipulation operations. The experimental results highlight the superior accuracy and
Areas in Still Images and Detecting stability of the UFCC scheme compared to existing methods.
Deepfake Videos by Evaluating
Content Consistency. Electronics 2024, Keywords: media forensics; image tampering; deepfake; Siamese network; deep learning
13, 804. https://doi.org/10.3390/
electronics13040804
verify the authenticity of digital images and videos, safeguarding against the potential
misuse and harmful consequences of manipulated content.
The goals of identifying image inpainting and Deepfake videos differ slightly. Image
inpainting typically involves altering a smaller section of a static image to conceal errors, or
to maliciously alter its meaning. In this research, the focus is on the latter aspect, aiming to
spot potential tampering in a picture. Consequently, the detection not only highlights the
presence of tampering but also locates the affected regions within the image. For instance,
if the detection process reveals the addition of a person or the removal of a subject, we can
gain insights into the image editor’s intentions by examining these manipulated areas. On
the other hand, Deepfake techniques primarily focus on modifying a person’s face to adopt
their identity in deceptive videos for conveying misinformation. The objective in detecting
Deepfake is to ascertain whether the video is artificially generated rather than authentic. If
an examined video is identified as a result of applying Deepfake manipulation, it may lead
to the disregard of its content.
It is worth mentioning that numerous current forgery detection methods resort to
supervised learning and adopt deep learning strategies. To train robust deep learning
models, a substantial amount of both unaltered and forged data is usually required. How-
ever, obtaining a comprehensive dataset encompassing both forged data and their original
counterparts is impractical. The constant evolution of forgery techniques makes it more
difficult to anticipate the specific manipulations applied to an image or video. In other
words, it is challenging to ensure the generality of trained deep learning models to deal
with various content manipulations by assuming them in advance.
This research introduces UFCC (Unified Forensic Scheme by Content Consistency), a
unified forensic scheme designed to detect both image inpainting and Deepfake videos.
The core concept revolves around scrutinizing images or video frames for the presence of
“abnormal” content discontinuity. A convolutional neural network (CNN) is trained to
extract features from patches within images. Subsequently, a Siamese network is employed
to evaluate the consistency by assessing the similarity between pairs of image patches.
While this proposed method utilizes deep learning techniques, different from many existing
methods, it uniquely requires only original or unaltered image patches during the model
training phase. This innovation allows for the identification of forged regions without a
dependency on specific forgery datasets, offering a more adaptable and practical solution
for detecting forgery in imagery data.
The term “Unified” in UFCC indicates that the identical methodology can be utilized
for detecting both image manipulation and Deepfake videos. In the realm of manipulated
images, the initial step involves identifying potential regions, succeeded by a more detailed
localization of forged areas using a deep segmentation network. This process enables a
more precise delineation of the regions affected by image manipulation operations. In
the cases of detecting Deepfake videos, we first extract facial regions within video frames
and subsequently assess the similarity of these facial regions across adjacent frames. The
proposed UFCC can thus prove equally adept at detecting both image forgery and Deepfake
videos. The contributions of this research are summarized below:
1. The camera-type classification models are improved and extended to apply to digital
image forensics and Deepfake detection.
2. A suitable dataset is formed for classifying camera types.
3. A unified method is proposed for effectively evaluating image and video content consistency.
4. A unified method is developed to deal with various Deepfake approaches.
The rest of the paper is organized as follows. Section 2 introduces related work, includ-
ing traditional signal processing methods and modern deep learning methods. Section 3
details the research methodology, including the adopted deep learning network architecture
and related data preparation. Section 4 presents the details of model training, showcases
the results, and provides comparisons with existing work. Section 5 provides conclusive
remarks and outlines expectations for future developments.
Electronics 2024, 13, 804 3 of 22
2. Related Work
2.1. Digital Image Forensics
The widespread availability of image editing software, equipped with increasingly
potent capabilities, amplifies the challenges from image forgery, prompting an urgent
demand for image forensic techniques. Two primary approaches, active and passive,
address the detection of image content forgery or manipulation. Active methods involve
embedding imperceptible digital watermarks into images, where the extracted watermarks
or their specific conditions may reveal potential manipulations applied to an investigated
image. However, drawbacks include the protection of only watermarked images, and
the content being somewhat affected by the watermark embedding process. Moreover,
controversies surround the responsibility for detecting or extracting watermarks to establish
image authenticity.
In contrast, passive methods operate on the premise that even seemingly imperceptible
image manipulations alter the statistical characteristics of imagery data. Rather than
introducing additional signals into images, passive methods seek to uncover underlying
inconsistencies to detect image manipulations or forgery [1,2]. Among passive methods,
camera model recognition [3] is a promising direction in image forensics. This involves
determining the type of camera used to capture the images. Various recognition methods
utilizing features such as CFA demosaic effects [4], sensor noise patterns [5], local binary
patterns [6], noise models [7], chromatic aberrations [8], and illumination direction [9] have
been proposed. Classification based on these features helps to determine the camera model
types. Recent work has utilized deep learning [10] to pursue generality.
Despite the effectiveness demonstrated by camera model identification methods, a
prevalent limitation is evident in many of these techniques. They operate under the
assumption that the camera models to be determined must be part of a predefined dataset,
requiring a foundational understanding of these models included in the training data to
precisely recognize images captured by specific cameras. However, selecting a suitable set of
camera models is not a trivial issue. Expanding the training dataset to encompass all camera
types is also impractical, given the fact that the number of camera types continues to grow
over time. In image forensics, identifying the exact camera model responsible for capturing
an image in various investigative contexts is not necessary as the primary objective here is
to confirm whether a set of examined images are from the same camera model to expose
instances of intellectual property infringement or to reveal the presence of image splicing
fraud [11]. Bondi et al. [12] discovered that CNN-based camera model identification can be
employed for classifying and localizing image splicing operations. Ref. [13] fed features
extracted by the network into a Siamese network for further comparison. Building upon
these findings, the proposed UFCC scheme aims to extend the ideas presented by [13] to
develop a unified approach for detecting manipulated image regions and determining the
authenticity of videos.
2.2. Deepfake
Deepfake operations are frequently utilized for facial manipulation, changing the
visual representation of individuals in videos to diverge from the original content or
substitute their identities. The spectrum of Deepfake operations has expanded over time
and elicited numerous concerns. Next, we analyze the strengths and weaknesses of each
Deepfake technique and track the evolution of these approaches.
1. Identity swapping
Identity swapping entails substituting faces in a source image with the face of another
individual, effectively replacing the facial features of the target person while retaining the
original facial expressions. The use of deep learning methods for identity replacement
can be traced back to the emergence of Deepfakes in 2017 [14]. Deepfakes employed an
autoencoder architecture comprising an encoder–decoder pair, where the encoder extracts
latent facial features and the decoder reconstructs the target face. Korshunova et al. [15]
Electronics 2024, 13, 804 4 of 22
utilized a fully convolutional network along with style transfer techniques, and adopted
multiple loss functions with variation regularization to generate realistic images. However,
these approaches necessitate a substantial amount of data for both the source and target
individuals for paired training, making the training process time-consuming.
In an effort to enhance the efficiency of Deepfakes, Zakharov et al. [16] proposed GAN-
based few-shot or one-shot learning to generate realistic talking-head videos from images.
Various studies have focused on extensive meta-learning using large-scale video datasets
over extended periods. Additionally, self-supervised learning methods and the generation
of forged identities based on independently encoded facial features and annotations have
been explored. Zhu et al. [17] extended the latent spaces to preserve more facial details and
employed StyleGAN2 [18] to generate high-resolution swapped facial images.
2. Expression reenactment
Expression reenactment involves transferring the facial expressions, gestures, and head
movements of the source person onto the target person while preserving the identity of the
target individual. These operations aim to modify facial expressions while synchronizing lip
movements to create fictional content. Techniques such as 3D face reconstruction and GAN
architectures were employed to capture head geometry and motions. Thies et al. [19] introduced
3D facial modeling combined with image rendering, allowing for the real-time transfer of
spoken expressions captured by a regular web camera to the face of the target person.
While GAN-based methods can generate realistic images, achieving highly convincing
reenactment for unknown identities requires substantial training data. Kim et al. [20]
proposed fusing spatial–temporal encoding and conditional GANs (cGANs) in static im-
ages to synthesize target video avatars, incorporating head poses, facial expressions, and
eye movements, resulting in highly realistic scenes. Other research has explored fully
unsupervised methods utilizing dual cGANs to train emotion–action units for generating
facial animations from single images.
Recent advancements include few-shot or one-shot facial expression reenactment
techniques, alleviating the training burden on large-scale datasets. These approaches
adopt strategies such as image attention analysis, target feature alignment, and landmark
transformation to prevent quality degradation due to limited or mismatched data. Such
methods eliminate the need for additional identity-adaptive fine-tuning, making them
suitable for the practical applications of Deepfake. Fried et al. [21] devised content-based
editing methods to fabricate forged speech videos, modifying speakers’ head movements
to match the dialogue content.
3. Face synthesis
Facial synthesis primarily revolves around the creation of entirely new facial images
and finds applications in diverse domains such as video games and 3D modeling. Many
facial synthesis methods leverage GAN models to enhance resolution, image quality, and
realism. StyleGAN [22] elevated the image resolution initially introduced by ProGAN [23],
and subsequent improvements were made with StyleGAN2 [18], which effectively elimi-
nated artifacts to further enhance image quality.
The applications of GAN-based facial synthesis methods are varied, encompassing
facial attribute translation [22,24], the combination of identity and attributes, and the
removal of specific features. Some synthesis techniques extend to virtual makeup trials,
enabling consumers to virtually test cosmetics [25] without the need for physical samples
or in-person visits. Additionally, these methods can be applied to synthesis operations
involving the entire body; for instance, DeepNude utilized the Pix2PixHD GAN model [26]
to patch clothing areas and generate fabricated nude images.
4. Facial attribute manipulation
Facial attribute manipulation, also known as facial editing or modification, entails
modifying facial attributes like hair color, hairstyle, skin tone, gender, age, smile, glasses,
and makeup. These operations can be considered as a form of conditional partial facial
Electronics 2024, 13, 804 5 of 22
synthesis. GAN methods, commonly used for facial synthesis, are also employed for facial
attribute manipulation. Choi et al. [24] introduced a unified model that simultaneously
trains multiple datasets with distinct regions of interest, allowing for the transfer of various
facial attributes and expressions. This approach eliminates the need for additional cross-
domain models for each attribute. Extracted facial features can be analyzed in different
latent spaces, providing more precise control over attribute manipulation within facial
editing. However, it is worth noting that the performance may degrade when dealing with
occluded faces or when the face lies outside the expected range.
5. Hybrid approaches
A potential trend is emerging wherein different Deepfake techniques are amalgamated
to form hybrid approaches, rendering them more challenging to identify. Nirkin et al. [27]
introduced a GAN model for real-time face swapping, incorporating a fusion of reenact-
ment and synthesis. Some approaches may involve the use of two separate Variational
Autoencoders (VAEs) to convert facial features into latent vectors. These vectors are then
conditionally adjusted for the target identity. These methods facilitate the application of
multiple operations to any combination of two faces without requiring retraining. Users
have the flexibility to freely swap faces and modify facial parameters, including age, gen-
der, smile, hairstyle, and more. More advanced and sophisticated methods exist. [28]
adopted a multimodal fusion approach to deal with fake news to provide further protection
mechanisms for social media platforms.
3. Proposed Scheme
This section offers in-depth insights into the UFCC scheme. Section 3.1 presents the
system and architecture diagram, while Sections 3.2–3.4 delve into the network design and
the mechanism for addressing image tampering. Section 3.5 elaborates on how the same
methodology is employed in assessing the authenticity of Deepfake videos.
The distinction
3.2. Feature Extractor between static image manipulation detection and Deepfake video
detection stems from the nature of the tampered areas identified. Image manipulation
The goal of the feature extractor depicted in Figure 1 diverges from that of typical
detection identifies tampered regions that can be anywhere within an image. On the
CNNs, which often focus on learning to recognize content like people, cars, animals, etc.,
other hand, Deepfake video detection specifically targets facial areas, given that most
in images. The scheme is designed to identify subtle signals specific to camera types, sig-
Deepfake methods alter only faces within frames. In videos, we utilize frame-by-frame
nals unrelated to imagery content. To achieve this, the model focuses on high-frequency
patch comparison to assess whether the faces appearing in target frames are fabricated.
features through filtering. Bayer [39] introduced the concept of constrained convolutions
This approach takes into account the continuity of visual content, as Deepfake operations
for extracting high-frequency features. During training, the center of the prediction error
filter is consistently set to −1, while the surrounding points are normalized to ensure their
sum equals 1. Figure 2 illustrates the design of the feature extraction networks, including
detailed parameter settings. The input comprises patches of the investigated image with
dimensions set as 128 × 128 × 3 (width × height × channels). The constrained convolution
layer serves as the initial convolution layer, followed by four convolution blocks with sim-
Electronics 2024, 13, 804 7 of 22
3.4.
3.4.Similarity
SimilarityNetwork
Network
The
Thefeature
featureextraction
extractionclassifier
classifierwewehave
havedesigned
designedexhibits
exhibitsthe thecapability
capabilityto to discern
discern
the
theknown
knowncameracameratypestypesor orcategories
categoriesconsidered
consideredduring duringthe thetraining
trainingprocess.
process.However,
However,
the
theoutcome
outcomemay maybe beinconsistent
inconsistentwhen whenconfronted
confrontedwith withthethecamera
cameratypes typesabsent
absentfromfromthe the
original
originaltraining
trainingdataset.
dataset.Deriving
Derivingaageneralized
generalizedclassification
classificationapproach
approachwould wouldnecessitate
necessitate
additional
additionalcamera-type
camera-typedata dataforforretraining.
retraining.Unfortunately,
Unfortunately,acquiring
acquiringan anextensive
extensiveset setof of
camera
cameratypestypesisisimpractical
impracticalsincesincewe wecan
canexpect
expectmanymanynew newones
onesto toappear.
appear. In Incontrast
contrastto to
relyingsolely
relying solelyon oncamera
cameratypetypeclassification,
classification,ititisismore
moresuitable
suitablefor forimage
imageforgery
forgerydetection
detection
to determine
to determine whether
whether two twoinvestigated
investigatedimageimagepatches
patches come
comefromfromcameras
camerasof the
of same
the sametype.
Drawing
type. Drawinginspiration from from
inspiration the work of Huh
the work of[40]
Huh and Mayer
[40] [13], the
and Mayer Siamese
[13], network,
the Siamese net-as
illustrated
work, in Figurein
as illustrated 3, Figure
is well-suited for such similarity
3, is well-suited for such comparisons between two
similarity comparisons inputs.
between
A brief
two description
inputs. of the Siamese
A brief description Network
of the Siamese concept
Network is provided
concept is below.
provided below.
The Siamese network consists of two identical subnetworks
The Siamese network consists of two identical subnetworks running runningininparallel,
parallel,shar-
shar-
ing weights
ing weightsand andparameters.
parameters.These Thesesubnetworks
subnetworks take
take twotwo inputs
inputs andandmapmap them
them intointoa
a lower-dimensional
lower-dimensional feature
feature space.
space. TheThe feature
feature vectors
vectors of theof inputs
the inputs are compared
are compared by cal- by
calculating
culating theirtheir distance
distance or similarity.
or similarity. This This concept,
concept, introduced
introduced by Bromley
by Bromley et al. et al. was
[41], [41],
was initially used for verifying signature matches on checks against
initially used for verifying signature matches on checks against a bank’s reference signa- a bank’s reference
signatures.
tures. If the If the similarity
similarity score score
exceeds exceeds a threshold,
a threshold, the signatures
the signatures are considered
are considered fromfromthe
the same person; otherwise, a forgery case is suspected. Nair et al.
same person; otherwise, a forgery case is suspected. Nair et al. [42] extended this idea [42] extended this ideato to
face verification. Unlike typical classifier networks, Siamese networks
face verification. Unlike typical classifier networks, Siamese networks focus on comparing focus on comparing
differencesbetween
differences betweenextracted
extractedfeature
featurevectors,
vectors,leading
leadingtotowidespread
widespreaduse useininnatural
naturallan- lan-
guageprocessing
guage processingand andobject
objecttracking.
tracking.ForForexample,
example,[43] [43]incorporated
incorporatedaaBidirectional
BidirectionalLong Long
Short-Term Memory Network (Bi-LSTM) as the core of the parallel
Short-Term Memory Network (Bi-LSTM) as the core of the parallel networks, facilitating networks, facilitating
theevaluation
the evaluationof ofsemantic
semanticsimilarity
similaritybetween
betweenstrings
stringsof ofdifferent
differentlengths.
lengths.Ref.Ref.[44]
[44]utilized
utilized
a Siamese network for feature extraction, followed by distinct classification
a Siamese network for feature extraction, followed by distinct classification and regression and regression
tasks in both the foreground and background, employing two parallel Region Proposal
Networks (RPNs). This arrangement is commonly referred to as Siamese-RPN.
We employ a Siamese network to compare features from different patches, assessing
their similarity and determining whether they belong to the same camera type. This ap-
proach enhances the generalizability of the extracted features. The design, depicted in Fig-
Electronics 2024, 13, 804 9 of 22
tasks in both the foreground and background, employing two parallel Region Proposal
Networks (RPNs). This arrangement is commonly referred to as Siamese-RPN.
We employ a Siamese network to compare features from different patches, assessing
their similarity and determining whether they belong to the same camera type. This
approach enhances the generalizability of the extracted features. The design, depicted
in Figure 4, involves training a robust feature extractor and freezing its parameters. The
classification output layer is removed, and the second-to-last fully connected layer serves 10 of 2
Electronics 2024, 13, x FOR PEER REVIEW
as the feature output, serving as the backbone output of the parallel networks. A pair of
fully connected layers with shared weights processes the features separately on each side.
The resulting features from both sides are multiplied element-wise and concatenated with
connected
the multipliedlayers andFinally,
results. a ReLU thisactivation function
concatenated toundergoes
feature produce thetwooutput. Thefully
additional weights ar
connected layers and a ReLU activation function to produce the output. The weights are
updated using the contrastive loss [45] as the loss function, and the calculation of this los
updated
function using the contrastive
is shown in (1). loss [45] as the loss function, and the calculation of this loss
function is shown in (1).
1 1
→ → ⃗, X ⃗ = 11− Y
L W, Y, X D 1+n Y max 0, m − D (1
L W, Y, X1 , X2 = (1 − Y) (DW )2 + (Y)2
max(20, m − DW )}2 (1)
2 2
where D denotes the distance between an input sample and a positive sample. The b
where
nary D W denotes
label Y is setthetodistance
1 if anbetween an input
input sample sample
and and a positive
a positive samplesample.
belongThe binary
to the same clas
label
and Y0 isotherwise.
set to 1 if an
Theinput sample and
parameter a positive sample
m represents belong to
a predefined the same
margin, class, andthe min
regulating
0mum
otherwise. The between
distance parameternegative
m represents a predefined
and positive margin,
samples. Theregulating the minimum
dual purpose of this loss func
distance between negative and positive samples. The dual purpose of this loss function is
tion is as follows: when an input sample and a positive sample share the same class, w
as follows: when an input sample and a positive sample share the same class, we aim for
aim for their distance to be minimal, resulting in a smaller loss value. Conversely, whe
their distance to be minimal, resulting in a smaller loss value. Conversely, when an input
an input
sample andsample
a negativeand a negative
sample belongsample belong
to different to different
classes, classes,
their distance musttheir
be atdistance
least the must b
at least the
predefined predefined
margin m, ensuringmargin m, isensuring
the loss the loss
0. This enables the is 0. This of
clustering enables
similar the clustering o
samples
similar samples in the embedding space, preventing
in the embedding space, preventing them from being scattered apart. them from being scattered apart.
Figure
Figure 4. 4.
TheThe similarity
similarity evaluation
evaluation network.
network.
As previously stated, incorporating a Siamese network in this context not only im-
proves the model’s ability to generalize but also extends its functionality beyond mere
classification to patch-level comparisons. The dataset containing camera types utilized for
this purpose is distinct from the one employed in training the feature extractor, broadening
the detection capability to include unknown cameras.
Figure
Figure 5.
5. An
Animage
image forgery
forgery detection
detection example.
example.
Additionally,
Additionally, we we noticed
noticed that
that patches
patches with
with similar
similar variations
variations in
in color
color or
or brightness
brightness
tend
tend to
to receive
receive higher
higher similarity
similarity values,
values, asas the
the adopted
adopted network
network maymay emphasize
emphasize high-
high-
dimensional
dimensionalfeatures
featuresdistinct
distinctfrom
fromthe image
the image content. ForFor
content. example, when
example, comparing
when sky
comparing
and ground
sky and patches
ground extracted
patches from
extracted thethe
from same
same image, the
image, thesimilarity
similarityscore
scorebetween
between sky
sky
patches tends to be higher than that between sky and ground patches. This
patches tends to be higher than that between sky and ground patches. This similarity biassimilarity bias
could
could result in
in false
falsepositives
positivesdue
duetotocontent
contentresemblance.
resemblance. ToTo address
address thisthis issue,
issue, we we pro-
propose
the following
pose two strategies.
the following two strategies.
Selecting Target
Selecting Target Patches
Patches
Using aa single
Using single target
target patch
patch for
for comparison
comparison is is aa feasible
feasible solution,
solution, but
but itit may
may overlook
overlook
potentially shared characteristics among other patches, leading to possible
potentially shared characteristics among other patches, leading to possible detection detection errors.
er-
Randomly
rors. Randomlyselecting target
selecting patches
target for comparison
patches for comparison can result in unstable
can result in unstablesampling and
sampling
inconsistent
and results.
inconsistent To address
results. these these
To address limitations, we leverage
limitations, the Structural
we leverage Similarity
the Structural Index
Similar-
(SSIM) [46] as the patch selection criterion to choose candidate patches for comparison.
ity Index (SSIM) [46] as the patch selection criterion to choose candidate patches for com- The
SSIM measures the product of three similarity components: luminance
parison. The SSIM measures the product of three similarity components: luminance simi- similarity, contrast
similarity,
larity, and structure
contrast similarity,similarity, calculated
and structure separately.
similarity, SSIM separately.
calculated values range fromvalues
SSIM 0 to 1,
with higher values indicating greater similarity and lower values indicating
range from 0 to 1, with higher values indicating greater similarity and lower values indi- dissimilarity.
The patch selection process is outlined as follows:
cating dissimilarity. The patch selection process is outlined as follows:
1. Randomly select a patch as the initial target patch.
1. Randomly select a patch as the initial target patch.
2. Execute a scanning-based detection on the entire image using the target patch.
2. Execute a scanning-based detection on the entire image using the target patch.
3. Select candidate patches that exhibit similarity to the target patch with relatively low
3. Select candidate patches that exhibit similarity to the target patch with relatively low
SSIM values.
SSIM values.
4. From the candidate patches, randomly pick one as the new target patch for the next
iteration.
5. Repeat Steps 2 to 4 until the preset number of iterations is achieved.
In Step 3, selecting patches similar to the previous one ensures each detection round
Electronics 2024, 13, 804 11 of 22
4. From the candidate patches, randomly pick one as the new target patch for the
next iteration.
5. Repeat Steps 2 to 4 until the preset number of iterations is achieved.
In Step 3, selecting patches similar to the previous one ensures each detection round
is performed under a more consistent setting. Low SSIM values indicate that candidate
patches have content or structures different from the previous target patch, allowing us
to select patches with larger content disparities. This approach facilitates multiple detec-
Electronics 2024, 13, x FOR PEER REVIEW 12 of 23
tion rounds by avoiding situations where newly selected target patches closely resemble
previously chosen ones, potentially leading to errors determined by the Siamese network.
In addition to using the SSIM to select target patches, the accumulation of similarity
measurements
predictions fromcurrent
and the each comparison,
prediction.combined with if
Specifically, the new
the score,
SSIM also requires
is higher than adjust-
a threshold
ments based on the SSIM. The adjustment refines the weighting of accumulated
(set as 0.5), the accumulated prediction results from previous rounds contribute M% (M = predictions
and the current prediction. Specifically, if the SSIM is higher than a threshold (set as 0.5),
40), and the current prediction result contributes (1 − M)%, which is larger than M%. Con-
the accumulated prediction results from previous rounds contribute M% (M = 40), and the
versely, if the SSIM is lower than or equal to the threshold, the accumulated predictions
current prediction result contributes (1 − M)%, which is larger than M%. Conversely, if the
contribute morethan
SSIM is lower thanorthe current
equal to the prediction. This
threshold, the strategy helps
accumulated to mitigate
predictions themore
contribute impact of
content
than the current prediction. This strategy helps to mitigate the impact of content variations, con-
variations, and it has been observed that these adjustments achieve more
sistent
and itoutcomes. Figure that
has been observed 6 illustrates an example.
these adjustments achieveThe
moreprocess involves
consistent three
outcomes. rounds
Figure 6 of
comparative
illustrates andetection.
example.The The initial
processtarget patch
involves in rounds
three Figure of6acomparative
is chosen randomly,
detection. while
The the
initial
target target patch
patches in (b,c)in are
Figure 6a is based
chosen chosenon
randomly, while round’s
the previous the targetresults
patches in (b,c)
and are in-
the SSIM
chosen based on the previous round’s results and the SSIM indicator. We
dicator. We can see that this method effectively corrects biases caused by single-round can see that this
method effectively corrects biases caused by single-round target patch detection.
target patch detection.
While
While theresults
the resultscan
can roughly
roughly identify
identifytampered
tampered areas,
areas,there is noisclear
there segmentation
no clear segmentation
between
between tamperedand
tampered andunaltered
unaltered regions.
regions.To Toaddress
address this, patch-level
this, patch-levelthresholding basedbased
thresholding
on Otsu’s algorithm is employed, drawing segmentation lines to
on Otsu’s algorithm is employed, drawing segmentation lines to ensure no ambiguous ensure no ambiguous
zones for inspection and assessment. Further refinement through foreground extraction is
zones for inspection and assessment. Further refinement through foreground extraction
then applied to enhance boundary details and avoid a blocky effect. Foreground extraction
is then applied to enhance boundary details and avoid a blocky effect. Foreground extrac-
is achieved through the use of FBA Matting (Foreground–Background-Aware Matting) [47],
tiona technique thatthrough
is achieved accuratelytheextracts
use oftheFBA Matting
contours (Foreground–Background-Aware
of foreground objects from images. FBA Mat-
ting) [47], employs
Matting a technique that accurately
ResNet-50 extractstothe
and U-Net models contours
classify pixels,ofutilizing
foreground objects
previously ac- from
images. FBA Matting employs ResNet-50 and U-Net models to classify
quired masks to categorize pixels into three classes: background, foreground, and unknown. pixels, utilizing
This is performed
previously acquiredby constructing a “Tri-Map”,pixels
masks to categorize a ternary graph,
into threebased on thebackground,
classes: original mask. fore-
The Tri-Map
ground, generationThis
and unknown. involves expandingby
is performed theconstructing
original maska to encompassapotentially
“Tri-Map”, ternary graph,
unknown areas, or neutral regions, through dilation. Simultaneously,
based on the original mask. The Tri-Map generation involves expanding the original erosion is applied mask
to draw the center region of the mask with high confidence. These steps are illustrated
to encompass potentially unknown areas, or neutral regions, through dilation. Simultane-
in Figure 7. Once the completed ternary diagram and the original image are provided to
ously, erosion is applied to draw the center region of the mask with high confidence. These
FBA Matting, it applies predictions in the neutral regions of the ternary diagram based
steps are correlation
on the illustratedbetween
in Figure 7. Once and
foreground the completed
background ternary diagram
areas, resulting in and theand
refined original
image are provided
accurate segmentation. to FBA Matting, it applies predictions in the neutral regions of the
ternary diagram based on the correlation between foreground and background areas, re-
sulting in refined and accurate segmentation.
to encompass potentially unknown areas, or neutral regions, through dilation. Simultane-
ously, erosion is applied to draw the center region of the mask with high confidence. These
steps are illustrated in Figure 7. Once the completed ternary diagram and the original
image are provided to FBA Matting, it applies predictions in the neutral regions of the
Electronics 2024, 13, 804 12 ofre-
22
ternary diagram based on the correlation between foreground and background areas,
sulting in refined and accurate segmentation.
Figure 7.
Figure 7. Tri-map
Tri-mapgeneration.
generation.
(a) (b)
(c) (d)
Figure 8.8.Patch
Patchselection:
selection:(a) overlapped
(a)by
overlapped patches,
patches, (b) non-overlapped
non-overlapped patches,
patches, (c)patches
patchesaround
aroundthe
Figure
the center point inward two-thirds of the(b)side of the face block to (c)
select suitable areas
theregion
regionof of interest,
interest, and
and (d)
(d) central
central and
and peripheral
peripheral parts
parts ofof
a a face.
face.
for checking similarity.
Considering the precise positions of the patches, as illustrated in Figure 9, we ob-
served that focusing solely on faces may not be sufficient. Certain portions of the periph-
eral areas around the faces can retain useful features. To address this, we designated five
anchor points for each face: the upper, lower, left, and right, as well as the center. We move
4.
4. Experimental
Experimental Results
Results
4.1.
4.1. Experiment
Experiment Settings
Settings
This
This study
study was
was conducted
conducted on on Ubuntu
Ubuntu 18.04
18.04 LTS
LTS using
using Python
Python as
as the
the primary
primary pro-
pro-
gramming
gramming language.
language. The
The environment
environment was built with
was built with PyTorch
PyTorch 1.12.1,
1.12.1, Torchvision
Torchvision 0.13.1,
0.13.1,
and ®
and OpenCV
OpenCV 4.6.0.
4.6.0. The
The hardware
hardware setup
setup includes
includesan anIntel
Intel® Core(TM)
Core(TM) i7-8700K
i7-8700K CPU
CPU with
with
3.70 GHz, 64 GB RAM, and a GeForce GTX 2080 Ti GPU. To enhance
3.70 GHz, 64 GB RAM, and a GeForce GTX 2080 Ti GPU. To enhance GPU computational GPU computa-
tional performance,
performance, the CUDA-11.2
the CUDA-11.2 framework
framework is employed,
is employed, along
along with with cuDNN
cuDNN 8.0 for
8.0 for acceler-
accelerating deep learning
ating deep learning tasks. tasks.
For the feature extractor network, the optimizer used is AdamW, with an initial
For the feature extractor network, the optimizer used is AdamW, with an initial learn-
learning rate of 0.001 and a batch size of 128. The learning rate is reduced by a factor of
ing rate of 0.001 and a batch size of 128. The learning rate is reduced by a factor of 1/5
1/5 every 20 iterations, and a total of 200 iterations are performed during training. In the
every 20 iterations, and a total of 200 iterations are performed during training. In the Sia-
Siamese network, the same optimizer (AdamW) is employed, with a batch size of 64 and
mese network, the same optimizer (AdamW) is employed, with a batch size of 64 and an
an initial learning rate of 5 ×−410−4 . The new learning rate is calculated as the old learning
initial learning rate of 5 × 10 epoch
. The new learning rate is calculated as the old learning rate
rate divided by the factor 1.2 10 in each iteration. The training process executes for a total
divided by the factor 1.2 in each iteration. The training process executes for a total of
of 200 iterations.
200 iterations.
4.2. Training Data
4.2. Training Data
When training the feature extractor as described in Section 3.2, we compiled various
cameraWhen training
types the feature
from datasets extractor
including the as described
VISION in Section
dataset [48], the3.2, we compiled
Camera various
Model Identifi-
camera types from datasets including the VISION dataset [48], the Camera Model Identi-
cation Challenge (CMIC) dataset [49], and our collected dataset. After removing duplicates
fication Challenge (CMIC) dataset [49], and our collected dataset. After removing dupli-
cates and conducting thorough tests and comparisons, we identified the 40 most suitable
classes (24 from VISION, 8 from CMIC, and 8 from our own collection) for our training
dataset. For the similarity network, outlined in Section 3.3, we utilized 25 camera-type
datasets as our training data, which are from the Dresden dataset [50]. The difference in
Electronics 2024, 13, 804 14 of 22
and conducting thorough tests and comparisons, we identified the 40 most suitable classes
(24 from VISION, 8 from CMIC, and 8 from our own collection) for our training dataset.
For the similarity network, outlined in Section 3.3, we utilized 25 camera-type datasets as
our training data, which are from the Dresden dataset [50]. The difference in the selected
datasets compared to those used for the feature extractor aims to enable the model to learn
from the 40 camera-type classes first. Subsequently, the model can be fine-tuned using the
unknown 25 classes during the similarity network learning. The model’s capabilities of
dealing with unknown camera types can be further enhanced. It should be noted that all
the images in these datasets are used for training since the investigated images in the ex-
periments will not be restricted to the dataset containing camera-type information. Besides,
it is not easy for us to collect more images related to specific camera types.
In terms of patch sizes, we experimented with three sizes: 256 × 256, 128 × 128, and
64 × 64. Patches of size 256 × 256 were deemed too large for making precise predictions
on smaller objects, while blocks of size 64 × 64 were considered a bit small, resulting
in significantly increased error rates. Therefore, we settled on a block size of 128 × 128.
During the training of the feature extractor, images of each category were partitioned into
overlapping or non-overlapping blocks of size 128 × 128 pixels. We randomly selected
20,000 blocks from each category for both overlapping and non-overlapping cases, resulting
in a total of 1,560,000 patches for training the feature extractor’s classification network. An
additional set of 579,973 patches was reserved for validation. For the similarity network,
patches of size 128 × 128 were extracted in a non-overlapping manner. We collected
40,000 patches for each category, resulting in a total of 1,000,000 patches for training the
similarity network. Another set of 600,000 patches was reserved for validation. Table 1
shows the accuracy rates of the feature extractor and the similarity network, both achieving
higher than 0.9, which shows that the model training is quite successful.
Model Accuracy
Feature extractor 0.95
Siamese network 0.90
We assessed the accuracy by comparing the camera-type model predictions from the
feature extractor against the ground truth, forming a multi-class confusion matrix. Most of
the accuracy rates in this matrix exceeded 94%, with the lowest accuracy within a single
class at 88.6% and the highest at 99.9%. The original training dataset comprised 85 classes,
and due to images from various sources, instances of similar or identical cameras led to
misclassifications in the confusion matrix. To mitigate this, we employed k-fold cross-
validation to ensure no errors were introduced while removing converged but effective
classes. As a result, 65 classes remained, with 40 selected for the feature extractor training
and 25 reserved for the training of the Siamese or similarity network. This approach aims
to maintain robust performance while addressing potential misclassifications arising from
similar cameras.
Figure10.
Figure 10. Examples
Examples of
of image
imageforgery
forgerydetection.
detection.
4.3.2.Evaluation
4.3.2. Evaluation Metrics
Metrics
Next, we assessthe
Next, we assess themodel’s
model’s performance
performance using the following
using metrics
the following and present
metrics the
and present
results in a table for comparison with existing studies. The four adopted metrics
the results in a table for comparison with existing studies. The four adopted metrics are are mAP
[52], F1
mAP measure
[52], [53], MCC
F1 measure [53],[54],
MCC and cIoU
[54], [55].
and We[55].
cIoU categorize all pixels into
We categorize the four
all pixels classes
into the four
according
classes to Tableto2.Table 2.
according
Table 2. Pixel Detection Classification.
Table 2. Pixel Detection Classification.
Prediction\Truth Forgery Not Forgery
Prediction\Truth Forgery Not Forgery
Forgery True Positive (TP) False Positive (FP)
Forgery
Not Forgery True Negative
False Positive (TP)
(FN) TrueFalse Positive
Negative (FP)
(TN)
Not Forgery False Negative (FN) True Negative (TN)
• mAP (mean Average Precision)
• mAP (meanmAP
The metric Average
(meanPrecision)
Average Precision) is calculated based on the Precision–Recall
(PR) curve, which illustrates the relationship between the model’s precision and recall at
different thresholds. The precision and recall are computed as follows:
𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (2)
𝑇𝑃 + 𝐹𝑃
Electronics 2024, 13, 804 16 of 22
The metric mAP (mean Average Precision) is calculated based on the Precision–Recall
(PR) curve, which illustrates the relationship between the model’s precision and recall at
different thresholds. The precision and recall are computed as follows:
TP
precision = (2)
TP + FP
TP
recall = (3)
TP + FN
The PR curve assists in determining the Average Precision (AP) value for each class.
AP represents the average precision across various recall levels. The AP values for all
classes are averaged to obtain the mAP, which we follow [52]. It is important to note that
these calculations are based on each pixel and are subsequently averaged to derive a single
detection value for an image.
• F1-measure
The F-score [53], as shown in (4), is a metric that considers both the precision and recall
rates. It incorporates a custom parameter denoted as β. When β is set to 1, the resulting
F-score is the F1-score, where the precision and recall are given equal weights.
1 + β2 precision × recall
F − score = (4)
β2 precision + recall
TP × TN − FP × FN
MCC = p (5)
(TP + FP)(TP + FN)(TN + FP)(TN + FN)
The MCC has a range of values between −1 and 1, where 1 indicates a perfect pre-
diction, 0 signifies a random prediction, and −1 suggests an opposite prediction. The
MCC can handle the imbalance between correct and incorrect predictions on imbalanced
datasets. We found that the MCC is the most effective indicator, capable of dealing with
extreme scenarios.
• cIoU
cIoU is calculated using a weighted Jaccard score. IoU is computed separately for
tampered and non-tampered regions, and then the average value is calculated based
on the proportions within their respective frames [55]. All the pixels are involved in
this calculation.
We compare the results with different models using FF++ [32] and the Celeb-DF [68]
(CDF) dataset. It is worth mentioning that Celeb-DF comprises celebrity videos downloaded
from YouTube and processed by more advanced Deepfake operations. Table 6 shows that
the proposed scheme still yields the best results. Furthermore, some existing work even
Electronics 2024, 13, 804 18 of 22
used the test dataset’s videos for training. The proposed UFCC scheme made use of
original/unaltered imagery patches for training, without targeting a specific dataset.
Table 6. FF++ [32] and Celeb-DF [68] Deepfake video detection results.
Matting. In Deepfake video detection, facial regions are initially extracted, and similar
detection methods are applied to patches in frames to determine whether the investigated
video is a Deepfake forgery.
The UFCC scheme, when compared to existing methods, not only demonstrates superior
performance but also distinguishes itself by presenting a comprehensive detection methodol-
ogy capable of handling both image and video manipulation scenarios. In detecting image
inpainting, UFCC surpasses existing work on the DSO-I dataset [51] with a higher mAP [52],
cIoU [53], MCC [54], and F1 [55] values of 0.58, 0.83, 0.56, and 0.63, respectively. For the
detection of Deepfake videos, UFCC excels on the DF [14], F2F [19], and NT [65] tests with
accuracy rates of 0.982, 0.984, and 0.973, respectively. Although the accuracy of the FS [64]
testing is slightly lower at 0.928, it is worth noting that FS contains less realistic images and
is considered a milder threat. Notably, our training data excludes manipulated images or
Deepfake datasets, enhancing the proposed scheme’s generalization capability. This absence
of specific target tampering operations makes the UFCC scheme more flexible and adaptive,
enabling it to handle a wide range of content manipulations, including mixed scenarios.
Author Contributions: Conceptualization, P.-C.S. and T.-Y.K.; methodology, P.-C.S. and T.-Y.K.;
software, B.-H.H.; validation, B.-H.H. and P.-C.S.; formal analysis, P.-C.S. and T.-Y.K.; investigation,
P.-C.S. and T.-Y.K.; resources, P.-C.S.; data curation, B.-H.H.; writing—original draft preparation,
P.-C.S. and B.-H.H.; writing—review and editing, P.-C.S. and T.-Y.K.; visualization, B.-H.H. and P.-C.S.;
supervision, P.-C.S. and T.-Y.K.; project administration, P.-C.S. and T.-Y.K.; funding acquisition, P.-C.S.
All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the National Science and Technology Council, Taiwan, grant
number NSTC 111-2221-E-008-098 and grant number 112-2221-E-008-077.
Data Availability Statement:
Model training datasets:
1. VISION [48]: https://lesc.dinfo.unifi.it/VISION/
2. CMIC [49]: https://www.kaggle.com/competitions/sp-society-camera-model-identification
3. Dresden [50]: http://forensics.inf.tu-dresden.de/ddimgdb/
Image manipulation test dataset:
DSO-1 [51]: https://recodbr.wordpress.com/code-n-data/#dso1_dsi1
Deepfake video test dataset:
1. Deepfakes [14]: https://github.com/deepfakes/faceswap
2. Face2Face [19]: https://www.kaggle.com/datasets/mdhadiuzzaman/face2face
3. Faceswap [64]: https://github.com/MarekKowalski/FaceSwap
4. NeuralTextures [65]: https://github.com/SSRSGJYD/NeuralTexture
5. FF++ [32]: https://github.com/ondyari/FaceForensics
6. Celeb-DF [68]: https://github.com/yuezunli/celeb-deepfakeforensics
Conflicts of Interest: The authors declare no conflicts of interest. The funders had no role in the
design of the study; in the collection, analyses, or interpretation of the data; in the writing of the
manuscript; or in the decision to publish the results.
Electronics 2024, 13, 804 20 of 22
References
1. Kuo, T.Y.; Lo, Y.C.; Huang, S.N. Image forgery detection for region duplication tampering. In Proceedings of the 2013 IEEE
International Conference on Multimedia and Expo, San Jose, CA, USA, 15–19 July 2013; pp. 1–6. [CrossRef]
2. Muhammad, G.; Hussain, M.; Bebis, G. Passive copy move image forgery detection using undecimated dyadic wavelet transform.
Digit. Investig. 2012, 9, 49–57. [CrossRef]
3. Kirchner, M.; Gloe, T. Forensic camera model identification. In Handbook of Digital Forensics of Multimedia Data and Devices; Wiley:
Hoboken, NJ, USA, 2015; pp. 329–374.
4. Swaminathan, A.; Wu, M.; Liu, K.R. Nonintrusive component forensics of visual sensors using output images. IEEE Trans. Inf.
Forensics Secur. 2007, 2, 91–106. [CrossRef]
5. Filler, T.; Fridrich, J.; Goljan, M. Using sensor pattern noise for camera model identification. In Proceedings of the 15th IEEE
International Conference on Image Processing, San Diego, CA, USA, 12–15 October 2008; IEEE: New York, NY, USA, 2008;
pp. 1296–1299.
6. Xu, G.; Shi, Y.Q. Camera model identification using local binary patterns. In Proceedings of the IEEE International Conference on
Multimedia and Expo, Melbourne, Australia, 9–13 July 2012; IEEE: New York, NY, USA, 2012; pp. 392–397.
7. Thai, T.H.; Cogranne, R.; Retraint, F. Camera model identification based on the heteroscedastic noise model. IEEE Trans. Image
Process. 2013, 23, 250–263. [CrossRef] [PubMed]
8. Van, L.T.; Emmanuel, S.; Kankanhalli, M.S. Identifying source cell phone using chromatic aberration. In Proceedings of the IEEE
International Conference on Multimedia and Expo, Beijing, China, 2–5 July 2007; IEEE: New York, NY, USA, 2007; pp. 883–886.
9. Farid, H. Image forgery detection. IEEE Signal Process. Mag. 2009, 26, 16–25. [CrossRef]
10. Wang, H.-T.; Su, P.-C. Deep-learning-based block similarity evaluation for image forensics. In Proceedings of the IEEE International
Conference on Consumer Electronics-Taiwan (ICCE-TW), Taoyuan, Taiwan, 28–30 September 2020; IEEE: New York, NY, USA, 2020.
11. Dirik, A.E.; Memon, N. Image tamper detection based on demosaicing artifacts. In Proceedings of the 16th IEEE International
Conference on Image Processing (ICIP), Cairo, Egypt, 7–10 November 2009; IEEE: New York, NY, USA, 2009; pp. 1497–1500.
12. Bondi, L.; Lameri, S.; Guera, D.; Bestagini, P.; Delp, E.J.; Tubaro, S. Tampering detection and localization through clustering of
camera-based CNN features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
(CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 1855–1864.
13. Mayer, O.; Stamm, M.C. Learned forensic source similarity for unknown camera models. In Proceedings of the IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; IEEE: New York, NY, USA,
2018; pp. 2012–2016.
14. Deepfakes. Available online: https://github.com/deepfakes/faceswap (accessed on 21 November 2021).
15. Korshunova, I.; Shi, W.; Dambre, J.; Theis, L. Fast face-swap using convolutional neural networks. In Proceedings of the IEEE
International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3677–3685.
16. Zakharov, E.; Shysheya, A.; Burkov, E.; Lempitsky, V. Few-shot adversarial learning of realistic neural talking head models. In
Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November
2019; pp. 9459–9468.
17. Zhu, Y.; Li, Q.; Wang, J.; Xu, C.-Z.; Sun, Z. One shot face swapping on megapixels. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 4834–4844.
18. Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of styleGAN. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020;
pp. 8110–8119.
19. Thies, J.; Zollhofer, M.; Stamminger, M.; Theobalt, C.; Nießner, M. Face2face: Real-time face capture and reenactment of RGB
videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June
2016; pp. 2387–2395.
20. Kim, H.; Garrido, P.; Xu, W.; Thies, J.; Nießner, M.; Pérez, P.; Richardt, C.; Zollhöfer, M.; Theobalt, C. Deep video portraits. ACM
Trans. Graph. 2018, 37, 1–14. [CrossRef]
21. Fried, O.; Tewari, A.; Zollhöfer, M.; Finkelstein, A.; Shechtman, E.; Goldman, D.B.; Genova, K.; Jin, Z.; Theobalt, C.; Agrawala, M.
Text-based editing of talking-head video. ACM Trans. Graph. 2019, 38, 1–14. [CrossRef]
22. Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4401–4410.
23. Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive growing of GANs for improved quality, stability, and variation. arXiv 2017,
arXiv:1710.10196.
24. Choi, Y.; Choi, M.; Kim, M.; Ha, J.-W.; Kim, S.; Choo, J. StarGAN: Unified generative adversarial networks for multi-domain
image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City,
UT, USA, 18–22 June 2018; pp. 8789–8797.
25. Nguyen, T.; Tran, A.T.; Hoai, M. Lipstick ain’t enough: Beyond color matching for in-the-wild makeup transfer. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 13305–13314.
26. Wang, T.-C.; Liu, M.-Y.; Zhu, J.-Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-resolution image synthesis and semantic manipulation with
conditional GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA,
18–22 June 2018; pp. 8798–8807.
Electronics 2024, 13, 804 21 of 22
27. Nirkin, Y.; Keller, Y.; Hassner, T. FSGAN: Subject agnostic face swapping and reenactment. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7184–7193.
28. Qu, Z.; Meng, Y.; Muhammad, G.; Tiwari, P. QMFND: A quantum multimodal fusion-based fake news detection model for social
media. Inf. Fusion 2024, 104, 102172. [CrossRef]
29. Zhou, P.; Han, X.; Morariu, V.I.; Davis, L.S. Two-stream neural networks for tampered face detection. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; IEEE: New
York, NY, USA, 2017; pp. 1831–1839.
30. Afchar, D.; Nozick, V.; Yamagishi, J.; Echizen, I. MesoNet: A compact facial video forgery detection network. In Proceedings of
the IEEE International Workshop on Information Forensics and Security (WIFS), Hong Kong, China, 10–13 December 2018; IEEE:
New York, NY, USA; pp. 1–7.
31. Nguyen, H.H.; Yamagishi, J.; Echizen, I. Capsule-forensics: Using capsule networks to detect forged images and videos. In
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17
May 2019; IEEE: New York, NY, USA, 2019; pp. 2307–2311.
32. Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Nießner, M. Faceforensics++: Learning to detect manipulated facial
images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2
November 2019; pp. 1–11.
33. Li, J.; Xie, H.; Li, J.; Wang, Z.; Zhang, Y. Frequency-aware discriminative feature learning supervised by single-center loss for face
forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA,
19–25 June 2021; pp. 6458–6467.
34. Liu, H.; Li, X.; Zhou, W.; Chen, Y.; He, Y.; Xue, H.; Zhang, W.; Yu, N. Spatial-phase shallow learning: Rethinking face forgery
detection in frequency domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Nashville, TN, USA, 19–25 June 2021; pp. 772–781.
35. Nguyen, H.H.; Fang, F.; Yamagishi, J.; Echizen, I. Multi-task learning for detecting and segmenting manipulated facial images
and videos. In Proceedings of the IEEE International Conference on Biometrics Theory, Applications and Systems (BTAS), Tampa,
FL, USA, 23–26 September 2019; IEEE: New York, NY, USA, 2019; pp. 1–8.
36. Güera, D.; Delp, E.J. Deepfake video detection using recurrent neural networks. In Proceedings of the IEEE International
Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, 27–30 November 2018; IEEE:
New York, NY, USA, 2018; pp. 1–6.
37. Li, Y.; Chang, M.-C.; Lyu, S. In ictu oculi: Exposing AI created fake videos by detecting eye blinking. In Proceedings of the IEEE
International Workshop on Information Forensics and Security (WIFS), Hong Kong, China, 11–13 December 2018; IEEE: New
York, NY, USA, 2018; pp. 1–7.
38. Yang, X.; Li, Y.; Lyu, S. Exposing deep fakes using inconsistent head poses. In Proceedings of the IEEE International Confer-
ence on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; IEEE: New York, NY, USA, 2019;
pp. 8261–8265.
39. Bayar, B.; Stamm, M.C. A deep learning approach to universal image manipulation detection using a new convolutional layer. In
Proceedings of the 4th ACM Workshop on Information Hiding and Multimedia Security, Vigo, Spain, 20–22 June 2016; pp. 5–10.
40. Huh, M.; Liu, A.; Owens, A.; Efros, A.A. Fighting fake news: Image splice detection via learned self-consistency. In Proceedings
of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 101–117.
41. Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; Shah, R. Signature verification using a Siamese time delay neural network. Adv.
Neural Inf. Process. Syst. 1993, 6, 737–744. [CrossRef]
42. Nair, V.; Hinton, G.E. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International
Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814.
43. Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with Siamese region proposal network. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8971–8980.
44. Neculoiu, P.; Versteegh, M.; Rotaru, M. Learning text similarity with Siamese recurrent networks. In Proceedings of the 1st
Workshop on Representation Learning for NLP, Berlin, Germany, 11 August 2016; pp. 148–157.
45. Hadsell, R. Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; IEEE: New York, NY, USA, 2006;
Volume 2, pp. 1735–1742.
46. Hadsell, R.; Chopra, S.; LeCun, Y. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image
Process. 2004, 13, 600–612.
47. Forte, M.; Pitié, F. $ F $, $ B $, Alpha Matting. arXiv 2020, arXiv:2003.07711.
48. Shullani, D.; Fontani, M.; Iuliani, M.; Shaya, O.A.; Piva, A. Vision: A video and image dataset for source identification. EURASIP
J. Inf. Secur. 2017, 2017, 1–16. [CrossRef]
49. Stamm, M.; Bestagini, P.; Marcenaro, L.; Campisi, P. Forensic camera model identification: Highlights from the IEEE Signal
Processing Cup 2018 Student Competition. IEEE Signal Process. Mag. 2018, 35, 168–174. [CrossRef]
50. Gloe, T.; Böhme, R. The Dresden image database for benchmarking digital image forensics. In Proceedings of the ACM Symposium
on Applied Computing, Sierre, Switzerland, 21–26 March 2010; pp. 1584–1590.
Electronics 2024, 13, 804 22 of 22
51. De Carvalho, T.J.; Riess, C.; Angelopoulou, E.; Pedrini, H.; de Rezende Rocha, A. Exposing digital image forgeries by illumination
color classification. IEEE Trans. Inf. Forensics Secur. 2013, 8, 1182–1194. [CrossRef]
52. Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J.
Comput. Vis. 2010, 88, 303–338. [CrossRef]
53. Van Rijsbergen, C.J.; Retrieval, I. Information Retrieval; Butterworth-Heinemann: Oxford, UK, 1979.
54. Matthews, B.W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Et Biophys. Acta
(BBA)-Protein Struct. 1975, 405, 442–451. [CrossRef]
55. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In
Proceeding of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019.
56. Gu, A.-R.; Nam, J.-H.; Lee, S.-C. FBI-Net: Frequency-based image forgery localization via multitask learning With self-attention.
IEEE Access 2022, 10, 62751–62762. [CrossRef]
57. Ferrara, P.; Bianchi, T.; De Rosa, A.; Piva, A. Image forgery localization via fine-grained analysis of CFA artifacts. IEEE Trans. Inf.
Forensics Secur. 2012, 7, 1566–1577. [CrossRef]
58. Ye, S.; Sun, Q.; Chang, E.-C. Detecting digital image forgeries by measuring inconsistencies of blocking artifact. In Proceedings
of the IEEE International Conference on Multimedia and Expo, Beijing, China, 2–5 July 2007; IEEE: New York, NY, USA, 2007;
pp. 12–15.
59. Mahdian, B.; Saic, S. Using noise inconsistencies for blind image forensics. Image Vis. Comput. 2009, 27, 1497–1503. [CrossRef]
60. Salloum, R.; Ren, Y.; Kuo, C.-C.J. Image splicing localization using a multi-task fully convolutional network (MFCN). J. Vis.
Commun. Image Represent. 2018, 51, 201–209. [CrossRef]
61. Wu, Y.; AbdAlmageed, W.; Natarajan, P. Mantra-net: Manipulation tracing network for detection and localization of image
forgeries with anomalous features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Long Beach, CA, USA, 16–20 June 2019; pp. 9543–9552.
62. Ding, H.; Chen, L.; Tao, Q.; Fu, Z.; Dong, L.; Cui, X. DCU-Net: A dual-channel U-shaped network for image splicing forgery
detection. Neural Comput. Appl. 2023, 35, 5015–5031. [CrossRef]
63. Zhang, Y.; Zhu, G.; Wu, L.; Kwong, S.; Zhang, H.; Zhou, Y. Multi-task SE-network for image splicing localization. IEEE Trans.
Circuits Syst. Video Technol. 2021, 32, 4828–4840. [CrossRef]
64. Faceswap. Available online: https://github.com/MarekKowalski/FaceSwap (accessed on 13 November 2021).
65. Thies, J.; Zollhöfer, M.; Nießner, M. Deferred neural rendering: Image synthesis using neural textures. ACM Trans. Graph. 2019,
38, 1–12. [CrossRef]
66. Fridrich, J.; Kodovsky, J. Rich models for steganalysis of digital images. IEEE Trans. Inf. Forensics Secur. 2012, 7, 868–882.
[CrossRef]
67. Cozzolino, D.; Poggi, G.; Verdoliva, L. Recasting residual-based local descriptors as convolutional neural networks: An application
to image forgery detection. In Proceedings of the 5th ACM Workshop on Information Hiding and Multimedia Security,
Philadelphia, PA, USA, 20–21 June 2017; pp. 159–164.
68. Li, Y.; Yang, X.; Sun, P.; Qi, H.; Lyu, S. Celeb-DF: A large-scale challenging dataset for deepfake forensics. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 3207–3216.
69. Li, Y.; Lyu, S. Exposing deepfake videos by detecting face warping artifacts. arXiv 2018, arXiv:1811.00656.
70. Matern, F.; Riess, C.; Stamminger, M. Exploiting visual artifacts to expose deepfakes and face manipulations. In Proceedings of
the IEEE Winter Applications of Computer Vision Workshops (WACVW), Waikoloa Village, HI, USA, 7–11 January 2019; IEEE:
New York, NY, USA, 2019; pp. 83–92.
71. Li, X.; Lang, Y.; Chen, Y.; Mao, X.; He, Y.; Wang, S.; Xue, H.; Lu, Q. Sharp multiple instance learning for deepfake video detection.
In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1864–1872.
72. Masi, I.; Killekar, A.; Mascarenhas, R.M.; Gurudatt, S.P.; AbdAlmageed, W. Two-branch recurrent network for isolating deepfakes
in videos. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Proceedings,
Part VII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 667–684.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.