0% found this document useful (0 votes)
30 views14 pages

Unsupervised Generative Fake Image Detector

Uploaded by

aliseren86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views14 pages

Unsupervised Generative Fake Image Detector

Uploaded by

aliseren86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

8442 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO.

9, SEPTEMBER 2024

Unsupervised Generative Fake Image Detector


Tong Qiao , Hang Shao , Shichuang Xie , and Ran Shi

Abstract— Recently, the rapid advancement of generative [12] tackled the fundamental challenge of image coordinates
model has led to its exploitation by malicious actors who employ and feature adhesion which were obvious limitations in its
it to fabricate fake synthetic images. Meanwhile, the deceptive two prior versions. This breakthrough facilitated pristine image
images are often disseminated on social network platforms,
thereby undermining public trust. Although reliable forensic translation and rotation invariance, thereby greatly enhancing
tools have emerged to detect generative fake images, the existing the quality of generative fake images. On the one hand we need
supervised detectors excessively rely on the correctly-labeled to actively strengthen the intellectual protection of generative
training samples, leading to overwhelming outsourcing annota- model [13]; on the other hand, to defend the malicious use of
tion costs and the potential risk of suffering from label flipping more and more realism images synthesized by the endlessly
attack. In light of the aforementioned limitations, we propose
an unsupervised detector fighting against generative fake image. updated generative networks, it is thus urgent to establish a
In particular, we assign the noisy labels to the training samples. passive detection manner.
Then dependent on the pre-clustered samples with noisy labels, Fortunately, in the multimedia forensic community, the
the strategy of pre-training and re-training mechanism helps us concerning studies have been exploring various techniques to
train the feature extractor utilized to extract the discriminative address the detection of generative fake images. Some methods
feature. Last, the extracted feature guides us to respectively
cluster both pristine and fake images; the fake images are rely on handcrafted features, while others adopt a data-driven
effectively filtered by employing cosine similarity. Extensive approach through end-to-end training strategy. However, the
experimental results highlight that our unsupervised detector current methods primarily rely on supervised learning, which
rivals the baseline supervised methods; moreover, it has better requires a large scale of accurately labeled samples. In fact,
capability of defending against label flipping attack. due to the crowdsourcing service from the untrustworthy third-
Index Terms— Image forensics, unsupervised learning, gener- party, the training samples are probably incorrectly annotated
ative fake images. involving malicious label flipping attack or innocent careless
mislabelling, leading to that the efficacy of supervised classi-
I. I NTRODUCTION fiers for generative forgery detection remarkably diminishes.
Furthermore, due to the continuous development of new
T HE widespread proliferation of fake facial images or
videos [1], [2], [3], [4], [5], [6], [7], [8], [9] synthesized
by generative models on the Internet has raised security-related
image generation techniques, it is unpractical to train the
detection model capable of fully comprehending the feature
distribution of all possible generated images, leading to poor
concerns in various fields such as politics, justice, criminal
generalization. Moreover, the robustness of the detectors tends
investigations, and reputation protection. In addition, with the
to degrade to some degree when post-processing operations
continuous development of generative models, the generated
appear. Although prior studies such as [1], [14], [15], [16],
fake images are getting closer and closer to authentic images,
[17], [18], [19], [20], [21], and [22] have addressed the
so that the naked eye can hardly distinguish true from fake
problem of generalization or robustness, few detectors can
images (see Fig. 1). Moreover, it is worth noting that the
remain the powerful performance as the training samples are
representative family of StyleGANs has undergone remarkable
incorrectly annotated, which is caused by the limitation of the
expansion in recent years. StyleGAN [10] was definitely a
supervised mechanism. Meanwhile, the development of the
pioneer study addressing the issue of synthesization model.
effective unsupervised detector has severely lagged behind.
Next, StyleGAN2 [11] addressed the presence of water droplet
To this end, it is proposed to develop an effective
artifacts in its prior version. Last but not least, StyleGAN3
unsupervised detector for addressing the problem of defending
Manuscript received 16 December 2023; revised 7 March 2024; against label flipping attack. In particular, we design an
accepted 24 March 2024. Date of publication 1 April 2024; date of current unsupervised detection method as the true labels of samples
version 30 September 2024. This work was supported by Zhejiang Provincial
Natural Science Foundation of China under Grant LZ23F020006. This article
are totally unknown during model training. More importantly,
was recommended by Associate Editor T. Zhang. (Corresponding author: our approach exhibits strong performance of defending
Ran Shi.) against label flipping attack. Moreover, our detector still
Tong Qiao is with the School of Cyberspace, Hangzhou Dianzi University,
Hangzhou 310018, China, and also with the Sino-France Joint Laboratory for
remains to some degree generalization capability, and its high
Digital Media Forensics of Zhejiang Province, Hangzhou 310018, China. efficacy even suffering various post-processing attacks, such
Hang Shao and Shichuang Xie are with the School of Cyberspace, Hangzhou as compression, blurring, filtering, gamma correction, noise
Dianzi University, Hangzhou 310018, China.
Ran Shi is with the School of Computer Science and Engineering, Nanjing
adding, etc. Specifically, the main contributions of this paper
University of Science and Technology, Nanjing 210094, China (e-mail: are as follows:
rshi@njust.edu.cn). • On the premise of not giving any label information of
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSVT.2024.3383833. training samples, it is proposed to design an unsupervised
Digital Object Identifier 10.1109/TCSVT.2024.3383833 detector resisting against generative fake images.
1051-8215 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Kirikkale Univ. Downloaded on December 18,2024 at 23:13:37 UTC from IEEE Xplore. Restrictions apply.
QIAO et al.: UNSUPERVISED GENERATIVE FAKE IMAGE DETECTOR 8443

detectors aim to determine whether the image under test is


from a generative model, which is a binary classification task.
Based on the artifacts used for feature extraction, the detec-
tion methods can be arbitrarily categorized into two types:
those based on spatial domain and those based on frequency
domain.

A. Generative Model
1) Unconditional Generative Model: In the framework of
unconditional generative model, the input of the generator
behaves as random noise; the input of the discriminator acts
as the generated image. For instance, BigGAN [29] applies
orthogonal regularization to the generator, increases the batch
size and network width (the number of image channels), and
significantly improves the authenticity evaluation index of
Fig. 1. Illustration of pristine and generative fake images synthesized by the generated image. PGGAN [30] generates high-resolution
generative models. Upper: the conditional generative models StarGAN [23], images by using a stepwise amplification manner. The authors
StarGAN2 [24], AttGAN [25] synthesize fake images. Middle: the uncondi- first feeds images with a size of 4 × 4 into the generator
tional generative models StyleGAN [10], StyleGAN2 [11], StyleGAN3 [12]
synthesize fake images. Bottom: the diffusion generative models Midjour- and discriminator, and then use the smoothing layer to grad-
ney [26], Stable diffusion [27] and DALL·E2 [28] synthesize fake images. ually increase network layers and image resolution, finally
generating images with a size of 1024 × 1024. StyleGAN
[10] uses an Adaptive Instance Normalization (AdaIn) method
• We develop a noisy label generator that assigns noisy for style control, which makes the generator parameters take
labels to two clusters, which is clustered dependent of style information into account. StyleGAN2 [11] improves the
robust artifacts both in spatial and frequency domain. StyleGAN framework by optimizing the upsampling method
• We establish a high-efficient feature extractor, involving and re-designing the generator architecture and instance nor-
the proposed strategy of pre-training and re-training, malization. Furthermore, StyleGAN3 [12] makes each layer
leading to that the discriminative features are extracted of the generator equivariance for continuous signals, leading
and continually refined by iteration on the guid- to that all changes in fine features in the local neighborhood
ance of the well-designed loss function. Moreover, the are related to changes in rough features in the previous layer.
bi-classification is carried out dependent of cosine simi- This fundamentally solves the problem of StyleGAN2 image
larity measurement. coordinates and feature adhesion.
• Towards generative fake images, the performance of 2) Conditional Generative Model: Compared with uncon-
our unsupervised detector is on par with the supervised ditional generative model, the conditional generative model
state-of-the-art methods; moreover, it has the stronger introduces the constraints into the input terminal. Specifically,
capability of defending against label flipping attack. the random noise is concatenated with the semantic label,
The remainder of this paper is organized as follows. First, such as the attribute category of the target image, as the
Section II reviews the prior-art methods and Section III input of the generator; the generated image and the semantic
elaborates the necessity and feasibility of developing the label are jointly input into the discriminator. Pix2Pix [31]
study of unsupervised generative image detection. Then, it is introduces reconstruction errors to improve style transfer per-
proposed to generally describe the proposed unsupervised formance. Although it has achieved good results in the field
framework for generative image detection in Section IV. Next, of style transfer, the training process requires the use of
Section V mainly elaborates the specific procedure of noisy paired data. To address that limitation, CycleGAN [32] imple-
label generation. Next, it is proposed to train a feature extractor ments style transfer on unpaired data using cycle consistency.
in Section VI. Then, a bi-classifier is designed in Section VII. Subsequently, StarGAN [23] solves the problem of mutual
Section VIII presents and analyzes extensive experimental conversion of multi-style images. Not only label information
results. Additionally, we also discuss and analyze the proposed is added to the input of the generator, but also the model
unsupervised detection framework in Section IX. Finally, let can be trained on multiple data sets by proposing a mask
us draw the conclusion in Section X. vector. Furthermore, StarGAN2 [24] improves the diversity
of generated images and the scalability of the model through
II. R ELATED W ORK components such as mapping network and style encoder, while
In this section, we first review some representative gener- improving the visual quality of generated images. Recently,
ative models. According to the manner the model generates the diffusion models such as Midjourney [26], Stable diffusion
images, the generative model is mainly divided into two [27], Kandinsky3 [33], Imagen2 [34], SDXL Turbo [35] attract
categories, namely unconditional and conditional generative more and more attention, due to the unprecedented high
model. Next, it is proposed to present some generative image quality of image generation, which indeed brings the new
detection methods and their limitations. Generative image challenge to generative image detector.

Authorized licensed use limited to: Kirikkale Univ. Downloaded on December 18,2024 at 23:13:37 UTC from IEEE Xplore. Restrictions apply.
8444 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 9, SEPTEMBER 2024

B. Detection Model digits of the Discrete Cosine Transform (DCT) coefficients.


1) Detection Based on Spatial Domain Artifact: Gener- To address the limitation of existing generative models in
ative fake images exhibit noticeable distinctions from real accurately capturing the spectral distribution of natural images
images in terms of imaging principles. As a consequence of during the image generation process, [51] proposes a spectral
the inherent constraints of generative model, generative fake regularization method.
images possess distinct abnormal traces in the spatial domain.
Consequently, researchers have proposed diverse algorithms III. M OTIVATION OF U NSUPERVISED G ENERATIVE
focusing on spatial domain traces to address the challenge of I MAGE D ETECTION
generative fake image detection.
Convolutional neural networks (CNN) have not only found To our knowledge, although the aforementioned methods
widespread application in computer vision but have also solve the generalization and robustness issues to a certain
demonstrated its unparalleled advantage in addressing prob- extent, it is worth noting that all the current generative image
lems related to multimedia forensics. Prior studies have detectors rely on supervised learning mechanism. That requires
validated that CNNs can effectively capture subtle variations a large number of correctly-labeled samples for training.
of generative images at the signal level during supervised Recently, [52], [53] proposed a generalization method for
learning. To our knowledge, [36] pioneers the use of CNNs generative image detection based on an unsupervised domain
for generative image detection by constructing a shallow CNN adaptation model. The methods only work by training a small
specifically designed for this task. Reference [37] proposes a number of unlabeled images in the target domain. However,
generative image detector combining co-occurrence matrices the samples used for training in the source domain are still
with CNNs. In the following studies, many high-efficient labeled, meaning that the detectors [52], [53] are not fully
generative image detectors [22], [38], [39], [40], [41], [42] unsupervised detectors.
have been proposed one after another. Supervised learning relies on large-scale data containing
To detect a fake image synthesized by an unknown gen- correct labeling, and in the real world collecting sufficient
erative model, [17] develops an AutoGAN model with a labeled data is usually expensive and laborious. In addition,
generalized generator network structure. Then [18] employs supervised learning possibly suffers from over-fitting issue
the incremental learning technique to improve the generaliza- or label flipping attack. When faced with problems such
tion ability of detection models. Meanwhile, [16] proposes as incorrectly-labeled data, the detection accuracy may be
the introduction of a self-attention mechanism for generative drastically reduced. While the traditional self-supervised learn-
image detection. Next, [14] introduces data augmentation ing can significantly reduce the cost of labeling pre-trained
strategy to generative image detection by utilizing ResNet50 as data and achieve good scalability, it still requires the use
the backbone network. Reference [19] further investigates the of correctly-labeled samples to fine-tune the model in its
influence of JPEG compression on generative image detection downstream tasks. Unsupervised learning, on the other hand,
performance. Next, [1] enhances the feature representation does not need to use labeled data throughout. Moreover,
capability by integrating the convolutional block attention features learned by unsupervised manner are more adaptive
module and the multi-layer feature aggregation module into and rich, since the image itself contains more rich information
the Xception model [43]. References [15] and [20] delve which is often constraint by the only label information; that
into the generalization performance of detection, in which the is to say, labeling limits learning. Therefore, the unsupervised
superiority of generalization and robustness from the detector generative image detectors that do not require prior knowledge
is addressed. Recently, [21] devises an image patch-based of the correct labels are more urgently needed.
orthogonal CNN training method. Reference [44] introduces
a comprehensive algorithm combining global and local fea- IV. OVERALL F RAMEWORK
tures. An attention-based approach in [45] is proposed to In this section, we describe an unsupervised framework for
identify generative faces by analyzing eye inconsistencies. detecting generative images. As shown in Fig. 2, the overall
Reference [46] presents a semantic-based method for distin- framework consists of three main stages.
guishing generative facial images from real ones.
2) Detection Based on Frequency Domain Artifact: Without
loss of generality, generative images follow a distinct synthesis A. Noisy Label Generation
process compared to natural imaging, wherein repeated up- First of all, we need to assign the noisy labels to the clus-
sampling operations [47] introduce periodic artifacts in the tered images, where the images are coarsely clustered based on
generated images. These artifacts can be examined not only in the feature extracted from the frequency and spatial domain.
the spatial domain but also manifest abnormal characteristics Specifically, inspired by [47] and [54], the re-sampling opera-
in the frequency domain. Reference [48] observes that different tion of the image synthesization leaves the unrealistic artifacts,
up-sampling processes result in specific correlations between which can be effectively exposed in the frequency domain;
adjacent pixels in generative images, leading to abnormal inspired by [55], the eye highlight traces can also be captured
middle-to-high frequency components. Reference [49] intro- by evaluating the Intersection over Union (IoU). It is important
duces a capsule networks based detector by color channel to emphasize that our assignment of noisy label is successfully
spectrum. Reference [50] characterizes the anomalous artifacts conducted independent of any prior knowledge regarding the
of generative images by analyzing the distribution of the first true label of each training sample.

Authorized licensed use limited to: Kirikkale Univ. Downloaded on December 18,2024 at 23:13:37 UTC from IEEE Xplore. Restrictions apply.
QIAO et al.: UNSUPERVISED GENERATIVE FAKE IMAGE DETECTOR 8445

Fig. 2. Illustration of our proposed unsupervised framework: in the initial stage, a noisy label generator is devised to assign the noisy labels to the training
samples; next, in the second stage, under the guidance of these labels, a generalizable feature extractor is well-trained via the strategy of pre-training and
re-training; last, in the third stage, the cosine similarity among images from each cluster, facilitates the classification of pristine and generative fake images.

B. Feature Extractor Training noisy label, while simultaneously minimizing the similarity
This stage consists of two key sub-stages: pre-training between samples with different noisy labels. By leveraging this
and re-training. In the pre-training sub-stage, the contrastive approach, we aim to optimize the model’s ability of extracting
learning is employed, guided by the noisy labels assigned by discriminative features between pristine and generative fake
the prior stage. This facilitates the acquisition of effective images.
feature representations. Next, in the re-training sub-stage,
the contrastive learning is once again utilized; it is then C. Bi-Classification
guided by high-confidence samples, namely model fine-tuning. In this stage, we first employ a well-trained feature extractor
In such manner, both discrimination and generalization of in the prior stage to extract features from inquiry testing
extracted feature are further optimized. During the contrastive images. Subsequently, in virtue of these features, two clus-
learning process, we need to highlight that our aim is to ters of images are effectively segmented. For each cluster,
maximize the similarity between samples carrying the same we isolate the corresponding images and evaluate their cosine

Authorized licensed use limited to: Kirikkale Univ. Downloaded on December 18,2024 at 23:13:37 UTC from IEEE Xplore. Restrictions apply.
8446 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 9, SEPTEMBER 2024


1
similarity among images. It is emphasized that this particular √

 for v = 0
step leverages the original pixel intensity of image. Then the Tv = r N (2)
cluster with a high-level similarity is classified as generative 2
for 0 ≤ v ≤ N − 1



one, due to the same procedure of image generation. This final N
stage enables us to effectively differentiate between pristine It should be noted that DCT operation works independently
and generative fake images. in the color spaces. In this context, for efficiency, only the red
color channel is adopted, since the discrepancies between pris-
V. N OISY L ABEL G ENERATION tine and generative fake image performs more remarkably [22].
In the training phase of our proposed feature extractor, 2) Artifact in the Spatial Domain: In the context, besides
the inclusion of labels is necessary to guide the learning the artifact in the frequency domain, we also dig out the artifact
procedure. To this end, by carefully designing a noisy label in the spatial domain. The similar corneal specular highlights
generator, we ensure that it provides the reliable guidance to exist in both eyes of real face images due to the same lighting
the feature extractor during training. Importantly, the assigned environment; while in the generative image, this artifact in
noisy labels intend to distinguish between two specific types the spatial domain cannot be easily exposed. To this end, the
of samples, without specifying which class corresponds to benchmark Dlib tool is used to locate the facial region in
positive or negative ones. That allows us to maintain an unsu- the image, and then 68 key facial landmarks are accurately
pervised learning setting, where the feature extractor learns to extracted. The regions corresponding to the two eyes are
differentiate between sample categories without knowing the then cropped. Next, we use the Canny descriptor, coupled
true labels. with Hough transform, to localize the corneal edges. Last,
the intersection points between corneal edges and eye regions
are defined as corneal region. After successfully obtaining
A. Artifact Exposing the corneal region, the adaptive image thresholding method
Different from the natural imaging mechanism, the gen- proposed by [62] is used to extract the specular highlights.
erative model generates images through operations such as To evaluate the similarity between the specular highlights
convolution. The repeated upsampling operation leaves peri- of two symmetrical corneal regions, the IoU score ranging
odic artifacts in the resulting image. The above traces will from 0 to 1 is then used as a quantitative metric, formulated as:
show abnormal characteristics in the frequency domain [17], T
L R
[48]. Moreover, for the new-emerging diffusion models, the IoU =| S | (3)
L R
abnormal spectrum peaks can also be exposed in the frequency
domain [56], [57]. Besides, since the generative model lacks where L represents the specular highlight of the left eye
modeling of the geometric relationship of face, some artifacts cornea; R represents the specular highlight of the right eye
are left in the spatial domain [55], [58], [59], [60], [61]. Thus, cornea after alignment. Generally, a lower IoU score indi-
inspired by the prior studies, in this context, it is proposed to cates more dissimilarity between corneal specular highlights,
adopt the features both form frequency and spatial domain for implying thus a generative facial image. In the pristine image
noisy label assignment. captured with a realism imaging device such as digital camera
1) Artifact in the Frequency Domain: The correlation or smartphone, the corneal specular highlights of eyes exhibit
between adjacent pixels of the image caused by re-sampling a striking symmetric similarity, manifesting itself higher IoU
unavoidably leads to anomalous peak values in the frequency scores. In contrast, the corneal specular highlights of eyes
domain. To this end, we first transform an image from the from generative images exhibit inconsistencies, resulting in
spatial into the frequency domain. Typically, let us assume an significantly lower IoU scores.
image x = {xm,n } , m ∈ {1, . . . , M} , n ∈ {1, . . . , N }. In each
8 × 8 block of each color channel, the DCT operation converts
B. Noisy Label Assignment
the pixel value in the spatial domain to the corresponding
coefficients in DCT domain by using: In this subsection, we intend to carry out the procedure of
noisy label assignment based on the extracted artifacts both
M−1
X NX−1
in the frequency and spatial domain. For frequency domain
I (u, v) = Tu Tv x(m, n) artifacts, we extract the features in the R color channel,
m=0 n=0
in which the DCT coefficients I are flattened to form feature
(2m + 1)uπ (2n + 1)vπ
   
· cos cos (1) vector; for spatial domain artifacts, the IoU score directly
2M 2N serves as feature vector. Straightforward, we directly append
where x from a single color space denotes a pixel value of a the obtained IoU score to the flattened DCT coefficient.
M × N matrix, represents the coefficient in DCT domain, and Without loss of the generality, we adopt the k-means clus-
Tu , Tv denotes the normalized weight: tering algorithm for image clustering. It involves several key
 steps to partition data into 2 groups, since only 2 types of
1
√

 for u = 0 images are to identify. Initially, 2 objects are randomly selected
Tu = r M as the initial cluster centroid. Subsequently, the algorithm
 2 calculates the distance between each object and its cluster

 for 0 ≤ u ≤ M − 1
M centroid. Each object is then assigned to the cluster centroid

Authorized licensed use limited to: Kirikkale Univ. Downloaded on December 18,2024 at 23:13:37 UTC from IEEE Xplore. Restrictions apply.
QIAO et al.: UNSUPERVISED GENERATIVE FAKE IMAGE DETECTOR 8447

to obtain the normalized high-dimensional features r, which is


next projected to the low-dimensional space through projection
head. Straightforward, let fproj (·) map r to a vector h =
fproj (r). Specifically, h behaves as a single linear layer of size
256. Besides, it is proposed to discard the module of projection
head at the end of contrastive training; for testing stage, in the
other words, the projection head is not adopted.
Fig. 3. Illustration of augmentation manners.
In the multi-view batching, let i, j, k ∈ T = {1, . . . , 2N } be
the index of any augmented view, the noisy labels l guide us
that is closest to it, resulting in the updated cluster. The to train the feature extractor with the following loss function:
iterative procedure of clustering continues until convergence. X −1 X exp hi · h j /τ

By iteratively refining the cluster centroid and the assignments L= log P (4)
of objects to cluster, we aim to group a high-purity cluster i∈T
| J(i)|
j∈ J(i) k∈K (i) exp (hi · hk /τ )
with the most similar objects while maximizing the distances
where the symbol · denotes the inner product, τ is a scalar
between different clusters. Last but not least, noisy labels from
temperature parameter, and K (i) = T \{i}. The random
a set {0, 1} are randomly assigned to two clusters of images,
variable i is called the anchor. J(i) = { j ∈ K (i) : l j = li } is
in which we do not intend to classify between positives and
the set with all identical noisy labels in the multi-view batching
negatives.
distinct from i, and | J(i)| represents its cardinality.
VI. F EATURE E XTRACTOR T RAINING
In this section, our goal is to train a reliable feature B. Re-Training
extractor based on the noisy-labeled data via contrastive learn- In the pre-training phase, our goal is to initially learn an
ing. Specifically, we divide the training procedure into two efficient feature extractor guided by the noisy labels generated
phases: pre-training and re-training. To improve the efficiency in the prior stage. However, it should be noted that the
of our proposed framework, we employ data augmentation inaccurate annotations contained in noisy labels may mislead
during training. Furthermore, considering the limitation of the training procedure away from the target domain. To address
the noisy label assignment accuracy generated in the prior that tough issue, we novelly propose to perform the re-training
stage, we further introduce a new module of high-confidence operation, in which the new unlabeled samples are introduced,
sample selection in the re-training stage. This module aims aiming to fine-tune the model. Through this iterative re-
to enhance the effectiveness of the feature extractor and its training procedure, the confidence of the generated noisy
overall generalization performance. labels gradually increases and the feature embeddings become
increasingly discriminative.
A. Pre-Training As a matter of fact, the procedure of pre-training helps
Contrastive learning obtains sample representations by com- us obtain an initial feature extractor, but its performance is
paring positive and negative samples in the embedding space. limited because of some training samples with noisy labels
In our approach, we leverage the data with noisy labels gener- provided by the noisy label generator. In order to further refine
ated in the prior stage to guide our framework of contrastive the feature extractor, we adopt an iterative re-training strat-
learning. In particular, in each batch, our goal is to minimize egy by gradually adding relative reliable unlabeled samples.
the distance between samples sharing the same noisy labels, Especially, the feature distance from the unlabeled sample
while maximizing the distance between samples with different to the centroid of clustering generated by training is used
noisy labels. The specific procedure is briefly elaborated as to measure the reliability of this sample. However, due to
follows. the limited performance of the initial extractor, the measure
For each input sample x from a batch of N original data may not be always accurate. It means that the more selected
with noisy labels, we generate two random augmentations, samples are, the lower reliabilities of theirs are. Therefore,
x̃ = faug (x), each of which represents a specific view of the in such scenario, we should strictly control the proper number
data. In this paper, our augmentation approach comprises a of the new unlabeled samples fed into the network in each
comprehensive set of editing operations: crop, horizontal flip, iteration to alleviate the relative reliability problem of the
vertical flip, 90-degree rotation, erasing, color jitter, histogram unlabeled sample. Thus, our strategy can gradually improve
equalization, Gaussian blur, JPEG compression (see Fig. 3). the performance of the feature extractor through multiple
It is worth noting that the augmented view with the same noisy iterations.
label is as its original version. In each iteration, we extract the embeddings of the training
By referring to [63], let fenc (·) map x̃ to a representation samples, and meanwhile employ a clustering method to iden-
vector, r = fenc (x̃). We use the Xception [43] model as the tify reliable samples utilized for fine-tuning the model. More
feature extraction network. Compared with the other models, specifically, the samples close to the centroid of clustering are
the network structure can extract high-quality feasible features assigned high confidence score, measured by:
for distinguishing between pristine and generative fake images.
We first feed the augmented images into the backbone network f hcs = − ∥z, zc ∥2 (5)

Authorized licensed use limited to: Kirikkale Univ. Downloaded on December 18,2024 at 23:13:37 UTC from IEEE Xplore. Restrictions apply.
8448 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 9, SEPTEMBER 2024

VIII. E XPERIMENTS
In order to verify the effectiveness of our proposed unsuper-
vised detector, we conduct the extensive experiments in this
section. First, the datasets used in the evaluation, as well as
the experimental settings, training parameters and evaluation
metrics are specifically described. Second, it is proposed
to evaluate the effectiveness of the establishment of label
generation and re-training strategy. Next, it is proposed to
comprehensively evaluate the performance of our proposed
unsupervised detector, and also the robustness performance.
Additionally, we compare the proposed detector with state-of-
the-art baselines, especially considering the scenario of label
Fig. 4. Similarity comparison by violin plots. The center white point flipping attack on which few current detectors focus. Finally,
represents the median, the lower bound of the box represents the 25th we conduct ablation experiments to verify the necessity of
percentile, the upper bound of the box represents the 75th percentile, and three core steps in the overall framework, and visualize the
the width of the violin indicates the frequency.
extracted features by adopting our proposed high-efficient
extractor.
where z denotes a feature vector of the cluster in the embed-
ding space; the cluster centroid is represented by zc . A. Settings
In the re-training phase, we perform a total of n iterations.
In each round of iterations, we select samples with higher In this context, we intend to comprehensively evaluate
confidence scores in each cluster to fine-tune the model; each the performance of the proposed unsupervised detector. For
iteration of the fine-tuning process consists of 1000 epochs. the original real images, it is proposed to use three base-
line datasets, referring to as FFHQ [10], CelebA [64], and
VII. B I -C LASSIFICATION COCO [65]. Based on the real images, it is proposed to use
Relying on the well-trained feature extractor designed in the various forgery models to synthesize fake images, including
prior section, we are capable of unsupervised-classifying the a family of unconditional StyleGANs [10], [11], [12], the
inquiry image without acquiring any true label information. conditional StarGAN [23], and the recent popular diffusion
Moreover, in this section, we will build a simple but effective models Midjourney [26], Stable diffusion [27].
bi-classifier for the task of distinguishing pristine images from In this context, due to the main contribution of introducing
generative fake ones. In fact, after extracting effective discrim- the strategy of re-training, in the testing dataset, we specifically
inative features, we can only partition the inquiry images into elaborate the numbers of pre-training and re-training. For
two clusters through k-means algorithm, but cannot distinguish StarGAN forgery model, in the pre-training stage, this subset
which cluster belongs to pristine or fake category. contains real images (∼3K) from CelebA [64], and fake
In our assumption, the generative images are synthesized images (∼3K) generated using StarGAN [23]. Then, in the
through the more unified standard pipeline, which to some re-training stage, it is proposed to use the real images (∼10K)
extent causes the more similarity than the pristine ones. from the same source, and fake images (∼10K) generated by
Straightforward, the cosine similarity between two samples is the same forgery model. Similarly, the remaining subsets are
formulated as: formulated as the above steps. For clarity and similarity, the
detailed description of the dataset is illustrated in Table I. Last
x · x′
f sim = (6) but not least, in the establishment of our dataset, we divide
(∥x∥2 · ∥x′ ∥2 ) the subsets based on the category of forgery models. Thus, for
where (x, x′ ) represent a pair of different images among the testing subsets, we need to remain the real and fake images
same cluster. Finally, by comparing the average similarity aligned with its training subsets. Besides, it is worth noting
of each cluster, the generative fake images are effectively that for each testing subset, the ratio of real and fake images
identified. The cluster of images with higher similarity score remains the same.
is considered to be fake; on the contrary, the cluster of images The backbone network is trained using the Adam optimizer.
with lower similarity score is considered to be pristine. Next, The pre-training default parameters are 2000 epochs; the
we validate our assumption by empirically analyzing the statis- default parameters for the re-training phase are 1000 epochs.
tical data. More specifically, different types of testing samples The learning rate is initialized to 0.0003, which is dynamically
are deployed: pristine images from FFHQ [10], CelebA [64] adjusted using the cosine annealing strategy. Our framework
and COCO [65]; fake images generated by StyleGANs [11], is implemented on PyTorch with GeForce RTX 2080Ti GPU
[12], and diffusion models [26], [27]. 500 samples of each (11GB memory, CUDA 10.0). The batch size serving as a key
type are randomly selected, and the cosine similarities are factor is 48, and the temperature parameter is set to 0.75 in
calculated for each type of samples. The results are shown all experiments.
in Fig. 4. By observation, the distributions of different types Next, the widely-adopted evaluation metrics in the com-
of images can be remarkably distinguished; the results empir- munity are considered: Precision, Recall, F1 Score, Accuracy
ically supports our assumption. (ACC), and Average Precision (AP). For clarity, we also

Authorized licensed use limited to: Kirikkale Univ. Downloaded on December 18,2024 at 23:13:37 UTC from IEEE Xplore. Restrictions apply.
QIAO et al.: UNSUPERVISED GENERATIVE FAKE IMAGE DETECTOR 8449

TABLE I
DATASET D ESCRIPTION

Fig. 5. Performance comparison by manually assigning the noisy label with


different error rates. Fig. 6. Performance comparison by different re-training iteration times.

introduce four indicators: True Positive (TP), False Positive Besides, in the stage of feature extractor training, the
(FP), False Negative (FN), and True Negative (TN). Based on strategy of re-training proposed in this context also fur-
the above indicators, ACC can be formulated as: ther refines the effectiveness of the discriminative feature.
TP + TN Furthermore, it is proposed to investigate the relationship
ACC = (7)
TP + FP + FN + TN between the iteration of re-training and the performance of
the final classification. As Fig. 6 illustrates, the detection
B. Evaluation of Label Generation and Re-Training accuracy is gradually improved by increasing the iteration
In our designed unsupervised detector, the noisy label of re-training. In fact, the strategy of re-training can further
generation helps us assign the labels to the training sam- refine the purity of the training samples from each category,
ples, which plays an important role for the following feature from which the well-trained backbone network, namely feature
extractor training. In this subset, we would like to carry out extractor, benefits much more. Thus, the detection performance
the experiments to address the importance of noisy label is remarkably improved due to the refined feature extractor.
generation. It should be noted that we manually assign the
noisy label to the training samples with different error rates.
As Fig. 5 illustrates, only if the error rate is less than C. Performance of Our Unsupervised Detector
45%, the detection accuracy can be perfectly guaranteed. In this subsection, we intend to comprehensively evaluate
In addition, for comparison, it is proposed to compare the the performance of the proposed unsupervised detector. Mean-
detection accuracy, in the case “45% (Ours)” by adopting label while, by mismatching between training and testing subsets,
generation. By observation, the detection accuracy based on referring to as adopting various forgery models, it is pro-
the proposed label generation is slightly better than that based posed to validate the generalization of our proposed method.
on the manually-assigned label, implying the high efficiency In Table II, the specific comparison results are demonstrated.
of the downstream strategy of pre-training and re-training. In particular, the bold face represents the best detection
Nevertheless, in the stage of label generation, we need to accuracy among all the testing subsets. The underlined data
ensure that the error rate of noisy label assignment should be are the acceptable valid results for generalization evaluation.
less than the given threshold. In such scenario, in the following By observation, as we expect, when the training and testing
stage of the feature extraction training, both pre-training and data are aligned, the detection performance is better than the
re-training steps further refine the discriminative features and others. Moreover, regardless of forgery models, our proposed
enhance the purity of clustering. unsupervised detector performs its effectiveness. However,

Authorized licensed use limited to: Kirikkale Univ. Downloaded on December 18,2024 at 23:13:37 UTC from IEEE Xplore. Restrictions apply.
8450 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 9, SEPTEMBER 2024

TABLE II
P ERFORMANCE C OMPARISON OF THE P ROPOSED U NSUPERVISED
D ETECTOR AS THE D IFFERENT F ORGERY M ODELS ATTACK

TABLE III
ROBUSTNESS P ERFORMANCE A FTER P OST-P ROCESSING
Fig. 7. Average Euclidean distance between original and post-processed
image features.

In order to further analyze the reason why the model suc-


cessfully resists against post-processing attacks, we introduce
feature similarity analysis. In general, the stronger the similar-
ity between the features extracted from the original image and
those extracted from the corresponding post-processed image,
the stronger the robustness is. Specifically, we extract the
original image features and their corresponding post-processed
image features, and the average Euclidean distance between
them is calculated in Fig. 7.
when the training and testing subsets are mismatched, the By observation, regardless of any post-processing attack,
detection results are not very satisfying, but some results the Euclidean distance between the original image fea-
are still acceptable. For instance, for a family of StyleGAN tures extracted by the feature extractor and those of the
models, the detectors can basically be generalized to each post-processed image remains close, indicating a high degree
other, and the new-emerging diffusion models. When facing of similarity between the features. That further explains why
the conditional generative model, such as StarGAN, the detec- the proposed unsupervised detector can successfully resist
tion accuracy is sharply degraded. For diffusion models, the against post-processing attacks. In addition, it should be
unsupervised detectors can also be generalized to each other, noted when resisting against noise adding attack, a remark-
but fail when testing forgery images from the conditional or able increase in the average Euclidean distance represents
unconditional generative models. Nevertheless, our proposed a decrease in feature similarity, leading to the decrease of
unsupervised detector demonstrates a certain degree of gener- detection accuracy in Table III.
alization capability.
E. Comparison With SOTAs
D. Robustness of Unsupervised Detectors To verify the effectiveness of the proposed unsupervised
In order to verify the robustness of the model, we use detector, let us compare our detector with deep learning based
some image post-processing operations to attack the samples. detectors, involving supervised ones [1], [15], [20], [37],
These include JPEG compression with the quality factor [39], [41], [48], and [56] and unsupervised one [66]. The
95, median filtering with the operation window size equal comparison results are shown in Table IV.It is proposed to
to 3 × 3, average blurring with the kernel size 3 × 3, rescal- adopt five key evaluation metrics: Precision, Recall, F1 Score,
ing with 0.5 resolution reduction, Gaussian-distributed noise ACC and AP.
adding with zero mean and σ = 0.2, gamma correction with First of all, all the subsets are used for evaluation (see
γ = 0.8. Moreover, six forgery models including StyleGAN, Table IV). By observation, for detecting fake images gener-
StyleGAN2, StyleGAN3, StarGAN, Midjourney, and Stable ated by various forgery models, our proposed unsupervised
diffusion are used for analysis, 2000 images for each type. detector is capable of completing the forensic task, in which
In Table III, it can be observed that regardless of forgery the performance of our proposed detector rivals the super-
models, our proposed unsupervised detector basically remains vised detectors. To our knowledge, towards the problem of
its relevant performance when the various post-processings generative fake image detection, few studies focus on the
appear. It should be noted that the proposed well-trained unsupervised scheme. Thus, we have to compare our proposed
detector has known nothing about the manners of the poss- method with most supervised SOTAs and the unsupervised
processing. In this experiment, the extensive experimental one [66] not from this community. It is worth noting that [66]
results have empirically verified the robustness of the proposed does not require the use of labeled data when training the
unsupervised detector. feature extractor. However, in downstream classification tasks,

Authorized licensed use limited to: Kirikkale Univ. Downloaded on December 18,2024 at 23:13:37 UTC from IEEE Xplore. Restrictions apply.
QIAO et al.: UNSUPERVISED GENERATIVE FAKE IMAGE DETECTOR 8451

TABLE IV
P ERFORMANCE C OMPARISON W ITH SOTA S W ITH C ORRECT L ABELS , IN W HICH THE C OLORED AND B OLD R ESULTS A RE F ROM THE B EST
P ERFORMANCES OF THE S UPERVISED D ETECTORS W ITH I TALIC AND THE U NSUPERVISED D ETECTORS R ESPECTIVELY

TABLE V
P ERFORMANCE C OMPARISON W ITH SOTA S U NDER L ABEL F LIPPING ATTACK , IN W HICH THE C OLORED AND B OLD R ESULTS A RE F ROM THE B EST
P ERFORMANCES OF THE S UPERVISED D ETECTORS W ITH I TALIC AND THE U NSUPERVISED D ETECTORS R ESPECTIVELY

it is still necessary to use labeled data for fine-tuning, meaning with the same noisy labels, and maximize the distance between
that the detector [66] is not fully unsupervised detectors. samples with different noisy labels. That is the fundamental
By comparison, the performance of the compared detectors of the unsupervised learning in this context. Moreover, the
besides [66] are basically on a par. In fact, it is indeed an re-training phase further refines the purity of the training
unfair comparison, since that the unsupervised manner lacks sample, leading to the performance improvement of the feature
of any label information while the supervised methods own extractor. Thus, in the following ablation experiments, we eval-
the accurate labels for high-efficient training. Nevertheless, uate the aforementioned three modules. Besides, to enlarge the
our proposed unsupervised detector still achieves the satisfying scope of application, it is proposed to carry out the experiments
accuracy defending against various forgery model generative for various forgery models.
images. As Table VI illustrates, regardless of forgery models, our
More importantly, it is proposed to address the superiority proposed unsupervised detector equipped with three modules
of our proposed unsupervised detector, compared to the rep- at the same time can achieve the best results, compared to the
resentative detectors [15], [56], [66]. In the practical scenario, others. It should be noted that for the traditional representative
as the labels of the training samples are maliciously attacked, forgery models [10], [11], [12], [23] and recent popular
namely label flipping attack, the supervised detectors possibly diffusion models [26], [27], our proposed scheme performs its
become invalid. In Table V, it is proposed to comprehensively comprehensive effectiveness. Besides, by observation, if only
evaluate the detection performance by varying the different the module of pre-training is deployed, the feature extractor
flipping rates. By observation, as the flipping rate gradually degrades to the vanilla contrastive learner, leading to very
increases, our proposed unsupervised detector is always nearly low accuracies for all the forgery models. In such scenario,
immune to label flipping attack. On the contrary, the compared the detector fails. That directly verifies the indispensability
supervised detector cannot resist against the attack, whose of the proposed modules in this context, among which our
performance is sharply degraded. It should be noted that the unsupervised detector benefits much more.
baseline [15] only marginally outperforms ours when detecting
the fake image by StyleGAN3 at 30% while worse than ours G. Performance Visualization of Unsupervised Detectors
in the remaining scenarios. In our proposed unsupervised detector, the discrimination
between features from different categories of samples is of
F. Ablation Study importance. For clarity and simplicity, it is proposed to
In this subsection, we mainly discuss the effectiveness of visualize the procedure of discrimination between features
three core modules adopted by the proposed unsupervised in Fig 8 illustrates, in order to address the effectiveness
framework. To this end, let us conduct the concerning ablation of our proposed strategy, in terms of pre-training and re-
studies and give the specific analysis. In particular, we first training. By observation, only by assigning the noisy label
need assign the noisy labels for training samples, unless the based on the initial artifact from the frequency and spatial
feature extraction cannot be trained. Then for the pre-training domain, the perfect discriminative features cannot be easily
phase, it can help us minimize the distance between samples obtained. However, by introducing the strategy of pre-training

Authorized licensed use limited to: Kirikkale Univ. Downloaded on December 18,2024 at 23:13:37 UTC from IEEE Xplore. Restrictions apply.
8452 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 9, SEPTEMBER 2024

TABLE VI
A BLATION S TUDY ON E ACH M ODULE FOR VARIOUS F ORGERY M ODELS

Fig. 8. T-SNE visualization comparison by equipping with different modules.

TABLE VII
P ERFORMANCE C OMPARISON OF BACKBONE N ETWORK

Fig. 9. Visualization analysis of manipulated facial region by GradCAM [67],


where the performance is compared by flipping labels with 50%.

and re-training, the features are further refined, leading to the


better discrimination. In fact, the T-SNE visualization results
in Fig 8 also verify the correctness of the ablation results in
Table VI.
In addition, it is proposed to visualize the region of the
feature by GradCAM [67]. As shown in Fig. 9, due to method, the backbone network indeed plays a very important
the high dependence of supervised methods on true labels, role in the feature extraction. Thus, it is proposed to discuss
when the labels of the training samples suffer from flipping the baseline backbone networks [43], [68], [69], [70], [71],
attacks, supervised methods often fail to correctly capture the to verify the validity of our method for detecting generative
forgery traces, which is reflected in the heatmap; the focus fake images. Specifically, six forgery models, referring to
of Gragnaniello et al. [15] is shifted from the face region StyleGAN, StyleGAN2, StyleGAN3, StarGAN, Midjourney,
to the background region, leading to a drastic degradation of Stable diffusion, are used for the comparison experiments.
the detection performance. Besides, although the unsupervised As Table VII illustrates, the more advanced PaCa-ViT [71]
method Wu et al. [66] can to some degree resist against brings higher detection accuracy superior to the others. In addi-
label flipping attack, the model is not trained very well as tion, it is worth noting that in the proposed unsupervised
ours, leading to unsatisfying detection accuracy (see Table V). framework, with the continuous development of the back-
In contrast, our method does not rely on true labels throughout, bone network, the detection accuracy probably can be further
and still captures forgery traces correctly through the feasible improved, which further proves that our framework has strong
strategy in our proposed unsupervised framework. scalability.
In Section IV, we describe an unsupervised framework for
IX. D ISCUSSION AND A NALYSIS OF detecting generative images, and demonstrate the effective-
U NSUPERVISED D ETECTOR ness of the proposed unsupervised detector through extensive
In this context, we also discuss and analyze the proposed experiments in Section VIII. As forgery models are constantly
unsupervised detection framework. In fact, in our proposed updated, the generative fake image detector should remain

Authorized licensed use limited to: Kirikkale Univ. Downloaded on December 18,2024 at 23:13:37 UTC from IEEE Xplore. Restrictions apply.
QIAO et al.: UNSUPERVISED GENERATIVE FAKE IMAGE DETECTOR 8453

TABLE VIII on expensive manual labeling and are probably fragile to


DATASET D ESCRIPTION OF N EW-E MERGING F ORGERY M ODELS label flipping attack. In order to address these problems,
we propose to design an unsupervised framework, generate
noisy labels, then pre-train and re-train the well-designed
feature extractor, and finally use cosine similarity to complete
the bi-classification task. Regardless of training samples with
or without correct labels, our method successfully detects
TABLE IX generative fake images synthesized by various forgery models,
P URITY F ORM N OISY L ABEL G ENERATION even including the new-emerging diffusion models. Besides,
our method also shows to some degree generalization capa-
bility and good robustness performance for post-processing
attacks.

R EFERENCES
[1] B. Chen, X. Liu, Y. Zheng, G. Zhao, and Y.-Q. Shi, “A robust GAN-
generated face detection method based on dual-color spaces and an
improved xception,” IEEE Trans. Circuits Syst. Video Technol., vol. 32,
TABLE X no. 6, pp. 3527–3538, Jun. 2022.
P ERFORMANCE C OMPARISON OF THE P ROPOSED U NSUPERVISED D ETEC - [2] X. Li, R. Ni, P. Yang, Z. Fu, and Y. Zhao, “Artifacts-disentangled
TOR AS N EW-E MERGING F ORGERY M ODELS ATTACK adversarial learning for deepfake detection,” IEEE Trans. Circuits Syst.
Video Technol., vol. 33, no. 4, pp. 1658–1670, Apr. 2022.
[3] H. Zhang, B. Chen, J. Wang, and G. Zhao, “A local perturbation
generation method for GAN-generated face anti-forensics,” IEEE Trans.
Circuits Syst. Video Technol., vol. 33, no. 2, pp. 661–676, Feb. 2023.
[4] Y. Yu, X. Liu, R. Ni, S. Yang, Y. Zhao, and A. C. Kot, “PVASS-
MDD: Predictive visual-audio alignment self-supervision for multimodal
deepfake detection,” IEEE Trans. Circuits Syst. Video Technol., early
access, Aug. 29, 2024, doi: 10.1109/TCSVT.2023.3309899.
[5] Z. Guo, L. Wang, W. Yang, G. Yang, and K. Li, “LDFnet: Lightweight
dynamic fusion network for face forgery detection by integrating local
artifacts and global texture information,” IEEE Trans. Circuits Syst.
Video Technol., vol. 34, no. 2, pp. 1255–1265, Jul. 2024.
[6] Y. Wang, C. Peng, D. Liu, N. Wang, and X. Gao, “Spatial-temporal
frequency forgery clue for video forgery detection in VIS and NIR
effective. In this section, we adopt the latest diffusion models scenario,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 12,
such as Kandinsky3 [33], Imagen2 [34], SDXL Turbo [35] to pp. 7943–7956, Dec. 2023.
[7] T. Qiao, J. Wu, N. Zheng, M. Xu, and X. Luo, “FGDNet: Fine-grained
further verify the effectiveness of the proposed overall frame- detection network towards face anti-spoofing,” IEEE Trans. Multimedia,
work. For Kandinsky3 forgery model, in the pre-training stage, vol. 25, pp. 7350–7363, 2023.
this subset contains real images (∼200) from COCO [65], [8] L. Zhang, T. Qiao, M. Xu, N. Zheng, and S. Xie, “Unsupervised
and fake images (∼200) generated using Kandinsky3 [33]. learning-based framework for deepfake video detection,” IEEE Trans.
Multimedia, vol. 25, pp. 4785–4799, 2023.
Then, in the re-training stage, it is proposed to use the real [9] Z. Xia, T. Qiao, M. Xu, N. Zheng, and S. Xie, “Towards DeepFake video
images (∼300) from the same source, and fake images (∼300) forensics based on facial textural disparities in multi-color channels,” Inf.
generated by the same forgery model. Similarly, the remaining Sci., vol. 607, pp. 654–669, Aug. 2022.
[10] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture
subsets are formulated as the above steps. For clarity and for generative adversarial networks,” in Proc. IEEE/CVF Conf. Comput.
similarity, the detailed description of the dataset is illustrated Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 4401–4410.
in Table VIII. [11] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila,
“Analyzing and improving the image quality of StyleGAN,” in Proc.
After undergoing the noisy label generation process, the IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
initial purity of the noisy labels is illustrated in Table IX. pp. 8110–8119.
Subsequently, following the training of the feature extractor, [12] T. Karras et al., “Alias-free generative adversarial networks,” in Proc.
which includes the incorporation of pre-training and re- Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 852–863.
[13] T. Qiao et al., “A novel model watermarking for protecting gen-
training strategies, the outcomes are depicted in Table X. erative adversarial network,” Comput. Secur., vol. 127, Apr. 2023,
By observation, as we expect, when the training and testing Art. no. 103102.
data are aligned, the detection performance is better than the [14] S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros, “CNN-
generated images are surprisingly easy to spot. . . for now,” in Proc.
others. Moreover, regardless of forgery models, our proposed IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
unsupervised detector performs its effectiveness. Unsupervised pp. 8692–8701.
detectors can also generalize between diffusion models. This [15] D. Gragnaniello, D. Cozzolino, F. Marra, G. Poggi, and L. Verdoliva,
“Are GAN generated images easy to detect? A critical analysis of the
further proves that our proposed unsupervised detector has a state-of-the-art,” in Proc. IEEE Int. Conf. Multimedia Expo (ICME),
certain degree of generalization ability. Mar. 2021, pp. 1–6.
[16] Z. Mi, X. Jiang, T. Sun, and K. Xu, “GAN-generated image detection
with self-attention mechanism against GAN generator defect,” IEEE
X. C ONCLUSION J. Sel. Topics Signal Process., vol. 14, no. 5, pp. 969–981, Aug. 2020.
[17] X. Zhang, S. Karaman, and S.-F. Chang, “Detecting and simulating
In the current generative fake image detection, the detectors artifacts in GAN fake images,” in Proc. IEEE Int. Workshop Inf.
are usually trained in a supervised manner, which rely heavily Forensics Security (WIFS), Aug. 2019, pp. 1–6.

Authorized licensed use limited to: Kirikkale Univ. Downloaded on December 18,2024 at 23:13:37 UTC from IEEE Xplore. Restrictions apply.
8454 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 9, SEPTEMBER 2024

[18] F. Marra, C. Saltori, G. Boato, and L. Verdoliva, “Incremental learning [43] F. Chollet, “Xception: Deep learning with depthwise separable convo-
for the detection and classification of GAN-generated images,” in Proc. lutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
IEEE Int. Workshop Inf. Forensics Security (WIFS), Oct. 2019, pp. 1–6. Jul. 2017, pp. 1251–1258.
[19] S. Mandelli, N. Bonettini, P. Bestagini, and S. Tubaro, “Training CNNs [44] B. Chen, W. Tan, Y. Wang, and G. Zhao, “Distinguishing between
in presence of JPEG compression: Multimedia forensics vs computer natural and GAN-generated face images by combining global and local
vision,” in Proc. IEEE Int. Workshop Inf. Forensics Secur. (WIFS), features,” Chin. J. Electron., vol. 31, no. 1, pp. 59–67, Jan. 2022.
Dec. 2020, pp. 1–6. [45] H. Guo, S. Hu, X. Wang, M.-C. Chang, and S. Lyu, “Robust attentive
[20] W. Li, P. He, H. Li, H. Wang, and R. Zhang, “Detection of GAN- deep neural network for detecting GAN-generated faces,” IEEE Access,
generated images by estimating artifact similarity,” IEEE Signal Process. vol. 10, pp. 32574–32583, 2022.
Lett., vol. 29, pp. 862–866, 2022. [46] J. Wang, B. Tondi, and M. Barni, “An eyes-based Siamese neural
[21] S. Mandelli, N. Bonettini, P. Bestagini, and S. Tubaro, “Detecting gan- network for the detection of GAN-generated face images,” Frontiers
generated images by orthogonal training of multiple CNNs,” in Proc. Signal Process., vol. 2, Jul. 2022, Art. no. 918725.
IEEE Int. Conf. Image Process. (ICIP), Oct. 2022, pp. 3091–3095. [47] T. Qiao, R. Shi, X. Luo, M. Xu, N. Zheng, and Y. Wu, “Statistical
[22] T. Qiao et al., “CSC-Net: Cross-color spatial co-occurrence matrix model-based detector via texture weight map: Application in re-sampling
network for detecting synthesized fake images,” IEEE Trans. Cognit. authentication,” IEEE Trans. Multimedia, vol. 21, no. 5, pp. 1077–1092,
Develop. Syst., vol. 16, no. 1, pp. 369–379, Nov. 2024. May 2019.
[23] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo, “StarGAN: [48] J. Frank, T. Eisenhofer, L. Schönherr, A. Fischer, D. Kolossa, and
Unified generative adversarial networks for multi-domain image-to- T. Holz, “Leveraging frequency analysis for deep fake image recog-
image translation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern nition,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 3247–3258.
Recognit., Jun. 2018, pp. 8789–8797. [49] S. Agarwal, N. Girdhar, and H. Raghav, “A novel neural model based
[24] Y. Choi, Y. Uh, J. Yoo, and J.-W. Ha, “StarGAN v2: Diverse image framework for detection of GAN generated fake images,” in Proc. 11th
synthesis for multiple domains,” in Proc. IEEE/CVF Conf. Comput. Vis. Int. Conf. Cloud Comput., Data Sci. Eng. (Confluence), Jan. 2021,
Pattern Recognit. (CVPR), Jun. 2020, pp. 8188–8197. pp. 46–51.
[25] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen, “AttGAN: Facial attribute [50] N. Bonettini, P. Bestagini, S. Milani, and S. Tubaro, “On the use of
editing by only changing what you want,” IEEE Trans. Image Process., Benford’s law to detect GAN-generated images,” in Proc. 25th Int. Conf.
vol. 28, no. 11, pp. 5464–5478, Nov. 2019. Pattern Recognit. (ICPR), Jan. 2021, pp. 5495–5502.
[26] (2022). Midjourney. [Online]. Available: https://www.midjourney.com [51] R. Durall, M. Keuper, and J. Keuper, “Watch your up-convolution: CNN
based generative deep neural networks are failing to reproduce spectral
[27] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. (2022).
distributions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
Stable Diffusion. [Online]. Available: https://github.com/CompVis/
(CVPR), Jun. 2020, pp. 7887–7896.
stable-diffusion
[52] M. Zhang, H. Wang, P. He, A. Malik, and H. Liu, “Exposing unseen
[28] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hier-
GAN-generated image using unsupervised domain adaptation,” Knowl.-
archical text-conditional image generation with CLIP latents,” 2022,
Based Syst., vol. 257, Dec. 2022, Art. no. 109905.
arXiv:2204.06125.
[53] M. Zhang, H. Wang, P. He, A. Malik, and H. Liu, “Improving GAN-
[29] A. Brock, J. Donahue, and K. Simonyan, “Large scale GAN training
generated image detection generalization using unsupervised domain
for high fidelity natural image synthesis,” 2018, arXiv:1809.11096.
adaptation,” in Proc. IEEE Int. Conf. Multimedia Expo (ICME),
[30] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive grow- Jul. 2022, pp. 1–6.
ing of GANs for improved quality, stability, and variation,” 2017,
[54] T. Qiao, X. Luo, H. Yao, and R. Shi, “Classifying between com-
arXiv:1710.10196.
puter generated and natural images: An empirical study from RAW to
[31] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation JPEG format,” J. Vis. Commun. Image Represent., vol. 85, May 2022,
with conditional adversarial networks,” in Proc. IEEE Conf. Comput. Vis. Art. no. 103506.
Pattern Recognit., Jul. 2017, pp. 1125–1134. [55] S. Hu, Y. Li, and S. Lyu, “Exposing GAN-generated faces using incon-
[32] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image sistent corneal specular highlights,” in Proc. IEEE Int. Conf. Acoust.,
translation using cycle-consistent adversarial networks,” in Proc. IEEE Speech Signal Process. (ICASSP), Jun. 2021, pp. 2500–2504.
Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2223–2232. [56] R. Corvi, D. Cozzolino, G. Zingarini, G. Poggi, K. Nagano, and
[33] V. Arkhipkin et al., “Kandinsky 3.0 technical report,” 2023, L. Verdoliva, “On the detection of synthetic images generated by diffu-
arXiv:2312.03511. sion models,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
[34] (2023). Imagen2. [Online]. Available: https://deepmind.google/ (ICASSP), Jun. 2023, pp. 1–5.
technologies/imagen-2 [57] R. Corvi, D. Cozzolino, G. Poggi, K. Nagano, and L. Verdoliva,
[35] A. Sauer, D. Lorenz, A. Blattmann, and R. Rombach, “Adversarial “Intriguing properties of synthetic images: From generative adversarial
diffusion distillation,” 2023, arXiv:2311.17042. networks to diffusion models,” in Proc. IEEE/CVF Conf. Comput. Vis.
[36] H. Mo, B. Chen, and W. Luo, “Fake faces identification via convolutional Pattern Recognit. Workshops (CVPRW), Jun. 2023, pp. 973–982.
neural network,” in Proc. 6th ACM Workshop Inf. Hiding Multimedia [58] S. McCloskey and M. Albright, “Detecting GAN-generated imagery
Secur., 2018, pp. 43–47. using color cues,” 2018, arXiv:1812.08247.
[37] L. Nataraj et al., “Detecting GAN generated fake images using co- [59] F. Matern, C. Riess, and M. Stamminger, “Exploiting visual artifacts to
occurrence matrices,” 2019, arXiv:1903.06836. expose deepfakes and face manipulations,” in Proc. IEEE Winter Appl.
[38] Y. Fu, T. Sun, X. Jiang, K. Xu, and P. He, “Robust GAN-face detection Comput. Vis. Workshops (WACVW), Jan. 2019, pp. 83–92.
based on dual-channel CNN network,” in Proc. 12th Int. Congr. Image [60] X. Yang, Y. Li, H. Qi, and S. Lyu, “Exposing GAN-synthesized
Signal Process., Biomed. Eng. Informat. (CISP-BMEI), Oct. 2019, faces using landmark locations,” in Proc. ACM Workshop Inf. Hiding
pp. 1–5. Multimedia Secur., Jul. 2019, pp. 113–118.
[39] N. Yu, L. S. Davis, and M. Fritz, “Attributing fake images to GANs: [61] H. Guo, S. Hu, X. Wang, M.-C. Chang, and S. Lyu, “Eyes tell all: Irreg-
Learning and analyzing GAN fingerprints,” in Proc. IEEE/CVF Int. ular pupil shapes reveal GAN-generated faces,” in Proc. IEEE Int. Conf.
Conf. Comput. Vis., Oct. 2019, pp. 7556–7566. Acoust., Speech Signal Process. (ICASSP), May 2022, pp. 2904–2908.
[40] M. Goebel, L. Nataraj, T. Nanjundaswamy, T. Manhar Mohammed, [62] J.-C. Yen, F.-J. Chang, and S. Chang, “A new criterion for automatic
S. Chandrasekaran, and B. S. Manjunath, “Detection, attribution and multilevel thresholding,” IEEE Trans. Image Process., vol. 4, no. 3,
localization of GAN generated images,” 2020, arXiv:2007.10466. pp. 370–378, Mar. 1995.
[41] M. Barni, K. Kallas, E. Nowroozi, and B. Tondi, “CNN detection [63] P. Khosla et al., “Supervised contrastive learning,” in Proc. NIPS, 2020,
of GAN-generated face images based on cross-band co-occurrences pp. 18661–18673.
analysis,” in Proc. IEEE Int. Workshop Inf. Forensics Secur. (WIFS), [64] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes
Jan. 2020, pp. 1–6. in the wild,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015,
[42] C. Dong, A. Kumar, and E. Liu, “Think twice before detecting GAN- pp. 3730–3738.
generated fake images from their spectral domain imprints,” in Proc. [65] T. Lin et al., “Microsoft COCO: Common objects in context,” in
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2014,
pp. 7855–7864. pp. 740–755.

Authorized licensed use limited to: Kirikkale Univ. Downloaded on December 18,2024 at 23:13:37 UTC from IEEE Xplore. Restrictions apply.
QIAO et al.: UNSUPERVISED GENERATIVE FAKE IMAGE DETECTOR 8455

[66] W. Wu, C. Gao, J. DiPalma, S. Vosoughi, and S. Hassanpour, “Improving Hang Shao received the B.S. degree in information
representation learning for histopathologic images with cluster con- security from Hebei University, Baoding, China,
straints,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2023, in 2019. Since 2021, he has been with the School of
pp. 21404–21414. Cyberspace, Hangzhou Dianzi University. His cur-
[67] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and rent research interests include multimedia forensics
D. Batra, “Grad-CAM: Visual explanations from deep networks via and AI security.
gradient-based localization,” in Proc. IEEE Int. Conf. Comput. Vis.
(ICCV), Oct. 2017, pp. 618–626.
[68] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convo-
lutional neural networks,” in Proc. 36th Int. Conf. Mach. Learn., 2019,
pp. 6105–6114.
[69] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers
for image recognition at scale,” 2020, arXiv:2010.11929. Shichuang Xie received the B.S. degree in com-
[70] Z. Tu et al., “MaxViT: Multi-axis vision transformer,” in Proc. 17th Eur. puter science and technology from Henan University,
Conf. Comput. Vis. Cham, Switzerland: Springer, 2022, pp. 459–479. Kaifeng, China, in 2020, and the M.S. degree from
[71] R. Grainger, T. Paniagua, X. Song, N. Cuntoor, M. W. Lee, and T. Wu, the School of Cybersapce, Hangzhou Dianzi Uni-
“PaCa-ViT: Learning patch-to-cluster attention in vision transformers,” versity, Hangzhou, Zhejiang, China, in 2023. His
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), current research interests include multimedia foren-
Jun. 2023, pp. 18568–18578. sics and AI security.

Tong Qiao received the B.S. degree in electronic


and information engineering from Information Engi- Ran Shi received the B.S. degree in electronic
neering University, Zhengzhou, China, in 2009, the science and technology from Changshu Institute
M.S. degree in communication and information sys- of Technology, Suzhou, China, in 2009, the M.S.
tems from Shanghai University, Shanghai, China, degree in signal and information processing from
in 2012, and the Ph.D. degree from the Laboratory of Shanghai University, Shanghai, China, in 2012, and
Systems Modelling and Dependability, University of the Ph.D. degree from the Department of Electronic
Technology of Troyes, Troyes, France, in 2016. He is Engineering, The Chinese University of Hong Kong.
currently an Associate Professor with the School of He is currently an Associate Professor with the
Cyberspace, Hangzhou Dianzi University. He has School of Computer Science and Engineering, Nan-
published over 70 peer-reviewed papers in journals jing University of Science and Technology. His
and conferences. His current research interests include media forensics, research interests include object segmentation visual
AI security, and data hiding. quality evaluation, interactive segmentation, and salient object detection.

Authorized licensed use limited to: Kirikkale Univ. Downloaded on December 18,2024 at 23:13:37 UTC from IEEE Xplore. Restrictions apply.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy