0% found this document useful (0 votes)
45 views13 pages

Synthetic Image Transformation

Uploaded by

keerthana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views13 pages

Synthetic Image Transformation

Uploaded by

keerthana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Improving Synthetic Image Detection Towards Generalization:

An Image Transformation Perspective


Ouxiang Li∗ Jiayin Cai Yanbin Hao†
University of Science and Technology Xiaohongshu Inc. University of Science and Technology
of China Beijing, China of China
Hefei, China jiayin1@xiaohongshu.com Hefei, China
lioox@mail.ustc.edu.cn haoyanbin@hotmail.com

Xiaolong Jiang Yao Hu Fuli Feng†


arXiv:2408.06741v1 [cs.CV] 13 Aug 2024

Xiaohongshu Inc. Xiaohongshu Inc. University of Science and Technology


Beijing, China Beijing, China of China
laige@xiaohongshu.com xiahou@xiaohongshu.com Hefei, China
fulifeng93@gmail.com

Abstract 1 Introduction
With recent generative models facilitating photo-realistic image Recent advancements in generative models have substantially sim-
synthesis, the proliferation of synthetic images has also engendered plified the synthesis of photo-realistic images, such as Generative
certain negative impacts on social platforms, thereby raising an Adversarial Networks (GANs) [12] and diffusion models (DMs) [16].
urgent imperative to develop effective detectors. Current synthetic While these technologies are thriving in low-cost image synthesis
image detection (SID) pipelines are primarily dedicated to crafting with unprecedented realism, they simultaneously pose potential
universal artifact features, accompanied by an oversight about SID societal risks, including disinformation [3], deepfakes [56], and pri-
training paradigm. In this paper, we re-examine the SID problem vacy concerns [11, 32]. In this context, numerous Synthetic Image
and identify two prevalent biases in current training paradigms, i.e., Detection (SID) pipelines have been developed, aiming to distin-
weakened artifact features and overfitted artifact features. Mean- guish between natural (real) and synthetic (fake) images.
while, we discover that the imaging mechanism of synthetic images SID aims to identify synthetic images from various generative
contributes to heightened local correlations among pixels, suggest- models, termed “generalization performance”. As novel generative
ing that detectors should be equipped with local awareness. In this models are persistently developed and deployed, ensuring general-
light, we propose SAFE, a lightweight and effective detector with ization performance to new generators is particularly critical in the
three simple image transformations. Firstly, for weakened artifact industry. Accordingly, current SID pipelines are dedicated to ex-
features, we substitute the down-sampling operator with the crop ploring universal differences between natural and synthetic images,
operator in image pre-processing to help circumvent artifact distor- termed “universal artifact features”, to improve their generalization
tion. Secondly, for overfitted artifact features, we include ColorJitter performance, which can be categorized into two branches. The
and RandomRotation as additional data augmentations, to help al- first image-aware branch leverages image inputs to autonomously
leviate irrelevant biases from color discrepancies and semantic extract universal artifact features via detection-aware modules,
differences in limited training samples. Thirdly, for local aware- operating from multiple perspectives, e.g., frequency [31, 35, 50],
ness, we propose a patch-based random masking strategy tailored semantics [19, 26, 38], and text [24, 58]. The second feature-aware
for SID, forcing the detector to focus on local regions at training. branch simplifies the detection pipeline into two stages, with stage
Comparative experiments are conducted on an open-world dataset, 1 manually extracting universal artifact features by means of off-
comprising synthetic images generated by 26 distinct generative the-shelf models [34, 41, 48, 63, 66] or image processing operators
models. Our pipeline achieves a new state-of-the-art performance, [18, 61, 62, 70, 73] in advance and stage 2 training the classifier with
with remarkable improvements of 4.5% in accuracy and 2.9% in these pre-processed features.
average precision against existing methods. Our code is available In addition to crafting universal artifact features, we argue that
at: https://github.com/Ouxiang-Li/SAFE. generalizable SID also necessitates reasonable training strategies.
For instance, current pipelines typically adopt the down-sampling
CCS Concepts operator during image pre-processing to standardize images into
a uniform shape. However, this common operator is not suitable
• Security and privacy → Social aspects of security and privacy.
for SID, which could potentially distort the subtle artifacts, leading
to weakened artifact features (cf. Fig. 1 (a)). Meanwhile, these
Keywords pipelines also confront overfitted artifact features (cf. Fig. 1 (b))
Synthetic Image Detection, AIGC Detection, Security and Privacy due to the monotonous data augmentation (e.g., HorizontalFlip),
which is insufficient to bridge the distributional disparity between
training and testing data. Afflicted by these feature-agnostic biases,
∗ This work was done during the internship in Xiaohongshu Inc..
† Corresponding author.
Ouxiang Li et al.

Baseline Ours
Real Image
Fake Image

Down-Sampling
Inversion
Subtraction
Difference

(a) Weakened artifact features (b) Overfitted artifact features

Figure 1: Prevalent biases observed in current SID training paradigms. (a) We reconstruct the real image using null-text inversion
[45] with Stable Diffusion v1.4 [60], ensuring they are semantically consistent. We then calculate their local correlation maps1
before and after down-sampling and subtract them for explicit comparison. It can be noticed that fake image exhibits stronger
local correlations and the down-sampling operator indeed weakens such subtle artifacts. (b) We compare the logit distributions
between the baseline (w/ HorizontalFlip only, left) and ours (right) for both seen and unseen generators. The monotonous
application of HorizontalFlip is not sufficient to alleviate overfitting to training samples, resulting in extreme logit distribution
for in-domain samples (e.g., ProGAN) and inferior generalization for out-of-domain samples (e.g., Midjourney).

such biased training paradigms can inherently prohibit current pipeline that integrates three effective image transformations with
pipelines from achieving superior generalization. simple artifact features. First, w.r.t artifact preservation, we
In this light, we attribute the inferior generalization of existing substitute the conventional down-sampling operator in image pre-
pipelines to their oversight of SID-specific image transformations processing with the crop operator for both training and inference,
at training. Hence, we re-examine the SID problem and identify the which helps circumvent artifact distortion. Second, w.r.t invari-
following critical factors in effective detection: ant augmentation, we introduce ColorJitter and RandomRotation
• Artifact preservation: By analyzing the imaging mechanisms as additional data augmentations, which are effective in alleviat-
of synthetic images, we find the widespread use of up-sampling ing irrelevant biases regarding color mode and semantics against
and convolution operators in generative models contributes to limited training samples. Third, w.r.t local awareness, we pro-
heightened local correlations in synthetic images. Consequently, pose a novel patch-based random masking strategy tailored for
the down-sampling operator can inevitably weaken such local SID in data augmentation, forcing the detector to focus on local
correlations via re-weighting adjacent pixel values. regions at training. Lastly, w.r.t artifact selection, considering the
• Invariant augmentation: Aiming to generalize to unseen syn- existing detectable differences between natural and synthetic im-
thetic images with limited training data, the detector has to learn ages in high-frequency components [53], we introduce Discrete
the specified artifact mode without being affected by other irrel- Wavelet Transform (DWT) [42] to extract high-frequency features.
evant features. Therefore, artifact-invariant data augmentations To comprehensively evaluate the generalization performance, we
will improve the robustness of SID detectors. benchmark our proposed pipeline across 26 generators2 with only
• Local awareness: Considering the inherent difference in imag- access to ProGAN [20] generations and corresponding real images
ing between natural and synthetic images, local awareness can at training. Extensive experiments demonstrate the effectiveness of
facilitate the detector to focus more on local correlations in adja- our image transformations in SID even with simple artifact features.
cent pixels, which offers a crucial clue for generalizable detection. Our contributions can be summarized as follows:
Building upon above insights, we propose SAFE (Simple Pre-
served and Augmented FEatures), a simple and lightweight SID 2 ProGAN, StyleGAN, StyleGAN2, BigGAN, CycleGAN, StarGAN, GauGAN, Deepfake,
AttGAN, BEGAN, CramerGAN, InfoMaxGAN, MMDGAN, RelGAN, S3GAN, SNGAN,
1 The detailed algorithm is presented in Appx. A.1. STGAN, DALLE, Glide, ADM, LDM, Midjourney, SDv1.4, SDv1.5, Wukong, VQDM.
Improving Synthetic Image Detection Towards Generalization: An Image Transformation Perspective

• We attribute the inferior generalization observed in current FC Up-Sampling Conv2d Pooling Skip Connection

pipelines to their biased training paradigms without compre-


hensive analyses of SID-specific image transformations. This ob-
servation prompts us to re-examine the SID problem and identify
crucial factors that contribute to simple yet effective detection.
• We propose SAFE, a simple and lightweight SID pipeline that
integrates three effective image transformations with simple GAN Pipeline DM Pipeline
artifact features. These techniques are effective in generalizable
SID by alleviating training biases and enhancing local awareness. Figure 2: The synthesis mechanisms of two common genera-
• Extensive experiments demonstrate the effectiveness of our pro- tive pipelines, i.e., GANs and DMs. Up-sampling and convo-
posed pipeline, exhibiting remarkable generalization performance lution operators are both widely used in these pipelines.
across 26 generators, with improvements of 4.5% in accuracy
and 2.9% in average precision compared against state-of-the-art
(SOTA) baselines. promise in practical applications. FreDect [10] is the first pipeline
to extract frequency artifacts via Discrete Cosine Transform (DCT)
2 Related Works in detecting GAN generations. Subsequently, BiHPF [18] adopts
In this section, we comprehensively summarize current SID pipelines, bilateral high-pass filters to amplify the effect of frequency-level ar-
categorizing them into two branches as introduced in Sec. 1, i.e., tifacts. More recently, PatchCraft [73] compares the rich-texture and
image-aware branch and feature-aware branch. poor-texture regions via pre-defined filter operators. FreqNet [61]
extracts high-frequency representations via Fast Fourier Transform
2.1 Image-aware Branch (FFT) and its inverse transform. NPR [62] analyzes the architectures
of CNN-based generators and proposes to capture the generaliz-
The image-aware branch aims to autonomously extract universal able structural artifacts stemming from up-sampling operations in
artifact features with meticulously crafted detection-aware mod- image synthesis.
ules from diverse perspectives. In frequency domain, F3Net [50]
introduces FAD, LFS, and MixBlock for feature extraction and inter- 3 Method
action. Later, FDFL [31] and FatFormer [35] propose AFFGM and
FAA blocks for adaptive frequency forgery extraction, respectively. In this section, we will elaborate on the details of our SAFE pipeline
In semantics domain, GramNet [38] inserts Gram Blocks into dif- to demonstrate how simple image transformations can improve the
ferent semantic levels from ResNet [14] to extract global image generalization performance in SID.
texture features. Fusing [19] combines global and local embeddings,
with PSM extracting local embeddings and AFFM fusing features.
3.1 Problem Formulation
RINE [26] leverages the image representations extracted by inter- The primary objective of SID is to design a universal classifier for
mediate Transformer blocks from CLIP-ViT [9, 51] and employs a generalizable synthetic image detection. In real-world scenarios,
TIE module to map them into learnable forgery-aware vectors. In the detector is required to distinguish test samples from 𝑛 unknown
text domain, DE-FAKE [58] introduces CLIP text encoder [64] as an plausible sources. That is
additional cue for detection on text-to-image models. Bi-LORA [24] Dtest = {S1, S2, ..., S𝑛 }, S𝑖 = {𝑿 𝑖𝑗 , 𝑦𝑖𝑗 }𝑁 𝑖
(1)
𝑗=1 ,
leverages BLIP2 [30] combined with LoRA [17] tuning techniques
to enhance the detection performance. where 𝑁𝑖 is the number of images in the 𝑖 th
source S𝑖 compris-
ing both natural (𝑦𝑖𝑗 = 0) and synthetic (𝑦𝑖𝑗 = 1) images from
2.2 Feature-aware Branch a specific generative model. To achieve this goal, one can resort
The feature-aware branch simplifies the detection pipeline into to training a neural network M (I; 𝜽 ) parameterized by 𝜽 with
two stages with pre-processed self-crafted features. One principal constrained training data from Dtrain , which typically includes
paradigm introduces off-the-shelf models, attempting to extract synthetic images generated by limited generators. Because gener-
artifacts through their pretrained knowledge. LNP [34] extracts ators are continuously iterating, it is impossible to exhaustively
learned noise patterns using pretrained denoising networks [72]. include them in practical training. Herein, according to different
LGrad [63] employs pretrained CNN models as transform functions detection branches, the model input I could be either the source
to convert images into gradients and leverages these gradients to images 𝑿 ∈ {Dtrain, Dtest } or the pre-processed artifact features
present universal artifacts. UniFD [48] directly utilizes CLIP fea- 𝒇 (𝑿 ) with self-crafted processor 𝒇 .
tures for classification from the semantic perspective. DIRE [66]
proposes to inverse images into Gaussian noise via DDIM inversion 3.2 Artifact Preservation
[59] and reconstruct the noise into images using pretrained ADM We first analyze the sources of synthetic artifacts from the perspec-
[8], where the differences between source and reconstructed im- tive of imaging mechanisms for both GANs and DMs in Fig. 2. In
ages are termed as DIRE for detection. Similarly, [5, 41, 67] follow the GAN pipeline, latent features 𝒛 ∈ N (0, I) are initially trans-
the same reconstruction perspective with pretrained VAEs [25] formed from low-resolutions into high-resolutions through a fully-
and DMs. To meet real-time requirements, the other paradigm in- connected layer. Subsequently, up-sampling (up) and convolution
troduces instant image processing operators, exhibiting superior (conv) operators are simultaneously utilized to proceed with image
Ouxiang Li et al.

We argue that this extent of data augmentation is insufficient to


CJ bridge such distributional disparity in architectures and propose the
following techniques to facilitate artifact-invariant augmentation
Brightness Contrast Saturation and improve local awareness as shown in Fig. 3.
ColorJitter. The training data is constrained by the limited cat-
RR egories (e.g., car, cat, chair) and generative models, resulting in
natural distributional discrepancies between training and testing
45° 75° 135° samples in color mode [29, 69]. To enhance cross-architecture gener-
alization, we incorporate ColorJitter into data augmentation, adjust-
RM ing the training distribution by jittering images in color space. The
factor controlling how much to jitter images is uniformly sampled
from [max(0, 1 − 𝛼), 1 + 𝛼], where 𝛼 ∈ [0, 1] specifies the allowed
25% 50% 75%
perturbation for brightness, contrast, and saturation channels.
RandomRotation. Considering that the local correlations among
Figure 3: Examples of our proposed transformations in data pixels are robust to the rotation operator, we introduce random
augmentation, i.e., ColorJitter (CJ), RandomRotatioin (RR), rotation as an additional data augmentation. This encourages the
and RandomMask (RM). In practice, these three augmenta- detector to focus more on local correlations across different rota-
tions are applied simultaneously along with HorizontalFlip. tion angles, rather than on irrelevant rotation-related features (e.g.,
semantics). Meanwhile, this simple operation can also enhance the
robustness against image rotation. In implementation, we uniformly
synthesis. Similarly, in the DM pipeline with UNet architecture sample the rotating angle from [−𝛽, +𝛽], where 𝛽 ∈ [0◦, 180◦ ], and
[55], noised images 𝑿 𝑡 at timestep 𝑡 are first downscaled through fill the area outside the rotated image with zero values.
pooling and conv operators into low-dimensions and then upscaled RandomMask. To improve the local awareness of the detector,
through up and conv operators until predicting the noise 𝝐 𝑡 . It can we propose a random masking technique applied to training sam-
be observed that synthetic images inevitably undergo both up and ples with a certain probability 𝑝. In this process, we set the masking
conv operators during image synthesis. In numerical computation, patch size to 𝑑 × 𝑑 and the maximum mask ratio to 𝑅 ∈ (0, 1).
both operators can be regarded as a weighted average of pixel Given an input image of size 𝐻 × 𝑊 , we first uniformly sample
values within a neighborhood. This limited receptive field would the actual mask ratio 𝑟 ∈ [0, 𝑅]. Next, we calculate the required
naturally enhance the local correlations among adjacent pixels in number of masking patches using 𝑛 = ⌊(𝐻 × 𝑊 × 𝑟 )/𝑑 2 ⌋. These
synthetic images, inevitably leaving discriminative artifact features 𝑛 patches filled with zero-value pixels are then randomly applied
for generalizable detection. to the image, ensuring that there is no overlap among them. We
However, we notice that current SID pipelines typically adopt the find that masking a high proportion of the input images (e.g., 75%)
down-sampling operator with Bilinear interpolation in image pre- still achieves accurate detection during training, demonstrating the
processing to standardize images into a uniform shape. Although detector is capable of distinguishing images based on the remaining
this straightforward operation is prevalent in conventional classifi- unmasked regions.
cation tasks, it can inadvertently smooth out the noticeable local
correlations in synthetic images and thereby weaken those subtle 3.4 Artifact Selection
discriminative artifacts in low-level space (see Fig. 1 (a)). To tackle
Inspired by recent works on frequency analysis [10, 53, 61], syn-
this problem, we propose to replace the down-sampling operator
thetic images are still exhibiting remarkable discrepancy with nat-
with the crop operator, with RandomCrop at training and Center-
ural images in high-frequency components. Therefore, we simply
Crop at inference. This straightforward adjustment facilitates to
introduce DWT as the frequency transform function to extract
retain intricate details and subtle local correlations in synthetic im-
high-frequency components, which is capable of preserving the
ages, thereby improving the detector’s ability to capture nuanced
spatial structure of images compared with FFT and DCT. Specif-
artifacts from input samples.
ically, DWT decomposes the input image 𝑿 ∈ R𝐶 ×𝐻 ×𝑊 into 4
𝐻 𝑊
3.3 Artifact Augmentation distinct frequency sub-bands 𝑿 LL, 𝑿 LH, 𝑿 HL, 𝑿 HH ∈ R𝐶 × 2 × 2 ,
where “L” and “H” stand for low- and high-pass filters, respectively.
With the refinement of artifact preservation, we observe a notice- Herein, we simply extract the HH-band component 𝑿 HH as the in-
able improvement in intra-architecture scenarios, where train- put artifact feature for training the detector, which is generalizable
ing and testing images are derived from the same generative ar- enough to distinguish both GAN- and DM-generated images along
chitecture, e.g., Dtrain → Dtest : ProGAN → GANs. However, the with our proposed image transformations above.
detector still suffers from inferior generalization performance in
cross-architecture scenarios, e.g., Dtrain → Dtest : ProGAN → 4 Experiments
DMs (see Fig. 1 (b)). Given the inherent differences in architec-
tures between GANs and DMs, we attribute this to the detector 4.1 Experimental Setups
potentially overfitting certain GAN-specific features. In light of Training datasets. Owing to the rapid advancements in generative
this assertion, we delve into the data augmentation setups and find models, we adhere to the standard protocol derived from Foren-
current pipelines monotonously adopt HorizontalFlip at training. Synths [65], where training data is access to only one generative
Improving Synthetic Image Detection Towards Generalization: An Image Transformation Perspective

Table 1: Intra-architecture evaluation on GAN-based synthetic images from ForenSynths [65]. Models here are all trained on
4-class ProGAN, except for † trained on the whole training set from ForenSynths, namely 20-class ProGAN. We report the
results in the formulation of ACC (%) / AP (%) and average them into ACCM (%) / APM (%) in the last column. The best result and
the second-best result are marked in bold and underline, respectively.

Method Ref ProGAN StyleGAN StyleGAN2 BigGAN CycleGAN StarGAN GauGAN Deepfake Mean

CNNDect CVPR 2020 91.4 / 99.4 63.8 / 91.4 76.4 / 97.5 52.9 / 73.3 72.7 / 88.6 63.8 / 90.8 63.9 / 92.2 51.7 / 62.3 67.1 / 86.9
FreDect ICML 2020 90.3 / 85.2 74.5 / 72.0 73.1 / 71.4 88.7 / 86.0 75.5 / 71.2 99.5 / 99.5 69.2 / 77.4 60.7 / 49.1 78.9 / 76.5
F3Net ECCV 2020 99.4 / 100.0 92.6 / 99.7 88.0 / 99.8 65.3 / 69.9 76.4 / 84.3 100.0 / 100.0 58.1 / 56.7 63.5 / 78.8 80.4 / 86.2
BiHPF WACV 2022 90.7 / 86.2 76.9 / 75.1 76.2 / 74.7 84.9 / 81.7 81.9 / 78.9 94.4 / 94.4 69.5 / 78.1 54.4 / 54.6 78.6 / 78.0
LGrad CVPR 2023 99.9 / 100.0 94.8 / 99.9 96.0 / 99.9 82.9 / 90.7 85.3 / 94.0 99.6 / 100.0 72.4 / 79.3 58.0 / 67.9 86.1 / 91.5
UniFD CVPR 2023 99.7 / 100.0 89.0 / 98.7 83.9 / 98.4 90.5 / 99.1 87.9 / 99.8 91.4 / 100.0 89.9 / 100.0 80.2 / 90.2 89.1 / 98.3
PatchCraft† Arxiv 2023 100.0 / 100.0 93.0 / 98.9 89.7 / 97.8 95.7 / 99.3 70.0 / 85.1 100.0 / 100.0 71.9 / 81.8 58.6 / 79.6 84.9 / 92.8
FreqNet AAAI 2024 99.6 / 100.0 90.2 / 99.7 88.0 / 99.5 90.5 / 96.0 95.8 / 99.6 85.7 / 99.8 93.4 / 98.6 88.9 / 94.4 91.5 / 98.5
NPR CVPR 2024 99.8 / 100.0 96.3 / 99.8 97.3 / 100.0 87.5 / 94.5 95.0 / 99.5 99.7 / 100.0 86.6 / 88.8 77.4 / 86.2 92.5 / 96.1
FatFormer CVPR 2024 99.9 / 100.0 97.2 / 99.8 98.8 / 100.0 99.5 / 100.0 99.3 / 100.0 99.8 / 100.0 99.4 / 100.0 93.2 / 98.0 98.4 / 99.7
Ours - 99.9 / 100.0 98.0 / 99.9 98.6 / 100.0 89.7 / 95.9 98.9 / 99.8 99.9 / 100.0 91.5 / 97.2 93.1 / 97.5 96.2 / 98.8

Table 2: Intra-architecture evaluation on GAN-based synthetic images from Self-Synthesis [62].

Method Ref AttGAN BEGAN CramerGAN InfoMaxGAN MMDGAN RelGAN S3GAN SNGAN STGAN Mean

CNNDect CVPR 2020 51.1 / 83.7 50.2 / 44.9 81.5 / 97.5 71.1 / 94.7 72.9 / 94.4 53.3 / 82.1 55.2 / 66.1 62.7 / 90.4 63.0 / 92.7 62.3 / 82.9
FreDect ICML 2020 65.0 / 74.4 39.4 / 39.9 31.0 / 36.0 41.1 / 41.0 38.4 / 40.5 69.2 / 96.2 69.7 / 81.9 48.4 / 47.9 25.4 / 34.0 47.5 / 54.6
F3Net ECCV 2020 85.2 / 94.8 87.1 / 97.5 89.5 / 99.8 67.1 / 83.1 73.7 / 99.6 98.8 / 100.0 65.4 / 70.0 51.6 / 93.6 60.3 / 99.9 75.4 / 93.1
LGrad CVPR 2023 68.6 / 93.8 69.9 / 89.2 50.3 / 54.0 71.1 / 82.0 57.5 / 67.3 89.1 / 99.1 78.5 / 86.0 78.0 / 87.4 54.8 / 68.0 68.6 / 80.8
UniFD CVPR 2023 78.5 / 98.3 72.0 / 98.9 77.6 / 99.8 77.6 / 98.9 77.6 / 99.7 78.2 / 98.7 85.2 / 98.1 77.6 / 98.7 74.2 / 97.8 77.6 / 98.8
PatchCraft† Arxiv 2023 99.7 / 99.99 61.1 / 86.0 72.4 / 78.4 87.5 / 93.3 79.8 / 84.3 99.5 / 99.9 94.0 / 97.9 85.1 / 93.2 68.6 / 91.3 83.1 / 91.6
FreqNet AAAI 2024 89.8 / 98.8 98.8 / 100.0 95.2 / 98.2 94.5 / 97.3 95.2 / 98.2 100.0 / 100.0 88.3 / 94.3 85.4 / 90.5 98.8 / 100.0 94.0 / 97.5
NPR CVPR 2024 83.0 / 96.2 99.0 / 99.8 98.7 / 99.0 94.5 / 98.3 98.6 / 99.0 99.6 / 100.0 79.0 / 80.0 88.8 / 97.4 98.0 / 100.0 93.2 / 96.6
FatFormer CVPR 2024 99.3 / 99.9 99.8 / 100.0 98.3 / 100.0 98.3 / 99.9 98.3 / 100.0 99.4 / 100.0 99.0 / 99.9 98.3 / 99.9 98.7 / 99.7 98.8 / 99.9
Ours - 99.4 / 100.0 99.8 / 100.0 99.7 / 100.0 99.6 / 100.0 99.7 / 100.0 99.6 / 100.0 94.5 / 100.0 98.8 / 100.0 99.9 / 100.0 99.0 / 99.8

model and corresponding real images. The training set consists of accompanied by an equal number of real images, providing a
20 distinct classes, each comprising 18,000 synthetic images gener- robust dataset for evaluating the generalization performance of
ated by ProGAN [20] and an equal number of real images from the SID detectors across a diverse range of GAN variants.
LSUN dataset [71]. In line with previous pipelines [35, 62, 63], we • 4 DMs from Ojha [48]: This testset sources real images from
adopt the specific 4-class training setting (i.e., car, cat, chair, horse), the LAION dataset [57] and incorporates 4 recent text-to-image
termed as 4-class ProGAN. DMs5 to generate fake images based on text descriptions. For the
Testing datasets. To evaluate the generalization performance of Glide series, the author introduces Glide generations with vary-
different SID pipelines in real-world scenarios, we introduce var- ing denoising and up-sampling steps, including Glide_100_10,
ious natural images from different sources and synthetic images Glide_100_27, and Glide_50_10. In the LDM series, in addition
generated by diverse GANs and DMs. Generally, the testing dataset to LDM_200 which uses 200 denoising steps, generations with
comprises 4 widely-used datasets with 26 generative models: classifier-free guidance (LDM_200_cfg) and with fewer denoising
• 8 models from ForenSynths [65]: This testset includes real steps (LDM_100) are also included.
images sampled from 6 datasets (i.e., LSUN [71], ImageNet [7], • 7 DMs and 1 GAN from GenImage [75]: This testset collects
CelebA [37], CelebA-HQ [21], COCO [33], and FaceForensics++ real images of 1,000 classes from the ImageNet dataset [7] and
[56]) and fake images derived from 8 generators3 with the same generates fake images conditioned on the same 1,000 classes
categories, where Deepfake images are partially forged from real with 8 SOTA generators6 . Each test subset consists of 6,000 to
face images. 8,000 synthetic images and an equivalent number of real images.
• 9 GANs from Self-Synthesis [62]: To further enrich the ex- Additionally, this dataset includes synthetic images of varying di-
isting GAN-based test scenes, additional 9 GANs4 have been mensions, ranging from 1282 to 10242 , which poses an additional
introduced, each generating 4,000 synthetic images. These are challenge for existing SID pipelines.

3 ProGAN [20], StyleGAN [22], StyleGAN2 [23], BigGAN [4], CycleGAN [74], StarGAN
5 DALLE [52], Glide [46], ADM [8], LDM [54].
[6], GauGAN [49], Deepfake [56].
4 AttGAN 6 Midjourney [43], SDv1.4 [60], SDv1.5 [60], ADM [8], Glide [46], Wukong [68], VQDM
[15], BEGAN [2], CramerGAN [1], InfoMaxGAN [27], MMDGAN [28], Rel-
GAN [47], S3GAN [40], SNGAN [44], and STGAN [36]. [13], BigGAN [4].
Ouxiang Li et al.

Table 3: Cross-architecture evaluation on DM-based synthetic images from Ojha [48].

Method Ref DALLE Glide_100_10 Glide_100_27 Glide_50_27 ADM LDM_100 LDM_200 LDM_200_cfg Mean

CNNDect CVPR 2020 51.8 / 61.3 53.3 / 72.9 53.0 / 71.3 54.2 / 76.0 54.9 / 66.6 51.9 / 63.7 52.0 / 64.5 51.6 / 63.1 52.8 / 67.4
FreDect ICML 2020 57.0 / 62.5 53.6 / 44.3 50.4 / 40.8 52.0 / 42.3 53.4 / 52.5 56.6 / 51.3 56.4 / 50.9 56.5 / 52.1 54.5 / 49.6
F3Net ECCV 2020 71.6 / 79.9 88.3 / 95.4 87.0 / 94.5 88.5 / 95.4 69.2 / 70.8 74.1 / 84.0 73.4 / 83.3 80.7 / 89.1 79.1 / 86.5
LGrad CVPR 2023 88.5 / 97.3 89.4 / 94.9 87.4 / 93.2 90.7 / 95.1 86.6 / 100.0 94.8 / 99.2 94.2 / 99.1 95.9 / 99.2 90.9 / 97.3
UniFD CVPR 2023 89.5 / 96.8 90.1 / 97.0 90.7 / 97.2 91.1 / 97.4 75.7 / 85.1 90.5 / 97.0 90.2 / 97.1 77.3 / 88.6 86.9 / 94.5
PatchCraft† Arxiv 2023 83.3 / 93.0 80.1 / 92.0 83.4 / 93.9 77.6 / 88.7 80.9 / 90.5 88.9 / 97.7 89.3 / 97.9 88.1 / 96.9 84.0 / 93.8
FreqNet AAAI 2024 97.2 / 99.7 87.8 / 96.0 84.4 / 96.6 86.6 / 95.8 67.2 / 75.4 97.8 / 99.9 97.4 / 99.9 97.2 / 99.9 89.5 / 95.4
NPR CVPR 2024 94.5 / 99.5 98.2 / 99.8 97.8 / 99.7 98.2 / 99.8 75.8 / 81.0 99.3 / 99.9 99.1 / 99.9 99.0 / 99.9 95.2 / 97.4
FatFormer CVPR 2024 98.7 / 99.8 94.6 / 99.5 94.1 / 99.3 94.3 / 99.2 75.9 / 91.9 98.6 / 99.8 98.5 / 99.8 94.8 / 99.2 93.7 / 98.6
Ours - 97.5 / 99.7 97.3 / 99.4 95.8 / 98.9 96.6 / 99.2 82.4 / 95.8 98.8 / 100.0 98.8 / 100.0 98.7 / 99.9 95.7 / 99.1

Table 4: Cross-architecture evaluation on DM and GAN-based synthetic images from GenImage [75].

Method Ref Midjourney SDv1.4 SDv1.5 ADM Glide Wukong VQDM BigGAN Mean

CNNDect CVPR 2020 50.1 / 53.4 50.2 / 55.8 50.3 / 56.3 53.0 / 69.2 51.7 / 66.9 51.4 / 62.4 50.1 / 53.6 69.7 / 91.8 53.3 / 63.7
FreDect ICML 2020 32.1 / 36.0 28.8 / 34.7 28.9 / 34.6 62.9 / 70.2 42.9 / 42.2 35.9 / 38.0 72.1 / 84.2 26.1 / 34.7 41.2 / 46.8
LGrad CVPR 2023 73.7 / 77.6 76.3 / 79.1 77.1 / 80.3 51.9 / 51.4 49.9 / 50.5 73.2 / 75.6 52.8 / 52.0 40.6 / 39.3 61.9 / 63.2
UniFD CVPR 2023 57.5 / 69.8 65.1 / 81.6 64.7 / 81.1 69.3 / 84.6 60.3 / 74.3 73.5 / 88.5 86.3 / 95.5 89.8 / 97.1 70.8 / 84.1
PatchCraft† Arxiv 2023 89.7 / 96.2 95.0 / 98.9 95.0 / 98.9 81.6 / 93.3 83.5 / 93.9 90.9 / 97.4 88.2 / 95.9 91.5 / 97.8 89.4 / 96.5
FreqNet AAAI 2024 69.8 / 78.9 64.2 / 74.3 64.9 / 75.6 83.3 / 91.4 81.6 / 88.8 57.7 / 66.9 81.7 / 89.6 90.5 / 94.9 74.2 / 82.6
NPR CVPR 2024 77.8 / 85.4 78.6 / 84.0 78.9 / 84.6 69.7 / 74.6 78.4 / 85.7 76.1 / 80.5 78.1 / 81.2 80.1 / 88.2 77.2 / 83.0
FatFormer CVPR 2024 56.0 / 62.7 67.7 / 81.1 68.0 / 81.0 78.4 / 91.7 87.9 / 95.9 73.0 / 85.8 86.8 / 96.9 96.7 / 99.5 76.9 / 86.8
Ours - 95.3 / 99.5 99.4 / 99.9 99.3 / 99.9 82.1 / 96.7 96.3 / 99.3 98.2 / 99.8 96.3 / 99.6 97.8 / 99.8 95.6 / 99.3

Table 5: We also compare the number of model parameters also report the averaged metrics for each test dataset, termed ACCM
and FLOPs along with the detection performance (ACCM / and APM .
APM ) averaged over 33 test subsets from 26 generative mod- Implementation details. We introduce a lightweight ResNet [14]
els. Extra computational overheads from instant image pro- from [62] with only 1.44M parameters to meet real-time require-
cessing operators (e.g., FFT, DCT, DWT) are too small to be ments. For image pre-processing, we apply random cropping of
counted in FLOPs. 2562 at training and center cropping of 2562 at testing. In terms
of data augmentations, we set 𝛼 = 0.5 for ColorJitter, 𝛽 = 180◦ for
Method Ref #Parameters #FLOPs Mean RandomRotation, and 𝑝 = 0.5, 𝑑 = 16, 𝑅 = 75% for RandomMask.
The DWT is configured with symmetric mode and bior1.3 wavelet.
CNNDect CVPR 2020 25.56M 5.41B 59.0 / 75.5 The detector is trained using AdamW optimizer [39] with batch
FreDect ICML 2020 25.56M 5.41B 55.3 / 56.8 size of 32, learning rate of 5 × 10 −3 , weight decay of 0.01, for 20
LGrad CVPR 2023 48.61M 50.95B 76.6 / 83.1 epochs on 4 Nvidia H800 GPUs. Besides, the warmup epoch is set
UniFD CVPR 2023 427.62M 77.83B 81.0 / 94.1 to 1 and a cosine annealing scheduler is adopted for the rest epochs.
PatchCraft† Arxiv 2023 0.12M 6.57B 85.3 / 93.6
FreqNet AAAI 2024 1.85 M 3.00B 87.5 / 93.6
NPR CVPR 2024 1.44M 2.30B 89.6 / 93.4 4.2 Generalization Comparisons
FatFormer CVPR 2024 577.25M 127.95B 92.2 / 96.4
Generalization on GAN-based testsets. We first evaluate the
Ours - 1.44M 2.30B 96.7 / 99.3
intra-architecture scenario (i.e., ProGAN → GANs) on two GAN-
based testsets in Table 1 and Table 2. Our SAFE pipeline demon-
strates competitive performance against the SOTA method Fat-
Baselines. To demonstrate that simple transformations can still im- Former with simple image transformations and artifact features.
prove SID performance without complicated designs, we introduce In contrast, FatFormer ensembles CLIP semantics, frequency ar-
10 representative baselines for comparison, including CNNDect tifacts, and text-modality guidance together within its pipeline,
[65], FreDect [10], F3Net [50], BiHPF [18], LGrad [63], UniFD [48], which brings burdensome computations and latencies in real-world
PatchCraft [73], FreqNet [61], NPR [62], and FatFormer [35]. applications. Simultaneously, our results are superior to the other
Evaluation metrics. The classification accuracy (ACC) and aver- latest pipelines (e.g., FreqNet and NPR). Notably, our detection on
age precision (AP) are introduced as the main metrics in evaluating Deepfake achieves considerable results with 93.1% ACC, while most
the SID performance across various generative models. To intu- pipelines are generally struggling with this testset since fake images
itively evaluate the detection performance on GANs and DMs, we in Deepfake are partially forged from real images. This challenges
Improving Synthetic Image Detection Towards Generalization: An Image Transformation Perspective

Testing Testing ForenSynths Ojha Mean ForenSynths Ojha Mean


BR NR RC CC SI BR NR RC CC SI Self-Synthesis GenImage Self-Synthesis GenImage
100 100
73.4 88.1 91.5 91.2 90.3 95 80.0 92.4 95.3 95.1 94.0 DCT DCT
Sobel Sobel
BR

BR
95
90
FFT FFT
Training

Training
73.7 82.4 89.6 91.1 93.2 85 84.8 93.3 93.9 96.1 98.1 90
NR

NR
80 Laplace 100 Laplace 100
85 90 90
84.1 91.0 95.8 95.6 96.2 75 91.6 97.3 99.2 99.3 99.3 80 80
6070
RC

RC
6070
70 80 Naïve Naïve
(a) ACCM (b) APM
DWT-LL DWT-LL
Figure 4: Image pre-processing ablation. We ablate different DWT-HH(ours) DWT-HH(ours)
data pre-processing operators at both training and testing on DWT-LH DWT-LH
GenImage. The training includes Bilinear-based Resize (BR), DWT-HL DWT-HL
Nearest-based Resize (NR), and RandomCrop (RC). The test- (a) ACCM (b) APM

ing includes BR, NR, RC, CenterCrop (CC), and Source Image
(SI), where SI indicates inference w/o any pre-processing. Our Figure 6: Feature selection ablation. We compare the Naïve
pipeline with RC (training) and CC (testing) is marked with baseline (i.e., trained with source images) and different low-
white color. level feature extractors mainly adopted in frequency trans-
forms (i.e., FFT, DCT, DWT) and edge detection (i.e., Sobel,
- +HF +RR +HF&RR - +HF +RR +HF&RR Laplace), where LL, LH, HL, HH represent 4 distinct fre-
100 100
quency bands in DWT. Details can be found in Appx. A.2.
85.9 86.0 88.3 87.7 98 95.8 96.3 98.4 98.5 99
-
-

96 98
Table 6: Daily average recall volume (×103 ) under different
+RM

96.9 97.0 98.9 99.1


+RM

88.5 90.4 93.5 94.2 94 97


precisions and corresponding improvement (imp.), where
92
94.3 93.2 98.8 98.7
96
the model is deployed online to detect two categories (i.e.,
+RM&CJ +CJ

87.7 87.0 95.8 95.3


+RM&CJ +CJ

90 95 human portraits and animals).


88
86.7 86.0 95.5 96.7 93.3 93.0 98.6 99.3 94
86
93 P95 P98 P99
(a) ACCM (b) APM
Base 92.7 12.3 9.6
Ours 135.2 19.2 14.9
Figure 5: Image augmentation ablation. We ablate the in- Imp. + 45.85% + 56.10% + 55.21%
troduced data augmentation techniques, including Horizon-
talFlip (HF), RandomRotation (RR), RandomMask (RM), and
ColorJitter (CJ), where “+" and “−" indicate w/ and w/o a spe-
Overall evaulation. In real-world scenarios, computational com-
cific augmentation, respectively. Our pipeline combined with
plexity and generalization performance are typically the two most
all augmentations is marked with white color.
crucial metrics for SID. Therefore, we compare all baselines regard-
ing model parameters, FLOPs, and averaged detection results in
the detector’s ability to differentiate based on local modifications, Table 5. Our SAFE pipeline reaches superior performance with the
emphasizing the necessity of local awareness in SID. least FLOPs, with improvements of 4.5% in ACC and 2.9% in AP.
Generalization on DM-based testsets. To further evaluate the In contrast, FatFormer achieves the second-best detection perfor-
generalization performance in the cross-architecture scenario (i.e., mance in the sacrifice of burdensome computational complexity,
ProGAN → DMs), we introduce another two DM-based testsets with 577.25M parameters and 127.95B FLOPs. Our improvements in
in Table 3 and Table 4. It can be observed that our SAFE pipeline both computational complexity and detection performance signify
achieves superior performance against all baselines under this sce- the effectiveness of the proposed image transformations even with
nario. FatFormer, which achieves SOTA performance in GAN-based simple artifact features in SID. Meanwhile, this also prompts us to
detection, exhibits inferior performance in DM-based detection. rethink the functioning of self-crafted features in existing pipelines,
This disparity suggests that its detection pipeline is more precisely questioning whether they are genuinely more generalizable for SID
tailored to GAN architectures, rendering it suboptimal for DM- or merely mitigate potential biases in certain aspects.
based detection. Additionally, we notice a significant improvement
in the GenImage testset, which incorporates the lasted generative 4.3 Online Experiment
models across diverse image categories. Compared with UniFD We conduct an online experiment on our production platform with
using semantic features for detection, such improvement indicates around 9 hundred user-generated content per day. Compared with
that our detector is robust to semantic variations, highlighting the previous online detector (base) that uses FFT amplitude and
the significance of using low-level features rather than semantics. phase spectrum, our pipeline achieves superior recall volume under
Because iterative advancements in generative models facilitate syn- the same precision in Table 6. This comparison demonstrates that
thetic images increasingly indistinguishable in semantics, hindering our pipeline can achieve more generalizable detection in business
generalizable detection from the semantic perspective. data by mitigating training biases and enhancing local awareness.
Ouxiang Li et al.

96 96 100 ACC 99.5 ACC 99.6


98 AP 96.5 AP
96
94 94 98 99.0 99.2
96.0
96
ACC(%)

ACC(%)

ACC(%)

ACC(%)
AP(%)

AP(%)

AP(%)

AP(%)
92 92 95
96 98.5 95.5 98.8
90 94
90 94 94 98.0 95.0 98.4
88 92
ACC 88 ACC 94.5
86 AP AP 92 93 97.5 98.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0° 30° 60° 90° 120° 150° 180° 1 2 4 8 16 32 64 0% 10% 25% 50% 75% 90% 100%
(a) Jitter Factor 𝛼 (b) Rotation Angle 𝛽 (c) Patch Size 𝑑 (d) Mask Ratio 𝑅

Figure 7: Ablation study of hyperparameters in artifact augmentations, including jitter factor 𝛼 in ColorJitter (CJ), rotation
angle 𝛽 in RandomRotation (RR), patch size 𝑑 and mask ratio 𝑅 in RandomMask (RM).

Naive w/ ours FreqNet w/ ours NPR w/ ours Naive w/ ours FreqNet w/ ours NPR w/ ours
distributional disparity among synthetic images from different ar-
Naive FreqNet NPR Naive FreqNet NPR
100 100 chitectures to alleviate overfitting biases and enhance local aware-
90 90 ness. Moreover, their benefits to generalization performance are
80 80
cumulative, achieving optimal results of 96.7% ACCM and 99.3%
70 70
60 60
APM when adopted in combination.
50 50 Feature selection. We also compare various artifact feature extrac-
urneySDv1.4 SDv1.5 ADM Glide Wukong VQDM BigGAN Mean urneySDv1.4 SDv1.5 ADM Glide Wukong VQDM BigGAN Mean tors commonly used in digital image processing, as shown in Fig. 6.
Midjo Midjo

(a) ACC (b) AP Our findings are as follows: (1) Low-level features demonstrate
superior generalization compared to the Naïve baseline trained
Figure 8: Plug & Play application with existing pipelines. We with source images, particularly in cross-architecture scenarios.
compare the Naïve baseline, FreqNet, and NPR on the Gen- This intuitive comparison aligns with our motivation for incor-
Image testset due to its diverse range of image dimensions, porating frequency features. (2) Regarding frequency component
generator types, and data categories. selection, high-frequency components exhibit better generalization
than low-frequency ones. The inferior generation of generative
4.4 Ablation Studies models in capturing high-frequency details makes these details a
useful discriminative clue. (3) For high-frequency extractors, meth-
To thoroughly comprehend the effects of our proposed image trans- ods such as FFT, DCT, and DWT all show favorable generalization
formations along with selected artifact features in SID, we conduct performance, with DWT performing the best. We have integrated
extensive ablations. Unless specified, we report ACCM and APM on DWT into our pipeline because it allows direct extraction of high-
all 33 test subsets. frequency features in the spatial domain without additional inverse
Image pre-processing. In real-world scenarios, images inevitably transformations and manual frequency filtering.
undergo various operations, with resizing being the most common. Hyperparameters. We empirically ablate the essential hyperpa-
In Fig. 4, we compare the detection performance trained with differ- rameters of our proposed artifact augmentations in Fig. 7. The
ent image pre-processing operators (i.e., BR, NR, and RC) and report results reveal that moderate levels of augmentation factors can sig-
ACCM and APM on GenImage, which includes images with vari- nificantly improve detection performance. Specifically, jitter factor
ous dimensions of 1282 , 2562 , 5122 , and 10242 , etc. The operation 𝛼 around 0.4 − 0.6, rotation angle 𝛽 up to 180◦ , patch size 𝑑 around
image size is set to 2562 except for SI, which retains the original 16, and mask ratio 𝑅 around 75% yield the best performance. These
image size. We can draw three conclusions: (1) Even though the findings provide insights for optimizing augmentation strategies to
testset undergoes resize operators, our training strategy (RC) still improve the generalization performance. More detailed analyses
achieves superior performance than BR and NR, which can be at- can be found in Appx. B.1.
tributed to the artifact-preserved training with crop operators. (2) Plug & Play application. Since our method is model-agnostic, we
The resize operator indeed diminishes the subtle artifact features apply our proposed image transformations to existing SID pipelines
and degrades the detection performance, necessitating the crop as a plug-and-play module. The comparison results, illustrated in
operator during training. (3) The similar performance between CC, Fig. 8, demonstrate a consistent improvement in detecting synthetic
RC, and SI during testing indicates that our pipeline can achieve images from various generative models. This indicates that these
accurate detection through center-cropped regions only, suggesting image transformations can help the detector learn more preserved
our detector is translation-robust to cropped regions of test images. and generalizable artifact features, thereby improving its ability
Image augmentation. We then ablate the proposed data augmen- to capture nuanced artifacts from input samples with enhanced
tations along with HorizontalFlip (HF) in Fig. 5. It can be observed generalization.
that HF exerts almost no effect on improving generalization per-
formance, which is insufficient to mitigate overfitting in previous
pipelines. In contrast, the proposed three techniques (i.e., CJ, RR, 5 Conclusion
and RM) can each independently enhance the generalization per- In this paper, we have re-examined current SID pipelines and dis-
formance by augmenting training images, thereby bridging the covered that these pipelines are inherently prohibited by biased
Improving Synthetic Image Detection Towards Generalization: An Image Transformation Perspective

training paradigms from superior generalization. In this light, we [19] Yan Ju, Shan Jia, Lipeng Ke, Hongfei Xue, Koki Nagano, and Siwei Lyu. 2022.
propose a simple yet effective pipeline, SAFE, to alleviate these Fusing global and local features for generalized ai-synthesized image detection.
In 2022 IEEE International Conference on Image Processing (ICIP). IEEE, 3465–3469.
existing biases with three image transformations. Our pipeline inte- [20] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressive
grates crop operators in image pre-processing and combines Color- growing of gans for improved quality, stability, and variation. arXiv preprint
arXiv:1710.10196 (2017).
Jitter, RandomRotation, and RandomMask in image augmentation, [21] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive
with the DWT extracting high-frequency artifact features. Exten- growing of gans for improved quality, stability, and variation. arXiv 2017. arXiv
sive experiments demonstrate the effectiveness of our pipeline in preprint arXiv:1710.10196 (2018), 1–26.
[22] Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator ar-
both computational efficiency and detection performance even with chitecture for generative adversarial networks. In Proceedings of the IEEE/CVF
simple artifacts. This prompts us to rethink the rationale behind conference on computer vision and pattern recognition. 4401–4410.
current pipelines dedicated to various self-crafted features, wonder- [23] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and
Timo Aila. 2020. Analyzing and improving the image quality of stylegan. In
ing whether these features are genuinely more generalizable in SID Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
or merely mitigate potential biases in certain aspects. We hope our 8110–8119.
[24] Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdenour
findings will facilitate further endeavors regarding mitigating biases Hadid, and Abdelmalik Taleb-Ahmed. 2024. Bi-LORA: A Vision-Language Ap-
in SID training paradigms before exploring self-crafted artifacts. proach for Synthetic Image Detection. arXiv preprint arXiv:2404.01959 (2024).
[25] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes.
arXiv preprint arXiv:1312.6114 (2013).
References [26] Christos Koutlis and Symeon Papadopoulos. 2024. Leveraging Representations
from Intermediate Encoder-blocks for Synthetic Image Detection. arXiv preprint
[1] Marc G Bellemare et al. 2017. The cramer distance as a solution to biased arXiv:2402.19091 (2024).
wasserstein gradients. arXiv preprint arXiv:1705.10743 (2017). [27] Kwot Sin Lee et al. 2021. Infomax-gan: Improved adversarial image generation
[2] David Berthelot et al. 2017. Began: Boundary equilibrium generative adversarial via information maximization and contrastive learning. In WACV. 3942–3952.
networks. arXiv preprint arXiv:1703.10717 (2017). [28] Chun-Liang Li et al. 2017. Mmd gan: Towards deeper understanding of moment
[3] Noémi Bontridder and Yves Poullet. 2021. The role of artificial intelligence in matching network. Advances in neural information processing systems 30 (2017).
disinformation. Data & Policy 3 (2021), e32. [29] Haodong Li, Bin Li, Shunquan Tan, and Jiwu Huang. 2020. Identification of
[4] Andrew Brock, Jeff Donahue, and Karen Simonyan. 2018. Large scale GAN deep network generated images using disparities in color components. Signal
training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 Processing 174 (2020), 107616.
(2018). [30] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping
[5] George Cazenavette, Avneesh Sud, Thomas Leung, and Ben Usman. 2024. FakeIn- language-image pre-training with frozen image encoders and large language
version: Learning to Detect Images from Unseen Text-to-Image Models by In- models. In International conference on machine learning. PMLR, 19730–19742.
verting Stable Diffusion. In Proceedings of the IEEE/CVF Conference on Computer [31] Jiaming Li, Hongtao Xie, Jiahong Li, Zhongyuan Wang, and Yongdong Zhang.
Vision and Pattern Recognition. 10759–10769. 2021. Frequency-aware discriminative feature learning supervised by single-
[6] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and center loss for face forgery detection. In Proceedings of the IEEE/CVF conference
Jaegul Choo. 2018. Stargan: Unified generative adversarial networks for multi- on computer vision and pattern recognition. 6458–6467.
domain image-to-image translation. In Proceedings of the IEEE conference on [32] Ouxiang Li, Yanbin Hao, Zhicai Wang, Bin Zhu, Shuo Wang, Zaixi Zhang, and
computer vision and pattern recognition. 8789–8797. Fuli Feng. 2024. Model Inversion Attacks Through Target-Specific Conditional
[7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Im- Diffusion Models. arXiv preprint arXiv:2407.11424 (2024).
agenet: A large-scale hierarchical image database. In 2009 IEEE conference on [33] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva
computer vision and pattern recognition. Ieee, 248–255. Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common
[8] Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on objects in context. In Computer Vision–ECCV 2014: 13th European Conference,
image synthesis. Advances in neural information processing systems 34 (2021), Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740–
8780–8794. 755.
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- [34] Bo Liu, Fan Yang, Xiuli Bi, Bin Xiao, Weisheng Li, and Xinbo Gao. 2022. Detecting
aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg generated images by real images. In European Conference on Computer Vision.
Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers Springer, 95–110.
for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020). [35] Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and
[10] Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, Yao Zhao. 2024. Forgery-aware adaptive transformer for generalizable synthetic
and Thorsten Holz. 2020. Leveraging frequency analysis for deep fake image image detection. In Proceedings of the IEEE/CVF Conference on Computer Vision
recognition. In International conference on machine learning. PMLR, 3247–3258. and Pattern Recognition. 10770–10780.
[11] Abenezer Golda, Kidus Mekonen, Amit Pandey, Anushka Singh, Vikas Hassija, [36] Ming Liu et al. 2019. Stgan: A unified selective transfer network for arbitrary
Vinay Chamola, and Biplab Sikdar. 2024. Privacy and Security Concerns in image attribute editing. In CVPR. 3673–3682.
Generative AI: A Comprehensive Survey. IEEE Access (2024). [37] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep learning
[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, face attributes in the wild. In Proceedings of the IEEE international conference on
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial computer vision. 3730–3738.
nets. Advances in neural information processing systems 27 (2014). [38] Zhengzhe Liu, Xiaojuan Qi, and Philip HS Torr. 2020. Global texture enhancement
[13] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu for fake face detection in the wild. In Proceedings of the IEEE/CVF conference on
Yuan, and Baining Guo. 2022. Vector quantized diffusion model for text-to-image computer vision and pattern recognition. 8060–8069.
synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and [39] Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization.
Pattern Recognition. 10696–10706. arXiv preprint arXiv:1711.05101 (2017).
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual [40] Mario Lučić et al. 2019. High-fidelity image generation with fewer labels. In
learning for image recognition. In Proceedings of the IEEE conference on computer ICML. PMLR, 4183–4192.
vision and pattern recognition. 770–778. [41] Yunpeng Luo, Junlong Du, Ke Yan, and Shouhong Ding. 2024. LaREˆ 2: Latent
[15] Zhenliang He et al. 2019. AttGAN: Facial Attribute Editing by Only Changing Reconstruction Error Based Method for Diffusion-Generated Image Detection.
What You Want. IEEE Transactions on Image Processing 28, 11 (2019), 5464–5478. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
https://doi.org/10.1109/TIP.2019.2916751 nition. 17006–17015.
[16] Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic [42] Stephane G Mallat. 1989. A theory for multiresolution signal decomposition:
models. Advances in neural information processing systems 33 (2020), 6840–6851. the wavelet representation. IEEE transactions on pattern analysis and machine
[17] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean intelligence 11, 7 (1989), 674–693.
Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large [43] Midjourney. 2022. https://www.midjourney.com/home.
language models. arXiv preprint arXiv:2106.09685 (2021). [44] Takeru Miyato et al. 2018. Spectral normalization for generative adversarial
[18] Yonghyun Jeong, Doyeon Kim, Seungjai Min, Seongho Joe, Youngjune Gwon, networks. arXiv preprint arXiv:1802.05957 (2018).
and Jongwon Choi. 2022. Bihpf: Bilateral high-pass filters for robust deepfake [45] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or.
detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of 2022. Null-text Inversion for Editing Real Images using Guided Diffusion Models.
Computer Vision. 48–57. arXiv:2211.09794 [cs.CV] https://arxiv.org/abs/2211.09794
Ouxiang Li et al.

[46] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, [70] Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi
Bob McGrew, Ilya Sutskever, and Mark Chen. 2021. Glide: Towards photorealistic Xie. 2024. A Sanity Check for AI-generated Image Detection. arXiv preprint
image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2406.19435 (2024).
arXiv:2112.10741 (2021). [71] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianx-
[47] Weili Nie et al. 2019. Relgan: Relational generative adversarial networks for text iong Xiao. 2015. Lsun: Construction of a large-scale image dataset using deep
generation. In ICLR. learning with humans in the loop. arXiv preprint arXiv:1506.03365 (2015).
[48] Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. 2023. Towards universal fake image [72] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz
detectors that generalize across generative models. In Proceedings of the IEEE/CVF Khan, Ming-Hsuan Yang, and Ling Shao. 2020. Cycleisp: Real image restora-
Conference on Computer Vision and Pattern Recognition. 24480–24489. tion via improved data synthesis. In Proceedings of the IEEE/CVF conference on
[49] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019. Semantic computer vision and pattern recognition. 2696–2705.
image synthesis with spatially-adaptive normalization. In Proceedings of the [73] Nan Zhong, Yiran Xu, Zhenxing Qian, and Xinpeng Zhang. 2023. Rich and
IEEE/CVF conference on computer vision and pattern recognition. 2337–2346. poor texture contrast: A simple yet effective approach for ai-generated image
[50] Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. 2020. Think- detection. arXiv preprint arXiv:2311.12397 (2023).
ing in frequency: Face forgery detection by mining frequency-aware clues. In [74] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired
European conference on computer vision. Springer, 86–103. image-to-image translation using cycle-consistent adversarial networks. In Pro-
[51] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- ceedings of the IEEE international conference on computer vision. 2223–2232.
hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. [75] Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li,
2021. Learning transferable visual models from natural language supervision. In Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. 2024. Genimage: A million-scale
International conference on machine learning. PMLR, 8748–8763. benchmark for detecting ai-generated image. Advances in Neural Information
[52] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Processing Systems 36 (2024).
2022. Hierarchical text-conditional image generation with clip latents. arXiv
preprint arXiv:2204.06125 1, 2 (2022), 3.
[53] Jonas Ricker, Simon Damm, Thorsten Holz, and Asja Fischer. 2022. Towards the
detection of diffusion model deepfakes. arXiv preprint arXiv:2210.14571 (2022).
[54] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn
Ommer. 2022. High-resolution image synthesis with latent diffusion models. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
10684–10695.
[55] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolu-
tional networks for biomedical image segmentation. In Medical image computing
and computer-assisted intervention–MICCAI 2015: 18th international conference,
Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer, 234–241.
[56] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies,
and Matthias Nießner. 2019. Faceforensics++: Learning to detect manipulated
facial images. In Proceedings of the IEEE/CVF international conference on computer
vision. 1–11.
[57] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk,
Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki.
2021. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs.
arXiv preprint arXiv:2111.02114 (2021).
[58] Zeyang Sha, Zheng Li, Ning Yu, and Yang Zhang. 2023. De-fake: Detection and
attribution of fake images generated by text-to-image generation models. In
Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications
Security. 3418–3432.
[59] Jiaming Song, Chenlin Meng, and Stefano Ermon. 2020. Denoising diffusion
implicit models. arXiv preprint arXiv:2010.02502 (2020).
[60] Stability-AI. 2022. Stable Diffusion. https://github.com/Stability-AI/
StableDiffusion.
[61] Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao
Wei. 2024. Frequency-aware deepfake detection: Improving generalizability
through frequency space learning. arXiv preprint arXiv:2403.07240 (2024).
[62] Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao
Wei. 2024. Rethinking the up-sampling operations in cnn-based generative
network for generalizable deepfake detection. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 28130–28139.
[63] Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, and Yunchao Wei. 2023.
Learning on gradients: Generalized artifacts representation for gan-generated
images detection. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition. 12105–12114.
[64] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. Advances in neural information processing systems 30 (2017).
[65] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A
Efros. 2020. CNN-generated images are surprisingly easy to spot... for now. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
8695–8704.
[66] Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong
Chen, and Houqiang Li. 2023. Dire for diffusion-generated image detection. In
Proceedings of the IEEE/CVF International Conference on Computer Vision. 22445–
22455.
[67] Zhenting Wang, Vikash Sehwag, Chen Chen, Lingjuan Lyu, Dimitris N Metaxas,
and Shiqing Ma. 2024. How to Trace Latent Generative Model Generated Images
without Artificial Watermark? arXiv preprint arXiv:2405.13360 (2024).
[68] Wukong. 2022. https://xihe.mindspore.cn/modelzoo/wukong.
[69] Qiang Xu, Dongmei Xu, Hao Wang, Jianye Yuan, and Zhe Wang. 2024. Color
Patterns And Enhanced Texture Learning For Detecting Computer-Generated
Images. Comput. J. (2024), bxae007.
Improving Synthetic Image Detection Towards Generalization: An Image Transformation Perspective

Supplementary Material

This Appendix is organized as follows: decomposes the image into sinusoidal waves of different frequen-
• In Sec. A, we elaborate on more experimental details omitted in cies, and these frequencies can be used to analyze and process
our main paper because of page limits. the frequency characteristics of the image. In our implementation
• In Sec. B, we conduct additional experiments about hyperparam- where FFT is adopted as a high-frequency extractor, the extraction
eter sensitivity and detection robustness for a more comprehen- process can be formulated as
sive evaluation. 𝒇 FFT (𝑿 ) = IFFT(BℎFFT (FFT(𝑿 ))), (2)
• In Sec. C, we summarize the plausible limitations of our pipelines
and expect to address them in future work. where the FFT operator includes both a 2D FFT function on the
source image and a zero-frequency shift function to move the zero-
A Experimental Details frequency component to the center of the spectrum. Then the high-
pass filter BℎFFT is adopted to extract high-frequency components
A.1 Local Correlation Map in the form of
To visualize the differences in local correlations between natural

FFT 0, if |𝑖 | < 𝐻4 and | 𝑗 | < 𝑊4 ,
and synthetic images, we introduce sliding windows to traverse Bℎ (𝑓𝑖,𝑗 ) = (3)
𝑓𝑖,𝑗 , otherwise.
the image and calculate the correlation coefficient for pixels in
Subsequently, the corresponding inverse function IFFT is intro-
each window. In implementation, we use the Pearson correlation
duced to transform the filtered frequency information back to im-
coefficient and set 𝑤 to 2 to meet the locality requirement.
age space, and we can obtain 𝒇 FFT (𝑿 ) ∈ R𝐶 ×𝐻 ×𝑊 as the input
artifact feature with FFT.
Algorithm 1: Calculate Local Correlation Map Discrete Cosine Transform (DCT). DCT decomposes the im-
Input: Image 𝐼 of size 𝐻 × 𝑊 , window size 𝑤 age into a combination of cosine functions to analyze the frequency
Output: Correlation map 𝐶 components of the image, which is particularly suited for processing
1 Initialize 𝐶 as a zero matrix of size block-based image data. In our implementation, the DCT extractor
with the high-frequency filter can be formulated as
(𝐻 − 𝑤 + 1) × (𝑊 − 𝑤 + 1);
2 for 𝑖 ← 0 to 𝐻 − 𝑤 do 𝒇 DCT (𝑿 ) = IDCT(BℎDCT (DCT(𝑿 ))), (4)
3 for 𝑗 ← 0 to 𝑊 − 𝑤 do where BℎDCT is the high-frequency filter for the transformed DCT
4 Extract window 𝑊 = 𝐼 [𝑖 : 𝑖 + 𝑤, 𝑗 : 𝑗 + 𝑤]; features with a pre-defined threshold 𝛿, which can be formulated
Compute row vector r as r𝑘 = 𝑤1 𝑚=1
Í𝑤
5 𝑊𝑚𝑘 for as 
𝑘 = 1, 2, . . . , 𝑤; 0, if 𝑖 + 𝑗 < 𝛿,
BℎDCT (𝑓𝑖,𝑗 ) = (5)
Compute column vector c as c𝑘 = 𝑤1 𝑛=1 𝑓𝑖,𝑗 , otherwise.
Í𝑤
6 𝑊𝑘𝑛 for
𝑘 = 1, 2, . . . , 𝑤; Subsequently, the inverse function IDCT is used to transform the
7 Compute correlation coefficient 𝜌 between r and c; high-frequency component back to image space, and 𝒇 DCT (𝑿 ) ∈
8 𝐶 [𝑖, 𝑗] ← 𝜌; R𝐶 ×𝐻 ×𝑊 is regarded as the input artifact feature.
Sobel and Laplace. Sobel and Laplace operators are both widely
9 return 𝐶; used for edge detection to identify regions in the image with sig-
nificant intensity changes, which are strongly correlated with the
high-frequency components. In practice, both operators are imple-
mented through convolution operations, using specific convolution
A.2 Details of Figure 6 kernels for edge detection.
This figure provides a horizontal comparison of various artifact The Sobel operator uses two 3 × 3 convolution kernels, one
features, with a particular emphasis on high-frequency components. for detecting horizontal changes (𝐺𝑥 ) and the other for detecting
In addition to the DWT involved in our pipeline, we also include vertical changes (𝐺 𝑦 ), that is
other commonly-used frequency transforms such as FFT and DCT,
−1 0 1 −1 −2 −1
along with edge detection operators such as Sobel and Laplace, to   
facilitate a comprehensive comparison and analysis. Specifically, 𝐺𝑥 = −2 0 2 , 𝐺 𝑦 =  0 0 0  .
−1 0 1 1 2 1 
given an input image 𝑿 ∈ R𝐶 ×𝐻 ×𝑊 , the details of these operators  
are as follows: The Laplace operator is a second-order differential operator used
Fast Fourier Transform (FFT). FFT is a method for transform- to detect edges and details in images. It determines the regions
ing an image from the spatial domain to the frequency domain. It of rapid intensity change by computing the second derivatives of
Ouxiang Li et al.

100 100 ProGAN StyleGAN2 StarGAN ProGAN StyleGAN2 StarGAN


100 StyleGAN BigGAN Mean StyleGAN BigGAN Mean
90 90 100
90
80 80 90
70 70 80
ForenSynths ForenSynths 80
Self-Synthesis Self-Synthesis
60 Ojha 60 Ojha
GenImage GenImage 70
Mean Mean 70
50 50
Ref 0.10 0.50 1.00 1.50 2.00 Ref 0.10 0.50 1.00 1.50 2.00 95 90 85 80 75 70 95 90 85 80 75 70
Sigma σ Sigma σ Quality Factor Q Quality Factor Q
(a) ACCM (b) APM (a) ACC (b) AP

Figure 9: Robustness to Gaussian blur perturbation with dif- Figure 10: Robustness to JPEG compression perturbation
ferent blur sigma 𝜎 on different test datasets. “Ref” corre- with different quality factors 𝑄 on different test datasets.
sponds to the detection performance w/o Gaussian blur. “Mean" refers to the average results on ForenSynths.

pixel values. The Laplace operator commonly uses two forms of intermediate patch sizes (𝑑 = 16) provide a balance between
3 × 3 convolution kernels, each with distinct characteristics and detail preservation and context abstraction, thereby enhancing
applications, that is local awareness.
0 1 0 1 1 1 • Mask ratio 𝑹. The mask ratio 𝑅 in RandomMask (RM) shows

1 −4 1 or

1 −8 1 . an intriguing trend. Both ACC and AP initially fluctuate with

0 1 0

1 1 1 increasing 𝑅, reaching optimal performance around 75%. Beyond
  this point, a gradual decline is noted. This implies that a mod-
These operators are applied to the image through convolution erate masking ratio, which likely introduces sufficient variation
operations 𝒇 Sobel (𝑿 ), 𝒇 Laplace (𝑿 ) ∈ R𝐶 ×𝐻 ×𝑊 for edge detection, without overwhelming the detector, is beneficial. Over-masking
which we consider as the input artifact features in our comparison. (close to 100%) can obscure critical information, negatively im-
pacting performance.
B Additional Experiments
In this section, we report ACCM and APM on all 33 test subsets B.2 Robustness to Unknown Perturbations
unless specified, ensuring a comprehensive comparison on both
In real-world scenarios, images shared on public platforms are
GAN-based and DM-based benchmarks.
susceptible to various unknown perturbations, necessitating the
evaluation of SID detectors in terms of robustness. Besides the
B.1 Hyperparameters resizing operator discussed Sec. 4.4, other plausible operators, such
We empirically ablate the essential hyperparameters of our pro- as Gaussian blur, JPEG compression, and random mask, should also
posed data augmentations as shown in Fig. 7, from which we can be taken into account for a comprehensive assessment.
draw the following observations: Gaussian blur. Gaussian blur can be ubiquitous during image
• Jitter factor 𝜶 . The jitter factor 𝛼, employed in ColorJitter (CJ), transmission on the Internet. Inspired by this, we simulate the sce-
shows a clear impact on both ACC and AP. As 𝛼 increases from nario where images are perturbed with varying deviation degrees 𝜎
0 to 0.8, ACC and AP initially improve, peaking around 𝛼 = 0.5 of Gaussian blur to evaluate the robustness of the detector in Fig. 9.
for both ACC and AP. Beyond these points, a sharp decline is ob- Specifically, we additionally introduce another data augmentation
served in both metrics, indicating an optimal range for 𝛼 around from CNNDect [65], i.e., RandomGaussianBlur with the standard
0.4 to 0.6. Excessive jittering deteriorates the detector’s perfor- deviation 𝜎 ∼ [0.1, 2.0] and probability 𝑝 = 0.5 at training. This
mance, highlighting the importance of moderate augmentation. technique is useful in enhancing the robustness to Gaussian blur. It
• Rotation angle 𝜷. In RandomRotation (RR), the rotation angle can be observed that the detection performance initially remains
𝛽 demonstrates a significant influence on the detection perfor- relatively stable at low blur levels (𝜎 ≤ 0.5), indicating that our
mance. Both ACC and AP increase as the rotation angle pro- detector is resilient to minimal blurring. As the blur level increases
gresses from 0◦ to 60◦ , stabilizing at high values until 150◦ . No- (0.5 ≤ 𝜎 ≤ 1.0), there is a gradual decline in performance. However,
tably, performance metrics continue to improve when 𝛽 exceeds when the blur level reaches 𝜎 ≥ 1.0, the detection performance
150◦ , reaching their peak at 180◦ . This suggests that larger rota- levels off, with results around 85% ACCM and 90% APM . This indi-
tion angles, including extreme rotations, can enhance the gener- cates that our detector maintains considerable robustness against
alization performance by providing diverse training samples. Gaussian blur across varying intensities. Overall, the consistency
• Patch size 𝒅. For patch size 𝑑, the ablation study indicates that of our detection pipeline under different levels of Gaussian blur
smaller patch sizes (𝑑 = 1, 2, 4) result in lower performance underscores its reliability and suitability for practical applications.
metrics. As 𝑑 increases to 8, 16, and 32, a marked improvement JPEG compression. Similar to Gaussian blur, we introduce Ran-
in both ACC and AP is observed, with optimal performance domJPEG as an additional data augmentation from CNNDect [65],
occurring at 𝑑 = 16. Further increasing the patch size to 𝑑 = 64 with JPEG quality factor 𝑄 ∼ [70, 100) and probability 𝑝 = 0.2 at
results in a slight decline in metrics. This finding suggests that training. As shown in Fig. 10, we observe a significant performance
Improving Synthetic Image Detection Towards Generalization: An Image Transformation Perspective

100 100 100 100


d =8 d2 = 16
90
d2 = 32 d22= 16
90 d2 = 32 d2 = 16 90 d2 = 32 90 d2 = 32
d2 = 64 d2 = 4 d2 = 64 d2 = 64
d2 = 16 d2 = 64
80 80 d2 = 8 80 80 d2 = 8

70 70 70 70
d2 = 8
d2 = 4 60 d2 = 4
60 d2 = 2 60 60 d2 = 4
d2 = 2
50 50 d2 = 2 50 d2 = 2
50
10% 25% 50% 75% 90% 10% 25% 50% 75% 90% 10% 25% 50% 75% 90% 10% 25% 50% 75% 90%
Mask Ratio r Mask Ratio r Mask Ratio r Mask Ratio r
(a) ACCM (𝑑 1 = 2) (b) ACCM (𝑑 1 = 4) (c) ACCM (𝑑 1 = 8) (d) ACCM (𝑑 1 = 16)

100 100 d2 = 32 100 d2 = 16 100 d2 = 16


d2 = 32 d =8
d2 = 64
d2 = 16
d2 = 64 d22= 32 d2 = 32
d2 = 64
90 90 d2 = 64 90
d2 = 16 d2 = 4 90
d2 = 8 d2 = 8
80 80 80 80

70 d2 = 8 70 70 70
d2 = 2 d2 = 4 d2 = 4
d2 = 4 d2 = 2
60 60 60 d2 = 2 60
d2 = 2
50 50 50 50
10% 25% 50% 75% 90% 10% 25% 50% 75% 90% 10% 25% 50% 75% 90% 10% 25% 50% 75% 90%
Mask Ratio r Mask Ratio r Mask Ratio r Mask Ratio r
(e) APM (𝑑 1 = 2) (f) APM (𝑑 1 = 4) (g) APM (𝑑 1 = 8) (h) APM (𝑑 1 = 16)

Figure 11: Robustness to random mask perturbation with different mask ratio 𝑟 and patch size 𝑑 2 , where 𝑑 1 and 𝑑 2 refer to the
masked patch size at training and testing, respectively.

drop in ACCM and APM . This is because JPEG compression can • Among the four detectors trained with different patch sizes 𝑑 1 =
substantially diminish the high-frequency components of images, 2, 4, 8, 16, the detector with 𝑑 1 = 4 shows the most robustness
which contradicts our adopted high-frequency artifacts from DWT. under random mask perturbation. This comes at the cost of slight
We will consider addressing this limitation by exploring more ro- performance degradation on unperturbed images compared to
bust artifacts in future work. 𝑑 1 = 16 as shown in Fig. 7 (c). The choice of 𝑑 1 at training should
Random mask. We also simulate the scenario where images are be grounded in specific application scenarios, balancing between
randomly masked with patches, which could visually and seman- detection accuracy and robustness.
tically degrade the detection performance. Fig. 11 examines the
robustness of our detector trained with different patch sizes (𝑑 1 = C Limitation
2, 4, 8, 16) in RandomMask under various mask ratios (𝑟 = 10%, 25%, Our pipeline demonstrates that simple image transformations can
50%, 75%, 90%) and patch sizes (𝑑 2 = 2, 4, 8, 16, 32, 64) at inference. significantly improve the generalization performance by mitigating
We can draw the following conclusions from this figure: training biases in SID. However, the adoption of simple low-level
artifacts (e.g., DWT features) proves inferior robustness against
• Across all subplots, there is a consistent trend where both ACCM plausible unknown perturbations, as these perturbations could un-
and APM decrease as the mask ratio increases. This pattern holds dermine discriminative artifacts in low-level space, thereby degrad-
true regardless of the training patch size 𝑑 1 . Higher mask ra- ing the detection performance. This performance degradation is
tios (e.g., 75% and 90%) result in more significant performance prevalent among existing SID pipelines. In addition to investigating
degradation, indicating the detector’s robustness diminishes with other plausible biases during SID training, we hope to explore more
increased masked regions. robust artifacts to conduct an efficient and robust detector with
• The detector consistently demonstrates robust performance when unbiased training paradigms in future work.
perturbed with larger patch sizes (𝑑 2 = 16, 32, 64) across all 𝑑 1 .
This suggests that the detector effectively focuses on the remain-
ing unmasked patches and draws inferences from these local
features. The ability to maintain high performance despite sig-
nificant masking highlights the importance of local awareness
in handling such perturbations.
• In all cases, detection performance deteriorates more signifi-
cantly as the mask ratio increases, particularly for smaller patches
(𝑑 2 = 2, 4). This can be attributed to the fact that smaller patches
are more prone to disrupting the local correlation among pixels,
as discussed in Sec. 3.2. These artifacts are inherently introduced
by the imaging paradigm of synthetic images, making robust
detection challenging with higher mask ratios.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy