Synthetic Image Transformation
Synthetic Image Transformation
Abstract 1 Introduction
With recent generative models facilitating photo-realistic image Recent advancements in generative models have substantially sim-
synthesis, the proliferation of synthetic images has also engendered plified the synthesis of photo-realistic images, such as Generative
certain negative impacts on social platforms, thereby raising an Adversarial Networks (GANs) [12] and diffusion models (DMs) [16].
urgent imperative to develop effective detectors. Current synthetic While these technologies are thriving in low-cost image synthesis
image detection (SID) pipelines are primarily dedicated to crafting with unprecedented realism, they simultaneously pose potential
universal artifact features, accompanied by an oversight about SID societal risks, including disinformation [3], deepfakes [56], and pri-
training paradigm. In this paper, we re-examine the SID problem vacy concerns [11, 32]. In this context, numerous Synthetic Image
and identify two prevalent biases in current training paradigms, i.e., Detection (SID) pipelines have been developed, aiming to distin-
weakened artifact features and overfitted artifact features. Mean- guish between natural (real) and synthetic (fake) images.
while, we discover that the imaging mechanism of synthetic images SID aims to identify synthetic images from various generative
contributes to heightened local correlations among pixels, suggest- models, termed “generalization performance”. As novel generative
ing that detectors should be equipped with local awareness. In this models are persistently developed and deployed, ensuring general-
light, we propose SAFE, a lightweight and effective detector with ization performance to new generators is particularly critical in the
three simple image transformations. Firstly, for weakened artifact industry. Accordingly, current SID pipelines are dedicated to ex-
features, we substitute the down-sampling operator with the crop ploring universal differences between natural and synthetic images,
operator in image pre-processing to help circumvent artifact distor- termed “universal artifact features”, to improve their generalization
tion. Secondly, for overfitted artifact features, we include ColorJitter performance, which can be categorized into two branches. The
and RandomRotation as additional data augmentations, to help al- first image-aware branch leverages image inputs to autonomously
leviate irrelevant biases from color discrepancies and semantic extract universal artifact features via detection-aware modules,
differences in limited training samples. Thirdly, for local aware- operating from multiple perspectives, e.g., frequency [31, 35, 50],
ness, we propose a patch-based random masking strategy tailored semantics [19, 26, 38], and text [24, 58]. The second feature-aware
for SID, forcing the detector to focus on local regions at training. branch simplifies the detection pipeline into two stages, with stage
Comparative experiments are conducted on an open-world dataset, 1 manually extracting universal artifact features by means of off-
comprising synthetic images generated by 26 distinct generative the-shelf models [34, 41, 48, 63, 66] or image processing operators
models. Our pipeline achieves a new state-of-the-art performance, [18, 61, 62, 70, 73] in advance and stage 2 training the classifier with
with remarkable improvements of 4.5% in accuracy and 2.9% in these pre-processed features.
average precision against existing methods. Our code is available In addition to crafting universal artifact features, we argue that
at: https://github.com/Ouxiang-Li/SAFE. generalizable SID also necessitates reasonable training strategies.
For instance, current pipelines typically adopt the down-sampling
CCS Concepts operator during image pre-processing to standardize images into
a uniform shape. However, this common operator is not suitable
• Security and privacy → Social aspects of security and privacy.
for SID, which could potentially distort the subtle artifacts, leading
to weakened artifact features (cf. Fig. 1 (a)). Meanwhile, these
Keywords pipelines also confront overfitted artifact features (cf. Fig. 1 (b))
Synthetic Image Detection, AIGC Detection, Security and Privacy due to the monotonous data augmentation (e.g., HorizontalFlip),
which is insufficient to bridge the distributional disparity between
training and testing data. Afflicted by these feature-agnostic biases,
∗ This work was done during the internship in Xiaohongshu Inc..
† Corresponding author.
Ouxiang Li et al.
Baseline Ours
Real Image
Fake Image
Down-Sampling
Inversion
Subtraction
Difference
Figure 1: Prevalent biases observed in current SID training paradigms. (a) We reconstruct the real image using null-text inversion
[45] with Stable Diffusion v1.4 [60], ensuring they are semantically consistent. We then calculate their local correlation maps1
before and after down-sampling and subtract them for explicit comparison. It can be noticed that fake image exhibits stronger
local correlations and the down-sampling operator indeed weakens such subtle artifacts. (b) We compare the logit distributions
between the baseline (w/ HorizontalFlip only, left) and ours (right) for both seen and unseen generators. The monotonous
application of HorizontalFlip is not sufficient to alleviate overfitting to training samples, resulting in extreme logit distribution
for in-domain samples (e.g., ProGAN) and inferior generalization for out-of-domain samples (e.g., Midjourney).
such biased training paradigms can inherently prohibit current pipeline that integrates three effective image transformations with
pipelines from achieving superior generalization. simple artifact features. First, w.r.t artifact preservation, we
In this light, we attribute the inferior generalization of existing substitute the conventional down-sampling operator in image pre-
pipelines to their oversight of SID-specific image transformations processing with the crop operator for both training and inference,
at training. Hence, we re-examine the SID problem and identify the which helps circumvent artifact distortion. Second, w.r.t invari-
following critical factors in effective detection: ant augmentation, we introduce ColorJitter and RandomRotation
• Artifact preservation: By analyzing the imaging mechanisms as additional data augmentations, which are effective in alleviat-
of synthetic images, we find the widespread use of up-sampling ing irrelevant biases regarding color mode and semantics against
and convolution operators in generative models contributes to limited training samples. Third, w.r.t local awareness, we pro-
heightened local correlations in synthetic images. Consequently, pose a novel patch-based random masking strategy tailored for
the down-sampling operator can inevitably weaken such local SID in data augmentation, forcing the detector to focus on local
correlations via re-weighting adjacent pixel values. regions at training. Lastly, w.r.t artifact selection, considering the
• Invariant augmentation: Aiming to generalize to unseen syn- existing detectable differences between natural and synthetic im-
thetic images with limited training data, the detector has to learn ages in high-frequency components [53], we introduce Discrete
the specified artifact mode without being affected by other irrel- Wavelet Transform (DWT) [42] to extract high-frequency features.
evant features. Therefore, artifact-invariant data augmentations To comprehensively evaluate the generalization performance, we
will improve the robustness of SID detectors. benchmark our proposed pipeline across 26 generators2 with only
• Local awareness: Considering the inherent difference in imag- access to ProGAN [20] generations and corresponding real images
ing between natural and synthetic images, local awareness can at training. Extensive experiments demonstrate the effectiveness of
facilitate the detector to focus more on local correlations in adja- our image transformations in SID even with simple artifact features.
cent pixels, which offers a crucial clue for generalizable detection. Our contributions can be summarized as follows:
Building upon above insights, we propose SAFE (Simple Pre-
served and Augmented FEatures), a simple and lightweight SID 2 ProGAN, StyleGAN, StyleGAN2, BigGAN, CycleGAN, StarGAN, GauGAN, Deepfake,
AttGAN, BEGAN, CramerGAN, InfoMaxGAN, MMDGAN, RelGAN, S3GAN, SNGAN,
1 The detailed algorithm is presented in Appx. A.1. STGAN, DALLE, Glide, ADM, LDM, Midjourney, SDv1.4, SDv1.5, Wukong, VQDM.
Improving Synthetic Image Detection Towards Generalization: An Image Transformation Perspective
• We attribute the inferior generalization observed in current FC Up-Sampling Conv2d Pooling Skip Connection
Table 1: Intra-architecture evaluation on GAN-based synthetic images from ForenSynths [65]. Models here are all trained on
4-class ProGAN, except for † trained on the whole training set from ForenSynths, namely 20-class ProGAN. We report the
results in the formulation of ACC (%) / AP (%) and average them into ACCM (%) / APM (%) in the last column. The best result and
the second-best result are marked in bold and underline, respectively.
Method Ref ProGAN StyleGAN StyleGAN2 BigGAN CycleGAN StarGAN GauGAN Deepfake Mean
CNNDect CVPR 2020 91.4 / 99.4 63.8 / 91.4 76.4 / 97.5 52.9 / 73.3 72.7 / 88.6 63.8 / 90.8 63.9 / 92.2 51.7 / 62.3 67.1 / 86.9
FreDect ICML 2020 90.3 / 85.2 74.5 / 72.0 73.1 / 71.4 88.7 / 86.0 75.5 / 71.2 99.5 / 99.5 69.2 / 77.4 60.7 / 49.1 78.9 / 76.5
F3Net ECCV 2020 99.4 / 100.0 92.6 / 99.7 88.0 / 99.8 65.3 / 69.9 76.4 / 84.3 100.0 / 100.0 58.1 / 56.7 63.5 / 78.8 80.4 / 86.2
BiHPF WACV 2022 90.7 / 86.2 76.9 / 75.1 76.2 / 74.7 84.9 / 81.7 81.9 / 78.9 94.4 / 94.4 69.5 / 78.1 54.4 / 54.6 78.6 / 78.0
LGrad CVPR 2023 99.9 / 100.0 94.8 / 99.9 96.0 / 99.9 82.9 / 90.7 85.3 / 94.0 99.6 / 100.0 72.4 / 79.3 58.0 / 67.9 86.1 / 91.5
UniFD CVPR 2023 99.7 / 100.0 89.0 / 98.7 83.9 / 98.4 90.5 / 99.1 87.9 / 99.8 91.4 / 100.0 89.9 / 100.0 80.2 / 90.2 89.1 / 98.3
PatchCraft† Arxiv 2023 100.0 / 100.0 93.0 / 98.9 89.7 / 97.8 95.7 / 99.3 70.0 / 85.1 100.0 / 100.0 71.9 / 81.8 58.6 / 79.6 84.9 / 92.8
FreqNet AAAI 2024 99.6 / 100.0 90.2 / 99.7 88.0 / 99.5 90.5 / 96.0 95.8 / 99.6 85.7 / 99.8 93.4 / 98.6 88.9 / 94.4 91.5 / 98.5
NPR CVPR 2024 99.8 / 100.0 96.3 / 99.8 97.3 / 100.0 87.5 / 94.5 95.0 / 99.5 99.7 / 100.0 86.6 / 88.8 77.4 / 86.2 92.5 / 96.1
FatFormer CVPR 2024 99.9 / 100.0 97.2 / 99.8 98.8 / 100.0 99.5 / 100.0 99.3 / 100.0 99.8 / 100.0 99.4 / 100.0 93.2 / 98.0 98.4 / 99.7
Ours - 99.9 / 100.0 98.0 / 99.9 98.6 / 100.0 89.7 / 95.9 98.9 / 99.8 99.9 / 100.0 91.5 / 97.2 93.1 / 97.5 96.2 / 98.8
Method Ref AttGAN BEGAN CramerGAN InfoMaxGAN MMDGAN RelGAN S3GAN SNGAN STGAN Mean
CNNDect CVPR 2020 51.1 / 83.7 50.2 / 44.9 81.5 / 97.5 71.1 / 94.7 72.9 / 94.4 53.3 / 82.1 55.2 / 66.1 62.7 / 90.4 63.0 / 92.7 62.3 / 82.9
FreDect ICML 2020 65.0 / 74.4 39.4 / 39.9 31.0 / 36.0 41.1 / 41.0 38.4 / 40.5 69.2 / 96.2 69.7 / 81.9 48.4 / 47.9 25.4 / 34.0 47.5 / 54.6
F3Net ECCV 2020 85.2 / 94.8 87.1 / 97.5 89.5 / 99.8 67.1 / 83.1 73.7 / 99.6 98.8 / 100.0 65.4 / 70.0 51.6 / 93.6 60.3 / 99.9 75.4 / 93.1
LGrad CVPR 2023 68.6 / 93.8 69.9 / 89.2 50.3 / 54.0 71.1 / 82.0 57.5 / 67.3 89.1 / 99.1 78.5 / 86.0 78.0 / 87.4 54.8 / 68.0 68.6 / 80.8
UniFD CVPR 2023 78.5 / 98.3 72.0 / 98.9 77.6 / 99.8 77.6 / 98.9 77.6 / 99.7 78.2 / 98.7 85.2 / 98.1 77.6 / 98.7 74.2 / 97.8 77.6 / 98.8
PatchCraft† Arxiv 2023 99.7 / 99.99 61.1 / 86.0 72.4 / 78.4 87.5 / 93.3 79.8 / 84.3 99.5 / 99.9 94.0 / 97.9 85.1 / 93.2 68.6 / 91.3 83.1 / 91.6
FreqNet AAAI 2024 89.8 / 98.8 98.8 / 100.0 95.2 / 98.2 94.5 / 97.3 95.2 / 98.2 100.0 / 100.0 88.3 / 94.3 85.4 / 90.5 98.8 / 100.0 94.0 / 97.5
NPR CVPR 2024 83.0 / 96.2 99.0 / 99.8 98.7 / 99.0 94.5 / 98.3 98.6 / 99.0 99.6 / 100.0 79.0 / 80.0 88.8 / 97.4 98.0 / 100.0 93.2 / 96.6
FatFormer CVPR 2024 99.3 / 99.9 99.8 / 100.0 98.3 / 100.0 98.3 / 99.9 98.3 / 100.0 99.4 / 100.0 99.0 / 99.9 98.3 / 99.9 98.7 / 99.7 98.8 / 99.9
Ours - 99.4 / 100.0 99.8 / 100.0 99.7 / 100.0 99.6 / 100.0 99.7 / 100.0 99.6 / 100.0 94.5 / 100.0 98.8 / 100.0 99.9 / 100.0 99.0 / 99.8
model and corresponding real images. The training set consists of accompanied by an equal number of real images, providing a
20 distinct classes, each comprising 18,000 synthetic images gener- robust dataset for evaluating the generalization performance of
ated by ProGAN [20] and an equal number of real images from the SID detectors across a diverse range of GAN variants.
LSUN dataset [71]. In line with previous pipelines [35, 62, 63], we • 4 DMs from Ojha [48]: This testset sources real images from
adopt the specific 4-class training setting (i.e., car, cat, chair, horse), the LAION dataset [57] and incorporates 4 recent text-to-image
termed as 4-class ProGAN. DMs5 to generate fake images based on text descriptions. For the
Testing datasets. To evaluate the generalization performance of Glide series, the author introduces Glide generations with vary-
different SID pipelines in real-world scenarios, we introduce var- ing denoising and up-sampling steps, including Glide_100_10,
ious natural images from different sources and synthetic images Glide_100_27, and Glide_50_10. In the LDM series, in addition
generated by diverse GANs and DMs. Generally, the testing dataset to LDM_200 which uses 200 denoising steps, generations with
comprises 4 widely-used datasets with 26 generative models: classifier-free guidance (LDM_200_cfg) and with fewer denoising
• 8 models from ForenSynths [65]: This testset includes real steps (LDM_100) are also included.
images sampled from 6 datasets (i.e., LSUN [71], ImageNet [7], • 7 DMs and 1 GAN from GenImage [75]: This testset collects
CelebA [37], CelebA-HQ [21], COCO [33], and FaceForensics++ real images of 1,000 classes from the ImageNet dataset [7] and
[56]) and fake images derived from 8 generators3 with the same generates fake images conditioned on the same 1,000 classes
categories, where Deepfake images are partially forged from real with 8 SOTA generators6 . Each test subset consists of 6,000 to
face images. 8,000 synthetic images and an equivalent number of real images.
• 9 GANs from Self-Synthesis [62]: To further enrich the ex- Additionally, this dataset includes synthetic images of varying di-
isting GAN-based test scenes, additional 9 GANs4 have been mensions, ranging from 1282 to 10242 , which poses an additional
introduced, each generating 4,000 synthetic images. These are challenge for existing SID pipelines.
3 ProGAN [20], StyleGAN [22], StyleGAN2 [23], BigGAN [4], CycleGAN [74], StarGAN
5 DALLE [52], Glide [46], ADM [8], LDM [54].
[6], GauGAN [49], Deepfake [56].
4 AttGAN 6 Midjourney [43], SDv1.4 [60], SDv1.5 [60], ADM [8], Glide [46], Wukong [68], VQDM
[15], BEGAN [2], CramerGAN [1], InfoMaxGAN [27], MMDGAN [28], Rel-
GAN [47], S3GAN [40], SNGAN [44], and STGAN [36]. [13], BigGAN [4].
Ouxiang Li et al.
Method Ref DALLE Glide_100_10 Glide_100_27 Glide_50_27 ADM LDM_100 LDM_200 LDM_200_cfg Mean
CNNDect CVPR 2020 51.8 / 61.3 53.3 / 72.9 53.0 / 71.3 54.2 / 76.0 54.9 / 66.6 51.9 / 63.7 52.0 / 64.5 51.6 / 63.1 52.8 / 67.4
FreDect ICML 2020 57.0 / 62.5 53.6 / 44.3 50.4 / 40.8 52.0 / 42.3 53.4 / 52.5 56.6 / 51.3 56.4 / 50.9 56.5 / 52.1 54.5 / 49.6
F3Net ECCV 2020 71.6 / 79.9 88.3 / 95.4 87.0 / 94.5 88.5 / 95.4 69.2 / 70.8 74.1 / 84.0 73.4 / 83.3 80.7 / 89.1 79.1 / 86.5
LGrad CVPR 2023 88.5 / 97.3 89.4 / 94.9 87.4 / 93.2 90.7 / 95.1 86.6 / 100.0 94.8 / 99.2 94.2 / 99.1 95.9 / 99.2 90.9 / 97.3
UniFD CVPR 2023 89.5 / 96.8 90.1 / 97.0 90.7 / 97.2 91.1 / 97.4 75.7 / 85.1 90.5 / 97.0 90.2 / 97.1 77.3 / 88.6 86.9 / 94.5
PatchCraft† Arxiv 2023 83.3 / 93.0 80.1 / 92.0 83.4 / 93.9 77.6 / 88.7 80.9 / 90.5 88.9 / 97.7 89.3 / 97.9 88.1 / 96.9 84.0 / 93.8
FreqNet AAAI 2024 97.2 / 99.7 87.8 / 96.0 84.4 / 96.6 86.6 / 95.8 67.2 / 75.4 97.8 / 99.9 97.4 / 99.9 97.2 / 99.9 89.5 / 95.4
NPR CVPR 2024 94.5 / 99.5 98.2 / 99.8 97.8 / 99.7 98.2 / 99.8 75.8 / 81.0 99.3 / 99.9 99.1 / 99.9 99.0 / 99.9 95.2 / 97.4
FatFormer CVPR 2024 98.7 / 99.8 94.6 / 99.5 94.1 / 99.3 94.3 / 99.2 75.9 / 91.9 98.6 / 99.8 98.5 / 99.8 94.8 / 99.2 93.7 / 98.6
Ours - 97.5 / 99.7 97.3 / 99.4 95.8 / 98.9 96.6 / 99.2 82.4 / 95.8 98.8 / 100.0 98.8 / 100.0 98.7 / 99.9 95.7 / 99.1
Table 4: Cross-architecture evaluation on DM and GAN-based synthetic images from GenImage [75].
Method Ref Midjourney SDv1.4 SDv1.5 ADM Glide Wukong VQDM BigGAN Mean
CNNDect CVPR 2020 50.1 / 53.4 50.2 / 55.8 50.3 / 56.3 53.0 / 69.2 51.7 / 66.9 51.4 / 62.4 50.1 / 53.6 69.7 / 91.8 53.3 / 63.7
FreDect ICML 2020 32.1 / 36.0 28.8 / 34.7 28.9 / 34.6 62.9 / 70.2 42.9 / 42.2 35.9 / 38.0 72.1 / 84.2 26.1 / 34.7 41.2 / 46.8
LGrad CVPR 2023 73.7 / 77.6 76.3 / 79.1 77.1 / 80.3 51.9 / 51.4 49.9 / 50.5 73.2 / 75.6 52.8 / 52.0 40.6 / 39.3 61.9 / 63.2
UniFD CVPR 2023 57.5 / 69.8 65.1 / 81.6 64.7 / 81.1 69.3 / 84.6 60.3 / 74.3 73.5 / 88.5 86.3 / 95.5 89.8 / 97.1 70.8 / 84.1
PatchCraft† Arxiv 2023 89.7 / 96.2 95.0 / 98.9 95.0 / 98.9 81.6 / 93.3 83.5 / 93.9 90.9 / 97.4 88.2 / 95.9 91.5 / 97.8 89.4 / 96.5
FreqNet AAAI 2024 69.8 / 78.9 64.2 / 74.3 64.9 / 75.6 83.3 / 91.4 81.6 / 88.8 57.7 / 66.9 81.7 / 89.6 90.5 / 94.9 74.2 / 82.6
NPR CVPR 2024 77.8 / 85.4 78.6 / 84.0 78.9 / 84.6 69.7 / 74.6 78.4 / 85.7 76.1 / 80.5 78.1 / 81.2 80.1 / 88.2 77.2 / 83.0
FatFormer CVPR 2024 56.0 / 62.7 67.7 / 81.1 68.0 / 81.0 78.4 / 91.7 87.9 / 95.9 73.0 / 85.8 86.8 / 96.9 96.7 / 99.5 76.9 / 86.8
Ours - 95.3 / 99.5 99.4 / 99.9 99.3 / 99.9 82.1 / 96.7 96.3 / 99.3 98.2 / 99.8 96.3 / 99.6 97.8 / 99.8 95.6 / 99.3
Table 5: We also compare the number of model parameters also report the averaged metrics for each test dataset, termed ACCM
and FLOPs along with the detection performance (ACCM / and APM .
APM ) averaged over 33 test subsets from 26 generative mod- Implementation details. We introduce a lightweight ResNet [14]
els. Extra computational overheads from instant image pro- from [62] with only 1.44M parameters to meet real-time require-
cessing operators (e.g., FFT, DCT, DWT) are too small to be ments. For image pre-processing, we apply random cropping of
counted in FLOPs. 2562 at training and center cropping of 2562 at testing. In terms
of data augmentations, we set 𝛼 = 0.5 for ColorJitter, 𝛽 = 180◦ for
Method Ref #Parameters #FLOPs Mean RandomRotation, and 𝑝 = 0.5, 𝑑 = 16, 𝑅 = 75% for RandomMask.
The DWT is configured with symmetric mode and bior1.3 wavelet.
CNNDect CVPR 2020 25.56M 5.41B 59.0 / 75.5 The detector is trained using AdamW optimizer [39] with batch
FreDect ICML 2020 25.56M 5.41B 55.3 / 56.8 size of 32, learning rate of 5 × 10 −3 , weight decay of 0.01, for 20
LGrad CVPR 2023 48.61M 50.95B 76.6 / 83.1 epochs on 4 Nvidia H800 GPUs. Besides, the warmup epoch is set
UniFD CVPR 2023 427.62M 77.83B 81.0 / 94.1 to 1 and a cosine annealing scheduler is adopted for the rest epochs.
PatchCraft† Arxiv 2023 0.12M 6.57B 85.3 / 93.6
FreqNet AAAI 2024 1.85 M 3.00B 87.5 / 93.6
NPR CVPR 2024 1.44M 2.30B 89.6 / 93.4 4.2 Generalization Comparisons
FatFormer CVPR 2024 577.25M 127.95B 92.2 / 96.4
Generalization on GAN-based testsets. We first evaluate the
Ours - 1.44M 2.30B 96.7 / 99.3
intra-architecture scenario (i.e., ProGAN → GANs) on two GAN-
based testsets in Table 1 and Table 2. Our SAFE pipeline demon-
strates competitive performance against the SOTA method Fat-
Baselines. To demonstrate that simple transformations can still im- Former with simple image transformations and artifact features.
prove SID performance without complicated designs, we introduce In contrast, FatFormer ensembles CLIP semantics, frequency ar-
10 representative baselines for comparison, including CNNDect tifacts, and text-modality guidance together within its pipeline,
[65], FreDect [10], F3Net [50], BiHPF [18], LGrad [63], UniFD [48], which brings burdensome computations and latencies in real-world
PatchCraft [73], FreqNet [61], NPR [62], and FatFormer [35]. applications. Simultaneously, our results are superior to the other
Evaluation metrics. The classification accuracy (ACC) and aver- latest pipelines (e.g., FreqNet and NPR). Notably, our detection on
age precision (AP) are introduced as the main metrics in evaluating Deepfake achieves considerable results with 93.1% ACC, while most
the SID performance across various generative models. To intu- pipelines are generally struggling with this testset since fake images
itively evaluate the detection performance on GANs and DMs, we in Deepfake are partially forged from real images. This challenges
Improving Synthetic Image Detection Towards Generalization: An Image Transformation Perspective
BR
95
90
FFT FFT
Training
Training
73.7 82.4 89.6 91.1 93.2 85 84.8 93.3 93.9 96.1 98.1 90
NR
NR
80 Laplace 100 Laplace 100
85 90 90
84.1 91.0 95.8 95.6 96.2 75 91.6 97.3 99.2 99.3 99.3 80 80
6070
RC
RC
6070
70 80 Naïve Naïve
(a) ACCM (b) APM
DWT-LL DWT-LL
Figure 4: Image pre-processing ablation. We ablate different DWT-HH(ours) DWT-HH(ours)
data pre-processing operators at both training and testing on DWT-LH DWT-LH
GenImage. The training includes Bilinear-based Resize (BR), DWT-HL DWT-HL
Nearest-based Resize (NR), and RandomCrop (RC). The test- (a) ACCM (b) APM
ing includes BR, NR, RC, CenterCrop (CC), and Source Image
(SI), where SI indicates inference w/o any pre-processing. Our Figure 6: Feature selection ablation. We compare the Naïve
pipeline with RC (training) and CC (testing) is marked with baseline (i.e., trained with source images) and different low-
white color. level feature extractors mainly adopted in frequency trans-
forms (i.e., FFT, DCT, DWT) and edge detection (i.e., Sobel,
- +HF +RR +HF&RR - +HF +RR +HF&RR Laplace), where LL, LH, HL, HH represent 4 distinct fre-
100 100
quency bands in DWT. Details can be found in Appx. A.2.
85.9 86.0 88.3 87.7 98 95.8 96.3 98.4 98.5 99
-
-
96 98
Table 6: Daily average recall volume (×103 ) under different
+RM
ACC(%)
ACC(%)
ACC(%)
AP(%)
AP(%)
AP(%)
AP(%)
92 92 95
96 98.5 95.5 98.8
90 94
90 94 94 98.0 95.0 98.4
88 92
ACC 88 ACC 94.5
86 AP AP 92 93 97.5 98.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0° 30° 60° 90° 120° 150° 180° 1 2 4 8 16 32 64 0% 10% 25% 50% 75% 90% 100%
(a) Jitter Factor 𝛼 (b) Rotation Angle 𝛽 (c) Patch Size 𝑑 (d) Mask Ratio 𝑅
Figure 7: Ablation study of hyperparameters in artifact augmentations, including jitter factor 𝛼 in ColorJitter (CJ), rotation
angle 𝛽 in RandomRotation (RR), patch size 𝑑 and mask ratio 𝑅 in RandomMask (RM).
Naive w/ ours FreqNet w/ ours NPR w/ ours Naive w/ ours FreqNet w/ ours NPR w/ ours
distributional disparity among synthetic images from different ar-
Naive FreqNet NPR Naive FreqNet NPR
100 100 chitectures to alleviate overfitting biases and enhance local aware-
90 90 ness. Moreover, their benefits to generalization performance are
80 80
cumulative, achieving optimal results of 96.7% ACCM and 99.3%
70 70
60 60
APM when adopted in combination.
50 50 Feature selection. We also compare various artifact feature extrac-
urneySDv1.4 SDv1.5 ADM Glide Wukong VQDM BigGAN Mean urneySDv1.4 SDv1.5 ADM Glide Wukong VQDM BigGAN Mean tors commonly used in digital image processing, as shown in Fig. 6.
Midjo Midjo
(a) ACC (b) AP Our findings are as follows: (1) Low-level features demonstrate
superior generalization compared to the Naïve baseline trained
Figure 8: Plug & Play application with existing pipelines. We with source images, particularly in cross-architecture scenarios.
compare the Naïve baseline, FreqNet, and NPR on the Gen- This intuitive comparison aligns with our motivation for incor-
Image testset due to its diverse range of image dimensions, porating frequency features. (2) Regarding frequency component
generator types, and data categories. selection, high-frequency components exhibit better generalization
than low-frequency ones. The inferior generation of generative
4.4 Ablation Studies models in capturing high-frequency details makes these details a
useful discriminative clue. (3) For high-frequency extractors, meth-
To thoroughly comprehend the effects of our proposed image trans- ods such as FFT, DCT, and DWT all show favorable generalization
formations along with selected artifact features in SID, we conduct performance, with DWT performing the best. We have integrated
extensive ablations. Unless specified, we report ACCM and APM on DWT into our pipeline because it allows direct extraction of high-
all 33 test subsets. frequency features in the spatial domain without additional inverse
Image pre-processing. In real-world scenarios, images inevitably transformations and manual frequency filtering.
undergo various operations, with resizing being the most common. Hyperparameters. We empirically ablate the essential hyperpa-
In Fig. 4, we compare the detection performance trained with differ- rameters of our proposed artifact augmentations in Fig. 7. The
ent image pre-processing operators (i.e., BR, NR, and RC) and report results reveal that moderate levels of augmentation factors can sig-
ACCM and APM on GenImage, which includes images with vari- nificantly improve detection performance. Specifically, jitter factor
ous dimensions of 1282 , 2562 , 5122 , and 10242 , etc. The operation 𝛼 around 0.4 − 0.6, rotation angle 𝛽 up to 180◦ , patch size 𝑑 around
image size is set to 2562 except for SI, which retains the original 16, and mask ratio 𝑅 around 75% yield the best performance. These
image size. We can draw three conclusions: (1) Even though the findings provide insights for optimizing augmentation strategies to
testset undergoes resize operators, our training strategy (RC) still improve the generalization performance. More detailed analyses
achieves superior performance than BR and NR, which can be at- can be found in Appx. B.1.
tributed to the artifact-preserved training with crop operators. (2) Plug & Play application. Since our method is model-agnostic, we
The resize operator indeed diminishes the subtle artifact features apply our proposed image transformations to existing SID pipelines
and degrades the detection performance, necessitating the crop as a plug-and-play module. The comparison results, illustrated in
operator during training. (3) The similar performance between CC, Fig. 8, demonstrate a consistent improvement in detecting synthetic
RC, and SI during testing indicates that our pipeline can achieve images from various generative models. This indicates that these
accurate detection through center-cropped regions only, suggesting image transformations can help the detector learn more preserved
our detector is translation-robust to cropped regions of test images. and generalizable artifact features, thereby improving its ability
Image augmentation. We then ablate the proposed data augmen- to capture nuanced artifacts from input samples with enhanced
tations along with HorizontalFlip (HF) in Fig. 5. It can be observed generalization.
that HF exerts almost no effect on improving generalization per-
formance, which is insufficient to mitigate overfitting in previous
pipelines. In contrast, the proposed three techniques (i.e., CJ, RR, 5 Conclusion
and RM) can each independently enhance the generalization per- In this paper, we have re-examined current SID pipelines and dis-
formance by augmenting training images, thereby bridging the covered that these pipelines are inherently prohibited by biased
Improving Synthetic Image Detection Towards Generalization: An Image Transformation Perspective
training paradigms from superior generalization. In this light, we [19] Yan Ju, Shan Jia, Lipeng Ke, Hongfei Xue, Koki Nagano, and Siwei Lyu. 2022.
propose a simple yet effective pipeline, SAFE, to alleviate these Fusing global and local features for generalized ai-synthesized image detection.
In 2022 IEEE International Conference on Image Processing (ICIP). IEEE, 3465–3469.
existing biases with three image transformations. Our pipeline inte- [20] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressive
grates crop operators in image pre-processing and combines Color- growing of gans for improved quality, stability, and variation. arXiv preprint
arXiv:1710.10196 (2017).
Jitter, RandomRotation, and RandomMask in image augmentation, [21] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive
with the DWT extracting high-frequency artifact features. Exten- growing of gans for improved quality, stability, and variation. arXiv 2017. arXiv
sive experiments demonstrate the effectiveness of our pipeline in preprint arXiv:1710.10196 (2018), 1–26.
[22] Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator ar-
both computational efficiency and detection performance even with chitecture for generative adversarial networks. In Proceedings of the IEEE/CVF
simple artifacts. This prompts us to rethink the rationale behind conference on computer vision and pattern recognition. 4401–4410.
current pipelines dedicated to various self-crafted features, wonder- [23] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and
Timo Aila. 2020. Analyzing and improving the image quality of stylegan. In
ing whether these features are genuinely more generalizable in SID Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
or merely mitigate potential biases in certain aspects. We hope our 8110–8119.
[24] Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdenour
findings will facilitate further endeavors regarding mitigating biases Hadid, and Abdelmalik Taleb-Ahmed. 2024. Bi-LORA: A Vision-Language Ap-
in SID training paradigms before exploring self-crafted artifacts. proach for Synthetic Image Detection. arXiv preprint arXiv:2404.01959 (2024).
[25] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes.
arXiv preprint arXiv:1312.6114 (2013).
References [26] Christos Koutlis and Symeon Papadopoulos. 2024. Leveraging Representations
from Intermediate Encoder-blocks for Synthetic Image Detection. arXiv preprint
[1] Marc G Bellemare et al. 2017. The cramer distance as a solution to biased arXiv:2402.19091 (2024).
wasserstein gradients. arXiv preprint arXiv:1705.10743 (2017). [27] Kwot Sin Lee et al. 2021. Infomax-gan: Improved adversarial image generation
[2] David Berthelot et al. 2017. Began: Boundary equilibrium generative adversarial via information maximization and contrastive learning. In WACV. 3942–3952.
networks. arXiv preprint arXiv:1703.10717 (2017). [28] Chun-Liang Li et al. 2017. Mmd gan: Towards deeper understanding of moment
[3] Noémi Bontridder and Yves Poullet. 2021. The role of artificial intelligence in matching network. Advances in neural information processing systems 30 (2017).
disinformation. Data & Policy 3 (2021), e32. [29] Haodong Li, Bin Li, Shunquan Tan, and Jiwu Huang. 2020. Identification of
[4] Andrew Brock, Jeff Donahue, and Karen Simonyan. 2018. Large scale GAN deep network generated images using disparities in color components. Signal
training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 Processing 174 (2020), 107616.
(2018). [30] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping
[5] George Cazenavette, Avneesh Sud, Thomas Leung, and Ben Usman. 2024. FakeIn- language-image pre-training with frozen image encoders and large language
version: Learning to Detect Images from Unseen Text-to-Image Models by In- models. In International conference on machine learning. PMLR, 19730–19742.
verting Stable Diffusion. In Proceedings of the IEEE/CVF Conference on Computer [31] Jiaming Li, Hongtao Xie, Jiahong Li, Zhongyuan Wang, and Yongdong Zhang.
Vision and Pattern Recognition. 10759–10769. 2021. Frequency-aware discriminative feature learning supervised by single-
[6] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and center loss for face forgery detection. In Proceedings of the IEEE/CVF conference
Jaegul Choo. 2018. Stargan: Unified generative adversarial networks for multi- on computer vision and pattern recognition. 6458–6467.
domain image-to-image translation. In Proceedings of the IEEE conference on [32] Ouxiang Li, Yanbin Hao, Zhicai Wang, Bin Zhu, Shuo Wang, Zaixi Zhang, and
computer vision and pattern recognition. 8789–8797. Fuli Feng. 2024. Model Inversion Attacks Through Target-Specific Conditional
[7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Im- Diffusion Models. arXiv preprint arXiv:2407.11424 (2024).
agenet: A large-scale hierarchical image database. In 2009 IEEE conference on [33] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva
computer vision and pattern recognition. Ieee, 248–255. Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common
[8] Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on objects in context. In Computer Vision–ECCV 2014: 13th European Conference,
image synthesis. Advances in neural information processing systems 34 (2021), Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740–
8780–8794. 755.
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- [34] Bo Liu, Fan Yang, Xiuli Bi, Bin Xiao, Weisheng Li, and Xinbo Gao. 2022. Detecting
aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg generated images by real images. In European Conference on Computer Vision.
Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers Springer, 95–110.
for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020). [35] Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and
[10] Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, Yao Zhao. 2024. Forgery-aware adaptive transformer for generalizable synthetic
and Thorsten Holz. 2020. Leveraging frequency analysis for deep fake image image detection. In Proceedings of the IEEE/CVF Conference on Computer Vision
recognition. In International conference on machine learning. PMLR, 3247–3258. and Pattern Recognition. 10770–10780.
[11] Abenezer Golda, Kidus Mekonen, Amit Pandey, Anushka Singh, Vikas Hassija, [36] Ming Liu et al. 2019. Stgan: A unified selective transfer network for arbitrary
Vinay Chamola, and Biplab Sikdar. 2024. Privacy and Security Concerns in image attribute editing. In CVPR. 3673–3682.
Generative AI: A Comprehensive Survey. IEEE Access (2024). [37] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep learning
[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, face attributes in the wild. In Proceedings of the IEEE international conference on
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial computer vision. 3730–3738.
nets. Advances in neural information processing systems 27 (2014). [38] Zhengzhe Liu, Xiaojuan Qi, and Philip HS Torr. 2020. Global texture enhancement
[13] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu for fake face detection in the wild. In Proceedings of the IEEE/CVF conference on
Yuan, and Baining Guo. 2022. Vector quantized diffusion model for text-to-image computer vision and pattern recognition. 8060–8069.
synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and [39] Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization.
Pattern Recognition. 10696–10706. arXiv preprint arXiv:1711.05101 (2017).
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual [40] Mario Lučić et al. 2019. High-fidelity image generation with fewer labels. In
learning for image recognition. In Proceedings of the IEEE conference on computer ICML. PMLR, 4183–4192.
vision and pattern recognition. 770–778. [41] Yunpeng Luo, Junlong Du, Ke Yan, and Shouhong Ding. 2024. LaREˆ 2: Latent
[15] Zhenliang He et al. 2019. AttGAN: Facial Attribute Editing by Only Changing Reconstruction Error Based Method for Diffusion-Generated Image Detection.
What You Want. IEEE Transactions on Image Processing 28, 11 (2019), 5464–5478. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
https://doi.org/10.1109/TIP.2019.2916751 nition. 17006–17015.
[16] Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic [42] Stephane G Mallat. 1989. A theory for multiresolution signal decomposition:
models. Advances in neural information processing systems 33 (2020), 6840–6851. the wavelet representation. IEEE transactions on pattern analysis and machine
[17] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean intelligence 11, 7 (1989), 674–693.
Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large [43] Midjourney. 2022. https://www.midjourney.com/home.
language models. arXiv preprint arXiv:2106.09685 (2021). [44] Takeru Miyato et al. 2018. Spectral normalization for generative adversarial
[18] Yonghyun Jeong, Doyeon Kim, Seungjai Min, Seongho Joe, Youngjune Gwon, networks. arXiv preprint arXiv:1802.05957 (2018).
and Jongwon Choi. 2022. Bihpf: Bilateral high-pass filters for robust deepfake [45] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or.
detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of 2022. Null-text Inversion for Editing Real Images using Guided Diffusion Models.
Computer Vision. 48–57. arXiv:2211.09794 [cs.CV] https://arxiv.org/abs/2211.09794
Ouxiang Li et al.
[46] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, [70] Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi
Bob McGrew, Ilya Sutskever, and Mark Chen. 2021. Glide: Towards photorealistic Xie. 2024. A Sanity Check for AI-generated Image Detection. arXiv preprint
image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2406.19435 (2024).
arXiv:2112.10741 (2021). [71] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianx-
[47] Weili Nie et al. 2019. Relgan: Relational generative adversarial networks for text iong Xiao. 2015. Lsun: Construction of a large-scale image dataset using deep
generation. In ICLR. learning with humans in the loop. arXiv preprint arXiv:1506.03365 (2015).
[48] Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. 2023. Towards universal fake image [72] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz
detectors that generalize across generative models. In Proceedings of the IEEE/CVF Khan, Ming-Hsuan Yang, and Ling Shao. 2020. Cycleisp: Real image restora-
Conference on Computer Vision and Pattern Recognition. 24480–24489. tion via improved data synthesis. In Proceedings of the IEEE/CVF conference on
[49] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019. Semantic computer vision and pattern recognition. 2696–2705.
image synthesis with spatially-adaptive normalization. In Proceedings of the [73] Nan Zhong, Yiran Xu, Zhenxing Qian, and Xinpeng Zhang. 2023. Rich and
IEEE/CVF conference on computer vision and pattern recognition. 2337–2346. poor texture contrast: A simple yet effective approach for ai-generated image
[50] Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. 2020. Think- detection. arXiv preprint arXiv:2311.12397 (2023).
ing in frequency: Face forgery detection by mining frequency-aware clues. In [74] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired
European conference on computer vision. Springer, 86–103. image-to-image translation using cycle-consistent adversarial networks. In Pro-
[51] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- ceedings of the IEEE international conference on computer vision. 2223–2232.
hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. [75] Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li,
2021. Learning transferable visual models from natural language supervision. In Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. 2024. Genimage: A million-scale
International conference on machine learning. PMLR, 8748–8763. benchmark for detecting ai-generated image. Advances in Neural Information
[52] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Processing Systems 36 (2024).
2022. Hierarchical text-conditional image generation with clip latents. arXiv
preprint arXiv:2204.06125 1, 2 (2022), 3.
[53] Jonas Ricker, Simon Damm, Thorsten Holz, and Asja Fischer. 2022. Towards the
detection of diffusion model deepfakes. arXiv preprint arXiv:2210.14571 (2022).
[54] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn
Ommer. 2022. High-resolution image synthesis with latent diffusion models. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
10684–10695.
[55] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolu-
tional networks for biomedical image segmentation. In Medical image computing
and computer-assisted intervention–MICCAI 2015: 18th international conference,
Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer, 234–241.
[56] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies,
and Matthias Nießner. 2019. Faceforensics++: Learning to detect manipulated
facial images. In Proceedings of the IEEE/CVF international conference on computer
vision. 1–11.
[57] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk,
Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki.
2021. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs.
arXiv preprint arXiv:2111.02114 (2021).
[58] Zeyang Sha, Zheng Li, Ning Yu, and Yang Zhang. 2023. De-fake: Detection and
attribution of fake images generated by text-to-image generation models. In
Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications
Security. 3418–3432.
[59] Jiaming Song, Chenlin Meng, and Stefano Ermon. 2020. Denoising diffusion
implicit models. arXiv preprint arXiv:2010.02502 (2020).
[60] Stability-AI. 2022. Stable Diffusion. https://github.com/Stability-AI/
StableDiffusion.
[61] Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao
Wei. 2024. Frequency-aware deepfake detection: Improving generalizability
through frequency space learning. arXiv preprint arXiv:2403.07240 (2024).
[62] Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao
Wei. 2024. Rethinking the up-sampling operations in cnn-based generative
network for generalizable deepfake detection. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 28130–28139.
[63] Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, and Yunchao Wei. 2023.
Learning on gradients: Generalized artifacts representation for gan-generated
images detection. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition. 12105–12114.
[64] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. Advances in neural information processing systems 30 (2017).
[65] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A
Efros. 2020. CNN-generated images are surprisingly easy to spot... for now. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
8695–8704.
[66] Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong
Chen, and Houqiang Li. 2023. Dire for diffusion-generated image detection. In
Proceedings of the IEEE/CVF International Conference on Computer Vision. 22445–
22455.
[67] Zhenting Wang, Vikash Sehwag, Chen Chen, Lingjuan Lyu, Dimitris N Metaxas,
and Shiqing Ma. 2024. How to Trace Latent Generative Model Generated Images
without Artificial Watermark? arXiv preprint arXiv:2405.13360 (2024).
[68] Wukong. 2022. https://xihe.mindspore.cn/modelzoo/wukong.
[69] Qiang Xu, Dongmei Xu, Hao Wang, Jianye Yuan, and Zhe Wang. 2024. Color
Patterns And Enhanced Texture Learning For Detecting Computer-Generated
Images. Comput. J. (2024), bxae007.
Improving Synthetic Image Detection Towards Generalization: An Image Transformation Perspective
Supplementary Material
This Appendix is organized as follows: decomposes the image into sinusoidal waves of different frequen-
• In Sec. A, we elaborate on more experimental details omitted in cies, and these frequencies can be used to analyze and process
our main paper because of page limits. the frequency characteristics of the image. In our implementation
• In Sec. B, we conduct additional experiments about hyperparam- where FFT is adopted as a high-frequency extractor, the extraction
eter sensitivity and detection robustness for a more comprehen- process can be formulated as
sive evaluation. 𝒇 FFT (𝑿 ) = IFFT(BℎFFT (FFT(𝑿 ))), (2)
• In Sec. C, we summarize the plausible limitations of our pipelines
and expect to address them in future work. where the FFT operator includes both a 2D FFT function on the
source image and a zero-frequency shift function to move the zero-
A Experimental Details frequency component to the center of the spectrum. Then the high-
pass filter BℎFFT is adopted to extract high-frequency components
A.1 Local Correlation Map in the form of
To visualize the differences in local correlations between natural
FFT 0, if |𝑖 | < 𝐻4 and | 𝑗 | < 𝑊4 ,
and synthetic images, we introduce sliding windows to traverse Bℎ (𝑓𝑖,𝑗 ) = (3)
𝑓𝑖,𝑗 , otherwise.
the image and calculate the correlation coefficient for pixels in
Subsequently, the corresponding inverse function IFFT is intro-
each window. In implementation, we use the Pearson correlation
duced to transform the filtered frequency information back to im-
coefficient and set 𝑤 to 2 to meet the locality requirement.
age space, and we can obtain 𝒇 FFT (𝑿 ) ∈ R𝐶 ×𝐻 ×𝑊 as the input
artifact feature with FFT.
Algorithm 1: Calculate Local Correlation Map Discrete Cosine Transform (DCT). DCT decomposes the im-
Input: Image 𝐼 of size 𝐻 × 𝑊 , window size 𝑤 age into a combination of cosine functions to analyze the frequency
Output: Correlation map 𝐶 components of the image, which is particularly suited for processing
1 Initialize 𝐶 as a zero matrix of size block-based image data. In our implementation, the DCT extractor
with the high-frequency filter can be formulated as
(𝐻 − 𝑤 + 1) × (𝑊 − 𝑤 + 1);
2 for 𝑖 ← 0 to 𝐻 − 𝑤 do 𝒇 DCT (𝑿 ) = IDCT(BℎDCT (DCT(𝑿 ))), (4)
3 for 𝑗 ← 0 to 𝑊 − 𝑤 do where BℎDCT is the high-frequency filter for the transformed DCT
4 Extract window 𝑊 = 𝐼 [𝑖 : 𝑖 + 𝑤, 𝑗 : 𝑗 + 𝑤]; features with a pre-defined threshold 𝛿, which can be formulated
Compute row vector r as r𝑘 = 𝑤1 𝑚=1
Í𝑤
5 𝑊𝑚𝑘 for as
𝑘 = 1, 2, . . . , 𝑤; 0, if 𝑖 + 𝑗 < 𝛿,
BℎDCT (𝑓𝑖,𝑗 ) = (5)
Compute column vector c as c𝑘 = 𝑤1 𝑛=1 𝑓𝑖,𝑗 , otherwise.
Í𝑤
6 𝑊𝑘𝑛 for
𝑘 = 1, 2, . . . , 𝑤; Subsequently, the inverse function IDCT is used to transform the
7 Compute correlation coefficient 𝜌 between r and c; high-frequency component back to image space, and 𝒇 DCT (𝑿 ) ∈
8 𝐶 [𝑖, 𝑗] ← 𝜌; R𝐶 ×𝐻 ×𝑊 is regarded as the input artifact feature.
Sobel and Laplace. Sobel and Laplace operators are both widely
9 return 𝐶; used for edge detection to identify regions in the image with sig-
nificant intensity changes, which are strongly correlated with the
high-frequency components. In practice, both operators are imple-
mented through convolution operations, using specific convolution
A.2 Details of Figure 6 kernels for edge detection.
This figure provides a horizontal comparison of various artifact The Sobel operator uses two 3 × 3 convolution kernels, one
features, with a particular emphasis on high-frequency components. for detecting horizontal changes (𝐺𝑥 ) and the other for detecting
In addition to the DWT involved in our pipeline, we also include vertical changes (𝐺 𝑦 ), that is
other commonly-used frequency transforms such as FFT and DCT,
−1 0 1 −1 −2 −1
along with edge detection operators such as Sobel and Laplace, to
facilitate a comprehensive comparison and analysis. Specifically, 𝐺𝑥 = −2 0 2 , 𝐺 𝑦 = 0 0 0 .
−1 0 1 1 2 1
given an input image 𝑿 ∈ R𝐶 ×𝐻 ×𝑊 , the details of these operators
are as follows: The Laplace operator is a second-order differential operator used
Fast Fourier Transform (FFT). FFT is a method for transform- to detect edges and details in images. It determines the regions
ing an image from the spatial domain to the frequency domain. It of rapid intensity change by computing the second derivatives of
Ouxiang Li et al.
Figure 9: Robustness to Gaussian blur perturbation with dif- Figure 10: Robustness to JPEG compression perturbation
ferent blur sigma 𝜎 on different test datasets. “Ref” corre- with different quality factors 𝑄 on different test datasets.
sponds to the detection performance w/o Gaussian blur. “Mean" refers to the average results on ForenSynths.
pixel values. The Laplace operator commonly uses two forms of intermediate patch sizes (𝑑 = 16) provide a balance between
3 × 3 convolution kernels, each with distinct characteristics and detail preservation and context abstraction, thereby enhancing
applications, that is local awareness.
0 1 0 1 1 1 • Mask ratio 𝑹. The mask ratio 𝑅 in RandomMask (RM) shows
1 −4 1 or
1 −8 1 . an intriguing trend. Both ACC and AP initially fluctuate with
0 1 0
1 1 1 increasing 𝑅, reaching optimal performance around 75%. Beyond
this point, a gradual decline is noted. This implies that a mod-
These operators are applied to the image through convolution erate masking ratio, which likely introduces sufficient variation
operations 𝒇 Sobel (𝑿 ), 𝒇 Laplace (𝑿 ) ∈ R𝐶 ×𝐻 ×𝑊 for edge detection, without overwhelming the detector, is beneficial. Over-masking
which we consider as the input artifact features in our comparison. (close to 100%) can obscure critical information, negatively im-
pacting performance.
B Additional Experiments
In this section, we report ACCM and APM on all 33 test subsets B.2 Robustness to Unknown Perturbations
unless specified, ensuring a comprehensive comparison on both
In real-world scenarios, images shared on public platforms are
GAN-based and DM-based benchmarks.
susceptible to various unknown perturbations, necessitating the
evaluation of SID detectors in terms of robustness. Besides the
B.1 Hyperparameters resizing operator discussed Sec. 4.4, other plausible operators, such
We empirically ablate the essential hyperparameters of our pro- as Gaussian blur, JPEG compression, and random mask, should also
posed data augmentations as shown in Fig. 7, from which we can be taken into account for a comprehensive assessment.
draw the following observations: Gaussian blur. Gaussian blur can be ubiquitous during image
• Jitter factor 𝜶 . The jitter factor 𝛼, employed in ColorJitter (CJ), transmission on the Internet. Inspired by this, we simulate the sce-
shows a clear impact on both ACC and AP. As 𝛼 increases from nario where images are perturbed with varying deviation degrees 𝜎
0 to 0.8, ACC and AP initially improve, peaking around 𝛼 = 0.5 of Gaussian blur to evaluate the robustness of the detector in Fig. 9.
for both ACC and AP. Beyond these points, a sharp decline is ob- Specifically, we additionally introduce another data augmentation
served in both metrics, indicating an optimal range for 𝛼 around from CNNDect [65], i.e., RandomGaussianBlur with the standard
0.4 to 0.6. Excessive jittering deteriorates the detector’s perfor- deviation 𝜎 ∼ [0.1, 2.0] and probability 𝑝 = 0.5 at training. This
mance, highlighting the importance of moderate augmentation. technique is useful in enhancing the robustness to Gaussian blur. It
• Rotation angle 𝜷. In RandomRotation (RR), the rotation angle can be observed that the detection performance initially remains
𝛽 demonstrates a significant influence on the detection perfor- relatively stable at low blur levels (𝜎 ≤ 0.5), indicating that our
mance. Both ACC and AP increase as the rotation angle pro- detector is resilient to minimal blurring. As the blur level increases
gresses from 0◦ to 60◦ , stabilizing at high values until 150◦ . No- (0.5 ≤ 𝜎 ≤ 1.0), there is a gradual decline in performance. However,
tably, performance metrics continue to improve when 𝛽 exceeds when the blur level reaches 𝜎 ≥ 1.0, the detection performance
150◦ , reaching their peak at 180◦ . This suggests that larger rota- levels off, with results around 85% ACCM and 90% APM . This indi-
tion angles, including extreme rotations, can enhance the gener- cates that our detector maintains considerable robustness against
alization performance by providing diverse training samples. Gaussian blur across varying intensities. Overall, the consistency
• Patch size 𝒅. For patch size 𝑑, the ablation study indicates that of our detection pipeline under different levels of Gaussian blur
smaller patch sizes (𝑑 = 1, 2, 4) result in lower performance underscores its reliability and suitability for practical applications.
metrics. As 𝑑 increases to 8, 16, and 32, a marked improvement JPEG compression. Similar to Gaussian blur, we introduce Ran-
in both ACC and AP is observed, with optimal performance domJPEG as an additional data augmentation from CNNDect [65],
occurring at 𝑑 = 16. Further increasing the patch size to 𝑑 = 64 with JPEG quality factor 𝑄 ∼ [70, 100) and probability 𝑝 = 0.2 at
results in a slight decline in metrics. This finding suggests that training. As shown in Fig. 10, we observe a significant performance
Improving Synthetic Image Detection Towards Generalization: An Image Transformation Perspective
70 70 70 70
d2 = 8
d2 = 4 60 d2 = 4
60 d2 = 2 60 60 d2 = 4
d2 = 2
50 50 d2 = 2 50 d2 = 2
50
10% 25% 50% 75% 90% 10% 25% 50% 75% 90% 10% 25% 50% 75% 90% 10% 25% 50% 75% 90%
Mask Ratio r Mask Ratio r Mask Ratio r Mask Ratio r
(a) ACCM (𝑑 1 = 2) (b) ACCM (𝑑 1 = 4) (c) ACCM (𝑑 1 = 8) (d) ACCM (𝑑 1 = 16)
70 d2 = 8 70 70 70
d2 = 2 d2 = 4 d2 = 4
d2 = 4 d2 = 2
60 60 60 d2 = 2 60
d2 = 2
50 50 50 50
10% 25% 50% 75% 90% 10% 25% 50% 75% 90% 10% 25% 50% 75% 90% 10% 25% 50% 75% 90%
Mask Ratio r Mask Ratio r Mask Ratio r Mask Ratio r
(e) APM (𝑑 1 = 2) (f) APM (𝑑 1 = 4) (g) APM (𝑑 1 = 8) (h) APM (𝑑 1 = 16)
Figure 11: Robustness to random mask perturbation with different mask ratio 𝑟 and patch size 𝑑 2 , where 𝑑 1 and 𝑑 2 refer to the
masked patch size at training and testing, respectively.
drop in ACCM and APM . This is because JPEG compression can • Among the four detectors trained with different patch sizes 𝑑 1 =
substantially diminish the high-frequency components of images, 2, 4, 8, 16, the detector with 𝑑 1 = 4 shows the most robustness
which contradicts our adopted high-frequency artifacts from DWT. under random mask perturbation. This comes at the cost of slight
We will consider addressing this limitation by exploring more ro- performance degradation on unperturbed images compared to
bust artifacts in future work. 𝑑 1 = 16 as shown in Fig. 7 (c). The choice of 𝑑 1 at training should
Random mask. We also simulate the scenario where images are be grounded in specific application scenarios, balancing between
randomly masked with patches, which could visually and seman- detection accuracy and robustness.
tically degrade the detection performance. Fig. 11 examines the
robustness of our detector trained with different patch sizes (𝑑 1 = C Limitation
2, 4, 8, 16) in RandomMask under various mask ratios (𝑟 = 10%, 25%, Our pipeline demonstrates that simple image transformations can
50%, 75%, 90%) and patch sizes (𝑑 2 = 2, 4, 8, 16, 32, 64) at inference. significantly improve the generalization performance by mitigating
We can draw the following conclusions from this figure: training biases in SID. However, the adoption of simple low-level
artifacts (e.g., DWT features) proves inferior robustness against
• Across all subplots, there is a consistent trend where both ACCM plausible unknown perturbations, as these perturbations could un-
and APM decrease as the mask ratio increases. This pattern holds dermine discriminative artifacts in low-level space, thereby degrad-
true regardless of the training patch size 𝑑 1 . Higher mask ra- ing the detection performance. This performance degradation is
tios (e.g., 75% and 90%) result in more significant performance prevalent among existing SID pipelines. In addition to investigating
degradation, indicating the detector’s robustness diminishes with other plausible biases during SID training, we hope to explore more
increased masked regions. robust artifacts to conduct an efficient and robust detector with
• The detector consistently demonstrates robust performance when unbiased training paradigms in future work.
perturbed with larger patch sizes (𝑑 2 = 16, 32, 64) across all 𝑑 1 .
This suggests that the detector effectively focuses on the remain-
ing unmasked patches and draws inferences from these local
features. The ability to maintain high performance despite sig-
nificant masking highlights the importance of local awareness
in handling such perturbations.
• In all cases, detection performance deteriorates more signifi-
cantly as the mask ratio increases, particularly for smaller patches
(𝑑 2 = 2, 4). This can be attributed to the fact that smaller patches
are more prone to disrupting the local correlation among pixels,
as discussed in Sec. 3.2. These artifacts are inherently introduced
by the imaging paradigm of synthetic images, making robust
detection challenging with higher mask ratios.