Multimedia Generated
Multimedia Generated
Abstract—The rapid advancement of Large AI Models (LAIMs), particularly diffusion models and large language models, has marked
a new era where AI-generated multimedia is increasingly integrated into various aspects of daily life. Although beneficial in numerous
fields, this content presents significant risks, including potential misuse, societal disruptions, and ethical concerns. Consequently,
detecting multimedia generated by LAIMs has become crucial, with a marked rise in related research. Despite this, there remains a
arXiv:2402.00045v3 [cs.MM] 7 Feb 2024
notable gap in systematic surveys that focus specifically on detecting LAIM-generated multimedia. Addressing this, we provide the first
survey to comprehensively cover existing research on detecting multimedia (such as text, images, videos, audio, and multimodal
content) created by LAIMs. Specifically, we introduce a novel taxonomy for detection methods, categorized by media modality, and
aligned with two perspectives: pure detection (aiming to enhance detection performance) and beyond detection (adding attributes like
generalizability, robustness, and interpretability to detectors). Additionally, we have presented a brief overview of generation
mechanisms, public datasets, and online detection tools to provide a valuable resource for researchers and practitioners in this field.
Furthermore, we identify current challenges in detection and propose directions for future research that address unexplored, ongoing,
and emerging issues in detecting multimedia generated by LAIMs. Our aim for this survey is to fill an academic gap and contribute to
global AI security efforts, helping to ensure the integrity of information in the digital realm. The project link is
https://github.com/Purdue-M2/Detect-LAIM-generated-Multimedia-Survey.
Index Terms—Media Forensics, Deepfake, Detection, Large AI Models, Diffusion Models, Large Language Models, Generation
m M o
6
d2
de
yV
PT
n eo
ra
Vi
u ideo
3
tG 2
ne
Ti Vid
2
G ngG
PT
D
G
A-
it
E
C L-E
SE
en
en
u
ur
-1
Ed
E
e-
D -4
-3
L-
Im M
O
e-
s
nG
Em u V
le
aM
D
ag
jo
gi
EN
rd
ag
U
M
PT
PT
av
PT
AL
AL
LM
T0
O
ak
ab
ta
ha
LI
id
ug
Pa
AM
Ba
T5
SD
LD
Pa
BL
Im
Em
W
LL
G
IF
O
G
G
G
M
M
St
Generation
H
2019 2020 Q1 2021 Q2 2021 Q4 2021 Q2 2022 Q3 2022 Q4 2022 Q1 2023 Q2 2023 Q3 2023 Q4 2023 Q1 2024
Detection
U
XL
Fe
Li
Tr
G tGP
O
R
W
D
To
Se
R
ns
H
Tu
W
U
ng
AD
et
PT
U
e-
IR
uF
ev
et
AM
av
w
N
w
up
nr
ho
D
rin
TF
ec
Fa
-S
et
e
u
E
ar
is
el
or
Ze
ID
av
AR
ct
is
er
du
M
g
-F
i
O
ds
ho
et
tV
ke
tic
Vo
-B
el
vi
ro
ER
ni
X
T
-P
id
t
se
in
en
c
t
M
e
ac
g
od
d
o
od
ke
h
e
el
r
ts
Fig. 1: A cat-and-mouse game between generating and detecting multimedia ( text , image , video , audio , and multimodal
) using LAIMs, showcasing only representative works. Q1 represents from Jan to Mar, Q2: Apr-Jun, Q3: Jul-Sep, Q4: Oct-Dec.
tive advertising, exploiting its ability to create persuasive on detection mechanisms without delving into broader as-
and tailored content. There are also ethical dilemmas re- pects or applications of these technologies beyond detection,
garding the use of LAIMs to produce art or content that which are discussed in our survey.
imitates human creativity, sparking debates over originality, The survey by Cardenuto et al. [25] falls short of being
intellectual property rights, and the intrinsic value of hu- comprehensive. They present a high-level overview of chal-
man artistic expression [18]. Furthermore, LAIMs’ ability to lenges and opportunities in generation and detection using
generate multimedia could potentially impact employment a broad range of AI models. However, their focus is not
in creative fields, fueling concerns about the displacement of specifically on large AI models, leading to an oversight of
human workers in journalism, the arts, and entertainment. numerous existing detection works. Additionally, their sur-
A notable example of this tension was the months-long vey lacks in-depth discussion about datasets in the context
strike in the summer of 2023, where writers and performers of detection. Crucially, the detection methods discussed in
fought against major Hollywood studios [15]. [25] are confined to forensic types and do not delve into the
Such challenges have highlighted the urgent need for specifics of detection models and methodologies, which is
effective detection methods for multimedia produced by one of the central focuses of our survey. They also overlook
LAIMs. In recent years, there has been a significant increase detection approaches that utilize multimodal data modality,
in research focused on this area. However, to the best of an area our survey encompasses.
our knowledge, there is a notable lack of systematic surveys In summary, there is currently no comprehensive survey
specifically addressing the detection of LAIM-generated covering the detection of fake media in text, images, videos, audio,
multimedia, in contrast to the numerous surveys focusing and multimodal content, especially from the perspective of large
on multimedia generation using LAIMs. To bridge this AI models. Our survey attempts to fill this gap by making the
gap, we present this first comprehensive survey, which not following contributions.
only fills a critical academic void but also aligns with AI
security initiatives of various governments worldwide, such 1.2 Contributions
as the AI Safety Summit 2023 [19] and the United States 1) This is the first comprehensive survey on detecting
government’s “AI Executive Order” [20]. multimedia generated by LAIMs, covering text, images,
In this survey, we provide a detailed and thorough videos, audio, and multimodal content. Through this
review of the existing research on identifying multimedia organized and systematic survey, we hope to help push
generated by LAIMs, particularly emphasizing Diffusion research in this field forward.
Models and Large Language Models. Our goal is to guide 2) We provide a brief overview of LAIMs, their generation
researchers toward understanding the current challenges mechanisms, and the content they generate, shedding
and exploring potential future directions in this field, aiming light on the current status of the cat (generation)-and-
to reinstate trust in digital content among users. Further- mouse (detection) game in this field. Additionally, we
more, we endeavor to show that, despite the high degree present public datasets tailored for detection tasks.
of realism in LAIM-generated multimedia, it can still be 3) This survey reviews the most up-to-date detection
identified, which is crucial for its ethical use and for main- methodologies for LAIM-generated multimedia. We in-
taining the integrity of information in the digital world. The novatively categorize them for each modality into two
dynamic and ongoing interplay between generating and main groups: pure detection and beyond detection. This
detecting multimedia using LAIMs is shown in Fig. 1. taxonomy offers a unique perspective that has not been
previously explored. Within these two categories, we fur-
1.1 Related Works ther classify the methods into more specific subcategories
While there are a few surveys [21]–[25] addressing the based on their common and distinct characteristics.
detection of multimedia generated by LAIMs, like Diffusion 4) We have conducted thorough examinations of online de-
Models (DMs) and Large Language Models (LLMs), their tection tools. This provides a valuable resource for both
scopes are very limited. Specifically, surveys like [21]–[24] researchers and practitioners in the field. Additionally,
mainly concentrate on the detection of LAIM-generated text, we pinpoint current challenges faced by recent detection
overlooking other multimedia forms such as images, videos, methods and provide directions for future work.
and audios. Additionally, while these surveys provide in- The remained of this survey is organized as follows: In
sights into detection techniques, they tend to focus solely Section 2, we briefly introduce LAIMs, including generation
3
... ...
Foundation
Foundation
Models as
Models as
Encoders
Decoders
Text Text Video
Here is the video scene.
Reverse Reverse LLM as
A beautiful
Acoustic modeling Waveform Controller
Image Audio pink cherry
Text Waveform ... Here is the audio. blossoms.
End-to-End
d) Audio e) Multimodal
Fig. 2: Illustrations of different types of multimedia generation process based on LAIMs.
mechanisms and datasets. In Section 3, we classify detection of-the-art family of deep generative models. The image
by their functionality and organize them into our newly generation process in DMs usually contains two processes
defined taxonomy. We summarize online detection tools [71]: a forward process that progressively destroys data by
in Section 4. Critical challenges faced by detectors and adding noise and a reverse process that learns to generate
potential future directions are discussed in Section 5. Finally, new data by denoising. More details can be found in [4],
we conclude the survey in Section 6. [126]. Current research on diffusion models is mostly based
on three predominant formulations: denoising diffusion
probabilistic models (DDPMs) [71], score-based generative
2 G ENERATION
models (SGMs) [124], and stochastic differential equations
In this section, we provide an overview of large generative (Score SDEs) [125]. Building upon them, more advanced
AI models, their generation mechanisms, and the type of models have emerged in image generation, including Ope-
content they generate. nAI DALL·E 3 [127], Stable Diffusion V2 [88], Google Ima-
♣ Text. Machine-generated text, primarily driven by the gen2 [128], Midjourney [129], Amazon Titan Image Genera-
advent of LLMs, is increasingly permeating many aspects tor [130], and Meta Emu Edit [131], see Fig. 2 b).
of our daily lives. The exceptional proficiency of LLMs in Datasets. There are several million-scale generated im-
understanding, following, and complex reasoning [118] has ages detection datasets: GenImage [79], DiffusionDB [84],
established their dominance in text creation. Recently, we ArtiFact [86], HiFi-IFDL [90], and Fake2M [104]. Diffu-
have witnessed a surge of LLMs including OpenAI GPT- sionDB stands out as the first large-scale Image-Prompt
4 [119], Google Gemini [120], Meta LLaMA2 [60] as well as pairs dataset, though its image is only generated by Stable
the remarkable performance they have achieved in various Diffusion. GenImage, ArtiFact, HiFi-IFDL, and Fake2M con-
tasks, such as News [34], [39], Question Answering [48], tain images generated by various DMs and GANs. Unique
Biomedicine [121], Code Generation [122], Tweets [44], and to HiFi-IFDL is its inclusion of high-resolution forgery
Scientific Writing [45], see Fig. 2 a). More details can be masks, making it particularly useful for both detection and
found in [1], [123]. localization tasks. Refer Table 1 for more details.
Datasets. The prevalent datasets for LLM-generated text ♣ Video. In the pursuit of high-quality video generation,
detection are listed in Table 1. For example, HC3 [53] recent research has also turned to diffusion models. Early
stands as one of the initial open-source efforts to compare work [132], [133] are based on DDPM [71] for video gen-
ChatGPT-generated text with human-written text. Due to eration. Current research extends text-to-image (T2I) diffu-
its pioneering contributions in this field, the HC3 corpus has sion models to text-to-video (T2V) generation. Meta pro-
been utilized as a valuable resource in numerous subsequent poses Make-A-Video [134] extending a diffusion-based T2I
studies. Moreover, Authorship Attribution [32] and Turing- model to T2V through spatiotemporally factorized diffusion
Bench [39] datasets are proposed to evaluate the attribution model. A more advanced model, Emu Video [135], is their
capability of detectors. HPPT [42] can benchmark task for latest video generation milestone. Unlike Make-A-Video,
detecting ChatGPT-polished texts. which requires five models, Emu Video is a simple unified
♣ Image. In the task of image synthesis, diffusion models architecture for video generation and uses just two diffusion
(DMs) [71], [124], [125] have emerged as the new state- models. Given a text prompt, Google’s Imagen Video [136]
4
Modality Dataset Content Link I2O #Real #Generated Source of Real Media Generation Method Year
Stu.Essays [26] Essays Link T2T 1,000 6,000 IvyPanda [27] ChatGPT 2023
Writing [26] Essays Link T2T 1,000 6,000 Reddit WritingPrompts [28] ChatGPT 2023
News [26] Essays Link T2T 1,000 6,000 Reuters 50-50 [29] ChatGPT 2023
Paraphrase [30] Essays Link T2T 98,280 163,710 Arxiv, Wikipedia, Theses GPT-3, T5 [31] 2022
AA [32] Essays Link T2T 1,064 8,512 News Media GPT-1&2, CTRL [33], GROVER [34] 2020
OUTFOX [35] Essays Link T2T 15,400 15,400 Feedback Prize [36] ChatGPT, GPT-3.5, T5 [31] 2023
MULTITuDE [37] News Link T2T 7,992 66,089 MassiveSumm [38] GPT-3&4, ChatGPT 2023
TuringBench [39] News Link T2T 8,854 159,758 News Media GPT-1&2&3, CTRL [33], GROVER [34] 2021
GPTUnmixed [40] News Link T2T 5,454 5,454 News Media GPT-3.5 2023
GPTMixed [40] News Link T2T 5,032 5,032 News Media GPT-3.5 2023
GPABench [41] Writing Link T2T 150,000 450,000 Arxiv GPT-3.5 2023
HPPT [42] Abstracts Link T2T 6,050 6,050 ACL Anthology [43] ChatGPT 2023
Text TweepFake [44] Tweets Link T2T 12,786 12,786 GitHub, Twitter GPT-2, RNN, LSTM 2021
SynSciPass [45] Passages Link T2T 99,989 10,485 Scientific papers GPT-2, BLOOM [46] 2022
DFText [47] General Link T2T 154,078 294,381 Reddit, ELI5 [48],Yelp, XSum [49] GPT, GLM [50], LLaMA, T5 [31] 2022
HC-Var [51] General Link T2T 90,096 45,000 XSum [49], IMDb, Yelp, FiQA [52] ChatGPT 2023
HC3 [53] General Link T2T 26,903 58,546 FiQA [52], ELI5 [48], Meddialog [54] ChatGPT 2023
M4 [55] General Link T2T 32,798 89,683 WikiHow [56], Arxiv, Reddit ChatGPT, LLaMA,T5 [31], BLOOM [46] 2023
MixSet [57] General Link T2T 300 3,600 Email [58], BBC News [59], ArXiv GTP-4, LLaMA2 [60] 2024
InternVid [61] Captions Link V2T 7,000,000 234,000,000 YouTube ViCLIP [61] 2023
DFF [62] Face Link T/I2I 30,000 90,000 IMDB-WIKI [63] SDMs, InsightFace [64] 2023
RealFaces [65] Face Link T2I 258 25,800 Prompts SDMs 2023
DCFace [66] Face Link I2I - 1,200,000 FFHQ, CASIA-WebFace [67] DDPM 2023
OHImg [68] Overhead Link T/I2I 6,475 6,675 MapBox [69], Google Maps GLIDE [70], DDPM [71] 2023
Western Blot [72] Biology Link I2I ∼14,000 ∼24,000 Western Blot DDPM, Pix2pix [73], CycleGAN [74] 2022
M3Dsynth [75] Biology Link I2I 1,018 8,577 LIDC-IDRI [76] DDPM, Pix2pix [73], CycleGAN [74] 2023
Synthbuster [77] General Link T2I - 9,000 Raise-1k [78] DALL·E 2&3, Midjourney, SDMs, GLIDE [70] 2023
GenImage [79] General Link T/I2I 1,331,167 1,350,000 ImageNet SDMs, Midjourney, BigGAN [80], 2023
CIFAKE [81] General Link T2I 60,000 60,000 CIFAR-10 SD-V1.4 2023
AutoSplice [82] General Link T2I 2,273 3,621 Visual News [83] DALL·E 2 2023
DiffusionDB [84] General Link T2I 3,300,000 16,000,000 DiscordChatExporter [85] SD 2023
Image ArtiFact [86] General Link T/I2I 964,989 1,531,749 COCO, FFHQ [87], LSUN SDMs, DDPM [71],LDM [88], CIPS [89] 2023
HiFi-IFDL [90] General Link T/I2I ∼600,000 1,300,000 FFHQ [87], COCO, LSUN DDPM [71], GLIDE [70], LDM [88],GANs 2023
DiffForensics [91] General Link T/I2I 232,000 232,000 LSUN, ImageNet LDM [88], DDPM [71], VQDM [92], ADM [93] 2023
CocoGlide [94] General Link T2I 512 512 COCO GLIDE [70] 2023
LSUNDB [95] General Link T/I2I 250,000 250,000 LSUN DDPM [71], LDM [88], StyleGAN [87] 2023
UniFake [96] General Link T2I 8,000 8,000 LAION-400M [97] LDM [88], GLIDE [70] 2023
REGM [98] General Link T/I2I - 116,000 CelebA [99], LSUN 116 publicly available GMs 2023
DMimage [100] General Link T2I 200,000 200,000 COCO, LSUN LDM [88] 2022
AIGCD [101] General Link T/I2I 360,000 508,500 LSUN, COCO, FFHQ [87] SDMs, GANs,ADM [93],DALL·E 2,GLIDE [70] 2023
DIF [102] General Link T/I2I 84,300 84,300 LAION-5B [103] SDMs, DALL·E 2, GLIDE [70], GANs 2023
Fake2M [104] General Link T/I2I - 2,300,000 CC3M [105] SD-V1.5 [106], IF [107], StyleGAN3 2023
Video DiffHead [108] Face Link I.A2V - 820 CREMA [109] Diffused Heads: build on DDPM 2023
Audio LibriSeVoc [110] Speech Link T2A 13,201 79,206 LibriTTS [111] DiffWave [112], WaveNet [113] 2023
Multi- DGM 4 [17] News Link T/I2T 77,426 152,574 Visual News [83] B-GST [114], StyleCLIP [115], HFGI [116] 2023
modal COCOFake [117] General Link T/I2T 113,287 566,435 COCO SDMs 2023
TABLE 1: Summary of public datasets that are generated by LAIMs. I2O: Input-to-Output, T2T: Text-to-Text, V2T: Video-to-Text,
T/I2I: Text-to-Image,Image-to-Image, T2A: Text-to-Audio, I.A2V: (Image conditioned with Audio)-to-Video. Only representative
works are listed in ”Source of Real Media” and “Generative Method”.
generates high-resolution videos using a base video genera- the detection of fake audio, LibriSeVoc [110] in Table 1 is
tion model and a sequence of interleaved spatial and tempo- the only public dataset that includes diffusion model-based
ral video super-resolution models. More recently, Stability vocoders (WaveGrad [141] and DiffWave [112]) so far.
AI’s Stable Video Diffusion [137] and Runway’s GEN-2 [138]
offer latent video diffusion models for video generation, ♣ Multimodal. Multimodal learning refers to an embodied
while Pika [139] introduces a platform aimed at broadening learning situation that includes learning multiple modalities
creative possibilities in video creation, as shown in Fig. 2 c). such as text, image, video, and audio [144]. From a gener-
Datasets. Though building upon the success of T2I gen- ation perspective, visual generation tasks, such as text-to-
eration, T2V generation requires temporally smooth and image and text-to-video, are regarded as multimodal gen-
realistic motion that matches the text, besides high-quality eration. The generative models for these tasks are trained
visual content, making it still in the nascent stage compared to learn visual representations and language understanding
with image generation. Furthermore, video generation is for visual generation. From the detection aspect, detectors
much more computationally costly than image synthesis. To that learn multiple modalities for forgery detection are
our best knowledge, only one work [108] contributes their categorized into multimodal. In this context, we define
generated talking head videos via LAMs, see Table 1. "multimodal generation" from a detection perspective, refer-
♣ Audio. Most audio synthesis by diffusion models focuses ring to frameworks capable of creating multimodal output.
on text-to-speech (TTS) tasks. One line of work [112], [140], Multimodal generation process normally contains founda-
[141] first generates acoustic features, e.g., mel-spectrogram, tion models as encoders (e.g., CLIP [145], ViT [146] ) and
and then outputs waveform with a vocoder. Another branch decoders (e.g., Stable Diffusion [88] ), and a LLM for taking
of work [142], [143] attempts to solve the TTS task in an language-like representations from encoders for semantic
end-to-end manner, as shown in Fig. 2 d). Diff-TTS [140] is understanding and produces modality signal tokens for
the first work that applies DDPM [71] to mel-spectrogram guiding content output, see Fig. 2 e). Most recent work,
generation. It transforms the noise signal into a mel- such as HuggingGPT [147], AudioGPT [148], and NExT-
spectrogram via diffusion time steps. [112], [141] apply GPT [149], all are based on the language understanding
diffusing models to vocoder for generating waveform based and complex reasoning capabilities of LLMs, and utilizing
on mel-spectrogram. Instead of treating acoustic modeling existing off-the-shelf multimodal encoders and decorders as
and vocoder modeling as independent processes, [142], tools to execute multimodal input and output.
[143] generate audio from text without acoustic features as Datasets. The multimodal datasets in Table 1 contain
explicit representation. multi-types of fake media. For example, DGM 4 [17] and
Datasets. Due to comparatively less research attention on COCOFake [117] includes synthesized text and image.
5
➀ Watermarking Distillation-Resistant [222], Fernandez et al. [223], Robust Multi-bit [224], Christ et
al. [225], Kuditipudi et al. [226], Unigram Watermark [227], Private Watermark [228]
♣ Easy-explainable Methods ➁ Artifacts Pu et al. [221]
➀ Deep-learning Based TuringBench [208], Contra-X [209], XLNet-FT [210], TopRoBERTa [211]
➀ Adversarial Data Augmentation Yang et al. [42], Shi et al. [198], MGTBench [199]
➀ Generalization/Robustness ChatLog [192], Zero-Shot [193], Training-based [51], Sarvazyan et al. [194], In The Wild [47]
♣ Frequency-based Methods Wavelet-Packets [179], AUSOME [180], Xi et al. [181], Synthbuster [77]
Beyond
➀Fully-supervised IFDL [90], PAL [169], TruFor [94]
♣ Localization
(§3.2.2) ➁Weakly-supervised Tantaru et al. [168]
♣ Empirical Study Corvi et al. [100], [160], Ricker et al. [95], Cocchi et al. [117], Papa et al. [65],
Porcile et al. [161], Mandelli et al. [72], Carriere et al. [162], Ha et al. [163]
Fig. 3: Taxonomy of the literature on detecting multimedia generated by Large AI Models (LAIMs).
a) Easy-explainable
b) Hard-explainable
None None What is perplexity?
Real Real Real Perplexity What is log
Stylometry Log- probability curvature?
Detector Detector features Fake
probability Real Fake
Fake Fake
Detector
Detector
1 Watermarking 2 Artifacts 3 Stylometry/Coherence
Fig. 4: Illustrations of pure detection methodologies for LAIM-generated text.
[23] divides the existing literature into two parts: methods uses the difference in coherence traced by entity consistency
designed to detect LLM-generated text (e.g., watermarking- exhibited in LLM-generated text and human-written ones.
based, fine-tune-based, and zero-shot) and methods de- ♣ Hard-explainable Methods. These methods involve de-
signed to evade detection (e.g., paraphrasing attacks, spoof- tection techniques and analytical processes that may not
ing attacks). Although these organizational strategies apply be as easily comprehensible to most people but are quite
to recent detection methods, their taxonomy may not be suf- accessible to researchers in the field.
ficient for adapting to new or evolving detection techniques.
To this end, we provide a novel taxonomy based on Pure
➀ Perplexity. This is a statistical metric used in language
models. Several works employ perplexity to distinguish
Detection and Beyond Detection. LLM-generated text from human-written text. For example,
[217] discerns LLM-generated text, specifically homework
3.1.1 Pure Detection
assignments, by computing perplexity scores for student-
We categorize the pure detection methodologies based on authored and ChatGPT-generated responses. These scores
their comprehensibility toward the general populace, which then assist in establishing a threshold for discerning the
can be identified as Easy-explainable and Hard-explainable origin of a submitted assignment. Moreover, the widely rec-
methods, as shown in Fig. 4. ognized tool, GPTZero [218], examines the text’s perplexity
♣ Easy-explainable Methods. These approaches ensure that and burstiness metrics to estimate the likelihood of a review
humans can straightforwardly understand the principles text being generated by LLMs for detection.
behind the detection technology. Such methods prioritize ➁ Log Probabilities Curvature. This refers to the loga-
clarity and accessibility, making the technology approach- rithmic transformation of probabilities that a model assigns
able for non-specialist users. to sequences of words or tokens, and be used in certain
➀ Watermarking. Detecting watermarks in the text research to detect LLM-generated text. Some works statisti-
would be the most straightforward way for humans to cally utilize log probabilities to detect LLM-generated text,
understand the detection technique. Text watermarking in- which require a certain level of knowledge among the gen-
jects algorithmically detectable patterns into the generated eral public regarding the generative mechanisms of LLMs
text while ideally preserving the quality and diversity of to understand their detection methods. For example, the
language model outputs. Most detection methods based on pioneer work DetectGPT [216] observes that LLM-generated
watermark require the watermark key for detection [222]– text exhibits a more negative log probability curvature and
[227], which are susceptible to security breaches and coun- leverages the curvature-based criterion based on random
terfeiting. [228] proposes the first method, which does not perturbations of the text, yielding promising results.
require the key during detection to alleviate this issue. They ➂ Efficient Perturbations. Recognizing the use of pure
use the shared token embedding between the watermark random perturbations in DetectGPT requires intensive com-
generation and detection networks to improve the efficiency putational cost, a series of works adopt various techniques
of training the detector. to improve computational efficiency and enhance detection
➁ Artifacts. While artifacts in text may not be as visually accuracy simultaneously. Specifically, [213] achieves simi-
obvious as in images, this detection approach is simple lar performance with up to two times fewer queries than
and comprehensible to humans. Yet, intriguing artifacts or DetectGPT with a Bayesian surrogate model by selecting
characteristics of LLM-generated texts can be elusive. [221] typical samples based on Bayesian uncertainty and inter-
reveals those characteristics by perturbing these generated polating scores from typical samples to other ones, making
texts and offers the following empirical findings: the perturbation process more focused and less resource-
• Artifacts exist mainly in the form of token co-occurrence intensive. Bao et al. [214] increase the detection speed by 340
and at the head of the vocabulary; times by substituting DetectGPT’s perturbation step with
• Content words contain more artifacts than stopwords; a more efficient sampling step via conditional probability
• High-level semantic/syntactic features contain much curvature. DetectLLM [215], another recent contribution,
fewer artifacts than shallow features; employs normalized perturbed log-rank for text detection
• Some artifacts are present in higher-order N-grams, ap- generated by LLMs, asserting a lower susceptibility to the
pearing as elements with unclear or vague meanings. perturbation model and the number of perturbations com-
➂ Stylometry/Coherence. Other works directly identify pared to DetectGPT.
the stylistic changes and coherence inconsistency between ➃ Positive-Unlabeled. Tian et al. [212] observe that
LLM-generated texts and human-written ones. [219] em- LLM-generated short texts are highly similar to human, it
ploys stylometric analysis to distinguish between human is not suitable to assign these simple generated texts with
and LLM-generated tweets by quantifying stylistic differ- either fake or real labels; rather, they are in an “Unlabeled”
ences. [220] examines the linguistic structure of texts and state. To this end, they model the task of LLM-generated
7
Detector
Detector
Seen
Detector
Detector
Flan T5 Disrupt word order
text detection as a partial Positive-Unlabeled problem and example, trace text generated by Alpaca [230] back to Chat-
formulate the Multiscale Positive-Unlabeled training frame- GPT and LLaMA. [204] introduces the first origin tracing
work to address the challenging task of short text detection tool, Sniffer, which utilizes the discrepancies between LLMs
without sacrificing long texts. as features and then trains a simple linear classifier to help
trace the origins.
3.1.2 Beyond Detection
♣ Generalization. Developing detectors with generalizabil-
Detection Methods go beyond distinguishing between hu- ity that can detect texts generated by generators never seen
man and machine-generated content can be organized (see before, as shown in Fig. 5 b).
Fig. 5) as follows:
➀ Structured Search. This method involves passing
♣ Attribution. Determining which specific model may gen- documents through weaker language models and then con-
erate the test content, see Fig. 5 a). ducting a systematic search over possible combinations
➀ Deep-learning Based. Most attribution models are of their features. Verma et al. present Ghostbuster [26], a
deep-learning based [208]–[211]. [210] addresses attribution method for detection with generalizability based on struc-
by noting that synthetic texts carry subtle distinguishing tured search and linear classification. Ghostbuster runs the
marks inherited from their source models and these marks structured search to get the possible text features and then
can be leveraged for attribution. TopRoBERTa [211] im- trains a classifier on the selected features to predict whether
proves existing attribution solutions by capturing more lin- documents are AI-generated.
guistic patterns in LLM-generated texts by including a Topo-
➁ Contrastive Learning. [203] develops a contrastive
logical Data Analysis (TDA) layer in the RoBERTa model. domain adaption framework that blends standard domain
In this approach, RoBERTa RoBERTa captures contextual adaption techniques with the representation power of con-
semantic and syntactic features, while a TDA layer analyzes trastive learning to learn domain invariant representations
the shape and structure of linguistic data. for better generalizability.
➁ Stylometric/Statistical. Besides the above deep
learning-based classifier, some works solve the attribution ♣ Interpretability. Exploring interpretable detectors that
task in a stylometric or statistical way. Stylometry is the can provide explanations for detection results, see Fig. 5 c).
statistical analysis of writing styles in texts. Uchendu et ➀ N-gram Overlaps. DNA-GPT [202] identifies GPT-
al. [32] propose a model named the Linguistic model. They generated text by exploiting the distinct text continuation
train a Random Forest classifier with their proposed Author- patterns between human and AI-generated content. It pro-
ship Attribution (AA) dataset and extract an ensemble of vides evidence based on nontrivial N-gram overlaps to
stylometric features, such as entropy and readability score. support explainable detection results.
GPT-who [207] is a psycholinguistically-aware, statistical- ➁ P-values. [201] proposes a statistical test for detecting
based detector. It utilizes Uniform Information Density watermarks in text with interpretable P-values. P-values
(UID) based features to model the unique statistical signa- provide a statistically interpretable way to quantify the
ture of each LLM and human author for accurate authorship certainty that the detected pattern (i.e., watermark) in the
attribution. text is not due to random chance.
➂ Perplexity. Inspired by the observation that perplex- ➂ Shapley Additive Explanations. Some researchers
ity serves as a reliable signal for distinguishing the source work toward gaining insight into the reasoning behind the
of generated text, [206] calculates the proxy perplexity to model. For example, [200] fine-tunes a transformer-based
identify the sources from which the text is generated, such model and uses it to make predictions, which are then ex-
as Human, LLaMA, OPT [229], or others. plained using Shapley Additive Explanations (SHAP) [231],
➃ Style Representation. In pursuit of an approach that an eXplainable Artificial Intelligence (XAI) framework to
does not rely on samples from language models of concern extract explanations of the model’s decisions aiming at un-
at training time. Soto et al. [205] propose a few-shot strategy covering the model’s reasoning. Similarly, [41] also employs
to detect LLM-generated text by leveraging representations Shapley Values to compare the interpretations derived from
of writing style estimated from human-written text. word-level and sentence-level results.
➄ Origin Tracing. This differs from the above model- ➃ Polish Ratio. Though the above studies provide in-
wise attribution. It refers to tracing back a larger original terpretability, they do not address text that has been refined
model on which a smaller generative model is based. For on a more granular level, such as ChatGPT-polished texts.
8
To bridge this gap, [42] introduces a novel dataset termed that are not principal for ChatGPT detection. This overfitting
HPPT (ChatGPT-polished academic abstracts) and proposes issue can be originated from the “incautious and insuffi-
the “Polish Ratio" method, a measure of ChatGPT’s in- cient” data collection process. Besides this, they provide an
volvement in text based on editing distance. It calculates optimistic insight: the trained models are also capable of
a similarity score between the original human-written texts extracting “transferable features”, which are shared features
and their polished versions, providing a more reasonable that can help detect the ChatGPT generated texts from
explanation for the detection outcome. various topics and language tasks.
♣ Robustness. Developing detectors that can handle dif- Additionally, [194] specifically investigate generaliza-
ferent attacks, see Fig. 5 d). In light of the vulnerability tion of in-domain fine-tune detectors. They use the multi-
of detectors to different attacks and robustness issues, a domain, multilingual AuTexTification corpus to fine-tune
significant body of research has been dedicated to utilizing various supervised detectors and discover that in-domain
adversarial learning as a mitigation strategy. fine-tune detectors struggle against data from different
models. Moreover, detection in the wild [47] conducts a
➀ Adversarial Data Augmentation. [42], [198], [199] more comprehensive experiment by building a wild testbed,
conduct the adversarial data augmentation process on LLM-
involving 10 datasets covering diverse writing tasks and
generated text, their findings indicate that models trained
sources, and using 27 LLMs for creating texts. A key finding
on meticulously augmented data exhibited commendable
is the challenge of Out-Of-Distribution data in real-world
robustness against potential attacks. The key technique in
scenarios, where the performance of detectors significantly
these methods involves employing adversarial attacks by
declines, often barely better than random classification.
creating misleading inputs, thereby enhancing the model’s
competency to handle a wider range of situations where ➁ Human Evaluation. [53] conducts thorough human
evaluations and linguistic analyses to compare the content
deception might be a factor.
generated by ChatGPT with that produced by humans. Key
➁ Adversarial Learning. Besides these pre-processing findings include: ChatGPT’s responses are more objective
methods, some works cast the spotlight on adversar-
and formal, less likely to contain biased or harmful in-
ial learning-based frameworks. Hu et al. [197] introduce
formation, and generally longer and more detailed. Chen
RADAR, a framework for training robust detectors by si-
et al. [191] conduct an extensive empirical investigation
multaneously interacting with a paraphrasing model, which
on LLM-generated misinformation detection, involved with
generates content aimed at evading detection. While effec-
human evaluation and LLMs (e.g., GPT4 and LLaMA2) as
tive against paraphrase attacks, its defense against other
detectors. They discover that LLM-generated misinforma-
attacks is unexplored. Diverging from RADAR, Koike et
tion can be harder to detect for both humans and detectors
al. [35] propose OUTFOX, which enhances robustness by
compared to human-written misinformation with the same
interlinking detector and attacker outputs. The attacker
semantics.
learns from the detector’s predictions to create hard-to-
detect essays, while the detector improves by learning from ➂ Attribution. An extensive investigation on model at-
tribution, encompassing source model identification, model
these adversarial essays. OUTFOX is particularly effective
family classification, and model size classification, is con-
against paraphrase-based attacks.
ducted by Antoun et al. [190]. Their source model iden-
➂ Stylistic/Consistency. [195] studies and quantifies tification task involves classifying the text into one of the
stylistic cues from the latent journalism process in real-
50 LLMs, spanning various families and sizes. The results
world news organizations towards discriminating AI-
reveal several key findings: a clear inverse relationship
generated news. Their J-Guard framework steers existing
between classifier effectiveness and model size, with larger
supervised AI text detectors while boosting adversarial
LLMs being more challenging to detect, especially when the
robustness. The research by Tulchinskii et al. [196] explores
classifier is trained on data from smaller models.
consistent properties in human-written texts across different
domains and skill levels. They estimate the geometry of ➃ Paraphrase Detection. Beyond detecting texts purely
generated by LLMs and written by humans. Another line of
text sample as an individual object and discover that real
empirical studies [30], [189] conduct extensive experiments
texts have a higher intrinsic dimension than artificial ones.
on paraphrase detection concerning the severe threat to
This insight is used to estimate intrinsic dimensionality for
academic integrity. Specifically, [30] evaluates the perfor-
detecting LLM-generated texts.
mance of various detectors and perform a human study
♣ Empirical Study. The empirical studies are crucial for with 105 participants regarding their detection performance
advancing our understanding and capabilities in detecting and the quality of generated examples. Their results suggest
LLM-generated texts, as shown in Fig. 5 e). that humans struggle to identify large models-paraphrased
➀ Generalization/Robustness. Tu et al. [192] observe a text (53% mean accuracy). Human experts rate the quality
decline in RoBERTa-based detectors’ effectiveness over time of paraphrases generated by GPT-3 to be as high as that
through a month-long study of ChatGPT responses to long- of original texts (clarity 4.0/5, fluency 4.2/5, coherence
form questions. Complementing this, Pu et al. [193] find that 3.8/5). [189] evaluates various detection methods on several
detectors trained on one generator can zero-shot generalize common paraphrase datasets, finding that human-authored
to another, especially when trained on a medium-size LLM paraphrases exceed machine-generated ones in terms of
for detecting content from a larger version. difficulty, diversity, and similarity.
A more comprehensive investigation on detectors’ gen- ➄ Sample Complexity. Chakraborty et al. [188] es-
eralization is conducted by Xu et al. [51]. They find that the tablish precise sample complexity bounds for detecting
trained models tend to overfit to some “irrelevant features” LLM-generated text, a first-of-its-kind tailored for detect-
9
Spatial domain
noising denoising
Texture
Fake Fake
Detector
Real Frequency domain Real
Gradient
Fake
Detector Detector
Real
a) Physical/Physiological-based Methods b) Diffuser Fingerprints-based Methods c) Spatial-based Methods d) Frequency-based Methods
ing AI-generated text for both IID (Independent and Iden- fall short when accurately depicting the intricate details of
tically Distributed) and non-IID settings. They conduct human extremities.
experiments with existing detectors (OpenAI’s Roberta, ♣ Diffuser Fingerprints-based Methods. Every generation
GPTZero [218]), and further remark that text detection is mechanism leaves its own unique fingerprints, which can
indeed possible under most of the settings and that these be explored for detection, as shown in Fig. 6 b). In [102], the
detectors could help mitigate the misuse of LLMs. authors elucidated that CNNs inherently manifest certain
image artifacts. These artifacts are harnessed in their Deep
3.2 Image Image Fingerprint (DIF) methodology to extract distinct fin-
While DMs are evolving rapidly and producing increasingly gerprints from images generated by these networks (GANs
realistic images, they still often make certain mistakes and and DMs). This approach facilitates the identification of
leave identifiable fingerprints. images originating from a specific model or its fine-tuned
variants. Complementarily, the studies by Wang et al. [91]
3.2.1 Pure Detection and Ma et al. [184] delve into the realm of diffusion models.
Research summarized here aims to identify DM-generated They have laid the groundwork for detecting DM-generated
images by examining physical and physiological cues, as images by leveraging the deterministic reverse and denois-
well as by focusing on enhancing detection accuracy. ing processes inherent diffusion models.
♣ Physical/Physiological-based Methods. Physical-based Wang et al. [91] propose a novel image representation
methods detect DM-generated images by examining incon- called DIffusion Reconstruction Error (DIRE) based on their
sistencies with real-world physics, such as lighting and re- hypothesis that images produced by diffusion processes can
flections. Physiologically-based methods, on the other hand, be reconstructed more accurately by a pre-trained diffusion
investigate the semantic aspects of human faces [232], in- model compared to real images. DIRE measures the error
cluding cues such as symmetry, iris color, pupil shapes, skin, between an input image and its reconstruction by a pre-
etc., see Fig. 6 a). These methods have stronger interpretabil- trained diffusion model. The computation process of DIRE
ity than data-driven methods, which have been widely can be simply concluded as follows: the input image x0 is
adapted to detect GAN-generated images [233], [234]. first gradually inverted into a noise image xT by DDIM
Borji [185] outlines key cues for detecting DM-generated inversion [235] and then is denoised step by step until
images that violate physical rules, as shown in Fig. 6 a) getting a reconstruction x′0 . DIRE is the residual image
top. These cues include: i) Reflection. Generated images can obtained from x0 and x′0 , which can be used to differentiate
exhibit artificial reflections that are inconsistent with the real or generated images.
natural lighting and environment, such as those in glasses, Though Wang et al. [91] indeed leverages some deter-
mirrors, or pupils. ii) Shadow. Generated images might not ministic aspects, their approach (i.e., DIRE) primarily targets
include shadows, or have inconsistent shadows. iii) Objects the reconstruction at the initial time step x0 . This method
without Support. In generated images, an object or material potentially overlooks the rich information present in the
appears to be floating in mid-air without any visible means intermediate stages of the diffusion and reverse diffusion
of support, it gives the impression that the object is defying processes. Ma et al. [184] exploits these intermediate steps
gravity or the laws of physics. later. They design Stepwise Error for Diffusion-generated
The above cues provided by Borji are based on the Image Detection (SeDID), particularly focusing on the errors
observations of the failure cases of DM-generated images, between reverse and denoise samples at specific time steps
while Farid analyzes the perspective [186] and lighting [187] during the generation process.
consistency in images synthesized by DALL·E 2. Noting that ♣ Spatial-based Methods. This research collection focuses
DALL·E 2 sometimes fails to maintain geometric consis- on mining spatial characteristics and features within images
tency, such as parallel lines cannot converge at a common to detect DM-generated content. Each study utilizes differ-
vanishing point. Additionally, while DALL·E 2 generally ent spatial aspects of images, such as texture, gradients, and
creates realistic lighting, there’s a tendency for the lighting local intrinsic dimensionality, for detection, see Fig. 6 c). Mo-
direction to be more frontal or rearward relative to the tivated by the principle that pixels in rich texture regions ex-
camera compared with natural photographs. hibit more significant fluctuations than those in poor texture
As for physiological-based forensics, Borji [185] shows regions. Consequently, synthesizing realistic rich texture re-
various physiological cues relative to eyes, teeth, ears, hair, gions proves to be more challenging for existing generative
skin, limbs, and fingers, etc, as shown in Fig. 6 a) bot- models. Zhang et al. [101] leverage such texture features and
tom. These artifacts suggest that generative models often exploit the contrast in inter-pixel correlation between rich
10
and poor texture regions within an image for DM-generated images as real or AI-generated; level 2 defines whether the
image forensics. Nguyen et al. [182] use gradient-based input images are created by GAN or DM technologies; level
features to differentiate DM-generated and human-made 3 solves the model attribution task.
artwork by Lorenz et al. [183] propose using the lightweight ➁ Model Parsing. The model parsing task is first de-
multi Local Intrinsic Dimensionality (multiLID) for effec- fined by Asnani et al. [98] as estimating Generative Model
tive detection. This approach stems from the principle that (GM) network architectures (e.g., convolutional layers, fully
neural networks heavily rely on low-dimensional textures connected layers, and the number of layers) and their loss
[236] and natural images can be represented as mixtures function types (e.g., Cross-entropy, Kullback–Leibler diver-
of textures residing on a low-dimensional manifold [237]. gence) by examining their generated images. To tackle this
Leveraging this concept, multiLID scores are calculated on problem, they compiled a dataset comprising images from
the lower-dimensional feature-map representations derived 116 GMs, which includes detailed information about the
from ResNet18. Then a classifier (random forest) is trained network architecture types, structures, and loss functions
on these multiLID scores to distinguish between synthetic utilized in each GM. They introduce a framework with
and real images. Their proposed multiLID approach exhibits two components: a Fingerprint Estimation Network, which
superiority in diffusion detection. However, this solution estimates a GM fingerprint from a generated image, and a
does not offer good transferability. Parsing Network, which predicts network architecture and
♣ Frequency-based Methods. Frequency-based methods loss functions from the estimated fingerprints.
analyze the frequency components of an image to extract ♣ Generalization. A generalizable detector can successfully
information that is not readily apparent in the spatial do- detect the unseen images generated by the newly released
main, see Fig. 6 d). The initial study by Wolter et al. [179] generators, as shown in Fig. 7 b).
highlights the limitations of traditional detection methods Epstein et al. [171] collect a dataset generated by 14 well-
that primarily utilize spatial domain convolutional neural known diffusion models and simulate a real-world learning
networks or Fourier transform techniques. In response, setting with incremental data from new DMs. They find
they propose an innovative approach using multi-scale that the classifiers generalize to unseen models, although
wavelet representation, which incorporates elements from performance drops substantially when there are major ar-
both spatial and frequency domains. However, the study chitectural changes. Different from [171], [96], [172] utilize
notes limited benefits from higher-order wavelets in guided pre-trained models, ViT [146] and CLIP [145] respectively,
diffusion data, pinpointing this as a potential area for future to achieve high generalization and robustness in various
research. In a recent study, Xi et al. [181] develop a dual- scenarios.
stream network that combines texture information and low- In addition, [173] uses data augmentation approaches
frequency analysis. This approach identifies artificial images to improve detection generalization. To avoid data depen-
by focusing on low-frequency forgeries and detailed texture dency on particular generative models and improve gen-
information. It has shown to be efficient across various im- eralizability, Jeong et al. [174] propose a framework that
age resolutions and databases. [180] introduces AUSOME to can generate various fingerprints and train the detector
detect DALL-E 2 images by performing a spectral compari- by adjusting the amplitude of these fingerprints or pertur-
son of Discrete Fourier Transform (DFT) and Discrete Cosine bations. Their method outperforms the prior state-of-the-
Transform (DCT). Bammey [77] uses the cross-difference art detectors, even though only real images are used for
filter on the image to highlight the frequency artifacts from training. Reiss et al. [175] introduce FACTOR, particularly
DM-generated images. It shows some generalization ability, for fact-checking tasks. Although their method is training-
as well as robustness to JPEG compression. free, it achieves promising performance. Dogoulis et al. [176]
address the generalization of detectors in cross-concept sce-
3.2.2 Beyond Detection nario (e.g., when training a detector on human faces and
Research on detecting DM-generated images aims not only testing on synthetic animal images), and propose a sampling
to enhance accuracy but also to add additional function- strategy that considers image quality scoring for sampling
alities to detectors. It seeks to develop a more nuanced training data. It shows better performance than existing
understanding and application of these detectors, unlocking methods that randomly sample training data. DNF [177]
new avenues and potential in the realm of DM-generated seeks to address generalization challenges through an en-
image analysis and detection. We summarize the existing semble representation that estimates the noise generated
work into the following categories: during the inverse diffusion process and achieves state-of-
the-art effects of detection.
♣ Attribution and Model Parsing. Attribution is to rec-
ognize the specific diffusion model that generates images. ♣ Interpretability. “This picture looks like someone I know,
Model parsing refers to the process of analyzing and in- and if the AI algorithm tells it is fake or real, then what is the
terpreting the structure and characteristics of generative reasoning and should I trust? " Detection with interpretability
models. The purpose of model parsing is to gain insight is working towards solving such question, see Fig. 7 c). In
into how a particular model generates images, which can pursuit of creating interpretable detectors, Aghasanli et al.
be crucial for tasks like identifying the source of a synthetic [170] propose a method for DM-generated image detection,
image, see Fig. 7 a). taking advantage of features from fine-tuned ViTs combined
with existing classifiers such as Support Vector Machines.
➀ Attribution. Guarnera et al. [178] focus on attribut-
ing DM-generated images using a multi-level hierarchical ♣ Localization. A detector that can localize the artifacts in a
approach. Each level solves a specific task: level 1 classifies DM-generated image or give a prediction with a localization
11
Attribution Seen
Real or Detector
Seen Detector Explain
Detector Fake
SD DALL·E 2 GLIDE DiT Mj Firefly Unseen Interpretable
Model
Parsing Source model loss Network Reasons
Interpretation
Prediction function type structure
a) Attribution and Model Parsing Unseen b) Generalization c) Interpretability
Without
Detection Real or Fake attacks
Fake
Fake Can exsiting detectors
Without attacks detect DMs images ?
Fully generated Detector Detector
Detector
image Detectors Does DMs images
Adversary attacks Fake
Fake
Localization Brigtness Blur Compression show similar artifacts
Inpaint With attacks as GANs image ?
Artifacts
...
Inpaint image d) Localization region e)e) Robustness
Robustness f) Empirical Study
map indicating which input regions have been manipulated, feature space of DM-generated images to generate disjoint
as shown in Fig. 7 d). ensembles for adversarial DM-generated image detection.
➀ Fully-supervised. Localization in a fully-supervised ♣ Empirical Study. The empirical study serves as a crucial
setting requires detectors to be explicitly trained for localiza-
foundation for devising methods to detect images generated
tion as a segmentation problem with localization mask label.
by rapidly advancing DMs. Despite DMs’ rapid progression,
Guo et al. [90] propose a hierarchical fine-grained Image
detection methods have not evolved at the same pace.
Forgery Detection and Localization (IFDL) framework with
Therefore, thorough experiments and insights from these
three components: a multi-branch feature extractor, local-
empirical studies are vital for advancing detection technolo-
ization, and classification modules. The localization module
gies, as illustrated in Fig. 7 f).
segments pixel-level forgery regions, while the classification
module detects image-level forgery. [169] centers on detect- Recent studies by Corvi et al. [100] and Ricker et al. [95]
ing and segmenting artifact areas that are only noticeable to investigate the challenge of differentiating photorealistic
human perception, not full manipulation region. In addition synthetic images from real ones. They find that detectors
to providing a pixel-level localization map, the study by trained only on GAN images work poorly on DM-generated
Guillaro et al. [94] also offers an integrity score to assist images. Additionally, DMs exhibit much subtler fingerprints
users in their analysis. in the frequency domain than GANs.
➁ Weakly-supervised. In contrast to the above work, Realizing it is challenging for existing methods to detect
which addresses localization in a fully-supervised setting. DM-generated images, Corvi et al. [160] extend their pre-
Tantaru et al. [168] consider a weakly-supervised scenario vious work [100] to gain insight into which image features
motivated by the fact that ground truth manipulation masks better discriminate fake images from real ones. Their exper-
are usually unavailable, especially for newly developed imental results shed light on some intriguing properties of
local manipulation methods. They conclude that localization synthetic images:
of manipulations for latent diffusion models [88] is very • To date, no generator appears to be artifact-free. Unnatu-
challenging in the weakly-supervised scenario. ral regular patterns are still observed in the autocorrela-
♣ Robustness. A robust detector is strategically developed tion of noise residuals;
to counteract different attacks consisting of intentionally • When a model is trained on a limited-variety dataset, its
designed perturbations or noise, as shown in Fig. 7 e). It is biases can transfer to the generated images;
also designed to maintain consistent detection performance • Synthetic and real images differ notably in their mid-high
when exposed to real-world conditions, such as dealing frequency signal content.
with image compression and blurring.
These traces exploited in this empirical study can be instru-
➀ Spatial-based. Ju et al. [165] propose a synthesized
image dataset that can contain diverse post-processing tech- mental in developing forensic detectors.
niques to simulate real-world applications. They further Cocchi et al. [117] investigate the robustness of different
introduce the Global and Local Feature Fusion Framework pre-trained model backbones against image augmentations
to enhance image detection by integrating multi-scale global and transformations. Their findings reveal that these trans-
features with detailed local features. Another work pre- formations significantly affect classifier performance.
sented by Xu et al. [166] also leverages multi-level feature Other works [65], [161] empirically study human face
representation for enhancing robustness. Different from the detection. [161] focuses on detecting profile photos of online
above works, [167] highlights the effectiveness of computing accounts, finding that a detector (EfficientNet-B1 [238]) only
local statistics, as opposed to global statistics, in distinguish- trained on face images completely fails when detecting non-
ing digital camera images from DM-generated images. This face images, and the detector can learn the semantic-level
method produces promising robustness to image resizing artifacts because it remains highly accurate even when the
and JPEG compression. image is compressed to a low resolution. [65] investigates
➁ Frequency-based. Contrasting with the approaches the difficulty of non-expert humans in recognizing fake faces
that mine spatial features, Hooda et al. [164] leverage fre- and further concludes that trained models outperform non-
quency features through a unique disjoint ensembling ap- expert human users, which brings out the need for solutions
proach. They utilize redundant information in the frequency to contrast the spread of disinformation.
12
Consistency?
Photo Article
GEN-2 Detector Without Real
Detector
Detector
Text accuracy
Furthermore, [72] centers on the biological images and Fig. 10: Illustrations of pure detection methodologies for LAIM-
generated multimodal media.
narrows on detecting synthetic western blot images. They
explore a range of detection strategies, such as binary clas- are more dominant and widespread in media and online
sification and one-class detectors, and notably, they do this platforms. This prevalence makes fake text/images more
without utilizing synthetic images in training. Their findings common and thus a higher priority for detection. However,
demonstrate the successful identification of synthetic bio- as voice synthesis and manipulation technologies improve
logical images. [163] seeks to understand how well these ex- and become more accessible, the importance and focus on
isting detectors and human groups can distinguish human detecting fake audio are likely to increase.
art from DM-generated images. They find that supervised
classifiers can be quite accurate, but are weaker at detect- 3.4.1 Pure Detection
ing images generated by newer models and adversarially
altered images. In contrast, human artists (especially expe- ♣ Vocoder-based. Sun et al. [110] focus on detecting unique
rienced experts) do well across all inputs. This highlights artifacts generated by neural vocoders in audio signals. Vis-
that a mixed team of human and automated detectors will ible artifacts introduced by different neural vocoder models
likely provide the best combination of detection accuracy can be observed in Fig. 9. The study employs a multi-task
and robustness. [162] is the first work to detect authentic- learning framework incorporating a binary-class RawNet2
looking handwriting generated by DMs. Their experiments [239] model. This model uniquely shares a feature extractor
reveal that the strongest discriminative features come from with a vocoder identification module. By employing this
real-world artifacts in genuine handwriting that are not shared structure, the feature extractor is effectively guided
reproduced by generative methods. to concentrate on vocoder-specific artifacts. However, in
their method, the detector may become overly specialized
3.3 Video in detecting specific forgery technologies, hindering gener-
alization to unseen vocoder artifacts.
Recently, there has been limited research focused on detect-
ing videos generated by LAIMs. This is primarily because
video generation techniques are more complex than those 3.5 Multimodal
for images. Additionally, as outlined in Section 2, video Here, we categorize literature utilizing multimodal learn-
generation technology is still in its nascent stages compared ing, which takes multiple modalities for detecting forgeries
with image generation. To the best of our knowledge, only in single or multiple modalities. In particular, multimodal
one work [159] related to this area, see Fig. 8, and we classify learning [144] refers to an embodied learning situation that
it into the Beyond Detection category. includes learning multiple data modalities such as text,
image, video, and audio.
3.3.1 Beyond Detection
♣ Generalization. Many works address generalization in 3.5.1 Pure Detection
deepfake video detection. However, they may overlook one ♣ Text-assisted. This type of method utilizes text-image or
challenge: the standard out-of-domain evaluation datasets prompt-image pair as input and focuses on extracting and
are very similar in form to the training data, failing to keep learning features from both text and images to improve DM-
up with the advancements in DM video generation. [159] generated image detection, see Fig. 10 a). Amoroso et al.
addresses such an issue by introducing the Simulated Gen- [158] leverage the semantic content of textual descriptions
eralizability Evaluation (SGE) method, which involves sim- alongside visual data. They introduce a contrastive-based
ulating spatial and temporal deepfake artifacts in videos of disentangling strategy to analyze the role of the semantics
human faces using a Markov process. SGE aims to improve of textual descriptions and low-level perceptual cues. The
the generalizability of video detection by creating artifacts experiments are conducted on their proposed COCOFake
that reflect implausibilities in facial structures, which could dataset, where CLIP [145] is employed as the backbone for
accompany unseen manipulation types. feature extraction. These extracted features are then used to
train a logistic regression model. They find that the visual
3.4 Audio features extracted from the generated image still retain the
The research on the detection of LAIM-generated audio is semantic information of the original caption used to create
also very limited. The reason is that visual and text content it, which allows them to distinguish between natural images
13
Seen
Real or Fake ?
Real or Fake ?
Detector
Detector
Skiers as they ski Skiers as they ski
Detector
Seen
Image:
on the snow Stable on the snow False prossibility:
Diffusion 93.5%
V1
Unseen
Unseen
Manipulation chance
A dog hanging Attribution DALL·E 2 A dog hanging Interpretation Semantic credibility
out of a side out of a side
Perceptual realism
window on a car window on a car
Signal authenticity
a) Attribution b) Generalization c) Interpretability
Detectors
Skiers as they ski Analyze if text can improve
Detector
Gordon Brown is forced fake image detection
to resign EU meeting by on the snow
Gordon Brown is
Nicolas Sarkozy . forced to resign. Is this image How to fully exploiting
Localization 'forced', 'resign' real? multimodal features
Image Text
Manipulated Manipulated
region Tokens
d) Localization e) Empirical Study
Fig. 11: Illustrations of beyond detection methodologies for LAIM-generated multimodal media.
and the generated ones using only semantic cues while ➁ Contrastive Learning. [155] adopt a language-guided
neglecting the perceptual ones. contrastive learning approach, augmenting training im-
♣ Text-image Inconsistency. This refers to a type of mis- ages with textual labels (e.g., “Real/Synthetic Photo” and
information where the textual content does not align with “Real/Synthetic Painting”) for forensic feature extraction.
or accurately represent the original meaning or intent of ♣ Interpretability. It provides evidence with detection re-
the associated image. This type of misinformation is com- sults by leveraging multimodal media, shown in Fig. 11 c).
monly presented in the guise of news [240] to mislead the To combat misinformation and emphasize the significance
public, see Fig. 10 b). Tan et al. [16] create a NeuralNews of scalability and explainability in misinformation detection,
dataset which contains articles with LLM-manipulated text Xu et al. [154] propose a conceptual multi-modal framework
and authentic images tailored for such task. They further composed of 4 levels: signal, perceptual, semantic, and hu-
propose DIDAN (Detecting Cross-Modal Inconsistency to man, for explainable multimodal misinformation detection.
Defend Against Neural Fake News) framework that ef-
♣ Localization. Detection methods with localization aim to
fectively leverages visual-semantic inconsistencies between
not only detect the authenticity of media content but also
article text, images, and captions, offering a novel approach
locate the manipulated content (i.e., image bounding boxes
to counteract neural fake news.
and text tokens). Unlike leveraging multimodal learning for
enhanced single-modal forgery detection tasks, the methods
3.5.2 Beyond Detection we summarize here can detect and locate multi-modal me-
dia, shown in Fig. 11 d).
♣ Attribution. This refers to research works that utilize
a multimodal learning approach, combining prompt infor-
➀ Spatial-based. Shao et al, [17] initiate a significant
development in the field of multimodal media manipula-
mation with image features, to enhance the accuracy of tion detection with their dataset “Detecting and Grounding
attributing DM-generated images to their source model, MultiModal Media Manipulation" (DGM 4 ), where image-
shown in Fig. 11 a). Sha et al. [157] build an image-only de- text pairs are manipulated by various approaches, to cope
tector (ResNet18) and a hybrid detector (take advantage of with the threat that enormous text fake news is generated or
CLIP’s [145] image and text encoders for feature extraction). manipulated by LLMs. They further use contrastive learning
Empirical results show that learning images with prompts to align the semantic content across different modalities
achieves better performance than image-only attribution (image and text) for detection and localization. Subsequent
regardless of the dataset. research [152], [153], [241] in this area has built upon this
♣ Generalization. This area of research focuses on devel- dataset. Specifically, [241] is an extension of [17] by inte-
oping a fake image detector that is guided by language, grating Manipulation-Aware Contrastive Loss with Local
enhancing its ability to detect a wider range of new, unseen View and constructing a more advanced model HAM-
DM-generated images, shown in Fig. 11 b). MER++ [241], which further improves detection and local-
➀ Prompt Tuning. In this vein, Chang et al. [156] em- ization performance. Wang et al. [153] construct a simple
ploy soft prompt tuning on a language-image model to treat and novel transformer-based framework for the same task.
image detection as a visual question-answering problem. Their designed dual-branch cross-attention and decoupled
These prompt questions (e.g., “Is this photo real?”) are then fine-grained classifiers can effectively model cross-modal
converted into vector representations, later are fed into Q- correlations and exploit modality-specific features, demon-
former and LLM with the image features extracted by the strating superior performance compared to HAMMER [17]
image encoder, therefore the DM-generated image detec- and UFAFormer [152] on DGM 4 .
tion problem is formulated as a visual question-answering ➁ Frequency-based. UFAFormer (a Unified Frequency-
problem, the LLM gives the detection output “Yes” for real Assisted transFormer framework) [152] incorporates fre-
images and “No” for fake images. quency domain to address the detection and localization
14
Modality Tool Company Link Reference
AI Content Detector Copyleaks Link [242] ♣ Image. In evaluating tools for detecting images generated
AI Content Detector, ChatGPT detector ZeroGPT Link [243]
AI Content Detector Winston AI Link [244] by DMs, AI or Not [254] demonstrated fast execution while
AI Content Detector Crossplag Link [245]
Giant Language model Test Room GLTR Link [246]
identifying fake images and supported diverse formats.
Text
The AI Detector
AI Checker
Content at Scale
Originality ai
Link
Link
[247]
[251]
However, it cannot provide stable results. In addition, Is
Advanced AI Detector and Humanizer Undetectable ai Link [248] it AI [255] and Content at Scale [258] offer a user-friendly
AI Content Detector Writer Link [252]
AI Content Detector Conch Link [253] interface with limitations in detecting certain content types.
Illuminarty Text Illuminarty Link [249]
AI-Generated Text Detector Is it AI Link [250] Illuminarty [256] provides in-depth analysis but faces us-
AI or Not image
AI-Generated Image Detector
AI or Not
Is it AI
Link
Link
[254]
[255]
ability complexities, and Huggingface [259] excelled in ac-
Image
Illuminarty Image Illuminarty Link [256] cessibility with limitations in advanced analysis. Google has
SynthID Google Link [257]
Advanced AI Image Detector Content at Scale Link [258] developed SynthID [257], specifically designed for images
AI Image Detector Huggingface Link [259]
Audio AI or Not audio AI or Not Link [254] generated by Imagen [260].
TABLE 2: Existing popular detection tools for exposing LAIM- ♣ Audio. AI or Not [254] also provides an AI audio detec-
generated multimedia. tion tool which sets it apart from other AI detection tool
product lineups. Currently, they support WAV, MP3, and
problem. It simultaneously integrates and processes the FLAC audio files of a maximum size of 100MB and have a
detection and localization processes for manipulated faces, duration of more than 5 seconds.
text, as well as image-text pair, simplifying the architecture
in [17] and facilitating the optimization process.
5 D ISCUSSION
♣ Empirical Study It provides insights into the feasibility
In this section, we discuss the main challenges and limi-
of detecting generated images by leveraging multimodal
tations encountered in the era of detection based on our
media shown in Fig. 11 e). [150] shows that it is possible
comprehensive review of existing works. Then we propose
to detect the generated images using simple Multi-Layer
potential directions for future research aimed at developing
Perceptrons (MLPs), starting from features extracted by
more efficient and practically effective detection systems.
CLIP [145] or traditional CNNs. They found that incorpo-
rating the associated textual information with the images
rarely leads to significant improvement in detection results 5.1 Challenges
but that the type of subject depicted in the image can ♣ Limited Reliability of Detection Models. A reliable
significantly impact performance. Papadopoulos et al. [151] detector should retain promising generalizability to new
highlight that specific patterns and biases in multimodal emerging models or data sources of detection, good inter-
misinformation detection benchmarks can result in biased or pretability in building trust in prediction results, and ro-
unimodal models outperforming their multimodal counter- bust resistance against different real-life application condi-
parts on a multimodal task. Therefore they create a ”VERifi- tions [261]. Therefore, we pinpoint the reliability challenges
cation of Image-TExt pairs” (VERITE) benchmark to address in detection models from three aspects: generalizability, inter-
the unimodal bias in multimodal misinformation detection. pretability, and robustness.
• Generalizability. We can see from Fig. 1, that LAIMs and
(confounding bias) [263]. The above methods still do not 5.2 Future Directions
accurately reflect a detector’s decision-making process.
♣ Building Foundation Models. The term “foundation
Besides the model architecture, training data is another
models” was first introduced by Bommasani et al. [266],
challenging aspect to consider, particularly in determining
and they defined foundation model as “the base models
which specific features or subsets of the vast training data
trained on broad data that can be adapted to a wide range
contribute to a detector’s predictions.
of downstream tasks”, (e.g., CLIP [145], DALL-E 3 [127],
• Robustness. Potential attacks are an important factor in
ViT [146], and GPT family) . Multimedia generation has
the continued unreliability of current LAIM-generated
recently witnessed remarkable progress fueled by founda-
multimedia detectors. These detectors are tasked with not
tion models [4]. In LAIM-generated text detection, founda-
only identifying content produced by the latest LAIMs but
tion models are widely used. Most recently, in fake image
also demonstrating resilience against sophisticated attacks
detection [150], [156]–[158], [170], [172], the paradigm is
created by the most advanced generative models. [264],
also shifting towards foundation models, because it allows
[265] illustrate this vulnerability, where they use diffusion-
rapid model development and provides better performance
based models to craft attacks that successfully exploit
in real-world scenarios with their extensive pre-training on
weaknesses in existing detectors. Furthermore, [117] has
diverse datasets.
shown that the performance of detectors is significantly
However, the application of foundation models in de-
influenced by image transformations. This finding empha-
tection tasks remains quite limited. For instance, the above
sizes the challenges in ensuring the robustness of detec-
efforts in synthesized image detection are primarily lim-
tors, especially in real-world scenarios where images often
ited to CLIP [145] and ViT [146]. This limitation is even
undergo compression or blurring. Additionally, LAIM-
more pronounced in areas like audio, video, and multi-
generated text detection faces paraphrase attacks and
modal forgery detection, where there is a notable scarcity
prompt attacks performed by LAIMs, besides adversarial
of research utilizing or developing foundation models. This
attacks [21], [22]. In a nutshell, LAIM-generated attacks
gap indicates a significant opportunity for advancement in
and real-life detection scenarios bring a greater challenge
the field, particularly in integrating and exploring existing
to the robustness of detectors ever than before.
foundation models to bolster the detection capabilities of a
wide range of digital forgeries. More importantly, there is
♣ Constrained Size, Accessibility, Range, and Variety of a pressing need to develop multimodal detection-oriented
Datasets. As illustrated in Table 1, there is a significant gap foundation models. Such models would offer more robust
in the availability of large-scale datasets for video, audio, and versatile detection systems compared to task-specific
and multimodal detection. While there are a few million- models, enabling researchers to rapidly adapt them for var-
scale image detection datasets, many have limited sample ious downstream detection tasks, streamlining the detection
sizes (some are even less than ten thousand). Additionally, process, and enhancing its efficacy.
there is a noticeable lack of diversity in datasets. Current ♣ Harnessing Generative Discrepancies. Recent studies re-
public multimodal datasets focus primarily on image and veal clear differences between LAIM-generated multimedia
text modalities, overlooking valuable information that could and human-created content. LAIM-generated text, as shown
be derived from other modalities like video and audio. by [53], [267], [268], often exhibits longer, more determin-
Furthermore, the majority of text detection datasets are istic, and complex syntax compared to human writing. In
predominantly in English, and are limited to specific do- image generation, current DMs face challenges in producing
mains without encompassing a broad range of topics. The high-frequency information [95], [179], rich textures [101],
available fake audio dataset, LibriSeVoc [110], contains only repetitive patterns [185], and adherence to physical princi-
two diffusion-based vocoders. This lack of diversity and ples like lighting [187]. Effective detection strategies might
narrow range lead to significant out-of-distribution chal- involve a mix of these indicators, with potential in ensemble
lenges, causing detectors to underperform markedly when learning [269], foundation models [266], multi-modal/task
faced with data in real-world settings. learning, and cross-modal feature embedding.
♣ Ensuring Reliability of Detectors. As reliability chal-
♣ Lack Evaluation Benchmark. The field of LAIM- lenges in detection methods are discussed above, we urge
generated multimedia detection faces the challenge of lack- forthcoming detection studies to prioritize detector reliabil-
ing a standardized benchmark, leading to inconsistent data ity. The development of a reliable detector should aim for a
processing, evaluation strategies, and metrics. This results unified framework that ensures generalizability, robustness,
in unfair comparisons and potentially misleading outcomes. and interpretability concurrently. This could be a promising
For instance, Kamat et al. [159] note that detectors often direction for detectors to combat LAIM-generated multime-
tested on datasets similar to their training data, while Cocchi dia in real-world scenarios and increase the trust of the
et al. [117] demonstrates that different data augmentation social community in detection results. Training on a diverse
has notable impacts on the performance of detectors. Fur- and large-scale dataset, building robust detectors with at-
thermore, the capability of LAIMs to partially manipulate tacks, incorporating explainable AI (XAI) techniques [262],
multimedia content, such as through image inpainting, pol- considering strategies like online learning [171] or transfer
ishing human-written text, and synthesizing specific video learning with new data to keep the model updated would
frames, adds to the evaluation complexity. Therefore, how contribute to building a reliable detector.
best to evaluate such intricately manipulated content with a ♣ Establishing Fair Detectors. It has been revealed that
unified benchmark remains challenging. current detectors, though with high detection accuracy, re-
16
sult in unfair performance disparities among demographic [2] H.-Y. Lin, “Large-scale artificial intelligence models,” Computer,
groups, such as race and gender [270], [271]. This can lead vol. 55, no. 05, pp. 76–80, 2022.
[3] J. Qiu et al., “Large ai models in health informatics: Applications,
to particular groups facing unfair targeting or exclusion challenges, and the future,” IEEE J. Biomed. Health Inform., 2023.
from detection, potentially allowing misclassified LAIM- [4] C. Li et al., “Multimodal foundation models: From specialists to
generated multimedia to manipulate public opinion and general-purpose assistants,” arXiv, vol. 1, no. 2, p. 2, 2023.
undermine trust in the model. Despite its criticality, the issue [5] K. Malinka et al., “On the educational impact of chatgpt: Is
artificial intelligence ready to obtain a university degree?” in Proc.
of fairness in LAIM-generated multimedia detection, even Innov. Technol. Comput. Sci. Educ., 2023, pp. 47–53.
in detecting deepfake generated by traditional methods, has [6] T. Susnjak, “Chatgpt: The end of online exam integrity?” arXiv,
not received adequate attention from the research commu- 2022.
nity. We advocate for a proliferation of studies that delve [7] Z. Zhao et al., “Chatcad+: Towards a universal and reliable
interactive cad using llms,” arXiv, 2023.
into addressing the bias problem in detectors. This involves
[8] O. Thawkar et al., “Xraygpt: Chest radiographs summarization
not only developing a fair detector, but also ensuring that using medical vision-language models,” arXiv, 2023.
this fairness is generalizable across various scenarios with- [9] M. Chui et al., “Generative ai is here: How tools like chatgpt could
out showing bias against certain groups. change your business,” Quantum Black AI by McKinsey, 2022.
[10] L. Henneborn, “Designing generative ai to work for people with
♣ Accelerating the Development of Multimodal Detec- disabilities,” http://tinyurl.com/mrfcrs94, 2023.
tors and Datasets. Existing research on detecting LAIM- [11] K. S. Glazko et al., “An autoethnographic case study of generative
generated multimedia mainly focuses on single data modal- artificial intelligence’s utility for accessibility,” in Proc. 25th Int.
ACM SIGACCESS Conf. Comput. Accessibility, 2023, pp. 1–8.
ity. However, in real-world scenarios, it is very hard to know [12] H. S. Sætra, “Generative ai: Here to stay, but for good?” Technol-
which data modality is generated. For example, generated ogy in Society, vol. 75, p. 102372, 2023.
videos may have both visual and audio components gen- [13] P. A. Napitupulu et al., “The implication of generative artificial
erated, or only one modality may be created. Hence, the intelligence towards intellectual property rights (examining the
multifaceted implications of generative artificial intelligence on
creation of multimodal detectors is crucial. To effectively intellectual property rights),” West Science Law and Human Rights,
develop these models, assembling extensive multimodal vol. 1, no. 04, pp. 274–284, 2023.
datasets for their training is a fundamental requirement. [14] M. Masood et al., “Deepfakes generation and detection: State-
This can be explored in the future. of-the-art, open challenges, countermeasures, and way forward,”
Applied intelligence, vol. 53, no. 4, pp. 3974–4026, 2023.
♣ Towards Interpretable User-friendly Detection Tools. [15] M. Bohacek et al., “The making of an ai news anchor—and its
Our analysis of popular online detection tools discussed implications,” Proc. Natl. Acad. Sci., vol. 121, no. 1, p. e2315678121,
2024.
in Section 4, reveals a common shortcoming: a lack of in-
[16] R. Tan et al., “Detecting cross-modal inconsistency to defend
terpretability. However, humans generally prefer tools that against neural fake news,” in Proc. 2020 Conf. Empirical Methods
are directly interpretable, tractable, and trustworthy [262]. Nat. Lang. Process., 2020.
Developing interpretable detection tools can gain more so- [17] R. Shao et al., “Detecting and grounding multi-modal media
manipulation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
cial attention and popularity. Therefore, enhancing inter- Recognit., 2023, pp. 6904–6913.
pretability is not just a technical upgrade but a significant [18] H. H. Jiang et al., “Ai art and its impact on artists,” in Proc.
step in making these tools more accessible to broader users, AAAI/ACM Conf. AI Ethics Soc., 2023, pp. 363–374.
especially beneficial for underrepresented groups such as [19] “Ai safety summit,” https://www.aisafetysummit.gov.uk/,
2023, hOSTED BY THE UK.
non-technical persons and elderly adults.
[20] F. Register, “Safe, secure, and trustworthy development and use
of artificial intelligence,” http://tinyurl.com/25t76x4d, 2023.
[21] X. Yang et al., “A survey on detection of llms-generated content,”
6 C ONCLUSION arXiv, 2023.
This paper provides the first systematic and comprehensive [22] J. Wu et al., “A survey on llm-generated text detection: Necessity,
methods, and future directions,” arXiv, 2023.
survey covering existing research on detecting multimedia
[23] S. S. Ghosal et al., “Towards possibilities & impossibilities of ai-
from text, image, audio, and video to multimodal con- generated text detection: A survey,” arXiv, 2023.
tent generated by large AI models. We introduce a novel [24] E. Crothers et al., “Machine-generated text: A comprehensive
taxonomy for detection methods within each modality, survey of threat models and detection methods,” IEEE Access,
2023.
categorizing them under two primary frameworks: pure
[25] J. P. Cardenuto et al., “The age of synthetic realities: Challenges
detection (focusing on improving detection accuracy) and and opportunities,” arXiv, 2023.
beyond detection (integrating attributes like generalizability, [26] V. Verma et al., “Ghostbuster: Detecting text ghostwritten by large
robustness, and interoperability to detectors). Additionally, language models,” arXiv, 2023.
[27] “Ivypanda essay dataset,” http://tinyurl.com/bd62jsc2.
we have outlined the sources contributing to detection, such
[28] “Reddit writing prompts,” http://tinyurl.com/yzxxkd82.
as generation mechanisms of LAIMs, public datasets, and [29] J. Houvardas et al., “N-gram feature selection for authorship
online tools. Finally, we pinpoint current challenges in this identification,” in Int. Conf. Artif. Intell. Methodol. Syst. Appl., 2006,
field and propose potential directions for future research. pp. 77–86.
We believe that this survey serves as the initial contribution [30] J. P. Wahle et al., “How large language models are transforming
machine-paraphrased plagiarism,” in Proc. 2022 Conf. Empirical
to addressing a notable academic gap in this field, aligning Methods Nat. Lang. Process., 2022.
with global AI security initiatives, thereby upholding the [31] C. Raffel et al., “Exploring the limits of transfer learning with
authenticity and integrity of digital information. a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21,
no. 1, pp. 5485–5551, 2020.
[32] A. Uchendu et al., “Authorship attribution for neural text gener-
R EFERENCES ation,” in Proc. 2020 Conf. Empirical Methods Nat. Lang. Process.,
2020, pp. 8384–8395.
[1] W. X. Zhao et al., “A survey of large language models,” arXiv, [33] N. S. Keskar et al., “Ctrl: A conditional transformer language
2023. model for controllable generation,” arXiv, 2019.
17
[34] R. Zellers et al., “Defending against neural fake news,” Adv. IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Neural Inf. Process. Syst., vol. 32, 2019. 2023, pp. 12 715–12 725.
[35] R. Koike et al., “Outfox: Llm-generated essay detection through [67] G. Huang, M. Mattar, H. Lee, and E. Learned-Miller, “Learning
in-context learning with adversarially generated examples,” to align from scratch,” Advances in neural information processing
arXiv, 2023. systems, vol. 25, 2012.
[36] A. F. Maggie et al., “Feedback prize - predicting effective argu- [68] B. B. May et al., “Comprehensive dataset of synthetic and ma-
ments,” 2022, http://tinyurl.com/48w9jb8k,Accessed: 2023-05- nipulated overhead imagery for development and evaluation of
10. forensic tools,” in Proc. 2023 ACM Workshop Inf. Hiding Multimedia
[37] D. Macko et al., “Multitude: Large-scale multilingual machine- Secur., 2023, pp. 145–150.
generated text detection benchmark,” arXiv, 2023. [69] “Mapbox,” https://www.mapbox.com/.
[38] D. Varab et al., “Massivesumm: a very large-scale, very multilin- [70] A. Nichol et al., “Glide: Towards photorealistic image generation
gual, news summarisation dataset,” in Proc. 2021 Conf. Empirical and editing with text-guided diffusion models,” in Int. Conf.
Methods Nat. Lang. Process., 2021, pp. 10 150–10 161. Mach. Learn. PMLR, 2022, pp. 16 784–16 804.
[39] A. Uchendu et al., “Turingbench: A benchmark environment for [71] J. Ho et al., “Denoising diffusion probabilistic models,” Adv.
turing test in the age of neural text generation,” arXiv, 2021. Neural Inf. Process. Syst., vol. 33, pp. 6840–6851, 2020.
[40] X. Liu et al., “Coco: Coherence-enhanced machine-generated text [72] S. Mandelli et al., “Forensic analysis of synthetically generated
detection under low resource with contrastive learning,” in Proc. western blot images,” IEEE Access, vol. 10, pp. 59 919–59 932,
2023 Conf. Empirical Methods Nat. Lang. Process., 2023, pp. 16 167– 2022.
16 188. [73] P. Isola et al., “Image-to-image translation with conditional adver-
[41] Z. Liu et al., “Check me if you can: Detecting chatgpt-generated sarial networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
academic writing using checkgpt,” arXiv, 2023. 2017, pp. 1125–1134.
[42] L. Yang et al., “Is chatgpt involved in texts? measure the polish [74] J.-Y. Zhu et al., “Unpaired image-to-image translation using cycle-
ratio to detect chatgpt-generated text,” arXiv, 2023. consistent adversarial networks,” in Proc. IEEE Int. Conf. Comput.
[43] “Acl anthology,” https://aclanthology.org/. Vis., Oct 2017.
[44] T. Fagni et al., “Tweepfake: About detecting deepfake tweets,” [75] G. Zingarini, D. Cozzolino, R. Corvi, G. Poggi, and L. Verdoliva,
Plos one, vol. 16, no. 5, p. e0251415, 2021. “M3Dsynth: A dataset of medical 3d images with ai-generated
[45] D. Rosati, “Synscipass: detecting appropriate uses of scientific local manipulations,” arXiv preprint arXiv:2309.07973, 2023.
text generation,” in Proc. Third Workshop Scholarly Doc. Process., [76] S. G. Armato III, G. McLennan, L. Bidaut, M. F. McNitt-Gray, C. R.
2022. Meyer, A. P. Reeves, B. Zhao, D. R. Aberle, C. I. Henschke, E. A.
[46] B. Workshop et al., “Bloom: A 176b-parameter open-access mul- Hoffman et al., “The lung image database consortium (lidc) and
tilingual language model,” arXiv, 2022. image database resource initiative (idri): a completed reference
[47] Y. Li et al., “Deepfake text detection in the wild,” arXiv, 2023. database of lung nodules on ct scans,” Medical physics, vol. 38,
no. 2, pp. 915–931, 2011.
[48] A. Fan et al., “Eli5: Long form question answering,” in Proc. 57th
[77] Q. Bammey, “Synthbuster: Towards detection of diffusion model
Conf. Assoc. Comput. Linguist., 2019, pp. 3558—-3567.
generated images,” IEEE Open J. Signal Process., 2023.
[49] S. Narayan et al., “Don’t give me the details, just the summary!
[78] D.-T. Dang-Nguyen et al., “Raise: A raw images dataset for digital
topic-aware convolutional neural networks for extreme summa-
image forensics,” in Proc. 6th ACM Multimedia Syst. Conf., 2015,
rization,” arXiv, 2018.
pp. 219–224.
[50] Z. Du et al., “Glm: General language model pretraining with
[79] M. Zhu et al., “Genimage: A million-scale benchmark for detect-
autoregressive blank infilling,” in Proc. 60th Annu. Meet. Assoc.
ing ai-generated image,” arXiv, 2023.
Comput. Linguist., 2022, pp. 320–335.
[80] J. Deng et al., “Large scale gan training for high fidelity natural
[51] H. Xu et al., “On the generalization of training-based chatgpt
image synthesis,” in 7th Int. Conf. Learn. Represent. OpenRe-
detection methods,” arXiv, 2023.
view.net, 2019.
[52] N. Thakur et al., “Beir: A heterogenous benchmark for zero-shot
[81] J. J. Bird et al., “Cifake: Image classification and explainable
evaluation of information retrieval models,” in Neural Inf. Process.
identification of ai-generated synthetic images,” IEEE ACCESS,
Syst. Datasets and Benchmarks Track, 2021.
2024.
[53] B. Guo et al., “How close is chatgpt to human experts? compari- [82] S. Jia et al., “Autosplice: A text-prompt manipulated image
son corpus, evaluation, and detection,” arXiv, 2023. dataset for media forensics,” in Proc. IEEE/CVF Conf. Comput. Vis.
[54] G. Zeng et al., “Meddialog: Large-scale medical dialogue Pattern Recognit., 2023, pp. 893–903.
datasets,” in Proc. 2020 Conf. Empirical Methods Nat. Lang. Process., [83] F. Liu et al., “Visual news: Benchmark and challenges in news
2020, pp. 9241–9250. image captioning,” in Proc. 2021 Conf. Empirical Methods Nat.
[55] Y. Wang et al., “M4: Multi-generator, multi-domain, and multi- Lang. Process. Association for Computational Linguistics, 2021,
lingual black-box machine-generated text detection,” arXiv, 2023. pp. 6761–6771.
[56] M. Koupaee et al., “Wikihow: A large scale text summarization [84] Z. J. Wang et al., “Diffusiondb: A large-scale prompt gallery
dataset,” arXiv, 2018. dataset for text-to-image generative models,” arXiv, 2022.
[57] C. Gao et al., “Llm-as-a-coauthor: The challenges of detecting llm- [85] O. Holub, “Discordchatexporter: Exports discord chat logs to a
human mixcase,” arXiv, 2024. file,” http://tinyurl.com/53jbxwxr.
[58] CMU, “Enron email dataset,” https://www.cs.cmu.edu/~enron [86] M. Awsafur Rahman et al., “Artifact: A large-scale dataset with
/, 2015. artificial and factual images for generalizable and robust syn-
[59] D. Greene and P. Cunningham, “Practical solutions to the prob- thetic image detection,” arXiv, pp. arXiv–2302, 2023.
lem of diagonal dominance in kernel document clustering,” in [87] T. Karras et al., “A style-based generator architecture for genera-
Proceedings of the 23rd Int. Conf. Mach. Learn., 2006, pp. 377–384. tive adversarial networks,” in Proc. IEEE/CVF Conf. Comput. Vis.
[60] H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat Pattern Recognit., 2019, pp. 4401–4410.
models,” arXiv, 2023. [88] A. B. Robin Rombach et al., “High-resolution image synthesis
[61] Y. Wang et al., “Internvid: A large-scale video-text dataset for with latent diffusion models,” in Conf. Comput. Vis. Pattern Recog-
multimodal understanding and generation,” arXiv, 2023. nit., 2022, pp. 10 674–10 685.
[62] H. Song et al., “Robustness and generalizability of deepfake [89] I. Anokhin et al., “Image generators with conditionally-
detection: A study with diffusion models,” arXiv, 2023. independent pixel synthesis,” in Proc. IEEE/CVF Conf. Comput.
[63] R. Rothe et al., “Dex: Deep expectation of apparent age from a Vis. Pattern Recognit., 2021, pp. 14 278–14 287.
single image,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops, [90] X. Guo et al., “Hierarchical fine-grained image forgery detection
2015, pp. 10–15. and localization,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
[64] “Insightface,” http://tinyurl.com/mwebuxdv. Recognit., 2023, pp. 3155–3165.
[65] L. Papa et al., “On the use of stable diffusion for creating realistic [91] Z. Wang et al., “Dire for diffusion-generated image detection,” in
faces: from generation to detection,” in 2023 11th Int. Workshop Proc. IEEE/CVF Int. Conf. Comput. Vis., 2023.
Biometrics Forensics. IEEE, 2023, pp. 1–6. [92] S. Gu et al., “Vector quantized diffusion model for text-to-image
[66] M. Kim, F. Liu, A. Jain, and X. Liu, “Dcface: Synthetic face gener- synthesis,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
ation with dual condition diffusion model,” in Proceedings of the 2022, pp. 10 696–10 706.
18
[93] P. Dhariwal et al., “Diffusion models beat gans on image synthe- Proc. 29th ACM SIGKDD Conf. Knowl. Discov. Data Min., 2023,
sis,” Adv. Neural Inf. Process. Syst., vol. 34, pp. 8780–8794, 2021. p. 5673–5684.
[94] F. Guillaro et al., “Trufor: Leveraging all-round clues for trustwor- [123] M. U. Hadi et al., “Large language models: a comprehensive
thy image forgery detection and localization,” in Proc. IEEE/CVF survey of its applications, challenges, limitations, and future
Conf. Comput. Vis. Pattern Recognit., 2023, pp. 20 606–20 615. prospects,” Authorea Preprints, 2023.
[95] J. Ricker et al., “Towards the detection of diffusion model deep- [124] Y. Song et al., “Generative modeling by estimating gradients of
fakes,” arXiv, 2022. the data distribution,” Adv. Neural Inf. Process. Syst., vol. 32, 2019.
[96] U. Ojha et al., “Towards universal fake image detectors that [125] ——, “Score-based generative modeling through stochastic dif-
generalize across generative models,” in Proc. IEEE/CVF Conf. ferential equations,” arXiv, 2020.
Comput. Vis. Pattern Recognit. Workshop, 2023, pp. 24 480–24 489. [126] L. Yang et al., “Diffusion models: A comprehensive survey of
[97] C. Schuhmann et al., “Laion-400m: Open dataset of clip-filtered methods and applications,” ACM Comput. Surv., vol. 56, no. 4,
400 million image-text pairs,” in Data Centric AI NeurIPS Work- pp. 1–39, 2023.
shop, 2021. [127] J. Betker et al., “Improving image generation with better cap-
[98] V. Asnani et al., “Reverse engineering of generative models: tions,” Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf,
Inferring model hyperparameters from generated images,” IEEE 2023.
Trans. Pattern Anal. Mach. Intell., 2023. [128] G. DeepMind, “Imagen 2,” http://tinyurl.com/3pakj3mk, 2023.
[99] Z. Liu et al., “Deep learning face attributes in the wild,” in Proc. [129] MidJourney, “Midjourney,” https://mid-journey.ai/.
Int. Conf. Comput. Vis., December 2015. [130] Amazon, “Amazon titan,” https://aws.amazon.com/bedrock/t
[100] R. Corvi et al., “On the detection of synthetic images generated by itan/.
diffusion models,” in 2023 IEEE Int. Conf. Acoustics, Speech Signal [131] S. Sheynin et al., “Emu edit: Precise image editing via recognition
Process. IEEE, 2023, pp. 1–5. and generation tasks,” arXiv, 2023.
[101] N. Zhong et al., “Rich and poor texture contrast: A simple yet [132] W. Harvey et al., “Flexible diffusion modeling of long videos,”
effective approach for ai-generated image detection,” arXiv, 2023. Adv. Neural Inf. Process. Syst., vol. 35, pp. 27 953–27 965, 2022.
[102] S. Sinitsa et al., “Deep image fingerprint: Accurate and low [133] R. Yang et al., “Diffusion probabilistic modeling for video gener-
budget synthetic image detector,” in Proc. IEEE/CVF Winter Conf. ation,” Entropy, vol. 25, no. 10, p. 1469, 2023.
Appl. Comput. Vis., 2024. [134] U. Singer et al., “Make-a-video: Text-to-video generation without
[103] C. Schuhmann et al., “Laion-5b: An open large-scale dataset for text-video data,” in The Eleventh Int. Conf. Learn. Represent., 2023.
training next generation image-text models,” Adv. Neural Inf. [135] R. Girdhar et al., “Emu video: Factorizing text-to-video genera-
Process. Syst., vol. 35, pp. 25 278–25 294, 2022. tion by explicit image conditioning,” arXiv, 2023.
[104] Z. Lu et al., “Seeing is not always believing: Benchmarking [136] J. Ho et al., “Imagen video: High definition video generation with
human and model perception of ai-generated images,” in 37th diffusion models,” arXiv, 2022.
Conf. Neural Inf. Process. Syst., 2023. [137] A. Blattmann et al., “Stable video diffusion: Scaling latent video
[105] P. Sharma et al., “Conceptual captions: A cleaned, hypernymed, diffusion models to large datasets,” arXiv, 2023.
image alt-text dataset for automatic image captioning,” in Proc. [138] “Runway gen-2,” https://research.runwayml.com/gen2, 2023.
56th Annu. Meet. Assoc. Comput. Linguist., 2018, pp. 2556–2565. [139] “Pika,” https://pika.art/launch, 2023.
[106] R. Rombach et al., “High-resolution image synthesis with latent [140] M. Jeong et al., “Diff-tts: A denoising diffusion model for text-
diffusion models,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern to-speech,” in Interspeech Ann. Conf. Int. Speech Commun. Assoc.
Recognit., 2022, pp. 10 684–10 695. ISCA, 2021, pp. 3605–3609.
[107] D. Lab, “If,” https://github.com/deep-floyd/IF, 2023, accessed: [141] N. Chen et al., “Wavegrad: Estimating gradients for waveform
2023-06-07. generation,” in Int. Conf. Learn. Represent., 2020.
[142] ——, “Wavegrad 2: Iterative refinement for text-to-speech syn-
[108] M. Stypułkowski et al., “Diffused heads: Diffusion models beat
thesis.” ISCA, 2021, pp. 3765–3769.
gans on talking-face generation,” arXiv, 2023.
[143] Z. Shi et al., “Itôn: End-to-end audio generation with itô stochastic
[109] H. Cao et al., “Crema-d: Crowd-sourced emotional multimodal
differential equations,” Dig. Signal Process., vol. 132, p. 103781,
actors dataset,” IEEE Trans. Affect. Comput., vol. 5, no. 4, pp. 377–
2023.
390, 2014.
[144] H. Khalid et al., “Evaluation of an audio-video multimodal deep-
[110] C. Sun et al., “Ai-synthesized voice detection using neural
fake dataset using unimodal and multimodal detectors,” in Proc.
vocoder artifacts,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
1st Workshop Synth. Multimedia-Audiovisual Deepfake Gen. Detect.,
Recognit. Workshop, 2023, pp. 904–912.
2021, pp. 7–15.
[111] H. Zen et al., “Libritts: A corpus derived from librispeech for [145] A. Radford et al., “Learning transferable visual models from
text-to-speech,” arXiv, 2019. natural language supervision,” in Int. Conf. Mach. Learn. PMLR,
[112] Z. Kong et al., “Diffwave: A versatile diffusion model for audio 2021, pp. 8748–8763.
synthesis,” in Int. Conf. Learn. Represent., 2020. [146] A. Dosovitskiy et al., “An image is worth 16x16 words: Trans-
[113] A. v. d. Oord et al., “Wavenet: A generative model for raw audio,” formers for image recognition at scale,” in 9th Int. Conf. Learn.
in 9th ISCA Speech Synthesis Workshop. ISCA, 2016, p. 125. Represent., 2021.
[114] A. Sudhakar et al., “Transforming delete, retrieve, generate ap- [147] Y. Shen et al., “Hugginggpt: Solving ai tasks with chatgpt and its
proach for controlled text style transfer,” in EMNLP-IJCNLP, friends in huggingface,” in Proc. Neural Inf. Process. Syst., 2023.
2019, pp. 3267–3277. [148] R. Huang et al., “Audiogpt: Understanding and generating
[115] O. Patashnik et al., “Styleclip: Text-driven manipulation of style- speech, music, sound, and talking head,” arXiv, 2023.
gan imagery,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. [149] S. Wu et al., “Next-gpt: Any-to-any multimodal llm,” arXiv, 2023.
2085–2094. [150] D. A. Coccomini et al., “Detecting images generated by diffusers,”
[116] T. Wang et al., “High-fidelity gan inversion for image attribute arXiv, 2023.
editing,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., [151] S.-I. Papadopoulos et al., “Verite: a robust benchmark for multi-
2022, pp. 11 379–11 388. modal misinformation detection accounting for unimodal bias,”
[117] F. Cocchi et al., “Unveiling the impact of image transformations Int. J. Multimed. Inf. Retr., vol. 13, no. 1, p. 4, 2024.
on deepfake detection: An experimental analysis,” in Int. Conf. [152] H. Liu et al., “Unified frequency-assisted transformer framework
Image Anal. Process. Springer, 2023, pp. 345–356. for detecting and grounding multi-modal manipulation,” arXiv,
[118] J. Yang et al., “Harnessing the power of llms in practice: A survey 2023.
on chatgpt and beyond,” arXiv, 2023. [153] J. Wang et al., “Exploiting modality-specific features for multi-
[119] OpenAI., “Gpt-4 technical report,” 2023. [Online]. Available: modal manipulation detection and grounding,” arXiv, 2023.
https://cdn.openai.com/papers/gpt-4.pdf [154] D. Xu et al., “Combating misinformation in the era of generative
[120] G. Team et al., “Gemini: A family of highly capable multimodal ai models,” in Proc. 31st ACM Int. Conf. Multimedia, 2023, pp.
models,” 2023. [Online]. Available: http://tinyurl.com/5fxxfzu2 9291–9298.
[121] Z. Wan et al., “Med-unic: Unifying cross-lingual medical vision- [155] H. Wu et al., “Generalizable synthetic image detection via
language pre-training by diminishing bias,” arXiv, 2023. language-guided contrastive learning,” arXiv, 2023.
[122] Q. Zheng et al., “Codegeex: A pre-trained model for code gen- [156] Y.-M. Chang et al., “Antifakeprompt: Prompt-tuned vision-
eration with multilingual benchmarking on humaneval-x,” in language models are fake image detectors,” arXiv, 2023.
19
[157] Z. Sha et al., “De-fake: Detection and attribution of fake images [185] A. Borji, “Qualitative failures of image generation models and
generated by text-to-image generation models,” in Proc. 2023 their application in detecting deepfakes,” Image Vis. Comput.,
ACM SIGSAC Conf. Comput. Commun. Secur., 2023, pp. 3418–3432. 2023.
[158] R. Amoroso et al., “Parents and children: Distinguishing multi- [186] H. Farid, “Perspective (in) consistency of paint by text,” arXiv,
modal deepfakes from natural images,” arXiv, 2023. 2022.
[159] S. Kamat et al., “Revisiting generalizability in deepfake detection: [187] ——, “Lighting (in) consistency of paint by text,” arXiv, 2022.
Improving metrics and stabilizing transfer,” in Proc. IEEE/CVF [188] S. Chakraborty, A. S. Bedi, S. Zhu, B. An, D. Manocha, and
Int. Conf. Comput. Vis., 2023, pp. 426–435. F. Huang, “On the possibilities of ai-generated text detection,”
[160] R. Corvi et al., “Intriguing properties of synthetic images: from arXiv preprint arXiv:2304.04736, 2023.
generative adversarial networks to diffusion models,” in Proc. [189] J. Becker et al., “Paraphrase detection: Human vs. machine con-
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshop, 2023, pp. tent,” arXiv, 2023.
973–982. [190] W. Antoun et al., “From text to source: Results in detecting large
[161] G. J. A. Porcile et al., “Finding ai-generated faces in the wild,” language model-generated content,” arXiv, 2023.
arXiv, 2023. [191] C. Chen and K. Shu, “Can llm-generated misinformation be
[162] G. Carrière et al., “Beyond human forgeries: An investigation into detected?” in Int. Conf. Learn. Represent. Poster, 2024.
detecting diffusion-generated handwriting,” in Proc. Int. Conf. [192] S. Tu et al., “Chatlog: Recording and analyzing chatgpt across
Document Analysis and Recognition. Springer, 2023, pp. 5–19. time,” arXiv, 2023.
[163] A. Y. J. Ha and J. Passananti, “Organic or diffused: Can we [193] X. Pu et al., “On the zero-shot generalization of machine-
distinguish human art from ai-generated images?” arXiv preprint generated text detectors,” arXiv, 2023.
arXiv:2402.03214v1, 2024. [194] A. M. Sarvazyan et al., “Supervised machine-generated text de-
[164] A. Hooda et al., “D4: Detection of adversarial diffusion deepfakes tectors: Family and scale matters,” in Int. Conf. Cross-Language
using disjoint ensembles,” in Proc. IEEE/CVF Winter Conf. Appl. Eval. Forum Eur. Lang. Springer, 2023, pp. 121–132.
Comput. Vis., 2024. [195] T. Kumarage et al., “J-guard: Journalism guided adversarially
[165] Y. Ju et al., “Glff: Global and local feature fusion for ai-synthesized robust detection of ai-generated news,” 2023.
image detection,” IEEE Trans. Multimedia, 2023. [196] E. Tulchinskii et al., “Intrinsic dimension estimation for robust
[166] Q. Xu et al., “Exposing fake images generated by text-to-image detection of ai-generated texts,” arXiv, 2023.
diffusion models,” Pattern Recognit. Lett., 2023. [197] X. Hu et al., “Radar: Robust ai-text detection via adversarial
[167] Y. Jer Wong and T. K. Ng, “Local statistics for generative image learning,” arXiv, 2023.
detection,” arXiv, pp. arXiv–2310, 2023. [198] Z. Shi et al., “Red teaming language model detectors with lan-
[168] D. Tantaru et al., “Weakly-supervised deepfake localization in guage models,” arXiv, 2023.
diffusion-generated images,” in Proc. IEEE/CVF Winter Conf. [199] X. S. Xinlei He et al., “Mgtbench: Benchmarking machine-
Appl. Comput. Vis., 2024. generated text detection,” CoRR, 2023.
[169] L. Zhang et al., “Perceptual artifacts localization for image syn- [200] S. Mitrović et al., “Chatgpt or human? detect and explain. ex-
thesis tasks,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. plaining decisions of machine learning model for detecting short
7579–7590. chatgpt-generated text,” arXiv, 2023.
[201] J. Kirchenbauer et al., “A watermark for large language models,”
[170] A. Aghasanli et al., “Interpretable-through-prototypes deepfake
in Int. Conf. Mach. Learn., 2023.
detection for diffusion models,” in Proc. IEEE/CVF Int. Conf.
Comput. Vis. Workshop, 2023, pp. 467–474. [202] X. Yang et al., “Dna-gpt: Divergent n-gram analysis for training-
free detection of gpt-generated text,” arXiv, 2023.
[171] D. C. Epstein et al., “Online detection of ai-generated images,” in
[203] A. Bhattacharjee et al., “Conda: Contrastive domain adaptation
Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshop, 2023, pp. 382–
for ai-generated text detection,” arXiv, 2023.
392.
[204] L. Li et al., “Origin tracing and detecting of llms,” arXiv, 2023.
[172] D. Cozzolino et al., “Raising the bar of ai-generated image detec-
[205] R. R. Soto and K. Koch, “Few-shot detection of machine-
tion with clip,” arXiv, 2023.
generated text using style representations,” arXiv, 2024.
[173] Z. Yan et al., “Transcending forgery specificity with latent space
[206] K. Wu et al., “Llmdet: A third party large language models
augmentation for generalizable deepfake detection,” arXiv, 2023.
generated text detection tool,” in 2023 Conf. Empirical Methods
[174] Y. Jeong et al., “Fingerprintnet: Synthesized fingerprints for gen- Nat. Lang. Process., 2023.
erated image detection,” in Eur. Conf. Comput. Vis. Springer, [207] S. Venkatraman et al., “Gpt-who: An information density-based
2022, pp. 76–94. machine-generated text detector,” arXiv, 2023.
[175] T. Reiss et al., “Detecting deepfakes without seeing any,” arXiv, [208] A. Uchendu et al., “TURINGBENCH: A benchmark environment
2023. for Turing test in the age of neural text generation,” in Findings
[176] P. Dogoulis et al., “Improving synthetically generated image de- Assoc. Comput. Linguist. Association for Computational Linguis-
tection in cross-concept settings,” in Proc. 2nd ACM Int. Workshop tics, Nov. 2021, pp. 2001–2016.
Multimedia AI against Disinformation, 2023, pp. 28–35. [209] B. Ai et al., “Whodunit? learning to contrast for authorship
[177] Y. Zhang and X. Xu, “Diffusion noise feature: Accurate and fast attribution,” in Proc. 2nd Conf. Asia-Pacific Chapter Assoc. Comput.
generated image detection,” arXiv, 2023. Linguist. 12th Int. Joint Conf. Nat. Lang. Process., vol. 1. Associa-
[178] L. Guarnera et al., “Level up the deepfake detection: a method tion for Computational Linguistics, Nov. 2022, pp. 1142–1157.
to effectively discriminate images generated by gan architectures [210] S. Munir et al., “Through the looking glass: Learning to attribute
and diffusion models,” arXiv, 2023. synthetic text generated by language models,” in Proc. 16th Conf.
[179] M. Wolter et al., “Wavelet-packets for deepfake image analysis Eur. Chapter Assoc. Comput. Linguist.: Main Volume, 2021, pp. 1811–
and detection,” Mach. Learn., vol. 111, no. 11, pp. 4295–4327, 2022. 1822.
[180] N. Poredi et al., “Ausome: authenticating social media images [211] A. Uchendu et al., “Toproberta: Topology-aware authorship attri-
using frequency analysis,” in Disruptive Technologies in Information bution of deepfake texts,” arXiv, 2023.
Sciences VII, vol. 12542. SPIE, 2023, pp. 44–56. [212] Y. Tian et al., “Multiscale positive-unlabeled detection of ai-
[181] Z. Xi et al., “Ai-generated image detection using a cross-attention generated texts,” arXiv, 2023.
enhanced dual-stream network,” Asia Pacific Signal Inf. Process. [213] Z. Deng et al., “Efficient detection of llm-generated texts with a
Assoc. Annu. Summit Conf., 2023. bayesian surrogate model,” arXiv, 2023.
[182] M.-Q. Nguyen et al., “Unmasking the artist: Discriminating [214] G. Bao et al., “Fast-detectgpt: Efficient zero-shot detection of
human-drawn and ai-generated human face art through facial machine-generated text via conditional probability curvature,”
feature analysis,” in Int. Conf. Multimedia Anal. Pattern Recognit. arXiv, 2023.
IEEE, 2023, pp. 1–6. [215] J. Su et al., “Detectllm: Leveraging log rank information for zero-
[183] P. Lorenz et al., “Detecting images generated by deep diffu- shot detection of machine-generated text,” arXiv, 2023.
sion models using their local intrinsic dimensionality,” in Proc. [216] E. Mitchell et al., “Detectgpt: Zero-shot machine-generated text
IEEE/CVF Int. Conf. Comput. Vis. Workshop, 2023, pp. 448–459. detection using probability curvature,” ICML, 2023.
[184] R. Ma et al., “Exposing the fake: Effective diffusion-generated [217] C. Vasilatos et al., “Howkgpt: Investigating the detection
images detection,” in Second Workshop New Frontiers Advers. Mach. of chatgpt-generated university student homework through
Learn., 2023. context-aware perplexity analysis,” arXiv, 2023.
20
[218] “Gptzero,” https://gptzero.me/, 2022. [252] Writer, “Ai content detector,” https://writer.com/ai-content-det
[219] T. Kumarage et al., “Stylometric detection of ai-generated text in ector/.
twitter timelines,” arXiv, 2023. [253] Conch, “Ai content detector,” https://www.getconch.ai/.
[220] X. Liu et al., “Coco: Coherence-enhanced machine-generated text [254] “Ai or not,” https://www.aiornot.com/.
detection under data limitation with contrastive learning,” in [255] I. it AI, “Ai-generated image detector,” https://isitai.com/ai-i
Proc. 2023 Conf. Empirical Methods Nat. Lang. Process., 2023. mage-detector/.
[221] J. Pu et al., “Unraveling the mystery of artifacts in machine [256] Illuminarty, “Illuminarty image,” https://app.illuminarty.ai/#/i
generated text,” in Proc. Thirteenth Lang. Resour. Eval. Conf., 2022, mage.
pp. 6889–6898. [257] Google, “Synthid,” http://tinyurl.com/22789nyc.
[222] X. Zhao et al., “Distillation-resistant watermarking for model [258] C. at Scale, “Advanced ai image detector,” http://tinyurl.com/
protection in NLP,” in Findings Assoc. Comput. Linguist. Abu mr34huak.
Dhabi, United Arab Emirates: Association for Computational [259] Huggingface, “Ai image detector,” http://tinyurl.com/mr2pc9
Linguistics, Dec. 2022, pp. 5044–5055. 58.
[223] P. Fernandez et al., “Three bricks to consolidate watermarks for [260] Google, “Imagen,” https://imagen.research.google/editor/.
large language models,” arXiv, 2023. [261] T. Wang et al., “Deepfake detection: A comprehensive study from
[224] K. Yoo et al., “Robust multi-bit natural language watermarking the reliability perspective,” arXiv, 2022.
through invariant features,” in Proc. 61st Annu. Meet. Assoc. [262] A. B. Arrieta et al., “Explainable artificial intelligence (xai): Con-
Comput. Linguist., vol. 1, 2023, pp. 2092–2115. cepts, taxonomies, opportunities and challenges toward respon-
[225] M. Christ et al., “Undetectable watermarks for language models,” sible ai,” Information fusion, vol. 58, pp. 82–115, 2020.
arXiv, 2023. [263] P. Angelov et al., “Towards interpretable-by-design deep learning
[226] R. Kuditipudi et al., “Robust distortion-free watermarks for lan- algorithms,” arXiv, 2023.
guage models,” arXiv, 2023. [264] I. Marija et al., “On the vulnerability of deepfake detectors
[227] X. Zhao et al., “Provable robust watermarking for ai-generated to attacks generated by denoising diffusion models,” in Proc.
text,” arXiv, 2023. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2024, pp. 1051–1060.
[228] A. Liu et al., “A private watermark for large language models,” [265] M. Saberi et al., “Robustness of ai-image detectors: Fundamental
arXiv, 2023. limits and practical attacks,” arXiv, 2023.
[229] J. Liu et al., “Opt: Omni-perception pre-trainer for cross-modal [266] R. Bommasani et al., “On the opportunities and risks of founda-
understanding and generation,” arXiv, 2021. tion models,” arXiv, 2021.
[230] I. G. Rohan Taori et al., “Stanford alpaca: An instruction-following [267] Y. Ma et al., “Is this abstract generated by ai? a research for the gap
llama model,” http://tinyurl.com/bdfnwfju, 2023. between ai-generated scientific text and human-written scientific
[231] S. M. Lundberg et al., “A unified approach to interpreting model text,” arXiv, 2023.
predictions,” Adv. Neural Inf. Process. Syst., vol. 30, 2017. [268] A. Muñoz-Ortiz et al., “Contrasting linguistic patterns in human
[232] U. A. Ciftci et al., “Fakecatcher: Detection of synthetic portrait and llm-generated text,” arXiv, 2023.
videos using biological signals,” IEEE Trans. Pattern Anal. Mach. [269] O. Sagi and L. Rokach, “Ensemble learning: A survey,” Wiley
Intell., 2020. Interdiscip. Rev. Data Min. Knowl. Discov., vol. 8, no. 4, p. e1249,
[233] S. H. o. Xin Wang, “GAN-generated Faces Detection: A Sur- 2018.
vey and New Perspectives,” in Eur. Conf. Artif. Intell., Krakaw, [270] Y. Xu et al., “A comprehensive analysis of ai biases in deepfake
Poland, 2023. detection with massively annotated databases,” arXiv, 2022.
[271] Y. Ju et al., “Improving fairness in deepfake detection,” in Proc.
[234] M. Kang et al., “Scaling up gans for text-to-image synthesis,”
IEEE/CVF Winter Conf. Appl. Comput. Vis., 2024, pp. 4655–4665.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp.
10 124–10 134.
[235] J. Song et al., “Denoising diffusion implicit models,” in 9th Int.
Conf. Learn. Represent.,, 2021.
Li Lin received the B.S. degree in communi-
[236] P. R. Robert Geirhos et al., “Imagenet-trained cnns are biased
cation engineering from Chongqing University,
towards texture; increasing shape bias improves accuracy and
China, in 2020. She is a master student in the
robustness,” in 7th Int. Conf. Learn. Represent. OpenReview.net,
Department of Computer and Information Tech-
2019.
nology, Purdue University in Indianapolis. She is
[237] J. Vacher et al., “Texture interpolation for probing visual percep-
also a Ph.D. student in the Department of Com-
tion,” Adv. Neural Inf. Process. Syst., vol. 33, pp. 22 146–22 157,
puter Science, Purdue University in Indianapolis.
2020.
Her research interests include computer vision,
[238] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for digital media forensics, and deep learning.
convolutional neural networks,” in Int. Conf. Mach. Learn. PMLR,
2019, pp. 6105–6114.
[239] H. Tak et al., “End-to-end anti-spoofing with RawNet2,” in IEEE
Int. Conf. Acoust. Speech Signal Process. IEEE, 2021, pp. 6369–6373.
[240] G. Luo et al., “Newsclippings: Automatic generation of out-of- Neeraj Gupta received the B.S. degree in Tech-
context multimodal media,” in Proc. 2021 Conf. Empirical Methods nology from the Indian Institute of Technology
Nat. Lang. Process., EMNLP 2021, 2021, pp. 6801–6817. Kanpur, in 2019. He is a master student in the
[241] R. Shao, T. Wu, J. Wu, L. Nie, and Z. Liu, “Detecting and Department of Computer and Information Tech-
grounding multi-modal media manipulation and beyond,” arXiv, nology, Purdue University in Indianapolis. His re-
2023. search interests include machine learning, com-
[242] Copyleaks, “Ai content detector,” https://copyleaks.com/ai-con puter vision, and digital media forensics.
tent-detector.
[243] ZeroGPT, “Ai content detector, chatgpt detector,” https://zerogp
t.net/zerogpt-results.
[244] W. AI, “Ai content detector,” https://gowinston.ai/.
[245] Crossplag, “Ai content detector,” https://crossplag.com/ai-con
tent-detector/.
[246] GLTR, “Giant language model test room,” http://gltr.io/. Yue Zhang received the B.S. degree in net-
[247] C. at Scale, “The ai detector,” http://tinyurl.com/yc8ackha. work engineering from Anhui Jianzhu University,
[248] U. ai, “Advanced ai detector and humanizer,” https://undetect Hefei, China, in 2023. She is a master student
able.ai/. in the School of Software, Nanchang University,
[249] Illuminarty, “Illuminarty text,” https://app.illuminarty.ai/#/tex Nanchang, China. Her research interests mainly
t. include digital forensics, machine learning, and
[250] I. it AI, “Ai-generated text detector,” https://isitai.com/ai-text-d digital image processing.
etector/.
[251] O. ai, “Ai checker,” https://originality.ai/.
21
Hainan Ren received the B.S. degree in au- Dr. Shu Hu received the MEng degree in soft-
tomation engineering from Hebei University of ware engineering from the University of Science
Technology in 2012, and the M.S. degree in and Technology of China, in 2016, the MA de-
control engineering from Hebei University of gree in mathematics from University at Albany,
Technology in 2015. He is a senior engineer SUNY, in 2020, and the PhD degree in com-
in Algorithm Research, Aibee Inc. His research puter science and engineering from University
interests include face recognition, person re- at Buffalo, SUNY, in 2022. He is an assistant
identification, multi-modal learning, and genera- professor in the Department of Computer and
tive models. Information Technology, Purdue University. He
was a postdoc at Carnegie Mellon University.
His research interests include machine learning,
multimedia forensics, and computer vision. He is a member of IEEE.
Dr. Chun-Hao Liu received the B.S. degree
in electronics engineering from National Chiao
Tung University in 2007, M.S. degree in electron-
ics engineering from National Taiwan University
in 2009, and PhD degree in electrical engineer-
ing from University of California, Los Angles in
2015. He is currently with Amazon Prime Video
as a senior applied scientist. His research in-
terests are computer vision, deep learning, and
signal processing.