0% found this document useful (0 votes)
73 views15 pages

Aml Ccs Membership Inf Vision

Uploaded by

nabeel_yoosuf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views15 pages

Aml Ccs Membership Inf Vision

Uploaded by

nabeel_yoosuf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Membership Inference Attacks against Vision Transformers:

Mosaic MixUp Training to the Defense


Qiankun Zhang∗ Di Yuan Boyu Zhang
School of Cyber Science and School of Cyber Science and School of Cyber Science and
Engineering Engineering Engineering
Huazhong University of Science nad Huazhong University of Science nad Huazhong University of Science nad
Technology Technology Technology
Wuhan, Hubei, China Wuhan, Hubei, China Wuhan, Hubei, China
Key Laboratory of Cyberspace diyuan@hust.edu.cn boyu@hust.edu.cn
Security, Ministry of Education
Zhengzhou, Henan, China
qiankun@hust.edu.cn

Bin Yuan Bingqian Du


School of Cyber Science and School of Computer Science and
Engineering Technology
Huazhong University of Science nad Huazhong University of Science nad
Technology Technology
Wuhan, Hubei, China Wuhan, Hubei, China
yuanbin@hust.edu.cn bqdu@hust.edu.cn

Abstract better accuracy-privacy trade-off than some common defense mech-


Vision transformers (ViTs) have demonstrated great success in anisms. Extensive experiments are conducted to rigorously evaluate
various fundamental CV tasks, mainly benefiting from their self- both RAMIA and MMUT.
attention-based transformer architectures, and the paradigm of
pre-training followed by fine-tuning. However, such advantages CCS Concepts
may lead to significant data privacy risks, such as membership in- • Security and privacy; • Computing methodologies → Ma-
ference attacks (MIAs), which remain unclear. This paper presents chine learning;
the first comprehensive study on MIAs and corresponding defenses
against ViTs. Our first contribution is a rollout-attention-based Keywords
MIA method (RAMIA), based on an experimental observation that
Membership inference attacks, Vision transformers
the attention, more precisely the rollout attention, behaves dispro-
portionately for members and non-members. We evaluate RAMIA ACM Reference Format:
on the standard ViT architecture proposed by Google (ICLR 2021), Qiankun Zhang, Di Yuan, Boyu Zhang, Bin Yuan, and Bingqian Du. 2024.
achieving high accuracy, precision, and recall performance. Fur- Membership Inference Attacks against Vision Transformers: Mosaic MixUp
ther, inspired by another experimental observation on a strong Training to the Defense. In Proceedings of the 2024 ACM SIGSAC Conference
connection between positional embeddings (PEs) and attentions, on Computer and Communications Security (CCS ’24), October 14–18, 2024,
we propose a novel framework for training ViTs, named Mosaic Salt Lake City, UT, USA. ACM, New York, NY, USA, 15 pages. https://doi.
MixUp Training (MMUT), as a defense against RAMIA. Intuitively, org/10.1145/3658644.3690268
MMUT mixes up private images and public ones at a patch level,
and mosaics the corresponding PEs with a global learnable mosaic 1 Introduction
embedding. Our empirical results show MMUT achieves a much
Large Vision Models are on their way!
∗ Qiankun Zhang is the corresponding author.

Large Language Models (LLMs) such as GPT [4] and LLaMA


Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed [53] have emerged as a game-changing force in artificial intelli-
for profit or commercial advantage and that copies bear this notice and the full citation gence. Large Vision Models (LVMs) revolution is later but arriving.
on the first page. Copyrights for components of this work owned by others than the For example, a very recent series of work [22, 55] since Bai et al.
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission [3] makes breakthroughs in training LVMs without any linguistic
and/or a fee. Request permissions from permissions@acm.org. data. Besides the power of data, the transformer-based architecture
CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA plays a vital role in their successes. In this paper, we study vision
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-0636-3/24/10 transformers [19] (ViTs), the skeletons of LVMs, from a security
https://doi.org/10.1145/3658644.3690268 perspective.

1256
CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA Qiankun Zhang, Di Yuan, Boyu Zhang, Bin Yuan, & Bingqian Du

ViTs are gaining remarkable popularity as an alternative to con- In contrast, we study a more typical supervised pre-training for
volutional neural networks (CNNs) across various computer vision ViTs on labeled datasets, and focus on the core of a ViT encoder,
tasks, such as image classification [19, 52], segmentation [5, 59], and the self-attention mechanism. In general, a ViT encoder starts by
object detection [6, 65]. The key ideas behind these breakthroughs dividing an input image into fixed-size patches, and then linearly
can be characterized as two main aspects: (a) the transformer archi- projects them into a sequence of vectors named patch embeddings.
tectures based on the self-attention mechanism, which enable the To understand the spatial arrangement of image patches, positional
model to learn global features of images; (b) initially pre-training on embeddings (PEs) are added to patch embeddings. The self-attention
an upstream (un)labeled dataset through (self-)supervised learning, mechanism allows to integrate information across all patches, cap-
followed by fine-tuning on a local labeled dataset for a downstream turing global relationships in the image. That’s why ViTs may sur-
task [19]. Most previous literature on ViTs targets on developing var- pass CNNs, which primarily focus on local features. However, we
ious pre-training objectives [9, 24, 60] or variants of self-attention observe that the attention, an intermediate feature representation
mechanisms [20, 56, 63, 64]. However, such efforts in both aspects matrix to weigh the importance of different patches relative to each
can lead to serious privacy risks of training data, which haven’t other, can lead to a membership leakage through an experiment.
been carefully studied yet. For example, ViTs may suffer member- We pre-train a standard ViT model provided by Google [19] from
ship inference attacks (MIAs) [47], which aim to infer whether an scratch on a member dataset. Then add an independent Gaussian
input image is in the pre-training dataset. noise to each training image and forward both images into the
encoder. We compute a cosine similarity between their correspond-
(a) Rollout Attention (b) Last Attention (c) Encoder Output ing attention maps, as well as encoder outputs. Figure 1(b) and
1.4 Member Member Member
1.2 Non-member Non-member Non-member Figure 1(c) present histograms for the number of images across
Frequency(1k)

1.0 different similarity scores, respectively. We observe the distribution


0.8
0.6
differences between members and non-members do exist, albeit
0.4 not as pronounced. The coming challenge is: Can and how can we
0.2 enhance such differences and further apply them to MIAs?
0 0.95 0.96 0.97 0.98 0.99 1.00 0.0 0.2 0.4 0.6 0.8 1.0 0.25 0.00 0.25 0.50 0.75 1.00
Similarity Similarity Similarity We adopt the attention rollout technique [2], a method to quan-
tify a transformer encoder’s attention flow. In a nutshell, attention
Figure 1: Histograms for the number of members vs. non- rollout aggregates attention maps from the first to the final encoder
member images across different cosine similarity scores. blocks, producing a rollout attention (RA) map. The disparity in
RA maps’ similarity distribution is noticeably enhanced, as shown
in Figure 1(a). Based on such observation, we propose a rollout-
The principal component of a ViT is a transformer encoder, a
attention-based MIA (RAMIA) against ViTs. RAMIA follows the
multi-blocked neural network that processes input images to en-
framework of shadow training [47], and uses a cosine similarity
code them as representative features containing global contextual
between RA maps for images before and after adding noises as a fea-
information. The output of the ViT encoder can be used for various
ture vector as a metric to distinguish members and non-members.
downstream tasks, such as image classification, where an extra
We compare two variants of RAMIA depending on whether it trains
classifier is added to the final block of the encoder as a decoder. To
a binary classifier [47] or determines a carefully selected threshold
achieve high accuracy and wide scalability to various downstream
[43] for the inference. To evaluate RAMIA, we conduct experi-
tasks, the ViT encoder is typically pre-trained by organizations with
ments on CIFAR10, CIFAR100, ImageNet100, and ISIC2018 datasets
substantial computational resources, such as Google and OpenAI. 2 , achieving higher accuracy, precision, and recall than baseline
Subsequently, the entire ViT encoder (with or without a decoder)
models, including EncoderMI. See Section 3 for details.
plus a model card, which describes its architecture, training dataset,
Subsequently, we question how to defend RAMIA in Section 4.
training algorithm, etc., will be released in public on websites such
We first observe a strong connection between PEs and RAs through
as Hugging Face 1 . Such applications inspire our first research ques-
an intuitive experiment. In particular, if we fix all PEs to identical
tion: given an incomplete ViT, say, a ViT encoder, can and how can
learnable parameters, the difference in RA similarity before and af-
an attacker proceed with MIAs?
ter noise addition is significantly reduced compared to the standard
To answer the above, the only existing successful attempt is the
ViT training method, where each PE is trained separately. Inspired
EncoderMI proposed by Liu et al. [35]. EncoderMI attacks a ViT
by such findings, we design a novel defense method for training ViT,
encoder, CLIP [42], based on the encoder output in a black-box man-
Mosaic MixUp Training (MMUT). MMUT mixes up public dataset
ner, meaning that only the encoder output is known to an attacker.
and private dataset by replacing a certain percentage of patches
Their results are exemplary but yet preliminary in attacking ViTs:
for each training image. PEs corresponding to replaced patches
(a) in a restricted contrastive learning setting, which pre-trains an
will be replaced by a global learnable mosaic embedding when for-
encoder using data augmentation in a self-supervised fashion on
warding this image. By doing so, in the backward process, only the
unlabeled datasets; and (b) under a strict black-box assumption,
mosaic embedding, as well as those non-replaced PEs (except for
which may violate the aforementioned practical applications where
replaced PEs), will be updated. We compare our MMUT to various
downstream tasks can access full information on ViT encoders, in
typical defense mechanisms in literature, including label smoothing
other words, in a white-box manner.
2We consider four datasets in experiments for diversity. CIFAR10, CIFAR100, and Ima-
1Apopular open-source hub for machine learning models, https://huggingface.co/ geNet100 are standard image classification tasks in CV. ISIC2018 contains dermoscopy
models images used for disease classification.

1257
Membership Inference Attacks against Vision Transformers: Mosaic MixUp Training to the Defense CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA

[51], DP-SGD [1], RelaxLoss [10], and adversarial regularization mechanism helps ViTs effectively focus on the interactions between
[40]. Besides, experiments consider whether an attacker is adaptive, any parts of an image and achieve excellent performance in many
meaning that whether it knows the defense methods. Empirical CV tasks.
results show the remarkable effectiveness of our MMUT. We high- Figure 2 demonstrates a complete ViT model for a classifica-
light that our MMUT borrows the idea of mixing up images for tion task. The most crucial part is the transformer encoder. The
data augmentation [48], however, a greater contributor to MMUT’s Transformer encoder initially divides the input image into multiple
success might be our special treatment on PEs, i.e., the mosaic em- small patches and flattens each patch into a one-dimensional vector,
bedding. To justify it, we compare MMUT with several standard named patch embedding, through a linear projection. Note that for
image augmentation methods in literature, including merely mix- a classification task, a special vector, known as a class token (CT),
ing up images without mosaic embeddings. Empirical results show is added to the sequence of these patch embedding vectors. CT is
their less effectiveness in defending against RAMIAs. Nevertheless, typically used to capture the global information of an entire image,
coupling PE mosaics with image MixUp contributes to maintaining and after the final block of the ViT encoder, CT is the only input to
(or even surprisingly enhancing) the prediction accuracy of ViTs the classifier because it has aggregated full information (both patch-
despite a high level of privacy. See Section 4 for details. wise and position-wise) from the entire image, though the series of
Practical Applications of Our Methods. Research in MIAs against transformer layers. Subsequently, a learnable positional embedding
ViTs is necessary and meaningful. One of the most significant ap- (PE) vector is then added to each patch embedding including CT.
plications of ViTs is medical image classification [45]. For instance, This combination, now imbued with positional information, is then
telemedicine platforms, such as Teladoc Health, use ViTs to identify trained through a multi-head self-attention (MSA) mechanism, and
patterns in chest X-rays that are indicative of conditions like pneu- will be dynamically refined through learning and optimization with
monia. In this scenario, a hospital might utilize MIAs to ensure that training data.
the patient’s sensitive medical images used in pre-trained ViTs are Multi-head Self-attention. MSA is the core component of the
handled in compliance with HIPAA and GDPR regulations, thereby transformer encoder, containing 𝑁 transformer blocks. In each block,
maintaining patient confidentiality and data security. Another pos- MSA aggregates sequential tokens as:
sible application is in content moderation on social media platforms ∑︁ ∑︁ QK
[57]. For example, Facebook uses ViTs to moderate content by an- 𝑡𝑗 = A𝑖 V𝑖 𝑗 = SoftMax( √ )𝑖 V𝑖 𝑗 , (1)
alyzing user-uploaded images. However, there’s a risk that these 𝑖 𝑖 𝑑
models may inadvertently learn sensitive or private user data. Using where Q, K and V are query, key and value matrices, respectively.
MIAs, an independent watchdog organization could audit whether 𝑠 is the dimension of the query and key, and 𝑡 𝑗 is the 𝑗-th output
personal images uploaded by users have been used without consent token. We highlight that A is called the attention map, a matrix of
to train these models. In addition, to ensure a more realistic fit, scores indicating the relevance of each input embedding to others.
the ViT architecture studied in this paper is an official model 3 Our attacks are based on such attention maps, more precisely, on
released by Google [19]. We pre-train it from scratch on CIFAR10, the rollout attention (RA). Multiple attention heads separately run
CIFAR100, ImageNet100, and ISIC2018 in experiments. Moreover, the attention mechanism in parallel, making the model focus on
to see how our RAMIA performs in a ViT encoder directly down- different parts of the input simultaneously and capture various
loaded from the website, we also attack against its original model aspects of the information. The outputs from each head are con-
with pre-trained parameters provided in Section 3.5. catenated back together. The concatenated output is then passed
Summary of Our Contributions: through another linear transformation to produce the final output
of the multi-head attention mechanism.
• We provide the first comprehensive study on MIAs
Classifier. In general, ViTs do not include a ‘decoder’ part. For
against ViTs.
a classification task, a classifier will be added at the end of the
• We propose and rigorously evaluate RAMIA, a white-
final encoder blocks. The classifier generally contains a multi-layer
box MIA method against ViT encoder, focusing on
perceptron head (MLP), which usually consists of multiple layers of
the self-attention mechanism.
fully connected neural networks. The output is a prediction vector
• We propose MMUT, a training framework for ViTs.
on every class label.
Extensive experiments show its effectiveness in
Rollout Attention. Attention rollout proposed by Abnar and
accuracy-privacy trade-off preserving.
Zuidema [2] is a methodology for quantifying the flow of attention
of a transformer encoder, by tracing the progression of attention
2 Preliminaries and Related Work from initial to final block. Specifically, in each transformer block,
an attention map A is computed. Each A𝑖 𝑗 indicates how much
2.1 Vision Transformer attention flows from token 𝑗 in the previous block to token 𝑖 in the
ViT [23, 30] is first proposed by Dosovitskiy et al. [19], whose core next block. An identity matrix I is then added to the layer attention
idea is to segment an image into a series of patches (typically 16x16 A + I to symbolize the unchanging attention mapping that results
pixels) and then process these patches as a sequence of ‘tokens’ like from the incorporation of residual connections between blocks.
words in NLP. The transformer architecture is trained to capture the Consequently, the RA matrix, denoted as RAℓ at block ℓ, can be
long-range dependencies between these tokens. The self-attention computed recursively by matrices multiplication as:
3 https://huggingface.co/google/vit-base-patch16-224-in21k RAℓ = (𝐴ℓ + 𝐼 )RAℓ −1 . (2)

1258
CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA Qiankun Zhang, Di Yuan, Boyu Zhang, Bin Yuan, & Bingqian Du

Multi-Head Attention
Scaled Dot-Product Attention
Q Linear
Scaled-Dot Q
K Linear Product Concat Linear MatMul Scale SoftMax � MatMul
K
Attention
V
V Linear

Encoder Block ×N Classifier

Q Class
Patch K Multi-Head Bird
Inputs Norm MLP MLP
Embedding V Attention Ball
...

Embedding Key
Transformer Encoder
Patch Positional Embedding
0* 1 2 3 4 5 6 7 8 9 Embedding
� Attention Map
Linear Projection of Flattened Patches Positional
Embedding
Class Token
*

Figure 2: Overview of vision transformer achitecture.

Further, recall that transformer encoders often have multiple at- a feature. Such output-based classification often lies in black-box
tention heads which separately compute self-attention in parallel. attacks. In contrast, white-box MIAs can get all the information,
In this case, several different methods are utilized to compute the including the prediction vector, the intermediate computation (e.g.,
maximum (Max), minimum (Min), and average (Mean) of the heads. the feature map) at each hidden layer, the loss, and the gradient of
Abnar and Zuidema [2] suggest the best choice may vary among the loss with respect to the parameters of each layer of the input
different tasks, and thus we shall evaluate all of them in Section 3.4. image. An example is from Salem et al. [44], who use either the max-
Remarks. In our RAMIA, instead of using attention maps, we use imum prediction confidence or prediction loss of the target model
the global rollout attention, meaning that all attention maps in as a metric. A threshold is then determined to classify members and
every block are multiplied cumulatively according to Eqn. (2), to non-members, named threshold-based MIAs. Recently, Pang et al.
construct the feature vectors for training shadow encoders. [41] employ model gradients in MIAs, and argue that they provide
Most existing work on ViTs aims to enhance performance in a more profound feature representation. Our RAMIA basically fol-
various vision tasks, primarily focusing on three means: boosting lows the idea of shadow training. We design both classifier-based
the concept of locality within images [9, 24, 60], refining the self- and threshold-based variants of RAMIA, and compare them with
attention mechanism [20, 56, 63, 64], and innovating in architectural the aforementioned attack methods. Numerous studies have applied
design [14, 18, 21, 24, 28, 34, 37, 52]. Besides EncoderMI, the privacy MIAs to various models, including classification models [7, 36, 58],
risks in ViTs have been reported by Lu et al. [38], who point out generative adversarial networks (GANs) [11, 25, 26], and diffusion
a gradient leakage risk of self-attention-based mechanism. To the models [8, 31, 41]. Readers may refer to [27] for a survey.
best of our knowledge, we provide the first comprehensive study The existing work most relevant to ours is EncoderMI [35], a
on both membership inference attacks and defenses against ViTs. black-box MIA towards encoders in contrastive learning, which
assumes the encoder is sufficiently overfitted. EncoderMI highly
relies on such an assumption and observes a disparity in the outputs
2.2 Membership Inference Attack of the encoder for members and non-members. Although most of
MIAs on machine learning (ML) models aim to determine whether their experiments are conducted on CNNs, they also report results
a specific sample is included in the training set of the target model. on CLIP with a ViT encoder. Compared to EncoderMI, our work
Depending on how the attack model is built, there are two main differs in: (a) not relying on the contrastive learning assumption
types of MIA strategies: those that train a binary classifier for mem- that may not be dominant in realistic scenarios where ViT encoders
bers and non-members (classifier-based), and those that rely on a are pre-trained on labeled datasets; (b) assuming a more applicable
carefully-designed metric to separate members and non-members white-box access. Full information on ViT encoders can be used for
directly (threshold-based). For example, Shokri et al. [47] first pro- attacks; (c) a new defense method by a unified training framework.
pose a shadow training technique, which trains a shadow model to
imitate the behavior of the target model using a shadow dataset. A
classifier is then trained using the output of the shadow models as

1259
Membership Inference Attacks against Vision Transformers: Mosaic MixUp Training to the Defense CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA

2.3 Defenses against MIAs partition the attacker’s shadow dataset, which has the same distri-
Various defenses [13, 29, 33, 46, 47, 49, 62] against MIAs have been bution as the target dataset, into two disjoint subsets, named shadow
well studied. One of the most widely used privacy-preserving tech- members and shadow non-members, respectively. The attacker then
niques is differential privacy (DP) [1], which provides a promising pre-trains a shadow encoder using shadow members; (b) feature
defense against MIAs by adding noise to the gradient (DP-SGD) or vectors construction. We construct a feature vector for each im-
parameter during model training. DP faces a significant trade-off age to be inferred using the similarity scores of rollout attention
between accuracy and privacy: high degree of security may lead to maps between the original input image and some of its neighbor im-
poor performance in accuracy. Another defense more specific to ages when independent random noises are added; (c) membership
MIAs is called regularization [25, 26, 33, 50], a more efficient way to inference. We apply two common attack methods in literature to
ensure the privacy of overfitted models. It has been reported that show the effectiveness of our RAMIA, that is, a binary-classifier-
most of the methods utilized to enhance the generalizability of an based MIA (named RAMIA-Classifier) and a threshold-based MIA
NN model can improve its privacy at the same time. For example, (named RAMIA-Threshold). Specifically, RAMIA-Classifier trains
label smoothing [51] is proved to be effective in mitigating MIAs a binary inference classifier to predict member/non-member for
by minimizing the behavioral discrepancy between training and input based on the feature vectors obtained in (b), while RAMIA-
test data; adversarial regularization [40] incorporates membership Threshold directly determines a proper threshold 𝜃 . Only if the
inference gain into the target model’s objective function, balanc- mean value of each component of the feature vector is smaller than
ing classification loss with attack model accuracy; RelaxLoss [10] 𝜃 , the input is inferred as a member, and vice versa. Algorithm 1
designs a more achievable learning objective, achieving easy im- formally describes our RAMIA, which we will explain in detail in
plementation and negligible overhead. Details of these methods the rest of this subsection.
applying to ViTs will be discussed in Section 4.3. Our MMUT will (a) Shadow Encoder Pre-training. Initially, the attacker splits
view them as baselines for comparisons in Section 4.4. a shadow dataset 𝐷 𝑠 , which is assumed to be independently and
identically distributed to the target dataset 𝐷 𝑡 , into two disjoint
subsets of shadow members 𝐷 𝑠mem and non-members 𝐷 𝑠non-mem .
3 Rollout Attention MIA 𝐷 𝑠mem is then used to train a shadow ViT encoder E𝑠 , which has
3.1 Attacker Assumptions the identical architecture to the target ViT encoder E𝑡 .
Recall that the objective of an attacker is to infer whether a given (b) Feature Vectors Construction. For each input image 𝑥 in
image 𝑥 belongs to the pre-training dataset of a transformer en- the shadow dataset 𝐷 𝑠 , we compute its rollout attention, denoted
coder of a ViT (which we call the target encoder of a target ViT ). as RA(E𝑠 , 𝑥), through the shadow encoder E𝑡 . Subsequently, we
in other words, to infer whether 𝑥 is a member or non-member. construct 𝑛 neighbor images of 𝑥, denoted as 𝑥˜ 1, 𝑥˜ 2, . . . , 𝑥˜𝑛 , by
In this paper, we assume that the attacker has white-box access adding independent Gaussian noises as stated in Appendix A to
to the encoder of the target ViT, with full information of (a) the 𝑥, and compute each RA(E𝑠 , 𝑥˜𝑖 ) for ∀𝑖 ∈ [𝑛]. Next, a cosine sim-
distribution of images for training, meaning that the attacker ilarity, denoted as sim(𝑥, 𝑥˜𝑖 ), between RA(E𝑠 , 𝑥) and RA(E𝑠 , 𝑥˜𝑖 )
can set up a shadow dataset with the same distribution as the target is computed for each neighbor image 𝑥˜𝑖 . Total 𝑛 neighbor im-
dataset; and (b) the architecture and the learned parameters of ages produce a similarity feature vector 𝑉 𝑠 (𝑥) for input 𝑥, where
the encoder, meaning that the attacker can train a shadow model 𝑉 𝑠 (𝑥) = [sim(𝑥, 𝑥˜ 1 ), sim(𝑥, 𝑥˜ 2 ), . . . , sim(𝑥, 𝑥˜𝑛 )].
with the same architecture as the target encoder (without any infor- (c) Membership Inference. We adopt both binary-classifier-based
mation on remaining parts of the target ViT, say, the ViT classifier); and threshold-based techniques for attack.
(c) how the target model is trained, for example, using either
a supervised or self-supervised learning on a labeled or unlabeled
• RAMIA-Classifier: we train the RAMIA-Classifier using
dataset, respectively. Note that such a threat model accurately cap-
the feature vectors 𝑉 𝑠 of the shadow dataset as inputs, and
tures the practical applications in Hugging Face, where information
a binary label on whether they are shadow members or non-
of a ViT encoder is recorded as a model card.
members. During the inference process, an inferred image is
forwarded to the target encoder E𝑡 , and produces a feature
3.2 RAMIA Design vector 𝑉 𝑡 . The inference classifier takes 𝑉 𝑡 as an input and
Overview. Figure 3 presents our RAMIA method, which follows predicts the membership of the inferred image.
the technique of shadow training proposed by Shokri et al. [47]. • RAMIA-Threshold: the key insight is how to choose a
Recall that for a given image 𝐼 , we forward it to the target encoder proper threshold 𝜃 . We choose 𝜃 that maximizes the accu-
and tackle its rollout attention. Further recall that in Section 1, racy of predictions among all images in 𝐷 𝑠 . Subsequently,
our experimental observation in Figure 1 indicates that the roll- for an inferred image 𝑋 , we forward it in the target encoder
out attention maps are more susceptible to the influence of adding E𝑡 and compute the corresponding feature vector 𝑉 𝑡 (𝑥). 𝑥 is
noises to a member of the target ViT encoder than a non-member. inferred as a member when the mean value of each compo-
| |𝑉 𝑡 (𝑥 ) | |
Therefore, inspired by such observation, our RAMIA classifies an nent of 𝑉 𝑡 (𝑥) is smaller than 𝜃 , that is, when |𝑉 𝑡 (𝑥 ) | 1 < 𝜃 ,
input image 𝐼 as a member when the rollout attention maps gener- where || · || denotes ℓ1 -norm and | · | denotes the cardinality
ated by the target encoder are significantly different for an image of a vector.
𝑥 and its noise-added counterpart 𝑥. ˜ Specifically, RAMIA follows
the following three steps: (a) shadow encoder pre-training. We

1260
CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA Qiankun Zhang, Di Yuan, Boyu Zhang, Bin Yuan, & Bingqian Du

Shadow Encoder
Feature Vectors Construction Membership Inference
Pre-training

Shadow �1 Classifier
Non-members

...
Shadow Shadow Rollout Feature
Dataset Encoder Attention Vectors
��
Shadow Shadow Threshold
Encoder Members

Figure 3: Overview of rollout-attention-based membership inference attack(RAMIA).

Algorithm 1 RAMIA Table 1: Dataset details including number of images, number


of class labels, and size of images.
Require: Target Encoder E𝑡 ; Shadow Encoder E𝑠 ; Shadow Dataset
𝐷 𝑠mem and 𝐷 𝑠non-mem ; Image 𝑋 to be inferred.
Ensure: Indicator I: True (member); False (non-member) Dataset Size Categories Image Size
1: for each 𝑥 in 𝐷 𝑠mem ∪ 𝐷 𝑠non-mem do CIFAR10 [32] 60000 10 32 × 32 × 3
2: RA(𝑥) ← AttentionRollout(𝑥, E𝑠 ) CIFAR100 [32] 60000 100 32 × 32 × 3
3: for each 𝑖 in 1, ..., 𝑛 do ImageNet100 [17] 130000 100 224 × 224 × 3
4: 𝑥˜𝑖 ← AddNoise(𝑥) ISIC2018 [15, 54] 12180 7 224 × 224 × 3
5: RA(𝑥˜𝑖 ) ← AttentionRollout(𝑥˜𝑖 , E𝑠 )
6: 𝑉 𝑠 (𝑥) [𝑖] ← Similarity(RA(𝑥), RA(𝑥˜𝑖 ))
7: end for to an attention-based MIA, which simply uses the attention map
8: end for of the last transformer block for feature vector construction, and
9: (a) RAMIA-Classifier keeps other details the same as RAMIA. We call these baselines (a)
10: Classifier ← Train(𝑉 𝑠 ) AMIA-Classifier and (b) AMIA-Threshold, respectively.
11: I ← Classifier(𝑉 𝑡 (𝑋 )) We next compare our RAMIA to (c) EncoderMI [35] directly,
12: (b) RAMIA-Threshold which also targets on attacking encoders. Note that EncoderMI uses
13: 𝜃 = GetBestThreshold(𝑉 𝑠 ) the similarity of the encoder’s output as a feature vector in CNN
| |𝑉 𝑡 (𝑥 ) | | 1 models (ResNet18). In the encoder of ViT, we compute the similarity
14: if
|𝑉 𝑡 (𝑥 ) | < 𝜃 then
15: I = True of the class token, which serves as input for the decoder of ViT,
16: else as a feature vector for training the inference classifier. Besides, to
17: I = False evaluate the effectiveness of our idea in using the cosine similarity
18: end if between the rollout attention of an input image and its neighbor
19: return I images as a feature vector, we compare RAMIA to an encoder attack
which uses the rollout attention as a feature vector for training the
inference classifier directly. We call it (d) Baseline-Classifier. In
particular, Baseline-Classifier is different from RAMIA in the feature
3.3 RAMIA Evaluation: Experiments Setup vectors construction. Instead of using 𝑉 𝑠 (𝑥) to train a classifier,
Model. We evaluate the effectiveness of our RAMIA using a ViT Baseline-Classifier trains it using the rollout attention maps RA(𝑥)
encoder that comprises 12 transformer blocks, with 6 attention of each image 𝑥 through the shadow encoder as input.
heads each. Unless stated otherwise, we default to using the av- In addition, other MIA techniques in literature utilize prediction
erage (Mean) of the heads when computing the rollout attention vectors of complete models for their attacks, so it is not straight-
for multiple heads as we introduced in Section 2.1. But we shall forward to make direct comparisons. However, we aim to adapt
compare it with the other two methods, including Max and Min, in or extend these prevailing MIA techniques to the encoder attacks.
Section 3.4. In particular, we complement the target encoder into a complete
Datasets. We consider four datasets in experiments for diversity as ViT, by adding a randomly initialized classifier and fine-tuning ViT
shown in Table 1. CIFAR10, CIFAR100, and ImageNet100 are stan- on the shadow dataset. Note that the pre-trained target encoder is
dard image classification tasks in CV. ISIC2018 contains dermoscopy kept frozen through the fine-tuning process. We call such baselines
images used for disease classification. FullViT. By doing so, we can get a prediction vector 𝑃 (𝑦|𝑥), the loss
Baselines. There are seven baseline attacks (a-g) in our experi- L (𝑦, 𝑃 (𝑦|𝑥)), and the gradient of the loss concerning the parame-
ments. First, to show the effectiveness of our use of rollout attention ters 𝜕𝜕𝜃L for each input image 𝑥. FullViT trains a binary classifier
as a feature instead of a single attention map, we adapt our RAMIA using each of three metrics as a feature vector for member and

1261
Membership Inference Attacks against Vision Transformers: Mosaic MixUp Training to the Defense CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA

Table 2: Average accuracy, precision, and recall (%) of our methods for the target encoder pre-trained on four datasets, which
are CIFAR10, CIFAR100, ImageNet100 and ISIC2018. C is for classifier, T is for threshold, P is for prediction, L is for LIRA, and
G is for gradient. In this experiment, our multi-head fusion method is average (Mean).

Attack CIFAR10 CIFAR100 ImageNet100 ISIC2018


methods Accuracy Precision Recall Accuracy Precision Recall Accuracy Precision Recall Accuracy Precision Recall
RAMIA-T 88.43 86.25 91.43 87.94 85.08 88.61 91.89 89.86 94.44 92.33 87.64 94.43
RAMIA-C 91.42 93.83 89.52 90.24 92.44 86.98 91.76 95.96 87.18 91.84 92.23 89.98
AMIA-T 75.16 69.33 89.49 74.9 69.34 77.25 83.96 77.86 84.43 79.3 67.22 81.91
AMIA-C 75.77 62.97 82.35 80.91 88.54 78.9 84.36 87.34 80.73 80.94 77.19 84.5
EncoderMI 70.37 71.84 67.73 83.89 89.34 65.5 84.11 90.33 81.5 82.87 84.61 81.46
Baseline-C 63.86 68.05 62.79 52.63 31.51 54.55 55.22 50.4 55.78 52.31 50.81 53.7
FullViT-P 54.05 54.09 53.58 50.55 49.76 73.19 51.02 49.88 65.67 51.28 50.13 51.97
FullViT-L 55.29 52.21 58.54 53.06 50.06 55.92 51.78 50.64 53.65 53.36 54.47 51.61
FullViT-G 51.33 54.17 57.44 50.66 52.43 51.52 50.28 50.12 52.99 51.85 50.7 52.88

non-member, named (e) FullViT-Prediction, (f) FullViT-LIRA varying on four datasets. Among the three methods, the highest
[7], which use a principled likelihood ratio test with Gaussian like- accuracy is achieved by Max on ImageNet100, because Max may en-
lihood estimates and per-example difficulty scores to attack, and hance the disparity of feature maps for members and non-members.
(g) FullViT-Gradient, respectively. On the contrary, for a similar reason, Min behaves the worst be-
cause vanishing values in RA produce indistinguishable features.
As a compromise, the Mean gives relatively stable accuracy among
3.4 RAMIA Evaluation: Experimental Results the four datasets.
Effectiveness of RAMIA. Table 2 shows the attack performance of The impact of model structures. Two important structural pa-
our RAMIAs and seven comparison methods from (a) to (g) on four rameters of a ViT encoder are the number of encoder blocks and the
different datasets: CIFAR10, CIFAR100, ImageNet100 and ISIC2018. number of attention heads in each block. We examine their impacts
The key observations are: by fixing one and changing the other. Figures 6(a) and 6(b) show
• The FullViT attack methods behave as random guessing the impacts of encoder blocks and attention heads, respectively.
because their accuracies are close to 50%. The results are not Our main findings are:
surprising because the classifier is trained on the shadow • For different numbers of encoder blocks, accuracy does not
dataset, whose output features fail to capture whether the seem to vary too much, meaning that more blocks do not
encoder is overfitted for images in the trarget dataset. weaken the performance of our RAMIA. That is the point
• EncoderMI, as well as the Baseline-Classifier, is also effective, why we use the RA maps, which aggregate all attention
but behaves worse than our RAMIA, which validates the information in every block, for the feature construction.
effectiveness of using the rollout attention instead of the • We notice a remarkable correlation between the number of
output of the encoder (class token). Readers may wonder if attention heads and the inference accuracy of RAMIA. In par-
EncoderMI assumes black-box access to the encoder, which ticular, as the number of attention heads increases, accuracy
is more strict than RAMIA. Are the results comparable? rises first and falls later. More heads do provide much more
Existing literature [39] reports and validates that a white- information for constructing features. However, when there
box attack may not be easier than a black-box attack as are too many, the aggregation policy may fail to capture the
imagine. Besides, in practical scenarios, the encoders of ViTs most salient features from RA maps. That is perhaps why
are released on the website, such as Hugging Face. So, a the ViTs used in practice (with typically 6 heads) do not have
white-box assumption is decent. too many attention heads in their encoders.
• Attention-based MIAs (AMIA-Threshold and AMIA-Classifier) Other impacts. We also evaluate on other factors that may influ-
behave worse than our RAMIAs, but yet beat EncoderMI in ence the performance of MMUT, including the number of neighbor
some cases. The reason is that compared to the attention in images (denoted as 𝑛) with independent noises added and different
the last transformer block, rollout attention captures cumula- choices of the metric when computing the similarity of RA maps,
tive attention through a whole forwarding process, making e.g., Pearson correlation coefficient (PCC). Details are reported in
it more sensitive to noises. Appendix C. We highlight our main findings are: (1) in general,
Besides, Figure 4 presents the ROC curves and AUC values of the larger 𝑛 induces a higher attack accuracy; (2) both the cosine
different attacks, indicating our RAMIA-Threshold outperforms similarity and PCC are effective in RAMIAs.
other attacks significantly on all four datasets.
The impact of attention rollout methods. Recall that to handle
multiple attention heads, the attention rollout technique aggregates
3.5 Applying RAMIA to Real-world ViT
the RA matrix of each head by computing the maximum (Max), Encoders
minimum (Min), or average (Mean) of them. All three computations ViT encoders used in previous experiments mostly use an offi-
are entry-wise. Figure 5 shows their impacts on inference accuracy cial model architecture released by Google. To further strengthen

1262
CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA Qiankun Zhang, Di Yuan, Boyu Zhang, Bin Yuan, & Bingqian Du

(a) CIFAR10 (b) CIFAR100 (c) ImageNet100 (d) ISIC2018


1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8


True Positive Rate

True Positive Rate

True Positive Rate

True Positive Rate


0.6 RAMIA-T (AUC = 0.95) 0.6 RAMIA-T (AUC = 0.89) 0.6 RAMIA-T (AUC = 0.95) 0.6 RAMIA-T (AUC = 0.94)
RAMIA-C (AUC = 0.93) RAMIA-C (AUC = 0.92) RAMIA-C (AUC = 0.94) RAMIA-C (AUC = 0.91)
AMIA-T (AUC = 0.82) AMIA-T (AUC = 0.67) AMIA-T (AUC = 0.78) AMIA-T (AUC = 0.76)
0.4 AMIA-C (AUC = 0.80) 0.4 AMIA-C (AUC = 0.70) 0.4 AMIA-C (AUC = 0.82) 0.4 AMIA-C (AUC = 0.78)
EncoderMI (AUC = 0.77) EncoderMI (AUC = 0.73) EncoderMI (AUC = 0.85) EncoderMI (AUC = 0.84)
Baseline-C (AUC = 0.57) Baseline-C (AUC = 0.55) Baseline-C (AUC = 0.57) Baseline-C (AUC = 0.58)
0.2 FullViT-P (AUC = 0.51) 0.2 FullViT-P (AUC = 0.52) 0.2 FullViT-P (AUC = 0.51) 0.2 FullViT-P (AUC = 0.53)
FullViT-L (AUC = 0.53) FullViT-L (AUC = 0.53) FullViT-L (AUC = 0.51) FullViT-L (AUC = 0.52)
FullViT-G (AUC = 0.52) FullViT-G (AUC = 0.52) FullViT-G (AUC = 0.53) FullViT-G (AUC = 0.51)
0.00.0 0.2 0.4 0.6 0.8 1.0 0.00.0 0.2 0.4 0.6 0.8 1.0 0.00.0 0.2 0.4 0.6 0.8 1.0 0.00.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate False Positive Rate False Positive Rate False Positive Rate

Figure 4: The ROC curves of nine attack methods on four datasets, CIFAR10, CIFAR100, ImageNet100 and ISIC2018.

100 (a) RAMIA-T (b) RAMIA-C Table 3: Attacking results (%) for Google’s ViT-base.
Mean Mean
Max Max
95 Min Min
Attack Accuracy(%)

90 Method Dataset Accuracy Precision Recall


85 CIFAR10 78.56 62.21 90.86
CIFAR100 80.87 66.04 87.73
80 RAMIA-T
ImageNet100 81.35 75.91 84.36
75 ISIC2018 77.21 77.91 78.36
70 CIFAR10 CIFAR100 ImageNet100 ISIC2018 CIFAR10 CIFAR100 ImageNet100 ISIC2018
Rollout Method Rollout Method CIFAR10 77.57 91.39 71.6
CIFAR100 82.47 67.66 93.45
RAMIA-C
Figure 5: The impact of different attention rollout methods ImageNet100 85.87 83.78 88.97
(Max, Min, and Mean) on the accuracy of our RAMIA on ISIC2018 77.95 70.91 89.36
CIFAR10, CIFAR100, ImageNet100 and ISIC2018, either based
on a threshold (T) or a classifier (C).
4 Mosaic MixUp Training against RAMIA
(a) (b) This section introduces our defense method against RAMIA in this
85
Attack Accuracy(%)

section, which we call Mosaic MixUp Training (MMUT) for ViTs.


80
MMUT is a novel unified framework for pre-training ViTs from
75
scratch. Intuitively, MMUT enhances the robustness of the model to
70
image noise by integrating private datasets with public datasets in a
65 RAMIA-T RAMIA-T
RAMIA-C RAMIA-C patch-level manner. Concurrently, the positional embeddings (PEs)
60 8 10 12 14 16 4 6 8 10 12 of the corresponding replaced patches, through a shared training
Number of Encoder Blocks Number of Heads in Each Block
parameter scheme in forwarding this image, serve a dual purpose: it
differentiates between private training data and integrated patches,
Figure 6: The impact of RAMIA on different structural mod-
and enhances the model’s resilience against inference attacks. This
els: (a) set the number of encoder blocks as 8, 10, 12, 14, 16,
approach not only fortifies the model against adversarial RAMIAs
fixing the number of heads as 6; (b) set the number of heads
but also ensures that the prediction accuracy does not significantly
in each block as 4, 6, 8, 12, fixing the number of blocks as 12.
diminish (or even increases). Before a thorough explanation of
MMUT, we first introduce an intuitive experiment on the relation-
the power of RAMIA, we conduct a RAMIA against the complete ship between PEs and rollout attention (RA).
encoder with pre-trained parameters released by Google, named
ViT-base. ViT-base is pre-trained on ImageNet-21k (14 million im- 4.1 An Intuitive Experiment
ages, 21,843 classes) at resolution 224x224. We use ImageNet100 To see how PEs affect RA maps for images in the training dataset
as the shadow dataset, which is i.i.d. to ImageNet21k. In addition, (members), we compare two different ways to train a ViT from
we also experiment on the non-i.i.d. shadow dataset to capture the scratch using a member dataset: (a) updating parameters for each
scenarios where the target dataset is not released to the public. image patch separately as those standard training methods in ViT
Table 3 shows the corresponding attack accuracy, precision, and literature; (b) fixing the PE for all patches to identical learnable
recall. Results show the effectiveness of RAMIA, although accu- parameters. Compared to (a), (b) nullifies the spatial positional
racies are slightly lower than those presented in Table 2. This is information typically conveyed by PEs, which is always viewed to
because ImageNet100 only contains images of 100 classes, while be crucial in ViTs’ success. We visualize images used for training
ImageNet21k is considerably larger. We also point out that even and their respective RA maps in the form of heat maps. We refer to
if pre-trained on non-i.i.d. shadow datasets, our RAMIAs are still Appendix B for a detailed construction of heat maps. Figure 7(b)
effective, with the lowest accuracy of 77.21% on ISIC2018. and 7(c) provide heat maps using the above two training methods,

1263
Membership Inference Attacks against Vision Transformers: Mosaic MixUp Training to the Defense CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA

step, the loss due to this image will only update those non-replaced
PEs and the mosaic embedding (instead of replaced PEs).

Image Mixup


(a) Input image (b) Full PEs (c) Uniform PEs

Original Public Mixup


Image Image Image

Position Mosaic

Linear Projection of Flattened Patches

(d) Input image-noise (e) Full PEs-noise (f) Uniform PEs-noise


0* 2 3 4 6 8

Figure 7: Rollout attention visualized as a heat map. An input Transformer Encoder


image (a) is forwarded into a pre-trained ViT encoder and
(b) is its corresponding RA heat map. Blue is for vanishing Key
entries in the RA matrix, yellow for medium and red for Patch Embedding Positional Embedding Mosaic Embedding
large. (c) visualizes the RA generated by feeding (a) to a ViT
encoder trained with identical PEs. (d) is from adding a ran-
dom noise to (a). (e) and (f) are generated similarly to (b) and Figure 8: Overview of Mosaic MixUp Training (MMUT).
(c), respectively.
Algorithm 2 formally describes how MMUT works. In the be-
respectively. Besides, we add a small amount of noise to the input ginning, MMUT prepares a training dataset, denoted as 𝐷, and a
image such that the heat maps become Figure 7(e) and 7(f). The public dataset, denoted as 𝐷 Pub . When an image 𝑥 ∈ 𝐷 is forwarded
key observation is that heat maps from Figure 7(b) to 7(e) change to ViT, MMUT mosaics 𝑥 to image patches 𝑥ˆ Pat as follows: (a) di-
much more dramatically than from Figure 7(c) to 7(f), meaning that vide 𝑥 into 𝑥 Pat with 𝑁 image patches; (b) randomly sample an
the PE updating policy (b) produces more robust RA maps against image 𝑦 ∈ 𝐷 Pub , and divide 𝑦 into 𝑦 Pat with 𝑁 image patches4 ; (c)
noises adding to input images. Such a strong connection between randomly pick 𝛼 fractions in 𝑥 Pat and replace them by patches in
PEs and RA inspires our MMUT. 𝑦 Pat at the same position. In particular, MMUT defines an indicator
vector I ∈ {0, 1}𝑁 and initializes it as a zero vector. 𝛼 fractions of
4.2 MMUT Design components in I will be randomly assigned as 1. MMUT then re-
Overview. The design principle of our defense includes two aspects places all 𝑥𝑖Pat by 𝑦𝑖Pat for all I𝑖 = 1 to get 𝑥ˆ Pat ; (d) when forwarding
in both privacy preservation and performance guarantee for ViTs. On 𝑥ˆ Pat , replace the PEs corresponding to all mosaic patches (i.e., all
the one hand, the defense should significantly reduce the accuracy PE𝑖 s such that I𝑖 = 1) with a learnable mosaic embedding 𝜔; (e) In
of MIAs, especially RAMIA proposed in Section 3; on the other backward step, the loss due to 𝑥ˆ Pat is computed to update those
hand, since the target ViT is pre-trained, whose encoder will be PE𝑖 s such that I𝑖 = 0, 𝜔, and other learnable parameters. PE𝑖:I𝑖 =1 s
released online and be applied to downstream tasks, for example, a is not updated in this round. (f) train the ViT using 𝑥ˆ Pat s until a
classification task, a high accuracy of the target ViT is also crucial. convergence. MMUT will be thoroughly evaluated in Section 4.4.
Our MMUT is based on the key idea of RAMIA that uses differ- Remarks on MMUT. We highlight several details related to the
ent behaviors of members and non-members in rollout attention above training process.
before and after noise addition, respectively as a criterion for the • A WarmUp before MMUT proceeds can be greatly helpful in
attack, and the observation in Section 4.1 that indicates PEs may speeding up the model convergence. WarmUp means train-
help weaken the sensitivity of rollout attention to noises. Figure 8 ing the ViT from scratch in a standard way on the private
provides an overview. To illustrate our method, recall that ViTs dataset for several epochs.
divide an input image into a certain number of patches. Each patch • The public dataset used in MMUT can be either i.i.d. or non-
is subsequently added to a PE. In general, MMUT first replaces 𝛼 i.i.d. to the private dataset, even be random noises.
fractions of patches with patches of an image from a public image • Although MMUT borrows the idea of mixing up images
dataset, for example, ImageNet21k, for a given private image for for data augmentation, our novel mosaic embeddings are
training ViT. We call the replaced patches mosaic patches. In the for-
ward step, we replace all PEs corresponding to mosaic patches with 4We may without loss of generalization assume 𝑥 and 𝑦 have the same size. Otherwise,

an extra learnable mosaic embedding. By doing so, in the backward we can resize all images in 𝐷 Pub at the beginning.

1264
CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA Qiankun Zhang, Di Yuan, Boyu Zhang, Bin Yuan, & Bingqian Du

more crucial to MMUT’s success than data augmentation decreasing the distinguishability between member and non-
techniques. member data and reducing the success rate of attacks. Sec-
We shall set up experiments in Section 4.4 to justify the latter two ondly, to maintain the utility of the model, RelaxLoss does
arguments above. not maximize the predicted posterior score of the true class
to 1. Instead, it flattens the posterior scores of the non-true
classes, ensuring a significant margin between the true class
Algorithm 2 MMUT score and others, thus preventing incorrect predictions, es-
pecially for challenging samples near decision.
Require: Training Dataset 𝐷; Public Dataset 𝐷 pub ; Mosaic Ratio:
• (d) Adversarial Regularization (Adv-reg) [40]. Adv-reg
𝛼; Learnable Positional Embedding: 𝜔; Positional embedding:
trains target models by blending traditional cross-entropy
PE; Parameters other than PEs: 𝜃
loss with adversarial loss. This method minimizes a compos-
Ensure: Trained 𝜃 ∗ and PE∗
ite loss, which is a weighted sum of the cross-entropy and
1: for each 𝑥 in 𝐷 do
adversarial losses. The weight of the adversarial loss 𝛿 is ad-
2: I ← InitializeIndicator()
justed throughout the training to balance the contributions
3: Ĩ ← RandomFlip(I, 𝛼)
˜ ˜ Ī] = PE[Ī] of both losses. The adversarial loss in Adv-Reg is generated
4: PE[I] = 𝜔, PE[
by surrogate attack models, which are specifically trained
5: 𝑦 ← RandomSelect(𝐷 pub )
on two types of data: the target model’s training dataset and
6: 𝑥 [I] = 𝑦 [I]
an additional, separate hold-out dataset. This training ap-
7: 𝜃 ∗, PE∗, 𝜔 ← GradientDescent(𝑥, PE)
˜
proach ensures these models are well-prepared to effectively
8: end for
challenge the target model, thus improving its robustness
9: return 𝜃 ∗ , PE∗
and overall performance.

Note that each of the aforementioned defense methods is governed


by a hyper-parameter, including smoothing parameters 𝜖 in Label
4.3 MMUT Evaluation: Experiments Setup Smoothing, noise scale 𝛽 in DP-SGD, target loss 𝛾 in RelaxLoss, and
Models and Datasets. The same ViT structures and datasets (CI- adversarial loss weight 𝛿 in Adv-Reg. Our MMUT is also governed
FAR10, CIFAR100, ImageNet100 and ISIC2018) as in Section 3.4 by the mosaic ratio 𝛼, fractions of image patches that are replaced.
are used in experiments. The evaluations of defenses are against Attacker’s information for defenses. We consider two typical
RAMIAs. For RAMIAs, we set the number of image neighbors to assumptions on attacker’s information for defense evaluation: non-
𝑛 = 8. Cosine similarity and Mean of RAs for multiple heads are adaptive and adaptive. For non-adaptive, the attacker lacks knowl-
used. edge about the specific defense method employed by the model,
Baselines. We compare to four baseline defense methods (a-d) as which represents a more lenient form of defense. Conversely, under
mentioned in Section 2.3 in our experiments. the adaptive assumption, the attacker possesses full awareness of
• (a) Label Smoothing [51]. The key idea is to soften the one- both the defense method employed by the model and the specific
hot encoded target vector into a smooth one. In particular, parameters involved. In this case, defending against such attacks
we reduce the entry with respect to the class of label from becomes even more challenging.
1 to 1 − 𝜖, where 𝜖 is a smoothing hyper-parameter. Then,
𝜖 is equally distributed across all non-target classes. Label 4.4 MMUT Evaluation: Experiments Result
smoothing diminishes model overconfidence by preventing Effectiveness of MMUT. We compare MMUT to the four men-
a 100% probability assignment to any single class, thereby tioned baseline models, evaluating the trade-off between predic-
promoting generalization and reducing overfitting to noisy tion accuracy and defense effectiveness. Figures 9 (against RAMIA-
or incorrect labels in the training data. Threshold) and 10 (against RAMIA-Classifier) show such trade-offs
• (b) Differentially Private Stochastic Gradient Descent using different defense methods. We consider both adaptive and
(DP-SGD) [1]. DP-SGD enforces privacy protections by alter- non-adaptive attackers. For each defense method, we report the
ing the optimization routine. It involves two main actions: (i) best four choices of its governed hyper-parameters. A lower attack
clipping the gradients to ensure that their ℓ2 -norm is capped accuracy with a higher prediction accuracy (closer to the bottom
at a threshold value 𝐶 during each training iteration; (ii) right) is better. Key observations are as follows:
adding random noise to the gradients before the update step
is applied. We adjust the noise scale 𝛽 as a hyper-parameter • In both adaptive and non-adaptive settings against both
while keeping the clipping threshold 𝐶 fixed. Note that this RAMIAs, MMUT (blue square) points are roughly distributed
is consistent with previous work [13]. We adopt small noise closer to the bottom right, showing its effectiveness in pre-
scale for maintaining target model’s utility at a decent level, serving the trade-off.
which leads to meaninglessly large 𝜖 values. • In the stricter adaptive setting, the four baselines show no sig-
• (c) RelaxLoss [10]. RelaxLoss defends against MIAs by al- nificant effect. In the weaker non-adaptive setting, baselines
ternating gradient ascent and descent. It firstly adjusts the other than DP-SGD have a minor effect in defending RAMIAs.
mean of the target loss, setting a target loss mean value 𝛾 The corresponding attack accuracies drop no greater than
that is more easily achievable by non-member data, thereby 10%. The only exception is DP-SGD, which behaves better

1265
Membership Inference Attacks against Vision Transformers: Mosaic MixUp Training to the Defense CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA

CIFAR10 Non-adaptive CIFAR100 Non-adaptive ImageNet100 Non-adaptive ISIC2018 Non-adaptive


91
87 85 90
Attack Accuracy (%)

Attack Accuracy (%)

Attack Accuracy (%)

Attack Accuracy (%)


81 85 84
79
75 79 78
73
69 67 73 72
63 61 67 66

59 62 65 68 71 74 58 62 66 70 74 52 59 66 73 80 69 72 75 78 81 84
Prediction Accuracy (%) Prediction Accuracy (%) Prediction Accuracy (%) Prediction Accuracy (%)
CIFAR10 Adaptive 90 CIFAR100 Adaptive 92 ImageNet100 Adaptive ISIC2018 Adaptive
87 90
84 87
Attack Accuracy (%)

Attack Accuracy (%)

Attack Accuracy (%)

Attack Accuracy (%)


82 85
78 82
77 80
72 77 75
72
67 66 72 70

59 62 65 68 71 74 58 62 66 70 74 52 59 66 73 80 69 72 75 78 81 84
Prediction Accuracy (%) Prediction Accuracy (%) Prediction Accuracy (%) Prediction Accuracy (%)
Basic MMUT Label_Smoothing DP-SGD Relaxloss Adv-Reg

Figure 9: Comparison of the effectiveness of defenses with RAMIA-Threshold.

93 CIFAR10 Non-adaptive 91 CIFAR100 Non-adaptive 92 ImageNet100 Non-adaptive ISIC2018 Non-adaptive


90
86 84 86
Attack Accuracy (%)

Attack Accuracy (%)

Attack Accuracy (%)

Attack Accuracy (%)


84
79 77 80
78
72 70 74 72
65 63 68 66

59 62 65 68 71 74 58 62 66 70 74 52 59 66 73 80 69 72 75 78 81 84
Prediction Accuracy (%) Prediction Accuracy (%) Prediction Accuracy (%) Prediction Accuracy (%)
CIFAR10 Adaptive CIFAR100 Adaptive ImageNet100 Adaptive 93 ISIC2018 Adaptive
91 91
88
88
Attack Accuracy (%)

Attack Accuracy (%)

Attack Accuracy (%)

Attack Accuracy (%)


85 86
83
79 81 83
78
73 76 78
73
68 67 71 73

59 62 65 68 71 74 58 62 66 70 74 52 59 66 73 80 69 72 75 78 81 84
Prediction Accuracy (%) Prediction Accuracy (%) Prediction Accuracy (%) Prediction Accuracy (%)
Basic MMUT Label_Smoothing DP-SGD Relaxloss Adv-Reg

Figure 10: Comparison of the effectiveness of defenses with RAMIA-Classifier.

than other baselines in defense. However, it causes a dra-


matic drop in prediction accuracy. 90
85
Attack Accuracy(%)

• Remarkably, on CIFAR10 (the smallest dataset), our MMUT


even improves the prediction accuracy while preserving 80
great defense. The reason is the MixUp method can enhance 75
the generalization ability of a ViT model. 70 Non-adaptive-CIAFR10
The impact of mosaic ratio. The mosaic ratio 𝛼 serves as a critical 65 Adaptive-CIFAR10
Non-adaptive-CIFAR100
parameter that governs the privacy-performance trade-off. Larger 𝛼 60 Adaptive-CIFAR100
replaces more image patches from public data and achieves higher Non-adaptive-ImageNet100
55 Adaptive-ImageNet100
security while leading to worse performance in prediction, and vice 0.0 0.2 0.4 0.6 0.8 1.0
versa. We evaluate MMUT on different values of 𝛼. Figure 11 shows Mosaic Ratio
the trends in attack accuracy when employing varying mosaic ra-
tios in non-adaptive and adaptive scenarios, respectively. Figure 12 Figure 11: Performance of MMUT in defending against adap-
shows how the prediction accuracy changes with different mo- tive and non-adaptive RAMIA-Threshold on three datasets.
saic ratios. Our defense method shows commendable performance A is for adaptive and N is for non-adaptive. Mosaic ratios 𝛼
across most attack scenarios. The key observations are: varies from 0 to 0.969.
• As mosaic proportion increases, the attack accuracy shows
a downward trend in both settings. Such a trend becomes

1266
CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA Qiankun Zhang, Di Yuan, Boyu Zhang, Bin Yuan, & Bingqian Du

Prediction Accuracy(%) Table 4: Attack accuracy (%) of RAMIA-T against MMUT v.s.
75 several DA techniques.

70 Dataset CIFAR10 CIFAR100 ImageNet100 ISIC2018


MMUT 55.63 61.23 63.99 62.73
65 MixUp 80.69 81.64 83.62 83.29
CIFAR10
CIFAR100 PixelMix 88.38 87.45 89.64 87.21
60 ImageNet100 GridMask 86.13 86.98 84.04 85.1
0.0 0.2 0.4 0.6 0.8 1.0 Flipping 76.27 77.57 78.28 75.05
Mosaic Ratio

Figure 12: Performance of MMUT in prediction accuracy on is the identical to MMUT, except for the use of mosaic position
three datasets. Mosaic ratios 𝛼 varies from 0 to 0.969. The dot- patches. PixelMix linearly combines multiple training images at
ted lines show the prediction accuracy without any defense pixel level by blending their features and labels. GridMask occludes
on three datasets. image regions using a random grid-like mask. Table 4 shows our
experimental results on the attack accuracies of RAMIA-Threshold.
MixUp, PixelMix, and GridMask have extremely limited effects on
more explicit when 𝛼 exceeds roughly 0.5 for ImageNet100 defending agianst RAMIAs, while the simple geometric transfor-
and 0.4 for CIFAR10 and CIFAR100. mation of flipping images works slightly. Such findings in fact echo
• For non-adaptive RAMIA, MMUT reduces the attack accu- our insights on using mosaic position embbedings. Flipping messes
racy to a random guessing level, close to 50%. In an adap- up the patches of original image as well as the corresponding PEs,
tive setting, even if the attacker knows the specific defense allowing the model to focus less on positional information and
method and parameters, MMUT still reduces the attack ac- learn a similar dispersion of attentions as in Figure 7(c), leading to
curacy remarkably. a more robust performance. Nevertheless, our MMUT shows much
• As 𝛼 increases, the prediction accuracy first rises and drops more effectiveness than merely using DA for defenses.
later. More specifically, MMUT can improve the prediction Similarity of RA under MMUT. Finally, as a complement and
accuracy of the model, with a maximum of 4.39% on CIFAR10, another corroboration of our results, we explore how the RA maps
2.15% on CIFAR100, and 2.02% on ImageNet100. Therefore, change when noises are added to an input image on CIFAR10. Sim-
a good choice of 𝛼 is crucial in designing MMUT. ilar to Figure 1, Figure 13, which presents the difference in cosine
The impact of public data distribution. We question on whether similarities after MMUT is adopted, shows that MMUT is more
the public dataset used in MMUT is necessarily i.i.d. to private effective than other methods in bridging the gap between members
dataset. To verify its impact, we set the private dataset as CINIC10 and non-members. Compared with the original model, MMUT sig-
[16], which has the same classification categories and the number nificantly improves the RA’s similarity level, whereas DP-SGD is
of class labels as CIFAR10, meaning that their distributions are exactly the opposite.
similar but not identical. We consider six different public datasets:
CINIC10 (half for the target model and the rest for the shadow 5 Conclusion
model), CIFAR10, ImageNet10 (a subset of ImageNet100 with 10 This work presents the first comprehensive study on the mem-
classes selected), ImageNet100 and random Gaussian noises. Fig- bership inference attacks and defenses against a powerful deep
ure 14 reports our experimental results. Our first finding is MMUT learning model, vision transformers. We use the information pro-
performs well even when replacing image patches with random vided by the rollout attention maps and design a white-box attack
noises. This is because the transformer-based model needs to learn against real-world ViT encoders. Based on an observation on the
robust attention allocations to distinguish the noise and original significance of positional embeddings, we design a unified frame-
data patches, which is is exactly what we need to defend against work of training ViTs as a defense method. Our defense can achieve
RAMIA. Nevertheless, using other datasets for MixUp can be re- a promising effectiveness in privacy-performance trade-off. Some
markably more effective. We observe all datasets can effectively possible future work includes: (a) extending our methods to lan-
defend against RAMIAs. However, much more interestingly, the guage models; (b) attacking ViTs using another more expressive
best-performing public dataset is not CINIC10 itself, but CIFAR10, features; (c) designing a training-free defense method.
with a similar but not identical distribution to CINIC10. A possible
reason is such a dataset can maximize the generalization ability of Acknowledgments
the model, while i.i.d. data provides a weaker contribution. On the This work is supported by the National Natural Science Founda-
other hand, ImageNet100 may not behave as well as expected, mean- tion of China (NSFC) (No. 62302183, No. 62372191, No. 62302187
ing that public data that differs too much does not help confuse the and No. 62202197) and the Open Foundation of Key Laboratory of
feature vectors constructed by RA maps. Cyberspace Security, Ministry of Education (No. KLCS20240401).
MMUT vs. data augmentation (DA). We argue that simply adopt-
ing DA techniques without PE mosaics has much less effect in References
defending MIAs. We consider four well-known DA methods, in- [1] Martín Abadi, Andy Chu, Ian J. Goodfellow, H. B. McMahan, Ilya Mironov, Kunal
cluding MixUp, PixelMix [61], GridMask [12], and Flipping. MixUp Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. Proceedings

1267
Membership Inference Attacks against Vision Transformers: Mosaic MixUp Training to the Defense CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA

1000 Member Member 1000 Member 1200 Member 1000 Member


Non-member 1000 Non-member Non-member Non-member Non-member
800 800 1000 800
800

Frequency
Frequency

Frequency

Frequency

Frequency
800 600
600 600 600
600
400 400 400 400
400
200 200 200 200 200
0 0.96 0.97 0.98 0.99 1.00 0 0.95 0.96 0.97 0.98 0.99 1.00 0 0.94 0.95 0.96 0.97 0.98 0.99 1.00 0 0.94 0.95 0.96 0.97 0.98 0.99 1.00 0 0.95 0.96 0.97 0.98 0.99 1.00
Similarity Similarity Similarity Similarity Similarity
(a) MMUT (b) Label Smoothing (c) DP-SGD (d) RelaxLoss (e) Adv-Reg

Figure 13: Histograms for the number of members vs. non-member images in CIFAR10 across different cosine similarity scores
under defenses on CIFAR10.

90
Attack Accuracy(%)

[14] Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei,
80 RAMIA-T Huaxia Xia, and Chunhua Shen. 2021. Twins: Revisiting the Design of Spatial
RAMIA-C Attention in Vision Transformers. In Neural Information Processing Systems.
70 [15] Noel C. F. Codella, Veronica M Rotemberg, Philipp Tschandl, M. E. Celebi,
60 Stephen W. Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Li-
opyris, Michael Armando Marchetti, Harald Kittler, and Allan C. Halpern. 2019.
50 Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by
the International Skin Imaging Collaboration (ISIC). ArXiv abs/1902.03368 (2019).
40 CINIC10 CIFAR10 CIFAR100 IN10 IN100 RN https://api.semanticscholar.org/CorpusID:60440592
Public Data [16] Luke N Darlow, Elliot J Crowley, Antreas Antoniou, and Amos J Storkey. 2018.
Cinic-10 is not imagenet or cifar-10. arXiv preprint arXiv:1810.03505 (2018).
[17] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, K. Li, and Li Fei-Fei. 2009. ImageNet:
Figure 14: Attack accuracy (%) against MMUT on five differ- A large-scale hierarchical image database. 2009 IEEE Conference on Computer
Vision and Pattern Recognition (2009), 248–255.
ent public datasets. IN10, IN100 and RN denote ImageNet10, [18] Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu
ImageNet100 and random noise, respectively. Dotted lines Yuan, Dong Chen, and Baining Guo. 2021. CSWin Transformer: A General Vision
show the attack accuracy without any defense. Transformer Backbone with Cross-Shaped Windows. 2022 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR) (2021), 12114–12124.
[19] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi-
aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg
Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is
of the 2016 ACM SIGSAC Conference on Computer and Communications Security Worth 16x16 Words: Transformers for Image Recognition at Scale. In International
(2016). Conference on Learning Representations.
[2] Samira Abnar and Willem Zuidema. 2020. Quantifying Attention Flow in Trans- [20] Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs
formers. In Annual Meeting of the Association for Computational Linguistics. Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob
[3] Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Verbeek, and Hervé Jégou. 2021. XCiT: Cross-Covariance Image Transformers.
Darrell, Jitendra Malik, and Alexei A. Efros. 2023. Sequential Modeling Enables In Neural Information Processing Systems.
Scalable Learning for Large Vision Models. [21] Jiemin Fang, Lingxi Xie, Xinggang Wang, Xiaopeng Zhang, Wenyu Liu, and
[4] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Pra- Qi Tian. 2021. MSG-Transformer: Exchanging Local Spatial Information by
fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Manipulating Messenger Tokens. 2022 IEEE/CVF Conference on Computer Vision
Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J. Henighan, Rewon and Pattern Recognition (CVPR) (2021), 12053–12062.
Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher [22] Jianyuan Guo, Zhiwei Hao, Chengcheng Wang, Yehui Tang, Han Wu, Han Hu, Kai
Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Han, and Chang Xu. 2024. Data-efficient Large Vision Models through Sequential
Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Autoregression. arXiv preprint arXiv:2402.04841 (2024).
and Dario Amodei. 2020. Language Models are Few-Shot Learners. ArXiv [23] Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua
abs/2005.14165 (2020). Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. 2022. A survey on vision
[5] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, transformer. IEEE transactions on pattern analysis and machine intelligence 45, 1
and Manning Wang. 2021. Swin-Unet: Unet-like Pure Transformer for Medical (2022), 87–110.
Image Segmentation. In ECCV Workshops. [24] Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang.
[6] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexan- 2021. Transformer in Transformer. In Neural Information Processing Systems.
der Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with [25] Jamie Hayes, Luca Melis, George Danezis, and Emiliano De Cristofaro. 2017.
transformers. In European conference on computer vision. Springer, 213–229. LOGAN: Membership Inference Attacks Against Generative Models. Proceedings
[7] Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, A. Terzis, and Florian on Privacy Enhancing Technologies 2019 (2017), 133 – 152.
Tramèr. 2022. Membership Inference Attacks From First Principles. 2022 IEEE [26] Benjamin Hilprecht, Martin Härterich, and Daniel Bernau. 2019. Monte Carlo
Symposium on Security and Privacy (SP) (2022), 1897–1914. and Reconstruction Membership Inference Attacks against Generative Models.
[8] Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Proceedings on Privacy Enhancing Technologies 2019 (2019), 232 – 249.
Florian Tramèr, Borja Balle, Daphne Ippolito, and Eric Wallace. 2023. Extracting [27] Hongsheng Hu, Zoran A. Salcic, Lichao Sun, Gillian Dobbie, P. Yu, and Xuyun
Training Data from Diffusion Models. ArXiv abs/2301.13188 (2023). Zhang. 2021. Membership Inference Attacks on Machine Learning: A Survey.
[9] Chun-Fu Chen, Rameswar Panda, and Quanfu Fan. 2021. RegionViT: Regional- ACM Computing Surveys (CSUR) 54 (2021), 1 – 37.
to-Local Attention for Vision Transformers. ArXiv abs/2106.02689 (2021). [28] Zilong Huang, Youcheng Ben, Guozhong Luo, Pei Cheng, Gang Yu, and Bin Fu.
[10] Dingfan Chen, Ning Yu, and Mario Fritz. 2022. Relaxloss: Defending membership 2021. Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer.
inference attacks without losing utility. arXiv preprint arXiv:2207.05801 (2022). ArXiv abs/2106.03650 (2021).
[11] Dingfan Chen, Ning Yu, Yang Zhang, and Mario Fritz. 2019. GAN-Leaks: A Tax- [29] Jinyuan Jia, Ahmed Salem, Michael Backes, Yang Zhang, and Neil Zhenqiang
onomy of Membership Inference Attacks against Generative Models. Proceedings Gong. 2019. MemGuard: Defending against Black-Box Membership Inference
of the 2020 ACM SIGSAC Conference on Computer and Communications Security Attacks via Adversarial Examples. Proceedings of the 2019 ACM SIGSAC Conference
(2019). on Computer and Communications Security (2019).
[12] Pengguang Chen, Shu Liu, Hengshuang Zhao, and Jiaya Jia. 2020. Gridmask data [30] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fa-
augmentation. arXiv preprint arXiv:2001.04086 (2020). had Shahbaz Khan, and Mubarak Shah. 2022. Transformers in vision: A survey.
[13] Christopher A. Choquette-Choo, Florian Tramèr, Nicholas Carlini, and Nicolas ACM computing surveys (CSUR) 54, 10s (2022), 1–41.
Papernot. 2020. Label-Only Membership Inference Attacks. ArXiv abs/2007.14321
(2020).

1268
CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA Qiankun Zhang, Di Yuan, Boyu Zhang, Bin Yuan, & Bingqian Du

[31] Fei Kong, Jinhao Duan, Ruipeng Ma, Hengtao Shen, Xiao lan Zhu, Xiaoshuang Shi, pigmented skin lesions. Scientific Data 5 (2018). https://api.semanticscholar.org/
and Kaidi Xu. 2023. An Efficient Membership Inference Attack for the Diffusion CorpusID:263789934
Model by Proximal Initialization. ArXiv abs/2305.18355 (2023). [55] Jiahao Wang, Wenqi Shao, Mengzhao Chen, Chengyue Wu, Yong Liu, Kaipeng
[32] Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Zhang, Songyang Zhang, Kai Chen, and Ping Luo. 2024. Adapting LLaMA
[33] Zheng Li and Yang Zhang. 2020. Membership Leakage in Label-Only Exposures. Decoder to Vision Transformer. arXiv preprint arXiv:2404.06773 (2024).
Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications [56] Pichao Wang, Xue Wang, F. Wang, Ming Lin, Shuning Chang, Wen Xie, Hao Li,
Security (2020). and Rong Jin. 2021. KVT: k-NN Attention for Boosting Vision Transformers.
[34] Hezheng Lin, Xingyi Cheng, Xiangyu Wu, Fan Yang, Dong Shen, Zhongyuan ArXiv abs/2106.00515 (2021).
Wang, Qing Song, and Wei Yuan. 2021. CAT: Cross Attention in Vision Trans- [57] Wenxuan Wang, Jingyuan Huang, Chang Chen, Jiazhen Gu, Jianping Zhang,
former. 2022 IEEE International Conference on Multimedia and Expo (ICME) (2021), Weibin Wu, Pinjia He, and Michael R. Lyu. 2023. Validating Multimedia Content
1–6. Moderation Software via Semantic Fusion. Proceedings of the 32nd ACM SIGSOFT
[35] Hongbin Liu, Jinyuan Jia, Wenjie Qu, and Neil Zhenqiang Gong. 2021. EncoderMI: International Symposium on Software Testing and Analysis (2023).
Membership Inference against Pre-trained Encoders in Contrastive Learning. [58] Jiayuan Ye, Aadyaa Maddi, Sasi Kumar Murakonda, and R. Shokri. 2021. Enhanced
Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Membership Inference Attacks against Machine Learning Models. Proceedings
Security (2021). of the 2022 ACM SIGSAC Conference on Computer and Communications Security
[36] Yiyong Liu, Zhengyu Zhao, Michael Backes, and Yang Zhang. 2022. Membership (2021).
Inference Attacks by Exploiting Loss Trajectory. Proceedings of the 2022 ACM [59] Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. 2019. Cross-Modal Self-
SIGSAC Conference on Computer and Communications Security (2022). Attention Network for Referring Image Segmentation. 2019 IEEE/CVF Conference
[37] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, on Computer Vision and Pattern Recognition (CVPR) (2019), 10494–10503.
and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer [60] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis E. H. Tay,
using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Jiashi Feng, and Shuicheng Yan. 2021. Tokens-to-Token ViT: Training Vision
Vision (ICCV) (2021), 9992–10002. Transformers from Scratch on ImageNet. 2021 IEEE/CVF International Conference
[38] Jiahao Lu, Xi Sheryl Zhang, Tianli Zhao, Xiangyu He, and Jian Cheng. 2021. on Computer Vision (ICCV) (2021), 538–547.
APRIL: Finding the Achilles’ Heel on Privacy for Vision Transformers. 2022 [61] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. 2018.
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), mixup: Beyond Empirical Risk Minimization. In International Conference on Learn-
10041–10050. ing Representations.
[39] Milad Nasr, R. Shokri, and Amir Houmansadr. 2018. Comprehensive Privacy [62] Junxiang Zheng, Yongzhi Cao, and Hanpin Wang. 2021. Resisting membership
Analysis of Deep Learning: Passive and Active White-box Inference Attacks inference attacks through knowledge distillation. Neurocomputing 452 (2021),
against Centralized and Federated Learning. 2019 IEEE Symposium on Security 114–126.
and Privacy (SP) (2018), 739–753. [63] Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Qibin Hou,
[40] Milad Nasr, R. Shokri, and Amir Houmansadr. 2018. Machine Learning with and Jiashi Feng. 2021. DeepViT: Towards Deeper Vision Transformer. ArXiv
Membership Privacy using Adversarial Regularization. Proceedings of the 2018 abs/2103.11886 (2021).
ACM SIGSAC Conference on Computer and Communications Security (2018). [64] Daquan Zhou, Yujun Shi, Bingyi Kang, Weihao Yu, Zihang Jiang, Yuan Li, Xiaojie
[41] Yan Pang, Tianhao Wang, Xu Kang, Mengdi Huai, and Yang Zhang. 2023. Jin, Qibin Hou, and Jiashi Feng. 2021. Refiner: Refining Self-attention for Vision
White-box Membership Inference Attacks against Diffusion Models. ArXiv Transformers. ArXiv abs/2106.03714 (2021).
abs/2308.06405 (2023). [65] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2021.
[42] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Deformable {DETR}: Deformable Transformers for End-to-End Object Detection.
Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, In International Conference on Learning Representations.
Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models
From Natural Language Supervision. In International Conference on Machine A Gaussian Noises
Learning.
[43] Ahmed Salem, Yang Zhang, Mathias Humbert, Pascal Berrang, Mario Fritz, and In this paper, noises are added to images as follows: for a given
Michael Backes. 2019. ML-Leaks: Model and Data Independent Membership image, we linearly map [0, 255] ↦→ [0, 1], and add a Gaussian with
Inference Attacks and Defenses on Machine Learning Models. In Proceedings of
the 26th Annual Network and Distributed System Security Symposium (NDSS).
𝜇 = 0 and 𝜎 = 0.2 to each pixel value. For those smaller than 0
[44] A. Salem, Yang Zhang, Mathias Humbert, Mario Fritz, and Michael Backes. 2018. or larger than 1, recap them as 0 or 1, respectively. Finally remap
ML-Leaks: Model and Data Independent Membership Inference Attacks and [0, 1] ↦→ [0, 255] with a nearest rounding.
Defenses on Machine Learning Models. ArXiv abs/1806.01246 (2018).
[45] Fahad Shamshad, Salman Khan, Syed Waqas Zamir, Muhammad Haris Khan,
Munawar Hayat, Fahad Shahbaz Khan, and Huazhu Fu. 2023. Transformers in B Rollout Attention Visualization
medical imaging: A survey. Medical Image Analysis (2023).
[46] Virat Shejwalkar and Amir Houmansadr. 2021. Membership Privacy for Machine Heat map construction in Figure 7. The RA map obtained is
Learning Models Through Knowledge Transfer. In AAAI Conference on Artificial a (𝑁 + 1) × (𝑁 + 1)-sized matrix, which represents the attention
Intelligence. allocation of each patch in a total of 𝑁 + 1 patches (𝑁 patches
[47] R. Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2016. Member-
ship Inference Attacks Against Machine Learning Models. 2017 IEEE Symposium plus one class token (CT)) to itself and other patches. We only
on Security and Privacy (SP) (2016), 3–18. use a one-dimensional vector with a size of 𝑁 + 1 corresponding
[48] Connor Shorten and Taghi M Khoshgoftaar. 2019. A survey on image data
augmentation for deep learning. Journal of big data 6, 1 (2019), 1–48.
to CT, which records the importance of each image patch. Then
[49] Liwei Song and Prateek Mittal. 2020. Systematic Evaluation of Privacy Risks of only 𝑁 components are kept, except one which represents the
Machine Learning Models. In USENIX Security Symposium. attention of CT on itself. Subsequently,
[50] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan √ √ we rearrange 𝑁 attention
Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from values into a matrix of size 𝑁 × 𝑁 , each value corresponds
overfitting. J. Mach. Learn. Res. 15 (2014), 1929–1958. to a patch in the original image. And then adopt interpolation to
[51] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbig-
niew Wojna. 2015. Rethinking the Inception Architecture for Computer Vision.
restore
√ the two-dimensional
√ arranged attention to a matrix of size
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), ( 𝑁 × 𝑝) × ( 𝑁 × 𝑝) (𝑝 is the size of each patch), such that it has
2818–2826. the same height and width as the original image, to obtain a single
[52] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre
Sablayrolles, and Herv’e J’egou. 2020. Training data-efficient image transformers channel heat map matrix ℎ. We convert the single channel matrix
& distillation through attention. In International Conference on Machine Learning. into a color image using√JET color mapping
√ to obtain the final heat
[53] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne
Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
map 𝐻 , a tensor of size ( 𝑁 × 𝑝) × ( 𝑁 × 𝑝) × 3. In JET, low values
Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lam- may be mapped to blue, intermediate values to green, and high
ple. 2023. LLaMA: Open and Efficient Foundation Language Models. ArXiv values to red. Finally, we normalize the heat map and overlay it
abs/2302.13971 (2023).
[54] Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. 2018. The HAM10000 with the original image, and then restore it to the 0-255 area as
dataset, a large collection of multi-source dermatoscopic images of common displayed in Figure 7.

1269
Membership Inference Attacks against Vision Transformers: Mosaic MixUp Training to the Defense CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA

C Missing Empirical Results in Section 3 Table 5: Attack performance (%) using Pearson correlation
coefficient for a similarity computation.
The impact of number of neighbor images. Recall that we
use 𝑛 neighbor images generated by adding independent noises Method Dataset Accuracy Precision Recall
to an original image from the dataset to construct the feature vec-
CIFAR10 87.66 83.99 90.43
tor. Figure 15 shows the impact of 𝑛 on the attack accuracy on
RAMIA-T CIFAR100 88.33 80.49 93.45
CIFAR10, CIFAR100, and ImageNet100. Both RAMIA-Threshold
ImageNet100 90.25 85.66 95.73
and RAMIA-Classifier are evaluated. There are roughly consistent
trends that the attack accuracies increase while 𝑛 grows larger for CIFAR10 89.69 90.87 87.51
the threshold-based model, which is not surprising because we com- RAMIA-C CIFAR100 90.88 97.68 86.37
pute an average value for every component in the feature vector, ImageNet100 88.63 93.35 83.77
followed by determining a proper threshold. More neighbor images
sharpen such disparity between members and non-members. How-
Table 6: Prediction and attack accuracies (%) using different
ever, for a classifier-based model, except for CIFAR10, the highest
mosaic methods. T for threshold and C for classifier.
accuracy arises when 𝑛 = 8. A possible reason is higher dimensional
feature vectors make the inference classifier overfitting such that Method Dataset MMUT MMUT-Avg MMUT-Zero
its generalization ability is reduced. Such observations inspire us CIFAR10 60.63 67.19 61.79
that infinitely increasing the number of neighbor images is unre- RAMIA-T CIFAR100 55.23 79.4 56.51
liable. A larger 𝑛 may not help in accuracy but requires a longer ImageNet100 71.99 89.25 89.05
computation time. Choosing a proper 𝑛 is crucial. CIFAR10 57.67 77.46 60.38
RAMIA-C CIFAR100 51.77 84.21 52.91
88 ImageNet100 66.18 94.24 91.08

86
Attack Accuracy(%)

may also behave well on the smallest CIFAR10. Further, MMUT-


84
Zero even achieves effectiveness similar to MMUT; however, on
82 the largest ImageNet100, it demonstrates advantages over the other
80 two methods.

78 CIFAR10-RAMIA-T CIFAR10-RAMIA-C
84 87
CIFAR100-RAMIA-T CIFAR100-RAMIA-C
Target Privacy parameters

Target Privacy parameters


0.82 55.5 57.4 57.1 63.4 63.9 0.82 53.8 56.8 58.2 61.8 61.7
76 ImageNet100-RAMIA-T ImageNet100-RAMIA-C
78 80
2 4 6 8 10 12 14 16 0.71 57.8 59.6 59.2 66.9 66.2 0.71 62.8 63.1 64.8 64.8 65.5
Number of Neighbors 0.61 66.7 69.6 71.3 70.5 68.4 72 0.61 69.5 70.4 71.4 72.7 72.7 73

0.51 77.2 78.0 77.6 75.8 73.2 66 0.51 80.1 81.8 79.5 77.7 75.2 66
Figure 15: The impact of number of image neighbors 𝑛 on at-
tack accuracy. RAMIA-Threshold and RAMIA-Classifier are 0 87.9 86.1 84.9 82.5 81.8 60 0 90.2 89.7 85.3 83.3 82.7 59
examined when 𝑛 is set as 1, 2, 4, 8, 16 on CIFAR10, CIFAR100, 0 0.51 0.61 0.71 0.82 0 0.51 0.61 0.71 0.82
Shadow Privacy Parameters Shadow Privacy Parameters
and ImageNet100.
(a) Defense against RAMIA-T (b) Defense against RAMIA-C

The impact of similarity metrics. We compute a cosine simi- Figure 16: The attack accuracy (%) of semi-adaptive attackers
larity of RA maps as a feature vector in our RAMIA. Besides, we attempting shadow models with different parameters on CI-
consider another similarity metric named Pearson correlation coef- FAR100. T is for threshold and C is for classifier.
ficient (PCC) and evaluate its effect. Accuracy, precision, and recall Defense against semi-adaptive RAMIA. Besides adaptive and
of both RAMIA-Threshold and RAMIA-Classifier are presented in non-adaptive assumptions on attacker’s information, we examine
Table 5. Compared to the results in Table 2, there isn’t a major dif- defenses under a semi-adaptive assumption, where the attacker is
ference in the performance of attack accuracies for the two metrics, aware of the defense method but lacks knowledge about the specific
providing another support to the effectiveness of our RAMIA. parameters associated with it. In this case, the attacker can only
make a guess on such parameters. Figure 16 shows the attack accu-
D Missing Empirical Results in Section 4 racy of RAMIAs using different parameters in target models versus
The impact of mosaic embedding methods. In the process of shadow models on CIFAR100. Note that blocks closer to the top left
position mosaic, in addition to using a global learnable mosaic (closer to non-adaptive) are darker in color, with attack accuracies
embedding 𝜔 as presented in Algorithm 2, we propose two other of approximate random guesses of 50%. Likewise, blocks closer
alternative solutions: (MMUT-Avg) and (MMUT-Zero). Neither of to the diagonal (closer to adaptive) have lighter colors. Such as-
them learns a mosaic embedding during training. Instead, MMUT- sumptions are more strict for attackers, but MMUT can remarkably
Avg computes the average value of those non-mosaiced PEs as a reduce the attack accuracy from 87.94% for RAMIA-Threshold and
mosaic, while MMUT-Zero mosaics PEs with a zero matrix. Ta- 90.24% for RAMIA-Classifier. In any case, the experimental results
ble 6 presents their performances in both privacy preserving and illustrate the effectiveness of our MMUT in defending RAMIAs.
prediction accuracy. Both MMUT-Avg and MMUT-Zero instead

1270

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy