Aml Ccs Membership Inf Vision
Aml Ccs Membership Inf Vision
1256
CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA Qiankun Zhang, Di Yuan, Boyu Zhang, Bin Yuan, & Bingqian Du
ViTs are gaining remarkable popularity as an alternative to con- In contrast, we study a more typical supervised pre-training for
volutional neural networks (CNNs) across various computer vision ViTs on labeled datasets, and focus on the core of a ViT encoder,
tasks, such as image classification [19, 52], segmentation [5, 59], and the self-attention mechanism. In general, a ViT encoder starts by
object detection [6, 65]. The key ideas behind these breakthroughs dividing an input image into fixed-size patches, and then linearly
can be characterized as two main aspects: (a) the transformer archi- projects them into a sequence of vectors named patch embeddings.
tectures based on the self-attention mechanism, which enable the To understand the spatial arrangement of image patches, positional
model to learn global features of images; (b) initially pre-training on embeddings (PEs) are added to patch embeddings. The self-attention
an upstream (un)labeled dataset through (self-)supervised learning, mechanism allows to integrate information across all patches, cap-
followed by fine-tuning on a local labeled dataset for a downstream turing global relationships in the image. That’s why ViTs may sur-
task [19]. Most previous literature on ViTs targets on developing var- pass CNNs, which primarily focus on local features. However, we
ious pre-training objectives [9, 24, 60] or variants of self-attention observe that the attention, an intermediate feature representation
mechanisms [20, 56, 63, 64]. However, such efforts in both aspects matrix to weigh the importance of different patches relative to each
can lead to serious privacy risks of training data, which haven’t other, can lead to a membership leakage through an experiment.
been carefully studied yet. For example, ViTs may suffer member- We pre-train a standard ViT model provided by Google [19] from
ship inference attacks (MIAs) [47], which aim to infer whether an scratch on a member dataset. Then add an independent Gaussian
input image is in the pre-training dataset. noise to each training image and forward both images into the
encoder. We compute a cosine similarity between their correspond-
(a) Rollout Attention (b) Last Attention (c) Encoder Output ing attention maps, as well as encoder outputs. Figure 1(b) and
1.4 Member Member Member
1.2 Non-member Non-member Non-member Figure 1(c) present histograms for the number of images across
Frequency(1k)
1257
Membership Inference Attacks against Vision Transformers: Mosaic MixUp Training to the Defense CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA
[51], DP-SGD [1], RelaxLoss [10], and adversarial regularization mechanism helps ViTs effectively focus on the interactions between
[40]. Besides, experiments consider whether an attacker is adaptive, any parts of an image and achieve excellent performance in many
meaning that whether it knows the defense methods. Empirical CV tasks.
results show the remarkable effectiveness of our MMUT. We high- Figure 2 demonstrates a complete ViT model for a classifica-
light that our MMUT borrows the idea of mixing up images for tion task. The most crucial part is the transformer encoder. The
data augmentation [48], however, a greater contributor to MMUT’s Transformer encoder initially divides the input image into multiple
success might be our special treatment on PEs, i.e., the mosaic em- small patches and flattens each patch into a one-dimensional vector,
bedding. To justify it, we compare MMUT with several standard named patch embedding, through a linear projection. Note that for
image augmentation methods in literature, including merely mix- a classification task, a special vector, known as a class token (CT),
ing up images without mosaic embeddings. Empirical results show is added to the sequence of these patch embedding vectors. CT is
their less effectiveness in defending against RAMIAs. Nevertheless, typically used to capture the global information of an entire image,
coupling PE mosaics with image MixUp contributes to maintaining and after the final block of the ViT encoder, CT is the only input to
(or even surprisingly enhancing) the prediction accuracy of ViTs the classifier because it has aggregated full information (both patch-
despite a high level of privacy. See Section 4 for details. wise and position-wise) from the entire image, though the series of
Practical Applications of Our Methods. Research in MIAs against transformer layers. Subsequently, a learnable positional embedding
ViTs is necessary and meaningful. One of the most significant ap- (PE) vector is then added to each patch embedding including CT.
plications of ViTs is medical image classification [45]. For instance, This combination, now imbued with positional information, is then
telemedicine platforms, such as Teladoc Health, use ViTs to identify trained through a multi-head self-attention (MSA) mechanism, and
patterns in chest X-rays that are indicative of conditions like pneu- will be dynamically refined through learning and optimization with
monia. In this scenario, a hospital might utilize MIAs to ensure that training data.
the patient’s sensitive medical images used in pre-trained ViTs are Multi-head Self-attention. MSA is the core component of the
handled in compliance with HIPAA and GDPR regulations, thereby transformer encoder, containing 𝑁 transformer blocks. In each block,
maintaining patient confidentiality and data security. Another pos- MSA aggregates sequential tokens as:
sible application is in content moderation on social media platforms ∑︁ ∑︁ QK
[57]. For example, Facebook uses ViTs to moderate content by an- 𝑡𝑗 = A𝑖 V𝑖 𝑗 = SoftMax( √ )𝑖 V𝑖 𝑗 , (1)
alyzing user-uploaded images. However, there’s a risk that these 𝑖 𝑖 𝑑
models may inadvertently learn sensitive or private user data. Using where Q, K and V are query, key and value matrices, respectively.
MIAs, an independent watchdog organization could audit whether 𝑠 is the dimension of the query and key, and 𝑡 𝑗 is the 𝑗-th output
personal images uploaded by users have been used without consent token. We highlight that A is called the attention map, a matrix of
to train these models. In addition, to ensure a more realistic fit, scores indicating the relevance of each input embedding to others.
the ViT architecture studied in this paper is an official model 3 Our attacks are based on such attention maps, more precisely, on
released by Google [19]. We pre-train it from scratch on CIFAR10, the rollout attention (RA). Multiple attention heads separately run
CIFAR100, ImageNet100, and ISIC2018 in experiments. Moreover, the attention mechanism in parallel, making the model focus on
to see how our RAMIA performs in a ViT encoder directly down- different parts of the input simultaneously and capture various
loaded from the website, we also attack against its original model aspects of the information. The outputs from each head are con-
with pre-trained parameters provided in Section 3.5. catenated back together. The concatenated output is then passed
Summary of Our Contributions: through another linear transformation to produce the final output
of the multi-head attention mechanism.
• We provide the first comprehensive study on MIAs
Classifier. In general, ViTs do not include a ‘decoder’ part. For
against ViTs.
a classification task, a classifier will be added at the end of the
• We propose and rigorously evaluate RAMIA, a white-
final encoder blocks. The classifier generally contains a multi-layer
box MIA method against ViT encoder, focusing on
perceptron head (MLP), which usually consists of multiple layers of
the self-attention mechanism.
fully connected neural networks. The output is a prediction vector
• We propose MMUT, a training framework for ViTs.
on every class label.
Extensive experiments show its effectiveness in
Rollout Attention. Attention rollout proposed by Abnar and
accuracy-privacy trade-off preserving.
Zuidema [2] is a methodology for quantifying the flow of attention
of a transformer encoder, by tracing the progression of attention
2 Preliminaries and Related Work from initial to final block. Specifically, in each transformer block,
an attention map A is computed. Each A𝑖 𝑗 indicates how much
2.1 Vision Transformer attention flows from token 𝑗 in the previous block to token 𝑖 in the
ViT [23, 30] is first proposed by Dosovitskiy et al. [19], whose core next block. An identity matrix I is then added to the layer attention
idea is to segment an image into a series of patches (typically 16x16 A + I to symbolize the unchanging attention mapping that results
pixels) and then process these patches as a sequence of ‘tokens’ like from the incorporation of residual connections between blocks.
words in NLP. The transformer architecture is trained to capture the Consequently, the RA matrix, denoted as RAℓ at block ℓ, can be
long-range dependencies between these tokens. The self-attention computed recursively by matrices multiplication as:
3 https://huggingface.co/google/vit-base-patch16-224-in21k RAℓ = (𝐴ℓ + 𝐼 )RAℓ −1 . (2)
1258
CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA Qiankun Zhang, Di Yuan, Boyu Zhang, Bin Yuan, & Bingqian Du
Multi-Head Attention
Scaled Dot-Product Attention
Q Linear
Scaled-Dot Q
K Linear Product Concat Linear MatMul Scale SoftMax � MatMul
K
Attention
V
V Linear
Q Class
Patch K Multi-Head Bird
Inputs Norm MLP MLP
Embedding V Attention Ball
...
Embedding Key
Transformer Encoder
Patch Positional Embedding
0* 1 2 3 4 5 6 7 8 9 Embedding
� Attention Map
Linear Projection of Flattened Patches Positional
Embedding
Class Token
*
Further, recall that transformer encoders often have multiple at- a feature. Such output-based classification often lies in black-box
tention heads which separately compute self-attention in parallel. attacks. In contrast, white-box MIAs can get all the information,
In this case, several different methods are utilized to compute the including the prediction vector, the intermediate computation (e.g.,
maximum (Max), minimum (Min), and average (Mean) of the heads. the feature map) at each hidden layer, the loss, and the gradient of
Abnar and Zuidema [2] suggest the best choice may vary among the loss with respect to the parameters of each layer of the input
different tasks, and thus we shall evaluate all of them in Section 3.4. image. An example is from Salem et al. [44], who use either the max-
Remarks. In our RAMIA, instead of using attention maps, we use imum prediction confidence or prediction loss of the target model
the global rollout attention, meaning that all attention maps in as a metric. A threshold is then determined to classify members and
every block are multiplied cumulatively according to Eqn. (2), to non-members, named threshold-based MIAs. Recently, Pang et al.
construct the feature vectors for training shadow encoders. [41] employ model gradients in MIAs, and argue that they provide
Most existing work on ViTs aims to enhance performance in a more profound feature representation. Our RAMIA basically fol-
various vision tasks, primarily focusing on three means: boosting lows the idea of shadow training. We design both classifier-based
the concept of locality within images [9, 24, 60], refining the self- and threshold-based variants of RAMIA, and compare them with
attention mechanism [20, 56, 63, 64], and innovating in architectural the aforementioned attack methods. Numerous studies have applied
design [14, 18, 21, 24, 28, 34, 37, 52]. Besides EncoderMI, the privacy MIAs to various models, including classification models [7, 36, 58],
risks in ViTs have been reported by Lu et al. [38], who point out generative adversarial networks (GANs) [11, 25, 26], and diffusion
a gradient leakage risk of self-attention-based mechanism. To the models [8, 31, 41]. Readers may refer to [27] for a survey.
best of our knowledge, we provide the first comprehensive study The existing work most relevant to ours is EncoderMI [35], a
on both membership inference attacks and defenses against ViTs. black-box MIA towards encoders in contrastive learning, which
assumes the encoder is sufficiently overfitted. EncoderMI highly
relies on such an assumption and observes a disparity in the outputs
2.2 Membership Inference Attack of the encoder for members and non-members. Although most of
MIAs on machine learning (ML) models aim to determine whether their experiments are conducted on CNNs, they also report results
a specific sample is included in the training set of the target model. on CLIP with a ViT encoder. Compared to EncoderMI, our work
Depending on how the attack model is built, there are two main differs in: (a) not relying on the contrastive learning assumption
types of MIA strategies: those that train a binary classifier for mem- that may not be dominant in realistic scenarios where ViT encoders
bers and non-members (classifier-based), and those that rely on a are pre-trained on labeled datasets; (b) assuming a more applicable
carefully-designed metric to separate members and non-members white-box access. Full information on ViT encoders can be used for
directly (threshold-based). For example, Shokri et al. [47] first pro- attacks; (c) a new defense method by a unified training framework.
pose a shadow training technique, which trains a shadow model to
imitate the behavior of the target model using a shadow dataset. A
classifier is then trained using the output of the shadow models as
1259
Membership Inference Attacks against Vision Transformers: Mosaic MixUp Training to the Defense CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA
2.3 Defenses against MIAs partition the attacker’s shadow dataset, which has the same distri-
Various defenses [13, 29, 33, 46, 47, 49, 62] against MIAs have been bution as the target dataset, into two disjoint subsets, named shadow
well studied. One of the most widely used privacy-preserving tech- members and shadow non-members, respectively. The attacker then
niques is differential privacy (DP) [1], which provides a promising pre-trains a shadow encoder using shadow members; (b) feature
defense against MIAs by adding noise to the gradient (DP-SGD) or vectors construction. We construct a feature vector for each im-
parameter during model training. DP faces a significant trade-off age to be inferred using the similarity scores of rollout attention
between accuracy and privacy: high degree of security may lead to maps between the original input image and some of its neighbor im-
poor performance in accuracy. Another defense more specific to ages when independent random noises are added; (c) membership
MIAs is called regularization [25, 26, 33, 50], a more efficient way to inference. We apply two common attack methods in literature to
ensure the privacy of overfitted models. It has been reported that show the effectiveness of our RAMIA, that is, a binary-classifier-
most of the methods utilized to enhance the generalizability of an based MIA (named RAMIA-Classifier) and a threshold-based MIA
NN model can improve its privacy at the same time. For example, (named RAMIA-Threshold). Specifically, RAMIA-Classifier trains
label smoothing [51] is proved to be effective in mitigating MIAs a binary inference classifier to predict member/non-member for
by minimizing the behavioral discrepancy between training and input based on the feature vectors obtained in (b), while RAMIA-
test data; adversarial regularization [40] incorporates membership Threshold directly determines a proper threshold 𝜃 . Only if the
inference gain into the target model’s objective function, balanc- mean value of each component of the feature vector is smaller than
ing classification loss with attack model accuracy; RelaxLoss [10] 𝜃 , the input is inferred as a member, and vice versa. Algorithm 1
designs a more achievable learning objective, achieving easy im- formally describes our RAMIA, which we will explain in detail in
plementation and negligible overhead. Details of these methods the rest of this subsection.
applying to ViTs will be discussed in Section 4.3. Our MMUT will (a) Shadow Encoder Pre-training. Initially, the attacker splits
view them as baselines for comparisons in Section 4.4. a shadow dataset 𝐷 𝑠 , which is assumed to be independently and
identically distributed to the target dataset 𝐷 𝑡 , into two disjoint
subsets of shadow members 𝐷 𝑠mem and non-members 𝐷 𝑠non-mem .
3 Rollout Attention MIA 𝐷 𝑠mem is then used to train a shadow ViT encoder E𝑠 , which has
3.1 Attacker Assumptions the identical architecture to the target ViT encoder E𝑡 .
Recall that the objective of an attacker is to infer whether a given (b) Feature Vectors Construction. For each input image 𝑥 in
image 𝑥 belongs to the pre-training dataset of a transformer en- the shadow dataset 𝐷 𝑠 , we compute its rollout attention, denoted
coder of a ViT (which we call the target encoder of a target ViT ). as RA(E𝑠 , 𝑥), through the shadow encoder E𝑡 . Subsequently, we
in other words, to infer whether 𝑥 is a member or non-member. construct 𝑛 neighbor images of 𝑥, denoted as 𝑥˜ 1, 𝑥˜ 2, . . . , 𝑥˜𝑛 , by
In this paper, we assume that the attacker has white-box access adding independent Gaussian noises as stated in Appendix A to
to the encoder of the target ViT, with full information of (a) the 𝑥, and compute each RA(E𝑠 , 𝑥˜𝑖 ) for ∀𝑖 ∈ [𝑛]. Next, a cosine sim-
distribution of images for training, meaning that the attacker ilarity, denoted as sim(𝑥, 𝑥˜𝑖 ), between RA(E𝑠 , 𝑥) and RA(E𝑠 , 𝑥˜𝑖 )
can set up a shadow dataset with the same distribution as the target is computed for each neighbor image 𝑥˜𝑖 . Total 𝑛 neighbor im-
dataset; and (b) the architecture and the learned parameters of ages produce a similarity feature vector 𝑉 𝑠 (𝑥) for input 𝑥, where
the encoder, meaning that the attacker can train a shadow model 𝑉 𝑠 (𝑥) = [sim(𝑥, 𝑥˜ 1 ), sim(𝑥, 𝑥˜ 2 ), . . . , sim(𝑥, 𝑥˜𝑛 )].
with the same architecture as the target encoder (without any infor- (c) Membership Inference. We adopt both binary-classifier-based
mation on remaining parts of the target ViT, say, the ViT classifier); and threshold-based techniques for attack.
(c) how the target model is trained, for example, using either
a supervised or self-supervised learning on a labeled or unlabeled
• RAMIA-Classifier: we train the RAMIA-Classifier using
dataset, respectively. Note that such a threat model accurately cap-
the feature vectors 𝑉 𝑠 of the shadow dataset as inputs, and
tures the practical applications in Hugging Face, where information
a binary label on whether they are shadow members or non-
of a ViT encoder is recorded as a model card.
members. During the inference process, an inferred image is
forwarded to the target encoder E𝑡 , and produces a feature
3.2 RAMIA Design vector 𝑉 𝑡 . The inference classifier takes 𝑉 𝑡 as an input and
Overview. Figure 3 presents our RAMIA method, which follows predicts the membership of the inferred image.
the technique of shadow training proposed by Shokri et al. [47]. • RAMIA-Threshold: the key insight is how to choose a
Recall that for a given image 𝐼 , we forward it to the target encoder proper threshold 𝜃 . We choose 𝜃 that maximizes the accu-
and tackle its rollout attention. Further recall that in Section 1, racy of predictions among all images in 𝐷 𝑠 . Subsequently,
our experimental observation in Figure 1 indicates that the roll- for an inferred image 𝑋 , we forward it in the target encoder
out attention maps are more susceptible to the influence of adding E𝑡 and compute the corresponding feature vector 𝑉 𝑡 (𝑥). 𝑥 is
noises to a member of the target ViT encoder than a non-member. inferred as a member when the mean value of each compo-
| |𝑉 𝑡 (𝑥 ) | |
Therefore, inspired by such observation, our RAMIA classifies an nent of 𝑉 𝑡 (𝑥) is smaller than 𝜃 , that is, when |𝑉 𝑡 (𝑥 ) | 1 < 𝜃 ,
input image 𝐼 as a member when the rollout attention maps gener- where || · || denotes ℓ1 -norm and | · | denotes the cardinality
ated by the target encoder are significantly different for an image of a vector.
𝑥 and its noise-added counterpart 𝑥. ˜ Specifically, RAMIA follows
the following three steps: (a) shadow encoder pre-training. We
1260
CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA Qiankun Zhang, Di Yuan, Boyu Zhang, Bin Yuan, & Bingqian Du
Shadow Encoder
Feature Vectors Construction Membership Inference
Pre-training
�
Shadow �1 Classifier
Non-members
...
Shadow Shadow Rollout Feature
Dataset Encoder Attention Vectors
��
Shadow Shadow Threshold
Encoder Members
1261
Membership Inference Attacks against Vision Transformers: Mosaic MixUp Training to the Defense CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA
Table 2: Average accuracy, precision, and recall (%) of our methods for the target encoder pre-trained on four datasets, which
are CIFAR10, CIFAR100, ImageNet100 and ISIC2018. C is for classifier, T is for threshold, P is for prediction, L is for LIRA, and
G is for gradient. In this experiment, our multi-head fusion method is average (Mean).
non-member, named (e) FullViT-Prediction, (f) FullViT-LIRA varying on four datasets. Among the three methods, the highest
[7], which use a principled likelihood ratio test with Gaussian like- accuracy is achieved by Max on ImageNet100, because Max may en-
lihood estimates and per-example difficulty scores to attack, and hance the disparity of feature maps for members and non-members.
(g) FullViT-Gradient, respectively. On the contrary, for a similar reason, Min behaves the worst be-
cause vanishing values in RA produce indistinguishable features.
As a compromise, the Mean gives relatively stable accuracy among
3.4 RAMIA Evaluation: Experimental Results the four datasets.
Effectiveness of RAMIA. Table 2 shows the attack performance of The impact of model structures. Two important structural pa-
our RAMIAs and seven comparison methods from (a) to (g) on four rameters of a ViT encoder are the number of encoder blocks and the
different datasets: CIFAR10, CIFAR100, ImageNet100 and ISIC2018. number of attention heads in each block. We examine their impacts
The key observations are: by fixing one and changing the other. Figures 6(a) and 6(b) show
• The FullViT attack methods behave as random guessing the impacts of encoder blocks and attention heads, respectively.
because their accuracies are close to 50%. The results are not Our main findings are:
surprising because the classifier is trained on the shadow • For different numbers of encoder blocks, accuracy does not
dataset, whose output features fail to capture whether the seem to vary too much, meaning that more blocks do not
encoder is overfitted for images in the trarget dataset. weaken the performance of our RAMIA. That is the point
• EncoderMI, as well as the Baseline-Classifier, is also effective, why we use the RA maps, which aggregate all attention
but behaves worse than our RAMIA, which validates the information in every block, for the feature construction.
effectiveness of using the rollout attention instead of the • We notice a remarkable correlation between the number of
output of the encoder (class token). Readers may wonder if attention heads and the inference accuracy of RAMIA. In par-
EncoderMI assumes black-box access to the encoder, which ticular, as the number of attention heads increases, accuracy
is more strict than RAMIA. Are the results comparable? rises first and falls later. More heads do provide much more
Existing literature [39] reports and validates that a white- information for constructing features. However, when there
box attack may not be easier than a black-box attack as are too many, the aggregation policy may fail to capture the
imagine. Besides, in practical scenarios, the encoders of ViTs most salient features from RA maps. That is perhaps why
are released on the website, such as Hugging Face. So, a the ViTs used in practice (with typically 6 heads) do not have
white-box assumption is decent. too many attention heads in their encoders.
• Attention-based MIAs (AMIA-Threshold and AMIA-Classifier) Other impacts. We also evaluate on other factors that may influ-
behave worse than our RAMIAs, but yet beat EncoderMI in ence the performance of MMUT, including the number of neighbor
some cases. The reason is that compared to the attention in images (denoted as 𝑛) with independent noises added and different
the last transformer block, rollout attention captures cumula- choices of the metric when computing the similarity of RA maps,
tive attention through a whole forwarding process, making e.g., Pearson correlation coefficient (PCC). Details are reported in
it more sensitive to noises. Appendix C. We highlight our main findings are: (1) in general,
Besides, Figure 4 presents the ROC curves and AUC values of the larger 𝑛 induces a higher attack accuracy; (2) both the cosine
different attacks, indicating our RAMIA-Threshold outperforms similarity and PCC are effective in RAMIAs.
other attacks significantly on all four datasets.
The impact of attention rollout methods. Recall that to handle
multiple attention heads, the attention rollout technique aggregates
3.5 Applying RAMIA to Real-world ViT
the RA matrix of each head by computing the maximum (Max), Encoders
minimum (Min), or average (Mean) of them. All three computations ViT encoders used in previous experiments mostly use an offi-
are entry-wise. Figure 5 shows their impacts on inference accuracy cial model architecture released by Google. To further strengthen
1262
CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA Qiankun Zhang, Di Yuan, Boyu Zhang, Bin Yuan, & Bingqian Du
Figure 4: The ROC curves of nine attack methods on four datasets, CIFAR10, CIFAR100, ImageNet100 and ISIC2018.
100 (a) RAMIA-T (b) RAMIA-C Table 3: Attacking results (%) for Google’s ViT-base.
Mean Mean
Max Max
95 Min Min
Attack Accuracy(%)
1263
Membership Inference Attacks against Vision Transformers: Mosaic MixUp Training to the Defense CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA
step, the loss due to this image will only update those non-replaced
PEs and the mosaic embedding (instead of replaced PEs).
Image Mixup
+
(a) Input image (b) Full PEs (c) Uniform PEs
Position Mosaic
an extra learnable mosaic embedding. By doing so, in the backward we can resize all images in 𝐷 Pub at the beginning.
1264
CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA Qiankun Zhang, Di Yuan, Boyu Zhang, Bin Yuan, & Bingqian Du
more crucial to MMUT’s success than data augmentation decreasing the distinguishability between member and non-
techniques. member data and reducing the success rate of attacks. Sec-
We shall set up experiments in Section 4.4 to justify the latter two ondly, to maintain the utility of the model, RelaxLoss does
arguments above. not maximize the predicted posterior score of the true class
to 1. Instead, it flattens the posterior scores of the non-true
classes, ensuring a significant margin between the true class
Algorithm 2 MMUT score and others, thus preventing incorrect predictions, es-
pecially for challenging samples near decision.
Require: Training Dataset 𝐷; Public Dataset 𝐷 pub ; Mosaic Ratio:
• (d) Adversarial Regularization (Adv-reg) [40]. Adv-reg
𝛼; Learnable Positional Embedding: 𝜔; Positional embedding:
trains target models by blending traditional cross-entropy
PE; Parameters other than PEs: 𝜃
loss with adversarial loss. This method minimizes a compos-
Ensure: Trained 𝜃 ∗ and PE∗
ite loss, which is a weighted sum of the cross-entropy and
1: for each 𝑥 in 𝐷 do
adversarial losses. The weight of the adversarial loss 𝛿 is ad-
2: I ← InitializeIndicator()
justed throughout the training to balance the contributions
3: Ĩ ← RandomFlip(I, 𝛼)
˜ ˜ Ī] = PE[Ī] of both losses. The adversarial loss in Adv-Reg is generated
4: PE[I] = 𝜔, PE[
by surrogate attack models, which are specifically trained
5: 𝑦 ← RandomSelect(𝐷 pub )
on two types of data: the target model’s training dataset and
6: 𝑥 [I] = 𝑦 [I]
an additional, separate hold-out dataset. This training ap-
7: 𝜃 ∗, PE∗, 𝜔 ← GradientDescent(𝑥, PE)
˜
proach ensures these models are well-prepared to effectively
8: end for
challenge the target model, thus improving its robustness
9: return 𝜃 ∗ , PE∗
and overall performance.
1265
Membership Inference Attacks against Vision Transformers: Mosaic MixUp Training to the Defense CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA
59 62 65 68 71 74 58 62 66 70 74 52 59 66 73 80 69 72 75 78 81 84
Prediction Accuracy (%) Prediction Accuracy (%) Prediction Accuracy (%) Prediction Accuracy (%)
CIFAR10 Adaptive 90 CIFAR100 Adaptive 92 ImageNet100 Adaptive ISIC2018 Adaptive
87 90
84 87
Attack Accuracy (%)
59 62 65 68 71 74 58 62 66 70 74 52 59 66 73 80 69 72 75 78 81 84
Prediction Accuracy (%) Prediction Accuracy (%) Prediction Accuracy (%) Prediction Accuracy (%)
Basic MMUT Label_Smoothing DP-SGD Relaxloss Adv-Reg
59 62 65 68 71 74 58 62 66 70 74 52 59 66 73 80 69 72 75 78 81 84
Prediction Accuracy (%) Prediction Accuracy (%) Prediction Accuracy (%) Prediction Accuracy (%)
CIFAR10 Adaptive CIFAR100 Adaptive ImageNet100 Adaptive 93 ISIC2018 Adaptive
91 91
88
88
Attack Accuracy (%)
59 62 65 68 71 74 58 62 66 70 74 52 59 66 73 80 69 72 75 78 81 84
Prediction Accuracy (%) Prediction Accuracy (%) Prediction Accuracy (%) Prediction Accuracy (%)
Basic MMUT Label_Smoothing DP-SGD Relaxloss Adv-Reg
1266
CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA Qiankun Zhang, Di Yuan, Boyu Zhang, Bin Yuan, & Bingqian Du
Prediction Accuracy(%) Table 4: Attack accuracy (%) of RAMIA-T against MMUT v.s.
75 several DA techniques.
Figure 12: Performance of MMUT in prediction accuracy on is the identical to MMUT, except for the use of mosaic position
three datasets. Mosaic ratios 𝛼 varies from 0 to 0.969. The dot- patches. PixelMix linearly combines multiple training images at
ted lines show the prediction accuracy without any defense pixel level by blending their features and labels. GridMask occludes
on three datasets. image regions using a random grid-like mask. Table 4 shows our
experimental results on the attack accuracies of RAMIA-Threshold.
MixUp, PixelMix, and GridMask have extremely limited effects on
more explicit when 𝛼 exceeds roughly 0.5 for ImageNet100 defending agianst RAMIAs, while the simple geometric transfor-
and 0.4 for CIFAR10 and CIFAR100. mation of flipping images works slightly. Such findings in fact echo
• For non-adaptive RAMIA, MMUT reduces the attack accu- our insights on using mosaic position embbedings. Flipping messes
racy to a random guessing level, close to 50%. In an adap- up the patches of original image as well as the corresponding PEs,
tive setting, even if the attacker knows the specific defense allowing the model to focus less on positional information and
method and parameters, MMUT still reduces the attack ac- learn a similar dispersion of attentions as in Figure 7(c), leading to
curacy remarkably. a more robust performance. Nevertheless, our MMUT shows much
• As 𝛼 increases, the prediction accuracy first rises and drops more effectiveness than merely using DA for defenses.
later. More specifically, MMUT can improve the prediction Similarity of RA under MMUT. Finally, as a complement and
accuracy of the model, with a maximum of 4.39% on CIFAR10, another corroboration of our results, we explore how the RA maps
2.15% on CIFAR100, and 2.02% on ImageNet100. Therefore, change when noises are added to an input image on CIFAR10. Sim-
a good choice of 𝛼 is crucial in designing MMUT. ilar to Figure 1, Figure 13, which presents the difference in cosine
The impact of public data distribution. We question on whether similarities after MMUT is adopted, shows that MMUT is more
the public dataset used in MMUT is necessarily i.i.d. to private effective than other methods in bridging the gap between members
dataset. To verify its impact, we set the private dataset as CINIC10 and non-members. Compared with the original model, MMUT sig-
[16], which has the same classification categories and the number nificantly improves the RA’s similarity level, whereas DP-SGD is
of class labels as CIFAR10, meaning that their distributions are exactly the opposite.
similar but not identical. We consider six different public datasets:
CINIC10 (half for the target model and the rest for the shadow 5 Conclusion
model), CIFAR10, ImageNet10 (a subset of ImageNet100 with 10 This work presents the first comprehensive study on the mem-
classes selected), ImageNet100 and random Gaussian noises. Fig- bership inference attacks and defenses against a powerful deep
ure 14 reports our experimental results. Our first finding is MMUT learning model, vision transformers. We use the information pro-
performs well even when replacing image patches with random vided by the rollout attention maps and design a white-box attack
noises. This is because the transformer-based model needs to learn against real-world ViT encoders. Based on an observation on the
robust attention allocations to distinguish the noise and original significance of positional embeddings, we design a unified frame-
data patches, which is is exactly what we need to defend against work of training ViTs as a defense method. Our defense can achieve
RAMIA. Nevertheless, using other datasets for MixUp can be re- a promising effectiveness in privacy-performance trade-off. Some
markably more effective. We observe all datasets can effectively possible future work includes: (a) extending our methods to lan-
defend against RAMIAs. However, much more interestingly, the guage models; (b) attacking ViTs using another more expressive
best-performing public dataset is not CINIC10 itself, but CIFAR10, features; (c) designing a training-free defense method.
with a similar but not identical distribution to CINIC10. A possible
reason is such a dataset can maximize the generalization ability of Acknowledgments
the model, while i.i.d. data provides a weaker contribution. On the This work is supported by the National Natural Science Founda-
other hand, ImageNet100 may not behave as well as expected, mean- tion of China (NSFC) (No. 62302183, No. 62372191, No. 62302187
ing that public data that differs too much does not help confuse the and No. 62202197) and the Open Foundation of Key Laboratory of
feature vectors constructed by RA maps. Cyberspace Security, Ministry of Education (No. KLCS20240401).
MMUT vs. data augmentation (DA). We argue that simply adopt-
ing DA techniques without PE mosaics has much less effect in References
defending MIAs. We consider four well-known DA methods, in- [1] Martín Abadi, Andy Chu, Ian J. Goodfellow, H. B. McMahan, Ilya Mironov, Kunal
cluding MixUp, PixelMix [61], GridMask [12], and Flipping. MixUp Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. Proceedings
1267
Membership Inference Attacks against Vision Transformers: Mosaic MixUp Training to the Defense CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA
Frequency
Frequency
Frequency
Frequency
Frequency
800 600
600 600 600
600
400 400 400 400
400
200 200 200 200 200
0 0.96 0.97 0.98 0.99 1.00 0 0.95 0.96 0.97 0.98 0.99 1.00 0 0.94 0.95 0.96 0.97 0.98 0.99 1.00 0 0.94 0.95 0.96 0.97 0.98 0.99 1.00 0 0.95 0.96 0.97 0.98 0.99 1.00
Similarity Similarity Similarity Similarity Similarity
(a) MMUT (b) Label Smoothing (c) DP-SGD (d) RelaxLoss (e) Adv-Reg
Figure 13: Histograms for the number of members vs. non-member images in CIFAR10 across different cosine similarity scores
under defenses on CIFAR10.
90
Attack Accuracy(%)
[14] Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei,
80 RAMIA-T Huaxia Xia, and Chunhua Shen. 2021. Twins: Revisiting the Design of Spatial
RAMIA-C Attention in Vision Transformers. In Neural Information Processing Systems.
70 [15] Noel C. F. Codella, Veronica M Rotemberg, Philipp Tschandl, M. E. Celebi,
60 Stephen W. Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Li-
opyris, Michael Armando Marchetti, Harald Kittler, and Allan C. Halpern. 2019.
50 Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by
the International Skin Imaging Collaboration (ISIC). ArXiv abs/1902.03368 (2019).
40 CINIC10 CIFAR10 CIFAR100 IN10 IN100 RN https://api.semanticscholar.org/CorpusID:60440592
Public Data [16] Luke N Darlow, Elliot J Crowley, Antreas Antoniou, and Amos J Storkey. 2018.
Cinic-10 is not imagenet or cifar-10. arXiv preprint arXiv:1810.03505 (2018).
[17] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, K. Li, and Li Fei-Fei. 2009. ImageNet:
Figure 14: Attack accuracy (%) against MMUT on five differ- A large-scale hierarchical image database. 2009 IEEE Conference on Computer
Vision and Pattern Recognition (2009), 248–255.
ent public datasets. IN10, IN100 and RN denote ImageNet10, [18] Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu
ImageNet100 and random noise, respectively. Dotted lines Yuan, Dong Chen, and Baining Guo. 2021. CSWin Transformer: A General Vision
show the attack accuracy without any defense. Transformer Backbone with Cross-Shaped Windows. 2022 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR) (2021), 12114–12124.
[19] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi-
aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg
Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is
of the 2016 ACM SIGSAC Conference on Computer and Communications Security Worth 16x16 Words: Transformers for Image Recognition at Scale. In International
(2016). Conference on Learning Representations.
[2] Samira Abnar and Willem Zuidema. 2020. Quantifying Attention Flow in Trans- [20] Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs
formers. In Annual Meeting of the Association for Computational Linguistics. Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob
[3] Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Verbeek, and Hervé Jégou. 2021. XCiT: Cross-Covariance Image Transformers.
Darrell, Jitendra Malik, and Alexei A. Efros. 2023. Sequential Modeling Enables In Neural Information Processing Systems.
Scalable Learning for Large Vision Models. [21] Jiemin Fang, Lingxi Xie, Xinggang Wang, Xiaopeng Zhang, Wenyu Liu, and
[4] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Pra- Qi Tian. 2021. MSG-Transformer: Exchanging Local Spatial Information by
fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Manipulating Messenger Tokens. 2022 IEEE/CVF Conference on Computer Vision
Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J. Henighan, Rewon and Pattern Recognition (CVPR) (2021), 12053–12062.
Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher [22] Jianyuan Guo, Zhiwei Hao, Chengcheng Wang, Yehui Tang, Han Wu, Han Hu, Kai
Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Han, and Chang Xu. 2024. Data-efficient Large Vision Models through Sequential
Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Autoregression. arXiv preprint arXiv:2402.04841 (2024).
and Dario Amodei. 2020. Language Models are Few-Shot Learners. ArXiv [23] Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua
abs/2005.14165 (2020). Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. 2022. A survey on vision
[5] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, transformer. IEEE transactions on pattern analysis and machine intelligence 45, 1
and Manning Wang. 2021. Swin-Unet: Unet-like Pure Transformer for Medical (2022), 87–110.
Image Segmentation. In ECCV Workshops. [24] Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang.
[6] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexan- 2021. Transformer in Transformer. In Neural Information Processing Systems.
der Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with [25] Jamie Hayes, Luca Melis, George Danezis, and Emiliano De Cristofaro. 2017.
transformers. In European conference on computer vision. Springer, 213–229. LOGAN: Membership Inference Attacks Against Generative Models. Proceedings
[7] Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, A. Terzis, and Florian on Privacy Enhancing Technologies 2019 (2017), 133 – 152.
Tramèr. 2022. Membership Inference Attacks From First Principles. 2022 IEEE [26] Benjamin Hilprecht, Martin Härterich, and Daniel Bernau. 2019. Monte Carlo
Symposium on Security and Privacy (SP) (2022), 1897–1914. and Reconstruction Membership Inference Attacks against Generative Models.
[8] Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Proceedings on Privacy Enhancing Technologies 2019 (2019), 232 – 249.
Florian Tramèr, Borja Balle, Daphne Ippolito, and Eric Wallace. 2023. Extracting [27] Hongsheng Hu, Zoran A. Salcic, Lichao Sun, Gillian Dobbie, P. Yu, and Xuyun
Training Data from Diffusion Models. ArXiv abs/2301.13188 (2023). Zhang. 2021. Membership Inference Attacks on Machine Learning: A Survey.
[9] Chun-Fu Chen, Rameswar Panda, and Quanfu Fan. 2021. RegionViT: Regional- ACM Computing Surveys (CSUR) 54 (2021), 1 – 37.
to-Local Attention for Vision Transformers. ArXiv abs/2106.02689 (2021). [28] Zilong Huang, Youcheng Ben, Guozhong Luo, Pei Cheng, Gang Yu, and Bin Fu.
[10] Dingfan Chen, Ning Yu, and Mario Fritz. 2022. Relaxloss: Defending membership 2021. Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer.
inference attacks without losing utility. arXiv preprint arXiv:2207.05801 (2022). ArXiv abs/2106.03650 (2021).
[11] Dingfan Chen, Ning Yu, Yang Zhang, and Mario Fritz. 2019. GAN-Leaks: A Tax- [29] Jinyuan Jia, Ahmed Salem, Michael Backes, Yang Zhang, and Neil Zhenqiang
onomy of Membership Inference Attacks against Generative Models. Proceedings Gong. 2019. MemGuard: Defending against Black-Box Membership Inference
of the 2020 ACM SIGSAC Conference on Computer and Communications Security Attacks via Adversarial Examples. Proceedings of the 2019 ACM SIGSAC Conference
(2019). on Computer and Communications Security (2019).
[12] Pengguang Chen, Shu Liu, Hengshuang Zhao, and Jiaya Jia. 2020. Gridmask data [30] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fa-
augmentation. arXiv preprint arXiv:2001.04086 (2020). had Shahbaz Khan, and Mubarak Shah. 2022. Transformers in vision: A survey.
[13] Christopher A. Choquette-Choo, Florian Tramèr, Nicholas Carlini, and Nicolas ACM computing surveys (CSUR) 54, 10s (2022), 1–41.
Papernot. 2020. Label-Only Membership Inference Attacks. ArXiv abs/2007.14321
(2020).
1268
CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA Qiankun Zhang, Di Yuan, Boyu Zhang, Bin Yuan, & Bingqian Du
[31] Fei Kong, Jinhao Duan, Ruipeng Ma, Hengtao Shen, Xiao lan Zhu, Xiaoshuang Shi, pigmented skin lesions. Scientific Data 5 (2018). https://api.semanticscholar.org/
and Kaidi Xu. 2023. An Efficient Membership Inference Attack for the Diffusion CorpusID:263789934
Model by Proximal Initialization. ArXiv abs/2305.18355 (2023). [55] Jiahao Wang, Wenqi Shao, Mengzhao Chen, Chengyue Wu, Yong Liu, Kaipeng
[32] Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Zhang, Songyang Zhang, Kai Chen, and Ping Luo. 2024. Adapting LLaMA
[33] Zheng Li and Yang Zhang. 2020. Membership Leakage in Label-Only Exposures. Decoder to Vision Transformer. arXiv preprint arXiv:2404.06773 (2024).
Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications [56] Pichao Wang, Xue Wang, F. Wang, Ming Lin, Shuning Chang, Wen Xie, Hao Li,
Security (2020). and Rong Jin. 2021. KVT: k-NN Attention for Boosting Vision Transformers.
[34] Hezheng Lin, Xingyi Cheng, Xiangyu Wu, Fan Yang, Dong Shen, Zhongyuan ArXiv abs/2106.00515 (2021).
Wang, Qing Song, and Wei Yuan. 2021. CAT: Cross Attention in Vision Trans- [57] Wenxuan Wang, Jingyuan Huang, Chang Chen, Jiazhen Gu, Jianping Zhang,
former. 2022 IEEE International Conference on Multimedia and Expo (ICME) (2021), Weibin Wu, Pinjia He, and Michael R. Lyu. 2023. Validating Multimedia Content
1–6. Moderation Software via Semantic Fusion. Proceedings of the 32nd ACM SIGSOFT
[35] Hongbin Liu, Jinyuan Jia, Wenjie Qu, and Neil Zhenqiang Gong. 2021. EncoderMI: International Symposium on Software Testing and Analysis (2023).
Membership Inference against Pre-trained Encoders in Contrastive Learning. [58] Jiayuan Ye, Aadyaa Maddi, Sasi Kumar Murakonda, and R. Shokri. 2021. Enhanced
Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Membership Inference Attacks against Machine Learning Models. Proceedings
Security (2021). of the 2022 ACM SIGSAC Conference on Computer and Communications Security
[36] Yiyong Liu, Zhengyu Zhao, Michael Backes, and Yang Zhang. 2022. Membership (2021).
Inference Attacks by Exploiting Loss Trajectory. Proceedings of the 2022 ACM [59] Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. 2019. Cross-Modal Self-
SIGSAC Conference on Computer and Communications Security (2022). Attention Network for Referring Image Segmentation. 2019 IEEE/CVF Conference
[37] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, on Computer Vision and Pattern Recognition (CVPR) (2019), 10494–10503.
and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer [60] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis E. H. Tay,
using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Jiashi Feng, and Shuicheng Yan. 2021. Tokens-to-Token ViT: Training Vision
Vision (ICCV) (2021), 9992–10002. Transformers from Scratch on ImageNet. 2021 IEEE/CVF International Conference
[38] Jiahao Lu, Xi Sheryl Zhang, Tianli Zhao, Xiangyu He, and Jian Cheng. 2021. on Computer Vision (ICCV) (2021), 538–547.
APRIL: Finding the Achilles’ Heel on Privacy for Vision Transformers. 2022 [61] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. 2018.
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), mixup: Beyond Empirical Risk Minimization. In International Conference on Learn-
10041–10050. ing Representations.
[39] Milad Nasr, R. Shokri, and Amir Houmansadr. 2018. Comprehensive Privacy [62] Junxiang Zheng, Yongzhi Cao, and Hanpin Wang. 2021. Resisting membership
Analysis of Deep Learning: Passive and Active White-box Inference Attacks inference attacks through knowledge distillation. Neurocomputing 452 (2021),
against Centralized and Federated Learning. 2019 IEEE Symposium on Security 114–126.
and Privacy (SP) (2018), 739–753. [63] Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Qibin Hou,
[40] Milad Nasr, R. Shokri, and Amir Houmansadr. 2018. Machine Learning with and Jiashi Feng. 2021. DeepViT: Towards Deeper Vision Transformer. ArXiv
Membership Privacy using Adversarial Regularization. Proceedings of the 2018 abs/2103.11886 (2021).
ACM SIGSAC Conference on Computer and Communications Security (2018). [64] Daquan Zhou, Yujun Shi, Bingyi Kang, Weihao Yu, Zihang Jiang, Yuan Li, Xiaojie
[41] Yan Pang, Tianhao Wang, Xu Kang, Mengdi Huai, and Yang Zhang. 2023. Jin, Qibin Hou, and Jiashi Feng. 2021. Refiner: Refining Self-attention for Vision
White-box Membership Inference Attacks against Diffusion Models. ArXiv Transformers. ArXiv abs/2106.03714 (2021).
abs/2308.06405 (2023). [65] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2021.
[42] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Deformable {DETR}: Deformable Transformers for End-to-End Object Detection.
Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, In International Conference on Learning Representations.
Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models
From Natural Language Supervision. In International Conference on Machine A Gaussian Noises
Learning.
[43] Ahmed Salem, Yang Zhang, Mathias Humbert, Pascal Berrang, Mario Fritz, and In this paper, noises are added to images as follows: for a given
Michael Backes. 2019. ML-Leaks: Model and Data Independent Membership image, we linearly map [0, 255] ↦→ [0, 1], and add a Gaussian with
Inference Attacks and Defenses on Machine Learning Models. In Proceedings of
the 26th Annual Network and Distributed System Security Symposium (NDSS).
𝜇 = 0 and 𝜎 = 0.2 to each pixel value. For those smaller than 0
[44] A. Salem, Yang Zhang, Mathias Humbert, Mario Fritz, and Michael Backes. 2018. or larger than 1, recap them as 0 or 1, respectively. Finally remap
ML-Leaks: Model and Data Independent Membership Inference Attacks and [0, 1] ↦→ [0, 255] with a nearest rounding.
Defenses on Machine Learning Models. ArXiv abs/1806.01246 (2018).
[45] Fahad Shamshad, Salman Khan, Syed Waqas Zamir, Muhammad Haris Khan,
Munawar Hayat, Fahad Shahbaz Khan, and Huazhu Fu. 2023. Transformers in B Rollout Attention Visualization
medical imaging: A survey. Medical Image Analysis (2023).
[46] Virat Shejwalkar and Amir Houmansadr. 2021. Membership Privacy for Machine Heat map construction in Figure 7. The RA map obtained is
Learning Models Through Knowledge Transfer. In AAAI Conference on Artificial a (𝑁 + 1) × (𝑁 + 1)-sized matrix, which represents the attention
Intelligence. allocation of each patch in a total of 𝑁 + 1 patches (𝑁 patches
[47] R. Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2016. Member-
ship Inference Attacks Against Machine Learning Models. 2017 IEEE Symposium plus one class token (CT)) to itself and other patches. We only
on Security and Privacy (SP) (2016), 3–18. use a one-dimensional vector with a size of 𝑁 + 1 corresponding
[48] Connor Shorten and Taghi M Khoshgoftaar. 2019. A survey on image data
augmentation for deep learning. Journal of big data 6, 1 (2019), 1–48.
to CT, which records the importance of each image patch. Then
[49] Liwei Song and Prateek Mittal. 2020. Systematic Evaluation of Privacy Risks of only 𝑁 components are kept, except one which represents the
Machine Learning Models. In USENIX Security Symposium. attention of CT on itself. Subsequently,
[50] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan √ √ we rearrange 𝑁 attention
Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from values into a matrix of size 𝑁 × 𝑁 , each value corresponds
overfitting. J. Mach. Learn. Res. 15 (2014), 1929–1958. to a patch in the original image. And then adopt interpolation to
[51] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbig-
niew Wojna. 2015. Rethinking the Inception Architecture for Computer Vision.
restore
√ the two-dimensional
√ arranged attention to a matrix of size
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), ( 𝑁 × 𝑝) × ( 𝑁 × 𝑝) (𝑝 is the size of each patch), such that it has
2818–2826. the same height and width as the original image, to obtain a single
[52] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre
Sablayrolles, and Herv’e J’egou. 2020. Training data-efficient image transformers channel heat map matrix ℎ. We convert the single channel matrix
& distillation through attention. In International Conference on Machine Learning. into a color image using√JET color mapping
√ to obtain the final heat
[53] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne
Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
map 𝐻 , a tensor of size ( 𝑁 × 𝑝) × ( 𝑁 × 𝑝) × 3. In JET, low values
Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lam- may be mapped to blue, intermediate values to green, and high
ple. 2023. LLaMA: Open and Efficient Foundation Language Models. ArXiv values to red. Finally, we normalize the heat map and overlay it
abs/2302.13971 (2023).
[54] Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. 2018. The HAM10000 with the original image, and then restore it to the 0-255 area as
dataset, a large collection of multi-source dermatoscopic images of common displayed in Figure 7.
1269
Membership Inference Attacks against Vision Transformers: Mosaic MixUp Training to the Defense CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA
C Missing Empirical Results in Section 3 Table 5: Attack performance (%) using Pearson correlation
coefficient for a similarity computation.
The impact of number of neighbor images. Recall that we
use 𝑛 neighbor images generated by adding independent noises Method Dataset Accuracy Precision Recall
to an original image from the dataset to construct the feature vec-
CIFAR10 87.66 83.99 90.43
tor. Figure 15 shows the impact of 𝑛 on the attack accuracy on
RAMIA-T CIFAR100 88.33 80.49 93.45
CIFAR10, CIFAR100, and ImageNet100. Both RAMIA-Threshold
ImageNet100 90.25 85.66 95.73
and RAMIA-Classifier are evaluated. There are roughly consistent
trends that the attack accuracies increase while 𝑛 grows larger for CIFAR10 89.69 90.87 87.51
the threshold-based model, which is not surprising because we com- RAMIA-C CIFAR100 90.88 97.68 86.37
pute an average value for every component in the feature vector, ImageNet100 88.63 93.35 83.77
followed by determining a proper threshold. More neighbor images
sharpen such disparity between members and non-members. How-
Table 6: Prediction and attack accuracies (%) using different
ever, for a classifier-based model, except for CIFAR10, the highest
mosaic methods. T for threshold and C for classifier.
accuracy arises when 𝑛 = 8. A possible reason is higher dimensional
feature vectors make the inference classifier overfitting such that Method Dataset MMUT MMUT-Avg MMUT-Zero
its generalization ability is reduced. Such observations inspire us CIFAR10 60.63 67.19 61.79
that infinitely increasing the number of neighbor images is unre- RAMIA-T CIFAR100 55.23 79.4 56.51
liable. A larger 𝑛 may not help in accuracy but requires a longer ImageNet100 71.99 89.25 89.05
computation time. Choosing a proper 𝑛 is crucial. CIFAR10 57.67 77.46 60.38
RAMIA-C CIFAR100 51.77 84.21 52.91
88 ImageNet100 66.18 94.24 91.08
86
Attack Accuracy(%)
78 CIFAR10-RAMIA-T CIFAR10-RAMIA-C
84 87
CIFAR100-RAMIA-T CIFAR100-RAMIA-C
Target Privacy parameters
0.51 77.2 78.0 77.6 75.8 73.2 66 0.51 80.1 81.8 79.5 77.7 75.2 66
Figure 15: The impact of number of image neighbors 𝑛 on at-
tack accuracy. RAMIA-Threshold and RAMIA-Classifier are 0 87.9 86.1 84.9 82.5 81.8 60 0 90.2 89.7 85.3 83.3 82.7 59
examined when 𝑛 is set as 1, 2, 4, 8, 16 on CIFAR10, CIFAR100, 0 0.51 0.61 0.71 0.82 0 0.51 0.61 0.71 0.82
Shadow Privacy Parameters Shadow Privacy Parameters
and ImageNet100.
(a) Defense against RAMIA-T (b) Defense against RAMIA-C
The impact of similarity metrics. We compute a cosine simi- Figure 16: The attack accuracy (%) of semi-adaptive attackers
larity of RA maps as a feature vector in our RAMIA. Besides, we attempting shadow models with different parameters on CI-
consider another similarity metric named Pearson correlation coef- FAR100. T is for threshold and C is for classifier.
ficient (PCC) and evaluate its effect. Accuracy, precision, and recall Defense against semi-adaptive RAMIA. Besides adaptive and
of both RAMIA-Threshold and RAMIA-Classifier are presented in non-adaptive assumptions on attacker’s information, we examine
Table 5. Compared to the results in Table 2, there isn’t a major dif- defenses under a semi-adaptive assumption, where the attacker is
ference in the performance of attack accuracies for the two metrics, aware of the defense method but lacks knowledge about the specific
providing another support to the effectiveness of our RAMIA. parameters associated with it. In this case, the attacker can only
make a guess on such parameters. Figure 16 shows the attack accu-
D Missing Empirical Results in Section 4 racy of RAMIAs using different parameters in target models versus
The impact of mosaic embedding methods. In the process of shadow models on CIFAR100. Note that blocks closer to the top left
position mosaic, in addition to using a global learnable mosaic (closer to non-adaptive) are darker in color, with attack accuracies
embedding 𝜔 as presented in Algorithm 2, we propose two other of approximate random guesses of 50%. Likewise, blocks closer
alternative solutions: (MMUT-Avg) and (MMUT-Zero). Neither of to the diagonal (closer to adaptive) have lighter colors. Such as-
them learns a mosaic embedding during training. Instead, MMUT- sumptions are more strict for attackers, but MMUT can remarkably
Avg computes the average value of those non-mosaiced PEs as a reduce the attack accuracy from 87.94% for RAMIA-Threshold and
mosaic, while MMUT-Zero mosaics PEs with a zero matrix. Ta- 90.24% for RAMIA-Classifier. In any case, the experimental results
ble 6 presents their performances in both privacy preserving and illustrate the effectiveness of our MMUT in defending RAMIAs.
prediction accuracy. Both MMUT-Avg and MMUT-Zero instead
1270