Stealing Part of A Production Language Model
Stealing Part of A Production Language Model
Nicholas Carlini 1 Daniel Paleka 2 Krishnamurthy (Dj) Dvijotham 1 Thomas Steinke 1 Jonathan Hayase 3
A. Feder Cooper 1 Katherine Lee 1 Matthew Jagielski 1 Milad Nasr 1 Arthur Conmy 1 Itay Yona 1
Eric Wallace 4 David Rolnick 5 Florian Tramèr 2
extracts precise, nontrivial information from queries to its API? This is the question studied by the field
black-box production language models like Open- of model stealing (Tramèr et al., 2016): the ability of an ad-
AI’s ChatGPT or Google’s PaLM-2. Specifically, versary to extract model weights by making queries its API.
our attack recovers the embedding projection Contributions. We introduce an attack that can be applied
layer (up to symmetries) of a transformer model, to black-box language models, and allows us to recover
given typical API access. For under $20 USD, the complete embedding projection layer of a transformer
our attack extracts the entire projection matrix language model. Our attack departs from prior approaches
of OpenAI’s ada and babbage language mod- that reconstruct a model in a bottom-up fashion, starting
els. We thereby confirm, for the first time, that from the input layer. Instead, our attack operates top-down
these black-box models have a hidden dimension and directly extracts the model’s last layer. Specifically,
of 1024 and 2048, respectively. We also recover we exploit the fact that the final layer of a language model
the exact hidden dimension size of the gpt-3.5- projects from the hidden dimension to a (higher dimen-
turbo model, and estimate it would cost under sional) logit vector. This final layer is thus low-rank, and by
$2,000 in queries to recover the entire projection making targeted queries to a model’s API, we can extract
matrix. We conclude with potential defenses and its embedding dimension or its final weight matrix.
mitigations, and discuss the implications of possi-
ble future work that could extend our attack. Stealing this layer is useful for several reasons. First, it
reveals the width of the transformer model, which is often
correlated with its total parameter count. Second, it slightly
1. Introduction reduces the degree to which the model is a complete “black-
box”, which so might be useful for future attacks. Third,
Little is publicly known about the inner workings of today’s
while our attack recovers only a (relatively small) part of
most popular large language models, such as GPT-4, Claude
the entire model, the fact that it is at all possible to steal any
2, or Gemini. The GPT-4 technical report states it “contains
parameters of a production model is surprising, and raises
no [...] details about the architecture (including model size),
concerns that extensions of this attack might be able to
hardware, training compute, dataset construction, training
recover more information. Finally, recovering the model’s
method, or similar” (OpenAI et al., 2023). Similarly, the
last layer (and thus hidden dimension) may reveal more
PaLM-2 paper states that “details of [the] model size and
global information about the model, such as relative size
architecture are withheld from external publication” (Anil
differences between different models.
et al., 2023). This secrecy is often ascribed to “the competi-
tive landscape” (because these models are expensive to train) Our attack is effective and efficient, and is applicable to
and the “safety implications of large-scale models” (OpenAI production models whose APIs expose full logprobs, or a
et al., 2023) (because it is easier to attack models when more “logit bias”. This included Google’s PaLM-2 and OpenAI’s
information is available). Nevertheless, while these models’ GPT-4 (Anil et al., 2023; OpenAI et al., 2023); after respon-
weights and internal details are not publicly accessible, the sible disclosure, both APIs have implemented defenses to
models themselves are exposed via APIs. prevent our attack or make it more expensive. We extract
1
the embedding layer of several OpenAI models with a mean
Google DeepMind 2 ETH Zurich 3 University of Washington squared error of 10−4 (up to unavoidable symmetries). We
4
OpenAI 5 McGill University.
apply a limited form of our attack to gpt-3.5 at a cost of un-
Proceedings of the 41 st International Conference on Machine der $200 USD and, instead of recovering the full embedding
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by layer, recover just the size of the embedding dimension.
the author(s).
1
Stealing Part of a Production Language Model
Responsible disclosure. We shared our attack with all ser- Section 4.2 is quite similar to their method. Others have
vices we are aware of that are vulnerable to this attack. We attempted to recover model sizes by correlating performance
also shared our attack with several other popular services, on published benchmarks with model sizes in academic
even if they were not vulnerable to our specific attack, be- papers (Gao, 2021).
cause variants of our attack may be possible in other settings.
We received approval from OpenAI prior to extracting the 3. Problem Formulation
parameters of the last layers of their models, worked with
OpenAI to confirm our approach’s efficacy, and then deleted We study models that take a sequence of tokens drawn from
all data associated with the attack. In response to our attack, a vocabulary X as input. Let P (X ) denote the space of
OpenAI and Google have both modified their APIs to intro- probability distributions over X . We study parameterized
duce mitigiations and defenses (like those that we suggest models fθ : X N → P (X ) that produce a probability distri-
in Section 8) to make it more difficult for adversaries to bution over the next output token, given an input sequence
perform this attack. of N tokens. The model has the following structure:
2. Fidelity: the stolen model fˆ should be functionally Note that the hidden dimension size is much smaller
equivalent to the target model f on all inputs. That is, than the size of the token dictionary, i.e., h ≪ l. For
for any valid input p, we want fˆ(p) ≈ f (p). example, LLaMA (Touvron et al., 2023) chooses h ∈
{4096, 5120, 6656, 8192} and l = 32,000, and there is a
In this paper, we focus on high-fidelity attacks. Most prior recent trend towards increasingly large token sizes; GPT-4,
high-fidelity attacks exploit specific properties of deep neu- for example, has a ≈100,000 token vocabulary.
ral networks with ReLU activations. Milli et al. (2019) first
showed that if an attacker can compute gradients of a target Threat model. Throughout the paper, we assume that the
two-layer ReLU model, then they can steal a nearly bit- adversary does not have any additional knowledge about the
for-bit equivalent model. Jagielski et al. (2020) observed model parameters. We assume access to a model fθ , hosted
that if the attacker only has query access to model outputs, by a service provider and made available to users through
they can approximate gradients with finite differences. Sub- a query interface (API) O. We assume that O is a perfect
sequent work extended these attacks to efficiently extract oracle: given an input sequence p, it produces y = O (p)
deeper ReLU models (Carlini et al., 2020; Rolnick & Kord- without leaking any other information about fθ than what
ing, 2020; Shamir et al., 2023). Unfortunately, none of these can be inferred from (p, y ). For example, the adversary
approaches scale to production language models, because cannot infer anything about fθ via timing side-channels or
they (1) accept tokens as inputs (and so performing finite other details of the implementation of the query interface.
differences is intractable); (2) use activations other than Re-
Different open-source and proprietary LLMs offer APIs
LUs; (3) contain architectural components such as attention,
with varying capabilities, which impact the ability to per-
layer normalization, residual connections, etc. that current
form model extraction attacks and the choice of attack algo-
attacks cannot handle; (4) are orders-of-magnitude larger
rithm. A summary of the different APIs we study, and our
than prior extracted models; and (5) expose only limited-
motivation for doing so, is presented in Table 1. The logits
precision outputs.
API is a strawman threat model where the API provides log-
Other attacks aim to recover more limited information, or its for all tokens in the response to a given prompt. We begin
assume a stronger adversary. Wei et al. (2020) show that with this toy setting, as the attack techniques we develop
an adversary co-located on the same server as the LLM can here can be reused in subsequent sections, where we will
recover the sizes of all hidden layers. Zanella-Beguelin et al. first reconstruct the logits from more limited information
(2021) assume a model with a public pretrained encoder (e.g., log-probabilities for only the top few tokens) and then
and a private final layer, and extract the final layer; our run the attack.
2
Stealing Part of a Production Language Model
In this section, we assume the adversary can directly view h = rank (Q (p1 , . . . pn )) .
the logits that feed into the softmax function for every token
in the vocabulary (we will later relax this assumption), i.e., Proof. We have Q = W · H, where H is a h × n ma-
trix whose columns are gθ (pi ) (i = 1, . . . , n). Thus,
O (p) ← W · gθ (p) . h ≥ rank (Q). Further, if H has rank h (with the second
assumption), then h = rank (Q).
We develop new attack techniques that allow us to perform
high-fidelity extraction of (a small part of) a transformer.
Section 4.1 demonstrates how we can identify the hidden Assumptions. In Lemma 4.1, we assume that both the
dimension h using the logits API and Section 4.2 presents matrix with columns gθ (pi ) and the matrix W have rank
an algorithm that can recover the matrix W. h. These matrices have either h rows or h columns, so both
have rank at most h. Moreover, it is very unlikely that they
4.1. Warm-up: Recovering Hidden Dimensionality have rank < h: this would require the distribution of gθ (p)
to be fully supported on a subspace of dimension < h across
We begin with a simple attack that allows an adversary to re- all pi we query, or all h ≪ l columns of W to lie in the
cover the size of the hidden dimension of a language model same (h − 1) dimensional subspace of Rl (the output space
by making queries to the oracle O (Algorithm 1). The tech- of logits). In practice we find this assumption holds for all
niques we use to perform this attack will be the foundation larger models (Table 2) and when different normalization
for attacks that we further develop to perform complete layers are used (Appendix B.1).
extraction of the final embedding projection matrix.
Practical considerations. Since the matrix Q is not com-
Algorithm 1 Hidden-Dimension Extraction Attack puted over the reals, but over floating-point numbers (possi-
Require: Oracle LLM O returning logits bly with precision as low as 16-bits or 8-bits for production
1: Initialize n to an appropriate value greater than h
neural networks), we cannot naively take the rank to be the
2: Initialize an empty matrix Q = 0n×l
number of linearly independent rows. Instead, we use a
3: for i = 1 to n do
practical numerical rank of Q, where we order the singular
4: pi ← RandPrefix() ▷ Choose a random prompt values λ1 ≥ λ2 ≥ · · · ≥ λn , and identify the largest multi-
5: Qi ← O (pi ) plicative gap λλi+i 1 between consecutive singular values. A
6: end for large multiplicative gap arises when we switch from large
7: λ1 ≥ λ2 ≥ · · · ≥ λn ← SingularValues(Q) “actual” singular values to small singular values that arise
8: count ← arg maxi log∥λi ∥ − log∥λi+1 ∥
9: return count from numerical imprecision. Figure 2 shows these gaps.
Algorithm 1 describes this attack.
Experiments. In order to visualize the intuition behind
Intuition. Suppose we query a language model on a large
this attack, Figure 1 illustrates an attack against the Pythia-
number of different random prefixes. Even though each
1.4b LLM. Here, we plot the magnitude of the singular
output logit vector is an l-dimensional vector, they all actu-
values of Q as we send an increasing number n of queries
ally lie in a h-dimensional subspace because the embedding
to the model. When we send fewer than 2048 queries it
projection layer up-projects from h-dimensions. Therefore,
is impossible to identify the dimensionality of the hidden
by querying the model “enough” (more than h times) we
space. This is because n < h, and so the n × l dimensional
will eventually observe new queries are linearly dependent
matrix Q has full rank and n nontrivial singular values. But
of past queries. We can then compute the dimensionality of
once we make more than 2048 queries to the model, and
this subspace (e.g., with SVD) and report this as the hidden
thus n > h, the number of numerically significant singular
dimensionality of the model.
values does not increase further; it is capped at exactly 2048.
Formalization. The attack is based on the following In Figure 2 we plot the difference (in log-space) between
straightforward mathematical result: subsequent singular values. As we can see, the largest
3
Stealing Part of a Production Language Model
Figure 1. SVD can recover the hidden dimensionality of a model Method: Let Q be as defined in Algorithm 1. Now rewrite
when the final output layer dimension is greater than the hidden Q = U · Σ · V⊤ with SVD. Previously we saw that the
dimension. Here we extract the hidden dimension (2048) of the number of large enough singular values corresponded to the
Pythia 1.4B model. We can precisely identify the size by obtaining dimension of the model. But it turns out that the matrix U
slightly over 2048 full logit vectors. actually directly represents (a linear transformation of) the
final layer! Specifically, we can show that U · Σ = W · G
for some h × h matrix G in the following lemma.
10 2
Proof. See Appendix C. □
10 3 Note that we could also use Q = W · G for n = l. The
SVD construction above gains numerical precision if n > l.
10 4
2000 2020 2040 2060 2080
Sorted Singular Values
Figure 2. Our extraction attack recovers the hidden dimension by Experiments. For the six models considered previously,
identifying a sharp drop in singular values, visualized as a spike we evaluate the attack success rate by comparing the root
in the difference between consecutive singular values. On Pythia-
mean square (RMS) between our extracted matrix W̃ =
1.4B, a 2048 dimensional model, the spike occurs at 2047 values.
U · Σ and the actual weight matrix, after allowing for a
h × h affine transformation. Concretely, we solve the least
difference occurs at (almost exactly) the 2048th singular squares system W̃ · G ≈ W for G, which reduces to h
value—the true hidden dimensionality of this model. linear least squares problems, each with l equations and h
unknowns. Then, we report the RMS of W and W̃ · G.
We now analyze the efficacy of this attack across a wider
range of models: GPT-2 (Radford et al., 2019) Small and The results are in Table 2. As a point of reference, the
XL, Pythia (Biderman et al., 2023) 1.4B and 6.9B, and RMS between a randomly initialized model and the actual
LLaMA (Touvron et al., 2023) 7B and 65B. The results are weights is 2 · 10−2 , over 100–500× higher than the error of
in Table 2: our attack recovers the embedding size nearly our reconstruction.
perfectly, with an error of 0 or 1 in five out of six cases.
In Appendices C and H, we show that reconstruction is pos-
Our near perfect extraction has one exception: GPT-2 Small. sible up an orthogonal transformation (approximately h2 /2
On this 768 dimensional model, our attack reports a hidden missing parameters, as opposed to h2 for reconstruction up
dimension of 757. In Appendix A we show that this “failure” to an affine transformation), and that this is tight under some
is caused by GPT-2 actually having an effective hidden formal assumptions. However, we only have an efficient
dimensionality of 757 despite having 768 dimensions. algorithm for reconstruction up to affine transformations.
4
Stealing Part of a Production Language Model
(or receives from) the model during the attack. Most APIs
Table 2. Our attack succeeds across a range of open-source models,
charge users per-token, so this metric represents the mone-
at both stealing the model size, and also at reconstructing the output
projection matrix (up to invariances; we show the root MSE). tary cost of an attack (after scaling by the token cost).
5. Extraction Attack for Logit-Bias APIs 5.3. Extraction Attack for Top-5 Logit Bias APIs
The above attack makes a significant assumption: that the We develop a technique to compute the logit vector for any
adversary can directly observe the complete logit vector prefix p via a sequence of queries with varying logit biases.
for each input. In practice, this is not true: no production To begin, suppose that the API returned the top K logits.
model we are aware of provides such an API. Instead, for Then we could recover the complete logit vector for an
example, they provide a way for users to get the top-K (by arbitrary prompt p by cycling through different choices for
logit) token log probabilities. In this section we address this the logit bias and measuring the top-k logits each time.
challenge. In particular, for an API with top-5 logits we can send a
sequence of queries
5.1. Description of the API
O (p, bk = bk+1 = . . . = bk+4 = B ), for k ∈ {0,5,10,. . .,|X |}
In this section we develop attacks for APIs that return log
probabilities for the top K tokens (sorted by logits), and with a large enough B. Each query thus promotes five
where the user can specify a real-valued bias b ∈ R|X | (the different tokens {k, k + 1, . . . , k + 4} into the top-5, which
“logit bias”) to be added to the logits for specified tokens allows us to observe their logits. By subtracting the bias B
before the softmax, i.e., and merging answers from all of these queries, we recover
the entire logit vector.
O (p, b) ← TopK (logsoftmax (Wgθ (p) + b))
! ! Unfortunately, we cannot use this attack directly because all
X production APIs we are aware of return logprobs (the log
= TopK Wgθ (p)+ b−log exp(Wgθ (p)+ b)i ·1 .
of the softmax output of the model) instead of the logits zi .
i
The problem now is that when we apply a logit bias B to
where TopK (z) returns the K highest entries of z ∈ Rl and the i-th token and observe that token’s logprob, we get the
their indices. Many APIs (prior to this paper) provided such value
an option for their state-of-the-art models (OpenAI, 2024; X
yiB = zi + B − log
Google, 2024). In particular, the OpenAI API supports mod- exp(zj ) + exp(zi + B )
j̸=i
ifying logits for at most 300 tokens, and the logit bias for
each token is restricted to the range [−100, 100] (OpenAI, where zi are the original logits. We thus get an additional
2023). bias-dependent term which we need to deal with. We pro-
pose two approaches.
All that remains is to show that we can uncover the full logit
vector for distinct prompt queries through this API. In this Our first approach relies on a common “reference” token that
section, we develop techniques for this purpose. Once we lets us learn the relative difference between all logits (this
have recovered multiple complete logit vector, we can run is the best we can hope for, since the softmax is invariant
the attack from Section 4.2 without modification. under additive shifts to the logits). Suppose the top token for
a prompt is R, and we want to learn the relative difference
5.2. Evaluation Methodology between the logits of tokens i and R. We add a large bias
B to token i to push it to the top-5, and then observe the
Practical attacks must be efficient, both to keep the cost logprobs of both token i and R. We have:
of extraction manageable and to bypass any rate limiters
B
or other filters in the APIs. We thus begin with two cost yR − yiB − B = zR − zi .
definitions that we use to measure the efficacy of our attack.
Since we can observe 5 logprobs, we can compare the ref-
Token cost: the number of tokens the adversary sends to erence token R to four tokens per query, by adding a large
5
Stealing Part of a Production Language Model
bias that pushes all four tokens into the top 5 (along with we will be able to collect the logits on several prompts of
the reference token). We thus issue a sequence of queries the form [p0 x x . . . x].
O (p, bi = bi+1 = bi+2 = bi+3 = B ) Analysis. It is easy to see that the query cost of this attack
1
is 4m , where m is the expansion factor. Further, since each
query requires 1 + m tokens, the token cost is 14m +m . (Or,
for i ∈ {0, 4, 8, · · · , |X |}. This recovers the logits up to the
free parameter zR that we set to 0. 1 + m + ∆ if the API has an overhead of ∆ tokens.) Note
that if m = 1, i.e., there is no expansion, this attack reduces
Query cost. This attack reveals the value of K-1 logits with to our first attack and the analysis similarly gives a query
each query to the model (the K-th being used as a reference cost of 1/4 and a token cost of 1/2.
point), for a cost of 1/(K − 1) queries per logit.
In Appendix E we present a second, more sophisticated 5.4. Extraction Attack for top-1 Binary Logit Bias APIs
method that allows us to recover K logits per query, i.e., a
In light of our attacks, it is conceivable that model providers
cost of 1/K, by viewing each logprob we receive as a linear
introduce restrictions on the above API. We now demon-
constraint on the original logits.
strate that an attack is possible even if the API only returns
Token cost. Recall that our attack requires that we learn the top logprob (K = 1 in the API from Section 5.1), and
the logits for several distinct prompts; and so each prompt the logit bias is constrained to only take one of two values.
must be at least one token long. Therefore, this attack costs
API. We place two following further restrictions on the logit
at least two tokens per query (one input and one output),
bias API (Section 5.1): first, we set K = 1, and only see
or a cost of 1/2 for each token of output. But, in practice,
the most likely token’s logprob; and second, each logit bias
many models (like gpt-3.5-turbo) include a few tokens
entry b is constrained to be in {−1, 0}. These constraints
of overhead along with every single query. This increases
would completely prevent the attacks from the prior section.
the token cost per logit to 2+4 ∆ where ∆ is the number of
We believe this constraint is significantly tighter than any
overhead tokens; for gpt-3.5-turbo we report ∆ = 7.
practical implementation would define.
An improved cost-optimal attack. It is possible to gen- Method. At first it may seem impossible to be able to learn
eralize the above attack to improve both the query cost and any information about a token t if it is not already the most
token cost. Instead of issuing queries to the model that likely token. However, note that if we query the model twice,
reveal 4 or 5 logit values for a single generated token, we once without any logit bias, and once with a logit bias of −1
might instead hope to be able to send a multi-token query for token t, then the top token will be slightly more likely
[p0 p1 p2 . . . pn ] and then ask for the logprob vector with a bias of −1, with exactly how slight depending on the
for each prefix of the prompt [p0 ], [p0 p1 ], [p0 p1 p2 ] value of token t’s logprob. Specifically, in Appendix D we
etc. OpenAI’s API did allow for queries of this form in show the logprob equals (1/e − 1)−1 (exp(ytop − ytop ′ ) − 1)
′
where ytop and ytop are the logprobs of the most likely token
the past, by providing logprobs for prompt tokens as well
as generated tokens by combining the logprob and echo when querying with logit bias of 0 and −1.
parameters; this option has since been removed. Analysis. This attack requires 1 query and token per logprob
Now, it is only possible to view logprobs of generated extracted. However, as we will show in the evaluation, this
tokens. And since only the very last token is generated, attack is much less numerically stable than the previously-
we can only view four logprobs for this single longer discussed attacks, and so may require more queries to reach
query. This, however, presents a potential approach to re- the same level of accuracy.
duce the query and token cost: if there were some way
to cause the model to emit a specific sequence of tokens 6. Logprob-free attacks
[pn+1 pn+2 . . . pn+m ], then we could inspect the log-
prob vector of each generated token. Due to space constraints, in Appendix F, we show we can
still extract logits without logprob access, although with a
We achieve this as follows: we fix a token x and four other higher cost.
tokens, and force the model to emit [x x . . . x]. In-
stead of supplying a logit bias of B for each of the five Intuitively, even without logprobs (as long as we still have
tokens, we supply a logit bias of B for token x, and B ′ < B logit bias) it is possible to perform binary search to increase
for the other four tokens. If B ′ is large enough so that the and decrease the logits for every token until increasing any
other tokens will be brought into the top-5 outputs, we will token by epsilon will make it the most likely. At this point,
still be able to learn the logits for those tokens. As long as the logit bias vector corresponds directly to the (relative)
B ′ is small enough so that the model will always complete logits of each token relative to every other.
the initial prompt p0 with token x (and not any other), then
6
Stealing Part of a Production Language Model
7
Stealing Part of a Production Language Model
8
Stealing Part of a Production Language Model
bit-for-bit copy of the matrix W: while there exist an infi- production language model. While there appear to be no
nite number of matrices W · G, only one will be discretized immediate practical consequences of learning this layer, it
properly. Unfortunately, this integer-constrained problem is represents the first time that any precise information about
NP-hard in general (similar problems are the foundation for a deployed transformer model has been stolen. Two imme-
an entire class of public key cryptosystems). But this need diate open questions are (1) how hazardous these practical
not imply that the problem is hard on all instances. stealing attacks are and (2) whether they pose a greater threat
to developers and the security of their models than black-
Extending this attack beyond a single layer. Our attack
box access already does via distillation or other approximate
recovers a single layer of a transformer. We see no obvious
stealing attacks.
methodology to extend it beyond just a single layer, due
to the non-linearity of the models. But we invite further Our attack also highlights how small design decisions in-
research in this area. fluence the overall security of a system. Our attack works
because of the seemingly-innocuous logit-bias and logprobs
Removing the logit bias assumption. All our attacks re-
parameters made available by the largest machine learning
quire the ability to pass a logit bias. Model providers includ-
service providers, including OpenAI and Google—although
ing Google and OpenAI provided this capability when we
both have now implemented mitigations to prevent this at-
began the writing of this paper, but this could change. (In-
tack or make it more expensive. Practitioners should strive
deed, it already has, as model providers begin implementing
to understand how system-level design decisions impact the
defenses to prevent this attack.) Other API parameters could
safety and security of the full product.
give alternative avenues for learning logit information. For
example, unconstrained temperature and top-k parame- Overall, we hope our paper serves to further motivate the
ters could also leak logit values through a series of queries. study of practical attacks on machine learning models, in
In the long run, completely hiding the logit information order to ultimately develop safer and more reliable systems.
might be challenging due both to public demand for the
feature, and ability of adversaries to infer this information Impact Statement
through other means.
This paper is the most recent in a line of work that demon-
Exploiting the stolen weights. Recovering a model’s em-
strates successful attacks on production models. As such,
bedding projection layer might improve other attacks against
we take several steps to mitigate the near-term potential
that model. Alternatively, an attacker could infer details
harms of this research. As discussed throughout the paper,
about a provider’s finetuning API by observing changes (or
we have worked closely with all affected products to ensure
the absence thereof) in the last layer. In this paper, we fo-
that mitigations are in place before disclosing this work.
cus primarily on the model extraction problem and leave
We have additionally sent advance copies of this paper to
exploring downstream attacks to future work.
all potentially affected parties, even if we were unable to
Practical stealing of other model information. Existing precisely verify our attack.
high-fidelity model stealing attacks are “all-or-nothing” at-
Long-term, we believe that openly discussing vulnerabilities
tacks that recover entire models, but only apply to small
that have practical impact is an important strategy for ensur-
ReLU networks. We show that stealing partial information
ing safe machine learning. This vulnerability exists whether
can be much more practical, even for state-of-the-art models.
or not we report on it. Especially for attacks that are simple
Future work may find that practical attacks can steal many
to identify (as evidenced by the concurrent work of Fin-
more bits of information about current proprietary models.
layson et al. (2024) that discovered this same vulnerability),
malicious actors are also likely to discover the same vulner-
10. Conclusion ability whether or not we report on it. By documenting it
early, we can ensure future systems remain secure.
As the field of machine learning matures, and models tran-
sition from research artifacts to production tools used by
millions, the field of adversarial machine learning must also Acknowledgements
adapt. While it is certainly useful to understand the potential
We are grateful to Andreas Terzis and the anonymous re-
applicability of model stealing to three-layer 100-neuron
viewers for comments on early drafts of this paper. We are
ReLU-only fully-connected networks, at some point it be-
grateful to Joshua Achiam for helping to write the code for
comes important to understand to what extent attacks can
post-hoc modifying the model architecture. We are also
be actually applied to the largest production models.
grateful to OpenAI for allowing us to attempt our extraction
This paper takes one step in that direction. We give an attack on their production models.
existence proof that it is possible to steal one layer of a
9
Stealing Part of a Production Language Model
10
Stealing Part of a Production Language Model
define-token-probability. Accessed Febraury Azhar, F., et al. LLaMA: Open and efficient founda-
1, 2024. tion language models. arXiv preprint arXiv:2302.13971,
2023.
OpenAI. Create chat completion, 2024. URL
https://platform.openai.com/docs/api- Tramèr, F., Zhang, F., Juels, A., Reiter, M. K., and Risten-
reference/chat/create. Accessed January 30, part, T. Stealing machine learning models via prediction
2024. APIs. In USENIX Security Symposium, 2016.
OpenAI et al. GPT-4 Technical Report, 2023. Veit, A., Wilber, M. J., and Belongie, S. J. Residual net-
works behave like ensembles of relatively shallow net-
Pal, S., Gupta, Y., Kanade, A., and Shevade, S. Stateful works. In Advances in Neural Information Processing
detection of model extraction attacks. arXiv preprint Systems, pp. 550–558, 2016.
arXiv:2107.05166, 2021.
Wei, J., Zhang, Y., Zhou, Z., Li, Z., and Al Faruque, M. A.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Leaky DNN: Stealing deep-learning model secret with
Sutskever, I. Language Models are Unsupervised Mul- GPU context-switching side-channel. In IEEE/IFIP In-
titask Learners. Technical report, OpenAI, 2019. URL ternational Conference on Dependable Systems and Net-
https://rb.gy/tm8qh. works (DSN), 2020.
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Yang, K. and Klein, D. FUDGE: Controlled text generation
Song, F., Aslanides, J., Henderson, S., Ring, R., Young, with future discriminators. In Toutanova, K., Rumshisky,
S., Rutherford, E., Hennigan, T., Menick, J., Cassirer, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard,
A., Powell, R., van den Driessche, G., Hendricks, L. A., S., Cotterell, R., Chakraborty, T., and Zhou, Y. (eds.),
Rauh, M., Huang, P.-S., Glaese, A., Welbl, J., Dathathri, ACL, 2021.
S., Huang, S., Uesato, J., Mellor, J., Higgins, I., Creswell,
A., McAleese, N., Wu, A., Elsen, E., Jayakumar, S., Zanella-Beguelin, S., Tople, S., Paverd, A., and Köpf, B.
Buchatskaya, E., Budden, D., Sutherland, E., Simonyan, Grey-box extraction of natural language models. In In-
K., Paganini, M., Sifre, L., Martens, L., Li, X. L., Kun- ternational Conference on Machine Learning, pp. 12278–
coro, A., Nematzadeh, A., Gribovskaya, E., Donato, D., 12286. PMLR, 2021.
Lazaridou, A., Mensch, A., Lespiau, J.-B., Tsimpoukelli,
Zhang, B. and Sennrich, R. Root mean square layer normal-
M., Grigorev, N., Fritz, D., Sottiaux, T., Pajarskas, M.,
ization. NeurIPS, 2019.
Pohlen, T., Gong, Z., Toyama, D., de Masson d’Autume,
C., Li, Y., Terzi, T., Mikulik, V., Babuschkin, I., Clark,
A., de Las Casas, D., Guy, A., Jones, C., Bradbury, J.,
Johnson, M., Hechtman, B., Weidinger, L., Gabriel, I.,
Isaac, W., Lockhart, E., Osindero, S., Rimell, L., Dyer, C.,
Vinyals, O., Ayoub, K., Stanway, J., Bennett, L., Hassabis,
D., Kavukcuoglu, K., and Irving, G. Scaling language
models: Methods, analysis and insights from training
gopher, 2022.
Ren, J., Zhao, Y., Vu, T., Liu, P. J., and Lakshminarayanan,
B. Self-evaluation improves selective generation in large
language models. arXiv preprint arXiv:2312.09300,
2023.
11
Stealing Part of a Production Language Model
103 102
100
102 10 2
Magnitude
Magnitude
10 4
101 10 6
10 8
100 10 10
600 650 700 750 800 850 900
600 650 700 750 800 850 900 Sorted Singular Values (computed in float64)
Sorted Singular Values
(b). Singular values of GPT-2 Small (higher float64 precision)
(a). Singular values of GPT-2 Small (default bfloat16 precision)
12
Stealing Part of a Production Language Model
so cancels the LayerNorm bias term. Hence, if we apply the Lemma 4.1 attack with this subtraction modification to a model
using LayerNorm, then the resultant ‘h’ output will be smaller by 1 (due to Appendix B.1). This would imply the model
used LayerNorm rather than RMSNorm, because RMSNorm does not project onto a smaller subspace and so would not
have a decrease in ‘h’ value if we were to use this subtraction trick.
B.2.2. R ESULTS
To confirm that the method from Appendix B.2.1 works, we test whether we can detect whether the GPT-2, Pythia and
LLAMA architectures use LayerNorm or RMSNorm from their logit outputs alone. We found that the technique required
two adjustments before it worked on models with lower than 32-bit precision (it always worked with 32-bit P precision). i)
We do not subtract O (p0 ) from logits queries, but instead subtract the mean logits over all queries, i.e. n1 ni=1 O (pi ).
Since the average of several points in a common affine subspace still lie on that affine subspace, this doesn’t change the
conclusions from Appendix B.2.1. ii) We additionally found it helped to calculate this mean in lower precision, before
casting to 64-bit precision to calculate the compact SVD.
The results are in Figure 4. We plot the singular value magnitudes (as in Figure 1) and show that there is a drop in the hth
singular value for the architectures using LayerNorm, but not for architecture using RMSNorm:
1.2 0.25
log10|| i||
log10|| i||
0
1.0 0.00
0.8 1 0.25
0.6 0.50
2
0.4 0.75
0 5 0 5 0 5 0 5
408 408 409 409 410 410 411 411 0 5 0 5 0 5 0 5 0
i 1580 5 0 5 0 5 0 5
158 159 159 160 160 161 161 162
0 510 510 511 511 512 512 513 513 514
(a) LLAMA-7B (RMSNorm). i i
(b) GPT-2 XL (LayerNorm). (c) Pythia-12B (LayerNorm).
Figure 4. Detecting whether models use LayerNorm or RMSNorm by singular value magnitudes.
Is this attack practical for real models? We perform the same attack on the logprobs we obtained for ada and babbage.2
We see in Figure 5a-b that indeed the drop in the hth singular values occurs for these two models that use LayerNorm
(GPT-3’s architecture was almost entirely inherited from GPT-2):
Figure 5. Stress-testing the LayerNorm extraction attack on models behind an API (a-b), and models using both RMSNorm and biases (c).
As a final stress test, we found that all open language models that use RMSNorm do not use any bias terms (Biderman,
2024). Therefore, we checked that our attack would not give a false positive when applied to a model with RMSNorm but
with biases. We chose Gopher-7B (Rae et al., 2022), a model with public architectural details but no public weight access,
2
Unfortunately, we deleted the logprobs for GPT-3.5 models before we created this attack due to security constraints.
13
Stealing Part of a Production Language Model
that uses RMSNorm but also biases (e.g. on the output logits). In Figure 5c we show that indeed the hth singular value does
not decrease for this model that uses RMSNorm.
Lemma 4.2 In the logit-API threat model, under the assumptions of Lemma 4.1: (i) The method from Section 4.2 recovers
W̃ = W · G for some G ∈ Rh×h ; (ii) With the additional assumption that gθ (p) is a transformer with residual connections,
it is impossible to extract W exactly.
We first give a short proof of (i):
Proof. (i) To show we can recover W̃ = W · G, recall Lemma 4.1: we have access to Q⊤ = W · H for some H ∈ Rh×n .
Using the compact SVD of Q from the method in Section 4.2, W · H · V = U · Σ. We know G := H · V ∈ Rh×h , hence
if we take W̃ = U · Σ, we have W̃ = W · G.
Proving Lemma 4.2(ii) requires several steps due to the complexity of the transformer architecture: we progressively
strengthen the proof to apply to models with no residual connections (C.1), models with residual connections (C.2), models
with RMSNorm (C.4), LayerNorm (C.5) and normalization with an ε term (C.6).
14
Stealing Part of a Production Language Model
where w is multiplied elementwise by normalized x. Clearly this can be written as a diagonal matrix. Further, we can
multiply
√ P q 2
this diagonal matrix by h to cancel that factor in the denominator of Equation (3). Since n(x) = x/||x|| = x i xi
we get the result.
Intuitively, the proof in Appendix C.2 relied on pre-multiplying the input projection weight of layers by a matrix S−1 ,
so that this cancelled the rotation S applied to the model’s hidden state (called the ‘residual stream’ in mechanistic
literature (Elhage et al., 2021)). Formally, if we let the input projection layer be M, we were using the fact
interpretability
that MS−1 (Sx) = Mx. However, since models with normalization layers use these before the linear input projection,
the result of applying S to the hidden state, if we apply the same procedure, produces the activation
which is identical to the original model. Applying this procedure to all layers added to the hidden state (using the different
W diagonal matrices each time) gives us a model gθ′ (p) such that gθ′ (p) = Ugθ′ (p) so a different embedding projection
matrix WU⊤ will give identical outputs to the original model gθ (p) (with embedding projection W).
Note that we ignore what happens to b in the above arguments, since any sequence of affine maps applied to a constant
b ∈ Rh yields a constant b′ ∈ Rh , and we can just use b′ instead of b in gθ′ .
since ||x|| = ||Ux||, n′ commutes with U, and so the proofs in Appendix C.4 and Appendix C.5 still work when using n′
instead of n.
Therefore finally, we have proven the impossibility result Lemma 4.2(ii) in all common model architectures (all non-residual
networks that end with dense layers, and all transformers from Biderman (2024)).
15
Stealing Part of a Production Language Model
P
Let N = i exp (logiti ) and p = exp (logitt ) /N. Then, we can rewrite
ytop = logittop − log N
ytop = logittop − log(N + (1/e − 1)pN)
Subtracting the two, we get
′
ytop − ytop = log (1 + (1/e − 1)p)
′ )−1
exp(ytop − ytop
=⇒ p = .
1/e − 1
Related work. Concurrent work (Morris et al., 2023) discusses a similar but weaker two-query logprob extraction. Their
attack requires a logit bias larger than logittop − logiti and top-2 logprob access; our attack works as soon the logit bias is
allowed to be nonzero, and with top-1 logprob access.
Each query may receive multiple answers (namely, the K largest ai (z, b) values). For notational simplicity, we denote
multiple answers to one query the same way as multiple queries each returning one answer. Suppose queries b1 , · · · , bm
were asked and we received m answers (i1 , ai1 (z, b1 )) ← O (p, b1 ), · · · , (im , aim (z, bm )) ← O (p, bm ).
Our goal is to compute z from the answers ai (z, b).
16
Stealing Part of a Production Language Model
Fix a token index i and let bi = B and bj = 0 for all j ̸= i. We query the API with this logit bias and assume that B is
large enough that token i is returned:
(i, ai (z, b)) ← O (p, b).
From Equation 6,
ℓ
X
ai (z, b) = zi + bi − log exp(zj + bj )
j
X
= zi + B − log exp(zi + B ) + exp(zj )
j̸=i
ℓ
X
= zi + B − log exp(zi + B ) − exp(zi ) + exp(zj ) ,
j
ℓ
X
=⇒ zi + B − ai (z, b) = log exp(zi + B ) − exp(zi ) + exp(zj ) ,
j
ℓ
X
=⇒ exp(zi + B − ai (z, b)) = exp(zi + B ) − exp(zi ) + exp(zj ),
j
ℓ
X
=⇒ exp(zi + B − ai (z, b)) − exp(zi + B ) + exp(zi ) = exp(zj ),
j
ℓ
X
=⇒ exp(zi ) · (exp(B − ai (z, b)) − exp(B ) + 1) = exp(zj ),
j
Pℓ
j exp(zj )
=⇒ exp(zi ) = ,
exp(B − ai (z, b)) − exp(B ) + 1
ℓ
X
=⇒ zi = log exp(zj ) − log (exp(B − ai (z, b)) − exp(B ) + 1) .
j
Pℓ
Thus if we normalize j exp(zj ) = 1, we have
17
Stealing Part of a Production Language Model
have
X X
aik (z, b) = zik + B − log exp(zi + B ) + exp(zi )
i∈{i1 ,··· ,iK } ∈{i1 ,··· ,iK }
i/
X ℓ
X
= zik + B − log (eB − 1) exp(zi ) + exp(zi )
i∈{i1 ,··· ,iK } i
X
= zik + B − log (eB − 1) exp(zi ) + N ,
i∈{i1 ,··· ,iK }
X
=⇒ zik + B − aik (z, b) = log (eB − 1) exp(zi ) + N ,
i∈{i1 ,··· ,iK }
X
B
=⇒ exp(zik + B − aik (z, b)) = (e − 1) exp(zi ) + N,
i∈{i1 ,··· ,iK }
Note that A is a rank-one perturbation of a diagonal matrix, that is, if 1 is the all-ones vector, then
where diag1≤k≤K (exp(B − aik (z, b))) denotes a diagonal matrix with the k-th diagonal entry being exp(B − aik (z, b)).
Inverting a diagonal matrix is easy and thus we can use the Sherman-Morrison formula to compute the inverse of A:
18
Stealing Part of a Production Language Model
(eB − 1)vv T 1
= v+ ·N
1 − (eB − 1)1T v
(eB − 1)1T v
= 1+ ·N·v
1 − (eB − 1)1T v
N
= · v,
1 − ( e B − 1 ) j vj
P
= log A−1 1N k
=⇒ zik
!
Nvk
= log
1 − ( eB − 1 ) K
P
j vj
!
N exp(aik (z, b) − B )
= log
1 − (e − 1) K
B
P
j exp(aij (z, b) − B )
K
X
= log N + aik (z, b) − B − log 1 − (eB − 1) exp(aij (z, b) − B )
j
K
X
= log N + aik (z, b) − B − log 1 − (1 − e−B ) exp(aij (z, b)) .
j
Related work. Two works published during the responsible disclosure period use a similar procedure, and deal with
numerical issues in different ways. (Chiu, 2024) start with a low B for the whole vocabulary, then increase B and ask for
all tokens that haven’t appeared before, and repeat until all tokens are covered. (Hayase et al., 2024) use the method in
Appendix E.1, and set B = −ẑi , where ẑi is an estimate of zi inherent to their application. It is possible variants of this
method have been discussed before our or these works, but we are not aware of further references.
19
Stealing Part of a Production Language Model
Suppose queries b1 , · · · , bm were asked and we received m answers (i1 , ai1 (z, b1 )) ← O (p, b1 ), . . . , (im , aim (z, bm )) ←
O (p, bm ). (If a query returns multiple answers, we can treat this the same as multiple queries each returning one answer.)
As before, rearranging Equation 6 gives the following equations.
exp(zik + bkik )
∀k ∈ [m] exp(aik (z, bkik )) = Pℓ .
k
j exp(zj + bj )
ℓ
X
∀k ∈ [m] exp(zj + bkj ) = exp(zik + bkik − aik (z, bk )).
j
ℓ
X
∀k ∈ [m] exp(zj ) · exp(bkj ) = exp(zik ) · exp(bkik − aik (z, bk )).
j
ℓ
X
∀k ∈ [m] exp(bkj ) − I[j = ik ] · exp(bkik − aik (z, bk )) · exp(zj ) = 0.
j
exp(z1 ) 0
exp(z2 ) 0
A· = ,
.. ..
. .
exp(zℓ ) 0
where ∀k ∈ [m] ∀j ∈ [ℓ] Ak,j = exp(bkj ) · 1 − I[j = ik ] · exp(−aik (z, bk )) .
Here I[j = ik ] is 1 if j = ik and 0 otherwise. If A is invertible, then this linear system can be solved to recover the logits z.
Unfortunately, A is not invertible: Indeed, we know that the solution cannot be unique because shifting all the logits by the
same amount yields the exact same answers ai (z, b) = ai (z + 1, b). That is, we expect a one-dimensional space of valid
solutions to A · exp(z ) = 0. To deal with this we simply add the constraint that z1 = 0 or, equivalently, exp(z1 ) = 1. This
corresponds to the system
exp(z1 ) 1
exp(z ) 0
b · exp(z ) = 1 0 ··· 0 2
A · = .. .
A ..
. .
exp(zℓ ) 0
(We could also normalize ℓi exp(zi ) = 1. This corresponds to the first row of A
P b being all 1s instead of one 1.) This is
solvable as long as the augmented matrix has a nonzero determinant
b = det 1 0 · · · 0
det A = det(A1:m,2:ℓ ). (9)
A
Here A1:m,2:d denotes A with the first column removed. Note that we are setting m = ℓ − 1. This is the minimum number
of query-answer pairs that we need. If we have more (i.e., m ≥ ℓ), then the system is overdetermined. Having the system be
overdetermined is a good thing; the extra answers can help us recover the logprobs with more precision. The least squares
solution to the overdetermined system is given by
exp(z1 ) 1
exp(z2 ) 0
AbT A
b·
.. =A
bT
.. .
(10)
. .
exp(zℓ ) 0
This provides a general method for recovering the (normalized) logits from the logprobs API.
Related work. (Zanella-Beguelin et al., 2021) have an almost identical method, although they operate in the setting of a
publicly known encoder and reconstructing the last layer.
20
Stealing Part of a Production Language Model
API: Some APIs provide access to a logit bias term, but do not provide any information about the logprobs. Thus, we
have,
O (p, b) = ArgMax (logsoftmax (W · gθ (p) + b)) .
where ArgMax (z) returns the index of the highest coordinate in the vector z ∈ Rl . In this section, we will use the notation
b = {i : z} to denote that the bias is set to z for token i and 0 for every other token. We also use b = {} to denote that no
logit bias is used. Finally, we assume that the bias is restricted to fall within the range [−B, B ].
What can be extracted? The attacks developed in this Section reconstruct the logit vector up to an additive (∞-norm)
error of ε.
else
αi ← αi +2
βi
end if
Return αi +
2
βi
end while
Analysis. This attack, while inefficient, correctly extracts the logit vector.
Lemma F.1. For every token i such that logiti − logit0 ≥ −B, Algorithm 2 outputs a value that is at most ε away from the
logiti − logit0 in at most log Bε API queries.
Proof. The API returns the (re-ordered) token 0 as long as the logit bias added is smaller than logiti − logit0 . By the
assumption, we know that logiti − logit0 ∈ [−B, 0]. The algorithm ensures that βi ≥ logiti − logit0 ≥ αi at each iteration,
as can be seen easily by an inductive argument. Further, βi − αi decreases by a factor of 2 in each iteration, and hence at
i − logit0 is sandwiched in an interval of length ε. Furthermore, it is clear
termination, we can see that the true value of logit
that the number of iterations is at most log2 Bε and hence so is the query cost of this algorithm.
Limitations of the approach. If logiti − logit0 < −2B it is easy to see there is no efficient way to sample the token i,
hence no way to find information about logiti without logprob access. There is a way to slightly increase the range for
3
We release supplementary code that deals with testing these attacks without direct API queries at https://github.com/
dpaleka/stealing-part-lm-supplementary.
21
Stealing Part of a Production Language Model
−2B ≤ logiti − logit0 ≤ −B by adding negative logit biases to the tokens with the largest logit values, but we skip the
details since for most models, for the prompts we use, the every token satisfies logiti − logit0 > −B.
Related work. Concurrent work (Morris et al., 2023) has discussed this method of extracting logits.
API: We use the same API as in the previous section, with the additional constraint that the O accepts at most N + 1
tokens in the logit bias dictionary. We again first run a query O (p, b = {}) to identify the most likely token and set its index
to 0. Our goal is to approximate logiti − logit0 for N different tokens. If N < l − 1, we simply repeat the same algorithm
for different batches of N tokens l−1N times.
Method. Our approach queries the API with the logit bias set for several tokens in parallel. The algorithm proceeds in
rounds, where each round involves querying the API with the logit bias set for several tokens.
Suppose that the query returns token k as output when the logit bias was set to {i : bi } for i = 1, . . . , l and the prompt is p.
Then, we know that logitk + bk ≥ logitj + bj for all j ̸= k by the definition of the API.
This imposes a system of linear constraints on the logits. By querying the model many times, and accumulating many such
systems of equations, we can recover the logit values more efficiently. To do this, we accumulate all such linear constraints
in the set C, and at the end of each round, compute the smallest and largest possible values for logiti − logit0 by solving
a linear program that maximizes/minimizes this value over the constraint set C. Thus, at each round, we can maintain an
interval that encloses logiti − logit0 , and refine the interval at each round given additional information from that round’s
query. After T rounds (where T is chosen based on the total query budget for the attack), we return the tightest known
bounds on each logit.
Lemma F.2. Suppose that logiti − logit0 ∈ [−B, 0] for all i = 1, . . . , l. Then, Algorithm 3 returns an interval [αi , βi ] such
that logiti − logit0 ∈ [αi , βi ] for each i such that logiti − logit0 ∈ [−B, 0]. Furthermore, each round in the algorithm can
be implemented in computation time O (N 3 ) (excluding the computation required for the API call).
Proof. Algorithm 3 maintains the invariant that logiti − logit0 ∈ [αi , βi ] in each round. We will prove by induction that
this is true and that the true vector of logits always lies in C. Note that by the assumption stated in the Lemma, this is
clearly true at the beginning of the first round. Suppose that this is true after K < T rounds. Then, in the K + 1-th round,
the constraints added are all valid constraints for the true logit vector, since the API returning token k guarantees that
logitk + bk ≥ logitj + bj for all j ̸= k. Hence, by induction, the algorithm always ensures that logiti − logit0 ∈ [αi , βi ].
22
Stealing Part of a Production Language Model
In Appendix F.2.1, we show the LP to compute αi , βi for all i can be seen as an all-pairs shortest paths problem on graph
with edge weights cjk = minrounds bj − bk where the minimum is taken over all rounds where the token returned was k.
This ensures the computation complexity of maintaining the logit difference intervals is O (N 3 ).
Proof. Let e0j1 , ej1 j2 , . . . , ejm−1 i be the edges of the minimum distance path from 0 to i in G. We have
hence the shortest path is an upper bound on logiti . To prove feasibility, we claim that setting logiti to be the distance from 0
to i satisfies all the inequalities. Assume some inequality logiti − logitj ≤ cji is violated. Then we can go from 0 → j → i
in G with a total weight of logitj + cji < logiti , which contradicts the assumption that logiti is the distance from 0 to i.
To apply this to our setting, note that (1) all constraints, even the initial αi ≤ logiti ≤ βi , are of the required form; (2) the
graph has no negative cycles because the true logits give a feasible solution. (3) we can get the lower bounds by applying the
same procedure to the graph induced by inequalities on −logiti .
We can find the distances from 0 to all other vertices using the Bellman-Ford algorithm in O (N 3 ) time. If N = 300, this is
at most comparable to the latency of O. Since only N edges of the graph update at each step, we note that the heuristic of
just updating and doing a few incremental iterations of Bellman-Ford gets [αi , βi ] to high precision in practice. The number
of API queries and the token cost, of course, remains the same.
23
Stealing Part of a Production Language Model
The goal of balanced sampling of all output tokens can be approached in many ways. For example, we could tune c in
the above expression; bias tokens which O hasn’t returned previously to be more likely; or solve for the exact logit bias
that separates C (or some relaxation) into equal parts. However, we show in Appendix G that, under some simplifying
assumptions, the queries/logit metric of this method in Table 4 is surprisingly close to optimal.
l log2 (B/ε)
.
log2 (l )
Proof. The information content of a single logit value in [−B, 0] up to ∞-norm error ε is log2 (B/ε), assuming a uniform
prior over ε-spaced points in the interval. Since the logits are independent, the information encoded in l logit values up to
∞-norm error ε is l log2 (100/ε).
Any single query to O, no matter how well-crafted, yields at most log2 (l ) bits, because the output is one of l distinct values.
The minimum number of queries required is at least the total information content divided by the information per query,
yielding the lower bound l log2 (B/ε)/ log2 (l ).
l log2 (B/ε)
log2 (N )
queries, which is a factor of log2 (l )/ log2 (N ) worse. For N = 300 and l ≈ 100,000, this is only a factor of 2.
For B = 100 and N = 300, we thus need at least
log2 (B/ε)
≈ 0.81 + 0.12 log2 (1/ε)
log2 (N )
queries per logit. If we want between 6 and 23 bits of precision, the lower bound corresponds to 1.53 to 3.57 queries per
logit. We see that the best logprob-free attack in Table 4 is only about 1 query per logit worse than the lower bound.
The main unrealistic assumption in Lemma G.1 is that the prior over the logit values is i.i.d. uniform over an interval. A
better assumption might be that most of the logit values come from a light-tailed unimodal distribution. We leave more
realistic lower bounds and attacks that make use of this better prior to future work.
H.1. Assumptions
We make a few simplifying assumptions:
4
See here: Recovering W up to an orthogonal matrix (Colab notebook)
24
Stealing Part of a Production Language Model
2. We assume that the output matrix always outputs 0-centered logits (as in Appendix B.2.1). This isn’t a restrictive
assumption since model logprobs are invariant to a bias added to the logit outputs.
3. We assume the numerical precision is high enough that during the final normalization layer, the hidden states are on a
sphere.
4. There is no degenerate lower-dimensional subspace containing all gθ (p) for all our queries p.
H.2. Methodology
Intuition. The main idea is to exploit the structure imposed by the last two layers of the model: LayerNorm/RMSNorm,
which projects model internal activations to a sphere, and the unembedding layer. The model’s logits lie on an ellipsoid of
rank h due to Lemma H.1:
Lemma H.1. The image of an ellipsoid under a linear transformation (such as the unembedding layer) is itself an ellipsoid.
Thus, we can project the model’s logits outputs to that subspace using U the truncated SVD matrix; X = U⊤ Q. An
ellipsoid is defined with the quadratic form as in Lemma H.2:
Lemma H.2. An ellipsoid is defined by a single semipositive-definite symmetric matrix A and a center by the following
equation: (x − c)⊤ A(x − c) = 1.
By expanding this formula x⊤ Ax − 2Acx + c⊤ Ac = 1, substituting d = Ac we end up with a system that all (projected)
model outputs should satisfy. This system is linear in (h+ 1
2 ) entries of A (as A is symmetric), and the h + 1 entries
of d. The system doesn’t constrain the free term c⊤ Ac, and we can get rid of it by subtracting x0 from other outputs
of the model, effectively forcing the ellipsoid to pass through the origin, and thus satisfy c⊤ Ac = 1. Using Cholesky
decomposition on the PSD symmetric matrix A = MM⊤ , we would end up with M a linear mapping from the unit-sphere
to the (origin-passing) ellipsoid, which is equivalent to W up to an orthogonal projection (see Lemma H.3). This should be
centered at Wb = c − x0 .
Procedure. The procedure we took in order to recover W up to an orthogonal matrix as implemented in the notebook ??
is described as follows at a high level (we expand in detail on several of the steps below):
1. Collect (h+ 1
2 ) + (h + 1) logits and store them in a matrix Q.
4. Find the model’s hidden dimension h (e.g. using Section 4.2) and truncate U
25
Stealing Part of a Production Language Model
In steps 3-4, we use the compact SVD on the query output matrix Q = U · Σ · V⊤ . Here Q ∈ Rl×n , U ∈ Rl×h , Σ ∈ Rh×h ,
and V⊤ ∈ Rh×n . Note that the points gθ (p) lie on a sphere in Rh , and U⊤ · W ∈ Rh×h , hence U⊤ · W · gθ (p) lie
on an ellipsoid in Rh . From now on, it is convenient to work with the points X = U⊤ · Q; As centered-ellipsoids are
equivalently defined by x⊤ Ax = 1 for some positive semidefinite (symmetric) matrix A ∈ Rh×h , this implies that we can
write A = M⊤ · M for some M (step 7 of the procedure) which will work since A is positive semidefinite and symmetric
since the system is non-degenerate because the ellipse is rank. Importantly, to motivate step 8 of the procedure, we use
Lemma H.3.
Lemma H.3. W = U · M−1 · O for some orthogonal matrix O.
Proof. We know that gθ (pi ) lie on a sphere. The equation (xi − c)⊤ A(xi − c) = 1 is equivalent to (xi − c)⊤ M⊤ M(xi −
c) = 1, which is equivalent to ∥M(xi − c)∥ = 1. This means that M(xi − c) lie on a sphere. Because M(xi − c) =
M · U⊤ · W · gθ (pi ), we have that M · U⊤ · W is a norm-preserving transformation on the points gθ (pi ). By the assumption
that gθ (pi ) are not in a degenerate lower-dimensional subspace, we have that M · U⊤ · W =: O is a norm-preserving
endomorphism of Rh , hence an orthogonal matrix. This directly implies W = U · M−1 · O as claimed.
I.2. Noise
One natural defense to our attacks is to obfuscate the logits by adding noise. This will naturally induce a tradeoff between
utility and vulnerability—more noise will result in less useful outputs, but increase extraction difficulty. We empirically
measure this tradeoff in Figure 5(c). We consider noise added directly to the logits, that is consistent between different
queries of the same prompt. To simulate this, we directly add noise to our recovered logits, and recompute the extracted
embedding matrix. For GPT-2, we measure the RMSE between the true embedding matrix and the embedding matrix
extracted with a specific noise level; for ada and babbage, we measure the RMSE between the noisy extracted weights
and the weights we extracted in the absence of noise. We normalize all embedding matrices (to have ℓ2 norm 1) before
measuring RMSE.
26
Stealing Part of a Production Language Model
104
10 llama-7B
10 ada
Difference between
GPT-2
6 10 5
100
4
10 2 llama-7B 10 6
llama-7B-8bit 2
llama-7B-4bit
10 4 0 10 7
0 1000 2000 3000 4000 5000 6000 4070 4080 4090 4100 4110
Sorted Singular Values Sorted Singular Values 10 4 10 3 10 2 10 1 100
Noise Scale
(a). Sorted singular values for (b). Differences between consecutive (c). RMSE of extracted embeddings at
{1024, 2048, 4096, 8192} queries. sorted singular values. various noise variances.
Figure 6. In (a, b), recovering the embedding matrix dimension h for Llama-7B at different levels of precision: 16-bit (default), 8-bit, and
4-bit. We observe no meaningful differences, with respect to our attack, at different levels of quantization. In (c), the RMSE between
extracted embeddings as a function of the standard deviation of Gaussian noise added to the logits.
800 800
Singular Value
Singular Value
600 600
400 400
200 200
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Sorted Singular Values Sorted Singular Values
Figure 7. On the left, we plot the singular values that are extracted using our attack on GPT-2 small—the estimated hidden dimension is
near 768. On the right, we post-hoc extend the dimensionality of the weight matrix to 1024, as described in Section 8. This misleads the
adversary into thinking the model is wider than it actually is.
27