0% found this document useful (0 votes)
10 views25 pages

On Differentially Private String Distances

Uploaded by

Mariusz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views25 pages

On Differentially Private String Distances

Uploaded by

Mariusz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

On Differentially Private String Distances

Jerry Yao-Chieh Hu∗ Erzhi Liu† Han Liu‡ Zhao Song§ Lichen Zhang¶
arXiv:2411.05750v1 [cs.DS] 8 Nov 2024

Abstract
Given a database of bit strings A1 , . . . , Am ∈ {0, 1}n, a fundamental data structure task is
to estimate the distances between a given query B ∈ {0, 1}n with all the strings in the database.
In addition, one might further want to ensure the integrity of the database by releasing these
distance statistics in a secure manner. In this work, we propose differentially private (DP) data
structures for this type of tasks, with a focus on Hamming and edit distance. On top of the
strong privacy guarantees, our data structures are also time- and space-efficient. In particular,
our data structure is ǫ-DP against any sequence of queries of arbitrary length, and for any query
B such that the maximum distance to any string in the database is at most k, we output m
distance estimates. Moreover,
• For Hamming distance, our data structure answers any query in O(mk e + n) time and each
estimate deviates from the true distance by at most O(k/e e ǫ/ log k
);
e
• For edit distance, our data structure answers any query in O(mk 2
+ n) time and each
e
estimate deviates from the true distance by at most O(k/e ǫ/(log k log n)
).
For moderate k, both data structures support sublinear query operations. We obtain these
results via a novel adaptation of the randomized response technique as a bit flipping procedure,
applied to the sketched strings.


jhu@ensemblecore.ai; jhu@u.northwestern.edu. Ensemble AI, San Francisco, CA, USA; Center for Foundation
Models and Generative AI & Department of Computer Science, Northwestern University, Evanston, IL, USA. Work
done during JH’s internship at Ensemble AI.

erzhiliu@u.northwestern.edu. Center for Foundation Models and Generative AI & Department of Computer
Science, Northwestern University, Evanston, IL, USA.

hanliu@northwestern.edu. Center for Foundation Models and Generative AI & Department of Computer Science
& Department of Statistics and Data Science, Northwestern University, Evanston, IL, USA. Supported in part by
NIH R01LM1372201, AbbVie and Dolby.
§
magic.linuxkde@gmail.com. Simons Institute for the Theory of Computing, UC Berkeley, Berkeley, CA, USA.

lichenz@mit.edu. Department of Mathematics & Computer Science and Artificial Intelligence Laboratory, MIT,
Cambridge, MA, USA. Supported in part by NSF CCF-1955217 and DMS-2022448.
Contents
1 Introduction 2

2 Related Work 4

3 Preliminary 4
3.1 Concentration Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Differential Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4 Differentially Private Hamming Distance Data Structure 5


4.1 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Privacy Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.3 Utility Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

5 Differentially Private Edit Distance Data Structure 9


5.1 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.2 Privacy Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.3 Utility Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

6 Conclusion 12

A Proofs for Hamming Distance Data Structure 19


A.1 Proof of Lemma 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
A.2 Proof of Lemma 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
A.3 Proof of Lemma 4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
A.4 Proof of Lemma 4.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

B Differentiall Private Longest Common Prefix 20


B.1 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
B.2 Privacy Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
B.3 Utility Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1
1 Introduction
Estimating string distances is one of the most fundamental problems in computer science and
information theory, with rich applications in high-dimensional geometry, computational biology
and machine learning. The problem could be generically formulated as follows: given a collection of
strings A1 , . . . , Am ∈ Σn where Σ is the alphabet, the goal is to design a data structure to preprocess
these strings such that when a query B ∈ Σn is given, the data structure needs to quickly output
estimates of kAi −Bk for all i ∈ [m], where k·k is the distance of interest. Assuming the symbols in Σ
can allow constant time access and operations, a naı̈ve implementation would be to simply compute
all the distances between Ai ’s and B, which would require O(mn) time. Designing data structures
with o(mn) query time has been the driving research direction in string distance estimations. To
make the discussion concrete, in this work we will focus on binary alphabet (Σ = {0, 1}) and for
distance, we will study Hamming and edit distance. Hamming distance [Ham50] is one of the
most natural distance measurements for binary strings, with its deep root in error detecting and
correction for codes. It finds large array of applications in database similarity searches [IM98,
Cha02, NPF12] and clustering algorithms [Hua97, HN99].
Compared to Hamming distance, edit distance or the Levenshtein distance [Lev66] could be
viewed as a more robust distance measurement for strings: it counts the minimum number of
operations (including insertion, deletion and substitution) to transform from Ai to B. To see the
robustness compared to Hamming distance, consider Ai = (01)n and B = (10)n , the Hamming
distance between these two strings is n, but Ai could be easily transformed to B by deleting the
first bit and adding a 0 to the end, yielding an edit distance of 2. Due to its flexibility, edit
distance is particularly useful for sequence alignment in computational biology [WHZ+ 15, YFA21,
BWY21], measuring text similarity [Nav01, SGAM+ 15] and natural language processing, speech
recognition [FARL06, DA10] and time series analysis [Mar09, GS18].
In addition to data structures with fast query times, another important consideration is to ensure
the database is secure. Consider the scenario where the database consists of private medical data of
m patients, where each of the Ai is the characteristic vector of n different symptoms. A malicious
adversary might attempt to count the number of symptoms each patient has by querying 0n , or
detecting whether patient i has symptom j by querying ej and 0n where ej is the j-th standard
basis in Rn . It is hence crucial to curate a private scheme so that the adversary cannot distinguish
the case whether the patient has symptom j or not. This notion of privacy has been precisely
captured by differential privacy [Dwo06, DKM+ 06], which states that for neighboring databases1 ,
the output distribution of the data structure query should be close with high probability, hence
any adversary cannot distinguishable between the two cases.
Motivated by both privacy and efficiency concerns, we ask the following natural question:

Is it possible to design a data structures to estimate Hamming and edit distance, that are both
differentially private, and time/space-efficient?

We provide an affirmative answer to the above question, with the main results summarized in
the following two theorems. We will use Dham (A, B) to denote the Hamming distance between A
and B, and Dedit (A, B) to denote the edit distance between A and B. We also say a data structure
is ǫ-DP if it provides ǫ-DP outputs against any sequence of queries, of arbitrary length.

Theorem 1.1. Let A1 , . . . , Am ∈ {0, 1}n be a database, k ∈ [n] and ǫ > 0, β ∈ (0, 1), then there
exists a randomized algorithm with the following guarantees:
1
In our case, we say two database D1 and D2 are neighboring if there exists one i ∈ [n] such that D1 (Ai ) and
D2 (Ai ) differs by one bit.

2
• The data structure is ǫ-DP;
e
• It perprocesses A1 , . . . , Am in time O(mn) time2 ;
e
• It consumes O(mk) space;

• Given any query B ∈ {0, 1}n such that maxi∈[m] Dham (Ai , B) ≤ k, it outputs m estimates
e
z1 , . . . , zm with |zi − Dham (Ai , B)| ≤ O(k/e e
ǫ/ log k ) for all i ∈ [m] in time O(mk + n), and it
succeeds with probability at least 1 − β.

Theorem 1.2. Let A1 , . . . , Am ∈ {0, 1}n be a database, k ∈ [n] and ǫ > 0, β ∈ (0, 1), then there
exists a randomized algorithm with the following guarantees:

• The data structure is ǫ-DP;


e
• It perprocesses A1 , . . . , Am in time O(mn) time;
e
• It consumes O(mn) space;

• Given any query B ∈ {0, 1}n such that maxi∈[m] Dedit (Ai , B) ≤ k, it outputs m estimates
e
z1 , . . . , zm with |zi − Dedit (Ai , B)| ≤ O(k/e e
ǫ/(log k log n) ) for all i ∈ [m] in time O(mk 2 + n),

and it succeeds with probability at least 1 − β.

Before diving into the details, we would like to make several remarks regarding our data
structure results. Note that instead of solving the exact Hamming distance and edit distance
problem, we impose the assumption that the query B has the property that for any i ∈ [m],
kAi − Bk ≤ k. Such an assumption might seem restrictive at its first glance, but under the
standard complexity assumption Strong Exponential Time Hypothesis (SETH) [IP01, IPZ01], it is
known that there is no O(n2−o(1) ) time algorithm exists for exact or even approximate edit distance
[BZ16, CGK16b, CGK16a, NSS17, RSSS19, RS20, GRS20, JNW21, BEG+ 21, KPS21, BK23, KS24].
It is therefore natural to impose assumptions that the query is “near” to the database in pursuit
of faster algorithms [Ukk85, Mye86, LV88, GKS19, KS20, GKKS23]. In fact, assuming SETH,
O(n + k2 ) runtime for edit distance when m = 1 is optimal up to sub-polynomial factors [GKKS23].
Thus, in this paper, we consider the setting where maxi∈[m] kAi − Bk ≤ k for both Hamming and
edit distance and show how to craft private and efficient mechanisms for this class of distance
problems.
Regarding privacy guarantees, one might consider the following simple augmentation to any fast
data structure for Hamming distance: compute the distance estimate via the data structure, and
add Laplace noise to it. Since changing one coordinate of the database would lead to the Hamming
distance change by at most 1, Laplace mechanism would properly handle this case. However, our
goal is to release a differentially private data structure that is robust against potentially infinitely
many queries, and a simple output perturbation won’t be sufficient as an adversary could simply
query with the same B, average them to reduce the variance and obtain a relatively accurate
estimate of the de-noised output. To address this issue, we consider the differentially private
function release communication model [HRW13], where the curator releases an ǫ-DP description of
a function eb(·) that is ǫ-DP without seeing any query in advance. The client can then use eb(·) to
compute eb(B) for any query B. This strong guarantee ensures that the client could feed infinitely
many queries to eb(·) without compromising the privacy of the database.
2 e to suppress polylogarithmic factors in m, n, k and 1/β.
Throughout the paper, we will use O(·)

3
2 Related Work
Differential Privacy. Differential privacy is a ubiquitous notion for protecting the privacy
of database. [DKM+ 06] first introduced this concept, which characterizes a class of algorithms
such that when inputs are two neighboring datasets, with high probability the output distribu-
tions are similar. Differential privacy has a wide range of applications in general machine learn-
ing [CM08, WM10, JE19, TF20], training deep neural networks [ACG+ 16, BPS19], computer vi-
sion [ZYCW20, LWAL21, TKP19], natural language processing [YDW+ 21, WK18], large language
models [GSY23, YNB+ 22], label protect [YSY+ 22], multiple data release [WYY+ 22], federated
learning [SYY+ 23, SWYZ23] and peer review [DKWS22]. In recent years, differential privacy
has been playing an important role for data structure design, both in making these data structures
robust against adaptive adversary [BKM+ 22, HKM+ 22, SYYZ23, CSW+ 23] and in the function re-
lease communication model [HRW13, HR14, WJF+ 16, AR17, CS21, WNM23, BLM+ 24, LHR+ 24].

Hamming Distance and Edit Distance. Given bit strings A and B, many distance measure-
ments have been proposed that capture various characteristics of bit strings. Hamming distance was
first studied by Hamming [Ham50] in the context of error correction for codes. From an algorithmic
perspective, Hamming distance is mostly studied in the context of approximate nearest-neighbor
search and locality-sensitive hashing [IM98, Cha02]. When it is known that the query B has the
property Dham (A, B) ≤ k, [PL07] shows how to construct a sketch of size O(k) e e
in O(n) time,
and with high probability, these sketches preserve Hamming distance. Edit distance, proposed by
Levenshtein [Lev66], is a more robust notion of distance between bit strings. It has applications in
computational biology [WHZ+ 15, YFA21, BWY21], text similarity [Nav01, SGAM+ 15] and speech
recognition [FARL06, DA10]. From a computational perspective, it is known that under the Strong
Exponential Time Hypothesis (SETH), no algorithm can solve edit distance in O(n2−o(1) ) time,
even its approximate variants [BZ16, CGK16b, CGK16a, NSS17, RSSS19, RS20, GRS20, JNW21,
BEG+ 21, KPS21, BK23, KS24]. Hence, various assumptions have been imposed to enable more
efficient algorithm design. The most related assumption to us is that Dedit (A, B) ≤ k, and in this
regime various algorithms have been proposed [Ukk85, Mye86, LV88, GKS19, KS20, GKKS23].
Under SETH, it has been shown that the optimal dependence on n and k is O(n + k 2 ), up to
sub-polynomial factors [GKKS23].

3 Preliminary
Let E be an event, we use 1[E] to denote the indicator
P variable if E is true. Given two length-n
bit strings A and B, we use Dham (A, B) to denote ni=1 1[Ai = Bi ]. We use Dedit (A, B) to denote
the edit distance between A and B, i.e., the minimum number of operations to transform A to B
where the allowed operations are insertion, deletion and substitution. We use ⊕ to denote the XOR
operation. For any positive integer n, we use [n] to denote the set {1, 2, · · · , n}. We use Pr[·], E[·]
and Var[·] to denote probability, expectation and variance respectively.

3.1 Concentration Bounds


We will mainly use two concentration inequalities in this paper.
Lemma 3.1 (Chebyshev’s Inequality). Let X be a random variable with 0 < Var[X] < ∞. For
any real number t > 0,
Var[X]
Pr[|X − E[X]| > t] ≤ .
t2

4
Lemma 3.2 (Hoeffding’s Inequality). Let X1 , . . . , Xn with ai ≤ Xi ≤ bi almost surely. Let Sn =
P n
i=1 Xi , then for any real number t > 0,

2t2
Pr[|Sn − E[Sn ]| > t] ≤ 2 exp(− Pn 2
).
i=1 (bi − ai )

3.2 Differential Privacy


Differential privacy (DP) is the key privacy measure we will be trying to craft our algorithm to
possess it. In this paper, we will solely focus on pure DP (ǫ-DP).

Definition 3.3 (ǫ-Differential Privacy). We say an algorithm A is ǫ-differentially private (ǫ-DP)


if for any two neighboring databases D1 and D2 and any subsets of possible outputs S, we have

Pr[A(D1 ) ∈ S] ≤ eǫ · Pr[A(D2 ) ∈ S],

where the probability is taken over the randomness of A.

Since we will be designing data structures, we will work with the function release communication
model [HRW13] where the goal is to release a function that is ǫ-DP against any sequence of queries
of arbitrary length.

Definition 3.4 (ǫ-DP Data Structure). We say a data structure A is ǫ-DP, if A is ǫ-DP against any
sequence of queries of arbitrary length. In other words, the curator will release an ǫ-DP description
of a function eb(·) without seeing any query in advance.

Finally, we will be utilizing the post-processing property of ǫ-DP.

Lemma 3.5 (Post-Processing). Let A be ǫ-DP, then for any deterministic or randomized function
g that only depends on the output of A, g ◦ A is also ǫ-DP.

4 Differentially Private Hamming Distance Data Structure


To start off, we introduce our data structure for differentially private Hamming distance. In par-
ticular, we will adapt a data structure due to [PL07]: this data structure computes a sketch of
e
length O(k) bit string to both the database and query, then with high probability, one could re-
trieve the Hamming distance from these sketches. Since the resulting sketch is also a bit string, a
natural idea is to inject Laplace noise on each coordinate of the sketch. Since for two neighboring
databases, only one coordinate would change, we could add Laplace noise of scale 1/ǫ to achieve
ǫ-DP. However, this approach has a critical issue: one could show that with high probability, the
magnitude of each noise is roughly O(ǫ−1 log k), aggregating the k coordinates of the sketch, this
leads to a total error of O(ǫ−1 k log k). To decrease this error to O(1), one would have to choose
ǫ = k log k, which is too large for most applications.
Instead of Laplace noises, we present a novel scheme that flips each bit of the sketch with certain
probability. Our main contribution is to show that this simple scheme, while produces a biased
estimator, the error is only O(e−ǫ/ log k k). Let t = log k/ǫ, we see that the Laplace mechanism has
an error of O(t−1 k) and our error is only O(e−t k), which is exponentially small! In what follows,
we will describe a data structure when the database is only one string A and with constant success
probability, and we will discuss how to extend it to m bit strings, and how to boost the success
probability to 1 − β for any β > 0. We summarize the main result below.

5
Theorem 4.1. Given a string A of length n. There exists an ǫ-DP data structure DPHam-
mingDistance (Algorithm 1), with the following operations

• Init(A ∈ {0, 1}n ): It takes a string A as input. This procedure takes O(n log k + k log3 k)
time.

• Query(B ∈ {0, 1}n ): for any B with z := Dham (A, B) ≤ k, Query(B) outputs a value ze
such that |e e
z − z| = O(k/e ǫ/ log k ) with probability 0.99, and the result is ǫ-DP. This procedure
3
takes O(n log k + k log k) time.

Algorithm 1 Differential Private Hamming Distance Query


1: data structure DPHammingDistance ⊲ Theorem 4.1
2: members
3: M1 , M2 , M3 ∈ N+
4: h(x) : [2n] → [M2 ] ⊲ h and g are public random hash function
5: g(x, i) : [2n] × [M1 ] → [M3 ]
6: Si,j,c ∈ {0, 1}M1 ×M2 ×M3 for all i ∈ [M1 ], j ∈ [M2 ], c ∈ [M3 ] ⊲ S represents the sketch
7: end members
8:
9: procedure Encode(A ∈ {0, 1}n , n) ⊲ Lemma 4.2
10: ∗
Si,j,c ← 0 for all i, j, c
11: for p ∈ [n] do
12: for i ∈ [M1 ] do
13: j ← h(2(p − 1) + Ap )
14: c ← g(2(p − 1) + Ap , i)
15: ∗
Si,j,c ∗
← Si,j,c ⊕1
16: end for
17: end for
18: return S ∗
19: end procedure
20:
21: procedure Init(A ∈ {0, 1}n , n ∈ N+ , k ∈ N+ , ǫ′ ∈ R+ ) ⊲ Lemma 4.3
22: M1 ← 10 log k
23: M2 ← 2k
24: M3 ← 400 log 2 k
25: S ← Encode(A, n)

26: Flip each Si,j,c with independent probability 1/(1 + eǫ /(2M1 ) )
27: end procedure
28:
29: procedure Query(B ∈ {0, 1}n ) ⊲ Lemma 4.7
30: S B ←Encode(B,
P 2 n) PM 3
31: return 0.5 · M B
j=1 maxi∈[M1 ] ( c=1 |Si,j,c − Si,j,c |)
32: end procedure
33: end data structure

To achieve the results above, we set parameters M1 = O(log k), M2 = O(k), M3 = O(log2 k) in
Algorithm 1.
We divide the proof of Theorem 4.1 into the following subsections:

6
4.1 Time Complexity
Note that both the initializing and query run Encode (Algorithm 1) exactly once, we show that
the running time of Encode is O(n log k).

Lemma 4.2. Given M1 = O(log k), the running time of Encode (Algorithm 1) is O(n log k).

Proof. In Encode, for each character in the input string, the algorithm iterate M1 times. Therefore
the total time complexity is O(n · M1 ) = O(n log k).

4.2 Privacy Guarantee


Next we prove our data structure is ǫ-DP.

Lemma 4.3. Let A and A′ be two strings that differ on only one position. Let A(A) and A(A′ ) be
the output of Init (Algorithm 1) given A and A′ . For any output S, we have:

Pr[A(A) = S] ≤ eǫ · Pr[A(A′ ) = S].

We defer the proof to Appendix A.

4.3 Utility Guarantee


The utility analysis is much more involved than privacy and runtime analysis. We defer the proofs
to the appendix, while stating key lemmas.
We first consider the distance between sketches of A and B without the random flipping process.
Let E(A),
P 2 E(B) be Encode(A)
PM 3 and Encode(B). We prove with probability 0.99, Dham (A, B) =
0.5 · M j=1 max i∈[M1 ] ( c=1 |E(A)i,j,c − E(B)i,j,c |). Before we present the error guarantee, we will
first introduce two technical lemmas. If we let T = {p ⊆ [n] | Ap 6= Bp } denote the set of “bad”
coordinates, then for each coordinate in the sketch, it only contains a few bad coordinates.

Lemma 4.4. Define set T := {p ∈ [n] | Ap 6= Bp }. Define set Tj := {p ⊆ T | h(p) = j}. When
M2 = 2k, with probability 0.99, for all j ∈ [M2 ], we have |Tj | ≤ 10 log k, i.e.,

Pr[∀j ∈ [M2 ], | |Tj | ≤ 10 log k] ≥ 0.99.

The next lemma shows that with high probability, the second level hashing g will hash bad
coordinates to distinct buckets.

Lemma 4.5. When M1 = 10 log k, M2 = 2k, M3 = 400 log 2 k, with probability 0.98, for S all j ∈ [M2 ],
there is at least one i ∈ [M1 ], such that all values in {g(2(p − 1) + Ap , i) | p ∈ Tj } {g(2(p − 1) +
Bp , i) | p ∈ Tj } are distinct.

With these two lemmas in hand, we are in the position to prove the error bound before the
random bit flipping process.

Lemma 4.6. Let E(A),P E(B) be the output PM3of Encode(A) and Encode(B). With probability
0.98, Dham (A, B) = 0.5 · M 2
j=1 max i∈[M 1 ] ( c=1 |E(A)i,j,c − E(B)i,j,c |).

Our final result provides utility guarantees for Algorithm 1.

Lemma 4.7. Let z be Dham (A, B), ze be the output of Query(B)(Algorithm 1). With probability
0.98, |z − ze| ≤ O(k log3 k/eǫ/ log k ).

7
Proof. From Lemma 4.6, we know with probability 0.98, when ǫ → ∞ (i.e. without the random
flip process), the output of Query(B) (Algorithm 1) equals the exact hamming distance.
We view the random flip process as random variables. Let random variables Ri,j,c be 1 with
probability 1/(1 + eǫ/M1 ), or 0 with probability 1 − 1/(1 + eǫ/M1 ). So we have
M2
X XM3
|e
z − z| = max ( Ri,j,c)
i∈[M1 ]
j=1 c=1

X M1 X
M2 X M3
≤ ( Ri,j,c),
j=1 i=1 c=1
P
where the second step follows from maxi ≤ i when all the summands are non-negative.
Therefore, the expectation of ze − z is:

z − z|] = M1 M2 M3 · E[Ri,j,c ]
E[|e
1
= k log3 k ·
(1 + eǫ/ log k )
k log3 k
≤ O( ǫ/ log k ),
e
where the last step follows from simple algebra. The variance of ze − z is:

Var[|e
z − z|] = M1 M2 M3 · Var[Ri,j,c ]
1 1
= k log3 k · ǫ/ log k
· (1 − ).
(1 + e ) (1 + eǫ/ log k )

Using Chebyshev’s inequality (Lemma 3.1), we have

k log3 k
Pr[|e
z − z| ≥ O( )] ≤ 0.01.
eǫ/ log k
Thus we complete the proof.

Remark 4.8. We will show how to generalize Theorem 4.1 to m bit strings, and how to boost the
success probability to 1 − β. To boost the success probability, we note that individual data structure
succeeds with probability 0.99, we could take log(1/β) independent copies of the data structure, and
query all of them. By a standard Chernoff bound argument, with probability at least 1 − β, at least
3/4 fraction of these data structures would output the correct answer, hence what we could do is
to take the median of these answers. These operations blow up both Init and Query by a factor
of log(1/β) in its runtime. Generalizing for a database of m strings is relatively straightforward:
we will run the Init procedure to A1 , . . . , Am , this would take O(mn log k + mk log3 k) time. For
each query, note we only need to Encode the query once, and we can subsequently compute the
Hamming distance from the sketch for m sketched database strings, therefore the total time for
query is O(n log k + mk log3 k). It is important to note that as long as k log3 k < n, the query
time is sublinear. Finally, we could use the success probability boosting technique described before,
that uses log(m/β) data structures to account for a union bound over the success of all distance
estimates.

8
5 Differentially Private Edit Distance Data Structure
Our algorithm for edit distance follows from the dynamic programming method introduced by
[Ukk85, LMS98, LV88, Mye86]. We note that a key procedure in these algorithms is a subroutine
to estimate longest common prefix (LCP) between two strings A and B and their substrings. We
design an ǫ-DP data structure for LCP based on our ǫ-DP Hamming distance data structure. Due to
space limitation, we defer the details of the DP-LCP data structure to Appendix B. In the following
discussion, we will assume access to a DP-LCP data structure with the following guarantees:
Theorem 5.1. Given a string A of length n. There exists an ǫ-DP data structure DPLCP (Al-
gorithm 3 and Algorithm 4) supporting the following operations
• Init(A ∈ {0, 1}n ): It preprocesses an input string A. This procedure takes O(n(log k +
log log n)) time.
• InitQuery(B ∈ {0, 1}n ): It preprocesses an input query string B. This procedure take
O(n(log k + log log n)) time.
• Query(i, j): Let w be the longest common prefix of A[i : n] and B[j : n] and w e be the output of
Query(i, j), With probability 1−1/(300k2 ), we have: 1). w e ≥ w; 2). E[Dham (A[i : i+ w],
e B[j :
e ≤ O((log k + log log n)/eǫ/(log k log n) ). This procedure takes O(log2 n(log k + log log n))
j + w])]
time.
We will be basing our edit distance data structure on the following result, which achieves the
optimal dependence on n and k assuming SETH:
Lemma 5.2 ([LMS98]). Given two strings A and B of length n. If the edit distance between A
and B is no more than k, there is an algorithm which computes the edit distance between A and B
in time O(k2 + n).
We start from a naı̈ve dynamic programming approach. Define D(i, j) to be the edit distance
between string A[1 : i] and B[1 : j]. We could try to match A[i] and B[j] by inserting, deleting and
substituting, which yields the following recurrence:

 D(i − 1, j) + 1 if i > 0;
D(i, j) = min D(i − 1, j − 1) + 1 if j > 0;

D(i − 1, j − 1) + 1[A[i] 6= B[j]] if i, j > 0.
The edit distance between A and B is then captured by D(n, n). When k < n, for all D(i, j)
such that |i − j| > k, because the length difference between A[1 : i] and B[1 : j] is greater than
k, D(i, j) > k. Since the final answer D(n, n) ≤ k, those positions with |i − j| > k won’t affect
D(n, n). Therefore, we only need to consider the set {D(i, j) : |i − j| ≤ k}.
For d ∈ [−k, k], r ∈ [0, k], we define F (r, d) = maxi {i : D(i, i + d) = r} and let LCP(i, j)
denote the length of the longest common prefix of A[i : n] and B[j : n]. The algorithm of [LMS98]
defines Extend(r, d) := F (r, d) + LCP(F (r, d), F (r, d) + d). We have

 Extend(r − 1, d) + 1 if r − 1 ≥ 0;
F (r, d) = max Extend(r − 1, d − 1) if d − 1 ≥ −k, r − 1 ≥ 0;

Extend(r − 1, d + 1) + 1 if d + 1, r + 1 ≤ k.
The edit distance between A and B equals minr {r : F (r, 0) = n}.
To implement LCP, [LMS98] uses a suffix tree data structure with initialization time O(n) and
query time O(1), thus the total time complexity is O(k2 + n). In place of their suffix tree data
structure, we use our DP-LCP data structure (Theorem 5.1). This leads to Algorithm 2.

9
Theorem 5.3. Given a string A of length n. There exists an ǫ-DP data structure DPEditDis-
tance (Algorithm 2) supporting the following operations:

• Init(A ∈ {0, 1}n ): It preprocesses an input string A. This procedure takes O(n(log k +
log log n)) time.

• Query(B ∈ {0, 1}n ): For any query string B with w := Dedit (A, B) ≤ k, Query outputs
a value w
e such that |w − w| e
e ≤ O(k/e ǫ/(log k log n) ) with probability 0.99. This procedure takes
2 2
O(n(log k + log log n) + k log n(log k + log log n)) = O(k e 2 + n) time.

Algorithm 2 Differential Private Edit Distance


1: data structure DPEditDistance ⊲ Theorem 5.3
2: procedure Init(A ∈ {0, 1}n , n ∈ N+ , k ∈ N+ , ǫ ∈ R) ⊲ Lemma 5.4
3: DPLCP.Init(A, n, k, ǫ) ⊲ Algorithm 3
4: end procedure
5:
6: procedure Extend(F, i, j)
7: return F (i, j) + DPLCP.Query(F (i, j), F (i, j) + j) ⊲ Algorithm 4
8: end procedure
9:
10: procedure Query(B, n, k) ⊲ Lemma 5.5 and 5.8
11: DPLCP.QueryInit(B, n, k) ⊲ Algorithm 3
12: F0,0 ← 0
13: for i from 1 to k do
14: for j ∈ [−k, k] do
15: Fi,j ← max(Fi,j , Extend(i − 1, j)) ⊲ Algorithm 4
16: if j − 1 ≥ −k then
17: Fi,j ← max(Fi,j , Extend(i − 1, j − 1)) ⊲ Algorithm 4
18: end if
19: if j + 1 ≤ k then
20: Fi,j ← max(Fi,j , Extend(i − 1, j + 1)) ⊲ Algorithm 4
21: end if
22: end for
23: if Fi,0 = n then
24: return i
25: end if
26: end for
27: end procedure
28: end data structure

Again, we divide the proof into runtime, privacy and utility.

5.1 Time Complexity


We prove the time complexity of Init and Query respectively.

Lemma 5.4. The running time of Init (Algorithm 2) is O(n(log k + log log n)).

Proof. The Init runs DPLCP.Init. From Theorem 5.1, the init time is O(n(log k + log log n)).

10
Lemma 5.5. Query (Algorithm 2) runs in time O((n + k2 log n)(log k + log log n)).

Proof. The Query runs DPLCP.QueryInit once and DPLCP.Query k2 times. From Theo-
rem 5.1, the query time is O(n(log k + log log n) + k 2 log2 n(log k + log log n)).

5.2 Privacy Guarantee


Lemma 5.6. The data structure DPEditDistance (Algorithm 2) is ǫ-DP.

Proof. The data structure only stores a DPLCP(Algorithm 3, 4). From Theorem 5.1 and the
post-processing property (Lemma 3.5), it is ǫ-DP.

5.3 Utility Guarantee


Before analyzing the error of the output of Query (Algorithm 2), we first introduce a lemma:

Lemma 5.7. Let A, B be two strings. Let LCP(i, d) be the length of the true longest common prefix
of A[i : n] and B[i + d : n]. For i1 ≤ i2 , d ∈ [−k, k], we have i1 + LCP(i1 , d) ≤ i2 + LCP(i2 , d).

Proof. Let w1 = LCP(i1 , d), w2 = LCP(i2 , d). Then for j ∈ [i1 , i1 + w1 − 1], A[j] = B[j + d]. On
the other side, w2 is the length of the longest common prefix for A[i2 : n] and B[i2 + d : n]. So
A[i2 + w2 ] 6= B[i2 + w2 + d]. Therefore, (i2 + w2 ) ∈
/ [i1 , i1 + w1 − 1]. Since i2 + w2 ≥ i2 ≥ i1 , we
have i2 + w2 ≥ i1 + w1 .

Lemma 5.8. Let re be the output of Query (Algorithm 2), r be the true edit distance Dedit (A, B).
With probability 0.99, we have |r − re| ≤ O(k(log k + log log n)/(1 + eǫ/(log k log n) )).

Proof. We divide the proof into two parts. In part one, we prove that with probability 0.99, re ≤ r.
In part two, we prove that with probability 0.99, re ≥ r − O(k(log k + log log n)/(1 + eǫ/(log k log n) )).
In Theorem 5.1, with probability 1 − 1/(300k2 ), DPLCP.Query satisfies two conditions. Our
following discussion supposes all DPLCP.Query satisfies the two conditions. There are 3k2 LCP
queries, by union bound, the probability is at least 0.99.

Part I. Suppose without differential privacy guarantee(using original LCP function instead of
our DPLCP data structure), the dynamic programming method outputs the true edit distance. We
′ as the dynamic programming array F without privacy guarantee, Extend’(i, j) be the
define Fi,j
result of Extend(i, j) without privacy guarantee. Then we prove that for all i ∈ [0, k], j ∈ [−k, k],
′ holds true.
Fi,j ≥ Fi,j
We prove the statement above by math induction on i. For i = 0, F (0, 0) = F ′ (0, 0) = 0.
Suppose for i − 1, F (i − 1, j) ≥ F ′ (i − 1, j), then for i,

 Extend(i − 1, j) + 1 if i − 1 ≥ 0;
F (i, j) = max Extend(i − 1, j − 1) if j − 1 ≥ −k, i − 1 ≥ 0;

Extend(i − 1, j + 1) + 1 if j + 1, i + 1 ≤ k, i − 1 ≥ 0.

For Extend(i − 1, j), we have

Extend(i − 1, j) =F (i − 1, j) + DPLCP.Query(F (i − 1, j), F (i − 1, j) + j)


≥ F (i − 1, j) + LCP(F (i − 1, j), F (i − 1, j) + j)
≥ F ′ (i − 1, j) + LCP(F ′ (i − 1, j), F ′ (i − 1, j) + j)
= Extend’(i − 1, j)

11
The second step is because in Query (Theorem 5.1), w e ≥ w. The third step follows from

F (i − 1, j) ≥ F (i − 1, j) and Lemma 5.7. Thus, F (i, j) = maxj2 ∈[j,j−1,j+1]{Extend(i, j2 )} ≥
maxj2 ∈[j,j−1,j+1]{Extend’(i, j2 )} = F ′ (i, j). Since re = min{e r , 0) = n}, r = min{r : F ′ (r, 0) =
r : F (e

n}, we have F (r, 0) ≥ F (r, 0) = n. Therefore re ≤ r.

Part II. Let G(L, R, j) := Dedit (A[L : R], B[L + j, R + j]). In this part, we prove that the edit
distance G(1, Fi,j , j) ≤ i · (1 + O((log k + log log n)/(1 + eǫ/(log k log n) ))) by induction on i.
For i = 0, F0,0 = 0. The statement holds true. Suppose for i − 1, G(1, Fi−1,j , j) ≤ (i − 1) · (1 +
O((log k + log log n)/(1 + eǫ/(log k log n) ))), then we prove this holds for i.
Because F (i, j) = maxj2 ∈[j,j−1,j+1]{Extend(i, j2 )}, there is some j2 ∈ {j, j − 1, j + 1} such that
Fi,j = Fi−1,j2 + DPLCP.Query(Fi−1,j2 , Fi−1,j2 + j2 ). Let Q := DPLCP.Query(Fi−1,j2 , Fi−1,j2 +
j2 ). Therefore

G(1, Fi,j , j) ≤G(1, Fi−1,j2 + Q, j2 ) + 1


≤ G(1, Fi−1,j2 , j2 ) + G(Fi−1,j2 , Fi−1,j2 + Q, j2 ) + 1
≤ G(1, Fi−1,j2 , j2 ) + 1 + O((log k + log log n)/(1 + eǫ/(log k log n) ))
≤ i · (1 + O((log k + log log n)/(1 + eǫ/(log k log n) )))

The third step follows from Theorem 5.1, and the fourth step follows from the induction hypothesis.
Therefore, r = G(1, Fre,0 , 0) ≤ re · (1 + O((log k + log log n)/(1 + eǫ/(log k log n) ))) and the proof is
completed.

Remark 5.9. To the best of our knowledge, this is the first edit distance algorithm, based on
noisy LCP implementations. In particular, we prove a structural result: if the LCP has query
(additive) error δ, then we could implement an edit distance data structure with (additive) error
O(kδ). Compared to standard relative error approximation, additive error approximation for edit
distance is relatively less explored (see e.g., [BCFN22] for using additive approximation to solve
gap edit distance problem). We hope this structural result sheds light on additive error edit distance
algorithms.

6 Conclusion
We study the differentially private Hamming distance and edit distance data structure problem
in the function release communication model. This type of data structures are ǫ-DP against any
sequence of queries of arbitrary length. For Hamming distance, our data structure has query time
e
O(mk e
+n) and error O(k/e e
ǫ/ log k ). For edit distance, our data structure has query time O(mk 2 +n)
e
and error O(k/e ǫ/(log k log n) ). While the runtime of our data structures (especially edit distance) is

nearly-optimal, it is interesting to design data structures with better utility in this model.

12
References
[ACG+ 16] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal
Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the
2016 ACM SIGSAC conference on computer and communications security, pages 308–
318, 2016.

[AR17] Francesco Aldà and Benjamin I.P. Rubinstein. The bernstein mechanism: function
release under differential privacy. In Proceedings of the Thirty-First AAAI Conference
on Artificial Intelligence, AAAI’17, page 1705–1711. AAAI Press, 2017.

[BCFN22] Karl Bringmann, Alejandro Cassis, Nick Fischer, and Vasileios Nakos. Improved
Sublinear-Time Edit Distance for Preprocessed Strings. In Mikolaj Bojańczyk,
Emanuela Merelli, and David P. Woodruff, editors, 49th International Colloquium on
Automata, Languages, and Programming (ICALP 2022), Leibniz International Pro-
ceedings in Informatics (LIPIcs), pages 32:1–32:20, Dagstuhl, Germany, 2022. Schloss
Dagstuhl – Leibniz-Zentrum für Informatik.

[BEG+ 21] Mahdi Boroujeni, Soheil Ehsani, Mohammad Ghodsi, MohammadTaghi HajiAghayi,
and Saeed Seddighin. Approximating edit distance in truly subquadratic time: Quan-
tum and mapreduce. Journal of the ACM (JACM), 68(3):1–41, 2021.

[BK23] Sudatta Bhattacharya and Michal Kouckỳ. Locally consistent decomposition of strings
with applications to edit distance sketching. In Proceedings of the 55th Annual ACM
Symposium on Theory of Computing, pages 219–232, 2023.

[BKM+ 22] Amos Beimel, Haim Kaplan, Yishay Mansour, Kobbi Nissim, Thatchaphol Saranu-
rak, and Uri Stemmer. Dynamic algorithms against an adaptive adversary: Generic
constructions and lower bounds. In Proceedings of the 54th Annual ACM SIGACT
Symposium on Theory of Computing, pages 1671–1684, 2022.

[BLM+ 24] Arturs Backurs, Zinan Lin, Sepideh Mahabadi, Sandeep Silwal, and Jakub Tarnawski.
Efficiently computing similarities to private datasets. In The Twelfth International
Conference on Learning Representations, 2024.

[BPS19] Eugene Bagdasaryan, Omid Poursaeed, and Vitaly Shmatikov. Differential privacy
has disparate impact on model accuracy. Advances in Neural Information Processing
Systems (NeurIPS), 32:15479–15488, 2019.

[BWY21] Bonnie Berger, Michael S. Waterman, and Yun William Yu. Levenshtein distance, se-
quence comparison and biological database search. IEEE Transactions on Information
Theory, 2021.

[BZ16] Djamal Belazzougui and Qin Zhang. Edit distance: Sketching, streaming, and docu-
ment exchange. In 2016 IEEE 57th Annual Symposium on Foundations of Computer
Science (FOCS), pages 51–60. IEEE, 2016.

[CGK16a] Diptarka Chakraborty, Elazar Goldenberg, and Michal Kouckỳ. Streaming algo-
rithms for computing edit distance without exploiting suffix trees. arXiv preprint
arXiv:1607.03718, 2016.

13
[CGK16b] Diptarka Chakraborty, Elazar Goldenberg, and Michal Kouckỳ. Streaming algorithms
for embedding and computing edit distance in the low distance regime. In Proceedings
of the forty-eighth annual ACM symposium on Theory of Computing, pages 712–725,
2016.
[Cha02] Moses Charikar. Similarity estimation techniques from rounding algorithms. In Pro-
ceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages
380–388, 2002.
[CM08] Kamalika Chaudhuri and Claire Monteleoni. Privacy-preserving logistic regression. In
NIPS, volume 8, pages 289–296. Citeseer, 2008.
[CS21] Benjamin Coleman and Anshumali Shrivastava. A one-pass distributed and private
sketch for kernel sums with applications to machine learning at scale. In Proceedings
of the 2021 ACM SIGSAC Conference on Computer and Communications Security,
CCS ’21, page 3252–3265, New York, NY, USA, 2021. Association for Computing
Machinery.
[CSW+ 23] Yeshwanth Cherapanamjeri, Sandeep Silwal, David Woodruff, Fred Zhang, Qiuyi
Zhang, and Samson Zhou. Robust algorithms on adaptive inputs from bounded ad-
versaries. In The Eleventh International Conference on Learning Representations,
2023.
[DA10] Jasha Droppo and Alex Acero. Context dependent phonetic string edit distance for
automatic speech recognition. In 2010 IEEE International Conference on Acoustics,
Speech and Signal Processing, 2010.
[DKM+ 06] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni
Naor. Our data, ourselves: Privacy via distributed noise generation. In Annual
International Conference on the Theory and Applications of Cryptographic Techniques,
pages 486–503. Springer, 2006.
[DKWS22] Wenxin Ding, Gautam Kamath, Weina Wang, and Nihar B. Shah. Calibration with
privacy in peer review. In 2022 IEEE International Symposium on Information Theory
(ISIT), 2022.
[Dwo06] Cynthia Dwork. Differential privacy. In International Colloquium on Automata, Lan-
guages, and Programming (ICALP), pages 1–12, 2006.
[FARL06] Jonathan G. Fiscus, Jerome Ajot, Nicolas Radde, and Christophe Laprun. Multiple di-
mension Levenshtein edit distance calculations for evaluating automatic speech recog-
nition systems during simultaneous speech. In Nicoletta Calzolari, Khalid Choukri,
Aldo Gangemi, Bente Maegaard, Joseph Mariani, Jan Odijk, and Daniel Tapias, ed-
itors, Proceedings of the Fifth International Conference on Language Resources and
Evaluation (LREC’06), Genoa, Italy, 2006. European Language Resources Association
(ELRA).
[GKKS23] Elazar Goldenberg, Tomasz Kociumaka, Robert Krauthgamer, and Barna Saha. An
Algorithmic Bridge Between Hamming and Levenshtein Distances. In Yael Tau-
man Kalai, editor, 14th Innovations in Theoretical Computer Science Conference
(ITCS 2023), Leibniz International Proceedings in Informatics (LIPIcs), pages 58:1–
58:23, Dagstuhl, Germany, 2023. Schloss Dagstuhl – Leibniz-Zentrum für Informatik.

14
[GKS19] Elazar Goldenberg, Robert Krauthgamer, and Barna Saha. Sublinear algorithms for
gap edit distance. In 2019 IEEE 60th Annual Symposium on Foundations of Computer
Science (FOCS), 2019.

[GRS20] Elazar Goldenberg, Aviad Rubinstein, and Barna Saha. Does preprocessing help in fast
sequence comparisons? In Proceedings of the 52nd Annual ACM SIGACT Symposium
on Theory of Computing (STOC), pages 657–670, 2020.

[GS18] Omer Gold and Micha Sharir. Dynamic time warping and geometric edit distance:
Breaking the quadratic barrier. ACM Trans. Algorithms, 2018.

[GSY23] Yeqi Gao, Zhao Song, and Xin Yang. Differentially private attention computation.
arXiv preprint arXiv:2305.04701, 2023.

[Ham50] Richard W Hamming. Error detecting and error correcting codes. The Bell System
Technical Journal, 29(2):147–160, 1950.

[HKM+ 22] Avinatan Hassidim, Haim Kaplan, Yishay Mansour, Yossi Matias, and Uri Stemmer.
Adversarially robust streaming algorithms via differential privacy. J. ACM, 69(6),
2022.

[HN99] Zhexue Huang and Mingkui Ng. A fuzzy k-modes algorithm for clustering categorical
data. IEEE Transactions on Fuzzy Systems, 7(4):446–452, 1999.

[HR14] Zhiyi Huang and Aaron Roth. Exploiting metric structure for efficient private query
release. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Dis-
crete Algorithms, SODA ’14, page 523–534, 2014.

[HRW13] Rob Hall, Alessandro Rinaldo, and Larry Wasserman. Differential privacy for functions
and functional data. J. Mach. Learn. Res., 2013.

[Hua97] Zhexue Huang. Extensions to the k-means algorithm for clustering large data sets
with categorical values. Data Mining and Knowledge Discovery, 2(3):283–304, 1997.

[IM98] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing
the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium
on Theory of Computing, pages 604–613, 1998.

[IP01] Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-sat. Journal of
Computer and System Sciences, 62(2):367–375, 2001.

[IPZ01] Russell Impagliazzo, Ramamohan Paturi, and Francis Zane. Which problems have
strongly exponential complexity? Journal of Computer and System Sciences,
63(4):512–530, 2001.

[JE19] Bargav Jayaraman and David Evans. Evaluating differentially private machine learn-
ing in practice. In 28th USENIX Security Symposium (USENIX Security 19), pages
1895–1912, 2019.

[JNW21] Ce Jin, Jelani Nelson, and Kewen Wu. An Improved Sketching Algorithm for Edit Dis-
tance. In 38th International Symposium on Theoretical Aspects of Computer Science
(STACS), pages 45:1–45:16, 2021.

15
[KPS21] Tomasz Kociumaka, Ely Porat, and Tatiana Starikovskaya. Small-space and streaming
pattern matching with k edits. In 2021 IEEE 62nd Annual Symposium on Foundations
of Computer Science (FOCS), pages 885–896. IEEE, 2021.

[KS20] Tomasz Kociumaka and Barna Saha. Sublinear-time algorithms for computing &
embedding gap edit distance. In 2020 IEEE 61st Annual Symposium on Foundations
of Computer Science (FOCS), 2020.

[KS24] Michal Kouckỳ and Michael E Saks. Almost linear size edit distance sketch. In
Proceedings of the 56th Annual ACM Symposium on Theory of Computing, pages
956–967, 2024.

[Lev66] Vladimir I Levenshtein. Binary codes capable of correcting deletions, insertions and
reversals. Soviet Physics Doklady, 10:707–710, 1966.

[LHR+ 24] Erzhi Liu, Jerry Yao-Chieh Hu, Alex Reneau, Zhao Song, and Han Liu. Differentially
private kernel density estimation. arXiv preprint arXiv:2409.01688, 2024.

[LMS98] Gad M. Landau, Eugene Wimberly Myers, and Jeanette P. Schmidt. Incremental
string comparison. SIAM J. Comput., 27:557–582, 1998.

[LV88] Gad M. Landau and Uzi Vishkin. Fast string matching with k differences. Journal of
Computer and System Sciences, 37, 1988.

[LWAL21] Zelun Luo, Daniel J Wu, Ehsan Adeli, and Fei-Fei Li. Scalable differential privacy with
sparse network finetuning. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pages 5059–5068, 2021.

[Mar09] Pierre-François Marteau. Time warp edit distance with stiffness adjustment for time
series matching. IEEE Transactions on Pattern Analysis and Machine Intelligence,
2009.

[Mye86] Eugene W. Myers. An O(N D) difference algorithm and its variations. Algorithmica,
1986.

[Nav01] Gonzalo Navarro. A guided tour to approximate string matching. ACM Comput.
Surv., 2001.

[NPF12] Mohammad Norouzi, Ali Punjani, and David J Fleet. Fast search in hamming space
with multi-index hashing. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 3108–3115. IEEE, 2012.

[NSS17] Timothy Naumovitz, Michael Saks, and C Seshadhri. Accurate and nearly optimal
sublinear approximations to ulam distance. In Proceedings of the Twenty-Eighth An-
nual ACM-SIAM Symposium on Discrete Algorithms, pages 2012–2031. SIAM, 2017.

[PL07] Ely Porat and Ohad Lipsky. Improved sketching of hamming distance with error
correcting. In Combinatorial Pattern Matching, pages 173–182, Berlin, Heidelberg,
2007. Springer Berlin Heidelberg.

[RS20] Aviad Rubinstein and Zhao Song. Reducing approximate longest common subsequence
to approximate edit distance. In Proceedings of the Fourteenth Annual ACM-SIAM
Symposium on Discrete Algorithms, pages 1591–1600. SIAM, 2020.

16
[RSSS19] Aviad Rubinstein, Saeed Seddighin, Zhao Song, and Xiaorui Sun. Approximation
algorithms for lcs and lis with truly improved running times. FOCS, 2019.

[SGAM+ 15] Grigori Sidorov, Helena Gómez-Adorno, Ilia Markov, David Pinto, and Nahun Loya.
Computing text similarity using tree edit distance. In 2015 Annual Conference of the
North American Fuzzy Information Processing Society (NAFIPS) held jointly with
2015 5th World Conference on Soft Computing (WConSC), 2015.

[SWYZ23] Zhao Song, Yitan Wang, Zheng Yu, and Lichen Zhang. Sketching for first order
method: efficient algorithm for low-bandwidth channel and vulnerability. In Interna-
tional Conference on Machine Learning (ICML), pages 32365–32417. PMLR, 2023.

[SYY+ 23] Jiankai Sun, Xin Yang, Yuanshun Yao, Junyuan Xie, Di Wu, and Chong Wang. Dpauc:
differentially private auc computation in federated learning. In Proceedings of the
Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Confer-
ence on Innovative Applications of Artificial Intelligence and Thirteenth Symposium
on Educational Advances in Artificial Intelligence, AAAI’23/IAAI’23/EAAI’23. AAAI
Press, 2023.

[SYYZ23] Zhao Song, Xin Yang, Yuanyuan Yang, and Lichen Zhang. Sketching meets differential
privacy: fast algorithm for dynamic kronecker projection maintenance. In Interna-
tional Conference on Machine Learning (ICML), pages 32418–32462. PMLR, 2023.

[TF20] Aleksei Triastcyn and Boi Faltings. Bayesian differential privacy for machine learning.
In International Conference on Machine Learning, pages 9583–9592. PMLR, 2020.

[TKP19] Reihaneh Torkzadehmahani, Peter Kairouz, and Benedict Paten. Dp-cgan: Differen-
tially private synthetic data and label generation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition Workshops (CVPR Work-
shop), 2019.

[Ukk85] Esko Ukkonen. Finding approximate patterns in strings. Journal of Algorithms,


6(1):132–137, 1985.

[WHZ+ 15] Xiao Shaun Wang, Yan Huang, Yongan Zhao, Haixu Tang, XiaoFeng Wang, and Diyue
Bu. Efficient genome-wide, privacy-preserving similar patient query based on private
edit distance. In Proceedings of the 22nd ACM SIGSAC Conference on Computer
and Communications Security, CCS ’15, page 492–503, New York, NY, USA, 2015.
Association for Computing Machinery.

[WJF+ 16] Ziteng Wang, Chi Jin, Kai Fan, Jiaqi Zhang, Junliang Huang, Yiqiao Zhong, and Liwei
Wang. Differentially private data releasing for smooth queries. Journal of Machine
Learning Research, 2016.

[WK18] Benjamin Weggenmann and Florian Kerschbaum. Syntf: Synthetic and differentially
private term frequency vectors for privacy-preserving text mining. In The 41st Interna-
tional ACM SIGIR Conference on Research & Development in Information Retrieval,
pages 305–314, 2018.

[WM10] Oliver Williams and Frank McSherry. Probabilistic inference and differential privacy.
Advances in Neural Information Processing Systems (NeurIPS), 23:2451–2459, 2010.

17
[WNM23] Tal Wagner, Yonatan Naamad, and Nina Mishra. Fast private kernel density esti-
mation via locality sensitive quantization. In Proceedings of the 40th International
Conference on Machine Learning, ICML’23. JMLR.org, 2023.

[WYY+ 22] Ruihan Wu, Xin Yang, Yuanshun Yao, Jiankai Sun, Tianyi Liu, Kilian Q Weinberger,
and Chong Wang. Differentially private multi-party data release for linear regression.
In The 38th Conference on Uncertainty in Artificial Intelligence, 2022.

[YDW+ 21] Xiang Yue, Minxin Du, Tianhao Wang, Yaliang Li, Huan Sun, and Sherman S. M.
Chow. Differential privacy for text analytics via natural text sanitization. In Findings,
ACL-IJCNLP 2021, 2021.

[YFA21] Brian Young, Tom Faris, and Luigi Armogida. Levenshtein distance as a measure of
accuracy and precision in forensic pcr-mps methods. Forensic Science International:
Genetics, 55:102594, 2021.

[YNB+ 22] Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A. Inan, Gautam
Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey
Yekhanin, and Huishuai Zhang. Differentially private fine-tuning of language mod-
els. In The Tenth International Conference on Learning Representations, ICLR 2022,
2022.

[YSY+ 22] Xin Yang, Jiankai Sun, Yuanshun Yao, Junyuan Xie, and Chong Wang. Differentially
private label protection in split learning. arXiv preprint arXiv:2203.02073, 2022.

[ZYCW20] Yuqing Zhu, Xiang Yu, Manmohan Chandraker, and Yu-Xiang Wang. Private-knn:
Practical differential privacy for computer vision. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), pages 11854–
11862, 2020.

18
A Proofs for Hamming Distance Data Structure
In this section, we include all proof details in Section 4.

A.1 Proof of Lemma 4.3


Proof of Lemma 4.3. Let E(A), E(A′ ) be Encode(A) and Encode(A′ ). Let #(E(A) = S) be the
number of the same bits between E(A) and S, #(E(A) 6= S) be the number of the different bits
between E(A) and S. Then the probability that the random flip process transforms E(A) into S
is:

1 eǫ/(2M1 ) #(E(A)=S)
Pr[A(A) = S] = ( )#(E(A)6=S) ( )
1 + eǫ/(2M1 ) 1 + eǫ/(2M1 )
(eǫ/(2M1 ) )#(E(A)=S)
=
(1 + eǫ/(2M1 ) )n

Since for each position, Encode changes at most M1 bits, and A and A′ only have one different
position. Therefore there are at most 2M1 different bits between E(A) and E(A′ ). So we have

Pr[A(A) = S] ′
≤ (eǫ/(2M1 ) )|#(E(A)=S)−#(E(A )=S)|
Pr[A(A′ ) = S]
≤ (eǫ/(2M1 ) )2M1
= eǫ

Thus we complete the proof.

A.2 Proof of Lemma 4.4


Proof of Lemma 4.4. h is a hash function randomly drawn from all functions [2n] → [M2 ]. For
certain j, h(p) = j for all p are independent random variables, each of them equals 1 with probability
1/M2 , or 0 with probability 1 − 1/M2 . So we have
X
Pr[|Tj | ≥ 10 log k] = Pr[ [h(p) = j] ≥ 10 log k]
p∈T
|T |
X  
|T | 1 d 1 |T |−d
= ( ) (1 − )
d M2 M2
d=10 log k
|T |
X |T |! 1 d
≤ ( )
d!(|T | − d)! M2
d=10 log k
|T |
X |T |d 1 d
≤ ( )
d! M2
d=10 log k
|T |
X 1 1 d
≤ ( )
d! 2
d=10 log k
|T |
1 X 1
≤ ( )d
(10 log k)! 2
d=10 log k

19
1

200k
The fifth step follows from that fact that |T | ≤ k, M2 = 2k.
Therefore, by union bound over all j ∈ [M2 ], we can show
1
Pr[∀j ∈ [M2 ], |Tj | < 10 log k] ≥ 1 − 2k · ( )
200k
= 0.99.
Thus, we complete the proof.

A.3 Proof of Lemma 4.5


Proof of Lemma 4.5. g is a hash function randomly drawn from all functions [2n] × [M1 ] → [M3 ].
For S
every single i ∈ [M1 ], define event Ei as the event that the 2|Tj | values in {g(2(p−1)+Ap , i) | p ∈
Tj } {g(2(p − 1) + Bp , i) | p ∈ Tj } are mapped into distinct positions.
2|Tj |
Y c
Pr[Ei ] = (1 − )
M3
c=1
2|Tj |
X c
≥ 1−
c=1
M3
2|Tj |(|Tj + 1|)
= 1−
M3
2(10 log 2 k)
> 1−
400 log k
= 0.5
The fourth step follows from Lemma 4.4. It holds true with probability 0.99.
For different i ∈ [M1 ], Ei are independent. Therefore, the probability that all Ei are false is
(0.5)M1 < 1/(1000k). By union bound, the probability that for every j ∈ [M2 ] there exists at least
one i such that Ei is true is at least
1 − M2 · 0.5M1 ≥ 1 − M2 /(1000k) ≥ 0.98.

A.4 Proof of Lemma 4.6


Proof of Lemma 4.6. S From Lemma 4.5, for all j, there is at least one i, such that the set {g(2(p −
1) + Ap , i)|p ∈ Tj } {g(2(p − 1) + Bp , i)|p ∈ Tj } contains 2|Tj | distinct values. Therefore, for that i,
E(A)i,j,1∼M3 and E(B)i,j,1∼M3 have exactly 2|Tj | different bits. For the rest of i, the different bits of
P 2 PM 3
E(A)i,j,1∼M3 and E(B)i,j,1∼M3 is no more than 2|Tj |. So we have 0.5· M j=1 maxi∈[M1 ] ( c=1 |E(A)i,j,c −
P 2
E(B)i,j,c |) = 0.5 · Mj=1 2|Tj | = |T | = Dham (A, B).

B Differentiall Private Longest Common Prefix


We design an efficient, ǫ-DP longest common prefix (LCP) data structure in this section. Specif-
ically, for two positions i and j in A and B respectively, we need to calculate the maximum l, so
that A[i : i + l] = B[j : j + l]. For this problem, we build a differentially private data structure
(Algorithm 3 and Algorithm 4). The main contribution is a novel utility analysis that accounts for
the error incurred by differentially private bit flipping.

20
Algorithm 3 Differential Private Longest Common Prefix, Part 1
1: data structure DPLCP ⊲ Theorem 5.1
2: members
3: A , T B for all i ∈ [log n], j ∈ [2i ]
Ti,j i,j
4: ⊲ Ti,j represents the hamming sketch (Algorithm 1) of the interval [i · n/2j , (i + 1) · n/2j ]
5: end members
6:
7: procedure BuildTree(A ∈ {0, 1}n , n ∈ N+ , k ∈ N+ , ǫ ∈ R) ⊲ Lemma B.3
8: M1 ← log k + log log n + 10, M2 ← 1, M3 ← 10, ǫ′ ← ǫ/ log n
9: for i from 0 to log n do
10: for j from 0 to 2i − 1 do
11: ∗ ← DPHammingDistance.Init(A[j · n/2i : (j + 1) · n/2i ], M , M , M , ǫ′ )
Ti,j 1 2 3
12: ⊲ Algorithm 1
13: end for
14: end for
15: return T ∗
16: end procedure
17:
18: procedure Init(A ∈ {0, 1}n , n ∈ N+ , k ∈ N+ , ǫ ∈ R) ⊲ Lemma B.1
19: T A ←BuildTree(A, n, k, ǫ)
20: end procedure
21:
22: procedure QueryInit(B ∈ {0, 1}n , n ∈ N+ , k ∈ N+ ) ⊲ Lemma B.1
23: T B ←BuildTree(B, n, k, 0)
24: end procedure
25:
26: procedure IntervalSketch(T, pl ∈ [n], pr ∈ [n])
27: Divide the interval [pl , pr ] into O(log n) intervals. Each of them is stored on a node of the
tree T .
28: Retrieve the Hamming distance sketches of these nodes as S1 , S2 , . . . , St .
29: Initialize a new sketch S ← 0 with the same size of the sketches above.
30: for every position w in the sketch S do
31: S[w] ← S1 [w] ⊕ S2 [w] ⊕ S3 [w] ⊕ ... ⊕ St [w]
32: end for
33: return S
34: end procedure
35:
36: procedure SketchHammingDistance(S A, S B ∈ RM1 ×M2 ×M3 ) ⊲ Lemma B.4 and B.5
Let M1 , M2 , M A B
37:
P3 be the size ofPdimensions of the sketches S and S .
38: return 0.5 · M 2
j=1 max M3 A B
i∈[M1 ] ( c=1 |Si,j,c − Si,j,c |)
39: end procedure
40: end data structure

B.1 Time Complexity


We prove the running time of the three operations above.

Lemma B.1. The running time of Init and InitQuery (Algorithm 3) are O(n log n(log k +

21
Algorithm 4 Differential Private Longest Common Prefix, Part 2
1: data structure DPCLP ⊲ Theorem 5.1
2: procedure Query(i ∈ [n], j ∈ [n]) ⊲ Lemma B.2 and B.6
3: L ← 0, R ← n
4: while L 6= R do
5: mid ← ⌈ L+R
2 ⌉
6: S A ← IntervalSketch(T A , i, i + mid) ⊲ Algorithm 3
7: S B ← IntervalSketch(T B , j, j + mid) ⊲ Algorithm 3
8: threshold ← 1.5M1 M3 /(1 + eǫ/(log k log n) )
9: if SketchHammingDistance(S A, S B ) ≤ threshold then ⊲ Algorithm 3
10: L ← mid
11: else
12: R ← mid − 1
13: end if
14: end while
15: return L
16: end procedure
17: end data structure

log log n))

Proof. From Lemma 4.2, the running time of building node Ti,j is O((n/2i )M1 ). Therefore the
total building time of all nodes is
log i −1
Xn 2X log
Xn
(n/2i )M1 = 2i · (n/2i )M1 = O(n log n(log k + log log n)).
i=0 j=0 i=0

Thus, we complete the proof.

Lemma B.2. The running time of Query (Algorithm 4) is O(log2 n(log k + log log n)).

Proof. In Query (Algorithm 4), we use binary search. There are totally log n checks. In each
check, we need to divide the interval into log n intervals and merge their sketches of size M1 M2 M3 .
So the time complexity is O(log2 n(log k + log log n)).

B.2 Privacy Guarantee


Lemma B.3. The data structure DPLCP (Algorithm 3 and Algorithm 4) is ǫ-DP.

Proof. On each node, we build a hamming distance data structure DPHammingDistance that
is (ǫ/ log n)-DP. For two strings A and A′ that differ on only one bit, since every position is in at
most log n nodes on the tree, for any output S, the probability

Pr[BuildTree(A) = S]
≤ (eǫ/ log n )log n = eǫ
Pr[BuildTree(A′ ) = S]

Thus we complete the proof.

22
B.3 Utility Guarantee
Before analyzing the error of the query, we first bound the error of SketchHammingDistance
(Algorithm 3).

Lemma B.4. We select M1 = log k + log log n + 10, M2 = 1, M3 = 10 for DPHammingDistance


data structure in BuildTree(Algorithm 3). Let z be the true hamming distance of the two strings
A[i : i + mid] and B[i : i + mid]. Let ze be the output of SketchHammingDistance(Algorithm 3).
When ǫ = +∞(without the random flip process), then we have

• if z = 0, then with probability 1, ze = 0.

• if z 6= 0, then with probability 1 − 1/(300k 2 log n), ze 6= 0.

Proof. Our proof follows from the proof of Lemma 4.6. We prove the case of z = 0 and z 6= 0
respectively.
When z = 0, it means the string A[i : i + mid] and B[i : i + mid] are identical. Therefore, the
output of the hash function is also the same. Therefore, the output S A and S B from IntervalS-
ketch(Algorithm 3) are identical. Then ze = 0.
When z 6= 0, define set Q := {p ∈ [mid] | A[i + p − 1] 6= B[j + p − 1]} as the positions where
string A and B are different. |Q| = z. Note that M1 = log k + log log n + 10, M2 = 1, M3 = 10,
S A , S B ∈ {0, 1}M1 ×M2 ×M3 . For every i′ ∈ [M1 ], the probability that SiA′ and SiB′ are identical is
the probability that all c ∈ [M3 ] is mapped exactly even times from the position set Q. Formally,
define event E as [∀j ′ , |{p ∈ Q | g(p) = j ′ }| mod 2 = 0]. Define another event E ′ as there is only
one position mapped odd times from set Q1∼z−1 . Then the probability equals

Pr[E] = Pr[E ′ ] · Pr[E|E ′ ]


≤ Pr[E|E ′ ]
= 1/M3

The last step is because Pr[E|E ′ ] is the probability that g(Qz ) equals the only position that mapped
odd times. There are totally M3 positions and the hash function g is uniformly distributed, so the
probability is 1/M3 .
For different i′ ∈ [M1 ], the event E are independent. So the total probability that z ′ 6= 0 is
the probability that for at least one i′ , event E holds true. So the probability is 1 − (1/M3 )M1 =
1 − (1/10)log k+log log n+10 > 1 − 1/(300k2 log n).

Lemma B.5. Let M1 = log k +log log n+10, M2 = 1, M3 = 10. Let z be the true hamming distance
of the two strings A[i : i + mid] and B[i : i + mid]. Let ze be the output of SketchHammingDis-
tance(Algorithm 3). With the random flip process with DP parameter ǫ, we have:

• When z = 0, with probability 1 − 1/(300k2 log n), ze < (1 + o(1))M1 M3 /(1 + eǫ/(log k log n) ).

• When z > 3M1 M3 /(1+eǫ/(log k log n) ), with probability 1−1/(300k2 log n), ze > (2−o(1))M1 M3 /(1+
eǫ/(log k log n) ).

Proof. In the random flip process in DPHammingDistance (Algorithm 1,3), the privacy param-
eter ǫ′ = ǫ/ log n. We flip each bit of the sketch with independent probability 1/(1 + eǫ/ log n log k ).
Then we prove the case of z = 0 and z > 3M1 M3 /(1 + eǫ/(log k log n) ) respectively.

23
When z = 0, similar to the proof of Lemma 4.7, we view the flipping operation as random vari-
A is flipped, otherwise 0. From Lemma B.4,
ables. Let random variable Ri,j,c be 1 if the sketch Si,j,c
A B
S and S are identical. Then we have
M3
X
|z − ze| = max Ri,j,c
i∈[M1 ]
c=1
M
X M
1 X 3

≤ Ri,j,c
i=1 c=1

Since Ri,j,c are independent Bernoulli random variables, using Hoeffding’s inequality (Lemma 3.2),
we have
MX
1 ×M3
2 /(M
Pr[| Ri,j,c − M1 M3 E[Ri,j,c ]| > L] ≤ e−2L 1 ×M3 )

i=1

When L = M1 M3 ,
MX
1 ×M3
2
Pr[| Ri,j,c − M1 M3 E[Ri,j,c]| > L] ≤e−2M1 M3 /(M1 M3 )
i=1
≤ e−2(log k+log log n)
≤ 1/(300k2 log n)
Thus we complete the z = 0 case.
When z > 3M1 M3 /(1 + eǫ/(log k log n) ), the proof is similar to z = 0. With probability 1 −
1/(300k2 log n), we have |z−e
z | < (1+o(1))M1 M3 /(1+eǫ/(log k log n) ). Thus ze > (2−o(1))M1 M3 /(1+
eǫ/(log k log n) ).
Lemma B.6. Let w e be the output of Query(i, j) (Algorithm 4), w be the longest common prefix of
A[i : n] and B[j : n]. With probability 1 − 1/(300k2 ), we have: 1.w
e ≥ w. 2. Dham (A[i : i + w],
e B[j :
e ≤ 3M1 M3 /(1 + eǫ/(log k log n) ).
j + w])
Proof. In Query(i, j) (Algorithm 4), we use a binary search to find the optimal w. In binary
search, there are totally log n calculations of SketchHammingDistance. Define threshold :=
1.5M1 M3 /(1 + eǫ/(log k log n) ). Define a return value of SketchHammingDistance is good if: 1).
when z = 0, ze < threshold. 2). when z > 2 · threshold, ze < threshold. z and ze are defined in
Lemma B.5.
Therefore, by Lemma B.5, each SketchHammingDistance is good with probability at least
1−1/(k2 log n). There are log n SketchHammingDistance in the binary search, by union bound,
the probability that all of them are good is at least 1 − 1/(300k2 ).
When all answers for SketchHammingDistance are good, from the definition of binary search,
for any two positions L, R such that Dham (A[i : i + L], B[j, j + L]) = 0, Dham (A[i : i + R], B[j, j +
R]) ≥ 2 · threshold, we have L ≤ w e ≤ R. Next, we prove w ≤ w e and Dham (A[i : i + w],
e B[j :
e ≤ 3M1 M3 /(1 + eǫ/(log k log n) ) respectively.
j + w])
w is the true longest common prefix of A[i : n] and B[j : n], so we have Dham (A[i : i+L], B[j, j +
L]) = 0. Let L = w, we have w = L ≤ w. e
Let R be the minimum value that Dham (A[i : i + R], B[j : j + R]) ≥ 2 · threshold. Because
Dham (A[i : i + R], B[j, j + R]) is monotone for R, and w e ≤ R, we have Dham (A[i : i + w],e B[j :
e ≤ Dham (A[i : i + R], B[j : j + R]) = 2 · threshold = 3M1 M3 /(1 + eǫ/(log k log n) ).
j + w])
Thus we complete the proof.

24

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy