An Approximate Algorithm For Maximum Inner Product Search Over Streaming Sparse Vectors
An Approximate Algorithm For Maximum Inner Product Search Over Streaming Sparse Vectors
1 INTRODUCTION
Many applications of information retrieval, as the name of the discipline suggests, reduce to or
involve the fundamental and familiar question of retrieval. In its most general form, it aims to solve
the following problem:
(𝑘)
arg max 𝑓 (𝑞, 𝑥), (1)
𝑥 ∈X
to find, from a collection X, a subset of 𝑘 objects that are the most relevant to a query object
𝑞 ∈ Q according to a similarity function 𝑓 : Q × X → R. In many instances, this manifests as the
Maximum Inner Product Search (MIPS) problem where X, Q ⊂ R𝑛 and 𝑓 (·, ·) is the inner product
of its arguments:
(𝑘)
arg max ⟨𝑞 , 𝑥⟩. (2)
𝑥 ∈X
As a prominent example, consider a multi-stage ranking system [4, 6, 60] in the context of text
retrieval. The cascade of ranking functions often begins with lexical or semantic similarity search
which can be formalized using Equation (2).
When similarity is based on Term Frequency-Inverse Document Frequency (TF-IDF), for example,
X is made up of high-dimensional vectors, one per document. Each document vector contains non-
zero entries that correspond to terms and their frequencies in that document. Here, there is a one
to one mapping between dimensions in sparse document vectors and terms in the vocabulary. Each
non-zero entry of a query vector 𝑞 records the corresponding term’s inverse document frequency.
Authors’ addresses: Sebastian Bruch, Pinecone, New York, NY, USA, sbruch@acm.org; Franco Maria Nardini, ISTI-CNR, Pisa,
Italy, francomaria.nardini@isti.cnr.it; Amir Ingber, Pinecone, Tel Aviv, Israel, ingber@pinecone.io; Edo Liberty, Pinecone,
New York, NY, USA, edo@pinecone.io.
2 Sebastian Bruch, Franco Maria Nardini, Amir Ingber, and Edo Liberty
BM25 [48, 49] and many other popular lexical similarity measures can similarly be expressed in the
form above.
When similarity is based on the semantic closeness of pieces of text, then vectors can be produced
by embedding models (e.g., [22, 42, 43, 47]). This formulation trivially extends to joint lexical-
semantic search [12, 14, 32, 57] too.
This deceptively simple problem is difficult to solve efficiently and effectively in practice. When
the coordinates of each vector are almost surely non-zero—a case we refer to as dense vectors—then
there are volumes of algorithms such as graph-based methods [30, 37, 50, 64], product quantiza-
tion [27, 29, 31], and random projections [1–3] that may be used to quickly find an approximate
solution to Equation (2). But when vectors are sparse in a high-dimensional space (i.e., have thou-
sands to millions of dimensions) with very few non-zero entries, then no general efficient solution
exists: Because of the near-orthogonality of most vectors in a sparse high-dimensional regime,
algorithms for MIPS do not port over successfully.
It is only by imposing additional constraints on the vectors that the literature approaches this
problem at scale and offers solutions that meet certain memory, time, and accuracy constraints.
Algorithms that rely on sketching cover only binary or categorical-valued vectors [46, 55]. Inverted
index-based algorithms that are more commonly used in information retrieval, such as WAND and
its descendants [11, 19, 20, 39, 40] and JASS [36], as well as signature-based algorithms [7, 15, 26]
all make a number of crucial assumptions: that vectors are non-negative and integer-valued; that
their non-zero entries follow a Zipfian distribution; that the share of the contribution of entries to
the final score is non-uniform (i.e., some entries contribute more heavily to the final score than
others); and that query vectors have very few non-zero entries.
Many of these constraints have historically held given the nature of text data and keyword
queries in search engines. But when the sparse vectors are the output of embedding models [9,
17, 22, 23, 25, 38, 61, 65], many of these assumptions need not hold. For example, query vectors
produced by the SPLADE model [23] have, on average, about 43 non-zero, real entries on the
MS MARCO Passage v1 dataset [41]—far too many for algorithms such as WAND to operate
efficiently [33] and, without discretization into integers, incompatible with existing algorithms.
While there are efforts to make such model-generated representations more sparse by way of
regularization or pooling [33, 59], the underlying Sparse MIPS problem (SMIPS) for unconstrained
real vectors remains mostly unexplored.
We investigate that handicap in this work because we believe SMIPS to be of increasing impor-
tance as evidenced by the examples above. Efficiently solving the SMIPS problem enables further
innovation in text retrieval and other related areas. In our search for an algorithm, we pay particu-
lar attention to the more difficult online SMIPS problem, where we assume no knowledge of the
streaming collection X and require that the algorithm supports online insertions and deletions.
We introduce this particular challenge to support real-world use-cases where collections change
rapidly, as well as emerging research on Retrieval-Enhanced Machine Learning [62] where a learn-
ing algorithm interacts with a retrieval system during the training process, thereby needing to
search for, insert, and delete objects in and from a dynamic collection.
Furthermore, we explore the online SMIPS problem in the context of a vector database depicted
in Figure 1. In particular, we assume that the system has (and is required to have) an efficient storage
system that contains all active vectors. In this setup, the (exact or approximate) top-𝑘 retrieval
engine may access the vector storage during query execution.
Given the setup above, we establish a baseline by revisiting a naïve, exact algorithm, which
we call LinScan, to approximate Equation (2). Here, 𝑞, 𝑥 ∈ R𝑛 and their number of non-zero
entries is much fewer than 𝑛. That is, 𝜓𝑞 ,𝜓𝑑 ≪ 𝑛 where 𝜓𝑞 and 𝜓𝑑 denote the number of non-zero
entries in a query vector and the document vector respectively. LinScan simply stores pairs of
An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors 3
Fig. 1. A vector database system consisting of a storage system and an exact or approximate top-𝑘 retrieval
engine that solves the MIPS/SMIPS problem of Equation (2).
vector identifiers and coordinate values in an inverted index that is optionally compressed. During
retrieval, it traverses the index one coordinate at a time to accumulate the inner product scores.
As we show in this work, LinScan proves surprisingly competitive because it takes advantage of
instruction-level parallelism and efficient cache utilization available on modern CPUs.
We then build on LinScan and propose an online, approximate algorithm called Sinnamon.
It approximately solves Equation (2) for sparse vectors, with the implication that some of the
candidates in the top-𝑘 set may be there erroneously. As we will show, this tolerance for error in
the top-𝑘 set makes it possible to tailor the approximate retrieval stage to meet a given set of time,
space, and accuracy constraints.
Sinnamon makes use of two data structures. One, which is more familiar to the reader, is a
lean, dynamic inverted index. In Sinnamon, this index is simply a mapping from a coordinate to
the identifier of vectors in X that have a non-zero value in that coordinate. In other words, we
maintain inverted lists that contain just vector identifiers. This structure allows us to quickly verify
if the 𝑗th coordinate of a vector 𝑥 is non-zero (i.e., 1𝑥 [ 𝑗 ]≠0 ) and obtain the set of vectors whose 𝑗th
coordinate is non-zero: {𝑖 | 1𝑥𝑖 [ 𝑗 ]≠0 }.
Coupled with the inverted index is a novel probabilistic sketch data structure.1 A high-dimensional
sparse vector 𝑥 ∈ R𝑛 is sketched as 𝑥˜ ∈ R2𝑚 (2𝑚 ≪ 𝑛) using a lossy transformation 𝑒 (·) : R𝑛 → R2𝑚 .
Together with the inverted index, this sketch offers an inverse transformation 𝑒 −1 (·) such that for
an arbitrary query vector 𝑞, we have that ⟨𝑞 , 𝑒 −1 (𝑥)⟩
˜ ≥ ⟨𝑞 , 𝑥⟩ and the difference can be tightened
by the parameters of the algorithm. Another crucial property of our data structure is that, much
like the machinery of a Counting Bloom filter [21], obtaining the value of the 𝑗th coordinate of
𝑒 −1 (𝑥)
˜ can be done efficiently and with access to the same coordinates of the sketch regardless of
the input, 𝑥.˜
As a document vector2 𝑥𝑖 is inserted into X, we record the identifier of its non-zero coordinates
in the inverted index and subsequently insert its sketch 𝑥˜𝑖 into the 𝑖th column of a sketch matrix
X̃ ∈ R2𝑚× | X | . When we receive a query 𝑞, we use a coordinate-at-a-time algorithm that efficiently
computes the inner product scores by accessing only a single or a fixed group of rows in X̃ per
coordinate. When deleting 𝑥𝑖 , we simply remove its identifier 𝑖 from the inverted index and mark
the 𝑖th column in X̃ as vacant.
In addition to a theoretical analysis of our data structure, we extensively evaluate LinScan and
Sinnamon on benchmark retrieval datasets and demonstrate their many interesting properties
empirically. We show that due to predictable and regular memory access patterns, both algorithms
1 We use “sketch” to describe a compressed data structure that approximates a high-dimensional vector, and “to sketch” to
describe the act of compressing a vector into a sketch.
2 We refer to vectors that are expected to be indexed as “document vectors” or simply “documents,” and call the input to the
are fast on modern CPUs. We discuss further how we may control the memory usage of Sinnamon
by adjusting the sketch dimensionality 2𝑚, and tune a knob within the sketching algorithm to
control its approximation error and retrieval accuracy. Moreover, due to their coordinate-at-a-time
query processing logic, LinScan and Sinnamon can be trivially turned into anytime algorithms,
terminating retrieval once an allotted time budget is exhausted. Finally, as we will demonstrate
in this work, it is straightforward to parallelize computations in LinScan and Sinnamon. These
properties make these methods the first SMIPS algorithms for real vectors that allow one to explore
the Pareto frontier of effectiveness and time- and space-efficiency, to trivially scale indexes vertically
through parallelization, and to tailor it to the needs of resource-constrained environments and
applications.
We begin this work with a review of the relevant literature on this topic in Section 2. We then
describe the LinScan and Sinnamon algorithms in Sections 3 and 4 respectively. That presentation
is followed by a detailed error analysis of the data structure and retrieval algorithm in Section 5, and
a comprehensive empirical evaluation on two hardware platforms and a variety of sparse vector
collections in Section 6. We conclude this work with a discussion in Section 7.
2 RELATED WORK
The information retrieval literature offers numerous algorithms to solve a constrained variant of
the SMIPS problem that is specifically tailored to text retrieval and keyword search. The research
on that topic has advanced the field considerably over the past few decades, making retrieval one
of the most efficient components in a modern search engine. We do not review this vast literature
here and refer the reader to existing surveys [54, 66] for details. Instead, we briefly review key
algorithms and explain what makes them less suitable to operate in the setup we consider in this
work.
Among the many algorithms in existence, WAND [11] and its intellectual descendants and
incremental optimizations [19, 20, 39, 40] have become the de facto top-𝑘 retrieval solution. The
core logic in WAND and other related algorithms centers around a document-at-a-time traversal
of the inverted index. By maintaining an upper-bound on the partial score contribution of each
coordinate to the final inner product, we can quickly tell if a document may possibly end up in the
top 𝑘 set: if it appears in enough inverted lists whose collective score upper-bound exceeds the
current threshold, then it is a candidate to be fully evaluated; otherwise, it has no prospect of ever
making it to the top-𝑘 set and can therefore be safely rejected without further computation.
The excellent performance of this logic rests on a number of important assumptions, however.
Like all other existing algorithms, it is designed primarily for non-negative vectors. Due to its
irregular memory access pattern, the algorithm operates better when the query has only a few
non-zero coordinates. But perhaps its key assumption is the fact that word frequencies in natural
languages often follow a Zipfian distribution. Given the role that word frequencies play in relevance
measures such as BM25 [49], the Zipfian shape implies that some words (i.e., coordinates) are
inherently far more important than others. That, in turn, boosts or attenuates the contribution
of the coordinate to the final inner product score, making the distribution of upper-bounds over
coordinates quite skewed. Such skewness contributes heavily to the success of WAND and other
dynamic pruning algorithms [54].
While non-negativity and high query sparsity can be relaxed, the algorithm duly redesigned,
and its implementation optimized for a more general regime, the reliance on Zipfian data is less
forgiving. When the distribution of non-zero coordinates deviates from the Zipfian curve, the
coordinate upper-bounds become more uniform, leading to less effective pruning of the inverted
lists, and therefore a less efficient top-𝑘 retrieval. That, among other problems [16], renders this
An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors 5
particular idea of pruning less suitable for a general purpose top-𝑘 retrieval for sparse vectors
where coordinates take on a non-zero value (nearly) uniformly at random.
Other competing index traversal techniques process a query in a coordinate-at-a-time or score-
at-a-time manner. Both of these approaches rely on sorting inverted lists by term frequencies or
their “impact scores” (i.e., precomputed partial scores). The machinery within these algorithms,
however, has the added disadvantage that it relies on the stationarity of the dataset to compute
impact scores or sort postings, making it undesirable for streaming collections that require fast
updates to the index [54].
In contrast to the multitude of data structures and algorithms for stationary datasets, the literature
on retrieval in streaming collections is rather slim and limited to a few works [4, 5, 7]. Notably,
Asadi and Lin [5, 7] used Bloom filters [10] to speed up postings list intersection in conjunctive and
disjunctive queries at the expense of accuracy and memory footprint. These approximate methods
proved instrumental in creating an end-to-end algorithm for retrieval and ranking of streaming
documents [4]. While these works are related to the question we investigate in this work, the
proposed methods are not directly applicable: We are not interested in set membership tests for
which Bloom filters are a natural choice, but rather in approximating real-valued vectors in such a
way that leads to arbitrarily accurate inner product with a query vector.
Another relevant topic is the use of signatures for retrieval and inner product approximation [26,
46, 55]. Pratap et al. propose a simple algorithm [46] to sketch sparse binary vectors in such a way
that the inner product of sketches approximates the inner product of original vectors. The core
idea is to randomly map coordinates in the original space to coordinates in the sketch. When two
or more entries collide, the sketch records the OR of the colliding values. A later work extends this
idea to categorical-valued vectors [55]. Nonetheless, it is not obvious how the proposed sketching
mechanisms may be extended to real-valued vectors.
Deviating from the standard inverted index solution to top-𝑘 retrieval is the work of Goodwin et
al. [26]. As part of what is referred to as the BitFunnel indexing machinery, the authors propose to
record and store a bit signature for every document vector in the index using Bloom filters. These
signatures are scanned during retrieval to deduce if a document contains the terms of a conjunctive
query. While it is encouraging that a signature-based replacement to inverted indexes appears not
only viable but very much practical, the query logic BitFunnel supports is limited to ANDs and
does not generalize to the setup we are considering in this work. Despite that, we note that the
“bit-sliced signatures” in BitFunnel inspired the particular transposed layout of the sketch matrix in
Sinnamon.
For completeness, we also briefly note the literature on sparse-sparse matrix multiplication [24,
34, 44, 51–53] and sparse matrix-sparse vector multiplication [8]. The main challenge in operations
concerning sparse matrices is that the computation involved is often highly memory-bound. As
such, much of this literature focuses on developing sparse storage formats with hardware- and
cache-aware designs that lead to a more effective utilization of memory bandwidth. We believe,
however, that the research on compact storage and memory-efficient structures is orthogonal to
the topic of our work and offers solutions that could lead to improvements across all algorithms
considered in this work.
an inverted list (also known as non-interleaved inverted lists), one that stores vector identifiers and
another that holds values. For completeness, we show this indexing logic in Algorithm 1.
In its most basic variant, we store the inverted index without using any form of compression:
That is, the document identifiers are stored as 32-bit integers and values as 32-bit floats. This allows
us to quantify the latency of the logic within the algorithm itself and remove other factors related
to compression. It also enables the algorithm to take advantage of instruction-level parallelism
and efficient caching that come for free (using default compiler optimization techniques) with a
coordinate-at-a-time retrieval strategy. To make the algorithm more practical, we also consider a
variant where the list of vector identifiers in each inverted list is compressed using the Roaring [13]
dynamic data structure and the values are stored using the bfloat16 standard (16-bit floating points).
The loss of precision due to the conversion from 32-bit floats to 16-bit values is negligible in practice.
We denote this variant by LinScan-Roaring.
During retrieval, LinScan follows a simple two-step procedure shown in Algorithm 2. In the
scoring step (lines 1 through 7), it traverses the inverted index one coordinate at a time for every
non-zero coordinate in the query vector and accumulates partial scores for all documents. At the
end of this step, the algorithm will have computed the exact inner product scores for every vector in
the collection—document vectors that are not visited in the scoring loop on line 3 will have a score
of 0. In the ranking step (line 8), it finds the top 𝑘 vectors with the largest inner product scores;
in our implementation of FindLargest, we use a heap to efficiently identify the top 𝑘 vectors as
shown in Algorithm 3.
An interesting property of LinScan’s retrieval algorithm is that it is trivial to execute its logic
in parallel in a dynamic manner. While most existing algorithms would require some form of
sharding of the index by document ids (i.e., keeping separate index structures for different ranges
of document ids), LinScan can execute retrieval with as many threads as are available using the
very same monolithic data structure. It is the combination of the coordinate-at-a-time nature of
LinScan and the layout of its data structure that lend the algorithm to such a dynamically adjustable
An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors 7
level of concurrency. By breaking up an inverted list into contiguous segments on the fly, we can
accumulate partial scores for each segment concurrently. Similarly, it is just as trivial to execute
FindLargest(·) in parallel. We consider this parallel variant of the algorithm in this work and
refer to it as LinScan ∥ .
Finally, let us introduce an anytime but necessarily approximate version of LinScan by way of a
simple modification to the retrieval logic. In this variant, we visit the non-zero query coordinates
in order from the coordinate with the largest absolute value to the smallest one. As soon as a given
time-budget 𝑇 is exhausted, we terminate the scoring step of the retrieval; setting 𝑇 = ∞ to give
the algorithm an unlimited time budget reduces the logic to the vanilla exact LinScan. But because
8 Sebastian Bruch, Franco Maria Nardini, Amir Ingber, and Edo Liberty
the scores may no longer represent the exact inner products, we find top 𝑘 ′ document vectors
according to these possibly inexact scores for some 𝑘 ′ ≥ 𝑘, and subsequently fetch those vectors
from storage to compute their exact scores and finally return the top 𝑘 elements. This procedure is
shown in Algorithm 4.
3.1 Deletions
We have so far described the indexing and retrieval algorithms in LinScan. In this section, we
briefly touch on deletion strategies. We preface this discussion with the note that insertion, deletion,
and retrieval procedures are not entirely independent: A particular deletion algorithm may pair
better with a particular set of insertion and retrieval algorithms. While we are careful to incorporate
this fact into making a choice between different deletion strategies, we acknowledge that more
can and should be done to optimize joint insertion-deletion-retrieval efficiency. But as much of
the optimization involves heavy engineering (e.g., applying deletions in batch, running a separate
background process to reclaim deleted space, etc.), we do not dwell on this point in this work and
leave an empirical exploration of this detail to future work.
Throughout this work, when we delete a document vector 𝑥𝑖 from a LinScan index, we invoke a
process that is best described as “full deletion.” This strategy simply wipes all postings associated
with 𝑥𝑖 from the inverted index and frees up the space they occupied. That involves removing a
posting from an inverted list, noting that the posting may reside anywhere within the list.
An obvious advantage of this protocol is that it does not produce any waste in memory in the
form of zombie postings—space allocated in memory that lingers on after the document has been
deleted. Moreover, it is compatible with the insertion and retrieval logic described earlier as it
maintains the contiguity of the inverted lists and the alignment between the identifier and value
arrays.
An obvious disadvantage of this approach, on the other hand, is that fully removing a posting
from an inverted list often leads to a reorganization of the underlying array of data in memory,
which is itself a potentially expensive procedure. But we believe that the benefits of the full deletion
approach outweigh its pitfalls, especially considering the ramifications of alternative algorithms
for the insertion and retrieval procedures.
Consider, for instance, a different method which simply designates a posting as “deleted” using a
special value without immediately reclaiming its space. Perhaps the space is reclaimed periodically
by a background process or recycled when a new posting is inserted into the inverted list. Regardless,
it is clear that the insertion and retrieval procedures have to be modified so as to safely handle
postings that are designated for deletion. In retrieval, for example, this involves conditioning on
the content of each posting, leading to branches in the execution.
Given the discussion above, we believe our choice of full deletion is appropriate for LinScan
and helps to reduce the overall complexity in insertion and retrieval.
4.1 Indexing
When a new document vector 𝑥𝑖 arrives into the collection X, Sinnamon executes an efficient two-
step algorithm to index it and make it available for retrieval. The first stage is the familiar procedure
of inserting the vector identifier 𝑖 into an inverted index I. I is a mapping from coordinates to a
list of vectors in which that coordinate is non-zero: I [ 𝑗] ≜ {𝑖 | 𝑥𝑖 [ 𝑗] ≠ 0}. When processing 𝑥𝑖 , for
every non-zero coordinate in the set 𝑛𝑧 (𝑥𝑖 ) ≜ { 𝑗 | 𝑥𝑖 [ 𝑗] ≠ 0}, we insert 𝑖 into I [ 𝑗].
The second, novel step involves populating the column 𝑖 in the sketch matrix X̃ ∈ R2𝑚×|X | , which
has 2𝑚 rows (the sketch size) and |X| columns (the collection size). For notational convenience and
to simplify prose, we regard X̃ as a block matrix X̃ = [U𝑚×|X | ; L𝑚×|X | ] with the top half of the
matrix denoted by U and the bottom half by L.
Intuitively, what the sketch of a vector in Sinnamon captures is an upper-bound and a lower-
bound on the entries of 𝑥𝑖 in such a way that its inner product with any query vector can be
approximated arbitrarily accurately. This sketching step, in effect, can be thought of as a lossy
compression of a sparse vector such that the error incurred from losing the original values does not
severely degrade the solution set of Equation (2). We will revisit the effect of this approximation on
the final inner product later in this work.
Algorithm 5 presents this indexing procedure. The algorithm makes use of ℎ independent random
mappings 𝜋𝑜 : [𝑛] → [𝑚], where each 𝜋𝑜 (·) projects coordinates in the original space to an integer
in [𝑚]. In the notation of the algorithm, we construct an upper-bound vector 𝑢 ∈ R𝑚 and a lower-
bound vector 𝑙 ∈ R𝑚 , and insert 𝑢 and 𝑙 into the 𝑖th column of U and L respectively. In words, the
𝑘th coordinate of 𝑢, 𝑢 [𝑘] (𝑙 [𝑘]) records the largest (smallest) value from the set of all entries in 𝑥𝑖
that map into 𝑘 according to at least one 𝜋𝑜 . Figure 2(a) illustrates the algorithm for an example
vector using a single mapping 𝜋.
It is instructive to contrast the index structure in Sinnamon with the one in LinScan. The main
material difference between the two indexes is that Sinnamon allocates a constant amount of
memory to store the sketch of each document, whereas LinScan stores all non-zero values of
a document within the index. As a result, the amount of memory required to store values with
LinScan grows linearly in 𝜓𝑑 .
Finally, let us consider the time complexity of the insertion procedure in Sinnamon. Let us assume
that raw vectors are represented in a “sparse” format: rather than a vector being implemented as an
array of size 𝑛 with most entries 0, we assume a vector is a mapping from a coordinate to a value.
Assuming further that inserting a value into an inverted list using Insert(·) can be carried out in
10 Sebastian Bruch, Franco Maria Nardini, Amir Ingber, and Edo Liberty
Fig. 2. Example of (a) indexing and (b) score computation in Sinnamon. When inserting the vector 𝑥 12 ∈ R𝑛
consisting of 3 non-zero coordinates {10, 27, 113}, we first populate the inverted index and then insert a sketch
of 𝑥 into the 12th column of the sketch matrix. The top half of the matrix, U, records the upper-bounds
and the bottom half,L, the lower-bounds, with the help of a single random mapping 𝜋 from 𝑛 to 𝑚. When
computing the approximate inner product of a query vector 𝑞 with the documents in the collection, we look
up the inverted list for one coordinate and traverse its corresponding row in the sketch matrix to accumulate
partial scores in a coordinate-at-a-time algorithm.
constant time, the overall time complexity of Algorithm 5 is then O (𝜓𝑑 ℎ) for a vector 𝑥 with 𝜓𝑑
non-zero entries. The algorithm stores 𝜓𝑑 integers in the inverted index and 2𝑚 real values per
vector.
As an important note we add that, when operating in the non-negative orthant where vectors
are in R𝑛+ , we do not record the lower-bounds (i.e., the sketch L) and only maintain a sketch matrix
X̃ = U ∈ R𝑚× | X | . We call this variant of the algorithm Sinnamon+ .
4.1.1 Notes on Implementation. In practice, the inverted index can be compressed using any of the
existing bitmap or list compression codes such as Roaring [13], Simple16 [63], PForDelta [67] or
others [45, 56]. We use the Roaring codec throughout this work to compress inverted lists because
it achieves a reasonable compression ratio and, at the same time, supports very fast insertion and
deletion operations.
As with LinScan-Roaring, we store sketches using bfloat16. Finally, we note that the matrix X̃ is
in practice implemented as an array of 2𝑚 rows, with each row growing as needed to accommodate
the sketch of the 𝑖th vector. This particular data layout is more efficient during retrieval because
Sinnamon needs access to the same row in order to compute partial scores for documents within a
single inverted list. As such, by representing a row as a contiguous region of memory, we improve
the overall memory access pattern and make the data structure more cache-friendly.
An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors 11
4.1.2 Extensions and Future Considerations. One of the desired properties in our design is that the
inverted index I must offer compression as well as efficient insertion and deletion operations. This
is required because the efficiency of the inverted index directly affects the overall efficiency of the
indexing, update, and retrieval algorithms in Sinnamon. For this study, we settled on Roaring [13]
bitmaps as it suits the needs of the algorithm. However, we note that studying inverted indexes
with the outlined properties is an orthogonal area of research with many existing studies on the
required operations.
Having stated that, existing inverted indexes are deterministic and exact. Sinnamon, on the
other hand, is an approximate algorithm which trades size and speed for error tolerance. Sinnamon
also offers levers, as we will argue, to compensate for the incurred error. This enables us to explore
approximate inverted indexes where each inverted list may be a superset rather than an exact set
and where multiple entries may share an inverted list. In other words, the research question to
investigate is whether there exists an approximate inverted index Ĩ such that I [ 𝑗] ⊂ Ĩ [ 𝑗] ∀ 𝑗 and
where | Ĩ| ≪ |I| (with | · | denoting the overall size of the index) with a quantifiable effect on the
final inner product with arbitrary query vectors. We wish to investigate this question in a future
study.
Another research question in this context is the representation of the values. While Sinnamon
offers a fixed number of dimensions in the sketch, how those entries are represented affects the
overall memory usage. We believe that a number of quantization methods can be used to reduce the
capacity requirement while maintaining an approximately accurate inner product. This is another
area we wish to explore in the future.
4.2 Retrieval
Assuming we have an inverted index I and a sketch matrix X̃ as described in Section 4.1, as well
as ℎ random mappings 𝜋𝑜 (·)1≤𝑜 ≤ℎ , we now discuss the question of retrieval: Given a query vector
𝑞 ∈ R𝑛 , find the top-𝑘 closest vectors in the collection.
4.2.1 Scoring. Similar to LinScan, Sinnamon approaches retrieval in two steps. In the first and
most critical step, Sinnamon computes an upper-bound on the inner product of the query vector
and every vector in the collection. It does so by traversing the inverted list of every non-zero
coordinate in the query, one coordinate at a time, and computing and accumulating partial scores
(i.e., the product of query value at that coordinate and the document value as encoded in the
sketch). Importantly, Sinnamon visits non-zero query coordinates in order from the coordinate
with the largest absolute value to the smallest one to facilitate an anytime variant. As soon as a given
time-budget 𝑇 is exhausted, it terminates the scoring phase. At this point, all partial scores are upper-
bounds on the exact inner product of the processed coordinates. This means, for example, when
𝑇 = ∞ (i.e., when time is unlimited) the computed score of a document vector is an upper-bound
on its inner product with the query.
Algorithm 6 presents the scoring procedure in Sinnamon. Intuitively, when the sign of a query
entry at coordinate 𝑗 is positive, we find the least upper-bound on the value of 𝑥𝑖 [ 𝑗] for a document
𝑥𝑖 ∈ I [ 𝑗] (line 8 in Algorithm 6). When 𝑞 [ 𝑗] < 0, we find the greatest lower-bound on 𝑥𝑖 [ 𝑗]
(line 10). In this way, Sinnamon guarantees that the partial score is always an upper-bound on the
actual partial score. This is illustrated for an example query vector in Figure 2(b).
We note that, the expected time complexity of the scoring algorithm is O (𝜓𝑞 log𝜓𝑞 +𝜓𝑞 ℎ|X|𝜓𝑑 /𝑛),
typically dominated by the second term, where the term |X|𝜓𝑑 /𝑛 represents the expected number
of vectors that have a non-zero value in a particular coordinate (with the assumption that non-zero
coordinates are uniformly distributed).
12 Sebastian Bruch, Franco Maria Nardini, Amir Ingber, and Edo Liberty
4.2.2 Ranking. At the end of the scoring stage, Sinnamon gives us approximate scores for every
document in the collection. In the second stage, which we refer to as “ranking,” we must find the
top-𝑘 vectors that make up the (approximate) solution to Equation (2).
To do that, we first find the 𝑘 ′ ≥ 𝑘 vectors with the largest approximate score using a heap with
time complexity O (|X| log 𝑘 ′). The reason the initial pass selects a set that has more than 𝑘 vectors
is that by doing so we compensate for the approximation error of the scoring algorithm. We later
review the relationship between 𝑘 and 𝑘 ′ empirically. We subsequently execute Algorithm 7 to
compute the exact inner product between the query and the set of 𝑘 ′ vectors, and eventually return
the top-𝑘 subset according to the exact scores.
Much like LinScan, Sinnamon’s retrieval algorithm (i.e., Algorithms 6 and 7) is trivially amenable
to dynamic parallelism. By breaking up an inverted list into non-overlapping segments in line 6 of
An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors 13
Algorithm 6, we can accumulate partial scores concurrently without the need to break up or shard
the sketch matrix. In the parallel version of the algorithm, we also execute FindLargest(·) and
the exact computation of inner products in Algorithm 7 using multiple threads. We refer to the
parallel variant of the algorithm as Sinnamon ∥ .
We note that, while a mono-CPU variant of Sinnamon offers a consistent analysis of the trade-
offs within the algorithm and sheds light onto its behavior in comparison with other algorithms,
we believe that the ease by which Sinnamon (and LinScan) can be run concurrently on a per-
query basis renders the algorithm suitable for production systems that operate on large (dynamic)
collections. In particular, because the index structure remains monolithic within each machine, it
need not be rebuilt or re-assembled when porting the index to another machine with a different
configuration.
4.2.3 Notes on Implementation. As is clear from line 5 in Algorithm 6, for a query coordinate 𝑗,
we need only probe a fixed set of ℎ rows in the sketch matrix. When ℎ = 1, as a typical example,
this implies that we need only visit a single row which is stored as a contiguous array in memory.
Due to this property and the predictability of the memory access pattern, it is often possible to
cache a few upcoming memory locations in advance of the computation so as to speed up scoring.
We observe the effect of the cache-friendliness of the sketch matrix in practice by enabling default
compiler optimizations. We also note that, it is possible to further optimize the implementation
through explicit instruction-level parallelism where the compiler fails to do so itself, though we do
not explicitly use this technique in this work.
In our implementation, we further optimize the efficiency of the algorithm by re-arranging the
logic so as to remove the branching on line 7 in Algorithm 6. This is possible if the programming
language offers “function pointers,” to point to the min and max operators depending on the sign
of the query entry.
4.2.4 Extensions and Future Considerations. A requirement that is often faced in practice is for a
retrieval algorithm to support constrained search. For example, a user may only ask for the top-𝑘
set of songs whose genre matches a set of desired genres. One way to formalize this is to require
that the retrieval algorithm enforce arbitrarily many binary constraints on the solution space. In
other words, the vectors in the solution set of Equation (2) must pass an arbitrary set of functions
G = {𝑔𝑖 : X → {0, 1}}. This transforms the problem to the following constrained retrieval problem:
(𝑘)
arg max ⟨𝑞 , 𝑥⟩,
𝑥 ∈X (3)
s.t. ∧𝑔 ∈ G 𝑔(𝑥) = 1,
where ∧ denotes the And operator.
Sinnamon naturally supports this mode of search because, by default, it computes the scores of
all documents in the collection in its scoring stage—the same is true of LinScan. It is therefore
possible to enforce arbitrary constraints by masking out those columns in X̃ that do not satisfy the
given conditions. We defer an examination of this setup to future studies.
4.3 Deletions
When deleting a vector 𝑥𝑖 from LinScan’s index, we committed to a “full deletion,” wiping all
postings associated with 𝑥𝑖 from inverted lists. That strategy, we argued, fits LinScan well as it
simplifies its insertion and retrieval logic. Sinnamon, in contrast, provides us with an alternative
deletion mechanism.
When deleting 𝑥𝑖 from the collection, we simply remove all instances of 𝑖 from inverted lists in
I, much like in LinScan. However, we do not clean up the sketch of 𝑥𝑖 from the sketch matrix X̃.
14 Sebastian Bruch, Franco Maria Nardini, Amir Ingber, and Edo Liberty
Instead, we add 𝑖 to the set of available document identifiers so that the next vector that is inserted
into the index may reuse 𝑖 as its identifier and recycle column 𝑖 in X̃ to store its sketch.
This protocol is efficient because it only involves the removal of an integer from compressed
inverted lists—an instruction that is often very fast to execute. Contrast this with LinScan where
the two parallel arrays within an inverted list must remain aligned at all times. Because the values
array must be cut at the same spot as the array that holds document identifiers, we have the
overhead of having to find the index of the posting that holds 𝑖, then proceed to delete the 𝑖th entry
in the values array.
We do note that, for applications that receive far more delete requests than new insertions,
Sinnamon’s deletion logic may prove suboptimal. This is because, by virtue of not freeing up a
column upon deletion, the sketch matrix could grow to be as large as the maximal number of
vectors that exist in the dataset at the same time. For such applications, a different deletion strategy
may be required that may involve defragmenting the matrix and reclaiming the underlying space.
In practice, however, we find this particular scenario to be a mere hypothetical; in reality, deletions
are dwarfed by insertion and update requests, where Sinnamon’s default deletion strategy leads to
minimal to no waste.
5 ANALYSIS
Recall that Sinnamon uses a sketch of size 2𝑚 to record upper- and lower-bounds on the values of
active3 coordinates in a vector. Consider now Line 5 of Algorithm 5 where, given a vector 𝑥 ∈ R𝑛
and ℎ independent random mappings 𝜋𝑜 : [𝑛] → [𝑚] (1 ≤ 𝑜 ≤ ℎ), we construct the upper-bound
sketch 𝑢 ∈ R𝑚 where the 𝑘th dimension is assigned the following value:
The lower-bound sketch is filled in a symmetric manner, in the sense that the algorithmic procedure
is the same but the operator changes from max(·) to min(·).
When a query coordinate is positive, to reconstruct the value 𝑥 [𝑖] of a document vector, we take
the least upper-bound from 𝑢, as captured on Line 8 of Algorithm 6, restated below for convenience:
When the query coordinate is negative, on the other hand, it is the greatest lower-bound that is
returned instead.
Given the above, it is easy to see that the following proposition is always true:
Theorem 5.1. The score returned by Algorithm 6 of Sinnamon𝑇 =∞ is an upper-bound on the inner
product of query and document vectors.
The fact above implies that Sinnamon’s approximation error is always non-negative. But what
can we say about the probability that such an error occurs when approximating a document value?
How large is the overestimation error of a single value? How does that error affect the final score
of a query-document pair? These are some of the questions we examine in the remainder of this
section.
3 Inthe rest of this work we refer to coordinates in a sparse vector as either zero or non-zero. In this section, to make the
exposition more accurate, we adopt a more formal terminology and say a coordinate is inactive when it is not present in the
sparse vector, and active when it is. Note that, the value of an active coordinate is almost surely non-zero; that leaves room
for the unlikely event that it may draw the value 0 from its value distribution.
An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors 15
Proof. Consider the random value 𝑋𝑖 . Suppose 𝑘 = 𝜋𝑜 (𝑖) for some 1 ≤ 𝑜 ≤ ℎ. We thus need to
probe 𝑈𝑘 as part of producing 𝑋 𝑖 . The event that 𝑈𝑘 > 𝑋𝑖 happens only when there exists another
active coordinate 𝑋 𝑗 such that 𝑋 𝑗 > 𝑋𝑖 and 𝜋𝑜 (𝑖) = 𝜋𝑜 ( 𝑗) = 𝑘. Consider the probability P[𝑈𝑘 > 𝑋𝑖 ].
To derive that probability, it is easier to think in terms of complementary events: 𝑈𝑘 = 𝑋𝑖 if every
other active coordinate whose value is larger than 𝑋𝑖 maps to a sketch coordinate except 𝑘. Clearly
the probability that any arbitrary 𝑋 𝑗 maps to a sketch coordinate other than 𝑈𝑘 is simply 1 − 1/𝑚.
Therefore, given a vector 𝑋 , the probability that no active coordinate 𝑋 𝑗 larger than 𝑋𝑖 maps to 𝑈𝑘
is:
1 Í
P 𝑗 ≠ 𝑖 𝑠.𝑡 . 𝜋𝑜 ( 𝑗) = 𝜋𝑜 (𝑖) = 𝑘 for some 1 ≤ 𝑜 ≤ ℎ | 𝑋 = 1 − (1 − )ℎ 𝑗 ≠𝑖 1𝑋 𝑗 active 1𝑋 𝑗 >𝑋𝑖 . (7)
| {z } 𝑚
Event A
Because 𝑚 is large by assumption, we can approximate 𝑒 −1 ≈ (1 − 1/𝑚)𝑚 and rewrite the expression
above as follows:
P Event A | 𝑋 ≈ 1 − 𝑒 − 𝑚 𝑗 ≠𝑖 1𝑋 𝑗 active 1𝑋 𝑗 >𝑋𝑖 .
ℎ Í
(8)
Finally, we marginalized the expression above over 𝑋 𝑗 s for 𝑗 ≠ 𝑖 to remove the dependence on
all but the 𝑖th coordinate of 𝑋 . To simplify the expression, however, we take the expectation over
16 Sebastian Bruch, Franco Maria Nardini, Amir Ingber, and Edo Liberty
the first-order Taylor expansion of the right hand side around 0. This results in the following
approximation:
ℎ Í
P Event A | 𝑋𝑖 ≈ 1 − 𝑒 − 𝑚 (1−Φ(𝛼)) 𝑗 ≠𝑖 𝑝 𝑗 .
(9)
For 𝑋 𝑖 to be larger than 𝑋𝑖 , event 𝐴 must take place for all ℎ sketch coordinates corresponding
to 𝑖. That probability, by the independence of random mappings, is:
ℎ Í
P 𝑋 𝑖 > 𝑋𝑖 | 𝑋𝑖 ≈ 1 − 𝑒 − 𝑚 (1−Φ(𝛼)) 𝑗 ≠𝑖 𝑝 𝑗 .
ℎ
(10)
In deriving the expression above, we conditioned the event on the value of 𝑋𝑖 . Taking the marginal
probability leads us to the following expression for the event that 𝑋 𝑖 > 𝑋𝑖 for any 𝑖:
∫ Í
∫ Í
ℎ ℎ
1 − 𝑒 − 𝑚 (1−Φ(𝛼)) 𝑗 ≠𝑖 𝑝 𝑗 𝑑P(𝛼) ≈ 1 − 𝑒 − 𝑚 (1−Φ(𝛼)) 𝑗 ≠𝑖 𝑝 𝑗 𝜙 (𝛼)𝑑𝛼 . (11)
ℎ ℎ
P 𝑋 𝑖 > 𝑋𝑖 ≈
Equation (6) is unwieldy in an analytical sense, but can be computed numerically. Nonetheless,
it offers insights into the behavior of the upper-bound sketch. Our first observation is that the
sketching mechanism presented here is more suitable for distributions where larger values occur
with a smaller probability such as sub-Gaussian variables. In such cases, the larger the value is the
smaller its chance of being overestimated by the upper-bound sketch. Regardless of the underlying
distribution, empirically, the largest value in a vector is always recovered exactly.
The second insight is that there is a sweet spot for ℎ given a particular value of 𝑚: using more
random mappings helps lower the probability of error until the sketch starts to saturate, at which
point the error rate increases. This particular property is similar to the behavior of a Bloom filter.
In addition to these general observations, it is often possible to derive a closed form expression
for special distributions and obtain further insights. For example, when active 𝑋𝑖 s are drawn from
a zero-mean, unit-variance Gaussian distribution, the probability of error can be expressed as in
the following corollary.
Corollary 5.3. Suppose the probability that a coordinate is active, 𝑝𝑖 , is equal to 𝑝 for all coordinates
of vector 𝑋 ∈ R𝑛 . When active 𝑋𝑖 s are drawn from 𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛(𝜇 = 0, 𝜎 = 1), the probability of error is:
ℎ
𝑘ℎ (𝑛−1)𝑝
∑︁ ℎ 𝑚
1 − 𝑒− 𝑚
P 𝑋 𝑖 > 𝑋𝑖 ≈ 1 + (−1)𝑘 . (12)
𝑘 𝑘ℎ(𝑛 − 1)𝑝
𝑘=1
Table 1. Probability of error as expressed in Equation (6) for sample distributions. In all cases, 𝜓𝑑 = 𝑛𝑝𝑖 =
𝑛𝑝 = 120 and the distribution over the support is as indicated in the first column. We select the support
arbitrarily for the purposes of this demonstration. For the Zeta distributions, we quantize the interval [−1, 1]
into 210 discrete values and define the distribution over the discrete support.
𝑚 = 𝜓𝑑 /2 𝑚 = 𝜓𝑑 𝑚 = 2𝜓𝑑
𝜙 ℎ=1 ℎ=2 ℎ=3 ℎ=1 ℎ=2 ℎ=3 ℎ=1 ℎ=2 ℎ=3
Uniform [−1,1] 0.57 0.63 0.69 0.37 0.38 0.43 0.21 0.17 0.17
Zeta(𝑠 = 2.5) 0.34 0.33 0.38 0.19 0.12 0.11 0.10 0.04 0.02
Zeta(𝑠 = 4.0) 0.13 0.03 0.04 0.07 0.02 0.008 0.03 0.005 0.001
Gaussian(𝜇 = 0, 𝜎 = 1) 0.57 0.63 0.69 0.37 0.38 0.43 0.21 0.17 0.17
Gaussian(𝜇 = 0, 𝜎 = 0.1) 0.57 0.63 0.69 0.37 0.38 0.43 0.21 0.17 0.17
Table 2. Expected value of error for sample distributions. The setup is the same as in Table 1.
𝑚 = 𝜓𝑑 /2 𝑚 = 𝜓𝑑 𝑚 = 2𝜓𝑑
𝜙 ℎ=1 ℎ=2 ℎ=3 ℎ=1 ℎ=2 ℎ=3 ℎ=1 ℎ=2 ℎ=3
Uniform 0.43 0.46 0.52 0.26 0.24 0.27 0.15 0.09 0.09
Zeta(𝑠 = 2.5) 0.001 5𝑒-4 5𝑒-4 7𝑒-4 2𝑒-4 2𝑒-4 4𝑒-4 6𝑒-5 3𝑒-5
Gaussian(𝜇 = 0, 𝜎 = 1) 0.40 0.43 0.48 0.24 0.22 0.25 0.14 0.09 0.08
Gaussian(𝜇 = 0, 𝜎 = 0.1) 0.07 0.07 0.08 0.05 0.04 0.04 0.02 0.01 0.01
variables. But to understand the differences between these distributions, we must contextualize the
probability of error with the distribution of error and understand its concentration around zero.
To that end, we state the following result for the upper-bound sketch. We denote by 𝑍 𝑖 the
decoding error 𝑋 𝑖 − 𝑋𝑖 .
Theorem 5.4 (CDF of Error in the Upper-Bound Sketch). For an active 𝑋𝑖 whose values are
drawn from a distribution with PDF and CDF 𝜙 and Φ, the probability that 𝑍 𝑖 = 𝑋 𝑖 − 𝑋𝑖 is less than 𝛿
is: ∫
ℎ Í
1 − 𝑒 − 𝑚 (1−Φ(𝛼+𝛿)) 𝑗 ≠𝑖 𝑝 𝑗 𝜙 (𝛼)𝑑𝛼 .
ℎ
P[𝑍 𝑖 ≤ 𝛿; 𝑋𝑖 is active] ≈ 1 − (13)
Given the CDF of 𝑍 𝑖 and the fact that 𝑍 𝑖 ≥ 0, it follows that its expected value conditioned on
𝑋𝑖 being active is:
Lemma 5.5 (Expected Value of Error in the Upper-Bound Sketch).
∫ ∞ ∫ ∞∫ Í
ℎ
1 − 𝑒 − 𝑚 (1−Φ(𝛼+𝛿)) 𝑗 ≠𝑖 𝑝 𝑗 𝜙 (𝛼) 𝑑𝛼 𝑑𝛿.
ℎ
E[𝑍 𝑖 ; 𝑋𝑖 is active] = P[𝑍 𝑖 ≥ 𝛿]𝑑𝛿 ≈ (16)
0 0
We present the CDF of 𝑍 𝑖 in Figure 4 for uniform and Gaussian-distributed vectors, and report
the expected value for other distributions in Table 2. Examining Tables 1 and 2 together shows
that, while some distributions can result in a similar probability of error given the same sketch
configuration, they differ greatly in terms of the expected magnitude of error.
We also find it interesting to study Theorem 5.4 for special distributions. As we show in Appen-
dix B, for example, when vectors are drawn from a zero-mean Gaussian distribution with standard
deviation 𝜎, the CDF of the variable 𝑍 𝑖 can be derived as follows.
Corollary 5.6. Suppose the probability that a coordinate is active, 𝑝𝑖 , is equal to 𝑝 for all coordinates
of vector 𝑋 ∈ R𝑛 . When active 𝑋𝑖 s are drawn from 𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛(𝜇 = 0, 𝜎), the CDF of error is:
ℎ (𝑛−1)𝑝 ′
P[𝑍 𝑖 ≤ 𝛿; 𝑋𝑖 is active] ≈ 1 − 1 − 𝑒 − 𝑚 (1−Φ (𝛿)) ,
ℎ
(17)
√
where Φ (·) is the CDF of a zero-mean Gaussian with standard deviation 𝜎 2.
′
This expression enables us to find a particular sketch configuration given a desired bound on the
probability of error. It is straightforward, for instance, to show the following result.
Lemma 5.7. Suppose the probability that a coordinate is active, 𝑝𝑖 , is equal to 𝑝 for all coordinates
of vector 𝑋 ∈ R𝑛 . Suppose active 𝑋𝑖 ∼ 𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛(0, 𝜎). Given a choice of 0 < 𝛿, 𝜖 < 1 and the number
of random mappings ℎ, we have that P[𝑍 𝑖 > 𝛿] < 𝜖 when:
ℎ(𝑛 − 1)𝑝 (1 − Φ′ (𝛿))
𝑚>− . (18)
log(1 − 𝜖 1/ℎ )
To get a sense of what this expression entails, we have plotted 𝑚/(𝑛 − 1)𝑝 as a function of ℎ
given particular configurations of 𝜎, 𝛿, and 𝜖 in Figure 3. As one may expect, when we utilize
more random mappings to form the sketch, we require a smaller sketch size to maintain the same
guarantee on the concentration of error. That is true only up to a certain point: using more than
three mappings, for example, translates to an increase in 𝑚 to keep the magnitude of error within a
given bound.
Before we move on, let us consider the general form of 𝑍 𝑖 without the assumption that 𝑋𝑖 is
active. Denote by 𝜇𝑖 the expected value of 𝑍 𝑖 conditioned on 𝑋𝑖 being active: 𝜇𝑖 = E[𝑍 𝑖 ; 𝑋𝑖 is active].
Similarly denote by 𝜎𝑖2 its variance when 𝑋𝑖 is active. Given that 𝑋𝑖 is active with probability 𝑝𝑖
and inactive with probability 1 − 𝑝𝑖 , it is easy to show that E[𝑍 𝑖 ] = 𝑝𝑖 𝜇𝑖 and that its variance
Var(𝑍 𝑖 ) = 𝑝𝑖 𝜎𝑖2 + 𝑝𝑖 (1 − 𝑝𝑖 )𝜇𝑖2 .
2.0 4.0
m/(n − 1)p
m/(n − 1)p
1.0 2.0
0.5 1.0
0.5
2 4 6 8 2 4 6 8
h h
Fig. 3. Visualization of Lemma 5.7 in terms of 𝑚/(𝑛 − 1)𝑝 as a function of ℎ for different values of 𝜎, 𝛿, and 𝜖.
1.0 1.0
0.9 0.9
Probability
Probability
0.8 0.8
0.7 0.7
0.6 0.6
m = 60 m = 60
0.5 m = 120 0.5 m = 120
m = 240 m = 240
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
Z Z
Fig. 4. Cumulative distribution of the overestimation error, 𝑍 , of the upper-bound sketch for vectors that
draw values from the uniform distribution over [−1, 1] and the Gaussian distribution. These curves represent
different settings of 𝑚 with ℎ = 1 and 𝑛𝑝𝑖 = 𝑛𝑝 = 120.
Theorem 5.8. Suppose that 𝑞 ∈ R𝑛 is a sparse vector with 𝑛𝑧 (𝑞) = {𝑖 | 𝑞[𝑖] ≠ 0} denoting its
set of active coordinates. Suppose in a random sparse vector 𝑋 ∈ R𝑛 , a coordinate 𝑋𝑖 is active with
probability 𝑝𝑖 and, when active, draws its value from some well-behaved distribution (i.e., expectation,
variance, and third moment exist). Let 𝑋˜ be the reconstruction of 𝑋 by Sinnamon: when 𝑋𝑖 is active
𝑋˜ 𝑖 = 𝑋 𝑖 if 𝑞[𝑖] > 0 and 𝑋˜ 𝑖 = 𝑋 𝑖 if 𝑞 [𝑖] < 0, otherwise 𝑋˜ 𝑖 is 0. If 𝑍𝑖 = 𝑋˜ 𝑖 − 𝑋𝑖 , then the random
variable 𝑍 defined as follows:
1 ∑︁
⟨𝑞, 𝑋˜ − 𝑋 ⟩ −
𝑍 ≜ √︃Í 𝑝𝑖 E[𝑍𝑖 ]𝑞[𝑖] , (19)
Var[𝑍 ]𝑞 [𝑖] 2 ∈𝑛𝑧 (𝑞)
𝑖 ∈𝑛𝑧 (𝑞) 𝑖 𝑖
ψq =8 ψq =8
0.5
0.4 ψq = 16 ψq = 16
ψq = 64 ψq = 64
0.4
ψq = 128 ψq = 128
0.3
Density
Density
0.3
0.2
0.2
0.1 0.1
0.0 0.0
−4 −2 0 2 4 −4 −2 0 2 4
Zq Zq
Fig. 5. Distribution of the transformed inner product error 𝑍 for vectors that draw values from the uniform
distribution over [−1, 1] and the Gaussian distribution. These curves represent different settings of the number
of non-zero coordinates in the query (𝜓𝑞 ) for 𝑚 = 60 with ℎ = 1 and 𝜓𝑑 = 𝑛𝑝𝑖 = 120. We also plotted the
standard normal distribution for reference.
We have already studied the variables 𝑍 𝑖 = 𝑋 𝑖 − 𝑋𝑖 in the previous section and noted that the
analysis for 𝑍 𝑖 is symmetrical. Therefore, we already know the properties of 𝑍𝑖 , as it is a well-defined
random variable that is 𝑍 𝑖 when 𝑞 [𝑖] > 0 and 𝑍 𝑖 otherwise.
As we only care about the order induced by scores, we are permitted to translate and scale the
sum above by a constant as follows to arrive at 𝑍 :
∑︁ ∑︁
⟨𝑞, 𝑋˜ − 𝑋 ⟩ = 𝑍𝑖′ − E[𝑍𝑖′]
𝑟𝑎𝑛𝑘
𝑞 [𝑖]𝑍𝑖 − 𝑞 [𝑖]E[𝑍𝑖 ] = (21)
𝑖 ∈𝑛𝑧 (𝑞) 𝑖 ∈𝑛𝑧 (𝑞)
|{z} | {z }
𝑍𝑖′ E[𝑍𝑖′ ]
𝑟𝑎𝑛𝑘 1 ∑︁
= √︃Í 𝑍𝑖 − E[𝑍𝑖 ] (22)
2
𝑖 ∈𝑛𝑧 (𝑞) Var[𝑍𝑖 ]𝑞 [𝑖] 𝑖 ∈𝑛𝑧 (𝑞)
= 𝑍. (23)
𝑍 is thus the sum of centered random variables 𝑍𝑖′. Because we assumed that the distribution of 𝑋𝑖
is well-behaved, we can conclude that Var[𝑍𝑖 ] > 0 and that E[|𝑍𝑖 | 3 ] < ∞. If we operated on the
assumption that 𝑍𝑖 s are independent—in reality, they are weakly dependent—albeit not identically
distributed, we can appeal to the Berry-Esseen theorem to complete our proof. □
We note that, the conditions required by the result above are trivially satisfied when the random
vectors 𝑋𝑖 s are drawn from a distribution with bounded support.
We verify the result above by simulating the following experiment. We draw a query with a given
number of non-zero coordinates (𝜓𝑞 ) from the standard normal distribution. We then draw a vector
of errors 𝑍𝑖 from a distribution defined by its CDF per Equation (13), and compute the transformed
inner product error 𝑍 , and repeat this procedure 10, 000 times. We then plot the distribution of
𝑍 in Figure 5 for two families of random vectors, one where we assume vectors are drawn from
a Uniform distribution over [−1, 1] and another from the Gaussian distribution with standard
deviation 𝜎 = 0.1. As the figures illustrate, 𝑍 tends to a standard normal distribution even when
the query has just a handful of active coordinates.
An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors 21
Table 3. The number of non-zero entries in documents (𝜓𝑑 ) and queries (𝜓𝑞 ) for various vector datasets.
We conclude this section with a remark on what Theorem 5.8 enables us to do. Let us assume that
we know the distribution of the random variables 𝑋𝑖 , or that we can estimate 𝑝𝑖 , E[𝑍𝑖 ] and Var[𝑍𝑖 ]
from data. Armed with these statistics, we have all the ingredients to form 𝑍 for a given query
𝑞 and any document vector. As 𝑍 tends to a standard normal distribution, we can thus compute
confidence bounds on the accuracy of the approximate inner product score returned by Sinnamon.
This information can in theory be used to dynamically adjust 𝑘 ′ in Algorithm 7 on a per-query
basis. We leave an exploration of this particular result to future work.
6 EVALUATION
This section presents our empirical evaluation of Sinnamon and its properties on real-world data.
We begin with a description of our empirical setup. We then verify the theoretical results of Section 5
on real datasets. That is followed by a discussion of the retrieval performance of Sinnamon where
we examine the trade-offs the algorithm offers to configure memory, time, and accuracy. We finally
turn to a review of insertions and deletions in Sinnamon and showcase its stable, online behavior.
6.1 Setup
6.1.1 Sparse Vector Datasets. We conduct our experiments on MS MARCO Passage Retrieval
v1 [41]. This question-answering dataset is a collection of 8.8 million short passages in English with
about 56 terms (39 unique terms) per passage on average. We use the smaller “dev” set of queries
for retrieval, consisting of 6, 980 questions with an average of about 5.8 unique terms per query.
We demonstrate the utility of the algorithms with several different methods of encoding text into
sparse vectors. In particular, we process MS MARCO passages and queries using BM25 [48, 49],
SPLADE [22], efficient SPLADE [33], and uniCOIL [35].
It is a well-known fact that BM25 can be translated into an inner product of two vectors, where
each non-zero entry in the query vector has the IDF of the corresponding term in the vocabulary,
and the document vectors encode BM25’s term importance weight. This requires that the average
document length and the hyperparameters 𝑘 1 and 𝑏 be fixed, but that is a reasonable assumption for
the purposes of our experiments. We set 𝑘 1 = 0.82 and 𝑏 = 0.68 by tuning using a grid search. We
drop the (𝑘 1 + 1) factor from the numerator of BM25’s term importance weight so that document
entries are bounded to [0, 1]; this is a rank-preserving change. We pre-process the text of the
collection and queries using the default word tokenizer and Snowball stemmer of the open-source
NLTK 4 library. Note that, we include BM25 simply to provide a reference point and emphasize
that LinScan and Sinnamon are designed as general-purpose solutions suitable for real-valued
sparse vectors of which BM25 is a special case.
As the second model, we use SPLADE5 [22], a deep learning model that produces sparse repre-
sentations for a given piece of text, where each non-zero entry is the importance weight of a term
in the BERT [18] WordPiece [58] vocabulary comprising of 30, 000 terms. When encoded with this
4 Available at https://github.com/nltk/nltk
5 Pre-trained checkpoint from HuggingFace available at https://huggingface.co/naver/splade-cocondenser-ensembledistil
22 Sebastian Bruch, Franco Maria Nardini, Amir Ingber, and Edo Liberty
BM25 10−2
SPLADE
0.06 Efficient SPLADE
uniCOIL
10−4
Density
Density
0.04
10−6
0.02 BM25
SPLADE
10−8
Efficient SPLADE
uniCOIL
0.00
100 101 102 101 103 105
Value Coordinate
Fig. 6. Distributions of values and non-zero coordinates in various vector datasets. For each dataset, the
figure on the left shows the likelihood of a non-zero coordinate taking on a particular value. The figure on the
right shows the likelihood of a particular coordinate being non-zero.
version of SPLADE, the MS MARCO passage vectors contain an average of 119 non-zero entries
and the query vectors 43 non-zero entries.
We include SPLADE because it enables us to test the behavior of retrieval algorithms on query
vectors with a large number of non-zero entries. However, we also create another vector dataset
from MS MARCO using a more efficient variant of SPLADE, called Efficient SPLADE6 [33]. This
model produces queries that have far fewer non-zero entries than the original SPLADE model
but documents that have a larger number of non-zero entries. More concretely, the mean 𝜓𝑑 of
document vectors is 181 and that of query vectors is about 5.9. Due to the larger size of the document
collection, this dataset helps us examine the memory footprint of Sinnamon in a relatively more
extreme scenario.
Similar to SPLADE, uniCOIL [35] produces impact scores for terms in the vocabulary, resulting
in sparse representations for text documents. We use the checkpoint provided by the authors7
to obtain vectors for the MS MARCO collection. This results in document vectors that have on
average 68 non-zero entries, and query vectors with 𝜓𝑞 ≈ 6.
We would be remiss if we did not note that all vector datasets produced from the MS MARCO
dataset are non-negative. This is a limitation of BM25 and existing embedding models that generate
sparse vectors for text. However, the results presented in this section are generalizable to real
vectors. To support that claim, we discuss this topic further and present evidence on synthetic
data at the end of this section. We further hope that our algorithmic contribution inspires sparse
embedding models that can leverage the whole real line including negative weights—an area that
is hitherto unexplored due to a lack of efficient SMIPS algorithms.
As a way to compare and contrast the various vector datasets above, we examine some of their
pertinent characteristics in Table 3 and Figure 6. In Figure 6(a), we plot a histogram of the coordinate
values, showing the likelihood that a non-zero coordinate takes on a particular value. One notable
difference between the datasets is that SPLADE and its other variant, Efficient SPLADE, appear to
have a very different value distribution than BM25 and uniCOIL: in the former smaller values are
more likely to occur.
6 Pre-trained checkpoints for document and query encoders were obtained from https://huggingface.co/naver/efficient-
splade-V-large-doc and https://huggingface.co/naver/efficient-splade-V-large-query, respectively
7 Available at https://github.com/castorini/pyserini/blob/master/docs/experiments-unicoil.md.
An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors 23
The datasets are different in one other way: the likelihood of a coordinate being non-zero. We plot
this distribution in Figure 6(b) (in log-log scale, with smoothing to reduce noise in our visualization).
We notice that the distributions for (Efficient) SPLADE have a heavier tail than the shape of BM25
and uniCOIL distributions.
6.1.2 Evaluation Metrics. There are four metrics that will be our focus in the remainder of this
work. First is the index size measured in GB. We rely on the programming language—which, in
this work, is Rust8 —to calculate the amount of space held by a data structure and estimate the
memory footprint of the overall index. In Sinnamon, this measurement includes the size of the
inverted index as well as the sketch matrix.
We also report latency in milliseconds. When reporting the latency of retrieval, this figure
includes the time elapsed from the moment a query vector is presented to the algorithm to the
moment the algorithm returns the requested top 𝑘 document vectors. In Sinnamon, for example,
this includes the scoring time of Algorithm 6 as well as the ranking time of Algorithm 7. We note
that, because this work is a study of retrieval of generic vectors, we do not report the latency
incurred to vectorize a given piece of text.
As another metric of interest, we report the accuracy of approximate algorithms in terms of
their recall with respect to exact retrieval. By measuring the recall of an approximate set with
respect to an exact top-𝑘 set, we can study the impact of the different levers in the algorithm on its
overall accuracy as a retrieval engine.
Finally, we also evaluate the algorithms according to task-specific metrics. Because the task in
MS MARCO is to rank passages according to a given query, we measure Normalized Discounted
Cumulative Gain (NDCG) [28] at a deep rank cutoff (1000) and Mean Reciprocal Rank (MRR)
at rank cutoff 10. In this way, we examine the impact of Sinnamon’s levers on the quality of its
solution from the perspective of the end task.
6.1.3 Hardware. We conduct experiments on two different commercially available platforms. One
is an Intel Xeon (Ice Lake) processor with a clock rate of 2.60GHz with 8 CPU cores and 64GB
of main memory. Another is an Apple M1 Pro processor with the same core count (8) and main
memory capacity (64GB). Not only do these processors have different characteristics and, as such,
shed light on Sinnamon’s behavior in the context of different architectures, but they also represent
two different use-cases. The first of these platforms represents a typical server in a production
environment—in fact, we rented this machine from the Google Cloud Platform—while the second
represents a vast number of end-user devices such as laptops, tablets, and phones. We believe given
that Sinnamon can be tailored to different memory and latency configurations, it is important to
understand its performance both in a production setting as well as on edge devices.
6.1.4 Algorithms. We begin, of course, with LinScan and Sinnamon. We implement the two
algorithms and their variants in Rust. This implementation includes support for Sinnamon+ ,
Sinnamon ∥ , LinScan-Roaring, LinScan ∥ , and their anytime variants. We note that, because the
vectors produced by existing models are all non-negative, we only experiment with Sinnamon+
and its parameter 𝑚 (sketch size) on the MS MARCO vector datasets.
To facilitate a more clear discussion of Sinnamon and manage the size of the experimental
configurations, we remove one variable from the equation in a subset of our experiments. In
particular, as we note later, we fix Sinnamon’s parameter ℎ to 1 and study ℎ > 1 only in a limited
number of experiments. We believe, however, that in totality, our experiments along with theoretical
results sufficiently explain the role that ℎ plays in the algorithm. Despite that, we do encourage
1.0 1.0
m = 30
m = 60
0.8 0.8
m = 90
Theoretical
Probability
Probability
0.6 0.6 Empirical
0.4 m = 30 0.4
m = 60
m = 90
0.2 0.2
Theoretical
Empirical
0.0 0.0
0 100 200 300 0 100 200 300
Z Z
0.3
0.3
Density
Density
0.2 0.2
0.1 0.1
0.0 0.0
−4 −2 0 2 4 −4 −2 0 2 4
Zq Zq
(b) Translated and scaled inner product error for ℎ = 1 (left) and ℎ = 2 (right)
Fig. 7. Sketching and inner product error distributions for the SPLADE dataset predicted by theory and inferred
empirically. In (a), the theoretical and empirical CDF of the sketching error are almost indistinguishable—
theoretical and empirical curves for 𝑚 = 30 overlap, so do curves for other values of 𝑚. In (b), the solid dark
curve plots a standard normal distribution for reference.
practitioners to explore the trade-offs Sinnamon offers for their own use case and application,
including tuning ℎ to reach a desired balance between latency, accuracy, and space.
As we noted earlier, to the best of our knowledge, Sinnamon is the first approximate algorithm
for MIPS over general sparse vectors. In the absence of other general purpose systems designed for
sparse vectors, we resort to modifying and implementing in Rust existing algorithms from the top-𝑘
retrieval literature so they may operate on real-valued vectors. In particular, we take the popular
Wand [11] algorithm as a representative example. In our examination of Wand, we emphasize
the logic of the algorithm itself—its document-at-a-time process with a pruning mechanism that
skips over the less-promising documents. To that end, our implementation of Wand makes use
of an uncompressed inverted index (where a posting is a pair consisting of a document identifier
and a 32-bit floating point value). In this way, we isolate the latency of the algorithm itself without
factoring in costs incurred by decompression or other operations unrelated to the retrieval logic
itself. As such, we advise against a comparison of the index size in Wand with that in Sinnamon.
An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors 25
6.2 Analysis
In this section, we examine the theoretical results of Section 5 on the vector datasets generated
from MS MARCO. In particular, we infer from the data the empirical probability that coordinate 𝑖
is non-zero (𝑝𝑖 ) and the empirical distribution of non-zero values (i.e., the random variables 𝑋𝑖 ).
We then use these statistics to theoretically predict the distribution of sketching error using
Theorem 5.4 and the inner product error from Theorem 5.8. We refer to statistics predicted by the
theorems as “theoretical” errors in the remainder of this section and in accompanying figures.
In addition to theoretical predictions, we compute the distribution of sketching and inner product
errors directly from the data, which we refer to as “empirical” errors. To infer the sketching error, for
example, we first sketch a document vector, and then decode the value of its non-zero coordinates
from the sketch. We then compute the error between the actual value and the decoded value. By
repeating this process for every document in the collection, we can empirically construct the CDF
of sketching error.
We find the distribution of the random variable 𝑍 from Theorem 5.8 by taking a query-document
pair, computing their inner product using an exact algorithm as well as Sinnamon, and recording the
difference between the two values. We then use the query 𝑞, the non-zero probability of document
coordinates 𝑝𝑖 , and statistics from 𝑋𝑖 to compute and plot 𝑍 .
Now that we have the theoretical and empirical predictions of the distribution of error, we
compare the two to verify that the theory holds in practice for arbitrary data distributions. We
show this comparison in Figure 7 for the SPLADE dataset. We included a similar comparison for
the remaining datasets in Appendix C.
As a general observation, the predictions from theory are accurate. We observe, for example, that
𝑍 takes on a Gaussian shape. We further observe that the predicted CDF of sketching error reflects
the empirical error. It is also worth noting that, increasing the number of random mappings from 1
to 2 results in an increase in the probability of sketching error but a decrease in the expected value
of error—the sketching error concentrates closer to 0. For example, when 𝑚 = 90, the probability of
sketching error changes from 0.45 to 0.48, but the expected value of error improves from 28 to 21.
6.3 Retrieval
We begin with an examination of index size, accuracy, and latency as a function of the sketch
size 𝑚 and time budget 𝑇 in Sinnamon on the different vector collections described previously. In
conjunction with the mono-CPU results, we present the parallel version of the algorithm denoted
by Sinnamon ∥ run on 8 threads. In all the experiments in this section, we set 𝑘 (in top 𝑘 retrieval)
to 1, 000.
6.3.1 Latency, Memory, and Retrieval Accuracy. Our objective is to illustrate the Pareto frontier that
Sinnamon explores. However, because there are three objectives involved, instead of rendering a
three-dimensional image, we show in one plot a pictorial presentation of the trade-off between
latency and index size, and couple it with another plot that depicts the interplay between latency
and accuracy (i.e., recall with respect to exact retrieval). In each figure, we distinguish between
different configurations of 𝑚 using shapes and colors, and between the different time budgets 𝑇 for
a given 𝑚 by labeling the points in the figure. We also show results from baseline algorithms for
reference, including LinScan and its compressed, parallel, and anytime flavors.
Let us now turn to Figure 8 where we plot the trade-offs between latency, memory, and accuracy
on the Intel platform. As we observe similar trends on the M1 platform, we do not include those
figures here and refer the reader to Appendix D for results; we note briefly, however, that all
algorithms run substantially faster on M1 due to architectural differences—as memory is mounted
on the processing chip in M1, we observe a higher memory throughput, leading to a significant
26 Sebastian Bruch, Franco Maria Nardini, Amir Ingber, and Edo Liberty
speed-up. Finally, we note that to generate these figures we run Sinnamon with 𝑘 ′ = 5, 000 and
study the impact of 𝑘 ′ on latency and accuracy later in this section. All runs of the anytime version
of LinScan too use 𝑘 ′ = 5, 000.
Turning our attention to Wand first—and ignoring its memory footprint as discussed previously—
we note its excellent performance on the collection of BM25 vectors. This is, after all, as expected.
Because term frequencies—the main ingredient of BM25 encoding—follows a Zipfian distribution;
because queries are short; and because the importance of query terms is non-uniform, Wand’s
pruning mechanism is able to narrow the search space substantially, leading to a very low latency.
As we move to other collections, however, we lose some of the properties that make Wand fast.
For example, on the Efficient SPLADE collection, queries have few non-zero terms and we observe
that Wand demonstrates a reasonable latency. But queries are an order of magnitude longer in the
SPLADE collection, leading to a dramatic increase in the latency of the algorithm.
Besides the evidence above, we also believe Wand is not suitable for the general setting because
its core pruning idea is designed specifically for Zipfian-distributed values. If each coordinate
instead has more or less the same likelihood of being non-zero in any given vector or when non-
zero entries have a Gaussian distribution, then Wand fails to prune documents effectively and its
logic of finding a pivot suddenly becomes the dominant term in its computational complexity.
Interestingly, LinScan, which traverses the postings list exhaustively but in a coordinate-at-a-
time manner, is often much faster than Wand. As we intuited before, this is due to better cache
utilization and instruction-level parallelism that takes advantage of wide CPU registers, both made
possible by the algorithm’s predictable, sequential scan of arrays in memory. In effect, LinScan
represents a lower-bound of sorts on Sinnamon’s mono-CPU performance, because they share the
same index traversal logic but where Sinnamon’s retrieval logic involves heavier computation.
Things change when the inverted index in LinScan is compressed, turning the algorithm to
LinScan-Roaring. As a general observation, latency tends to increase substantially. This rise can
be attributed to the cost of decompression, which inevitably makes the data structure and index
traversal logic less friendly to the CPU’s caches and registers. Nonetheless, assuming vector values
cannot be quantized and must at least occupy 16 bits, the index size in LinScan-Roaring represents
a ceiling of sorts for Sinnamon; that is, if a configuration of Sinnamon leads to a higher memory
usage than LinScan-Roaring, then there is no real advantage to utilizing Sinnamon in that setup.
Now consider the curves for Sinnamon. Naturally, by reducing the scoring time budget 𝑇 , we
observe a decrease in overall latency—which includes ranking with unlimited time budget. We also
observe, as one would expect, a decrease in retrieval accuracy. This effect is milder in the parallel
version of Sinnamon. We observe the same trend as we tighten the scoring time budget in the
anytime version of LinScan-Roaring.
In our experiments, we set 𝑚 to be roughly 25%, 50%, and 75% of the average 𝜓𝑑 of the document
vector collection (see Table 3); this, for example, translates to 10, 20, and 30 for the BM25-encoded
dataset. Again, as anticipated, by reducing the sketch size 𝑚, Sinnamon allocates less and less
memory to store document sketches. The gap between the different configurations of 𝑚 and
LinScan-Roaring is not so large where document vectors have few non-zero entries (e.g., BM25)
but it widens on collections with a larger 𝜓𝑑 .
An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors 27
Sinnamon+
m=.25ψd Sinnamon+
m=.5ψd Sinnamon+
m=.75ψd Sinnamon+
m=.25ψd Sinnamon+
m=.5ψd Sinnamon+
m=.75ψd
+k +k +k +k +k +k
Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd
LinScan-Roaring LinScan-Roaringk LinScan-Roaring LinScan-Roaringk
∞ ∞
60 20 60 20
Latency (ms)
Latency (ms)
40 10 40 10 Wand
LinScan
30 30
5 5
20 20
∞ ∞
5 5
101 101
1.0 1.2 1.4 1.6 0.86 0.88 0.90 0.92 0.94 0.96 0.98 1.00
Index Size (GB) Recall w.r.t. +Exact
Sinnamon+
m=.25ψd Sinnamon+ m=.5ψd Sinnamon+
m=.75ψd Sinnamon+
m=.25ψd Sinnamonm=.5ψd Sinnamon+
m=.75ψd
+k +k +k
(a) BM25 Sinnamon+k +k +k
Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd m=.25ψ d
Sinnamonm=.5ψd Sinnamonm=.75ψd
LinScan-Roaring LinScan-Roaringk LinScan-Roaring LinScan-Roaringk
∞ 103 Wand
2
10
Latency (ms)
Latency (ms)
∞ 20 2 LinScan
10
10
∞ 20
5 10
5
5 5
101 101
2.75 3.00 3.25 3.50 3.75 4.00 4.25 0.5 0.6 0.7 0.8 0.9 1.0
Index Size (GB) Recall w.r.t. +Exact
Sinnamon+
m=.25ψd Sinnamon+ m=.5ψd Sinnamon+
m=.75ψd Sinnamon+
m=.25ψd Sinnamonm=.5ψd Sinnamon+
m=.75ψd
+k +k +k SPLADE
(b) +k +k +k
Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd Sinnamonm=.25ψ d
Sinnamonm=.5ψd Sinnamonm=.75ψd
LinScan-Roaring LinScan-Roaringk LinScan-Roaring LinScan-Roaringk
102
∞ Wand
60 ∞
20
10 20
Latency (ms)
Latency (ms)
40
5
30 10
5
LinScan
20
∞ 5 ∞
5
101 101
4.0 4.5 5.0 5.5 6.0 0.5 0.6 0.7 0.8 0.9 1.0
Index Size (GB) Recall w.r.t. +Exact
Sinnamon+
m=.25ψd Sinnamon+ m=.5ψd Sinnamon+
m=.75ψd Sinnamon+
m=.25ψd Sinnamonm=.5ψd Sinnamon+
m=.75ψd
+k +k +k
(c) Efficient SPLADE +k +k +k
Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd Sinnamonm=.25ψ d
Sinnamonm=.5ψd Sinnamonm=.75ψd
LinScan-Roaring LinScan-Roaringk LinScan-Roaring LinScan-Roaringk
20 20
10 10
Latency (ms)
Latency (ms)
5 5
LinScan
∞ ∞
5
5
101 101
1.2 1.4 1.6 1.8 2.0 0.80 0.85 0.90 0.95 1.00
Index Size (GB) Recall w.r.t. Exact
(d) uniCOIL
Fig. 8. Trade-offs on the Intel processor between latency and memory (left column), and latency and accu-
racy (right column) for various vector collections (rows). Shapes (and colors) distinguish between different
configurations of Sinnamon, and points on a line represent different time budgets 𝑇 (in milliseconds).
28 Sebastian Bruch, Franco Maria Nardini, Amir Ingber, and Edo Liberty
Sinnamon+
m=.25ψd Sinnamon+
m=.5ψd Sinnamon+
m=.75ψd Sinnamon+
m=.25ψd Sinnamon+
m=.5ψd Sinnamon+
m=.75ψd
+k +k +k +k +k +k
Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd
LinScan-Roaring LinScan-Roaringk LinScan-Roaring LinScan-Roaringk
∞ ∞
60 20 60 20
Latency (ms)
Latency (ms)
40 10 40 10 Wand
Wand
30 LinScan 30 LinScan
5 5
20 20
∞
5 5∞
101 101
0.30 0.35 0.40 0.45 0.50 0.20 0.25 0.30 0.35 0.40
NDCG@1000 MRR@10
Sinnamon+
m=.25ψd Sinnamon+m=.5ψd Sinnamon+
m=.75ψd Sinnamon+
m=.25ψd Sinnamon+ m=.5ψd Sinnamon+
m=.75ψd
+k +k +k
(a) BM25 Sinnamon+k +k +k
Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd m=.25ψ d
Sinnamonm=.5ψd Sinnamonm=.75ψd
LinScan-Roaring LinScan-Roaringk LinScan-Roaring LinScan-Roaringk
Latency (ms)
∞ ∞
20∞ ∞
10 10
20
5 5
5 5
101 101
0.30 0.35 0.40 0.45 0.50 0.20 0.25 0.30 0.35 0.40
NDCG@1000 MRR@10
(b) SPLADE
An Approximate Algorithm
+
for Maximum
+
Inner Product+ Search over Streaming
+
Sparse Vectors
+
29
Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamon+
m=.75ψd
+k +k +k +k +k +k
Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd
LinScan-Roaring LinScan-Roaringk LinScan-Roaring LinScan-Roaringk
102 102
Wand Wand
∞ ∞
20 20
Latency (ms)
Latency (ms)
10 10
5 5
LinScan LinScan
∞ 5 ∞
5
101 101
0.30 0.35 0.40 0.45 0.50 0.20 0.25 0.30 0.35 0.40
NDCG@1000 MRR@10
Sinnamon+
m=.25ψd Sinnamon+m=.5ψd Sinnamon+
m=.75ψd Sinnamon+
m=.25ψd Sinnamon+ m=.5ψd Sinnamon+
m=.75ψd
+k +k +k
(c) Efficient SPLADE +k +k +k
Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd Sinnamonm=.25ψ d
Sinnamonm=.5ψd Sinnamonm=.75ψd
LinScan-Roaring LinScan-Roaringk LinScan-Roaring LinScan-Roaringk
∞ ∞
102 Wand 102 Wand
20 20
Latency (ms)
Latency (ms)
10 10
5 5
LinScan LinScan
∞ ∞
5 5
101 101
0.30 0.35 0.40 0.45 0.50 0.20 0.25 0.30 0.35 0.40
NDCG@1000 MRR@10
(d) uniCOIL
Fig. 9. Trade-offs on the Intel processor between latency and NDCG@1000 (left column), and latency and
MRR@10 (right column) for various vector collections (rows). As before, shapes (and colors) distinguish
between different configurations of Sinnamon, and points on a line represent different time budgets 𝑇 (in
milliseconds).
6.3.3 Effect of 𝑘 ′. The experiments so far used a fixed value for the intermediate number of
candidates 𝑘 ′. We ask now if and to what degree changing this hyperparameter affects the trade-
offs and shapes the interplay between the three factors. To study this effect, we choose a single
value of 𝑚 for each dataset and measure retrieval accuracy as 𝑘 ′ grows from 1, 000 to 20, 000. We
expect retrieval accuracy to increase as more candidates are re-ranked by Algorithm 7.
As we show in Figure 10, this is indeed the effect we observe on our vector collections. In this
figure, we retrieve 𝑘 ′ vectors for two values of 𝑚 that are 25% and 75% of the 𝜓𝑑 of a dataset. As
𝑘 ′ increases, so does retrieval accuracy. We must note, however, that a larger 𝑘 ′ adversely affects
overall latency as more documents must be re-ranked using exact scores, which results in a larger
number of fetches from the raw vector storage. However, this effect can be amortized over multiple
processors in the Sinnamon ∥ variant.
Having made the observations above, we repeat the experiments presented earlier in this section
for 𝑘 ′ = 20, 000 and visualize the trade-offs between latency and retrieval accuracy once more—
index size remains the same as in Figure 8, so we leave it out. This time, we limit the experiments
to Sinnamon ∥ as it is a more practical choice than a mono-CPU variant, following the note above.
These results are presented in Figure 11 for the Intel processor with M1 results presented in
Appendix F.
While retrieval accuracy appears to improve across the board, including the anytime variants,
with an increase in 𝑘 ′, Sinnamon’s latency remains the same or degrades comparative to the
𝑘 ′ = 5, 000 setup before. This increase in cost is primarily driven by the exact score computation as
expected. We observe that Sinnamon’s anytime variants perform faster than LinScan-Roaring ∥
30 Sebastian Bruch, Franco Maria Nardini, Amir Ingber, and Edo Liberty
1.0 1.0
0.9
Recall w.r.t. Exact
0.6 0.7
BM25 BM25
0.5 0.6
SPLADE SPLADE
0.4 Efficient SPLADE Efficient SPLADE
uniCOIL 0.5 uniCOIL
Fig. 10. The effect of retrieving 𝑘 ′ vectors in Algorithm 7 on retrieval accuracy in terms of recall with respect
to exact retrieval. In the experiments leading to the above figure, we use a value of 𝑚 that is roughly 25% and
75% of the average 𝜓𝑑 in the datasets (10/30 for BM25, 30/90 for SPLADE, 45/135 for Efficient SPLADE, and
15/45 for uniCOIL).
and with a high accuracy on the SPLADE collection. This is an interesting result if one notes that
the differentiating factor between SPLADE and the other collections is SPLADE’s relatively larger
𝜓𝑞 , hinting that Sinnamon may indeed be a competitive algorithm when query vectors have a
large number of non-zero entries. Finally, we note that the increase in quality as a result of using a
stricter time budget in the BM25 plot is statistically insignificant and can be attributed to noise.
+k +k +k +k +k +k
Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd
∞
∞ 10
22
20 5 40
LinScan-Roa
ring k
Latency (ms)
Latency (ms)
18
30 10
16
5
14 20
LinScan-Roa
ring k
12
101 101
0.988 0.990 0.992 0.994 0.996 0.998 1.000 0.94 0.95 0.96 0.97 0.98 0.99 1.00
Recall w.r.t. Exact Recall w.r.t. Exact
+k +k +k +k +k +k
Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd
10 ∞ ∞
24 24 10
22 5 22
5
20 20
Latency (ms)
Latency (ms)
18 18
16 16 LinScan-Roa
ring k
14 14
LinScan-Roa
12 ring k
12
1
10 101
0.90 0.92 0.94 0.96 0.98 1.00 0.96 0.97 0.98 0.99 1.00
Recall w.r.t. Exact Recall w.r.t. Exact
Fig. 11. Trade-offs on the Intel processor between latency and retrieval accuracy for various vector collections
by retrieving 𝑘 ′ = 20, 000. Shapes (and colors) distinguish between different configurations of Sinnamon, and
points on a line represent different time budgets 𝑇 (in milliseconds).
collection and subsequently delete 1 million randomly-selected vectors from the index. We measure
the latency of each deletion in milliseconds and report the means of 10 trials in this figure. Both
algorithms show a stable latency across datasets, with a negligible speed-up as more vectors are
deleted. But we observe that Sinnamon deletes vectors with a response time that is about an order
of magnitude faster than that in LinScan-Roaring.
SPLADE Efficient SPLADE uniCOIL BM25 SPLADE Efficient SPLADE uniCOIL BM25
105
Latency (ms/vector)
101
100
104 10−1
Fig. 12. Indexing throughput (vectors per second) and deletion latency (milliseconds per vector) as a function
of the size of the index on the Intel processor.
Table 4. Comparison of different retrieval algorithms on a dataset of 5 million vectors with 𝜓 non-zero
coordinates, drawn from the standard normal distribution. In 𝐺 100 , 𝜓𝑞 = 𝜓𝑑 = 𝜓 = 100 and dimensionality
𝑛 = 10, 000. In 𝐺 200 , 𝜓 = 200 and dimensionality 𝑛 = 32, 000.
Algorithm Index Size (GB) Query Lateny (ms) Recall w.r.t Exact
WAND 5.2 2,437 1
LinScan-Roaring 2.0 151 1
𝐺 100
1, 000 vectors to form the query set. We then run a limited set of experiments with a modified
variant of WAND that can handle real values, LinScan-Roaring, and Sinnamon to retrieve the top
𝑘 = 1, 000 documents, where 𝑘 ′ = 20, 000 in Sinnamon. We experiment with two configurations: a)
𝜓 = 100 and 𝑛 = 10, 000, a dataset we call 𝐺 100 ; and b) 𝜓 = 200 and 𝑛 = 32, 000, a dataset we call
𝐺 200 .
Table 4 summarizes the results of our experiments. We observe trends that are consistent with
the findings in the preceding sections. Notably, WAND scales poorly when 𝜓𝑞 is large. LinScan-
Roaring achieves much better latency with a reasonable memory footprint. Finally, Sinnamon
with 2𝑚 = 75%𝜓𝑑 reduces memory usage, with anytime variants achieving better latency at the
cost of accuracy. Overall, these results confirm that Sinnamon on real-valued vector datasets offers
similar trade-offs as Sinnamon+ does on non-negative vectors.
An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors 33
Table 5. Query latency (in milliseconds) with speed-up in parentheses (i.e., ratio between single-threaded
∥
latency Sinnamon𝑇 =∞ with Sinnamon𝑇 =∞ ) on the Intel platform with the indicated number of cores. Experi-
ments are conducted on the synthetic datasets 𝐺 100 and 𝐺 200 of Table 4.
Dataset 1 2 4 8
𝐺 100 174 95 (1.83×) 59 (2.95×) 40 (4.35×)
𝐺 200 253 139 (1.82×) 86 (2.94×) 54 (4.68×)
study if noisy inverted lists can reduce the overall memory footprint possibly at the expense of
accuracy.
Staying with memory usage, we also deem it worthy of a thorough empirical investigation to
understand the effect of other forms of compression on the inverted lists as well as the sketch
matrix, especially in the case of an offline algorithm where it is safe to assume that the vector
dataset remains almost stationary. Does knowing the data distribution a priori help us quantize
vectors or document sketches to make the overall structure more compact? Can we design a sketch
with a smaller error probability if we relaxed the streaming requirement?
As an example of what is possible, we take the synthetic datasets of Section 6.5 and assume that
the dataset is stationary. We then use a different compression scheme to encode inverted lists in an
offline manner. On the 𝐺 100 dataset, switching from Roaring to PForDelta [67] reduces the overall
size of the index from 1.7 GB to 1.3 GB—a reduction of approximately 23%. On the 𝐺 200 dataset, we
observe a similar reduction in index size.
We believe another effort that may lead to a lower memory consumption is by taking advantage of
ℎ, the number of random mappings in Sinnamon—a hyperparameter that we left largely unexplored
in our empirical evaluation of the algorithm. In particular, we showed that by using more random
mappings, we may see an increase in the probability of error, but where the probability mass shifts
closer to 0. This suggests that, in applications or environments where consuming less memory is
preferred over achieving the lowest latency, it makes more sense to use 2 or more random mappings.
What number of random mappings is appropriate can be determined by the theoretical results
presented in this work for a particular data distribution.
Memory and latency aside, we believe our theoretical analysis of the sketch and the retrieval
algorithm raise a few interesting questions that we wish to investigate. For example, we developed
a good understanding of the magnitude of error and its impact on the final retrieval quality. Can we
now use this knowledge to decide what value of 𝑘 ′ can give us a particular retrieval accuracy? Can
that prediction be done on a per-query basis? Our initial thoughts are that that is indeed possible,
but materializing it and quantifying its impact on latency needs further experiments.
We can even take a step beyond sparse vectors and, in what may be unexpected, generalize the
top-𝑘 retrieval problem to, simply, vectors. More specifically, vectors that may have a “dense” part
and a “sparse” part. When is it sufficient, theoretically and practically, to break up such vectors into
distinct collections of their dense part and their sparse part, and solve the two problems separately
before re-ranking the top candidates from the two solution sets? When does it make sense to jointly,
with a single index, solve the top-𝑘 retrieval problem over such hybrid vectors? And how? Can
these results over hybrid vectors help us design better matrix multiplication algorithms for sparse,
or dense-sparse matrices? These and similar questions will be the focus of our future research.
REFERENCES
[1] Nir Ailon and Bernard Chazelle. 2009. The Fast Johnson–Lindenstrauss Transform and Approximate Nearest Neighbors.
SIAM J. Comput. 39, 1 (2009), 302–322.
[2] Nir Ailon and Edo Liberty. 2011. An Almost Optimal Unrestricted Fast Johnson-Lindenstrauss Transform. In Proceedings
of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, California). 185–191.
[3] Nir Ailon and Edo Liberty. 2013. An Almost Optimal Unrestricted Fast Johnson-Lindenstrauss Transform. ACM Trans.
Algorithms 9, 3, Article 21 (jun 2013), 12 pages.
[4] Nima Asadi. 2013. Multi-Stage Search Architectures for Streaming Documents. University of Maryland.
[5] Nima Asadi and Jimmy Lin. 2012. Fast Candidate Generation for Two-Phase Document Ranking: Postings List
Intersection with Bloom Filters. In Proceedings of the 21st ACM International Conference on Information and Knowledge
Management (Maui, Hawaii, USA). 2419–2422.
[6] Nima Asadi and Jimmy Lin. 2013. Effectiveness/Efficiency Tradeoffs for Candidate Generation in Multi-Stage Retrieval
Architectures. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information
Retrieval (Dublin, Ireland). 997–1000.
An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors 35
[7] Nima Asadi and Jimmy Lin. 2013. Fast Candidate Generation for Real-Time Tweet Search with Bloom Filter Chains.
ACM Trans. Inf. Syst. 31, 3, Article 13 (aug 2013), 36 pages.
[8] Ariful Azad and Aydın Buluç. 2017. A Work-Efficient Parallel Sparse Matrix-Sparse Vector Multiplication Algorithm.
2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2017), 688–697.
[9] Yang Bai, Xiaoguang Li, Gang Wang, Chaoliang Zhang, Lifeng Shang, Jun Xu, Zhaowei Wang, Fangshan Wang, and
Qun Liu. 2020. SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval.
[10] Burton H. Bloom. 1970. Space/Time Trade-Offs in Hash Coding with Allowable Errors. Commun. ACM 13, 7 (jul 1970),
422–426.
[11] Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. 2003. Efficient Query Evaluation Using
a Two-Level Retrieval Process. In Proceedings of the Twelfth International Conference on Information and Knowledge
Management (New Orleans, LA, USA). 426–434.
[12] Sebastian Bruch, Siyu Gai, and Amir Ingber. 2022. An Analysis of Fusion Functions for Hybrid Retrieval.
arXiv:2210.11934 [cs.IR]
[13] Samy Chambi, Daniel Lemire, Owen Kaser, and Robert Godin. 2016. Better Bitmap Performance with Roaring Bitmaps.
Softw. Pract. Exper. 46, 5 (may 2016), 709–719.
[14] Tao Chen, Mingyang Zhang, Jing Lu, Michael Bendersky, and Marc Najork. 2022. Out-of-Domain Semantics to the
Rescue! Zero-Shot Hybrid Retrieval Models. In Advances in Information Retrieval: 44th European Conference on IR
Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part I (Stavanger, Norway). 95–110.
[15] Graham Cormode and S. Muthukrishnan. 2005. An Improved Data Stream Summary: The Count-Min Sketch and Its
Applications. J. Algorithms 55, 1 (apr 2005), 58–75.
[16] Matt Crane, J. Shane Culpepper, Jimmy Lin, Joel Mackenzie, and Andrew Trotman. 2017. A Comparison of Document-
at-a-Time and Score-at-a-Time Query Evaluation. In Proceedings of the Tenth ACM International Conference on Web
Search and Data Mining (Cambridge, United Kingdom). 201–210.
[17] Zhuyun Dai and Jamie Callan. 2020. Context-Aware Term Weighting For First Stage Passage Retrieval. In Proceedings
of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event,
China). 1533–1536.
[18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association
for Computational Linguistics, Minneapolis, Minnesota, 4171–4186.
[19] Constantinos Dimopoulos, Sergey Nepomnyachiy, and Torsten Suel. 2013. Optimizing Top-k Document Retrieval
Strategies for Block-Max Indexes. In Proceedings of the Sixth ACM International Conference on Web Search and Data
Mining (Rome, Italy). 113–122.
[20] Shuai Ding and Torsten Suel. 2011. Faster Top-k Document Retrieval Using Block-Max Indexes. In Proceedings of
the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (Beijing, China).
993–1002.
[21] Li Fan, Pei Cao, Jussara Almeida, and Andrei Z. Broder. 2000. Summary Cache: A Scalable Wide-Area Web Cache
Sharing Protocol. IEEE/ACM Trans. Netw. 8, 3 (jun 2000), 281–293.
[22] Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. 2022. From Distillation to Hard
Negative Sampling: Making Sparse Neural IR Models More Effective. In Proceedings of the 45th International ACM
SIGIR Conference on Research and Development in Information Retrieval (Madrid, Spain). 2353–2359.
[23] Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE: Sparse Lexical and Expansion Model
for First Stage Ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in
Information Retrieval (Virtual Event, Canada). 2288–2292.
[24] Jeremy Fowers, Kalin Ovtcharov, Karin Strauss, Eric S Chung, and Greg Stitt. 2014. A high memory bandwidth
fpga accelerator for sparse matrix-vector multiplication. In 2014 IEEE 22nd Annual International Symposium on Field-
Programmable Custom Computing Machines. IEEE, 36–43.
[25] Luyu Gao, Zhuyun Dai, and Jamie Callan. 2021. COIL: Revisit Exact Lexical Match in Information Retrieval with
Contextualized Inverted List. In Proceedings of the 2021 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021. 3030–3042.
[26] Bob Goodwin, Michael Hopcroft, Dan Luu, Alex Clemmer, Mihaela Curmei, Sameh Elnikety, and Yuxiong He. 2017.
BitFunnel: Revisiting Signatures for Search. In Proceedings of the 40th International ACM SIGIR Conference on Research
and Development in Information Retrieval (Shinjuku, Tokyo, Japan). 605–614.
[27] Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. 2020. Accelerating
Large-Scale Inference with Anisotropic Vector Quantization. In Proceedings of the 37th International Conference on
Machine Learning (Proceedings of Machine Learning Research). 3887–3896.
36 Sebastian Bruch, Franco Maria Nardini, Amir Ingber, and Edo Liberty
[28] Kalervo Järvelin and Jaana Kekäläinen. 2000. IR evaluation methods for retrieving highly relevant documents. In
Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval.
ACM, 41–48.
[29] Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantization for Nearest Neighbor Search. IEEE
Trans. Pattern Anal. Mach. Intell. 33, 1 (2011), 117–128.
[30] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2021. Billion-Scale Similarity Search with GPUs. IEEE Transactions on
Big Data 7 (2021), 535–547.
[31] Aditya Krishnan and Edo Liberty. 2021. Projective Clustering Product Quantization. arXiv:2112.02179 [cs.DS]
[32] Saar Kuzi, Mingyang Zhang, Cheng Li, Michael Bendersky, and Marc Najork. 2020. Leveraging Semantic and Lexical
Matching to Improve the Recall of Document Retrieval Systems: A Hybrid Approach. (2020). arXiv:2010.01195 [cs.IR]
[33] Carlos Lassance and Stéphane Clinchant. 2022. An Efficiency Study for SPLADE Models. In Proceedings of the 45th
International ACM SIGIR Conference on Research and Development in Information Retrieval (Madrid, Spain). 2220–2226.
[34] Jiajia Li, Jimeng Sun, and Richard Vuduc. 2018. HiCOO: Hierarchical storage of sparse tensors. In SC18: International
Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 238–252.
[35] Jimmy Lin and Xueguang Ma. 2021. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for
Information Retrieval Techniques. arXiv:2106.14807 [cs.IR]
[36] Jimmy Lin and Andrew Trotman. 2015. Anytime Ranking for Impact-Ordered Indexes. In Proceedings of the 2015
International Conference on The Theory of Information Retrieval (Northampton, Massachusetts, USA). 301–304.
[37] Yu. A. Malkov and D. A. Yashunin. 2016. Efficient and robust approximate nearest neighbor search using Hierarchical
Navigable Small World graphs. arXiv:1603.09320 [cs.DS]
[38] Antonio Mallia, Omar Khattab, Torsten Suel, and Nicola Tonellotto. 2021. Learning Passage Impacts for Inverted
Indexes. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information
Retrieval (Virtual Event, Canada). 1723–1727.
[39] Antonio Mallia, Giuseppe Ottaviano, Elia Porciani, Nicola Tonellotto, and Rossano Venturini. 2017. Faster BlockMax
WAND with Variable-Sized Blocks. In Proceedings of the 40th International ACM SIGIR Conference on Research and
Development in Information Retrieval (Shinjuku, Tokyo, Japan). 625–634.
[40] Antonio Mallia and Elia Porciani. 2019. Faster BlockMax WAND with Longer Skipping. In Advances in Information
Retrieval. 771–778.
[41] Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS
MARCO: A Human Generated MAchine Reading COmprehension Dataset. (November 2016).
[42] Rodrigo Nogueira and Kyunghyun Cho. 2020. Passage Re-ranking with BERT. arXiv:1901.04085 [cs.IR]
[43] Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Document Ranking with a Pretrained
Sequence-to-Sequence Model. In Findings of the Association for Computational Linguistics: EMNLP 2020. 708–718.
[44] Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-
Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. 2018. OuterSPACE: An Outer Product Based Sparse
Matrix Multiplication Accelerator. In 2018 IEEE International Symposium on High Performance Computer Architecture
(HPCA). 724–736.
[45] Giulio Ermanno Pibiri and Rossano Venturini. 2020. Techniques for Inverted Index Compression. ACM Comput. Surv.
53, 6, Article 125 (dec 2020), 36 pages.
[46] Rameshwar Pratap, Debajyoti Bera, and Karthik Revanuru. 2019. Efficient Sketching Algorithm for Sparse Binary
Data. In 2019 IEEE International Conference on Data Mining (ICDM). 508–517.
[47] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational
Linguistics.
[48] Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations
and Trends in Information Retrieval 3, 4 (April 2009), 333–389.
[49] Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. 1994. Okapi at
TREC-3.. In TREC (NIST Special Publication, Vol. 500-225), Donna K. Harman (Ed.). National Institute of Standards and
Technology (NIST), 109–126.
[50] Aditi Singh, Suhas Jayaram Subramanya, Ravishankar Krishnaswamy, and Harsha Vardhan Simhadri. 2021.
FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search. arXiv:2105.09613 [cs.IR]
[51] Shaden Smith, Niranjay Ravindran, Nicholas D Sidiropoulos, and George Karypis. 2015. SPLATT: Efficient and parallel
sparse tensor-matrix multiplication. In 2015 IEEE International Parallel and Distributed Processing Symposium. IEEE,
61–70.
[52] Nitish Srivastava, Hanchen Jin, Jie Liu, David Albonesi, and Zhiru Zhang. 2020. MatRaptor: A Sparse-Sparse Matrix
Multiplication Accelerator Based on Row-Wise Product. In 2020 53rd Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO). 766–780.
An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors 37
[53] Nitish Srivastava, Hanchen Jin, Shaden Smith, Hongbo Rong, David Albonesi, and Zhiru Zhang. 2020. Tensaurus: A
Versatile Accelerator for Mixed Sparse-Dense Tensor Computations. In 2020 IEEE International Symposium on High
Performance Computer Architecture (HPCA). 689–702.
[54] Nicola Tonellotto, Craig Macdonald, and Iadh Ounis. 2018. Efficient Query Processing for Scalable Web Search. Found.
Trends Inf. Retr. 12, 4–5 (dec 2018), 319–500.
[55] Bhisham Dev Verma, Rameshwar Pratap, and Debajyoti Bera. 2022. Efficient Binary Embedding of Categorical Data
using BinSketch. Data Mining and Knowledge Discovery 36 (2022), 537–565.
[56] Jianguo Wang, Chunbin Lin, Yannis Papakonstantinou, and Steven Swanson. 2017. An Experimental Study of Bitmap
Compression vs. Inverted List Compression. In Proceedings of the 2017 ACM International Conference on Management of
Data (Chicago, Illinois, USA). 993–1008.
[57] Shuai Wang, Shengyao Zhuang, and Guido Zuccon. 2021. BERT-Based Dense Retrievers Require Interpolation with
BM25 for Effective Passage Retrieval. In Proceedings of the 2021 ACM SIGIR International Conference on Theory of
Information Retrieval (Virtual Event, Canada). 317–324.
[58] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun,
Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan
Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff
Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean.
2016. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.
arXiv:1609.08144 [cs.CL]
[59] Jheng-Hong Yang, Xueguang Ma, and Jimmy Lin. 2021. Sparsifying Sparse Representations for Passage Retrieval by
Top-𝑘 Masking. arXiv:2112.09628 [cs.IR]
[60] Dawei Yin, Yuening Hu, Jiliang Tang, Tim Daly, Mianwei Zhou, Hua Ouyang, Jianhui Chen, Changsung Kang, Hongbo
Deng, Chikashi Nobata, et al. 2016. Ranking relevance in yahoo search. In Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining. ACM, 323–332.
[61] Hamed Zamani, Mostafa Dehghani, W. Bruce Croft, Erik Learned-Miller, and Jaap Kamps. 2018. From Neural Re-
Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing. In Proceedings of the 27th ACM
International Conference on Information and Knowledge Management (Torino, Italy). 497–506.
[62] Hamed Zamani, Fernando Diaz, Mostafa Dehghani, Donald Metzler, and Michael Bendersky. 2022. Retrieval-Enhanced
Machine Learning. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in
Information Retrieval (Madrid, Spain). 2875–2886.
[63] Jiangong Zhang, Xiaohui Long, and Torsten Suel. 2008. Performance of Compressed Inverted List Caching in Search
Engines. In Proceedings of the 17th International Conference on World Wide Web (Beijing, China). 387–396.
[64] Zhixin Zhou, Shulong Tan, Zhaozhuo Xu, and Ping Li. 2019. Möbius Transformation for Fast Inner Product Search on
Graph. In Advances in Neural Information Processing Systems, Vol. 32.
[65] Shengyao Zhuang and Guido Zuccon. 2022. Fast Passage Re-ranking with Contextualized Exact Term Matching and
Efficient Passage Expansion. In Workshop on Reaching Efficiency in Neural Information Retrieval, the 45th International
ACM SIGIR Conference on Research and Development in Information Retrieval.
[66] Justin Zobel and Alistair Moffat. 2006. Inverted Files for Text Search Engines. ACM Comput. Surv. 38, 2 (jul 2006),
56 pages.
[67] Marcin Zukowski, Sandor Heman, Niels Nes, and Peter Boncz. 2006. Super-Scalar RAM-CPU Cache Compression. In
Proceedings of the 22nd International Conference on Data Engineering. USA, 59.
Lemma A.1. Suppose a sparse random vector 𝑋 ∈ R𝑛 is encoded into an 𝑚-dimensional sketch with
Sinnamon using Algorithm 5 with a single random mapping (ℎ = 1). Suppose the probability that
a coordinate is active, 𝑝𝑖 , is equal to 𝑝 for all coordinates. Suppose further that the value of active
coordinates 𝑋𝑖 s are drawn from Gaussian(𝜇 = 0, 𝜎 = 1). The probability that the upper-bound sketch
overestimates the value of 𝑋𝑖 upon decoding as 𝑋 𝑖 is the following quantity:
𝑚 (𝑛−1)𝑝
(1 − 𝑒 − 𝑚 ).
P 𝑋 𝑖 > 𝑋𝑖 ≈ 1 −
(𝑛 − 1)𝑝
38 Sebastian Bruch, Franco Maria Nardini, Amir Ingber, and Edo Liberty
Given that 𝑋𝑖 s are drawn from a Gaussian distribution, and using the approximation above, we
can rewrite the probability of error as:
∫ ∞
1 (𝑛−1)𝑝 𝛼2
1 − 𝑒 − 𝑚 (1−Φ(𝛼)) 𝑒 − 2 𝑑𝛼 .
P 𝑋 𝑖 > 𝑋𝑖 ≈ √
2𝜋 −∞
We now break up the right hand side into the following three sums, replacing (𝑛 − 1)𝑝/𝑚 with 𝛽
for brevity:
∫ ∞
1 𝛼2
√ 𝑒 − 2 𝑑𝛼
P 𝑋 𝑖 > 𝑋𝑖 ≈ (24)
−∞ 2𝜋
∫ 0
1 𝛼2
− √ 𝑒 −𝛽 (1−Φ(𝛼)) 𝑒 − 2 𝑑𝛼 (25)
−∞ 2𝜋
∫ ∞
1 𝛼2
− √ 𝑒 −𝛽 (1−Φ(𝛼)) 𝑒 − 2 𝑑𝛼 . (26)
0 2𝜋
The sum in (24) is equal to the quantity 1. Let us turn to (26) first. We have that:
∫ 𝛼
𝛼 >0 1 1 𝑡2
1 − Φ(𝛼) = − √ 𝑒 − 2 𝑑𝑡 . (27)
2 0 2𝜋
| {z }
𝜆 (𝛼)
we arrive at:
∫ 0
1 𝛼2 1 𝛽 𝛽
√ 𝑒 −𝛽 (1−Φ(𝛼)) 𝑒 − 2 𝑑𝛼 = 𝑒 − 2 1 − 𝑒 − 2 (30)
−∞ 2𝜋 𝛽
An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors 39
Plugging the results above into Equations (24), (25), and (26) results in:
1 𝛽 1 𝛽 𝛽
P 𝑋 𝑖 > 𝑋𝑖 ≈ 1 − 1 − 𝑒 − 2 − 𝑒 − 2 1 − 𝑒 − 2
𝛽 𝛽
1 𝛽 𝛽
= 1 − 1 − 𝑒− 2 1 + 𝑒− 2
𝛽
𝑚 (𝑛−1)𝑝
=1− 1 − 𝑒− 𝑚 ,
(𝑛 − 1)𝑝
which completes the proof. □
Given the result above, the solution for the general case of ℎ > 0 is straightforward to obtain.
We show this result in the lemma below.
Lemma A.2. Under similar conditions as Lemma A.1 but assuming ℎ ≥ 1, the probability that the
upper-bound sketch overestimates an active coordinate 𝑋𝑖 upon decoding it as 𝑋 𝑖 is the following
quantity:
ℎ
∑︁ ℎ 𝑚 𝑘ℎ (𝑛−1)𝑝
(1 − 𝑒 − 𝑚 ).
P 𝑋 𝑖 > 𝑋𝑖 ≈ 1 + (−1)𝑘
𝑘 𝑘ℎ(𝑛 − 1)𝑝
𝑘=1
When the active values of a vector are drawn from a Gaussian distribution, then the pairwise
difference between any two coordinates itself has a Gaussian distribution with standard deviation
40 Sebastian Bruch, Franco Maria Nardini, Amir Ingber, and Edo Liberty
√ √
𝜎 2 + 𝜎 2 = 𝜎 2. As such, we may estimate 1 − Φ(𝛼 + 𝛿) by considering the probability that a pair
of coordinates (one of which having value 𝛼) has a difference greater than 𝛿: P[𝑋𝑖 − 𝑋 𝑗 > 𝛿]. With
that idea, we may thus write:
1 − Φ(𝛼 + 𝛿) = 1 − Φ′ (𝛿),
√
where Φ′ (·) is the CDF of a zero-mean Gaussian distribution with standard deviation 𝜎 2.
Putting everything together, we can write:
∫
ℎ (𝑛−1)𝑝 ′
1 − 𝑒 − 𝑚 (1−Φ (𝛿)) 𝑑P(𝛼)
ℎ
P[𝑍 𝑖 ≤ 𝛿] ≈ 1 −
ℎ (𝑛−1)𝑝 ′
= 1 − 1 − 𝑒 − 𝑚 (1−Φ (𝛿)) ,
ℎ
1.0 1.0
m = 10 m = 10
m = 20 m = 20
0.8 0.8
m = 30 m = 30
Theoretical Theoretical
Probability
Probability
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8
Z Z
Theoretical Theoretical
2.5 2.0
Empirical Empirical
2.0 1.5
Density
Density
1.5
1.0
1.0
0.5
0.5
0.0 0.0
−4 −2 0 2 4 −4 −2 0 2 4
Zq Zq
Fig. 13. Sketching and inner product error distributions for the BM25 dataset.
An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors 41
1.0 1.0
m = 45
m = 90
0.8 0.8
m = 135
Theoretical
Probability
Probability
0.6 0.6 Empirical
0.4 m = 45 0.4
m = 90
m = 135
0.2 0.2
Theoretical
Empirical
0.0 0.0
0 100 200 300 0 100 200 300
Z Z
Theoretical Theoretical
3.0 2.5
Empirical Empirical
2.5 2.0
2.0
Density
Density
1.5
1.5
1.0
1.0
0.5
0.5
0.0 0.0
−4 −2 0 2 4 −4 −2 0 2 4
Zq Zq
Fig. 14. Sketching and inner product error distributions for the Efficient SPLADE dataset.
1.0 1.0
m = 15
m = 30
0.8 0.8
m = 45
Theoretical
Probability
Probability
0.4 m = 15 0.4
m = 30
m = 45
0.2 0.2
Theoretical
Empirical
0.0 0.0
0 50 100 150 200 250 0 50 100 150 200 250
Z Z
0.7
Theoretical 0.6 Theoretical
0.6 Empirical Empirical
0.5
0.5
0.4
Density
Density
0.4
0.3
0.3
0.2 0.2
0.1 0.1
0.0 0.0
−4 −2 0 2 4 −4 −2 0 2 4
Zq Zq
Fig. 15. Sketching and inner product error distributions for the uniCOIL dataset.
42 Sebastian Bruch, Franco Maria Nardini, Amir Ingber, and Edo Liberty
Sinnamon+
m=.25ψd Sinnamon+
m=.5ψd Sinnamon+
m=.75ψd Sinnamon+
m=.25ψd Sinnamon+
m=.5ψd Sinnamon+
m=.75ψd
+k +k +k +k +k +k
Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd
LinScan-Roaring LinScan-Roaringk LinScan-Roaring LinScan-Roaringk
∞ ∞
40 20 40 20
10 10
Latency (ms)
Latency (ms)
30 30 Wand
5 5 LinScan
20 20
∞ 5∞
101 5 101
1.0 1.2 1.4 1.6 0.93 0.94
0.95 0.96 0.97 0.98 0.99 1.00
Index Size (GB) Recall w.r.t. Exact
Sinnamon+
m=.25ψd Sinnamon+ m=.5ψd Sinnamon+
m=.75ψd Sinnamon+
m=.25ψd Sinnamon +
m=.5ψd Sinnamon+
m=.75ψd
+k
Sinnamonm=.25ψd
+k
Sinnamonm=.5ψd Sinnamonm=.75ψd (a) BM25Sinnamon
+k +k
m=.25ψd
+k
Sinnamonm=.5ψd Sinnamonm=.75ψd
+k
Latency (ms)
20
∞
∞ 102
10 LinScan
5 10 20 ∞
5
101 5 5
101
2.75 3.00 3.25 3.50 3.75 4.00 4.25 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00
Index Size (GB) Recall w.r.t. Exact
Sinnamon+
m=.25ψd Sinnamon+ m=.5ψd Sinnamon+
m=.75ψd Sinnamon+
m=.25ψd Sinnamon+ m=.5ψd Sinnamon+
m=.75ψd
+k
Sinnamonm=.25ψd
+k
Sinnamonm=.5ψd (b) SPLADE
Sinnamonm=.75ψd
+k
Sinnamon
+k
m=.25ψd
+k
Sinnamonm=.5ψd Sinnamonm=.75ψd
+k
Latency (ms)
5 40 20
30 10
20
5
20 LinScan
5
101 101
5
∞ ∞
4.0 4.5 5.0 5.5 6.0 0.6 0.7 0.8 0.9 1.0
Index Size (GB) Recall w.r.t. Exact
Sinnamon+
m=.25ψd Sinnamon+ m=.5ψd Sinnamon+
m=.75ψd Sinnamon+
m=.25ψd Sinnamon+m=.5ψd Sinnamon+m=.75ψd
+k
Sinnamonm=.25ψd
+k
Sinnamonm=.5ψd (c) Efficient SPLADE
Sinnamonm=.75ψd
+k
Sinnamon
+k
m=.25ψd
+k
Sinnamonm=.5ψd Sinnamonm=.75ψd
+k
∞ ∞
60 60 Wand
20 20
10 10
Latency (ms)
Latency (ms)
40 40
5 5
LinScan
30 30
20 20
∞ ∞
5 5
101 101
1.2 1.4 1.6 1.8 2.0 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000
Index Size (GB) Recall w.r.t. Exact
(d) uniCOIL
Fig. 16. Trade-offs on the Apple M1 processor between latency and memory (left column), and latency and
accuracy (right column) for various vector collections (rows). Shapes (and colors) distinguish between different
configurations of Sinnamon, and points on a line represent different time budgets 𝑇 (in milliseconds).
Sinnamon+
m=.25ψd Sinnamon+
m=.5ψd Sinnamon+
m=.75ψd Sinnamon+
m=.25ψd Sinnamon+
m=.5ψd Sinnamon+
m=.75ψd
+k +k +k +k +k +k
Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd
LinScan-Roaring LinScan-Roaringk LinScan-Roaring LinScan-Roaringk
∞ ∞
40 20 40 20
Latency (ms)
Latency (ms)
30 10 30 10
5 Wand 5 Wand
LinScan LinScan
20 20
5∞ ∞
101 101 5
0.30 0.35 0.40 0.45 0.50 0.20 0.25 0.30 0.35 0.40
NDCG@1000 MRR@10
Sinnamon+
m=.25ψd Sinnamon+m=.5ψd Sinnamon+
m=.75ψd Sinnamon+
m=.25ψd Sinnamon+ m=.5ψd Sinnamon+
m=.75ψd
+k
Sinnamonm=.25ψd
+k
Sinnamonm=.5ψd (a) BM25Sinnamon
Sinnamonm=.75ψd
+k +k
m=.25ψd
+k
Sinnamonm=.5ψd Sinnamonm=.75ψd
+k
Latency (ms)
∞ ∞
102 102
LinScan LinScan
∞ ∞ 20
20
5 10 510
101 5 101 5
0.30 0.35 0.40 0.45 0.50 0.20 0.25 0.30 0.35 0.40
NDCG@1000 MRR@10
Sinnamon+
m=.25ψd Sinnamon+m=.5ψd Sinnamon+
m=.75ψd Sinnamon+
m=.25ψd Sinnamon+ m=.5ψd Sinnamon+
m=.75ψd
+k
Sinnamonm=.25ψd
+k
Sinnamonm=.5ψd (b) SPLADE
Sinnamonm=.75ψd
+k
Sinnamon
+k
m=.25ψd
+k
Sinnamonm=.5ψd Sinnamonm=.75ψd
+k
Wand Wand
60 60
∞ ∞
Latency (ms)
Latency (ms)
40 20 40 20
30 10 30 10
5 5
20 LinScan 20 LinScan
101 5∞ 101 5∞
0.30 0.35 0.40 0.45 0.50 0.20 0.25 0.30 0.35 0.40
NDCG@1000 MRR@10
Sinnamon+
m=.25ψd Sinnamon+m=.5ψd Sinnamon+
m=.75ψd Sinnamon+
m=.25ψd Sinnamon+ m=.5ψd Sinnamon+
m=.75ψd
+k
Sinnamonm=.25ψd
+k
Sinnamonm=.5ψd (c) Efficient SPLADE
Sinnamonm=.75ψd
+k
Sinnamon
+k
m=.25ψd
+k
Sinnamonm=.5ψd Sinnamonm=.75ψd
+k
Latency (ms)
40 10 40 10
5 5
LinScan LinScan
30 30
20 20
∞ ∞
5 5
101 101
0.30 0.35 0.40 0.45 0.50 0.20 0.25 0.30 0.35 0.40
NDCG@1000 MRR@10
(d) uniCOIL
Fig. 17. Trade-offs on the M1 processor between latency and NDCG@1000 (left column), and latency and
MRR@10 (right column) for various vector collections (rows). As before, shapes (and colors) distinguish
between different configurations of Sinnamon, and points on a line represent different time budgets 𝑇 (in
milliseconds).
+k +k +k +k +k +k
Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd
∞ ∞
20 10 40
18 LinScan-Roar
5 ing k
30
Latency (ms)
Latency (ms)
16
10
14 20 5
12
LinScan-Roar
ing k
101
101
0.9850 0.9875 0.9900 0.9925 0.9950 0.9975 1.0000 0.96 0.97 0.98 0.99 1.00
Recall w.r.t. Exact Recall w.r.t. Exact
+k +k +k +k +k +k
Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd Sinnamonm=.25ψd Sinnamonm=.5ψd Sinnamonm=.75ψd
20 10 ∞
∞ 22 10
5
20
5
Latency (ms)
Latency (ms)
18
16
14 LinScan-Roar
ing k
101 12
LinScan-Roar
ing k
101
0.92 0.94 0.96 0.98 1.00 0.96 0.97 0.98 0.99 1.00
Recall w.r.t. Exact Recall w.r.t. Exact
Fig. 18. Trade-offs on the M1 processor between latency and retrieval accuracy for various vector collections
by retrieving 𝑘 ′ = 20, 000. Shapes (and colors) distinguish between different configurations of Sinnamon, and
points on a line represent different time budgets 𝑇 (in milliseconds).
SPLADE Efficient SPLADE uniCOIL BM25 SPLADE Efficient SPLADE uniCOIL BM25
Latency (ms/vector)
100
105
10−1
104
0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0
Count of Inserted Vectors ×106 Count of Deleted Vectors ×106
Fig. 19. Indexing throughput (vectors per second) and deletion latency (milliseconds per vector) as a function
of the size of the index on the M1 processor.