Flickr Solution
Flickr Solution
Abstract
1
quantization (LOPQ). Following a quite common search with a multi-index and is the current state-of-the-art on a
option of [12], a coarse quantizer is used to index data by in- billion-scale dataset, but all optimizations are still global.
verted lists, and residuals between data points and centroids We observe that OPQ performs significantly better when
are PQ-encoded. But within-cell distributions are largely the underlying distribution is unimodal, while residuals are
unimodal; hence, as in Fig. 1(d), we locally optimize an in- much more unimodal than original data. Hence we indepen-
dividual product quantizer per cell. Under no assumptions dently optimize per cell to distribute centroids mostly over
on the distribution, practically all centroids are supported underlying data, despite the constraints of a product quan-
by data, contributing to a lower distortion. tizer. In particular, we make the following contributions:
LOPQ requires reasonable space and time overhead 1. Partitioning data in cells, we locally optimize one prod-
compared to PQ, both for offline training/indexing, and on- uct quantizer per cell on the residual distribution.
line queries; but all overhead is constant in data size. It is 2. We show that training is practical since local distribu-
embarrassingly simple to apply and boosts performance on tions are easier to optimize via a simple OPQ variant.
several public datasets. A multi-index is essential for large 3. We provide solutions for either a single or a multi-
scale datasets and combining with LOPQ is less trivial, but index, fitting naturally to existing search frameworks
we provide a scalable solution nevertheless. for state-of-the-art performance with little overhead.
A recent related work is [7], applying a local PCA ro-
2. Related work and contribution tation per centroid prior to VLAD aggregation. However,
Focusing on large datasets where index space is the bottle- both our experiments and [9, 8] show that PCA without sub-
neck, we exclude e.g. tree-based methods like [14] that re- space allocation actually damages ANN performance.
quire all data uncompressed in memory. Binary encoding is
the most compact representation, approaching ANN search 3. Background
via exhaustive search in Hamming space. Methods like
Vector quantization. A quantizer is a function q that maps
spectral hashing [18], ITQ [10] or k-means hashing [11]
a d-dimensional vector x ∈ Rd to vector q(x) ∈ C, where C
focus on learning optimized codes on the underlying data
is a finite subset of Rd , of cardinality k. Each vector c ∈ C
distribution. Search in Hamming space is really fast but,
is called a centroid, and C a codebook. Given a finite set X
despite learning, performance suffers.
of data points in Rd , q induces distortion
Significant benefit is to be gained via multiple quantiz-
X
ers or hash tables as in LSH [6], at the cost of storing each E= kx − q(x)k2 . (1)
point index multiple times. For instance, [17, 19] gain per- x∈X
formance by multiple k-means quantizers via random re-
initializing or partitioning jointly trained centroids. Sim- According to Lloyd’s first condition, regardless of the cho-
ilarly, multi-index hashing [16] gains speed via multiple sen codebook, a quantizer that minimizes distortion should
hash tables on binary code substrings. We still outperform map vector x to its nearest centroid, or
all such approaches at only a fraction of index space. x 7→ q(x) = arg min kx − ck, (2)
c∈C
PQ [12] provides efficient vector quantization with less
distortion than binary encoding. Transform coding [4] is for x ∈ Rd . Hence, an optimal quantizer should minimize
a special case of scalar quantization that additionally allo- distortion E as a function of codebook C alone.
cates bits according to variance per dimension. OPQ [9] Product quantization. Assuming that dimension d is a
and Ck-means [15] generalize PQ by jointly optimizing ro- multiple of m, write any vector x ∈ Rd as a concatenation
tation, subspace decomposition and sub-quantizers. Inter- (x1 , . . . , xm ) of m sub-vectors, each of dimension d/m. If
estingly, the parametric solution of OPQ aims at the exact C 1 , . . . , C m are m sub-codebooks in subspace Rd/m , each
opposite of [4]: balancing variance given a uniform bit al- of k sub-centroids, a product quantizer [12] constrains C to
location over subspaces. the Cartesian product
Although [12] provides non-exhaustive variant IVFADC
based on a coarse quantizer and PQ-encoded residuals, [9, C = C1 × · · · × Cm, (3)
15] are exhaustive. The inverted multi-index [3] achieves
i.e., a codebook of k m centroids of the form c =
very fine space partitioning via one quantizer per subspace
(c1 , . . . , cm ) with each sub-centroid cj ∈ C j for j ∈ M =
and is compatible with PQ-encoding, gaining performance
{1, . . . , m}. An optimal product quantizer q should mini-
at query times comparable to Hamming space search. On
mize distortion E (1) as a function of C, subject to C being
the other hand, the idea of space decomposition can be ap-
of the form (3) [9]. In this case, for each x ∈ Rd , the nearest
plied recursively to provide extremely fast codebook train-
centroid in C is
ing and vector quantization [2].
The recent extension of OPQ [8] combines optimization q(x) = (q 1 (x1 ), . . . , q m (xm )), (4)
where q j (xj ) is the nearest sub-centroid of sub-vector xj in Rd/2 , each of K sub-centroids. A cell is now a pair of
C j , for j ∈ M [9]. Hence an optimal product quantizer q in sub-centroids. There are K 2 cells, which can be struc-
d dimensions incurs m subproblems of finding m optimal tured on a 2-dimensional grid, inducing a fine partition
sub-quantizers q j , j ∈ M, each in d/m dimensions. We over Rd . For each point x = (x1 , x2 ) ∈ X , sub-vectors
write q = (q 1 , . . . , q m ) in this case. x1 , x2 ∈ Rd/2 are separately (and exhaustively) quantized
Optimized product quantization [9],[15] refers to opti- to Q1 (x1 ), Q2 (x2 ), respectively. For each cell, an inverted
mizing the subspace decomposition apart from the cen- list of data points is again maintained.
troids. Constraint (3) of the codebook is now relaxed to Given a query vector y = (y1 , y2 ), the (squared) Eu-
clidean distances of each of sub-vectors y1 , y2 to all sub-
C = {Rĉ : ĉ ∈ C 1 × · · · × C m , RT R = I}, (5) centroids of Q1 , Q2 respectively are found first. The dis-
tance of y to a cell may then be found by a lookup-add
where orthogonal d × d matrix R allows for arbitrary ro-
operation, similarly to (6) for m = 2. Cells are traversed in
tation and permutation of vector components. Hence E
increasing order of distance to y by the multi-sequence al-
should be minimized as a function of C, subject to C be-
gorithm [3]—a form of distance propagation on the grid—
ing of the form (5). Optimization with respect to R and
until a target number T of points is collected. Different
C 1 , . . . , C m can be either joint as in Ck-means [15] and in
options exist for encoding residuals and re-ranking.
the non-parametric solution OPQNP of [9], or decoupled, as
in the parametric solution OPQ P of [9].
4. Locally optimized product quantization
Exhaustive search. Given a product quantizer q =
(q 1 , . . . , q m ), assume that each data point x ∈ X is rep- We investigate two solutions: ordinary inverted lists, and a
resented by q(x) and encoded as tuple (i1 , . . . , im ) of m second-order multi-index. Section 4.1 discusses LOPQ in
sub-centroid indices (4), each in index set K = {1, . . . , k}. the former case, which simply allocates data to cells and
This PQ-encoding requires m log2 k bits per point. locally optimizes a product quantizer per cell to encode
Given a new query vector y, the (squared) Euclidean dis- residuals. Optimization per cell is discussed in section 4.2,
tance to every point x ∈ X may be approximated by mostly following [9, 15]; the same process is used in sec-
m tion 4.4, discussing LOPQ in the multi-index case.
X
2
δq (y, x) = ky − q(x)k = kyj − q j (xj )k2 , (6)
4.1. Searching on a single index
j=1
Given a set X = {x1 , . . . , xn } of n data points in Rd ,
j j
where q (x ) ∈ C = j
{cj1 , . . . , cjk }
for j ∈ M. Distances we optimize a coarse quantizer Q, with associated code-
kyj − cji k2 are precomputed for i ∈ K and j ∈ M, so (6) book E = {e1 , . . . , eK } of K centroids, or cells. For
amounts to only O(m) lookup and add operations. This is i ∈ K = {1, . . . , K}, we construct an inverted list Li con-
the asymmetric distance computation (ADC) of [12]. taining indices of points quantized to cell ei ,
Indexing. When quantizing point x ∈ Rd by quantizer q,
its residual vector is defined as Li = {j ∈ N : Q(xj ) = ei } (8)
rq (x) = x − q(x). (7) where N = {1, . . . , n}, and collect their residuals in
Non-exhaustive search involves a coarse quantizer Q of Zi = {x − ei : x ∈ X , Q(x) = ei }. (9)
K centroids, or cells. Each point x ∈ X is quantized to
Q(x), and its residual vector rQ (x) is quantized by a prod- For each cell i ∈ K, we locally optimize PQ encoding of
uct quantizer q. For each cell, an inverted list of data points residuals in set Zi , as discussed in section 4.2, yielding an
is maintained, along with PQ-encoded residuals. orthogonal matrix Ri and a product quantizer qi . Residuals
A query point y is first quantized to its w nearest cells, are then locally rotated by ẑ ← RiT z for z ∈ Zi and PQ-
and approximate distances between residuals are then found encoded as qi (ẑ) = qi (RiT z).
according to (6) only within the corresponding w inverted At query time, the query point y is soft-assigned to its
lists. This is referred to as IVFADC search in [12]. w nearest cells A in E. For each cell ei ∈ A, residual
Re-ranking. Second-order residuals may be employed yi = y − ei is individually rotated by ŷi ← RiT yi . Asym-
along with ADC or IVFADC, again PQ-encoded by m0 sub- metric distances δqi (ŷi , ẑp ) to residuals ẑp for p ∈ Li are
quantizers. However, this requires full vector reconstruc- then computed according to (6), using the underlying local
tion, so is only used for re-ranking [13]. product quantizer qi . The computation is exhaustive within
Multi-indexing applies the idea of PQ to the coarse quan- list Li , but is performed in the compressed domain.
tizer used for indexing. A second-order inverted multi- Analysis. To illustrate the individual gain from the two op-
index [3] comprises two subspace quantizers Q1 , Q2 over timized quantities, we investigate optimizing rotation alone
with fixed sub-quantizers, as well as both rotation and sub- 1
quantizers, referred to as LOR+PQ and LOPQ, respectively.
0.8
In the latter case, there is an O(K(d2 +dk)) space overhead,
comparing e.g. to IVFADC [12]. Similarly, local rotation of
recall@R
0.6
the query residual imposes an O(wd2 ) time overhead. IVFADC
0.4 I-PCA+RP
4.2. Local optimization I-PCA
0.2 I-OPQ
Let Z ∈ {Z1 , . . . , ZK } be the set of residuals of data LOPQ
0
points quantized to some cell in E. Contrary to [12], we
100 101 102 103 104
PQ-encode these residuals by locally optimizing both space
R
decomposition and sub-quantizers per cell. Given m and k
as parameters, this problem is expressed as minimizing dis- Figure 2. Recall@R performance on SYNTH1M—recall@R is
tortion as a function of orthogonal matrix R ∈ Rd×d and defined in section 5.1. We use K = 1024 and w = 8 for all
sub-codebooks C 1 , . . . , C m ⊂ Rd/m per cell, methods; for all product quantizers, we use m = 8 and k = 256.
Curves for IVFADC, I-OPQ and I-PCA+RP coincide everywhere.
X
minimize min kz − Rĉk2
ĉ∈Ĉ
z∈Z
(10) of minimal variance, i.e., B ∗ ← B ∗ ∪ s with
subject to Cˆ = C 1 × · · · × C m Y
RT R = I, B ∗ = arg min λs , (12)
B∈B
|B|<d∗ s∈B
where |C j | = k for j ∈ M = {1, . . . , m}. Given solution until all buckets are full. Then, buckets determine a re-
R, C 1 , . . . , C m , codebook C is found by (5). For j ∈ M, ∗
ordering of dimensions: if vector bj ∈ Rd contains ele-
sub-codebook C j determines a sub-quantizer q j by ments of bucket B j (in any order) for j ∈ M and b =
(b1 , . . . , bm ), then vector b is read off as a permutation π
x 7→ q j (x) = arg min kx − ĉj k (11)
j j
ĉ ∈C of set {1, . . . , d}. If Pπ is the permutation matrix of π, then
matrix U PπT represents a re-ordering of eigenvectors of Σ
for x ∈ Rd/m , as in (2); collectively, sub-quantizers deter- and is the final solution for R. In other words, Z is first
mine a product quantizer q = (q 1 , . . . , q m ) by (4). Local PCA-aligned and then dimensions are grouped in subspaces
optimization can then be seen as a mapping Z 7→ (R, q). exactly as eigenvalues are allocated to buckets.
Following [9, 15], there are two solutions that we briefly Non-parametric solution (OPQNP [9] or Ck-means [15])
describe here, focusing more on OPQ P . is a variant of k-means, carried out in all m subspaces in
Parametric solution (OPQ P [9]) is the outcome of as- parallel, interlacing in each iteration its two traditional steps
suming a d-dimensional, zero-mean normal distribution assign and update with steps to rotate data and optimize R,
N (0, Σ) of residual data Z and minimizing the theoretical i.e., align centroids to data. OPQ P is extremely faster than
lower distortion bound as a function of R alone [9]. That is, OPQNP in practice. Because we locally optimize thousands
R is optimized independently prior to codebook optimiza- of quantizers, OPQNP training is impractical, so we only use
tion, which can follow by independent k-means per sub- it in one small experiment in section 5.2 and otherwise focus
space, exactly as in PQ. on OPQ P , which we refer to as I-OPQ in the sequel.
Given the d × d positive definite covariance matrix Σ,
4.3. Example
empirically measured on Z, the solution for R is found in
closed form, in two steps. First, rotating data by ẑ ← RT z To illustrate the benefit of local optimization, we experi-
for z ∈ Z should yield a block-diagonal covariance matrix ment on our synthetic dataset SYNTH1M, containing 1M
Σ̂, with the j-th diagonal block being sub-matrix Σ̂jj of j- 128-dimensional data points and 10K queries, generated by
th subspace, for j ∈ M. That is, subspace distributions taking 1000 samples from each of 1000 components of an
should be pairwise independent. This is accomplished e.g. anisotropic Gaussian mixture distribution. All methods are
by diagonalizing Σ as U ΛU T . non-exhaustive as in section 4.1, i.e. using a coarse quan-
Second, determinants |Σ̂jj | should be equal for j ∈ M, tizer, inverted lists and PQ-encoded residuals; however, all
i.e., variance should be balanced across subspaces. This is optimization variants are global except for LOPQ. For fair
achieved by eigenvalue allocation [9]. In particular, a set B comparison here and in section 5, I-OPQ is our own non-
of m buckets B j is initialized with B j = ∅, j ∈ M, each exhaustive adaptation of [9]. IVFADC (PQ) [12] uses natu-
of capacity d∗ = d/m. Eigenvalues in Λ are then traversed ral order of dimensions and no optimization.
in descending order, λ1 ≥ · · · ≥ λd . Each eigenvalue λs , Figure 2 shows results on ANN search. On this ex-
s = 1, . . . , d, is greedily allocated to the non-full bucket B ∗ tremely multi-modal distribution, I-OPQ fails to improve
over IVFADC. PCA-aligning all data and allocating dimen- in cell (e1i1 , e2i2 ) ∈ E is constrained to be block-diagonal
sions in decreasing order of eigenvalues is referred to as with blocks Ri11 , Ri22 , keeping rotations within-subspace.
I-PCA. This is even worse than natural order, because e.g. By contrast, OMulti-D-OADC [8] employs an arbitrary ro-
the largest d/m eigenvalues are allocated in a single sub- tation matrix that is however fixed for all cells.
space, contrary to the balancing objective of I-OPQ. Ran- Analysis. Comparing to Multi-D-ADC [3], the space over-
domly permuting dimensions after global PCA-alignment, head remains (asymptotically) the same as in section 4.1,
referred to as I-PCA+RP, alleviates this problem. LOPQ i.e., O(K(d2 + dk)). The query time overhead is O(Kd2 )
outperforms all methods by up to 30%. in the worst case, but much lower in practice.
4.4. Searching on a multi-index
5. Experiments
The case of a second-order multi-index is less trivial, as the
space overhead is prohibitive to locally optimize per cell as 5.1. Experimental setup
in section 4.1. Hence, we separately optimize per cell of the
two subspace quantizers and encode two sub-residuals. We Datasets. We conduct experiments on four publicly avail-
call this product optimization, or Multi-LOPQ. able datasets. Three of them are popular in state-of-the-art
Product optimization. Two subspace quantizers Q1 , Q2 ANN methods: SIFT1M, GIST1M [12] and SIFT1B [13]1 .
of K centroids each are built as in [3], with associated SIFT1M dataset contains 1 million 128-dimensional SIFT
codebooks E j = {ej1 , . . . , ejK } for j = 1, 2. Each data vectors and 10K query vectors; GIST1M contains 1 mil-
point x = (x1 , x2 ) ∈ X is quantized to cell Q(x) = lion 960-dimensional GIST vectors and 1000 query vectors;
(Q1 (x1 ), Q2 (x2 )). An inverted list Li1 i2 is kept for each SIFT1B contains 1 billion SIFT vectors and 10K queries.
cell (e1i1 , e2i2 ) on grid E = E 1 × E 2 , for i1 , i2 ∈ K. Given that LOPQ is effective on multi-modal distri-
At the same time, Q1 , Q2 are employed for residuals as butions, we further experiment on MNIST2 apart from
well, as in Multi-D-ADC [3]. That is, for each data point our synthetic dataset SYNTH1M discussed in section 4.3.
x = (x1 , x2 ) ∈ X , residuals xj − Qj (xj ) for j = 1, 2 MNIST contains 70K images of handwritten digits, each
are PQ-encoded. However, because the codebook induced represented as a 784-dimensional vector of raw pixel inten-
on Rd by Q1 , Q2 is extremely fine (K 2 cells on the grid), sities. As in [9, 8], we randomly sample 1000 vectors as
locally optimizing per cell is not an option—the total space queries and use the remaining as the data.
overhead e.g. would be O((d2 + dk)K 2 ). What we do is Evaluation. As in related literature [12, 9, 13, 3, 16, 15],
separately optimize per subspace: similarly to (9), let we measure search performance via the recall@R measure,
i.e. the proportion of queries having their nearest neighbor
Zij = {xj − eji : x ∈ X , Qj (xj ) = eji }. (13) ranked in the first R positions. Alternatively, recall@R is
the fraction of queries for which the nearest neighbor would
contain the residuals of points x ∈ X whose j-th sub-vector be correctly found if we verified the R top-ranking vectors
is quantized to cell eji for i ∈ K and j = 1, 2. We then using exact Euclidean distances. Recall@1 is the most im-
locally optimize each set Zij as in discussed section 4.2, portant, and is equivalent to the precision of [14].
yielding a rotation matrix Rij and a product quantizer qij .
Re-ranking. Following [13], second-order residuals can
Now, given a point x = (x1 , x2 ) ∈ X quantized to cell
be used for re-ranking along with LOPQ variants, but for
(ei1 , e2i2 ) ∈ E, its sub-residuals zj = xj − ejij are rotated
1
fair comparison we only apply it with a single index. This
and PQ-encoded as qijj (ẑj ) = qijj ((Rijj )T zj ) for j = 1, 2. new variant, LOPQ+R, locally optimizes second-order sub-
That is, encoding is separately adjusted per sub-centroid i1 quantizers per cell. However, rotation of second-order
(resp., i2 ) in the first (resp., second) subspace. residuals is only optimized globally; otherwise there would
Given a query y, rotations ŷijj = (Rijj )T (yj − ejij ) are be an additional query time overhead on top of [13].
lazy-evaluated for ij = 1, ..., K and j = 1, 2, i.e. com-
Settings. We always perform search in a non-exhaustive
puted on demand by multi-sequence and stored for re-use.
manner, either with a single or a multi-index. In all cases,
For each point index p fetched in cell (e1i1 , e2i2 ) ∈ E with
we use k = 256, i.e. 8 bits per sub-quantizer. Unless other-
associated residuals ẑjp for j = 1, 2, asymmetric distance
wise stated, we use 64-bit codes produced with m = 8. On
SIFT1B we also use 128-bit codes produced with m = 16,
kŷi11 − qi11 (ẑ1p )k2 + kŷi22 − qi22 (ẑ2p )k2 (14)
except when re-ranking, where m = m0 = 8 is used in-
is computed. Points are ranked according to this distance. stead, as in [13]. For all multi-index methods, T refers to
When considering the entire space Rd , this kind of opti- the target number of points fetched by multi-sequence.
mization is indeed local per cell, but more constrained than 1 http://corpus-texmex.irisa.fr/
recall@R
Method Ē
In particular, IVFADC [12], our I-PCA+RP, and our non- 0.6 IVFADC 70.1
I-PCA+RP 13.3
exhaustive adaptation of OPQ [9], using either OPQ P or
0.4 I − OPQ P 12.6
OPQNP global optimization. These non-exhaustive variants I − OPQNP 11.4
are not only faster, but also superior. OPQNP is too slow LOPQ 8.13
0.2
to train, so is only shown for MNIST; otherwise I-OPQ
refers to OPQ P . We do not consider transform coding [4] 100 101 102
or ITQ [10] since they are outperformed by I-OPQ in [9]. R
Compared methods (SIFT1B). After some experiments on Figure 3. Recall@R on MNIST with K = 64, found to be opti-
a single index comparing mainly to IVFADC and I-OPQ, mal, and w = 8. Ē = E/n: average distortion per point.
we focus on using a multi-index, comparing against Multi-
D-ADC [3] and its recent variant OMulti-D-OADC [8], cur-
rently the state-of-the-art. Both methods PQ-encode the 0.8
residuals of the subspace quantizers. Additionally, OMulti-
recall@R
D-OADC uses OPQNP to globally optimize both the ini-
0.6
tial data prior to multi-index construction and the residuals.
IVFADC
We also report results for IVFADC with re-ranking (IV-
I-PCA+RP
FADC+R) [13], Ck-means [15], KLSH-ADC [17], multi- 0.4
I-OPQ
index hashing (Multi-I-Hashing) [16], and the very recent LOPQ
joint inverted indexing (Joint-ADC) [19].
100 101 102 103 104
Implementation. Results followed by a citation are repro-
R
duced from the corresponding publication. For the rest we
use our own implementations in Matlab and C++ on a 8- Figure 4. Recall@R on SIFT1M with K = 1024, w = 8.
core machine with 64GB RAM. For k-means and exhaus-
tive nearest neighbor assignment we use yael3 .
0.8
5.2. Results on MNIST, SIFT1M, GIST1M
0.6
recall@R
0.7
IVFADC
0.6 Table 2. Recall@{1, 10, 100} on SIFT1B with 128-bit codes and
I-PCA+RP
0.5 I-OPQ
K = 213 = 8192 (resp. K = 214 ) for single index (resp. multi-
LOPQ index). For IVFADC+R and LOPQ+R, m0 = 8, w = 64. Results
0.4 for Joint-ADC and KLSH-ADC are taken from [19]. Rows includ-
1 2 4 8 16 32 64 ing citations reproduce authors’ results.
w