{"title": "Non-metric Similarity Graphs for Maximum Inner Product Search", "book": "Advances in Neural Information Processing Systems", "page_first": 4721, "page_last": 4730, "abstract": "In this paper we address the problem of Maximum Inner Product Search (MIPS) that is currently the computational bottleneck in a large number of machine learning applications. \nWhile being similar to the nearest neighbor search (NNS), the MIPS problem was shown to be more challenging, as the inner product is not a proper metric function. We propose to solve the MIPS problem with the usage of similarity graphs, i.e., graphs where each vertex is connected to the vertices that are the most similar in terms of some similarity function. Originally, the framework of similarity graphs was proposed for metric spaces and in this paper we naturally extend it to the non-metric MIPS scenario. We demonstrate that, unlike existing approaches, similarity graphs do not require any data transformation to reduce MIPS to the NNS problem and should be used for the original data. Moreover, we explain why such a reduction is detrimental for similarity graphs. By an extensive comparison to the existing approaches, we show that the proposed method is a game-changer in terms of the runtime/accuracy trade-off for the MIPS problem.", "full_text": "Non-metric Similarity Graphs for\nMaximum Inner Product Search\n\nStanislav Morozov\n\nYandex,\n\nLomonosov Moscow State University\n\nstanis-morozov@yandex.ru\n\nArtem Babenko\n\nYandex,\n\nNational Research University\nHigher School of Economics\n\nartem.babenko@phystech.edu\n\nAbstract\n\nIn this paper we address the problem of Maximum Inner Product Search (MIPS)\nthat is currently the computational bottleneck in a large number of machine learning\napplications. While being similar to the nearest neighbor search (NNS), the MIPS\nproblem was shown to be more challenging, as the inner product is not a proper\nmetric function. We propose to solve the MIPS problem with the usage of similarity\ngraphs, i.e., graphs where each vertex is connected to the vertices that are the most\nsimilar in terms of some similarity function. Originally, the framework of similarity\ngraphs was proposed for metric spaces and in this paper we naturally extend it to\nthe non-metric MIPS scenario. We demonstrate that, unlike existing approaches,\nsimilarity graphs do not require any data transformation to reduce MIPS to the\nNNS problem and should be used for the original data. Moreover, we explain why\nsuch a reduction is detrimental for similarity graphs. By an extensive comparison\nto the existing approaches, we show that the proposed method is a game-changer\nin terms of the runtime/accuracy trade-off for the MIPS problem.\n\n1\n\nIntroduction\n\n(cid:104)xj, q(cid:105) \u2265 (cid:104)xi, q(cid:105) = xT\n\nThe Maximum Inner Product Search (MIPS) problem has recently received increased attention from\ndifferent research communities. The machine learning community has been especially active on this\nsubject, as MIPS arises in a number of important machine learning tasks such as ef\ufb01cient Bayesian\ninference[1, 2], memory networks training[3], dialog agents[4], reinforcement learning[5]. The MIPS\nproblem formulates as follows. Given the large database of vectors X = {xi \u2208 Rd|i = 1, . . . , n}\nand a query vector q \u2208 Rd, we need to \ufb01nd an index j such that\ni q, i (cid:54)= j\n\n(1)\nIn practice we often need K > 1 vectors that provide the largest inner products and the top-K MIPS\nproblem is considered.\nFor large-scale databases the sequential scan with the O(nd) complexity is not feasible, and the\nef\ufb01cient approximate methods are required. The current studies on ef\ufb01cient MIPS can be roughly\ndivided into two groups. The methods from the \ufb01rst group [6, 7, 8], which are probably the more pop-\nular in the machine learning community, reduce MIPS to the NNS problem. They typically transform\nthe database and query vectors and then search the neighbors via traditional NNS structures, e.g.,\nLSH[7] or partition trees[6]. The second group includes the methods that \ufb01lter out the unpromising\ndatabase vectors based on inner product upper bounds, like the Cauchy-Schwarz inequality[9, 10].\nIn this work we introduce a new research direction for the MIPS problem. We propose to employ\nthe similarity graphs framework that was recently shown to provide the exceptional performance for\nthe nearest neighbor search. In this framework the database is represented as a graph, where each\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fvertex corresponds to a database vector. If two vertices i and j are connected by an edge that means\nthat the corresponding database vectors xi and xj are close in terms of some metric function. The\nneighbor search for a query q is performed via graph exploration: on each search step, the query\nmoves from the current vertex to one of adjacent vertex, corresponding to a vector, which is the\nclosest to the query. The search terminates when the query reaches a local minimum. To the best of\nour knowledge, we are the \ufb01rst who expands similarity graphs on the MIPS territory with non-metric\nsimilarity function. We summarize the main contributions of this paper below:\n\n1. We provide the theoretical analysis to justify the use of similarity graphs for inner product\n\nsimilarity function.\n\n2. We demonstrate both theoretically and experimentally that typical MIPS-to-NNS reductions\n\nare detrimental for similarity graphs.\n\n3. We introduce a new large-scale dataset for the MIPS algorithms evaluation to facilitate\nresearch in this direction. The dataset and the C++ implementation of our method are\navailable online1.\n\nThe rest of the paper is organized as follows: in Section 2, we shortly review the existing MIPS\nmethods and the similarity graphs framework. In Section 3 we advocate the usage of similarity\ngraphs for MIPS and describe the ef\ufb01cient algorithm as well. In addition, we demonstrate that\none should not reduce MIPS to NNS when using similarity graphs. In Section 4, we compare the\nproposed approach to the current state-of-the-art and demonstrate its exceptional advantage over\nexisting methods. Finally, in Section 5 we conclude the paper and summarize the results.\n\n2 Related work\n\nNow we describe several methods and ideas from the previous research that are essential for descrip-\ntion of our method. Hereafter we denote the database by X = {xi \u2208 Rd|i = 1, . . . , n} and a query\nvector by q \u2208 Rd.\n\n2.1 The existing approaches to the MIPS problem\n\nReduction to NNS. The \ufb01rst group of methods[6, 7, 8] reformulates the MIPS problem as a NNS\nproblem. Such reformulation becomes possible via mapping the original data to a higher dimensional\nspace. For example, [7] maps a database vector x to\n\n\u02c6x = (x,(cid:112)1 \u2212 (cid:107)x(cid:107)2)T\n\n(2)\n\nand a query vector q is mapped to\n\n\u02c6q = (q, 0)T\n\n(cid:107)\u02c6x \u2212 \u02c6q(cid:107)2 = (cid:107)\u02c6x(cid:107)2 + (cid:107)\u02c6q(cid:107)2 \u2212 2(cid:104)\u02c6x, \u02c6q(cid:105) = \u22122(cid:104)x, q(cid:105) + 2\n\n(3)\nThe transformation from [7] assumes without loss of generality that (cid:107)x(cid:107) \u2264 1 for all x \u2208 X and\n(cid:107)q(cid:107) = 1 for a query vector q. After the mapping the transformed vectors \u02c6x and \u02c6q have unit norms and\n(4)\nso the minimization of (cid:107)\u02c6x \u2212 \u02c6q(cid:107) is equivalent to the maximization of (cid:104)x, q(cid:105). Other MIPS-to-NNS\ntransformations are also possible, as shown in [11] and [12], and the empirical comparison of\ndifferent transformations was recently performed in [6]. After transforming the original data, the\nMIPS problem becomes equivalent to metric neighbor search and can be solved with standard NNS\ntechniques, like LSH[7], Randomized Partitioning Tree[6] or clustering[8].\nUpper-bounding. Another family of methods use inner product upper bounds to construct a small set\nof promising candidates, which are then checked exhaustively. For example, the LEMP framework[9]\n\ufb01lters out the unpromising database vectors based on the Cauchy-Schwarz inequality. Furthermore,\n[9] proposes an incremental pruning technique that re\ufb01nes the upper bound by computing the partial\ninner product over the \ufb01rst several dimensions. The FEXIPRO method[10] goes further and performs\nSVD over the database vectors to make the \ufb01rst dimensions more meaningful. These steps typically\nimprove the upper bounds, and the incremental pruning becomes more ef\ufb01cient. The very recent\n\n1https://github.com/stanis-morozov/ip-nsw\n\n2\n\n\fd(cid:80)\n\nGreedy-MIPS method[13] uses another upper bound (cid:104)q, x(cid:105) =\nqixi \u2264 d maxi{qixi} to construct\nthe candidate set ef\ufb01ciently. The ef\ufb01ciency of upper-bound methods is con\ufb01rmed by experimental\ncomparison[9, 13] and their source code, available online.\nDespite a large number of existing MIPS methods, the problem is far from being solved, especially\ngiven rapidly growing databases in nowadays applications. In this work we propose to solve MIPS\nwith the similarity graphs framework that does not fall into either of the two groups above.\n\ni=1\n\n2.2 NNS via similarity graph exploration\n\nHere we shortly describe the similarity graphs that are currently used for NNS in metric spaces. For a\ndatabase X = {xi \u2208 Rd|i = 1, . . . , n} the similarity (or knn-)graph is a graph where each vertex\ncorresponds to one of the database vectors x. The vertices i and j are connected by an edge if xj\nbelongs to the set of k nearest neigbors of xi, xj \u2208 N Nk(xi) in terms of some metric similarity\nfunction s(x, y). The usage of knn-graphs for NNS was initially proposed in the seminal work[14].\nThe approach[14] constructs the database knn-graph and then performs the search by greedy walk\non this graph. First, the search process starts from a random vertex and then on each step a query\nmoves from the current vertex to its neighbor, which appears to be the closest to a query. The process\nterminates when the query reaches a local minimum. The pseudocode of the greedy walk procedure\nis presented on Algorithm 1.\n\nAlgorithm 1 Greedy walk\n1: Input: Similarity Graph Gs, similarity function s(x, y), query q, entry vertex v0\n2: Initialize vcurr = v0\n3: repeat\n4:\n5:\n6:\n7: until vcurr changes\n8: return vcurr\n\nfor vadj adjacent to vcurr in Gs do\nif s(vadj, q) < s(vcurr, q) then\n\nvcurr = vadj\n\nSince [14] gave rise to research on NNS with similarity graphs, a plethora of methods, which elaborate\nthe idea, were proposed. The current state-of-the-art graph-based NNS implementations[15, 16, 17]\ndevelop additional heuristics that increase the ef\ufb01ciency of both graph construction and search process.\nHere we describe in detail the recent Navigable Small World (NSW) approach[15], as it is shown to\nprovide the state-of-the-art for NNS[18] and its code is available online. Our approach for MIPS will\nbe based on the NSW algorithm, although the other graph-based NNS methods[16, 17] could also be\nused.\n\nAlgorithm 2 NSW graph construction\n1: Input: Database X, similarity function s(x, y), maximum vertex degree M\n2: Initialize graph Gs = \u2205\n3: for x in X do\n4:\n5:\n6: return Gs\n\nS = {M vertices from Gs, s.t. the corresponding vectors y give the largest values of s(x, y)}\nAdd x to the graph Gs and connect it by the directed edges with vertices in S\n\nThe key to the practical success of NSW lies in the ef\ufb01ciency of both knn-graph construction and\nneighbor search. NSW constructs the knn-graph by adding vertices in the graph sequentially one by\none. On each step NSW adds the next vertex v, corresponding to a database vector x to the current\ngraph. v is connected by directed edges to M vertices, corresponding to the closest database vectors\nthat are already added to the graph. The construction algorithm is presented in Algorithm 2. The\nprimary parameter of the NSW is the maximum vertex degree M, which determines the balance\nbetween the search ef\ufb01ciency and the probability that search stops in the suboptimal local minima.\nWhen searching via Greedy walk, NSW maintains a priority queue of a size L with the knn-graph\nvertices, which neighbors should be visited by the search process. With L=1 the search in NSW is\nequivalent to Algorithm 1, while with L > 1 it can be considered as a variant of Beam Search[19],\n\n3\n\n\fwhich makes the search process less greedy. In practice, varying L allows to balance between the\nruntime and search accuracy in NSW.\nPrior work on non-metric similarity search on graphs. After the publication, we became aware of\na body of previous work that explored the use of proximity graphs with general non-metric similarity\nfunctions[20, 21, 22, 23, 24]. In these works, the MIPS problem is investigated as a special case and\nthe effectiveness of proximity graph based methods to the MIPS problem has been con\ufb01rmed.\n\n3 Similarity graphs for MIPS\n\nNow we extend the similarity graphs framework to applications with a non-metric similarity function\ns(x, y). Assume that we have a database X = {xi \u2208 Rd|i = 1, . . . , n} and aim to solve the problem\n(5)\n\ns(q, xi), q \u2208 Rd\n\narg max\n\nxi\u2208X\n\n3.1 Exact solution\n\nFirst, let us construct a graph Gs such that the greedy walk procedure (Algorithm 1), provides the\nexact answer to the problem (5). [14] has shown that for Euclidean distance s(x, y) = \u2212(cid:107)x \u2212 y(cid:107),\nthe minimal Gs with this property is the Delaunay graph of X. Now we generalize this result for a\nbroader range of similarity functions.\nDe\ufb01nition. The s-Voronoi cell Rk, associated with the element xk \u2208 X, is a set\n\n(6)\nThe diagrams of s-Voronoi cells for s(x, y) = \u2212(cid:107)x\u2212 y(cid:107) and s(x, y) = (cid:104)x, y(cid:105) are shown on Figure 1.\n\nRk = {x \u2208 Rd|s(x, xk) > s(x, xj) \u2200j (cid:54)= k}\n\ns(x, y) = \u2212(cid:107)x \u2212 y(cid:107)\n\ns(x, y) = (cid:104)x, y(cid:105)\nFigure 1: s-Voronoi diagram examples on the plane\n\nNote, that in the case of inner product s-Voronoi cells for some points are empty. It implies that these\npoints can not be answers in MIPS.\nNow we can de\ufb01ne a s-Delaunay graph for a similarity function s(x, y).\nDe\ufb01nition. The s-Delaunay graph for the database X and the similarity function s(x, y) is a graph\nGs(V, E) where the set of vertices V corresponds to the set X and two vertices i and j are connected\nby an edge if the correspoding s-Voronoi cells Ri and Rj are adjacent in Rd.\nTheorem 1. Suppose that the similarity function s(x, y) is such that for every \ufb01nite database X the\ncorresponding s-Voronoi cells are path-connected sets. Then the greedy walk (Algorithm 1) stops\nat the exact solution for problem (5) if the similarity graph Gs contains the s-Delaunay graph as a\nsubgraph.\n\n4\n\n\fProof. Assume that the greedy walk with a query q stops at the point x i.e. s(x, q) > s(y, q) for\nall y \u2208 N (x), where N (x) is a set of vertices that are adjacent to x. Suppose that there is a point\nz /\u2208 N (x) such that s(z, q) > s(x, q). It means that the point q does not belong to the s-Voronoi cell\nRx corresponding to the point x. Note, that if we remove all the points from Gs except x \u222a N (x), a\nset of points covered by Rx does not change as all adjacent s-Voronoi regions correspond to vertices\nfrom N (x) and they are not removed. Hence, the query q still does not belong to Rx. Since the\ns-Voronoi cells cover the whole space, the point q belongs to some Rx(cid:48), x(cid:48) \u2208 N (x). This means that\ns(x(cid:48), q) > s(x, q).This contradiction proves the theorem.\nNow we show that s(x, y) = (cid:104)x, y(cid:105) satis\ufb01es the assumptions of the Theorem.\nLemma 1. Suppose X is a \ufb01nite database and the similarity function s(x, y) is linear, then the\ns-Voronoi cells are convex.\nProof. Consider a s-Voronoi cell Rx, corresponding to a point x \u2208 X. Let us take two arbitrary\nvectors u and v from the s-Voronoi cell Rx. It means that\n\ns(x, u) > s(w, u) \u2200w \u2208 X \\ {x}\ns(x, v) > s(w, v) \u2200w \u2208 X \\ {x}\n\n(7)\n(8)\n\n(9)\n\nHence, due to linearity\n\ns(x, tu + (1 \u2212 t)v) > s(w, tu + (1 \u2212 t)v), t \u2208 [0, 1]\n\nTherefore, vector tu + (1 \u2212 t)v \u2208 Rx for every t \u2208 [0, 1].\nCorollary 1. If the graph G(V, E) contains the s-Delaunay graph for the similarity function\ns(x, y) = (cid:104)x, y(cid:105) then greedy walk always gives the exact true answer for MIPS.\nProof. Due to Lemma 1 all s-Voronoi cells for s(x, y) = (cid:104)x, y(cid:105) are convex, therefore, path-connected.\n\n3.2\n\ns-Delaunay graph approximation for MIPS\n\nIn practice, the computation and usage of the exact s-Delaunay graph in high-dimensional spaces\nare infeasible due to the exponentially growing number of edges[25]. Instead, we approximate\nthe s-Delaunay graph as was previously proposed for Euclidean distance case in [16, 17, 15]. In\nparticular, we adopt the approximation proposed in [15] by simply extending Algorithm 2 to inner\nproduct similarity function s(x, y) = (cid:104)x, y(cid:105). As in [15] we also restrict the vertex degree to a constant\nM, which determines the s-Delaunay graph approximation quality. We refer to the proposed MIPS\nmethod as ip-NSW. The search process in ip-NSW remains the same as in [15] except that the inner\nproduct similarity function guides the similarity graph exploration.\nLet us provide some intuition behind the proposed s-Delaunay graph approximation. In fact, each\nvertex x is connected to M vertices that provide the highest inner product values(cid:104)x,\u00b7(cid:105). The heuristic\ngeometrical argument in favor of such approximation is that for s(x, y) = (cid:104)x, y(cid:105) s-Voronoi cells are\npolyhedral angles, and the \u00abdirection vectors\u00bb of adjacent s-Voronoi cells are likely to have large\ninner product values. While missing the strict mathematical justi\ufb01cation, the proposed approach\nprovides the brilliant performance, as con\ufb01rmed in the experimental section.\n\n3.3 Similarity graphs after reduction to NNS\n\nThe natural question is: Why should we develop an additional theory for non-metric similarity graphs?\nMaybe, one should just reduce the MIPS problem to NNS[6, 7, 8] and apply the state-of-the-art graph\nimplementation for Euclidean similarity. In fact, such a solution is detrimental for runtime-accuracy\ntrade-off, as will be demonstrated in the experimental section. In this section, we provide the intuitive\nexplanation of the inferior performance using the example of transformation form [7]:\n\n\u02c6x = (x,(cid:112)1 \u2212 (cid:107)x(cid:107)2)T ;\n\n\u02c6q = (q, 0)T = (q,(cid:112)1 \u2212 (cid:107)q(cid:107)2)T\n\n(10)\nassuming that (cid:107)x(cid:107) \u2264 1 for all x \u2208 X and (cid:107)q(cid:107) = 1. Other transformations could be considered\nin the similar way. Now we construct the Euclidean similarity graph for the transformed database\n\n5\n\n\f\u02c6X = {(x,(cid:112)1 \u2212 (cid:107)x(cid:107)2)T|x \u2208 X} via Algorithm 2. In terms of the original database X, the Euclidean\n\ndistance between the transformed elements equals\n\n(cid:107)\u02c6x \u2212 \u02c6y(cid:107)2 = \u22122(cid:104)x, y(cid:105) + 2 \u2212 2(cid:112)1 \u2212 (cid:107)x(cid:107)2(cid:112)1 \u2212 (cid:107)y(cid:107)2\n\n(11)\n\n(12)\n\nNote, that the Euclidean similarity graph, constructed for the transformed database \u02c6X, is equiv-\nalent to a graph, constructed for the original X with the similarity function s(x, y) = (cid:104)x, y(cid:105) +\n\n(cid:112)1 \u2212 (cid:107)x(cid:107)2(cid:112)1 \u2212 (cid:107)y(cid:107)2 or equivalently\n\ns(x, y) = (cid:107)x(cid:107)(cid:107)y(cid:107) cos \u03b1 +(cid:112)1 \u2212 (cid:107)x(cid:107)2(cid:112)1 \u2212 (cid:107)y(cid:107)2,\n\nwhere \u03b1 is the angle between x and y. The \ufb01rst term in this sum encourages large norms, while the\nsecond term penalizes large norms. In high-dimensional spaces the typical values of cos \u03b1 tend to\nbe small even for close vectors, which results in the dominance of the second term. Thus, when a\nnew vertex is added to a graph, it prefers to be connected to the vertices, corresponding to vectors\nwith smaller norms. Thus, the edges in the Euclidean graph, constructed for the transformed data,\ntypically lead in the direction of norm decreasing, which is counterproductive to MIPS, which prefers\nthe vectors of larger norms. On the other hand, the non-metric similarity graph, constructed with\ns(x, y) = (cid:104)x, y(cid:105), is more probable to contain edges, directed towards increasing of norms. To verify\nthis explanation, we measure the rate of edges that lead to vectors of larger norms for ip-NSW and\nthe Euclidean NSW on the transformed data. The numbers for three datasets, presented in Table 2,\nfully con\ufb01rm our intuition.\n\n4 Experiments\n\nIn this section we present the experimental evaluation of non-metric similarity graphs for the top-K\nMIPS problem. All the experiments were performed on Intel Xeon E5-2650 machine in a single\nthread mode. For evaluation, we used the commonly used Recall measure that is de\ufb01ned as a rate of\nsuccessfully found neighbors, averaged over a set of queries. We performed the experiments with\nK = 1 and K = 10.\nDatasets. We summarize the information on the benchmark datasets in Table 1. The Net\ufb02ix,\nMovieLens and Yahoo!Music datasets are the established benchmarks for the MIPS problem. Music-\n100 is a new dataset that we introduce to the community2. This dataset was obtained by IALS-\nfactorization[26] of the user-item ranking matrix, with dimensionality 100. The matrix contains the\nratings from 3,897,789 users on one million popular songs from proprietary music recommendation\nservice. To the best of our knowledge, there is no publicly available dataset of such a large scale\nand high dimensionality. Normal-64 dataset was generated as a sample from a standard normal\ndistribution with the dimension 64. For all the datasets, the groundtruth neighbors were computed by\nsequential scan. The recall values were averaged over 10, 000 randomly sampled queries.\n\nTable 1: The datasets used in the evaluation.\n\nDATASET\nNETFLIX\nMOVIELENS\nYAHOO! MUSIC\nMUSIC-100\nNORMAL-64\n\n|X|\n17,770\n33,670\n624,961\n1,000,000\n1,048,576\n\n|Q|\n\n480,189\n247,753\n1,000,990\n3,897,789\n\n20,000\n\nDIM\n200\n150\n50\n100\n64\n\n4.1 Non-metric graphs or reduction to NNS?\n\nHere we experimentally investigate the optimal way to use the similarity graphs for the MIPS\nproblem. We argue that the straightforward solution by reduction to NNS and then using the standard\nEuclidean similarity graph is suboptimal. To con\ufb01rm this claim, we compare the performance of\nthe non-metric similarity graph (denoted by ip-NSW) to the performance of Euclidean similarity\ngraph combined with transformation from[7] (denoted by NSW+reduction). The runtime-accuracy\n\n2https://github.com/stanis-morozov/ip-nsw\n\n6\n\n\fplots on three datasets are presented on Figure 2. The plots con\ufb01rm the advantage of non-metric\nsimilarity graphs, especially in the high recall regime. For instance, ip-NSW reaches the recall\nlevel 0.9 \ufb01ve times faster on Music-100. We believe that the reason for the inferior performance of\nthe NSW+reduction approach is the edge distribution bias, described in Section 3.3. Overall, we\nconclude that similarity graphs do not require any MIPS-to-NNS transformation that makes them\nfavorable over other similarity search frameworks. In the subsequent experiments, we evaluate only\nthe ip-NSW approach as our main contribution.\n\nTable 2: The rate of similarity graph edges that lead to vector of larger norms for ip-NSW and\nNSW+reduction. This rate is much higher in the non-metric similarity graph in ip-NSW, which\nresults in higher MIPS performance.\n\nDATASET\nMUSIC-100\nYAHOO! MUSIC\nNORMAL-64\n\nNSW+REDUCTION\n\n0.349335\n0.398541\n0.362722\n\nYahoo! Music\n\nIP-NSW\n0.75347\n0.92353\n0.703605\n\nNormal-64\n\nMusic-100\n\nFigure 2: The performance of non-metric ip-NSW and the Euclidean NSW for transformed data\non three million-scala datasets. The combination of metric similarity graphs with MIPS-to-NNS\nreduction results in inferior performance.\n4.2 Comparison to the state-of-the-art\n\nAs our main experiment, we extensively compare the proposed ip-NSW method to the existing\napproaches. We compared the following algorithms:\n\nNaive-MKL The sequential scan implementation that uses the Intel MKL library3 for ef\ufb01cient\n\nvector-matrix multiplication.\n\nLSH+reduction[7] We used the implementation available in [13]. We tuned the parameter B in a\n\nrange {20, 40, 80, 160} and the parameter R in a range {5, 8, 11, 14, 17, 20}.\n\nClustering+reduction[8] We used our own reimplementation and use\n\nN,\nwhere N is the size of the database. When searching, the number of considered clusters was\nvaried from 1 to 50.\n\nN clusters of size\n\nFEXIPRO[10] We used the author\u2019s implementation of the FEXIPRO framework4 with the algo-\nrithm FEXIPRO-SIR and the parameters scalingV alue = 127 and SIGM A = 0.8 since it\nwas recommended in [10] as the best combination. Note, that FEXIPRO is an exact method.\n\n3https://software.intel.com/mkl\n4https://github.com/stanford-futuredata/FEXIPRO-orig\n\n7\n\n\u221a\n\n\u221a\n\n0.700.750.800.850.900.951.000.000.050.100.150.200.250.300.350.400.45Time (ms)ip-NSWNSW+reduction0.700.750.800.850.900.951.000.000.050.100.150.200.250.300.350.400.45ip-NSWNSW+reduction0.00.20.40.60.81.00.00.51.01.52.02.53.03.5ip-NSWNSW+reduction0.700.750.800.850.900.951.00Recall0.000.050.100.150.200.250.300.350.400.45Time (ms)ip-NSWNSW+reduction0.700.750.800.850.900.951.00Recall0.000.050.100.150.200.250.300.350.400.45ip-NSWNSW+reduction0.00.20.40.60.81.0Recall0.00.51.01.52.02.53.03.5ip-NSWNSW+reduction\fLEMP[9] We used the author\u2019s implementation of the LEMP framework5 with the algorithm LEMP-\nHYB-REL. We varied parameters R from 0.1 to 0.9 with step 0.1 and \u03b5 from 0.2 to 0.6 with\nstep 0.05 to achieve the runtime-accuracy plots.\n\nGreedy-MIPS[13] We used the author\u2019s implementation6 with a budget parameter tuned for the\n\neach dataset.\n\nip-NSW is the proposed algorithm based on the non-metric similarity graph, described in Section 3.2.\n\nWhile Net\ufb02ix and MovieLens are the established datasets in previous works, we do not consider them\nas interesting benchmarks these days. They both contain only several thousand vectors and the exact\nNaive-MKL is ef\ufb01cient enough on them. E.g. Naive-MKL works only 0.56 ms on Net\ufb02ix and 1.42\nms on MovieLens, which is fast enough for most of applications. Thus, we perform the extensive\ncomparison on three million-scale datasets only. The Figure 3 presents the runtime-accuracy plots\nfor the compared approaches. The timings for Naive-MKL and FEXIPRO are presented under the\ncorresponding plots. Overall, the proposed ip-NSW method outperforms the existing approaches by\na substantial margin. For example, ip-NSW reaches 0.9 recall level ten times faster that the fastest\nbaseline. Note, that the for top-10 MIPS the advantage of ip-NSW is even more impressive on all\ndatasets. To justify that the speedup improvements are due to the proposed algorithm and not because\nof implementation differences (such as libraries, cache locallity, register level optimizations and so\non) we also compare number of inner products needed to achieve certain recall levels for different\nmethods. The plots for three datasets and top-10 MIPS are presented on Figure 4.\n\nMusic-100\n\nYahoo! Music\n\nNormal-64\n\nFEXIPRO \u2014 56.071\nNaive-MKL \u2014 74.239\n\nFEXIPRO \u2014 0.895\nNaive-MKL \u2014 19.551\n\nFEXIPRO \u2014 76.922\nNaive-MKL \u2014 36.377\n\nFEXIPRO \u2014 76.506\nNaive-MKL \u2014 74.760\n\nFEXIPRO \u2014 2.673\n\nNaive-MKL \u2014 19.649\n\nFEXIPRO \u2014 77.227\nNaive-MKL \u2014 37.129\n\nFigure 3: The runtime-recall plots on three datasets for top-1 MIPS (top) and top-10 MIPS (bottom).\nThe timings for the exact FEXIPRO and Naive-MKL methods are presented under the corresponding\nplots.\n\nAdditional memory consumption. The performance advantage of the similarity graphs comes at\na price of additional memory to maintain the graph structure. In our experiments we use M = 32\nedges per vertex, which results in 32 \u00d7 n \u00d7 sizeof (int) bytes for edge lists. Note, that the size of\nthe database equals d \u00d7 n \u00d7 sizeof (f loat) bytes, hence for high-dimensional datasets d (cid:29) 32 the\nadditional memory consumption is negligible.\n\n5https://github.com/uma-pi1/LEMP\n6https://github.com/rofuyu/exp-gmips-nips17\n\n8\n\n0.00.20.40.60.81.00.00.51.01.52.02.53.0Time (ms)ip-NSWGreedy-MIPSClustering+red.LEMPLSH+reduction0.00.20.40.60.81.00.00.51.01.52.02.53.0ip-NSWGreedy-MIPSClustering+red.LEMPLSH+reduction0.00.20.40.60.81.00246810ip-NSWGreedy-MIPSClustering+red.LEMPLSH+reduction0.00.20.40.60.81.0Recall0.00.51.01.52.02.53.0Time (ms)ip-NSWGreedy-MIPSClustering+red.LEMPLSH+reduction0.00.20.40.60.81.0Recall0.00.51.01.52.02.53.0ip-NSWGreedy-MIPSClustering+red.LEMPLSH+reduction0.00.20.40.60.81.0Recall0246810ip-NSWGreedy-MIPSClustering+red.LEMPLSH+reduction\fMusic-100\n\nYahoo! Music\n\nNormal-64\n\nFigure 4: The number of inner products computations needed to achieve certain recall levels on three\ndatasets for top-10 MIPS.\n\n5 Conclusion\n\nIn this work, we have proposed and evaluated a new framework for inner product similarity search.\nWe extend the framework of similarity graphs to the non-metric similarity search problems and\ndemonstrate that the practically important case of inner product could be perfectly solved by these\ngraphs. We also investigate the optimal way to use this framework for MIPS and demonstrate that the\npopular MIPS-to-NNS reductions are harmful to similarity graphs. The optimized implementation\nof the proposed method will be available upon publication to support the further research in this\ndirection.\n\nReferences\n[1] Stephen Mussmann and Stefano Ermon. Learning and inference via maximum inner product\nsearch. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016,\nNew York City, NY, USA, June 19-24, 2016, pages 2587\u20132596, 2016.\n\n[2] Stephen Mussmann, Daniel Levy, and Stefano Ermon. Fast amortized inference and learning\nin log-linear models with randomly perturbed nearest neighbor search. In Proceedings of the\nThirty-Third Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI 2017, Sydney, Australia,\nAugust 11-15, 2017, 2017.\n\n[3] Sarath Chandar, Sungjin Ahn, Hugo Larochelle, Pascal Vincent, Gerald Tesauro, and Yoshua\n\nBengio. Hierarchical memory networks. CoRR, abs/1605.07427, 2016.\n\n[4] Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-Hsuan Sung, L\u00e1szl\u00f3 Luk\u00e1cs, Ruiqi Guo,\nSanjiv Kumar, Balint Miklos, and Ray Kurzweil. Ef\ufb01cient natural language response suggestion\nfor smart reply. CoRR, abs/1705.00652, 2017.\n\n[5] Kwang-Sung Jun, Aniruddha Bhargava, Robert D. Nowak, and Rebecca Willett. Scalable\ngeneralized linear bandits: Online computation and hashing. In Advances in Neural Information\nProcessing Systems 30: Annual Conference on Neural Information Processing Systems 2017,\n4-9 December 2017, Long Beach, CA, USA, pages 98\u2013108, 2017.\n\n[6] O. Keivani, K. Sinha, and P. Ram.\n\nImproved maximum inner product search with better\ntheoretical guarantees. In 2017 International Joint Conference on Neural Networks (IJCNN),\npages 2927\u20132934, May 2017.\n\n[7] Behnam Neyshabur and Nathan Srebro. On symmetric and asymmetric lshs for inner product\nsearch. In Proceedings of the 32Nd International Conference on International Conference on\nMachine Learning - Volume 37, ICML\u201915, pages 1926\u20131934. JMLR.org, 2015.\n\n[8] Alex Auvolat and Pascal Vincent. Clustering is ef\ufb01cient for approximate maximum inner\n\nproduct search. CoRR, abs/1507.05910, 2015.\n\n[9] Christina Te\ufb02ioudi and Rainer Gemulla. Exact and approximate maximum inner product search\n\nwith lemp. ACM Trans. Database Syst., 42(1):5:1\u20135:49, December 2016.\n\n9\n\n0.00.20.40.60.81.0Recall0.000.250.500.751.001.251.501.752.00Number of inner products1e4ip-NSWGreedy-MIPSClustering+red.LEMPLSH+reduction0.00.20.40.60.81.0Recall0.000.250.500.751.001.251.501.752.001e4ip-NSWGreedy-MIPSClustering+red.LEMPLSH+reduction0.00.20.40.60.81.0Recall0123451e4ip-NSWGreedy-MIPSClustering+red.LEMPLSH+reduction\f[10] Hui Li, Tsz Nam Chan, Man Lung Yiu, and Nikos Mamoulis. Fexipro: Fast and exact inner\n\nproduct retrieval in recommender systems. In SIGMOD Conference, 2017.\n\n[11] Anshumali Shrivastava and Ping Li. Asymmetric lsh (alsh) for sublinear time maximum\ninner product search (mips). In Advances in Neural Information Processing Systems, pages\n2321\u20132329, 2014.\n\n[12] Yoram Bachrach, Yehuda Finkelstein, Ran Gilad-Bachrach, Liran Katzir, Noam Koenigstein,\nNir Nice, and Ulrich Paquet. Speeding up the xbox recommender system using a euclidean\ntransformation for inner-product spaces. October 2014.\n\n[13] Hsiang-Fu Yu, Cho-Jui Hsieh, Qi Lei, and Inderjit S. Dhillon. A greedy approach for budgeted\n\nmaximum inner product search. In NIPS, 2017.\n\n[14] Gonzalo Navarro. Searching in metric spaces by spatial approximation. The VLDB Journal,\n\n11(1):28\u201346, Aug 2002.\n\n[15] Yury A. Malkov and D. A. Yashunin. Ef\ufb01cient and robust approximate nearest neighbor search\n\nusing hierarchical navigable small world graphs. CoRR, abs/1603.09320, 2016.\n\n[16] Cong Fu and Deng Cai. Efanna : An extremely fast approximate nearest neighbor search\n\nalgorithm based on knn graph. CoRR, abs/1609.07228, 2016.\n\n[17] B. Harwood and T. Drummond. Fanng: Fast approximate nearest neighbour graphs. In 2016\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5713\u20135722, June\n2016.\n\n[18] Wen Li, Ying Zhang, Yifang Sun, Wei Wang, Wenjie Zhang, and Xuemin Lin. Approximate\nnearest neighbor search on high dimensional data - experiments, analyses, and improvement\n(v1.0). CoRR, abs/1610.02455, 2016.\n\n[19] Stuart C. Shapiro. Encyclopedia of Arti\ufb01cial Intelligence. 1987.\n\n[20] Leonid Boytsov, David Novak, Yury Malkov, and Eric Nyberg. Off the beaten path: Let\u2019s\n\nreplace term-based retrieval with k-nn search. In CIKM, 2016.\n\n[21] Bilegsaikhan Naidan, Leonid Boytsov, and Eric Nyberg. Permutation search methods are\n\nef\ufb01cient, yet faster search is possible. VLDB, 2015.\n\n[22] Leonid Boytsov. Ef\ufb01cient and accurate non-metric k-nn search with applications to text\n\nmatching. Technical report, 2017.\n\n[23] Alexander Ponomarenko, Nikita Avrelin, Bilegsaikhan Naidan, and Leonid Boytsov. Compara-\ntive analysis of data structures for approximate nearest neighbor search. Data Analytics, pages\n125\u2013130, 2014.\n\n[24] Wei Dong, Charikar Moses, and Kai Li. Ef\ufb01cient k-nearest neighbor graph construction for\ngeneric similarity measures. In Proceedings of the 20th international conference on World wide\nweb, pages 577\u2013586. ACM, 2011.\n\n[25] Jean-Daniel Boissonnat and Mariette Yvinec. Algorithmic Geometry. Cambridge University\n\nPress, New York, NY, USA, 1998.\n\n[26] Y. Hu, Y. Koren, and C. Volinsky. Collaborative \ufb01ltering for implicit feedback datasets. In 2008\n\nEighth IEEE International Conference on Data Mining, pages 263\u2013272, Dec 2008.\n\n10\n\n\f", "award": [], "sourceid": 2294, "authors": [{"given_name": "Stanislav", "family_name": "Morozov", "institution": "Yandex"}, {"given_name": "Artem", "family_name": "Babenko", "institution": "MIPT/Yandex"}]}