{"title": "Locality-Sensitive Hashing for f-Divergences: Mutual Information Loss and Beyond", "book": "Advances in Neural Information Processing Systems", "page_first": 10044, "page_last": 10054, "abstract": "Computing approximate nearest neighbors in high dimensional spaces is a central problem in large-scale data mining with a wide range of applications in machine learning and data science. A popular and effective technique in computing nearest neighbors approximately is the locality-sensitive hashing (LSH) scheme. In this paper, we aim to develop LSH schemes for distance functions that measure the distance between two probability distributions, particularly for f-divergences as well as a generalization to capture mutual information loss. First, we provide a general framework to design LHS schemes for f-divergence distance functions and develop LSH schemes for the generalized Jensen-Shannon divergence and triangular discrimination in this framework. We show a two-sided approximation result for approximation of the generalized Jensen-Shannon divergence by the Hellinger distance, which may be of independent interest. Next, we show a general method of reducing the problem of designing an LSH scheme for a Krein kernel (which can be expressed as the difference of two positive definite kernels) to the problem of maximum inner product search. We exemplify this method by applying it to the mutual information loss, due to its several important applications such as model compression.", "full_text": "Locality-Sensitive Hashing for f-Divergences\nand Kre\u02d8\u0131n Kernels: Mutual Information Loss\n\nand Beyond\n\nLin Chen1,2 Hossein Esfandiari2 Thomas Fu2 Vahab S. Mirrokni2\n\nlin.chen@yale.edu, {esfandiari,thomasfu,mirrokni}@google.com\n\n1Yale University 2Google Research\n\nAbstract\n\nComputing approximate nearest neighbors in high dimensional spaces is a\ncentral problem in large-scale data mining with a wide range of applications\nin machine learning and data science. A popular and e\ufb00ective technique in\ncomputing nearest neighbors approximately is the locality-sensitive hashing\n(LSH) scheme. In this paper, we aim to develop LSH schemes for distance\nfunctions that measure the distance between two probability distributions,\nparticularly for f-divergences as well as a generalization to capture mutual\ninformation loss. First, we provide a general framework to design LHS\nschemes for f-divergence distance functions and develop LSH schemes for\nthe generalized Jensen-Shannon divergence and triangular discrimination in\nthis framework. We show a two-sided approximation result for approximation\nof the generalized Jensen-Shannon divergence by the Hellinger distance,\nwhich may be of independent interest. Next, we show a general method of\nreducing the problem of designing an LSH scheme for a Kre\u02d8\u0131n kernel (which\ncan be expressed as the di\ufb00erence of two positive de\ufb01nite kernels) to the\nproblem of maximum inner product search. We exemplify this method by\napplying it to the mutual information loss, due to its several important\napplications such as model compression.\n\nIntroduction\n\n1\nA central problem in machine learning and data mining is to \ufb01nd top-k similar items to each\nitem in a dataset. Such problems, referred to as approximate nearest neighbor problems, are\nespecially challenging in high dimensional spaces and are an important part of a wide range\nof data mining tasks such as \ufb01nding near-duplicate pages in a corpus of images or web pages,\nor clustering items in a high-dimensional metric space. A popular technique for solving these\nproblems is the locality-sensitive hashing (LSH) technique [19]. In this method, items in a\nhigh-dimensional metric space are \ufb01rst mapped into buckets (via a hashing scheme) with\nthe property that closer items have a higher chance of being assigned to the same bucket.\nLSH-based nearest neighbor methods limit their scope of search to the items that fall into\nthe same bucket in which the target item resides 1.\nLocality sensitive hashing was \ufb01rst introduced and studied by [19]. They provide a family of\nbasic locality-sensitive hash functions for the Hamming distance in a d-dimensional space\n1We note that LSH is a popular data-independent technique for nearest neighbor search. Another\ncategory of nearest neighbor search algorithms, referred to as data-dependent techniques, are\nlearning-to-hash methods [37] which learn a hash function that maps each item to a compact code.\nHowever, this line of work is out of the scope of this paper.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fand for the L1 distance in a d-dimensional Euclidean space. They also show that such a\nfamily of hash functions provides a randomized (1 + \u0001)-approximation algorithm for the\nnearest neighbor search problem with sublinear space and sublinear query time. Following\n[19], several families of locality-sensitive hash functions have been designed and implemented\nfor di\ufb00erent metrics, each serving a certain application. We summarize further results in\nthis area in Section 1.1.\nIn several applications, data points can be represented as probability distributions. One\nexample is the space of users\u2019 browsed web pages, read articles or watched videos. In order\nto represent such data, one can represent each user by a distribution of documents they\nread, and the documents by topics included in those documents. Other examples are time\nseries distributions, content of documents, or images that can be represented as histograms.\nParticularly, analysis of similarities in time series distributions or documents can be used\nin the context of attacks, and spam detection. Analysis of user similarities can be used in\nrecommendation systems and online advertisement.\nIn fact, many of the aforementioned applications deal with huge datasets and require very\ntime e\ufb03cient algorithms to \ufb01nd similar data points. These applications motivated us to study\nLSH functions for distributions, especially for distance measures with information-theoretic\njusti\ufb01cations. In fact, in addition to k-nearest neighbor, LSH functions can be used to\nimplement very fast distributed algorithms for traditional clusterings such as k-means [7].\nRecently, Mao et al. [26] noticed the importance and lack of LSH functions for the distance\nof distributions, especially for information-theoretic measures. They attempted to design an\nLSH to capture the famous Jensen-Shannon (JS) divergence. However, instead of directly\nproviding locality-sensitive hash functions for Jensen-Shannon divergence, they take two\nsteps to turn this distance function into a new distance function that is easier to hash. They\n\ufb01rst looked at a less common divergence measure S2JSD which is the square root of two\ntimes the JS divergence. Then they de\ufb01ned a related distance function S2JSDapprox\nnew , which\nwas obtained by only keeping the linear terms in the Taylor expansion of the logarithm in\nthe expression of S2JSD and designed a locality-sensitive hash function for the new measure\nnew . This is an interesting work; however, unfortunately it does not provide any\nS2JSDapprox\nbound on the actual JS divergence using the LSH that they designed for S2JSDapprox\nnew . Our\nresults resolve this issue by providing LSH schemes with provable guarantees for information-\ntheoretic distance measures including the JS divergence and its generalizations.\nMu and Yan [27] proposed an LSH algorithm for non-metric data by embedding them into a\nreproducing kernel Kre\u02d8\u0131n space. However, their method is indeed data-dependent. Given a\n\ufb01nite set of data points M, they compute the distance matrix D whose (i, j)-entry is the\ndistance between i and j, where both i and j are data points in M. Data is embedded into\na reproducing kernel Kre\u02d8\u0131n space by performing singular value decomposition on a transform\nof the distance matrix D. The embedding changes if we are given another dataset.\nOur Contributions. In this paper, we \ufb01rst study LSH schemes for f-divergences2 between\ntwo probability distributions. We \ufb01rst in Proposition 1 provide a simple reduction tool for\ndesigning LSH schemes for the family of f-divergence distance functions. This proposition\nis not hard to prove but might be of independent interest. Next we use this tool and\nprovide LSH schemes for two examples of f-divergence distance functions, Jensen-Shannon\ndivergence and triangular discrimination. Interestingly our result holds for a generalized\nversion of Jensen-Shannon divergence. We apply this tool to design and analyze an LSH\nscheme for the generalized Jensen-Shannon (GJS) divergence through approximation by the\nsquared Hellinger distance. We use a similar technique to provide an LSH for triangular\ndiscrimination. Our approximation is provably lower bounded by a factor 0.69 for the Jensen-\nShannon divergence and is lower bounded by a factor 0.5 for triangular discrimination. The\napproximation result of the generalized Jensen-Shannon divergence by the squared Hellinger\nrequires a more involved analysis and the lower and upper bounds depend on the weight\nparameter. This approximation result may be of independent interest for other machine\nlearning tasks such as approximate information-theoretic clustering [12]. Our technique may\nbe useful for designing LSH schemes for other f-divergences.\n\n2The formal de\ufb01nition of f-divergence is presented in Section 2.2.\n\n2\n\n\fNext, we propose a general approach to designing an LSH for Kre\u02d8\u0131n kernels. A Kre\u02d8\u0131n\nkernel is a kernel function that can be expressed as the di\ufb00erence of two positive de\ufb01nite\nkernels. Our approach is built upon a reduction to the problem of maximum inner product\nsearch (MIPS) [33, 28, 41]. In contrast to our LSH schemes for f-divergence functions via\napproximation, our approach for Kre\u02d8\u0131n kernels involves no approximation and is theoretically\nlossless. Contrary to [27], this approach is data-independent. We exemplify our approach\nby designing an LSH function speci\ufb01cally for mutual information loss. Mutual information\nloss is of our particular interest due to its several important applications such as model\ncompression [6, 17], and compression in discrete memoryless channels [20, 30, 42].\n\n1.1 Other Related Work\nDatar et al. [16] designed an LSH for Lp distances using p-stable distributions. Broder\n[10] designed MinHash for the Jaccard similarity. LSH for other distances and similarity\nmeasures were proposed later, for example, angle similarity [11], spherical LSH on a unit\nhypersphere [34], rank similarity [40], and non-metric LSH [27]. Li et al. [24] demonstrated\nthat uniform quantization outperforms the standard method in [16] with a random o\ufb00set.\nGorisse et al. [18] proposed an LSH family for \u03c72 distance by relating it to the L2 distance\nvia an algebraic transform. Interested readers are referred to a more comprehensive survey of\nexisting LSH methods [38]. Another related problem is the construction of feature maps of\npositive de\ufb01nite kernels. A feature map maps a data point into a usually higher-dimensional\nspace such that the inner product in that space agrees with the kernel in the original space.\nExplicit feature maps for additive kernels are introduced in [35]. Bregman divergences are\nanother broad class of distances that arise naturally in practical applications. The nearest\nneighbor search problem for Bregman divergences were studied in [3, 2, 1].\n\n2 Preliminaries\n2.1 Locality-Sensitive Hashing\nLet M be the universal set of items (the database), endowed with a distance function D.\nIdeally, we would like to have a family of hash functions such that for any two items p and\nq in M that are close to each other, their hash values collide with a higher probability,\nand if they reside far apart, their hash values collide with a lower probability. A family of\nhash functions with the above property is said to be locality-sensitive. A hash value is also\nknown as a bucket in other literature. Using this metaphor, hash functions are imagined\nas sorters that place items into buckets. If hash functions are locality-sensitive, it su\ufb03ces\nto search the bucket into which an item falls if one wants to know its nearest neighbors.\nThe (r1, r2, p1, p2)-sensitive LSH family formulates the intuition of locality sensitivity and is\nformally de\ufb01ned in De\ufb01nition 1.\nDe\ufb01nition 1 ([19]). Let H = {h : M \u2192 U} be a family of hash functions, where U is the\nset of possible hash values. Assume that there is a distribution h \u223c H over the family of\nfunctions. This family H is called (r1, r2, p1, p2)-sensitive (r1 < r2 and p1 > p2) for D, if for\n\u2200p, q \u2208 M the following statements hold: (1) if D(p, q) \u2264 r1, then Prh\u223cH[h(p) = h(q)] \u2265 p1;\n(2) if D(p, q) > r2, then Prh\u223cH[h(p) = h(q)] \u2264 p2.\nWe would like to note that the gap between the high probability p1 and p2 can be ampli\ufb01ed\nby constructing a compound hash function that concatenates multiple functions from an LSH\nfamily. For example, one can construct g : M \u2192 U K such that g(p) (cid:44) (h1(p), . . . , hK(p)) for\n\u2200p \u2208 M, where h1, . . . , hK are chosen from the LSH family H. This conjunctive construction\nreduces the amount of items in one bucket. To improve the recall, an additional disjunction\nis introduced. To be precise, if g1, . . . , gL are L such compound hash functions, we search all\nof the buckets g1(p), . . . , gL(p) in order to \ufb01nd the nearest neighbors of p.\n\n2.2 f-Divergence\nLet P and Q be two probability measures associated with a common sample space \u2126. We\nwrite P (cid:28) Q if P is absolutely continuous with respect to Q, which requires that for every\nsubset A of \u2126, Q(A) = 0 imply P(A) = 0.\n\n3\n\n\fLet f : (0,\u221e) \u2192 R be a convex function that satis\ufb01es f(1) = 0. If P (cid:28) Q, the f-divergence\nfrom P to Q [14] is de\ufb01ned by\n\nDf(P k Q) =\n\ndQ,\n\n(1)\n\nZ\n\nf\n\n\u2126\n\n(cid:18) dP\n\n(cid:19)\n\ndQ\n\n\u221a\n\n\u221a\n\n\u2126(\n\nR\n\ndQ dP [13]. If hel(t) = 1\n2(\ndP \u2212 \u221a\ndQ)2 [15]. If \u03b4(t) = (t\u22121)2\n\nR\nthe triangular discrimination between P and Q is given by \u2206(P k Q) =P\n\ndQ is the Radon-Nikodym derivative of P\nprovided that the right-hand side exists, where dP\nwith respect to Q. In general, an f-divergence is not symmetric: Df(P k Q) 6= Df(Q k P).\nIf fKL(t) = t ln t + (1 \u2212 t), the fKL-divergence yields the KL divergence DKL(P k Q) =\nt \u2212 1)2, the hel-divergence is the squared Hellinger distance\n\u2126 ln dP\nH2(P, Q) = 1\nt+1 , the \u03b4-divergence is the triangular\n2\ndiscrimination (also known as Vincze-Le Cam distance) [22, 36]. If the sample space is \ufb01nite,\n(P (i)\u2212Q(i))2\n.\nP (i)+Q(i)\nThe Jensen-Shannon (JS) divergence is a symmetrized version of the KL divergence. If\nP (cid:28) Q, Q (cid:28) P and M = (P + Q)/2, the JS divergence is de\ufb01ned by\n2 DKL(Q k M) .\n\n2 DKL(P k M) + 1\n\nDJS(P k Q) = 1\n\n(2)\n\ni\u2208\u2126\n\n2.3 Mutual Information Loss and Generalized Jensen-Shannon Divergence\nThe mutual information loss arises naturally in many machine learning tasks, such as\ninformation-theoretic clustering [17] and categorical feature compression [6].\nSuppose that two random variables X and C obeys a joint distribution p(X, C). This joint\ndistribution can model a dataset where X denotes the feature value of a data point and C\ndenotes its label [6]. Let X and C denote the support of X and C (i.e., the universal set of\nall possible feature values and labels), respectively. Consider clustering two feature values\ninto a new combined value. This operation can be represented by the following map\n\n(cid:26)t,\n\nz,\n\nt \u2208 X \\ {x, y} ,\nt = x, y ,\n\n\u03c0x,y : X \u2192 X \\ {x, y} \u222a {z}\n\nsuch that \u03c0x,y(t) =\n\nwhere x and y are the two feature values to be clustered and z /\u2208 X is the new combined feature\nvalue. To make the dataset after applying the map \u03c0x,y preserve as much information of the\noriginal dataset as possible, one has to select two feature values x and y such that the mutual\ninformation loss incurred by the clustering operation mil(x, y) = I(X; C) \u2212 I(\u03c0x,y(X); C) is\nminimized, where I(\u00b7;\u00b7) is the mutual information between two random variables [13]. Note\nthat the mutual information loss (MIL) divergence mil : X \u00d7 X \u2192 R is symmetric in both\narguments and always non-negative due to the data processing inequality [13].\nNext, we motivate the generalized Jensen-Shannon divergence. If we let P and Q be the\nconditional distribution of C given X = x and X = y, respectively, such that P(c) = p(C =\nc|X = x) and Q(c) = p(C = c|X = y), the mutual information loss can be re-written as\n\n\u03bbDKL(P k M\u03bb) + (1 \u2212 \u03bb)DKL(Q k M\u03bb) ,\n\n(3)\np(x)+p(y) and the distribution M\u03bb = \u03bbP + (1\u2212 \u03bb)Q. Note that (3) is a generalized\nwhere \u03bb = p(x)\nversion of (2). Therefore, we de\ufb01ne the generalized Jensen-Shannon (GJS) divergence between\nGJS(P k Q) = \u03bbDKL(P k M\u03bb)+(1\u2212\u03bb)DKL(Q k M\u03bb), where \u03bb \u2208 [0, 1]\nP and Q [25, 5, 17] by D\u03bb\nand M\u03bb = \u03bbP +(1\u2212\u03bb)Q. We immediately have D\nGJS(P k Q) = DJS(P k Q), which indicates\n1/2\nthat the JS divergence is indeed a special case of the GJS divergence when \u03bb = 1/2. The GJS\nGJS(P k Q) = H(M\u03bb)\u2212 \u03bbH(P)\u2212(1\u2212 \u03bb)H(Q),\ndivergence has another equivalent de\ufb01nition D\u03bb\nwhere H(\u00b7) denotes the Shannon entropy [13]. In contrast to the MIL divergence, the GJS\nGJS(\u00b7 k \u00b7) is not symmetric in general as the weight \u03bb \u2208 [0, 1] is \ufb01xed and not necessarily\nD\u03bb\nequal to 1/2. We will show in Lemma 1 that the GJS divergence is an f-divergence.\n\n2.4 Positive De\ufb01nite Kernel and Kre\u02d8\u0131n Kernel\nWe \ufb01rst review the de\ufb01nition of a positive de\ufb01nite kernel.\n\n4\n\n\freal numbers a1, . . . , an \u2208 R, and x, . . . , xn \u2208 X , it holds thatPn\n\nPn\nDe\ufb01nition 2 (Positive de\ufb01nite kernel [32]). Let X be a non-empty set. A symmetric,\nreal-valued map k : X \u00d7 X \u2192 R is a positive de\ufb01nite kernel on X if for all positive integer n,\nj=1 aiajk(xi, xj) \u2265 0.\nA kernel is said to be a Kre\u02d8\u0131n kernel if it can be represented as the di\ufb00erence of two positive\nde\ufb01nite kernels. The formal de\ufb01nition is presented below.\nDe\ufb01nition 3 (Kre\u02d8\u0131n kernel [29]). Let X be a non-empty set. A symmetric, real-valued map\nk : X \u00d7 X \u2192 R is a Kre\u02d8\u0131n kernel on X if there exists two positive de\ufb01nite kernels k1 and k2\non X such that k(x, y) = k1(x, y) \u2212 k2(x, y) holds for all x, y \u2208 X .\n\ni=1\n\n3 LSH Schemes for f-Divergences\n\nQ(i)) [31].\n\nWe build LSH schemes for f-divergences based on approximation via another f-divergence if\nthe latter admits an LSH family. If Df and Dg are two divergences associated with convex\nfunctions f and g as de\ufb01ned by (1), the approximation ratio of Df(P k Q) to Dg(P k Q)\nis determined by the ratio of the functions f and g, as well as the ratio of P to Q (to be\nprecise, inf i\u2208\u2126 P (i)\nProposition 1 (Proof in Appendix A.4). Let \u03b20 \u2208 (0, 1), L, U > 0 and let f and g be two\nconvex functions (0,\u221e) \u2192 R that obey f(1) = 0, g(1) = 0, and f(t), g(t) > 0 for every t 6= 1.\nLet P be a set of probability measures on a \ufb01nite sample space \u2126 such that for every i \u2208 \u2126\nand P, Q \u2208 P, 0 < \u03b20 \u2264 P (i)\n0 ), it holds\nthat 0 < L \u2264 f(\u03b2)\ng(\u03b2) \u2264 U < \u221e. If H forms an (r1, r2, p1, p2)-sensitive family for g-divergence\non P, then it is also an (Lr1, U r2, p1, p2)-sensitive family for f-divergence on P.\nProposition 1 provides a general strategy of constructing LSH families for f-divergences.\nThe performance of such LSH families depends on the tightness of the approximation. In\nSections 3.1 and 3.2, as instances of the general strategy, we derive concrete results for the\ngeneralized Jensen-Shannon divergence and triangular discrimination, respectively.\n\n0 . Assume that for every \u03b2 \u2208 (\u03b20, 1) \u222a (1, \u03b2\u22121\n\nQ(i) \u2264 \u03b2\u22121\n\n3.1 Generalized Jensen-Shannon Divergence\nFirst, Lemma 1 shows that the GJS divergence is indeed an instance of f-divergence.\nLemma 1 (Proof in Appendix A.3). De\ufb01ne m\u03bb(t) = \u03bbt ln t \u2212 (\u03bbt + 1 \u2212 \u03bb) ln(\u03bbt + 1 \u2212 \u03bb).\nFor any \u03bb \u2208 [0, 1], m\u03bb(t) is convex on (0,\u221e) and m\u03bb(1) = 0. Furthermore, m\u03bb-divergence\nyields the GJS divergence with parameter \u03bb.\nWe choose to approximate it via the squared Hellinger distance, which plays a central role in\nthe construction of the hash family with desired properties.\nThe approximation guarantee is established in Theorem 1. We show that the ratio of\nGJS(P k Q) to H2(P, Q) is upper bounded by the function U(\u03bb) and lower bounded by\nD\u03bb\nthe function L(\u03bb). Furthermore, Theorem 1 shows that U(\u03bb) \u2264 1, which implies that the\nsquared Hellinger distance is an upper bound of the GJS divergence.\nTheorem 1 (Proof in Appendix A.2). We assume that the sample space \u2126 is \ufb01nite. Let P\nand Q be two di\ufb00erent distributions on \u2126. For every t > 0 and \u03bb \u2208 (0, 1), we have\n\nL(\u03bb)H2(P, Q) \u2264 D\u03bb\n\nGJS(P k Q) \u2264 U(\u03bb)H2(P, Q) \u2264 H2(P, Q),\n\nwhere L(\u03bb) = 2 min{\u03b7(\u03bb), \u03b7(1 \u2212 \u03bb)}, \u03b7(\u03bb) = \u2212\u03bb ln \u03bb and U(\u03bb) = 2\u03bb(1\u2212\u03bb)\nWe show Theorem 1 by showing a two-sided approximation result regarding m\u03bb and hel. This\nresult might be of independent interest for other machine learning tasks, say, approximate\ninformation-theoretic clustering [12].\nLemma 2 (Proof in Appendix A.1). De\ufb01ne \u03ba\u03bb(t) = m\u03bb(t)\nwe have \u03ba\u03bb(t) = \u03ba1\u2212\u03bb(1/t) and \u03ba\u03bb(t) \u2208 [L(\u03bb), U(\u03bb)].\n\nhel(t) . For every t > 0 and \u03bb \u2208 (0, 1),\n\n1\u22122\u03bb ln 1\u2212\u03bb\n\u03bb .\n\n5\n\n\fWe illustrate the upper and lower bound functions U(\u03bb) and L(\u03bb) in Appendix B. Recall that\nif \u03bb = 1/2, the generalized Jensen-Shannon divergence reduces to the usual Jensen-Shannon\ndivergence. Theorem 1 yields the approximation guarantee 0.69 < ln 2 \u2264 DJS(PkQ)\nIf the common sample space \u2126 with which the two distributions P and Q are associated is\n\ufb01nite, one can identify P and Q with the |\u2126|-dimensional vectors [P(i)]i\u2208\u2126 and [Q(i)]i\u2208\u2126,\nP (cid:44) [pP(i)]i\u2208\u2126 and\n2k\u221a\nP \u2212 \u221a\nrespectively. In this case, H2(P, Q) = 1\n2, which is exactly half of the squared\n\u221a\nL2 distance between the two vectors\nthe squared Hellinger distance can be endowed with the L2-LSH family [16] applied to the\nsquare root of the vector. In light of this, the locality-sensitive hash function that we propose\n&a \u00b7 \u221a\n\u2019\nfor the generalized Jensen-Shannon divergence is\n\nQ (cid:44) [pQ(i)]i\u2208\u2126. Therefore,\n\nH2(P,Q) \u2264 1.\n\nQk2\n\n\u221a\n\nha,b(P) =\n\nP + b\nr\n\n,\n\n(4)\n\n1\n\nP \u2212 \u221a\n\nu f2(t/u)(1 \u2212 t/r)dt.\n\nwhere a \u223c N (0, I) is a |\u2126|-dimensional standard normal random vector, \u00b7 denotes the inner\nproduct, b is uniformly at random on [0, r], and r is a positive real number.\nTheorem 2 (Proof in Appendix A.5). Let c = k\u221a\nQk2 and f2 be the probability\ndensity function of the absolute value of the standard normal distribution. The hash functions\n{ha,b} de\ufb01ned in (4) form a (R, c2 U(\u03bb)\nR r\nL(\u03bb) R, p1, p2)-sensitive family for the generalized Jensen-\nShannon divergence with parameter \u03bb, where R > 0, p1 = p(1), p2 = p(c), and p(u) =\n0\n3.2 Triangular Discrimination\nRecall that triangular discrimination is the \u03b4-divergence, where \u03b4(t) = (t\u22121)2\nt+1 . As shown\nin the proof of Theorem 3 (Appendix A.6), the function \u03b4 can be approximated by the\nfunction hel(t) that de\ufb01nes the squared Hellinger distance 1 \u2264 \u03b4(t)\nhel(t) \u2264 2. The squared\nHellinger distance can be sketched via L2-LSH after taking the square root, as exempli\ufb01ed in\nSection 3.1. By Proposition 1, the LSH family for the square Hellinger distance also forms\nan LSH family for the triangular discrimination. Theorem 3 shows that the LSH family\nde\ufb01ned in (4) form a (R, 2c2R, p1, p2)-sensitive family for triangular discrimination.\nTheorem 3 (Proof in Appendix A.6). Let c = k\u221a\nQk2 and f2 be the probability density\nfunction of the absolute value of the standard normal distribution. The hash functions {ha,b}\nR > 0, p1 = p(1), p2 = p(c), and p(u) =R r\nde\ufb01ned in (4) form a (R, 2c2R, p1, p2)-sensitive family for triangular discrimination, where\n\nu f2(t/u)(1 \u2212 t/r)dt.\n\nP \u2212\u221a\n\n1\n\n0\n\n4 Kre\u02d8\u0131n-LSH for Mutual Information Loss\nIn this section, we \ufb01rst show that the mutual information loss is a Kre\u02d8\u0131n kernel. Then we\npropose Kre\u02d8\u0131n-LSH, an asymmetric LSH method [33] for mutual information loss. We would\nlike to remark that this method can be easily extended to other Kre\u02d8\u0131n kernels, provided that\nthe associated positive de\ufb01nite kernels allow an explicit feature map.\n\n4.1 Mutual Information Loss is a Kre\u02d8\u0131n Kernel\nRecall that in Section 2.3 we assume a joint distribution p(X, C) whose support is X \u00d7C. Let\nx, y \u2208 X be represented by x = [p(c, x) : c \u2208 C] \u2208 [0, 1]|C| and y = [p(c, y) : c \u2208 C] \u2208 [0, 1]|C|,\nrespectively. We consider the mutual information loss of merging x and y, which is given by\nI(X; C) \u2212 I(\u03c0x,y(X); C).\nTheorem 4 (Proof in Appendix A.8). The mutual information loss mil(x, y) is a\nKre\u02d8\u0131n kernel on [0, 1]|C|.\nIn other words, there exist two positive de\ufb01nite kernels K1\nand K2 on [0, 1]|C| such that mil(x, y) = K1(x, y) \u2212 K2(x, y). To be explicit, we set\nc\u2208C k(p(c, x), p(c, y)), where\n\nc\u2208C p(c, y)) and K2(x, y) = P\n\nK1(x, y) = k(P\n\nc\u2208C p(c, x),P\n\nk(a, b) = a ln a\n\na+b + b ln b\n\na+b.\n\n6\n\n\fTo prove Theorem 4 and construct explicit feature maps for K1 and K2, we need the following\nlemma.\nLemma 3 (Proof in Appendix A.7). The kernel k is a positive de\ufb01nite kernel on [0, 1].\nMoreover, it is endowed with the following explicit feature map x 7\u2192 \u03a6w(x) such that\nand \u03a6w(x)\u2217 denotes the\n\nk(x, y) =R\nThe map \u03a6(x) : w 7\u2192 \u03a6w(x) is called the feature map of x. The integralR\n\nR \u03a6w(x)\u2217\u03a6w(y)dw, where \u03a6w(x) (cid:44) e\u2212iw ln(x)q\n\ncomplex conjugate of \u03a6w(x).\n\nR \u03a6w(x)\u2217\u03a6w(y)dw\n\n2 sech(\u03c0w)\n\nis also denoted by a Hermitian inner product h\u03a6(x), \u03a6(y)i.\n\nx\n\n1+4w2\n\n4.2 Kre\u02d8\u0131n-LSH for Mutual Information Loss\nNow we are ready to present an asymmetric LSH scheme [33] for mutual information loss.\nThis method can be easily extended to other Kre\u02d8\u0131n kernels, provided that the associated\npositive de\ufb01nite kernels admit an explicit feature map. In fact, we reduce the problem of\ndesigning the LSH for a Kre\u02d8\u0131n kernel to the problem of designing the LSH for maximum\ninner product search (MIPS) [33, 28, 41]. We call this general reduction Kre\u02d8\u0131n-LSH.\n\n4.2.1 Reduction to Maximum Inner Product Search\nOur reduction is based on the following observation. Suppose that K is a Kre\u02d8\u0131n kernel on\nX such that K = K1 \u2212 K2 where K1 and K2 are positive de\ufb01nite kernels on X . Assume\nthat K1 and K2 admit feature maps \u03a61 and \u03a62 such that K1(x, y) = h\u03a81(x), \u03a81(y)i and\nK2(x, y) = h\u03a82(x), \u03a82(y)i. Then the Kre\u02d8\u0131n kernel K can also represented as an inner product\n(5)\nwhere \u2295 denotes the direct sum. If we de\ufb01ne a pair of transforms T1(x) (cid:44) \u03a61(x) \u2295 \u03a62(x)\nand T2(x) (cid:44) \u03a61(x) \u2295 \u2212\u03a62(x), then we have K(x, y) = hT1(x), T2(y)i. We call this pair of\ntransforms left and right Kre\u02d8\u0131n transforms.\n\nK(x, y) = h\u03a61(x) \u2295 \u03a62(x), \u03a61(y) \u2295 \u2212\u03a62(y)i ,\n\nAlgorithm 1 Kre\u02d8\u0131n-LSH\nInput: Discretization parameters J \u2208 N and \u2206 > 0.\nOutput: The left and right Kre\u02d8\u0131n transform \u03b71 and \u03b72.\n1: wj \u2190 (j \u2212 1/2)\u2206 for j = 1, . . . , J\n2: Construct the atomic transform\n\n\"\n\ns\n\nZ j\u2206\n\n(j\u22121)\u2206\n\n\u03c4(x, w, j) (cid:44)\n\ncos(w ln(x))\n\n2x\n\n\u03c1(w0)dw0, sin(w ln(x))\n\ns\n\n2x\n\nZ j\u2206\n\n(j\u22121)\u2206\n\n#\n\n.\n\n\u03c1(w0)dw0\n\n3: Construct the left and right basic transform\n\n\u03c4(p(x), wj, j) \u2295 JM\n\u03c4(p(x), wj, j) \u2295 JM\n\nj=1\n\nj=1\n\nM\nM\n\nc\u2208C\n\nc\u2208C\n\nj=1\n\n\u03b71(x) (cid:44) JM\n\u03b72(x) (cid:44) JM\nq\n\nj=1\n\n\u03c4(p(c, x), wj, j) ,\n\n\u2212\u03c4(p(c, x), wj, j) .\n\nq\n\n4: Construct the left and right Kre\u02d8\u0131n transform\n\nT1(x, M) (cid:44) [\u03b71,\n\nM \u2212 k\u03b71(x)k2\n\n5: Sample a \u223c N (0, I) and construct the hash function h(x; M) (cid:44) sign(a>T(x, M)), where\n\nwhere M is a constant such that M \u2265 k\u03b71(x)k2\nT is either the left or right transform.\n\n2, 0], T2(y, M) (cid:44) [\u03b72, 0,\n\nM \u2212 k\u03b72(x)k2\n2] .\n2 (note that k\u03b71(x)k2= k\u03b72(x)k2).\n\nWe exemplify this technique by applying it to the MIL divergence. For ease of exposition,\nwe de\ufb01ne \u03c1(w) (cid:44) 2 sech(\u03c0w)\n. The proposed approach Kre\u02d8\u0131n-LSH is presented in Algorithm 1.\n1+4w2\n\n7\n\n\fand discretize the integral k(x, y) =R\n\nTo make the intuition of (5) applicable in a practical implementation, we have to truncate\nR \u03a6w(x)\u2217\u03a6w(y)dw. First we analyze the truncation.\nThe analysis is similar to Lemma 10 of [4].\nLemma 4 (Truncation error bound, proof in Appendix A.9). If t > 0 and x, y \u2208 [0, 1], the\ntruncation error can be bounded as follows\n\n(cid:12)(cid:12)(cid:12)k(x, y) \u2212R t\n\n(cid:12)(cid:12)(cid:12) \u2264 4e\u2212t.\n\n\u2212t \u03a6w(x)\u2217\u03a6w(y)dw\n\nTo discretize the \ufb01nite integralR t\n(cid:12)(cid:12)(cid:12)R \u2206J\n\u2212\u2206J \u03a6w(x)\u2217\u03a6w(y)dw \u2212DLJ\nhcos(w ln(x))\n\n\u2212t \u03a6w(x)\u2217\u03a6w(y)dw, we divide the inteval into 2J sub-intervals\nof length \u2206. The following lemma bounds the discretization error.\nLemma 5 (Discretization error bound, proof in Appendix A.10). If J is a positive in-\nteger, \u2206 > 0, and wj = (j \u2212 1/2)\u2206, the discretization error is bounded as follows\n\nj=1 \u03c4(y, wj, j)E(cid:12)(cid:12)(cid:12) \u2264 2\u2206, where \u03c4(x, w, j) =\nj=1 \u03c4(x, wj, j),LJ\nq\n(j\u22121)\u2206 \u03c1(w0)dw0i \u2208 R2.\n2xR j\u2206\n\n(j\u22121)\u2206 \u03c1(w0)dw0, sin(w ln(x))\n\n2xR j\u2206\n\nq\n\n\u0001\n\n\u0001\n\n\u0001\n\n.\n\nln 8(1+|C|)\n\nBy Lemmas 4 and 5, to guarantee that the total approximation error (including both\ntruncation and discretization errors) is at most \u0001, it su\ufb03ces to set \u2206 =\n4(1+|C|) and\nJ \u2265 4(1+|C|)\n4.2.2 LSH for Maximum Inner Product Search\nThe second stage of our proposed method is to apply LSH to the MIPS problem. As\nan example, in Line 5, we use the Simple-LSH introduced by [28]. Let us have a quick\nreview of Simple-LSH. Assume that M \u2286 Rd is a \ufb01nite set of vectors and that for all\nx \u2208 M, there is a universal bound on the squared 2-norm, i.e., kxk2\n2\u2264 M. Neyshabur and\nSrebro [28] assume that M = 1 without loss of generality. We allow M to be any positive\nreal number. For two vectors x, y \u2208 M, Simple-LSH performs the following transform\n2]. Note that the norm of L1 and L2 is\nM and that therefore their cosine similarity equals their inner product. In fact, Simple-LSH\nis a reduction from MIPS to LSH for the cosine similarity. Then a random-projection-based\nLSH for the cosine similarity [11, 38]\n\n2, 0], L2(y) (cid:44) [y, 0,pM \u2212 kyk2\n\nL1(x) (cid:44) [x,pM \u2212 kxk2\n\nh(x) (cid:44) sign(x>Li(x)),\n\na \u223c N (0, I), i = 1, 2\n\ncan be used for MIPS and thereby LSH for the MIL divergence via our reduction.\n\nadditional term pM \u2212 kxk2\n\nDiscussion We have some important remarks for practical implementation of Kre\u02d8\u0131n-LSH.\nAlthough [28] provides a theoretical guarantee for LSH for MIPS, as noted in [41], the\n2 may dominate in the 2-norm and signi\ufb01cantly degrade the\nperformance of LSH. To circumvent this issue, we recommend a method that partitions the\ndataset according to the 2-norm, e.g., the norm-ranging method [41].\n\n5 Experiment Results\n\n(a) \u03bb = 1/2\n(c) \u03bb = 1/10\nFigure 1: The empirical performance of Hellinger approximation\n\n(b) \u03bb = 1/3\n\nApproximation Guarantee. In the \ufb01rst part, we verify the theoretical bounds derived\nin Theorem 1 on real data. We used the latent Dirichlet allocation to extract the topic\n\n8\n\nGJSGJS GJS\f(a) Fashion MNIST\n\n(b) MNIST\n\n(c) CIFAR-10\n\nFigure 2: Precision vs. speed-up factor for di\ufb00erent \u03bb\u2019s.\n\ndistributions of Reuters-21578, Distribution 1.0. The number of topics is set to 10. We\nsampled 100 documents uniformly at random and computed the GJS divergence and Hellinger\ndistance between each pair of topic distributions. Each dot in Fig. 1 represents the topic\ndistribution of a document. The horizontal axis denotes the Hellinger distance while\nthe vertical axis denotes the GJS divergence. We chose di\ufb00erent parameter values (\u03bb =\n1/2, 1/3, 1/10) for the GJS divergence. From the three sub\ufb01gures, we observe that both the\nupper and lower bounds are tight for the data.\nNearest Neighbor Search. In the second part, we apply the proposed LSH scheme for the\nGJS divergence to the nearest neighbor search problem in Fashion MNIST [39], MNIST [23],\nand CIFAR-10 [21]. Each image in the datasets is \ufb02attened into a vector and L1-normalized,\nthereby summing to 1. As described in Section 2.1, a concatenation of hash functions is\nused. We denote the number of concatenated hash functions by K and the number of\ncompound hash functions by L. In the \ufb01rst set of experiments, we set K = 3 and vary\nL from 20 to 40. We measure the execution time of LSH-based k-nearest neighbor search\nand the exact (brute-force) algorithm, where k is set to 20. Both algorithms were run on a\n2.2 GHz Intel Core i7 processor. The speed-up factor is the ratio of the execution time of\nthe exact algorithm to that of the LSH-based method. The quality of the result returned\nby the LSH-based method is quanti\ufb01ed by its precision, which is the fraction of correct\nnearest neighbors among the retrieved items. We would like to remark that the precision\nand recall are equal in our case since both algorithms return k items. We also vary the\nparameter of the GJS divergence and choose \u03bb from {1/2, 1/3, 1/10}. The result is illustrated\nin Figs. 2a to 2c. We observe a trade-o\ufb00 between the quality of the output (precision) and\ncomputational e\ufb03ciency (speed-up factor). The performance appears to be robust to the\nparameter of the GJS divergence. In the second set of experiments, we \ufb01x the parameter of\nthe GJS divergence to 1/2; i.e., the JS divergence is used. The number of concatenated hash\nfunctions K ranges from 3 to 5 or 4 to 6. The result is presented in Appendix C. In addition\nto the aforementioned quality-e\ufb03ciency trade-o\ufb00, we observe that a larger K results in a\nmore e\ufb03cient algorithm given the same target precision.\n\n6 Conclusion\nIn this paper, we propose a general strategy of designing an LSH family for f-divergences.\nWe exemplify this strategy by developing LSH schemes for the generalized Jensen-Shannon\ndivergence and triangular discrimination in this framework. They are endowed with an LSH\nfamily via the Hellinger approximation. In particular, we show a two-sided approximation\nfor the generalized Jensen-Shannon divergence by the Hellinger distance. This may be of\nindependent interest. Next, we propose a general approach to designing an LSH scheme\nfor Kre\u02d8\u0131n kernels via a reduction to the problem of maximum inner product search. In\ncontrast to our strategy for f-divergences, this approach involves no approximation and is\ntheoretically lossless. We exemplify this approach by applying to mutual information loss.\n\nAcknowledgments\nLC was supported by the Google PhD Fellowship.\n\n9\n\n\fReferences\n[1] Ahmed Abdelkader, Sunil Arya, Guilherme D da Fonseca, and David M Mount.\n\u201cApproximate nearest neighbor searching with non-Euclidean and weighted distances\u201d.\nIn: SODA. SIAM. 2019, pp. 355\u2013372.\n\n[2] Amirali Abdullah and Suresh Venkatasubramanian. \u201cA directed isoperimetric inequality\nwith application to bregman near neighbor lower bounds\u201d. In: STOC. ACM. 2015,\npp. 509\u2013518.\n\n[3] Amirali Abdullah, John Moeller, and Suresh Venkatasubramanian. \u201cApproximate\nBregman near neighbors in sublinear time: Beyond the triangle inequality\u201d. In: SoCG.\nACM. 2012, pp. 31\u201340.\n\n[4] Amirali Abdullah, Ravi Kumar, Andrew McGregor, Sergei Vassilvitskii, and Suresh\nVenkatasubramanian. \u201cSketching, Embedding and Dimensionality Reduction in Infor-\nmation Theoretic Spaces\u201d. In: AISTATS. 2016, pp. 948\u2013956.\n\n[5] Syed Mumtaz Ali and Samuel D Silvey. \u201cA general class of coe\ufb03cients of divergence of\none distribution from another\u201d. In: Journal of the Royal Statistical Society. Series B\n(Methodological) (1966), pp. 131\u2013142.\n\n[6] MohammadHossein Bateni, Lin Chen, Hossein Esfandiari, Thomas Fu, Vahab S Mir-\nrokni, and Afshin Rostamizadeh. \u201cCategorical Feature Compression via Submodular\nOptimization\u201d. In: ICML (2019).\n\n[7] Aditya Bhaskara and Maheshakya Wijewardena. \u201cDistributed Clustering via LSH\n\nBased Data Partitioning\u201d. In: ICML. 2018, pp. 569\u2013578.\n\n[8] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.\n[9] S. Bochner, M. Tenenbaum, and H. Pollard. Lectures on Fourier Integrals. Annals of\n\nmathematics studies. Princeton University Press, 1959.\n\n[10] Andrei Z Broder. \u201cOn the resemblance and containment of documents\u201d. In: SE-\n\n[11] Moses S Charikar. \u201cSimilarity estimation techniques from rounding algorithms\u201d. In:\n\nQUENCES. IEEE. 1997, pp. 21\u201329.\n\nSTOC. ACM. 2002, pp. 380\u2013388.\n\n[12] Kamalika Chaudhuri and Andrew McGregor. \u201cFinding Metric Structure in Information\n\nTheoretic Clustering.\u201d In: COLT. Vol. 8. 2008, p. 10.\n\n[14]\n\n[13] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley &\n\nSons, 2012.\nImre Csisz\u00e1r. \u201cEine informationstheoretische ungleichung und ihre anwendung auf\nbeweis der ergodizitaet von marko\ufb00schen ketten\u201d. In: Publ. Math. Inst. Hungar. Acad.\n8 (1963), pp. 95\u2013108.\n\n[17]\n\n[15] Constantinos Daskalakis and Qinxuan Pan. \u201cSquare Hellinger Subadditivity for\nBayesian Networks and its Applications to Identity Testing\u201d. In: COLT. 2017, pp. 697\u2013\n703.\n\n[16] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. \u201cLocality-sensitive\nhashing scheme based on p-stable distributions\u201d. In: SoCG. ACM. 2004, pp. 253\u2013262.\nInderjit S Dhillon, Subramanyam Mallela, and Rahul Kumar. \u201cA divisive information-\ntheoretic feature clustering algorithm for text classi\ufb01cation\u201d. In: JMLR 3.Mar (2003),\npp. 1265\u20131287.\n\n[18] David Gorisse, Matthieu Cord, and Frederic Precioso. \u201cLocality-sensitive hashing for\nchi2 distance\u201d. In: IEEE transactions on pattern analysis and machine intelligence\n34.2 (2012), pp. 402\u2013409.\n\n[19] Piotr Indyk and Rajeev Motwani. \u201cApproximate nearest neighbors: towards removing\n\nthe curse of dimensionality\u201d. In: STOC. ACM. 1998, pp. 604\u2013613.\n\n[20] Assaf Kartowsky and Ido Tal. \u201cGreedy-Merge Degrading has Optimal Power-Law\u201d. In:\n\nIEEE Transactions on Information Theory 65.2 (2018), pp. 917\u2013934.\n\n[21] Alex Krizhevsky and Geo\ufb00rey Hinton. Learning multiple layers of features from tiny\n\n[22] Lucien Le Cam. Asymptotic methods in statistical decision theory. Springer Science &\n\nimages. Tech. rep. Citeseer, 2009.\n\nBusiness Media, 2012.\n\n10\n\n\f[23] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Ha\ufb00ner. \u201cGradient-based\nlearning applied to document recognition\u201d. In: Proceedings of the IEEE 86.11 (1998),\npp. 2278\u20132324.\n\n[24] Ping Li, Michael Mitzenmacher, and Anshumali Shrivastava. \u201cCoding for random\n\nprojections\u201d. In: ICML. 2014, pp. 676\u2013684.\n\n[25] Jianhua Lin. \u201cDivergence measures based on the Shannon entropy\u201d. In: IEEE Trans-\n\nactions on Information theory 37.1 (1991), pp. 145\u2013151.\n\n[26] Xianling Mao, Bo-Si Feng, Yi-Jing Hao, Liqiang Nie, Heyan Huang, and Guihua Wen.\n\u201cS2JSD-LSH: A Locality-Sensitive Hashing Schema for Probability Distributions.\u201d In:\nAAAI. 2017, pp. 3244\u20133251.\n\n[27] Yadong Mu and Shuicheng Yan. \u201cNon-Metric Locality-Sensitive Hashing.\u201d In: AAAI.\n\n2010, pp. 539\u2013544.\n\n[28] Behnam Neyshabur and Nathan Srebro. \u201cOn Symmetric and Asymmetric LSHs for\n\nInner Product Search\u201d. In: ICML. 2015, pp. 1926\u20131934.\n\n[29] Cheng Soon Ong, Xavier Mary, St\u00e9phane Canu, and Alexander J Smola. \u201cLearning\n\nwith non-positive kernels\u201d. In: ICML. ACM. 2004, p. 81.\n\n[31]\n\n[30] Yuta Sakai and Ken-ichi Iwata. \u201cSuboptimal quantizer design for outputs of discrete\nmemoryless channels with a \ufb01nite-input alphabet\u201d. In: ISIT. IEEE. 2014, pp. 120\u2013124.\nIgal Sason and Sergio Verd\u00fa. \u201cf-divergence Inequalities\u201d. In: IEEE Transactions on\nInformation Theory 62.11 (2016), pp. 5973\u20136006.\n\n[32] Bernhard Sch\u00f6lkopf. \u201cThe kernel trick for distances\u201d. In: NeurIPS. 2001, pp. 301\u2013307.\n[33] Anshumali Shrivastava and Ping Li. \u201cAsymmetric LSH (ALSH) for sublinear time\n\nmaximum inner product search (MIPS)\u201d. In: NeurIPS. 2014, pp. 2321\u20132329.\n\n[34] Kengo Terasawa and Yuzuru Tanaka. \u201cSpherical lsh for approximate nearest neighbor\nsearch on unit hypersphere\u201d. In: Workshop on Algorithms and Data Structures. Springer.\n2007, pp. 27\u201338.\n\n[35] Andrea Vedaldi and Andrew Zisserman. \u201cE\ufb03cient additive kernels via explicit feature\nmaps\u201d. In: IEEE transactions on pattern analysis and machine intelligence 34.3 (2012),\npp. 480\u2013492.\nIstv\u00e1n Vincze. \u201cOn the concept and measure of information contained in an observation\u201d.\nIn: Contributions to Probability. Elsevier, 1981, pp. 207\u2013214.\n\n[37] Jingdong Wang, Ting Zhang, Nicu Sebe, Heng Tao Shen, et al. \u201cA survey on learning\nto hash\u201d. In: IEEE transactions on pattern analysis and machine intelligence 40.4\n(2018), pp. 769\u2013790.\n\n[38] Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. \u201cHashing for similarity\n\n[36]\n\nsearch: A survey\u201d. In: arXiv preprint arXiv:1408.2927 (2014).\n\n[39] Han Xiao, Kashif Rasul, and Roland Vollgraf. \u201cFashion-mnist: a novel image dataset\nfor benchmarking machine learning algorithms\u201d. In: arXiv preprint arXiv:1708.07747\n(2017).\n\n[40] Jay Yagnik, Dennis Strelow, David A Ross, and Ruei-sung Lin. \u201cThe power of compar-\n\native reasoning\u201d. In: ICCV. IEEE. 2011, pp. 2431\u20132438.\n\n[41] Xiao Yan, Jinfeng Li, Xinyan Dai, Hongzhi Chen, and James Cheng. \u201cNorm-Ranging\n\nLSH for Maximum Inner Product Search\u201d. In: NeurIPS. 2018, pp. 2952\u20132961.\n\n[42] Jiuyang Alan Zhang and Brian M Kurkoski. \u201cLow-complexity quantization of discrete\n\nmemoryless channels\u201d. In: ISITA. IEEE. 2016, pp. 448\u2013452.\n\n11\n\n\f", "award": [], "sourceid": 5303, "authors": [{"given_name": "Lin", "family_name": "Chen", "institution": "Yale University"}, {"given_name": "Hossein", "family_name": "Esfandiari", "institution": "Google Research"}, {"given_name": "Gang", "family_name": "Fu", "institution": "Google Research"}, {"given_name": "Vahab", "family_name": "Mirrokni", "institution": "Google Research NYC"}]}