{"title": "Fast Similarity Search via Optimal Sparse Lifting", "book": "Advances in Neural Information Processing Systems", "page_first": 176, "page_last": 184, "abstract": "Similarity search is a fundamental problem in computing science with various applications and has attracted significant research attention, especially in large-scale search with high dimensions. Motivated by the evidence in biological science, our work develops a novel approach for similarity search. Fundamentally different from existing methods that typically reduce the dimension of the data to lessen the computational complexity and speed up the search, our approach projects the data into an even higher-dimensional space while ensuring the sparsity of the data in the output space, with the objective of further improving precision and speed. Specifically, our approach has two key steps. Firstly, it computes the optimal sparse lifting for given input samples and increases the dimension of the data while approximately preserving their pairwise similarity. Secondly, it seeks the optimal lifting operator that best maps input samples to the optimal sparse lifting. Computationally, both steps are modeled as optimization problems that can be efficiently and effectively solved by the Frank-Wolfe algorithm. Simple as it is, our approach has reported significantly improved results in empirical evaluations, and exhibited its high potentials in solving practical problems.", "full_text": "Fast Similarity Search via Optimal Sparse Lifting\n\nWenye Li1,2,\u2217, Jingwei Mao1, Yin Zhang1, Shuguang Cui1,2\n1 The Chinese University of Hong Kong, Shenzhen, China\n2 Shenzhen Research Institute of Big Data, Shenzhen, China\n\n{wyli,yinzhang,shuguangcui}@cuhk.edu.cn, 216019005@link.cuhk.edu.cn\n\nAbstract\n\nSimilarity search is a fundamental problem in computing science with various\napplications and has attracted signi\ufb01cant research attention, especially in large-\nscale search with high dimensions. Motivated by the evidence in biological science,\nour work develops a novel approach for similarity search. Fundamentally different\nfrom existing methods that typically reduce the dimension of the data to lessen\nthe computational complexity and speed up the search, our approach projects the\ndata into an even higher-dimensional space while ensuring the sparsity of the data\nin the output space, with the objective of further improving precision and speed.\nSpeci\ufb01cally, our approach has two key steps. Firstly, it computes the optimal\nsparse lifting for given input samples and increases the dimension of the data\nwhile approximately preserving their pairwise similarity. Secondly, it seeks the\noptimal lifting operator that best maps input samples to the optimal sparse lifting.\nComputationally, both steps are modeled as optimization problems that can be\nef\ufb01ciently and effectively solved by the Frank-Wolfe algorithm. Simple as it is, our\napproach has reported signi\ufb01cantly improved results in empirical evaluations, and\nexhibited its high potentials in solving practical problems.\n\n1\n\nIntroduction\n\nSimilarity search refers to the problem of \ufb01nding a subset of objects which are similar to a given query\nfrom a speci\ufb01c dataset. As a fundamental problem in computing science, it has various applications in\ninformation retrieval, pattern classi\ufb01cation, data clustering, etc., and has attracted signi\ufb01cant research\nattention in the literature [21, 9].\nMore speci\ufb01cally and of particular research interest, recent work in similarity search focuses on the\nlarge-scale high-dimensional problems. To lessen the computational complexity, a popular approach\nis to \ufb01rstly reduce the dimension of the data, and then apply the nearest neighbor search or the space\npartitioning methods effectively on the reduced data. To ef\ufb01ciently reduce the dimension of large\nvolume datasets, the locality sensitive hashing method is widely used [11, 3, 7], with quite successful\nresults.\nVery recently, with biological evidence from the fruit \ufb02y\u2019s olfactory circuit, people have shown the\npossibility of increasing the data dimension instead of reducing it as a general hashing scheme. For\nexample, the \ufb02y algorithm projects each input data sample to a higher-dimensional output space with\na random sparse binary matrix. Then after competitive learning, the algorithm returns a sparse binary\nvector in the output space. Comparing with the locality sensitive hashing method, similarity search\nbased on the sparse binary vectors has reported improved precision and speed in a series of empirical\nstudies [8].\nMotivated by the biological evidence and the idea of dimension expansion, our work proposes a\nuni\ufb01ed framework for dimension expansion and applies it in similarity search. The framework has\n\n\u2217Corresponding author.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ftwo key components. The optimal sparse lifting is a sparse binary vector representation of the input\nsamples in a higher-dimensional space, such that the pairwise similarity between the samples can\nbe roughly preserved. The sparse lifting operator is a sparse binary matrix that best maps the input\nsamples to the optimal sparse lifting. Computationally, both components can be ef\ufb01ciently and\neffectively obtained by solving optimization problems with the Frank-Wolfe method.\nTo verify the effectiveness of the proposed work, we carried out a series of experiments. It was found\nthat, for given data, our approach could produce the optimal sparse lifting and the sparse lifting\noperator with high quality. It reported consistently and signi\ufb01cantly improved precision and speed\nin similarity search applications on various datasets. And hence our work provides a solution for\npractical applications.\nThe paper is organized as follows. Section 2 reviews the related work. Section 3 introduces our\nmodel and the algorithm. Section 4 reports the empirical experiments and results, followed by the\ndiscussion and conclusion in Section 5.\n\n2 Related work\n\n2.1 Similarity search and locality sensitive hashing\n\nSimilarity search aims to \ufb01nd similar objects to a given query among potential candidate objects,\naccording to certain pairwise similarity or distance measures [5, 21]. The complexity of accurately\ndetermining the similar objects depends heavily on both the number of candidates to evaluate and\nthe dimension of the data [17]. Computing the similarities or distances seems straightforward, but\nunfortunately could often become prohibitive if the number of candidate objects is large or the\ndimension of the data is high.\nTo ensure the tractability of calculating pairwise distances for large-scale problems in high-\ndimensional spaces, approximate methods have to be sought, among which the locality sensitive\nhashing (LSH) method is routinely applied [11, 7, 3, 4]. The LSH method provides an approximate\ndistance-preserving mapping of points from the input space to the output space. The output space\nusually has a much lower dimension than the input space, so that the speed of nearest neighbors\nsearch can be signi\ufb01cantly improved.\nTo realize an LSH mapping, one common way is to compute random projections of the data samples by\nmultiplying the input vectors with a random dense matrix of various types [3, 11]. Strong theoretical\nbounds exist and guarantee that the good locality can be preserved through such random projections\n[15, 1, 2].\n\n2.2 Biological evidence of dimension expansion and the \ufb02y algorithm\n\nBiological discovery in animals\u2019 neural systems keeps motivating new studies in the design of\ncomputer algorithms [16, 8, 24]. Take the fruit \ufb02y\u2019s olfactory circuit as an example. It has d = 50\nOlfactory Receptor Neuron (ORN) types, each of which has different sensitivity and selectivity for\ndifferent odors. The ORNs are connected to 50 Projection Neurons (PNs). The distribution of \ufb01ring\nrates across the PN types has roughly the same mean for all odors and concentrations, and therefore\nthe dependence on the concentration disappears. The PNs are projected to d(cid:48) = 2, 000 Kenyon Cells\n(KCs) through sparse connections. One KC receives the \ufb01ring rates from about six PNs and then sums\nthem up [6]. With the strong feedback from a single inhibitory neuron, most KCs become inactive\nexcept for the highest-\ufb01ring 5%. In this way a sparse tag composed of active and inactive KCs for\neach odor is generated [28, 19, 23].\nThe \ufb02y algorithm was designed by simulating the odor detection procedure of the fruit \ufb02y, which\nachieved quite successful results in practice [8]. Denote by X \u2208 Rd\u00d7m the m input samples of d-\ndimensional zero-centered vectors. The inputs are mapped into hashes of d(cid:48) (usually (cid:29) d) dimensions,\nby multiplying X with a randomly generated sparse binary matrix W . Then a winner-take-all strategy\nis applied on the output. For each vector in W X, the elements with the highest k = 100 values are set\nto one, and all others zero out. In this way, a sparse binary representation (denoted by Y \u2208 Rd(cid:48)\u00d7m)\nin a space with a higher dimension is obtained. In short, comparing with the LSH method which\nreduces the data dimension, the \ufb02y algorithm increases it, while ensuring the sparsity of the data in\nthe higher-dimensional output space.\n\n2\n\n\f3 Models\n\n3.1 The optimal sparse lifting framework\n\nWe are interested in the problem of seeking sparse binary output vectors for given input data samples,\nwhere the output dimension is larger or much larger than the input dimension. We expect that the\npairwise similarity relationship of the data in the input space can be kept as much as possible by the\nnew vectors in the output space. Moreover, if the optimal output vectors are available for a small\nportion of input samples, we are also interested in the problem of approximately obtaining such a\nrepresentation for other samples, but in a computationally economical way.\nMathematically, we model the two problems into a uni\ufb01ed optimal sparse lifting framework as\nfollows. Let X \u2208 Rd\u00d7m be a matrix of input data samples in the d-dimensional space. We consider\nto minimize the objective function\n\n(cid:107)W X \u2212 Y (cid:107)2\n\n1\n2\n\nF ,\n\nF +\n\nf (W, Y ) =\n\n(1)\nwhere W \u2208 Rd(cid:48)\u00d7d and Y \u2208 Rd(cid:48)\u00d7m are subject to some constraints; in particular, both are required to\nbe sparse. Here, the \ufb01rst term aims to ensure Y \u2248 W X 2, and the second term seeks to approximately\npreserve pairwise similarities between the input samples. In the function, \u03b1 > 0 is a balance\nparameter.\nIn general, we expect d(cid:48) (cid:29) d. Therefore, the output Y is called sparse lifting of the input X, and the\nmatrix W is called sparse lifting operator. For simplicity of discussion, the adjective \u201csparse\u201d may\nbe dropped from time to time in the sequel.\nIn addition to sparsity constraint on W , we would like W to be binary with exactly c ones in\neach row. If we relax the binary constraint into the unit interval constraint, then W should satisfy,\ncomponent-wise,\n\n(cid:13)(cid:13)X T X \u2212 Y T Y(cid:13)(cid:13)2\n\n\u03b1\n2\n\n0 \u2264 W \u2264 1.\nSimilar constraints can be imposed on Y as well, for example,\n\nW 1d = c1d(cid:48),\n\nY T 1d(cid:48) = k1m,\n\n0 \u2264 Y \u2264 1\n\nwith the hope that each column of Y has exactly k ones. But if the primary goal is to obtain a good\nW using the training dataset, fewer constraints on Y could be preferable.\nComputationally, the problem formulated in Eq. (1) can be naturally solved via alternating mini-\nmization. Fix W and solve for Y ; then \ufb01x Y and solve for W ; and repeat the process. A simpli\ufb01ed\napproach that performs well in practice just does one round alternating minimization using the\n(cid:96)p (0 < p < 1) pseudo-norm to promote sparsity and binarization. Denote by W and Y the feasible\nregions of W and Y de\ufb01ned in Eq. (2) and Eq. (3) respectively. We solve\n\n(cid:13)(cid:13)X T X \u2212 Y T Y(cid:13)(cid:13)2\n\nmin\nY \u2208Y\n\n1\n2\n\nF + \u03b3 (cid:107)Y (cid:107)p ,\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\nto get the optimal sparse lifting Y\u2217; then we solve\n\nmin\nW\u2208W\n\n1\n2\n\n(cid:107)W X \u2212 Y\u2217(cid:107)2\n\nF + \u03b2 (cid:107)W(cid:107)p ,\n\nto get the optimal lifting operator W\u2217. Here the term \u201coptimal\u201d is used loosely.\nWe call the \ufb01rst step of solving Eq. (4) the (sparse) lifting step, and the second step of solving Eq. (5)\nthe (sparse) lifting operator step.\nGiven the optimal lifting operator W\u2217, the optimal lifting of an input vector x can be estimated by\ny = (y1,\u00b7\u00b7\u00b7 , yd(cid:48)) \u2208 {0, 1}d(cid:48)\n\nwith\n\nyi =\n\nif (W\u2217x)i is among the largest k entries in W\u2217x;\notherwise.\n\n(6)\n\n(cid:26) 1\n\n0\n\n2We may also consider to enforce Y \u2248 \u00b5W X instead where \u00b5 > 0 is a scaling parameter.\n\n3\n\n\f(cid:13)(cid:13)X T X \u2212 Y T Y(cid:13)(cid:13)2\n(cid:13)(cid:13)X T X \u2212 Y T Y(cid:13)(cid:13)2\n(cid:16)\n(cid:17)\n\nF\n\nAlgorithm 1 minY\n1: Given X, Y 0 \u2208 Y, \u03b30 > 0\n2: Let L (Y, \u03b3) = 1\n3: for k = 0, 1, 2,\u00b7\u00b7\u00b7 , K do\n2\n4:\n5:\n6:\n7: end for\n8: return Y K+1\n\nUpdate Y k+1 :=\nChoose \u03b3k+1 \u2265 \u03b3k\n\nCompute Sk+1 := argminS\u2208Y(cid:104)S,\u2207Y L(cid:0)Y k, \u03b3k(cid:1)(cid:105)\n\nF + \u03b3 (cid:107)Y (cid:107)p\n\ns.t. Y \u2208 Y \u2229 {0, 1}d(cid:48)\u00d7m\n\n1 \u2212 2\n\nk+2\n\nY k + 2\n\nk+2 Sk+1\n\n3.2 Algorithm\n\nA number of optimization algorithms are applicable to solve the two minimization problems for-\nmulated in Eq. (4) and Eq. (5). Our current study resorts to the Frank-Wolfe algorithm, which is\nan iterative \ufb01rst-order optimization method for constrained optimization [10, 13]. In each iteration,\nthe algorithm considers a linear approximation of the objective function, and moves towards a\nminimizer of the linear function. An important advantage of the algorithm is that, for constrained\noptimization problems, it only needs the solution of a linear program over the feasible region in each\niteration, thereby eliminating the need of projecting back to the feasible region, which can often be\ncomputationally expensive. Simple as it is, the algorithm provides quite good empirical results in our\napplications (ref. Section 4).\nBased on the Frank-Wolfe algorithm, a simple iterative solution for minimizing Eq. (4) is shown in\nAlgorithm 1. If we do not consider the increase of the balance parameter \u03b3 in line 6, it becomes the\nstandard Frank-Wolfe algorithm. In each iteration of the algorithm, the major computation comes from\nsolving a linear program in line 4. Although in our study the linear program may involve a million or\nmore variables, it can be solved very ef\ufb01ciently with modern optimization techniques [27]. The value\nof the balance parameter \u03b3 increases monotonically with each iteration (e.g. \u03b3K+1 = 1.1 \u00d7 \u03b3K),\nwhich makes the solution of the output matrix Y tend to be sparse and binary.\nFrom the computational complexity point of view, minimizing Eq. (5) for the optimal lifting operator\nW\u2217 is much simpler than minimizing Eq. (4). The problem can be tackled in almost the same way as\nin Algorithm 1. Therefore we omit the detailed discussion.\n\n3.3 Optimal lifting vs. random lifting\n\nThe \ufb02y algorithm uses a randomly generated data transform matrix W to map the dense input X to\nW X in a higher-dimensional space, followed by a sparsi\ufb01cation and binarization process. Similarly\nto the LSH algorithm, there exists theoretical guarantee that the projection W X preserves the (cid:96)2\ndistances of input vectors in expectation [15, 8]. However when the sparsi\ufb01cation and binarization\nprocess is taken into consideration, no strong theoretical results are known any more.\nAlthough motivated with the same biological characteristics in the \ufb02y olfactory circuit, our work\nstudies the problem from a very different viewpoint. There exist two key novelties. Our work\nformalizes the process of the \ufb02y algorithm into a data-transform paradigm of sparse lifting. The input\nvectors are lifted to sparse binary vectors in a higher-dimensional space, and the feature values are\nreplaced by their high energy concentration locations which are further encoded in the sparse binary\nrepresentation.\nA more signi\ufb01cant novelty lies in the principle of projecting from the input space to the output space.\nThe \ufb02y algorithm randomly generates the projection matrix W and can be regarded as a random\nlifting method. Randomness exists when deciding the concentration locations due to the random\ngeneration mechanism of W . At the same time, although the biological connection from Projection\nNeurons to Kenyon Cells is still not completely clear, very recent electron microscopy images of the\nanimal\u2019s brain have reported evidence that the connection is not random [29]. Comparatively, our\nproposed framework in Section 3.1 models the projection as an optimization problem which actually\nreduces such randomness. Along this optimal lifting viewpoint, many modeling and algorithmic\nissues could potentially arise.\n\n4\n\n\fFigure 1: The quality of the optimal lifting on MNIST dataset. Left: The relative deviation from\nthe input similarities. The optimal lifting (denoted by LIFTING) preserves the pairwise similarity\neven better than the ground truth (denoted by RESIZE) which were generated with the techniques of\nindustry standard. Right: Visualization of the lifting results as images. The \ufb01rst and the third rows are\nthe ground truth images with 80 \u00d7 80 pixels. The second and the fourth rows are the lifting results\nre-ordered by a permutation matrix.\n\n4 Evaluation\n\n4.1 Experimental objectives and general settings\n\nTo evaluate the effectiveness of the proposed approach, we carried out a series of experiments.\nSpeci\ufb01cally, we had an experiment to illustrate the effectiveness of the optimal sparse lifting (ref.\nSection 4.2), an experiment with the same scenario of similarity search as in [8] to demonstrate the\nempirical superiority of the proposed optimal lifting operator (ref. Section 4.3), and an experiment\nto show the running speed comparison in a query application (ref. Section 4.4). The following\nbenchmarked datasets were used in the experiments.\n\n\u2022 SIFT: SIFT descriptors of images used for similarity search (d = 128) [14].\n\u2022 GLOVE: Pre-trained word vectors based on the GloVe algorithm (d = 300) [25].\n\u2022 MNIST: 28 \u00d7 28 images of handwritten digits in 256-grayscale pixels (d = 784) [18].\n\nBesides, we also used a much larger WIKI dataset in a query application, which includes word vectors\ngenerated on the May 2017 dump of wikipedia 3 by the GloVe algorithm. There are 400, 000 vectors\nin the WIKI dataset and each vector has 500 dimensions.\nThe evaluation included the empirical comparison of our work against the \ufb02y algorithm and the LSH\nalgorithm (by random dense projection). Besides, the autoencoder algorithm [12] is also included in\nour study. An autoencoder is an arti\ufb01cial neural network used for unsupervised learning of codings.\nIt is implemented as one hidden layer connecting one input layer and one output layer. The output\nlayer has the same number of nodes as the input layer. An autoencoder is trained to reconstruct its\nown inputs. Usually the hidden layer has a much lower dimension than the input layer. Therefore the\nfeature vector learned in the hidden layer can be regarded as a compressed representation of the input\nsamples.\nWe implemented and tested all the algorithms on the MATLAB platform. Our approach used IBM\nILOG CPLEX Optimizer as the underlying linear program solver.\n\n4.2 Optimal lifting\n\nThe \ufb01rst experiment was carried out to evaluate the performance of the optimal lifting step. We hope\nto know if the model and the matrix factorization algorithm (ref. Algorithm 1) could well preserve\nthe pairwise similarity between the input data samples. In the experiment, we randomly chose 5, 000\ngrayscale images (denoted by X with each column vector Xi being an image) from the MNIST\ndataset as the input data, and resized each image to 80 \u00d7 80 pixels using the cubic interpolation\nmethod, and then binarized each resized image with light pixels and dark pixels only by the Otsu\u2019s\n\n3https://dumps.wikimedia.org/\n\n5\n\n10 1020 2040 4080 80Image Size0.050.10.5Deviation (logscale)Relative Deviation on MNIST DatasetRESIZELIFTING\f\u221a\n\n(cid:107)X T X\u2212GT G(cid:107)F\n\n(cid:107)X T X(cid:107)F\n\n(cid:107)X T X(cid:107)F\n\n, and compared it with the deviation in the ground truth\n\nmethod [22]. These 80 \u00d7 80 binary images generated from the industry standard techniques were\nregarded as the ground truth in this experiment, which is denoted by a matrix G with each column\nGi being a binary image vector.\nWith the same set of input images, we normalized each vector Xi to be of length\nki where ki\nis the number of light pixels in Gi. After obtaining the optimal lifting (denoted by Y\u2217) of these\nimages in an 80 \u00d7 80-dimensional output space by Algorithm 1, we recorded the relative deviation of\n(cid:107)X T X\u2212Y T\u2217 Y\u2217(cid:107)F\n. Obviously,\na smaller deviation value indicates a higher quality of preserving pairwise similarities between input\nsamples.\nThe results are shown in Fig. 1 (left). From the results, we can see that our algorithm produced\nhigh-quality factorization results for X T X. The relative deviation of the optimal lifting from the\ninput is even signi\ufb01cantly (about 20%) smaller than that of the ground truth.\nBesides 80 \u00d7 80-dimensional images, we also tested the performance of the proposed approach\nwith different dimensions of 10 \u00d7 10, 20 \u00d7 20 and 40 \u00d7 40 respectively 4. On 40 \u00d7 40 images, the\nimprovement of the relative deviation from the optimal lifting is very similar to that of 80 \u00d7 80. On\n20 \u00d7 20 images, the optimal lifting is roughly similar to the ground truth. On 10 \u00d7 10 images, the\nimprovement becomes again quite evident. The optimal lifting produced a relative deviation that is\nonly half of the ground truth. All these results veri\ufb01ed the effectiveness of Algorithm 1, and hence\nthe effectiveness of the optimal lifting step in keeping pairwise similarities of the data.\nThe results of the optimal lifting can be visualized in an intuitive way. To do this, we computed a\npermutation matrix P\u2217 via minimizing (cid:107)P Y\u2217 \u2212 G(cid:107)2\nF with respect to P by the Frank-Wolfe algorithm,\nand then depicted each vector in P\u2217Y\u2217 as a binary image. Part of the results are shown in Fig. 1\n(right). In the \ufb01gure, the \ufb01rst and the third rows are the 80 \u00d7 80 binary images of the ground truth,\nand the second and the fourth rows are the corresponding images from P\u2217Y\u2217. From the results, we\ncan see that the lifting results mostly keep the shape of the images and can be recognized easily by\nthe human being, while preserving the pairwise similarity with higher quality.\n\n4.3 Similarity search\n\nThe second experiment aimed to evaluate the performance of the proposed optimal lifting framework\nin similarity search applications by comparing its accuracies against the \ufb02y and related algorithms.\nIn the experiment, a subset of 10, 000 samples from each dataset were used as the testing set. All\nsamples were normalized to have zero mean. In one run, all samples were used as a query in turn.\nFor each query, we computed its 100 nearest neighbors among all other samples in the input space as\nthe ground truth. Then we computed its 100 nearest neighbors in the output space and compared the\nresults with the ground truth. The ratio of common neighbors was recorded, and averaged over all\nsamples as the precision of each run.\nFor our proposed approach, we randomly selected 5, 000 different samples from each dataset as the\ntraining set. Sparse binary vectors (i.e. the optimal lifting) of these training samples were \ufb01rstly\ngenerated with Algorithm 1 and then used to train the optimal lifting operator W\u2217.\nFor the \ufb02y and the LSH algorithms, 100 runs were carried out with randomly generated projection\nmatrices. The mean average precision over the 100 runs and the standard deviation were recorded\n[20]. For the optimal lifting approach, only one run was executed and recorded. As a comparison, we\nalso collected the results of the autoencoder algorithm (denoted by AUTOENC) [12], with which the\nhidden representation size is set equal to the hash length (i.e., the k). The autoencoder algorithm was\ntrained with the same samples as our optimal lifting approach.\nThe results are depicted in Fig. 2. In all sub-\ufb01gures, the horizontal axis shows different hash lengths\n(k = 2, 4, 8, 16, 32 respectively). For the \ufb02y and the optimal lifting algorithms, the output dimensions\nare set to d(cid:48) = 20 \u00d7 k and d(cid:48) = 2, 000 respectively. The vertical axis shows the one-run precisions of\nthe optimal lifting and the autoencoder algorithms, and the mean average precisions and the standard\ndeviations of the \ufb02y and the LSH algorithms over 100-runs. From the results it can be seen that,\nconsistent with the results shown in [8, 26], the output vectors from the \ufb02y algorithm outperformed\n4For the 10\u00d7 10 and 20\u00d7 20 experiments, the dimension is actually reduced and it can\u2019t be called as \u201clifting\u201d.\n\nHowever, it does not prevent us from testing the algorithm\u2019s performance under these settings.\n\n6\n\n\fFigure 2: Empirical comparison of similarity search precisions on different datasets. The hori-\nzontal axis is the hash length (k). The vertical axis is the (mean) average precision on 10, 000 testing\nsamples. Error bars of \ufb02y/LSH indicate standard deviation over 100 trials. The embedding dimensions\n(not applicable to the autoencoder algorithm) are d(cid:48) = 20 \u00d7 k in the \ufb01rst row and d = 2, 000 in the\nsecond row.\n\nthe vectors from the LSH algorithm in most experiments; while our optimal lifting approach reported\nfurther and signi\ufb01cantly improved results in all experiments. The improvement on the GLOVE\ndataset is especially evident. All these results con\ufb01rmed the bene\ufb01ts brought by seeking the optimal\nprojection matrix W\u2217 instead of randomizing one.\nThe dense vectors generated from the autoencoder algorithm also improved the search precision over\nthe vectors from the \ufb02y and the LSH algorithms on most experiments. Compare the results of the\noptimal lifting with the autoencoder. On SIFT and MNIST datasets, it can be seen that, when the hash\nlength is small (k = 2, 4, 8), the superiority of the optimal lifting is quite evident. When increasing\nthe hash length to k = 16 and 32, the precisions of the autoencoder catch up, which tend to be quite\nsimilar as the optimal lifting. On GLOVE dataset, however, the improvement of our approach is still\nconsistently signi\ufb01cant.\n\n4.4 Running speed\n\nAs a practical concern, we also measured the running time of the proposed approach, including both\nthe training time and the query time, and compared with other algorithms. The running time was\nrecorded on WIKI dataset with 400, 000 word vectors in d = 500 dimensions.\nThe training time of our approach includes the optimization time for both matrices Y\u2217 and W\u2217. To\nreduce the in\ufb02uence from parallel execution, only one CPU core was allowed in the experiment. The\nresults are shown in Fig. 3 (left), and compared with the training time of the autoencoder algorithm.\nWe can see that, with 5, 000 training samples and 2, 000 output dimensions, our training time is\naround 15 minutes for different hash lengths (k), which is slower than the autoencoder algorithm on\nhash lengths of 2 and 4 but faster on hash lengths of 16 and 32. With 20 \u00d7 k output dimensions, our\napproach runs magnitude faster than the autoencoder algorithm on all hash lengths.\nThe query time was measured by searching for 100 nearest neighbors out of the 400, 000 words for\n10, 000 query words with one CPU core. We reported the total query time on the output vectors of\nthe LSH, autoencoder and optimal lifting algorithms respectively. As a baseline, the query time in the\noriginal input space is also shown (denoted by NO_HASH). From the results in Fig. 3 (right), we\ncan see that the vectors from the optimal lifting approach reported signi\ufb01cantly better speed over the\n\n7\n\n2481632Hash Length (k)0.00.10.20.30.40.50.60.7Mean Average PrecisionSIFT (d=128, d'=20k)FLYLSHAUTOENCLIFTING2481632Hash Length (k)0.00.10.20.30.40.50.60.7Mean Average PrecisionGLOVE (d=300, d'=20k)FLYLSHAUTOENCLIFTING2481632Hash Length (k)0.00.10.20.30.40.50.60.7Mean Average PrecisionMNIST (d=784, d'=20k)FLYLSHAUTOENCLIFTING2481632Hash Length (k)0.00.10.20.30.40.50.60.7Mean Average PrecisionSIFT (d=128, d'=2000)FLYLSHAUTOENCLIFTING2481632Hash Length (k)0.00.10.20.30.40.50.60.7Mean Average PrecisionGLOVE (d=300, d'=2000)FLYLSHAUTOENCLIFTING2481632Hash Length (k)0.00.10.20.30.40.50.60.7Mean Average PrecisionMNIST (d=784, d'=2000)FLYLSHAUTOENCLIFTING\fFigure 3: Comparison of training and query time of the algorithms on WIKI dataset with 5, 000\ntraining samples and 10, 000 query samples with a single CPU core. Left: training time; right:\nquery time. The horizontal axis is the hash length (k). The vertical axis is the time in seconds. The\nembedding dimension is set to d(cid:48) = 20 \u00d7 k and d(cid:48) = 2, 000 respectively.\n\nothers. It is magnitudes faster than searching in the original input space, and 4 to 9 times faster than\nthe vectors from the LSH and the autoencoder methods.\nConsidering the bene\ufb01ts of improved query precision and speed, the cost of computing the optimal\nlifting and training the optimal lifting operator in our framework should be an acceptable overhead in\npractical applications.\n\n5 Conclusion\n\nFundamentally different from classical approaches that seek to reduce the data dimension for analysis,\nour work promotes a general method for dimension expansion by a type of data transform called\noptimal sparse lifting. In this transform, feature vectors of a dataset are lifted to sparse binary vectors\nin a higher-dimensional space, and feature values are replaced by their \u201chigh energy concentration\u201d\nlocations that are encoded in the sparse binary vectors. Our proof-of-concept experiments in similarity\nsearch indicate that the proposed approach can signi\ufb01cantly outperform, in terms of accuracy, the\nrandom sparse lifting and the locality sensitive hashing methods.\nMany modeling and algorithmic issues still remain to be studied for the proposed framework, as\npromising as it appears to be. In addition, there are strong potentials to extend sparse lifting transforms\nto other tasks in unsupervised learning and pattern recognition, in particular to clustering analysis and\ndata classi\ufb01cation. To deepen understanding, further work will be necessary to study and compare\nthe proposed approach with existing methodologies.\n\nAcknowledgments\n\nThis work was supported by Shenzhen Fundamental Research Fund (JCYJ20170306141038939,\nKQJSCX20170728162302784, ZDSYS201707251409055), Shenzhen Development and Reform\nCommission Fund, and Guangdong Introducing Innovative and Entrepreneurial Teams Fund\n(2017ZT07X152), China.\n\nReferences\n[1] D. Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal\n\nof Computer and System Sciences, 66(4):671\u2013687, 2003.\n\n[2] Z. Allen-Zhu, R. Gelashvili, S. Micali, and N. Shavit. Sparse sign-consistent johnson\u2013lindenstrauss\nmatrices: Compression with neuroscience-based constraints. Proceedings of the National Academy of\nSciences, 111(47):16872\u201316876, 2014.\n\n[3] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high\ndimensions. In 47th Annual IEEE Symposium on Foundations of Computer Science, pages 459\u2013468. IEEE,\n2006.\n\n8\n\n2481632Hash Length (k)5010050010005000Time (seconds)Training Time on WIKI DatasetAUTOENCLIFTING (d'=2000)LIFTING (d'=20k)10 1020 2040 4080 80Image Size0.050.10.5Deviation (logscale)Relative Deviation on MNIST DatasetRESIZEMF2481632Hash Length (k)10501005001000Time (seconds)Query Time on WIKI DatasetNO_HASHAUTOENCLSH (d'=2000)LSH (d'=20k)LIFTING (d'=2000)LIFTING (d'=20k)\f[4] A. Andoni and I. Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. In\nProceedings of the 47th Annual ACM Symposium on Theory of Computing, pages 793\u2013801. ACM, 2015.\n[5] R. Baeza-Yates and B. Ribeiro-Neto. Modern information retrieval, volume 463. ACM press New York,\n\n1999.\n\n[6] S. Caron, V. Ruta, L. Abbott, and R. Axel. Random convergence of olfactory inputs in the drosophila\n\nmushroom body. Nature, 497(7447):113, 2013.\n\n[7] M. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the 34th\n\nAnnual ACM Symposium on Theory of Computing, pages 380\u2013388. ACM, 2002.\n\n[8] S. Dasgupta, C. Stevens, and S. Navlakha. A neural algorithm for a fundamental computing problem.\n\nScience, 358(6364):793\u2013796, 2017.\n\n[9] R. Duda, P. Hart, and D. Stork. Pattern classi\ufb01cation. John Wiley & Sons, 2012.\n[10] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics (NRL),\n\n3(1-2):95\u2013110, 1956.\n\n[11] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing.\n\nvolume 99, pages 518\u2013529, 1999.\n\nIn VLDB,\n\n[12] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,\n\n313(5786):504\u2013507, 2006.\n\n[13] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th\n\nInternational Conference on Machine Learning, pages 427\u2013435, 2013.\n\n[14] H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE Transactions\n\non Pattern Analysis and Machine Intelligence, 33(1):117\u2013128, 2011.\n\n[15] W. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary\n\nMathematics, 26(189-206):1, 1984.\n\n[16] R. Jortner, S. Farivar, and G. Laurent. A simple connectivity scheme for sparse coding in an olfactory\n\nsystem. Journal of Neuroscience, 27(7):1659\u20131669, 2007.\n\n[17] J. Kleinberg. Two algorithms for nearest-neighbor search in high dimensions. In Proceedings of the 29th\n\nAnnual ACM Symposium on Theory of Computing, pages 599\u2013608. ACM, 1997.\n\n[18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[19] A. Lin, A. Bygrave, A. De Calignon, T. Lee, and G. Miesenb\u00f6ck. Sparse, decorrelated odor coding in the\n\nmushroom body enhances learned odor discrimination. Nature Neuroscience, 17(4):559, 2014.\n\n[20] Y. Lin, R. Jin, D. Cai, S. Yan, and X. Li. Compressed hashing. In 2013 IEEE Conference on Computer\n\nVision and Pattern Recognition, pages 446\u2013451. IEEE, 2013.\n\n[21] C. Manning, P. Raghavan, and H. Sch\u00fctze. Introduction to information retrieval. Cambridge University\n\nPress, 2008.\n\n[22] N. Otsu. A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man,\n\nand Cybernetics, 9(1):62\u201366, 1979.\n\n[23] M. Papadopoulou, S. Cassenaer, T. Nowotny, and G. Laurent. Normalization for sparse encoding of odors\n\nby a wide-\ufb01eld interneuron. Science, 332(6030):721\u2013725, 2011.\n\n[24] C. Pehlevan, A. Sengupta, and D. Chklovskii. Why do similarity matching objectives lead to hebbian/anti-\n\nhebbian networks? Neural computation, 30(1):84\u2013124, 2018.\n\n[25] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings\nof the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1532\u20131543, 2014.\n\n[26] Jaiyam S. Ef\ufb01cient nearest neighbors inspired by the fruit \ufb02y brain.\n\n@jaiyamsharma/.\n\nhttps://medium.com/\n\n[27] A. Schrijver. Theory of linear and integer programming. John Wiley & Sons, 1998.\n[28] G. Turner, M. Bazhenov, and G. Laurent. Olfactory representations by drosophila mushroom body neurons.\n\nJournal of Neurophysiology, 99(2):734\u2013746, 2008.\n\n[29] Z. Zheng, S. Lauritzen, E. Perlman, C. Robinson, et al. A complete electron microscopy volume of the\n\nbrain of adult drosophila melanogaster. Cell, 174(3):730\u2013743, 2018.\n\n9\n\n\f", "award": [], "sourceid": 142, "authors": [{"given_name": "Wenye", "family_name": "Li", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"given_name": "Jingwei", "family_name": "Mao", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"given_name": "Yin", "family_name": "Zhang", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"given_name": "Shuguang", "family_name": "Cui", "institution": "The Chinese University of Hong Kong, Shenzhen"}]}