{"title": "Sign Cauchy Projections and Chi-Square Kernel", "book": "Advances in Neural Information Processing Systems", "page_first": 2571, "page_last": 2579, "abstract": "The method of Cauchy random projections is popular  for computing the $l_1$ distance in high dimension. In this paper, we propose to use only the signs of the projected data and show that the  probability of collision (i.e., when the two signs differ) can be accurately approximated as a function of the chi-square ($\\chi^2$) similarity, which is a popular  measure for nonnegative data (e.g., when features are generated from histograms as common in text and vision applications). Our experiments   confirm that this method of sign Cauchy random projections is promising for large-scale  learning applications. Furthermore, we extend the idea to sign $\\alpha$-stable random projections and derive a bound of the collision probability.", "full_text": "Sign Cauchy Projections and Chi-Square Kernel\n\nPing Li\n\nDept of Statistics & Biostat.\nDept of Computer Science\n\nRutgers University\n\npingli@stat.rutgers.edu\n\nGennady Samorodnitsky\n\nORIE and Dept of Stat. Science\n\nCornell University\nIthaca, NY 14853\ngs18@cornell.edu\n\nJohn Hopcroft\n\nDept of Computer Science\n\nCornell University\nIthaca, NY 14853\njeh@cs.cornell.edu\n\nAbstract\n\nThe method of stable random projections is useful for ef\ufb01ciently approximating\nthe l(cid:11) distance (0 < (cid:11) \u2264 2) in high dimension and it is naturally suitable for data\nstreams. In this paper, we propose to use only the signs of the projected data and\nwe analyze the probability of collision (i.e., when the two signs differ). Interest-\ningly, when (cid:11) = 1 (i.e., Cauchy random projections), we show that the probability\nof collision can be accurately approximated as functions of the chi-square ((cid:31)2)\nsimilarity. In text and vision applications, the (cid:31)2 similarity is a popular measure\nwhen the features are generated from histograms (which are a typical example of\ndata streams). Experiments con\ufb01rm that the proposed method is promising for\nlarge-scale learning applications. The full paper is available at arXiv:1308.1009.\nThere are many future research problems. For example, when (cid:11) \u2192 0, the collision\nprobability is a function of the resemblance (of the binary-quantized data). This\nprovides an effective mechanism for resemblance estimation in data streams.\n\n1 Introduction\nHigh-dimensional representations have become very popular in modern applications of machine\nlearning, computer vision, and information retrieval. For example, Winner of 2009 PASCAL image\nclassi\ufb01cation challenge used millions of features [29]. [1, 30] described applications with billion or\ntrillion features. The use of high-dimensional data often achieves good accuracies at the cost of a\nsigni\ufb01cant increase in computations, storage, and energy consumptions.\nConsider two data vectors (e.g., two images) u; v \u2208 RD. A basic task is to compute their distance\nor similarity. For example, the correlation ((cid:26)2) and l(cid:11) distance (d(cid:11)) are commonly used:\n\n\u2211\n\u221a\u2211\n\n\u2211\n\nD\ni=1 uivi\nD\ni=1 v2\ni\n\nD\ni=1 u2\ni\n\n(cid:26)2(u; v) =\n\n;\n\nd(cid:11)(u; v) =\n\nIn this study, we are particularly interested in the (cid:31)2 similarity, denoted by (cid:26)(cid:31)2:\n\nD\u2211\n\ni=1\n\n|ui \u2212 vi|(cid:11)\nD\u2211\n\nD\u2211\n\n(1)\n\n(2)\n\n(3)\n\nD\u2211\nD\u2211\n\ni=1\n\nThe chi-square similarity is closely related to the chi-square distance d(cid:31)2:\n\n(cid:26)(cid:31)2 =\n\n2uivi\nui + vi\n\n;\n\n(ui \u2212 vi)2\nui + vi\n\nd(cid:31)2 =\n\ni=1\n\nwhere ui \u2265 0; vi \u2265 0;\nD\u2211\n(ui + vi) \u2212 D\u2211\n\n=\n\ni=1\n\ni=1\n\nui =\n\nvi = 1\n\ni=1\n\ni=1\n\n4uivi\nui + vi\n\n= 2 \u2212 2(cid:26)(cid:31)2\n\nThe chi-square similarity is an instance of the Hilbertian metrics, which are de\ufb01ned over probability\nspace [10] and suitable for data generated from histograms. Histogram-based features (e.g., bag-\nof-word or bag-of-visual-word models) are extremely popular in computer vision, natural language\nprocessing (NLP), and information retrieval. Empirical studies have demonstrated the superiority of\nthe (cid:31)2 distance over l2 or l1 distances for image and text classi\ufb01cation tasks [4, 10, 13, 2, 28, 27, 26].\nThe method of normal random projections (i.e., (cid:11)-stable projections with (cid:11) = 2) has become\npopular in machine learning (e.g., [7]) for reducing the data dimensions and data sizes, to facilitate\n\n1\n\n\fef\ufb01cient computations of the l2 distances and correlations. More generally, the method of stable\nrandom projections [11, 17] provides an ef\ufb01cient algorithm to compute the l(cid:11) distances (0 < (cid:11) \u2264 2).\nIn this paper, we propose to use only the signs of the projected data after applying stable projections.\n1.1 Stable Random Projections and Sign (1-Bit) Stable Random Projections\nConsider two high-dimensional data vectors u; v \u2208 RD. The basic idea of stable random projections\nis to multiply u and v by a random matrix R \u2208 RD(cid:2)k: x = uR \u2208 Rk, y = vR \u2208 Rk, where entries\nof R are i.i.d. samples from a symmetric (cid:11)-stable distribution with unit scale. By properties of\nstable distributions, xj \u2212 yj follows a symmetric (cid:11)-stable distribution with scale d(cid:11). Hence, the\ntask of computing d(cid:11) boils down to estimating the scale d(cid:11) from k i.i.d. samples. In this paper, we\npropose to store only the signs of projected data and we study the probability of collision:\n\n(4)\nUsing only the signs (i.e., 1 bit) has signi\ufb01cant advantages for applications in search and learning.\nWhen (cid:11) = 2, this probability can be analytically evaluated [9] (or via a simple geometric argument):\n\nP(cid:11) = Pr (sign(xj) \u0338= sign(yj))\n\nP2 = Pr (sign(xj) \u0338= sign(yj)) =\n\n(cid:0)1 (cid:26)2\n\ncos\n\n1\n(cid:25)\n\n(5)\n\nwhich is an important result known as sim-hash [5]. For (cid:11) < 2, the collision probability is an\nopen problem. When the data are nonnegative, this paper (Theorem 1) will prove a bound of P(cid:11)\nfor general 0 < (cid:11) \u2264 2. The bound is exact at (cid:11) = 2 and becomes less sharp as (cid:11) moves away\nfrom 2. Furthermore, for (cid:11) = 1 and nonnegative data, we have the interesting observation that the\nprobability P1 can be well approximated as functions of the (cid:31)2 similarity (cid:26)(cid:31)2.\n1.2 The Advantages of Sign Stable Random Projections\n\n1. There is a signi\ufb01cant saving in storage space by using only 1 bit instead of (e.g.,) 64 bits.\n2. This scheme leads to an ef\ufb01cient linear algorithm (e.g., linear SVM). For example, a nega-\ntive sign can be coded as \u201c01\u201d and a positive sign as \u201c10\u201d (i.e., a vector of length 2). With\nk projections, we concatenate k short vectors to form a vector of length 2k. This idea is\ninspired by b-bit minwise hashing [20], which was designed for binary sparse data.\n\n3. This scheme also leads to an ef\ufb01cient near neighbor search algorithm [8, 12]. We can code\na negative sign by \u201c0\u201d and positive sign by \u201c1\u201d and concatenate k such bits to form a hash\ntable of 2k buckets. In the query phase, one only searches for similar vectors in one bucket.\n\n1.3 Data Stream Computations\nStable random projections are naturally suitable for data streams. In modern applications, massive\ndatasets are often generated in a streaming fashion, which are dif\ufb01cult to transmit and store [22], as\nthe processing is done on the \ufb02y in one-pass of the data. In the standard turnstile model [22], a data\nstream can be viewed as high-dimensional vector with the entry values changing over time.\n\n= u(t(cid:0)1)\n\nHere, we denote a stream at time t by u(t)\n, i = 1 to D. At time t, a stream element (it; It)\ni\narrives and updates the it-th coordinate as u(t)\n+ It. Clearly, the turnstile data stream\nit\nmodel is particularly suitable for describing histograms and it is also a standard model for network\ntraf\ufb01c summarization and monitoring [31]. Because this stream model is linear, methods based on\nlinear projections (i.e., matrix-vector multiplications) can naturally handle streaming data of this\nsort. Basically, entries of the projection matrix R \u2208 RD(cid:2)k are (re)generated as needed using\npseudo-random number techniques [23]. As (it; It) arrives, only the entries in the it-th row, i.e.,\n+ It \u00d7 ritj.\nrit;j, j = 1 to k, are (re)generated and the projected data are updated as x(t)\nRecall that, in the de\ufb01nition of (cid:31)2 similarity, the data are assumed to be normalized (summing to\n1). For nonnegative streams, the sum can be computed error-free by using merely one counter:\ns=1 Is. Thus we can still use, without loss of generality, the sum-to-one assump-\ntion, even in the streaming environment. This fact was recently exploited by another data stream\nalgorithm named Compressed Counting (CC) [18] for estimating the Shannon entropy of streams.\nBecause the use of the (cid:31)2 similarity is popular in (e.g.,) computer vision, recently there are other\nproposals for estimating the (cid:31)2 similarity. For example, [15] proposed a nice technique to approxi-\nmate (cid:26)(cid:31)2 by \ufb01rst expanding the data from D dimensions to (e.g.,) 5 \u223c 10 \u00d7 D dimensions through\na nonlinear transformation and then applying normal random projections on the expanded data. The\nnonlinear transformation makes their method not applicable to data streams, unlike our proposal.\n\nj = x(t(cid:0)1)\n\n\u2211\n\n\u2211\n\nD\n\ni=1 u(t)\n\ni =\n\nt\n\nj\n\nit\n\n2\n\n\fFor notational simplicity, we will drop the superscript (t) for the rest of the paper.\n2 An Experimental Study of Chi-Square Kernels\nWe provide an experimental study to validate the use of (cid:31)2 similarity. Here, the \u201c(cid:31)2-kernel\u201d is\nde\ufb01ned as K(u; v) = (cid:26)(cid:31)2 and the \u201cacos-(cid:31)2-kernel\u201d as K(u; v) = 1 \u2212 1\n(cid:0)1 (cid:26)(cid:31)2. With a slight\n(cid:25) cos\nabuse of terminology, we call both \u201c(cid:31)2 kernel\u201d when it is clear in the context.\nWe use the \u201cprecomputed kernel\u201d functionality in LIBSVM on two datasets: (i) UCI-PEMS, with\n267 training examples and 173 testing examples in 138,672 dimensions; (ii) MNIST-small, a subset\nof the popular MNIST dataset, with 10,000 training examples and 10,000 testing examples.\nThe results are shown in Figure 1. To compare these two types of (cid:31)2 kernels with \u201clinear\u201d kernel,\nwe also test the same data using LIBLINEAR [6] after normalizing the data to have unit Euclidian\nnorm, i.e., we basically use (cid:26)2. For both LIBSVM and LIBLINEAR, we use l2-regularization with\na regularization parameter C and we report the test errors for a wide range of C values.\n\nFigure 1: Classi\ufb01cation accuracies. C is the l2-regularization parameter. We use LIBLINEAR\nfor \u201clinear\u201d (i.e., (cid:26)2) kernel and LIBSVM \u201cprecomputed kernel\u201d for two types of (cid:31)2 kernels (\u201c(cid:31)2-\nkernel\u201d and \u201cacos-(cid:31)2-kernel\u201d). For UCI-PEMS, the (cid:31)2-kernel has better performance than the linear\nkernel and acos-(cid:31)2-kernel. For MNIST-Small, both (cid:31)2 kernels noticeably outperform linear kernel.\nNote that MNIST-small used the original MNIST test set and merely 1/6 of the original training set.\n\n(\n\n)\n\n\u2211\n\nHere, we should state that it is not the intention of this paper to use these two small examples\nto conclude the advantage of (cid:31)2 kernels over linear kernel. We simply use them to validate our\nproposed method, which is general-purpose and is not limited to data generated from histograms.\n3 Sign Stable Random Projections and the Collision Probability Bound\n(\nWe apply stable random projections on two vectors u; v \u2208 RD: x =\ni=1 viri,\nri \u223c S((cid:11); 1), i.i.d. Here Z \u223c S((cid:11); (cid:13)) denotes a symmetric (cid:11)-stable distribution with scale (cid:13),\n)\n(cid:0)(cid:13)jtj(cid:11). By properties of stable distributions,\nwhose characteristic function [24] is E\n|ui \u2212 vi|(cid:11)\nwe know x\u2212y \u223c S\n. Applications including linear learning and near neighbor\nsearch will bene\ufb01t from sign (cid:11)-stable random projections. When (cid:11) = 2 (i.e. normal), the collision\nprobability Pr (sign(x) \u0338= sign(y)) is known [5, 9]. For (cid:11) < 2, it is a dif\ufb01cult probability problem.\nThis section provides a bound of Pr (sign(x) \u0338= sign(y)), which is fairly accurate for (cid:11) close to 2.\n3.1 Collision Probability Bound\nIn this paper, we focus on nonnegative data (as common in practice). We present our \ufb01rst theorem.\nTheorem 1 When the data are nonnegative, i.e., ui \u2265 0; vi \u2265 0, we have\n\ni=1 uiri, y =\n\np(cid:0)1Zt\n\n\u2211\n\n\u2211\n\n(cid:11);\n\nD\ni=1\n\ne\n\n= e\n\nD\n\nD\n\nPr (sign(x) \u0338= sign(y)) \u2264 1\n(cid:25)\n\n(cid:0)1 (cid:26)(cid:11); where (cid:26)(cid:11) =\n\ncos\n\n\uf8eb\uf8ed \u2211\n\u221a\u2211\n\n\u2211\n\nD\n\ni=1 u(cid:11)=2\ni\nD\ni=1 u(cid:11)\ni\n\nv(cid:11)=2\ni\nD\ni=1 v(cid:11)\ni\n\n\uf8f6\uf8f82=(cid:11)\n\n(cid:3) (6)\n\nFor (cid:11) = 2, this bound is exact [5, 9]. In fact the result for (cid:11) = 2 leads to the following Lemma:\nLemma 1 The kernel de\ufb01ned as K(u; v) = 1 \u2212 1\nProof: The indicator function 1{sign(x) = sign(y)} can be written as an inner product (hence PD)\nand Pr (sign(x) = sign(y)) = E (1{sign(x) = sign(y)}) = 1 \u2212 1\n(cid:3)\n\n(cid:0)1 (cid:26)2 is positive de\ufb01nite (PD).\n\n(cid:25) cos\n\n(cid:0)1 (cid:26)2.\n\n(cid:25) cos\n\n3\n\n10\u2212210\u22121100101102103020406080100CClassification Acc (%) PEMSlinear\u03c72acos \u03c7210\u2212210\u2212110010110260708090100CClassification Acc (%) MNIST\u2212Smalllinear\u03c72acos \u03c72\f3.2 A Simulation Study to Verify the Bound of the Collision Probability\nWe generate the original data u and v by sampling from a bivariate t-distribution, which has two\nparameters: the correlation and the number of degrees of freedom (which is taken to be 1 in our\nexperiments). We use a full range of the correlation parameter from 0 to 1 (spaced at 0.01). To\ngenerate positive data, we simply take the absolute values of the generated data. Then we \ufb01x the\ndata as our original data (like u and v), apply sign stable random projections, and report the empirical\ncollision probabilities (after 105 repetitions).\nFigure 2 presents the simulated collision probability Pr (sign(x) \u0338= sign(y)) for D = 100 and (cid:11) \u2208\n{1:5; 1:2; 1:0; 0:5}. In each panel, the dashed curve is the theoretical upper bound 1\n(cid:0)1 (cid:26)(cid:11), and\nthe solid curve is the simulated collision probability. Note that it is expected that the simulated data\ncan not cover the entire range of (cid:26)(cid:11) values, especially as (cid:11) \u2192 0.\n\n(cid:25) cos\n\nFigure 2: Dense Data and D = 100. Simulated collision probability Pr (sign(x) \u0338= sign(y)) for\nsign stable random projections. In each panel, the dashed curve is the upper bound 1\n\n(cid:0)1 (cid:26)(cid:11).\n\n(cid:25) cos\n\n(cid:25) cos\n\n(cid:0)1 (cid:26)(cid:11). When (cid:11) \u2265 1:5, this upper bound is fairly\nFigure 2 veri\ufb01es the theoretical upper bound 1\nsharp. However, when (cid:11) \u2264 1, the bound is not tight, especially for small (cid:11). Also, the curves of the\nempirical collision probabilities are not smooth (in terms of (cid:26)(cid:11)).\nReal-world high-dimensional datasets are often sparse. To verify the theoretical upper bound of\nthe collision probability on sparse data, we also simulate sparse data by randomly making 50% of\nthe generated data as used in Figure 2 be zero. With sparse data, it is even more obvious that the\ntheoretical upper bound 1\n\n(cid:0)1 (cid:26)(cid:11) is not sharp when (cid:11) \u2264 1, as shown in Figure 3.\n\n(cid:25) cos\n\nFigure 3: Sparse Data and D = 100. Simulated collision probability Pr (sign(x) \u0338= sign(y)) for\nsign stable random projection. The upper bound is not tight especially when (cid:11) \u2264 1.\nIn summary, the collision probability bound: Pr (sign(x) \u0338= sign(y)) \u2264 1\nwhen (cid:11) is close to 2 (e.g., (cid:11) \u2265 1:5). However, for (cid:11) \u2264 1, a better approximation is needed.\n4 (cid:11) = 1 and Chi-Square ((cid:31)2) Similarity\nIn this section, we focus on nonnegative data (ui \u2265 0; vi \u2265 0) and (cid:11) = 1. This case is important in\npractice. For example, we can view the data (ui, vi) as empirical probabilities, which are common\nwhen data are generated from histograms (as popular in NLP and vision) [4, 10, 13, 2, 28, 27, 26].\n\n(cid:0)1 (cid:26)(cid:11) is fairly sharp\n\n(cid:25) cos\n\n\u2211\n\n\u2211\n\nIn this context, we always normalize the data, i.e.,\n\nD\ni=1 ui =\n\nD\n\ni=1 vi = 1. Theorem 1 implies\n\nPr (sign(x) \u0338= sign(y)) \u2264 1\n(cid:25)\n\n(cid:0)1 (cid:26)1; where (cid:26)1 =\n\ncos\n\nu1=2\ni\n\nv1=2\ni\n\n(7)\n\n)2\n\n(\n\nD\u2211\n\ni=1\n\nWhile the bound is not tight, interestingly, the collision probability can be related to the (cid:31)2 similarity.\n\nRecall the de\ufb01nitions of the chi-square distance d(cid:31)2 =\n(cid:26)(cid:31)2 = 1 \u2212 1\n\n. In this context, we should view 0\n\nD\ni=1\n\n2 d(cid:31)2 =\n\nD\ni=1\n\n2uivi\nui+vi\n\n0 = 0.\n\n(ui(cid:0)vi)2\nui+vi\n\nand the chi-square similarity\n\n\u2211\n\n\u2211\n\n4\n\n0.20.40.60.8100.10.20.30.40.5\u03b1 = 1.5, D = 100\u03c1\u03b1Collision probability0.40.60.8100.10.20.30.40.5\u03b1 = 1.2, D = 100\u03c1\u03b1Collision probability0.40.60.8100.10.20.30.40.5\u03b1 = 1, D = 100\u03c1\u03b1Collision probability0.70.80.9100.10.20.30.40.5\u03b1 = 0.5, D = 100\u03c1\u03b1Collision probability00.20.40.60.8100.10.20.30.40.5\u03b1 = 1.5, D = 100, Sparse\u03c1\u03b1Collision probability00.20.40.60.8100.10.20.30.40.5\u03b1 = 1.2, D = 100, Sparse\u03c1\u03b1Collision probability00.20.40.60.8100.10.20.30.40.5\u03b1 = 1, D = 100, Sparse\u03c1\u03b1Collision probability00.10.20.30.400.10.20.30.40.5\u03b1 = 0.5, D = 100, Sparse\u03c1\u03b1Collision probability\f\u2211\nD\u2211\nLemma 2 Assume ui \u2265 0; vi \u2265 0,\n\nD\n\ni=1 ui = 1,\n\nD\n\ni=1 vi = 1. Then\n\n)2\n\n\u2211\n\u2265 (cid:26)1 =\n\n(\nD\u2211\n\ni=1\n\n(cid:26)(cid:31)2 =\n\ni=1\n\n2uivi\nui + vi\n\nu1=2\ni\n\nv1=2\ni\n\n(cid:3)\n\n(8)\n\nIt is known that the (cid:31)2-kernel is PD [10]. Consequently, we know the acos-(cid:31)2-kernel is also PD.\nLemma 3 The kernel de\ufb01ned as K(u; v) = 1 \u2212 1\nThe remaining question is how to connect Cauchy random projections with the (cid:31)2 similarity.\n5 Two Approximations of Collision Probability for Sign Cauchy Projections\nIt is a dif\ufb01cult problem to derive the collision probability of sign Cauchy projections if we would\nlike to express the probability only in terms of certain summary statistics (e.g., some distance). Our\n\ufb01rst observation is that the collision probability can be well approximated using the (cid:31)2 similarity:\n\n(cid:0)1 (cid:26)(cid:31)2 is positive de\ufb01nite (PD).\n\n(cid:25) cos\n\n(cid:3)\n\nPr (sign(x) \u0338= sign(y)) \u2248 P(cid:31)2(1) =\n(\n\n(9)\n(cid:0)1 ((cid:26)1). Particularly, in sparse data, the\nis very accurate (except when (cid:26)(cid:31)2 is close to 1), while the bound\n\n(cid:25) cos\n\n)\n\n(cid:26)(cid:31)2\n\n(cid:0)1\n\n1\n(cid:25)\n\ncos\n\nFigure 4 shows this approximation is better than 1\napproximation 1\n1\n(cid:25) cos\n\n(cid:0)1 ((cid:26)1) is not sharp (and the curve is not smooth in (cid:26)1).\n\n(cid:25) cos\n\n(cid:26)(cid:31)2\n\n(cid:0)1\n\n(\n\n)\n\n(cid:0)1 ((cid:26)), where (cid:26) can be (cid:26)1 or (cid:26)(cid:31)2 depending on the context. In\nFigure 4: The dashed curve is 1\neach panel, the two solid curves are the empirical collision probabilities in terms of (cid:26)1 (labeled by\n(cid:0)1 (cid:26)(cid:31)2 in (9) is more\n\u201c1\u201d) or (cid:26)(cid:31)2 (labeled by \u201c(cid:31)2). It is clear that the proposed approximation 1\ntight than the upper bound 1\n\n(cid:0)1 (cid:26)1, especially so in sparse data.\n\n(cid:25) cos\n\n(cid:25) cos\n\n(cid:25) cos\n\nOur second (and less obvious) approximation is the following integral:\n\n\u222b\n\n(\n\n)\n\nPr (sign(x) \u0338= sign(y)) \u2248 P(cid:31)2(2) =\n\n1\n2\n\n\u2212 2\n(cid:25)2\n\n(cid:25)=2\n\n(cid:0)1\n\ntan\n\n0\n\n(cid:26)(cid:31)2\n\n2 \u2212 2(cid:26)(cid:31)2\n\ntan t\n\ndt\n\n(10)\n\nFigure 5 illustrates that, for dense data, the second approximation (10) is more accurate than the\n\ufb01rst (9). The second approximation (10) is also accurate for sparse data. Both approximations,\nP(cid:31)2(1) and P(cid:31)2(2), are monotone functions of (cid:26)(cid:31)2. In practice, we often do not need the (cid:26)(cid:31)2 values\nexplicitly because it often suf\ufb01ces if the collision probability is a monotone function of the similarity.\n5.1 Binary Data\nInterestingly, when the data are binary (before normalization), we can compute the collision prob-\nability exactly, which allows us to analytically assess the accuracy of the approximations. In fact,\nthis case inspired us to propose the second approximation (10), which is otherwise not intuitive.\nFor convenience, we de\ufb01ne a = |Ia|; b = |Ib|; c = |Ic|, where\nIb = {i|vi > 0; ui = 0};\n\nIa = {i|ui > 0; vi = 0};\n\nIc = {i|ui > 0; vi > 0};\n\n(11)\n\nAssume binary data (before normalization, i.e., sum to one). That is,\n\nui =\n\n1\n\n|Ia| + |Ic| =\n\n1\n\na + c\n\n; \u2200i \u2208 Ia \u222a Ic;\n\nThe chi-square similarity (cid:26)(cid:31)2 becomes (cid:26)(cid:31)2 =\n\n1\n\n|Ib| + |Ic| =\n= 2c\n\nvi =\n\n2uivi\nui+vi\n\n; \u2200i \u2208 Ib \u222a Ic\n\n(12)\n\n1\n\nb + c\n\na+b+2c and hence\n\n(cid:26)(cid:31)2\n2(cid:0)2(cid:26)(cid:31)2\n\n= c\n\na+b.\n\n\u2211\n\nD\ni=1\n\n5\n\n0.40.60.8100.10.20.30.40.5\u03c1\u03c72, \u03c11Collision probability1\u03c72\u03b1 = 1, D = 10000.20.40.60.8100.10.20.30.40.5\u03c1\u03c72, \u03c11Collision probability1\u03c72\u03b1 = 1, D = 100, Sparse\f)}\nTheorem 2 Assume binary data. When (cid:11) = 1, the exact collision probability is\n|R|\n(\n\nwhere R is a standard Cauchy random variable.\n\nPr (sign(x) \u0338= sign(y)) =\n\n\u2212 2\n\n(cid:25)2 E\n\n|R|\n\n{\n\n)\n\n(\n\ntan\n\ntan\n\n(cid:0)1\n\n(cid:0)1\n\n1\n2\n\nc\nb\n\n(\n\n(cid:0)1\n\n{\n(\n\n(\n\n(cid:0)1\n\n|R|)\n)}\n\nWhen a = 0 or b = 0, we have E\nc\nb\nobservation inspires us to propose the approximation (10):\n\u2212 2\n(cid:25)2\n\n\u2212 1\n(cid:25)\n\nP(cid:31)2(2) =\n\na + b\n\n{\n\ntan\n\ntan\n\ntan\n\n(cid:0)1\n\n1\n2\n\n1\n2\n\n=\n\nE\n\nc\na\n\nc\n\nc\na\n\n(\n|R|)}\n\u222b\n\n{\n(\n\n= (cid:25)\n\n2 E\n\n(cid:0)1\n\ntan\n\nc\n\na+b\n\n(cid:25)=2\n\n(cid:0)1\n\ntan\n\n0\n\nc\n\na + b\n\ntan t\n\ndt\n\n(13)\n(cid:3)\n\n. This\n\n)}\n|R|\n)\n\nFigure 5: Comparison of two approximations: (cid:31)2(1) based on (9) and (cid:31)2(2) based on (10). The\nsolid curves (empirical probabilities expressed in terms of (cid:26)(cid:31)2) are the same solid curves labeled\n\u201c(cid:31)2\u201d in Figure 4. The left panel shows that the second approximation (10) is more accurate in dense\ndata. The right panel illustrate that both approximations are accurate in sparse data. (9) is slightly\nmore accurate at small (cid:26)(cid:31)2 and (10) is more accurate at (cid:26)(cid:31)2 close to 1.\n\n|R|\n(\n\n1\nb=c\n\nTo validate this approximation for binary data, we study the difference between (13) and (10), i.e.,\n\n{\n\n{\nZ(a=c; b=c) = Err = Pr (sign(x) \u0338= sign(y)) \u2212 P(cid:31)2(2)\n= \u2212 2\n\n)}\n\n)\n\n(\n\n|R|\n\n|R|\n\ntan\n\ntan\n\n(cid:0)1\n\n(cid:0)1\n\n+\n\nE\n\n(\n\n(cid:25)2 E\n\n1\na=c\n\n1\n(cid:25)\n\n(cid:0)1\n\ntan\n\n1\n\na=c + b=c\n\n|R|\n\n(14)\n\n)}\n\n(14) can be easily computed by simulations. Figure 6 con\ufb01rms that the errors are larger than zero\nand very small . The maximum error is smaller than 0.0192, as proved in Lemma 4.\n\nFigure 6: Left panel: contour plot for the error Z(a=c; b=c) in (14). The maximum error (which is\n< 0:0192) occurs along the diagonal line. Right panel: the diagonal curve of Z(a=c; b=c).\n\nLemma 4 The error de\ufb01ned in (14) ranges between 0 and Z(t\n0 \u2264 Z(a=c; b=c) \u2264 Z(t\n\ntan\n\n) =\n\n(cid:0)1\n\n+\n\n(cid:3)\n\n\u222b 1\n\n{\n\n(\n\n\u2212 2\n(cid:25)2\n\n))2\n\n(\n\nr\nt(cid:3)\n\n(cid:3)\n\n):\n\n(cid:0)1\n\ntan\n\n1\n(cid:25)\n\n)}\n\n(\n\nr\n2t(cid:3)\n\n(cid:3)\n\nwhere t\n\n= 2:77935 is the solution to\n\n0\n\n1\n\nt2(cid:0)1 log 2t\n\n1+t = log(2t)\n\n(cid:3)\n(2t)2(cid:0)1 . Numerically, Z(t\n\n6\n\n1\n\n2\n1 + r2 dr (15)\n(cid:25)\n) = 0:01919. (cid:3)\n\n0.40.60.8100.10.20.30.40.5\u03c1\u03c72Collision probability\u03b1 = 1, D = 100 \u03c72 (1)\u03c72 (2)Empirical00.20.40.60.8100.10.20.30.40.5\u03c1\u03c72Collision probability\u03b1 = 1, D = 100, Sparse \u03c72 (1)\u03c72 (2)Empiricala/cb/c0.0010.010.01910\u2212210\u2212110010110210\u2212210\u2212110010110210\u2212210\u2212110010110210300.0050.010.0150.02tZ(t)\f5.2 An Experiment Based on 3.6 Million English Word Pairs\nTo further validate the two (cid:31)2 approximations (in non-binary data), we experiment with a word\noccurrences dataset (which is an example of histogram data) from a chunk of D = 216 web crawl\ndocuments. There are in total 2,702 words, i.e., 2,702 vectors and 3,649,051 word pairs. The entries\nof a vector are the occurrences of the word. This is a typical sparse, non-binary dataset. Interestingly,\nthe errors of the collision probabilities based on two (cid:31)2 approximations are still very small. To report\nthe results, we apply sign Cauchy random projections 107 times to evaluate the approximation errors\n(cid:0)1 (cid:26)1\nof (9) and (10). The results, as presented in Figure 7, again con\ufb01rm that the upper bound 1\nis not tight and both (cid:31)2 approximations, P(cid:31)2(1) and P(cid:31)2(2), are accurate.\n\n(cid:25) cos\n\nFigure 7: Empirical collision probabilities for 3.6 million English word pairs. In the left panel,\nwe plot the empirical collision probabilities against (cid:26)1 (lower, green if color is available) and (cid:26)(cid:31)2\n(cid:0)1 (cid:26)1 is not tight (and the curve is not smooth).\n(higher, red). The curves con\ufb01rm that the bound 1\nWe plot the two (cid:31)2 approximations as dashed curves which largely match the empirical probabilities\nplotted against (cid:26)(cid:31)2, con\ufb01rming that the (cid:31)2 approximations are good. For smaller (cid:26)(cid:31)2 values, the\n\ufb01rst approximation P(cid:31)2(1) is slightly more accurate. For larger (cid:26)(cid:31)2 values, the second approximation\nP(cid:31)2(2) is more accurate. In the right panel, we plot the errors for both P(cid:31)2(1) and P(cid:31)2(2).\n\n(cid:25) cos\n\n6 Sign Cauchy Random Projections for Classi\ufb01cation\nOur method provides an effective strategy for classi\ufb01cation. For each (high-dimensional) data vec-\ntor, using k sign Cauchy projections, we encode a negative sign as \u201c01\u201d and a positive as \u201c10\u201d (i.e.,\na vector of length 2) and concatenate k short vectors to form a new feature vector of length 2k. We\nthen feed the new data into a linear classi\ufb01er (e.g., LIBLINEAR). Interestingly, this linear classi\ufb01er\napproximates a nonlinear kernel classi\ufb01er based on acos-(cid:31)2-kernel: K(u; v) = 1\u2212 1\n(cid:0)1 (cid:26)(cid:31)2. See\nFigure 8 for the experiments on the same two datasets in Figure 1: UCI-PEMS and MNIST-Small.\n\n(cid:25) cos\n\nFigure 8: The two dashed (red if color is available) curves are the classi\ufb01cation results obtained\nusing \u201cacos-(cid:31)2-kernel\u201d via the \u201cprecomputed kernel\u201d functionality in LIBSVM. The solid (black)\ncurves are the accuracies using k sign Cauchy projections and LIBLINEAR. The results con\ufb01rm\nthat the linear kernel from sign Cauchy projections can approximate the nonlinear acos-(cid:31)2-kernel.\n\nFigure 1 has already shown that, for the UCI-PEMS dataset, the (cid:31)2-kernel ((cid:26)(cid:31)2) can produce notice-\nably better classi\ufb01cation results than the acos-(cid:31)2-kernel (1 \u2212 1\n(cid:0)1 (cid:26)(cid:31)2). Although our method\ndoes not directly approximate (cid:26)(cid:31)2, we can still estimate (cid:26)(cid:31)2 by assuming the collision probability\nis exactly Pr (sign(x) \u0338= sign(y)) = 1\n(cid:0)1 (cid:26)(cid:31)2 and then we can feed the estimated (cid:26)(cid:31)2 values\ninto LIBSVM \u201cprecomputed kernel\u201d for classi\ufb01cation. Figure 9 veri\ufb01es that this method can also\napproximate the (cid:31)2 kernel with enough projections.\n\n(cid:25) cos\n\n(cid:25) cos\n\n7\n\n00.20.40.60.8100.10.20.30.40.5\u03c1\u03c72 or \u03c11Collision probability\u03c72(1)\u03c72(2)00.20.40.60.81\u22120.03\u22120.02\u22120.0100.010.020.03\u03c1\u03c72Error\u03c72(1)\u03c72(2)10\u2212210\u22121100101102103020406080100k = 32k = 64k = 128k = 2565121024k = 2048,4096,8192PEMS: SVMCClassification Acc (%)10\u2212210\u2212110010110260708090100k = 64k = 128k = 256k = 51210242048k = 4096, 8192CClassification Acc (%)MNIST\u2212Small: SVM\fFigure 9: Nonlinear kernels. The dashed curves are the classi\ufb01cation results obtained using (cid:31)2-\nkernel and LIBSVM \u201cprecomputed kernel\u201d functionality. We apply k sign Cauchy projections and\n(cid:0)1 (cid:26)(cid:31)2 and then feed the estimated\nestimate (cid:26)(cid:31)2 assuming the collision probability is exactly 1\n(cid:26)(cid:31)2 into LIBSVM again using the \u201cprecomputed kernel\u201d functionality.\n\n(cid:25) cos\n\n7 Conclusion\nThe use of (cid:31)2 similarity is widespread in machine learning, especially when features are generated\nfrom histograms, as common in natural language processing and computer vision. Many prior stud-\nies [4, 10, 13, 2, 28, 27, 26] have shown the advantage of using (cid:31)2 similarity compared to other\nmeasures such as l2 distance. However, for large-scale applications with ultra-high-dimensional\ndatasets, using (cid:31)2 similarity becomes challenging for practical reasons. Simply storing (and maneu-\nvering) all the high-dimensional features would be dif\ufb01cult if there are a large number of observa-\ntions. Computing all pairwise (cid:31)2 similarities can be time-consuming and in fact we usually can not\nmaterialize an all-pairwise similarity matrix even if there are merely 106 data points. Furthermore,\nthe (cid:31)2 similarity is nonlinear, making it dif\ufb01cult to take advantage of modern linear algorithms\nwhich are known to be very ef\ufb01cient, e.g., [14, 25, 6, 3]. When data are generated in a streaming\nfashion, computing (cid:31)2 similarities without storing the original data will be even more challenging.\nThe method of (cid:11)-stable random projections (0 < (cid:11) \u2264 2) [11, 17] is popular for ef\ufb01ciently com-\nputing the l(cid:11) distances in massive (streaming) data. We propose sign stable random projections by\nstoring only the signs (i.e., 1-bit) of the projected data. Obviously, the saving in storage would be\na signi\ufb01cant advantage. Also, these bits offer the indexing capability which allows ef\ufb01cient search.\nFor example, we can build hash tables using the bits to achieve sublinear time near neighbor search\n(although this paper does not focus on near neighbor search). We can also build ef\ufb01cient linear\nclassi\ufb01ers using these bits, for large-scale high-dimensional machine learning applications.\nA crucial task in analyzing sign stable random projections is to study the probability of collision (i.e.,\nwhen the two signs differ). We derive a theoretical bound of the collision probability which is exact\nwhen (cid:11) = 2. The bound is fairly sharp for (cid:11) close to 2. For (cid:11) = 1 (i.e., Cauchy random projec-\ntions), we \ufb01nd the (cid:31)2 approximation is signi\ufb01cantly more accurate. In addition, for binary data, we\nanalytically show that the errors from using the (cid:31)2 approximation are less than 0.0192. Experiments\non real and simulated data con\ufb01rm that our proposed (cid:31)2 approximations are very accurate.\nWe are enthusiastic about the practicality of sign stable projections in learning and search applica-\ntions. The previous idea of using the signs from normal random projections has been widely adopted\nin practice, for approximating correlations. Given the widespread use of the (cid:31)2 similarity and the\nsimplicity of our method, we expect the proposed method will be adopted by practitioners.\nFuture research Many interesting future research topics can be studied. (i) The processing cost\nof conducting stable random projections can be dramatically reduced by very sparse stable random\nprojections [16]. This will make our proposed method even more practical. (ii) We can try to utilize\nmore than just 1-bit of the projected data, i.e., we can study the general coding problem [19]. (iii)\nAnother interesting research would be to study the use of sign stable projections for sparse signal\n(iv) When (cid:11) \u2192 0, the collision\nrecovery (Compressed Sensing) with stable distributions [21].\nprobability becomes Pr (sign(x) \u0338= sign(y)) = 1\n2 Resemblance, which provides an elegant\nmechanism for computing resemblance (of the binary-quantized data) in sparse data streams.\nAcknowledgement\nThe work of Ping Li is supported by NSF-III-1360971, NSF-Bigdata-\n1419210, ONR-N00014-13-1-0764, and AFOSR-FA9550-13-1-0137. The work of Gennady\nSamorodnitsky is supported by ARO-W911NF-12-10385.\n\n\u2212 1\n\n2\n\n8\n\n10\u2212210\u22121100101102103020406080100k = 32k = 64k = 128k = 256k = 512k = 1024k = 2048k = 4096k = 8192PEMS: \u03c72 kernel SVMCClassification Acc (%)10\u2212210\u2212110010110260708090100k = 64k = 128k = 256k = 512k = 1024204840968192CClassification Acc (%)MNIST\u2212Small: \u03c72 kernel SVM\fReferences\n[1] http://googleresearch.blogspot.com/2010/04/ lessons-learned-developing-practical.html.\n[2] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. What is an object? In CVPR, pages 73\u201380, 2010.\n[3] Leon Bottou. http://leon.bottou.org/projects/sgd.\n[4] Olivier Chapelle, Patrick Haffner, and Vladimir N. Vapnik. Support vector machines for histogram-based\n\nimage classi\ufb01cation. IEEE Trans. Neural Networks, 10(5):1055\u20131064, 1999.\n\n[5] Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002.\n[6] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library\n\nfor large linear classi\ufb01cation. Journal of Machine Learning Research, 9:1871\u20131874, 2008.\n\n[7] Yoav Freund, Sanjoy Dasgupta, Mayank Kabra, and Nakul Verma. Learning the structure of manifolds\n\nusing random projections. In NIPS, Vancouver, BC, Canada, 2008.\n\n[8] Jerome H. Friedman, F. Baskett, and L. Shustek. An algorithm for \ufb01nding nearest neighbors.\n\nTransactions on Computers, 24:1000\u20131006, 1975.\n\nIEEE\n\n[9] Michel X. Goemans and David P. Williamson. Improved approximation algorithms for maximum cut and\n\nsatis\ufb01ability problems using semide\ufb01nite programming. Journal of ACM, 42(6):1115\u20131145, 1995.\n\n[10] Matthias Hein and Olivier Bousquet. Hilbertian metrics and positive de\ufb01nite kernels on probability mea-\n\nsures. In AISTATS, pages 136\u2013143, Barbados, 2005.\n\n[11] Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation.\n\nJournal of ACM, 53(3):307\u2013323, 2006.\n\n[12] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimen-\n\nsionality. In STOC, pages 604\u2013613, Dallas, TX, 1998.\n\n[13] Yugang Jiang, Chongwah Ngo, and Jun Yang. Towards optimal bag-of-features for object categorization\n\nand semantic video retrieval. In CIVR, pages 494\u2013501, Amsterdam, Netherlands, 2007.\n\nmetric convergence. Technical report, arXiv:1206.4074, 2013.\n\n[14] Thorsten Joachims. Training linear svms in linear time. In KDD, pages 217\u2013226, Pittsburgh, PA, 2006.\n[15] Fuxin Li, Guy Lebanon, and Cristian Sminchisescu. A linear approximation to the (cid:31)2 kernel with geo-\n[16] Ping Li. Very sparse stable random projections for dimension reduction in l(cid:11) (0 < (cid:11) (cid:20) 2) norm. In\n[17] Ping Li. Estimators and tail bounds for dimension reduction in l(cid:11) (0 < (cid:11) (cid:20) 2) using stable random\n\nKDD, San Jose, CA, 2007.\n\nprojections. In SODA, pages 10 \u2013 19, San Francisco, CA, 2008.\n\n[18] Ping Li. Improving compressed counting. In UAI, Montreal, CA, 2009.\n[19] Ping Li, Michael Mitzenmacher, and Anshumali Shrivastava. Coding for random projections. 2013.\n[20] Ping Li, Art B Owen, and Cun-Hui Zhang. One permutation hashing. In NIPS, Lake Tahoe, NV, 2012.\n[21] Ping Li, Cun-Hui Zhang, and Tong Zhang. Compressed counting meets compressed sensing. 2013.\n[22] S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in Theoretical\n\nComputer Science, 1:117\u2013236, 2 2005.\n\n[23] Noam Nisan. Pseudorandom generators for space-bounded computations. In STOC, 1990.\n[24] Gennady Samorodnitsky and Murad S. Taqqu. Stable Non-Gaussian Random Processes. Chapman &\n\nHall, New York, 1994.\n\n[25] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver\n\nfor svm. In ICML, pages 807\u2013814, Corvalis, Oregon, 2007.\n\n[26] Andrea Vedaldi and Andrew Zisserman. Ef\ufb01cient additive kernels via explicit feature maps. IEEE Trans.\n\nPattern Anal. Mach. Intell., 34(3):480\u2013492, 2012.\n\n[27] Sreekanth Vempati, Andrea Vedaldi, Andrew Zisserman, and C. V. Jawahar. Generalized rbf feature maps\n\nfor ef\ufb01cient detection. In BMVC, pages 1\u201311, Aberystwyth, UK, 2010.\n\n[28] Gang Wang, Derek Hoiem, and David A. Forsyth. Building text features for object image classi\ufb01cation.\n\nIn CVPR, pages 1367\u20131374, Miami, Florida, 2009.\n\n[29] Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas S. Huang, and Yihong Gong. Locality-\nconstrained linear coding for image classi\ufb01cation. In CVPR, pages 3360\u20133367, San Francisco, CA, 2010.\n[30] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature hashing\n\nfor large scale multitask learning. In ICML, pages 1113\u20131120, 2009.\n\n[31] Haiquan (Chuck) Zhao, Nan Hua, Ashwin Lall, Ping Li, Jia Wang, and Jun Xu. Towards a universal sketch\nfor origin-destination network measurements. In Network and Parallel Computing, pages 201\u2013213, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1226, "authors": [{"given_name": "Ping", "family_name": "Li", "institution": "Cornell University"}, {"given_name": "Gennady", "family_name": "Samorodnitsk", "institution": "Cornell University"}, {"given_name": "John", "family_name": "Hopcroft", "institution": "Cornell University"}]}