{"title": "Hashing Algorithms for Large-Scale Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2672, "page_last": 2680, "abstract": "Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of b-bit minwise hashing provides a  substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that  b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and logistic regression, to solve large-scale and high-dimensional statistical learning tasks, especially when the data do not fit in memory.   We  compare $b$-bit minwise hashing with  the Count-Min (CM)  and  Vowpal Wabbit (VW) algorithms, which have essentially the same variances as random projections. Our theoretical and empirical comparisons illustrate that b-bit minwise hashing is significantly more accurate (at the same storage cost) than VW (and random projections) for binary data.", "full_text": "Hashing Algorithms for Large-Scale Learning\n\nPing Li\n\nCornell University\n\npingli@cornell.edu\n\nAnshumali Shrivastava\n\nCornell University\n\nanshu@cs.cornell.edu\n\nJoshua Moore\n\nCornell University\n\nArnd Christian K\u00a8onig\n\nMicrosoft Research\n\njlmo@cs.cornell.edu\n\nchrisko@microsoft.com\n\nAbstract\n\nMinwise hashing is a standard technique in the context of search for ef\ufb01ciently\ncomputing set similarities. The recent development of b-bit minwise hashing pro-\nvides a substantial improvement by storing only the lowest b bits of each hashed\nvalue.\nIn this paper, we demonstrate that b-bit minwise hashing can be natu-\nrally integrated with linear learning algorithms such as linear SVM and logistic\nregression, to solve large-scale and high-dimensional statistical learning tasks, es-\npecially when the data do not \ufb01t in memory. We compare b-bit minwise hashing\nwith the Count-Min (CM) and Vowpal Wabbit (VW) algorithms, which have es-\nsentially the same variances as random projections. Our theoretical and empirical\ncomparisons illustrate that b-bit minwise hashing is signi\ufb01cantly more accurate (at\nthe same storage cost) than VW (and random projections) for binary data.\n\n1 Introduction\nWith the advent of the Internet, many machine learning applications are faced with very large and\ninherently high-dimensional datasets, resulting in challenges in scaling up training algorithms and\nstoring the data. Especially in the context of search and machine translation, corpus sizes used in\nindustrial practice have long exceeded the main memory capacity of single machine. For example,\n[33] discusses training sets with 1011 items and 109 distinct features, requiring novel algorithmic\napproaches and architectures. As a consequence, there has been a renewed emphasis on scaling up\nmachine learning techniques by using massively parallel architectures; however, methods relying\nsolely on parallelism can be expensive (both with regards to hardware requirements and energy\ncosts) and often induce signi\ufb01cant additional communication and data distribution overhead.\n\nThis work approaches the challenges posed by large datasets by leveraging techniques from the area\nof similarity search [2], where similar increases in data sizes have made the storage and computa-\ntional requirements for computing exact distances prohibitive, thus making data representations that\nallow compact storage and ef\ufb01cient approximate similarity computation necessary.\n\nThe method of b-bit minwise hashing [26\u201328] is a recent progress for ef\ufb01ciently (in both time and\nspace) computing resemblances among extremely high-dimensional (e.g., 264) binary vectors. In\nthis paper, we show that b-bit minwise hashing can be seamlessly integrated with linear Support\nVector Machine (SVM) [13, 18, 20, 31, 35] and logistic regression solvers.\n1.1 Ultra High-Dimensional Large Datasets and Memory Bottlenecks\n\nIn the context of search, a standard procedure to represent documents (e.g., Web pages) is to use\nw-shingles (i.e., w contiguous words), where w \u2265 5 in several studies [6, 7, 14]. This procedure can\ngenerate datasets of extremely high dimensions. For example, suppose we only consider 105 com-\nmon English words. Using w = 5 may require the size of dictionary \u2126 to be D = |\u2126| = 1025 = 283.\nIn practice, D = 264 often suf\ufb01ces, as the number of available documents may not be large enough\nto exhaust the dictionary. For w-shingle data, normally only abscence/presence (0/1) information\nis used, as it is known that word frequency distributions within documents approximately follow\na power-law [3], meaning that most single terms occur rarely, thereby making a w-shingle is un-\nlikely to occur more than once in a document. Interestingly, even when the data are not too high-\ndimensional, empirical studies [8, 17, 19] achieved good performance with binary-quantized data.\n\nWhen the data can \ufb01t in memory, linear SVM training is often extremely ef\ufb01cient after the data are\nloaded into the memory. It is however often the case that, for very large datasets, the data loading\n\n1\n\n\ftime dominates the computing time for solving the SVM problem [35]. A more severe problem\narises when the data can not \ufb01t in memory. This situation can be common in practice. The publicly\navailable webspam dataset (in LIBSVM format) needs about 24GB disk space, which exceeds the\nmemory capacity of many desktop PCs. Note that webspam, which contains only 350,000 docu-\nments represented by 3-shingles, is still very small compared to industry applications [33].\n\nwhich will provide the solid foundation for our proposed solution.\n\nwe effectively convert this nonlinear problem into a linear problem?\n\n1.2 Our Proposal\nWe propose a solution which leverages b-bit minwise hashing. Our approach assumes the data\nvectors are binary, high-dimensional, and relatively sparse, which is generally true of text documents\nrepresented via shingles. We apply b-bit minwise hashing to obtain a compact representation of the\noriginal data. In order to use the technique for ef\ufb01cient learning, we have to address several issues:\n\u2022 We need to prove that the matrices generated by b-bit minwise hashing are positive de\ufb01nite,\n\u2022 If we use b-bit minwise hashing to estimate the resemblance, which is nonlinear, how can\n\u2022 Compared to other hashing techniques such as random projections, Count-Min (CM)\nIt turns out that our proof in the next section that b-bit hashing matrices are positive de\ufb01nite naturally\nprovides the construction for converting the otherwise nonlinear SVM problem into linear SVM.\n2 Review of Minwise Hashing and b-Bit Minwise Hashing\nMinwise hashing [6,7] has been successfully applied to a wide range of real-world problems [4,6,7,\n9, 10, 12, 15, 16, 30], for ef\ufb01ciently computing set similarities. Minwise hashing mainly works well\nwith binary data, which can be viewed either as 0/1 vectors or as sets. Given two sets, S1, S2 \u2286\n\u2126 = {0, 1, 2, ..., D \u2212 1}, a widely used measure of similarity is the resemblance R:\n\nsketch [11], or Vowpal Wabbit (VW) [32, 34], does our approach exhibits advantages?\n\nR =\n\n|S1 \u2229 S2|\n|S1 \u222a S2|\n\n=\n\na\n\nf1 + f2 \u2212 a\n\n,\n\nwhere f1 = |S1|, f2 = |S2|, a = |S1 \u2229 S2|.\n\n(1)\n\n(2)\n\nApplying a random permutation \u03c0 : \u2126 \u2192 \u2126 on S1 and S2, the collision probability is simply\n\nPr (min(\u03c0(S1)) = min(\u03c0(S2))) =\n\n|S1 \u2229 S2|\n|S1 \u222a S2|\n\n= R.\n\nOne can repeat the permutation k times: \u03c01, \u03c02, ..., \u03c0k to estimate R without bias. The common\npractice is to store each hashed value, e.g., min(\u03c0(S1)) and min(\u03c0(S2)), using 64 bits [14]. The\nstorage (and computational) cost will be prohibitive in truly large-scale (industry) applications [29].\nb-bit minwise hashing [27] provides a strikingly simple solution to this (storage and computational)\nproblem by storing only the lowest b bits (instead of 64 bits) of each hashed value.\n\n(z(b)\nFor convenience, denote z1 = min (\u03c0 (S1)) and z2 = min (\u03c0 (S2)), and denote z(b)\n2 ) the\n1\ninteger value corresponding to the lowest b bits of of z1 (z2). For example, if z1 = 7, then z(2)\n1 = 3.\nTheorem 1 [27] Assume D is large.\n\nPb = Pr\u201cz(b)\n\n1 = z(b)\n\nr1 =\n\nf1\nD\n\n,\n\nr2 =\n\nr2\n\nC1,b = A1,b\n\n+ A2,b\n\n2 \u201d = C1,b + (1 \u2212 C2,b) R\nf2\nD\n\n, f1 = |S1|, f2 = |S2|\n\nr1\n\n,\n\nr1 + r2\n\nr1 + r2\nr1 [1 \u2212 r1]2b\n1 \u2212 [1 \u2212 r1]2b ,\n\n\u22121\n\nA1,b =\n\nC2,b = A1,b\n\n+ A2,b\n\nr1\n\nr1 + r2\nr2 [1 \u2212 r2]2b\n1 \u2212 [1 \u2212 r2]2b .(cid:3)\n\n\u22121\n\nA2,b =\n\n(3)\n\nr2\n\nr1 + r2\n\n,\n\nThis (approximate) formula (3) is remarkably accurate, even for very small D; see Figure 1 in [25].\nWe can then estimate Pb (and R) from k independent permutations:\n\n\u02c6Rb =\n\n\u02c6Pb \u2212 C1,b\n1 \u2212 C2,b\n\n,\n\nVar\u201c \u02c6Rb\u201d =\n\nVar\u201c \u02c6Pb\u201d\n[1 \u2212 C2,b]2 =\n\n1\nk\n\n[C1,b + (1 \u2212 C2,b)R] [1 \u2212 C1,b \u2212 (1 \u2212 C2,b)R]\n\n[1 \u2212 C2,b]2\n\n(4)\n\nIt turns out that our method only needs \u02c6Pb for linear learning, i.e., no need to explicitly estimate R.\n\n2\n\n\f3 Kernels from Minwise Hashing b-Bit Minwise Hashing\n\nDe\ufb01nition: A symmetric n\u00d7 n matrix K satisfyingPij cicjKij \u2265 0, for all real vectors c is called\npositive de\ufb01nite (PD). Note that here we do not differentiate PD from nonnegative de\ufb01nite.\nTheorem 2 Consider n sets S1, ..., Sn \u2286 \u2126 = {0, 1, ..., D \u2212 1}. Apply one permutation \u03c0 to each\nset. De\ufb01ne zi = min{\u03c0(Si)} and z(b)\nthe lowest b bits of zi. The following three matrices are PD.\n1. The resemblance matrix R \u2208 Rn\u00d7n, whose (i, j)-th entry is the resemblance between set\nSi and set Sj: Rij = |Si\u2229Sj|\n\n|Si\u2229Sj|\n\ni\n\n|Si\u222aSj| =\n\n|Si|+|Sj |\u2212|Si\u2229Sj| .\n\n2. The minwise hashing matrix M \u2208 Rn\u00d7n: Mij = 1{zi = zj}.\n3. The b-bit minwise hashing matrix M(b) \u2208 Rn\u00d7n: M (b)\n\nConsequently, consider k independent permutations and denote M(b)\n\nmatrix generated by the s-th permutation. Then the summationPk\n\nProof: A matrix A is PD if it can be written as an inner product BTB. Because\n\ns=1\n\nij = 1nz(b)\n\ni = z(b)\n\nj o.\n\n(s) the b-bit minwise hashing\nM(b)\n\n(s) is also PD.\n\nMij = 1{zi = zj } =\n\n1{zi = t} \u00d7 1{zj = t},\n\n(5)\n\nD\u22121\n\nX\n\nt=0\n\n\u22121\n\nt=0 1{z(b)\n\ni = t} \u00d7 1{z(b)\n\nMij is the inner product of two D-dim vectors. Thus, M is PD. Similarly, M(b) is PD because\nij = P2b\nM (b)\nj = t}. R is PD because Rij = Pr{Mij = 1} = E (Mij) and\nMij is the (i, j)-th element of the PD matrix M. Note that the expectation is a linear operation. (cid:3)\n4 Integrating b-Bit Minwise Hashing with (Linear) Learning Algorithms\nLinear algorithms such as linear SVM and logistic regression have become very powerful and ex-\ntremely popular. Representative software packages include SVMperf [20], Pegasos [31], Bottou\u2019s\ni=1, xi \u2208 RD, yi \u2208 {\u22121, 1}. The\nSGD SVM [5], and LIBLINEAR [13]. Given a dataset {(xi, yi)}n\nL2-regularized linear SVM solves the following optimization problem):\n\nmin\n\nw\n\n1\n2\n\nT\n\nw\n\nw + C\n\nn\n\nX\n\ni=1\n\nmax n1 \u2212 yiw\n\nT\n\nxi, 0o ,\n\nand the L2-regularized logistic regression solves a similar problem:\n\nmin\n\nw\n\n1\n2\n\nT\n\nw\n\nw + C\n\nn\n\nX\n\ni=1\n\nlog \u201c1 + e\u2212yiw\n\nT\n\nxi\u201d .\n\n(6)\n\n(7)\n\nHere C > 0 is a regularization parameter. Since our purpose is to demonstrate the effectiveness of\nour proposed scheme using b-bit hashing, we simply provide results for a wide range of C values\nand assume that the best performance is achievable if we conduct cross-validations.\nIn our approach, we apply k random permutations on each feature vector xi and store the lowest b\nbits of each hashed value. This way, we obtain a new dataset which can be stored using merely nbk\nbits. At run-time, we expand each new data point into a 2b \u00d7 k-length vector with exactly k 1\u2019s.\nFor example, suppose k = 3 and the hashed values are originally {12013, 25964, 20191}, whose bi-\nnary digits are {010111011101101, 110010101101100, 100111011011111}. Consider b = 2. Then\nthe binary digits are stored as {01, 00, 11} (which corresponds to {1, 0, 3} in decimals). At run-time,\nwe need to expand them into a vector of length 2bk = 12, to be {0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0},\nwhich will be the new feature vector fed to a solver such as LIBLINEAR. Clearly, this expansion is\ndirectly inspired by the proof that the b-bit minwise hashing matrix is PD in Theorem 2.\n5 Experimental Results on Webspam Dataset\nOur experiment settings closely follow the work in [35]. They conducted experiments on three\ndatasets, of which only the webspam dataset is public and reasonably high-dimensional (n =\n350000, D = 16609143). Therefore, our experiments focus on webspam. Following [35], we\nrandomly selected 20% of samples for testing and used the remaining 80% samples for training.\nWe chose LIBLINEAR as the workhorse to demonstrate the effectiveness of our algorithm. All\nexperiments were conducted on workstations with Xeon(R) CPU (W5590@3.33GHz) and 48GB\n\n3\n\n\fRAM, under Windows 7 System. Thus, in our case, the original data (about 24GB in LIBSVM\nformat) \ufb01t in memory. In applications when the data do not \ufb01t in memory, we expect that b-bit\nhashing will be even more substantially advantageous, because the hashed data are relatively very\nsmall. In fact, our experimental results will show that for this dataset, using k = 200 and b = 8 can\nachieve similar testing accuracies as using the original data. The effective storage for the reduced\ndataset (with 350K examples, using k = 200 and b = 8) would be merely about 70MB.\n\n5.1 Experimental Results on Nonlinear (Kernel) SVM\nWe implemented a new resemblance kernel function and tried to use LIBSVM to train an SVM using\nthe webspam dataset. The training time well exceeded 24 hours. Fortunately, using b-bit minswise\nhashing to estimate the resemblance kernels provides a substantial improvement. For example, with\nk = 150, b = 4, and C = 1, the training time is about 5185 seconds and the testing accuracy is quite\nclose to the best results given by LIBLINEAR on the original webspam data.\n\n5.2 Experimental Results on Linear SVM\nThere is an important tuning parameter C. To capture the best performance and ensure repeatability,\nwe experimented with a wide range of C values (from 10\u22123 to 102) with \ufb01ne spacings in [0.1, 10].\nWe experimented with k = 10 to k = 500, and b = 1, 2, 4, 6, 8, 10, and 16. Figure 1 (average)\nand Figure 2 (std, standard deviation) provide the test accuracies. Figure 1 demonstrates that using\nb \u2265 8 and k \u2265 200 achieves similar test accuracies as using the original data. Since our method\nis randomized, we repeated every experiment 50 times. We report both the mean and std values.\nFigure 2 illustrates that the stds are very small, especially with b \u2265 4. In other words, our algorithm\nproduces stable predictions. For this dataset, the best performances were usually achieved at C \u2265 1.\n\nb = 10,16\n\nb = 8,10,16\n\nb = 8,10,16\n\n6\n\n4\n\nb = 6,8,10,16\n\n4\n\nsvm: k = 30\nSpam: Accuracy\n\n10\u22122\n\n10\u22121\n\nC\n\n100\n\n101\n\n102\n\nsvm: k = 50\nSpam: Accuracy\n\n10\u22122\n\n10\u22121\n\nC\n\n100\n\n101\n\n102\n\nsvm: k = 100\n\nSpam: Accuracy\n\n10\u22122\n\n10\u22121\n\nC\n\n100\n\n101\n\n102\n\nb = 6,8,10,16\n\nb = 6,8,10,16\n\nb = 6,8,10,16\n\n4\n\nsvm: k = 200\n\nSpam: Accuracy\n\n10\u22122\n\n10\u22121\n\nC\n\n100\n\n101\n\n102\n\nsvm: k = 300\n\nSpam: Accuracy\n\n10\u22122\n\n10\u22121\n\nC\n\n100\n\n101\n\n102\n\nsvm: k = 400\n\nSpam: Accuracy\n\n10\u22122\n\n10\u22121\n\nC\n\n100\n\n101\n\n102\n\nsvm: k = 500\n\nSpam: Accuracy\n\n10\u22122\n\n10\u22121\n\nC\n\n100\n\n101\n\n102\n\nFigure 1: SVM test accuracy (averaged over 50 repetitions). With k \u2265 200 and b \u2265 8. b-bit\nhashing achieves very similar accuracies as using the original data (dashed, red if color is available).\n\n)\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n)\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n100\n98\n96\n94\n92\n90\n88\n86\n84\n82\n80\n10\u22123\n\n100\n98\n96\n94\n92\n90\n88\n86\n84\n82\n80\n10\u22123\n\n100\n\n10\u22121\n\n)\n\n%\nd\n\n \n\nt\ns\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n8\nb = 6\n\nb = 4\n\nb = 2\n\nb = 1\n\nb = 4\n\nb = 2\n\nb = 1\n\n)\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n)\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n100\n98\n96\n94\n92\n90\n88\n86\n84\n82\n80\n10\u22123\n\n100\n98\n96\n94\n92\n90\n88\n86\n84\n82\n80\n10\u22123\n\n4\n\nb = 1\n\nb = 2\nb = 4\n\nb = 6\nb = 8\n10\n\nb = 16\n\n100\n\n10\u22121\n\n)\n\n%\nd\n\n \n\nt\ns\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\nb = 6\nb = 4\n\nb = 2\n\nb = 1\n\nb = 4\n\nb = 2\n\nb = 1\n\n)\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n)\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n100\n98\n96\n94\n92\n90\n88\n86\n84\n82\n80\n10\u22123\n\n4\n\n100\n98\n96\n94\n92\n90\n88\n86\n84\n82\n80\n10\u22123\n\nb = 1\n\nb = 2\n\nb = 4\n\nb = 6\n\nb = 8\nb = 10,16\n\n100\n\n10\u22121\n\n)\n\n%\nd\n\n \n\nt\ns\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n6\nb = 4\n\nb = 2\n\nb = 1\n\nb = 4\n\nb = 2\n\nb = 1\n\n)\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n)\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n100\n98\n96\n94\n92\n90\n88\n86\n84\n82\n80\n10\u22123\n\n100\n98\n96\n94\n92\n90\n88\n86\n84\n82\n80\n10\u22123\n\nb = 1\n\nb = 2\n\nb = 4\nb = 6\n\nb = 8,10,16\n\n100\n\n10\u22121\n\n)\n\n%\nd\n\n \n\nt\ns\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n10\u22122\n\nb = 4\n\nb = 2\n\nb = 1\n\nsvm: k = 150\n\nSpam: Accuracy\n\n10\u22122\n\n10\u22121\n\nC\n\n100\n\n101\n\n102\n\nb = 6,8,10,16\n\n4\n\nb = 4\n\nb = 2\nb = 1\n\nb = 1\n\nb = 2\n\nb = 4\n\nb = 6,8,10,16\n\n10\u22122\n\nsvm: k = 50\nSpam accuracy (std)\n100\n\n10\u22122\n\nsvm: k = 100\nSpam accuracy (std)\n100\n\n10\u22122\n\n10\u22121\n\n10\u22122\n\nsvm: k = 200\nSpam accuracy (std)\n100\n\n10\u22122\n\n10\u22121\n\n10\u22122\n\nC\n\nC\n\n101\n\n101\n\n102\n\n102\n\n10\u22123\n\n10\u22123\n\n10\u22121\n\n102\nFigure 2: SVM test accuracy (std). The standard deviations are computed from 50 repetitions.\nWhen b \u2265 8, the standard deviations become extremely small (e.g., 0.02%).\nCompared with the original training time (about 100 seconds), Figure 3 (upper panels) shows that\nour method only needs about 3 seconds (near C = 1). Note that our reported training time did not\ninclude data loading (about 12 minutes for the original data and 10 seconds for the hashed data).\n\n10\u22123\n\n10\u22123\n\n102\n\n101\n\n101\n\nC\n\nC\n\nsvm: k = 500\nSpam accuracy (std)\n100\n\n10\u22122\n\n10\u22121\n\nCompared with the original testing time (about 150 seconds), Figure 3 (bottom panels) shows that\nour method needs merely about 2 seconds. Note that the testing time includes both the data loading\ntime, as designed by LIBLINEAR. The ef\ufb01ciency of testing may be very important in practice, for\nexample, when the classi\ufb01er is deployed in a user-facing application (such as search), while the cost\nof training or preprocessing may be less critical and can be conducted off-line.\n\n4\n\n\fsvm: k = 50\nSpam: Training time\n\n103\n\n102\n\n101\n\n)\nc\ne\ns\n(\n \ne\nm\n\ni\n\ni\nt\n \ng\nn\nn\na\nr\nT\n\ni\n\nsvm: k =100\nSpam: Training time\n\n103\n\n102\n\n101\n\n103\n\n102\n\n101\n\nsvm: k = 200\nSpam: Training time\n\nb = 16\n\n)\nc\ne\ns\n(\n \ne\nm\n\ni\n\ni\nt\n \ng\nn\nn\na\nr\nT\n\ni\n\n)\nc\ne\ns\n(\n \ne\nm\n\ni\n\ni\nt\n \ng\nn\nn\na\nr\nT\n\ni\n\n103\n\n102\n\n101\n\nsvm: k = 500\nSpam: Training time\n\nb = 16\n\nb = 10\n\n)\nc\ne\ns\n(\n \n\ne\nm\n\ni\nt\n \n\ni\n\ng\nn\nn\na\nr\nT\n\ni\n\n100\n10\u22123\n\n10\u22122\n\n10\u22121\n\nC\n\n100\n\n101\n\n102\n\n100\n10\u22123\n\n10\u22122\n\n10\u22121\n\nC\n\n100\n\n101\n\n102\n\n100\n10\u22123\n\n10\u22122\n\n10\u22121\n\nC\n\n100\n\n101\n\n102\n\n100\n10\u22123\n\n10\u22122\n\n10\u22121\n\nC\n\n100\n\n101\n\n102\n\nsvm: k = 50\nSpam: Testing time\n\n1000\n\n100\n\n10\n\n)\nc\ne\ns\n(\n \ne\nm\n\ni\nt\n \ng\nn\ni\nt\ns\ne\nT\n\nsvm: k = 100\nSpam: Testing time\n\n1000\n\n100\n\n10\n\n)\nc\ne\ns\n(\n \ne\nm\n\ni\nt\n \ng\nn\ni\nt\ns\ne\nT\n\n1000\n\n100\n\n10\n\n)\nc\ne\ns\n(\n \ne\nm\n\ni\nt\n \ng\nn\ni\nt\ns\ne\nT\n\nsvm: k = 200\nSpam: Testing time\n\nsvm: k = 500\nSpam: Testing time\n\n1000\n\n100\n\n10\n\n)\nc\ne\ns\n(\n \ne\nm\n\ni\nt\n \ng\nn\ni\nt\ns\ne\nT\n\n2\n1\n10\u22123\n\n2\n1\n10\u22123\n\n2\n1\n10\u22123\n\n100\n\n10\u22122\n\n10\u22121\n\n102\nFigure 3: SVM training time (upper panels) and testing time (bottom panels). The original costs\nare plotted using dashed (red, if color is available) curves.\n\n10\u22122\n\n10\u22121\n\n10\u22122\n\n10\u22121\n\n10\u22122\n\n10\u22121\n\n100\n\n102\n\n100\n\n102\n\n100\n\n101\n\n101\n\n101\n\n101\n\n102\n\nC\n\nC\n\nC\n\nC\n\n2\n1\n10\u22123\n\n5.3 Experimental Results on Logistic Regression\nFigure 4 presents the test accuracies and training time using logistic regression. Again, with k \u2265 200\nand b \u2265 8, b-bit minwise hashing can achieve similar test accuracies as using the original data. The\ntraining time is substantially reduced, from about 1000 seconds to about 30 seconds only.\n\nb = 8,10,16\n\nb = 8,10,16\n\nb = 6,8,10,16\n\nlogit: k = 50\nSpam: Accuracy\n\n10\u22122\n\n10\u22121\n\nC\n\n100\n\n101\n\n102\n\nlogit: k = 100\n\nSpam: Accuracy\n\n10\u22122\n\n10\u22121\n\nC\n\n100\n\n101\n\n102\n\nlogit: k = 200\n\nSpam: Accuracy\n\n10\u22122\n\n10\u22121\n\nC\n\n100\n\n101\n\n102\n\n100\n98\n96\n94\n92\n90\n88\n86\n84\n82\n80\n10\u22123\n\n103\n\n102\n\n101\n\n)\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n)\nc\ne\ns\n(\n \n\ne\nm\n\ni\nt\n \n\ni\n\ng\nn\nn\na\nr\nT\n\ni\n\nb = 6\nb = 4\n\nb = 2\n\nb = 1\n\n100\n98\n96\n94\n92\n90\n88\n86\n84\n82\n80\n10\u22123\n\n103\n\n102\n\n101\n\n)\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n)\nc\ne\ns\n(\n \n\ne\nm\n\ni\nt\n \n\ni\n\ng\nn\nn\na\nr\nT\n\ni\n\nb = 6\n\nb = 4\n\nb = 2\n\nb = 1\n\n100\n98\n96\n94\n92\n90\n88\n86\n84\n82\n80\n10\u22123\n\n103\n\n102\n\n101\n\n)\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n)\nc\ne\ns\n(\n \n\ne\nm\n\ni\nt\n \n\ni\n\ng\nn\nn\na\nr\nT\n\ni\n\nb = 6,8,10,16\n\n4\n\nb = 4\n\nb = 2\nb = 1\n\nb = 4\n\nb = 2\n\nb = 1\n\nlogit: k = 500\n\nSpam: Accuracy\n\n10\u22122\n\n10\u22121\n\nC\n\n100\n\n101\n\n102\n\nb = 16\n\n100\n98\n96\n94\n92\n90\n88\n86\n84\n82\n80\n10\u22123\n\n103\n\n102\n\n101\n\n)\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n)\nc\ne\ns\n(\n \ne\nm\n\ni\nt\n \n\ni\n\ng\nn\nn\na\nr\nT\n\ni\n\nlogit: k = 50\nSpam: Training time\n100\n\n10\u22122\n\n10\u22121\n\n100\n10\u22123\n\nC\n\nlogit: k = 100\nSpam: Training time\n100\n\n10\u22122\n\n10\u22121\n\n100\n10\u22123\n\nC\n\nlogit: k = 200\nSpam: Training time\n100\n\n10\u22121\n\n10\u22122\n\n100\n10\u22123\n\nC\n\nlogit: k = 500\nSpam: Training time\n100\n\n10\u22121\n\n10\u22122\n\n100\n10\u22123\n\nC\n\n101\n\n102\n\n101\n\n102\n\n101\n\n102\n\n101\n\n102\n\nFigure 4: Logistic regression test accuracy (upper panels) and training time (bottom panels).\n\nIn summary, it appears b-bit hashing is highly effective in reducing the data size and speeding up the\ntraining (and testing), for both SVM and logistic regression. We notice that when using b = 16, the\ntraining time can be much larger than using b \u2264 8. Interestingly, we \ufb01nd that b-bit hashing can be\neasily combined with Vowpal Wabbit (VW) [34] to further reduce the training time when b is large.\n6 Random Projections, Count-Min (CM) Sketch, and Vowpal Wabbit (VW)\nRandom projections [1, 24], Count-Min (CM) sketch [11], and Vowpal Wabbit (VW) [32, 34], as\npopular hashing algorithms for estimating inner products for high-dimensional datasets, are naturally\napplicable in large-scale learning. In fact, those methods are not limited to binary data. Interestingly,\nthe three methods all have essentially the same variances. Note that in this paper, we use \u201dVW\u201c\nparticularly for the hashing algorithm in [34], not the in\ufb02uential \u201cVW\u201d online learning platform.\n6.1 Random Projections\nDenote the \ufb01rst two rows of a data matrix by u1, u2 \u2208 RD. The task is to estimate the inner\nproduct a = PD\ni=1 u1,iu2,i. The general idea is to multiply the data vectors by a random matrix\n{rij} \u2208 RD\u00d7k, where rij is sampled i.i.d. from the following generic distribution with [24]\n\nE(rij ) = 0, V ar(rij ) = 1, E(r3\nij ) = E(r4\n\nij ) = 0, E(r4\n\n(8)\nij ) = s \u2212 1 \u2265 0. This generates two k-dim vectors, v1 and v2:\n(9)\n\nij ) = s, s \u2265 1.\n\nj = 1, 2, ..., k\n\nu2,irij ,\n\nv2,j =\n\nD\n\nv1,j =\n\nNote that V ar(r2\n\nD\n\nij ) \u2212 E2(r2\nXi=1\n\nu1,irij ,\n\nXi=1\n\n5\n\n\fThe general family of distributions (8) includes the standard normal distribution (in this case, s = 3)\n\nand the \u201csparse projection\u201d distribution speci\ufb01ed as rij = \u221as \u00d7\uf8f1\uf8f2\n\uf8f3\n\n[24] provided the following unbiased estimator \u02c6arp,s of a and the general variance formula:\n\nwith prob. 1\n2s\nwith prob. 1 \u2212 1\n\n1\n0\n\u22121 with prob. 1\n\n2s\n\ns\n\n\u02c6arp,s =\n\n1\nk\n\nk\n\nXj=1\n\nV ar(\u02c6arp,s) =\n\nD\n\nu1,iu2,i,\n\nv1,jv2,j,\n\n1\n\nk \" D\nXi=1\n\nu2\n1,i\n\nD\n\nXi=1\n\nE(\u02c6arp,s) = a =\n\nXi=1\nXi=1\nu2\n2,i + a2 + (s \u2212 3)\n\nD\n\nu2\n1,iu2\n\n2,i#\n\n(10)\n\n(11)\n\nwhich means s = 1 achieves the smallest variance. The only elementary distribution we know that\nsatis\ufb01es (8) with s = 1 is the two point distribution in {\u22121, 1} with equal probabilities.\n[23] proposed an improved estimator for random projections as the solution to a cubic equation.\nBecause it can not be written as an inner product, that estimator can not be used for linear learning.\n\n6.2 Count-Min (CM) Sketch and Vowpal Wabbit (VW)\nAgain, in this paper, \u201cVW\u201d always refers to the hashing algorithm in [34]. VW may be viewed as\na \u201cbias-corrected\u201d version of the Count-Min (CM) sketch [11]. In the original CM algorithm, the\nkey step is to independently and uniformly hash elements of the data vectors to k buckets and the\nhashed value is the sum of the elements in the bucket. That is h(i) = j with probability 1\nk , where\n\nj \u2208 {1, 2, ..., k}. By writing Iij =(cid:26) 1\nXi=1\n\nw1,j =\n\n0\n\nD\n\nu1,iIij ,\n\nif h(i) = j\notherwise\n\n, we can write the hashed data as\n\nw2,j =\n\nu2,iIij\n\nD\n\nXi=1\n\n(12)\n\nThe estimate \u02c6acm =Pk\n\nj=1 w1,jw2,j is (severely) biased for estimating inner products. The original\npaper [11] suggested a \u201ccount-min\u201d step for positive data, by generating multiple independent esti-\nmates \u02c6acm and taking the minimum as the \ufb01nal estimate. That step can reduce but can not remove\nthe bias. Note that the bias can be easily removed by using k\n\ni=1 u2,i(cid:17).\n[34] proposed a creative method for bias-correction, which consists of pre-multiplying (element-\nwise) the original data vectors with a random vector whose entries are sampled i.i.d. from the two-\npoint distribution in {\u22121, 1} with equal probabilities. Here, we consider the general distribution (8).\nAfter applying multiplication and hashing on u1 and u2, the resultant vectors g1 and g2 are\n\nk\u22121 (cid:16)\u02c6acm \u2212 1\n\ni=1 u1,iPD\n\nk PD\n\nD\n\nD\n\ng1,j =\n\nu1,iriIij ,\n\ng2,j =\n\nu2,iriIij ,\n\nj = 1, 2, ..., k\n\n(13)\n\nXi=1\n\nXi=1\n\nwhere E(ri) = 0, E(r2\n\ni ) = 1, E(r3\n\ni ) = 0, E(r4\n\ni ) = s. We have the following Lemma.\n\nTheorem 3\n\n\u02c6avw,s =\n\nk\n\nXj=1\n\ng1,jg2,j,\n\nE(\u02c6avw,s) =\n\nu1,iu2,i = a,\n\nD\n\nXi=1\nk \" D\nXi=1\n\n1\n\n(14)\n\n(15)\n\nV ar(\u02c6avw,s) = (s \u2212 1)\n\nu2\n1,iu2\n\n2,i +\n\nD\n\nXi=1\n\nu2\n1,i\n\nD\n\nXi=1\n\nu2\n2,i + a2 \u2212 2\n\nu2\n1,iu2\n\n2,i# (cid:3)\n\nD\n\nXi=1\n\n1,iu2\n\ni=1 u2\n\nInterestingly, the variance (15) says we do need s = 1, otherwise the additional term (s \u2212\n1)PD\n2,i will not vanish even as the sample size k \u2192 \u221e. In other words, the choice of\nrandom distribution in VW is essentially the only option if we want to remove the bias by pre-\nmultiplying the data vectors (element-wise) with a vector of random variables. Of course, once we\nlet s = 1, the variance (15) becomes identical to the variance of random projections (11).\n\n6\n\n\f7 Comparing b-Bit Minwise Hashing with VW (and Random Projections)\nWe implemented VW and experimented it on the same webspam dataset. Figure 5 shows that b-bit\nminwise hashing is substantially more accurate (at the same sample size k) and requires signi\ufb01cantly\nless training time (to achieve the same accuracy). Basically, for 8-bit minwise hashing with k = 200\nachieves similar test accuracies as VW with k = 104 \u223c 106 (note that we only stored the non-zeros).\n\n1,10,100\n0.1\n\nC = 0.01\n\n)\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n100\n98\n96\n94\n92\n90\n88\n86\n84\n82\n80\n101\n\n10,100\n\nC = 1\nC = 0.1\n\nC = 0.01\n\n)\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n10,100\n\n1\n\nC = 0.1\n\nC = 0.01\n\n100\n98\n96\n94\n92\n90\n88\n86\n84\n82\n80\n101\n\n100\n\n10\n\nC = 1\nC = 0.1\n\nC = 0.01\n\n)\nc\ne\ns\n(\n \ne\nm\n\ni\n\ni\nt\n \ng\nn\nn\na\nr\nT\n\ni\n\nlogit: VW vs b = 8 hashing\nSpam: Accuracy\n\n102\n\n103\n\nk\n\n104\n\n105\n\n106\n\n103\n\n102\n\n101\n\n100\n\nsvm: VW vs b = 8 hashing\nSpam: Accuracy\n\n102\n\n103\n\nk\n\n104\n\n105\n\n106\n\nsvm: VW vs b = 8 hashing\nC = 100\n\nC = 10\n\nC = 1,0.1,0.01\n\nSpam: Training time\n\nC = 100\n\nC = 10\n\n103\n\n102\n\n)\nc\ne\ns\n(\n \ne\nm\n\nC = 100,10,1\n\nC = 0.1,0.01\n\ni\n\ni\nt\n \ng\nn\nn\na\nr\nT\n\ni\n\n101\n\n100\n10,1.0,0.1\n\nC = 0.01\n\nC = 1,0.1,0.01\n\n102\n\n103\n\nk\n\n104\n\n105\n\n106\n\n100\n\nlogit: VW vs b = 8 hashing\nSpam: Training time\n105\n\n103\n\n104\n\n106\n\n102\n\nk\n\nFigure 5: The dashed (red if color is available) curves represent b-bit minwise hashing results (only\nfor k \u2264 500) while solid curves for VW. We display results for C = 0.01, 0.1, 1, 10, 100.\nThis empirical \ufb01nding is not surprising, because the variance of b-bit hashing is usually substantially\nsmaller than the variance of VW (and random projections). In the technical report (arXiv:1106.0967,\nwhich also includes the complete proofs of the theorems presented in this paper), we show that, at\nthe same storage cost, b-bit hashing usually improves VW by 10- to 100-fold, by assuming each\nsample of VW needs 32 bits to store. Of course, even if VW only stores each sample using 16 bits,\nan improvement of 5- to 50-fold would still be very substantial.\n\nThere is one interesting issue here. Unlike random projections (and minwise hashing), VW is a\nsparsity-preserving algorithm, meaning that in the resultant sample vector of length k, the number\nof non-zeros will not exceed the number of non-zeros in the original vector. In fact, it is easy to see\n\nk(cid:1)c\nthat the fraction of zeros in the resultant vector would be (at least)(cid:0)1 \u2212 1\nk(cid:1), where c\nis the number of non-zeros in the original data vector. In this paper, we mainly focus on the scenario\nin which c \u226b k, i.e., we use b-bit minwise hashing or VW for the purpose of data reduction.\nHowever, in some cases, we care about c \u226a k, because VW is also an excellent tool for compact\nindexing. In fact, our b-bit minwise hashing scheme for linear learning may face such an issue.\n\n\u2248 exp(cid:0)\u2212 c\n\n8 Combining b-Bit Minwise Hashing with VW\nIn Figures 3 and 4, when b = 16, the training time becomes substantially larger than b \u2264 8. Recall\nthat in the run-time, we expand the b-bit minwise hashed data to sparse binary vectors of length 2bk\nwith exactly k 1\u2019s. When b = 16, the vectors are very sparse. On the other hand, once we have\nexpanded the vectors, the task is merely computing inner products, for which we can use VW.\n\nTherefore, in the run-time, after we have generated the sparse binary vectors of length 2bk, we hash\nthem using VW with sample size m (to differentiate from k). How large should m be? Theorem 4\nmay provide an insight. Recall Section 2 provides the estimator, denoted by \u02c6Rb, of the resemblance\nR, using b-bit minwise hashing. Now, suppose we \ufb01rst apply VW hashing with size m on the binary\nvector of length 2bk before estimating R, which will introduce some additional randomness. We\ndenote the new estimator by \u02c6Rb,vw. Theorem 4 provides its theoretical variance.\n\n100\n\n)\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n95\n\n90\n\n3\n\n8\n\n2\n\n1\n\n0\n\n100\n\n)\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n95\n\n90\n\nSVM: 16\u2212bit hashing + VW, k = 200\n\n85\n10\u22123\n\nSpam:Accuracy\n\n10\u22122\n\n10\u22121\n\nC\n\n100\n\n101\n\n102\n\n85\n10\u22123\n\n8\n\n3\n\n2\n\n1\n\n0\n\nLogit: 16\u2212bit hashing +VW, k = 200\nSpam: Accuracy\n10\u22122\n100\n\n10\u22121\n\n101\n\n102\n\nC\n\n)\nc\ne\ns\n(\n \n\ne\nm\n\ni\nt\n \n\ni\n\ng\nn\nn\na\nr\nT\n\ni\n\n102\n\n101\n\n8\n\n1\n0\n\n100\n10\u22123\n\nSVM: 16\u2212bit hashing + VW, k = 200\n\n102\n\n101\n\n8\n1\n\n0\n\n0\n\n8\n\n0\n\n3\n\n8\n\n2\n\n1\n\n)\nc\ne\ns\n(\n \n\ne\nm\n\ni\nt\n \n\ni\n\ng\nn\nn\na\nr\nT\n\ni\n\nSpam:Training Time\n\n10\u22122\n\n10\u22121\n\nC\n\n100\n\n101\n\n102\n\n100\n10\u22123\n\nLogit: 16\u2212bit hashing +VW, k = 200\nSpam: Training Time\n10\u22122\n\n10\u22121\n\n100\n\n101\n\n102\n\nC\n\nFigure 6: We apply VW hashing on top of the binary vectors (of length 2bk) generated by b-bit\nhashing, with size m = 20k, 21k, 22k, 23k, 28k, for k = 200 and b = 16. The numbers on the solid\ncurves (0, 1, 2, 3, 8) are the exponents. The dashed (red if color if available) curves are the results\nfrom only using b-bit hashing. When m = 28k, this method achieves similar test accuracies (left\npanels) while substantially reducing the training time (right panels).\n\n7\n\n\fTheorem 4\n\nVar(cid:16) \u02c6Rb,vw(cid:17) = V ar(cid:16) \u02c6Rb(cid:17) +\n\n1\nm\n\n1\n\n[1 \u2212 C2,b]2 (cid:18)1 + P 2\nb \u2212\n\nPb(1 + Pb)\n\nk\n\n(cid:19) ,\n\n(16)\n\nk\n\nPb(1\u2212Pb)\n\n[1\u2212C2,b]2 is given by (4) and C2,b is the constant de\ufb01ned in Theorem 1. (cid:3)\n\nwhere V ar(cid:16) \u02c6Rb(cid:17) = 1\nCompared to the original variance V ar \u201c \u02c6Rb\u201d, the additional term in (16) can be relatively large, if\nm is small. Therefore, we should choose m \u226b k and m \u226a 2bk. If b = 16, then m = 28k may be a\ngood trade-off. Figure 8 provides an empirical study to verify this intuition.\n\n9 Limitations\nWhile using b-bit minwise hashing for training linear algorithms is successful on the webspam\ndataset, it is important to understand the following three major limitations of the algorithm:\n\n(A): Our method is designed for binary (0/1) sparse data. (B): Our method requires an expensive\npreprocessing step for generating k permutations of the data. For most applications, we expect the\npreprocessing cost is not a major issue because the preprocessing can be conducted off-line (or com-\nbined with the data-collection step) and is easily parallelizable. However, even if the speed is not a\nconcern, the energy consumption might be an issue, especially considering (b-bit) minwise hashing\nis mainly used for industry applications. In addition, testing an new unprocessed data vector (e.g.,\na new document) will be expensive. (C): Our method performs only reasonably well in terms of\ndimension reduction. The processed data need to be mapped into binary vectors in 2b \u00d7 k dimen-\nsions, which is usually not small. (Note that the storage cost is just bk bits.) For example, for the\nwebspam dataset, using b = 8 and k = 200 seems to suf\ufb01ce and 28 \u00d7 200 = 51200 is quite large,\nalthough it is much smaller than the original dimension of 16 million. It would be desirable if we\ncan further reduce the dimension, because the dimension determines the storage cost of the model\nand (moderately) increases the training time for batch learning algorithms such as LIBLINEAR.\n\nIn hopes of \ufb01xing the above limitations, we experimented with an implementation using another\nhashing technique named Conditional Random Sampling (CRS)\n[21, 22], which is not limited to\nbinary data and requires only one permutation of the original data (i.e., no expensive preprocessing).\nWe achieved some limited success. For example, CRS compares favorably to VW in terms of stor-\nage (to achieve the same accuracy) on the webspam dataset. However, so far CRS can not compete\nwith b-bit minwise hashing for linear learning (in terms of training speed, storage cost, and model\nsize). The reason is because even though the estimator of CRS is an inner product, the normalization\nfactors (i.e, the effective sample size of CRS) to ensure unbiased estimates substantially differ pair-\nwise (which is a signi\ufb01cant advantage in other applications). In our implementation, we could not\nto use fully correct normalization factors, which lead to severe bias of the inner product estimates\nand less than satisfactory performance of linear learning compared to b-bit minwise hashing.\n\n10 Conclusion\nAs data sizes continue to grow faster than the memory and computational power, statistical learning\ntasks in industrial practice are increasingly faced with training datasets that exceed the resources on\na single server. A number of approaches have been proposed that address this by either scaling out\nthe training process or partitioning the data, but both solutions can be expensive.\n\nIn this paper, we propose a compact representation of sparse, binary data sets based on b-bit minwise\nhashing, which can be naturally integrated with linear learning algorithms such as linear SVM and\nlogistic regression, leading to dramatic improvements in training time and/or resource requirements.\nWe also compare b-bit minwise hashing with the Count-Min (CM) sketch and Vowpal Wabbit (VW)\nalgorithms, which, according to our analysis, all have (essentially) the same variances as random\nprojections [24]. Our theoretical and empirical comparisons illustrate that b-bit minwise hashing is\nsigni\ufb01cantly more accurate (at the same storage) for binary data. There are various limitations (e.g.,\nexpensive preprocessing) in our proposed method, leaving ample room for future research.\n\nAcknowledgement\nThis work is supported by NSF (DMS-0808864), ONR (YIP-N000140910911), and a grant from\nMicrosoft. We thank John Langford and Tong Zhang for helping us better understand the VW hash-\ning algorithm, and Chih-Jen Lin for his patient explanation of LIBLINEAR package and datasets.\n\n8\n\n\fReferences\n[1] Dimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System\n\nSciences, 66(4):671\u2013687, 2003.\n\n[2] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Commun.\n\nACM, volume 51, pages 117\u2013122, 2008.\n\n[3] Harald Baayen. Word Frequency Distributions, volume 18 of Text, Speech and Language Technology. Kulver Academic Publishers,\n\n2001.\n\n[4] Michael Bendersky and W. Bruce Croft. Finding text reuse on the web. In WSDM, pages 262\u2013271, Barcelona, Spain, 2009.\n[5] Leon Bottou. http://leon.bottou.org/projects/sgd.\n[6] Andrei Z. Broder. On the resemblance and containment of documents. In the Compression and Complexity of Sequences, pages 21\u201329,\n\nPositano, Italy, 1997.\n\n[7] Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of the web. In WWW, pages 1157 \u2013\n\n1166, Santa Clara, CA, 1997.\n\n[8] Olivier Chapelle, Patrick Haffner, and Vladimir N. Vapnik. Support vector machines for histogram-based image classi\ufb01cation. IEEE\n\nTrans. Neural Networks, 10(5):1055\u20131064, 1999.\n\n[9] Ludmila Cherkasova, Kave Eshghi, Charles B. Morrey III, Joseph Tucek, and Alistair C. Veitch. Applying syntactic similarity algorithms\n\nfor enterprise information management. In KDD, pages 1087\u20131096, Paris, France, 2009.\n\n[10] Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, Michael Mitzenmacher, Alessandro Panconesi, and Prabhakar Raghavan. On com-\n\npressing social networks. In KDD, pages 219\u2013228, Paris, France, 2009.\n\n[11] Graham Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of\n\nAlgorithm, 55(1):58\u201375, 2005.\n\n[12] Yon Dourisboure, Filippo Geraci, and Marco Pellegrini. Extraction and classi\ufb01cation of dense implicit communities in the web graph.\n\nACM Trans. Web, 3(2):1\u201336, 2009.\n\n[13] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classi\ufb01cation.\n\nJournal of Machine Learning Research, 9:1871\u20131874, 2008.\n\n[14] Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L. Wiener. A large-scale study of the evolution of web pages. In WWW, pages\n\n669\u2013678, Budapest, Hungary, 2003.\n\n[15] George Forman, Kave Eshghi, and Jaap Suermondt. Ef\ufb01cient detection of large-scale redundancy in enterprise \ufb01le systems. SIGOPS\n\nOper. Syst. Rev., 43(1):84\u201391, 2009.\n\n[16] Sreenivas Gollapudi and Aneesh Sharma. An axiomatic approach for result diversi\ufb01cation. In WWW, pages 381\u2013390, Madrid, Spain,\n\n2009.\n\n[17] Matthias Hein and Olivier Bousquet. Hilbertian metrics and positive de\ufb01nite kernels on probability measures.\n\nIn AISTATS, pages\n\n136\u2013143, Barbados, 2005.\n\n[18] Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S. Sathiya Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale\n\nlinear svm. In Proceedings of the 25th international conference on Machine learning, ICML, pages 408\u2013415, 2008.\n\n[19] Yugang Jiang, Chongwah Ngo, and Jun Yang. Towards optimal bag-of-features for object categorization and semantic video retrieval. In\n\nCIVR, pages 494\u2013501, Amsterdam, Netherlands, 2007.\n\n[20] Thorsten Joachims. Training linear svms in linear time. In KDD, pages 217\u2013226, Pittsburgh, PA, 2006.\n[21] Ping Li and Kenneth W. Church. Using sketches to estimate associations. In HLT/EMNLP, pages 708\u2013715, Vancouver, BC, Canada,\n\n2005 (The full paper appeared in Commputational Linguistics in 2007).\n\n[22] Ping Li, Kenneth W. Church, and Trevor J. Hastie. Conditional random sampling: A sketch-based sampling technique for sparse data. In\n\nNIPS, pages 873\u2013880, Vancouver, BC, Canada, 2006 (Newer results appeared in NIPS 2008.\n\n[23] Ping Li, Trevor J. Hastie, and Kenneth W. Church. Improving random projections using marginal information. In COLT, pages 635\u2013649,\n\nPittsburgh, PA, 2006.\n\n[24] Ping Li, Trevor J. Hastie, and Kenneth W. Church. Very sparse random projections. In KDD, pages 287\u2013296, Philadelphia, PA, 2006.\n[25] Ping Li and Arnd Christian K\u00a8onig. Theory and applications b-bit minwise hashing. In Commun. ACM, 2011.\n[26] Ping Li and Arnd Christian K\u00a8onig. Accurate estimators for improving minwise hashing and b-bit minwise hashing. Technical report,\n\n2011 (arXiv:1108.0895).\n\n[27] Ping Li and Arnd Christian K\u00a8onig. b-bit minwise hashing. In WWW, pages 671\u2013680, Raleigh, NC, 2010.\n[28] Ping Li, Arnd Christian K\u00a8onig, and Wenhao Gui. b-bit minwise hashing for estimating three-way similarities. In NIPS, Vancouver, BC,\n\n2010.\n\n[29] Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. Detecting Near-Duplicates for Web-Crawling. In WWW, Banff, Alberta,\n\nCanada, 2007.\n\n[30] Marc Najork, Sreenivas Gollapudi, and Rina Panigrahy. Less is more: sampling the neighborhood graph makes salsa better and faster.\n\nIn WSDM, pages 242\u2013251, Barcelona, Spain, 2009.\n\n[31] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for svm.\n\nIn ICML, pages\n\n807\u2013814, Corvalis, Oregon, 2007.\n\n[32] Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, and S.V.N. Vishwanathan. Hash kernels for structured data.\n\nJournal of Machine Learning Research, 10:2615\u20132637, 2009.\n\n[33] Simon\n\nTong.\n\nLessons\n\nlearned\n\ndeveloping\n\na\n\npractical\n\nlarge\n\nscale\n\nmachine\n\nlearning\n\nsystem.\n\nhttp://googleresearch.blogspot.com/2010/04/lessons-learned-developing-practical.html, 2008.\n\n[34] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature hashing for large scale multitask\n\nlearning. In ICML, pages 1113\u20131120, 2009.\n\n[35] Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang, and Chih-Jen Lin. Large linear classi\ufb01cation when data cannot \ufb01t in memory. In KDD,\n\npages 833\u2013842, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1455, "authors": [{"given_name": "Ping", "family_name": "Li", "institution": null}, {"given_name": "Anshumali", "family_name": "Shrivastava", "institution": null}, {"given_name": "Joshua", "family_name": "Moore", "institution": null}, {"given_name": "Arnd", "family_name": "K\u00f6nig", "institution": null}]}