{"title": "Re-randomized Densification for One Permutation Hashing and Bin-wise Consistent Weighted Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 15926, "page_last": 15936, "abstract": "Jaccard similarity is widely used as a distance measure in many machine learning\nand search applications. Typically, hashing methods are essential for the use of\nJaccard similarity to be practical in large-scale settings. For hashing binary (0/1)\ndata, the idea of one permutation hashing (OPH) with densification significantly\naccelerates traditional minwise hashing algorithms while providing unbiased and\naccurate estimates. In this paper, we propose a strategy named \u201cre-randomization\u201d\nin the process of densification that could achieve the smallest variance among all\ndensification schemes. The success of this idea naturally inspires us to generalize\none permutation hashing to weighted (non-binary) data, which results in the socalled \u201cbin-wise consistent weighted sampling (BCWS)\u201d algorithm. We analyze the\nbehavior of BCWS and compare it with a recent alternative. Extensive experiments\non various datasets illustrates the effectiveness of our proposed methods.", "full_text": "Re-randomized Densi\ufb01cation for One Permutation\nHashing and Bin-wise Consistent Weighted Sampling\n\nPing Li\n\nCognitive Computing Lab\n\nBaidu Research\n\nBellevue, WA 98004, USA\n\nliping11@baidu.com\n\nXiaoyun Li\u2217\n\nDepartment of Statistics\n\nRutgers University\n\nPiscataway, NJ 08854, USA\nxiaoyun.li@rutgers.edu\n\nCun-Hui Zhang\u2020\n\nDepartment of Statistics\n\nRutgers University\n\nPiscataway, NJ 08854, USA\ncunhui@stat.rutgers.edu\n\nAbstract\n\nJaccard similarity is widely used as a distance measure in many machine learning\nand search applications. Typically, hashing methods are essential for the use of\nJaccard similarity to be practical in large-scale settings. For hashing binary (0/1)\ndata, the idea of one permutation hashing (OPH) with densi\ufb01cation signi\ufb01cantly\naccelerates traditional minwise hashing algorithms while providing unbiased and\naccurate estimates. In this paper, we propose a \u201cre-randomization\u201d strategy in the\nprocess of densi\ufb01cation and we show that it achieves the smallest variance among\nexisting densi\ufb01cation schemes. The success of this idea inspires us to generalize one\npermutation hashing to weighted (non-binary) data, resulting in the so-called \u201cbin-\nwise consistent weighted sampling (BCWS)\u201d algorithm. We analyze the behavior of\nBCWS and compare it with a recent alternative. Experiments on a range of datasets\nand tasks con\ufb01rm the effectiveness of proposed methods. We expect that BCWS\nwill be adopted in practice for training kernel machines and fast similarity search.\n\n(cid:80)D\n(cid:80)D\n\nIntroduction\n\n1\nIn recent years, there has been a surge of interest in studying the following measure of similarity for\nnonnegative data [17, 6, 12, 26, 14, 20, 29, 21, 22]:\n\ni=1 min(Si, Ti)\ni=1 max(Si, Ti)\n\nJ(S, T ) =\n\n(1)\nwhere S, T \u2208 RD are two D-dimensional data vectors with only nonnegative entries. This \u201cmin-max\u201d\nmeasure is a generalization of the \u201cJaccard similarity\u201d in binary (0/1) data. For simplicity, in this paper,\nwe will use \u201cJaccard\u201d regardless whether the data are binary or non-binary. We should also mention\n\u221a\nthat J(S, T ) has been successfully extended to include data with negative entries [21, 22]. In fact,\n(1\u2212\u03c1)/2\nunder a fairly general distributional assumption, J \u2192 1\u2212\n\u221a\n, where \u03c1 is the correlation [25].\n(1\u2212\u03c1)/2\n\n1+\n\nWhile J(S, T ) in Eq. (1) appears deceivingly simple, the work of [20, 21, 22] demonstrated that,\nthrough extensive empirical studies, this measure of similarity is surprisingly effective when it is\nused as a kernel for classi\ufb01cation (e.g., SVM and logistic regression). In many public datasets, using\nthis (tuning-free) kernel resulted in a substantial increase in classi\ufb01cation accuracy, compared to the\n(best-tuned) radial basis function (RBF) kernel. Furthermore, the \u201ctunable\u201d version [22] of J(S, T )\nis even able to achieve classi\ufb01cation accuracy comparable to boosted trees (and deep nets) [18, 19].\nSince J(S, T ) is a type of nonlinear kernels. In order to use it for mere medium-scale datasets, we\nmust be able to \u201clinearize\u201d this kernel. Scaling nonlinear kernel machines is a known non-trivial\ntask [2]. For example, we cannot even store a kernel matrix in the memory for a dataset with only\n1,000,000 training samples, which has 1012 \u2248 240 entries and will need multiple terabyptes of storage.\n\n\u2217The work of Xiaoyun Li was conducted during the internship at Baidu Research.\n\u2020The work of Cun-Hui Zhang was conducted as a consulting researcher at Baidu Research.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1.1 Consistent Weighted Sampling (CWS)\nThe method of consistent weighted sampling (CWS) [12, 26, 14] is the standard strategy for ef\ufb01ciently\ncomputing the Jaccard similarity in Eq. (1). This algorithm is summarized in Algorithm 1, for hashing\nthe vector S as an example. For all other data vectors (e.g., T ), we apply the same randomization,\ni.e., using the same random numbers (ri, ci, \u03b2i) in Algorithm 1. For vectors S and T , we denote the\noutputs as (i\u2217\nT ), respectively. Then the following interesting probability result holds:\nPr (i\u2217\n(2)\n\nS) and (i\u2217\n\nT , t\u2217\n\nS, t\u2217\n\nS = i\u2217\n\nT and t\u2217\n\nT ) = J(S, T )\n\nS = t\u2217\n\nAlgorithm 1: Consistent Weighted Sampling (CWS).\n\n1 Input: (Non-negative) Data vector Si, i = 1 to D\n2 Output: Consistent uniform sample (i\u2217, t\u2217)\n3 For every nonzero Si\n4\n5\n6 End For\n7 i\u2217 \u2190 arg mini ai,\n\nri \u223c Gamma(2, 1), ci \u223c Gamma(2, 1), \u03b2i \u223c U nif orm(0, 1)\nti \u2190 (cid:98) log Si\n\n+ \u03b2i(cid:99), ai \u2190 log(ci) \u2212 ri(ti + 1 \u2212 \u03b2i)\n\nt\u2217 \u2190 ti\u2217\n\nri\n\nAfter repeating the randomization M times, one can then estimate the similarity as\n\n1(cid:8)i\u2217\n\nM(cid:88)\n\nj=1\n\n(cid:9)\n\n\u02c6J =\n\n1\nM\n\nS,j = i\u2217\n\nT,j and t\u2217\n\nS,j = t\u2217\n\nT,j\n\nE( \u02c6J) = J,\n\nV ar( \u02c6J) =\n\nJ(1 \u2212 J)\n\n1\nM\n\n(3)\n\n(4)\n\nNote that this estimate \u02c6J is actually a linear (inner product) kernel in a (sparse) high-dimensional\nspace. Basically there will be only exactly M 1\u2019s if we expand the samples into one sparse vector.\nThe prior work [20, 21, 22] already demonstrated the effectiveness of CWS for training kernel SVMs.\n\n1.2 The Computational Bottleneck of CWS\n\nCWS as presented in Algorithm 1 is fairly complex. Also, it needs O( \u00aff M ) computations to process\none data vector, where \u00aff is the average number of nonzero entries in the data vector. For many\nimportant applications, \u00aff (cid:28) D, especially when D is large (i.e., high-dimensional data). This cost is\nactually very expensive and can be the bottleneck if engineers hope to use CWS in practice. It would\nbe highly desirable if the computational cost (per data vector) can be reduced to O( \u00aff ). Note that\nsince we anyway have to touch each data entry (at least) once, O( \u00aff ) is the minimal possible cost.\n\n1.3 Bin-wise CWS (BCWS)\n\nBin-wise CWS (or BCWS) appears to be a natural idea, although it has taken us a fairly long journey\nto get it into the current form as presented in this paper. We will elaborate on the algorithm in details\nand explain other attempts we had tried which did not lead to satisfactory results.\nWith BCWS, we \ufb01rst conduct a random permutation on the columns of the data matrix, then we\ngroup the columns into equal-sized bins and perform CWS on each bin separately. Suppose we break\nthe columns into K = M bins and each bin contributes one sample, then the processing cost would\nbe only O( \u00aff ) (where \u00aff is the average number of nonzero entries). Intuitively, this strategy should\nwork well if D is large and the data entries are \u201cwell-behaved\u201d (e.g., entries follow a Gaussian or\nmore generally a non-heavy-tailed distribution). This is because, when D is large and K is not too\nlarge, each bin with D/K entries would still be a good representative sample of the original data.\nThe real world is typically not this ideal. When D is large, the real data often tend to be (highly)\nsparse, which means some bins may have only a small number of nonzero entries or even empty.\nTherefore, we must be able to deal with empty bins. Algorithm 2 is a generic description of BCWS.\n\n2\n\n\fAlgorithm 2: A generic description of BCWS.\n\nFor j = 1 to M\n\n1 Randomly group a dataset of D columns evenly into K bins.\n2 For each data vector.\n3\n4\n5\n6\n7 End For\n\nPick one non-empty bin.\nGenerate one CWS sample.\n\nEnd For\n\n/* In fact, we can replace CWS with other methods, e.g., [6, 29]. */\n\nThis is just the beginning of the story. Next, we need to develop strategies to implement lines 4 and 5.\n\n1. How to pick a non-empty bin? We present the results on two strategies.\n\n(a) The \ufb01rst strategy is to treat, for each data vector, the bins as a K-dimensional binary\ndata vector and apply classical min-wise hashing [4, 3] to choose a non-empty bin. We\ndenote this strategy as Rs (random select).\n\n(b) A better strategy is to apply the idea of \u201cone permutation hashing\u201d [24] and densi\ufb01ca-\ntion [30]. Instead of always (randomly) selecting a non-empty bin, we focus only on the\nempty bins. If a bin is empty, then we select a bin from the non-empty ones according to\nsome strategy. It was shown in [30] that a good strategy is to fully randomly select from\nall bins and stop till a non-empty bin is reached. We denote it as Den (densi\ufb01cation)\nand will describe this strategy in more details.\n\n2. How to generate a CWS sample from a non-empty bin? We also present two strategies.\n\n(a) For one non-empty bin, we generate one CWS sample and always output this sample\n\nwhenever this bin is picked.\n\n(b) We always generate a new CWS sample whenever this bin is picked.\n\nTherefore, we will present in total four variants of BCWS. Quite a few years back, when we started to\nstudy this problem, we had tried various other proposals for picking the non-empty bins. One intuitive\nstrategy, which initially appeared very reasonable, is to pick the bins with probabilities proportional to\nthe sums of the elements in all bins. However, after many unsuccessful attempts, we have eventually\nrealized that we should just focus on whether bins are empty or non-empty, i.e., our Algorithm 2.\nNote that, for binary data, the CWS algorithm generates statistically equivalent samples as the\nclassical minwise hashing. Thus, to use BCWS on binary data, we just need to apply the standard\nminwise hashing method (instead of CWS), whenever a non-empty bin is picked. In other words,\nfor binary data, our study of BCWS actually leads to a new densi\ufb01cation scheme. In the next two\nsections, after we \ufb01rst review minwise hashing, one permutation hashing, and densi\ufb01cation, we will\nillustrate how our work actually improves over the existing densi\ufb01cation schemes for binary data.\n\n2 Minwise Hashing, One Permutation Hashing, and Densi\ufb01cation\n\nMinwise hashing (minhash) [4, 3] was initially developed for ef\ufb01ciently computing data Jaccard\nsimilarities (a.k.a. resemblances) for the task of duplicate web page removal. Then the technique has\nbeen widely applied for numerous practical tasks, e.g., [10, 28, 11, 7, 8, 13, 9, 27, 16, 32, 15, 5, 1].\nConsider two sets S, T \u2286 \u2126 = {1, 2, 3, ..., D}. Suppose a random permutation \u03c0 is performed on \u2126:\n\u03c0 : \u2126 \u2212\u2192 \u2126. An elementary probability argument shows that\n\nPr (min(\u03c0(S)) = min(\u03c0(T ))) =\n\n(5)\n\n|S \u2229 T|\n|S \u222a T| = J(S, T ).\n\nFor the sake of simplicity, with a slight abuse of notation, we use J(S, T ) to denote the Jaccard\nsimilarity between two sets S and T , and J(S, T ) to denote the generalized Jaccard similarity between\ntwo vectors S and T . In order to estimate J, one will need to repeat the permutations M times. This\ncomputational burden has been resolved by the idea of \u201cone permutation hashing\u201d [24]. As illustrated\nin Figure 1, after applying one permutation on the columns, we break the column space evenly into\nK bins (where K = M in [24]) and then use the locations of the \ufb01rst nonzero entries in all bins as\nthe hashed data. This substantially reduces the processing time and makes minwise hashing truly\npractical. However, new problems arise as there will inevitably be empty bins in sparse data.\n\n3\n\n\fTo deal with empty bins, [31] proposed a strategy\nby directly borrowing hashed data from neigh-\nboring bins. The work [30] proposed an im-\nprovement by selecting non-empty bins randomly\nfrom all bins. Basically, we can understand the\nstrategy in [30] as using a random permutation\n\u03c0(cid:48) : {1, 2, ..., K} \u2212\u2192 {1, 2, ..., K}. If one bin\nis empty, one then follows the permutation \u03c0(cid:48) till\none non-empty bin is found. [30] also used an\nargument based on \u201cuniversal hashing\u201d to claim\nthat cycling can be avoided.\nThrough an elegant analysis, [30] claimed that it\nachieved the smallest (optimal) variance. Interest-\ningly, the performance of the densi\ufb01cation scheme\nin [30] can still be improved, in the sense that,\ngiven a \ufb01xed budget of storage (i.e., sample size M), the variance can be further reduced via\nre-randomization on each selected non-empty bin. This \ufb01nding is useful, practically and theoretically.\n\nFigure 1: Demonstration of one permutation\nhashing. For example, for S1, the hashed values\nare [\u2217, 5, 11, 14] with the \ufb01rst bin being empty.\n\n3 Binary Data: Densi\ufb01cation with Re-randomization\n\nWe continue to use the example in Figure 1. After one permutation and binning, the non-empty bins\nfor each set constitute a set: IS1 = {1, 2, 3} for S1, IS2 = {1, 2, 4} for S2, and IS3 = {1, 3, 4} for\nS3, respectively. As brie\ufb02y mentioned in Introduction, there are four strategies to generate samples.\nSuppose we \ufb01rst group the data matrix columns into K bins and we hope to obtain M samples.\n\n1. Rs: For the original set Si, we randomly select a non-empty bin using minwise hashing\non the set ISi. For each non-empty bin, we have already generated a hashed value from\nminhash within the bin. Once a non-empty bin is selected, we return its hashed value.\n\n2. RsRe: For the original set Si, we randomly select a non-empty bin using minwise hashing\non the set ISi. Once a non-empty bin is selected, we perform a minwise hashing within the\nbin and return the hashed value.\nFor Rs and RsRe, we repeat the procedure M times to obtain M samples, regardless of K,\nthe number of bins. It is often the case that we use M = K, but we do not have to.\n\n3. Den: After we group the data matrix columns into K bins, we \u201cdensify\u201d (\ufb01ll in) the empty\nbins from the beginning of the bins using the \u201coptimal\u201d densi\ufb01cation strategy described\nin [30]. For each non-empty bin, we have already generated a sample by a minwise hashing\nwithin the bin. If M < K, then we stop till we have collected M samples.\n\n4. DenRe (Densi\ufb01cation with Re-randomization): For each empty bin, after we \ufb01ll it in with\n\none non-empty bin, we re-do a minwise hashing within that bin and output a sample.\n\nHere we will repeat the procedure for \u201coptimal densi\ufb01cation\u201d as described in the nice work of [30].\nConceptually, we have another random permutation \u03c0(cid:48) : {1, 2, ..., K} \u2212\u2192 {1, 2, ..., K}. If one bin\nis empty, one then follows the permutation \u03c0(cid:48) till a non-empty bin is found. Based on the \u201cuniversal\nhashing\u201d argument, [30] claimed that cycling can be avoided. The actual implementation is slightly\nmore sophisticated than just using a random permutation. We refer readers to [30] for more details.\nNote that, for binary data, Den is basically the scheme in [30]. Even though it is called \u201coptimal\ndensi\ufb01cation\u201d in [30], in this paper we will show that the variance can actually be further reduced by\na re-randomization (Re) step, through a careful analysis. Also, note that while [24, 31, 30] always let\nM = K, in this study we have relaxed this constraint by providing more general theoretical results.\n\n3.1 Theoretical Results\n\nIn the original minwise hashing, we have an (unbiased) estimator \u02c6J of the Jaccard similarity (anal-\nM J(1 \u2212 J) in order to generate M samples. With the four\nogous to Eq. (3)) with a variance 1\ndensi\ufb01cation schemes described above, we now have four more (unbiased) estimators, which are re-\nspectively denoted by \u02c6J M\nDen, and \u02c6J M\nDenRe. Here, we use the superscript M to emphasize\nthe sample size. The following lemma gives an important quantity in the re-randomization process.\n\nRsRe, \u02c6J M\n\nRs, \u02c6J M\n\n4\n\n12345678910111213141516011001100101000000101000000001010000011100001001234(S1):(S2):(S3):0\fE0 (cid:44) E[\n\n1\n\u02dcf\n\n|Iemp,B (cid:54)= 0] =\n\nLet f = |S \u222a T|. Denote \u02dcf =(cid:80)\n\nLemma 1. Let d = D\n\nK be an integer. Assume B is the index set of a simultaneously non-empty bin.\nmax(Si, Ti) the number of nonzeros in bin B. Then we have\n\ni\u2208IB\n\nwhere Iemp,B denotes the indicator of event that bin B is empty. Conditional on the event that there\nare m simultaneously non-empty bins, we have\n\n(cid:1)\n\n1\nj\n\nj=max(1,d+f\u2212D)\n\n(cid:1)(cid:0)D\u2212f\n(cid:0)f\n(cid:80)min(d,f )\n(cid:0)D\n(cid:1) \u2212 I{d+f\u2212D\u22640}(cid:0)D\u2212f\n(cid:1)\nd\u2227(f\u2212m+1)(cid:88)\n\nd\u2212j\n\nd\n\nd\n\nj\n\n,\n\n(6)\n\n1\nj\n\nj\n\n(cid:0)d\n(cid:1)H(m \u2212 1, f \u2212 j|d)\n(cid:19)\n(cid:18)d\n\nH(m, f|s)\n\n,\n\n(7)\n\n.\n\nn\n\n\u02dcE0(m) (cid:44) E[\n\n1\n\u02dcf\n\n|Iemp,B (cid:54)= 0, m] =\n\nj=1\u2228[f\u2212(m\u22121)d]\n\nwhere the following recursion holds for \u2200k \u2264 K,\n\nmin{d,n\u2212k+1}(cid:88)\n\n(cid:18)d\n(cid:19)\n\nj=max{1,n\u2212(k\u22121)d}\n\nj\n\nH(k, n|d) =\n\nH(k \u2212 1, n \u2212 j|d), H(1, n|d) =\n\nA careful analysis derives the following theory for the variances of the proposed four estimators.\nTheorem 1. Let \u02dcE0(m) be de\ufb01ned in Lemma 1. Let f1 = |S|, f2 = |T|, a = |S \u2229 T|, f = |S \u222a T| =\nemp is the number of empty bins out of \ufb01rst M \u2264 K bins. If\nf1 + f2 \u2212 a, and J = a\n(cid:17)\nM > K, then N M\n(cid:17)\n\nf\u22121 . N M\nemp. We have\n\nf , \u02dcJ = a\u22121\n\nemp = N K\n\nE1 \u2212 J 2,\n\nM \u2212 1\n\nM\nM \u2212 1\n\nJ\nM\n\nV ar\n\n+\n\n=\n\nRs\n\n(cid:16) \u02c6J M\n(cid:16) \u02c6J M\n\nV ar\n\nRsRe\n\n=\n\n+\n\nJ\nM\n\nM\n\nE2 \u2212 J 2.\n\nNote that the above theorem involves the probability distribution of N M\nderived this probability for a simpler case. Here, we provide the general result.\nTheorem 2. The distribution of N M\n\nemp, where M \u2264 K, is given by\nM\u2212j(cid:88)\n\n(cid:19)(cid:18)M \u2212 j\n\n(cid:19)(cid:18)D(1 \u2212 (j + (cid:96))/K)\n\n(cid:18)M\n\n(cid:26)\n\n(cid:27)\n\nP r\n\nN M\n\nemp = j\n\n=\n\nj\n\n(cid:96)\n\nf\n\n(cid:19)(cid:30)(cid:18)D\n\n(cid:19)\n\nf\n\n.\n\n(\u22121)(cid:96)\n\n(cid:96)=0\n\n(cid:16) \u02c6J M\n(cid:16) \u02c6J M\n\nDenRe\n\n(cid:17)\n(cid:17)\n\nWe also have the asymptotic convergence as the following Theorem.\nTheorem 3. Suppose K is \ufb01nite and \ufb01xed, then as M \u2192 \u221e,\n\nM\u2192\u221e V ar\nlim\n\nDen\n\nlim\nM\u2192\u221e V ar\n\n= lim\n\nM\u2192\u221e V ar\n\n= lim\n\nM\u2192\u221e V ar\n\n(cid:17)\n\n(cid:16) \u02c6J M\n(cid:16) \u02c6J M\n\nRs\n\nRsRe\n\n(cid:17)\n\n= E1 \u2212 J 2,\n\n= E2 \u2212 J 2.\n\nProof. The results simply follows by taking M \u2192 \u221e in Theorem 1.\n\n5\n\nV ar\n\nV ar\n\nIf M \u2264 K, then\n\nDenRe\n\n(cid:17)\n\nDen\n\n(cid:16) \u02c6J M\n(cid:16) \u02c6J M\n(cid:16) \u02c6J M\n(cid:16) \u02c6J M\n\n=\n\n(cid:17)\n(cid:17)\n\nIf M > K, then\n\nV ar\n\nDen\n\nV ar\n\nDenRe\n\nwhere E1 = E[\n\nJ\nM\n\n=\n\n+\n\nJ\nM\n\n1\nM 2\n\nE[(M \u2212 N M\n1\nM 2\n\nE[(M \u2212 N M\n\n+\n\nemp)(M \u2212 N M\n\nemp)(M \u2212 N M\n\nemp \u2212 1)J \u02dcJ] +\n\n1\nM 2\nemp \u2212 1)J \u02dcJ] +\n\nE[N M\n\nemp(2M \u2212 N M\nE[N M\n\nemp(2M \u2212 N M\n\n1\nM 2\n\nemp \u2212 1)]E1 \u2212 J 2,\n\nemp \u2212 1)]E2 \u2212 J 2,\n\n(cid:17)\n(cid:16) \u02c6J K\n(cid:16) \u02c6J K\n\nDen\n\n(cid:17)\n\n=\n\n(cid:17)\n\n1\nM 2 [K 2(V ar\n=\n\n1\nM 2 [K 2(V ar\n+ (1\u2212\n\nJ\n\nK\u2212N K\n\nemp\n\n+ J 2) + (M \u2212 K)(M + K \u2212 1)E1 + (M \u2212 K)J] \u2212 J 2,\n\n+ J 2) + (M \u2212 K)(M + K \u2212 1)E2 + (M \u2212 K)J] \u2212 J 2,\n\nDenRe\n\n1\n\nK\u2212N K\n\nemp\n\n)J \u02dcJ] and E2 = E[\n\n\u02dcE0(K\u2212N K\nK\u2212N K\n\nemp\n\nemp)\n\nemp)\n\nJ + (1\u2212 \u02dcE0(K\u2212N K\n)J \u02dcJ].\nK\u2212N K\nemp, where M \u2264 K. [24]\n\nemp\n\n\fFigure 2: Veri\ufb01cation of theoretical results in Theorem 1 for estimating Jaccard similarity between\ntwo binary vectors (\u201cHONG\u201d and \u201dKONG\u201d). We report the empirical MSEs (solid curves), which\noverlap the theoretical variances (dashed curves). Note that it is possible DenRe can have smaller\nvariance than the original minwise hashing. (Bottom panels are zoomed-in versions of upper panels.)\n\n3.2 Sanity Check: An Empirical Study to Verify the Theoretical Results\n\nDen) and V ar( \u02c6J M\n\nRsRe) \u2264 V ar( \u02c6J M\n\nFigure 2 presents a sanity check of our theoretical results, from 105 simulations, for estimating the\nJaccard similarity between two word vectors: \u201cHONG\u201d and \u201cKONG\u201d. Basically, \u201cHONG\u201d denotes\nthe vector of occurrence (0/1) of the word \u201cHONG\u201d in a repository of D = 216 documents. As\nexpected, the data are highly sparse and re\ufb02ect the real-world situation. This dataset (named \u201cWords\u201d)\nhas been used in a few previous papers on hashing and sketching, as early as in 2005 [23]. From\nFigure 2, we can see that the empirical MSEs (mean square error (= V ar+Bias2), solid curves)\noverlap the (dashed) theoretical curves very well, con\ufb01rming the theoretical results in Theorem 1.\nDenRe) \u2264\nIn Theorem 1, since E0 \u2264 1 and \u02dcJ < J, we always have E2 \u2264 E1 and thus V ar( \u02c6J M\nV ar( \u02c6J M\nRs) always hold for positive M. We can see that the equality\nis attained only when |S \u222a T| = 1, i.e., each data vector contains at most 1 nonzero entry. Hence, in\nessentially all cases, the variance of re-randomized approaches (RsRe, DenRe) is smaller than that\nof the corresponding counterparts (Rs, Den), and the improvement can be substantial in some cases.\nIn fact, this re-randomized densi\ufb01cation procedure achieves the smallest variance among all existing\nOPH variants, since maximum randomness is introduced in both bin selection and hash reassignment\nsteps. Thus, our proposed DenRe approach is able to generate the most accurate estimator under the\nef\ufb01cient OPH scheme. We also remark that under the setting by [30] where asymptotic analysis is in\nthe sense of K = M \u2192 \u221e, the variance of \u02c6J M\nRunning time.\nLet \u00aff be the average number of nonzero entries of all sets. To generate M\nsamples for each set, the vanilla minwise hashing algorithm has a running time of O(M \u00aff ). Clearly,\none permutation hashing + densi\ufb01cation schemes is able to dramatically reduce the processing time.\nFor simplicity, consider M = K. The running time of Den scheme is O( \u00aff + 2K +\nemp).\nK\u2212N K\nThe re-randomized approach DenRe takes O( \u00aff + 2K +\nemp), where the\nadditional cost comes from re-doing minwise hashing for each empty bin. This additional cost\n\u00aff\nemp becomes\nK N K\nlarger, the variance reduction effect due to re-randomization would be even more substantial.\n\nemp is in general minimal. One can also see that, as the number of empty bins N K\n\nDenRe is always the smallest and converges to zero.\n\n\u00aff\nK N K\n\nK\n\nK\u2212N K\n\nemp\n\nN K\n\nemp +\n\nK\n\nemp\n\nN K\n\nNext, we will generalize our densi\ufb01cation scheme to non-binary data, i.e., the BCWS algorithm.\n\n6\n\n100101102103Sample Size (M)10-510-410-310-210-1MSEHONG - KONGBinary: K = 2048RsRsReDenDenReMinHash100101102103Sample Size (M)10-510-410-310-210-1MSEHONG - KONGBinary: K = 1024RsRsReDenDenReMinHash100101102103Sample Size (M)10-410-310-210-1MSEHONG - KONGBinary: K = 512RsRsReDenDenReMinHash102420484096Sample Size (M)10-4MSEHONG - KONGBinary: K = 2048RsRsReDenDenReMinHash51210242048Sample Size (M)10-4MSEHONG - KONGBinary: K = 1024RsRsReDenDenReMinHash2565121024Sample Size (M)10-4MSEHONG - KONGBinary: K = 512RsRsReDenDenReMinHash\f4 Weighted data: Bin-wise Consistent Weighted Sampling (BCWS)\nWe have provided a comprehensive analysis on densi\ufb01cation schemes for one permutation hashing.\nHowever, these methods are constrained to binary data. Given the signi\ufb01cant acceleration of bin-wise\ntype algorithms, one may ask: can we extend one permutation hashing, which is designed speci\ufb01cally\nfor binary sets, to weighted sets? The main concern is that, unlike in the binary case, different weights\nare assigned to entries which intrinsically give bins different amount of information. Using the same\nstrategy as in one permutation hashing could no longer provide unbiasedness. In the next, we show\nthat applying bin-wise CWS is also theoretically plausible with moderate number of bins K, and it\nprovides signi\ufb01cant speedup and very similar empirical performance to the original CWS procedure.\n\n4.1 Concentration of BCWS Estimates\nConsider two non-negative real-valued data vectors S, T \u2208 RD. For simplicity, we assume M = K.\nIntuitively, one may expect that the BCWS estimator is \u201cnearly unbiased\u201d if the data is \u201creason-\nably behaved\u201d, in the sense that information is approximately uniformly distributed among the entries.\nTheorem 4. Consider two non-negative real-valued vectors S, T \u2208 RD. Denote \u00b51, \u00b52 and \u00b53 as\nthe average of S, T and S \u2228 T , and \u03c31, \u03c32 and \u03c33 be corresponding standard deviations. Further\n(cid:80)K\nK is an integer. After applying random permutation \u03c0, hk(\u00b7) is the hash tuple generated\nassume that D\nk=1 I{hk(S) = hk(T )|\u03c0} the estimator given by BCWS. For\nin bin k. Denote \u02c6JBCW S(\u03c0) = 1\nK\nany t > 0, when K \u2264 min\nD\nt\n\n, we have\n\ni\n\n\u00b5i\n\u03c3i\n\n(cid:26) 1 \u2212 K(\u03b41(t) \u2228 \u03b42(t))\n\u2264 1 + K(\u03b41(t) \u2228 \u03b42(t))\n\n1 + K\u03b43(t)\n\nP\n\n(cid:114)\nJ \u2212 K(\u03b43(t) + (\u03b41(t) \u2228 \u03b42(t)))\n(cid:1)/(cid:0) D\n\n1 \u2212 K\u03b43(t)\n\n1 + K\u03b43(t)\n\nD/K\n\nJ +\n\n1 \u2212 K\u03b43(t)\n\n(cid:113) t\nD for i = 1, 2, 3, p0 = (cid:0)D\u2212f\n\nD/K\n\nK(\u03b43(t) + (\u03b41(t) \u2228 \u03b42(t)))\n\n\u2264 E[ \u02c6JBCW S(\u03c0)]\n\n(cid:27)\n\ne\u2212t +\n\n\u2265 1 \u2212 6K\np1\n\n(cid:1), p1 = 1 \u2212 p0, and f = |{i : Si >\n\n3p0K\n\np1\n\n,\n\nwith \u03b4i(t) = \u03c3i\n\u00b5i\n0 or Ti > 0}|.\n\n4.2 Experiment: MSEs for Estimating Jaccard in Real-valued Word-Vectors\n\nWe again use the word occurrence data (such as \u201cHONG\u201d and \u201cKONG\u201d) form the \u201cWords\u201d dataset.\nThis time, we record the actual numbers of occurrences instead of just the presence/absence informa-\ntion. We have conducted a large number of simulations for estimating the Jaccard similarity between\ntwo vectors. The patterns are essentially similar and hence we only present the results for three\nword-vector-pairs, in Figure 3. More details about the datasets and the experiments are available in\nthe supplementary material, which also contains the proofs of the theorems presented in this paper.\nAs expected, as shown in Figure 3, DenRe outperforms Den (and two other estimates) in terms\nof MSEs. We remark that when K is smaller, the curves of CWS, Den and DenRe are usually\nindistinguishable (if M < K) and hence we just report results with relatively larger K values.\n\n4.3 The \u201cR-G\u201d Algorithm for Estimating Jaccard\nTo demonstrate the advantage of our proposed method for estimating Jaccard similarity with nonneg-\native real-valued data, here we introduce another interesting algorithm [17, 6, 29] which in this paper\nwe refer to as the \u201cR-G\u201d method. They showed that when data are dense, \u201cR-G\u201d speeds up CWS,\ntypically by a substantial factor.\nFigure 4 is an example where two vectors\nS = [S1, S2, S3, S4] and T = [T1, T2, 0, T4].\nThe algorithm needs prior \ufb01xed feature-wise\nupper bound mi, i = 1, ..., D. The green\nregion represents the data entries. Denote\ni=1 mi. They repeatedly choose a\n\nFigure 4: Illustration of R-G algorithm in [29].\n\n\u02dcmj =(cid:80)j\n\npoint at random on [0, \u02dcmD] until it falls into the green region. The hash values are set to be the\nnumber of tries before success. This simple strategy also yields an unbiased estimator of Jaccard\n\n7\n\n\fFigure 3: Empirical MSEs of four BCWS schemes for estimating Jaccard similarity on weighted\ndatasets, for three word-vector-pairs. The bottom panels are zoomed-in versions of the upper panels.\n\nsimilarity J and same variance as CWS. The running time of R-G method is O( \u00aff + K 1\ns ) of a set S\nwith \u00aff nonzero entries, where s =\nD when data are\nbinary. Therefore it should be obvious that the R-G algorithm would perform poorly in binary data\nand also poorly in sparse data.\n\nis the effective sparsity. Note that s = f\n\ni=1 Si\n\u02dcmD\n\n(cid:80)D\n\nK\n\nemp\n\nN K\n\nemp +\n\n\u00aff\nK N K\n\nRecall that, for BCWS using DenRe (consider only M = K), the running time is O( \u00aff + 2K +\nemp). Roughly speaking, in a typical situation, we can say that the cost for\nK\u2212N K\ngenerating K samples using R-G is O(K/s) and for BCWS is O( \u00aff ). Therefore, we can use K\ns \u00aff as an\nindicator for the improvement of BCWS compared to R-G.\nThrough a careful study of the literature, the history of the \u201cR-G\u201d algorithm can be traced to [17, 6].\nThe recent work [29] developed an effective column-wise preprocessing scheme which made the\nalgorithm practical (in dense data). [29] also provided an elegant theoretical analysis to clearly reveal\nthe advantage of the method in dense data (and its disadvantage in sparse data).\n\ns \u00aff \u2248 1); (ii) \u201c20 NewsGroup\u201d ( 1\n\nIn Figure 5, we present the empirical comparisons between R-G and BCWS on two datasets: (i)\ns \u00aff \u2248 14 \u223c 150). There is an important (hidden) detail in\n\u201cWords\u201d ( 1\nthe R-G algorithm that its performance largely depends on the properly chosen scaling factor. For the\ns \u00aff \u2248 150, but if we scale the data properly, this value can be reduced\noriginal 20 Newsgroup dataset, 1\nsubstantially to 14. We will explain the R-G algorithm in more details in the supplementary material.\n\nFigure 5: Average absolute estimation error of Jaccard similarity and running time comparison.\n\n8\n\n100101102103Sample Size (M)10-410-310-210-1MSEHONG - KONGReal: K = 512CWSRsRsReDenDenRe100101102103Sample Size (M)10-410-310-210-1MSEPAIN - PATIENTReal: K = 512CWSRsReRsReDenDenRe100101102103Sample Size (M)10-510-410-310-210-1MSEVISIT - DECIDEReal: K = 2048CWSRsRsReDenDenRe2565121024Sample Size (M)10-410-3MSEHONG-KONGReal: K=512CWSRsRsReDenDenRe2565121024Sample Size (M)10-4MSE10-4PAIN-PATIENTReal: K=512CWSRsRsReDenDenRe102420484096Sample Size (M)10-5MSE10-5VISIT-DECIDEReal: K=2048CWSRsRsReDenDenRe 24 25 26 27 28K00.010.020.03Avg Abs ErrorWordsR-GBCWS-DenRe 24 25 26 27 28K00.020.040.06Avg Abs ErrorNewsGroupR-GBCWS-DenRe 24 25 26 27 28K50100150200250Ratio of TimeWords\fThe \u201cWords\u201d dataset [23] consists of 2, 702 word-vectors (e.g., \u201cHONG\u201d and \u201cKONG\u201d) from a\nrepository of D = 216 documents, for a total of 3,649,051 word-pairs. The left panel of Figure 5\nreports the averaged absolute errors (among over 3 million pairs). Similarly we present the errors\nfor 20 NewsGroup, for both R-G and BCWS-DenRe. We also include the time comparisons in the\nright panel of Figure 5. Basically, for k = 256, R-G needs 200 times more time than BCWS. This is\nlargely consistent with what theoretical results would predict. For 20 NewsGroup, we observe that\nthe improvement in ef\ufb01ciency by using BCWS would be even much more substantial.\n\nFinally, we should add that one can combine the idea of BCWS and R-G to signi\ufb01cantly speed up\nthe R-G algorithm as suggested in Line 5 of Algorithm 2. We can perhaps name this new method\n\nas \u201cB-R-G\u201d. Its computational cost would be merely O(cid:0) \u00aff + 1\n\nThis new method would be very useful for hashing Jaccard similarity in dense high-dimensional data.\n\ns\n\n(cid:1) as opposed to O(cid:0) \u00aff + K 1\n\n(cid:1) for R-G.\n\ns\n\n4.4 Classi\ufb01cation Experiment\n\n[20, 21, 22] already conducted extensive experiments on many classi\ufb01cation tasks using the min-max\nkernel (and other kernels) and linearized min-max kernels by CWS hashing.\nHere, we report additional experiments on the\nUCI-Dailysports dataset. When a classi\ufb01er based\non linear SVM is used on the original data, the\ntest accuracy is only 77%. However, with the min-\nmax kernel, the accuracy becomes 99%. This is a\ngood example to show that min-max kernel (and\nlinearization by CWS hashing) might be very use-\nful in practice. Figure 6 shows that for BCWS\nwith K \u2208 {16, 32, 64, 128}, using linear SVM\nand hashed data by BCWS achieves good classi\ufb01-\ncation accuracy. Compared to CWS, the accuracy\nof BCWS is similar or even slightly better. Note\nthat the dimension of the dataset is only 5,625. For\nexample, when K = 128 and the desired number\nof (nonzero) features is 210 (x-axis), we have to\nrepeat BCWS 1024/128 = 8 times. Nevertheless,\nwe can still achieve a cost reduction by a factor of\n128 without losing accuracy.\nThis is a signi\ufb01cant part of the contribution in this\npaper for machine learning. That is, we are able\nto achieve good accuracy of nonlinear kernels at the cost similar to that of linear classi\ufb01ers, and the\npreprocessing (hashing) cost is no longer the bottleneck, unlike the original CWS hashing method.\n\nFigure 6: Using linear SVM on the original data\nachieves only a 77% accuracy. After we hash\nthe data via CWS and use linear SVM on top of\nhashed data (dashed curve), the accuracy reaches\n99%. Using BCWS with K ranging from 16 to\n128 (solid curves) still attain similar accuracies.\n\n5 Conclusion\n\nWe expect BCWS would be adopted in practice for large-scale similarity search and machine learning\ntasks, given its simplicity and effectiveness. The prior work [20] showed that the min-max kernel,\neven though it appears simple, could be a good choice of nonlinear kernels for many classi\ufb01cation\ntasks. The more recent work [21] extended the min-max kernel to data vectors with negative entries.\nIn addition, the min-max kernel can be modi\ufb01ed to admit tuning parameters [22] for potentially\nachieving even better performance. The work [22] compared \u201ctunable\u201d min-max kernels with boosted\ntrees and deep nets and presented surprising results. Nevertheless, the processing (hashing) cost of the\noriginal CWS algorithm makes it dif\ufb01cult for min-max kernel (and variants) and CWS to be adopted\nin practice. This study \ufb01lls in this gap by developing the Bin-Wise CWS (BCWS) algorithm and\nproviding the theoretical analysis. For binary (0/1) data, our results are also interesting and practically\nuseful, in that we provide a scheme that, under the same storage budget, can achieve the provably\nsmallest variance among all existing densi\ufb01cation methods for one permutation hashing (OPH).\n\n9\n\n 25 26 27 28 29210Number of non-zero features85%90%95%100%Test AccuracyDaily-sportsK=16K=32K=64K=128CWS\fReferences\n[1] Michael Bendersky and W. Bruce Croft. Finding text reuse on the web. In Proceedings of\nthe Second International Conference on Web Search and Web Data Mining (WSDM), pages\n262\u2013271, Barcelona, Spain, 2009.\n\n[2] L\u00e9on Bottou, Olivier Chapelle, Dennis DeCoste, and Jason Weston, editors. Large-Scale Kernel\n\nMachines. The MIT Press, Cambridge, MA, 2007.\n\n[3] Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise\nindependent permutations. In Proceedings of the Thirtieth Annual ACM Symposium on the\nTheory of Computing (STOC), pages 327\u2013336, Dallas, TX, 1998.\n\n[4] Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic\n\nclustering of the web. Computer Networks, 29(8-13):1157\u20131166, 1997.\n\n[5] Gregory Buehrer and Kumar Chellapilla. A scalable pattern mining approach to web graph\ncompression with communities. In Proceedings of the International Conference on Web Search\nand Web Data Mining (WSDM), pages 95\u2013106, Stanford, CA, 2008.\n\n[6] Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings\non 34th Annual ACM Symposium on Theory of Computing (STOC), pages 380\u2013388, Montreal,\nCanada, 2002.\n\n[7] Ludmila Cherkasova, Kave Eshghi, Charles B. Morrey III, Joseph Tucek, and Alistair C. Veitch.\nApplying syntactic similarity algorithms for enterprise information management. In Proceedings\nof the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining\n(KDD), pages 1087\u20131096, Paris, France, 2009.\n\n[8] Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, Michael Mitzenmacher, Alessandro Panconesi,\nand Prabhakar Raghavan. On compressing social networks. In Proceedings of the 15th ACM\nSIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages\n219\u2013228, Paris, France, 2009.\n\n[9] Yon Dourisboure, Filippo Geraci, and Marco Pellegrini. Extraction and classi\ufb01cation of dense\n\nimplicit communities in the web graph. ACM Trans. Web, 3(2):1\u201336, 2009.\n\n[10] Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L. Wiener. A large-scale study of the\nevolution of web pages. In Proceedings of the Twelfth International World Wide Web Conference\n(WWW), pages 669\u2013678, Budapest, Hungary, 2003.\n\n[11] George Forman, Kave Eshghi, and Jaap Suermondt. Ef\ufb01cient detection of large-scale redundancy\n\nin enterprise \ufb01le systems. SIGOPS Oper. Syst. Rev., 43(1):84\u201391, 2009.\n\n[12] Sreenivas Gollapudi and Rina Panigrahy. Exploiting asymmetry in hierarchical topic extraction.\nIn Proceedings of the 2006 ACM CIKM International Conference on Information and Knowledge\nManagement (CIKM), pages 475\u2013482, Arlington, VA, 2006.\n\n[13] Sreenivas Gollapudi and Aneesh Sharma. An axiomatic approach for result diversi\ufb01cation. In\nProceedings of the 18th International Conference on World Wide Web (WWW), pages 381\u2013390,\nMadrid, Spain, 2009.\n\n[14] Sergey Ioffe. Improved consistent sampling, weighted minhash and L1 sketching. In The 10th\nIEEE International Conference on Data Mining (ICDM), pages 246\u2013255, Sydney, AU, 2010.\n\n[15] Nitin Jindal and Bing Liu. Opinion spam and analysis. In Proceedings of the International\nConference on Web Search and Web Data Mining (WSDM), pages 219\u2013230, Palo Alto, CA,\n2008.\n\n[16] Konstantinos Kalpakis and Shilang Tang. Collaborative data gathering in wireless sensor\nnetworks using measurement co-occurrence. Computer Communications, 31(10):1979\u20131992,\n2008.\n\n10\n\n\f[17] Jon Kleinberg and Eva Tardos. Approximation algorithms for classi\ufb01cation problems with\npairwise relationships: Metric labeling and Markov random \ufb01elds. In 40th Annual Symposium\non Foundations of Computer Science (FOCS), pages 14\u201323, New York, NY, 1999.\n\n[18] Hugo Larochelle, Dumitru Erhan, Aaron C. Courville, James Bergstra, and Yoshua Bengio.\nAn empirical evaluation of deep architectures on problems with many factors of variation. In\nProceedings of the Twenty-Fourth International Conference on Machine Learning (ICML),\npages 473\u2013480, Corvalis, Oregon, 2007.\n\n[19] Ping Li. Robust logitboost and adaptive base class (abc) logitboost. In Proceedings of the\nTwenty-Sixth Conference Annual Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI),\npages 302\u2013311, Catalina Island, CA, 2010.\n\n[20] Ping Li. 0-bit consistent weighted sampling.\n\nIn Proceedings of the 21th ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining (KDD), pages 665\u2013674,\nSydney, Australia, 2015.\n\n[21] Ping Li. Linearized GMM kernels and normalized random Fourier features. In Proceedings of\nthe 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining\n(KDD), pages 315\u2013324, 2017.\n\n[22] Ping Li. Several tunable GMM kernels. arXiv:1805.02830, 2018.\n\n[23] Ping Li and Kenneth W. Church. Using sketches to estimate associations. In Proceedings of\nthe 2005 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages\n708\u2013715, Vancouver, Canada, 2005.\n\n[24] Ping Li, Art B Owen, and Cun-Hui Zhang. One permutation hashing. In Advances in Neural\n\nInformation Processing Systems (NIPS), pages 3122\u20133130, Lake Tahoe, NV, 2012.\n\n[25] Ping Li and Cun-Hui Zhang. Theory of the GMM kernel. In Proceedings of the 26th Interna-\n\ntional Conference on World Wide Web (WWW), pages 1053\u20131062, Perth, Australia, 2017.\n\n[26] Mark Manasse, Frank McSherry, and Kunal Talwar. Consistent weighted sampling. Technical\n\nReport MSR-TR-2010-73, Microsoft Research, 2010.\n\n[27] Marc Najork, Sreenivas Gollapudi, and Rina Panigrahy. Less is more: sampling the neighbor-\nhood graph makes salsa better and faster. In Proceedings of the Second International Conference\non Web Search and Web Data Mining (WSDM), pages 242\u2013251, Barcelona, Spain, 2009.\n\n[28] Sandeep Pandey, Andrei Broder, Flavio Chierichetti, Vanja Josifovski, Ravi Kumar, and Sergei\nVassilvitskii. Nearest-neighbor caching for content-match applications. In Proceedings of the\n18th International Conference on World Wide Web (WWW), pages 441\u2013450, Madrid, Spain,\n2009.\n\n[29] Anshumali Shrivastava. Simple and ef\ufb01cient weighted minwise hashing. In Neural Information\n\nProcessing Systems (NIPS), pages 1498\u20131506, Barcelona, Spain, 2016.\n\n[30] Anshumali Shrivastava. Optimal densi\ufb01cation for fast and accurate minwise hashing.\n\nIn\nProceedings of the 34th International Conference on Machine Learning (ICML), pages 3154\u2013\n3163, Sydney, Australia, 2017.\n\n[31] Anshumali Shrivastava and Ping Li. Densifying one permutation hashing via rotation for fast\nnear neighbor search. In Proceedings of the 31th International Conference on Machine Learning\n(ICML), Beijing, China, 2014.\n\n[32] Tanguy Urvoy, Emmanuel Chauveau, Pascal Filoche, and Thomas Lavergne. Tracking web\n\nspam with html style similarities. ACM Trans. Web, 2(1):1\u201328, 2008.\n\n11\n\n\f", "award": [], "sourceid": 9401, "authors": [{"given_name": "Ping", "family_name": "Li", "institution": "Baidu Research USA"}, {"given_name": "Xiaoyun", "family_name": "Li", "institution": "Rutgers University"}, {"given_name": "Cun-Hui", "family_name": "Zhang", "institution": "Rutgers"}]}