{"title": "Bipartite Stochastic Block Models with Tiny Clusters", "book": "Advances in Neural Information Processing Systems", "page_first": 3867, "page_last": 3877, "abstract": "We study the problem of finding clusters in random bipartite graphs. We present a simple two-step algorithm which provably finds even tiny clusters of size $O(n^\\epsilon)$, where $n$ is the number of vertices in the graph and $\\epsilon > 0$. Previous algorithms were only able to identify clusters of size $\\Omega(\\sqrt{n})$. We evaluate the algorithm on synthetic and on real-world data; the experiments show that the algorithm can find extremely small clusters even in presence of high destructive noise.", "full_text": "Bipartite Stochastic Block Models with Tiny Clusters\n\nStefan Neumann\n\nUniversity of Vienna\n\nFaculty of Computer Science\n\nVienna, Austria\n\nstefan.neumann@univie.ac.at\n\nAbstract\n\nWe study the problem of \ufb01nding clusters in random bipartite graphs. We present a\nsimple two-step algorithm which provably \ufb01nds even tiny clusters of size O(n\u03b5),\nwhere n is the number of vertices in the graph and \u03b5 > 0. Previous algorithms\nwere only able to identify clusters of size \u2126(\nn). We evaluate the algorithm on\nsynthetic and on real-world data; the experiments show that the algorithm can \ufb01nd\nextremely small clusters even in presence of high destructive noise.\n\n\u221a\n\n1\n\nIntroduction\n\nFinding clusters in bipartite graphs is a fundamental problem and has many applications. In practice,\nthe two parts of the bipartite graph usually correspond to objects from different domains and an edge\ncorresponds to an interaction between the objects. For example, paleontologists use biclustering to\n\ufb01nd co-occurrences of localities (left side of the graph) and mammals (right side of the graph) [13];\nbioinformaticians want to relate biological samples and gene expression levels [10]; in an online shop\nsetting, one wants to \ufb01nd clusters of customers and products.\nDiscovering clusters in bipartite graphs has been researched in many different settings. However,\nmost of these algorithms were heuristics and do not provide theoretical guarantees for the quality\nof their results. This was recently addressed by Xu et al. [27] and Lim et al. [17] who initiated the\nstudy of biclustering algorithms with formal guarantees. They considered random bipartite graphs\nand proved under which conditions their algorithms can recover the ground-truth clusters.\nIn this paper, we consider a standard random graph model and propose a simple two-step algorithm\nwhich provably discovers the ground-truth clusters in bipartite graphs: (1) Cluster the vertices on the\nleft side of the graph based on the similarity of their neighborhoods (Section 3). (2) Infer the right\nside clusters based on the previously discovered left clusters using degree-thresholding (Section 4).\nOur algorithm allows to recover even tiny clusters of size O(n\u03b5), where n is the number of vertices\non the right side of the graph and \u03b5 > 0. Previously, existing algorithms could only discover clusters\nof size \u2126(\nn). Note that \ufb01nding tiny clusters is of high practical importance. For example, in an\nonline shop with millions of products (n \u2265 106), \ufb01nding only clusters of at least a thousand products\n\u221a\n(\nThe formal guarantees of our algorithm are provided at the end of this section. From a high-level\npoint of view, the algorithm can be seen as a way to leverage formal guarantees for mixture models\nand clustering algorithms into biclustering algorithms with formal guarantees. This partially explains\nwhy heuristics such as \u201capply k-means to both sides of the graph\u201d are very successful in practice.\nFinally, we implement a version of the proposed algorithm (Section 5) and we evaluate it on synthetic\nand on real-world data. The experiments show that in practice the algorithm can \ufb01nd extremely small\nclusters and it outperforms all algorithms we compare with (Section 6).\n\nn \u2265 103) is not very interesting. One would want the product clusters to be much smaller.\n\n\u221a\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fBipartite Stochastic Block Models. We now introduce bipartite stochastic block models (SBMs)\nwhich we will be using throughout the paper. Let G = (U \u222a V, E) be a bipartite graph with m\nvertices in U and n vertices in V ; we call U the left side of G and V the right side of G.\n\nThe left side U is partitioned into clusters U1, . . . , Uk, i.e., Ui \u2229 Uj = \u2205 for i (cid:54)= j and(cid:83)\n\ni Ui = U.\nFor V there are clusters V1, . . . , Vk with Vi \u2286 V ; we do not assume that the Vj are disjoint or that\ntheir union equals V . The Ui are the left clusters of G and the Vj are the right clusters of G.\nFix two probabilities p > q \u2265 0. For any two vertices u \u2208 Ui and v \u2208 Vi, insert an edge with\nprobability p; for u \u2208 Ui and v (cid:54)\u2208 Vi, insert an edge with probability q.\nThe algorithmic task for bipartite SBMs is as follows. Given parameters k, p, q and a graph G\ngenerated in the previously described way, recover all clusters Ui and Vj.\n\nMain Theoretical Results. We propose the following simple algorithm:\n\n1. Recover the clusters Ui by clustering the vertices in U according to the similarity of their\n\nneighborhoods (see Section 3).\n\n2. For each recovered Ui, set Vi to all vertices with \u201cmany\u201d neighbors in Ui (see Section 4).\n\nTo state the formal guarantees of the proposed algorithm we require two parameters. We let (cid:96) be the\nsize of the smallest cluster on the left side, i.e., (cid:96) = mini |Ui|. Furthermore, let \u03b4 denote the smallest\ndifference between any two clusters on the right side; more formally, \u03b4 = mini(cid:54)=j |Vi\u2206Vj|, where\nVi\u2206Vj = (Vi \\ Vj) \u222a (Vj \\ Vi) is the symmetric difference of Vi and Vj.\nIn the theorem, D(p || q) denotes the Kullback\u2013\nWe now state the main result of this paper.\nLeibler divergence of Bernoulli random variables with parameters p, q \u2208 [0, 1], i.e., D(p || q) =\np lg( p\nTheorem 1. Suppose \u03c32 = max{p(1 \u2212 p), q(1 \u2212 q)} \u2265 lg6 n/n. There exist constants C1, C2 such\nthat if (cid:96) \u2265 C1 lg n/D(p || q) and\n\nq ) + (1 \u2212 p) lg( 1\u2212p\n1\u2212q ).\n\n(p \u2212 q)2\n\n\u03c32\n\n> C2k\n\nn + m lg m\n\n(cid:96)\u03b4\n\n,\n\n(1)\n\nthen there exists an algorithm which on input G, k, p and q returns all clusters Ui and Vi. The\nalgorithm succeeds with high probability.\n\nTo give a better interpretability of the theorem, consider its two main assumptions: (1) The condition\n(cid:96) \u2265 C1 lg n/D(p || q) is necessary so that the vertices in Vi have suf\ufb01ciently many neighbors in Ui.\n(2) To get a better understanding of Equation 1, consider the case where m = \u0398(n), k = O(1), and\np, q are constants. Also, ignore logarithmic factors. We obtain a smooth tradeoff between \u03b4 and (cid:96):\nThe inequality in Equation 1 is satis\ufb01ed if \u03b4 = \u0398(n\u03b5) and (cid:96) = \u0398(n1\u2212\u03b5). That is, if the right clusters\nare very small or similar (\u03b4 is small), the algorithm requires larger clusters on the left side ((cid:96) must be\nlarge). On the other hand, if the right clusters are very large and dissimilar (large \u03b4), the algorithm\nrequires only very small left clusters (small (cid:96) suf\ufb01ces). More generally, if p \u2212 q = \u0398(n\u2212C) and\np (cid:29) q, the algorithm requires (cid:96)\u03b4 = \u0398(n1+C).\n\u221a\nThe fact that the algorithm can recover clusters of size O(n\u03b5) is interesting since previous algorithms\nrequired min{(cid:96), \u03b4} = \u2126(\n\u221a\nn) (see Section 2). Furthermore, the lower bounds of Hajek, Wu and\nn) barrier is impossible in general graphs. Hajek et al. also\nXu [15] show that breaking the \u2126(\nprovide lower bounds in the bipartite setting which show that one cannot \ufb01nd biclusters of size k \u00d7 k\nfor k = o(\nn). We bypass this lower bound through the previously discussed smooth tradeoff\nbetween (cid:96) and \u03b4. We conjecture that the tradeoff we obtain is asymptotically optimal.\nAs a side result we also study the setting in which the algorithm only obtains an approximate\nclustering of the left side of the graph. We show that when the approximation of the left clusters is\nof good enough quality, then the right clusters can still be recovered exactly. We also observe this\nbehavior in our experiments in Section 6. We provide the result in the supplementary material.\n\n\u221a\n\nExperimental Evaluation. We implemented a version of the algorithm from Theorem 1 and\npresent the practical details in Section 5. The experimental results are reported in Section 6. In the\nexperiments, our main focus will be to verify whether, in practice, the algorithm can \ufb01nd the small\nclusters that the theoretical analysis promised.\n\n2\n\n\fOn synthetic data, the experiments show that, indeed, the algorithm \ufb01nds tiny clusters even in the\npresence of high destructive noise and it outperforms all methods that we compare against.\nThe algorithm is also qualitatively evaluated on real-world datasets. On these datasets it \ufb01nds clusters\nwhich are interesting and which have natural interpretations.\n\n2 Related Work\n\nStochastic Block Models (SBMs). During the last years, many papers on SBMs have been pub-\nlished. We only discuss bipartite SBMs here and refer to the survey by Abbe [1] for other settings.\nLim, Chen and Xu [17] study the biclustering of observations with general labels. When constrained\nto only two labels, their results provide a bipartite SBM. However, in the bipartite SBM case, [17]\nhas two drawbacks compared to the results presented here: (1) The data-generating process in [17]\nrules out certain nested structures of the sets Vi. E.g., [17] does not allow to have clusters V1, V2, V3\nsuch that V3 = V1 \u222a V2. (2) The main result of [17] relies on a notion of coherence, which measures\nhow dif\ufb01cult the structure of the clusters is to infer. Due to this dependency on coherence, the results\nof this paper and [17] are only partially comparable. In case of a constant number of clusters or\n\u201cworst-case coherence\u201d, though, the algorithm of [17] only works if both (cid:96) and \u03b4 are \u2126(\nZhou and Amini [28] study spectral methods for bipartite SBMs. [28] considers a more general\nconnectivity structure and obtains sharper bounds for the recovery rates than in this paper. However,\nin [28] the clusters Vi cannot overlap and, hence, the results of this paper and [28] are incomparable.\nAbbe and Sandon [2, 3] and Gao et al. [14] study optimal recovery for SBMs in general graphs. Their\nresults apply to bipartite graphs with a constant number of overlapping communities of linear size.\nZhou and Amini [29] improve these results for bipartite SBMs under a broader range of parameters.\n\u221a\nOne can use the result of McSherry [18] to recover the clusters of a bipartite graph but this has two\ncaveats: (1) It does not allow the Vi to overlap. (2) Both (cid:96) and \u03b4 must be of size \u2126(\nFlorescu and Perkins [11] provided an SBM for bipartite graphs with two linear-size communities on\neach side of the graph. Xu et al. [27] consider a biclustering setting with clusters of size \u2126(n).\n\nn).\n\n\u221a\n\nn).\n\nBoolean Matrix Factorization (BMF). Another way to \ufb01nd clusters in bipartite graphs is BMF.\nBMF takes the biadjacency matrix D \u2208 {0, 1}m\u00d7n of a bipartite graph and \ufb01nds factor matrices\nL \u2208 {0, 1}m\u00d7k and R \u2208 {0, 1}k\u00d7n such that D \u2248 L \u25e6 R, where \u25e6 is the Boolean matrix-matrix-\nproduct. In other words, BMF tries to approximate D with a Boolean-rank k matrix. The interpretation\nis that the columns of L contain the left clusters and the rows of R contain the right clusters. This\nsetting is more general than the one presented in this paper as it allows the clusters Ui to overlap.\nBMF was studied from applied [20, 21, 24\u201326] and also from theoretical [5, 7, 12] perspectives.\nSection 6 provides an experimental comparison of BMF algorithms and the algorithm from this paper.\n\n3 Recovering the Left Clusters\nWe describe how the clusters Ui can be recovered. Our approach is to cluster the vertices u \u2208 U\naccording to the similarity of their neighborhoods in V . The intuition is that if two vertices u and u(cid:48)\nare in the same cluster Ui, they should have relatively many neighbors in common (those in Vi). On\nthe other hand, if u and u(cid:48) are from different clusters Ui and Uj, their neighbors should be relatively\ndifferent (as Vi\u2206Vj is supposed to be large).\nTechnically, we will apply mixture models. We use the result by Mitra [22] since it is simple\nto state. We could as well use other mixture models such as the one by Dasgupta et al. [9] or\nclustering algorithms such as Kumar and Kannan [16], Bilu and Linial [6] or Cohen-Addad and\nSchwiegelshohn [8]. The different methods come with different assumptions on the data.\n\n3.1 Mixture Models and Mitra\u2019s Algorithm\n\nMixture Models on the Hypercube. From a high-level point of view, the question of mixture\nmodels is as follows: Given samples from different distributions, cluster the samples according to\nwhich distributions they were sampled from. We will now present the formal details behind this.\n\n3\n\n\fLet there be k probability distributions D1, . . . , Dk in {0, 1}n and denote the mean of Dr as \u00b5r \u2208\n[0, 1]n. Let \u03c32 be an entry-wise upper bound on all \u00b5r, i.e., \u00b5r(i) \u2264 \u03c32 for all r = 1, . . . , k and\n\ni = 1, . . . , n. For each distribution Dr de\ufb01ne a weight wr > 0 such that(cid:80)\nwe obtain m samples and denote the set containing all samples as T , i.e., T =(cid:83)\n\nFrom each distribution Dr, create wrm samples and denote the set of these samples as Tr. In total\n\nr wr = 1.\n\nr Tr.\n\nThe algorithmic problem in mixture models is as follows. Given T and k, \ufb01nd a partition P1, . . . , Pk\nof the samples in T such that {T1, . . . , Tk} = {P1, . . . , Pk}.\n\nMitra\u2019s Algorithm. We state the result by Mitra [22] and refer to the supplementary material for\nmore details on the algorithm. We de\ufb01ne a matrix A \u2208 {0, 1}m\u00d7n which has the samples from T in\nits rows. Thus, by clustering the rows of A we obtain a clustering of T . The formal guarantees are\n\nstated in the following lemma, where ||v||2 = ((cid:80)\n\ni )1/2.\n\ni v2\n\nLemma 2 (Mitra [22]). Suppose \u03c32 \u2265 lg6 n/n. Let \u03b6 = min{||\u00b5r \u2212 \u00b5s||2\nminr wr. Then there exists a constant c such that if\n\n2 : r (cid:54)= s} and wmin =\n\n(cid:18) m + n\n\nm\n\n(cid:19)\n\n+ lg m\n\n,\n\n\u03b6 > ck\u03c32 1\nwmin\n\nthen on input A and k, the output {P1, . . . , Pk} of Mitra\u2019s algorithm satis\ufb01es {P1, . . . , Pk} =\n{T1, . . . , Tk} with high probability. That is, the algorithm recovers the clusters Tr.\n\n3.2 Proposition and Analysis\n\nLet us come back to our original problem of recovering the left clusters of G. To \ufb01nd the left\nclusters Ui, we apply Mitra\u2019s algorithm to the rows of the biadjacency matrix D of G. Formally,\nthe biadjacency matrix D \u2208 {0, 1}m\u00d7n of G is the matrix with Duv = 1 iff there exists an edge\n(u, v) \u2208 G.\nProposition 3 states under which conditions this approach succeeds.\nProposition 3. Let all variables be as in Section 1. Let \u03b4 = mini(cid:54)=j |Vi\u2206Vj| and (cid:96) = mini |Ui|.\nSuppose \u03c32 = max{p(1 \u2212 p), q(1 \u2212 q)} \u2265 lg6 n/n. There exists a constant C such that if\n\n(p \u2212 q)2\n\nn + m lg m\n\n(2)\nthen applying Mitra\u2019s algorithm on D returns a partition { \u02dcU1, . . . , \u02dcUr} of D\u2019s rows such that\n{ \u02dcU1, . . . , \u02dcUr} = {U1, . . . , Ur} with high probability. That is, the algorithm recovers the left clusters\nUi of G.\n\n> Ck\n\n\u03c32\n\n(cid:96)\u03b4\n\n,\n\nProof. Observe that D is a matrix arising from a mixture model as discussed earlier: Consider a\nvertex u \u2208 Ui and its corresponding row Du in D. Then the probability that entry Duv = 1 is p if\nv \u2208 Vi and q if v (cid:54)\u2208 Vi. Furthermore, for two vertices u, u(cid:48) \u2208 Ui these distributions are exactly the\nsame.\nHence, we view the rows of D as samples from k distributions Di with distribution Di corresponding\nto cluster Ui. For each cluster Ui, we have |Ui| samples from Di. For the mean \u00b5i of Di, we have\ncomponent-wise \u00b5i(v) = p, if v \u2208 Vi, and \u00b5i(v) = q, if v (cid:54)\u2208 Vi. Thus, partitioning the rows of D\nwith a mixture model is exactly the same as recovering the clusters Ui of G.\nIt is left to check that the conditions of Lemma 2 are satis\ufb01ed. By assumption on the Vj, ||\u00b5i \u2212 \u00b5j||2\n2 \u2265\n(p \u2212 q)2\u03b4 for i (cid:54)= j. Since we have |Ui| samples from distribution Di, the mixing weights are\nwi = |Ui|/m and wmin = (cid:96)/m. To apply Lemma 2, we must satisfy the inequality\n\n(cid:18) m + n + m lg m\n\n(cid:19)\n\n(p \u2212 q)2\u03b4 > ck\u03c32 m\n(cid:96)\n\nm\n\n= ck\u03c32\n\n(cid:18) m + n + m lg m\n\n(cid:19)\n\n(cid:96)\n\n.\n\nBy rearranging terms and noticing that Cm lg m \u2265 c(m + m lg m) for large enough C, this is the\ninequality we required in the proposition (Equation 2).\n\n4\n\n\f4 Recovering the Right Clusters\n\nThis section presents an algorithm to recover the right clusters Vj given the left clusters Ui. The\nalgorithm is very simple: For each cluster Ui, \u02dcVi consists of all vertices from V which have \u201cmany\u201d\nneighbors in Ui. We will show that the algorithm succeeds with high probability.\n\nHigh-Degree Thresholding Algorithm. The input for the algorithm are p, q and the clusters\nU1, . . . , Uk. For each cluster Ui, the algorithm constructs \u02dcVi by adding all vertices v \u2208 V which have\nat least \u03b8|Ui| neighbors in Ui, where we set\n\n(cid:19)(cid:18)\n\n(cid:18) 1 \u2212 q\n\n1 \u2212 p\n\n(cid:18) p(1 \u2212 q)\n\nq(1 \u2212 p)\n\nlg\n\n(cid:19)(cid:19)\u22121\n\n\u03b8 = lg\n\n.\n\n(3)\n\nq ) + (1 \u2212 p) lg( 1\u2212p\n1\u2212q ).\n\nProposition and Analysis. We prove in Proposition 4 that for a \ufb01xed cluster Ui of suf\ufb01ciently\nlarge size, Vi = \u02dcVi with probability 1 \u2212 O(n\u22122). A union bound implies that \u02dcVi = Vi for all\ni = 1, . . . , k with high probability. In the proposition, we use the notation D(p || q) to denote\nthe Kullback\u2013Leibler divergence of Bernoulli random variables with parameters p, q \u2208 [0, 1], i.e.,\nD(p || q) = p lg( p\nProposition 4. There exists a constant C such that if |Ui| \u2265 C lg n/D(p || q), then \u02dcVi returned by\nthe high-degree thresholding algorithm satis\ufb01es Vi = \u02dcVi with probability at least 1 \u2212 O(1/n2). The\nalgorithm runs in time O(|Ui|n).\nProof. Consider a vertex v \u2208 V . The vertex v has an edge to u \u2208 Ui with probability p, if v \u2208 Vi,\nand with probability q, if v (cid:54)\u2208 Vi. Let Zv be the random variable denoting the number of edges from\nv to vertices in Ui; Zv is binomially distributed with |Ui| trials and success probability p (if v \u2208 Vi)\nor q (if v (cid:54)\u2208 Vi). To \ufb01nd out whether v \u2208 Vi, we must decide whether Zv is distributed with parameter\np or q. If we decide for the correct parameter then the decision to include v into \u02dcVi is correct.\nWe make the decision for the parameter based on the likelihood of observing Zv edges incident\nupon v. Parameter p is more likely if:\n\n(cid:0)|Ui|\n(cid:1)pZv (1 \u2212 p)|Ui|\u2212Zv\n(cid:0)|Ui|\n(cid:1)qZv (1 \u2212 q)|Ui|\u2212Zv\n\nZv\n\nZv\n\n(cid:18) p\n\n(cid:19)Zv(cid:18) 1 \u2212 p\n\n(cid:19)|Ui|\u2212Zv \u2265 1.\n\n1 \u2212 q\n\n=\n\nq\n\nSolving this inequality for Zv gives that one should decide for parameter p if Zv \u2265 \u03b8|Ui|, where \u03b8 is\nas in Equation 3.\nThe maximum likelihood approach above succeeds with probability at least 1\u2212 O(1/n3); this follows\nfrom [4, Theorem 6] if |Ui| \u2265 C lg n/D(p || q), where C is a suf\ufb01ciently large constant. The\nprobability for obtaining a correct result for all vertices v \u2208 V is at least 1 \u2212 O(1/n2); this follows\nfrom a union bound. Conditioning on this event we obtain Vi = \u02dcVi.\n\n5\n\nImplementation\n\nWhile so far we have been concerned with theory, we will now consider practice. The pseudocode\nof the algorithm we implemented is presented in Algorithm 1. As stated in Section 1, the algorithm\nperforms two steps: (1) Recover the clusters \u02dcU1, . . . , \u02dcUk in U. (2) Recover the clusters \u02dcV1, . . . , \u02dcVk in\nV based on the \u02dcUi. We call the algorithm pcv, which is short for project, cluster, vote.\nWhile for step (2) we use exactly the algorithm discussed in Section 4, we made some changes for\nstep (1). The main reason is that Mitra\u2019s algorithm discussed in Section 3 was developed in a way to\ngive theoretical guarantees and not necessarily to give the best results in practice.\nInstead, for step (1) we use an even simpler algorithm for recovering the clusters \u02dcUi: Project the\nbiadjacency of G on its \ufb01rst k left singular vectors and then run k-means. This delivers better results\nand is conjectured to give the same theoretical guarantees as Mitra\u2019s algorithm (see [18] or [22]).\nWe implemented Algorithm 1 in Python. To compute the truncated SVD we used scikit-learn [23].\nThe source code is available in the supplementary material.\n\n5\n\n\fAlgorithm 1 The pcv algorithm\nInput: G a bipartite m \u00d7 n graph, k, p, q\n1: procedure pcv(G, k, p, q)\nD \u2190 the m \u00d7 n biadjacency matrix of G\n2:\nA \u2190 rank k SVD of D\n3:\n\u02dcU1, . . . , \u02dcUk \u2190 the clusters obtained by running k-means on the rows of A\n4:\nfor i = 1, . . . , k do\n5:\n6:\n\n\u02dcVi \u2190 all vertices in V with at least \u03b8|Ui| neighbors in Ui, where \u03b8 is as in Equation 3\n\n(cid:46) Step (1)\n\n(cid:46) Step (2)\n\nWhen developing the algorithm, we also tried using other clustering methods than k-means. However,\nnone of them delivered consistently better results than k-means and the differences in the outputs\nwere mostly minor. Hence, we do not study this further here.\nWe note that due to k-means, pcv is a randomized algorithm. On the synthetic graphs we will consider,\nthis had almost no in\ufb02uence on the quality of the results. On real-world graphs, this randomness\nresulted in different clusterings in each run of the algorithm. However, some \u201cprominent clusters\u201d\nwere always there and the computed clusters always had an interpretable structure.\n\nParameters. The parameters p and q are only used to compute the parameter \u03b8 from Section 4. We\nnote that in practice it might be reasonable to pick a different threshold \u03b8 for each cluster depending\non its sparsity; however, this was not done in this paper. In the supplementary material we present\nand evaluate a heuristic for how p and q (and, hence, \u03b8) can be estimated.\nIt suf\ufb01ces if k is a suf\ufb01ciently tight upper bound on its true value. pcv will not necessarily output\nexactly k clusters; if k-means outputs less than k clusters, then pcv will do the same. In practice it is\nsometimes handy to use different values for k in the SVD and in k-means.\nWe further added a parameter L \u2208 N. In practice, often some of the \u02dcUi returned by pcv are tiny (e.g.,\ncontaining less than \ufb01ve vertices). To avoid creating too much output, we use the parameter L to\nignore all clusters \u02dcUi of size less than L. In the experiments we always set L = 10.\n\n6 Experiments\n\nIn this section, we practically evaluate the performance of pcv. Throughout the experiments our main\nobjective will be to understand how well pcv can recover small clusters on the right side of the graph.\nIn the synthetic experiments, we will be most interested in how small p can be so that pcv can still\nrecover clusters of size less than 10 on the right side of the graph. We picked real-world datasets from\nwhich we expect that they contain only very small clusters on the right side.\nThe experiments were done on a MacBook Air with a 1.6 GHz Intel Core i5 and 8 GB RAM. The\nsource code and the synthetic data are provided in the supplementary materials.\n\nAlgorithms. pcv was compared with the lim algorithm by Lim, Wu and Xu [17], message by\nRavanbakhsh, P\u00b4oczos and Greiner [24], and the lfm algorithm by Rukat, Holmes and Yau [26]. For\neach of the algorithms, implementations provided by the authors were used. message and lfm are\nBMF algorithms (see Section 2).\nWhen we report the running times of the algorithms, note that the quality of the implementations is\nincomparable. For example, lim is implemented in Matlab, message and pcv are purely implemented\nin Python and lfm is programmed in Python with certain subroutines precompiled using Numba.\n\n6.1 Synthetic Data\n\nLet us start by considering the performance of the algorithms on synthetically generated graphs. The\ngraphs were generated as described in Section 1.\nThe ground-truth clusters Ui and Vi were picked in the following way. For each Ui, (cid:96) vertices were\nadded to the (initially empty) left side of the graph. On the right side of the graph, we inserted\nn vertices. Each of the Vj consists of r vertices which were picked uniformly at random from the\n\n6\n\n\fn vertices. Due to the randomness in the graph generation, some of the Vj will overlap and most of\nthem will not. By size of a cluster we mean the number of vertices contained in the cluster.\nWhen not mentioned otherwise, the parameters were set to n = 1000, k = 8, (cid:96) = 70, and m = (cid:96) \u00b7 k\n(i.e., 1000 vertices on the right, 8 ground-truth clusters on both sides and left-side clusters of size 70).\nThe size of the right-side clusters was set to r = 8. The parameters p and q were set depending on\nthe dataset.\nFor each of the reported parameter settings, \ufb01ve random graphs were generated. The results that are\nreported in the following are averages over these datasets. When an algorithm was run multiple times\non the same dataset, we report the best result on the right clusters of the graph.\nDuring the experiments, all algorithms were given the correct parameters for k, p and q whenever\nthe algorithms allowed this. For lim and lfm we optimized their parameters; we report this in the\nsupplementary material.\n\nQuality Measure. Consider the k ground-truth clusters U1, . . . , Uk and let \u02dcU1, . . . , \u02dcUs be the s\nclusters returned by an algorithm. The quality Q of the solution \u02dcUj is computed as follows. For each\nground-truth cluster Ui, \ufb01nd the cluster \u02dcUj which maximizes the Jaccard coef\ufb01cient of Ui and \u02dcUj.\nThen sum over the Jaccard coef\ufb01cients for all ground-truth clusters Ui and normalize by k. Formally,\n\nk(cid:88)\n\ni=1\n\nQ =\n\n1\nk\n\nJ(Ui, \u02dcUj) \u2208 [0, 1],\n\nmax\n\nj=1,...,s\n\nwhere J(A, B) = |A \u2229 B|/|A \u222a B| is the Jaccard coef\ufb01cient. Higher values for Q imply a better\nquality of the solution. E.g., if Q = 1 then the clusters \u02dcUj match exactly the ground-truth clusters Ui.\nWe used the same quality measure for the clusters Vi. In the supplementary material we explain why\ndecided against using the reconstruction error of the biadjacency matrix of G as a quality measure.\n\nVarying p. We start by studying how much the results of the algorithms are affected by destructive\nnoise. To this end, we use varying values for p = 0.2, 0.25, 0.3, 0.5, 0.75, 0.95 and \ufb01x q = 0.03. The\nresults are presented in Figures 1(a)\u20131(c).\nWe see that on both sides of the graph, pcv and message outperform lfm and lim for p \u2264 0.3; for\np \u2265 0.5, lim picks up and delivers very good results.\nIn Figure 1(a) we see that on the left clusters, pcv and message deliver similar performances with\npcv picking up the signal better for p \u2265 0.5; the results of lim improve as p increases and they are\nperfect for p = 0.75, 0.95; lfm always delivers relatively poor results.\nFor the right clusters the situation is similar with message having slight advantages over pcv for\np \u2264 0.3; pcv and lim deliver better results than message in settings with less noise (p \u2265 0.75). It is\ninteresting to observe that pcv already recovers the ground-truth clusters on the right side for p \u2265 0.4\nand even for p = 0.3 the results are of good quality.\nThe running times of the algorithms are reported in Figure 1(c). pcv is the fastest method with lim\nand lfm being somewhat slower. message is by far the slowest method and we see that when p is\nsmall, message takes a long time until it converges.\n\nVarying sizes of the right clusters. We now study how small the right clusters Vi can get such that\nthey can still be recovered by the algorithms. To this end, we vary the size of the right clusters and\nnote that this corresponds to varying \u03b4 (for example, when all clusters are disjoint, \u03b4 is exactly twice\nthe size of the right clusters).\nPreviously, we saw that pcv, message and lim did well at the recovery of right clusters of size 8\neven for p = 0.4. We study this further by \ufb01xing p = 0.4, q = 0.03 and varying the size of the right\nclusters from 1 to 8. The results are reported in Figures 1(d)\u20131(f).\nThe results for clustering the left side of the graph are presented in Figure 1(d). We observe a clear\nranking with pcv being the best algorithm before message; lim is the third-best algorithm and lfm\nis the worst.\n\n7\n\n\f(a) Vary p: Left Cluster Quality\n\n(b) Vary p: Right Cluster Quality\n\n(c) Vary p: Running times (sec)\n\n(d) Vary \u03b4: Left Cluster Quality\n\n(e) Vary \u03b4: Right Cluster Quality\n\n(f) Vary \u03b4: Running times (sec)\n\n(g) Vary (cid:96): Left Cluster Quality\n\n(h) Vary (cid:96): Right Cluster Quality\n\n(i) Vary (cid:96): Running times (sec)\n\nFigure 1: Results on synthetic data. Figures 1(a)\u20131(c) have varying p, Figures 1(d)\u20131(f) have varying\nsizes of the right clusters, Figures 1(g)\u20131(i) have varying (cid:96). Markers are mean values over \ufb01ve\ndifferent datasets; error bars are one third of the standard deviation over the \ufb01ve datasets.\n\nFor the right side of the graph (Figure 1(e)) we observe that pcv outperforms message for ground-\ntruth clusters sizes less than 7; even for clusters of sizes 2 and 3, pcv \ufb01nds good solutions. The\nperformance of lim improves as the cluster sizes grow.\nThe running times (Figure 1(f)) are similar to what we have seen before for varying p.\n\nVarying (cid:96). We study how (cid:96), the size of the left clusters Ui, in\ufb02uences the results of the algorithms.\nWe used values (cid:96) = 20, 30, 40, 50, 70. The other parameters were \ufb01xed to p = 0.5, q = 0.03, k = 8\nand the size of the right clusters was set to 8. The results are reported in Figures 1(g)\u20131(i).\nOn the left clusters, pcv is the best algorithm with message also delivering good results; the results\nof lim are of good quality for (cid:96) \u2265 40. On the right clusters, message is initially ((cid:96) \u2264 30) slightly\nbetter than pcv and for (cid:96) \u2265 40, pcv and message deliver essentially perfect results; lim \ufb01nds good\nright clusters for (cid:96) \u2265 40. The running times are similar to what we have seen in previous experiments.\nIt is interesting and maybe even a bit surprising that even for (cid:96) = 20, pcv and message can \ufb01nd very\ngood clusters on the right side of the graph which only consist of 8 out of a 1000 vertices.\n\n8\n\n\fBad Parameters. The supplementary material contains experiments in which we repeat the previ-\nous experiment with varying \u03b4 but where we run the algorithms with wrong parameters.\n\nConclusion. We conclude that pcv was very good at \ufb01nding tiny clusters even with high destructive\nnoise. In most cases, pcv delivered the solutions of highest quality and pcv was the fastest algorithm.\n\n6.2 Real-World Data\n\npcv is qualitatively evaluated on two real-world datasets. Since the parameters required by pcv\nare not known, pcv was run with different parameters settings and the quality of the clusters was\nmanually evaluated; the \ufb01nal setting of the parameters is reported for each dataset.\n\nDatasets. The BookCrossing dataset1 originates from Ziegler et al. [30]. It consists of users on the\nleft side of the graph and books on the right side of the graph; if a user rated a book, there exists an\nedge between the corresponding vertices. The dataset was preprocessed so that all books read by less\nthan 11 users and all users reading less than 11 books were removed. The resulting graph has 6195\nusers and 4958 books; the number of edges is 83550.\nThe 4News dataset is a subset of the 20Newsgroups dataset; it was preprocessed by Ata Kab\u00b4an\n(see [19]). The data contains the occurrences of 800 words (right side of the graph) over 400 posts\n(left side of the graph) in four different Usenet newsgroups about cryptography, medicine, outer\nspace, and christianity; for each newsgroup there are 100 posts. The graph has 11260 edges.\n\nQualitative Evaluation. BookCrossing. For the BookCrossing dataset, pcv was run with parame-\nters k = 20, \u03b8 = 0.2 and L = 10; pcv \ufb01nished in less than 2 minutes.\npcv returns 12 user-clusters (i.e., on the left side of the graph) with size at least L. Out of these\n12 user-clusters, 9 have a non-empty book-side (right side of the graph). The largest user-cluster\ncontains 4268 vertices and has an empty book-side (right side). We will now discuss some of the\nclusters with non-empty right sides. All of those clusters have a natural interpretation.\nThe returned clusters mostly consist of books written by the same authors (as one would expect).\nTwo clusters were consisting of the Harry Potter books by Joanne K. Rowling; the \ufb01rst cluster\ncontained the \ufb01ve Harry Potter books that were published until 2004 (when the dataset was created)\nand contains 92 users, the other one consisted of the \ufb01rst three books of the series and contained\n60 users. There is one cluster containing four books written by Anne Rice (64 users), one cluster\ncontaining seven books written by John Grisham (67 users), and one clusters containing 46 books\nwritten by Stephen King (12 users). pcv also returns two clusters containing a single book: The Da\nVinci Code by Dan Brown (215 users) and The Lovely Bones by Alice Sebold (261 users).\n4News. For this dataset we observe that it is useful to set the parameter k in the SVD and in the call to\nk-means to different values. With this, we can obtain more general or more speci\ufb01c clusters: Setting\nthe value k for k-means to a smaller (larger) value, creates less (more) clusters on the left side of the\ngraph. This will also make the right-side clusters more general (speci\ufb01c).\nWe used k = 30 for the SVD and k = 50 for k-means to obtain relatively speci\ufb01c clusters. The value\nof k is so large, because the dataset contains many outliers that create a lot of left-side clusters of size\n1. Further, we set \u03b8 = 0.3, L = 10.\nFor each of the four newsgroups, pcv \ufb01nds clusters. In total, pcv \ufb01nds \ufb01ve clusters of which one has\nan empty right side (225 posts). The cluster (18 posts) returned for the cryptography newsgroup is\npublic, system, govern, encrypt, decrypt, ke(y), secur(ity), person, escrow, clipper, chip (a clipper chip\nis an encryption device developed by the NSA). For the medicine newsgroups, pcv \ufb01nds the cluster\n(24 posts) question, stud(y), year, effect, result, ve, call, doctor, patient, medic, read, level, peopl(e),\nthing. The cluster (19 posts) concept, system, orbit, space, year, nasa, cost, project, high, launch,\nda(y), part, peopl(e) explains the topics of the outer space newsgroup well. For the christian religion\nnewsgroup we obtain the cluster (24 posts) christian, bibl(e), read, rutger, god, peopl(e), thing.\n\n1http://www2.informatik.uni-freiburg.de/~cziegler/BX/\n\n9\n\n\fAcknowledgements\n\nI wish to thank the anonymous reviewers for for their helpful comments and for pointing out a\nheuristic to estimate parameters of the algorithm. I am grateful to my advisor Monika Henzinger for\nher support and for helpful discussions, to Pan Peng for valuable conversations about SBMs and to\nPauli Miettinen and Jilles Vreeken for getting me interested in biclustering.\nThe author gratefully acknowledges the \ufb01nancial support from the Doctoral Programme \u201cVienna\nGraduate School on Computational Optimization\u201d which is funded by the Austrian Science Fund\n(FWF, project no. W1260-N35).\n\nReferences\n[1] Emmanuel Abbe. Community detection and stochastic block models: recent developments.\n\nCoRR, abs/1703.10146, 2017.\n\n[2] Emmanuel Abbe and Colin Sandon. Community detection in general stochastic block models:\n\nFundamental limits and ef\ufb01cient algorithms for recovery. In FOCS, pages 670\u2013688, 2015.\n\n[3] Emmanuel Abbe and Colin Sandon. Recovering communities in the general stochastic block\n\nmodel without knowing the parameters. In NIPS, pages 676\u2013684, 2015.\n\n[4] Thomas Baign`eres, Pascal Junod, and Serge Vaudenay. How far can we go beyond linear\n\ncryptanalysis? In ASIACRYPT, pages 432\u2013450, 2004.\n\n[5] Frank Ban, Vijay Bhattiprolu, Karl Bringmann, Pavel Kolev, Euiwoong Lee, and David P.\nWoodruff. A PTAS for (cid:96)p-low rank approximation. CoRR, abs/1807.06101, 2018. To appear in\nSODA\u201919.\n\n[6] Yonatan Bilu and Nathan Linial. Are stable instances easy? In ICS, pages 332\u2013341, 2010.\n[7] L. Sunil Chandran, Davis Issac, and Andreas Karrenbauer. On the parameterized complexity of\n\nbiclique cover and partition. In IPEC, pages 11:1\u201311:13, 2016.\n\n[8] Vincent Cohen-Addad and Chris Schwiegelshohn. On the local structure of stable clustering\n\ninstances. In FOCS, pages 49\u201360, 2017.\n\n[9] Anirban Dasgupta, John E. Hopcroft, Ravi Kannan, and Pradipta Prometheus Mitra. Spectral\n\nclustering with limited independence. In SODA, pages 1036\u20131045, 2007.\n\n[10] Kemal Eren, Mehmet Deveci, Onur K\u00a8uc\u00b8 \u00a8uktunc\u00b8, and \u00a8Umit V. C\u00b8 ataly\u00a8urek. A comparative analysis\nof biclustering algorithms for gene expression data. Brie\ufb01ngs in Bioinformatics, 14(3):279\u2013292,\n2013.\n\n[11] Laura Florescu and Will Perkins. Spectral thresholds in the bipartite stochastic block model. In\n\nCOLT, pages 943\u2013959, 2016.\n\n[12] Fedor V. Fomin, Petr A. Golovach, Daniel Lokshtanov, Fahad Panolan, and Saket Saurabh.\nApproximation schemes for low-rank binary matrix approximation problems. CoRR,\nabs/1807.07156, 2018.\n\n[13] M. (coordinator) Fortelius. New and old worlds database of fossil mammals (NOW). Online.\n\nhttp://www.helsinki.fi/science/now/, 2003. Accessed: 2015-09-23.\n\n[14] Chao Gao, Zongming Ma, Anderson Y. Zhang, and Harrison H. Zhou. Achieving optimal\n\nmisclassi\ufb01cation proportion in stochastic block models. JMLR, 18:60:1\u201360:45, 2017.\n\n[15] Bruce E. Hajek, Yihong Wu, and Jiaming Xu. Computational lower bounds for community\n\ndetection on random graphs. In COLT, pages 899\u2013928, 2015.\n\n[16] Amit Kumar and Ravindran Kannan. Clustering with spectral norm and the k-means algorithm.\n\nIn FOCS, pages 299\u2013308, 2010.\n\n[17] Shiau Hong Lim, Yudong Chen, and Huan Xu. A convex optimization framework for bi-\n\nclustering. In ICML, pages 1679\u20131688, 2015.\n\n[18] Frank McSherry. Spectral partitioning of random graphs. In FOCS, pages 529\u2013537, 2001.\n[19] Pauli Miettinen. Matrix decomposition methods for data mining: Computational complexity\n\nand algorithms. PhD thesis, Helsingin yliopisto, 2009.\n\n10\n\n\f[20] Pauli Miettinen, Taneli Mielik\u00a8ainen, Aristides Gionis, Gautam Das, and Heikki Mannila. The\n\ndiscrete basis problem. IEEE Trans. Knowl. Data Eng., 20(10):1348\u20131362, 2008.\n\n[21] Pauli Miettinen and Jilles Vreeken. MDL4BMF: minimum description length for boolean\n\nmatrix factorization. TKDD, 8(4):18:1\u201318:31, 2014.\n\n[22] Pradipta Mitra. A simple algorithm for clustering mixtures of discrete distributions. Online.\n\nhttps://sites.google.com/site/ppmitra/invariant.pdf.\n\n[23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,\nP. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,\nM. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. JMLR, 12:2825\u20132830,\n2011.\n\n[24] Siamak Ravanbakhsh, Barnab\u00b4as P\u00b4oczos, and Russell Greiner. Boolean matrix factorization and\n\nnoisy completion via message passing. In ICML, pages 945\u2013954, 2016.\n\n[25] Tammo Rukat, Christopher C. Holmes, Michalis K. Titsias, and Christopher Yau. Bayesian\n\nboolean matrix factorisation. In ICML, pages 2969\u20132978, 2017.\n\n[26] Tammo Rukat, Christopher C. Holmes, and Christopher Yau. Probabilistic boolean tensor\n\ndecomposition. In ICML, pages 4410\u20134419, 2018.\n\n[27] Jiaming Xu, Rui Wu, Kai Zhu, Bruce E. Hajek, R. Srikant, and Lei Ying. Jointly clustering\nrows and columns of binary matrices: algorithms and trade-offs. In SIGMETRICS, pages 29\u201341,\n2014.\n\n[28] Zhixin Zhou and Arash A. Amini. Analysis of spectral clustering algorithms for community\n\ndetection: the general bipartite setting. CoRR, abs/1803.04547, 2018.\n\n[29] Zhixin Zhou and Arash A. Amini. Optimal bipartite network clustering. CoRR, abs/1803.06031,\n\n2018.\n\n[30] Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, and Georg Lausen. Improving\n\nrecommendation lists through topic diversi\ufb01cation. In WWW, pages 22\u201332, 2005.\n\n11\n\n\f", "award": [], "sourceid": 1910, "authors": [{"given_name": "Stefan", "family_name": "Neumann", "institution": "University of Vienna"}]}