{"title": "The Product Cut", "book": "Advances in Neural Information Processing Systems", "page_first": 3792, "page_last": 3800, "abstract": "We introduce a theoretical and algorithmic framework for multi-way graph partitioning that relies on a multiplicative cut-based objective. We refer to this objective as the Product Cut. We provide a detailed investigation of the mathematical properties of this objective and an effective algorithm for its optimization. The proposed model has strong mathematical underpinnings, and the corresponding algorithm achieves state-of-the-art performance on benchmark data sets.", "full_text": "The Product Cut\n\nNanyang Technological University\n\nLoyola Marymount University\n\nFacebook AI Research\n\nThomas Laurent\n\nArthur Szlam\n\nLos Angeles\n\ntlaurent@lmu.edu\n\nNew York\n\naszlam@fb.com\n\nXavier Bresson\n\nSingapore\n\nxavier.bresson@ntu.edu.sg\n\nJames H. von Brecht\n\nCalifornia State University, Long Beach\n\nLong Beach\n\njames.vonbrecht@csulb.edu\n\nAbstract\n\nWe introduce a theoretical and algorithmic framework for multi-way graph parti-\ntioning that relies on a multiplicative cut-based objective. We refer to this objective\nas the Product Cut. We provide a detailed investigation of the mathematical proper-\nties of this objective and an effective algorithm for its optimization. The proposed\nmodel has strong mathematical underpinnings, and the corresponding algorithm\nachieves state-of-the-art performance on benchmark data sets.\n\neH(P)\n\nPcut(P) =\n\nIntroduction\n\nH(P) = \u2212 R(cid:88)\n\n1\nWe propose the following model for multi-way graph partitioning. Let G = (V, W ) denote a weighted\n(cid:81)R\ngraph, with V its vertex set and W its weighted adjacency matrix. We de\ufb01ne the Product Cut of a\npartition P = (A1, . . . , AR) of the vertex set V as\nr=1 Z(Ar, Ac\nr)\n\n(1)\nwhere \u03b8r = |Ar|/|V | denotes the relative size of a set. This model provides a distinctive way to\nincorporate classical notions of a quality partition. The non-linear, non-local function Z(Ar, Ac\nr) of\na set measures its intra- and inter-connectivity with respect to the graph. The entropic balance H(P)\nmeasures deviations of the partition P from a collection of sets (A1, . . . , AR) with equal size. In this\nway, the Product Cut optimization parallels the classical Normalized Cut optimization [10, 15, 13] in\nterms of its underlying notion of cluster, and it arises quite naturally as a multiplicative version of the\nNormalized Cut.\nNevertheless, the two models strongly diverge beyond the point of this super\ufb01cial similarity. We\nprovide a detailed analysis to show that (1) settles the compromise between cut and balance in a\nfundamentally different manner than classical objectives, such as the Normalized Cut or the Cheeger\nCut. The sharp inequalities\n\n\u03b8r log \u03b8r,\n\nr=1\n\n,\n\n0 \u2264 Ncut(P) \u2264 1\n\ne\u2212H(P) \u2264 Pcut(P) \u2264 1\n\n(2)\nsuccinctly capture this distinction; the Product Cut exhibits a non-vanishing lower bound while the\nNormalized Cut does not. We show analytically and experimentally that this distinction leads to\nsuperior stability properties and performance. From an algorithmic point-of-view, we show how\nto cast the minimization of (1) as a convex maximization program. This leads to a simple, exact\ncontinuous relaxation of the discrete problem that has a clear mathematical structure. We leverage this\nformulation to develop a monotonic algorithm for optimizing (1) via a sequence of linear programs,\nand we introduce a randomized version of this strategy that leads to a simple yet highly effective\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\falgorithm. We also introduce a simple version of Algebraic Multigrid (AMG) tailored to our problem\nthat allows us to perform each step of the algorithm at very low cost. On graphs that contain\nreasonably well-balanced clusters of medium scale, the algorithm provides a strong combination\nof accuracy and ef\ufb01ciency. We conclude with an experimental evaluation and comparison of the\nalgorithm on real world data sets to validate these claims.\n\n2 The Product Cut Model\n\nWe begin by introducing our notation and by describing the rationale underlying our model. We use\nG = (V, W ) to denote a graph on n vertices V = {v1, . . . , vn} with weighted edges W = {wij}n\ni,j=1\nthat encode similarity between vertices. We denote partitions of the vertex set into R subsets as\nP = (A1, . . . , AR), with the understanding that the Ar \u2282 V satisfy the covering A1 \u222a . . .\u222a AR = V\nconstraint, the non-overlapping Ar \u2229 As = \u2205, (r (cid:54)= s) constraint and the non-triviality Ar (cid:54)= \u2205\nconstraint. We use f, g, h, u, v to denote vertex functions f : V \u2192 R, which we view as functions\nf (vi) and n-vectors f \u2208 Rn interchangeably. For a A \u2282 V we use |A| for its cardinality and 1A for\nits indicator function. Finally, for a given graph G = (V, W ) we use D := diag(W 1V ) to denote the\ndiagonal matrix of weighted vertex degrees.\nThe starting point for our model arises from a well-known and widely used property of the random\nwalk on a graph. Namely, a random walker initially located in a cluster A is unlikely to leave\nthat cluster quickly [8]. Different approaches of quantifying this intuition then lead to a variety of\nmulti-way partitioning strategies for graphs [11, 12, 1]. The personalized page-rank methodology\nprovides an example of this approach. Following [1], given a scalar 0 < \u03b1 < 1 and a non-empty\nvertex subset A we de\ufb01ne\n\nM\u03b1 :=(cid:0)Id \u2212 \u03b1W D\u22121(cid:1) /(1 \u2212 \u03b1)\n\nprA := M\u22121\n\n\u03b1 1A/|A|\n\n(3)\nas its personalized page-rank vector. As 1A/|A| is the uniform distribution on the set A and W D\u22121 is\nthe transition matrix of the random walk on the graph, prA corresponds to the stationary distribution\nof a random walker that, at each step, moves with probability \u03b1 to a neighboring vertex by a usual\nrandom walk, and has a probability (1 \u2212 \u03b1) to teleport to the set A. If A has a reasonable cluster\nstructure, then prA will concentrate on A and assign low probabilities to its complement. Given a\nhigh-quality partition P = (A1, . . . , AR) of V , we therefore expect that \u03c3i,r := prAr (vi) should\nachieve its maximal value over 1 \u2264 r \u2264 R when r = r(i) is the class of the ith vertex.\nViewed from this perspective, we can formulate an R-way graph partitioning problem as the task of\nselecting P = (A1, . . . , AR) to maximize some combination of the collection {\u03c3i,r(i) : i \u2208 V } of\npage-rank probabilities generated by the partition. Two intuitive options immediately come to mind,\nthe arithmetic and geometric means of the collection:\n\n(cid:80)\n(cid:0)(cid:81)\n\n1\nn\n\nr\n\n(cid:80)\n(cid:81)\n\nr\n\nprAr (vi)\n\nprAr (vi)(cid:1)1/n\n\nvi\u2208Ar\n\nvi\u2208Ar\n\nMaximize\n\nMaximize\n\nover all partitions (A1, . . . , AR) of V into R sets. (4)\n\nover all partitions (A1, . . . , AR) of V into R sets. (5)\n\nThe \ufb01rst option corresponds to a straightforward variant of the classical Normalized Cut. The second\noption leads to a different type of cut-based objective that we term the Product Cut. The underlying\nreason for considering (5) is quite natural. If we view each prAr as a probability distribution, then (5)\ncorresponds to a formal likelihood of the partition. This proves quite analogous to re-formulating the\nclassical k-means objective for partitioning n data points (x1, . . . , xn) into R clusters (A1, . . . , AR)\n\nin terms of maximizing a likelihood(cid:81)R\n\n(cid:81)\n\nr=1\n\nvi\u2208Ar\n\nexp(\u2212(cid:107)xi\u2212mr(cid:107)2\n\n2\u03c32\nr\n\n)\n\nof Gaussian densities. While the Normalized Cut variant (4) is certainly popular, we show that it\nsuffers from several defects that the Product Cut resolves. As the Product Cut can be effectively\noptimized and generally leads to higher quality partitions, it therefore provides a natural alternative.\nTo make these ideas precise, let us de\ufb01ne the \u03b1-smoothed similarity matrix as \u2126\u03b1 := M\u22121\n\u03b1 and use\n{\u03c9ij}n\n\u03b1 1vj )i = pr{vj}(vi), and so \u03c9ij gives a non-local\nmeasure of similarity between the vertices vi and vj by means of the personalized page-rank diffusion\nprocess. The matrix \u2126\u03b1 is column stochastic, non-symmetric, non-sparse, and has diagonal entries\n\ni,j=1 to denote its entries. Thus \u03c9ij = (M\u22121\n\n2\n\n\fgreater than (1 \u2212 \u03b1). Given a partition P = (A1, . . . , AR), we de\ufb01ne\n\nPcut(P) :=\n\nr)1/n\n\nand Ncut(P) :=\n\n(cid:81)R\nr=1 Z(Ar, Ac\n\neH(P)\n\nR(cid:88)\n\nr=1\n\n1\nR\n\nCut(Ar, Ac\nr)\nVol(Ar)\n\nas its Product Cut and Normalized Cut, respectively. The non-linear, non-local function\n\n(6)\n\n(7)\n\nZ(A, Ac) :=\n\n(cid:89)\n\n1 +\n\nvi\u2208Ar\n\nj\u2208Ac \u03c9ij\nj\u2208A \u03c9ij\n\n(cid:80)\n(cid:80)\nVol(A) =(cid:80)\n\n(cid:80)\n\nof a set measures its intra- and inter-connectivity with respect to the graph while H(P) denotes the\nentropic balance (1). The de\ufb01nitions of\n\nCut(A, Ac) =(cid:80)\n\n(cid:80)\n\ni\u2208Ac\n\nr\n\nj\u2208Ar\n\n\u03c9ij\n\nand\n\ni\u2208V\n\nj\u2208Ar\n\n\u03c9ij\n\nare standard. A simple computation then shows that maximizing the geometric average (5) is\nequivalent to minimizing the Product Cut, while maximizing the arithmetic average (4) is equivalent\nto minimizing the Normalized Cut. At a super\ufb01cial level, both models wish to achieve the same\ngoal. The numerator of the Product Cut aims at a partition in which each vertex is weakly connected\nto vertices from other clusters and strongly connected with vertices from its own cluster. The\ndenominator H(P) is maximal when |A1| = |A2| = . . . = |AR|, and so aims at a well-balanced\npartition of the vertices. The objective (5) therefore promotes partitions with strongly intra-connected\nclusters and weakly inter-connected clusters that have comparable size. The Normalized Cut, de\ufb01ned\nhere on \u2126\u03b1 but usually posed over the original similarity matrix W, is exceedingly well-known\n[10, 15] and also aims at \ufb01nding a good balance between low cut value and clusters of comparable\nsizes.\nDespite this apparent parallel between the Product and Normalized Cuts, the two objectives behave\nquite differently both in theory and in practice. To illustrate this discrepancy at a high level, note \ufb01rst\nthat the following sharp bounds\n(8)\nhold for the Normalized Cut. The lower bound is attained for partitions P in which the clusters are\nmutually disconnected. For the Product Cut, we have\nTheorem 1 The following inequality holds for any partition P:\ne\u2212H(P) \u2264 Pcut(P) \u2264 1.\n\n(9)\nMoreover the lower bound is attained for partitions P in which the clusters are mutually disconnected.\nThe lower bound in (9) can be directly read from (6) and (7), while the upper bound is non-trivial and\nproved in the supplementary material. This theorem goes at the heart of the difference between the\nProduct and Normalized Cuts. To illustrate this, let P (k) denote a sequence of partitions. Then (9)\nshows that\n\n0 \u2264 Ncut(P) \u2264 1\n\nk\u2192\u221e H(P (k)) = 0 \u21d2 lim\n\nk\u2192\u221e Pcut(P (k)) = 1.\n\nlim\n\n(10)\n\nIn other words, an arbitrarily ill-balanced partition leads to arbitrarily poor values of its Product Cut.\nThe Normalized Cut does not possess this property. As an extreme but easy-to-analyze example,\nconsider the case where G = (V, W ) is a collection of isolated vertices. All possible partitions P\nconsist of mutually disconnected clusters and the lower bound is reached for both (8) and (9). Thus\nNcut(P) = 0 for all P and so all partitions are equivalent for the Normalized Cut. On the other\nhand Pcut(P) = e\u2212H(P), which shows that, in the absence of \u201ccut information,\u201d the Product Cut\nwill choose the partition that maximizes the entropic balance. So in this case, any partition P for\nwhich |A1| = . . . = |AR| will be a minimizer. In essence, this tighter lower bound for the Product\nCut re\ufb02ects its stronger balancing effect vis-a-vis the Normalized Cut.\n\n2.1\n\n(In-)Stability Properies of the Product Cut and Normalized Cut\n\nIn practice, the stronger balancing effect of the Product Cut manifests as a stronger tolerance to\nperturbations. We now delve deeper and contrast the two objectives by analyzing their stability\nproperties using experimental data as well as a simpli\ufb01ed model problem that isolates the source of\n\n3\n\n\f(a) An in blue, Bn in green, C in red.\n\n(b) P 0,good\nFigure 1: The graphs G0\n\nn\n\n= (An, Bn \u222a C)\n\n(c) P 0,bad\n\nn = (An \u222a Bn, C)\n\nn used for analyzing stability.\n\nthe inherent dif\ufb01culties. Invoking ideas from dynamical systems theory, we say an objective is stable\nif an in\ufb01nitesimal perturbation of a graph G = (V, W ) leads to an in\ufb01nitesimal perturbation of the\noptimal partition. If an in\ufb01nitesimal perturbation leads to a dramatic change in the optimal partition,\nthen the objective is unstable.\nWe use a simpli\ufb01ed model to study stability of the Product Cut and Normalized Cut objectives.\nConsider a graph Gn = (Vn, Wn) made of two clusters An and Bn containing n vertices each. Each\nvertex in Gn has degree k and is connected to \u00b5k vertices in the opposite cluster, where 0 \u2264 \u00b5 \u2264 1.\nn is a perturbation of Gn constructed by adding a small cluster C of size n0 (cid:28) n to\nThe graph G0\nthe original graph. Each vertex of C has degree k0 and is connected to \u00b50k0 vertices in Bn and\n(1 \u2212 \u00b50)k0 vertices in C for some 0 \u2264 \u00b50 \u2264 1. In the perturbed graph G0\nn, a total of n0 vertices in\nBn are linked to C and have degree k + \u00b50k0. See \ufb01gure 1(a). The main properties of Gn,G0\n\n\u2022 Unperturbed graph Gn :\n\u2022 Perturbed graph G0\nn:\n\n.\n\nn are\n|An| = |Bn| = n, CondGn (An) = \u00b5, CondGn (Bn) = \u00b5\n|An| = |Bn| = n, CondG0\n(Bn) \u2248 \u00b5\n|C| = n0 (cid:28) n,\nCondG0\n\n(An) = \u00b5, CondG0\n(C) = \u00b50.\n\nn\n\nn\n\nn\n\nwhere CondG(A) = Cut(A, Ac)/ min(|A|,|Ac|) denotes the conductance of a set. If we consider\nthe parameters \u00b5, \u00b50, k, k0, n0 as \ufb01xed and look at the perturbed graph G0\nn in the limit n \u2192 \u221e of a\nlarge number of vertices, then as n becomes larger the degree of the bulk vertices will remain constant\nwhile the size |C| of the perturbation becomes in\ufb01nitesimal.\nTo examine the in\ufb02uence of this in\ufb01nitesimal perturbation for each model, let Pn = (An, Bn)\ndenote the desired partition of the unperturbed graph Gn and let P 0,good\n= (An, Bn \u222a C) and\n= (An \u222a Bn, C) denote the partitions of the perturbed graph G0\nP 0,bad\nn depicted in \ufb01gure 1(b)\nand 1(c), respectively. As P 0,good\nwhile any\nn\nobjective preferring the converse is unstable. A detailed study of stability proves possible for this\nspeci\ufb01c graph family. We summarize the conclusions of this analysis in the theorem below, which\nshows that the Normalized Cut is unstable in certain parameter regimes while the Product Cut is\nalways stable. The supplementary material contains the proof.\n\n\u2248 Pn, a stable objective will prefer P 0,good\n\nto P 0,bad\n\nn\n\nn\n\nn\n\nn\n\nTheorem 2 Suppose that \u00b5, \u00b50, k, k0, n0 are \ufb01xed. Then\n) > NcutG0\nn\n(P 0,bad\n)\n\n\u00b50 < 2\u00b5 \u21d2 NcutG0\n(P 0,good\n\n(P 0,good\n) < PcutG0\n\nPcutG0\n\nn\n\nn\n\nn\n\nn\n\nn\n\nn\n\n(P 0,bad\nn\nfor n large enough.\n\n)\n\nfor n large enough.\n\n(11)\n(12)\n\nStatement (11) simply says that the large cluster An must have a conductance \u00b5 at least twice\nbetter than the conductance \u00b50 of the small perturbation cluster C in order to prevent instability.\nThus adding an in\ufb01nitesimally small cluster with mediocre conductance (up to two times worse\nthe conductance of the main structure) has the potential of radically changing the partition selected\nby the Normalized Cut. Moreover, this result holds for the classical Normalized Cut, its smoothed\nvariant (4) as well as for similar objectives such as the Cheeger Cut and Ratio Cut. Conversely,\n(12) shows that adding an in\ufb01nitesimally small cluster will not affect the partition selected by the\n\n4\n\n\faa\n\ne\u2212H(P)\nPcut(P)\nNcut(P)\n\nPartition P of\n\nPartition P of\n\nPartition P of\n\nPartition P of\n\nWEBKB4 found by WEBKB4 found by\n\nCITESEER found by CITESEER found by\n\nthe Pcut algo.\n\nthe Ncut algo.\n\nthe Pcut algo.\n\nthe Ncut algo.\n\n.2506\n.5335\n.5257\n\n.7946\n.8697\n.5004\n\n.1722\n.4312\n.5972\n\n.7494\n.8309\n.5217\n\nFigure 2: The Product and Normalized Cuts on WEBKB4 (R = 4 clusters) and CITESEER (R = 6\nclusters). The pie charts visually depict the sizes of the clusters in each partition. In both cases, NCut\nreturns a super-cluster while PCut returns a well-balanced partition. The NCut objective prefers the\nill-balanced partitions while the PCut objective dramatically prefers the balanced partitions.\n\nn\n\nn\n\nn\n\nn\n\nn\n\nn\n\nn\n\nn\n\nn\n\nn\n\n(P 0,good\n\nn\n\n(P 0,bad\n\n(P 0,good\n\nProduct Cut. The proof, while lengthy, is essentially just theorem 1 in disguise. To see this, note\nthat the sequence of partitions P 0,bad\nbecomes arbitrarily ill-balanced, which from (10) implies\n) = 1. However, the unperturbed graph Gn grows in a self-similar fashion\nlimn\u2192\u221e PcutG0\nas n \u2192 \u221e and so the Product Cut of Pn remains approximately a constant, say \u03b3, for all n. Thus\n) \u2248 PcutGn (Pn) since |C| is\nPcutGn (Pn) \u2248 \u03b3 < 1 for n large enough, and PcutG0\n) \u2248 \u03b3 < 1. Comparing this upper-bound with the fact\nin\ufb01nitesimal. Therefore PcutG0\n) = 1, we see that the Product Cut of P 0,bad\n(P 0,bad\nbecomes eventually larger than\nlimn\u2192\u221e PcutG0\nthe Product Cut of P 0,good\n. While we execute this program in full only for the example above, this\nline of argument is fairly general and similar stability estimates are possible for more general families\nof graphs.\nThis general contrast between the Product Cut and the Normalized Cut extends beyond the realm of\nmodel problems, as the user familiar with off-the-shelf NCut codes likely knows. When provided\nwith \u201cdirty\u201d graphs, for example an e-mail network or a text data set, NCut has the aggravating\ntendency to return a super-cluster. That is, NCut often returns a partition P = (A1, . . . , AR) where\na single set |Ar| contains the vast majority of the vertices. Figure 2 illustrates this phenomenon. It\ncompares the partitions obtained for NCut (computed on \u2126\u03b1 using a modi\ufb01cation of the standard\nspectral approximation from [15]) and for PCut (computed using the algorithm presented in the\nnext section) on two graphs constructed from text data sets. The NCut algorithm returns highly\nill-balanced partitions containing a super-cluser, while PCut returns an accurate and well-balanced\npartition. Other strategies for optimizing NCut obtain similarly unbalanced partitions. As an example,\nusing the algorithm from [9] with the original sparse weight matrix W leads to relative cluster sizes\nof 99.2%, 0.5%, 0.2% and 0.1% for WEBKB4 and 98.5%, 0.4%, 0.3%, 0.3%, 0.3% and 0.2% for\nCITESEER. As our theoretical results indicate, these unbalanced partitions result from the normalized\ncut criterion itself and not the algorithm used to minimize it.\n\n3 The Algorithm\n\nOur strategy for optimizing the Product Cut relies on a popular paradigm for discrete optimization,\ni.e. exact relaxation. We begin by showing that the discrete, graph-based formulation (5) can be\nrelaxed to a continuous optimization problem, speci\ufb01cally a convex maximization program. We then\nprove that this relaxation is exact, in the sense that optimal solutions of the discrete and continuous\nproblems coincide. With an exact relaxation in hand, we may then appeal to continuous optimization\nstrategies (rather than discrete or greedy ones) for optimizing the Product Cut. This general idea of\nexact relaxation is intimately coupled with convex maximization.\nAssume that the graph G = (V, W ) is connected. Then by taking the logarithm of (5) we see that (5)\nis equivalent to the problem\n\n(cid:41)\n\n(P)\n\nMaximize (cid:80)R\n\n(cid:80)\n\nr=1\n\ni\u2208Ar\n\nlog (\u2126\u03b11Ar )i\n\n|Ar|\n\nover all partitions P = (A1, . . . , AR) of V into R non-empty subsets.\n\n5\n\n\fThe relaxation of (P) then follows from the usual approach. We \ufb01rst encode sets Ar (cid:40) V as binary\nvertex functions 1Ar , then relax the binary constraint to arrive at a continuous program. Given a\nvertex function f \u2208 Rn\n\n+ with non-negative entries, we de\ufb01ne the continuous energy e(f ) as\n\nwhenever f (cid:54)= 0, the continuous energy is well-de\ufb01ned. After noting that(cid:80)\n\nwhere (cid:104)\u00b7,\u00b7(cid:105) denotes the usual dot product in Rn and the logarithm applies entriwise. As (\u2126\u03b1f )i > 0\nr e(1Ar ) is simply the\n\ne(0) = 0,\n\nobjective value in problem (P), we arrive to the following continuous relaxation\n\nand\n\nif f (cid:54)= 0,\n\ne(f ) :=(cid:10)f, log(cid:0)\u2126\u03b1f /(cid:104)f, 1V (cid:105)(cid:1)(cid:11)\nMaximize(cid:80)R\n\n(cid:41)\n\nr=1 e(fr)\n\n+ satisfying (cid:80)R\n\n+ \u00d7 . . . \u00d7 Rn\n+ consists of all vectors in Rn with non-negative entries.\n\nover all (f1, . . . , fR) \u2208 Rn\nwhere the non-negative cone Rn\nThe following theorem provides the theoretical underpinning for our algorithmic approach.\nestablishes convexity of the relaxed objective for connected graphs.\nTheorem 3 Assume that G = (V, W ) is connected. Then the energy e(f ) is continuous, positive\n1-homogeneous and convex on Rn\n\n+. Moreover, the strict convexity property\n\nr=1 fr = 1V\n\nIt\n\n,\n\n(P-rlx)\n\ne(\u03b8f + (1 \u2212 \u03b8)g) < \u03b8e(f ) + (1 \u2212 \u03b8)e(g)\n\nfor all\n\n\u03b8 \u2208 (0, 1)\n\n+ are linearly independent.\n\nholds whenever f, g \u2208 Rn\nThe continuity of e(f ) away from the origin as well as the positive one-homogeneity are obvious,\nwhile the continuity of e(f ) at the origin is easy to prove. The proof of convexity of e(f ), provided\nin the supplementary material, is non-trivial and heavily relies on the particular structure of \u2126\u03b1 itself.\nWith convexity of e(f ) in hand, we may prove the main theorem of this section.\nTheorem 4 ( Equivalence of (P) and (P-rlx) ) Assume that G = (V, W ) is connected and that V\ncontains at least R vertices. If P = (A1, . . . , AR) is a global optimum of (P) then (1A1, . . . , 1AR )\nis a global optimum of (P-rlx) . Conversely, if (f1, . . . , fR) is a global optimum of (P-rlx) then\n(f1, . . . , fR) = (1A1, . . . , 1AR ) where (A1, . . . , AR) is a global optimum of (P).\n\nProof. By strict convexity, the solution of the maximization (P-rlx) occurs at the extreme points of\nthe constraint set \u03a3 = {(f1, . . . , fR) : fr \u2208 RN\nr=1 fr = 1}. Any such extreme point takes\nthe form (1A1, . . . , 1AR ), where necessarily A1 \u222a . . . \u222a AR = V and Ar \u2229 As = \u2205 (r (cid:54)= s) hold. It\ntherefore suf\ufb01ces to rule out extreme points that have an empty set of vertices. But if A (cid:54)= B are\nnon-empty then 1A, 1B are linearly independent, and so the inequality e(1A + 1B) < e(1A) + e(1B)\nholds by strict convexity and one-homogeneity. Thus given a partition of the vertices into R \u2212 1\nnon-empty subsets and one empty subset, we can obtain a better energy by splitting one of the\nnon-empty vertex subsets into two non-empty subsets. Thus any globally maximal partition cannot\ncontain empty subsets. (cid:3)\nWith theorems 3 and 4 in hand, we may now proceed to optimize (P) by searching for optima of\nits exact relaxation. We tackle the latter problem by leveraging sequential linear programming or\ngradient thresholding strategies for convex maximization. We may write (P-rlx) as\n\nMaximize E(F )\n\nsubject to F \u2208 C and \u03c8i(F ) = 0 for i = 1, . . . , n\n\n(13)\nwhere F = (f1, . . . , fR) is the optimization variable, E(F ) is the convex energy to be maximized, C\nis the bounded convex set [0, 1]n \u00d7 . . . \u00d7 [0, 1]n and the n af\ufb01ne constraints \u03c8i(F ) = 0 correspond\nr=1 fi,r = 1. Given a current feasible estimate F k of the solution,\n\nto the row-stochastic constraints(cid:80)R\n\nwe obtain the next estimate F k+1 by solving the linear program\n\nMaximize Lk(F )\n\nsubject to F \u2208 C and \u03c8i(F ) = 0 for i = 1, . . . , n\n\n(14)\nwhere Lk(F ) = E(F k) + (cid:104)\u2207E(F (k)), F \u2212 F k(cid:105) is the linearization of the energy E(F ) around the\ncurrent iterate. By convexity of E(F ), this strategy monotonically increases E(F k) since E(F k+1) \u2265\nLk(F k+1) \u2265 Lk(F k) = E(F k). The iterates F k therefore encode a sequence of partitions of V that\nmonotonically increase the energy at each step. Either the current iterate maximizes the linear form,\nin which case \ufb01rst-order optimality holds, or else the subsequent iterate produces a partition with a\n\n+ and (cid:80)R\n\n6\n\n\fR) = (1A1, . . . , 1AR ) for (A1, . . . , AR) a random partition of V\n\nSet \u02c6fr = f k\ni,r)\nSet gi,r = fi,r/ui,r for i = 1, . . . n then solve M T\nSet hr = log ur + vr \u2212 1\n\nthen solve M\u03b1ur = \u02c6fr\n\u03b1 vr = gr\n\ni=1 f k\n\nend for\nChoose at random sk vertices and let I \u2282 V be these vertices.\nfor all i \u2208 V do\n\nIf i \u2208 I then f k+1\n\ni,r =\n\nif r = arg maxs his\notherwise,\n\nif i /\u2208 I then f k+1\n\ni,r =\n\n(cid:26)1\n\n0\n\nif hi,r > 0\notherwise.\n\nAlgorithm 1 Randomized SLP for PCut\n\nInitialization: (f 0\nfor k = 0 to maxiter do\n\n1 , . . . , f 0\n\nfor r = 1 to R do\n\nr /((cid:80)n\n\n(cid:26)1\n\n0\n\nend for\n\nend for\n\nstrictly larger objective value. The latter case can occur only a \ufb01nite number of times, as only a \ufb01nite\nnumber of partitions exist. Thus the sequence F k converges after a \ufb01nite number of iterations.\nWhile simple and easy to implement, this algorithm suffers from a severe case of early termination.\nWhen initialized from a random partition, the iterates F k almost immediately converge to a poor-\nquality solution. We may rescue this poor quality algorithm and convert it to a highly effective one,\nwhile maintaining its simplicity, by randomizing the LP (14) at each step in the following way. At\nstep k we solve the LP\n\nmaximize Lk(F )\n\nsubject to F \u2208 C and \u03c8i(F ) = 0 for i \u2208 Ik,\n\n(15)\nwhere the set Ik is a random subset of {1, 2, . . . , n} obtained by drawing sk constraints uniformly at\nrandom without replacement. The LP (15) is therefore version of LP (14) in which we have dropped\na random set of constraints. If we start by enforcing a small number sk of constraints and slowly\nincrement this number sk+1 = sk + \u2206sk as the algorithm progresses, we allow the algorithm time\nto explore the energy landscape. Enforcing more constraints as the iterates progress ensures that\n(15) eventually coincides with (14), so convergence of the iterates F k of the randomized algorithm\nis still guaranteed. The attraction is that LP (15) has a simple, closed-form solution given by a\nvariant of gradient thresholding. We derive the closed form solution of LP (15) in section 1 of the\nsupplementary material, and this leads to Algorithm 1 above.\nThe overall effectiveness of this strategy relies on two key ingredients. The \ufb01rst is a proper choice\nof the number of constraints sk to enforce at each step. Selecting the rate at which sk increases is\nsimilar, in principle, to selecting a learning rate schedule for a stochastic gradient descent algorithm.\nIf sk increases too quickly then the algorithm will converge to poor-quality partitions. If sk increases\ntoo slowly, the algorithm will \ufb01nd a quality solution but waste computational effort. A good rule of\nthumb is to linearly increase sk at some constant rate \u2206sk \u2261 \u03bb until all constraints are enforced, at\nwhich point we switch to the deterministic algorithm and terminate the process at convergence. The\nsecond key ingredient involves approximating solutions to the linear system M\u03b1x = b quickly. We\nuse a simple Algebraic Multigrid (AMG) technique, i.e. a stripped-down version of [7] or [6], to\naccomplish this. The main insight here is that exact solutions of M\u03b1x = b are not needed, but not all\napproximate solutions are effective. We need an approximate solution x that has non-zero entries on\nall of |V | for thresholding to succeed, and this can be accomplished by AMG at very little cost.\n\n4 Experiments\n\nWe conclude our study of the Product Cut model by presenting extensive experimental evaluation\nof the algorithm1. We intend these experiments to highlight the fact that, in addition to a strong\ntheoretical model, the algorithm itself leads to state-of-the-art performance in terms of cluster purity\non a variety of real world data sets. We provide experimental results on four text data sets (20NEWS,\nRCV1, WEBKB4, CITESEER) and four data sets containing images of handwritten digits (MNIST,\nPENDIGITS, USPS, OPTDIGITS). We provide the source for these data sets and details on their\n\n1The code is available at https://github.com/xbresson/pcut\n\n7\n\n\fTable 1: Algorithmic Comparison via Cluster Purity.\n\nsize\nR\n\nRND\nNCUT\nLSD\nMTV\n\nGRACLUS\n\nNMFR\n\nPCut (.9,\u03bb1)\nPCut (.9,\u03bb2)\n\n20NE\n20K\n20\n6\n27\n34\n36\n42\n61\n61\n60\n\n4.2K\n\n3.3K\n\nRCV1 WEBK CITE MNIS\n70K\n9.6K\n10\n11\n77\n76\n96\n97\n97\n97\n96\n\n4\n39\n40\n46\n45\n49\n58\n58\n57\n\n4\n30\n38\n38\n43\n42\n43\n53\n50\n\n6\n22\n23\n53\n43\n54\n63\n63\n64\n\nPEND USPS OPTI\n5.6K\n11K\n10\n10\n12\n12\n91\n80\n91\n86\n87\n95\n94\n85\n87\n98\n98\n87\n84\n95\n\n9.3K\n10\n17\n72\n70\n85\n87\n86\n89\n89\n\nconstruction in the supplementary material. We compare our method against partitioning algorithms\nthat, like the Product Cut, rely on graph-cut objective principles and that partition the graph in a direct,\nnon-recursive manner. The NCut algorithm [15] is a widely used spectral algorithm that relies on a\npost-processing of the eigenvectors of the graph Laplacian to optimize the Normalized Cut energy.\nThe NMFR algorithm [14] uses a graph-based random walk variant of the Normalized Cut. The LSD\nalgorithm [2] provides a non-negative matrix factorization algorithm that relies upon a trace-based\nrelaxation of the Normalized Cut objective. The MTV algorithm from [3] and the balanced k-cut\nalgorithm from [9] provide total-variation based algorithms that attempt to \ufb01nd an optimal multi-way\nCheeger cut of the graph by using (cid:96)1 optimization techniques. Both algorithms optimize the same\nobjective and achieve similar purity values. We report results for [3] only. The GRACLUS algorithm\n[4, 5] uses a multi-level coarsening approach to optimize the NCut objective as formulated in terms\nof kernel k-means. Table 1 reports the accuracy obtained by these algorithms for each data set. We\nuse cluster purity to quantify the quality of the calculated partition, de\ufb01ned according to the relation:\nr=1 max1