{"title": "Convergence and Energy Landscape for Cheeger Cut Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 1385, "page_last": 1393, "abstract": "Unsupervised clustering of scattered, noisy and high-dimensional data points is an important and difficult problem. Continuous relaxations of balanced cut problems yield excellent clustering results. This paper provides rigorous convergence results for two algorithms that solve the relaxed Cheeger Cut minimization. The first algorithm is a new steepest descent algorithm and the second one is a slight modification of the Inverse Power Method algorithm \\cite{pro:HeinBuhler10OneSpec}. While the steepest descent algorithm has better theoretical convergence properties, in practice both algorithm perform equally. We also completely characterize the local minima of the relaxed problem in terms of the original balanced cut problem, and relate this characterization to the convergence of the algorithms.", "full_text": "Convergence and Energy Landscape for Cheeger Cut\n\nClustering\n\nXavier Bresson\n\nCity University of Hong Kong\n\nHong Kong\n\nxbresson@cityu.edu.hk\n\nThomas Laurent\n\nUniversity of California, Riversize\n\nRiverside, CA 92521\n\nlaurent@math.ucr.edu\n\nDavid Uminsky\n\nUniversity of San Francisco\nSan Francisco, CA 94117\nduminsky@usfca.edu\n\nJames H. von Brecht\n\nUniversity of California, Los Angeles\n\nLos Angeles, CA 90095\njub@math.ucla.edu\n\nAbstract\n\nThis paper provides both theoretical and algorithmic results for the (cid:96)1-relaxation\nof the Cheeger cut problem. The (cid:96)2-relaxation, known as spectral clustering, only\nloosely relates to the Cheeger cut; however, it is convex and leads to a simple op-\ntimization problem. The (cid:96)1-relaxation, in contrast, is non-convex but is provably\nequivalent to the original problem. The (cid:96)1-relaxation therefore trades convexity\nfor exactness, yielding improved clustering results at the cost of a more challeng-\ning optimization. The \ufb01rst challenge is understanding convergence of algorithms.\nThis paper provides the \ufb01rst complete proof of convergence for algorithms that\nminimize the (cid:96)1-relaxation. The second challenge entails comprehending the (cid:96)1-\nenergy landscape, i.e.\nthe set of possible points to which an algorithm might\nconverge. We show that (cid:96)1-algorithms can get trapped in local minima that are\nnot globally optimal and we provide a classi\ufb01cation theorem to interpret these lo-\ncal minima. This classi\ufb01cation gives meaning to these suboptimal solutions and\nhelps to explain, in terms of graph structure, when the (cid:96)1-relaxation provides the\nsolution of the original Cheeger cut problem.\n\n1\n\nIntroduction\n\nPartitioning data points into sensible groups is a fundamental problem in machine learning. Given a\nset of data points V = {x1,\u00b7\u00b7\u00b7 , xn} and similarity weights {wi,j}1\u2264i,j\u2264n, we consider the balance\nCheeger cut problem [4]:\n\nMinimize C(S) =\n\n(1)\nHere |S| denotes the number of data points in S and Sc is the complementary set of S in V . While\nthis problem is NP-hard, it has the following exact continuous (cid:96)1-relaxation:\n\nover all subsets S (cid:40) V .\n\nxj\u2208Sc wi,j\n\nxi\u2208S\nmin(|S|,|Sc|)\n\nMinimize E(f ) =\n\n(2)\nHere med(f ) denotes the median of f \u2208 Rn and fi \u2261 f (xi). Recently, various algorithms have\nbeen proposed [12, 6, 7, 1, 9, 5] to minimize (cid:96)1-relaxations of the Cheeger cut (1) and of other\nrelated problems. Typically these (cid:96)1-algorithms provide excellent unsupervised clustering results\n\nover all non-constant functions f : V \u2192 R.\n\n(cid:80)\n\n(cid:80)\n\n(cid:80)\n(cid:80)\ni,j wi,j|fi \u2212 fj|\ni |fi \u2212 med(f )|\n\n1\n2\n\n1\n\n\fand improve upon the standard (cid:96)2 (spectral clustering) method [10, 13] in terms of both Cheeger\nenergy and classi\ufb01cation error. However, complete theoretical guarantees of convergence for such\nalgorithms do not exist. This paper provides the \ufb01rst proofs of convergence for (cid:96)1-algorithms that\nattempt to minimize (2).\nIn this work we consider two algorithms for minimizing (2). We present a new steepest descent (SD)\nalgorithm and also consider a slight modi\ufb01cation of the inverse power method (IPM) from [6]. We\nprovide convergence results for both algorithms and also analyze the energy landscape. Speci\ufb01cally,\nwe give a complete classi\ufb01cation of local minima. This understanding of the energy landscape\nprovides intuition for when and how the algorithms get trapped in local minima. Our numerical\nexperiments show that the two algorithms perform equally well with respect to the quality of the\nachieved cut. Both algorithms produce state of the art unsupervised clustering results. Finally, we\nremark that the SD algorithm has a better theoretical guarantee of convergence. This arises from\nthe fact that the distance between two successive iterates necessarily converges to zero. In contrast,\nwe cannot guarantee this holds for the IPM without further assumptions on the energy landscape.\nThe simpler mathematical structure of the SD algorithm also provides better control of the energy\ndescent.\nBoth algorithms take the form of a \ufb01xed point iteration f k+1 \u2208 A(f k), where f \u2208 A(f ) implies\nthat f is a critical point. To prove convergence towards a \ufb01x point typically requires three key\ningredients: the \ufb01rst is monotonicity of A, that is E(z) \u2264 E(f ) for all z \u2208 A(f ); the second\nis some estimate that guarantees the successive iterates remain in a compact domain on which E\nis continuous; lastly, some type of continuity of the set-valued map A is required. For set valued\nmaps, closedness provides the correct notion of continuity [8]. Monotonicity of the IPM algorithm\nwas proven in [6]. This property alone is not enough to obtain convergence, and the closedness\nproperty proves the most challenging ingredient to establish for the algorithms we consider. Section\n2 elucidates the form these properties take for the SD and IPM algorithms. In Section 3 we show\nthat that if the iterates of either algorithm approach a neighborhood of a strict local minimum then\nboth algorithms will converge to this minimum. We refer to this property as local convergence.\nWhen the energy is non-degenerate, section 4 extends this local convergence to global convergence\ntoward critical points for the SD algorithm by using the additional structure afforded by the gradient\n\ufb02ow. In Section 5 we develop an understanding of the energy landscape of the continuous relaxation\nproblem. For non-convex problems an understanding of local minima is crucial. We therefore\nprovide a complete classi\ufb01cation of the local minima of (2) in terms of the combinatorial local\nminima of (1) by means of an explicit formula. As a consequence of this formula, the problem\nof \ufb01nding local minima of the combinatorial problem is equivalent to \ufb01nding local minima of the\ncontinuous relaxation. The last section is devoted to numerical experiments.\nWe now present the SD algorithm. Rewrite the Cheeger functional (2) as E(f ) = T (f )/B(f ),\nwhere the numerator T (f ) is the total variation term and the denominator B(f ) is the balance term.\nIf T and B were differentiable, a mixed explicit-implicit gradient \ufb02ow of the energy would take the\nform (f k+1\u2212f k)/\u03c4 k = \u2212(\u2207T (f k+1)\u2212E(f k)\u2207B(f k))/(B(f k)), where {\u03c4 k} denotes a sequence\nof time steps. As T and B are not differentiable, particularly at the binary solutions of paramount\ninterest, we must consider instead their subgradients\n\n\u2202T (f ) := {v \u2208 Rn : T (g) \u2212 T (f ) \u2265 (cid:104)v, g \u2212 f(cid:105) \u2200g \u2208 Rn} ,\n\u22020B(f ) := {v \u2208 Rn : B(g) \u2212 B(f ) \u2265 (cid:104)v, g \u2212 f(cid:105) \u2200g \u2208 Rn and (cid:104)1, v(cid:105) = 0} .\n\n(3)\n(4)\nHere 1 \u2208 Rn denotes the constant vector of ones. Also note that if f has zero median then B(f ) =\n||f||1 and \u22020B(f ) = {v \u2208 sign(f ), s.t. mean(v) = 0}. After an appropriate choice of time steps\nwe arrive to the SD Algorithm summarized in table 1(on left), i.e. a non-smooth variation of steepest\ndescent. A key property of the the SD algorithm\u2019s iterates is that (cid:107)f k+1 \u2212 f k(cid:107)2 \u2192 0. This property\nallows us to conclude global convergence of the SD algorithm in cases where we can not conclude\nconvergence for the IPM algorithm. We also summarize the IPM algorithm from [6] in Table 1 (on\nright). Compared to the original algorithm from [6], we have added the extra step to project onto\nthe sphere S n\u22121, that is f k+1 = hk/||hk||2. While we do not think that this extra step is essential,\nit simpli\ufb01es the proof of convergence.\nThe successive iterates of both algorithms belong to the space\n\nS n\u22121\n\n0\n\n:= {f \u2208 Rn : ||f||2 = 1 and med(f ) = 0}.\n\n(5)\n\n2\n\n\fTable 1: ASD : SD Algorithm.\n\nf 0 nonzero function with med(f ) = 0.\nc positive constant.\nwhile E(f k) \u2212 E(f k+1) \u2265 TOL do\nvk \u2208 \u22020B(f k)\ngk = f k + c vk\n\u02c6hk = arg min\nhk = \u02c6hk \u2212 med(\u02c6hk)1\nf k+1 = hk\n(cid:107)hk(cid:107)2\n\nT (u)+ E(f k)\n\n||u\u2212gk||2\n\nu\u2208Rn\n\n2c\n\n2\n\nend while\n\nAIPM : Modifed IPM Algorithm [6].\nf 0 nonzero function with med(f ) = 0.\nwhile E(f k) \u2212 E(f k+1) \u2265 TOL do\n\nvk \u2208 \u22020B(f k)\nDk = min||u||2\u22641 T (u) \u2212 E(f k)(cid:104)u, vk(cid:105)\ngk = arg min\n||u||2\u22641\n\nT (u)\u2212E(f k)(cid:104)u, vk(cid:105) if Dk< 0\n\ngk = f k if Dk = 0\nhk = gk \u2212 med(gk)1\nf k+1 = hk\n\n||hk||2\n\nend while\n\nAs the successive iterates have zero median, \u22020B(f k) is never empty. For example, we can take\nvk \u2208 Rn so that vk(xi) = 1 if f (xi) > 0, vk(xi) = \u22121 if f (xi) < 0 and vk(xi) = (n\u2212\u2212 n+)/(n0)\nif f (xi) = 0 where n+, n\u2212 and n0 denote the cardinalities of the sets {xi : f (xi) > 0}, {xi :\nf (xi) > 0} and {xi : f (xi) = 0}, respectively. Other possible choices also exist, so that vk is\nnot uniquely de\ufb01ned. This idea, i.e. choosing an element from the subdifferential with mean zero,\nwas introduced in [6] and proves indispensable when dealing with median zero functions. As vk is\nnot uniquely de\ufb01ned in either algorithm, we must introduce the concepts of a set-valued map and a\nclosed map, which is the proper notion of continuity in this context:\nDe\ufb01nition 1 (Set-valued Map, Closed Maps). Let X and Y be two subsets of Rn. If for each x \u2208 X\nthere is a corresponding set F (x) \u2282 Y then F is called a set-valued map from X to Y . We denote\nthis by F : X \u21d2 Y . The graph of F , denoted Graph(F) is de\ufb01ned by\n\nGraph(F ) = {(x, y) \u2208 Rn \u00d7 Rn : x \u2208 X, y \u2208 F (x)}.\n\nA set-valued map F is called closed if Graph(F ) is a closed subset of Rn \u00d7 Rn.\nWith these notations in hand we can write f k+1 \u2208 ASD(f k) (SD algorithm) and f k+1 \u2208 AIPM(f k)\n(IPM algorithm) where ASD,AIPM : S n\u22121\nare the appropriate set-valued maps. The\nnotion of a closed map proves useful when analyzing the step \u02c6hk \u2208 H(f k) in the SD algorithm.\nParticularly,\nLemma 1 (Closedness of H(f )). The following set-valued map H : S n\u22121\n\n\u21d2 Rn is closed.\n\n\u21d2 S n\u22121\n\n0\n\n0\n\n0\n\n(cid:27)\n\n(cid:26)\n\nH(f ) := arg min\n\nu\n\nT (u) +\n\nE(f )\n\n2c\n\n||u \u2212 (f + c\u22020B(f ))||2\n\n2\n\nCurrently, we can only show that lemma 1 holds at strict local minima for the analogous step, gk,\nof the IPM algorithm. That lemma 1 holds without this further restriction on f \u2208 S n\u22121\nwill allow\nus to demonstrate stronger global convergence results for the SD algorithm. Due to page limitations\nthe supplementary material contains the proofs of all lemmas and theorems in this paper.\n2 Properties of ASD and AIPM\nThis section establishes the required properties of the of the set-valued maps ASD and AIPM men-\ntioned in the introduction. In section 2.1 we \ufb01rst elucidate the monotonicity and compactness of\nASD and AIPM. Section 2.2 demonstrates that a local notion of closedness holds for each algorithm.\nThis form of closedness suf\ufb01ces to show local convergence toward isolated local minima (c.f. Sec-\ntion 3). In particular, this more dif\ufb01cult and technical section is necessary as monotonicity alone\ndoes not guarantee this type of convergence.\n\n0\n\n2.1 Monotonicity and Compactness\n\nWe provide the monotonicity and compactness results for each algorithm in turn. Lemmas 2 and 3\nestablish monotonicity and compactness for ASD while Lemmas 4 and 5 establish monotonicity and\ncompactness for AIPM.\n\n3\n\n\fLemma 2 (Monotonicity of ASD). Let f \u2208 S n\u22121\nalgorithm. Then neither \u02c6h nor h is a constant vector. Moreover, the energy inequality\n\nand de\ufb01ne v, g, \u02c6h and h according to the SD\n\n0\n\nE(f ) \u2265 E(h) +\n\nE(f )\nB(h)\n\n(cid:107)\u02c6h \u2212 f(cid:107)2\n\n2\n\nc\n\nholds. As a consequence, if z \u2208 ASD(f ) then E(z) = E(h) < E(f ) unless z = f.\nLemma 3 (Compactness of ASD). Let f 0 \u2208 S n\u22121\n(gk, \u02c6hk, hk, f k+1) according to the SD algorithm. Then for any such sequence\n0 < ||hk||2 \u2264 (1 +\n\n1 \u2264 ||gk||2 \u2264 1 + c\n\n(cid:107)\u02c6hk(cid:107)2 \u2264 (cid:107)gk(cid:107)2,\n\nn and\n\n\u221a\n\n\u221a\n\n0\n\nand de\ufb01ne a sequence of\n\nn)||\u02c6hk||2.\n\nMoreover, we have\n\n(6)\n\niterates\n\n(7)\n\n(8)\n\n(cid:107)f k \u2212 f k+1(cid:107)2 \u2192 0.\n\nmed(\u02c6hk) \u2192 0,\n||\u02c6hk \u2212 f k||2 \u2192 0,\nattracts the sequences {\u02c6hk} and {hk}.\n\n0\n\nTherefore S n\u22121\nBy the monotonicity result of Hein and B\u00a8uhler [6] we have\nLemma 4 (Monotonicity of AIPM). Let f \u2208 S n\u22121\nz = f.\nTo prove convergence for AIPM using our techniques, we must also maintain control over the iterates\nafter subtracting the median. This control is provided by the following lemma.\nLemma 5 (Compactness of AIPM). Let f \u2208 S n\u22121\n\n. If z \u2208 AIPM(f ) then E(z) < E(f ) unless\n\nand de\ufb01ne v, D, g and h according to the IPM.\n\n0\n\n0\n\n1. The minimizer is unique when D < 0, i.e. g \u2208 S n\u22121 is a single point.\n2. 1 \u2264 ||h||2 \u2264 1 +\n\nn. In particular, AIPM(f ) is always well-de\ufb01ned for a given choice of\n\n\u221a\n\nv \u2208 \u22020B(f ).\n\n2.2 Closedness Properties\n\nThe \ufb01nal ingredient to prove local convergence is some form of closedness. We require closedness\nof the set valued maps A at strict local minima of the energy. As the energy (2) is invariant under\nconstant shifts and scalings, the usual notion of a strict local minimum on Rn does not apply. We\nmust therefore remove the effects of these invariances when referring to a local minimum as strict.\nTo this end, de\ufb01ne the spherical and annular neighborhoods on S n\u22121\n\nby\n\nB\u0001(f\u221e) := {||f \u2212 f\u221e||2 \u2264 \u0001} \u2229 S n\u22121\n\n0\n\nA\u03b4,\u0001(f\u221e) := {\u03b4 \u2264 ||f \u2212 f\u221e||2 \u2264 \u0001} \u2229 S n\u22121\n\n.\n\n0\n\n0\n\n0\n\n0\n\n. We say f\u221e is a strict local minimum of the\n\nWith these in hand we introduce the proper de\ufb01nition of a strict local minimum.\nDe\ufb01nition 2 (Strict Local Minima). Let f\u221e \u2208 S n\u22121\nenergy if there exists \u0001 > 0 so that f \u2208 B\u0001(f\u221e) and f (cid:54)= f\u221e imply E(f ) > E(f\u221e).\nThis de\ufb01nition then allows us to formally de\ufb01ne closedness at a strict local minimum in De\ufb01nition\n3. For the IPM algorithm this is the only form of closedness we are able to establish. Closedness at\nan arbitrary f \u2208 S n\u22121\n(c.f. lemma 1) does in fact hold for the SD algorithm. Once again, this fact\nmanifests itself in the stronger global convergence results for the SD algorithm in section 4.\nDe\ufb01nition 3 (CLM/CSLM Mappings). Let A(f ) : S n\u22121\ndenote a set-valued mapping.\nWe say A(f ) is closed at local minima (CLM) if zk \u2208 A(f k) and f k \u2192 f\u221e imply zk \u2192 f\u221e\nwhenever f\u221e is a local minimum of the energy. If zk \u2192 f\u221e holds only when f\u221e is a strict local\nminimum then we say A(f ) is closed at strict local minima (CSLM).\nThe CLM property for the SD algorithm, provided by lemma 6, follows as a straight forward conse-\nquence of lemma 1. The CSLM property for the IPM algorithm provided by lemma 7 requires the\nadditional hypothesis that the local minimum is strict.\nLemma 6 (CLM Property for ASD). For f \u2208 S n\u22121\nThen ASD(f ) de\ufb01nes a CLM mapping.\nLemma 7 (CSLM Property for AIPM). For f \u2208 S n\u22121\nAIPM(f ) de\ufb01nes a CSLM mapping.\n\nde\ufb01ne g, \u02c6h and h according to the SD algorithm.\n\nde\ufb01ne v, D, g, h according to the IPM. Then\n\n\u21d2 S n\u22121\n\n0\n\n0\n\n0\n\n0\n\n4\n\n\f3 Local Convergence of ASD and AIPM at Strict Local Minima\nDue to the lack of convexity of the energy (2) , at best we can only hope to obtain convergence\nto a local minimum of the energy. An analogue of Lyapunov\u2019s method from differential equations\nallows us to show that such convergence does occur provided the iterates reach a neighborhood of\nan isolated local minimum. To apply the lemmas from section 2 we must assume that f\u221e \u2208 S n\u22121\nis a local minimum of the energy. We will assume further that f\u221e is an isolated critical point of the\nenergy according to the following de\ufb01nition.\nDe\ufb01nition 4 (Isolated Critical Points). Let f \u2208 S n\u22121\n. We say that f is a critical point of the energy\nE(f ) if there exist w \u2208 \u2202T (f ) and v \u2208 \u22020B(f ) so that 0 = w \u2212 E(f )v. This generalizes the usual\nquotient rule 0 = \u2207T (f ) \u2212 E(f )\u2207B(f ). If there exists \u0001 > 0 so that f is the only critical point in\nB\u0001(f\u221e) we say f is an isolated critical point of the energy.\nNote that as any local minimum is a critical point of the energy, if f\u221e is an isolated critical point\nand a local minimum then it is necessarily a strict local minimum. The CSLM property therefore\napplies.\nFinally, to show convergence, the set-valued map A must possess one further property, i.e.\ncritical point property.\nDe\ufb01nition 5 (Critical Point Property). Let A(f ) : S n\u22121\ndenote a set-valued mapping. We\nsay that A(f ) satis\ufb01es the critical point property (CP property) if, given any sequence satisfying\nf k+1 \u2208 A(f k), all limit points of {f k} are critical points of the energy.\n\n\u21d2 S n\u22121\n\nthe\n\n0\n\n0\n\n0\n\n0\n\n0\n\nAnalogously to the CLM property, for the SD algorithm the CP property follows as a direct conse-\nquence of lemma 1. For the IPM algorithm it follows from closedness of the minimization step.\nThe proof of local convergence utilizes a version of Lyapunov\u2019s direct method for set-valued maps,\nand we adapt this technique from the strategy outlined in [8]. We \ufb01rst demonstrate that if any\niterate f k lies in a suf\ufb01ciently small neighborhood B\u03b3(f\u221e) of the strict local minimum then all\nsubsequent iterates remain in the neighborhood B\u0001(f\u221e) in which f\u221e is an isolated critical point.\nBy compactness and the CP property, any subsequence of {f k} must have a further subsequence\nthat converges to the only critical point in B\u0001(f\u221e), i.e. f\u221e. This implies that the whole sequence\nmust converge to f\u221e as well. We formalize this argument in lemma 8 and its corollary theorem 1.\nLemma 8 (Lyapunov Stability at Strict Local Minima). Suppose A(f ) is a monotonic, CSLM map-\nping. Fix f 0 \u2208 S n\u22121\nand let {f k} denote any sequence satisfying f k+1 \u2208 A(f k). If f\u221e is a strict\nlocal minimum of the energy, then for any \u0001 > 0 there exists a \u03b3 > 0 so that if f 0 \u2208 B\u03b3(f\u221e) then\n{f k} \u2282 B\u0001(f\u221e).\nTheorem 1 (Local Convergence at Isolated Critical Points). Let A(f ) : S n\u22121\ndenote a\nand suppose {f k} is any sequence satisfying\nmonotonic, CSLM, CPP mapping. Let f 0 \u2208 S n\u22121\nf k+1 \u2208 A(f k). Let f\u221e denote a local minimum that is an isolated critical point of the energy. If\nf 0 \u2208 B\u03b3(f\u221e) for \u03b3 > 0 suf\ufb01ciently small then f k \u2192 f\u221e.\nNote that both algorithms satisfy the hypothesis of theorem 1, and therefore possess identical lo-\ncal convergence properties. A slight modi\ufb01cation of the proof of theorem 1 yields the following\ncorollary that also applies to both algorithms.\nCorollary 1. Let f 0 \u2208 S n\u22121\nbe arbitrary, and de\ufb01ne f k+1 \u2208 A(f k) according to either algorithm.\nIf any accumulation point f\u2217 of the sequence {f k} is both an isolated critical point of the energy\nand a local minimum, then the whole sequence f k \u2192 f\u2217.\n4 Global Convergence for ASD\nTo this point the convergence properties of both algorithms appear identical. However, we have\nyet to take full advantage of the superior mathematical structure afforded by the SD algorithm.\nIn particular, from lemma 3 we know that ||f k+1 \u2212 f k||2 \u2192 0 without any further assumptions\nregarding the initialization of the algorithm or the energy landscape. This fact combines with the\nfact that lemma 1 also holds globally for f \u2208 S n\u22121\nto yield theorem 2. Once again, we arrive at this\nconclusion by adapting the proof from [8].\n\n\u21d2 S n\u22121\n\n0\n\n0\n\n0\n\n0\n\n0\n\n5\n\n\fTheorem 2 (Convergence of the SD Algorithm). Take f 0 \u2208 S n\u22121\n{f k} denote any sequence satisfying f k+1 \u2208 ASD(f k). Then\n\n0\n\nand \ufb01x a constant c > 0. Let\n\n1. Any accumulation point f\u2217 of the sequence is a critical point of the energy.\n2. Either the sequence converges, or the set of accumulation points form a continuum in S n\u22121\n\n0\n\n.\n\nWe might hope to rule out the second possibility in statement 2 by showing that E can never have\nan uncountable number of critical points. Unfortunately, we can exhibit (c.f.\nthe supplementary\nmaterial) simple examples to show that a continuum of local or global minima can in fact happen.\nThis degeneracy of a continuum of critical points arises from a lack of uniqueness in the underlying\ncombinatorial problem. We explore this aspect of convergence further in section 5.\nBy assuming additional structure in the energy landscape we can generalize the local convergence\nresult, theorem 1, to yield global convergence of both algorithms. This is the content of corollary 2\nfor the SD algorithm and the content of corollary 3 for the IPM algorithm. The hypotheses required\nfor each corollary clearly demonstrate the bene\ufb01t of knowing apriori that ||f k+1\u2212 f k||2 \u2192 0 occurs\nfor the SD algorithm. For the IPM algorithm, we can only deduce this a posteriori from the fact that\nthe iterates converge.\nCorollary 2. Let f 0 \u2208 S n\u22121\nbe arbitrary and de\ufb01ne f k+1 \u2208 ASD(f k). If the energy has only\ncountably many critical points in S n\u22121\nCorollary 3. Let f 0 \u2208 S n\u22121\npoints of the energy are isolated in S n\u22121\nconverges.\n\nbe arbitrary and de\ufb01ne f k+1 \u2208 AIPM(f k). Suppose all critical\nand are either local maxima or local minima. Then {f k}\n\nthen {f k} converges.\n\n0\n\n0\n\n0\n\n0\n\nWhile at \ufb01rst glance corollary 3 provides hope that global convergence holds for the IPM algorithm,\nour simple examples in the supplementary material demonstrate that even benign graphs with well-\nde\ufb01ned cuts have critical points of the energy that are neither local maxima nor local minima.\n\n5 Energy Landscape of the Cheeger Functional\n\nThis section demonstrates that the continuous problem (2) provides an exact relaxation of the combi-\nnatorial problem (1). Speci\ufb01cally, we provide an explicit formula that gives an exact correspondence\nbetween the global minimizers of the continuous problem and the global minimizers of the combi-\nnatorial problem. This extends previous work [12, 11, 9] on the relationship between the global\nminima of (1) and (2). We also completely classi\ufb01y the local minima of the continuous problem by\nintroducing a notion of local minimum for the combinatorial problem. Any local minimum of the\ncombinatorial problem then determines a local minimum of the combinatorial problem by means of\nan explicit formula, and vice-versa. Theorem 4 provides this formula, which also gives a sharp con-\ndition for when a global minimum of the continuous problem is two-valued (binary), three-valued\n(trinary), or k-valued in the general case. This provides an understanding the energy landscape,\nwhich is essential due to the lack of convexity present in the continuous problem. Most importantly,\nwe can classify the types of local minima encountered and when they form a continuum. This is\ngermane to the global convergence results of the previous sections. The proofs in this section follow\nclosely the ideas from [12, 11].\n\n5.1 Local and Global Minima\n\nWe \ufb01rst introduce the two fundamental de\ufb01nitions of this section. The \ufb01rst de\ufb01nition introduces the\nconcept of when a set S \u2282 V of vertices is compatible with an increasing sequence S1 (cid:40) S2 (cid:40)\n\u00b7\u00b7\u00b7 (cid:40) Sk of vertex subsets. Loosely speaking, a set S is compatible with S1 (cid:40) S2 (cid:40) \u00b7\u00b7\u00b7 (cid:40) Sk\nwhenever the cut de\ufb01ned by the pair (S, Sc) neither intersects nor crosses any of the cuts (Si, Sc\ni ).\nDe\ufb01nition 6 formalizes this notion.\nDe\ufb01nition 6 (Compatible Vertex Set). A vertex set S is compatible with an increasing sequence\nS1 (cid:40) S2 (cid:40) \u00b7\u00b7\u00b7 (cid:40) Sk if S \u2286 S1, Sk \u2286 S or\n\nS1 (cid:40) S2 (cid:40) \u00b7\u00b7\u00b7 (cid:40) Si \u2286 S \u2286 Si+1 (cid:40) \u00b7\u00b7\u00b7 (cid:40) Sk\n\nfor some 1 \u2264 i \u2264 k \u2212 1,\n\n6\n\n\f1),\u00b7\u00b7\u00b7 , (Sk, Sc\n\nThe concept of compatible cuts then allows us to introduce our notion of a local minimum of the\ncombinatorial problem, i.e. de\ufb01nition 7.\nDe\ufb01nition 7 (Combinatorial k-Local Minima). An increasing collection of nontrivial sets S1 (cid:40)\nS2 (cid:40) \u00b7\u00b7\u00b7 (cid:40) Sk is called a k-local minimum of the combinatorial problem if C(S1) = C(S2) =\n\u00b7\u00b7\u00b7 = C(Sk) \u2264 C(S) for all S compatible with S1 (cid:40) S2 (cid:40) \u00b7\u00b7\u00b7 (cid:40) Sk.\nPursuing the previous analogy, a collection of cuts (S1, Sc\nk) forms a k-local minimum\nof the combinatorial problem precisely when they do not intersect, have the same energy and all other\nnon-intersecting cuts (S, Sc) have higher energy. The case of a 1-local minimum is paramount. A cut\n1) de\ufb01nes a 1-local minimum if and only if it has lower energy than all cuts that do not intersect\n(S1, Sc\nit. As a consequence, if a 1-local minimum is not a global minimum then the cut (S1, Sc\n1) necessarily\nintersects all of the cuts de\ufb01ned by the global minimizers. This is a fundamental characteristic of\nlocal minima: they are never \u201cparallel\u201d to global minima.\nFor the continuous problem, combinatorial k-local minima naturally correspond to vertex functions\nf \u2208 Rn that take (k + 1) distinct values. We therefore de\ufb01ne the concept of a (k + 1)-valued local\nminimum of the continuous problem.\nDe\ufb01nition 8 (Continuous (k + 1)-valued Local Minima). We call a vertex function f \u2208 Rn a\n(k + 1)-valued local minimum of the continuous problem if f is a local minimum of E and if its\nrange contains exactly k + 1 distinct values.\n\nTheorem 3 provides the intuitive picture connecting these two concepts of minima, and it follows as\na corollary of the more technical and explicit theorem 4.\nTheorem 3. The continuous problem has a (k + 1)-valued local minimum if and only if the combi-\nnatorial problem has a k-local minimum.\n\nFor example, if the continuous problem has a trinary local minimum in the usual sense then the com-\nbinatorial problem must have a 2-local minimum in the sense of de\ufb01nition 7. As the cuts (S1, Sc\n1)\nand (S2, Sc\n2) de\ufb01ning a 2-local minimum do not intersect, a 2-local minimum separates the vertices\nof the graph into three disjoint domains. A trinary function therefore makes intuitive sense. We\nmake this intuition precise in theorem 4. Before stating it we require two further de\ufb01nitions.\nDe\ufb01nition 9 (Characteristic Functions). Given \u2205 (cid:54)= S \u2282 V , de\ufb01ne its characteristic function fS\nas\nif |S| > n/2. (9)\nfS = Cut(S, Sc)\nNote that fS has median zero and T V -norm equal to 1.\nDe\ufb01nition 10 (Strict Convex Hull). Given k functions f1,\u00b7\u00b7\u00b7 , fk, their strict convex hull is the set\n(10)\n\nsch{f1,\u00b7\u00b7\u00b7 , fk} = {\u03b81f1 + \u00b7\u00b7\u00b7 + \u03b8kfk : \u03b8i > 0 for 1 \u2264 i \u2264 k and \u03b81 + \u00b7\u00b7\u00b7 + \u03b8k = 1}\n\nfS = \u2212Cut(S, Sc)\n\n\u22121\u03c7Sc\n\n\u22121\u03c7S\n\nif |S| \u2264 n/2\n\nand\n\nTheorem 4 (Explicit Correspondence of Local Minima).\n\n1. Suppose S1 (cid:40) S2 (cid:40) \u00b7\u00b7\u00b7 (cid:40) Sk is a k-local minimum of the combinatorial problem and let\nf \u2208 sch{fS1 ,\u00b7\u00b7\u00b7 , fSk}. Then any function of the form g = \u03b1f + \u03b21 de\ufb01nes a (k + 1)-\nvalued local minimum of the continuous problem and with E(g) = C(S1).\n\n2. Suppose that f is a (k + 1)-valued local minimum and let c1 > c2 > \u00b7\u00b7\u00b7 > ck+1 denote\nits range. For 1 \u2264 i \u2264 k set \u2126i = {f = ci}. Then the increasing collection of sets\nS1 (cid:40) \u00b7\u00b7\u00b7 (cid:40) Sk given by\n\nS1 = \u21261, S2 = \u21261 \u222a \u21262\n\n\u00b7\u00b7\u00b7\n\nSk = \u21261 \u222a \u00b7\u00b7\u00b7 \u222a \u2126k\n\nis a k-local minimum of the combinatorial problem with C(Si) = E(f ).\n\nRemark 1 (Isolated vs Continuum of Local Minima). If a set S1 is a 1-local min then the strict\nconvext hull (10) of its characteristic function reduces to the single binary function fS1. Thus every\n1-local minimum generates exactly one local minimum of the continuous problem in S n\u22121\n, and this\nlocal minimum is binary. On the other hand, if k \u2265 2 then every k-local minimum of the combi-\nnatorial problem generates a continuum (in S n\u22121\n) of non-binary local minima of the continuous\nproblem. Thus, the hypotheses of theorem 1, corollary 2 or corollary 3 can hold only if no such\nhigher order k-local minima exist. When these theorems do apply the algorithms therefore con-\nverge to a binary function.\n\n0\n\n0\n\n7\n\n\fAs a \ufb01nal consequence, we summarize the fact that theorem 4 implies that the continuous relaxation\nof the Cheeger cut problem is exact. In other words,\nTheorem 5. Given {f \u2208 arg min E} an explicit formula exists to construct the set {S \u2208\narg minC}, and vice-versa.\n\n6 Experiments\n\nIn all experiments, we take the constant c = 1 in the SD algorithm. We use the method from\n[3] to solve the minimization problem in the SD algorithm and the method from [7] to solve the\nminimization problem in the IPM algorithm. We terminate each minimization when either a stopping\ntolerance of \u03b5 = 10\u221210 (i.e. (cid:107)uj+1 \u2212 uj(cid:107)1 \u2264 \u03b5) or 2, 000 iterations is reached. This yields a\ncomparison of the idealized cases of the SD algorithm and the IPM algorithm. Our \ufb01rst experiment\nuses the two-moon dataset [2] in the same setting as in [12]. The second experiment utilizes pairs of\nimage digits extracted from the MNIST dataset. The \ufb01rst table summarizes the results of these tests.\nIt shows the mean Cheeger energy value (2), the mean error of classi\ufb01cation (% of misclassi\ufb01ed data)\nand the mean computational time for both algorithms over 10 experiments with the same random\ninitialization for both algorithms in each of the individual experiments.\n\nSD Algorithm\n\nModi\ufb01ed IPM Algorithm [7]\n\nEnergy Error (%) Time (sec.) Energy Error (%) Time (sec.)\n0.126\n0.115\n0.086\n\n0.145\n0.185\n0.086\n\n14.12\n25.23\n1.219\n\n8.69\n1.65\n1.217\n\n1.98\n58.9\n48.1\n\n2.06\n52.4\n49.2\n\n2 moons\n4\u2019s and 9\u2019s\n3\u2019s and 8\u2019s\n\nOur second set of experiments applies both algorithms to multi-class clustering problems using a\nstandard, recursive bi-partitioning method. We use the MNIST, USPS and COIL datasets. We\npreprocessed the data by projecting onto the \ufb01rst 50 principal components, and take k = 10 nearest\nneighbors for the MNIST and USPS datasets and k = 5 nearest neighbors for the COIL dataset.\nWe used the same tolerances for the minimization problems, i.e. \u03b5 = 10\u221210 and 2, 000 maximum\niterations. The table below presents the mean Cheeger energy, classi\ufb01cation error and time over 10\nexperiments as before.\n\nMNIST (10 classes)\nUSPS (10 classes)\nCOIL (20 classes)\n\nSD Algorithm\n\nModi\ufb01ed IPM Algorithm [7]\n\nEnergy Err. (%) Time (min.) Energy Err. (%) Time (min.)\n1.30\n2.37\n0.19\n\n11.75\n4.13\n2.52\n\n45.01\n5.15\n4.31\n\n42.83\n4.81\n4.20\n\n11.78\n4.11\n1.58\n\n1.29\n2.37\n0.18\n\nOverall, the results show that both algorithms perform equivalently for both two-class and multi-\nclass clustering problems.\nAs our interest here lies in the theoretical properties of both algorithms, we will study practical\nimplementation details for the SD algorithm in future work. For instance, as Hein and B\u00a8uhler remark\n[6], solving the minimization problem for the IPM algorithm precisely is unnecessary. Analogously\nfor the SD Algorithm, we only need to lower the energy suf\ufb01ciently before proceeding to the next\niteration of the algorithm. It proves convenient to stop the minimization when a weaker form of the\nenergy inequality (6) holds, such as\n\n(cid:32)\n\n(cid:33)\n\nE(f ) \u2265 E(h) + \u03b8\n\nE(f )\nB(h)\n\n||\u02c6h \u2212 f||2\n\n2\n\nc\n\nfor some constant 0 < \u03b8 < 1. This condition provably holds in a \ufb01nite number of iterations and\nstill guarantees that ||f k+1 \u2212 f k||2 \u2192 0. The concrete decay estimate provided by SD algorithm\ntherefore allows us to give precise meaning to \u201csuf\ufb01ciently lowers the energy.\u201d We investigate these\naspects of the algorithm and prove convergence for this practical implementation in future work.\nReproducible research: The code is available at http://www.cs.cityu.edu.hk/\u223cxbresson/codes.html\nAcknowledgements: This work supported by AFOSR MURI grant FA9550-10-1-0569 and Hong\nKong GRF grant #110311.\n\n8\n\n\fReferences\n[1] X. Bresson, X.-C. Tai, T.F. Chan, and A. Szlam. Multi-Class Transductive Learning based on\n\n(cid:96)1 Relaxations of Cheeger Cut and Mumford-Shah-Potts Model. UCLA CAM Report, 2012.\n\n[2] T. B\u00a8uhler and M. Hein. Spectral Clustering Based on the Graph p-Laplacian. In International\n\nConference on Machine Learning, pages 81\u201388, 2009.\n\n[3] A. Chambolle and T. Pock. A First-Order Primal-Dual Algorithm for Convex Problems with\nApplications to Imaging. Journal of Mathematical Imaging and Vision, 40(1):120\u2013145, 2011.\n[4] J. Cheeger. A Lower Bound for the Smallest Eigenvalue of the Laplacian. Problems in Analy-\n\nsis, pages 195\u2013199, 1970.\n\n[5] F. R. K. Chung. Spectral Graph Theory, volume 92 of CBMS Regional Conference Series in\nMathematics. Published for the Conference Board of the Mathematical Sciences, Washington,\nDC, 1997.\n\n[6] M. Hein and T. B\u00a8uhler. An Inverse Power Method for Nonlinear Eigenproblems with Ap-\nIn In Advances in Neural Information\n\nplications in 1-Spectral Clustering and Sparse PCA.\nProcessing Systems (NIPS), pages 847\u2013855, 2010.\n\n[7] M. Hein and S. Setzer. Beyond Spectral Clustering - Tight Relaxations of Balanced Graph\n\nCuts. In In Advances in Neural Information Processing Systems (NIPS), 2011.\n\n[8] R.R. Meyer. Suf\ufb01cient conditions for the convergence of monotonic mathematical program-\n\nming algorithms. Journal of Computer and System Sciences, 12(1):108 \u2013 121, 1976.\n\n[9] S. Rangapuram and M. Hein. Constrained 1-Spectral Clustering. In International conference\n\non Arti\ufb01cial Intelligence and Statistics (AISTATS), pages 1143\u20131151, 2012.\n\n[10] J. Shi and J. Malik. Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence (PAMI), 22(8):888\u2013905, 2000.\n\n[11] G. Strang. Maximal Flow Through A Domain. Mathematical Programming, 26:123\u2013143,\n\n1983.\n\n[12] A. Szlam and X. Bresson. Total variation and cheeger cuts. In Proceedings of the 27th Inter-\n\nnational Conference on Machine Learning, pages 1039\u20131046, 2010.\n\n[13] L. Zelnik-Manor and P. Perona. Self-tuning Spectral Clustering. In In Advances in Neural\n\nInformation Processing Systems (NIPS), 2004.\n\n9\n\n\f", "award": [], "sourceid": 670, "authors": [{"given_name": "Xavier", "family_name": "Bresson", "institution": null}, {"given_name": "Thomas", "family_name": "Laurent", "institution": null}, {"given_name": "David", "family_name": "Uminsky", "institution": null}, {"given_name": "James", "family_name": "Brecht", "institution": null}]}