{"title": "Minimum Average Cost Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 1759, "page_last": 1767, "abstract": "A number of objective functions in clustering problems can be described with submodular functions. In this paper, we introduce the minimum average cost criterion, and show that the theory of intersecting submodular functions can be used for clustering with submodular objective functions. The proposed algorithm does not require the number of clusters in advance, and it will be determined by the property of a given set of data points. The minimum average cost clustering problem is parameterized with a real variable, and surprisingly, we show that all information about optimal clusterings for all parameters can be computed in polynomial time in total. Additionally, we evaluate the performance of the proposed algorithm through computational experiments.", "full_text": "Minimum Average Cost Clustering\n\nKiyohito Nagano\n\nInstitute of Industrial Science\nUniversity of Tokyo, Japan\n\nThe Institute of Scienti\ufb01c and Industrial Research\n\nYoshinobu Kawahara\n\nOsaka University, Japan\n\nnagano@sat.t.u-tokyo.ac.jp\n\nkawahara@ar.sanken.osaka-u.ac.jp\n\nSatoru Iwata\n\nResearch Institute for Mathematical Sciences\n\nKyoto University, Japan\n\niwata@kurims.kyoto-u.ac.jp\n\nAbstract\n\nA number of objective functions in clustering problems can be described with\nsubmodular functions.\nIn this paper, we introduce the minimum average cost\ncriterion, and show that the theory of intersecting submodular functions can be\nused for clustering with submodular objective functions. The proposed algorithm\ndoes not require the number of clusters in advance, and it will be determined by\nthe property of a given set of data points. The minimum average cost clustering\nproblem is parameterized with a real variable, and surprisingly, we show that all\ninformation about optimal clusterings for all parameters can be computed in poly-\nnomial time in total. Additionally, we evaluate the performance of the proposed\nalgorithm through computational experiments.\n\n1\n\nIntroduction\n\nA clustering of a \ufb01nite set V of data points is a partition of V into subsets (called clusters) such\nthat data points in the same cluster are similar to each other. Basically, a clustering problem asks\nfor a partition P of V such that the intra-cluster similarity is maximized and/or the inter-cluster\nsimilarity is minimized. The clustering of data is one of the most fundamental unsupervised learning\nproblems. We use a criterion function de\ufb01ned on all partitions of V , and the clustering problem\nbecomes that of \ufb01nding a partition of V that minimizes the clustering cost under some constraints.\nSuppose that the inhomogeneity of subsets of the data set is measured by a nonnegative set function\nf : 2V \u2192 R with f (\u2205) = 0, where 2V denotes the set of all subsets of V , and the clustering cost\nof a partition P = {S1, S2, . . . , Sk} is de\ufb01ned by f [P] = f (S1) + \u00b7 \u00b7 \u00b7 + f (Sk). A number of set\nfunctions that represent the inhomogeneity, including cut functions of graphs and entropy functions,\nare known to be submodular [3, 4]. Throughout of this paper, we suppose that f is submodular, that\nis, f (S) + f (T ) \u2265 f (S \u222a T ) + f (S \u2229 T ) for all S, T \u2286 V . A submodular function is known to be\na discrete counterpart of a convex function, and in recent years, its importance has been recognized\nin the \ufb01eld of machine learning.\nFor any given integer k with 1 \u2264 k \u2264 n, where n is the number of points in V , a partition P of\nV is called a k-partition if there are exactly k nonempty elements in P, and is called an optimal\nk-clustering if P is a k-partition that minimizes the cost f [P] among all k-partitions. A problem of\n\ufb01nding an optimal k-clustering is widely studied in combinatorial optimization and various \ufb01elds,\nand it is recognized as a natural formulation of a clustering problem [8, 9, 10]. Even if f is a cut\nfunction of a graph, which is submodular and symmetric, that is, f (V \u2212 S) = f (S) for all S \u2286 V ,\nthis problem is known to be NP-hard unless k can be regarded as a constant [5]. Zhao et al. [17]\nand Narasimhan et al. [10] dealt with the case when f is submodular and symmetric. Zhao et al.\n[17] gave a 2(1\u22121/k)-approximation algorithm using Queyranne\u2019s symmetric submodular function\nminimization algorithm [13]. Narasimhan et al. [10] showed that Queyranne\u2019s algorithm can be used\n\n1\n\n\ffor clustering problems with some speci\ufb01c natural criteria. For a general submodular function and\na small constant k, constant factor approximation algorithms for optimal k-clusterings are designed\nin [12, 18]. In addition, balanced clustering problems with submodular costs are studied in [8, 9].\nGenerally speaking, it is dif\ufb01cult to \ufb01nd an optimal k-clustering for any given k because the opti-\nmization problem is NP-hard even for simple special cases. Furthermore, the number of clusters\nhas to be determined in advance, regardless of the property of the data points, or an additional com-\nputation is required to \ufb01nd a proper number of clusters via some method like cross-validation. In\nthis paper, we introduce a new clustering criterion to resolve the above shortcomings of previous\napproaches [10]. In the minimum average cost (MAC) clustering problem we consider, the objec-\ntive function is the average cost of a partition P which combines the clustering cost f [P] and the\nnumber of clusters |P|. Now the number of clusters is not pre-determined, but it will be determined\nautomatically by solving the combinatorial optimization problem. We argue that the MAC clustering\nproblem represents a natural clustering criterion. In this paper, we show that the Dilworth truncation\nof an intersecting submodular function [2] (see also Chapter II of Fujishige [4] and Chapter 48 of\nSchrijver [14]) can be used to solve the clustering problem exactly and ef\ufb01ciently. To the best of\nour knowledge, this is the \ufb01rst time that the theory of intersecting submodular functions is used for\nclustering. The MAC clustering problem can be parameterized with a real-valued parameter \u03b2 \u2265 0,\nand the problem with respect to \u03b2 asks for a partition P of V that minimizes the average cost under\na constraint |P| > \u03b2. The main contribution of this paper is a polynomial time algorithm that solves\nthe MAC clustering problem exactly for any given parameter \u03b2. This result is in stark contrast to the\nNP-hardness of the optimal k-clustering problems. Even more surprisingly, our algorithm computes\nall information about MAC clusterings for all parameters in polynomial time in total.\nIn the case where f is a cut function of a graph, there are some related works. If f is a cut function\nand \u03b2 = 1, the optimal value of the MAC clustering problem coincides with the strength of a graph\n[1]. In addition, the computation of the principal sequence of partitions of a graph [7] is a special\ncase of the parametrized MAC clustering problem in an implicit way.\n\nThis paper is organized as follows. In Section 2, we formulate the minimum average cost clustering\nproblem, and show a structure property of minimum average cost clusterings.\nIn Section 3, we\npropose a framework of our algorithm for the minimum average cost clustering problem. In Section\n4, we explain the basic results on the theory of intersecting submodular functions, and describe the\nDilworth truncation algorithm which is used in Section 3 as a subroutine. Finally, we show the result\nof computational experiments in Section 5, and give concluding remarks in Section 6.\n\n2 Minimum Average Cost Clustering\n\nIn this section, we give a de\ufb01nition of minimum average cost clusterings. After that, we show\na structure property of them. Let V be a given set of n data points, and let f : 2V \u2192 R be a\nnonnegative submodular function with f (\u2205) = 0, which is not necessarily symmetric. For each\nsubset S \u2286 V , the value f (S) represents the inhomogeneity of data points in S. For a partition\nP = {S1, . . . , Sk}, the clustering cost is de\ufb01ned by f [P] = f (S1) + \u00b7 \u00b7 \u00b7 + f (Sk). We will\nintroduce the minimum average cost criterion in order to make consideration of both the clustering\ncost f [P] and the number of clusters |P|.\n\n2.1 De\ufb01nition\n\nConsider a k-partition P of V with k > 1, and compare P with a trivial partition {V } of V . Then,\nthe number of clusters has increased by k \u2212 1 and the clustering cost has increased by f [P] + c,\nwhere c is a constant. Therefore, it is natural to de\ufb01ne an average cost of P by f [P]/(|P| \u2212 1).\nSuppose that P \u2217 is a partition of V that minimizes the average cost among all partitions P of V with\n|P| > 1. Remark that the number of clusters of P \u2217 is determined not by us, but by the property of\nthe given data set. Therefore, it may be said that P \u2217 is a natural clustering.\nMore generally, using a parameter \u03b2 \u2208 [0, n) = {\u03c4 \u2208 R : 0 \u2264 \u03c4 < n}, we de\ufb01ne an extended\naverage cost by f [P]/(|P| \u2212 \u03b2). For any parameter \u03b2 \u2208 [0, n), we consider the minimum average\ncost (MAC) clustering problem\n\n\u03bb(\u03b2) := min\n\nP\n\n{f [P]/(|P| \u2212 \u03b2) : P is a partition of V , |P| > \u03b2} .\n\n(1)\n\n2\n\n\fLet us say that a partition P is a \u03b2-MAC clustering if P is optimal for the problem (1) with respect\nto \u03b2 \u2208 [0, n). Naturally, the case where \u03b2 = 1 is fundamental. Furthermore, we can expect \ufb01ner\nclusterings for relatively large parameters. The problem (1) and the optimal k-clustering problem\n[10] are closely related.\nProposition 1. Let P be a \u03b2-MAC clustering for some \u03b2 \u2208 [0, n), and set k := |P|. Then we have\nf [P] \u2264 f [Q] for any k-partition Q of V . In other words, P is an optimal k-clustering.\nProof. By de\ufb01nition, we have k > \u03b2 and f [P]/(k \u2212 \u03b2) \u2264 f [Q]/(k \u2212 \u03b2) for any k-partition Q.\n\nWe will show that all information about \u03b2-MAC clusterings for all parameters \u03b2 can be computed in\npolynomial time in total. Our algorithm requires the help of the theory of intersecting submodular\nfunctions [4, 14]. Proposition 1 says that if there exists a \u03b2-MAC clustering P satisfying |P| = k,\nthen we obtain an optimal k-clustering. Note that this fact is consistent with the NP-hardness of the\noptimal k-clustering problem because the information about MAC clusterings just gives a portion of\nthe information about optimal k-clusterings (k = 1, . . . , n).\n\n2.2 Structure property\n\nWe will investigate the structure of all \u03b2-MAC clusterings. Denote by R+ the set of nonnegative\nreal values. Let us choose a parameter \u03b2 \u2208 [0, n). If P is a partition of V satisfying |P| \u2264 \u03b2, we\nhave \u2212\u03b2\u03bb \u2264 \u2212|P|\u03bb \u2264 f [P] \u2212 |P|\u03bb for all \u03bb \u2208 R+. Hence the minimum average cost \u03bb(\u03b2) de\ufb01ned\nin (1) is represented as\n\n\u03bb(\u03b2) = max{\u03bb \u2208 R+ : \u03bb \u2264 f [P]/(|P| \u2212 \u03b2) for all partition P of V with |P| > \u03b2}\n\n= max{\u03bb \u2208 R+ : \u2212\u03b2\u03bb \u2264 f [P] \u2212 |P|\u03bb for all partition P of V }\n= max{\u03bb \u2208 R+ : \u2212\u03b2\u03bb \u2264 h(\u03bb)},\n\nwhere h : R+ \u2192 R is de\ufb01ned by\n\nh(\u03bb) = min\n\nP\n\n{f [P] \u2212 |P|\u03bb : P is a partition of V } (\u03bb \u2265 0).\n\n(2)\n\n(3)\n\nThe function h does not depend on the parameter \u03b2. For \u03bb \u2265 0, we say that a partition P determines\nh at \u03bb if f [P] \u2212 |P|\u03bb = h(\u03bb). Apparently, the minimization problem (3) is dif\ufb01cult to solve for any\ngiven \u03bb \u2265 0. This point will be discussed in Section 4 in detail.\nLet us examine properties of the function h. For each partition P of V , de\ufb01ne a linear function\nhP : R+ \u2192 R as hP (\u03bb) = f [P] \u2212 |P|\u03bb. Since h is the minimum of these linear functions, h is\na piecewise-linear concave function on R+. The function h is illustrated in Figure 1 by the thick\ncurve. We have h(0) = f (V ) because f [{V }] \u2264 f [P] for any partition P of V . Moreover, it is easy\nto see that the set of singletons {{1}, {2}, . . . , {n}} determines h at a suf\ufb01ciently large \u03bb. In view\nof (2), the minimum average cost \u03bb(\u03b2) can be obtained by solving the equation \u2212\u03b2\u03bb = h(\u03bb) (see\nalso Figure 1). In addition, a \u03b2-MAC clustering can be characterized as follows.\nLemma 2. Given a parameter \u03b2 \u2208 [0, n), let P be a partition of V such that |P| > \u03b2 and\nh(\u03bb(\u03b2)) = f [P] \u2212 |P|\u03bb(\u03b2). Then P is a \u03b2-MAC clustering.\nProof. Since \u2212\u03b2\u03bb(\u03b2) = h(\u03bb(\u03b2)) = f [P] \u2212 |P|\u03bb(\u03b2), we have \u03bb(\u03b2) = f [P]/(|P| \u2212 \u03b2). For\nany partition Q of V with |Q| > \u03b2, we have \u2212\u03b2\u03bb(\u03b2) \u2264 f [Q] \u2212 |Q|\u03bb(\u03b2), and thus \u03bb(\u03b2) \u2264\nf [Q]/(|Q| \u2212 \u03b2). Therefore, P is a \u03b2-MAC clustering.\n\nh (\u03bb)\n\nf (V )\n\n0\n\nhP (\u03bb) = f [P] - |P| \u03bb \n\nh (\u03bb)\n\nPs1\n\nPs2\n\n\u03bb\n\n\u2212 \u03b2 \u03bb \n\n0\n\nh (\u03bb)\n\n0\n\nPs3\n\nPs4\n\n\u03bb\n\nPs5\n\n\u03bb (0)\n\nI1\n\nI2\n\nI3\n\nI4 \u03bb\nB1\nB2\nB3\n\n\u03bb ( \u03b2 )\n\nFigure 1: The function h\n\n(a)\n\n(b)\n\nFigure 2: The structure of h\n\nNow, we will present a structure property of the MAC problem (1). Suppose that the slopes of h\ntake values \u2212s1 > \u2212s2 > \u00b7 \u00b7 \u00b7 > \u2212sd. As {s1, s2, . . . , sd} \u2286 {1, . . . , n}, we have d \u2264 n. The\n\n3\n\n\finterval R+ is split into d subintervals R1 = [0, \u03bb1), R2 = [\u03bb1, \u03bb2), . . . , Rd = [\u03bbd\u22121, +\u221e)\nsuch that, for each j = 1, . . . , d, the function h is linear and its slope is \u2212sj on Rj. Let\nPs1, Ps2, . . . , Psd be partitions of V such that, for each j = 1, . . . , d, the partition Psj deter-\nmines h at all \u03bb \u2208 Rj (see Figure 2 (a)). In particular, sd = n and the last partition Psd is the set\nof singletons {{1}, {2}, . . . , {n}}. Observe that the range I of the minimum average costs \u03bb(\u03b2) is\nI = [\u03bb(0), +\u221e). Suppose that j\u2217 is an index such that \u03bb(0) \u2208 Rj\u2217. Let d\u2217 = d \u2212 j\u2217 + 1, and let\nj = \u03bbj+j\u2217\u22121 and s\u2217\nj = sj+j\u2217\u22121 for each j = 1, . . . , d\u2217. The interval I is split into d\u2217 subintervals\n\u03bb\u2217\nd\u2217\u22121, +\u221e). Accordingly, the domain of \u03b2 is split\nI1 = [\u03bb(0), \u03bb\u2217\ninto d\u2217 subintervals B1 = [0, \u03b21), B2 = [\u03b21, \u03b22), . . . , Bd = [\u03b2d\u2217\u22121, n), where \u03b2j = \u2212h(\u03bb\u2217\nj )/\u03bb\u2217\nj\nfor each j = 1, . . . , d\u2217 \u2212 1. Figure 2 (b) illustrates these two sets of subintervals {I1, . . . , Id\u2217} and\n{B1, . . . , Bd\u2217}. By Lemma 2, we directly obtain the structure property of the MAC problem (1):\nLemma 3. Let j \u2208 {1, . . . , d\u2217}. For any \u03b2 \u2208 Bj, the partition Ps\u2217\nLemma 3 implies that if we can \ufb01nd the collection {Ps1, Ps2 , . . . , Psd}, then the MAC problem\n(1) will be completely solved. In the subsequent sections, we will give an algorithm that computes\nthe collection {Ps1 , Ps2, . . . , Psd} in polynomial time in total.\n\nis a \u03b2-MAC clustering.\n\n2), . . . , Id\u2217 = [\u03bb\u2217\n\n1), I2 = [\u03bb\u2217\n\n1, \u03bb\u2217\n\nj\n\n3 The clustering algorithm\n\nIn this section, we present a framework of a polynomial time algorithm that \ufb01nds the collection\n{Ps1 , Ps2 , . . . , Psd} de\ufb01ned in \u00a72.2. That is, our algorithm computes all the breakpoints of the\npiecewise linear concave function h de\ufb01ned in (3). By Lemma 3, we can immediately construct a\npolynomial time algorithm that solves the MAC problem (1) completely.\n\nThe proposed algorithm uses the following procedure FINDPARTITION, which will be described in\nSection 4 precisely.\n\nProcedure FINDPARTITION(\u03bb): For any given \u03bb \u2265 0, this procedure computes the value\nh(\u03bb) and \ufb01nds a partition P of V that determines h at \u03bb.\n\nWe will use SFM(n) to denote the time required to minimize a general submodular function de\ufb01ned\non 2V , where n = |V |. Submodular function minimization can be solved in polynomial time\n(see [6]). Although the minimization problem (3) is apparently hard, we show that the procedure\nFINDPARTITION can be designed to run in polynomial time.\nLemma 4. For any \u03bb \u2265 0, the procedure FINDPARTITION(\u03bb) runs in O(n \u00b7 SFM(n)).\n\nThe proof of Lemma 4, which will be given in \u00a74, utilizes the Dilworth truncation of an intersecting\nsubmodular function [4, 14].\nLet us call a partition P of V supporting if there exists \u03bb \u2265 0 such that h(\u03bb) = hP (\u03bb). By de\ufb01ni-\ntion, each Psj is supporting. In addition, for any \u03bb \u2265 0, FINDPARTITION(\u03bb) returns a supporting\npartition of V . Set Q1 := {V } and Qn := {{1}, {2}, . . . , {n}}. Q1 is a supporting partition of V\nbecause h(0) = f [{V }] = hQ1(0), and Qn is also supporting because Qn = Psd. For a supporting\npartition P of V , if |P| = sj for some j \u2208 {1, . . . , d}, then we can put Psj = P. For integers\n1 \u2264 k < \u2113 \u2264 n, de\ufb01ne R(k, \u2113) = {\u03bb \u2208 R+ : \u2212k \u2265 \u2202+h(\u03bb), and \u2202\u2212h(\u03bb) \u2265 \u2212\u2113}, where \u2202+h\nand \u2202\u2212h are the right and left derivatives of h, respectively, and we set \u2202\u2212h(0) = 0. Observe that\nR(k, \u2113) is an interval in R+. All breakpoints of h are included in R(1, n) = R+.\nSuppose that we are given two supporting partitions Qk and Q\u2113 such that |Qk| = k, |Q\u2113| = \u2113\nand k < \u2113. We describe the algorithm SPLIT(Qk, Q\u2113), which computes the information about all\nbreakpoints of h on the interval R(k, \u2113). This algorithm is a recursive one. First of all, the algorithm\nSPLIT decides whether \u201ck = sj and \u2113 = sj+1 for some j \u2208 {1, . . . , d \u2212 1}\u201d or not. Besides, if\nthe decision is negative, the algorithm \ufb01nds a supporting partition Qm such that |Qm| = m and\nk < m < \u2113. If the decision is positive, there is exactly one breakpoint on the interior of R(k, \u2113),\nwhich can be given by Qk and Q\u2113. Now we show how to execute these operations. For two linear\nfunctions hQk (\u03bb) and hQ\u2113 (\u03bb), the equality hQk (\u03bb) = hQ\u2113(\u03bb) holds at \u03bb = (f [Q\u2113]\u2212f [Qk])/(\u2113\u2212k).\nSet h = hQk (\u03bb) = (\u2113f [Qk] \u2212 kf [Q\u2113])/(\u2113 \u2212 k). Clearly, we have h(\u03bb) \u2264 h. The algorithm SPLIT\nperforms the procedure FINDPARTITION(\u03bb). Consider the case where h(\u03bb) = h (see Figure 3 (a)).\nThen algorithm gives an af\ufb01rmative answer, returns Qk and Q\u2113, and stops. Next, consider the case\nwhere h(\u03bb) < h (see Figure 3 (b)). Then the algorithm gives a negative answer, and the partition\n\n4\n\n\fP returned by FINDPARTITION is supporting and satis\ufb01es k < |P| < \u2113. We set m = |P| and\nQm = P. Finally, the algorithm performs SPLIT(Qk, Qm) and SPLIT(Qm, Q\u2113).\n\nh (\u03bb)\n\n0\n\nhQ\u2113\n\n(\u03bb)\n\n(\u03bb, h)\n\n\u03bb\n\nh (\u03bb)\n\nhQ\u2113\n\n(\u03bb)\n\n(\u03bb, h)\n\u03bb\n\nhQk\n\n(\u03bb)\n\n(a)\n\n0\n\nhQk\n\n(\u03bb)\n\n(b)\n\nFigure 3: Two different situations in SPLIT(Qk, Q\u2113)\n\nThe algorithm SPLIT can be summarized as follows.\n\nAlgorithm SPLIT(Qk, Q\u2113)\nInput :\nOutput : The information about all breakpoints of h on the interval R(k, \u2113).\n\nSupporting partitions of V , Qk and Q\u2113 such that |Qk| = k, |Q\u2113| = \u2113 and k < \u2113.\n\n1:\n\n2:\n3:\n\nSet \u03bb := (f [Q\u2113] \u2212 f [Qk])/(\u2113 \u2212 k), and set h := (\u2113f [Qk] \u2212 kf [Q\u2113])/(\u2113 \u2212 k). By performing\nFINDPARTITION(\u03bb), compute h(\u03bb) and a partition P of V that determines h(\u03bb).\nIf h(\u03bb) = h (positive case), return Qk and Q\u2113, and stop.\nIf h(\u03bb) < h (negative case), set m := |P|, Qm := P, and perform SPLIT(Qk, Qm) and\nSPLIT(Qm, Q\u2113).\n\nBy performing the algorithm SPLIT(Q1, Qn), where Q1 := {V } and Qn := {{1}, {2}, . . . , {n}},\nthe information of all breakpoints of h is obtained. Therefore, the collection {Ps1, Ps2, . . . , Psd}\nde\ufb01ned in \u00a72.2 can be obtained. Let us show that this algorithm runs in polynomial time.\nTheorem 5. The collection {Ps1, Ps2, . . . , Psd} can be computed in O(n2 \u00b7 SFM(n)) time. In\nother words, the information of all breakpoints of h can be computed in O(n2 \u00b7 SFM(n)) time.\n\nProof. By Lemma 4, it suf\ufb01ces to show that the number of calls of the procedure FINDPARTITION\nin the execution of SPLIT(Q1, Qn) is O(n). In the algorithm, after one call of FINDPARTITION,\n(i) we can obtain the information about one breakpoint of h, or (ii) a new supporting partition Qm\ncan be obtained. Clearly, the number of breakpoints of h is at most n. Throughout the execution\nof SPLIT(Q1, Qn), the algorithm computes a supporting k-partition at most once for each k \u2208\n{1, . . . , n}. Therefore, FINDPARTITION is called at most 2n times in total.\n\nThe main theorem of this paper directly follows from Lemma 3 and Theorem 5.\nTheorem 6. All information of optimal solutions to the minimum average cost clustering problem\n(1) for all parameters \u03b2 \u2208 [0, n) can be computed in O(n2 \u00b7 SFM(n)) time in total.\n\n4 Finding a partition\n\nIn the clustering algorithm of Section 3, we iteratively call the procedure FINDPARTITION, which\ncomputes h(\u03bb) de\ufb01ned in (3) and a partition P that determines h(\u03bb) for any given \u03bb \u2265 0. In this\nsection, we will see that the procedure FINDPARTITION can be implemented to run in polynomial\ntime with the aid of the Dilworth truncation of an intersecting submodular function [2], and give a\nproof of Lemma 4. The Dilworth truncation algorithm is sketched in the proof of Theorem 48.4 of\nSchrijver [14], and the algorithm described in \u00a74.2 is based on that algorithm.\n\n4.1 The Dilworth truncation of an intersecting submodular function\n\nWe start with de\ufb01nitions of an intersecting submodular function and the Dilworth truncation. Subsets\nS, T \u2286 V are intersecting if S \u2229 T 6= \u2205, S \\ T 6= \u2205, and T \\ S 6= \u2205. A set function g : 2V \u2192 R\nis intersecting submodular if g(S) + g(T ) \u2265 g(S \u222a T ) + g(S \u2229 T ) for all intersecting subsets\nS, T \u2286 V . Clearly, the fully submodular function1 f is also intersecting submodular. For any \u03bb \u2265 0,\n\n1To emphasize the difference between submodular and intersecting submodular functions, in what follows\n\nwe refer to a submodular function as a fully submodular function.\n\n5\n\n\fde\ufb01ne f\u03bb : 2V \u2192 R as follows: f\u03bb(S) = 0 if S = \u2205, and f\u03bb(S) = f (S) \u2212 \u03bb otherwise. It is easy to\nsee that f\u03bb is an intersecting submodular function.\nFor a fully submodular function f with f (\u2205) = 0, consider a polyhedron P(f ) = {x \u2208 Rn : x(S) \u2264\nf (S), \u2205 6= \u2200S \u2286 V }, where x(S) = Pi\u2208S xi. The polyhedron P(f ) is called a submodular\npolyhedron. In the same manner, for an intersecting submodular function g with g(\u2205) = 0, de\ufb01ne\nP(g) = {x \u2208 Rn : x(S) \u2264 g(S), \u2205 6= \u2200S \u2286 V }. As for P(f ), for each nonempty subset S \u2286 V ,\nthere exists a vector x \u2208 P(f ) such that x(S) = f (S) by the validity of the greedy algorithm of\nEdmonds [3]. On the other hand, the polyhedron P(g) does not necessarily satisfy such a property.\nAlternatively, the following property is known.\n\nTheorem 7 (Refer to Theorems 2.5, 2.6 of [4]). Given an intersecting submodular function g :\n2V \u2192 R with g(\u2205) = 0, there exists a fully submodular function bg : 2V \u2192 R such that bg(\u2205) = 0\nand P(bg) = P(g). Furthermore, the function bg can be represented as\n\nbg(S) = min{PS\u2208P g(S) : P is a partition of S}.\n\n(4)\n\nThe function bg in Theorem 7 is called the Dilworth truncation of g. If g is fully submodular, for\neach S \u2286 V , {S} is an optimal solution to the RHS of (4) and we have bg(S) = g(S). For a general\nintersecting submodular function g, however, the computation of bg(S) is a nontrivial task.\nLet us see a small example. Suppose that a fully submodular function f : 2{1, 2} \u2192 R satis\ufb01es\nf (\u2205) = 0, f ({1}) = 12, f ({2}) = 8, and f ({1, 2}) = 19. Set \u03bb = 2. There is no vector x \u2208 P(f\u03bb)\nsuch that x({1, 2}) = f\u03bb({1, 2}). The Dilworth truncation bf\u03bb : 2V \u2192 R de\ufb01ned by (4) satis\ufb01es\nbf\u03bb(S) = f\u03bb(S) for S \u2208 {\u2205, {1}, {2}}, and bf\u03bb({1, 2}) = f\u03bb({1}) + f\u03bb({2}) = 16. Observe that\nbf\u03bb is fully submodular and P( bf\u03bb) = P(f\u03bb). Figure 4 illustrates these polyhedra.\n\n19\n\n8\n\n0\n\nx2\n\nx2\n\nx2\n\n17\n\n6\n\n0\n\nP(f\u03bb)\n\nx1\n\n10\n\n16\n\n6\n\n0\n\nP( bf\u03bb)\n\nx1\n\n10\n\nx2\n6\n\nx0\n0\n\n10 e1\n\nx1\n10 x1\n\nx2\n6\n\nx0\n0\n\n6 e2\n\nx2\n\nx1\n10 x1\n\nP(f )\n\nx1\n\n12\n\nFigure 4: Polyhedra P(f ), P(f\u03bb), and P( bf\u03bb)\n\nFigure 5: The greedy algorithm [3]\n\n4.2 Algorithm that \ufb01nds a partition\n\nLet us \ufb01x \u03bb \u2265 0, and describe FINDPARTITION(\u03bb). In view of equations (3), (4) and the de\ufb01nition\nof bf\u03bb, we obtain h(\u03bb) = bf\u03bb(V ) using the Dilworth truncation of f\u03bb. We ask for a partition P of V\nsatisfying bf\u03bb(V ) = f\u03bb[P] (= PT \u2208P f\u03bb(T )) because such a partition P of V determines h at \u03bb.\nWe know that bf\u03bb : 2V \u2192 R is submodular, but bf\u03bb(S) = min{f\u03bb[P] : P is a partition of S} cannot\nbe obtained directly for each S \u2286 V . To evaluate bf\u03bb(V ), we will use the greedy algorithm of\nEdmonds [3]. Denote the set of all extreme points of P( bf\u03bb) \u2286 Rn by ex(P( bf\u03bb)). In the example\nof \u00a74.1, we have ex(P( bf\u03bb)) = {(10, 6)}. We set x\n0 \u2264 y for all\ni := \u2212M for each i \u2208 V , where M = \u03bb + Pj\u2208V {|f ({j})| +\ny \u2208 ex(P( bf\u03bb)). For example, set x0\n|f (V ) \u2212 f (V \u2212 {j})|}. For each i \u2208 V , let ei denote the i-th unit vector in Rn.\n\n0 \u2208 Rn in such a way that x\n\nLet L = (i1, . . . , in) be any ordering of V , and let V \u2113 = {i1, . . . , i\u2113} for each \u2113 = 1, . . . , n.\nNow we describe the framework of the greedy algorithm [3]. In the \u2113-th iteration (\u2113 = 1, . . . , n),\n\u2113\u22121 + \u03b1\u2113 \u00b7 ei\u2113. Finally, the\nwe compute \u03b1\u2113 := max{\u03b1 : x\nalgorithm returns z := x\nn. Figure 5 illustrates this process. By the following property, we can use\nthe greedy algorithm to evaluate the value h(\u03bb) = bf\u03bb(V ).\nTheorem 8 ([3]). For each \u2113 = 1, . . . , n, we have bf\u03bb(V \u2113) = x\u2113(V \u2113) = z(V \u2113).\n\n\u2113\u22121 + \u03b1 \u00b7 ei\u2113 \u2208 P( bf\u03bb)} and set x\n\n\u2113 := x\n\n6\n\n\fLet us see that the greedy algorithm with bf\u03bb can be implemented to run in polynomial time. We\ndiscuss how to compute \u03b1\u2113 in each iteration. Since x\u2113\u22121 \u2208 P( bf\u03bb) and P( bf\u03bb) = P(f\u03bb), we have\n\u2113\u22121 + \u03b1 \u00b7 ei\u2113 \u2208 P(f\u03bb)} = max{\u03b1 : x\u2113\u22121(S) + \u03b1 \u2264 f\u03bb(S), i\u2113 \u2208 \u2200S \u2286 V }\n\n\u03b1\u2113 = max{\u03b1 : x\n\n= min{f (S) \u2212 x\u2113\u22121(S) \u2212 \u03bb : i\u2113 \u2208 \u2200S \u2286 V }\n= min{f (S) \u2212 x\u2113\u22121(S) \u2212 \u03bb : i\u2113 \u2208 \u2200S \u2286 V \u2113},\n\n(5)\n\nwhere the last equality holds because of the choice of the initial vector x\ni for\nall i \u2208 V \u2212 V \u2113). Hence, the value \u03b1\u2113 can be computed by minimizing a fully submodular function.\nIt follows from Theorem 8 that the value h(\u03bb) = bf\u03bb(V ) can be computed in O(n \u00b7 SFM(n)) time.\nIn addition to the value h(\u03bb), a partition P of V such that f [P] \u2212 \u03bb|P| = h(\u03bb) is also required. For\nthis purpose, we modify the above greedy algorithm, and obtain the procedure FINDPARTITION.\n\n0 (remark that x\u2113\u22121\n\ni = x0\n\nA nonnegative real value \u03bb \u2265 0.\n\nProcedure FINDPARTITION(\u03bb)\nInput :\nOutput : A real value h\u03bb and a partition P\u03bb of V .\n1:\n2:\n\nSet P 0 := \u2205.\nFor each \u2113 = 1, . . . , n, do:\n\nCompute \u03b1\u2113 = min{f (S) \u2212 x\u2113\u22121(S) \u2212 \u03bb : i\u2113 \u2208 \u2200S \u2286 V \u2113};\nFind a subset T \u2113 such that i\u2113 \u2208 T \u2113 \u2286 V \u2113 and f (T \u2113) \u2212 x\u2113\u22121(T \u2113) \u2212 \u03bb = \u03b1\u2113;\nSet x\nP \u2113 := {U \u2113} \u222a {S : S \u2208 P \u2113\u22121, T \u2113 \u2229 S = \u2205}.\n\n\u2113\u22121 + \u03b1\u2113 \u00b7 ei\u2113, set U \u2113 := T \u2113 \u222a [ \u222a{S : S \u2208 P \u2113\u22121, T \u2113 \u2229 S 6= \u2205}], and set\n\n\u2113 := x\n\n3: Return h\u03bb := z(V ) and P\u03bb := P n.\n\ni\u2113\n\nT \u2113\n\nP \u2113\n\nU \u2113\n\nP \u2113\u22121\n\nBasically, this procedure FINDPARTITION(\u03bb) is the\nsame algorithm as the above greedy algorithm. But\nnow, we compute P \u2113 in each iteration. Figure 6 shows\nthe computation of P \u2113 in the \u2113-th iteration of the pro-\ncedure FINDPARTITION(\u03bb). For each \u2113 = 1, . . . , n,\nP \u2113 is a partition of V \u2113 = {i1, . . . , i\u2113}. Thus, P\u03bb is a\npartition of V .\nLet x be a vector in P(f\u03bb). We say that a subset S \u2286 V is x-tight (with respect to f\u03bb) if f\u03bb(S) =\nx(S). By the intersecting submodularity of f\u03bb, if S and T are intersecting and both S and T are\nx-tight, then S \u222a T is also x-tight. Using this property, we obtain the following property.\nLemma 9. For each \u2113 = 1, . . . , n, we have bf\u03bb(V \u2113) = x\u2113(V \u2113) = f\u03bb[P \u2113].\nProof. (Sketch) For each \u2113 = 1, . . . , n, observe that T \u2113 is x\ntion that any cluster in P \u2113 is x\nPS\u2208P \u2113 x\u2113(S) = x\u2113(V \u2113). Moreover, the equality bf\u03bb(V \u2113) = x\u2113(V \u2113) follows from Theorem 8.\nThe procedure FINDPARTITION(\u03bb) returns h\u03bb \u2208 R and P\u03bb. By Theorem 8, we have h\u03bb = h(\u03bb), and\nby Lemma 9, we have bf\u03bb(V ) = f\u03bb[P\u03bb], and thus the partition P\u03bb of V determines h(\u03bb). Clearly,\nthe procedure runs in O(n \u00b7 SFM(n)) time. So, in the end, we completed the proof of Lemma 4.\n\n\u2113-tight. Thus, we can show by induc-\n\u2113-tight for each \u2113 = 1, . . . , n. Thus, f\u03bb[P \u2113] = PS\u2208P \u2113 f\u03bb(S) =\n\nFigure 6: Computation of P \u2113\n\n5 Experimental results\n\n5.1\n\nIllustrative example\n\nWe \ufb01rst illustrate the proposed algorithm using two arti\ufb01cial datasets depicted in Figure 7. The above\ndataset is generated from four Gaussians with unit variance (whose centers are located at (3,3), (3,-\n3), (-3,3) and (-3,-3), respectively), and the below one consists of three cycles with different radii\nwith a line. The numbers of samples in these examples are 100 and 310, respectively. Figure 7\nshows the typical examples of partitions calculated through Algorithm SPLIT given in Section 3.\nNow the function f is a cut function of a complete graph and the weight of each edge of that graph\nis determined by the Gaussian similarity function [15]. The values of \u03bb above the \ufb01gures are the\n\n7\n\n\f\u03bb = 0.19\n\n\u03bb = 0.54\n\n\u03bb = 5.21\n\n(cid:25)\n\n(cid:23)\n\n(cid:21)\n\n(cid:19)\n\n(cid:827)(cid:21)\n\n(cid:827)(cid:23)\n\n(cid:827)(cid:25)\n\n(cid:25)\n\n(cid:23)\n\n(cid:21)\n\n(cid:19)\n\n(cid:827)(cid:21)\n\n(cid:827)(cid:23)\n\n(cid:827)(cid:25)\n\n(cid:25)\n\n(cid:23)\n\n(cid:21)\n\n(cid:19)\n\n(cid:827)(cid:21)\n\n(cid:827)(cid:23)\n\n(cid:827)(cid:25)\n\n(cid:827)(cid:25)\n\n(cid:827)(cid:23)\n\n(cid:827)(cid:21)\n\n(cid:19)\n\n(cid:21)\n\n(cid:23)\n\n(cid:25)\n\n(cid:827)(cid:25)\n\n(cid:827)(cid:23)\n\n(cid:827)(cid:21)\n\n(cid:19)\n\n(cid:21)\n\n(cid:23)\n\n(cid:25)\n\n(cid:827)(cid:25)\n\n(cid:827)(cid:23)\n\n(cid:827)(cid:21)\n\n(cid:19)\n\n(cid:21)\n\n(cid:23)\n\n(cid:25)\n\n\u03bb = 0.87\n\n\u03bb = 3.22\n\n\u03bb = 4.90\n\n(cid:19)(cid:17)(cid:22)\n\n(cid:19)(cid:17)(cid:21)\n\n(cid:19)(cid:17)(cid:20)\n\n(cid:19)\n\n(cid:827)(cid:19)(cid:17)(cid:20)\n\n(cid:827)(cid:19)(cid:17)(cid:21)\n\n(cid:827)(cid:19)(cid:17)(cid:22)\n\n(cid:19)(cid:17)(cid:22)\n\n(cid:19)(cid:17)(cid:21)\n\n(cid:19)(cid:17)(cid:20)\n\n(cid:19)\n\n(cid:827)(cid:19)(cid:17)(cid:20)\n\n(cid:827)(cid:19)(cid:17)(cid:21)\n\n(cid:827)(cid:19)(cid:17)(cid:22)\n\n(cid:19)(cid:17)(cid:22)\n\n(cid:19)(cid:17)(cid:21)\n\n(cid:19)(cid:17)(cid:20)\n\n(cid:19)\n\n(cid:827)(cid:19)(cid:17)(cid:20)\n\n(cid:827)(cid:19)(cid:17)(cid:21)\n\n(cid:827)(cid:19)(cid:17)(cid:22)\n\n(cid:827)(cid:19)(cid:17)(cid:22)\n\n(cid:827)(cid:19)(cid:17)(cid:21)\n\n(cid:827)(cid:19)(cid:17)(cid:20)\n\n(cid:19)\n\n(cid:19)(cid:17)(cid:20)\n\n(cid:19)(cid:17)(cid:21)\n\n(cid:19)(cid:17)(cid:22)\n\n(cid:827)(cid:19)(cid:17)(cid:22)\n\n(cid:827)(cid:19)(cid:17)(cid:21)\n\n(cid:827)(cid:19)(cid:17)(cid:20)\n\n(cid:19)\n\n(cid:19)(cid:17)(cid:20)\n\n(cid:19)(cid:17)(cid:21)\n\n(cid:19)(cid:17)(cid:22)\n\n(cid:827)(cid:19)(cid:17)(cid:22)\n\n(cid:827)(cid:19)(cid:17)(cid:21)\n\n(cid:827)(cid:19)(cid:17)(cid:20)\n\n(cid:19)\n\n(cid:19)(cid:17)(cid:20)\n\n(cid:19)(cid:17)(cid:21)\n\n(cid:19)(cid:17)(cid:22)\n\nFigure 7: Illustrative examples with datasets from four Gaussians (above) and three circles (below).\n\nones identi\ufb01ed as breakpoints. Note several partitions other than shown in the \ufb01gures were obtained\nthrough one execution of Algorithm SPLIT. As can be seen, the algorithm produced several different\nsizes of clusters with inclusive relations.\n\n5.2 Empirical comparison\n\nNext, in this subsection, we empirically compare the performance of the algorithm with the ex-\nisting algorithms using several synthetic and real world datasets from the UCI repository. The\ncompared algorithms are k-means method, spectral-clustering method with normalized-cut [11] and\nmaximum-margin clustering [16], and we used cut functions as the objective functions for the MAC\nclustering algorithm. The three UCI datasets used in this experiment are \u2019Glass\u2019, \u2019Iris\u2019 and \u2019Libras\u2019\nwhich respectively consist of 214, 150 and 360 samples, respectively. For the existing algorithms,\nthe number of clusters was selected through 5-fold cross-validation (again note that our algorithm\nneeds no such hyper-parameter tuning). Table 1 shows the clustering accuracy when applying the\nalgorithms to two arti\ufb01cial (stated in Subsection 5.1 and three UCI datasets. For our algorithm, the\nresults with the best performance between among several partitions are shown. As can be seen, our\nalgorithm seems to be competitive with the existing leading algorithms for these datasets.\n\nk-means\nnormalized cut\nmaximum margin\nminimum average\n\nGaussian\n\n1.0\n0.88\n0.99\n0.99\n\nCircle\n0.88\n0.86\n1.0\n1.0\n\nIris\n0.79\n0.84\n0.96\n0.99\n\nLibras Glass\n0.93\n0.85\n0.93\n0.87\n0.97\n0.90\n0.97\n0.97\n\nTable 1: Clustering accuracy for the proposed and existing algorithms.\n\n6 Concluding remarks\n\nWe have introduced the new concept, the minimum average cost clustering problem. We have shown\nthat the set of minimum average cost clusterings has a compact representation, and if the clustering\ncost is given by a submodular function, we have proposed a polynomial time algorithm that compute\nall information about minimum average cost clusterings. This result contrasts sharply with the NP-\nhardness of the optimal k-clustering problem [5]. The present paper reinforced the importance of\nthe theory of intersecting submodular functions from the viewpoint of clustering.\n\nAcknowledgments\n\nThis work is supported in part by JSPS Global COE program \u201cComputationism as a Foundation for\nthe Sciences\u201d, KAKENHI (20310088, 22700007, and 22700147), and JST PRESTO program. We\nwould also like to thank Takuro Fukunaga for his helpful comments.\n\n8\n\n\fReferences\n\n[1] W. H. Cunningham: Optimal attack and reinforcement of a network. Journal of the ACM 32\n\n(1985), pp. 549\u2013561.\n\n[2] R. P. Dilworth: Dependence relations in a semimodular lattice. Duke Mathematical Journal,\n\n11 (1944), pp. 575\u2013587.\n\n[3] J. Edmonds: Submodular functions, matroids, and certain polyhedra. CombinatorialStructures\nand Their Applications, R. Guy, H. Hanani, N. Sauer, and J. Sch\u00a8onheim, eds., Gordon and\nBreach, 1970, pp. 69\u201387.\n\n[4] S. Fujishige: Submodular Functions and Optimization (Second Edition). Elsevier, Amsterdam,\n\n2005.\n\n[5] O. Goldschmidt and D. S. Hochbaum: A polynomial algorithm for the k-cut problem for \ufb01xed\n\nk, Mathematics of Operations Research, 19 (1994), pp. 24\u201337.\n\n[6] S. Iwata: Submodular function minimization. Mathematical Programming, 112 (2008), pp.\n\n45\u201364.\n\n[7] V. Kolmogorov: A faster algorithm for computing the principal sequence of partitions of a\n\ngraph. Algorithmica 56, pp. 394-412.\n\n[8] Y. Kawahara, K. Nagano, and Y. Okamoto: Submodular fractional programming for balanced\n\nclustering. Pattern Recognition Letters, to appear.\n\n[9] M. Narasimhan and J. Bilmes: Local search for balanced submodular clusterings. In Proceed-\nings of the 12th International Joint Conference on Arti\ufb01cial Intelligence (IJCAI 2007), pp.\n981\u2013986.\n\n[10] M. Narasimhan, N. Jojic, and J. Bilmes: Q-clustering. In Advances in Neural Information\n\nProcessing Systems, 18 (2006), pp. 979\u2013986. Cambridge, MA: MIT Press.\n\n[11] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm.\n\nAdvances in neural information processing systems, 2:849\u2013856, 2002.\n\n[12] K. Okumoto, T. Fukunaga, and H. Nagamochi: Divide-and-conquer algorithms for partitioning\nhypergraphs and submodular systems. In Proceedings of the 20th International Symposium on\nAlgorithms and Computation (ISAAC 2009), LNCS 5878, 2009, pp. 55\u201364.\n\n[13] M. Queyranne: Minimizing symmetric submodular functions, Mathematical Programming, 82\n\n(1998), pp. 3\u201312.\n\n[14] A. Schrijver: CombinatorialOptimization\u2014PolyhedraandEf\ufb01ciency. Springer-Verlag, 2003.\n[15] U. von Luxburg: Tutorial on spectral clustering. Statistics and Computing 17 (2007), pp. 395\u2013\n\n416.\n\n[16] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. Advances in\n\nneural information processing systems, 17:1537\u20131544, 2005.\n\n[17] L. Zhao, H. Nagamochi, and T. Ibaraki: Approximating the minimum k-way cut in a graph via\n\nminimum 3-way cuts. Journal of Combinatorial Optimization, 5 (2001), pp. 397\u2013410.\n\n[18] L. Zhao, H. Nagamochi, and T. Ibaraki: A uni\ufb01ed framework for approximating multiway\npartition problems. In Proceedings of the 12th International Symposium on Algorithms and\nComputation (ISAAC 2001), LNCS 2223, 2001, pp. 682\u2013694.\n\n9\n\n\f", "award": [], "sourceid": 976, "authors": [{"given_name": "Kiyohito", "family_name": "Nagano", "institution": null}, {"given_name": "Yoshinobu", "family_name": "Kawahara", "institution": null}, {"given_name": "Satoru", "family_name": "Iwata", "institution": null}]}