{"title": "Weighted Theta Functions and Embeddings with Applications to Max-Cut, Clustering and Summarization", "book": "Advances in Neural Information Processing Systems", "page_first": 1018, "page_last": 1026, "abstract": "We introduce a unifying generalization of the Lov\u00e1sz theta function, and the associated geometric embedding, for graphs with weights on both nodes and edges. We show how it can be computed exactly by semidefinite programming, and how to approximate it using SVM computations. We show how the theta function can be interpreted as a measure of diversity in graphs and use this idea, and the graph embedding in algorithms for Max-Cut, correlation clustering and document summarization, all of which are well represented as problems on weighted graphs.", "full_text": "Weighted Theta Functions and Embeddings with\n\nApplications to Max-Cut, Clustering and\n\nSummarization\n\nFredrik D. Johansson\n\nComputer Science & Engineering\nChalmers University of Technology\n\nG\u00a8oteborg, SE-412 96, Sweden\nfrejohk@chalmers.se\n\nChiranjib Bhattacharyya\n\nComputer Science and Automation\n\nIndian Institute of Science\n\nBangalore 560012, Karnataka, India\nchiru@csa.iisc.ernet.in\n\nAnkani Chattoraj\u2217\n\nBrain & Cognitive Sciences\n\nUniversity of Rochester\n\nRochester, NY 14627-0268, USA\n\nachattor@ur.rochester.edu\n\nDevdatt Dubhashi\n\nComputer Science & Engineering\nChalmers University of Technology\n\nG\u00a8oteborg, SE-412 96, Sweden\ndubhashi@chalmers.se\n\nAbstract\n\nWe introduce a unifying generalization of the Lov\u00b4asz theta function, and the as-\nsociated geometric embedding, for graphs with weights on both nodes and edges.\nWe show how it can be computed exactly by semide\ufb01nite programming, and how\nto approximate it using SVM computations. We show how the theta function can\nbe interpreted as a measure of diversity in graphs and use this idea, and the graph\nembedding in algorithms for Max-Cut, correlation clustering and document sum-\nmarization, all of which are well represented as problems on weighted graphs.\n\n1\n\nIntroduction\n\nEmbedding structured data, such as graphs, in geometric spaces, is a central problem in machine\nlearning. In many applications, graphs are attributed with weights on the nodes and edges \u2013 infor-\nmation that needs to be well represented by the embedding. Lov\u00b4asz introduced a graph embedding\ntogether with the famous theta function in the seminal paper [19], giving his celebrated solution to\nthe problem of computing the Shannon capacity of the pentagon. Indeed, Lov\u00b4asz\u2019s embedding is a\nvery elegant and powerful representation of unweighted graphs, that has come to play a central role\nin information theory, graph theory and combinatorial optimization [10, 8]. However, despite there\nbeing at least eight different formulations of \u03d1(G) for unweighted graphs, see for example [20], there\ndoes not appear to be a version that applies to graphs with weights on the edges. This is surprising,\nas it has a natural interpretation in the information theoretic problem of the original de\ufb01nition [19].\nA version of the Lov\u00b4asz number for edge-weighted graphs, and a corresponding geometrical rep-\nresentation, could open the way to new approaches to learning problems on data represented as\nsimilarity matrices. Here we propose such a generalization for graphs with weights on both nodes\nand edges, by combining a few key observations. Recently, Jethava et al. [14] discovered an inter-\nesting connection between the original theta function and a central problem in machine learning,\nnamely the one class Support Vector Machine (SVM) formulation [14]. This kernel based method\ngives yet another equivalent characterization of the Lov\u00b4asz number. Crucially, it is easily modi\ufb01ed\nto yield an equivalent characterization of the closely related Delsarte version of the Lov\u00b4asz number\n\u2217This work was performed when the author was af\ufb01liated with CSE at Chalmers University of Technology.\n\n1\n\n\fintroduced by Schrijver [24] which is more \ufb02exible and often more convenient to work with. Using\nthis kernel characterization of the Delsarte version of Lov\u00b4asz number, we de\ufb01ne a theta function and\nembedding of weighted graphs, suitable for learning with data represented as similarity matrices.\nThe original theta function is limited to applications on small graphs, because of its formulation as a\nsemide\ufb01nite program (SDP). In [14], Jethava et al. showed that their kernel characterization can be\nused to compute a number and an embedding of a graph that are often good approximations to the\ntheta function and embedding, and that can be computed fast, scaling to very large graphs. Here we\ngive the analogous approximate method for weighted graphs. We use this approximation to solve\nthe weighted maximum cut problem faster than the classical SDP relaxation.\nFinally, we show that our edge-weighted theta function has a natural interpretation as a measure of\ndiversity in graphs. We use this intuition to de\ufb01ne a centroid-based correlation clustering algorithm\nthat automatically chooses the number of clusters and initializes the centroids. We also show how to\nuse the support vectors, computed in the kernel characterization with both node and edge weights,\nto perform extractive document summarization.\nTo summarize our main contributions:\n\nwith weights on both nodes and edges.\n\nweighted theta function and the corresponding embeddings using SVM computations.\n\n\u2022 We introduce a unifying generalization of the famous Lov\u00b4asz number applicable to graphs\n\u2022 We show that via our characterization, we can compute a good approximation to our\n\u2022 We show that the weighted version of the Lov\u00b4asz number can be interpreted as a measure\nof diversity in graphs, and we use this to de\ufb01ne a correlation clustering algorithm dubbed\n\u03d1-means that automatically a) chooses the number of clusters, and b) initializes centroids.\n\u2022 We apply the embeddings corresponding to the weighted Lov\u00b4asz numbers to solve weighted\n\u2022 We apply the weighted kernel characterization of the theta function to document summa-\n\nmaximum cut problems faster than the classical SDP methods, with similar accuracy.\n\nrization, exploiting both node and edge weights.\n\n2 Extensions of Lov\u00b4asz and Delsarte numbers for weighted graphs\n\nBackground Consider embeddings of undirected graphs G = (V, E). Lov\u00b4asz introduced an el-\negant embedding, implicit in the de\ufb01nition of his celebrated theta function \u03d1(G) [19], famously an\nupper bound on the Shannon capacity and sandwiched between the independence number and the\nchromatic number of the complement graph.\n\ni\n\n1\n\nmax\n\n\u03d1(G) = min{ui},c\n\n(c(cid:62)ui)2 , u(cid:62)\n\ni uj = 0, \u2200(i, j) (cid:54)\u2208 E, (cid:107)ui(cid:107) = (cid:107)c(cid:107) = 1 .\n\n(1)\nThe vectors {ui}, c are so-called orthonormal representations or labellings, the dimension of which\nis determined by the optimization. We refer to both {ui}, and the matrix U = [u1, . . . , un] as an\nembedding G, and use the two notations interchangeably. Jethava et al. [14] introduced a characteri-\nzation of the Lov\u00b4asz \u03d1 function that established a close connection with the one-class support vector\nmachine [23]. They showed that, for an unweighted graph G = (V, E),\n\nmin\n\nK\u2208K(G)\n\n\u03c9(K), where\n\n\u03d1(G) =\nK(G)\n\u03c9(K)\n\n:= {K (cid:23) 0 | Kii = 1, Kij = 0,\u2200(i, j) (cid:54)\u2208 E},\n:= max\n\u03b1\u22650\n\nf (\u03b1; K) := 2\n\nf (\u03b1; K),\n\n(cid:88)\n\n\u03b1i \u2212(cid:88)\n\ni\n\ni,j\n\n(2)\n\n(3)\n(4)\n\nKij\u03b1i\u03b1j\n\nis the dual formulation of the one-class SVM problem, see [16]. Note that the conditions on K only\nrefer to the non-edges of G. In the sequel, \u03c9(K) and f (\u03b1; K) always refer to the de\ufb01nitions in (4).\n\n2.1 New weighted versions of \u03d1(G)\n\nA key observation in proving (2), is that the set of valid orthonormal representations is equivalent\nto the set of kernels K. This equivalence can be preserved in a natural way when generalizing the\n\n2\n\n\fde\ufb01nition to weighted graphs: any constraint on the inner product uT\nconstraints on the elements Kij of the kernel matrix.\nTo de\ufb01ne weighted extensions of the theta function, we need to \ufb01rst pass to the closely related\nDelsarte version of the Lov\u00b4asz number introduced by Schrijver [24]. In the Delsarte version, the\ni uj \u2264 0, (i, j) (cid:54)\u2208 E. With reference to the\northogonality constraint for non-edges is relaxed to uT\nformulation (2) it is easy to observe that the Delsarte version is given by\n\ni uj may be represented as\n\n\u03d11(G) = min\n\nK\u2208K1(G)\n\n\u03c9(K), where K1(G) := {K (cid:23) 0 | Kii = 1, Kij \u2264 0,\u2200(i, j) (cid:54)\u2208 E}\n\n(5)\n\nIn other words, the Lov\u00b4asz number corresponds to orthogonal labellings of G with orthogonal vec-\ntors on the unit sphere assigned to non\u2013adjacent nodes whereas the Delsarte version corresponds to\nobtuse labellings, i.e. the vectors corresponding to non\u2013adjacent nodes are vectors on the unit sphere\nmeeting at obtuse angles. In both cases, the corresponding number is essentially the half-angle of\nthe smallest spherical cap containing all the vectors assigned to the nodes. Comparing (2) and (5) it\nfollows that \u03d11(G) \u2264 \u03d1(G). In the sequel, we will use the Delsarte version and obtuse labellings to\nde\ufb01ne weighted generalizations of the theta function.\nWe observe in passing, that for any K \u2208 K1, and for any independent set I in the graph, taking\n\u03b1i = 1 if i \u2208 I and 0 otherwise,\n\n(cid:88)\n\n\u03b1i \u2212(cid:88)\n\n\u03b1i\u03b1jKij \u2265 (cid:88)\n\n\u03b1i = |I|\n\n(6)\n\n(cid:88)\n\n\u03b1i \u2212(cid:88)\n\n\u03c9(K) \u2265 2\n\n\u03b1i\u03b1jKij =\n\ni\n\ni,j\n\ni\n\ni\n\ni(cid:54)=j\n\nsince for each term in the second sum, either (i, j) is an edge, in which case either \u03b1i or \u03b1j is zero,\nor (i, j) is a non\u2013edge in which case Kij \u2264 0. Thus, like \u03d1(G), the Delsarte version \u03d11(G) is also\nan upper bound on the stability or independence number \u03b1(G).\n\nKernel characterization of theta functions on node-weighted graphs Lov\u00b4asz number has a\nclassical extension to graphs with node weights \u03c3 = [\u03c31, . . . , \u03c3n](cid:62), see for example [17]. The\ngeneralization, in the Delsarte version (note the inequality constraint), is the following\n\n\u03d1(G, \u03c3) = min{ui},c\n\nmax\n\ni\n\n\u03c3i\n\n(c(cid:62)ui)2 , u(cid:62)\n\ni uj \u2264 0, \u2200(i, j) (cid:54)\u2208 E, (cid:107)ui(cid:107) = (cid:107)c(cid:107) = 1 .\n\n(7)\n\nBy passing to the dual of (7), see section 2.1 and [16], we may, as for unweighted graphs, character-\nize \u03d1(G, \u03c3) by a minimization over the set of kernels,\n\nK(G, \u03c3) := {K (cid:23) 0 | Kii = 1/\u03c3i, Kij \u2264 0,\u2200(i, j) (cid:54)\u2208 E}\n\n(8)\nand, just like in the unweighted case, \u03d11(G, \u03c3) = minK\u2208K(G,\u03c3) \u03c9(K). When \u03c3i = 1,\u2200i \u2208 V , this\nreduces to the unweighted case. We also note that for any K \u2208 K(G, \u03c3) and for any independent\nset I in the graph, taking \u03b1i = \u03c3i if i \u2208 I and 0 otherwise,\n\n(cid:88)\n\n\u03b1i \u2212(cid:88)\n\ni\n\ni,j\n\n\u03c9(K) \u2265 2\n\n\u03b1i\u03b1jKij = 2\n\n(cid:88)\n\ni\u2208I\n\n\u03c3i \u2212(cid:88)\n\ni\u2208I\n\n\u03c32\ni\n\u03c3i\n\n\u2212(cid:88)\n\n\u03b1i\u03b1jKij \u2265(cid:88)\n\ni(cid:54)=j\n\ni\u2208I\n\n\u03c3i ,\n\n(9)\n\nsince Kij \u2264 0 \u2200(i, j) (cid:54)\u2208 E. Thus, \u03d11(G, \u03c3) \u2265 \u03c9(K) is an upper bound on the weight of the\nmaximum-weight independent set.\n\nExtension to edge-weighted graphs The kernel characterization of \u03d11(G) allows one to de\ufb01ne a\nnatural extension to data given as similarity matrices represented in the form of a weighted graph\nG = (V, S). Here, S is a similarity function on (unordered) node pairs, and S(i, j) \u2208 [0, 1] with +1\nrepresenting complete similarity and 0 complete dissimilarity. The obtuse labellings corresponding\nto the Delsarte version are somewhat more \ufb02exible even for unweighted graphs, but is particularly\nwell suited for weighted graphs. We de\ufb01ne\n\n\u03d11(G, S) := min\n\nK\u2208K(G,S)\n\n\u03c9(K) where K(G, S) := {K (cid:23) 0 | Kii = 1, Kij \u2264 Sij}\n\n(10)\n\nIn the case of an unweighted graph, where Sij \u2208 {0, 1}, this reduces exactly to (5).\n\n3\n\n\fTable 1: Characterizations of weighted theta functions. In the \ufb01rst row are characterizations follow-\ning the original de\ufb01nition. In the second are kernel characterizations. The bottom row are versions\nof the LS-labelling [14]. In all cases, (cid:107)ui(cid:107) = (cid:107)c(cid:107) = 1. A refers to the adjacency matrix of G.\n\nUnweighted\n\n1\n\ni\n\nc\n\nu\n\nmax\n\nmin{ui} min\n(c(cid:62)ui)2\ni uj \u2264 0, \u2200(i, j) (cid:54)\u2208 E\n(cid:62)\nKG = {K (cid:23) 0 | Kii = 1,\nKij = 0,\u2200(i, j) (cid:54)\u2208 E}\n\nNode-weighted\n\u03c3i\n\nc\n\nmax\n\nmin{ui} min\n(c(cid:62)ui)2\ni uj = 0, \u2200(i, j) (cid:54)\u2208 E\n(cid:62)\nu\n\ni\n\nEdge-weighted\n\n1\n\nmax\n\nmin{ui} min\n(c(cid:62)ui)2\nc\ni uj \u2264 Sij, i (cid:54)= j\n(cid:62)\nu\n\ni\n\nKG,\u03c3 = {K (cid:23) 0 | Kii = 1/\u03c3i,\nKij = 0,\u2200(i, j) (cid:54)\u2208 E}\n\nKG,S = {K (cid:23) 0 | Kii = 1,\nKij \u2264 Sij, i (cid:54)= j}\n\nKLS =\n\nA\n\n|\u03bbn(A)| + I\n\nK \u03c3\n\nLS =\n\nA\n\n\u03c3max|\u03bbn(A)| +diag(\u03c3)\n\n\u22121\n\nK S\n\nLS =\n\nS\n\n|\u03bbn(S)| + I\n\nUnifying weighted generalization We may now combine both node and edge weights to form a\nfully general extension to the Delsarte version of the Lov\u00b4asz number,\n\n(cid:26)\n\n(cid:27)\n\n(11)\n\n\u03d11(G, \u03c3, S) =\n\nmin\n\nK\u2208K(G,\u03c3,S)\n\n\u03c9(K), K(G, \u03c3, S) :=\n\nK (cid:23) 0 | Kii =\n\n1\n\u03c3i\n\n, Kij \u2264 Sij\u221a\n\u03c3i\u03c3j\n\nIt is easy to see that for unweighted graphs, Sij \u2208 {0, 1}, \u03c3i = 1, the de\ufb01nition reduces to the\nDelsarte version of the theta function in (5). \u03d11(G, \u03c3, S) is hence a strict generalization of \u03d11(G).\nAll the proposed weighted extensions are de\ufb01ned by the same objective, \u03c9(K). The only difference\nis the set K, specialized in various ways, over which the minimum, minK\u2208K \u03c9(K), is computed.\nIt also is important to note, that with the generalization of the theta function comes an implicit\ngeneralization of the geometric representation of G. Speci\ufb01cally, for any feasible K in (11), there\n\u03c3i\u03c3j \u2264 Sij,\nis an embedding U = [u1, . . . , un] such that K = U(cid:62)U with the properties u(cid:62)\n(cid:107)ui(cid:107)2 = 1/\n\u03c3i\u03c3j is\ni uj\nexactly the cosine similarity between ui and uj, which is a very natural choice when Sij \u2208 [0, 1].\nThe original de\ufb01nition of the (Delsarte) theta function and its extensions, as well as their kernel\ncharacterizations, can be seen in table 1. We can prove the equivalence of the embedding (top) and\nkernel characterizations (middle) using the following result.\nProposition 2.1. For any embedding U \u2208 Rd\u00d7n with K = U(cid:62)U, and f in (4), the following holds\n\n\u221a\n\u03c3i, which can be retrieved using matrix decomposition. Note that u(cid:62)\n\ni uj\n\n\u221a\n\n\u221a\n\nmin\nc\u2208S d\u22121\n\nmax\n\ni\n\n1\n\n(c(cid:62)ui)2 = max\n\u03b1i\u22650\n\nf (\u03b1; K) .\n\n(12)\n\nProof. The result is given as part of the proof of Theorem 3 in Jethava et al. [14]. See also [16].\n\nAs we have already established in section 2 that any set of geometric embeddings have a characteri-\nzation as a set of kernel matrices, it follows that the minimizing the LHS in (12) over a (constrained)\nset of orthogonal representations, {ui}, is equivalent to minimizing the RHS over a kernel set K.\n\n3 Computation and \ufb01xed-kernel approximation\n\nThe weighted generalization of the theta function, \u03d11(G, \u03c3, S), de\ufb01ned in the previous section, may\nbe computed as a semide\ufb01nite program. In fact \u03d11(G, \u03c3, S) = 1/(t\u2217)2 for t\u2217, the solution to the\nfollowing problem. For details, see [16]. With Sk\n\n+ the set of k \u00d7 k symmetric p.s.d. matrices,\n\nmaximize\n\nX\n\nt\n\nsubject to X \u2208 Sn+1\nXi,n+1 \u2265 t, Xii = 1/\u03c3i,\n\u221a\nXij \u2264 Sij/\n\n\u03c3i\u03c3j,\n\n+\n\ni \u2208 [n]\ni (cid:54)= j, i, j \u2208 [n] .\n\n(13)\n\n4\n\n\fWhile polynomial in time complexity [13], solving the SDP is too slow in many cases. To address\nthis, Jethava et al. [14] introduced a fast approximation to (the unweighted) \u03d1(G), dubbed SVM-\ntheta. They showed that in some cases, the minimization over K in (2) can be replaced by a \ufb01xed\nchoice of K, while causing just a constant-factor error. Speci\ufb01cally, for unweighted graphs with\nadjacency matrix A, Jethava et al. [14] de\ufb01ned the so called LS-labelling, KLS(G) = A/|\u03bbn(A)| +\nI, and showed that for large families of graphs \u03d1(G) \u2264 \u03c9(KLS(G)) \u2264 \u03b3\u03d1(G), for a constant \u03b3.\nWe extend the LS-labelling to weighted graphs. For graphs with edge weights, represented by\na similarity matrix S, the original de\ufb01nition may be used, with S substituted for A. For node\nweighted graphs we also must satisfy the constraint Kii = 1/\u03c3i, see (8). A natural choice, still\nensuring positive semide\ufb01niteness is,\n\nKLS(G, \u03c3) =\n\nA\n\n\u03c3max|\u03bbn(A)| + diag(\u03c3)\u22121\n\n(14)\n\nwhere diag(\u03c3)\u22121 is the diagonal matrix \u03a3 with elements \u03a3ii = 1/\u03c3i, and \u03c3max = maxn\ni=1 \u03c3i. Both\nweighted versions of the LS-labelling are presented in table 1. The fully generalized labelling, for\ngraphs with weights on both nodes and edges, KLS(G, \u03c3, S) can be obtained by substituting S for\nA in (14). As with the exact characterization, we note that KLS(G, \u03c3, S) reduces to KLS(G) for\nthe uniform case, Sij \u2208 {0, 1}, \u03c3i = 1. For all versions of the LS-labelling of G, as with the exact\ncharacterization, a geometric embedding U may be obtained from KLS using matrix decompotion.\n\n3.1 Computational complexity\n\nSolving the full problem in the kernel characterization (11), is not faster than the computing the\nSDP characterization (13). However, for a \ufb01xed K, the one-class SVM can be solved in O(n2)\ntime [12]. Retrieving the embedding U : K = U T U may be done using Cholesky or singular value\ndecomposition (SVD). In general, algorithms for these problems have complexity O(n3). However,\nin many cases, a rank d approximation to the decomposition is suf\ufb01cient, see for example [9]. A thin\n(or truncated) SVD corresponding to the top d singular values may be computed in O(n2d) time [5]\nfor d = O(\nn). The remaining issue is the computation of K. The complexity of computing the\nLS-labelling discussed in the previous section is dominated by the computation of the minimum\neigenvalue \u03bbn(A). This can be done approximately in \u02dcO(m) time, where m is the number of edges\nof the graph [1]. Overall, the complexity of computing both the embedding U and \u03c9(K) is O(dn2).\n\n\u221a\n\n4 The theta function as diversity in graphs: \u03d1-means clustering\n\nIn section 2, we de\ufb01ned extensions of the Delsarte version of the Lov\u00b4asz number, \u03d11(G) and the\nassociated geometric embedding, for weighted graphs. Now we wish to show how both \u03d1(G) and\nthe geometric embedding are useful for solving common machine learning tasks. We build on an\nintuition of \u03d1(G) as a measure of diversity in graphs, illustrated here by a few simple examples. For\ncomplete graphs Kn, it is well known that \u03d1(Kn) = 1, and for empty graphs K n, \u03d1(K n) = n.\nWe may interpret these graphs as having 1 and n clusters respectively. Graphs with several disjoint\nclusters make a natural middle-ground. For a graph G that is a union of k disjoint cliques, \u03d1(G) = k.\nNow, consider the analogue of (6) for graphs with edge weights Sij. For any K \u2208 K(G, S) and for\nany subset H of nodes, let \u03b1i = 1 if i \u2208 H and 0 otherwise. Then, since Kij \u2264 Sij,\n\n(cid:88)\n\n\u03b1i \u2212(cid:88)\n\n\u03b1i\u03b1jKij \u2265 |H| \u2212 (cid:88)\n\n(cid:88)\n\n\u03b1i \u2212(cid:88)\n\n2\n\n\u03b1i\u03b1jKij =\n\ni\n\nij\n\ni\n\ni(cid:54)=j\n\nSij .\n\ni(cid:54)=j,i,j\u2208H\n\nMaximizing this expression may be viewed as the trade-off of \ufb01nding a subset of nodes that is both\nlarge and diverse; the objective function is the size of the set subjected to a penalty for non\u2013diversity.\nIn general support vector machines, non-zero support values \u03b1i correspond to support vectors, de\ufb01n-\ning the decision boundary. As a result, nodes i \u2208 V with high values \u03b1i may be interpreted as an\nimportant and diverse set of nodes.\n\n4.1 \u03d1-means clustering\n\nA common problem related to diversity in graphs is correlation clustering [3]. In correlation clus-\ntering, the task is to cluster a set of items V = {1, . . . , n}, based on their similarity, or correlation,\n\n5\n\n\fAlgorithm 1 \u03d1-means clustering\n1: Input: Graph G, with weight matrix S and node weights \u03c3.\n2: Compute kernel K \u2208 K(G, \u03c3, S)\ni \u2190 arg max\u03b1i f (\u03b1; K), as in (4)\n3: \u03b1\u2217\n4: Sort alphas according to ji such that \u03b1j1 \u2265 \u03b1j2 \u2265 ... \u2265 \u03b1jn\n5: Let k = (cid:100) \u02c6\u03d1(cid:101) where \u02c6\u03d1 \u2190 \u03c9(K) = f (\u03b1\u2217; K)\n6: either a)\n7:\n8: Output: result of kernel k-means with kernel K, k = (cid:100) \u02c6\u03d1(cid:101) and Z as initial labels\n9: or b)\n10:\n11: Output: result of k-means with k = (cid:100) \u02c6\u03d1(cid:101) and C as initial cluster centroids\n\nCompute U : K = U T U, with columns Ui, and let C \u2190 {Uji : i \u2264 k}\n\nInitialize labels Zi = arg maxj\u2208{j1,...,jk} Kij\n\nS : V \u00d7 V \u2192 Rn\u00d7n, without specifying the number of clusters beforehand. This is naturally posed\nas a problem of clustering the nodes of an edge-weighted graph. In a variant called overlapping\ncorrelation clustering [4], items may belong to several, overlapping, clusters. The usual formulation\nof correlation clustering is an integer linear program [3]. Making use of geometric embeddings, we\nmay convert the graph clustering problem to the more standard problem of clustering a set of points\ni=1 \u2208 Rd\u00d7n, allowing the use of an arsenal of established techniques, such as k-means cluster-\n{ui}n\ning. However, we remind ourselves of two common problems with existing clustering algorithms.\n\nProblem 1: Number of clusters Many clustering algorithms relies on the user making a good\nchoice of k, the number of clusters. As this choice can have dramatic effect on both the accuracy\nand speed of the algorithm, heuristics for choosing k, such as Pham et al. [22], have been proposed.\n\nProblem 2: Initialization Popular clustering algorithms such as Lloyd\u2019s k-means, or expectation-\nmaximization for Gaussian mixture models require an initial guess of the parameters. As a result,\nthese algorithms are often run repeatedly with different random initializations.\nWe propose solutions to both problems based on \u03d11(G). To solve Problem 1, we choose k =\n(cid:100)\u03d11(G)(cid:101). This is motivated by \u03d11(G) being a measure of diversity. For Problem 2, we propose\ninitializing parameters based on the observation that the non-zero \u03b1i are support vectors. Speci\ufb01-\ncally, we let the initial clusters by represented by the set of k nodes, I \u2282 V , with the largest \u03b1i. In\nk-means clustering, this corresponds to letting the initial centroids be {ui}i\u2208I. We summarize these\nideas in algorithm 1, comprising both \u03d1-means and kernel \u03d1-means clustering.\n\u221a\nIn section 3.1, we showed that computating the approximate weighted theta function and embedding,\nn) approximation to the SVD. As is well-known,\ncan be done in O(dn2) time for a rank d = O(\nLloyd\u2019s algorithm has a very high worst-case complexity and will dominate the overall complexity.\n\n5 Experiments\n\n5.1 Weighted Maximum Cut\n\nThe maximum cut problem (Max-Cut), a fundamental problem in graph algorithms, with applica-\ntions in machine learning [25], has famously been solved using geometric embeddings de\ufb01ned by\nsemide\ufb01nite programs [9]. Here, given a graph G, we compute an embedding U \u2208 Rd\u00d7n, the\nSVM-theta labelling in [15], using the LS-labelling, KLS. To reduce complexity, while preserving\naccuracy [9], we use a rank d =\n2n truncated SVD, see section 3.1. We apply the Goemans-\nWilliamson random hyperplane rounding [9] to partition the embedding into two sets of points,\nrepresenting the cut. The rounding was repeated 5000 times, and the maximum cut is reported.\nHelmberg & Rendl [11] constructed a set of 54 graphs, 24 of which are weighted, that has since\noften been used as benchmarks for Max-Cut. We use the six of the weighted graphs for which there\nare multiple published results [6, 21]. Our approach is closest to that of the SDP-relaxation, which\n\n\u221a\n\n6\n\n\fTable 2: Weighted maximum cut. c is the weight of the produced cut.\n\nGraph\nG11\nG12\nG13\nG32\nG33\nG34\n\nSDP [6]\nc\n528\n522\n542\n1280\n1248\n1264\n\nTime\n165s\n145s\n145s\n1318s\n1417s\n1295s\n\nSVM-\u03d1\nc\n522\n518\n540\n1286\n1260\n1268\n\nTime\n3.13s\n2.94s\n2.97s\n35.5s\n36.4s\n37.9s\n\nc\n564\n556\n580\n1398\n1376\n1372\n\nBest known [21]\n\nTime\n171.8s\n241.5s\n227.5s\n900.6s\n925.6s\n925.6s\n\nTable 3: Clustering of the (mini) newsgroup dataset. Average (and std. deviation) over 5 splits. \u02c6k is\nthe average number of clusters predicted. The true number is k = 16.\n\nVOTE/BOEM\nPIVOT/BOEM\nBEST/BOEM\nFIRST/BOEM\nk-MEANS+RAND\nk-MEANS+INIT\n\u03d1-MEANS+RAND\n\u03d1-MEANS\n\nF1\n\n31.29 \u00b1 4.0\n30.07 \u00b1 3.4\n29.67 \u00b1 3.4\n26.76 \u00b1 3.8\n17.31 \u00b1 1.3\n20.06 \u00b1 6.8\n35.60 \u00b1 4.3\n36.20 \u00b1 4.9\n\n\u02c6k\n124\n120\n112\n109\n2\n3\n25\n25\n\nTime\n8.7m\n14m\n13m\n14m\n15m\n5.2m\n45s\n11s\n\nhas time complexity O(mn log2 n/\u00013) [2]. In comparison, our method takes O(n2.5) time, see sec-\ntion 3.1. The results are presented in table 2. For all graphs, the SVM approximation is comparable\nto or better than the SDP solution, and considerably faster than the best known method [21].1\n\n5.2 Correlation clustering\n\nWe evaluate several different versions of algorithm 1 in the task of correlation clustering, see sec-\ntion 4.1. We consider a) the full version (\u03d1-MEANS), b) one with k = (cid:100) \u02c6\u03d1(cid:101) but random initialization\nof centroids (\u03d1-MEANS+RAND), c) one with \u03b1-based initialization but choosing k according to Pham\net al. [22] (k-MEANS+INIT) and d) k according to [22] and random initialization (k-MEANS+RAND).\nFor the randomly initialized versions, we use 5 restarts of k-means++. In all versions, we cluster the\npoints of the embedding de\ufb01ned by the \ufb01xed kernel (LS-labelling) K = KLS(G, S).\nElsner & Schudy [7] constructed \ufb01ve af\ufb01nity matrices for a subset of the classical 20-newsgroups\ndataset. Each matrix, corresponding to a different split of the data, represents the similarity between\nmessages in 16 different newsgroups. The task is to cluster the messages by their respective news-\ngroup. We run algorithm 1 on every split, and compute the F1-score [7], reporting the average and\nstandard deviation over all splits, as well as the predicted number of clusters, \u02c6k. We compare our\nresults to several greedy methods described by Elsner & Schudy [7], see table 3. We only compare\nto their logarithmic weighting schema, as the difference to using additive weights was negligible [7].\nThe results are presented in table 3. We observe that the full \u03d1-means method achieves the highest\nF1-score, followed by the version with random initialization (instead of using embeddings of nodes\nwith highest \u03b1i, see algorithm 1). We note also that choosing k by the method of Pham et al. [22]\nconsistently results in too few clusters, and with the greedy search methods, far too many.\n\n5.3 Overlapping Correlation Clustering\n\nBonchi et al. [4] constructed a benchmark for overlapping correlation clustering based on two\ndatasets for multi-label classi\ufb01cation, Yeast and Emotion. The datasets consist of 2417 and 593\nitems belonging to one or more of 14 and 6 overlapping clusters respectively. Each set can be repre-\nsented as an n \u00d7 k binary matrix L, where k is the number of clusters and n is the number of items,\n\n1Note that the timing results for the SDP method are from the original paper, published in 2001.\n\n7\n\n\fTable 4: Clustering of the Yeast and Emotion datasets. \u2020The total time for \ufb01nding the best solution.\nThe times for OCC-ISECT for a single k was 2.21s and 80.4s respectively.\n\nOCC-ISECT [4]\n\u03d1-means (no k-means)\n\n1\n\n1\n1\n\n1\n\nPrec. Rec.\n0.98\n\nEmotion\nF1\n0.99\n\nTime\n12.1\u2020\n0.34s\n\nPrec. Rec.\n0.99\n1.00\n0.94\n\n1\n\nYeast\nF1\n1.00\n0.97\n\nTime\n716s\u2020\n6.67s\n\nsuch that Lic = 1 iff item i belongs to cluster c. From L, a weight matrix S is de\ufb01ned such that Sij\nis the Jaccard coef\ufb01cient between rows i and j of L. S is often sparse, as many of the pairs do not\nshare a single cluster. The correlation clustering task is to reconstruct L from S.\nHere, we use only the centroids C = {uj1, ..., ujk} produced by algorithm 1, without running k-\nmeans. We let each centroid c = 1, ..., k represent a cluster, and assign a node i \u2208 V to that cluster,\ni.e. \u02c6Lic = 1, iff uT\ni ujc > 0. We compute the precision and recall following Bonchi et al. [4]. For\ncomparison with Bonchi et al. [4], we run their algorithm called OCC-ISECT with the parameter \u00afk,\nbounding the number of clusters, in the interval 1, ..., 16 and select the one resulting in lowest cost.\nThe results are presented in table 4. For Emotion and Yeast, \u03d1-means estimated the number of\nclusters, k to be 6 (the correct number) and 8 respectively. For OCC-Isect, the k with the lowest\ncost were 10 and 13. We note that while very similar in performance, the \u03d1-means algorithms is\nconsiderably faster than OCC-ISECT, especially when k is unknown.\n\n5.4 Document summarization\n\nFinally, we brie\ufb02y examine the idea of using \u03b1i to select a both relevant and diverse set of items, in a\nvery natural application of the weighted theta function \u2013 extractive summarization [18]. In extractive\nsummarization, the goal is to automatically summarize a text by picking out a small set of sentences\nthat best represents the whole text. We may view the sentences of a text as the nodes of a graph, with\nedge weights Sij, the similarity between sentences, and node weights \u03c3i representing the relevance\nof the sentence to the text as a whole. The trade-off between brevity and relevance described above\ncan then be viewed as \ufb01nding a set of nodes that has both high total weight and high diversity. This is\nnaturally accomplished using our framework by computing [\u03b1\u2217\nn](cid:62) = arg max\u03b1i>0 f (\u03b1; K)\nfor \ufb01xed K = KLS(G, \u03c3, S) and picking the sentences with the highest \u03b1\u2217\ni .\nWe apply this method to the multi-document summarization task of DUC-042. We let Sij be the\nj Sij)2. State-of-the-\nart systems, purpose-built for summarization, achieve around 0.39 in recall and F1 score [18]. Our\nmethod achieves a score of 0.33 on both measures which is about the same as the basic version\nof [18]. This is likely possible to improve by tuning the trade-off between relevance and diversity,\nsuch as a making a more sophisticated choice of S and \u03c3. However, we leave this to future work.\n\nTF-IDF sentence similarity described by Lin & Bilmes [18], and let \u03c3i = ((cid:80)\n\n1, . . . , \u03b1\u2217\n\n6 Conclusions\n\nWe have introduced a unifying generalization of Lov\u00b4asz\u2019s theta function and the corresponding\ngeometric embedding to graphs with node and edge weights, characterized as a minimization over\na constrained set of kernel matrices. This allows an extension of a fast approximation of the Lov\u00b4asz\nnumber to weighted graphs, de\ufb01ned by an SVM problem for a \ufb01xed kernel matrix. We have shown\nthat the theta function has a natural interpretation as a measure of diversity in graphs, a useful\nfunction in several machine learning problems. Exploiting these results, we have de\ufb01ned algorithms\nfor weighted maximum cut, correlation clustering and document summarization.\n\nAcknowledgments\n\nThis work is supported in part by the Swedish Foundation for Strategic Research (SSF).\n\n2http://duc.nist.gov/duc2004/\n\n8\n\n\fReferences\n[1] S. Arora, E. Hazan, and S. Kale. Fast algorithms for approximate semide\ufb01nite programming using the\nIn Foundations of Computer Science, 2005. FOCS 2005. 46th\n\nmultiplicative weights update method.\nAnnual IEEE Symposium on, pages 339\u2013348. IEEE, 2005.\n\n[2] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a meta-algorithm and\n\napplications. Theory of Computing, 8(1):121\u2013164, 2012.\n\n[3] N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Machine Learning, 56(1-3):89\u2013113, 2004.\n[4] F. Bonchi, A. Gionis, and A. Ukkonen. Overlapping correlation clustering. Knowledge and information\n\nsystems, 35(1):1\u201332, 2013.\n\n[5] M. Brand. Fast low-rank modi\ufb01cations of the thin singular value decomposition. Linear algebra and its\n\napplications, 415(1):20\u201330, 2006.\n\n[6] S. Burer and R. D. Monteiro. A projected gradient algorithm for solving the maxcut sdp relaxation.\n\nOptimization methods and Software, 15(3-4):175\u2013200, 2001.\n\n[7] M. Elsner and W. Schudy. Bounding and comparing methods for correlation clustering beyond ilp. In\nProceedings of the Workshop on Integer Linear Programming for Natural Langauge Processing, pages\n19\u201327. Association for Computational Linguistics, 2009.\n\n[8] M. X. Goemans. Semide\ufb01nite programming in combinatorial optimization. Math. Program., 79:143\u2013161,\n\n1997.\n\n[9] M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and sat-\nis\ufb01ability problems using semide\ufb01nite programming. Journal of the ACM (JACM), 42(6):1115\u20131145,\n1995.\n\n[10] M. Gr\u00a8otschel, L. Lov\u00b4asz, and A. Schrijver. Geometric Algorithms and Combinatorial Optimization,\n\nvolume 2 of Algorithms and Combinatorics. Springer, 1988.\n\n[11] C. Helmberg and F. Rendl. A spectral bundle method for semide\ufb01nite programming. SIAM Journal on\n\nOptimization, 10(3):673\u2013696, 2000.\n\n[12] D. Hush, P. Kelly, C. Scovel, and I. Steinwart. Qp algorithms with guaranteed accuracy and run time for\n\nsupport vector machines. Journal of Machine Learning Research, 7:733\u2013769, 2006.\n\n[13] G. Iyengar, D. J. Phillips, and C. Stein. Approximating semide\ufb01nite packing programs. SIAM Journal on\n\nOptimization, 21(1):231\u2013268, 2011.\n\n[14] V. Jethava, A. Martinsson, C. Bhattacharyya, and D. Dubhashi. Lov\u00b4asz \u03d1 function, svms and \ufb01nding\n\ndense subgraphs. The Journal of Machine Learning Research, 14(1):3495\u20133536, 2013.\n\n[15] V. Jethava, J. Sznajdman, C. Bhattacharyya, and D. Dubhashi. Lovasz \u03d1, svms and applications.\n\nInformation Theory Workshop (ITW), 2013 IEEE, pages 1\u20135. IEEE, 2013.\n\nIn\n\n[16] F. D. Johanson, A. Chattoraj, C. Bhattacharyya, and D. Dubhashi. Supplementary material, 2015.\n[17] D. E. Knuth. The sandwich theorem. Electr. J. Comb., 1, 1994.\n[18] H. Lin and J. Bilmes. A class of submodular functions for document summarization.\n\nIn Proc. of the\n49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-\nVolume 1, pages 510\u2013520. Association for Computational Linguistics, 2011.\n\n[19] L. Lov\u00b4asz. On the shannon capacity of a graph. IEEE Transactions on Information Theory, 25(1):1\u20137,\n\n1979.\n\n[20] L. Lov\u00b4asz and K. Vesztergombi. Geometric representations of graphs. Paul Erdos and his Mathematics,\n\n1999.\n\n[21] R. Mart\u00b4\u0131, A. Duarte, and M. Laguna. Advanced scatter search for the max-cut problem.\n\nJournal on Computing, 21(1):26\u201338, 2009.\n\nINFORMS\n\n[22] D. T. Pham, S. S. Dimov, and C. Nguyen. Selection of k in k-means clustering. Proceedings of the\nInstitution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science, 219(1):103\u2013\n119, 2005.\n\n[23] B. Sch\u00a8olkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of\n\na high-dimensional distribution. Neural computation, 13(7):1443\u20131471, 2001.\n\n[24] A. Schrijver. A comparison of the delsarte and lov\u00b4asz bounds. Information Theory, IEEE Transactions\n\non, 25(4):425\u2013429, 1979.\n\n[25] J. Wang, T. Jebara, and S.-F. Chang. Semi-supervised learning using greedy max-cut. The Journal of\n\nMachine Learning Research, 14(1):771\u2013800, 2013.\n\n9\n\n\f", "award": [], "sourceid": 638, "authors": [{"given_name": "Fredrik", "family_name": "Johansson", "institution": "Chalmers University, Sweden"}, {"given_name": "Ankani", "family_name": "Chattoraj", "institution": "Chalmers University"}, {"given_name": "Chiranjib", "family_name": "Bhattacharyya", "institution": "Indian Institute of Science"}, {"given_name": "Devdatt", "family_name": "Dubhashi", "institution": "Chalmers University, Sweden"}]}