{"title": "Solving graph compression via optimal transport", "book": "Advances in Neural Information Processing Systems", "page_first": 8014, "page_last": 8025, "abstract": "We propose a new approach to graph compression by appeal to optimal transport. The transport problem is seeded with prior information about node importance, attributes, and edges in the graph. The transport formulation can be setup for either directed or undirected graphs, and its dual characterization is cast in terms of distributions over the nodes. The compression pertains to the support of node distributions and makes the problem challenging to solve directly. To this end, we introduce Boolean relaxations and specify conditions under which these relaxations are exact. The relaxations admit algorithms with provably fast convergence. Moreover, we provide an exact O(d log d) algorithm for the subproblem of projecting a d-dimensional vector to transformed simplex constraints. Our method outperforms state-of-the-art compression methods on graph classification.", "full_text": "Solving graph compression via optimal transport\n\nVikas K. Garg\nCSAIL, MIT\n\nTommi Jaakkola\n\nCSAIL, MIT\n\nvgarg@csail.mit.edu\n\ntommi@csail.mit.edu\n\nAbstract\n\nWe propose a new approach to graph compression by appeal to optimal transport.\nThe transport problem is seeded with prior information about node importance,\nattributes, and edges in the graph. The transport formulation can be setup for\neither directed or undirected graphs, and its dual characterization is cast in terms\nof distributions over the nodes. The compression pertains to the support of node\ndistributions and makes the problem challenging to solve directly. To this end, we\nintroduce Boolean relaxations and specify conditions under which these relaxations\nare exact. The relaxations admit algorithms with provably fast convergence. More-\nover, we provide an exact O(d log d) algorithm for the subproblem of projecting a\nd-dimensional vector to transformed simplex constraints. Our method outperforms\nstate-of-the-art compression methods on graph classi\ufb01cation.\n\n1\n\nIntroduction\n\nGraphs are widely used to capture complex relational objects, from social interactions to molecular\nstructures. Large, richly connected graphs can be, however, computationally unwieldy if used\nas-is, and spurious features present in the graphs that are unrelated to the task can be statistically\ndistracting. A signi\ufb01cant effort thus has been spent on developing methods for compressing or\nsummarizing graphs towards graph sketches [1]. Beyond computational gains, these sketches take\ncenter stage in numerous tasks pertaining to graphs such as partitioning [2, 3], unraveling complex or\nmulti-resolution structures [4, 5, 6, 7], obtaining coarse-grained diffusion maps [8], including neural\nconvolutions [9, 10, 11].\nState-of-the-art compression methods broadly fall into two categories: (a) sparsi\ufb01cation (removing\nedges) and (b) coarsening (merging vertices). These methods measure spectral similarity between\nthe original graph and a compressed representation in terms of a (inverse) Laplacian quadratic form\n[12, 13, 14, 15, 16]. Thus, albeit some of these methods approximately preserve the graph spectrum\n(see, e.g., [1]), they are oblivious to, and thus less effective for, downstream tasks such as classi\ufb01cation\nthat rely on labels or attributes of the nodes. Also, the key compression steps in most of these methods\nare, typically, either heuristic or detached from their original objective [17].\nWe address these issues by taking a novel perspective that appeals to the theory of optimal transport\n[18], and develops its connections to minimum cost \ufb02ow on graphs [19, 20]. Speci\ufb01cally, we interpret\ngraph compression as minimizing the transport cost from a \ufb01xed initial distribution supported on\nall vertices to an unknown target distribution whose size of support is limited by the amount of\ncompression desired. Thus, the compressed graph in our case is a subgraph of the original graph,\nrestricted to a subset of the vertices selected via the associated transport problem. The transport\ncost depends on the speci\ufb01ed prior information such as importance of the nodes and their labels or\nattributes, and thus can be informed by the downstream task. Moreover, we take into account the\nunderlying geometry toward the transport cost, unlike agnostic measures such as KL-divergence [21].\nThere are several technical challenges that we must address. First, the standard notion of optimal\ntransport on graphs is tailored to directed graphs where the transport cost decomposes as a directed\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f\ufb02ow along the edge orientations [22]. To circumvent this limitation, we extend optimal transport on\ngraphs to handle both directed and undirected edges, and derive a dual that directly measures the\ndiscrepancy between distributions on the vertices. As a result, we can also compress mixed graphs\nthat contain both directed and undirected edges [23, 24, 25, 26].\nThe second challenge comes from the compression itself, enforced in our approach as sparse support\nof the target distribution. Optimal transport (OT) is known to be computationally intensive, and\nalmost all recent applications of OT in machine learning, e.g., [27, 28, 29, 30, 31, 32] rely on\nentropy regularization [33, 34, 35, 36, 37] for tractability. However, entropy is not conducive to\nsparse solutions since it discourages the variables from ever becoming zero [22]. In principle, one\ncould consider convex alternatives such as enforcing (cid:96)1 penalty [38]. However, such methods require\niterative tuning to \ufb01nd a solution that matches the desired support, and require strong assumptions such\nas restricted eigenvalue, isometry, or nullspace conditions for recovery. Some of these issues were\npreviously averted by introducing binary selection variables [39, 40] and using Boolean relaxations\nin the context of unconstrained real spaces (regression). However, they do not apply to our setting\nsince the target distribution must reside in the simplex. We introduce constrained Boolean relaxations\nthat are not only ef\ufb01cient, but also provide exactness certi\ufb01cates.\nOur graph compression formulation also introduces new algorithmic challenges. For example,\nsolving our sparsity controlled dual transport problem involves a new subproblem of projecting on the\nprobability simplex \u2206 under a diagonal transformation. Speci\ufb01cally, let D(\u0001) be a diagonal matrix\nwith the diagonal \u0001 \u2208 [0, 1]d \\ {0}. Then, for a given \u0001, the problem is to \ufb01nd the projection x \u2208 Rd\nof a given vector y \u2208 Rd such that D(\u0001)x \u2208 \u2206. This generalizes well-studied problem of Euclidean\nprojection on the probability simplex [41, 42, 43], recovered if each \u0001i is set to 1. We provide an exact\nO(d log d) algorithm for solving this generalized projection. Our approach leads to convex-concave\nsaddle point problems with fast convergence via methods such as Mirror Prox [44].\nTo summarize, we make the following contributions. We propose an approach for graph compression\nbased on optimal transport (OT). Speci\ufb01cally, we (a) extend OT to undirected and mixed graphs\n(section 2), (b) introduce constrained Boolean relaxations for our dual OT problem, and provide\nexactness guarantees (section 3), (c) generalize Euclidean projection onto simplex, and provide an\nef\ufb01cient algorithm (section 3), and (d) demonstrate that our algorithm outperforms state-of-the-art\ncompression methods, both in accuracy and compression time, on classifying graphs from standard\nreal datasets. We also provide qualitative results that our approach provides meaningful compression\nin synthetic and real graphs (section 4).\n\n2 Optimal transport for general edges\n\nLet (cid:126)G = (V, (cid:126)E) be a directed graph on nodes (or vertices) V and edges (cid:126)E. We de\ufb01ne the signed\nincidence matrix (cid:126)F : (cid:126)F ((cid:126)e, v) = 1 if (cid:126)e = (w, v) \u2208 (cid:126)E for some w \u2208 V , \u22121 if (cid:126)e = (v, w) \u2208\n(cid:126)E for some w \u2208 V , and 0 otherwise. Let c((cid:126)e) \u2208 R+ be the positive cost to transport unit mass along\nedge (cid:126)e \u2208 (cid:126)E, and \u2206(V ) the probability simplex on V . The shorthand a (cid:22) b denotes that a(i) \u2264 b(i)\nfor each component i. Let 0 be a vector of all zeros and 1 a vector of all ones. Let \u03c10, \u03c11 \u2208 \u2206(V ) be\ndistributions over the vertices in V . The optimal transport distance (cid:126)W (\u03c10, \u03c11) from \u03c10 to \u03c11 is [22]:\n\n(cid:126)W (\u03c10, \u03c11) = min\nJ\u2208R| (cid:126)E|\n0 (cid:22) J\n\n(cid:88)\n\n(cid:126)e\u2208 (cid:126)E\n\nc((cid:126)e)J((cid:126)e)\n\ns.t.\n\n(cid:126)F (cid:62)J = \u03c11 \u2212 \u03c10 ,\n\nwhere J((cid:126)e) is the non-negative mass transfer from tail to head on edge (cid:126)e. Intuitively, (cid:126)W (\u03c10, \u03c11) is\nthe minimum cost of a directed \ufb02ow from \u03c10 to \u03c11. In order to extend this intuition to the undirected\ngraphs, we need to re\ufb01ne the notion of incidence, and let the mass \ufb02ow in either direction. Speci\ufb01cally,\nlet G = (V, E) be a connected undirected graph. We de\ufb01ne the incidence matrix pertaining to G as\nF (e, v) = 1 if edge e is incident on v, and 0 otherwise. With each undirected edge e \u2208 E, having\ncost c(e) \u2208 R+, we associate two directed edges e+ and e\u2212, each with cost c(e), and \ufb02ow variables\nJ +(e), J\u2212(e) \u2265 0. Then, the total undirected \ufb02ow pertaining to e is J +(e) + J\u2212(e). Since we incur\ncost for \ufb02ow in either direction, we de\ufb01ne the optimal transport cost W (\u03c10, \u03c11) from \u03c10 to \u03c11 as\n\nc(e)(J +(e) + J\u2212(e))\n\ns.t. F (cid:62)(J\u2212 \u2212 J +) = \u03c11 \u2212 \u03c10 .\n\n(1)\n\n2\n\n(cid:88)\n\ne\u2208E\n\nmin\n\nJ +,J\u2212\u2208R|E|\n0 (cid:22) J +,J\u2212\n\n\fWe call a directed edge e+ active if J +(e) > 0, i.e., there is some positive \ufb02ow on the edge (likewise\nfor e\u2212). Moreover, by extension, we call an undirected edge e active if at least one of e+ and e\u2212 is\nactive. We claim that at most one of e+ and e\u2212 may be active for any edge e.\nTheorem 1. The optimal solution to (1) must have J +(e) = 0 or J\u2212(e) = 0 (or both) \u2200 e \u2208 E.\nThe proof is provided in the supplementary material. Thus, like the directed graph setting, we either\nhave \ufb02ow in only one direction for each edge e, or no \ufb02ow at all. Moreover, Theorem 1 facilitates\ngeneralizing optimal transport distance to mixed graphs \u02dcG(V, E, (cid:126)E), i.e., where both directed and\nundirected edges may be present. In particular, we adapt the formulation in (1) with minor changes:\n(a) we associate bidirectional variables with each edge, directed or undirected. For the undirected\nedges e \u2208 E, we replicate the constraints from (1). For the directed edges (cid:126)e, we follow the convention\nthat J +((cid:126)e) denotes the outward \ufb02ow along (cid:126)e whereas J\u2212((cid:126)e) denotes the incoming \ufb02ow (from head\nto tail), and impose the additional constraints J\u2212((cid:126)e) = 0. We will focus on undirected graphs\nG = (V, E) since the extensions to directed and mixed graphs are immediate due to Theorem 1.\n\n3 Graph compression\n\nWe view graph compression as the problem of minimizing the optimal transport distance from an\ninitial distribution \u03c10 having full support on the vertices V to a target distribution \u03c11 that is supported\nonly on a subset SV (\u03c11) of V . The compressed subgraph is obtained by restricting the original graph\nto vertices in SV (\u03c11) and the incident edges. The initial distribution \u03c10 encodes any prior information.\nFor instance, it might be taken as a stationary distribution of random walk on the graph. Likewise, the\ncost function c encodes the preference for different edges. In particular, a high value of c(e) would\ninhibit edge e from being active. This \ufb02exibility allows our framework to inform compression based\non the speci\ufb01cs of different downstream applications by getting to de\ufb01ne \u03c10 and c appropriately.\n\n3.1 Dual characterization of the transport distance\n\nNote that (1) de\ufb01nes an optimization problem over edges. However, our perspective requires\nquantifying W (\u03c10, \u03c11) as an optimization over the vertices. Fortunately, strong duality comes to our\nrescue. Let c = (c(e), e \u2208 E) be the column vector obtained by stacking the costs. The dual of (1) is\n(2)\n\n(y \u2212 z)(cid:62)(\u03c11 \u2212 \u03c10),\n\nt(cid:62)(\u03c11 \u2212 \u03c10) .\n\nor equivalently,\n\nmax\n\n0(cid:22)y,0(cid:22)z\n\n\u2212c (cid:22) F (y\u2212z)(cid:22)c\n\nmax\nt\u2208R|V |\n\n\u2212c (cid:22)F t(cid:22)c\n\nThis alternative formulation of W (\u03c10, \u03c11) in (2) lets us de\ufb01ne compression solely in terms of variables\nover vertices. Speci\ufb01cally, for a budget of at most k vertices, we solve\n||\u03c11||2\n\nmin\n\n(3)\n\n,\n\n\u03c11\u2208\u2206(V )\n||\u03c11||0\u2264k\n\nmax\nt\u2208R|V |\n\u2212c(cid:22)F t(cid:22)c\n\n(cid:124)\n\nt(cid:62)(\u03c11 \u2212 \u03c10) +\n\u03bb\n2\nL\u03bb(\u03c11,t;\u03c10)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\nwhere \u03bb > 0 is a regularization hyperparameter, and ||\u03c11||0 is the number of vertices with positive\nmass under \u03c11, i.e., the cardinality of support set SV (\u03c11). The quadratic penalty is strongly convex\nin \u03c11, so as we shall see shortly, would help us leverage fast algorithms for a saddle point problem.\nNote that a high value of \u03bb would encourage \u03c11 toward a uniform distribution. We favor this penalty\nover entropy, which is not conducive to sparse solutions since entropy would forbid \u03c11 from having\nzero mass at any vertex in the graph.\nOur next result reveals the structure of optimal solution \u03c1\u2217\n1 must be expressible\nas an af\ufb01ne function of \u03c10 and F . Moreover, the constraints on active edges are tight. This reaf\ufb01rms\nour intuition that \u03c1\u2217\n1 is obtained from \u03c10 by transporting mass along a subset of the edges, i.e., the\nactive edges. The remaining edges do not participate in the \ufb02ow.\n1 = \u03c10 + F (cid:62)\u03b7 , where \u03b7 \u2208 R|E|. Furthermore,\nTheorem 2. The optimal \u03c1\u2217\nfor any active edge e \u2208 E, we must have F t\u2217(e) \u2208 {c(e),\u2212c(e)}.\n\n1 in (3) is of the form \u03c1\u2217\n\n1 in (3). Speci\ufb01cally, \u03c1\u2217\n\n3.2 Constrained Boolean relaxations\n\nThe formulation (3) is non-convex due to the support constraint on \u03c11. Since recovery under (cid:96)1\nbased methods such as Lasso often requires extensive tuning, we resort to the method of Boolean\n\n3\n\n\frelaxations that affords an explicit control much like the (cid:96)0 penalty. However, prior literature on\nBoolean relaxations is limited to variables that have no additional constraints beyond sparsity. Thus, in\norder to deal with the simplex constraints \u03c11 \u2208 \u2206(V ), we introduce constrained Boolean relaxations.\nSpeci\ufb01cally, we de\ufb01ne the characteristic function gV (x) = 0 if x \u2208 \u2206(V ) and \u221e otherwise, and\nmove the non-sparsity constraints inside the objective. This lets us delegate the sparsity constraints to\nbinary variables, which can be relaxed to [0, 1]. Using the de\ufb01nition of L\u03bb, we can write (3) as\n\nmin\n\u03c11\u2208R|V |\n||\u03c11||0\u2264k\n\nmax\nt\u2208R|V |\n\u2212c(cid:22)F t(cid:22)c\n\nL\u03bb(\u03c11, t; \u03c10) + gV (\u03c11) .\n\nDenoting by (cid:12) the Hadamard (elementwise) product, and introducing variables \u0001 \u2208 {0, 1}|V |, we get\n\nmin\n\u0001\u2208{0,1}|V |\n||\u0001||0\u2264k\n\nmin\n\u03c11\u2208R|V |\n\nmax\nt\u2208R|V |\n\u2212c(cid:22)F t(cid:22)c\n\nL\u03bb(\u03c11 (cid:12) \u0001, t; \u03c10) + gV (\u03c11 (cid:12) \u0001) .\n\nAdjusting the characteristic term as a constraint, we have the following equivalent problem\n\nL\u03bb(\u03c11 (cid:12) \u0001, t; \u03c10) .\n\n(4)\n\nmin\n\u0001\u2208{0,1}|V |\n||\u0001||0\u2264k\n\nmin\n\u03c11\u2208R|V |\n\n\u03c11(cid:12)\u0001\u2208\u2206(V )\n\nmax\nt\u2208R|V |\n\u2212c(cid:22)F t(cid:22)c\n\nAlgorithm 1 Algorithm to compute\n\nEuclidean projection on the d-simplex\n\u2206 under a diagonal transformation.\nInput: y, \u0001\nDe\ufb01ne I> (cid:44) {j \u2208 [d] | \u0001j > 0}\nDe\ufb01ne I= (cid:44) {j \u2208 [d] | \u0001j = 0};\ny> (cid:44) {yj | j \u2208 I>}; \u0001> (cid:44) {\u0001j | j \u2208 I>}\nSort y> into \u02c6y> and \u0001> into \u02c6\u0001>, in non-\nincreasing order, based on yj/\u0001j, j \u2208 I> .\nRename indices in (\u02c6y>, \u02c6\u0001>) to start from 1.\nLet \u03c0 map j \u2208 I> to \u03c0(j) \u2208 [|\u02c6y>|]. Thus\n\n\u02c6y1/\u02c6\u00011 \u2265 \u02c6y2/\u02c6\u00012 \u2265 . . . \u2265 \u02c6y|I>|/\u02c6\u0001|I>|\n\n(1 \u2212(cid:80)j\n(cid:80)j\n(1 \u2212(cid:80)(cid:96)\n(cid:80)(cid:96)\n\nbj = \u02c6yj +\u02c6\u0001j\n(cid:96) = max{j \u2208 [|y>|] | bj > 0}\n\ni=1 \u02c6\u0001i \u02c6yi)\ni=1 \u02c6\u00012\ni\n\n, \u2200j \u2208 [|y>|]\n\ni=1 \u02c6\u0001i \u02c6yi)\n\u03b1 =\ni=1 \u02c6\u00012\nxj = max{\u02c6y\u03c0(j) + \u03b1\u02c6\u0001\u03c0(j), 0}, \u2200 j \u2208 I>\ni\nxj = yj,\u2200 j \u2208 I=\n\nAlgorithm 2 Mirror Prox algorithm to (approximately)\n\ufb01nd \u0001 in relaxation of (9). The step-sizes at time (cid:96) with\nrespect to \u0001, t, and \u03b6 are \u03b1(cid:96), \u03b2(cid:96), and \u03b3(cid:96) respectively.\n\nGradient step:\n\nInput: \u03c10, k, \u03bb; iterations T\nDe\ufb01ne \u02dcEk = {\u0001 \u2208 [0, 1]|V | | \u0001(cid:62)1 \u2264 k}\nDe\ufb01ne TF,c as in (5)\nDe\ufb01ne \u03c8\u03c10 (\u0001, t, \u03b6) as in (9)\nInitialize \u0001(0) = k1/|V |, t(0) = 0, and \u03b6 (0) = 0\nfor (cid:96) = 0, 1, . . . , T do\n\n\u02c6\u0001((cid:96)) = Proj \u02dcEk\n\u02c6t((cid:96)) = ProjTF,c\n\u02c6\u03b6 ((cid:96)) = \u03b6 ((cid:96)) + \u03b3(cid:96)\u2207\u03b6\u03c8\u03c10(\u0001((cid:96)), t((cid:96)), \u03b6 ((cid:96)))\n\n(cid:0)\u0001((cid:96)) \u2212 \u03b1(cid:96)\u2207\u0001\u03c8\u03c10(\u0001((cid:96)), t((cid:96)), \u03b6 ((cid:96)))(cid:1)\n(cid:0)t((cid:96)) + \u03b2(cid:96)\u2207t\u03c8\u03c10(\u0001((cid:96)), t((cid:96)), \u03b6 ((cid:96)))(cid:1)\n(cid:0)\u0001((cid:96))\u2212\u03b1(cid:96)\u2207\u0001\u03c8\u03c10(\u02c6\u0001((cid:96)), \u02c6t((cid:96)), \u02c6\u03b6 ((cid:96)))(cid:1)\n(cid:0)t((cid:96))+\u03b2(cid:96)\u2207t\u03c8\u03c10(\u02c6\u0001((cid:96)), \u02c6t((cid:96)), \u02c6\u03b6 ((cid:96)))(cid:1)\n\n\u0001((cid:96)+1) = Proj \u02dcEk\nt((cid:96)+1)= ProjTF,c\n\u03b6 ((cid:96)+1) = \u03b6 ((cid:96)) + \u03b3(cid:96)\u2207\u03b6\u03c8\u03c10(\u02c6\u0001((cid:96)), \u02c6t((cid:96)), \u02c6\u03b6 ((cid:96)))\n\nExtra-gradient step:\n\n\u02c6\u0001 =(cid:80)T\n\nend for\n\n(cid:96)=1 \u03b1(cid:96) \u02c6\u0001((cid:96))(cid:14)(cid:80)T\n\n(cid:96)=1 \u03b1(cid:96)\n\nOur formulation in (4) requires solving a new subproblem, namely, Euclidean projection on the\nd-simplex \u2206 under a diagonal transformation. Speci\ufb01cally, let D(\u0001) be a diagonal matrix with the\ndiagonal \u0001 \u2208 [0, 1]d \\{0}. Then, for a given \u0001, the problem is to \ufb01nd the projection x \u2208 Rd of a given\nvector y \u2208 Rd such that D(\u0001)x = x (cid:12) \u0001 \u2208 \u2206. This problem generalizes Euclidean projection on the\nprobability simplex [41, 42, 43], which is recovered when we set \u0001 to 1, i.e., an all-ones vector. Our\nnext result shows that Algorithm 1 solves this problem exactly in O(d log d) time.\nTheorem 3. Let \u0001 \u2208 [0, 1]d \\ {0} be a given vector of weights, and y \u2208 Rd be a given vector of\nvalues. Algorithm 1 solves the following problem in O(d log d) time\n\nmin\n\nx\u2208Rd : x(cid:12)\u0001\u2208\u2206\n\n||x \u2212 y||2 .\n\n1\n2\n\nTheorem 3 allows us to relax the problem (4), and solve the relaxed problem ef\ufb01ciently since the\nprojection steps on the other constraint sets can be solved by known methods [45, 46]. Speci\ufb01cally,\nsince \u0001 consists of only zeros and ones, ||\u0001||0 = ||\u0001||1 = \u0001(cid:62)1 and \u0001 (cid:12) \u0001 = \u0001. So, we can write (4) as\n\nmin\n\u0001\u2208Ek\n\nmin\n\u03c11\u2208R|V |\n\n\u03c11(cid:12)\u0001\u2208\u2206(V )\n\nmax\nt\u2208TF,c\n\nL\u03bb(\u03c11 (cid:12) \u0001 (cid:12) \u0001, t; \u03c10) ,\n\n4\n\n\fwhere we denote the constraints for \u0001 and t respectively by\n\nEk (cid:44) {\u0001 \u2208 {0, 1}|V | | \u0001(cid:62)1 \u2264 k} ,\n\nand TF,c (cid:44) {t \u2208 R|V | | \u2212 c (cid:22) F t (cid:22) c} .\nWe can thus eliminate \u0001 from the regularization term via a change of variable \u03c11 (cid:12) \u0001 \u2192 \u02dc\u03c11\n\nmin\n\u0001\u2208Ek\n\nmin\n\u02dc\u03c11\u2208R|V |\n\n\u02dc\u03c11(cid:12)\u0001\u2208\u2206(V )\n\nmax\nt\u2208TF,c\n\nL0(\u02dc\u03c11 (cid:12) \u0001, t; \u03c10) +\n\n||\u02dc\u03c11||2 .\n\n\u03bb\n2\n\n(5)\n\n(6)\n\nWe note that (6) is a mixed-integer program due to constraints Ek, and thus hard to solve. Nonetheless,\nwe can relax the hard binary constraints on the coordinates of \u0001 to [0, 1] intervals to obtain a saddle\npoint formulation with a strongly convex term, and solve the relaxation ef\ufb01ciently, e.g., via customized\nversions of methods such as Mirror Prox [44], Accelerated Gradient Descent [47], or Primal-Dual\nHybrid Gradient [48]. An attractive property of our relaxation is that if the solution \u02c6\u0001 from the\nrelaxed problem is integral then \u02c6\u0001 must be optimal for the non-relaxed hard problem (6), and so the\noriginal formulation (3). We now pin down the necessary and suf\ufb01cient conditions for optimality of \u02c6\u0001.\n1(v) > 0} be the support of optimal \u03c11 in the original\nTheorem 4. Let SV (\u03c1\u2217\n\u2208 {0, 1}|V | be such that IS\u2217\nformulation (3). Let the indicator IS\u2217\n1) and 0\notherwise. The relaxation of (6) is guaranteed to recover SV (\u03c1\u2217\n1) if and only if there exists a tuple\n(\u03b3, \u02c6t, \u02c6\u03bd, \u02c6\u03b6) \u2208 R+ \u00d7 R|V | \u00d7 R|V |\n\n(v) = 1 if v \u2208 SV (\u03c1\u2217\n\n1) = {v \u2208 V | \u03c1\u2217\n\nV\n\nV\n\n|\u02c6t(v) \u2212 \u02c6\u03bd(v) + \u02c6\u03b6|\n\n+ \u00d7 R such that the following holds for all vertices v \u2208 V ,\n(cid:19)\n\nif v \u2208 SV (\u03c1\u2217\n1)\nif v /\u2208 SV (\u03c1\u2217\n1)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(t \u2212 \u03bd + \u03b61) (cid:12) IS\u2217\n\n(cid:26)> \u03b3\n(cid:18) 1\n\n+ t(cid:62)\u03c10 + \u03b6\n\n(cid:12)(cid:12)(cid:12)(cid:12)2\n\n, where\n\n< \u03b3\n\nV\n\n(7)\n\n.\n\n(8)\n\n(\u02c6t, \u02c6\u03bd, \u02c6\u03b6) \u2208 arg max\nt\u2208TF,c\n\n\u03b6\u2208R \u2212\n\nmax\n\nmax\n\u03bd\u2208R|V |\n\n+\n\n2\u03bb\n\nThe quantity |\u02c6t(v) \u2212 \u02c6\u03bd(v) + \u02c6\u03b6| in (7) may be viewed as the strength of a signal. In that sense, we\nrequire the vertices in support of optimal \u03c11 to have a strictly higher signal that the vertices not in\nthe support. Such signal detection conditions appear in various contexts and often have information\ntheoretic implications, e.g., Ising models [49]. (7) is also reminiscent of the \u03b2-min condition on\nregression coef\ufb01cients for variable selection with Lasso in high-dimensional linear models [50].\nFor some applications projecting on the simplex, as required by (6), may be an expensive operation.\nWe can invoke the minimax theorem to swap the order of \u02dc\u03c11 and t, and proceed with a Lagrangian\ndual to eliminate \u02dc\u03c11 at the expense of introducing a scalar variable. Thus, effectively, we can replace\nthe projection on simplex by a one-dimensional search. We state this equivalent formulation below.\nTheorem 5. Problem (6), and thus the original formulation (3), is equivalent to\n\n(cid:88)\n\n(cid:0)\u0001(v)(t(v) + \u03b6)2 + 2\u03bbt(v)\u03c10(v)(cid:1) \u2212 (cid:88)\n\n(cid:123)(cid:122)\n\n\u03c8\u03c10 (\u0001,t,\u03b6)\n\nv:t(v)>\u2212\u03b6\n\nt(v)\u03c10(v) \u2212 \u03b6\n\n(cid:125)\n\n.\n\n(9)\n\nmin\n\u0001\u2208Ek\n\nmax\nt\u2208TF,c\n\u03b6\u2208R\n\n\u2212 1\n2\u03bb\n\n(cid:124)\n\nv:t(v)\u2264\u2212\u03b6\n\nWe present a customized Mirror Prox procedure in Algorithm 2. The projections ProjTF,c and Proj \u02dcEk\ncan be computed ef\ufb01ciently [45, 46]. We round the solution \u02c6\u0001 \u2208 [0, 1]|V | returned by the algorithm to\nhave at most k vertices as the estimated support for the target distribution \u03c11 if \u02c6\u0001 is not integral. The\ncompressed graph is taken to be the subgraph spanned by these vertices.\n\n3.3 Specifying the cost function\n\nIn our experiments, we \ufb01xed the cost of each edge, computed based on the agreement between\nthe associated vertex labels. Here we illustrate brie\ufb02y how to parameterize the cost and how the\nparameters could be learned. De\ufb01ne (cid:96)(i, j) = 1 for edge (i, j) if vertices i and j have the same label,\nand \u22121 otherwise. Let the cost function be parameterized by \u03b8 = (\u03b8s, \u03b8d), \u03b8s > 0, \u03b8d > 0 such that\n\nc\u03b8(i, j) = 0.5(\u03b8d(1 \u2212 (cid:96)(i, j)) + \u03b8s(1 + (cid:96)(i, j))) .\n\n5\n\n\fTable 1: Description of graph datasets, and comparison of accuracy on test data. We provide\nthe statistics on the number of graphs, number of classes, average number of nodes, and average\nnumber of edges in each dataset. The classi\ufb01cation test accuracy (along with standard deviation) when\neach graph was (roughly) compressed to half is shown for each method for each training fraction in\n{0.2, . . . , 0.8}. The algorithm having the best performance is indicated with bold font in each case.\n\u2019-\u2019 entries indicate that the method failed to compress the dataset (e.g. due to matrix singularity).\n\nacc@0.5\n\nacc@0.6\n\nacc@0.7\n\nacc@0.4\n\nacc@0.3\n\nacc@0.2\nacc@0.8\nmethod\n.485\u00b1.016 .543\u00b1.010 .595\u00b1.010 .625\u00b1.013 .641\u00b1.008 .696\u00b1.013 .738\u00b1.016\nREC\n.408\u00b1.016 .479\u00b1.015 .516\u00b1.009 .538\u00b1.009 .557\u00b1.011 .602\u00b1.022 .653\u00b1.011\nHeavy\nAf\ufb01nity .413\u00b1.021 .489\u00b1.008 .516\u00b1.011 .549\u00b1.010 .560\u00b1.016 .607\u00b1.019 .654\u00b1.021\n.452\u00b1.036 .498\u00b1.035 .524\u00b1.021 .535\u00b1.027 .531\u00b1.029 .590\u00b1.032 .652\u00b1.044\n.548\u00b1.004 .605\u00b1.003 .639\u00b1.006 .679\u00b1.003 .696\u00b1.002 .742\u00b1.007 .778\u00b1.005\nOTC\n.681\u00b1.011 .704\u00b1.014 .724\u00b1.007 .738\u00b1.009 .749\u00b1.008 .756\u00b1.011 .771\u00b1.011\nREC\n.719\u00b1.010 .751\u00b1.010 .776\u00b1.012 .782\u00b1.008 .777\u00b1.009 .786\u00b1.014 .799\u00b1.013\nHeavy\nAf\ufb01nity .717\u00b1.013 .733\u00b1.011 .745\u00b1.014 .761\u00b1.014 .771\u00b1.019 .767\u00b1.015 .785\u00b1.013\n.743\u00b1.011 .761\u00b1.012 .768\u00b1.022 .786\u00b1.019 .810\u00b1.025 .817\u00b1.033 .809\u00b1.030\n.757\u00b1.004 .784\u00b1.003 .797\u00b1.005 .799\u00b1.003 .811\u00b1.007 .814\u00b1.006 .823\u00b1.004\nOTC\n.738\u00b1.011 .782\u00b1.010 .817\u00b1.009 .818\u00b1.013 .835\u00b1.020 .833\u00b1.018 .840\u00b1.013\nREC\n.648\u00b1.019 .710\u00b1.024 .766\u00b1.014 .773\u00b1.010 .786\u00b1.009 .796\u00b1.010 .813\u00b1.009\nHeavy\nAf\ufb01nity .665\u00b1.015 .722\u00b1.005 .762\u00b1.010 .774\u00b1.014 .789\u00b1.026 .786\u00b1.019 .801\u00b1.017\n.666\u00b1.048 .717\u00b1.051 .756\u00b1.029 .771\u00b1.039 .798\u00b1.032 .803\u00b1.030 .809\u00b1.046\n.784\u00b1.005 .808\u00b1.005 .826\u00b1.007 .846\u00b1.003 .839\u00b1.006 .842\u00b1.007 .854\u00b1.003\nOTC\n.525\u00b1.011 .548\u00b1.015 .563\u00b1.020 .553\u00b1.021 .563\u00b1.012 .569\u00b1.012 .587\u00b1.020\nREC\n.497\u00b1.000 .546\u00b1.000 .555\u00b1.000 .522\u00b1.000 .550\u00b1.000 .572\u00b1.000 .558\u00b1.000\nHeavy\nAf\ufb01nity .508\u00b1.006 .534\u00b1.012 .534\u00b1.017 .532\u00b1.015 .549\u00b1.020 .567\u00b1.033 .562\u00b1.029\n.497\u00b1.021 .546\u00b1.026 .555\u00b1.038 .522\u00b1.028 .550\u00b1.028 .572\u00b1.024 .558\u00b1.039\n.534\u00b1.000 .569\u00b1.000 .547\u00b1.000 .579\u00b1.000 .572\u00b1.000 .607\u00b1.000 .603\u00b1.000\n.713\u00b1.006 .730\u00b1.006 .742\u00b1.005 .752\u00b1.004 .758\u00b1.005 .765\u00b1.007 .769\u00b1.007\n.718 \u00b1.006 .738\u00b1.004 .753\u00b1.004 .763\u00b1.004 .771\u00b1.003 .779\u00b1.004 .783\u00b1.004\n\nDataset\n\nMSRC-21C\ngraphs: 209\nclasses: 20\nnodes: 40.3 Alg. Dist.\nedges: 96.6\n\nDHFR\n\ngraphs: 467\nclasses: 2\nnodes: 42.4 Alg. Dist.\nedges: 44.5\nMSRC-9\ngraphs: 221\nclasses: 8\nnodes: 40.6 Alg. Dist.\nedges: 97.9\nBZR-MD\ngraphs: 306\nclasses: 2\nnodes: 21.3 Alg. Dist.\nedges: 225.06 OTC\nMutagenicity\nREC\ngraphs: 4337 Heavy\nclasses: 2\nAf\ufb01nity\nnodes: 30.3 Algebraic\nedges: 30.8\n\nOTC\n\n-\n-\n\n-\n-\n\n-\n-\n\n-\n-\n\n-\n-\n\n-\n-\n\n-\n-\n\n.749\u00b1.002 .768\u00b1.003 .779\u00b1.003 .787\u00b1.004 .792\u00b1.004 .795\u00b1.003 .799\u00b1.003\n\nThus, c\u03b8(i, j) \u2208 {\u03b8s, \u03b8d} depending on whether i and j have the same label.\nThe cost parameters couldn\u2019t be driven solely by the compression criterion as this objective would\nlead to a trivial all-zero solution. Instead, \u03b8 must be in part driven by an external classi\ufb01cation loss.\nIn other words, we can learn \u03b8 by trading off compression loss against, for example, the ability to\ncorrectly classify the resulting reduced graphs. We leave this for future work.\n\n3.4 Relation to other compression techniques\n\nMost compression algorithms try to preserve the graph spectrum via a (multi-level) coarsening\nprocedure: at each level they compute a matching of vertices and merge the matched vertices, e.g.,\nHeavy Edge [2] contracts those edges (i, j) that are incident on low degree vertices. Likewise REC\n[1] follows a randomized greedy procedure for generating maximal matching incrementally. Let di\nbe the degree of node i. Setting the cost c(i, j) = max(di, dj) in our framework will incentivize \ufb02ow\non edges with low degree vertices, and in turn, compression of one of their end points. The vertices\nnot in support of target distribution may then be viewed as being matched to (a subset of) adjacent\nvertices that they transfer \ufb02ow to. Unlike other methods, our approach is \ufb02exible in terms of de\ufb01ning\nc(i, j).\n\n4 Experiments\n\nWe conducted several experiments to demonstrate the merits of our method. We start by describing\nthe experimental setup. We \ufb01xed the value of hyperparameters in Algorithm 2 for all our experiments.\nSpeci\ufb01cally, we set the regularization coef\ufb01cient \u03bb = 1, and the gradient rates \u03b1(cid:96) = 0.1, \u03b2(cid:96) =\n\n6\n\n\f0.1, \u03b3(cid:96) = 0.1 for each (cid:96) \u2208 {0, 1, . . . , T}. We also let \u03c10 be the stationary distribution by setting\n\u03c10(v) for each v \u2208 V as the ratio of deg(v), i.e. the degree of v, to the sum of degrees of all the\nvertices. Note that the distribution thus obtained is the unique stationary distribution for connected\nnon-bipartite graphs, and a stationary distribution for the bipartite graphs. Moreover, for non-bipartite\ngraphs, it has a nice physical interpretation in that any random walk on the graph always converges to\nthis distribution irrespective of the graph structure.\nThe objective of our experiments is three-fold. Since compression is often employed as a preprocess-\ning step for further tasks, we \ufb01rst show that our method compares favorably, in terms of test accuracy,\nto the state-of-the-art compression methods on graph classi\ufb01cation. We then demonstrate that our\nmethod performed the best in terms of the compression time. We \ufb01nally show that our approach\nprovides qualitatively meaningful compression on synthetic and real examples.\n\n4.1 Classifying standard graph data\n\nWe used several standard graph datasets for our experiments, namely, DHFR [51], BZR-MD [52],\nMSRC-9, MSRC-21C [53], and Mutagenicity [54]. We focused on these datasets since they represent\na wide spectrum in terms of the number of graphs, number of classes, average number of nodes per\ngraph, and average number of edges per graph (see Table 1 for details). All these datasets have a\nclass label for each graph, and additionally, labels for each node in every graph.\nWe compare the test accuracy of our algorithm, OTC (short for Optimal Transport based Compression),\nto several state-of-the-art methods: REC [1], Heavy edge matching (Heavy) [2], Af\ufb01nity vertex\nproximity (Af\ufb01nity) [14], and Algebraic distance (Algebraic) [12, 13]. Amongst these, the REC\nalgorithm is a randomized method that iteratively contracts edges of the graph and thereby coarsens\nthe graph. Since the method is randomized, REC yields a compressed graph that achieves a speci\ufb01ed\ncompression factor, i.e., the ratio of the number of nodes in the reduced graph to that in the original\nuncompressed graph, only in expectation. Therefore, in order to allow for a fair comparison, we \ufb01rst\nrun REC on each graph with the compression factor set to 0.5, and then execute other baselines and\nour algorithm, i.e. Algorithm 2, with k set to the number of nodes in the compressed graph produced\nby REC. Also to mitigate the effects of randomness on the number of nodes returned by REC, we\nperformed 5 independent runs for REC, and subsequently all other methods.\nEach collection of compressed graphs was then divided into train and test sets, and used for our\nclassi\ufb01cation task. Since our datasets do not provide separate train and test sets, we employed the\nfollowing procedure for each dataset. We partitioned each dataset into multiple train and test sets\nof varying sizes. Speci\ufb01cally, for each p \u2208 {0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8}, we divided each dataset\nrandomly into a train set containing a fraction p of the graphs in the dataset, and a test set containing\nthe remaining 1 \u2212 p fraction. To mitigate the effects of chance, we formed 5 such independent\ntrain-test partitions for each fraction p for each dataset. We averaged the results over these multiple\nsplits to get one reading per collection, and thus 5 readings in total for all collections, for each fraction\nfor each method. We averaged the test accuracy across collections for each method and fraction.\nWe now specify the cost function c for our algorithm. As described in section 3, we can leverage the\ncost function to encode preference for different edges in the compressed graph. For each graph, we set\nc = 0.01 for each edge e incident on the nodes with same label, and c = 0.02 for each e incident on\nthe nodes that have different label. Thus, in effect, we slightly biased the graphs to prefer retaining the\nedges that have separate labels at their end-points. In our experiments, we employed support vector\nmachines (SVMs), with Weisfeiler-Leman subtree kernel [55] to quantify the similarity between\ngraphs [53, 56, 57]. This kernel is based on the Weisfeiler-Leman test of isomorphism [58, 59], and\nthus naturally takes into account the labels of the nodes in the two graphs. We \ufb01xed the number of\nkernel iterations to 5. We also \ufb01xed T = 25 for our algorithm. For each method and each train-test\nsplit, we used a separate 5-fold cross-validation procedure to tune the coef\ufb01cient of error term C over\nthe set {0.1, 1, 10} for training an independent SVM model on the training portion.\nTable 1 summarizes the performance of different methods for each fraction of the data. As the\nnumbers in bold indicate, our method generally outperformed the other methods across the datasets.\nWe observe that on some datasets the discrepancy between the average test accuracy of two algorithms\nis massive for every fraction. Note that though OTC performs the best on DHFR, the performance of\nmost methods is similar (except REC, which lags behind). In contrast, REC performs better than all\nmethods except OTC on MSRC-9. This seems to suggest that REC performs well on graphs with\n\n7\n\n\fFigure 1: Comparison on standard graph datasets. The top row shows the average test accuracy\nand corresponding standard deviation for our method (OTC) and state-of-the-art baselines for different\nfractions of training data. The bottom row compares the corresponding compression times. Our\nmethod outperforms the other methods in terms of both accuracy and compression time.\n\nstrong connectivity, while others might be better on data with a long backbone besides these ring\nstructures. We believe robust performance of OTC across these datasets comprising graphs with\nvastly different topologies underscores the promise of our approach.\nFurther, as Fig. 1 shows, OTC performed best in terms of compression time as well. We emphasize\nthat the discrepancy in compression times became quite stark for larger datasets (i.e., DHFR and\nMutagenicity). To provide more evidence on the scalability of our method, we also experimented\nwith the larger Tox21AR-LBD data,1 which consists of about 8600 graphs. Both our method and\nAlgebraic distance performed very well in terms of classi\ufb01cation accuracy (\u223c 97%) on this data. Our\napproach took in total about 39 seconds to compress graphs in this dataset to 90% (low compression),\nand about 41 seconds in total to compress to 10% (high compression). In contrast, the Algebraic\ndistance method took about 48 seconds to compress to 90% and a signi\ufb01cantly longer time, i.e., 3.5\nminutes to compress to 10%. The other baselines failed to compress this data.\n\n4.2 Compressing synthetic and real examples\n\nWe now describe our second set of experiments with both synthetic and real data to show that our\nmethod can be seeded with useful prior information toward preserving interesting patterns in the\ncompressed structures. This \ufb02exibility in specifying the prior information makes our approach\nespecially well-suited for downstream tasks, where domain knowledge is often available.\nFig. 2 demonstrates the effect of compressing a synthetic tree-structured graph. The penultimate\nlevel consists of four nodes, each of which has four child nodes as leaves. We introduce asymmetry\nwith respect to the different nodes by specifying different combinations of c(e) for edges e between\nthe leaf nodes and their parents: there are respectively one, two, three and four heavy edges (i.e. with\nc(e) = 0.5) from the internal nodes 1, 2, 3, and 4 to their children, i.e., the leaves in their subtree.\nAs shown in Fig. 2, our method adapts to the hierarchical structure with a change in the amount\nof compression from just one node to about three-fourths of the entire graph. We also show mean-\n\n1https://tripod.nih.gov/tox21/challenge/data.jsp\n\n8\n\n0.20.30.40.50.60.70.80.390.470.550.630.710.79TrainingfractionAverageaccuracyMSRC-21COTCRECAf\ufb01nityAlgebraicHeavy0.20.30.40.50.60.70.80.670.70.730.760.790.820.85TrainingfractionAverageaccuracyDHFROTCRECAf\ufb01nityAlgebraicHeavy0.20.30.40.50.60.70.80.690.710.730.750.770.790.81TrainingfractionAverageaccuracyMutagenicityOTCRECHeavyOTCRECAf\ufb01nityAlgebraicHeavy01234567Compressiontime(s)OTCRECAf\ufb01nityAlgebraicHeavy2468101214Compressiontime(s)OTCRECAf\ufb01nityAlgebraicHeavy81624324048Compressiontime(s)\f(a) Synthetic graph\n\n(b) Compressed graph (k = 20)\n\n(c) Compressed graph (k = 15)\n\n(d) Compressed graph (k = 5)\n\n(e) Mutagenicity\n\n(f) MSRC-21C\n\nFigure 2: Visualizing compression on synthetic and real examples. (a) A synthetic graph struc-\ntured as a 4-ary tree of depth 2. The root 0 is connected to its neighbors by edges having c(e) = 0.3.\nAll the other edges have either c(e) = 0.5 (thickest) or c(e) = 0.1 (lightest). The left out portions\nare shown in gray. (b) Leaf node 13, which is in the same subtree as three nodes with heavy edges\n14-16, is the \ufb01rst to go. (c) Proceeding further, 9 and 10 are left out followed by the remaining nodes\n(i.e. 5, 6,7) connected by light edges. (d) When the graph is compressed to 5 vertices, only the root\nand its neighbors remain despite bulkier subtrees, e.g. the one with node 4 and its neighbors, that\nare discarded. Thus, our method yields meaningful compression on this synthetic example. (e-f)\nCompressed structures pertaining to some sample graphs from real datasets that have some other\ninteresting motifs. In each case, the compressed graph consists of red edges and the incident vertices,\nwhile the discarded parts are shown in gray. All the \ufb01gures here are best viewed in color.\n\ningful compression on some examples from real datasets. The bottom row of Fig. 2 shows two\nsuch examples, one each from Mutagenicity and MSRC-21C. For these graphs, we used the same\nspeci\ufb01cation for c as in section 4.1. The example from Mutagenicity contains patterns such as rings\nand backbone structures that are ubiquitous in molecules and proteins. Likewise, the other example is\na good representative of the MSRC-21C dataset from computer vision.\nThus, our method encodes prior information, and provides fast and effective graph compression for\ndownstream applications.\n\nAcknowledgments\n\nWe thank the anonymous reviewers for their thoughtful questions that led to sections 3.3 and 3.4, and\nexperiments on the Tox21 data. We are grateful to Andreas Loukas for the code of their algorithm [1].\nVG and TJ were partially supported by a grant from the MIT-IBM collaboration.\n\n9\n\n\fReferences\n[1] A. Loukas and P. Vandergheynst. Spectrally approximating large graphs with smaller graphs.\n\nIn International Conference on Machine Learning (ICML), 2018.\n\n[2] I. S. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors a multilevel\napproach. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 29, 2007.\n[3] L. Wang, Y. Xiao, B. Shao, and H. Wang. How to partition a billion-node graph. In International\n\nConference on Data Engineering (ICDE), pages 568\u2013579, 2014.\n\n[4] E. Ravasz and A.-L. Barabasi. Hierarchical organization in complex networks. Physical Review\n\nE, 67(2):026112, 2003.\n\n[5] B. Savas and I. Dhillon. Clustered low rank approximation of graphs in information science\n\napplications. In SIAM International Conference on Data Mining (SDM), 2011.\n\n[6] R. Kondor, N. Teneva, and V. K. Garg. Multiresolution matrix factorization. In International\n\nConference on Machine Learning (ICML), pages 1620\u20131628, 2014.\n\n[7] N. Teneva, P. K. Mudrakarta, and R. Kondor. Multiresolution matrix compression. In In-\nternational Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), pages 1441\u20131449,\n2016.\n\n[8] S. Lafon and A. B. Lee. Diffusion maps and coarse-graining: A uni\ufb01ed framework for\ndimensionality reduction, graph partitioning, and data set parameterization. IEEE Transactions\non Pattern Analysis and Machine Intelligence (TPAMI), 28(9):1393\u20131403, 2006.\n\n[9] J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun. Spectral networks and locally connected\nnetworks on graphs. In International Conference on Learning Representations (ICLR), 2014.\n[10] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs\nwith fast localized spectral \ufb01ltering. In Neural Information Processing Systems (NIPS), 2016.\n[11] M. Simonovsky and N. Komodakis. Dynamic edge-conditioned \ufb01lters in convolutional neural\nnetworks on graphs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\n2017.\n\n[12] D. Ron, I. Safro, and A. Brandt. Relaxation-based coarsening and multiscale graph organization.\n\nMultiscale Modeling & Simulation, 9(1):407\u2013423, 2011.\n\n[13] J. Chen and I. Safro. Algebraic distance on graphs. SIAM Journal on Scienti\ufb01c Computing,\n\n33(6):3468\u20133490, 2011.\n\n[14] O. E. Livne and A. Brandt. Lean algebraic multigrid (lamg): Fast graph laplacian linear solver.\n\nSIAM Journal on Scienti\ufb01c Computing, 34, 2012.\n\n[15] D. I. Shuman, M. J. Faraji, and P. Vandergheynst. A multiscale pyramid transform for graph\n\nsignals. IEEE Transactions on Signal Processing, 64(8):2119\u20132134, 2016.\n\n[16] F. D\u00a8or\ufb02er and F. Bullo. Kron reduction of graphs with applications to electrical networks. IEEE\n\nTransactions on Circuits and Systems I: Regular Papers, 60(1):150\u2013163, 2013.\n\n[17] G. B. Hermsdorff and L. M. Gunderson. A unifying framework for spectrum-preserving graph\n\nsparsi\ufb01cation and coarsening. In arXiv:1902.09702, 2019.\n\n[18] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media,\n\n2008.\n\n[19] G. Karakostas. Faster approximation schemes for fractional multicommodity \ufb02ow problems.\n\nACM Transactions on Algorithms, 4(1):13:1\u201313:17, 2008.\n\n[20] M. B. Cohen, A. Madry, P. Sankowski, and A. Vladu. Negative-weight shortest paths and\nunit capacity minimum cost \ufb02ow in \u02dcO(m10/7 log w) time: (extended abstract). In ACM-SIAM\nSymposium on Discrete Algorithms (SODA), pages 752\u2013771, 2017.\n\n[21] J. Solomon, R. Rustamov, L. Guibas, and A. Butscher. Continuous-\ufb02ow graph transportation\n\ndistances. In arXiv: 1603.06927, 2016.\n\n[22] M. Essid and J. Solomon. Quadratically regularized optimal transport on graphs. SIAM Journal\n\non Scienti\ufb01c Computing, 40(4):A1961\u2013A1986, 2018.\n\n[23] P. Hansen, J. Kuplinsky, and D. de Werra. Mixed graph colorings. Mathematical Methods of\n\nOperations Research, 45(1):145\u2013160, 1997.\n\n10\n\n\f[24] B. Ries. Coloring some classes of mixed graphs. Discrete Applied Mathematics, 155(1):1\u20136,\n\n2007.\n\n[25] R. G. Cowell, A. P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter. Probabilistic Networks and\nExpert Systems: Exact Computational Methods for Bayesian Networks. Springer Publishing\nCompany, Incorporated, 1st edition, 2007.\n\n[26] M. Beck, D. Blado, J. Crawford, T. Jean-Louis, and M. Young. On weak chromatic polynomials\n\nof mixed graphs. Graphs and Combinatorics, 31(1):91\u201398, 2015.\n\n[27] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In\n\nInternational Conference on Machine Learning (ICML), pages 214\u2013223, 2017.\n\n[28] N. Courty, R. Flamary, A. Habrard, and A. Rakotomamonjy. Joint distribution optimal trans-\n\nportation for domain adaptation. In Neural Information Processing Systems (NIPS), 2017.\n\n[29] I. Redko, N. Courty, R. Flamary, and D. Tuia. Optimal transport for multi-source domain\nadaptation under target shift. In International Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS), pages 849\u2013858, 2019.\n\n[30] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio. Learning with a wasserstein\n\nloss. In Neural Information Processing Systems (NIPS), pages 2053\u20132061, 2015.\n\n[31] D. Alvarez-Melis, T. Jaakkola, and S. Jegelka. Structured optimal transport. In International\nConference on Arti\ufb01cial Intelligence and Statistics (AISTATS), volume 84, pages 1771\u20131780,\n2018.\n\n[32] T. Vayer, L. Chapel, R. Flamary, R. Tavenard, and N. Courty. Optimal transport for structured\n\ndata. In arXiv: 1805.09114, 2018.\n\n[33] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport.\n\nInformation Processing Systems (NIPS), 2013.\n\nIn Neural\n\n[34] J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyr\u00b4e. Iterative bregman projections for\nregularized transportation problems. SIAM Journal on Scienti\ufb01c Computing, 37:A1111\u2013A1138,\n2015.\n\n[35] J. Solomon. Convolutional wasserstein distances : Ef\ufb01cient optimal transportation on geometric\n\ndomains. ACM Transactions of Graphics (TOG), 34(4):66:1\u201366:11, 2015.\n\n[36] A. Genevay, M. Cuturi, G. Peyr\u00b4e, and F. Bach. Stochastic optimization for large-scale optimal\n\ntransport. In Neural Information Processing Systems (NIPS), pages 3440\u20133448, 2016.\n\n[37] V. Seguy, B. B. Damodaran, R. Flamary, N. Courty, A. Rolet, and M. Blondel. Large-scale opti-\nmal transport and mapping estimation. In International Conference on Learning Representations\n(ICLR), 2018.\n\n[38] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety. Series B (methodological), 58(1):267\u2013288, 1996.\n\n[39] M. Tan, I. W. Tsang, and Li Wang. Towards ultrahigh dimensional feature selection for big data.\n\nJournal of Machine Learning Research (JMLR), 15(1):1371\u20131429, 2014.\n\n[40] M. Pilanci, M. J. Wainwright, and L. E. Ghaoui. Sparse learning via boolean relaxations.\n\nMathematical Programming, 151(1):63\u201387, 2015.\n\n[41] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projections onto the (cid:96)1-ball for\nlearning in high dimensions. In International Conference on Machine Learning (ICML), 2008.\n[42] W. Wang and M. A. Carreira-P. Projection onto the probability simplex: An ef\ufb01cient algorithm\n\nwith a simple proof, and an application. In arXiv: 1309.1541, 2013.\n\n[43] L. Condat. Fast projection onto the simplex and the (cid:96)1 ball. Mathematical Programming,\n\n158(1\u20132):575\u2013585, 2016.\n\n[44] A. Nemirovski. Prox-method with rate of convergence o(1/t) for variational inequalities with\nlipschitz continuous monotone operators and smooth convex-concave saddle point problems.\nSIAM Journal on Optimization, 15(1):229\u2013251, 2004.\n\n[45] A. Beck and M. Teboulle. A fast dual proximal gradient algorithm for convex minimization and\n\napplications. Operations Research Letters, 42(1):1\u20136, 2014.\n\n[46] V. K. Garg, O. Dekel, and L. Xiao. Learning small predictors. In Neural Information Processing\n\nSystems (NeurIPS), 2018.\n\n11\n\n\f[47] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming,\n\n103(1):127\u2013152, 2005.\n\n[48] A. Chambolle and T. Pock. A \ufb01rst-order primal-dual algorithm for convex problems with\napplications to imaging. Journal of Mathematical Imaging and Vision, 40(1):120\u2013145, 2011.\n[49] N. P. Santhanam and M. J. Wainwright. Information-theoretic limits of selecting binary graphical\nmodels in high dimensions. IEEE Transactions on Information Theory, 58(7):4117\u20134134, 2012.\n[50] P. B\u00a8uhlmann. Statistical signi\ufb01cance in high-dimensional linear models. Bernoulli, 19(4):1212\u2013\n\n1242, 09 2013.\n\n[51] J. J. Sutherland, L. A. O\u2019Brien, and D. F. Weaver. Spline-\ufb01tting with a genetic algorithm: a\nmethod for developing classi\ufb01cation structure-activity relationships. J. Chem. Inf. Comput. Sci.,\n43:1906\u20131915, 2003.\n\n[52] N. Kriege and P. Mutzel. Subgraph matching kernels for attributed graphs. In International\n\nConference on Machine Learning (ICML), 2012.\n\n[53] M. Neumann, R. Garnett, C. Bauckhage, and K. Kersting. Propagation kernels: ef\ufb01cient graph\n\nkernels from propagated information. Machine Learning, 102(2):209\u2013245, 2016.\n\n[54] J. Kazius, R. McGuire, , and R. Bursi. Derivation and validation of toxicophores for mutagenicity\n\nprediction. Journal of Medicinal Chemistry, 48(1):312\u2013320, 2005.\n\n[55] N. Shervashidze, P. Schweitzer, E. J. van Leeuwen, K. Mehlhorn, and K. M. Borgwardt.\nWeisfeiler-lehman graph kernels. Journal of Machine Learning Research (JMLR), 12:2539\u2013\n2561, 2011.\n\n[56] S. V. N. Vishwanathan, N. Schraudolph, R. Kondor, and K. Borgwardt. Graph kernels. Journal\n\nof Machine Learning Research (JMLR), 11:1201\u20131242, 2010.\n\n[57] R. Kondor and H. Pan. The multiscale laplacian graph kernel. In Neural Information Processing\n\nSystems (NIPS), pages 2990\u20132998, 2016.\n\n[58] B. Y. Weisfeiler and A. A. Leman. A reduction of a graph to a canonical form and an algebra\n\narising during this reduction. Nauchno-Technicheskaya Informatsia, 9, 1968.\n\n[59] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? In\n\nInternational Conference on Learning Representations (ICLR), 2019.\n\n[60] M. Sion. On general minimax theorems. Paci\ufb01c Journal of Mathematics, 8(1):171\u2013176, 1958.\n\n12\n\n\f", "award": [], "sourceid": 4390, "authors": [{"given_name": "Vikas", "family_name": "Garg", "institution": "MIT"}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": "MIT"}]}