{"title": "End to end learning and optimization on graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 4672, "page_last": 4683, "abstract": "Real-world applications often combine learning and optimization problems on graphs. For instance, our objective may be to cluster the graph in order to detect meaningful communities (or solve other common graph optimization problems such as facility location, maxcut, and so on). However, graphs or related attributes are often only partially observed, introducing learning problems such as link prediction which must be solved prior to optimization. Standard approaches treat learning and optimization entirely separately, while recent machine learning work aims to predict the optimal solution directly from the inputs. Here, we propose an alternative decision-focused learning approach that integrates a differentiable proxy for common graph optimization problems as a layer in learned systems. The main idea is to learn a representation that maps the original optimization problem onto a simpler proxy problem that can be efficiently differentiated through. Experimental results show that our ClusterNet system outperforms both pure end-to-end approaches (that directly predict the optimal solution) and standard approaches that entirely separate learning and optimization. Code for our system is available at https://github.com/bwilder0/clusternet.", "full_text": "End to end learning and optimization on graphs\n\nBryan Wilder\n\nHarvard University\n\nbwilder@g.harvard.edu\n\nEric Ewing\n\nUniversity of Southern California\n\nericewin@usc.edu\n\nBistra Dilkina\n\nUniversity of Southern California\n\ndilkina@usc.edu\n\nMilind Tambe\n\nHarvard University\n\nmilind_tambe@harvard.edu\n\nAbstract\n\nReal-world applications often combine learning and optimization problems on\ngraphs. For instance, our objective may be to cluster the graph in order to detect\nmeaningful communities (or solve other common graph optimization problems\nsuch as facility location, maxcut, and so on). However, graphs or related attributes\nare often only partially observed, introducing learning problems such as link\nprediction which must be solved prior to optimization. Standard approaches treat\nlearning and optimization entirely separately, while recent machine learning work\naims to predict the optimal solution directly from the inputs. Here, we propose an\nalternative decision-focused learning approach that integrates a differentiable proxy\nfor common graph optimization problems as a layer in learned systems. The main\nidea is to learn a representation that maps the original optimization problem onto a\nsimpler proxy problem that can be ef\ufb01ciently differentiated through. Experimental\nresults show that our CLUSTERNET system outperforms both pure end-to-end\napproaches (that directly predict the optimal solution) and standard approaches\nthat entirely separate learning and optimization. Code for our system is available at\nhttps://github.com/bwilder0/clusternet.\n\n1\n\nIntroduction\n\nWhile deep learning has proven enormously successful at a range of tasks, an expanding area of\ninterest concerns systems that can \ufb02exibly combine learning with optimization. Examples include\nrecent attempts to solve combinatorial optimization problems using neural architectures [45, 28, 8, 30],\nas well as work which incorporates explicit optimization algorithms into larger differentiable systems\n[3, 18, 47]. The ability to combine learning and optimization promises improved performance for\nreal-world problems which require decisions to be made on the basis of machine learning predictions\nby enabling end-to-end training which focuses the learned model on the decision problem at hand.\nWe focus on graph optimization problems, an expansive subclass of combinatorial optimization.\nWhile graph optimization is ubiquitous across domains, complete applications must also solve\nmachine learning challenges. For instance, the input graph is usually incomplete; some edges\nmay be unobserved or nodes may have attributes that are only partially known. Recent work has\nintroduced sophisticated methods for tasks such as link prediction and semi-supervised classi\ufb01cation\n[38, 29, 39, 25, 53], but these methods are developed in isolation of downstream optimization tasks.\nMost current solutions use a two-stage approach which \ufb01rst trains a model using a standard loss\nand then plugs the model\u2019s predictions into an optimization algorithm ([50, 10, 5, 9, 42]). However,\npredictions which minimize a standard loss function (e.g., cross-entropy) may be suboptimal for\nspeci\ufb01c optimization tasks, especially in dif\ufb01cult settings where even the best model is imperfect.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fA preferable approach is to incorporate the downstream optimization problem into the training of\nthe machine learning model. A great deal of recent work takes a pure end-to-end approach where\na neural network is trained to predict a solution to the optimization problem using supervised or\nreinforcement learning [45, 28, 8, 30]. However, this often requires a large amount of data and results\nin suboptimal performance because the network needs to discover algorithmic structure entirely from\nscratch. Between the extremes of an entirely two stage approach and pure end-to-end architectures,\ndecision-focused learning [18, 47] embeds a solver for the optimization problem as a differentiable\nlayer within a learned system. This allows the model to train using the downstream performance that\nit induces as the loss, while leveraging prior algorithmic knowledge for optimization. The downside\nis that this approach requires manual effort to develop a differentiable solver for each particular\nproblem and often results in cumbersome systems that must, e.g, call a quadratic programming solver\nevery forward pass.\nWe propose a new approach that gets the best of both worlds: incorporate a solver for a simpler\noptimization problem as a differentiable layer, and then learn a representation that maps the (harder)\nproblem of interest onto an instance of the simpler problem. Compared to earlier approaches to\ndecision-focused learning, this places more emphasis on the representation learning component\nof the system and simpli\ufb01es the optimization component. However, compared to pure end-to-end\napproaches, we only need to learn the reduction to the simpler problem instead of the entire algorithm.\nIn this work, we instantiate the simpler problem as a differentiable version of k-means clustering.\nClustering is motivated by the observation that graph neural networks embed nodes into a continuous\nspace, allowing us to approximate optimization over the discrete graph with optimization in continuous\nembedding space. We then interpret the cluster assignments as a solution to the discrete problem. We\ninstantiate this approach for two classes of optimization problems: those that require partitioning the\ngraph (e.g., community detection or maxcut), and those that require selecting a subset of K nodes\n(facility location, in\ufb02uence maximization, immunization, etc). We don\u2019t claim that clustering is the\nright algorithmic structure for all tasks, but it is suf\ufb01cient for many problems as shown in this paper.\nIn short, we make three contributions. First, we introduce a general framework for integrating graph\nlearning and optimization, with a simpler optimization problem in continuous space as a proxy for the\nmore complex discrete problem. Second, we show how to differentiate through the clustering layer,\nallowing it to be used in deep learning systems. Third, we show experimental improvements over\nboth two-stage baselines as well as alternate end-to-end approaches on a range of example domains.\n\n2 Related work\n\nWe build on a recent work on decision-focused learning [18, 47, 15], which includes a solver for\nan optimization problem into training in order to improve performance on a downstream decision\nproblem. A related line of work develops and analyzes effective surrogate loss functions for predict-\nthen-optimize problems [19, 6]. Some work in structured prediction also integrates differentiable\nsolvers for discrete problems (e.g., image segmentation [16] or time series alignment [34]). Our\nwork differs in two ways. First, we tackle more dif\ufb01cult optimization problems. Previous work\nmostly focuses on convex problems [18] or discrete problems with near-lossless convex relations\n[47, 16]. We focus on highly combinatorial problems where the methods of choice are hand-designed\ndiscrete algorithms. Second, in response to this dif\ufb01culty, we differ methodologically in that we do\nnot attempt to include a solver for the exact optimization problem at hand (or a close relaxation of\nit). Instead, we include a more generic algorithmic skeleton that is automatically \ufb01netuned to the\noptimization problem at hand.\nThere is also recent interest in training neural networks to solve combinatorial optimization problems\n[45, 28, 8, 30]. While we focus mostly on combining graph learning with optimization, our model\ncan also be trained just to solve an optimization problem given complete information about the input.\nThe main methodological difference is that we include more structure via a differentiable k-means\nlayer instead of using more generic tools (e.g., feed-forward or attention layers). Another difference\nis that prior work mostly trains via reinforcement learning. By contrast, we use a differentiable\napproximation to the objective which removes the need for a policy gradient estimator. This is a\nbene\ufb01t of our architecture, in which the \ufb01nal decision is fully differentiable in terms of the model\nparameters instead of requiring non-differentiable selection steps (as in [28, 8, 30]). We give our\n\n2\n\n\fFigure 1: Top: CLUSTERNET, our proposed system. Bottom: a typical two-stage approach.\n\nend-to-end baseline (\u201cGCN-e2e\") the same advantage by training it with the same differentiable\ndecision loss as our own model instead of forcing it to use noisier policy gradient estimates.\nFinally, some work uses deep architectures as a part of a clustering algorithm [43, 31, 24, 41, 35], or\nincludes a clustering step as a component of a deep network [21, 22, 52]. While some techniques\nare similar, the overall task we address and framework we propose are entirely distinct. Our aim is\nnot to cluster a Euclidean dataset (as in [43, 31, 24, 41]), or to solve perceptual grouping problems\n(as in [21, 22]). Rather, we propose an approach for graph optimization problems. Perhaps the\nclosest of this work is Neural EM [22], which uses an unrolled EM algorithm to learn representations\nof visual objects. Rather than using EM to infer representations for objects, we use k-means in\ngraph embedding space to solve an optimization problem. There is also some work which uses deep\nnetworks for graph clustering [49, 51]. However, none of this work includes an explicit clustering\nalgorithm in the network, and none consider our goal of integrating graph learning and optimization.\n\n3 Setting\n\nWe consider settings that combine learning and optimization. The input is a graph G = (V, E), which\nis in some way partially observed. We will formalize our problem in terms of link prediction as an\nexample, but our framework applies to other common graph learning problems (e.g., semi-supervised\nclassi\ufb01cation). In link prediction, the graph is not entirely known; instead, we observe only training\nedges Etrain \u2282 E. Let A denote the adjacency matrix of the graph and Atrain denote the adjacency\nmatrix with only the training edges. The learning task is to predict A from Atrain. In domains we\nconsider, the motivation for performing link prediction, is to solve a decision problem for which the\nobjective depends on the full graph. Speci\ufb01cally, we have a decision variable x, objective function\nf (x, A), and a feasible set X . We aim to solve the optimization problem\n\nmax\nx\u2208X f (x, A).\n\n(1)\n\nHowever, A is unobserved. We can also consider an inductive setting in which we observe graphs\nA1, ..., Am as training examples and then seek to predict edges for a partially observed graph from\nthe same distribution. The most common approach to either setting is to train a model to reconstruct\nA from Atrain using a standard loss function (e.g., cross-entropy), producing an estimate \u02c6A. The\ntwo-stage approach plugs \u02c6A into an optimization algorithm for Problem 1, maximizing f (x, \u02c6A).\nWe propose end-to-end models which map from Atrain directly to a feasible decision x. The model\nwill be trained to maximize f (x, Atrain), i.e., the quality of its decision evaluated on the training\ndata (instead of a loss (cid:96)( \u02c6A, Atrain) that measures purely predictive accuracy). One approach is to\n\u201clearn away\" the problem by training a standard model (e.g., a GCN) to map directly from Atrain\nto x. However, this forces the model to entirely rediscover algorithmic concepts, while two-stage\nmethods are able to exploit highly sophisticated optimization methods. We propose an alternative\nthat embeds algorithmic structure into the learned model, getting the best of both worlds.\n\n3\n\nBackward pass: update node embeddings to improve objectiveForward pass: embed and cluster nodes, evaluate objectiveRun one update (Eq. 2)Compute k-means fixed pointRound \u0de4\ud835\udc65with hard max or swap rounding\ud835\udc34\ud835\udc61\ud835\udc5f\ud835\udc4e\ud835\udc56\ud835\udc5bNode embeddings \ud835\udc66Fractional solution \u0de4\ud835\udc65E\ud835\udc65\u223c\u0de4\ud835\udc65\ud835\udc53(\ud835\udc65,\ud835\udc34\ud835\udc61\ud835\udc5f\ud835\udc4e\ud835\udc56\ud835\udc5b)\ud835\udc58-means layerRound at test timeDecisiontraining lossBackward pass: update model parameters to improve accuracyForward pass: embed nodes, predict edges, evaluate accuracyRun optimization algorithm on \u1218\ud835\udc34\ud835\udc34\ud835\udc61\ud835\udc5f\ud835\udc4e\ud835\udc56\ud835\udc5bNode embeddings \ud835\udc66Predicted edges probabilities \u1218\ud835\udc34\u2113(\u1218\ud835\udc34,\ud835\udc34\ud835\udc61\ud835\udc5f\ud835\udc4e\ud835\udc56\ud835\udc5b)Optimize at test timeClusterNetTwo-stage\f4 Approach: CLUSTERNET\n\nOur proposed CLUSTERNET system (Figure 1) merges two differentiable components into a system\nthat is trained end-to-end. First, a graph embedding layer which uses Atrain and any node features to\nembed the nodes of the graph into Rp. In our experiments, we use GCNs [29]. Second, a layer that\nperforms differentiable optimization. This layer takes the continuous-space embeddings as input and\nuses them to produce a solution x to the graph optimization problem. Speci\ufb01cally, we propose to use\na layer that implements a differentiable version of K-means clustering. This layer produces a soft\nassignment of the nodes to clusters, along with the cluster centers in embedding space.\nThe intuition is that cluster assignments can be interpreted as the solution to many common graph\noptimization problems. For instance, in community detection we can interpret the cluster assignments\nas assigning the nodes to communities. Or, in maxcut, we can use two clusters to assign nodes\nto either side of the cut. Another example is maximum coverage and related problems, where we\nattempt to select a set of K nodes which cover (are neighbors to) as many other nodes as possible.\nThis problem can be approximated by clustering the nodes into K components and choosing nodes\nwhose embedding is close to the center of each cluster. We do not claim that any of these problems is\nexactly reducible to K-means. Rather, the idea is that including K-means as a layer in the network\nprovides a useful inductive bias. This algorithmic structure can be \ufb01ne-tuned to speci\ufb01c problems\nby training the \ufb01rst component, which produces the embeddings, so that the learned representations\ninduce clusterings with high objective value for the underlying downstream optimization task. We\nnow explain the optimization layer of our system in greater detail. We start by detailing the forward\nand the backward pass for the clustering procedure, and then explain how the cluster assignments can\nbe interpreted as solutions to the graph optimization problem.\n\n4.1 Forward pass\n\nbut we will relax it to a fractional value such that(cid:80)\n\nLet xj denote the embedding of node j and \u00b5k denote the center of cluster k. rjk denotes the\ndegree to which node j is assigned to cluster k. In traditional K-means, this is a binary quantity,\nk rjk = 1 for all j. Speci\ufb01cally, we take\n(cid:80)\nrjk = exp(\u2212\u03b2||xj\u2212\u00b5k||)\n(cid:96) exp(\u2212\u03b2||xj\u2212\u00b5(cid:96)||), which is a soft-min assignment of each point to the cluster centers based\non distance. While our architecture can be used with any norm || \u00b7 ||, we use the negative cosine\nsimilarity due to its strong empirical performance. \u03b2 is an inverse-temperature hyperparameter;\ntaking \u03b2 \u2192 \u221e recovers the standard k-means assignment. We can optimize the cluster centers via an\niterative process analogous to the typical k-means updates by alternately setting\n\n\u00b5k =\n\n\u2200k = 1...K rjk =\n\n\u2200k = 1...K, j = 1...n.\n\n(2)\n\n(cid:80)\nj rjkxj(cid:80)\n\nj rjk\n\n(cid:80)\nexp(\u2212\u03b2||xj \u2212 \u00b5k||)\n(cid:96) exp(\u2212\u03b2||xj \u2212 \u00b5(cid:96)||)\n\nThese iterates converge to a \ufb01xed point where \u00b5 remains the same between successive updates [33].\nThe output of the forward pass is the \ufb01nal pair (\u00b5, r).\n\n4.2 Backward pass\n\nWe will use the implicit function theorem to analytically differentiate through the \ufb01xed point that\nthe forward pass k-means iterates converge to, obtaining expressions for \u2202\u00b5\n\u2202x. Previous\nwork [18, 47] has used the implicit function theorem to differentiate through the KKT conditions of\noptimization problems; here we take a more direct approach that characterizes the update process\nitself. Doing so allows us to backpropagate gradients from the decision loss to the component that\nproduced the embeddings x. De\ufb01ne a function f : RKp \u2192 R as\nj rjkx(cid:96)\nj rjk\n\n(cid:80)\nj(cid:80)\n\nfi,(cid:96)(\u00b5, x) = \u00b5(cid:96)\n\n\u2202x and \u2202r\n\ni \u2212\n\n(3)\n\nNow, (\u00b5, x) are a \ufb01xed point of the iterates if f (\u00b5, x) = 0. Applying the implicit function theorem\nyields that \u2202\u00b5\n\u2202x can be easily obtained via the chain rule.\n\n, from which \u2202r\n\n\u2202f (\u00b5,x)\n\n\u2202x\n\n\u2202x = \u2212(cid:104) \u2202f (\u00b5,x)\n\n\u2202\u00b5\n\n(cid:105)\u22121\n\nExact backward pass: We now examine the process of calculating \u2202\u00b5\nand \u2202f (\u00b5,x)\ncan be easily calculated in closed form (see appendix). Computing the former requires time O(nKp2).\n\n\u2202x . Both \u2202f (\u00b5,x)\n\n\u2202\u00b5\n\n\u2202x\n\n4\n\n\fComputing the latter requires O(npK 2) time, after which it must be inverted (or else iterative methods\nmust be used to compute the product with its inverse). This requires time O(K 3p3) since it is a\nmatrix of size (Kp) \u00d7 (Kp). While the exact backward pass may be feasible for some problems, it\nquickly becomes burdensome for large instances. We now propose a fast approximation.\nApproximate backward pass: We start from the observation that \u2202f\n\u2202\u00b5 will often be dominated by\nits diagonal terms (the identity matrix). The off-diagonal entries capture the extent to which updates\nto one entry of \u00b5 indirectly impact other entries via changes to the cluster assignments r. However,\nwhen the cluster assignments are relatively \ufb01rm, r will not be highly sensitive to small changes to\nthe cluster centers. We \ufb01nd to be typical empirically, especially since the optimal choice of the\nparameter \u03b2 (which controls the hardness of the cluster assignments) is typically fairly high. Under\nthese conditions, we can approximate \u2202f\nWe can formally justify this approximation when the clusters are relatively balanced and well-\nseparated. More precisely, de\ufb01ne c(j) = arg maxi rji to be the closest cluster to point j. Proposition 1\n(proved in the appendix) shows that the quality of the diagonal approximation improves exponentially\nquickly in the product of two terms: \u03b2, the hardness of the cluster assignments, and \u03b4, which measures\nhow well separated the clusters are. \u03b1 (de\ufb01ned below) measures the balance of the cluster sizes. We\nassume for convenience that the input is scaled so ||xj||1 \u2264 1\u2200j.\nProposition 1. Suppose that for all points j, ||xj \u2212 \u00b5i|| \u2212 ||xj \u2212 \u00b5c(j)|| \u2265 \u03b4 for all i (cid:54)= c(j) and\n\u2264\n\nthat for all clusters i,(cid:80)n\n\nj=1 rji \u2265 \u03b1n. Moreover, suppose that \u03b2\u03b4 > log 2\u03b2K2\n\n\u2202\u00b5 \u2248 I. This in turn gives \u2202\u00b5\n\n\u2202x \u2248 \u2212 \u2202f\n\u2202x .\n\n\u2202\u00b5 by its diagonal, \u2202f\n\n\u03b1 . Then,\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2202f\n\n\u2202\u00b5 \u2212 I\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)1\n\nexp(\u2212\u03b4\u03b2)\n\nK2\u03b2\n\n1\n\n2 \u03b1\u2212K2\u03b2 exp(\u2212\u03b4\u03b2)\n\nwhere || \u00b7 ||1 is the operator 1-norm.\n\n(cid:16)\n\n(cid:17)\n\nWe now show that the approximate gradient obtained by taking \u2202f\n\u2202\u00b5 = I can be calculated by unrolling\na single iteration of the forward-pass updates from Equation 2 at convergence. Examining Equation\n3, we see that the \ufb01rst term (\u00b5(cid:96)\ni) is constant with respect to x, since here \u00b5 is a \ufb01xed value. Hence,\n\n(cid:80)\nj rjkxj(cid:80)\n\nj rjk\n\n\u2212 \u2202fk\n\u2202x\n\n=\n\n\u2202\n\u2202x\n\n\u2202x and \u2202r\n\nwhich is just the update equation for \u00b5k. Since the forward-pass updates are written entirely in\nterms of differentiable functions, we can automatically compute the approximate backward pass with\nrespect to x (i.e., compute products with our approximations to \u2202\u00b5\n\u2202x) by applying standard\nautodifferentiation tools to the \ufb01nal update of the forward pass. Compared to computing the exact\nanalytical gradients, this avoids the need to explicitly reason about or invert \u2202f\n\u2202\u00b5. The \ufb01nal iteration\n(the one which is differentiated through) requires time O(npK), linear in the size of the data.\nCompared to differentiating by unrolling the entire sequence of updates in the computational graph\n(as has been suggested for other problems [17, 4, 54]), our approach has two key advantages. First, it\navoids storing the entire history of updates and backpropagating through all of them. The runtime for\nour approximation is independent of the number of updates needed to reach convergence. Second, we\ncan in fact use entirely non-differentiable operations to arrive at the \ufb01xed point, e.g., heuristics for\nthe K-means problem, stochastic methods which only examine subsets of the data, etc. This allows\nthe forward pass to scale to larger datasets since we can use the best algorithmic tools available, not\njust those that can be explicitly encoded in the autodifferentiation tool\u2019s computational graph.\n\n4.3 Obtaining solutions to the optimization problem\n\nHaving obtained the cluster assignments r, along with the centers \u00b5, in a differentiable manner, we\nneed a way to (1) differentiably interpret the clustering as a soft solution to the optimization problem,\n(2) differentiate a relaxation of the objective value of the graph optimization problem in terms of\nthat solution, and then (3) round to a discrete solution at test time. We give a generic means of\naccomplishing these three steps for two broad classes of problems: those that involve partitioning the\ngraph into K disjoint components, and those that that involve selecting a subset of K nodes.\nPartitioning:\n(1) We can naturally interpret the cluster assignments r as a soft partitioning of\nthe graph. (2) One generic continuous objective function (de\ufb01ned on soft partitions) follows from\nthe random process of assigning each node j to a partition with probabilities given by rj, repeat-\ning this process independently across all nodes. This gives the expected training decision loss\n\n5\n\n\fTable 1: Performance on the community detection task\n\nLearning + optimization\n\nOptimization\n\ncora\n\ncite.\n\nprot.\n\nadol\n\nfb\n\ncora\n\ncite.\n\nprot.\n\nadol\n\nfb\n\nClusterNet\nGCN-e2e\nTrain-CNM\nTrain-Newman\nTrain-SC\nGCN-2stage-CNM\nGCN-2stage-Newman\nGCN-2stage-SC\n\n0.54\n0.16\n0.20\n0.09\n0.03\n0.17\n0.00\n0.14\n\n0.55\n0.02\n0.42\n0.15\n0.02\n0.21\n0.00\n0.16\n\n0.29\n0.13\n0.09\n0.15\n0.03\n0.18\n0.00\n0.04\n\n0.49\n0.12\n0.01\n0.15\n0.23\n0.28\n0.14\n0.31\n\n0.30\n0.13\n0.14\n0.08\n0.19\n0.13\n0.02\n0.25\n\n0.72\n0.19\n0.08\n0.20\n0.09\n-\n-\n-\n\n0.73\n0.03\n0.34\n0.23\n0.05\n-\n-\n-\n\n0.52\n0.16\n0.05\n0.29\n0.06\n-\n-\n-\n\n0.58\n0.20\n0.57\n0.30\n0.49\n-\n-\n-\n\n0.76\n0.23\n0.77\n0.55\n0.61\n-\n-\n-\n\nTable 2: Performance on the facility location task.\n\nLearning + optimization\nadol\n\nprot.\n\ncora\n\ncite.\n\nClusterNet\nGCN-e2e\nTrain-greedy\nTrain-gonzalez\nGCN-2Stage-greedy\nGCN-2Stage-gonzalez\n\n10\n12\n14\n12\n14\n13\n\n14\n15\n16\n17\n17\n17\n\n6\n8\n8\n8\n8\n8\n\n6\n6\n8\n6\n7\n6\n\nOptimization\ncite.\n\nprot.\n\ncora\n\nadol\n\nfb\n\n9\n11\n9\n10\n-\n-\n\n14\n14\n14\n15\n-\n-\n\n6\n7\n7\n7\n-\n-\n\n5\n6\n6\n7\n-\n-\n\n3\n5\n5\n3\n-\n-\n\nfb\n\n4\n5\n6\n6\n6\n6\n\nThe total probability allocated to node j is bj =(cid:80)K\n\n(cid:96) = Erhard\u223cr[f (rhard, Atrain)], where rhard \u223c r denotes this random assignment. (cid:96) is now differ-\nentiable in terms of r, and can be computed in closed form via standard autodifferentiation tools for\nmany problems of interest (see Section 5). We remark that when the expectation is not available in\nclosed form, our approach could still be applied by repeatedly sampling rhard \u223c r and using a policy\ngradient estimator to compute the gradient of the resulting objective. (3) At test time, we simply\napply a hard maximum to r to obtain each node\u2019s assignment.\nSubset selection:\n(1) Here, it is less obvious how to obtain a subset of K nodes from the cluster\nassignments. Our continuous solution will be a vector x, 0 \u2264 x \u2264 1, where ||x||1 = K. Intuitively,\nxj is the probability of including xj in the solution. Our approach obtains xj by placing greater\nprobability mass on nodes that are near the cluster centers. Speci\ufb01cally, each center \u00b5i is endowed\nwith one unit of probability mass, which it allocates to the points x as aij = softmin(\u03b7||x \u2212 \u00b5i||)j.\ni=1 aij. Since we may have bj > 1, we pass b\nthrough a sigmoid function to cap the entries at 1; speci\ufb01cally, we take x = 2 \u2217 \u03c3(\u03b3b) \u2212 0.5 where\n\u03b3 is a tunable parameter. If the resulting x exceeds the budget constraint (||x||1 > K), we instead\noutput Kx||x||1\n(2) We interpret this solution in terms of the objective similarly as above. Speci\ufb01cally, we consider\nthe result of drawing a discrete solution xhard \u223c x where every node j is included (i.e., set\nto 1) independently with probability xj from the end of step (1). The training objective is then\nExhard\u223cx[f (xhard, Atrain)]. For many problems, this can again be computed and differentiated\nthrough in closed form (see Section 5).\n(3) At test time, we need a feasible discrete vector x; note that independently rounding the individual\nentries may produce a vector with more than K ones. Here, we apply a fairly generic approach based\non pipage rounding [1], a randomized rounding scheme which has been applied to many problems\n(particularly those with submodular objectives). Pipage rounding can be implemented to produce a\nrandom feasible solution in time O(n) [26]; in practice we round several times and take the solution\nwith the best decision loss on the observed edges. While pipage rounding has theoretical guarantees\nonly for speci\ufb01c classes of functions, we \ufb01nd it to work well even in other domains (e.g., facility\nlocation). However, more domain-speci\ufb01c rounding methods can be applied if available.\n\nto ensure a feasible solution.\n\n6\n\n\f5 Experimental results\n\nWe now show experiments on domains that combine link prediction with optimization.\nLearning problem: In link prediction, we observe a partial graph and aim to infer which unobserved\nedges are present. In each of the experiments, we hold out 60% of the edges in the graph, with\n40% observed during training. We used a graph dataset which is not included in our results to set\nour method\u2019s hyperparameters, which were kept constant across datasets (see appendix for details).\nThe learning task is to use the training edges to predict whether the remaining edges are present,\nafter which we will solve an optimization problem on the predicted graph. The objective is to \ufb01nd a\nsolution with high objective value measured on the entire graph, not just the training edges.\nOptimization problems: We consider two optimization tasks, one from each of the broad classes\nintroduced above. First, community detection aims to partition the nodes of the graph into K distinct\nsubgroups which are dense internally, but with few edges across groups. Formally, the objective is to\n\ufb01nd a partition maximizing the modularity [37], de\ufb01ned as\n\n(cid:88)\n\nK(cid:88)\n\nu,v\u2208V\n\nk=1\n\nQ(r) =\n\n1\n2m\n\n(cid:20)\n\n(cid:21)\n\nAuv \u2212 dudv\n2m\n\nrukrvk.\n\n2m Tr(cid:2)r(cid:62)Btrainr(cid:3).\n\nHere, dv is the degree of node v, and rvk is 1 if node v is assigned to community k and zero\notherwise. This measures the number of edges within communities compared to the expected number\nif edges were placed randomly. Our clustering module has one cluster for each of the K communities.\nDe\ufb01ning B to be the modularity matrix with entries Buv = Auv \u2212 dudv\n2m , our training objective (the\nexpected value of a partition sampled according to r) is 1\nSecond, minmax facility location, where the problem is to select a subset of K nodes from the\ngraph, minimizing the maximum distance from any node to a facility (selected node). Letting\nd(v, S) be the shortest path length from a vertex v to a set of vertices S, the objective is f (S) =\nmin|S|\u2264k maxv\u2208V d(v, S). To obtain the training loss, we take two steps. First, we replace d(v, S)\nby ES\u223cx[d(v, S)], where S \u223c x denotes drawing a set from the product distribution with marginals\nx. This can easily be calculated in closed form [26]. Second, we replace the min with a softmin.\nBaseline learning methods: We instantiate CLUSTERNET using a 2-layer GCN for node embed-\ndings, followed by a clustering layer. We compare to three families of baselines. First, GCN-2stage,\nthe two-stage approach which \ufb01rst trains a model for link prediction, and then inputs the predicted\ngraph into an optimization algorithm. For link prediction, we use the GCN-based system of [39]\n(we also adopt their training procedure, including negative sampling and edge dropout). For the\noptimization algorithms, we use standard approaches for each domain, outlined below. Second,\n\u201ctrain\", which runs each optimization algorithm only on the observed training subgraph (without\nattempting any link prediction). Third, GCN-e2e, an end-to-end approach which does not include\nexplicit algorithm structure. We train a GCN-based network to directly predict the \ufb01nal decision\nvariable (r or x) using the same training objectives as our own model. Empirically, we observed\nbest performance with a 2-layer GCN. This baseline allows us to isolate the bene\ufb01ts of including\nalgorithmic structure.\nBaseline optimization approaches: In each domain, we compare to expert-designed optimization\nalgorithms found in the literature. In community detection, we compare to \u201cCNM\" [11], an ag-\nglomerative approach, \u201cNewman\", an approach that recursively partitions the graph [36], and \u201cSC\",\nwhich performs spectral clustering [46] on the modularity matrix. In facility location, we compare to\n\u201cgreedy\", the common heuristic of iteratively selecting the point with greatest marginal improvement\nin objective value, and \u201cgonzalez\" [20], an algorithm which iteratively selects the node furthest from\nthe current set. \u201cgonzalez\" attains the optimal 2-approximation for this problem (note that the minmax\nfacility location objective is non-submodular, ruling out the usual (1 \u2212 1/e)-approximation).\nDatasets: We use several standard graph datasets: cora [40] (a citation network with 2,708 nodes),\nciteseer [40] (a citation network with 3,327 nodes), protein [14] (a protein interaction network with\n3,133 nodes), adol [12] (an adolescent social network with 2,539 vertices), and fb [13, 32] (an online\nsocial network with 2,888 nodes). For facility location, we use the largest connected component of\nthe graph (since otherwise distances may be in\ufb01nite). Cora and citeseer have node features (based on\n\n7\n\n\fTable 3: Inductive results. \u201c%\" is the fraction of test instances for which a method attains top\nperformance (including ties). \u201cFinetune\" methods are excluded from this in the \u201cNo \ufb01netune\" section.\n\nCommunity detection\n\nsynthetic\n\npubmed\n\nNo \ufb01netune\nClusterNet\nGCN-e2e\nTrain-CNM\nTrain-Newman\nTrain-SC\n2Stage-CNM\n2Stage-Newman\n2Stage-SC\nClstrNet-1train\nFinetune\nClstrNet-ft\nClstrNet-ft-only\n\nAvg.\n0.57\n0.26\n0.14\n0.24\n0.16\n0.51\n0.01\n0.52\n0.55\n\n%\n26/30\n0/30\n0/30\n0/30\n0/30\n0/30\n0/30\n4/30\n0/30\n\n0.60\n0.60\n\n20/30\n10/30\n\nAvg.\n0.30\n0.01\n0.16\n0.17\n0.04\n0.24\n0.01\n0.15\n0.25\n\n0.40\n0.42\n\n%\n7/8\n0/8\n1/8\n0/8\n0/8\n0/8\n0/8\n0/8\n0/8\n\n2/8\n6/8\n\nFacility location\n\nsynthetic\nAvg.\n7.90\n8.63\n14.00\n10.30\n9.60\n10.00\n7.93\n\n%\n25/30\n11/30\n0/30\n2/30\n3/30\n2/30\n12/30\n\npubmed\nAvg.\n7.88\n8.62\n9.50\n9.38\n10.00\n6.88\n7.88\n\n%\n3/8\n1/8\n1/8\n1/8\n0/8\n5/8\n2/8\n\n8.08\n7.84\n\n12/30\n16/30\n\n8.01\n7.76\n\n3/8\n4/8\n\nNo \ufb01netune\nClusterNet\nGCN-e2e\nTrain-greedy\nTrain-gonzalez\n2Stage-greedy\n2Stage-gonz.\nClstrNet-1train\n\nFinetune\nClstrNet-ft\nClstrNet-ft-only\n\na bag-of-words representation of the document), which were given to all GCN-based methods. For\nthe other datasets, we generated unsupervised node2vec features [23] using the training edges.\n\n5.1 Results on single graphs\n\nWe start out with results for the combined link prediction and optimization problem. Table 1 shows\nthe objective value obtained by each approach on the full graph for community detection, with Table\n2 showing facility location. We focus \ufb01rst on the \u201cLearning + Optimization\" column which shows\nthe combined link prediction/optimization task. We use K = 5 clusters; K = 10 is very similar\nand may be found in the appendix. CLUSTERNET outperforms the baselines in nearly all cases,\noften substantially. GCN-e2e learns to produce nontrivial solutions, often rivaling the other baseline\nmethods. However, the explicit structure used by our approach CLUSTERNET results in much higher\nperformance.\nInterestingly, the two stage approach sometimes performs worse than the train-only baseline which\noptimizes just based on the training edges (without attempting to learn). This indicates that approaches\nwhich attempt to accurately reconstruct the graph can sometimes miss qualities which are important\nfor optimization, and in the worst case may simply add noise that overwhelms the signal in the\ntraining edges. In order to con\ufb01rm that the two-stage method learned to make meaningful predictions,\nin the appendix we give AUC values for each dataset. The average AUC value is 0.7584, indicating\nthat the two-stage model does learn to make nontrivial predictions. However, the small amount of\ntraining data (only 40% of edges are observed) prevents it from perfectly reconstructing the true\ngraph. This drives home the point that decision-focused learning methods such as CLUSTERNET can\noffer substantial bene\ufb01ts when highly accurate predictions are out of reach even for sophisticated\nlearning methods.\nWe next examine an optimization-only task where the entire graph is available as input (the \u201cOp-\ntimization\" column of Tables 1 and Table 2). This tests CLUSTERNET\u2019s ability to learn to solve\ncombinatorial optimization problems compared to expert-designed algorithms, even when there is no\npartial information or learning problem in play. We \ufb01nd that CLUSTERNET is highly competitive,\nmeeting and frequently exceeding the baselines. It is particularly effective for community detection,\nwhere we observe large (> 3x) improvements compared to the best baseline on some datasets. At\nfacility location, our method always at least ties the baselines, and frequently improves on them.\nThese experiments provide evidence that our approach, which is automatically specialized during\ntraining to optimize on a given graph, can rival and exceed hand-designed algorithms from the\nliterature. The alternate learning approach, GCN-e2e, which is an end-to-end approach that tries to\nlearn to predicts optimization solutions directly from the node features, at best ties the baselines and\ntypically underperforms. This underscores the bene\ufb01t of including algorithmic structure as a part of\nthe end-to-end architecture.\n\n8\n\n\f5.2 Generalizing across graphs\n\nNext, we investigate whether our method can learn generalizable strategies for optimization: can\nwe train the model on one set of graphs drawn from some distribution and then apply it to unseen\ngraphs? We consider two graph distributions. First, a synthetic generator introduced by [48], which\nis based on the spatial preferential attachment model [7] (details in the appendix). We use 20 training\ngraphs, 10 validation, and 30 test. Second, a dataset obtained by splitting the pubmed graph into 20\ncomponents using metis [27]. We \ufb01x 10 training graphs, 2 validation, and 8 test. At test time, only\n40% of the edges in each graph are revealed, matching the \u201cLearning + optimization\" setup above.\nTable 3 shows the results. To start out, we do not conduct any \ufb01ne-tuning to the test graphs, evaluating\nentirely the generalizability of the learned representations. CLUSTERNET outperforms all baseline\nmethods on all tasks, except for facility location on pubmed where it places second. We conclude\nthat the learned model successfully generalizes to completely unseen graphs. We next investigate (in\nthe \u201c\ufb01netune\" section of Table 3) whether CLUSTERNET\u2019s performance can be further improved by\n\ufb01ne-tuning to the 40% of observed edges for each test graph (treating each test graph as an instance of\nthe link prediction problem from Section 5.1, but initializing with the parameters of the model learned\nover the training graphs). We see that CLUSTERNET\u2019s performance typically improves, indicating\nthat \ufb01ne-tuning can allow us to extract additional gains if extra training time is available.\nInterestingly, only \ufb01ne-tuning (not using the training graphs at all) yields similar performance (the\nrow \u201cClstrNet-ft-only\"). While our earlier results show that CLUSTERNET can learn generalizable\nstrategies, doing so may not be necessary when there is the opportunity to \ufb01ne-tune. This allows a\ntrade-off between quality and runtime: without \ufb01ne-tuning, applying our method at test time requires\njust a single forward pass, which is extremely ef\ufb01cient. If additional computational cost at test time\nis acceptable, \ufb01ne-tuning can be used to improve performance. Complete runtimes for all methods\nare shown in the appendix. CLUSTERNET\u2019s forward pass (i.e., no \ufb01ne-tuning) is extremely ef\ufb01cient,\nrequiring at most 0.23 seconds on the largest network, and is always faster than the baselines (on\nidentical hardware). Fine-tuning requires longer, on par with the slowest baseline.\nWe lastly investigate the reason why pretraining provides little to no improvement over only \ufb01ne-\ntuning. Essentially, we \ufb01nd that CLUSTERNET is extremely sample-ef\ufb01cient: using only a single\ntraining graph results in nearly as good performance as the full training set (and still better than all of\nthe baselines), as seen in the \u201cClstrNet-1train\" row of Table 3. That is, CLUSTERNET is capable of\nlearning optimization strategies that generalize with strong performance to completely unseen graphs\nafter observing only a single training example. This underscores the bene\ufb01ts of including algorithmic\nstructure as a part of the architecture, which guides the model towards learning meaningful strategies.\n\n6 Conclusion\n\nWhen machine learning is used to inform decision-making, it is often necessary to incorporate\nthe downstream optimization problem into training. Here, we proposed a new approach to this\ndecision-focused learning problem: include a differentiable solver for a simple proxy to the true,\ndif\ufb01cult optimization problem and learn a representation that maps the dif\ufb01cult problem to the\nsimpler one. This representation is trained in an entirely automatic way, using the solution quality\nfor the true downstream problem as the loss function. We \ufb01nd that this \u201cmiddle path\" for including\nalgorithmic structure in learning improves over both two-stage approaches, which separate learning\nand optimization entirely, and purely end-to-end approaches, which use learning to directly predict\nthe optimal solution. Here, we instantiated this framework for a class of graph optimization problems.\nWe hope that future work will explore such ideas for other families of problems, paving the way for\n\ufb02exible and ef\ufb01cient optimization-based structure in deep learning.\n\nAcknowledgements\n\nThis work was supported by the Army Research Of\ufb01ce (MURI W911NF1810208). Wilder is\nsupported by a NSF Graduate Research Fellowship. Dilkina is supported partially by NSF award\n# 1914522 and by U.S. Department of Homeland Security under Grant Award No. 2015-ST-061-\nCIRC01. The views and conclusions contained in this document are those of the authors and should\nnot be interpreted as necessarily representing the of\ufb01cial policies, either expressed or implied, of the\nU.S Department of Homeland Security.\n\n9\n\n\fReferences\n[1] Alexander A Ageev and Maxim I Sviridenko. Pipage rounding: A new method of construct-\ning algorithms with proven performance guarantee. Journal of Combinatorial Optimization,\n8(3):307\u2013328, 2004.\n\n[2] Nesreen K Ahmed, Ryan Rossi, John Boaz Lee, Theodore L Willke, Rong Zhou, Xiangnan Kong,\nand Hoda Eldardiry. Learning role-based graph embeddings. arXiv preprint arXiv:1802.02896,\n2018.\n\n[3] Brandon Amos and J. Zico Kolter. Optnet: Differentiable optimization as a layer in neural\n\nnetworks. In ICML, 2017.\n\n[4] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom\nSchaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by\ngradient descent. In Advances in Neural Information Processing Systems, pages 3981\u20133989,\n2016.\n\n[5] Ashwin Bahulkar, Boleslaw K Szymanski, N Orkun Baycik, and Thomas C Sharkey. Community\ndetection with edge augmentation in criminal networks. In 2018 IEEE/ACM International\nConference on Advances in Social Networks Analysis and Mining (ASONAM), 2018.\n\n[6] Othman El Balghiti, Adam N Elmachtoub, Paul Grigas, and Ambuj Tewari. Generalization\n\nbounds in the predict-then-optimize framework. arXiv preprint arXiv:1905.11488, 2019.\n\n[7] Marc Barth\u00e9lemy. Spatial networks. Physics Reports, 499(1-3):1\u2013101, 2011.\n\n[8] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combina-\n\ntorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016.\n\n[9] Giulia Berlusconi, Francesco Calderoni, Nicola Parolini, Marco Verani, and Carlo Piccardi.\nLink prediction in criminal networks: A tool for criminal intelligence analysis. PloS one,\n11(4):e0154244, 2016.\n\n[10] Matthew Burgess, Eytan Adar, and Michael Cafarella. Link-prediction enhanced consensus\n\nclustering for complex networks. PloS one, 11(5):e0153384, 2016.\n\n[11] Aaron Clauset, Mark EJ Newman, and Cristopher Moore. Finding community structure in very\n\nlarge networks. Physical review E, 70(6):066111, 2004.\n\n[12] Koblenz Network Collection. Adolescent health.\n\nnetworks/moreno_health, 2017.\n\nhttp://konect.uni-koblenz.de/\n\n[13] Koblenz Network Collection.\n\nnetworks/ego-facebook, 2017.\n\nFacebook (nips).\n\nhttp://konect.uni-koblenz.de/\n\n[14] Koblenz Network Collection. Human protein (vidal). http://konect.uni-koblenz.de/\n\nnetworks/maayan-vidal, 2017.\n\n[15] Emir Demirovic, Peter J Stuckey, James Bailey, Jeffrey Chan, Chris Leckie, Kotagiri Ramamo-\nhanarao, and Tias Guns. Prediction + optimisation for the knapsack problem. In CPAIOR,\n2019.\n\n[16] Josip Djolonga and Andreas Krause. Differentiable learning of submodular models. In NeurIPS,\n\n2017.\n\n[17] Justin Domke. Generic methods for optimization-based modeling. In Arti\ufb01cial Intelligence and\n\nStatistics, pages 318\u2013326, 2012.\n\n[18] Priya Donti, Brandon Amos, and J Zico Kolter. Task-based end-to-end model learning in\nstochastic optimization. In Advances in Neural Information Processing Systems, pages 5484\u2013\n5494, 2017.\n\n[19] Adam N Elmachtoub and Paul Grigas. Smart\" predict, then optimize\". arXiv preprint\n\narXiv:1710.08005, 2017.\n\n10\n\n\f[20] Teo\ufb01lo F Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical\n\nComputer Science, 38:293\u2013306, 1985.\n\n[21] Klaus Greff, Antti Rasmus, Mathias Berglund, Tele Hao, Harri Valpola, and J\u00fcrgen Schmidhuber.\n\nTagger: Deep unsupervised perceptual grouping. In NeurIPS, 2016.\n\n[22] Klaus Greff, Sjoerd van Steenkiste, and J\u00fcrgen Schmidhuber. Neural expectation maximization.\n\nIn NeurIPS, 2017.\n\n[23] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks.\n\nIn\nProceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and\ndata mining, pages 855\u2013864. ACM, 2016.\n\n[24] Xifeng Guo, Long Gao, Xinwang Liu, and Jianping Yin. Improved deep embedded clustering\n\nwith local structure preservation. In IJCAI, 2017.\n\n[25] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large\n\ngraphs. In NIPS, 2017.\n\n[26] Mohammad Karimi, Mario Lucic, Hamed Hassani, and Andreas Krause. Stochastic submodular\nmaximization: The case of coverage functions. In Advances in Neural Information Processing\nSystems, 2017.\n\n[27] George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning\n\nirregular graphs. SIAM Journal on scienti\ufb01c Computing, 20(1):359\u2013392, 1998.\n\n[28] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial\n\noptimization algorithms over graphs. In NIPS, 2017.\n\n[29] Thomas N. Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional\n\nnetworks. In ICLR, 2017.\n\n[30] Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems! In\n\nICLR, 2019.\n\n[31] Marc T Law, Raquel Urtasun, and Richard S Zemel. Deep spectral clustering learning. In ICML,\n\n2017.\n\n[32] Jure Leskovec and Julian J Mcauley. Learning to discover social circles in ego networks. In\n\nAdvances in neural information processing systems, pages 539\u2013547, 2012.\n\n[33] David JC MacKay. Information theory, inference and learning algorithms. Cambridge university\n\npress, 2003.\n\n[34] Arthur Mensch and Mathieu Blondel. Differentiable dynamic programming for structured\n\nprediction and attention. In ICML, 2018.\n\n[35] Azade Nazi, Will Hang, Anna Goldie, Sujith Ravi, and Azalia Mirhoseini. Gap: Generalizable\n\napproximate graph partitioning framework. arXiv preprint arXiv:1903.00614, 2019.\n\n[36] Mark EJ Newman. Finding community structure in networks using the eigenvectors of matrices.\n\nPhysical review E, 74(3):036104, 2006.\n\n[37] Mark EJ Newman. Modularity and community structure in networks. Proceedings of the\n\nNational Academy of Sciences, 103(23):8577\u20138582, 2006.\n\n[38] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social repre-\nsentations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge\ndiscovery and data mining, pages 701\u2013710. ACM, 2014.\n\n[39] M. Schlichtkrull, T. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling. Modeling\nrelational data with graph convolutional networks. In European Semantic Web Conference,\n2018.\n\n11\n\n\f[40] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-\n\nRad. Collective classi\ufb01cation in network data. AI magazine, 29(3):93\u201393, 2008.\n\n[41] Uri Shaham, Kelly Stanton, Henry Li, Boaz Nadler, Ronen Basri, and Yuval Kluger. Spectralnet:\n\nSpectral clustering using deep neural networks. In ICLR, 2018.\n\n[42] Suo-Yi Tan, Jun Wu, Linyuan L\u00fc, Meng-Jun Li, and Xin Lu. Ef\ufb01cient network disintegration\nunder incomplete information: the comic effect of link prediction. Scienti\ufb01c reports, 6:22916,\n2016.\n\n[43] Fei Tian, Bin Gao, Qing Cui, Enhong Chen, and Tie-Yan Liu. Learning deep representations\n\nfor graph clustering. In Twenty-Eighth AAAI Conference on Arti\ufb01cial Intelligence, 2014.\n\n[44] Michalis Titsias. One-vs-each approximation to softmax for scalable estimation of probabilities.\n\nIn NeurIPS, 2016.\n\n[45] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In NIPS, 2015.\n\n[46] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395\u2013416,\n\n2007.\n\n[47] Bryan Wilder, Bistra Dilkina, and Milind Tambe. Melding the data-decisions pipeline: Decision-\n\nfocused learning for combinatorial optimization. In AAAI, 2019.\n\n[48] Bryan Wilder, Han Ching Ou, Kayla de la Haye, and Milind Tambe. Optimizing network\n\nstructure for preventative health. In AAMAS, 2018.\n\n[49] Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering\n\nanalysis. In International conference on machine learning, pages 478\u2013487, 2016.\n\n[50] Bowen Yan and Steve Gregory. Detecting community structure in networks using edge predic-\ntion methods. Journal of Statistical Mechanics: Theory and Experiment, 2012(09):P09008,\n2012.\n\n[51] Liang Yang, Xiaochun Cao, Dongxiao He, Chuan Wang, Xiao Wang, and Weixiong Zhang.\nModularity based community detection with deep learning. In IJCAI, volume 16, pages 2252\u2013\n2258, 2016.\n\n[52] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec.\nHierarchical graph representation learning with differentiable pooling. In Advances in Neural\nInformation Processing Systems, pages 4800\u20134810, 2018.\n\n[53] Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. In NIPS, 2018.\n\n[54] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su,\nDalong Du, Chang Huang, and Philip HS Torr. Conditional random \ufb01elds as recurrent neural\nnetworks. In Proceedings of the IEEE international conference on computer vision, pages\n1529\u20131537, 2015.\n\n12\n\n\f", "award": [], "sourceid": 2620, "authors": [{"given_name": "Bryan", "family_name": "Wilder", "institution": "Harvard University"}, {"given_name": "Eric", "family_name": "Ewing", "institution": "University of Southern California"}, {"given_name": "Bistra", "family_name": "Dilkina", "institution": "University of Southern California"}, {"given_name": "Milind", "family_name": "Tambe", "institution": "USC"}]}