{"title": "Exact Combinatorial Optimization with Graph Convolutional Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 15580, "page_last": 15592, "abstract": "Combinatorial optimization problems are typically tackled by the branch-and-bound paradigm. We propose a new graph convolutional neural network model for learning branch-and-bound variable selection policies, which leverages the natural variable-constraint bipartite graph representation of mixed-integer linear programs. We train our model via imitation learning from the strong branching expert rule, and demonstrate on a series of hard problems that our approach produces policies that improve upon state-of-the-art machine-learning methods for branching and generalize to instances significantly larger than seen during training. Moreover, we improve for the first time over expert-designed branching rules implemented in a state-of-the-art solver on large problems. Code for reproducing all the experiments can be found at https://github.com/ds4dm/learn2branch.", "full_text": "Exact Combinatorial Optimization\n\nwith Graph Convolutional Neural Networks\n\nMaxime Gasse\n\nMila, Polytechnique Montr\u00e9al\nmaxime.gasse@polymtl.ca\n\nDidier Ch\u00e9telat\n\nPolytechnique Montr\u00e9al\n\ndidier.chetelat@polymtl.ca\n\nNicola Ferroni\n\nUniversity of Bologna\n\nn.ferroni@specialvideo.it\n\nLaurent Charlin\nMila, HEC Montr\u00e9al\n\nlaurent.charlin@hec.ca\n\nAndrea Lodi\n\nMila, Polytechnique Montr\u00e9al\nandrea.lodi@polymtl.ca\n\nAbstract\n\nCombinatorial optimization problems are typically tackled by the branch-and-\nbound paradigm. We propose a new graph convolutional neural network model for\nlearning branch-and-bound variable selection policies, which leverages the natural\nvariable-constraint bipartite graph representation of mixed-integer linear programs.\nWe train our model via imitation learning from the strong branching expert rule,\nand demonstrate on a series of hard problems that our approach produces policies\nthat improve upon state-of-the-art machine-learning methods for branching and\ngeneralize to instances signi\ufb01cantly larger than seen during training. Moreover, we\nimprove for the \ufb01rst time over expert-designed branching rules implemented in a\nstate-of-the-art solver on large problems. Code for reproducing all the experiments\ncan be found at https://github.com/ds4dm/learn2branch.\n\n1\n\nIntroduction\n\nCombinatorial optimization aims to \ufb01nd optimal con\ufb01gurations in discrete spaces where exhaustive\nenumeration is intractable. It has applications in \ufb01elds as diverse as electronics, transportation,\nmanagement, retail, and manufacturing [42], but also in machine learning, such as in structured\nprediction and maximum a posteriori inference [51; 34; 49]. Such problems can be extremely dif\ufb01cult\nto solve, and in fact most classical NP-hard computer science problems are examples of combinatorial\noptimization. Nonetheless, there exists a broad range of exact combinatorial optimization algorithms,\nwhich are guaranteed to \ufb01nd an optimal solution despite a worst-case exponential time complexity\n[52]. An important property of such algorithms is that, when interrupted before termination, they can\nusually provide an intermediate solution along with an optimality bound, which can be a valuable\ninformation in theory and in practice. For example, after one hour of computation, an exact algorithm\nmay give the guarantee that the best solution found so far lies within 2% of the optimum, even without\nknowing what the actual optimum is. This quality makes exact methods appealing and practical, and\nas such they constitute the core of modern commercial solvers.\nIn practice, most combinatorial optimization problems can be formulated as mixed-integer linear\nprograms (MILPs), in which case branch-and-bound (B&B) [35] is the exact method of choice.\nBranch-and-bound recursively partitions the solution space into a search tree, and computes relaxation\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fbounds along the way to prune subtrees that provably cannot contain an optimal solution. This\niterative process requires sequential decision-making, such as node selection: selecting the next node\nto evaluate, and variable selection: selecting the variable by which to partition the node\u2019s search\nspace [41]. This decision process traditionally follows a series of hard-coded heuristics, carefully\ndesigned by experts to minimize the average solving time on a representative set of MILP instances\n[21]. However, in many contexts it is common to repeatedly solve similar combinatorial optimization\nproblems, e.g., day-to-day production planning and lot-sizing problems [44], which may signi\ufb01cantly\ndiffer from the set of instances on which B&B algorithms are typically evaluated. It is then appealing\nto use statistical learning for tuning B&B algorithms automatically for a desired class of problems.\nHowever, this line of work raises two challenges. First, it is not obvious how to encode the state of a\nMILP B&B decision process [4], especially since both search trees and integer linear programs can\nhave a variable structure and size. Second, it is not clear how to formulate a model architecture that\nleads to rules which can generalize, at least to similar instances but also ideally to instances larger\nthan seen during training.\nIn this work we propose to address the above challenges by using graph convolutional neural networks.\nMore precisely, we focus on variable selection, also known as the branching problem, which lies\nat the core of the B&B paradigm yet is still not well theoretically understood [41], and adopt an\nimitation learning strategy to learn a fast approximation of strong branching, a high-quality but\nexpensive branching rule. While such an idea is not new [30; 4; 24], we propose to address the\nlearning problem in a novel way, through two contributions. First, we propose to encode the branching\npolicies into a graph convolutional neural network (GCNN), which allows us to exploit the natural\nbipartite graph representation of MILP problems, thereby reducing the amount of manual feature\nengineering. Second, we approximate strong branching decisions by using behavioral cloning with\na cross-entropy loss, a less dif\ufb01cult task than predicting strong branching scores [4] or rankings\n[30; 24]. We evaluate our approach on four classes of NP-hard problems, namely set covering,\ncombinatorial auction, capacitated facility location and maximum independent set. We compare\nagainst the previously proposed approaches of Khalil et al. [30], Alvarez et al. [4] and Hansknecht\net al. [24], as well as against the default hybrid branching rule in SCIP [20], a modern open-source\nsolver. The results show that our choice of model, state encoding, and training procedure leads to\npolicies that can offer a substantial improvement over traditional branching rules, and generalize well\nto larger instances than those used in training.\nIn Section 2, we review the broader literature of works that use statistical learning for branching.\nIn Section 3, we formally introduce the B&B framework, and formulate the branching problem as\na Markov decision process. In Section 4, we present our state representation, model, and training\nprocedure for addressing the branching problem. Finally, we discuss experimental results in Section 5.\n\n2 Related work\n\nFirst steps towards statistical learning of branching rules in B&B were taken by Khalil et al. [30], who\nlearn a branching rule customized to a single instance during the B&B process, as well as Alvarez et al.\n[4] and Hansknecht et al. [24] who learn a branching rule of\ufb02ine on a collection of similar instances,\nin a fashion similar to us. In each case a branching policy is learned by imitation of the strong\nbranching expert, although with a differently formulated learning problem. Namely, Khalil et al. [30]\nand Hansknecht et al. [24] treat it as a ranking problem and learn a partial ordering of the candidates\nproduced by the expert, while Alvarez et al. [4] treat it as a regression problem and learn directly\nthe strong branching scores of the candidates. In contrast, we treat it as a classi\ufb01cation problem\nand simply learn from the expert decisions, which allows imitation from experts that don\u2019t rely on\nbranching scores or orderings. These works also differ from ours in three other key aspects. First, they\nrely on extensive feature engineering, which is reduced by our graph convolutional neural network\napproach. Second, they do not evaluate generalization ability to instances larger than seen during\ntraining, which we propose to do. Finally, in each case performance was evaluated on a simpli\ufb01ed\nsolver, whereas we compare, for the \ufb01rst time and favorably, against a full-\ufb02edged solver with primal\nheuristics, cuts and presolving activated. We compare against these approaches in Section 5.\nOther works have considered using graph convolutional neural networks in the context of approximate\ncombinatorial optimization, where the objective is to \ufb01nd good solutions quickly, without seeking\nany optimality guarantees. The \ufb01rst work of this nature was by Khalil et al. [31], who proposed a\nGCNN model for learning greedy heuristics on several collections of combinatorial optimization\n\n2\n\n\fproblems de\ufb01ned on graphs. This was followed by Selsam et al. [47], who proposed a recurrent\nGCNN model, NeuroSAT, which can be interpreted as an approximate SAT solver when trained to\npredict satis\ufb01ability. Such works provide additional evidence that GCNNs can effectively capture\nstructural characteristics of combinatorial optimization problems.\nOther works consider using machine learning to improve variable selection in branch-and-bound,\nwithout directly learning a branching policy. Di Liberto et al. [15] learn a clustering-based classi\ufb01er\nto pick a variable selection rule at every branching decision up to a certain depth, while Balcan et al.\n[8] use the fact that many variable selection rules in B&B explicitly score the candidate variables,\nand propose to learn a weighting of different existing scores to combine their strengths. Other works\nlearn variable selection policies, but for algorithms less general than B&B. Liang et al. [39] learn a\nvariable selection policy for SAT solvers using a bandit approach, and Lederman et al. [36] extend\ntheir work by taking a reinforcement learning approach with graph convolutional neural networks.\nUnlike our approach, these works are restricted to con\ufb02ict-driven clause learning methods in SAT\nsolvers, and cannot be readily extended to B&B methods for arbitrary mixed-integer linear programs.\nIn the same vein, Balunovic et al. [9] learn by imitation learning a variable selection procedure for\nSMT solvers that exploits speci\ufb01c aspects of this type of solver.\nFinally, researchers have also focused on learning other aspects of B&B algorithms than variable\nselection. He et al. [25] learn a node selection heuristic by imitation learning of the oracle procedure\nthat expands the node whose feasible set contains the optimal solution, while Song et al. [48]\nlearn node selection and pruning heuristics by imitation learning of shortest paths to good feasible\nsolutions, and Khalil et al. [32] learn primal heuristics for B&B algorithms. Those approaches\nare complementary with our work, and could in principle be combined to further improve solver\nperformance. More generally, many authors have proposed machine learning approaches to \ufb01ne-tune\nexact optimization algorithms, not necessarily for MILPs in general. A recent survey is provided by\nBengio et al. [10].\n\n3 Background\n\n3.1 Problem de\ufb01nition\n\nA mixed-integer linear program is an optimization problem of the form\n\n(cid:8)c(cid:62)x | Ax \u2264 b,\n\nl \u2264 x \u2264 u, x \u2208 Zp \u00d7 Rn\u2212p(cid:9) ,\n\narg min\n\nx\n\n(1)\nwhere c \u2208 Rn is called the objective coef\ufb01cient vector, A \u2208 Rm\u00d7n the constraint coef\ufb01cient matrix,\nb \u2208 Rm the constraint right-hand-side vector, l, u \u2208 Rn respectively the lower and upper variable\nbound vectors, and p \u2264 n the number of integer variables. Under this representation, the size of a\nMILP is typically measured by the number of rows (m) and columns (n) of the constraint matrix.\nBy relaxing the integrality constraint, one obtains a continuous linear program (LP) whose solution\nprovides a lower bound to (1), and can be solved ef\ufb01ciently using, for example, the simplex algorithm.\nIf a solution to the LP relaxation respects the original integrality constraint, then it is also a solution\nto (1). If not, then one may decompose the LP relaxation into two sub-problems, by splitting the\nfeasible region according to a variable that does not respect integrality in the current LP solution x(cid:63),\n(2)\nwhere (cid:98).(cid:99) and (cid:100).(cid:101) respectively denote the \ufb02oor and ceil functions. In practice, the two sub-problems\nwill only differ from the parent LP in the variable bounds for xi, which get updated to ui = (cid:98)x(cid:63)\ni (cid:99) in\nthe left child and li = (cid:100)x(cid:63)\nThe branch-and-bound algorithm [52, Ch. II.4], in its simplest formulation, repeatedly performs this\nbinary decomposition, giving rise to a search tree. By design, the best LP solution in the leaf nodes\nof the tree provides a lower bound to the original MILP, whereas the best integral LP solution (if any)\nprovides an upper bound. The solving process stops whenever both the upper and lower bounds are\nequal or when the feasible regions do not decompose anymore, thereby providing a certi\ufb01cate of\noptimality or infeasibility, respectively.\n\ni (cid:101) in the right child.\n\n\u2203i \u2264 p | x(cid:63)\n\nxi \u2264 (cid:98)x(cid:63)\n\ni (cid:99) \u2228 xi \u2265 (cid:100)x(cid:63)\ni (cid:101),\n\ni (cid:54)\u2208 Z,\n\n3.2 Branching rules\n\nA key step in the B&B algorithm is selecting a fractional variable to branch on in (2), which can have\na very signi\ufb01cant impact on the size of the resulting search tree [2]. As such, branching rules are at\n\n3\n\n\fst\n\nst+1\n\nx7 \u2264 0\n\nx7 \u2265 1\n\nx7 \u2264 0\n\nx7 \u2265 1\n\nx1 \u2264 2\n\nx1 \u2265 3\n\nx1 \u2264 2\n\nx1 \u2265 3\n\nA(st) = {1, 3, 4}\n\nat = 4\n\nx4 \u2264 \u22122\n\nx4 \u2265 \u22121\n\nFigure 1: B&B variable selection as a\nMarkov decision process. On the left,\na state st comprised of the branch-and-\nbound tree, with a leaf node chosen by\nthe solver to be expanded next (in pink).\nOn the right, a new state st+1 resulting\nfrom branching on the variable at = x4.\n\nthe core of modern combinatorial optimization solvers, and have been the focus of extensive research\n[40; 43; 1; 17]. So far, the branching strategy consistently resulting in the smallest B&B trees is\nstrong branching [5]. It does so by computing the expected bound improvement for each candidate\nvariable before branching, which unfortunately requires the solution of two LPs for every candidate.\nIn practice, running strong branching at every node is prohibitive, and modern B&B solvers instead\nrely on hybrid branching [3; 1] which computes strong branching scores only at the beginning of\nthe solving process and gradually switches to simpler heuristics such as: the con\ufb02ict score (in the\noriginal article), the pseudo-cost [43] or a hand-crafted combination of the two. For a more extensive\ndiscussion of B&B branching strategies in MILP, the reader is referred to Achterberg et al. [3].\n\n3.3 Markov decision process formulation\n\nAs remarked by He et al. [25], the sequential decisions made during B&B can be assimilated to\na Markov decision process [26]. Consider the solver to be the environment, and the brancher the\nagent. At the tth decision the solver is in a state st, which comprises the B&B tree with all past\nbranching decisions, the best integer solution found so far, the LP solution of each node, the currently\nfocused leaf node, as well as any other solver statistics (such as, for example, the number of times\nevery primal heuristic has been called). The brancher then selects a variable at among all fractional\nvariables A(st) \u2286 {1, . . . , p} at the currently focused node, according to a policy \u03c0(at | st). The\nsolver in turn extends the B&B tree, solves the two child LP relaxations, runs any internal heuristic,\nprunes the tree if warranted, and \ufb01nally selects the next leaf node to split. We are then in a new state\nst+1, and the brancher is called again to take the next branching decision. This process, illustrated in\nFigure 1, continues until the instance is solved, i.e., until there are no leaf node left for branching.\nAs a Markov decision process, B&B is episodic, where each episode amounts to solving a MILP\ninstance. Initial states correspond to an instance being sampled among a group of interest, while \ufb01nal\nstates mark the end of the optimization process. The probability of a trajectory \u03c4 = (s0, . . . , sT ) \u2208 T\nthen depends on both the branching policy \u03c0 and the remaining components of the solver,\n\nT\u22121(cid:89)\n\n(cid:88)\n\n\u03c0(a| st)p(st+1 | st, a).\n\np\u03c0(\u03c4 ) = p(s0)\n\nt=0\n\na\u2208A(st)\n\nA natural approach to \ufb01nd good branching policies is reinforcement learning, with a carefully\ndesigned reward function. However, this raises several key issues which we circumvent by adopting\nan imitation learning scheme, as discussed next.\n\n4 Methodology\n\nWe now describe our approach for tackling the B&B variable selection problem in MILPs, where\nwe use imitation learning and a dedicated graph convolutional neural network model. As the B&B\nvariable selection problem can be formulated as a Markov decision process, a natural way of training a\npolicy would be reinforcement learning [50]. However, this approach runs into many issues. Notably,\nas episode length is proportional to performance, and randomly initialized policies perform poorly,\nstandard reinforcement learning algorithms are usually so slow early in training as to make total\ntraining time prohibitively long. Moreover, once the initial state corresponding to an instance is\nselected, the rest of the process is instance-speci\ufb01c, and so the Markov decision processes tend to be\nextremely large. In this work we choose instead to learn directly from an expert branching rule, an\napproach usually referred to as imitation learning [27].\n\n4\n\n\fc1\n\nc2\n\ne1,1\ne1,2\ne1,3\ne2,3\n\nv1\n\nv2\n\nv3\n\ninitial\n\nembedding\n\nC-side\n\nconvolution\n\nV-side\n\nconvolution\n\n\ufb01nal\n\nembedding\n+ softmax\n\nV\nn \u00d7 d\n\nE\n\nm \u00d7 n \u00d7 e\n\nC\n\nm \u00d7 c\n\nV1\nn \u00d7 64\n\nV2\nn \u00d7 64\n\n\u03c0(x)\nn \u00d7 1\n\nC1\n\nm \u00d7 64\n\nC2\n\nm \u00d7 64\n\nFigure 2: Left: our bipartite state representation st = (G, C, E, V) with n = 3 variables and m = 2\nconstraints. Right: our bipartite GCNN architecture for parametrizing our policy \u03c0\u03b8(a | st).\n\n4.1\n\nImitation learning\n\nWe train by behavioral cloning [45] using the strong branching rule, which suffers a high computa-\ntional cost but usually produces the smallest B&B trees, as mentioned in Section 3.2. We \ufb01rst run the\nexpert on a collection of training instances of interest, record a dataset of expert state-action pairs\nD = {(si, a(cid:63)\n\ni=1, and then learn our policy by minimizing the cross-entropy loss\n\ni )}N\n\n(cid:88)\n\n(s,a\u2217)\u2208D\n\nL(\u03b8) = \u2212 1\nN\n\nlog \u03c0\u03b8(a\u2217 | s).\n\n(3)\n\n4.2 State encoding\n\nWe encode the state st of the B&B process at time t as a bipartite graph with node and edge features\n(G, C, E, V), described in Figure 2 (Left). On one side of the graph are nodes corresponding to\nthe constraints in the MILP, one per row in the current node\u2019s LP relaxation, with C \u2208 Rm\u00d7c their\nfeature matrix. On the other side are nodes corresponding to the variables in the MILP, one per LP\ncolumn, with V \u2208 Rn\u00d7d their feature matrix. An edge (i, j) \u2208 E connects a constraint node i and a\nvariable node j if the latter is involved in the former, that is if Aij (cid:54)= 0, and E \u2208 Rm\u00d7n\u00d7e represents\nthe (sparse) tensor of edge features. Note that under mere restrictions in the B&B solver (namely,\nby enabling cuts only at the root node), the graph structure is the same for all LPs in the B&B tree,\nwhich reduces the cost of feature extraction. The exact features attached to the graph are described\nin the supplementary materials. We note that this is really only a subset of the solver state, which\ntechnically turns the process into a partially-observable Markov decision process [6], but also that\nexcellent variable selection policies such as strong branching are able to do well despite relying only\non a subset of the solver state as well.\n\n4.3 Policy parametrization\nWe parametrize our variable selection policy \u03c0\u03b8(a|st) as a graph convolutional neural network\n[23; 46; 12]. Such models, also known as message-passing neural networks [19], are extensions of\nconvolutional neural networks from grid-structured data (as in images or sounds) to arbitrary graphs.\nThey have been successfully applied to a variety of machine learning tasks with graph-structured\ninputs, such as prediction of molecular properties [16; 19], program veri\ufb01cation [38], and document\nclassi\ufb01cation in citation networks [33]. Graph convolutions exhibit many properties which make\nthem a natural choice for graph-structured data in general, and MILP problems in particular: 1) they\nare well-de\ufb01ned no matter the input graph size; 2) their computational complexity is directly related\nto the density of the graph, which makes it an ideal choice for processing typically sparse MILP\nproblems; and 3) they are permutation-invariant, that is they will always produce the same output no\nmatter the order in which the nodes are presented.\nOur model takes as input our bipartite state representation st = (G, C, V, E) and performs a single\ngraph convolution, in the form of two interleaved half-convolutions. In detail, because of the bipartite\nstructure of the input graph, our graph convolution can be broken down into two successive passes,\n\n5\n\n\f(cid:16)\n\n(i,j)\u2208E(cid:88)\n\nj\n\nci \u2190 fC\n\nci,\n\n(cid:17)\n\n,\n\n(cid:16)\n\n(i,j)\u2208E(cid:88)\n\nvj \u2190 fV\n\n(cid:17)\n\none from variable to constraints and one from constraints to variables. These passes take the form\n\nvj,\n\ngC (ci, vj, ei,j)\n\n(4)\nfor all i \u2208 C, j \u2208 V, where fC, fV, gC and gV are 2-layer perceptrons with relu activation functions.\nFollowing this graph-convolution layer, we obtain a bipartite graph with the same topology as the\ninput, but with potentially different node features, so that each node now contains information from\nits neighbors. We obtain our policy by discarding the constraint nodes and applying a \ufb01nal 2-layer\nperceptron on variable nodes, combined with a masked softmax activation to produce a probability\ndistribution over the candidate branching variables (i.e., the non-\ufb01xed LP variables). The right side of\nFigure 2 provides an overview of our architecture.\n\ngV (ci, vj, ei,j)\n\ni\n\nPrenorm layers\nIn the literature of GCNN, it is common to normalize each convolution operation\nby the number of neighbours [33]. As noted by Xu et al. [53] this might result in a loss of expressive-\nness, as the model then becomes unable to perform a simple counting operation (e.g., in how many\nconstraints does a variable appears). Therefore we opt for un-normalized convolutions. However,\nthis introduces a weight initialization issue. Indeed, weight initialization in standard CNNs relies\non the number of input units to normalize the initial weights [22], which in a GCNN is unknown\nbeforehand and depends on the dataset. To overcome this issue and stabilize the learning procedure,\nwe adopt a simple af\ufb01ne transformation x \u2190 (x\u2212 \u03b2)/\u03c3, which we call a prenorm layer, applied right\nafter the summation in (4). The \u03b2 and \u03c3 parameters are initialized with respectively the empirical\nmean and standard deviation of x on the training dataset, and \ufb01xed once and for all before the actual\ntraining. Adopting both un-normalized convolutions and this pre-training procedure improves our\ngeneralization performance on larger problems, as will be shown in Section 5.\n\n5 Experiments\n\nWe now present a comparative experiment against three competing machine learning approaches\nand SCIP\u2019s default branching rule to assess the value of our approach, as well as an ablation study\nto validate our architectural choices. Code for reproducing all the experiments can be found at\nhttps://github.com/ds4dm/learn2branch.\n\n5.1 Setup\n\nBenchmarks We evaluate our approach on four NP-hard problem benchmarks. Our \ufb01rst benchmark\nis comprised of set covering instances generated following Balas and Ho [7], with 1,000 columns.\nWe train and test on instances with 500 rows, and we evaluate on instances with 500 (Easy), 1,000\n(Medium) and 2,000 (Hard) rows. Our second benchmark is comprised of combinatorial auction\ninstances, generated following the arbitrary relationships procedure of Leyton-Brown et al. [37,\nSection 4.3]. We train and test on instances with 100 items for 500 bids, and we evaluate on instances\nwith 100 items for 500 bids (Easy), 200 items for 1,000 bids (Medium) and 300 items for 1,500\nbids (Hard). Our third benchmark is comprised of capacitated facility location instances generated\nfollowing Cornuejols et al. [14], with 100 facilities. We train and test on instances with 100 customers,\nand we evaluate on instances with 100 (Easy), 200 (Medium) and 400 (Hard) customers. Finally,\nour fourth benchmark is comprised of maximum independent set instances on Erd\u02ddos-R\u00e9nyi random\ngraphs, generated following the procedure of Bergman et al. [11, 4.6.4] with af\ufb01nity set to 4. We train\nand test on instances with graphs of 500 nodes, and we evaluate on instances with 500 (Easy), 1000\n(Medium) and 1500 nodes (Hard). These four benchmarks were chosen because they are challenging\nfor state-of-the-art solvers, but also representative of the types of integer programming problems\nencountered in practice. In particular, set covering problems capture the quintessence of integer linear\nprogramming, since column generation formulations can be written for virtually any dif\ufb01cult discrete\noptimization problem. Throughout all experiments we use SCIP 6.0.1 as the backend solver, with a\ntime limit of 1 hour. Following Karzan et al. [29], Fischetti and Monaci [17] and Khalil et al. [30] we\nallow cutting plane generation at the root node only, and deactivate solver restarts. All other SCIP\nparameters are kept to default so as to make comparisons as fair and reproducible as possible.\n\nBaselines We compare against a human-designed state-of-the-art branching rule: reliability pseu-\ndocost (RPB), a variant of hybrid branching [1] which is used by default in SCIP. For completeness,\n\n6\n\n\fTable 1: Imitation learning accuracy on the test sets.\n\nacc@5\n\nSet Covering\n\nCapacitated Facility Location Maximum Independent Set\nmodel\nacc@1\nacc@1\nacc@10\n51.8\u00b10.3 80.5\u00b10.1 91.4\u00b10.2 52.9\u00b10.3 84.3\u00b10.1 94.1\u00b10.1 63.0\u00b10.4 97.3\u00b10.1 99.9\u00b10.0 30.9\u00b10.4 47.4\u00b10.3 54.6\u00b10.3\nTREES\nSVMRANK 57.6\u00b10.2 84.7\u00b10.1 94.0\u00b10.1 57.2\u00b10.2 86.9\u00b10.2 95.4\u00b10.1 67.8\u00b10.1 98.1\u00b10.1 99.9\u00b10.0 48.0\u00b10.6 69.3\u00b10.2 78.1\u00b10.2\n57.4\u00b10.2 84.5\u00b10.1 93.8\u00b10.1 57.3\u00b10.3 86.9\u00b10.2 95.3\u00b10.1 68.0\u00b10.2 98.0\u00b10.0 99.9\u00b10.0 48.9\u00b10.3 68.9\u00b10.4 77.0\u00b10.5\nLMART\n65.5\u00b10.1 92.4\u00b10.1 98.2\u00b10.0 61.6\u00b10.1 91.0\u00b10.1 97.8\u00b10.1 71.2\u00b10.2 98.6\u00b10.1 99.9\u00b10.0 56.5\u00b10.2 80.8\u00b10.3 89.0\u00b10.1\nGCNN\n\nCombinatorial Auction\n\nacc@10\n\nacc@10\n\nacc@10\n\nacc@5\n\nacc@1\n\nacc@5\n\nacc@1\n\nacc@5\n\nTable 2: Policy evaluation on separate instances in terms of solving time, number of wins (fastest\nmethod) over number of solved instances, and number of resulting B&B nodes (lower is better). For\neach problem, the models are trained on easy instances only. See Section 5.1 for de\ufb01nitions.\n\nTime\n\nNodes\n\nEasy\nWins\n\nModel\nFSB\nRPB\n\nHard\nWins\nn/a \u00b1 n/a %\n17.30 \u00b1 6.1% 0 / 100 17 \u00b113.7% 411.34 \u00b1 4.3% 0 / 90 171 \u00b1 6.4% 3600.00 \u00b1 0.0% 0 /\n0\n8.98 \u00b1 4.8% 0 / 100 54 \u00b120.8% 60.07 \u00b1 3.7% 0 / 100 1741 \u00b1 7.9% 1677.02 \u00b1 3.0% 4 / 65 47 299 \u00b1 4.9%\n9.28 \u00b1 4.9% 0 / 100 187 \u00b1 9.4% 92.47 \u00b1 5.9% 0 / 100 2187 \u00b1 7.9% 2869.21 \u00b1 3.2% 0 / 35 59 013 \u00b1 9.3%\nTREES\nSVMRANK 8.10 \u00b1 3.8% 1 / 100 165 \u00b1 8.2% 73.58 \u00b1 3.1% 0 / 100 1915 \u00b1 3.8% 2389.92 \u00b1 2.3% 0 / 47 42 120 \u00b1 5.4%\n7.19 \u00b1 4.2% 14 / 100 167 \u00b1 9.0% 59.98 \u00b1 3.9% 0 / 100 1925 \u00b1 4.9% 2165.96 \u00b1 2.0% 0 / 54 45 319 \u00b1 3.4%\nLMART\n6.59 \u00b1 3.1% 85 / 100 134 \u00b1 7.6% 42.48 \u00b1 2.7% 100 / 100 1450 \u00b1 3.3% 1489.91 \u00b1 3.3% 66 / 70 29 981 \u00b1 4.9%\nGCNN\n\nMedium\nWins\n\nNodes\n\nNodes\n\nTime\n\nTime\n\nSet Covering\n\nFSB\nRPB\n\n6 \u00b130.3% 86.90 \u00b1 12.9% 0 / 100\n\n4.11 \u00b112.1% 0 / 100\n400 \u00b1 7.5%\n72 \u00b119.4% 1813.33 \u00b1 5.1% 0 / 68\n5511 \u00b111.7%\n2.74 \u00b1 7.8% 0 / 100 10 \u00b132.1% 17.41 \u00b1 6.6% 0 / 100 689 \u00b121.2% 136.17 \u00b1 7.9% 13 / 100\n2.47 \u00b1 7.3% 0 / 100 86 \u00b115.9% 23.70 \u00b1 11.2% 0 / 100 976 \u00b114.4% 451.39 \u00b114.6% 0 / 95 10 290 \u00b116.2%\nTREES\nSVMRANK 2.31 \u00b1 6.8% 0 / 100 77 \u00b115.0% 23.10 \u00b1 9.8% 0 / 100 867 \u00b113.4% 364.48 \u00b1 7.7% 0 / 98\n6329 \u00b1 7.7%\n1.79 \u00b1 6.0% 75 / 100 77 \u00b114.9% 14.42 \u00b1 9.5% 1 / 100 873 \u00b114.3% 222.54 \u00b1 8.6% 0 / 100\n7006 \u00b1 6.9%\nLMART\n1.85 \u00b1 5.0% 25 / 100 70 \u00b112.0% 10.29 \u00b1 7.1% 99 / 100 657 \u00b112.2% 114.16 \u00b110.3% 87 / 100 5169 \u00b114.9%\nGCNN\n\nCombinatorial Auction\n\nFSB\nRPB\n\n30.36 \u00b119.6% 4 / 100 14 \u00b134.5% 214.25 \u00b1 15.2% 1 / 100\n76 \u00b115.8% 742.91 \u00b1 9.1% 15 / 90\n26.55 \u00b116.2% 9 / 100 22 \u00b131.9% 156.12 \u00b1 11.5% 8 / 100 142 \u00b120.6% 631.50 \u00b1 8.1% 14 / 96\n28.96 \u00b114.7% 3 / 100 135 \u00b120.0% 159.86 \u00b1 15.3% 3 / 100 401 \u00b111.6% 671.01 \u00b111.1% 1 / 95\nTREES\nSVMRANK 23.58 \u00b114.1% 11 / 100 117 \u00b120.5% 130.86 \u00b1 13.6% 13 / 100 348 \u00b111.4% 586.13 \u00b110.0% 21 / 95\nLMART 23.34 \u00b113.6% 16 / 100 117 \u00b120.7% 128.48 \u00b1 15.4% 23 / 100 349 \u00b112.9% 582.38 \u00b110.5% 15 / 95\n22.10 \u00b115.8% 57 / 100 107 \u00b121.4% 120.94 \u00b1 14.2% 52 / 100 339 \u00b111.8% 563.36 \u00b110.7% 30 / 95\nGCNN\n\n55 \u00b1 7.2%\n110 \u00b115.5%\n381 \u00b111.1%\n321 \u00b1 8.8%\n314 \u00b1 7.0%\n338 \u00b110.9%\n\nCapacitated Facility Location\n\nFSB\nRPB\n\n7 \u00b135.9% 1503.55 \u00b1 20.9% 0 / 74\n\nn/a \u00b1 n/a %\n23.58 \u00b129.9% 9 / 100\n38 \u00b128.2% 3600.00 \u00b1 0.0% 0 /\n0\n2675 \u00b124.0%\n8.77 \u00b111.8% 7 / 100 20 \u00b136.1% 110.99 \u00b1 24.4% 41 / 100 729 \u00b137.3% 2045.61 \u00b118.3% 22 / 42\n3 38 296 \u00b1 4.1%\n10.75 \u00b122.1% 1 / 100 76 \u00b144.2% 1183.37 \u00b1 34.2% 1 / 47 4664 \u00b145.8% 3565.12 \u00b1 1.2% 0 /\nTREES\n6256 \u00b115.1%\nSVMRANK 8.83 \u00b114.9% 2 / 100 46 \u00b132.2% 242.91 \u00b1 29.3% 1 / 96 546 \u00b126.0% 2902.94 \u00b1 9.6% 1 / 18\n8893 \u00b1 3.5%\n7.31 \u00b112.7% 30 / 100 52 \u00b138.1% 219.22 \u00b1 36.0% 15 / 91 747 \u00b135.1% 3044.94 \u00b1 7.0% 0 / 12\nLMART\n6.43 \u00b111.6% 51 / 100 43 \u00b140.2% 192.91 \u00b1110.2% 42 / 82 1841 \u00b188.0% 2024.37 \u00b130.6% 25 / 29\n2997 \u00b126.3%\nGCNN\n\nMaximum Independent Set\n\nwe report as well the performance of full strong branching (FSB), our slow expert. We also compare\nagainst three machine learning branchers: the learning-to-score approach of Alvarez et al. [4] (TREES)\nbased on an ExtraTrees [18] model, as well as the learning-to-rank approaches from Khalil et al. [30]\n(SVMRANK) and Hansknecht et al. [24] (LMART), based on an SVMrank [28] and a LambdaMART\n[13] model, respectively. The TREES model uses variable-wise features from our bipartite state,\nobtained by concatenating the variable\u2019s node features with edge and constraint node features statistics\nover its neighborhood. The SVMRANK and LMART models both use the original features proposed\nby Khalil et al. [30], which we re-implemented within SCIP. More training details for each machine\nlearning method can be found in the supplementary materials.\n\nTraining We train the models on each benchmark separately. Namely, for each benchmark, we\ngenerate 100,000 branching samples extracted from 10,000 randomly generated instances for training,\n20,000 branching samples from 2,000 instances for validation, and same for test (see supplementary\nmaterials for details). We report in Table 1 the test accuracy of each machine learning model over\n\ufb01ve seeds, as the percentage of times the highest ranked decision of the model (acc@1), one of the\n\ufb01ve highest (acc@5) and one of the ten highest (acc@10) is a variable which is given the highest\nstrong branching score.\n\n7\n\n\fEvaluation Evaluation is performed for each problem dif\ufb01culty (Easy, Medium, Hard) on 20 new\ninstances using \ufb01ve different seeds1, which amounts to a total of 100 solving attempts per method.\nWe report standard metrics for MILP benchmarking2, that is: the 1-shifted geometric mean of the\nsolving times in seconds, including running times of unsolved instances without extra penalization\n(Time); the hardware-independent \ufb01nal node counts of instances that are solved by all baselines\n(Nodes); and the number of times each branching policy results in the fastest solving time, over the\nnumber of instances solved (Win). Policy evaluation results are displayed in Table 2. Note that we\nalso report the average per-instance standard deviation, so \u201c64 \u00b1 13.6% nodes\u201d means it took on\naverage 64 nodes to solve an instance, and when solving one of those instances the number of nodes\nvaried by 13.6% on average.\n\n5.2 Comparative experiment\n\nIn terms of prediction accuracy (Table 1), GCNN clearly outperforms the baseline competitors on\nall four problems, while SVMRANK and LMART are on par with each other and the performance of\nTREES is the lowest.\nAt solving time (Table 2), the accuracy of each method is clearly re\ufb02ected in the number of nodes\nrequired to solve the instances. Interestingly, the best method in terms of nodes is not necessarily\nthe best in terms of total solving time, which also takes into account the computational cost of\neach branching policy, i.e., the feature extraction and inference time. The SVMRANK approach,\ndespite being slightly better than LMART in terms of number of nodes, is also slower due to a worse\nrunning time / number of nodes trade-off. Our GCNN model clearly dominates overall, except on\ncombinatorial auction (Easy) and maximum independent set (Medium) instances, where LMART and\nRPB are respectively faster.\nOur GCNN model generalizes well to instances of size larger than seen during training, and outper-\nforms SCIP\u2019s default branching rule RPB in terms of running time in almost every con\ufb01guration. In\nparticular and strikingly, it signi\ufb01cantly outperforms RPB in terms of nodes on medium and hard\ninstances for setcover and combinatorial auction problems. As expected, the FSB expert brancher is\nnot competitive in terms of running time, despite producing very small search trees. The maximum\nindependent set problem seems particularly challenging for generalization, as all machine learning\napproaches report a lower number of solved instances than the default RPB brancher, and GCNN,\ndespite being the fastest machine learning approach overall, exhibits a high variability both in terms\nof time and number of nodes.\nFor the \ufb01rst time in the literature a machine-learning-based approach is compared with an essentially\nfull-\ufb02edged MILP solver. For this reason, the results are particularly impressive, and indicate that\nGCNN is a very serious candidate to be implemented within a MILP solver, as an additional tool to\nspeed up mixed-integer linear programming solvers. Also, they suggest that more could be gained\nfrom a tight integration within a complex software, like any MILP solver is.\n\n5.3 Ablation study\n\nWe present an ablation study of our proposed GCNN model on the set covering problem by comparing\nthree variants of the convolution operation in (4): mean rather than sum convolutions (MEAN), sum\nconvolutions without our prenorm layer (SUM) and \ufb01nally sum convolutions with prenorm layers,\nwhich is the model we use throughout our experiments (GCNN).\nResults on test instances are reported in Table 3. The solving performance of both variants MEAN and\nSUM is very similar to that of our baseline GCNN on small instances. On large instances however, the\nvariants perform signi\ufb01cantly worse in terms of both solving time and number of nodes, especially on\nhard instances. This empirical evidence supports our hypothesis that sum-convolutions offer a better\narchitectural prior than mean-convolution from the task of learning to branch, and that our prenorm\nlayer helps for stabilizing training and improving generalization.\n\n1In addition to ML models which are re-trained with a different seed, all major MILP solvers have a parameter,\nseed, that randomizes some tie-breaking rules, so as to be able to report aggregated results over the same instance.\n\n2See e.g. http://plato.asu.edu/bench.html\n\n8\n\n\fTable 3: Ablation study of our GCNN model on the set covering problem. Sum convolutions\ngeneralize better to larger instances, especially when combined with a prenorm layer.\n\nAccuracies\n\nEasy\nwins\n\ntime\n\nMedium\nwins\n\ntime\n\nacc@1\n\nacc@5\n\nModel\nMEAN 65.4 \u00b10.1 92.4 \u00b10.1 98.2 \u00b10.0 6.7 \u00b13% 13 / 100 134 \u00b16% 43.7 \u00b13% 19 / 100 1894 \u00b14% 1593.0 \u00b14% 6 / 70 62 227 \u00b16%\nSUM 65.5 \u00b10.2 92.3 \u00b10.2 98.1 \u00b10.1 6.6 \u00b13% 27 / 100 134 \u00b16% 42.5 \u00b13% 45 / 100 1882 \u00b14% 1511.7 \u00b13% 22 / 70 57 864 \u00b14%\nGCNN 65.5 \u00b10.1 92.4 \u00b10.1 98.2 \u00b10.0 6.6 \u00b13% 60 / 100 134 \u00b18% 42.5 \u00b13% 36 / 100 1870 \u00b13% 1489.9 \u00b13% 42 / 70 56 348 \u00b15%\n\nacc@10\n\nnodes\n\nnodes\n\ntime\n\nnodes\n\nHard\nwins\n\n6 Discussion\n\nThe objective of branch-and-bound is to solve combinatorial optimization problems as fast as possible.\nBranching policies must therefore balance the quality of decisions taken with the time spent to take\neach decision. An extreme example of this tradeoff is strong branching: this policy takes excellent\ndecisions leading to low number of nodes overall, but every decision-making step is so slow that the\noverall running time is not competitive. Early experiments showed that we could take better decisions\nand decrease the number of nodes slightly on average by training a GCNN policy with more layers or\nwith a larger embedding size. However, this would also lead to increased computational costs for\ninference and slightly larger times at each decision, and in the end increased solving times on average.\nThe policy architecture we propose is thus a compromise between learning capacity and inference\nspeed, something that is not traditionally a concern within the machine learning community.\nAnother concern among the combinatorial optimization community is the ability of policies trained\non small instances to generalize to larger instances. We were able to show that machine learning\nmethods, and the GCNN model in particular, can generalize to fairly larger instances. However, in\ngeneral it is expected that the improvement in performance decreases as our model is evaluated on\nprogressively larger problems, as can already be observed from Table 2. In early experiments with\neven larger instances (huge), we observed a performance drop for the model trained on our small\ninstances. This could presumably be remedied by training on larger instances in the \ufb01rst place, and\nindeed a model trained on medium instances did perform well on those huge instances again. In any\ncase, there are limits as to the generalization ability of any learned branching policy, and since the\nlimit is likely very dependent on the problem structure, it is dif\ufb01cult to give any precise quantitative\nestimates a priori. This desirable ability to generalize outside of the training distribution, sometimes\ntermed transfer learning, is also not a traditional concern in the machine learning community.\n\n7 Conclusion\n\nWe formulated branch-and-bound, the standard exact method for solving mixed-integer linear pro-\ngrams, as a Markov decision process. In this context, we proposed and evaluated a novel approach\nfor tackling the branching problem, by expressing the state of the branch-and-bound process as a\nbipartite graph, which reduces the need for feature engineering by naturally leveraging the variable-\nconstraint structure of MILP problems, and allows for the encoding of branching policies as a graph\nconvolutional neural network. We demonstrated on four NP-hard problems that, by adopting a simple\nimitation learning scheme, the policies learned by a GCNN were outperforming previously proposed\nmachine learning approaches for branching, and could also outperform the default branching strategy\nof SCIP, a modern open-source solver. Most importantly, we demonstrated that the learned policies\ncould generalize to instance sizes larger than seen during training. This is essential since collecting\nstrong branching decisions, hence training, can be computationally prohibitive on large instances.\nOur work indicates that the GCNN model, especially using sum convolutions with the proposed\nprenorm layer, is a good architectural prior for the task of branching in MILP.\nIn future work, we would like to assess the viability of our approach on a broader set on combinatorial\nproblems, and also experiment with reinforcement learning methods for improving over the policies\nlearned by imitation. Also, we believe that there is plenty of room for hydrid approaches combining\ntraditional methods and machine learning for branching, and we would like to dig deeper into the\nlearned policies in order to extract some knowledge of interest for the MILP community.\n\n9\n\n\fAcknowledgements\n\nWe would like to thank the anonymous reviewers whose contributions helped considerably improve\nthe quality of this paper. We would also like to thank Ambros Gleixner and Benjamin M\u00fcller for\nenlightening discussions and technical help regarding SCIP, as well as Gonzalo Mu\u00f1oz, Aleksandr\nKazachkov and Giulia Zarpellon for insightful discussions on variable selection. Finally, we thank\nJason Jo, Meng Qu and Mike Pieper for their helpful comments on the structure of the paper.\nThis work was supported by the Canada First Research Excellence Fund (CFREF), IVADO, CIFAR,\nGERAD, and Canada Excellence Research Chairs (CERC).\n\nReferences\n[1] Tobias Achterberg and Timo Berthold. Hybrid branching.\n\nIn Integration of AI and OR\n\nTechniques in Constraint Programming for Combinatorial Optimization Problems, 2009.\n\n[2] Tobias Achterberg and Roland Wunderling. Mixed integer programming: Analyzing 12 years\n\nof progress. Facets of Combinatorial Optimization, pages 449\u2013481, 2013.\n\n[3] Tobias Achterberg, Thorsten Koch, and Alexander Martin. Branching rules revisited. Operations\n\nResearch Letters, 33:42\u201354, 2005.\n\n[4] Alejandro M. Alvarez, Quentin Louveaux, and Louis Wehenkel. A machine learning-based\n\napproximation of strong branching. INFORMS Journal on Computing, 29:185\u2013195, 2017.\n\n[5] David Applegate, Robert Bixby, Va\u0161ek Chv\u00e1tal, and William Cook. Finding cuts in the TSP.\n\nTechnical report, DIMACS, 1995.\n\n[6] Karl J. \u00c5str\u00f6m. Optimal control of Markov processes with incomplete state information. Journal\n\nof Mathematical Analysis and Applications, 10:174\u2013205, 1965.\n\n[7] Egon Balas and Andrew Ho. Set covering algorithms using cutting planes, heuristics, and\nsubgradient optimization: a computational study. In Combinatorial Optimization, pages 37\u201360.\nSpringer, 1980.\n\n[8] Maria-Florina Balcan, Travis Dick, Tuomas Sandholm, and Ellen Vitercik. Learning to branch.\n\nIn Proceedings of the International Conference on Machine Learning, 2018.\n\n[9] Mislav Balunovic, Pavol Bielik, and Martin Vechev. Learning to solve SMT formulas. In\n\nAdvances in Neural Information Processing Systems 31, pages 10338\u201310349, 2018.\n\n[10] Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial\n\noptimization: a methodological tour d\u2019horizon. arXiv:1811.06128, 2018.\n\n[11] David Bergman, Andre A. Cire, Willem-Jan Van Hoeve, and John Hooker. Decision diagrams\n\nfor optimization. Springer, 2016.\n\n[12] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally\nconnected networks on graphs. In Proceedings of the Second International Conference on\nLearning Representations, 2014.\n\n[13] Christopher J. C. Burges. From RankNet to LambdaRank to LambdaMART: An Overview.\n\nTechnical report, Microsoft Research, 2010.\n\n[14] Gerard Cornuejols, Ramaswami Sridharan, and Jean Michel Thizy. A comparison of heuristics\nand relaxations for the capacitated plant location problem. European Journal of Operational\nResearch, 50:280 \u2013 297, 1991.\n\n[15] Giovanni Di Liberto, Serdar Kadioglu, Kevin Leo, and Yuri Malitsky. Dash: Dynamic approach\n\nfor switching heuristics. European Journal of Operational Research, 248:943\u2013953, 2016.\n\n[16] David K. Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Bombarell, Timothy\nHirzel, Al\u00e1n Aspuru-Guzik, and Ryan P. Adams. Convolutional networks on graphs for learning\nIn Advances in Neural Information Processing Systems 28, pages\nmolecular \ufb01ngerprints.\n2224\u20132232, 2015.\n\n10\n\n\f[17] Matteo Fischetti and Michele Monaci. Branching on nonchimerical fractionalities. Operations\n\nResearch Letters, 40:159 \u2013 164, 2012.\n\n[18] Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine\n\nlearning, 63:3\u201342, 2006.\n\n[19] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural\nmessage passing for quantum chemistry. In Proceedings of the Thirty-Fourth International\nConference on Machine Learning, pages 1263\u20131272, 2017.\n\n[20] Ambros Gleixner, Michael Bastubbe, Leon Ei\ufb02er, Tristan Gally, Gerald Gamrath, Robert Lion\nGottwald, Gregor Hendel, Christopher Hojny, Thorsten Koch, Marco E. L\u00fcbbecke, Stephen J.\nMaher, Matthias Miltenberger, Benjamin M\u00fcller, Marc E. Pfetsch, Christian Puchert, Daniel\nRehfeldt, Franziska Schl\u00f6sser, Christoph Schubert, Felipe Serrano, Yuji Shinano, Jan Merlin\nViernickel, Matthias Walter, Fabian Wegscheider, Jonas T. Witt, and Jakob Witzig. The SCIP\nOptimization Suite 6.0. Zib-report, Zuse Institute Berlin, July 2018.\n\n[21] Ambros Gleixner, Gregor Hendel, Gerald Gamrath, Tobias Achterberg, Michael Bastubbe, Timo\nBerthold, Philipp Christophel, Kati Jarck, Thorsten Koch, Jeff Linderoth, Marco L\u00fcbbecke,\nHans D. Mittelmann, Derya Ozyurt, Ted K. Ralphs, Domenico Salvagnin, and Yuji Shinano.\nMIPLIB 2017: Data-Driven Compilation of the 6th Mixed-Integer Programming Library.\nTechnical report, Optimization Online, August 2019.\n\n[22] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward\nIn Proceedings of the Thirteenth International Conference on Arti\ufb01cial\n\nneural networks.\nIntelligence and Statistics, pages 249\u2013256, 2010.\n\n[23] Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graph\ndomains. In Proceedings of the 2005 IEEE International Joint Conference on Neural Networks,\nvolume 2, pages 729\u2013734, 2005.\n\n[24] Christoph Hansknecht, Imke Joormann, and Sebastian Stiller. Cuts, primal heuristics, and\nlearning to branch for the time-dependent traveling salesman problem. arXiv:1805.01415, 2018.\n\n[25] He He, Hal III Daum\u00e9, and Jason Eisner. Learning to search in branch-and-bound algorithms.\n\nIn Advances in Neural Information Processing Systems 27, pages 3293\u20133301, 2014.\n\n[26] Ronald A. Howard. Dynamic Programming and Markov Processes. MIT Press, Cambridge,\n\nMA, 1960.\n\n[27] Ahmed Hussein, Mohamed M. Gaber, Eyad Elyan, and Chrisina Jayne. Imitation learning: A\n\nsurvey of learning methods. ACM Computing Surveys, 50:21, 2017.\n\n[28] Thorsten Joachims. Optimizing search engines using clickthrough data. In Proceedings of the\nEighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,\npages 133\u2013142, 2002.\n\n[29] Fatma K. Karzan, George L. Nemhauser, and Martin W. P. Savelsbergh. Information-based\nbranching schemes for binary linear mixed integer problems. Mathematical Programming\nComputation, 1:249\u2013293, 2009.\n\n[30] Elias B. Khalil, Pierre Le Bodic, Le Song, George Nemhauser, and Bistra Dilkina. Learning to\nbranch in mixed integer programming. In Proceedings of the Thirtieth AAAI Conference on\nArti\ufb01cial Intelligence, pages 724\u2013731, 2016.\n\n[31] Elias B. Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial\noptimization algorithms over graphs. In Advances in Neural Information Processing Systems\n30, pages 6348\u20136358, 2017.\n\n[32] Elias B. Khalil, Bistra Dilkina, George L. Nemhauser, Shabbir Ahmed, and Yufen Shao.\nLearning to run heuristics in tree search. In Proceedings of the Twenty-Sixth International Joint\nConference on Arti\ufb01cial Intelligence, pages 659\u2013666, 2017.\n\n11\n\n\f[33] Thomas N. Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional\nnetworks. In Proceedings of the Fifth International Conference on Learning Representations,\n2017.\n\n[34] Alex Kulesza and Fernando Pereira. Structured learning with approximate inference.\n\nAdvances in Neural Information Processing Systems 21, pages 785\u2013792, 2008.\n\nIn\n\n[35] Ailsa H. Land and Alison G. Doig. An automatic method of solving discrete programming\n\nproblems. Econometrica, 28:497\u2013520, 1960.\n\n[36] Gil Lederman, Markus N. Rabe, and Sanjit A. Seshia. Learning heuristics for automated\n\nreasoning through deep reinforcement learning. arXiv:1807.08058, 2018.\n\n[37] Kevin Leyton-Brown, Mark Pearson, and Yoav Shoham. Towards a universal test suite for\ncombinatorial auction algorithms. In Proceedings of the Second ACM Conference on Electronic\nCommerce, pages 66\u201376, 2000.\n\n[38] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. Gated graph sequence\nIn Yoshua Bengio and Yann LeCun, editors, Proceedings of the Fourth\n\nneural networks.\nInternational Conference on Learning Representations, 2016.\n\n[39] Jia Hui Liang, Vijay Ganesh, Pascal Poupart, and Krzysztof Czarnecki. Learning rate based\nbranching heuristic for SAT solvers. In International Conference on Theory and Applications of\nSatis\ufb01ability Testing, pages 123\u2013140, 2016.\n\n[40] Jeff T. Linderoth and Martin W. P. Savelsbergh. A computational study of search strategies for\n\nmixed integer programming. INFORMS Journal on Computing, 11:173\u2013187, 1999.\n\n[41] Andrea Lodi and Giulia Zarpellon. On learning and branching: a survey. TOP, 25:207\u2013236,\n\n2017.\n\n[42] Vangelis T. Paschos, editor. Applications of Combinatorial Optimization. Mathematics and\n\nStatistics. Wiley-ISTE, second edition, 2014.\n\n[43] Jagat Patel and John W. Chinneck. Active-constraint variable ordering for faster feasibility of\n\nmixed integer linear programs. Mathematical Programming, 110:445\u2013474, 2007.\n\n[44] Yves Pochet and Laurence A. Wolsey. Production planning by mixed integer programming.\n\nSpringer Science and Business Media, 2006.\n\n[45] Dean A. Pomerleau. Ef\ufb01cient training of arti\ufb01cial neural networks for autonomous navigation.\n\nNeural Computation, 3:88\u201397, 1991.\n\n[46] Franco Scarselli, Marco Gori, Ah C. Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.\n\nThe graph neural network model. IEEE Trans. Neural Networks, 20(1):61\u201380, 2009.\n\n[47] Daniel Selsam, Matthew Lamm, Benedikt B\u00fcnz, Percy Liang, Leonardo de Moura, and David L.\nDill. Learning a sat solver from single-bit supervision. In Proceedings of the Seventh Interna-\ntional Conference on Learning Representations, 2019.\n\n[48] Jialin Song, Ravi Lanka, Albert Zhao, Yisong Yue, and Masahiro Ono. Learning to search via\n\nretrospective imitation. arXiv:1804.00846, 2018.\n\n[49] Vivek G. K. Srikumar and Dan Roth. On amortizing inference cost for structured prediction.\nIn Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language\nProcessing and Computational Natural Language Learning, pages 1114\u20131124, 2012.\n\n[50] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press,\n\nsecond edition, 2018.\n\n[51] Martin J. Wainwright, Tommi S. Jaakkola, and Alan S. Willsky. MAP estimation via agreement\non trees: message-passing and linear programming. IEEE Transactions on Information Theory,\n51:3697\u20133717, 2005.\n\n[52] Laurence A. Wolsey. Integer Programming. Wiley-Blackwell, 1988.\n\n12\n\n\f[53] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural\nnetworks? In Proceedings of the Seventh International Conference on Learning Representations,\n2019.\n\n13\n\n\f", "award": [], "sourceid": 9029, "authors": [{"given_name": "Maxime", "family_name": "Gasse", "institution": "Polytechnique Montr\u00e9al"}, {"given_name": "Didier", "family_name": "Chetelat", "institution": "Polytechnique Montreal"}, {"given_name": "Nicola", "family_name": "Ferroni", "institution": "University of Bologna"}, {"given_name": "Laurent", "family_name": "Charlin", "institution": "MILA / U.Montreal"}, {"given_name": "Andrea", "family_name": "Lodi", "institution": "\u00c9cole Polytechnique Montr\u00e9al"}]}