{"title": "Differentiable Learning of Submodular Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1013, "page_last": 1023, "abstract": "Can we incorporate discrete optimization algorithms within modern machine learning models? For example, is it possible to use in deep architectures a layer whose output is the minimal cut of a parametrized graph? Given that these models are trained end-to-end by leveraging gradient information, the introduction of such layers seems very challenging due to their non-continuous output. In this paper we focus on the problem of submodular minimization, for which we show that such layers are indeed possible. The key idea is that we can continuously relax the output without sacrificing guarantees. We provide an easily computable approximation to the Jacobian complemented with a complete theoretical analysis. Finally, these contributions let us experimentally learn probabilistic log-supermodular models via a bi-level variational inference formulation.", "full_text": "Differentiable Learning of Submodular Models\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nJosip Djolonga\n\nETH Zurich\n\nAndreas Krause\n\nETH Zurich\n\njosipd@inf.ethz.ch\n\nkrausea@ethz.ch\n\nAbstract\n\nCan we incorporate discrete optimization algorithms within modern machine learn-\ning models? For example, is it possible to incorporate in deep architectures a layer\nwhose output is the minimal cut of a parametrized graph? Given that these models\nare trained end-to-end by leveraging gradient information, the introduction of such\nlayers seems very challenging due to their non-continuous output. In this paper we\nfocus on the problem of submodular minimization, for which we show that such\nlayers are indeed possible. The key idea is that we can continuously relax the output\nwithout sacri\ufb01cing guarantees. We provide an easily computable approximation\nto the Jacobian complemented with a complete theoretical analysis. Finally, these\ncontributions let us experimentally learn probabilistic log-supermodular models\nvia a bi-level variational inference formulation.\n\n1\n\nIntroduction\n\nDiscrete optimization problems are ubiquitous in machine learning. While the majority of them are\nprovably hard, a commonly exploitable trait that renders some of them tractable is that of submodular-\nity [1, 2]. In addition to capturing many useful phenomena, submodular functions can be minimized\nin polynomial time and also enjoy a powerful connection to convex optimization [3]. Both of these\nproperties have been used to great effect in both computer vision and machine learning, to e.g. com-\npute the MAP con\ufb01guration in undirected graphical models with long-reaching interactions [4]\nand higher-order factors [5], clustering [6], to perform variational inference in log-supermodular\nmodels [7, 8], or to design norms useful for structured sparsity problems [9, 10].\nDespite all the bene\ufb01ts of submodular functions, the question of how to learn them in a practical\nmanner remains open. Moreover, if we want to open the toolbox of submodular optimization to\nmodern practitioners, an intriguing question is how to to use them in conjunction with deep learning\nnetworks. For instance, we need to develop mechanisms that would enable them to be trained\ntogether in a fully end-to-end fashion. As a concrete example from the computer vision domain,\nconsider the problem of image segmentation. Namely, we are given as input an RGB representation\nx \u2208 Rn\u00d73 of an image captured by say a dashboard camera, and the goal is to identify the set of pixels\nA \u2286 {1, 2, . . . , n} that are occupied by pedestrians. While we could train a network \u03b8 : x \u2192 v \u2208 Rn\nto output per-pixel scores, it would be helpful, especially in domains with limited data, to bias the\npredictions by encoding some prior beliefs about the expected output. For example, we might prefer\nsegmentations that are spatially consistent. One common approach to encourage such con\ufb01gurations\nis to \ufb01rst de\ufb01ne a graph over the image G = (V, E) by connecting neighbouring pixels, specify\nweights w over the edges, and then solve the following graph-cut problem\n\nA\u2217(w, v) = arg min\n\nA\u2286V\n\nF (A) = arg min\n\nA\u2286V\n\n(cid:88)\n\nwi,j\n\n{i,j}\u2208E\n\n(cid:74)A \u2229 {i, j} = 1(cid:75)\n(cid:124)\n(cid:125)\n\n(cid:123)(cid:122)\n\n1 iff the predictions disagree\n\n+\n\n(cid:88)\n\ni\u2208A\n\nvi(cid:124)(cid:123)(cid:122)(cid:125)\n\npixel score\n\n.\n\n(1)\n\nWhile this can be easily seen as a module computing the best con\ufb01guration as a function of the edge\nweights and per-pixel scores, incorporating it as a layer in a deep network seems at a \ufb01rst glance to\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fbe a futile task. Even though the output is easily computable, it will be discontinuous and have no\nJacobian, which is necessary for backpropagation. However, as the above problem is an instance of\nsubmodular minimization, we can leverage its relationship to convexity and relax it to\nwi,j|yi \u2212 yj| + vT y +\n\n(cid:88)\n\n(cid:107)y(cid:107)2.\n\n(2)\n\n1\n2\n\ny\u2217(w, v) = arg min\ny\u2208Rn\n\nf (y) +\n\n1\n2\n\n(cid:107)y(cid:107)2 = arg min\ny\u2208Rn\n\n{i,j}\u2208E\n\nIn addition to having a continuous output, this relaxation has a very strong connection with the discrete\ni > 0}.\nproblem as the discrete optimizer can be obtained by thresholding y\u2217 as A\u2217 = {i \u2208 V | y\u2217\nMoreover, as explained in \u00a72, for every submodular function F there exists an easily computable\nconvex function f so that this relationship holds. For general submodular functions, the negation of\nthe solution to the relaxed problem (2) is known as the min-norm point [11]. In this paper we consider\nthe problem of designing such modules that solve discrete optimization problems by leveraging this\ncontinuous formulation. To this end, our key technical contribution is to analyze the sensitivity of the\nmin-norm point as a function of the parametrization of the function f. For the speci\ufb01c case above we\nwill show how to compute \u2202y\u2217/\u2202w and \u2202y\u2217/\u2202v.\nContinuing with the segmentation example, we might want to train a conditional model P (A | x) that\ncan model the uncertainty in the predictions to be used in downstream decision making. A rich class\nof models are log-supermodular models, i.e., those of the form P (A | x) = exp(\u2212F\u03b8(x)(A))/Z(\u03b8)\nfor some parametric submodular function F\u03b8. While they can capture very useful interactions, they\nare very hard to train in the maximum likelihood setting due to the presence of the intractable\nnormalizer Z(\u03b8). However, Djolonga and Krause [8] have shown that for any such distribution we\ncan \ufb01nd the closest fully factorized distribution Q(\u00b7 | x) minimizing a speci\ufb01c information theoretic\ndivergence D\u221e. In other words, we can exactly compute Q(\u00b7 | x) = arg minQ\u2208Q D\u221e(P (\u00b7 | x)(cid:107) Q),\nwhere Q is the family of fully factorized distributions. Most importantly, the optimal Q can also be\ncomputed from the min-norm point. Thus, a reasonable objective would be to learn a model \u03b8(x) so\nthat the best approximate distribution Q(\u00b7 | x) gives high likelihood to the training data point. This is\na complicated bi-level optimization problem (as Q implicitly depends on \u03b8) with an inner variational\ninference procedure, which we can again train end-to-end using our results. In other words, we can\noptimize the following algorithm end-to-end with respect to \u03b8.\n\nxi\n\n\u03b8(x)\u2212\u2212\u2212\u2192 \u03b8i \u2212\u2192 P = exp(\u2212F\u03b8i(A))/Z(\u03b8i) \u2212\u2192 Q = arg min\nQ\u2208Q\n\nD\u221e(P (cid:107) Q) \u2212\u2192 Q(Ai | xi). (3)\n\nRelated work. Sensitivity analysis of the set of optimal solutions has a long history in optimization\ntheory [12]. The problem of argmin-differentiation of the speci\ufb01c case resulting from graph cuts (i.e.\neq. (2)) has been considered in the computer vision literature, either by smoothing the objective [13],\nor by unrolling iterative methods [14]. The idea to train probabilistic models by evaluating them using\nthe marginals produced by an approximate inference algorithm has been studied by Domke [15] for\ntree-reweighted belief propagation and mean \ufb01eld, and for continuous models by Tappen [16]. These\nmethods either use the implicit function theorem, or unroll iterative optimization algorithms. The\nbene\ufb01ts of using an inconsistent estimator, which is what we do by optimizing eq. (3), at the bene\ufb01t\nof using computationally tractable inference methods has been discussed by Wainwright [17]. Amos\nand Kolter [18] discuss how to ef\ufb01ciently argmin-differentiate quadratic programs by perturbing\nthe KKT conditions, an idea that goes back to Boot [19]. We make an explicit connection to their\nwork in Theorem 4. In Section 4 we harness the connection between the min-norm problem and\nisotonic regression, which has been exploited to obtain better duality certi\ufb01cates [2], and by Kumar\nand Bach [20] to design an active-set algorithm for the min-norm problem. Chakravarti [21] analyzes\nthe sensitivity of the optimal isotonic regression point with respect to perturbations of the input, but\ndoes not discuss the directional derivatives of the problem. Recently, Dolhansky and Bilmes [22]\nhave used deep networks to parametrize submodular functions. Discrete optimization is also used\nin structured prediction [23, 24] for the computation of the loss function, which is closely related\nto our work if we use discrete optimization only at the last layer. However, in this case we have the\nadvantage in that we allow for arbitrary loss functions to be applied to the solution of the relaxation.\n\nContributions. We develop a very ef\ufb01cient approximate method (\u00a74) for the computation of the\nJacobian of the min-norm problem inspired by our analysis of isotonic regression in \u00a73, where\nwe derive results that might be of independent interest. Even more importantly, from a practical\nperspective, this Jacobian has a very nice structure and we can multiply with it in linear time. This\n\n2\n\n\fmeans that we can ef\ufb01ciently perform back-propagation if we use these layers in a modern deep\narchitectures. In \u00a75 we show how to compute directional derivatives exactly in polynomial time, and\ngive conditions under which our approximation is correct. This is also an interesting theoretical result\nas it quanti\ufb01es the stability of the min-norm point with respect to the model parameters. Lastly, we\nuse our results to learn log-supermodular models in \u00a76.\n\n2 Background on Submodular Minimization\n\nLet us introduce the necessary background on submodular functions. They are de\ufb01ned over subsets of\nsome ground set, which in the remaining of this paper we will w.l.o.g. assume to be V = {1, 2, . . . , n}.\nThen, a function F : 2V \u2192 R is said to be submodular iff for all A, B \u2286 V it holds that\n\nF (A \u222a B) + F (A \u2229 B) \u2264 F (A) + F (B).\n\nsatisfy the above with equality and are given as F (A) =(cid:80)\nfunction 2V \u2192 R de\ufb01ned as m(A) =(cid:80)\n\n(4)\nWe will furthermore w.l.o.g. assume that F is normalized so that F (\u2205) = 0. A very simple family of\nsubmodular functions are modular functions. These, seen as discrete analogues of linear functions,\ni\u2208A mi for some real numbers mi. As\ncommon practice in combinatorial optimization, we will treat any vector m \u2208 Rn as a modular\ni\u2208A mi. In addition to the graph cuts from the introduction\n(eq. (1)), another widely used class of functions are concave-of-cardinality functions, i.e. those of the\nform F (|A|) = h(|A|) for some concave h : R \u2192 R [5]. From eq. (4) we see that if we want to de\ufb01ne\na submodular function over a collection D (cid:40) 2V it has to be closed under union and intersection.\nSuch collections are known as lattices, and two examples that we will use are the simple lattice 2V\nand the trivial lattice {\u2205, V }. In the theory of submodular minimization, a critical object de\ufb01ned by a\npair consisting of a submodular function F and a lattice D \u2287 {\u2205, V } is the base polytope\n\nB(F | D) = {x \u2208 Rn | x(A) \u2264 F (A) for all A \u2208 D} \u2229 {x \u2208 Rn | x(V ) = F (V )}.\n\n(5)\nWe will also use the shorthand B(F ) = B(F | 2V ). Using the result of Edmonds [25], we know\nhow to maximize a linear function over B(F ) in O(n log n) time with n function evaluations of F .\nSpeci\ufb01cally, to compute maxy\u2208B(F ) zT y, we \ufb01rst choose a permutation \u03c3 : V \u2192 V that sorts z, i.e.\nso that z\u03c3(1) \u2265 z\u03c3(2) \u2265 \u00b7\u00b7\u00b7 \u2265 z\u03c3(n). Then, a maximizer f (\u03c3) \u2208 B(F ) can be computed as\n\n[f (\u03c3)]\u03c3(i) = F ({\u03c3(i)} | {\u03c3(1), . . . , \u03c3(i \u2212 1)}),\n\n(6)\nwhere the marginal gain of A given B is de\ufb01ned as F (A | B) = F (A \u222a B) \u2212 F (B). Hence, we\nknow how to compute the support function f (z) = supy\u2208B(F ) zT y, which is known as the Lov\u00e1sz\nextension [3]. First, this function is indeed an extension as f (1A) = F (A) for all A \u2286 V , where\n1A \u2208 {0, 1}n is the indicator vector for the set A. Second, it is convex as it is a supremum of linear\nfunctions. Finally, and most importantly, it lets us minimize submodular functions in polynomial\ntime with convex optimization because minA\u22082V F (A) = minz\u2208[0,1] f (z) and we can ef\ufb01ciently\nround the optimal continuous point to a discrete optimizer. Another problem, with a smooth objective,\nwhich is also explicitly tied to the problem of minimizing F is that of computing the min-norm point,\nwhich can be de\ufb01ned in two different ways as\n\ny\u2217 = arg min\ny\u2208B(F )\n\n1\n2\n\n(cid:107)y(cid:107)2, or equivalently as \u2212 y\u2217 = arg min\n\nf (y) +\n\ny\n\n(cid:107)y(cid:107)2,\n\n1\n2\n\n(7)\n\nwhere the equivalence comes from strong Fenchel duality [2]. The connection with submodular\nminimization comes from the following lemma, which we have already hinted at in the introduction.\ni \u2264 0}. Then A\u2212 (A0) is the\nLemma 1 ([1, Lem. 7.4]). De\ufb01ne A\u2212 = {i | y\u2217\nunique smallest (largest) minimizer of F .\n\ni < 0} and A0 = {i | y\u2217\n\nMoreover, if instead of hard-thresholding we send the min-norm point through a sigmoid, the result\nhas the following variational inference interpretation, which lets us optimize the pipeline in eq. (3).\nLemma 2 ([8, Thm. 3]). De\ufb01ne the in\ufb01nite R\u00e9nyi divergence between any distributions P and Q over\n\n2V as D\u221e(P || Q) = supA\u2286V log(cid:2)P (A)/Q(A)(cid:3). For P (A) \u221d exp(\u2212F (A)), the distribution Q\u2217\n\nminimizing D\u221e over all fully factorized distributions Q is given as\n\n\u03c3(\u2212y\u2217\ni )\ni /\u2208A\nwhere \u03c3(u) = 1/(1 + exp(\u2212u)) is the sigmoid function.\n\nQ(A) =\n\n\u03c3(y\u2217\ni ),\n\n(cid:89)\n\n(cid:89)\n\ni\u2208A\n\n3\n\n\f3 Argmin-Differentiation of Isotonic Regression\n\nWe will \ufb01rst analyze a simpler problem, i.e., that of isotonic regression, de\ufb01ned as\n\ny(x) = arg min\n\ny\u2208O\n\n(cid:107)y \u2212 x(cid:107)2,\n\n1\n2\n\n(8)\n\nwhere O = {y \u2208 Rn | yi \u2264 yi+1 for i = 1, 2, . . . , n \u2212 1}. The connection to our problem will be\nmade clear in Section 4, and it essentially follows from the fact that the Lov\u00e1sz extension is linear\non O. In this section, we will be interested in computing the Jacobian \u2202y/\u2202x, i.e., in understanding\nhow the solution y changes with respect to the input x. The function is well-de\ufb01ned because of\nthe strict convexity of the objective and the non-empty convex feasible set. Moreover, it can be\neasily computed in O(n) time using the pool adjacent violators algorithm (PAVA) [26]. This is\na well-studied problem in statistics, see e.g. [27]. To understand the behaviour of y(x), we will\nstart by stating the optimality conditions of problem (8). To simplify the notation, for any A \u2286 V\ni\u2208A xi. The optimality conditions can be stated via ordered\nwe will de\ufb01ne Meanx(A) = 1|A|\npartitions \u03a0 = (B1, B2, . . . , Bm) of V , meaning that the sets Bi are disjoint, \u222ak\nj=1Bj = V , and\n\u03a0 is ordered so that 1 + maxi\u2208Bj i = mini\u2208Bj+1 i. Speci\ufb01cally, for any such partition we de\ufb01ne\ny\u03a0 = (y1, y2, . . . , ym), where yj = Meanx(Bj)1|Bj| and 1k = {1}k is the vector of all ones. In\nother words, y\u03a0 is de\ufb01ned by taking block-wise averages of x with respect to \u03a0. By analyzing the\nKKT conditions of problem (8), we obtain the following well-known condition.\nLemma 3 ([26]). An ordered partition \u03a0 = (B1, B2, . . . , Bm) is optimal iff the following hold\n\n(cid:80)\n\n1. (Primal feasibility) For any two blocks Bj and Bj+1 we have\nMeanx(Bj) \u2264 Meanx(Bj+1).\n\n(9)\n2. (Dual feasibility) For every block B \u2208 \u03a0 and each i \u2208 B de\ufb01ne PreB(i) = {j \u2208 B | j \u2264 i}.\n\nThen, the condition reads\n\nMeanx(PreB(i)) \u2212 Meanx(B) \u2265 0.\n\n(10)\n\nPoints where eq. (9) is satis\ufb01ed with equality are of special interest, because of the following result.\nLemma 4. If for some Bj and Bj+1 the \ufb01rst condition is satis\ufb01ed with equality, we can merge the\ntwo sets so that the new coarser partition \u03a0(cid:48) will also be optimal.\n\nThus, in the remaining of this section we will assume that the sets Bj are chosen maximally. We will\nnow introduce a notion that will be crucial in the subsequent analysis.\nDe\ufb01nition 1. For any block B, we say that i \u2208 B is a breakpoint if Meanx(PreB(i)) = Meanx(B)\nand it is not the right end-point of B (i.e., i < maxi(cid:48)\u2208B i(cid:48)).\nFrom an optimization perspective, any breakpoint is equivalent to non-strict complementariness of the\ncorresponding Lagrange multiplier. From a combinatorial perspective, they correspond to positions\nwhere we can re\ufb01ne \u03a0 into a \ufb01ner partition \u03a0(cid:48) that gives rise to the same point, i.e., y\u03a0 = y\u03a0(cid:48) (if we\nmerge blocks using Lemma 4, the point where we merge them will become a breakpoint). We can now\ndiscuss the differentiability of y(x). Because projecting onto convex sets is a proximal operator and\nthus non-expansive, we have the following as an immediate consequence of Rademacher\u2019s theorem.\nLemma 5. The function y(x) is 1-Lipschitz continuous and differentiable almost everywhere.\nWe will denote by \u2202\u2212\nxk the left and right partial derivatives with respect to xk. For any index\nk we will denote by u(k) (l(k)) the breakpoint with the smallest (largest) coordinate larger (smaller)\nthan k. De\ufb01ne it as +\u221e (\u2212\u221e) if no such point exists. Moreover, denote by \u03a0(z) the collection of\nindices where z takes on distinct values, i.e., \u03a0(z) = \u222an\nTheorem 1. Let k be any coordinate and let B \u2208 \u03a0(y(x)) be the block containing coordinate i.\nAlso de\ufb01ne B+ = {i \u2208 B | i \u2265 u(k)} and B\u2212 = {i \u2208 B | i \u2264 l(k)}. Hence, for any i \u2208 B\n\nxk and \u2202+\n\ni=1{{i(cid:48) \u2208 V | zi = zi(cid:48)}}.\n(yi) =(cid:74)i \u2208 B \\ B+(cid:75)/|B \\ B+|.\n\n\u2202\u2212\n\nxk\n\n(yi) =(cid:74)i \u2208 B \\ B\u2212(cid:75)/|B \\ B\u2212|,\n\n\u2202+\nxk\n\nand\n\n4\n\n\fNote that all of these derivatives will agree iff there are no breakpoints, which means that the existence\nof breakpoints is an isolated phenomenon due to Lemma 5. In this case the Jacobian exists and has a\nvery simple block-diagonal form. Namely, it is equal to\n\n= \u039b(y(x)) \u2261 blkdiag(C|B1|, C|B2|, . . . , C|Bm|),\n\n\u2202y\n\u2202x\n\n(11)\n\nwhere Ck = 1k\u00d7k/k is the averaging matrix with elements 1/k. We will use \u039b(z) for the matrix\ntaking block-wise averages with respect to the blocks \u03a0(z). As promised in the introduction, Jacobian\nmultiplication \u039b(y(x))u is linear as we only have to perform block-wise averages.\n\n4 Min-Norm Differentiation\nIn this section we will assume that we have a function F\u03b8 parametrized by some \u03b8 \u2208 Rd that we seek\nto learn. For example, we could have a mixture model\n\nd(cid:88)\n\nF\u03b8(A) =\n\n\u03b8jGj(A),\n\n(12)\n\nfor some \ufb01xed submodular functions Gj : 2V \u2192 R. In this case, to ensure that the resulting function\nis submodular we also want to enforce \u03b8j \u2265 0 unless Gj is modular. We would like to note that the\ndiscussion in this section goes beyond such models. Remember that the min-norm point is de\ufb01ned as\n\nj=1\n\ny\u03b8 = \u2212 arg min\n\ny\n\nf\u03b8(y) +\n\n(cid:107)y(cid:107)2,\n\n1\n2\n\n(13)\n\nwhere f\u03b8 is the Lov\u00e1sz extension of F\u03b8. Hence, we want to compute \u2202y/\u2202\u03b8. To make the connection\nwith isotonic regression, remember how we evaluate the Lov\u00e1sz extension at y. First, we pick a\npermutation \u03c3 that sorts y, and then evaluate it as f\u03b8(y) = f\u03b8(\u03c3)T y, where f\u03b8(\u03c3) is de\ufb01ned in\neq. (6). Hence, the Lov\u00e1sz extension is linear on the set of all vectors that are sorted by \u03c3. Formally,\nfor any permutation \u03c3 the Lov\u00e1sz extension is equal to f\u03b8(\u03c3)T y on the order cone\n\nO(\u03c3) = {y | y\u03c3(n) \u2264 y\u03c3(n\u22121) \u2264 . . . \u2264 y\u03c3(1)}.\n\nGiven a permutation \u03c3, if we constrain eq. (13) to O(\u03c3) we can replace f\u03b8(y) by the linear function\nf\u03b8(\u03c3)T , so that the problem reduces to\n\ny\u03b8(\u03c3) = \u2212 arg min\ny\u2208O(\u03c3)\n\n1\n2\n\n(cid:107)y + f\u03b8(\u03c3)(cid:107)2,\n\n(14)\n\nwhich is an instance of isotonic regression if we relabel the elements of V using \u03c3. Roughly, the idea\nis to instead differentiate eq. (14) with f\u03b8(\u03c3) computed at the optimal point y\u03b8. However, because\nwe can choose an arbitrary order among the elements with equal values, there may be multiple\npermutations that sort y\u03b8, and this extra choice we have seems very problematic. Nevertheless, let us\ncontinue with this strategy and analyze the resulting approximations to the Jacobian. We propose the\nfollowing approximation to the Jacobian\n\n\u2202y\u03b8\n\u2202\u03b8\n\n\u00d7 \u2202f\u03b8(\u03c3)\n\u2202\u03b8\n\n= \u039b(y\u03b8) \u00d7 [\u2202\u03b81 f\u03b8(\u03c3)\n\n| \u2202\u03b82 f\u03b8(\u03c3)\n\n|\n\n\u00b7\u00b7\u00b7\n\n| \u2202\u03b8d f\u03b8(\u03c3)] ,\n\n\u2248 (cid:98)J\u03c3 \u2261 \u039b(y\u03b8)\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n\u2248 \u2202y\u03b8 (\u03c3)\n\u2202f\u03b8 (\u03c3)\n\nwhere \u039b(y\u03b8) is used as an approximation of a Jacobian which might not exist. Fortunately, due to the\nspecial structure of the linearizations, we have the following result that the gradient obtained using\nthe above strategy does not depend on the speci\ufb01c permutation \u03c3 that was chosen.\n\nTheorem 2. If \u2202\u03b8k F (A) exists for all A \u2286 V the approximate Jacobians (cid:98)J\u03c3 are equal and do not\n\ndepend on the choice of \u03c3. Speci\ufb01cally, the j-th block of any element i \u2208 B \u2208 \u03a0(y\u03b8) is equal to\n\n1\n\n|B| \u2202\u03b8j F\u03b8(B | {i(cid:48) | [y\u03b8]i(cid:48) < [y\u03b8]i}).\n\n(15)\n\nProof sketch, details in supplement. Remember that \u039b(y\u03b8) averages f\u03b8(\u03c3) within each B \u2208 \u03a0(y\u03b8).\nMoreover, as \u03c3 sorts y\u03b8, the elements in B must be placed consecutively. The coordinates of f\u03b8(\u03c3)\nare marginal gains (6) and they will telescope inside the mean, which yields the claimed quantity.\n\n5\n\n\fGraph cuts. As a special, but important case, let us analyze how the approximate Jacobian looks\nlike for a cut function (eq. (1)), in which case eq. (13) reduces to eq. (2). Let \u03a0(y(w, v)) =\n(B1, B2, . . . , Bm). For any element i \u2208 V we will denote by \u03b7(i) \u2208 {1, 2, . . . , m} the index of the\n\nblock where it belongs to. Then, the approximate Jacobian (cid:98)J at \u03b8 = (w, v) has entries\n\n(cid:98)\u2202vj (yi) =(cid:74)\u03b7(i) = \u03b7(j)(cid:75)/|B\u03b7(i)|, and\n(cid:98)\u2202wi,j (yk) =\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3sign(yi \u2212 yj)\n\nsign(yj \u2212 yi)\n0\n\n1|B\u03b7(k)|\n1|B\u03b7(k)|\n\nif \u03b7(k) = \u03b7(i), or\nif \u03b7(k) = \u03b7(j), and\notherwise,\n\nwhere the sign function is de\ufb01ned to be zero if the argument is zero. Intuitively, increasing the\nmodular term vi by \u03b4 will increase all the coordinates B in y that are in the same segment by \u03b4/|B|.\nOn the other hand, increasing the weight of an edge wi,j will have no effect if i and j are already on y\nin the same segment, and otherwise it will pull the segments containing i and j together by increasing\nthe smaller one and decreasing the larger one. In the supplementary we provide a pytorch module\nthat executes the back propagation pass in O(|V | + |E|) time in about 10 lines of code, and we also\nderive the approximate Jacobians for concave-of-cardinality and facility location functions.\n\n5 Analysis\n\nWe will now theoretically analyze the conditions under which our approximation is correct, and then\ngive a characterization of the exact directional derivative together with a polynomial algorithm that\ncomputes it. The \ufb01rst notion that will have implications for our analysis is that of (in)separability.\nDe\ufb01nition 2. The function F : 2V \u2192 R is said to be separable if there exists some B \u2286 V such that\nB /\u2208 {\u2205, V } and F (V ) = F (B) + F (V \\ B).\nThe term separable is indeed appropriate as it implies that F (A) = F (A \u2229 B) + F ((V \\ B) \u2229 A) for\nall A \u2286 V [2, Prop. 4.3], i.e., the function splits as a sum of two functions on disjoint domains. Hence,\nwe can split the problem into two (on B and V \\ B) and analyze them independently. We would\nlike to point out that separability is checkable in cubic time using the algorithm of Queyranne [28].\nTo simplify the notation, we will assume that we want to compute the derivative at point \u03b8(cid:48) \u2208 Rd\nwhich results in the min-norm point y(cid:48) = y\u03b8 \u2208 Rn. We will furthermore assume that y(cid:48) takes on\nunique values \u03b31 < \u03b32 < \u00b7\u00b7\u00b7 < \u03b3k on sets B1, B2, . . . , Bk respectively, and we will de\ufb01ne the chain\n\u2205 = A0 \u2286 A1 \u2286 A2 \u2286 \u00b7\u00b7\u00b7 \u2286 Ak = V by Aj = \u222aj\nj(cid:48)=1Bj(cid:48). A central role in the analysis will be\nplayed by the set of constraints in B(F\u03b8) (see (5)) that are active at y\u03b8, which makes sense given that\nwe expect small perturbations in \u03b8(cid:48) to result in small changes in y\u03b8(cid:48) as well.\nDe\ufb01nition 3. For any submodular function F : 2V \u2192 R and any point z \u2208 B(F ) we shall denote by\nDF (z) the lattice of tight sets of z on B(F ), i.e.\n\nDF (z) = {A \u2286 V | z(A) = F (A)}.\n\nThe fact that the above set is indeed a lattice is well-known [1]. Moreover, note that DF (z) \u2287 {\u2205, V }.\nWe will also de\ufb01ne D(cid:48) = DF\u03b8(cid:48) (y(cid:48)), i.e., the lattice of tight sets at the min-norm point.\n\n5.1 When will the approximate approach work?\n\nWe will analyze suf\ufb01cient conditions so that irrespective of the choice of \u03c3, the isotonic regression\nproblem eq. (14) has no breakpoints, and the left and right derivatives agree. To this end, for any\nj \u2208 {1, 2, . . . , k} we de\ufb01ne the submodular function Fj : 2Bj as Fj(H) = F\u03b8(cid:48)(H | Aj\u22121), where\nwe have dropped the dependence on \u03b8(cid:48) as it will remain \ufb01xed throughout this section.\nTheorem 3. The approximate problem (14) is argmin-continuously differentiable irrespective of the\nchosen permutation \u03c3 sorting y\u03b8 if and only if any of the following equivalent conditions hold.\n\n(cid:2)Fj(H) \u2212 Fj(Bj)|H|/|Bj|(cid:3) = {\u2205, Bj}.\n\n(a) arg minH\u2208Bj\n(b) y(cid:48)\n\nBj\n\n\u2208 relint(B(Fj)), i.e. DFj (y(cid:48)\n\nBj\n\n) = {\u2205, Bj}, which is only possible if Fj is inseparable.\n\n6\n\n\fIn other words, we can equivalently say that the optimum has to lie on the interior of the face.\nMoreover, if \u03b8 \u2192 y\u03b8 is continuous1, this result implies that the min-norm point is locally de\ufb01ned as\naveraging within the same blocks using (15), so that the approximate Jacobian is exact.\nWe would like to point out that one can obtain the same derivatives as the ones suggested in \u00a74, if we\nperturb the KKT conditions, as done by Amos and Kolter [18]. If we use that approach, in addition to\nthe computational challenges, there is the problem of non-uniqueness of the Lagrange multiplier, and\nmoreover, some valid multipliers might be zero for some of the active constraints. This would render\nthe resulting linear system rank de\ufb01cient, and it is not clear how to proceed. Remember that when we\nanalyzed the isotonic regression problem in \u00a73 we had non-differentiability due to the exactly same\nreason \u2014 zero multipliers for active constraints, which in that case correspond to the breakpoints.\nTheorem 4. For a speci\ufb01c Lagrange multiplier there exists a solution to the perturbed KKT conditions\nderived by [18] that gives rise to the approximate Jacobians from Section 4. Moreover, this multiplier\nis unique if the conditions of Theorem 3 are satis\ufb01ed.\n\n5.2 Exact computation\n\nUnfortunately, computing the gradients exactly seems very complicated for arbitrary parametrizations\nF\u03b8, and we will focus our attention to mixture models of the form given in eq. (12). The directions v\nwhere we will compute the directional derivatives will have in general non-negative components vj,\nunless Fj is modular. By leveraging the theory of Shapiro [29], and exploiting the structure of both\nthe min-norm point and the polyhedron B(Fv | D(cid:48)) we obtain at the following result.\nTheorem 5. Assume that F\u03b8(cid:48) is inseparable and let v be any direction so that Fv is submodular. The\ndirectional derivative \u2202y/\u2202\u03b8j at \u03b8(cid:48) in direction v is given by the solution of the following problem.\n\n(cid:107)d(cid:107)2,\n\n1\n2\n\nminimize\nsubject to d \u2208 B(Fv | D(cid:48)), and\n\nd\n\nd(Bj) = Fv(Aj) for j \u2208 {1, 2, . . . , k}.\n\n(16)\n\nFirst, note that this is again a min-norm problem, but now de\ufb01ned over a reduced lattice D(cid:48) with k\nadditional equality constraints. Fortunately, due to these additional equalities, we can split the above\nproblem into k separate min-norm problems. Namely, for each j \u2208 {1, 2, . . . , k} collect the lattice\nof tight sets that intersect Bj as D(cid:48)\nj = {H \u2229 Bj | H \u2208 D(cid:48)}, and de\ufb01ne the function Gj : 2Bj \u2192 R\nas Gj(A) = Fv(A | Aj\u22121), where note that the parameter vector \u03b8 is taken as the direction v in\nwhich we want to compute the derivative. Then, the block of the optimal solution of problem (16)\ncorresponding to Bj is equal to\n\nd\u2217\n\nBj\n\n= arg min\nyj\u2208B(Gj|D(cid:48)\nj )\n\n(cid:107)yj(cid:107)2,\n\n1\n2\n\n(17)\n\nwhich is again a min-norm point problem where the base polytope is de\ufb01ned using the lattice D(cid:48)\ncan then immediately draw a connection with the results from the previous subsection.\nCorollary 1. If all latices are trivial, the solution of (17) agrees with the approximate Jacobian (15).\n\nj. We\n\nHow to solve problem (16)? Fortunately, the divide-and-conquer algorithm of Groenevelt [30]\ncan be used to \ufb01nd the min-norm point over arbitrary lattices. To do this, we have to compute for\ni in arg minHj(cid:51)i Fj(Hj) \u2212 y(cid:48)(Hj), which can be done using\neach i \u2208 Bj the unique smallest set H\u2217\nsubmodular minimization after applying the reduction of Schrijver [31].\nLemma 6. Assume that Gj is equal to Gj(A) =(cid:74)i \u2208 A(cid:75) for some i \u2208 Bj. Then, the directional\nTo highlight the difference with the approximation from section 4, let us consider a very simple case.\nderivative is equal to 1|D|/|D| where D = {i(cid:48) | i \u2208 H\u2217\ni(cid:48)}.\nHence, while the approximate directional derivative would average over all elements in Bj, the\ntrue one averages only over a subset D \u2286 Bj and is possibly sparser. Lemma 6 gives us the exact\ndirectional derivatives for the graph cuts, as each component Gj will be either a cut function on\n\n1For example if the correspondence \u03b8 (cid:16) B(F\u03b8) is hemicontinuous due to Berge\u2019s theorem.\n\n7\n\n\ftwo vertices, or a function of the form in Lemma 6. In the \ufb01rst case the directional derivative is\nzero because 0 \u2208 B(Gj) \u2286 B(Gj | D(cid:48)\nj). In the second case, we can can either solve exactly\nusing Lemma 6 or use a more sophisticated approximation, generalizing the result from [32] \u2014\ngiven that Fj is separable over 2Bj iff the graph is disconnected, we can \ufb01rst separate the graph into\nconnected components, and then take averages within each connected component instead of over Bj.\n\n5.3 Structured attention and constraints\n\nRecently, there has been an interest in the design of structured attention mechanisms, which, as we\nwill now show, can be derived and furthermore generalized using the results in this paper. The \ufb01rst\nmechanism is the sparsemax of Martins and Astudillo [33]. It takes as input a vector and projects\nit to the probability simplex, which is the base polytope corresponding to G(A) = min{|A|, 1}.\nConcurrently with this work, Niculae and Blondel [32] have analyzed the following problem\n\n(cid:107)y \u2212 z(cid:107)2,\n\n1\n2\n\ny\u2217 = min\ny\u2208B(G)\n\nf (y) +\n\n(18)\nfor the special case when B(G) is the simplex and f is the Lov\u00e1sz extension of one of two speci\ufb01c\nsubmodular functions. We will consider the general case when G can be any concave-of-cardinality\nfunction and F is an arbitrary submodular function. Note that, if either f (y) or the constraint were\nnot present in problem (18), we could have simply leveraged the theory we have developed so far\nto differentiate it. Fortunately, as done by Niculae and Blondel [32], we can utilize the result of Yu\n[34] to signi\ufb01cantly simplify (18). Namely, because projection onto B(G) preserves the order of the\ncoordinates [35, Lemma 1], we can write the optimal solution y\u2217 to (18) as\n\ny\u2217 = min\nx\u2208B(G)\n\n1\n2\n\n(cid:107)y \u2212 z(cid:107)2.\n\n1\n2\n\n(cid:107)y \u2212 y(cid:48)(cid:107), where y(cid:48) = arg min\n\nf (y) +\n\ny\n\nWe can hence split problem (18) into two subtasks \u2014 \ufb01rst, compute y(cid:48) and then project it onto\nB(G). As each operation can reduces to a minimum-norm problem, we can differentiate each of\nthem separately, and thus solve (18) by stacking two submodular layers one after the other.\n\n6 Experiments\n\nCNN\n\nCNN+GC\n\nMean\n0.8103\n0.3919\n96.9\n\nStd. Dev.\n0.1034\n0.2696\n30.6\n\nStd. Dev. Mean\n0.9121\n0.1391\n0.2681\n0.1911\n25.3\n65.8\n\nAccuracy\nNLL\n# Fg. Objs.\nFigure 1: Test set results. We see that incorporating a\ngraph cut solver improves both the accuracy and neg-\native log-likelihood (NLL), while having consistent\nsegmentations with fewer foreground objects.\n\nWe consider the image segmentation tasks\nfrom the introduction, where we are given\nan RGB image x \u2208 Rn\u00d73 and are supposed\nto predict those pixels y \u2208 {0, 1}n con-\ntaining the foreground object. We used the\nWeizmann horse segmentation dataset [36],\nwhich we split into training, validation and\ntest splits of sizes 180, 50 and 98 respec-\ntively. The implementation was done in\npytorch2, and to compute the min-norm\npoint we used the algorithm from [37]. To\nmake the problem more challenging, at training time we randomly selected and revealed only 0.1%\nof the training set labels. We \ufb01rst trained a convolutional neural network with two hidden layers that\ndirectly predicts the per-pixel labels, which we refer to as CNN. Then, we added a second model,\nwhich we call CNN+GC, that has the same architecture as the \ufb01rst one, but with an additional graph\ncut layer, whose weights are parametrized by a convolutional neural network with one hidden layer.\nDetails about the architectures can be found in the supplementary. We train the models by maximizing\nthe log-likelihood of the revealed pixels, which corresponds to the variational bi-level strategy (eq. (3))\ndue to Lemma 2. We trained using SGD, Adagrad [38] and Adam [39], and chose the model with\nthe best validation score. As evident from the results presented in Section 6, adding the discrete\nlayer improves not only the accuracy (after thresholding the marginals at 0.5) and log-likelihood,\nbut it gives more coherent results as it makes predictions with fewer connected components (i.e.,\nforeground objects). Moreover, if we have a look at the predictions themselves in Figure 2, we\ncan observe that the optimization layer not only removes spurious predictions, but there is is also a\nqualitative difference in the marginals as they are spatially more consistent.\n\n2The code will be made available at https://www.github.com/josipd/nips-17-experiments.\n\n8\n\n\fFigure 2: Comparison of results from both models on four instances from the test set (up: CNN,\ndown: CNN+GC). We can see that adding the graph-cut layers helps not only quantitatively, but also\nqualitatively, as the predictions are more spatially regular and vary smoothly inside the segments.\n\n7 Conclusion\n\nWe have analyzed the sensitivity of the min-norm point for parametric submodular functions and\nprovided both a very easy-to-implement practical approximate algorithm for general objectives, and\nstrong theoretical result characterizing the true directional derivatives for mixtures. These results\nallow the use of submodular minimization inside modern deep architectures, and they are also\nimmediately applicable to bi-level variational learning of log-supermodular models of arbitrarily\nhigh order. Moreover, we believe that the theoretical results open the new problem of developing\nalgorithms that can compute not only the min-norm point, but also solve for the associated derivatives.\n\nAcknowledgements. The research was partially supported by ERC StG 307036 and a Google\nEuropean PhD Fellowship.\n\nReferences\n\n[1] S. Fujishige. Submodular functions and optimization. Annals of Discrete Mathematics vol. 58.\n\n[2] F. Bach. \u201cLearning with submodular functions: a convex optimization perspective\u201d. Founda-\n\n2005.\ntions and Trends R(cid:13) in Machine Learning 6.2-3 (2013).\n\n[3] L. Lov\u00e1sz. \u201cSubmodular functions and convexity\u201d. Mathematical Programming The State of\n\nthe Art. Springer, 1983, pp. 235\u2013257.\n\n[4] Y. Boykov and V. Kolmogorov. \u201cAn experimental comparison of min-cut/max-\ufb02ow algorithms\nfor energy minimization in vision\u201d. IEEE Transactions on Pattern Analysis and Machine\nIntelligence 26.9 (2004), pp. 1124\u20131137.\n\n[5] P. Kohli, L. Ladicky, and P. H. Torr. \u201cRobust higher order potentials for enforcing label\n\nconsistency\u201d. Computer Vision and Pattern Recognition (CVPR). 2008.\n\n[6] M. Narasimhan, N. Jojic, and J. A. Bilmes. \u201cQ-clustering\u201d. Advances in Neural Information\n\n[7]\n\n[8]\n\nProcessing Systems (NIPS). 2006, pp. 979\u2013986.\nJ. Djolonga and A. Krause. \u201cFrom MAP to Marginals: Variational Inference in Bayesian\nSubmodular Models\u201d. Advances in Neural Information Processing Systems (NIPS). 2014.\nJ. Djolonga and A. Krause. \u201cScalable Variational Inference in Log-supermodular Models\u201d.\nInternational Conference on Machine Learning (ICML). 2015.\n\n[9] F. R. Bach. \u201cShaping level sets with submodular functions\u201d. Advances in Neural Information\n\nProcessing Systems (NIPS). 2011.\n\n[10] F. R. Bach. \u201cStructured sparsity-inducing norms through submodular functions\u201d. Advances in\n\nNeural Information Processing Systems (NIPS). 2010.\n\n9\n\n\f[11] S. Fujishige, T. Hayashi, and S. Isotani. The minimum-norm-point algorithm applied to\nsubmodular function minimization and linear programming. Kyoto University. Research\nInstitute for Mathematical Sciences [RIMS], 2006.\n\n[12] R. T. Rockafellar and R. J.-B. Wets. Variational analysis. Vol. 317. Springer Science &\n\nBusiness Media, 2009.\n\n[13] K. Kunisch and T. Pock. \u201cA bilevel optimization approach for parameter learning in variational\n\nmodels\u201d. SIAM Journal on Imaging Sciences 6.2 (2013), pp. 938\u2013983.\n\n[14] P. Ochs, R. Ranftl, T. Brox, and T. Pock. \u201cBilevel optimization with nonsmooth lower level\nproblems\u201d. International Conference on Scale Space and Variational Methods in Computer\nVision. Springer. 2015, pp. 654\u2013665.\nJ. Domke. \u201cLearning graphical model parameters with approximate marginal inference\u201d. IEEE\nTransactions on Pattern Analysis and Machine Intelligence 35.10 (2013), pp. 2454\u20132467.\n\n[15]\n\n[16] M. F. Tappen. \u201cUtilizing variational optimization to learn markov random \ufb01elds\u201d. Computer\n\nVision and Pattern Recognition (CVPR). 2007.\n\n[17] M. J. Wainwright. \u201cEstimating the wrong graphical model: Bene\ufb01ts in the computation-limited\n\nsetting\u201d. Journal of Machine Learning Research (JMLR) 7 (2006).\n\n[18] B. Amos and J. Z. Kolter. \u201cOptNet: Differentiable Optimization as a Layer in Neural Networks\u201d.\n\nInternational Conference on Machine Learning (ICML). 2017.\nJ. C. Boot. \u201cOn sensitivity analysis in convex quadratic programming problems\u201d. Operations\nResearch 11.5 (1963), pp. 771\u2013786.\n\n[19]\n\n[20] K. Kumar and F. Bach. \u201cActive-set Methods for Submodular Minimization Problems\u201d. arXiv\n\npreprint arXiv:1506.02852 (2015).\n\n[21] N. Chakravarti. \u201cSensitivity analysis in isotonic regression\u201d. Discrete Applied Mathematics\n\n45.3 (1993), pp. 183\u2013196.\n\n[22] B. Dolhansky and J. Bilmes. \u201cDeep Submodular Functions: De\ufb01nitions and Learning\u201d. Neural\n\nInformation Processing Society (NIPS). Barcelona, Spain, Dec. 2016.\nI. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. \u201cLarge margin methods for structured\nand interdependent output variables\u201d. Journal of Machine Learning Research (JMLR) 6.Sep\n(2005), pp. 1453\u20131484.\n\n[23]\n\n[24] B. Taskar, C. Guestrin, and D. Koller. \u201cMax-margin Markov networks\u201d. Advances in Neural\n\nInformation Processing Systems (NIPS). 2004, pp. 25\u201332.\nJ. Edmonds. \u201cSubmodular functions, matroids, and certain polyhedra\u201d. Combinatorial struc-\ntures and their applications (1970), pp. 69\u201387.\n\n[25]\n\n[26] M. J. Best and N. Chakravarti. \u201cActive set algorithms for isotonic regression; a unifying\n\nframework\u201d. Mathematical Programming 47.1-3 (1990), pp. 425\u2013439.\n\n[27] T. Robertson and T. Robertson. Order restricted statistical inference. Tech. rep. 1988.\n[28] M. Queyranne. \u201cMinimizing symmetric submodular functions\u201d. Mathematical Programming\n\n82.1-2 (1998), pp. 3\u201312.\n\n[29] A. Shapiro. \u201cSensitivity Analysis of Nonlinear Programs and Differentiability Properties of\n\nMetric Projections\u201d. SIAM Journal on Control and Optimization 26.3 (1988), pp. 628\u2013645.\n\n[30] H. Groenevelt. \u201cTwo algorithms for maximizing a separable concave function over a polyma-\n\ntroid feasible region\u201d. European Journal of Operational Research 54.2 (1991).\n\n[31] A. Schrijver. \u201cA combinatorial algorithm minimizing submodular functions in strongly poly-\n\nnomial time\u201d. Journal of Combinatorial Theory, Series B 80.2 (2000), pp. 346\u2013355.\n\n[32] V. Niculae and M. Blondel. \u201cA Regularized Framework for Sparse and Structured Neural\n\nAttention\u201d. arXiv preprint arXiv:1705.07704 (2017).\n\n[33] A. Martins and R. Astudillo. \u201cFrom softmax to sparsemax: A sparse model of attention and\n\nmulti-label classi\ufb01cation\u201d. International Conference on Machine Learning (ICML). 2016.\n\n[34] Y.-L. Yu. \u201cOn decomposing the proximal map\u201d. Advances in Neural Information Processing\n\nSystems. 2013, pp. 91\u201399.\n\n[35] D. Suehiro, K. Hatano, S. Kijima, E. Takimoto, and K. Nagano. \u201cOnline prediction under\n\nsubmodular constraints\u201d. International Conference on Algorithmic Learning Theory. 2012.\n\n[36] E. Borenstein and S. Ullman. \u201cCombined top-down/bottom-up segmentation\u201d. IEEE Transac-\n\ntions on Pattern Analysis and Machine Intelligence 30.12 (2008), pp. 2109\u20132125.\n\n10\n\n\f[37] A. Barbero and S. Sra. \u201cModular proximal optimization for multidimensional total-variation\n\nregularization\u201d. arXiv preprint arXiv:1411.0589 (2014).\nJ. Duchi, E. Hazan, and Y. Singer. \u201cAdaptive subgradient methods for online learning and\nstochastic optimization\u201d. Journal of Machine Learning Research (JMLR) 12.Jul (2011),\npp. 2121\u20132159.\n\n[38]\n\n[39] D. Kingma and J. Ba. \u201cAdam: A method for stochastic optimization\u201d. arXiv preprint\n\narXiv:1412.6980 (2014).\n\n11\n\n\f", "award": [], "sourceid": 668, "authors": [{"given_name": "Josip", "family_name": "Djolonga", "institution": "ETH Zurich"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETHZ"}]}