{"title": "Optimal Algorithms for Continuous Non-monotone Submodular and DR-Submodular Maximization", "book": "Advances in Neural Information Processing Systems", "page_first": 9594, "page_last": 9604, "abstract": "In this paper we study the fundamental problems of maximizing a continuous non monotone submodular function over a hypercube, with and without coordinate-wise concavity. This family of optimization problems has several applications in machine learning, economics, and communication systems. Our main result is the first 1/2 approximation algorithm for continuous submodular function maximization; this approximation factor of is the best possible for algorithms that use only polynomially many queries. For the special case of DR-submodular maximization, we provide a faster 1/2-approximation algorithm that runs in (almost) linear time. Both of these results improve upon prior work [Bian et al., 2017, Soma and Yoshida, 2017, Buchbinder et al., 2012].\n\nOur first algorithm is a single-pass algorithm that uses novel ideas such as reducing the guaranteed approximation problem to analyzing a zero-sum game for each coordinate, and incorporates the geometry of this zero-sum game to fix the value at this coordinate. Our second algorithm is a faster single-pass algorithm that\nexploits coordinate-wise concavity to identify a monotone equilibrium condition sufficient for getting the required approximation guarantee, and hunts for the equilibrium point using binary search. We further run experiments to verify the performance of our proposed algorithms in related machine learning applications.", "full_text": "Optimal Algorithms for Continuous Non-monotone\nSubmodular and DR-Submodular Maximization\n\nRad Niazadeh\u21e4\n\nTim Roughgarden\u2020\n\nDepartment of Computer Science\n\nStanford University, Stanford, CA 95130\n\nDepartment of Computer Science\n\nStanford University, Stanford, CA 95130\n\nrad@cs.stanford.edu\n\ntim@cs.stanford.edu\n\nJoshua R. Wang\n\nGoogle, Mountain View, CA 94043\n\njoshuawang@google.com\n\nAbstract\n\nIn this paper we study the fundamental problems of maximizing abcontinuous\nnon-monotone submodular function over a hypercube, with and without coordinate-\nwise concavity. This family of optimization problems has several applications in\nmachine learning, economics, and communication systems. Our main result is the\n2-approximation algorithm for continuous submodular function maximization;\n\ufb01rst 1\nthe approximation factor of 1\n2 is the best possible for algorithms that use only\npolynomially many queries. For the special case of DR-submodular maximization,\ni.e., when the submodular functions is also coordinate-wise concave along all\ncoordinates, we provide a faster 1\n2-approximation algorithm that runs in almost\nlinear time. Both of these results improve upon prior work [Bian et al., 2017a,b,\nSoma and Yoshida, 2017, Buchbinder et al., 2012, 2015].\nOur \ufb01rst algorithm is a single-pass algorithm that uses novel ideas such as reducing\nthe guaranteed approximation problem to analyzing a zero-sum game for each\ncoordinate, and incorporates the geometry of this zero-sum game to \ufb01x the value at\nthis coordinate. Our second algorithm is a faster single-pass algorithm that exploits\ncoordinate-wise concavity to identify a monotone equilibrium condition suf\ufb01cient\nfor getting the required approximation guarantee, and hunts for the equilibrium\npoint using binary search. We further run experiments to verify the performance of\nour proposed algorithms in related machine learning applications.\n\n1\n\nIntroduction\n\nSubmodular optimization is a sweet spot between tractability and expressiveness, with numerous\napplications in machine learning (e.g., Krause and Golovin [2014], and see below) and with many\nalgorithms that are both practical and enjoy good rigorous guarantees (e.g., Buchbinder et al. [2012,\n2015]). In general, a real-valued function F de\ufb01ned on a lattice L is submodular if and only if\n\nF(x _ y) + F(x ^ y) \uf8ffF (x) + F(y)\n\nfor all x, y 2L , where x_ y and x^ y denote the join and meet, respectively, of x and y in the lattice\nL. Such functions are generally neither convex nor concave. In one of the most commonly studied\nexamples, L is the lattice of subsets of a \ufb01xed ground set (or a sublattice thereof), with union and\nintersection playing the roles of join and meet, respectively.\n\n\u21e4Rad Niazadeh was supported by Stanford Computer Science Motwani Fellowship.\n\u2020Tim Roughgarden was supported in part by Google Faculty Grant, Guggenheim Fellowship, and NSF\n\nGrants CCF-1524062 and CCF-1813188.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThis paper concerns a different well-studied setting, where L is a hypercube (i.e., [0, 1]n), with\ncomponentwise maximum and minimum serving as the join and meet, respectively.3 We consider\nthe fundamental problem of (approximately) maximizing a continuous and nonnegative submodular\nfunction over the hypercube.4 The function F is given as a \u201cblack box\u201d and can only be accessed by\nquerying its value at a point. We are interested in algorithms that use at most a polynomial (in n)\nnumber of queries. We do not assume that F is monotone (otherwise the problem is trivial).\nWe next brie\ufb02y mention four applications of maximizing a non-monotone submodular function over\na hypercube that are germane to machine learning and other related application domains.5\nNon-concave quadratic programming. In this problem, the goal is to maximize F(x) = 1\n2xT Hx +\nhT x + c, where the off-diagonal entries of H are non-positive. One application of this problem is to\nlarge-scale price optimization on the basis of demand forecasting models [Ito and Fujimaki, 2016].\nMap inference for Determinantal Point Processes (DPP). DPPs are elegant probabilistic models that\narise in statistical physics and random matrix theory. DPPs can be used as generative models in\napplications such as text summarization, human pose estimation, and news threading tasks [Kulesza\net al., 2012]. The approach in Gillenwater et al. [2012] to the problem boils down to maximize a\nsuitable submodular function over the hypercube, accompanied with an appropriate rounding (see\nalso [Bian et al., 2017a]). One can also think of regularizing this objective function with `2-norm\nregularizer, to avoid over\ufb01tting, and the function will still remain submodular.\nLog-submodularity and mean-\ufb01eld inference. Another probabilistic model that generalizes DPPs\nand all other strong Rayleigh measures [Li et al., 2016, Zhang et al., 2015] is the class of log-\nsubmodular distributions over sets, i.e., p(S) \u21e0 exp(F(S)) where F(\u00b7) is a set submodular function.\nMAP inference over this distribution has applications in machine learning [Djolonga and Krause,\n2014]. One variational approach towards this MAP inference task is to use mean-\ufb01eld inference to\napproximate the distribution p with a product distribution x 2 [0, 1]n, which again boils down to\nsubmodular function maximization over the hypercube (see [Bian et al., 2017a]).\nRevenue maximization over social networks. Here, there is a seller who wants to sell a product over\na social network of buyers. To do so, it freely assigns trial products and fractions thereof to the\nbuyers in the network [Bian et al., 2017b, Hartline et al., 2008]. For this problem, one can reduce it\nto maximizing an objective function that takes into account two parts: the revenue gain from those\nwho did not get a free product, where the revenue function for any such buyer is a non-negative\nnon-decreasing and submodular function Ri(x); and the revenue loss from those who received the\nfree product, where the revenue function for any such buyer is a non-positive non-increasing and\nsubmodular function \u00afRi(x). The combination for all buyers is a non-monotone submodular function.\nIt also is non-negative at ~0 and ~1, by extending the model and accounting for extra revenue gains\nfrom buyers with free trials.\n\nOur Results. Maximizing a submodular function over the hypercube is at least as dif\ufb01cult as over\nthe subsets of a ground set.6 For the latter problem, the best approximation ratio achievable by an\nalgorithm making a polynomial number of queries is 1\n2; the (information-theoretic) lower bound is\ndue to [Feige et al., 2007, 2011], the optimal algorithm to [Buchbinder et al., 2012, 2015]. Thus, the\nbest-case scenario for maximizing a submodular function over the hypercube (using polynomially\nmany queries) is a 1\n2-approximation. The main result of this paper achieves this best-case scenario:\n\nThere is an algorithm for maximizing a continuous submodular function over\nthe hypercube that guarantees a 1\n2-approximation while using only a polynomial\nnumber of queries to the function under mild continuity assumptions.\n\nOur algorithm is inspired by the bi-greedy algorithm of Buchbinder et al. [2015], which maximizes\na submodular set function; it maintains two solutions initialized at ~0 and ~1, goes over coordinates\n\n3Our results also extend easily to arbitrary axis-aligned boxes (i.e., \u201cbox constraints\u201d).\n4More generally, the function only has to be nonnegative at the points ~0 and ~1.\n5See the supplement for more details on these applications.\n6An instance of the latter problem can be converted to one of the former by extending the given set function f\n(with domain viewed as {0, 1}n) to its multilinear extension F de\ufb01ned on the hypercube (where F(x) =\nPS\u2713[n]Qi2S xiQi /2S(1 xi)f (S)). Sampling based on an \u21b5-approximate solution for the multilinear\nextension yields an equally good approximate solution to the original problem.\n\n2\n\n\fsequentially, and makes the two solutions agree on each coordinate. The algorithmic question here is\nhow to choose the new coordinate value for the two solutions, so that the algorithm gains enough\nvalue relative to the optimum in each iteration. Prior to our work, the best-known result was a\n3-approximation [Bian et al., 2017b], which generalized the simple non-optimal 1\n3-approximation\n1\ndeterministic bi-greedy algorithm of [Buchbinder et al., 2012, 2015] for set functions to continuous\ndomains. However, to get the optimal approximation factor and systematically passing the barrier of\npure continuous submodularity, our algorithm requires a number of new ideas, including a reduction\nto the analysis of a zero-sum game for each coordinate, and the use of the special geometry of this\ngame to bound the value of the game at its equilibrium. See Section 2 for more details.\nThe second and third applications above induce objective functions that, in addition to being sub-\nmodular, are concave in each coordinate. 7 This class of functions is called DR-submodular in the\nliterature (e.g., in [Soma and Yoshida, 2015] and based on diminishing returns de\ufb01ned in [Kapralov\net al., 2013]). Here, an optimal 1\n2-approximation algorithm was recently already known on integer\nlattices [Soma and Yoshida, 2017], that can easily be generalized to our continuous setting as well;\nour contribution is a signi\ufb01cantly faster such bi-greedy algorithm. The main idea here is to identify a\nmonotone equilibrium condition suf\ufb01cient for getting the required approximation guarantee, which\nenables a binary search-type solution. See Section 3 for more details.\nWe also run experiments to verify the performance of our proposed algorithms in practical machine\nlearning applications. We observe that our algorithms match the performance of the prior work, while\nproviding either a better guaranteed approximation or a better running time.\n\nFurther Related Work. Buchbinder and Feldman [2016] derandomize the bi-greedy algorithm.\nStaib and Jegelka [2017] apply continuous submodular optimization to budget allocation, and develop\na new submodular optimization algorithm to this end. Hassani et al. [2017] give a 1\n2-approximation\nfor monotone continuous DR-submodular functions under convex constraints, which is later improved\ne )-approximation in Mokhtari et al. [2018] (even for stochastic functions). Gotovos et al.\nto (1 1\n[2015] consider (adaptive) submodular maximization when feedback is given after an element is\nchosen. Chen et al. [2018], Roughgarden and Wang [2018] consider submodular maximization in the\ncontext of online no-regret learning. Mirzasoleiman et al. [2013] show how to perform submodular\nmaximization with distributed computation. Submodular minimization is studied in Schrijver [2000],\nIwata et al. [2001]. See Bach et al. [2013] for a survey on more applications in machine learning.\n\nVariations of Continuous Submodularity. We consider non-monotone non-negative continuous\nsubmodular functions, i.e., F : [0, 1] ! [0, 1]n s.t. 8x, y 2 [0, 1]n, F(x) +F(y) F (x_y) +F(x^\ny), where _ and ^ are coordinate-wise max and min operations. Two related properties are weak\nDiminishing Returns Submodularity (weak DR-SM) and strong Diminishing Returns Submodularity\n(strong DR-SM) [Bian et al., 2017b], formally de\ufb01ned below. Indeed, weak DR-SM is equivalent to\nsubmodularity (see Proposition 3 in the supplement), and hence we use these terms interchangeably.\nDe\ufb01nition 1 (Weak/Strong DR-SM). Consider a continuous function F : [0, 1]n ! [0, 1]:\n\u2022 Weak DR-SM (continuous submodular): 8i 2 [n], 8xi \uf8ff yi 2 [0, 1]n, and 8 0,8z\n\nF(z + , xi) F (z, xi) F (z + , yi) F (z, yi)\n\n\u2022 Strong DR-SM (DR-submodular ): 8i 2 [n], 8x \uf8ff y 2 [0, 1]n, and 8 0:\nF(xi + , xi) F (x) F (yi + , yi) F (y)\n\nAs simple corollaries, a twice-differentiable F is strong DR-SM if and only if all the entries of its\nHessian are non-positive, and weak DR-SM if and only if all of the off-diagonal entries of its Hessian\nare non-positive. Also, weak DR-SM together with concavity along each coordinate is equivalent to\nstrong DR-SM (see Proposition 3 in the supplementary materials for more details).\n\nCoordinate-wise Lipschitz Continuity. Consider univariate functions generated by \ufb01xing all but\none of the coordinates of the original function F(\u00b7). In future sections, we sometimes require mild\ntechnical assumptions on the Lipschitz continuity of these single dimensional functions.\n\n7However, after regularization the function still remains submodular, but can lose coordinate-wise concavity.\n\n3\n\n\fFigure 1: Continuous curve r(z) in R2 (dark blue), positive-orthant concave envelope (red).\n\nDe\ufb01nition 2 (Coordinate-wise Lipschitz). A function F : [0, 1]n ! [0, 1] is coordinate-wise\nLipschitz continuous if there exists a constant C > 0 such that 8i 2 [n], 8xi 2 [0, 1]n, the single\nvariate function F(\u00b7, xi) is C-Lipschitz continuous, i.e.,\n\n8z1, z2 2 [0, 1] :\n\n|F(z1, xi) F (z2, xi)|\uf8ff C|z1 z2|\n\n2 Weak DR-SM Maximization: Continuous Randomized Bi-Greedy\n\n2-approximation algorithm (up to additive error ) for maximizing a contin-\nOur \ufb01rst main result is a 1\nuous submodular function F, a.k.a. weak DR-SM, which is information-theoretically optimal [Feige\net al., 2007, 2011]. This result assumes that F is coordinate-wise Lipschitz continuous.8 Before\ndescribing our algorithm, we introduce the notion of the positive-orthant concave envelope of a\ntwo-dimensional curve, which is useful for understanding our algorithm.\nDe\ufb01nition 3. Consider a curve r(z) = (g(z), h(z)) 2 R2 over the interval z 2 [Zl, Zu] such that:\n\n1. g : [Zl, Zu] ! [1,\u21b5 ] and h : [Zl, Zu] ! [1, ] are both continuous,\n2. g(Zl) = h(Zu) = 0, and h(Zl) = 2 [0, 1], g(Zu) = \u21b5 2 [0, 1].\n\nThen the positive-orthant concave envelope of r(\u00b7), denoted by conc-env(r), is the smallest concave\ncurve in the positive-orthant upper-bounding all the points {r(z) : z 2 [Zl, Zu]} (see Figure 1), i.e.,\nconc-env(r) , upper-face\u2713conv ({r(z) : z 2 [Zl, Zu]}) \\\u21e2(g0, h0) 2 [0, 1]2 :\n\u21b5 1\u25c6\nWe start by describing a vanilla version of our algorithm for maximizing F over the unit hypercube,\ntermed as continuous randomized bi-greedy (Algorithm 1). This version assumes blackbox oracle\naccess to algorithms for a few computations involving univariate functions of the form F(., xi) (e.g.,\nmaximization over [0, 1], computing conc-env(.), etc.). We \ufb01rst prove that the vanilla algorithm\n\ufb01nds a solution with an objective value of at least 1\n2 of the optimum. In Section 2.2, we show how to\napproximately implement these oracles in polynomial time when F is coordinate-wise Lipschitz.\nTheorem 1. If F(\u00b7) is non-negative and continuous submodular (or equivalently is weak DR-SM),\nthen Algorithm 1 is a randomized 1\n\nh0\n\n\ng0\n\n+\n\n2E [F(\u02c6z)] F (x\u21e4),\n\n2-approximation algorithm, i.e., returns \u02c6z 2 [0, 1]n s.t.\nx2[0,1]n F(x) is the optimal solution.\n\nwhere x\u21e4 2 argmax\n\n8Such an assumption is necessary, since otherwise the single-dimensional problem amounts to optimizing\nan arbitrary function and is hence intractable. Prior works, e.g,. Bian et al. [2017b] and Bian et al. [2017a],\nimplicitly require such an assumption to perform single-dimensional optimization.\n\n4\n\n\fFigure 2: Pentagon (M0,M1,Q1,Q2,M2)= ADV player\u2019s positive region against a mixed strategy\nover two points P1 and P2.\n\n2.1 Analysis of the Continuous Randomized Bi-Greedy (Proof of Theorem 1)\n\nWe start by de\ufb01ning these vectors, used in our analysis in the same spirit as Buchbinder et al. [2015]:\n\ni 2 [n] : X(i) , (\u02c6z1, . . . , \u02c6zi, 0, 0, . . . , 0), X(0) , (0, . . . , 0)\ni 2 [n] : Y(i) , (\u02c6z1, . . . , \u02c6zi, 1, 1, . . . , 1), Y(0) , (1, . . . , 1)\ni 2 [n] : O(i) , (\u02c6z1, . . . , \u02c6zi, x\u21e4i+1, . . . , x\u21e4n), O(0) , (x\u21e41, . . . , x\u21e4n)\n\nNote that X(i) and Y(i) (or X(i1) and Y(i1)) are the values of X and Y at the end of (or at the\nbeginning of) the ith iteration of Algorithm 1. In the remainder of this section, we give the high-level\nproof ideas and present some proof sketches. See the supplementary materials for the formal proofs.\n\nReduction to Coordinate-wise Zero-sum Games. For each coordinate i 2 [n], we consider a sub-\nproblem. In particular, de\ufb01ne a two-player zero-sum game played between the algorithm player\n(denoted by ALG) and the adversary player (denoted by ADV). ALG selects a (randomized) strategy\n\u02c6zi 2 [0, 1], and ADV selects a (randomized) strategy x\u21e4i 2 [0, 1]. Recall the descriptions of g(z) and\nh(z) at iteration i of Algorithm 1,:\n\ng(z) = F(z, X(i1)\ni\n\n) F (Zl, X(i1)\ni\n\n) , h(z) = F(z, Y(i1)\n\ni\n\n) F (Zu, Y(i1)\n\ni\n\n).\n\nWe now de\ufb01ne the utility of ALG (negative of the utility of ADV) in our zero-sum game as follows:\n\nV (i)(\u02c6zi, x\u21e4i ) , 1\n\n2\n\ng(\u02c6zi) +\n\n1\n2\n\nh(\u02c6zi) max (g(x\u21e4i ) g(\u02c6zi), h(x\u21e4i ) h(\u02c6zi)) .\n\n(1)\n\nSuppose the expected utility of ALG is non-negative at the equilibrium of this game. In particular,\nsuppose ALG\u2019s randomized strategy \u02c6zi (in Algorithm 1) guarantees that for every strategy x\u21e4i of ADV\nthe expected utility of ALG is non-negative. If this statement holds for all of the zero-sum games\ncorresponding to different iterations i 2 [n], then Algorithm 1 is a 1\n2-approximation of the optimum.\nLemma 1. If 8i 2 [n] : E\u21e5V (i)(\u02c6zi, x\u21e4i )\u21e4 /n for constant > 0, then 2E [F(\u02c6z)] F (x\u21e4) .\n\n5\n\n\fAlgorithm 1: (Vanilla) Continuous Randomized Bi-Greedy\ninput: function F : [0, 1]n ! [0, 1] ;\noutput: vector \u02c6z = (\u02c6z1, . . . , \u02c6zn) 2 [0, 1]n ;\nInitialize X (0, . . . , 0) and Y (1, . . . , 1) ;\nfor i = 1 to n do\nZl 2 argmax\nZu 2 argmax\n\n;\n\nz2[0,1] F(z, Yi)\nz2[0,1] F(z, Xi)\n\nFind Zu, Zl 2 [0, 1] such that8><>:\nif Zu \uf8ff Zl then\n\u02c6zi Zl ;\nelse\n8z 2 [Zl, Zu], let(g(z) , F(z, Xi) F (Zl, Xi),\nh(z) , F(z, Yi) F (Zu, Yi),\nLet \u21b5 , g(Zu) and , h(Zl) ;\nLet r(z) , (g(z), h(z)) be a continuous two-dimensional curve in [1,\u21b5 ] \u21e5 [1, ] ;\nCompute conc-env(r) (i.e. positive-orthant concave envelope of r(t) as in De\ufb01nition 3) ;\nFind point P , intersection of conc-env(r) and the line h0 = g0 \u21b5 on g-h plane ;\nSuppose P = P1 + (1 )P2, where 2 [0, 1] and Pj = r(z(j)), z(j) 2 [Zl, Zu] for\n// see Figure 2\nRandomly pick \u02c6zi such that(\u02c6zi z(1) with probablity \nLet Xi \u02c6zi and Yi \u02c6zi ;\n\nj = 1, 2, and both points are also on the conc-env(r) ;\n\n// note that \u21b5, 0\n\n;\n\n;\n\n\u02c6zi z(2) o.w.\n\n// after this, X and Y will agree at coordinate i\n\nProof sketch. Our bi-greedy approach, \u00e1 la Buchbinder et al. [2012, 2015], revolves around analyzing\nthe evolving values of three points: X(i), Y(i), and O(i). These three points begin at all-zeroes,\nall-ones, and the optimum solution, respectively, and converge to the algorithm\u2019s \ufb01nal point. In each\niteration, we aim to relate the total increase in value of the \ufb01rst two points with the decrease in value\nof the third point. If we can show that the former quantity is at least twice the latter quantity, then a\ntelescoping sum proves that the algorithm\u2019s \ufb01nal choice of point scores at least half that of optimum.\nThe utility of our game is speci\ufb01cally engineered to compare the total increase in value of the \ufb01rst\ntwo points with the decrease in value of the third point. The positive term of the utility is half of this\nincrease in value, and the negative term is a bound on how large in magnitude the decrease in value\nmay be. As a result, an overall nonnegative utility implies that the increase beats the decrease by a\nfactor of two, exactly the requirement for our bi-greedy approach to work. Finally, an additive slack\nof /n in the utility of each game sums over n iterations for a total additive slack of .\n\nAnalyzing the Zero-sum Games. Fix an iteration i 2 [n] of Algorithm 1. We then have the\nfollowing.\nProposition 1. If ALG plays the (randomized) strategy \u02c6zi as described in Algorithm 1, then we have\nE\u21e5V (i)(\u02c6zi, x\u21e4i )\u21e4 0 against any strategy x\u21e4i of ADV.\nProof of Proposition 1. We do the proof by case analysis over two cases:\n\u21e4 Case Zl Zu (easy): See the supplementary materials for this case.\n\u21e4 Case Zl < Zu (hard): In this case, ALG plays a mixed strategy over two points. To determine\nthe two-point support, it considers the curve r = {(g(z), h(z))}z2[Zl,Zu] and \ufb01nds a point P on\nconc-env(r) (i.e., De\ufb01nition 3) that lies on the line h0 = g0 \u21b5, where recall that \u21b5 =\ng(Zu) 0 and = g(Zl) 0 (as Zu and Zl are the maximizers of F(z, X(i1)\n) and F(z, Y(i1)\n)\ni\nrespectively). Because this point is on the concave envelope it should be a convex combination of two\npoints on the curve r(z). Lets say P = P1 + (1 )P2, where P1 = r(z(1)) and P2 = r(z(2)), and\n 2 [0, 1]. The \ufb01nal strategy of ALG is a mixed strategy over {z(1), z(2)} with probabilities (, 1 ).\nFixing any mixed strategy of ALG over two points P1 = (g1, h1) and P2 = (g2, h2) with probabilities\n\ni\n\n6\n\n\f(, 1 ) (denoted by FP), de\ufb01ne the ADV\u2019s positive region, i.e.\n\n(g0, h0) 2 [1, 1] \u21e5 [1, 1] : E(g,h)\u21e0FP\uf8ff 1\n\n2\n\ng +\n\n1\n2\n\nh max(g0 g, h0 h) 0.\n\n2 h2 + 1\n\n2 h1 + 1\n\n2 g1 + 1\n\n2 h1) + (1 )( 3\n\n2 g2 + 1\n\nNow, suppose ALG plays a mixed strategy with the property that its corresponding ADV\u2019s positive\nregion covers the entire curve {g(z), h(z)}z2[0,1]. Then, for any strategy x\u21e4i of ADV the expected\nutility of ALG is non-negative. In the rest of the proof, we geometrically characterize the ADV\u2019s positive\nregion against a mixed strategy of ALG over any 2-point support, and then we show for the particular\nchoice of P1, P2 and in Algorithm 1 the positive region covers the entire curve {g(z), h(z)}z2[0,1].\nLemma 2. Suppose ALG plays a 2-point mixed strategy over P1 = r(z(1)) = (g1, h1) and P2 =\nr(z(1)) = (g2, h2) with probabilities (, 1 ), and w.l.o.g. h1 g1 h2 g2. Then ADV\u2019s positive\nregion is the pentagon (M0,M1,Q1,Q2,M2), where M0 = (1,1) and (see Figure 2):\n\n2 g1) + (1 )( 3\n\n2 g2),\n2 h2),1,\n\n1. M1 =1, ( 3\n2. M2 =( 3\n3. Q1 is the intersection of the lines leaving P1 with slope 1 and leaving M1 along the g-axis,\n4. Q2 is the intersection of the lines leaving P2 with slope 1 and leaving M2 along the h-axis.\nBy applying Lemma 2, we have the following main technical lemma. The proof is geometric and is\npictorially visible in Figure 2. This lemma \ufb01nishes the proof of Proposition 1.\nLemma 3 (main lemma). If ALG plays the two point mixed strategy described in Algorithm 1, then\nfor every x\u21e4i 2 [0, 1] the point (g0, h0) = (g(x\u21e4i ), h(x\u21e4i )) is in the ADV\u2019s positive region.\nProof sketch. For simplicity assume Zl = 0 and Zu = 1. To understand the ADV\u2019s positive region\nthat results from playing a two-point mixed strategy by ALG, we consider the positive region that\nresults from playing a one point pure strategy. When ALG chooses a point (g, h), the positive term of\nthe utility is one-half of its one-norm. The negative term of the utility is the worse between how much\nthe ADV\u2019s point is above ALG\u2019s point, and how much it is to the right of ALG\u2019s point. The resulting\npositive region is de\ufb01ned by an upper boundary g0 \uf8ff 3\n2 h.\nNext, let\u2019s consider what happens when we pick point (g1, h1) with probability and point (g2, h2)\nwith probability (1 ). We can compute the expected point: let (g3, h3) = (g1, h1) + (1 \n)(g2, h2). As suggested by Lemma 2, the positive region for our mixed strategy has three boundary\nconditions: an upper boundary, a right boundary, and a corner-cutting boundary. The \ufb01rst two\nboundary conditions correspond to a pure strategy which picks (g3, h3). By design, (g3, h3) is\nlocated so that these boundaries cover the entire [1,\u21b5 ] \u21e5 [1, ] rectangle. This leaves us with\nanalyzing the corner-cutting boundary, which is the focus of Figure 2. As it turns out, the intersections\nof this boundary with the two other boundaries lie on lines of slope 1 extending from (gj, hj)j=1,2.\nIf we consider the region between these two lines, the portion under the envelope (where the curve r\nmay lie) is distinct from the portion outside the corner-cutting boundary. However, if r were to ever\nviolate the corner-cutting boundary condition without violating the other two boundary conditions, it\nmust do so in this region. Hence the resulting positive region covers the entire curve r, as desired.\n\n2 h and a right boundary h0 \uf8ff 1\n\n2 g + 1\n\n2 g + 3\n\n2.2 Polynomial-time Implementation under Lipschitz Continuity: Overview\nAt each iteration, Algorithm 1 interfaces with F in two ways: (i) when performing optimization\nto compute Zl, Zu and (ii) when computing the upper-concave envelope. In both cases, we are\nconcerned with univariate projections of F, namely F(z, Xi) and F(z, Yi. Assuming F is\ncoordinate-wise Lipschitz continuous with constant C > 0, we choose a small \u270f> 0 and take\nperiodic samples at \u270f-spaced intervals from each one of these functions, for a total of O( 1\n\u270f ) samples.\nTo perform task (i), we simply return the the sample which resulted in the maximum function value.\nSince the actual maximum is \u270f-close to one of the samples, our maximum is at most an additive \u270fC\nlower in value. To perform task (ii), we use these samples to form an approximate r(z) curve, denoted\nby \u02c6r(z). Note that we then proceed exactly as described in Algorithm 1 to pick a (randomized) strategy\n\n7\n\n\f\u02c6zi using \u02c6r(z). Note that ADV can actually choose a point on the exact curve r(z). However the point\nshe chooses is close to one of our samples and hence is at most an additive \u270fC better in value with\nrespect to functions g(.) and h(.). Furthermore, we can compute the upper-concave envelope \u02c6r(z) in\ntime linear in the number of samples using Graham\u2019s algorithm [Graham, 1972]. Roughly speaking,\nthis is because we can go through the samples in order of z-coordinate, avoiding the sorting cost of\nrunning Graham\u2019s on completely unstructured data. Formally, we have the following proposition.\nSee the supplementary materials for detailed implementations (Algorithm 3 and Algorithm 4).\nProposition 2. If F is coordinate-wise Lipschitz continuous with constant C > 0, then Algorithm 1\ncan be implemented with O(n2/\u270f) calls to F and returning a (randomized) point \u02c6z s.t.\n\n2E [F(\u02c6z)] F (x\u21e4) 2C\u270f,\n\nwhere x\u21e4 2 argmax\n\nx2[0,1]n F(x) is the optimal solution.\n\n3 Strong DR-SM Maximization: Binary-Search Bi-Greedy\n\nOur second result is a fast binary search algorithm, achieving the tight 1\n2-approximation factor (up to\nadditive error ) in quasi-linear time in n, but only for the special case of strong DR-SM functions\n(a.k.a. DR-submodular); see De\ufb01nition 1. This algorithm leverages the coordinate-wise concavity\nto identify a coordinate-wise monotone equilibrium condition. In each iteration, it hunts for an\nequilibrium point by using binary search. Satisfying the equilibrium at each iteration then guarantees\nthe desired approximation factor. Formally we propose Algorithm 2. As a technical assumption, we\n\nAlgorithm 2: Binary-Search Continuous Bi-greedy\ninput: function F : [0, 1]n ! [0, 1], error \u270f> 0 ;\noutput: vector \u02c6z = (\u02c6z1, . . . , \u02c6zn) 2 [0, 1]n ;\nInitialize X (0, . . . , 0) and Y (1, . . . , 1) ;\nfor i = 1 to n do\n\nif @F@xi\n\n(1, Yi) 0 then\n\n(0, Xi) \uf8ff 0 then\n\u02c6zi 0\nelse if @F@xi\n\u02c6zi 1\n// we do binary search.\nwhile Yi Xi >\u270f/n do\n\nelse\n\nLet \u02c6zi Xi+Yi\nif @F@xi\n\n2\n\n;\n\nelse\n\n(\u02c6zi, Xi) \u00b7 (1 \u02c6zi) + @F@xi\n// we need to increase wi.\nSet Xi \u02c6zi ;\n// we need to decrease wi.\nSet Yi \u02c6zi ;\nLet Xi \u02c6zi and Yi \u02c6zi ;\n\n(\u02c6zi, Yi) \u00b7 \u02c6zi > 0 then\n\n// after this, X and Y will agree at coordinate i\n\nassume F is Lipschitz continuous with some constant C > 0, so that we can relate the precision of\nour binary search with additive error. We arrive at the theorem, whose proof is in the supplement.\nTheorem 2. If F(.) is non-negative and DR-submodular (a.k.a Strong DR-SM) and is coordinate-\nwise Lipschitz continuous with constant C > 0, then Algorithm 2 runs in time On log n\n\u270f and is a\ndeterministic 1\n2-approximation algorithm up to O(\u270f) additive error, i.e. returns \u02c6z 2 [0, 1]n s.t.\nx2[0,1]n F(x) is the optimal solution.\n\n2F(\u02c6z) F (x\u21e4) 2C\u270f ,\n\nwhere x\u21e4 2 argmax\n\nRunning Time.\n(z, Yi)z is monotone non-\nincreasing in z, then clearly the binary search terminates in O (log(n/\u270f)) steps (note that the algorithm\nonly does binary search in the case when f (0) > 0 and f (1) < 0). To see the monotonicity,\n\nIf we show that f (z) , @F@xi\n\n(z, Xi)(1 z) + @F@xi\n2 (z, Yi) +\u2713 @F\n\n@xi\n\n(z, Yi) \n\n8\n\n@F\n@xi\n\n(z, Xi)\u25c6 \uf8ff 0\n\nf0(z) = (1 z)\n\n@2F\n@xi\n\n2 (z, Xi) + z\n\n@2F\n@xi\n\n\f(a) Strong DR-SM NQP\n\n(b) Weak DR-SM NQP\n\n(c) Strong DR-SM Softmax\n\nFigure 3: Box and whisker plots of our experimental results.\n\nwhere the inequality holds due to strong DR-SM and the fact that all of the Hessian entries (including\ndiagonal) are non-positive. Hence the total running time is O (n log(n/\u270f)).\n\n4 Experimental Results\n\nWe empirically measure the solution quality of three algorithms: Algorithm 1 (GAME), Algorithm 2\n(BINARY) and the Bi-Greedy algorithm of Bian et al. [2017b] (BMBK). These are all based on a double-\ngreedy framework, which we implemented to iterate over coordinates in a random order. These\nalgorithms also do not solely rely on oracle access to the function; they invoke one-dimensional\noptimizers, concave envelopes, and derivatives. We implement the \ufb01rst and the second (Algorithm 3\nand Algorithm 4 in the supplement), and numerically compute derivatives by discretization. We\nconsider two application domains, namely Non-concave Quadratic Programming (NQP) [Bian\net al., 2017b, Kim and Kojima, 2003, Luo et al., 2010], under both strong-DR and weak-DR, and\nmaximization of softmax extension for MAP inference of determinantal point process[Kulesza\net al., 2012, Gillenwater et al., 2012]. Each experiment consists of twenty repeated trials. For each\nexperiment, we use n = 100 dimensional functions. Our experiments were implemented in python.\nSee the supplementary materials for the detailed speci\ufb01cs of each experiment. The results of our\nexperiments are in Table 1, and the corresponding box and whisker plots are in Figure 3. The data\nsuggest that for all three experiments the three algorithms obtain very similar objective values. For\nexample, in the weak-DR NQP experiment, all three algorithms have standard deviations around 6\nwhile their means differ by less than 1.\n\nGAME\nBINARY\nBMBK\n\n1225.416454 \u00b1 8.201871\n1225.392136 \u00b1 8.203827\n1225.339063 \u00b1 8.141104\n\nNQP, 8i, j : Hi,j \uf8ff 0, (strong-DR) NQP, 8i 6= j : Hi,j \uf8ff 0, (weak-DR)\n\nSoftmax Ext. (strong-DR)\n24.056934 \u00b1 3.794209\n23.945428 \u00b1 3.770932\n24.055435 \u00b1 3.796350\nTable 1: Experimental results listing mean and standard deviation over T = 20 repeated trials with\ndimension n = 100.\n\n1200.860403 \u00b1 6.009484\n1200.248876 \u00b1 6.088293\n1200.798114 \u00b1 5.975035\n\n5 Conclusion\n\nWe proposed a tight approximation algorithm for continuous submodular maximization, and a qausi-\nlinear time tight approximation algorithm for the special case of DR-submodular maxmization. Our\nexperiments also verify the applicability of these algorithms in practical domains in machine learning.\nOne interesting avenue for future research is to generalize our techniques to the maximization over\nany arbitrary separable convex set, which will result in a broader application domain.\n\n9\n\n\fReferences\nAnestis Antoniadis, Ir\u00e8ne Gijbels, and Mila Nikolova. Penalized likelihood regression for generalized\nlinear models with non-quadratic penalties. Annals of the Institute of Statistical Mathematics, 63\n(3):585\u2013615, 2011.\n\nFrancis Bach et al. Learning with submodular functions: A convex optimization perspective. Founda-\n\ntions and Trends R in Machine Learning, 6(2-3):145\u2013373, 2013.\n\nAn Bian, K\ufb01r Levy, Andreas Krause, and Joachim M Buhmann. Continuous DR-submodular\nmaximization: Structure and algorithms. In Advances in Neural Information Processing Systems,\npages 486\u2013496, 2017a.\n\nAndrew An Bian, Baharan Mirzasoleiman, Joachim Buhmann, and Andreas Krause. Guaranteed non-\nconvex optimization: Submodular maximization over continuous domains. In Arti\ufb01cial Intelligence\nand Statistics, pages 111\u2013120, 2017b.\n\nNiv Buchbinder and Moran Feldman. Deterministic algorithms for submodular maximization\nIn Proceedings of the twenty-seventh annual ACM-SIAM symposium on Discrete\n\nproblems.\nalgorithms, pages 392\u2013403. SIAM, 2016.\n\nNiv Buchbinder, Moran Feldman, Joseph Sef\ufb01 Naor, and Roy Schwartz. A tight linear time (1/2)-\napproximation for unconstrained submodular maximization. In Proceedings of the 2012 IEEE\n53rd Annual Symposium on Foundations of Computer Science, pages 649\u2013658. IEEE Computer\nSociety, 2012.\n\nNiv Buchbinder, Moran Feldman, Joseph Sef\ufb01, and Roy Schwartz. A tight linear time (1/2)-\napproximation for unconstrained submodular maximization. SIAM Journal on Computing, 44(5):\n1384\u20131402, 2015.\n\nLin Chen, Hamed Hassani, and Amin Karbasi. Online continuous submodular maximization. In\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 1896\u20131905, 2018.\n\nJosip Djolonga and Andreas Krause. From map to marginals: Variational inference in bayesian\nsubmodular models. In Advances in Neural Information Processing Systems, pages 244\u2013252, 2014.\n\nUriel Feige, Vahab S Mirrokni, and Jan Vondrak. Maximizing non-monotone submodular functions.\nIn Foundations of Computer Science, 2007. FOCS\u201907. 48th Annual IEEE Symposium on, pages\n461\u2013471. IEEE, 2007.\n\nUriel Feige, Vahab S Mirrokni, and Jan Vondrak. Maximizing non-monotone submodular functions.\n\nSIAM Journal on Computing, 40(4):1133\u20131153, 2011.\n\nJennifer Gillenwater, Alex Kulesza, and Ben Taskar. Near-optimal map inference for determinantal\npoint processes. In Advances in Neural Information Processing Systems, pages 2735\u20132743, 2012.\n\nAlkis Gotovos, Amin Karbasi, and Andreas Krause. Non-monotone adaptive submodular maximiza-\n\ntion. In Twenty-Fourth International Joint Conference on Arti\ufb01cial Intelligence, 2015.\n\nRonald L Graham. An ef\ufb01cient algorith for determining the convex hull of a \ufb01nite planar set.\n\nInformation processing letters, 1(4):132\u2013133, 1972.\n\nJason Hartline, Vahab Mirrokni, and Mukund Sundararajan. Optimal marketing strategies over social\nnetworks. In Proceedings of the 17th international conference on World Wide Web, pages 189\u2013198.\nACM, 2008.\n\nHamed Hassani, Mahdi Soltanolkotabi, and Amin Karbasi. Gradient methods for submodular\nmaximization. In Advances in Neural Information Processing Systems, pages 5843\u20135853, 2017.\n\nShinji Ito and Ryohei Fujimaki. Large-scale price optimization via network \ufb02ow. In Advances in\n\nNeural Information Processing Systems, pages 3855\u20133863, 2016.\n\nSatoru Iwata, Lisa Fleischer, and Satoru Fujishige. A combinatorial strongly polynomial algorithm\n\nfor minimizing submodular functions. Journal of the ACM (JACM), 48(4):761\u2013777, 2001.\n\n10\n\n\fMichael Kapralov, Ian Post, and Jan Vondr\u00e1k. Online submodular welfare maximization: Greedy is\noptimal. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms,\npages 1216\u20131225. SIAM, 2013.\n\nSunyoung Kim and Masakazu Kojima. Exact solutions of some nonconvex quadratic optimization\nproblems via sdp and socp relaxations. Computational Optimization and Applications, 26(2):\n143\u2013154, 2003.\n\nAndreas Krause and Daniel Golovin. Submodular function maximization. In Tractability: Practical\n\nApproaches to Hard Problems, pages 71\u2013104. Cambridge University Press, 2014.\n\nAlex Kulesza, Ben Taskar, et al. Determinantal point processes for machine learning. Foundations\n\nand Trends R in Machine Learning, 5(2\u20133):123\u2013286, 2012.\n\nChengtao Li, Suvrit Sra, and Stefanie Jegelka. Fast mixing markov chains for strongly rayleigh\nmeasures, dpps, and constrained sampling. In Advances in Neural Information Processing Systems,\npages 4188\u20134196, 2016.\n\nZhi-Quan Luo, Wing-Kin Ma, Anthony Man-Cho So, Yinyu Ye, and Shuzhong Zhang. Semide\ufb01nite\nrelaxation of quadratic optimization problems. IEEE Signal Processing Magazine, 27(3):20\u201334,\n2010.\n\nBaharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Distributed submodular\nIn Advances in Neural\n\nmaximization: Identifying representative elements in massive data.\nInformation Processing Systems, pages 2049\u20132057, 2013.\n\nAryan Mokhtari, Hamed Hassani, and Amin Karbasi. Conditional gradient method for stochastic\nsubmodular maximization: Closing the gap. In International Conference on Arti\ufb01cial Intelligence\nand Statistics, pages 1886\u20131895, 2018.\n\nTim Roughgarden and Joshua R Wang. An optimal learning algorithm for online unconstrained\nsubmodular maximization. In To Appear in Proceedings of the 31st Conference on Learning\nTheory (COLT), 2018.\n\nAlexander Schrijver. A combinatorial algorithm minimizing submodular functions in strongly\n\npolynomial time. Journal of Combinatorial Theory, Series B, 80(2):346\u2013355, 2000.\n\nTasuku Soma and Yuichi Yoshida. A generalization of submodular cover via the diminishing return\nproperty on the integer lattice. In Advances in Neural Information Processing Systems, pages\n847\u2013855, 2015.\n\nTasuku Soma and Yuichi Yoshida. Non-monotone dr-submodular function maximization. In AAAI,\n\nvolume 17, pages 898\u2013904, 2017.\n\nMatthew Staib and Stefanie Jegelka. Robust budget allocation via continuous submodular functions.\n\narXiv preprint arXiv:1702.08791, 2017.\n\nJian Zhang, Josip Djolonga, and Andreas Krause. Higher-order inference for multi-class log-\nsupermodular models. In Proceedings of the IEEE International Conference on Computer Vision,\npages 1859\u20131867, 2015.\n\n11\n\n\f", "award": [], "sourceid": 5855, "authors": [{"given_name": "Rad", "family_name": "Niazadeh", "institution": "Stanford University"}, {"given_name": "Tim", "family_name": "Roughgarden", "institution": "Stanford University"}, {"given_name": "Joshua", "family_name": "Wang", "institution": "Google"}]}