{"title": "Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 2436, "page_last": 2444, "abstract": "We investigate two new optimization problems \u2014 minimizing a submodular function subject to a submodular lower bound constraint (submodular cover) and maximizing a submodular function subject to a submodular upper bound constraint (submodular knapsack). We are motivated by a number of real-world applications in machine learning including sensor placement and data subset selection, which require maximizing a certain submodular function (like coverage or diversity) while simultaneously minimizing another (like cooperative cost). These problems are often posed as minimizing the difference between submodular functions [9, 23] which is in the worst case inapproximable. We show, however, that by phrasing these problems as constrained optimization, which is more natural for many applications, we achieve a number of bounded approximation guarantees. We also show that both these problems are closely related and, an approximation algorithm solving one can be used to obtain an approximation guarantee for the other. We provide hardness results for both problems thus showing that our approximation factors are tight up to log-factors. Finally, we empirically demonstrate the performance and good scalability properties of our algorithms.", "full_text": "Submodular Optimization with Submodular Cover\n\nand Submodular Knapsack Constraints\n\nRishabh Iyer\n\nDepartment of Electrical Engineering\n\nUniversity of Washington\n\nrkiyer@u.washington.edu\n\nJeff Bilmes\n\nDepartment of Electrical Engineering\n\nUniversity of Washington\n\nbilmes@u.washington.edu\n\nAbstract\n\nWe investigate two new optimization problems \u2014 minimizing a submodular\nfunction subject to a submodular lower bound constraint (submodular cover)\nand maximizing a submodular function subject to a submodular upper bound\nconstraint (submodular knapsack). We are motivated by a number of real-world\napplications in machine learning including sensor placement and data subset\nselection, which require maximizing a certain submodular function (like coverage\nor diversity) while simultaneously minimizing another (like cooperative cost).\nThese problems are often posed as minimizing the difference between submodular\nfunctions [9, 25] which is in the worst case inapproximable. We show, however,\nthat by phrasing these problems as constrained optimization, which is more natural\nfor many applications, we achieve a number of bounded approximation guarantees.\nWe also show that both these problems are closely related and an approximation\nalgorithm solving one can be used to obtain an approximation guarantee for\nthe other. We provide hardness results for both problems thus showing that\nour approximation factors are tight up to log-factors. Finally, we empirically\ndemonstrate the performance and good scalability properties of our algorithms.\n\nIntroduction\n\n1\nA set function f : 2V \u2192 R is said to be submodular [4] if for all subsets S, T \u2286 V , it holds that\nf (S) + f (T ) \u2265 f (S \u222a T ) + f (S \u2229 T ). De\ufb01ning f (j|S) (cid:44) f (S \u222a j) \u2212 f (S) as the gain of j \u2208 V\nin the context of S \u2286 V , then f is submodular if and only if f (j|S) \u2265 f (j|T ) for all S \u2286 T and\nj /\u2208 T . The function f is monotone iff f (j|S) \u2265 0,\u2200j /\u2208 S, S \u2286 V . For convenience, we assume\nthe ground set is V = {1, 2,\u00b7\u00b7\u00b7 , n}. While general set function optimization is often intractable,\nmany forms of submodular function optimization can be solved near optimally or even optimally\nin certain cases. Submodularity, moreover, is inherent in a large class of real-world applications,\nparticularly in machine learning, therefore making them extremely useful in practice.\nIn this paper, we study a new class of discrete optimization problems that have the following form:\nProblem 1 (SCSC): min{f (X)| g(X) \u2265 c},\nand Problem 2 (SCSK): max{g(X)| f (X) \u2264 b},\nwhere f and g are monotone non-decreasing submodular functions that also, w.l.o.g., are normalized\n(f (\u2205) = g(\u2205) = 0)1, and where b and c refer to budget and cover parameters respectively. The\ncorresponding constraints are called the submodular cover [29] and submodular knapsack [1]\nrespectively and hence we refer to Problem 1 as Submodular Cost Submodular Cover (henceforth\nSCSC) and Problem 2 as Submodular Cost Submodular Knapsack (henceforth SCSK). Our motivation\nstems from an interesting class of problems that require minimizing a certain submodular function\nf while simultaneously maximizing another submodular function g. We shall see that these naturally\n1A monotone non-decreasing normalized (f (\u2205) = 0) submodular function is called a polymatroid function.\n\n1\n\n\foccur in applications like sensor placement, data subset selection, and many other machine learning\napplications. A standard approach used in literature [9, 25, 15] has been to transform these problems\ninto minimizing the difference between submodular functions (also called DS optimization):\n\n(1)\n\n(cid:0)f (X) \u2212 g(X)(cid:1).\n\nProblem 0: min\nX\u2286V\n\nWhile a number of heuristics are available for solving Problem 0, in the worst-case it is NP-hard\nand inapproximable [9], even when f and g are monotone. Although an exact branch and bound\nalgorithm has been provided for this problem [15], its complexity can be exponential in the worst case.\nOn the other hand, in many applications, one of the submodular functions naturally serves as part of a\nconstraint. For example, we might have a budget on a cooperative cost, in which case Problems 1 and\n2 become applicable. The utility of Problems 1 and 2 become apparent when we consider how they\noccur in real-world applications and how they subsume a number of important optimization problems.\nSensor Placement and Feature Selection: Often, the problem of choosing sensor locations can\nbe modeled [19, 9] by maximizing the mutual information between the chosen variables A and the\nunchosen set V \\A (i.e.,f (A) = I(XA; XV \\A)). Alternatively, we may wish to maximize the mutual\ninformation between a set of chosen sensors XA and a quantity of interest C (i.e., f (A) = I(XA; C))\nassuming that the set of features XA are conditionally independent given C [19, 9]. Both these\nfunctions are submodular. Since there are costs involved, we want to simultaneously minimize the\ncost g(A). Often this cost is submodular [19, 9]. For example, there is typically a discount when\npurchasing sensors in bulk (economies of scale). This then becomes a form of either Problem 1 or 2.\nData subset selection: A data subset selection problem in speech and NLP involves \ufb01nding a limited\nvocabulary which simultaneously has a large coverage. This is particularly useful, for example in\nspeech recognition and machine translation, where the complexity of the algorithm is determined\nby the vocabulary size. The motivation for this problem is to \ufb01nd the subset of training examples\nwhich will facilitate evaluation of prototype systems [23]. Often the objective functions encouraging\nsmall vocabulary subsets and large acoustic spans are submodular [23, 20] and hence this problem\ncan naturally be cast as an instance of Problems 1 and 2.\nPrivacy Preserving Communication: Given a set of random variables X1,\u00b7\u00b7\u00b7 , Xn, denote I as\nan information source, and P as private information that should be \ufb01ltered out. Then one way\nof formulating the problem of choosing a information containing but privacy preserving set of\nrandom variables can be posed as instances of Problems 1 and 2, with f (A) = H(XA|I) and\ng(A) = H(XA|P), where H(\u00b7|\u00b7) is the conditional entropy.\nMachine Translation: Another application in machine translation is to choose a subset of training\ndata that is optimized for given test data set, a problem previously addressed with modular functions\n[24]. De\ufb01ning a submodular function with ground set over the union of training and test sample\ninputs V = Vtr \u222a Vte, we can set f : 2Vtr \u2192 R+ to f (X) = f (X|Vte), and take g(X) = |X|, and\nb \u2248 0 in Problem 2 to address this problem. We call this the Submodular Span problem.\nApart from the real-world applications above, both Problems 1 and 2 generalize a number of well-\nstudied discrete optimization problems. For example the Submodular Set Cover problem (henceforth\nSSC) [29] occurs as a special case of Problem 1, with f being modular and g is submodular. Similarly\nthe Submodular Cost Knapsack problem (henceforth SK) [28] is a special case of problem 2 again\nwhen f is modular and g submodular. Both these problems subsume the Set Cover and Max k-Cover\nproblems [3]. When both f and g are modular, Problems 1 and 2 are called knapsack problems [16].\nThe following are some of our contributions. We show that Problems 1 and 2 are intimately\nconnected, in that any approximation algorithm for either problem can be used to provide guarantees\nfor the other problem as well. We then provide a framework of combinatorial algorithms based\non optimizing, sometimes iteratively, subproblems that are easy to solve. These subproblems\nare obtained by computing either upper or lower bound approximations of the cost functions or\nconstraining functions. We also show that many combinatorial algorithms like the greedy algorithm\nfor SK [28] and SSC [29] also belong to this framework and provide the \ufb01rst constant-factor\nbi-criterion approximation algorithm for SSC [29] and hence the general set cover problem [3]. We\nthen show how with suitable choices of approximate functions, we can obtain a number of bounded\napproximation guarantees and show the hardness for Problems 1 and 2, which in fact match some\nof our approximation guarantees. Our guarantees and hardness results depend on the curvature of\nthe submodular functions [2]. We observe a strong asymmetry in the results that the factors change\n\n2\n\n\fpolynomially based on the curvature of f but only by a constant-factor with the curvature of g, hence\nmaking the SK and SSC much easier compared to SCSK and SCSC.\n\n2 Background and Main Ideas\n\nf (j|V \\j)\n\nf (j)\n\nif and only if f is modular (or additive, i.e., f (X) = (cid:80)\n\nWe \ufb01rst introduce several key concepts used throughout the paper. This paper includes only the\nmain results and we defer all the proofs and additional discussions to the extended version [11].\nGiven a submodular function f, we de\ufb01ne the total curvature, \u03baf as2: \u03baf = 1 \u2212 minj\u2208V\n[2].\nIntuitively, the curvature 0 \u2264 \u03baf \u2264 1 measures the distance of f from modularity and \u03baf = 0\nj\u2208X f (j)). A number of approx-\nimation guarantees in the context of submodular optimization have been re\ufb01ned via the cur-\nvature of the submodular function [2, 13, 12].\nIn this paper, we shall witness the role\nof curvature also in determining the approximations and the hardness of problems 1 and 2.\nAlgorithm 1: General algorithmic framework to\nThe main idea of this paper is a framework of\naddress both Problems 1 and 2\nalgorithms based on choosing appropriate sur-\n1: for t = 1, 2,\u00b7\u00b7\u00b7 , T do\nrogate functions for f and g to optimize over.\nThis framework is represented in Algorithm 1.\n2:\nWe would like to choose surrogate functions \u02c6ft\nand \u02c6gt such that using them, Problems 1 and 2\nbecome easier. If the algorithm is just single\nstage (not iterative), we represent the surrogates\nas \u02c6f and \u02c6g. The surrogate functions we consider\nin this paper are in the forms of bounds (upper or lower) and approximations.\nModular lower bounds: Akin to convex functions, submodular functions have tight modular lower\nbounds. These bounds are related to the subdifferential \u2202f (Y ) of the submodular set function f at\na set Y \u2286 V [4]. Denote a subgradient at Y by hY \u2208 \u2202f (Y ). The extreme points of \u2202f (Y ) may\nbe computed via a greedy algorithm: Let \u03c0 be a permutation of V that assigns the elements in Y\nto the \ufb01rst |Y | positions (\u03c0(i) \u2208 Y if and only if i \u2264 |Y |). Each such permutation de\ufb01nes a chain\ni = {\u03c0(1), \u03c0(2), . . . , \u03c0(i)} and S\u03c0|Y | = Y . This chain de\ufb01nes an extreme\nwith elements S\u03c0\npoint h\u03c0\nY forms a lower\nY of \u2202f (Y ) with entries h\u03c0\nbound of f, tight at Y \u2014 i.e., h\u03c0\n\nChoose surrogate functions \u02c6ft and \u02c6gt for f\nand g respectively, tight at X t\u22121.\nObtain X t as the optimizer of Problem 1 or\n2 with \u02c6ft and \u02c6gt instead of f and g.\n\ni ) \u2212 f (S\u03c0\nY (j) \u2264 f (X),\u2200X \u2286 V and h\u03c0\n\ni\u22121). De\ufb01ned as above, h\u03c0\n\nY (\u03c0(i)) = f (S\u03c0\nj\u2208X h\u03c0\n\nY (Y ) = f (Y ).\n\n0 = \u2205, S\u03c0\n\n4: end for\n\n3:\n\nModular upper bounds: We can also de\ufb01ne superdifferentials \u2202f (Y ) of a submodular function\n[14, 10] at Y . It is possible, moreover, to provide speci\ufb01c supergradients [10, 13] that de\ufb01ne the\nfollowing two modular upper bounds (when referring either one, we use mf\n\nY (X) =(cid:80)\n(cid:88)\n\nX,1(Y ) (cid:44) f (X) \u2212(cid:88)\n\nmf\n\nf (j|X\\j) +\n\nf (j|\u2205), mf\n\nj\u2208X\\Y\n\nj\u2208Y \\X\n\nX,2(Y ) (cid:44) f (X) \u2212(cid:88)\n\nX):\nf (j|V \\j) +\n\n(cid:88)\n\nj\u2208X\\Y\n\nj\u2208Y \\X\n\nf (j|X).\n\nX,1(Y ) \u2265 f (Y ) and mf\n\nX,2(Y ) \u2265 f (Y ),\u2200Y \u2286 V and mf\n\nX,1(X) = mf\n\nX,2(X) = f (X).\n\nThen mf\nMM algorithms using upper/lower bounds: Using the modular upper and lower bounds above in\nAlgorithm 1, provide a class of Majorization-Minimization (MM) algorithms, akin to the algorithms\nproposed in [13] for submodular optimization and in [25, 9] for DS optimization (Problem 0 above).\nAn appropriate choice of the bounds ensures that the algorithm always improves the objective values\nfor Problems 1 and 2. In particular, choosing \u02c6ft as a modular upper bound of f tight at X t, or \u02c6gt as a\nmodular lower bound of g tight at X t, or both, ensures that the objective value of Problems 1 and\n2 always improves at every iteration as long as the corresponding surrogate problem can be solved\nexactly. Unfortunately, Problems 1 and 2 are NP-hard even if f or g (or both) are modular [3], and\ntherefore the surrogate problems themselves cannot be solved exactly. Fortunately, the surrogate\nproblems are often much easier than the original ones and can admit log or constant-factor guarantees.\nIn practice, moreover, these factors are almost 1. Furthermore, with a simple modi\ufb01cation of the\niterative procedure of Algorithm 1, we can guarantee improvement at every iteration [11]. What\nis also fortunate and perhaps surprising, as we show in this paper below, is that unlike the case of\nDS optimization (where the problem is inapproximable in general [9]), the constrained forms of\noptimization (Problems 1 and 2) do have approximation guarantees.\n\n2We can assume, w.l.o.g that f (j) > 0, g(j) > 0,\u2200j \u2208 V\n\n3\n\n\fEllipsoidal Approximation: We also consider ellipsoidal approximations (EA) of f. The main\nresult of Goemans et. al [6] is to provide an algorithm based on approximating the submodular\npolyhedron by an ellipsoid. They show that for any polymatroid function f, one can compute\n\nan approximation of the form (cid:112)wf (X) for a certain modular weight vector wf \u2208 RV , such\nn log n)(cid:112)wf (X),\u2200X \u2286 V . A simple trick then provides a\nthat (cid:112)wf (X) \u2264 f (X) \u2264 O(\nfollows: f \u03ba(X) (cid:44)(cid:2)f (X) \u2212 (1 \u2212 \u03baf )(cid:80)\nj\u2208X f (j)(cid:3)/\u03baf . Then, the submodular function f ea(X) =\n(cid:112)wf \u03ba (X) + (1 \u2212 \u03baf )(cid:80)\n(cid:18)\n\ncurvature-dependent approximation [12] \u2014 we de\ufb01ne the \u03baf -curve-normalized version of f as\n\nj\u2208X f (j) satis\ufb01es [12]:\n\n(cid:19)\n\n\u221a\n\n\u221a\n\n\u03baf\n\nf ea(X) \u2264 f (X) \u2264 O\n\nf ea(X),\u2200X \u2286 V\n\n(2)\n\n1 + (\n\n\u221a\n\nn log n\n\nn log n \u2212 1)(1 \u2212 \u03baf )\n\u221a\n\nf ea is multiplicatively bounded by f by a factor depending on\nn and the curvature. We shall use\nthe result above in providing approximation bounds for Problems 1 and 2. In particular, the surrogate\nfunctions \u02c6f or \u02c6g in Algorithm 1 can be the ellipsoidal approximations above, and the multiplicative\nbounds transform into approximation guarantees for these problems.\n\n3 Relation between SCSC and SCSK\n\nIn this section, we show a precise relationship between Problems 1 and 2. From the formu-\nlation of Problems 1 and 2, it is clear that these problems are duals of each other.\nIndeed,\nin this section we show that the problems are polynomially transformable into each other.\nAlgorithm 3: Approx. algorithm for SCSC us-\nAlgorithm 2: Approx. algorithm for SCSK us-\ning an approximation algorithm for SCSK.\ning an approximation algorithm for SCSC.\n1: Input: An SCSC instance with cover c, an\n1: Input: An SCSK instance with budget b, an\n[\u03c3, \u03c1] approx. algo. for SCSC, & \u0001 \u2208 [0, 1).\n2: Output: [(1 \u2212 \u0001)\u03c1, \u03c3] approx. for SCSK.\n3: c \u2190 g(V ), \u02c6Xc \u2190 V .\n4: while f ( \u02c6Xc) > \u03c3b do\n5:\n6:\n7: end while\n\n2: Output: [(1 + \u0001)\u03c3, \u03c1] approx. for SCSC.\n3: b \u2190 argminj f (j), \u02c6Xb \u2190 \u2205.\n4: while g( \u02c6Xb) < \u03c1c do\n5:\n6:\n7: end while\n\nb \u2190 (1 + \u0001)b\n\u02c6Xb \u2190 [\u03c1, \u03c3] approx. for SCSK using b.\n\nc \u2190 (1 \u2212 \u0001)c\n\u02c6Xc \u2190 [\u03c3, \u03c1] approx. for SCSC using c.\n\n[\u03c1, \u03c3] approx. algo. for SCSK, & \u0001 > 0.\n\nWe \ufb01rst introduce the notion of bicriteria algorithms. An algorithm is a [\u03c3, \u03c1] bi-criterion algorithm for\nProblem 1 if it is guaranteed to obtain a set X such that f (X) \u2264 \u03c3f (X\u2217) (approximate optimality)\nand g(X) \u2265 c(cid:48) = \u03c1c (approximate feasibility), where X\u2217 is an optimizer of Problem 1. Similarly, an\nalgorithm is a [\u03c1, \u03c3] bi-criterion algorithm for Problem 2 if it is guaranteed to obtain a set X such that\ng(X) \u2265 \u03c1g(X\u2217) and f (X) \u2264 b(cid:48) = \u03c3b, where X\u2217 is the optimizer of Problem 2. In a bi-criterion algo-\nrithm for Problems 1 and 2, typically \u03c3 \u2265 1 and \u03c1 \u2264 1. A non-bicriterion algorithm for Problem 1 is\nwhen \u03c1 = 1 and a non-bicriterion algorithm for Problem 2 is when \u03c3 = 1. Algorithms 2 and 3 provide\nthe schematics for using an approximation algorithm for one of the problems for solving the other.\nTheorem 3.1. Algorithm 2 is guaranteed to \ufb01nd a set \u02c6Xc which is a [(1 \u2212 \u0001)\u03c1, \u03c3] approximation\nof SCSK in at most log1/(1\u2212\u0001)[g(V )/ minj g(j)] calls to the [\u03c3, \u03c1] approximate algorithm for SCSC.\nSimilarly, Algorithm 3 is guaranteed to \ufb01nd a set \u02c6Xb which is a [(1 + \u0001)\u03c3, \u03c1] approximation of SCSC\nin log1+\u0001[f (V )/ minj f (j)] calls to a [\u03c1, \u03c3] approximate algorithm for SCSK.\n\nTheorem 3.1 implies that the complexity of Problems 1 and 2 are identical, and a solution to one of\nthem provides a solution to the other. Furthermore, as expected, the hardness of Problems 1 and 2 are\nalso almost identical. When f and g are polymatroid functions, moreover, we can provide bounded ap-\nproximation guarantees for both problems, as shown in the next section. Alternatively we can also do a\nbinary search instead of a linear search to transform Problems 1 and 2. This essentially turns the factor\nof O(1/\u0001) into O(log 1/\u0001). Due to lack of space, we defer this discussion to the extended version [11].\n\n4\n\n\f4 Approximation Algorithms\n\nWe consider several algorithms for Problems 1 and 2, which can all be characterized by the framework\nof Algorithm 1, using the surrogate functions of the form of upper/lower bounds or approximations.\n\n4.1 Approximation Algorithms for SCSC\nWe \ufb01rst describe our approximation algorithms designed speci\ufb01cally for SCSC, leaving to \u00a74.2 the\npresentation of our algorithms slated for SCSK. We \ufb01rst investigate a special case, the submodular\nset cover (SSC), and then provide two algorithms, one of them (ISSC) is very practical with a weaker\ntheoretical guarantee, and another one (EASSC) which is slow but has the tightest guarantee.\nSubmodular Set Cover (SSC): We start by considering a classical special case of SCSC (Problem\n1) where f is already a modular function and g is a submodular function. This problem occurs\nnaturally in a number of problems related to active/online learning [7] and summarization [21, 22].\nThis problem was \ufb01rst investigated by Wolsey [29], wherein he showed that a simple greedy algorithm\nachieves bounded (in fact, log-factor) approximation guarantees. We show that this greedy algorithm\ncan naturally be viewed in the framework of our Algorithm 1 by choosing appropriate surrogate\nfunctions \u02c6ft and \u02c6gt. The idea is to use the modular function f as its own surrogate \u02c6ft and choose the\nfunction \u02c6gt as a modular lower bound of g. Akin to the framework of algorithms in [13], the crucial\nfactor is the choice of the lower bound (or subgradient). De\ufb01ne the greedy subgradient as:\n\n\u03c0(i) \u2208 argmin\n\ni\u22121, g(S\u03c0\n\n(3)\ng(j|S\u03c0\ni\u22121)\ni\u22121 \u222a j) < c can no longer be satis\ufb01ed by any j /\u2208 S\u03c0\nOnce we reach an i where the constraint g(S\u03c0\ni\u22121,\nwe choose the remaining elements for \u03c0 arbitrarily. Let the corresponding subgradient be referred\nto as h\u03c0. Then we have the following lemma, which is an extension of [29], and which is a simpler\ndescription of the result stated formally in [11].\nLemma 4.1. The greedy algorithm for SSC [29] can be seen as an instance of Algorithm 1 by\nchoosing the surrogate function \u02c6f as f and \u02c6g as h\u03c0 (with \u03c0 de\ufb01ned in Eqn. (3)).\nWhen g is integral, the guarantee of the greedy algorithm is Hg (cid:44) H(maxj g(j)), where\ni [29] (henceforth we will use Hg for this quantity). This factor is tight up to lower-\norder terms [3]. Furthermore, since this algorithm directly solves SSC, we call it the primal greedy.\nWe could also solve SSC by looking at its dual, which is SK [28]. Although SSC does not admit any\nconstant-factor approximation algorithms [3], we can obtain a constant-factor bi-criterion guarantee:\nLemma 4.2. Using the greedy algorithm for SK [28] as the approximation oracle in Algorithm 3\nprovides a [1 + \u0001, 1 \u2212 e\u22121] bi-criterion approximation algorithm for SSC, for any \u0001 > 0.\n\nH(d) =(cid:80)d\n\ni=1\n\n1\n\n(cid:26) f (j)\n\n(cid:12)(cid:12)(cid:12)(cid:12) j /\u2208 S\u03c0\n\n(cid:27)\ni\u22121 \u222a j) < c\n\n.\n\nWe call this the dual greedy. This result follows immediately from the guarantee of the submodular\ncost knapsack problem [28] and Theorem 3.1. We remark that we can also use a simpler version\nof the greedy iteration at every iteration [21, 17] and we obtain a guarantee of (1 + \u0001, 1/2(1 \u2212 e\u22121)).\nIn practice, however, both these factors are almost 1 and hence the simple variant of the greedy\nalgorithm suf\ufb01ces.\nIterated Submodular Set Cover (ISSC): We next investigate an algorithm for the general SCSC\nproblem when both f and g are submodular. The idea here is to iteratively solve the submodular\nset cover problem which can be done by replacing f by a modular upper bound at every iteration.\nIn particular, this can be seen as a variant of Algorithm 1, where we start with X 0 = \u2205 and\nchoose \u02c6ft(X) = mf\nX t(X) at every iteration. The surrogate problem at each iteration becomes\nmin{mf\nX t(X)|g(X) \u2265 c}. Hence, each iteration is an instance of SSC and can be solved nearly\noptimally using the greedy algorithm. We can continue this algorithm for T iterations or until\nconvergence. An analysis very similar to the ones in [9, 13] will reveal polynomial time convergence.\nSince each iteration is only the greedy algorithm, this approach is also highly practical and scalable.\nTheorem 4.3. ISSC obtains an approximation factor of\n1+(n\u22121)(1\u2212\u03baf ) Hg where\nKg = 1 + max{|X| : g(X) < c} and Hg is the approximation factor of the submodular set cover\nusing g.\n\n1+(Kg\u22121)(1\u2212\u03baf ) \u2264\n\nKgHg\n\nn\n\n5\n\n\fFrom the above, it is clear that Kg \u2264 n. Notice also that Hg is essentially a log-factor. We also\nsee an interesting effect of the curvature \u03baf of f. When f is modular (\u03baf = 0), we recover the\napproximation guarantee of the submodular set cover problem. Similarly, when f has restricted\ncurvature, the guarantees can be much better. Moreover, the approximation guarantee already holds\nafter the \ufb01rst iteration, so additional iterations can only further improve the objective.\nEllipsoidal Approximation based Submodular Set Cover (EASSC): In this setting, we use the\nellipsoidal approximation discussed in \u00a72. We can compute the \u03baf -curve-normalized version of f\n(f \u03ba, see \u00a72), and then compute its ellipsoidal approximation\nwf \u03ba. We then de\ufb01ne the function\n\u02c6f (X) = f ea(X) = \u03baf\nj\u2208X f (j) and use this as the surrogate function \u02c6f\nfor f. We choose \u02c6g as g itself. The surrogate problem becomes:\n\n\u221a\n\n(cid:112)wf \u03ba(X) + (1 \u2212 \u03baf )(cid:80)\n\uf8f1\uf8f2\uf8f3\u03baf\n\n(cid:113)\n\nwf \u03ba(X) + (1 \u2212 \u03baf )\n\nmin\n\n(cid:88)\n\nj\u2208X\n\n(cid:12)(cid:12)(cid:12)(cid:12) g(X) \u2265 c\n\n\uf8fc\uf8fd\uf8fe .\n\nf (j)\n\n(4)\n\nWhile function \u02c6f (X) = f ea(X) is not modular, it is a weighted sum of a concave over modular\nfunction and a modular function. Fortunately, we can use the result from [26], where they show\n\nthat any function of the form of(cid:112)w1(X) + w2(X) can be optimized over any polytope P with an\n\napproximation factor of \u03b2(1 + \u0001) for any \u0001 > 0, where \u03b2 is the approximation factor of optimizing\na modular function over P. The complexity of this algorithm is polynomial in n and 1\n\u0001 . We use\ntheir algorithm to minimize f ea(X) over the submodular set cover constraint and hence we call this\nalgorithm EASSC.\nTheorem 4.4. EASSC obtains a guarantee of O(\ntion guarantee of the set cover problem.\n\n\u221a\nn log n\u22121)(1\u2212\u03baf ) ), where Hg is the approxima-\n\nn log nHg\n\n1+(\n\n\u221a\n\nIf the function f has \u03baf = 1, we can use a much simpler algorithm. In particular, we can minimize\n(f ea(X))2 = wf (X) at every iteration, giving a surrogate problem of the form min{wf (X)|g(X) \u2265\nc}. This is directly an instance of SSC, and in contrast to EASSC, we just need to solve SSC once.\nWe call this algorithm EASSCc.\n\u221a\nCorollary 4.5. EASSCc obtains an approximation guarantee of O(\n\nn log n(cid:112)Hg).\n\n4.2 Approximation Algorithms for SCSK\n\nIn this section, we describe our approximation algorithms for SCSK. We note the dual nature of\nthe algorithms in this current section to those given in \u00a74.1. We \ufb01rst investigate a special case, the\nsubmodular knapsack (SK), and then provide three algorithms, two of them (Gr and ISK) being\npractical with slightly weaker theoretical guarantee, and another one (EASK) which is not scalable\nbut has the tightest guarantee.\nSubmodular Cost Knapsack (SK): We start with a special case of SCSK (Problem 2), where f is\na modular function and g is a submodular function. In this case, SCSK turns into the SK problem for\nwhich the greedy algorithm with partial enumeration provides a 1 \u2212 e\u22121 approximation [28]. The\ngreedy algorithm can be seen as an instance of Algorithm 1 with \u02c6g being the modular lower bound of\ng and \u02c6f being f, which is already modular. In particular, de\ufb01ne:\n\n\u03c0(i) \u2208 argmax\n\ni\u22121, f (S\u03c0\n\ni\u22121 \u222a {j}) \u2264 b\n\n,\n\n(5)\n\n(cid:26) g(j|S\u03c0\n\ni\u22121)\n\nf (j)\n\n(cid:12)(cid:12)(cid:12)(cid:12) j /\u2208 S\u03c0\n\n(cid:27)\n\nwhere the remaining elements are chosen arbitrarily. The following is an informal description of the\nresult described formally in [11].\nLemma 4.6. Choosing the surrogate function \u02c6f as f and \u02c6g as h\u03c0 (with \u03c0 de\ufb01ned in eqn (5)) in\nAlgorithm 1 with appropriate initialization obtains a guarantee of 1 \u2212 1/e for SK.\nGreedy (Gr): A similar greedy algorithm can provide approximation guarantees for the general\nSCSK problem, with submodular f and g. Unlike the knapsack case in (5), however, at iteration\ni\u22121 \u222a {j}) \u2264 b which maximizes g(j|Si\u22121). In terms of\ni we choose an element j /\u2208 Si\u22121 : f (S\u03c0\nAlgorithm 1, this is analogous to choosing a permutation, \u03c0 such that:\n\u03c0(i) \u2208 argmax{g(j|S\u03c0\ni\u22121)|j /\u2208 S\u03c0\n\ni\u22121 \u222a {j}) \u2264 b}.\n\ni\u22121, f (S\u03c0\n\n(6)\n\n6\n\n\f)kf ) \u2265\n, where Kf = max{|X| : f (X) \u2264 b} and kf = min{|X| : f (X) \u2264 b & \u2200j \u2208 X, f (X\u222aj) > b}.\n\nTheorem 4.7. The greedy algorithm for SCSK obtains an approx. factor of 1\n\u03bag\n1\nKf\n\n(1 \u2212 ( Kf\u2212\u03bag\n\nKf\n\nIn the worst case, kf = 1 and Kf = n, in which case the guarantee is 1/n. The bound above\nfollows from a simple observation that the constraint {f (X) \u2264 b} is down-monotone for a monotone\nfunction f. However, in this variant, we do not use any speci\ufb01c information about f. In particular\nit holds for maximizing a submodular function g over any down monotone constraint [2]. Hence\nit is conceivable that an algorithm that uses both f and g to choose the next element could provide\nbetter bounds. We do not, however, currently have the analysis for this.\nIterated Submodular Cost Knapsack (ISK): Here, we choose \u02c6ft(X) as a modular upper bound\nof f, tight at X t. Let \u02c6gt = g. Then at every iteration, we solve max{g(X)|mf\nX t(X) \u2264 b}, which is\na submodular maximization problem subject to a knapsack constraint (SK). As mentioned above,\nj\u2208X f (j) and then\niteratively continue this process until convergence (note that this is an ascent algorithm). We have the\nfollowing theoretical guarantee:\nTheorem 4.8. Algorithm ISK obtains a set X t such that g(X t) \u2265 (1\u2212e\u22121)g( \u02dcX), where \u02dcX is the opti-\nand where Kf = max{|X| : f (X) \u2264 b}.\nmal solution of max\n\ngreedy can solve this nearly optimally. We start with X 0 = \u2205, choose \u02c6f0(X) =(cid:80)\n\ng(X) | f (X) \u2264 b(1+(Kf\u22121)(1\u2212\u03baf )\n\n(cid:110)\n\n(cid:111)\n\nKf\n\nIt is worth pointing out that the above bound holds even after the \ufb01rst iteration of the algorithm. It is\ninteresting to note the similarity between this approach and ISSC. Notice that the guarantee above is\nnot a standard bi-criterion approximation. We show in the extended version [11] that with a simple\ntransformation, we can obtain a bicriterion guarantee.\nEllipsoidal Approximation based Submodular Cost Knapsack (EASK): Choosing the Ellip-\nsoidal Approximation f ea of f as a surrogate function, we obtain a simpler problem:\n\nwf \u03ba (X) + (1 \u2212 \u03baf )\n\nf (j) \u2264 b\n\n(7)\n\nIn order to solve this problem, we look at its dual problem (i.e., Eqn. (4)) and use Algorithm 2 to\nconvert the guarantees. We call this procedure EASK. We then obtain guarantees very similar to\nTheorem 4.4.\nLemma 4.9. EASK obtains a guarantee of\n\n1 + \u0001, O(\n\n\u221a\nn log n\u22121)(1\u2212\u03baf ) )\n\nn log nHg\n\n(cid:104)\n\n1+(\n\n\u221a\n\ndirectly choose the ellipsoidal approximation of f as(cid:112)wf (X) and solve the surrogate problem:\n\nIn the case when the submodular function has a curvature \u03baf = 1, we can actually provide a simpler\nalgorithm without needing to use the conversion algorithm (Algorithm 2). In this case, we can\nmax{g(X) : wf (X) \u2264 b2}. This surrogate problem is a submodular cost knapsack problem, which\nwe can solve using the greedy algorithm. We call this algorithm EASKc. This guarantee is tight up to\nlog factors if \u03baf = 1.\n\u221a\nCorollary 4.10. Algorithm EASKc obtains a bi-criterion guarantee of [1 \u2212 e\u22121, O(\n\nn log n)].\n\n4.3 Extensions beyond SCSC and SCSK\n\nSCSC and SCSK can in fact be extended to more \ufb02exible and complicated constraints which can arise\nnaturally in many applications [18, 8]. These include multiple covering and knapsack constraints \u2013\ni.e., min{f (X)|gi(X) \u2265 ci, i = 1, 2,\u00b7\u00b7\u00b7 k} and max{g(X)|fi(X) \u2264 bi, i = 1, 2,\u00b7\u00b7\u00b7 k}, and robust\noptimization problems like max{mini gi(X)|f (X) \u2264 b}, where the functions f, g, fi\u2019s and gi\u2019s are\nsubmodular. We also consider SCSC and SCSK with non-monotone submodular functions. Due to\nlack of space, we defer these discussions to the extended version of this paper [11].\n\n4.4 Hardness\n\nIn this section, we provide the hardness for Problems 1 and 2. The lower bounds serve to show that\nthe approximation factors above are almost tight.\n\n7\n\n\uf8f1\uf8f2\uf8f3g(X)\n\n(cid:113)\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u03baf\n\nmax\n\n(cid:88)\n\nj\u2208X\n\n\uf8fc\uf8fd\uf8fe .\n(cid:105)\n\n.\n\n\fTheorem 4.11. For any \u03ba > 0, there exists submodular functions with curvature \u03ba such that\nno polynomial time algorithm for Problems 1 and 2 achieves a bi-criterion factor better than\n\u03c3\n\u03c1 =\n\n1+(n1/2\u2212\u0001\u22121)(1\u2212\u03ba) for any \u0001 > 0.\n\nn1/2\u2212\u0001\n\nThe above result shows that EASSC and EASK meet the bounds above to log factors. We see an\ninteresting curvature-dependent in\ufb02uence on the hardness. We also see this phenomenon in the\napproximation guarantees of our algorithms. In particular, as soon as f becomes modular, the\nproblem becomes easy, even when g is submodular. This is not surprising since the submodular\nset cover problem and the submodular cost knapsack problem both have constant factor guarantees.\n\n5 Experiments\n\ni\u2208V min{(cid:80)\n\ng, we use two types of coverage: one is a facility location function g1(X) = (cid:80)\nwhile the other is a saturated sum function g2(X) = (cid:80)\nj\u2208X sij, \u03b1(cid:80)\n\nIn this section, we empirically compare the performance of the various algorithms discussed in this\npaper. We are motivated by the speech data subset selection application [20, 23] with the submodular\nfunction f encouraging limited vocabulary while g tries to achieve acoustic variability. A natural\nchoice of the function f is a function of the form |\u0393(X)|, where \u0393(X) is the neighborhood function\non a bipartite graph constructed between the utterances and the words [23]. For the coverage function\ni\u2208V maxj\u2208X sij\nj\u2208V sij}. Both\nthese functions are de\ufb01ned in terms of a similarity matrix S = {sij}i,j\u2208V , which we de\ufb01ne on the\nTIMIT corpus [5], using the string kernel metric [27] for similarity. Since some of our algorithms, like\nthe Ellipsoidal Approximations, are computationally intensive, we restrict ourselves to 50 utterances.\nWe compare our different algorithms on\nProblems 1 and 2 with f being the bipartite\nneighborhood and g being the facility location\nand saturated sum respectively. Furthermore,\nin our experiments, we observe that the neigh-\nborhood function f has a curvature \u03baf = 1.\nThus, it suf\ufb01ces to use the simpler versions\nof algorithm EA (i.e., algorithm EASSCc and\nEASKc). The results are shown in Figure 1. We observe that on the real-world instances, all our\nalgorithms perform almost comparably. This implies, moreover, that the iterative variants, viz. Gr,\nISSC and ISK, perform comparably to the more complicated EA-based ones, although EASSC and\nEASK have better theoretical guarantees. We also compare against a baseline of selecting random\nsets (of varying cardinality), and we see that our algorithms all perform much better. In terms of\nthe running time, computing the Ellipsoidal Approximation for |\u0393(X)| with |V | = 50 takes about 5\nhours while all the iterative variants (i.e., Gr, ISSC and ISK) take less than a second. This difference\nis much more prominent on larger instances (for example |V | = 500).\n\nFigure 1: Comparison of the algorithms in the text.\n\n6 Discussions\n\nIn this paper, we propose a unifying framework for problems 1 and 2 based on suitable surrogate\nfunctions. We provide a number of iterative algorithms which are very practical and scalable (like\nGr, ISK and ISSC), and also algorithms like EASSC and EASK, which though more intensive,\nobtain tight approximation bounds. Finally, we empirically compare our algorithms, and show that\nthe iterative algorithms compete empirically with the more complicated and theoretically better\napproximation algorithms. For future work, we would like to empirically evaluate our algorithms on\nmany of the real world problems described above, particularly the limited vocabulary data subset\nselection application for speech corpora, and the machine translation application.\nAcknowledgments: Special thanks to Kai Wei and Stefanie Jegelka for discussions, to Bethany\nHerwaldt for going through an early draft of this manuscript and to the anonymous reviewers for\nuseful reviews. This material is based upon work supported by the National Science Foundation\nunder Grant No. (IIS-1162606), a Google and a Microsoft award, and by the Intel Science and\nTechnology Center for Pervasive Computing.\n\n8\n\n010020025001020304050f(X)g(X)Fac. Location/ Bipartite Neighbor.  ISSCEASSCcISKGrEASKcRandom204060801000100200300f(X)g(X)Saturated Sum/ Bipartite Neighbor  ISSCEASSCcISKGrEASKcRandom\fReferences\n[1] A. Atamt\u00a8urk and V. Narayanan. The submodular knapsack polytope. Discrete Optimization, 2009.\n[2] M. Conforti and G. Cornuejols. Submodular set functions, matroids and the greedy algorithm: tight worst-\ncase bounds and some generalizations of the Rado-Edmonds theorem. Discrete Applied Mathematics,\n7(3):251\u2013274, 1984.\n\n[3] U. Feige. A threshold of ln n for approximating set cover. Journal of the ACM (JACM), 1998.\n[4] S. Fujishige. Submodular functions and optimization, volume 58. Elsevier Science, 2005.\n[5] J. Garofolo, F. Lamel, L., J. W., Fiscus, D. Pallet, and N. Dahlgren. Timit, acoustic-phonetic continuous\n\nspeech corpus. In DARPA, 1993.\n\n[6] M. Goemans, N. Harvey, S. Iwata, and V. Mirrokni. Approximating submodular functions everywhere. In\n\nSODA, pages 535\u2013544, 2009.\n\n[7] A. Guillory and J. Bilmes. Interactive submodular set cover. In ICML, 2010.\n[8] A. Guillory and J. Bilmes. Simultaneous learning and covering with adversarial noise. In ICML, 2011.\n[9] R. Iyer and J. Bilmes. Algorithms for approximate minimization of the difference between submodular\n\nfunctions, with applications. In UAI, 2012.\n\n[10] R. Iyer and J. Bilmes. The submodular Bregman and Lov\u00b4asz-Bregman divergences with applications. In\n\nNIPS, 2012.\n\n[11] R. Iyer and J. Bilmes. Submodular Optimization with Submodular Cover and Submodular Knapsack\n\nConstraints: Extended arxiv version, 2013.\n\n[12] R. Iyer, S. Jegelka, and J. Bilmes. Curvature and Optimal Algorithms for Learning and Minimizing\n\nSubmodular Functions . In NIPS, 2013.\n\n[13] R. Iyer, S. Jegelka, and J. Bilmes. Fast semidifferential based submodular function optimization. In ICML,\n\n2013.\n\n[14] S. Jegelka and J. A. Bilmes. Submodularity beyond submodular energies: coupling edges in graph cuts. In\n\nCVPR, 2011.\n\n[15] Y. Kawahara and T. Washio. Prismatic algorithm for discrete dc programming problems. In NIPS, 2011.\n[16] H. Kellerer, U. Pferschy, and D. Pisinger. Knapsack problems. Springer Verlag, 2004.\n[17] A. Krause and C. Guestrin. A note on the budgeted maximization on submodular functions. Technical\n\nReport CMU-CALD-05-103, Carnegie Mellon University, 2005.\n\n[18] A. Krause, B. McMahan, C. Guestrin, and A. Gupta. Robust submodular observation selection. Journal of\n\nMachine Learning Research (JMLR), 9:2761\u20132801, 2008.\n\n[19] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: Theory,\n\nef\ufb01cient algorithms and empirical studies. JMLR, 9:235\u2013284, 2008.\n\n[20] H. Lin and J. Bilmes. How to select a good training-data subset for transcription: Submodular active\n\nselection for sequences. In Interspeech, 2009.\n\n[21] H. Lin and J. Bilmes. Multi-document summarization via budgeted maximization of submodular functions.\n\nIn NAACL, 2010.\n\n[22] H. Lin and J. Bilmes. A class of submodular functions for document summarization. In The 49th Annual\nMeeting of the Association for Computational Linguistics: Human Language Technologies (ACL/HLT-\n2011), Portland, OR, June 2011.\n\n[23] H. Lin and J. Bilmes. Optimal selection of limited vocabulary speech corpora. In Interspeech, 2011.\n[24] R. C. Moore and W. Lewis. Intelligent selection of language model training data. In Proceedings of the\n\nACL 2010 Conference Short Papers, pages 220\u2013224. Association for Computational Linguistics, 2010.\n\n[25] M. Narasimhan and J. Bilmes. A submodular-supermodular procedure with applications to discriminative\n\nstructure learning. In UAI, 2005.\n\n[26] E. Nikolova. Approximation algorithms for of\ufb02ine risk-averse combinatorial optimization, 2010.\n[27] J. Rousu and J. Shawe-Taylor. Ef\ufb01cient computation of gapped substring kernels on large alphabets.\n\nJournal of Machine Learning Research, 6(2):1323, 2006.\n\n[28] M. Sviridenko. A note on maximizing a submodular set function subject to a knapsack constraint.\n\nOperations Research Letters, 32(1):41\u201343, 2004.\n\n[29] L. A. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combinatorica,\n\n2(4):385\u2013393, 1982.\n\n9\n\n\f", "award": [], "sourceid": 1147, "authors": [{"given_name": "Rishabh", "family_name": "Iyer", "institution": "University of Washington"}, {"given_name": "Jeff", "family_name": "Bilmes", "institution": "University of Washington"}]}