{"title": "Stochastic Continuous Greedy ++:  When Upper and Lower Bounds Match", "book": "Advances in Neural Information Processing Systems", "page_first": 13087, "page_last": 13097, "abstract": "In this paper, we develop \\scg~(\\text{SCG}{$++$}), the first efficient variant of a conditional gradient method for maximizing a  continuous submodular function subject to a convex constraint. Concretely, for a monotone and continuous DR-submodular function, \\SCGPP achieves a tight  $[(1-1/e)\\OPT -\\epsilon]$ solution while using $O(1/\\epsilon^2)$ stochastic gradients and $O(1/\\epsilon)$ calls to the linear optimization oracle. The best previously known algorithms either achieve a suboptimal $[(1/2)\\OPT -\\epsilon]$ solution with $O(1/\\epsilon^2)$ stochastic gradients or the tight $[(1-1/e)\\OPT -\\epsilon]$ solution  with suboptimal $O(1/\\epsilon^3)$ stochastic gradients. We further provide an information-theoretic lower bound to showcase the necessity of $\\OM({1}/{\\epsilon^2})$ stochastic oracle queries in order to achieve $[(1-1/e)\\OPT -\\epsilon]$ for monotone and DR-submodular functions. This result shows that our proposed \\SCGPP enjoys optimality in terms of both approximation guarantee, i.e., $(1-1/e)$ approximation factor, and stochastic gradient evaluations, i.e., $O(1/\\epsilon^2)$ calls to the stochastic oracle. By using stochastic\ncontinuous optimization as an interface, we also show that it is possible to obtain the $[(1-1/e)\\OPT-\\epsilon]$ tight approximation guarantee for maximizing a monotone\nbut stochastic submodular set function subject to a general matroid constraint after at most \n$\\mathcal{O}(n^2/\\epsilon^2)$ calls to the stochastic function value, where $n$ is the number of elements in the ground set.", "full_text": "Stochastic Continuous Greedy ++:\n\nWhen Upper and Lower Bounds Match\u2217\n\nHamed Hassani\nESE Department\n\nUniversity of Pennsylvania\n\nPhiladelphia, PA\n\nhassani@seas.upenn.edu\n\nAmin Karbasi\nECE Department\nYale University\nNew Haven, CT\n\namin.karbasi@yale.edu\n\nAryan Mokhtari\nECE Department\n\nZebang Shen\nESE Department\n\nUniversity of Pennsylvania\n\nPhiladelphia, PA\n\nThe University of Texas at Austin\n\nAustin, TX\n\nmokhtari@austin.utexas.edu\n\nzebang@seas.upenn.edu\n\nAbstract\n\nIn this paper, we develop Stochastic Continuous Greedy++ (SCG++),\nthe \ufb01rst ef\ufb01cient variant of a conditional gradient method for maximizing a con-\ntinuous submodular function subject to a convex constraint. Concretely, for a\nmonotone and continuous DR-submodular function, SCG++ achieves a tight\n[(1 \u2212 1/e)OPT \u2212 \u0001] solution while using O(1/\u00012) stochastic gradients and O(1/\u0001)\ncalls to the linear optimization oracle. The best previously known algorithms either\nachieve a suboptimal [(1/2)OPT\u2212 \u0001] solution with O(1/\u00012) stochastic gradients or\nthe tight [(1\u2212 1/e)OPT\u2212 \u0001] solution with suboptimal O(1/\u00013) stochastic gradients.\nWe further provide an information-theoretic lower bound to showcase the neces-\nsity of O(1/\u00012) stochastic oracle queries in order to achieve [(1 \u2212 1/e)OPT \u2212 \u0001]\nfor monotone and DR-submodular functions. This result shows that our proposed\nSCG++ enjoys optimality in terms of both approximation guarantee, i.e., (1\u22121/e)\napproximation factor, and stochastic gradient evaluations, i.e., O(1/\u00012) calls to the\nstochastic oracle. By using stochastic continuous optimization as an interface, we\nalso show that it is possible to obtain the [(1 \u2212 1/e)OPT \u2212 \u0001] tight approximation\nguarantee for maximizing a monotone but stochastic submodular set function sub-\nject to a general matroid constraint after at most O(n2/\u00012) calls to the stochastic\nfunction value, where n is the number of elements in the ground set.\n\n1\n\nIntroduction\n\nIn this paper, we consider the following non-oblivious stochastic submodular maximization problem:\n\nEz\u223cp(z;x)[ \u02dcF (x; z)],\n\nx\u2208C F (x) := max\nmax\nx\u2208C\n\n(1)\nwhere x \u2208 Rd\n+ is the decision variable, C \u2286 Rd is a convex feasible set, z \u2208 Z is a random variable\nwith distribution p(z; x), and the submodular objective function F : Rd \u2192 R is de\ufb01ned as the\nexpectation of a set of stochastic functions \u02dcF : Rd \u00d7 Z \u2192 R. In this paper, we focus on a general\ncase of stochastic submodular maximization in which the probability distribution of the random\nvariable z depends on the variable x and may change during the optimization procedure. One should\n\n\u2217The authors are listed in alphabetical order.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fnote that the usual stochastic optimization where the distribution p is independent of x is a special case\nof Problem (1). A canonical example of the general stochastic submodular maximization problem in\n(1) is the multi-linear extension of a discrete submodular function where the stochasticity crucially\ndepends on the decision variable x at which we evaluate. Speci\ufb01cally, consider a discrete submodular\nset function f : 2V \u2192 R+ de\ufb01ned over the ground set V . The aim is to solve the following problem\nmaxS\u2208I f (S) where I is a matroid constraint. For this problem, the classic greedy algorithm leads to\na 1/2 approximation guarantee, but one can achieve the optimal approximation guarantee of 1 \u2212 1/e\nby maximizing its multilinear extension F : [0, 1]V \u2192 R+ which is de\ufb01ned as\n(1 \u2212 xj),\n\nF (x) := Ez\u223cx[f (z(x))] :=\n\n(cid:88)\n\nf (S)\n\n(2)\n\nS\u2286V\n\n(cid:89)\n\ni\u2208S\n\nxi\n\n(cid:89)\n\nj /\u2208S\n\nwhere for the random set z(x) each element e is sampled with probability xe. This problem is a\nspecial case of (1) if we de\ufb01ne \u02dcF (x, z) as f (z(x)) and p(x, z) as the distribution of the random set\nz(x), i.e., each coordinate ze is generated according to a Bernoulli distribution with parameter xe.\nWhen F is monotone and continuous DR-submodular, Hassani et al. [17] showed that the Stochastic\nGradient Ascent (SGA) method \ufb01nds a solution to Problem (1) with a function value no less than\n[(1/2)OPT \u2212 \u0001] after computing O(1/\u00012) stochastic gradients. Here, and throughout the paper, OPT\ndenotes the optimal value of Problem (1). Hassani et al. [17] also provided examples for which\nSGA cannot achieve better than 1/2 approximation ratio, in general. Later, Mokhtari et al. [22]\nproposed Stochastic Continuous Greedy (SCG), a conditional gradient method that achieves the tight\n[(1 \u2212 1/e)OPT \u2212 \u0001] solution by O(1/\u00013) calls to the linear optimization oracle while using O(1/\u00013)\nstochastic gradients. While both SCG and SGA are \ufb01rst-order methods, meaning that they rely on\nstochastic gradients, SCG provably achieves a better result at the price of being slower. Therefore, a\nfundamental question is the following\n\n\u201cCan we achieve the best of both worlds? That is, can we \ufb01nd a [(1\u2212 1/e)OPT\u2212 \u0001]\nsolution after at most O(1/\u00012) calls to the stochastic oracle?\"\n\nAnother question that naturally arises is about a lower bound on the number of stochastic gradient\nevaluations for \ufb01nding a (1 \u2212 1/e) approximate solution:\n\n\u201cWhat is the lower bound on the number of calls to the \ufb01rst-order stochastic oracle\nfor achieving a [(1 \u2212 1/e)OPT \u2212 \u0001] solution?\"\n\nIn this paper, we develop a tight lower bound on the number of calls to the stochastic oracle for\nachieving a [(1\u22121/e)OPT\u2212\u0001] solution, and propose an algorithm that achieves the sample complexity\nof the lower bound. The detail of our contributions follows.\nOur contributions. We develop Stochastic Continuous Greedy++ (SCG++ ), the \ufb01rst\nmethod that achieves the tight [(1 \u2212 1/e)OPT \u2212 \u0001] solution for Problem (1) with O(1/\u0001) calls to the\nlinear optimization program while using O(1/\u00012) stochastic gradients in total. Our technique relies\non a novel variance reduction method that estimates the difference of gradients in the non-oblivious\nstochastic setting without introducing extra bias. This is crucial in our analysis, as all the existing\nvariance reduction methods fail to correct for this bias and can only operate in the oblivious/classic\nstochastic setting. We further show that our result is optimal in all aspects. In particular, we provide\nan information-theoretic lower bound to showcase the necessity of O(1/\u00012) stochastic oracle queries\nin order to achieve [(1 \u2212 1/e)OPT \u2212 \u0001]. Note that under standard assumptions, one cannot achieve\nan approximation ratio better than (1 \u2212 1/e) for submodular functions [13]. By using stochastic\ncontinuous optimization as an interface, we also provide a (1 \u2212 1/e)OPT \u2212 \u0001 tight approximation\nguarantee for maximizing a monotone but stochastic submodular set function subject to a matroid\nconstraint with at most O(n/\u00012) calls to the stochastic oracle where n is the size of the ground set.\n\n2 Related Work\n\nSubmodular set functions capture the intuitive notion of diminishing returns and have become in-\ncreasingly important in various machine learning applications. Examples include data summarization\n[20, 21], dictionary learning [9], and variational inference [11], to name a few. It is known that for a\nmonotone submodular function and subject to a cardinality constraint, greedy algorithm achieves the\ntight (1 \u2212 1/e) approximation guarantee [25]. However, the vanilla greedy method does not provide\n\n2\n\n\fthe tightest guarantees for many classes of feasibility constraints. To circumvent this issue, the con-\ntinuous relaxation of submodular functions, through the multilinear extension, have been extensively\nstudied [31, 7, 8, 14, 16, 30]. In particular, it is known that the Continuous Greedy algorithm achieves\nthe tight (1 \u2212 1/e) approximation guarantee for monotone submodular functions under a general\nmatroid constraint [7] with a prohibitive query complexity of O(n8). The fastest existing solution\nfor maximizing a submodular function subject to a matroid constraint interplays between discrete\nand continuous domains to achieve a running time of O(n/\u00014) for \ufb01nding a (1 \u2212 1/e)OPT \u2212 \u0001\napproximate solution [4]. In contrast, we develop a pure continuous method that obtains the same\nguarantee with a running time of O(n2/\u00012).\nContinuous DR-submodular functions, an important subclass of non-convex functions, generalize\nthe notion of diminishing returns to the continuous domains [5]. Such functions naturally arise in\nmachine learning applications such as Map inference for Determinantal Point Processes [19] and\nrevenue maximization [26]. It has been recently shown that monotone continuous DR-submodular\nfunctions can be (approximately) maximized over convex bodies using \ufb01rst-order methods [5, 17, 22].\nWhen exact gradient information is available, [5] showed that the continuous greedy algorithm\nachieves [(1 \u2212 1/e)OPT \u2212 \u0001] with O(1/\u0001) gradient evaluations. However, the problem becomes\nconsiderably more challenging when we only have access to a stochastic \ufb01rst-order oracle. In\nparticular, Hassani et al. [17] showed that the stochastic gradient ascent achieves [1/2OPT \u2212 \u0001] by\nusing O(1/\u00012) stochastic gradients. In contrast, [22, 23] proposed a stochastic variant of continuous\ngreedy that achieves [(1 \u2212 1/e)OPT \u2212 \u0001] by using O(1/\u00013) stochastic gradients. This paper shows\nhow to achieve [(1 \u2212 1/e)OPT \u2212 \u0001] by O(1/\u00012) stochastic gradient evaluations.\n\n3 Preliminaries\nSubmodularity. A set function f : 2V \u2192 R+, de\ufb01ned on the ground set V , is submodular if\nf (A) + f (B) \u2265 f (A \u2229 B) + f (A \u222a B), for all subsets A, B \u2286 V . Even though submodularity is\nmostly considered on discrete domains, the notion can be naturally extended to arbitrary lattices [15].\ni=1 Xi where each Xi is a compact\nTo this aim, let us consider a subset of Rd\nsubset of R+. A function F : X \u2192 R+ is continuous submodular if \u2200(x, y) \u2208 X \u00d7 X\n\n+ of the form X =(cid:81)d\n\nF (x) + F (y) \u2265 F (x \u2228 y) + F (x \u2227 y),\n\n= max(x, y) (component-wise) and x \u2227 y\n.\n\n(3)\nwhere x \u2228 y\n.\n= min(x, y) (component-wise). A\nsubmodular function is monotone if for any x, y \u2208 X such that x \u2264 y, we have F (x) \u2264 F (y)\n(here, by x \u2264 y we mean that every element of x is less than that of y). When twice differentiable,\nF is submodular if and only if all cross-second-derivatives are non-positive [3], i.e., we have\n\u2200i (cid:54)= j,\u2200x \u2208 X , \u22022F (x)/\u2202xi\u2202xj \u2264 0. This expression shows continuous submodular functions are\nnot convex nor concave in general, as concavity (convexity) implies that \u22072F (cid:22) 0 (resp.(cid:53)2F (cid:23) 0).\nA proper subclass of submodular functions are called DR-submodular [29] if for all x, y \u2208 X such\nthat x \u2264 y and any standard basis vector ei \u2208 Rn and a non-negative number z \u2208 R+ such that\nzei + x \u2208 X and zei + y \u2208 X , then, F (zei + x) \u2212 F (x) \u2265 F (zei + y) \u2212 F (y). One can easily\nverify that for a differentiable DR-submodular function the gradient is an antitone mapping, i.e., for\nall x, y \u2208 X such that x \u2264 y we have \u2207F (x) \u2265 \u2207F (y) [5].\nVariance Reduction. Beyond the vanilla stochastic gradient, variance reduced methods [28, 18, 10,\n27, 2] have succeeded in reducing stochastic \ufb01rst-order oracle complexity in oblivious stochastic\noptimization\n\nEz\u223cp(z)\n\n\u02dcF (x; z),\n\nx\u2208C F (x) := max\nmax\nx\u2208C\n\n(4)\nwhere each component function \u02dcF (\u00b7; z) is L-smooth. In contrast to (1), the underlying distribution\np of (4) is invariant to the variable x and is hence called oblivious. We will now explain a recent\nvariance reduction technique for solving (4) using stochastic gradient information. Consider the\nfollowing unbiased estimate of the gradient at the current iterate xt:\n\ngt := gt\u22121 + \u2207 \u02dcF (xt;M) \u2212 \u2207 \u02dcF (xt\u22121;M),\n\n(cid:80)\n(5)\nz\u2208M \u2207 \u02dcF (y; z) for some y \u2208 Rd, gt\u22121 is an unbiased gradient esti-\nwhere \u2207 \u02dcF (y;M) := 1|M|\nmator at xt\u22121, and M is a mini-batch of random samples drawn from p(z). [12] showed that, with\nthe gradient estimator (5), O(1/\u00013) stochastic gradient evaluations are suf\ufb01cient to \ufb01nd an \u0001-\ufb01rst-\norder stationary point of Problem (4), improving upon the O(1/\u00014) complexity of SGD. A crucial\n\n3\n\n\fproperty leading to the success of the variance reduction method given in (5) is that \u2207 \u02dcF (xt;M) and\n\u2207 \u02dcF (xt\u22121;M) use the same minibatch sample M in order to exploit the L-smoothness of component\nfunctions f (\u00b7; z). Such construction is only possible in the oblivious setting where p(z) is independent\nof the choice of x, and would introduce bias in the more general non-oblivious case (1). To see this,\nlet M be the minibatch of random variable z sampled according to distribution p(z; xt). We have\nE[\u2207 \u02dcF (xt;M)] = \u2207F (xt) but E[\u2207 \u02dcF (xt\u22121;M)] (cid:54)= \u2207F (xt\u22121) since the distribution p(z; xt\u22121) is\nnot the same as p(z; xt). The same argument renders all the existing variance reduction techniques\ninapplicable for the non-oblivious setting of Problem (1).\n\n4 Stochastic Continuous Greedy++\nIn this section, we present the Stochastic Continuous Greedy++ (SCG++) algorithm\nwhich is the \ufb01rst method to obtain a [(1 \u2212 1/e)OPT \u2212 \u0001] solution with O(1/\u00012) stochastic oracle\ncomplexity. The SCG++ algorithm essentially operates in a conditional gradient manner. To be\nmore precise, at each iteration t, given a gradient estimator gt, SCG++ solves the subproblem\n\n(6)\nto obtain an element vt in C as ascent direction, which is then added to the iterate xt+1 with a scaling\nfactor 1/T , i.e., the new iterate xt+1 is computed by following the update\n\nvt = argmax\n\nv\u2208C\n\n(cid:104)v, gt(cid:105)\n\nxt+1 = xt +\n\n1\nT\n\nvt,\n\n(7)\n\nwhere T is the total number of iterations of the algorithm. The iterates are assumed to be initialized\nat the origin which may not belong to the feasible set C. Though each iterate xt may not necessarily\nbe in C, the feasibility of the \ufb01nal iterate xT is guaranteed by the convexity of C. Note that the iterate\nsequence {xs}T\ns=0 can be regarded as a path from the origin (as we manually force x0 = 0) to some\nfeasible point in C. The key idea in SCG++ is to exploit the high correlation between the consecutive\niterates originated from the O(1/T )-sized increments to maintain a highly accurate estimate gt,\nwhich is the focus of the rest of this section. Note that by replacing the gradient approximation vector\ngt in the update of SCG++ by the exact gradient of the objective function, we recover the update\nrule of the continuous greedy method [7, 5].\nWe now proceed to describe our approach for evaluating the gradient approximation gt when we face\na non-oblivious problem as in (1). Given a sequence of iterates {xs}t\ns=0, the gradient of the objective\nfunction F at the iterate xt can be written in a path-integral form as\n\n\u2207F (xt) = \u2207F (x0) +\n\n\u2206s def= \u2207F (xs) \u2212 \u2207F (xs\u22121)\n\n.\n\n(8)\n\n(cid:111)\n\nt(cid:88)\n\n(cid:110)\n\ns=1\n\n(cid:90) 1\n\nBy obtaining an unbiased estimate of \u2206t = \u2207F (xt) \u2212 \u2207F (xt\u22121) and reusing the previous unbiased\nestimates for s < t, we obtain recursively an unbiased estimator of \u2207F (xt) which has a reduced\nvariance. Estimating \u2207F (xs) and \u2207F (xs\u22121) separately as suggested in (5) would cause the bias\nissue in the the non-oblivious case (see discussion at the end of section 3). Therefore, we propose an\napproach for directly estimating the difference \u2206t = \u2207F (xt) \u2212 \u2207F (xt\u22121) in an unbiased manner.\nWe construct an unbiased estimator gt of the gradient vector \u2207F (xt) by adding an unbiased estimate\n\u02dc\u2206t of the gradient difference \u2206t = \u2207F (xt) \u2212 \u2207F (xt\u22121) to gt\u22121, where gt\u22121 as an unbiased\nestimate of \u2207F (xt\u22121). Note that \u2206t = \u2207F (xt) \u2212 \u2207F (xt\u22121) can be written as\n\n\u2206t =\n\n\u22072F (x(a))(xt \u2212 xt\u22121)da =\n\n\u22072F (x(a))da\n\n(xt \u2212 xt\u22121),\n\n(9)\n\n0\n\n0\n\nwhere x(a) def= a\u00b7xt +(1\u2212a)\u00b7xt\u22121 for a \u2208 [0, 1]. Therefore, if we sample the parameter a uniformly\nat random from the interval [0, 1], it can be easily veri\ufb01ed that \u02dc\u2206t := \u22072F (x(a))(xt \u2212 xt\u22121) is an\nunbiased estimator of the gradient difference \u2206t since\n\nEa[\u22072F (x(a))(xt \u2212 xt\u22121)] = \u2207F (xt) \u2212 \u2207F (xt\u22121).\n\n(10)\nTherefore, all we need is an unbiased estimator of the Hessian-vector product \u22072F (y)(xt \u2212 xt\u22121)\nfor the non-oblivious objective F at an arbitrary y \u2208 C. In the following lemma, we present an\nunbiased estimator of \u22072F (y) for any y \u2208 C that can be evaluated ef\ufb01ciently.\n\n4\n\n(cid:20)(cid:90) 1\n\n(cid:21)\n\n\fif t = 1 then\n\nAlgorithm 1 Stochastic Continuous Greedy++ (SCG++)\nInput: Minibatch size |M0| and |M|, and total number of rounds T\n1: Initialize x0 = 0;\n2: for t = 1 to T do\n3:\n4:\n5:\n6:\n\nSample a minibatch M0 of z according to p(z; x0) and compute g0 def= \u2207 \u02dcF (x0;M0);\nSample a minibatch M of z according to p(z; x(a)) where a is a chosen uniformly at\nrandom from [0, 1] and x(a) := a \u00b7 xt + (1 \u2212 a) \u00b7 xt\u22121;\nCompute the Hessian approximation \u02dc\u22072\nConstruct \u02dc\u2206t based on (13) (Option I) or (18) (Option II);\nUpdate the stochastic gradient approximation gt := gt\u22121 + \u02dc\u2206t;\n\nt corresponding to M according to (12);\n\nelse\n\nend if\nCompute the ascent direction vt := argmaxv\u2208C{v(cid:62)gt};\nUpdate the variable xt+1 := xt + 1/T \u00b7 vt;\n\n7:\n8:\n9:\n10:\n11:\n12:\n13: end for\n\nLemma 1. For any y \u2208 C, let z be the random variable with distribution p(z; y) and de\ufb01ne\n\n\u02dc\u22072F (y; z) def= \u02dcF (y; z)[\u2207 log p(z; y)][\u2207 log p(z; y)](cid:62) + [\u2207 \u02dcF (x; z)][\u2207 log p(z; y)](cid:62)\n+ [\u2207 log p(z; y)][\u2207 \u02dcF (y; z)](cid:62) + \u22072 \u02dcF (y; z) + \u02dcF (y; z)\u22072 log p(z; y).\n\n(11)\n\nThen, \u02dc\u22072F (y; z) is an unbiased estimator of \u22072F (y), i.e., Ez\u223cp(z;y)[ \u02dc\u22072F (y; z)] = \u22072F (y).\nThe result in Lemma 1 shows how to evaluate an unbiased estimator of the Hessian \u22072F (y). If we\nconsider a as a random variable with a uniform distribution over the interval [0, 1], then we can de\ufb01ne\nthe random variable z(a) with the probability distribution p(z(a); x(a)) where x(a) is de\ufb01ned as\nx(a) := a \u00b7 xt + (1 \u2212 a) \u00b7 xt\u22121. Considering these two random variables and the result in Lemma 1,\n\nwe can construct an unbiased estimator of the integral(cid:82) 1\n\n0 \u22072F (x(a))da in (9) by\n\n\u02dc\u22072F (x(a); z(a)),\n\n(12)\n\n(a,z(a))\u2208M\n\nt which is an unbiased estimator of(cid:82) 1\n\nwhere M is a minibatch containing |M| samples of random tuple (a, z(a)). Once we have access to\n\u02dc\u22072\n0 \u22072F (x(a))da, we can approximate the gradient difference\n\u2206t by its unbiased estimator which is de\ufb01ned as\n\u02dc\u2206t := \u02dc\u22072\n\n(13)\nt (xt \u2212 xt\u22121) requires O(d2)\nNote that for the general objective F (\u00b7), the matrix-vector product \u02dc\u22072\ncomputation and memory. To resolve this issue, in Section 4.1 we provide an implementation of\n(13) using only \ufb01rst-order information which has a computational and memory complexity of O(d).\nUsing \u02dc\u2206t as an unbiased estimator of the gradient difference \u2206t, we de\ufb01ne our gradient estimator as\n\nt (xt \u2212 xt\u22121).\n\n(cid:88)\n\n\u02dc\u22072\n\nt\n\ndef=\n\n1\n|M|\n\nt(cid:88)\n\ngt = \u2207 \u02dcF (x0;M0) +\n\n\u02dc\u2206t.\n\n(14)\n\nThis update can also be written in a recursive way as gt = gt\u22121 + \u02dc\u2206t, if we set g0 = \u2207 \u02dcF (x0;M0).\nNote that the proposed approach for gradient approximation in (14) has a variance reduction mecha-\nnism which leads to optimal computational complexity of SCG++ in terms of number of calls to\nthe stochastic oracle. We further highlight this point in Section 4.2.\n\ni=1\n\nImplementation of the Hessian-Vector Product\n\n4.1\nNow we focus on the computation of the gradient difference approximation \u02dc\u2206t in (13). We aim\nto come up with a scheme that avoids explicitly computing the matrix estimator \u02dc\u22072\nt which has a\n\n5\n\n\fcomplexity of O(d2), and present an approach directly approximating \u02dc\u2206t that only uses the \ufb01nite\ndifferences of gradients with a complexity of O(d). Based on (12), computing \u02dc\u22072\nt (xt \u2212 xt\u22121) is\nequivalent to computing |M| instances of \u02dc\u22072F (y; z)(xt \u2212 xt\u22121) for some y \u2208 C and z \u2208 Z. Denote\nd = xt \u2212 xt\u22121 and use the expression in (11) to write\n\n\u02dc\u22072F (y;z) \u00b7 d = \u02dcF (y; z)[\u2207 log p(z; y)(cid:62)d]\u2207 log p(z; y) + [\u2207 log p(z; y)(cid:62)d]\u2207 \u02dcF (x; z)\n+ [\u2207 \u02dcF (y; z)(cid:62)d][\u2207 log p(z; y)] + \u22072 \u02dcF (y; z) \u00b7 d + \u02dcF (y; z)\u22072 log p(z; y) \u00b7 d.\n\n(15)\nNote that the \ufb01rst three terms can be computed in time O(d) and only the last two terms on the\nright hand side of (15) involve O(d2) operations, which can be approximated by the following \ufb01nite\ngradient difference scheme. For any twice differentiable function \u03c8 : Rd \u2192 R and arbitrary d \u2208 Rd\nwith bounded Euclidean norm (cid:107)d(cid:107) \u2264 D, we compute, for some small \u03b4 > 0,\n\n\u2207\u03c8(y + \u03b4 \u00b7 d) \u2212 \u2207\u03c8(y \u2212 \u03b4 \u00b7 d)\n\n(cid:39) \u22072\u03c8(y) \u00b7 d.\n\n\u03c6(\u03b4; \u03c8) def=\n\n2\u03b4\n\n(16)\nAs the Hessian of \u03c8(\u00b7) is L2-smooth, the above approximation can be bounded by (cid:107)\u22072\u03c8(y) \u00b7 d \u2212\n\u03c6(\u03b4; \u03c8)(cid:107) = (cid:107)\u22072\u03c8(y)\u00b7d\u2212\u22072\u03c8(\u02dcx)\u00b7d(cid:107) \u2264 D2L2\u03b4, where \u02dcx is obtained from the mean value theorem.\nThis quantity can be made arbitrary small by decreasing \u03b4. In next section, we show that setting\n\u03b4 = O(\u00012) is suf\ufb01cient, where \u0001 is the target accuracy. By applying the technique of (16) to the two\nfunctions \u03c8(y) = \u02dcF (y; z) and \u03c8(y) = log p(z; y), we can approximate (15) in time O(d):\n\u03be\u03b4(y; z) def= \u02dcF (y; z)[\u2207 log p(z; y)(cid:62)d]\u2207 log p(z; y) + [\u2207 log p(z; y)(cid:62)d]\u2207 \u02dcF (x; z)\n+ [\u2207 \u02dcF (y; z)(cid:62)d][\u2207 log p(z; y)] + \u03c6(\u03b4; \u02dcF (y; z)) + \u03c6(\u03b4; log p(z; y)).\n\n(17)\n\nWe further can de\ufb01ne a minibatch version of that which is used in Option II of Step 8 in Algorithm 1,\n\n(cid:88)\n\n\u03be\u03b4(x;M) def=\n\n1\n|M|\n\n(a,z(a))\u2208M\n\n\u03be\u03b4(x(a); z(a)).\n\n(18)\n\n4.2 Convergence Analysis\nIn this section, we analyze the convergence of Algorithm 1 using (18) as the gradient-difference\nestimation. The result for (13) can be obtained similarly. We note that (13) is a special case of (18)\nby taking \u03b4 \u2192 0 (e.g., by letting \u03b4 = O(\u00012)). We \ufb01rst state the assumptions required for our analysis.\nAssumption 4.1 (function value at the origin). The function value F at the origin is F (0) \u2265 0.\nAssumption 4.2 (bounded stochastic function value). The stochastic function \u02dcF (x; z) has bounded\nfunction value for all z \u2208 Z and x \u2208 C: maxz\u2208Z,x\u2208C \u02dcF (x; z) \u2264 B.\nAssumption 4.3 (monotonicity and DR-submodularity). F is monotone and DR-submodular.\nAssumption 4.4 (compactness of feasible domain). The set C is compact with diameter D.\nAssumption 4.5 (bounded gradient norm). For all x \u2208 C, the stochastic gradient \u2207 \u02dcF has bounded\nnorm: \u2200z \u2208 Z,(cid:107)\u2207 \u02dcF (x; z)(cid:107) \u2264 G \u02dcF , and the norm of the gradient of log p has bounded fourth-order\nmoment, i.e., Ez\u223cp(x;z)(cid:107)\u2207 log p(z; x)(cid:107)4 \u2264 G4\nAssumption 4.6 (bounded second-order derivatives). \u2200x \u2208 C, the Hessian \u22072 \u02dcF has bounded\nspectral norm \u2200z \u2208 Z,(cid:107)\u22072 \u02dcF (x; z)(cid:107) \u2264 L \u02dcF , and spectral norm of the log-probability Hessian has\nbounded second moment: Ez\u223cp(z;x)(cid:107)\u22072 log p(z; x)(cid:107)2 \u2264 L2\np. Further we de\ufb01ne L = max{L \u02dcF , Lp}.\nAssumption 4.7 (continuity of the Hessian). The stochastic Hessian is L2,f -Lipschitz continuous,\ni.e, for all x, y \u2208 C and all z \u2208 Z, i.e., (cid:107)\u22072 \u02dcF (x; z) \u2212 \u22072 \u02dcF (y; z)(cid:107) \u2264 L2, \u02dcF(cid:107)x \u2212 y(cid:107). The Hessian\nof the log probability log p(x; z) is L2,p-Lipschitz continuous: for all x, y \u2208 C and all z \u2208 Z, i.e.,\n(cid:107)\u22072 log p(x; z) \u2212 \u22072 log p(y; z)(cid:107) \u2264 L2,p(cid:107)x \u2212 y(cid:107). Further, de\ufb01ne L2 = max{L2, \u02dcF , L2,p}.\nRemark 1. Assumption 4.7 is only used to show the \ufb01nite difference scheme (15) has bounded\nvariance, and the oracle complexity of our method does not depend on L2, \u02dcF and L2,p.\n\np. Further we de\ufb01ne G = max{G \u02dcF , Gp}.\n\nAs we mentioned in the previous section, the update for the stochastic gradient vector gt in the update\nof SCG++ is designed properly to reduce the noise of gradient approximation. In the following\nlemma, we formally characterize the variance of gradient approximation for SCG++ . To this end,\nwe also need to properly choose the minibatch sizes |M0| and |M|.\n\n6\n\n\fLemma 2. Consider SCG++ outlined in Algorithm 1 and assume that in Step 8 we follow (18) to\nconstruct the gradient difference approximation \u02dc\u2206t (Option II). If Assumptions (4.2), (4.4), (4.5),\n(4.6), and (4.7) hold and we set the minibatch sizes to |M0| = (G2/( \u00afL2D2\u00012)) and |M| = 2/\u0001, and\nthe error of Hessian-vector product approximation \u03b4 is O(\u00012) as in (31), then\n\nE(cid:2)(cid:107)gt \u2212 \u2207F (xt)(cid:107)2(cid:3) \u2264 (1 + \u0001t) \u00afL2D2\u00012,\n\n\u2200t \u2208 {0, . . . , T \u2212 1},\n\n(19)\n\nwhere \u00afL is a constant de\ufb01ned by \u00afL2 def= 4B2G4 + 16G4 + 4L2 + 4B2L2.\nLemma 2 shows that by |M| = O(\u0001\u22121) calls to the stochastic oracle at each iteration, the variance\nof gradient approximation in SCG++ after t iterations is of order O((1 + \u0001t)\u0001). In the following\ntheorem, we incorporate this result to characterize the convergence guarantee of SCG++ .\nTheorem 1. Consider the SCG++ method outlined in Algorithm 1 and assume that in Step 8\nwe follow the update in (18) to construct the gradient difference approximation \u02dc\u2206t (Option II). If\nAssumptions 4.1-4.7 hold, then the output of SCG++ denoted by xT satis\ufb01es\n\nE(cid:2)F (xT )(cid:3) \u2265 (1 \u2212 1/e)F (x\u2217) \u2212 2 \u00afLD2\u0001,\n\n2\u0001 , T = 1\n\n2 \u00afL2D2\u00012 , |M| = 1\n\n\u0001 , and \u03b4 = O(\u00012) as in (31). Here \u00afL is a constant\n\nby setting |M0| = G2\nde\ufb01ned by \u00afL2 def= 4B2G4 + 16G4 + 4L2 + 4B2L2.\nThe result in Theorem 1 shows that after at most T = O(1/\u0001) iterations the objective function value\nfor the output of SCG++ is at least (1 \u2212 1/e)OPT \u2212 O(\u0001). As the number of calls to the stochastic\noracle per iteration is of O(1/\u0001), to reach a [(1 \u2212 1/e)OPT \u2212 O(\u0001)] approximation guarantee the\nSCG++ method has an overall stochastic \ufb01rst-order oracle complexity of O(1/\u00012). We formally\ncharacterize this result in the following corollary.\nCorollary 1 (oracle complexities). To \ufb01nd a [(1 \u2212 1/e)OPT \u2212 \u0001] solution to Problem (1) using Algo-\nrithm 1 with Option II, the overall stochastic \ufb01rst-order oracle complexity is (2G2D2 + 4 \u00afL2D4)/\u00012\nand the overall linear optimization oracle complexity is 2 \u00afLD2/\u0001.\n\n5 Discrete Stochastic Submodular Maximization\nIn this section, we focus on extending our result in the previous section to the case where F is\nthe multilinear extension of a (stochastic) discrete submodular function f. This is also an instance\nof the non-oblivious stochastic optimization in (1). Indeed, once such a result is achieved, with\nproper rounding scheme such as randomized pipage rounding [6] or contention resolution method\n[32], we can extend our results to the discrete setting. Let V denote a \ufb01nite set of d elements, i.e.,\nV = {1, . . . , d}. Consider a discrete submodular function f : 2V \u2192 R+, which is de\ufb01ned as an\nexpectation over a set of functions f\u03b3 : 2V \u2192 R+. Our goal is to maximize f subject to some\nconstraint I, where I contains feasible subsets of V . In other words, we aim to solve the following\ndiscrete and stochastic submodular function maximization problem\nE\u03b3\u223cp(\u03b3)[f\u03b3(S)],\n\n(20)\nwhere p(\u03b3) is an arbitrary distribution. In particular, we assume the pair M = {V,I} forms a matroid\nwith rank r. The prototypical example is maximization under the cardinality constraint, i.e., for a\ngiven integer r, \ufb01nd S \u2286 V , |S| \u2264 r, which maximizes f. The challenge here is to \ufb01nd a solution\nwith near-optimal quality for the problem in (20) without computing the expectation in (20). That is,\nwe assume access to an oracle that, given a set S, outputs an independently chosen sample f\u03b3(S)\nwhere \u03b3 \u223c p(\u03b3). The focus of this section is on extending our result into the discrete domain and\nshowing that SCG++ can be applied for maximizing a stochastic submodular set function f, namely\nProblem (20), through the multilinear extension of the function f. Speci\ufb01cally, in lieu of solving (20)\nwe can solve its multilinear extension problem\n\nS\u2208I f (S) := max\nmax\nS\u2208I\n\nwhere F : [0, 1]V \u2192 R+ is the multilinear extension of f and is de\ufb01ned as\nxi\n\nE\u03b3\u223cp(\u03b3)[f\u03b3(S)]\n\n(1 \u2212 xj) =\n\nF (x) :=\n\nf (S)\n\nxi\n\n(cid:88)\n\nS\u2286V\n\n(cid:89)\n\ni\u2208S\n\n(cid:89)\n\nj /\u2208S\n\nmax\nx\u2208C F (x),\n\n(cid:88)\n\nS\u2286V\n\n7\n\n(21)\n\n(1 \u2212 xj),\n\n(22)\n\n(cid:89)\n\nj /\u2208S\n\n(cid:89)\n\ni\u2208S\n\n\fand the convex set C = conv{1I : I \u2208 I} is the matroid polytope [6]. Note that here xi denotes the\ni-th component of the vector x. In other words, F (x) is the expected value of f over sets wherein\neach element i is included with probability xi independently. To solve (21) using SCG++ , we\nneed access to unbiased estimators of the gradient and the Hessian. We now construct the Hessian\napproximation \u02dc\u22072\nk using the result in [6] which is stated in Lemma 4, in the supplementary material.\nLet a be a uniform random variable between [0, 1] and let e = (e1,\u00b7\u00b7\u00b7 , ed) be a random vector\nin which ei\u2019s are generated i.i.d. according to the uniform distribution over the unit interval [0, 1].\nIn each iteration, a minibatch M of |M| samples of {a, e, \u03b3} (recall that \u03b3 is the random variable\nthat parameterizes the component function f\u03b3), i.e. M = {ak, ek, \u03b3k}|M|\nk=1, is generated. Then for\nall k \u2208 [|M|], we let xak = akxt + (1 \u2212 ak)xt\u22121 and construct the random set S(xak , ek) using\nxak and ek in the following way: s \u2208 S(xak , ek) if and only if [ek]s \u2264 [xak ]s for s \u2208 [d]. Having\nS(xak , ek) and \u03b3k, each entry of the Hessian estimator \u02dc\u22072\n\nt \u2208 Rd\u00d7d is\n\n(cid:88)\nf\u03b3k (S(xak , ek) \u222a {i, j}) \u2212 f\u03b3k (S(xak , ek) \u222a {i} \\ {j})\n\u2212f\u03b3k (S(xak , ek) \u222a {j} \\ {i}) + f\u03b3k (S(xak , ek) \\ {i, j}),\nt ]i,j = 0. As linear optimization over the rank-r matriod polytope\n\n(23)\n\nk\u2208[|M|]\n\n(cid:105)\n\n(cid:104) \u02dc\u22072\n\nt\n\n=\n\n1\n|M|\n\ni,j\n\n\u221a\n\n\u221a\n\ndDf /\n\nf /\u00012).\n\n(cid:113)E\u03b3[D2\n\nr3dDf /\u0001) and |M0| = O(\n\nr3dDf /\u0001) iterations. Moreover, the overall stochastic oracle cost is O(r3dD2\n\n\u221a\n\u03b3]. By using the minibatch size |M| = O(\n\u221a\n\nwhere i (cid:54)= j, and if i = j then [ \u02dc\u22072\nalways return vt with at most r nonzero entries, the complexity of computing (23) is O(|M|rd).\nWe use the above Hessian approximation to solve (21) as a special case of Problem (1) using SCG++ .\nTheorem 2. Consider D\u03b3 := maxi\u2208V f\u03b3(i) as the maximum marginal value of f\u03b3, and de\ufb01ne Df :=\nr\u00012),\nAlgorithm 1 \ufb01nds a [(1 \u2212 1/e)OP T \u2212 6\u0001] approximation of the multilinear extension problem in (21)\nat most (\nSince the cost of a single stochastic gradient computation is O(d), Theorem 2 shows that the overall\ncomputation complexity of Algorithm 1 is O(d2/\u00012). Note that, in multilinear extension case, the\nsmoothness Assumption 4.6 required for the results in Section 4 is absent, and that is why we need to\ndevelop a more sophisticated gradient-difference estimator to achieve a similar theoretical guarantee\n(more details is available in the appendix).\nRemark 2 (optimality of oracle complexities). Note that to achieve the tight (1 \u2212 1/e \u2212 \u0001) approxi-\nmation, the O(1/\u00012) stochastic oracle complexity in Theorem 2 is optimal in terms of its dependency\non \u0001. A lower bound on the stochastic oracle complexity is given in the following theorem.\n6 Lower Bound\nIn this section, we show that reaching a (1 \u2212 1/e \u2212 \u0001)-optimal solution of Problem (1) requires\nat least O(1/\u00012) calls to an oracle which provides stochastic \ufb01rst-order information. To do\nso, we \ufb01rst construct a stochastic submodular set function f, de\ufb01ned through the expectation\nf (S) = E\u03b3\u223cp(\u03b3)[f\u03b3(S)], with the following property: Obtaining a (1 \u2212 1/e \u2212 \u0001)-optimal solu-\ntion for maximization of f under a cardinality constraint (an instance of Problem (20)) requires\nat least O(1/\u00012) samples of the form f\u03b3(\u00b7) where \u03b3 is generated i.i.d from distribution p. Such\na lower bound on sample complexity can be directly extended to Problem (1) with an stochastic\n\ufb01rst order oracle, by considering the multilinear extension of the function f, denoted by F , and\nnoting that (i) Problems (20) and (21) have the same optimal values, and (ii) one can construct an\nunbiased estimator of the gradient of the multilinear extension using d independent samples from the\nunderlying stochastic set function f. Hence, any method for maximizing (21) is also an algorithm for\nmaximizing (20) with the same guarantees on the quality of the solution and with sample complexities\nthat differ at most by a factor d. Now we provide the formal statements regarding the above argument.\nTheorem 3. There exists a distribution p(\u03b3) and a monotone submodular function f : 2V \u2192\nR+, given as f (S) = E\u03b3\u223cp(\u03b3)[f\u03b3(S)], such that the following holds: In order to \ufb01nd a (1 \u2212\n1/e \u2212 \u0001)-optimal solution for (20) with k-cardinality constraint, any algorithm requires at least\nmin{exp(\u03b1k), \u03b2/\u00012} stochastic samples f\u03b3(\u00b7).\nCorollary 2. There exists a DR-submodular function F : [0, 1]n \u2192 R, a convex constraint C, and a\nstochastic \ufb01rst order oracle Of irst, such that any algorithm for maximizing F subject to C requires\nat least min{exp(\u03b1n), \u03b2/\u00012} queries from Of irst.\n\n8\n\n\f7 Conclusion\n\nIn this paper, we developed SCG++ , the \ufb01rst ef\ufb01cient variant of continuous greedy for maximizing\na stochastic continuous DR-submodular function subject to a convex constraint. We showed that\nSCG++ achieves a tight [(1 \u2212 1/e)OPT \u2212 \u0001] solution while using O(1/\u00012) stochastic gradients.\nWe further derived a tight lower bound on the number of calls to the \ufb01rst-order stochastic oracle\nfor achieving a [(1 \u2212 1/e)OPT \u2212 \u0001] approximate solution. This result showed that SCG++ has the\noptimal sample complexity for \ufb01nding an optimal (1 \u2212 1/e) approximation guarantee for monotone\nbut stochastic DR-submodular functions.\n\nAcknowledgment\nThe work of H. Hassani was partially supported by NSF CPS-1837253. Karbasi\u2019s work is partially\nsupported by NSF (IIS-1845032), ONR (N00014- 19-1-2406) and AFOSR (FA9550-18-1-0160).\nShen\u2019s work is supported by Zhejiang Provincial Natural Science Foundation of China under Grant\nNo. LZ18F020002, and National Natural Science Foundation of China (Grant No: 61672376,\n61751209, 61472347).\n\nReferences\n[1] A. Agarwal, M. J. Wainwright, P. L. Bartlett, and P. K. Ravikumar. Information-theoretic lower\nbounds on the oracle complexity of convex optimization. In Advances in Neural Information\nProcessing Systems, pages 1\u20139, 2009.\n\n[2] Z. Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. In Advances in Neural\n\nInformation Processing Systems, pages 2680\u20132691, 2018.\n\n[3] F. Bach. Submodular functions:\n\narXiv:1511.00394, 2015.\n\nfrom discrete to continuous domains. arXiv preprint\n\n[4] A. Badanidiyuru and J. Vondr\u00e1k. Fast algorithms for maximizing submodular functions. In\nProceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pages\n1497\u20131514, 2014.\n\n[5] A. A. Bian, B. Mirzasoleiman, J. Buhmann, and A. Krause. Guaranteed non-convex opti-\nmization: Submodular maximization over continuous domains. In Arti\ufb01cial Intelligence and\nStatistics, pages 111\u2013120, 2017.\n\n[6] G. Calinescu, C. Chekuri, M. P\u00e1l, and J. Vondr\u00e1k. Maximizing a submodular set function\n\nsubject to a matroid constraint. In IPCO, volume 7, pages 182\u2013196. Springer, 2007.\n\n[7] G. Calinescu, C. Chekuri, M. P\u00e1l, and J. Vondr\u00e1k. Maximizing a monotone submodular function\n\nsubject to a matroid constraint. SIAM Journal on Computing, 40(6):1740\u20131766, 2011.\n\n[8] C. Chekuri, J. Vondr\u00e1k, and R. Zenklusen. Submodular function maximization via the mul-\ntilinear relaxation and contention resolution schemes. SIAM Journal on Computing, 43(6):\n1831\u20131879, 2014.\n\n[9] A. Das and D. Kempe. Submodular meets spectral: Greedy algorithms for subset selection,\n\nsparse approximation and dictionary selection. ICML, 2011.\n\n[10] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with\nsupport for non-strongly convex composite objectives. In Advances in neural information\nprocessing systems, pages 1646\u20131654, 2014.\n\n[11] J. Djolonga and A. Krause. From map to marginals: Variational inference in bayesian submodu-\n\nlar models. In NIPS, 2014.\n\n[12] C. Fang, C. J. Li, Z. Lin, and T. Zhang. Spider: Near-optimal non-convex optimization via\nstochastic path-integrated differential estimator. In Advances in Neural Information Processing\nSystems, pages 687\u2013697, 2018.\n\n9\n\n\f[13] U. Feige. A threshold of ln n for approximating set cover. Journal of the ACM (JACM), 45(4):\n\n634\u2013652, 1998.\n\n[14] M. Feldman, J. Naor, and R. Schwartz. A uni\ufb01ed continuous greedy algorithm for submodular\nmaximization. In IEEE 52nd Annual Symposium on Foundations of Computer Science, pages\n570\u2013579, 2011.\n\n[15] S. Fujishige. Submodular functions and optimization, volume 58. Annals of Discrete Mathe-\n\nmatics, North Holland, Amsterdam, 2nd edition, 2005. ISBN 0-444-52086-4.\n\n[16] S. O. Gharan and J. Vondr\u00e1k. Submodular maximization by simulated annealing. In Proceedings\nof the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1098\u20131116,\n2011.\n\n[17] H. Hassani, M. Soltanolkotabi, and A. Karbasi. Gradient methods for submodular maximization.\n\nIn Advances in Neural Information Processing Systems, pages 5841\u20135851, 2017.\n\n[18] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in neural information processing systems, pages 315\u2013323, 2013.\n\n[19] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. arXiv preprint\n\narXiv:1207.6083, 2012.\n\n[20] H. Lin and J. Bilmes. A class of submodular functions for document summarization.\n\nIn\nProceedings of Annual Meeting of the Association for Computational Linguistics: Human\nLanguage Technologies, 2011.\n\n[21] H. Lin and J. Bilmes. Word alignment via submodular maximization over matroids.\n\nIn\nProceedings of the 49th Annual Meeting of the Association for Computational Linguistics:\nHuman Language Technologies: short papers-Volume 2, pages 170\u2013175. Association for\nComputational Linguistics, 2011.\n\n[22] A. Mokhtari, H. Hassani, and A. Karbasi. Conditional gradient method for stochastic submodular\nmaximization: Closing the gap. In International Conference on Arti\ufb01cial Intelligence and\nStatistics, pages 1886\u20131895, 2018.\n\n[23] A. Mokhtari, H. Hassani, and A. Karbasi. Stochastic conditional gradient methods: From\n\nconvex minimization to submodular maximization. arXiv preprint arXiv:1804.09554, 2018.\n\n[24] G. L. Nemhauser and L. A. Wolsey. Best algorithms for approximating the maximum of a\n\nsubmodular set function. Mathematics of operations research, 3(3):177\u2013188, 1978.\n\n[25] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing\n\nsubmodular set functions\u2013I. Mathematical Programming, 14(1):265\u2013294, 1978.\n\n[26] R. Niazadeh, T. Roughgarden, and J. R. Wang. Optimal algorithms for continuous non-monotone\n\nsubmodular and dr-submodular maximization. 2018.\n\n[27] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for\nnonconvex optimization. In International conference on machine learning, pages 314\u2013323,\n2016.\n\n[28] M. Schmidt, N. Le Roux, and F. Bach. Minimizing \ufb01nite sums with the stochastic average\n\ngradient. Mathematical Programming, 162(1-2):83\u2013112, 2017.\n\n[29] T. Soma and Y. Yoshida. A generalization of submodular cover via the diminishing return\n\nproperty on the integer lattice. In NIPS, 2015.\n\n[30] M. Sviridenko, J. Vondr\u00e1k, and J. Ward. Optimal approximation for submodular and super-\nmodular optimization with bounded curvature. In Proceedings of the Twenty-Sixth Annual\nACM-SIAM Symposium on Discrete Algorithms, pages 1134\u20131148, 2015.\n\n[31] J. Vondr\u00e1k. Optimal approximation for the submodular welfare problem in the value oracle\nmodel. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages\n67\u201374. ACM, 2008.\n\n10\n\n\f[32] J. Vondr\u00e1k, C. Chekuri, and R. Zenklusen. Submodular function maximization via the multilin-\near relaxation and contention resolution schemes. In Proceedings of the forty-third annual ACM\nsymposium on Theory of computing, pages 783\u2013792. ACM, 2011.\n\n11\n\n\f", "award": [], "sourceid": 7167, "authors": [{"given_name": "Amin", "family_name": "Karbasi", "institution": "Yale"}, {"given_name": "Hamed", "family_name": "Hassani", "institution": "UPenn"}, {"given_name": "Aryan", "family_name": "Mokhtari", "institution": "UT Austin"}, {"given_name": "Zebang", "family_name": "Shen", "institution": "University of Pennsylvania"}]}