{"title": "Tight Complexity Bounds for Optimizing Composite Objectives", "book": "Advances in Neural Information Processing Systems", "page_first": 3639, "page_last": 3647, "abstract": "We provide tight upper and lower bounds on the complexity of minimizing the average of m convex functions using gradient and prox oracles of the component functions. We show a significant gap between the complexity of deterministic vs randomized optimization. For smooth functions, we show that accelerated gradient descent (AGD) and an accelerated variant of SVRG are optimal in the deterministic and randomized settings respectively, and that a gradient oracle is sufficient for the optimal rate. For non-smooth functions, having access to prox oracles reduces the complexity and we present optimal methods based on smoothing that improve over methods using just gradient accesses.", "full_text": "Tight Complexity Bounds for Optimizing Composite\n\nObjectives\n\nToyota Technological Institute at Chicago\n\nToyota Technological Institute at Chicago\n\nBlake Woodworth\n\nChicago, IL, 60637\nblake@ttic.edu\n\nNathan Srebro\n\nChicago, IL, 60637\nnati@ttic.edu\n\nAbstract\n\nWe provide tight upper and lower bounds on the complexity of minimizing the\naverage of m convex functions using gradient and prox oracles of the component\nfunctions. We show a signi\ufb01cant gap between the complexity of deterministic vs\nrandomized optimization. For smooth functions, we show that accelerated gradi-\nent descent (AGD) and an accelerated variant of SVRG are optimal in the deter-\nministic and randomized settings respectively, and that a gradient oracle is suf\ufb01-\ncient for the optimal rate. For non-smooth functions, having access to prox oracles\nreduces the complexity and we present optimal methods based on smoothing that\nimprove over methods using just gradient accesses.\n\nIntroduction\n\n1\nWe consider minimizing the average of m 2 convex functions:\nfi(x))\n\nx2X(F (x) :=\n\n1\nm\n\nmin\n\nmXi=1\n\nwhere\n\n(1)\n\n(3)\n\nwhere X\u2713 Rd is a closed, convex set, and where the algorithm is given access to the following\ngradient (or subgradient in the case of non-smooth functions) and prox oracle for the components:\n(2)\n\nhF (x, i, ) =\u21e5fi(x), rfi(x), proxfi(x, )\u21e4\n2 kx uk2\n\nu2X \u21e2fi(u) +\n\nproxfi(x, ) = arg min\n\n\n\nA natural question is how to leverage the prox oracle, and how much bene\ufb01t it provides over gradient\naccess alone. The prox oracle is potentially much more powerful, as it provides global, rather then\nlocal, information about the function. For example, for a single function (m = 1), one prox oracle\ncall (with = 0) is suf\ufb01cient for exact optimization. Several methods have recently been suggested\nfor optimizing a sum or average of several functions using prox accesses to each component, both in\nthe distributed setting where each components might be handled on a different machine (e.g. ADMM\n[7], DANE [18], DISCO [20]) or for functions that can be decomposed into several \u201ceasy\u201d parts\n(e.g. PRISMA [13]). But as far as we are aware, no meaningful lower bound was previously known\non the number of prox oracle accesses required even for the average of two functions (m = 2).\nThe optimization of composite objectives of the form (1) has also been extensively studied in the\ncontext of minimizing empirical risk over m samples. Recently, stochastic methods such as SDCA\n[16], SAG [14], SVRG [8], and other variants, have been presented which leverage the \ufb01nite nature\nof the problem to reduce the variance in stochastic gradient estimates and obtain guarantees that\ndominate both batch and stochastic gradient descent. As methods with improved complexity, such\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fConvex,\nkxk \uf8ff B\n\nmLB\n\n\u270f\n\n(Section 3)\n\nmLB\n\n\u270f\n\n(Section 4)\n\nL-Lipschitz\n\n-Strongly\n\nConvex\nmLp\u270f\n\n(Section 3)\n\nmLp\u270f\n\n(Section 4)\n\nr\ne\np\np\nU\n\nc\ni\nt\ns\ni\nn\ni\nm\nr\ne\nt\ne\nD\n\nd\ne\nz\ni\nm\no\nd\nn\na\nR\n\nr\ne\nw\no\nL\nr L2B2\ne\np\np\nU\nr L2B2\ne\nw\no\nL\n\n\u270f2 ^\u21e3m log 1\n\u270f2 ^\u21e3m+\n\n(Section 5)\n\n(SGD, A-SVRG)\n\n\u270f +\n\npmLB\n\n\u270f \u2318\n\u270f \u2318\n\npmLB\n\n\u270f +\n\n(SGD, A-SVRG)\n\nL2\n\n\u270f^\u21e3m log 1\n\u270f ^\u21e3m+\n\nL2\n\npmLp\u270f\u2318 m log \u270f0\npmLp\u270f\u2318\n\n(Section 5)\n\n-Smooth\n\nConvex,\nkxk \uf8ff B\n\n(Section 4)\n\n\u270f\n(AGD)\n\nmq B 2\nmq B 2\n\u270f +qmB2\nm +q mB2\n\n\u270f\n(Section 5)\n\n(A-SVRG)\n\n\u270f\n\n\u270f\n\n-Strongly\nConvex\n\n log \u270f0\n(AGD)\n\n\u270f\n\n log \u270f0\n\n\u270f\n\n(Section 4)\n\nmp \nmp \nm+pm\nm +p m\n\n(Section 5)\n\n(A-SVRG)\n\n log \u270f0\n\n\u270f\n\n log \u270f0\n\n\u270f\n\nTable 1: Upper and lower bounds on the number of grad-and-prox oracle accesses needed to \ufb01nd \u270f-suboptimal\nsolutions for each function class. These are exact up to constant factors except for the lower bounds for smooth\n\nand strongly convex functions, which hide extra log / and logpm/ factors for deterministic and ran-\n\ndomized algorithms. Here, \u270f0 is the suboptimality of the point 0.\n\nas accelerated SDCA [17], accelerated SVRG, and KATYUSHA [3], have been presented, researchers\nhave also tried to obtain lower bounds on the best possible complexity in this settings\u2014but as we\nsurvey below, these have not been satisfactory so far.\nIn this paper, after brie\ufb02y surveying methods for smooth, composite optimization, we present meth-\nods for optimizing non-smooth composite objectives, which show that prox oracle access can indeed\nbe leveraged to improve over methods using merely subgradient access (see Section 3). We then\nturn to studying lower bounds. We consider algorithms that access the objective F only through\nthe oracle hF and provide lower bounds on the number of such oracle accesses (and thus the run-\ntime) required to \ufb01nd \u270f-suboptimal solutions. We consider optimizing both Lipschitz (non-smooth)\nfunctions and smooth functions, and guarantees that do and do not depend on strong convexity, dis-\ntinguishing between deterministic optimization algorithms and randomized algorithms. Our upper\nand lower bounds are summarized in Table 1.\nAs shown in the table, we provide matching upper and lower bounds (up to a log factor) for all\nfunction and algorithm classes. In particular, our bounds establish the optimality (up to log fac-\ntors) of accelerated SDCA, SVRG, and SAG for randomized \ufb01nite-sum optimization, and also the\noptimality of our deterministic smoothing algorithms for non-smooth composite optimization.\n\nOn the power of gradient vs prox oracles For non-smooth functions, we show that having access\nto prox oracles for the components can reduce the polynomial dependence on \u270f from 1/\u270f2 to 1/\u270f, or\nfrom 1/(\u270f) to 1/p\u270f for -strongly convex functions. However, all of the optimal complexities for\nsmooth functions can be attained with only component gradient access using accelerated gradient\ndescent (AGD) or accelerated SVRG. Thus the worst-case complexity cannot be improved (at least\nnot signi\ufb01cantly) by using the more powerful prox oracle.\n\nOn the power of randomization We establish a signi\ufb01cant gap between deterministic and ran-\ndomized algorithms for \ufb01nite-sum problems. Namely, the dependence on the number of components\nmust be linear in m for any deterministic algorithm, but can be reduced to pm (in the typically\nsigni\ufb01cant term) using randomization. We emphasize that the randomization here is only in the\nalgorithm\u2014not in the oracle. We always assume the oracle returns an exact answer (for the re-\nquested component) and is not a stochastic oracle. The distinction is that the algorithm is allowed\nto \ufb02ip coins in deciding what operations and queries to perform but the oracle must return an exact\nanswer to that query (of course, the algorithm could simulate a stochastic oracle).\n\nPrior Lower Bounds Several authors recently presented lower bounds for optimizing (1) in the\nsmooth and strongly convex setting using component gradients. Agarwal and Bottou [1] presented\n\na lower bound of \u2326m +p m\n\nrithms (thus not including SDCA, SVRG, SAG, etc.)\u2014we not only consider randomized algorithms,\nbut also show a much higher lower bound for deterministic algorithms (i.e. the bound of Agarwal\n\n\u270f. However, their bound is valid only for deterministic algo-\n\n log 1\n\n2\n\n\fand Bottou is loose). Improving upon this, Lan [9] shows a similar lower bound for a restricted\nclass of randomized algorithms: the algorithm must select which component to query for a gradient\nby drawing an index from a \ufb01xed distribution, but the algorithm must otherwise be deterministic\nin how it uses the gradients, and its iterates must lie in the span of the gradients it has received.\nThis restricted class includes SAG, but not SVRG nor perhaps other realistic attempts at improving\nover these. Furthermore, both bounds allow only gradient accesses, not prox computations. Thus\nSDCA, which requires prox accesses, and potential variants are not covered by such lower bounds.\nWe prove as similar lower bound to Lan\u2019s, but our analysis is much more general and applies to any\nrandomized algorithm, making any sequence of queries to a gradient and prox oracle, and without\nassuming that iterates lie in the span of previous responses. In addition to smooth functions, we\nalso provide lower bounds for non-smooth problems which were not considered by these previous\nattempts. Another recent observation [15] was that with access only to random component subgradi-\nents without knowing the component\u2019s identity, an algorithm must make \u2326(m2) queries to optimize\nwell. This shows how relatively subtle changes in the oracle can have a dramatic effect on the com-\nplexity of the problem. Since the oracle we consider is quite powerful, our lower bounds cover a\nvery broad family of algorithms, including SAG, SVRG, and SDCA.\nOur deterministic lower bounds are inspired by a lower bound on the number of rounds of communi-\ncation required for optimization when each fi is held by a different machine and when iterates lie in\nthe span of certain permitted calculations [5]. Our construction for m = 2 is similar to theirs (though\nin a different setting), but their analysis considers neither scaling with m (which has a different role\nin their setting) nor randomization.\nNotation and De\ufb01nitions We use k\u00b7k to denote the standard Euclidean norm on Rd. We say that\na function f is L-Lipschitz continuous on X if 8x, y 2X | f (x) f (x)|\uf8ff Lkx yk; -smooth\non X if it is differentiable and its gradient is -Lipschitz on X ; and -strongly convex on X if\n2 kx yk2. We consider optimizing (1) under\n8x, y 2X\nfour combinations of assumptions: each component fi is either L-Lipschitz or -smooth, and either\nF (x) is -strongly convex or its domain is bounded, X\u2713{ x : kxk \uf8ff B}.\n2 Optimizing Smooth Sums\nWe brie\ufb02y review the best known methods for optimizing (1) when the components are -smooth,\nyielding the upper bounds on the right half of Table 1. These upper bounds can be obtained using\nonly component gradient access, without need for the prox oracle.\nWe can obtain exact gradients of F (x) by computing all m component gradients rfi(x). Running\naccelerated gradient descent (AGD) [12] on F (x) using these exact gradients achieves the upper\ncomplexity bounds for deterministic algorithms and smooth problems (see Table 1).\nSAG [14], SVRG [8] and related methods use randomization to sample components, but also lever-\nage the \ufb01nite nature of the objective to control the variance of the gradient estimator used. Ac-\ncelerating these methods using the Catalyst framework [10] ensures that for -strongly convex ob-\n\nfi(y) fi(x) + hrfi(x), y xi + \n\njectives we have E\u21e5F (x(k)) F (x\u21e4)\u21e4 <\u270f after k = Om +p m\n\u270f iterations, where\n log2 \u270f0\nF (0) F (x\u21e4) = \u270f0. KATYUSHA [3] is a more direct approach to accelerating SVRG which avoids\n log \u270f0\nextraneous log-factors, yielding the complexity k = Om +p m\n\u270f indicated in Table 1.\nWhen F is not strongly convex, adding a regularizer to the objective and instead optimizing F(x) =\n\u270f \u25c6 log \u270f0\n\u270f\u25c6.\nF (x) + \nThe log-factor in the second term can be removed using the more delicate reduction of Allen-Zhu\nand Hazan [4], which involves optimizing F(x) for progressively smaller values of , yielding the\nupper bound in the table.\nKATYUSHA and Catalyst-accelerated SAG or SVRG use only gradients of the components. Accel-\nerated SDCA [17] achieves a similar complexity using gradient and prox oracle access.\n\n2 kxk2 with = \u270f/B2 results in an oracle complexity of O\u2713\u2713m +q mB2\n\n3 Leveraging Prox Oracles for Lipschitz Sums\nIn this section, we present algorithms for leveraging the prox oracle to minimize (1) when each\ncomponent is L-Lipschitz. This will be done by using the prox oracle to \u201csmooth\u201d each component,\n\n3\n\n\fand optimizing the new, smooth sum which approximates the original problem. This idea was used\nin order to apply KATYUSHA [3] and accelerated SDCA [17] to non-smooth objectives. We are not\naware of a previous explicit presentation of the AGD-based deterministic algorithm, which achieves\nthe deterministic upper complexity indicated in Table 1.\nThe key is using a prox oracle to obtain gradients of the -Moreau envelope of a non-smooth func-\ntion, f, de\ufb01ned as:\n\nf ()(x) = inf\nu2X\n\nf (u) +\n\n\n2 kx uk2\n\n(4)\n\nLemma 1 ([13, Lemma 2.2], [6, Proposition 12.29], following [11]). Let f be convex and L-\nLipschitz continuous. For any > 0,\n\n1. f () is -smooth\n2. r(f ())(x) = (x proxf (x, ))\n3. f ()(x) \uf8ff f (x) \uf8ff f ()(x) + L2\n\n2\n\nConsequently, we can consider the smoothed problem\n\nx2X( \u02dcF ()(x) :=\n\nmin\n\n1\nm\n\nmXi=1\n\n(x)) .\n\nf ()\ni\n\n(5)\n\npmLB\n\n\u270f\n\n\u270f +\n\n\u270f +\n\n\u2318 and O\u21e3m log \u270f0\n\ndeterministic setting. When the functions are -strongly convex, smoothing with a \ufb01xed results in\na spurious log-factor. To avoid this, we again apply the reduction of Allen-Zhu and Hazan [4], this\n\nwhen used with AGD (see Appendix A for details).\nSimilarly, we can apply an accelerated randomized algorithm (such as KATYUSHA) to the smooth\n\nWhile \u02dcF () is not, in general, the -Moreau envelope of F , it is -smooth, we can calculate the\ngradient of its components using the oracle hF , and \u02dcF ()(x) \uf8ff F (x) \uf8ff \u02dcF ()(x) + L2\n2 . Thus,\nto obtain an \u270f-suboptimal solution to (1) using hF , we set = L2/\u270f and apply any algorithm\nwhich can optimize (5) using gradients of the L2/\u270f-smooth components, to within \u270f/2 accuracy.\nWith the rates presented in Section 2, using AGD on (5) yields a complexity of O mLB\n\u270f in the\ntime optimizing \u02dcF () for increasingly large values of . This leads to the upper bound of O\u21e3 mLp\u270f\u2318\npmLp\u270f\u2318\u2014this\nproblem \u02dcF () to obtain complexities of O\u21e3m log \u270f0\nmatches the presentation of Allen-Zhu [3] and is similar to that of Shalev-Shwartz and Zhang [17].\nFinally, if m > L2B2/\u270f2 or m > L2/(\u270f), stochastic gradient descent is a better randomized\nalternative, yielding complexities of O(L2B2/\u270f2) or O(L2/(\u270f)).\n4 Lower Bounds for Deterministic Algorithms\nWe now turn to establishing lower bounds on the oracle complexity of optimizing (1). We \ufb01rst\nconsider only deterministic optimization algorithms. What we would like to show is that for any\ndeterministic optimization algorithm we can construct a \u201chard\u201d function for which the algorithm\ncannot \ufb01nd an \u270f-suboptimal solution until it has made many oracle accesses. Since the algorithm\nis deterministic, we can construct such a function by simulating the (deterministic) behavior of the\nalgorithm. This can be viewed as a game, where an adversary controls the oracle being used by\nthe algorithm. At each iteration the algorithm queries the oracle with some triplet (x, i, ) and\nthe adversary responds with an answer. This answer must be consistent with all previous answers,\nbut the adversary ensures it is also consistent with a composite function F that the algorithm is\nfar from optimizing. The \u201chard\u201d function is then gradually de\ufb01ned in terms of the behavior of the\noptimization algorithm.\nTo help us formulate our constructions, we de\ufb01ne a \u201cround\u201d of queries as a series of queries in which\n2 e distinct functions fi are queried. The \ufb01rst round begins with the \ufb01rst query and continues until\nd m\n2 e unique functions have been queried. The second round begins with the next query, and\nexactly d m\ncontinues until exactly d m\n2 e more distinct components have been queried in the second round, and so\non until the algorithm terminates. This de\ufb01nition is useful for analysis but requires no assumptions\nabout the algorithm\u2019s querying strategy.\n\n4\n\n\f4.1 Non-Smooth Components\nWe begin by presenting a lower bound for deterministic optimization of (1) when each component\nfi is convex and L-Lipschitz continuous, but is not necessarily strongly convex, on the domain\nX = {x : kxk \uf8ff B}. Without loss of generality, we can consider L = B = 1. We will construct\nfunctions of the following form:\n\nfi(x) =\n\n1\np2 |b hx, v0i| +\n\n1\n2pk\n\ni,r |hx, vr1i hx, vri| .\n\n(6)\n\nkXr=1\n\nwhere k = b 1\n12\u270fc, b = 1pk+1, and {vr} is an orthonormal set of vectors in Rd chosen according to\nthe behavior of the algorithm such that vr is orthogonal to all points at which the algorithm queries\nhF before round r, and where i,r are indicators chosen so that i,r = 1 if the algorithm does\nnot query component i in round r (and zero otherwise). To see how this is possible, consider the\nfollowing truncations of (6):\n\nf t\ni (x) =\n\n1\np2 |b hx, v0i| +\n\n1\n2pk\n\nt1Xr=1\n\ni,r |hx, vr1i hx, vri|\n\n(7)\n\ni\n\ni are consistent with fi.\n\nDuring each round t, the adversary answers queries according to f t\ni , which depends only on vr, i,r\nfor r < t, i.e. from previous rounds. When the round is completed, i,t is determined and vt is\nchosen to be orthogonal to the vectors {v0, ..., vt1} as well as every point queried by the algorithm\nso far, thus de\ufb01ning f t+1\nfor the next round. In Appendix B.1 we prove that these responses based\non f t\nThe algorithm can only learn vr after it completes round r\u2014until then every iterate is orthogonal\nto it by construction. The average of these functions reaches its minimum of F (x\u21e4) = 0 at x\u21e4 =\nbPk\nr=0 vr, so we can view optimizing these functions as the task of discovering the vectors vr\u2014\neven if only vk is missing, a suboptimality better than b/(6pk) >\u270f cannot be achieved. Therefore,\nthe deterministic algorithm must complete at least k rounds of optimization, each comprising at\n2\u2325 queries to hF in order to optimize F . The key to this construction is that even though\nleast\u2303 m\neach term |hx, vr1i hx, vri| appears in m/2 components, and hence has a strong effect on the\naverage F (x), we can force a deterministic algorithm to make \u2326(m) queries during each round\nbefore it \ufb01nds the next relevant term. We obtain (for complete proof see Appendix B.1):\nTheorem 1. For any L, B > 0, any 0 <\u270f< LB\n12 , any m 2, and any deterministic algorithm\n\u270f , and m functions fi de\ufb01ned over\nA with access to hF , there exists a dimension d = O mLB\nX =x 2 Rd : kxk \uf8ff B , which are convex and L-Lipschitz continuous, such that in order to \ufb01nd\na point \u02c6x for which F (\u02c6x) F (x\u21e4) <\u270f , A must make \u2326 mLB\n\u270f queries to hF .\nFurthermore, we can always reduce optimizing a function over kxk \uf8ff B to optimizing a strongly\nconvex function by adding the regularizer \u270fkxk2 /(2B2) to each component, implying (see complete\nproof in Appendix B.2):\nTheorem 2. For any L, > 0, any 0 <\u270f<\n288, any m 2, and any deterministic algorithm\nA with access to hF , there exists a dimension d = O\u21e3 mLp\u270f\u2318, and m functions fi de\ufb01ned over\nX\u2713 Rd, which are L-Lipschitz continuous and -strongly convex, such that in order to \ufb01nd a point\n\u02c6x for which F (\u02c6x) F (x\u21e4) <\u270f , A must make \u2326\u21e3 mLp\u270f\u2318 queries to hF .\n\n4.2 Smooth Components\nWhen the components fi are required to be smooth, the lower bound construction is similar to (6),\nexcept it is based on squared differences instead of absolute differences. We consider the functions:\n\nL2\n\n1\n\n8 i,1\u21e3hx, v0i2 2ahx, v0i\u2318 + i,k hx, vki2 +\n\nkXr=1\n\ni,r (hx, vr1i hx, vri)2! (8)\n\nwhere i,r and vr are as before. Again, we can answer queries at round t based only on i,r, vr for\nr < t. This construction yields the following lower bounds (full details in Appendix B.3):\n\nfi(x) =\n\n5\n\n\f log\u21e3 \u270f0\n\nTheorem 3. For any , B, \u270f > 0, any m 2, and any deterministic algorithm A with access to\nhF , there exists a suf\ufb01ciently large dimension d = OmpB 2/\u270f, and m functions fi de\ufb01ned\nover X =x 2 Rd : kxk \uf8ff B , which are convex and -smooth, such that in order to \ufb01nd a point\n\u02c6x 2 Rd for which F (\u02c6x) F (x\u21e4) <\u270f , A must make \u2326mpB 2/\u270f queries to hF .\nIn the strongly convex case, we use a very similar construction, adding the term kxk2 /2, which\ngives the following bound (see Appendix B.4):\nTheorem 4. For any , > 0 such that \n , any m 2, and\nany deterministic algorithm A with access to hF , there exists a suf\ufb01ciently large dimension d =\n\u270f\u2318\u2318, and m functions fi de\ufb01ned over X\u2713 Rd, which are -smooth and -\nstrongly convex and where F (0) F (x\u21e4) = \u270f0, such that in order to \ufb01nd a point \u02c6x for which\n\n > 73, any \u270f> 0, any \u270f0 > 3\u270f\n\n log\u21e3 \u270f0\n\nO\u21e3mp \nF (\u02c6x) F (x\u21e4) <\u270f , A must make \u2326\u21e3mp \n\n\u270f\u2318\u2318 queries to hF .\n\n5 Lower Bounds for Randomized Algorithms\nWe now turn to randomized algorithms for (1). In the deterministic constructions, we relied on\nbeing able to set vr and i,r based on the predictable behavior of the algorithm. This is impossible\nfor randomized algorithms, we must choose the \u201chard\u201d function before we know the random choices\nthe algorithm will make\u2014so the function must be \u201chard\u201d more generally than before.\nPreviously, we chose vectors vr orthogonal to all previous queries made by the algorithm. For ran-\ndomized algorithms this cannot be ensured. However, if we choose orthonormal vectors vr randomly\nin a high dimensional space, they will be nearly orthogonal to queries with high probability. Slightly\nmodifying the absolute or squared difference from before makes near orthogonality suf\ufb01cient. This\nissue increases the required dimension but does not otherwise affect the lower bounds.\nMore problematic is our inability to anticipate the order in which the algorithm will query the com-\nponents, precluding the use of i,r. In the deterministic setting, if a term revealing a new vr appeared\nin half of the components, we could ensure that the algorithm must make m/2 queries to \ufb01nd it.\nHowever, a randomized algorithm could \ufb01nd it in two queries in expectation, which would eliminate\nthe linear dependence on m in the lower bound! Alternatively, if only one component included the\nterm, a randomized algorithm would indeed need \u2326(m) queries to \ufb01nd it, but that term\u2019s effect on\nsuboptimality of F would be scaled down by m, again eliminating the dependence on m.\nTo establish a \u2326(pm) lower bound for randomized algorithms we must take a new approach. We\n\nde\ufb01ne\u2305 m\n\n2\u21e7 pairs of functions which operate on\u2305 m\n\nfunctions resembles the constructions from the previous section, but since there are many of them,\nthe algorithm must solve \u2326(m) separate optimization problems in order to optimize F .\n\n2\u21e7 orthogonal subspaces of Rd. Each pair of\n\n5.1 Lipschitz Continuous Components\nFirst consider the non-smooth, non-strongly-convex setting and assume for simplicity m is even\n(otherwise we simply let the last function be zero). We de\ufb01ne the helper function c, which replaces\nthe absolute value operation and makes our construction resistant to small inner products between\niterates and not-yet-discovered components:\n\n c(z) = max (0,|z| c)\nNext, we de\ufb01ne m/2 pairs of functions, indexed by i = 1..m/2:\n\n(9)\n\n(10)\n\n1\n2pk\n\nkXr even\n\n c (hx, vi,r1i hx, vi,ri)\n\nfi,1(x) =\n\n1\np2 |b hx, vi,0i| +\nkXr odd\n\n1\n2pk\n\nfi,2(x) =\n\n c (hx, vi,r1i hx, vi,ri)\n\u270fpm ). With c suf\ufb01ciently\nwhere {vi,r}r=0..k,i=1..m/2 are random orthonormal vectors and k =\u21e5(\nsmall and the dimensionality suf\ufb01ciently high, with high probability the algorithm only learns the\n\n1\n\n6\n\n\f\u270f4\n\n1pmk\n\npmLB\n\n) =\u21e5( p\u270f/pm), and the total number of queries is \u2326(mk) =\u2326(\n\nidentity of new vectors vi,r by alternately querying fi,1 and fi,2; so revealing all k + 1 vectors\nrequires at least k + 1 total queries. Until vi,k is revealed, an iterate is \u2326(\u270f)-suboptimal on (fi,1 +\nfi,2)/2. From here, we show that an \u270f-suboptimal solution to F (x) can be found only after at\nleast k + 1 queries are made to at least m/4 pairs, for a total of \u2326(mk) queries. This time, since\nthe optimum x\u21e4 will need to have inner product b with \u21e5(mk) vectors vi,r, we need to have b =\npm\n\u270f ). The \u2326(m) term of\n\u21e5(\nthe lower bound follows trivially since we require \u270f = O(1/pm), (proofs in Appendix C.1):\nTheorem 5. For any L, B > 0, any 0 <\u270f< LB\n10pm, any m 2, and any randomized algorithm A\nwith access to hF , there exists a dimension d = O\u21e3 L4B6\n\u270f \u2318, and m functions fi de\ufb01ned\nlog LB\nover X =x 2 Rd : kxk \uf8ff B , which are convex and L-Lipschitz continuous, such that to \ufb01nd a\n queries to hF .\npoint \u02c6x for which E [F (\u02c6x) F (x\u21e4)] <\u270f , A must make \u2326m +\n200m, any m 2, and any randomized algorithm A\nwith access to hF , there exists a dimension d = O L4\n3\u270f log Lp\u270f, and m functions fi de\ufb01ned over\nX\u2713 Rd, which are L-Lipschitz continuous and -strongly convex, such that in order to \ufb01nd a point\n\u02c6x for which E [F (\u02c6x) F (x\u21e4)] <\u270f , A must make \u2326m +\n\nAn added regularizer gives the result for strongly convex functions (see Appendix C.2):\nTheorem 6. For any L, > 0, any 0 <\u270f<\n\npmLp\u270f queries to hF .\n\nThe large dimension required by these lower bounds is the cost of omitting the assumption that the\nalgorithm\u2019s queries lie in the span of previous oracle responses. If we do assume that the queries lie\nin that span, the necessary dimension is only on the order of the number of oracle queries needed.\n\nWhen \u270f =\u2326( LB/pm) in the non-strongly convex case or \u270f =\u2326 L2/(m) in the strongly\nconvex case, the lower bounds for randomized algorithms presented above do not apply. Instead, we\ncan obtain a lower bound based on an information theoretic argument. We \ufb01rst uniformly randomly\nchoose a parameter p, which is either (1/2 2\u270f) or (1/2 + 2\u270f). Then for i = 1, ..., m, in the non-\nstrongly convex case we make fi(x) = x with probability p and fi(x) = x with probability 1 p.\nOptimizing F (x) to within \u270f accuracy then implies recovering the bias of the Bernoulli random\nvariable, which requires \u2326(1/\u270f2) queries based on a standard information theoretic result [2, 19].\n2 kxk2 gives a \u2326(1/(\u270f)) lower bound in the -strongly convex setting. This\nSetting fi(x) = \u00b1x + \nis formalized in Appendix C.5.\n\nL2\n\n\u270f\n\n5.2 Smooth Components\nWhen the functions fi are smooth and not strongly convex, we de\ufb01ne another helper function c:\n\nc(z) =8<:\n\n0\n2(|z| c)2\nz2 2c2\n\n|z|\uf8ff c\nc < |z|\uf8ff 2c\n|z| > 2c\n\nand the following pairs of functions for i = 1, ..., m/2:\n\n(11)\n\n(12)\n\nfi,1(x) =\n\nfi,2(x) =\n\n1\n\n16\u2713hx, vi,0i2 2ahx, vi,0i +\n16\u2713c (hx, vi,ki) +\n\n1\n\nkXr odd\n\nc (hx, vi,r1i hx, vi,ri)\u25c6\n\nkXr even\n\nc (hx, vi,r1i hx, vi,ri)\u25c6\n\nwith vi,r as before. The same arguments apply, after replacing the absolute difference with squared\ndifference. A separate argument is required in this case for the \u2326(m) term in the bound, which we\nshow using a construction involving m simple linear functions (see Appendix C.3).\nTheorem 7. For any , B, \u270f > 0, any m 2, and any randomized algorithm A with access to hF ,\nthere exists a suf\ufb01ciently large dimension d = O\u21e3 2B6\n\u270f \u2318 + B2m log m\u2318 and m functions\nfi de\ufb01ned over X =x 2 Rd : kxk \uf8ff B , which are convex and -smooth, such that to \ufb01nd a point\n\u02c6x 2 Rd for which E [F (\u02c6x) F (x\u21e4)] <\u270f , A must make \u2326\u2713m +q mB2\n\n\u270f \u25c6 queries to hF .\n\nlog\u21e3 B 2\n\n\u270f2\n\n7\n\n\fIn the strongly convex case, we add the term kxk2 /2 to fi,1 and fi,2 (see Appendix C.4) to obtain:\nTheorem 8. For any m 2, any , > 0 such that \nm, and\nany randomized algorithm A, there exists a dimension d = O\u21e3 2.5\u270f0\nlog m\u2318,\ndomain X\u2713 Rd, x0 2X , and m functions fi de\ufb01ned on X which are -smooth and -strongly\nconvex, and such that F (x0) F (x\u21e4) = \u270f0 and such that in order to \ufb01nd a point \u02c6x 2X such that\nE [F (\u02c6x) F (x\u21e4)] <\u270f , A must make \u2326\u21e3m +p m\n\u270fq m\n\n > 161m, any \u270f> 0, any \u270f0 > 60\u270fp \n\u270f\u2318 + m\u270f0\n log\u21e3 \u270f0\n\n2.5\u270f log3\u21e3 \u270f0\n \u2318\u2318 queries to hF .\n\n\u270f\n\nRemark: We consider (1) as a constrained optimization problem, thus the minimizer of F could be\nachieved on the boundary of X , meaning that the gradient need not vanish. If we make the additional\nassumption that the minimizer of F lies on the interior of X (and is thus the unconstrained global\nminimum), Theorems 1-8 all still apply, with a slight modi\ufb01cation to Theorems 3 and 7. Since the\ngradient now needs to vanish on X , 0 is always O(B 2)-suboptimal, and only values of \u270f in the\nrange 0 <\u270f< B 2\n128 result in a non-trivial lower bound (see Remarks at the end\nof Appendices B.3 and C.3).\n\n128 and 0 <\u270f< 9B 2\n\n6 Conclusion\n\nWe provide a tight (up to a log factor) understanding of optimizing \ufb01nite sum problems of the form\n(1) using a component prox oracle.\nRandomized optimization of (1) has been the subject of much research in the past several years, start-\ning with the presentation of SDCA and SAG, and continuing with accelerated variants. Obtaining\nlower bounds can be very useful for better understanding the problem, for knowing where it might\nor might not be possible to improve or where different assumptions would be needed to improve,\nand for establishing optimality of optimization methods. Indeed, several attempts have been made\nat lower bounds for the \ufb01nite sum setting [1, 9]. But as we explain in the introduction, these were\nunsatisfactory and covered only limited classes of methods. Here we show that in a fairly general\nsense, accelerated SDCA, SVRG, SAG, and KATYUSHA are optimal up to a log factor. Improv-\ning on their runtime would require additional assumptions, or perhaps a stronger oracle. However,\neven if given \u201cfull\u201d access to the component functions, all algorithms that we can think of utilize\nthis information to calculate a prox vector. Thus, it is unclear what realistic oracle would be more\npowerful than the prox oracle we consider.\nOur results highlight the power of randomization, showing that no deterministic algorithm can beat\nthe linear dependence on m and reduce it to the pm dependence of the randomized algorithms.\nThe deterministic algorithm for non-smooth problems that we present in Section 3 is also of inter-\nest in its own right. It avoids randomization, which is not usually problematic, but makes it fully\nparallelizable unlike the optimal stochastic methods. Consider, for example, a supervised learning\nproblem where fi(x) = `(hi, xi, yi) is the (non-smooth) loss on a single training example (i, yi),\nand the data is distributed across machines. Calculating a prox oracle involves applying the Fenchel\nconjugate of the loss function `, but even if a closed form is not available, this is often easy to com-\npute numerically, and is used in algorithms such as SDCA. But unlike SDCA, which is inherently\nsequential, we can calculate all m prox operations in parallel on the different machines, average the\nresulting gradients of the smoothed function, and take an accelerated gradient step to implement our\noptimal deterministic algorithm. This method attains a recent lower bound for distributed optimiza-\ntion, resolving a question raised by Arjevani and Shamir [5], and when the number of machines is\nvery large improves over all other known distributed optimization methods for the problem.\nIn studying \ufb01nite sum problems, we were forced to explicitly study lower bounds for randomized\noptimization as opposed to stochastic optimization (where the source of randomness is the oracle,\nnot the algorithm). Even for the classic problem of minimizing a smooth function using a \ufb01rst order\noracle, we could not locate a published proof that applies to randomized algorithms. We provide a\nsimple construction using \u270f-insensitive differences that allows us to easily obtain such lower bounds\nwithout reverting to assuming the iterates are spanned by previous responses (as was done, e.g., in\n[9]), and could potentially be useful for establishing randomized lower bounds also in other settings.\n\nAcknowledgements: We thank Ohad Shamir for his helpful discussions and for pointing out [4].\n\n8\n\n\fReferences\n[1] Alekh Agarwal and Leon Bottou. A lower bound for the optimization of \ufb01nite sums. arXiv preprint\n\narXiv:1410.0723, 2014.\n\n[2] Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar. Information-theoretic\nlower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Pro-\ncessing Systems, pages 1\u20139, 2009.\n\n[3] Zeyuan Allen-Zhu. Katyusha: The \ufb01rst truly accelerated stochastic gradient descent. arXiv preprint\n\narXiv:1603.05953, 2016.\n\n[4] Zeyuan Allen-Zhu and Elad Hazan. Optimal black-box reductions between optimization objectives. arXiv\n\npreprint arXiv:1603.05642, 2016.\n\n[5] Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning and opti-\n\nmization. In Advances in Neural Information Processing Systems, pages 1747\u20131755, 2015.\n\n[6] Heinz H Bauschke and Patrick L Combettes. Convex analysis and monotone operator theory in Hilbert\n\nspaces. Springer Science & Business Media, 2011.\n\n[7] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization\nand statistical learning via the alternating direction method of multipliers. Foundations and Trends R in\nMachine Learning, 3(1):1\u2013122, 2011.\n\n[8] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduc-\n\ntion. In Advances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n\n[9] Guanghui Lan. An optimal randomized incremental gradient method. arXiv preprint arXiv:1507.02000,\n\n2015.\n\n[10] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for \ufb01rst-order optimization. In\n\nAdvances in Neural Information Processing Systems, pages 3366\u20133374, 2015.\n\n[11] Yu Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103(1):127\u2013\n\n152, 2005.\n\n[12] Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2).\n\nSoviet Mathematics Doklady, 27(2):372\u2013376, 1983.\n\n[13] Francesco Orabona, Andreas Argyriou, and Nathan Srebro. Prisma: Proximal iterative smoothing algo-\n\nrithm. arXiv preprint arXiv:1206.2372, 2012.\n\n[14] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing \ufb01nite sums with the stochastic average\n\ngradient. arXiv preprint arXiv:1309.2388, 2013.\n\n[15] Shai Shalev-Shwartz. Stochastic optimization for machine learning. Slides of presentation at \u201cOpti-\nmization Without Borders 2016\u201d, http://www.di.ens.fr/~aspremon/Houches/talks/Shai.pdf,\n2016.\n\n[16] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss.\n\nThe Journal of Machine Learning Research, 14(1):567\u2013599, 2013.\n\n[17] Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regu-\n\nlarized loss minimization. Mathematical Programming, 155(1-2):105\u2013145, 2016.\n\n[18] Ohad Shamir, Nathan Srebro, and Tong Zhang. Communication ef\ufb01cient distributed optimization using\n\nan approximate newton-type method. arXiv preprint arXiv:1312.7853, 2013.\n\n[19] Bin Yu. Assouad, fano, and le cam. In Festschrift for Lucien Le Cam, pages 423\u2013435. Springer, 1997.\n\n[20] Yuchen Zhang and Lin Xiao. Communication-ef\ufb01cient distributed optimization of self-concordant empir-\n\nical loss. arXiv preprint arXiv:1501.00263, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1809, "authors": [{"given_name": "Blake", "family_name": "Woodworth", "institution": "Toyota Technological Institute"}, {"given_name": "Nati", "family_name": "Srebro", "institution": "TTI-Chicago"}]}