{"title": "A Universal Catalyst for First-Order Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 3384, "page_last": 3392, "abstract": "We introduce a generic scheme for accelerating first-order optimization methods in the sense of Nesterov, which builds upon a new analysis of the accelerated proximal point algorithm. Our approach consists of minimizing a convex objective by approximately solving a sequence of well-chosen auxiliary problems, leading to faster convergence. This strategy applies to a large class of algorithms, including gradient descent, block coordinate descent, SAG, SAGA, SDCA, SVRG, Finito/MISO, and their proximal variants. For all of these methods, we provide acceleration and explicit support for non-strongly convex objectives. In addition to theoretical speed-up, we also show that acceleration is useful in practice, especially for ill-conditioned problems where we measure significant improvements.", "full_text": "A Universal Catalyst for First-Order Optimization\n\nHongzhou Lin1, Julien Mairal1 and Zaid Harchaoui1,2\n\n1Inria\n\n2NYU\n\n{hongzhou.lin,julien.mairal}@inria.fr\n\nzaid.harchaoui@nyu.edu\n\nAbstract\n\nWe introduce a generic scheme for accelerating \ufb01rst-order optimization methods\nin the sense of Nesterov, which builds upon a new analysis of the accelerated prox-\nimal point algorithm. Our approach consists of minimizing a convex objective\nby approximately solving a sequence of well-chosen auxiliary problems, leading\nto faster convergence. This strategy applies to a large class of algorithms, in-\ncluding gradient descent, block coordinate descent, SAG, SAGA, SDCA, SVRG,\nFinito/MISO, and their proximal variants. For all of these methods, we provide\nacceleration and explicit support for non-strongly convex objectives. In addition\nto theoretical speed-up, we also show that acceleration is useful in practice, espe-\ncially for ill-conditioned problems where we measure signi\ufb01cant improvements.\n\n1\n\nIntroduction\n\nA large number of machine learning and signal processing problems are formulated as the mini-\n\nmization of a composite objective function F : Rp \u2192 R:\n\nmin\n\nx\u2208RpnF (x) , f (x) + \u03c8(x)o ,\n\n(1)\n\nwhere f is convex and has Lipschitz continuous derivatives with constant L and \u03c8 is convex but may\nnot be differentiable. The variable x represents model parameters and the role of f is to ensure that\nthe estimated parameters \ufb01t some observed data. Speci\ufb01cally, f is often a large sum of functions\n\nf (x) , 1\nn\n\nn\n\nXi=1\n\nfi(x),\n\n(2)\n\nand each term fi(x) measures the \ufb01t between x and a data point indexed by i. The function \u03c8 in (1)\nacts as a regularizer; it is typically chosen to be the squared \u21132-norm, which is smooth, or to be a\nnon-differentiable penalty such as the \u21131-norm or another sparsity-inducing norm [2]. Composite\nminimization also encompasses constrained minimization if we consider extended-valued indicator\nfunctions \u03c8 that may take the value +\u221e outside of a convex set C and 0 inside (see [11]).\nOur goal is to accelerate gradient-based or \ufb01rst-order methods that are designed to solve (1), with\na particular focus on large sums of functions (2). By \u201caccelerating\u201d, we mean generalizing a mech-\nanism invented by Nesterov [17] that improves the convergence rate of the gradient descent algo-\nrithm. More precisely, when \u03c8 = 0, gradient descent steps produce iterates (xk)k\u22650 such that\nF (xk) \u2212 F \u2217 = O(1/k), where F \u2217 denotes the minimum value of F . Furthermore, when the objec-\ntive F is strongly convex with constant \u00b5, the rate of convergence becomes linear in O((1\u2212 \u00b5/L)k).\ninstead optimal rates\u2014O(1/k2) for the convex case and O((1 \u2212p\u00b5/L)k) for the \u00b5-strongly con-\n\nvex one\u2014could be obtained by taking gradient steps at well-chosen points. Later, this acceleration\ntechnique was extended to deal with non-differentiable regularization functions \u03c8 [4, 19].\n\nThese rates were shown by Nesterov [16] to be suboptimal for the class of \ufb01rst-order methods, and\n\nFor modern machine learning problems involving a large sum of n functions, a recent effort has been\ndevoted to developing fast incremental algorithms [6, 7, 14, 24, 25, 27] that can exploit the particular\n\n1\n\n\fstructure of (2). Unlike full gradient approaches which require computing and averaging n gradients\ni=1 \u2207fi(x) at every iteration, incremental techniques have a cost per-iteration\nthat is independent of n. The price to pay is the need to store a moderate amount of information\nregarding past iterates, but the bene\ufb01t is signi\ufb01cant in terms of computational complexity.\n\n\u2207f (x) = (1/n)Pn\n\nMain contributions. Our main achievement is a generic acceleration scheme that applies to a\nlarge class of optimization methods. By analogy with substances that increase chemical reaction\nrates, we call our approach a \u201ccatalyst\u201d. A method may be accelerated if it has linear conver-\ngence rate for strongly convex problems. This is the case for full gradient [4, 19] and block coordi-\nnate descent methods [18, 21], which already have well-known accelerated variants. More impor-\ntantly, it also applies to incremental algorithms such as SAG [24], SAGA [6], Finito/MISO [7, 14],\nSDCA [25], and SVRG [27]. Whether or not these methods could be accelerated was an important\nopen question. It was only known to be the case for dual coordinate ascent approaches such as\nSDCA [26] or SDPC [28] for strongly convex objectives. Our work provides a universal positive an-\nswer regardless of the strong convexity of the objective, which brings us to our second achievement.\n\nSome approaches such as Finito/MISO, SDCA, or SVRG are only de\ufb01ned for strongly convex ob-\njectives. A classical trick to apply them to general convex functions is to add a small regularization\n\u03b5kxk2 [25]. The drawback of this strategy is that it requires choosing in advance the parameter \u03b5,\nwhich is related to the target accuracy. A consequence of our work is to automatically provide a\ndirect support for non-strongly convex objectives, thus removing the need of selecting \u03b5 beforehand.\n\nOther contribution: Proximal MISO. The approach Finito/MISO, which was proposed in [7]\nand [14], is an incremental technique for solving smooth unconstrained \u00b5-strongly convex problems\nwhen n is larger than a constant \u03b2L/\u00b5 (with \u03b2 = 2 in [14]). In addition to providing acceleration\nand support for non-strongly convex objectives, we also make the following speci\ufb01c contributions:\n\u2022 we extend the method and its convergence proof to deal with the composite problem (1);\n\u2022 we \ufb01x the method to remove the \u201cbig data condition\u201d n \u2265 \u03b2L/\u00b5.\nThe resulting algorithm can be interpreted as a variant of proximal SDCA [25] with a different step\nsize and a more practical optimality certi\ufb01cate\u2014that is, checking the optimality condition does not\nrequire evaluating a dual objective. Our construction is indeed purely primal. Neither our proof of\nconvergence nor the algorithm use duality, while SDCA is originally a dual ascent technique.\n\nRelated work. The catalyst acceleration can be interpreted as a variant of the proximal point algo-\nrithm [3, 9], which is a central concept in convex optimization, underlying augmented Lagrangian\napproaches, and composite minimization schemes [5, 20]. The proximal point algorithm consists\nof solving (1) by minimizing a sequence of auxiliary problems involving a quadratic regulariza-\ntion term. In general, these auxiliary problems cannot be solved with perfect accuracy, and several\nnotations of inexactness were proposed, including [9, 10, 22]. The catalyst approach hinges upon\n(i) an acceleration technique for the proximal point algorithm originally introduced in the pioneer\nwork [9]; (ii) a more practical inexactness criterion than those proposed in the past.1 As a result, we\nare able to control the rate of convergence for approximately solving the auxiliary problems with\nan optimization method M. In turn, we are also able to obtain the computational complexity of the\nglobal procedure for solving (1), which was not possible with previous analysis [9, 10, 22]. When\ninstantiated in different \ufb01rst-order optimization settings, our analysis yields systematic acceleration.\n\nBeyond [9], several works have inspired this paper.\nIn particular, accelerated SDCA [26] is an\ninstance of an inexact accelerated proximal point algorithm, even though this was not explicitly\nstated in [26]. Their proof of convergence relies on different tools than ours. Speci\ufb01cally, we use the\nconcept of estimate sequence from Nesterov [17], whereas the direct proof of [26], in the context\nof SDCA, does not extend to non-strongly convex objectives. Nevertheless, part of their analysis\nproves to be helpful to obtain our main results. Another useful methodological contribution was the\nconvergence analysis of inexact proximal gradient methods of [23]. Finally, similar ideas appear in\nthe independent work [8]. Their results overlap in part with ours, but both papers adopt different\ndirections. Our analysis is for instance more general and provides support for non-strongly convex\nobjectives. Another independent work with related results is [13], which introduce an accelerated\nmethod for the minimization of \ufb01nite sums, which is not based on the proximal point algorithm.\n\n1Note that our inexact criterion was also studied, among others, in [22], but the analysis of [22] led to the\n\nconjecture that this criterion was too weak to warrant acceleration. Our analysis refutes this conjecture.\n\n2\n\n\f2 The Catalyst Acceleration\n\nWe present here our generic acceleration scheme, which can operate on any \ufb01rst-order or gradient-\nbased optimization algorithm with linear convergence rate for strongly convex objectives.\n\nLinear convergence and acceleration. Consider the problem (1) with a \u00b5-strongly convex func-\ntion F , where the strong convexity is de\ufb01ned with respect to the \u21132-norm. A minimization algo-\nrithm M, generating the sequence of iterates (xk)k\u22650, has a linear convergence rate if there exists\n\u03c4M,F in (0, 1) and a constant CM,F in R such that\n\nF (xk) \u2212 F \u2217 \u2264 CM,F (1 \u2212 \u03c4M,F )k,\n\n(3)\nwhere F \u2217 denotes the minimum value of F . The quantity \u03c4M,F controls the convergence rate: the\nlarger is \u03c4M,F , the faster is convergence to F \u2217. However, for a given algorithm M, the quantity\n\u03c4M,F depends usually on the ratio L/\u00b5, which is often called the condition number of F .\nThe catalyst acceleration is a general approach that allows to wrap algorithm M into an accelerated\nalgorithm A, which enjoys a faster linear convergence rate, with \u03c4A,F \u2265 \u03c4M,F . As we will also see,\nthe catalyst acceleration may also be useful when F is not strongly convex\u2014that is, when \u00b5 = 0. In\nthat case, we may even consider a method M that requires strong convexity to operate, and obtain\nan accelerated algorithm A that can minimize F with near-optimal convergence rate \u02dcO(1/k2).2\n\nOur approach can accelerate a wide range of \ufb01rst-order optimization algorithms, starting from clas-\nsical gradient descent. It also applies to randomized algorithms such as SAG, SAGA, SDCA, SVRG\nand Finito/MISO, whose rates of convergence are given in expectation. Such methods should be\ncontrasted with stochastic gradient methods [15, 12], which minimize a different non-deterministic\nfunction. Acceleration of stochastic gradient methods is beyond the scope of this work.\n\nCatalyst action. We now highlight the mechanics of the catalyst algorithm, which is presented in\nAlgorithm 1. It consists of replacing, at iteration k, the original objective function F by an auxiliary\nobjective Gk, close to F up to a quadratic term:\n\nGk(x) , F (x) +\n\n\u03ba\n2kx \u2212 yk\u22121k2,\n\n(4)\n\nwhere \u03ba will be speci\ufb01ed later and yk is obtained by an extrapolation step described in (6). Then, at\niteration k, the accelerated algorithm A minimizes Gk up to accuracy \u03b5k.\nSubstituting (4) to (1) has two consequences. On the one hand, minimizing (4) only provides an\napproximation of the solution of (1), unless \u03ba = 0; on the other hand, the auxiliary objective Gk\nenjoys a better condition number than the original objective F , which makes it easier to minimize.\nFor instance, when M is the regular gradient descent algorithm with \u03c8 = 0, M has the rate of\nconvergence (3) for minimizing F with \u03c4M,F = \u00b5/L. However, owing to the additional quadratic\nterm, Gk can be minimized by M with the rate (3) where \u03c4M,Gk = (\u00b5 + \u03ba)/(L + \u03ba) > \u03c4M,F . In\npractice, there exists an \u201coptimal\u201d choice for \u03ba, which controls the time required by M for solving\nthe auxiliary problems (4), and the quality of approximation of F by the functions Gk. This choice\nwill be driven by the convergence analysis in Sec. 3.1-3.3; see also Sec. C for special cases.\n\nAcceleration via extrapolation and inexact minimization. Similar to the classical gradient de-\nscent scheme of Nesterov [17], Algorithm 1 involves an extrapolation step (6). As a consequence, the\nsolution of the auxiliary problem (5) at iteration k + 1 is driven towards the extrapolated variable yk.\nAs shown in [9], this step is in fact suf\ufb01cient to reduce the number of iterations of Algorithm 1 to\nsolve (1) when \u03b5k = 0\u2014that is, for running the exact accelerated proximal point algorithm.\nNevertheless, to control the total computational complexity of an accelerated algorithm A, it is nec-\nessary to take into account the complexity of solving the auxiliary problems (5) using M. This\nis where our approach differs from the classical proximal point algorithm of [9]. Essentially, both\nalgorithms are the same, but we use the weaker inexactness criterion Gk(xk) \u2212 G\u2217\nk \u2264 \u03b5k, where the\nsequence (\u03b5k)k\u22650 is \ufb01xed beforehand, and only depends on the initial point. This subtle difference\nhas important consequences: (i) in practice, this condition can often be checked by computing dual-\nity gaps; (ii) in theory, the methods M we consider have linear convergence rates, which allows us\nto control the complexity of step (5), and then to provide the computational complexity of A.\n2In this paper, we use the notation O(.) to hide constants. The notation \u02dcO(.) also hides logarithmic factors.\n\n3\n\n\fxk \u2248 arg min\n\nx\u2208Rp nGk(x) , F (x) +\nCompute \u03b1k \u2208 (0, 1) from equation \u03b12\nCompute\n\n2kx \u2212 yk\u22121k2o such that Gk(xk) \u2212 G\u2217\nk = (1 \u2212 \u03b1k)\u03b12\n\nk\u22121 + q\u03b1k;\n\n\u03ba\n\nk \u2264 \u03b5k.\n\n(5)\n\nAlgorithm 1 Catalyst\ninput initial estimate x0 \u2208 Rp, parameters \u03ba and \u03b10, sequence (\u03b5k)k\u22650, optimization method M;\n1: Initialize q = \u00b5/(\u00b5 + \u03ba) and y0 = x0;\n2: while the desired stopping criterion is not satis\ufb01ed do\n3:\n\nFind an approximate solution of the following problem using M\n\n4:\n5:\n\nyk = xk + \u03b2k(xk \u2212 xk\u22121) with\n\n\u03b2k =\n\n6: end while\noutput xk (\ufb01nal estimate).\n\n3 Convergence Analysis\n\n\u03b1k\u22121(1 \u2212 \u03b1k\u22121)\n\n\u03b12\n\nk\u22121 + \u03b1k\n\n.\n\n(6)\n\nIn this section, we present the theoretical properties of Algorithm 1, for optimization methods M\nwith deterministic convergence rates of the form (3). When the rate is given as an expectation, a\nsimple extension of our analysis described in Section 4 is needed. For space limitation reasons, we\nshall sketch the proof mechanics here, and defer the full proofs to Appendix B.\n\n3.1 Analysis for \u00b5-Strongly Convex Objective Functions\n\nWe \ufb01rst analyze the convergence rate of Algorithm 1 for solving problem 1, regardless of the com-\nplexity required to solve the subproblems (5). We start with the \u00b5-strongly convex case.\n\nTheorem 3.1 (Convergence of Algorithm 1, \u00b5-Strongly Convex Case).\nChoose \u03b10 = \u221aq with q = \u00b5/(\u00b5 + \u03ba) and\n\n\u03b5k =\n\n2\n9\n\n(F (x0) \u2212 F \u2217)(1 \u2212 \u03c1)k with\n\n\u03c1 < \u221aq.\n\nThen, Algorithm 1 generates iterates (xk)k\u22650 such that\n\nF (xk) \u2212 F \u2217 \u2264 C(1 \u2212 \u03c1)k+1(F (x0) \u2212 F \u2217) with C =\n\n8\n\n(\u221aq \u2212 \u03c1)2 .\n\n(7)\n\nThis theorem characterizes the linear convergence rate of Algorithm 1. It is worth noting that the\nchoice of \u03c1 is left to the discretion of the user, but it can safely be set to \u03c1 = 0.9\u221aq in practice.\nThe choice \u03b10 = \u221aq was made for convenience purposes since it leads to a simpli\ufb01ed analysis, but\nlarger values are also acceptable, both from theoretical and practical point of views. Following an\nadvice from Nesterov[17, page 81] originally dedicated to his classical gradient descent algorithm,\nwe may for instance recommend choosing \u03b10 such that \u03b12\nThe choice of the sequence (\u03b5k)k\u22650 is also subject to discussion since the quantity F (x0) \u2212 F \u2217 is\n\nunknown beforehand. Nevertheless, an upper bound may be used instead, which will only affects\nthe corresponding constant in (7). Such upper bounds can typically be obtained by computing a\nduality gap at x0, or by using additional knowledge about the objective. For instance, when F is\nnon-negative, we may simply choose \u03b5k = (2/9)F (x0)(1 \u2212 \u03c1)k.\n\n0 + (1 \u2212 q)\u03b10 \u2212 1 = 0.\n\nThe proof of convergence uses the concept of estimate sequence invented by Nesterov [17], and\nintroduces an extension to deal with the errors (\u03b5k)k\u22650. To control the accumulation of errors, we\nborrow the methodology of [23] for inexact proximal gradient algorithms. Our construction yields a\nconvergence result that encompasses both strongly convex and non-strongly convex cases. Note that\nestimate sequences were also used in [9], but, as noted by [22], the proof of [9] only applies when\nusing an extrapolation step (6) that involves the true minimizer of (5), which is unknown in practice.\nTo obtain a rigorous convergence result like (7), a different approach was needed.\n\n4\n\n\fTheorem 3.1 is important, but it does not provide yet the global computational complexity of the full\nalgorithm, which includes the number of iterations performed by M for approximately solving the\nauxiliary problems (5). The next proposition characterizes the complexity of this inner-loop.\nProposition 3.2 (Inner-Loop Complexity, \u00b5-Strongly Convex Case).\nUnder the assumptions of Theorem 3.1, let us consider a method M generating iterates (zt)t\u22650 for\nminimizing the function Gk with linear convergence rate of the form\n\n(8)\nWhen z0 = xk\u22121, the precision \u03b5k is reached with a number of iterations TM = \u02dcO(1/\u03c4M), where\nthe notation \u02dcO hides some universal constants and some logarithmic dependencies in \u00b5 and \u03ba.\n\nk \u2264 A(1 \u2212 \u03c4M)t(Gk(z0) \u2212 G\u2217\nk).\n\nGk(zt) \u2212 G\u2217\n\nThis proposition is generic since the assumption (8) is relatively standard for gradient-based meth-\nods [17]. It may now be used to obtain the global rate of convergence of an accelerated algorithm.\nBy calling Fs the objective function value obtained after performing s = kTM iterations of the\nmethod M, the true convergence rate of the accelerated algorithm A is\nTM(cid:19)s\nTM (F (x0) \u2212 F \u2217) \u2264 C(cid:18)1 \u2212\nFs \u2212 F \u2217 = F (cid:16)x s\nAs a result, algorithm A has a global linear rate of convergence with parameter\n\u03c4A,F = \u03c1/TM = \u02dcO(\u03c4M\u221a\u00b5/\u221a\u00b5 + \u03ba),\n\nTM(cid:17) \u2212 F \u2217 \u2264 C(1 \u2212 \u03c1)\n\n(F (x0) \u2212 F \u2217). (9)\n\n\u03c1\n\ns\n\nwhere \u03c4M typically depends on \u03ba (the greater, the faster is M). Consequently, \u03ba will be chosen to\nmaximize the ratio \u03c4M/\u221a\u00b5 + \u03ba. Note that for other algorithms M that do not satisfy (8), additional\nanalysis and possibly a different initialization z0 may be necessary (see Appendix D for example).\n\n3.2 Convergence Analysis for Convex but Non-Strongly Convex Objective Functions\n\nWe now state the convergence rate when the objective is not strongly convex, that is when \u00b5 = 0.\nTheorem 3.3 (Convergence of Algorithm 1, Convex, but Non-Strongly Convex Case).\n\nWhen \u00b5 = 0, choose \u03b10 = (\u221a5 \u2212 1)/2 and\n\n\u03b5k =\n\n2(F (x0) \u2212 F \u2217)\n9(k + 2)4+\u03b7\n\nwith \u03b7 > 0.\n\n(10)\n\n(11)\n\nThen, Algorithm 1 generates iterates (xk)k\u22650 such that\n\nF (xk) \u2212 F \u2217 \u2264\n\n8\n\n(k + 2)2 (cid:18)1 +\n\n2\n\n\u03b7(cid:19)2\n\n(F (x0) \u2212 F \u2217) +\n\n\u03ba\n\n2kx0 \u2212 x\u2217k2! .\n\nThis theorem is the counter-part of Theorem 3.1 when \u00b5 = 0. The choice of \u03b7 is left to the discretion\nof the user; it empirically seem to have very low in\ufb02uence on the global convergence speed, as long\nas it is chosen small enough (e.g., we use \u03b7 = 0.1 in practice). It shows that Algorithm 1 achieves the\noptimal rate of convergence of \ufb01rst-order methods, but it does not take into account the complexity\nof solving the subproblems (5). Therefore, we need the following proposition:\n\nProposition 3.4 (Inner-Loop Complexity, Non-Strongly Convex Case).\nAssume that F has bounded level sets. Under the assumptions of Theorem 3.3, let us consider a\nmethod M generating iterates (zt)t\u22650 for minimizing the function Gk with linear convergence rate\nof the form (8). Then, there exists TM = \u02dcO(1/\u03c4M), such that for any k \u2265 1, solving Gk with initial\npoint xk\u22121 requires at most TM log(k + 2) iterations of M.\nWe can now draw up the global complexity of an accelerated algorithm A when M has a lin-\near convergence rate (8) for \u03ba-strongly convex objectives. To produce xk, M is called at most\nkTM log(k + 2) times. Using the global iteration counter s = kTM log(k + 2), we get\n\nFs \u2212 F \u2217 \u2264\n\n8T 2\n\nM log2(s)\n\ns2\n\n (cid:18)1 +\n\n2\n\n\u03b7(cid:19)2\n\n(F (x0) \u2212 F \u2217) +\n\n\u03ba\n\n2kx0 \u2212 x\u2217k2! .\n\n(12)\n\nIf M is a \ufb01rst-order method, this rate is near-optimal, up to a logarithmic factor, when compared to\nthe optimal rate O(1/s2), which may be the price to pay for using a generic acceleration scheme.\n\n5\n\n\f4 Acceleration in Practice\n\nWe show here how to accelerate existing algorithms M and compare the convergence rates obtained\nbefore and after catalyst acceleration. For all the algorithms we consider, we study rates of conver-\ngence in terms of total number of iterations (in expectation, when necessary) to reach accuracy \u03b5.\nWe \ufb01rst show how to accelerate full gradient and randomized coordinate descent algorithms [21].\nThen, we discuss other approaches such as SAG [24], SAGA [6], or SVRG [27]. Finally, we present\na new proximal version of the incremental gradient approaches Finito/MISO [7, 14], along with its\naccelerated version. Table 4.1 summarizes the acceleration obtained for the algorithms considered.\n\nDeriving the global rate of convergence. The convergence rate of an accelerated algorithm A is\ndriven by the parameter \u03ba. In the strongly convex case, the best choice is the one that maximizes\nthe ratio \u03c4M,Gk /\u221a\u00b5 + \u03ba. As discussed in Appendix C, this rule also holds when (8) is given in\nexpectation and in many cases where the constant CM,Gk is different than A(Gk(z0)\u2212G\u2217\nk) from (8).\nWhen \u00b5 = 0, the choice of \u03ba > 0 only affects the complexity by a multiplicative constant. A rule\nof thumb is to maximize the ratio \u03c4M,Gk /\u221aL + \u03ba (see Appendix C for more details).\nAfter choosing \u03ba, the global iteration-complexity is given by Comp \u2264 kin kout, where kin is an upper-\nbound on the number of iterations performed by M per inner-loop, and kout is the upper-bound on\nthe number of outer-loop iterations, following from Theorems 3.1-3.3. Note that for simplicity, we\nalways consider that L \u226b \u00b5 such that we may write L \u2212 \u00b5 simply as \u201cL\u201d in the convergence rates.\n\n4.1 Acceleration of Existing Algorithms\n\nComposite minimization. Most of the algorithms we consider here, namely the proximal gradient\nmethod [4, 19], SAGA [6], (Prox)-SVRG [27], can handle composite objectives with a regularization\npenalty \u03c8 that admits a proximal operator prox\u03c8, de\ufb01ned for any z as\n\nprox\u03c8(z) , arg min\n\ny\u2208Rp (cid:26)\u03c8(y) +\n\n1\n\n2ky \u2212 zk2(cid:27) .\n\nTable 4.1 presents convergence rates that are valid for proximal and non-proximal settings, since\nmost methods we consider are able to deal with such non-differentiable penalties. The exception is\nSAG [24], for which proximal variants are not analyzed. The incremental method Finito/MISO has\nalso been limited to non-proximal settings so far. In Section 4.2, we actually introduce the extension\nof MISO to composite minimization, and establish its theoretical convergence rates.\n\nFull gradient method. A \ufb01rst illustration is the algorithm obtained when accelerating the regular\n\u201cfull\u201d gradient descent (FG), and how it contrasts with Nesterov\u2019s accelerated variant (AFG). Here,\nthe optimal choice for \u03ba is L \u2212 2\u00b5.\nIn the strongly convex case, we get an accelerated rate of\nconvergence in \u02dcO(npL/\u00b5 log(1/\u03b5)), which is the same as AFG up to logarithmic terms. A similar\nresult can also be obtained for randomized coordinate descent methods [21].\n\nRandomized incremental gradient. We now consider randomized incremental gradient methods,\nresp. SAG [24] and SAGA [6]. When \u00b5 > 0, we focus on the \u201cill-conditioned\u201d setting n \u2264 L/\u00b5,\nwhere these methods have the complexity O((L/\u00b5) log(1/\u03b5)). Otherwise, their complexity becomes\nO(n log(1/\u03b5)), which is independent of the condition number and seems theoretically optimal [1].\nFor these methods, the best choice for \u03ba has the form \u03ba = a(L \u2212 \u00b5)/(n + b) \u2212 \u00b5, with (a, b) =\n(2,\u22122) for SAG, (a, b) = (1/2, 1/2) for SAGA. A similar formula, with a constant L\u2032 in place of\nL, holds for SVRG; we omit it here for brevity. SDCA [26] and Finito/MISO [7, 14] are actually\nrelated to incremental gradient methods, and the choice for \u03ba has a similar form with (a, b) = (1, 1).\n\n4.2 Proximal MISO and its Acceleration\n\nFinito/MISO was proposed in [7] and [14] for solving the problem (1) when \u03c8 = 0 and when f is\na sum of n \u00b5-strongly convex functions fi as in (2), which are also differentiable with L-Lipschitz\nderivatives. The algorithm maintains a list of quadratic lower bounds\u2014say (dk\ni=1 at iteration k\u2014\nof the functions fi and randomly updates one of them at each iteration by using strong-convexity\n\ni )n\n\n6\n\n\fComp. \u00b5 > 0\n\nComp. \u00b5 = 0\n\nFG\n\nSAG [24]\n\nSAGA [6]\n\nFinito/MISO-Prox\n\nSDCA [25]\n\nSVRG [27]\n\nAcc-FG [19]\n\nAcc-SDCA [26]\n\nO (cid:16)n(cid:16) L\n\n\u00b5(cid:17) log (cid:0) 1\n\n\u03b5(cid:1)(cid:17)\n\nO (cid:16) L\n\n\u00b5 log (cid:0) 1\n\n\u03b5(cid:1)(cid:17)\n\n\u2032\n\nO (cid:16) L\nO (cid:16)nq L\n\u02dcO (cid:16)q nL\n\n\u03b5(cid:1)(cid:17)\n\u00b5 log (cid:0) 1\n\u03b5(cid:1)(cid:17)\n\u00b5 log (cid:0) 1\n\u03b5(cid:1)(cid:17)\n\u00b5 log (cid:0) 1\n\nO (cid:0)n L\n\u03b5 (cid:1)\n\nnot avail.\n\nO (cid:16)n L\n\u221a\u03b5(cid:17)\nnot avail.\n\nCatalyst \u00b5 > 0\n\u00b5 log (cid:0) 1\n\n\u02dcO (cid:16)nq L\n\n\u03b5(cid:1)(cid:17)\n\n\u02dcO (cid:16)q nL\n\n\u00b5 log (cid:0) 1\n\n\u03b5(cid:1)(cid:17)\n\nCatalyst \u00b5 = 0\n\n\u02dcO (cid:16)n L\n\n\u221a\u03b5(cid:17)\n\n\u02dcO (cid:16)q nL\u2032\n\n\u00b5 log (cid:0) 1\n\n\u03b5(cid:1)(cid:17)\n\nno acceleration\n\nin\nTable 1: Comparison of rates of convergence, before and after the catalyst acceleration, resp.\nthe strongly-convex and non strongly-convex cases. To simplify, we only present the case where\nn \u2264 L/\u00b5 when \u00b5 > 0. For all incremental algorithms, there is indeed no acceleration otherwise.\nThe quantity L\u2032 for SVRG is the average Lipschitz constant of the functions fi (see [27]).\n\ninequalities. The current iterate xk is then obtained by minimizing the lower-bound of the objective\n\nxk = arg min\n\nx\u2208Rp (Dk(x) =\n\n1\nn\n\nn\n\nXi=1\n\ndk\n\ni (x)) .\n\n(13)\n\nInterestingly, since Dk is a lower-bound of F we also have Dk(xk) \u2264 F \u2217, and thus the quantity\nF (xk) \u2212 Dk(xk) can be used as an optimality certi\ufb01cate that upper-bounds F (xk) \u2212 F \u2217. Further-\nmore, this certi\ufb01cate was shown to converge to zero with a rate similar to SAG/SDCA/SVRG/SAGA\nunder the condition n \u2265 2L/\u00b5. In this section, we show how to remove this condition and how to\nprovide support to non-differentiable functions \u03c8 whose proximal operator can be easily computed.\nWe shall brie\ufb02y sketch the main ideas, and we refer to Appendix D for a thorough presentation.\n\nThe \ufb01rst idea to deal with a nonsmooth regularizer \u03c8 is to change the de\ufb01nition of Dk:\n\nDk(x) =\n\n1\nn\n\nn\n\nXi=1\n\ndk\ni (x) + \u03c8(x),\n\ni\n\ndk\n\ndk\u22121\ni\n\n(x)+ \u03b4(fi(xk\u22121)+h\u2207fi(xk\u22121), x \u2212 xk\u22121i+ \u00b5\n\ni . Assume that index ik is selected among {1, . . . , n} at iteration k, then\n2kx \u2212 xk\u22121k2)\n\nwhich was also proposed in [7] without a convergence proof. Then, because the dk\ni \u2019s are quadratic\nfunctions, the minimizer xk of Dk can be obtained by computing the proximal operator of \u03c8 at a\nparticular point. The second idea to remove the condition n \u2265 2L/\u00b5 is to modify the update of the\nlower bounds dk\ni (x) =(cid:26) (1 \u2212 \u03b4)dk\u22121\ni = ik\nif\notherwise\nWhereas the original Finito/MISO uses \u03b4 = 1, our new variant uses \u03b4 = min(1, \u00b5n/2(L \u2212 \u00b5)).\nThe resulting algorithm turns out to be very close to variant \u201c5\u201d of proximal SDCA [25], which\ncorresponds to using a different value for \u03b4. The main difference between SDCA and MISO-\nProx is that the latter does not use duality. It also provides a different (simpler) optimality cer-\nti\ufb01cate F (xk) \u2212 Dk(xk), which is guaranteed to converge linearly, as stated in the next theorem.\nTheorem 4.1 (Convergence of MISO-Prox).\nLet (xk)k\u22650 be obtained by MISO-Prox, then\n\n(x)\n\nE[F (xk)] \u2212 F \u2217 \u2264\n\n1\n\u03c4\n\n(1 \u2212 \u03c4 )k+1 (F (x0) \u2212 D0(x0)) with \u03c4 \u2265 minn \u00b5\n\n4L\n\n,\n\n1\n\n2no.\n\n(14)\n\nFurthermore, we also have fast convergence of the certi\ufb01cate\n\nE[F (xk) \u2212 Dk(xk)] \u2264\n\n1\n\u03c4\n\n(1 \u2212 \u03c4 )k (F \u2217 \u2212 D0(x0)) .\n\nThe proof of convergence is given in Appendix D. Finally, we conclude this section by noting that\nMISO-Prox enjoys the catalyst acceleration, leading to the iteration-complexity presented in Ta-\nble 4.1. Since the convergence rate (14) does not have exactly the same form as (8), Propositions 3.2\nand 3.4 cannot be used and additional analysis, given in Appendix D, is needed. Practical forms of\nthe algorithm are also presented there, along with discussions on how to initialize it.\n\n7\n\n\f5 Experiments\n\nWe evaluate the Catalyst acceleration on three methods that have never been accelerated in the past:\nSAG [24], SAGA [6], and MISO-Prox. We focus on \u21132-regularized logistic regression, where the\nregularization parameter \u00b5 yields a lower bound on the strong convexity parameter of the problem.\n\nWe use three datasets used in [14], namely real-sim, rcv1, and ocr, which are relatively large, with\nup to n = 2 500 000 points for ocr and p = 47 152 variables for rcv1. We consider three regimes:\n\u00b5 = 0 (no regularization), \u00b5/L = 0.001/n and \u00b5/L = 0.1/n, which leads signi\ufb01cantly larger\ncondition numbers than those used in other studies (\u00b5/L \u2248 1/n in [14, 24]). We compare MISO,\nSAG, and SAGA with their default parameters, which are recommended by their theoretical analysis\n(step-sizes 1/L for SAG and 1/3L for SAGA), and study several accelerated variants. The values of\n\u03ba and \u03c1 and the sequences (\u03b5k)k\u22650 are those suggested in the previous sections, with \u03b7 = 0.1 in (10).\nOther implementation details are presented in Appendix E.\nThe restarting strategy for M is key to achieve acceleration in practice. All of the methods we com-\npare store n gradients evaluated at previous iterates of the algorithm. We always use the gradients\nfrom the previous run of M to initialize a new one. We detail in Appendix E the initialization for\neach method. Finally, we evaluated a heuristic that constrain M to always perform at most n iter-\nations (one pass over the data); we call this variant AMISO2 for MISO whereas AMISO1 refers to\nthe regular \u201cvanilla\u201d accelerated variant, and we also use this heuristic to accelerate SAG.\n\nThe results are reported in Table 1. We always obtain a huge speed-up for MISO, which suffers from\nnumerical stability issues when the condition number is very large (for instance, \u00b5/L = 10\u22123/n =\n4.10\u221210 for ocr). Here, not only does the catalyst algorithm accelerate MISO, but it also stabilizes\nit. Whereas MISO is slower than SAG and SAGA in this \u201csmall \u00b5\u201d regime, AMISO2 is almost\nsystematically the best performer. We are also able to accelerate SAG and SAGA in general, even\nthough the improvement is less signi\ufb01cant than for MISO. In particular, SAGA without acceleration\nproves to be the best method on ocr. One reason may be its ability to adapt to the unknown strong\n\nconvexity parameter \u00b5\u2032 \u2265 \u00b5 of the objective near the solution. When \u00b5\u2032/L \u2265 1/n, we indeed obtain\n\na regime where acceleration does not occur (see Sec. 4). Therefore, this experiment suggests that\nadaptivity to unknown strong convexity is of high interest for incremental optimization.\n\nx 10\u22123\n\nn\no\ni\nt\nc\nn\nu\nf\n\ne\nv\ni\nt\nc\ne\nj\nb\nO\n\n12\n\n10\n\n8\n\n6\n\n4\n\n0\n\n0.104\n\n0.102\n\n0.1\n\n0.098\n\nn\no\ni\nt\nc\nn\nu\nf\n\ne\nv\ni\nt\nc\ne\nj\nb\nO\n\n0.096\n0\n\n0.496\n\n0.4959\n\n0.4958\n\n0.4957\n\nn\no\ni\nt\nc\nn\nu\nf\n\ne\nv\ni\nt\nc\ne\nj\nb\nO\n\n0\n\np\na\ng\n\ny\nt\ni\nl\na\nu\nd\n\ne\nv\ni\nt\na\nl\ne\nR\n\n100\n\n10\u22122\n\n10\u22124\n\n10\u22126\n\np\na\ng\n\ny\nt\ni\nl\na\nu\nd\n\ne\nv\ni\nt\na\nl\ne\nR\n\n100\n\n10\u22122\n\n10\u22124\n\n10\u22126\n\n10\u22128\n\n \n\nMISO\nAMISO1\nAMISO2\nSAG\nASAG\nSAGA\nASAGA\n\n50\n\n100\n\n150\n\n200\n\n#Passes, Dataset real-sim, \u00b5 = 0\n\n10\u22128\n0\n#Passes, Dataset real-sim, \u00b5/L = 10\u22123/n\n\n100\n\n200\n\n300\n\n400\n\n500\n\n10\u221210\n \n0\n#Passes, Dataset real-sim, \u00b5/L = 10\u22121/n\n\n100\n\n200\n\n300\n\n400\n\n500\n\n100\n\n10\u22122\n\np\na\ng\n\ny\nt\ni\nl\na\nu\nd\n\ne\nv\ni\nt\na\nl\ne\nR\n\n100\n\n10\u22124\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n#Passes, Dataset rcv1, \u00b5/L = 10\u22123/n\n\n100\n\n10\u22122\n\n10\u22124\n\n10\u22126\n\n10\u22128\n\np\na\ng\n\ny\nt\ni\nl\na\nu\nd\n\ne\nv\ni\nt\na\nl\ne\nR\n\n25\n\n10\u221210\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n#Passes, Dataset ocr, \u00b5/L = 10\u22123/n\n\np\na\ng\n\ny\nt\ni\nl\na\nu\nd\n\ne\nv\ni\nt\na\nl\ne\nR\n\n100\n\n10\u22122\n\n10\u22124\n\n10\u22126\n\n10\u22128\n\n10\u221210\n0\n\n100\n\n10\u22122\n\n10\u22124\n\n10\u22126\n\n10\u22128\n\np\na\ng\n\ny\nt\ni\nl\na\nu\nd\n\ne\nv\ni\nt\na\nl\ne\nR\n\n10\u221210\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n#Passes, Dataset rcv1, \u00b5/L = 10\u22121/n\n\n5\n\n10\n\n15\n\n20\n\n25\n\n#Passes, Dataset ocr, \u00b5/L = 10\u22121/n\n\n20\n\n80\n#Passes, Dataset rcv1, \u00b5 = 0\n\n40\n\n60\n\n5\n\n20\n#Passes, Dataset ocr, \u00b5 = 0\n\n10\n\n15\n\nFigure 1: Objective function value (or duality gap) for different number of passes performed over\neach dataset. The legend for all curves is on the top right. AMISO, ASAGA, ASAG refer to the\naccelerated variants of MISO, SAGA, and SAG, respectively.\n\nAcknowledgments\n\nThis work was supported by ANR (MACARON ANR-14-CE23-0003-01), MSR-Inria joint centre,\nCNRS-Mastodons program (Titan), and NYU Moore-Sloan Data Science Environment.\n\n8\n\n\fReferences\n\n[1] A. Agarwal and L. Bottou. A lower bound for the optimization of \ufb01nite sums. In Proc. International\n\nConference on Machine Learning (ICML), 2015.\n\n[2] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foun-\n\ndations and Trends in Machine Learning, 4(1):1\u2013106, 2012.\n\n[3] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces.\n\nSpringer, 2011.\n\n[4] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[5] D. P. Bertsekas. Convex Optimization Algorithms. Athena Scienti\ufb01c, 2015.\n\n[6] A. J. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support\nfor non-strongly convex composite objectives. In Adv. Neural Information Processing Systems (NIPS),\n2014.\n\n[7] A. J. Defazio, T. S. Caetano, and J. Domke. Finito: A faster, permutable incremental gradient method for\n\nbig data problems. In Proc. International Conference on Machine Learning (ICML), 2014.\n\n[8] R. Frostig, R. Ge, S. M. Kakade, and A. Sidford. Un-regularizing: approximate proximal point algorithms\nfor empirical risk minimization. In Proc. International Conference on Machine Learning (ICML), 2015.\n\n[9] O. G\u00a8uler. New proximal point algorithms for convex minimization. SIAM Journal on Optimization,\n\n2(4):649\u2013664, 1992.\n\n[10] B. He and X. Yuan. An accelerated inexact proximal point algorithm for convex minimization. Journal\n\nof Optimization Theory and Applications, 154(2):536\u2013548, 2012.\n\n[11] J.-B. Hiriart-Urruty and C. Lemar\u00b4echal. Convex Analysis and Minimization Algorithms I. Springer, 1996.\n\n[12] A. Juditsky and A. Nemirovski. First order methods for nonsmooth convex large-scale optimization.\n\nOptimization for Machine Learning, MIT Press, 2012.\n\n[13] G. Lan. An optimal randomized incremental gradient method. arXiv:1507.02000, 2015.\n\n[14] J. Mairal. Incremental majorization-minimization optimization with application to large-scale machine\n\nlearning. SIAM Journal on Optimization, 25(2):829\u2013855, 2015.\n\n[15] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to\n\nstochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[16] Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2). Soviet\n\nMathematics Doklady, 27(2):372\u2013376, 1983.\n\n[17] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer, 2004.\n\n[18] Y. Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems. SIAM\n\nJournal on Optimization, 22(2):341\u2013362, 2012.\n\n[19] Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming,\n\n140(1):125\u2013161, 2013.\n\n[20] N. Parikh and S.P. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):123\u2013231,\n\n2014.\n\n[21] P. Richt\u00b4arik and M. Tak\u00b4a\u02c7c.\n\nIteration complexity of randomized block-coordinate descent methods for\n\nminimizing a composite function. Mathematical Programming, 144(1-2):1\u201338, 2014.\n\n[22] S. Salzo and S. Villa. Inexact and accelerated proximal point algorithms. Journal of Convex Analysis,\n\n19(4):1167\u20131192, 2012.\n\n[23] M. Schmidt, N. Le Roux, and F. Bach. Convergence rates of inexact proximal-gradient methods for\n\nconvex optimization. In Adv. Neural Information Processing Systems (NIPS), 2011.\n\n[24] M. Schmidt, N. Le Roux, and F. Bach. Minimizing \ufb01nite sums with the stochastic average gradient.\n\narXiv:1309.2388, 2013.\n\n[25] S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate ascent. arXiv:1211.2717, 2012.\n\n[26] S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized\n\nloss minimization. Mathematical Programming, 2015.\n\n[27] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM\n\nJournal on Optimization, 24(4):2057\u20132075, 2014.\n\n[28] Y. Zhang and L. Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimiza-\n\ntion. In Proc. International Conference on Machine Learning (ICML), 2015.\n\n9\n\n\f", "award": [], "sourceid": 1868, "authors": [{"given_name": "Hongzhou", "family_name": "Lin", "institution": "Inria"}, {"given_name": "Julien", "family_name": "Mairal", "institution": "INRIA"}, {"given_name": "Zaid", "family_name": "Harchaoui", "institution": "Inria"}]}