{"title": "Optimal Black-Box Reductions Between Optimization Objectives", "book": "Advances in Neural Information Processing Systems", "page_first": 1614, "page_last": 1622, "abstract": "The diverse world of machine learning applications has given rise to a plethora of algorithms and optimization methods, finely tuned to the specific regression or classification task at hand.  We reduce the complexity of algorithm design for machine learning by reductions:  we develop reductions that take a method developed for one setting and apply it to the entire spectrum of smoothness and strong-convexity in applications.  Furthermore, unlike existing results, our new reductions are OPTIMAL and more PRACTICAL. We show how these new reductions give rise to new and faster running times on training linear classifiers for various families of loss functions, and conclude with experiments showing their successes also in practice.", "full_text": "Optimal Black-Box Reductions\nBetween Optimization Objectives\u2217\n\nZeyuan Allen-Zhu\n\nzeyuan@csail.mit.edu\n\nInstitute for Advanced Study\n\n& Princeton University\n\nElad Hazan\n\nehazan@cs.princeton.edu\n\nPrinceton University\n\nAbstract\n\nThe diverse world of machine learning applications has given rise to a plethora\nof algorithms and optimization methods, \ufb01nely tuned to the speci\ufb01c regression\nor classi\ufb01cation task at hand. We reduce the complexity of algorithm design\nfor machine learning by reductions: we develop reductions that take a method\ndeveloped for one setting and apply it to the entire spectrum of smoothness and\nstrong-convexity in applications.\nFurthermore, unlike existing results, our new reductions are optimal and more\npractical. We show how these new reductions give rise to new and faster running\ntimes on training linear classi\ufb01ers for various families of loss functions, and\nconclude with experiments showing their successes also in practice.\n\n1\n\nIntroduction\n\nThe basic machine learning problem of minimizing a regularizer plus a loss function comes in\nnumerous different variations and names. Examples include Ridge Regression, Lasso, Support Vector\nMachine (SVM), Logistic Regression and many others. A multitude of optimization methods were\nintroduced for these problems, but in most cases specialized to very particular problem settings.\nSuch specializations appear necessary since objective functions for different classi\ufb01cation and\nregularization tasks admin different convexity and smoothness parameters. We list below a few recent\nalgorithms along with their applicable settings.\n\u2022 Variance-reduction methods such as SAGA and SVRG [9, 14] intrinsically require the objective to\nbe smooth, and do not work for non-smooth problems like SVM. This is because for loss functions\nsuch as hinge loss, no unbiased gradient estimator can achieve a variance that approaches to zero.\n\u2022 Dual methods such as SDCA or APCG [20, 30] intrinsically require the objective to be strongly\nconvex (SC), and do not directly apply to non-SC problems. This is because for a non-SC objective\nsuch as Lasso, its dual is not well-behaved due to the (cid:96)1 regularizer.\n\u2022 Primal-dual methods such as SPDC [35] require the objective to be both smooth and SC. Many\n\nother algorithms are only analyzed for both smooth and SC objectives [7, 16, 17].\n\nIn this paper we investigate whether such specializations are inherent. Is it possible to take a convex\noptimization algorithm designed for one problem, and apply it to different classi\ufb01cation or regression\nsettings in a black-box manner? Such a reduction should ideally take full and optimal advantage of\nthe objective properties, namely strong-convexity and smoothness, for each setting.\nUnfortunately, existing reductions are still very limited for at least two reasons. First, they incur at\nleast a logarithmic factor log(1/\u03b5) in the running time so leading only to suboptimal convergence\n\u2217The full version of this paper can be found on https://arxiv.org/abs/1603.05642. This paper is partially\n\nsupported by an NSF Grant, no. 1523815, and a Microsoft Research Grant, no. 0518584.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\frates.2 Second, after applying existing reductions, algorithms become biased so the objective value\ndoes not converge to the global minimum. These theoretical concerns also translate into running time\nlosses and parameter tuning dif\ufb01culties in practice.\nIn this paper, we develop new and optimal regularization and smoothing reductions that can\n\u2022 shave off a non-optimal log(1/\u03b5) factor\n\u2022 produce unbiased algorithms\nBesides such technical advantages, our new reductions also enable researchers to focus on designing\nalgorithms for only one setting but infer optimal results more broadly. This is opposed to results such\nas [4, 25] where the authors develop ad hoc techniques to tweak speci\ufb01c algorithms, rather than all\nalgorithms, and apply them to other settings without losing extra factors and without introducing bias.\nOur new reductions also enable researchers to prove lower bounds more broadly [32].\n\n1.1 Formal Setting and Classical Approaches\nConsider minimizing a composite objective function\n\n(cid:8)F (x) def= f (x) + \u03c8(x)(cid:9) ,\n\nmin\nx\u2208Rd\n\n(1.1)\n\nwhere f (x) is a differentiable convex function and \u03c8(x) is a relatively simple (but possibly non-\n(cid:80)n\ndifferentiable) convex function, sometimes referred to as the proximal function. Our goal is to \ufb01nd a\npoint x \u2208 Rd satisfying F (x) \u2264 F (x\u2217) + \u03b5, where x\u2217 is a minimizer of F .\ni=1 fi((cid:104)x, ai(cid:105))\nIn most classi\ufb01cation and regression problems, f (x) can be written as f (x) = 1\nwhere each ai \u2208 Rd is a feature vector. We refer to this as the \ufb01nite-sum case of (1.1).\nn\n\u2022 CLASSICAL REGULARIZATION REDUCTION.\n\n2(cid:107)x0 \u2212 x(cid:107)2 in which\nGiven a non-SC F (x), one can de\ufb01ne a new objective F (cid:48)(x) def= F (x) + \u03c3\n\u03c3 is on the order of \u03b5. In order to minimize F (x), the classical regularization reduction calls an\noracle algorithm to minimize F (cid:48)(x) instead, and this oracle only needs to work with SC functions.\nEXAMPLE. If F is L-smooth, one can apply accelerated gradient descent to minimize F (cid:48) and\n\u03b5 ) iterations in terms of minimizing the\noriginal F . This complexity has a suboptimal dependence on \u03b5 and shall be improved using our\nnew regularization reduction.\n\nobtain an algorithm that converges in O((cid:112)L/\u03b5 log 1\nGiven a non-smooth F (x) of a \ufb01nite-sum form, one can de\ufb01ne a smoothed variant (cid:98)fi(\u03b1) for each\n(cid:80)n\ni=1 (cid:98)fi((cid:104)ai, x(cid:105)) + \u03c8(x). 4 In order to minimize F (x), the classical\nyields an algorithm that converges in O(cid:0) 1\u221a\n(cid:1) iterations for minimizing the original F (x).\n\nfi(\u03b1) and let F (cid:48)(x) = 1\nsmoothing reduction calls an oracle algorithm to minimize F (cid:48)(x) instead, and this oracle only\nneeds to work with smooth functions.\nEXAMPLE. If F (x) is \u03c3-SC and one applies accelerated gradient descent to minimize F (cid:48), this\n\n\u2022 CLASSICAL SMOOTHING REDUCTION (FINITE-SUM CASE).3\n\nn\n\n\u03c3\u03b5 log 1\n\n\u03b5\n\nAgain, the additional factor log(1/\u03b5) can be removed using our new smoothing reduction.\n\nBesides the non-optimality, applying the above two reductions gives only biased algorithms. One has\nto tune the regularization or smoothing parameter, and the algorithm only converges to the minimum\nof the regularized or smoothed problem F (cid:48)(x), which can be away from the true minimizer of F (x)\nby a distance proportional to the parameter. This makes the reduction hard to use in practice.\n\n2Recall that obtaining the optimal convergence rate is one of the main goals in operations research and\nmachine learning. For instance, obtaining the optimal 1/\u03b5 rate for online learning was a major breakthrough\nsince the log(1/\u03b5)/\u03b5 rate was discovered [13, 15, 26].\n\n3Smoothing reduction is typically applied to the \ufb01nite sum form only. This is because, for a general high\n\ndimensional function f (x), its smoothed variant (cid:98)f (x) may not be ef\ufb01ciently computable.\n4More formally, one needs this variant to satisfy |(cid:98)fi(\u03b1) \u2212 fi(\u03b1)| \u2264 \u03b5 for all \u03b1 and be smooth at the\nsame time. This can be done at least in two classical ways if (cid:98)fi(\u03b1) is Lipschitz continuous. One is to de\ufb01ne\n(cid:98)fi(\u03b1) = Ev\u2208[\u22121,1][fi(\u03b1 + \u03b5v)] as an integral of f over the scaled unit interval, see for instance Chapter 2.3 of\n[12], and the other is to de\ufb01ne (cid:98)fi(\u03b1) = max\u03b2\n\n(cid:8)\u03b2 \u00b7 \u03b1 \u2212 f\u2217\n\n2 \u03b12} using the Fenchel dual f\u2217\n\ni (\u03b2) \u2212 \u03b5\n\ni (\u03b2) of fi(\u03b1),\n\nsee for instance [24].\n\n2\n\n\fdient \u2207fi(x) and perform a proximal gradient update.\n\n\u2022 SVRG and SAGA [14, 34] solve the \ufb01nite-sum form of (1.1) and satisfy HOOD with\n\n\u2022 Katyusha [1] solves the \ufb01nite-sum form of (1.1) and satis\ufb01es HOOD with Time(L, \u03c3) =\n\nTo introduce our new reductions, we \ufb01rst de\ufb01ne a property on the oracle algorithm.\nOur Black-Box Oracle. Consider an algorithm A that minimizes (1.1) when the objective F is\nL-smooth and \u03c3-SC. We say that A satis\ufb01es the homogenous objective decrease (HOOD) property in\ntime Time(L, \u03c3) if, for every starting vector x0, A produces an output x(cid:48) satisfying F (x(cid:48))\u2212 F (x\u2217) \u2264\nF (x0)\u2212F (x\u2217)\nin time Time(L, \u03c3). In other words, A decreases the objective value distance to the\nminimum by a constant factor in time Time(L, \u03c3), regardless of how large or small F (x0) \u2212 F (x\u2217)\nis. We give a few example algorithms that satisfy HOOD:\n\u2022 Gradient descent and accelerated gradient descent satisfy HOOD with Time(L, \u03c3) = O(L/\u03c3)\u00b7 C\n\u2207f (x) and perform a proximal gradient update [23]. Many subsequent works in this line of\nresearch also satisfy HOOD, including [3, 7, 16, 17].\n\nand Time(L, \u03c3) = O((cid:112)L/\u03c3) \u00b7 C respectively, where C is the time needed to compute a gradient\nTime(L, \u03c3) = O(cid:0)n + L/\u03c3(cid:1) \u00b7 C1 where C1 is the time needed to compute a stochastic gra-\nO(cid:0)n +(cid:112)nL/\u03c3(cid:1) \u00b7 C1.\nAdaptReg. For objectives F (x) that are non-SC and L-smooth, our AdaptReg reduction calls\noutput(cid:98)x satisfying F ((cid:98)x) \u2212 F (x\u2217) \u2264 \u03b5 with a total running time(cid:80)\u221e\nthe an oracle satisfying HOOD a logarithmic number of times, each time with a SC objective\n2(cid:107)x \u2212 x0(cid:107)2 for an exponentially decreasing value \u03c3. In the end, AdaptReg produces an\nF (x) + \u03c3\nto the old reduction. In addition, AdaptReg is an unbiased and anytime algorithm. F ((cid:98)x) converges\n\u2022 Applying AdaptReg to SVRG, we obtain a running time O(cid:0)n log 1\n(cid:1) \u00b7 C1 for minimizing\non known theoretical running time obtained by non-accelerated methods, including O(cid:0)n log 1\n(cid:1) \u00b7 C1 through direct methods such as\n\u2022 Applying AdaptReg to Katyusha, we obtain a running time O(cid:0)n log 1\n(cid:1)\u00b7 C1 for minimiz-\n\nSince most algorithms have an inverse polynomial dependence on \u03c3 in Time(L, \u03c3), when summing\nup Time(L, \u03b5 \u00b7 2t) for positive values t, we do not incur the additional factor log(1/\u03b5) as opposed\nto F (x\u2217) as the time goes without the necessity of changing parameters, so the algorithm can be\ninterrupted at any time. We mention some theoretical applications of AdaptReg:\n\n\u221a\nnL\u221a\n\u03b5\n\u221a\ning \ufb01nite-sum, non-SC, and smooth objectives (such as Lasso and Logistic Regression). This is\nthe \ufb01rst and only known stochastic method that converges with the optimal 1/\n\u03b5 rate (as opposed\nto log(1/\u03b5)/\n\n\ufb01nite-sum, non-SC, and smooth objectives (such as Lasso and Logistic Regression). This improves\n\u03b5 +\n\n(cid:1) \u00b7 C1 through the old reduction, as well as O(cid:0) n+L\n\n\u03b5 log 1\nL\nSAGA [9] and SAG [27].\n\n\u03b5\n\n\u221a\n\n\u03b5) for this class of objectives. [1]\n\n\u03b5\n\n\u03b5 + L\n\n\u03b5\n\n\u03b5 +\n\nt=0 Time(L, \u03b5 \u00b7 2t).\n\n1.2 Our New Results\n\n4\n\n\u2022 Applying AdaptReg to methods that do not originally work for non-SC objectives such as [7, 16,\n17], we improve their running times by a factor of log(1/\u03b5) for working with non-SC objectives.\n\nAdaptSmooth and JointAdaptRegSmooth. For objectives F (x) that are \ufb01nite-sum, \u03c3-SC, but\nnon-smooth, our AdaptSmooth reduction calls an oracle satisfying HOOD a logarithmic number\nof times, each time with a smoothed variant of F (\u03bb)(x) and an exponentially decreasing smoothing\n\nparameter \u03bb. In the end, AdaptSmooth produces an output(cid:98)x satisfying F ((cid:98)x) \u2212 F (x\u2217) \u2264 \u03b5 with a\ntotal running time(cid:80)\u221e\n\nt=0 Time( 1\n\n\u03b5\u00b72t , \u03c3).\n\nSince most algorithms have a polynomial dependence on L in Time(L, \u03c3), when summing up\n\u03b5\u00b72t , \u03c3) for positive values t, we do not incur an additional factor of log(1/\u03b5) as opposed to\nTime( 1\nthe old reduction. AdaptSmooth is also an unbiased and anytime algorithm for the same reason as\nAdaptReg.\nIn addition, AdaptReg and AdaptSmooth can effectively work together, to solve \ufb01nite-sum, non-SC,\nand non-smooth case of (1.1), and we call this reduction JointAdaptRegSmooth.\nWe mention some theoretical applications of AdaptSmooth and JointAdaptRegSmooth:\n\n3\n\n\f\u2022 Applying AdaptReg to Katyusha, we obtain a running time O(cid:0)n log 1\n\u2022 Applying JointAdaptRegSmooth to Katyusha, we obtain a running time O(cid:0)n log 1\n\n\u221a\n\ufb01nite-sum, SC, and non-smooth objectives (such as SVM). Therefore, Katyusha combined with\n\u221a\nAdaptReg is the \ufb01rst and only known stochastic method that converges with the optimal 1/\n\u03b5\nrate (as opposed to log(1/\u03b5)/\n\u221a\n\n(cid:1)\u00b7C1 for minimizing\n(cid:1) \u00b7\n\n\u03b5) for this class of objectives. [1]\n\n\u221a\nn\u221a\n\u03c3\u03b5\n\n\u03b5 +\n\nC1 for minimizing \ufb01nite-sum, SC, and non-smooth objectives (such as L1-SVM). Therefore,\nKatyusha combined with JointAdaptRegSmooth gives the \ufb01rst and only known stochastic\nmethod that converges with the optimal 1/\u03b5 rate (as opposed to log(1/\u03b5)/\u03b5) for this class of\nobjectives. [1]\n\n\u03b5 +\n\nn\n\u03b5\n\nRoadmap. We provide notations in Section 2 and discuss related works in Section 3. We propose\nAdaptReg in Section 4 and AdaptSmooth in Section 5. We leave proofs as well as the description\nand analysis of JointAdaptRegSmooth to the full version of this paper. We include experimental\nresults in Section 7.\n\n2 Preliminaries\nIn this paper we denote by \u2207f (x) the full gradient of f if it is differentiable, or the subgradient if f\nis only Lipschitz continuous. Recall some classical de\ufb01nitions on strong convexity and smoothness.\nDe\ufb01nition 2.1 (smoothness and strong convexity). For a convex function f : Rn \u2192 R,\n\u2022 f is \u03c3-strongly convex if \u2200x, y \u2208 Rn, it satis\ufb01es f (y) \u2265 f (x) + (cid:104)\u2207f (x), y \u2212 x(cid:105) + \u03c3\n\u2022 f is L-smooth if \u2200x, y \u2208 Rn, it satis\ufb01es (cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107) \u2264 L(cid:107)x \u2212 y(cid:107).\n\n2(cid:107)x \u2212 y(cid:107)2.\n\nCase 2: \u03c8(x) is non-SC and f (x) is L-smooth. Examples:\n\nCharacterization of SC and Smooth Regimes.\nIn this paper we give numbers to the following\n4 categories of objectives F (x) in (1.1). Each of them corresponds to some well-known training\nproblems in machine learning. (Letting (ai, bi) \u2208 Rd \u00d7 R be the i-th feature vector and label.)\nCase 1: \u03c8(x) is \u03c3-SC and f (x) is L-smooth. Examples:\n\n\u2022 ridge regression: f (x) = 1\n\u2022 elastic net: f (x) = 1\n\n(cid:80)n\n(cid:80)n\ni=1((cid:104)ai, x(cid:105) \u2212 bi)2 and \u03c8(x) = \u03c3\n2(cid:107)x(cid:107)2\ni=1((cid:104)ai, x(cid:105) \u2212 bi)2 and \u03c8(x) = \u03c3\n(cid:80)n\n(cid:80)n\ni=1((cid:104)ai, x(cid:105) \u2212 bi)2 and \u03c8(x) = \u03bb(cid:107)x(cid:107)1.\ni=1 log(1 + exp(\u2212bi(cid:104)ai, x(cid:105))) and \u03c8(x) = \u03bb(cid:107)x(cid:107)1.\n(cid:80)n\nCase 3: \u03c8(x) is \u03c3-SC and f (x) is non-smooth (but Lipschitz continuous). Examples:\ni=1 max{0, 1 \u2212 bi(cid:104)ai, x(cid:105)} and \u03c8(x) = \u03c3(cid:107)x(cid:107)2\n(cid:80)n\n2.\ni=1 max{0, 1 \u2212 bi(cid:104)ai, x(cid:105)} and \u03c8(x) = \u03bb(cid:107)x(cid:107)1.\n\nCase 4: \u03c8(x) is non-SC and f (x) is non-smooth (but Lipschitz continuous). Examples:\n\n\u2022 Lasso: f (x) = 1\n\u2022 logistic regression: f (x) = 1\n\n\u2022 (cid:96)1-SVM: f (x) = 1\n\n2(cid:107)x(cid:107)2\n2.\n2 + \u03bb(cid:107)x(cid:107)1.\n\n\u2022 SVM: f (x) = 1\n\n2n\n\nn\n\nn\n\n2n\n\n2n\n\nn\n\nDe\ufb01nition 2.2 (HOOD property). We say an algorithm A(F, x0) solving Case 1 of problem (1.1)\nsatis\ufb01es the homogenous objective decrease (HOOD) property with time Time(L, \u03c3) if, for every\nstarting point x0, it produces output x(cid:48) \u2190 A(F, x0) such that F (x(cid:48))\u2212minx F (x) \u2264 F (x0)\u2212minx F (x)\nin time Time(L, \u03c3).5\n(cid:8) 1\nIn this paper, we denote by C the time needed for computing a full gradient \u2207f (x) and performing\n\u03c8(x))(cid:9). For the \ufb01nite-sum case of problem (1.1), we denote by C1 the time needed for computing\n2(cid:107)x \u2212 x0(cid:107)2 + \u03b7((cid:104)\u2207f (x), x \u2212 x0(cid:105) +\na proximal gradient update of the form x(cid:48) \u2190 arg minx\n(cid:8) 1\n2(cid:107)x \u2212 x0(cid:107)2 + \u03b7((cid:104)\u2207fi((cid:104)ai, x(cid:105))ai, x \u2212 x0(cid:105) + \u03c8(x))(cid:9). For \ufb01nite-sum forms of (1.1),\na stochastic (sub-)gradient \u2207fi((cid:104)ai, x(cid:105)) and performing a proximal gradient update of the form\nx(cid:48) \u2190 arg minx\nC is usually on the magnitude of n \u00d7 C1.\n5Although our de\ufb01nition is only for deterministic algorithms, if the guarantee is probabilistic, i.e., E(cid:2)F (x(cid:48))(cid:3)\u2212\n\n4\n\nminx F (x) \u2264 F (x0)\u2212minx F (x)\nand can be replaced with any other constant bigger than 1.\n\n4\n\n, all the results of this paper remain true. Also, the constant 4 is very arbitrary\n\n4\n\n\fAlgorithm 1 The AdaptReg Reduction\nInput: an objective F (\u00b7) in Case 2 (smooth and not necessarily strongly convex);\n\nx0 a starting vector, \u03c30 an initial regularization parameter, T the number of epochs;\nan algorithm A that solves Case 1 of problem (1.1).\n\nOutput: (cid:98)xT .\n1: (cid:98)x0 \u2190 x0.\n\n(cid:98)xt+1 \u2190 A(F (\u03c3t),(cid:98)xt).\n\nDe\ufb01ne F (\u03c3t)(x) def= \u03c3t\n\u03c3t+1 \u2190 \u03c3t/2.\n\n2: for t \u2190 0 to T \u2212 1 do\n3:\n4:\n5:\n6: end for\n\n7: return(cid:98)xT .\n\n2 (cid:107)x \u2212 x0(cid:107) + F (x).\n\n3 Related Works\nCatalyst/APPA [11, 19] turn non-accelerated methods into accelerated ones, which is different from\nthe purpose of this paper. They can be used as regularization reductions from Case 2 to Case 1 (but\nnot from Cases 3 or 4); however, they suffer from two log-factor loss in the running time, and perform\npoor in practice [1]. In particular, Catalyst/APPA \ufb01x the regularization parameter throughout the\nalgorithm but our AdaptReg decreases it exponentially. Their results cannot imply ours.\nBeck and Teboulle [5] give a reduction from Case 4 to Case 2. Such a reduction does not suffer from\n\u221a\na log-factor loss for almost trivial reason: by setting the smoothing parameter \u03bb = \u03b5 and applying\nany O(1/\n\u03b5)-convergence method for Case 2, we immediately obtain an O(1/\u03b5) method for Case\n4 without an extra log factor. Our main challenge in this paper is to provide log-free reductions to\nCase 1;6 simple ideas fail to produce log-free reductions in this case because all ef\ufb01cient algorithms\nsolving Case 1 (due to linear convergence) have a log factor. In addition, the Beck-Teboulle reduction\nis biased but ours is unbiased.\nThe so-called homotopy methods (e.g. methods with geometrically decreasing regularizer/smoothing\nweights) appeared a lot in the literature [6, 25, 31, 33]. However, to the best of our knowledge, all\nexisting homotopy analysis either only work for speci\ufb01c algorithms [6, 25, 31] or solve only special\nproblems [33]. In other words, none of them provides all-purpose black-box reductions like we do.\n\n4 AdaptReg: Reduction from Case 2 to Case 1\nWe now focus on solving Case 2 of problem (1.1): that is, f (\u00b7) is L-smooth, but \u03c8(\u00b7) is not necessarily\nSC. We achieve so by reducing the problem to an algorithm A solving Case 1 that satis\ufb01es HOOD.\n\nAdaptReg works as follows (see Algorithm 1). At the beginning of AdaptReg, we set(cid:98)x0 to equal x0,\nan arbitrary given starting vector. AdaptReg consists of T epochs. At each epoch t = 0, 1, . . . , T \u2212 1,\n2 (cid:107)x \u2212 x0(cid:107)2 + F (x). Here, the parameter\nwe de\ufb01ne a \u03c3t-strongly convex objective F (\u03c3t)(x) def= \u03c3t\nWe run A on F (\u03c3t)(x) with starting vector(cid:98)xt in each epoch, and let the output be(cid:98)xt+1. After all T\n\u03c3t+1 = \u03c3t/2 for each t \u2265 0 and \u03c30 is an input parameter to AdaptReg that will be speci\ufb01ed later.\nepochs are \ufb01nished, AdaptReg simply outputs(cid:98)xT .\nT = log2(\u2206/\u03b5) produces an output(cid:98)xT satisfying F ((cid:98)xT ) \u2212 minx F (x) \u2264 O(\u03b5) in a total running\ntime of(cid:80)T\u22121\n\nWe state our main theorem for AdaptReg below and prove it in the full version of this paper.\nTheorem 4.1 (AdaptReg). Suppose that in problem (1.1) f (\u00b7) is L-smooth. Let x0 be a starting\nvector such that F (x0)\u2212 F (x\u2217) \u2264 \u2206 and (cid:107)x0 \u2212 x\u2217(cid:107)2 \u2264 \u0398. Then, AdaptReg with \u03c30 = \u2206/\u0398 and\n\nt=0 Time(L, \u03c30 \u00b7 2\u2212t).7\n\nRemark 4.2. We compare the parameter tuning effort needed for AdaptReg against the classical\nregularization reduction. In the classical reduction, there are two parameters: T , the number of\n\n6Designing reductions to Case 1 (rather than for instance Case 2) is crucial for various reasons. First,\nalgorithm design for Case 1 is usually easier (esp. in stochastic settings). Second, Case 3 can only be reduced to\nCase 1 but not Case 2. Third, lower bound results [32] require reductions to Case 1.\n\nprobabilistic, i.e., E(cid:2)F ((cid:98)xT )(cid:3) \u2212 minx F (x) \u2264 O(\u03b5). This is also true for other reduction theorems of this paper.\n\n7If the HOOD property is only satis\ufb01ed probabilistically as per Footnote 5, our error guarantee becomes\n\n5\n\n\ft=0 Time(L, \u03c30 \u00b7 2\u2212t) =(cid:80)T\u22121\n\niterations that does not need tuning; and \u03c3, which had better equal \u03b5/\u0398 which is an unknown quantity\nso requires tuning. In AdaptReg, we also need tune only one parameter, that is \u03c30. Our T need not\n\noutputted. In our experiments later, we spent the same effort turning \u03c3 in the classical reduction and\n\u03c30 in AdaptReg. As it can be easily seen from the plots, tuning \u03c30 is much easier than \u03c3.\nCorollary 4.3. When AdaptReg is applied to SVRG, we solve the \ufb01nite-sum case of Case 2 with\n\u03b5 ) \u00b7 C1. This is\n\nbe tuned because AdaptReg can be interrupted at any moment and(cid:98)xt of the current epoch can be\nrunning time(cid:80)T\u22121\n(cid:1)\u00b7C1\n(cid:1)\u00b7C1 obtained through the old reduction, and faster than O(cid:0) n+L\u0398\nfaster than O(cid:0)(cid:0)n+ L\u0398\n\u03b5 +(cid:112)nL\u0398/\u03b5) \u00b7 C1. This is\n(cid:80)T\u22121\nt=0 Time(L, \u03c30 \u00b7 2\u2212t) = (cid:80)T\u22121\nfaster than O(cid:0)(cid:0)n +(cid:112)nL/\u03b5(cid:1) log \u2206\n(cid:1) \u00b7 C1 obtained through the old reduction on Katyusha [1].8\n\nobtained by SAGA [9] and SAG [27].\nWhen AdaptReg is applied to Katyusha, we solve the \ufb01nite-sum case of Case 2 with running time\n\n) \u00b7 C1 = O(n log \u2206\n\n) \u00b7 C1 = O(n log \u2206\n\n(cid:1) log \u2206\n\nt=0 O(n + L2t\n\n\u221a\nnL2t\u221a\n\u03c30\n\nt=0 O(n +\n\n\u03b5 + L\u0398\n\n\u03c30\n\n\u03b5\n\n\u03b5\n\n\u03b5\n\n\u03b5\n\n(cid:80)n\n\ni (\u03b2) \u2212 \u03bb\n\n(cid:8)\u03b2 \u00b7 \u03b1 \u2212 f\u2217\n\n2 \u03b22(cid:9) . Accordingly, we de\ufb01ne F (\u03bb)(x) def= 1\n\n5 AdaptSmooth: Reduction from Case 3 to 1\nWe now focus on solving the \ufb01nite-sum form of Case 3 for problem (1.1). Without loss of generality,\nwe assume (cid:107)ai(cid:107) = 1 for each i \u2208 [n] because otherwise one can scale fi accordingly. We solve\nthis problem by reducing it to an oracle A which solves the \ufb01nite-sum form of Case 1 and satis\ufb01es\nHOOD. Recall the following de\ufb01nition using Fenchel conjugate:9\nDe\ufb01nition 5.1. For each function fi : R \u2192 R, let f\u2217\nconjugate. Then, we de\ufb01ne the following smoothed variant of fi parameterized by \u03bb > 0: f (\u03bb)\nmax\u03b2\n\ni (\u03b2) def= max\u03b1{\u03b1 \u00b7 \u03b2 \u2212 fi(\u03b1)} be its Fenchel\n(\u03b1) def=\n((cid:104)ai, x(cid:105)) + \u03c8(x) .\n(\u00b7) is\nFrom the property of Fenchel conjugate (see for instance the textbook [28]), we know that f (\u03bb)\na (1/\u03bb)-smooth function and therefore the objective F (\u03bb)(x) falls into the \ufb01nite-sum form of Case 1\nfor problem (1.1) with smoothness parameter L = 1/\u03bb.\nOur AdaptSmooth works as follows (see pseudocode in the full version). At the beginning of\nT epochs. At each epoch t = 0, 1, . . . , T \u2212 1, we de\ufb01ne a (1/\u03bbt)-smooth objective F (\u03bbt)(x) using\nDe\ufb01nition 5.1. Here, the parameter \u03bbt+1 = \u03bbt/2 for each t \u2265 0 and \u03bb0 is an input parameter to\n\nAdaptSmooth, we set(cid:98)x0 to equal x0, an arbitrary given starting vector. AdaptSmooth consists of\nAdaptSmooth that will be speci\ufb01ed later. We run A on F (\u03bbt)(x) with starting vector(cid:98)xt in each\nepoch, and let the output be(cid:98)xt+1. After all T epochs are \ufb01nished, AdaptSmooth outputs(cid:98)xT .\n\u03bb0 = \u2206/G2 and T = log2(\u2206/\u03b5) produces an output(cid:98)xT satisfying F ((cid:98)xT ) \u2212 minx F (x) \u2264 O(\u03b5)\nin a total running time of(cid:80)T\u22121\n\nWe state our main theorem for AdaptSmooth below and prove it the full version of this paper.\nTheorem 5.2. Suppose that in problem (1.1), \u03c8(\u00b7) is \u03c3 strongly convex and each fi(\u00b7) is G-Lipschitz\ncontinuous. Let x0 be a starting vector such that F (x0) \u2212 F (x\u2217) \u2264 \u2206. Then, AdaptSmooth with\n\ni=1 f (\u03bb)\n\nt=0 Time(2t/\u03bb0, \u03c3).\n\nn\n\ni\n\ni\n\ni\n\nRemark 5.3. We emphasize that AdaptSmooth requires less parameter tuning effort than the old\nreduction for the same reason as in Remark 4.2. Also, AdaptSmooth, when applied to Katyusha, pro-\nvides the fastest running time on solving the Case 3 \ufb01nite-sum form of (1.1), similar to Corollary 4.3.\n\nJointAdaptRegSmooth: From Case 4 to 1\n\n6\nWe show in the full version that AdaptReg and AdaptSmooth can work together to reduce the\n\ufb01nite-sum form of Case 4 to Case 1. We call this reduction JointAdaptRegSmooth and it relies\non a jointly exponentially decreasing sequence of (\u03c3t, \u03bbt), where \u03c3t is the weight of the convexity\nparameter that we add on top of F (x), and \u03bbt is the smoothing parameter that determines how we\n\nwill be lost.\n\n8If the old reduction is applied on APCG, SPDC, or AccSDCA rather than Katyusha, then two log factors\n9For every explicitly given fi(\u00b7), this Fenchel conjugate can be symbolically computed and fed into the\nalgorithm. This pre-process is needed for nearly all known algorithms in order for them to apply to non-smooth\nsettings (such as SVRG, SAGA, SPDC, APCG, SDCA, etc).\n\n6\n\n\f(a) covtype, \u03bb = 10\u22126\n\n(b) mnist, \u03bb = 10\u22125\n\n(c) rcv1, \u03bb = 10\u22125\n\nFigure 1: Comparing AdaptReg and the classical reduction on Lasso (with (cid:96)1 regularizer weight \u03bb).\ny-axis is the objective distance to minimum, and x-axis is the number of passes to the dataset. The\nblue solid curves represent APCG under the old regularization reduction, and the red dashed curve\nrepresents APCG under AdaptReg. For other values of \u03bb, or the results on SDCA, please refer to the\nfull version of this paper.\nchange each fi(\u00b7). The analysis is analogous to a careful combination of the proofs for AdaptReg\nand AdaptSmooth.\n\n7 Experiments\nWe perform experiments to con\ufb01rm our theoretical speed-ups obtained for AdaptSmooth and\nAdaptReg. We work on minimizing Lasso and SVM objectives for the following three well-known\ndatasets that can be found on the LibSVM website [10]: covtype, mnist, and rcv1. We defer some\ndataset and implementation details the full version of this paper.\n\n7.1 Experiments on AdaptReg\n\nTo test the performance of AdaptReg, consider the Lasso objective which is in Case 2 (i.e. non-SC\nbut smooth). We apply AdaptReg to reduce it to Case 1 and apply either APCG [20], an accelerated\nmethod, or (Prox-)SDCA [29, 30], a non-accelerated method. Let us make a few remarks:\n\u2022 APCG and SDCA are both indirect solvers for non-strongly convex objectives and therefore\nregularization is intrinsically required in order to run them for Lasso or more generally Case 2.\n\u2022 APCG and SDCA do not satisfy HOOD in theory. However, they still bene\ufb01t from AdaptReg as\n\nwe shall see, demonstrating the practical value of AdaptReg.\n\nA Practical Implementation.\nIn principle, one can implement AdaptReg by setting the termination\ncriteria of the oracle in the inner loop as precisely suggested by the theory, such as setting the number\nof iterations for SDCA to be exactly T = O(n + L\n) in the t-th epoch. However, in practice, it is\n\u03c3t\nmore desirable to automatically terminate the oracle whenever the objective distance to the minimum\nhas been suf\ufb01ciently decreased. In all of our experiments, we simply compute the duality gap and\nterminate the oracle whenever the duality gap is below 1/4 times the last recorded duality gap of the\nprevious epoch. For details see the full version of this paper.\n\nExperimental Results. For each dataset, we consider three different magnitudes of regularization\nweights for the (cid:96)1 regularizer in the Lasso objective. This totals 9 analysis tasks for each algorithm.\n2(cid:107)x(cid:107)2 term to the\nFor each such a task, we \ufb01rst implement the old reduction by adding an additional \u03c3\nLasso objective and then apply APCG or SDCA. We consider values of \u03c3 in the set {10k, 3 \u00b7 10k :\nk \u2208 Z} and show the most representative six of them in the plots (blue solid curves in Figure 3 and\nFigure 4). Naturally, for a larger value of \u03c3 the old reduction converges faster but to a point that is\nfarther from the exact minimizer because of the bias. We implement AdaptReg where we choose the\ninitial parameter \u03c30 also from the set {10k, 3 \u00b7 10k : k \u2208 Z} and present the best one in each of 18\nplots (red dashed curves in Figure 3 and Figure 4). Due to space limitations, we provide only 3 of the\n18 plots for medium-sized \u03bb in the main body of this paper (see Figure 1), and include Figure 3 and\n4 only in the full version of this paper.\nIt is clear from our experiments that\n\u2022 AdaptReg is more ef\ufb01cient than the old regularization reduction;\n\u2022 AdaptReg requires no more parameter tuning than the classical reduction;\n\n7\n\n1E-101E-091E-081E-071E-061E-051E-041E-031E-021E-010204060801001E-081E-071E-061E-051E-041E-031E-021E-010204060801001E-051E-041E-031E-021E-011E+00020406080100\f(a) covtype, \u03c3 = 10\u22125\n\n(b) mnist, \u03c3 = 10\u22124\n\n(c) rcv1, \u03c3 = 10\u22124\n\nFigure 2: Comparing AdaptSmooth and the classical reduction on SVM (with (cid:96)2 regularizer weight\n\u03bb). y-axis is the objective distance to minimum, and x-axis is the number of passes to the dataset.\nThe blue solid curves represent SVRG under the old smoothing reduction, and the red dashed curve\nrepresents SVRG under AdaptSmooth. For other values of \u03c3, please refer to the full version.\n\u2022 AdaptReg is unbiased so simpli\ufb01es the parameter selection procedure.10\n\n7.2 Experiments on AdaptSmooth\n\nTo test the performance of AdaptSmooth, consider the SVM objective which is non-smooth but SC.\nWe apply AdaptSmooth to reduce it to Case 1 and apply SVRG [14]. We emphasize that SVRG is\nan indirect solver for non-smooth objectives and therefore regularization is intrinsically required in\norder to run SVRG for SVM or more generally for Case 3.\nA Practical Implementation.\nIn principle, one can implement AdaptSmooth by setting the termi-\nnation criteria of the oracle in the inner loop as precisely suggested by the theory, such as setting\nthe number of iterations for SVRG to be exactly T = O(n + 1\n) in the t-th epoch. In practice,\n\u03c3\u03bbt\nhowever, it is more desirable to automatically terminate the oracle whenever the objective distance\nto the minimum has been suf\ufb01ciently decreased. In all of our experiments, we simply compute the\nEuclidean norm of the full gradient of the objective, and terminate the oracle whenever the norm is\nbelow 1/3 times the last recorded Euclidean norm of the previous epoch. For details see full version.\nExperimental Results. For each dataset, we consider three different magnitudes of regularization\nweights for the (cid:96)2 regularizer in the SVM objective. This totals 9 analysis tasks. For each such a task,\nwe \ufb01rst implement the old reduction by smoothing the hinge loss functions (using De\ufb01nition 5.1) with\nparameter \u03bb > 0 and then apply SVRG. We consider different values of \u03bb in the set {10k, 3 \u00b7 10k :\nk \u2208 Z} and show the most representative six of them in the plots (blue solid curves in Figure 5).\nNaturally, for a larger \u03bb, the old reduction converges faster but to a point that is farther from the exact\nminimizer due to its bias. We then implement AdaptSmooth where we choose the initial smoothing\nparameter \u03bb0 also from the set {10k, 3 \u00b7 10k : k \u2208 Z} and present the best one in each of the 9\nplots (red dashed curves in Figure 5). Due to space limitations, we provide only 3 of the 9 plots for\nsmall-sized \u03c3 in the main body of this paper (see Figure 2, and include Figure 5 only in full version.\nIt is clear from our experiments that\n\u2022 AdaptSmooth is more ef\ufb01cient than the old smoothing reduction, especially when the desired\n\u2022 AdaptSmooth requires no more parameter tuning than the classical reduction;\n\u2022 AdaptSmooth is unbiased and simpli\ufb01es the parameter selection for the same reason as\n\ntraining error is small;\n\nFootnote 10.\n\nReferences\n[1] Zeyuan Allen-Zhu. Katyusha: The First Direct Acceleration of Stochastic Gradient Methods. ArXiv\n\ne-prints, abs/1603.05953, March 2016.\n\n[2] Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear coupling: An ultimate uni\ufb01cation of gradient and mirror\n\ndescent. ArXiv e-prints, abs/1407.1537, July 2014.\n\n[3] Zeyuan Allen-Zhu, Peter Richt\u00e1rik, Zheng Qu, and Yang Yuan. Even faster accelerated coordinate descent\n\nusing non-uniform sampling. In ICML, 2016.\n\n10It is easy to determine the best \u03c30 in AdaptReg, and in contrast, in the old reduction if the desired error is\n\nsomehow changed for the application, one has to select a different \u03c3 and restart the algorithm.\n\n8\n\n1E-071E-061E-051E-041E-031E-021E-011E+000204060801001E-071E-061E-051E-041E-031E-021E-011E+000204060801001E-041E-031E-021E-011E+00020406080100\f2015-06.\n\n[4] Zeyuan Allen-Zhu and Yang Yuan. Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex\n\nObjectives. In ICML, 2016.\n\n[5] Amir Beck and Marc Teboulle. Smoothing and \ufb01rst order methods: A uni\ufb01ed framework. SIAM Journal\n\non Optimization, 22(2):557\u2013580, 2012.\n\n[6] Radu Ioan Bo\u00b8t and Christopher Hendrich. A variable smoothing algorithm for solving convex optimization\n\nproblems. TOP, 23(1):124\u2013150, 2015.\n\n[7] S\u00e9bastien Bubeck, Yin Tat Lee, and Mohit Singh. A geometric alternative to Nesterov\u2019s accelerated\n\ngradient descent. ArXiv e-prints, abs/1506.08187, June 2015.\n\n[8] Antonin Chambolle and Thomas Pock. A \ufb01rst-order primal-dual algorithm for convex problems with\n\napplications to imaging. Journal of Mathematical Imaging and Vision, 40(1):120\u2013145, 2011.\n\n[9] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A Fast Incremental Gradient Method\n\nWith Support for Non-Strongly Convex Composite Objectives. In NIPS, 2014.\n\n[10] Rong-En Fan and Chih-Jen Lin. LIBSVM Data: Classi\ufb01cation, Regression and Multi-label. Accessed:\n\n[11] Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point\nand faster stochastic algorithms for empirical risk minimization. In ICML, volume 37, pages 1\u201328, 2015.\n[12] Elad Hazan. DRAFT: Introduction to online convex optimimization. Foundations and Trends in Machine\n\nLearning, XX(XX):1\u2013168, 2015.\n\n[13] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: Optimal algorithms for stochastic\n\nstrongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489\u20132512, 2014.\n\n[14] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction.\n\nIn Advances in Neural Information Processing Systems, NIPS 2013, pages 315\u2013323, 2013.\n\n[15] Simon Lacoste-Julien, Mark W. Schmidt, and Francis R. Bach. A simpler approach to obtaining an o(1/t)\n\nconvergence rate for the projected stochastic subgradient method. ArXiv e-prints, abs/1212.2002, 2012.\n[16] Yin Tat Lee and Aaron Sidford. Ef\ufb01cient accelerated coordinate descent methods and faster algorithms for\n\nsolving linear systems. In FOCS, pages 147\u2013156. IEEE, 2013.\n\n[17] Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization algorithms\n\nvia integral quadratic constraints. CoRR, abs/1408.3595, 2014.\n\n[18] Hongzhou Lin. private communication, 2016.\n[19] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A Universal Catalyst for First-Order Optimization. In\n\nNIPS, 2015.\n\n[20] Qihang Lin, Zhaosong Lu, and Lin Xiao. An Accelerated Proximal Coordinate Gradient Method and its\n\nApplication to Regularized Empirical Risk Minimization. In NIPS, pages 3059\u20133067, 2014.\n\n[21] Arkadi Nemirovski. Prox-Method with Rate of Convergence O(1/t) for Variational Inequalities with\nLipschitz Continuous Monotone Operators and Smooth Convex-Concave Saddle Point Problems. SIAM\nJournal on Optimization, 15(1):229\u2013251, January 2004.\n\n[22] Yurii Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2). In\n\nDoklady AN SSSR (translated as Soviet Mathematics Doklady), volume 269, pages 543\u2013547, 1983.\n\n[23] Yurii Nesterov. Introductory Lectures on Convex Programming Volume: A Basic course, volume I. Kluwer\n\nAcademic Publishers, 2004.\n\n152, December 2005.\n\n[24] Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127\u2013\n\n[25] Francesco Orabona, Andreas Argyriou, and Nathan Srebro. Prisma: Proximal iterative smoothing algorithm.\n\narXiv preprint arXiv:1206.2372, 2012.\n\n[26] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly\n\nconvex stochastic optimization. In ICML, 2012.\n\n[27] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing \ufb01nite sums with the stochastic average\ngradient. arXiv preprint arXiv:1309.2388, pages 1\u201345, 2013. Preliminary version appeared in NIPS 2012.\n[28] Shai Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and Trends in\n\n[29] Shai Shalev-Shwartz and Tong Zhang. Proximal Stochastic Dual Coordinate Ascent. arXiv preprint\n\nMachine Learning, 4(2):107\u2013194, 2012.\n\narXiv:1211.2717, pages 1\u201318, 2012.\n\n[30] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss\n\nminimization. Journal of Machine Learning Research, 14:567\u2013599, 2013.\n\n[31] Quoc Tran-Dinh. Adaptive smoothing algorithms for nonsmooth composite convex minimization. arXiv\n\npreprint arXiv:1509.00106, 2015.\n\n[32] Blake Woodworth and Nati Srebro. Tight Complexity Bounds for Optimizing Composite Objectives. In\n\nNIPS, 2016.\n\n[33] Lin Xiao and Tong Zhang. A proximal-gradient homotopy method for the sparse least-squares problem.\n\nSIAM Journal on Optimization, 23(2):1062\u20131091, 2013.\n\n[34] Lin Xiao and Tong Zhang. A Proximal Stochastic Gradient Method with Progressive Variance Reduction.\n\nSIAM Journal on Optimization, 24(4):2057\u2014-2075, 2014.\n\n[35] Yuchen Zhang and Lin Xiao. Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk\n\nMinimization. In ICML, 2015.\n\n9\n\n\f", "award": [], "sourceid": 880, "authors": [{"given_name": "Zeyuan", "family_name": "Allen-Zhu", "institution": "Princeton University"}, {"given_name": "Elad", "family_name": "Hazan", "institution": "Princeton University"}]}