{"title": "Optimal Regularized Dual Averaging Methods for Stochastic Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 395, "page_last": 403, "abstract": "This paper considers a wide spectrum of regularized stochastic optimization problems where both the loss function and regularizer can be non-smooth. We develop a novel algorithm based on the regularized dual averaging (RDA) method, that can simultaneously achieve the optimal convergence rates for both convex and strongly convex loss. In particular, for strongly convex loss, it achieves the optimal rate of $O(\\frac{1}{N}+\\frac{1}{N^2})$ for $N$ iterations, which improves the best known rate $O(\\frac{\\log N }{N})$ of previous stochastic dual averaging algorithms. In addition, our method constructs the final solution directly from the proximal mapping instead of averaging of all previous iterates. For widely used sparsity-inducing regularizers (e.g., $\\ell_1$-norm), it has the advantage of encouraging sparser solutions. We further develop a multi-stage extension using the proposed algorithm as a subroutine, which achieves the uniformly-optimal rate $O(\\frac{1}{N}+\\exp\\{-N\\})$ for strongly convex loss.", "full_text": "Optimal Regularized Dual Averaging Methods for\n\nStochastic Optimization\n\nXi Chen\n\nMachine Learning Department\nCarnegie Mellon University\nxichen@cs.cmu.edu\n\nQihang Lin\n\nJavier Pe\u02dcna\n\nTepper School of Business\nCarnegie Mellon University\n\n{qihangl,jfp}@andrew.cmu.edu\n\nAbstract\n\nThis paper considers a wide spectrum of regularized stochastic optimization prob-\nlems where both the loss function and regularizer can be non-smooth. We develop\na novel algorithm based on the regularized dual averaging (RDA) method, that\ncan simultaneously achieve the optimal convergence rates for both convex and\nstrongly convex loss. In particular, for strongly convex loss, it achieves the opti-\nmal rate of O( 1\nN ) for pre-\nvious regularized dual averaging algorithms. In addition, our method constructs\nthe \ufb01nal solution directly from the proximal mapping instead of averaging of all\nprevious iterates. For widely used sparsity-inducing regularizers (e.g., (cid:96)1-norm),\nit has the advantage of encouraging sparser solutions. We further develop a multi-\nstage extension using the proposed algorithm as a subroutine, which achieves the\nuniformly-optimal rate O( 1\n\nN 2 ) for N iterations, which improves the rate O( log N\n\nN + exp{\u2212N}) for strongly convex loss.\n\nN + 1\n\n1\n\nIntroduction\n\nMany risk minimization problems in machine learning can be formulated into a regularized stochas-\ntic optimization problem of the following form:\n\nminx\u2208X{\u03c6(x) := f (x) + h(x)}.\n\n(1)\nHere, the set of feasible solutions X is a convex set in Rn, which is endowed with a norm (cid:107) \u00b7 (cid:107) and\nthe dual norm (cid:107) \u00b7 (cid:107)\u2217. The regularizer h(x) is assumed to be convex, but could be non-differentiable.\nPopular examples of h(x) include (cid:96)1-norm and related sparsity-inducing regularizers. The loss\n\nfunction f (x) takes the form: f (x) := E\u03be(F (x, \u03be)) =(cid:82) F (x, \u03be)dP (\u03be), where \u03be is a random vector\nconstants L \u2265 0, M \u2265 0 and(cid:101)\u00b5 \u2265 0 such that\n(cid:101)\u00b5\n\nwith the distribution P . In typical regression or classi\ufb01cation tasks, \u03be is the input and response (or\nclass label) pair. We assume that for every random vector \u03be, F (x, \u03be) is a convex and continuous\nfunction in x \u2208 X . Therefore, f (x) is also convex. Furthermore, we assume that there exist\n\n(2)\nwhere f(cid:48)(x) \u2208 \u2202f (x), the subdifferential of f. We note that this assumption allows us to adopt a\nwide class of loss functions. For example, if f (x) is smooth and its gradient f(cid:48)(x) = \u2207f (x) is\nLipschitz continuous, we have L > 0 and M = 0 (e.g., squared or logistic loss). If f (x) is non-\n\nsmooth but Lipschitz continuous, we have L = 0 and M > 0 (e.g., hinge loss). If(cid:101)\u00b5 > 0, f (x) is\nstrongly convex and(cid:101)\u00b5 is the so-called strong convexity parameter.\n\n(cid:107)x \u2212 y(cid:107)2 \u2264 f (y) \u2212 f (x) \u2212 (cid:104)y \u2212 x, f(cid:48)(x)(cid:105) \u2264 L\n2\n\n(cid:107)x \u2212 y(cid:107)2 + M(cid:107)x \u2212 y(cid:107),\n\n\u2200x, y \u2208 X ,\n\n2\n\nIn general, the optimization problem in Eq.(1) is challenging since the integration in f (x) is compu-\ntationally intractable for high-dimensional P . In many learning problems, we do not even know the\nunderlying distribution P but can only generate i.i.d. samples \u03be from P . A traditional approach is to\n\n1\n\n\f(cid:80)m\n\nm\n\nO\n\ni=1 and hence leads to improved empirical performances.\n\nconsider empirical loss minimization problem where the expectation in f (x) is replaced by its em-\npirical average on a set of training samples {\u03be1, . . . , \u03bem}: femp(x) := 1\ni=1 F (x, \u03bei). However,\nfor modern data-intensive applications, minimization of empirical loss with an off-line optimization\nsolver could suffer from very poor scalability.\nIn the past few years, many stochastic (sub)gradient methods [6, 5, 8, 12, 14, 10, 9, 11, 7, 18] have\nbeen developed to directly solve the stochastic optimization problem in Eq.(1), which enjoy low per-\niteration complexity and the capability of scaling up to very large data sets. In particular, at the t-th\niteration with the current iterate xt, these methods randomly draw a sample \u03bet from P ; then com-\npute the so-called \u201cstochastic subgradient\u201d G(xt, \u03bet) \u2208 \u2202xF (xt, \u03bet) where \u2202xF (xt, \u03bet) denotes the\nsubdifferential of F (x, \u03bet) with respect to x at xt; and update xt using G(xt, \u03bet). These algorithms\nfall into the class of stochastic approximation methods. Recently, Xiao [21] proposed the regu-\nlarized dual averaging (RDA) method and its accelerated version (AC-RDA) based on Nesterov\u2019s\nprimal-dual method [17]. Instead of only utilizing a single stochastic subgradient G(xt, \u03bet) of the\ncurrent iteration, it updates the parameter vector using the average of all past stochastic subgradients\n{G(xi, \u03bei)}t\nIn this paper, we propose a novel regularized dual averaging method, called optimal RDA or ORDA,\nfrom ORDA and x\u2217 is the optimal solution of Eq.(1). As compared to previous dual averaging\nmethods, it has three main advantages:\n\nwhich achieves the optimal expected convergence rate of E[\u03c6((cid:98)x) \u2212 \u03c6(x\u2217)], where(cid:98)x is the solution\n(cid:17) \u2248\n\n1. For strongly convex f (x), ORDA improves the convergence rate of stochastic dual aver-\n\n(cid:16) 1(cid:101)\u00b5N\naging methods O( \u03c32 log N(cid:101)\u00b5N ) \u2248 O( log N(cid:101)\u00b5N ) [17, 21] to an optimal rate O\ntions, and the parameters(cid:101)\u00b5, M and L of f (x) are de\ufb01ned in Eq.(2).\nvex f (x) with the strong convexity parameter(cid:101)\u00b5 as an input. When(cid:101)\u00b5 = 0, ORDA reduces\nof f (x). For strongly convex f (x) with(cid:101)\u00b5 > 0, our algorithm achieves the optimal rate of\n(cid:17)\n(cid:16) 1(cid:101)\u00b5N\ncan only show the convergence rate for the averaged iterates: \u00afxN =(cid:80)N\nt=1 \u0001txt/(cid:80)N\n\nto a variant of AC-RDA in [21] with the optimal rate for solving convex f (x). Further-\nmore, our analysis allows f (x) to be non-smooth while AC-RDA requires the smoothness\n\n3. Existing RDA methods [21] and many other stochastic gradient methods (e.g., [14, 10])\nt=1 \u0001t,\nwhere the {\u0001t} are nonnegative weights. However, in general, the average iterates \u00afxN\ncannot keep the structure that the regularizer tends to enforce (e.g., sparsity, low-rank,\netc). For example, when h(x) is a sparsity-inducing regularizer ((cid:96)1-norm), although xt\ncomputed from proximal mapping will be sparse as t goes large, the averaged solution\ncould be non-sparse. In contrast, our method directly generates the \ufb01nal solution from the\nproximal mapping, which leads to sparser solutions.\n\n2. ORDA is a self-adaptive and optimal algorithm for solving both convex and strongly con-\n\n, where \u03c32 is the variance of the stochastic subgradient, N is the number of itera-\n\nwhile AC-RDA does not utilize the advantage of strong convexity.\n\n(cid:16) \u03c32+M 2\n(cid:101)\u00b5N + L\n\n(cid:17)\n\nN 2\n\nhas the rate O\n\nIn addition to the rate of convergence, we also provide high probability bounds on the error of\nobjective values. Utilizing a technical lemma from [3], we could show the same high probability\nbound as in RDA [21] but under a weaker assumption.\nFurthermore, using ORDA as a subroutine, we develop the multi-stage ORDA which obtains the\nfor strongly convex f (x). Recall that ORDA\nconvergence rate of O\n\n(cid:101)\u00b5N + exp{\u2212(cid:112)(cid:101)\u00b5/LN}(cid:17)\n(cid:16) \u03c32+M 2\n(cid:17)\n(cid:16) \u03c32+M 2\n(cid:1) to O\nthe second term in the rate of ORDA from O(cid:0) L\n(cid:101)\u00b5N + L\n\nand achieves the so-\ncalled \u201cuniformly-optimal \u201d rate [15]. Although the improvement is on the non-dominating term,\nmulti-stage ORDA is an optimal algorithm for both stochastic and deterministic optimization. In\nparticular, for deterministic strongly convex and smooth f (x) (M = 0), one can use the same al-\ngorithm but only replaces the stochastic subgradient G(x, \u03be) by the deterministic gradient \u2207f (x).\n(cid:101)\u00b5N\nThen, the variance of the stochastic subgradient \u03c3 = 0. Now the term \u03c32+M 2\nin the rate equals\nto 0 and multi-stage ORDA becomes an optimal deterministic solver with the exponential rate\n\nexp{\u2212(cid:112)(cid:101)\u00b5/LN}(cid:17)\n\nfor strongly convex f (x). The rate of muli-stage ORDA improves\n\n(cid:16)\n\nN 2\n\nN 2\n\n2\n\n\fAlgorithm 1 Optimal Regularized Dual Averaging Method: ORDA(x0, N, \u0393, c)\nInput Parameters: Starting point x0 \u2208 X , the number of iterations N, constants \u0393 \u2265 L and c \u2265 0.\n\nParameters for f (x): Constants L, M and(cid:101)\u00b5 for f (x) in Eq. (2) and set \u00b5 =(cid:101)\u00b5/\u03c4.\n\nt+1; \u03b3t = c(t + 1)3/2 + \u03c4 \u0393; z0 = x0.\n\nInitialization: Set \u03b8t = 2\nIterate for t = 0, 1, 2, . . . , N:\nt \u03b3t)\n\nt+2; \u03bdt = 2\nt )\u00b5 xt + (1\u2212\u03b8t)\u03b8t\u00b5+\u03b83\nt \u03b3t+(1\u2212\u03b82\n\u03b82\n\n(cid:111)\n\n1. yt = (1\u2212\u03b8t)(\u00b5+\u03b82\nt \u03b3t+(1\u2212\u03b82\n\u03b82\n2. Sample \u03bet from the distribution P (\u03be) and compute the stochastic subgradient G(yt, \u03bet).\n3. gt = \u03b8t\u03bdt\n\n(cid:16)(cid:80)t\n\nt \u03b3t\nt )\u00b5 zt\n\nG(yi,\u03bei)\n\ni=0\n\n\u03bdi\n\n(cid:17)\n(cid:16)(cid:80)t\n(cid:110)(cid:104)x, gt(cid:105) + h(x) + \u03b8t\u03bdt\n(cid:110)(cid:104)x, G(yt, \u03bet)(cid:105) + h(x) +\n(cid:16) \u00b5\n\ni=0\n\n\u03c4 \u03b82\nt\n\n(cid:17)\n\n(cid:17)\n\n\u00b5V (x,yi)\n\n\u03bdi\n\n+ \u03b3t\n\u03c4\n\n+ \u03b8t\u03bdt\u03b3t+1V (x, x0)\n\n(cid:111)\n\nV (x, yt)\n\n4. zt+1 = arg minx\u2208X\n\n5. xt+1 = arg minx\u2208X\n\nOutput: xN +1\n\n(cid:16)\n\nexp{\u2212(cid:112)(cid:101)\u00b5/LN}(cid:17)\n\nO\nwith respect to both stochastic and deterministic optimization.\n\n. This is the reason why such a rate is \u201cuniformly-optimal\u201d, i.e., optimal\n\n2 Preliminary and Notations\n\nIn the framework of \ufb01rst-order stochastic optimization, the only available information of f (x) is the\nstochastic subgradient. Formally speaking, stochastic subgradient of f (x) at x, G(x, \u03be), is a vector-\nvalued function such that E\u03beG(x, \u03be) = f(cid:48)(x) \u2208 \u2202f (x). Following the existing literature, a standard\nassumption on G(x, \u03be) is made throughout the paper : there exists a constant \u03c3 such that \u2200x \u2208 X ,\n(3)\n\nE\u03be((cid:107)G(x, \u03be) \u2212 f(cid:48)(x)(cid:107)2\u2217) \u2264 \u03c32.\n\nA key updating step in dual averaging methods, the proximal mapping, utilizes the Bregman diver-\ngence. Let \u03c9(x) : X \u2192 R be a strongly convex and differentiable function, the Bregman divergence\nassociated with \u03c9(x) is de\ufb01ned as:\n\nV (x, y) := \u03c9(x) \u2212 \u03c9(y) \u2212 (cid:104)\u2207\u03c9(y), x \u2212 y(cid:105).\n\n(4)\nOne typical and simple example is \u03c9(x) = 1\n2. One may\nrefer to [21] for more examples. We can always scale \u03c9(x) so that V (x, y) \u2265 1\n2(cid:107)x \u2212 y(cid:107)2 for all\nx, y \u2208 X . Following the assumption in [10]: we assume that V (x, y) grows quadratically with the\n2(cid:107)x \u2212 y(cid:107)2 with \u03c4 > 1 for all x, y \u2208 X . In fact, we could simply\nparameter \u03c4 > 1, i.e., V (x, y) \u2264 \u03c4\nchoose \u03c9(x) with a \u03c4-Lipschitz continuous gradient so that the quadratic growth assumption will be\nautomatically satis\ufb01ed.\n\n2 together with V (x, y) = 1\n\n2(cid:107)x \u2212 y(cid:107)2\n\n2(cid:107)x(cid:107)2\n\n3 Optimal Regularized Dual Averaging Method\n\n(cid:111)\n\nt V (x, x0)\n\n(cid:110)(cid:104)gt, x(cid:105) + h(x) + \u03b2t\n\n, where \u03b2t is the step-size and gt = 1\nt+1\n\nis suboptimal. In this section, we propose a new dual averaging algorithm which adapts to both\n\nIn dual averaging methods [17, 21], the key proximal mapping step utilizes the average of all past\nIn particular, it takes the form: zt+1 =\nstochastic subgradients to update the parameter vector.\ni=0 G(zi, \u03bei).\narg minx\u2208X\nFor strongly convex f (x), the current dual averaging methods achieve a rate of O( \u03c32 log N(cid:101)\u00b5N ), which\nstrongly and non-strongly convex f (x) via the strong convexity parameter(cid:101)\u00b5 and achieves optimal\nthe \ufb01nal solution takes the form: (cid:98)x = 1\n\nrates in both cases. In addition, for previous dual averaging methods, to guarantee the convergence,\nt=0 zt and hence is not sparse in nature for sparsity-\ninducing regularizers. Instead of taking the average, we introduce another proximal mapping and\ngenerate the \ufb01nal solution directly from the second proximal mapping. This strategy will provide us\nsparser solutions in practice. It is worthy to note that in RDA, zN has been proved to achieve the de-\nsirable sparsity pattern (i.e., manifold identi\ufb01cation property) [13]. However, according to [13], the\n\n(cid:80)N\n\n(cid:80)t\n\nN +1\n\n3\n\n\fwe de\ufb01ne the parameter \u00b5 =(cid:101)\u00b5/\u03c4, which scales the strong convexity parameter(cid:101)\u00b5 by 1\noptimal rates for both convex and strongly convex loss. When \u00b5 > 0 (or equivalently,(cid:101)\u00b5 > 0), c is\n\nconvergence of \u03c6(zN ) to the optimal \u03c6(x\u2217) is established only under a more restrictive assumption\nthat x\u2217 is a strong local minimizer of \u03c6 relative to the optimal manifold and the convergence rate is\nquite slow. Without this assumption, the convergence of \u03c6(zN ) is still unknown.\nThe proposed optimal RDA (ORDA) method is presented in Algorithm 1. To simplify our notations,\n\u03c4 , where \u03c4 is\nthe quadratic growth constant. In general, the constant \u0393 which de\ufb01nes the step-size parameter \u03b3t\nis set to L. However, we allow \u0393 to be an arbitrary constant greater than or equal to L to facilitate\nthe introduction of the multi-stage ORDA in the later section. The parameter c is set to achieve the\n\u221a\n\u221a\nset to 0 so that \u03b3t \u2261 \u03c4 \u0393 \u2265 \u03c4 L; while for \u00b5 = 0, c =\n. Since x\u2217 is unknown in practice,\none might replace V (x\u2217, x0) in c by a tuning parameter.\nHere, we make a few more explanations of Algorithm 1. In Step 1, the intermediate point yt is\na convex combination of xt and zt and when \u00b5 = 0, yt = (1 \u2212 \u03b8t)xt + \u03b8tzt. The choice of the\ncombination weights is inspired by [10]. Second, with our choice of \u03b8t and \u03bdt, it is easy to prove that\ni=0. As compared\nto RDA which uses the average of past subgradients, gt in ORDA is a weighted average of all\npast stochastic subgradients and the subgradient from the larger iteration has a larger weight (i.e.,\nG(yi, \u03bei) has the weight\n(t+1)(t+2)). In practice, instead of storing all past stochastic subgradients,\ngt could be simply updated based on gt\u22121: gt = \u03b8t\u03bdt\n. We also note that\nsince the error in the stochastic subgradient G(yt, \u03bet) will affect the sparsity of xt+1 via the second\nproximal mapping, to obtain stable sparsity recovery performances, it would be better to construct\nthe stochastic subgradient with a small batch of samples [21, 1]. This could help to reduce the noise\nof the stochastic subgradient.\n\n. Therefore, gt in Step 3 is a convex combination of {G(yi, \u03bei)}t\n\n(cid:16) gt\u22121\n\n\u03c4 (\u03c3+M )\nV (x\u2217,x0)\n\n(cid:80)t\n\n+ G(yt,\u03bet)\n\n= 1\n\u03b8t\u03bdt\n\n\u03b8t\u22121\u03bdt\u22121\n\n(cid:17)\n\n2(i+1)\n\n1\n\u03bdi\n\ni=0\n\n\u03bdt\n\n2\n\n3.1 Convergence Rate\n\nTheorem 1 For ORDA, if we require c > 0 when(cid:101)\u00b5 = 0, then for any t \u2265 0:\n\nWe present the convergence rate for ORDA. We start by presenting a general theorem without plug-\nging the values of the parameters. To simplify our notations, we de\ufb01ne \u2206t := G(yt, \u03bet) \u2212 f(cid:48)(yt).\n(cid:104)x\u2217 \u2212(cid:98)zi, \u2206i(cid:105)\n\nt(cid:88)\n\nt(cid:88)\n\n\u03b8t\u03bdt\n\n) \u2264 \u03b8t\u03bdt\u03b3t+1V (x\n\u2217\n\n\u03c6(xt+1) \u2212 \u03c6(x\n\n, x0) +\n\n+ \u03b8t\u03bdt\n\n, (5)\n\n\u2217\n\n(cid:16) \u00b5\n\n(cid:17)\n((cid:107)\u2206i(cid:107)\u2217 + M )2\n\u03c4 \u2212 \u03b8iL\n+ \u03b8i\u03b3i\n\nwhere(cid:98)zt = \u03b8t\u00b5\n\n\u00b5+\u03b3t\u03b82\nt\n\nyt + (1\u2212\u03b8t)\u00b5+\u03b3t\u03b82\n\nt\n\n\u00b5+\u03b3t\u03b82\nt\n\n2\n\ni=0\n\n\u03c4 \u03b8i\n\nzt, is a convex combination of yt and zt; and(cid:98)zt = zt when\n\ni=0\n\n\u03bdi\n\n\u03bdi\n\n\u00b5 = 0. Taking the expectation on both sides of Eq.(5):\n\nE\u03c6(xt+1) \u2212 \u03c6(x\n\n\u2217\n\n) \u2264 \u03b8t\u03bdt\u03b3t+1V (x\n\n\u2217\n\n, x0) + (\u03c32 + M 2)\u03b8t\u03bdt\n\nt(cid:88)\n\ni=0\n\n(cid:16) \u00b5\n\n\u03c4 \u03b8i\n\n(cid:17)\n\u03c4 \u2212 \u03b8iL\n\n1\n+ \u03b8i\u03b3i\n\n.\n\n\u03bdi\n\n(6)\n\nThe proof of Theorem 1 is given in Appendix. In the next two corollaries, we establish the rates of\n\nconvergence in expectation for ORDA by choosing different values for c based on(cid:101)\u00b5.\nCorollary 1 For convex f (x) with(cid:101)\u00b5 = 0 , by setting c =\n8(\u03c3 + M )(cid:112)\u03c4 V (x\u2217, x0)\nE\u03c6(xN +1) \u2212 \u03c6(x\u2217) \u2264 4\u03c4 LV (x\u2217, x0)\n\n\u03c4 (\u03c3+M )\nV (x\u2217,x0)\n\n\u221a\n\u221a\n\n\u221a\n\n+\n\n2\n\n.\n\nand \u0393 = L, we obtain:\n\n(7)\n\nN 2\n\nN\n\nBy doing so, Eq.(7) remains valid after replacing all V (x\u2217, x0) by D\u2217. For convex f (x) with(cid:101)\u00b5 = 0,\n\nBased on Eq.(6), the proof of Corollary 1 is straightforward with the details in Appendix. Since x\u2217\nis unknown in practice, one could set c by replacing V (x\u2217, x0) in c with any value D\u2217 \u2265 V (x\u2217, x0).\nthe rate in Eq.(7) has achieved the uniformly-optimal rate according to [15]. In fact, if f (x) is a\ndeterministic and smooth function with \u03c3 = M = 0 (e.g., smooth empirical loss), one only needs\n\n4\n\n\fto change the stochastic subgradient G(yt, \u03bet) to \u2207f (yt). The resulting algorithm, which reduces to\nAlgorithm 3 in [20], is an optimal deterministic \ufb01rst-order method with the rate O( LV (x\u2217,x0)\nWe note that the quadratic growth assumption of V (x, y) is not necessary for convex f (x).\nIf one does not assume this assumption and replaces the last step in ORDA by xt+1 =\narg minx\u2208X\n, we can achieve the same rate as in\nEq.(7) but just removing all \u03c4 from the right hand side. But the quadratic growth assumption is\nindeed required for showing the convergence for strongly convex f (x) as in the next corollary.\n\n(cid:110)(cid:104)x, G(yt, \u03bet)(cid:105) + h(x) +\n\nCorollary 2 For strongly convex f (x) with(cid:101)\u00b5 > 0, we set c = 0 and \u0393 = L and obtain that:\n\n(cid:17)(cid:107)x \u2212 yt(cid:107)2(cid:111)\n\n(cid:16) \u00b5\n\n+ \u03b3t\n2\n\n2\u03b82\nt\n\nN 2\n\n).\n\nE\u03c6(xN +1) \u2212 \u03c6(x\u2217) \u2264 4\u03c4 LV (x\u2217, x0)\n(cid:16) 1\n\u00b5N +exp(\u2212(cid:112) \u00b5\n\n(cid:17)\n\nN 2\n\n\u00b5N\n\n, is optimal and better than the O\n\nrate for previous\nThe dominating term in Eq.(8), O\ndual averaging methods. However, ORDA has not achieved the uniformly-optimal rate, which takes\nthe form of O( \u03c32+M 2\nL N )). In particular, for deterministic smooth and strongly convex\nN 2 ) while the\nf (x) (i.e., empirical loss with \u03c3 = M = 0), ORDA only achieves the rate of O( L\n\noptimal deterministic rate should be O(cid:0)exp(\u2212(cid:112) \u00b5\n\nL N )(cid:1) [16]. Inspired by the multi-restart technique\n\nin [7, 11], we present a multi-stage extension of ORDA in Section 4 which achieves the uniformly-\noptimal convergence rate.\n\n\u00b5N\n\n4\u03c4 (\u03c32 + M 2)\n\n+\n\n\u00b5N\n\n.\n\n(cid:16) log N\n\n(cid:17)\n\n(8)\n\n3.2 High Probability Bounds\n\nFor stochastic optimization problems, another important evaluation criterion is the con\ufb01dence\nlevel of the objective value.\nIn particular, it is of great interest to \ufb01nd \u0001(N, \u03b4) as a mono-\ntonically decreasing function in both N and \u03b4 \u2208 (0, 1) such that the solution xN +1 satis\ufb01es\nPr (\u03c6(xN +1) \u2212 \u03c6(x\u2217) \u2265 \u0001(N, \u03b4)) \u2264 \u03b4.\nIn other words, we want to show that with probability\nat least 1 \u2212 \u03b4, \u03c6(xN +1) \u2212 \u03c6(x\u2217) < \u0001(N, \u03b4). According to Markov inequality, for any \u0001 > 0,\nPr(\u03c6(xN +1) \u2212 \u03c6(x\u2217) \u2265 \u0001) \u2264 E(\u03c6(xN +1)\u2212\u03c6(x\u2217))\nE\u03c6(xN +1)\u2212\u03c6(x\u2217)\n(cid:16) \u03c32+M 2\n(cid:17)\n.\nUnder the basic assumption in Eq.(3), namely E\u03be((cid:107)G(x, \u03be) \u2212 f(cid:48)(x)(cid:107)2\u2217) \u2264 \u03c32, and according to\nCorollary 1 and 2, \u0001(N, \u03b4) = O\nfor strongly convex f (x).\nwe assume that E(cid:0)exp(cid:8)(cid:107)G(x, \u03be) \u2212 f(cid:48)(x)(cid:107)2\u2217/\u03c32(cid:9)(cid:1) \u2264 exp{1}, \u2200x \u2208 X . By further making the\nHowever, the above bounds are quite loose. To obtain tighter bounds, we strengthen the basic\nassumption of the stochastic subgradient in Eq. (3) to the \u201clight-tail\u201d assumption [14]. In particular,\nboundedness assumption ((cid:107)x\u2217 \u2212(cid:98)zt(cid:107) \u2264 D) and utilizing a technical lemma from [3], we obtain a\n\nfor convex f (x), and \u0001(N, \u03b4) = O\n\n. Therefore, we have \u0001(N, \u03b4) =\n\n(cid:16) (\u03c3+M )\n\nV (x\u2217,x0)\nN \u03b4\n\n\u221a\n\u221a\n\n(cid:17)\n\n\u00b5N \u03b4\n\n\u03b4\n\n\u0001\n\n(cid:16)\u221a\n\n(cid:17)\n\nln(1/\u03b4)D\u03c3\n\n\u221a\n\nN\n\nfor both convex and strongly\n\nmuch tighter high probability bound with \u0001(N, \u03b4) = O\nconvex f (x). The details are presented in Appendix.\n\n4 Multi-stage ORDA for Stochastic Strongly Convex Optimization\n\nAs we show in Section 3.1, for convex f (x), ORDA achieves the uniformly-optimal rate. How-\never, for strongly convex f (x), although the dominating term of the convergence rate in Eq.(8) is\noptimal, the overall rate is not uniformly-optimal. Inspired by the multi-stage stochastic approx-\nimation methods [7, 9, 11], we propose the multi-stage extension of ORDA in Algorithm 2 for\nstochastic strongly convex optimization. For each stage 1 \u2264 k \u2264 K, we run ORDA in Algorithm\n1 as a sub-routine for Nk iterations with the parameter \u03b3t = c(t + 1)3/2 + \u03c4 \u0393 with c = 0 and\n\u0393 = \u039bk + L. Roughly speaking, we set Nk = 2Nk\u22121 and \u039bk = 4\u039bk\u22121. In other words, we double\nthe number of iterations for the next stage but reduce the step-size. The multi-stage ORDA has\nachieved uniformly-optimal convergence rate as shown in Theorem 2 with the proof in Appendix.\nThe proof technique follows the one in [11]. Due this specialized proof technique, instead of show-\ning E(\u03c6(xN )) \u2212 \u03c6(x\u2217) \u2264 \u0001(N ) as in ORDA, we show the number of iterations N (\u0001) to achieve the\n\u0001-accurate solution: E(\u03c6(xN (\u0001))) \u2212 \u03c6(x\u2217) \u2264 \u0001. But the two convergence rates are equivalent.\n\n5\n\n\fAlgorithm 2 Multi-stage ORDA for Stochastic Strongly Convex Optimization\nInitialization: x0 \u2208 X , a constant V0 \u2265 \u03c6(x0) \u2212 \u03c6(x\u2217) and the number of stages K.\nIterate for k = 1, 2, . . . , K:\n\n1. Set Nk = max\n\n(cid:110)\n(cid:113) \u03c4 L\n(cid:113) 2k\u22121\u00b5(\u03c32+M 2)\n\n4\n\n\u00b5 , 2k+9\u03c4 (\u03c32+M 2)\n\n\u00b5V0\n\n(cid:111)\n\nk\n\n\u03c4V0\n\n2. Set \u039bk = N 3/2\n\n3. Generate(cid:101)xk by calling the sub-routine ORDA((cid:101)xk\u22121, Nk, \u0393 = \u039bk + L, c = 0)\n\nOutput:(cid:101)xK\nE(\u03c6((cid:101)xK)) \u2212 \u03c6(x\u2217) \u2264 \u0001 and the total number of iterations is upper bounded by:\n\nTheorem 2 If we run multi-stage ORDA for K stages with K = log2\n\n(cid:0)V0\n\n\u0001\n\n(cid:1) for any given \u0001, we have\n\n(cid:115)\n\nK(cid:88)\n\n(cid:18)V0\n\n(cid:19)\n\n\u0001\n\nN =\n\nNk \u2264 4\n\n\u03c4 L\n\u00b5\n\nlog2\n\n+\n\n1024\u03c4 (\u03c32 + M 2)\n\n\u00b5\u0001\n\n.\n\n(9)\n\nk=1\n\n5 Related Works\n\nIn the last few years, a number of stochastic gradient methods [6, 5, 8, 12, 14, 21, 10, 11, 7, 4, 3] have\nbeen developed to solve Eq.(1), especially for a sparsity-inducing h(x). In Table 1, we compare the\nproposed ORDA and its multi-stage extension with some widely used stochastic gradient methods\nusing the following metrics. For the ease of comparison, we assume f (x) is smooth with M = 0.\n\n(cid:17)\n\n(cid:16) \u03c32(cid:101)\u00b5N\n\n1. The convergence rate for solving (non-strongly) convex f (x) and whether this rate has\n\nachieved the uniformly-optimal (Uni-opt) rate.\n\n2. The convergence rate for solving strongly convex f (x) and whether (1) the dominating\n\nterm of rate is optimal, i.e., O\n\n3. Whether the \ufb01nal solution(cid:98)x, on which the results of convergence are built, is generated\n\nand (2) the overall rate is uniformly-optimal.\n\nfrom the weighted average of previous iterates (Avg) or from the proximal mapping (Prox).\nFor sparsity-inducing regularizers, the solution directly from the proximal mapping is often\nsparser than the averaged solution.\n\n4. Whether an algorithm allows to use a general Bregman divergence in proximal mapping or\n\nit only allows the Euclidean distance V (x, y) = 1\n\n2(cid:107)x \u2212 y(cid:107)2\n2 .\n\nsince the parameter D in the convergence rate is chosen such that E(cid:0)(cid:107)xt \u2212 x\u2217(cid:107)2\n\nIn Table 1, the algorithms in the \ufb01rst 7 rows are stochastic approximation algorithms where only\nthe current stochastic gradient is used at each iteration. The last 4 rows are dual averaging methods\nwhere all past subgradients are used. Some algorithms in Table 1 make a more restrictive assumption\non the stochastic gradient: \u2203G > 0, E(cid:107)G(x, \u03be)(cid:107)2\u2217 \u2264 G2,\u2200x \u2208 X . It is easy to verify that this\nassumption implies our basic assumption in Eq.(3) by Jensen\u2019s inequality.\nAs we can see from Table 1, the proposed ORDA possesses all good properties except that the\nconvergence rate for strongly convex f (x) is not uniformly-optimal. Multi-stage ORDA further\nimproves this rate to be uniformly-optimal. In particular, SAGE [8] achieves a nearly optimal rate\nand it could be much larger than V \u2261 V (x\u2217, x0). In addition, SAGE requires the boundedness of the\ndomain X , the smoothness of f (x), and only allows the Euclidean distance in proximal mapping.\nAs compared to AC-SA [10] and multi-stage AC-SA [11], our methods do not require the \ufb01nal\naveraging step; and as shown in our experiments, ORDA has better empirical performances due\nto the usage of all past stochastic subgradients. Furthermore, we improve the rates of RDA and\nextend AC-RDA to an optimal algorithm for both convex and strongly convex f (x). Another highly\nrelevant work is [9]. Juditsky et al. [9] proposed multi-stage algorithms to achieve the optimal\nstrongly convex rate based on non-accelerated dual averaging methods. However, the algorithms in\n[9] assume that \u03c6(x) is a Lipschitz continuous function, i.e., the subgradient of \u03c6(x) is bounded.\nTherefore, when the domain X is unbounded, the algorithms in [9] cannot be directly applied.\n\n(cid:1) \u2264 D for all t \u2265 0\n\n2\n\n6\n\n\fUni-opt Rate\nNO\n\nO\n\nNO\n\nO\n\nNEARLY O\n\nRate\nO\n\nO\n\nO\n\n(cid:16) G\n(cid:16) G\n(cid:16) \u03c3\n(cid:16) \u03c3\n\n\u221a\nV\u221a\n\u221a\nN\nV\u221a\n\u221a\nN\nD\u221a\n\u221a\nN\nV\u221a\nN\n\n(cid:16) G\n(cid:16) \u03c3\n(cid:16) \u03c3\n\n\u221a\nV\u221a\n\u221a\nN\nV\u221a\n\u221a\nN\nV\u221a\nN\n\n(cid:17)\n(cid:17)\n\n(cid:17)\n(cid:17)\n\n(cid:17)\n(cid:17)\n\n+ LD\nN 2\n+ LV\nN 2\n\n(cid:17)\n\n+ LV\nN 2\n+ LV\nN 2\n\nFOBOS [6]\n\nCOMID [5]\n\nSAGE [8]\n\nAC-SA [10] O\n\nM-AC-SA [11] NA\n\nEpoch-GD [7] NA\n\nRDA [21]\n\nO\n\nAC-RDA [21] O\nORDA\n\nO\n\nM-ORDA\n\nNA\n\nYES\n\nNA\n\nNA\n\nNO\n\nYES\n\nYES\n\nNA\n\nN 2\n\nN 2\n\n(cid:17)\n(cid:17)\n\n(cid:17)\n(cid:16) G2 log N(cid:101)\u00b5N\n(cid:17)\n(cid:16) G2 log N(cid:101)\u00b5N\n(cid:16) \u03c32(cid:101)\u00b5N + LD\n(cid:16) \u03c32(cid:101)\u00b5N + LV\n(cid:18)\n\u03c32(cid:101)\u00b5N + exp{\u2212(cid:113)(cid:101)\u00b5\n(cid:17)\n(cid:16) G2(cid:101)\u00b5N\n(cid:17)\n(cid:16) G2 log N(cid:101)\u00b5N\n(cid:16) \u03c32(cid:101)\u00b5N + LV\n(cid:18)\n\u03c32(cid:101)\u00b5N + exp{\u2212(cid:113)(cid:101)\u00b5\n\n(cid:17)\n\nN 2\n\nO\n\nO\n\nO\n\nO\n\nO\n\nO\n\nNA\n\nL N}\n\nOpt Uni-opt\nNO NO\n\nNO NO\n\nYES NO\n\nYES NO\n\nYES YES\n\nYES NO\n\nNO NO\n\nNA NA\n\nYES NO\n\n(cid:19)\n\n(cid:19)\n\nFinal(cid:98)x Bregman\n\nProx NO\n\nProx YES\n\nProx NO\n\nAvg\n\nAvg\n\nAvg\n\nAvg\n\nAvg\n\nYES\n\nYES\n\nNO\n\nYES\n\nYES\n\nProx YES\n\nConvex f (x)\n\nStrongly Convex f (x)\n\nL N}\n\nYES YES\n\nProx YES\n\nTable 1: Summary for different stochastic gradient algorithms. V is short for V (x\u2217, x0); AC for \u201caccelerated\u201d;\nM for \u201cmulti-stage\u201d and NA stands for either \u201cnot applicable\u201d or \u201cno analysis of the rate\u201d.\n\nRecently, the paper [18] develops another stochastic gradient method which achieves the rate O( G2(cid:101)\u00b5N )\n\nfor strongly convex f (x). However, for non-smooth f (x), it requires the averaging of the last a few\niterates and this rate is not uniformly-optimal.\n\n6 Simulated Experiments\n\n2(cid:107)x(cid:107)2\n\nIn this section, we conduct simulated experiments to demonstrate the performance of ORDA and\nits multi-stage extension (M ORDA). We compare our ORDA and M ORDA (only for strongly\nconvex loss) with several state-of-the-art stochastic gradient methods, including RDA and AC-RDA\n[21], AC-SA [10], FOBOS [6] and SAGE [8]. For a fair comparison, we compare all different\nmethods using solutions which have expected convergence guarantees. For all algorithms, we tune\nthe parameter related to step-size (e.g., c in ORDA for convex loss) within an appropriate range and\nchoose the one that leads to the minimum objective value.\nIn this experiment, we solve a sparse linear regression problem: minx\u2208Rn f (x)+h(x) where f (x) =\nEa,b((aT x \u2212 b)2) + \u03c1\n2 and h(x) = \u03bb(cid:107)x(cid:107)1. The input vector a is generated from N (0, In\u00d7n)\n1\ni = 1 for 1 \u2264 i \u2264 n/2 and 0 otherwise and the noise\n2\nand the response b = aT x\u2217 + \u0001, where x\u2217\n\u0001 \u223c N (0, 1). When \u03c1 = 0, th problem is the well known Lasso [19] and when \u03c1 > 0, it is known\nas Elastic-net [22]. The regularization parameter \u03bb is tuned so that a deterministic solver on all the\nsamples can correctly recover the underlying sparsity pattern. We set n = 100 and create a large\npool of samples for generating stochastic gradients and evaluating objective values. The number\nof iterations N is set to 500. Since we focus on stochastic optimization instead of online learning,\nwe could randomly draw samples from an underlying distribution. So we construct the stochastic\ngradient using the mini-batch strategy [2, 1] with the batch size 50. We run each algorithm for 100\ntimes and report the mean of the objective value and the F1-score for sparsity recovery performance.\nF1-score is de\ufb01ned as 2 precision\u00b7recall\n\ni=1 1{(cid:98)xi=1} and\ni =1}. The higher the F1-score is, the better the recovery\nability of the sparsity pattern. The standard deviations for both objective value and the F1-score in\n100 runs are very small and thus omitted here due to space limitations.\nWe \ufb01rst set \u03c1 = 0 to test algorithms for (non-strongly) convex f (x). The result is presented in\nTable 2 (the \ufb01rst two columns). We also plot the decrease of the objective values for the \ufb01rst 200\niterations in Figure 1. From Table 2, ORDA performs the best in both objective value and recovery\nability of sparsity pattern. For those optimal algorithms (e.g., AC-RDA, AC-SA, SAGE, ORDA),\nthey achieve lower \ufb01nal objective values and the rates of the decrease are also faster. We note that\nfor dual averaging methods, the solution generated from the (\ufb01rst) proximal mapping (e.g., zt in\n\nrecall = (cid:80)p\n\nprecision+recall where precision = (cid:80)p\ni =1}/(cid:80)p\n\ni=1 1{x\u2217\n\ni =1}/(cid:80)p\n\ni=1 1{(cid:98)xi=1,x\u2217\n\ni=1 1{(cid:98)xi=1,x\u2217\n\n7\n\n\f\u03c1 = 0\n\n\u03c1 = 1\n\nF1\n\nObj\n\nObj\nF1\nRDA\n20.87 0.67 21.57 0.67\nAC-RDA 20.67 0.67 21.12 0.67\n20.66 0.67 21.01 0.67\nAC-SA\n20.98 0.83 21.19 0.84\nFOBOS\n20.65 0.82 21.09 0.73\nSAGE\n20.56 0.92 20.97 0.87\nORDA\nM ORDA N.A. N.A. 20.98 0.88\n\nTable 2: Comparisons in objective\nvalue and F1-score.\n\nFigure 1: Obj for Lasso.\n\nFigure 2: Obj for Elastic-Net.\n\nORDA) has almost perfect sparsity recovery performance. However, since here is no convergence\nguarantee for that solution, we do not report results here.\nThen we set \u03c1 = 1 to test algorithms for solving\nstrongly convex f (x). The results are presented\nin Table 2 (the last two columns) and Figure\n2 and 3. As we can see from Table 2, ORDA\nand M ORDA perform the best. Although\nM ORDA achieves the theoretical uniformly-\noptimal convergence rate, the empirical per-\nformance of M ORDA is almost identical to\nthat of ORDA. This observation is consistent\nwith our theoretical analysis since the improve-\nment of the convergence rate only appears on\nthe non-dominating term. In addition, ORDA,\nM ORDA, AC-SA and SAGE with the conver-\n\ngence rate O( 1(cid:101)\u00b5N ) achieve lower objective val-\nrate O( log N(cid:101)\u00b5N ) . For better visualization, we do\n\nues as compared to other algorithms with the\n\nFigure 3: ORDA v.s. M ORDA.\n\nnot include the comparison between M ORA and ORDA in Figure 2. Instead, we present the com-\nparison separately in Figure 3. From Figure 3, the \ufb01nal objective values of both algorithms are very\nclose. An interesting observation is that, for M ORDA, each time when a new stage starts, it leads\nto a sharp increase in the objective value following by a quick drop.\n\n7 Conclusions and Future Works\n\nIn this paper, we propose a new dual averaging method which achieves the optimal rates for solving\nstochastic regularized problems with both convex and strongly convex loss functions. We further\npropose a multi-stage extension to achieve the uniformly-optimal convergence rate for strongly con-\nvex loss.\nAlthough we study stochastic optimization problems in this paper, our algorithms can be easily\nconverted into online optimization approaches, where a sequence of decisions {xt}N\nt=1 are generated\naccording to Algorithm 1 or 2. We often measure the quality of an online learning algorithm via the\n\n(cid:0)(F (xt, \u03bet) + h(xt)) \u2212 (F (x\u2217, \u03bet) + h(x\u2217))(cid:1). Given\n\nso-called regret, de\ufb01ned as RN (x\u2217) =(cid:80)N\nexample, for strongly convex f (x): ERN (x\u2217) \u2264 (cid:80)N\n\nt=1 (E(\u03c6(xt)) \u2212 \u03c6(x\u2217)) \u2264 (cid:80)N\n\nthe expected convergence rate in Corollary 1 and 2, the expected regret can be easily derived. For\nt ) =\nO(ln N ). However, it would be a challenging future work to derive the regret bound for ORDA\ninstead of the expected regret. It would also be interesting to develop the parallel extensions of\nORDA (e.g., combining the distributed mini-batch strategy in [21] with ORDA) and apply them to\nsome large-scale real problems.\n\nt=1 O( 1\n\nt=1\n\n8\n\n50100150200202122232425262728IterationObjective RDAAC\u2212RDAAC\u2212SAFOBOSSAGEORDA50100150200232425262728293031IterationObjective RDAAC\u2212RDAAC\u2212SAFOBOSSAGEORDA100200300400500202224262830IterationObjective ORDAM_ORDA\fReferences\n[1] A. Cotter, O. Shamir, N. Srebro, and K. Sridharan. Better mini-batch algorithms via accelerated\n\ngradient methods. In Advances in Neural Information Processing Systems (NIPS), 2011.\n\n[2] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction\n\nusing mini-batches. Technical report, Microsoft Research, 2011.\n\n[3] J. Duchi, P. L. Bartlett, and M. Wainwright. Randomized smoothing for stochastic optimiza-\n\ntion. arXiv:1103.4296v1, 2011.\n\n[4] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. In Conference on Learning Theory (COLT), 2010.\n\n[5] J. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari. Composite objective mirror descent. In\n\nConference on Learning Theory (COLT), 2010.\n\n[6] J. Duchi and Y. Singer. Ef\ufb01cient online and batch learning using forward-backward splitting.\n\nJournal of Machine Learning Research, 10:2873\u20132898, 2009.\n\n[7] E. Hazan and S. Kale. Beyond the regret minimization barrier: an optimal algorithm for\n\nstochastic strongly-convex optimization. In Conference on Learning Theory (COLT), 2011.\n\n[8] C. Hu, J. T. Kwok, and W. Pan. Accelerated gradient methods for stochastic optimization and\n\nonline learning. In Advances in Neural Information Processing Systems (NIPS), 2009.\n\n[9] A. Juditsky and Y. Nesterov. Primal-dual subgradient methods for minimizing uniformly con-\n\nvex functions. August 2010.\n\n[10] G. Lan and S. Ghadimi. Optimal stochastic approximation algorithms for strongly convex\nstochastic composite optimization, part i: a generic algorithmic framework. Technical report,\nUniversity of Florida, 2010.\n\n[11] G. Lan and S. Ghadimi. Optimal stochastic approximation algorithms for strongly convex\nstochastic composite optimization, part ii: shrinking procedures and optimal algorithms. Tech-\nnical report, University of Florida, 2010.\n\n[12] J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of\n\nMachine Learning Research, 10:777\u2013801, 2009.\n\n[13] S. Lee and S. J. Wright. Manifold identi\ufb01cation of dual averaging methods for regularized\n\nstochastic online learning. In International Conference on Machine Learning (ICML), 2011.\n\n[14] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach\n\nto stochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[15] A. Nemirovski and D. Yudin. Problem complexity and method ef\ufb01ciency in optimization. John\n\nWiley New York, 1983.\n\n[16] Y. Nesterov. Introductory lectures on convex optimization: a basic course. Kluwer Academic\n\nPub, 2003.\n\n[17] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Program-\n\nming, 120:221\u2013259, 2009.\n\n[18] A. Rakhlin, O. Shamir, and K. Sridharan. To average or not to average? making stochastic gra-\ndient descent optimal for strongly convex problems. In International Conference on Machine\nLearning (ICML), 2012.\n\n[19] R. Tibshirani. Regression shrinkage and selection via the lasso. J.R.Statist.Soc.B, 58:267\u2013288,\n\n1996.\n\n[20] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. SIAM\n\nJournal on Optimization (Submitted), 2008.\n\n[21] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization.\n\nJournal of Machine Learning Research, 11:2543\u20132596, 2010.\n\n[22] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. J. R. Statist.\n\nSoc. B, 67(2):301\u2013320, 2005.\n\n9\n\n\f", "award": [], "sourceid": 204, "authors": [{"given_name": "Xi", "family_name": "Chen", "institution": null}, {"given_name": "Qihang", "family_name": "Lin", "institution": null}, {"given_name": "Javier", "family_name": "Pena", "institution": null}]}