{"title": "Accelerated Gradient Methods for Stochastic Optimization and Online Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 781, "page_last": 789, "abstract": "Regularized risk minimization often involves non-smooth optimization, either because of the loss function (e.g., hinge loss) or the regularizer (e.g., $\\ell_1$-regularizer). Gradient descent methods, though highly scalable and easy to implement, are known to converge slowly on these problems. In this paper, we develop novel accelerated gradient methods for stochastic optimization while still preserving their computational simplicity and scalability. The proposed algorithm, called SAGE (Stochastic Accelerated GradiEnt), exhibits fast convergence rates on stochastic optimization with both convex and strongly convex objectives. Experimental results show that SAGE is faster than recent (sub)gradient methods including FOLOS, SMIDAS and SCD. Moreover, SAGE can also be extended for online learning, resulting in a simple but powerful algorithm.", "full_text": "Accelerated Gradient Methods for\n\nStochastic Optimization and Online Learning\n\nChonghai Hu\u266f\u2020, James T. Kwok\u266f, Weike Pan\u266f\n\u266f Department of Computer Science and Engineering\nHong Kong University of Science and Technology\n\nClear Water Bay, Kowloon, Hong Kong\n\n\u2020 Department of Mathematics, Zhejiang University\n\nHangzhou, China\n\nhino.hu@gmail.com, {jamesk,weikep}@cse.ust.hk\n\nAbstract\n\nRegularized risk minimization often involves non-smooth optimization, either be-\ncause of the loss function (e.g., hinge loss) or the regularizer (e.g., \u21131-regularizer).\nGradient methods, though highly scalable and easy to implement, are known to\nconverge slowly. In this paper, we develop a novel accelerated gradient method\nfor stochastic optimization while still preserving their computational simplicity\nand scalability. The proposed algorithm, called SAGE (Stochastic Accelerated\nGradiEnt), exhibits fast convergence rates on stochastic composite optimization\nwith convex or strongly convex objectives. Experimental results show that SAGE\nis faster than recent (sub)gradient methods including FOLOS, SMIDAS and SCD.\nMoreover, SAGE can also be extended for online learning, resulting in a simple\nalgorithm but with the best regret bounds currently known for these problems.\n\n1\n\nIntroduction\n\nRisk minimization is at the heart of many machine learning algorithms. Given a class of models\nparameterized by w and a loss function \u2113(\u00b7,\u00b7), the goal is to minimize EXY [\u2113(w; X, Y )] w.r.t. w,\nwhere the expectation is over the joint distribution of input X and output Y . However, since the joint\ndistribution is typically unknown in practice, a surrogate problem is to replace the expectation by\nits empirical average on a training sample {(x1, y1), . . . , (xm, ym)}. Moreover, a regularizer \u2126(\u00b7)\nis often added for well-posedness. This leads to the minimization of the regularized risk\n\nmin\n\nw\n\n1\nm\n\nm\n\nXi=1\n\n\u2113(w; xi, yi) + \u03bb\u2126(w),\n\n(1)\n\nwhere \u03bb is a regularization parameter. In optimization terminology, the deterministic optimization\nproblem in (1) can be considered as a sample average approximation (SAA) of the corresponding\nstochastic optimization problem:\n\nmin\n\nw\n\nEXY [\u2113(w; X, Y )] + \u03bb\u2126(w).\n\n(2)\n\nSince both \u2113(\u00b7,\u00b7) and \u2126(\u00b7) are typically convex, (1) is a convex optimization problem which can be\nconveniently solved even with standard off-the-shelf optimization packages.\n\nHowever, with the proliferation of data-intensive applications in the text and web domains, data sets\nwith millions or trillions of samples are nowadays not uncommon. Hence, off-the-shelf optimization\nsolvers are too slow to be used. Indeed, even tailor-made softwares for speci\ufb01c models, such as the\nsequential minimization optimization (SMO) method for the SVM, have superlinear computational\n\n1\n\n\fcomplexities and thus are not feasible for large data sets. In light of this, the use of stochastic meth-\nods have recently drawn a lot of interest and many of these are highly successful. Most are based\non (variants of) the stochastic gradient descent (SGD). Examples include Pegasos [1], SGD-QN [2],\nFOLOS [3], and stochastic coordinate descent (SCD) [4]. The main advantages of these methods\nare that they are simple to implement, have low per-iteration complexity, and can scale up to large\ndata sets. Their runtime is independent of, or even decrease with, the number of training samples\n[5, 6]. On the other hand, because of their simplicity, these methods have a slow convergence rate,\nand thus may require a large number of iterations.\n\nWhile standard gradient schemes have a slow convergence rate, they can often be \u201caccelerated\u201d.\nThis stems from the pioneering work of Nesterov in 1983 [7], which is a deterministic algorithm\nfor smooth optimization. Recently, it is also extended for composite optimization, where the objec-\ntive has a smooth component and a non-smooth component [8, 9]. This is particularly relevant to\nmachine learning since the loss \u2113 and regularizer \u2126 in (2) may be non-smooth. Examples include\nloss functions such as the commonly-used hinge loss used in the SVM, and regularizers such as the\npopular \u21131 penalty in Lasso [10], and basis pursuit. These accelerated gradient methods have also\nbeen successfully applied in the optimization problems of multiple kernel learning [11] and trace\nnorm minimization [12]. Very recently, Lan [13] made an initial attempt to further extend this for\nstochastic composite optimization, and obtained the convergence rate of\n\nO(cid:16)L/N 2 + (M + \u03c3)/\u221aN(cid:17) .\n\n(3)\nHere, N is the number of iterations performed by the algorithm, L is the Lipschitz parameter of\nthe gradient of the smooth term in the objective, M is the Lipschitz parameter of the nonsmooth\nterm, and \u03c3 is the variance of the stochastic subgradient. Moreover, note that the \ufb01rst term of (3)\nis related to the smooth component in the objective while the second term is related to the non-\nsmooth component. Complexity results [14, 13] show that (3) is the optimal convergence rate for\nany iterative algorithm solving stochastic (general) convex composite optimization.\n\nHowever, as pointed out in [15], a very useful property that can improve the convergence rates in ma-\nchine learning optimization problems is strong convexity. For example, (2) can be strongly convex\neither because of the strong convexity of \u2113 (e.g., log loss, square loss) or \u2126 (e.g., \u21132 regularization).\nOn the other hand, [13] is more interested in general convex optimization problems and so strong\nconvexity is not utilized. Moreover, though theoretically interesting, [13] may be of limited practi-\ncal use as (1) the stepsize in its update rule depends on the often unknown \u03c3; and (2) the number of\niterations performed by the algorithm has to be \ufb01xed in advance.\n\nInspired by the successes of Nesterov\u2019s method, we develop in this paper a novel accelerated sub-\ngradient scheme for stochastic composite optimization. It achieves the optimal convergence rate\n\nof O(cid:16)L/N 2 + \u03c3/\u221aN(cid:17) for general convex objectives, and O(cid:0)(L + \u00b5)/N 2 + \u03c3\u00b5\u22121/N(cid:1) for \u00b5-\ning. We obtain O(\u221aN ) regret for general convex problems and O(log N ) regret for strongly convex\n\nstrongly convex objectives. Moreover, its per-iteration complexity is almost as low as that for stan-\ndard (sub)gradient methods. Finally, we also extend the accelerated gradient scheme to online learn-\n\nproblems, which are the best regret bounds currently known for these problems.\n\n2 Setting and Mathematical Background\n\nFirst, we recapitulate a few notions in convex analysis.\n(Lipschitz continuity) A function f (x) is L-Lipschitz if kf (x) \u2212 f (y)k \u2264 Lkx \u2212 yk.\nLemma 1. [14] The gradient of a differentiable function f (x) is Lipschitz continuous with Lipschitz\nparameter L if, for any x and y,\n\nf (y) \u2264 f (x) + h\u2207f (x), y \u2212 xi +\n\nL\n2 kx \u2212 yk2.\n\n(4)\n\n(Strong convexity) A function \u03c6(x) is \u00b5-strongly convex if \u03c6(y) \u2265 \u03c6(x)+hg(x), y\u2212xi+ \u00b5\nfor any x, y and subgradient g(x) \u2208 \u2202\u03c6(x).\nLemma 2. [14] Let \u03c6(x) be \u00b5-strongly convex and x\u2217 = arg minx \u03c6(x). Then, for any x,\n\n2ky\u2212xk2\n\n\u00b5\n2 kx \u2212 x\u2217k2.\n\n\u03c6(x) \u2265 \u03c6(x\u2217) +\n\n2\n\n(5)\n\n\fWe consider the following stochastic convex stochastic optimization problem, with a composite\nobjective function\n\nmin\nx {\u03c6(x) \u2261 E[F (x, \u03be)] + \u03c8(x)},\n\n(6)\nwhere \u03be is a random vector, f (x) \u2261 E[F (x, \u03be)] is convex and differentiable, and \u03c8(x) is convex\nbut non-smooth. Clearly, this includes the optimization problem (2). Moreover, we assume that the\ngradient of f (x) is L-Lipschitz and \u03c6(x) is \u00b5-strongly convex (with \u00b5 \u2265 0). Note that when \u03c6(x) is\nsmooth (\u03c8(x) = 0), \u00b5 lower bounds the smallest eigenvalue of its Hessian.\nRecall that in smooth optimization, the gradient update xt+1 = xt \u2212 \u03bb\u2207f (xt) on a function f (x)\ncan be seen as proximal regularization of the linearized f at the current iterate xt [16]. In other\nwords, xt+1 = arg minx(h\u2207f (xt), x \u2212 xti + 1\n2\u03bbkx \u2212 xtk2). With the presence of a non-smooth\ncomponent, we have the following more general notion.\n(Gradient mapping) [8] In minimizing f (x) + \u03c8(x), where f is convex and differentiable and \u03c8 is\nconvex and non-smooth,\n\nxt+1 = arg min\n\nx (cid:18)h\u2207f (x), x \u2212 xti +\n\n1\n\n2\u03bbkx \u2212 xtk2 + \u03c8(x)(cid:19)\n\n(7)\n\nis called the generalized gradient update, and \u03b4 = 1\n\u03bb (xt \u2212 xt+1) is the (generalized) gradient map-\nping. Note that the quadratic approximation is made to the smooth component only. It can be shown\nthat the gradient mapping is analogous to the gradient in smooth convex optimization [14, 8]. This\nis also a common construct used in recent stochastic subgradient methods [3, 17].\n\n3 Accelerated Gradient Method for Stochastic Learning\n\nLet G(xt, \u03bet) \u2261 \u2207xF (x, \u03bet)|x=xt be the stochastic gradient of F (x, \u03bet). We assume that it is an\nunbiased estimator of the gradient \u2207f (x), i.e., E\u03be[G(x, \u03be)] = \u2207f (x). Algorithm 1 shows the pro-\nIt involves the\nposed algorithm, which will be called SAGE (Stochastic Accelerated GradiEnt).\nupdating of three sequences {xt}, {yt} and {zt}. Note that yt is the generalized gradient update,\nand xt+1 is a convex combination of yt and zt. The algorithm also maintains two parameter se-\nquences {\u03b1t} and {Lt}. We will see in Section 3.1 that different settings of these parameters lead\nto different convergence rates. Note that the only expensive step of Algorithm 1 is the computation\nof the generalized gradient update yt, which is analogous to the subgradient computation in other\nsubgradient-based methods. In general, its computational complexity depends on the structure of\n\u03c8(x). As will be seen in Section 3.3, this can often be ef\ufb01ciently obtained in many regularized risk\nminimization problems.\n\nAlgorithm 1 SAGE (Stochastic Accelerated GradiEnt).\n\nInput: Sequences {Lt} and {\u03b1t}.\nInitialize: y\u22121 = z\u22121 = 0, \u03b10 = \u03bb0 = 1. L0 = L + \u00b5.\nfor t = 0 to N do\n\nxt = (1 \u2212 \u03b1t)yt\u22121 + \u03b1tzt\u22121.\nyt = arg minx(cid:8)hG(xt, \u03bet), x \u2212 xti + Lt\nzt = zt\u22121 \u2212 (Lt\u03b1t + \u00b5)\u22121[Lt(xt \u2212 yt) + \u00b5(zt\u22121 \u2212 xt)].\n\n2 kx \u2212 xtk2 + \u03c8(x)(cid:9).\n\nend for\nOutput yN .\n\n3.1 Convergence Analysis\n\nDe\ufb01ne \u2206t \u2261 G(xt, \u03bet) \u2212 \u2207f (xt). Because of the unbiasedness of G(xt, \u03bet), E\u03bet [\u2206t] = 0. In the\nfollowing, we will show that the value of \u03c6(yt) \u2212 \u03c6(x) can be related to that of \u03c6(yt\u22121) \u2212 \u03c6(x) for\nany x. Let \u03b4t \u2261 Lt(xt \u2212 yt) be the gradient mapping involved in updating yt. First, we introduce\nthe following lemma.\nLemma 3. For t \u2265 0, \u03c6(x) is quadratically bounded from below as\n\n\u03c6(x) \u2265 \u03c6(yt) + h\u03b4t, x \u2212 xti +\n\n\u00b5\n2 kx \u2212 xtk2 + h\u2206t, yt \u2212 xi +\n\n2Lt \u2212 L\n\n2L2\nt\n\nk\u03b4tk2.\n\n3\n\n\fProposition 1. Assume that for each t \u2265 0, k\u2206tk\u2217 \u2264 \u03c3 and Lt > L, then\n\n\u03c6(yt) \u2212 \u03c6(x) +\n\nLt\u03b12\n\nt + \u00b5\u03b1t\n2\n\n\u2264 (1 \u2212 \u03b1t)[\u03c6(yt\u22121) \u2212 \u03c6(x)] +\n\nkx \u2212 ztk2\nLt\u03b12\n2 kx \u2212 zt\u22121k2 +\nt\n\n(8)\n\n+ \u03b1th\u2206t, x \u2212 zt\u22121i.\n\n\u03c32\n\n2(Lt \u2212 L)\n2 kx \u2212 zt\u22121k2.\n\nProof. De\ufb01ne Vt(x) = h\u03b4t, x \u2212 xti + \u00b5\nIt is easy to see that\nzt = arg minx\u2208Rd Vt(x). Moreover, notice that Vt(x) is (Lt\u03b1t + \u00b5)-strongly convex. Hence on\napplying Lemmas 2 and 3, we obtain that for any x,\n\n2kx \u2212 xtk2 + Lt\u03b1t\n\nVt(zt) \u2264 Vt(x) \u2212\n= h\u03b4t, x \u2212 xti +\n\u2264 \u03c6(x)\u2212\u03c6(yt)\u2212\n\nLt\u03b1t + \u00b5\n\n2\n\nkx \u2212 ztk2\nLt\u03b1t\n2 kx \u2212 zt\u22121k2 \u2212\nLt\u03b1t\n2 kx\u2212zt\u22121k2\u2212\n\n\u00b5\n2 kx \u2212 xtk2 +\n2Lt\u2212L\nk\u03b4tk2 +\n2L2\nt\n\nLt\u03b1t + \u00b5\n\nLt\u03b1t +\u00b5\n\n2\n\n2\n\nkx \u2212 ztk2\nkx\u2212ztk2 +h\u2206t, x\u2212yti.\n\nThen, \u03c6(yt) can be bounded from above, as:\n\n\u03c6(yt) \u2264\u03c6(x) + h\u03b4t, xt \u2212 zti \u2212\nLt\u03b1t\n2 kx \u2212 zt\u22121k2 \u2212\n\n+\n\n2Lt \u2212 L\n\n2L2\nt\nLt\u03b1t + \u00b5\n\nLt\u03b1t\n2 kzt \u2212 zt\u22121k2\nk\u03b4tk2 \u2212\nkx \u2212 ztk2 + h\u2206t, x \u2212 yti,\n\n2\n\n(9)\n\nwhere the non-positive term \u2212 \u00b5\nhand, by applying Lemma 3 with x = yt\u22121, we get\n\n2kzt \u2212 xtk2 has been dropped from its right-hand-side. On the other\n\n\u03c6(yt) \u2212 \u03c6(yt\u22121) \u2264 h\u03b4t, xt \u2212 yt\u22121i + h\u2206t, yt\u22121 \u2212 yti \u2212\n\n2Lt \u2212 L\n\n2L2\nt\n\nk\u03b4tk2,\n\n(10)\n\n2kyt\u22121 \u2212 xtk2 has also been dropped from the right-hand-side. On\n\nwhere the non-positive term \u2212 \u00b5\nmultiplying (9) by \u03b1t and (10) by 1 \u2212 \u03b1t, and then adding them together, we obtain\nLt\u03b12\n2 kzt\u2212 zt\u22121k2, (11)\nt\n\u03c6(yt)\u2212 \u03c6(x) \u2264 (1\u2212 \u03b1t)[\u03c6(yt\u22121)\u2212 \u03c6(x)]\u2212\nwhere A = h\u03b4t, \u03b1t(xt \u2212 zt) + (1\u2212 \u03b1t)(xt \u2212 yt\u22121)i,B = \u03b1th\u2206t, x\u2212 yti + (1\u2212 \u03b1t)h\u2206t, yt\u22121 \u2212 yti,\nand C = Lt\u03b12\nkx \u2212 ztk2. In the following, we consider to upper bound A\nand B. First, by using the update rule of xt in Algorithm 1 and the Young\u2019s inequality1, we have\n\n2 kx \u2212 zt\u22121k2 \u2212 Lt\u03b12\n\nk\u03b4tk2 +A +B +C\u2212\n\n2Lt \u2212 L\n\nt +\u00b5\u03b1t\n2\n\n2L2\nt\n\nt\n\nA = h\u03b4t, \u03b1t(xt \u2212 zt\u22121) + (1 \u2212 \u03b1t)(xt \u2212 yt\u22121)i + \u03b1th\u03b4t, zt\u22121 \u2212 zti\n\n= \u03b1th\u03b4t, zt\u22121 \u2212 zti \u2264\nOn the other hand, B can be bounded as\n\nLt\u03b12\nt\n\n2 kzt \u2212 zt\u22121k2 + k\u03b4tk2\n\n2Lt\n\n.\n\n(12)\n\n(13)\n\nB = h\u2206t, \u03b1tx + (1 \u2212 \u03b1t)yt\u22121 \u2212 xti + h\u2206t, xt \u2212 yti = \u03b1th\u2206t, x \u2212 zt\u22121i + h\u2206t, \u03b4ti\n\u2264 \u03b1th\u2206t, x \u2212 zt\u22121i +\n\n\u03c3k\u03b4tk\nLt\n\nLt\n\n,\n\nwhere the second equality is due to the update rule of xt, and the last step is from the Cauchy-\nSchwartz inequality and the boundedness of \u2206t. Hence, plugging (12) and (13) into (11),\n\n\u03c6(yt) \u2212 \u03c6(x) \u2264 (1\u2212\u03b1t)[\u03c6(yt\u22121)\u2212\u03c6(x)]\u2212\n\n(Lt\u2212L)k\u03b4tk2\n\n+\n\n\u03c3k\u03b4tk\nLt\n\n+ \u03b1th\u2206t, x\u2212zt\u22121i + C\n\n\u2264 (1 \u2212 \u03b1t)[\u03c6(yt\u22121) \u2212 \u03c6(x)] +\n\n+ \u03b1th\u2206t, x \u2212 zt\u22121i + C,\n\nwhere the last step is due to the fact that \u2212ax2 + bx \u2264 b2\nobtain (8).\n\n4a with a, b > 0. On re-arranging terms, we\n\n2L2\nt\n\u03c32\n\n2(Lt \u2212 L)\n\n1The Young\u2019s inequality states that hx, yi \u2264 kxk2\n\n2a\n\n+ akyk2\n\n2\n\nfor any a > 0.\n\n4\n\n\fLet the optimal solution in problem (6) be x\u2217. From the update rules in Algorithm 1, we observe\nthat the triplet (xt, yt\u22121, zt\u22121) depends on the random process \u03be[t\u22121] \u2261 {\u03be0, . . . , \u03bet\u22121} and hence is\nalso random. Clearly, zt\u22121 and x\u2217 are independent of \u03bet. Thus,\n\nE\u03be[t]h\u2206t, x\u2217 \u2212 zt\u22121i = E\u03be[t\u22121]\n\nE\u03be[t] [h\u2206t, x\u2217 \u2212 zt\u22121i|\u03be[t\u22121]] = E\u03be[t\u22121]\n\n= E\u03be[t\u22121]hx\u2217 \u2212 zt\u22121, E\u03bet [\u2206t]i = 0,\n\nE\u03bet [h\u2206t, x\u2217 \u2212 zt\u22121i]\n\nwhere the \ufb01rst equality uses Ex[h(x)] = Ey Ex[h(x)|y], and the last equality is from our assumption\nthat the stochastic gradient G(x, \u03be) is unbiased. Taking expectations on both sides of (8) with x =\nx\u2217, we obtain the following corollary, which will be useful in proving the subsequent theorems.\nCorollary 1.\n\nE[\u03c6(yt)] \u2212 \u03c6(x\u2217) +\n\nLt\u03b12\n\nt + \u00b5\u03b1t\n2\n\nE[kx\u2217 \u2212 ztk2]\nLt\u03b12\nt\n\n\u2264 (1 \u2212 \u03b1t)(E[\u03c6(yt\u22121)] \u2212 \u03c6(x\u2217)) +\n\nE[kx\u2217 \u2212 zt\u22121k2] +\n\n2\n\n\u03c32\n\n2(Lt \u2212 L)\n\n.\n\nSo far, the choice of Lt and \u03b1t in Algorithm 1 has been left unspeci\ufb01ed. In the following, we will\nshow that with a good choice of Lt and \u03b1t, (the expectation of) \u03c6(yt) converges rapidly to \u03c6(x\u2217).\nTheorem 1. Assume that E[kx\u2217 \u2212 ztk2] \u2264 D2 for some D. Set\n2\n\n3\n\nLt = b(t + 1)\n\n2 + L, \u03b1t =\n\n,\n\nt + 2\n\n(14)\n\nwhere b > 0 is a constant. Then the expected error of Algorithm 1 can be bounded as\n\nE[\u03c6(yN )] \u2212 \u03c6(x\u2217) \u2264\n\n3D2L\n\nN 2 +(cid:18)3D2b +\n\n5\u03c32\n\n3b (cid:19) 1\n\u221aN\n\n.\n\n(15)\n\n.\n\n3D , and the bound in (15) becomes 3D2L\n\nIf \u03c3 were known, we can set b to the optimal choice of \u221a5\u03c3\n2\u221a5\u03c3D\u221aN\nNote that so far \u03c6(x) is only assumed to be convex. As is shown in the following theorem, the\nconvergence rate can be further improved by assuming strong convexity. This also requires another\nsetting of \u03b1t and Lt which is different from that in (14).\nTheorem 2. Assume the same conditions as in Theorem 1, except that \u03c6(x) is \u00b5-strongly convex.\nSet\n\nN 2 +\n\nLt = L + \u00b5\u03bb\u22121\nt\u22121,\n\nfor t \u2265 1; \u03b1t =s\u03bbt\u22121 +\n\n\u03bb2\nt\u22121\n4 \u2212\n\n\u03bbt\u22121\n2\n\n,\n\nfor t \u2265 1,\n\n(16)\n\nwhere \u03bbt \u2261 \u03a0t\nbounded as\n\nk=1(1 \u2212 \u03b1t) for t \u2265 1 and \u03bb0 = 1. Then, the expected error of Algorithm 1 can be\n(17)\n\n2(L + \u00b5)D2\n\n+\n\nE[\u03c6(yN )] \u2212 \u03c6(x\u2217) \u2264\n\nN 2\n\n6\u03c32\nN \u00b5\n\n.\n\nIn comparison, FOLOS only converges as O(log(N )/N) for strongly convex objectives.\n3.2 Remarks\n\nAs in recent studies on stochastic composite optimization [13], the error bounds in (15) and (17) con-\nsist of two terms: a faster term which is related to the smooth component and a slower term related to\nthe non-smooth component. SAGE bene\ufb01ts from using the structure of the problem and accelerates\nthe convergence of the smooth component. On the other hand, many stochastic (sub)gradient-based\nalgorithms like FOLOS do not separate the smooth from the non-smooth part, but simply treat the\nwhole objective as non-smooth. Consequently, convergence of the smooth component is also slowed\n\ndown to O(1/\u221aN).\n\nAs can be seen from (15) and (17), the convergence of SAGE is essentially encumbered by the\nvariance of the stochastic subgradient. Recall that the variance of the average of p i.i.d. random\n\n5\n\n\fvariables is equal to 1/p of the original variance. Hence, as in Pegasos [1], \u03c3 can be reduced by\nestimating the subgradient from a data subset.\nUnlike the AC-SA algorithm in [13], the settings of Lt and \u03b1t in (14) do not require knowledge of\n\u03c3 and the number of iterations, both of which can be dif\ufb01cult to estimate in practice. Moreover,\nwith the use of a sparsity-promoting \u03c8(x), SAGE can produce a sparse solution (as will be exper-\nimentally demonstrated in Section 5) while AC-SA cannot. This is because in SAGE, the output\nyt is obtained from a generalized gradient update. With a sparsity-promoting \u03c8(x), this reduces to\na (soft) thresholding step, and thus ensures a sparse solution. On the other hand, in each iteration\nof AC-SA, its output is a convex combination of two other variables. Unfortunately, adding two\nvectors is unlikely to produce a sparse vector.\n\n3.3 Ef\ufb01cient Computation of yt\n\nThe computational ef\ufb01ciency of Algorithm 1 hinges on the ef\ufb01cient computation of yt. Recall that\nyt is just the generalized gradient update, and so is not signi\ufb01cantly more expensive than the gradient\nupdate in traditional algorithms. Indeed, the generalized gradient update is often a central compo-\nnent in various optimization and machine learning algorithms. In particular, Duchi and Singer [3]\nshowed how this can be ef\ufb01ciently computed with the various smooth and non-smooth regulariz-\n2, \u2113\u221e, Berhu and matrix norms. Interested readers are referred to [3] for\ners, including the \u21131, \u21132, \u21132\ndetails.\n\n4 Accelerated Gradient Method for Online Learning\n\nIn this section, we extend the proposed accelerated gradient scheme for online learning of (2). The\nalgorithm, shown in Algorithm 2, is similar to the stochastic version in Algorithm 1.\n\nAlgorithm 2 SAGE-based Online Learning Algorithm.\n\nInputs: Sequences {Lt} and {\u03b1t}, where Lt > L and 0 < \u03b1t < 1.\nInitialize: z1 = y1.\nloop\n\nxt = (1 \u2212 \u03b1t)yt\u22121 + \u03b1tzt\u22121.\nOutput yt = arg minx(cid:8)h\u2207ft\u22121(xt), x \u2212 xti + Lt\nzt = zt\u22121 \u2212 \u03b1t(Lt + \u00b5\u03b1t)\u22121[Lt(xt \u2212 yt) + \u00b5(zt\u22121 \u2212 xt)].\n\n2 kx \u2212 xtk2 + \u03c8(x)(cid:9).\n\nend loop\n\nFirst, we introduce the following lemma, which plays a similar role as its stochastic counterpart of\nLemma 3. Moreover, let \u03b4t \u2261 Lt(xt \u2212 yt) be the gradient mapping related to the updating of yt.\nLemma 4. For t > 1, \u03c6t(x) can be quadratically bounded from below as\n\n\u03c6t\u22121(x) \u2265 \u03c6t\u22121(yt) + h\u03b4t, x \u2212 xti +\n\n\u00b5\n2 kx \u2212 xtk2 +\n\n2Lt \u2212 L\n\n2L2\nt\n\nk\u03b4tk2.\n\nProposition 2. For any x and t \u2265 1, assume that there exists a subgradient \u02c6g(x) \u2208 \u2202\u03c8(x) such that\nk\u2207ft(x) + \u02c6g(x)k\u2217 \u2264 Q. Then for Algorithm 2,\n\nQ2\n\n2(1 \u2212 \u03b1t)(Lt \u2212 L)\n+\n\n(1 \u2212 \u03b12\n\nt )Lt \u2212 \u03b1t(1 \u2212 \u03b1t)L\n\n2\n\n+\n\nLt\n2\u03b1tkx \u2212 zt\u22121k2 \u2212\n\nLt + \u00b5\u03b1t\n\n2\u03b1t\nkyt\u22121 \u2212 zt\u22121k2 \u2212\n\nkx \u2212 ztk2\nLt\n2 kzt \u2212 ytk2.\n\n\u03c6t\u22121(yt\u22121) \u2212 \u03c6t\u22121(x) \u2264\n\n(18)\n\nProof Sketch. De\ufb01ne \u03c4t = Lt\u03b1\u22121\n\nt\n\n. From the update rule of zt, one can check that\n\nVt(x) \u2261 h\u03b4t, x \u2212 xti +\nSimilar to the analysis in obtaining (9), we can obtain\n\nzt = arg min\n\nx\n\n\u00b5\n2 kx \u2212 xtk2 +\n\n\u03c4t\n2 kx \u2212 zt\u22121k2.\n\n\u03c6t\u22121(yt)\u2212\u03c6t\u22121(x)\u2264h\u03b4t, xt\u2212zti\u2212\n\n2Lt\u2212L\n2L2\nt\n\n\u03c4t\n\u03c4t\nk\u03b4tk2\u2212\n2 kzt\u2212zt\u22121k2+\n2 kx\u2212zt\u22121k2\u2212\n\n\u03c4t +\u00b5\n\nkx\u2212ztk2. (19)\n\n2\n\n6\n\n\fOn the other hand,\n\nh\u03b4t, xt \u2212 zti \u2212 k\u03b4tk2\n\n2Lt\n\nLt\n(kzt \u2212 xtk2 \u2212 kzt \u2212 ytk2)\n2\nLt(1 \u2212 \u03b1t)\nLt\n2\u03b1tkzt \u2212 zt\u22121k2 +\n\n2\n\n=\n\n\u2264\n\nkzt\u22121 \u2212 yt\u22121k2 \u2212\n\nLt\n2 kzt \u2212 ytk2,\n\n(20)\n\non using the convexity of k \u00b7 k2. Using (20), the inequality (19) becomes\n\n\u03c6t\u22121(yt) \u2212 \u03c6t\u22121(x) \u2264\n\nkzt\u22121 \u2212 yt\u22121k2 \u2212\n\n2\n\nLt(1 \u2212 \u03b1t)\nLt \u2212 L\n2L2\n\n\u2212\n\nt k\u03b4tk2 +\n\n\u03c4t\n2 kx \u2212 zt\u22121k2 \u2212\n\nkx \u2212 ztk2.\n\n2\n\n(21)\n\nLt\n2 kzt \u2212 ytk2\n\u03c4t + \u00b5\n\nOn the other hand, by the convexity of \u03c6t\u22121(x) and the Young\u2019s inequality, we have\n\n\u03c6t\u22121(yt\u22121) \u2212 \u03c6t\u22121(yt) \u2264 h\u2207ft\u22121(yt\u22121) + \u02c6gt\u22121(yt\u22121), yt\u22121 \u2212 yti\n(1 \u2212 \u03b1t)(Lt \u2212 L)\n\nQ2\n\n+\n\n\u2264\n\n2(1 \u2212 \u03b1t)(Lt \u2212 L)\n\nMoreover, by using the update rule of xt and the convexity of k \u00b7 k2, we have\n\n2\n\nkyt\u22121 \u2212 ytk2.\n\n(22)\n\nkyt\u22121 \u2212 ytk2 = k(yt\u22121 \u2212 xt) + (xt \u2212 yt)k2 = k\u03b1t(yt\u22121 \u2212 zt\u22121) + (xt \u2212 yt)k2\n\u2264 \u03b1tkyt\u22121 \u2212 zt\u22121k2 + (1 \u2212 \u03b1t)\u22121kxt \u2212 ytk2 = \u03b1tkyt\u22121 \u2212 zt\u22121k2 + k\u03b4tk2\n(1 \u2212 \u03b1t)L2\n\nt\n\n.\n\n(23)\n\nOn using (23), it follows from (22) that\n\n\u03c6t\u22121(yt\u22121) \u2212 \u03c6t\u22121(yt) \u2264\nInequality (18) then follows immediately by adding this to (21).\n\n2(1\u2212\u03b1t)(Lt\u2212L)\n\n+\n\n2\n\nQ2\n\n\u03b1t(1\u2212\u03b1t)(Lt\u2212L)\n\nkyt\u22121\u2212zt\u22121k2 +\n\nLt\u2212L\n2L2\n\nt k\u03b4tk2.\n\nTheorem 3. Assume that \u00b5 = 0, and kx\u2217\u2212ztk \u2264 D for t \u2265 1. Set \u03b1t = a and Lt = aL\u221at \u2212 1+L,\nwhere a \u2208 (0, 1) is a constant. Then the regret of Algorithm 2 can be bounded as\na(1 \u2212 a)L(cid:21)\u221aN .\n\n[\u03c6t(yt) \u2212 \u03c6t(x\u2217)] \u2264\n\n+(cid:20) LD2\n\nLD2\n2a\n\nQ2\n\n+\n\n2\n\nN\n\nTheorem 4. Assume that \u00b5 > 0, and kx\u2217 \u2212 ztk \u2264 D for t \u2265 1. Set \u03b1t = a, and Lt = a\u00b5t + L +\na\u22121(\u00b5 \u2212 L)+, where a \u2208 (0, 1) is a constant. Then the regret of Algorithm 2 can be bounded as\n\nXt=1\n\nN\n\nXt=1\n\n[\u03c6t(yt) \u2212 \u03c6t(x\u2217)] \u2264(cid:20) (2a + a\u22121)\u00b5 + L\n2 , the regret bound reduces to(cid:0) 3\u00b5\n\n(cid:21) D2 +\n2a(1 \u2212 a)\u00b5\n2 + L(cid:1) D2 + 2Q2\n\nQ2\n\n2a\n\nIn particular, with a = 1\n\n5 Experiments\n\nlog(N + 1).\n\n\u00b5 log(N + 1).\n\nIn this section, we perform experiments on the stochastic optimization of (2). Two data sets are\nused2 (Table 1). The \ufb01rst one is the pcmac data set, which is a subset of the 20-newsgroup data set\nfrom [18], while the second one is the RCV1 data set, which is a \ufb01ltered collection of the Reuters\nRCV1 from [19]. We choose the square loss for \u2113(\u00b7,\u00b7) and the \u21131 regularizer for \u2126(\u00b7) in (2). As\ndiscussed in Section 3.3 and [3], the generalized gradient update can be ef\ufb01ciently computed by soft\nthresholding in this case. Moreover, we do not use strong convexity and so \u00b5 = 0.\nWe compare the proposed SAGE algorithm (with Lt and \u03b1t in (14)) with three recent algorithms: (1)\nFOLOS [3]; (2) SMIDAS [4]; and (3) SCD [4]. For fair comparison, we compare their convergence\n\n2Downloaded from http://people.cs.uchicago.edu/\u223cvikass/svmlin.html and http://www.cs.ucsb.edu/\u223cwychen/sc.html.\n\n7\n\n\fbehavior w.r.t. both the number of iterations and the number of data access operations, the latter\nof which has been advocated in [4] as an implementation-independent measure of time. Moreover,\nthe ef\ufb01ciency tricks for sparse data described in [4] are also implemented. Following [4], we set the\nregularization parameter \u03bb in (2) to 10\u22126. The \u03b7 parameter in SMIDAS is searched over the range\nof {10\u22126, 10\u22125, 10\u22124, 10\u22123, 10\u22122, 10\u22121}, and the one with the lowest \u21131-regularized loss is used.\nAs in Pegasos [1], the (sub)gradient is computed from small sample subsets. The subset size p is set\nto min(0.01m, 500), where m is the data set size. This is used on all the algorithms except SCD,\nsince SCD is based on coordinate descent and is quite different from the other stochastic subgradient\nalgorithms.3 All the algorithms are trained with the same maximum amount of \u201ctime\u201d (i.e., number\nof data access operations).\n\nTable 1: Summary of the data sets.\n\ndata set\npcmac\nRCV1\n\n#features\n\n#instances\n\n7,511\n47,236\n\n1,946\n193,844\n\nsparsity\n0.73%\n0.12%\n\nResults are shown in Figure 1. As can be seen, SAGE requires much fewer iterations for convergence\nthan the others (Figures 1(a) and 1(e)). Moreover, the additional costs on maintaining xt and zt are\nsmall, and the most expensive step in each SAGE iteration is in computing the generalized gradient\nupdate. Hence, its per-iteration complexity is comparable with the other (sub)gradient schemes, and\nits convergence in terms of the number of data access operations is still the fastest (Figures 1(b),\n1(c), 1(f) and 1(g)). Moreover, the sparsity of the SAGE solution is comparable with those of the\nother algorithms (Figures 1(d) and 1(h)).\n\ns\ns\no\n\nl\n \n\nd\ne\nz\ni\nr\na\nu\ng\ne\nr\n \n\nl\n\nL\n\n1\n\ns\ns\no\n\nl\n \n\nd\ne\nz\ni\nr\na\nu\ng\ne\nr\n \n\nl\n\nL\n\n1\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n0\n\nSAGE\nFOLOS\nSMIDAS\n\n \n\ns\ns\no\n\nl\n \n\nd\ne\nz\ni\nr\na\nu\ng\ne\nr\n \n\nl\n\nL\n\n1\n\n1000\n3000\nNumber of Iterations\n\n2000\n\n(a)\n\nSAGE\nFOLOS\nSMIDAS\n\n1000\n3000\nNumber of Iterations\n\n2000\n\n(e)\n\n4000\n\n \n\n4000\n\ns\ns\no\n\nl\n \n\nd\ne\nz\ni\nr\na\nu\ng\ne\nr\n \n\nl\n\nL\n\n1\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n0\n\n \n\n100\n\n \n\n8000\n\n \n\nSAGE\nFOLOS\nSMIDAS\nSCD\n\n2\n\n4\n\n6\n\n8\n\nNumber of Data Accesses\n\n10\nx 106\n\n(b)\n\n)\n\n%\n\n(\n \nr\no\nr\nr\n\nE\n\n80\n\n60\n\n40\n\n20\n\n \n\n0\n0\n\nSAGE\nFOLOS\nSMIDAS\nSCD\n\nNumber of Data Accesses\n\n10\nx 106\n\n2\n\n4\n\n6\n\n8\n\n(c)\n\nw\n\n \nf\n\no\n\n \ny\nt\ni\ns\nn\ne\nD\n\n6000\n\n4000\n\n2000\n\n \n\n0\n0\n\nSAGE\nFOLOS\nSMIDAS\nSCD\n\n2\n\n8\nNumber of Data Accesses\n\n6\n\n4\n\n(d)\n\n10\nx 106\n\n \n\nSAGE\nFOLOS\nSMIDAS\nSCD\n\n \n\n100\n\n \n\nSAGE\nFOLOS\nSMIDAS\nSCD\n\nx 104\n\n4\n\n3\n\n2\n\n1\n\nw\n\n \nf\n\no\n\n \ny\nt\ni\ns\nn\ne\nD\n\nSAGE\nFOLOS\nSMIDAS\nSCD\n\n0.5\n\n1\n\n1.5\n\n2\n\nNumber of Data Accesses\n\n2.5\nx 108\n\n(f)\n\n)\n\n%\n\n(\n \nr\no\nr\nr\n\nE\n\n80\n\n60\n\n40\n\n20\n\n \n\n0\n0\n\n0.5\n\n2\nNumber of Data Accesses\n\n1.5\n\n1\n\n(g)\n\n2.5\nx 108\n\n \n\n0\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\nNumber of Data Accesses\n\n(h)\n\n2.5\nx 108\n\nFigure 1: Performance of the various algorithms on the pcmac (upper) and RCV1 (below) data sets.\n\n6 Conclusion\n\nIn this paper, we developed a novel accelerated gradient method (SAGE) for stochastic con-\nvex composite optimization. It enjoys the computational simplicity and scalability of traditional\n(sub)gradient methods but are much faster, both theoretically and empirically. Experimental results\nshow that SAGE outperforms recent (sub)gradient descent methods. Moreover, SAGE can also be\nextended to online learning, obtaining the best regret bounds currently known.\n\nAcknowledgment\n\nThis research has been partially supported by the Research Grants Council of the Hong Kong Special\nAdministrative Region under grant 615209.\n\n3For the same reason, an SCD iteration is also very different from an iteration in the other algorithms.\n\nHence, SCD is not shown in the plots on the regularized loss versus number of iterations.\n\n8\n\n\fReferences\n\n[1] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver\nfor SVM. In Proceedings of the 24th International Conference on Machine Learning, pages\n807\u2013814, Corvalis, Oregon, USA, 2007.\n\n[2] A. Bordes, L. Bottou, and P. Gallinari. SGD-QN: Careful Quasi-Newton Stochastic Gradient\n\nDescent. Journal of Machine Learning Research, 10:1737\u20131754, 2009.\n\n[3] J. Duchi and Y. Singer. Online and batch learning using forward looking subgradients. Tech-\n\nnical report, 2009.\n\n[4] S. Shalev-Shwartz and A. Tewari. Stochastic methods for \u21131 regularized loss minimization.\nIn Proceedings of the 26th International Conference on Machine Learning, pages 929\u2013936,\nMontreal, Quebec, Canada, 2009.\n\n[5] L. Bottou and O. Bousquet. The tradeoffs of large scale learning.\n\nInformation Processing Systems 20. 2008.\n\nIn Advances in Neural\n\n[6] S. Shalev-Shwartz and N. Srebro. SVM optimization: Inverse dependence on training set size.\nIn Proceedings of the 25th International Conference on Machine Learning, pages 928\u2013935,\nHelsinki, Finland, 2008.\n\n[7] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of con-\n\nvergence o( 1\n\nk2 ). Doklady AN SSSR (translated as Soviet. Math. Docl.), 269:543\u2013547, 1983.\n\n[8] Y. Nesterov. Gradient methods for minimizing composite objective function. CORE Discus-\n\nsion Paper 2007/76, Catholic University of Louvain, September 2007.\n\n[9] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM Journal on Imaging Sciences, 2:183\u2013202, 2009.\n\n[10] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical\n\nSociety: Series B, 58:267\u2013288, 1996.\n\n[11] S. Ji, L. Sun, R. Jin, and J. Ye. Multi-label multiple kernel learning. In Advances in Neural\n\nInformation Processing Systems 21. 2009.\n\n[12] S. Ji and J. Ye. An accelerated gradient method for trace norm minimization. In Proceedings\n\nof the International Conference on Machine Learning. Montreal, Canada, 2009.\n\n[13] G. Lan. An optimal method for stochastic composite optimization. Technical report, School\n\nof Industrial and Systems Engineering, Georgia Institute of Technology, 2009.\n\n[14] Y. Nesterov and I.U.E. Nesterov.\n\nCourse. Kluwer, 2003.\n\nIntroductory Lectures on Convex Optimization: A Basic\n\n[15] S.M. Kakade and S. Shalev-Shwartz. Mind the duality gap: Logarithmic regret algorithms for\n\nonline optimization. In Advances in Neural Information Processing Systems 21. 2009.\n\n[16] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for\n\nconvex optimization. Operations Research Letters, 31(3):167\u2013175, 2003.\n\n[17] S.J. Wright, R.D. Nowak, and M.A.T. Figueiredo. Sparse reconstruction by separable approx-\nIn Proceedings of the International Conference on Acoustics, Speech, and Signal\n\nimation.\nProcessing, Las Vegas, Nevada, USA, March 2008.\n\n[18] V. Sindhwani and S.S. Keerthi. Large scale semi-supervised linear SVMs. In Proceedings of\nthe SIGIR Conference on Research and Development in Information Retrieval, pages 477\u2013484,\nSeattle, WA, USA, 2006.\n\n[19] Y. Song, W.Y. Chen, H. Bai, C.J. Lin, and E.Y. Chang. Parallel spectral clustering. In Proceed-\nings of the European Conference on Machine Learning, pages 374\u2013389, Antwerp, Belgium,\n2008.\n\n9\n\n\f", "award": [], "sourceid": 997, "authors": [{"given_name": "Chonghai", "family_name": "Hu", "institution": null}, {"given_name": "Weike", "family_name": "Pan", "institution": null}, {"given_name": "James", "family_name": "Kwok", "institution": null}]}