{"title": "An Accelerated Proximal Coordinate Gradient Method", "book": "Advances in Neural Information Processing Systems", "page_first": 3059, "page_last": 3067, "abstract": "We develop an accelerated randomized proximal coordinate gradient (APCG) method, for solving a broad class of composite convex optimization problems. In particular, our method achieves faster linear convergence rates for minimizing strongly convex functions than existing randomized proximal coordinate gradient methods. We show how to apply the APCG method to solve the dual of the regularized empirical risk minimization (ERM) problem, and devise efficient implementations that can avoid full-dimensional vector operations. For ill-conditioned ERM problems, our method obtains improved convergence rates than the state-of-the-art stochastic dual coordinate ascent (SDCA) method.", "full_text": "An Accelerated Proximal Coordinate Gradient Method\n\nQihang Lin\n\nUniversity of Iowa\nIowa City, IA, USA\n\nqihang-lin@uiowa.edu\n\nZhaosong Lu\n\nSimon Fraser University\n\nBurnaby, BC, Canada\nzhaosong@sfu.ca\n\nLin Xiao\n\nMicrosoft Research\nRedmond, WA, USA\n\nlin.xiao@microsoft.com\n\nAbstract\n\nWe develop an accelerated randomized proximal coordinate gradient (APCG)\nmethod, for solving a broad class of composite convex optimization problems.\nIn particular, our method achieves faster linear convergence rates for minimizing\nstrongly convex functions than existing randomized proximal coordinate gradi-\nent methods. We show how to apply the APCG method to solve the dual of the\nregularized empirical risk minimization (ERM) problem, and devise ef\ufb01cient im-\nplementations that avoid full-dimensional vector operations. For ill-conditioned\nERM problems, our method obtains improved convergence rates than the state-of-\nthe-art stochastic dual coordinate ascent (SDCA) method.\n\n1\n\nIntroduction\n\nCoordinate descent methods have received extensive attention in recent years due to their potential\nfor solving large-scale optimization problems arising from machine learning and other applications.\nIn this paper, we develop an accelerated proximal coordinate gradient (APCG) method for solving\nconvex optimization problems with the following form:\n\nwhere f is differentiable on dom (\u03a8), and \u03a8 has a block separable structure. More speci\ufb01cally,\n\nminimize\n\nx\u2208RN\n\n(cid:8)F (x) def= f (x) + \u03a8(x)(cid:9),\n\n(1)\n\n(2)\n\n\u03a8(x) =\n\n\u03a8i(xi),\n\nn\n\nXi=1\n\nwhere each xi denotes a sub-vector of x with cardinality Ni, and each \u03a8i : RNi \u2192 R \u222a {+\u221e}\nis a closed convex function. We assume the collection {xi : i = 1, . . . , n} form a partition of\nthe components of x \u2208 RN . In addition to the capability of modeling nonsmooth regularization\nterms such as \u03a8(x) = \u03bbkxk1, this model also includes optimization problems with block separable\nconstraints. More precisely, each block constraint xi \u2208 Ci, where Ci is a closed convex set, can be\nmodeled by an indicator function de\ufb01ned as \u03a8i(xi) = 0 if xi \u2208 Ci and \u221e otherwise.\nAt each iteration, coordinate descent methods choose one block of coordinates xi to suf\ufb01ciently\nreduce the objective value while keeping other blocks \ufb01xed. One common and simple approach\nfor choosing such a block is the cyclic scheme. The global and local convergence properties of the\ncyclic coordinate descent method have been studied in, for example, [21, 11, 16, 2, 5]. Recently,\nrandomized strategies for choosing the block to update became more popular. In addition to its the-\noretical bene\ufb01ts [13, 14, 19], numerous experiments have demonstrated that randomized coordinate\ndescent methods are very powerful for solving large-scale machine learning problems [3, 6, 18, 19].\n\nInspired by the success of accelerated full gradient methods (e.g., [12, 1, 22]), several recent work\napplied Nesterov\u2019s acceleration schemes to speed up randomized coordinate descent methods. In\nparticular, Nesterov [13] developed an accelerated randomized coordinate gradient method for min-\nimizing unconstrained smooth convex functions, which corresponds to the case of \u03a8(x) \u2261 0 in (1).\n\n1\n\n\fLu and Xiao [10] gave a sharper convergence analysis of Nesterov\u2019s method, and Lee and Sid-\nford [8] developed extensions with weighted random sampling schemes. More recently, Fercoq\nand Richt\u00b4arik [4] proposed an APPROX (Accelerated, Parallel and PROXimal) coordinate descent\nmethod for solving the more general problem (1) and obtained accelerated sublinear convergence\nrates, but their method cannot exploit the strong convexity to obtain accelerated linear rates.\n\nIn this paper, we develop a general APCG method that achieves accelerated linear convergence\nrates when the objective function is strongly convex. Without the strong convexity assumption, our\nmethod recovers the APPROX method [4]. Moreover, we show how to apply the APCG method to\nsolve the dual of the regularized empirical risk minimization (ERM) problem, and devise ef\ufb01cient\nimplementations that avoid full-dimensional vector operations. For ill-conditioned ERM problems,\nour method obtains faster convergence rates than the state-of-the-art stochastic dual coordinate as-\ncent (SDCA) method [19], and the improved iteration complexity matches the accelerated SDCA\nmethod [20]. We present numerical experiments to illustrate the advantage of our method.\n\n1.1 Notations and assumptions\n\nFor any partition of x \u2208 RN into {xi \u2208 RNi\nmatrix U partitioned as U = [U1 \u00b7\u00b7\u00b7 Un], where Ui \u2208 RN\u00d7Ni , such that\n\n: i = 1, . . . , n}, there is an N \u00d7 N permutation\n\nn\n\nand xi = U T\n\ni x,\n\ni = 1, . . . , n.\n\nFor any x \u2208 RN , the partial gradient of f with respect to xi is de\ufb01ned as\n\ni \u2207f (x),\n\ni = 1, . . . , n.\n\nx =\n\nUixi,\n\nXi=1\n\u2207if (x) = U T\n\nWe associate each subspace RNi , for i = 1, . . . , n, with the standard Euclidean norm, denoted\nby k \u00b7 k. We make the following assumptions which are standard in the literature on coordinate\ndescent methods (e.g., [13, 14]).\nAssumption 1. The gradient of function f is block-wise Lipschitz continuous with constants Li, i.e.,\n\nk\u2207if (x + Uihi) \u2212 \u2207if (x)k \u2264 Likhik,\n\n\u2200 hi \u2208 RNi ,\n\ni = 1, . . . , n,\n\nx \u2208 RN .\n\nFor convenience, we de\ufb01ne the following norm in the whole space RN :\n\nkxkL = (cid:18) n\nXi=1\n\nLikxik2(cid:19)1/2\n\n,\n\n\u2200 x \u2208 RN .\n\n(3)\n\nAssumption 2. There exists \u00b5 \u2265 0 such that for all y \u2208 RN and x \u2208 dom (\u03a8),\n\nf (y) \u2265 f (x) + h\u2207f (x), y \u2212 xi +\n\n\u00b5\n2ky \u2212 xk2\nL.\n\nThe convexity parameter of f with respect to the norm k \u00b7 kL is the largest \u00b5 such that the above\ninequality holds. Every convex function satis\ufb01es Assumption 2 with \u00b5 = 0. If \u00b5 > 0, the function f\nis called strongly convex.\n\nWe note that an immediate consequence of Assumption 1 is\n\nf (x + Uihi) \u2264 f (x) + h\u2207if (x), hii +\nThis together with Assumption 2 implies \u00b5 \u2264 1.\n\nLi\n2 khik2,\n\n2 The APCG method\n\n\u2200 hi \u2208 RNi ,\n\ni = 1, . . . , n,\n\nx \u2208 RN .\n\n(4)\n\nIn this section we describe the general APCG method, and several variants that are more suitable\nfor implementation under different assumptions. These algorithms extend Nesterov\u2019s accelerated\ngradient methods [12, Section 2.2] to the composite and coordinate descent setting.\n\nWe \ufb01rst explain the notations used in our algorithms. The algorithms proceed in iterations, with k\nbeing the iteration counter. Lower case letters x, y, z represent vectors in the full space RN , and\nx(k), y(k) and z(k) are their values at the kth iteration. Each block coordinate is indicated with a\nsubscript, for example, x(k)\nrepresents the value of the ith block of the vector x(k). The Greek letters\n\u03b1, \u03b2, \u03b3 are scalars, and \u03b1k, \u03b2k and \u03b3k represent their values at iteration k.\n\ni\n\n2\n\n\fAlgorithm 1: the APCG method\n\nInput: x(0) \u2208 dom (\u03a8) and convexity parameter \u00b5 \u2265 0.\nInitialize: set z(0) = x(0) and choose 0 < \u03b30 \u2208 [\u00b5, 1].\nIterate: repeat for k = 0, 1, 2, . . .\n\n1. Compute \u03b1k \u2208 (0, 1\n\nn ] from the equation\n\nand set\n\nn2\u03b12\n\nk = (1 \u2212 \u03b1k) \u03b3k + \u03b1k\u00b5,\n\n\u03b3k+1 = (1 \u2212 \u03b1k)\u03b3k + \u03b1k\u00b5,\n\n\u03b2k =\n\n\u03b1k\u00b5\n\u03b3k+1\n\n.\n\n2. Compute y(k) as\n\ny(k) =\n\n1\n\n\u03b1k\u03b3k + \u03b3k+1 (cid:16)\u03b1k\u03b3kz(k) + \u03b3k+1x(k)(cid:17) .\n\n(5)\n\n(6)\n\n(7)\n\n3. Choose ik \u2208 {1, . . . , n} uniformly at random and compute\n\nz(k+1) = arg min\n\nx\u2208RN n n\u03b1k\n\n2 (cid:13)(cid:13)x\u2212(1\u2212\u03b2k)z(k)\u2212\u03b2ky(k)(cid:13)(cid:13)\nx(k+1) = y(k) + n\u03b1k(z(k+1) \u2212 z(k)) +\n\n2\n\nL +h\u2207ik f (y(k)), xiki+\u03a8ik (xik )o.\n\n\u00b5\nn\n\n(z(k) \u2212 y(k)).\n\n(8)\n\n4. Set\n\nThe general APCG method is given as Algorithm 1. At each iteration k, it chooses a random\ncoordinate ik \u2208 {1, . . . , n} and generates y(k), x(k+1) and z(k+1). One can observe that x(k+1) and\nz(k+1) depend on the realization of the random variable\n\nwhile y(k) is independent of ik and only depends on \u03bek\u22121. To better understand this method, we\nmake the following observations. For convenience, we de\ufb01ne\n\n\u03bek = {i0, i1, . . . , ik},\n\n\u02dcz(k+1) = arg min\n\nx\u2208RN n n\u03b1k\n\n2 (cid:13)(cid:13)x \u2212 (1 \u2212 \u03b2k)z(k) \u2212 \u03b2ky(k)(cid:13)(cid:13)\n\n2\n\nL + h\u2207f (y(k)), x \u2212 y(k)i + \u03a8(x)o,\n\n(9)\n\nwhich is a full-dimensional update version of Step 3. One can observe that z(k+1) is updated as\n\nz(k+1)\ni\n\n=( \u02dcz(k+1)\n\ni\n\n(1 \u2212 \u03b2k)z(k)\n\ni + \u03b2ky(k)\n\ni\n\nNotice that from (5), (6), (7) and (8) we have\n\nif i = ik,\nif i 6= ik.\n\nwhich together with (10) yields\n\nx(k+1) = y(k) + n\u03b1k(cid:16)z(k+1) \u2212 (1 \u2212 \u03b2k)z(k) \u2212 \u03b2ky(k)(cid:17) ,\n=\uf8f1\uf8f2\n\uf8f3\n\ni + n\u03b1k(cid:16)z(k+1)\n\ni (cid:17) + \u00b5\n\u2212 z(k)\n\nn(cid:16)z(k)\n\ni \u2212 y(k)\n\ny(k)\ny(k)\ni\n\ni (cid:17) if i = ik,\nif i 6= ik.\n\nik\n\ni\n\nx(k+1)\ni\n\n(10)\n\n(11)\n\n(12)\n\nThat is, in Step 4, we only need to update the block coordinates x(k+1)\n\nand set the rest to be y(k)\n\n.\n\ni\n\nWe now state a theorem concerning the expected rate of convergence of the APCG method, whose\nproof can be found in the full report [9].\nTheorem 1. Suppose Assumptions 1 and 2 hold. Let F \u22c6 be the optimal value of problem (1), and\n{x(k)} be the sequence generated by the APCG method. Then, for any k \u2265 0, there holds:\nE\u03bek\u22121[F (x(k))] \u2212 F \u22c6 \u2264 min((cid:18)1 \u2212\n2n + k\u221a\u03b30(cid:19)2)(cid:16)F (x(0)) \u2212 F \u22c6 +\n\u03b30\n2\n\n0(cid:17) ,\n\nR2\n\nwhere\n\n2n\n\n\u221a\u00b5\nn (cid:19)k\n, (cid:18)\nx\u22c6\u2208X \u22c6 kx(0) \u2212 x\u22c6kL,\n\nR0\n\ndef= min\n\nand X \u22c6 is the set of optimal solutions of problem (1).\n\n3\n\n\fOur result in Theorem 1 improves upon the convergence rates of the proximal coordinate gradient\nmethods in [14, 10], which have convergence rates on the order of\n\nFor n = 1, our result matches exactly that of the accelerated full gradient method in [12, Section 2.2].\n\nO(cid:16)minn(cid:0)1 \u2212 \u00b5\nn(cid:1)k\n\n,\n\nn\n\nn+ko(cid:17) .\n\n2.1 Two special cases\n\nHere we give two simpli\ufb01ed versions of the APCG method, for the special cases of \u00b5 = 0 and\n\u00b5 > 0, respectively. Algorithm 2 shows the simpli\ufb01ed version for \u00b5 = 0, which can be applied to\nproblems without strong convexity, or if the convexity parameter \u00b5 is unknown.\n\nAlgorithm 2: APCG with \u00b5 = 0\n\nInput: x(0) \u2208 dom (\u03a8).\nInitialize: set z(0) = x(0) and choose \u03b10 \u2208 (0, 1\nn ].\nIterate: repeat for k = 0, 1, 2, . . .\n\nz(k+1)\nik\nand set z(k+1)\n\n= arg minx\u2208RNn n\u03b1kLik\n\n= z(k)\n\n1. Compute y(k) = (1 \u2212 \u03b1k)x(k) + \u03b1kz(k).\n2. Choose ik \u2208 {1, . . . , n} uniformly at random and compute\n(cid:13)(cid:13)x \u2212 z(k)\nik (cid:13)(cid:13)\n3. Set x(k+1) = y(k) + n\u03b1k(z(k+1) \u2212 z(k)).\nk(cid:17) .\n4. Compute \u03b1k+1 = 1\nk \u2212 \u03b12\n\nfor all i 6= ik.\n2(cid:16)p\u03b14\n\nk + 4\u03b12\n\n2\n\n2\n\ni\n\ni\n\n+ h\u2207ik f (y(k)), x \u2212 y(k)\n\nik i + \u03a8ik (x)o.\n\nAccording to Theorem 1, Algorithm 2 has an accelerated sublinear convergence rate, that is\n\nE\u03bek\u22121[F (x(k))] \u2212 F \u22c6 \u2264 (cid:18)\n\n2n\n\n2n + kn\u03b10(cid:19)2(cid:18)F (x(0)) \u2212 F \u22c6 +\n\n1\n2\n\nR2\n\n0(cid:19) .\n\nWith the choice of \u03b10 = 1/n, Algorithm 2 reduces to the APPROX method [4] with single block\nupdate at each iteration (i.e., \u03c4 = 1 in their Algorithm 1).\n\nFor the strongly convex case with \u00b5 > 0, we can initialize Algorithm 1 with the parameter \u03b30 = \u00b5,\n\nwhich implies \u03b3k = \u00b5 and \u03b1k = \u03b2k = \u221a\u00b5/n for all k \u2265 0. This results in Algorithm 3.\n\nAlgorithm 3: APCG with \u03b30 = \u00b5 > 0\n\nInput: x(0) \u2208 dom (\u03a8) and convexity parameter \u00b5 > 0.\nInitialize: set z(0) = x(0) and and \u03b1 =\nIterate: repeat for k = 0, 1, 2, . . .\n\n\u221a\u00b5\nn .\n\n1. Compute y(k) = x(k)+\u03b1z(k)\n\n1+\u03b1\n\n.\n\n2. Choose ik \u2208 {1, . . . , n} uniformly at random and compute\n\nz(k+1) = arg min\n\nx\u2208RN n n\u03b1\n\n2 (cid:13)(cid:13)x\u2212(1\u2212\u03b1)z(k)\u2212\u03b1y(k)(cid:13)(cid:13)\n\n3. Set x(k+1) = y(k) + n\u03b1(z(k+1) \u2212 z(k)) + n\u03b12(z(k) \u2212 y(k)).\n\n2\n\nL+h\u2207ik f (y(k)), xik\u2212y(k)\n\nik i+\u03a8ik (xik )o.\n\nAs a direct corollary of Theorem 1, Algorithm 3 enjoys an accelerated linear convergence rate:\n\nE\u03bek\u22121[F (x(k))] \u2212 F \u22c6 \u2264 (cid:18)1 \u2212\n\n\u221a\u00b5\nn (cid:19)k\n\n(cid:16)F (x(0)) \u2212 F \u22c6 +\n\n\u00b5\n2\n\nR2\n\n0(cid:17) .\n\nTo the best of our knowledge, this is the \ufb01rst time such an accelerated rate is obtained for solving\nthe general problem (1) (with strong convexity) using coordinate descent type of methods.\n\n4\n\n\f2.2 Ef\ufb01cient implementation\n\nThe APCG methods we presented so far all need to perform full-dimensional vector operations at\neach iteration. For example, y(k) is updated as a convex combination of x(k) and z(k), and this\ncan be very costly since in general they can be dense vectors. Moreover, for the strongly con-\nvex case (Algorithms 1 and 3), all blocks of z(k+1) need to be updated at each iteration, although\nonly the ik-th block needs to compute the partial gradient and perform a proximal mapping. These\nfull-dimensional vector updates cost O(N ) operations per iteration and may cause the overall com-\nputational cost of APCG to be even higher than the full gradient methods (see discussions in [13]).\n\nIn order to avoid full-dimensional vector operations, Lee and Sidford [8] proposed a change of\nvariables scheme for accelerated coordinated descent methods for unconstrained smooth minimiza-\ntion. Fercoq and Richt\u00b4arik [4] devised a similar scheme for ef\ufb01cient implementation in the \u00b5 = 0\ncase for composite minimization. Here we show that such a scheme can also be developed for the\ncase of \u00b5 > 0 in the composite optimization setting. For simplicity, we only present an equivalent\nimplementation of the simpli\ufb01ed APCG method described in Algorithm 3.\n\nAlgorithm 4: Ef\ufb01cient implementation of APCG with \u03b30 = \u00b5 > 0\n\nInput: x(0) \u2208 dom (\u03a8) and convexity parameter \u00b5 > 0.\nInitialize: set \u03b1 =\nIterate: repeat for k = 0, 1, 2, . . .\n\n\u221a\u00b5\nn and \u03c1 = 1\u2212\u03b1\n\n1+\u03b1 , and initialize u(0) = 0 and v(0) = x(0).\n\n1. Choose ik \u2208 {1, . . . , n} uniformly at random and compute\n\u2206(k)\n\nik = arg min\n\nk\u2206k2 + h\u2207ik f (\u03c1k+1u(k) +v(k)), \u2206i + \u03a8ik (\u2212\u03c1k+1u(k)\n\nik +v(k)\n\nik +\u2206)o.\n\nNik n n\u03b1Lik\n\n2\n\n\u2206\u2208R\n\n2. Let u(k+1) = u(k) and v(k+1) = v(k), and update\n\nu(k+1)\nik\n\n= u(k)\n\nik \u2212 1\u2212n\u03b1\n\n2\u03c1k+1 \u2206(k)\nik ,\n\nv(k+1)\nik\n\n= v(k)\n\nik + 1+n\u03b1\n\n2 \u2206(k)\nik .\n\n(13)\n\nOutput: x(k+1) = \u03c1k+1u(k+1) + v(k+1)\n\nThe following Proposition is proved in the full report [9].\n\nProposition 1. The iterates of Algorithm 3 and Algorithm 4 satisfy the following relationships:\n\nx(k) = \u03c1ku(k) + v(k),\n\ny(k) = \u03c1k+1u(k) + v(k),\n\nz(k) = \u2212\u03c1ku(k) + v(k).\n\n(14)\n\nWe note that in Algorithm 4, only a single block coordinate of the vectors u(k) and v(k) are updated\nat each iteration, which cost O(Ni). However, computing the partial gradient \u2207ik f (\u03c1k+1u(k)+v(k))\nmay still cost O(N ) in general. In the next section, we show how to further exploit structure in many\nERM problems to completely avoid full-dimensional vector operations.\n\n3 Application to regularized empirical risk minimization (ERM)\n\nLet A1, . . . , An be vectors in Rd, \u03c61, . . . , \u03c6n be a sequence of convex functions de\ufb01ned on R, and g\nbe a convex function on Rd. Regularized ERM aims to solve the following problem:\n\nminimize\n\nw\u2208Rd\n\nP (w), with P (w) =\n\n1\nn\n\nn\n\nXi=1\n\n\u03c6i(AT\n\ni w) + \u03bbg(w),\n\nwhere \u03bb > 0 is a regularization parameter. For example, given a label bi \u2208 {\u00b11} for each vector Ai,\nfor i = 1, . . . , n, we obtain the linear SVM problem by setting \u03c6i(z) = max{0, 1\u2212biz} and g(w) =\n(1/2)kwk2\n2. Regularized logistic regression is obtained by setting \u03c6i(z) = log(1+exp(\u2212biz)). This\nformulation also includes regression problems. For example, ridge regression is obtained by setting\n(1/2)\u03c6i(z) = (z \u2212 bi)2 and g(w) = (1/2)kwk2\n\n2, and we get Lasso if g(w) = kwk1.\n\n5\n\n\fLet \u03c6\u2217i be the convex conjugate of \u03c6i, that is, \u03c6\u2217i (u) = maxz\u2208R(zu \u2212 \u03c6i(z)). The dual of the\n\nregularized ERM problem is (see, e.g., [19])\n\nmaximize\n\n\u2212\u03c6\u2217i (\u2212xi) \u2212 \u03bbg\u2217(cid:18) 1\nwhere A = [A1, . . . , An]. This is equivalent to minimize F (x) def= \u2212D(x), that is,\n\nD(x), with D(x) =\n\nXi=1\n\nx\u2208Rn\n\n1\nn\n\n\u03bbn\n\nAx(cid:19) ,\n\nn\n\nminimize\n\nx\u2208Rn\n\nF (x) def=\n\n1\nn\n\nn\n\nXi=1\n\n\u03c6\u2217i (\u2212xi) + \u03bbg\u2217(cid:18) 1\n\n\u03bbn\n\nAx(cid:19) .\n\nThe structure of F (x) above matches the formulation in (1) and (2) with f (x) = \u03bbg\u2217(cid:0) 1\n\n\u03bbn Ax(cid:1) and\n\u03a8i(xi) = 1\nn \u03c6\u2217i (\u2212xi), and we can apply the APCG method to minimize F (x). In order to exploit\nthe fast linear convergence rate, we make the following assumption.\nAssumption 3. Each function \u03c6i is 1/\u03b3 smooth, and the function g has unit convexity parameter 1.\n\nHere we slightly abuse the notation by overloading \u03b3, which also appeared in Algorithm 1. But\nin this section it solely represents the (inverse) smoothness parameter of \u03c6i. Assumption 3 implies\nthat each \u03c6\u2217i has strong convexity parameter \u03b3 (with respect to the local Euclidean norm) and g\u2217\nis differentiable and \u2207g\u2217 has Lipschitz constant 1. In the following, we split the function F (x) =\nf (x) + \u03a8(x) by relocating the strong convexity term as follows:\nn\n\nf (x) = \u03bbg\u2217(cid:18) 1\n\nXi=1(cid:16)\u03c6\u2217(\u2212xi) \u2212\nAs a result, the function f is strongly convex and each \u03a8i is still convex. Now we can apply the\nAPCG method to minimize F (x) = \u2212D(x), and obtain the following guarantee.\nTheorem 2. Suppose Assumption 3 holds and kAik \u2264 R for all i = 1, . . . , n. In order to obtain an\nexpected dual optimality gap E[D\u22c6 \u2212 D(x(k))] \u2264 \u01eb by using the APCG method, it suf\ufb01ces to have\n\n2kxik2(cid:17) .\n\nAx(cid:19) +\n\n\u03b3\n2nkxk2,\n\n\u03a8(x) =\n\n(15)\n\n1\nn\n\n\u03bbn\n\n\u03b3\n\nk \u2265(cid:16)n +q nR2\n\n\u03bb\u03b3 (cid:17) log(C/\u01eb).\n\nwhere D\u22c6 = maxx\u2208Rn D(x) and the constant C = D\u22c6 \u2212 D(x(0)) + (\u03b3/(2n))kx(0) \u2212 x\u22c6k2.\nProof. The function f (x) in (15) has coordinate Lipschitz constants Li = kAik2\nand convexity parameter \u03b3\nparameter of f (x) with respect to the norm k \u00b7 kL de\ufb01ned in(3) is\n\u03bbn2 = \u03bb\u03b3n\nR2+\u03bb\u03b3n .\n\u221a\u00b5\n\nn \u2264 R2+\u03bb\u03b3n\nn with respect to the unweighted Euclidean norm. The strong convexity\n\nn. R2+\u03bb\u03b3n\n\n\u03bbn2 + \u03b3\n\n\u00b5 = \u03b3\n\n\u221a\u00b5\n\n\u03bbn2\n\nit suf\ufb01ces to have the number of iterations k to be larger than\n\nAccording to Theorem 1, we have E[D\u22c6\u2212D(x(0))] \u2264(cid:16)1 \u2212\nn\u221a\u00b5 log(C/\u01eb) = nq R2+\u03bb\u03b3n\nlog(C/\u01eb) = qn2 + nR2\n\nThis \ufb01nishes the proof.\n\n\u03bb\u03b3n\n\nn (cid:17)k\n\nC \u2264 exp(cid:16)\u2212\n\n\u03bb\u03b3 log(C/\u01eb) \u2264 (cid:16)n +q nR2\n\nn k(cid:17) C. Therefore\n\u03bb\u03b3 (cid:17) log(C/\u01eb).\n\n(16)\n\n(17)\n\nSeveral state-of-the-art algorithms for ERM, including SDCA [19], SAG [15, 17] and SVRG [7, 23]\nobtain the iteration complexity\n\nO(cid:16)(cid:16)n + R2\n\n\u03bb\u03b3(cid:17) log(1/\u01eb)(cid:17) .\n\nWe note that our result in (16) can be much better for ill-conditioned problems, i.e., when the condi-\ntion number R2\n\u03bb\u03b3 is larger than n. This is also con\ufb01rmed by our numerical experiments in Section 4.\n\nThe complexity bound in (17) for the aforementioned work is for minimizing the primal objective\nP (x) or the duality gap P (x)\u2212 D(x), but our result in Theorem 2 is in terms of the dual optimality.\nIn the full report [9], we show that the same guarantee on accelerated primal-dual convergence can be\nobtained by our method with an extra primal gradient step, without affecting the overall complexity.\nThe experiments in Section 4 illustrate superior performance of our algorithm on reducing the primal\nobjective value, even without performing the extra step.\n\n6\n\n\fWe note that Shalev-Shwartz and Zhang [20] recently developed an accelerated SDCA method\n\nwhich achieves the same complexity O(cid:16)(cid:16)n +q n\n\n\u03bb\u03b3(cid:17) log(1/\u01eb)(cid:17) as our method. Their method calls\n\nthe SDCA method in a full-dimensional accelerated gradient method in an inner-outer iteration pro-\ncedure. In contrast, our APCG method is a straightforward single loop coordinate gradient method.\n\n3.1\n\nImplementation details\n\nHere we show how to exploit the structure of the regularized ERM problem to ef\ufb01ciently compute\n\nthe coordinate gradient \u2207ik f (y(k)), and totally avoid full-dimensional updates in Algorithm 4. We\nfocus on the special case g(w) = 1\n2kwk2 and show how to compute \u2207ik f (y(k)). According to (15),\n\n\u2207ik f (y(k)) = 1\n\n\u03bbn2 AT\n\ni (Ay(k)) + \u03b3\n\nn y(k)\nik .\n\nSince we do not form y(k) in Algorithm 4, we update Ay(k) by storing and updating two vectors\nin Rd: p(k) = Au(k) and q(k) = Av(k). The resulting method is detailed in Algorithm 5.\n\nAlgorithm 5: APCG for solving dual ERM\n\nInput: x(0) \u2208 dom (\u03a8) and convexity parameter \u00b5 > 0.\nInitialize: set \u03b1 =\nIterate: repeat for k = 0, 1, 2, . . .\n\n\u221a\u00b5\nn and \u03c1 = 1\u2212\u03b1\n\n1+\u03b1 , and let u(0) = 0, v(0) = x(0), p(0) = 0 and q(0) = Ax(0).\n\n\u2207(k)\nik = 1\n\n2. Compute coordinate increment\n\n1. Choose ik \u2208 {1, . . . , n} uniformly at random, compute the coordinate gradient\nik (cid:17) .\nik \u2212 v(k)\n\nn(cid:16)\u03c1k+1u(k)\nn \u03c6\u2217ik (\u03c1k+1u(k)\n\nik q(k)(cid:1) + \u03b3\nik , \u2206i + 1\n\nk\u2206k2 + h\u2207(k)\n3. Let u(k+1) = u(k) and v(k+1) = v(k), and update\n\nik = arg min\n\nik p(k) + AT\n\nik + v(k)\n\n\u2206(k)\n\n\u2206\u2208R\n\n\u03bbn2 (cid:0)\u03c1k+1AT\nNik n \u03b1(kAikk2+\u03bb\u03b3n)\n2\u03c1k+1 \u2206(k)\nik ,\n2\u03c1k+1 Aik \u2206(k)\nik ,\n\n= u(k)\n\n2\u03bbn\n\nu(k+1)\nik \u2212 1\u2212n\u03b1\nik\np(k+1) = p(k) \u2212 1\u2212n\u03b1\n\nik \u2212 \u2206)o .\n\n= v(k)\n\nv(k+1)\nik + 1+n\u03b1\nik\nq(k+1) = q(k) + 1+n\u03b1\n\n2 \u2206(k)\nik ,\n2 Aik \u2206(k)\nik .\n\n(18)\n\nx(k+1) = \u03c1k+1u(k+1) + v(k+1).\n\nOutput: approximate primal and dual solutions\n\nw(k+1) = 1\n\n\u03bbn(cid:0)\u03c1k+2p(k+1) + q(k+1)(cid:1) ,\n\nEach iteration of Algorithm 5 only involves the two inner products AT\nik q(k) in computing\n\u2207(k)\nik and the two vector additions in (18). They all cost O(d) rather than O(n). When the Ai\u2019s are\nsparse (the case of most large-scale problems), these operations can be carried out very ef\ufb01ciently.\nBasically, each iteration of Algorithm 5 only costs twice as much as that of SDCA [6, 19].\n\nik p(k), AT\n\n4 Experiments\n\nIn our experiments, we solve ERM problems with smoothed hinge loss for binary classi\ufb01cation.\nThat is, we pre-multiply each feature vector Ai by its label bi \u2208 {\u00b11} and use the loss function\n\n\u03c6(a) =\uf8f1\uf8f2\n\uf8f3\n\n0\n1 \u2212 a \u2212 \u03b3\n2\u03b3 (1 \u2212 a)2\n\n2\n\n1\n\nif a \u2265 1,\nif a \u2264 1 \u2212 \u03b3,\notherwise.\n\nThe conjugate function of \u03c6 is \u03c6\u2217(b) = b + \u03b3\n\n\u03a8i(xi) =\n\n1\n\nn(cid:16)\u03c6\u2217(\u2212xi) \u2212\n\n2 b2 if b \u2208 [\u22121, 0] and \u221e otherwise. Therefore we have\n2kxik2(cid:17) =(cid:26) \u2212xi\n\u03b3\n\nif xi \u2208 [0, 1]\n\u221e otherwise.\n\nn\n\nThe dataset used in our experiments are summarized in Table 1.\n\n7\n\n\f\u03bb\n\nrcv1\n\ncovertype\n\nnews20\n\n10\u22125\n\n10\u22126\n\n10\u22127\n\n10\u22128\n\n100\n\n10\u22123\n\n10\u22126\n\n10\u22129\n\n10\u221212\n\n10\u221215\n0\n\n100\n\n10\u22123\n\n10\u22126\n\n10\u22129\n0\n\n100\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\n10\u22124\n\n10\u22125\n\n10\u22126\n0\n\n100\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\n10\u22124\n0\n\nAFG\n\nSDCA\n\nAPCG\n\n20\n\n40\n\n60\n\n80\n\n100\n\n20\n\n40\n\n60\n\n80\n\n100\n\n20\n\n40\n\n60\n\n80\n\n100\n\nAFG\n\nSDCA\n\nAPCG\n\n20\n\n40\n\n60\n\n80\n\n100\n\n100\n\n10\u22123\n\n10\u22126\n\n10\u22129\n\n10\u221212\n\n10\u221215\n0\n\n100\n\n10\u22123\n\n10\u22126\n\n10\u22129\n0\n\n100\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\n10\u22124\n\n10\u22125\n\n10\u22126\n0\n\n100\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\n10\u22124\n0\n\nAFG\n\nSDCA\n\nAPCG\n\n20\n\n40\n\n60\n\n80\n\n100\n\n20\n\n40\n\n60\n\n80\n\n100\n\n20\n\n40\n\n60\n\n80\n\n100\n\n20\n\n40\n\n60\n\n80\n\n100\n\n100\n\n10\u22123\n\n10\u22126\n\n10\u22129\n\n10\u221212\n\n10\u221215\n0\n\n100\n\n10\u22123\n\n10\u22126\n\n10\u22129\n0\n\n100\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\n10\u22124\n\n10\u22125\n\n10\u22126\n0\n\n100\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\n10\u22124\n\n10\u22125\n0\n\nAFG\n\nSDCA\n\nAPCG\n\n20\n\n40\n\n60\n\n80\n\n100\n\n20\n\n40\n\n60\n\n80\n\n100\n\n20\n\n40\n\n60\n\n80\n\n100\n\n20\n\n40\n\n60\n\n80\n\n100\n\nFigure 1: Comparing the APCG method with SDCA and the accelerated full gradient method (AFG)\n\nwith adaptive line search. In each plot, the vertical axis is the primal objective gap P (w(k))\u2212P \u22c6, and\nthe horizontal axis is the number of passes through the entire dataset. The three columns correspond\nto the three datasets, and each row corresponds to a particular value of the regularization parameter \u03bb.\n\nIn our experiments, we compare the APCG method with SDCA and the accelerated full gradient\nmethod (AFG) [12] with an additional line search procedure to improve ef\ufb01ciency. When the regu-\nlarization parameter \u03bb is not too small (around 10\u22124), then APCG performs similarly as SDCA as\npredicted by our complexity results, and they both outperform AFG by a substantial margin.\nFigure 1 shows the results in the ill-conditioned setting, with \u03bb varying form 10\u22125 to 10\u22128. Here\nwe see that APCG has superior performance in reducing the primal objective value compared with\nSDCA and AFG, even though our theory only gives complexity for solving the dual ERM problem.\nAFG eventually catches up for cases with very large condition number (see the plots for \u03bb = 10\u22128).\n\ndatasets\nrcv1\ncovtype\nnews20\n\nnumber of samples n number of features d\n47,236\n54\n1,355,191\n\n20,242\n581,012\n19,996\n\nsparsity\n0.16%\n22%\n0.04%\n\nTable 1: Characteristics of three binary classi\ufb01cation datasets (available from the LIBSVM web\npage: http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/datasets).\n\n8\n\n\fReferences\n\n[1] A. Beck and M. Teboulle. A fast iterative shrinkage-threshold algorithm for linear inverse\n\nproblems. SIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[2] A. Beck and L. Tetruashvili. On the convergence of block coordinate descent type methods.\n\nSIAM Journal on Optimization, 13(4):2037\u20132060, 2013.\n\n[3] K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. Coordinate descent method for large-scale l2-loss\nlinear support vector machines. Journal of Machine Learning Research, 9:1369\u20131398, 2008.\n\n[4] O. Fercoq and P. Richt\u00b4arik. Accelerated, parallel and proximal coordinate descent. Manuscript,\n\narXiv:1312.5799, 2013.\n\n[5] M. Hong, X. Wang, M. Razaviyayn, and Z. Q. Luo. Iteration complexity analysis of block\n\ncoordinate descent methods. Manuscript, arXiv:1310.6957, 2013.\n\n[6] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S.-S. Keerthi, and S. Sundararajan. A dual coordinate\ndescent method for large-scale linear svm. In Proceedings of the 25th International Conference\non Machine Learning (ICML), pages 408\u2013415, 2008.\n\n[7] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in Neural Information Processing Systems 26, pages 315\u2013323. 2013.\n\n[8] Y. T. Lee and A. Sidford. Ef\ufb01cient accelerated coordinate descent methods and faster algo-\n\nrithms for solving linear systems. arXiv:1305.1922.\n\n[9] Q. Lin, Z. Lu, and L. Xiao. An accelerated proximal coordinate gradient method and its\napplication to regularized empirical risk minimization. Technical Report MSR-TR-2014-94,\nMicrosoft Research, 2014. (arXiv:1407.1296).\n\n[10] Z. Lu and L. Xiao. On the complexity analysis of randomized block-coordinate descent meth-\n\nods. Accepted by Mathematical Programming, Series A, 2014. (arXiv:1305.4723).\n\n[11] Z. Q. Luo and P. Tseng. On the convergence of the coordinate descent method for convex dif-\nferentiable minimization. Journal of Optimization Theory & Applications, 72(1):7\u201335, 2002.\n\n[12] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Boston,\n\n2004.\n\n[13] Y. Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems.\n\nSIAM Journal on Optimization, 22(2):341\u2013362, 2012.\n\n[14] P. Richt\u00b4arik and M. Tak\u00b4a\u02c7c. Iteration complexity of randomized block-coordinate descent meth-\n\nods for minimizing a composite function. Mathematical Programming, 144(1):1\u201338, 2014.\n\n[15] N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential\nconvergence rate for \ufb01nite training sets. In Advances in Neural Information Processing Systems\n25, pages 2672\u20132680. 2012.\n\n[16] A. Saha and A. Tewari. On the non-asymptotic convergence of cyclic coordinate descent\n\nmethods. SIAM Jorunal on Optimization, 23:576\u2013601, 2013.\n\n[17] M. Schmidt, N. Le Roux, and F. Bach. Minimizing \ufb01nite sums with the stochastic average\n\ngradient. Technical Report HAL 00860051, INRIA, Paris, France, 2013.\n\n[18] S. Shalev-Shwartz and A. Tewari. Stochastic methods for \u21131 regularized loss minimization. In\nProceedings of the 26th International Conference on Machine Learning (ICML), pages 929\u2013\n936, Montreal, Canada, 2009.\n\n[19] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized\n\nloss minimization. Journal of Machine Learning Research, 14:567\u2013599, 2013.\n\n[20] S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for\nregularized loss minimization. Proceedings of the 31st International Conference on Machine\nLearning (ICML), JMLR W&CP, 32(1):64\u201372, 2014.\n\n[21] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimiza-\n\ntion. Journal of Optimization Theory and Applications, 140:513\u2013535, 2001.\n\n[22] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. Un-\n\npublished manuscript, 2008.\n\n[23] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance re-\nduction. Technical Report MSR-TR-2014-38, Microsoft Research, 2014. (arXiv:1403.4699).\n\n9\n\n\f", "award": [], "sourceid": 1586, "authors": [{"given_name": "Qihang", "family_name": "Lin", "institution": "University of Iowa"}, {"given_name": "Zhaosong", "family_name": "Lu", "institution": "Simon Frasor University"}, {"given_name": "Lin", "family_name": "Xiao", "institution": "MSR Redmond"}]}