{"title": "When Cyclic Coordinate Descent Outperforms Randomized Coordinate Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 6999, "page_last": 7007, "abstract": "The coordinate descent (CD) method is a classical optimization algorithm that has seen a revival of interest because of its competitive performance in machine learning applications. A number of recent papers provided convergence rate estimates for their deterministic (cyclic) and randomized variants that differ in the selection of update coordinates. These estimates suggest randomized coordinate descent (RCD) performs better than cyclic coordinate descent (CCD), although numerical experiments do not provide clear justification for this comparison. In this paper, we provide examples and more generally problem classes for which CCD (or CD with any deterministic order) is faster than RCD in terms of asymptotic worst-case convergence. Furthermore, we provide lower and upper bounds on the amount of improvement on the rate of CCD relative to RCD, which depends on the deterministic order used. We also provide a characterization of the best deterministic order (that leads to the maximum improvement in convergence rate) in terms of the combinatorial properties of the Hessian matrix of the objective function.", "full_text": "When Cyclic Coordinate Descent Outperforms\n\nRandomized Coordinate Descent\n\nMert G\u00fcrb\u00fczbalaban\u21e4, Asuman Ozdaglar\u2020, Pablo A. Parrilo\u2020, N. Denizcan Vanli\u2020\n\n\u21e4Rutgers University, mg1366@rutgers.edu\n\n\u2020 Massachusetts Institute of Technology, {asuman,parrilo,denizcan}@mit.edu\n\nAbstract\n\nThe coordinate descent (CD) method is a classical optimization algorithm that\nhas seen a revival of interest because of its competitive performance in machine\nlearning applications. A number of recent papers provided convergence rate\nestimates for their deterministic (cyclic) and randomized variants that differ in the\nselection of update coordinates. These estimates suggest randomized coordinate\ndescent (RCD) performs better than cyclic coordinate descent (CCD), although\nnumerical experiments do not provide clear justi\ufb01cation for this comparison. In this\npaper, we provide examples and more generally problem classes for which CCD\n(or CD with any deterministic order) is faster than RCD in terms of asymptotic\nworst-case convergence. Furthermore, we provide lower and upper bounds on\nthe amount of improvement on the rate of CCD relative to RCD, which depends\non the deterministic order used. We also provide a characterization of the best\ndeterministic order (that leads to the maximum improvement in convergence rate)\nin terms of the combinatorial properties of the Hessian matrix of the objective\nfunction.\n\n1\n\nIntroduction\n\nWe consider solving smooth convex optimization problems using the coordinate descent (CD) method.\nThe CD method is an iterative algorithm that performs (approximate) global minimizations with\nrespect to a single coordinate (or several coordinates in the case of block CD) in a sequential manner.\nMore speci\ufb01cally, at each iteration k, an index ik 2{ 1, 2, . . . , n} is selected and the decision vector\nis updated to approximately minimize the objective function in the ik-th coordinate [3, 4]. The CD\nmethod can be deterministic or randomized depending on the choice of the update coordinates. If\nthe coordinate indices ik are chosen in a cyclic manner from the set {1, 2, . . . , n}, then the method\nis called the cyclic coordinate descent (CCD) method. When ik is sampled uniformly from the set\n{1, 2, . . . , n}, the resulting method is called the randomized coordinate descent (RCD) method.1\nThe CD method has a long history in optimization and its convergence has been studied extensively\nin 80s and 90s (cf. [5, 12, 13, 18]). It has seen a resurgence of recent interest because of its\napplicability and superior empirical performance in machine learning and large-scale data analysis\n[7, 8]. Several recent in\ufb02uential papers established non-asymptotic convergence rate estimates under\nvarious assumptions. Among these are Nesterov [15], which provided the \ufb01rst global non-asymptotic\nconvergence rates of RCD for convex and smooth problems (see also [11, 21, 22] for problems with\nnon-smooth terms), and Beck and Tetruashvili [1], which provided rate estimates for block coordinate\ngradient descent method that yields rate results for CCD with exact minimization for quadratic\nproblems. Tighter rate estimates (with respect to [1]) for CCD are then presented in [23]. These rate\nestimates suggest that CCD can be slower than RCD (precisely O(n2) times slower for quadratic\n1Note that there are other coordinate selection rules as well (such as the Gauss-Southwell rule [17]). However,\n\nin this paper, we focus on cyclic and randomized rules.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fproblems, where n is the dimension of the problem), which is puzzling in view of the faster empirical\nperformance of CCD over RCD for various problems (e.g., see numerical examples in [1, 24]). This\ngap was investigated in [24], which provided a quadratic problem that attains this performance gap.\nIn this paper, we investigate performance comparison of deterministic and randomized coordinate\ndescent and provide examples and more generally problem classes for which CCD (or CD with any\ndeterministic order) is faster than RCD in terms of asymptotic worst-case convergence. Furthermore,\nwe provide lower and upper bounds on the amount of improvement on the rate of deterministic CD\nrelative to RCD. The amount of improvement depends on the deterministic order used. We also\nprovide a characterization of the best deterministic order (that leads to the maximum improvement\nin convergence rate) in terms of the combinatorial properties of the Hessian matrix of the objective\nfunction.\nIn order to clarify the rate comparison between CCD and RCD, we focus on quadratic optimization\nproblems. In particular, we consider the problem2\n1\n2\n\nxT Ax,\n\n(1)\n\nmin\nx2Rn\n\nwhere A is a positive de\ufb01nite matrix. We consider two problem classes: i) A is a 2-cyclic matrix,\nwhose formal de\ufb01nition is given in De\ufb01nition 4.4, but an equivalent and insightful de\ufb01nition is the\nbipartiteness of the graph induced by the matrix A  D, where D is the diagonal part of A; ii) A\nis an M-matrix, i.e., the off-diagonal entries of A are nonpositive. These matrices arise in a large\nnumber of applications such as in inference in attractive Gaussian-Markov random \ufb01elds [14] and in\nminimization of quadratic forms of graph Laplacians (for which A = D  W , where W denotes the\nweighted adjacency matrix of the graph and D is the diagonal matrix given by Di,i =Pj Wi,j), for\nexample in spectral partitioning [6] and semisupervised learning [2]. We build on the seminal work\nof Young [27] and Varga [25] on the analysis of Gauss-Seidel method for solving linear systems of\nequations (with matrices satisfying certain properties) and provide a novel analysis that allows us to\ncompare the asymptotic worst-case convergence rate of CCD and RCD for the aforementioned class\nof problems and establish the faster performance of CCD with any deterministic order.\nOutline: In the next section, we formally introduce the CCD and RCD methods. In Section 3, we\npresent the notion of asymptotic convergence rate to compare the CCD and RCD methods and provide\na motivating example for which CCD converges faster than RCD. In Section 4, we present classes of\nproblems for which the asymptotic convergence rate of CCD is faster than that of RCD. We provide\nnumerical experiments in Section 5 and concluding remarks in Section 6.\nNotation: For a matrix H, we let Hi denote its ith row and Hi,j denote its entry at the ith row\nand jth column. For a vector x, we let xi denote its ith entry. Throughout the paper, we reserve\nsuperscripts for iteration counters of iterative algorithms and use x\u21e4 to denote the optimal solution\nof problem (1). For a vector x, kxk denotes its Euclidean norm and for a matrix H, ||H|| denotes\nits operator norm. For matrices,  and \uf8ff are entry-wise operators. The matrices I and 0 denote the\nidentity matrix and the zero matrix respectively and their dimensions can be understood from the\ncontext.\n\n2 Coordinate Descent Method\nStarting from an initial point x0 2 Rn, the coordinate descent (CD) method, at each iteration k, picks\na coordinate of x, say ik, and updates the decision vector by performing exact minimization along\nthe ikth coordinate, which for problem (1) yields\n\nxk+1 = xk \n\n1\n\nAik,ik\n\nAik xkeik ,\n\nk = 0, 1, 2, . . . ,\n\n(2)\n\nwhere eik is the unit vector, whose ikth entry is 1 and the rest of its entries are 0. Note that this is a\nspecial case of the coordinate gradient projection method (see [1]), which at each iteration updates a\nsingle coordinate, say coordinate ik, along the gradient component direction (with the particular step\nsize of\n). The coordinate index ik can be selected according to a deterministic or randomized\nrule:\n\nAik ,ik\n\n1\n\n2For ease of presentation, we consider minimization of 1\n\n2 xT Ax, yet our results directly extend for problems\n\nof the type 1\n\n2 xT Ax  bT x for any b 6= 0.\n\n2\n\n\f\u2022 When ik is chosen using the cyclic rule with order 1, . . . , n (i.e., ik = k (mod n) + 1), the\nresulting algorithm is called the cyclic coordinate descent (CCD) method. In order to write\nthe CCD iterations in a matrix form, we introduce the following decomposition\n\nA = D  L  LT ,\n\nwhere D is the diagonal part of A and L is the strictly lower triangular part of A. Then,\nover each epoch `  0 (where an epoch is de\ufb01ned to be consecutive n iterations), the CCD\niterations given in (2) can be written as\n\nx(`+1)n\nCCD = C x`n\n\n(3)\nNote that the epoch in (3) is equivalent to one iteration of the Gauss-Seidel (GS) method\napplied to the \ufb01rst-order optimality condition of (1), i.e., applied to the linear system Ax = 0\n[26].\n\nCCD, where C = (D  L)1LT .\n\n\u2022 When ik is chosen at random among {1, . . . , n} with probabilities {p1, . . . , pn} indepen-\ndently at each iteration k, the resulting algorithm is called the randomized coordinate descent\nRCD, we have\n(RCD) method. Given the kth iterate generated by the RCD algorithm, i.e., xk\n(4)\n\nEk\u21e5xk+1\nRCD | xk\n\nRCD\u21e4 =I  SD1A xk\n\nRCD,\n\nwhere S = diag(p1, . . . , pn) contains the coordinate sampling probabilities and the condi-\ntional expectation Ek is taken over the random variable ik given xk\nRCD. Using the nested\nproperty of the expectations, the RCD iterations in expectation over each epoch `  0 satisfy\n(5)\n\nEx(`+1)n\n\nRCD = R Ex`n\n\nRCD with R :=I  SD1An\n\n.\n\n3 Comparison of the Convergence Rates of CCD and RCD Methods\n\nIn the following subsection, we de\ufb01ne our basis of comparison for rates of CCD and RCD methods.\nTo measure the performance of these methods, we use the notion of the average worst-case asymptotic\nrate that has been studied extensively in the literature for characterizing the rate of iterative algorithms\n[25]. In Section 3.2, we construct an example, for which the rate of CCD is more than twice the rate\nof RCD. This raises the question whether the best known convergence rates of CCD in the literature\nare tight or whether there exist a class of problems for which CCD provably attains better convergence\nrates than the best known rates for RCD, a question which we will answer in Section 4.\n\n3.1 Asymptotic Rate of Converge for Iterative Algorithms\n\nConsider an iterative algorithm with update rule x(`+1)n = Cx`n (e.g., the CCD algorithm). The\nreduction in the distance to the optimal solution of the iterates generated by this algorithm after `\nepochs is given by\n\nx`n  x\u21e4\n\n||x0  x\u21e4||\n\n= C`(x0  x\u21e4)\n\n||x0  x\u21e4||\n\n.\n\n(6)\n\nNote that the right hand side of (6) can be as large asC`, hence in the worst-case, the average\ndecay of distance at each epoch of this algorithm isC`1/`. Over any \ufb01nite epochs `  1, we\nhaveC`1/`  \u21e2(C) andC`1/` ! \u21e2(C) as ` ! 1 by Gelfand\u2019s formula. Thus, we de\ufb01ne the\n\nasymptotic worst-case convergence rate of an iterative algorithm (with iteration matrix C) as follows\n\nRate(CCD) := lim\n`!1\n\nsup\n\nx02Rn \n\n1\n`\n\nlog x`n  x\u21e4\n\n||x0  x\u21e4|| ! =  log (\u21e2(C)) .\n\n(7)\n\nWe emphasize that this notion has been used extensively for studying the performance of iterative\nmethods such as GS and Jacobi methods [5, 18, 25, 27]. Note that according to our de\ufb01nition in (7),\nlarger rate means faster algorithm and we will use these terms interchangably in throughout the paper.\nAnalogously, for a randomized algorithm with expected update rule Ex(`+1)n = R Ex`n (e.g.,\nthe RCD algorithm), we consider the asymptotic convergence of the expected iterate error\n\n3\n\n\f1\n`\n\nsup\n\nx02Rn \n\nRate(RCD) := lim\n`!1\n\nlog E(x`n)  x\u21e4\n\n||x0  x\u21e4|| ! =  log (\u21e2(R)) ,\n\nE(x`n)  x\u21e4 and de\ufb01ne the asymptotic worst-case convergence rate as\nNote that in (8), we use the distance of the expected iteratesEx`n  x\u21e4 as our convergence crite-\nrion. One can also use the expected distance (or the squared distance) of the iterates Ex`n  x\u21e4\nfollows since Ex`n  x\u21e4 Ex`n  x\u21e4 by Jensen\u2019s inequality and any convergence rate on\nEx`n  x\u21e4 immediately implies at least the same convergence rate onEx`n  x\u21e4 as well.\nSince we consider the reciprocal case, i.e., obtain a convergence rate onEx`n  x\u21e4 and show that\nit is slower than that of CCD, our results naturally imply that the convergence rate on Ex`n  x\u21e4\n\nas the convergence criterion, which is a stronger convergence criterion than the one in (8). This\n\nis also slower than that of CCD.\n\n(8)\n\n3.2 A Motivating Example\n\nIn this section, we provide an example for which the (asymptotic worst-case convergence) rate of\nCCD is better than the one of RCD and building on this example, in Section 4, we construct a class\nof problems for which CCD attains a better rate than RCD. For some positive integer n  1, consider\nthe 2n \u21e5 2n symmetric matrix\n\nA = I  L  LT , where L =\n\n1\n\nn2\uf8ff0n\u21e5n\n\n1n\u21e5n\n\n0n\u21e5n\n\n0n\u21e5n ,\n\n(9)\n\nand 1n\u21e5n is the n \u21e5 n matrix with all entries equal to 1 and 0n\u21e5n is the n \u21e5 n zero matrix. Noting\nthat A has a special structure (A is equal to the sum of the identity matrix and the rank-two matrix\nL  LT ), it is easy to check that 1  1/n and 1 + 1/n are eigenvalues of A with the corresponding\neigenvectors [11\u21e5n 11\u21e5n]T and [11\u21e5n 11\u21e5n]T . The remaining 2n  2 eigenvalues of A are\nequal to 1.\nThe iteration matrix of the CCD algorithm when applied to the problem in (1) with the matrix (9) can\nbe found as\n\nC =\uf8ff0n\u21e5n\n\n0n\u21e5n\n\n1\n\nn2 1n\u21e5n\n\nn3 1n\u21e5n .\n\n1\n\nThe eigenvalues of C are all zero except the eigenvalue of 1/n2 with the corresponding eigenvector\n[n11\u21e5n, 11\u21e5n]T . Therefore, \u21e2(C) = 1/n2 and Rate(CCD) =  log(\u21e2(C)) = 2 log n. On the other\nhand, the spectral radius of the expected iteration matrix of RCD can be found as\n\n\u21e2(R) =\u27131 \n\nmin(A)\n\n\u25c6n\n\n 1  min(A) =\nwhich yields Rate(RCD) =  log(\u21e2(R)) \uf8ff log n. Thus, we conclude\n\nn\n\n1\nn\n\n,\n\nRate(CCD)\nRate(RCD)  2,\n\nfor all n  1.\n\nThat is, CCD is at least twice as fast as RCD in terms of the the asymptotic rate. This motivates us\nto investigate if there exists a more general class of problems for which the asymptotic worst-case\nrate of CCD is larger than that of RCD. The answer to this question turns out to be positive as we\ndescribe in the following section.\n\n4 When Deterministic Orders Outperform Randomized Sampling\n\nIn this section, we present special classes of problems (of the form (1)) for which the asymptotical\nworst-case rate of CCD is larger than that of RCD. We begin our discussion by highlighting the main\nassumption we will use in this section.\nAssumption 4.1. A is a symmetric positive de\ufb01nite matrix whose smallest eigenvalue is \u00b5 and the\ndiagonal entries of A are 1.\n\n4\n\n\fGiven any positive de\ufb01nite matrix A with diagonals D 6= I, the diagonal entries of the preconditioned\nmatrix D1/2AD1/2 are 1. Therefore, Assumption 4.1 is mild. The relationship between the\nsmallest eigenvalue of the original matrix and the preconditioned matrix are as follows. Let > 0\nand Lmax denote the smallest eigenvalue and the largest diagonal entry of the original matrix,\nrespectively. Then, the smallest eigenvalue of the preconditioned matrix satis\ufb01es \u00b5  /Lmax.\nRemark 4.2. For the RCD algorithm, the coordinate index ik 2{ 1, . . . , n} (at iteration k) can be\nchosen using different probability distributions {p1, . . . , pn}. Two common choices of distributions\nare pi = 1\n[15]. Since by Assumption 4.1, the diagonal\nentries of A are 1, both of these distributions reduces to pi = 1\nn, for all i 2{ 1, . . . , n}. Therefore,\nin the rest of the paper, we consider the RCD algorithm with uniform and independent coordinate\nselection at each iteration.\n\nn, for all i 2{ 1, . . . , n} and pi =\n\nAi,iPN\n\nJ=1 Aj,j\n\nIn the following lemma, we characterize the spectral radius of the RCD method. This worst-case\nrate has been presented in many works in the literature for strongly convex optimization problems\n[15, 26]. The proof is deferred to Appendix.\nLemma 4.3. Suppose Assumption 4.1 holds. Then, the spectral radius of the expected iteration\nmatrix R of the RCD algorithm (de\ufb01ned in (5)) is given by\n\n\u21e2(R) =\u21e31 \n\n\u00b5\n\nn\u2318n\n\n.\n\n(10)\n\n4.1 Convergence Rate of CCD for 2-Cyclic Matrices\n\nIn this section, we introduce the class of 2-cyclic matrices and show that the asymptotic worst-case\nrate of CCD is more than two times faster than that of RCD.\nDe\ufb01nition 4.4 (2-Cyclic Matrix). A matrix H is 2-cyclic if there exists a permutation matrix P such\nthat\n\nP HP T = D +\uf8ff 0 B1\n0 ,\n\nB2\n\n(11)\n\nwhere the diagonal null submatrices are square and D is a diagonal matrix.\n\nThis de\ufb01nition can be interpreted as follows. Let H be a 2-cyclic matrix, i.e., H satis\ufb01es (11). Then,\nthe graph induced by the matrix H  D is bipartite. The de\ufb01nition in (11) is \ufb01rst introduced in [27],\nwhere it had an alternative name: Property A. A generalization of this property is later introduced by\nVarga to the class of p-cyclic matrices [25] where p  2 can be arbitrary.\nWe next introduce the following de\ufb01nition that will be useful in Theorem 4.12 and explicitly identify\nthe class of matrices that satisfy this de\ufb01nition in Lemma 4.6.\nDe\ufb01nition 4.5 (Consistently Ordered Matrix). For a matrix H, let H = HD  HL  HU be its\ndecomposition such that HD is a diagonal matrix, HL (and HU) is a strictly lower (and upper)\ntriangular matrix. If the eigenvalues of the matrix \u21b5HL + \u21b5HU  HD are independent of \u21b5 for\nany  2 R and \u21b5 6= 0, then H is said to be consistently ordered.\nLemma 4.6. [27, Theorem 4.5] A matrix H is 2-cyclic if and only if there exists a permutation matrix\nP such that P HP T is consistently ordered.\n\nIn the next theorem, we characterize the convergence rate of CCD algorithm applied to a 2-cyclic\nmatrix. Since \u21e2(R)  1  \u00b5 by Lemma 4.3, the following theorem indicates that the spectral radius\nof the CCD iteration matrix is smaller than \u21e22(R).\nTheorem 4.7. Suppose Assumption 4.1 holds and A is a consistently ordered 2-cyclic matrix. Then,\nthe spectral radius of the CCD algorithm is given by\n\n\u21e2(C) = (1  \u00b5)2 .\n\nRemark 4.8. Note that our motivating example in (9) is an example of a consistently ordered 2-cyclic\nmatrix, for which Theorem 4.7 is applicable. In particular, for the example in (9), we can apply\nTheorem 4.7 with \u00b5 = 1  1/n leading to \u21e2(C) = 1/n2, which exactly coincides with our previous\ncomputations of \u21e2(C) in Section 3.2. We also give an example in Appendix F, for which CCD is twice\nfaster than RCD for any arbitrary initialization with probability one.\n\n5\n\n\fThe following corollary states that the asymptotic worst-case rate of CCD is more than twice larger\nthan that of RCD for quadratic problems whose Hessian is a 2-cyclic matrix. This corollary directly\nfollows by Theorem 4.7 and de\ufb01nitions (7)-(8).\nCorollary 4.9. Suppose Assumption 4.1 holds and A is a consistently ordered 2-cyclic matrix. Then,\nthe asymptotic worst-case rates of CCD and RCD satisfy\n\nRate(CCD)\nRate(RCD)\n\n= 2\u232bn, where\n\n\u232bn :=\n\nlog(1  \u00b5)\n\nn log1  \u00b5\nn .\n\n(12)\n\nIn the following remark, we highlight several properties of the constant \u232bn.\nRemark 4.10. \u232bn is a monotonically increasing function of n over the interval [1,1), where \u232b1 = 1\nand limn!1 \u232bn =  log(1\u00b5)\nWe emphasize that the CCD method applied to 1 is equivalent to the Gauss-Seidel (GS) algorithm\napplied to the linear system Ax = 0 and when A is a 2-cyclic matrix, the GS algorithm is twice as\nfast as the Jacobi algorithm [25, 27]. Hence, when A is a 2-cyclic matrix and \u00b5 is suf\ufb01ciently small,\nthe RCD method is approximately as fast as the Jacobi algorithm.\n\n> 1. Furthermore, lim\u00b5!0+ \u232bn = 1.\n\n\u00b5\n\n4.2 Convergence Rate of CCD for Irreducible M-Matrices\nIn this section, we \ufb01rst de\ufb01ne the class of M-matrices and then present the convergence rate of the\nCCD algorithm applied to quadratic problems whose Hessian is an M-matrix.\nDe\ufb01nition 4.11 (M-matrix). A real matrix A with Ai,j \uf8ff 0 for all i 6= j is an M-matrix if A has the\ndecomposition A = sI  B such that B  0 and s  \u21e2(B).\nWe emphasize that M-matrices arise in a variety of applications such as in belief propagation over\nGaussian graphical models [14] and in distributed control of positive systems [20]. Furthermore,\ngraph Laplacians are M-matrices, therefore solving linear systems with M-matrices (or equivalently\nsolving (1) for an M-matrix A) arise in a variety of applications for analyzing random walks over\ngraphs as well as distributed optimization and consensus problems over graphs (cf. [10] for a survey).\nFor quadratic problems, the Hessian is an M-matrix if and only if the gradient descent mapping is an\nisotone operator [5, 22] and in Gaussian graphical models, M-matrices are often referred as attractive\nmodels [14].\nIn the following theorem, we provide lower and upper bounds on the spectral radius of the iteration\nmatrix of CCD for quadratic problems, whose Hessian matrix is an irreducible M-matrix. In particular,\nwe show that the spectral radius of the iteration matrix of CCD is strictly smaller than that of RCD\nfor irreducible M-matrices.\nTheorem 4.12. Suppose Assumption 4.1 holds, A is an irreducible M-matrix and n  2. Then, the\niteration matrix of the CCD algorithm C = (I  L)1LT satis\ufb01es the following inequality\n\n(1  \u00b5)2 \uf8ff \u21e2(C) \uf8ff\n\n1  \u00b5\n1 + \u00b5\n\n,\n\n(13)\n\nwhere the inequality on the left holds with equality if and only if A is a consistently ordered matrix.\n\nAn immediate consequence of Theorem 4.12 is that for quadratic problems whose Hessian is an\nirreducible M-matrix, the best cyclic order that should be used in CCD can be characterized as\nfollows.\nRemark 4.13. The standard CCD method follows the standard cyclic order (1, 2, . . . , n) as described\nin Section 2. However, we can construct a CCD method that follows an alternative deterministic\norder by considering a permutation \u21e1 of {1, 2, . . . , n}, and choosing the coordinates according to\nthe order (\u21e1(1),\u21e1 (2), . . . ,\u21e1 (n)) instead. For any given order \u21e1, (1) can be reformulated as follows\n\n1\n2\n\n\u21e1 A\u21e1x\u21e1, where A\u21e1 := P\u21e1AP T\nxT\n\u21e1\n\nand x\u21e1 = P\u21e1 x,\n\nmin\nx\u21e12Rn\n\nwhere P\u21e1 is the corresponding permutation matrix of \u21e1. Supposing that Assumption 4.1 holds, the\ncorresponding CCD iterations for this problem can be written as follows\n\nx(`+1)n\n\u21e1\n\n= C\u21e1 x`n\n\n\u21e1 , where C\u21e1 = (I  L\u21e1)1LT\n\n\u21e1\n\nand L\u21e1 = P\u21e1LP\u21e1.\n\n6\n\n\fIf A is an irreducible M-matrix and satis\ufb01es Assumptions 4.1, then so does A\u21e1. Consequently,\nTheorem 4.12 yields the same upper and lower bounds (in (13)) on \u21e2(C\u21e1) as well, i.e., the spectral\nradius of the iteration matrix of CCD with any cyclic order \u21e1 satis\ufb01es\n\n(1  \u00b5)2 \uf8ff \u21e2(C\u21e1) \uf8ff\n\n1  \u00b5\n1 + \u00b5\n\n,\n\n(14)\n\nwhere the inequality on the left holds with equality if and only if A\u21e1 is a consistently ordered matrix.\nTherefore, if a consistent order \u21e1\u21e4 exists, then the CCD method with the consistent order \u21e1\u21e4 attains\nthe smallest spectral radius (or equivalently, the fastest asymptotic worst-case convergence rate)\namong the CCD methods with any cyclic order.\nRemark 4.14. The irreducibility of A is essential to derive the lower bound in (13) of Theorem 4.12.\nHowever, the upper bound in (13) holds even when A is a reducible matrix.\n\nWe next compare the spectral radii bounds for CCD (given in Theorem 4.12) and RCD (given in\nLemma 4.3). Since \u00b5 > 0, the right-hand side of (13) can be relaxed to (1  \u00b5)2 \uf8ff \u21e2(C) < 1  \u00b5.\nA direct consequence of this inequality is the following corollary, which states that the asymptotic\nworst-case rate of CCD is strictly better than that of RCD at least by a factor that is strictly greater\nthan 1.\nCorollary 4.15. Suppose Assumption 4.1 holds, A is an irreducible M-matrix and n  2. Then, the\nasymptotic worst-case rates of CCD and RCD satisfy\n\n1 <\u232b n <\n\nRate(CCD)\nRate(RCD) \uf8ff 2\u232bn, where\n\n\u232bn :=\n\nlog(1  \u00b5)\n\nn log1  \u00b5\nn ,\n\n(15)\n\nand the inequality on the right holds with equality if and only if A is a consistently ordered matrix.\n\nIn the following corollary, we highlight that as the smallest eigenvalue of A goes to zero, the\nasymptotic worst-case rate of the CCD algorithm becomes twice the asymptotic worst-case rate of\nthe RCD algorithm.\nCorollary 4.16. Suppose Assumption 4.1 holds, A is an irreducible M-matrix and n  2. Then, we\nhave\n\nlim\n\u00b5!0+\n\n5 Numerical Experiments\n\nRate(CCD)\nRate(RCD)\n\n= 2.\n\nIn this section, we compare the performance of CCD and RCD through numerical examples. First,\nwe consider the quadratic optimization problem in (1), where A is an n\u21e5 n matrix de\ufb01ned as follows\n(16)\n\nA = I  L  LT , where L =\nand 1 n\n2 matrix with all entries equal to 1. Here, it can be easily checked that A is a\nconsistently ordered, 2-cyclic matrix. By Theorem 4.7 and Corolloary 4.9, the asymptotic worst-case\nconvergence rate of CCD on this example is\nlog(1  \u00b5)\n\n0\n1 n\n2 \u21e5 n\n\n0 ,\n\n2 is the n\n\n1\n\nn\uf8ff\n\n2 \u21e5 n\n\nlog(0.5)\n\n0\n\n2\n\n2 \u21e5 n\n\n(17)\n\n2\u232bn = 2\n\nn log1  \u00b5\nn =\n\n50 log1  1\n\n200 \u21e1 2.77\n\ntimes faster than that of RCD. This is illustrated in Figure 1 (left), where the distance to the optimal\nsolution is plotted in a logarithmic scale over epochs. Note that even if our results our asymptotic,\nwe see the same difference in performances on the early epochs (for small `). On the other hand,\nwhen the matrix A is not consistently ordered, according to Theorem 4.12, CCD is still faster but\nthe difference in the convergence rates decreases with respect to the consistent ordering case. To\nillustrate this, we need to generate an inconsistent ordering of the matrix A. For this goal, we generate\na permutation matrix P and replace A with AP := P AP T in the optimization problem (1) (This is\nequivalent to solving the system AP x = 0) so that AP is not consistently ordered (We generate P\nrandomly and compute AP ). Figure 1 (right) shows that for this inconsistent ordering CCD is still\nfaster compared to RCD, but not as fast (the slope of the decay of error line in blue marker is less\nsteep) predicted by our theory.\n\n7\n\n\fConsistent Ordering, Worst-Case Initialization\n\nInconsistent Ordering, Worst-Case Initialization\n\n0\n\n\u22122\n\n\u22124\n\n \n\n0\n\n\u22122\n\n\u22124\n\n\n\n\n\n \n\n\u22126\n\n|\n|\n\u21e4\nx\n\n\n`\nx\n\n|\n|\n\ng\no\nl\n\n\n\n\u22128\n\n\u22126\n\n|\n|\n\u21e4\nx\n\n\n`\nx\n\n|\n|\n\ng\no\nl\n\n\n\n\u22128\n\n\u221210\n\n\u221212\n\n\u221214\n\n \n1\n\nCCD\nRCD\nExpected RCD\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\nNumber of Epochs `\n\n\u221210\n\n\u221212\n\n\u221214\n\n \n1\n\nCCD\nRCD\nExpected RCD\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\nNumber of Epochs `\n\nFigure 1: Distance to the optimal solution of the iterates of CCD and RCD for the cyclic matrix in\n(16) (left \ufb01gure) and a randomly permuted version of the same matrix (right \ufb01gure) where the y-axis\nis on a logarithmic scale. The left (right) \ufb01gure corresponds to the consistent (inconsistent) ordering\nfor the same quadratic optimization problem.\n\nM-Matrix, Worst-Case Initialization\n\nM-Matrix, Random Initialization\n\n0\n\n\u22122\n\n\u22124\n\n\u2318\n\n|\n|\n|\n|\n\u21e4\n\u21e4\nx\nx\n\n\n0\nx\n|\n|\n\n`\nx\n|\n|\n\n\u22126\n\n\u21e3\n\ng\no\nl\n\n\u22128\n\n\u221210\n\n\u221212\n\n \n0\n\n \n\nCCD\nRCD\nExpected RCD\n\n20\n\n40\n\n60\nNumber of Epochs `\n\n80\n\n100\n\n0\n\n\u22122\n\n\u22124\n\n\u2318\n\n|\n|\n|\n|\n\u21e4\n\u21e4\nx\nx\n\n\n0\nx\n|\n|\n\n`\nx\n|\n|\n\n\u22126\n\n\u21e3\n\ng\no\nl\n\n\u22128\n\n\u221210\n\n\u221212\n\n \n0\n\n \n\nCCD\nRCD\nExpected RCD\n\n20\n\n40\n\n60\nNumber of Epochs `\n\n80\n\n100\n\nFigure 2: Distance to the optimal solution of the iterates of CCD and RCD for the M-matrix matrix\nin (18) for the worst-case initialization (left \ufb01gure) and a random initialization (right \ufb01gure).\n\nWe next consider the case, where A is an irreducible positive de\ufb01nite M-matrix. In particular, we\nconsider the matrix\n\nA = (1 + )I  1n\u21e5n,\n\n(18)\n\nn+5. We set n = 100\nwhere 1n\u21e5n is the n \u21e5 n matrix with all entries equal to 1 as before and  = 1\nand plot the performance of CCD and RCD methods for the quadratic problem de\ufb01ned by this matrix.\nIn Figure 2, we compare the convergence rate of CCD and RCD for an initial point that corresponds\nto a worst-case (left \ufb01gure) and for a random choice of an initial point (right \ufb01gure). We conclude\nthat the asymptotic rate of CCD is faster than that of RCD demonstrating our results in Theorem 4.12\nand Corolloary 4.15.\n\n6 Conclusion\n\nIn this paper, we compare the CCD and RCD methods for quadratic problems, whose Hessian is\na 2-cyclic matrix or an M-matrix. We show by a novel analysis that for these classes of quadratic\nproblems, CCD is always faster than RCD in terms of the asymptotic worst-case rate. We also give a\ncharacterization of the best cyclic order to use in the CCD algorithm for these classes of problems and\nshow that with the best cyclic order, CCD enjoys more than a twice faster asymptotic worst-case rate\nwith respect to RCD. We also provide numerical experiments that show the tightness of our results.\n\nAcknowledgments\n\nThis work is supported by NSF DMS-1723085 and DARPA Foundations of Scalable Statistical\nLearning grants.\n\n8\n\n\fReferences\n[1] A. Beck and L. Tetruashvili. On the convergence of block coordinate descent type methods. SIAM Journal\n\non Optimization, 23(4):2037\u20132060, 2013.\n\n[2] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning\n\nfrom labeled and unlabeled examples. Journal of Machine Learning Research, 7:2399\u20132434, 2006.\n\n[3] D. P. Bertsekas. Nonlinear programming. Athena Scienti\ufb01c, 1999.\n[4] D. P. Bertsekas. Convex Optimization Algorithms. Athena Scienti\ufb01c, 2015.\n[5] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice-\n\nHall, Inc., 1989.\n\n[6] F. R. K. Chung. Spectral Graph Theory. American Mathematical Society, 1997.\n[7] J. Friedman, T. Hastie, H. H\u00f6\ufb02ing, and R. Tibshirani. Pathwise coordinate optimization. The Annals of\n\nApplied Statistics, 1(2):302\u2013332, 2007.\n\n[8] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate\n\ndescent. Journal of Statistical Software, 33(1):1\u201322, 2010.\n\n[9] J. F. C. Kingman. A convexity property of positive matrices. The Quarterly Journal of Mathematics,\n\n12(1):283\u2013284, 1961.\n\n[10] S. J. Kirkland and M. Neumann. Group inverses of M-matrices and their applications. CRC Press, 2012.\n[11] Z. Lu and L. Xiao. On the complexity analysis of randomized block-coordinate descent methods. Mathe-\n\nmatical Programming, 152(1):615\u2013642, 2015.\n\n[12] Z.-Q. Luo and P. Tseng. On the convergence of the coordinate descent method for convex differentiable\n\nminimization. Journal of Optimization Theory and Applications, 72(1):7\u201335, 1992.\n\n[13] Z.-Q. Luo and P. Tseng. Error bounds and convergence analysis of feasible descent methods: a general\n\napproach. Annals of Operations Research, 46(1):157\u2013178, 1993.\n\n[14] D. M. Malioutov, J. K. Johnson, and A. S. Willsky. Walk-sums and belief propagation in gaussian graphical\n\nmodels. Journal of Machine Learning Research, 7:2031\u20132064, 2006.\n\n[15] Y. Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems. SIAM Journal\n\non Optimization, 22(2):341\u2013362, 2012.\n\n[16] Roger D. Nussbaum. Convexity and log convexity for the spectral radius. Linear Algebra and its\n\nApplications, 73(Supplement C):59 \u2013 122, 1986.\n\n[17] J. Nutini, M. Schmidt, I. H. Laradji, M. Friedlander, and H. Koepke. Coordinate descent converges faster\nwith the gauss-southwell rule than random selection. In Proceedings of the 32nd International Conference\non International Conference on Machine Learning, pages 1632\u20131641, 2015.\n\n[18] J. Ortega and W. Rheinboldt. Iterative Solution of Nonlinear Equations in Several Variables. Society for\n\nIndustrial and Applied Mathematics, 2000.\n\n[19] R. J. Plemmons. M-matrix characterizations.I\u2014nonsingular m-matrices. Linear Algebra and its Applica-\n\ntions, 18(2):175 \u2013 188, 1977.\n\n[20] A. Rantzer. Distributed control of positive systems. ArXiv:1203.0047, 2014.\n[21] P. Richt\u00e1rik and M. Tak\u00e1\u02c7c. Parallel coordinate descent methods for big data optimization. Mathematical\n\nProgramming, 156(1):433\u2013484, 2016.\n\n[22] A. Saha and A. Tewari. On the nonasymptotic convergence of cyclic coordinate descent methods. SIAM\n\nJournal on Optimization, 23(1):576\u2013601, 2013.\n\n[23] R. Sun and M. Hong. Improved iteration complexity bounds of cyclic block coordinate descent for convex\n\nproblems. In Advances in Neural Information Processing Systems 28, pages 1306\u20131314. 2015.\n\n[24] R. Sun and Y. Ye. Worst-case Complexity of Cyclic Coordinate Descent: O(n2) Gap with Randomized\n\nVersion. ArXiv:1604.07130, 2016.\n\n[25] R. S. Varga. Matrix iterative analysis. Springer Science & Business Media, 2009.\n[26] S. J. Wright. Coordinate descent algorithms. Mathematical Programming, 151(1):3\u201334, 2015.\n[27] D. M. Young. Iterative solution of large linear systems. Academic Press, 1971.\n\n9\n\n\f", "award": [], "sourceid": 3507, "authors": [{"given_name": "Mert", "family_name": "Gurbuzbalaban", "institution": "Rutgers University"}, {"given_name": "Asuman", "family_name": "Ozdaglar", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Pablo", "family_name": "Parrilo", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Nuri", "family_name": "Vanli", "institution": "Massachusetts Institute of Technology"}]}