{"title": "Dykstra's Algorithm, ADMM, and Coordinate Descent: Connections, Insights, and Extensions", "book": "Advances in Neural Information Processing Systems", "page_first": 517, "page_last": 528, "abstract": "We study connections between Dykstra's algorithm for projecting onto an intersection of convex sets, the augmented Lagrangian method of multipliers or ADMM, and block coordinate descent. We prove that coordinate descent for a regularized regression problem, in which the penalty is a separable sum of support functions, is exactly equivalent to Dykstra's algorithm applied to the dual problem. ADMM on the dual problem is also seen to be equivalent, in the special case of two sets, with one being a linear subspace. These connections, aside from being interesting in their own right, suggest new ways of analyzing and extending coordinate descent. For example, from existing convergence theory on Dykstra's algorithm over polyhedra, we discern that coordinate descent for the lasso problem converges at an (asymptotically) linear rate. We also develop two parallel versions of coordinate descent, based on the Dykstra and ADMM connections.", "full_text": "Dykstra\u2019s Algorithm, ADMM, and Coordinate\nDescent: Connections, Insights, and Extensions\n\nDepartment of Statistics and Machine Learning Department\n\nRyan J. Tibshirani\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nryantibs@stat.cmu.edu\n\nAbstract\n\nWe study connections between Dykstra\u2019s algorithm for projecting onto an intersec-\ntion of convex sets, the augmented Lagrangian method of multipliers or ADMM,\nand block coordinate descent. We prove that coordinate descent for a regularized\nregression problem, in which the penalty is a separable sum of support functions,\nis exactly equivalent to Dykstra\u2019s algorithm applied to the dual problem. ADMM\non the dual problem is also seen to be equivalent, in the special case of two sets,\nwith one being a linear subspace. These connections, aside from being interesting\nin their own right, suggest new ways of analyzing and extending coordinate de-\nscent. For example, from existing convergence theory on Dykstra\u2019s algorithm over\npolyhedra, we discern that coordinate descent for the lasso problem converges at\nan (asymptotically) linear rate. We also develop two parallel versions of coordinate\ndescent, based on the Dykstra and ADMM connections.\n\n1\n\nIntroduction\n\nIn this paper, we study two seemingly unrelated but closely connected convex optimization problems,\nand associated algorithms. The \ufb01rst is the best approximation problem: given closed, convex sets\nC1, . . . , Cd \u2286 Rn and y \u2208 Rn, we seek the point in C1 \u2229 \u00b7\u00b7\u00b7 \u2229 Cd (assumed nonempty) closest to y,\nand solve\n\n(1)\nThe second problem is the regularized regression problem: given a response y \u2208 Rn and predictors\nX \u2208 Rn\u00d7p, and a block decomposition Xi \u2208 Rn\u00d7pi, i = 1, . . . , d of the columns of X (i.e., these\ncould be columns, or groups of columns), we build a working linear model by applying blockwise\nregularization over the coef\ufb01cients, and solve\n\nsubject to\n\nu \u2208 C1 \u2229 \u00b7\u00b7\u00b7 \u2229 Cd.\n\n(cid:107)y \u2212 u(cid:107)2\n\nmin\nu\u2208Rn\n\n2\n\nd(cid:88)\n\n(cid:107)y \u2212 Xw(cid:107)2\n\n1\n2\n\n(2)\nwhere hi : Rpi \u2192 R, i = 1, . . . , d are convex functions, and we write wi \u2208 Rpi, i = 1, . . . , d for the\n\nappropriate block decomposition of a coef\ufb01cient vector w \u2208 Rp (so that Xw =(cid:80)d\n\nhi(wi),\n\nmin\nw\u2208Rp\n\n2 +\n\ni=1\n\ni=1 Xiwi).\n\nTwo well-studied algorithms for problems (1), (2) are Dykstra\u2019s algorithm (Dykstra, 1983; Boyle and\nDykstra, 1986) and (block) coordinate descent (Warga, 1963; Bertsekas and Tsitsiklis, 1989; Tseng,\n1990), respectively. The jumping-off point for our work in this paper is the following fact: these two\nalgorithms are equivalent for solving (1) and (2). That is, for a particular relationship between the\nsets C1, . . . , Cd and penalty functions h1, . . . , hd, the problems (1) and (2) are duals of each other,\nand Dykstra\u2019s algorithm on the primal problem (1) is exactly the same as coordinate descent on the\ndual problem (2). We provide details in Section 2.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fThis equivalence between Dykstra\u2019s algorithm and coordinate descent can be essentially found in\nthe optimization literature, dating back to the late 1980s, and possibly earlier. (We say \u201cessentially\u201d\nhere because, to our knowledge, this equivalence has not been stated for a general regression matrix\nX, and only in the special case X = I; but, in truth, the extension to a general matrix X is fairly\nstraightforward.) Though this equivalence has been cited and discussed in various ways over the\nyears, we feel that it is not as well-known as it should be, especially in light of the recent resurgence\nof interest in coordinate descent methods. We revisit the connection between Dykstra\u2019s algorithm\nand coordinate descent, and draw further connections to a third method\u2014the augmented Lagrangian\nmethod of multipliers or ADMM (Glowinski and Marroco, 1975; Gabay and Mercier, 1976)\u2014that\nhas also received a great deal of attention recently. While these basic connections are interesting in\ntheir own right, they also have important implications for analyzing and extending coordinate descent.\nBelow we give a summary of our contributions.\n\n1. We prove in Section 2 (under a particular relationship between C1, . . . , Cd and h1, . . . , hd)\nthat Dykstra\u2019s algorithm for (1) is equivalent to block coordinate descent for (2). (This is a\nmild generalization of the previously known connection when X = I.)\n\n2. We also show in Section 2 that ADMM is closely connected to Dykstra\u2019s algorithm, in that\n\nADMM for (1), when d = 2 and C1 is a linear subspace, matches Dykstra\u2019s algorithm.\n\n3. Leveraging existing results on the convergence of Dykstra\u2019s algorithm for an intersection of\nhalfspaces, we establish in Section 3 that coordinate descent for the lasso problem has an\n(asymptotically) linear rate of convergence, regardless of the dimensions of X (i.e., without\nassumptions about strong convexity of the problem). We derive two different explicit forms\nfor the error constant, which shed light onto how correlations among the predictor variables\naffect the speed of convergence.\n\n4. Appealing to parallel versions of Dykstra\u2019s algorithm and ADMM, we present in Section 4\ntwo parallel versions of coordinate descent (each guaranteed to converge in full generality).\n5. We extend in Section 5 the equivalence between coordinate descent and Dykstra\u2019s algorithm\nto the case of nonquadratic loss in (2), i.e., non-Euclidean projection in (1). This leads to a\nDykstra-based parallel version of coordinate descent for (separably regularized) problems\nwith nonquadratic loss, and we also derive an alternative ADMM-based parallel version of\ncoordinate descent for the same class of problems.\n\n2 Preliminaries and connections\n\nDykstra\u2019s algorithm. Dykstra\u2019s algorithm was \ufb01rst proposed by Dykstra (1983), and was extended\nto Hilbert spaces by Boyle and Dykstra (1986). Since these seminal papers, a number of works have\nanalyzed and extended Dykstra\u2019s algorithm in various interesting ways. We will reference many of\nthese works in the coming sections, when we discuss connections between Dykstra\u2019s algorithm and\nother methods; for other developments, see the comprehensive books Deutsch (2001); Bauschke and\nCombettes (2011) and review article Bauschke and Koch (2013).\nDykstra\u2019s algorithm for the best approximation problem (1) can be described as follows. We initialize\nu(0) = y, z(\u2212d+1) = \u00b7\u00b7\u00b7 = z(0) = 0, and then repeat, for k = 1, 2, 3, . . .:\n\nu(k) = PC[k] (u(k\u22121) + z(k\u2212d)),\nz(k) = u(k\u22121) + z(k\u2212d) \u2212 u(k),\n\n(3)\n\nwhere PC(x) = argminc\u2208C (cid:107)x \u2212 c(cid:107)2\n2 denotes the (Euclidean) projection of x onto a closed, convex\nset C, and [\u00b7] denotes the modulo operator taking values in {1, . . . , d}. What differentiates Dykstra\u2019s\nalgorithm from the classical alternating projections method of von Neumann (1950); Halperin (1962)\nis the sequence of (what we may call) dual variables z(k), k = 1, 2, 3, . . .. These track, in a cyclic\nfashion, the residuals from projecting onto C1, . . . , Cd. The simpler alternating projections method\nwill always converge to a feasible point in C1 \u2229 \u00b7\u00b7\u00b7 \u2229 Cd, but will not necessarily converge to the\nsolution in (1) unless C1, . . . , Cd are subspaces (in which case alternating projections and Dykstra\u2019s\nalgorithm coincide). Meanwhile, Dykstra\u2019s algorithm converges in general (for any closed, convex\nsets C1, . . . , Cd with nonempty intersection, see, e.g., Boyle and Dykstra (1986); Han (1988); Gaffke\nand Mathar (1989)). We note that Dykstra\u2019s algorithm (3) can be rewritten in a different form, which\n\n2\n\n\fwill be helpful for future comparisons. First, we initialize u(0)\nrepeat, for k = 1, 2, 3, . . .:\n\nd = y, z(0)\n\n1 = \u00b7\u00b7\u00b7 = z(0)\n\nd = 0, and then\n\n,\n\nd\n\n0 = u(k\u22121)\nu(k)\nu(k)\ni = PCi(u(k)\nz(k)\ni = u(k)\n\ni\u22121 + z(k\u22121)\n),\n\u2212 u(k)\n\ni\u22121 + z(k\u22121)\n\ni\n\ni\n\ni\n\n(cid:41)\n\nfor i = 1, . . . , d.\n\n,\n\n(4)\n\nCoordinate descent. Coordinate descent methods have a long history in optimization, and have\nbeen studied and discussed in early papers and books such as Warga (1963); Ortega and Rheinboldt\n(1970); Luenberger (1973); Auslender (1976); Bertsekas and Tsitsiklis (1989), though coordinate\ndescent was still likely in use much earlier. (Of course, for solving linear systems, coordinate descent\nreduces to Gauss-Seidel iterations, which dates back to the 1800s.) Some key papers analyzing the\nconvergence of coordinate descent methods are Tseng and Bertsekas (1987); Tseng (1990); Luo and\nTseng (1992, 1993); Tseng (2001). In the last 10 or 15 years, a considerable interest in coordinate\ndescent has developed across the optimization community. With the \ufb02urry of recent work, it would be\ndif\ufb01cult to give a thorough account of the recent progress on the topic. To give just a few examples,\nrecent developments include \ufb01nite-time (nonasymptotic) convergence rates for coordinate descent,\nand exciting extensions such as accelerated, parallel, and distributed versions of coordinate descent.\nWe refer to Wright (2015), an excellent survey that describes this recent progress.\nIn (block) coordinate descent1 for (2), we initialize say w(0) = 0, and repeat, for k = 1, 2, 3, . . .:\n\n(cid:13)(cid:13)(cid:13)(cid:13)y \u2212(cid:88)\n\nj<i\n\nj \u2212(cid:88)\n\nj>i\n\nXjw(k)\n\nw(k)\n\ni = argmin\nwi\u2208Rpi\n\n1\n2\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\nXjw(k\u22121)\n\nj\n\n\u2212 Xiwi\n\n+ hi(wi),\n\ni = 1, . . . , d.\n\n(5)\n\nWe assume here and throughout that Xi \u2208 Rn\u00d7pi, i = 1, . . . , d each have full column rank so that\nthe updates in (5) are uniquely de\ufb01ned (this is used for convenience, and is not a strong assumption;\nnote that there is no restriction on the dimensionality of the full problem in (2), i.e., we could still\nhave X \u2208 Rn\u00d7p with p (cid:29) n). The precise form of these updates, of course, depends on the penalty\nfunctions. Suppose that each hi is the support function of a closed, convex set Di \u2286 Rpi, i.e.,\n\n(cid:104)d, v(cid:105),\n\nhi(v) = max\nd\u2208Di\n\ni )\u22121(Di) = {v \u2208 Rn : X T\n\nfor i = 1, . . . , d.\ni v \u2208 Di}, the inverse image of Di under the\nSuppose also that Ci = (X T\nlinear map X T\ni , for i = 1, . . . , d. Then, perhaps surprisingly, it turns out that the coordinate descent\niterations (5) are exactly the same as the Dykstra iterations (4), via a duality argument. We extract\nthe key relationship as a lemma below, for future reference, and then state the formal equivalence.\nProofs of these results, as with all results in this paper, are given in the supplement.\nLemma 1. Assume that Xi \u2208 Rn\u00d7pi has full column rank and hi(v) = maxd\u2208Di(cid:104)d, v(cid:105) for a closed,\nconvex set Di \u2286 Rpi. Then for Ci = (X T\n(cid:107)b \u2212 Xiwi(cid:107)2\n\n2 + hi(wi) \u21d0\u21d2 Xi \u02c6wi = (Id \u2212 PCi)(b).\n\ni )\u22121(Di) \u2286 Rn and any b \u2208 Rn,\n\n\u02c6wi = argmin\nwi\u2208Rpi\n\n1\n2\n\nwhere Id(\u00b7) denotes the identity mapping.\nTheorem 1. Assume the setup in Lemma 1, for each i = 1, . . . , d. Then problems (1), (2) are dual to\neach other, and their solutions, denoted \u02c6u, \u02c6w, respectively, satisfy \u02c6u = y \u2212 X \u02c6w. Further, Dykstra\u2019s\nalgorithm (4) and coordinate descent (5) are equivalent, and satisfy at all iterations k = 1, 2, 3, . . .:\n\nz(k)\ni = Xiw(k)\n\ni\n\nand u(k)\n\nXjw(k\u22121)\n\nj\n\n,\n\nfor i = 1, . . . , d.\n\ni = y \u2212(cid:88)\n\nj\u2264i\n\nj \u2212(cid:88)\n\nj>i\n\nXjw(k)\n\nThe equivalence between coordinate descent and Dykstra\u2019s algorithm dates back to (at least) Han\n(1988); Gaffke and Mathar (1989), under the special case X = I. In fact, Han (1988), presumably\nunaware of Dykstra\u2019s algorithm, seems to have reinvented the method and established convergence\n\n1To be precise, this is cyclic coordinate descent, where exact minimization is performed along each block of\ncoordinates. Randomized versions of this algorithm have recently become popular, as have inexact or proximal\nversions. While these variants are interesting, they are not the focus of our paper.\n\n3\n\n\fthrough its relationship to coordinate descent. This work then inspired Tseng (1993) (who must have\nalso been unaware of Dykstra\u2019s algorithm) to improve the existing analyses of coordinate descent,\nwhich at the time all assumed smoothness of the objective function. (Tseng continued on to become\narguably the single most important contributor to the theory of coordinate descent of the 1990s and\n2000s, and his seminal work Tseng (2001) is still one of the most comprehensive analyses to date.)\nReferences to this equivalence can be found speckled throughout the literature on Dykstra\u2019s method,\nbut given the importance of the regularized problem form (2) for modern statistical and machine\nlearning estimation tasks, we feel that the connection between Dykstra\u2019s algorithm and coordinate\ndescent and is not well-known enough and should be better explored. In what follows, we show that\nsome old work on Dykstra\u2019s algorithm, fed through this equivalence, yields new convergence results\nfor coordinate descent for the lasso and a new parallel version of coordinate descent.\n\nADMM. The augmented Lagrangian method of multipliers or ADMM was invented by Glowinski\nand Marroco (1975); Gabay and Mercier (1976). ADMM is a member of a class of methods generally\ncalled operator splitting techniques, and is equivalent (via a duality argument) to Douglas-Rachford\nsplitting (Douglas and Rachford, 1956; Lions and Mercier, 1979). Recently, there has been a strong\nrevival of interest in ADMM (and operator splitting techniques in general), arguably due (at least in\npart) to the popular monograph of Boyd et al. (2011), where it is argued that the ADMM framework\noffers an appealing \ufb02exibility in algorithm design, which permits parallelization in many nontrivial\nsituations. As with coordinate descent, it would be dif\ufb01cult thoroughly describe recent developments\non ADMM, given the magnitude and pace of the literature on this topic. To give just a few examples,\nrecent progress includes \ufb01nite-time linear convergence rates for ADMM (see Nishihara et al. 2015;\nHong and Luo 2017 and references therein), and accelerated extensions of ADMM (see Goldstein\net al. 2014; Kadkhodaie et al. 2015 and references therein).\nTo derive an ADMM algorithm for (1), we introduce auxiliary variables and equality constraints to\nput the problem in a suitable ADMM form. While different formulations for the auxiliary variables\nand constraints give rise to different algorithms, loosely speaking, these algorithms generally take on\nsimilar forms to Dykstra\u2019s algorithm for (1). The same is also true of ADMM for the set intersection\nproblem, a simpler task than the best approximation problem (1), in which we only seek a point in\nthe intersection C1 \u2229 \u00b7\u00b7\u00b7 \u2229 Cd, and solve\n\nmin\nu\u2208Rn\n\nICi(ui),\n\n(6)\nwhere IC(\u00b7) denotes the indicator function of a set C (equal to 0 on C, and \u221e otherwise). Consider\nthe case of d = 2 sets, in which case the translation of (6) into ADMM form is unambiguous. ADMM\nfor (6), properly initialized, appears highly similar to Dykstra\u2019s algorithm for (1); so similar, in fact,\nthat Boyd et al. (2011) mistook the two algorithms for being equivalent, which is not generally true,\nand was shortly thereafter corrected by Bauschke and Koch (2013).\nBelow we show that when d = 2, C1 is a linear subspace, and y \u2208 C1, an ADMM algorithm for (1)\n(and not the simpler set intersection problem (6)) is indeed equivalent to Dykstra\u2019s algorithm for (1).\nIntroducing auxiliary variables, the problem (1) becomes\n2 + IC1 (u1) + IC2(u2)\nThe augmented Lagrangian is L(u1, u2, z) = (cid:107)y \u2212 u1(cid:107)2\n\u2212\u03c1(cid:107)z(cid:107)2\n\n2 + IC1(u1) + IC2(u2) + \u03c1(cid:107)u1 \u2212 u2 + z(cid:107)2\n2\n2, where \u03c1 > 0 is an augmented Lagrangian parameter. ADMM repeats, for k = 1, 2, 3, . . .:\n\n(cid:107)y \u2212 u1(cid:107)2\n\nu1,u2\u2208Rn\n\nsubject to\n\nu1 = u2.\n\nmin\n\nd(cid:88)\n\ni=1\n\n(cid:18) y\n\n(cid:19)\n\n\u03c1(u(k\u22121)\n\n2\n\n\u2212 z(k\u22121))\n\n+\n\nu(k)\n1 = PC1\n\n2 = PC2 (u(k)\nu(k)\nz(k) = z(k\u22121) + u(k)\n\n1 + \u03c1\n1 + z(k\u22121)),\n1 \u2212 u(k)\n2 .\n\n1 + \u03c1\n\n,\n\n(7)\n\nSuppose we initialize u(0)\nand a simple inductive argument, the above iterations can be rewritten as\n\n2 = y, z(0) = 0, and set \u03c1 = 1. Using linearity of PC1, the fact that y \u2208 C1,\n\n2\n\n),\n\n1 = PC1 (u(k\u22121)\nu(k)\nu(k)\n2 = PC2 (u(k)\nz(k) = z(k\u22121) + u(k)\n\n1 + z(k\u22121)),\n1 \u2212 u(k)\n2 ,\n\n4\n\n(8)\n\n\f1 , k = 1, 2, 3, . . . in Dykstra\u2019s iterations plays no role and can be ignored.\n\nwhich is precisely the same as Dykstra\u2019s iterations (4), once we realize that, due again to linearity of\nPC1, the sequence z(k)\nThough d = 2 sets in (1) may seem like a rather special case, the strategy for parallelization in both\nDykstra\u2019s algorithm and ADMM stems from rewriting a general d-set problem as a 2-set problem, so\nthe above connection between Dykstra\u2019s algorithm and ADMM can be relevant even for problems\nwith d > 2, and will reappear in our later discussion of parallel coordinate descent. As a matter of\nconceptual interest only, we note that for general d (and no constraints on the sets being subspaces),\nDykstra\u2019s iterations (4) can be viewed as a limiting version of the ADMM iterations either for (1) or\nfor (6), as we send the augmented Lagrangian parameters to \u221e or to 0 at particular scalings. See the\nsupplement for details.\n\n3 Coordinate descent for the lasso\nThe lasso problem (Tibshirani, 1996; Chen et al., 1998), de\ufb01ned for a tuning parameter \u03bb \u2265 0 as\n\n1\n2\n\n(cid:107)y \u2212 Xw(cid:107)2\n\n2 + \u03bb(cid:107)w(cid:107)1,\n\nmin\nw\u2208Rp\n\ni )\u22121(Di) = {v \u2208 Rn : |X T\n\n(9)\nis a special case of (2) where the coordinate blocks are of each size 1, so that Xi \u2208 Rn, i = 1, . . . , p\nare just the columns of X, and wi \u2208 R, i = 1, . . . , p are the components of w. This problem \ufb01ts into\nthe framework of (2) with hi(wi) = \u03bb|wi| = maxd\u2208Di dwi for Di = [\u2212\u03bb, \u03bb], for each i = 1, . . . , d.\nCoordinate descent is widely-used for the lasso (9), both because of the simplicity of the coordinate-\nwise updates, which reduce to soft-thresholding, and because careful implementations can achieve\nstate-of-the-art performance, at the right problem sizes. The use of coordinate descent for the lasso\nwas popularized by Friedman et al. (2007, 2010), but was studied earlier or concurrently by several\nothers, e.g., Fu (1998); Sardy et al. (2000); Wu and Lange (2008).\nAs we know from Theorem 1, the dual of problem (9) is the best approximation problem (1), where\ni v| \u2264 \u03bb} is an intersection of two halfspaces, for i = 1, . . . , p.\nCi = (X T\nThis makes C1 \u2229 \u00b7\u00b7\u00b7 \u2229 Cd an intersection of 2p halfspaces, i.e., a (centrally symmetric) polyhedron.\nFor projecting onto a polyhedron, it is well-known that Dykstra\u2019s algorithm reduces to Hildreth\u2019s\nalgorithm (Hildreth, 1957), an older method for quadratic programming that itself has an interesting\nhistory in optimization. Theorem 1 hence shows coordinate descent for the lasso (9) is equivalent not\nonly to Dykstra\u2019s algorithm, but also to Hildreth\u2019s algorithm, for (1).\nThis equivalence suggests a number of interesting directions to consider. For example, key practical\nspeedups have been developed for coordinate descent for the lasso that enable this method to attain\nstate-of-the-art performance at the right problem sizes, such as clever updating rules and screening\nrules (e.g., Friedman et al. 2010; El Ghaoui et al. 2012; Tibshirani et al. 2012; Wang et al. 2015).\nThese implementation tricks can now be used with Dykstra\u2019s (Hildreth\u2019s) algorithm. On the \ufb02ip side,\nas we show next, older results from Iusem and De Pierro (1990); Deutsch and Hundal (1994) on\nDykstra\u2019s algorithm for polyhedra, lead to interesting new results on coordinate descent for the lasso.\nTheorem 2 (Adaptation of Iusem and De Pierro 1990). Assume the columns of X \u2208 Rn\u00d7p are in\ngeneral position, and \u03bb > 0. Then coordinate descent for the lasso (9) has an asymptotically linear\nconvergence rate, in that for large enough k,\n\na2\n\na2 + \u03bbmin(X T\n\n(10)\nwhere \u02c6w is the lasso solution in (9), \u03a3 = X T X, and (cid:107)z(cid:107)2\n\u03a3 = zT \u03a3z for z \u2208 Rp, A = supp( \u02c6w) is\nthe active set of \u02c6w, a = |A| is its size, XA \u2208 Rn\u00d7a denotes the columns of X indexed by A, and\n\u03bbmin(X T\nTheorem 3 (Adaptation of Deutsch and Hundal 1994). Assume the same conditions and notation\nas in Theorem 2. Then for large enough k,\n\nA XA) denotes the smallest eigenvalue of X T\n\nA XA)/ maxi\u2208A (cid:107)Xi(cid:107)2\n\nA XA.\n\n2\n\n,\n\n(cid:18)\n\n(cid:107)w(k+1) \u2212 \u02c6w(cid:107)\u03a3\n(cid:107)w(k) \u2212 \u02c6w(cid:107)\u03a3\n\n\u2264\n\n(cid:19)1/2\n\n(cid:33)1/2\n\n(cid:107)w(k+1) \u2212 \u02c6w(cid:107)\u03a3\n(cid:107)w(k) \u2212 \u02c6w(cid:107)\u03a3\n\n\u2264\n\n(cid:107)P \u22a5\n\n{ij+1,...,ia}Xij(cid:107)2\n\n2\n\n(cid:107)Xij(cid:107)2\n\n2\n\n,\n\n(11)\n\nwhere we enumerate A = {i1, . . . , ia}, i1 < . . . < ia, and we denote by P \u22a5\nonto the orthocomplement of the column span of X{ij+1,...,ia}.\n\n{ij+1,...,ia} the projection\n\n(cid:32)\n1 \u2212 a\u22121(cid:89)\n\nj=1\n\n5\n\n\fThe results in Theorems 2, 3 both rely on the assumption of general position for the columns of X.\nThis is only used for convenience and can be removed at the expense of more complicated notation.\nLoosely put, the general position condition simply rules out trivial linear dependencies between small\nnumbers of columns of X, but places no restriction on the dimensions of X (i.e., it still allows for\np (cid:29) n). It implies that the lasso solution \u02c6w is unique, and that XA (where A = supp( \u02c6w)) has full\ncolumn rank. See Tibshirani (2013) for a precise de\ufb01nition of general position and proofs of these\nfacts. We note that when XA has full column rank, the bounds in (10), (11) are strictly less than 1.\nRemark 1 (Comparing (10) and (11)). Clearly, both the bounds in (10), (11) are adversely affected\nby correlations among Xi, i \u2208 A (i.e., stronger correlations will bring each closer to 1). It seems to\nus that (11) is usually the smaller of the two bounds, based on simple mathematical and numerical\ncomparisons. More detailed comparisons would be interesting, but is beyond the scope of this paper.\nRemark 2 (Linear convergence without strong convexity). One striking feature of the results in\nTheorems 2, 3 is that they guarantee (asymptotically) linear convergence of the coordinate descent\niterates for the lasso, with no assumption about strong convexity of the objective. More precisely,\nthere are no restrictions on the dimensionality of X, so we enjoy linear convergence even without an\nassumption on the smooth part of the objective. This is in line with classical results on coordinate\ndescent for smooth functions, see, e.g., Luo and Tseng (1992). The modern \ufb01nite-time convergence\nanalyses of coordinate descent do not, as far as we understand, replicate this remarkable property.\nFor example, Beck and Tetruashvili (2013); Li et al. (2016) establish \ufb01nite-time linear convergence\nrates for coordinate descent, but require strong convexity of the entire objective.\nRemark 3 (Active set identi\ufb01cation). The asymptotics developed in Iusem and De Pierro (1990);\nDeutsch and Hundal (1994) are based on a notion of (in)active set identi\ufb01cation: the critical value of\nk after which (10), (11) hold is based on the (provably \ufb01nite) iteration number at which Dykstra\u2019s\nalgorithm identi\ufb01es the inactive halfspaces, i.e., at which coordinate descent identi\ufb01es the inactive\nset of variables, Ac = supp( \u02c6w)c. This might help explain why in practice coordinate descent for the\nlasso performs exceptionally well with warm starts, over a decreasing sequence of tuning parameter\nvalues \u03bb (e.g., Friedman et al. 2007, 2010): here, each coordinate descent run is likely to identify the\n(in)active set\u2014and hence enter the linear convergence phase\u2014at an early iteration number.\n\n4 Parallel coordinate descent\n\nd(cid:88)\n\nParallel-Dykstra-CD. An important consequence of the connection between Dykstra\u2019s algorithm\nand coordinate descent is a new parallel version of the latter, stemming from an old parallel version\nof the former. A parallel version of Dykstra\u2019s algorithm is usually credited to Iusem and Pierro (1987)\nfor polyhedra and Gaffke and Mathar (1989) for general sets, but really the idea dates back to the\nproduct space formalization of Pierra (1984). We rewrite problem (1) as\n\n\u03b3i(cid:107)y \u2212 ui(cid:107)2\n\n2\n\ni=1\n\nmin\n\nsubject to\n\nu=(u1,...,ud)\u2208Rnd\n\n(12)\nwhere C0 = {(u1, . . . , ud) \u2208 Rnd : u1 = \u00b7\u00b7\u00b7 = ud}, and \u03b31, . . . , \u03b3d > 0 are weights that sum to 1.\nAfter rescaling appropriately to turn (12) into an unweighted best approximation problem, we can\napply Dykstra\u2019s algorithm, which sets u(0)\n\u03b3iu(k\u22121)\n\n1 = \u00b7\u00b7\u00b7 = z(0)\n\nd = 0, and repeats:\n\nd = y, z(0)\n\nd(cid:88)\n\nu(k)\n0 =\n\nu \u2208 C0 \u2229 (C1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Cd),\n\n(13)\n\n1 = \u00b7\u00b7\u00b7 = u(0)\n(cid:41)\n\n,\n\ni\n\ni=1\n\nu(k)\ni = PCi(u(k)\nz(k)\ni = u(k)\n\n0 + z(k\u22121)\n\n0 + z(k\u22121)\n),\n\u2212 u(k)\n\ni\n\ni\n\ni\n\nfor i = 1, . . . , d,\n\n,\n\nfor k = 1, 2, 3, . . .. The steps enclosed in curly brace above can all be performed in parallel, so that\n(13) is a parallel version of Dykstra\u2019s algorithm (4) for (1). Applying Lemma 1, and a straightforward\ninductive argument, the above algorithm can be rewritten as follows. We set w(0) = 0, and repeat:\n\n(cid:13)(cid:13)(cid:13)y\u2212 Xw(k\u22121) + Xiw(k\u22121)\n\ni\n\nw(k)\n\ni = argmin\nwi\u2208Rpi\n\n1\n2\n\n/\u03b3i\u2212 Xiwi/\u03b3i\n\n+ hi(wi/\u03b3i),\n\ni = 1, . . . , d, (14)\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\nfor k = 1, 2, 3, . . ., which we call parallel-Dykstra-CD (with CD being short for coordinate descent).\nAgain, note that the each of the d coordinate updates in (14) can be performed in parallel, so that\n\n6\n\n\f(14) is a parallel version of coordinate descent (5) for (2). Also, as (14) is just a reparametrization of\nDykstra\u2019s algorithm (13) for the 2-set problem (12), it is guaranteed to converge in full generality, as\nper the standard results on Dykstra\u2019s algorithm (Han, 1988; Gaffke and Mathar, 1989).\nTheorem 4. Assume that Xi \u2208 Rn\u00d7pi has full column rank and hi(v) = maxd\u2208Di(cid:104)d, v(cid:105) for a closed,\nconvex set Di \u2286 Rpi, for i = 1, . . . , d. If (2) has a unique solution, then the iterates in (14) converge\nto this solution. More generally, if the interior of \u2229d\ni )\u22121(Di) is nonempty, then the sequence\nw(k), k = 1, 2, 3, . . . from (14) has at least one accumulation point, and any such point solves (2).\nFurther, Xw(k), k = 1, 2, 3, . . . converges to X \u02c6w, the optimal \ufb01tted value in (2).\n\ni=1(X T\n\nThere have been many recent exciting contributions to the parallel coordinate descent literature; two\nstandouts are Jaggi et al. (2014); Richtarik and Takac (2016), and numerous others are described in\nWright (2015). What sets parallel-Dykstra-CD apart, perhaps, is its simplicity: convergence of the\niterations (14), given in Theorem 4, just stems from the connection between coordinate descent and\nDykstra\u2019s algorithm, and the fact that the parallel Dykstra iterations (13) are nothing more than the\nusual Dykstra iterations after a product space reformulation. Moreover, parallel-Dykstra-CD for the\nlasso enjoys an (asymptotic) linear convergence rate under essentially no assumptions, thanks once\nagain to an old result on the parallel Dykstra (Hildreth) algorithm from Iusem and De Pierro (1990).\nThe details can be found in the supplement.\n\nParallel-ADMM-CD. As an alternative to the parallel method derived using Dykstra\u2019s algorithm,\nADMM can also offer a version of parallel coordinate descent. Since (12) is a best approximation\nproblem with d = 2 sets, we can refer back to our earlier ADMM algorithm in (7) for this problem.\nBy passing these ADMM iterations through the connection developed in Lemma 1, we arrive at what\nwe call parallel-ADMM-CD, which initializes u(0)\n\n0 = y, w(\u22121) = w(0) = 0, and repeats:\n\n((cid:80)d\n1 +(cid:80)d\n\n0\n\ni=1 \u03c1i)u(k\u22121)\ni=1 \u03c1i\n1\n2\n\n(cid:13)(cid:13)(cid:13)u(k)\n\n+\n\n1 +(cid:80)d\n\ny \u2212 Xw(k\u22121)\ni=1 \u03c1i\n/\u03c1i \u2212 Xiwi/\u03c1i\n\n+\n\n1 +(cid:80)d\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n0 + Xiw(k\u22121)\n\ni\n\nX(w(k\u22122) \u2212 w(k\u22121))\n\n,\n\ni=1 \u03c1i\n\n+ hi(wi/\u03c1i),\n\nu(k)\n0 =\n\nw(k)\n\ni = argmin\nwi\u2208Rpi\n\n(15)\n\ni = 1, . . . , d,\n\ni\n\nfor k = 1, 2, 3, . . ., where \u03c11, . . . , \u03c1d > 0 are augmented Lagrangian parameters. In each iteration,\nthe updates to w(k)\n, i = 1, . . . , d above can be done in parallel. Just based on their form, it seems\nthat (15) can be seen as a parallel version of coordinate descent (5) for problem (2). The next result\ncon\ufb01rms this, leveraging standard theory for ADMM (Gabay, 1983; Eckstein and Bertsekas, 1992).\nTheorem 5. Assume that Xi \u2208 Rn\u00d7pi has full column rank and hi(v) = maxd\u2208Di(cid:104)d, v(cid:105) for a closed,\nconvex set Di \u2286 Rpi, for i = 1, . . . , d. Then the sequence w(k), k = 1, 2, 3, . . . in (15) converges to\na solution in (2).\n\n0\n\n0\n\nThe parallel-ADMM-CD iterations in (15) and parallel-Dykstra-CD iterations in (14) differ in that,\nwhere the latter uses a residual y \u2212 Xw(k\u22121), the former uses an iterate u(k)\nthat seems to have a\nmore complicated form, being a convex combination of u(k\u22121)\nand y \u2212 Xw(k\u22121), plus a quantity\nthat acts like a momentum term. It turns out that when \u03c11, . . . , \u03c1d sum to 1, the two methods (14),\n(15) are exactly the same. While this may seem like a surprising coincidence, it is in fact nothing\nmore than a reincarnation of the previously established equivalence between Dykstra\u2019s algorithm (4)\nand ADMM (8) for a 2-set best approximation problem, as here C0 is a linear subspace.\nOf course, with ADMM we need not choose probability weights for \u03c11, . . . , \u03c1d, and the convergence\nin Theorem 5 is guaranteed for any \ufb01xed values of these parameters. Thus, even though they were\nderived from different perspectives, parallel-ADMM-CD subsumes parallel-Dykstra-CD, and it is a\nstrictly more general approach. It is important to note that larger values of \u03c11, . . . , \u03c1d can often lead\nto faster convergence in practice, as we show in Figure 1. More detailed study and comparisons to\nrelated parallel methods are worthwhile, but are beyond the scope of this work.\n\n5 Discussion and extensions\n\nWe studied connections between Dykstra\u2019s algorithm, ADMM, and coordinate descent. Leveraging\nthese connections, we established an (asymptotically) linear convergence rate for coordinate descent\nfor the lasso, as well as two parallel versions of coordinate descent (one based on Dykstra\u2019s algorithm\nand the other on ADMM). Some extensions and possibilities for future work are described below.\n\n7\n\n\fof parallel-ADMM-CD (i.e., three different values of \u03c1 =(cid:80)p\n\nFigure 1: Suboptimality curves for serial coordinate descent, parallel-Dykstra-CD, and three tunings\ni=1 \u03c1i), each run over the same 30 lasso\n\nproblems with n = 100 and p = 500. For details of the experimental setup, see the supplement.\n\nNonquadratic loss: Dykstra\u2019s algorithm and coordinate descent. Given a convex function f, a\ngeneralization of (2) is the regularized estimation problem\n\nd(cid:88)\n2(cid:107)y \u2212 z(cid:107)2\n\ni=1\n\nmin\nw\u2208Rp\n\nf (Xw) +\n\nhi(wi).\n\n(16)\n\n2, and e.g., regularized classi\ufb01cation (under\ni=1 log(1 + ezi). In (block) coordinate descent for (16), we\n\nRegularized regression (2) is given by f (z) = 1\n\nthe logistic loss) by f (z) = \u2212yT z +(cid:80)n\n(cid:88)\n\ninitialize say w(0) = 0, and repeat, for k = 1, 2, 3, . . .:\nXjw(k\u22121)\n\n(cid:18)(cid:88)\n\nXjw(k)\n\nw(k)\n\nf\n\nj +\n\nj\n\ni = argmin\nwi\u2208Rpi\n\nj<i\n\nj>i\n\n(cid:19)\n\n+ Xiwi\n\n+ hi(wi),\n\ni = 1, . . . , d.\n\n(17)\n\nOn the other hand, given a differentiable and strictly convex function g, we can generalize (1) to the\nfollowing best Bregman-approximation problem,\n\nmin\nu\u2208Rn\n\nDg(u, b)\n\n(18)\nwhere Dg(u, b) = g(u) \u2212 g(b) \u2212 (cid:104)\u2207g(b), u \u2212 b(cid:105) is the Bregman divergence between u and b with\n2(cid:107)v(cid:107)2\n2 (and b = y), this recovers the best approximation problem (1). As\nrespect to g. When g(v) = 1\nshown in Censor and Reich (1998); Bauschke and Lewis (2000), Dykstra\u2019s algorithm can be extended\nto apply to (18). We initialize u(0)\nd = b, z(0)\n\nd = 0, and repeat for k = 1, 2, 3, . . .:\n\n1 = \u00b7\u00b7\u00b7 = z(0)\n\nsubject to\n\nu \u2208 C1 \u2229 \u00b7\u00b7\u00b7 \u2229 Cd.\n\nd\n\n(cid:16)\u2207g(u(k)\n\n0 = u(k\u22121)\nu(k)\nu(k)\ni = (P g\nCi\ni = \u2207g(u(k)\nz(k)\n\n,\n\u25e6 \u2207g\u2217)\ni\u22121) + z(k\u22121)\n\ni\n\ni\u22121) + z(k\u22121)\n\u2212 \u2207g(u(k)\n\n),\n\ni\n\ni\n\n(cid:17)\n\n,\n\n\uf8fc\uf8fd\uf8fe for i = 1, . . . , d,\n\n(19)\n\nwhere P g\nC(x) = argminc\u2208C Dg(c, x) denotes the Bregman (rather than Euclidean) projection of x\nonto a set C, and g\u2217 is the conjugate function of g. Though it may not be immediately obvious, when\n2(cid:107)v(cid:107)2\n2 the above iterations (19) reduce to the standard (Euclidean) Dykstra iterations in (4).\ng(v) = 1\nFurthermore, Dykstra\u2019s algorithm and coordinate descent are equivalent in the more general setting.\nTheorem 6. Let f be a strictly convex, differentiable function that has full domain. Assume that\nXi \u2208 Rn\u00d7pi has full column rank and hi(v) = maxd\u2208Di(cid:104)d, v(cid:105) for a closed, convex set Di \u2286 Rpi, for\ni = 1, . . . , d. Also, let g(v) = f\u2217(\u2212v), b = \u2212\u2207f (0), and Ci = (X T\ni )\u22121(Di) \u2286 Rn, i = 1, . . . , d.\n\n8\n\n05001000150020001e\u2212081e\u2212051e\u2212021e+011e+04No parallelizationActual iteration numberSuboptimalityCoordinate descentPar\u2212Dykstra\u2212CDPar\u2212ADMM\u2212CD, rho=10Par\u2212ADMM\u2212CD, rho=50Par\u2212ADMM\u2212CD, rho=2000501001501e\u2212081e\u2212051e\u2212021e+011e+0410% parallelizationEffective iteration numberSuboptimalityCoordinate descentPar\u2212Dykstra\u2212CDPar\u2212ADMM\u2212CD, rho=10Par\u2212ADMM\u2212CD, rho=50Par\u2212ADMM\u2212CD, rho=200\fThen (16), (18) are dual to each other, and their solutions \u02c6w, \u02c6u satisfy \u02c6u = \u2212\u2207f (X \u02c6w). Moreover,\nDykstra\u2019s algorithm (19) and coordinate descent (17) are equivalent, i.e., for k = 1, 2, 3, . . .:\n\nz(k)\ni = Xiw(k)\n\ni\n\nand u(k)\n\ni = \u2212\u2207f\n\nXjw(k)\n\nj +\n\nXjw(k\u22121)\n\nj\n\n,\n\nfor i = 1, . . . , d.\n\n(cid:19)\n\n(cid:18)(cid:88)\n\nj\u2264i\n\n(cid:88)\n\nj>i\n\nNonquadratic loss: parallel coordinate descent methods. For a general regularized estimation\nproblem (16), parallel coordinate descent methods can be derived by applying Dykstra\u2019s algorithm\nand ADMM to a product space reformulation of the dual. Interestingly, the subsequent coordinate\ndescent algorithms are no longer equivalent (for a unity augmented Lagrangian parameter), and they\nfeature quite different computational structures. Parallel-Dykstra-CD for (16) initializes w(0) = 0,\nand repeats:\n\nXw(k) \u2212 Xiw(k)\n\ni /\u03b3i + Xiwi/\u03b3i\n\n+ hi(wi/\u03b3i),\n\ni = 1, . . . , d,\n\n(20)\n\nw(k)\n\ni = argmin\nwi\u2208Rpi\n\nf\n\nfor k = 1, 2, 3, . . ., and weights \u03b31, . . . , \u03b3d > 0 that sum to 1. In comparison, parallel-ADMM-CD\nfor (16) begins with u(0)\n\n0 = 0, w(\u22121) = w(0) = 0, and repeats:\n0 \u2212 u(k\u22121)\n(u(k)\n\n0 = \u2212\u2207f\n\n\u03c1i\n\n0\n\n(cid:32)(cid:18) d(cid:88)\n\n(cid:19)\n\ni=1\n\n0 + Xiw(k\u22121)\n\ni\n\n/\u03c1i \u2212 Xiwi/\u03c1i\n\nFind u(k)\n\n0\n\nsuch that: u(k)\n\nw(k)\n\ni = argmin\nwi\u2208Rpi\n\n1\n2\n\n(cid:16)\n\n(cid:13)(cid:13)(cid:13)u(k)\n\n(cid:17)\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n(cid:33)\n\n,\n\n(21)\n\n) \u2212 X(w(k\u22122) \u2212 2w(k\u22121))\n\n+ hi(wi/\u03c1i),\n\ni = 1, . . . , d,\n\nfor k = 1, 2, 3, . . ., and parameters \u03c11, . . . , \u03c1d > 0. Derivation details are given in the supplement.\nNotice the stark contrast between the parallel-Dykstra-CD iterations (20) and the parallel-ADMM-\nCD iterations (21). In (20), we perform (in parallel) coordinatewise hi-regularized minimizations\ninvolving f, for i = 1, . . . , d. In (21), we perform a single quadratically-regularized minimization\ninvolving f for the u0-update, and then for the w-update, we perform (in parallel) coordinatewise\nhi-regularized minimizations involving a quadratic loss, for i = 1, . . . , d (these are typically much\ncheaper than the analogous minimizations for typical nonquadratic losses f of interest).\nWe note that the u0-update in the parallel-ADMM-CD iterations (21) simpli\ufb01es for many losses f\ni=1 fi(vi), for convex,\nunivariate functions fi, i = 1, . . . , n, the u0-update separates into n univariate minimizations. As an\nexample, consider the logistic lasso problem,\n\u2212yT Xw +\n\nof interest; in particular, for separable loss functions of the form f (v) =(cid:80)n\n\ni=1 \u03c1i, and denoting by\n\u03c3(x) = 1/(1 + e\u2212x) the sigmoid function, and by St(x) = sign(x)(|x| \u2212 t)+ the soft-thresholding\nfunction at a level t > 0, the parallel-ADMM-CD iterations (21) for (22) reduce to:\n\nwhere xi \u2208 Rp, i = 1, . . . , n denote the rows of X. Abbreviating \u03c1 =(cid:80)p\n(cid:17)\ni (w(k\u22122) \u2212 2w(k\u22121)) +\ni (w(k\u22122) \u2212 2w(k\u22121))\n,\n\ni w) + \u03bb(cid:107)w(cid:107)1,\n\n(\u03c1 \u2212 1)u(k)\n\u03c1u(k)\n\n0i = \u03c1u(k\u22121)\n0i \u2212 \u03c1u(k\u22121)\n\n0i such that:\n\n+ xT\n\u2212 xT\n\nlog(1 + exT\n\nn(cid:88)\n\nFind u(k)\n\nmin\nw\u2208Rp\n\n(cid:16)\n\n(22)\n\ni=1\n\n\u03c3\n\n0i\n\n0i\n\ni = 1, . . . , n,\n\n(23)\n\nw(k)\n\ni = S\u03bb\u03c1i/(cid:107)Xi(cid:107)2\n\n2\n\ni (u(k)\n\n0 + Xiw(k\u22121)\n\ni\n\n/\u03c1i)\n\n(cid:107)Xi(cid:107)2\n\n2\n\n,\n\ni = 1, . . . , p,\n\n(cid:18) \u03c1iX T\n\n(cid:19)\n\nfor k = 1, 2, 3, . . .. Now we see that both the u0-update and w-update in (23) can be parallelized,\nand each coordinate update in the former can be done with, say, a simple bisection search.\n\nAsynchronous parallel algorithms, and coordinate descent in Hilbert spaces. We \ufb01nish with\nsome directions for possible future work. Asynchronous variants of parallel coordinate descent are\ncurrently of great interest, e.g., see the review in Wright (2015). Given the link between ADMM and\ncoordinate descent developed in this paper, it would be interesting to investigate the implications of\nthe recent exciting progress on asynchronous ADMM, e.g., see Chang et al. (2016a,b) and references\ntherein, for coordinate descent. In a separate direction, much of the literature on Dykstra\u2019s algorithm\nemphasizes that this method works seamlessly in Hilbert spaces. It would be interesting to consider\nthe connections to (parallel) coordinate descent in in\ufb01nite-dimensional function spaces, which we\nwould encounter, e.g., in alternating conditional expectation algorithms or back\ufb01tting algorithms in\nadditive models.\n\n9\n\n\fReferences\nAlfred Auslender. Optimisation: Methodes Numeriques. Masson, 1976.\n\nHeinz H. Bauschke and Patrick L. Combettes. Convex Analysis and Monotone Operator Theory in\n\nHilbert Spaces. Springer, 2011.\n\nHeinz H. Bauschke and Valentin R. Koch. Projection methods: Swiss army knives for solving\n\nfeasibility and best approximation problems with halfspaces. arXiv: 1301.4506, 2013.\n\nHeinz H. Bauschke and Adrian S. Lewis. Dykstra\u2019s algorithm with Bregman projections: a conver-\n\ngence proof. Optimization, 48:409\u2013427, 2000.\n\nAmir Beck and Luba Tetruashvili. On the convergence of block coordinate descent type methods.\n\nSIAM Journal on Optimization, 23(4):2037\u20132060, 2013.\n\nDimitri P. Bertsekas and John N. Tsitsiklis. Parallel and Distributed Computation: Numerical\n\nMethods. Prentice Hall, 1989.\n\nSteve Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization\nand statistical learning via the alternative direction method of multipliers. Foundations and Trends\nin Machine Learning, 3(1):1\u2013122, 2011.\n\nJames P. Boyle and Richard L. Dykstra. A method for \ufb01nding projections onto the intersection of\nconvex sets in hilbert spaces. Advances in Order Restricted Statistical Inference: Proceedings of\nthe Symposium on Order Restricted Statistical Inference, pages 28\u201347, 1986.\n\nYair Censor and Simeon Reich. The Dykstra algorithm with Bregman projections. Communications\n\nin Applied Analysis, 48:407\u2013419, 1998.\n\nTsung-Hui Chang, Mingyi Hong, Wei-Cheng Liao, and Xiangfeng Wang. Asynchronous distributed\nADMM for large-scale optimization\u2014part i: Algorithm and convergence analysis. IEEE Transac-\ntions on Signal Processing, 64(12):3118\u20133130, 2016a.\n\nTsung-Hui Chang, Wei-Cheng Liao, Mingyi Hong, and Xiangfeng Wang. Asynchronous distributed\nADMM for large-scale optimization\u2014part ii: Linear convergence analysis and numerical perfor-\nmance. IEEE Transactions on Signal Processing, 64(12):3131\u20133144, 2016b.\n\nScott Chen, David L. Donoho, and Michael Saunders. Atomic decomposition for basis pursuit. SIAM\n\nJournal on Scienti\ufb01c Computing, 20(1):33\u201361, 1998.\n\nFrank Deutsch. Best Approximation in Inner Product Spaces. Springer, 2001.\n\nFrank Deutsch and Hein Hundal. The rate of convergence of Dykstra\u2019s cyclic projections algorithm:\nThe polyhedral case. Numerical Functional Analysis and Optimization, 15(5\u20136):537\u2013565, 1994.\n\nJim Douglas and H. H. Rachford. On the numerical solution of heat conduction problems in two and\n\nthree space variables. Transactions of the American Mathematical Society, 82:421\u2013439, 1956.\n\nRichard L. Dykstra. An algorithm for restricted least squares regression. Journal of the American\n\nStatistical Association, 78(384):837\u2013842, 1983.\n\nJonathan Eckstein and Dimitri P. Bertsekas. On the Douglas-Rachford splitting method and the\nproximal point algorithm for maximal monotone operators. Mathematical Programming, 55(1):\n293\u2013318, 1992.\n\nLaurent El Ghaoui, Vivian Viallon, and Tarek Rabbani. Safe feature elimination in sparse supervised\n\nlearning. Paci\ufb01c Journal of Optimization, 8(4):667\u2013698, 2012.\n\nJerome Friedman, Trevor Hastie, Holger Hoe\ufb02ing, and Robert Tibshirani. Pathwise coordinate\n\noptimization. Annals of Applied Statistics, 1(2):302\u2013332, 2007.\n\nJerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for generalized linear\n\nmodels via coordinate descent. Journal of Statistical Software, 33(1):1\u201322, 2010.\n\n10\n\n\fWenjiang J. Fu. Penalized regressions: The bridge versus the lasso. Journal of Computational and\n\nGraphical Statistics, 7(3):397\u2013416, 1998.\n\nDaniel Gabay. Applications of the method of multipliers to variational inequalities. Studies in\n\nMathematics and Its Applications, 15:299\u2013331, 1983.\n\nDaniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational\nproblems via \ufb01nite element approximation. Computers & Mathematics with Applications, 2(1):\n17\u201340, 1976.\n\nNorbert Gaffke and Rudolf Mathar. A cyclic projection algorithm via duality. Metrika, 36(1):29\u201354,\n\n1989.\n\nRoland Glowinski and A. Marroco. Sur l\u2019approximation, par elements \ufb01nis d\u2019ordre un, et la resolution,\npar penalisation-dualite d\u2019une classe de problemes de Dirichlet non lineaires. Modelisation\nMathematique et Analyse Numerique, 9(R2):41\u201376, 1975.\n\nTom Goldstein, Brendan O\u2019Donoghue, Simon Setzer, and Richard Baraniuk. Fast alternating direction\n\noptimization methods. SIAM Journal on Imaging Sciences, 7(3):1588\u20131623, 2014.\n\nIsrael Halperin. The product of projection operators. Acta Scientiarum Mathematicarum, 23:96\u201399,\n\n1962.\n\nShih-Ping Han. A successive projection algorithm. Mathematical Programming, 40(1):1\u201314, 1988.\n\nClifford Hildreth. A quadratic programming procedure. Naval Research Logistics Quarterly, 4(1):\n\n79\u201385, 1957.\n\nMingyi Hong and Zhi-Quan Luo. On the linear convergence of the alternating direction method of\n\nmultipliers. Mathematical Programming, 162(1):165\u2013199, 2017.\n\nAlfredo N. Iusem and Alvaro R. De Pierro. On the convergence properties of Hildreth\u2019s quadratic\n\nprogramming algorithm. Mathematical Programming, 47(1):37\u201351, 1990.\n\nAlfredo N. Iusem and Alvaro R. De Pierro. A simultaneous iterative method for computing projections\n\non polyhedra. SIAM Journal on Control and Optimization, 25(1):231\u2013243, 1987.\n\nMartin Jaggi, Virginia Smith, Martin Takac, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann,\nand Michael I. Jordan. Communication-ef\ufb01cient distributed dual coordinate ascent. Advances in\nNeural Information Processing, 27:3068\u20133076, 2014.\n\nMojtaba Kadkhodaie, Konstantina Christakopoulou, Maziar Sanjabi, and Arindam Banerjee. Ac-\ncelerated alternating direction method of multipliers. International Conference on Knowledge\nDiscovery and Data Mining, 21:497\u2013506, 2015.\n\nXingguo Li, Tuo Zhao, Raman Arora, Han Liu, and Mingyi Hong. An improved convergence analysis\nof cyclic block coordinate descent-type methods for strongly convex minimization. International\nConference on Arti\ufb01cial Intelligence and Statistics, 19:491\u2013499, 2016.\n\nP. L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM Journal\n\non Numerical Analysis, 16(6):964\u2013979, 1979.\n\nDavid Luenberger. Introduction to Linear and Nonlinear Programming. Addison-Wesley, 1973.\n\nZhi-Quan Luo and Paul Tseng. On the convergence of the coordinate descent method for convex\ndifferentiable minimization. Journal of Optimization Theory and Applications, 72(1):7\u201335, 1992.\n\nZhi-Quan Luo and Paul Tseng. On the convergence rate of dual ascent methods for linearly constrained\n\nconvex minimization. Mathematics of Operations Research, 18(4):846\u2013867, 1993.\n\nRobert Nishihara, Laurent Lessard, Benjamin Recht, Andrew Packard, and Michael I. Jordan. A\ngeneral analysis of the convergence of ADMM. International Conference on Machine Learning,\n32:343\u2013352, 2015.\n\n11\n\n\fJames M. Ortega and Werner C. Rheinboldt. Iterative Solution of Nonlinear Equations in Several\n\nVariables. Academic Press, 1970.\n\nG. Pierra. Decomposition through formalization in a product space. Mathematical Programming, 28\n\n(1):96\u2013115, 1984.\n\nPeter Richtarik and Martin Takac. Parallel coordinate descent methods for big data optimization.\n\nMathematical Programming, 156(1):433\u2013484, 2016.\n\nSylvain Sardy, Andrew G. Bruce, and Paul Tseng. Block coordinate relaxation methods for non-\nparametric wavelet denoising. Journal of Computational and Graphical Statistics, 9(2):361\u2013379,\n2000.\n\nRobert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety: Series B, 58(1):267\u2013288, 1996.\n\nRobert Tibshirani, Jacob Bien, Jerome Friedman, Trevor Hastie, Noah Simon, Jonathan Taylor, and\nRyan J. Tibshirani. Strong rules for discarding predictors in lasso-type problems. Journal of the\nRoyal Statistical Society: Series B, 74(2):245\u2013266, 2012.\n\nRyan J. Tibshirani. The lasso problem and uniqueness. Electronic Journal of Statistics, 7:1456\u20131490,\n\n2013.\n\nPaul Tseng. Dual ascent methods for problems with strictly convex costs and linear constraints: A\n\nuni\ufb01ed approach. SIAM Journal on Control and Optimization, 28(1):214\u201329, 1990.\n\nPaul Tseng. Dual coordinate ascent methods for non-strictly convex minimization. Mathematical\n\nProgramming, 59(1):231\u2013247, 1993.\n\nPaul Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization.\n\nJournal of Optimization Theory and Applications, 109(3):475\u2013494, 2001.\n\nPaul Tseng and Dimitri P. Bertsekas. Relaxation methods for problems with strictly convex separable\n\ncosts and linear constraints. Mathematical Programming, 38(3):303\u2013321, 1987.\n\nJohn von Neumann. Functional Operators, Volume II: The Geometry of Orthogonal Spaces. Princeton\n\nUniversity Press, 1950.\n\nJie Wang, Peter Wonka, and Jieping Ye. Lasso screening rules via dual polytope projection. Journal\n\nof Machine Learning Research, 16:1063\u20131101, 2015.\n\nJack Warga. Minimizing certain convex functions. Journal of the Society for Industrial and Applied\n\nMathematics, 11(3):588\u2013593, 1963.\n\nStephen J. Wright. Coordinate descent algorithms. Mathematical Programming, 151(1):3\u201334, 2015.\n\nTong Tong Wu and Kenneth Lange. Coordinate descent algorithms for lasso penalized regression.\n\nThe Annals of Applied Statistics, 2(1):224\u2013244, 2008.\n\n12\n\n\f", "award": [], "sourceid": 366, "authors": [{"given_name": "Ryan", "family_name": "Tibshirani", "institution": "Carnegie Mellon University"}]}