{"title": "Trajectory of Alternating Direction Method of Multipliers and Adaptive Acceleration", "book": "Advances in Neural Information Processing Systems", "page_first": 7357, "page_last": 7365, "abstract": "The alternating direction method of multipliers (ADMM) is one of the most widely used first-order optimisation methods in the literature owing to its simplicity, flexibility and efficiency. Over the years, numerous efforts are made to improve the performance of the method, such as the inertial technique. By studying the geometric properties of ADMM, we discuss the limitations of current inertial accelerated ADMM and then present and analyze an adaptive acceleration scheme for the method. Numerical experiments on problems arising from image processing, statistics and machine learning demonstrate the advantages of the proposed acceleration approach.", "full_text": "Trajectory of Alternating Direction Method of\n\nMultipliers and Adaptive Acceleration\n\nClarice Poon\u2217\n\nUniversity of Bath, Bath UK\n\ncmhsp20@bath.ac.uk\n\nJingwei Liang\u2217\n\nUniversity of Cambridge, Cambridge UK\n\njl993@cam.ac.uk\n\nAbstract\n\nThe alternating direction method of multipliers (ADMM) is one of the most widely\nused \ufb01rst-order methods in the literature owing to its simplicity, \ufb02exibility and\nef\ufb01ciency. Over the years, numerous efforts are made to improve the performance\nof ADMM, such as the inertial technique. By studying the geometric properties\nof ADMM, we discuss the limitations of current inertial accelerated ADMM, then\npresent and analyze an adaptive acceleration scheme for the method. Numerical\nexperiments on problems arising from image processing, statistics and machine\nlearning demonstrate the advantages of the proposed acceleration approach.\n\nIntroduction\n\n1\nConsider the following constrained and composite optimisation problem\n\nmin\n\nx\u2208Rn,y\u2208Rm\n\nR(x) + J(y)\n\nsuch that Ax + By = b,\n\n(PADMM)\n\nwhere the following basic assumptions are imposed\n\n(A.1) R \u2208 \u03930(Rn) and J \u2208 \u03930(Rm) are proper convex and lower semi-continuous functions.\n(A.2) A : Rn \u2192 Rp and B : Rm \u2192 Rp are injective linear operators.\n(A.3) ri(dom(R) \u2229 dom(J)) (cid:54)= \u2205, and the set of minimizers is non-empty.\n\nOver the past years, problem (PADMM) has attracted a great deal of interests as it covers many\nproblems arising from data science, machine learning, statistics, inverse problems and imaging, etc.;\nSee Section 5 for examples. In the literature, different methods are proposed to handle the problem,\namong them the alternating direction method of multipliers (ADMM) is the most prevailing one.\nEarlier works of ADMM include [17, 16, 15, 11], and recently it has gained increasing popularity, in\npart due to [6]. To derive ADMM, \ufb01rst consider the augmented Lagrangian associated to (PADMM)\nL(x, y; \u03c8) def= R(x) + J(y) + (cid:104)\u03c8, Ax + By \u2212 b(cid:105) + \u03b3\n2||Ax + By \u2212 b||2, where \u03b3 > 0 and \u03c8 \u2208 Rp is\nthe Lagrangian multiplier. To \ufb01nd a saddle-point of L(x, y; \u03c8), ADMM applies the iteration\n\nxk = argminx\u2208Rn R(x) + \u03b3\nyk = argminy\u2208Rm J(y) + \u03b3\n\u03c8k = \u03c8k\u22121 + \u03b3(Axk + Byk \u2212 b).\n\n2||Ax + Byk\u22121 \u2212 b + 1\n2||Axk + By \u2212 b + 1\n\n\u03b3 \u03c8k\u22121||2,\n\u03b3 \u03c8k\u22121||2,\n\n(1)\n\n(2)\n\nBy de\ufb01ning zk\n\ndef= \u03c8k\u22121 + \u03b3Axk, we can rewrite ADMM iteration (1) into the following form\n\nxk = argminx\u2208Rn R(x) + \u03b3\n2\nzk = \u03c8k\u22121 + \u03b3Axk,\nyk = argminy\u2208Rm J(y) + \u03b3\n2\n\u03c8k = zk + \u03b3(Byk \u2212 b).\n\n||Ax \u2212 1\n\n\u03b3 (zk\u22121 \u2212 2\u03c8k\u22121)||2,\n\n||By + 1\n\n\u03b3 (zk \u2212 \u03b3b)||2,\n\nFor the rest of the paper, we will consider the above four-point formulation.\n\n\u2217Equal contributions.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fContributions The contribution of our paper is threefold. First, for the sequence {zk}k\u2208N of (2),\nwe show that it has two different types of trajectory:\n\u2022 When both R, J are non-smooth functions, under the assumption that they are partly smooth (see\nDe\ufb01nition 2.1), we show that the eventual trajectory of {zk}k\u2208N is approximately a spiral which\ncan be characterized precisely if R, J are moreover locally polyhedral around the solution.\n\u2022 When at least one of R, J is smooth, we show that depends on the choice of \u03b3, the eventual\ntrajectory of {zk}k\u2208N can be either straight line or spiral.\nSecond, based on trajectory of {zk}k\u2208N, we discuss the limitations of the current combination of\nADMM and inertial acceleration technique. In Section 3, we distinguish the situations where inertial\nacceleration will work and when it fails. More precisely: inertial technique will work if the trajectory\nof {zk}k\u2208N is or close to a straight line, and will fail if the trajectory is a spiral.\nOur core contribution is an adaptive acceleration for ADMM, which is inspired by the trajectory of\nADMM and dubbed \u201cA3DMM\u201d. The limitation of inertial technique, particularly its failure, implies\nthat the right acceleration scheme should be able to follow the trajectory of the iterates. In Section 4,\nwe propose an adaptive linear prediction scheme to accelerate ADMM which is able to following\nthe trajectory of the method. Our proposed A3DMM belongs to the realm of extrapolation method,\nand provides an alternative geometrical interpretation for polynomial extrapolation methods such as\nMinimal Polynomial Extrapolation (MPE) [9] and Reduced Rank Extrapolation (RRE) [12, 21].\nRelated works Over the past decades, owing to the tremendous success of inertial acceleration [22,\n5], the inertial technique has been widely adapted to accelerate other \ufb01rst-order methods. In terms of\nADMM, related work can be found in [23, 18, 14], either from proximal point algorithm perspective or\ncontinuous dynamical system. However, to ensure that inertial acceleration works, strong assumptions\nare imposed on R, J in (PADMM), such as smooth differentiability or strong convexity. When it\ncomes to general non-smooth problems, these works may fail to provide acceleration. Recently in\n[13], an O(1/k2) convergence rate is established for ADMM using Nesterov acceleration, however\nthe result holds only for the continuous dynamical system while the discrete-time optimization\nscheme remains unavailable.\nFor more generic acceleration techniques, there are extensive works in numerical analysis on the topic\nof convergence acceleration for sequences. The goal of convergence acceleration is, given an arbitrary\nsequence {zk}k\u2208N \u2282 Rn with limit z(cid:63), \ufb01nding a transformation Ek : {zk\u2212j}q\nj=1 \u2192 \u00afzk \u2208 Rn such\nthat \u00afzk converges faster to z(cid:63). In general, the process by which {zk}k\u2208N is generated is unknown,\nq is chosen to be a small integer, and \u00afzk is referred to as the extrapolation of zk. Some of the best\nknown examples include Richardson\u2019s extrapolation [24], the \u22062-process of Aitken [1] and Shank\u2019s\nalgorithm [26]. We refer to [7, 8, 27] and references therein for a detailed historical perspective on\nthe development of these techniques. Much of the works on the extrapolation of vector sequences was\ninitiated by Wynn [29] who generalized the work of Shank to vector sequences. In the supplementary\nmaterial, the formulation of some of these methods are provided. In particular, minimal polynomial\nextrapolation (MPE) [9] and Reduced Rank Extrapolation (RRE) [12, 21] (which is also a variant of\nAnderson acceleration developed independently in [3]), which are particularly relevant to this present\nwork (see Section 4.2 for brief discussion).\nMore recently, there has been a series of work on a regularised version of RRE stemming from\n[25]. We remark however the regularisation parameter in these works rely on a grid search based on\nobjective function, their applicability to the general ADMM setting is unclear.\nNotations Denote Rn a n-dimensional Euclidean space equipped with scalar product (cid:104)\u00b7, \u00b7(cid:105) and norm\n|| \u00b7 ||. Id denotes the identity operator on Rn. \u03930(Rn) denotes the class of proper convex and lower-\nsemicontinuous functions on Rn. For a nonempty convex set S \u2282 Rn, denote ri(S) its relative interior,\npar(S) the smallest subspace parallel to S and PS the projection operator onto S. The sub-differential\n\nof a function R \u2208 \u03930(Rn) is de\ufb01ned by \u2202R(x) def=(cid:8)g \u2208 Rn|R(x(cid:48)) \u2265 R(x) +(cid:104)g, x(cid:48)\u2212 x(cid:105),\u2200x(cid:48) \u2208 Rn(cid:9).\n\nThe spectral radius of a matrix M is denoted by \u03c1(M ).\n2 Trajectory of ADMM\nIn this section, we discuss the trajectory of the sequence {zk}k\u2208N generated by ADMM based on the\nconcept \u201cpartial smoothness\u201d which was \ufb01rst introduced in [19].\n2.1 Partial smoothness\nLet M \u2282 Rn be a C 2-smooth submanifold, denote TM(x) the tangent space of M at a point x \u2208 M.\n\n2\n\n\fDe\ufb01nition 2.1 (Partly smooth function [19]). A function R \u2208 \u03930(Rn) is partly smooth at \u00afx relative\nto a set M\u00afx if \u2202R(\u00afx) (cid:54)= \u2205 and M\u00afx is a C 2 manifold around \u00afx, and moreover\n\nSmoothness R restricted to M\u00afx is C 2 around \u00afx.\nSharpness The tangent space TM\u00afx (\u00afx) = par(\u2202R(\u00afx))\u22a5.\nContinuity The set-valued mapping \u2202R is continuous at x relative to M\u00afx.\n\nThe class of partly smooth functions at \u00afx relative to M\u00afx is denoted as PSF\u00afx(M\u00afx). Popular examples\nof partly smooth functions can be found in [20, Chapter 5]. Loosely speaking, a partly smooth\nfunction behaves smoothly as we move along M\u00afx, and sharply if we move transversal to it.\n2.2 Trajectory of ADMM\nThe iteration of ADMM is non-linear in general owing to the non-smoothness and non-linearity of\nR and J. However, if they are partly smooth, the local C 2-smoothness allows us to linearize the\nADMM iteration, and hence enables us to study the trajectory of sequence generated by the method.\nWe denote (x(cid:63), y(cid:63), \u03c8(cid:63)) a saddle-point of L(x, y; \u03c8) and let z(cid:63) = \u03c8(cid:63) + \u03b3Ax(cid:63).\ndef= zk \u2212 zk\u22121 and\nTo discuss the trajectory of ADMM, we rely on sequence {zk}k\u2208N. De\ufb01ne vk\n(cid:104)vk, vk\u22121(cid:105)\n||vk||||vk\u22121|| ) the angle between vk, vk\u22121. We use {\u03b8k}k\u2208N to characterize the trajectory\n\u03b8k\nof {zk}k\u2208N. Given (x(cid:63), y(cid:63), \u03c8(cid:63)), the \ufb01rst-order optimality condition entails \u2212AT \u03c8(cid:63) \u2208 \u2202R(x(cid:63)) and\n\u2212BT \u03c8(cid:63) \u2208 \u2202J(y(cid:63)), below we impose\n\ndef= arccos(\n\n\u2212 AT \u03c8(cid:63) \u2208 ri(cid:0)\u2202R(x(cid:63))(cid:1)\n\nand \u2212 BT \u03c8(cid:63) \u2208 ri(cid:0)\u2202J(y(cid:63))(cid:1).\n\nx(cid:63) ,MJ\n\ndef= A \u25e6 P\n\n, BJ\n\nT R\nx(cid:63)\n\nx(cid:63) ), J \u2208 PSFy(cid:63) (MJ\n\ny(cid:63) at x(cid:63), y(cid:63), respectively. Let AR\n\ny(cid:63) ) are partly smooth. Denote T R\ndef= B \u25e6 P\n\n(ND)\nBoth R, J are non-smooth Let MR\ny(cid:63) be two smooth manifolds around x(cid:63), y(cid:63) respectively,\nand suppose R \u2208 PSFx(cid:63) (MR\ny(cid:63) the tangent\nx(cid:63) , T J\nspaces of MR\ny(cid:63) and TAR , TBJ\nbe the range of AR, BJ respectively. Denote (\u03b1j)j=1,... the Principal angles (see Section D.2 in the\nsupplementary for de\ufb01nition) between TAR , TBJ , and let \u03b1F , \u03b1(cid:48) be the smallest and 2nd smallest of\n\u03b1j which are larger than 0.\nTheorem 2.2. For problem (PADMM) and ADMM iteration (1), assume that conditions (A.1)-(A.3)\nare true, then (xk, yk, \u03c8k) converges to a saddle point (x(cid:63), y(cid:63), \u03c8(cid:63)) of L(x, y; \u03c8). Suppose that\nR \u2208 PSFx(cid:63) (MR\n(i) There exists a matrix M such that vk = M vk\u22121 + o(||vk\u22121||) holds for all k large enough.\n(ii) If moreover, R, J are locally polyhedral around x(cid:63), y(cid:63), then vk = M vk\u22121 with M being\nnormal and having eigenvalues of the form cos(\u03b1j)e\u00b1i\u03b1j , and cos(\u03b8k) = cos(\u03b1F ) + O(\u03b72k)\nwith \u03b7 = cos(\u03b1(cid:48))/ cos(\u03b1F ).\n\ny(cid:63) ) and condition (ND) holds, then\n\nx(cid:63) ), J \u2208 PSFy(cid:63) (MJ\n\nx(cid:63) ,MJ\n\nT J\n\nRemark 2.3. The result indicates that, when both R, J are locally polyhedral, the trajectory of\n{zk}k\u2208N is a spiral. For the case R, J being general partly smooth function, though we cannot prove,\nnumerical evidence shows that the trajectory of {zk}k\u2208N could be either straight line or also a spiral.\nR or/and J is smooth Now we consider the case that at least one function out of R, J is smooth.\nFor simplicity, consider that R is smooth and J remains non-smooth.\nProposition 2.4. For problem (PADMM) and ADMM iteration (1), assume that conditions (A.1)-\n(A.3) are true, then (xk, yk, \u03c8k) converges to a saddle point (x(cid:63), y(cid:63), \u03c8(cid:63)) of L(x, y; \u03c8). Suppose R\nis locally C 2 around x(cid:63), J \u2208 PSFy(cid:63) (MJ\ny(cid:63) ) is partly smooth and condition (ND) holds for J, then\nTheorem 2.2(i) holds for all k large enough. If moreover, A is full rank square matrix, then all the\neigenvalues of M are real for \u03b3 > ||(AT A)\u2212 1\nRemark 2.5. The spectrum of M is real, numerical evidence shows that the eventual trajectory of\n{zk}k\u2208N is a straight line, which is different from the case where both functions are non-smooth. If\no(||vk\u22121||) is vanishing fast enough, we can also prove that \u03b8k \u2192 0.\nIt should be emphasized that the trajectory is determined by the property of the leading eigenvalue of\nM. Therefore, for \u03b3 \u2264 ||(AT A)\u2212 1\n2||, though M will have complex eigenvalues,\nthe leading one is not necessarily to be complex. As a result, the trajectory of {zk}k\u2208N could be\neither spiral (complex leading eigenvalue) or straight line (real leading eigenvalue).\nIn Figure 1 (a) and (c), we present two examples of the trajectory of ADMM. Sub\ufb01gure (a) shows a\nspiral trajectory in R2 which is obtained from solving a polyhedral problem, while sub\ufb01gure (c) is an\neventual straight line trajectory in R3.\n\n2\u22072R(x(cid:63))(AT A)\u2212 1\n2||.\n\n2\u22072R(x(cid:63))(AT A)\u2212 1\n\n3\n\n\f(a) Spiral\n\n(b) \u03b3 =\n\n||K||2\n10\n\n(c) Eventual straight line\n\n(d) \u03b3 = ||K||2 + 0.1\n\nFigure 1: Trajectory of sequence {zk}k\u2208N and effects of inertial on ADMM. (a) Spiral trajectory of\nADMM; (b) failure of inertial ADMM on spiral trajectory; (c) Eventual straight line trajectory; (d)\nsuccess of inertial ADMM on straight line trajectory.\n\n3 The failure of inertial acceleration\nOne simple approach for combining inertial technique with ADMM is described below\n\nxk = argminx\u2208Rn R(x) + \u03b3\nzk = \u03c8k\u22121 + \u03b3Axk,\n\u00afzk = zk + ak(zk \u2212 zk\u22121),\nyk = argminy\u2208Rm J(y) + \u03b3\n\u03c8k = \u00afzk + \u03b3(Byk \u2212 b),\n\n2||Ax \u2212 1\n\n\u03b3 (\u00afzk\u22121 \u2212 2\u03c8k\u22121)||2,\n\n2||By + 1\n\n\u03b3 (\u00afzk \u2212 \u03b3b)||2,\n\n(3)\n\nwhich considers only the momentum of {zk}k\u2208N without any stronger assumptions on R, J. The\nabove scheme can reformulated as an instance of inertial Proximal Point Algorithm, guaranteed to be\nconvergent for ak < 1\n3 [2]; We refer to [23] or [20, Chapter 4.3] for more details. To our knowledge,\nthere is no acceleration guarantee for (3).\nRemark 3.1. Besides (3), other combinations of inertial technique and ADMM are also proposed,\nsee for instance [23, 18]. To ensure acceleration guarantees, stronger assumptions, such as Lipschitz\nsmoothness and strong convexity, are needed.\nWe use LASSO problem to demonstrate the combination of the above inertial technique and ADMM,\nespecially when it failures. The formulation of LASSO in the form of (PADMM) reads\n\n\u00b5||x||1 + 1\n\n||Ky \u2212 f||2\n\nsuch that x \u2212 y = 0,\n\n2\n\nmin\nx,y\u2208Rn\n\n(4)\nwhere K \u2208 Rm\u00d7n, m < n is a random Gaussian matrix. Since 1\n2||Ky \u2212 f||2 is quadratic, owing to\nProposition 2.4, the eventual trajectory of {zk}k\u2208N is a straight line if \u03b3 > ||K||2, and a spiral for\nsome \u03b3 \u2264 ||K||2. Therefore, we consider two different choices of \u03b3 which are \u03b3 = ||K||2/10 and\n\u03b3 = ||K||2 + 0.1, and for each \u03b3, four different choices of ak are considered\n\nak \u2261 0.3,\n\nak \u2261 0.7\n\nand ak = k\u22121\nk+3 .\n\nThe 3rd choice of ak corresponds to FISTA [10]. For the numerical example, we let K \u2208 R640\u00d72048\nand \u00b5 = 1, f is the measurement of an 128-sparse signal. The results are shown in Figure 1 (b) & (d),\n\u2022 \u03b3 = ||K||2/10: The inertial scheme works only for ak \u2261 0.3, which is due to that fact that the\ntrajectory of {zk}k\u2208N is a spiral for \u03b3 = ||K||2/10. As a result, the direction zk \u2212 zk\u22121 is not\npointing towards z(cid:63), hence unable to provide satisfactory acceleration.\n\u2022 \u03b3 = ||K||2 + 0.1: All choices of ak work since {zk}k\u2208N eventually forms a straight line. Among\nthese four choices of ak, ak \u2261 0.7 is the fastest, while ak = k\u22121\n\nk+3 eventually is the slowest.\nIt should be noted that, though ADMM is faster under \u03b3 = ||K||2/10 than \u03b3 = ||K||2 + 0.1, our main\nfocus here is to show how the trajectory of {zk}k\u2208N affects the outcome of inertial acceleration.\n||K||2\n10 , imply that the trajectory of {zk}k\u2208N is crucial\nThe above comparisons, particularly for \u03b3 =\nfor the acceleration outcome of the inertial scheme (3). Since the trajectory of {zk}k\u2208N depends on\nthe properties of R, J and choice of \u03b3, this implies that the right scheme that can achieve uniform\nacceleration despite R, J and \u03b3 should be able to adapt itself to the trajectory of the method. More\ndiscussions on the failure of inertial can be found in Section A of the supplementary material.\n\n4\n\n5010015020010-810-41002004006008001000120014001600180010-810-4100\f4 A3DMM: adaptive acceleration for ADMM\nThe previous section shows the trajectory of {zk}k\u2208N eventually settles onto a regular path i.e. either\nstraight line or spiral. In this section, we exploit this regularity to design adaptive acceleration for\nADMM, which is called \u201cA3DMM\u201d; See Algorithm 1.\nThe update of \u00afzk in (3) can be viewed as a special case of the following extrapolation\n\n\u00afzk = E(zk, zk\u22121,\u00b7\u00b7\u00b7 , zk\u2212q\u22121),\nj=0, de\ufb01ne vj\ndef= [vk\u22121,\u00b7\u00b7\u00b7 , vk\u2212q] \u2208 Rn\u00d7q, and let ck\n\n(5)\nfor the choice of q = 0. The idea is: given {zk\u2212j}q+1\ndef= zj \u2212 zj\u22121 and predict the future\niterates by considering how the past directions vk\u22121, . . . , vk\u2212q approximate the latest direction vk.\ndef= argminc\u2208Rq||Vk\u22121c \u2212 vk||2 =\nIn particular, de\ufb01ne Vk\u22121\ndef= zk + Vkc \u2248 zk+1. By\niterating this s times, we obtain \u00afzk,s \u2248 zk+s.\nMore precisely, given c \u2208 Rq, de\ufb01ne the mapping H by H(c) =\nCk = H(ck), note that Vk = Vk\u22121Ck. De\ufb01ne \u00afVk,0\nVkC s\n\n(cid:104) c1:q\u22121\nj=1 cjvk\u2212j \u2212 vk||2. The idea is then that Vkck \u2248 vk+1 and so, \u00afzk,1\n(cid:0)(cid:80)s\n\nIdq\u22121\n01,q\u22121\ndef= Vk and for s \u2265 1, de\ufb01ne \u00afVk,s\nk is the power of Ck. Let (C)(:,1) be the \ufb01rst column of matrix C, then\n\ni=1( \u00afVk,i)(:,1) = zk +(cid:80)s\n\n(cid:105) \u2208 Rq\u00d7q. Let\n(cid:1)\n\ndef= \u00afVk,s\u22121Ck\n\n\u00afzk,s = zk +(cid:80)s\n\ni=1 C i\nk\nwhich is the desired trajectory following extrapolation. Now de\ufb01ne the extrapolation\n\nk)(:,1) = zk + Vk\n\ni=1 Vk(C i\n\n||(cid:80)q\n\nk where C s\n\ndef=\n\n(6)\n\n(:,1),\n\ncq\n\nEs,q(zk,\u00b7\u00b7\u00b7 , zk\u2212q\u22121) def= Vk\n\n(cid:0)(cid:80)s\n\n(cid:1)\n\ni=1 C i\nk\n\n(:,1)\n\nparameterized by s, q, we obtain the following trajectory following adaptive acceleration for ADMM.\n\nAlgorithm 1: A3DMM - Adaptive Acceleration for ADMM\nInitial: Let s \u2265 1, q \u2265 1 be integers. Let \u00afz0 = z0 \u2208 Rp and V0 = 0 \u2208 Rp\u00d7(q+1).\nRepeat:\n\u2022 For k \u2265 1:\n\n||By + 1\n\n\u03b3 (\u00afzk\u22121 \u2212 \u03b3b)||2,\n\nyk = argminy\u2208Rm J(y) + \u03b3\n2\n\u03c8k = \u00afzk\u22121 + \u03b3(Byk \u2212 b),\nxk = argminx\u2208Rn R(x) + \u03b3\n2\nzk = \u03c8k + \u03b3Axk,\nvk = zk \u2212 zk\u22121\n\n||Ax \u2212 1\n\n\u03b3 (\u00afzk\u22121 \u2212 2\u03c8k)||2,\nand Vk = [vk, Vk\u22121(:, 1 : q \u2212 1)].\n\n\u2022 If mod(k, q+2) = 0: compute ck and Ck, if \u03c1(Ck) < 1: \u00afzk = zk + akEs,q(zk,\u00b7\u00b7\u00b7 , zk\u2212q\u22121).\nUntil: ||vk|| \u2264 tol.\n\nRemark 4.1.\n\npseudoinverse of Vk\u22121. And the value of q usually is taken very small, e.g. q \u2264 10.\n\n\u2022 The extra computational cost of A3DMM is very small, which is about nq2 for computing the\n\u2022 The reason we change the order of updates in Algorithm 1 is that the update of yk requires only\n\u00afzk, doing so we only need to extrapolate zk which requires the minimal computational overhead.\nMoreover, the extrapolation can also be applied to xk, yk, \u03c8k under proper adaptation.\n\u2022 A3DMM carries out (q + 2) standard ADMM iterations to set up the extrapolation step Es,q. As\nEs,q contains the sum of the powers of Ck, it is guaranteed to be convergent when \u03c1(Ck) < 1.\nTherefore, we only apply Es,q when the spectral radius \u03c1(Ck) < 1 is true. In this case, there is a\nclosed form expression for Es,q when s = +\u221e; See Eq. (7).\n\u2022 The purpose of adding ak in front of Es,q(zk,\u00b7\u00b7\u00b7 , zk\u2212q\u22121) is so that we can control the value\n\nof ak to ensure the convergence of the algorithm; See below the discussion.\n\n4.1 Convergence of A3DMM\nTo discuss the convergence of A3DMM, we shall treat the algorithm as a perturbation of the original\nADMM. If the perturbation error is absolutely summable, then we obtain the convergence of A3DMM.\nMore precisely, let \u03b5k \u2208 Rn whose value takes\n\n0 : mod(k, q + 2) (cid:54)= 0 or mod(k, q + 2) = 0 & \u03c1(Ck) \u2265 1,\n\n(cid:26)\n\n\u03b5k =\n\nakEs,q(zk,\u00b7\u00b7\u00b7 , zk\u2212q\u22121) : mod(k, q + 2) = 0 & \u03c1(Ck) < 1.\n\n5\n\n\fSuppose the \ufb01xed-point formulation of ADMM can be written as zk = F(zk\u22121) for some F (see\nSection B.2 of the appendix for details). Then Algorithm 1 can be written as zk = F(zk\u22121 +\n\u03b5k\u22121), and we can obtain the following convergence for Algorithm 1 which is based on the classic\nconvergence result of inexact Krasnosel\u2019ski\u02d8\u0131-Mann \ufb01xed-point iteration [4, Proposition 5.34].\nProposition 4.2. For problem (PADMM) and Algorithm 1, suppose that the conditions (A.1)-(A.3)\nk ||\u03b5k|| < +\u221e, zk \u2192 z(cid:63) \u2208 \ufb01x(F) def= {z \u2208 Rp : z = F(z)} and (xk, yk, \u03c8k)\nconverges to (x(cid:63), y(cid:63), \u03c8(cid:63)) which is a saddle point of L(x, y; \u03c8).\nk ||\u03b5k|| < +\u221e in general cannot be guaran-\nteed. However, it can be enforced by a simple online updating rule. Let a \u2208 [0, 1] and b, \u03b4 > 0, then\n\nare true. If moreover,(cid:80)\nOn-line updating rule The summability condition(cid:80)\nak can be determined by ak = min(cid:8)a, b/(k1+\u03b4||zk \u2212 zk\u22121||)(cid:9).\n\nInexact A3DMM Observe that in A3DMM, when A, B are non-trivial, in general there are no\nclosed form solutions for xk and yk. Take xk for example, suppose it is computed approximately,\nthen in zk there will be another approximation error \u03b5(cid:48)\n\nk, and consequently\n\nzk = F(zk\u22121 + \u03b5k\u22121 + \u03b3\u03b5(cid:48)\n\nk\u22121).\n\nk ||\u03b5(cid:48)\n\nk\u22121|| < +\u221e, Proposition 4.2 remains true for the above perturbation form.\n\nIf there holds(cid:80)\n\n(7)\n\n(:,1) =\n\n= zk\u22121 + Vk\n\n1\u2212(cid:80)s\n\n1\ni=1 ck,i\n\n(:,1)\n\n(cid:1),\n\nk = Ck\n\ni=0 C i\n\n(cid:80)+\u221e\n\n(cid:80)+\u221e\n(cid:0)(Id \u2212 Ck)\u22121 \u2212 Id(cid:1)\n(cid:0)(Id \u2212 Ck)\u22121(cid:1)\n\nj=0) is an approximation to zk+s. In this section, we make precise this statement.\n\n4.2 Acceleration guarantee for A3DMM\nWe have so far alluded to the idea that the extrapolated point \u00afzk,s de\ufb01ned in (6) (which depends only\non {zk\u2212j}q\nRelationship to MPE and RRE We \ufb01rst show that \u00afzk,\u221e is (almost) equivalent to MPE. Recall that\ni=0 C i.\nNow for the summation of the power of Ck in (6), when s = +\u221e, we have\n\ngiven a square matrix C, if its Neumann series is convergent, then there holds (Id\u2212C)\u22121 =(cid:80)+\u221e\n\nk = Ck(Id \u2212 Ck)\u22121 = (Id \u2212 Ck)\u22121 \u2212 Id.\n\n\u00afzk,\u221e def= zk + Vk\n\n(:,1) = zk \u2212 vk + Vk\n\ni=1 C i\nBack to (6), then we get\n\n(cid:0)(Id \u2212 Ck)\u22121(cid:1)\n(cid:0)zk \u2212(cid:80)q\u22121\nwhich turns out to be MPE, with the slight difference of taking the weighted sum of {zj}k\nas opposed to the weighted sum of {zj}k\u22121\nthe coef\ufb01cients c is computed in the following way: b \u2208 argmina\u2208Rq+1,(cid:80)\n\u00afzk,\u221e =(cid:80)q\u22121\nb0 (cid:54)= 0 and de\ufb01ne cj\n\nj=k\u2212q+1\nj=k\u2212q (See appendix for more details of MPE). Note that if\nj=0 ajvk\u2212j|| and\n= b0, and\nj=0 bjzk\u2212j is precisely the RRE update (again with the slight difference of summing over\niterates shifted by one iteration).\nAcceleration guarantee for A3DMM Let {zk}k\u2208N be a sequence in Rn and let vk\ndef= zk \u2212 zk\u22121.\nAssume that vk = M vk\u22121 for some M \u2208 Rn\u00d7n. Denote \u03bb(M ) the spectrum of M. The following\nproposition provides control on the extrapolation error for \u00afzk,s from (6).\nProposition 4.3. De\ufb01ne the coef\ufb01cient \ufb01tting error by \u0001k\n(i) For s \u2208 N, we have\n\ndef= \u2212bj/b0 for j = 1, . . . , q. Then (1 \u2212(cid:80)q\n\nj aj =1||(cid:80)q\nb0+(cid:80)q\n\ndef= minc\u2208Rq ||Vk\u22121c \u2212 vk||.\n\ni=1 ci)\u22121 =\n\nj=1 ck,jzk\u2212j\n\n(cid:96)=1 ||M (cid:96)|||(cid:80)s\u2212(cid:96)\ndef=(cid:80)s\n\n||\u00afzk,s \u2212 z(cid:63)|| \u2264 ||zk+s \u2212 z(cid:63)|| + Bs\u0001k,\ni=0(C i\n\nk)(1,1)|. If \u03c1(M ) < 1 and \u03c1(Ck) < 1, then(cid:80)\n\nBs is uniformly bounded in s. For s = +\u221e, B\u221e def= |1 \u2212(cid:80)\n\ni ck,i|\u22121(cid:80)\u221e\n\n(cid:96)=1 ||M||(cid:96)\n\nwhere Bs\n\n(8)\ni ck,i (cid:54)= 1 and\n\n(ii) Suppose that M is diagonalizable. Let (\u03bbj)j denote its distinct eigenvalues ordered such that\n\n|\u03bbj| \u2265 |\u03bbj+1| and |\u03bb1| = \u03c1(M ) < 1. Suppose that |\u03bbq| > |\u03bbq+1|.\n\n\u2022 Asymptotic bound (\ufb01xed q and as k \u2192 +\u221e): \u0001k = O(|\u03bbq+1|k).\n\u2022 Non-asymptotic bound (\ufb01xed q and k): Suppose \u03bb(M ) is real-valued and contained in\n\nb0\nj=1 bj\n\n[\u03b1, \u03b2] with \u22121 < \u03b1 < \u03b2 < 1. Then, let K def= 2||z0 \u2212 z(cid:63)||||(Id \u2212 M ) 1\n\n2|| and \u03b7 = 1\u2212\u03b1\n1\u2212\u03b2\n\n1 \u2212(cid:80)\n\n\u0001k\n\ni ck,i\n\n\u2264 K\u03b2k\u2212q(cid:0)\u221a\n\n\u03b7\u22121\u221a\n\n\u03b7+1\n\n(cid:1)q\n\n.\n\n(9)\n\nRemark 4.4.\n\n6\n\n\f\u2022 From Theorem 2.2(ii), when R and J are both polyhedral, we have a perfect local linearisation\nwith the corresponding linearisation matrix being normal and hence, the conditions of Proposition\n4.3 holds for all k large enough. The \ufb01rst bound (i) shows that the extrapolated point \u00afzk,s moves\nalong the true trajectory as s increases, up to the \ufb01tting error \u0001k. Although \u00afzk,\u221e is essentially an\nMPE update which is known to satisfy error bound (9) (see [28]), this proposition offers a further\ninterpretation of these extrapolation methods in terms of following the \u201csequence trajectory\u201d,\nand combined with our local analysis of ADMM, provides justi\ufb01cation of these methods for the\nacceleration of non-smooth optimisation problems.\n\u2022 Proposition 4.3 (ii) shows that extrapolation improves the convergence rate from O(|\u03bb1|k) to\nO(|\u03bbq+1|k), and the nonasymptotic bound shows that the improvement of extrapolation is\noptimal in the sense of Nesterov [22]. Recalling the form of the eigenvalues of M from Theorem\n2.2, in the case of two nonsmooth polyhedral terms, we must have |\u03bb2j\u22121| = |\u03bb2j| > |\u03bb2j+1|\nfor all j \u2265 1. Hence, no acceleartion can be guaranteed or observed when q = 1, while the\nchoice of q = 2 provides guaranteed acceleration.\n\nExtension of A3DMM to variants of ADMM is provided in Section B of the supplementary material.\n5 Numerical experiments\nBelow we present numerical experiments on af\ufb01ne constrained minimisation (e.g. Basis Pursuit)\nand LASSO problems to demonstrate the performance of A3DMM. Extra comparisons can be found\nin the supplementary material Section C. In the numerical comparison below, we mainly compare\nwith the original ADMM and its inertial version (3) with \ufb01xed ak \u2261 0.3. For the proposed A3DMM,\ntwo settings are considered: (q, s) = (4, 100) and (q, s) = (4, +\u221e). MATLAB source codes for\nreproducing the results can be found at: https://github.com/jliang993/A3DMM.\n\n(a) (cid:96)1-norm\n\n(b) (cid:96)1,2-norm\n\n(c) Nuclear norm\n\n(d) (cid:96)1-norm\n\n(e) (cid:96)1,2-norm\n\n(f) Nuclear norm\n\nFigure 2: Performance comparisons and {\u03b8k}k\u2208N of ADMM for af\ufb01ne constrained problem.\n\nAf\ufb01ne constrained minimisation Consider the following constrained problem, given \u25e6\nx\n\nmin\nx\u2208Rn\n\nR(x)\n\nsuch that Kx = K\n\n\u25e6\nx.\n\nDenote the set \u2126 def= {x \u2208 Rn : Kx = K\n\n(10)\nx} and \u03b9\u2126 its indicator function. Then (10) can be written as\n(11)\nwhich is special case of (PADMM) with A = Id, B = \u2212Id and b = 0. Here K is generated from the\nstandard Gaussian ensemble, and the following three choices of R are considered:\n\nsuch that x \u2212 y = 0,\n\nR(x) + \u03b9\u2126(y)\n\nmin\nx,y\u2208Rn\n\n(cid:96)1-norm (m, n) = (640, 2048), \u25e6\n(cid:96)1,2-norm (m, n) = (640, 2048), \u25e6\n\nx is 128-sparse;\nx has 32 non-zero blocks of size 4;\n\nNuclear norm (m, n) = (1448, 64 \u00d7 64), \u25e6\nThe property of {\u03b8k}k\u2208N is shown in Figure 2 (a)-(c). Note that the indicator function \u03b9\u2126(y) in (11)\nis polyhedral since \u2126 is an af\ufb01ne subspace,\n\nx has rank of 4.\n\n\u25e6\n\n7\n\n1002003004005006000.80.850.90.951501001502002500.80.850.90.951501001502002503003504004500.80.850.90.9515010015020025030035010-810-41005010015020010-810-410010020030040050010-810-4100\f\u2022 As (cid:96)1-norm is polyhedral, we have in Figure 2(a) that \u03b8k is converging to a constant which\n\u2022 Since (cid:96)1,2-norm and nuclear norm are no longer polyhedral functions, we have that \u03b8k eventually\n\ncomplies with Theorem 2.2(ii).\noscillates in a range, meaning that the trajectory of {zk}k\u2208N is an elliptical spiral.\n\nComparisons of the four schemes are shown below in Figure 2 (d)-(f):\n\n\u2022 Since both functions in (11) are non-smooth, the eventual trajectory of {zk}k\u2208N for ADMM is\n\u2022 A3DMM is faster than both ADMM and inertial ADMM. For the two different settings of\n\nspiral. Inertial ADMM fails to provide acceleration locally.\n\nA3DMM, their performances are very close.\n\nLASSO We consider again the LASSO problem (4) with three datasets from LIBSVM2. The\nnumerical experiments are provided below in Figure 3.\nIt can be observed that the proposed A3DMM is signi\ufb01cantly faster than the other schemes, especially\nfor s = +\u221e. Between ADMM and inertial ADMM, the inertial technique can provided consistent\nacceleration for all three examples since \u03b8k \u2192 0; See \ufb01rst row of Figure 3. For Figure 3 (a), the\noscillation after k = 2000 is due to machine error.\n\n(a) covtype: 1 \u2212 cos(\u03b8k)\n\n(b) ijcnn1: 1 \u2212 cos(\u03b8k)\n\n(c) phishing: 1 \u2212 cos(\u03b8k)\n\n(d) covtype: ||xk \u2212 x(cid:63)||\n\n(e) ijcnn1: ||xk \u2212 x(cid:63)||\n\n(f) phishing: ||xk \u2212 x(cid:63)||\n\nFigure 3: Performance comparisons for LASSO problem.\n\n6 Conclusions\nIn this article, by analyzing the trajectory of the \ufb01xed point sequences associated to ADMM and\nextrapolating along the trajectory, we provide an alternative derivation of these methods. Furthermore,\nour local linear analysis allows for the application of previous results on extrapolation methods, and\nhence provides guaranteed (local) acceleration.\nAcknowledgments\nWe would like to thank Arieh Iserles for pointing out the connection between trajectory following\nadaptive acceleration and vector extrapolation. We also like to thank the reviewers whose comments\nhelped to improve the paper. JL was partly supported by Leverhulme trust and Newton trust.\nReferences\n[1] A. C. Aitken. Xxv.\u2013on Bernoulli\u2019s numerical solution of algebraic equations. Proceedings of the Royal\n\nSociety of Edinburgh, 46:289\u2013305, 1927.\n\n[2] F. Alvarez and H. Attouch. An inertial proximal method for maximal monotone operators via discretization\n\nof a nonlinear oscillator with damping. Set-Valued Analysis, 9(1-2):3\u201311, 2001.\n\n[3] D. G. Anderson. Iterative procedures for nonlinear integral equations. J. ACM, 12(4):547\u2013560, October\n\n1965.\n\n[4] H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces.\n\nSpringer, 2011.\n\n2https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/\n\n8\n\n1000200030004000500060007000800090001000010-1010-810-610-410-21002004006008001000120010-1410-1210-1010-810-610-410-2100500100015002000250010-1410-1210-1010-810-610-410-210020004000600080001000010-810-41002004006008001000120010-810-4100500100015002000250010-810-4100\f[5] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[6] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. Distributed optimization and statistical learning via\nthe alternating direction method of multipliers. Foundations and Trends R(cid:13) in Machine learning, 3(1):1\u2013122,\n2011.\n\n[7] C. Brezinski. Convergence acceleration during the 20th century. Numerical Analysis: Historical Develop-\n\nments in the 20th Century, page 113, 2001.\n\n[8] C. Brezinski and M. R. Zaglia. Extrapolation methods: theory and practice, volume 2. Elsevier, 2013.\n[9] S. Cabay and L. W. Jackson. A polynomial extrapolation method for \ufb01nding limits and antilimits of vector\n\nsequences. SIAM Journal on Numerical Analysis, 13(5):734\u2013752, 1976.\n\n[10] A. Chambolle and C. Dossal. On the convergence of the iterates of the \u201cfast iterative shrinkage/thresholding\n\nalgorithm\u201d. Journal of Optimization Theory and Applications, 166(3):968\u2013982, 2015.\n\n[11] J. Eckstein and D. P. Bertsekas. On the douglas\u2013rachford splitting method and the proximal point algorithm\n\nfor maximal monotone operators. Mathematical Programming, 55(1-3):293\u2013318, 1992.\n\n[12] R. P. Eddy. Extrapolating to the limit of a vector sequence. In Information linkage between applied\n\nmathematics and industry, pages 387\u2013396. Elsevier, 1979.\n\n[13] G. Franca, D. P. Robinson, and R. Vidal. A dynamical systems perspective on nonsmooth constrained\n\noptimization. arXiv preprint arXiv:1808.04048, 2018.\n\n[14] Guilherme Franca, Daniel Robinson, and Rene Vidal. ADMM and accelerated ADMM as continuous\ndynamical systems. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International\nConference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages\n1559\u20131567, Stockholmsmassan, Stockholm Sweden, 10\u201315 Jul 2018. PMLR.\n\n[15] D. Gabay. Chapter ix applications of the method of multipliers to variational inequalities. Studies in\n\nmathematics and its applications, 15:299\u2013331, 1983.\n\n[16] D. Gabay and B. Mercier. A dual algorithm for the solution of non linear variational problems via \ufb01nite\n\nelement approximation. Institut de recherche d\u2019informatique et d\u2019automatique, 1975.\n\n[17] R. Glowinski and A. Marroco. Sur l\u2019approximation, par \u00e9l\u00e9ments \ufb01nis d\u2019ordre un, et la r\u00e9solution, par\np\u00e9nalisation-dualit\u00e9 d\u2019une classe de probl\u00e8mes de dirichlet non lin\u00e9aires. ESAIM: Mathematical Modelling\nand Numerical Analysis-Mod\u00e9lisation Math\u00e9matique et Analyse Num\u00e9rique, 9(R2):41\u201376, 1975.\n\n[18] M. Kadkhodaie, K. Christakopoulou, M. Sanjabi, and A. Banerjee. Accelerated alternating direction method\nof multipliers. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery\nand data mining, pages 497\u2013506. ACM, 2015.\n\n[19] A. S. Lewis. Active sets, nonsmoothness, and sensitivity. SIAM Journal on Optimization, 13(3):702\u2013725,\n\n2003.\n\n[20] J. Liang. Convergence rates of \ufb01rst-order operator splitting methods. PhD thesis, Normandie Universit\u00e9;\n\nGREYC CNRS UMR 6072, 2016.\n\n[21] M. Me\u0161ina. Convergence acceleration for the iterative solution of the equations x= ax+ f. Computer\n\nMethods in Applied Mechanics and Engineering, 10(2):165\u2013173, 1977.\n\n[22] Y. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k2).\n\nDokl. Akad. Nauk SSSR, 269(3):543\u2013547, 1983.\n\n[23] I. Pejcic and C. N. Jones. Accelerated admm based on accelerated douglas-rachford splitting. In 2016\n\nEuropean Control Conference (ECC), pages 1952\u20131957. Ieee, 2016.\n\n[24] L. F. Richardson and J. A. Gaunt. Viii. the deferred approach to the limit. Philosophical Transactions\nof the Royal Society of London. Series A, containing papers of a mathematical or physical character,\n226(636-646):299\u2013361, 1927.\n\n[25] D. Scieur, A. d\u2019Aspremont, and F. Bach. Regularized nonlinear acceleration. In Advances In Neural\n\nInformation Processing Systems, pages 712\u2013720, 2016.\n\n[26] D. Shanks. Non-linear transformations of divergent and slowly convergent sequences. Journal of Mathe-\n\nmatics and Physics, 34(1-4):1\u201342, 1955.\n\n[27] A. Sidi. Practical extrapolation methods: Theory and applications, volume 10. Cambridge University\n\nPress, 2003.\n\n[28] A. Sidi. Vector extrapolation methods with applications, volume 17. SIAM, 2017.\n[29] P. Wynn. Acceleration techniques for iterated vector and matrix problems. Mathematics of Computation,\n\n16(79):301\u2013322, 1962.\n\n9\n\n\f", "award": [], "sourceid": 4015, "authors": [{"given_name": "Clarice", "family_name": "Poon", "institution": "University of Bath"}, {"given_name": "Jingwei", "family_name": "Liang", "institution": "University of Cambridge"}]}