{"title": "Non-Ergodic Alternating Proximal Augmented Lagrangian Algorithms with Optimal Rates", "book": "Advances in Neural Information Processing Systems", "page_first": 4811, "page_last": 4819, "abstract": "We develop two new non-ergodic alternating proximal augmented Lagrangian algorithms (NEAPAL) to solve a class of nonsmooth constrained convex optimization problems. Our approach relies on a novel combination of the augmented Lagrangian framework, alternating/linearization scheme, Nesterov's acceleration techniques, and adaptive strategy for parameters. Our algorithms have several new features compared to existing methods. Firstly, they have a Nesterov's acceleration step on the primal variables compared to the dual one in several methods in the literature.\nSecondly, they achieve non-ergodic optimal convergence rates under standard assumptions, i.e. an $\\mathcal{O}\\left(\\frac{1}{k}\\right)$ rate without any smoothness or strong convexity-type assumption, or an $\\mathcal{O}\\left(\\frac{1}{k^2}\\right)$ rate under only semi-strong convexity, where $k$ is the iteration counter. \nThirdly, they preserve or have better per-iteration complexity compared to existing algorithms. Fourthly, they can be implemented in a parallel fashion.\nFinally, all the parameters are adaptively updated without heuristic tuning.\nWe verify our algorithms on different numerical examples and compare them with some state-of-the-art methods.", "full_text": "Non-Ergodic Alternating Proximal Augmented\nLagrangian Algorithms with Optimal Rates\n\nDepartment of Statistics and Operations Research, University of North Carolina at Chapel Hill\n\nAddress: Hanes Hall 333, UNC-Chapel Hill, NC27599, USA.\n\nEmail: quoctd@email.unc.edu\n\nQuoc Tran-Dinh\u21e4\n\nAbstract\n\nWe develop two new non-ergodic alternating proximal augmented Lagrangian algorithms\n(NEAPAL) to solve a class of nonsmooth constrained convex optimization problems. Our\napproach relies on a novel combination of the augmented Lagrangian framework, alter-\nnating/linearization scheme, Nesterov\u2019s acceleration techniques, and adaptive strategy for\nparameters. Our algorithms have several new features compared to existing methods. Firstly,\nthey have a Nesterov\u2019s acceleration step on the primal variables compared to the dual one in\nseveral methods in the literature. Secondly, they achieve non-ergodic optimal convergence\n\nrates under standard assumptions, i.e. an O 1\nconvexity-type assumption, or an O 1\n\nk rate without any smoothness or strong\nk2 rate under only semi-strong convexity, where k is\n\nthe iteration counter. Thirdly, they preserve or have better per-iteration complexity compared\nto existing algorithms. Fourthly, they can be implemented in a parallel fashion. Finally, all\nthe parameters are adaptively updated without heuristic tuning. We verify our algorithms on\ndifferent numerical examples and compare them with some state-of-the-art methods.\n\nIntroduction\n\n1\nProblem statement: We consider the following nonsmooth constrained convex problem:\n\n(1)\n\nF ? :=\n\nmin\n\nz:=(x,y)2RpnF (z) := f (x) +\n\nmXi=1\n\ngi(yi) s.t. Ax +\n\nmXi=1\n\nBiyi = co,\ni=1 gi(yi), and By := Pm\n\nwhere f : R\u00afp ! R [{ +1} and gi : Rpi ! R [{ +1} are proper, closed, and convex functions;\np := \u00afp + \u02c6p with \u02c6p := Pm\ni=1 pi; A 2 Rn\u21e5\u00afp, Bi 2 Rn\u21e5pi, and c 2 Rn are given. Here, we also\nde\ufb01ne y := [y1,\u00b7\u00b7\u00b7 , ym] as a column vector, g(y) := Pm\ni=1 Biyi. We\noften assume that we do not explicitly form matrices A and Bi, but we can only compute Ax, Biyi\nand their adjoints A> and B>i for any given x, yi, and for i = 1,\u00b7\u00b7\u00b7 , m.\nProblem (1) is suf\ufb01ciently general to cope with many applications in different \ufb01elds including\nmachine learning, statistics, image/signal processing, and model predictive control. In particular, (1)\ncovers convex empirical risk minimization, support vector machine, LASSO-type, matrix completion,\ncompressive sensing problems as representative examples.\nOur approach: Our approach relies on a novel combination of the augmented Lagrangian (AL)\nfunction and other classical and new techniques. First, we use AL as a merit function. Next, we\nincorporate an acceleration step (either Nesterov\u2019s momentum [17] or Tseng\u2019s accelerated variant\n[25]) into the primal steps. Then, we alternate the augmented Lagrangian primal subproblem into x\nand y. We also linearize the yi-subproblems and parallelize them to reduce per-iteration complexity.\nFinally, we incorporate with an adaptive strategy proposed in [23] to derive explicit update rules for\nalgorithmic parameters. Our approach shares some similarities with the alternating direction method\nof multipliers (ADMM) and alternating minimization algorithm (AMA) but is essentially different\nfrom several aspects as will be discussed below.\nOur contribution: Our contribution can be summarized as follows:\n\n(a) We propose a novel algorithm called NEAPAL, Algorithm 1, to solve (1) under only\nconvexity and strong duality assumptions. This algorithm can be viewed as a Nesterov\u2019s\naccelerated, alternating, linearizing, and parallel proximal AL method which alternates\nbetween x and yi, and linearizes and parallelizes the yi-subproblems.\n\n\u21e4This work is partly supported by the NSF-grant, DMS-1619884, USA.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fk-rate of this algorithm in terms of |F (zk) F ?| and kAxk +\n(b) We prove an optimal O 1\nByk ck. Our rate achieves at the last iterate (i.e. in a non-ergodic sense), while our\nper-iteration complexity is the same or even better than existing methods.\n(c) When the problem (1) is semi-strongly convex, i.e. f is non-strongly convex and g is\nstrongly convex, we develop a new NEAPAL variant, Algorithm 2, that achieves an optimal\n\nO 1\nk2-rate. This rate is either in a semi-ergodic (i.e. non-ergodic in x and ergodic in y)\n\nsense or a non-ergodic sense. The non-ergodic rate just requires one more proximal operator\nof g. This variant also possesses the same parallel computation feature as in Algorithm 1.\nFrom a practical point of view, Algorithm 1 has better per-iteration complexity than ADMM and AMA\nsince the yi-subproblems are linearized and parallelized. This per-iteration complexity is essentially\nthe same as in primal-dual methods [3] when applying to solve composite convex problems with\nlinear operators. When f = 0, we obtain fully parallel variants of Algorithms 1 and 2 which only\nrequire the proximal operators of gi and solve all the yi-subproblems in parallel.\n\nADMM method in [11]. To our best knowledge, optimal rates at the last iterate have not been\n\nthe dual step does not require averaging. Algorithm 2 only requires F to be semi-strongly convex\n\nIn terms of theory, Algorithm 1 achieves an optimal O 1\nto achieve an optimal O 1\nknown yet in primal-dual methods such as in [3].2 The O 1\n\nk-rate in a non-ergodic sense. Moreover,\nk2-rate on the last iterate, which is weaker than that of the accelerated\nk2-rate is also achieved in [29] for\n\naccelerated ADMM, but Algorithm 2 remains essentially different from [29]. First, it combines\ndifferent acceleration schemes for x and y. Second, the convergence rate can achieve in either a\nnon-ergodic or semi-ergodic sense. Third, the parameters are updated explicitly.\nRelated work: Our algorithms developed in this paper can be cast into the framework of augmented\nLagrangian-type methods. In this context, we brie\ufb02y review some notable and recent works which\nare most related to our methods. The augmented Lagrangian method was dated back from the work\nof Powell and Hessenberg in nonlinear programming in early 1970s. It soon became a powerful\nmethod to solve nonlinear optimization and constrained convex optimization problems. Alternatively,\nalternating methods were dated back from Von Neumann\u2019s work. Among these algorithms, AMA\nand ADMM are the most popular ones. ADMM can be viewed either as the dual variant of Douglas-\nRachford\u2019s method [8, 15] or as an alternating variant of AL methods [1]. ADMM is widely used in\npractice, especially in signal and image processing. [2] provides a comprehensive survey of ADMM\nusing in statistical learning. While the asymptotic convergence of ADMM has been long known, see,\n\nis achieved through a gap function in the framework of variational inequality and in an ergodic sense.\n\nk-convergence rate seems to be \ufb01rst proved in [13, 16]. However, such a rate in [13]\ne.g., [8], its O 1\nk-non-ergodic rate was then proved in [12], but still on the sequence of differences\nThe same O 1\nkwk+1 wkk2 combining both the primal and dual variables in w. Many other works also focus\non theoretical aspects of ADMM by showing its O 1\nk-convergence rate in the objective residual\n|F (zk)F ?| and feasibility gap kAxk+Bykck. Notable works include [6, 11, 20, 22]. Extensions\nto stochastic settings as well as multi-blocks formulations have also been intensively studied in the\nliterature such as [4, 7]. Other researchers have attempted to optimize the rate of convergence in\ncertain cases such as quadratic problems or using the theory of feedback control [10, 19].\nIn terms of algorithms, the main steps of ADMM remain the same in most of the existing research\npapers. Some modi\ufb01cations have been made for ADMM such as relaxation [6, 20, 22], or dual\nacceleration [11, 20]. Other extensions to Bregman distances and proximal settings remain essentially\nthe same as the original version, see, e.g., [26]. Note that our algorithms can be cast into a primal-dual\nmethod such as [3, 23] rather than ADMM when solving composite problems with linear operators.\n\ngap function or in both objective residual and constraint violation [5, 6, 11, 13, 20, 22, 27]. This rate\nhas been shown to be optimal for ADMM-type methods under only convexity and strong duality in\n\nIn terms of theory, most of existing results have shown an ergodic convergence rate of O 1\nrecent work [14, 28]. When one function f or g is strongly convex, one can achieve O 1\nADMM variant using Nesterov\u2019s acceleration step and showed O 1\n\nk in either\nk2 rate as\nk-non-ergodic rate. This scheme\n\nshown in [29] but it is still on an averaging sequence. A recent work in [14] proposed a linearized\n\nis very similar to a special case of Algorithm 1. However, our scheme has a better per-iteration\n\n2In [14], a non-ergodic rate is obtained, but the algorithm is essentially different. However, a non-ergodic\n\noptimal rate of \ufb01rst-order methods for solving (1) was perhaps \ufb01rst proved in [24].\n\n2\n\n\fu f (u) + 1\n\ncomplexity than [14] since it updates yi in parallel instead of alternating as in [14]. Besides, our\nanalysis is much simpler than [14] which is extremely long and involves various parameters.\nPaper organization: The rest of this paper is organized as follows. Section 2 recalls the dual\nproblem of (1) and optimality condition. It also provides a key lemma for convergence analysis.\nSection 3 presents two new NEAPAL algorithms and analyzes their convergence rate. It also\nconsiders an extension. Section 4 provides some representative numerical examples.\nNotations: We work on \ufb01nite dimensional spaces Rp and Rn, equipped with a standard inner\nproduct h\u00b7,\u00b7i and norm k\u00b7k . Given a proper, closed, and convex function f, dom(f ) denotes its\ndomain, @f (\u00b7) is its subdifferential, f\u21e4(y) := supx {hy, xi f (x)} is its Fenchel conjugate, and\nproxfx := argmin\n2ku xk2 is its proximal operator, where > 0. We say that\nproxf is tractably proximal if it can be computed ef\ufb01ciently in a closed form or by a polynomial\nalgorithm. Several tractable proximity functions can be found from the literature. We say that f\nhas Lf -Lipschitz gradient if it is differentiable, and its gradient rf is Lipschitz continuous with\nthe Lipschitz constant Lf 2 [0, +1), f is \u00b5f -strongly convex if f (\u00b7) \u00b5f\n2 k\u00b7k 2 is convex, where\n\u00b5f > 0 is its strong convexity parameter. For a given convex set X , ri (X ) denotes its relative interior.\nFor a given matrix A, we denote kAk its operator (or spectral) norm.\n2 Duality theory, fundamental assumption, and optimality conditions\nThe Lagrange function associated with (1) is L(x, y, ) := f (x) + g(y) hAx + By c, i, where\n is the vector of Lagrange multipliers. The dual function is de\ufb01ned as\n(x,y)2dom(F )nhAx + By c, i f (x) g(y)o = f\u21e4(A>) + g\u21e4(B>) hc, i,\nd()\nwhere dom(F ) := dom(f ) \u21e5 dom(g), and f\u21e4 and g\u21e4 are the Fenchel conjugates of f and g,\nrespectively. The dual problem of (1) is\n\nL(x?, y?, ) \uf8ffL (x?, y?, ?) \uf8ffL (x, y, ?).\n\n2Rnnd() \u2318 f\u21e4(A>) + g\u21e4(B>) hc, io.\n\n(2)\nWe say that a point (x?, y?, ?) 2 dom(F ) \u21e5 Rn is a saddle point of the Lagrange function L if for\nall (x, y) 2 dom(F ), and 2 Rn, one has\n(3)\nWe denote by S ? := {(x?, y?, ?)} the set of saddle points of L, by Z ? := {(x?, y?)} the set of\nprimal components of saddle points, and by \u21e4? := {?} the set of corresponding multipliers.\nIn this paper, we rely on the following general assumption used in any primal-dual-type method.\nAssumption 2.1. Both functions f and g are proper, closed, and convex. The set of saddle points\nS ? of the Lagrange function L is nonempty, and the optimal value F ? is \ufb01nite and is attainable at\nsome (x?, y?) 2Z ?.\nWe assume that Assumption 2.1 holds throughout this paper without recalling it in the sequel.\nThe optimality condition (or the KKT condition) of (1) can be written as\n\nd? := min\n\n:=\n\nmax\n\n0 2 @f (x?) A>?, 0 2 @g(y?) B>?, and Ax? + By? = c.\n\nLet us assume that the following Slater condition holds:\n\n(4)\n\nri (dom(F )) \\{ (x, y) | Ax + By = c}6 = ;.\n\nThen, the optimality condition (4) is necessary and suf\ufb01cient for the strong duality of (1) and (2) to\nhold, i.e., F ? + D? = 0, and the dual solution is attainable and \u21e4? is bounded, see, e.g., [1].\nIn practice, we can only \ufb01nd an approximation \u02dcz? := (\u02dcx?, \u02dcy?) to z? of (1) in the following sense:\nDe\ufb01nition 2.1. Given a tolerance \" := (\"p,\" d) > 0, we say that \u02dcz? := (\u02dcx?, \u02dcy?) 2 dom(F ) is an\n\"-solution of (1) if |F (\u02dcz?) F ?|\uf8ff \"p and kA\u02dcx? + B \u02dcy? ck \uf8ff \"d.\nLet us de\ufb01ne an augmented Lagrangian function L\u21e2 associated with the constrained problem (1) as\n(5)\nwhere z := (x, y), is the corresponding multiplier, and \u21e2> 0 is a penalty parameter. The following\nlemma characterizes approximate solutions of (1) whose proof is in Supplementary Document A.\n\nL\u21e2(z, ) := f (x) + g(y) h, Ax + By ci + \u21e2\n\n2 kAx + By ck2 ,\n\n3\n\n\f\u21e2 ,\n\n(6)\n\n2\u21e2 , 1\n\n|F (z) F ?|\uf8ff maxnS\u21e2(z, ) + kk2\n\nLemma 2.1. Let S\u21e2(z, ) := L\u21e2(z, ) F ? for L\u21e2 de\ufb01ned by (5). Then, for any z = (x, y) 2\ndom(F ) and ? 2 \u21e4?, we have\n\u21e2k?kRdo and kAx + By ck \uf8ff Rd\nwhere Rd := k ?k +pk ?k2 + 2\u21e2S\u21e2(z, ) and k ?k2 + 2\u21e2S\u21e2(z, ) 0.\nUsing Lemma 2.1, our goal is to generate a sequence(zk,\u21e2 k) such that S\u21e2k (zk, \u02c60) converges to\n\nzero. In this case, we obtain zk as an approximate solution of (1) in the sense of De\ufb01nition 2.1.\n3 Non-Ergodic Alternating Proximal Augmented Lagrangian Algorithms\nWe \ufb01rst propose a new primal-dual algorithm to solve nonsmooth and nonstrongly convex problems\nin (1). Then, we present another variant for the semi-strongly convex case. Finally, we extend our\nmethods to the sum of smooth and nonsmooth objectives.\n3.1 NEAPAL for nonstrongly convex case\nThe classical augmented Lagrangian method minimizes the augmented Lagrangian function L\u21e2 in\n(5) over x and y altogether, which is often dif\ufb01cult. Our methods alternate between x and y to break\nthe non-separability of the augmented term \u21e2\n2kAx + By ck2. Therefore, at each iteration k, given\n\u02c6zk := (\u02c6xk, \u02c6yk) 2 dom(F ), \u02c6k 2 Rn, \u21e2k > 0, and k 0, we de\ufb01ne the x-subproblem as\n2 kx \u02c6xkk2o.\n\nx2dom(f )nf (x)h\u02c6k, Axi+ \u21e2k\n\nSk (\u02c6zk, \u02c6k; \u21e2k) := arg min\n\nIf k > 0, then (7) is well-de\ufb01ned and has unique solution. If k = 0, then we need to assume that (7)\nhas optimal solution but not necessarily unique. For the y-subproblem, we linearize the augmented\nterm to make use of proximal operators of g. We also incorporate Nesterov\u2019s accelerated steps [18]\ninto these primal subproblems. In summary, our algorithm is presented in Algorithm 1, which we call\na Non-Ergodic Alternating Proximal Augmented Lagrangian (NEAPAL) method.\nAlgorithm 1 (Non-Ergodic Alternating Proximal Augmented Lagrangian Algorithm (NEAPAL))\n1: Initialization: Choose z0 := (x0, y0) 2 dom(F ), \u02c60 2 Rn, \u21e20 > 0, and 0 0. Set \u02dcz0 := z0.\n2: For k := 0 to kmax perform\n3:\n4:\n5:\n6:\n\n(Parameter update) \u2327k := 1\n(Acceleration step) \u02c6zk := (1 \u2327k)zk + \u2327k \u02dczk with z = (x, y).\n(x-update) xk+1 := Sk (\u02c6zk, \u02c6k; \u21e2k) by solving (7) and rk := Axk+1 + B \u02c6yk c.\nk\u02c6yk\n(Parallel y-update) For i = 1 to m update yk+1\n(zk+1 \u02c6zk).\n(Momentum step) \u02dczk+1 := \u02dczk + 1\n\u2327k\n(Dual step) \u02c6k+1 := \u02c6k \u2318k(A\u02dcxk+1 + B \u02dcyk+1 c).\nk+1k if necessary.\n(-update) Choose 0 \uf8ff k+1 \uf8ff k+2\n\nk+1, \u21e2k := \u21e20(k + 1), k := 2\u21e20LB(k + 1), and \u2318k := \u21e20\n2 .\n\nB>i (\u21e2krk \u02c6k).\n\n8:\n9:\n10: End for\n\n2 kAx+B \u02c6ykck2 + k\n\ni 1\n\n:= prox gi\n\n(7)\n\n7:\n\ni\n\nk\n\nThe parameter LB in Algorithm 1 can be chosen as LB\n\nm maxkBik2 | 1 \uf8ff i \uf8ff m . Moreover, we have a \ufb02exibility to choose \u21e20 and 0. For exam-\n\nple, we can \ufb01x 0 > 0 to make sure (7) is well-de\ufb01ned. But if A = I, the identity operator, or A is\northogonal, then we should choose 0 = 0.\nCombining Step 4 and Step 7, we can show that the per-iteration complexity of Algorithm 1 is domi-\nnated by the subproblem (7) at Step 5, one proximal operator of g, one matrix vector-multiplication\n(Ax, By), and one adjoint B>. Hence, the per-iteration complexity of Algorithm 1 is better than\nthat of standard ADMM [2]. We also observe the following additional features of Algorithm 1.\n\n:= kBk2, or LB\n\n:=\n\n\u2022 Firstly, the subproblem (7) not only admits a unique solution, but it is also strongly convex.\nHence, if we use \ufb01rst-order methods to solve it, then we obtain a linear convergence rate. In\nparticular, if A = I or A is orthonormal, then we can choose 0 = 0, and (7) reduces to the\nproximal operator of f, i.e.\n\nS0(\u02c6zk, \u02c6k; \u21e2k) := proxf /\u21e2 kA>(c B \u02c6yk \u21e21\n\nk\n\n\u02c6k).\n\n4\n\n\f\u02c6k+1\n\n:= \u02c6k \u2318k\n\n\u2022 Secondly, we directly incorporate Nesterov\u2019s accelerated steps into the primal variables\ninstead of the dual one as in [11, 20]. We can eliminate \u02dczk, and update \u02c6zk+1 := zk+1 +\nk+2 (zk+1 zk). In this case, the dual variable \u02c6k can be updated as\nk\n\n\u2327kAxk+1 + Byk+1 c (1 \u2327k)(Axk + Byk c) .\n\n\u2022 Fourthly, we can use different parameters i\n\nThis dual update collapses to the one in classical AL methods such as AMA and ADMM,\nand their variants when \u2327k = 1 is \ufb01xed in all iterations.\n\u2022 Thirdly, the parameters \u21e2k and k are increasingly updated with the same rate of O (k), and\nk can be increasing, decreasing, or \ufb01xed. Moreover, while the penalty parameter \u21e2k is\nupdated at each iteration, the step-size \u2318k in the dual step remains \ufb01xed.\nk for each yi-subproblem for i = 1,\u00b7\u00b7\u00b7 , m. In\n\u2022 Finally, if f = 0, then we can remove the x-subproblem in Algorithm 1 to obtain a parallel\nvariant of this algorithm. In this case, if we use different i\nk, then they can be updated as\nk := 2LBi(k + 1). The convergence analysis of this variant requires some slight changes.\ni\nThe convergence of Algorithm 1 is stated in the following theorem whose proof can be found in\nSupplementary Document B.\n\nk based on LBi := mkBik2 for each component i.\n\nthis case, we can update i\n\nmaxn\u21e20R2\n\nk-rate, i.e., |F (zk) F ?|\uf8ffO 1\n\n0 + k\u02c60k2, 2Rdk?ko and kAxk + Byk ck \uf8ff\n\nTheorem 3.1. Letzk be the sequence generated by Algorithm 1. Then, for any k 1, we have\n|F (zk) F ?|\uf8ff\n0 := 0kx0 x?k2 + 2\u21e20LBky0 y?k2 and Rd := k\u02c60 ?k +qk\u02c60 ?k2 + \u21e20R2\nwhere R2\nConsequently, the sequence of the last iterateszk globally converges to a solution z? of (1) at a\nk and kAxk + Byk ck \uf8ffO 1\nk.\nnon-ergodic optimal O 1\n3.2 NEAPAL for semi-strongly convex case\nNow, we propose a new variant of Algorithm 1 that can exploit the semi-strong convexity of F .\nWithout loss of generality, we assume that gi is strongly convex with the convexity parameter \u00b5gi > 0\nfor all i = 1,\u00b7\u00b7\u00b7 , m. In this case g(y) =Pm\ni=1 gi(yi) is also strongly convex with the parameter\n\u00b5g := min{\u00b5gi | 1 \uf8ff i \uf8ff m} > 0.\nTo exploit the strong convexity of g, we apply Tseng\u2019s accelerated scheme in [25] to the y-subproblem,\nwhile using Nesterov\u2019s momentum idea [17] for the x-subproblem to keep the non-ergodic conver-\ngence onxk . The complete algorithm is described in Algorithm 2.\n1: Initialization: Choose z0 := (x0, y0) 2 dom(F ), \u02c60 2 Rn, \u21e20 2\u21e30, \u00b5g\n\nAlgorithm 2 (scvx-NEAPAL for solving (1) with strongly convex objective term g)\n\n4LBi, and 0 0.\n\nRd\n\u21e20k\n\n2\u21e20k\n\n(8)\n\n0.\n\n1\n\n,\n\nSet \u23270 := 1 and \u02dcz0 := z0.\n2:\n3: For k := 0 to kmax perform\n4:\n\n, k := 0, k := 2LB\u21e2k, and \u2318k := \u21e2k\u2327k\n2 .\n\n(Parameter update) Set \u21e2k := \u21e20\n\u2327 2\nk\n(Accelerated step) \u02c6zk := (1 \u2327k)zk + \u2327k \u02dczk with z = (x, y).\n(x-update) xk+1 := Sk (\u02c6zk, \u02c6k; \u21e2k) by solving (7) and rk := Axk+1 + B \u02c6yk c.\n(x-momentum step) \u02dcxk+1 := \u02dcxk + 1\n\u2327k\n(Parallel \u02dcy-update) For i = 1 to m, update \u02dcyk+1\n(Dual step) \u02c6k+1 := \u02c6k \u2318k(A\u02dcxk+1 + B \u02dcyk+1 c).\n(Parallel y-update) For i = 1 to m, update yk+1\n\n\u2327k k\u02dcyk\n\n(xk+1 \u02c6xk).\n\ni 1\n\n:= prox gi\n\n\u2327kk\n\ni\n\nusing one of the following two options:\n\ni\n\nB>i (\u21e2krk \u02c6k).\n\n5:\n6:\n7:\n\n8:\n\n9:\n10:\n\n24\n\nyk+1\ni\nyk+1\ni\n\n:= (1 \u2327k)yk\n:= prox gi\n\ni\n\ni + \u2327k \u02dcyk+1\ni 1\nk + 4 \u2327k.\n\n\u21e2k LB\u02c6yk\n2 \u2327kp\u2327 2\n\n\u21e2kLB\n\nB>i \u21e2krk \u02c60\n\n(\u2327-update) \u2327k+1 := 1\n\n11:\n12: End for\n\n(Option 1: Averaging step)\n(Option 2: Proximal step).\n\n5\n\n\fThe parameter LB is chosen as in Algorithm 1, and \u00b5g := min{\u00b5gi | 1 \uf8ff i \uf8ff m} in Algorithm\n4LBi | 1 \uf8ff i \uf8ff mo, where\n2. We can replace the choice of \u21e2 in Algorithm 2 by 0 <\u21e2 0 \uf8ff minn \u00b5gi\nLBi := kBik2. Before analyzing the convergence of Algorithm 2, we make the following remarks:\n(a) Firstly, Algorithm 2 linearizes the y-subproblem to reduce the per-iteration complexity. This\nstep relies on Tseng\u2019s accelerated variant in [25] instead of Nesterov\u2019s optimal scheme [17]\nas in Algorithm 1. Hence, it uses two different options at Step 10 to form yk+1.\n\n(b) Secondly, if yk+1 is updated using Option 1, then one can take a weighted averaging step\non yk without incurring extra cost. The Option 2 at Step 10 requires one additional proxg\nbut can avoid averaging on yk as in Option 1.\n\n(c) Thirdly, we can eliminate all parameters k, k, and \u2318k in Algorithm 2 so that it has only\n\ntwo parameters \u2327k and \u21e20 that need to be updated and initialized, respectively.\n\nThe following theorem proves convergence of Algorithm 2 (cf. Supplementary Document D).\nTheorem 3.2. Assume that gi is \u00b5gi-strongly convex with \u00b5gi > 0 for all i = 1,\u00b7\u00b7\u00b7 , m, but f is not\nnecessarily strongly convex. Letzk be generated by Algorithm 2. Then, the following bounds hold:\n0 + k\u02c60k2, 2Rdk?ko and kAxk + Byk ck \uf8ff 4Rd\n|F (zk) F ?|\uf8ff\n0 := 0kx0x?k2 + 2\u21e20LBky0y?k2 and Rd := k\u02c60?k +qk\u02c60?k2 + 2\u21e20R2\nwhere R2\nk2-rate either in a semi-ergodic sense (i.e.\n\nnon-ergodic in xk and ergodic in yk) if Option 1 is chosen, or a non-ergodic sense if Option 2 is\n\n\u21e20(k+1)2n\u21e20R2\n\n\u21e20(k+1)2 , (9)\n\n0.\n\n2\n\n3.3 Extension to the sum of smooth and nonsmooth objective functions\n\nk2 and kAxk + Byk ck \uf8ffO 1\nk2.\n\nConsequently,zk globally converges to z? at O 1\nchosen, i.e., |F (zk) F ?|\uf8ffO 1\nWe can consider (1) with F (z) := f (x) + \u02c6f (x) +Pm\nx nf (x)+hr \u02c6f (\u02c6xk)A>\u02c6k, x \u02c6xki+ \u21e2k\n8><>:\nyi ngi(yi)+hr\u02c6gi(\u02c6yk\n\nxk+1 := argmin\n\n:= argmin\n\nyk+1\ni\n\ni )+B>i \u21e2krk \u02c6k, yi \u02c6yk\n\nsmooth with L \u02c6f - and L\u02c6gi-Lipschitz gradients, respectively. In this case, the x- and yi-subproblems in\nAlgorithm 1 can be replaced respectively by\n\ni=1\u21e5gi(yi) + \u02c6gi(yi)\u21e4, where \u02c6f and \u02c6gi are\n2 kx \u02c6xkk2o,\n2 kAx+B \u02c6ykck2+ \u02c6k\ni k2o,\n2 kyi \u02c6yk\n\ni i+\n\n\u02c6i\nk\n\nF ? := min\n\nk := kLBi + L\u02c6gi for i = 1,\u00b7\u00b7\u00b7 , m. We can also modify Algorithm\n\nwhere \u02c6k := kLA + L \u02c6f and \u02c6i\n2 and its convergence guarantee to handle this case, but we omit the details here.\n4 Numerical experiments\nWe provide some numerical examples to illustrate our algorithms. More examples can be found in\nSupplementary Document E. All the experiments are implemented in Matlab R2014b, running on a\nMacBook Pro. Retina, 2.7GHz Intel Core i5 with 16Gb RAM.\n4.1 Square-root LASSO and Square-root Elastic-net\nWe consider the following square-root elastic-net problem as a modi\ufb01cation of the model in [30]:\n\n2 + \uf8ff2kyk1o,\n\ny2Rp2nF (y) := kBy ck2 + \uf8ff1\n\n(10)\nwhere B 2 Rn\u21e5\u02c6p, c 2 Rn, and \uf8ff1 0 and \uf8ff2 > 0 are two regularization parameters. If \uf8ff1 = 0,\nthen (10) reduces to the well-known square-root LASSO model which is fully nonsmooth.\nSquare-root LASSO Problem: We \ufb01rst compare our algorithms with state-of-the-art methods on\nthe square-root LASSO problem. Since this problem is fully nonsmooth and non-strongly convex,\nwe implement three candidates to compare: ASGARD [23] and its restarting variant, and Chambolle-\nPock\u2019s method [3]. For ASGARD, we use the same setting as in [23], and for Chambolle-Pock\u2019s (CP)\nmethod, we use step-sizes = \u2327 = kBk1 and \u2713 = 1. In Algorithm 1, we choose \u21e20 :=\nkBkky0y?k\nas suggested by Theorem 3.1 to trade-off the objective residual and feasibility gap, where (x?, ?) is\ncomputed by MOSEK up to the best accuracy. In Algorithm 2, we set \u21e20 := \u00b5g\n4kBk2 as suggested by\nour theory, where \u00b5g := 0.1 \u21e5 min(B) as a guess for the restricted strong convexity parameter.\n\n2 kyk2\n\nk?k\n\n6\n\n\fWe generate B randomly using standard Gaussian distribution without or with 50% correlated\ncolumns. Then, we normalize B to get unit norm columns. We generate c as c := By\\ + N (0, ),\nwhere y\\ is a s-sparse vector, and = 0 (i.e. without noise) and = 103 (i.e. with noise). In\nsquare-root LASSO, we set \uf8ff1 = 0 and \uf8ff2 = 0.055 which gives us reasonable results close to y\\.\nWe run these algorithms on two problem instances, where (n, p, s) = (700, 2000, 100), and the results\nare plotted in Figure 1. Here, NEAPAL is Algorithm 1, scvx-NEAPAL is Algorithm 2, NEAPAL-par is\nthe parallel variant of Algorithm 1 by setting f = 0 and g1(y1) = kBy ck2 and g2(y2) := \uf8ff2kyk1,\nand ASGARD-rs is the restarting-ASGARD [23], and avg-CP is the averaging sequence of CP.\n\n102\n\n100\n\nCP-Avg\n\n10-2\n\nNEAPAL-par\n\ne\nl\na\nc\ns\n-\ng\no\nl\n\nn\n\ni\n-\n\n?\nF\n!\n\n)\nk\ny\n(\nF\n\nj\n?\nF\n\nj\n\n10-4\n\n10-6\n\n0\n\nCP-Avg-1\n\nASGARD\n\nNEAPAL\nscvx-NEAPAL\n\nASGARD-rs\n\nC\n\nh\n\na\n\nm\n\nb\n\no\n\nl\nl\n\ne\n-\n\nP\n\nk\nc\no\nP\n-\ne\nl\nl\no\nb\nm\na\nh\nC\n\no\n\nc\n\nk\n-\n1\n\n100\n\n200\n\n300\n\n400\n\n500\n\nIterations\n\n102\n\n100\n\n10-2\n\ne\nl\na\nc\ns\n-\ng\no\nl\n\nn\n\ni\n-\n\n10-4\n\nj\n?\nF\n\nj\n\n?\nF\n!\n\n)\nk\ny\n(\nF\n\n10-6\n\n10-8\n\n0\n\nCP-Avg\n\nCP-Avg-1\n\nChambolle-Pock-1\nscvx-NEAPAL\nNEAPAL-par\n\n100\n\n200\n\nIterations\n\nChambolle-Pock\n\nASGARD-rs\n\nNEAPAL\n\n400\n\n500\n\nA\nS\nG\nA\nR\nD\n\n300\n\nFigure 1: Convergence behavior on the relative objective residuals of 6 algorithms for the square-root LASSO\nproblem (10) after 500 iterations. Left: Without noise; Right: With noise and 50% correlated columns.\nWe can observe from Figure 1 that Algorithm 1 and its parallel variant has similar performance and\nare comparable with ASGARD. Algorithm 2 also performs well compared to other methods. It works\nbetter than Chambolle-Pock\u2019s method (CP) in early iterations, but becomes slower in late iterations.\nASGARD-rs does not perform well due to the lack of strong convexity. While the last iterate of CP\nshows a great progress, its averaging sequence, where we have convergence rate guarantee is very\nslow in both cases: standard case and the case where the stepsize \u2327 = 1.\nSquare-root Elastic-net Problems: Now, we consider the case \uf8ff1 = 0.01 > 0 in (10), which is\ncalled the square-root elastic-net problem. Our data is generated as in square-root LASSO. In this\ncase, Algorithm 2 and Chambolle-Pock\u2019s method with strong convexity are used. The results of these\nalgorithms and non-strongly convex variants are plotted in Figure 2.\n\n102\n\n100\n\n10-2\n\n10-4\n\ne\nl\na\nc\ns\ng\no\nl\n\nn\n\ni\n-\n\n?\nF\n!\n\n)\nk\ny\n(\nF\n\nj\n?\nF\n\nj\n\n10-6\n\n10-8\n\n10-10\n\nCP-Avg\n\nChambolle-Pock\n\n102\n\n100\n\n10-2\n\n10-4\n\ne\nl\na\nc\ns\ng\no\nl\n\nn\n\ni\n-\n\nD\nR\nA\nG\nS\nA\n\nN\n\nE\n\nA\n\nP\n\nA\n\nL\n\n-\n\np\n\na\n\nr\n\ns\nr\n-\nD\nR\nA\nG\nS\nA\n\nscvx-NEAPAL\n\n?\nF\n!\n\n)\nk\ny\n(\nF\n\nj\n?\nF\n\nj\n\n10-6\n\nN\n\nE\n\nA\n\nP\n\nA\n\nL\n\n0\n\n100\n\n200\n\n300\n\n400\n\n500\n\nIterations\n\nCP-Avg\n\nNEAPAL-par\n\nscvx-NEAPAL\n\nChambolle-Pock\n\nN\n\nE\n\nA\n\nP\n\nA\n\nL\n\nD\nR\nA\nG\nS\nA\n\nA\n\nS\n\nG\n\nA\n\nR\n\nD\n\n-r\n\ns\n\n10-8\n\n10-10\n\n0\n\n100\n\n200\n\n300\nIterations\n\n400\n\n500\n\nFigure 2: Convergence behavior on the relative objective residuals of 6 algorithms for the square-root elastic-net\nproblem (10) after 500 iterations. Left: Without noise; Right: With noise and 50% correlated columns.\nNEAPAL, NEAPAL-par, and scvx-NEAPAL all work well in this example. They are all comparable\nwith ASGARD. CP makes a slow progress in early iterations, but reaches better accuracy at the end.\nASGARD-rs is the best due to the strong convexity of the problem. However, it does not have a\ntheoretical guarantee. Again, the averaging sequence of CP is the slowest one.\n4.2 Low-rank matrix recovery with square-root loss\nWe consider a low-rank matrix recovery problem with square-root loss, which can be considered as a\npenalized formulation of the model in [21]:\n\nY 2Rm\u21e5qnF (Y ) := kB(Y ) ck2 + kY k\u21e4o,\n\n(11)\nwhere k\u00b7k \u21e4 is a nuclear norm, B : Rm\u21e5q ! Rn is a linear operator, c 2 Rn is a given observed\nvector, and > 0 is a penalty parameter. By letting z := (x, Y ), F (z) := kxk2 + kY k\u21e4 and\nx + B(Y ) = c, we can reformulate (11) into (1).\n\nF ? := min\n\n7\n\n\fTo avoid complex subproblems of ADMM, we reformulate (11) into the following one:\n\nmin\n\nx,Y,Znkxk2 + kZk\u21e4 | x + B(Y ) = c, Y Z = 0o,\n\nby introducing two auxiliary variables x := B(Y ) c and Z := Y . The main computation at each\niteration of ADMM includes proxk\u00b7k\u21e4\n, B(Y ), B\u21e4(x), and the solution of (I + B\u21e4B)(Y ) = ek, where\nek is a residual term computed at each iteration. Since B and B\u21e4 are given in operators, we apply a\npreconditioned conjugate gradient (PCG) method to solve it. We warm-start PCG and terminate it\nwith a tolerance of 105 or a maximum of 30 iterations. We tune the penalty parameter \u21e2 in ADMM\nfor our test and \ufb01nd that \u21e2 = 0.25 works best. We call this variant \u201cTuned-ADMM\u201d.\nWe test 3 algorithms: Algorithm 1, ASGARD [23], and Tuned-ADMM on 5 logo images: IBM, EPFL,\nMIT, TUM, and UNC. These images are pre-processed to get low-rank forms of 45, 59, 6, 7 and 56,\nrespectively. The measurement c is generated as c := B(Y \\) + N (0, 103 max|Y \\\nij|) with Gaussian\nnoise, where Y \\ is the original image, and B is a linear operator formed by a subsampled-FFT with\n25% of sampling rate. We run 3 algorithms with 200 iterations, and the results are given in Table 1.\n\nTable 1: The results and performance of 3 algorithms on 5 Logo images of size 256 \u21e5 256.\n\nASGARD [23]\n\nAlgorithm 1 (NEAPAL)\n\nTuned-ADMM\n\nName Time\n\nError F (Y k) PSNR rank Res Time\n\nError F (Y k) PSNR rank Res Time\n\nError F (Y k) PSNR rank Res\n\nIBM 8.0 0.0615\nEPFL 8.2 0.0830\nMIT\n7.9 0.0501\nTUM 7.5 0.0382\nUNC\n8.3 0.0611\n\n0.293 72.4\n0.414 69.8\n0.348 74.2\n0.266 76.5\n0.283 72.5\n\n34 0.107\n56 0.108\n6 0.102\n49 0.087\n42 0.112\n\n8.5 0.0604\n8.1 0.0803\n7.5 0.0485\n7.6 0.0390\n7.7 0.0596\n\n0.297 72.4\n0.426 69.8\n0.349 74.2\n0.277 76.5\n0.287 72.5\n\n34 0.107 12.7 0.0615\n56 0.108 17.2 0.0830\n6 0.102 15.9 0.0502\n49 0.087 20.1 0.0384\n42 0.112 14.7 0.0611\n\n0.293 72.4\n0.414 69.8\n0.348 74.2\n0.267 76.5\n0.283 72.5\n\n34 0.107\n56 0.108\n6 0.102\n49 0.087\n42 0.112\n\nThe results in Table 1 show that ASGARD and NEAPAL work well and are comparable with\nTuned-ADMM. However, NEAPAL and ASGARD are faster than ADMM due to the PCG loop\nfor solving the linear system. The recovered results of two images: TUM and MIT are shown in\nFigure 3. Except for TUM, three algorithms produce low-rank solutions as expected, and their PSNR\n\nOriginal (rank=7)\n\nASGARD (rank=49)\n\nNEAPAL (rank=49)\n\nTuned-ADMM (rank=49)\n\nOriginal (rank=6)\n\nASGARD (rank=6)\n\nNEAPAL (rank=6)\n\nTuned-ADMM (rank=6)\n\nFigure 3: The low-rank recovery from three algorithms on two loge images: TUM and MIT.\n\nkY \\kF\n\nshowing the relative error\n\n(peak-signal-to-noise-ratio) is consistent. Moreover, Error:= kY kY \\kF\nbetween Y k and the original image Y \\ is small in all cases.\n5 Conclusion\nWe have proposed two novel primal-dual algorithms to solve a broad class of nonsmooth constrained\nconvex optimization problems that have the following features. They offer the same or better per-\niteration complexity as existing methods such as AMA or ADMM. They achieve optimal convergence\nrates in non-ergodic sense (i.e., in the last iterates) on the objective residual and feasibility violation,\nwhich are important in sparse and low-rank optimization as well as in image processing. They can be\nimplemented in both sequential and parallel manner. The dual update step in Algorithms 1 and 2 can\nbe viewed as the dual step in relaxed augmented Lagrangian-based methods, where the step-size is\ndifferent from the penalty parameter. Our future research is to develop new variants of Algorithms\n1 and 2 such as coordinate descent, stochastic primal-dual, and asychronous parallel algorithms.\nWe also plan to investigate connection of our methods to primal-dual \ufb01rst-order methods such as\nprimal-dual hybrid gradient and projective and other splitting methods.\n\n8\n\n\f6. D. Davis and W. Yin. Faster convergence rates of relaxed Peaceman-Rachford and ADMM under regularity\n\n7. W. Deng, M.-J. Lai, Z. Peng, and W. Yin. Parallel multi-block ADMM with o(1/k) convergence. J.\n\n25(3):1760\u20131786, 2015.\n\nassumptions. Math. Oper. Res., 2014.\n\nScienti\ufb01c Computing, 71(2): 712\u2013736, 2017.\n\nReferences\n1. Dimitri P. Bertsekas. Constrained Optimization and Lagrange Multiplier Methods. Athena Scienti\ufb01c, 1996.\n2. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via\nthe alternating direction method of multipliers. Found. and Trends in Machine Learning, 3(1):1\u2013122, 2011.\n3. A. Chambolle and T. Pock. A \ufb01rst-order primal-dual algorithm for convex problems with applications to\n\nimaging. J. Math. Imaging Vis., 40(1):120\u2013145, 2011.\n\n4. C. Chen, B. He, Y. Ye, and X. Yuan. The direct extension of ADMM for multi-block convex minimization\n\nproblems is not necessarily convergent. Math. Program., 155(1-2):57\u201379, 2016.\n\n5. D. Davis. Convergence rate analysis of the forward-Douglas-Rachford splitting scheme. SIAM J. Optim.,\n\n8. J. Eckstein and D. Bertsekas. On the Douglas - Rachford splitting method and the proximal point algorithm\n\nfor maximal monotone operators. Math. Program., 55:293\u2013318, 1992.\n\n9. J. E. Esser. Primal-dual algorithm for convex models and applications to image restoration, registration\n\nand nonlocal inpainting. PhD Thesis, University of California, Los Angeles, Los Angeles, USA, 2010.\n\n10. E. Ghadimi, A. Teixeira, I. Shames, and M. Johansson. Optimal parameter selection for the alternating\ndirection method of multipliers: quadratic problems. IEEE Trans. Automat. Contr., 60(3):644\u2013658, 2015.\n11. T. Goldstein, B. O\u2019Donoghue, and S. Setzer. Fast Alternating Direction Optimization Methods. SIAM J.\n\nImaging Sci., 7(3):1588\u20131623, 2012.\n\n12. B. He and X. Yuan. On non-ergodic convergence rate of Douglas\u2013Rachford alternating direction method of\n\nmultipliers. Numerische Mathematik, 130(3):567\u2013577, 2012.\n\n13. B.S. He, M. Tao, M.H. Xu, and X.M. Yuan. Alternating directions based contraction method for generally\n\nseparable linearly constrained convex programming problems. Optimization, (to appear), 2011.\n\n14. H. Li and Z. Lin. Accelerated Alternating Direction Method of Multipliers: an Optimal O(1/k) Nonergodic\n15. P. L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM J. Num. Anal.,\n\nAnalysis. arXiv preprint arXiv:1608.06366, 2016.\n\n16:964\u2013979, 1979.\n\n16. R.D.C. Monteiro and B.F. Svaiter. Iteration-complexity of block-decomposition algorithms and the alternat-\n\ning minimization augmented Lagrangian method. SIAM J. Optim., 23(1):475\u2013507, 2013.\n\n17. Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence\n\n18. Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87 of Applied Optimiza-\n\nO(1/k2). Doklady AN SSSR, 269:543\u2013547, 1983. Translated as Soviet Math. Dokl.\ntion. Kluwer Academic Publishers, 2004.\n\n19. R. Nishihara, L. Lessard, B. Recht, A. Packard, and M. Jordan. A general analysis of the convergence of\n\nADMM. In ICML, Lille, France, pages 343\u2013352, 2015.\n\n20. Y. Ouyang, Y. Chen, G. Lan, and E. JR. Pasiliao. An accelerated linearized alternating direction method of\n\nmultiplier. SIAM J. Imaging Sci., 8(1):644\u2013681, 2015.\n\n21. B. Recht, M. Fazel, and P.A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via\n\nnuclear norm minimization. SIAM Review, 52(3):471\u2013501, 2010.\n\n22. R. She\ufb01 and M. Teboulle. On the rate of convergence of the proximal alternating linearized minimization\n\nalgorithm for convex problems. EURO J. Comput. Optim., 4(1):27\u201346, 2016.\n\n23. Q. Tran-Dinh, O. Fercoq, and V. Cevher. A smooth primal-dual optimization framework for nonsmooth\n\ncomposite convex minimization. SIAM J. Optim., pages 1\u201335, 2018.\n\n24. Q. Tran-Dinh, C. Savorgnan, and M. Diehl. Combining Lagrangian decomposition and excessive gap\nsmoothing technique for solving large-scale separable convex optimization problems. Compt. Optim. Appl.,\n55(1):75\u2013111, 2013.\n\n25. P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. Submitted to SIAM J.\n\nOptim, 2008.\n\n26. H. Wang and A. Banerjee. Bregman Alternating Direction Method of Multipliers. Advances in Neural\n\nInformation Processing Systems (NIPS), pages 1\u201318, 2013.\n\n27. E. Wei, A. Ozdaglar, and A.Jadbabaie. A Distributed Newton Method for Network Utility Maximization.\n\nIEEE Trans. Automat. Contr., 58(9):2162 \u2013 2175, 2011.\n\n28. B. E. Woodworth and N. Srebro. Tight complexity bounds for optimizing composite objectives. In Advances\n\nin neural information processing systems (NIPS), pages 3639\u20133647, 2016.\n\n29. Y. Xu. Accelerated \ufb01rst-order primal-dual proximal methods for linearly constrained composite convex\n\nprogramming. SIAM J. Optim., 27(3):1459\u20131484, 2017.\n\n30. H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal\n\nStatistical Society: Series B, 67(2):301\u2013320, 2005.\n\n9\n\n\f", "award": [], "sourceid": 2334, "authors": [{"given_name": "Quoc", "family_name": "Tran Dinh", "institution": "Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, North Carolina"}]}