{"title": "Adaptive Primal-Dual Splitting Methods for Statistical Learning and Image Processing", "book": "Advances in Neural Information Processing Systems", "page_first": 2089, "page_last": 2097, "abstract": "The alternating direction method of multipliers (ADMM) is an important tool for solving complex optimization problems, but it involves minimization sub-steps that are often difficult to solve efficiently.   The Primal-Dual Hybrid Gradient (PDHG) method is a powerful alternative that often has simpler substeps than ADMM, thus producing lower complexity solvers. Despite the flexibility of this method, PDHG is often impractical because it requires the careful choice of multiple stepsize parameters. There is often no intuitive way to choose these parameters to maximize efficiency, or even achieve convergence.  We propose self-adaptive stepsize rules that automatically tune PDHG parameters for optimal convergence.  We rigorously analyze our methods, and identify convergence rates.  Numerical experiments show that adaptive PDHG has strong advantages over non-adaptive methods in terms of both efficiency and simplicity for the user.", "full_text": "Adaptive Primal-Dual Splitting Methods for\nStatistical Learning and Image Processing\n\nThomas Goldstein\u21e4\n\nDepartment of Computer Science\n\nUniversity of Maryland\n\nCollege Park, MD\n\nMin Li\u2020\n\nSchool of Economics and Management\n\nSoutheast University\n\nNanjing, China\n\nXiaoming Yuan\u2021\n\nDepartment of Mathematics\nHong Kong Baptist University\nKowloon Tong, Hong Kong\n\nAbstract\n\nThe alternating direction method of multipliers (ADMM) is an important tool for\nsolving complex optimization problems, but it involves minimization sub-steps\nthat are often dif\ufb01cult to solve ef\ufb01ciently. The Primal-Dual Hybrid Gradient\n(PDHG) method is a powerful alternative that often has simpler sub-steps than\nADMM, thus producing lower complexity solvers. Despite the \ufb02exibility of this\nmethod, PDHG is often impractical because it requires the careful choice of multi-\nple stepsize parameters. There is often no intuitive way to choose these parameters\nto maximize ef\ufb01ciency, or even achieve convergence. We propose self-adaptive\nstepsize rules that automatically tune PDHG parameters for optimal convergence.\nWe rigorously analyze our methods, and identify convergence rates. Numerical\nexperiments show that adaptive PDHG has strong advantages over non-adaptive\nmethods in terms of both ef\ufb01ciency and simplicity for the user.\n\n1\n\nIntroduction\n\nSplitting methods such as ADMM [1, 2, 3] have recently become popular for solving problems\nin distributed computing, statistical regression, and image processing. ADMM allows complex\nproblems to be broken down into sequences of simpler sub-steps, usually involving large-scale least\nsquares minimizations. However, in many cases these least squares minimizations are dif\ufb01cult to\ndirectly compute.\nIn such situations, the Primal-Dual Hybrid Gradient method (PDHG) [4, 5],\nalso called the linearized ADMM [4, 6], enables the solution of complex problems with a simpler\nsequence of sub-steps that can often be computed in closed form. This \ufb02exibility comes at a cost\n\u2013 the PDHG method requires the user to choose multiple stepsize parameters that jointly determine\nthe convergence of the method. Without having extensive analytical knowledge about the problem\nbeing solved (such as eigenvalues of linear operators), there is no intuitive way to select stepsize\nparameters to obtain fast convergence, or even guarantee convergence at all.\nIn this article we introduce and analyze self-adaptive variants of PDHG \u2013 variants that automatically\ntune stepsize parameters to attain (and guarantee) fast convergence without user input. Applying\nadaptivity to splitting methods is a dif\ufb01cult problem. It is known that naive adaptive variants of\n\n\u21e4tomg@cs.umd.edu\n\u2020limin@seu.edu.cn\n\u2021xmyuan@hkbu.edu.hk\n\n1\n\n\fADMM are non-convergent, however recent results prove convergence when speci\ufb01c mathematical\nrequirements are enforced on the stepsizes [7]. Despite this progress, the requirements for conver-\ngence of adaptive PDHG have been unexplored. This is surprising, given that stepsize selection is a\nmuch bigger issue for PDHG than for ADMM because it requires multiple stepsize parameters.\nThe contributions of this paper are as follows. First, we describe applications of PDHG and its\nadvantages over ADMM. We then introduce a new adaptive variant of PDHG. The new algorithm not\nonly tunes parameters for fast convergence, but contains a line search that guarantees convergence\nwhen stepsize restrictions are unknown to the user. We analyze the convergence of adaptive PDHG,\nand rigorously prove convergence rate guarantees. Finally, we use numerical experiments to show\nthe advantages of adaptivity on both convergence speed and ease of use.\n\n2 The Primal-Dual Hybrid Gradient Method\n\nThe PDHG scheme has its roots in the Arrow-Hurwicz method, which was studied by Popov [8].\nResearch in this direction was reinvigorated by the introduction of PDHG, which converges rapidly\nfor a wider range of stepsizes than Arrow-Hurwicz. PDHG was \ufb01rst presented in [9] and analyzed\nfor convergence in [4, 5]. It was later studied extensively for image segmentation [10]. An extensive\ntechnical study of the method and its variants is given by He and Yuan [11]. Several extensions\nof PDHG, including simpli\ufb01ed iterations for the case that f or g is differentiable, are presented by\nCondat [12]. Several authors have also derived PDHG as a preconditioned form of ADMM [4, 6].\nPDHG solves saddle-point problems of the form\n\nmin\nx2X\n\nmax\ny2Y\n\nf (x) + yT Ax  g(y).\n\n(1)\n\nfor convex f and g. We will see later that an incredibly wide range of problems can be cast as (1).\nThe steps of PDHG are given by\n\n\u02c6xk+1 = xk  \u2327kAT yk\nxk+1 = arg min\nf (x) +\n\n1\n2\u2327k kx  \u02c6xk+1k2\n\n\u02c6yk+1 = yk + kA(2xk+1  xk)\nyk+1 = arg min\n\ng(y) +\n\n1\n2k ky  \u02c6yk+1k2\n\nx2X\n\ny2Y\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\n8>>>>>>><>>>>>>>:\n\nwhere {\u2327k} and {k} are stepsize parameters. Steps (2) and (3) of the method update x, decreasing\nthe energy (1) by \ufb01rst taking a gradient descent step with respect to the inner product term in (1)\nand then taking a \u201cbackward\u201d or proximal step involving f. In steps (4) and (5), the energy (1) is\nincreased by \ufb01rst marching up the gradient of the inner product term with respect to y, and then a\nbackward step is taken with respect to g.\nPDHG has been analyzed in the case of constant stepsizes, \u2327k = \u2327 and k = . In particular,\nit is known to converge as long as \u2327 < 1/\u21e2(AT A) [4, 5, 11]. However, PDHG typically does\nnot converge when non-constant stepsizes are used, even in the case that k\u2327k < 1/\u21e2(AT A) [13].\nFurthermore, it is unclear how to select stepsizes when the spectral properties of A are unknown. In\nthis article, we identify the speci\ufb01c stepsize conditions that guarantee convergence in the presence\nof adaptivity, and propose a backtracking scheme that can be used when the spectral radius of A is\nunknown.\n\n3 Applications\n\nLinear Inverse Problems Many inverse problems and statistical regressions have the form\n\n(6)\nwhere f (the data term) is some convex function, h is a (convex) regularizer (such as the `1-norm),\nA and S are linear operators, and b is a vector of data. Recently, the alternating direction method\n\nh(Sx) + f (Ax  b)\n\nminimize\n\n2\n\n\fof multipliers (ADMM) has become a popular method for solving such problems. The ADMM\nrelies on the change of variables y Sx, and generates the following sequence of iterates for some\nstepsize \u2327\n\nxk+1 = arg minx f (Ax  b) + (Sx  yk)T k + \u2327\nyk+1 = arg miny h(y) + (Sxk+1  y)T k + \u2327\nk+1 = k + \u2327 (Sxk+1  yk+1).\n\n2kSx  ykk2\n2kSxk+1  yk2\n\n(7)\n\n8<:\n\nThe x-update in (7) requires the solution of a (potentially large) least-square problem involving both\nA and S. Common formulations such as the consensus ADMM [14] solve these large sub-problems\nwith direct matrix factorizations, however this is often impractical when either the data matrices are\nextremely large or fast transforms (such as FFT, DCT, or Hadamard) cannot be used.\nThe problem (6) can be put into the form (1) using the Fenchel conjugate of the convex function h,\ndenoted h\u21e4, which satis\ufb01es the important identity\n\nh(z) = max\n\ny\n\nyT z  h\u21e4(y)\n\nfor all z in the domain of h. Replacing h in (6) with this expression involving its conjugate yields\n\nmin\n\nx\n\nmax\n\ny\n\nf (Ax  b) + yT Sx  h\u21e4(y)\n\nwhich is of the form (1). The forward (gradient) steps of PDHG handle the matrix A explicitly,\nallowing linear inverse problems to be solved without any dif\ufb01cult least-squares sub-steps. We will\nsee several examples of this below.\n\nScaled Lasso The square-root lasso [15] or scaled lasso [16] is a variable selection regression that\nobtains sparse solutions to systems of linear equations. Scaled lasso has several advantages over\nclassical lasso \u2013 it is more robust to noise and it enables setting penalty parameters without cross\nvalidation [15, 16]. Given a data matrix D and a vector b, the scaled lasso \ufb01nds a sparse solution to\nthe system Dx = b by solving\n\n(8)\nfor some scaling parameter \u00b5. Note the `2 term in (8) is not squared as in classical lasso. If we write\n\n\u00b5kxk1 + kDx  bk2\n\nmin\n\nx\n\n\u00b5kxk1 = max\nky1k1\uf8ff\u00b5\nwe can put (8) in the form (1)\n\nyT\n1 x,\n\nand\n\nkDx  bk2 = max\nky2k2\uf8ff1\n\nyT\n2 (Dx  b)\n\nmin\n\nx\n\nmax\n\nky1k1\uf8ff\u00b5,ky2k2\uf8ff1\n\nyT\n1 x + yT\n\n2 (Dx  b).\n\n(9)\n\nUnlike ADMM, PDHG does not require the solution of least-squares problems involving D.\n\n1\n2kAx  fk2\n\n2\n\nTotal-Variation Minimization Total variation [17] is commonly used to solve problems of the\nform\n\n(10)\nwhere x is a 2D array (image), r is the discrete gradient operator, A is a linear operator, and f\ncontains data. If we add a dual variable y and write \u00b5krxk1 = maxkyk1\uf8ff\u00b5 yTrx, we obtain\n\n\u00b5krxk1 +\n\nmin\n\nx\n\nmax\nkyk1\uf8ff\u00b5\n\nmin\n\nx\n\n1\n2kAx  fk2 + yTrx\n\n(11)\n\nwhich is clearly of the form (1).\nThe PDHG solver using formulation (11) avoids the inversion of the gradient operator that is required\nby ADMM. This is useful in many applications. For example, in compressive sensing the matrix A\nmay be a sub-sampled orthogonal Hadamard [18], wavelet, or Fourier transform [19, 20]. In this\ncase, the proximal sub-steps of PDHG are solvable in closed form using fast transforms because they\ndo not involve the gradient operator r. The sub-steps of ADMM involve both the gradient operator\nand the matrix A simultaneously, and thus require inner loops with expensive iterative solvers.\n\n3\n\n\f4 Adaptive Formulation\n\nThe convergence of PDHG can be measured by the size of the residuals, or gradients of (1) with\nrespect to the primal and dual variables x and y. These primal and dual gradients are simply\n\npk+1 = @f (xk+1) + AT yk+1,\n\nand\n\ndk+1 = @g(yk+1) + Axk+1\n\n(12)\n\nwhere @f and @g denote the sub-differential of f and g. The sub-differential can be directly evalu-\nated from the sequence of PDHG iterates using the optimality condition for (3): 0 2 @f (xk+1) +\n(\u02c6xk+1  xk+1) 2 @f (xk+1). The same method can be\n1\n\u2327k\napplied to (5) to obtain @g(yk+1). Applying these results to (12) yields the closed form residuals\n\n(xk+1  \u02c6xk+1). Rearranging this yields 1\n\n\u2327k\n\npk+1 =\n\n1\n\u2327k\n\n(xk  xk+1)  AT (yk  yk+1),\n\ndk+1 =\n\n1\nk\n\n(yk  yk+1)  A(xk  xk+1).\n\n(13)\n\nWhen choosing the stepsize for PDHG, there is a tradeoff between the primal and dual residuals.\nChoosing a large \u2327k and a small k drives down the primal residuals at the cost of large dual resid-\nuals. Choosing a small \u2327k and large k results in small dual residuals but large primal errors. One\nwould like to choose stepsizes so that the larger of pk+1 and dk+1 is as small as possible. If we as-\nsume the residuals on step k+1 change monotonically with \u2327k, then max{pk+1, dk+1} is minimized\nwhen pk+1 = dk+1. This suggests that we tune \u2327k to \u201cbalance\u201d the primal and dual residuals.\nTo achieve residual balancing, we \ufb01rst select a parameter \u21b50 < 1 that controls the aggressiveness of\nadaptivity. On each iteration, we check whether the primal residual is at least twice the dual. If so,\nwe increase the primal stepsize to \u2327k+1 = \u2327k/(1 \u21b5k) and decrease the dual to k+1 = k(1 \u21b5k).\nIf the dual residual is at least twice the primal, we do the opposite. When we modify the stepsize, we\nshrink the adaptivity level to \u21b5k+1 = \u2318\u21b5k, for \u2318 2 (0, 1). We will see in Section 5 that this adaptivity\nlevel decay is necessary to guarantee convergence. In our implementation we use \u21b50 = \u2318 = .95.\nIn addition to residual balancing, we check the following backtracking condition after each iteration\n\nc\n2\u2327k kxk+1  xkk2  2(yk+1  yk)T A(xk+1  xk) +\n\nc\n2k kyk+1  ykk2 > 0\n\n(14)\n\nwhere c 2 (0, 1) is a constant (we use c = 0.9) is our experiments. If condition (14) fails, then we\nshrink \u2327k and k before the next iteration. We will see in Section 5 that the backtracking condition\n(14) is suf\ufb01cient to guarantee convergence. The complete scheme is listed in Algorithm 1.\n\nAlgorithm 1 Adaptive PDHG\n1: Choose x0, y0, large \u23270 and 0, and set \u21b50 = \u2318 = 0.95.\n2: while kpkk,kdkk > tolerance do\n3:\n4:\n5:\n6:\n7:\n8:\n9: end while\n\nCompute (xk+1, yk+1) from (xk, yk) using the PDHG updates (2-5)\nCheck the backtracking condition (14) and if it fails set \u2327k \u2327k/2, k k/2\nCompute the residuals (13), and use them for the following two adaptive updates\nIf 2kpk+1k < kdk+1k, then set \u2327k+1 = \u2327k(1 \u21b5k), k+1 = k/(1 \u21b5k), and \u21b5k+1 = \u21b5k\u2318\nIf kpk+1k > 2kdk+1k, then set \u2327k+1 = \u2327k/(1 \u21b5k), k+1 = k(1 \u21b5k), and \u21b5k+1 = \u21b5k\u2318\nIf no adaptive updates were triggered, then \u2327k+1 = \u2327k, k+1 = k, and \u21b5k+1 = \u21b5k\n\n5 Convergence Theory\n\nIn this section, we analyze Algorithm 1 and its rate of convergence. In our analysis, we consider\nadaptive variants of PDHG that satisfy the following assumptions. We will see later that these\nassumptions guarantee convergence of PDHG with rate O(1/k).\n\nAlgorithm 1 trivially satis\ufb01es Assumption A. The sequence {k} measures the adaptive aggressive-\nness on iteration k, and serves the same role as \u21b5k in Algorithm 1. The geometric decay of \u21b5k\nensures that Assumption B holds. The backtracking rule explicitly guarantees Assumption C.\n\n4\n\n\fAssumptions for Adaptive PDHG\nA The sequences {\u2327k} and {k} are positive and bounded.\nB The sequence {k} is summable, where k = maxn \u2327k\u2327k+1\nC Either X or Y is bounded, and there is a constant c 2 (0, 1) such that for all k > 0\nc\nc\n2k kyk+1  ykk2 > 0.\n2\u2327k kxk+1  xkk2  2(yk+1  yk)T A(xk+1  xk) +\n\n, 0o .\n\n, kk+1\n\nk\n\n\u2327k\n\n5.1 Variational Inequality Formulation\n\nFor notational simplicity, we de\ufb01ne the composite vector uk = (xk, yk) and the matrices\n\nMk =\u2713\u23271\n\nk I AT\nA 1\n\nk I\u25c6 , Hk =\u2713\u23271\n\nk I\n0\n\n0\n1\n\nk I\u25c6 , and Q(u) =\u2713AT y\nAx\u25c6 .\n\nThis notation allows us to formulate the optimality conditions for (1) as a variational inequality (VI).\nIf u? = (x?, y?) is a solution to (1), then x? is a minimizer of (1). More formally,\n\nLikewise, (1) is maximized by y?, and so\n\nf (x)  f (x?) + (x  x?)T AT y?  0 8 x 2 X.\n\nSubtracting (17) from (16) and letting h(u) = f (x) + g(y) yields the VI formulation\n\n g(y) + g(y?) + (y  y?)T Ax? \uf8ff 0 8 y 2 Y.\n\n(15)\n\n(16)\n\n(17)\n\n(18)\n\nh(u)  h(u?) + (u  u?)T Q(u?)  0 8u 2 \u2326,\n\nwhere \u2326= X \u21e5 Y. We say \u02dcu is an approximate solution to (1) with VI accuracy \u270f if\n\n(19)\nwhere B1(\u02dcu) is a unit ball centered at \u02dcu. In Theorem 1, we prove O(1/k) ergodic convergence of\nadaptive PDHG using the VI notion of convergence.\n\nh(u)  h(\u02dcu) + (u  \u02dcu)T Q(\u02dcu)  \u270f 8u 2 B1(\u02dcu) \\ \u2326,\n\n5.2 Preliminary Results\n\nWe now prove several results about the PDHG iterates that are needed to obtain a convergence rate.\nLemma 1. The iterates generated by PDHG (2-5) satisfy\n\nkuk  u?k2\n\nMk  kuk+1  ukk2\n\nMk + kuk+1  u?k2\n\nMk .\n\nThe proof of this lemma follows standard techniques, and is presented in the supplementary material.\nThis next lemma bounds iterates generated by PDHG.\nLemma 2. Suppose the stepsizes for PDHG satisfy Assumptions A, B and C. Then\n\nfor some upper bound CU > 0.\n\nkuk  u?k2\n\nHk \uf8ff CU\n\nThe proof of this lemma is given in the supplementary material.\nLemma 3. Under Assumptions A, B, and C, we have\n\nnXk=1\u21e3kuk  uk2\n\nMk  kuk  uk2\n\nMk1\u2318 \uf8ff 2CCU + 2CCHku  u?k2\n\nwhere C =P1k=0 k and CH is a constant such that ku  u?k2\n\nHk \uf8ff CHku  u?k2.\n\n5\n\n\fProof. Using the de\ufb01nition of Mk we obtain\n\n1\nk \n\n1\nk1\n\n1\n\nk kyk  yk2\u25c6\n\n)kyk  yk2\n\n(20)\n\n=\n\n\uf8ff\n\nMk  kuk  uk2\n1\n\u2327k \n\n1\n\u2327k1\n\u2327k kxk  xk2 +\n\nMk1\u2318\n)kxk  xk2 + (\n\nnXk=1\u21e3kuk  uk2\nnXk=1\uf8ff(\nk1\u2713 1\nnXk=1\nnXk=1\nk1kuk  uk2\nnXk=1\nk1kuk  u?k2\nnXk=1\nk1CU + CHku  u?k2\n\uf8ff 2\n\uf8ff 2CCU + 2CCHku  u?k2,\n\n\uf8ff 2\n\nHk\n\n=\n\nHk\nHk + ku  u?k2\n\nwhere we have used the bound kuk  u?k2\n\nHk \uf8ff CU from Lemma 2 and C =P1k=0 k.\n\nThis \ufb01nal lemma provides a VI interpretation of the PDHG iteration.\nLemma 4. The iterates uk = (xk, yk) generated by PDHG satisfy\n\nh(u)  h(uk+1) + (u  uk+1)T [Quk+1 + Mk(uk+1  uk)]  0\n\n8u 2 \u2326.\n\n(21)\n\nProof. Let uk = (xk, yk) be a pair of PDHG iterates. The minimizers in (3) and (5) of PDHG\nsatisfy the following for all x 2 X\n\nf (x)  f (xk+1) + (x  xk+1)T [AT yk+1  AT (yk+1  yk) +\n\n1\n\u2327k\n\n(xk+1  xk)]  0,\n\nand also for all y 2 Y\n\ng(y)  g(yk+1) + (y  yk+1)T [Axk+1  A(xk+1  xk) +\n\n1\nk\n\n(yk+1  yk)]  0.\n\n(22)\n\n(23)\n\nAdding these two inequalities and using the notation (15) yields the result.\n\n5.3 Convergence Rate\n\nWe now combine the above lemmas into our \ufb01nal convergence result.\nTheorem 1. Suppose that the stepsizes in PDHG satisfy Assumptions A, B, and C. Consider the\nsequence de\ufb01ned by\n\nThis sequence satis\ufb01es the convergence bound\n\n\u02dcut =\n\nuk.\n\n1\nt\n\ntXk=1\n\nh(u)  h(\u02dcut) + (u  \u02dcut)T Q(\u02dcut)  ku  \u02dcutk2\n\nMt  ku  u0k2\n\nM0  2CCU  2CCHku  u?k2\n\n.\n\n2t\n\nThus \u02dcut converges to a solution of (1) with rate O(1/k) in the VI sense (19).\n\n6\n\n\fBecause h is convex,\n\n ku  utk2\n ku  utk2\nt1Xk=0\n\nh(uk+1) =\n\nM0 +\n\nMt  ku  u0k2\nMt  ku  u0k2\n\ntXk=1hku  ukk2\n\nMki\nMk1  ku  ukk2\n\nM0  2CCU  2CCHku  u?k2.\n\nh(uk)  th 1\n\nt\n\nuk! = th(\u02dcut).\n\ntXk=1\n\ntXk=1\n\n(26)\n\n(27)\n\n(28)\n\nProof. We begin with the following identity (a special case of the polar identity for vector spaces):\n\n(u  uk+1)T Mk(uk  uk+1) =\n\nMk  ku  ukk2\nWe apply this to the VI formulation of the PDHG iteration (18) to get\n\n(ku  uk+1k2\n\nMk ) +\n\n1\n2\n\n1\n2kuk  uk+1k2\n\nMk .\n\nh(u)  h(uk+1) + (u  uk+1)T Q(uk+1)\n\nNote that\n\n\n\n1\n\n2ku  uk+1k2\n\nMk  ku  ukk2\n\n1\n2kuk  uk+1k2\n\nMk .\n\n(24)\n\nMk +\n\n(u  uk+1)T Q(u  uk+1) = (x  xk+1)AT (y  yk+1)  (y  yk+1)A(x  xk+1) = 0,\n\n(25)\nand so (u  uk+1)T Q(u) = (u  uk+1)T Q(uk+1). Also, Assumption C guarantees that kuk \nuk+1k2\n\nMk  0. These observations reduce (24) to\nh(u)  h(uk+1) + (u  uk+1)T Q(u) \n\n1\n\n2ku  uk+1k2\n\nMk .\nMk  ku  ukk2\n\nWe now sum (26) for k = 0 to t  1, and invoke Lemma 3,\n\n2\n\nt1Xk=0\n\n[h(u)  h(uk+1) + (u  uk+1)T Q(u)]\n\nThe left side of (27) therefore satis\ufb01es\n\n2th(u)  h(\u02dcut) + (u  \u02dcut)T Q(u)  2\nh(u)  h(\u02dcut) + (u  \u02dcut)T Q(u)  ku  utk2\n\nCombining (27) and (28) yields the tasty bound\n\nt1Xk=0\u21e5h(u)  h(uk+1) + (u  uk+1)T Q(u)\u21e4 .\n\nMt  ku  u0k2\n\nM0  2CCU  2CCHku  u?k2\n\n.\n\n2t\n\nApplying (19) proves the theorem.\n\n6 Numerical Results\n\nWe apply the original and adaptive PDHG to the test problems described in Section 3. We terminate\nthe algorithms when both the primal and dual residual norms (i.e. kpkk and kdkk) are smaller\nthan 0.05. We consider four variants of PDHG. The method \u201cAdapt:Backtrack\u201d denotes adaptive\nPDHG with backtracking. The method \u201cAdapt: \u2327  = L\u201d refers to the adaptive method without\nbacktracking with \u23270 = 0 = 0.95\u21e2(AT A) 1\n2 .\nWe also consider the non-adaptive PDHG with two different stepsize choices. The method \u201cConst:\n\u2327,  = pL\u201d refers to the constant-stepsize method with both stepsize parameters equal to pL =\n\u21e2(AT A) 1\n2 . The method \u201cConst: \u2327-\ufb01nal\u201d refers to the constant-stepsize method, where the stepsizes\nare chosen to be the \ufb01nal values of the stepsizes used by \u201cAdapt: \u2327  = L.\u201d This \ufb01nal method is\nmeant to demonstrate the performance of PDHG with a stepsize that is customized to the problem\nat hand, but still non-adaptive. The speci\ufb01cs of each test problem are described below:\n\n7\n\n\fROF Convergence Curves, \u00b5 = 0.05\n\n \n\nA d a p t: Ba cktra ck\nA d a p t: \u03c4 \u03c3 = L\nCo n st: \u03c4 = \u221aL\nCo n st: \u03c4 -\ufb01 n a l\n\nPrimal Stepsize ( \u03c4 k)\n\n \n\nA d ap t: Backtrack\nA d ap t: \u03c4 \u03c3 = L\n\np\na\nG\n\ny\ng\nr\ne\nn\nE\n\n107\n\n106\n\n105\n\n104\n\n103\n\n102\n\n101\n\n100\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\nk\n\u03c4\n\n \n0\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\nIteration\n\n \n0\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\nIteration\n\nFigure 1: (left) Convergence curves for the TV denoising experiment with \u00b5 = 0.05. The y-axis\ndisplays the difference between the objective (10) at the kth iterate and the optimal objective value.\n(right) Stepsize sequences, {\u2327k}, for both adaptive schemes.\n\nTable 1: Iteration counts for each problem with runtime (sec) in parenthesis.\nProblem\n\nScaled Lasso (50%)\nScaled Lasso (20%)\nScaled Lasso (10%)\n\nTV, \u00b5 = .25\nTV, \u00b5 = .05\nTV, \u00b5 = .01\n\nCompressive (20%)\nCompressive (10%)\nCompressive (5%)\n\nAdapt:Backtrack Adapt: \u2327  = L Const: \u2327,  = pL Const: \u2327-\ufb01nal\n156 (0.27)\n197 (0.11)\n277 (0.15)\n48 (0.121)\n97 (0.228)\n152 (0.369)\n246 (6.03)\n437 (9.94)\n435 (9.95)\n\n342 (0.60)\n437 (0.25)\n527 (0.28)\n78 (0.184)\n281 (0.669)\n927 (2.17)\n501 (12.54)\n908 (20.6)\n1505 (34.2)\n\n240 (0.38)\n330 (0.21)\n322 (0.18)\n16 (0.041)\n51 (0.122)\n122 (0.288)\n168 (4.12)\n274 (6.21)\n438 (10.7)\n\n212 (0.33)\n349 (0.22)\n360 (0.21)\n16 (0.0475)\n50 (0.122)\n109 (0.262)\n163 (4.08)\n244 (5.63)\n382 (9.54)\n\nScaled Lasso We test our methods on (8) using the synthetic problem suggested in [21]. The test\nproblem recovers a 1000 dimensional vector with 10 nonzero components using a Gaussian matrix.\nTotal Variation Minimization We apply the model (10) with A = I to the \u201cCameraman\u201d image.\nThe image is scaled to the range [0, 255], and noise contaminated with standard deviation 10. The\nimage is denoised with \u00b5 = 0.25, 0.05, and 0.01. See Table 1 for time trial results. Note the similar\nperformance of Algorithm 1 with and without backtracking, indicating that there is no advantage to\nknowing the constant L = \u21e2(AT A)1. We plot convergence curves and show the evolution of \u2327k in\nFigure 1. Note that \u2327k is large for the \ufb01rst several iterates and then decays over time.\nCompressed Sensing We reconstruct a Shepp-Logan phantom from sub-sampled Hadamard mea-\nsurements. Data is generated by applying the Hadamard transform to a 256 \u21e5 256 discretization of\nthe Shepp-Logan phantom, and then sampling 5%, 10%, and 20% of the coef\ufb01cients are random.\n7 Discussion and Conclusion\nSeveral interesting observations can be made from the results in Table 1. First, both the backtracking\n(\u201cAdapt: Backtrack\u201d) and non-backtracking (\u201cAdapt: \u2327  = L\u201d) methods have similar performance\non average for the imaging problems, with neither algorithm showing consistently better perfor-\nmance. Thus there is no cost to using backtracking instead of knowing the ideal stepsize \u21e2(AT A).\nFinally, the method \u201cConst: \u2327-\ufb01nal\u201d (using non-adaptive, \u201coptimized\u201d stepsizes) did not always out-\nperform the constant, non-optimized stepsizes. This occurs because the true \u201cbest\u201d stepsize choice\ndepends on the active set of the problem and the structure of the remaining error and thus evolves\nover time. This is depicted in Figure 1, which shows the time dependence of \u2327k. This show that\nadaptive methods can achieve superior performance by evolving the stepsize over time.\n8 Acknowledgments\nThis work was supported by the National Science Foundation ( #1535902), the Of\ufb01ce of Naval\nResearch (#N00014-15-1-2676), and the Hong Kong Research Grants Council\u2019s General Research\nFund (HKBU 12300515). The second author was supported in part by the Program for New Century\nExcellent University Talents under Grant No. NCET-12-0111, and the Qing Lan Project.\n\n8\n\n\fReferences\n[1] R. Glowinski and A. Marroco. Sur l\u2019approximation, par \u00b4el\u00b4ements \ufb01nis d\u2019ordre un, et la r\u00b4esolution, par\np\u00b4enalisation-dualit\u00b4e d\u2019une classe de probl`emes de Dirichlet non lin\u00b4eaires. Rev. Franc\u00b8aise d\u2019Automat. Inf.\nRecherche Op\u00b4erationelle, 9(2):41\u201376, 1975.\n\n[2] Roland Glowinski and Patrick Le Tallec. Augmented Lagrangian and Operator-Splitting Methods in\n\nNonlinear Mechanics. Society for Industrial and Applied Mathematics, Philadephia, PA, 1989.\n\n[3] Tom Goldstein and Stanley Osher. The Split Bregman method for `1 regularized problems. SIAM J. Img.\n\nSci., 2(2):323\u2013343, April 2009.\n\n[4] Ernie Esser, Xiaoqun Zhang, and Tony F. Chan. A general framework for a class of \ufb01rst order primal-dual\nalgorithms for convex optimization in imaging science. SIAM Journal on Imaging Sciences, 3(4):1015\u2013\n1046, 2010.\n\n[5] Antonin Chambolle and Thomas Pock. A \ufb01rst-order primal-dual algorithm for convex problems with\n\napplications to imaging. Convergence, 40(1):1\u201349, 2010.\n\n[6] Yuyuan Ouyang, Yunmei Chen, Guanghui Lan, and Eduardo Pasiliao Jr. An accelerated linearized alter-\n\nnating direction method of multipliers. arXiv preprint arXiv:1401.6607, 2014.\n\n[7] B. He, H. Yang, and S.L. Wang. Alternating direction method with self-adaptive penalty parameters for\nmonotone variational inequalities. Journal of Optimization Theory and Applications, 106(2):337\u2013356,\n2000.\n\n[8] L.D. Popov. A modi\ufb01cation of the arrow-hurwicz method for search of saddle points. Mathematical notes\n\nof the Academy of Sciences of the USSR, 28:845\u2013848, 1980.\n\n[9] Mingqiang Zhu and Tony Chan. An ef\ufb01cient primal-dual hybrid gradient algorithm for total variation\n\nimage restoration. UCLA CAM technical report, 08-34, 2008.\n\n[10] T. Pock, D. Cremers, H. Bischof, and A. Chambolle. An algorithm for minimizing the mumford-shah\n\nfunctional. In Computer Vision, 2009 IEEE 12th International Conference on, pages 1133\u20131140, 2009.\n\n[11] Bingsheng He and Xiaoming Yuan. Convergence analysis of primal-dual algorithms for a saddle-point\n\nproblem: From contraction perspective. SIAM J. Img. Sci., 5(1):119\u2013149, January 2012.\n\n[12] Laurent Condat. A primal-dual splitting method for convex optimization involving lipschitzian, prox-\nimable and linear composite terms. Journal of Optimization Theory and Applications, 158(2):460\u2013479,\n2013.\n\n[13] Silvia Bonettini and Valeria Ruggiero. On the convergence of primal\u2013dual hybrid gradient algorithms for\n\ntotal variation image restoration. Journal of Mathematical Imaging and Vision, 44(3):236\u2013253, 2012.\n\n[14] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed Optimization and Statistical Learning\nvia the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning, 2010.\n[15] A. Belloni, Victor Chernozhukov, and L. Wang. Square-root lasso: pivotal recovery of sparse signals via\n\nconic programming. Biometrika, 98(4):791\u2013806, 2011.\n\n[16] Tingni Sun and Cun-Hui Zhang. Scaled sparse linear regression. Biometrika, 99(4):879\u2013898, 2012.\n[17] L Rudin, S Osher, and E Fatemi. Nonlinear total variation based noise removal algorithms. Physica. D.,\n\n60:259\u2013268, 1992.\n\n[18] Tom Goldstein, Lina Xu, Kevin Kelly, and Richard Baraniuk. The STONE transform: Multi-resolution\nimage enhancement and real-time compressive video. Preprint available at Arxiv.org (arXiv:1311.34056),\n2013.\n\n[19] M. Lustig, D. Donoho, and J. Pauly. Sparse MRI: The application of compressed sensing for rapid MR\n\nimaging. Magnetic Resonance in Medicine, 58:1182\u20131195, 2007.\n\n[20] Xiaoqun Zhang and J. Froment. Total variation based fourier reconstruction and regularization for com-\nIn Nuclear Science Symposium Conference Record, 2005 IEEE, volume 4, pages\n\nputer tomography.\n2332\u20132336, Oct 2005.\n\n[21] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety, Series B, 58:267\u2013288, 1994.\n\n9\n\n\f", "award": [], "sourceid": 1257, "authors": [{"given_name": "Tom", "family_name": "Goldstein", "institution": "University of Maryland"}, {"given_name": "Min", "family_name": "Li", "institution": "Southeast University"}, {"given_name": "Xiaoming", "family_name": "Yuan", "institution": "Hong Kong Baptist University"}]}