{"title": "A quasi-Newton proximal splitting method", "book": "Advances in Neural Information Processing Systems", "page_first": 2618, "page_last": 2626, "abstract": "We describe efficient implementations of the proximity calculation for a useful class of functions; the implementations exploit the piece-wise linear nature of the dual problem. The second part of the paper applies the previous result to acceleration of convex minimization problems, and leads to an elegant quasi-Newton method. The optimization method compares favorably against state-of-the-art alternatives. The algorithm has extensive applications including signal processing, sparse regression and recovery, and machine learning and classification.", "full_text": "A quasi-Newton proximal splitting method\n\nS. Becker\u2217\n\nM.J. Fadili\u2020\n\nAbstract\n\nA new result in convex analysis on the calculation of proximity operators in cer-\ntain scaled norms is derived. We describe ef\ufb01cient implementations of the prox-\nimity calculation for a useful class of functions; the implementations exploit the\npiece-wise linear nature of the dual problem. The second part of the paper applies\nthe previous result to acceleration of convex minimization problems, and leads\nto an elegant quasi-Newton method. The optimization method compares favor-\nably against state-of-the-art alternatives. The algorithm has extensive applications\nincluding signal processing, sparse recovery and machine learning and classi\ufb01ca-\ntion.\n\n1\n\nIntroduction\n\nConvex optimization has proved to be extremely useful to all quantitative disciplines of science. A\ncommon trend in modern science is the increase in size of datasets, which drives the need for more\nef\ufb01cient optimization schemes. For large-scale unconstrained smooth convex problems, two classes\nof methods have seen the most success:\nlimited memory quasi-Newton methods and non-linear\nconjugate gradient (CG) methods. Both of these methods generally outperform simpler methods,\nsuch as gradient descent.\nFor problems with non-smooth terms and/or constraints, it is possible to generalize gradient descent\nwith proximal gradient descent (which includes projected gradient descent as a sub-cases), which is\njust the application of the forward-backward algorithm [1].\nUnlike gradient descent, it is not easy to adapt quasi-Newton and CG methods to problems involv-\ning constraints and non-smooth terms. Much work has been written on the topic, and approaches\ngenerally follow an active-set methodology. In the limit, as the active-set is correctly identi\ufb01ed, the\nmethods behave similar to their unconstrained counterparts. These methods have seen success, but\nare not as ef\ufb01cient or as elegant as the unconstrained versions. In particular, a sub-problem on the\nactive-set must be solved, and the accuracy of this sub-iteration must be tuned with heuristics in\norder to obtain competitive results.\n\n1.1 Problem statement\n\nLet H = (RN ,(cid:104)\u00b7,\u00b7(cid:105)) equipped with the usual Euclidean scalar product (cid:104)x, y(cid:105) = (cid:80)N\nassociated norm (cid:107)x(cid:107) =(cid:112)(cid:104)x, x(cid:105). For a matrix V \u2208 RN\u00d7N in the symmetric positive-de\ufb01nite (SDP)\n\ni=1 xiyi and\ncone S++(N), we de\ufb01ne HV = (RN ,(cid:104)\u00b7,\u00b7(cid:105)V ) with the scalar product (cid:104)x, y(cid:105)V = (cid:104)x, V y(cid:105) and norm\n(cid:107)x(cid:107)V corresponding to the metric induced by V . The dual space of HV , under (cid:104)\u00b7,\u00b7(cid:105), is HV \u22121. We\ndenote IH the identity operator on H.\nA real-valued function f : H \u2192 R \u222a {+\u221e} is (0)-coercive if lim(cid:107)x(cid:107)\u2192+\u221e f (x) = +\u221e. The\ndomain of f is de\ufb01ned by dom f = {x \u2208 H : f(x) < +\u221e} and f is proper if dom f (cid:54)= \u2205. We\nsay that a real-valued function f is lower semi-continuous (lsc) if lim inf x\u2192x0 f(x) \u2265 f(x0). The\n\n\u2217LJLL, CNRS-UPMC, Paris France (stephen.becker@upmc.fr).\n\u2020GREYC, CNRS-ENSICAEN-Univ. of Caen, Caen France (Jalal.Fadili@greyc.ensicaen.fr).\n\n1\n\n\fclass of all proper lsc convex functions from H to R \u222a {+\u221e} is denoted by \u03930(H). The conjugate\nor Legendre-Fenchel transform of f on H is denoted f\u2217 .\nOur goal is the generic minimization of functions of the form\n\nx\u2208H {F (x) (cid:44) f(x) + h(x)} ,\nmin\n\n(P)\nwhere f, h \u2208 \u03930(H). We also assume the set of minimizers is nonempty (e.g. F is coercive) and that\na standard domain quali\ufb01cation holds. We take f \u2208 C 1(RN ) with L-Lipschitz continuous gradient,\nand we assume h is separable. Write x(cid:63) to denote an element of Argmin F (x).\nThe class we consider covers non-smooth convex optimization problems, including those with con-\nvex constraints. Here are some examples in regression, machine learning and classi\ufb01cation.\nExample 1 (LASSO).\n\nExample 2 (Non-negative least-squares (NNLS)).\n\nmin\nx\u2208H\n\n1\n2\n\n(cid:107)Ax \u2212 b(cid:107)2\n\n2 + \u03bb(cid:107)x(cid:107)1 .\n\nmin\nx\u2208H\n\n1\n2\n\n(cid:107)Ax \u2212 b(cid:107)2\n\n2\n\nsubject to x (cid:62) 0 .\n\n(1)\n\n(2)\n\nExample 3 (Sparse Support Vector Machines). One would like to \ufb01nd a linear decision function\nwhich minimizes the objective\n\nL((cid:104)x, zi(cid:105) + b, yi) + \u03bb(cid:107)x(cid:107)1\n\n(3)\n\nm(cid:88)\n\ni=1\n\nmin\nx\u2208H\n\n1\nm\n\nwhere for i = 1,\u00b7\u00b7\u00b7 , m, (zi, yi) \u2208 RN \u00d7 {\u00b11} is the training set, and L is a smooth loss function\nwith Lipschitz-continuous gradient such as the squared hinge loss L(\u02c6yi, yi) = max(0, 1 \u2212 \u02c6yiyi)2 or\nthe logistic loss L(\u02c6yi, yi) = log(1 + e\u2212\u02c6yiyi).\n\n1.2 Contributions\n\nThis paper introduces a class of scaled norms for which we can compute a proximity operator; these\nresults themselves are signi\ufb01cant, for previous results only cover diagonal scaling (the diagonal\nscaling result is trivial). Then, motivated by the discrepancy between constrained and unconstrained\nperformance, we de\ufb01ne a class of limited-memory quasi-Newton methods to solve (P) and that\nextends naturally and elegantly from the unconstrained to the constrained case. Most well-known\nquasi-Newton methods for constrained problems, such as L-BFGS-B [2], are only applicable to box\nconstraints l \u2264 x \u2264 u. The power of our approach is that it applies to a wide-variety of useful\nnon-smooth functionals (see \u00a73.1.4 for a list) and that it does not rely on an active-set strategy. The\napproach uses the zero-memory SR1 algorithm, and we provide evidence that the non-diagonal term\nprovides signi\ufb01cant improvements over diagonal Hessians.\n\n2 Quasi-Newton forward-backward splitting\n\n2.1 The algorithm\n\nIn the following, de\ufb01ne the quadratic approximation\n\nk (x) = f(xk) + (cid:104)\u2207f(xk), x \u2212 xk(cid:105) +\nQB\n\n(cid:107)x \u2212 xk(cid:107)2\nB,\n\n(4)\n\n1\n2\n\nwhere B \u2208 S++(N).\nThe standard (non relaxed) version of the forward-backward splitting algorithm (also known as\nproximal or projected gradient descent) to solve (P) updates to a new iterate xk+1 according to\n\nxk+1 = argmin\n\nk (x) + h(x) = proxtkh(xk \u2212 tk\u2207f(xk))\nQBk\nk IH, tk \u2208]0, 2/L[ (typically tk = 1/L unless a line search is used).\n\nx\n\nwith Bk = t\u22121\n\n(5)\n\n2\n\n\fNote that this specializes to the gradient descent when h = 0. Therefore, if f is a strictly convex\nquadratic function and one takes Bk = \u22072f(xk), then we obtain the Newton method. Let\u2019s get back\nto h (cid:54)= 0. It is now well known that \ufb01xed B = LIH is usually a poor choice. Since f is smooth and\ncan be approximated by a quadratic, and inspired by quasi-Newton methods, this suggest picking\nBk as an approximation of the Hessian. Here we propose a diagonal+rank 1 approximation.\nOur diagonal+rank 1 quasi-Newton forward-backward splitting algorithm is listed in Algorithm 1\n(with details for the quasi-Newton update in Algorithm 2, see \u00a74 for details). These algorithms\nare listed as simply as possible to emphasize their important components; the actual software\nused for numerical tests is open-source and available at http://www.greyc.ensicaen.fr/\n\u02dcjfadili/software.html.\n\nAlgorithm 1: Zero-memory Symmetric Rank 1 (0SR1) algorithm to solve min f + h\nRequire: x0 \u2208 dom(f + h), Lipschitz constant estimate L of \u2207f, stopping criterion \u0001\n1: for k = 1, 2, 3, . . . do\n2:\n3:\n4:\n5:\n\nsk \u2190 xk \u2212 xk\u22121\nyk \u2190 \u2207f(xk) \u2212 \u2207f(xk\u22121)\nCompute Hk via Algorithm 2, and de\ufb01ne Bk = H\u22121\nk .\nCompute the rank-1 proximity operator (see \u00a73)\n\n\u02c6xk+1 \u2190 proxBk\n\nh (xk \u2212 Hk\u2207f(xk))\n\n(6)\n\npk \u2190 \u02c6xk+1 \u2212 xk and terminate if (cid:107)pk(cid:107) < \u0001\nLine-search along the ray xk + tpk to determine xk+1, or choose t = 1.\n\n6:\n7:\n8: end for\n\n2.2 Relation to prior work\n\nFirst-order methods The algorithm in (5) is variously known as proximal descent or iterated\nshrinkage/thresholding algorithm (IST or ISTA). It has a grounded convergence theory, and also\nadmits over-relaxation factors \u03b1 \u2208 (0, 1) [3].\nThe spectral projected gradient (SPG) [4] method was designed as an extension of the Barzilai-\nBorwein spectral step-length method to constrained problems. In [5], it was extended to non-smooth\nproblems by allowing general proximity operators; The Barzilai-Borwein method [6] uses a speci\ufb01c\nchoice of step-length tk motivated by quasi-Newton methods. Numerical evidence suggests the\nSPG/SpaRSA method is highly effective, although convergence results are not as strong as for ISTA.\nFISTA [7] is a multi-step accelerated version of ISTA inspired by the work of Nesterov. The stepsize\nt is chosen in a similar way to ISTA; in our implementation, we tweak the original approach by using\na Barzilai-Borwein step size, a standard line search, and restart[8], since this led to improved per-\nformance. Nesterov acceleration can be viewed as an over-relaxed version of ISTA with a speci\ufb01c,\nnon-constant over-relaxation parameter \u03b1k.\nThe above approaches assume Bk is a constant diagonal. The general diagonal case was considered\nin several papers in the 1980s as a simple quasi-Newton method, but never widely adapted. More\nrecent attempts include a static choice Bk \u2261 B for a primal-dual method [9]. A convergence rate\nanalysis of forward-backward splitting with static and variable Bk where one of the operators is\nmaximal strongly monotone is given in [10].\n\nActive set approaches Active set methods take a simple step, such as gradient projection, to iden-\ntify active variables, and then uses a more advanced quadratic model to solve for the free variables. A\nwell-known such method is L-BFGS-B [2, 11] which handles general box-constrained problems; we\ntest an updated version [12]. A recent bound-constrained solver is ASA [13] which uses a conjugate\ngradient (CG) solver on the free variables, and shows good results compared to L-BFGS-B, SPG,\nGENCAN and TRON. We also compare to several active set approaches specialized for (cid:96)1 penalties:\n\u201cOrthant-wise Learning\u201d (OWL) [14], \u201cProjected Scaled Sub-gradient + Active Set\u201d (PSSas) [15],\n\u201cFixed-point continuation + Active Set\u201d (FPC AS) [16], and \u201cCG + IST\u201d (CGIST) [17].\n\n3\n\n\fOther approaches By transforming the problem into a standard conic programming problem, the\ngeneric problem is amenable to interior-point methods (IPM). IPM requires solving a Newton-step\nequation, so \ufb01rst-order like \u201cHessian-free\u201d variants of IPM solve the Newton-step approximately,\neither by approximately solving the equation or by subsampling the Hessian. The main issues are\nspeed and robust stopping criteria for the approximations.\nYet another approach is to include the non-smooth h term in the quadratic approximation. Yu et\nal. [18] propose a non-smooth modi\ufb01cation of BFGS and L-BFGS, and test on problems where h is\ntypically a hinge-loss or related function.\nThe projected quasi-Newton (PQN) algorithm [19, 20] is perhaps the most elegant and logical ex-\ntension of quasi-Newton methods, but it involves solving a sub-iteration. PQN proposes the SPG [4]\nalgorithm for the subproblems, and \ufb01nds that this is an ef\ufb01cient tradeoff whenever the cost func-\ntion (which is not involved in the sub-iteration) is relatively much more expensive to evaluate than\nprojecting onto the constraints. Again, the cost of the sub-problem solver (and a suitable stopping\ncriteria for this inner solve) are issues. As discussed in [21], it is possible to generalize PQN to gen-\neral non-smooth problems whenever the proximity operator is known (since, as mentioned above, it\nis possible to extend SPG to this case).\n\n3 Proximity operators and proximal calculus\n\nFor space limitation reasons, we only recall essential de\ufb01nitions. More notions, results from convex\nanalysis as well as proofs can be found in the supplementary material.\nDe\ufb01nition 4 (Proximity operator [22]). Let h \u2208 \u03930(H). Then, for every x \u2208 H, the function\nz (cid:55)\u2192 1\n2 (cid:107)x \u2212 z(cid:107)2 + h(z) achieves its in\ufb01mum at a unique point denoted by proxh x. The uniquely-\nvalued operator proxh : H \u2192 H thus de\ufb01ned is the proximity operator or proximal mapping of\nh.\n3.1 Proximal calculus in HV\nThroughout, we denote proxV\nproximity operator of h w.r.t.\nV \u2208 S++(N), the proximity operator proxV\nLemma 5 (Moreau identity in HV ). Let h \u2208 \u03930(H), then for any x \u2208 H\n\nh = (IHV + V \u22121\u2202h)\u22121, where \u2202h is the subdifferential of h, the\nthe norm endowing HV for some V \u2208 S++(N). Note that since\n\nh is well-de\ufb01ned.\n\nproxV\n\n\u03c1h\u2217(x) + \u03c1V \u22121 \u25e6 proxV \u22121\n\nh/\u03c1 \u25e6V (x/\u03c1) = x,\u2200 0 < \u03c1 < +\u221e .\n\nCorollary 6.\n\nproxV\n\nh (x) = x \u2212 V \u22121 \u25e6 proxV \u22121\n\nh\u2217 \u25e6V (x) .\n\n(7)\n\n(8)\n\n3.1.1 Diagonal+rank-1: General case\nTheorem 7 (Proximity operator in HV ). Let h \u2208 \u03930(H) and V = D + uuT , where D is diagonal\nwith (strictly) positive diagonal elements di, and u \u2208 RN . Then,\n\n(9)\n\nwhere v = \u03b1D\u22121/2u and \u03b1 is the unique root of\n\n(cid:68)\n\nproxV\n\nh (x) = D\u22121/2 \u25e6 proxh\u25e6D\u22121/2(D1/2x \u2212 v) ,\n\n(cid:69)\nu, x \u2212 D\u22121/2 \u25e6 proxh\u25e6D\u22121/2 \u25e6D1/2(x \u2212 \u03b1D\u22121u)\n\np(\u03b1) =\n\n(10)\nwhich is a Lipschitz continuous and strictly increasing function on R with Lipschitz constant 1 +\n\n+ \u03b1 ,\n\n(cid:80)\n\ni u2\n\ni /di.\nRemark 8.\n\n\u2022 Computing proxV\n\nh amounts to solving a scalar optimization problem that involves the com-\nputation of proxh\u25e6D\u22121/2. The latter can be much simpler to compute as D is diagonal\n(beyond the obvious separable case that we will consider shortly). This is typically the\ncase when h is the indicator of the (cid:96)1-ball or the canonical simple. The corresponding pro-\njector can be obtained in expected complexity O(N log N) by simple sorting the absolute\nvalues\n\n4\n\n\f\u2022 It is of course straightforward to compute proxV\n\nh either using Theorem 7, or\nusing this theorem together with Corollary 6 and the Sherman-Morrison inversion lemma.\n\nh\u2217 from proxV\n\nwhere D is diagonal with (strictly) positive diagonal elements di, and u \u2208 RN . Then,\n\n3.1.2 Diagonal+rank-1: Separable case\n\nThe following corollary is key to our novel optimization algorithm.\n\nCorollary 9. Assume that h \u2208 \u03930(H) is separable, i.e. h(x) =(cid:80)N\n(cid:17)\nproxhi/di(xi \u2212 vi/di)\n(cid:17)\nproxhi/di(xi \u2212 \u03b1ui/di)\n\nwhere v = \u03b1u and \u03b1 is the unique root of\n\n(cid:68)\nu, x \u2212(cid:16)\n\nh (x) =\n\np(\u03b1) =\n\nproxV\n\n(cid:69)\n\n(cid:16)\n\n,\n\ni\n\nwhich is a Lipschitz continuous and strictly increasing function on R.\n\n+ \u03b1 ,\n\ni\n\ni=1 hi(xi), and V = D + uuT ,\n\n(11)\n\n(12)\n\nProof:\n\ndiagonal,\n\nAs\napplying\n\nS++(N)\nis\nresult.\nProposition 10. Assume that for 1 (cid:54) i (cid:54) N, proxhi is piecewise af\ufb01ne on R with ki \u2265 1 segments,\ni.e.\n\nand\nyields\n\nseparable\n\nTheorem\n\nD\nthe\n\nis\n\n7\n\nh\n\n\u2208\ndesired\n\nproxhi(xi) = ajxi + bj,\n\ntj (cid:54) xi (cid:54) tj+1, j \u2208 {1, . . . , ki} .\n\ni=1 ki. Then proxV\n\nh (x) can be obtained exactly by sorting at most the k real values\n\nLet k = (cid:80)N\n(cid:16) di\n(cid:17)\n\n(xi \u2212 tj)\n\nui\n\n(i,j)\u2208{1,...,N}\u00d7{1,...,ki}\n\n.\n\nProof: Recall that (10) has a unique solution. When proxhi is piecewise af\ufb01ne with ki\nsegments, it is easy to see that p(\u03b1) in (12) is also piecewise af\ufb01ne with slopes and intercepts\n. To get \u03b1(cid:63), it is suf-\nchanging at the k transition points\n\ufb01cient to isolate the unique segment that intersects the abscissa axis. This can be achieved\nby sorting the values of the transition points which can cost in average complexity O(k log k).\n\n(i,j)\u2208{1,...,N}\u00d7{1,...,ki}\n\n(xi \u2212 tj)\n\nui\n\n(cid:16) di\n\n(cid:17)\n\nRemark 11.\n\n\u2022 The above computational cost can be reduced in many situations by exploiting e.g. symme-\ntry of the h(cid:48)\nis, identical functions, etc. This turns out to be the case for many functions of\ninterest, e.g. (cid:96)1-norm, indicator of the (cid:96)\u221e-ball or the positive orthant, and many others;\nsee examples hereafter.\n\n\u2022 Corollary 9 can be extended to the \u201cblock\u201d separable (i.e. separable in subsets of coordi-\n\nnates) when D is piecewise constant along the same block indices.\n\n3.1.3 Semi-smooth Newton method\n\nIn many situations (see examples below), the root of p(\u03b1) can be found exactly in polynomial\ncomplexity.\nIf no closed-form is available, one can appeal to some ef\ufb01cient iterative method to\nsolve (10) (or (12)). As p is Lipschitz-continuous, hence so-called Newton (slantly) differentiable,\nsemi-smooth Newton are good such solvers, with the proviso that one can design a simple slanting\nfunction which can be algorithmically exploited.\nThe semi-smooth Newton method for the solution of (10) can be stated as the iteration\n\n\u03b1t+1 = \u03b1t \u2212 g(\u03b1t)\u22121p(\u03b1t) ,\n\n(13)\n\nwhere g is a generalized derivative of p.\nProposition 12 (Generalized derivative of p). If proxh\u25e6D\u22121/2 is Newton differentiable with gener-\nalized derivative G, then so is the mapping p with a generalized derivative\n\ng(\u03b1) = 1 +\n\nu, D\u22121/2 \u25e6 G(D1/2x \u2212 \u03b1D\u22121/2u) \u25e6 D\u22121/2u\n\n.\n\n(cid:69)\n\n(cid:68)\n\nFurthermore, g is nonsingular with a uniformly bounded inverse on R.\n\n5\n\n\fFunction h\n(cid:96)1-norm\nHinge\n(cid:96)\u221e-ball\nBox constraint\nPositivity constraint\n(cid:96)1-ball\n(cid:96)\u221e-norm\nCanonical simplex\nmax function\n\nAlgorithm\nSeparable: exact in O(N log N)\nSeparable: exact in O(N log N)\nSeparable: exact in O(N log N) from (cid:96)1-norm by Moreau-identity\nSeparable: exact in O(N log N)\nSeparable: exact in O(N log N)\nNonseparable: semismooth Newton and proxh\u25e6D\u22121/2 costs O(N log N)\nNonseparable: from projector on the (cid:96)1-ball by Moreau-identity\nNonseparable: semismooth Newton and proxh\u25e6D\u22121/2 costs O(N log N)\nNonseparable: from projector on the simplex by Moreau-identity\n\nTable 1: Summary of functions which have ef\ufb01ciently computable rank-1 proximity operators\n\nProof:\n\nond statement\n\nThis follows from linearity and the chain rule [23, Lemma 3.5]. The sec-\nfollows strict\nincreasing monotonicity of p as established in Theorem 7.\n\nThus, as p is Newton differentiable with nonsingular generalized derivative whose inverse is also\nbounded, the general semi-smooth Newton convergence theorem implies that (13) converges super-\nlinearly to the unique root of (10).\n\n3.1.4 Examples\n\nMany functions can be handled very ef\ufb01ciently using our results above. For instance, Table 1 sum-\nmarizes a few of them where we can obtain either an exact answer by sorting when possible, or else\nby minimizing w.r.t. to a scalar variable (i.e. \ufb01nding the unique root of (10)).\n\n4 A primal rank 1 SR1 algorithm\n\nFollowing the conventional quasi-Newton notation, we let B denote an approximation to the Hessian\nof f and H denote an approximation to the inverse Hessian. All quasi-Newton methods update an\napproximation to the (inverse) Hessian that satis\ufb01es the secant condition:\n\nHkyk = sk,\n\nyk = \u2207f(xk) \u2212 \u2207f(xk\u22121),\n\nsk = xk \u2212 xk\u22121\n\n(14)\n\nAlgorithm 1 follows the SR1 method [24], which uses a rank-1 update to the inverse Hessian ap-\nproximation at every step. The SR1 method is perhaps less well-known than BFGS, but it has the\ncrucial property that updates are rank-1, rather than rank-2, and it is described \u201c[SR1] has now taken\nits place alongside the BFGS method as the pre-eminent updating formula.\u201d [25].\nWe propose two important modi\ufb01cations to SR1. The \ufb01rst is to use limited-memory, as is commonly\ndone with BFGS. In particular, we use zero-memory, which means that at every iteration, a new\ndiagonal plus rank-one matrix is formed. The other modi\ufb01cation is to extend the SR1 method to\nthe general setting of minimizing f + h where f is smooth but h need not be smooth; this further\ngeneralizes the case when h is an indicator function of a convex set. Every step of the algorithm\nreplaces f with a quadratic approximation, and keeps h unchanged. Because h is left unchanged,\nthe subgradient of h is used in an implicit manner, in comparison to methods such as [18] that use\nan approximation to h as well and therefore take an explicit subgradient step.\n\nChoosing H0\nstep length\n\nIn our experience, the choice of H0 is best if scaled with a Barzilai-Borwein spectral\n\n\u03c4BB2 = (cid:104)sk, yk(cid:105) /(cid:104)yk, yk(cid:105)\n\n(15)\nfrom the other Barzilai-Borwein step size \u03c4BB1 =\n\nit \u03c4BB2 to distinguish it\n\n(we call\n(cid:104)sk, sk(cid:105) /(cid:104)sk, yk(cid:105) (cid:62) \u03c4BB2).\nIn SR1 methods, the quantity (cid:104)sk \u2212 H0yk, yk(cid:105) must be positive in order to have a well-de\ufb01ned\nupdate for uk. The update is:\n\nuk = (sk \u2212 H0yk)/(cid:112)(cid:104)sk \u2212 H0yk, yk(cid:105).\n\nHk = H0 + ukuT\nk ,\n\n(16)\n\n6\n\n\fAlgorithm 2: Sub-routine to compute the approximate inverse Hessian Hk\nRequire: k, sk, yk, 0 < \u03b3 < 1, 0 < \u03c4min < \u03c4max\n1: if k = 1 then\n2: H0 \u2190 \u03c4IH where \u03c4 > 0 is arbitrary\nuk \u2190 0\n3:\n4: else\n\u03c4BB2 \u2190 (cid:104)sk,yk(cid:105)\n5:\n(cid:107)yk(cid:107)2\nProject \u03c4BB2 onto [\u03c4min, \u03c4max]\n6:\n7: H0 \u2190 \u03b3\u03c4BB2IH\nif (cid:104)sk \u2212 H0yk, yk(cid:105) \u2264 10\u22128(cid:107)yk(cid:107)2(cid:107)sk \u2212 H0yk(cid:107)2 then\n8:\n9:\nelse\n10:\n11:\n12:\n13: end if\n14: return Hk = H0 + ukuT\n\nuk \u2190 (sk \u2212 H0yk)/(cid:112)(cid:104)sk \u2212 H0yk, yk(cid:105)).\n\nk {Bk = H\u22121\n\nuk \u2190 0\n\nend if\n\n{Barzilai-Borwein step length}\n\n{Skip the quasi-Newton update}\n\ncan be computed via the Sherman-Morrison formula}\n\nk\n\nFor this reason, we choose H0 = \u03b3\u03c4BB2IH with 0 < \u03b3 < 1, and thus 0 \u2264 (cid:104)sk \u2212 H0yk, yk(cid:105) =\n(1 \u2212 \u03b3)(cid:104)sk, yk(cid:105).\nIf (cid:104)sk, yk(cid:105) = 0, then there is no symmetric rank-one update that satis\ufb01es the\nsecant condition. The inequality (cid:104)sk, yk(cid:105) > 0 is the curvature condition, and it is guaranteed for\nall strictly convex objectives. Following the recommendation in [26], we skip updates whenever\n(cid:104)sk, yk(cid:105) cannot be guaranteed to be non-zero given standard \ufb02oating-point precision.\nA value of \u03b3 = 0.8 works well in most situations. We have tested picking \u03b3 adaptively, as well as\ntrying H0 to be non-constant on the diagonal, but found no consistent improvements.\n\n5 Numerical experiments and comparisons\n\n(a)\n\n(b)\n\nFigure 1: (a) is \ufb01rst LASSO test, (b) is second LASSO test\n\nConsider the unconstrained LASSO problem (1). Many codes, such as [27] and L-BFGS-B [2],\nhandle only non-negativity or box-constraints. Using the standard change of variables by introducing\nthe positive and negative parts of x, the LASSO can be recast as\n\nmin\n\nx+,x\u2212(cid:62)0\n\n1\n2\n\n(cid:107)Ax+ \u2212 Ax\u2212 \u2212 b(cid:107)2 + \u03bb1T (x+ + x\u2212)\n\nand then x is recovered via x = x+ \u2212 x\u2212. With such a formulation solvers such as L-BFGS-B are\napplicable. However, this constrained problem has twice the number of variables, and the Hessian of\n\n7\n\n010203040506070809010011010\u2212810\u2212610\u2212410\u22122100102104time in secondsobjective value error 0\u2212mem SR1FISTA w/ BBSPG/SpaRSAL\u2212BFGS\u2212BASAPSSasOWLCGISTFPC\u2212AS00.511.522.510\u2212810\u2212610\u2212410\u22122100102104106108time in secondsobjective value error 0\u2212mem SR1FISTA w/ BBSPG/SpaRSAL\u2212BFGS\u2212BASAPSSasOWLCGISTFPC\u2212AS\f(cid:18) AT A \u2212AT A\n\n\u2212AT A AT A\n\n(cid:19)\n\nwhich necessarily has (at least)\n\nthe quadratic part changes from AT A to \u02dcA =\nn degenerate 0 eigenvalues and adversely affects solvers.\nA similar situation occurs with the hinge-loss function. Consider the shifted and reversed hinge loss\nfunction h(x) = max(0, x). Then one can split x = x+ \u2212 x\u2212, add constraints x+ (cid:62) 0, x\u2212 (cid:62) 0,\nand replace h(x) with 1T (x+). As before, the Hessian gains n degenerate eigenvalues.\nWe compared our proposed algorithm on the LASSO problem. The \ufb01rst example, in Fig. 1a, is a\ntypical example from compressed sensing that takes A \u2208 Rm\u00d7n to have iid N (0, 1) entries with\nm = 1500 and n = 3000. We set \u03bb = 0.1. L-BFGS-B does very well, followed closely by\nour proposed SR1 algorithm and PSSas. Note that L-BFGS-B and ASA are in Fortran and C,\nrespectively (the other algorithms are in Matlab).\nOur second example uses a square operator A with dimensions n = 133 = 2197 chosen as a\n3D discrete differential operator. This example stems from a numerical analysis problem to solve a\ndiscretized PDE as suggested by [28]. For this example, we set \u03bb = 1. For all the solvers, we use the\nsame parameters as in the previous example. Unlike the previous example, Fig. 1b now shows that\nL-BFGS-B is very slow on this problem. The FPC-AS method, very slow on the earlier test, is now\nthe fastest. However, just as before, our SR1 method is nearly as good as the best algorithm. This\nrobustness is one bene\ufb01t of our approach, since the method does not rely on active-set identifying\nparameters and inner iteration tolerances.\n\n6 Conclusions\n\nIn this paper, we proposed a novel variable metric (quasi-Newton) forward-backward splitting algo-\nrithm, designed to ef\ufb01ciently solve non-smooth convex problems structured as the sum of a smooth\nterm and a non-smooth one. We introduced a class of weighted norms induced by a diagonal+rank\n1 symmetric positive de\ufb01nite matrices, and proposed a whole framework to compute a proximity\noperator in the weighted norm. The latter result is distinctly new and is of independent interest.\nWe also provided clear evidence that the non-diagonal term provides signi\ufb01cant acceleration over\ndiagonal matrices.\nThe proposed method can be extended in several ways. Although we focused on forward-backward\nsplitting, our approach can be easily extended to the new generalized forward-backward algorithm\nof [29]. However, if we switch to a primal-dual setting, which is desirable because it can handle\nmore complicated objective functionals, updating Bk is non-obvious. Though one can think of\nnon-diagonal pre-conditioning methods.\nAnother improvement would be to derive ef\ufb01cient calculation for rank-2 proximity terms, thus al-\nlowing a 0-memory BFGS method. We are able to extend (result not presented here) Theorem 7\nto diagonal+rank r matrices. However, in general, one must solve an r-dimensional inner problem\nusing the semismooth Newton method.\nA \ufb01nal possible extension is to take Bk to be diagonal plus rank-1 on diagonal blocks, since if\nh is separable, this is still can be solved by our algorithm (see Remark 10). The challenge here\nis adapting this to a robust quasi-Newton update. For some matrices that are well-approximated\nby low-rank blocks, such as H-matrices [30], it may be possible to choose Bk \u2261 B to be a \ufb01xed\npreconditioner.\n\nAcknowledgments\n\nSB would like to acknowledge the Fondation Sciences Math\u00b4ematiques de Paris for his fellowship.\n\nReferences\n\n[1] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces.\n\nSpringer-Verlag, New York, 2011.\n\n[2] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimiza-\n\ntion. SIAM J. Sci. Computing, 16(5):1190\u20131208, 1995.\n\n8\n\n\f[3] P. L. Combettes and J. C. Pesquet. Proximal splitting methods in signal processing. In H. H. Bauschke,\nR. S. Burachik, P. L. Combettes, V. Elser, D. R. Luke, and H. Wolkowicz, editors, Fixed-Point Algorithms\nfor Inverse Problems in Science and Engineering, pages 185\u2013212. Springer-Verlag, New York, 2011.\n\n[4] E. G. Birgin, J. M. Mart\u00b4\u0131nez, and M. Raydan. Nonmonotone spectral projected gradient methods on\n\nconvex sets. SIAM J. Optim., 10(4):1196\u20131211, 2000.\n\n[5] S. Wright, R. Nowak, and M. Figueiredo. Sparse reconstruction by separable approximation.\n\nTransactions on Signal Processing, 57, 2009. 2479\u20132493.\n\nIEEE\n\n[6] J. Barzilai and J. Borwein. Two point step size gradient method. IMA J. Numer. Anal., 8:141\u2013148, 1988.\n[7] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM J. on Imaging Sci., 2(1):183\u2013202, 2009.\n\n[8] B. O\u2019Donoghue and E. Cand`es. Adaptive restart for accelerated gradient schemes.\n\narXiv:1204.3982, 2012.\n\nPreprint:\n\n[9] T. Pock and A. Chambolle. Diagonal preconditioning for \ufb01rst order primal-dual algorithms in convex\n\noptimization. In ICCV, 2011.\n\n[10] G. H.-G. Chen and R. T. Rockafellar. Convergence rates in forward\u2013backward splitting. SIAM Journal\n\non Optimization, 7(2):421\u2013444, 1997.\n\n[11] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale\n\nbound-constrained optimization. ACM Trans. Math. Software, 23(4):550\u2013560, 1997.\n\n[12] Jos\u00b4e Luis Morales and Jorge Nocedal. Remark on \u00a8algorithm 778: L-BFGS-B: Fortran subroutines for\n\nlarge-scale bound constrained optimization\u00a8. ACM Trans. Math. Softw., 38(1):7:1\u20137:4, 2011.\n\n[13] W. W. Hager and H. Zhang. A new active set algorithm for box constrained optimization. SIAM J. Optim.,\n\n17:526\u2013557, 2006.\n\n[14] A. Andrew and J. Gao. Scalable training of l1-regularized log-linear models. In ICML, 2007.\n[15] M. Schmidt, G. Fung, and R. Rosales. Fast optimization methods for l1 regularization: A comparative\n\nstudy and two new approaches. In European Conference on Machine Learning, 2007.\n\n[16] Z. Wen, W. Yin, D. Goldfarb, and Y. Zhang. A fast algorithm for sparse reconstruction based on shrinkage,\n\nsubspace optimization and continuation. SIAM J. Sci. Comput., 32(4):1832\u20131857, 2010.\n\n[17] T. Goldstein and S. Setzer. High-order methods for basis pursuit. Technical report, CAM-UCLA, 2011.\n[18] J. Yu, S.V.N. Vishwanathan, S. Guenter, and N. Schraudolph. A quasi-Newton approach to nonsmooth\nconvex optimization problems in machine learning. J. Machine Learning Research, 11:1145\u20131200, 2010.\n[19] M. Schmidt, E. van den Berg, M. Friedlander, and K. Murphy. Optimizing costly functions with simple\n\nconstraints: A limited-memory projected quasi-Newton algorithm. In AISTATS, 2009.\n\n[20] M. Schmidt, D. Kim, and S. Sra. Projected Newton-type methods in machine learning.\n\nS. Nowozin, and S.Wright, editors, Optimization for Machine Learning. MIT Press, 2011.\n\nIn S. Sra,\n\n[21] J. D. Lee, Y. Sun, and M. A. Saunders. Proximal Newton-type methods for minimizing convex objective\n\nfunctions in composite form. Preprint: arXiv:1206.1623, 2012.\n\n[22] J.-J. Moreau. Fonctions convexes duales et points proximaux dans un espace hilbertien. CRAS S\u00b4er. A\n\nMath., 255:2897\u20132899, 1962.\n\n[23] R. Griesse and D. A. Lorenz. A semismooth Newton method for Tikhonov functionals with sparsity\n\nconstraints. Inverse Problems, 24(3):035007, 2008.\n\n[24] C. Broyden. Quasi-Newton methods and their application to function minimization. Math. Comp.,\n\n21:577\u2013593, 1967.\n\n[25] N. Gould. Seminal papers in nonlinear optimization. In An introduction to algorithms for continuous\noptimization. Oxford University Computing Laboratory, 2006. http://www.numerical.rl.ac.\nuk/nimg/course/lectures/paper/paper.pdf.\n\n[26] J. Nocedal and S. Wright. Numerical Optimization. Springer, 2nd edition, 2006.\n[27] I. Dhillon, D. Kim, and S. Sra. Tackling box-constrained optimization via a new projected quasi-Newton\n\napproach. SIAM J. Sci. Comput., 32(6):3548\u20133563, 2010.\n\n[28] Roger Fletcher. On the Barzilai-Borwein method.\n\nIn Liqun Qi, Koklay Teo, Xiaoqi Yang, Panos M.\nPardalos, and Donald W. Hearn, editors, Optimization and Control with Applications, volume 96 of Ap-\nplied Optimization, pages 235\u2013256. Springer US, 2005.\n\n[29] H. Raguet, J. Fadili, and G. Peyr\u00b4e. Generalized forward-backward splitting. Technical report, Preprint\n\nHal-00613637, 2011.\n\n[30] W. Hackbusch. A sparse matrix arithmetic based on H-matrices. Part I: Introduction to H-matrices. Com-\n\nputing, 62:89\u2013108, 1999.\n\n[31] R. T. Rockafellar. Convex Analysis. Princeton University Press, 1970.\n\n9\n\n\f", "award": [], "sourceid": 1233, "authors": [{"given_name": "Stephen", "family_name": "Becker", "institution": null}, {"given_name": "Jalal", "family_name": "Fadili", "institution": null}]}