{"title": "Local Linear Convergence of Forward--Backward under Partial Smoothness", "book": "Advances in Neural Information Processing Systems", "page_first": 1970, "page_last": 1978, "abstract": "In this paper, we consider the Forward--Backward proximal splitting algorithm to minimize the sum of two proper closed convex functions, one of which having a Lipschitz continuous gradient and the other being partly smooth relatively to an active manifold $\\mathcal{M}$. We propose a generic framework in which we show that the Forward--Backward (i) correctly identifies the active manifold $\\mathcal{M}$ in a finite number of iterations, and then (ii) enters a local linear convergence regime that we characterize precisely. This gives a grounded and unified explanation to the typical behaviour that has been observed numerically for many problems encompassed in our framework, including the Lasso, the group Lasso, the fused Lasso and the nuclear norm regularization to name a few. These results may have numerous applications including in signal/image processing processing, sparse recovery and machine learning.", "full_text": "Local Linear Convergence of Forward\u2013Backward\n\nunder Partial Smoothness\n\nJingwei Liang and Jalal M. Fadili\n\nGREYC, CNRS, ENSICAEN, Universit\u00e9 de Caen\n\n{Jingwei.Liang,Jalal.Fadili}@greyc.ensicaen.fr\n\nGabriel Peyr\u00e9\n\nCNRS, Ceremade, Universit\u00e9 Paris-Dauphine\nGabriel.Peyre@ceremade.dauphine.fr\n\nAbstract\n\nIn this paper, we consider the Forward\u2013Backward proximal splitting algorithm to\nminimize the sum of two proper convex functions, one of which having a Lip-\nschitz continuous gradient and the other being partly smooth relative to an ac-\ntive manifold M. We propose a generic framework under which we show that\nthe Forward\u2013Backward (i) correctly identi\ufb01es the active manifold M in a \ufb01nite\nnumber of iterations, and then (ii) enters a local linear convergence regime that\nwe characterize precisely. This gives a grounded and uni\ufb01ed explanation to the\ntypical behaviour that has been observed numerically for many problems encom-\npassed in our framework, including the Lasso, the group Lasso, the fused Lasso\nand the nuclear norm regularization to name a few. These results may have numer-\nous applications including in signal/image processing processing, sparse recovery\nand machine learning.\n\n1 Introduction\n\n1.1 Problem statement\n\nConvex optimization has become ubiquitous in most quantitative disciplines of science. A common\ntrend in modern science is the increase in size of datasets, which drives the need for more ef\ufb01cient\noptimization methods. Our goal is the generic minimization of composite functions of the form\n\n(cid:8)\u03a6(x) = F (x) + J(x)(cid:9),\n\nwhere\n\nmin\nx\u2208Rn\n\n(1.1)\n\n(A.1) J \u2208 \u03930(Rn), the set of proper, lower semi-continuous and convex functions;\n(A.2) F is a convex and C 1,1(Rn) function whose gradient is \u03b2\u2013Lipschitz continuous;\n(A.3) Argmin \u03a6 (cid:54)= \u2205.\n\nThe class of problems (1.1) covers many popular non-smooth convex optimization problems en-\ncountered in various \ufb01elds throughout science and engineering, including signal/image processing,\n2\u03bb||y \u2212 A \u00b7 ||2 for some A \u2208 Rm\u00d7n\nmachine learning and classi\ufb01cation. For instance, taking F = 1\nand \u03bb > 0, we recover the Lasso problem when J = || \u00b7 ||1, the group Lasso for J = || \u00b7 ||1,2, the\nfused Lasso for J = ||D\u2217 \u00b7 ||1 with D = [DDIF, \u0001Id] and DDIF is the \ufb01nite difference operator,\nanti-sparsity regularization when J = || \u00b7 ||\u221e, and nuclear norm regularization when J = || \u00b7 ||\u2217.\n\n1\n\n\fThe standard (non relaxed) version of Forward\u2013Backward (FB) splitting algorithm [18] for solving\n(1.1) updates to a new iterate xk+1 according to\n\n(cid:0)xk \u2212 \u03b3k\u2207F (xk)(cid:1), \u03b3k \u2208 [\u0001, 2/\u03b2 \u2212 \u0001],\n\n(1.2)\nwith x0 \u2208 Rn arbitrarily chosen and 0 < \u0001 \u2264 1/\u03b2. Recall that the proximity operator is de\ufb01ned, for\n\u03b3 > 0, as\n\nxk+1 = prox\u03b3kJ\n\nprox\u03b3J (x) = argminz\u2208Rn\n\n1.2 Contributions\n\n1\n\n2\u03b3||z \u2212 x||2 + J(z).\n\nIn this paper, we present a uni\ufb01ed local linear convergence analysis for the FB algorithm to solve\n(1.1) when J is in addition partly smooth relative to a manifold M (see De\ufb01nition 2.1). The class of\npartly smooth functions is very large and encompasses all previously discussed examples as special\ncases. More precisely, we \ufb01rst show that FB has a \ufb01nite identi\ufb01cation property, meaning that after\na \ufb01nite number of iterations, say K, all iterates obey xk \u2208 M for k \u2265 K. Exploiting this property,\nwe then show that after such a large enough number of iterations, xk converges locally linearly. We\ncharacterize this regime and the rates precisely depending on the structure of the active manifold M.\nIn general, xk converges locally Q-linearly, and when M is an linear subspace, the convergence be-\ncomes R-linear. Several experimental results on some of the problems discussed above are provided\nto support our theoretical \ufb01ndings.\n\n1.3 Related work\n\nFinite support identi\ufb01cation and local R-linear convergence of FB to solve the Lasso problem,\nthough in in\ufb01nite-dimensional setting, is established in [3] under either a very restrictive injectiv-\nity assumption, or a non-degeneracy assumption which is a specialization of ours (see (3.1)) to the\n(cid:96)1-norm. A similar result is proved in [12], for F being a smooth convex and locally C 2 function\nand J the (cid:96)1-norm, under restricted injectivity and non-degeneracy assumptions. The (cid:96)1-norm is a\npartly smooth function and hence covered by our results. [1] proved Q-linear convergence of FB to\nsolve (1.1) for F satisfying restricted smoothness and strong convexity assumptions, and J being\na so-called convex decomposable regularizer. Again, the latter is a small subclass of partly smooth\nfunctions, and their result is then covered by ours. For example, our framework covers the total\nvariation (TV) semi-norm and (cid:96)\u221e-norm regularizers which are not decomposable.\nIn [13, 14], the authors have shown \ufb01nite identi\ufb01cation of active manifolds associated to partly\nsmooth functions for various algorithms, including the (sub)gradient projection method, Newton-\nlike methods, the proximal point algorithm. Their work extends that of e.g. [28] on identi\ufb01able\nsurfaces from the convex case to a general non-smooth setting. Using these results, [15] considered\nthe algorithm [25] to solve (1.1) where J is partly smooth, but not necessarily convex and F is\nC 2(Rn), and proved \ufb01nite identi\ufb01cation of the active manifold. However, the convergence rate\nremains an open problem in all these works.\n\n1.4 Notations\nFor a nonempty convex set C \u2282 Rn, ri(C) denotes its relative interior, a\ufb00(C) is its af\ufb01ne hull,\npar(C) is the subspace parallel to it. We denote PC the orthogonal projector onto C, and for a matrix\nA \u2208 Rm\u00d7n, AC = A \u25e6 PC.\nSuppose M \u2282 Rn is a C 2-manifold around x \u2208 Rn, denote TM(x) the tangent space of M at\nx \u2208 Rn. The tangent model subspace is de\ufb01ned as\n\n(cid:0)\u2202J(x)(cid:1) is single-valued, we therefore de\ufb01ne the generalized sign vector\n\n.\n\nIt is easy to see that PTx\n\nTx = par(cid:0)\u2202J(x)(cid:1)\u22a5\n(cid:0)\u2202J(x)(cid:1).\n\nex = PTx\n\nIt is straightforward to show that ex = Pa\ufb00(\u2202J(x))(0).\n\n2\n\n\f2 Partial smoothness\n\nBesides (A.1), our central assumption is that J is a partly smooth function. Partial smoothness of\nfunctions is originally de\ufb01ned in [16]. Our de\ufb01nition hereafter specializes it to functions in \u03930(Rn).\nDe\ufb01nition 2.1. Let J \u2208 \u03930(Rn), and x \u2208 Rn such that \u2202J(x) (cid:54)= \u2205. J is partly smooth at x relative\nto a set M containing x if\n\n(1) (Smoothness) M is a C 2-manifold around x and J restricted to M is C 2 around x.\n(2) (Sharpness) The tangent space TM(x) is Tx.\n(3) (Continuity) The set\u2013valued mapping \u2202J is continuous at x relative to M.\n\nIn the following, the class of partly smooth functions at x relative to M is denoted as PSx(M).\nWhen M is an af\ufb01ne manifold, then M = x + Tx, and we denote this subclass as PSAx(x + Tx).\nWhen M is a linear manifold, then M = Tx, and we denote this subclass as PSLx(Tx).\nCapitalizing on the results of [16], it can be shown that under mild transversality assumptions, the\nset of continuous convex partly smooth functions is closed under addition and pre-composition by a\nlinear operator. Moreover, absolutely permutation-invariant convex and partly smooth functions of\nthe singular values of a real matrix, i.e. spectral functions, are convex and partly smooth spectral\nfunctions of the matrix [9].\nIt then follows that all the examples discussed in Section 1, including (cid:96)1, (cid:96)1,2, (cid:96)\u221e norms, TV semi-\nnorm and nuclear norm, are partly smooth. In fact, the nuclear norm is partly smooth at a matrix x\n\nrelative to the manifold M =(cid:8)x(cid:48) : rank(x(cid:48)) = rank(x)(cid:9). The \ufb01rst three regularizers are all part\n\nof the class PSLx(Tx), see Section 4 and [27] for details.\nWe now de\ufb01ne a subclass of partly smooth functions where the active manifold is actually a subspace\nand the generalized sign vector ex is locally constant.\nDe\ufb01nition 2.2. J belongs to the class PSSx(Tx) if and only if J \u2208 PSAx(x + Tx) (resp. J \u2208\nPSLx(Tx)), and there exists a neighbourhood U of x such that \u2200x(cid:48) \u2208 (x + Tx) \u2229 U (resp. \u2200x(cid:48) \u2208\nTx \u2229 U)\n\nex(cid:48) = ex.\n\nA typical family of functions that comply with this de\ufb01nition is that of partly polyhedral func-\ntions [26, Section 6.5], which includes the (cid:96)1 and (cid:96)\u221e norms, and TV semi-norm.\n\n3 Local linear convergence of the FB method\n\nIn this section, we state our main result on \ufb01nite identi\ufb01cation and local linear convergence of FB.\nTheorem 3.1. Assume that (A.1)-(A.3) hold. Suppose that the FB scheme is used to create a se-\nquence xk which converges to x(cid:63) \u2208 Argmin \u03a6 such that J \u2208 PSx(cid:63) (Mx(cid:63) ), F is C 2 near x(cid:63) and\n\n\u2212\u2207F (x(cid:63)) \u2208 ri(cid:0)\u2202J(x(cid:63))(cid:1).\n\nThen we have the following holds,\n\n(1) The FB scheme (1.2) has the \ufb01nite identi\ufb01cation property, i.e. there exists K \u2265 0, such that\n\nfor all k \u2265 K, xk \u2208 Mx(cid:63).\n\n(2) Suppose moreover there exists \u03b1 > 0 such that\n\nPT\u22072F (x(cid:63))PT (cid:23) \u03b1Id,\nwhere T := Tx(cid:63). Then for all k \u2265 K, the following holds.\n\n(i) Q-linear convergence: if 0 < \u03b3 \u2264 \u03b3k \u2264 \u00af\u03b3 < min(cid:0)2\u03b1\u03b2\u22122, 2\u03b2\u22121(cid:1), then given any\n1 > \u03c1 >(cid:101)\u03c1,\nwhere(cid:101)\u03c12 = max(cid:8)q(\u03b3), q(\u00af\u03b3)(cid:9) \u2208 [0, 1[ and q(\u03b3) = 1 \u2212 2\u03b1\u03b3 + \u03b22\u03b32.\n\n||xk+1 \u2212 x(cid:63)|| \u2264 \u03c1||xk \u2212 x(cid:63)||,\n\n(3.1)\n\n(3.2)\n\n3\n\n\f(ii) R-linear convergence: if J \u2208 PSAx(cid:63) (x(cid:63) + T ) or J \u2208 PSLx(cid:63) (T ), then for 0 < \u03b3 \u2264\n\n\u03b3k \u2264 \u00af\u03b3 < min(cid:0)2\u03b1\u03bd\u22122, 2\u03b2\u22121(cid:1), where \u03bd \u2264 \u03b2 is the Lipschitz constant of PT\u2207F PT ,\n\nthen\n\nk = 1 \u2212 2\u03b1\u03b3k + \u03bd2\u03b32\n\nwhere \u03c12\nthe optimal linear rate can be achieved is\n\n||xk+1 \u2212 x(cid:63)|| \u2264 \u03c1k||xk \u2212 x(cid:63)||,\nk \u2208 [0, 1[. Moreover, if \u03b1\n\n\u03c1\u2217 =(cid:112)1 \u2212 \u03b12/\u03bd2.\n\n\u03bd2 \u2264 \u00af\u03b3 and set \u03b3k \u2261 \u03b1\n\n\u03bd2 , then\n\nRemark 3.2.\n\n\u2022 The non-degeneracy assumption in (3.1) can be viewed as a geometric gener-\nalization of strict complementarity of non-linear programming. Building on the arguments\nof [14], it turns out that it is almost a necessary condition for \ufb01nite identi\ufb01cation of Mx(cid:63).\n\u2022 Under the non-degeneracy and restricted strong convexity assumptions (3.1)-(3.2), one can\n\nactually show that x(cid:63) is unique by extending the reasoning in [26].\n\n\u2022 For F = G\u25e6 A, where G satis\ufb01es (A.2) and A is a linear operator, assumption (3.2) and the\nconstant \u03b1 can be restated in terms of local strong convexity of G and restricted injectivity\nof A on T , i.e. Ker(A) \u2229 T = {0}.\n\n\u2022 When F = 1\n\nrem 3.1 can be re\ufb01ned further as the gradient operator \u2207F becomes linear.\n\n2||y \u2212 A \u00b7 ||2, not only the minimizer x(cid:63) is unique, but also the rates in Theo-\n\u2022 Partial smoothness guarantees that xk arrives the active manifold in \ufb01nite time, hence rais-\ning the hope of acceleration using second-order information. For instance, one can think\nof turning to geometric methods along the manifold Mx(cid:63), where faster convergence rates\ncan be achieved. This is also the motivation behind the work of e.g. [19].\n\nWhen J \u2208 PSSx(cid:63) (T ), it turns out that the restricted convexity assumption (3.2) of Theorem 3.1 can\nbe removed in some cases, but at the price of less sharp rates.\nTheorem 3.3. Assume that (A.1)-(A.3) hold. For x(cid:63) \u2208 Argmin \u03a6, suppose that J \u2208 PSSx(cid:63) (Tx(cid:63) ),\nB\u0001(x(cid:63)), \u0001 > 0. Let the FB scheme be used to create a sequence xk that converges to x(cid:63) with\nC > 0 and \u03c1 \u2208 [0, 1[ such that for all k large enough\n\n(3.1) is ful\ufb01lled, and there exists a subspace V such that Ker(cid:0)PT\u22072F (x)PT\n(cid:1) = V for any x \u2208\n0 < \u03b3 \u2264 \u03b3k \u2264 \u00af\u03b3 < min(cid:0)2\u03b1\u03b2\u22122, 2\u03b2\u22121(cid:1), where \u03b1 > 0 (see the proof). Then there exists a constant\n\n||xk \u2212 x(cid:63)|| \u2264 C\u03c1k.\n\nA typical example where this result applies is for F = G \u25e6 A with G locally strongly convex, in\nwhich case V = Ker(AT ).\n\n4 Numerical experiments\n\nIn this section, we describe some examples to demonstrate the applicability of our results. More\nprecisely, we consider solving\n\n(4.1)\nwhere y \u2208 Rm is the observation, A : Rn \u2192 Rm, \u03bb is the tradeoff parameter, and J is either the\n(cid:96)1-norm, the (cid:96)\u221e-norm, the (cid:96)1,2-norm, the TV semi-norm or the nuclear norm.\nExample 4.1 ((cid:96)1-norm). For x \u2208 Rn, the sparsity promoting (cid:96)1-norm [7, 23] is\n\nmin\nx\u2208Rn\n\n1\n\n2||y \u2212 Ax||2 + \u03bbJ(x)\n\nJ(x) =(cid:80)n\n\ni=1|xi|.\n\nIt can veri\ufb01ed that J is a polyhedral norm, and thus J \u2208 PSSx(Tx) for the model subspace\n\nM = Tx =(cid:8)u \u2208 Rn : supp(u) \u2286 supp(x)(cid:9), and ex = sign(x).\n\nThe proximity operator of the (cid:96)1-norm is given by a simple soft-thresholding.\n\n4\n\n\fLet the support of x \u2208 Rn be divided into non-overlapping blocks B such that(cid:83)\n\nExample 4.2 ((cid:96)1,2-norm). The (cid:96)1,2-norm is usually used to promote group-structured sparsity [29].\nb\u2208B b = {1, . . . , n}.\n\nThe (cid:96)1,2-norm is given by\n\nJ(x) = ||x||B =(cid:80)\n\nb\u2208B||xb||,\n\nwhere xb = (xi)i\u2208b \u2208 R|b|. || \u00b7 ||B in general is not polyhedral, yet partly smooth relative to the\nlinear manifold\n\nM = Tx =(cid:8)u \u2208 Rn : suppB(u) \u2286 suppB(x)(cid:9), and ex =(cid:0)N (xb)(cid:1)\n\nwhere suppB(x) =(cid:83)(cid:8)b : xb (cid:54)= 0(cid:9), N (x) = x/||x|| and N (0) = 0. The proximity operator of the\n\n(cid:96)1,2-norm is given by a simple block soft-thresholding.\nExample 4.3 (Total Variation). As stated in the introduction, partial smoothness is preserved under\npre-composition by a linear operator. Let J0 be a closed convex function and D is a linear operator.\nPopular examples are the TV semi-norm in which case J0 = || \u00b7 ||1 and D\u2217 = DDIF is a \ufb01nite\ndifference approximation of the derivative [22], or the fused Lasso for D = [DDIF, \u0001Id] [24].\nIf J0 \u2208 PSD\u2217x(M0), then it is shown in [16, Theorem 4.2] that under an appropriate transversality\ncondition, J \u2208 PSx(M) where\n\nb\u2208B,\n\nM =(cid:8)u \u2208 Rn : D\u2217u \u2208 M0\n\n(cid:9).\n\nIn particular, for the case of the TV semi-norm, we have J \u2208 PSSx(Tx) with\n\nM = Tx =(cid:8)u \u2208 Rn : supp(D\u2217u) \u2286 I(cid:9) and ex = PTx Dsign(D\u2217x)\n\nwhere I = supp(D\u2217x). The proximity operator for the 1D TV, though not available in closed form,\ncan be obtained ef\ufb01ciently using either the taut string algorithm [10] or the graph cuts [6].\nExample 4.4 (Nuclear norm). Low-rank is the spectral extension of vector sparsity to matrix-\nvalued data x \u2208 Rn1\u00d7n2, i.e. imposing the sparsity on the singular values of x. Let x = U \u039bxV \u2217 a\nreduced singular value decomposition (SVD) of x. The nuclear norm of a x is de\ufb01ned as\n\nJ(x) = ||x||\u2217 =(cid:80)r\nM =(cid:8)z \u2208 Rn1\u00d7n2 : rank(z) = r(cid:9).\n\ni=1(\u039bx)i,\n\nwhere rank(x) = r. It has been used for instance as SDP convex relaxation for many problems\nincluding in machine learning [2, 11], matrix completion [21, 4] and phase retrieval [5].\nIt can be shown that the nuclear norm is partly smooth relative to the manifold [17, Example 2]\n\nThe tangent space to M at x and ex are given by\n\nTM(x) =(cid:8)z \u2208 Rn1\u00d7n2 : z = U L\u2217 + M V \u2217, \u2200L \u2208 Rn2\u00d7r, M \u2208 Rn1\u00d7r(cid:9), and ex = U V \u2217.\n\nThe proximity operator of the nuclear norm is just soft\u2013thresholding applied to the singular values.\n\nRecovery from random measurements\n\nIn these examples, the forward observation model is\n\n(4.2)\nwhere A \u2208 Rm\u00d7n is generated uniformly at random from the Gaussian ensemble with i.i.d. zero-\nmean and unit variance entries. The tested experimental settings are\n\ny = Ax0 + \u03b5,\n\n\u03b5 \u223c N (0, \u03b42),\n\n(a) (cid:96)1-norm m = 48 and n = 128, x0 is 8-sparse;\n(b) Total Variation m = 48 and n = 128, (DDIFx0) is 8-sparse;\n(c) (cid:96)\u221e-norm m = 123 and n = 128, x0 has 10 saturating entries;\n(d) (cid:96)1,2-norm m = 48 and n = 128, x0 has 2 non-zero blocks of size 4;\n(e) Nuclear norm m = 1425 and n = 2500, x0 \u2208 R50\u00d750 and rank(x0) = 5.\n\n5\n\n\fThe number of measurements is chosen suf\ufb01ciently large, \u03b4 small enough and \u03bb of the order of\n\u03b4 so that [27, Theorem 1] applies, yielding that the minimizer of (4.1) is unique and veri\ufb01es the\nnon-degeneracy and restricted strong convexity assumptions (3.1)-(3.2).\nThe convergence pro\ufb01le of ||xk\u2212x(cid:63)|| are depicted in Figure 1(a)-(e). Only local curves after activity\nidenti\ufb01cation are shown. For (cid:96)1, TV and (cid:96)\u221e, the predicted rate coincides exactly with the observed\none. This is because these regularizers are all partly polyhedral gauges, and the data \ufb01delity is\nquadratic, hence making the predictions of Theorem 3.1(ii) exact. For the (cid:96)1,2-norm, although its\nactive manifold is still a subspace, the generalized sign vector ek is not locally constant, which\nentails that the the predicted rate of Theorem 3.1(ii) slightly overestimates the observed one. For the\nnuclear norm, whose active manifold is not linear. Thus Theorem 3.1(i) applies, and the observed\nand predicted rates are again close.\n\nTV deconvolution In this image processing example, y is a degraded image generated accord-\ning to the same forward model as (4.1), but now A is a convolution with a Gaussian kernel. The\nanisotropic TV regularizer is used. The convergence pro\ufb01le is shown in Figure 1(f). Assumptions\n(3.1)-(3.2) are checked a posteriori. This together with the fact that the anisotropic TV is polyhedral\njusti\ufb01es that the predicted rate is again exact.\n\n(a) (cid:96)1 (Lasso)\n\n(b) TV semi-norm\n\n(c) (cid:96)\u221e-norm\n\n(d) (cid:96)1,2-norm\n\n(e) Nuclear norm\n\n(f) TV deconvolution\n\nFigure 1: Observed and predicted local convergence pro\ufb01les of the FB method (1.2) in terms of\n||xk \u2212 x(cid:63)|| for different types of partly smooth functions. (a) (cid:96)1-norm; (b) TV semi-norm; (c) (cid:96)\u221e-\nnorm; (d) (cid:96)1 \u2212 (cid:96)2-norm; (e) Nuclear norm; (f) TV deconvolution.\n\nAcknowledgment\n\nThis work was partly supported by the European Research Council (ERC project SIGMA-Vision).\n\n5 Proofs\nLemma 5.1. Suppose that J \u2208 PSx(M). Then for any x(cid:48) \u2208 M \u2229 U, where U is a neighbourhood\nof x, the projector PM(x(cid:48)) is uniquely valued and C 1 around x, and thus\n\nx(cid:48) \u2212 x = PTx (x(cid:48) \u2212 x) + o(cid:0)||x(cid:48) \u2212 x||(cid:1).\n\nIf J \u2208 PSAx(x + Tx) or J \u2208 PSLx(Tx), then x(cid:48) \u2212 x = PTx (x(cid:48) \u2212 x).\n\n6\n\n38040042044046048050010\u22121010\u2212810\u2212610\u2212410\u22122kxk\u2212x\u22c6kk theoreticalpractical45050055060065070075080010\u22121010\u2212810\u2212610\u2212410\u22122kxk\u2212x\u22c6kk theoreticalpractical1000200030004000500060007000800010\u22121010\u2212810\u2212610\u2212410\u22122100kxk\u2212x\u22c6kk theoreticalpractical35040045050010\u22121010\u2212810\u2212610\u2212410\u22122kxk\u2212x\u22c6kk theoreticalpractical25030035040045050010\u22121010\u2212810\u2212610\u2212410\u22122100kxk\u2212x\u22c6kk theoreticalpractical5010015020025030010\u22121010\u2212810\u2212610\u2212410\u22122100102kxk\u2212x\u22c6kk theoreticalpractical\fProof. Partial smoothness implies that M is a C 2\u2013manifold around x, then PM(x(cid:48)) is uniquely\nvalued [20] and moreover C 1 near x [17, Lemma 4]. Thus, continuous differentiability shows\n\nx(cid:48) \u2212 x = PM(x(cid:48)) \u2212 PM(x) = DPM(x)(x \u2212 x(cid:48)) + o(||x \u2212 x(cid:48)||).\n\nwhere DPM(x) is the derivative of PM at x. By virtue of [17, Lemma 4] and the sharpness property\nof J, this derivative is given by\n\nThe case where M is af\ufb01ne or linear is immediate. This concludes the proof.\n\nDPM(x) = PTM(x) = PTx ,\n\nProof of Theorem 3.1.\n1. Classical convergence results of the FB scheme, e.g. [8], show that xk converges to some x(cid:63) \u2208\nArgmin \u03a6 (cid:54)= \u2205 by assumption (A.3). Assumptions (A.1)-(A.2) entail that (3.1) is equivalent\nfunctions [16, Corollary 4.7] ensures that \u03a6 \u2208 PSx(cid:63) (M). By de\ufb01nition of xk+1, we have\n\nto 0 \u2208 ri \u2202(cid:0)\u03a6(x(cid:63))(cid:1). Since F \u2208 C 2 around x(cid:63), the smooth perturbation rule of partly smooth\nwhere Gk =(cid:0)Id \u2212 \u03b3k\u2207F(cid:1). By Baillon-Haddad theorem, Gk is non-expansive, hence\nSince lim inf \u03b3k = \u03b3 > 0, we obtain dist(cid:0)0, \u2202\u03a6(xk+1)(cid:1) \u2192 0. Owing to assumptions (A.1)-\n\n(cid:0)Gk(xk) \u2212 Gk(xk+1)(cid:1) \u2208 \u2202\u03a6(xk+1).\n\ndist(cid:0)0, \u2202\u03a6(xk+1)(cid:1) \u2264 1\n\n||Gk(xk) \u2212 Gk(xk+1)|| \u2264 1\n\n(A.2), \u03a6 is sub-differentially continuous and thus \u03a6(xk) \u2192 \u03a6(x(cid:63)). Altogether, this shows that\nthe conditions of [13, Theorem 5.3] are ful\ufb01lled, whence the claim follows.\n\n||xk \u2212 xk+1||.\n\n1\n\u03b3k\n\n\u03b3k\n\n\u03b3k\n\n2. Take K > 0 suf\ufb01ciently large such that for all k \u2265 K, xk \u2208 Mx(cid:63) and xk \u2208 B\u0001(x(cid:63)).\n\n(i) Since prox\u03b3kJ is \ufb01rmly non-expansive, hence non-expansive, we have\n\n||xk+1 \u2212 x(cid:63)|| = ||prox\u03b3kJ Gkxk \u2212 prox\u03b3kJ Gkx(cid:63)|| \u2264 ||Gkxk \u2212 Gkx(cid:63)||.\n\n(5.1)\nBy virtue of Lemma 5.1, we have xk \u2212 x(cid:63) = PT (xk \u2212 x(cid:63)) + o(||xk \u2212 x(cid:63)||). This, together\nwith local C 2 smoothness of F and Lipschitz continuity of \u2207F entails\n\n(cid:104)xk \u2212 x(cid:63),\u2207F (xk) \u2212 \u2207F (x(cid:63))(cid:105) =(cid:82) 1\n= (cid:82) 1\n0 (cid:104)PT (xk \u2212 x(cid:63)),\u22072F (x(cid:63) + t(xk \u2212 x(cid:63)))PT (xk \u2212 x(cid:63))(cid:105)dt + o(cid:0)||xk \u2212 x(cid:63)||2(cid:1)\n\u2265 \u03b1||xk \u2212 x(cid:63)||2 + o(cid:0)||xk \u2212 x(cid:63)||2(cid:1) .\n\n(5.2)\nSince (3.2) holds and \u22072F (x) depends continuously on x, there exists \u0001 > 0 such that\nPT\u22072F (x)PT (cid:31) \u03b1Id, \u2200x \u2208 B\u0001(x(cid:63)). Thus, classical development of the right hand side of\n(5.1) yields\n\n0(cid:104)xk \u2212 x(cid:63),\u22072F (x(cid:63) + t(xk \u2212 x(cid:63)))(xk \u2212 x(cid:63))(cid:105)dt\n\n||xk+1 \u2212 x(cid:63)||2 \u2264 ||Gkxk \u2212 Gkx(cid:63)||2 = ||(xk \u2212 x(cid:63)) \u2212 \u03b3k(\u2207F (xk) \u2212 \u2207F (x(cid:63)))||2\nk||\u2207F (xk) \u2212 \u2207F (x(cid:63))||2\n\n= ||xk \u2212 x(cid:63)||2 \u2212 2\u03b3k(cid:104)xk \u2212 x(cid:63),\u2207F (xk) \u2212 \u2207F (x(cid:63))(cid:105) + \u03b32\n\u2264 ||xk \u2212 x(cid:63)||2 \u2212 2\u03b3k\u03b1||xk \u2212 x(cid:63)||2 + \u03b32\n\nk\u03b22||xk \u2212 x(cid:63)||2 + o(cid:0)||xk \u2212 x(cid:63)||2(cid:1)\n\n(cid:1)||xk \u2212 x(cid:63)||2 + o(cid:0)||xk \u2212 x(cid:63)||2(cid:1).\n\n=(cid:0)1 \u2212 2\u03b1\u03b3k + \u03b22\u03b32\n\nk\n\n(5.3)\n\nTaking the lim sup in this inequality gives\n\nlim sup\nk\u2192+\u221e\n\n||xk+1 \u2212 x(cid:63)||2/||xk \u2212 x(cid:63)||2 \u2264 q(\u03b3k) = 1 \u2212 2\u03b1\u03b3k + \u03b22\u03b32\nk.\n\nIt is clear that for 0 < \u03b3 \u2264 \u03b3 \u2264 \u00af\u03b3 < min(cid:0)2\u03b1\u03b2\u22122, 2\u03b2\u22121(cid:1), q(\u03b3) \u2208 [0, 1[, and q(\u03b3) \u2264 (cid:101)\u03c12 =\nmax(cid:8)q(\u03b3), q(\u00af\u03b3)(cid:9). Inserting this in (5.4), and using classical arguments yields the result.\n\n(5.4)\n\n7\n\n\f(ii) We give the proof for M = T , that for M = x(cid:63) + T is similar. Since xk and x(cid:63) belong to\n\nT , from xk+1 = prox\u03b3kJ (Gkxk) we have\nGkxk \u2212 xk+1 \u2208 \u03b3k\u2202J(xk+1) \u21d2 xk+1 = PT\nSimilarly, we have x(cid:63) = PT Gkx(cid:63) \u2212 \u03b3ke(cid:63). We then arrive at\n(xk+1 \u2212 x(cid:63)) + \u03b3k(ek+1 \u2212 e(cid:63)) = (xk \u2212 x(cid:63)) \u2212 \u03b3k\nMoreover, maximal monotonicity of \u03b3k\u2202J gives\n\n(cid:0)Gkxk \u2212 \u03b3k\u2202J(xk+1)(cid:1) = PT Gkxk \u2212 \u03b3kek+1.\n(cid:0)PT\u2207F (PT xk) \u2212 PT\u2207F (PT x(cid:63))(cid:1). (5.5)\n\n||(xk+1 \u2212 x(cid:63)) + \u03b3k(ek+1 \u2212 e(cid:63))||2\n\n= ||xk+1 \u2212 x(cid:63)||2 + 2(cid:104)xk+1 \u2212 x(cid:63), \u03b3k(ek+1 \u2212 e(cid:63))(cid:105) + \u03b3k||ek+1 \u2212 e(cid:63)||2 \u2265 ||xk+1 \u2212 x(cid:63)||2.\n\nIt is straightforward to see that now, (5.2) becomes\n\n(cid:104)xk \u2212 x(cid:63), PT\u2207F (PT xk) \u2212 PT\u2207F (PT x(cid:63))(cid:105) \u2265 \u03b1||xk \u2212 x(cid:63)||2.\n\nLet \u03bd be the Lipschitz constant of PT\u2207F PT . Obviously \u03bd \u2264 \u03b2. Developing ||PT (Gkxk \u2212\nGkx(cid:63))||2 similarly to (5.3) we obtain\n\n||xk+1 \u2212 x(cid:63)||2 \u2264(cid:0)1 \u2212 2\u03b1\u03b3k + \u03bd2\u03b32\n\n(cid:1)||xk \u2212 x(cid:63)||2 = \u03c12\n\nwhere \u03c1k \u2208 [0, 1[ for 0 < \u03b3 \u2264 \u03b3k \u2264 \u00af\u03b3 < min(cid:0)2\u03b1/\u03bd2, 2/\u03b2(cid:1). \u03c1k is minimized at \u03b1\n\nk||xk \u2212 x(cid:63)||2,\n\nk\n\nproposed optimal rate whenever it obeys the given upper-bound.\n\n\u03bd2 with the\n\nProof of Theorem 3.3. Arguing similarly to the proof of Theorem 3.1(ii), and using in addition that\ne(cid:63) = ex(cid:63) is locally constant, we get\n\nxk+1 \u2212 x(cid:63) = (xk \u2212 x(cid:63)) \u2212 \u03b3k\n= (xk \u2212 x(cid:63)) \u2212 \u03b3k\n\n0 PT\u22072F (x(cid:63) + t(xk \u2212 x(cid:63)))PT (xk \u2212 x(cid:63))dt,\nDenote Ht = PT\u22072F (x(cid:63) + t(xk \u2212 x(cid:63)))PT (cid:23) 0. Using that Ht is self-adjoint, we have\n\n(cid:0)PT\u2207F (PT xk) \u2212 PT\u2207F (PT x(cid:63))(cid:1)\n(cid:82) 1\n\nPV xk+1 = PV xk.\n\nSince xk \u2192 x(cid:63), it follows that PV xk = PV x(cid:63) for all k suf\ufb01ciently large. Observing that xk \u2212 x(cid:63) =\nPV \u22a5(xk \u2212 x(cid:63)) for all large k, we get\n\nxk+1 \u2212 x(cid:63) = xk \u2212 x(cid:63) \u2212 \u03b3k\nObserve that V \u22a5 \u2282 T . By de\ufb01nition, Bt = H 1/2\n||Btx||2 > \u03c3||x||2 for all x (cid:54)= 0 and t \u2208 [0, 1]. We then have\n\n0 PV \u22a5 HtPV \u22a5(xk \u2212 x(cid:63))dt.\n\nt PV \u22a5 is injective, and therefore, \u2203\u03c3 > 0 such that\n\n(cid:82) 1\n\n||xk+1 \u2212 x(cid:63)||2\n\n(cid:82) 1\n\n= ||xk \u2212 x(cid:63)||2 \u2212 2\u03b3k\n0 (cid:104)xk \u2212 x(cid:63), BT\n= ||xk \u2212 x(cid:63)||2 \u2212 2\u03b3k\u03c3||xk \u2212 x(cid:63)||2 + \u03b32\n\u2264 ||xk \u2212 x(cid:63)||2 \u2212 2\u03b3k\u03c3||xk \u2212 x(cid:63)||2 + \u03b32\n\u2264 ||xk \u2212 x(cid:63)||2 \u2212 2\u03b3k\u03c3||xk \u2212 x(cid:63)||2 + \u03b32\n\nk||PV \u22a5 PT\n\nt Bt(xk \u2212 x(cid:63))(cid:105)dt + \u03b32\nk||PT PV \u22a5||2||\u2207F (xk) \u2212 \u2207F (x(cid:63))||2\nk\u03b22||PV \u22a5||2||PV \u22a5 (xk \u2212 x(cid:63))||2\nk\u03b22||xk \u2212 x(cid:63)||2 = \u03c12\n\nk||xk \u2212 x(cid:63)||2.\n\nIt is easy to see again that \u03c1k \u2208 [0, 1[ whenever 0 < \u03b3 \u2264 \u03b3k \u2264 \u00af\u03b3 < min(cid:0)2\u03b2\u22121, 2\u03c3\u03b2\u22122(cid:1).\n\n(cid:0)\u2207F (xk) \u2212 \u2207F (x(cid:63))(cid:1)||2\n\nReferences\n\n[1] A. Agarwal, S. Negahban, and M. Wainwright. Fast global convergence of gradient methods for high-\n\ndimensional statistical recovery. The Annals of Statistics, 40(5):2452\u20132482, 10 2012.\n\n[2] F. Bach. Consistency of trace norm minimization. The Journal of Machine Learning Research, 9:1019\u2013\n\n1048, 2008.\n\n8\n\n\f[3] K. Bredies and D. A. Lorenz. Linear convergence of iterative soft-thresholding. Journal of Fourier\n\nAnalysis and Applications, 14(5-6):813\u2013837, 2008.\n\n[4] E. J. Cand\u00e8s and B. Recht. Exact matrix completion via convex optimization. Foundations of Computa-\n\ntional Mathematics, 9(6):717\u2013772, 2009.\n\n[5] E. J. Cand\u00e8s, T. Strohmer, and V. Voroninski. Phaselift: Exact and stable signal recovery from mag-\nnitude measurements via convex programming. Communications on Pure and Applied Mathematics,\n66(8):1241\u20131274, 2013.\n\n[6] A. Chambolle and J. Darbon. A parametric maximum \ufb02ow approach for discrete total variation regular-\n\nization. In Image Processing and Analysis with Graphs. CRC Press, 2012.\n\n[7] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM journal\n\non scienti\ufb01c computing, 20(1):33\u201361, 1999.\n\n[8] P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale\n\nModeling & Simulation, 4(4):1168\u20131200, 2005.\n\n[9] A. Daniilidis, D. Drusvyatskiy, and A. S. Lewis. Orthogonal invariance and identi\ufb01ability. to appear in\n\nSIAM J. Matrix Anal. Appl., 2014.\n\n[10] P. L. Davies and A. Kovac. Local extremes, runs, strings and multiresolution. Ann. Statist., 29:1\u201365,\n\n2001.\n\n[11] E. Grave, G. Obozinski, and F. Bach. Trace lasso: a trace norm regularization for correlated designs.\n\narXiv preprint arXiv:1109.1990, 2011.\n\n[12] E. Hale, W. Yin, and Y. Zhang. Fixed-point continuation for (cid:96)1-minimization: Methodology and conver-\n\ngence. SIAM Journal on Optimization, 19(3):1107\u20131130, 2008.\n\n[13] W. L. Hare and A. S. Lewis. Identifying active constraints via partial smoothness and prox-regularity.\n\nJournal of Convex Analysis, 11(2):251\u2013266, 2004.\n\n[14] W. L. Hare and A. S. Lewis. Identifying active manifolds. Algorithmic Operations Research, 2(2):75\u201382,\n\n2007.\n\n[15] WL Hare. Identifying active manifolds in regularization problems. In Fixed-Point Algorithms for Inverse\n\nProblems in Science and Engineering, pages 261\u2013271. Springer, 2011.\n\n[16] A. S. Lewis. Active sets, nonsmoothness, and sensitivity. SIAM Journal on Optimization, 13(3):702\u2013725,\n\n2003.\n\n[17] A. S. Lewis and J. Malick. Alternating projections on manifolds. Mathematics of Operations Research,\n\n33(1):216\u2013234, 2008.\n\n[18] P. L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM Journal\n\non Numerical Analysis, 16(6):964\u2013979, 1979.\n\n[19] S. A. Miller and J. Malick. Newton methods for nonsmooth convex minimization: connections among-\nlagrangian, riemannian newton and sqp methods. Mathematical programming, 104(2-3):609\u2013633, 2005.\n[20] R. A. Poliquin, R. T. Rockafellar, and L. Thibault. Local differentiability of distance functions. Trans.\n\nAmer. Math. Soc., 352:5231\u20135249, 2000.\n\n[21] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via\n\nnuclear norm minimization. SIAM review, 52(3):471\u2013501, 2010.\n\n[22] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica\n\nD: Nonlinear Phenomena, 60(1):259\u2013268, 1992.\n\n[23] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society.\n\nSeries B. Methodological, 58(1):267\u2013288, 1996.\n\n[24] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the fused\n\nlasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1):91\u2013108, 2004.\n\n[25] P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separable minimization. Math.\n\nProg. (Ser. B), 117, 2009.\n\n[26] S. Vaiter, M. Golbabaee, M. J. Fadili, and G. Peyr\u00e9. Model selection with low complexity priors. Avail-\n\nable at arXiv:1304.6033, 2013.\n\n[27] S. Vaiter, G. Peyr\u00e9, and M. J. Fadili. Model consistency of partly smooth regularizers. Available\n\narXiv:1405.1004, 2014.\n\n[28] S. J. Wright. Identi\ufb01able surfaces in constrained optimization. SIAM Journal on Control and Optimiza-\n\ntion, 31(4):1063\u20131079, 1993.\n\n[29] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology), 68(1):49\u201367, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1078, "authors": [{"given_name": "Jingwei", "family_name": "Liang", "institution": "ENSICAEN, Uni. de Caen"}, {"given_name": "Jalal", "family_name": "Fadili", "institution": "CNRS-ENSICAEN-Univ. Caen"}, {"given_name": "Gabriel", "family_name": "Peyr\u00e9", "institution": "Universit\u00e9 Paris Dauphine"}]}