{"title": "Learning step sizes for unfolded sparse coding", "book": "Advances in Neural Information Processing Systems", "page_first": 13100, "page_last": 13110, "abstract": "Sparse coding is typically solved by iterative optimization techniques, such as the Iterative Shrinkage-Thresholding Algorithm (ISTA). Unfolding and learning weights of ISTA using neural networks is a practical way to accelerate estimation. In this paper, we study the selection of adapted step sizes for ISTA. We show that a simple step size strategy can improve the convergence rate of ISTA by leveraging the sparsity of the iterates. However, it is impractical in most large-scale applications. Therefore, we propose a network architecture where only the step sizes of ISTA are learned. We demonstrate that for a large class of unfolded algorithms, if the algorithm converges to the solution of the Lasso, its last layers correspond to ISTA with learned step sizes. Experiments show that our method is competitive with state-of-the-art networks when the solutions are sparse enough.", "full_text": "Learning step sizes for unfolded sparse coding\n\nPierre Ablin\u2217 , Thomas Moreau\u2217, Mathurin Massias, Alexandre Gramfort\n\nInria - CEA\n\nUniversit\u00e9 Paris-Saclay\n\n{pierre.ablin,thomas.moreau,mathurin.massias,alexandre.gramfort}@inria.fr\n\nAbstract\n\nSparse coding is typically solved by iterative optimization techniques, such as\nthe Iterative Shrinkage-Thresholding Algorithm (ISTA). Unfolding and learning\nweights of ISTA using neural networks is a practical way to accelerate estimation.\nIn this paper, we study the selection of adapted step sizes for ISTA. We show\nthat a simple step size strategy can improve the convergence rate of ISTA by\nleveraging the sparsity of the iterates. However, it is impractical in most large-\nscale applications. Therefore, we propose a network architecture where only the\nstep sizes of ISTA are learned. We demonstrate that for a large class of unfolded\nalgorithms, if the algorithm converges to the solution of the Lasso, its last layers\ncorrespond to ISTA with learned step sizes. Experiments show that our method is\ncompetitive with state-of-the-art networks when the solutions are sparse enough.\n\n1\n\nIntroduction\n\nThe resolution of convex optimization problems by iterative algorithms has become a key part of\nmachine learning and signal processing pipelines, in particular with the Generalized Linear Models\nfor classi\ufb01cation [Nelder and Wedderburn, 1972]. Amongst these problems, special attention has\nbeen devoted to the Lasso [Tibshirani, 1996], due to the attractive sparsity properties of its solution\n(see Hastie et al. 2015 for an extensive review). For a given input x \u2208 Rn , a dictionary D \u2208 Rn\u00d7m\nand a regularization parameter \u03bb > 0 , the Lasso problem is\nFx(z) with Fx(z) , 1\n\nz\u2217(x) \u2208 arg min\nto solve Problem (1), e.g. proximal coordinate descent\nA variety of algorithms exist\n[Tseng, 2001, Friedman et al., 2007], Least Angle Regression [Efron et al., 2004] or proximal\nsplitting methods [Combettes and Bauschke, 2011]. The focus of this paper is on the Iterative\nShrinkage-Thresholding Algorithm (ISTA, Daubechies et al. 2004), which is a proximal-gradient\nmethod applied to Problem (1). ISTA starts from z(0) = 0 and iterates\n\n2kx \u2212 Dzk2 + \u03bbkzk1 .\n\nz\u2208Rm\n\n(1)\n\nz(t+1) = ST(cid:18)z(t) \u2212\n\n1\nL\n\nD\u22a4(Dz(t) \u2212 x),\n\n\u03bb\n\nL(cid:19) ,\n\n(2)\n\nwhere ST is the soft-thresholding operator de\ufb01ned as ST(x, u) , sign(x) max(|x| \u2212 u, 0) , and L\nis the greatest eigenvalue of D\u22a4D . In the general case, ISTA converges at rate 1/t , which can be\nimproved to the optimal rate 1/t2 [Nesterov, 1983]. However, this optimality stands in the worst\npossible case, and linear rates are achievable in practice [Liang et al., 2014].\n\nline of\n\nresearch to improve the speed of Lasso solvers is to try to iden-\nA popular\ntify the support of z\u2217\nthe optimization prob-\nlem [El Ghaoui et al., 2012, Ndiaye et al., 2017, Johnson and Guestrin, 2015, Massias et al., 2018].\n\nto diminish the size of\n\nin order\n\n,\n\n\u2217Equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOnce the support is identi\ufb01ed, larger steps can also be taken, leading to improved rates for \ufb01rst\norder algorithms [Liang et al., 2014, Poon et al., 2018, Sun et al., 2019].\n\nHowever, these techniques only consider the case where a single Lasso problem is solved. When\none wants to solve the Lasso for many samples {xi}N\ni=1 \u2013 e.g. in dictionary learning [Olshausen and\nField, 1997] \u2013 it is proposed by Gregor and Le Cun [2010] to learn a T -layers neural network of\nparameters \u0398 , \u03a6\u0398 : Rn \u2192 Rm such that \u03a6\u0398(x) \u2243 z\u2217(x) . This Learned-ISTA (LISTA) algorithm\nyields better solution estimates than ISTA on new samples for the same number of iterations/layers.\nThis idea has led to a profusion of literature (summarized in Table A.1 in appendix), and is a popular\napproach to solve inverse problems. Recently, it has been hinted by Zhang and Ghanem [2018], Ito\net al. [2018], Liu et al. [2019] that only a few well-chosen parameters can be learned while retaining\nthe performances of LISTA.\n\nIn this article, we study strategies for LISTA where only step sizes are learned. In Section 3, we\npropose Oracle-ISTA, an analytic strategy to obtain larger step sizes in ISTA. We show that the\nproposed algorithm\u2019s convergence rate can be much better than that of ISTA. However, it requires\ncomputing a large number of Lipschitz constants which is a burden in high dimension. This motivates\nthe introduction of Step-LISTA (SLISTA) networks in Section 4, where only a step size parameter is\nlearned per layer. As a theoretical justi\ufb01cation, we show in Theorem 4.4 that the last layers of any\ndeep LISTA network converging on the Lasso must correspond to ISTA iterations with learned step\nsizes. We validate the soundness of this approach with numerical experiments in Section 5.\n\n2 Notation and Framework\n\nNotation The \u21132 norm on Rn is k \u00b7 k. For p \u2208 [1,\u221e] , k \u00b7 kp is the \u2113p norm. The Frobenius matrix\nnorm is kMkF . The identity matrix of size m is Idm . ST is the soft-thresholding operator. Iterations\nare denoted z(t) . \u03bb > 0 is the regularization parameter. The Lasso cost function is Fx . \u03c8\u03b1(z, x) is\none iteration of ISTA with step \u03b1: \u03c8\u03b1(z, x) = ST(z \u2212 \u03b1D\u22a4(Dz \u2212 x), \u03b1\u03bb) . \u03c6\u03b8(z, x) is one iteration\nof a LISTA layer with parameters \u03b8 = (W, \u03b1, \u03b2): \u03c6\u03b8(z, x) = ST(z \u2212 \u03b1W \u22a4(Dz \u2212 x), \u03b2\u03bb) .\nThe set of integers between 1 and m is J1, mK . Given z \u2208 Rm , the support is supp(z) = {j \u2208\nJ1, mK : zj 6= 0} \u2282 J1, mK . For S \u2282 J0, mK, DS \u2208 Rn\u00d7m is the matrix containing the columns of\nD indexed by S. We denote LS, the greatest eigenvalue of D\u22a4\nS DS. The equicorrelation set is E =\nj (Dz\u2217\u2212 x)| = \u03bb}. The equiregularization set is B\u221e = {x \u2208 Rn : kD\u22a4xk\u221e = 1}.\n{j \u2208 J1, mK : |D\u22a4\nNeural networks parameters are between brackets, e.g. \u0398 = {\u03b1(t), \u03b2(t)}T \u22121\nt=0 . The sign function is\nsign(x) = 1 if x > 0, \u22121 if x < 0 and 0 is x = 0 .\nFramework This paragraph recalls some properties of the Lasso. Lemma 2.1 gives the \ufb01rst-order\noptimality conditions for the Lasso.\n\nLemma 2.1 (Optimality for the Lasso). The Karush-Kuhn-Tucker (KKT) conditions read\n\nz\u2217 \u2208 arg min Fx \u21d4 \u2200j \u2208 J1, mK, D\u22a4\n\nj (x \u2212 Dz\u2217) \u2208 \u03bb\u2202|z\u2217\n\nj | = (cid:26){\u03bb sign z\u2217\nj },\n[\u2212\u03bb, \u03bb],\n\nif z\u2217\nif z\u2217\n\nj 6= 0 ,\nj = 0 .\n\n(3)\n\nDe\ufb01ning \u03bbmax , kD\u22a4xk\u221e , it holds arg min Fx = {0} \u21d4 \u03bb \u2265 \u03bbmax . For some results in Section 3,\nwe will need the following assumption on the dictionary D:\nAssumption 2.2 (Uniqueness assumption). D is such that the solution of Problem (1) is unique for\nall \u03bb and x i.e. arg min Fx = {z\u2217} .\nAssumption 2.2 may seem stringent since whenever m > n , Fx is not strictly convex. However, it\nwas shown in Tibshirani [2013, Lemma 4] \u2013 with earlier results from Rosset et al. 2004 \u2013 that if D is\nsampled from a continuous distribution, Assumption 2.2 holds for D with probability one.\nDe\ufb01nition 2.3 (Equicorrelation set). The KKT conditions motivate the introduction of the equicorre-\nlation set E , {j \u2208 J1, mK : |D\u22a4\nj = 0 , i.e. E contains the\nsupport of any solution z\u2217 .\nWhen Assumption 2.2 holds, we have E = supp(z\u2217) [Tibshirani, 2013, Lemma 16].\n\nj (Dz\u2217 \u2212 x)| = \u03bb} , since j /\u2208 E =\u21d2 z\u2217\n\n2\n\n\fWe consider samples x in the equiregularization set\n\n(4)\nwhich is the set of x such that \u03bbmax(x) = 1 . Therefore, when \u03bb \u2265 1 , the solution is z\u2217(x) = 0 for\nall x \u2208 B\u221e , and when \u03bb < 1 , z\u2217(x) 6= 0 for all x \u2208 B\u221e . For this reason, we assume 0 < \u03bb < 1 in\nthe following.\n\nB\u221e , {x \u2208 Rn : kD\u22a4xk\u221e = 1} ,\n\n3 Better step sizes for ISTA\n\nThe Lasso objective is the sum of a L-smooth function, 1\n2kx \u2212 D \u00b7 k2 , and a function with an explicit\nproximal operator, \u03bbk \u00b7 k1 . Proximal gradient descent for this problem, with the sequence of step\nsizes (\u03b1(t)) consists in iterating\n\nz(t+1) = ST(cid:16)z(t) \u2212 \u03b1(t)D\u22a4(Dz(t) \u2212 x), \u03bb\u03b1(t)(cid:17) .\n\n(5)\n\nISTA follows these iterations with a constant step size \u03b1(t) = 1/L . In the following, denote\n\u03c8\u03b1(z, x) , ST(z \u2212 \u03b1D\u22a4(Dz(t) \u2212 x), \u03b1\u03bb). One iteration of ISTA can be cast as a majorization-\nminimization step [Beck and Teboulle, 2009]. Indeed, for all z \u2208 Rm ,\n\n2kx \u2212 Dz(t)k2 + (z \u2212 z(t))\u22a4D\u22a4(Dz(t) \u2212 x) + 1\nFx(z) = 1\n2kx \u2212 Dz(t)k2 + (z \u2212 z(t))\u22a4D\u22a4(Dz(t) \u2212 x) + L\n\u2264 1\n|\n\n, Qx,L(z, z(t))\n\n{z\n\n2kD(z \u2212 z(t))k2 + \u03bbkzk1\n2 kz \u2212 z(t)k2 + \u03bbkzk1\n,\n}\n\n(6)\n\n(7)\n\nwhere we have used the inequality (z \u2212 z(t))\u22a4D\u22a4D(z \u2212 z(t)) \u2264 Lkz \u2212 z(t)k2 . The minimizer of\nQx,L(\u00b7, z(t)) is \u03c81/L(z(t), x), which is the next ISTA step.\nOracle-ISTA: an accelerated ISTA with larger step sizes Since the iterates are sparse, this\napproach can be re\ufb01ned. For S \u2282 J1, mK , let us de\ufb01ne the S-smoothness of D as\n\nLS , max\n\nz\n\nz\u22a4D\u22a4Dz, s.t. kzk = 1 and supp(z) \u2282 S ,\n\n(8)\n\nwith the convention L\u2205 = L . Note that LS is the greatest eigenvalue of D\u22a4\nS DS where DS \u2208 Rn\u00d7|S|\nis the columns of D indexed by S . For all S , LS \u2264 L , since L is the solution of Equation (8)\nwithout support constraint. Assume supp(z(t)) \u2282 S . Combining Equations (6) and (8), we have\n\n\u2200z s.t. supp(z) \u2282 S, Fx(z) \u2264 Qx,LS (z, z(t)) .\n\n(9)\n\nThe minimizer of the r.h.s is z = \u03c81/LS (z(t), x) . Furthermore, the r.h.s. is a tighter upper bound than\nthe one given in Equation (7) (see illustration in Figure 1). Therefore, using z(t+1) = \u03c81/LS (z(t), x)\nminimizes a tighter upper bound, provided that the following condition holds\n\nsupp(z(t+1)) \u2282 S .\n\n(\u22c6)\n\nFigure 1: Majorization illustration. If z(t) has support\nS , Qx,LS (\u00b7, z(t)) is a tighter upper bound of Fx than\nQx,L(\u00b7, z(t)) on the set of points of support S .\n\nOracle-ISTA (OISTA) is an accelerated version of ISTA which leverages the sparsity of the iterates\nin order to use larger step sizes. The method is summarized in Algorithm 1. OISTA computes\n\n3\n\n01L1LSStepsizeCostfunctionFxQx,L(\u00b7,z(t))Qx,LS(\u00b7,z(t))\fAlgorithm 1: Oracle-ISTA (OISTA) with larger step sizes\nInput: Dictionary D , target x , number of iterations T\nz(0) = 0\nfor t = 0, . . . , T \u2212 1 do\n\nCompute S = supp(z(t)) and LS using an oracle ;\nSet y(t+1) = \u03c81/LS (z(t), x) ;\nif Condition \u22c6 : supp(y(t+1)) \u2282 S then Set z(t+1) = y(t+1) ;\nelse Set z(t+1) = \u03c81/L(z(t), x) ;\n\nOutput: Sparse code z(T )\n\ny(t+1) = \u03c81/Ls (z(t), x) , using the larger step size 1/LS , and checks if it satis\ufb01es the support\nCondition \u22c6. When the condition is satis\ufb01ed, the step can be safely accepted. In particular Equation (9)\nyields Fx(y(t+1)) \u2264 Fx(z(t)) . Otherwise, the algorithm falls back to the regular ISTA iteration\nwith the smaller step size. Hence, each iteration of the algorithm is guaranteed to decrease Fx . The\nfollowing proposition shows that OISTA converges in iterates, achieves \ufb01nite support identi\ufb01cation,\nand eventually reaches a safe regime where Condition \u22c6 is always true.\nProposition 3.1 (Convergence, \ufb01nite-time support identi\ufb01cation and safe regime). When Assump-\ntion 2.2 holds, the sequence (z(t)) generated by the algorithm converges to z\u2217 = arg min Fx .\nFurther, there exists an iteration T \u2217 such that for t \u2265 T \u2217 , supp(z(t)) = supp(z\u2217) , S\u2217 and\nCondition \u22c6 is always statis\ufb01ed.\n\nSketch of proof (full proof in Subsection B.1). Using Zangwill\u2019s global convergence theorem [Zang-\nwill, 1969], we show that all accumulation points of (z(t)) are solutions of Lasso. Since the solution\nis assumed unique, (z(t)) converges to z\u2217 . Then, we show that the algorithm achieves \ufb01nite-support\nidenti\ufb01cation with a technique inspired by Hale et al. [2008]. The algorithm gets arbitrary close\nto z\u2217 , eventually with the same support. We \ufb01nally show that in a neighborhood of z\u2217 , the set of\npoints of support S\u2217 is stable by \u03c81/LS (\u00b7, x) . The algorithm eventually reaches this region, and then\nCondition \u22c6 is true.\n\nIt follows that the algorithm enjoys the usual ISTA convergence results replacing L with LS \u2217 .\nProposition 3.2 (Rates of convergence). For t > T \u2217 , Fx(z(t)) \u2212 Fx(z\u2217) \u2264 LS \u2217\nIf additionally inf kzk=1 kDS \u2217 zk2 = \u00b5\u2217 > 0 , then the convergence rate for t \u2265 T \u2217 is\nFx(z(t)) \u2212 Fx(z\u2217) \u2264 (1 \u2212 \u00b5\u2217\nSketch of proof (full proof in Subsection B.2). After iteration T \u2217 , OISTA is equivalent to ISTA ap-\nplied on Fx(z) restricted to z \u2208 S\u2217 . This function is LS \u2217-smooth, and \u00b5\u2217-strongly convex if \u00b5\u2217 > 0 .\nTherefore, the classical ISTA rates apply with improved condition number.\n\n(Fx(z(T \u2217)) \u2212 Fx(z\u2217)) .\n\nLS\u2217 )t\u2212T \u2217\n\n2(t\u2212T \u2217)\n\nkz\u2217\u2212z(T\n\n\u2217 )k2\n\n.\n\nThese two rates are tighter than the usual ISTA rates \u2013 in the convex case L kz\u2217k2\nand in the \u00b5-strongly\nconvex case (1 \u2212 \u00b5\u2217\nL )t(Fx(0) \u2212 Fx(z\u2217)) [Beck and Teboulle, 2009]. Finally, the same way ISTA\nconverges in one iteration when D is orthogonal (D\u22a4D = Idm), OISTA converges in one iteration if\nS\u2217 is identi\ufb01ed and DS \u2217 is orthogonal.\nProposition 3.3. Assume D\u22a4\n\nS \u2217 DS \u2217 = LS \u2217 Id|S \u2217| . Then, z(T \u2217+1) = z\u2217 .\n\n2t\n\nProof. For z s.t. supp(z) = S\u2217 , Fx(z) = Qx,LS (z, z(T \u2217)) . Hence, the OISTA step minimizes\nFx .\n\nQuanti\ufb01cation of the rates improvement in a Gaussian setting The following proposition gives\nan asymptotic value for LS\n\nL in a simple setting.\n\n4\n\n\fProposition 3.4. Assume that the entries of D \u2208 Rn\u00d7m are i.i.d centered Gaussian variables with\nvariance 1 . Assume that S consists of k integers chosen uniformly at random in J1, mK . Assume that\nk, m, n \u2192 +\u221e with linear ratios m/n \u2192 \u03b3, k/m \u2192 \u03b6 . Then\n\n.\n\n(10)\n\nLS\n\nL \u2192 (cid:18) 1 + \u221a\u03b6\u03b3\n1 + \u221a\u03b3 (cid:19)2\n\nThis is a direct application of the Marchenko-Pastur law [Marchenko and Pastur, 1967]. The law\nis illustrated on a toy dataset in Figure D.1. In Proposition 3.4, \u03b3 is the ratio between the number\nof atoms and number of dimensions, and the average size of S is described by \u03b6 \u2264 1 . In an\novercomplete setting where we have \u03b3 \u226b 1 , this yield an approximation of Equation (10) with\nLS \u2243 \u03b6L . Therefore, if z\u2217 is very sparse (\u03b6 \u226a 1), the convergence rates of Proposition 3.2 are much\nbetter than those of ISTA.\n\nBacktracking Line Search A related strategy for \ufb01nding good step sizes is the use of backtracking\nline search (see for instance Nesterov 2013). The core idea here is to compute iterate candidates for\nvarious step-sizes and choose the one that gives the best cost decrease. This strategy is adaptive to\nthe actual state of the iterative procedure. However, it requires computing a new step size at each\niteration. At each iteration, BT considers step-sizes of the form (\u03b10\u03b2k)k\u22650, where \u03b10 is an initial\nguess and \u03b2 < 1 is a shrinking factor. In practice, the hyperparameters \u03b10 and \u03b2 are critical and hard\nto tune. The need to search for a new step-size at each iteration is the main difference with OISTA\nwhich provides a \ufb01xed rule (maybe intractable) to set the step size.\n\nExample Figure 2 compares the OISTA, ISTA, FISTA, and backtracking ISTA on a toy problem.\nWe display two backtracking strategies, with different hyperparameters. We also compare this to a\ngreedy best step-size approach, where step-sizes are chosen as \u03b1(t+1) = arg min Fx(\u03c8\u03b1(z(t), x)).\nThe improved rate of convergence of OISTA over ISTA and FISTA is illustrated: one can indeed take\ngreater steps to increase the convergence speed. Further comparisons are displayed in Figure D.2 for\ndifferent regularization parameters \u03bb . While this demonstrates a faster rate of convergence, OISTA\nrequires computing several Lipschitz constants LS , which is cumbersome in high dimension. This\nmotivates the next section, where we propose to learn those steps.\n\nFigure 2: Convergence curves of OISTA,\nISTA, FISTA, backtracking ISTA and a\ngreedy best step-size strategy on a toy prob-\nlem with n = 10 , m = 50 , \u03bb = 0.5 . The\nbottom \ufb01gure displays the (normalized) steps\ntaken by OISTA and the best steps at each\niteration. Full experimental setup described\nin Appendix D.\n\n4 Learning unfolded algorithms\n\nNetwork architectures At each step, ISTA performs a linear operation to compute an update\nin the direction of the gradient D\u22a4(Dz(t) \u2212 x) and then an element-wise non linearity with the\nsoft-thresholding operator ST . The whole algorithm can be summarized as a recurrent neural network\n(RNN), presented in Figure 3a. Gregor and Le Cun [2010] introduced Learned-ISTA (LISTA), a\nneural network constructed by unfolding this RNN T times and learning the weights associated to each\nlayer. The unfolded network, presented in Figure 3b, iterates z(t+1) = ST(W (t)\nz z(t), \u03bb\u03b2(t)) .\nIt outputs exactly the same vector as T iterations of ISTA when W (t)\nz = Idm \u2212 D\u22a4D\nx = D\u22a4\nand \u03b2(t) = 1\nL . Empirically, this network is able to output a better estimate of the sparse code solution\nwith fewer operations.\n\nx x+W (t)\nL , W (t)\n\nL\n\n5\n\n10\u2212610\u221212Fx\u2212F\u2217x050100150NumberofDzcomputations0.02.55.07.5Oraclestep1LISTAFISTAOISTA(proposed)Backtrack1Backtrack2Beststep\fx\n\nWx\n\nx\n\nz\u2217\n\nW (0)\n\nx\n\nW (1)\n\nx\n\nW (2)\n\nx\n\nWz\n\nW (1)\n\nz\n\nW (2)\n\nz\n\nz(3)\n\n(a) ISTA - Recurrent Neural Network\n\n(b) LISTA - Unfolded network with T = 3\n\nFigure 3: Network architecture for ISTA (left) and LISTA (right).\n\nDue to the expression of the gradient, Chen et al. [2018] proposed to consider only a subclass of\nthe previous networks, where the weights Wx and Wz are coupled via Wz = Idm \u2212W \u22a4\nx D . This is\nthe architecture we consider in the following. A layer of LISTA is a function \u03c6\u03b8 : Rm \u00d7 Rn \u2192 Rm\nparametrized by \u03b8 = (W, \u03b1, \u03b2) \u2208 Rn\u00d7m \u00d7 R+\n\n\u2217 \u00d7 R+\n\n\u2217 such that\n\nGiven a set of T layer parameters \u0398(T ) = {\u03b8(t)}T \u22121\n\u03a6\u0398(T ) (x) = z(T )(x) where z(t)(x) is de\ufb01ned by recursion\n\n\u03c6\u03b8(z, x) = ST(z \u2212 \u03b1W \u22a4(Dz \u2212 x), \u03b2\u03bb) .\n\n(11)\nt=0 , the LISTA network \u03a6\u0398(T ) : Rn \u2192 Rm is\n\nTaking W = D , \u03b1 = \u03b2 = 1\n\nz(0)(x) = 0, and z(t+1)(x) = \u03c6\u03b8(t) (z(t)(x), x) for t \u2208 J0, T \u2212 1K .\n\nL yields the same outputs as T iterations of ISTA.\n\n(12)\n\nTo alleviate the need to learn the large matrices W (t), Liu et al. [2019] proposed to use a shared\nanalytic matrix WALISTA for all layers. The matrix is computed in a preprocessing stage by\n\nWALISTA = arg min\n\nW kW \u22a4Dk2\n\nF\n\ns.t.\n\ndiag(W \u22a4D) = 111m .\n\n(13)\n\nThen, only the parameters (\u03b1(t), \u03b2(t)) are learned. This effectively reduces the number of parameters\nfrom (nm + 2) \u00d7 T to 2 \u00d7 T . However, we will see that ALISTA fails in our setup.\nStep-LISTA With regards to the study on step sizes for ISTA in Section 3, we propose to learn\napproximation of ISTA step sizes for the input distribution using the LISTA framework. The resulting\nnetwork, dubbed Step-LISTA (SLISTA), has T parameters \u0398SLISTA = {\u03b1(t)}T \u22121\nt=0 , and follows the\niterations:\nz(t+1)(x) = ST(z(t)(x) \u2212 \u03b1(t)D\u22a4(Dz(t)(x) \u2212 x), \u03b1(t)\u03bb) .\n\n(14)\nThis is equivalent to a coupling in the LISTA parameters: a LISTA layer \u03b8 = (W, \u03b1, \u03b2) corresponds\nto a SLISTA layer if and only if \u03b1\n\u03b2 W = D. This network aims at learning good step sizes, like\nthe ones used in OISTA, without the computational burden of computing Lipschitz constants. The\nnumber of parameters compared to the classical LISTA architecture \u0398LISTA is greatly diminished,\nmaking the network easier to train. Learning curves are shown in Figure D.3 in appendix.\n\nFigure 4 displays the learned steps of a SLISTA network on a toy example. The network learns larger\nstep-sizes as the sparsity (and as a result, 1/LS\u2019s) increase. It is interesting to note that the learned\nstep sizes tends to be larger than 1/LS but smaller than 2/LS. As step sizes in ]0, 2/LS[ guarantee\ndescent of the cost function, SLISTA learns step sizes that are adapted to solve the optimization\nproblem. Still, steps larger than 2/LS may be suitable depending on the geometry of the problem.\nFor instance, in Figure 2, the greedy best-steps, that lead to the greatest decrease of the cost function,\nare taken larger than 2/LS.\n\nTraining the network We consider the framework where the network learns to solve the Lasso on\nB\u221e in an unsupervised way. Given a distribution p on B\u221e , the network is trained by solving\n\n\u02dc\u0398(T ) \u2208 arg min\n\n\u0398(T ) L(\u0398(T )) , Ex\u223cp[Fx(\u03a6\u0398(T ) (x))] .\n\n(15)\n\nMost of the literature on learned optimization train the network with a different supervised objective\n[Gregor and Le Cun, 2010, Xin et al., 2016, Chen et al., 2018, Liu et al., 2019]. Given a set of pairs\n(xi, zi) , the supervised approach tries to learn the parameters of the network such that \u03a6\u0398(xi) \u2243 zi\ne.g. by minimizing k\u03a6\u0398(xi)\u2212zik2 . This training procedure differs critically from ours. For instance,\nISTA does not converge for the supervised problem in general while it does for the unsupervised\none. As Proposition 4.1 shows, the unsupervised approach allows to learn to minimize the Lasso cost\nfunction Fx .\n\n6\n\n\fFigure 4: Steps learned with a 20 layers SLISTA network\non a 10 \u00d7 20 problem. For each layer t and each train-\ning sample x, we compute the support S(x, t) of z(t)(x).\nThe brown (resp.green) curves display the quantiles of the\ndistribution of 1/LS(x,t) (resp. 2/LS(x,t)) for each layer\nt . Learned steps are mostly in ]0, 2/LS[, which guar-\nantees the decrease of the surrogate cost function. Full\nexperimental setup described in Appendix D.\n\nProposition 4.1 (Pointwise convergence). Let \u02dc\u0398(T ) found by solving Problem (15).\nFor x \u2208 B\u221e such that p(x) > 0 , Fx(\u03a6 \u02dc\u0398(T ) (x)) \u2212\u2212\u2212\u2212\u2212\u2192T \u2192+\u221e\n\nx almost everywhere.\n\nF \u2217\n\nISTA\n\nx ] . Hence, Ex\u223cp[Fx(\u03a6 \u02dc\u0398(T ) (x))\u2212 F \u2217\n\nx to 0 since it is non-negative.\n\nx ] \u2264 Ex\u223cp[Fx(\u03a6 \u02dc\u0398(T ) (x))] \u2264 Ex\u223cp[Fx(\u03a6\u0398(T )\n\nj \u03b5 = 0 , the vector (1 \u2212 \u03bb)ej minimizes Fx for x = Dj + \u03b5 .\n\nISTA the parameters corresponding to ISTA. For\n(x))] . Because ISTA con-\nx ] \u2192 0 . This implies\n\nSketch of proof (full proof in Subsection C.1). Let \u0398(T )\nall T , we have Ex\u223cp[F \u2217\nverges, the right hand term goes to Ex\u223cp[F \u2217\nalmost sure convergence of Fx(\u03a6 \u02dc\u0398(T ) (x)) \u2212 F \u2217\nAsymptotical weight coupling theorem In this paragraph, we show the main result of this paper:\nany LISTA network minimizing Fx on B\u221e reduces to SLISTA in its deep layers (Theorem 4.4). It\nrelies on the following Lemmas.\nLemma 4.2 (Stability of solutions around Dj). Let D \u2208 Rn\u00d7m be a dictionary with non-duplicated\nunit-normed columns. Let c , maxl6=j |D\u22a4\nl Dj| < 1 . Then for all j \u2208 J1, mK and \u03b5 \u2208 Rm such that\nk\u03b5k < \u03bb(1 \u2212 c) and D\u22a4\nIt can be proven by verifying the KKT conditions (3) for (1 \u2212 \u03bb)ej , detailed in Subsection C.2.\nLemma 4.3 (Weight coupling). Let D \u2208 Rn\u00d7m be a dictionary with non-duplicated unit-normed\ncolumns. Let \u03b8 = (W, \u03b1, \u03b2) a set of parameters. Assume that all the couples (z\u2217(x), x) \u2208 Rm \u00d7 B\u221e\nsuch that z\u2217(x) \u2208 arg min Fx(z) verify \u03c6\u03b8(z\u2217(x), x) = z\u2217(x). Then, \u03b1\nSketch of proof (full proof in Subsection C.3). For j \u2208 J1, mK , consider x = Dj + \u03b5 , with\n\u03b5\u22a4Dj = 0 . For k\u03b5k small enough, x \u2208 B\u221e and \u03b5 veri\ufb01es the hypothesis of Lemma 4.2,\ntherefore z\u2217 = (1 \u2212 \u03bb)ej \u2208 arg min Fx . Writing \u03c6\u03b8(z\u2217, x) = z\u2217 for the j-th coordinate yields\n\u03b1W \u22a4\nj )(\u03bbDj + \u03b5) = 0 . This stands for\nany \u03b5 orthogonal to Dj and of norm small enough. Simple linear algebra shows that this implies\n\u03b1Wj \u2212 \u03b2Dj = 0 .\nLemma 4.3 states that the Lasso solutions are \ufb01xed points of a LISTA layer only if this layer\ncorresponds to a step size for ISTA. The following theorem extends the lemma by continuity, and\nshows that the deep layers of any converging LISTA network must tend toward a SLISTA layer.\nTheorem 4.4. Let D \u2208 Rn\u00d7m be a dictionary with non-duplicated unit-normed columns. Let\n\u0398(T ) = {\u03b8(t)}T\nt=0 be the parameters of a sequence of LISTA networks such that the transfer function\nof the layer t is z(t+1) = \u03c6\u03b8(t) (z(t), x) . Assume that\n\nj (\u03bbDj + \u03b5) = \u03bb\u03b2 . We can then verify that (\u03b1W \u22a4\n\n\u03b2 W = D .\n\nj \u2212 \u03b2D\u22a4\n\n(i) the sequence of parameters converges i.e. \u03b8(t) \u2212\u2212\u2212\u2192t\u2192\u221e\n(ii) the output of the network converges toward a solution z\u2217(x) of the Lasso (1) uniformly over\n\n\u03b8\u2217 = (W \u2217, \u03b1\u2217, \u03b2\u2217) ,\n\nthe equiregularization set B\u221e , i.e. supx\u2208B\u221e k\u03a6\u0398(T ) (x) \u2212 z\u2217(x)k \u2212\u2212\u2212\u2212\u2192T \u2192\u221e\n\n0 .\n\nThen \u03b1\u2217\n\n\u03b2\u2217 W \u2217 = D .\n\nSketch of proof (full proof in Subsection C.4). Let \u03b5 > 0 , and x \u2208 B\u221e . Using the triangular in-\nequality, we have\n\nk\u03c6\u03b8\u2217 (z\u2217, x) \u2212 z\u2217k \u2264 k\u03c6\u03b8\u2217 (z\u2217, x) \u2212 \u03c6\u03b8(t) (z(t), x)k + k\u03c6\u03b8(t) (z(t), x) \u2212 z\u2217k\n\n(16)\n\n7\n\n11020Layer1/L2/L3/L4/LStep1/LLearnedsteps1/LS2/LS\fSince the z(t) and \u03b8(t) converge, they are valued over a compact set K. The function f : (z, x, \u03b8) 7\u2192\n\u03c6\u03b8(z, x) is continuous, piecewise-linear. It is therefore Lipschitz on K. Hence, we have k\u03c6\u03b8\u2217 (z\u2217, x)\u2212\n\u03c6\u03b8(t) (z(t), x)k \u2264 \u03b5 for t large enough. Since \u03c6\u03b8(t) (z(t), x) = z(t+1) and z(t) \u2192 z\u2217 , k\u03c6\u03b8(t) (z(t), x)\u2212\nz\u2217k \u2264 \u03b5 for t large enough. Finally, \u03c6\u03b8\u2217 (z\u2217, x) = z\u2217 . Lemma 4.3 allows to conclude.\nTheorem 4.4 means that the deep layers of any LISTA network that converges to solutions of the\nLasso correspond to SLISTA iterations: W (t) aligns with D , and \u03b1(t), \u03b2(t) get coupled. This is\nillustrated in Figure 5, where a 40-layers LISTA network is trained on a 10 \u00d7 20 problem with\n\u03bb = 0.1 . As predicted by the theorem, \u03b1(t)\n\u03b2(t) W (t) \u2192 D : the last layers only learn a step size. This\nis consistent with the observation of Moreau and Bruna [2017] which shows that the deep layers\nof LISTA stay close to ISTA. Further, Theorem 4.4 also shows that it is hopeless to optimize the\nunsupervised objective (15) with WALISTA (13), since this matrix is not aligned with D .\n\nFigure 5: Illustration of Theorem 4.4: for deep layers\nof LISTA, we have \u03b1(t)W (t)/\u03b2(t) \u2192 D , indicating\nthat the network ultimately only learns a step size. Full\nexperimental setup described in Appendix D.\n\n5 Numerical Experiments\n\nThis section provides numerical arguments to compare SLISTA to LISTA and ISTA. All the experi-\nments were run using Python [Python Software Foundation, 2017] and pytorch [Paszke et al., 2017].\nThe code to reproduce the \ufb01gures is available online2.\n\nNetwork comparisons We compare the proposed approach SLISTA to state-of-the-art learned\nmethods LISTA [Chen et al., 2018] and ALISTA [Liu et al., 2019] on synthetic and semi-real cases.\nIn the synthetic case, a dictionary D \u2208 Rn\u00d7m of Gaussian i.i.d. entries is generated. Each column is\nthen normalized to unit norm. A set of Gaussian i.i.d. samples (\u02dcxi)N\ni=1 \u2208 Rn is drawn. The input\nsamples are obtained as xi = \u02dcxi/kD\u22a4 \u02dcxik\u221e \u2208 B\u221e , so that for all i , xi \u2208 B\u221e . We set m = 256\nand n = 64.\n\nFor the semi-real case, we used the digits dataset from scikit-learn [Pedregosa et al., 2011] which\nconsists of 8 \u00d7 8 images of handwritten digits from 0 to 9 . We sample m = 256 samples at random\nfrom this dataset and normalize it do generate our dictionary D . Compared to the simulated Gaussian\ndictionary, this dictionary has a much richer correlation structure, which is known to imper the\nperformances of learned algorithms [Moreau and Bruna, 2017]. The input distribution also consists\nfrom images from the digits dataset, normalized to lie in B\u221e.\nThe networks are trained by minimizing the empirical loss L (15) on a training set of size Ntrain =\n10, 000 and we report the loss on a test set of size Ntest = 10, 000 . Further details on training are in\nAppendix D.\n\nFigure 6 shows the test curves for different levels of regularization \u03bb = 0.1 and 0.8. SLISTA performs\nbest for high \u03bb, even for challenging semi-real dictionary D . In a low regularization setting, LISTA\nperforms best as SLISTA cannot learn much larger steps due to the low sparsity of the solution. In\nthis unsupervised setting, ALISTA does not converge in accordance with Theorem 4.4.\n\n6 Conclusion\n\nWe showed that using larger step sizes is an ef\ufb01cient strategy to accelerate ISTA for sparse solution\nof the Lasso. In order to make this approach practical, we proposed SLISTA, a neural network\n\n2 The code can be found at https://github.com/tomMoral/adopty\n\n8\n\n110203040Layers0123k\u03b1(t)W(t)\u03b2(t)\u2212DkFLISTA\fFigure 6: Test loss of ISTA, ALISTA, LISTA and SLISTA on simulated and semi-real data for\ndifferent regularization parameters.\n\narchitecture which learns such step sizes. Theorem 4.4 shows that the deepest layers of any converging\nLISTA architecture must converge to a SLISTA layer. Numerical experiments show that SLISTA\noutperforms LISTA in a high sparsity setting. An major bene\ufb01t of our approach is that it preserves\nthe dictionary. We plan on leveraging this property to apply SLISTA in convolutional or wavelet\ncases, where the structure of the dictionary allows for fast multiplications.\n\nAcknowledgements\n\nWe would like to thank the anonymous reviewers for their insightful comments which have improved\nthe quality of the paper. This project has received funding from the European Research Council\n(ERC) under the European Union\u2019s Horizon 2020 research and innovation program (Grant agreement\nNo. 676943)\n\nReferences\n\nJonas Adler, Axel Ringh, Ozan \u00d6ktem, and Johan Karlsson. Learning to solve inverse problems\n\nusing Wasserstein loss. preprint ArXiv, 1710.10898, 2017.\n\nAmir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM journal on imaging sciences, 2(1):183\u2013202, 2009.\n\nMark Borgerding, Philip Schniter, and Sundeep Rangan. AMP-inspired deep networks for sparse\n\nlinear inverse problems. IEEE Transactions on Signal Processing, 65(16):4293\u20134308, 2017.\n\nXiaohan Chen, Jialin Liu, Zhangyang Wang, and Wotao Yin. Theoretical linear convergence of\nIn Advances in Neural Information\n\nunfolded ISTA and its practical weights and thresholds.\nProcessing Systems (NIPS), pages 9061\u20139071, 2018.\n\nPatrick L Combettes and Heinz H. Bauschke. Convex Analysis and Monotone Operator Theory in\nHilbert Spaces. Springer, 2011. ISBN 9788578110796. doi: 10.1017/CBO9781107415324.004.\n\nIngrid Daubechies, Michel Defrise, and Christine De Mol. An iterative thresholding algorithm\nfor linear inverse problems with a sparsity constraint. Communications on Pure and Applied\nMathematics, 57(11):1413\u20131457, 2004.\n\nBradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. Ann.\n\nStatist., 32(2):407\u2013499, 2004.\n\nLaurent El Ghaoui, Vivian Viallon, and Tarek Rabbani. Safe feature elimination in sparse supervised\n\nlearning. J. Paci\ufb01c Optim., 8(4):667\u2013698, 2012.\n\nJerome Friedman, Trevor Hastie, Holger H\u00f6\ufb02ing, and Robert Tibshirani. Pathwise coordinate\n\noptimization. The Annals of Applied Statistics, 1(2):302\u2013332, 2007.\n\nRaja Giryes, Yonina C. Eldar, Alex M. Bronstein, and Guillermo Sapiro. Tradeoffs between con-\nvergence speed and reconstruction accuracy in inverse problems. IEEE Transaction on Signal\nProcessing, 66(7):1676\u20131690, 2018.\n\n9\n\n0102030NumberofLayers10\u2212210\u22121100Fx\u2212F\u2217xSimulateddata\u03bb=0.10102030NumberofLayers10\u2212610\u2212410\u22122Simulateddata\u03bb=0.80102030NumberofLayers10\u2212210\u22121Digitsdata\u03bb=0.10102030NumberofLayers10\u2212410\u2212310\u22122Digitsdata\u03bb=0.8ISTALISTAALISTASLISTA(proposed)\fKarol Gregor and Yann Le Cun. Learning Fast Approximations of Sparse Coding. In International\n\nConference on Machine Learning (ICML), pages 399\u2013406, 2010.\n\nElaine Hale, Wotao Yin, and Yin Zhang. Fixed-point continuation for \u21131-minimization: Methodology\n\nand convergence. SIAM J. Optim., 19(3):1107\u20131130, 2008.\n\nTrevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical Learning with Sparsity: The\n\nLasso and Generalizations. CRC Press, 2015.\n\nJohn R. Hershey, Jonathan Le Roux, and Felix Weninger. Deep unfolding: Model-based inspiration\n\nof novel deep architectures. preprint ArXiv, 1409.2574, 2014.\n\nDaisuke Ito, Satoshi Takabe, and Tadashi Wadayama. Trainable ISTA for sparse signal recovery. In\n\nIEEE International Conference on Communications Workshops, pages 1\u20136, 2018.\n\nTyler Johnson and Carlos Guestrin. Blitz: A principled meta-algorithm for scaling sparse optimization.\n\nIn International Conference on Machine Learning (ICML), pages 1171\u20131179, 2015.\n\nJingwei Liang, Jalal Fadili, and Gabriel Peyr\u00e9. Local linear convergence of forward\u2013backward under\npartial smoothness. In Advances in Neural Information Processing Systems, pages 1970\u20131978,\n2014.\n\nJialin Liu, Xiaohan Chen, Zhangyang Wang, and Wotao Yin. ALISTA: Analytic weights are as good\nas learned weigths in LISTA. In International Conference on Learning Representation (ICLR),\n2019.\n\nVladimir A Marchenko and Leonid Andreevich Pastur. Distribution of eigenvalues for some sets of\n\nrandom matrices. Mathematics of the USSR-Sbornik, 1(4):457, 1967.\n\nMathurin Massias, Alexandre Gramfort, and Joseph Salmon. Celer: a Fast Solver for the Lasso with\n\nDual Extrapolation. In International Conference on Machine Learning (ICML), 2018.\n\nThomas Moreau and Joan Bruna. Understanding neural sparse coding with matrix factorization. In\n\nInternational Conference on Learning Representation (ICLR), 2017.\n\nEugene Ndiaye, Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. Gap safe screening rules\n\nfor sparsity enforcing penalties. J. Mach. Learn. Res., 18(128):1\u201333, 2017.\n\nJ. A. Nelder and R. W. M. Wedderburn. Generalized Linear Models. Journal of the Royal Statistical\n\nSociety. Series A (General), 135(3):370, 1972. ISSN 00359238. doi: 10.2307/2344614.\n\nYurii Nesterov. A method for solving a convex programming problem with rate of convergence\n\nO(1/k2). Soviet Math. Doklady, 269(3):543\u2013547, 1983.\n\nYurii Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming,\n\n140(1):125\u2013161, 2013.\n\nBruno A. Olshausen and David J Field. Sparse coding with an incomplete basis set: a strategy\n\nemployed by V1, 1997.\n\nAdam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\nPyTorch. In NIPS Autodiff Workshop, 2017.\n\nF. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten-\nhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and\nE. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research,\n12:2825\u20132830, 2011.\n\nClarice Poon, Jingwei Liang, and Carola-Bibiane Sch\u00f6nlieb. Local convergence properties of SAGA\nand prox-SVRG and acceleration. In International Conference on Machine Learning (ICML),\n2018.\n\nPython Software Foundation. Python Language Reference, version 3.6. http://python.org/, 2017.\n\n10\n\n\fSaharon Rosset, Ji Zhu, and Trevor Hastie. Boosting as a regularized path to a maximum margin\n\nclassi\ufb01er. J. Mach. Learn. Res., 5:941\u2013973, 2004.\n\nPablo Sprechmann, Alex M. Bronstein, and Guillermo Sapiro. Learning ef\ufb01cient structured sparse\n\nmodels. In International Conference on Machine Learning (ICML), pages 615\u2013622, 2012.\n\nPablo Sprechmann, Roee Litman, and TB Yakar. Ef\ufb01cient supervised sparse analysis and synthesis\noperators. In Advances in Neural Information Processing Systems (NIPS), pages 908\u2013916, 2013.\n\nYifan Sun, Halyun Jeong, Julie Nutini, and Mark Schmidt. Are we there yet? manifold identi\ufb01cation\nof gradient-related proximal methods. In Proceedings of Machine Learning Research, volume 89\nof Proceedings of Machine Learning Research, pages 1110\u20131119. PMLR, 2019.\n\nRobert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety: Series B (Methodological), 58(1):267\u2013288, 1996.\n\nRyan Tibshirani. The lasso problem and uniqueness. Electron. J. Stat., 7:1456\u20131490, 2013.\n\nPaul Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization.\n\nJ. Optim. Theory Appl., 109(3):475\u2013494, 2001.\n\nZhangyang Wang, Qing Ling, and Thomas S. Huang. Learning deep \u21130 encoders. In AAAI Conference\n\non Arti\ufb01cial Intelligence, pages 2194\u20132200, 2015.\n\nBo Xin, Yizhou Wang, Wen Gao, and David Wipf. Maximal sparsity with deep networks? In\n\nAdvances in Neural Information Processing Systems (NIPS), pages 4340\u20134348, 2016.\n\nYan Yang, Jian Sun, Huibin Li, and Zongben Xu. Deep ADMM-Net for compressive censing MRI.\n\nIn Advances in Neural Information Processing Systems (NIPS), pages 10\u201318, 2017.\n\nWillard I Zangwill. Convergence conditions for nonlinear programming algorithms. Management\n\nScience, 16(1):1\u201313, 1969.\n\nJian Zhang and Bernard Ghanem. ISTA-Net: Interpretable optimization-inspired deep network for\nimage compressive sensing. In IEEE Computer Society Conference on Computer Vision and\nPattern Recognition, pages 1828\u20131837, 2018.\n\n11\n\n\f", "award": [], "sourceid": 7188, "authors": [{"given_name": "Pierre", "family_name": "Ablin", "institution": "Inria"}, {"given_name": "Thomas", "family_name": "Moreau", "institution": "Inria"}, {"given_name": "Mathurin", "family_name": "Massias", "institution": "Inria"}, {"given_name": "Alexandre", "family_name": "Gramfort", "institution": "INRIA"}]}