{"title": "Accelerated Proximal Gradient Methods for Nonconvex Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 379, "page_last": 387, "abstract": "Nonconvex and nonsmooth problems have recently received considerable attention in signal/image processing, statistics and machine learning. However, solving the nonconvex and nonsmooth optimization problems remains a big challenge. Accelerated proximal gradient (APG) is an excellent method for convex programming. However, it is still unknown whether the usual APG can ensure the convergence to a critical point in nonconvex programming. To address this issue, we introduce a monitor-corrector step and extend APG for general nonconvex and nonsmooth programs. Accordingly, we propose a monotone APG and a non-monotone APG. The latter waives the requirement on monotonic reduction of the objective function and needs less computation in each iteration. To the best of our knowledge, we are the first to provide APG-type algorithms for general nonconvex and nonsmooth problems ensuring that every accumulation point is a critical point, and the convergence rates remain $O(1/k^2)$ when the problems are convex, in which k is the number of iterations. Numerical results testify to the advantage of our algorithms in speed.", "full_text": "Accelerated Proximal Gradient Methods for\n\nNonconvex Programming\n\nHuan Li\n\nZhouchen Lin (cid:66)\n\nKey Lab. of Machine Perception (MOE), School of EECS, Peking University, P. R. China\n\nCooperative Medianet Innovation Center, Shanghai Jiaotong University, P. R. China\n\nlihuanss@pku.edu.cn\n\nzlin@pku.edu.cn\n\nAbstract\n\nNonconvex and nonsmooth problems have recently received considerable atten-\ntion in signal/image processing, statistics and machine learning. However, solv-\ning the nonconvex and nonsmooth optimization problems remains a big challenge.\nAccelerated proximal gradient (APG) is an excellent method for convex program-\nming. However, it is still unknown whether the usual APG can ensure the con-\nvergence to a critical point in nonconvex programming. In this paper, we extend\nAPG for general nonconvex and nonsmooth programs by introducing a monitor\nthat satis\ufb01es the suf\ufb01cient descent property. Accordingly, we propose a monotone\nAPG and a nonmonotone APG. The latter waives the requirement on monotonic\nreduction of the objective function and needs less computation in each iteration.\nTo the best of our knowledge, we are the \ufb01rst to provide APG-type algorithms\nfor general nonconvex and nonsmooth problems ensuring that every accumulation\n\npoint is a critical point, and the convergence rates remain O(cid:0) 1\n\n(cid:1) when the prob-\n\nlems are convex, in which k is the number of iterations. Numerical results testify\nto the advantage of our algorithms in speed.\n\nk2\n\n1\n\nIntroduction\n\nIn recent years, sparse and low rank learning has been a hot research topic and leads to a wide\nvariety of applications in signal/image processing, statistics and machine learning.\nl1-norm and\nnuclear norm, as the continuous and convex surrogates of l0-norm and rank, respectively, have been\nused extensively in the literature. See e.g., the recent collections [1]. Although l1-norm and nuclear\nnorm have achieved great success, in many cases they are suboptimal as they can promote sparsity\nand low-rankness only under very limited conditions [2, 3]. To address this issue, many nonconvex\nregularizers have been proposed, such as lp-norm [4], Capped-l1 penalty [3], Log-Sum Penalty [2],\nMinimax Concave Penalty [5], Geman Penalty [6], Smoothly Clipped Absolute Deviation [7] and\nSchatten-p norm [8]. This trend motivates a revived interest in the analysis and design of algorithms\nfor solving nonconvex and nonsmooth problems, which can be formulated as\n\nmin\nx\u2208Rn\n\nF (x) = f (x) + g(x),\n\n(1)\n\nwhere f is differentiable (it can be nonconvex) and g can be both nonconvex and nonsmooth.\nAccelerated gradient methods have been at the heart of convex optimization research. In a series of\ncelebrated works [9, 10, 11, 12, 13, 14], several accelerated gradient methods are proposed for prob-\nlem (1) with convex f and g. In these methods, k iterations are suf\ufb01cient to \ufb01nd a solution within\n\n(cid:1) error from the optimal objective value. Recently, Ghadimi and Lan [15] presented a uni\ufb01ed\n\nO(cid:0) 1\n\ntreatment of accelerated gradient method (UAG) for convex, nonconvex and stochastic optimiza-\n\nk2\n\n1\n\n\fTable 1: Comparisons of GD (General Descent Method), iPiano, GIST, GDPA, IR, IFB, APG, UAG\nand our method for problem (1). The measurements include the assumption, whether the methods\naccelerate for convex programs (CP) and converge for nonconvex programs (NCP).\n\nMethod name\nGD [16, 17]\niPiano [18]\nGIST [19]\nGDPA [20]\nIR [8, 21]\nIFB [22]\n\nAPG [12, 13]\n\nUAG [15]\n\nOurs\n\nAssumption\nf + g: KL\n\nnonconvex f, convex g\n\nnonconvex f, g = g1 \u2212 g2, g1, g2 convex\nnonconvex f, g = g1 \u2212 g2, g1, g2 convex\n\nspecial f and g\n\nnonconvex f, nonconvex g\n\nconvex f, convex g\n\nnonconvex f, convex g\n\nnonconvex f, nonconvex g\n\nAccelerate (CP)\n\nconverge (NCP)\n\nNo\nNo\nNo\nNo\nNo\nNo\nYes\nYes\nYes\n\nYes\nYes\nYes\nYes\nYes\nYes\n\nYes\nYes\n\nUnclear\n\nk2\n\nconvex g and accelerates with an O(cid:0) 1\n\n(cid:1) convergence rate in convex programming for problem (1).\n\ntion. They proved that their algorithm converges1 in nonconvex programming with nonconvex f but\n\nConvergence rate about the gradient mapping is also analyzed in [15].\nAttouch et al. [16] proposed a uni\ufb01ed framework to prove the convergence of a general class of\ndescent methods using the Kurdyka-\u0141ojasiewicz (KL) inequality for problem (1) and Frankel et\nal. [17] studied the convergence rates of general descent methods under the assumption that the\ndesingularising function \u03d5 in KL property has the form of C\n\u03b8 t\u03b8. A typical example in their frame-\nwork is the proximal gradient method. However, there is no literature showing that there exists an\naccelerated gradient method satisfying the conditions in their framework.\nOther typical methods for problem (1) includes Inertial Forward-Backward (IFB) [22], iPiano [18],\nGeneral Iterative Shrinkage and Thresholding (GIST) [19], Gradient Descent with Proximal Aver-\nage(GDPA) [20] and Iteratively Reweighted Algorithms (IR) [8, 21]. Table 1 demonstrates that the\nexisting methods are not ideal. GD and IFB cannot accelerate the convergence for convex programs.\nGIST and GDPA require that g should be explicitly written as a difference of two convex functions.\niPiano demands the convexity of g and IR is suitable for some special cases of problem (1). APG\ncan accelerate the convergence for convex programs, however, it is unclear whether APG can con-\nverge to critical points for nonconvex programs. UAG can ensure the convergence for nonconvex\nprogramming, however, it requires g to be convex. This restricts the applications of UAG to solving\nnonconvexly regularized problems, such as sparse and low rank learning. To the best of our knowl-\nedge, extending the accelerated gradient method for general nonconvex and nonsmooth programs\n\n(cid:1) convergence rate in the convex case remains an open problem.\n\nwhile keeping the O(cid:0) 1\n\nk2\n\nIn this paper we aim to extend Beck and Teboulle\u2019s APG [12, 13] to solve general nonconvex and\nnonsmooth problem (1). APG \ufb01rst extrapolates a point yk by combining the current point and\nthe previous point, then solves a proximal mapping problem. When extending APG to nonconvex\nprograms the chief dif\ufb01culty lies in the extrapolated point yk. We have little restriction on F (yk)\nwhen the convexity is absent. In fact, F (yk) can be arbitrarily larger than F (xk) when yk is a bad\nextrapolation, especially when F is oscillatory. When xk+1 is computed by a proximal mapping at\na bad yk, F (xk+1) may also be arbitrarily larger than F (xk). Beck and Teboulle\u2019s monotone APG\n[12] ensures F (xk+1) \u2264 F (xk). However, this is not enough to ensure the convergence to critical\npoints. To address this issue, we introduce a monitor satisfying the suf\ufb01cient descent property to\nprevent a bad extrapolation of yk and then correct it by this monitor. In summary, our contributions\ninclude:\n\n1. We propose APG-type algorithms for general nonconvex and nonsmooth programs (1). We\n\ufb01rst extend Beck and Teboulle\u2019s monotone APG [12] by replacing their descent condition\nwith suf\ufb01cient descent condition. This critical change ensures that every accumulation point\nis a critical point. Our monotone APG satis\ufb01es some modi\ufb01ed conditions for the framework\nof [16, 17] and thus stronger results on convergence rate can be obtained under the KL\n\n1Except for the work under the KL assumption, convergence for nonconvex problems in this paper and the\n\nreferences of this paper means that every accumulation point is a critical point.\n\n2\n\n\fassumption. Then we propose a nonmonotone APG, which allows for larger stepsizes\nwhen line search is used and reduces the average number of proximal mappings in each\niteration. Thus it can further speed up the convergence in practice.\n\n2. For our APGs, the convergence rates maintain O(cid:0) 1\n\n(cid:1) when the problems are convex. This\n\nresult is of great signi\ufb01cance when the objective function is locally convex in the neighbor-\nhoods of local minimizers even if it is globally nonconvex.\n\nk2\n\n2 Preliminaries\n\n2.1 Basic Assumptions\nNote that a function g : Rn \u2192 (\u2212\u221e, +\u221e] is said to be proper if dom g (cid:54)= \u2205, where dom g =\n{x \u2208 R : g(x) < +\u221e}. g is lower semicontinuous at point x0 if lim inf x\u2192x0 g(x) \u2265 g(x0). In\nproblem (1), we assume that f is a proper function with Lipschitz continuous gradients and g is\nproper and lower semicontinuous. We assume that F (x) is coercive, i.e., F is bounded from below\nand F (x) \u2192 \u221e when (cid:107)x(cid:107) \u2192 \u221e, where (cid:107) \u00b7 (cid:107) is the l2-norm.\n\n2.2 KL Inequality\nDe\ufb01nition 1. [23] A function f : Rn \u2192 (\u2212\u221e, +\u221e] is said to have the KL property at u \u2208\ndom\u2202f := {x \u2208 Rn : \u2202f (u) (cid:54)= \u2205} if there exists \u03b7 \u2208 (0, +\u221e], a neighborhood U of u and a\n\nfunction \u03d5 \u2208 \u03a6\u03b7, such that for all u \u2208 U(cid:84){u \u2208 Rn : f (u) < f (u) < f (u) + \u03b7}, the following\n\ninequality holds\n\n(2)\nwhere \u03a6\u03b7 stands for a class of function \u03d5 : [0, \u03b7) \u2192 R+ satisfying: (1) \u03d5 is concave and C 1 on\n(0, \u03b7); (2) \u03d5 is continuous at 0, \u03d5(0) = 0; and (3) \u03d5(cid:48)(x) > 0, \u2200x \u2208 (0, \u03b7).\n\n\u03d5(cid:48)(f (u) \u2212 f (u))dist(0, \u2202f (u)) > 1,\n\nAll semi-algebraic functions and subanalytic functions satisfy the KL property. Specially, the desin-\ngularising function \u03d5(t) of semi-algebraic functions can be chosen to be the form of C\n\u03b8 t\u03b8 with\n\u03b8 \u2208 (0, 1]. Typical semi-algebraic functions include real polynomial functions, (cid:107)x(cid:107)p with p \u2265 0,\nrank(X), the indicator function of PSD cone, Stiefel manifolds and constant rank matrices [23].\n\n2.3 Review of APG in the Convex Case\n\nWe \ufb01rst review APG in the convex case. Bech and Teboulle [13] extend Nesterov\u2019s accelerated\ngradient method to the nonsmooth case. It is named the Accelerated Proximal Gradient method and\nconsists of the following steps:\n\ntk\u22121 \u2212 1\n\n(xk \u2212 xk\u22121),\nyk = xk +\nxk+1 = prox\u03b1kg(yk \u2212 \u03b1k\u2207f (yk)),\n\n(cid:112)4(tk)2 + 1 + 1\n\ntk\n\n(5)\n2\u03b1(cid:107)x \u2212 u(cid:107)2. APG is not\nwhere the proximal mapping is de\ufb01ned as prox\u03b1g(x) = argminu g(u) + 1\na monotone algorithm, which means that F (xk+1) may not be smaller than F (xk). So Beck and\nTeboulle [12] further proposed a monotone APG, which consists of the following steps:\n\ntk+1 =\n\n2\n\n,\n\n(3)\n\n(4)\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\n(zk \u2212 xk) +\n\nyk = xk +\ntk\nzk+1 = prox\u03b1kg(yk \u2212 \u03b1k\u2207f (yk)),\n\ntk\u22121\ntk\n\ntk\u22121 \u2212 1\n\n(xk \u2212 xk\u22121),\n\n(cid:112)4(tk)2 + 1 + 1\n(cid:26) zk+1,\n\n2\n\n,\n\nxk,\n\ntk+1 =\n\nxk+1 =\n\nif F (zk+1) \u2264 F (xk),\notherwise.\n\n3\n\n\f3 APGs for Nonconvex Programs\n\nWe establish the convergence in the nonconvex case and the O(cid:0) 1\n\n(cid:1) convergence rate in the convex\n\nk2\n\nIn this section, we propose two APG-type algorithms for general nonconvex nonsmooth problems.\n\ncase. When the KL property is satis\ufb01ed we also provide stronger results on convergence rate.\n\n3.1 Monotone APG\n\nWe give two reasons that result in the dif\ufb01culty of convergence analysis on the usual APG [12, 13]\nfor nonconvex programs: (1) yk may be a bad extrapolation, (2) in [12] only descent property,\nF (xk+1) \u2264 F (xk), is ensured. To address these issues, we need to monitor and correct yk when\nit has the potential to fail, and the monitor should enjoy the property of suf\ufb01cient descent which is\ncritical to ensure the convergence to a critical point. As is known, proximal gradient methods can\nmake sure suf\ufb01cient descent [16] (cf. (15)). So we use a proximal gradient step as the monitor. More\nspecially, our algorithm consists of the following steps:\n\n(10)\n\n(11)\n(12)\n\n(13)\n\n(14)\n\ntk\u22121 \u2212 1\n\n(xk \u2212 xk\u22121),\n\ntk\u22121\ntk\n\n(zk \u2212 xk) +\n\nyk = xk +\ntk\nzk+1 = prox\u03b1yg(yk \u2212 \u03b1y\u2207f (yk)),\nvk+1 = prox\u03b1xg(xk \u2212 \u03b1x\u2207f (xk)),\n\n(cid:112)4(tk)2 + 1 + 1\n(cid:26) zk+1,\n\n2\n\n,\n\ntk+1 =\n\nxk+1 =\n\nvk+1, otherwise.\n\nif F (zk+1) \u2264 F (vk+1),\n\nwhere \u03b1y and \u03b1x can be \ufb01xed constants satisfying \u03b1y < 1\nL, or dynamically computed\nby backtracking line search initialized by Barzilai-Borwein rule2. L is the Lipschitz constant of \u2207f.\nOur algorithm is an extension of Beck and Teboulle\u2019s monotone APG [12]. The difference lies in\nthe extra v, as the role of monitor, and the correction step of x-update. In (9) F (zk+1) is compared\nwith F (xk), while in (14) F (zk+1) is compared with F (vk+1). A further difference is that Beck\nand Teboulle\u2019s algorithm only ensures descent while our algorithm makes sure suf\ufb01cient descent,\nwhich means\n\nL and \u03b1x < 1\n\nF (xk+1) \u2264 F (xk) \u2212 \u03b4(cid:107)vk+1 \u2212 xk(cid:107)2,\n\n(15)\nwhere \u03b4 > 0 is a small constant. It is not dif\ufb01cult to understand that only the descent property cannot\nensure the convergence to a critical point in nonconvex programming. We present our convergence\nresult in the following theorem3.\nTheorem 1. Let f be a proper function with Lipschitz continuous gradients and g be proper and\nlower semicontinuous. For nonconvex f and nonconvex nonsmooth g, assume that F (x) is coercive.\nThen {xk} and {vk} generated by (10)-(14) are bounded. Let x\u2217 be any accumulation point of\n{xk}, we have 0 \u2208 \u2202F (x\u2217), i.e., x\u2217 is a critical point.\nA remarkable aspect of our algorithm is that although we have made some modi\ufb01cations on Beck\n\n(cid:1) convergence rate in the convex case still holds. Similar to\n\nand Teboulle\u2019s algorithm, the O(cid:0) 1\n\nTheorem 5.1 in [12], we have the following theorem on the accelerated convergence in the convex\ncase:\nTheorem 2. For convex f and g, assume that \u2207f is Lipschitz continuous, let x\u2217 be any global\noptimum, then {xk} generated by (10)-(14) satis\ufb01es\n2\n\nk2\n\nF (xN +1) \u2212 F (x\u2217) \u2264\n\n2 means that APG can ensure to have an O(cid:0) 1\n\n\u03b1y(N + 1)2(cid:107)x0 \u2212 x\u2217(cid:107)2,\n(cid:1) convergence rate when approaching to a local\n\n(16)\n\nk2\n\nWhen the objective function is locally convex in the neighborhood of local minimizers, Theorem\n\nminimizer, thus accelerating the convergence.\nFor better reference, we summarize the proposed monotone APG algorithm in Algorithm 1.\n\n2For the detail of line search with Barzilai-Borwein initializtion please see Supplementary Materials.\n3The proofs in this paper can be found in Supplementary Materials.\n\n4\n\n\fAlgorithm 1 Monotone APG\n\nInitialize z1 = x1 = x0, t1 = 1, t0 = 0, \u03b1y < 1\nfor k = 1, 2, 3,\u00b7\u00b7\u00b7 do\n\nL, \u03b1x < 1\nL.\n\nupdate yk, zk+1, vk+1, tk+1 and xk+1 by (10)-(14).\n\nend for\n\n3.2 Convergence Rate under the KL Assumption\n\nThe KL property is a powerful tool and is studied by [16], [17] and [23] for a class of general\ndescent methods. The usual APG in [12, 13] does not satisfy the suf\ufb01cient descent property, which\nis crucial to use the KL property, and thus has no conclusions under the KL assumption. On the\nother hand, due to the intermediate variables yk, vk and zk, our algorithm is more complex than\nthe general descent methods and also does not satisfy the conditions therein. However, due to the\nmonitor-corrector step (12) and (14), some modi\ufb01ed conditions4 can be satis\ufb01ed and we can still\nget some exciting results under the KL assumption. With the same framework of [17], we have the\nfollowing theorem.\nTheorem 3. Let f be a proper function with Lipschitz continuous gradients and g be proper and\nlower semicontinuous. For nonconvex f and nonconvex nonsmooth g, assume that F (x) is coercive.\nIf we further assume that f and g satisfy the KL property and the desingularising function has the\nform of \u03d5(t) = C\n\n\u03b8 t\u03b8 for some C > 0, \u03b8 \u2208 (0, 1], then\n\n1. If \u03b8 = 1, then there exists k1 such that F (xk) = F \u2217 for all k > k1 and the algorithm\n\nterminates in \ufb01nite steps.\n\n2. If \u03b8 \u2208 [ 1\n\n2 , 1), then there exists k2 such that for all k > k2,\n\n(cid:18) d1C 2\n\n(cid:19)k\u2212k2\n\n1 + d1C 2\n\nF (xk) \u2212 F \u2217 \u2264\n\nrk2.\n\n(17)\n\n3. If \u03b8 \u2208 (0, 1\n\n2 ), then there exists k3 such that for all k > k3,\n\n(cid:19) 1\n\n1\u22122\u03b8\n\nF (xk) \u2212 F \u2217 \u2264\n\nC\n\n(18)\nwhere F \u2217 is the same function value at all the accumulation points of {xk}, rk = F (vk)\u2212\nF \u2217, d1 =\n\n(k \u2212 k3)d2(1 \u2212 2\u03b8)\n\n(cid:110) 1\n\nand d2 = min\n\n(cid:16) 1\n\n(cid:16) 1\n\n2\u03b8\u22122 \u2212 1\n2\u03b8\u22121\n\n2\n\n(cid:17)2\n\n,\n\n(cid:16)\n\n(cid:111)\n\n(cid:17)\n\nr2\u03b8\u22121\n\n0\n\n+ L\n\n/\n\n\u03b1x\n\n2\u03b1x\n\n2d1C , C\n1\u22122\u03b8\n\n\u2212 L\n\n2\n\n(cid:18)\n(cid:17)\n\nWhen F (x) is a semi-algebraic function, the desingularising function \u03d5(t) can be chosen to be the\n\u03b8 t\u03b8 with \u03b8 \u2208 (0, 1] [23]. In this case, as shown in Theorem 3, our algorithm converges in\nform of C\n\ufb01nite iterations when \u03b8 = 1, converges with a linear rate when \u03b8 \u2208 [ 1\n2 , 1) and a sublinear rate (at\n2 ) for the gap F (xk) \u2212 F \u2217. This is the same as the results mentioned in\nk )) when \u03b8 \u2208 (0, 1\nleast O( 1\n[17], although our algorithm does not satisfy the conditions therein.\n\n3.3 Nonmonotone APG\n\nAlgorithm 1 is a monotone algorithm. When the problem is ill-conditioned, a monotone algorithm\nhas to creep along the bottom of a narrow curved valley so that the objective function value does not\nincrease, resulting in short stepsizes or even zigzagging and hence slow convergence [24]. Removing\nthe requirement on monotonicity can improve convergence speed because larger stepsizes can be\nadopted when line search is used.\nOn the other hand, in Algorithm 1 we need to compute zk+1 and vk+1 in each iteration and use\nvk+1 to monitor and correct zk+1. This is a conservative strategy. In fact, we can accept zk+1 as\nxk+1 directly if it satis\ufb01es some criterion showing that yk is a good extrapolation. Then vk+1 is\ncomputed only when this criterion is not met. Thus, we can reduce the average number of proximal\n\n4For the details of difference please see Supplementary Materials.\n\n5\n\n\fmappings, accordingly the computation cost, in each iteration. So in this subsection we propose a\nnonmonotone APG to speed up convergence.\nIn monotone APG, (15) is ensured. In nonmonotone APG, we allow xk+1 to make a larger objec-\ntive function value than F (xk). Speci\ufb01cally, we allow xk+1 to yield an objective function value\nsmaller than ck, a relaxation of F (xk). ck should not be too far from F (xk). So the average of\nF (xk), F (xk\u22121),\u00b7\u00b7\u00b7 , F (x1) is a good choice. Thus we follow [24] to de\ufb01ne ck as a convex com-\nbination of F (xk), F (xk\u22121),\u00b7\u00b7\u00b7 , F (x1) with exponentially decreasing weights:\n\nck =\n\n,\n\n(19)\n\nwhere \u03b7 \u2208 [0, 1) controls the degree of nonmonotonicity. In practice ck can be ef\ufb01ciently computed\nby the following recursion:\n\n(cid:80)k\n(cid:80)k\nj=1 \u03b7k\u2212jF (xj)\n\nj=1 \u03b7k\u2212j\n\nqk+1 = \u03b7qk + 1,\n\nck+1 =\n\n\u03b7qkck + F (xk+1)\n\nqk+1\n\n,\n\n(20)\n\n(21)\n\nwhere q1 = 1 and c1 = F (x1).\nAccording to (14), we can split (15) into two parts by the different choices of xk+1. Accordingly, in\nnonmonotone APG we consider the following two conditions to replace (15):\n\nF (zk+1) \u2264 ck \u2212 \u03b4(cid:107)zk+1 \u2212 yk(cid:107)2,\nF (vk+1) \u2264 ck \u2212 \u03b4(cid:107)vk+1 \u2212 xk(cid:107)2.\n\n(22)\n(23)\nWe choose (22) as the criteria mentioned before. When (22) holds, we deem that yk is a good\nextrapolation and accept zk+1 directly. Then we do not compute vk+1 in this case. However, (22)\ndoes not hold all the time. When it fails, we deem that yk may not be a good extrapolation. In this\ncase, we compute vk+1 by (12) satisfying (23), and then monitor and correct zk+1 by (14). (23) is\nensured when \u03b1x \u2264 1/L. When backtracking line search is used, such vk+1 that satis\ufb01es (23) can\nbe found in \ufb01nite steps5.\nCombing (20), (21), (22) and xk+1 = zk+1 we have\n\nck+1 \u2264 ck \u2212 \u03b4(cid:107)xk+1 \u2212 yk(cid:107)2\n\n.\n\nqk+1\n\n(24)\n\n(25)\n\nSimilarly, replacing (22) and xk+1 = zk+1 by (23) and xk+1 = vk+1, respectively, we have\n\nck+1 \u2264 ck \u2212 \u03b4(cid:107)xk+1 \u2212 xk(cid:107)2\n\n.\n\nqk+1\n\nThis means that we replace the suf\ufb01cient descent condition of F (xk) in (15) by the suf\ufb01cient descent\nof ck.\nWe summarize the nonmonotone APG in Algorithm 26. Similar to monotone APG, nonmonotone\n\nAPG also enjoys the convergence property in the nonconvex case and the O(cid:0) 1\n\nin the convex case. We present our convergence result in Theorem 4. Theorem 2 still holds for\nAlgorithm 2 with no modi\ufb01cation. So we omit it here.\nDe\ufb01ne \u21261 = {k1, k2,\u00b7\u00b7\u00b7 , kj,\u00b7\u00b7\u00b7} and \u21262 = {m1, m2,\u00b7\u00b7\u00b7 , mj,\u00b7\u00b7\u00b7}, such that in Algorithm 2,\n(22) holds and xk+1 = zk+1 is executed for all k = kj \u2208 \u21261. For all k = mj \u2208 \u21262, (22) does\nnot hold and (14) is executed. Then we have \u21261\nfollowing theorem holds.\nTheorem 4. Let f be a proper function with Lipschitz continuous gradients and g be proper and\nlower semicontinuous. For nonconvex f and nonconvex nonsmooth g, assume that F (x) is coercive.\nThen {xk}, {vk} and {ykj} where kj \u2208 \u21261 generated by Algorithm 2 are bounded, and\n\n(cid:84) \u21262 = \u2205, \u21261\n\n(cid:1) convergence rate\n(cid:83) \u21262 = {1, 2, 3,\u00b7\u00b7\u00b7 ,} and the\n\nk2\n\n1. if \u21261 or \u21262 is \ufb01nite, then for any accumulation point {x\u2217} of {xk}, we have 0 \u2208 \u2202F (x\u2217).\n\n5See Lemma 2 in Supplementary Materials.\n6Please see Supplementary Materials for nonmonotone APG with line search.\n\n6\n\n\fAlgorithm 2 Nonmonotone APG\n\n1\n\n(xk \u2212 xk\u22121),\n\nInitialize z1 = x1 = x0, t1 = 1, t0 = 0, \u03b7 \u2208 [0, 1), \u03b4 > 0, c1 = F (x1), q1 = 1, \u03b1x < 1\nL.\nfor k = 1, 2, 3,\u00b7\u00b7\u00b7 do\n(zk \u2212 xk) + tk\u22121\u22121\nyk = xk + tk\u22121\nzk+1 = prox\u03b1yg(yk \u2212 \u03b1y\u2207f (yk))\ntk\nif F (zk+1) \u2264 ck \u2212 \u03b4(cid:107)zk+1 \u2212 yk(cid:107)2 then\nxk+1 = zk+1.\nelse\nvk+1 = prox\u03b1xg(xk \u2212 \u03b1x\u2207f (xk)),\nxk+1 =\n\u221a\n\nif F (zk+1) \u2264 F (vk+1),\n\n(cid:26) zk+1,\n\nvk+1, otherwise.\n\ntk\n\nL , \u03b1y <\n\n4(tk)2+1+1\n\nend if\ntk+1 =\nqk+1 = \u03b7qk + 1,\nck+1 = \u03b7qkck+F (xk+1)\n\n,\n\n2\n\nqk+1\n\n.\n\nend for\n\n2. if \u21261 and \u21262 are both in\ufb01nite, then for any accumulation point x\u2217 of {xkj +1}, y\u2217 of {ykj}\nwhere kj \u2208 \u21261 and any accumulation point v\u2217 of {vmj +1}, x\u2217 of {xmj} where mj \u2208 \u21262,\nwe have 0 \u2208 \u2202F (x\u2217), 0 \u2208 \u2202F (y\u2217) and 0 \u2208 \u2202F (v\u2217).\n\n4 Numerical Results\n\nIn this section, we test the performance of our algorithm on the problem of Sparse Logistic Re-\ngression (LR)7.Sparse LR is an attractive extension to LR as it can reduce over\ufb01tting and perform\nfeature selection simultaneously. Sparse LR is widely used in areas such as bioinformatics [25] and\ntext categorization [26]. In this subsection, we follow Gong et al. [19] to consider Sparse LR with a\nnonconvex regularizer:\n\nn(cid:88)\n\ni=1\n\nmin\n\nw\n\n1\nn\n\nlog(1 + exp(\u2212yixT\n\ni w)) + r(w).\n\nWe choose r(w) as the capped l1 penalty [3], de\ufb01ned as\nmin(|wi|, \u03b8),\n\nr(w) = \u03bb\n\nd(cid:88)\n\n\u03b8 > 0.\n\n(26)\n\n(27)\n\ni=1\n\nWe compare monotone APG (mAPG) and nonmonotone APG (nmAPG) with monotone GIST8\n(mGIST), nonmonotone GIST (nmGIST) [19] and IFB [22]. We test the performance on the real-sim\ndata set9, which contains 72309 samples of 20958 dimensions. We follow [19] to set \u03bb = 0.0001,\n\u03b8 = 0.1\u03bb and the starting point as zero vectors. In nmAPG we set \u03b7 = 0.8. In IFB the inertial\nparameter \u03b2 is set at 0.01 and the Lipschitz constant is computed by backtracking. To make a\nfair comparison, we \ufb01rst run mGIST. The algorithm is terminated when the relative change of two\nconsecutive objective function values is less than 10\u22125 or the number of iterations exceeds 1000.\nThis termination condition is the same as in [19]. Then we run nmGIST, mAPG, nmAPG and\nIFB. These four algorithms are terminated when they achieve an equal or smaller objective function\nvalue than that by mGIST or the number of iterations exceeds 1000. We randomly choose 90% of\nthe data as training data and the rest as test data. The experiment result is averaged over 10 runs. All\nalgorithms are run on Matlab 2011a and Windows 7 with an Intel Core i3 2.53 GHz CPU and 4GB\nmemory. The result is reported in Table 2. We also plot the curves of objective function values vs.\niteration number and CPU time in Figure 1.\n\n7For the sake of space limitation we leave another experiment, Sparse PCA, in Supplementary Materials.\n8http://www.public.asu.edu/ yje02/Software/GIST\n9http://www.csie.ntu.tw/cjlin/libsvmtools/datasets\n\n7\n\n\fTable 2: Comparisons of APG, GIST and IFB on the sparse logistic regression problem. The quan-\ntities include number of iterations, averaged number of line searches in each iteration, computing\ntime (in seconds) and test error. They are averaged over 10 runs.\nTime\n300.42\n222.22\n215.82\n133.23\n42.99\n\ntest error\n2.94%\n2.94%\n2.96%\n2.93%\n2.97%\n\n#Iter.\n994\n806\n635\n175\n146\n\nMethod\nmGIST\nnmGIST\n\n2.19\n1.69\n2.59\n2.99\n1.01\n\n#Line search\n\nIFB\n\nmAPG\nnmAPG\n\nWe have the following observations: (1) APG-type methods need much fewer iterations and less\ncomputing time than GIST and IFB to reach the same (or smaller) objective function values. As\nGIST is indeed a Proximal Gradient method (PG) and IFB is an extension of PG, this veri\ufb01es that\nAPG can indeed accelerate the convergence in practice. (2) nmAPG is faster than mAPG. We give\ntwo reasons: nmAPG avoids the computation of vk in most of the time and reduces the number\nof line searches in each iteration. We mention that in mAPG line search is performed in both (11)\nand (12), while in nmAPG only the computation of zk+1 needs line search in every iteration. vk+1\nis computed only when necessary. We note that the average number of line searches in nmAPG is\nnearly one. This means that (22) holds in most of the time. So we can trust that zk can work well in\nmost of the time and only in a few times vk is computed to correct zk and yk. On the other hand,\nnonmonotonicity allows for larger stepsizes, which results in fewer line searches.\n\n(a) Objective function value v.s. iteration\n\n(b) Objective function value v.s. time\n\nFigure 1: Compare the objective function value produced by APG, GIST and IFB.\n\n5 Conclusions\n\nIn this paper, we propose two APG-type algorithms for ef\ufb01ciently solving general nonconvex non-\nsmooth problems, which are abundant in machine learning. We provide a detailed convergence\nanalysis, showing that every accumulation point is a critical point for general nonconvex nonsmooth\n\nprograms and the convergence rate is maintained at O(cid:0) 1\n\n(cid:1) for convex programs. Nonmonotone\n\nAPG allows for larger stepsizes and needs less computation cost in each iteration and thus is faster\nthan monotone APG in practice. Numerical experiments testify to the advantage of the two algo-\nrithms.\n\nk2\n\nAcknowledgments\n\nZhouchen Lin is supported by National Basic Research Program of China (973 Program) (grant no.\n2015CB352502), National Natural Science Foundation (NSF) of China (grant nos. 61272341 and\n61231002), and Microsoft Research Asia Collaborative Research Program. He is the corresponding\nauthor.\n\n8\n\n02004006008001000\u22122.6\u22122.4\u22122.2\u22122\u22121.8\u22121.6\u22121.4\u22121.2\u22121\u22120.8IterationFunction Value mGISTnmGISTIFBmAPGnmAPG050100150200250300\u22122.6\u22122.4\u22122.2\u22122\u22121.8\u22121.6\u22121.4\u22121.2\u22121\u22120.8CPU TimeFunction Value mGISTnmGISTIFBmAPGnmAPG\fReferences\n[1] F. Yun, editor. Low-rank and sparse modeling for visual analysis. Springer, 2014. 1\n[2] E.J. Candes, M.B. Wakin, and S.P. Boyd. Enhancing sparsity by reweighted l1 minimization. Journal of\n\nFourier Analysis and Applications, 14(5):877\u2013905, 2008. 1\n\n[3] T. Zhang. Analysis of multi-stage convex relaxation for sparse regularization. The Journal of Machine\n\nLearning Rearch, 11:1081\u20131107, 2010. 1, 7\n\n[4] S. Foucart and M.J. Lai. Sparsest solutions of underdeterminied linear systems via lq minimization for\n\n0 < q \u2264 1. Applied and Computational Harmonic Analysis, 26(3):395\u2013407, 2009. 1\n\n[5] C.H. Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics,\n\n38(2):894\u2013942, 2010. 1\n\n[6] D. Geman and C. Yang. Nonlinear image recovery with half-quadratic regularization. IEEE Transactions\n\non Image Processing, 4(7):932\u2013946, 1995. 1\n\n[7] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal\n\nof the American Statistical Association, 96(456):1348\u20131360, 2001. 1\n\n[8] K. Mohan and M. Fazel. Iterative reweighted algorithms for matrix rank minimization. The Journal of\n\nMachine Learning Research, 13(1):3441\u20133473, 2012. 1, 2\n\n[9] Y.E. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence\n\nO(1/k2). Soviet Mathematics Doklady, 27(2):372\u2013376, 1983. 1\n\n[10] Y.E. Nesterov. Smooth minimization of nonsmooth functions. Mathematical programming, 103(1):127\u2013\n\n152, 2005. 1\n\n[11] Y.E. Nesterov. Gradient methods for minimizing composite objective functions. Technical report, Center\n\nfor Operations Research and Econometrics(CORE), Catholie University of Louvain, 2007. 1\n\n[12] A. Beck and M. Teboulle. Fast gradient-based algorithms for constrained total variation image denoising\nand deblurring problems. IEEE Transactions on Image Processing, 18(11):2419\u20132434, 2009. 1, 2, 3, 4, 5\n[13] A. Beck and M. Teboulle. A fast iterative shrinkage thresholding algorithm for linear inverse problems.\n\nSIAM J. Imaging Sciences, 2(1):183\u2013202, 2009. 1, 2, 3, 4, 5\n\n[14] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. Technical report,\n\nUniversity of Washington, Seattle, 2008. 1\n\n[15] S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic program-\n\nming. arXiv preprint arXiv:1310.3787, 2013. 1, 2\n\n[16] H. Attouch, J. Bolte, and B.F. Svaier. Convergence of descent methods for semi-algebraic and tame prob-\nlems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Mathe-\nmatical Programming, 137:91\u2013129, 2013. 2, 4, 5\n\n[17] P. Frankel, G. Garrigos, and J. Peypouquet. Splitting methods with variable metric for Kurdyka-\n\u0141ojasiewicz functions and general convergence rates. Journal of Optimization Theory and Applications,\n165:874\u2013900, 2014. 2, 5\n\n[18] P. Ochs, Y. Chen, T. Brox, and T. Pock. IPiano: Inertial proximal algorithms for nonconvex optimization.\n\nSIAM J. Image Sciences, 7(2):1388\u20131419, 2014. 2\n\n[19] P. Gong, C. Zhang, Z. Lu, J. Huang, and J. Ye. A general iterative shrinkage and thresholding algorithm\n\nfor nonconvex regularized optimization problems. In ICML, pages 37\u201345, 2013. 2, 7\n\n[20] W. Zhong and J. Kwok. Gradient descent with proximal average for nonconvex and composite regular-\n\nization. In AAAI, 2014. 2\n\n[21] P. Ochs, A. Dosovitskiy, T. Brox, and T. Pock. On iteratively reweighted algorithms for non-smooth\n\nnon-convex optimization in computer vision. SIAM J. Imaging Sciences, 2014. 2\n\n[22] R.L. Bot, E.R. Csetnek, and S. L\u00b4aszl\u00b4o. An inertial forward-backward algorithm for the minimization of\n\nthe sum of two nonconvex functions. Preprint, 2014. 2, 7\n\n[23] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for nonconvex and\n\nnonsmooth problems. Mathematical Programming, 146(1-2):459\u2013494, 2014. 3, 5\n\n[24] H. Zhang and W.W. Hager. A nonmonotone line search technique and its application to unconstrained\n\noptimization. SIAM J. Optimization, 14:1043\u20131056, 2004. 5, 6\n\n[25] S.K. Shevade and S.S. Keerthi. A simple and ef\ufb01cient algorithm for gene selection using sparse logistic\n\nregression. Bioinformatics, 19(17):2246\u20132253, 2003. 7\n\n[26] A. Genkin, D.D. Lewis, and D. Madigan. Large-scale bayesian logistic regression for text categorization.\n\nTechnometrics, 49(14):291\u2013304, 2007. 7\n\n9\n\n\f", "award": [], "sourceid": 266, "authors": [{"given_name": "Huan", "family_name": "Li", "institution": "Peking University"}, {"given_name": "Zhouchen", "family_name": "Lin", "institution": "Peking University"}]}