{"title": "HONOR: Hybrid Optimization for NOn-convex Regularized problems", "book": "Advances in Neural Information Processing Systems", "page_first": 415, "page_last": 423, "abstract": "Recent years have witnessed the superiority of non-convex sparse learning formulations over their convex counterparts in both theory and practice. However, due to the non-convexity and non-smoothness of the regularizer, how to efficiently solve the non-convex optimization problem for large-scale data is still quite challenging. In this paper, we propose an efficient \\underline{H}ybrid \\underline{O}ptimization algorithm for \\underline{NO}n convex \\underline{R}egularized problems (HONOR). Specifically, we develop a hybrid scheme which effectively integrates a Quasi-Newton (QN) step and a Gradient Descent (GD) step. Our contributions are as follows: (1) HONOR incorporates the second-order information to greatly speed up the convergence, while it avoids solving a regularized quadratic programming and only involves matrix-vector multiplications without explicitly forming the inverse Hessian matrix. (2) We establish a rigorous convergence analysis for HONOR, which shows that convergence is guaranteed even for non-convex problems, while it is typically challenging to analyze the convergence for non-convex problems. (3) We conduct empirical studies on large-scale data sets and results demonstrate that HONOR converges significantly faster than state-of-the-art algorithms.", "full_text": "HONOR: Hybrid Optimization for NOn-convex\n\nRegularized problems\n\nPinghua Gong\n\nJieping Ye\n\nUniveristy of Michigan, Ann Arbor, MI 48109\n\nUniveristy of Michigan, Ann Arbor, MI 48109\n\ngongp@umich.edu\n\njpye@umich.edu\n\nAbstract\n\nRecent years have witnessed the superiority of non-convex sparse learning formu-\nlations over their convex counterparts in both theory and practice. However, due\nto the non-convexity and non-smoothness of the regularizer, how to ef\ufb01ciently\nsolve the non-convex optimization problem for large-scale data is still quite chal-\nlenging. In this paper, we propose an ef\ufb01cient Hybrid Optimization algorithm\nfor NOn-convex Regularized problems (HONOR). Speci\ufb01cally, we develop a hy-\nbrid scheme which effectively integrates a Quasi-Newton (QN) step and a Gra-\ndient Descent (GD) step. Our contributions are as follows: (1) HONOR incor-\nporates the second-order information to greatly speed up the convergence, while\nit avoids solving a regularized quadratic programming and only involves matrix-\nvector multiplications without explicitly forming the inverse Hessian matrix. (2)\nWe establish a rigorous convergence analysis for HONOR, which shows that con-\nvergence is guaranteed even for non-convex problems, while it is typically chal-\nlenging to analyze the convergence for non-convex problems. (3) We conduct\nempirical studies on large-scale data sets and results demonstrate that HONOR\nconverges signi\ufb01cantly faster than state-of-the-art algorithms.\n\n1 Introduction\n\nSparse learning with convex regularization has been successfully applied to a wide range of ap-\nplications including marker genes identi\ufb01cation [19], face recognition [22], image restoration [2],\ntext corpora understanding [9] and radar imaging [20]. However, it has been shown recently that\nmany convex sparse learning formulations are inferior to their non-convex counterparts in both the-\nory and practice [27, 12, 23, 25, 16, 26, 24, 11]. Popular non-convex sparsity-inducing penalties\ninclude Smoothly Clipped Absolute Deviation (SCAD) [10], Log-Sum Penalty (LSP) [6] and Mini-\nmax Concave Penalty (MCP) [23]. Although non-convex sparse learning reveals its advantage over\nthe convex one, it remains a challenge to develop an ef\ufb01cient algorithm to solve the non-convex\noptimization problem especially for large-scale data.\n\nDC programming [21] is a popular approach to solve non-convex problems whose objective func-\ntions can be expressed as the difference of two convex functions. However, a potentially non-trivial\nconvex subproblem is required to solve at each iteration, which is not practical for large-scale prob-\nlems. SparseNet [16] can solve a least squares problem with a non-convex penalty. At each step,\nSparseNet solves a univariate subproblem with a non-convex penalty which admits a closed-form\nsolution. However, to establish the convergence analysis, the parameter of the non-convex penalty\nis required to be restricted to some interval such that the univariate subproblem (with a non-convex\npenalty) is convex. Moreover, it is quite challenging to extend SparseNet to non-convex problems\nwith a non-least-squares loss, as the univariate subproblem generally does not admit a closed-form\nsolution. The GIST algorithm [14] can solve a class of non-convex regularized problems by itera-\ntively solving a possibly non-convex proximal operator problem, which in turn admits a closed-form\nsolution. However, GIST does not well exploit the second-order information. The DC-PN algorithm\n\n1\n\n\f[18] can incorporate the second-order information to solve non-convex regularized problems but it\nrequires to solve a non-trivial regularized quadratic subproblem at each iteration.\n\nIn this paper, we propose an ef\ufb01cient Hybrid Optimization algorithm for NOn-convex Regularized\nproblems (HONOR), which incorporates the second-order information to speed up the convergence.\nHONOR adopts a hybrid optimization scheme which chooses either a Quasi-Newton (QN) step or\na Gradient Descent (GD) step per iteration mainly depending on whether an iterate has very small\ncomponents. If an iterate does not have any small component, the QN-step is adopted, which uses\nL-BFGS to exploit the second-order information. The key advantage of the QN-step is that it does\nnot need to solve a regularized quadratic programming and only involves matrix-vector multipli-\ncations without explicitly forming the inverse Hessian matrix. If an iterate has small components,\nwe switch to a GD-step. Our detailed theoretical analysis sheds light on the effect of such a hy-\nbrid scheme on the convergence of the algorithm. Speci\ufb01cally, we provide a rigorous convergence\nanalysis for HONOR, which shows that every limit point of the sequence generated by HONOR is\na Clarke critical point. It is worth noting that the convergence analysis for a non-convex problem\nis typically much more challenging than the convex one, because many important properties for a\nconvex problem may not hold for non-convex problems. Empirical studies are also conducted on\nlarge-scale data sets which include up to millions of samples and features; results demonstrate that\nHONOR converges signi\ufb01cantly faster than state-of-the-art algorithms.\n\n2 Non-convex Sparse Learning\n\nWe focus on the following non-convex regularized optimization problem:\n\nmin\nx\u2208Rn\n\n{f (x) = l(x) + r(x)} ,\n\n(1)\n\nwhere we make the following assumptions throughout the paper:\n\n(A1) l(x) is coercive, continuously differentiable and \u2207l(x) is Lipschitz continuous with con-\n\nstant L. Moreover, l(x) > \u2212\u221e for all x \u2208 Rn.\n\n(A2) r(x) = Pn\n\ni=1 \u03c1(|xi|), where \u03c1(t) is non-decreasing, continuously differentiable and con-\ncave with respect to t in [0, \u221e); \u03c1(0) = 0 and \u03c1\u2032(0) 6= 0 with \u03c1\u2032(t) = \u2202\u03c1(t)/\u2202t denoting\nthe derivative of \u03c1(t) at the point t.\n\nRemark 1 Assumption (A1) allows l(x) to be non-convex. Assumption (A2) implies that \u03c1(|xi|) is\ngenerally non-convex with respect to xi and the only convex case is \u03c1(|xi|) = \u03bb|xi| with \u03bb > 0.\nMoreover, \u03c1(|xi|) is continuously differentiable with respect to xi in (\u2212\u221e, 0) \u222a (0, \u221e) and non-\n6= 0, where\ndifferentiable at xi = 0.\n\u03c3(xi) = 1, if xi > 0; \u03c3(xi) = \u22121, if xi < 0 and \u03c3(xi) = 0, otherwise. In addition, \u03c1\u2032(0) > 0 must\nhold (Otherwise \u03c1\u2032(0) < 0 implies \u03c1(t) \u2264 \u03c1(0) + \u03c1\u2032(0)t < 0 for any t > 0, contradicting the fact\nthat \u03c1(t) is non-decreasing). It is also easy to show that, under the assumptions above, both l(x)\nand r(x) are locally Lipschitz continuous. Thus, the Clarke subdifferential [7] is well-de\ufb01ned.\n\nIn particular, \u2202\u03c1(|xi|)/\u2202xi = \u03c3(xi)\u03c1\u2032(|xi|) for any xi\n\nThe commonly used least squares loss and the logistic regression loss satisfy the assumption (A1);\nwe can add a small term \u03b4kxk2 to make them coercive. The following popular non-convex regular-\nizers satisfy the assumption (A2), where \u03bb > 0 and \u03b8 > 0 except that \u03b8 > 2 for SCAD.\n\n\u2022 LSP: \u03c1(|xi|) = \u03bb log(1 + |xi|/\u03b8).\n\n\u03bb|xi|,\n\u2212x2\n\n\u2022 SCAD: \u03c1(|xi|) =\uf8f1\uf8f2\n\uf8f3\n\u2022 MCP: \u03c1(|xi|) =(cid:26) \u03bb|xi| \u2212 x2\n\n\u03b8\u03bb2/2,\n\ni /(2\u03b8),\n\ni +2\u03b8\u03bb|xi|\u2212\u03bb2\n\n2(\u03b8\u22121)\n\n(\u03b8 + 1)\u03bb2/2,\n\n,\n\nif |xi| \u2264 \u03bb,\nif \u03bb < |xi| \u2264 \u03b8\u03bb,\nif |xi| > \u03b8\u03bb.\n\nif |xi| \u2264 \u03b8\u03bb,\nif |xi| > \u03b8\u03bb.\n\nDue to the non-convexity and non-differentiability of problem (1), the traditional subdifferential\nconcept for the convex optimization is not applicable here. Thus, we use the Clarke subdifferential\n[7] to characterize the optimality of problem (1). We say \u00afx is a Clarke critical point of problem (1),\nif 0 \u2208 \u2202of (\u00afx), where \u2202of (\u00afx) is the Clarke subdifferential of f (x) at x = \u00afx. To be self-contained,\n\n2\n\n\fwe brie\ufb02y review the Clarke subdifferential: for a locally Lipschitz continuous function f (x), the\nClarke generalized directional derivative of f (x) at x = \u00afx along the direction d is de\ufb01ned as\n\nf o(\u00afx; d) = lim sup\nx\u2192\u00afx,\u03b1\u21930\n\nf (x + \u03b1d) \u2212 f (x)\n\n\u03b1\n\n.\n\nThen, the Clarke subdifferential of f (x) at x = \u00afx is de\ufb01ned as\n\n\u2202of (\u00afx) = {\u03b4 \u2208 Rn : f o(\u00afx; d) \u2265 dT \u03b4, \u2200d \u2208 Rn}.\n\nInterested readers may refer to Proposition 4 in the Supplement A for more properties about the\nClarke Subdifferential. We want to emphasize that some basic properties of the subdifferential of a\nconvex function may not hold for the Clarke Subdifferential of a non-convex function.\n\n3 Proposed Optimization Algorithm: HONOR\n\nSince each decomposable component function of the regularizer is only non-differentiable at the\norigin, the objective function is differentiable, if the segment between any two consecutive iterates\ndo not cross any axis. This motivates us to design an algorithm which can keep the current iterate\nin the same orthant of the previous iterate. Before we present the detailed HONOR algorithm, we\nintroduce two functions as follows:\n\nDe\ufb01ne a function \u03c0 : Rn 7\u2192 Rn with the i-th entry being:\n\n\u03c0i(xi; yi) =(cid:26) xi,\n\n0,\n\nif \u03c3(xi) = \u03c3(yi),\notherwise,\n\nwhere y \u2208 Rn (yi is the i-th entry of y) is the parameter of the function \u03c0; \u03c3(\u00b7) is the sign function\nde\ufb01ned as follows: \u03c3(xi) = 1, if xi > 0; \u03c3(xi) = \u22121, if xi < 0 and \u03c3(xi) = 0, otherwise.\n\nDe\ufb01ne the pseudo-gradient \u22c4f (x) whose i-th entry is given by:\n\n\u2207il(x) + \u03c1\u2032(|xi|),\n\u2207il(x) \u2212 \u03c1\u2032(|xi|),\n\u2207il(x) + \u03c1\u2032(0),\n\u2207il(x) \u2212 \u03c1\u2032(0),\n0,\n\nif xi > 0,\nif xi < 0,\nif xi = 0, \u2207il(x) + \u03c1\u2032(0) < 0,\nif xi = 0, \u2207il(x) \u2212 \u03c1\u2032(0) > 0,\notherwise,\n\n\u22c4if (x) =\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\n\uf8f4\uf8f4\uf8f4\uf8f3\n\nwhere \u03c1\u2032(t) is the derivative of \u03c1(t) at the point t.\n\nRemark 2 If r(x) is convex, \u22c4f (x) is the minimum-norm sub-gradient of f (x) at x. Thus, \u2212 \u22c4 f (x)\nis a descent direction. However, \u22c4f (x) is not even a sub-gradient of f (x) if r(x) is non-convex.\nThis indicates that some obvious concepts and properties for a convex problem may not hold in the\nnon-convex case. Thus, it is signi\ufb01cantly more challenging to develop and analyze algorithms for a\nnon-convex problem.\n\nInterestingly, we can still show that vk = \u2212 \u22c4 f (xk) is a descent direction at the point xk (refer\nto Supplement D and replace pk = \u03c0(dk; vk) with vk). To utilize the second-order information,\nwe may perform the optimization along the direction dk = H kvk, where H k is a positive de\ufb01nite\nmatrix containing the second-order information. However, dk is not necessarily a descent direction.\nTo address this issue, we use the following slightly modi\ufb01ed direction pk:\n\npk = \u03c0(dk; vk).\n\nWe can show that pk is a descent direction (proof is provided in Supplement D). Thus, we can\nperform the optimization along the direction pk. Recall that we need to keep the current iterate in\nthe same orthant of the previous iterate. So the following iterative scheme is proposed:\n\nwhere\n\nxk(\u03b1) = \u03c0(xk + \u03b1pk; \u03bek),\n\n\u03bek\n\ni =(cid:26) \u03c3(xk\n\n\u03c3(vk\n\ni ),\ni ),\n\nif xk\nif xk\n\ni 6= 0,\ni = 0,\n\n3\n\n(2)\n\n(3)\n\n\fand \u03b1 is a step size chosen by the following line search procedure: for constants \u03b10 > 0, \u03b2, \u03b3 \u2208\n(0, 1) and m = 0, 1, \u00b7 \u00b7 \u00b7 , \ufb01nd the smallest integer m with \u03b1 = \u03b10\u03b2m such that the following\ninequality holds:\n\nf (xk(\u03b1)) \u2264 f (xk) \u2212 \u03b3\u03b1(vk)T dk.\n\n(4)\n\nHowever, only using the above iterative scheme may not guarantee the convergence. The main chal-\nlenge is: if there exists a subsequence K such that {xk\ni }K converges to zero, it is possible that for\nsuf\ufb01ciently large k \u2208 K, |xk\ni | is arbitrarily small but never equal to zero (refer to the proof of The-\norem 1 for more details). To address this issue, we propose a hybrid optimization scheme. Speci\ufb01-\ncally, for a small constant \u01eb > 0, if I k = {i \u2208 {1, \u00b7 \u00b7 \u00b7 , n} : 0 < |xk\ni < 0}\nis not empty, we switch the iteration to the following gradient descent step (GD-step):\n\ni | \u2264 min(kvkk, \u01eb), xk\n\ni vk\n\nxk(\u03b1) = arg min\n\nx\n\n(cid:26)\u2207l(xk)T (x \u2212 xk) +\n\n1\n2\u03b1\n\nkx \u2212 xkk2 + r(x)(cid:27) ,\n\nwhere \u03b1 is a step size chosen by the following line search procedure: for constants \u03b10 > 0, \u03b2, \u03b3 \u2208\n(0, 1) and m = 0, 1, \u00b7 \u00b7 \u00b7 , \ufb01nd the smallest integer m with \u03b1 = \u03b10\u03b2m such that the following\ninequality holds:\n\nf (xk(\u03b1)) \u2264 f (xk) \u2212\n\n\u03b3\n2\u03b1\n\nkxk(\u03b1) \u2212 xkk2.\n\n(5)\n\nThe detailed steps of the algorithm are presented in Algorithm 1.\n\nRemark 3 Algorithm 1 is similar to OWL-QN-type algorithms in [1, 3, 4, 17, 13]. However,\nHONOR is signi\ufb01cantly different from them: (1) The OWL-QN-type algorithms can only handle \u21131-\nregularized convex problems while HONOR is applicable to a class of non-convex problems beyond\n\u21131-regularized ones. (2) The convergence analyses of the OWL-QN-type algorithms heavily rely on\nthe convexity of the \u21131-regularized problem. In contrast, the convergence analysis for HONOR is\napplicable to non-convex cases beyond the convex ones, which is a non-trivial extension.\n\nAlgorithm 1: HONOR: Hybrid Optimization for NOn-convex Regularized problems\n\n1 Initialize x0, H 0 and choose \u03b2, \u03b3 \u2208 (0, 1), \u01eb > 0, \u03b10 > 0;\n2 for k = 0 to maxiter do\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\n11\n\n12\n\n13\n\n14\n\n15\n\n16\n\n17\n\n18\n\n19\n\n20\n\n21\n\nCompute vk \u2190 \u2212 \u22c4 f (xk) and I k = {i \u2208 {1, \u00b7 \u00b7 \u00b7 , n} : 0 < |xk\n\u01ebk = min(kvkk, \u01eb);\nInitialize \u03b1 \u2190 \u03b10;\nif I k = \u2205 then\n(QN-step)\nCompute dk \u2190 H kvk with a positive de\ufb01nite matrix H k using L-BFGS;\nAlignment: pk \u2190 \u03c0(dk; vk);\nwhile Eq. (4) is not satis\ufb01ed do\n\ni | \u2264 \u01ebk, xk\n\ni vk\n\ni < 0}, where\n\n\u03b1 \u2190 \u03b1\u03b2; xk(\u03b1) \u2190 \u03c0(xk + \u03b1pk; \u03bek);\n\nend\n\nelse\n\n(GD-step)\nwhile Eq. (5) is not satis\ufb01ed do\n\n\u03b1 \u2190 \u03b1\u03b2;\n\nxk(\u03b1) \u2190 arg minx(cid:8)\u2207l(xk)T (x \u2212 xk) + 1\n\n2\u03b1 kx \u2212 xkk2 + r(x)(cid:9);\n\nend\n\nend\nxk+1 \u2190 xk(\u03b1);\nif some stopping criterion is satis\ufb01ed then\n\nstop and return xk+1;\n\nend\n\n22\n23 end\n\n4\n\n\f4 Convergence Analysis\n\nWe \ufb01rst present a few basic propositions and then provide the convergence theorem based on the\npropositions; all proofs of the presented propositions are carefully handled due to the lack of con-\nvexity. First of all, an optimality condition is presented (proof is provided in Supplement B), which\nwill be directly used in the proof of Theorem 1.\n\nProposition 1 Let \u00afx = limk\u2208K,k\u2192\u221e xk, vk = \u2212 \u22c4 f (xk) and \u00afv = \u2212 \u22c4 f (\u00afx), where K is a\nsubsequence of {1, 2, \u00b7 \u00b7 \u00b7 , k, k + 1, \u00b7 \u00b7 \u00b7 }. If lim inf k\u2208K,k\u2192\u221e |vk\ni | = 0 for all i \u2208 {1, \u00b7 \u00b7 \u00b7 , n}, then\n\u00afv = 0 and \u00afx is a Clarke critical point of problem (1).\n\nWe subsequently show that we have a Lipschitz-continuous-like inequality in the following propo-\nsition (proof is provided in Supplement C), which is crucial to prove the \ufb01nal convergence theorem.\n\nProposition 2 Let vk = \u2212\u22c4f (xk), xk(\u03b1) = \u03c0(xk +\u03b1pk; \u03bek) and qk\nwith \u03b1 > 0. Then under assumptions (A1) and (A2), we have\n\n\u03b1 = 1\n\n\u03b1 (\u03c0(xk +\u03b1pk; \u03bek)\u2212 xk)\n\n(i) \u2207l(xk)T (xk(\u03b1) \u2212 xk) + r(xk(\u03b1)) \u2212 r(xk) \u2264 \u2212(vk)T (xk(\u03b1) \u2212 xk),\n\n(ii) f (xk(\u03b1)) \u2264 f (xk) \u2212 \u03b1(vk)T qk\n\n\u03b1 +\n\n\u03b12L\n\n2\n\nkqk\n\n\u03b1k2.\n\n(6)\n\n(7)\n\nWe next show that both line search criteria in the QN-step [Eq. (4)] and the GD-step [Eq. (5)] at any\niteration k is satis\ufb01ed in a \ufb01nite number of trials (proof is provided in Supplement D).\n\nProposition 3 At any iteration k of the HONOR algorithm, if xk is not a Clarke critical point of\nproblem (1), then (a) for the QN-step, there exists an \u03b1 \u2208 [\u00af\u03b1k, \u03b10] with 0 < \u00af\u03b1k \u2264 \u03b10 such that the\nline search criterion in Eq. (4) is satis\ufb01ed; (b) for the GD-step, the line search criterion in Eq. (5) is\nsatis\ufb01ed whenever \u03b1 \u2265 \u03b2 min(\u03b10, (1 \u2212 \u03b3)/L). That is, both line search criteria at any iteration k\nare satis\ufb01ed in a \ufb01nite number of trials.\n\nWe are now ready to provide the convergence proof for the HONOR algorithm:\n\nTheorem 1 The sequence {xk} generated by the HONOR algorithm has at least a limit point and\nevery limit point of {xk} is a Clarke critical point of problem (1).\n\nProof It follows from Proposition 3 that both line search criteria in the QN-step [Eq. (4)] and the\nGD-step [Eq. (5)] at each iteration can be satis\ufb01ed in a \ufb01nite number of trials. Let \u03b1k be the accepted\nstep size at iteration k. Then we have\n\nf (xk) \u2212 f (xk+1) \u2265 \u03b3\u03b1k(vk)T dk = \u03b3\u03b1k(vk)T H kvk (QN-step),\n\nor f (xk) \u2212 f (xk+1) \u2265\n\n\u03b3\n2\u03b1k kxk+1 \u2212 xkk2 \u2265\n\n\u03b3\n2\u03b10\n\nkxk+1 \u2212 xkk2 (GD-step).\n\n(8)\n\n(9)\n\nRecall that H k is positive de\ufb01nite and \u03b3 > 0, \u03b1k > 0, which together with Eqs.(8), (9) imply\nthat {f (xk)} is monotonically decreasing. Thus, {f (xk)} converges to a \ufb01nite value \u00aff , since f is\nbounded from below (note that l(x) > \u2212\u221e and r(x) \u2265 0 for all x \u2208 Rn). Due to the boundedness\nof {xk} (see Proposition 7 in Supplement F), the sequence {xk} generated by the HONOR algorithm\nhas at least a limit point \u00afx. Since f is continuous, there exists a subsequence K of {1, 2 \u00b7 \u00b7 \u00b7 , k, k +\n1, \u00b7 \u00b7 \u00b7 } such that\n\nlim\n\nk\u2208K,k\u2192\u221e\n\nxk = \u00afx,\n\nlim\nk\u2192\u221e\n\nf (xk) = lim\n\nk\u2208K,k\u2192\u221e\n\nf (xk) = \u00aff = f (\u00afx).\n\n(10)\n\n(11)\n\nIn the following, we prove the theorem by contradiction. Assume that \u00afx is not a Clarke critical point\nof problem (1). Then by Proposition 1, there exists at least one i \u2208 {1, \u00b7 \u00b7 \u00b7 , n} such that\n\nlim inf\n\nk\u2208K,k\u2192\u221e\n\n|vk\n\ni | > 0.\n\n5\n\n(12)\n\n\fWe next consider the following two cases:\n(a) There exist a subsequence \u02dcK of K and an integer \u02dck > 0 such that for all k \u2208 \u02dcK, k \u2265 \u02dck, the\nGD-step is adopted. Then for all k \u2208 \u02dcK, k \u2265 \u02dck, we have\n\nxk+1 = arg min\n\nx\n\n(cid:26)\u2207l(xk)T (x \u2212 xk) +\n\n1\n\n2\u03b1k kx \u2212 xkk2 + r(x)(cid:27) .\n\nThus, by the optimality condition of the above problem and properties of the Clarke subdifferential\n(Proposition 4 in Supplement A), we have\n\n0 \u2208 \u2207l(xk) +\n\n1\n\u03b1k (xk+1 \u2212 xk) + \u2202or(xk+1).\n\nTaking limits with k \u2208 \u02dcK for Eq. (9) and considering Eqs. (10), (11), we have\n\nlim\n\nk\u2208 \u02dcK,k\u2192\u221e\n\nkxk+1 \u2212 xkk2 \u2264 0 \u21d2 lim\n\nk\u2208 \u02dcK,k\u2192\u221e\n\nxk = lim\n\nk\u2208 \u02dcK,k\u2192\u221e\n\nxk+1 = \u00afx.\n\n(13)\n\n(14)\n\nTaking limits with k \u2208 \u02dcK for Eq. (13) and considering Eq. (14), \u03b1k \u2265 \u03b2 min(\u03b10, (1 \u2212 \u03b3)/L)\n[Proposition 3] and \u2202or(\u00b7) is upper-semicontinuous (upper-hemicontinuous) [8] (see Proposition 4\nin the Supplement A), we have\n\n0 \u2208 \u2207l(\u00afx) + \u2202or(\u00afx) = \u2202of (\u00afx),\n\nwhich contradicts the assumption that \u00afx is not a Clarke critical point of problem (1).\n(b) There exists an integer \u02c6k > 0 such that for all k \u2208 K, k \u2265 \u02c6k, the QN-step is adopted. According\nto Remark 7 (in Supplement F), we know that the smallest eigenvalue of H k is uniformly bounded\nfrom below by a positive constant, which together with Eq. (12) implies\n\nlim inf\n\nk\u2208K,k\u2192\u221e\n\n(vk)T H kvk > 0.\n\nTaking limits with k \u2208 K for Eq. (8), we have\n\nlim\n\nk\u2208K,k\u2192\u221e\n\n\u03b3\u03b1k(vk)T H kvk \u2264 0,\n\nwhich together with \u03b3 \u2208 (0, 1), \u03b1k \u2208 (0, \u03b10] and Eq. (15) implies that\n\nlim\n\nk\u2208K,k\u2192\u221e\n\n\u03b1k = 0.\n\n(15)\n\n(16)\n\nEq. (12) implies that there exist an integer \u02c7k > 0 and a constant \u00af\u01eb > 0 such that \u01ebk =\nmin(kvkk, \u01eb) \u2265 \u00af\u01eb for all k \u2208 K, k \u2265 \u02c7k. Notice that for all k \u2208 K, k \u2265 \u02c6k, the QN-step is\nadopted. Thus, we obtain that I k = {i \u2208 {1, \u00b7 \u00b7 \u00b7 , n} : 0 < |xk\ni < 0} = \u2205 for all\nk \u2208 K, k \u2265 \u02c6k. We also notice that, if |xk\ni | \u2265 \u00af\u01eb, then there exists a constant \u00af\u03b1i > 0 such that\nxk\ni (\u03b1) = \u03c0i(xk\ni } is bounded (Proposition 8\nin Supplement F). Therefore, we conclude that, for all k \u2208 K, k \u2265 \u00afk = max(\u02c7k, \u02c6k) and for all\ni \u2208 {1, \u00b7 \u00b7 \u00b7 , n}, at least one of the following three cases must happen:\n\ni for all \u03b1 \u2208 (0, \u00af\u03b1i], as {pk\n\ni | \u2264 \u01ebk, xk\n\ni ) = xk\n\ni + \u03b1pk\n\ni + \u03b1pk\n\ni ; \u03bek\n\ni vk\n\ni (\u03b1) = \u03c0i(xk\n\ni = 0 \u21d2 xk\nxk\ni | > \u01ebk \u2265 \u00af\u01eb \u21d2 xk\nor |xk\ni vk\nor xk\n\ni \u2265 0 \u21d2 xk\n\ni pk\n\ni + \u03b1pk\ni (\u03b1) = \u03c0i(xk\n\ni ; \u03bek\ni + \u03b1pk\n\ni ) = xk\ni ; \u03bek\ni (\u03b1) = \u03c0i(xk\n\ni + \u03b1pk\ni ) = xk\ni + \u03b1pk\n\ni , \u2200\u03b1 > 0,\ni + \u03b1pk\ni ; \u03bek\n\ni ) = xk\n\ni \u2265 0 \u21d2 xk\n\ni , \u2200\u03b1 \u2208 (0, \u00af\u03b1i],\n\ni + \u03b1pk\n\ni , \u2200\u03b1 > 0.\n\nIt follows that there exists a constant \u00af\u03b1 > 0 such that\n\nqk\n\n\u03b1 =\n\n1\n\u03b1\n\n(xk(\u03b1) \u2212 xk) = pk, \u2200k \u2208 K, k \u2265 \u00afk, \u03b1 \u2208 (0, \u00af\u03b1].\n\nThus, considering |pk\n\ni | = |\u03c0i(dk\n\ni ; vk\n\ni )| \u2264 |dk\n\ni | and vk\n\ni pk\n\ni \u2265 vk\n\ni dk\n\ni for all i \u2208 {1, \u00b7 \u00b7 \u00b7 , n}, we have\n\nkqk\n(vk)T qk\n\n\u03b1k2 = kpkk2 \u2264 kdkk2 = (vk)T (H k)2vk, \u2200k \u2208 K, k \u2265 \u00afk, \u03b1 \u2208 (0, \u00af\u03b1],\n\n\u03b1 = (vk)T pk \u2265 (vk)T dk = (vk)T H kvk, \u2200k \u2208 K, k \u2265 \u00afk, \u03b1 \u2208 (0, \u00af\u03b1].\n\n(17)\n\n(18)\n\n(19)\n\n6\n\n\fAccording to Proposition 8 (in Supplement F), we know that the largest eigenvalue of H k is uni-\nformly bounded from above by some positive constant M . Thus, we have\n\n(vk)T (H k)2vk \u2264\n\n2\n\u03b1L\n\n(vk)T H kvk \u2212(cid:18) 2\n\n\u03b1L\n\n\u2212 M(cid:19) (vk)T H kvk, \u2200k,\n\nwhich together with Eqs. (18), (19) and dk = H kvk implies\n\nkqk\n\n\u03b1k2 \u2264\n\n2\n\u03b1L\n\n(vk)T qk\n\n\u03b1 \u2212(cid:18) 2\n\n\u03b1L\n\n\u2212 M(cid:19) (vk)T dk, \u2200k \u2208 K, k \u2265 \u00afk, \u03b1 \u2208 (0, \u00af\u03b1].\n\n(20)\n\nConsidering Eqs. (7), (20), we have\n\nf (xk(\u03b1)) \u2264 f (xk) \u2212 \u03b1(cid:18)1 \u2212\n\n\u03b1LM\n\n2 (cid:19) (vk)T dk, \u2200k \u2208 K, k \u2265 \u00afk, \u03b1 \u2208 (0, \u00af\u03b1],\n\nwhich together with (vk)T dk = (vk)T H kvk \u2265 0 implies that the line search criterion in the\nQN-step [Eq. (4)] is satis\ufb01ed if\n\n1 \u2212\n\n\u03b1LM\n\n2\n\n\u2265 \u03b3 , 0 < \u03b1 \u2264 \u03b10 and 0 < \u03b1 \u2264 \u00af\u03b1, \u2200k \u2208 K, k \u2265 \u00afk.\n\nConsidering the backtracking form of the line search in QN-step [Eq. (4)], we conclude that the line\nsearch criterion in the QN-step [Eq. (4)] is satis\ufb01ed whenever\n\n\u03b1k \u2265 \u03b2 min(min(\u00af\u03b1, \u03b10), 2(1 \u2212 \u03b3)/(LM )) > 0, \u2200k \u2208 K, k \u2265 \u00afk.\n\nThis leads to a contradiction with Eq. (16).\n\nBy (a) and (b), we conclude that \u00afx = limk\u2208K,k\u2192\u221e xk is a Clarke critical point of problem (1). (cid:3)\n\n5 Experiments\n\ngistic regression problem1 by setting l(x) = 1/NPN\n\nIn this section, we evaluate the ef\ufb01ciency of HONOR on solving the non-convex regularized lo-\ni x)), where ai \u2208\nRn is the i-th sample associated with the label yi \u2208 {1, \u22121}. Three non-convex regulariz-\ners (LSP, MCP and SCAD) are included in experiments, where the parameters are set as \u03bb =\n1/N and \u03b8 = 10\u22122\u03bb (\u03b8 is set as 2 + 10\u22122\u03bb for SCAD as it requires \u03b8 > 2). We com-\npare HONOR with the non-convex solver2 GIST [14] on three large-scale, high-dimensional\nand sparse data sets which are summarized in Table 1. All data sets can be downloaded from\nhttp://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/datasets/.\n\ni=1 log(1 + exp(\u2212yiaT\n\ndatasets\n\n\u266f samples N\n\nurl\n\nTable 1: Data set statistics.\n\nAll algorithms are implemented in Mat-\nlab 2015a under a Linux operating sys-\ntem and executed on an Intel Core\ni7-4790 CPU (@3.6GHz) with 32GB\nmemory. We choose the starting points\nx0 for the compared algorithms using the same random vector whose entries are i.i.d. sampled from\nthe standard Gaussian distribution. We terminate the compared algorithms if the relative change of\ntwo consecutive objective function values is less than 10\u22125 or the number of iterations exceeds 1000\n(HONOR) or 10000 (GIST). For HONOR, we set \u03b3 = 10\u22125, \u03b2 = 0.5, \u03b10 = 1 and the number of\nunrolling steps in L-BFGS as m = 10. For GIST, we use the non-monotone line search in exper-\niments as it usually performs better than its monotone counterpart. To show how the convergence\nbehavior of HONOR varies over the parameter \u01eb, we use three values: \u01eb = 10\u221210, 10\u22126, 10\u22122.\n\nkdd2010b\n748,401\n\nkdd2010a\n510,302\n\n2,396,130\n3,231,961\n\ndimensionality n\n\n20,216,830\n\n29,890,095\n\nWe report the objective function value (in log-scale) vs. CPU time (in seconds) plots in Figure 1.\nWe can observe from Figure 1 that: (1) If \u01eb is set to a small value, the QN-step is adopted at almost\nall steps in HONOR and HONOR converges signi\ufb01cantly faster than GIST for all three non-convex\n\n1We do not include the term \u03b4kxk2 in the objective and \ufb01nd that the proposed algorithm still works well.\n2We do not involve SparseNet, DC programming and DC-PN in comparison, because (1) adapting\nSparseNet to the logistic regression problem is challenging; (2) DC programming is shown to be much in-\nferior to GIST; (3) The objective function value of DC-PN is larger than GIST in most cases [18].\n\n7\n\n\fregularizers on all three data sets. This shows that using the second-order information greatly speeds\nup the convergence. (2) When \u01eb increases, the ratio of the GD-step adopted in HONOR increases.\nMeanwhile, the convergence performance of HONOR generally degrades. In some cases, setting\na slightly larger \u01eb and adopting a small number of GD steps even sligtly boosts the convergence\nperformance of HONOR (the green curves in the \ufb01rst row). But setting \u01eb to a very small value is\nalways safe to guarantee the fast convergence of HONOR. (3) When \u01eb is large enough, the GD steps\ndominate all iterations of HONOR and HONOR converge much slower. In this case, HONOR con-\nverges even slower than GIST. The reason is that, at each iteration of HONOR, extra computational\ncost is required in addition to the basic computation in the GD-step. Moreover, the non-monotone\nline search is used in GIST while the monotone line search is adopted in the GD-step. (4) In some\ncases (the \ufb01rst row), GIST is trapped in a local solution which has a much larger objective function\nvalue than HONOR with a small \u01eb. This implies that HONOR may have a potential of escaping\nfrom high error plateau which often exists in high dimensional non-convex problems. These results\nshow the great potential of HONOR for solving large-scale non-convex sparse learning problems.\n\nLSP (kdd2010a)\n\nLSP (kdd2010b)\n\nLSP (url)\n\nl\n\n)\ne\na\nc\ns\n \n\nd\ne\ng\ng\no\nl\n(\n \n\nl\n\ne\nu\na\nv\n \n\nn\no\n\ni\nt\nc\nn\nu\n\nf\n \n\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\nl\n\n)\ne\na\nc\ns\n \n\nd\ne\ng\ng\no\nl\n(\n \n\nl\n\ne\nu\na\nv\n \n\nn\no\n\ni\nt\nc\nn\nu\n\nf\n \n\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\nl\n\n)\ne\na\nc\ns\n \n\nd\ne\ng\ng\no\nl\n(\n \n\nl\n\ne\nu\na\nv\n \n\nn\no\n\ni\nt\nc\nn\nu\n\nf\n \n\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n101\n\n100\n\n100\n\n10-1\n\n10-2\n\n100\n\n10-1\n\n10-2\n\n0\n\n200\n\n600\n\n400\n800\nCPU time (seconds)\nMCP (kdd2010a)\n\nHONOR(\u01eb=1e-10)\nHONOR(\u01eb=1e-6)\nHONOR(\u01eb=1e-2)\nGIST\n\nHONOR(\u01eb=1e-10)\nHONOR(\u01eb=1e-6)\nHONOR(\u01eb=1e-2)\nGIST\n\nl\n\n)\ne\na\nc\ns\n \n\nd\ne\ng\ng\no\nl\n(\n \n\nl\n\ne\nu\na\nv\n \n\nn\no\n\ni\nt\nc\nn\nu\n\nf\n \n\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n101\n\n100\n\n1000\n\n1200\n\n0\n\n200\n\n400\n\n600\n\n800\n\n1000\n\n1200\n\nCPU time (seconds)\nMCP (kdd2010b)\n\nHONOR(\u01eb=1e-10)\nHONOR(\u01eb=1e-6)\nHONOR(\u01eb=1e-2)\nGIST\n\nHONOR(\u01eb=1e-10)\nHONOR(\u01eb=1e-6)\nHONOR(\u01eb=1e-2)\nGIST\n\nl\n\n)\ne\na\nc\ns\n \n\nd\ne\ng\ng\no\nl\n(\n \n\nl\n\ne\nu\na\nv\n \n\nn\no\n\ni\nt\nc\nn\nu\n\nf\n \n\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n100\n\n10-1\n\n10-2\n\n10-3\n\n0\n\n500\n\n1500\n\n1000\n2000\nCPU time (seconds)\nSCAD (kdd2010a)\n\n2500\n\n3000\n\n0\n\n500\n\n1000\n\n2000\n\n2500\n1500\nCPU time (seconds)\nSCAD (kdd2010b)\n\n3000\n\n3500\n\n4000\n\nHONOR(\u01eb=1e-10)\nHONOR(\u01eb=1e-6)\nHONOR(\u01eb=1e-2)\nGIST\n\nHONOR(\u01eb=1e-10)\nHONOR(\u01eb=1e-6)\nHONOR(\u01eb=1e-2)\nGIST\n\nl\n\n)\ne\na\nc\ns\n \n\nd\ne\ng\ng\no\nl\n(\n \n\nl\n\ne\nu\na\nv\n \n\nn\no\n\ni\nt\nc\nn\nu\n\nf\n \n\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n100\n\n10-1\n\n10-2\n\n10-3\n\n0\n\n500\n\n1000\nCPU time (seconds)\n\n1500\n\n2000\n\n2500\n\n0\n\n1000\n\n2000\n\n3000\nCPU time (seconds)\n\n4000\n\n101\n\n100\n\n10-1\n\n100\n\n10-1\n\n10-2\n\nl\n\n)\ne\na\nc\ns\n \n\nd\ne\ng\ng\no\nl\n(\n \n\nl\n\ne\nu\na\nv\n \n\nn\no\n\ni\nt\nc\nn\nu\n\nf\n \n\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\nl\n\n)\ne\na\nc\ns\n \n\nd\ne\ng\ng\no\nl\n(\n \n\nl\n\ne\nu\na\nv\n \n\nn\no\n\ni\nt\nc\nn\nu\n\nf\n \n\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\nl\n\n)\ne\na\nc\ns\n \n\nd\ne\ng\ng\no\nl\n(\n \n\nl\n\ne\nu\na\nv\n \n\nn\no\n\ni\nt\nc\nn\nu\n\nf\n \n\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\nHONOR(\u01eb=1e-10)\nHONOR(\u01eb=1e-6)\nHONOR(\u01eb=1e-2)\nGIST\n\n0\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000 12000 14000\n\nCPU time (seconds)\n\nMCP (url)\n\nHONOR(\u01eb=1e-10)\nHONOR(\u01eb=1e-6)\nHONOR(\u01eb=1e-2)\nGIST\n\n2.5\n\n3\n\u00d7104\n\nHONOR(\u01eb=1e-10)\nHONOR(\u01eb=1e-6)\nHONOR(\u01eb=1e-2)\nGIST\n\n1.5\n\n1\nCPU time (seconds)\n\n2\n\nSCAD (url)\n\n10-3\n\n0\n\n0.5\n\n100\n\n10-1\n\n10-2\n\n10-3\n\n0\n\n0.5\n\n1\n2\nCPU time (seconds)\n\n1.5\n\n2.5\n\n\u00d7104\n\nFigure 1: Objective function value (in log-scale) vs. CPU time (in seconds) plots for differ-\nent non-convex regularizers and different large-scale and high-dimensional data sets. The ra-\ntios of the GD-step adopted in HONOR are: LSP (kdd2010a): 0%, 1%, 34%; LSP (kdd2010b):\n0%, 2%, 27%; LSP (url): 0.1%, 2%, 35%; MCP (kdd2010a): 0%, 88%, 100%; MCP (kdd2010b):\n0%, 89%, 100%; MCP (url): 0%, 97%, 100%; SCAD (kdd2010a): 0%, 43%, 100%; SCAD (2010b):\n0%, 32%, 99.5%; SCAD (url): 0%, 79%, 100%.\n\n6 Conclusions\n\nIn this paper, we propose an ef\ufb01cient optimization algorithm called HONOR for solving non-convex\nregularized sparse learning problems. HONOR incorporates the second-order information to speed\nup the convergence in practice and uses a carefully designed hybrid optimization scheme to guaran-\ntee the convergence in theory. Experiments are conducted on large-scale data sets and results show\nthat HONOR converges signi\ufb01cantly faster than state-of-the-art algorithms. In our future work, we\nplan to develop parallel/distributed variants of HONOR to tackle much larger data sets.\n\nAcknowledgements\n\nThis work is supported in part by research grants from NIH (R01 LM010730, U54 EB020403) and\nNSF (IIS- 0953662, III-1539991, III-1539722).\n\n8\n\n\fReferences\n\n[1] G. Andrew and J. Gao. Scalable training of \u21131-regularized log-linear models. In ICML, pages 33\u201340,\n\n2007.\n\n[2] J. Bioucas-Dias and M. Figueiredo. A new TwIST: two-step iterative shrinkage/thresholding algorithms\n\nfor image restoration. IEEE Transactions on Image Processing, 16(12):2992\u20133004, 2007.\n\n[3] R. H. Byrd, G. M. Chin, J. Nocedal, and F. Oztoprak. A family of second-order methods for convex\n\u21131-regularized optimization. Technical report, Industrial Engineering and Management Sciences, North-\nwestern University, Evanston, IL, 2012.\n\n[4] R. H. Byrd, G. M. Chin, J. Nocedal, and Y. Wu. Sample size selection in optimization methods for\n\nmachine learning. Mathematical Programming, 134(1):127\u2013155, 2012.\n\n[5] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimiza-\n\ntion. SIAM Journal on Scienti\ufb01c Computing, 16(5):1190\u20131208, 1995.\n\n[6] E. Candes, M. Wakin, and S. Boyd. Enhancing sparsity by reweighted \u21131 minimization. Journal of Fourier\n\nAnalysis and Applications, 14(5):877\u2013905, 2008.\n\n[7] F. Clarke. Optimization and Nonsmooth Analysis. John Wiley&Sons, New York, 1983.\n\n[8] J. Dutta. Generalized derivatives and nonsmooth optimization, a \ufb01nite dimensional tour. Top, 13(2):185\u2013\n\n279, 2005.\n\n[9] L. El Ghaoui, G. Li, V. Duong, V. Pham, A. Srivastava, and K. Bhaduri. Sparse machine learning methods\n\nfor understanding large text corpora. In CIDU, pages 159\u2013173, 2011.\n\n[10] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal\n\nof the American Statistical Association, 96(456):1348\u20131360, 2001.\n\n[11] J. Fan, L. Xue, and H. Zou. Strong oracle optimality of folded concave penalized estimation. Annals of\n\nStatistics, 42(3):819, 2014.\n\n[12] G. Gasso, A. Rakotomamonjy, and S. Canu. Recovering sparse signals with a certain family of nonconvex\n\npenalties and dc programming. IEEE Transactions on Signal Processing, 57(12):4686\u20134698, 2009.\n\n[13] P. Gong and J. Ye. A modi\ufb01ed orthant-wise limited memory quasi-newton method with convergence\n\nanalysis. In ICML, 2015.\n\n[14] P. Gong, C. Zhang, Z. Lu, J. Huang, and J. Ye. A general iterative shrinkage and thresholding algorithm\n\nfor non-convex regularized optimization problems. In ICML, volume 28, pages 37\u201345, 2013.\n\n[15] N. Jorge and J. Stephen. Numerical Optimization. Springer, 1999.\n\n[16] R. Mazumder, J. Friedman, and T. Hastie. Sparsenet: Coordinate descent with nonconvex penalties.\n\nJournal of the American Statistical Association, 106(495), 2011.\n\n[17] P. Olsen, F. Oztoprak, J. Nocedal, and S. Rennie. Newton-like methods for sparse inverse covariance\n\nestimation. In Advances in Neural Information Processing Systems (NIPS), pages 764\u2013772, 2012.\n\n[18] A. Rakotomamonjy, R. Flamary, and G. Gasso. Dc proximal newton for non-convex optimization prob-\n\nlems. 2014.\n\n[19] S. Shevade and S. Keerthi. A simple and ef\ufb01cient algorithm for gene selection using sparse logistic\n\nregression. Bioinformatics, 19(17):2246, 2003.\n\n[20] X. Tan, W. Roberts, J. Li, and P. Stoica. Sparse learning via iterative minimization with application to\n\nmimo radar imaging. IEEE Transactions on Signal Processing, 59(3):1088\u20131101, 2011.\n\n[21] P. Tao and L. An. The dc (difference of convex functions) programming and dca revisited with dc models\nof real world nonconvex optimization problems. Annals of Operations Research, 133(1-4):23\u201346, 2005.\n\n[22] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robust face recognition via sparse representation.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2):210\u2013227, 2008.\n\n[23] C. Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics,\n\n38(2):894\u2013942, 2010.\n\n[24] C. Zhang and T. Zhang. A general theory of concave regularization for high-dimensional sparse estimation\n\nproblems. Statistical Science, 27(4):576\u2013593, 2012.\n\n[25] T. Zhang. Analysis of multi-stage convex relaxation for sparse regularization. JMLR, 11:1081\u20131107,\n\n2010.\n\n[26] T. Zhang. Multi-stage convex relaxation for feature selection. Bernoulli, 2012.\n\n[27] H. Zou and R. Li. One-step sparse estimates in nonconcave penalized likelihood models. Annals of\n\nStatistics, 36(4):1509, 2008.\n\n9\n\n\f", "award": [], "sourceid": 324, "authors": [{"given_name": "Pinghua", "family_name": "Gong", "institution": "University of Michigan-Ann Arbor"}, {"given_name": "Jieping", "family_name": "Ye", "institution": "University of Michigan"}]}