{"title": "On the Convergence of the Concave-Convex Procedure", "book": "Advances in Neural Information Processing Systems", "page_first": 1759, "page_last": 1767, "abstract": "The concave-convex procedure (CCCP) is a majorization-minimization algorithm that solves d.c. (difference of convex functions) programs as a sequence of convex programs. In machine learning, CCCP is extensively used in many learning algorithms like sparse support vector machines (SVMs), transductive SVMs, sparse principal component analysis, etc. Though widely used in many applications, the convergence behavior of CCCP has not gotten a lot of specific attention. Yuille and Rangarajan analyzed its convergence in their original paper, however, we believe the analysis is not complete. Although the convergence of CCCP can be derived from the convergence of the d.c. algorithm (DCA), their proof is more specialized and technical than actually required for the specific case of CCCP. In this paper, we follow a different reasoning and show how Zangwills global convergence theory of iterative algorithms provides a natural framework to prove the convergence of CCCP, allowing a more elegant and simple proof. This underlines Zangwills theory as a powerful and general framework to deal with the convergence issues of iterative algorithms, after also being used to prove the convergence of algorithms like expectation-maximization, generalized alternating minimization, etc. In this paper, we provide a rigorous analysis of the convergence of CCCP by addressing these questions: (i) When does CCCP find a local minimum or a stationary point of the d.c. program under consideration? (ii) When does the sequence generated by CCCP converge? We also present an open problem on the issue of local convergence of CCCP.", "full_text": "On the Convergence of the Concave-Convex\n\nProcedure\n\nBharath K. Sriperumbudur\n\nDepartment of Electrical and Computer Engineering\n\nUniversity of California, San Diego\n\nLa Jolla, CA 92093\n\nbharathsv@ucsd.edu\n\nGert R. G. Lanckriet\n\nDepartment of Electrical and Computer Engineering\n\nUniversity of California, San Diego\n\nLa Jolla, CA 92093\n\ngert@ece.ucsd.edu\n\nAbstract\n\nThe concave-convex procedure (CCCP) is a majorization-minimization algorithm\nthat solves d.c. (difference of convex functions) programs as a sequence of convex\nprograms. In machine learning, CCCP is extensively used in many learning algo-\nrithms like sparse support vector machines (SVMs), transductive SVMs, sparse\nprincipal component analysis, etc. Though widely used in many applications, the\nconvergence behavior of CCCP has not gotten a lot of speci\ufb01c attention. Yuille and\nRangarajan analyzed its convergence in their original paper, however, we believe\nthe analysis is not complete. Although the convergence of CCCP can be derived\nfrom the convergence of the d.c. algorithm (DCA), its proof is more specialized\nand technical than actually required for the speci\ufb01c case of CCCP. In this paper,\nwe follow a different reasoning and show how Zangwill\u2019s global convergence the-\nory of iterative algorithms provides a natural framework to prove the convergence\nof CCCP, allowing a more elegant and simple proof. This underlines Zangwill\u2019s\ntheory as a powerful and general framework to deal with the convergence issues of\niterative algorithms, after also being used to prove the convergence of algorithms\nlike expectation-maximization, generalized alternating minimization, etc. In this\npaper, we provide a rigorous analysis of the convergence of CCCP by addressing\nthese questions: (i) When does CCCP \ufb01nd a local minimum or a stationary point\nof the d.c. program under consideration? (ii) When does the sequence gener-\nated by CCCP converge? We also present an open problem on the issue of local\nconvergence of CCCP.\n\n1 Introduction\n\nThe concave-convex procedure (CCCP) [30] is a majorization-minimization algorithm [15] that is\npopularly used to solve d.c. (difference of convex functions) programs of the form,\n\nmin\nx\ns.t.\n\nf(x)\nci(x) \u2264 0, i \u2208 [m],\ndj(x) = 0, j \u2208 [p],\n\n(1)\nwhere f(x) = u(x) \u2212 v(x) with u, v and ci being real-valued convex functions, dj being an af\ufb01ne\nfunction, all de\ufb01ned on Rn. Here, [m] := {1, . . . , m}. Suppose v is differentiable. The CCCP\n\n1\n\n\falgorithm is an iterative procedure that solves the following sequence of convex programs,\n\nx(l+1) \u2208 arg min\nx\ns.t.\n\nu(x) \u2212 xT\u2207v(x(l))\nci(x) \u2264 0, i \u2208 [m],\ndj(x) = 0, j \u2208 [p].\n\n(2)\nAs can be seen from (2), the idea of CCCP is to linearize the concave part of f, which is \u2212v,\naround a solution obtained in the current iterate so that u(x) \u2212 xT\u2207v(x(l)) is convex in x, and\ntherefore the non-convex program in (1) is solved as a sequence of convex programs as shown in\n(2). The original formulation of CCCP by Yuille and Rangarajan [30] deals with unconstrained\nand linearly constrained problems. However, the same formulation can be extended to handle any\nconstraints (both convex and non-convex). CCCP has been extensively used in solving many non-\nconvex programs (of the form in (1)) that appear in machine learning. For example, [3] proposed a\nsuccessive linear approximation (SLA) algorithm for feature selection in support vector machines,\nwhich can be seen as a special case of CCCP. Other applications where CCCP has been used include\nsparse principal component analysis [27], transductive SVMs [11, 5, 28], feature selection in SVMs\n[22], structured estimation [10], missing data problems in Gaussian processes and SVMs [26], etc.\nThe algorithm in (2) starts at some random point x(0) \u2208 {x : ci(x) \u2264 0, i \u2208 [m]; dj(x) = 0, j \u2208\n[p]}, solves the program in (2) and therefore generates a sequence {x(l)}\u221e\nl=0. The goal of this paper\nis to study the convergence of {x(l)}\u221e\nl=0: (i) When does CCCP \ufb01nd a local minimum or a stationary\npoint1 of the program in (1)? (ii) Does {x(l)}\u221e\nl=0 converge? If so, to what and under what conditions?\nFrom a practical perspective, these questions are highly relevant, given that CCCP is widely applied\nin machine learning.\nIn their original CCCP paper, Yuille and Rangarajan [30, Theorem 2] analyzed its convergence, but\nwe believe the analysis is not complete. They showed that {x(l)}\u221e\nl=0 satis\ufb01es the monotonic descent\nproperty, i.e., f(x(l+1)) \u2264 f(x(l)) and argued that this descent property ensures the convergence\nof {x(l)}\u221e\nl=0 to a minimum or saddle point of the program in (1). However, a rigorous proof is\nnot provided, to ensure that their claim holds for all u, v, {ci} and {dj}. Answering the previous\nquestions, however, requires a rigorous proof of the convergence of CCCP that explicitly mentions\nthe conditions under which it can happen.\nIn the d.c. programming literature, Pham Dinh and Hoai An [8] proposed a primal-dual subd-\nifferential method called DCA (d.c. algorithm) for solving a general d.c. program of the form\nmin{u(x) \u2212 v(x) : x \u2208 Rn}, where it is assumed that u and v are proper lower semi-continuous\nconvex functions, which form a larger class of functions than the class of differentiable functions.\nIt can be shown that if v is differentiable, then DCA exactly reduces to CCCP. Unlike in CCCP,\nDCA involves constructing two sets of convex programs (called the primal and dual programs) and\nsolving them iteratively in succession such that the solution of the primal is the initialization to the\ndual and vice-versa. See [8] for details. [8, Theorem 3] proves the convergence of DCA for gen-\neral d.c. programs. The proof is specialized and technical. It fundamentally relies on d.c. duality,\nhowever, outlining the proof in any more detail requires a substantial discussion which would lead\nus too far here. In this work, we follow a fundamentally different approach and show that the con-\nvergence of CCCP, speci\ufb01cally, can be analyzed in a more simple and elegant way, by relying on\nZangwill\u2019s global convergence theory of iterative algorithms. We make some simple assumptions on\nthe functions involved in (1), which are not too restrictive and therefore applicable to many practical\nsituations. The tools employed in our proof are of completely different \ufb02avor than the ones used in\nthe proof of DCA convergence: DCA convergence analysis exploits d.c. duality while we use the\nnotion of point-to-set maps as introduced by Zangwill. Zangwill\u2019s theory is a powerful and general\nframework to deal with the convergence issues of iterative algorithms. It has also been used to prove\nthe convergence of the expectation-maximation (EM) algorithm [29], generalized alternating mini-\nmization algorithms [12], multiplicative updates in non-negative quadratic programming [25], etc.\nand is therefore a natural framework to analyze the convergence of CCCP in a more direct way.\nThe paper is organized as follows. In Section 2, we provide a brief introduction to majorization-\nminimization (MM) algorithms and show that CCCP is obtained as a particular form of majorization-\n1x\u2217 is said to be a stationary point of a constrained optimization problem if it satis\ufb01es the corresponding\nKarush-Kuhn-Tucker (KKT) conditions. Assuming constraint quali\ufb01cation, KKT conditions are necessary for\nthe local optimality of x\u2217. See [2, Section 11.3] for details.\n\n2\n\n\fminimization. The goal of this section is also to establish the literature on MM algorithms and show\nwhere CCCP \ufb01ts in it. In Section 3, we present Zangwill\u2019s theory of global convergence, which is a\ngeneral framework to analyze the convergence behavior of iterative algorithms. This theory is used\nto address the global convergence of CCCP in Section 4. This involves analyzing the \ufb01xed points\nof the CCCP algorithm in (2) and then showing that the \ufb01xed points are the stationary points of the\nprogram in (1). The results in Section 4 are extended in Section 4.1 to analyze the convergence of\nthe constrained concave-convex procedure that was proposed by [26] to deal with d.c. programs\nwith d.c. constraints. We brie\ufb02y discuss the local convergence issues of CCCP in Section 5 and\nconclude the section with an open question.\n\n2 Majorization-minimization\n\nMM algorithms can be thought of as a generalization of the well-known EM algorithm [7]. The\ngeneral principle behind MM algorithms was \ufb01rst enunciated by the numerical analysts, Ortega\nand Rheinboldt [23] in the context of line search methods. The MM principle appears in many\nplaces in statistical computation, including multidimensional scaling [6], robust regression [14],\ncorrespondence analysis [13], variable selection [16], sparse signal recovery [4], etc. We refer the\ninterested reader to a tutorial on MM algorithms [15] and the references therein.\nThe general idea of MM algorithms is as follows. Suppose we want to minimize f over \u2126 \u2282 Rn.\nThe idea is to construct a majorization function g over \u2126 \u00d7 \u2126 such that\n\n(cid:189)\n\nf(x) \u2264 g(x, y), \u2200 x, y \u2208 \u2126\nf(x) = g(x, x), \u2200 x \u2208 \u2126 .\n\n(3)\n\nThus, g as a function of x is an upper bound on f and coincides with f at y. The majorization\nalgorithm corresponding with this majorization function g updates x at iteration l by\n\nx(l+1) \u2208 arg min\nx\u2208\u2126\n\ng(x, x(l)),\n\n(4)\nunless we already have x(l) \u2208 arg minx\u2208\u2126 g(x, x(l)), in which case the algorithm stops. The ma-\njorization function, g is usually constructed by using Jensen\u2019s inequality for convex functions, the\n\ufb01rst-order Taylor approximation or the quadratic upper bound principle [1]. However, any other\nmethod can also be used to construct g as long as it satis\ufb01es (3). It is easy to show that the above\niterative scheme decreases the value of f monotonically in each iteration, i.e.,\nf(x(l+1)) \u2264 g(x(l+1), x(l)) \u2264 g(x(l), x(l)) = f(x(l)),\n\n(5)\nwhere the \ufb01rst inequality and the last equality follow from (3) while the sandwiched inequality\nfollows from (4).\nNote that MM algorithms can be applied equally well to the maximization of f by simply reversing\nthe inequality sign in (3) and changing the \u201cmin\u201d to \u201cmax\u201d in (4). In this case, the word MM refers to\nminorization-maximization, where the function g is called the minorization function. To put things\nin perspective, the EM algorithm can be obtained by constructing the minorization function g using\nJensen\u2019s inequality for concave functions. The construction of such a g is referred to as the E-step,\nwhile (4) with the \u201cmin\u201d replaced by \u201cmax\u201d is referred to as the M-step. The algorithm in (3) and\n(4) is also referred to as the auxiliary function method, e.g., for non-negative matrix factorization\n[18]. [17] studied this algorithm under the name optimization transfer while [19] referred to it as\nthe SM algorithm, where \u201cS\u201d stands for the surrogate step (same as the majorization/minorization\nstep) and \u201cM\u201d stands for the minimization/maximization step depending on the problem at hand. g\nis called the surrogate function. In the following example, we show that CCCP is an MM algorithm\nfor a particular choice of the majorization function, g.\nExample 1 (Linear Majorization). Let us consider the optimization problem, minx\u2208\u2126 f(x) where\nf = u \u2212 v, with u and v both real-valued, convex, de\ufb01ned on Rn and v differentiable. Since v is\nconvex, we have v(x) \u2265 v(y) + (x \u2212 y)T\u2207v(y), \u2200 x, y \u2208 \u2126. Therefore,\n\nf(x) \u2264 u(x) \u2212 v(y) \u2212 (x \u2212 y)T\u2207v(y) =: g(x, y).\nIt is easy to verify that g is a majorization function of f. Therefore, we have\n\nx(l+1) \u2208 arg min\nx\u2208\u2126\n= arg min\nx\u2208\u2126\n\ng(x, x(l))\nu(x) \u2212 xT\u2207v(x(l)).\n\n3\n\n(6)\n\n(7)\n\n\fIf \u2126 is a convex set, then the above procedure reduces to CCCP, which solves a sequence of convex\nprograms. As mentioned before, CCCP is proposed for unconstrained and linearly constrained\nnon-convex programs. This example shows that the same idea can be extended to any constraint set.\nSuppose u and v are strictly convex, then a strict descent can be achieved in (5) unless x(l+1) = x(l),\ni.e., if x(l+1) (cid:54)= x(l), then\n\nf(x(l+1)) < g(x(l+1), x(l)) < g(x(l), x(l)) = f(x(l)).\n\n(8)\n\nThe \ufb01rst strict inequality follows from (6). The strict convexity of u leads to the strict convexity of g\nand therefore g(x(l+1), x(l)) < g(x(l), x(l)) unless x(l+1) = x(l).\n\n3 Global convergence theory of iterative algorithms\n\nFor an iterative procedure like CCCP to be useful, it must converge to a local optimum or a stationary\npoint from all or at least a signi\ufb01cant number of initialization states and not exhibit other nonlinear\nsystem behaviors, such as divergence or oscillation. This behavior can be analyzed by using the\nglobal convergence theory of iterative algorithms developed by Zangwill [31]. Note that the word\n\u201cglobal convergence\u201d is a misnomer. We will clarify it below and also introduce some notation and\nterminology.\nTo understand the convergence of an iterative procedure like CCCP, we need to understand the\nnotion of a set-valued mapping, or point-to-set mapping, which is central to the theory of global\nconvergence.2 A point-to-set map \u03a8 from a set X into a set Y is de\ufb01ned as \u03a8 : X \u2192 P(Y ), which\nassigns a subset of Y to each point of X, where P(Y ) denotes the power set of Y . We introduce\nfew de\ufb01nitions related to the properties of point-to-set maps that will be used later. Suppose X and\nY are two topological spaces. A point-to-set map \u03a8 is said to be closed at x0 \u2208 X if xk \u2192 x0\nas k \u2192 \u221e, xk \u2208 X and yk \u2192 y0 as k \u2192 \u221e, yk \u2208 \u03a8(xk), imply y0 \u2208 \u03a8(x0). This concept of\nclosure generalizes the concept of continuity for ordinary point-to-point mappings. A point-to-set\nmap \u03a8 is said to be closed on S \u2282 X if it is closed at every point of S. A \ufb01xed point of the map\n\u03a8 : X \u2192 P(X) is a point x for which {x} = \u03a8(x), whereas a generalized \ufb01xed point of \u03a8 is a\npoint for which x \u2208 \u03a8(x). \u03a8 is said to be uniformly compact on X if there exists a compact set H\nindependent of x such that \u03a8(x) \u2282 H for all x \u2208 X. Note that if X is compact, then \u03a8 is uniformly\ncompact on X. Let \u03c6 : X \u2192 R be a continuous function. \u03a8 is said to be monotonic with respect to\n\u03c6 whenever y \u2208 \u03a8(x) implies that \u03c6(y) \u2264 \u03c6(x). If, in addition, y \u2208 \u03a8(x) and \u03c6(y) = \u03c6(x) imply\nthat y = x, then we say that \u03a8 is strictly monotonic.\nMany iterative algorithms in mathematical programming can be described using the notion of point-\nto-set maps. Let X be a set and x0 \u2208 X a given point. Then an algorithm, A, with initial point x0\nis a point-to-set map A : X \u2192 P(X) which generates a sequence {xk}\u221e\nk=1 via the rule xk+1 \u2208\nA(xk), k = 0, 1, . . .. A is said to be globally convergent if for any chosen initial point x0, the\nsequence {xk}\u221e\nk=0 generated by xk+1 \u2208 A(xk) (or a subsequence) converges to a point for which a\nnecessary condition of optimality holds. The property of global convergence expresses, in a sense,\nthe certainty that the algorithm works. It is very important to stress the fact that it does not imply\n(contrary to what the term might suggest) convergence to a global optimum for all initial points x0.\nWith the above mentioned concepts, we now state Zangwill\u2019s global convergence theorem [31, Con-\nvergence theorem A, page 91].\nTheorem 2 ([31]). Let A : X \u2192 P(X) be a point-to-set map (an algorithm) that given a point\nx0 \u2208 X generates a sequence {xk}\u221e\nk=0 through the iteration xk+1 \u2208 A(xk). Also let a solution set\n\u0393 \u2282 X be given. Suppose\n\n(1) All points xk are in a compact set S \u2282 X.\n(2) There is a continuous function \u03c6 : X \u2192 R such that:\n\n(a) x /\u2208 \u0393 \u21d2 \u03c6(y) < \u03c6(x), \u2200 y \u2208 A(x),\n\n2Note that depending on the objective and constraints, the minimizer of the CCCP algorithm in (2) need\nnot be unique. Therefore, the algorithm takes x(l) as its input and returns a set of minimizers from which an\nelement, x(l+1) is chosen. Hence the notion of point-to-set maps appear naturally in such iterative algorithms.\n\n4\n\n\f(b) x \u2208 \u0393 \u21d2 \u03c6(y) \u2264 \u03c6(x), \u2200 y \u2208 A(x).\n\n(3) A is closed at x if x /\u2208 \u0393.\n\nk=0 is in \u0393. Furthermore, limk\u2192\u221e \u03c6(xk) =\n\nThen the limit of any convergent subsequence of {xk}\u221e\n\u03c6(x\u2217) for all limit points x\u2217.\nThe general idea in showing the global convergence of an algorithm, A is to invoke Theorem 2\nby appropriately de\ufb01ning \u03c6 and \u0393. For an algorithm A that solves the minimization problem,\nmin{f(x) : x \u2208 \u2126}, the solution set, \u0393 is usually chosen to be the set of corresponding station-\nary points and \u03c6 can be chosen to be the objective function itself, i.e., f, if f is continuous. In\nTheorem 2, the convergence of \u03c6(xk) to \u03c6(x\u2217) does not automatically imply the convergence of xk\nto x\u2217. However, if A is strictly monotone with respect to \u03c6, then Theorem 2 can be strengthened by\nusing the following result due to Meyer [20, Theorem 3.1, Corollary 3.2].\nTheorem 3 ([20]). Let A : X \u2192 P(X) be a point-to-set map such that A is uniformly compact,\nclosed and strictly monotone on X, where X is a closed subset of Rn. If {xk}\u221e\nk=0 is any sequence\ngenerated by A, then all limit points will be \ufb01xed points of A, \u03c6(xk) \u2192 \u03c6(x\u2217) =: \u03c6\u2217 as k \u2192 \u221e,\nwhere x\u2217 is a \ufb01xed point, (cid:107)xk+1 \u2212 xk(cid:107) \u2192 0, and either {xk}\u221e\nk=0 converges or the set of limit points\nof {xk}\u221e\nk=0 is connected. De\ufb01ne F (a) := {x \u2208 F : \u03c6(x) = a} where F is the set of \ufb01xed points of\nA. If F (\u03c6\u2217) is \ufb01nite, then any sequence {xk}\u221e\nk=0 generated by A converges to some x\u2217 in F (\u03c6\u2217).\nBoth these results just use basic facts of analysis and are simple to prove and understand. Using\nthese results on the global convergence of algorithms, [29] has studied the convergence properties\nof the EM algorithm, while [12] analyzed the convergence of generalized alternating minimization\nprocedures. In the following section, we use these results to analyze the convergence of CCCP.\n\n4 Convergence theorems for CCCP\n\nLet us consider the CCCP algorithm in (2) pertaining to the d.c. program in (1). Let Acccp be the\npoint-to-set map, x(l+1) \u2208 Acccp(x(l)) such that\n\nAcccp(y) = arg min{u(x) \u2212 xT\u2207v(y) : x \u2208 \u2126},\n\n(9)\nwhere \u2126 := {x : ci(x) \u2264 0, i \u2208 [m], dj(x) = 0, j \u2208 [p]}. Let us assume that {ci} are dif-\nferentiable convex functions de\ufb01ned on Rn. We now present the global convergence theorem for\nCCCP.\nTheorem 4 (Global convergence of CCCP\u2212I). Let u and v be real-valued differentiable convex\nfunctions de\ufb01ned on Rn. Suppose \u2207v is continuous. Let {x(l)}\u221e\nl=0 be any sequence generated by\nAcccp de\ufb01ned by (9). Suppose Acccp is uniformly compact3 on \u2126 and Acccp(x) is nonempty for\nany x \u2208 \u2126. Then, assuming suitable constraint quali\ufb01cation, all the limit points of {x(l)}\u221e\nl=0 are\nstationary points of the d.c. program in (1). In addition liml\u2192\u221e(u(x(l))\u2212v(x(l))) = u(x\u2217)\u2212v(x\u2217),\nwhere x\u2217 is some stationary point of Acccp.\nBefore we proceed with the proof of Theorem 4, we need a few additional results. The idea of the\nproof is to show that any generalized \ufb01xed point of Acccp is a stationary point of (1), which is shown\nbelow in Lemma 5, and then use Theorem 2 to analyze the generalized \ufb01xed points.\nLemma 5. Suppose x\u2217 is a generalized \ufb01xed point of Acccp and assume that constraints in (9) are\nquali\ufb01ed at x\u2217. Then, x\u2217 is a stationary point of the program in (1).\nProof. We have x\u2217 \u2208 Acccp(x\u2217) and the constraints in (9) are quali\ufb01ed at x\u2217. Then, there exists\ni=1 \u2282 R+ and {\u00b5\u2217\nj}p\nj=1 \u2282 R such that the following KKT conditions\nLagrange multipliers {\u03b7\u2217\nhold:\ni \u2207ci(x\u2217) +\ni=1 \u03b7\u2217\ni \u2265 0, ci(x\u2217)\u03b7\u2217\ni = 0, \u2200 i \u2208 [m]\nj \u2208 R, \u2200 j \u2208 [p].\n\n\uf8f1\uf8f2\uf8f3 \u2207u(x\u2217) \u2212 \u2207v(x\u2217) +\n\nci(x\u2217) \u2264 0, \u03b7\u2217\ndj(x\u2217) = 0, \u00b5\u2217\n\nj=1 \u00b5\u2217\n\nj\u2207dj(x\u2217) = 0,\n\n(10) is exactly the KKT conditions of (1) which are satis\ufb01ed by (x\u2217,{\u03b7\u2217\nis a stationary point of (1).\n\nj}) and therefore, x\u2217\n3Assuming that for every x \u2208 \u2126, the set H(x) := {y : u(y) \u2212 u(x) \u2264 v(y) \u2212 v(x), y \u2208 Acccp(\u2126)} is\n\ni },{\u00b5\u2217\n\nbounded is also suf\ufb01cient for the result to hold.\n\n(10)\n\ni }m\n\n(cid:80)m\n\n(cid:80)p\n\n5\n\n\fBefore proving Theorem 4, we need a result to test the closure of Acccp. The following result from\n[12, Proposition 7] shows that the minimization of a continuous function forms a closed point-to-set\nmap. A similar suf\ufb01cient condition is also provided in [29, Equation 10].\nLemma 6 ([12]). Given a real-valued continuous function h on X \u00d7 Y , de\ufb01ne the point-to-set map\n\u03a8 : X \u2192 P(Y ) by\n\n\u03a8(x) = arg min\ny(cid:48)\u2208Y\n\nh(x, y(cid:48))\n\n= {y : h(x, y) \u2264 h(x, y(cid:48)), \u2200 y(cid:48) \u2208 Y }.\n\n(11)\n\nThen, \u03a8 is closed at x if \u03a8(x) is nonempty.\nWe are now ready to prove Theorem 4.\nProof of Theorem 4. The assumption of Acccp being uniformly compact on \u2126 ensures that condition\n(1) in Theorem 2 is satis\ufb01ed. Let \u0393 be the set of all generalized \ufb01xed points of Acccp and let\n\u03c6 = f = u \u2212 v. Because of the descent property in (5), condition (2) in Theorem 2 is satis\ufb01ed. By\nour assumption on u and v, we have g(x, y) = u(x) \u2212 v(y) \u2212 (x \u2212 y)T\u2207v(y) is continuous in x\nand y. Therefore, by Lemma 6, the assumption of non-emptiness of Acccp(x) for any x \u2208 \u2126 ensures\nthat Acccp is closed on \u2126 and so satis\ufb01es condition (3) in Theorem 2. Therefore, by Theorem 2,\nall the limit points of {x(l)}\u221e\nl=0 are the generalized \ufb01xed points of Acccp and liml\u2192\u221e(u(x(l)) \u2212\nv(x(l))) = u(x\u2217) \u2212 v(x\u2217), where x\u2217 is some generalized \ufb01xed point of Acccp. By Lemma 5, since\nthe generalized \ufb01xed points of Acccp are stationary points of (1), the result follows.\nRemark 7. If \u2126 is compact, then Acccp is uniformly compact on \u2126. In addition, since u is continuous\non \u2126, by the Weierstrass theorem4 [21], it is clear that Acccp(x) is nonempty for any x \u2208 \u2126 and\ntherefore is also closed on \u2126. This means, when \u2126 is compact, the result in Theorem 4 follows\ntrivially from Theorem 2.\nIn Theorem 4, we considered the generalized \ufb01xed points of Acccp. The disadvantage with this\ncase is that it does not rule out \u201coscillatory\u201d behavior [20]. To elaborate, we considered {x\u2217} \u2282\nAcccp(x\u2217). For example, let \u21260 = {x1, x2} and let Acccp(x1) = Acccp(x2) = \u21260 and u(x1) \u2212\nv(x1) = u(x2) \u2212 v(x2) = 0. Then the sequence {x1, x2, x1, x2, . . .} could be generated by Acccp,\nwith the convergent subsequences converging to the generalized \ufb01xed points x1 and x2. Such an\noscillatory behavior can be avoided if we allow Acccp to have \ufb01xed points instead of generalized\n\ufb01xed points. With appropriate assumptions on u and v, the following stronger result can be obtained\non the convergence of CCCP through Theorem 3.\nTheorem 8 (Global convergence of CCCP\u2212II). Let u and v be strictly convex, differentiable func-\ntions de\ufb01ned on Rn. Also assume \u2207v be continuous. Let {x(l)}\u221e\nl=0 be any sequence generated by\nAcccp de\ufb01ned by (9). Suppose Acccp is uniformly compact on \u2126 and Acccp(x) is nonempty for\nany x \u2208 \u2126. Then, assuming suitable constraint quali\ufb01cation, all the limit points of {x(l)}\u221e\nare stationary points of the d.c. program in (1), u(x(l)) \u2212 v(x(l)) \u2192 u(x\u2217) \u2212 v(x\u2217) =: f\u2217\nas l \u2192 \u221e, for some stationary point x\u2217, (cid:107)x(l+1) \u2212 x(l)(cid:107) \u2192 0, and either {x(l)}\u221e\nl=0 con-\nverges or the set of limit points of {x(l)}\u221e\nl=0 is a connected and compact subset of S (f\u2217), where\nS (a) := {x \u2208 S : u(x) \u2212 v(x) = a} and S is the set of stationary points of (1). If S (f\u2217) is\n\ufb01nite, then any sequence {x(l)}\u221e\nProof. Since u and v are strictly convex, the strict descent property in (8) holds and therefore Acccp\nis strictly monotonic with respect to f. Under the assumptions made about Acccp, Theorem 3 can\nbe invoked, which says that all the limit points of {x(l)}\u221e\nl=0 are \ufb01xed points of Acccp, which either\nconverge or form a connected compact set. From Lemma 5, the set of \ufb01xed points of Acccp are\nalready in the set of stationary points of (1) and the desired result follows from Theorem 3.\nTheorems 4 and 8 answer the questions that we raised in Section 1. These results explicitly provide\nsuf\ufb01cient conditions on u, v, {ci} and {dj} under which the CCCP algorithm \ufb01nds a stationary point\nof (1) along with the convergence of the sequence generated by the algorithm. From Theorem 8, it\nshould be clear that convergence of f(x(l)) to f\u2217 does not automatically imply the convergence of\nx(l) to x\u2217. The convergence in the latter sense requires more stringent conditions like the \ufb01niteness\nof the set of stationary points of (1) that assume the value of f\u2217.\n\nl=0 generated by Acccp converges to some x\u2217 in S (f\u2217).\n\nl=0\n\n4Weierstrass theorem states: If f is a real continuous function on a compact set K \u2282 Rn, then the problem\n\nmin{f (x) : x \u2208 K} has an optimal solution x\u2217 \u2208 K.\n\n6\n\n\f4.1 Extensions\n\nSo far, we have considered d.c. programs where the constraint set is convex. Let us consider a\ngeneral d.c. program given by\n\nmin\nx\ns.t.\n\nu0(x) \u2212 v0(x)\nui(x) \u2212 vi(x) \u2264 0, i \u2208 [m],\n\n(12)\nwhere {ui}, {vi} are real-valued convex and differentiable functions de\ufb01ned on Rn. While dealing\nwith kernel methods for missing variables, [26] encountered a problem of the form in (12) for which\nthey proposed a constrained concave-convex procedure given by\n\nx(l+1) \u2208 arg min\nx\ns.t.\n\nu0(x) \u2212 (cid:98)v0(x; x(l))\nui(x) \u2212(cid:98)vi(x; x(l)) \u2264 0, i \u2208 [m],\n\n(13)\n\nwhere (cid:98)vi(x; x(l)) := vi(x(l)) + (x \u2212 x(l))T\u2207vi(x(l)). Note that, similar to CCCP, the algorithm\n\nl=0 is assumed.\n\nin (13) is a sequence of convex programs. Though [26, Theorem 1] have provided a convergence\nanalysis for the algorithm in (13), it is however not complete due to the fact that the convergence\nof {x(l)}\u221e\nIn this subsection, we provide its convergence analysis, following an\napproach similar to what we did for CCCP by considering a point-to-set map, Bccp associated with\nthe iterative algorithm in (13), where x(l+1) \u2208 Bccp(x(l)). In Theorem 10, we provide the global\nconvergence result for the constrained concave-convex procedure, which is an equivalent version of\nTheorem 4 for CCCP. We do not provide the stronger version of the result as in Theorem 8 as it can\nbe obtained by assuming strict convexity of u0 and v0. Before proving Theorem 10, we need an\nequivalent version of Lemma 5 which we provide below.\nLemma 9. Suppose x\u2217 is a generalized \ufb01xed point of Bccp and assume that constraints in (13) are\nquali\ufb01ed at x\u2217. Then, x\u2217 is a stationary point of the program in (12).\nProof. Based on the assumptions x\u2217 \u2208 Bccp(x\u2217) and the constraint quali\ufb01cation at x\u2217 in (13),\nthere exist Lagrange multipliers {\u03b7\u2217\ni=1 \u2282 R+ (for simplicity, we assume all the constraints to be\ninequality constraints) such that the following KKT conditions hold:\n\ni }m\n\n(cid:80)m\n\n\uf8f1\uf8f2\uf8f3 \u2207u0(x\u2217) +\n\ni=1 \u03b7\u2217\nui(x\u2217) \u2212 vi(x\u2217) \u2264 0, \u03b7\u2217\n(ui(x\u2217) \u2212 vi(x\u2217))\u03b7\u2217\n\ni (\u2207ui(x\u2217) \u2212 \u2207vi(x\u2217)) = \u2207v0(x\u2217),\ni = 0, i \u2208 [m].\n\ni \u2265 0, i \u2208 [m],\n\n(14)\n\ni }) and therefore, x\u2217 is a stationary\n\nwhich is exactly the KKT conditions for (12) satis\ufb01ed by (x\u2217,{\u03b7\u2217\npoint of (12).\nTheorem 10 (Global convergence of constrained CCP). Let {ui}, {vi} be real-valued differentiable\nconvex functions on Rn. Assume \u2207v0 to be continuous. Let {x(l)}\u221e\nl=0 be any sequence generated\nby Bccp de\ufb01ned in (13). Suppose Bccp is uniformly compact on \u2126 := {x : ui(x) \u2212 vi(x) \u2264 0, i \u2208\n[m]} and Bccp(x) is nonempty for any x \u2208 \u2126. Then, assuming suitable constraint quali\ufb01cation,\nall the limit points of {x(l)}\u221e\nIn addition\nliml\u2192\u221e(u0(x(l)) \u2212 v0(x(l))) = u0(x\u2217) \u2212 v0(x\u2217), where x\u2217 is some stationary point of Bccp.\nProof. The proof is very similar to that of Theorem 4 wherein we check whether Bccp satis\ufb01es the\nconditions of Theorem 2 and then invoke Lemma 9. The assumptions mentioned in the statement\nof the theorem ensure that conditions (1) and (3) in Theorem 2 are satis\ufb01ed. [26, Theorem 1] has\nproved the descent property, similar to that of (5), which simply follows from the linear majorization\nidea and therefore the descent property in condition (2) of Theorem 2 holds. Therefore, the result\nfollows from Theorem 2 and Lemma 9.\n\nl=0 are stationary points of the d.c. program in (12).\n\n5 On the local convergence of CCCP: An open problem\n\nThe study so far has been devoted to the global convergence analysis of CCCP and the constrained\nconcave-convex procedure. As mentioned before, we say an algorithm is globally convergent if for\nk=0 generated by xk+1 \u2208 A(xk) converges to a\nany chosen starting point, x0, the sequence {xk}\u221e\npoint for which a necessary condition of optimality holds. In the results so far, we have shown\n\n7\n\n\fthat all the limit points of any sequence generated by CCCP (resp. its constrained version) are the\nstationary points (local extrema or saddle points) of the program in (1) (resp. (12)). Suppose, if\nx0 is chosen such that it lies in an \u0001-neighborhood around a local minima, x\u2217, then will the CCCP\nsequence converge to x\u2217? If so, what is the rate of convergence? This is the question of local\nconvergence that needs to be addressed.\n[24] has studied the local convergence of bound optimization algorithms (of which CCCP is an\nexample) to compare the rate of convergence of such methods to that of gradient and second-order\nmethods. In their work, they considered the unconstrained version of CCCP with Acccp to be a point-\nto-point map that is differentiable. They showed that depending on the curvature of u and v, CCCP\nwill exhibit either quasi-Newton behavior with fast, typically superlinear convergence or extremely\nslow, \ufb01rst-order convergence behavior. However, extending these results to the constrained setup as\nin (2) is not obvious. The following result due to Ostrowski which can be found in [23, Theorem\n10.1.3] provides a way to study the local convergence of iterative algorithms.\nProposition 11 (Ostrowski). Suppose that \u03a8 : U \u2282 Rn \u2192 Rn has a \ufb01xed point x\u2217 \u2208 int(U) and\n\u03a8 is Fr\u00b4echet-differentiable at x\u2217. If the spectral radius of \u03a8(cid:48)(x\u2217) satis\ufb01es \u03c1(\u03a8(cid:48)(x\u2217)) < 1, and if x0\nis suf\ufb01ciently close to x\u2217, then the iterates {xk} de\ufb01ned by xk+1 = \u03a8(xk) all lie in U and converge\nto x\u2217.\nFew remarks are in place regarding the usage of Proposition 11 to study the local convergence of\nCCCP. Note that Proposition 11 treats \u03a8 as a point-to-point map which can be obtained by choosing\nu and v to be strictly convex so that x(l+1) is the unique minimizer of (2). x\u2217 in Proposition 11\ncan be chosen to be a local minimum. Therefore, the desired result of local convergence with at\nleast linear rate of convergence is obtained if we show that \u03c1(\u03a8(cid:48)(x\u2217)) < 1. However, currently we\nare not aware of a way to compute the differential of \u03a8 and, moreover, to impose conditions on the\nfunctions in (2) so that \u03a8 is a differentiable map. This is an open question coming out of this work.\nOn the other hand, the local convergence behavior of DCA has been proved for two important classes\nof d.c. programs: (i) the trust region subproblem [9] (minimization of a quadratic function over a\nEuclidean ball) and (ii) nonconvex quadratic programs [8]. We are not aware of local optimality\nresults for general d.c. programs using DCA.\n\n6 Conclusion & Discussion\n\nThe concave-convex procedure (CCCP) is widely used in machine learning. In this work, we analyze\nits global convergence behavior by using results from the global convergence theory of iterative\nalgorithms. We explicitly mention the conditions under which any sequence generated by CCCP\nconverges to a stationary point of a d.c. program with convex constraints. The proposed approach\nallows an elegant and direct proof and is fundamentally different from the highly technical proof\nfor the convergence of DCA, which implies convergence for CCCP. It illustrates the power and\ngenerality of Zangwill\u2019s global convergence theory as a framework for proving the convergence of\niterative algorithms. We also brie\ufb02y discuss the local convergence of CCCP and present an open\nquestion, the settlement of which would address the local convergence behavior of CCCP.\n\nAcknowledgments\n\nThe authors thank anonymous reviewers for their constructive comments. They wish to acknowl-\nedge support from the National Science Foundation (grant DMS-MSPA 0625409), the Fair Isaac\nCorporation and the University of California MICRO program.\n\nReferences\n[1] D. B\u00a8ohning and B. G. Lindsay. Monotonicity of quadratic-approximation algorithms. Annals of the\n\nInstitute of Statistical Mathematics, 40(4):641\u2013663, 1988.\n\n[2] J. F. Bonnans, J. C. Gilbert, C. Lemar\u00b4echal, and C. A. Sagastiz\u00b4abal. Numerical Optimization: Theoretical\n\nand Practical Aspects. Springer-Verlag, 2006.\n\n[3] P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support vector\nmachines. In Proc. 15th International Conf. on Machine Learning, pages 82\u201390. Morgan Kaufmann, San\nFrancisco, CA, 1998.\n\n8\n\n\f[4] E. J. Candes, M. Wakin, and S. Boyd. Enhancing sparsity by reweighted (cid:96)1 minimization. J. Fourier\n\nAnal. Appl., 2007. To appear.\n\n[5] R. Collobert, F. Sinz, J. Weston, and L. Bottou. Large scale transductive SVMs. Journal of Machine\n\nLearning Research, 7:1687\u20131712, 2006.\n\n[6] J. deLeeuw. Applications of convex analysis to multidimensional scaling. In J. R. Barra, F. Brodeau,\nG. Romier, and B. Van Cutsem, editors, Recent advantages in Statistics, pages 133\u2013146, Amsterdam, The\nNetherlands, 1977. North Holland Publishing Company.\n\n[7] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM\n\nalgorithm. J. Roy. Stat. Soc. B, 39:1\u201338, 1977.\n\n[8] T. Pham Dinh and L. T. Hoai An. Convex analysis approach to d.c. programming: Theory, algorithms\n\nand applications. Acta Mathematica Vietnamica, 22(1):289\u2013355, 1997.\n\n[9] T. Pham Dinh and L. T. Hoai An. D.c. optimization algorithms for solving the trust region subproblem.\n\nSIAM Journal of Optimization, 8:476\u2013505, 1998.\n\n[10] C. B. Do, Q. V. Le, C. H. Teo, O. Chapelle, and A. J. Smola. Tighter bounds for structured estimation. In\n\nAdvances in Neural Information Processing Systems 21, 2009. To appear.\n\n[11] G. Fung and O. L. Mangasarian. Semi-supervised support vector machines for unlabeled data classi\ufb01ca-\n\ntion. Optimization Methods and Software, 15:29\u201344, 2001.\n\n[12] A. Gunawardana and W. Byrne. Convergence theorems for generalized alternating minimization proce-\n\ndures. Journal of Machine Learning Research, 6:2049\u20132073, 2005.\n\n[13] W. J. Heiser. Correspondence analysis with least absolute residuals. Comput. Stat. Data Analysis, 5:337\u2013\n\n356, 1987.\n\n[14] P. J. Huber. Robust Statistics. John Wiley, New York, 1981.\n[15] D. R. Hunter and K. Lange. A tutorial on MM algorithms. The American Statistician, 58:30\u201337, 2004.\n[16] D. R. Hunter and R. Li. Variable selection using MM algorithms. Annals of Statistics, 33:1617\u20131642,\n\n2005.\n\n[17] K. Lange, D. R. Hunter, and I. Yang. Optimization transfer using surrogate objective functions with\n\ndiscussion. Journal of Computational and Graphical Statistics, 9(1):1\u201359, 2000.\n\n[18] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In T.K. Leen, T.G. Di-\netterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 556\u2013562.\nMIT Press, Cambridge, 2001.\n\n[19] X.-L. Meng. Discussion on \u201coptimization transfer using surrogate objective functions\u201d. Journal of Com-\n\nputational and Graphical Statistics, 9(1):35\u201343, 2000.\n\n[20] R. R. Meyer. Suf\ufb01cient conditions for the convergence of monotonic mathematical programming algo-\n\nrithms. Journal of Computer and System Sciences, 12:108\u2013121, 1976.\n\n[21] M. Minoux. Mathematical Programming: Theory and Algorithms. John Wiley & Sons Ltd., 1986.\n[22] J. Neumann, C. Schn\u00a8orr, and G. Steidl. Combined SVM-based feature selection and classi\ufb01cation. Ma-\n\nchine Learning, 61:129\u2013150, 2005.\n\n[23] J. M. Ortega and W. C. Rheinboldt.\n\nAcademic Press, New York, 1970.\n\nIterative Solution of Nonlinear Equations in Several Variables.\n\n[24] R. Salakhutdinov, S. Roweis, and Z. Ghahramani. On the convergence of bound optimization algorithms.\n\nIn Proc. 19th Conference in Uncertainty in Arti\ufb01cial Intelligence, pages 509\u2013516, 2003.\n\n[25] F. Sha, Y. Lin, L. K. Saul, and D. D. Lee. Multiplicative updates for nonnegative quadratic programming.\n\nNeural Computation, 19:2004\u20132031, 2007.\n\n[26] A. J. Smola, S. V. N. Vishwanathan, and T. Hofmann. Kernel methods for missing variables. In Proc. of\n\nthe Tenth International Workshop on Arti\ufb01cial Intelligence and Statistics, 2005.\n\n[27] B. K. Sriperumbudur, D. A. Torres, and G. R. G. Lanckriet. Sparse eigen methods by d.c. programming.\n\nIn Proc. of the 24th Annual International Conference on Machine Learning, 2007.\n\n[28] L. Wang, X. Shen, and W. Pan. On transductive support vector machines. In J. Verducci, X. Shen, and\n\nJ. Lafferty, editors, Prediction and Discovery. American Mathematical Society, 2007.\n\n[29] C. F. J. Wu. On the convergence properties of the EM algorithm. Annals of Statistics, 11(1):95\u2013103,\n\n1983.\n\n[30] A. L. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 15:915\u2013936, 2003.\n[31] W. I. Zangwill. Nonlinear Programming: A Uni\ufb01ed Approach. Prentice-Hall, Englewood Cliffs, N.J.,\n\n1969.\n\n9\n\n\f", "award": [], "sourceid": 913, "authors": [{"given_name": "Gert", "family_name": "Lanckriet", "institution": null}, {"given_name": "Bharath", "family_name": "Sriperumbudur", "institution": null}]}