{"title": "Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1440, "page_last": 1448, "abstract": "Consider the problem of minimizing the sum of a smooth (possibly non-convex) and a convex (possibly nonsmooth) function involving a large number of variables. A popular approach to solve this problem is the block coordinate descent (BCD) method whereby at each iteration only one variable block is updated while the remaining variables are held fixed. With the recent advances in the developments of the multi-core parallel processing technology, it is desirable to parallelize the BCD method by allowing multiple blocks to be updated simultaneously at each iteration of the algorithm. In this work, we propose an inexact parallel BCD approach where at each iteration, a subset of the variables is updated in parallel by minimizing convex approximations of the original objective function. We investigate the convergence of this parallel BCD method for both randomized and cyclic variable selection rules. We analyze the asymptotic and non-asymptotic convergence behavior of the algorithm for both convex and non-convex objective functions. The numerical experiments suggest that for a special case of Lasso minimization problem, the cyclic block selection rule can outperform the randomized rule.", "full_text": "Parallel Successive Convex Approximation for\n\nNonsmooth Nonconvex Optimization\n\nMeisam Razaviyayn\u2217\n\nmeisamr@stanford.edu\n\nMingyi Hong\u2020\n\nmingyi@iastate.edu\n\nZhi-Quan Luo\u2021\nluozq@umn.edu\n\nJong-Shi Pang\u00a7\n\njongship@usc.edu\n\nAbstract\n\nConsider the problem of minimizing the sum of a smooth (possibly non-convex)\nand a convex (possibly nonsmooth) function involving a large number of variables.\nA popular approach to solve this problem is the block coordinate descent (BCD)\nmethod whereby at each iteration only one variable block is updated while the re-\nmaining variables are held \ufb01xed. With the recent advances in the developments of\nthe multi-core parallel processing technology, it is desirable to parallelize the BCD\nmethod by allowing multiple blocks to be updated simultaneously at each itera-\ntion of the algorithm. In this work, we propose an inexact parallel BCD approach\nwhere at each iteration, a subset of the variables is updated in parallel by mini-\nmizing convex approximations of the original objective function. We investigate\nthe convergence of this parallel BCD method for both randomized and cyclic vari-\nable selection rules. We analyze the asymptotic and non-asymptotic convergence\nbehavior of the algorithm for both convex and non-convex objective functions.\nThe numerical experiments suggest that for a special case of Lasso minimization\nproblem, the cyclic block selection rule can outperform the randomized rule.\n\n1\n\nIntroduction\n\nConsider the following optimization problem\n\nmin\n\nx\n\nn(cid:88)\n\ni=1\n\ngi(xi)\n\nh(x) (cid:44) f (x1, . . . , xn) +\n\nwhere Xi \u2286 Rmi is a closed convex set; the function f :(cid:81)n\nsibly non-convex); and g(x) (cid:44)(cid:80)n\n\ns.t. xi \u2208 Xi, i = 1, 2, . . . , n,\n(1)\ni=1 Xi \u2192 R is a smooth function (pos-\ni=1 gi(xi) is a separable convex function (possibly nonsmooth).\nThe above optimization problem appears in various \ufb01elds such as machine learning, signal process-\ning, wireless communication, image processing, social networks, and bioinformatics, to name just a\nfew. These optimization problems are typically of huge size and should be solved expeditiously.\nA popular approach for solving the above multi-block optimization problem is the block coordinate\ndescent (BCD) approach, where at each iteration of BCD, only one of the block variables is updated,\nwhile the remaining blocks are held \ufb01xed. Since only one block is updated at each iteration, the per-\niteration storage and computational demand of the algorithm is low, which is desirable in huge-size\nproblems. Furthermore, as observed in [1\u20133], these methods perform particulary well in practice.\n\n\u2217Electrical Engineering Department, Stanford University\n\u2020Industrial and Manufacturing Systems Engineering, Iowa State University\n\u2021Department of Electrical and Computer Engineering, University of Minnesota\n\u00a7Department of Industrial and Systems Engineering, University of Southern California\n\n1\n\n\fThe availability of high performance multi-core computing platforms makes it increasingly desir-\nable to develop parallel optimization methods. One category of such parallelizable methods is the\n(proximal) gradient methods. These methods are parallelizable in nature [4\u20138]; however, they are\nequivalent to successive minimization of a quadratic approximation of the objective function which\nmay not be tight; and hence suffer from low convergence speed in some practical applications [9].\nTo take advantage of the BCD method and parallel multi-core technology, different parallel BCD al-\ngorithms have been recently proposed in the literature. In particular, the references [10\u201312] propose\nparallel coordinate descent minimization methods for (cid:96)1-regularized convex optimization problems.\nUsing the greedy (Gauss-Southwell) update rule, the recent works [9,13] propose parallel BCD type\nmethods for general composite optimization problems. In contrast, references [2, 14\u201320] suggest\nrandomized block selection rule, which is more amenable to big data optimization problems, in\norder to parallelize the BCD method.\nMotivated by [1,9,15,21], we propose a parallel inexact BCD method where at each iteration of the\nalgorithm, a subset of the blocks is updated by minimizing locally tight approximations of the objec-\ntive function. Asymptotic and non-asymptotic convergence analysis of the algorithm is presented in\nboth convex and non-convex cases for different variable block selection rules. The proposed parallel\nalgorithm is synchronous, which is different than the existing lock-free methods in [22, 23].\nThe contributions of this work are as follows:\n\n\u2022 A parallel block coordinate descent method is proposed for non-convex nonsmooth prob-\nlems. To the best of our knowledge, reference [9] is the only paper in the literature that\nfocuses on parallelizing BCD for non-convex nonsmooth problems. This reference utilizes\ngreedy block selection rule which requires search among all blocks as well as communica-\ntion among processing nodes in order to \ufb01nd the best blocks to update. This requirement\ncan be demanding in practical scenarios where the communication among nodes are costly\nor when the number of blocks is huge. In fact, this high computational cost motivated the\nauthors of [9] to develop further inexact update strategies to ef\ufb01ciently alleviating the high\ncomputational cost of the greedy block selection rule.\n\n\u2022 The proposed parallel BCD algorithm allows both cyclic and randomized block variable\nselection rules. The deterministic (cyclic) update rule is different than the existing parallel\nrandomized or greedy BCD methods in the literature; see, e.g., [2, 9, 13\u201320]. Based on our\nnumerical experiments, this update rule is bene\ufb01cial in solving the Lasso problem.\n\n\u2022 The proposed method not only works with the constant step-size selection rule, but also\nwith the diminishing step-sizes which is desirable when the Lipschitz constant of the ob-\njective function is not known.\n\n\u2022 Unlike many existing algorithms in the literature, e.g. [13\u201315], our parallel BCD algo-\nrithm utilizes the general approximation of the original function which includes the lin-\near/proximal approximation of the objective as a special case. The use of general approx-\nimation instead of the linear/proximal approximation offers more \ufb02exibility and results in\nef\ufb01cient algorithms for particular practical problems; see [21, 24] for speci\ufb01c examples.\n\n\u2022 We present an iteration complexity analysis of the algorithm for both convex and non-\nconvex scenarios. Unlike the existing non-convex parallel methods in the literature such\nas [9] which only guarantee the asymptotic behavior of the algorithm, we provide non-\nasymptotic guarantees on the convergence of the algorithm as well.\n\n2 Parallel Successive Convex Approximation\n\nAs stated in the introduction section, a popular approach for solving (1) is the BCD method where at\niteration r +1 of the algorithm, the block variable xi is updated by solving the following subproblem\n\nxr+1\ni = arg min\nxi\u2208Xi\n\nh(xr\n\n1, . . . , xr\n\ni\u22121, xi, xr\n\ni+1, . . . , xr\n\nn).\n\n(2)\n\nIn many practical problems, the update rule (2) is not in closed form and hence not computation-\nally cheap. One popular approach is to replace the function h(\u00b7) with a well-chosen local convex\n\n2\n\n\f(3)\n\nxr+1\ni = arg min\nxi\u2208Xi\n\n(cid:101)hi(xi, xr),\n\n(cid:101)hi(xi, y) = (cid:101)fi(xi, y) + gi(xi).\n\nto the i-th block around the current iteration xr. This approach, also known as block successive\nconvex approximation or block successive upper-bound minimization [21], has been widely used in\ndifferent applications; see [21, 24] for more details and different useful approximation functions. In\n\napproximation(cid:101)hi(xi, xr) in (2). That is, at iteration r + 1, the block variable xi is updated by\nwhere(cid:101)hi(xi, xr) is a convex (possibly upper-bound) approximation of the function h(\u00b7) with respect\nthis work, we assume that the approximation function(cid:101)hi(\u00b7,\u00b7) is of the following form:\nHere (cid:101)fi(\u00b7, y) is an approximation of the function f (\u00b7) around the point y with respect to the i-th\nblock. We further assume that (cid:101)fi(xi, y) : Xi \u00d7 X \u2192 R satis\ufb01es the following assumptions:\n\u2022 (cid:101)fi(\u00b7, y) is continuously differentiable and uniformly strongly convex with parameter \u03c4, i.e.,\n(cid:101)fi(xi, y) \u2265 (cid:101)fi(x(cid:48)\n\u2022 Gradient consistency assumption: \u2207xi(cid:101)fi(xi, x) = \u2207xif (x), \u2200x \u2208 X\ni, y), xi \u2212 x(cid:48)\ni \u2208 Xi, \u2200y \u2208 X\ni(cid:107)2, \u2200xi, x(cid:48)\n\u2022 \u2207xi(cid:101)fi(xi,\u00b7) is Lipschitz continuous on X for all xi \u2208 Xi with constant (cid:101)L,\n(cid:107)\u2207xi(cid:101)fi(xi, y) \u2212 \u2207xi(cid:101)fi(xi, z)(cid:107) \u2264(cid:101)L(cid:107)y \u2212 z(cid:107),\n\u2022 (cid:101)f (xi, y) = (cid:104)\u2207yif (y), xi \u2212 yi(cid:105) + \u03b1\n\u2022 (cid:101)f (xi, y) = f (xi, y\u2212i) + \u03b1\n\nFor instance, the following traditional proximal/quadratic approximations of f (\u00b7) satisfy the above\nassumptions when the feasible set is compact and f (\u00b7) is twice continuously differentiable:\n\ni, y) + (cid:104)\u2207xi(cid:101)fi(x(cid:48)\n\n2 (cid:107)xi \u2212 yi(cid:107)2, for \u03b1 large enough.\n\n\u2200y, z \u2208 X , \u2200xi \u2208 Xi, \u2200i.\n\ni(cid:105) + \u03c4\n\n2(cid:107)xi \u2212 x(cid:48)\n\n2 (cid:107)xi \u2212 yi(cid:107)2.\n\ni.e.,\n\n(4)\n\nFor other practical useful approximations of f (\u00b7) and the stochastic/incremental counterparts, see\n[21, 25, 26].\nWith the recent advances in the development of parallel processing machines, it is desirable to take\nthe advantage of multi-core machines by updating multiple blocks simultaneously in (3). Unfortu-\nnately, naively updating multiple blocks simultaneously using the approach (3) does not result in a\nconvergent algorithm. Hence, we suggest to modify the update rule by using a well-chosen step-size.\nMore precisely, we propose Algorithm 1 for solving the optimization problem (1).\n\nAlgorithm 1 Parallel Successive Convex Approximation (PSCA) Algorithm\n\n\ufb01nd a feasible point x0 \u2208 X and set r = 0\nfor r = 0, 1, 2, . . . do\n\nchoose a subset Sr \u2286 {1, . . . , n}\n\ni = arg minxi\u2208Xi (cid:101)hi(xi, xr), \u2200i \u2208 Sr\ni + \u03b3r((cid:98)xr\ni ), \u2200i \u2208 Sr, and set xr+1\n\ni \u2212 xr\n\ncalculate(cid:98)xr\n\nset xr+1\n\ni = xr\n\nend for\n\ni = xr\n\ni , \u2200 i /\u2208 Sr\n\nThe procedure of selecting the subset Sr is intentionally left unspeci\ufb01ed in Algorithm 1. This\nselection could be based on different rules. Reference [9] suggests the greedy variable selection rule\nwhere at each iteration of the algorithm in [9], the best response of all the variables are calculated\nand at the end, only the block variables with the largest amount of improvement are updated. A\ndrawback of this approach is the overhead caused by the calculation of all of the best responses at\neach iteration; this overhead is especially computationally demanding when the size of the problem\nis huge. In contrast to [9], we suggest the following randomized or cyclic variable selection rules:\n\n\u2205, \u2200i (cid:54)= j and(cid:83)m\u22121\n\n\u2022 Cyclic: Given the partition {T0, . . . ,Tm\u22121} of the set {1, 2, . . . , n} with Ti\n\n(cid:96)=0 T(cid:96) = {1, 2, . . . , n}, we say the choice of the variable selection is\n\n(cid:84)Tj =\n\ncyclic if\n\nSmr+(cid:96) = T(cid:96),\n\n\u2200(cid:96) = 0, 1, . . . , m \u2212 1 and \u2200r,\n\n3\n\n\f\u2022 Randomized: The variable selection rule is called randomized if at each iteration the vari-\n\nables are chosen randomly from the previous iterations so that\n\nP r(j \u2208 Sr | xr, xr\u22121, . . . , x0) = pr\n\nj \u2265 pmin > 0,\n\n\u2200j = 1, 2, . . . , n, \u2200r.\n\n3 Convergence Analysis: Asymptotic Behavior\nWe \ufb01rst make a standard assumption that \u2207f (\u00b7) is Lipschitz continuous with constant L\u2207f , i.e.,\n\n(cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107) \u2264 L\u2207f(cid:107)x \u2212 y(cid:107),\n\nand assume that \u2212\u221e < inf x\u2208X h(x). Let us also de\ufb01ne \u00afx to be a stationary point of (1) if \u2203 d \u2208\n\u2202g(\u00afx) such that (cid:104)\u2207f (\u00afx)+d, x\u2212 \u00afx(cid:105) \u2265 0, \u2200x \u2208 X , i.e., the \ufb01rst order optimality condition is satis\ufb01ed\nat the point \u00afx. The following lemma will help us to study the asymptotic convergence of the PSCA\nalgorithm.\n\nLemma 1 [9, Lemma 2] De\ufb01ne the mapping(cid:98)x(\u00b7) : X (cid:55)\u2192 X as(cid:98)x(y) = ((cid:98)xi(y))n\narg minxi\u2208Xi(cid:101)hi(xi, y). Then the mapping(cid:98)x(\u00b7) is Lipschitz continuous with constant(cid:98)L =\n\ni=1 with(cid:98)xi(y) =\n\nn(cid:101)L\n\n, i.e.,\n\n\u221a\n\n\u03c4\n\n(cid:107)(cid:98)x(y) \u2212(cid:98)x(z)(cid:107) \u2264(cid:98)L(cid:107)y \u2212 z(cid:107), \u2200y, z \u2208 X .\n\nHaving derived the above result, we are now ready to state our \ufb01rst result which studies the limiting\nbehavior of the PSCA algorithm. This result is based on the suf\ufb01cient decrease of the objective\nfunction which has been also exploited in [9] for greedy variable selection rule.\n\nTheorem 1 Assume \u03b3r \u2208 (0, 1], (cid:80)\u221e\n\n\u03c4\n\n,\n\n\u03c4 +(cid:101)L\n\nlim supr\u2192\u221e \u03b3r < \u00af\u03b3 (cid:44)\n}. Suppose either cyclic or randomized block selection rule is employed. For\nmin{ \u03c4\nL\u2207f\ncyclic update rule, assume further that {\u03b3r}\u221e\nr=1 is a monotonically decreasing sequence. Then ev-\nery limit point of the iterates is a stationary point of (1) \u2013 deterministically for cyclic update rule\nand almost surely for randomized block selection rule.\n\nr=1 \u03b3r = +\u221e, and that\n\n\u221a\nn\n\nProof Using the standard suf\ufb01cient decrease argument (see the supplementary materials), one can\nshow that\n\nh(xr+1) \u2264 h(xr) +\n\n\u03b3r(\u2212\u03c4 + \u03b3rL\u2207f )\n\n2\n\n(cid:107)(cid:98)xr \u2212 xr(cid:107)2\n\nSr .\n\nSince lim supr\u2192\u221e \u03b3r < \u00af\u03b3, for suf\ufb01ciently large r, there exists \u03b2 > 0 such that\n\nh(xr+1) \u2264 h(xr) \u2212 \u03b2\u03b3r(cid:107)(cid:98)xr \u2212 xr(cid:107)2\ni(cid:107)(cid:98)xr\ni \u2212 xr\n\n(cid:34) n(cid:88)\n\nSr .\n\nRr\n\ni=1\n\n(cid:35)\n\n,\n\ni(cid:107)2 | xr\n\nE[h(xr+1) | xr] \u2264 h(xr) \u2212 \u03b2\u03b3rE\n\nTaking the conditional expectation from both sides implies\n\nwhere Rr\nE[Rr\n\ni | xr] = pr\n\ni is a Bernoulli random variable which is one if i \u2208 Sr and it is zero otherwise. Clearly,\n\ni and therefore,\n\nE[h(xr+1) | xr] \u2264 h(xr) \u2212 \u03b2\u03b3rpmin(cid:107)(cid:98)xr \u2212 xr(cid:107)2, \u2200r.\n\n(8)\nThus {h(xr)} is a supermartingale with respect to the natural history; and by the supermartingale\nconvergence theorem [27, Proposition 4.2], h(xr) converges and we have\n\n\u221e(cid:88)\n\n\u03b3r(cid:107)(cid:98)xr \u2212 xr(cid:107)2 < \u221e,\n\nalmost surely.\n\nr=1\n\n(cid:80)\u221e\nr=1 \u03b3r(cid:107)(cid:98)xr \u2212 xr(cid:107)2 < \u221e. Fix a realization in that set. The equation (9) simply implies that,\nfor the \ufb01xed realization, lim inf r\u2192\u221e (cid:107)(cid:98)xr \u2212 xr(cid:107) = 0, since(cid:80)\nLet us now restrict our analysis to the set of probability one for which h(xr) converges and\nresult by proving that limr\u2192\u221e (cid:107)(cid:98)xr \u2212 xr(cid:107) = 0. Suppose the contrary that there exists \u03b4 > 0 such\nr \u03b3r = \u221e. Next we strengthen this\n\n(5)\n\n(6)\n\n(7)\n\n(9)\n\n4\n\n\fthat \u2206r (cid:44) (cid:107)(cid:98)xr \u2212 xr(cid:107) \u2265 2\u03b4 in\ufb01nitely often. Since lim inf r\u2192\u221e \u2206r = 0, there exists a subset of\n\nindices K and {ir} such that for any r \u2208 K,\n\n\u2206r < \u03b4,\n\n2\u03b4 < \u2206ir ,\n\nand \u03b4 \u2264 \u2206j \u2264 2\u03b4, \u2200j = r + 1, . . . , ir \u2212 1.\n\n(10)\n\nClearly,\n\n\u03b4 \u2212 \u2206r\n\n(i)\u2264 \u2206r+1 \u2212 \u2206r = (cid:107)(cid:98)xr+1 \u2212 xr+1(cid:107) \u2212 (cid:107)(cid:98)xr \u2212 xr(cid:107) (ii)\u2264 (cid:107)(cid:98)xr+1 \u2212(cid:98)xr(cid:107) + (cid:107)xr+1 \u2212 xr(cid:107)\n(iii)\u2264 (1 +(cid:98)L)(cid:107)xr+1 \u2212 xr(cid:107) (iv)\n\n= (1 +(cid:98)L)\u03b3r(cid:107)(cid:98)xr \u2212 xr(cid:107) \u2264 (1 +(cid:98)L)\u03b3r\u03b4,\n\n(11)\n\nwhere (i) and (ii) are due to (10) and the triangle inequality, respectively. The inequality (iii)\nis the result of Lemma 1; and (iv) is followed from the algorithm iteration update rule. Since\n1+(cid:98)L\nlim supr\u2192\u221e \u03b3r < 1\n\n, the above inequality implies that there exists an \u03b1 > 0 such that\n\n\u2206r > \u03b1,\n\n(12)\n\nlimr\u2192\u221e(cid:80)ir\u22121\n\nfor all r large enough. Furthermore, since the chosen realization satis\ufb01es (9), we have that\n\nt=r \u03b3t(\u2206t)2 = 0; which combined with (10) and (12), implies\n\nir\u22121(cid:88)\n\nt=r\n\nlim\nr\u2192\u221e\n\n\u03b3t = 0.\n\n(13)\n\nOn the other hand, using the similar reasoning as in above, one can write\n\n\u03b3t(cid:107)(cid:98)xt \u2212 xt(cid:107) \u2264 2\u03b4(1 +(cid:98)L)\n\n\u03b4 < \u2206ir \u2212 \u2206r = (cid:107)(cid:98)xir \u2212 xir(cid:107) \u2212 (cid:107)(cid:98)xr \u2212 xr(cid:107) \u2264 (cid:107)(cid:98)xir \u2212(cid:98)xr(cid:107) + (cid:107)xir \u2212 xr(cid:107)\nir\u22121(cid:88)\n\u2264 (1 +(cid:98)L)\nand hence lim inf r\u2192\u221e(cid:80)ir\u22121\ndoes not hold and we must have limr\u2192\u221e (cid:107)(cid:98)xr \u2212 xr(cid:107) = 0, almost surely. Now consider a limit\nj=1 converging to \u00afx. Using the de\ufb01nition of (cid:98)xrj , we have\nlimj\u2192\u221e(cid:101)hi((cid:98)xrj\nthe fact that limr\u2192\u221e (cid:107)(cid:98)xr \u2212 xr(cid:107) = 0, almost surely, we obtain(cid:101)hi(\u00afxi, \u00afx) \u2264 (cid:101)hi(xi, \u00afx),\n\n\u2200xi \u2208 Xi, \u2200i. Therefore, by letting j \u2192 \u221e and using\n\u2200xi \u2208\n\nt=r \u03b3t > 0, which contradicts (13). Therefore the contrary assumption\n\npoint \u00afx with the subsequence {xrj}\u221e\n\ni , xrj ) \u2264 (cid:101)hi(xi, xrj ),\n\nXi, \u2200i, almost surely; which in turn, using the gradient consistency assumption, implies\n\nt=r\n\nir\u22121(cid:88)\n\nt=r\n\n\u03b3t,\n\n(cid:104)\u2207f (\u00afx) + d, x \u2212 \u00afx(cid:105) \u2265 0, \u2200x \u2208 X , almost surely,\n\nfor some d \u2208 \u2202g(\u00afx), which completes the proof for the randomized block selection rule.\nNow consider the cyclic update rule with a limit point \u00afx. Due to the suf\ufb01cient decrease bound\n(6), we have limr\u2192\u221e h(xr) = h(\u00afx). Furthermore, by taking the summation over (6), we obtain\nk=1 to be the subsequence of\nk=1 \u03b3rk = \u221e,\ni (cid:107) = 0. Repeating the\ni \u2212 xrk\nabove argument with some slight modi\ufb01cations, which are omitted due to lack of space, we can\ni (cid:107) = 0 implying that the limit point \u00afx is a stationary point of (1). (cid:4)\n\n(cid:80)\u221e\nr=1 \u03b3r(cid:107)(cid:98)xr \u2212 xr(cid:107)2\niterations that block i is updated in. Clearly,(cid:80)\u221e\nSr < \u221e. Consider a \ufb01xed block i and de\ufb01ne {rk}\u221e\nsince {\u03b3r} is monotonically decreasing. Therefore, lim inf k\u2192\u221e (cid:107)(cid:98)xrk\ni \u2212 xrk\nshow that limk\u2192\u221e (cid:107)(cid:98)xrk\n\ni (cid:107)2 < \u221e and(cid:80)\u221e\n\nk=1 \u03b3rk(cid:107)(cid:98)xrk\n\ni \u2212 xrk\n\nRemark 1 Theorem 1 covers both diminishing and constant step-size selection rule; or the combi-\nnation of the two, i.e., decreasing the step-size until it is less than the constant \u00af\u03b3. It is also worth\nnoting that the diminishing step-size rule is especially useful when the knowledge of the problem\u2019s\nconstants L, \u02dcL, and \u03c4 is not available.\n\n4 Convergence Analysis: Iteration Complexity\n\nIn this section, we present iteration complexity analysis of the algorithm for both convex and non-\nconvex cases.\n\n5\n\n\f4.1 Convex Case\nWhen the function f (\u00b7) is convex, the overall objective function will become convex; and as a\nresult of Theorem 1, if a limit point exists, it is a global minimizer of (1).\nIn this scenario, it\nis desirable to derive the iteration complexity bounds of the algorithm. Note that our algorithm\nemploys linear combination of the two consecutive points at each iteration and hence it is different\nthan the existing algorithms in [2, 14\u201320]. Therefore, not only in the cyclic case, but also in the\nrandomized scenario, the iteration complexity analysis of PSCA is different than the existing results\nand should be investigated. Let us make the following assumptions for our iteration complexity\nanalysis:\n\nL\u2207f\n\n, \u2200r.\n\n\u2022 The step-size is constant with \u03b3r = \u03b3 < \u03c4\n\u2022 The level set {x | h(x) \u2264 h(x0)} is compact and the next two assumptions hold in this set.\n\u2022 The nonsmooth function g(\u00b7) is Lipschitz continuous, i.e., |g(x) \u2212 g(y)| \u2264 Lg(cid:107)x \u2212\ny(cid:107), \u2200x, y \u2208 X . This assumption is satis\ufb01ed in many practical problems such as (group)\nLasso.\n\nLemma 2 (Suf\ufb01cient Descent) There exists(cid:98)\u03b2,(cid:101)\u03b2 > 0, such that for all r \u2265 1, we have\n\n\u2022 The gradient of the approximation function (cid:101)fi(\u00b7, y) is uniformly Lipschitz with constant\nLi, i.e., (cid:107)\u2207xi(cid:101)fi(xi, y) \u2212 \u2207x(cid:48)\n\u2022 For randomized rule: E[h(xr+1) | xr] \u2264 h(xr) \u2212(cid:98)\u03b2(cid:107)(cid:98)xr \u2212 xr(cid:107)2.\n\u2022 For cyclic rule: h(xm(r+1)) \u2264 h(xmr) \u2212(cid:101)\u03b2(cid:107)xm(r+1) \u2212 xmr(cid:107)2.\n\ni, y)(cid:107) \u2264 Li(cid:107)xi \u2212 x(cid:48)\n\ni(cid:101)fi(x(cid:48)\n\ni(cid:107), \u2200xi, x(cid:48)\n\ni \u2208 Xi.\n\nProof The above result is an immediate consequence of (6) with(cid:98)\u03b2 (cid:44) \u03b2\u03b3pmin and(cid:101)\u03b2 (cid:44) \u03b2\nDue to the bounded level set assumption, there must exist constants (cid:98)Q, Q, R > 0 such that\nfor all xr. Next we use the constants Q, (cid:98)Q and R to bound the cost-to-go in the algorithm.\n\n(cid:107)\u2207xi(cid:101)fi((cid:98)xr, xr)(cid:107) \u2264 (cid:98)Q,\n\n(cid:107)xr \u2212 x\u2217(cid:107) \u2264 R,\n\n(cid:107)\u2207f (xr)(cid:107) \u2264 Q,\n\n\u03b3 .\n\n(cid:4)\n\n(14)\n\nLemma 3 (Cost-to-go Estimate) For all r \u2265 1, we have\n\n\u2022 For randomized rule:(cid:0)E[h(xr+1) | xr] \u2212 h(x\u2217)(cid:1)2 \u2264 2(cid:0)(Q + Lg)2 + nL2R2(cid:1)(cid:107)(cid:98)xr\u2212xr(cid:107)2\n\u2022 For cyclic rule:(cid:0)h(xm(r+1)) \u2212 h(x\u2217)(cid:1)2 \u2264 3n \u03b8(1\u2212\u03b3)2\n\n(cid:107)xm(r+1) \u2212 xmr(cid:107)2\n\nfor any optimal point x\u2217, where L (cid:44) maxi{Li} and \u03b8 (cid:44) L2\n\n\u03b32\n\ng + (cid:98)Q2 + 2nR2 \u02dcL2\n\n\u03b32\n\n(1\u2212\u03b3)2 + 2R2L2.\n\nProof Please see the supplementary materials for the proof.\n\nLemma 2 and Lemma 3 lead to the iteration complexity bound in the following theorem. The proof\nsteps of this result are similar to the ones in [28] and therefore omitted here for space reasons.\n\nTheorem 2 De\ufb01ne \u03c3 (cid:44) (\u03b3L\u2207f\u2212\u03c4 )\u03b3pmin\n\n4((Q+Lg)2+nL2R2) and(cid:101)\u03c3 (cid:44) (\u03b3L\u2207f\u2212\u03c4 )\u03b3\n\n6n\u03b8(1\u2212\u03b3)2 . Then\n\n\u2022 For randomized update rule: E [h(xr)] \u2212 h(x\u2217) \u2264 max{4\u03c3\u22122,h(x0)\u2212h(x\u2217),2}\n\u2022 For cyclic update rule: h(xmr) \u2212 h(x\u2217) \u2264 max{4(cid:101)\u03c3\u22122,h(x0)\u2212h(x\u2217),2}\n\n\u03c3\n\n1\nr .\n\n1\nr .\n\n(cid:101)\u03c3\n\n6\n\n\f4.2 Non-convex Case\n\nIn this subsection we study the iteration complexity of the proposed randomized algorithm for the\ngeneral nonconvex function f (\u00b7) assuming constant step-size selection rule. This analysis is only\nfor the randomized block selection rule. Since in the nonconvex scenario, the iterates may not\nconverge to the global optimum point, the closeness to the optimal solution cannot be considered\nfor the iteration complexity analysis. Instead, inspired by [29] where the size of the gradient of the\nobjective function is used as a measure of optimality, we consider the size of the objective proximal\ngradient as a measure of optimality. More precisely, we de\ufb01ne\n\n(cid:101)\u2207h(x) = x \u2212 arg min\n\ny\u2208X\n\n(cid:26)\n\n(cid:27)\n\n1\n2\n\n(cid:4)\n\n7\n\ntime that\n\ni , we have\n\n\u2200xi \u2208 Xi.\n\n(cid:104)\u2207f (x), y \u2212 x(cid:105) + g(y) +\n\n(cid:107)y \u2212 x(cid:107)2\n\n.\n\ni ) \u2265 0,\n\n\u2200xi \u2208 Xi.\n\nand h\u2217 = minx\u2208X h(x).\n\n(cid:98)\u03b2\ni \u2212(cid:101)yr\n\ni(cid:107)2. Clearly, (cid:101)\u2207h(xr) = (xr\n\n(cid:44) arg minyi\u2208Xi(cid:104)\u2207xi f (xr), yi \u2212\ni=1. The \ufb01rst order optimality condition\n\ni\n\ni )n\n\nTheorem 3 Consider randomized block selection rule. De\ufb01ne T\u0001 to be the \ufb01rst\n\nthe objective if g \u2261 0 and X = Rn. The following theorem, which studies the decrease rate of\n\nClearly, (cid:101)\u2207h(x) = 0 when x is a stationary point. Moreover, (cid:101)\u2207h(\u00b7) coincides with the gradient of\n(cid:107)(cid:101)\u2207h(x)(cid:107), could be viewed as an iteration complexity analysis of the randomized PSCA.\nE[(cid:107)(cid:101)\u2207h(xr)(cid:107)2] \u2264 \u0001. Then T\u0001 \u2264 \u03ba/\u0001 where \u03ba (cid:44) 2(L2+2L+2)(h(x0)\u2212h\u2217)\nProof To simplify the presentation of the proof, let us de\ufb01ne(cid:101)yr\ni(cid:105) + gi(yi) + 1\n2(cid:107)yi \u2212 xr\nxr\ni (cid:105) + gi(xi) \u2212 gi((cid:101)yr\ni , xi \u2212(cid:101)yr\n(cid:104)\u2207xif (xr) +(cid:101)yr\nof the above optimization problem implies\nFurthermore, based on the de\ufb01nition of(cid:98)xr\ni \u2212 xr\n(cid:104)\u2207xi(cid:101)fi((cid:98)xr\ni , xr), xi \u2212(cid:98)xr\ni(cid:105) + gi(xi) \u2212 gi((cid:98)xr\ni and(cid:101)yr\nPlugging in the points(cid:98)xr\ni ) \u2265 0,\n(16)\n(cid:104)\u2207xi(cid:101)fi((cid:98)xr\ni ,(cid:101)yr\ni \u2212(cid:101)yr\ni \u2212(cid:98)xr\ni in (15) and (16); and summing up the two equations will yield to\ni , xr) \u2212 \u2207xif (xr) + xr\ni(cid:105) \u2265 0.\n(cid:104)\u2207xi(cid:101)fi((cid:98)xr\ni , xr) \u2212 \u2207xi(cid:101)fi(xr\ni ,(cid:101)yr\ni \u2212(cid:101)yr\ni +(cid:98)xr\ni \u2212(cid:98)xr\ni \u2212(cid:98)xr\nUsing the gradient consistency assumption, we can write\nor equivalently, (cid:104)\u2207xi(cid:101)fi((cid:98)xr\ni , xr) \u2212 \u2207xi(cid:101)fi(xr\ni \u2212(cid:98)xr\ni ,(cid:101)yr\ni \u2212(cid:98)xr\ni(cid:105) \u2265 (cid:107)(cid:98)xr\n(cid:16)(cid:107)\u2207xi(cid:101)fi((cid:98)xr\ni(cid:107)(cid:17)(cid:107)(cid:101)yr\ni , xr) \u2212 \u2207xi(cid:101)fi(xr\ni \u2212(cid:98)xr\ni \u2212(cid:98)xr\ni(cid:107) \u2265 (cid:107)(cid:98)xr\ni , xr)(cid:107) + (cid:107)xr\nSince the function (cid:101)fi(\u00b7, x) is Lipschitz, we must have\ni \u2212(cid:98)xr\n(cid:107)(cid:98)xr\ni \u2212(cid:101)yr\ni(cid:107)\ni (cid:107) \u2264 (1 + Li)(cid:107)xr\nn(cid:88)\n(cid:0)(cid:107)xr\ni (cid:107)2(cid:1)\n(cid:107)(cid:101)\u2207h(xr)(cid:107)2 =\ni \u2212(cid:101)yr\ni \u2212(cid:101)yr\ni(cid:107)2 + (cid:107)(cid:98)xr\ni \u2212(cid:98)xr\ni (cid:107)2 \u2264 2\n(cid:0)(cid:107)xr\ni(cid:107)2(cid:1) \u2264 2(2 + 2L + L2)(cid:107)(cid:98)xr \u2212 xr(cid:107)2.\ni \u2212(cid:98)xr\ni \u2212(cid:98)xr\ni(cid:107)2 + (1 + Li)2(cid:107)xr\nE(cid:104)(cid:107)(cid:101)\u2207h(xr)(cid:107)2(cid:105) \u2264 T(cid:88)\nT(cid:88)\n2(2 + 2L + L2)E(cid:2)(cid:107)(cid:98)xr \u2212 xr(cid:107)2(cid:3)\n\u2264 T(cid:88)\nE(cid:2)h(xr) \u2212 h(xr+1)(cid:3) \u2264 2(2 + 2L + L2)\n(cid:98)\u03b2\n(cid:2)h(x0) \u2212 h\u2217(cid:3) = \u03ba,\n\ni(cid:105) \u2265 0,\ni \u2212(cid:101)yr\ni (cid:107)2. Applying\ni \u2212(cid:101)yr\ni (cid:107)2.\n\ni , xr) + xr\nCauchy-Schwarz and the triangle inequality will yield to\n\nUsing the inequality (17), the norm of the proximal gradient of the objective can be bounded by\n\nCombining the above inequality with the suf\ufb01cient decrease bound in (7), one can write\n\nE(cid:2)h(x0) \u2212 h(xT +1)(cid:3)\n\n(cid:107)xr\n\nn(cid:88)\nn(cid:88)\n\ni=1\n\ni=1\n\n\u2264 2\n\nr=0\n\nr=1\n\n2(2 + 2L + L2)\n\ni , xr) + xr\n\n(15)\n\n(17)\n\ni=1\n\n(cid:98)\u03b2\n\nr=0\n\n(cid:98)\u03b2\n\n\u2264 2(2 + 2L + L2)\nwhich implies that T\u0001 \u2264 \u03ba\n\u0001 .\n\n\f5 Numerical Experiments:\n\nIn this short section, we compare the numerical performance of the proposed algorithm with the\nclassical serial BCD methods. The algorithms are evaluated over the following Lasso problem:\n\nmin\n\nx\n\n1\n2\n\n(cid:107)Ax \u2212 b(cid:107)2\n\n2 + \u03bb(cid:107)x(cid:107)1,\n\n(cid:107)xi \u2212 yi(cid:107)2\n\n\u03b1\n2\n\nwhere the matrix A is generated according to the Nesterov\u2019s approach [5]. Two problem instances\nare considered: A \u2208 R2000\u00d710,000 with 1% sparsity level in x\u2217 and A \u2208 R1000\u00d7100,000 with 0.1%\nsparsity level in x\u2217. The approximation functions are chosen similar to the numerical experiments\nin [9], i.e., block size is set to one (mi = 1, \u2200i) and the approximation function\n\n(cid:101)f (xi, y) = f (xi, y\u2212i) +\n2(cid:107)Ax \u2212 b(cid:107)2 is the smooth part of the objective function. We choose\nis considered, where f (x) = 1\nconstant step-size \u03b3 and proximal coef\ufb01cient \u03b1. In general, careful selection of the algorithm pa-\nrameters results in better numerical convergence rate. The smaller values of step-size \u03b3 will result\nin less zigzag behavior for the convergence path of the algorithm; however, too small step sizes will\nclearly slow down the convergence speed. Furthermore, in order to make the approximation func-\ntion suf\ufb01ciently strongly convex, we need to choose \u03b1 large enough. However, choosing too large \u03b1\nvalues enforces the next iterates to stay close to the current iterate and results in slower convergence\nspeed; see the supplementary materials for related examples.\nFigure 1 and Figure 2 illustrate the behavior of cyclic and randomized parallel BCD method as\ncompared with their serial counterparts. The serial methods \u201cCyclic BCD\u201d and \u201cRandomized BCD\u201d\nare based on the update rule in (2) with the cyclic and randomized block selection rules, respectively.\nThe variable q shows the number of processors and on each processor we update 40 scalar variables\nin parallel. As can be seen in Figure 1 and Figure 2, parallelization of the BCD algorithm results in\nmore ef\ufb01cient algorithm. However, the computational gain does not grow linearly with the number\nof processors. In fact, we can see that after some point, the increase in the number of processors\nlead to slower convergence. This fact is due to the communication overhead among the processing\nnodes which dominates the computation time; see the supplementary materials for more numerical\nexperiments on this issue.\n\nFigure 1: Lasso Problem: A \u2208 R2,000\u00d710,000\n\nFigure 2: Lasso Problem: A \u2208 R1,000\u00d7100,000\n\nAcknowledgments: The authors are grateful to the University of Minnesota Graduate School Doc-\ntoral Dissertation Fellowship and AFOSR, grant number FA9550-12-1-0340 for the support during\nthis research.\n\nReferences\n[1] Y. Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems. SIAM\n\nJournal on Optimization, 22(2):341\u2013362, 2012.\n\n[2] P. Richt\u00b4arik and M. Tak\u00b4a\u02c7c. Ef\ufb01cient serial and parallel coordinate descent methods for huge-scale truss\n\ntopology design. In Operations Research Proceedings, pages 27\u201332. Springer, 2012.\n\n8\n\n0123456710\u2212610\u2212510\u2212410\u2212310\u2212210\u22121100101102Time (seconds)Relative Error Cyclic BCDRandomized BCDCyclic PSCA q=32Randomized PSCA q= 32Cyclic PSCA q=8Randomized PSCA q=8Cyclic PSCA q=4Randomized PSCA q=4010020030040050060010\u2212610\u2212510\u2212410\u2212310\u2212210\u22121100101102Time (seconds)Relative Error Randomized BCDCyclic BCDRandomized PSCA q=8Cyclic PSCA q=8Randomized PSCA q=16Cyclic PSCA q = 16Randomized PSCA q=32Cyclic PSCA q=32\f[3] Y. T. Lee and A. Sidford. Ef\ufb01cient accelerated coordinate descent methods and faster algorithms for\nsolving linear systems. In 54th Annual Symposium on Foundations of Computer Science (FOCS), pages\n147\u2013156. IEEE, 2013.\n\n[4] I. Necoara and D. Clipici. Ef\ufb01cient parallel coordinate descent algorithm for convex optimization prob-\nlems with separable constraints: application to distributed MPC. Journal of Process Control, 23(3):243\u2013\n253, 2013.\n\n[5] Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming,\n\n140(1):125\u2013161, 2013.\n\n[6] P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separable minimization.\n\nMathematical Programming, 117(1-2):387\u2013423, 2009.\n\n[7] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[8] S. J. Wright, R. D. Nowak, and M. Figueiredo. Sparse reconstruction by separable approximation. IEEE\n\nTransactions on Signal Processing, 57(7):2479\u20132493, 2009.\n\n[9] F. Facchinei, S. Sagratella, and G. Scutari. Flexible parallel algorithms for big data optimization. arXiv\n\npreprint arXiv:1311.2444, 2013.\n\n[10] J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel coordinate descent for (cid:96)1-regularized loss\n\nminimization. arXiv preprint arXiv:1105.5379, 2011.\n\n[11] C. Scherrer, A. Tewari, M. Halappanavar, and D. Haglin. Feature clustering for accelerating parallel\n\ncoordinate descent. In NIPS, pages 28\u201336, 2012.\n\n[12] C. Scherrer, M. Halappanavar, A. Tewari, and D. Haglin. Scaling up coordinate descent algorithms for\n\nlarge (cid:96)1 regularization problems. arXiv preprint arXiv:1206.6409, 2012.\n\n[13] Z. Peng, M. Yan, and W. Yin. Parallel and distributed sparse optimization. preprint, 2013.\n[14] I. Necoara and D. Clipici. Distributed coordinate descent methods for composite minimization. arXiv\n\npreprint arXiv:1312.5302, 2013.\n\n[15] P. Richt\u00b4arik and M. Tak\u00b4a\u02c7c. Parallel coordinate descent methods for big data optimization. arXiv preprint\n\narXiv:1212.0873, 2012.\n\n[16] P. Richt\u00b4arik and M. Tak\u00b4a\u02c7c. On optimal probabilities in stochastic coordinate descent methods. arXiv\n\npreprint arXiv:1310.3438, 2013.\n\n[17] O. Fercoq and P. Richt\u00b4arik. Accelerated, parallel and proximal coordinate descent. arXiv preprint\n\narXiv:1312.5799, 2013.\n\n[18] O. Fercoq, Z. Qu, P. Richt\u00b4arik, and M. Tak\u00b4a\u02c7c. Fast distributed coordinate descent for non-strongly convex\n\nlosses. arXiv preprint arXiv:1405.5300, 2014.\n\n[19] O. Fercoq and P. Richt\u00b4arik. Smooth minimization of nonsmooth functions with parallel coordinate descent\n\nmethods. arXiv preprint arXiv:1309.5885, 2013.\n\n[20] A. Patrascu and I. Necoara. A random coordinate descent algorithm for large-scale sparse nonconvex\n\noptimization. In European Control Conference (ECC), pages 2789\u20132794. IEEE, 2013.\n\n[21] M. Razaviyayn, M. Hong, and Z.-Q. Luo. A uni\ufb01ed convergence analysis of block successive minimiza-\n\ntion methods for nonsmooth optimization. SIAM Journal on Optimization, 23(2):1126\u20131153, 2013.\n\n[22] F. Niu, B. Recht, C. R\u00b4e, and S. J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic\n\ngradient descent. Advances in Neural Information Processing Systems, 24:693\u2013701, 2011.\n\n[23] J. Liu, S. J. Wright, C. R\u00b4e, and V. Bittorf. An asynchronous parallel stochastic coordinate descent algo-\n\nrithm. arXiv preprint arXiv:1311.1873, 2013.\n\n[24] J. Mairal. Optimization with \ufb01rst-order surrogate functions. arXiv preprint arXiv:1305.3120, 2013.\n[25] J. Mairal. Incremental majorization-minimization optimization with application to large-scale machine\n\nlearning. arXiv preprint arXiv:1402.4419, 2014.\n\n[26] M. Razaviyayn, M. Sanjabi, and Z.-Q. Luo. A stochastic successive minimization method for nons-\nmooth nonconvex optimization with applications to transceiver design in wireless communication net-\nworks. arXiv preprint arXiv:1307.4457, 2013.\n\n[27] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming. 1996.\n[28] M. Hong, X. Wang, M. Razaviyayn, and Z.-Q. Luo. Iteration complexity analysis of block coordinate\n\ndescent methods. arXiv preprint arXiv:1310.6957, 2013.\n\n[29] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer, 2004.\n\n9\n\n\f", "award": [], "sourceid": 789, "authors": [{"given_name": "Meisam", "family_name": "Razaviyayn", "institution": "University of Minnesota"}, {"given_name": "Mingyi", "family_name": "Hong", "institution": "Iowa State University"}, {"given_name": "Zhi-Quan", "family_name": "Luo", "institution": "University of Minnesota, Twin Cites"}, {"given_name": "Jong-Shi", "family_name": "Pang", "institution": null}]}