{"title": "Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation", "book": "Advances in Neural Information Processing Systems", "page_first": 612, "page_last": 620, "abstract": "Many machine learning and signal processing problems can be formulated as linearly constrained convex programs, which could be efficiently solved by the alternating direction method (ADM). However, usually the subproblems in ADM are easily solvable only when the linear mappings in the constraints are identities. To address this issue, we propose a linearized ADM (LADM) method by linearizing the quadratic penalty term and adding a proximal term when solving the subproblems. For fast convergence, we also allow the penalty to change adaptively according a novel update rule. We prove the global convergence of LADM with adaptive penalty (LADMAP). As an example, we apply LADMAP to solve low-rank representation (LRR), which is an important subspace clustering technique yet suffers from high computation cost. By combining LADMAP with a skinny SVD representation technique, we are able to reduce the complexity $O(n^3)$ of the original ADM based method to $O(rn^2)$, where $r$ and $n$ are the rank and size of the representation matrix, respectively, hence making LRR possible for large scale applications. Numerical experiments verify that for LRR our LADMAP based methods are much faster than state-of-the-art algorithms.", "full_text": "Linearized Alternating Direction Method with\nAdaptive Penalty for Low-Rank Representation\n\nZhouchen Lin\n\nVisual Computing Group\nMicrosoft Research Asia\n\nRisheng Liu\nZhixun Su\nSchool of Mathematical Sciences\nDalian University of Technology\n\nAbstract\n\nMany machine learning and signal processing problems can be formulated as lin-\nearly constrained convex programs, which could be ef\ufb01ciently solved by the alter-\nnating direction method (ADM). However, usually the subproblems in ADM are\neasily solvable only when the linear mappings in the constraints are identities. To\naddress this issue, we propose a linearized ADM (LADM) method by linearizing\nthe quadratic penalty term and adding a proximal term when solving the sub-\nproblems. For fast convergence, we also allow the penalty to change adaptively\naccording a novel update rule. We prove the global convergence of LADM with\nadaptive penalty (LADMAP). As an example, we apply LADMAP to solve low-\nrank representation (LRR), which is an important subspace clustering technique\nyet suffers from high computation cost. By combining LADMAP with a skinny\nSVD representation technique, we are able to reduce the complexity O(n3) of\nthe original ADM based method to O(rn2), where r and n are the rank and size\nof the representation matrix, respectively, hence making LRR possible for large\nscale applications. Numerical experiments verify that for LRR our LADMAP\nbased methods are much faster than state-of-the-art algorithms.\n\n1 Introduction\n\nRecently, compressive sensing [5] and sparse representation [19] have been hot research topics and\nalso have found abundant applications in signal processing and machine learning. Many of the\nproblems in these \ufb01elds can be formulated as the following linearly constrained convex programs:\n(1)\n\nf (x) + g(y); s:t: A(x) + B(y) = c;\n\nmin\nx;y\n\nwhere x, y and c could be either vectors or matrices, f and g are convex functions (e.g., the nuclear\nnorm \u2225 \u00b7 \u2225\u2217 [2], Frobenius norm \u2225 \u00b7 \u2225, l2;1 norm \u2225 \u00b7 \u22252;1 [13], and l1 norm \u2225 \u00b7 \u22251), and A and B are\nlinear mappings.\nAlthough the interior point method can be used to solve many convex programs, it may suffer from\nunbearably high computation cost when handling large scale problems. For example, when using\nCVX, an interior point based toolbox, to solve nuclear norm minimization (namely, f (X) = \u2225X\u2225\u2217\nin (1)) problems, such as matrix completion [4], robust principal component analysis [18] and their\ncombination [3], the complexity of each iteration is O(n6), where n \u00d7 n is the matrix size. To over-\ncome this issue, \ufb01rst-order methods are often preferred. The accelerated proximal gradient (APG)\n\u22122) convergence rate, where k is\nalgorithm [16] is a popular technique due to its guaranteed O(k\nthe iteration number. The alternating direction method (ADM) has also regained a lot of atten-\ntion [11, 15]. It updates the variables alternately by minimizing the augmented Lagrangian function\nwith respect to the variables in a Gauss-Seidel manner. While APG has to convert (1) into an approx-\nimate unconstrained problem by adding the linear constraints to the objective function as a penalty,\nhence only producing an approximate solution to (1), ADM can solve (1) exactly. However, when\n\n1\n\n\fA or B is not the identity mapping, the subproblems in ADM may not have closed form solutions.\nSo solving them is cumbersome.\nIn this paper, we propose a linearized version of ADM (LADM) to overcome the dif\ufb01culty in solving\nsubproblems. It is to replace the quadratic penalty term by linearizing the penalty term and adding\na proximal term. We also allow the penalty parameter to change adaptively and propose a novel\nand simple rule to update it. Linearization makes the auxiliary variables unnecessary, hence saving\nmemory and waiving the expensive matrix inversions to update the auxiliary variables. Moreover,\nwithout the extra constraints introduced by the auxiliary variables, the convergence is also faster.\nUsing a variable penalty parameter further speeds up the convergence. The global convergence of\nLADM with adaptive penalty (LADMAP) is also proven.\nAs an example, we apply our LADMAP to solve the low-rank representation (LRR) problem [12]1:\n\n\u2225Z\u2225\u2217 + (cid:22)\u2225E\u22252;1; s:t: X = XZ + E;\n\nmin\nZ;E\n\n(2)\n\nwhere X is the data matrix. LRR is an important robust subspace clustering technique and has\nfound wide applications in machine learning and computer vision, e.g., motion segmentation, face\nclustering, and temporal segmentation [12, 14, 6]. However, the existing LRR solver [12] is based on\nADM, which suffers from O(n3) computation complexity due to the matrix-matrix multiplications\nand matrix inversions. Moreover, introducing auxiliary variables also slows down the convergence,\nas there are more variables and constraints. Such a heavy computation load prevents LRR from large\nscale applications. It is LRR that motivated us to develop LADMAP. We show that LADMAP can be\nsuccessfully applied to LRR, obtaining faster convergence speed than the original solver. By further\nrepresenting Z as its skinny SVD and utilizing an advanced functionality of the PROPACK [9]\npackage, the complexity of solving LRR by LADMAP becomes only O(rn2), as there is no full\nsized matrix-matrix multiplications, where r is the rank of the optimal Z. Numerical experiments\nshow the great speed advantage of our LADMAP based methods for solving LRR.\nOur work is inspired by Yang et al. [20]. Nonetheless, the difference of our work from theirs is\ndistinct. First, they only proved the convergence of LADM for a speci\ufb01c problem, namely nuclear\nnorm regularization. Their proof utilized some special properties of the nuclear norm, while we\nprove the convergence of LADM for general problems in (1). Second, they only proved in the case of\n\ufb01xed penalty, while we prove in the case of variable penalty. Although they mentioned the dynamic\nupdating rule proposed in [8], their proof cannot be straightforwardly applied to the case of variable\npenalty. Moreover, that rule is for ADM only. Third, the convergence speed of LADM heavily\ndepends on the choice of penalty. So it is dif\ufb01cult to choose an optimal \ufb01xed penalty that \ufb01ts for\ndifferent problems and problem sizes, while our novel updating rule for the penalty, although simple,\nis effective for different problems and problem sizes. The linearization technique has also been used\nin other optimization methods. For example, Yin [22] applied this technique to the Bregman iteration\nfor solving compressive sensing problems and proved that the linearized Bregman method converges\nto an exact solution conditionally. In comparison, LADM (and LADMAP) always converges to an\nexact solution.\n\n2 Linearized Alternating Direction Method with Adaptive Penalty\n\n2.1 The Alternating Direction Method\n\nADM is now very popular in solving large scale machine learning problems [1]. When solving (1)\nby ADM, one operates on the following augmented Lagrangian function:\n\nL(x; y; (cid:21)) = f (x) + g(y) + \u27e8(cid:21);A(x) + B(y) \u2212 c\u27e9 +\n\n(3)\nwhere (cid:21) is the Lagrange multiplier, \u27e8\u00b7;\u00b7\u27e9 is the inner product, and (cid:12) > 0 is the penalty parameter.\nThe usual augmented Lagrange multiplier method is to minimize L w.r.t. x and y simultaneously.\nThis is usually dif\ufb01cult and does not exploit the fact that the objective function is separable. To\nremedy this issue, ADM decomposes the minimization of L w.r.t. (x; y) into two subproblems that\n\n\u2225A(x) + B(y) \u2212 c\u22252;\n\n(cid:12)\n2\n\n1Here we switch to bold capital letters in order to emphasize that the variables are matrices.\n\n2\n\n\fminimize w.r.t. x and y, respectively. More speci\ufb01cally, the iterations of ADM go as follows:\n\nxk+1 = arg min\n\nx\n\nL(x; yk; (cid:21)k)\n\n= arg min\n\nx\n\nf (x) +\n\n(cid:12)\n2\n\n\u2225A(x) + B(yk) \u2212 c + (cid:21)k=(cid:12)\u22252;\n\nyk+1 = arg min\n\ny\n\nL(xk+1; y; (cid:21)k)\n\n= arg min\n\ng(y) +\n\ny\n\n(cid:12)\n2\n\n\u2225B(y) + A(xk+1) \u2212 c + (cid:21)k=(cid:12)\u22252;\n\n(cid:21)k+1 = (cid:21)k + (cid:12)[A(xk+1) + B(yk+1) \u2212 c]:\n\n(6)\nIn many machine learning problems, as f and g are matrix or vector norms, the subproblems (4)\nand (5) usually have closed form solutions when A and B are identities [2, 12, 21]. In this case,\nADM is appealing. However, in many problems A and B are not identities. For example, in matrix\ncompletion A can be a selection matrix, and in LRR and 1D sparse representation A can be a general\nmatrix. In this case, there are no closed form solutions to (4) and (5). Then (4) and (5) have to be\nsolved iteratively. To overcome this dif\ufb01culty, a common strategy is to introduce auxiliary variables\n[12, 1] u and v and reformulate problem (1) into an equivalent one:\n\n(4)\n\n(5)\n\n(7)\n\nf (x) + g(y); s:t: A(u) + B(v) = c; x = u; y = v;\n\nmin\nx;y;u;v\n\nand the corresponding ADM iterations analogous to (4)-(6) can be deduced. With more variables\nand more constraints, more memory is required and the convergence of ADM also becomes slower.\nMoreover, to update u and v, whose subproblems are least squares problems, expensive matrix\ninversions are often necessary. Even worse, the convergence of ADM with more than two variables\nis not guaranteed [7].\nTo avoid introducing auxiliary variables and still solve subproblems (4) and (5) ef\ufb01ciently, inspired\nby Yang et al. [20], we propose a linearization technique for (4) and (5). To further accelerate the\nconvergence of the algorithm, we also propose an adaptive rule for updating the penalty parameter.\n\n2.2 Linearized ADM\n\nBy linearizing the quadratic term in (4) at xk and adding a proximal term, we have the following\napproximation:\n\nxk+1 = arg min\n\nx\n\n= arg min\n\nf (x) +\n\nx\n\n2\n\nf (x) + \u27e8A\u2217\n(cid:12)(cid:17)A\n\n\u2225x \u2212 xk + A\u2217\n\n((cid:21)k) + (cid:12)A\u2217\n\n(A(xk) + B(yk) \u2212 c); x \u2212 xk\u27e9 + (cid:12)(cid:17)A\n((cid:21)k + (cid:12)(A(xk) + B(yk) \u2212 c))=((cid:12)(cid:17)A)\u22252;\n\n2\n\n\u2225x \u2212 xk\u22252\n\n(8)\nwhere A\u2217 is the adjoint of A and (cid:17)A > 0 is a parameter whose proper value will be analyzed later.\nThe above approximation resembles that of APG [16], but we do not use APG to solve (4) iteratively.\nSimilarly, subproblem (5) can be approximated by\n\u2225y \u2212 yk + B\u2217\n\n((cid:21)k + (cid:12)(A(xk+1) + B(yk) \u2212 c))=((cid:12)(cid:17)B)\u22252:\n\nyk+1 = arg min\n\ng(y) +\n\n(cid:12)(cid:17)B\n\n(9)\n\ny\n\n2\n\nThe update of Lagrange multiplier still goes as (6)2.\n\n2.3 Adaptive Penalty\n\nIn previous ADM and LADM approaches [15, 21, 20], the penalty parameter (cid:12) is \ufb01xed. Some schol-\nars have observed that ADM with a \ufb01xed (cid:12) can converge very slowly and it is nontrivial to choose an\noptimal \ufb01xed (cid:12). So is LADM. Thus a dynamic (cid:12) is preferred in real applications. Although Tao et\nal. [15] and Yang et al. [20] mentioned He et al.\u2019s adaptive updating rule [8] in their papers, the rule\nis for ADM only. We propose the following adaptive updating strategy for the penalty parameter (cid:12):\n(10)\n2As in [20], we can also introduce a parameter (cid:13) and update (cid:21) as (cid:21)k+1 = (cid:21)k +(cid:13)(cid:12)[A(xk+1)+B(yk+1)(cid:0)c].\nWe choose not to do so in this paper in order not to make the exposition of LADMAP too complex. The readers\ncan refer to Supplementary Material for full details.\n\n(cid:12)k+1 = min((cid:12)max; (cid:26)(cid:12)k);\n\n3\n\n\f{\n\nwhere (cid:12)max is an upper bound of {(cid:12)k}. The value of (cid:26) is de\ufb01ned as\n\n\u221a\n\n\u221a\n\n(cid:17)A\u2225xk+1 \u2212 xk\u2225;\n\n(cid:17)B\u2225yk+1 \u2212 yk\u2225)=\u2225c\u2225 < \"2;\n\n(cid:26)0;\n1;\n\nif (cid:12)k max(\notherwise;\n\n(cid:26) =\n\n(11)\nwhere (cid:26)0 \u2265 1 is a constant. The condition to assign (cid:26) = (cid:26)0 comes from the analysis on the stopping\ncriteria (see Section 2.5). We recommend that (cid:12)0 = (cid:11)\"2, where (cid:11) depends on the size of c. Our\nupdating rule is fundamentally different from He et al.\u2019s for ADM [8], which aims at balancing the\nerrors in the stopping criteria and involves several parameters.\n\n2.4 Convergence of LADMAP\n\nTo prove the convergence of LADMAP, we \ufb01rst have the following propositions.\n\nProposition 1\n\u2212(cid:12)k(cid:17)A(xk+1\u2212xk)\u2212A\u2217\n(^(cid:21)k+1) \u2208 @g(yk+1); (12)\nwhere ~(cid:21)k+1 = (cid:21)k + (cid:12)k[A(xk) +B(yk)\u2212 c], ^(cid:21)k+1 = (cid:21)k + (cid:12)k[A(xk+1) +B(yk)\u2212 c], and @f and\n@g are subgradients of f and g, respectively.\n\n(~(cid:21)k+1) \u2208 @f (xk+1); \u2212(cid:12)k(cid:17)B(yk+1\u2212yk)\u2212B\u2217\n\n\u2217\n\n\u22122\nk\n\n\u2225(cid:21)k \u2212 (cid:21)\n\nThis can be easily proved by checking the optimality conditions of (8) and (9).\nProposition 2 Denote the operator norms of A and B as \u2225A\u2225 and \u2225B\u2225, respectively. If {(cid:12)k} is non-\ndecreasing and upper bounded, (cid:17)A > \u2225A\u22252, (cid:17)B > \u2225B\u22252, and (x\n\u2217\n\u2217\n) is any Karush-Kuhn-\nTucker (KKT) point of problem (1) (see (13)-(14)), then: (1). {(cid:17)A\u2225xk \u2212 x\n)\u22252 +\n\u2217\u22252 \u2212 \u2225A(xk \u2212 x\n\u2217\n\u2217\u22252} is non-increasing. (2). \u2225xk+1 \u2212 xk\u2225 \u2192 0, \u2225yk+1 \u2212 yk\u2225 \u2192 0,\n\u2217\u22252 + (cid:12)\n(cid:17)B\u2225yk \u2212 y\n\u2225(cid:21)k+1 \u2212 (cid:21)k\u2225 \u2192 0.\nThe proof can be found in Supplementary Material. Then we can prove the convergence of\nLADMAP, as stated in the following theorem.\nTheorem 3 If {(cid:12)k} is non-decreasing and upper bounded, (cid:17)A > \u2225A\u22252, and (cid:17)B > \u2225B\u22252, then the\nsequence {(xk; yk; (cid:21)k)} generated by LADMAP converges to a KKT point of problem (1).\nThe proof can be found in Appendix A.\n\n; y\n\n; (cid:21)\n\n2.5 Stopping Criteria\n\nThe KKT conditions of problem (1) are that there exists a triple (x\n\n\u2217\n\n\u2217\n\n\u2217\n; (cid:21)\n\n; y\n\n) such that\n\n\u2217\n\n) + B(y\n\nA(x\n) \u2208 @f (x\n\u2217\n\u2217\n\n);\u2212B\u2217\n\n\u2217\n\n) \u2212 c = 0;\n\n\u2217\n\n) \u2208 @g(y\n\n\u2217\n\n):\n\n((cid:21)\n\n\u2212A\u2217\n\n((cid:21)\n\n(13)\n(14)\n\n(15)\n\nThe triple (x\n\n\u2217\n\n\u2217\n\n\u2217\n\n; (cid:21)\n\n; y\n\n) is called a KKT point. So the \ufb01rst stopping criterion is the feasibility:\n\n\u2225A(xk+1) + B(yk+1) \u2212 c\u2225=\u2225c\u2225 < \"1:\n\nAs for the second KKT condition, we rewrite the second part of Proposition 1 as follows\n\n\u2212(cid:12)k[(cid:17)B(yk+1 \u2212 yk) + B\u2217\n\n(A(xk+1 \u2212 xk))] \u2212 B\u2217\n\n(16)\nSo for ~(cid:21)k+1 to satisfy the second KKT condition, both (cid:12)k(cid:17)A\u2225xk+1\u2212 xk\u2225 and (cid:12)k\u2225(cid:17)B(yk+1\u2212 yk) +\nB\u2217\n\n(A(xk+1 \u2212 xk))\u2225 should be small enough. This leads to the second stopping criterion:\n\n(~(cid:21)k+1) \u2208 @g(yk+1):\n\n(cid:12)k max((cid:17)A\u2225xk+1 \u2212 xk\u2225=\u2225A\u2217\n\n(c)\u2225; (cid:17)B\u2225yk+1 \u2212 yk\u2225=\u2225B\u2217\n\n(c)\u2225) \u2264 \"\n\u2032\n2:\n\nBy estimating \u2225A\u2217\nstopping criterion in use:\n\n(c)\u2225 and \u2225B\u2217\n\u221a\n\n(cid:12)k max(\n\n\u221a\n\n(cid:17)A\u2225c\u2225 and\n\n(c)\u2225 by\n(cid:17)A\u2225xk+1 \u2212 xk\u2225;\n\n\u221a\n\n(17)\n(cid:17)B\u2225c\u2225, respectively, we arrive at the second\n\n\u221a\n\n(cid:17)B\u2225yk+1 \u2212 yk\u2225)=\u2225c\u2225 \u2264 \"2:\n\n(18)\n\nFinally, we summarize our LADMAP algorithm in Algorithm 1.\n\n4\n\n\fAlgorithm 1 LADMAP for Problem (1)\nInitialize: Set \"1 > 0, \"2 > 0, (cid:12)max \u226b (cid:12)0 > 0, (cid:17)A > \u2225A\u22252, (cid:17)B > \u2225B\u22252, x0, y0, (cid:21)0, and k \u2190 0.\nwhile (15) or (18) is not satis\ufb01ed do\nStep 1: Update x by solving (8).\nStep 2: Update y by solving (9).\nStep 3: Update (cid:21) by (6).\nStep 4: Update (cid:12) by (10) and (11).\nStep 5: k \u2190 k + 1.\n\nend while\n\n3 Applying LADMAP to LRR\n\nIn this section, we apply LADMAP to solve the LRR problem (2). We further introduce acceleration\ntricks to reduce the computation complexity of each iteration.\n\n3.1 Solving LRR by LADMAP\n\nAs the LRR problem (2) is a special case of problem (1), PADM can be directly applied to it. The\ntwo subproblems both have closed form solutions. In the subproblem for updating E, one may apply\nk , to matrix Mk = \u2212XZk + X\u2212 (cid:3)k=(cid:12)k.\n\u22121\nthe l2;1-norm shrinkage operator [12], with a threshold (cid:12)\nIn the subproblem for updating Z, one has to apply the singular value shrinkage operator [2], with\nX XT (XZk + Ek+1 \u2212 X + (cid:3)k=(cid:12)k), where (cid:17)X >\n\u22121\na threshold ((cid:12)k(cid:17)X )\nmax(X). If Nk is formed explicitly, the usual technique of partial SVD, using PROPACK [9] and\n(cid:27)2\nrank prediction3, can be utilized to compute the leading r singular values and associated vectors of\nNk ef\ufb01ciently, making the complexity of SVD computation O(rn2), where r is the predicted rank\nof Zk+1 and n is the column number of X. Note that as (cid:12)k is non-decreasing, the predicted rank is\nalmost non-decreasing, making the iterations computationally ef\ufb01cient.\n\n\u22121, to matrix Nk = Zk \u2212 (cid:17)\n\n3.2 Acceleration Tricks for LRR\n\nUp to now, LADMAP for LRR is still of complexity O(n3), although partial SVD is already used.\nThis is because forming Mk and Nk requires full sized matrix-matrix multiplications, e.g., XZk.\nTo break this complexity bound, we introduce a decomposition technique to further accelerate\nLADMAP for LRR. By representing Zk as its skinny SVD: Zk = Uk(cid:6)kVT\nk , some of the full sized\nmatrix-matrix multiplications are gone: they are replaced by successive reduced sized matrix-matrix\nmultiplications. For example, when updating E, XZk is computed as ((XUk)(cid:6)k)VT\nk , reducing the\ncomplexity to O(rn2). When computing the partial SVD of Nk, things are more complicated. If we\nform Nk explicitly, we will face with computing XT (X + (cid:3)k=(cid:12)k), which is neither low-rank nor\nsparse4. Fortunately, in PROPACK the bi-diagonalizing process of Nk is done by the Lanczos pro-\ncedure [9], which only requires to compute matrix-vector multiplications Nkv and uT Nk, where u\nand v are some vectors in the Lanczos procedure. So we may compute Nkv and uT Nk by multi-\nplying the vectors u and v successively with the component matrices in Nk, rather than forming Nk\nexplicitly. So the computation complexity of partial SVD of Nk is still O(rn2). Consequently, with\nour acceleration techniques, the complexity of our accelerated LADMAP (denoted as LADMAP(A)\nfor short) for LRR is O(rn2). LADMAP(A) is summarized in Algorithm 2.\n\n3The current PROPACK can only output a given number of singular values and vectors. So one has to\npredict the number of singular values that are greater than a threshold [11, 20, 16]. See step 3 of Algorithm 2.\nRecently, we have modi\ufb01ed PROPACK so that it can output the singular values that are greater than a threshold\nand their corresponding singular vectors. See [10].\n\nstill O(rn2), while XT Ek+1 could also be accelerated as Ek+1 is a column-sparse matrix.\n\n4When forming Nk explicitly, XT XZk can be computed as ((XT (XUk))(cid:6)k)VT\n\nk , whose complexity is\n\n5\n\n\f(cid:22)\u2225E\u22252;1 + (cid:12)k\n\n\u2225E + (XUk)(cid:6)kVT\n\nmax(X), r = 5, and k \u2190 0.\n\nAlgorithm 2 Accelerated LADMAP for LRR (2)\nInput: Observation matrix X and parameter (cid:22) > 0.\nInitialize: Set E0, Z0 and (cid:3)0 to zero matrices, where Z0 is represented as (U0; (cid:6)0; V0) \u2190\n(0; 0; 0). Set \"1 > 0, \"2 > 0, (cid:12)max \u226b (cid:12)0 > 0, (cid:17)X > (cid:27)2\nwhile (15) or (18) is not satis\ufb01ed do\nStep 1: Update Ek+1 = arg min\nE\nsubproblem can be solved by using Lemma 3.2 in [12].\nStep 2: Update the skinny SVD (Uk+1; (cid:6)k+1; Vk+1) of Zk+1. First, compute the partial\nSVD ~Ur ~(cid:6)r ~VT\nr of the implicit matrix Nk, which is bi-diagonalized by the successive matrix-\n\u2032\nvector multiplication technique described in Section 3.1. Second, Uk+1 = ~Ur(:; 1 : r\n),\n) \u2212 ((cid:12)k(cid:17)X )\n\u2032 is the number of\n(cid:6)k+1 = ~(cid:6)r(1 : r\nsingular values in (cid:6)r that are greater than ((cid:12)k(cid:17)X )\nStep 3: Update the predicted rank r:\nIf r\nStep 4: Update (cid:3)k+1 = (cid:3)k + (cid:12)k((XUk+1)(cid:6)k+1VT\nStep 5: Update (cid:12)k+1 by (10)-(11).\nStep 6: k \u2190 k + 1.\n\n\u22121I, Vk+1 = ~Vr(:; 1 : r\n\n+ round(0:05n); n).\n\nk+1 + Ek+1 \u2212 X).\n\n< r, then r = min(r\n\n+ 1; n); otherwise, r = min(r\n\n\u2212 X + (cid:3)k=(cid:12)k\u22252. This\n\n\u2032\n\n; 1 : r\n\n\u2032\n\n), where r\n\n2\n\nk\n\n\u22121.\n\n\u2032\n\n\u2032\n\n\u2032\n\n\u2032\n\nend while\n\n4 Experimental Results\n\nIn this section, we report numerical results on LADMAP, LADMAP(A) and other state-of-the-art\nalgorithms, including APG5, ADM6 and LADM, for LRR based data clustering problems. APG,\nADM, LADM and LADMAP all utilize the Matlab version of PROPACK [9]. For LADMAP(A),\nwe provide two function handles to PROPACK which ful\ufb01ls the successive matrix-vector multipli-\ncations. All experiments are run and timed on a PC with an Intel Core i5 CPU at 2.67GHz and with\n4GB of memory, running Windows 7 and Matlab version 7.10.\nWe test and compare these solvers on both synthetic multiple subspaces data and the real world\nmotion data (Hopkin155 motion segmentation database [17]). For APG, we set the parameters\n\u221210, (cid:18) = 0:9 in its continuation technique and the Lipschitz constant (cid:28) =\n(cid:12)0 = 0:01, (cid:12)min = 10\nmax(X). The parameters of ADM and LADM are the same as those in [12] and [20], respectively.\n(cid:27)2\nIn particular, for LADM the penalty is \ufb01xed at (cid:12) = 2:5= min(m; n), where m \u00d7 n is the size of\n\u22125, (cid:12)0 = min(m; n)\"2, (cid:12)max = 1010, (cid:26)0 = 1:9,\nX. For LADMAP, we set \"1 = 10\nmax(X). As the code of ADM was downloaded, its stopping criteria, \u2225XZk +\nand (cid:17)X = 1:02(cid:27)2\nEk \u2212 X\u2225=\u2225X\u2225 \u2264 \"1 and max(\u2225Ek \u2212 Ek\u22121\u2225=\u2225X\u2225;\u2225Zk \u2212 Zk\u22121\u2225=\u2225X\u2225) \u2264 \"2, are used in all our\nexperiments7.\n\n\u22124, \"2 = 10\n\n4.1 On Synthetic Data\n\ni=1 are constructed, whose bases {Ui}s\n\nThe synthetic test data, parameterized as (s, p, d, ~r), is created by the same procedure in [12]. s\nindependent subspaces {Si}s\ni=1 are generated by Ui+1 =\nTUi; 1 \u2264 i \u2264 s \u2212 1, where T is a random rotation and U1 is a d \u00d7 ~r random orthogonal matrix.\nSo each subspace has a rank of ~r and the data has an ambient dimension of d. Then p data points\nare sampled from each subspace by Xi = UiQi; 1 \u2264 i \u2264 s, with Qi being an ~r \u00d7 p i.i.d. zero\nmean unit variance Gaussian matrix N (0; 1). 20% samples are randomly chosen to be corrupted by\nadding Gaussian noise with zero mean and standard deviation 0:1\u2225x\u2225. We empirically \ufb01nd that LRR\nachieves the best clustering performance on this data set when (cid:22) = 0:1. So we test all algorithms\nwith (cid:22) = 0:1 in this experiment. To measure the relative errors in the solutions, we run LADMAP\n2000 iterations with (cid:12)max = 103 to establish the ground truth solution (E0; Z0).\nThe computational comparison is summarized in Table 1. We can see that the iteration numbers and\nthe CPU times of both LADMAP and LADMAP(A) are much less than those of other methods, and\n\n5Please see Supplementary Material for the detail of solving LRR by APG.\n6We use the Matlab code provided online by the authors of [12].\n7Note that the second criterion differs from that in (18). However, this does not harm the convergence of\n\nLADMAP because (18) is always checked when updating (cid:12)k+1 (see (11)).\n\n6\n\n\fLADMAP(A) is further much faster than LADMAP. Moreover, the advantage of LADMAP(A) is\neven greater when the ratio ~r=p, which is roughly the ratio of the rank of Z0 to the size of Z0, is\nsmaller, which testi\ufb01es to the complexity estimations on LADMAP and LADMAP(A) for LRR. It\nis noteworthy that the iteration numbers of ADM and LADM seem to grow with the problem sizes,\nwhile that of LADMAP is rather constant. Moreover, LADM is not faster than ADM. In particular,\non the last data we were unable to wait until LADM stopped. Finally, as APG converges to an\napproximate solution to (2), its relative errors are larger and its clustering accuracy is lower than\nADM and LADM based methods.\n\nTable 1: Comparison among APG, ADM, LADM, LADMAP and LADMAP(A) on the synthetic\ndata. For each quadruple (s, p, d, ~r), the LRR problem, with (cid:22) = 0:1, was solved for the same data\nusing different algorithms. We present typical running time (in \u00d7103 seconds), iteration number,\nrelative error (%) of output solution ( ^E; ^Z) and the clustering accuracy (%) of tested algorithms,\nrespectively.\n\nSize (s, p, d, ~r)\n\n(10, 20,200, 5)\n\n(15, 20,300, 5)\n\n(20, 25, 500, 5)\n\n(30, 30, 900, 5)\n\nLADMAP\n\nLADMAP(A)\n\nMethod\nAPG\nADM\nLADM\n\nAPG\nADM\nLADM\n\nAPG\nADM\nLADM\n\nAPG\nADM\nLADM\n\nLADMAP\n\nLADMAP(A)\n\nLADMAP\n\nLADMAP(A)\n\nLADMAP\n\nLADMAP(A)\n\nTime\n0.0332\n0.0529\n0.0603\n0.0145\n0.0010\n0.0869\n0.1526\n0.2943\n0.0336\n0.0015\n1.8837\n3.7139\n8.1574\n0.7762\n0.0053\n6.1252\n11.7185\n\nN.A.\n2.3891\n0.0058\n\nIter.\n110\n176\n194\n46\n46\n106\n185\n363\n41\n41\n117\n225\n508\n40\n40\n116\n220\nN.A.\n44\n44\n\n\u2225 ^Z\u2212Z0\u2225\n\u2225Z0\u2225\n2.2079\n0.5491\n0.5480\n0.5480\n0.5480\n2.4824\n0.6519\n0.6518\n0.6518\n0.6518\n2.8905\n1.1191\n0.6379\n0.6379\n0.6379\n3.0667\n0.6865\nN.A.\n0.6864\n0.6864\n\n\u2225 ^E\u2212E0\u2225\n\u2225E0\u2225\n1.5096\n0.5093\n0.5024\n0.5024\n0.5024\n1.0341\n0.4078\n0.4076\n0.4076\n0.4076\n2.4017\n1.0170\n0.4268\n0.4268\n0.4268\n0.9199\n0.4866\nN.A.\n0.4294\n0.4294\n\nAcc.\n81.5\n90.0\n90.0\n90.0\n90.0\n80.0\n83.7\n86.7\n86.7\n86.7\n72.4\n80.0\n80.0\n84.6\n84.6\n69.4\n76.0\nN.A.\n80.1\n80.1\n\nTable 2: Comparison among APG, ADM, LADM, LADMAP and LADMAP(A) on the Hopkins155\ndatabase. We present their average computing time (in seconds), average number of iterations and\naverage classi\ufb01cation errors (%) on all 156 sequences.\n\nTwo Motion\n\nThree Motion\n\nTime\n15.7836\n53.3470\n9.6701\n3.6964\n2.1348\n\nIter. CErr.\n5.77\n90\n5.72\n281\n5.77\n110\n5.72\n22\n22\n5.72\n\nTime\n46.4970\n159.8644\n22.1467\n10.9438\n6.1098\n\nIter. CErr.\n16.52\n90\n16.52\n284\n16.52\n64\n16.52\n22\n22\n16.52\n\nTime\n22.6277\n77.0864\n12.4520\n5.3114\n3.0202\n\nAll\nIter. CErr.\n8.36\n90\n8.33\n282\n8.36\n99\n8.33\n22\n22\n8.33\n\nAPG\nADM\nLADM\n\nLADMAP\n\nLADMAP(A)\n\n4.2 On Real World Data\n\nWe further test the performance of these algorithms on the Hopkins155 database [17]. This database\nconsists of 156 sequences, each of which has 39 to 550 data vectors drawn from two or three motions.\nFor computational ef\ufb01ciency, we preprocess the data by projecting it to be 5-dimensional using PCA.\nAs (cid:22) = 2:4 is the best parameter for this database [12], we test all algorithms with (cid:22) = 2:4.\nTable 2 shows the comparison among APG, ADM, LADM, LADMAP and LADMAP(A) on this\ndatabase. We can also see that LADMAP and LADMAP(A) are much faster than APG, ADM, and\n\n7\n\n\fLADM, and LADMAP(A) is also faster than LADMAP. However, in this experiment the advantage\nof LADMAP(A) over LADMAP is not as dramatic as that in Table 1. This is because on this data (cid:22)\nis chosen as 2:4, which cannot make the rank of the ground truth solution Z0 much smaller than the\nsize of Z0.\n\n5 Conclusions\n\nIn this paper, we propose a linearized alternating direction method with adaptive penalty for solving\nsubproblems in ADM conveniently. With LADMAP, no auxiliary variables are required and the\nconvergence is also much faster. We further apply it to solve the LRR problem and combine it with\nan acceleration trick so that the computation complexity is reduced from O(n3) to O(rn2), which\nis highly advantageous over the existing LRR solvers. Although we only present results on LRR,\nLADMAP is actually a general method that can be applied to other convex programs.\n\nAcknowledgments\n\nThe authors would like to thank Dr. Xiaoming Yuan for pointing us to [20]. This work is partially\nsupported by the grants of the NSFC-Guangdong Joint Fund (No. U0935004) and the NSFC Fund\n(No. 60873181, 61173103). R. Liu also thanks the support from CSC.\n\n\u221e\n\n\u221e\n\n; (cid:21)\n\n; y\n\n; (cid:21)\n\u221e\n\n; y\n\u221e\n\n) is a KKT point of problem (1).\n\n). We accomplish the proof in two steps.\n\u221e\n\nA Proof of Theorem 3\nProof By Proposition 2 (1), {(xk; yk; (cid:21)k)} is bounded, hence has an accumulation point, say\n(xkj ; ykj ; (cid:21)kj ) \u2192 (x\n\u221e\n1. We \ufb01rst prove that (x\nBy Proposition 2 (2), A(xk+1) + B(yk+1) \u2212 c = (cid:12)\nk ((cid:21)k+1 \u2212 (cid:21)k) \u2192 0: This shows that any\n\u22121\naccumulation point of {(xk; yk)} is a feasible solution.\nBy letting k = kj \u2212 1 in Proposition 1 and the de\ufb01nition of subgradient, we have\n) + \u27e8xkj\n\u2212 x\nf (xkj ) + g(ykj ) \u2264 f (x\n;\u2212(cid:12)kj\u22121(cid:17)A(xkj\n\u2217\n\u2217\n\u2217\n;\u2212(cid:12)kj\u22121(cid:17)B(ykj\n+\u27e8ykj\n\u2212 ykj\u22121) \u2212 B\u2217\nLet j \u2192 +\u221e, by observing Proposition 2 (2), we have\n) \u2264 f (x\n) + \u27e8x\n\u221e \u2212 x\n;\u2212A\u2217\n)\u27e9 + \u27e8y\n\u2217\n\u2217\n\u2217\n\u221e\n((cid:21)\n\u221e\u27e9 \u2212 \u27e8B(y\n\u221e \u2212 x\n) \u2212 \u27e8A(x\n\u2217\n\u2217\n\u2217\n); (cid:21)\n= f (x\n) \u2212 \u27e8A(x\n) + B(y\n) \u2212 A(x\n\u2217\n\u2217\n\u2217\n\u221e\n\u221e\n= f (x\n\u2217\n\u2217\n);\n= f (x\n\u221e\n\u221e\nwhere we have used the fact that both (x\nthat (x\nAgain, let k = kj \u2212 1 in Proposition 1 and by the de\ufb01nition of subgradient, we have\n(~(cid:21)kj )\u27e9; \u2200x:\n\n\u2212 xkj\u22121) \u2212 A\u2217\n(^(cid:21)kj )\u27e9:\n;\u2212B\u2217\n\u221e \u2212 y\n\u2217\n((cid:21)\n)\u27e9\n\u221e \u2212 y\n\u221e\n\u2217\n); (cid:21)\n\u221e\u27e9\n) \u2212 B(y\n\u2217\n); (cid:21)\n\nf (x) \u2265 f (xkj ) + \u27e8x \u2212 xkj ;\u2212(cid:12)kj\u22121(cid:17)A(xkj\n\n; y\n) is an optimal solution to (1).\n\n) are feasible solutions. So we conclude\n\n) + g(y\n) + g(y\n) + g(y\n) + g(y\n\n) + g(y\n\u2212 y\n\u2217\n\n\u2217\n) and (x\n\n(~(cid:21)kj )\u27e9\n\n) + g(y\n\n(19)\n\nf (x\n\n)\u27e9\n\n; y\n\n; y\n\n\u221e\n\n\u221e\n\n\u221e\n\n\u221e\n\n((cid:21)\n\n;\u2212A\u2217\n\u221e\n\n) \u2208 @g(y\n\n) + \u27e8x \u2212 x\n((cid:21)\n\n((cid:21)\n). Therefore, (x\n\nf (x) \u2265 f (x\n\u221e\n). Similarly, \u2212B\u2217\n\nFix x and let j \u2192 +\u221e, we see that\n) \u2208 @f (x\nSo \u2212A\u2217\n\u221e\npoint of problem (1).\n2. We next prove that the whole sequence {(xk; yk; (cid:21)k)} converges to (x\n\u2217\n\u2217\n\u2217\n\u221e\nBy choosing (x\n; (cid:21)\n) = (x\n)\u22252 + (cid:17)B\u2225ykj\n\u2212 y\n\u221e\u22252 + (cid:12)\n\u221e\nx\n(cid:17)A\u2225xk \u2212 x\n\u221e\u22252 \u2212 \u2225A(xk \u2212 x\n\u221e\n\u221e\n\u221e\n\u221e\n; (cid:21)\n; y\n(x\n\u221e\n\u221e\nAs (x\n{(xk; yk; (cid:21)k)} converges to a KKT point of problem (1).\n\n\u2212\n\u221e\u22252 \u2192 0. By Proposition 2 (1), we readily have\n\u221e\u22252 \u2192 0. So (xk; yk; (cid:21)k) \u2192\n) can be an arbitrary accumulation point of {(xk; yk; (cid:21)k)}, we may conclude that\n\n\u221e\n) in Proposition 2, we have (cid:17)A\u2225xkj\n\u221e\n\u2212 (cid:21)\n\n; y\n\u22122\nkj\n)\u22252 + (cid:17)B\u2225yk \u2212 y\n\n\u221e\n\u221e\u22252 \u2212\u2225A(xkj\n\n\u221e\n\n\u221e\n\n\u221e\n\n; (cid:21)\n\n; y\n\n) is a KKT\n\n\u221e\u22252 + (cid:12)\n\n\u2225(cid:21)k \u2212 (cid:21)\n\n\u221e\n; (cid:21)\n\u2212 x\n\n\u221e\n\u2225(cid:21)kj\n\n).\n\u221e\n\n\u22122\nk\n\n; y\n\n; y\n\n; y\n\n; (cid:21)\n\n; (cid:21)\n\n\u2212 xkj\u22121) \u2212 A\u2217\n)\u27e9; \u2200x:\n\n\u221e\n\n\u221e\n\n\u221e\n\n\u221e\n\n).\n\n\u221e\n\n\u2217\n\n8\n\n\fReferences\n[1] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statis-\ntical learning via the alternating direction method of multipliers. In Michael Jordan, editor,\nFoundations and Trends in Machine Learning, 2010.\n\n[2] J. Cai, E. Cand`es, and Z. Shen. A singular value thresholding algorithm for matrix completion.\n\npreprint, 2008.\n\n[3] E. Cand`es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? J. ACM, 2011.\n[4] E. Cand`es and B. Recht. Exact matrix completion via convex optimization. Foundations of\n\nComputational Mathematics, 2009.\n\n[5] E. J. Cand`es and M. Wakin. An introduction to compressive sampling. IEEE Signal Processing\n\nMagazine, 2008.\n\n[6] P. Favaro, R. Vidal, and A. Ravichandran. A closed form solution to robust subspace estimation\n\nand clustering. In CVPR, 2011.\n\n[7] B. He, M. Tao, and X. Yuan. Alternating direction method with Gaussian back substitution for\n\nseparable convex programming. SIAM Journal on Optimization, accepted.\n\n[8] B. He, H. Yang, and S. Wang. Alternating direction method with self-adaptive penalty param-\neters for monotone variational inequality. J. Optimization Theory and Applications, 106:337\u2013\n356, 2000.\n\n[9] R. Larsen. Lanczos bidiagonalization with partial reorthogonalization. Department of Com-\n\nputer Science, Aarhus University, Technical report, DAIMI PB-357, 1998.\n\n[10] Z. Lin. Some software packages for partial SVD computation. arXiv:1108.1548.\n[11] Z. Lin, M. Chen, and Y. Ma. The augmented Lagrange multiplier method for exact re-\ncovery of corrupted low-rank matrices. UIUC Technical Report UILU-ENG-09-2215, 2009,\narXiv:1009.5055.\n\n[12] G. Liu, Z. Lin, and Y. Yu. Robust subspace segmentation by low-rank representation. In ICML,\n\n2010.\n\n[13] J. Liu, S. Ji, and J. Ye. Multi-task feature learning via ef\ufb01cient l2;1 norm minimization. In UAI,\n\n2009.\n\n[14] Y. Ni, J. Sun, X. Yuan, S. Yan, and L. Cheong. Robust low-rank subspace segmentation with\n\nsemide\ufb01nite guarantees. In ICDM Workshop, 2010.\n\n[15] M. Tao and X.M. Yuan. Recovering low-rank and sparse components of matrices from incom-\n\nplete and noisy observations. SIAM Journal on Optimization, 21(1):57\u201381, 2011.\n\n[16] K. Toh and S. Yun. An accelerated proximal gradient algorithm for nuclear norm regularized\n\nleast sequares problems. Paci\ufb01c J. Optimization, 6:615\u2013640, 2010.\n\n[17] R. Tron and R. Vidal. A benchmark for the comparison of 3D montion segmentation algo-\n\nrithms. In CVPR, 2007.\n\n[18] J. Wright, A. Ganesh, S. Rao, and Y. Ma. Robust principal component analysis: Exact recovery\n\nof corrupted low-rank matrices via convex optimization. In NIPS, 2009.\n\n[19] J. Wright, Y. Ma, J. Mairal, G. Sapirao, T. Huang, and S. Yan. Sparse representation for\n\ncomputer vision and pattern recognition. Proceedings of the IEEE, 2010.\n\n[20] J. Yang and X. Yuan. Linearized augmented Lagrangian and alternating direction methods for\n\nnuclear norm minimization. submitted, 2011.\n\n[21] J. Yang and Y. Zhang. Alternating direction algorithms for l1 problems in compressive sensing.\n\nSIAM J. Scienti\ufb01c Computing, 2010.\n\n[22] W. Yin. Analysis and generalizations of the linearized Bregman method. SIAM Journal on\n\nImaging Sciences, 2010.\n\n9\n\n\f", "award": [], "sourceid": 427, "authors": [{"given_name": "Zhouchen", "family_name": "Lin", "institution": null}, {"given_name": "Risheng", "family_name": "Liu", "institution": null}, {"given_name": "Zhixun", "family_name": "Su", "institution": null}]}