{"title": "Finding a sparse vector in a subspace: Linear sparsity using alternating directions", "book": "Advances in Neural Information Processing Systems", "page_first": 3401, "page_last": 3409, "abstract": "We consider the problem of recovering the sparsest vector in a subspace $ \\mathcal{S} \\in \\mathbb{R}^p $ with $ \\text{dim}(\\mathcal{S})=n$. This problem can be considered a homogeneous variant of the sparse recovery problem, and finds applications in sparse dictionary learning, sparse PCA, and other problems in signal processing and machine learning. Simple convex heuristics for this problem provably break down when the fraction of nonzero entries in the target sparse vector substantially exceeds $1/ \\sqrt{n}$. In contrast, we exhibit a relatively simple nonconvex approach based on alternating directions, which provably succeeds even when the fraction of nonzero entries is $\\Omega(1)$. To our knowledge, this is the first practical algorithm to achieve this linear scaling. This result assumes a planted sparse model, in which the target sparse vector is embedded in an otherwise random subspace. Empirically, our proposed algorithm also succeeds in more challenging data models arising, e.g., from sparse dictionary learning.", "full_text": "Finding a sparse vector in a subspace:\n\nLinear sparsity using alternating directions\n\nQing Qu, Ju Sun, and John Wright\n\n{qq2105, js4038, jw2966}@columbia.edu\n\nDept. of Electrical Engineering, Columbia University, New York City, NY, USA, 10027\n\nAbstract\n\nWe consider the problem of recovering the sparsest vector in a subspace S \u2286 Rp\nwith dim (S) = n. This problem can be considered a homogeneous variant of\nthe sparse recovery problem, and \ufb01nds applications in sparse dictionary learning,\nsparse PCA, and other problems in signal processing and machine learning. Simple\n\u221a\nconvex heuristics for this problem provably break down when the fraction of\nnonzero entries in the target sparse vector substantially exceeds 1/\nn. In contrast,\nwe exhibit a relatively simple nonconvex approach based on alternating directions,\nwhich provably succeeds even when the fraction of nonzero entries is \u2126(1). To\nour knowledge, this is the \ufb01rst practical algorithm to achieve this linear scaling.\nThis result assumes a planted sparse model, in which the target sparse vector is\nembedded in an otherwise random subspace. Empirically, our proposed algorithm\nalso succeeds in more challenging data models arising, e.g., from sparse dictionary\nlearning. Full version is online: http://arxiv.org/abs/1412.4659.\n\nIntroduction\n\n1\nSuppose we are given a linear subspace S of a high-dimensional space Rp, which contains a sparse\nvector x0 (cid:54)= 0. Given arbitrary basis of S, can we ef\ufb01ciently recover x0? Equivalently, provided a\nmatrix A \u2208 R(p\u2212n)\u00d7p, can we ef\ufb01ciently \ufb01nd a nonzero sparse vector x such that Ax = 0? In the\nlanguage of sparse approximation, can we solve\n\n(cid:107)x(cid:107)0\n\nmin\n\nx\n\ns.t. Ax = 0, x (cid:54)= 0\n\n?\n\n(1)\n\nVariants of this problem have been studied in the context of applications to numerical linear algebra\n[13], graphical model learning [27], nonrigid structure from motion [14], spectral estimation and\nProny\u2019s problem [9], sparse PCA [29], blind source separation [28], dictionary learning [23], graphical\nmodel learning [3], and sparse coding on manifolds [19].\nHowever, in contrast to the standard sparse regression problem (Ax = b, b (cid:54)= 0), for which convex\nrelaxations perform nearly optimally for broad classes of designs A [12, 16], the computational\nproperties of problem (1) are not nearly as well understood. It has been known for several decades\nthat the basic formulation\n\n(cid:107)x(cid:107)0 ,\n\nmin\n\nx\n\ns.t. x \u2208 S \\ {0},\n\n(2)\n\nis NP-hard [13]. However, it is only recently that ef\ufb01cient computational surrogates with nontrivial\nrecovery guarantees have been discovered. In the context of sparse dictionary learning, Spielman et\nal. [23] introduced a relaxation which replaces the nonconvex problem (2) with a sequence of linear\nprograms:\n\n(3)\nand proved that when S is generated as a span of n random sparse vectors, with high probability\nthe relaxation recovers these vectors, provided the probability of an entry being nonzero is at most\n\ns.t. xi = 1, x \u2208 S, 1 \u2264 i \u2264 p,\n\n(cid:107)x(cid:107)1 ,\n\nmin\n\nx\n\n1\n\n\f\u221a\n\n(4)\n\n(cid:107)X(cid:107)1 ,\n\nmin\nX\n\ns.t.\n\n\u03b8 \u2208 O (1/\nn). In a planted sparse model, in which S consists of a single sparse vector x0 embedded\n\u221a\nin a \u201cgeneric\u201d subspace, Hand et al. proved that (3) also correctly recovers x0, provided the fraction\nof nonzeros in x0 scales as \u03b8 \u2208 O (1/\n\u221a\nn) [17]. Unfortunately, the results of [23, 17] are essentially\nsharp: when \u03b8 substantially exceeds 1/\nn, in both models the relaxation (3) provably breaks down.\nMoreover, the most natural semide\ufb01nite programming relaxation of (1),\n\n(cid:10)A(cid:62)A, X(cid:11) = 0, trace[X] = 1, X (cid:23) 0.\n\nn.1\n\n\u221a\nalso breaks down at exactly the same threshold of \u03b8 \u223c 1/\n\u221a\nOne might naturally conjecture that this 1/\nn threshold is simply an intrinsic price we must pay for\nhaving an ef\ufb01cient algorithm, even in these random models. Some evidence towards this conjecture\nmight be borrowed from the surface similarity of (2)-(4) and sparse PCA [29]. In sparse PCA, there is\n\u221a\na substantial gap between what can be achieved with ef\ufb01cient algorithms and the information theoretic\noptimum [8]. Is this also the case for recovering a sparse vector in a subspace? Is \u03b8 \u2208 O (1/\nn)\nsimply the best we can do with ef\ufb01cient, guaranteed algorithms?\nRemarkably, this is not the case. Recently, Barak et al. introduced a new rounding technique for\nsum-of-squares relaxations, and showed that the sparse vector x0 in the planted sparse model can be\n\nrecovered when p \u2265 \u2126(cid:0)n2(cid:1) and \u03b8 \u2265 \u2126(1) [6]. It is perhaps surprising that this is possible at all with\n\na polynomial time algorithm. Unfortunately, the runtime of this approach is a high-degree polynomial\nin p, and so for machine learning problems in which p is either a feature dimension or sample size,\nthis algorithm is of theoretical interest only. However, it raises an interesting algorithmic question: Is\nthere a practical algorithm that provably recovers a sparse vector with \u03b8 (cid:29) 1/\nn nonzeros from a\ngeneric subspace S?\nIn this paper, we address this problem, under the following hypotheses: we assume the planted\nsparse model, in which a target sparse vector x0 is embedded in an otherwise random n-dimensional\nsubspace of Rp. We allow x0 to have up to \u03b80p nonzero entries, where \u03b80 is a constant. We provide\na relatively simple algorithm which, with very high probability, exactly recovers x0, provided that\n\n\u221a\n\np \u2265 \u2126(cid:0)n4 log2 n(cid:1).\n\nOur algorithm is based on alternating directions, with two special twists. First, we introduce a\nspecial data driven initialization, which seems to be important for achieving \u03b8 = \u2126(1). Second, our\ntheoretical results require a second, linear programming based rounding phase, which is similar to\n[23]. Our core algorithm has very simple iterations, of linear complexity in the size of the data, and\nhence should be scalable to moderate-to-large scale problems.\nIn addition to enjoying theoretical guarantees in a regime (\u03b8 = \u2126(1)) that is out of the reach\nof previous practical algorithms, it performs well in simulations \u2013 succeeding empirically with\np \u2265 \u2126 (n log n). It also performs well empirically on more challenging data models, such as the\n\u221a\ndictionary learning model, in which the subspace of interest contains not one, but n target sparse\nvectors. Breaking the O(1/\nn) sparsity barrier with a practical algorithm is an important open\nproblem in the nascent literature on algorithmic guarantees for dictionary learning [5, 4, 2, 1]. We are\noptimistic that the techniques introduced here will be applicable in this direction. 2\n\n2 Problem Formulation and Global Optimality\nWe study the problem of recovering a sparse vector x0 (cid:54)= 0 (up to scale), which is an element of a\nknown subspace S \u2282 Rp of dimension n, provided an arbitrary orthonormal basis Y \u2208 Rp\u00d7n for S.\nOur starting point is the nonconvex formulation (2). Both the objective and constraint are nonconvex,\nand hence not easy to optimize over. We relax (2) by replacing the (cid:96)0 norm with the (cid:96)1 norm. For the\nconstraint x (cid:54)= 0, which is necessary to avoid a trivial solution, we force x to live on the unit sphere\n(cid:107)x(cid:107)2 = 1, giving\n\n(cid:107)x(cid:107)1 ,\n\ns.t. x \u2208 S, (cid:107)x(cid:107)2 = 1.\n\nmin\n\nx\n\n(5)\n\n1This breakdown behavior is again in sharp contrast to the standard sparse approximation problem (with\nb (cid:54)= 0), in which it is possible to handle very large fractions of nonzeros (say, \u03b8 = \u2126(1/ log n), or even\n\u03b8 = \u2126(1)) using a very simple (cid:96)1 relaxation [12, 16]\n\n2In work currently in preparation [24], we show that in the dictionary learning problem, ef\ufb01cient algorithms\n\nbased on nonconvex optimization also produce global solutions, even when \u03b8 = \u2126 (1).\n\n2\n\n\fThis formulation is still nonconvex, and so we should not expect to obtain an ef\ufb01cient algorithm\nthat can solve it globally for general inputs S. Nevertheless, the geometry of the sphere is benign\nenough that for well-structured inputs it actually will be possible to give algorithms that \ufb01nd the\nglobal optimum of this problem.\nThe formulation (5) can be contrasted with (3), in which we optimize the (cid:96)1 norm subject to the\nconstraint (cid:107)x(cid:107)\u221e = 1. Because (cid:107)\u00b7(cid:107)\u221e is polyhedral, that formulation immediately yields a sequence\nof linear programs. This is very convenient for computation and analysis, but suffers from the\naforementioned breakdown behavior around (cid:107)x0(cid:107)0 \u223c p/\nIn contrast, the sphere (cid:107)x(cid:107)2 = 1 is a more complicated geometric constraint, but will allow much\nlarger numbers of nonzeros in x0. For example, if we consider the global optimizer of a variant of\n(5):\n\n\u221a\n\nn.\n\n(cid:107)Yq(cid:107)1 ,\n\ns.t.\n\n(cid:107)q(cid:107)2 = 1,\n\nmin\nq\u2208Rn\n\n(6)\n\n\u221a\n\nunder the planted sparse model (detailed below), e1 is the unique to (6) with very high probability:\nTheorem 2.1 ((cid:96)1/(cid:96)2 recovery, planted sparse model). There exists a constant \u03b80 > 0 such that if the\nsubspace S follows the planted sparse model\n\nS = span (x0, g1, . . . , gn\u22121) \u2282 Rp,\n\n(7)\n1\u221a\n\u03b8p Ber(\u03b8), with x0, g1, . . . , gn\u22121 mutually independent and\nn < \u03b8 < \u03b80, then \u00b1e0 are the only global minimizers to (6) if Y = [x0, g1, . . . , gn\u22121], provided\n\nwith gi \u223ci.i.d. N (0, 1/p), and x0 \u223ci.i.d.\n1/\np \u2265 \u2126 (n log n). 3\nHence, if we could \ufb01nd the global optimizer of (6), we would be able to recover x0 whose number of\nnonzero entries is quite large \u2013 even linear in the dimension p (\u03b8 = \u2126(1)). On the other hand, it is\nnot obvious that this should be possible: (6) is nonconvex. In the next section, we will describe a\nsimple heuristic algorithm for (a near approximation of) the (cid:96)1/(cid:96)2 problem (6), which guarantees\nto \ufb01nd a stationary point. More surprisingly, we will then prove that for a class of random problem\ninstances, this algorithm, plus an auxiliary rounding technique, actually recovers the global optimum\n\u2013 the target sparse vector x0. The proof requires a detailed probabilistic analysis, which is sketched in\nSection 4.2.\nBefore continuing, it is worth noting that the formulation (5) is in no way novel \u2013 see, e.g., the work\nof [28] in blind source separation for precedent. However, the novelty originates from our algorithms\nand subsequent analysis.\n\n3 Algorithm based on Alternating Direction Method (ADM)\nTo develop an algorithm for solving (6), we work with the orthonormal basis Y \u2208 Rp\u00d7n for S. For\nnumerical purposes, and also for coping with noise in practical application, it is useful to consider a\nslight relaxation of (6), in which we introduce an auxiliary variable x \u2248 Yq:\n\nmin\nq,x\n\n1\n2\n\n(cid:107)Yq \u2212 x(cid:107)2\n\n2 + \u03bb(cid:107)x(cid:107)1 ,\n\ns.t.\n\n(cid:107)q(cid:107)2 = 1,\n\n(8)\n\nHere, \u03bb > 0 is a penalty parameter. It is not dif\ufb01cult to see that this problem is equivalent to\nminimizing the Huber m-estimator over Yq. This relaxation makes it possible to apply alternating\ndirection method to this problem, which, starting from some initial point q(0), alternates between\noptimizing with respect to x and optimizing with respect to q:\n\nx(k+1) = arg min\n\nx\n\nq(k+1) = arg min\n\nq\n\n1\n2\n1\n2\n\n(cid:13)(cid:13)(cid:13)Yq(k) \u2212 x\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)Yq \u2212 x(k+1)(cid:13)(cid:13)(cid:13)2\n\n2\n\n2\n\n+ \u03bb(cid:107)x(cid:107)1 ,\n\ns.t. (cid:107)q(cid:107)2 = 1.\n\nBoth (9) and (10) have simple closed form solutions:\n\nx(k+1) = S\u03bb[Yq(k)],\n\nq(k+1) =\n\nY(cid:62)x(k+1)\n\n(cid:13)(cid:13)Y(cid:62)x(k+1)(cid:13)(cid:13)2\n\n,\n\n3In the full version [22], this theorem has been strengthened and become more practical.\n\n3\n\n(9)\n\n(10)\n\n(11)\n\n\fwhere S\u03bb [x] = sign(x) max{|x| \u2212 \u03bb, 0} is the soft-thresholding operator. The proposed ADM\nalgorithm is summarized in Algorithm 1.\n\nThe recovered sparse vector \u02c6x0 = Yq(k)\n\nAlgorithm 1 Nonconvex ADM\nInput: A matrix Y \u2208 Rp\u00d7n with Y(cid:62)Y = I, initialization q(0), threshold \u03bb > 0.\nOutput:\n1: Set k = 0,\n2: while not converged do\nx(k+1) = S\u03bb[Yq(k)],\n3:\nq(k+1) = Y(cid:62)x(k+1)\n(cid:107)Y(cid:62)x(k+1)(cid:107)2\n4:\nSet k = k + 1.\n\n,\n\n5:\n6: end while\n\nThe algorithm is simple to state and easy to implement. However, if our goal is to recover the sparsest\nvector x0, some additional tricks are needed.\n\nin which x0(i) is nonzero, then x0(i) = \u0398(cid:0)1/\n\nInitialization. Because the problem (6) is nonconvex, an arbitrary or random initialization is\nunlikely to produce a global minimizer.4 Therefore, good initializations are critical for the proposed\nADM algorithm to succeed. For this purpose, we suggest to use every normalized row of Y as\ninitializations for q, and solve a sequence of p nonconvex programs (6) by the ADM algorithm.\nTo get an intuition of why our initialization works, recall the planted sparse model: S =\nspan(x0, g1, . . . , gn\u22121). Write Z = [x0 | g1 | \u00b7\u00b7\u00b7 | gn\u22121] \u2208 Rp\u00d7n.s Suppose we take a row zi of Z,\n\u221a\n\u221a\nare all N (0, 1/p), and so have size about 1/\np. Hence, when \u03b8 is not too large, x0(i) will be\nsomewhat bigger than most of the other entries in zi. Put another way, zi is biased towards the \ufb01rst\nstandard basis vector e1.\nNow, under our probabilistic assumptions, Z is very well conditioned: Z(cid:62)Z \u2248 I.5 Using, e.g.,\nGram-Schmidt, we can \ufb01nd a basis \u00afY for S of the form\n\u00afY = ZR,\n\n\u03b8p(cid:1). Meanwhile, the entries of g1(i), . . . gn\u22121(i)\n\n(12)\nwhere R is upper triangular, and R is itself well-conditioned: R \u2248 I. Since the i-th row of Z is\nbiased in the direction of e1 and R is well-conditioned, the i-th row \u00afyi is also biased in the direction\nof e1. Moreover, we know that the global optimizer q(cid:63) should satisfy \u00afYq(cid:63) = x0. Since Ze1 = x0,\nwe have q(cid:63) = R\u22121e1 \u2248 e1. Here, the approximation comes from R \u2248 I. Hence, for this particular\nchoice of Y, described in (12), the i-th row is biased in the direction of the global optimizer.\nWhat if we are handed some other basis Y = \u00afYU, where U is an orthogonal matrix? Suppose\nq(cid:63) is a global optimizer to (6) with input matrix \u00afY, then it is easy to check that, with input matrix\nY, U(cid:62)q(cid:63) is also a global optimizer to (6), which implies that our initialization is invariant to any\nrotation of the basis. Hence, even if we are handed an arbitrary basis for S, the i-th row is still biased\nin the direction of the global optimizer.\n\nRounding. Let \u00afq denote the output of Algorithm 1. We will prove that with our particular initializa-\ntion and an appropriate choice of \u03bb, the solution of our ADM algorithm falls within a certain radius\nof the globally optimal solution q(cid:63) to (6). To recover q(cid:63), or equivalently to recover the sparse vector\nx0 = Yq(cid:63), we solve the linear program\n\n(cid:107)Yq(cid:107)1\n\nmin\n\nq\n\ns.t.\n\n(cid:104)r, q(cid:105) = 1,\n\n(13)\n\nwith r = \u00afq. We will prove that if r is close enough to q(cid:63), then this relaxation exactly recovers q(cid:63),\nand hence x0.\n\n4More precisely, in our models, random initialization does work, but only when the subspace dimension n is\n\nextremely low compared to the ambient dimension p.\n\n5This is the common heuristic that \u201ctall random matrices are well conditioned\u201d [25].\n\n4\n\n\f4 Analysis\n\n4.1 Main Results\n\nIn this section, we describe our main theoretical result, which shows that with high probability, the\nalgorithm described in the previous section succeeds.\nTheorem 4.1. Suppose that S satis\ufb01es the planted sparse model, and let Y be an arbitrary basis for\n\u221a\nS. Let y1 . . . yp \u2208 Rn denote the (transposes of) the rows of Y. Apply Algorithm 1 with \u03bb = 1/\np,\nusing initializations q(0) = y1, . . . , yp, to produce outputs \u00afq1, . . . , \u00afqp. Solve the linear program\n(13) with r = \u00afq1, . . . , \u00afqp, to produce \u02c6q1, . . . , \u02c6qp. Set i(cid:63) \u2208 arg mini (cid:107)Y\u02c6qi(cid:107)0. Then\n\nY\u02c6qi(cid:63) = \u03b3x0,\nfor some \u03b3 (cid:54)= 0, with overwhelming probability, provided\n\np > Cn4 log2 n,\n\nand\n\n1\u221a\nn\n\n\u2264 \u03b8 \u2264 \u03b80.\n\n(14)\n\n(15)\n\nHere, C and \u03b80 > 0 are universal constants.\n\nWe can see that the result in Theorem 4.1 is suboptimal compared to the global optimality condition\n\nand Barak et al.\u2019s result in sampling complexity: we require p \u2265 \u2126(cid:0)n4 log2 n(cid:1), while the global opti-\nmality condition and Barak et al demand p \u2265 \u2126 (n log n) and p \u2265 \u2126(cid:0)n2(cid:1), respectively. Nonetheless,\n\ncompared to Barak et al., we believe this is the \ufb01rst practical and ef\ufb01cient method that is guaranteed\nto achieve \u03b8 \u223c O(1) rate. The lower bound on \u03b8 in Theorem 4.1 is mostly for convenience in the\nproof; in fact, the LP rounding stage of our algorithm already succeeds with high probability when\n\u03b8 \u2208 O (1/\n\n\u221a\n\nn).\n\n4.2 A Sketch of Analysis\n\nThe proof of our main result requires rather detailed technical analysis of the iteration-by-iteration\nproperties of Algorithm 1. In this subsection, we brie\ufb02y sketch the main ideas. For detailed proofs,\nplease see the technical supplement to this paper.\nAs noted in Section 3, the ADM algorithm is invariant to change of basis. So, we can assume without\nloss of generality that we are working with the particular basis \u00afY = ZR de\ufb01ned in that section. In\norder to further streamline the presentation, we are going to sketch the proof under the assumption\nthat\n\n(16)\nrather than the orthogonalized version \u00afY. This may seem plausible, but when p is large Y is already\nnearly orthogonal, and hence Y is very close to \u00afY. In fact, in our proof, we simply carry through the\nargument for Y, and then note that Y and \u00afY are close enough that all steps of the proof still hold\nwith Y replaced by \u00afY. With that noted, let y1, . . . , yp \u2208 Rn denote the transposes of the rows of Y,\nand note that these are independent random vectors. From (11), we can see one step of the ADM\nalgorithm takes the form:\n\nY = [x0 | g1 | \u00b7\u00b7\u00b7 | gn\u22121],\n\nq(k+1) =\n\n.\n\n(17)\n\ni=1 yiS\u03bb[(cid:0)yi(cid:1)(cid:62)\n(cid:80)p\n(cid:80)p\n(cid:62)\ni=1 yiS\u03bb[(yi)\n\n1\np\n\n(cid:13)(cid:13)(cid:13) 1\n\np\n\n(cid:13)(cid:13)(cid:13)2\n\nq(k)]\n\nq(k)]\n\nyiS\u03bb[(cid:0)yi(cid:1)(cid:62)\n\np(cid:88)\n\ni=1\n\nQ(q) =\n\n1\np\n\nThis is a very favorable form for analysis: if q is viewed as \ufb01xed, the term in the numerator is a sum\nof p independent random vectors. To this end, we de\ufb01ne a vector valued random process Q(q) on\nq \u2208 Sn\u22121, via\n\nq].\n\n(18)\n\nWe study the behavior of the iteration (17) through the random process Q(q). We wish to show\nthat w.h.p. in our choice of Y, q(k) converges to (\u00b1e1), so that the algorithm successfully retrieves\nthe sparse vector x0 = Ye1. Thus, we hope that in general, Q(q) is more concentrated on the \ufb01rst\n\n5\n\n\f(cid:20) q1\n\nq2\n\n(cid:21)\n\n, with q1 \u2208 R and q2 \u2208 Rn\u22121, and\n\ncoordinate than q. Let us partition the vector q as q =\n\ncorrespondingly partition Q(q) =\n\n, where\n\np(cid:88)\n\ni=1\n\nQ1(q) =\n\n1\np\n\nx0iS\u03bb\n\np(cid:88)\n\ngiS\u03bb\n\n(cid:104)(cid:0)yi(cid:1)(cid:62)\n\n(cid:105)\n\n.\n\nq\n\n(19)\n\nand\n\nQ2(q) =\n\n1\np\n\ni=1\n\n(cid:21)\n\n(cid:20) Q1(q)\n(cid:105)\n\nQ2(q)\n\nq\n\n(cid:104)(cid:0)yi(cid:1)(cid:62)\n\nThe inner product of Q(q)/(cid:107)Q(q)(cid:107)2 and e1 is strictly larger than the inner product of q and e1 if\nand only if\n\n|Q1(q)|\n|q1| >\n\n(cid:107)Q2(q)(cid:107)2\n(cid:107)q2(cid:107)2\n\n.\n\n(20)\n\nIn the appendix, we show that with high probability, this inequality holds uniformly over a signi\ufb01cant\nportion of the sphere, so the algorithm moves in the correct direction. To complete the proof of\nTheorem 4.1, we combine the following observations:\n1. Good initializers. With high probability, at least one of the initializers q(0) satis\ufb01es |q(0)\n2. Uniform progress away from the equator. With high probability, for every q such that\n|q1| \u2264 C(cid:63)\n\n1 | > 1\n\u221a\n4\n\n\u03b8, the bound\n\n\u221a\n1\n\u03b8n\n2\n\n.\n\u2264\n\n\u221a\n\n\u03b8n\n\n|q1| \u2212 (cid:107)Q2(q)(cid:107)2\n|Q1(q)|\n(cid:107)q(cid:107)2\n\n>\n\nc\nnp\n\n(21)\n\nholds. This implies that if at any iteration k of the algorithm, |q(k)\neventually obtain a point q(k(cid:48)), k(cid:48) > k, for which |q(k(cid:48))\nmakes bounding the iteration complexity possible.\n3. No jumps away from the caps. With high probability, for all q such that |q|1 > C(cid:63)\n\n1 | > 1\n\u221a\n, the algorithm will\n2\n\u03b8. Moreover the steady progress\n\n| > C(cid:63)\n\n\u221a\n\n\u03b8,\n\n\u221a\n\n\u03b8n\n\n1\n\n(cid:113)|Q1(q)2| + (cid:107)Q2(q)(cid:107)2\n\n|Q1(q)|\n\n2\n\n\u221a\n\u2265 2\n\n\u03b8.\n\n(22)\n\n\u03b8n\n\n\u221a\n\n1 | > 1\n\u221a\n4\n\n, it will converge to a point \u00afq with \u00afq1 > C(cid:63)\n\n4. Location of stationary points. Steps 1, 2 and 3 above imply that if Algorithm 1 ever obtains a\npoint q(k) with |q(k)\n\u221a\n5. Rounding succeeds when |r1| > 2\n\u03b8. With high probability, the linear programming based\nrounding (13) will produce \u00b1x0, up to scale, whenever it is provided with an input r whose \ufb01rst\n\u221a\ncoordinate has magnitude at least 2\nTaken together, these claims imply that from at least one of the initializers q(0), the ADM algorithm\nwill produce an output \u00afq which is accurate enough for LP rounding to exactly return x0, up to scale.\nAs x0 is the sparsest nonzero vector in the subspace S with overwhelming probability, it will be\nselected as Yqi(cid:63), and hence produced by the algorithm.\n\n\u03b8.\n\n\u03b8.\n\n5 Experimental Results\n\nIn this section, we show the performance of the proposed ADM algorithm on both synthetic and real\ndatasets. On the synthetic dataset, we show the phase transition of our algorithm on both the planted\nsparse vector and dictionary learning models; for the real dataset, we demonstrate how seeking sparse\nvectors can help discover interesting patterns.\n\n5.1 Phase Transition on Synthetic Data\n\nFor the planted sparse model, for each pair of (k, p), we generate the n dimensional subspace\nS \u2208 Rp by a k sparse vector x0 with nonzero entries equal to 1 and a random Gaussian matrix\nG \u2208 Rp\u00d7(n\u22121) with Gij\ni.i.d.\u223c N (0, 1/p), so that the basis Yof the subspace S can be constructed\n\n6\n\n\fby Y = GS ([x0, G]) U, where GS (\u00b7) denotes the Gram-Schmidt orthonormalization operator and\nU \u2208 Rn\u00d7n is an arbitrary orthogonal matrix. We \ufb01x the relationship between n and p as p = 5n log n,\n\u221a\nand set the regularization parameter in (8) as \u03bb = 1/\np. We use all the normalized rows of Y as\ninitializations of q for the proposed ADM algorithm, and run every program for 5000 iterations. We\nassume the proposed method to be success whenever (cid:107)x0/(cid:107)x0(cid:107)2 \u2212 Yq(cid:107)2 \u2264 \u0001 for at least one of the\np programs, for some error tolerance \u0001 = 10\u22123. For each pair of (k, p), we repeat the simulation for\n5 times.\n\nFigure 1: Phase transition for the planted sparse model (left) and dictionary learning (right) using the ADM\nalgorithm, with \ufb01xed relationship between p and n: p = 5n log n. White indicates success and black indicates\nfailure.\n\nSecond, we consider the same dictionary learning model as in [23]. Speci\ufb01cally, the observation is\nassumed to be Y = A0X0where A0 is a square, invertible matrix, and X0 a n \u00d7 p sparse matrix.\nSince A0 is invertible, the row space of Y is the same as that of X0. For each pair of (k, n), we\n(cid:62), where each vector xi \u2208 Rp is k-sparse with every nonzero entry\ngenerate X0 = [x1,\u00b7\u00b7\u00b7 , xn]\n\nfollowing i.i.d. Gaussian distribution, and construct the observation by Y(cid:62) = GS(cid:0)X(cid:62)\n\nrepeat the same experiment as for the planted sparse model presented above. The only difference is\nthat we assume the proposed method to be success as long as one sparse row of X0 is recovered by\nthose p programs.\nFig. 1 shows the phase transition between the sparsity level k = \u03b8p and p for both models. It seems\nclear for both problems our algorithm can work well into (beyond) the linear regime in sparsity level.\nHence for the planted sparse model, to close the gap between our algorithm and practice is one future\ndirection. Also, how to extend our analysis for dictionary learning is another interesting direction.\n\n(cid:1) U(cid:62). We\n\n0\n\n5.2 Exploratory Experiments on Faces\n\nIt is well known in computer vision convex objects only subject to illumination changes produce\nimage collection that can be well approximated by low-dimensional space in raw-pixel space [7].\nWe will play with face subspaces here. First, we extract face images of one person (65 images)\nunder different illumination conditions. Then we apply robust principal component analysis [10]\nto the data and get a low dimensional subspace of dimension 10, i.e., the basis Y \u2208 R32256\u00d710. We\napply the ADM algorithm to \ufb01nd the sparsest element in such a subspace, by randomly selecting\n10% rows as initializations for q. We judge the sparsity in a (cid:96)1/(cid:96)2 sense, that is, the sparsest vector\n\u02c6x0 = Yq\u2217 should produce the smallest (cid:107)Yq(cid:107)1 /(cid:107)Yq(cid:107)2 among all results. Once some sparse vectors\nare found, we project the subspace onto orthogonal complement of the sparse vectors already found,\nand continue the seeking process in the projected subspace. Fig. 2 shows the \ufb01rst four sparse vectors\nwe get from the data. We can see they correspond well to different extreme illumination conditions.\nSecond, we manually select ten different persons\u2019 faces under the normal lighting condition. Again,\nthe dimension of the subspace is 10 and Y \u2208 R32256\u00d710. We repeat the same experiment as stated\nabove. Fig. 3 shows four sparse vectors we get from the data. Interestingly, the sparse vectors roughly\ncorrespond to differences of face images concentrated around facial parts that different people tend to\ndiffer from each other.\nIn sum, our algorithm seems to \ufb01nd useful sparse vectors for potential applications, like peculiar\ndiscovery in \ufb01rst setting, and locating differences in second setting. Netherless, the main goal of this\nexperiment is to invite readers to think about similar pattern discovery problems that might be cast as\nsearching for a sparse vector in a subspace. The experiment also demonstrates in a concrete way the\n\n7\n\n\fFigure 2: Four sparse vectors extracted by the ADM algorithm for one person in the Yale B database under\ndifferent illuminations.\n\nFigure 3: Four sparse vectors extracted by the ADM algorithm for 10 persons in the Yale B database under\nnormal illuminations.\n\npracticality of our algorithm, both in handling data sets of realistic size and in producing attractive\nresults even outside of the (idealized) planted sparse model that we adopt for analysis.\n\n6 Discussion\n\nThe random models we assume for the subspace can be easily extended to other random models,\nparticularly for dictionary learning. Moreover we believe the algorithm paradigm works far beyond\nthe idealized models, as our preliminary experiments on face data have clearly shown. For the\nparticular planted sparse model, the performance gap in terms of (p, n, \u03b8) between the empirical\nsimulation and our result is likely due to analysis itself. Advanced techniques to bound the empirical\nprocess, such as decoupling [15] techniques, can be deployed in place of our crude union bound\nto cover all iterates. Our algorithmic paradigm as a whole sits well in the recent surge of research\nendeavors in provable and practical nonconvex approaches towards many problems of interest, often\nin large-scale setting [11, 20, 18, 21, 26]. We believe this line of research will become increasingly\nimportant in theory and practice. On the application side, the potential of seeking sparse/structured\nelement in a subspace seems largely unexplored, despite the cases we mentioned at the start. We\nhope this work can invite more application ideas.\n\nAcknowledgements.\nJS thanks the Wei Family Private Foundation for their generous support. We thank\nCun Mu, IEOR Department of Columbia University, for helpful discussion and input regarding this work. This\nwork was partially supported by grants ONR N00014-13-1-0492, NSF 1343282, and funding from the Moore\nand Sloan Foundations.\n\nReferences\n[1] AGARWAL, A., ANANDKUMAR, A., JAIN, P., NETRAPALLI, P., AND TANDON, R. Learning sparsely\n\nused overcomplete dictionaries via alternating minimization. arXiv preprint arXiv:1310.7991 (2013).\n\n[2] AGARWAL, A., ANANDKUMAR, A., AND NETRAPALLI, P. Exact recovery of sparsely used overcomplete\n\ndictionaries. arXiv preprint arXiv:1309.1952 (2013).\n\n[3] ANANDKUMAR, A., HSU, D., JANZAMIN, M., AND KAKADE, S. M. When are overcomplete topic\nmodels identi\ufb01able? uniqueness of tensor tucker decompositions with structured sparsity. In Advances in\nNeural Information Processing Systems (2013), pp. 1986\u20131994.\n\n[4] ARORA, S., BHASKARA, A., GE, R., AND MA, T. More algorithms for provable dictionary learning.\n\narXiv preprint arXiv:1401.0579 (2014).\n\n8\n\n\f[5] ARORA, S., GE, R., AND MOITRA, A. New algorithms for learning incoherent and overcomplete\n\ndictionaries. arXiv preprint arXiv:1308.6273 (2013).\n\n[6] BARAK, B., KELNER, J., AND STEURER, D. Rounding sum-of-squares relaxations. arXiv preprint\n\narXiv:1312.6652 (2013).\n\n[7] BASRI, R., AND JACOBS, D. W. Lambertian re\ufb02ectance and linear subspaces. Pattern Analysis and\n\nMachine Intelligence, IEEE Transactions on 25, 2 (2003), 218\u2013233.\n\n[8] BERTHET, Q., AND RIGOLLET, P. Complexity theoretic lower bounds for sparse principal component\n\ndetection. In Conference on Learning Theory (2013), pp. 1046\u20131066.\n\n[9] BEYLKIN, G., AND MONZ \u00b4ON, L. On approximation of functions by exponential sums. Applied and\n\nComputational Harmonic Analysis 19, 1 (2005), 17\u201348.\n\n[10] CAND `ES, E., LI, X., MA, Y., AND WRIGHT, J. Robust principal component analysis? Journal of the\n\nACM 58, 3 (May 2011).\n\n[11] CAND `ES, E. J., LI, X., AND SOLTANOLKOTABI, M. Phase retrieval via wirtinger \ufb02ow: Theory and\n\nalgorithms. arXiv preprint arXiv:1407.1065 (2014).\n\n[12] CANDES, E. J., AND TAO, T. Decoding by linear programming. Information Theory, IEEE Transactions\n\non 51, 12 (2005), 4203\u20134215.\n\n[13] COLEMAN, T. F., AND POTHEN, A. The null space problem i. complexity. SIAM Journal on Algebraic\n\nDiscrete Methods 7, 4 (1986), 527\u2013537.\n\n[14] DAI, Y., LI, H., AND HE, M. A simple prior-free method for non-rigid structure-from-motion factorization.\nIn Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (2012), IEEE, pp. 2018\u2013\n2025.\n\n[15] DE LA PENA, V., AND GIN \u00b4E, E. Decoupling: from dependence to independence. Springer, 1999.\n[16] DONOHO, D. L. For most large underdetermined systems of linear equations the minimal (cid:96)1-norm solution\n\nis also the sparsest solution. Communications on pure and applied mathematics 59, 6 (2006), 797\u2013829.\n\n[17] HAND, P., AND DEMANET, L. Recovering the sparsest element in a subspace. arXiv preprint\n\narXiv:1310.1654 (2013).\n\n[18] HARDT, M. On the provable convergence of alternating minimization for matrix completion. arXiv\n\npreprint arXiv:1312.0925 (2013).\n\n[19] HO, J., XIE, Y., AND VEMURI, B. On a nonlinear generalization of sparse coding and dictionary learning.\n\nIn Proceedings of The 30th International Conference on Machine Learning (2013), pp. 1480\u20131488.\n\n[20] JAIN, P., NETRAPALLI, P., AND SANGHAVI, S. Low-rank matrix completion using alternating minimiza-\ntion. In Proceedings of the 45th annual ACM symposium on Symposium on theory of computing (2013),\nACM, pp. 665\u2013674.\n\n[21] NETRAPALLI, P., JAIN, P., AND SANGHAVI, S. Phase retrieval using alternating minimization. In\n\nAdvances in Neural Information Processing Systems (2013), pp. 2796\u20132804.\n\n[22] QU, Q., SUN, J., AND WRIGHT, J. Finding a sparse vector in a subspace: Linear sparsity using alternating\n\ndirections. arXiv preprint arXiv:1412.4659 (2014).\n\n[23] SPIELMAN, D. A., WANG, H., AND WRIGHT, J. Exact recovery of sparsely-used dictionaries. In\n\nProceedings of the 25th Annual Conference on Learning Theory (2012).\n\n[24] SUN, J., QU, Q., AND WRIGHT, J. Complete dictionary recovery over the sphere. In preparation (2014).\nIntroduction to the non-asymptotic analysis of random matrices. arXiv preprint\n[25] VERSHYNIN, R.\n\narXiv:1011.3027 (2010).\n\n[26] YI, X., CARAMANIS, C., AND SANGHAVI, S. Alternating minimization for mixed linear regression.\n\narXiv preprint arXiv:1310.3745 (2013).\n\n[27] ZHAO, Y.-B., AND FUKUSHIMA, M. Rank-one solutions for homogeneous linear matrix equations over\n\nthe positive semide\ufb01nite cone. Applied Mathematics and Computation 219, 10 (2013), 5569\u20135583.\n\n[28] ZIBULEVSKY, M., AND PEARLMUTTER, B. A. Blind source separation by sparse decomposition in a\n\nsignal dictionary. Neural computation 13, 4 (2001), 863\u2013882.\n\n[29] ZOU, H., HASTIE, T., AND TIBSHIRANI, R. Sparse principal component analysis. Journal of computa-\n\ntional and graphical statistics 15, 2 (2006), 265\u2013286.\n\n9\n\n\f", "award": [], "sourceid": 1743, "authors": [{"given_name": "Qing", "family_name": "Qu", "institution": "Columbia University"}, {"given_name": "Ju", "family_name": "Sun", "institution": "Columbia University"}, {"given_name": "John", "family_name": "Wright", "institution": "Columbia University"}]}