{"title": "Tight convex relaxations for sparse matrix factorization", "book": "Advances in Neural Information Processing Systems", "page_first": 3284, "page_last": 3292, "abstract": "Based on a new atomic norm, we propose a new convex formulation for sparse matrix factorization problems in which the number of nonzero elements of the factors is assumed fixed and known. The formulation counts sparse PCA with multiple factors, subspace clustering and low-rank sparse bilinear regression as potential applications. We compute slow rates and an upper bound on the statistical dimension of the suggested norm for rank 1 matrices, showing that its statistical dimension is an order of magnitude smaller than the usual l_1-norm, trace norm and their combinations. Even though our convex formulation is in theory hard and does not lead to provably polynomial time algorithmic schemes, we propose an active set algorithm leveraging the structure of the convex problem to solve it and show promising numerical results.", "full_text": "Tight convex relaxations for sparse matrix\n\nfactorization\n\nEmile Richard\n\nElectrical Engineering\nStanford University\n\nGuillaume Obozinski\nUniversit\u00b4e Paris-Est\n\nEcole des Ponts - ParisTech\n\nJean-Philippe Vert\nMINES ParisTech\n\nInstitut Curie\n\nAbstract\n\nBased on a new atomic norm, we propose a new convex formulation for sparse\nmatrix factorization problems in which the number of non-zero elements of the\nfactors is assumed \ufb01xed and known. The formulation counts sparse PCA with\nmultiple factors, subspace clustering and low-rank sparse bilinear regression as\npotential applications. We compute slow rates and an upper bound on the sta-\ntistical dimension [1] of the suggested norm for rank 1 matrices, showing that\nits statistical dimension is an order of magnitude smaller than the usual (cid:96)1-norm,\ntrace norm and their combinations. Even though our convex formulation is in the-\nory hard and does not lead to provably polynomial time algorithmic schemes, we\npropose an active set algorithm leveraging the structure of the convex problem to\nsolve it and show promising numerical results.\n\n1\n\nIntroduction\n\nA range of machine learning problems such as link prediction in graphs containing community struc-\nture [16], phase retrieval [5], subspace clustering [18] or dictionary learning [12] amount to solve\nsparse matrix factorization problems, i.e., to infer a low-rank matrix that can be factorized as the\nproduct of two sparse matrices with few columns (left factor) and few rows (right factor). Such\na factorization allows more ef\ufb01cient storage, faster computation, more interpretable solutions and\nespecially leads to more accurate estimates in many situations. In the case of interaction networks,\nfor example, this is related to the assumption that the network is organized as a collection of highly\nconnected communities which can overlap. More generally, considering sparse low-rank matrices\ncombines two natural forms of sparsity, in the spectrum and in the support, which can be motivated\nby the need to explain systems behaviors by a superposition of latent processes which only involve\na few parameters. Landmark applications of sparse matrix factorization are sparse principal com-\nponents analysis (SPCA) [8, 21] or sparse canonical correlation analysis (SCCA)[19], which are\nwidely used to analyze high-dimensional data such as genomic data.\nIn this paper, we propose new convex formulations for the estimation of sparse low-rank matrices. In\nparticular, we assume that the matrix of interest should be factorized as the sum of rank one factors\nthat are the product of column and row vectors with respectively k and q non zero-entries, where k\nand q are known. We \ufb01rst introduce below the (k, q)-rank of a matrix as the minimum number of\nleft and right factors, having respectively k and q non-zeros, required to reconstruct a matrix. This\nindex is a more involved complexity measure for matrices than the rank in that it conditions on the\nnumber of non-zero elements of the left and right factors of the matrix. Based on this index, we\npropose a new atomic norm for matrices [7] by considering its convex hull restricted to the unit ball\nof the operator norm, resulting in convex surrogates to low (k, q)-rank matrix estimation problem.\nWe analyze the statistical dimension of the new norm and compare it to that of linear combinations\nof the (cid:96)1 and trace norms. In the vector case, our atomic norm actually reduces to k-support norm\nintroduced by [2] and our analysis shows that its statistical power is not better than that of the (cid:96)1-\n\n1\n\n\fnorm. By contrast, in the matrix case, the statistical dimension of our norm is at least one order of\nmagnitude better than combinations of the (cid:96)1-norm and the trace norm.\nHowever, while in the vector case the computation remains feasible in polynomial time, the norm we\nintroduce for matrices can not be evaluated in polynomial time. We propose algorithmic schemes\nto approximately learn with the new norm. The same norm and meta-algorithms can be used as\na regularizer in supervised problems such as multitask learning or quadratic regression and phase\nretrieval, highlighting the fact that our algorithmic contribution does not consist in providing more\nef\ufb01cient solutions to the rank-1 SPCA problem, but to combine atoms found by the rank-1 solvers\nin a principled way.\n\n2 Tight convex relaxations of sparse factorization constraints\n\nIn this section we propose a new matrix norm allowing to formulate various sparse matrix factor-\nization problems as convex optimization problems. We start by de\ufb01ning the (k, q)-rank of a matrix\nin section 2.1, a useful generalization of the rank which also quanti\ufb01es the sparseness of a matrix\nfactorization. We then introduce in section 2.2 the (k, q)-trace norm, an atomic norm de\ufb01ned as the\nconvex relaxations of the (k, q)-rank over the operator norm ball. We discuss further properties and\npotential applications of this norm used as a regularizer in section 2.3.\n\n2.1 The (k, q)-rank of a matrix\nThe rank of a matrix Z \u2208 Rm1\u00d7m2 is the minimum number of rank-1 matrices needed to express Z\ni . The following de\ufb01nition generalizes this rank\n\nas a linear combination of the form Z =(cid:80)r\n(k, q)-sparse decomposition of Z any decomposition of the form Z =(cid:80)r\n\nto incorporate conditions on the sparseness of the rank-1 elements:\nDe\ufb01nition 1 ((k, q)-sparse decomposition and (k, q)-rank) For a matrix Z \u2208 Rm1\u00d7m2, we call\ni where ai (resp.\nbi) are unit vectors with at most k (resp. q) non-zero elements, and with minimal r, which we call\nthe (k, q)-rank of Z.\n\ni=1 ciaib(cid:62)\n\ni=1 aib(cid:62)\n\n\u221e(cid:88)\n\ni=1\n\nThe (k, q)-rank and (k, q)-sparse decomposition of Z can equivalently be de\ufb01ned as the optimal\nvalue and a solution of the optimization problem:\n\ns.t. Z =\n\nciaib(cid:62)\ni ,\n\nmin(cid:107)c(cid:107)0\n\n(ai, bi, ci) \u2208 Am1\n\nk \u00d7 Am2\nj = {a \u2208 Rn | (cid:107)a(cid:107)0 \u2264 j,(cid:107)a(cid:107)2 = 1}. Since An\n\n(1)\nwhere for any 1 \u2264 j \u2264 n, An\nj when i \u2264 j, we\nhave for any k and q rank(Z) \u2264 (k, q)-rank(Z) \u2264 (cid:107)Z(cid:107)0. The (k, q)-rank is useful to formalize\nproblems such as sparse matrix factorization, which can be de\ufb01ned as approximating the solution of\na matrix valued problem by a matrix having low (k, q)-rank. For instance the standard rank-1 SPCA\nproblem consists in \ufb01nding the symmetric matrix with (k, k)-rank equal to 1 and providing the best\napproximation of the sample covariance matrix [21].\n\nq \u00d7 R+ ,\ni \u2282 An\n\n2.2 A convex relaxation for the (k, q)-rank\n\nThe (k, q)-rank is a discrete, nonconvex index, like the rank or the cardinality, leading to com-\nputational dif\ufb01culties if one wants to learn matrices with small (k, q)-rank. We propose a convex\nrelaxation of the (k, q)-rank aimed at mitigating these dif\ufb01culties. For that purpose, we consider an\natomic norm [7] that provides a convex relaxation of the (k, q)-trace norm, just like the (cid:96)1 norm and\nthe trace norm are convex relaxations of the (cid:96)0 semi-norm and the rank, respectively. An atomic\nnorm is a convex function de\ufb01ned based on a small set of elements called atoms which constitute\na basis on which an object of interest can be sparsely decomposed. The function (a norm if the set\nis centrally symmetric) is de\ufb01ned as the gauge of the convex hull of atoms. In other terms, its unit\nball or level-set of value 1 is formed by the convex envelope of atoms. In case of atoms of interest,\nnamely rank-1 factors of given sparsities k and q, we de\ufb01ne\n\nDe\ufb01nition 2 ((k, q)-trace norm) Let Ak,q be a set of atoms Ak,q =(cid:8)ab(cid:62) : a \u2208 Am1\n\nFor a matrix Z \u2208 Rm1\u00d7m2, the (k, q)-trace norm \u2126k,q(Z) is the atomic norm induced by Ak,q, i.e.,\n\nk , b \u2208 Am2\n\n(cid:9) .\n\nq\n\n2\n\n\f(cid:110) (cid:88)\n\nA\u2208Ak,q\n\n(cid:88)\n\nA\u2208Ak,q\n\ncA : Z =\n\n\u2126k,q(Z) = inf\n\ncAA, cA \u2265 0, \u2200A \u2208 Ak,q\n\n(cid:111)\n\n.\n\n(2)\n\nIn words, Ak,q is the set of matrices A \u2208 Rm1\u00d7m2 such that (k, q)-rank(A) = 1 and (cid:107)A(cid:107)op = 1.\nThe next lemma provides an explicit formulation for the (k, q)-trace norm and its dual:\nLemma 1 For any Z, K \u2208 Rm1\u00d7m2, and denoting Gm\n(cid:107)Z (I,J)(cid:107)\u2217 : Z =\n\n(cid:111)\n(cid:88)\nk = {I \u2282 [[1, m]] : |I| = k}, we have\nZ (I,J) , supp(Z (I,J)) \u2282 I \u00d7 J\n\n(cid:110) (cid:88)\n\n\u2126k,q(Z) = inf\n\n(3)\n\n,\n\nk,q(K) = max(cid:8)(cid:107)KI,J(cid:107)op : I \u2208 Gm1\n\n(I,J)\u2208Gm1\n\nk \u00d7Gm2\n\nq\n\nk\n\n(cid:9) .\n\n(I,J)\n\n, J \u2208 Gm2\n\nq\n\nand \u2126\u2217\n\n2.3 Learning matrices with sparse factors\n\nIn this section, we brie\ufb02y discuss how the (k, q)-trace norm norm can be used to formulate various\nproblems involving the estimation of sparse low-rank matrices. A way to learn a matrix Z with\nlow empirical risk L(Z) and with low (k, q)-rank is to use \u2126k,q as a regularizer and minimize an\nobjective of the form\n\nmin\n\nZ\u2208Rm1\u00d7m2\n\nL(Z) + \u03bb\u2126k,q(Z).\n\n(4)\n\n(5)\n\nA number of problems can be formulated as variants of (4).\nBilinear regression. In bilinear regression, given two inputs x \u2208 Rm1 and x(cid:48) \u2208 Rm2 one ob-\nserves as output a noisy version of y = x(cid:62)Zx(cid:48). Assuming that Z has low (k, q)-rank means that\nthe noiseless response is a sum of a small number of terms, each involving only a small number\nof features from either of the input vectors. To estimate within such a model from observations\n(xi, x(cid:48)\n\ni, yi)i=1,...,n one can consider the following formulation, in which (cid:96) is a convex loss :\n\n(cid:88)\n\n(cid:96)(cid:0)x(cid:62)\n\ni Zx(cid:48)\n\ni, yi\n\n(cid:1) + \u03bb\u2126k,q(Z) .\n\nmin\n\nZ\u2208Rm1\u00d7m2\n\ni\n\nSubspace clustering. In subspace clustering, one assumes that the data can be clustered in such a\nway that the points in each cluster belong to a low dimensional space. If we have a design matrix\nX \u2208 Rn\u00d7p with each row corresponding to an observation, then the previous assumption means\nthat if X (j) \u2208 Rnj\u00d7p is a matrix formed by the rows of cluster j, there exist a low rank matrix\nZ (j) \u2208 Rnj\u00d7nj such that Z (j)X (j) = X (j). This means that there exists a block-diagonal matrix Z\nsuch that ZX = X and with low-rank diagonal blocks. This idea, exploited recently by [18] implies\nthat Z is a sum of low rank sparse matrices; and this property still holds if the clustering is unknown.\nWe therefore suggest that if all subspaces are of dimension k, Z may be estimated via\n\nmin\n\nZ\u2208Rn\u00d7n\n\n\u2126k,k(Z)\n\ns.t. ZX = X .\n\nSparse PCA. One possible formulation of sparse PCA with multiple factors is the problem of ap-\nproximation of an empirical covariance matrix \u02c6\u03a3n by a low-rank matrix with sparse factors. This\nsuggests to formulate sparse PCA as follows:\n\n(cid:8)(cid:107) \u02c6\u03a3n \u2212 Z(cid:107)F : (k, k)-rank(Z) \u2264 r and Z (cid:23) 0(cid:9) ,\n\n(6)\n\nmin\n\nZ\n\nwhere q is the maximum number of non-zero coef\ufb01cients allowed in each principal direction. By\ncontrast to sequential approaches that estimate the principal components one-by-one [11], this for-\nmulation requires to \ufb01nd simultaneously a set of complementary factors. If we require the decompo-\nsition of Z to be a sum of positive semi-de\ufb01nite (k, k)-sparse rank one factors (which is a stronger\nassumption than assuming that Z is p.s.d.), the positivity constraint on Z is no longer necessary and\na natural convex relaxation for (6) using another atomic norm (in fact only a gauge here) is\n\nF + \u03bb\u2126k,(cid:23)(Z) ,\nwhere \u2126k,(cid:23) is the gauge of the set of atoms Ak,(cid:23) := {aa(cid:62), a \u2208 Am\nk }.\n\nZ\u2208Rm\u00d7m\n\nmin\n\n(cid:107) \u02c6\u03a3n \u2212 Z(cid:107)2\n\n(7)\n\n3\n\n\f3 Performance of the (k, q)-trace norm for denoising\nIn this section, we consider the problem of denoising a low-rank matrix Z (cid:63) \u2208 Rm1\u00d7m2 with sparse\nfactors corrupted by additive Gaussian noise, that is noisy observations Y \u2208 Rm1\u00d7m2 of the form\nY = Z (cid:63) + \u03c3G , where \u03c3 > 0 and G is a random matrix with i.i.d. N (0, 1) entries. For a convex\npenalty \u2126 : Rm1\u00d7m2 \u2192 R, we consider, for any \u03bb > 0, the estimator\n\n\u02c6Z \u03bb\n\n\u2126 = arg min\nZ\n\n1\n2\n\n(cid:107)Z \u2212 Y (cid:107)2\n\nF + \u03bb\u2126(Z) .\n\n(8)\n\nThe following result is a straightforward generalization to any norm \u2126 of the so-called slow rates\nthat are well know for the (cid:96)1 norms and other norms such as the trace-norm (see e.g. [10]).\nLemma 2 If \u03bb \u2265 \u03c3\u2126\u2217(G)\n\nthen\n\nF \u2264 4\u03bb\u2126(Z (cid:63)) .\n\nTo derive an upper bound in estimation error from these inequalities, and to keep the argument as\n\u2126 where \u03bb = \u03c3\u2126\u2217(G). From\nsimple as possible we consider the oracle1 estimate \u02c6ZOracle\nLemma 2 we immediately get\n\nequal to \u02c6Z \u03bb\n\n\u2126\n\n\u2126 \u2212 Z (cid:63)(cid:13)(cid:13)2\n(cid:13)(cid:13) \u02c6Z \u03bb\nE(cid:13)(cid:13) \u02c6ZOracle\n\n\u2212 Z (cid:63)(cid:107)2\n\n\u2126\n\nF \u2264 4\u03c3 \u2126(Z (cid:63)) E \u2126\u2217(G) .\n\n, \u02c6ZOracle\n\nkq and \u2126k,q(ab(cid:62)) = (cid:107)ab(cid:62)(cid:107)\u2217 = 1 which lead to the corollary:\n\n(9)\nThis upper bound can be computed for Z (cid:63) = ab(cid:62) \u2208 Ak,q for different norms. In particular, for\n\u2126(Z (cid:63)), we have (cid:107)ab(cid:62)(cid:107)1 \u2264 \u221a\nCorollary 1 When Z (cid:63) = ab(cid:62) \u2208 Ak,q is an atom, the expected errors of the oracle estimators\n\u02c6ZOracle\nusing respectively the (k, q)-trace norm, the (cid:96)1 norm and the trace norm\n\u2126k,q\nare upper bounded as follows:\n\u2212 Z (cid:63)(cid:107)2\n\u2212 Z (cid:63)(cid:107)2\n\u2212 Z (cid:63)(cid:107)2\n\n(cid:19)\n(cid:112)2 log(m1m2) \u2264 2\u03c3(cid:112)2kq log(m1m2) ,\n\n(cid:18)(cid:114)\nF \u2264 8 \u03c3\nF \u2264 2\u03c3(cid:107)Z (cid:63)(cid:107)1\n\u221a\nF \u2264 2\u03c3(\nm1 +\n\nE(cid:107) \u02c6ZOracle\nE(cid:107) \u02c6ZOracle\nE(cid:107) \u02c6ZOracle\n\nand \u02c6ZOracle\n\n(cid:114)\n\n+ 2k +\n\nm1\nk\n\nm2\nq\n\nm2) .\n\nq log\n\n+ 2q\n\n,\n\n\u2126k,q\n\n1\n\u2217\n\n1\n\n\u2217\n\nk log\n\n\u221a\n\n, reaching \u03c3(cid:112)2 log(m1m2) on e1e(cid:62)\n\n1\n\nWhen the smallest entry in absolute value of a or b is close to 0, then the expected error is smaller for\n\u02c6ZOracle\n1 while not changing for the two other norms. But under\n\u221a\nthe assumption that the smallest nonzero entries in absolute value of a and b are lower bounded by\nkq with c a constant, the upper bound on the rates obtained for the (k, q)-trace norm is at least\nc/\nan order of magnitude larger than for the other norms. We report the order of magnitude of these\n\u221a\nupper bounds in Table 1 for m1 = m2 = m and k = q and assuming that nonzeros coef\ufb01cients are\nlower bounded in magnitude by c/\nObviously the comparison of upper bounds is not enough to conclude to the superiority of\n(k, q)-trace norm and, admittedly, the problem of denoising considered here is a special instance\nof linear regression in which the design matrix is the identity, and, since this is a case in which the\ndesign is trivially incoherent, it is possible to obtain fast rates for decomposable norms such as the\n(cid:96)1 or trace norm [13]; however, the slow rates obtained are the same if instead of Y a linear trans-\nformation of Z with incoherent design is observed, or when the signal to recover is only weakly\nsparse, which is not the case for the fast rates. Moreover, Lemma 2 applies to matrices of any rank\nand Corollary 1 generalizes to rank greater than 1. We present in the next section more sophisticated\nresults, based on bounds on the so-called statistical dimension of different norms [1].\n\nkq.\n\n(10)\n\n4 A bound on the statistical dimension of the (k, q)-trace norm\n\nThe squared Gaussian width [7, and ref. therein] and the statistical dimension introduced recently\nby Amelunxen et al. [1], provide quanti\ufb01ed estimation guarantees. The two quantities are equal\n\n1We call it oracle estimate because the choice of \u03bb depends on the unknown noise level. Virtually identi-\ncal bounds (up to constants) holding with large probability could be derived for the non-oracle estimator by\ncontrolling the deviations of \u2126\u2217(G) from its expectation.\n\n4\n\n\fup to an additive term smaller than 1 and we thus present results only in terms of the statistical\ndimension. The sample complexity of exact recovery and robust recovery are characterized by this\nquantity [7]. It is also equal to the signal to noise ratio necessary for denoising [6] and demixing\n[1] (see supplementary section 3). The statistical dimension is de\ufb01ned as follows: if T\u2126(A) is the\ntangent cone of a matrix norm \u2126 : Rm1\u00d7m2 \u2192 R+ at A, then, the statistical dimension of T\u2126(A) is\n\nS(Z, \u2126) := E(cid:104)(cid:13)(cid:13)\u03a0T\u2126(Z)(G)(cid:13)(cid:13)2\n\n(cid:105)\n\n,\n\nF\n\nwhere G \u2208 Rm1\u00d7m2 is a random matrix with i.i.d. standard normal entries and \u03a0T\u2126(Z)(G) is the\northogonal projection of G onto the cone T\u2126(Z). In this section, we compute an upper bound on\nthe statistical dimension of \u2126k,q at an atoms A of Ak,q, which we will denote by S(A, \u2126k,q), and\ncompare it to results known for linear combinations of the (cid:96)1 and the trace norm of the form \u0393\u00b5 with\n\n\u2200\u00b5 \u2208 [0, 1], \u2200Z \u2208 Rm1\u00d7m2 , \u0393\u00b5(Z) :=\n\n\u00b5\u221a\nkq\n\n(cid:107)Z(cid:107)1 + (1 \u2212 \u00b5)(cid:107)Z(cid:107)(cid:63) ,\n\n(11)\n\nj. We de\ufb01ne the strength \u03b3(a, b) \u2208 (0, 1] as \u03b3(a, b) := (k a2\nb2\n\nwhich are norms that have been used in the literature to infer sparse low-rank matrices [17]. The\nability to recover the support of a sparse vector typically depends on the size of its smallest non-zero\ncoef\ufb01cient. For the recovery of a sparse rank 1 matrix, this motivates the following de\ufb01nition\nDe\ufb01nition 3 Let A = ab(cid:62) \u2208 Ak,q with I0 = supp(a) and J0 = supp(b). Denote a2\nmin) \u2227 (q b2\nb2\nmin = min\nj\u2208J0\n\u221a\nThe strength of an atom takes the maximal value of 1 when |ai| = 1/\nk, i \u2208 I and |bj| = 1/\nq, j \u2208\nJ where I and J are the supports of a and b. On the contrary, its strength is close to 0 as soon as one\nof its nonzero entries is close to zero. We can now present our main result: a bound on the statistical\ndimension of \u2126k,q on Ak,q.\nProposition 1 For A = ab(cid:62) \u2208 Ak,q with strength \u03b3 = \u03b3(a, b), there exist universal constants\nc1, c2, independent of m1, m2, k, q such that\n\nmin = min\ni\u2208I0\n\ni and\na2\n\nmin).\n\n\u221a\n\nS(A, \u2126k,q) \u2264 c1\n\n\u03b32 (k + q) +\n\n(k + q) log(m1 \u2228 m2) .\n\nc2\n\u03b3\n\nOur proof, presented in the appendix, follows the scheme proposed in [7] and used for the trace\nnorm and (cid:96)1 norm. However, \u2126k,q is not decomposable and requires some work to obtain precise\nupper bounds on various quantities.\nNote \ufb01rst that S must be larger than the number of degrees of freedoms of elements of Ak,q which\nis k + q \u2212 1. So the bound could not possibly be improved beyond logarithmic factors, besides\nthe logarithmic dependence on the dimension of the overall space is expected. To appreciate the\nresult, it should be compared with the statistical dimension for the (cid:96)1-norm which scales as the\nproduct of the size of the support with the logarithm of the dimension of the ambient space, that is\nas kq log(m1m2). Using Landau notation, we report in Table 1 the upper and lower bounds known\nfor the statistical dimension of other norms in the case where k = q and m1 = m2 = m. The\nrates are known exactly up to constants for the (cid:96)1 and the trace norm (see e.g. [1]). Of particular\nrelevance is the comparison with norms of the form \u0393\u00b5 which have been introduced with the aim\nof improving over both the (cid:96)1-norm and the trace norm and have been the object of a signi\ufb01cant\nliterature [17, 15, 9]. Using theorem 3.2 in [15], we prove in appendix 4 a lower bound on the\nstatistical dimension of \u0393\u00b5 of order kq\u2227 (m1 + m2) which holds for all values of \u00b5, and which show\nthat, up to logarithmic factors, \u2126k,q is an order of magnitude smaller in term of k \u2227 q.\nIn the right column of Table 1 we also report results in the vector case, that is, when m2 = q = 1. In\nfact, in that case, \u2126k,1 is exactly the k-support norm proposed by [2]. But the statistical dimension\nof that norm and the (cid:96)1 norm is the same as it is known that the rate k log p\nk cannot be improved\n([4]). So, perhaps surprisingly, there improvement in the matrix case but not in the vector case.\n\n5 Algorithm\n\nIn this section, we present a working set algorithm that attempts to solve problem (4). Injecting the\nvariational form (3) of \u2126k,q in (4) and eliminating the variable Z from the optimization problem\n\n5\n\n\fMatrix norm\n(k, q)-trace\n\n(cid:96)1\n\ntrace-norm\n(cid:96)1 + trace-n.\n\nS\n\nO(k log m)\n\u0398(k2 log m\nk2 )\n\n\u2126(cid:0)k2 \u2227 m(cid:1)\n\n\u0398(m)\n\n\u2126(Z (cid:63))E \u2126\u2217(G)\n(k log m\nk )1/2\n(k2 log m)1/2\n\nO(cid:0)m1/2 \u2227 (k2 log m)1/2(cid:1)\n\nm1/2\n\nVector norm\nk-support\n\n(cid:96)1\n(cid:96)2\n\nelastic net\n\nS\n\n\u0398(k log p\nk )\n\u0398(k log p\nk )\n\np\n\n\u0398(k log p\nk )\n\nTable 1: Scaling of the statistical dimension S and of the upper bound \u2126(Z (cid:63)) E\u2126\u2217(G) in estimation\nerror (slow-rates) of different matrix norms for elements of Ak,q with strength (see De\ufb01nition 3)\n\u221a\nlower bounded by a constant (or equivalently with nonzero coef\ufb01cient lower bounded by c/\nkq\nfor c a constant). Leftmost columns: scalings for matrices with k = q, m = m1 = m2; rightmost\ncolumns: scalings for vectors with m1 = p and m2 = q = 1. We use the notations \u2126 and \u0398 with\nf = \u2126(g) meaning g = O(f ) and f = \u0398(g) to mean that both g = O(f ) and f = O(g).\n\nusing Z =(cid:80)\n\nmin\n\nZ(IJ)\u2208Rm1m2\n\nL(cid:16) (cid:88)\n\nZ (IJ)(cid:17)\n\n(I,J)\u2208S\n\n(cid:88)\n\n(cid:107)Z (IJ)(cid:107)\u2217,\n\n+ \u03bb\n\n(I,J)\u2208S\n\n(I,J)\u2208S Z (IJ), one obtains that, when S = Gm1\n\nk \u00d7 Gm2\n\nq\n\ns.t. Supp(Z (IJ)) \u2282 I \u00d7 J, (I, J) \u2208 S.\n\n, problem (4) is equivalent to\n(PS)\n\nk , b \u2208 Am2\n\nk,q(\u2207L(Z)) \u2264 \u03bb.\n\nk,q(K) = max{a(cid:62)Kb | a \u2208 Am1\n\nAt the optimum of (4) however, most of the variables Z (IJ) are equal to zero, and the solution is\nthe same as the solution obtained from (PS) in which S is reduced to the set of non-zero matrices\nZ (IJ) obtained at optimality, that are often called the active components. We thus propose to solve\nproblem (4) using a so-called working set algorithm which solves a sequence of problems of the form\n(PS) for a growing sequence of working sets S, so as to keep a small number of non-zero matrices\nZ (IJ) throughout. Problem (PS) is solved easily using approximate block coordinate descent on\nthe (Z (IJ))(I,J)\u2208S [3, Chap. 4] , which consists in iterating proximal operators of the trace norm\non blocks I \u00d7 J. The principle of the working set algorithm is to solve problem (PS) for the\ncurrent working set S and to check whether a new component should be added. It can be shown\nthat a component with support I \u00d7 J should be added if and only if (cid:107)[\u2207L(Z)]IJ(cid:107)op > \u03bb for the\ncurrent value of Z. If such a component is found, the corresponding (I, J) pair is added in S and\nproblem (PS) is solved again. Given that for any component in S, we have (cid:107)[\u2207L(Z)]IJ(cid:107)op \u2264 \u03bb at\nthe optimum of (PS), the algorithm terminates if \u2126\u2217\nq }, which is NP-hard to\nThe main dif\ufb01culty is that \u2126\u2217\ncompute, since it reduces in particular to rank 1 sparse PCA when k = q and K is p.s.d.. This\nimplies that determining when the algorithm should stop and which new component to add is hard.\nHowever, a signi\ufb01cant amount of research has been carried out on sparse PCA recently, and we\nthus propose to leverage some of the recently proposed relaxations and heuristics to solve this rank\nIn particular, the Truncated Power\n1 sparse PCA problem (see [8, 20] and references therein).\niteration (TPI) algorithm of [20] can easily be modi\ufb01ed to compute \u2126\u2217\nk,q which corresponds to a\ngeneralization of the rank 1 sparse PCA in which in general a (cid:54)= b and k (cid:54)= q.\nIn our numerical experiments, we used a variant of Truncated Power Iteration with multiple restarts,\nkeeping track of the highest found variance. It should be noted that under RIP conditions on the\nmatrix, [20] shows that the solution returned by TPI is guaranteed to solve the rank 1 sparse PCA\nproblem. Also, even if TPI \ufb01nds a pair (I, J) which is suboptimal, adding it in S does not hurt as\nthe algorithm might determine subsequently that it is not necessary. However TPI might fail to \ufb01nd\nsome of the components violating the optimality conditions and terminate the algorithm early.\nThe proposed algorithm cannot be guaranteed to solve (4) if \u2126\u2217\nk,q is not computed exactly, but it\nexploits as much as possible the structure of the convex optimization problem to \ufb01nd a candidate\nsolution. A similar active set algorithm can be designed to solve problems regularized by \u2126k,(cid:23).\nFormulations regularized by the trace norm require to compute its proximal operator, and thus to\ncompute an SVD. However, even when m1, m2 are large, solving PS involves the computation of\ntrace norms of matrices of size only k \u00d7 q and so the SVDs that need to be computed are fairly\nsmall. This means that the computational bottleneck of the algorithm is clearly in \ufb01nding candidate\nsupports. It has been proved [20] that, under some conditions, the problem can be solved in linear\ntime. Multiple restarts allow to \ufb01nd good candidate supports in practice.\n\n6\n\n\fFigure 1: Estimates of the statistical dimensions of the (cid:96)1, trace and \u2126k,q norms at a matrix Z \u2208\n\nR1000\u00d71000 in different setting. From left to right: (a): Z is an atom in (cid:101)Ak,k for different values of\nk. (b) Z is a sum of r atoms in (cid:101)Ak,k with non-overlapping support, with k = 10 and varying r,\n(c) Z is a sum of 3 atoms in (cid:101)Ak,k with non-overlapping support, for varying k. (d) Z is a sum of 3\natoms in (cid:101)Ak,k with overlapping support, for varying k.\n\n6 Numerical experiments\nIn this section we report experimental results to assess the performance of sparse low-rank matrix\nestimation using different techniques. We start in Section 6.1 with simulations that con\ufb01rm and\nillustrate the theoretical results on statistical dimension of \u2126k,q and assess how they generalize to\nmatrices with (k, q)-rank larger than 1. In Section 6.2 we compare several techniques for sparse\nPCA on simulated data.\n\n6.1 Empirical estimates of the statistical dimension.\nIn order to numerically estimate the statistical dimension S(Z, \u2126) of a regularizer \u2126 at a matrix\nZ, we add to Z a random Gaussian noise matrix and observe Y = Z + \u03c3G where G has normal\ni.i.d. entries following N (0, 1). We then denoise Y to form an estimate \u02c6Z of Z. For small \u03c3, the\nnormalized mean-squared error (NMSE) de\ufb01ned as NMSE(\u03c3) := E(cid:107) \u02c6Z\u2212Z(cid:107)2\nF/\u03c32 is a good estimate\nof the statistical dimension, since [14] show that S(Z, \u2126) = lim\u03c3\u21920 NMSE(\u03c3) . Numerically, we\ntherefore estimate S(Z, \u2126) with the empirical NMSE(\u03c3) for \u03c3 = 10\u22124, averaged over 20 replicates.\nWe consider square matrices with m1 = m2 = 1000, and estimate the statistical dimension of \u2126k,q,\nthe (cid:96)1 and the trace norms at different matrices Z. The constrained denoiser has a simple closed-form\nfor the (cid:96)1 and the trace norm. For \u2126k,q, it can be obtained by a sequence of proximal projections\nwith different parameters \u03bb until \u2126k,q( \u02c6Z) has the correct value \u2126k,q(Z). Since the noise is small,\nwe found that it was suf\ufb01cient and faster to perform a (k, q)-SVD of Y by computing a proximal of\n\u2126k,q with a small \u03bb, and then apply the (cid:96)1 constrained denoiser to the set of (k, q)-sparse singular\nvalues.\n\nWe \ufb01rst estimate the statistical dimensions of the three norms at an atom Z in (cid:101)Ak,q for different\nvalues of k = q, where (cid:101)Ak,q = {ab(cid:62) \u2208 Ak,q | (cid:107)ab(cid:62)(cid:107)\u221e = 1/\n\n\u221a\n\nkq} is the set of elements of\nAk,q with nonzero entries of constant magnitude . Figure 1.a shows the results, which con\ufb01rm\nthe theoretical bounds summarized in Table 1. The statistical dimension of the trace norm does\nnot depend on k, while that of the (cid:96)1 norm increases almost quadratically with k and that of \u2126k,q\nincreases linearly with k. The linear versus quadratic dependence of the statistical dimension on\nk are re\ufb02ected by the slopes of the curves in the log-log plot in Figure 1.a. As expected, \u2126k,q\ninterpolates between the (cid:96)1 norm (for k = 1) and the trace norm (for k = m1), and outperforms both\nnorms for intermediate values of k. This experiments therefore con\ufb01rms that our upper bound (1) on\nS(Z, \u2126k,q) captures the correct order in k, although the constants can certainly be much improved,\nand that our algorithm manages, in this simple setting, to correctly approximate the solution of the\nconvex minimization problem.\nSecond, we estimate the statistical dimension of \u2126k,q on matrices with (k, q)-rank larger than 1,\na setting for which we proved no theoretical result. Figure 1.b shows the numerical estimate of\n\nS(Z, \u2126k,q) for matrices Z which are sums of r atoms in (cid:101)Ak,k with non-overlapping support, for\ndimensions of the three regularizers on matrices Z which are sums of 3 atoms in (cid:101)Ak,k with re-\n\nk = 10 and varying r. We observe that the increase in statistical dimension is roughly linear in\nthe (k, q)-rank. For a \ufb01xed (k, q)-rank of 3, Figures 1.c and 1.d compare the estimated statistical\n\n7\n\n100101102103101102103104105106kNMSE(k,k)\u2212rank = 1 l1Trace\u2126k,q11.522.533.544.555.56100200300400500600700800900(k,q)\u2212rankNMSE100101102103101102103104105106kNMSENo overlap l1Trace\u2126k,q100101102101102103104105106kNMSE90 % overlap l1Trace\u2126k,q\fSample covariance\n\n4.20 \u00b1 0.02\n\nTrace\n\n0.98 \u00b1 0.01\n\n(cid:96)1\n\n2.07 \u00b1 0.01\n\nTrace + (cid:96)1\n0.96 \u00b1 0.01\n\nSequential\n0.93 \u00b1 0.08\n\n\u2126k,(cid:23)\n\n0.59 \u00b1 0.03\n\nTable 2: Relative error of covariance estimation with different methods.\n\nspectively non-overlapping or overlapping supports. The shapes of the different curves are overall\nsimilar to the rank 1 case, although the performance of \u2126k,q degrades when the supports of atoms\noverlap. In both cases, \u2126k,q consistently outperforms the two other norms. Overall these experi-\nments suggest that the statistical dimension of \u2126k,q at a linear combination of r atoms increases as\nCr (k log m1 + q log m2) where the coef\ufb01cient C increases with the overlap among the supports of\nthe atoms.\n\n6.2 Comparison of algorithms for sparse PCA\n\ni=1 xix(cid:62)\n\n(cid:80)n\n\n2 + a3a(cid:62)\n\n\u221a\n\nIn this section we compare the performance of different algorithms in estimating a sparsely factored\ncovariance matrix that we denote \u03a3(cid:63). The observed sample consists of n i.i.d. random vectors gen-\nerated according to N (0, \u03a3(cid:63) + \u03c32Idp), where (k, k)-rank(\u03a3(cid:63)) = 3. The matrix \u03a3(cid:63) is formed by\n3 , having all the same sparsity (cid:107)ai(cid:107)0 = k = 10,\nadding 3 blocks of rank 1, \u03a3(cid:63) = a1a(cid:62)\n1 + a2a(cid:62)\n3 \u00d7 3 overlaps and nonzero entries equal to 1/\nk. The noise level \u03c3 = 0.8 is set in order to make\nthe signal to noise ratio below the level \u03c3 = 1 where a spectral gap appears and makes the spectral\nbaseline (penalizing the trace of the PSD matrix) work. In our experiments the number of variables\nis p = 200 and n = 80 points are observed. To estimate the true covariance matrix from the noisy\nobservation, \ufb01rst the sample covariance matrix is formed as \u02c6\u03a3n = 1\ni , and given as input\nn\nto various algorithms which provide a new estimate \u02c6\u03a3. The methods we compared are the following:\n\u2022 Sample covariance. Output \u02c6\u03a3n as the estimate of the covariance.\n\u2022 (cid:96)1 penalty. Soft-threshold \u02c6\u03a3n elementwise.\n\u2022 Trace penalty on the PSD cone. minZ(cid:23)0\n\u2022 Trace + (cid:96)1 penalty. minZ(cid:23)0\n2(cid:107)Z \u2212 \u02c6\u03a3n(cid:107)2\nF + \u03bb\u2126k,(cid:23)(Z) , with \u2126k,(cid:23) the gauge associated with Ak,(cid:23)\n2(cid:107)Z \u2212 \u02c6\u03a3n(cid:107)2\n\u2022 \u2126k,(cid:23) penalty. minZ\u2208Rp\u00d7p\nintroduced in Section 2.3.\n\u2022 Sequential sparse PCA. This is the standard way of estimating multiple sparse principal com-\nponents which consists of solving the problem for a single component at each step t = 1 . . . r, and\nde\ufb02ate to switch to the next (t + 1)st component. The de\ufb02ation step used in this algorithm is the\northogonal projection Zt+1 = (Idp \u2212 utu(cid:62)\nt ) . The tuning parameters for this ap-\nproach are the sparsity level k and the number of principal components r. The hyperparameters\nwere chosen by leaving one portion of the train data off (validation) and selecting the parameter\nwhich allows to build an estimator approximating the best the validation set\u2019s empirical covariance.\nWe assumed the true value of k known in advance for all algorithms.\nWe report the relative errors (cid:107) \u02c6\u03a3 \u2212 \u03a3(cid:63)(cid:107)F/(cid:107)\u03a3(cid:63)(cid:107)F over 10 runs of our experiments in Table 2. The\nresults indicate that sparse PCA methods, whether based on \u2126k,(cid:23) or the sequential method with\nde\ufb02ation steps, outperform spectral and (cid:96)1 baselines, and that penalizing \u2126k,(cid:23) is superior to the\nsequential approach. This was to be expected since our algorithm minimizes a loss function close to\nthe error measure used, whereas the sequential scheme does not optimize a well-de\ufb01ned objective.\n\n2(cid:107)Z \u2212 \u02c6\u03a3n(cid:107)2\nF + \u03bb\u0393\u00b5(Z).\n\nt ) Zt (Idp \u2212 utu(cid:62)\n\n1\n\nF + \u03bb Tr Z .\n\n1\n\n1\n\n7 Conclusion\nWe formulated the problem of matrix factorization with sparse factors of known sparsity as the\nminimization of an index, the (k, q)-rank which tight convex relaxation is the (k, q)-trace norm\nregularizer. This penalty is proved to have near optimal statistical performance. Despite theoretical\ncomputational hardness in the worst-case scenario, exploiting the convex geometry of the problem\nallowed us to build an ef\ufb01cient algorithm to minimize it. Future work will consist of relaxing the\nconstraint on the blocks size, and exploring applications such as \ufb01nding small comminuties in large\nrandom graph background.\nAcknowlegments This project was partially funded by Agence Nationale de la Recherche grant\nANR-13-MONU-005-10 (CHORUS project) and by ERC grant SMAC-ERC-280032.\n\n8\n\n\fReferences\n[1] D. Amelunxen, M. Lotz, M. McCoy, and J. Tropp. Living on the edge: a geometric theory of\n\nphase transition in convex optimization. Information and Inference, 3(3):224\u2013294, 2014.\n\n[2] A. Argyriou, R. Foygel, and N. Srebro. Sparse prediction with the k-support norm. In Advances\n[3] F. Bach. Optimization with sparsity-inducing penalties. Foundations and Trends R(cid:13) in Machine\n\nin Neural Information Processing Systems (NIPS), pages 1466\u20131474, 2012.\n\nLearning, 4(1):1\u2013106, 2011.\n\n[4] E. J. Candes and M. A. Davenport. How well can we estimate a sparse vector? Applied and\n\nComputational Harmonic Analysis 34, 34(2):317\u2013323, 2013.\n\n[5] E.J. Cand`es, T Strohmer, and V. Voroninski. Phaselift: Exact and stable signal recovery from\n\nmagnitude measurements via convex programming. Comm. Pure Appl. Math., 2011.\n\n[6] V. Chandrasekaran and M. I. Jordan. Computational and statistical tradeoffs via convex relax-\n\nation. Proc. Natl. Acad. Sci. USA, 2013.\n\n[7] V. Chandrasekaran, B. Recht, P.A. Parrilo, and A.S. Willsky. The convex geometry of linear\n\ninverse problems. Foundations of Computational Mathematics, 12, 2012.\n\n[8] A. d\u2019Aspremont, L. El Ghaoui, M.I. Jordan, and G.R.G. Lanckriet. A direct formulation for\n\nsparse PCA using semide\ufb01nite programming. SIAM review, 2007.\n\n[9] X.V. Doan and S.A. Vavasis. Finding approximately rank-one submatrices with the nuclear\n\nnorm and l1 norm. SIAM Journal on Optimization, 23:2502 \u2013 2540, 2013.\n\n[10] V. Koltchinskii, K. Lounici, and A. Tsybakov. Nuclear-norm penalization and optimal rates\n\nfor noisy low-rank matrix completion. Ann. Stat., 39(5):2302\u20132329, 2011.\n\n[11] L. Mackey. De\ufb02ation methods for sparse PCA. Advances in Neural Information Processing\n\nSystems (NIPS), 21:1017\u20131024, 2008.\n\n[12] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse\n\ncoding. J. Mach. Learn. Res., 11:19\u201360, 2010.\n\n[13] S. N Negahban, P. Ravikumar, M. J Wainwright, and B. Yu. A uni\ufb01ed framework for high-\n\ndimensional analysis of M-estimators. Statistical Science, 27(4):538\u2013557, 2012.\n\n[14] S. Oymak and B. Hassibi. Sharp MSE bounds for proximal denoising. Technical Report\n\n1305.2714, arXiv, 2013.\n\n[15] S. Oymak, A. Jalali, M. Fazel, Y. Eldar, and B. Hassibi. Simultaneously structured models\n\nwith application to sparse and low-rank matrices. arXiv preprint 1212.3753, 2012.\n\n[16] E. Richard, S. Gaiffas, and N. Vayatis. Link prediction in graphs with autoregressive features.\n\nIn Neural Information Processing Systems, volume 15, pages 565\u2013593. MIT Press, 2012.\n\n[17] E. Richard, P.-A. Savalle, and N. Vayatis. Estimation of simultaneously sparse and low-rank\n\nmatrices. In International Conference on Machine Learning (ICML), 2012.\n\n[18] Y.X. Wang, H. Xu, and C. Leng. Provable subspace clustering: When LRR meets SSC. In\n\nAdvances in Neural Information Processing Systems (NIPS), pages 64\u201372. 2013.\n\n[19] D Witten, R Tibshirani, and T Hastie. A penalized matrix decomposition, with applications to\n\nsparse PCA and CCA. Biostatistics, 10(3):515\u2013534, 2009.\n\n[20] X.-T. Yuan and T. Zhang. Truncated power method for sparse eigenvalue problems. J. Mach.\n\nLearn. Res., 14(1):899\u2013925, 2013.\n\n[21] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of compu-\n\ntational and graphical statistics, 15(2):265\u2013286, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1666, "authors": [{"given_name": "Emile", "family_name": "Richard", "institution": "Stanford"}, {"given_name": "Guillaume", "family_name": "Obozinski", "institution": "Ecole des Ponts - ParisTech"}, {"given_name": "Jean-Philippe", "family_name": "Vert", "institution": "Mines ParisTech"}]}