{"title": "Sum-of-Squares Lower Bounds for Sparse PCA", "book": "Advances in Neural Information Processing Systems", "page_first": 1612, "page_last": 1620, "abstract": "This paper establishes a statistical versus computational trade-offfor solving a basic high-dimensional machine learning problem via a basic convex relaxation method. Specifically, we consider the {\\em Sparse Principal Component Analysis} (Sparse PCA) problem, and the family of {\\em Sum-of-Squares} (SoS, aka Lasserre/Parillo) convex relaxations. It was well known that in large dimension $p$, a planted $k$-sparse unit vector can be {\\em in principle} detected using only $n \\approx k\\log p$ (Gaussian or Bernoulli) samples, but all {\\em efficient} (polynomial time) algorithms known require $n \\approx k^2 $ samples. It was also known that this quadratic gap cannot be improved by the the most basic {\\em semi-definite} (SDP, aka spectral) relaxation, equivalent to a degree-2 SoS algorithms. Here we prove that also degree-4 SoS algorithms cannot improve this quadratic gap. This average-case lower bound adds to the small collection of hardness results in machine learning for this powerful family of convex relaxation algorithms. Moreover, our design of moments (or ``pseudo-expectations'') for this lower bound is quite different than previous lower bounds. Establishing lower bounds for higher degree SoS algorithms for remains a challenging problem.", "full_text": "Sum-of-Squares Lower Bounds for Sparse PCA\n\nTengyu Ma\u22171 and Avi Wigderson\u20202\n\n1Department of Computer Science, Princeton University\n2School of Mathematics, Institute for Advanced Study\n\nAbstract\n\nThis paper establishes a statistical versus computational trade-off for solving\na basic high-dimensional machine learning problem via a basic convex re-\nlaxation method. Speci\ufb01cally, we consider the Sparse Principal Component\nAnalysis (Sparse PCA) problem, and the family of Sum-of-Squares (SoS, aka\nLasserre/Parillo) convex relaxations. It was well known that in large dimension p,\na planted k-sparse unit vector can be in principle detected using only n \u2248 k log p\n(Gaussian or Bernoulli) samples, but all ef\ufb01cient (polynomial time) algorithms\nknown require n \u2248 k2 samples. It was also known that this quadratic gap cannot\nbe improved by the the most basic semi-de\ufb01nite (SDP, aka spectral) relaxation,\nequivalent to a degree-2 SoS algorithms. Here we prove that also degree-4 SoS al-\ngorithms cannot improve this quadratic gap. This average-case lower bound adds\nto the small collection of hardness results in machine learning for this powerful\nfamily of convex relaxation algorithms. Moreover, our design of moments (or\n\u201cpseudo-expectations\u201d) for this lower bound is quite different than previous lower\nbounds. Establishing lower bounds for higher degree SoS algorithms for remains\na challenging problem.\n\n1\n\nIntroduction\n\nWe start with a general discussion of the tension between sample size and computational ef\ufb01ciency in\nstatistical and learning problems. We then describe the concrete model and problem at hand: Sum-\nof-Squares algorithms and the Sparse-PCA problem. All are broad topics studied from different\nviewpoints, and the given references provide more information.\n\n1.1 Statistical vs. computational sample-size\n\nModern machine learning and statistical inference problems are often high dimensional, and it is\nhighly desirable to solve them using far less samples than the ambient dimension. Luckily, we often\nknow, or assume, some underlying structure of the objects sought, which allows such savings in\nprinciple. Typical such assumption is that the number of real degrees of freedom is far smaller\nthan the dimension; examples include sparsity constraints for vectors, and low rank for matrices\nand tensors. The main dif\ufb01culty that occurs in nearly all these problems is that while information\ntheoretically the sought answer is present (with high probability) in a small number of samples,\nactually computing (or even approximating) it from these many samples is a computationally hard\nproblem. It is often expressed as a non-convex optimization program which is NP-hard in the worst\ncase, and seemingly hard even on random instances.\nGiven this state of affairs, relaxed formulations of such non-convex programs were proposed, which\ncan be solved ef\ufb01ciently, but sometimes to achieve accurate results seem to require far more samples\n\n\u2217Supported in part by Simons Award for Graduate Students in Theoretical Computer Science\n\u2020Supported in part by NSF grant CCF-1412958\n\n1\n\n\fthan existential bounds provide. This phenomenon has been coined the \u201cstatistical versus computa-\ntional trade-off\u201d by Chandrasekaran and Jordan [1], who motivate and formalize one framework to\nstudy it in which ef\ufb01cient algorithms come from the Sum-of-Squares family of convex relaxations\n(which we shall presently discuss). They further give a detailed study of this trade-off for the basic\nde-noising problem [2, 3, 4] in various settings (some exhibiting the trade-off and others that do\nnot). This trade-off was observed in other practical machine learning problems, in particular for the\nSparse PCA problem that will be our focus, by Berthet and Rigollet [5].\nAs it turns out, the study of the same phenomenon was proposed even earlier in computational\ncomplexity, primarily from theoretical motivations. Decatur, Goldreich and Ron [6] initiate the study\nof \u201ccomputational sample complexity\u201d to study statistical versus computation trade-offs in sample-\nsize. In their framework ef\ufb01cient algorithms are arbitrary polynomial time ones, not restricted to any\nparticular structure like convex relaxations. They point out for example that in the distribution-free\nPAC-learning framework of Vapnik-Chervonenkis and Valiant, there is often no such trade-off. The\nreason is that the number of samples is essentially determined (up to logarithmic factors, which we\nwill mostly ignore here) by the VC-dimension of the given concept class learned, and moreover,\nan \u201cOccam algorithm\u201d (computing any consistent hypothesis) suf\ufb01ces for classi\ufb01cation from these\nmany samples. So, in the many cases where ef\ufb01ciently \ufb01nding a hypothesis consistent with the\ndata is possible, enough samples to learn are enough to do so ef\ufb01ciently! This paper also provide\nexamples where this is not the case in PAC learning, and then turns to an extensive study of possible\ntrade-offs for learning various concept classes under the uniform distribution. This direction was\nfurther developed by Servedio [7].\nThe fast growth of Big Data research, the variety of problems successfully attacked by various\nheuristics and the attempts to \ufb01nd ef\ufb01cient algorithms with provable guarantees is a growing area of\ninteraction between statisticians and machine learning researchers on the one hand, and optimiza-\ntion and computer scientists on the other. The trade-offs between sample size and computational\ncomplexity, which seems to be present for many such problems, re\ufb02ects a curious \u201ccon\ufb02ict\u201d be-\ntween these \ufb01elds, as in the \ufb01rst more data is good news, as it allows more accurate inference and\nprediction, whereas in the second it is bad news, as a larger input size is a source of increased com-\nplexity and inef\ufb01ciency. More importantly, understanding this phenomenon can serve as a guide to\nthe design of better algorithms from both a statistical and computational viewpoints, especially for\nproblems in which data acquisition itself is costly, and not just computation. A basic question is\nthus for which problems is such trade-off inherent, and to establish the limits of what is achievable\nby ef\ufb01cient methods.\nEstablishing a trade-off has two parts. One has to prove an existential, information theoretic upper\nbound on the number of samples needed when ef\ufb01ciency is not an issue, and then prove a computa-\ntional lower bound on the number of samples for the class of ef\ufb01cient algorithms at hand. Needless\nto say, it is desirable that the lower bounds hold for as wide a class of algorithms as possible, and that\nit will match the best known upper bound achieved by algorithms from this class. The most general\none, the computational complexity framework of [6, 7] allows all polynomial-time algorithms. Here\none cannot hope for unconditional lower bounds, and so existing lower bounds rely on computa-\ntional assumptions, e.g.\u201dcryptographic assumptions\u201d, e.g. that factoring integers has no polynomial\ntime algorithm, or other average case assumptions. For example, hardness of refuting random 3CNF\nwas used for establishing the sample-computational tradeoff for learning halfspaces [8], and hard-\nness of \ufb01nding planted clique in random graphs was used for tradeoff in sparse PCA [5, 9]. On the\nother hand, in frameworks such as [1], where the class of ef\ufb01cient algorithms is more restricted (e.g.\na family of convex relaxations), one can hope to prove unconditional lower bounds, which are called\n\u201cintegrality gaps\u201d in the optimization and algorithms literature. Our main result is of this nature,\nadding to the small number of such lower bounds for machine learning problems.\nWe now describe and motivate SoS convex relaxations algorithms, and the Sparse PCA problem.\n\n1.2 Sum-of-Squares convex relaxations\n\nSum-of-Squares algorithms (sometimes called the Lasserre hierarchy) encompasses perhaps the\nstrongest known algorithmic technique for a diverse set of optimization problems. It is a family\nof convex relaxations introduced independently around the year 2000 by Lasserre [10], Parillo [11],\nand in the (equivalent) context of proof systems by Grigoriev [12]. These papers followed better\nand better understanding in real algebraic geometry [13, 14, 15, 16, 17, 18, 19]of David Hilbert\u2019s\n\n2\n\n\ffamous 17th problem on certifying the non-negativity of a polynomial by writing it as a sum of\nsquares (which explains the name of this method). We only brie\ufb02y describe this important class of\nalgorithms; far more can be found in the book [20] and the excellent extensive survey [21].\nThe SoS method provides a principled way of adding constraints to a linear or convex program in a\nway that obtains tighter and tighter convex sets containing all solutions of the original problem. This\nfamily of algorithms is parametrized by their degree d (sometimes called the number of rounds); as\nd gets larger, the approximation becomes better, but the running time becomes slower, speci\ufb01cally\nnO(d). Thus in practice one hopes that small degree (ideally constant) would provide suf\ufb01ciently\ngood approximation, so that the algorithm would run in polynomial time. This method extends\nthe standard semi-de\ufb01nite relaxation (SDP, sometimes called spectral), that is captured already by\ndegree-2 SoS algorithms. Moreover, it is more powerful than two earlier families of relaxations: the\nSherali-Adams [22] and Lov\u00b4asz-Scrijver [23] hierarchies.\nThe introduction of these algorithms has made a huge splash in the optimization community, and\nnumerous applications of it to problems in diverse \ufb01elds were found that greatly improve solution\nquality and time performance over all past methods. For large classes of problems they are consid-\nered the strongest algorithmic technique known. Relevant to us is the very recent growing set of\napplications of constant-degree SoS algorithms to machine learning problems, such as [24, 25, 26].\nThe survey [27] contains some of these exciting developments. Section 2.1 contains some self-\ncontained material about the general framework SoS algorithms as well.\nGiven their power, it was natural to consider proving lower bounds on what SoS algorithms can do.\nThere has been an impressive progress on SoS degree lower bounds (via beautiful techniques) for\na variety of combinatorial optimization problems [28, 12, 29, 30]. However, for machine learning\nproblems relatively few such lower bounds (above SDP level) are known [26, 31] and follow via\nreductions to the above bounds. So it is interesting to enrich the set of techniques for proving such\nlimits on the power of SoS for ML. The lower bound we prove indeed seem to follow a different\nroute than previous such proofs.\n1.3 Sparse PCA\n\nSparse principal component analysis, the version of the classical PCA problem which assumes that\nthe direction of variance of the data has a sparse structure, is by now a central problem of high-\ndiminsional statistical analysis. In this paper we focus on the single-spiked covariance model intro-\nduced by Johnstone [32]. One observes n samples from p-dimensional Gaussian distribution with\ncovariance \u03a3 = \u03bbvvT + I where (the planted vector) v is assumed to be a unit-norm sparse vector\nwith at most k non-zero entries, and \u03bb > 0 represents the strength of the signal. The task is to\n\ufb01nd (or estimate) the sparse vector v. More general versions of the problem allow several sparse\ndirections/components and general covariance matrix [33, 34]. Sparse PCA and its variants have a\nwide variety of applications ranging from signal processing to biology: see, e.g., [35, 36, 37, 38].\nThe hardness of Sparse PCA, at least in the worst case, can be seen through its connection to the\n(NP-hard) Clique problem in graphs. Note that if \u03a3 is a {0, 1} adjacency matrix of a graph (with 1\u2019s\non the diagonal), then it has a k-sparse eigenvector v with eigenvalue k if and only if the graph has\na k-clique. This connection between these two problems is actually deeper, and will appear again\nbelow, for our real, average case version above.\nFrom a theoretical point of view, Sparse PCA is one of the simplest examples where we observe a\ngap between the number of samples needed information theoretically and the number of samples\nneeded for a polynomial time estimator: It has been well understood [39, 40, 41] that information\ntheoretically, given n = O(k log p) samples1, one can estimate v up to constant error (in euclidean\nnorm), using a non-convex (therefore not polynomial time) optimization algorithm. On the other\nhand, all the existing provable polynomial time algorithms [36, 42, 34, 43], which use either diago-\nnal thresholding (for the single spiked model) or semide\ufb01nite programming (for general covariance),\n\ufb01rst introduced for this problem in [44], need at least quadratically many samples to solve the prob-\nlem, namely n = O(k2). Moreover, Krauthgamer, Nadler and Vilenchik [45] and Berthet and\nRigollet [41] have shown that for semi-de\ufb01nite programs (SDP) this bound is tight. Speci\ufb01cally,\nthe natural SDP cannot even solve the detection problem: to distinguish the data from covariance\n\n1We treat \u03bb as a constant so that we omit the dependence on it for simplicity throughout the introduction\n\nsection\n\n3\n\n\f\u03a3 = \u03bbvvT + I from the null hypothesis in which no sparse vector is planted, namely the n samples\nare drawn from the Gaussian distribution with covariance matrix I.\nRecall that the natural SDP for this problem (and many others) is just the \ufb01rst level of the SoS\nhierarchy, namely degree-2. Given the importance of the Sparse PCA, it is an intriguing question\nwhether one can solve it ef\ufb01ciently with far fewer samples by allowing degree-d SoS algorithms with\nlarger d. A very interesting conditional negative answer was suggested by Berthet and Rigollet [41].\nThey gave an ef\ufb01cient reduction from Planted Clique2 problem to Sparse PCA, which shows in\nparticular that degree-d SoS algorithms for Sparse PCA will imply similar ones for Planted Clique.\nGao, Ma and Zhou [9] strengthen the result by establishing the hardness of the Gaussian single-\nspiked covariance model, which is an interesting subset of models considered by [5]. These are\nuseful as nontrivial constant-degree SoS lower bounds for Planted Clique were recently proved\nby [30, 46] (see there for the precise description, history and motivation for Planted Clique). As [41,\n9] argue, strong yet believed bounds, if true, would imply that the quadratic gap is tight for any\nconstant d. Before the submission of this paper, the known lower bounds above for planted clique\nwere not strong enough yet to yield any lower bound for Sparse PCA beyond the minimax sample\ncomplexity. We also note that the recent progress [47, 48] that show the tight lower bounds for\nplanted clique, together with the reductions of [5, 9], also imply the tight lower bounds for Sparse\nPCA, as shown in this paper.\n1.4 Our contribution\n\nWe give a direct, unconditional lower bound proof for computing Sparse PCA using degree-4 SoS\n\nalgorithms, showing that they too require n =(cid:101)\u2126(k2) samples to solve the detection problem (Theo-\n\nrem 3.1), which is tight up to polylogarithmic factors when the strength of the signal \u03bb is a constant.\nIndeed the theorem gives a lower bound for every strength \u03bb, which becomes weaker as \u03bb gets larger.\nOur proof proceeds by constructing the necessary pseudo-moments for the SoS program that achieve\ntoo high an objective value (in the jargon of optimization, we prove an \u201cintegrality gap\u201d for these\nprograms). As usual in such proofs, there is tension between having the pseudo-moments satisfy the\nconstraints of the program and keeping them positive semide\ufb01nite (PSD). Differing from past lower\nbound proofs, we construct two different PSD moments, each approximately satisfying one sets of\nconstraints in the program and is negligible on the rest. Thus, their sum give PSD moments which\napproximately satisfy all constraints. We then perturb these moments to satisfy constraints exactly,\nand show that with high probability over the random data, this perturbation leaves the moments\nPSD.\nWe note several features of our lower bound proof which makes the result particularly strong and\ngeneral. First, it applies not only for the Gaussian distribution, but also for Bernoulli and other\ndistributions. Indeed, we give a set of natural (pseudorandomness) conditions on the sampled data\nvectors under which the SoS algorithm is \u201cfooled\u201d, and show that these conditions are satis\ufb01ed\nwith high probability under many similar distributions (possessing strong concentration of measure).\nNext, our lower bound holds even if the hidden sparse vector is discrete, namely its entries come\nfrom the set {0,\u00b1 1\u221a\n}. We also extend the lower bound for the detection problem to apply also\nto the estimation problem, in the regime when the ambient dimension is linear in the number of\nsamples, namely n \u2264 p \u2264 Bn for constant B.\nOrganization: Section 2 provides more backgrounds of sparse PCA and SoS algorithms. We state\nour main results in Section 3. A complete paper is available as supplementary material or on arxiv.\n2 Formal description of the model and problem\nNotation: We will assume that n, k, p are all suf\ufb01ciently large3, and that n \u2264 p. Throughout this\npaper, by \u201cwith high probability some event happens\u201d, we mean the failure probability is bounded\nby p\u2212c for every constant c, as p tends to in\ufb01nity.\n\nk\n\nSparse PCA estimation and detection problems We will consider the simplest setting of sparse\nPCA, which is called single-spiked covariance model in literature [32] (note that restricting to a\n2An average case version of the Clique problem in which the input is a random graph in which a much\n\nlarger than expected clique is planted.\n\n3Or we assume that they go to in\ufb01nity as typically done in statistics.\n\n4\n\n\f\u221a\n\nspecial case makes our lower bound hold in all generalizations of this simple model). In this model,\nthe task is to recover a single sparse vector from noisy samples as follows. The \u201chidden data\u201d is\nan unknown k-sparse vector v \u2208 Rp with |v|0 = k and (cid:107)v(cid:107) = 1. To make the task easier (and so\nthe lower bound stronger), we even assume that v has discrete entries, namely that vi \u2208 {0,\u00b1 1\u221a\n}\nfor all i \u2208 [p]. We observe n noisy samples X 1, . . . , X n \u2208 Rp that are generated as follows. Each\nis independently drawn as X j =\n\u03bbgjv + \u03bej from a distribution which generalizes both Gaussian\nand Bernoulli noise to v. Namely, the gj\u2019s are i.i.d real random variable with mean 0 and variance\n1, and \u03bej\u2019s are i.i.d random vectors which have independent entries with mean zero and variance 1.\nTherefore under this model, the covariance of X i is equal to \u03bbvvT +I. Moreover, we assume that gj\nand entries of \u03bej are sub-gaussian4 with variance proxy O(1). Given these samples, the estimation\nproblem is to approximate the unknown sparse vector v (up to sign \ufb02ip).\nIt is also interesting to also consider the sparse component detection problem [41, 5], which is the\ndecision problem of distinguishing from random samples the following two distributions\n\nk\n\nH0: data X j = \u03bej is purely random\nHv: data X j = \u03bej +\n\n\u221a\n\n\u03bbgjv contains a hidden sparse signal with strength \u03bb.\n\nX as a shorthand for the p \u00d7 n matrix(cid:2)X 1, . . . , X n(cid:3). We denote the rows of X as X T\n\nRigollet [49] observed that a polynomial time algorithm for estimation version of sparse PCA with\nconstant error implies that an algorithm for the detection problem with twice number of the samples.\nThus, for polynomial time lower bounds, it suf\ufb01ces to consider the detection problem. We will use\np ,\n1 , . . . , X T\ntherefore Xi\u2019s are n-dimensional column vectors. The empirical covariance matrix is de\ufb01ned as\n\u02c6\u03a3 = 1\n\nn XX T .\n\n1\nk\n\n\u221a\n\n(2.1)\n\nmax( \u02c6\u03a3) =\n\u03bbk\n\nStatistically optimal estimator/detector\nIt is well known that the following non-convex program\nachieves optimal statistical minimax rate for the estimation problem and the optimal sample com-\nplexity for the detection problem. Note that we scale the variables x up by a factor of\nk for\nsimplicity (the hidden vector now has entries from {0,\u00b11}).\n(cid:104) \u02c6\u03a3, xxT(cid:105)\n(cid:107)x(cid:107)2\n\n(2.2)\nProposition 2.1 ([42], [41], [39] informally stated). The non-convex program (2.1) statistically\noptimally solves the sparse PCA problem when n \u2265 Ck/\u03bb2 log p for some suf\ufb01ciently large C.\nNamely, the following hold with high probability. If X is generated from Hv, then optimal solution\nxopt of program (2.1) satis\ufb01es (cid:107) 1\nmax( \u02c6\u03a3) is at least\n3, and the objective value \u03bbk\nmax( \u02c6\u03a3) is at most\n3 . On the other hand, if X is generated from null hypothesis H0, then \u03bbk\n1 + 2\u03bb\n3 .\n1 + \u03bb\n\n\u00b7 max\nsubject to\n\n2 = k,(cid:107)x(cid:107)0 = k\n\nopt \u2212 vvT(cid:107) \u2264 1\n\nk \u00b7 xoptxT\n\nTherefore, for the detection problem, once can simply use the test \u03bbk\n\nthe case of H0 and Hv, with n = (cid:101)\u2126(k/\u03bb2) samples. However, this test is highly inef\ufb01cient, as the\n\nmax( \u02c6\u03a3) > 1 + \u03bb\n\n2 to distinguish\n\nbest known ways for computing \u03bbk\nways of solving this problem.\n\nmax( \u02c6\u03a3) take exponential time! We now turn to consider ef\ufb01cient\n\n2.1 Sum of Squares (Lasserre) Relaxations\n\nHere we will only brie\ufb02y introduce the basic ideas of Sum-of-Squares (Lasserre) relaxation that will\nbe used for this paper. We refer readers to the extensive [20, 21, 27] for detailed discussions of sum\nof squares algorithms and proofs and their applications to algorithm design.\nLet R[x]d denote the set of all real polynomials of degree at most d with n variables x1, . . . , xn.\nWe start by de\ufb01ning the notion of pseudo-moment (sometimes called pseudo-expectation ). The\nintuition is that these pseudo-moments behave like the actual \ufb01rst d moments of a real probability\ndistribution.\n\n4A real random variable X is subgaussian with variance proxy \u03c32 if it has similar tail behavior as gaussian\n\ndistribution with variance \u03c32. More formally, if for any t \u2208 R, E[exp(tX)] \u2264 exp(t2\u03c32/2)\n\n5\n\n\fDe\ufb01nition 2.2 (pseudo-moment). A degree-d pseudo-moments M is a linear operator that maps\nR[x]d to R and satis\ufb01es M (1) = 1 and M (p2(x)) \u2265 0 for all real polynomials p(x) of degree at\nmost d/2.\n\nFor a mutli-set S \u2282 [n], we use xS to denote the monomial(cid:81)\n\ni\u2208S xi. Since M is a linear operator, it\ncan be clearly described by all the values of M on the monomial of degree d, that is, all the values of\nM (xS) for mutli-set S of size at most d uniquely determines M. Moreover, the nonnegativity con-\nstraint M (p(x)2) \u2265 0 is equivalent to the positive semide\ufb01niteness of the matrix-form (as de\ufb01ned\nbelow), and therefore the set of all pseudo-moments is convex.\nDe\ufb01nition 2.3 (matrix-form). For an even integer d and any degree-d pseudo-moments M, we\nde\ufb01ne the matrix-form of M as the trivial way of viewing all the values of M on monomials as a\nmatrix: we use mat(M ) to denote the matrix that is indexed by multi-subset S of [n] with size at\nmost d/2, and mat(M )S,T = M (xSxT ).\n\nGiven polynomials p(x) and q1(x), . . . , qm(x) of degree at most d, and a polynomial program,\n\nMaximize p(x)\nSubject to\n\nqi(x) = 0,\u2200i \u2208 [m]\n\n(2.3)\n\nWe can write a sum of squares based relaxation in the following way: Instead of searching over\nx \u2208 Rn, we search over all the possible \u201cpseudo-moments\u201d M of a hypothetical distribution over\nsolutions x, that satisfy the constraints above. The key of the relaxation is to consider only moments\nup to degree d. Concretely, we have the following semide\ufb01nite program in roughly nd variables.\n\nVariables M (xS)\nMaximize M (p(x))\nSubject to M (qi(x)xK) = 0\n\nmat(M ) (cid:23) 0\n\n\u2200S : |S| \u2264 d\n\n\u2200i, K : |K| + deg(qi) \u2264 d\n\n(2.4)\n\nNote that (2.4) is a valid relaxation because for any solution x\u2217 of (2.3), if we de\ufb01ne M (xS) to be\nM (xS) = xS\u2217 , then M satis\ufb01es all the constraints and the objective value is p(x\u2217). Therefore it is\nguaranteed that the optimal value of (2.4) is always larger than that of (2.3).\nFinally, the key point is that this program can be solved ef\ufb01ciently, in polynomial time in its size,\nnamely in time nO(d). As d grows, the constraints added make the \u201cpseudo-distribution\u201d de\ufb01ned by\nthe moments closer and closer to an actual distribution, thus providing a tighter relaxation, at the\ncost of a larger running time to solve it. In the next section we apply this relaxation to the Sparse\nPCA problem and state our results.\n\n3 Main Results\n\nTo exploit the sum of squares relaxation framework as described in Section 2.1], we \ufb01rst convert the\nstatistically optimal estimator/detector (2.1) into the \u201cpolynomial\u201d program version below.\n\n2 = k, and x3\n\n|x|1 \u2264 k\n\nMaximize(cid:104) \u02c6\u03a3, xxT(cid:105)\nsubject to(cid:107)x(cid:107)2\n\ni = xi,\u2200i \u2208 [p]\n\n(3.1)\n(3.2& 3.3)\n(3.4)\nThe non-convex sparsity constraint (2.2) is replaced by the polynomial constraint (3.3), which en-\nsures that any solution vector x has entries in {0,\u00b11}, and so together with the constraint (3.2)\nguarantees that it has precisely k non-zero \u00b11 entries. The constraint (.3.3) implies other natural\nconstraints that one may add to the program in order to make it stronger: for example, the upper\nbound on each entry xi, the lower bound on the non-zero entries of xi, and the constraint (cid:107)x(cid:107)4 \u2265 k\nwhich is used as a surrogate for k-sparse vectors in [25, 24]. Note that we also added an (cid:96)1 spar-\nsity constraint (3.4) (which is convex) as is often used in practice and makes our lower bound even\nstronger. Of course, it is formally implied by the other constraints, but not in low-degree SoS.\nNow we are ready to apply the sum-of-squares relaxation scheme described in Section 2.1) to the\npolynomial program above as . For degree-4 relaxation we obtain the following semide\ufb01nite pro-\ngram SoS4( \u02c6\u03a3), which we view as an algorithm for both detection and estimation problems. Note\n\n6\n\n\fM (xixj) \u02c6\u03a3ij\n\nSoS4( \u02c6\u03a3) = max (cid:88)\nand (cid:88)\nsubject to (cid:88)\ni xj) = M (xixj), and (cid:88)\n(cid:88)\n\ni ) = k\n\nM (x2\n\n(cid:96)\u2208[p]\n\ni\u2208[p]\n\nM (x3\n\ni,j\n\ni,j\u2208[p]\n\ni,j,s,t\u2208[p]\n\n|M (xixj)| \u2264 k2\n\n(Obj)\n\n(C1&2)\n\nM (x2\n\n(cid:96) xixj) = kM (xixj),\u2200i, j \u2208 [p]\n\n(C4)\n\nthat the same objective function, with only the three constraints (C1&2), (C6) gives the degree-2\nrelaxation, which is precisely the standard SDP relaxation of Sparse PCA studied in [42, 41, 45]. So\nclearly SoS4( \u02c6\u03a3) subsumes the SDP relaxation.\n\nAlgorithm 1 SoS4( \u02c6\u03a3): Degree-4 Sum of Squares Relaxation\nSolve the following SDP and obtain optimal objective value SoS4( \u02c6\u03a3) and maximizer M\u2217.\nVariables: M (S), for all mutli-sets S of size at most 4.\n\n|M (xixjxsxt)| \u2264 k4\n\nand M (cid:23) 0\n\n(C5&6)\n\nOutput: 1. For detection problem : output Hv if SoS4( \u02c6\u03a3) > (1 + 1\n2 = (M\u2217(xixj))i,j\u2208[p]\n\n2. For estimation problem: output M\u2217\n\n2 \u03bb)k, H0 otherwise\n\nBefore stating the lower bounds for both detection and estimation in the next two subsections, we\ncomment on the choices made for the outputs of the algorithm in both, as clearly other choices can be\nmade that would be interesting to investigate. For detection, we pick the natural threshold (1 + 1\n2 \u03bb)k\nfrom the statistically optimal detection algorithm of Section 2. Our lower bound of the objective\nunder H0 is actually a large constant multiple of \u03bbk, so we could have taken a higher threshold.\nTo analyze even higher ones would require analyzing the behavior of SoS4 under the (planted)\nalternative distribution Hv. For estimation we output the maximizer M\u2217\n2 of the objective function,\nand prove that it is not too correlated with the rank-1 matrix vvT in the planted distribution Hv.\nThis suggest, but does not prove, that the leading eigenvector of M\u2217\n2 (which is a natural estimator\nfor v) is not too correlated with v. We \ufb01nally note that Rigollet\u2019s ef\ufb01cient reduction from detection\nto estimation is not in the SoS framework, and so our detection lower bound does not automatically\nimply the one for estimation.\nFor the detection problem, we prove that SoS4( \u02c6\u03a3) gives a large objective on null hypothesis H0.\nn} and\nTheorem 3.1. There exists absolute constant C and r such that for 1 \u2264 \u03bb < min{k1/4,\nany p \u2265 C\u03bbn, k \u2265 C\u03bb7/6\u221a\nn logr p, the following holds. When the data X is drawn from the null\nhypothesis H0, then with high probability (1\u2212 p\u221210), the objective value of degree-4 sum of squares\nrelaxation SoS4( \u02c6\u03a3) is at least 10\u03bbk. Consequently, Algorithm 1 can\u2019t solve the detection problem.\n\n\u221a\n\nTo parse the theorem and to understand its consequence, consider \ufb01rst the case when \u03bb is a constant\n(which is also arguably the most interesting regime). Then the theorem says that when we have only\nn (cid:28) k2 samples, degree-4 SoS relaxation SoS4 still over\ufb01ts heavily to the randomness of the data X\nunder the null hypothesis H0. Therefore, using SoS4( \u02c6\u03a3) > (1 + \u03bb\n2 )k (or even 10\u03bbk) as a threshold\nwill fail with high probability to distinguish H0 and Hv.\nWe note that for constant \u03bb our result is essentially tight in terms of the dependencies between\n\nn, k, p. The condition p = (cid:101)\u2126(n) is necessary since otherwise when p = o(n), even without the\neigenvalue 1 + o(1) in this regime. Furthermore, as mentioned in the introduction, k \u2265 (cid:101)\u2126(\n\nsum of squares relaxation, the objective value is controlled by (1 + o(1))k since \u02c6\u03a3 has maximum\nn) is\nalso necessary (up to poly-logarithmic factors), since when n (cid:29) k2, a simple diagonal thresholding\nalgorithm works for this simple single-spike model.\nWhen \u03bb is not considered as a constant, the dependence of the lower bound on \u03bb is not optimal, but\nclose. Ideally one could expect that as long as k (cid:29) \u03bb\nn, and p \u2265 \u03bbn, the objective value on the\nnull hypothesis is at least \u2126(\u03bbk). Tightening the \u03bb1/6 slack, and possibly extending the range of\n\n\u221a\n\n\u221a\n\n7\n\n\f\u03bb are left to future study. Finally, we note that he result can be extended to a lower bound for the\nestimation problem, which is presented in the supplementary material.\n\nReferences\n[1] Venkat Chandrasekaran and Michael I. Jordan. Computational and statistical tradeoffs via convex relax-\n\nation. Proceedings of the National Academy of Sciences, 110(13):E1181\u2013E1190, 2013.\n\n[2] IM Johnstone. Function estimation and gaussian sequence models. Unpublished manuscript, 2002.\n[3] D. L. Donoho. De-noising by soft-thresholding. IEEE Trans. Inf. Theor., 41(3):613\u2013627, May 1995.\n[4] David L. Donoho and Iain M. Johnstone. Minimax estimation via wavelet shrinkage. Ann. Statist.,\n\n26(3):879\u2013921, 06 1998.\n\n[5] Quentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparse principal component\ndetection. In COLT 2013 - The 26th Annual Conference on Learning Theory, June 12-14, 2013, Princeton\nUniversity, NJ, USA, pages 1046\u20131066, 2013.\n\n[6] Scott Decatur, Oded Goldreich, and Dana Ron. Computational sample complexity. In Proceedings of\nthe Tenth Annual Conference on Computational Learning Theory, COLT \u201997, pages 130\u2013142, New York,\nNY, USA, 1997. ACM.\n\n[7] Rocco A. Servedio. Computational sample complexity and attribute-ef\ufb01cient learning. Journal of Com-\n\nputer and System Sciences, 60(1):161 \u2013 178, 2000.\n\n[8] Amit Daniely, Nati Linial, and Shai Shalev-Shwartz. More data speeds up training time in learning\nIn Christopher J. C. Burges, L\u00b4eon Bottou, Zoubin Ghahramani, and\nhalfspaces over sparse vectors.\nKilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 26: 27th Annual\nConference on Neural Information Processing Systems 2013. Proceedings of a meeting held December\n5-8, 2013, Lake Tahoe, Nevada, United States., pages 145\u2013153, 2013.\n\n[9] C. Gao, Z. Ma, and H. H. Zhou. Sparse CCA: Adaptive Estimation and Computational Barriers. ArXiv\n\ne-prints, September 2014.\n\n[10] Jean B. Lasserre. Global optimization with polynomials and the problem of moments. SIAM Journal on\n\nOptimization, 11(3):796\u2013817, 2001.\n\n[11] Pablo A. Parrilo. Structured Semide\ufb01nite Programs and Semialgebraic Geometry Methods in Robustness\n\nand Optimization. PhD thesis, California Institute of Technology, 2000.\n\n[12] Dima Grigoriev. Linear lower bound on degrees of positivstellensatz calculus proofs for the parity. The-\n\noretical Computer Science, 259(1):613\u2013622, 2001.\n\n[13] Emil Artin. \u00a8Uber die zerlegung de\ufb01niter funktionen in quadrate. In Abhandlungen aus dem mathematis-\n\nchen Seminar der Universit\u00a8at Hamburg, volume 5, pages 100\u2013115. Springer, 1927.\n[14] Jean-Louis Krivine. Anneaux pr\u00b4eordonn\u00b4es. Journal d\u2019analyse math\u00b4ematique, 1964.\n[15] Gilbert Stengle. A nullstellensatz and a positivstellensatz in semialgebraic geometry. Mathematische\n\nAnnalen, 207(2):87\u201397, 1974.\n\n[16] N.Z. Shor. An approach to obtaining global extremums in polynomial mathematical programming prob-\n\nlems. Cybernetics, 23(5):695\u2013700, 1987.\n\n[17] Konrad Schm\u00a8udgen. Thek-moment problem for compact semi-algebraic sets. Mathematische Annalen,\n\n289(1):203\u2013206, 1991.\n\n[18] Mihai Putinar. Positive polynomials on compact semi-algebraic sets. Indiana University Mathematics\n\nJournal, 42(3):969\u2013984, 1993.\n\n[19] Yurii Nesterov. Squared functional systems and optimization problems.\n\nIn Hans Frenk, Kees Roos,\nTams Terlaky, and Shuzhong Zhang, editors, High Performance Optimization, volume 33 of Applied\nOptimization, pages 405\u2013440. Springer US, 2000.\n\n[20] Jean Bernard Lasserre. An introduction to polynomial and semi-algebraic optimization. Cambridge Texts\n\nin Applied Mathematics. Cambridge: Cambridge University Press. , 2015.\n\n[21] Monique Laurent. Sums of squares, moment matrices and optimization over polynomials.\n\nIn Mihai\nPutinar and Seth Sullivant, editors, Emerging Applications of Algebraic Geometry, volume 149 of The\nIMA Volumes in Mathematics and its Applications, pages 157\u2013270. Springer New York, 2009.\n\n[22] Hanif D. Sherali and Warren P. Adams. A hierarchy of relaxations between the continuous and convex hull\nrepresentations for zero-one programming problems. SIAM Journal on Discrete Mathematics, 3(3):411\u2013\n430, 1990.\n\n[23] L. Lov\u00b4asz and A. Schrijver. Cones of matrices and set-functions and 01 optimization. SIAM Journal on\n\nOptimization, 1(2):166\u2013190, 1991.\n\n8\n\n\f[24] Boaz Barak, Jonathan A. Kelner, and David Steurer. Dictionary learning and tensor decomposition via\nthe sum-of-squares method. In Proceedings of the Forty-seventh Annual ACM Symposium on Theory of\nComputing, STOC \u201915, 2015.\n\n[25] Boaz Barak, Jonathan A. Kelner, and David Steurer. Rounding sum-of-squares relaxations. In STOC,\n\npages 31\u201340, 2014.\n\n[26] Boaz Barak and Ankur Moitra. Tensor prediction, rademacher complexity and random 3-xor. CoRR,\n\nabs/1501.06521, 2015.\n\n[27] Boaz Barak and David Steurer. Sum-of-squares proofs and the quest toward optimal algorithms.\n\nProceedings of International Congress of Mathematicians (ICM), 2014. To appear.\n\nIn\n\n[28] D. Grigoriev. Complexity of positivstellensatz proofs for the knapsack.\n\n10(2):139\u2013154, 2001.\n\ncomputational complexity,\n\n[29] Grant Schoenebeck. Linear level lasserre lower bounds for certain k-csps. In Proceedings of the 2008 49th\nAnnual IEEE Symposium on Foundations of Computer Science, FOCS \u201908, pages 593\u2013602, Washington,\nDC, USA, 2008. IEEE Computer Society.\n\n[30] Raghu Meka, Aaron Potechin, and Avi Wigderson. Sum-of-squares lower bounds for planted clique.\n\nCoRR, abs/1503.06447, 2015.\n\n[31] Z. Wang, Q. Gu, and H. Liu. Statistical Limits of Convex Relaxations. ArXiv e-prints, March 2015.\n[32] Iain M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Ann.\n\nStatist., 29(2):295\u2013327, 04 2001.\n\n[33] Zongming Ma. Sparse principal component analysis and iterative thresholding. Ann. Statist., 41(2):772\u2013\n\n801, 04 2013.\n\n[34] Vincent Q. Vu and Jing Lei. Minimax sparse principal subspace estimation in high dimensions. Ann.\n\nStatist., 41(6):2905\u20132947, 12 2013.\n\n[35] U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine. Broad patterns\nof gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonu-\ncleotide arrays. Proceedings of the National Academy of Sciences, 96(12):6745\u20136750, 1999.\n\n[36] Iain M. Johnstone and Arthur Yu Lu. On consistency and sparsity for principal components analysis in\n\nhigh dimensions. Journal of the American Statistical Association, 104(486):pp. 682\u2013703, 2009.\n\n[37] Xi Chen. Adaptive elastic-net sparse principal component analysis for pathway association testing. Sta-\n\ntistical Applications in Genetics and Molecular Biology, 10, 2011.\n\n[38] Rodolphe Jenatton, Guillaume Obozinski, and Francis R. Bach. Structured sparse principal component\nanalysis. In Proceedings of the Thirteenth International Conference on Arti\ufb01cial Intelligence and Statis-\ntics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010, pages 366\u2013373, 2010.\n\n[39] Vincent Q. Vu and Jing Lei. Minimax rates of estimation for sparse PCA in high dimensions. In Proceed-\nings of the Fifteenth International Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS 2012, La\nPalma, Canary Islands, April 21-23, 2012, pages 1278\u20131286, 2012.\n\n[40] Debashis Paul and Iain M Johnstone. Augmented sparse principal component analysis for high dimen-\n\nsional data. arXiv preprint arXiv:1202.1242, 2012.\n\n[41] Quentin Berthet and Philippe Rigollet. Optimal detection of sparse principal components in high dimen-\n\nsion. The Annals of Statistics, 41(4):1780\u20131815, 2013.\n\n[42] Arash A. Amini and Martin J. Wainwright. High-dimensional analysis of semide\ufb01nite relaxations for\n\nsparse principal components. Ann. Statist., 37(5B):2877\u20132921, 10 2009.\n\n[43] Yash Deshpande and Andrea Montanari. Sparse PCA via covariance thresholding. In Advances in Neural\nInformation Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014,\nDecember 8-13 2014, Montreal, Quebec, Canada, pages 334\u2013342, 2014.\n\n[44] Alexandre d\u2019Aspremont, Laurent El Ghaoui, Michael I. Jordan, and Gert R. G. Lanckriet. A direct\n\nformulation for sparse pca using semide\ufb01nite programming. SIAM Review, 49(3):434\u2013448, 2007.\n\n[45] Robert Krauthgamer, Boaz Nadler, and Dan Vilenchik. Do semide\ufb01nite relaxations solve sparse pca up\n\nto the information limit? The Annals of Statistics, 43(3):1300\u20131322, 2015.\n\n[46] Y. Deshpande and A. Montanari. Improved Sum-of-Squares Lower Bounds for Hidden Clique and Hidden\n\nSubmatrix Problems. ArXiv e-prints, February 2015.\n\n[47] Prasad Raghavendra and Tselil Schramm. Tight lower bounds for planted clique in the degree-4 SOS\n\nprogram. CoRR, abs/1507.05136, 2015.\n\n[48] Samuel B. Hopkins, Pravesh K. Kothari, and Aaron Potechin. Sos and planted clique: Tight analysis of\nMPW moments at all degrees and an optimal lower bound at degree four. CoRR, abs/1507.05230, 2015.\n\n[49] Tengyu Ma and Philippe Rigollet. personal communication, 2014.\n\n9\n\n\f", "award": [], "sourceid": 997, "authors": [{"given_name": "Tengyu", "family_name": "Ma", "institution": "Princeton University"}, {"given_name": "Avi", "family_name": "Wigderson", "institution": "Institute for Advanced Study"}]}