{"title": "Cone-Constrained Principal Component Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 2717, "page_last": 2725, "abstract": "Estimating a vector from noisy quadratic observations is a task that arises naturally in many contexts, from dimensionality reduction, to synchronization and phase retrieval problems. It is often the case that additional information is available about the unknown vector (for instance, sparsity, sign or magnitude of its entries). Many authors propose non-convex quadratic optimization problems that aim at exploiting optimally this information. However, solving these problems is typically NP-hard. We consider a simple model for noisy quadratic observation of an unknown vector $\\bvz$. The unknown vector is constrained to belong to a cone $\\Cone \\ni \\bvz$. While optimal estimation appears to be intractable for the general problems in this class, we provide evidence that it is tractable when $\\Cone$ is a convex cone with an efficient projection. This is surprising, since the corresponding optimization problem is non-convex and --from a worst case perspective-- often NP hard. We characterize the resulting minimax risk in terms of the statistical dimension of the cone $\\delta(\\Cone)$. This quantity is already known to control the risk of estimation from gaussian observations and random linear measurements. It is rather surprising that the same quantity plays a role in the estimation risk from quadratic measurements.", "full_text": "Cone-constrained Principal Component Analysis\n\nYash Deshpande\n\nElectrical Engineering\nStanford University\n\nAndrea Montanari\n\nElectrical Engineering and Statistics\n\nStanford University\n\nEmile Richard\n\nElectrical Engineering\nStanford University\n\nAbstract\n\nEstimating a vector from noisy quadratic observations is a task that arises nat-\nurally in many contexts, from dimensionality reduction, to synchronization and\nphase retrieval problems. It is often the case that additional information is avail-\nable about the unknown vector (for instance, sparsity, sign or magnitude of its\nentries). Many authors propose non-convex quadratic optimization problems that\naim at exploiting optimally this information. However, solving these problems is\ntypically NP-hard.\nWe consider a simple model for noisy quadratic observation of an unknown vector\nv0. The unknown vector is constrained to belong to a cone C (cid:51) v0. While\noptimal estimation appears to be intractable for the general problems in this class,\nwe provide evidence that it is tractable when C is a convex cone with an ef\ufb01cient\nprojection. This is surprising, since the corresponding optimization problem is\nnon-convex and \u2013from a worst case perspective\u2013 often NP hard. We characterize\nthe resulting minimax risk in terms of the statistical dimension of the cone \u03b4(C).\nThis quantity is already known to control the risk of estimation from gaussian\nobservations and random linear measurements. It is rather surprising that the same\nquantity plays a role in the estimation risk from quadratic measurements.\n\n1\n\nIntroduction\n\nIn many statistical estimation problems, observations can be modeled as noisy quadratic functions\nof an unknown vector v0 = (v0,1, v0,2, . . . , v0,n)T \u2208 Rn. For instance, in positioning and graph\nlocalization [5, 24], one is given noisy measurements of pairwise distances (v0,i \u2212 v0,j)2 (where\n\u2013for simplicity\u2013 we consider the case in which the underlying geometry is one-dimensional). In\nprincipal component analysis (PCA) [15], one is given a data matrix X \u2208 Rn\u00d7p, and tries to reduce\nits dimensionality by postulating an approximate factorization X \u2248 u0 v0\nT. Hence Xij can be\ninterpreted as a noisy observation of the quadratic function u0,iv0,j. As a last example, there has\nbeen signi\ufb01cant interest recently in phase retrieval problems [11, 6]. In this case, the unknown vector\nv0 is \u2013roughly speaking\u2013 an image, and the observations are proportional to the square modulus of\na modulated Fourier transform |Fv0|2.\nIn several of these contexts, a signi\ufb01cant effort has been devoted to exploiting additional structure\nof the unknown vector v0. For instance, in Sparse PCA, various methods have been developed to\nexploit the fact that v0 is known to be sparse [14, 25]. In sparse phase retrieval [13, 18], a similar\nassumption is made in the context of phase retrieval.\nAll of these attempts face a recurring dichotomy. One the hand, additional information on v0 can\nincrease dramatically the estimation accuracy. On the other, only a fraction of this additional in-\nformation is exploited by existing polynomial time algorithms. For instance in sparse PCA, if it is\nknown that only k entries of the vector v0 are non-vanishing, an optimal estimator is successful in\nidentifying them from roughly k samples (neglecting logarithmic factors) [2]. On the other hand,\nknown polynomial-time algorithms require about k2 samples [16, 7].\n\n1\n\n\fThis fascinating phenomenon is however poorly understood so far. Classifying estimation problems\nas to whether optimal estimation accuracy can be achieved or not in polynomial time is an out-\nstanding challenge. In this paper we develop a stylized model to study estimation from quadratic\nobservations, under additional constraints. Special choices of the constraint set yield examples for\nwhich optimal estimation is thought to be intractable.\nHowever we identify a large class of constraints for which estimation appears to be tractable, de-\nspite the corresponding maximum likelihood problem is non-convex. This shows that computational\ntractability is not immediately related to simple considerations of convexity or worst-case complex-\nity.\nOur model assumes v0 \u2208 Cn with Cn \u2286 Rn a closed cone. Observations are organized in a sym-\nmetric matrix X = (Xij)1\u2264i,j\u2264n de\ufb01ned by\n\n1\n2\n\nX = \u03b2 v0v0\n\nT + Z .\n\n2,(cid:107)(cid:98)v(X) \u2212 v0(cid:107)2\n\n(1)\nHere Z is a symmetric noise matrix with independent entries (Zij)i\u2264j with Zij \u223c N(0, 1/n) for\ni < j and Zii \u223c N(0, 2/n). We assume, without loss of generality, (cid:107)v0(cid:107)2 = 1, and hence \u03b2 is the\nsignal to noise ratio. We will assume \u03b2 to be known to avoid non-essential complications.\n\nWe consider estimators that return normalized vectors(cid:98)v : Rn\u00d7n \u2192 Sn\u22121 \u2261 {v \u2208 Rn : (cid:107)v(cid:107)2 = 1},\nRCn ((cid:98)v; v0) =\nThe corresponding worst-case risk is R((cid:98)v;Cn) \u2261 supv0\u2208Cn RCn((cid:98)v; v0), and the minimax risk\nR(Cn) = inf(cid:98)v R((cid:98)v;Cn).\n\n2)(cid:9) = 1 \u2212 E{|(cid:104)(cid:98)v(X), v0(cid:105)|} .\n\nE(cid:8) min((cid:107)(cid:98)v(X) \u2212 v0(cid:107)2\n\nand will characterize such an estimator through the risk function\n\nRemark 1.1. Let Cn = Sn,k be the cone of vectors v0 that have at most k non-zero entries, all\npositive, and with equal magnitude. The problem of testing whether \u03b2 = 0 or \u03b2 \u2265 \u03b20 within\nthe model (1) coincides with the problem of detecting a non-zero mean submatrix in a Gaussian\nmatrix. For the latter, Ma and Wu [20] proved that it cannot be accomplished in polynomial time\nunless an algorithm exists for the so-called planted clique problem in a regime in which the latter is\nconjectured to be hard.\nThis suggests that the problem of estimating v0 with rate-optimal minimax risk is hard for the\nconstraint set Cn = Sn,k.\nWe next summarize our results. While \u2013as shown by the last remark\u2013 optimal estimation is generi-\ncally intractable for the model (1) under the constraint v0 \u2208 Cn, we show that \u2013roughly speaking\u2013 it\nis instead tractable if Cn is a convex cone. Note that this does not follow from elementary convexity\nconsiderations. Indeed, the maximum likelihood problem\n\n(2)\n\nmaximize\nsubject to\n\n(cid:104)v, Xv(cid:105) ,\nv \u2208 Cn,\n\n(cid:107)v(cid:107)2 = 1 ,\n\n(3)\n\nE(cid:8)(cid:13)(cid:13)PCn(g)(cid:13)(cid:13)2\n\n(cid:9) ,\n\nis non-convex. Even more, solving exactly this optimization problem is NP-hard even for simple\nchoices of the convex cone Cn. For instance, if Cn = Pn \u2261 {v \u2208 Rn : v \u2265 0} is an orthant, then\nsolving the above is equivalent to copositive programming, which is NP-hard by reduction from\nmaximum independent sets [12, Chapter 7].\nOur results naturally characterize the cone Cn through its statistical dimension [1]. If PCn denotes\nthe orthogonal projection on Cn, then the fractional statistical dimension of Cn is de\ufb01ned as\n\n\u03b4(Cn) \u2261 1\nn\n\n2\n\n(4)\nwhere expectation is with respect to g \u223c N(0, In\u00d7n). Note that \u03b4(Cn) \u2208 [0, 1] can be signi\ufb01cantly\nsmaller than 1. For instance, if Cn = Mn \u2261 {v \u2208 Rn\n+ : \u2200i , vi+1 \u2265 vi} is the cone of non-\nnegative, monotone increasing sequences, then [9, Lemma 4.2] proves that \u03b4(Cn) \u2264 20(log n)2/n.\nBelow is an informal summary of our results, with titles referring to sections where these are estab-\nlished.\n\nInformation-theoretic limits. We prove that in order to estimate accurately v0, it is necessary\n\nto have \u03b2 (cid:38) (cid:112)\u03b4(Cn). Namely, there exist universal constants c1, c2 > 0 such that, if\n\n(cid:112)\u03b4(Cn), then R(Cn) \u2265 c2.\n\n\u03b2 \u2264 c1\n\n2\n\n\fMaximum likelihood estimator. Let(cid:98)vML(X) be the maximum-likelihood estimator, i.e. any solu-\n\ntion of Eq. (3). We then prove that, for \u03b2 \u2265(cid:112)\u03b4(Cn)\nR((cid:98)vML;Cn) \u2264 4(cid:112)\u03b4(Cn)\n\n(5)\nLow-complexity iterative estimator. In the special case Cn = Rn, the solution of the optimiza-\ntion problem (3) is given by the eigenvector with the largest eigenvalue. A standard\nlow-complexity approach to computing the leading eigenvector is provided by the power\nmethod. We consider a simple generalization that \u2013starting from the initialization v0\u2013 al-\nternates between projection onto Cn and multiplication by (X + \u03c1In) (\u03c1 > 0 is added to\nimprove convergence):\n\n\u03b2\n\n.\n\n(cid:98)vt+1 =\nPCn (ut)\nut = (X + \u03c1In)(cid:98)vt .\n(cid:107)PCn (ut)(cid:107)2\n\n,\n\nWe prove that, for t (cid:38) log n iterations, this algorithm yields an estimate with R((cid:98)vt;Cn) (cid:46)\n(cid:112)\u03b4(Cn)/\u03b2, and hence order optimal, for \u03b2 (cid:38)(cid:112)\u03b4(Cn). (Our proof technique requires the\n\n(7)\n\ninitialization to have a positive scalar product with v0.)\n\n(6)\n\nE\u03bb1(Z;Cn) \u2264 2(cid:112)\u03b4(Cn) .\n\nAs a side result of our analysis of the maximum likelihood estimator, we prove a new, el-\negant, upper bound on the value of the optimization problem (3), denoted by \u03bb1(Z;Cn) \u2261\nmaxv\u2208Cn\u2229Sn\u22121(cid:104)v, Zv(cid:105). Namely\n\n(8)\nIn the special case Cn = Rn, \u03bb1(Z; Rn) is the largest eigenvalue of Z, and the above inequality\nshows that this is bounded in expectation by 2. In this case, the bound is known to be asymptotically\ntight [10]. In the supplementary material, we prove that it is tight for certain other examples such\nas the nonnegative orthant and for circular cones (a.k.a. ice-cream cones). We conjecture that this\ninequality is asymptotically tight for general convex cones.\nUnless stated otherwise, in the following we will defer proofs to the Supplementary Material.\n\n2\n\nInformation-theoretic limits\n\nminimax risk can be bounded below for \u03b2 (cid:46) (cid:112)\u03b4(Cn). As is standard, our bound employs the\n\nWe use an information-theoretic argument to show that, under the observation model (1), then the\nso-called packing number of Cn.\nDe\ufb01nition 2.1. For a cone Cn \u2286 Rn, we de\ufb01ne its packing number N (Cn, \u03b5) as the size of the\nmaximal subset X of Cn \u2229 Sn\u22121 such that for every x1, x2 \u2208 Cn \u2229 Sn\u22121, (cid:107)x1 \u2212 x2(cid:107) \u2265 \u03b5.\nWe then have the following.\nTheorem 1. There exist universal constants C1, C2 > 0 such that for any closed convex cone Cn\nwith \u03b4(Cn) \u2265 3/n:\n\n(cid:112)\u03b4(Cn) \u21d2 R(Cn) \u2265 C2\u03b4(Cn)\n\nlog(1/\u03b4(Cn))\n\n\u03b2 \u2264 C1\n\n.\n\n(9)\n\nNotice that the last expression for the lower bound depends on the cone width, as it is to be expected:\neven for \u03b2 = 0, it is possible to estimate v0 with risk going to 0 if the cone Cn \u2018shrinks\u2019 as n \u2192 \u221e.\nThe proof of this theorem is provided in Section 2 of the supplement.\n\n3 Maximum likelihood estimator\n\nUnder the Gaussian noise model for Z, cf. Eq. (1), the likelihood of observing X under a hypothesis\nv is proportional to exp(\u2212(cid:107)X \u2212 vvT(cid:107)2\nF /2). Using the constraint that (cid:107)v(cid:107) = 1, it follows that any\nsolution of (3) is a maximum likelihood estimator.\n\n3\n\n\fTheorem 2. Consider the model as in (1). Then, when \u03b2 \u2265(cid:112)\u03b4(Cn), any solution(cid:98)vML(X) to the\n\nmaximum likelihood problem (3) satis\ufb01es\n\nRCn ((cid:98)vML;Cn) \u2264 min\n\n(cid:40)\n\n4(cid:112)\u03b4(Cn)\n\n(cid:41)\n\n,\n\n16\n\u03b22\n\n.\n\n(10)\n\n\u03b2\n\nfor \u03b2 (cid:38) 1, it shifts to a faster decay of 1/\u03b22. We have made no attempt to optimize the constants in\nthe statement of the theorem, though we believe that the correct leading constant in either case is 1.\nNote that without the cone constraint (or with Cn = Rn) the maximum likelihood estimator reduces\n\nThus, for \u03b2 (cid:38)(cid:112)\u03b4(Cn), the risk of the maximum likelihood estimator decays as(cid:112)\u03b4(Cn)/\u03b2 while\nto computing the principal eigenvector(cid:98)vPC of X. Recent results in random matrix theory [10] and\nRCn ((cid:98)vPC;Cn) < 1 asymptotically) is obtained only when \u03b2 > 1. Our result shows that this threshold\nis, instead, reduced to(cid:112)\u03b4(Cn), which can be signi\ufb01cantly smaller than 1. The proof of this theorem\n\nstatistical decision theory [4] prove that in the case of principal eigenvector, a nontrivial risk (i.e.\n\nis provided in Section 3 of the supplement.\n\n4 Low-complexity iterative estimator\n\nSections 2 and 3 provide theoretical insight into the fundamental limits of estimation of v0 from\nquadratic observations of the form \u03b2v0v0\nT + Z. However, as previously mentioned, the maximum\nlikelihood estimator of Section 3 is NP-hard to compute, in general. In this section, we propose a\nsimple iterative algorithm that generalizes the well-known power iteration to compute the principal\neigenvector of a matrix. Furthermore, we prove that, given an initialization with positive scalar\nproduct with v0, this algorithm achieves the same risk of the maximum likelihood estimator up to\nconstants. Throughout, the cone Cn is assumed to be convex.\n\nOur starting point is the power iteration to compute the principal eigenvector (cid:98)vPC of X. This is\ngiven by letting, for t \u2265 0: (cid:98)vt+1 = X(cid:98)vt/(cid:107)X(cid:98)vt(cid:107). Under our observation model, we have X =\n\nT + Z with v0 \u2208 Cn. We can incorporate this information by projecting the iterates on to the\n\n\u03b2v0v0\ncone Cn (see e.g. [19] for related ideas):\n\nut+1 = Xvt + \u03c1vt.\n\n(11)\n\nPCn (ut)\n(cid:107)PCn (ut)(cid:107) ,\nThe projection is de\ufb01ned in the standard way:\n\n(cid:98)vt =\n\nPCn (x) \u2261 arg min\ny\u2208Cn\n\n(cid:107)y \u2212 x(cid:107)2.\n\n(12)\nIf Cn is convex, then the projection is unique. We have implicitly assumed that the operation of\nprojecting to the cone Cn is available to the algorithm as a simple primitive. This is the case for many\nconvex cones of interest, such as the orthant Pn, the monotone cone Mn, and ice-cream cones the\nprojection is easy to compute. For instance, if Cn = Pn is the non-negative orthant PCn (x) = (x)+\nis the non-negative part of x. For the monotone cone, the projection can be computed ef\ufb01ciently\nthrough the pool-adjacent violators algorithm.\nThe memory term \u03c1vt is necessary for our proof technique to go through. It is straightforward to\nsee that adding \u03c1In to the data X does not change the optimizers of the problem (3). The following\ntheorem provides deterministic conditions under which the distance between the iterative estimator\nand the vector v0 can be bounded.\n\nTheorem 3. Let(cid:98)vt be the power iteration estimator (11). Assume \u03c1 > \u2206 and that the noise matrix\nIf \u03b2 > 4\u2206, and the initial point (cid:98)v0 \u2208 Cn \u2229 Sn\u22121 satis\ufb01es (cid:104)(cid:98)v0, v0(cid:105) \u2265 2\u2206/\u03b2, then there exits\n\nmax(cid:8)|(cid:104)x, Zy(cid:105)| : x, y \u2208 Cn \u2229 Sn\u22121(cid:9) \u2264 \u2206 .\n\nt0 = t0(\u2206/\u03b2, \u2206/\u03c1) < \u221e independent of n such that, for all t \u2265 t0\n\nZ satis\ufb01es:\n\n(13)\n\n.\n\n(14)\n\n(cid:107)(cid:98)vt \u2212 v0(cid:107) \u2264 4\u2206\n\n\u03b2\n\n4\n\n\fWe can apply this theorem to the Gaussian noise model to obtain the following bound on the risk of\nthe power iteration estimator.\n\nCorollary 4.1. Under the model (1) let \u03b5n = 8(cid:112)log n/n. Assume that (cid:104)(cid:98)v0, v0(cid:105) > 0 and\n\n\u03b2 > 2((cid:112)\u03b4(Cn) + \u03b5n) max(cid:0)2, (cid:104)(cid:98)v0, v0(cid:105)\u22121(cid:1) .\nThen R((cid:98)vt,Cn) \u2264 2\u03b4(Cn) + \u03b5n\n\n.\n\n\u03b2\n\n(15)\n\n(16)\n\nIn other words, power iteration has risk within a constant from the maximum likelihood estimator,\nprovided an initialization is available whose scalar product with v0 is bounded away from zero. The\nproofs of Theorem 3 and Corollary 4.1 are provided in Section 4 of the supplement.\n\n5 A case study: sharp asymptotics and minimax results for the orthant\nIn this section, we will be interested in the example in which the cone Cn is the non-negative orthant\nCn = Pn. Non-negativity constraints within principal component analysis arise in non-negative\nmatrix factorization (NMF). Initially introduced in the context of chemometrics [23, 22], NMF at-\ntracted considerable interest because of its applications in computer vision and topic modeling. In\nparticular, Lee and Seung [17] demonstrated empirically that NMF successfully identi\ufb01es parts of\nimages, or topics in documents\u2019 corpora.\nNote that the in applications of NMF to computer vision or topic modeling the setting is somewhat\ndifferent from the model studied here: X is rectangular instead of symmetric, and the rank is larger\nthan one. Such generalizations of our analysis will be the object of future work.\nHere we will use the positive orthant to illustrate the results in previous sections. Further, we will\nshow that stronger results can be proved in this case, thanks to the separable structure of this cone.\nNamely, we derive sharp asymptotics and we characterize the least-favorable vectors for the maxi-\nmum likelihood estimator.\nWe denote by \u03bb+(X) = \u03bb1(X;Cn = Pn) the value of the optimization problem (3). Our \ufb01rst result\nyields the asymptotic value of this quantity for \u2018pure noise,\u2019 con\ufb01rming the general conjecture put\nforward above.\n\nTheorem 4. We have almost surely limn\u2192\u221e \u03bb+(Z) = 2(cid:112)\u03b4(Pn) =\n\n\u221a\n\n2.\n\nNext we characterize the risk phase transition: this result con\ufb01rms and strengthen Theorem 2.\nTheorem 5. Consider estimation in the non-negative orthant Cn = Pn under the model (1). If\n\u221a\n\u03b2 \u2264 1/\n\n2, then there exists a sequence of vectors {v0(n)}n\u22650 , such that almost surely\n\n\u221a\n2, there exists a function \u03b2 (cid:55)\u2192 R+(\u03b2) with R+(\u03b2) < 1 for all \u03b2 > 1/\n\n\u221a\n2, and\nFor \u03b2 > 1/\nR+(\u03b2) \u2265 1 \u2212 1/2\u03b22, such that the following happens. For any sequence of vectors {v0(n)}n\u22650,\nwe have, almost surely\n\nn\u2192\u221e R(vML; v0(n)) = 1 .\nlim\n\n(17)\n\n(18)\n\nR(vML; v0(n)) \u2264 R+(\u03b2) .\n\nlim sup\nn\u2192\u221e\n\nlated with the signal v0(n) if and only if \u03b2 >(cid:112)\u03b4(Cn) = 1/\n\n\u221a\n\n2.\n\nIn other words, in the high-dimensional limit, the maximum likelihood estimator is positively corre-\n\nExplicit (although non-elementary) expressions for R+(\u03b2) can be computed, along with the limit\nvalue of the risk R(vML; v0(n)) for sequences of vectors {v0(n)}n\u22651 whose entries empirical dis-\ntribution converges. These results go beyond the scope of the present paper (but see Fig. 1 below for\nillustration).\nAs a byproduct of our analysis, we can characterize the least-favorable choice of the signal v0.\nNamely for k \u2208 [1, n], wee let u(n, k) denote a vector with (cid:98)k(cid:99) non-zero entries, all equal to\n\n1/(cid:112)(cid:98)k(cid:99). Then we can prove that the asymptotic minimax risk is achieved along sequences of\n\nvectors of this type.\n\n5\n\n\fTheorem 6. Consider estimation in the non-negative orthant Cn = Pn under the model (1), and let\n\u221a\nR+(\u03b2) be the same function as in Theorem 5. If \u03b2 \u2264 1/\n\n2 then there exists kn = o(n) such that\n\n\u221a\n\nIf \u03b2 > 1/\n\nn\u2192\u221e R(vML; u(n, kn)) = 1 .\nlim\n\n2 then there exists \u03b5# = \u03b5#(\u03b2) \u2208 (0, 1] such that\n\nn\u2192\u221e R(vML; u(n, n\u03b5#)) = R+(\u03b2) .\nlim\n\n(19)\n\n(20)\n\nWe refer the reader to [21] for a detailed analysis of the case of nonnegative PCA and the full proofs\nof Theorems 4, 5 and 6.\n\n5.1 Approximate Message Passing\nThe next question is whether, in the present example Cn = Pn, the risk of the maximum likelihood\nestimator can be achieved by a low-complexity iterative algorithm. We prove that this is indeed the\ncase (up to an arbitrarily small error), thus con\ufb01rming Theorem 3. In order to derive an asymp-\ntotically exact analysis, we consider an \u2018approximate message passing\u2019 modi\ufb01cation of the power\niteration.\nLet f (x) = (x)+/(cid:107)(x)+(cid:107)2 denote the normalized projector. We consider the iteration de\ufb01ned by\nv0 = (1, 1, . . . , 1)T/\n\nn, v\u22121 = (0, 0, . . . , 0)T, and for t \u2265 0,\n\n\u221a\n\nvt+1 = Xf (vt) \u2212 bt f (vt\u22121)\n\nAMP\nThe algorithm AMP is a slight modi\ufb01cation of the projected power iteration algorithm up to adding\nat each step the \u201cmemory term\u201d \u2212bt f (vt\u22121). As shown in [8, 3] this term plays a crucial role in\nallowing for an exact high-dimensional characterization. At each step the estimate produced by the\n\nsequence is(cid:98)vt = (vt)+/(cid:107)(vt)+(cid:107)2. We have the following\n\nand bt \u2261 (cid:107)(vt)+(cid:107)0/{\u221a\n\nn(cid:107)(vt)+(cid:107)2}\n\nTheorem 7. Let X be generated as in (1). Then we have, almost surely,\n\n(cid:12)(cid:12)(cid:104)(cid:98)vML, XvML(cid:105) \u2212 (cid:104)(cid:98)vt, X(cid:98)vt(cid:105)(cid:12)(cid:12) = 0 .\n\nt\u2192\u221e lim\nlim\nn\u2192\u221e\n\n(21)\n\n5.2 Numerical illustration: comparison with classical PCA\n\nWe performed numerical experiments on synthetic data generated according to the model (1) and\nwith signal v0 = u(n, n\u03b5) as de\ufb01ned in the previous section. We provide in the Appendix formulas\n\nfor the value of limn\u2192\u221e(cid:104)v0,(cid:98)vML(cid:105), which correspond to continuous black lines in the Figure 1. We\nempirical average of (cid:104)(cid:98)vt, v0(cid:105) over 32 instances. Even for such moderate values of n, the asymptotic\n\ncompare these predictions with empirical values obtained by running AMP.\nWe generated samples of size n = 104, sparsity level \u03b5 \u2208 {0.001, 0.1, 0.8}, and signal-to-noise\nratios \u03b2 \u2208 {0.05, 0.10, . . . , 1.5}.\nIn each case we run AMP for t = 50 iterations and plot the\n\npredictions are remarkably accurate.\nObserve that sparse vectors (small \u03b5) correspond to the least favorable signal for small signal-to-\n\u221a\nnoise ratio \u03b2, while the situation is reverted for large values of \u03b2. In dashed green we represented\nthe theoretical prediction for \u03b5 \u2192 0. The value \u03b2 = 1/\n2 corresponds to the phase transition. At\n\nthe bottom the images correspond to values of the correlation (cid:104)v0,(cid:98)vML(cid:105) for a grid of values of \u03b2 and\nof n, and \ufb01xed \u03b5 = 0.05 and several value of \u03b2. For each point we plot the average of (cid:104)(cid:98)vt, v0(cid:105) after\nt = 50 iteration, over 32 instances. The data suggest (cid:104)(cid:98)vML, v0(cid:105) + A n\u2212b \u2248 limn\u2192\u221e(cid:104)v0, v+(cid:105) with\n\n\u03b5. The top left-hand frame in Figure 1 is obtained by repeating the experiment for a grid of values\n\nb \u2248 0.5.\n\n6 Polyhedral cones and convex relaxations\nA polyhedral cone Cn is a closed convex cone that can be represented in the form Cn = {x \u2208 Rn :\nAx \u2265 0} for some matrix A \u2208 Rm\u00d7n. In section 5 we considered the non-negative orthant, which\nis an example of polyhedral cone with A = In. A number of other examples of practical interest fall\nwithin this category of cones. For instance, monotonicity or convexity of a vector v = (v1, . . . , vn)\n\n6\n\n\fpredictions of Theorem 5, and dots represent empirical values of (cid:104)(cid:98)vt, v0(cid:105) for the AMP estimator\n\nFigure 1: Numerical simulations with the model 1 for the positive orthant cone Cn = Pn. Top-\nleft: empirical deviation from asymptotic prediction. Top-right: black lines represent the theoretical\n(in red) and (cid:104)v1, v0(cid:105) for standard PCA (in blue). Bottom: a comparison of theoretical asymptotic\nvalues (left frame) and empirical values (right frame) of (cid:104)v0, vML(cid:105) for a range of \u03b2 and \u03b5.\n\nan be enforced \u2013in their discrete version\u2013 through inequality constraints (respectively vi+1\u2212 vi \u2265 0\nand vi+1 \u2212 2vi + vi\u22121 \u2265 0), and hence give rise to polyhedral cones. Furthermore, it is possible to\napproximate any convex cone Cn with a sequence of increasingly accurate polyhedral cones.\nFor a polyhedral cone, the maximum likelihood problem (3) reads:\n\nmaximize (cid:104)v, Xv(cid:105)\nsubject to: Av \u2265 0; (cid:107)v(cid:107) = 1.\n\n(22)\n\nThe modi\ufb01ed power iteration (11), can be specialized to this case, via the appropriate projection.\nThe projection remains computationally feasible provided the matrix A is not too large. Indeed, it\nis easy to show using convex duality that PCn (u) is given by:\n\nPCn (u) = arg min(cid:8)(cid:107)Ax + u(cid:107)2, x \u2265 0(cid:9) .\n\nThis reduces the projection onto a general polyhedral cone to a non-negative least squares problem,\nfor which ef\ufb01cient routines exist. In special cases such as the orthant, the projection is closed form.\nIn the case of polyhedral cones, it is possible to relax this problem (22) using a natural convex\nsurrogate. To see this, we introduce the variable V = vvT and write the following equivalent\nversion of problem 22:\n\nmaximize (cid:104)X, V(cid:105)\nsubject to: AVAT \u2265 0; Tr(V) = 1;\n\nV(cid:23)0; rank(V) = 1.\n\nHere the constraint AVAT \u2265 0 is to be interpreted as entry-wise non-negativity, while we write\nV(cid:23)0 to denote that V is positive semide\ufb01nite. We can now relax this problem by dropping the rank\nconstraint:\n\nmaximize (cid:104)X, V(cid:105)\nsubject to: AVAT \u2265 0; Tr(V) = 1; V(cid:23)0.\n\n(23)\n\nNote that this increases the number of variables from n to n2, as V \u2208 Rn\u00d7n, which results in a\nsigni\ufb01cant cost increase for standard interior point methods, over the power iteration (11). Further-\n\nmore, if the solution V is not rank one, it is not clear how one can use it to form an estimate(cid:98)v. On\n\nthe other hand, this convex relaxation yields a principled approach to bounding the sub-optimality\n\n7\n\n10110210310410\u2212210\u22121100nDeviation from asymptotic Empiricaln\u22121/200.10.20.30.40.50.60.70.80.92\u22121/ 2\u03b21\u03b5 = .001\u03b5 = .100\u03b5 = .800< v0, v+ >Non\u2212negative PCA\u03b2\u03b5Theory Prediction 21/2|0.20.40.60.811.21.40.10.20.30.40.50.60.70.80.910.10.20.30.40.50.60.70.8\u03b2\u03b5Empirical (n = 1000) 21/2|0.20.40.60.811.21.40.10.20.30.40.50.60.70.80.910.10.20.30.40.50.60.70.8\fFigure 2: Value of the maximum likelihood problem (3) for Cn = Pn, as approximated by power\niteration. The red line is the value achieved by power iteration, and the blue points the upper bound\nobtained by dual witness (25). The gap at small \u03b2 is due to the suboptimal choice of the dual witness,\nsince solving exactly Problem (24) yields the dual witness with value given by the teal circles. As\ncan be seen, they match exactly the value obtained by power iteration, showing zero duality gap!\nThe simulation is for n = 50 and 40 Monte Carlo iterations.\n\nof the estimate provided by the power iteration. It is straightforward to derive the dual program of\n(23):\n\nminimize \u03bb1(X + ATYA)\nsubject to: Y \u2265 0,\n\n(24)\n\nplementary Material, that such a local maximizer must satisfy the modi\ufb01ed eigenvalue equation:\n\nwhere Y is the decision variable, the constraint is interpreted as entry-wise nonnegativity as above,\nand \u03bb1(\u00b7 ) denotes the largest eigenvalue. If one can construct a dual witness Y \u2265 0 such that\n\na certi\ufb01cate of convergence of the power iteration (11).\nWe next describe a construction of dual witness that we found empirically successful at large enough\nsignal-to-noise ratio. Assume that a heuristic (for instance, the modi\ufb01ed power iteration (11)) has\n\n\u03bb1(X + ATYA) = (cid:104)(cid:98)v, X(cid:98)v(cid:105) for any estimator(cid:98)v, then this estimator is the maximum likelihood\nestimator. In particular, using the power iteration estimator(cid:98)v =(cid:98)vt , such a dual witness can provide\nproduced an estimate (cid:98)v that is a local maximizer of the problem (3). It is is proved in the Sup-\nX(cid:98)v = \u03bb(cid:98)v \u2212 AT\u00b5, with \u00b5 \u2265 0 and (cid:104)(cid:98)v, AT\u00b5(cid:105) = 0.\n(cid:16)\n(cid:107)A(cid:98)v(cid:107)2\n\nNote that Y((cid:98)v) is non-negative by construction and hence dual feasible. A direct calculation shows\nthat(cid:98)v is an eigenvector of the matrix X + ATYA with eigenvalue \u03bb = (cid:104)(cid:98)v, X(cid:98)v(cid:105). We then obtain\nProposition 6.1. Let(cid:98)v be a local maximizer of the problem (3). If(cid:98)v is the principal eigenvector of\nX + ATY((cid:98)v)A, then(cid:98)v is a global maximizer.\n\nthe following suf\ufb01cient condition for optimality.\n\nWe then suggest the witness\n\nY((cid:98)v) =\n\n1\n\n\u00b5(cid:98)vTAT + A(cid:98)v\u00b5T(cid:17)\n\n.\n\n(25)\n\nIn Figure 2 we plot the average value of the objective function over 50 instances of the problem for\nCn = Pn, n = 100. We solved the maximum likelihood problem using the power iteration heuristics\n(11), and used the above construction to compute an upper bound via duality. It is possible to show\nthat this upper bound cannot be tight unless \u03b2 > 1, but appears to be quite accurate. We also solve\nthe problem (24) directly for case of nonnegative PCA, and (rather surprisingly) the dual is tight for\nevery \u03b2 > 0.\n\n8\n\n00.511.522.533.5411.522.533.544.5 \u03b2\u03bb1(X + Y)Power IterationProposed dual witnessExact dual witness\fReferences\n[1] D. Amelunxen, M. Lotz, M. Mccoy, and J. Tropp. Living on the edge: a geometric theory of phase\n\ntransition in convex optimization. submitted, 2013.\n\n[2] Arash A Amini and Martin J Wainwright. High-dimensional analysis of semide\ufb01nite relaxations for\nsparse principal components. In Information Theory, 2008. ISIT 2008. IEEE International Symposium\non, pages 2454\u20132458. IEEE, 2008.\n\n[3] M. Bayati and A. Montanari. The dynamics of message passing on dense graphs, with applications to\n\ncompressed sensing. IEEE Trans. on Inform. Theory, 57:764\u2013785, 2011.\n\n[4] Aharon Birnbaum, Iain M Johnstone, Boaz Nadler, and Debashis Paul. Minimax bounds for sparse pca\n\nwith noisy high-dimensional data. The Annals of Statistics, 41(3):1055\u20131084, 2013.\n\n[5] Pratik Biswas and Yinyu Ye. Semide\ufb01nite programming for ad hoc wireless sensor network localization.\nIn Proceedings of the 3rd international symposium on Information processing in sensor networks, pages\n46\u201354. ACM, 2004.\n\n[6] Emmanuel J Candes, Thomas Strohmer, and Vladislav Voroninski. Phaselift: Exact and stable signal\nrecovery from magnitude measurements via convex programming. Communications on Pure and Applied\nMathematics, 66(8):1241\u20131274, 2013.\n\n[7] Yash Deshpande and Andrea Montanari. Sparse pca via covariance thresholding.\n\narXiv:1311.5179, 2013.\n\narXiv preprint\n\n[8] D. L. Donoho, A. Maleki, and A. Montanari. Message Passing Algorithms for Compressed Sensing.\n\nProceedings of the National Academy of Sciences, 106:18914\u201318919, 2009.\n\n[9] David Donoho, Iain Johnstone, and Andrea Montanari. Accurate prediction of phase transitions in\ncompressed sensingvia a connection to minimax denoising. IEEE Transactions on Information Theory,\n59(6):3396 \u2013 3433, 2013.\n\n[10] Delphine F\u00b4eral and Sandrine P\u00b4ech\u00b4e. The largest eigenvalue of rank one deformation of large wigner\n\nmatrices. Communications in mathematical physics, 272(1):185\u2013228, 2007.\n\n[11] James R Fienup. Phase retrieval algorithms: a comparison. Applied optics, 21(15):2758\u20132769, 1982.\n[12] Bernd G\u00a8artner and Jiri Matousek. Approximation algorithms and semide\ufb01nite programming. Springer,\n\n2012.\n\n[13] Kishore Jaganathan, Samet Oymak, and Babak Hassibi. Recovery of sparse 1-d signals from the mag-\nnitudes of their fourier transform. In Information Theory Proceedings (ISIT), 2012 IEEE International\nSymposium On, pages 1473\u20131477. IEEE, 2012.\n\n[14] Iain M Johnstone and Arthur Yu Lu. Sparse principal components analysis. Unpublished manuscript,\n\n2004.\n\n[15] Ian Jolliffe. Principal component analysis. Wiley Online Library, 2005.\n[16] Robert Krauthgamer, Boaz Nadler, and Dan Vilenchik. Do semide\ufb01nite relaxations really solve sparse\n\npca? arXiv preprint arXiv:1306.3690, 2013.\n\n[17] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix factorization.\n\nNature, 401(6755):788\u2013791, 1999.\n\n[18] Xiaodong Li and Vladislav Voroninski. Sparse signal recovery from quadratic measurements via convex\n\nprogramming. SIAM Journal on Mathematical Analysis, 45(5):3019\u20133033, 2013.\n\n[19] Ronny Luss and Marc Teboulle. Conditional gradient algorithmsfor rank-one matrix approximations with\n\na sparsity constraint. SIAM Review, 55(1):65\u201398, 2013.\n\n[20] Zongming Ma and Yihong Wu. Computational barriers in minimax submatrix detection. arXiv preprint\n\narXiv:1309.5914, 2013.\n\n[21] Andrea Montanari and Emile Richard. Non-negative principal component analysis: Message passing\n\nalgorithms and sharp asymptotics. arXiv preprint arXiv:1406.4775, 2014.\n\n[22] Pentti Paatero. Least squares formulation of robust non-negative factor analysis. Chemometrics and\n\nintelligent laboratory systems, 37(1):23\u201335, 1997.\n\n[23] Pentti Paatero and Unto Tapper. Positive matrix factorization: A non-negative factor model with optimal\n\nutilization of error estimates of data values. Environmetrics, 5(2):111\u2013126, 1994.\n\n[24] Amit Singer. A remark on global positioning from local distances. Proceedings of the National Academy\n\nof Sciences, 105(28):9507\u20139511, 2008.\n\n[25] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of compu-\n\ntational and graphical statistics, 15(2):265\u2013286, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1406, "authors": [{"given_name": "Yash", "family_name": "Deshpande", "institution": "Stanford University"}, {"given_name": "Andrea", "family_name": "Montanari", "institution": "Stanford University"}, {"given_name": "Emile", "family_name": "Richard", "institution": "Stanford"}]}