{"title": "A statistical model for tensor PCA", "book": "Advances in Neural Information Processing Systems", "page_first": 2897, "page_last": 2905, "abstract": "We consider the Principal Component Analysis problem for large tensors of arbitrary order k under a single-spike (or rank-one plus noise) model. On the one hand, we use information theory, and recent results in probability theory to establish necessary and sufficient conditions under which the principal component can be estimated using unbounded computational resources. It turns out that this is possible as soon as the signal-to-noise ratio beta becomes larger than C\\sqrt{k log k} (and in particular beta can remain bounded has the problem dimensions increase). On the other hand, we analyze several polynomial-time estimation algorithms, based on tensor unfolding, power iteration and message passing ideas from graphical models. We show that, unless the signal-to-noise ratio diverges in the system dimensions, none of these approaches succeeds. This is possibly related to a fundamental limitation of computationally tractable estimators for this problem. For moderate dimensions, we propose an hybrid approach that uses unfolding together with power iteration, and show that it outperforms significantly baseline methods. Finally, we consider the case in which additional side information is available about the unknown signal. We characterize the amount of side information that allow the iterative algorithms to converge to a good estimate.", "full_text": "A statistical model for tensor PCA\n\nAndrea Montanari\n\nStatistics & Electrical Engineering\n\nStanford University\n\nEmile Richard\n\nElectrical Engineering\nStanford University\n\nAbstract\n\n\u221a\n\nWe consider the Principal Component Analysis problem for large tensors of ar-\nbitrary order k under a single-spike (or rank-one plus noise) model. On the one\nhand, we use information theory, and recent results in probability theory, to es-\ntablish necessary and suf\ufb01cient conditions under which the principal component\ncan be estimated using unbounded computational resources. It turns out that this\nis possible as soon as the signal-to-noise ratio \u03b2 becomes larger than C\nk log k\n(and in particular \u03b2 can remain bounded as the problem dimensions increase).\nOn the other hand, we analyze several polynomial-time estimation algorithms,\nbased on tensor unfolding, power iteration and message passing ideas from graph-\nical models. We show that, unless the signal-to-noise ratio diverges in the system\ndimensions, none of these approaches succeeds. This is possibly related to a fun-\ndamental limitation of computationally tractable estimators for this problem.\nWe discuss various initializations for tensor power iteration, and show that a\ntractable initialization based on the spectrum of the unfolded tensor outperforms\nsigni\ufb01cantly baseline methods, statistically and computationally. Finally, we con-\nsider the case in which additional side information is available about the unknown\nsignal. We characterize the amount of side information that allows the iterative\nalgorithms to converge to a good estimate.\n\n1\n\nIntroduction\n\nGiven a data matrix X, Principal Component Analysis (PCA) can be regarded as a \u2018denoising\u2019 tech-\nnique that replaces X by its closest rank-one approximation. This optimization problem can be\nsolved ef\ufb01ciently, and its statistical properties are well-understood. The generalization of PCA to\ntensors is motivated by problems in which it is important to exploit higher order moments, or data\nelements are naturally given more than two indices. Examples include topic modeling, video pro-\ncessing, collaborative \ufb01ltering in presence of temporal/context information, community detection\n[1], spectral hypergraph theory. Further, \ufb01nding a rank-one approximation to a tensor is a bottle-\nneck for tensor-valued optimization algorithms using conditional gradient type of schemes. While\ntensor factorization is NP-hard [11], this does not necessarily imply intractability for natural statis-\ntical models. Over the last ten years, it was repeatedly observed that either convex optimization or\ngreedy methods yield optimal solutions to statistical problems that are intractable from a worst case\nperspective (well-known examples include sparse regression and low-rank matrix completion).\nIn order to investigate the fundamental tradeoffs between computational resources and statistical\npower in tensor PCA, we consider the simplest possible model where this arises, whereby an un-\nknown unit vector v0 is to be inferred from noisy multilinear measurements. Namely, for each\nunordered k-uple {i1, i2, . . . , ik} \u2286 [n], we measure\n\n(1)\nwith Z Gaussian noise (see below for a precise de\ufb01nition) and wish to reconstruct v0. In tensor\nnotation, the observation model reads (see the end of this section for notations)\n\nXi1,i2,...,ik = \u03b2(v0)i1 (v0)i2 \u00b7\u00b7\u00b7 (v0)ik + Zi1,i2,...,ik ,\n\nX = \u03b2 v0\n\n\u2297k + Z .\n\nSpiked Tensor Model\n\n1\n\n\fThis is analogous to the so called \u2018spiked covariance model\u2019 used to study matrix PCA in high\ndimensions [12].\nIt is immediate to see that maximum-likelihood estimator vML is given by a solution of the following\nproblem\n\nmaximize\nsubject to\n\n(cid:104)X, v\u2297k(cid:105),\n(cid:107)v(cid:107)2 = 1 .\n\nTensor PCA\n\nSolving it exactly is \u2013in general\u2013 NP hard [11].\nWe next summarize our results. Note that, given a completely observed rank-one symmetric tensor\n\u2297k (i.e. for \u03b2 = \u221e), it is easy to recover the vector v0 \u2208 Rn. It is therefore natural to ask the\nv0\nquestion for which signal-to-noise ratios can one still reliably estimate v0? The answer appears to\ndepend dramatically on the computational resources1.\nIdeal estimation. Assuming unbounded computational resources, we can solve the Tensor PCA\nresults in probability theory to show that this approach is successful for \u03b2 \u2265 \u00b5k =\nok(1)). In particular, above this threshold2 we have, with high probability,\n\noptimization problem and hence implement the maximum likelihood estimator(cid:98)vML. We use recent\n\nk log k(1 +\n\n\u221a\n\n(cid:107)(cid:98)vML \u2212 v0(cid:107)2\n\n2\n\n(cid:46) 2.01 \u00b5k\n\n\u03b2\n\n.\n\n(2)\n\n\u221a\nWe use an information-theoretic argument to show that no approach can do signi\ufb01cantly better,\nnamely no procedure can estimate v0 accurately for \u03b2 \u2264 c\n\nk (for c a universal constant).\n\nTractable estimators: Unfolding. We consider two approaches to estimate v0 that can be\nimplemented in polynomial time. The \ufb01rst approach is based on tensor unfolding: starting from\n\nthe tensor X \u2208(cid:78)k Rn, we produce a matrix Mat(X) of dimensions nq \u00d7 nk\u2212q. We then perform\n\nmatrix PCA on Mat(X). We show that this method is successful for \u03b2 (cid:38) n((cid:100)k/2(cid:101)\u22121)/2. A heuristics\nargument suggests that the necessary and suf\ufb01cient condition for tensor unfolding to succeed is\nindeed \u03b2 (cid:38) n(k\u22122)/4 (which is below the rigorous bound by a factor n1/4 for k odd). We can\nindeed con\ufb01rm this conjecture for k even and under an asymmetric noise model.\n\nTractable estimators: Warm-start power iteration and Approximate Message Passing. We\nprove that, initializing power iteration uniformly at random, it converges very rapidly to an accurate\nestimate provided \u03b2 (cid:38) n(k\u22121)/2. A heuristic argument suggests that the correct necessary and\nsuf\ufb01cient threshold is given by \u03b2 (cid:38) n(k\u22122)/2. Motivated by the last observation, we consider a\n\u2018warm-start\u2019 power iteration algorithm, in which we initialize power iteration with the output of\ntensor unfolding. This approach appears to have the same threshold signal-to-noise ratio as simple\nunfolding, but signi\ufb01cantly better accuracy above that threshold. We extend power iteration to\nan approximate message passing (AMP) algorithm [7, 4]. We show that the behavior of AMP is\nqualitatively similar to the one of naive power iteration. In particular, AMP fails for any \u03b2 bounded\nas n \u2192 \u221e.\n\nSide information. Given the above computational complexity barrier, it is natural to study weaker\nversion of the original problem. Here we assume that extra information about v0 is available. This\ncan be provided by additional measurements or by approximately solving a related problem, for\ninstance a matrix PCA problem as in [1]. We model this additional information as y = \u03b3v0 + g\n(with g an independent Gaussian noise vector), and incorporate it in the initial condition of AMP\nalgorithm. We characterize exactly the threshold value \u03b3\u2217 = \u03b3\u2217(\u03b2) above which AMP converges to\nan accurate estimator. The thresholds for various classes of algorithms are summarized below.\n\n1Here we write F (n) (cid:46) G(n) if there exists a constant c independent of n (but possibly dependent on n,\n\n2Note that, for k even, v0 can only be recovered modulo sign. For the sake of simplicity, we assume here\n\nsuch that F (n) \u2264 c G(n)\n\nthat this ambiguity is correctly resolved.\n\n2\n\n\fMethod\n\nTensor Unfolding\n\nTensor Power Iteration (with random init.)\n\nMaximum Likelihood\n\nInformation-theory lower bound\n\nO(n((cid:100)k/2(cid:101)\u22121)/2)\nO(n(k\u22121)/2)\n\n1\n1\n\nn(k\u22122)/4\nn(k\u22122)/2\n\n\u2013\n\u2013\n\nRequired \u03b2 (rigorous) Required \u03b2 (heuristic)\n\nWe will conclude the paper with some insights that we believe provide useful guidance for tensor\nfactorization heuristics. We illustrate these insights through simulations.\n\n1.1 Notations\n\nGiven X \u2208 (cid:78)k Rn a real k-th order tensor, we let {Xi1,...,ik}i1,...,ik denote its coordinates and\n\nde\ufb01ne a map X : Rn \u2192 Rn, by letting, for v \u2208 Rn,\n\nX{v}i =\n\nXi,j1,\u00b7\u00b7\u00b7 ,jk\u22121 vj1 \u00b7\u00b7\u00b7 vjk\u22121 .\n\nj1,\u00b7\u00b7\u00b7 ,jk\u22121\u2208[n]\n\nThe outer product of two tensors is X\u2297Y, and, for v \u2208 Rn, we de\ufb01ne v\u2297k = v\u2297\u00b7\u00b7\u00b7\u2297v \u2208(cid:78)k Rn\nas the k-th outer power of v. We de\ufb01ne the inner product of two tensors X, Y \u2208(cid:78)k Rn as\nWe de\ufb01ne the Frobenius (Euclidean) norm of a tensor X, by (cid:107)X(cid:107)F =(cid:112)(cid:104)X, X(cid:105), and its operator\n\nXi1,\u00b7\u00b7\u00b7 ,ik Yi1,\u00b7\u00b7\u00b7 ,ik .\n\n(cid:104)X, Y(cid:105) =\n\n(cid:88)\n\ni1,\u00b7\u00b7\u00b7 ,ik\u2208[n]\n\n(4)\n\n(3)\n\n(cid:88)\n\nnorm by\n\n(cid:107)X(cid:107)op \u2261 max{(cid:104)X, u1 \u2297 \u00b7\u00b7\u00b7 \u2297 uk(cid:105) : \u2200i \u2208 [k] , (cid:107)ui(cid:107)2 \u2264 1}.\n\n(5)\nFor the special case k = 2, it reduces to the ordinary (cid:96)2 matrix operator norm. For \u03c0 \u2208 Sk, we\nwill denote by X\u03c0 the tensor with permuted indices X\u03c0\n= X\u03c0(i1),\u00b7\u00b7\u00b7 ,\u03c0(ik). We call the tensor\nX symmetric if, for any permutation \u03c0 \u2208 Sk, X\u03c0 = X. It is proved [23] that, for symmetric\ntensors, the value of problem Tensor PCA coincides with (cid:107)X(cid:107)op up to a sign. More precisely, for\nsymmetric tensors we have the equivalent representation max{|(cid:104)X, u\u2297k(cid:105)| : (cid:107)u(cid:107)2 \u2264 1}. We denote\n\nby G \u2208 (cid:78)k Rn a tensor with independent and identically distributed entries Gi1,\u00b7\u00b7\u00b7 ,ik \u223c N(0, 1)\nZ \u2208(cid:78)k Rn by\n\n(note that this tensor is not symmetric). We de\ufb01ne the symmetric standard normal noise tensor\n\ni1,\u00b7\u00b7\u00b7 ,ik\n\nZ =\n\n1\nk!\n\n(cid:114)\n(cid:16)(cid:107)(cid:98)v \u2212 v0(cid:107)2\n\nk\nn\n\nG\u03c0 .\n\n(cid:88)\n2,(cid:107)(cid:98)v + v0(cid:107)2\n\n\u03c0\u2208Sk\n\n2\n\n(cid:17)\n\n= 2 \u2212 2|(cid:104)(cid:98)v, v0(cid:105)| .\n\n(6)\n\n(7)\n\nWe use the loss function\n\nLoss((cid:98)v, v0) \u2261 min\n\n2\n\nIdeal estimation\n\nIn this section we consider the problem of estimating v0 under the observation model Spiked Tensor\nModel, when no constraint is imposed on the complexity of the estimator. Our \ufb01rst result is a lower\nbound on the loss of any estimator.\n\u2297kRn \u2192 Sn\u22121), we have, for all n \u2265 4,\n\nTheorem 1. For any estimator(cid:98)v = (cid:98)v(X) of v0 from data X, such that (cid:107)(cid:98)v(X)(cid:107)2 = 1 (i.e. (cid:98)v :\n\n(cid:114)\n\n\u03b2 \u2264\n\nk\n10\n\n\u21d2 E Loss((cid:98)v, v0) \u2265 1\n\n32\n\n.\n\n(8)\n\nIn order to establish a matching upper bound on the loss, we consider the maximum likelihood\n\nestimator(cid:98)vML, obtained by solving the Tensor PCA problem. As in the case of matrix denoising, we\n\nexpect the properties of this estimator to depend on signal to noise ratio \u03b2, and on the \u2018norm\u2019 of the\nnoise (cid:107)Z(cid:107)op (i.e. on the value of the optimization problem Tensor PCA in the case \u03b2 = 0). For the\n\n3\n\n\fmatrix case k = 2, this coincides with the largest eigenvalue of Z. Classical random matrix theory\nshows that \u2013in this case\u2013 (cid:107)Z(cid:107)op concentrates tightly around 2 [10, 6, 3].\nIt turns out that tight results for k \u2265 3 follow immediately from a technically sophisticated analysis\nof the stationary points of random Morse functions by Auf\ufb01nger, Ben Arous and Cerny [2].\nLemma 2.1. There exists a sequence of real numbers {\u00b5k}k\u22652, such that\n\n(cid:107)Z(cid:107)op \u2264 \u00b5k\nlim sup\nn\u2192\u221e\nn\u2192\u221e(cid:107)Z(cid:107)op = \u00b5k\nlim\n\n(k odd),\n\n(k even).\n\nFurther (cid:107)Z(cid:107)op concentrates tightly around its expectation. Namely, for any n, k\n\nP(cid:0)(cid:12)(cid:12)(cid:107)Z(cid:107)op \u2212 E(cid:107)Z(cid:107)op\n\n(cid:12)(cid:12) \u2265 s(cid:1) \u2264 2 e\u2212ns2/(2k) .\n\n\u221a\n\nk log k(1 + ok(1)) for large k.\n\nFinally \u00b5k =\nFor instance, a large order-3 Gaussian tensor should have (cid:107)Z(cid:107)op \u2248 2.87, while a large order 10\ntensor has (cid:107)Z(cid:107)op \u2248 6.75. As a simple consequence of Lemma 2.1, we establish an upper bound on\nthe error incurred by the maximum likelihood estimator.\n\nTheorem 2. Let \u00b5k be the sequence of real numbers introduced above. Letting (cid:98)vML denote the\n\nmaximum likelihood estimator (i.e. the solution of Tensor PCA), we have for n large enough, and\nall s > 0\n\n(9)\n\n(10)\n\n(11)\n\n\u03b2 \u2265 \u00b5k \u21d2 Loss((cid:98)vML, v0) \u2264 2\n\n\u03b2\n\n(\u00b5k + s) ,\n\n(12)\n\nwith probability at least 1 \u2212 2e\u2212ns2/(16k).\n\nThe following upper bound on the value of the Tensor PCA problem is proved using Sudakov-\nFernique inequality. While it is looser than Lemma 2.1 (corresponding to the case \u03b2 = 0), we\nexpect it to become sharp for \u03b2 \u2265 \u03b2k a suitably large constant.\nLemma 2.2. Under Spiked Tensor Model model, we have\n\nE(cid:107)Z(cid:107)op \u2264 max\n\u03c4\u22650\n\n(cid:110)\nP(cid:0)(cid:12)(cid:12)(cid:107)Z(cid:107)op \u2212 E(cid:107)Z(cid:107)op\n\n\u03b2\n\n(cid:19)k\n\n\u03c4\u221a\n1 + \u03c4 2\n\n(cid:18)\n(cid:12)(cid:12) \u2265 s(cid:1) \u2264 2 e\u2212ns2/(2k) .\n\n+\n\nk\u221a\n1 + \u03c4 2\n\n(cid:111)\n\n.\n\n(13)\n\n(14)\n\nlim sup\nn\u2192\u221e\n\nFurther, for any s \u2265 0,\n\n3 Tensor Unfolding\n\nA simple and popular heuristics to obtain tractable estimators of v0 consists in constructing a suit-\nable matrix with the entries of X, and performing PCA on this matrix.\n\n3.1 Symmetric noise\nFor an integer k \u2265 q \u2265 k/2, we introduce the unfolding (also referred to as matricization or\nreshape) operator Matq : \u2297kRn \u2192 Rnq\u00d7nk\u2212q as follows. For any indices i1, i2,\u00b7\u00b7\u00b7 , ik \u2208 [n],\n\nwe let a = 1 +(cid:80)q\n\nj=1(ij \u2212 1)nj\u22121 and b = 1 +(cid:80)k\n\nj=q+1(ij \u2212 1)nj\u2212q\u22121, and de\ufb01ne\n\n[Matq(X)]a,b = Xi1,i2,\u00b7\u00b7\u00b7 ,ik .\n\n(15)\n\nStandard convex relaxations of low-rank tensor estimation problem compute factorizations of\nMatq(X)[22, 15, 17, 19]. Not all unfoldings (choices of q) are equivalent. It is natural to expect that\nthis approach will be successful only if the signal-to-noise ratio exceeds the operator norm of the\nunfolded noise (cid:107)Matq(Z)(cid:107)op. The next lemma suggests that the latter is minimal when Matq(Z) is\n\u2018as square as possible\u2019 . A similar phenomenon was observed in a different context in [17].\n\n4\n\n\fLemma 3.1. For any integer k/2 \u2264 q \u2264 k we have, for some universal constant Ck,\n\nn(q\u22121)/2(cid:16)\n\n1(cid:112)(k \u2212 1)!\n\n1 \u2212\n\nCk\n\nnmax(q,k\u2212q))\n\nn(q\u22121)/2 + n(k\u2212q\u22121)/2(cid:17)\n\n\u221a\n\n(cid:16)\n\n(cid:17) \u2264 E(cid:107)Matq(Z)(cid:107)op \u2264\n(cid:111) \u2264 2 e\u2212nt2/(2k) .\n(cid:12)(cid:12) \u2265 t\n\nk\n\n.\n\n(16)\n\n(17)\n\nFor all n large enough, both bounds are minimized for q = (cid:100)k/2(cid:101). Further\n\nP(cid:110)(cid:12)(cid:12)(cid:107)Matq(Z)(cid:107)op \u2212 E(cid:107)Matq(Z)(cid:107)op\n\nThe last lemma suggests the choice q = (cid:100)k/2(cid:101), which we shall adopt in the following, unless stated\notherwise. We will drop the subscript from Mat.\nLet us recall the following standard result derived directly from Wedin perturbation Theorem [24],\nand stated in the context of the spiked model.\nT + \u039e \u2208 Rm\u00d7p be a matrix with (cid:107)u0(cid:107)2 =\nTheorem 3 (Wedin perturbation). Let M = \u03b2u0w0\n\n(cid:107)w0(cid:107)2 = 1. Let (cid:98)w denote the right singular vector of M. If \u03b2 > 2(cid:107)\u039e(cid:107)op, then\n\nLoss((cid:98)w, w0) \u2264 8(cid:107)\u039e(cid:107)2\n\nop\n\n\u03b22\n\n.\n\n(18)\n\nTheorem 4. Letting w = w(X) denote the top right singular vector of Mat(X), we have the\nfollowing, for some universal constant C = Ck > 0, and b \u2261 (1/2)((cid:100)k/2(cid:101) \u2212 1).\nIf \u03b2 \u2265 5 k1/2 nb then, with probability at least 1 \u2212 n\u22122, we have\n\n(cid:16)\n\nw, vec(cid:0)v0\n\n\u2297(cid:98)k/2(cid:99)(cid:1)(cid:17) \u2264 C kn2b\n\nLoss\n\n\u03b22\n\n.\n\n(19)\n\n3.2 Asymmetric noise and recursive unfolding\n\nA technical complication in analyzing the random matrix Matq(X) lies in the fact that its entries are\nnot independent, because the noise tensor Z is assumed to be symmetric. In the next theorem we\nconsider the case of non-symmetric noise and even k. This allows us to leverage upon known results\nin random matrix theory [18, 8, 5] to obtain: (i) Asymptotically sharp estimates on the critical\nsignal-to-noise ratio; (ii) A lower bound on the loss below the critical signal-to-noise ratio. Namely,\nwe consider observations\n\n(cid:101)X = \u03b2v0\n\n\u2297k +\n\n1\u221a\nn\n\nG .\n\n(20)\n\n\u03b2 \u2264 (1 \u2212 \u03b5)nb \u21d2\n\nwhere G \u2208 \u2297kRn is a standard Gaussian tensor (i.e. a tensor with i.i.d. standard normal entries).\n\nLet w = w((cid:101)X) \u2208 Rnk/2 denote the top right singular vector of Mat(X). For k \u2265 4 even, and de\ufb01ne\n\nb \u2261 (k \u2212 2)/4, as above. By [18, Theorem 4], or [5, Theorem 2.3], we have the following almost\nsure limits\n\nlim\n\nn\u2192\u221e(cid:104)w((cid:101)X), vec(v0\n(cid:12)(cid:12)(cid:104)w((cid:101)X), vec(v0\n\n\u2297(k/2))(cid:105) = 0 ,\n\n\u2297(k/2))(cid:105)(cid:12)(cid:12) \u2265\n\n(cid:114) \u03b5\n\n(21)\n\n(22)\n\n\u03b2 \u2265 (1 + \u03b5)nb \u21d2 lim inf\nn\u2192\u221e\n\n1 + \u03b5\n\u2297(k/2) if and only if \u03b2 is larger than nb.\n\nIn other words w((cid:101)X) is a good estimate of v0\nWe can use w((cid:101)X) \u2208 R2b+1 to estimate v0 as follows. Construct the unfolding Mat1(w) \u2208 Rn\u00d7n2b\nwe then let(cid:98)v to be the left principal vector of Mat1(X). We refer to this algorithm as to recursive\n\n(slight abuse of notation) of w by letting, for i \u2208 [n], and j \u2208 [n2b],\n\nMat1(w)i,j = wi+(j\u22121)n ,\n\n(23)\n\n.\n\nunfolding.\n\n5\n\n\fTheorem 5. Let (cid:101)X be distributed according to the non-symmetric model (20) with k \u2265 4 even,\nde\ufb01ne b \u2261 (k \u2212 2)/4. and let(cid:98)v be the estimate obtained by two-steps recursive matricization.\n\nIf \u03b2 \u2265 (1 + \u03b5)nb then, almost surely\n\nn\u2192\u221e Loss((cid:98)v, v0) = 0 .\n\nlim\n\n(24)\n\nWe conjecture that the weaker condition \u03b2 (cid:38) n(k\u22122)/4 is indeed suf\ufb01cient also for our original\nsymmetric noise model, both for k even and for k odd.\n\n4 Power Iteration\n\nIterating over (multi-) linear maps induced by a (tensor) matrix is a standard method for \ufb01nding\nleading eigenpairs, see [14] and references therein for tensor-related results. In this section we will\nconsider a simple power iteration, and then its possible uses in conjunction with tensor unfolding.\nFinally, we will compare our analysis with results available in the literature.\n\n4.1 Naive power iteration\n\nThe simplest iterative approach is de\ufb01ned by the following recursion\n\nv0 =\n\ny\n(cid:107)y(cid:107)2\n\n,\n\nand vt+1 =\n\nX{vt}\n(cid:107)X{vt}(cid:107)2\n\n.\n\nPower Iteration\n\nThe following result establishes convergence criteria for this iteration, \ufb01rst for generic noise Z and\nthen for standard normal noise (using Lemma 2.1).\nTheorem 6. Assume\n\n(26)\nThen for all t \u2265 t0(k), the power iteration estimator satis\ufb01es Loss(vt, v0) \u2264 2e(cid:107)Z(cid:107)op/\u03b2. If Z is a\nstandard normal noise tensor, then conditions (25), (26) are satis\ufb01ed with high probability provided\n(27)\n\n.\n\n(cid:104)y, v0(cid:105)\n(cid:107)y(cid:107)2\n\n\u03b2\n\n\u03b2 \u2265 2 e(k \u2212 1)(cid:107)Z(cid:107)op ,\n\u2265\n\n(cid:20) (k \u2212 1)(cid:107)Z(cid:107)op\n(cid:112)\n(cid:21)1/(k\u22121)\n\n(cid:21)1/(k\u22121)\nk3 log k(cid:0)1 + ok(1)(cid:1) ,\n= \u03b2\u22121/(k\u22121)(cid:0)1 + ok(1)(cid:1) .\n\n(cid:20) k\u00b5k\n\n\u03b2 \u2265 2ek \u00b5k = 6\n\u2265\n\n\u03b2\n\n(cid:104)y, v0(cid:105)\n(cid:107)y(cid:107)2\n\n(25)\n\n(28)\n\nIn Section 6 we discuss two aspects of this result: (i) The requirement of a positive correlation\nbetween initialization and ground truth ; (ii) Possible scenarios under which the assumptions of\nTheorem 6 are satis\ufb01ed.\n\n5 Asymptotics via Approximate Message Passing\n\nApproximate message passing (AMP) algorithms [7, 4] proved successful\nin several high-\ndimensional estimation problems including compressed sensing, low rank matrix reconstruction,\nand phase retrieval [9, 13, 20, 21]. An appealing feature of this class of algorithms is that their high-\ndimensional limit can be characterized exactly through a technique known as \u2018state evolution.\u2019 Here\nwe develop an AMP algorithm for tensor data, and its state evolution analysis focusing on the \ufb01xed\n\u03b2, n \u2192 \u221e limit. Proofs follows the approach of [4] and will be presented in a journal publication.\nIn a nutshell, our AMP for Tensor PCA can be viewed as a sophisticated version of the power\niteration method of the last section. With the notation f (x) = x/(cid:107)x(cid:107)2, we de\ufb01ne the AMP iteration\nover vectors vt \u2208 Rn by v0 = y, f (v\u22121) = 0, and\n\n(cid:40) vt+1 = X{f (vt)} \u2212 bt f (vt\u22121) ,\nbt = (k \u2212 1)(cid:0)(cid:104)f (vt), f (vt\u22121)(cid:105)(cid:1)k\u22122\n\n.\n\nAMP\n\nOur main conclusion is that the behavior of AMP is qualitatively similar to the one of power itera-\ntion. However, we can establish stronger results in two respects:\n\n6\n\n\f1. We can prove that, unless side information is provided about the signal v0, the AMP esti-\nmates remain essentially orthogonal to v0, for any \ufb01xed number of iterations. This corre-\nsponds to a converse to Theorem 6.\n\n2. Since state evolution is asymptotically exact, we can prove sharp phase transition results\n\nwith explicit characterization of their locations.\n\nWe assume that the additional information takes the form of a noisy observation y = \u03b3 v0 + z,\nwhere z \u223c N(0, In/n). Our next results summarize the state evolution analysis.\nProposition 5.1. Let k \u2265 2 be a \ufb01xed integer. Let {v0(n)}n\u22651 be a sequence of unit norm vectors\nv0(n) \u2208 Sn\u22121. Let also {X(n)}n\u22651 denote a sequence of tensors X(n) \u2208 \u2297kRn generated follow-\ning Spiked Tensor Model. Finally, let vt denote the t-th iterate produced by AMP, and consider its\northogonal decomposition\n\n(29)\nwhere vt(cid:107) is proportional to v0, and vt\u22a5 is perpendicular. Then vt\u22a5 is uniformly random, conditional\non its norm. Further, almost surely\n\nvt = vt(cid:107) + vt\u22a5 ,\n\n(31)\nwhere \u03c4t is given recursively by letting \u03c40 = \u03b3 and, for t \u2265 0 (we refer to this as to state evolution):\n\nlim\n\nn\u2192\u221e(cid:104)vt(cid:107), v0(cid:105) = \u03c4t ,\nn\u2192\u221e(cid:104)vt, v0(cid:105) = lim\nn\u2192\u221e(cid:107)vt\u22a5(cid:107)2 = 1 ,\n(cid:18) \u03c4 2\n\n(cid:19)k\u22121\n\nlim\n\n\u03c4 2\nt+1 = \u03b22\n\nt\n\n1 + \u03c4 2\nt\n\n.\n\n(30)\n\n(32)\n\nThe following result characterizes the minimum required additional information \u03b3 to allow AMP\nto escape from those undesired local optima. We will say that {vt}t converges almost surely to a\ndesired local optimum if,\n\nn\u2192\u221e Loss(vt/(cid:107)vt(cid:107)2, v0) \u2264 4\nTheorem 7. Consider the Tensor PCA problem with k \u2265 3 and\n\nlim\nt\u2192\u221e lim\n\n\u03b22 .\n\n\u221a\n\n\u03b2 > \u03c9k \u2261(cid:113)\n\nek .\n\n(k \u2212 1)k\u22121/(k \u2212 2)k\u22122 \u223c\n\nThen AMP converges almost surely to a desired local optimum if and only if \u03b3 > (cid:112)1/\u0001k(\u03b2) \u2212 1\nIn the special case k = 3, and \u03b2 > 2, assuming \u03b3 > \u03b2(1/2 \u2212(cid:112)1/4 \u2212 1/\u03b22), AMP tends to a\ndesired local optimum. Numerically \u03b2 > 2.69 is enough for AMP to achieve (cid:104)v0,(cid:98)v(cid:105) \u2265 0.9 if\n\nwhere \u0001k(\u03b2) is the largest solution of (1 \u2212 \u0001)(k\u22122)\u0001 = \u03b2\u22122,\n\n\u03b3 > 0.45.\nAs a \ufb01nal remark, we note that the methods of [16] can be used to show that, under the assumptions\nof Theorem 7, for \u03b2 > \u03b2k a suf\ufb01ciently large constant, AMP asymptotically solves the optimization\nproblem Tensor PCA. Formally, we have, almost surely,\n\n(cid:12)(cid:12)(cid:12)(cid:104)X,(cid:0)vt(cid:1)\u2297k(cid:105) \u2212 (cid:107)X(cid:107)op\n\n(cid:12)(cid:12)(cid:12) = 0.\n\nt\u2192\u221e lim\nlim\nn\u2192\u221e\n\n(33)\n\n6 Numerical experiments\n\n6.1 Comparison of different algorithms\n\nOur empirical results are reported in the appendix. The main \ufb01ndings are consistent with the theory\ndeveloped above:\n\n\u2022 Tensor power iteration (with random initialization) performs poorly with respect to other\napproaches that use some form of tensor unfolding. The gap widens as the dimension n\nincreases.\n\n7\n\n\fFigure 1: Simultaneous PCA at \u03b2 = 3. Absolute correlation of the estimated principal component\n\nwith the truth |(cid:104)(cid:98)v, v0(cid:105)|, simultaneous PCA (black) compared with matrix (green) and tensor PCA\n\n(blue).\n\n\u2022 All algorithms based on initial unfolding (comprising PSD-constrained PCA and recursive\nunfolding) have essentially the same threshold. Above that threshold, those that process the\nsingular vector (either by recursive unfolding or by tensor power iteration) have superior\nperformances over simpler one-step algorithms.\n\nOur heuristic arguments suggest that tensor power iteration with random initialization will work for\n\u03b2 (cid:38) n1/2, while unfolding only requires \u03b2 (cid:38) n1/4 (our theorems guarantee this for, respectively,\n\n\u03b2 (cid:38) n and \u03b2 (cid:38) n1/2). We plot the average correlation |(cid:104)(cid:98)v, v0(cid:105)| versus (respectively) \u03b2/n1/2 and\n\n\u03b2/n1/4. The curve superposition con\ufb01rms that our prediction captures the correct behavior already\nfor n of the order of 50.\n\n6.2 The value of side information\n\nOur next experiment concerns a simultaneous matrix and tensor PCA task: we are given a tensor\nX \u2208 \u22973Rn of Spiked Tensor Model with k = 3 and the signal to noise ratio \u03b2 = 3 is \ufb01xed. In\nT + N where N \u2208 Rn\u00d7n is a symmetric noise matrix with upper\naddition, we observe M = \u03bbv0v0\ndiagonal elements i < j iid Ni,j \u223c N(0, 1/n) and the value of \u03bb \u2208 [0, 2] varies. This experiment\nmimics a rank-1 version of topic modeling method presented in [1] where M is a matrix representing\npairwise co-occurences and X triples.\nThe analysis in previous sections suggests to use the leading eigenvector of M as the initial point\nof AMP algorithm for tensor PCA on X. We performed the experiments on 100 randomly gener-\n\nated instances with n = 50, 200, 500 and report in Figure 1 the mean values of |(cid:104)v0,(cid:98)v(X)(cid:105)| with\nRandom matrix theory predicts limn\u2192\u221e(cid:104)(cid:98)v1(M ), v0(cid:105) =\n(cid:17)\n\n1 \u2212 \u03bb\u22122 and apply the theory of the previous section. In particular, Proposition 5.1 implies\n\ncon\ufb01dence intervals.\n\u221a\n\n1 \u2212 \u03bb\u22122 [8]. Thus we can set \u03b3 =\n\n1/2 +(cid:112)1/4 \u2212 1/\u03b22\n\n1/2 \u2212(cid:112)1/4 \u2212 1/\u03b22\n\nlim\n\nand limn\u2192\u221e(cid:104)(cid:98)v(X), v0(cid:105) = 0 otherwise Simultaneous PCA appears vastly superior to simple PCA.\n\nOur theory captures this difference quantitatively already for n = 500.\n\nif \u03b3 > \u03b2\n\nn\u2192\u221e(cid:104)(cid:98)v(X), v0(cid:105) = \u03b2\n\n(cid:16)\n\n\u221a\n\n(cid:16)\n\n(cid:17)\n\nAcknowledgements\n\nThis work was partially supported by the NSF grant CCF-1319979 and the grants AFOSR/DARPA\nFA9550-12-1-0411 and FA9550-13-1-0036.\n\nReferences\n[1] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for\n\nlearning latent variable models. arXiv:1210.7559, 2012.\n\n8\n\n00.511.5200.10.20.30.40.50.60.70.80.91| <v0 , v> | n = 50\u03bb00.511.5200.10.20.30.40.50.60.70.80.91| <v0 , v> | n = 200\u03bb00.511.5200.10.20.30.40.50.60.70.80.91| <v0 , v> | n \u2212> \u221e (theory)\u03bb00.511.5200.10.20.30.40.50.60.70.80.91| <v0 , v> | n = 500\u03bb Pow. It. (init. y)Pow. It. (random init.)y = Matrix PCAPow. It. (unfold. init.) \f[2] A. Auf\ufb01nger, G. Ben Arous, and J. Cerny. Random matrices and complexity of spin glasses.\n\nCommunications on Pure and Applied Mathematics, 66(2):165\u2013201, 2013.\n\n[3] Z. Bai and J. Silverstein. Spectral Analysis of Large Dimensional Random Matrices (2nd\n\nedition). Springer, 2010.\n\n[4] M. Bayati and A. Montanari. The dynamics of message passing on dense graphs, with appli-\n\ncations to compressed sensing. IEEE Trans. on Inform. Theory, 57:764\u2013785, 2011.\n\n[5] Florent Benaych-Georges and Raj Rao Nadakuditi. The singular values and vectors of low\nrank perturbations of large rectangular random matrices. Journal of Multivariate Analysis,\n111:120\u2013135, 2012.\n\n[6] K. R. Davidson and S. J. Szarek. Local operator theory, random matrices and Banach spaces.\nIn Handbook on the Geometry of Banach spaces, volume 1, pages 317\u2013366. Elsevier Science,\n2001.\n\n[7] D. L. Donoho, A. Maleki, and A. Montanari. Message Passing Algorithms for Compressed\n\nSensing. Proceedings of the National Academy of Sciences, 106:18914\u201318919, 2009.\n\n[8] D. F\u00b4eral and S. P\u00b4ech\u00b4e. The largest eigenvalues of sample covariance matrices for a spiked\n\npopulation: diagonal case. Journal of Mathematical Physics, 50:073302, 2009.\n\n[9] A. K. Fletcher, S. Rangan, L. R. Varshney, and A. Bhargava. Neural reconstruction with\napproximate message passing (neuramp). In Neural Information Processing Systems (NIPS),\npages 2555\u20132563, 2011.\n\n[10] S. Geman. A limit theorem for the norm of random matrices. Annals of Probability, 8:252\u2013261,\n\n1980.\n\n[11] C. Hillar and L. H. Lim. Most tensor problems are np-hard. Journal of the ACM, 6, 2009.\n[12] I. M Johnstone and A. Y. Lu. On consistency and sparsity for principal components analysis\n\nin high dimensions. Journal of the American Statistical Association, 104(486), 2009.\n\n[13] U. Kamilov, S. Rangan, A. K. Fletcher, and M. Unser. Approximate message passing with\nIn Neural Information\n\nconsistent parameter estimation and applications to sparse learning.\nProcessing Systems (NIPS), pages 2447\u20132455, 2012.\n\n[14] T. Kolda and J. Mayo. Shifted power method for computing tensor eigenpairs. SIAM Journal\n\non Matrix Analysis and Applications, 32(4):1095\u20131124, 2011.\n\n[15] J. Liu, P. Musialski, P. Wonka, and J. Ye. Tensor completion for estimating missing values in\nvisual data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):208\u2013220,\n2013.\n\n[16] A. Montanari and E. Richard. Non-negative principal component analysis: Message passing\n\nalgorithms and sharp asymptotics. arXiv:1406.4775, 2014.\n\n[17] C. Mu, J. Huang, B. Wright, and D. Goldfarb. Square deal: Lower bounds and improved\nIn International Conference in Machine Learning (ICML),\n\nrelaxations for tensor recovery.\n2013.\n\n[18] D. Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariance\n\nmodel. Statistica Sinica, 17(4):1617, 2007.\n\n[19] B. Romera-Paredes and M. Pontil. A new convex relaxation for tensor completion. In Neural\n\nInformation Processing Systems (NIPS), 2013.\n\n[20] P. Schniter and V. Cevher. Approximate message passing for bilinear models. In Proc. Work-\n\nshop Signal Process. Adaptive Sparse Struct. Repr.(SPARS), page 68, 2011.\n\n[21] P. Schniter and S. Rangan. Compressive phase retrieval via generalized approximate message\npassing. In Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton\nConference on, pages 815\u2013822. IEEE, 2012.\n\n[22] R. Tomioka, T. Suzuki, K. Hayashi, and H. Kashima. Statistical performance of convex tensor\n\ndecomposition. In Neural Information Processing Systems (NIPS), 2011.\n\n[23] W. C. Waterhouse. The absolute-value estimate for symmetric multilinear forms. Linear Alge-\n\nbra and its Applications, 128:97\u2013105, 1990.\n\n[24] P. A. Wedin. Perturbation bounds in connection with singular value decomposition. BIT Nu-\n\nmerical Mathematics, 12(1):99\u2013111, 1972.\n\n9\n\n\f", "award": [], "sourceid": 1506, "authors": [{"given_name": "Emile", "family_name": "Richard", "institution": "Stanford"}, {"given_name": "Andrea", "family_name": "Montanari", "institution": "Stanford University"}]}