{"title": "Interpolating Convex and Non-Convex Tensor Decompositions via the Subspace Norm", "book": "Advances in Neural Information Processing Systems", "page_first": 3106, "page_last": 3113, "abstract": "We consider the problem of recovering a low-rank tensor from its noisy observation. Previous work has shown a recovery guarantee with signal to noise ratio $O(n^{\\ceil{K/2}/2})$ for recovering a $K$th order rank one tensor of size $n\\times \\cdots \\times n$ by recursive unfolding. In this paper, we first improve this bound to $O(n^{K/4})$ by a much simpler approach, but with a more careful analysis. Then we propose a new norm called the \\textit{subspace} norm, which is based on the Kronecker products of factors obtained by the proposed simple estimator. The imposed Kronecker structure allows us to show a nearly ideal $O(\\sqrt{n}+\\sqrt{H^{K-1}})$ bound, in which the parameter $H$ controls the blend from the non-convex estimator to mode-wise nuclear norm minimization. Furthermore, we empirically demonstrate that the subspace norm achieves the nearly ideal denoising performance even with $H=O(1)$.", "full_text": "Interpolating Convex and Non-Convex Tensor\n\nDecompositions via the Subspace Norm\n\nQinqing Zheng\n\nUniversity of Chicago\n\nqinqing@cs.uchicago.edu\n\nRyota Tomioka\n\nToyota Technological Institute at Chicago\n\ntomioka@ttic.edu\n\nAbstract\n\nWe consider the problem of recovering a low-rank tensor from its noisy obser-\nvation. Previous work has shown a recovery guarantee with signal to noise ratio\nO(n(cid:100)K/2(cid:101)/2) for recovering a Kth order rank one tensor of size n \u00d7 \u00b7\u00b7\u00b7 \u00d7 n by\nrecursive unfolding. In this paper, we \ufb01rst improve this bound to O(nK/4) by a\nmuch simpler approach, but with a more careful analysis. Then we propose a new\nnorm called the subspace norm, which is based on the Kronecker products of fac-\ntors obtained by the proposed simple estimator. The imposed Kronecker structure\n\u221a\nH K\u22121) bound, in which the parameter\nallows us to show a nearly ideal O(\nH controls the blend from the non-convex estimator to mode-wise nuclear norm\nminimization. Furthermore, we empirically demonstrate that the subspace norm\nachieves the nearly ideal denoising performance even with H = O(1).\n\nn +\n\n\u221a\n\n1\n\nIntroduction\n\nTensor is a natural way to express higher order interactions for a variety of data and tensor decom-\nposition has been successfully applied to wide areas ranging from chemometrics, signal processing,\nto neuroimaging; see [15, 18] for a survey. Moreover, recently it has become an active area in the\ncontext of learning latent variable models [3].\nMany problems related to tensors, such as, \ufb01nding the rank, or a best rank-one approaximation of\na tensor is known to be NP hard [11, 8]. Nevertheless we can address statistical problems, such as,\nhow well we can recover a low-rank tensor from its randomly corrupted version (tensor denoising)\nor from partial observations (tensor completion). Since we can convert a tensor into a matrix by\nan operation known as unfolding, recent work [25, 19, 20, 13] has shown that we do get nontrivial\nguarantees by using some norms or singular value decompositions. More speci\ufb01cally, Richard &\nMontanari [20] has shown that when a rank-one Kth order tensor of size n \u00d7 \u00b7\u00b7\u00b7 \u00d7 n is corrupted\nnoise ratio \u03b2/\u03c3 (cid:37) n(cid:100)K/2(cid:101)/2 by a method called the recursive unfolding1. Note that \u03b2/\u03c3 (cid:37) \u221a\nby standard Gaussian noise, a nontrivial bound can be shown with high probability if the signal to\nn\nis suf\ufb01cient for matrices (K = 2) and also for tensors if we use the best rank-one approximation\n(which is known to be NP hard) as an estimator. On the other hand, Jain & Oh [13] analyzed the\ntensor completion problem and proposed an algorithm that requires O(n3/2\u00b7polylog(n)) samples for\nK = 3; while information theoretically we need at least \u2126(n) samples and the intractable maximum\nlikelihood estimator would require O(n \u00b7 polylog(n)) samples. Therefore, in both settings, there is\na wide gap between the ideal estimator and current polynomial time algorithms. A subtle question\nthat we will address in this paper is whether we need to unfold the tensor so that the resulting matrix\nbecome as square as possible, which was the reasoning underlying both [19, 20].\nAs a parallel development, non-convex estimators based on alternating minimization or nonlinear\noptimization [1, 21] have been widely applied and have performed very well when appropriately\n\n1We say an (cid:37) bn if there is a constant C > 0 such that an \u2265 C \u00b7 bn.\n\n1\n\n\fTable 1: Comparison of required signal-to-noise ratio \u03b2/\u03c3 of different algorithms for recovering\na Kth order rank one tensor of size n \u00d7 \u00b7\u00b7\u00b7 \u00d7 n contaminated by Gaussian noise with Standard\ndeviation \u03c3. See model (2). The bound for the ordinary unfolding is shown in Corollary 1. The\nbound for the subspace norm is shown in Theorem 2. The ideal estimator is proven in Appendix A.\n\nOverlapped/\nLatent nuclear\nnorm[23]\nO(n(K\u22121)/2) O(n(cid:100)K/2(cid:101)/2) O(nK/4)\n\nRecursive\nunfolding[20]/\nsquare\nnorm[19]\n\nOrdinary un-\nfolding\n\nSubspace norm (pro-\nposed)\n\nIdeal\n\n\u221a\n\nn +\n\nO(\n\n\u221a\n\nH K\u22121) O((cid:112)nK log(K))\n\nset up. Therefore it would be of fundamental importance to connect the wisdom of non-convex\nestimators with the more theoretically motivated estimators that recently emerged.\nIn this paper, we explore such a connection by de\ufb01ning a new norm based on Kronecker products of\nfactors that can be obtained by simple mode-wise singular value decomposition (SVD) of unfoldings\n(see notation section below), also known as the higher-order singular value decomposition (HOSVD)\n[6, 7]. We \ufb01rst study the non-asymptotic behavior of the leading singular vector from the ordinary\n(rectangular) unfolding X (k) and show a nontrivial bound for signal to noise ratio \u03b2/\u03c3 (cid:37) nK/4.\nThus the result also applies to odd order tensors con\ufb01rming a conjecture in [20]. Furthermore, this\nmotivates us to use the solution of mode-wise truncated SVDs to construct a new norm. We propose\nthe subspace norm, which predicts an unknown low-rank tensor as a mixture of K low-rank tensors,\nin which each term takes the form\n\nfoldk(M (k)((cid:98)P\n\n(1) \u2297 \u00b7\u00b7\u00b7 \u2297 (cid:98)P\n\n(k\u22121) \u2297 (cid:98)P\n\n(k+1) \u2297 \u00b7\u00b7\u00b7 \u2297 (cid:98)P\n\nwhere foldk is the inverse of unfolding (\u00b7)(k), \u2297 denotes the Kronecker product, and (cid:98)P\nsuf\ufb01ciently high signal-to-noise ratio the estimated (cid:98)P\n\n(k) \u2208 Rn\u00d7H\nis a orthonormal matrix estimated from the mode-k unfolding of the observed tensor, for k =\n1, . . . , K; H is a user-de\ufb01ned parameter, and M (k) \u2208 Rn\u00d7HK\u22121. Our theory tells us that with\n\n(k) spans the true factors.\n\n(K)\n\n)(cid:62)),\n\nWe highlight our contributions below:\n\n1. We prove that the required signal-to-noise ratio for recovering a Kth order rank one tensor from\nthe ordinary unfolding is O(nK/4). Our analysis shows a curious two phase behavior: with high\nprobability, when nK/4 (cid:45) \u03b2/\u03c3 (cid:45) nK/2, the error shows a fast decay as 1/\u03b24; for \u03b2/\u03c3 (cid:37) nK/2, the\nerror decays slowly as 1/\u03b22. We con\ufb01rm this in a numerical simulation.\n2. The proposed subspace norm is an interpolation between the intractable estimators that directly\ncontrol the rank (e.g., HOSVD) and the tractable norm-based estimators. It becomes equivalent to\nthe latent trace norm [23] when H = n at the cost of increased signal-to-noise ratio threshold (see\nTable 1).\n3. The proposed estimator is more ef\ufb01cient than previously proposed norm based estimators, because\nthe size of the SVD required in the algorithm is reduced from n \u00d7 nK\u22121 to n \u00d7 H K\u22121.\n4. We also empirically demonstrate that the proposed subspace norm performs nearly optimally for\nconstant order H.\n\nNotation\nLet X \u2208 Rn1\u00d7n2\u00d7\u00b7\u00b7\u00b7\u00d7nK be a Kth order tensor. We will often use n1 = \u00b7\u00b7\u00b7 = nK = n to\nsimplify the notation but all the results in this paper generalizes to general dimensions. The inner\nproduct between a pair of tensors is de\ufb01ned as the inner products of them as vectors; i.e., (cid:104)X ,W(cid:105) =\n(cid:104)vec(X ), vec(W)(cid:105). For u \u2208 Rn1, v \u2208 Rn2, w \u2208 Rn3, u \u25e6 v \u25e6 w denotes the n1 \u00d7 n2 \u00d7 n3\nrank-one tensor whose i, j, k entry is uivjwk. The rank of X is the minimum number of rank-one\ntensors required to write X as a linear combination of them. A mode-k \ufb01ber of tensor X is an nk\ndimensional vector that is obtained by \ufb01xing all but the kth index of X . The mode-k unfolding X (k)\nk(cid:48)(cid:54)=k nk(cid:48) matrix constructed by concatenating all the mode-k \ufb01bers along\ncolumns. We denote the spectral and Frobenius norms for matrices by (cid:107) \u00b7 (cid:107) and (cid:107) \u00b7 (cid:107)F , respectively.\n\nof tensor X is an nk \u00d7(cid:81)\n\n2\n\n\f2 The power of ordinary unfolding\n\n2.1 A perturbation bound for the left singular vector\nWe \ufb01rst establish a bound on recovering the left singular vector of a rank-one n \u00d7 m matrix (with\nm > n) perturbed by random Gaussian noise.\nConsider the following model known as the information plus noise model [4]:\n\n\u02dcX = \u03b2uv(cid:62) + \u03c3E,\n\n(1)\n\nwhere u and v are unit vectors, \u03b2 is the signal strength, \u03c3 is the noise standard deviation, and\nthe noise matrix E is assumed to be random with entries sampled i.i.d. from the standard normal\ndistribution. Our goal is to lower-bound the correlation between u and the top left singular vector \u02c6u\nof \u02dcX for signal-to-noise ratio \u03b2/\u03c3 (cid:37) (mn)1/4 with high probability.\nA direct application of the classic Wedin perturbation theorem [28] to the rectangular matrix \u02dcX does\nnot provide us the desired result. This is because it requires the signal to noise ratio \u03b2/\u03c3 \u2265 2(cid:107)E(cid:107).\n\u221a\nm) [27], this would mean that we require \u03b2/\u03c3 (cid:37)\nSince the spectral norm of E scales as Op(\nm1/2; i.e., the threshold is dominated by the number of columns m, if m \u2265 n.\nAlternatively, we can view \u02c6u as the leading eigenvector of \u02dcX \u02dcX\nis that we can decompose \u02dcX \u02dcX\n\n, a square matrix. Our key insight\n\nas follows:\n\nn +\n\n\u221a\n\n(cid:62)\n\n(cid:62)\n\n(cid:62)\n\n\u02dcX \u02dcX\n\n= (\u03b22uu(cid:62) + m\u03c32I) + (\u03c32EE(cid:62) \u2212 m\u03c32I) + \u03b2\u03c3(uv(cid:62)E(cid:62) + Evu(cid:62)).\n\nNote that u is the leading eigenvector of the \ufb01rst term because adding an identity matrix does not\nchange the eigenvectors. Moreover, we notice that there are two noise terms: the \ufb01rst term is a\ncentered Wishart matrix and it is independent of the signal \u03b2; the second term is Gaussian distributed\nand depends on the signal \u03b2.\nThis implies a two-phase behavior corresponding to either the Wishart or the Gaussian noise term\nbeing dominant, depending on the value of \u03b2. Interestingly, we get a different speed of convergence\nfor each of these phases as we show in the next theorem (the proof is given in Appendix D.1).\nTheorem 1. There exists a constant C such that with probability at least 1 \u2212 4e\u2212n, if m/n \u2265 C,\n\n|(cid:104)\u02c6u, u(cid:105)| \u2265\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f31 \u2212 Cnm\n\n(\u03b2/\u03c3)4 ,\n\n1 \u2212 Cn\n(\u03b2/\u03c3)2 ,\n(\u03b2/\u03c3)2 if \u03b2/\u03c3 \u2265 \u221a\n\nCn.\n\n\u2265 (Cnm)\n\n1\n4 ,\n\n\u221a\n\nif\n\nif \u03b2\n\u03c3\n\nm >\n\n\u2265 \u221a\n\n\u03b2\n\u03c3\n\nm,\n\notherwise, |(cid:104)\u02c6u, u(cid:105)| \u2265 1 \u2212 Cn\n\nIn other words, if \u02dcX has suf\ufb01ciently many more columns than rows, as the signal to noise ratio \u03b2/\u03c3\nincreases, \u02c6u \ufb01rst converges to u as 1/\u03b24, and then as 1/\u03b22. Figure 1(a) illustrates these results.\nWe randomly generate a rank-one 100 \u00d7 10000 matrix perturbed by Gaussian noise, and measure\nthe distance between \u02c6u and u. The phase transition happens at \u03b2/\u03c3 = (nm)1/4, and there are two\nregimes of different convergence rates as Theorem 1 predicts.\n\n2.2 Tensor Unfolding\n\nNow let\u2019s apply the above result to the tensor version of information plus noise model studied by\n[20]. We consider a rank one n\u00d7\u00b7\u00b7\u00b7\u00d7 n tensor (signal) contaminated by Gaussian noise as follows:\n\nY = X \u2217 + \u03c3E = \u03b2u(1) \u25e6 \u00b7\u00b7\u00b7 \u25e6 u(K) + \u03c3E,\n\n(2)\nwhere factors u(k) \u2208 Rn, k = 1, . . . , K, are unit vectors, which are not necessarily identical, and\nthe entries of E \u2208 Rn\u00d7\u00b7\u00b7\u00b7\u00d7n are i.i.d samples from the normal distribution N (0, 1). Note that this is\nslightly more general (and easier to analyze) than the symmetric setting studied by [20].\n\n3\n\n\f(a) Synthetic experiment showing phase transition\nat \u03b2/\u03c3 = (nm)1/4 and regimes with different rates\nof convergence. See Theorem 1.\n\nat \u03b2 = \u03c3((cid:81)\n\nCorollary 1.\n\n(b) Synthetic experiment showing phase transition\nk nk)1/4 for odd order tensors. See\n\nFigure 1: Numerical demonstration of Theorem 1 and Corollary 1.\n\n\u03c3\n\n,\n\n||| \u02c6X \u2212 X \u2217|||F /\u03b2 \u2264 Op\n\nSeveral estimators for recovering X \u2217 from its noisy version Y have been proposed (see Table 1).\nBoth the overlapped nuclear norm and latent nuclear norm discussed in [23] achives the relative\nperformance guarantee\n\n(cid:16)\n(3)\nwhere \u02c6X is the estimator. This bound implies that if we want to obtain relative error smaller than \u03b5,\nwe need the signal to noise ratio \u03b2/\u03c3 to scale as \u03b2/\u03c3 (cid:37) \u221a\nMu et al. [19] proposed the square norm, de\ufb01ned as the nuclear norm of the matrix obtained by\ngrouping the \ufb01rst (cid:98)K/2(cid:99) indices along the rows and the last (cid:100)K/2(cid:101) indices along the columns.\nn(cid:100)K/2(cid:101)/\u03b2), which translates to\nrequiring \u03b2/\u03c3 (cid:37) \u221a\nThis norm improves the right hand side of inequality (3) to Op(\u03c3\nn(cid:100)K/2(cid:101)/\u03b5 for obtaining relative error \u03b5. The intuition here is the more square the\n\nnK\u22121/\u03b5.\n\nnK\u22121/\u03b2\n\n(cid:17)\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\nn/\u03b52, nK/2)\n\nunfolding is the better the bound becomes. However, there is no improvement for K = 3.\nunfolding algorithm achieves the factor recovery error dist(\u02c6u(k), u(k)) = \u03b5 with \u03b2/\u03c3 (cid:37) \u221a\nRichard and Montanari [20] studied the (symmetric version of) model (2) and proved that a recursive\nn(cid:100)K/2(cid:101)/\u03b5\nwith high probability, where dist(u, u(cid:48)) := min((cid:107)u \u2212 u(cid:48)(cid:107),(cid:107)u + u(cid:48)(cid:107)). They also showed that the\nrandomly initialized tensor power method [7, 16, 3] can achieve the same error \u03b5 with slightly worse\nthreshold \u03b2/\u03c3 (cid:37) max(\nThe reasoning underlying both [19] and [20] is that square unfolding is better. However, if we take\nthe (ordinary) mode-k unfolding\n\nY (k) = \u03b2u(k)(cid:0)u(k\u22121) \u2297 \u00b7\u00b7\u00b7 \u2297 u(1) \u2297 u(K) \u2297 \u00b7\u00b7\u00b7 \u2297 u(k+1)(cid:1)(cid:62)\n\n(4)\nwe can see (4) as an instance of information plus noise model (1) where m/n = nK\u22122. Thus the\nordinary unfolding satis\ufb01es the condition of Theorem 1 for n or K large enough.\nCorollary 1. Consider a K(\u2265 3)th order rank one tensor contaminated by Gaussian noise as in\n(2). There exists a constant C such that if nK\u22122 \u2265 C, with probability at least 1\u2212 4Ke\u2212n, we have\n\nK log K also with high probability.\n\n+ \u03c3E(k),\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\ndist2(\u02c6u(k), u(k)) \u2264\n\n2CnK\n(\u03b2/\u03c3)4 ,\n2Cn\n(\u03b2/\u03c3)2 ,\n\nif n\n\nK\u22121\n\n2 > \u03b2/\u03c3 \u2265 C\n\n1\n4 n\n\nK\n4 ,\n\nif \u03b2/\u03c3 \u2265 n\n\nK\u22121\n\n2\n\n,\n\nfor k = 1, . . . , K,\n\nwhere \u02c6u(k) is the leading left singular vector of the rectangular unfolding Y (k).\nThis proves that as conjectured by [20], the threshold \u03b2/\u03c3 (cid:37) nK/4 applies not only to the even\norder case but also to the odd order case. Note that Hopkins et al.\n[10] have shown a similar\nresult without the sharp rate of convergence. The above corollary easily extends to more general\nn1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 nK tensor by replacing the conditions by\nk=1 nk)1/4 and\n\n(cid:113)(cid:81)\n(cid:96)(cid:54)=k n(cid:96) > \u03b2/\u03c3 \u2265 (C(cid:81)K\n\n(cid:96)(cid:54)=k n(cid:96). The result also holds when X \u2217 has rank higher than 1; see Appendix E.\n\n\u03b2/\u03c3 \u2265(cid:113)(cid:81)\n\n4\n\nlog(\u03b2/\u03c3)23456-10-8-6-4-20n=100,m=10000log(1\u2212|hu,\u02c6ui|)\u22123.81log(\u03b2/\u03b1)+12.76\u22122.38log(\u03b2/\u03b1)+6.18log1(nm)1/42log(\u221am)\u03c3(Qini)14051015202530correlation00.20.40.60.81|hu1,bu1i|-[204060]|hu2,bu2i|-[204060]|hu1,bu1i|-[4080120]|hu2,bu2i|-[4080120]\u03b21\u03b22\fWe demonstrate this result in Figure 1(b). The models behind the experiment are slightly more\ngeneral ones in which [n1, n2, n3] = [20, 40, 60] or [40, 80, 120] and the signal X \u2217 is rank two with\n\u03b21 = 20 and \u03b22 = 10. The plot shows the inner products (cid:104)u(1)\n2 (cid:105) as a measure\nof the quality of estimating the two mode-1 factors. The horizontal axis is the normalized noise\nk=1 nk)1/4. We can clearly see that the inner product decays symmetrically\n\nstandard deviation \u03c3((cid:81)K\n\n1 (cid:105) and (cid:104)u(1)\n\n2 , \u02c6u(1)\n\n1 , \u02c6u(1)\n\naround \u03b21 and \u03b22 as predicted by Corollary 1 for both tensors.\n\n\u25e6 \u00b7\u00b7\u00b7 \u25e6 u(K)\n\n3 Subspace norm for tensors\nSuppose the true tensor X \u2217 \u2208 Rn\u00d7\u00b7\u00b7\u00b7\u00d7n admits a minimum Tucker decomposition [26] of rank\n(R, . . . , R):\n\n(5)\nIf the core tensor C = (\u03b2i1...iK ) \u2208 RR\u00d7\u00b7\u00b7\u00b7\u00d7R is superdiagonal, the above decomposition reduces to\nthe canonical polyadic (CP) decomposition [9, 15]. The mode-k unfolding of the true tensor X \u2217 can\nbe written as follows:\n\nX \u2217 =(cid:80)R\ni1=1 \u00b7\u00b7\u00b7(cid:80)R\nU (1) \u2297 \u00b7\u00b7\u00b7 \u2297 U (k\u22121) \u2297 U (k+1) \u2297 \u00b7\u00b7\u00b7 \u2297 U (K)(cid:17)(cid:62)\n(cid:16)\n\n(6)\nwhere C(k) is the mode-k unfolding of the core tensor C; U (k) is a n \u00d7 R matrix U (k) =\n[u(k)\nLet X\u2217\n\nR ] for k = 1, . . . , K. Note that U (k) is not necessarily orthogonal.\n\nbe the SVD of X\u2217\n\niK =1 \u03b2i1i2...iK u(1)\n\nP (1) \u2297 \u00b7\u00b7\u00b7 \u2297 P (k\u22121) \u2297 P (k+1) \u2297 \u00b7\u00b7\u00b7 \u2297 P (K)(cid:17)\n\n(k). We will observe that\n\n(k) = P (k)\u039b(k)Q(k)(cid:62)\nQ(k) \u2208 Span\n\n(k) = U (k)C(k)\n\n1 , . . . , u(k)\n\nX\u2217\n\n(cid:16)\n\n.\n\niK\n\n(7)\n\ni1\n\n,\n\nbecause of (6) and U (k) \u2208 Span(P (k)).\nCorollary 1 shows that the left singular vectors P (k) can be recovered under mild conditions; thus\nthe span of the right singular vectors can also be recovered. Inspired by this, we de\ufb01ne a norm that\nmodels a tensor X as a mixture of tensors Z (1), . . . ,Z (K). We require that the mode-k unfolding\nof Z (k), i.e. Z(k)\n, where M (k) \u2208 Rn\u00d7HK\u22121 is a\nvariable, and S(k) \u2208 RnK\u22121\u00d7HK\u22121 is a \ufb01xed arbitrary orthonormal basis of some subspace, which\nwe choose later to have the Kronecker structure in (7).\nIn the following, we de\ufb01ne the subspace norm, suggest an approach to construct the right factor\nS(k), and prove the denoising bound in the end.\n\n(k), has a low rank factorization Z(k)\n\n(k) = M (k)S(k)(cid:62)\n\n3.1 The subspace norm\nConsider a Kth order tensor of size n \u00d7 \u00b7\u00b7\u00b7 n.\nDe\ufb01nition 1. Let S(1), . . . , S(K) be matrices such that S(k) \u2208 RnK\u22121\u00d7HK\u22121 with H \u2264 n. The\nsubspace norm for a Kth order tensor X associated with {S(k)}K\n\nk=1 is de\ufb01ned as\nif X \u2208 Span({S(k)}K\notherwise,\n\nk=1),\n\n:= (cid:8)X \u2208 Rn\u00d7\u00b7\u00b7\u00b7\u00d7n\n\n:\n\n(cid:40)\n\u2203M (1), . . . , M (K),X =(cid:80)K\n\n|||X|||s :=\n\ninf{M (k)}K\n+\u221e,\n\nk=1\n\n(cid:80)K\nk=1 (cid:107)M (k)(cid:107)\u2217,\n\nwhere (cid:107) \u00b7 (cid:107)\u2217 is the nuclear norm, and Span({S(k)}K\n\nk=1)\n\nk=1 foldk(M (k)S(k)(cid:62)\n\n)(cid:9).\n\n\u221a\nIn the next lemma (proven in Appendix D.2), we show the dual norm of the subspace norm has a sim-\nnK\u22121) scaling (see the \ufb01rst column\nple appealing form. As we see in Theorem 2, it avoids the O(\nof Table 1) by restricting the in\ufb02uence of the noise term in the subspace de\ufb01ned by S(1), . . . , S(K).\nLemma 1. The dual norm of |||\u00b7|||s is a semi-norm\n|||X|||s\u2217 = max\n\n(cid:107)X (k)S(k)(cid:107),\n\nk=1,...,K\n\nwhere (cid:107) \u00b7 (cid:107) is the spectral norm.\n\n5\n\n\f3.2 Choosing the subspace\n\n(k) = P (k)\u039b(k)Q(k) be the SVD of X\u2217\n\nA natural question that arises is how to choose the matrices S(1), . . . , S(k).\nLemma 2. Let the X\u2217\nnK\u22121 \u00d7 R. Assume that R \u2264 n and U (k) has full column rank. It holds that for all k,\ni) U (k) \u2208 Span(P (k)),\nii) Q(k) \u2208 Span\n\nP (1) \u2297 \u00b7\u00b7\u00b7 \u2297 P (k\u22121) \u2297 P (k+1) \u2297 \u00b7\u00b7\u00b7 \u2297 P (K)(cid:17)\n\n(cid:16)\n\n.\n\n(k), where P (k) is n \u00d7 R and Q(k) is\n\nProof. We prove the lemma in Appendix D.4.\n\nCorollary 1 shows that when the signal to noise ratio is high enough, we can recover P (k) with high\nprobability. Hence we suggest the following three-step approach for tensor denoising:\n\n(i) For each k, unfold the observation tensor in mode k and compute the top H left singular\n\nvectors. Concatenate these vectors to obtain a n \u00d7 H matrix (cid:98)P\n(ii) Construct S(k) as S(k) = (cid:98)P\n\n(1) \u2297 \u00b7\u00b7\u00b7 \u2297 (cid:98)P\n\n(k\u22121) \u2297 (cid:98)P\n\n(iii) Solve the subspace norm regularized minimization problem\n\n(k).\n\n(k+1) \u2297 \u00b7\u00b7\u00b7 \u2297 (cid:98)P\nF + \u03bb|||X|||s,\n\n(K).\n\nminX\n\n|||Y \u2212 X|||2\n\n1\n2\n\n(8)\n\n(9)\n\nwhere the subspace norm is associated with the above de\ufb01ned {S(k)}K\n\nk=1.\n\nSee Appendix B for details.\n\n3.3 Analysis\nLet Y \u2208 Rn\u00d7\u00b7\u00b7\u00b7\u00d7n be a tensor corrupted by Gaussian noise with standard deviation \u03c3 as follows:\n\nY = X \u2217 + \u03c3E.\n\n(cid:110) 1\n\n\u02c6X = arg min\nX ,{M (k)}K\n\nWe de\ufb01ne a slightly modi\ufb01ed estimator \u02c6X as follows:\n\n(cid:111)\nk=1 \u2208 M(\u03c1)\n(10)\nwhere M(\u03c1) is a restriction of the set of matrices M (k) \u2208 Rn\u00d7HK\u22121, k = 1, . . . , K de\ufb01ned as\nfollows:\n\nM (k)S(k)(cid:62)(cid:17)\n\nF + \u03bb|||X|||s : X =\n\n,{M (k)}K\n\n|||Y \u2212 X|||2\n\nK(cid:88)\n\nfoldk\n\n(cid:16)\n\nk=1\n\nk=1\n\n2\n\n\u221a\n\n\u221a\n\n(\n\nn +\n\nH K\u22121),\u2200k (cid:54)= (cid:96)\n\n.\n\n(cid:111)\n\n(cid:110){M (k)}K\n\nM(\u03c1) :=\n\nk=1 : (cid:107)foldk(M (k))((cid:96))(cid:107) \u2264 \u03c1\nK\n\nThis restriction makes sure that M (k), k = 1, . . . , K, are incoherent, i.e., each M (k) has a spectral\nnorm that is as low as a random matrix when unfolded at a different mode (cid:96). Similar assumptions\nwere used in low-rank plus sparse matrix decomposition [2, 12] and for the denoising bound for the\nlatent nuclear norm [23].\nThen we have the following statement (we prove this in Appendix D.3).\nTheorem 2. Let Xp be any tensor that can be expressed as\n\n(cid:16)\n\np S(k)(cid:62)(cid:17)\n\n,\n\nXp =\n\nfoldk\n\nM (k)\n\nK(cid:88)\n\nwhich satis\ufb01es the above incoherence condition {M (k)\nof M (k)\n\nk=1 \u2208 M(\u03c1) and let rk be the rank\nIn addition, we assume that each S(k) is constructed as S(k) =\n\nfor k = 1, . . . , K.\n\np }K\n\np\n\nk=1\n\n6\n\n\f(k\u22121) \u2297 \u00b7\u00b7\u00b7 \u2297 (cid:98)P\n(cid:98)P\n(cid:16)\u221a\n= I H. Then there are universal constants c0\nand c1 such that any solution \u02c6X of the minimization problem (10) with \u03bb = |||Xp \u2212 X \u2217|||s\u2217 +\nc0\u03c3\n\n(k+1) with ((cid:98)P\nH K\u22121 +(cid:112)2 log(K/\u03b4)\n\nsatis\ufb01es the following bound\n\n)(cid:62)(cid:98)P\n(cid:17)\n\nn +\n\n\u221a\n\n(k)\n\n(k)\n\n(cid:114)(cid:88)K\n\nrk,\n\nk=1\n\n||| \u02c6X \u2212 X \u2217|||F \u2264 |||Xp \u2212 X \u2217|||F + c1\u03bb\n\nwith probability at least 1 \u2212 \u03b4.\n\nNote that the right-hand side of the bound consists of two terms. The \ufb01rst term is the approximation\nerror. This term will be zero if X \u2217 lies in Span({S(k)}K\nk=1). This is the case, if we choose S(k) =\nI nK\u22121 as in the latent nuclear norm, or if the condition of Corollary 1 is satis\ufb01ed for the smallest \u03b2R\nwhen we use the Kronecker product construction we proposed. Note that the regularization constant\n\u03bb should also scale with the dual subspace norm of the residual Xp \u2212 X \u2217.\nThe second term is the estimation error with respect to Xp. If we take Xp to be the orthogonal pro-\njection of X \u2217 to the Span({S(k)}K\nk=1), we can ignore the contribution of the residual to \u03bb, because\n(Xp \u2212 X \u2217)(k)S(k) = 0. Then the estimation error scales mildly with the dimensions n, H K\u22121 and\nwith the sum of the ranks. Note that if we take S(k) = I nK\u22121, we have H K\u22121 = nK\u22121, and we\nrecover the guarantee (3) .\n\n4 Experiments\n\nIn this section, we conduct tensor denoising experiments on synthetic and real datasets, to numeri-\ncally con\ufb01rm our analysis in previous sections.\n\n4.1 Synthetic data\nWe randomly generated the true rank two tensor X \u2217 of size 20\u00d730\u00d740 with singular values \u03b21 = 20\nand \u03b22 = 10. The true factors are generated as random matrices with orthonormal columns. The\nobservation tensor Y is then generated by adding Gaussian noise with standard deviation \u03c3 to X \u2217.\nOur approach is compared to the CP decomposition, the overlapped approach, and the latent ap-\nproach. The CP decomposition is computed by the tensorlab [22] with 20 random initializations.\nWe assume CP knows the true rank is 2. For the subspace norm, we use Algorithm 2 described\n(k)\u2019s. We computed\nthe solutions for 20 values of regularization parameter \u03bb logarithmically spaced between 1 and 100.\nFor the overlapped and the latent norm, we use ADMM described in [25]; we also computed 20\nsolutions with the same \u03bb\u2019s used for the subspace norm.\n\nin Section 3. We also select the top 2 singular vectors when constructing (cid:98)U\nWe measure the performance in the relative error de\ufb01ned as |||(cid:98)X \u2212X \u2217|||F /|||X \u2217|||F . We report the min-\n\nimum error obtained by choosing the optimal regularization parameter or the optimal initialization.\nAlthough the regularization parameter could be selected by leaving out some entries and measuring\nthe error on these entries, we will not go into tensor completion here for the sake of simplicity.\nFigure 2 (a) and (b) show the result of this experiment. The left panel shows the relative error for 3\nrepresentative values of \u03bb for the subspace norm. The black dash-dotted line shows the minimum\nerror across all the \u03bb\u2019s. The magenta dashed line shows the error corresponding to the theoretically\nmotivated choice \u03bb = \u03c3(maxk(\n\nH K\u22121) +(cid:112)2 log(K)) for each \u03c3. The two vertical\n\nlines are thresholds of \u03c3 from Corollary 1 corresponding to \u03b21 and \u03b22, namely, \u03b21/((cid:81)\n\u03b22/((cid:81)\nparameter \u03bb leads to predicting with (cid:98)X = 0.\n\nk nk)1/4 and\nk nk)1/4. It con\ufb01rms that there is a rather sharp increase in the error around the theoretically\npredicted places (see also Figure 1(b)). We can also see that the optimal \u03bb should grow linearly with\n\u03c3. For large \u03c3 (small SNR), the best relative error is 1 since the optimal choice of the regularization\n\nnk +\n\n\u221a\n\n\u221a\n\nFigure 2 (b) compares the performance of the subspace norm to other approaches. For each method\nthe smallest error corresponding to the optimal choice of the regularization parameter \u03bb is shown.\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: Tensor denoising. (a) The subspace approach with three representative \u03bb\u2019s on synthetic\ndata. (b) Comparison of different methods on synthetic data. (c) Comparison on amino acids data.\nIn addition, to place the numbers in context, we plot the line corresponding to\n\n(cid:112)R(cid:80)\n\nRelative error =\n\nk nk log(K)\n|||X \u2217|||F\n\n\u00b7 \u03c3,\n\n(11)\n\nwhich we call \u201coptimistic\u201d. This can be motivated from considering the (non-tractable) maximum\nlikelihood estimator for CP decomposition (see Appendix A).\nClearly, the error of CP, the subspace norm, and \u201coptimistic\u201d grows at the same rate, much slower\nthan overlap and latent. The error of CP increases beyond 1, as no regularization is imposed (see\nAppendix C for more experiments). We can see that both CP and the subspace norm are behaving\nnear optimally in this setting, although such behavior is guaranteed for the subspace norm whereas\nit is hard to give any such guarantee for the CP decomposition based on nonlinear optimization.\n\n4.2 Amino acids data\n\nThe amino acid dataset [5] is a semi-realistic dataset commonly used as a benchmark for low rank\ntensor modeling. It consists of \ufb01ve laboratory-made samples, each one contains different amounts\nof tyrosine, tryptophan and phenylalanine. The spectrum of their excitation wavelength (250-300\nnm) and emission (250-450 nm) are measured by \ufb02uorescence, which gives a 5 \u00d7 201 \u00d7 61 tensor.\nAs the true factors are known to be these three acids, this data perfectly suits the CP model. The true\nrank is fed into CP and the proposed approach as H = 3. We computed the solutions of CP for 20\ndifferent random initializations, and the solutions of other approaches with 20 different values of \u03bb.\nFor the subspace and the overlapped approach, \u03bb\u2019s are logarithmically spaced between 103 and 105.\nFor the latent approach, \u03bb\u2019s are logarithmically spaced between 104 and 106. Again, we include the\noptimistic scaling (11) to put the numbers in context.\nFigure 2(c) shows the smallest relative error achieved by all methods we compare. Similar to the\nsynthetic data, both CP and the subspace norm behaves near ideally, though the relative error of\nCP can be larger than 1 due to the lack of regularization. Interestingly the theoretically suggested\nscaling of the regularization parameter \u03bb is almost optimal.\n\n5 Conclusion\n\nproposed norm is de\ufb01ned with respect to a set of orthonormal matrices (cid:98)P\n\nWe have settled a conjecture posed by [20] and showed that indeed O(nK/4) signal-to-noise ratio\nis suf\ufb01cient also for odd order tensors. Moreover, our analysis shows an interesting two-phase be-\nhavior of the error. This \ufb01nding lead us to the development of the proposed subspace norm. The\n(K), which are\nestimated by mode-wise singular value decompositions. We have analyzed the denoising perfor-\nmance of the proposed norm, and shown that the error can be bounded by the sum of two terms,\nwhich can be interpreted as an approximation error term coming from the \ufb01rst (non-convex) step,\nand an estimation error term coming from the second (convex) step.\n\n, . . . ,(cid:98)P\n\n(1)\n\n8\n\n\u03c300.511.52Relative Error00.511.52\u03bb1 = 1.62\u03bb2 = 5.46\u03bb3 = 14.38min errorsuggested\u03c300.511.52Relative Error00.10.20.30.40.50.60.70.80.91cpsubspacelatentoverlapoptimistic\u03c30500100015002000Relative Error00.20.40.60.811.21.41.61.8cpsubspacelatentoverlapsuggestedoptimistic\f", "award": [], "sourceid": 1742, "authors": [{"given_name": "Qinqing", "family_name": "Zheng", "institution": "University of Chicago"}, {"given_name": "Ryota", "family_name": "Tomioka", "institution": "Toyota Technological Institute at Chicago"}]}