{"title": "Convex Tensor Decomposition via Structured Schatten Norm Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 1331, "page_last": 1339, "abstract": "We propose a new class of structured Schatten norms for tensors that includes two recently  proposed norms (overlapped'' and \"latent'') for convex-optimization-based  tensor decomposition. Based on the properties of the structured Schatten norms, we mathematically analyze the performance of \"latent'' approach for tensor decomposition, which was empirically found to perform better than the \"overlapped'' approach in some settings. We show theoretically that this is indeed the case. In particular, when the unknown true tensor is low-rank in a specific mode, this approach performs as well as knowing the mode with the smallest rank. Along the way, we show a novel duality result for structures Schatten norms, which is also interesting in the general context of structured sparsity. We confirm through  numerical simulations that our theory can precisely predict the scaling behaviour of the mean squared error.  \"", "full_text": "Convex Tensor Decomposition via Structured\n\nSchatten Norm Regularization\n\nRyota Tomioka\n\nToyota Technological Institute at Chicago\n\nChicago, IL 60637\n\ntomioka@ttic.edu\n\nTaiji Suzuki\n\nDepartment of Mathematical\n\nand Computing Sciences\n\nTokyo Institute of Technology\n\nTokyo 152-8552, Japan\n\ns-taiji@is.titech.ac.jp\n\nAbstract\n\nWe study a new class of structured Schatten norms for tensors that includes two\nrecently proposed norms (\u201coverlapped\u201d and \u201clatent\u201d) for convex-optimization-\nbased tensor decomposition. We analyze the performance of \u201clatent\u201d approach\nfor tensor decomposition, which was empirically found to perform better than the\n\u201coverlapped\u201d approach in some settings. We show theoretically that this is indeed\nthe case. In particular, when the unknown true tensor is low-rank in a speci\ufb01c\nunknown mode, this approach performs as well as knowing the mode with the\nsmallest rank. Along the way, we show a novel duality result for structured Schat-\nten norms, which is also interesting in the general context of structured sparsity.\nWe con\ufb01rm through numerical simulations that our theory can precisely predict\nthe scaling behaviour of the mean squared error.\n\n1 Introduction\n\nDecomposition of tensors [10, 14] (or multi-way arrays) into low-rank components arises naturally\nin many real world data analysis problems. For example, in neuroimaging, spatio-temporal patterns\nof neural activities that are related to certain experimental conditions or subjects can be found by\ncomputing the tensor decomposition of the data tensor, which can be of size channels (cid:2) time-\npoints (cid:2) subjects (cid:2) conditions [18]. More generally, any multivariate spatio-temporal data (e.g.,\nenvironmental monitoring) can be regarded as a tensor. If some of the observations are missing, low-\nrank modeling enables the imputation of missing values. Tensor modelling may also be valuable for\ncollaborative \ufb01ltering with temporal or contextual dimension.\nConventionally, tensor decomposition has been tackled through non-convex optimization problems,\nusing alternate least squares or higher-order orthogonal iteration [6]. Compared to its empirical\nsuccess, little has been theoretically understood about the performance of tensor decomposition\nalgorithms. De Lathauwer et al. [5] showed an approximation bound for a truncated higher-order\nSVD (also known as the Tucker decomposition). Nevertheless the generalization performance of\nthese approaches has been widely open. Moreover, the model selection problem can be highly\nchallenging, especially for the Tucker model [5, 27], because we need to specify the rank rk for each\nmode (here a mode refers to one dimensionality of a tensor); that is, we have K hyper-parameters\nto choose for a K-way tensor, which is challenging even for K = 3.\nRecently a convex-optimization-based approach for tensor decomposition has been proposed by\nseveral authors [9, 15, 23, 25], and its performance has been analyzed in [26].\n\n1\n\n\fFigure 1: Estimation of a low-rank 50(cid:2)50(cid:2)20 tensor of rank r (cid:2) r (cid:2) 3 from noisy measurements.\nThe noise standard deviation is (cid:27) = 0:1. The estimation errors of two convex optimization based\nmethods are plotted against the rank r of the \ufb01rst two modes. The solid lines show the error at the\n\ufb01xed regularization constant (cid:21), which is 0.89 for the overlapped approach and 3.79 for the latent\napproach (see also Figure 2). The dashed lines show the minimum error over candidates of the\nregularization constant (cid:21) from 0.1 to 100. In the inset, the errors of the two approaches are plotted\nagainst the regularization constant (cid:21) for rank r = 40 (marked with gray dashed vertical line in\nthe outset). The two values (0.89 and 3.79) are marked with vertical dashed lines. Note that both\napproaches need no knowledge of the true rank; the rank is automatically learned.\n\nThe basic idea behind their convex approach, which we call overlapped approach, is to unfold1 a\ntensor into matrices along different modes and penalize the unfolded matrices to be simultaneously\nlow-rank based on the Schatten 1-norm, which is also known as the trace norm and nuclear norm [7,\n22, 24]. This approach does not require the rank of the decomposition to be speci\ufb01ed beforehand,\nand due to the low-rank inducing property of the Schatten 1-norm, the rank of the decomposition is\nautomatically determined.\nHowever, it has been noticed that the above overlapped approach has a limitation that it performs\npoorly for a tensor that is only low-rank in a certain mode. The authors of [25] proposed an alter-\nnative approach, which we call latent approach, that decomposes a given tensor into a mixture of\ntensors that each are low-rank in a speci\ufb01c mode. Figure 1 demonstrates that the latent approach\nis preferable to the overlapped approach when the underlying tensor is almost full rank in all but\none mode. However, so far no theoretical analysis has been presented to support such an empirical\nsuccess.\nIn this paper, we rigorously study the performance of the latent approach and show that the mean\nsquared error of the latent approach scales no greater than the minimum mode-k rank of the underly-\ning true tensor, which clearly explains why the latent approach performs better than the overlapped\napproach in Figure 1.\nAlong the way, we show a novel duality between the two types of norms employed in the above\ntwo approaches, namely the overlapped Schatten norm and the latent Schatten norm. This result\nis closely related and generalize the results in structured sparsity literature [2, 13, 17, 21]. In fact,\nthe (plain) overlapped group lasso constrains the weights to be simultaneously group sparse over\noverlapping groups. The latent group lasso predicts with a mixture of group sparse weights [see\nalso 1, 3, 12]. These approaches clearly correspond to the two variations of tensor decomposition\nalgorithms we discussed above.\nFinally we empirically compare the overlapped approach and latent approach and show that even\nwhen the unknown tensor is simultaneously low-rank, which is a favorable situation for the over-\nlapped approach, the latent approach performs better in many cases. Thus we provide both theoreti-\ncal and empirical evidence that for noisy tensor decomposition, the latent approach is preferable to\nthe overlapped approach. Our result is complementary to the previous study [25, 26], which mainly\nfocused on the noise-less tensor completion setting.\n\n1For a K-way tensor, there are K ways to unfold a tensor into a matrix. See Section 2.\n\n2\n\n01020304050601015202530Rank of the first two modesEstimation error ||W\u2212W*||Fsize=[50 50 20]  Overlapped Schatten 1\u2212normLatent Schatten 1\u2212norm1001011020204060Regularization constant \u03bb||W\u2212W*||Frank=[40 40 3]\fThis paper is structured as follows. In Section 2, we provide basic de\ufb01nitions of the two variations of\nstructured Schatten norms, namely the overlapped/latent Schatten norms, and discuss their proper-\nties, especially the duality between them. Section 3 presents our main theoretical contributions; we\nestablish the consistency of the latent approach, and we analyze the denoising performance of the\nlatent approach. In Section 4, we empirically con\ufb01rm the scaling predicted by our theory. Finally,\nSection 5 concludes the paper. Most of the proofs are presented in the supplementary material.\n\n2 Structured Schatten norms for tensors\n\nK\n\nIn this section, we de\ufb01ne the overlapped Schatten norm and the latent Schatten norm and discuss\ntheir basic properties.\n\u220f\nFirst we need some basic de\ufb01nitions.\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)W(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nLet W 2 Rn1(cid:2)(cid:1)(cid:1)(cid:1)nK be a K-way tensor. We denote the total number of entries in W by N =\n\u221a\u27e8W;W\u27e9. Each dimensionality of a tensor is called a mode. The mode k unfolding W (k) 2\nvec(X );\nk=1 nk. The dot product between two tensors W and X is de\ufb01ned as \u27e8W;X\u27e9 = vec(W)\ni.e., the dot product as vectors in RN . The Frobenius norm of a tensor is de\ufb01ned as\nF =\nRnk(cid:2)N=nk is a matrix that is obtained by concatenating the mode-k \ufb01bers along columns; here a\nmode-k \ufb01ber is an nk dimensional vector obtained by \ufb01xing all the indices but the kth index of W.\nThe mode-k rank rk of W is the rank of the mode-k unfolding W (k). We say that a tensor W has\nmultilinear rank (r1; : : : ; rK) if the mode-k rank is rk for k = 1; : : : ; K [14]. The mode k folding\nis the inverse of the unfolding operation.\n\n\u22a4\n\n2.1 Overlapped Schatten norms\n\nThe low-rank inducing norm studied in [9, 15, 23, 25], which we call overlapped Schatten 1-norm,\ncan be written as follows:\n\nIn this paper, we consider the following more general overlapped Sp=q-norm, which includes the\nSchatten 1-norm as the special case (p; q) = (1; 1). The overlapped Sp=q-norm is written as follows:\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)W(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)W(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u2211K\n\nS1=1 =\n\nk=1\n\n\u2225W (k)\u2225S1 :\n)1=q\n\u2225W (k)\u2225q\n)1=p\n\nSp\n\n(\u2211K\n(\u2211r\n\n\u2225W\u2225Sp =\n\n(cid:27)p\nj (W )\n\nj=1\n\n(1)\n\n(3)\n\nwhere 1 (cid:20) p; q (cid:20) 1; here\n\nSp=q =\n\nk=1\n\n;\n\n(2)\n\nis the Schatten p-norm for matrices, where (cid:27)j(W ) is the jth largest singular value of W .\nWhen used as a regularizer, the overlapped Schatten 1-norm penalizes all modes of W to be jointly\nlow-rank. It is related to the overlapped group regularization [see 13, 16] in a sense that the same\nobject W appears repeatedly in the norm.\nThe following inequality relates the overlapped Schatten 1-norm with the Frobenius norm, which\n\nwas a key step in the analysis of [26]:(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)W(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:20) K\u2211\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)W(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nF ;\n\np\nrk\n\nS1=1\n\nk=1\n\nwhere rk is the mode-k rank of W.\nNow we are interested in the dual norm of the overlapped Sp=q-norm, because deriving the dual\nnorm is a key step in solving the minimization problem that involves the norm (2) [see 16], as\nwell as computing various complexity measures, such as, Rademacher complexity [8] and Gaussian\n(cid:3)-norm as\nwidth [4]. It turns out that the dual norm of the overlapped Sp=q-norm is the latent Sp(cid:3) =q\nshown in the following lemma (proof is presented in Appendix A).\n\n3\n\n\f(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nLemma 1. The dual norm of the overlapped Sp=q-norm is the latent Sp(cid:3) =q\n1=p\n\n= 1, which is de\ufb01ned as follows:\n\n= 1 and 1=q + 1=q\n\n(cid:3)\n\n(cid:3)\n\n(\u2211K\n\n)1=q\n\n(cid:3)-norm, where 1=p +\n\n(cid:3)\n\n:\n\n(4)\n\nSp(cid:3) =q(cid:3) =\n\ninf\n\n(X (1)+(cid:1)(cid:1)(cid:1)+X (K))=X\n\n\u2225X (k)\n\n(k)\n\n\u2225q\n(cid:3)\nSp(cid:3)\n\nk=1\n\nHere the in\ufb01mum is taken over the K-tuple of tensors X (1); : : : ;X (K) that sums to X .\nIn the supplementary material, we show a slightly more general version of the above lemma that\nnaturally generalizes the duality between overlapped/latent group sparsity norms [1, 12, 17, 21]; see\nSection A. Note that when the groups have no overlap, the overlapped/latent group sparsity norms\nbecome identical, and the duality is the ordinary duality between the group Sp=q-norms and the\ngroup Sp(cid:3)=q\n\n(cid:3)-norms.\n\n2.2 Latent Schatten norms\n\nThe latent approach for tensor decomposition [25] solves the following minimization problem\n\nminimize\nW (1);:::;W (K)\n\nL(W (1) + (cid:1)(cid:1)(cid:1) + W (K)) + (cid:21)\n\n\u2225W (k)\n\n(k)\n\n\u2225S1 ;\n\n(5)\n\nK\u2211\n\nk=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)W(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nS1=1 =\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)W(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nwhere L is a loss function, (cid:21) is a regularization constant, and W (k)\n(k) is the mode-k unfolding of\nW (k). Intuitively speaking, the latent approach for tensor decomposition predicts with a mixture of\nK tensors that each are regularized to be low-rank in a speci\ufb01c mode.\nNow, since the loss term in the minimization problem (5) only depends on the sum of the tensors\nW (1); : : : ;W (K), minimization problem (5) is equivalent to the following minimization problem\n\nminimizeW\n\nL(W) + (cid:21)\n\nS1=1:\n\nIn other words, we have identi\ufb01ed the structured Schatten norm employed in the latent approach as\nthe latent S1=1-norm (or latent Schatten 1-norm for short), which can be written as follows:\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)W(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nK\u2211\n\n\u2225W (k)\n\n(k)\n\n\u2225S1 :\n\n(6)\n\nAccording to Lemma 1, the dual norm of the latent S1=1-norm is the overlapped S1=1-norm\n\nk=1\n\ninf\n\n(W (1)+(cid:1)(cid:1)(cid:1)+W (K))=W\n\nS1=1 = max\n\n\u2225X (k)\u2225S1 ;\n\n(7)\n\nwhere \u2225 (cid:1) \u2225S1 is the spectral norm.\nThe following lemma is similar to inequality (3) and is a key in our analysis (proof is presented in\nAppendix B).\nLemma 2.\n\nk\n\n(\n\n)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)W(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nF ;\n\n(cid:20)\n\nS1=1\n\np\nrk\n\nmin\n\nk\n\nwhere rk is the mode-k rank of W.\nCompared to inequality (3), the latent Schatten 1-norm is bounded by the minimal square root of the\nranks instead of the sum. This is the fundamental reason why the latent approach performs betters\nthan the overlapped approach as in Figure 1.\n\n3 Main theoretical results\n\nIn this section, combining the duality we presented in the previous section with the techniques\nfrom Agarwal et al. [1], we study the generalization performance of the latent approach for tensor\ndecomposition in the context of recovering an unknown tensor W(cid:3) from noisy measurements. This\nis the setting of the experiment in Figure 1. We \ufb01rst prove a generic consistency statement that does\nnot take the low-rank-ness of the truth into account. Next we show that a tighter bound that takes the\nlow-rank-ness into account can be obtained with some incoherence assumption. Finally, we discuss\nthe difference between overlapped approach and latent approach and provide an explanation for the\nempirically observed superior performance of the latent approach in Figure 1.\n\n4\n\n\f3.1 Consistency\nLet W(cid:3) be the underlying true tensor and the noisy version Y is obtained as follows:\n\nY = W(cid:3)\n\n+ E;\n\nwhere E 2 Rn1(cid:2)(cid:1)(cid:1)(cid:1)(cid:2)nK is the noise tensor.\nA consistency statement can be obtained as follows (proof is presented in Appendix C):\n\nTheorem 1. Assume that the regularization constant (cid:21) satis\ufb01es (cid:21) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)E(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(\n\n)\nS1=1 (overlapped S1=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Y (cid:0) W(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)W(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nnorm of the noise), then the estimator de\ufb01ned by ^W = argminW\nsatis\ufb01es the inequality\n\n1\n2\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ^W (cid:0) W(cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u221a\n(cid:20) 2(cid:21)\n\n(8)\nIn particular when the noise goes to zero E ! 0, the right hand side of inequality (8) shrinks to zero.\n\nmin\n\nnk:\n\nF\n\nk\n\nF + (cid:21)\n\n;\n\nS1=1\n\n3.2 Deterministic bound\nThe consistency statement in the previous section only deals with the sum ^W =\n^W (k) and\nthe statement does not take into account the low-rank-ness of the truth. In this section, we establish\na tighter statement that bounds the errors of individual terms ^W (k).\nTo this end, we need some additional assumptions. First, we assume that the unknown tensor W(cid:3) is\na mixture of K tensors that each are low-rank in a certain mode and we have a noisy observation Y\nas follows:\n\nK\nk=1\n\nY = W(cid:3)\n\n+ E =\n\nW(cid:3)(k) + E;\n\n(9)\n(k)) is the mode-k rank of the kth component W(cid:3)(k); note that this does not\n\nwhere (cid:22)rk = rank(W (k)\nequal the mode-k rank rk of W(cid:3) in general.\nSecond, we assume that the spectral norm of the mode-k unfolding of the lth component is bounded\nby a constant (cid:11) for all k \u0338= l as follows:\n\nk=1\n\n\u2211K\n\n\u2211\n\n\u2225W\n\n(cid:3)(l)\n(k)\n\n\u2225S1 (cid:20) (cid:11) (8l \u0338= k; k; l = 1; : : : ; K):\n\n(10)\n\n)\n\n;\n\n8l \u0338= k\n\n(\n\nNote that such an additional incoherence assumption has also been used in [1, 3, 11].\nWe employ the following optimization problem to recover the unknown tensor W(cid:3):\n^W = argminW\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Y (cid:0) W(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)W(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nW (k); \u2225W (l)\n\ns:t: W =\n\nK\u2211\n\n\u2225S1 (cid:20) (cid:11);\n\nF + (cid:21)\n\nS1=1\n\n(k)\n\n1\n2\n\nk=1\n\n(11)\nwhere (cid:21) > 0 is a regularization constant. Notice that we have introduced additional spectral norm\nconstraints to control the correlation between the components; see also [1].\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)E(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nOur deterministic performance bound can be stated as follows (proof is presented in Appendix D):\nTheorem 2. Let ^W (k) be an optimal decomposition of ^W induced by the latent Schatten 1-norm (6).\nS1=1 + (cid:11)(K (cid:0) 1). Then there is\nAssume that the regularization constant (cid:21) satis\ufb01es (cid:21) (cid:21) 2\na universal constant c such that, any solution ^W of the minimization problem (11) satis\ufb01es the\n\u2211K\n\nfollowing deterministic bound:\u2211K\n\n(cid:20) c(cid:21)2\n\nrk:\n\nk=1\n\n(12)\n\nMoreover, the overall error can be bounded in terms of the multilinear rank of W(cid:3) as follows:\n\nk=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2\n\nF\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ^W (k) (cid:0) W(cid:3)(k)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ^W (cid:0) W(cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2\n\n(cid:20) c(cid:21)2 min\n\nk\n\nF\n\n5\n\nrk:\n\n(13)\n\n\fK\nk=1\n\n=\nW(cid:3)(k) to replace the sum over the ranks with the minimal mode-k rank. This is possible\n\u2032 \u0338= k, is allowed for\n\n\u2211\nNote that in order to get inequality (13), we exploit the arbitrariness of the decomposition W(cid:3)\nbecause a singleton decomposition, i.e., W(cid:3)(k) = W(cid:3) and W(cid:3)(k\nany k.\nComparing two inequalities (8) and (13), we see that there are two regimes. When the noise is small,\n\u226a mink nk, (13) is tighter.\n(8) is tighter. On the other hand, when the noise is larger and/or mink rk\n\n) = 0 for k\n\n\u2032\n\n3.3 Gaussian noise\nWhen the elements of the noise tensor E are Gaussian, we obtain the following theorem.\nTheorem 3. Assume that the elements of the noise tensor E are independent zero-mean Gaussian\nrandom variables with variance (cid:27)2. In addition, assume without loss of generality that the dimen-\nsionalities of W(cid:3) are sorted in the descending order, i.e., n1 (cid:21) (cid:1)(cid:1)(cid:1) (cid:21) nK. Then there is a universal\nconstant c such that, with probability at least 1 (cid:0) (cid:14), any solution of the minimization problem (11)\nwith regularization constant (cid:21) = 2(cid:27)(\n\n\u221a\n\u2211\n2 log(K=(cid:14))) + (cid:11)(K (cid:0) 1) satis\ufb01es\n\np\nn1 +\n\n\u221a\nK\u2211\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ^W (k) (cid:0) W(cid:3)(k)\n(\u221a\n\nN=nK +\n\nk=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2\n\nF\n\n((\n\n\u221a\n\n1\nN\n\n)\n\n(cid:20) cF (cid:27)2\n\n)\u221a\n\nK\nk=1 (cid:22)rk\nnK\n\n)2\n\n;\n\n(14)\n\nwhere F =\non the dimensionalities and the constant (cid:11) in (10).\n\n1 +\n\nn1nK\n\n+\n\nN\n\n2 log(K=(cid:14)) + (cid:11)(K(cid:0)1)\n\n2(cid:27)\n\nnK\nN\n\nis a factor that mildly depends\n\nNote that the theoretically optimal choice of regularization constant (cid:21) is independent of the ranks of\nthe truth W(cid:3) or its factors in (9), which are unknown in practice.\nAgain we can obtain a bound corresponding to the minimum rank singleton decomposition as in\ninequality (13) as follows:\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ^W (cid:0) W(cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2\n\n1\nN\n\n(cid:20) cF (cid:27)2 mink rk\n\nnK\n\n;\n\n(15)\n\nF\n\nwhere F is the same factor as in Theorem 3.\n\n3.4 Comparison with the overlapped approach\n\nInequality (15) explains the superior performance of the latent approach for tensor decomposition in\nFigure 1. The inequality obtained in [26] for the overlapped approach that uses overlapped Schatten\n1-norm (1) can be stated as follows:\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ^W (cid:0) W(cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2\n\n1\nN\n\n(cid:20) c\n\u2032\n\n(cid:27)2\n\n1\nK\n\nF\n\n(\n\n)2(\n\nK\u2211\n\n\u221a\n\nk=1\n\n1\nnk\n\n)2\n\nK\u2211\n\nk=1\n\n1\nK\n\np\nrk\n\n:\n\n(16)\n\nComparing inequalities (15) and (16), we notice that the complexity of the overlapped approach\ndepends on the average (square root) of the mode-k ranks r1; : : : ; rK, whereas that of the latent\napproach only grows linearly against the minimum mode-k rank. Interestingly, the latent approach\nperforms as if it knows the mode with the minimum rank, although such information is not given.\nRecently, Mu et al. [19] proved a lower bound of the number of measurements for solving linear\ninverse problem via the overlapped approach. Although the setting is different, the lower bound\ndepends on the minimum mode-k rank, which agrees with the complexity of the latent approach.\n\n4 Numerical results\n\nIn this section, we numerically con\ufb01rm the theoretically obtained scaling behavior.\nThe goal of this experiment is to recover the true low rank tensor W(cid:3) from a noisy observation Y.\nWe randomly generated the true low rank tensors W(cid:3) of size 50 (cid:2) 50 (cid:2) 20 or 80 (cid:2) 80 (cid:2) 40 with\nvarious mode-k ranks (r1; r2; r3). A low-rank tensor is generated by \ufb01rst randomly drawing the\n\n6\n\n\f\u2211\n\n(\n\u2211\n\n1\nK\n\nK\nk=1 (cid:22)rk\nnK\n\n;\n\nTR complexity =\n\nLR complexity =\n\n\u221a\n\n)2(\n\n\u2211\n\nK\nk=1\n\n1\nnk\n\n1\nK\n\nK\nk=1\n\n)2\n\n;\n\np\nrk\n\n(17)\n\n(18)\n\n\u221a\n\nFigure 2: Performance of the overlapped approach and latent approach for tensor decomposition are\nshown against their theoretically predicted complexity measures (see Eqs. (17) and (18)). The right\npanel shows the improvement of the latent approach from the overlapped approach against the ratio\nof their complexity measures.\n\n(cid:2) r2\n\n(cid:2) r3 core tensor from the standard normal distribution and multiplying an orthogonal factor\nr1\nmatrix drawn uniformly to its each mode. The observation tensor Y is obtained by adding Gaussian\nnoise with standard deviation (cid:27) = 0:1. There is no missing entries in this experiment.\nFor each observation Y, we computed tensor decompositions using the overlapped approach and the\nlatent approach (11). For the optimization, we used the algorithms2 based on alternating direction\nmethod of multipliers described in Tomioka et al. [25]. We computed the solutions for 20 candidate\nregularization constants ranging from 0.1 to 100 and report the results for three representative values\nfor each method.\nWe measured the quality of the solutions obtained by the two approaches by the mean squared error\nF =N. In order to make our theoretical predictions more concrete, we de\ufb01ne\n(MSE)\nthe quantities in the right hand side of the bounds (16) and (14) as Tucker rank (TR) complexity and\nLatent rank (LR) complexity, respectively, as follows:\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ^W (cid:0) W(cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2\n\nwhere without loss of generality we assume n1 (cid:21) (cid:1)(cid:1)(cid:1) (cid:21) nK. We have ignored terms like\nnk=N\nbecause they are negligible for nk (cid:25) 50 and N (cid:25) 50; 000. The TR complexity is equivalent to the\nnormalized rank in [26]. Note that the TR complexity (17) is de\ufb01ned in terms of the multilinear rank\n(r1; : : : ; rK) of the truth W(cid:3), whereas the LR complexity (18) is de\ufb01ned in terms of the ranks of the\nlatent factors (r1; : : : ; rK) in (9). In order to \ufb01nd a decomposition that minimizes the right hand side\nof (18), we ran the latent approach to the true tensor W(cid:3) without noise, and took the minimum of\nthe sum of ranks found by the run and mink rk, i.e., the minimal mode-k rank (because a singleton\nsolution is also allowed). The whole procedure is repeated 10 times and averaged.\nFigure 2 shows the results of the experiment. The left panel shows the MSE of the overlapped\napproach against the TR complexity (17). The middle panel shows the MSE of the latent approach\nagainst the LR complexity (18). The right panel shows the improvement (i.e., MSE of the overlap\napproach over that of the latent approach) against the ratio of the respective complexity measures.\nFirst, from the left panel, we can con\ufb01rm that as predicted by [26], the MSE of the overlapped\napproach scales linearly against the TR complexity (17) for each value of the regularization constant.\nFrom the central panel, we can clearly see that the MSE of the latent approach scales linearly against\nthe LR complexity (18) as predicted by Theorem 3. The series with \u25b3 ((cid:21) = 3:79 for 50 (cid:2) 50 (cid:2) 20,\n\n2The solver is available online: https://github.com/ryotat/tensor.\n\n7\n\n00.20.40.60.8100.0050.010.015Tucker rank complexityMean squared error (overlap)Overlapped approach  size=[50 50 20] \u03bb=0.43size=[50 50 20] \u03bb=0.89size=[50 50 20] \u03bb=3.79size=[80 80 40] \u03bb=0.62size=[80 80 40] \u03bb=1.27size=[80 80 40] \u03bb=5.4600.20.40.60.8100.0050.010.015Latent rank complexityMean squared error (latent)Latent approach  size=[50 50 20] \u03bb=0.89size=[50 50 20] \u03bb=3.79size=[50 50 20] \u03bb=11.29size=[80 80 40] \u03bb=1.27size=[80 80 40] \u03bb=5.46size=[80 80 40] \u03bb=16.240123400.511.522.533.544.555.5TR complexity/LR complexityMSE (overlap) / MSE (latent)Comparison\f(cid:21) = 5:46 for 80(cid:2) 80(cid:2) 40) is mostly below other series, which means that the optimal choice of the\nregularization constant is independent of the rank of the true tensor and only depends on the size;\nthis agrees with the condition on (cid:21) in Theorem 3. Since the blue series and red series with the same\nmarkers lie on top of each other (especially the series with \u25b3 for which the optimal regularization\nconstant is chosen), we can see that our theory predicts not only the scaling against the latent ranks\nbut also that against the size of the tensor correctly. Note that the regularization constants are scaled\nby roughly 1.6 to account for the difference in the dimensionality.\nThe right panel reveals that in many cases the latent approach performs better than the overlapped\napproach, i.e., MSE (overlap)/ MSE (latent) greater than one. Moreover, we can see that the success\nof the latent approach relative to the overlapped approach is correlated with high TR complexity\nto LR complexity ratio. Indeed, we found that an optimal decomposition of the true tensor W(cid:3)\nwas typically a singleton decomposition corresponding to the smallest tucker rank (see Section 3.2).\nNote that the two approaches perform almost identically when they are under-regularized (crosses).\nThe improvements here are milder than that in Figure 1. This is because most of the randomly\ngenerated low-rank tensors were simultaneously low-rank to some degree. It is encouraging that the\nlatent approach perform at least as well as the overlapped approach in such situations as well.\n\n5 Conclusion\n\nIn this paper, we have presented a framework for structured Schatten norms. The current framework\nincludes both the overlapped Schatten 1-norm and latent Schatten 1-norm recently proposed in the\ncontext of convex-optimization-based tensor decomposition [9, 15, 23, 25], and connects these stud-\nies to the broader studies on structured sparsity [2, 13, 17, 21]. Moreover, we have shown a duality\nthat holds between the two types of norms.\nFurthermore, we have rigorously studied the performance of the latent approach for tensor decom-\nposition. We have shown the consistency of the latent Schatten 1-norm minimization. Next, we have\nanalyzed the denoising performance of the latent approach and shown that the error of the latent ap-\nproach is upper bounded by the minimal mode-k rank, which contrasts sharply against the average\n(square root) dependency of the overlapped approach analyzed in [26]. This explains the empirically\nobserved superior performance of the latent approach compared to the overlapped approach. The\nmost dif\ufb01cult case for the overlapped approach is when the unknown tensor is only low-rank in one\nmode as in Figure 1.\nWe have also con\ufb01rmed through numerical simulations that our analysis precisely predicts the scal-\ning of the mean squared error as a function of the dimensionalities and the sum of ranks of the factors\nof the unknown tensor, which is dominated by the minimal mode-k rank. Unlike mode-k ranks, the\nranks of the factors are not easy to compute. However, note that the theoretically optimal choice of\nthe regularization constant does not depend on these quantities.\nThus, we have theoretically and empirically shown that for noisy tensor decomposition, the latent\napproach is more likely to perform better than the overlapped approach. Analyzing the performance\nof the latent approach for tensor completion would be an important future work.\nThe structured Schatten norms proposed in this paper include norms for tensors that are not em-\nployed in practice yet. Therefore, it would be interesting to explore various extensions, such as,\nusing the overlapped S1=1-norm instead of the S1=1-norm or a non-sparse tensor decomposition.\nAcknowledgment: This work was carried out while both authors were at The University of Tokyo.\nThis work was partially supported by JSPS KAKENHI 25870192 and 25730013, and the Aihara\nProject, the FIRST program from JSPS, initiated by CSTP.\n\nReferences\n[1] A. Agarwal, S. Negahban, and M. J. Wainwright. Noisy matrix decomposition via convex relaxation:\n\nOptimal rates in high dimensions. The Annals of Statistics, 40(2):1171\u20131197, 2012.\n\n[2] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Convex optimization with sparsity-inducing norms. In\n\nOptimization for Machine Learning. MIT Press, 2011.\n\n8\n\n\f[3] E. J. Candes, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? Technical report,\n\narXiv:0912.3599, 2009.\n\n[4] V. Chandrasekaran, B. Recht, P. Parrilo, and A. Willsky. The convex geometry of linear inverse problems,\n\nprepint. Technical report, arXiv:1012.0621v2, 2010.\n\n[5] L. De Lathauwer, B. De Moor, and J. Vandewalle. A multilinear singular value decomposition. SIAM J.\n\nMatrix Anal. Appl., 21(4):1253\u20131278, 2000.\n\n[6] L. De Lathauwer, B. De Moor, and J. Vandewalle. On the best rank-1 and rank-(R1; R2; : : : ; RN ) ap-\n\nproximation of higher-order tensors. SIAM J. Matrix Anal. Appl., 21(4):1324\u20131342, 2000.\n\n[7] M. Fazel, H. Hindi, and S. P. Boyd. A Rank Minimization Heuristic with Application to Minimum Order\n\nSystem Approximation. In Proc. of the American Control Conference, 2001.\n\n[8] R. Foygel and N. Srebro. Concentration-based guarantees for low-rank matrix reconstruction. Technical\n\nreport, arXiv:1102.3923, 2011.\n\n[9] S. Gandy, B. Recht, and I. Yamada. Tensor completion and low-n-rank tensor recovery via convex opti-\n\nmization. Inverse Problems, 27:025010, 2011.\n\n[10] F. L. Hitchcock. The expression of a tensor or a polyadic as a sum of products. J. Math. Phys., 6(1):\n\n164\u2013189, 1927.\n\n[11] D. Hsu, S. M. Kakade, and T. Zhang. Robust matrix decomposition with sparse corruptions. Information\n\nTheory, IEEE Transactions on, 57(11):7221\u20137234, 2011.\n\n[12] A. Jalali, P. Ravikumar, S. Sanghavi, and C. Ruan. A dirty model for multi-task learning. In Advances in\n\nNIPS 23, pages 964\u2013972. 2010.\n\n[13] R. Jenatton, J. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms. J.\n\nMach. Learn. Res., 12:2777\u20132824, 2011.\n\n[14] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455\u2013500,\n\n2009.\n\n[15] J. Liu, P. Musialski, P. Wonka, and J. Ye. Tensor completion for estimating missing values in visual data.\n\nIn Prof. ICCV, 2009.\n\n[16] J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Convex and network \ufb02ow optimization for structured\n\nsparsity. J. Mach. Learn. Res., 12:2681\u20132720, 2011.\n\n[17] A. Maurer and M. Pontil. Structured sparsity and generalization. Technical report, arXiv:1108.3476,\n\n2011.\n\n[18] M. M\u00f8rup. Applications of tensor (multiway array) factorizations and decompositions in data mining.\n\nWiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1):24\u201340, 2011.\n\n[19] C. Mu, B. Huang, J. Wright, and D. Goldfarb. Square deal: Lower bounds and improved relaxations for\n\ntensor recovery. arXiv preprint arXiv:1307.5870, 2013.\n\n[20] S. Negahban, P. Ravikumar, M. Wainwright, and B. Yu. A uni\ufb01ed framework for high-dimensional\nanalysis of m-estimators with decomposable regularizers. In Advances in NIPS 22, pages 1348\u20131356.\n2009.\n\n[21] G. Obozinski, L. Jacob, and J.-P. Vert. Group lasso with overlaps:\n\nTechnical report, arXiv:1110.0413, 2011.\n\nthe latent group lasso approach.\n\n[22] B. Recht, M. Fazel, and P. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via\n\nnuclear norm minimization. SIAM Review, 52(3):471\u2013501, 2010.\n\n[23] M. Signoretto, L. De Lathauwer, and J. Suykens. Nuclear norms for tensors and their use for convex\n\nmultilinear estimation. Technical Report 10-186, ESAT-SISTA, K.U.Leuven, 2010.\n\n[24] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In Proc. of the 18th Annual Conference\n\non Learning Theory (COLT), pages 545\u2013560. Springer, 2005.\n\n[25] R. Tomioka, K. Hayashi, and H. Kashima. Estimation of low-rank tensors via convex optimization.\n\nTechnical report, arXiv:1010.0789, 2011.\n\n[26] R. Tomioka, T. Suzuki, K. Hayashi, and H. Kashima. Statistical performance of convex tensor decompo-\n\nsition. In Advances in NIPS 24, pages 972\u2013980. 2011.\n\n[27] L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279\u2013311,\n\n1966.\n\n[28] R. Vershynin.\n\nIntroduction to the non-asymptotic analysis of random matrices. Technical report,\n\narXiv:1011.3027, 2010.\n\n9\n\n\f", "award": [], "sourceid": 686, "authors": [{"given_name": "Ryota", "family_name": "Tomioka", "institution": "TTI Chicago"}, {"given_name": "Taiji", "family_name": "Suzuki", "institution": "Tokyo Institute of Technology"}]}