{"title": "Multitask learning meets tensor factorization: task imputation via convex optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 2825, "page_last": 2833, "abstract": "We study a multitask learning problem in which each task is parametrized by a weight vector and indexed by a pair of indices, which can be e.g, (consumer, time). The weight vectors can be collected into a tensor and the (multilinear-)rank of the tensor controls the amount of sharing of information among tasks. Two types of convex relaxations have recently been proposed for the tensor multilinear rank. However, we argue that both of them are not optimal in the context of multitask learning in which the dimensions or multilinear rank are typically heterogeneous. We propose a new norm, which we call the scaled latent trace norm and analyze the excess risk of all the three norms. The results apply to various settings including matrix and tensor completion, multitask learning, and multilinear multitask learning. Both the theory and experiments support the advantage of the new norm when the tensor is not equal-sized and we do not a priori know which mode is low rank.", "full_text": "Multitask learning meets tensor factorization: task\n\nimputation via convex optimization\n\nKishan Wimalawarne\n\nTokyo Institute of Technology\n\nMeguro-ku, Tokyo, Japan\n\nMasashi Sugiyama\n\nThe University of Tokyo\nBunkyo-ku, Tokyo, Japan\n\nkishan@sg.cs.titech.ac.jp\n\nsugi@k.u-tokyo.ac.jp\n\nRyota Tomioka\n\nTTI-C\n\nIllinois, Chicago, USA\ntomioka@ttic.edu\n\nAbstract\n\nWe study a multitask learning problem in which each task is parametrized by a\nweight vector and indexed by a pair of indices, which can be e.g, (consumer,\ntime). The weight vectors can be collected into a tensor and the (multilinear-)rank\nof the tensor controls the amount of sharing of information among tasks. Two\ntypes of convex relaxations have recently been proposed for the tensor multilin-\near rank. However, we argue that both of them are not optimal in the context of\nmultitask learning in which the dimensions or multilinear rank are typically het-\nerogeneous. We propose a new norm, which we call the scaled latent trace norm\nand analyze the excess risk of all the three norms. The results apply to various set-\ntings including matrix and tensor completion, multitask learning, and multilinear\nmultitask learning. Both the theory and experiments support the advantage of the\nnew norm when the tensor is not equal-sized and we do not a priori know which\nmode is low rank.\n\n1 Introduction\n\nWe consider supervised multitask learning problems [1, 6, 7] in which the tasks are indexed by a\npair of indices known as multilinear multitask learning (MLMTL) [17, 19]. For example, when we\nwould like to predict the ratings of different aspects (e.g., quality of service, food, etc) of restaurants\nby different customers, the tasks would be indexed by aspects (cid:2) customers. When each task is\nparametrized by a weight vector over features, the goal would be to learn a features (cid:2) aspects (cid:2)\ncustomers tensor. Another possible task dimension would be time, since the ratings may change\nover time.\nThis setting is interesting, because it would allow us to exploit the similarities across different cus-\ntomers as well as similarities across different aspects or time-points. Furthermore this would allow\nus to perform task imputation, that is to learn weights for tasks for which we have no training exam-\nples. On the other hand, the conventional matrix-based multitask learning (MTL) [2, 3, 13, 16] may\nfail to capture the higher order structure if we consider learning a \ufb02at features (cid:2) tasks matrix and\nwould require at least r samples, where r is the rank of the matrix to be learned, for each task.\nRecently several norms that induce low-rank tensors in the sense of Tucker decomposition or multi-\nlinear singular value decomposition [8, 9, 14, 25] have been proposed. The mean squared error for\nrecovering a n1 (cid:2) (cid:1)(cid:1)(cid:1) (cid:2) nK tensor of multilinear rank (r1; : : : ; rK) from its noisy version scale as\nnk)2) for the overlapped trace norm [23]. On the other hand, the\nO(( 1\nK\nerror of the latent trace norm scales as O(mink rk= mink nk) in the same setting [21]. Thus while\nthe latent trace norm has the better dependence in terms of the multilinear rank rk, it has the worse\ndependence in terms of the dimensions nk.\nTensors that arise in multitask learning typically have heterogeneous dimensions. For example,\nthe number of aspects for a restaurant (quality of service, food, atmosphere, etc.) would be much\n\np\nK\nk=1 1=\n\nrk)2( 1\nK\n\n\u2211\n\np\n\nK\nk=1\n\n\u2211\n\n1\n\n\fTable 1: Tensor denoising performance using different norms. The mean squared error jjj ^W (cid:0)\nW(cid:3)jjj2\n\nF =N is shown for the denoising algorithms (3) using different norms for tensors.\n\nOverlapped trace norm\np\nnk\n1=\n\np\n\nrk\n\n1\nK\n\nOp\n\n1\nK\n\n)2(\n\nK\u2211\n\nk=1\n\n((\n\nK\u2211\n\nk=1\n\n)\n\n)2\n\n(\n\n)\n\n(\n\n)\n\nLatent trace norm\n\nScaled latent trace norm\n\nOp\n\nmin\n\nk\n\nrk= min\n\nk\n\nnk\n\nOp\n\nmin\n\nk\n\n(rk=nk)\n\nsmaller than the number of customers or the number of features. In addition, it is a priori unclear\nwhich mode (or dimension) would have the most redundancy or sharing that could be exploited by\nmultitask learning. Some of the modes may have full ranks if there is no sharing of information\nalong them. Therefore, both the latent trace norm and the overlapped trace norm would suffer either\nfrom the heterogeneous multilinear rank or the heterogeneous dimensions in this context.\nIn this paper, we propose a modi\ufb01cation to the latent trace norm whose mean squared error scales as\nO(mink(rk=nk)) in the same setting, which is better than both the previously proposed extensions\nof trace norm for tensors. We study the excess risk of the three norms through their Rademacher\ncomplexities in various settings including matrix completion, multitask learning, and MLMTL. The\nnew analysis allows us to also study the tensor completion setting, which was only empirically\nstudied in [22, 23]. Our analysis consistently shows the advantage of the proposed scaled latent\ntrace norm in various settings in which the dimensions or ranks are heterogeneous. Experiments on\nboth synthetic and real data sets are also consistent with our theoretical \ufb01ndings.\n\n2 Norms for tensors and their denoising performance\nLet W 2 Rn1(cid:2)(cid:1)(cid:1)(cid:1)(cid:2)nK be a K-way tensor. We denote the total number of entries by N :=\nk=1 nk.\nA mode-k \ufb01ber of W is an nk dimensional vector we obtain by \ufb01xing all but the kth index. The\nmode-k unfolding W (k) of W is the nk(cid:2)N=nk matrix formed by concatenating all the N=nk mode-\nk \ufb01bers along columns. We say that W has multilinear rank (r1; : : : ; rK) if rk = rank(W (k)).\n\nK\n\n\u220f\n\n2.1 Existing norms for tensors\n\nFirst we review two norms proposed in literature in order to convexify tensor decomposition.\nThe overlapped trace norm (see [12, 15, 18, 22]) is de\ufb01ned as the sum of the trace norms of the\nmode-k unfoldings as follows:\n\njjjWjjj\n\n\u2225W (k)\u2225tr;\n\noverlap =\n\n(1)\nwhere \u2225 (cid:1) \u2225tr is the trace norm (also known as the nuclear norm) [10, 20], which is de\ufb01ned as the\nabsolute sum of singular values. Romera-Paredes et al. [17] has used the overlapped trace norm in\nMLMTL.\nThe latent trace norm [21, 22] is de\ufb01ned as the in\ufb01mum over K tensors as follows:\n\nk=1\n\n\u2211K\n\njjjWjjj\n\nlatent =\n\ninf\n\nW (1)+(cid:1)(cid:1)(cid:1)+W (K)=W\n\n\u2225W (k)\n\n(k)\n\n\u2225tr:\n\n(2)\n\n\u2211K\n\nk=1\n\n(\n\nTable 1 summarizes the denoising performance in mean squared error analyzed in Tomioka and\nSuzuki [21] for the above two norms. The setting is as follows: we observe a noisy version Y of a\ntensor W(cid:3) with multilinear rank (r1; : : : ; rK) and would like to recover W(cid:3) by solving\n\n^W = argminW\n\n1\n2\n\njjjW (cid:0) Yjjj2\n\nF + (cid:21)jjjWjjj\n\n\u22c6\n\n;\n\n(3)\n\nwhere jjj(cid:1)jjj\n\u22c6 is either the overlapped trace norm or the latent trace norm. We can see that while the\nlatent trace norm has the better dependence in terms of the multilinear rank, it has the worse de-\npendence in terms of the dimensions. Intuitively, the latent trace norm recognizes the mode with\nthe lowest rank. However, it does not have a good control of the dimensions; in fact, the factor\n\n)\n\n2\n\n\f\u221a\n1= mink nk comes from the fact that for a random tensor X with i.i.d. Gaussian entries, the expecta-\nmaxk N=nk), where \u2225(cid:1)\u2225op\ntion of the dual norm \u2225X\u2225latent(cid:3) = maxk \u2225X (k)\u2225op behaves like Op(\nis the operator norm.\n\n2.2 A new norm\n\np\nIn order to correct the unfavorable behavior of the dual norm, we propose the scaled latent trace\nnorm. It is de\ufb01ned similarly to the latent trace norm with weights 1=\n\nK\u2211\nNow the expectation of the dual norm \u2225X\u2225scaled(cid:3) = maxk\nfor X with random i.i.d. Gaussian entries and combined with the following relation\n\n\u2225W (k)\n1p\nnk\np\np\nnk\u2225X (k)\u2225op behaves like Op(\nN )\n\nnk as follows:\n\u2225tr:\n\nW (1)+(cid:1)(cid:1)(cid:1)+W (K)=W\n\nscaled =\n\njjjWjjj\n\n(4)\n\ninf\n\nk=1\n\n(k)\n\n\u221a\n\njjjWjjj\n\nscaled\n\n(cid:20) min\n\nk\n\njjjWjjj\n\nF ;\n\nrk\nnk\n\n(5)\n\nwe obtain the scaling of the mean squared error in the last column of Table 1. We can see that the\nscaled latent norm recognizes the mode with the lowest rank relative to its dimension.\n\n3 Theory for multilinear multitask learning\n\ni=1 ((p; q) 2 S)\nWe consider T = P Q supervised learning tasks. Training samples (xipq; yipq)mpq\nare provided for a relatively small fraction of the task index pairs S (cid:26) [P ] (cid:2) [Q]. Each task is\nparametrized by a weight vector wpq 2 Rd, which can be collected into a 3-way tensor W =\n(wpq) 2 Rd(cid:2)P(cid:2)Q whose (p; q) \ufb01ber is wpq. We de\ufb01ne the learning problem as follows:\n\n^W = argmin\nW2Rd(cid:2)P(cid:2)Q\n\nsubject to\n\njjjWjjj\n\n\u22c6\n\n(cid:20) B0;\n\n(6)\n\nwhere the norm jjj(cid:1)jjj\nnorm, and the empirical risk ^L is de\ufb01ned as follows:\n\n\u22c6 is either the overlapped trace norm, latent trace norm, or the scaled latent trace\n\n^L(W) =\n\n\u2113 (\u27e8xipq; wpq\u27e9 (cid:0) yipq) :\n\n^L(W);\n\u2211\n\u2211\n\n1jSj\n\n1\n\n(p;q)2S\n\nP Q\n\np;q\n\nmpq\u2211\n\ni=1\n\n1\nmpq\n\nThe true risk we are interested in minimizing is de\ufb01ned as follows:\n\nL(W) =\n\nE(x;y)(cid:24)Ppq \u2113 (\u27e8x; wpq\u27e9 (cid:0) y) ;\n\nwhere Ppq is the distribution from which the samples (xipq; yipq)mpq\nThe next lemma relates the excess risk L( ^W) (cid:0) L(W(cid:3)\n) with the expected dual norm EjjjDjjj\n\u22c6(cid:3)\nthrough Rademacher complexity.\nLemma 1. We assume that the output yipq is bounded as jyipqj (cid:20) b, and the number of samples\nmpq (cid:21) m > 0 for the observed tasks. We also assume that the loss function \u2113 is Lipschitz continuous\nwith the constant (cid:3), bounded in [0; c] and \u2113(0) = 0. Let W(cid:3) be any tensor such that jjjW(cid:3)jjj\n(cid:20) B0.\nThen with probability at least 1 (cid:0) (cid:14), any minimizer of (6) satis\ufb01es the following bound:\n\ni=1 are drawn from.\n\n\u22c6\n\n\u221a\n\n)\n\n) (cid:20) 2(cid:3)\n\n(\n2B0jSj EjjjDjjj\n\u2211mpq\n\u22c6(cid:3) is the dual norm of jjj(cid:1)jjj\n{\nZ ipq, where Z ipq 2 Rd(cid:2)P(cid:2)Q is de\ufb01ned as\n\n(cid:26)\u221ajSjm\n\u2211\n\nL( ^W) (cid:0) L(W(cid:3)\n\u2211\n)th \ufb01ber ofZ ipq =\n\nlog(2=(cid:14))\n2jSjm\n\n\u2032 and q = q\n\n\u22c6, (cid:26) := 1jSj\n\n(cid:27)ipqxipq;\n\n(p;q)2S\n\n(p;q)2S\n\n\u2032\n+ c\n\n\u22c6(cid:3) +\n\n\u2032\n(p\n\np\n\n\u2032\n\n;\n\n\u2032\n\n; q\n\ni=1\n\nb\n\n;\n\n= c + 1, jjj(cid:1)jjj\n\n\u2032\nwhere c\nis de\ufb01ned as the sum D =\n\nmpq\n\nm . The tensor D 2 Rd(cid:2)P(cid:2)Q\n\n1\n\nmpq\n0;\n\nif p = p\notherwise:\n\nHere (cid:27)ipq 2 f(cid:0)1; +1g are Rademacher random variables and the expectation in the above inequal-\nity is with respect to (cid:27)ipq, the random draw of tasks S, and the training samples (xipq; yipq)mpq\ni=1 .\n\n3\n\n\fProof. The proof is a standard one following the line of [5] and it is presented in Appendix A.\nThe next theorem computes the expected dual norm EjjjDjjj\nproof can be found in Appendix B).\nTheorem 1. We assume that Cpq := E[xipqxipq\n\u2225xipq\u2225 (cid:20) R almost surely. Let us de\ufb01ne\n\nd I d and there is a constant R > 0 such that\n\n\u22c6(cid:3) for the three norms for tensors (the\n\n] \u2aaf (cid:20)\n\n\u22a4\n\nD1 := d + P Q; D2 := P + dQ; D3 := Q + dP:\n\nIn order to simplify the presentation, we assume that maxk Dk (cid:21) 3 and dP Q (cid:21) max(d2; P 2; Q2).\nFor the overlapped trace norm, the latent trace norm, and the scaled latent trace norm, the expecta-\ntion EjjjDjjj\n\n)\n\n\u22c6(cid:3) can be bounded as follows:\n1jSjEjjjDjjj\n1jSjEjjjDjjj\n1jSjEjjjDjjj\n\n(\u221a\n(\u221a\n(cid:3) (cid:20) C min\n(\u221a\n(cid:3) (cid:20) C\nmjSjdP Q\n(cid:3) (cid:20) C\n(cid:20)\nmjSj log(max\n\nmjSjdP Q\n(cid:20)\n\noverlap\n\nscaled\n\nlatent\n\n(cid:20)\n\n\u2032\u2032\n\nk\n\nk\n\nk\n\n\u2032\n\nmax\n\nDk log Dk +\n\nR\nmjSj log Dk\n\n;\n\n(Dk log Dk) +\n\nR\nmjSj log(max\n\nk\n\nDk)\n\n)\n)\n\n(7)\n\n(8)\n\n;\n\nR\n\np\nmaxk nk\nmjSj\n\n;\n\n\u2032\n\nk\n\n; C\n\nDk)\n\nDk) +\n\nlog(max\n\n(9)\n\u2032\u2032 are constants, n1 = d; n2 = P , and n3 = Q. Furthermore, if mjSj (cid:21)\n\nwhere C; C\nR2(maxk nk) log(maxk Dk)=(cid:20), the O(1=mjSj) terms in the above inequalities can be dropped.\nNote that the assumption that the norm of xipq is bounded is natural because the target yipq is also\nbounded. The parameter (cid:20) in the assumption Cpq \u2aaf (cid:20)=dI d controls the amount of correlation in\nthe data. Since Tr(C) = E\u2225xipq\u22252 (cid:20) R2, we have (cid:20) = O(1) when the features are uncorrelated;\non the other hand, we have (cid:20) = O(d), if they lie in a one dimensional subspace. The number of\nsamples mjSj = ~O(maxk nk) is enough to drop the O(1=mjSj) term even if (cid:20) = O(1).\nNow we state the consequences of Theorem 1 for the three norms for tensors. The com-\nmon assumptions are the same as in Lemma 1 and Theorem 1. We also assume mjSj (cid:21)\nR2(maxk nk) log(maxk Dk)=(cid:20) to drop the O(1=mjSj) terms. Let W(cid:3) be any d (cid:2) P (cid:2) Q tensor\nwith multilinear-rank (r1; r2; r3) and bounded element-wise as jjjW(cid:3)jjj\nCorollary 1 (Overlapped trace norm). With probability at least 1 (cid:0) (cid:14), any minimizer of (6) with\njjjWjjj\n\n\u221a\u2225r\u22251=2dP Q satis\ufb01es the following inequality:\n(cid:20) B\n\u2211\n) (cid:20) c1(cid:3)B\nL( ^W) (cid:0) L(W(cid:3)\np\nwhere \u2225r\u22251=2 = (\np\nNote that Tomioka et al.\nDk=3)2 instead of\nmin(Dk log Dk). Although the minimum may look better than the average, our bound has the\nworse constant K = 3 hidden in c1. The log Dk factor allows us to apply the above result to the\nsetting of tensor completion as we show below.\nCorollary 2 (Latent trace norm). With probability at least 1 (cid:0) (cid:14), any minimizer of (6) with\njjjWjjj\n\n(cid:26)\nmjSj + c3\n\u2211\n\n[23] obtained a bound that depends on (\n\n\u21131 (cid:20) B.\n\u221a\n\nrk=3)2 and c1; c2; c3 are constants.\n\nmjSj\u2225r\u22251=2 min\n\n(Dk log Dk) + c2(cid:3)b\n\nlog(2=(cid:14))\nmjSj\n\n\u221a\n\n\u221a\n\n3\nk=1\n\n3\nk=1\n\noverlap\n\n(cid:20)\n\nk\n\n;\n\np\nmink rkdP Q satis\ufb01es the following inequality:\n) (cid:20) c\n\n\u221a\n\nrk max\n\n\u2032\n1(cid:3)B\n\n(cid:20)\nmjSj min\n\nk\n\nk\n\n(Dk log Dk) + c2(cid:3)b\n\n\u221a\n\n(cid:26)\nmjSj + c3\n\n(cid:20) B\n\nlatent\n\n\u221a\n\n\u2032\nwhere c\n1; c2; c3 are constants.\nCorollary 3 (Scaled latent trace norm). With probability at least 1 (cid:0) (cid:14), any minimizer of (6) with\njjjWjjj\n\nmink(rk=nk)dP Q satis\ufb01es the following inequality:\n\n;\n\nlog(2=(cid:14))\nmjSj\n\u221a\n\n\u221a\n\n(\n\n)\n\nL( ^W) (cid:0) L(W(cid:3)\n\u221a\n\nscaled\n\n(cid:20) B\nL( ^W) (cid:0) L(W(cid:3)\n\n) (cid:20) c\n\u2032\u2032\n1 (cid:3)B\n\n(cid:20)\nmjSj min\n\u2032\u2032\nwhere n1 = d, n2 = P , n3 = Q, and c\n1 ; c2; c3 are constants.\n\ndP Q log(max\n\nrk\nnk\n\nk\n\nk\n\nDk) + c2(cid:3)b\n\n(cid:26)\nmjSj + c3\n\nlog(2=(cid:14))\nmjSj\n\n;\n\n\u221a\n\n4\n\n\f)\nT\nd\n(\ng\no\nl\n\nd\nr\n\n)\nT\n;\nd\n(\nx\na\nm\n\n)\nT\nd\n(\ng\no\nl\nd\n\n)\nT\n+\nd\n(\ng\no\nl\n\n)\nT\n\n;\nd\n\n=\n1\n;\nd\n(\n\n)\n1\n;\nr\n;\nr\n(\n\nWe summarize the implications of the above\ncorollaries for different settings in Table\n2. We almost recover the settings for ma-\ntrix completion [11] and multitask learning\n(MTL) [16]. Note that these simpler prob-\nlems sometimes disguise themselves as the\nmore general tensor completion or multilin-\near multitask learning problems. Therefore it\nis important that the new tensor based norms\nadapts to the simplicity of the problems in\nthese cases.\nMatrix completion is when d = (cid:20) = m =\nr1 = 1, and we assume that r2 = r3 =\nr < P; Q. The sample complexities are the\nnumber of samples jSj that we need to make\nthe leading term in Corollaries 1, 2, and 3\nequal \u03f5. We can see that the overlapped trace\nnorm and the scaled latent trace norm recover\nthe known result for matrix completion [11].\nThe plain latent trace norm requires O(P Q)\nsamples because it recognizes the \ufb01rst mode\nas the mode with the lowest rank 1. Although\nthe rank r of the last two modes are low rela-\ntive to their dimensions, the latent trace norm\nfails to recognize this. Note that \u2225r\u22251=2 (cid:20) r.\nThis is not a contradiction, because in Cor. 1,\nwe assume that the overlapped trace norm is\nbounded, which may or may not be true for\nmatrix completion. In fact, in this case, the\noverlapped trace norm is an Elastic-net-type\nregularizer (trace norm + Frobenius norm).\nIn multitask learning (MTL), we set P = T\n(the number of tasks) and Q = 1. The \ufb01rst\nand the second modes have a low rank r.\nWe also assume that all the pairs (p; q) are\nobserved (jSj = T ) as in [16]. The sam-\nple complexities are de\ufb01ned the same way\nas above with respect to the number of sam-\nples m because jSj is \ufb01xed. Our bound for\nthe overlapped trace norm is almost as good\nas the one in [16] but has an multiplicative\nlog(d + T ) factor (as oppose to their additive\nlog(mT ) term). Also note that the results in\n[16] can be applied when d is much larger\nthan T . Turning back to our bounds, the\nscaled latent trace norm performs as well as\nknowing the mode with the lowest rank (the\n\ufb01rst and the second modes; see also [21]).\nHowever, similarly to the matrix completion\ncase above, the plain latent trace norm fails to\nrecognize the low-rank-ness of the \ufb01rst two\nmodes, and requires O(d) samples, because\nthe third mode has the lowest rank.\n\ne\nr\na\n\ne\nh\nt\n\nn\no\nm\nm\no\nc\n\n=\n2\n=\n1\n\nk\ns\na\nt\ni\nt\nl\n\u2225\nu\nr\nm\n\u2225\n\ne\nh\nT\n\nr\no\nf\n\nm\n\ne\nn\n\ufb01\ne\nd\n\n\u2032\n\n.\n\ne\n\nW\n\n,\nn\no\ni\nt\ne\nl\np\nm\no\nc\n\nr\n<\nr\nx\n(cid:20)\ni\nr\nt\na\nm\n\nr\no\nf\n\nj\n\nS\n\nj\n\n.\ns\ng\nn\ni\nt\nt\ne\ns\n\ns\nu\no\ni\nr\na\nv\n\nn\ni\n\nm\nr\no\nn\n\ne\nc\na\nr\nt\n\nt\nn\ne\nt\na\nl\n\nd\ne\nl\na\nc\ns\n\ne\nh\nt\n\nd\nn\na\n\n,\n\nm\nr\no\nn\n\ne\nc\na\nr\nt\n\nt\nn\ne\nt\na\nl\n\n,\n\nm\nr\no\nn\n\no\nt\n\nt\nc\ne\np\ns\ne\nr\n\nh\nt\ni\n\nw\nd\ne\nn\n\ufb01\ne\nd\n\ns\ne\ni\nt\ni\nx\ne\nl\np\nm\no\nc\n\ne\nl\np\nm\na\ns\n\ne\nh\nT\n\ne\nc\na\nr\nt\n\nd\ne\np\np\na\nl\nr\ne\nv\no\n\ne\nh\nt\n\nf\no\n\ns\ne\ni\nt\ni\nx\ne\nl\np\nm\no\nc\n\ne\nl\np\nm\na\nS\n\n.\ns\ne\ni\nt\ni\nx\ne\nl\np\nm\no\nc\n\ne\nl\np\nm\na\ns\n\ne\nh\nt\n\nm\no\nr\nf\n\nd\ne\nt\nt\ni\n\nm\no\n\ns\ni\n\n2\n\u03f5\n=\n1\n\n:\n2\n\nP\ne\nm\nu\ns\ns\na\n\ne\nw\n\n,\ne\ns\na\nc\n\ns\nu\no\ne\nn\ne\ng\no\nr\ne\nt\ne\nh\n\nn\nI\n\n.\n\ng\nn\ni\nn\nr\na\ne\nl\n\nk\ns\na\nt\ni\nt\nl\nu\nm\n\nr\na\ne\nn\ni\nl\ni\nt\nl\nu\nm\nd\nn\na\n\nn\no\ni\nt\ne\nl\np\nm\no\nc\n\nr\no\ns\nn\ne\nt\n\nr\no\nf\n\nd\ne\nl\na\nc\nS\n\n)\n2\n\u03f5\n=\n1\nr\ne\np\n(\n\nt\nn\ne\nt\na\nL\n\ns\ne\ni\nt\ni\nx\ne\nl\np\nm\no\nc\n\ne\nl\np\nm\na\nS\n\np\na\nl\nr\ne\nv\nO\n\n)\nj\n\nS\n\nj\n\n;\n\nB\n\n;\n\n(cid:20)\n(\n\n)\n3\nr\n;\n2\nr\n;\n1\nr\n(\n\n)\n3\nn\n\n;\n2\nn\n\n;\n1\nn\n(\n\n.\n3\nn\n2\nn\n1\nn\n=\n\n:\n\nN\nd\nn\na\n\n)\n\nQ\nP\n(\ng\no\nl\n)\n\nQ\n+\nP\n(\nr\n\n)\n\nQ\nP\n(\ng\no\nl\n\nQ\nP\n\n)\n\nQ\n+\nP\n(\ng\no\nl\n)\n\nQ\n+\nP\n(\n2\n=\n1\n\n\u2225\nr\n\u2225\n\n)\nj\n\nS\n\nj\n\n;\n1\n;\n1\n(\n\n)\nr\n;\nr\n;\n1\n(\n\n]\n1\n1\n[\n\nn\no\ni\nt\ne\nl\np\nm\no\nc\nx\ni\nr\nt\na\n\nM\n\nj\n\nj\n\n2\n)\n\nK\n=\nk\nr\n\nS\nm\nd\nn\na\np\n,\ng\nn\ni\nn\nr\na\ne\nl\n\ne\nl\nb\n\u2211\na\nT\n\nr\no\nt\nc\na\nf\n\n1\n=\n3 k\n\n(\n\n5\n\n)\n2\nd\n(\ng\no\nl\n\n2\nd\n)\nk\nr\nn\nm\n(\n(cid:20)\n\ni\n\nk\n\n)\n2\nd\n(\ng\no\nl\n\n2\nd\n)\nk\nr\nn\nm\n(\n(cid:20)\n\ni\n\nk\n\n)\n2\nd\n(\ng\no\nl\n\n2\nd\n2\n=\n1\n\n)\n\nQ\nd\n(\ng\no\nl\n)\n\n\u2032\n\nr\nP\nd\n;\n\nQ\nP\nr\n(\nn\nm\n(cid:20)\n\ni\n\n)\n\nQ\nd\n(\ng\no\nl\n\nQ\nP\nd\n(cid:20)\n\n)\n\nQ\nP\n(\ng\no\nl\n\nQ\nP\n2\n=\n1\n\n\u2225\nr\n\u2225\n(cid:20)\n\n)\nk\nD\nx\na\nm\n(\ng\no\nl\n\nk\n\nN\n\n)\n\nk\nr\n\nk\nn\n\n(\nn\nm\n\ni\n\nk\n\n)\nk\nD\ng\no\nl\n\nk\nD\n(\nx\na\nm\nk\nr\nn\nm\n\ni\n\nk\n\nk\n\n)\nk\nD\ng\no\nl\n\nk\nD\n(\nn\nm\n\ni\n\nk\n\n2\n=\n1\n\n\u2225\nr\n\u2225\n\n)\nT\n+\nd\n(\n2\n=\n1\n\n\u2225\nr\n\u2225\n\nT\n\n\u2225\nr\n\u2225\n(cid:20)\n\n)\nj\n\nS\n\n)\nj\n\nS\n\nj\n\nj\n\n)\nj\n\nS\n\nj\n\np\n\n;\n1\n;\n\n(cid:20)\n(\n\n)\n3\nr\n;\n2\nr\n;\n1\nr\n(\n\n;\n1\n;\n\n(cid:20)\n(\n\n)\n\n\u2032\n\nr\n;\n\nP\n\n;\nr\n(\n\n-\no\nm\no\nh\n(\n\n-\no\nr\ne\nt\ne\nh\n(\n\n]\n7\n1\n[\n\n)\ne\ns\na\nc\n\nL\nT\nM\nL\nM\n\ns\nu\no\ne\nn\ne\ng\n\n]\n7\n1\n[\n\n)\ne\ns\na\nc\n\nL\nT\nM\nL\nM\n\ns\nu\no\ne\nn\ne\ng\n\n]\n6\n1\n[\nL\nT\nM\n\n;\n1\n;\n1\n(\n\n)\n3\nr\n;\n2\nr\n;\n1\nr\n(\n\nn\no\ni\nt\ne\nl\np\nm\no\nc\n\nr\no\ns\nn\ne\nT\n\nP!Q!1!d!T!d!d!d!P!Q!d!n2!n3!n1!\fIn multilinear multitask learning (MLMTL) [17], any mode could possibly be low rank but it is\na priori unknown. The sample complexities are de\ufb01ned the same way as above with respect to\nmjSj. The homogeneous case is when d = P = Q. The heterogeneous case is when the \ufb01rst\nmode or the third mode is low rank but P (cid:20) r < d. Similarly to the above two settings, the\noverlapped trace norm has a mild dependence on the dimensions but a higher dependence on the\nrank \u2225r\u22251=2 (cid:21) mink rk. The latent trace norm performs as well as knowing the mode that has\nthe lowest rank in the homogeneous case. However, it fails to recognize the mode with the lowest\nrank relative to its dimension. The scaled latent trace norm does this and although it has a higher\nlogarithmic dependence, it is competitive in both cases.\nFinally, our bounds also hold for tensor completion. Although Tomioka et al. [22, 23] studied\ntensor completion algorithms, their analysis assumed that the inputs xipq are drawn from a Gaussian\ndistribution, which does not hold for tensor completion. Note that in our setting xipq can be an\nindicator vector that has one in the jth position uniformly over 1; : : : ; d. In this case, (cid:20) = 1. The\nsample complexities of different norms with respect to mjSj is shown in the last row of Table 2. The\nsample complexity for the overlapped trace norm is the same as the one in [23] with a logarithmic\nfactor. The sample complexities for the latent and scaled latent trace norms are new. Again we can\nsee that while the latent trace norm recognize the mode with the lowest rank, the scaled latent trace\nnorm is able to recognize the mode with the lowest rank relative to its dimension.\n\n4 Experiments\n\nWe conducted several experiments to evaluate performances of tensor based multitask learning set-\nting we have discussed in Section 3. In Section 4.1, we discuss simulation we conducted using\nsynthetic data sets. In Sections 4.2 and 4.3, we discuss experiments on two real world data sets,\nnamely the Restaurant data set [26] and School Effectiveness data set [3, 4]. Both of our real world\ndata sets have heterogeneous dimensions (see Figure 2) and it is a priori unclear across which mode\nhas the most amount of information sharing.\n\n4.1 Synthetic data sets\nThe true d (cid:2) P (cid:2) Q tensor W(cid:3) was generated by \ufb01rst sampling a r1 (cid:2) r2 (cid:2) r3 core tensor and then\nmultiplying random orthonormal matrix to each of its modes. For each task (p; q) 2 [P ] (cid:2) [Q], we\ni=1 by \ufb01rst sampling xipq from the standard normal\ngenerated training set of m vectors (xipq; yipq)m\ndistribution and then computing yipq = \u27e8xipq; wpq\u27e9 + (cid:23)i, where (cid:23)i was drawn from a zero-mean\nnormal distribution with variance 0:1. We used the penalty formulation of (6) with the squared loss\nand selected the regularization parameter (cid:21) using two-fold cross validation on the training set from\nthe range 0:01 to 10 with the interval 0:1.\nIn addition to the three norms for tensors we discussed in the previous section, we evaluated the\nmatrix-based multitask learning approaches that penalizes the trace norm of the unfolding of W at\nspeci\ufb01c modes. The conventional convex multitask learning [2, 3, 16] corresponds to one of these\napproaches that penalizes the trace norm of the \ufb01rst unfolding \u2225W (1)\u2225tr. The convex MLMTL in\n[17] corresponds to the overlapped trace norm.\nIn the \ufb01rst experiment, we chose d = P = Q = 10 and r1 = r2 = r3 = 3. Therefore, both\nthe dimensions and the multilinear rank are homogeneous. The result is shown in Figure 1(a). The\noverlapped trace norm performed the best, the matrix-based approaches performed next, and the\nlatent trace norm and the scaled latent trace norm were the worst. The scaling of the latent trace\nnorm had no effect because the dimensions were homogeneous. Since the sample complexities for\nall the methods were the same in this setting (see Table 2), the difference in the performances could\nbe explained by a constant factor K(= 3) that is not shown in the sample complexities.\nIn the second experiment, we chose the dimensions to be homogeneous as d = P = Q = 10, but\n(r1; r2; r3) = (3; 6; 8). The result is shown in Figure 1(b). In this setting, the (scaled) latent trace\nnorm and the mode-1 regularization performed the best. The lower the rank of the corresponding\nmode, the lower were the error of the matrix-based MTL approaches. The overlapped trace norm\nwas somewhat in the middle of the three matrix-based approaches.\n\n6\n\n\f(a) Synthetic experiment for the\ncase when both the dimensions and\nthe ranks are homogeneous. The\ntrue tensor is 10(cid:2)10(cid:2)10 with mul-\ntilinear rank (3; 3; 3).\n\n(b) Synthetic experiment for the\ncase when the dimensions are ho-\nmogeneous but the ranks are het-\nerogeneous.\nThe true tensor is\n10 (cid:2) 10 (cid:2) 10 with multilinear rank\n(3; 6; 8).\n\n(c) Synthetic experiment for the\ncase when both the dimensions and\nthe ranks are heterogeneous. The\ntrue tensor is 10(cid:2) 3(cid:2) 10 with mul-\ntilinear rank (3; 3; 8).\n\nFigure 1: Results for the synthetic data sets.\n\nIn the last experiment, we chose both the dimensions and the multilinear rank to be heterogeneous\nas (d; P; Q) = (10; 3; 10) and (r1; r2; r3) = (3; 3; 8). The result is shown in Figure 1(c). Clearly the\n\ufb01rst mode had the lowest rank relative to its dimension. However, the latent trace norm recognizes\nthe second mode as the mode with the lowest rank and performed similarly to the mode-2 regulariza-\ntion. The overlapped trace norm performed better but it was worse than the mode-1 regularization.\nThe scaled latent trace norm performed comparably to the mode-1 regularization.\n\n4.2 Restaurant data set\n\nThe Restaurant data set contains data for a recommendation system for restaurants where different\ncustomers have given ratings to different aspects of each restaurant. Following the same approach\nas in [17] we modelled the problem as a MLMTL problem with d = 45 features, P = 3 aspects,\nand Q = 138 customers.\nThe total number of instances for all the tasks were 3483 and we randomly selected training set of\nsizes 400, 800, 1200, 1600, 2000, 2400, and 2800. When the size was small many tasks contained\nno training example. We also selected 250 instances as the validation set and the rest was used as the\ntest set. The regularization parameter for each norm was selected by minimizing the mean squared\nerror on the validation set from the candidate values in the interval [50, 1000] for the overlapped,\n[0.5, 40] for the latent, [6000, 20000] for the scaled latent norms, respectively.\nWe also evaluated matrix-based MTL approaches on different modes and ridge regression (Frobe-\nnius norm regularization; abbreviated as RR) as baselines. The convex MLMTL in [17] corresponds\nto the overlapped trace norm.\nThe result is shown in Figure 2(a). We found the multilinear rank of the solution obtained by the\noverlapped trace norm to be typically (1; 3; 3). This was consistent with the fact that the perfor-\nmances of the mode-1 regularization and the ridge regression were equal. In other words, the effec-\ntive dimension of the \ufb01rst mode (features) was one instead of 45. The latent trace norm recognized\nthe \ufb01rst mode as the mode with the lowest rank and it failed to take advantage of the low-rank-ness\nof the second and the third modes. The scaled latent trace norm was able to perform the best match-\ning with the performances of mode-2 and mode-3 regularization. When the number of samples was\nabove 2400, the latent trace norm caught up with other methods, probably because the effective\ndimension became higher in this regime.\n\n4.3 School data set\n\nThe data set comes from the inner London Education Authority (ILEA) consisting of examination\nrecords from 15362 students at 139 schools in years 1985, 1986, and 1987. We followed [4] for\nthe preprocessing of categorical attributes and obtained 24 features. Previously Argyriou et al. [3]\nmodeled this data set as a 27 (cid:2) 139 matrix-based MTL problem in which the year was modeled as a\n\n7\n\n1020304050607080901001100.010.0110.0120.0130.0140.0150.016Sample sizeMSE Overlapped trace normLatent trace normScaled Latent trace normMode 1Mode 2Mode 31020304050607080901001100.010.0120.0140.0160.0180.020.022Sample sizeMSE Overlapped trace normLatent trace normScaled Latent trace normMode 1Mode 2Mode 31020304050607080901001100.010.0120.0140.0160.0180.020.0220.024Sample sizeMSE Overlapped trace normLatent trace normScaled Latent trace normMode 1Mode 2Mode 3\f(a) Result for the 45 (cid:2) 3 (cid:2) 138 Restaurant data set.\n\n(b) Result for the 24 (cid:2) 139 (cid:2) 3 School data set.\n\nFigure 2: Results for the real world data sets.\n\ntrinomial attribute. Instead here we model this data set as a 24(cid:2) 139(cid:2) 3 MLMTL problem in which\nthe third mode corresponds to the year. Following earlier papers, [3, 4], we used the percentage of\nexplained variance, de\ufb01ned as 100 (cid:1) (1 (cid:0) (test MSE)=(variance of test y)), as the evaluation metric.\nWe note that the variance of test y was roughly one because we standardized the data to have zero\nmean and unit variance [4].\nThe results are shown in Figure 2(b). First, ridge regression performed the worst because it was\nnot able to take advantage of the low-rank-ness of any mode. Second, the plain latent trace norm\nperformed similarly to the mode-3 regularization probably because the dimension 3 was lower than\nthe rank of the other two modes. Clearly the scaled latent trace norm performed the best matching\nwith the performance of the mode-2 regularization; probably the second mode had the most redun-\ndancy. The performance of the overlapped trace norm was comparable or slightly better than the\nmode-1 regularization. The percentage of the explained variance of the latent trace norm exceeds\n30 % around sample size 4000 (around 30 samples per school), which is higher than the Hierarchical\nBayes [4] (around 29.5 %) and matrix-based MTL [3] (around 26.7 %) that used around 80 samples\nper school.\n\n5 Discussion\n\nUsing tensors for modeling multitask learning [17, 19] is a promising direction that allows us to\ntake advantage of similarity of tasks in multiple dimensions and even make prediction for a task\nwith no training example. However, having multiple modes, we would have to face with more\nhyperparameters to choose in the conventional nonconvex tensor decomposition framework. Convex\nrelaxation of tensor multilinear rank allows us to side-step this issue. In fact, we have shown that the\nsample complexity of the latent trace norm is as good as knowing the mode with the lowest rank.\nThis is consistent with the analysis of [21] in the tensor denoising setting (see Table 1).\nIn the setting of tensor-based MTL, however, the notion of mode with the lowest rank may be\nvacuous because some modes may have very low dimension. In fact, the sample complexity of\nthe latent trace norm can be as bad as not using any low-rank-ness at all if there is a mode with\ndimension lower than the rank of the other modes. The scaled latent trace norm we proposed in this\npaper recognizes the mode with the lowest rank relative to its dimension and lead to the competitive\nsample complexities in various settings we have shown in Table 2.\nAcknowledgment: MS acknowledges support from the JST CREST program.\n\nReferences\n[1] R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unla-\n\nbeled data. J. Mach. Learn. Res., 6:1817\u20131853, 2005.\n\n8\n\n0500100015002000250030000.350.40.450.50.550.60.650.7Sample sizeMSE Overlapped trace normLatent trace normScaled latent trace normMode 1Mode 2Mode 3RR02000400060008000100001200010152025303540Sample sizeExplained variance Overlapped trace normLatent trace normScaled latent trace normMode 1Mode 2Mode 3RR\f[2] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning.\n\nIn B. Sch\u00a8olkopf, J. Platt, and\nT. Hoffman, editors, Adv. Neural. Inf. Process. Syst. 19, pages 41\u201348. MIT Press, Cambridge, MA, 2007.\n[3] A. Argyriou, M. Pontil, Y. Ying, and C. A. Micchelli. A spectral regularization framework for multi-task\nstructure learning. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Adv. Neural. Inf. Process. Syst.\n20, pages 25\u201332. Curran Associates, Inc., 2008.\n\n[4] B. Bakker and T. Heskes. Task clustering and gating for bayesian multitask learning. J. Mach. Learn.\n\nRes., 4:83\u201399, 2003.\n\n[5] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural\n\nresults. J. Mach. Learn. Res., 3:463\u2013482, 2002.\n\n[6] J. Baxter. A model of inductive bias learning. J. Artif. Intell. Res., 12:149\u2013198, 2000.\n[7] R. Caruana. Multitask learning. Machine learning, 28(1):41\u201375, 1997.\n[8] L. De Lathauwer, B. De Moor, and J. Vandewalle. A multilinear singular value decomposition. SIAM J.\n\nMatrix Anal. Appl., 21(4):1253\u20131278, 2000.\n\n[9] L. De Lathauwer, B. De Moor, and J. Vandewalle. On the best rank-1 and rank-(R1; R2; : : : ; RN ) ap-\n\nproximation of higher-order tensors. SIAM J. Matrix Anal. Appl., 21(4):1324\u20131342, 2000.\n\n[10] M. Fazel, H. Hindi, and S. P. Boyd. A Rank Minimization Heuristic with Application to Minimum Order\n\nSystem Approximation. In Proc. of the American Control Conference, 2001.\n\n[11] R. Foygel and N. Srebro. Concentration-based guarantees for low-rank matrix reconstruction. Arxiv\n\npreprint arXiv:1102.3923, 2011.\n\n[12] S. Gandy, B. Recht, and I. Yamada. Tensor completion and low-n-rank tensor recovery via convex opti-\n\nmization. Inverse Problems, 27:025010, 2011.\n\n[13] S. M. Kakade, S. Shalev-Shwartz, and A. Tewari. Regularization techniques for learning with matrices.\n\nJ. Mach. Learn. Res., 13(1):1865\u20131890, 2012.\n\n[14] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455\u2013500,\n\n2009.\n\n[15] J. Liu, P. Musialski, P. Wonka, and J. Ye. Tensor completion for estimating missing values in visual data.\n\nIn Proc. ICCV, 2009.\n\n[16] A. Maurer and M. Pontil. Excess risk bounds for multitask learning with trace norm regularization. In\n\nJMLR W&CP 30 (COLT2013), pages 55\u201376. MIT Press, 2013.\n\n[17] B. Romera-Paredes, H. Aung, N. Bianchi-Berthouze, and M. Pontil. Multilinear multitask learning. In\n\nProceedings of the 30th International Conference on Machine Learning, pages 1444\u20131452, 2013.\n\n[18] M. Signoretto, L. De Lathauwer, and J. Suykens. Nuclear norms for tensors and their use for convex\n\nmultilinear estimation. Technical Report 10-186, ESAT-SISTA, K.U.Leuven, 2010.\n\n[19] M. Signoretto, L. De Lathauwer, and J. A. K. Suykens. Learning tensors in reproducing kernel hilbert\n\nspaces with multilinear spectral penalties. Technical report, arXiv:1310.4977, 2013.\n\n[20] N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K. Saul,\nY. Weiss, and L. Bottou, editors, Adv. Neural. Inf. Process. Syst. 17, pages 1329\u20131336. MIT Press, Cam-\nbridge, MA, 2005.\n\n[21] R. Tomioka and T. Suzuki. Convex tensor decomposition via structured Schatten norm regularization. In\nC. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Adv. Neural. Inf. Process.\nSyst. 26, pages 1331\u20131339. 2013.\n\n[22] R. Tomioka, K. Hayashi, and H. Kashima. Estimation of low-rank tensors via convex optimization.\n\nTechnical report, arXiv:1010.0789, 2011.\n\n[23] R. Tomioka, T. Suzuki, K. Hayashi, and H. Kashima. Statistical performance of convex tensor decompo-\n\nsition. In Adv. Neural. Inf. Process. Syst. 24, pages 972\u2013980. 2011.\n\n[24] J. A. Tropp. User-friendly tail bounds for sums of random matrices. Found. Comput. Math., 12(4):\n\n389\u2013434, 2012.\n\n[25] L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279\u2013311,\n\n1966.\n\n[26] B. Vargas-Govea, G. Gonz\u00b4alez-Serna, and R. Ponce-Medell\u0131n. Effects of relevant contextual features in\nthe performance of a restaurant recommender system. In Proceedings of 3rd Workshop on Context-Aware\nRecommender Systems. 2011.\n\n9\n\n\f", "award": [], "sourceid": 1466, "authors": [{"given_name": "Kishan", "family_name": "Wimalawarne", "institution": "Tokyo Institute of Technology"}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": "The University of Tokyo"}, {"given_name": "Ryota", "family_name": "Tomioka", "institution": "TTI Chicago"}]}