{"title": "Minimax Optimal Alternating Minimization for Kernel Nonparametric Tensor Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3783, "page_last": 3791, "abstract": "We investigate the statistical performance and computational efficiency of the alternating minimization procedure for nonparametric tensor learning. Tensor modeling has been widely used for capturing the higher order relations between multimodal data sources. In addition to a linear model, a nonlinear tensor model has been received much attention recently because of its high flexibility. We consider an alternating minimization procedure for a general nonlinear model where the true function consists of components in a reproducing kernel Hilbert space (RKHS). In this paper, we show that the alternating minimization method achieves linear convergence as an optimization algorithm and that the generalization error of the resultant estimator yields the minimax optimality. We apply our algorithm to some multitask learning problems and show that the method actually shows favorable performances.", "full_text": "Minimax Optimal Alternating Minimization\nfor Kernel Nonparametric Tensor Learning\n\nTaiji Suzuki\u2217, Heishiro Kanagawa\u2020\n\n\u2217,\u2020Department of Mathematical and Computing Science, Tokyo Institute of Technology\n\n\u2217PRESTO, Japan Science and Technology Agency\n\n\u2217Center for Advanced Integrated Intelligence Research, RIKEN\n\ns-taiji@is.titech.ac.jp, kanagawa.h.ab@m.titech.ac.jp\n\nHayato Kobayash, Nobuyuki Shimizu, Yukihiro Tagami\n\nYahoo Japan Corporation\n\n{ hakobaya, nobushim, yutagami } @yahoo-corp.jp\n\nAbstract\n\nWe investigate the statistical performance and computational ef\ufb01ciency of the\nalternating minimization procedure for nonparametric tensor learning. Tensor\nmodeling has been widely used for capturing the higher order relations between\nmultimodal data sources. In addition to a linear model, a nonlinear tensor model has\nbeen received much attention recently because of its high \ufb02exibility. We consider\nan alternating minimization procedure for a general nonlinear model where the true\nfunction consists of components in a reproducing kernel Hilbert space (RKHS).\nIn this paper, we show that the alternating minimization method achieves linear\nconvergence as an optimization algorithm and that the generalization error of the\nresultant estimator yields the minimax optimality. We apply our algorithm to some\nmultitask learning problems and show that the method actually shows favorable\nperformances.\n\n1\n\nIntroduction\n\nTensor modeling is widely used for capturing the higher order relations between several data sources.\nFor example, it has been applied to spatiotemporal data analysis [19], multitask learning [20, 2, 14]\nand collaborative \ufb01ltering [15]. The success of tensor modeling is usually based on the low-rank\nproperty of the target parameter. As in the matrix, the low-rank decomposition of tensors, e.g.,\ncanonical polyadic (CP) decomposition [10, 11] and Tucker decomposition [31], reduces the effective\ndimension of the statistical model, improves the generalization error, and gives a better understanding\nof the model based on an condensed representation of the target system.\nAmong several tensor models, linear models have been extensively studied from both theoretical\nand practical points of views [16]. A dif\ufb01culty of the tensor model analysis is that typical tensor\nanalysis problems usually fall under a non-convex problem and it is dif\ufb01cult to solve the problem.\nTo overcome the computational dif\ufb01culty, several authors have proposed convex relaxation methods\n[18, 23, 9, 30, 29]. Unfortunately, however, convex relaxation methods lose statistical optimality in\nfavor of computational ef\ufb01ciency [28].\nAnother promising approach is the alternating minimization procedure which alternately updates\neach component of the tensor with the other \ufb01xed components. The method has shown a nice\nperformance in practice. Moreover, its theoretical analysis has also been given by several authors\n[1, 13, 6, 3, 21, 36, 27, 37]. These theoretical analyses indicate that the estimator given by the\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\falternating minimization procedure has a good generalization error, with a mild dependency on the\nsize of the tensor if the initial solution is properly set. In addition to the alternating minimization\nprocedure, it has been shown that the Bayes estimator achieves the minimax optimality under quite\nweak assumptions [28].\nNonparametric models have also been proposed for capturing nonlinear relations [35, 24, 22]. In\nparticular, [24] extended the linear tensor learning to the nonparametric learning problem using\na kernel method and proposed a convex regularization method and an alternating minimization\nmethod. Recently, [14, 12] showed that the Bayesian approach has good theoretical properties for the\nnonparametric problem. In particular, it achieves the minimax optimality under weak assumptions.\nHowever, from a practical point of view, the Bayesian approach is computationally expensive\ncompared with the alternating minimization approach. An interesting observation is that the practical\nperformance of the alternating minimization procedure is quite good [24] and is comparable to the\nBayesian one [14], although its computational ef\ufb01ciency is much better than that of the Bayesian\none. Despite the practical usefulness of the alternating minimization, its statistical properties have\nnot been investigated yet in the general nonparametric model.\nIn this paper, we theoretically analyze the alternating minimization procedure in the nonparametric\nmodel. We investigate its computational ef\ufb01ciency and analyze its statistical performance. It is shown\nthat, if the true function is included in a reproducing kernel Hilbert space (RKHS), then the algorithm\nconverges to an (a possibly local) optimal solution in linear rate, and the generalization error of the\nestimator achieves the minimax optimality if the initial point of the algorithm is in the O(1) distance\nfrom the true function. Roughly speaking, the theoretical analysis shows that\n\nwhere (cid:98)f (t) is the estimated nonlinear tensor at the tth iteration of the alternating minimization\n\nprocedure, n is the sample size, d is the rank of the true tensor, K is the number of modes, and s is\nthe complexity of the RKHS. This indicates that the alternating minimization procedure can produce\na minimax optimal estimator after O(log(n)) iterations.\n\n= Op\n\nL2\n\n(cid:107)(cid:98)f (t) \u2212 f\u2217(cid:107)2\n\n(cid:0)dKn\u2212 1\n\n1+s log(dK) + dK (3/4)t(cid:1)\n\n2 Problem setting: nonlinear tensor model\n\nHere, we describe the model to be analyzed. Suppose that we are given n input-output pairs\n{(xi, yi)}n\ni=1 that are generated from the following system. The input xi is a concatenation of K\nvariables, i.e., xi = (x(1)\nis an element of a set\nXk and is generated from a distribution Pk. We consider the regression problem where the outputs\n{yi}n\n\ni=1 are observed according to the nonparametric tensor model [24]:\n\n) \u2208 X1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 XK = X , where each x(k)\n\n,\u00b7\u00b7\u00b7 , x(K)\n\ni\n\ni\n\ni\n\nd(cid:88)\n\nK(cid:89)\n\nyi =\n\nr=1\n\nk=1\n\nf\u2217\n(r,k)(x(k)\n\ni\n\n) + \u0001i,\n\n(1)\n\nr=1\n\n(cid:81)K\nk=1 f\u2217\n\ni=1 represents an i.i.d. zero-mean noise and each f\u2217\n\nf\u2217(x) = f\u2217(x(1), . . . , x(K)) = (cid:80)d\n\nwhere {\u0001i}n\n(r,k) is a component of the true function\nincluded in some RKHS Hr,k. In this regression problem, our objective is to estimate the true function\n(r,k)(x(k)) based on the observations {(xi, yi)}n\ni=1.\nThis model has been applied to several problems such as multitask learning, recommendation system\nand spatiotemporal data analysis. Although we focus on the squared loss regression problem, the\ndiscussion in this paper can be easily generalized to Lipschitz continuous and strongly convex losses\nas in [4].\nExample 1: multitask learning Suppose that we have several tasks indexed by a two-dimensional\nindex (s, t) \u2208 [M1] \u00d7 [M2]1, and each task (s, t) is a regression problem for which there is a true\n[s,t](x) that takes an input feature w \u2208 X3. The ith input sample is given as xi = (si, ti, wi),\nfunction g\u2217\nwhich is a combination of task index (si, ti) and input feature wi. By assuming that the true function\ng\u2217\n[s,t] is a linear combination of a few latent factors hr as\n\nr=1 \u03b1s,r\u03b2t,rhr(w) (x = (s, t, w)),\n\n(2)\n\n[s,t](x) =(cid:80)d\n\ng\u2217\n\n1 We denote by [k] = {1, . . . , k}.\n\n2\n\n\fAlgorithm 1 Alternating minimization procedure for nonlinear tensor estimation\nRequire: Training data Dn = {(xi, yi)}n\n\ni=1, the regularization parameter \u03bb(n), iteration number T .\n\nEnsure: (cid:98)f =(cid:80)d\nSet \u02dcf(r,k) = (cid:98)f (t\u22121)\n\nfor t = 1, . . . , T do\n\nr\n\n(cid:81)K\nk=1 (cid:98)f (T )\n\n(r,k) as the estimator\n\nr=1 \u02c6v(T )\n(r,k) (\u2200(r, k)), and \u02dcvr = \u02c6v(t\u22121)\n\nr\n\n(cid:104)\n\nyi\u2212(cid:16)\n\nfor (r, k) \u2208 {1, . . . , d} \u00d7 {1, . . . , K} do\n(cid:89)\nThe (r, k)-element of \u02dcf is updated as\n\nn(cid:88)\n\n(cid:40)\n\n(cid:48)\n\u02dcf\n(r,k) = argmin\nf(r,k)\u2208Hr,k\n(r,k)(cid:107)n, \u02dcf(r,k) \u2190 \u02dcf(cid:48)\n\n\u02dcvr \u2190 (cid:107) \u02dcf(cid:48)\n\nend for\n\nSet (cid:98)f (t)\n\n(r,k)/\u02dcvr.\n(r,k) = \u02dcf(r,k) (\u2200(r, k)) and \u02c6v(t)\n\nf(r,k)\n\n1\nn\n\ni=1\n\nk(cid:48)(cid:54)=k\n\nr = \u02dcvr (\u2200r).\n\nend for\n\n(\u2200r).\n(cid:88)\n\n\u02dcf(r,k(cid:48)) +\n\n\u02dcvr(cid:48)\n\nr(cid:48)(cid:54)=r\n\nk(cid:48)=1\n\nK(cid:89)\n\n(cid:17)\n\n(xi)\n\n(cid:105)2\n\n(cid:41)\n+ Cn(cid:107)f(cid:107)2Hr,k\n\n.\n\n(4)\n\n\u02dcf(r(cid:48),k(cid:48))\n\nand the output is given as yi = g\u2217\nlearning problem for estimating {g\u2217\n\u03b1s,r, f(r,2)(t) = \u03b2t,r, f(r,3)(w) = hr(w).\n\n[si,ti](xi) + \u0001i [20, 2, 14], then we can reduce the multitask\n[s,t]}s,t to the tensor estimation problem, where f(r,1)(s) =\n\n3 Alternating regularized least squares algorithm\n\n(cid:107)f(r,k)(cid:107)2Hr,k ,\n\n(3)\n\nTo learn the nonlinear tensor factorization model (1), we propose to optimize the regularized empirical\nrisk in an alternating way. That is, we optimize each component f(r,k) with the other \ufb01xed components\n{f(r(cid:48),k(cid:48))}(r(cid:48),k(cid:48))(cid:54)=(r,k). Basically, we want to execute the following optimization problem:\n\nd(cid:88)\nwhere the \ufb01rst term is the loss function for measuring how our guess(cid:80)d\n\n{f(r,k)}(r,k):f(r,k)\u2208Hr,k\n\nf(r,k)(x(k)\n\nn(cid:88)\n\nK(cid:89)\n\n+ Cn\n\nmin\n\n1\nn\n\nk=1\n\nr=1\n\nr=1\n\ni=1\n\n)\n\ni\n\nd(cid:88)\n(cid:81)K\n\nk=1\n\n(cid:32)\nyi \u2212 d(cid:88)\n\n(cid:33)2\n\nr=1\n\nk=1 f(r,k) \ufb01ts the data\nand the second term is a regularization term for controlling the complexity of the learning function.\nHowever, this optimization problem is not convex and is dif\ufb01cult to exactly solve.\nWe found that this computational dif\ufb01culty could be overcome if we assume some additional assump-\ntions and aim to achieve a better generalization error instead of exactly minimizing the training error.\nThe optimization procedure we discuss to obtain such an estimator is the alternating minimization\nprocedure, which minimizes the objective function (3) alternately with respect to each component\nf(r,k). For each component f(r,k), the objective function (3) is a convex function, and thus, it is easy\nto obtain the optimal solution. Actually, the subproblem is reduced to a variant of the kernel ridge\nregression, and the solution can be analytically obtained.\nThe algorithm we call alternating minimization procedure (AMP) is summarized in Algorithm 1.\nAfter minimizing the objective (Eq. (4)), the obtained solution is normalized so that its empirical L2-\nnorm becomes 1 to adjust the scaling factor freedom. The parameter Cn in Eq. (4) is a regularization\nparameter that is appropriately chosen.\nFor theoretical simplicity, we consider the following equivalent constraint formula instead of the\npenalization one (4):\n\n(cid:32)\n\n(cid:40)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n(r,k) \u2208\n(cid:48)\n\u02dcf\n\nargmin\nf(r,k)\u2208Hr,k\n(cid:107)f(r,k)(cid:107)Hr,k\n\n\u2264 \u02dcR\n\nyi \u2212 f(r,k)(x(k)\n\ni\n\n)\n\n\u02dcf(r,k(cid:48))(x(k(cid:48))\n\ni\n\n\u02dcf(r(cid:48),k(cid:48))(x(k(cid:48))\n\ni\n\n)\n\n) \u2212(cid:88)\n\nK(cid:89)\n\n\u02dcvr(cid:48)\n\nr(cid:48)(cid:54)=r\n\nk(cid:48)=1\n\n(cid:33)2(cid:41)\n\n(5)\nwhere the parameter \u02dcR is a regularization parameter for controlling the complexity of the estimated\nfunction.\n\n(cid:89)\n\nk(cid:48)(cid:54)=k\n\n3\n\n\f4 Assumptions and problem settings for the convergence analysis\nHere, we prepare some assumptions for our theoretical analysis. First, we assume that the distribution\nP (X) of the input feature x \u2208 X is a product measure of Pk(X) on each Xk. That is, PX (dX) =\nP1(dX1) \u00d7 \u00b7\u00b7\u00b7 \u00d7 PK(dXK) for X = (X1, . . . , XK) \u2208 X = X1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 XK. This is typically\nassumed in the analysis of linear tensor estimation methods [13, 6, 3, 21, 1, 36, 27, 37]. Thus, the\n\nL2-norm of a \u201crank-1\u201d function f (x) =(cid:81)K\nXk \u2192 R. The inner product in the space L2 is denoted by (cid:104)f, g(cid:105)L2 := (cid:82) f (X)g(X)dPX (X).\nNote that because of the construction of PX , it holds that (cid:104)f, g(cid:105)L2 =(cid:81)K\nf (x) =(cid:81)K\nLet the magnitude of the rth component of the true function be vr := (cid:107)(cid:81)K\n\nHereafter, with a slight abuse of notations, we denote by (cid:107)f(cid:107)L2 = (cid:107)f(cid:107)L2(Pk) for a function f :\nk=1(cid:104)fk, gk(cid:105)L2 for functions\n\nNext, we assume that the norm of the true function is bounded away from zero and from above.\n(r,k)(cid:107)L2 and the\n\nk=1 fk(x(k)) and g(x) =(cid:81)K\n\nk=1 fk(x(k)) can be decomposed into\nL2(P1) \u00d7 \u00b7\u00b7\u00b7 \u00d7 (cid:107)fK(cid:107)2\n\nk=1 gk(x(k)) where x = (x(1), . . . , x(K)) \u2208 X .\n\nL2(PX ) = (cid:107)f1(cid:107)2\n\nk=1 f\u2217\n\nL2(PK ).\n\n(cid:107)f(cid:107)2\n\n(r,k)/(cid:107)f\u2217\n\n(r,k) := f\u2217\n\nnormalized components be f\u2217\u2217\nAssumption 1 (Boundedness Assumption).\n(A1-1) There exist 0 < vmin \u2264 vmax such that vmin \u2264 vr \u2264 vmax (\u2200r = 1, . . . , d).\n(A1-2) The true function f\u2217\n\n(r,k) is included in the RKHS Hr,k, i.e., f\u2217\n\n(r,k)(cid:107)L2 (\u2200(r, k)).\n\nthere exists R > 0 such that max{vr, 1}(cid:107)f\u2217\u2217\n\n(r,k)(cid:107)Hr,k \u2264 R (\u2200(r, k)).\nfunction k(r,k) associated with the RKHS Hr,k\n\n(A1-3) The kernel\n\nsupx\u2208Xk\n\nk(r,k)(x, x) \u2264 1 (\u2200(r, k)).\n\n(A1-4) There exists L > 0 such that the noise is bounded as |\u0001i| \u2264 L (a.s.).\n\n(r,k) \u2208 Hr,k (\u2200(r, k)), and\n\nis bounded as\n\nAssumption 1 is a standard one for the analysis of the tensor model and the kernel regression\nmodel. Note that the boundedness condition of the kernel gives that (cid:107)f(cid:107)\u221e = supx(k) |f (x(k))| \u2264\n(cid:107)f(cid:107)Hr,k for all f \u2208 Hr,k because the Cauchy-Schwarz inequality gives |(cid:104)f, k(r,k)(\u00b7, x(k))(cid:105)Hr,k| \u2264\n(r,k)(cid:107)\u221e \u2264 R.\nk(r,k)(x(k), x(k))(cid:107)f(cid:107)Hr,k for all x(k). Thus, combining with (A1-2), we also have (cid:107)f\u2217\u2217\nThe last assumption (A1-4) is a bit restrictive. However, this assumption can be replaced with a\nGaussian assumption. In that situation, we may use the Gaussian concentration inequality [17] instead\nof Talagrand\u2019s concentration inequality in the proof.\nNext, we characterize the complexity of each RKHS Hr,k by using the entropy number [33, 25].\nThe \u0001-covering number N (\u0001,G, L2(PX )) with respect to L2(PX ) is the minimal number of balls\nwith radius \u0001 measured by L2(PX ) needed to cover a set G \u2282 L2(PX ). The ith entropy number\nei(G, L2(PX )) is de\ufb01ned as the in\ufb01mum of \u0001 > 0 such that N (\u0001,G, L2) \u2264 2i\u22121 [25]. Intuitively, if\nthe entropy number is small, the space G is \u201csimple\u201d; otherwise, it is \u201ccomplicated.\u201d\nAssumption 2 (Complexity Assumption). Let BHr,k be the unit ball of an RKHS Hr,k. There exist\n0 < s < 1 and c such that\n\nei(BHr,k , L2(PX )) \u2264 ci\u2212 1\n2s ,\n\n(6)\n\nfor all 1 \u2264 r \u2264 d and 1 \u2264 k \u2264 K.\nThe optimal rate of the ordinary kernel ridge regression on the RKHS with Assumption 2 is given as\nn\u2212 1\nAssumption 3 (In\ufb01nity Norm Assumption). There exist 0 < s2 \u2264 1 and c2 such that\n\n1+s [26]. Next, we give a technical assumption about the L\u221e-norm.\n\n(cid:107)f(cid:107)\u221e \u2264 c2(cid:107)f(cid:107)1\u2212s2\n\nL2\n\n(cid:107)f(cid:107)s2Hr,k\n\n(\u2200f \u2208 Hr,k)\n\n(7)\n\nfor all 1 \u2264 r \u2264 d and 1 \u2264 k \u2264 K.\nBy Assumption 1, this assumption is always satis\ufb01ed for c2 = 1 and s2 = 1. s2 < 1 is a nontrivial\nsituation and gives a tighter bound. We would like to note that this condition with s2 < 1 is satis\ufb01ed\n\n4\n\n\fby many practically used kernels such as the Gaussian kernel. In particular, it is satis\ufb01ed if the kernel\nis smooth so that Hr,k is included in a Sobolev space W 2,s2[0, 1]. More formal characterization of\nthis condition using the notion of a real interpolation space can be found in [26] and Proposition\n2.10 of [5].\nFinally, we assume an incoherence condition on {f\u2217\n(r,k)}r,k. Roughly speaking, the incoherence\nproperty of a set of functions {f(r,k)}r,k means that components {f(r,k)}r are linearly independent\nacross different 1 \u2264 r \u2264 d on the same mode k. This is required to distinguish each component. An\nanalogous assumption has been assumed also in the literature of linear models [13, 6, 3, 21, 36, 27].\nDe\ufb01nition 1 (Incoherence). A set of functions {f(r,k)}r,k, where f(r,k) \u2208 L2(Pk), is \u00b5-incoherent if,\nfor all k = 1, . . . , K, it holds that\n\n|(cid:104)f(r,k), f(r(cid:48),k)(cid:105)L2| \u2264 \u00b5(cid:107)f(r,k)(cid:107)L2(cid:107)f(r(cid:48),k)(cid:107)L2 (\u2200r (cid:54)= r(cid:48)).\n\nAssumption 4 (Incoherence Assumption). There exists 1 > \u00b5\u2217 \u2265 0 such that the true function\n{f\u2217\n(r,k)}r,k is \u00b5\u2217-incoherent.\n5 Linear convergence of alternating minimization procedure\nIn this section, we give the convergence analysis of the AMP algorithm. Under the assumptions\npresented in the previous section, it will be shown that the AMP algorithm shows linear convergence\nin the sense of optimization algorithm and achieves the minimax optimal rate in the sense of statistical\nperformance. Roughly speaking, if the initial solution is suf\ufb01ciently close to the true function (namely,\nin a distance of O(1)), then the solution generated by AMP linearly converges to the optimal solution\nand the estimation accuracy of the \ufb01nal solution is given as O(dKn\u2212 1\nWe analyze how close the updated estimator is to the true one when the (r, k)th component is\n(r,k). The tensor decomposition {f(r,k)}r,k of a nonlinear tensor model has a\nupdated from \u02dcf(r,k) to \u02dcf(cid:48)\n\u02dcf(r(cid:48),k(cid:48))/(cid:107) \u02dcf(r(cid:48),k(cid:48))(cid:107)L2 (\u2200(r(cid:48), k(cid:48)) \u2208 [d] \u00d7 [K]) and \u00afvr(cid:48) = \u02dcvr(cid:48)(cid:81)K\nfreedom of scaling. Thus, we need to measure the accuracy based on a normalized representation to\navoid the scaling factor uncertainty. Let the normalized components of the estimator be \u00aff(r(cid:48),k(cid:48)) =\nk(cid:48)=1 (cid:107) \u02dcf(r(cid:48),k(cid:48))(cid:107)L2 (\u2200r(cid:48) \u2208 [d]). On the\n(cid:81)\n(r,k) (see Eq. (4)) and we denote by \u00afv(cid:48)\n(r,k)(cid:107)L2\nk(cid:48)(cid:54)=k (cid:107) \u02dcf(r,k(cid:48))(cid:107)L2. The normalized newly\n(r,k)(cid:107)L2.\n\nother hand, the newly updated (r, k)th element is denoted by \u02dcf(cid:48)\nthe updated value of \u00afvr correspondingly: \u00afv(cid:48)\nupdated element is denoted by \u00aff(cid:48)\n(r,k) = \u02dcf(cid:48)\nFor an estimator ( \u00aff , \u00afv) = ({ \u00aff(r(cid:48),k(cid:48))}r(cid:48),k(cid:48),{\u00afvr(cid:48)}r(cid:48)) which is a couple of the normalized component\nand the scaling factor, de\ufb01ne\n\n1+s ) up to log(dK) factor.\n\nr = (cid:107) \u02dcf(cid:48)\n(r,k)/(cid:107) \u02dcf(cid:48)\n\nr\n\nd\u221e( \u00aff , \u00afv) := max\n(r(cid:48),k(cid:48))\n\n{vr(cid:48)(cid:107) \u00aff(r(cid:48),k(cid:48)) \u2212 f\u2217\u2217\n\n(r(cid:48),k(cid:48))(cid:107)L2 + |vr(cid:48) \u2212 \u00afvr(cid:48)|}.\n\nFor any \u03bb1,n > 0 and \u03bb2,n > 0 and \u03c4 > 0, we let a\u03c4 := max{1, L} max{1, \u03c4} log(dK) and de\ufb01ne\n\u03ben = \u03ben(\u03bb1,n, \u03c4 ) and \u03be(cid:48)\n\n(cid:18) K\n\nn = \u03be(cid:48)\nn(\u03bb2,n, \u03c4 ) as 2\n\u2212 s\n2 \u03bb\n1,n\u221a\nn\n\n\u2228\n\n1+2s\n\nK\n\n2\n\n1+2s\n1+s\n\n2s+(1\u2212s)s2\n\n2(1+s)\n\n\u03bb\n\n1,n\n\n1\n\n1+s\n\nn\n\n(cid:19)\n\n\u03ben := a\u03c4\n\n(cid:18) \u03bb\n\n2\n\n\u2212 s\n2,n\u221a\nn\n\n, \u03be(cid:48)\n\nn := a\u03c4\n\n\u2228\n\n1\n\n1\n2\n\n\u03bb\n2,nn\n\n1\n\n1+s\n\n(cid:19)\n\n.\n\nTheorem 2. Suppose that Assumptions 1\u20134 are satis\ufb01ed, and the regularization parameter \u02dcR in Eq.\n(5) is set as \u02dcR = 2R. Let \u02c6R = 8 \u02dcR/ min{vmin, 1} and suppose that we have already obtained an\nestimator \u02dcf satisfying the following conditions:\n\u2022 The RKHS-norms of { \u00aff(r(cid:48),k(cid:48))}r(cid:48),k(cid:48) are bounded as (cid:107) \u00aff(r(cid:48),k(cid:48))(cid:107)Hr(cid:48) ,k(cid:48) \u2264 \u02c6R/2 (\u2200(r(cid:48), k(cid:48)) (cid:54)= (r, k)).\n\u2022 The distance from the true one is bounded as d\u221e( \u00aff , \u00afv) \u2264 \u03b3.\nThen, for a suf\ufb01ciently small \u00b5\u2217 and \u03b3 (independent of n), there exists an event with probability\ngreater than 1 \u2212 3 exp(\u2212\u03c4 ) where any ( \u00aff , \u00afv) satisfying the above conditions gives\nd\u221e( \u00aff , \u00afv)2 + Sn \u02c6R2K\n\nr \u2212 vr|(cid:17)2 \u2264 1\n\n(r,k)(cid:107)L2 + |\u00afv(cid:48)\n\n(r,k) \u2212 f\u2217\u2217\n\nvr(cid:107) \u00aff(cid:48)\n\n(cid:16)\n\n(8)\n\n2\n\n2The symbol \u2228 indicates the max operation, that is, a \u2228 b := max{a, b}.\n\n5\n\n\ffor any suf\ufb01ciently large n, where Sn is de\ufb01ned for a constant C(cid:48) depending on s, s2, c, c2 as\n\n\u22121)(d\u03ben)2/s2(1 + vmax)2(cid:105)\n\n.\n\nSn := C(cid:48)(cid:104)\n\n\u03be(cid:48)\nn\u03bb1/2\n\n2,n + \u03be(cid:48)2\n\nn + d\u03ben\u03bb1/2\n\n1,n + \u02c6R2(K\u22121)( 1\n\ns2\n\nMoreover, if we denote by \u03b7n the right hand side of Eq. (8), then it holds that\n\n(cid:107) \u00aff(cid:48)\n\n(r,k)(cid:107)Hr,k \u2264\n\n2\n\nvr \u2212 \u221a\n\n\u03b7n\n\n\u02dcR.\n\nThe proof and its detailed statement are given in the supplementary material (Theorem A.1). It\nis proven by using such techniques as the so-called peeling device [32] or, equivalently, the local\nRademacher complexity [4], and by combining these techniques with the coordinate descent opti-\nmization argument. Theorem 2 states that, if the initial solution is suf\ufb01ciently close to the true one,\nthen the following updated estimator gets closer to the true one and its RKHS-norm is still bounded\nabove by a constant. Importantly, it can be shown that the updated one still satis\ufb01es the conditions\nof Theorem 2 for large n. Since the bound given in Theorem 2 is uniform, the inequality (8) can be\n\nrecursively applied to the sequence of (cid:98)f (t) (t = 1, 2, . . . ).\n\nBy substituting \u03bb1,n = K\u2212 1+s\n\n(cid:16)\n\n1+s \u2228(cid:16)\n\nn\n\n1\u2212s n\u2212 1\n1\u2212s d\u2212 2\n1+s\u2212(1\u2212s2) min{ 1\u2212s\n\u2212 1\n\n1+s and \u03bb2,n = n\u2212 1\ns2(1+s)}\n\n4(1+s) ,\n\n1\n\nn\u2212 1\n\nSn = O\n\n1+s , we have that\n\n(cid:17)(cid:17)\n\npoly(d, K)\n\nlog(dK),\n\nwhere poly(d, K) means a polynomial of d, K. Thus, if s2 < 1 and n is suf\ufb01ciently large compared\nwith d and K, then the second term is smaller than the \ufb01rst term and we have Sn \u2264 Cn\u2212 1\n1+s with a\nconstant C. Furthermore, we can bound the L2-norm from the true one as in the following theorem.\n\nTheorem 3. Let ((cid:98)f (t), \u02c6v(t)) be the estimator at the tth iteration. In addition to the assumptions of\nTheorem 2, suppose that ((cid:98)f (1), \u02c6v(1)) satis\ufb01es d\u221e((cid:98)f (1), \u02c6v(1))2 \u2264 v2\nand n (cid:29) d, K, then \u02c7f (t)(x) =(cid:80)d\n\n8 and Sn \u02c6R2K \u2264 v2\n\n8 , s2 < 1\n\nmin\n\nmin\n\n(cid:81)K\nk=1 (cid:98)f (t)\n(cid:16)\n1+s log(dK) + dK (3/4)t(cid:17)\n\n(r,k)(x(k)) satis\ufb01es\n\ndKn\u2212 1\n\n.\n\n(cid:107) \u02c7f (t) \u2212 f\u2217(cid:107)2\n\nL2\n\n= O\n\nr=1 \u02c6v(t)\n\nr\n\nfor all t \u2265 2 uniformly with probability 1 \u2212 3 exp(\u2212\u03c4 ).\nMore detailed argument is given in Theorem A.3 in the supplementary material. This means that\nafter T = O(log(n)) iterations, we obtain the estimation accuracy of O(dKn\u2212 1\n1+s log(dK)). The\nestimation accuracy bound O(dKn\u2212 1\n1+s log(dK)) is intuitively natural because we are estimating\nd \u00d7 K functions {f\u2217\n(r,k) is\nknown as n\u2212 1\n1+s [26]. Indeed, recently, it has been shown that this accuracy bound is minimax\noptimal up to log(dK) factor [14], that is,\n\n(r,k)}r,k and the optimal sample complexity to estimate one function f\u2217\n\nE[(cid:107)(cid:98)f \u2212 f\u2217(cid:107)2] (cid:38) dKn\u2212 1\n\n1+s\n\ninf\n\u02c6f\n\nsup\nf\u2217\n\n(r,k)(cid:107)Hr,k \u2264 R.\nwhere inf is taken over all estimators and sup runs over all low rank tensors f\u2217 with (cid:107)f\u2217\nThe Bayes estimator also achieves this minimax lower bound [14]. Hence, a rough Bayes estimator\nwould be a good initial solution satisfying the assumptions.\n\n6 Relation to existing works\nIn this section, we describe the relation of our work to existing works. First, our work can be\nseen as a nonparametric extension of the linear parametric tensor model. The AMP algorithm\nand related methods for the linear model has been extensively studied in the recent years, e.g.\n[1, 13, 6, 3, 21, 36, 27, 37]. Overall, the tensor completion problem has been mainly studied instead\nof a general regression problem. Among the existing works, [37] analyzed the AMP algorithm for\na low-rank matrix estimation problem. It is shown that, under an incoherence condition, the AMP\nalgorithm converges to the optimal in a linear rate. However, their analysis is limited to a matrix\ncase. [1] analyzed an alternating minimization approach to estimate a low-rank tensor with positive\nentries in a noisy observation setting. [13, 6] considered an AMP algorithm for a tensor completion.\n\n6\n\n\fTheir estimation method is close to our AMP algorithm. However, the analysis is for a linear tensor\ncompletion with/without noise and is a different direction from our general nonparametric regression\nsetting. [3, 36] proposed estimation methods other than an alternating minimization one, which were\nspecialized to a linear tensor completion problem.\nAs for the theoretical analysis for the nonparametric tensor regression model, some Bayes estimators\nhave been analyzed very recently by [14, 12]. They analyzed Bayes methods with Gaussian process\npriors and showed that the Gaussian process methods possess a good statistical performance. In\nparticular, [14] showed that the Gaussian process method for the nonlinear tensor estimation yields\nthe mini-max optimality as an extension of the linear model analysis [28]. However, the Bayes\nestimators require posterior sampling such as Gibbs sampling, which is rather computationally\nexpensive. On the other hand, the AMP algorithm yields a linear convergence rate and satis\ufb01es\nthe minimax optimality. An interesting observation is that the AMP algorithm requires a stronger\nassumption than the Bayesian one. There would be a trade-off between computational ef\ufb01ciency and\nstatistical property.\n7 Numerical experiments\nWe numerically compare the following methods in multitask learning problems (Eq. (2)):\n\u2022 Gaussian process method (GP-MTL) [14]: The nonparametric Bayesian method with Gaussian\nprocess priors. It was shown that the generalization error of GP-MTL achieves the minimax\noptimal rate [14].\n\u2022 Our AMP method with different kernels for the latent factors hr (see Eq. (2)); the Gaussian RBF\nkernel and the linear kernel. We also examined their mixture like 2 RBF kernels and 1 linear\nkernel among d = 3 components. They are indicated as Lin(1)+RBF(2).\n\nThe tensor rank for AMP and GP-MTL was \ufb01xed d = 3 in the following two data sets. The kernel\nwidth and the regularization parameter were tuned by cross validation. We also examined the scaled\nlatent convex regularization method [34]. However, it did not perform well and was omitted.\n\n7.1 Restaurant data\n\nHere, we compared the methods in the Restaurant & Consumer Dataset [7]. The task was to\npredict consumer ratings about several aspects of different restaurants, which is a typical task of a\nrecommendation system. The number of consumers was M1 = 138, and each consumer gave scores\nof about M2 = 3 different aspects (food quality, service quality, and overall quality). Each restaurant\nwas described by M3 = 44 features as in [20], and the task was to predict the score of an aspect\nby a certain consumer based on the restaurant feature vector. This is a multitask learning problem\nconsisting of M1 \u00d7 M2 = 414 (nonlinear) regression tasks where the input feature vector is M3 = 44\ndimensional. The kernel function representing the task similarities among Task 1 (restaurant) and\nTask 2 (aspect) are set as k(p, p(cid:48)) = \u03b4p,p(cid:48) + 0.8 \u00b7 (1 \u2212 \u03b4p,p(cid:48)) (where the pair p, p(cid:48) are restaurants or\naspects) 3.\nFig. 1 shows the relative MSE (the discrepancy of MSE from the best one) for different training\nsample sizes n computed on the validation data against the number of iterations t averaged over 10\nrepetitions. It can be seen that the validation error dropped rapidly to the optimal one. The best\nachievable validation error depended on the sample size. An interesting observation was that, until\nthe algorithm converged to the best possible error, it dropped at a linear rate. After it reached the\nbottom, the error was no longer improved.\nFig. 2 shows the performance comparison between the AMP method with different kernels and\nthe Gaussian process method (GP-MTL). The performances of AMP and GP-MTL were almost\nidentical. Although AMP is computationally quite ef\ufb01cient, as shown in Fig. 1, it did not deteriorate\nthe statistical performance. This is a remarkable property of the AMP algorithm.\n\n7.2 Online shopping data\n\nHere, we compared our AMP method with the existing method using data of Yahoo! Japan shopping.\nYahoo! Japan shopping contains various types of shops. The dataset is built on the purchase history\n3We also tested the delta kernel k(p, p(cid:48)) = \u03b4p,p(cid:48), but its performance was worse that the presented kernel.\n\n7\n\n\fFigure 1: Convergence property\nof the AMP method:\nrelative\nMSE against the number of it-\nerations.\n\nFigure 2: Comparison between\nAMP method with different ker-\nnels and GP-MTL on the restau-\nrant data.\n\nFigure 3: Comparison between\nAMP and GP-MTL on the on-\nline shopping data with different\nkernels.\n\nthat describes how many times each consumer bought each product in each shop. Our objective was\nto predict the quantity of a product purchased by a consumer at a speci\ufb01c shop. Each consumer was\ndescribed by 65 features based on his/her properties such as age, gender, and industry type of his/her\noccupation. We executed the experiments on 100 items and 508 different shops. Hence, the problem\nwas reduced to a multitask learning problem consisting of 100 \u00d7 508 regression tasks.\nSimilarly to [14], we put a commute-time kernel K = L\u2020 [8] on the shops based on a Laplacian matrix\nL on a weighted graph constructed by two similarity measures between shops (where \u2020 denotes\n\npsuedoinverse). Here, the Lapalacian on the graph is given by Li,j = (cid:0)(cid:80)\n\n(cid:1)\u03b4i,j \u2212 wi,j\n\nj\u2208V wi,j\n\nwhere wi,j is the similarity between shops (i, j). We employed the cosine similarity with different\nparameters as the similarity measures (indicated by \u201ccossim\u201d and \u201ccosdis\u201d).\nBased on the above settings, we performed a comparison between AMP and GP-MTL with different\nsimilarity parameters. We used the Gaussian kernel for the latent factor hr. The result is shown in\nFig. 3, which presents the validation error (MSE) against the size of the training data. We can see that,\nfor both \u201ccossim\u201d and \u201ccosdis,\u201d AMP performed comparably well to the GP-MTL method and even\nbetter than the GP-MTL method in some situations. Here it should be noted that AMP is much more\ncomputationally ef\ufb01cient than GP-MTL despite its high predictive performance. This experimental\nresult justi\ufb01es our theoretical analysis.\n8 Conclusion\nWe have developed a convergence theory of the AMP method for the nonparametric tensor learning.\nThe AMP method has been used by several authors in the literature, but its theoretical analysis has\nnot been addressed in the nonparametric setting. We showed that the AMP algorithm converges in a\nlinear rate as an optimization algorithm and achieves the minimax optimal statistical error if the initial\npoint is in the O(1)-neighborhood of the true function. We may use the Bayes estimator as a rough\ninitial solution, but it would be an important future work to explore more sophisticated determination\nof the initial solution.\nAcknowledgment This work was partially supported by MEXT kakenhi (25730013, 25120012,\n26280009, 15H01678 and 15H05707), JST-PRESTO and JST-CREST.\nReferences\n[1] A. Aswani. Low-rank approximation and completion of positive tensors. arXiv:1412.0620, 2014.\n[2] M. T. Bahadori, Q. R. Yu, and Y. Liu. Fast multivariate spatio-temporal analysis via low rank tensor\n\nlearning. In Advances in Neural Information Processing Systems 27.\n\n[3] B. Barak and A. Moitra. Tensor prediction, Rademacher complexity and random 3-xor. arXiv:1501.06521,\n\n2015.\n\n[4] P. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. The Annals of Statistics, 33:\n\n1487\u20131537, 2005.\n\n[5] C. Bennett and R. Sharpley. Interpolation of Operators. Academic Press, Boston, 1988.\n[6] S. Bhojanapalli and S. Sanghavi. A new sampling technique for tensors. arXiv:1502.05023, 2015.\n[7] V.-G. Blanca, G.-S. Gabriel, and P.-M. Rafael. Effects of relevant contextual features in the performance\nof a restaurant recommender system. In Proceedings of 3rd Workshop on Context-Aware Recommender\nSystems, 2011.\n\n8\n\n 8 9 10 11 12 13 4000 6000 8000 10000 12000 14000MSESample sizeGP-MTL(cosdis)GP-MTL(cossim)AMP(cosdis)AMP(cossim)\f[8] F. Fouss, A. Pirotte, J.-M. Renders, and M. Saerens. Random-walk computation of similarities between\nnodes of a graph with application to collaborative recommendation. IEEE Trans. on Knowl. and Data Eng.,\n19(3):355\u2013369, Mar. 2007.\n\n[9] S. Gandy, B. Recht, and I. Yamada. Tensor completion and low-n-rank tensor recovery via convex\n\noptimization. Inverse Problems, 27:025010, 2011.\n\n[10] F. L. Hitchcock. The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics\n\nand Physics, 6:164\u2013189, 1927.\n\n[11] F. L. Hitchcock. Multilple invariants and generalized rank of a p-way matrix or tensor. Journal of\n\nMathematics and Physics, 7:39\u201379, 1927.\n\n[12] M. Imaizumi and K. Hayashi. Doubly decomposing nonparametric tensor regression. In International\n\nConference on Machine Learning (ICML2016), page to appear, 2016.\n\n[13] P. Jain and S. Oh. Provable tensor factorization with missing data. In Advances in Neural Information\n\nProcessing Systems 27, pages 1431\u20131439. Curran Associates, Inc., 2014.\n\n[14] H. Kanagawa, T. Suzuki, H. Kobayashi, N. Shimizu, and Y. Tagami. Gaussian process nonparametric tensor\nestimator and its minimax optimality. In International Conference on Machine Learning (ICML2016),\npages 1632\u20131641, 2016.\n\n[15] A. Karatzoglou, X. Amatriain, L. Baltrunas, and N. Oliver. Multiverse recommendation: N-dimensional\ntensor factorization for context-aware collaborative \ufb01ltering. In Proceedings of the 4th ACM Conference\non Recommender Systems 2010, pages 79\u201386, 2010.\n\n[16] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Rev., 51(3):455\u2013500, Aug.\n\n2009.\n\n[17] M. Ledoux. The concentration of measure phenomenon. Number 89 in Mathematical Surveys and\n\nMonographs. American Mathematical Soc., 2005.\n\n[18] J. Liu, P. Musialski, P. Wonka, and J. Ye. Tensor completion for estimating missing values in visual data.\nIn Proceedings of the 12th International Conference on Computer Vision (ICCV), pages 2114\u20132121, 2009.\n[19] M. M\u00f8rup. Applications of tensor (multiway array) factorizations and decompositions in data mining.\n\nWiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1):24\u201340, 2011.\n\n[20] B. Romera-Paredes, H. Aung, N. Bianchi-Berthouze, and M. Pontil. Multilinear multitask learning. In\nProceedings of the 30th International Conference on Machine Learning (ICML2013), volume 28 of JMLR\nWorkshop and Conference Proceedings, pages 1444\u20131452, 2013.\n\n[21] P. Shah, N. Rao, and G. Tang. Optimal low-rank tensor recovery from separable measurements: Four\n\ncontractions suf\ufb01ce. arXiv:1505.04085, 2015.\n\n[22] W. Shen and S. Ghosal. Adaptive Bayesian density regression for high-dimensional data. Bernoulli, 22(1):\n\n396\u2013420, 02 2016.\n\n[23] M. Signoretto, L. D. Lathauwer, and J. Suykens. Nuclear norms for tensors and their use for convex\n\nmultilinear estimation. Technical Report 10-186, ESAT-SISTA, K.U.Leuven, 2010.\n\n[24] M. Signoretto, L. D. Lathauwer, and J. A. K. Suykens. Learning tensors in reproducing kernel Hilbert\n\nspaces with multilinear spectral penalties. CoRR, abs/1310.4977, 2013.\n\n[25] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008.\n[26] I. Steinwart, D. Hush, and C. Scovel. Optimal rates for regularized least squares regression. In Proceedings\n\nof the Annual Conference on Learning Theory, pages 79\u201393, 2009.\n\n[27] W. Sun, Z. Wang, H. Liu, and G. Cheng. Non-convex statistical optimization for sparse tensor graphical\n\nmodel. In Advances in Neural Information Processing Systems, pages 1081\u20131089, 2015.\n\n[28] T. Suzuki. Convergence rate of Bayesian tensor estimator and its minimax optimality. In Proceedings of\n\nthe 32nd International Conference on Machine Learning (ICML2015), pages 1273\u20131282, 2015.\n\n[29] R. Tomioka and T. Suzuki. Convex tensor decomposition via structured schatten norm regularization. In\n\nAdvances in Neural Information Processing Systems 26, pages 1331\u20131339, 2013. NIPS2013.\n\n[30] R. Tomioka, T. Suzuki, K. Hayashi, and H. Kashima. Statistical performance of convex tensor decomposi-\n\ntion. In Advances in Neural Information Processing Systems 24, pages 972\u2013980, 2011. NIPS2011.\n\n[31] L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279\u2013311,\n\n1966.\n\n[32] S. van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000.\n[33] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes: With Applications to\n\nStatistics. Springer, New York, 1996.\n\n[34] K. Wimalawarne, M. Sugiyama, and R. Tomioka. Multitask learning meets tensor factorization: task\nimputation via convex optimization. In Advances in Neural Information Processing Systems 27, pages\n2825\u20132833. 2014. NIPS2014.\n\n[35] Z. Xu, F. Yan, and Y. A. Qi. Inftucker: t-process based in\ufb01nite tensor decomposition. CoRR, abs/1108.6296,\n\n2011.\n\n[36] Z. Zhang and S. Aeron. Exact tensor completion using t-svd. arXiv:1502.04689, 2015.\n[37] T. Zhao, Z. Wang, and H. Liu. A nonconvex optimization framework for low rank matrix estimation. In\nAdvances in Neural Information Processing Systems 28, pages 559\u2013567. Curran Associates, Inc., 2015.\nNIPS2015.\n\n9\n\n\f", "award": [], "sourceid": 1881, "authors": [{"given_name": "Taiji", "family_name": "Suzuki", "institution": "Tokyo Institute of Technology"}, {"given_name": "Heishiro", "family_name": "Kanagawa", "institution": "Tokyo Institute of Technology"}, {"given_name": "Hayato", "family_name": "Kobayashi", "institution": "Yahoo Japan Corporation"}, {"given_name": "Nobuyuki", "family_name": "Shimizu", "institution": "Yahoo Japan Corporation"}, {"given_name": "Yukihiro", "family_name": "Tagami", "institution": "Yahoo Japan Corporation"}]}