{"title": "Distributed Flexible Nonlinear Tensor Factorization", "book": "Advances in Neural Information Processing Systems", "page_first": 928, "page_last": 936, "abstract": "Tensor factorization is a powerful tool to analyse multi-way data. Recently proposed nonlinear factorization methods, although capable of capturing complex relationships, are computationally quite expensive and may suffer a severe learning bias in case of extreme data sparsity. Therefore, we propose a distributed, flexible nonlinear tensor factorization model, which avoids the expensive computations and structural restrictions of the Kronecker-product in the existing TGP formulations, allowing an arbitrary subset of tensor entries to be selected for training. Meanwhile, we derive a tractable and tight variational evidence lower bound (ELBO) that enables highly decoupled, parallel computations and high-quality inference. Based on the new bound, we develop a distributed, key-value-free inference algorithm in the MapReduce framework, which can fully exploit the memory cache mechanism in fast MapReduce systems such as Spark. Experiments demonstrate the advantages of our method over several state-of-the-art approaches, in terms of both predictive performance and computational efficiency.", "full_text": "Distributed Flexible Nonlinear Tensor Factorization\n\nShandian Zhe\u00a7, Kai Zhang\u2020, Pengyuan Wang\u2021, Kuang-chih Lee(cid:93), Zenglin Xu(cid:92),\n\n\u00a7Dept. Computer Science, Purdue University, \u2020NEC Laboratories America, Princeton NJ,\n\n\u2021Dept. Marketing, University of Georgia at Athens, (cid:93)Yahoo! Research,\n\nYuan Qi(cid:91), Zoubin Gharamani(cid:63)\n\n(cid:92)Big Data Res. Center, School Comp. Sci. Eng., Univ. of Electr. Sci. & Tech. of China,\n\n(cid:91)Ant Financial Service Group, Alibaba, (cid:63)University of Cambridge\n\n\u00a7szhe@purdue.edu, \u2020kzhang@nec-labs.com, \u2021pengyuan@uga.edu,\n\n(cid:93)kclee@yahoo-inc.com, (cid:92)zlxu@uestc.edu.cn,\n(cid:91)alanqi0@outlook.com, (cid:63)zoubin@cam.ac.uk\n\nAbstract\n\nTensor factorization is a powerful tool to analyse multi-way data. Recently pro-\nposed nonlinear factorization methods, although capable of capturing complex\nrelationships, are computationally quite expensive and may suffer a severe learning\nbias in case of extreme data sparsity. Therefore, we propose a distributed, \ufb02exible\nnonlinear tensor factorization model, which avoids the expensive computations and\nstructural restrictions of the Kronecker-product in the existing TGP formulations,\nallowing an arbitrary subset of tensorial entries to be selected for training. Mean-\nwhile, we derive a tractable and tight variational evidence lower bound (ELBO) that\nenables highly decoupled, parallel computations and high-quality inference. Based\non the new bound, we develop a distributed, key-value-free inference algorithm in\nthe MAPREDUCE framework, which can fully exploit the memory cache mecha-\nnism in fast MAPREDUCE systems such as SPARK. Experiments demonstrate the\nadvantages of our method over several state-of-the-art approaches, in terms of both\npredictive performance and computational ef\ufb01ciency.\n\nIntroduction\n\n1\nTensors, or multidimensional arrays, are generalizations of matrices (from binary interactions) to\nhigh-order interactions between multiple entities. For example, we can extract a three-mode tensor\n(user, advertisement, context) from online advertising logs. To analyze tensor data, people usually\nturn to factorization approaches, which use a set of latent factors to represent each entity and\nmodel how the latent factors interact with each other to generate tensor elements. Classical tensor\nfactorization models, including Tucker [18] and CANDECOMP/PARAFAC (CP) [5], assume multi-\nlinear interactions and hence are unable to capture more complex, nonlinear relationships. Recently,\nXu et al. [19] proposed In\ufb01nite Tucker decomposition (InfTucker), which generalizes the Tucker\nmodel to in\ufb01nite feature space using a Tensor-variate Gaussian process (TGP) and is hence more\npowerful in modeling intricate nonlinear interactions. However, InfTucker and its variants [22, 23]\nare computationally expensive, because the Kronecker product between the covariances of all the\nmodes requires the TGP to model the entire tensor structure. In addition, they may suffer from\nthe extreme sparsity of real-world tensor data, i.e., when the proportion of the nonzero entries is\nextremely low. As is often the case, most of the zero elements in real tensors are meaningless: they\nsimply indicate missing or unobserved entries. Incorporating all of them in the training process may\naffect the factorization quality and lead to biased predictions.\nTo address these issues, we propose a distributed, \ufb02exible nonlinear tensor factorization model,\nwhich has several important advantages. First, it can capture highly nonlinear interactions in the\ntensor, and is \ufb02exible enough to incorporate arbitrary subset of (meaningful) tensor entries for the\ntraining. This is achieved by placing a Gaussian process prior over tensor entries, where the input\nis constructed by concatenating the latent factors from each mode and the intricate relationships\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fare captured by using the kernel function. By using such a construction, the covariance function\nis then free of the Kronecker-product structure, and as a result users can freely choose any subset\nof tensor elements for the training process and incorporate prior domain knowledge. For example,\none can choose a combination of balanced zero and nonzero elements to overcome the learning bias.\nSecond, the tight variational evidence lower bound (ELBO) we derived using functional derivatives\nand convex conjugates subsumes optimal variational posteriors, thus evades inef\ufb01cient, sequential\nE-M updates and enables highly ef\ufb01cient, parallel computations as well as improved inference quality.\nMoreover, the new bound allows us to develop a distributed, gradient-based optimization algorithm.\nFinally, we develop a simple yet very ef\ufb01cient procedure to avoid the data shuf\ufb02ing operation, a\nmajor performance bottleneck in the (key-value) sorting procedure in MAPREDUCE. That is, rather\nthan sending out key-value pairs, each mapper simply calculates and sends a global gradient vector\nwithout keys. This key-value-free procedure is general and can effectively prevent massive disk IOs\nand fully exploit the memory cache mechanism in fast MAPREDUCE systems, such as SPARK.\nEvaluation using small real-world tensor data have fully demonstrated the superior prediction accuracy\nof our model in comparison with InfTucker and other state-of-the-art; on large tensors with millions\nof nonzero elements, our approach is signi\ufb01cantly better than, or at least as good as two popular\nlarge-scale nonlinear factorization methods based on TGP: one uses hierarchical modeling to perform\ndistributed in\ufb01nite Tucker decomposition [22]; the other further enhances InfTucker by using Dirichlet\nprocess mixture prior over the latent factors and employs an online learning scheme [23]. Our method\nalso outperforms GigaTensor [8], a typical large-scale CP factorization algorithm, by a large margin.\nIn addition, our method achieves a faster training speed and enjoys almost linear speedup with respect\nto the number of computational nodes. We apply our model to CTR prediction for online advertising\nand achieves a signi\ufb01cant, 20% improvement over the popular logistic regression and linear SVM\napproaches (Section 4 of the supplementary material).\n2 Background\nWe \ufb01rst introduce the background knowledge. For convenience, we will use the same notations\nin [19]. Speci\ufb01cally, we denote a K-mode tensor by M \u2208 Rd1\u00d7...\u00d7dK , where the k-th mode is\nof dimension dk. The tensor entry at location i (i = (i1, . . . , iK)) is denoted by mi. To introduce\nTucker decomposition, we need to generalize matrix-matrix products to tensor-matrix products.\nSpeci\ufb01cally, a tensor W \u2208 Rr1\u00d7...\u00d7rK can multiply with a matrix U \u2208 Rs\u00d7t at mode k when its\ndimension at mode-k is consistent with the number of columns in U, i.e., rk = t. The product is\na new tensor, with size r1 \u00d7 . . . \u00d7 rk\u22121 \u00d7 s \u00d7 rk+1 \u00d7 . . . \u00d7 rK. Each element is calculated by\n\n(W \u00d7k U)i1...ik\u22121jik+1...iK =(cid:80)rk\n\nik=1 wi1...iK ujik .\n\nThe Tucker decomposition model uses a latent factor matrix Uk \u2208 Rdk\u00d7rk in each mode k and a\ncore tensor W \u2208 Rr1\u00d7...\u00d7rK and assumes the whole tensor M is generated by M = W \u00d71 U(1) \u00d72\n. . . \u00d7K U(K). Note that this is a multilinear function of W and {U1, . . . , UK}. It can be further\nsimpli\ufb01ed by restricting r1 = r2 = . . . = rK and the off-diagonal elements of W to be 0. In this\ncase, the Tucker model becomes CANDECOMP/PARAFAC (CP).\nThe in\ufb01nite Tucker decomposition (InfTucker) generalizes the Tucker model to in\ufb01nite feature space\nvia a tensor-variate Gaussian process (TGP) [19]. Speci\ufb01cally, in a probabilistic framework, we\nassign a standard normal prior over each element of the core tensor W, and then marginalize out W\nto obtain the probability of the tensor given the latent factors:\n\np(M|U(1), . . . , U(K)) = N (vec(M); 0, \u03a3(1) \u2297 . . . \u2297 \u03a3(K))\n\nt of the latent factors U(k) is replaced by a nonlinear feature transformation \u03c6(uk\n\n(1)\nwhere vec(M) is the vectorized whole tensor, \u03a3(k) = U(k)U(k)(cid:62)\nand \u2297 is the Kronecker-product.\nNext, we apply the kernel trick to model nonlinear interactions between the latent factors: Each\nrow uk\nt ) and thus\nan equivalent nonlinear covariance matrix \u03a3(k) = k(U(k), U(k)) is used to replace U(k)U(k)(cid:62)\n,\nwhere k(\u00b7,\u00b7) is the covariance function. After the nonlinear feature mapping, the original Tucker\ndecomposition is performed in an (unknown) in\ufb01nite feature space. Further, since the covariance of\nvec(M) is a function of the latent factors U = {U(1), . . . , U(K)}, Equation (1) actually de\ufb01nes a\nGaussian process (GP) on tensors, namely tensor-variate GP (TGP) [19], where the input are based\non U. Finally, we can use different noisy models p(Y|M) to sample the observed tensor Y. For\nexample, we can use Gaussian models and Probit models for continuous and binary observations,\nrespectively.\n\n2\n\n\fk dk \u00d7(cid:81)\n\nsize dk \u00d7 dk and the full covariance is of size(cid:81)\n\n3 Model\nDespite being able to capture nonlinear interactions, InfTucker may suffer from the extreme sparsity\nissue in real-world tensor data sets. The reason is that its full covariance is a Kronecker-product\nbetween the covariances over all the modes\u2014{\u03a3(1), . . . , \u03a3(K)} (see Equation (1)). Each \u03a3(k) is of\nk dk. Thus TGP is projected onto the entire\ntensor with respect to the latent factors U, including all zero and nonzero elements, rather than a\n(meaningful) subset of them. However, the real-world tensor data are usually extremely sparse, with\na huge number of zero entries and a tiny portion of nonzero entries. On one hand, because most zero\nentries are meaningless\u2014they are either missing or unobserved, using them can adversely affect the\ntensor factorization quality and lead to biased predictions; on the other hand, incorporating numerous\nzero entries into GP models will result in large covariance matrices and high computational costs. Zhe\net al. [22, 23] proposed to improve the scalability by modeling subtensors instead, but the sampled\nsubtensors can still be very sparse. Even worse, because they are typically of small dimensions (for\nef\ufb01ciency considerations), it is often possible to encounter subtensors full of zeros. This may further\nincur numerical instabilities in model estimation.\nTo address these issues, we propose a \ufb02exible Gaussian process tensor factorization model. While\ninheriting the nonlinear modeling power, our model disposes of the Kronecker-product structure in\nthe full covariance and can therefore select an arbitrary subset of tensor entries for training.\nSpeci\ufb01cally, given a tensor M \u2208 Rd1\u00d7...\u00d7dK , for each tensor entry mi (i = (i1, . . . , iK)), we\nconstruct an input xi by concatenating the corresponding latent factors from all the modes: xi =\nthat there is an underlying function f : R(cid:80)K\n[u(1)\nis the ik-th row in the latent factor matrix U(k) for mode k. We assume\ni1\n, . . . , u(K)\n]).\niK\nThis function is unknown and can be complex and nonlinear. To learn the function, we assign a\nGaussian process prior over f: for any set of tensor entries S = {i1, . . . , iN}, the function values\nfS = {f (xi1), . . . , f (xiN )} are distributed according to a multivariate Gaussian distribution with\nmean 0 and covariance determined by XS = {xi1, . . . , xiN}:\n\nj=1 dj \u2192 R such that mi = f (xi) = f ([u(1)\n\n], where u(k)\nik\n\n, . . . , u(K)\niK\n\ni1\n\np(fS|U) = N (fS|0, k(XS, XS))\n\n], [u(1)\nj1\n\n, . . . , u(K)\niK\n\n, . . . , u(K)\njK\n\nwhere k(\u00b7,\u00b7) is a (nonlinear) covariance function.\nBecause k(xi, xj) = k([u(1)\n]), there is no Kronecker-product structure\ni1\nconstraint and so any subset of tensor entries can be selected for training. To prevent the learning\nprocess to be biased toward zero, we can use a set of entries with balanced zeros and nonzeros;\nfurthermore, useful domain knowledge can also be incorporated to select meaningful entries for\ntraining. Note, however, that if we still use all the tensor entries and intensionally impose the\nKronecker-product structure in the full covariance, our model is reduced to InfTucker. Therefore,\nfrom the modeling perspective, the proposed model is more general.\nWe further assign a standard normal prior over the latent factors U. Given the selected tensor entries\nm = [mi1, . . . , miN ], the observed entries y = [yi1 , . . . , yiN ] are sampled from a noise model\np(y|m). In this paper, we deal with both continuous and binary observations. For continuous data,\nwe use the Gaussian model, p(y|m) = N (y|m, \u03b2\u22121I) and the joint probability is\n\np(y, m,U) =\n\nN (vec(U(t))|0, I)N (m|0, k(XS, XS))N (y|m, \u03b2\u22121I)\n\n(2)\nwhere S = [i1, . . . , iN ]. For binary data, we use the Probit model in the following manner. We\n\ufb01rst introduce augmented variables z = [z1, . . . , zN ] and then decompose the Probit model into\np(zj|mij ) = N (zj|mij , 1) and p(yij|zj) = 1(yij = 0)1(zj \u2264 0) + 1(yij = 1)1(zj > 0) where\n1(\u00b7) is the indicator function. Then the joint probability is\n\nt=1\n\n(cid:89)K\n\n(cid:89)K\n\nt=1\n\n\u00b7(cid:89)\n\nj\n\np(y, z, m,U) =\n\nN (vec(U(t))|0, I)N (m|0, k(XS, XS))N (z|m, I)\n\n1(yij = 0)1(zj \u2264 0) + 1(yij = 1)1(zj > 0).\n\n(3)\n\n4 Distributed Variational Inference\nReal-world tensors often comprise a large number of entries, say, millions of non-zeros and billions\nof zeros, making exact inference of the proposed model totally intractable. This motives us to develop\na distributed variational inference algorithm, presented as follows.\n\n3\n\n\f4.1 Tractable Variational Evidence Lower Bound\n\nSince the GP covariance term \u2014 k(XS, XS) (see Equations (2) and (3)) intertwines all the la-\ntent factors, exact inference in parallel is quite dif\ufb01cult. Therefore, we \ufb01rst derive a tractable\nvariational evidence lower bound (ELBO), following the sparse Gaussian process framework by\nTitsias [17]. The key idea is to introduce a small set of inducing points B = {b1, . . . , bp} and\nlatent targets v = {v1, . . . , vp} (p (cid:28) N). Then we augment the original model with a joint\nmultivariate Gaussian distribution of the latent tensor entries m and targets v, p(m, v|U, B) =\nN ([m, v](cid:62)|[0, 0](cid:62), [KSS, KSB; KBS, KBB]) where KSS = k(XS, XS), KBB = k(B, B),\nKSB = k(XS, B) and KBS = k(B, XS). We use Jensen\u2019s inequality and conditional Gaus-\nsian distributions to construct the ELBO. Using a very similar derivation to [17], we can obtain a\n\ntractable ELBO for our model on continuous data, log(cid:0)p(y,U|B)(cid:1) \u2265 L1\n(cid:90)\n(cid:88)\nFv(\u00b7j,\u2217) = (cid:82) log(cid:0)N (\u00b7j|mij ,\u2217)(cid:1)N (mij|\u00b5j, \u03c32\n\n(4)\nHere p(v|B) = N (v|0, KBB), q(v) is the variational posterior for the latent targets v and\nj =\n\u03a3(j, j) = k(xij , xij ) \u2212 k(xij , B)K\u22121\nBBk(B, xij ). Note that L1 is decomposed into a summation of\nterms involving individual tensor entries ij(1 \u2264 j \u2264 N ). The additive form enables us to distribute\nthe computation across multiple computers.\nFor binary data, we introduce a variational posterior q(z) and make the mean-\ufb01eld assumption that\nj q(zj). Following a similar derivation to the continuous case, we can obtain a tractable\n\n(cid:0)U, B, q(v)(cid:1) = log(p(U)) +\n\n(cid:0)U, B, q(v)(cid:1) , where\n\nj )dmij , where \u00b5j = k(xij , B)K\u22121\n\np(v|B)\nq(v)\n\nq(v)Fv(yij , \u03b2)dv.\n\nBBv and \u03c32\n\nq(v) log\n\ndv +\n\n(cid:90)\n\nL1\n\nj\n\nq(z) =(cid:81)\nELBO for binary data, log(cid:0)p(y,U|B)(cid:1) \u2265 L2\n(cid:90)\n(cid:0)U, B, q(v), q(z)(cid:1) = log(p(U)) +\n(cid:88)\n\n(cid:90)\n\n(cid:90)\n\nL2\n\nq(v)\n\nq(zj)Fv(zj, 1)dzjdv.\n\n+\n\nj\n\n(cid:0)U, B, q(v), q(z)(cid:1), where\n(cid:88)\n\nq(v) log(\n\n)dv +\n\np(v|B)\nq(v)\n\nj\n\nq(zj) log(\n\np(yij|zj)\nq(zj)\n\n)\n\n(5)\n\nOne can simply use the standard Expectation-maximization (EM) framework to optimize (4) and\n(5) for model inference, i.e., the E step updates the variational posteriors {q(v), q(z)} and the M\nstep updates the latent factors U, the inducing points B and the kernel parameters. However, the\nsequential E-M updates can not fully exploit the paralleling computing resources. Due to the strong\ndependencies between the E step and the M step, the sequential E-M updates may take a large number\nof iterations to converge. Things become worse for binary case: in the E step, the updates of q(v)\nand q(z) are also dependent on each other, making a parallel inference even less ef\ufb01cient.\n\n4.2 Tight and Parallelizable Variational Evidence Lower Bound\n\nIn this section, we further derive tight(er) ELBOs that subsume the optimal variational posteriors\nfor q(v) and q(z). Thereby we can avoid the sequential E-M updates to perform decoupled, highly\nef\ufb01cient parallel inference. Moreover, the inference quality is very likely to be improved using tighter\nbounds. Due to the space limit, we only present key ideas and results here; detailed discussions are\ngiven in Section 1 and 2 of the supplementary material.\nTight ELBO for continuous tensors. We take functional derivative of L1 with respect to q(v) in\n(4). By setting the derivative to zero, we obtain the optimal q(v) (which is a Gaussian distribution)\nand then substitute it into L1, manipulating the terms, we achieve the following tighter ELBO.\nTheorem 4.1. For continuous data, we have\n\nlog |KBB| \u2212 1\n2\n\nlog |KBB + \u03b2A1| \u2212 1\n2\n\n\u03b2a2 \u2212 1\n2\n\n\u03b2a3\n\n\u03b22a(cid:62)\n\n4 (KBB + \u03b2A1)\u22121a4 +\n\nN\n2\n\nlog(\n\n),\n\n(6)\n\n\u03b2\n2\u03c0\n\n(cid:88)\n\nj\n\nk(xij , xij ), a4 =\n\nk(B, xij )yij .\n\nlog(cid:0)p(y,U|B)(cid:1) \u2265 L\u2217\n\n\u03b2\n2\n\ntr(K\u22121\n\nBBA1) \u2212 1\n2\n\n1(U, B) =\nK(cid:88)\n\n1\n2\n(cid:107)U(k)(cid:107)2\n\nk=1\n\n1\n2\n\nF +\n\n(cid:88)\n\n+\n\n(cid:88)\n\nj\n\nwhere (cid:107) \u00b7 (cid:107)F is Frobenius norm, and\nA1 =\n\nk(B, xij )k(xij , B), a2 =\n\n(cid:88)\n\nj\n\ny2\nij\n\nj\n\n, a3 =\n\n4\n\n\fTight ELBO for binary tensors. The binary case is more dif\ufb01cult because q(v) and q(z) are\ncoupled together (see (5)). We use the following steps: we \ufb01rst \ufb01x q(z) and plug the optimal q(v) in\nthe same way as the continuous case. Then we obtain an intermediate ELBO \u02c6L2 that only contains\n2 (KBS(cid:104)z(cid:105))(cid:62)(KBB + A1)\u22121(KBS(cid:104)z(cid:105)), intertwines all\nq(z). However, a quadratic term in \u02c6L2 , 1\n{q(zj)}j in \u02c6L2, making it infeasible to analytically derive or parallelly compute the optimal {q(zj)}j.\nTo overcome this dif\ufb01culty, we use the convex conjugate of the quadratic term, and introduce a\nvariational parameter \u03bb to decouple the dependences between {q(zj)}j. After that, we are able to\nderive the optimal {q(zj)}j using functional derivatives and to obtain the following tight ELBO.\nTheorem 4.2. For binary data, we have\n1\n2\n\nlog |KBB| \u2212 1\n2\n\n2(U, B, \u03bb) =\n\na3\n\nk(B, xij ))(cid:1) \u2212 1\n\n(cid:62)\n\nlog |KBB + A1| \u2212 1\n2\nBBA1) \u2212 1\n2\n\ntr(K\u22121\n\n1\n2\n\nK(cid:88)\n\nk=1\n\n(cid:62)\n\n\u03bb\n\n2\n\nKBB\u03bb +\n\nlog(cid:0)p(y,U|B)(cid:1) \u2265 L\u2217\n(cid:88)\nlog(cid:0)\u03a6((2yij \u2212 1)\u03bb\n\n+\n\nj\n\n(cid:107)U(k)(cid:107)2\n\nF\n\n(7)\n\nwhere \u03a6(\u00b7) is the cumulative distribution function of the standard Gaussian.\nAs we can see, due to the additive forms of the terms in L\u2217\ncomputation of the tight ELBOs and their gradients can be ef\ufb01ciently performed in parallel.\n\n1 and L\u2217\n\n2, such as A1, a2, a3 and a4, the\n\n4.3 Distributed Inference on Tight Bound\n\n4.3.1 Distributed Gradient-based Optimization\n\nGiven the tighter ELBOs in (6) and (7), we develop a distributed algorithm to optimize the latent\nfactors U, the inducing points B, the variational parameters \u03bb (for binary data) and the kernel\nparameters. We distribute the computations over multiple computational nodes (MAP step) and then\ncollect the results to calculate the ELBO and its gradient (REDUCE step). A standard routine, such as\ngradient descent and L-BFGS, is then used to solve the optimization problem.\nFor binary data, we further \ufb01nd that \u03bb can be updated with a simple \ufb01xed point iteration:\n\nwhere a5 =(cid:80)\n\n\u03bb(t+1) = (KBB + A1)\u22121(A1\u03bb(t) + a5)\n\nN(cid:0)k(B,xij )(cid:62)\u03bb(t)|0,1(cid:1)\n\u03a6(cid:0)(2yij \u22121)k(B,xij )(cid:62)\u03bb(t)(cid:1).\n\nj k(B, xij )(2yij \u2212 1)\n\n(8)\n\n2(U, B, \u03bbt+1) \u2265 L\u2217\n\nApparently, the updating can be ef\ufb01ciently performed in parallel (due to the additive structure of A1\nand a5). Moreover, the convergence is guaranteed by the following lemma. The proof is given in\nSection 3 of the supplementary material.\nLemma 4.3. Given U and B, we have L\u2217\n2(U, B, \u03bbt) and the \ufb01xed point iteration\n(8) always converges.\nTo use the \ufb01xed point iteration, before we calculate the gradients with respect to U and B, we\n\ufb01rst optimize \u03bb via (8) in an inner loop. In the outer control, we then employ gradient descent or\n2 (U, B) =\nL-BFGS to optimize U and B. This will lead to an even tighter bound for our model: L\u2217\u2217\n2(U, B, \u03bb) = maxq(v),q(z) L2(U, B, q(v), q(z)). Empirically, this converges must faster\nmax\u03bb L\u2217\nthan feeding the optimization algorithms with \u2202\u03bb, \u2202U and \u2202B altogether, especially for large data.\n4.3.2 Key-Value-Free MAPREDUCE\n\nWe now present the detailed design of MAPREDUCE procedures to ful\ufb01ll our distributed inference.\nBasically, we \ufb01rst allocate a set of tensor entries St on each MAPPER t such that the corresponding\ncomponents of the ELBO and the gradients are calculated; then the REDUCER aggregates local results\nfrom each MAPPER to obtain the integrated, global ELBO and gradient.\nWe \ufb01rst consider the standard (key-value) design. For brevity, we take the gradient computation for\nthe latent factors as an example. For each tensor entry i on a MAPPER, we calculate the corresponding\ngradients {\u2202u(1)\n}k, where the\nkey indicates the mode and the index of the latent factors. The REDUCER aggregates gradients with\nthe same key to recover the full gradient with respect to each latent factor.\n\n} and then send out the key-value pairs {(k, ik) \u2192 \u2202u(k)\n\n, . . . \u2202u(K)\niK\n\nik\n\ni1\n\n5\n\n\fAlthough the (key-value) MAPREDUCE has been successfully applied in numerous applications, it\nrelies on an expensive data shuf\ufb02ing operation: the REDUCE step has to sort the MAPPERS\u2019 output\nby the keys before aggregation. Since the sorting is usually performed on disk due to signi\ufb01cant data\nsize, intensive disk I/Os and network communications will become serious computational overheads.\nTo overcome this de\ufb01ciency, we devise a key-value-free MAP-REDUCE scheme to avoid on-disk data\nshuf\ufb02ing operations. Speci\ufb01cally, on each MAPPER, a complete gradient vector is maintained for all\nthe parameters, including U, B and the kernel parameters; however, only relevant components of the\ngradient, as speci\ufb01ed by the tensor entries allocated to this MAPPER, will be updated. After updates,\neach MAPPER will then send out the full gradient vector, and the REDUCER will simply sum them up\ntogether to obtain a global gradient vector without having to perform any extra data sorting. Note that\na similar procedure can also be used to perform the \ufb01xed point iteration for \u03bb (in binary tensors).\nEf\ufb01cient MAPREDUCE systems, such as SPARK [21], can fully optimize the non-shuf\ufb02ing MAP\nand REDUCE, where most of the data are buffered in memory and disk I/Os are circumvented to the\nutmost; by contrast, the performance with data shuf\ufb02ing degrades severely [3]. This is veri\ufb01ed in our\nevaluations: on a small tensor of size 100 \u00d7 100 \u00d7 100, our key-value-free MAPREDUCE gains 30\ntimes speed acceleration over the traditional key-value process. Therefore, our algorithm can fully\nexploit the memory-cache mechanism to achieve fast inference.\n\nfor each MAPPER node is O((cid:80)K\n\n4.4 Algorithm Complexity\nSuppose we use N tensor entries for training, with p inducing points and T MAPPER, the time\nT p2N ). Since p (cid:28) N is a \ufb01xed constant (p = 100 in our\ncomplexity for each MAPPER node is O( 1\nexperiments), the time complexity is linear in the number of tensor entries. The space complexity\nT K), in order to store the latent factors, their\ngradients, the covariance matrix on inducing points, and the indices of the latent factors for each\ntensor entry. Again, the space complexity is linear in the number of tensor entries. In comparison,\nInfTucker utilizes the Kronecker-product properties to calculate the gradients and has to perform\neigenvalue decomposition of the covariance matrices in each tensor mode. Therefor it has a higher\ntime and space complexity (see [19] for details) and is not scalable to large dimensions.\n\nj=1 mjrj + p2 + N\n\n5 Related work\nClassical tensor factorization models include Tucker [18] and CP [5], based on which there are many\nexcellent works [2, 16, 6, 20, 14, 7, 13, 8, 1]. Despite the wide-spread success, their underlying\nmultilinear factorization structures prevent them from capturing more complex, nonlinear relationship\nin real-world applications. In\ufb01nite Tucker decomposition [19], and its distributed or online exten-\nsions [22, 23] overcome this limitation by modeling tensors or subtensors via tensor-variate Gaussian\nprocesses (TGP). However, these methods may suffer from the extreme sparsity in real-world tensors\ndue to the Kronecker-product structure in TGP formulations. Our model further address this issue by\neliminating the Kronecker-product restriction, and can model an arbitrary subset of tensor entries.\nIn theory, all such nonlinear factorization models belong to the family of random function prior\nmodels [11] for exchangeable multidimensional arrays.\nOur distributed variational inference algorithm is based on sparse GP [12], an ef\ufb01cient approximation\nframework to scale up GP models. Sparse GP uses a small set of inducing points to break the\ndependency between random function values. Recently, Titsias [17] proposed a variational learning\nframework for sparse GP, based on which Gal et al. [4] derived a tight variational lower bound for\ndistributed inference of GP regression and GPLVM [10]. The derivation of the tight ELBO in our\nmodel for continuous tensors is similar to [4]; however, the gradient calculation is substantially\ndifferent, because the input to our GP factorization model is the concatenation of the latent factors.\nMany tensor entries may partly share the same latent factors, causing a large amount of key-value\npair to be sent during the distributed gradient calculation. This will incur an expensive data shuf\ufb02ing\nprocedure that takes place on disk. To improve the computational ef\ufb01ciency, we develop a non-\nkey-value MAP-REDUCE to avoid data shuf\ufb02ing and fully exploit the memory-cache mechanism\nin ef\ufb01cient MAPREDUCE systems. This strategy is also applicable to other MAP-REDUCE based\nlearning algorithms. In addition to continuous data, we also develop a tight ELBO for binary data on\noptimal variational posteriors. By introducing p extra variational parameters with convex conjugates\n(p is the number of inducing points), our inference can be performed ef\ufb01ciently in a distributed\nmanner, which avoids explicit optimization on a large number of variational posteriors for the latent\ntensor entries and inducing targets. Our method can also be useful for GP classi\ufb01cation problem.\n\n6\n\n\f6 Experiments\n6.1 Evaluation on Small Tensor Data\nFor evaluation, we \ufb01rst compared our method with various existing tensor factorization methods.\nTo this end, we used four small real datasets where all methods are computationally feasible: (1)\nAlog, a real-valued tensor of size 200 \u00d7 100 \u00d7 200, representing a three-way interaction (user, action,\nresource) in a \ufb01le access log. It contains 0.33% nonzero entries.(2) AdClick, a real-valued tensor\nof size 80 \u00d7 100 \u00d7 100, describing (user, publisher, advertisement) clicks for online advertising.\nIt contains 2.39% nonzero entries. (3) Enron, a binary tensor depicting the three-way relationship\n(sender, receiver, time) in emails. It contains 203\u00d7 203\u00d7 200 elements, of which 0.01% are nonzero.\n(4) NellSmall, a binary tensor of size 295 \u00d7 170 \u00d7 94, depicting the knowledge predicates (entity,\nrelationship, entity). The data set contains 0.05% nonzero elements.\nWe compared with CP, nonnegative CP (NN-CP) [15], high order SVD (HOSVD) [9], Tucker, in\ufb01nite\nTucker (InfTucker) [19] and its extension (InfTuckerEx) which uses the Dirichlet process mixture\n(DPM) prior to model latent clusters and local TGP to perform scalable, online factorization [23].\nNote that InfTucker and InfTuckerEx are nonlinear factorization approaches.\nFor testing, we used the same setting as in [23]. All the methods were evaluated via a 5-fold cross\nvalidation. The nonzero entries were randomly split into 5 folds; 4 folds were used for training and\nthe remaining non-zero entries and 0.1% zero entries were used for testing so that the number of\nnon-zero entries is comparable to the number of zero entries. In doing so, zero and nonzero entries are\ntreated equally important in testing, and the evaluation will not be dominated by large portion of zeros.\nFor InfTucker and InfTuckerEx, we performed extra cross-validations to select the kernel form (e.g.,\nRBF, ARD and Matern kernels) and the kernel parameters. For InfTuckerEx, we randomly sampled\nsubtensors and tuned the learning rate following [23]. For our model, the number of inducing points\nwas set to 100, and we used a balanced training set generated as follows: in addition to nonzero\nentries, we randomly sampled the same number of zero entries and made sure that they would not\noverlap with the testing zero elements.\nOur model used ARD kernel and the kernel parameters were estimated jointly with the latent factors.\nWe implemented our distributed inference algorithm with two optimization frameworks, gradient\ndescent and L-BFGS (denoted by Ours-GD and Ours-LBFGS respectively). For a comprehensive\nevaluation, we also examined CP on balanced training entries generated in the same way as our\nmodel, denoted by CP-2. The mean squared error (MSE) is used to evaluate predictive performance\non Alog and Click and area-under-curve (AUC) on Enron and NellSmall. The averaged results from\nthe 5-fold cross validation are reported.\nOur model achieves a higher prediction accuracy than InfTucker, and a better or comparable accuracy\nthan InfTuckerEx (see Figure 1). A t-test shows that our model outperforms InfTucker signi\ufb01cantly\n(p < 0.05) in almost all situations. Although InfTuckerEx uses the DPM prior to improve factoriza-\ntion, our model still obtains signi\ufb01cantly better predictions on Alog and AdClick and comparable or\nbetter performance on Enron and NellSmall. This might be attributed to the \ufb02exibility of our model\nin using balanced training entries to prevent the learning bias (toward numerous zeros). Similar\nimprovements can be observed from CP to CP-2. Finally, our model outperforms all the remaining\nmethods, demonstrating the advantage of our nonlinear factorization approach.\n6.2 Scalability Analysis\nTo examine the scalability of the proposed distributed inference algorithm, we used the following\nlarge real-world datasets: (1) ACC, A real-valued tensor describing three-way interactions (user,\naction, resource) in a code repository management system [23]. The tensor is of size 3K \u00d7 150 \u00d7\n30K, where 0.009% are nonzero. (2) DBLP: a binary tensor depicting a three-way bibliography\nrelationship (author, conference, keyword) [23]. The tensor was extracted from DBLP database and\ncontains 10K \u00d7 200 \u00d7 10K elements, where 0.001% are nonzero entries. (3) NELL: a binary tensor\nrepresenting the knowledge predicates, in the form of (entity, entity, relationship) [22]. The tensor\nsize is 20K \u00d7 12.3K \u00d7 280 and 0.0001% are nonzero.\nThe scalability of our distributed inference algorithm was examined with regard to the number of\nmachines on ACC dataset. The number of latent factors was set to 3. We ran our algorithm using\nthe gradient descent. The results are shown in Figure 2(a). The Y-axis shows the reciprocal of the\nrunning time multiplied by a constant\u2014which corresponds to the running speed. As we can see, the\nspeed of our algorithm scales up linearly to the number of machines.\n\n7\n\n\f(a) Alog\n\n(c) Enron\n\n(d) NellSmall\nFigure 1: The prediction results on small datasets. The results are averaged over 5 runs.\n\n(b) AdClick\n\n(a) Scalability\n\n(b) ACC\n\n(c) DBLP\n\n(d) NELL\n\nFigure 2: Prediction accuracy (averaged on 50 test datasets) on large tensor data and the scalability.\n\n6.3 Evaluation on Large Tensor Data\nWe then compared our approach with three state-of-the-art large-scale tensor factorization methods:\nGigaTensor [8], Distributed in\ufb01nite Tucker decomposition (DinTucker) [22], and InfTuckerEx [23].\nBoth GigaTensor and DinTucker are developed on HADOOP, while InfTuckerEx uses online inference.\nOur model was implemented on SPARK. We ran Gigatensor, DinTucker and our approach on a large\nYARN cluster and InfTuckerEx on a single computer.\nWe set the number of latent factors to 3 for ACC and DBLP data set, and 5 for NELL data set.\nFollowing the settings in [23, 22], we randomly chose 80% of nonzero entries for training, and then\nsampled 50 test data sets from the remaining entries. For ACC and DBLP, each test data set comprises\n200 nonzero elements and 1, 800 zero elements; for NELL, each test data set contains 200 nonzero\nelements and 2, 000 zero elements. The running of GigaTensor was based on the default settings\nof the software package. For DinTucker and InfTuckerEx, we randomly sampled subtensors for\ndistributed or online inference. The parameters, including the number and size of the subtensors and\nthe learning rate, were selected in the same way as [23]. The kernel form and parameters were chosen\nby a cross-validation on the training tensor. For our model, we used the same setting as in the small\ndata. We set 50 MAPPERS for GigaTensor, DinTucker and our model.\nFigure 2(b)-(d) shows the predictive performance of all the methods. We observe that our approach\nconsistently outperforms GigaTensor and DinTucker on all the three datasets; our approach outper-\nforms InfTuckerEx on ACC and DBLP and is slightly worse than InfTuckerEx on NELL. Note again\nthat InfTuckerEx uses DPM prior to enhance the factorization while our model doesn\u2019t; \ufb01nally, all the\nnonlinear factorization methods outperform GigaTensor, a distributed CP factorization algorithm by a\nlarge margin, con\ufb01rming the advantages of nonlinear factorizations on large data. In terms of speed,\nour algorithm is much faster than GigaTensor and DinTucker. For example, on DBLP dataset, the\naverage per-iteration running time were 1.45, 15.4 and 20.5 minutes for our model, GigaTensor and\nDinTucker, respectively. This is not surprising, because (1) our model uses the data sparsity and can\nexclude numerous, meaningless zero elements from training; (2) our algorithm is based on SPARK,\na more ef\ufb01cient MAPREDUCE system than HADOOP; (3) our algorithm gets rid of data shuf\ufb02ing and\ncan fully exploit the memory-cache mechanism of SPARK.\n\n7 Conclusion\nIn this paper, we have proposed a novel \ufb02exible GP tensor factorization model. In addition, we have\nderived a tight ELBO for both continuous and binary problems, based on which we further developed\nan ef\ufb01cient distributed variational inference algorithm in MAPREDUCE framework.\n\nAcknowledgement\nDr. Zenglin Xu was supported by a grant from NSF China under No. 61572111. We thank IBM T.J.\nWatson Research Center for providing one dataset. We also thank Jiasen Yang for proofreading this\npaper.\n\n8\n\nCPNNCPHOSVDTuckerInfTuckerInfTuckerExCP-2Ours-GDOurs-LBFGSNumber of Factors35810Mean Squared Error (MSE)0.651.522.53Number of Factors35810Mean Squared Error (MSE)0.30.81.21.9Number of Factors35810AUC0.70.80.91Number of Factors35810AUC0.70.80.91Number of Machines51015201 / RunningTime X Const135Mean Squared Error (MSE)0.10.50.70.9GigaTensorDinTuckerInfTuckerExOurs-GDOurs-LBFGSAUC0.820.90.95AUC0.820.90.951\fReferences\n[1] Choi, J. H. & Vishwanathan, S. (2014). Dfacto: Distributed factorization of tensors. In NIPS.\n\n[2] Chu, W. & Ghahramani, Z. (2009). Probabilistic models for incomplete multi-dimensional arrays. In\n\nAISTATS.\n\n[3] Davidson, A. & Or, A. (2013). Optimizing shuf\ufb02e performance in Spark. University of California,\n\nBerkeley-Department of Electrical Engineering and Computer Sciences, Tech. Rep.\n\n[4] Gal, Y., van der Wilk, M., & Rasmussen, C. (2014). Distributed variational inference in sparse Gaussian\n\nprocess regression and latent variable models. In NIPS.\n\n[5] Harshman, R. A. (1970). Foundations of the PARAFAC procedure: Model and conditions for an \"explana-\n\ntory\" multi-mode factor analysis. UCLA Working Papers in Phonetics, 16, 1\u201384.\n\n[6] Hoff, P. (2011). Hierarchical multilinear models for multiway data. Computational Statistics & Data\n\nAnalysis.\n\n[7] Hu, C., Rai, P., & Carin, L. (2015). Zero-truncated poisson tensor factorization for massive binary tensors.\n\nIn UAI.\n\n[8] Kang, U., Papalexakis, E., Harpale, A., & Faloutsos, C. (2012). Gigatensor: scaling tensor analysis up by\n\n100 times-algorithms and discoveries. In KDD.\n\n[9] Lathauwer, L. D., Moor, B. D., & Vandewalle, J. (2000). A multilinear singular value decomposition. SIAM\n\nJ. Matrix Anal. Appl, 21, 1253\u20131278.\n\n[10] Lawrence, N. D. (2004). Gaussian process latent variable models for visualisation of high dimensional\n\ndata. In NIPS.\n\n[11] Lloyd, J. R., Orbanz, P., Ghahramani, Z., & Roy, D. M. (2012). Random function priors for exchangeable\n\narrays with applications to graphs and relational data. In NIPS.\n\n[12] Qui\u00f1onero-Candela, J. & Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian\n\nprocess regression. The Journal of Machine Learning Research, 6, 1939\u20131959.\n\n[13] Rai, P., Hu, C., Harding, M., & Carin, L. (2015). Scalable probabilistic tensor factorization for binary and\n\ncount data. In IJCAI.\n\n[14] Rai, P., Wang, Y., Guo, S., Chen, G., Dunson, D., & Carin, L. (2014). Scalable Bayesian low-rank\n\ndecomposition of incomplete multiway tensors. In ICML.\n\n[15] Shashua, A. & Hazan, T. (2005). Non-negative tensor factorization with applications to statistics and\n\ncomputer vision. In ICML.\n\n[16] Sutskever, I., Tenenbaum, J. B., & Salakhutdinov, R. R. (2009). Modelling relational data using Bayesian\n\nclustered tensor factorization. In NIPS.\n\n[17] Titsias, M. K. (2009). Variational learning of inducing variables in sparse Gaussian processes. In AISTATS.\n\n[18] Tucker, L. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika, 31, 279\u2013311.\n\n[19] Xu, Z., Yan, F., & Qi, Y. (2012). In\ufb01nite Tucker decomposition: Nonparametric Bayesian models for\n\nmultiway data analysis. In ICML.\n\n[20] Yang, Y. & Dunson, D. B. (2016). Bayesian conditional tensor factorizations for high-dimensional\n\nclassi\ufb01cation. Journal of the American Statistical Association, 656\u2013669.\n\n[21] Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., &\nStoica, I. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing.\nIn NSDI.\n\n[22] Zhe, S., Qi, Y., Park, Y., Xu, Z., Molloy, I., & Chari, S. (2016). Dintucker: Scaling up Gaussian process\n\nmodels on large multidimensional arrays. In AAAI.\n\n[23] Zhe, S., Xu, Z., Chu, X., Qi, Y., & Park, Y. (2015). Scalable nonparametric multiway data analysis. In\n\nAISTATS.\n\n9\n\n\f", "award": [], "sourceid": 567, "authors": [{"given_name": "Shandian", "family_name": "Zhe", "institution": "Purdue University"}, {"given_name": "Kai", "family_name": "Zhang", "institution": "NEC Labs America"}, {"given_name": "Pengyuan", "family_name": "Wang", "institution": "Yahoo! Research"}, {"given_name": "Kuang-chih", "family_name": "Lee", "institution": "Yahoo Inc."}, {"given_name": "Zenglin", "family_name": "Xu", "institution": "University of Electronic Science & Technology of China"}, {"given_name": "Yuan", "family_name": "Qi", "institution": "Ant financial service group"}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": "University of Cambridge"}]}