{"title": "Predictive Matrix-Variate t Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1721, "page_last": 1728, "abstract": "It is becoming increasingly important to learn from a partially-observed random matrix and predict its missing elements. We assume that the entire matrix is a single sample drawn from a matrix-variate t distribution and suggest a matrix-variate t model (MVTM) to predict those missing elements. We show that MVTM generalizes a range of known probabilistic models, and automatically performs model selection to encourage sparse predictive models. Due to the non-conjugacy of its prior, it is difficult to make predictions by computing the mode or mean of the posterior distribution. We suggest an optimization method that sequentially minimizes a convex upper-bound of the log-likelihood, which is very efficient and scalable. The experiments on a toy data and EachMovie dataset show a good predictive accuracy of the model.", "full_text": "Predictive Matrix-Variate t Models\n\nShenghuo Zhu\n\nKai Yu\n\nYihong Gong\n\nNEC Labs America, Inc.\n\n10080 N. Wolfe Rd. SW3-350\n\nCupertino, CA 95014\n\n{zsh,kyu,ygong}@sv.nec-labs.com\n\nAbstract\n\nIt is becoming increasingly important to learn from a partially-observed random\nmatrix and predict its missing elements. We assume that the entire matrix is a\nsingle sample drawn from a matrix-variate t distribution and suggest a matrix-\nvariate t model (MVTM) to predict those missing elements. We show that MVTM\ngeneralizes a range of known probabilistic models, and automatically performs\nmodel selection to encourage sparse predictive models. Due to the non-conjugacy\nof its prior, it is dif\ufb01cult to make predictions by computing the mode or mean of\nthe posterior distribution. We suggest an optimization method that sequentially\nminimizes a convex upper-bound of the log-likelihood, which is very ef\ufb01cient and\nscalable. The experiments on a toy data and EachMovie dataset show a good\npredictive accuracy of the model.\n\n1\n\nIntroduction\n\nMatrix analysis techniques, e.g., singular value decomposition (SVD), have been widely used in\nvarious data analysis applications. An important class of applications is to predict missing elements\ngiven a partially observed random matrix. For example, putting ratings of users into a matrix form,\nthe goal of collaborative \ufb01ltering is to predict those unseen ratings in the matrix.\n\nTo predict unobserved elements in matrices, the structures of the matrices play an importance role,\nfor example, the similarity between columns and between rows. Such structures imply that elements\nin a random matrix are no longer independent and identically-distributed (i.i.d.). Without the i.i.d.\nassumption, many machine learning models are not applicable.\n\nIn this paper, we model the random matrix of interest as a single sample drawn from a matrix-\nvariate t distribution, which is a generalization of Student-t distribution. We call the predictive\nmodel under such a prior by matrix-variate t model (MVTM). Our study shows several interesting\nproperties of the model. First, it continues the line of gradual generalizations across several known\nprobabilistic models on random matrices, namely, from probabilistic principle component analysis\n(PPCA) [11], to Gaussian process latent-variable models (GPLVMs)[7], and to multi-task Gaussian\nprocesses (MTGPs) [13]. MVTMs can be further derived by analytically marginalizing out the\nhyper-parameters of these models. From a Bayesian modeling point of view, the marginalization of\nhyper-parameters means an automatic model selection and usually leads to a better generalization\nperformance [8]; Second, the model selection by MVTMs explicitly encourages simpler predictive\nmodels that have lower ranks. Unlike the direct rank minimization, the log-determinant terms in the\nform of matrix-variate t prior offers a continuous optimization surface (though non-convex) for rank\nconstraint; Third, like multivariate Gaussian distributions, a matrix-variate t prior is consistent under\nmarginalization, that means, if a matrix follows a matrix-variate t distribution, its any sub-matrix\nfollows a matrix-variate t distribution as well. This property allows to generalize distributions for\n\ufb01nite matrices to in\ufb01nite stochastic processes.\n\n\fS\n\nR\n\nS\n\nT\n\nY\n\n(a)\n\nT\n\nY\n\n(b)\n\nT\n\nY\n\n(c)\n\nI\n\nT\n\nY\n\n(d)\n\nFigure 1: Models for matrix prediction. (a) MVTM. (b) and (c) are two normal-inverse-Wishart\nmodels, equivalent to MVTM when the covariance variable S (or R) is marginalized. (d) MTGP,\nwhich requires to optimize the covariance variable S. Circle nodes represent for random variables,\nshaded nodes for (partially) observable variables, text nodes for given parameters.\n\nUnder a Gaussian noise model, the matrix-variate t distribution is not a conjugate prior. It is thus dif-\n\ufb01cult to make predictions by computing the mode or mean of the posterior distribution. We suggest\nan optimization method that sequentially minimizes a convex upper-bound of the log-likelihood,\nwhich is highly ef\ufb01cient and scalable. In the experiments, the algorithm shows very good ef\ufb01ciency\nand excellent prediction accuracy.\n\nThis paper is organized as follows. We review three existing models and introduce the matrix-variate\nt models in Section 2. The prediction methods are proposed in Section 3. In Section 4, the MVTM is\ncompared with some other models. We illustrate the MVTM with the experiments on a toy example\nand on the movie-rating data in Section 5. We conclude in Section 6.\n\n2 Predictive Matrix-Variate t Models\n\n2.1 A Family of Probabilistic Models for Matrix Data\n\nIn this section we introduce three probabilistic models in the literature. Let Y be a p \u00d7 m\nobservational matrix and T be the underlying p \u00d7 m noise-free random matrix. We assume\nYi,j = Ti,j + \u0001i,j , \u0001i,j \u223c N (0, \u03c32), where Yi,j denotes the (i, j)-th element of Y.\nIf Y is\npartially observed, then YI denotes the set of observed elements and I is the corresponding index\nset.\n\nProbabilistic Principal Component Analysis (PPCA) [11] assumes that yj, the j-th column vector\nof Y, can be generated from a latent vector vj in a k-dimensional linear space (k < p). The model\nis de\ufb01ned as yj = Wvj + \u00b5 + \u0001j and vj \u223c Nk(vj; 0, Ik), where \u0001j \u223c Np(\u0001j; 0, \u03c32Ip), and\nW is a p \u00d7 k loading matrix. By integrating out vj, we obtain the marginal distribution yj \u223c\nNp(yj; \u00b5, WW> + \u03c32Ip). Since the columns of Y are conditionally independent, letting S take\nthe place of WW>, PPCA is similar1 to\n\nYi,j = Ti,j + \u0001i,j,\n\nT \u223c Np,m(T; 0, S, Im),\n\nwhere Np,m(\u00b7; 0, S, Im) is a matrix-variate normal prior with zero mean, covariance S between\nrows, and identity covariance Im between columns. PPCA aims to estimate the parameter W by\nmaximum likelihood.\n\nGaussian Process Latent-Variable Model (GPLVM) [7] formulates a latent-variable model in a\nslightly unconventional way. It considers the same linear relationship from latent representation vj\nto observations yj. Instead of treating vj as random variables, GPLVM assigns a prior on W and\nsee {vj} as parameters yj = Wvj + \u0001j, and W \u223c Np,k(W; 0, Ip, Ik), where the elements of W\nare independent Gaussian random variables. By marginalizing out W, we obtain a distribution that\neach row of Y is an i.i.d. sample from a Gaussian process prior with the covariance VV> + \u03c32Im\nand V = [v1, . . . , vm]>. Letting R take the place of VV>, we rewrite a similar model as\n\nYi,j = Ti,j + \u0001i,j,\n\nT \u223c Np,m(T; 0, Ip, R).\n\n1Because it requires S to be positive de\ufb01nite and W is usually low rank, they are not equivalent.\n\nS\nW\nS\nW\nS\nW\nS\n\fFrom a matrix modeling point of view, GPLVM estimates the covariance between the rows and\nassume the columns to be conditionally independent.\n\nMulti-task Gaussian Process (MTGP) [13] is a multi-task learning model where each column of\nY is a predictive function of one task, sampled from a Gaussian process prior, yj = tj + \u0001j, and\ntj \u223c Np(0, S), where \u0001j \u223c Np(0, \u03c32Ip). It introduces a hierarchical model where an inverse-\nWishart prior is added for the covariance,\n\nYi,j = Ti,j + \u0001i,j,\n\nT \u223c Np,m(T; 0, S, Im),\n\nS \u223c IW p(S; \u03bd, Ip)\n\nMTGP utilizes the inverse-Wishart prior as the regularization and obtains a maximum a posteriori\n(MAP) estimate of S.\n\n2.2 Matrix-Variate t Models\n\nThe models introduced in the previous section are closely related to each other. PPCA models the\nrow covariance of Y, GPLVM models the column covariance, and MTGP assigns a hyper prior to\nprevent over-\ufb01tting when estimating the (row) covariance. From a matrix modeling point of view,\ncapturing the dependence structure of Y by its row or column covariance is a matter of choices,\nwhich are not fundamentally different.2 There is no reason to favor one choice over the other. By\nintroducing the matrix-variate t models (MVTMs), they can be uni\ufb01ed to be the same model.\n\nFrom a Bayesian modeling viewpoint, one should marginalize out as many variables as possible\n[8]. We thus extend the MTGP model in two directions: (1) assume T \u223c Np,m(T; 0, S, Im) that\nhave covariances on both sides of the matrix; (2) marginalize the covariance S on one side (see\nFigure 1(b)). Then we have a marginal distribution of T\n\nPr(T) =Z Np,m(T; 0, S, Im)IW p(S; \u03bd, Ip)dS = tp,m(T; \u03bd, 0, Ip, Im),\n\n(1)\n\nwhich is a matrix-variate t distribution. Because the inverse-Wishart distribution may have different\ndegree-of-freedom de\ufb01nition in literature, we use the de\ufb01nition in [5].\n\nFollowing the de\ufb01nition in [6], the matrix-variate t distribution of p \u00d7 m matrix T is given by\n\ntp,m(T; \u03bd, M, \u03a3, \u2126) def=\n\n1\nZ\n\n|\u03a3|\u2212 m\n\n2 |\u2126|\u2212 p\n\n2 (cid:12)(cid:12)Ip + \u03a3\u22121(T \u2212 M)\u2126\u22121(T \u2212 M)>(cid:12)(cid:12)\u2212 \u03bd+m+p\u22121\n\n2\n\n,\n\nwhere \u03bd is the degree of freedom; M is a p \u00d7 m matrix; \u03a3 and \u2126 are positive de\ufb01nite matrices of\nsize p \u00d7 p and m \u00d7 m, respectively; Z = (\u03bd\u03c0)\n); \u0393p(\u00b7) is a multivariate\ngamma function, and | \u00b7 | stands for determinant.\n\n)/\u0393p( \u03bd+m+p\u22121\n\n2 \u0393p( \u03bd+p\u22121\n\nmp\n\n2\n\n2\n\nThe model can be depicted as Figure 1(a). One important property of matrix-variate t distribution\nis that the marginal distribution of its sub-matrix still follows a matrix-variate t distribution with the\nsame degree of freedom (see Section 3.1). Therefore, we can expand it to the in\ufb01nite dimensional\nstochastic process. By Eq. (1), we can see that Figure 1(a) and Figure 1(b) describe two equivalent\nmodels. Comparing them with the MTGP model represented in Figure 1(d), we can see that the\ndifference lies in whether S is point estimated or integrated out.\n\nInterestingly, the same matrix-variate t distribution can be equivalently derived by putting another\nhierarchical generative process on the covariance R, as described in Figure 1(c), where R follows\nan inverse-Wishart distribution. In other words, integrating the covariance on either side, we obtain\nthe same model. This implies that the model controls the complexity of the covariances on both\nsides of the matrix. Neither PPCA nor GPLVM has such a property.\n\nThe matrix-variate t distribution involves a determinant term of T, which becomes a log-determinant\nterm in log-likelihood or KL-divergence. The log-determinant term encourages the sparsity of ma-\ntrix T with lower rank. This property has been used as the heuristic for minimizing the rank of the\nmatrix in [3]. Student\u2019s t priors were applied to enforce sparse kernel machine [10].\n\nHere we say a few words about the given parameters. Though we can use evidence framework[8]\nor other methods to estimate \u03bd, the results are not good in many cases(see [4]). Usually we just set\n\n2GPLVM offers an advantage of using nonlinear covariance function based on attributes.\n\n\f\u03bd to a small number. Similar to \u03bd, the estimated \u03c32 does not give us a good result either, but cross-\nvalidation is a good choice. For the mean matrix M, in our experiments, we just use sample average\nfor all observed elements. For some tasks, when we have prior knowledge about the covariance\nbetween columns or between rows, we can use the covariance matrices in the places of Im or Ip.\n\n3 Prediction Methods\n\nWhen the evaluation of the prediction is the sum of individual losses, the optimal prediction is to \ufb01nd\nthe individual mode of the marginal posterior distribution, i.e., arg maxTij Pr(Tij|YI). However,\nthere is no exact solution for the marginal posterior. We have two ways to approximate the optimal\nprediction.\n\nOne way to make prediction is to compute the mode of the joint posterior distribution of T, i.e. the\nprediction problem is\n\n{ln Pr(YI|T) + ln Pr(T)} .\n\n(2)\n\nbT = arg max\n\nT\n\nThe computation of this estimation is usually easy. We discuss it in Section 3.3.\n\nAn alternative way is to use the individual mean of the posterior distribution to approximate the\nindividual mode. Since the joint of individual mean happens to be the mean of the joint distribution,\nwe only need to compute the joint posterior distribution. The problem of prediction by means is\nwritten as\n\nT = E(T|YI).\n\n(3)\n\nHowever, it is usually dif\ufb01cult to compute the exact mean. One estimation method is the Monte\nCarlo method, which is computationally intensive. In Section 3.4, we discuss an approximation\nto compute the mean. From our experiments, the prediction by means usually outperforms the\nprediction by modes.\n\nBefore discussing the prediction methods, we introduce a few useful properties in Section 3.1 and\nsuggest an optimization method as the ef\ufb01cient tool for prediction in Section 3.2.\n\n3.1 Properties\n\nThe MVTM has a rich set of properties. We list a few in the following Theorem.\n\nTheorem 1. If\n\nthen\n\nn m\nq \u0398 \u03a6\n\n(cid:18)\np \u03a8 T(cid:19) \u223c tp+q,m+n(\u00b7; \u03bd, 0, (cid:18)\n\nq\nIq\n0\n\np\n0\n\nIp(cid:19), (cid:18)\n\nn m\nIn\n0\n0\n\nIm(cid:19)),\n\nPr(T) =tp,m(T; \u03bd, 0, Ip, Im),\n\nPr(T|\u0398, \u03a6, \u03a8) =tp,m(T; \u03bd + q + n, M, (Ip + \u03a8B\u03a8>), (Im + \u03a6>A\u03a6)),\n\nPr(\u0398) =tq,n(\u0398; \u03bd, 0, Iq, In),\n\nPr(\u03a6|\u0398) =tq,m(\u03a6; \u03bd + n, 0, A\u22121, Im),\n\nPr(\u03a8|\u0398, \u03a6) =tp,n(\u03a8; \u03bd + q, 0, Ip, B\u22121) = Pr(\u03a8|\u0398),\n\nE(T|\u0398, \u03a6, \u03a8) =M,\n\nCov(cid:16)vec(cid:16)T>(cid:17) |\u0398, \u03a6, \u03a8(cid:17) =(\u03bd + q + n \u2212 2)\u22121(Ip + \u03a8B\u03a8>) \u2297 (Im + \u03a6>A\u03a6),\n\nwhere A\n\ndef\n\n= (\u0398\u0398> + Iq)\u22121, B\n\ndef\n\n= (\u0398>\u0398 + In)\u22121, and M\n\ndef\n= \u03a8\u0398>A\u03a6 = \u03a8B\u0398>\u03a6.\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\nThis theorem can be directly derived from Theorem 4.3.1 and 4.3.9 in [6] with a little calculus. It\nprovides some insights about MVTMs. The marginal distribution in Eq. (5) has the same form as the\njoint distribution, therefore the matrix-variate t distribution is extensible to an in\ufb01nite dimensional\nstochastic process. As conditional distribution in Eq. (6) is still a matrix-variate t distribution, we\ncan use it to approximate the posterior distribution, which we use in Section 3.4.\n\n\fWe encounter log-determinant terms in computation of the mode or mean estimation. The following\ntheorem provides a quadratic upper bounds for the log-determinant terms, which makes it possible\nto apply the optimization method in Section 3.2.\nLemma 1. If X is a p \u00d7 p positive de\ufb01nite matrices, it holds that ln |X| \u2264 tr (X) \u2212 p. The equality\nholds when X is an orthonormal matrix.\n\nProof. Let {\u03bb1, \u00b7 \u00b7 \u00b7 , \u03bbp} be the eigenvalues of X. We have ln |X| =Pi ln \u03bbi and tr (X) =Pi \u03bbi.\n\nSince ln \u03bbi \u2264 \u03bbi \u2212 1, we have the inequality. The equality holds when \u03bbi = 1. Therefore, when X\nis an orthonormal matrix (especially X = Ip), the equality holds.\n\nTheorem 2. If \u03a3 is a p \u00d7 p positive de\ufb01nite matrix, \u2126 is an m \u00d7 m positive de\ufb01nite matrix, and T\nand T0 are p \u00d7 m matrices, it holds that\n\nln |\u03a3 + T\u2126\u22121T>| \u2264 h(T; T0, \u03a3, \u2126) + h0(T0, \u03a3, \u2126),\n\nwhere\n\nh(T; T0, \u03a3, \u2126)\n\nh0(T0, \u03a3, \u2126)\n\n\u2202\n\u2202T\n\nh(T; T0, \u03a3, \u2126)(cid:12)(cid:12)(cid:12)(cid:12)T=T0\n\nThe equality holds when T = T0. Also it holds that\n\ndef\n\n=tr(cid:16)(\u03a3 + T0\u2126\u22121T>\n\ndef\n\n= ln |\u03a3 + T0\u2126\u22121T>\n\n0 )\u22121T\u2126\u22121T>(cid:17) ,\n0 | + tr(cid:16)(\u03a3 + T0\u2126\u22121T>\n\n= 2(\u03a3 + T0\u2126\u22121T>\n\n0 )\u22121T0\u2126\u22121 =\n\n\u2202\n\u2202T\n\n0 )\u22121\u03a3(cid:17) \u2212 p\nln |\u03a3 + T\u2126\u22121T>|(cid:12)(cid:12)(cid:12)(cid:12)T=T0\n\n.\n\nApplying Lemma 1 with X = (\u03a3 + T0\u2126\u22121T>\n0 )\u22121(\u03a3 + T\u2126\u22121T>), we obtain the inequality. By\nsome calculus we have the equality of the \ufb01rst-order derivative. Actually h(\u00b7) is a quadratic convex\nfunction with respect to T, as (\u03a3 + T0\u2126\u22121T>\n\n0 )\u22121 and \u2126\u22121 are positive de\ufb01nite matrices.\n\n3.2 Optimization Method\n\nOnce the objective is given, the prediction becomes an optimization problem. We use an EM-\nstyle optimization method to make the prediction. Suppose J (T) be the objective function to be\nminimized. If we can \ufb01nd an auxiliary function, Q(T; T0), having the following properties, we can\napply this method.\n\n1. J (T) \u2264 Q(T; T0) and J (T0) = Q(T0; T0),\n\n2. \u2202J (T)/\u2202T|T=T0 = \u2202Q(T; T0)/\u2202T(cid:12)(cid:12)T=T0 ,\n\n3. For a \ufb01xed T0, Q(T; T0) is quadratic and convex with respect to T.\n\nStarting from any T0, as long as we can \ufb01nd a T1 such that Q(T1, T0) < Q(T0, T0), we have\nJ (T0) = Q(T0, T0) > Q(T1, T0) \u2265 J (T1). If there exists a global minimum point of J (T),\nthere exists a global minimum point of Q(T; T0) as well, because Q(T; T0) is upper bound of\nJ (T). Since Q(T; T0) is quadratic with the respect to T, we can apply the Newton-Raphson\nmethod to minimize Q(T; T0). As long as T0 is not a local minimum, maximum or saddle point of\nJ , we can \ufb01nd a T to reduce Q(T; T0), because Q(T; T0) has the same derivative as J (T) at T0.\nUsually, a random starting point, T0 is unlikely to be a local maximum, then T1 can not be a local\nmaximum. If T0 is a local maximum, we can reselect a point, which is not. After we \ufb01nd a Ti, we\nrepeat the procedure to \ufb01nd a Ti+1 so that J (Ti+1) < J (Ti), unless Ti is a local minimum or\nsaddle point of J . Repeating this procedure, Ti converges a local minimum or saddle point of J ,\nas long as T0 is not a local maximum.\n\n3.3 Mode Prediction\n\nFollowing Eq. (2), the goal is to minimize the objective function\n\nbJ (T) def= `(T) +\n\n\u03bd+m+p\u22121\n\n2\n\nln(cid:12)(cid:12)(cid:12)Ip + TT>(cid:12)(cid:12)(cid:12) ,\n\n(12)\n\n\fwhere `(T) def= \u2212 ln Pr(YI) = 1\n\n2\u03c32 P(i,j)\u2208I(Tij \u2212 Yij)2 + const.\n\nintroduce an auxiliary function,\n\nQ(T; T0) def= `(T) + h(T; T0, Ip, Im) + h0(T0, Ip, Im).\n\nAs bJ contains a log-determinant term, minimizing bJ by nonlinear optimization is slow. Here, we\nBy Corollary 2, we have that bJ (T) \u2264 Q(T; T0), bJ (T0) = Q(T0, T0), and Q(T, T0) has the same\n\ufb01rst-order derivative as bJ (T) at T0. Because l and h are quadratic and convex, Q is quadratic and\nconvex as well. Therefore, we can apply the optimization method in Section 3.2 to minimize bJ .\nHowever, when the size of T is large, to \ufb01nd bT is still time consuming and requires a very large\nspace. In many tasks, we only need to infer a small portion of bT. Therefore, we consider a low\nmatrix. The problem of Eq. (2) is approximated by arg minU,V bJ (UV>). We can minimize J1 by\nalternatively optimizing U and V. We can put the \ufb01nal result in a canonical format as bT \u2248 USV>,\n\nwhere U and V are semi-orthonormal and S is a k \u00d7 k diagonal matrix. This result can be consider\nas the SVD of an incomplete matrix using matrix-variate t regularization. The details are skipped\nbecause of the limit space.\n\nrank approximation, using UV> to approximate T, where U is a p \u00d7 k matrix and V is an m \u00d7 k\n\n(13)\n\n3.4 Variational Mean Prediction\n\nAs the dif\ufb01culty in explicitly computing the posterior distribution of T, we take a variational ap-\nproach to approximate its posterior distribution by a matrix-variate t distribution via an expanded\nmodel. We expand the model by adding matrix variate \u0398, \u03a6 and \u03a8 with distribution as Eq. (4).\nSince the marginal distribution, Eq. (5), is the same as the prior of T, we can derive the original\nmodel by marginalizing out \u0398, \u03a6 and \u03a8. However, instead of integrating out \u0398, \u03a6 and \u03a8, we use\nthem as the parameters to approximate T\u2019s posterior distribution. Therefore, the estimation of the\nparameters is to minimize\n\n\u2212 ln Pr(YI, \u0398, \u03a6, \u03a8) = \u2212 ln Pr(\u0398, \u03a6, \u03a8) \u2212 lnZ Pr(T|\u0398, \u03a6, \u03a8) Pr(YI|T)dT\n\n(14)\n\nover \u0398, \u03a6 and \u03a8. The \ufb01rst term in the RHS of Eq. (14) can be written as\n\n\u2212 ln Pr(\u0398, \u03a6, \u03a8) = \u2212 ln Pr(\u0398) \u2212 ln Pr(\u03a6|\u0398) \u2212 ln Pr(\u03a8|\u0398, \u03a6)\n\u03bd+q+n+m\u22121\n\n\u03bd+q+n+p+m\u22121\n\n=\n\n2\n\n\u03bd+q+n+p\u22121\n\n2\n\n+\n\nln |Iq + \u0398\u0398>| +\n\n2\nln |Ip + \u03a8B\u03a8>| + const.\n\nln |Im + \u03a6>A\u03a6|\n\n(15)\n\nDue to the convexity of negative logarithm, the second term in the RHS of Eq. (14) is bounded by\n\n`(\u03a8B\n\n1\n2 \u0398>A\n\n1\n2 \u03a6) +\n\n1\n\n2\u03c32(\u03bd+q+n\u22122) X(i,j)\u2208I\n\n(1 + [\u03a8B\u03a8>]ii)(1 + [\u03a6>A\u03a6]jj) + const.\n\n(16)\n\nbecause \u2212 ln Pr(YI|T) is quadratic respective to T, thus we only need integration using the mean\nand variance of Tij of Pr(T|\u0398, \u03a6, \u03a8), which is given by Eq. (10) and (11). The parameter estima-\ntion not only reduce the loss (the term of `(\u00b7)), but also reduce the variance. Because of this, the\nprediction by means usually outperforms the prediction by modes.\n\nLet J be the sum of the right-hand-side of Eq. (15) and (16), which can be considered as the upper\nbound of Eq. (14) (ignoring constants). Here, we estimate the parameters by minimizing J . Because\nA and B involve the inverse of quadratic term of \u0398, it is awkward to directly optimize \u0398, \u03a6, \u03a8.\ndef= \u0398. We can easily apply the\nWe reparameterize J by U\noptimization method in Section 3.2 to \ufb01nd optimal U, V and S. After estimation U, V and S, by\nTheorem 1, we can compute T = M = USV>. The details are skipped because of the limit space.\n\ndef= \u03a6>A1/2, and S\n\ndef= \u03a8B1/2, V\n\n4 Related work\n\nMaximum Margin Matrix Factorization (MMMF) [9] is not in the framework of stochastic matrix\nanalysis, but there are some similarities between MMMF and our mode estimation in Section 3.3.\n\n\fUsing trace norm on the matrix as regularization, MMMF overcomes the over-\ufb01tting problem in\nfactorizing matrix with missing values. From the regularization viewpoint, the prediction by mode\nof MVTM uses log-determinants as the regularization term in Eq. (12). The log-determinants en-\ncourage sparsity predictive models.\n\nStochastic Relational Models (SRMs) [12] extend MTGPs by estimating the covariance matrices\nfor each side. The covariance functions are required to be estimated from observation. By maxi-\nmizing marginalized likelihood, the estimated S and R re\ufb02ect the information of the dependency\nstructure. Then the relationship can be predicted with S and R. During estimating S and R, inverse-\nWishart priors with parameter \u03a3 and \u2126 are imposed to S and R respectively. MVTM differs from\nSRM in integrating out the hyper-parameters or maximizing out. As MacKay suggests [8], \u201cone\nshould integrate over as many variables as possible\u201d.\n\nRobust Probabilistic Projections (RPP)[1] uses Student-t distribution to extends PPCA by scaling\neach feature vector by an independent random variable. Written in a matrix format, RPP is\n\nT \u223c Np,m(T; \u00b51>, WW>, U),\n\nU = diag {ui} ,\n\nui \u223c IG(ui|\n\n\u03bd\n2\n\n,\n\n\u03bd\n2\n\n),\n\nwhere IG is inverse Gamma distribution. Though RPP unties the scale factors between feature vec-\ntors, which could make the estimation more robust, it does not integrate out the covariance matrix,\nwhich we did in MVTM. Moreover inherited from PPCA, RPP implicitly uses independence as-\nsumption of feature vectors. Also RPP results different models depending on which side we assume\nto be independent, therefore it is not suitable for matrix prediction.\n\n5 Experiments\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\n20\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\n20\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\n20\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\n20\n\n(a) Original Matrix\n\n(b) With Noise (0.32)\n\n(c) MMMF (0.27)\n\n(d) PPCA (0.26)\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\n20\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\n20\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\n20\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\n20\n\n(e) SRM (0.22)\n\n(f) MVTM mode (0.20)\n\n(g) MVTM mean (0.192)\n\n(h) MCMC (0.185)\n\nFigure 2: Experiments on synthetic data. RMSEs are shown in parentheses.\n\nSynthetic data: We generate a 30 \u00d7 20 matrix (Fig-\nure 2(a)), then add noise with \u03c32 = 0.1 (Figure 2(b)). The\nroot mean squared noise is 0.32. We select 70% elements\nas the observed data and the rest elements are for predic-\ntion. We apply MMMF [9], PPCA[11], MTGP[13], SRM\n[12], our MVTM prediction-by-means and prediction-\nby-modes methods. The number of dimensions for low\nrank approximation is 10. We also apply MCMC method\nto infer the matrix. The reconstruction matrix and root\nmean squared errors of prediction on the unobserved el-\nements (comparing to the original matrix) are shown in\nFigure 2(c)-2(g), respectively. MTGP has the similar re-\nsult as PPCA, we do not show the result.\n\nl\n\ns\ne\nu\na\nv\n \nr\na\nu\ng\nn\ns\n\nl\n\ni\n\nMMMF\nMVTM-mode\nMVTM-mean\n\n 4\n 3.5\n 3\n 2.5\n 2\n 1.5\n 1\n 0.5\n 0\n\n 1 2 3 4 5 6 7 8 9 10\n\nindex\n\nFigure 3: Singular values of recovered\nmatrices in descent order.\n\nMVTM is in favor of sparse predictive models. To verify this, we depict the singular values of\nthe MMMF method and two MVTM prediction methods in Figure 3. There are only two singular\n\n\fRMSE\nMAE\n\nuser mean movie mean MMMF\n1.186\n0.943\n\n1.387\n1.103\n\n1.425\n1.141\n\nPPCA MVTM (mode) MVTM (mean)\n1.165\n0.915\n\n1.162\n0.898\n\n1.151\n0.887\n\nTable 1: RMSE (root mean squred error) and MAE (mean absolute error) of experiments on Each-\nmovie data. All standard errors are 0.001 or less.\n\nvalues of the MVTM prediction-by-means method are non-zeros. The singular values of the mode\nestimation decrease faster than the MMMF ones at beginning, but decrease slower after a threshold.\nThis con\ufb01rms that the log-determinants automatically determine the intrinsic rank of the matrices.\n\nEachmovie data: We test our algorithms on Eachmovie from [2]. The dataset contains 74, 424\nusers\u2019 2, 811, 718 ratings on 1, 648 movies, i.e. about 2.29% are rated by zero-to-\ufb01ve stars. We put\nall ratings into a matrix, and randomly select 80% as observed data to predict the remaining ratings.\nThe random selection was carried out 10 times independently. We compare our approach with other\nthree approaches: 1) USER MEAN predicting rating by the sample mean of the same user\u2019 ratings;\n2) MOVIE MEAN, predicting rating by the sample mean of users\u2019 ratings of the same movie; 3)\nMMMF[9]; 4) PPCA[11]. We do not have a scalable implementation for other approaches compared\nin the previous experiment. The number of dimensions is 10. The results are shown in Table 1. Two\nMVTM prediction methods outperform the other methods.\n\n6 Conclusions\n\nIn this paper we introduce matrix-variate t models for matrix prediction. The entire matrix is mod-\neled as a sample drawn from a matrix-variate t distribution. An MVTM does not require the inde-\npendence assumption over elements. The implicit model selection of the MVTM encourages sparse\nmodels with lower ranks. To minimize the log-likelihood with log-determinant terms, we propose an\noptimization method by sequentially minimizing its convex quadratic upper bound. The experiments\nshow that the approach is accurate, ef\ufb01cient and scalable.\n\nReferences\n\n[1] C. Archambeau, N. Delannay, and M. Verleysen. Robust probabilistic projections. In ICML, 2006.\n\n[2] J. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for collaborative\n\n\ufb01ltering. In UAI-98, pages 43\u201352, 1998.\n\n[3] M. Fazel, H. Haitham, and S. P. Boyd. Log-det heuristic for matrix rank minimization with applications\n\nto hankel and euclidean distance matrices. In Proceedings of the American Control Conference, 2003.\n\n[4] C. Fernandez and M. F. J. Steel. Multivariate Student-t regression models: Pitfalls and inference.\n\nBiometrika, 86(1):153\u2013167, 1999.\n\n[5] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman & Hall/CRC,\n\nNew York, 2nd edition, 2004.\n\n[6] A. K. Gupta and D. K. Nagar. Matrix Variate Distributions. Chapman & Hall/CRC, 2000.\n\n[7] N. Lawrence. Probabilistic non-linear principal component analysis with gaussian process latent variable\n\nmodels. J. Mach. Learn. Res., 6:1783\u20131816, 2005.\n\n[8] D. J. C. MacKay. Comparison of approximate methods for handling hyperparameters. Neural Comput.,\n\n11(5):1035\u20131068, 1999.\n\n[9] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction.\n\nIn ICML, 2005.\n\n[10] M. E. Tipping. Sparse bayesian learning and the relevance vector machine. Journal of Machine Learning\n\nResearch, 1:211\u2013244, 2001.\n\n[11] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the Royal\n\nStatisitical Scoiety, B(61):611\u2013622, 1999.\n\n[12] K. Yu, W. Chu, S. Yu, V. Tresp, and Z. Xu. Stochastic relational models for discriminative link prediction.\n\nIn Advances in Neural Information Processing Systems 19 (NIPS), 2006.\n\n[13] K. Yu, V. Tresp, and A. Schwaighofer. Learning Gaussian processes from multiple tasks. In ICML, 2005.\n\n\f", "award": [], "sourceid": 896, "authors": [{"given_name": "Shenghuo", "family_name": "Zhu", "institution": null}, {"given_name": "Kai", "family_name": "Yu", "institution": null}, {"given_name": "Yihong", "family_name": "Gong", "institution": null}]}