{"title": "Stochastic Relational Models for Large-scale Dyadic Data using MCMC", "book": "Advances in Neural Information Processing Systems", "page_first": 1993, "page_last": 2000, "abstract": "Stochastic relational models provide a rich family of choices for learning and predicting dyadic data between two sets of entities. It generalizes matrix factorization to a supervised learning problem that utilizes attributes of objects in a hierarchical Bayesian framework. Previously empirical Bayesian inference was applied, which is however not scalable when the size of either object sets becomes tens of thousands. In this paper, we introduce a Markov chain Monte Carlo (MCMC) algorithm to scale the model to very large-scale dyadic data. Both superior scalability and predictive accuracy are demonstrated on a collaborative filtering problem, which involves tens of thousands users and a half million items.", "full_text": "Stochastic Relational Models for\n\nLarge-scale Dyadic Data using MCMC\n\nNEC Laboratories America, Cupertino, CA 95014, USA\n\nShenghuo Zhu\nYihong Gong\n{zsh, kyu, ygong}@sv.nec-labs.com\n\nKai Yu\n\nAbstract\n\nStochastic relational models (SRMs) [15] provide a rich family of choices for\nlearning and predicting dyadic data between two sets of entities. The models gen-\neralize matrix factorization to a supervised learning problem that utilizes attributes\nof entities in a hierarchical Bayesian framework. Previously variational Bayes in-\nference was applied for SRMs, which is, however, not scalable when the size of\neither entity set grows to tens of thousands. In this paper, we introduce a Markov\nchain Monte Carlo (MCMC) algorithm for equivalent models of SRMs in order to\nscale the computation to very large dyadic data sets. Both superior scalability and\npredictive accuracy are demonstrated on a collaborative \ufb01ltering problem, which\ninvolves tens of thousands users and half million items.\n\n1 Stochastic Relational Models\n\nStochastic relational models (SRMs) [15] are generalizations of Gaussian process (GP) models [11]\nto the relational domain, where each observation is a dyadic datum, indexed by a pair of entities.\nThey model dyadic data by a multiplicative interaction of two Gaussian process priors.\nLet U be the feature representation (or index) space of a set of entities. A pair-wise similarity in\nU is given by a kernel (covariance) function \u03a3 : U \u00d7 U \u2192 R. A Gaussian process (GP) de\ufb01nes\na random function f : U \u2192 R, whose distribution is characterized by a mean function and the\ncovariance function \u03a3, denoted by f \u223c N\u221e(0, \u03a3)1, where, for simplicity, we assume the mean to\nbe the constant zero. GP complies with the intuition regarding the smoothness \u2014 if two entities ui\nand uj are similar according to \u03a3, then f(ui) and f(uj) are similar with a high probability.\nA domain of dyadic data must involve another set of entities, let it be represented (or indexed) by\nV. In a similar way, this entity set is associated with another kernel function \u2126. For example, in a\ntypical collaborative \ufb01ltering domain, U represents users while V represents items, then, \u03a3 measures\nthe similarity between users and \u2126 measures the similarity between items.\nBeing the relation between a pair of entities from different sets, a dyadic variable y is indexed by\nthe product space U \u00d7 V. Then an SRM aims to model y(u, v) by the following generative process,\nModel 1. The generative model of an SRM:\n1. Draw kernel functions \u03a3 \u223c IW\u221e(\u03b4, \u03a3\u25e6), and \u2126 \u223c IW\u221e(\u03b4, \u2126\u25e6);\n2. For k = 1, . . . , d: draw random functions fk \u223c N\u221e(0, \u03a3), and gk \u223c N\u221e(0, \u2126);\n\n1We denote an n dimensional Gaussian distribution with a covariance matrix \u03a3 by Nn(0, \u03a3). Then\n\nN\u221e(0, \u03a3) explicitly indicates that a GP follows an \u201cin\ufb01nite dimensional\u201d Gaussian distribution.\n\n1\n\n\f3. For each pair (u, v): draw y(u, v) \u223c p(y(u, v)|z(u, v), \u03b3), where\n\nz(u, v) =\n\n1\u221a\nd\n\nfk(u)gk(v) + b(u, v).\n\nd(cid:88)\n\nk=1\n\nIn this model, IW\u221e(\u03b4, \u03a3\u25e6) and IW\u221e(\u03b4, \u2126\u25e6) are hyper priors, whose details will be introduced\nlater. p(y|z, \u03b3) is the problem-speci\ufb01c noise model. For example, it can follow a Gaussian noise\ndistribution y \u223c N1(z, \u03b3) if y is numerical, or, a Bernoulli distribution if y is binary. Function\nb(u, v) is the bias function over the U \u00d7 V. For simplicity, we assume b(u, v) = 0.\nIn the limit d \u2192 \u221e, the model converges to a special case where fk and gk can be analytically\nmarginalized out and z becomes a Gaussian process z \u223c N\u221e(0, \u03a3 \u2297 \u2126) [15], with the covariance\nbetween pairs being a tensor kernel\n\nK ((ui, vs), (uj, vt)) = \u03a3(ui, uj)\u2126(vs, vt).\n\nIn anther special case, if \u03a3 and \u2126 are both \ufb01xed to be Dirac delta functions, and U, V are \ufb01nite sets,\nit is easy to see that the model reduces to probabilistic matrix factorization.\nThe hyper prior IW\u221e(\u03b4, \u03a3\u25e6) is called inverted Wishart Process that generalizes the \ufb01nite n-\ndimensional inverted Wishart distribution [2]\n\nIW n(\u03a3|\u03b4, \u03a3\u25e6) \u221d |\u03a3|\u2212 1\n\n2 (\u03b4+2n) etr(cid:0) \u2212 1\n\n2 \u03a3\u22121\u03a3\u25e6(cid:1),\n\nwhere \u03b4 is the degree-of-freedom parameter, and \u03a3\u25e6 is a positive de\ufb01nite kernel matrix. We note\nthat the above de\ufb01nition is different from the popular formulation [3] or [4] in the machine learning\ncommunity. The advantage of this new notation is demonstrated by the following theorem [2].\nTheorem 1. Let A \u223c IW m(\u03b4, K), A \u2208 R+, K \u2208 R+, and A and K be partitioned as\n\n(cid:21)\n\n(cid:20)A11, A12\n\nA21, A22\n\nA =\n\n, K =\n\n(cid:21)\n\n(cid:20)K11, K12\n\nK21, K22\n\nwhere A11 and K11 are two n \u00d7 n sub matrices, n < m, then A11 \u223c IW n(\u03b4, K11).\nThe new formulation of inverted Wishart is consistent under marginalization. Therefore, similar to\nthe way of deriving GPs from Gaussian distributions, we de\ufb01ne a distribution of in\ufb01nite-dimensional\nkernel functions, denoted by \u03a3 \u223c IW\u221e(\u03b4, \u03a3\u25e6), such that any sub kernel matrix of size m \u00d7 m\nfollows \u03a3 \u223c IW m(\u03b4, \u03a3\u25e6), where both \u03a3 and \u03a3\u25e6 are positive de\ufb01nite kernel functions. In case\nwhen U and V are sets of entity indices, SRMs let \u03a3\u25e6 and \u2126\u25e6 both be Dirac delta functions, i.e., any\nof their sub kernel matrices is an identity matrix.\nSimilar to GP regression/classi\ufb01cation, the major application of SRMs is supervised prediction based\non observed relational values and input features of entities. Formally, let YI = {y(u, v)|(u, v) \u2208 I}\nbe the set of noisy observations, where I \u2282 U \u00d7 V, the model aims to predict the noise-free values\nZO = {z(u, v)|(u, v) \u2208 O} on O \u2282 U \u00d7 V. As our computation is always on a \ufb01nite set containing\nboth I and O, from now on, we only consider the \ufb01nite subset U0 \u00d7 V0, a \ufb01nite support subset of\nU \u00d7V that contains I\u222a O. Accordingly we let \u03a3 be the covariance matrix of \u03a3 on U0, and \u2126 be the\ncovariance matrix of \u2126 on V0.\nPreviously a variational Bayesian method was applied to SRMs [15], which computes the maximum\na posterior estimates of \u03a3 and \u2126, given YI, and then predicts ZO based on the estimated \u03a3 and \u2126.\nThere are two limitations of this empirical Bayesian approach: (1) The variational method is not a\nfully Bayesian treatment. Ideally we wish to integrate \u03a3 and \u2126; (2) The more critical issue is, the\nalgorithm has the complexity O(m3 + n3), with m = |U0| and n = |V0|, is not scalable to a large\nrelational domain where m or n exceeds several thousands. In this paper we will introduce a fully\nBayesian inference algorithm using Markov chain Monte Carlo sampling. By deriving equivalent\nsampling processes, we show the algorithms can be applied to a dataset, which is 103 times larger\nthan the previous work [15], and produce an excellent accuracy.\nIn the rest of this paper, we present our algorithms for Bayesian inference of SRMs in Section 2.\nSome related work is discussed in Section 3, followed by experiment results of SRMs in Section 4.\nSection 5 concludes.\n\n2\n\n\f2 Bayesian Models and MCMC Inference\n\nIn this paper, we tackle the scalability issue with a fully Bayesian paradigm. We estimate the expec-\ntation of ZO directly from YI using Markov-chain Monte Carlo (MCMC) algorithm (speci\ufb01cally,\nGibbs sampling), instead of evaluating that from estimated \u03a3 or \u2126. Our contribution is in how to\nmake the MCMC inference more ef\ufb01cient for large scale data.\nWe \ufb01rst introduce some necessary notation here. Bold capital letters, e.g. X, indicate matrices. I(m)\nis an identity matrix of size m \u00d7 m. Nd, Nm,d, IW m, \u03c7\u22122 are the multivariate normal distribution,\nthe matrix-variate normal distribution, the inverse-Wishart distribution, and the inverse chi-square\ndistribution, respectively.\n\n2.1 Models with Non-informative Priors\nLet r = |I|, m = |U0| and n = |V0|. It is assumed that d (cid:28) min(m, n), and the observed set, I, is\nsparse, i.e. r (cid:28) mn. First, we consider the case of \u03a3\u25e6 = \u03b1I(m) and \u2126\u25e6 = \u03b2I(n). Let {fk} on U0\ndenoted by matrix variate F of size m \u00d7 d, {gk} on V0 denoted by matrix variate G of size n \u00d7 d.\nThen the generative model is written as Model 2 and depicted in Figure 1.\nModel 2. The generative model of a matrix-variate SRM:\n1. Draw \u03a3 \u223c IW m(\u03b4, \u03b1I(m)) and \u2126 \u223c IW n(\u03b4, \u03b2I(n));\n2. Draw F|\u03a3 \u223c Nm,d(0, \u03a3 \u2297 I(d)) and G|\u2126 \u223c Nn,d(0, \u2126 \u2297 I(d));\n3. Draw s2 \u223c \u03c7\u22122(\u03bd, \u03c32) ;\n4. Draw Y|F, G, s2 \u223c Nm,n(Z, s2I(m) \u2297 I(n)), where Z = FG(cid:62).\nwhere Nm,d is the matrix-variate normal distribution of size m \u00d7 d; \u03b1,\n\u03b2, \u03b4, \u03bd and \u03c32 are scalar parameters of the model. A slight difference\nbetween this \ufb01nite model and Model 1 is that the coef\ufb01cient 1/\nd is ignored for simplicity because\nthis coef\ufb01cient can be absorbed by \u03b1 or \u03b2.\nAs we can explicitly compute Pr(\u03a3|F), Pr(\u2126|G), Pr(F|YI, G, \u03a3, s2), Pr(G|YI, F, \u2126, s2),\nPr(s2|YI, F, G), we can apply Gibbs sampling algorithm to compute ZO. However, the com-\nputational time complexity is at least O(m3 + n3), which is not practical for large scale data.\n\nFigure 1: Model 2\n\n\u221a\n\n2.2 Gibbs Sampling Method\n\nTo overcome the inef\ufb01ciency in sampling large covariance matrices, we rewrite the sampling\nprocess using the property of Theorem 2 to take the advantage of d (cid:28) min(m, n).\nTheorem 2. If\n\n1. \u03a3 \u223c IW m(\u03b4, \u03b1I(m)) and F|\u03a3 \u223c Nm,d(0, \u03a3 \u2297 I(d)),\n2. K \u223c IW d(\u03b4, \u03b1I(d)) and H|K \u223c Nm,d(0, I(m) \u2297 K),\n\nthen, matrix variates, F and H, have the same distribution.\n\nProof sketch. Matrix variate F follows a matrix variate t distribution,\nt(\u03b4, 0, \u03b1I(m), I(d)), which is written as\n\nFigure 2: Theorem 2\n\np(F) \u221d |I(m) + (\u03b1I(m))\u22121F(I(d))\u22121F(cid:62)|\u2212 1\n\n2 (\u03b4+m+d\u22121) = |I(m) + \u03b1\u22121FF(cid:62)|\u2212 1\n\n2 (\u03b4+m+d\u22121)\n\nMatrix variate H follows a matrix variate t distribution, t(\u03b4, 0, I(m), \u03b1I(d)), which can be written as\np(H) \u221d |I(m) + (I(m))\u22121H(\u03b1I(d))\u22121H(cid:62)|\u2212 1\nThus, matrix variates, F and H, have the same distribution.\n\n2 (\u03b4+m+d\u22121) = |I(m) + \u03b1\u22121HH(cid:62)|\u2212 1\n\n2 (\u03b4+m+d\u22121)\n\n3\n\nSI(d)WI(d)FGZYs2aI(m)SI(d)F\u2192aI(d)KI(m)F\fThis theorem allows us to sample a smaller covariance matrix K of size d \u00d7 d on the column side\ninstead of sampling a large covariance matrix \u03a3 of size m \u00d7 m on the row side. The translation is\ndepicted in Figure 2. This theorem applies to G as well, thus we rewrite the model as Model 3 (or\nFigure 3). A similar idea was used in our previous work [16].\nModel 3. The alternative generative model of a matrix-variate SRM:\n1. Draw K \u223c IW d(\u03b4, \u03b1I(d)) and R \u223c IW d(\u03b4, \u03b2I(d));\n2. Draw F|K \u223c Nm,d(0, I(m) \u2297 K), and G|R \u223c Nn,d(0, I(n) \u2297 R),\n3. Draw s2 \u223c \u03c7\u22122(\u03bd, \u03c32) ;\n4. Draw Y|F, G, s2 \u223c Nm,n(Z, s2I(m) \u2297 I(n)), where Z = FG(cid:62).\nLet column vector f i be the i-th row of matrix F, and column vector gj\nbe the j-th row of matrix G. In Model 3, {f i} are independent given K,\nG and s2. Similar independence applies to {gj\nK, R, {f i}, {gj\n(for Bayesian SRM).\nWe use Gibbs sampling to compute the mean of ZO, which is derived from the samples of FG(cid:62).\nBecause of the sparsity of I, each iteration in this sampling algorithm can be computed in O(d2r +\nd3(m + n)) time complexity2, which is a dramatic reduction from the previous time complexity\nO(m3 + n3) .\n\n} as well. The conditional posterior distribution of\n} and s2 can be easily computed, thus the Gibbs sampling for SRM is named BSRM\n\nFigure 3: Model 3\n\n2.3 Models with Informative Priors\n\nAn important characteristic of SRMs is that it allows the inclusion of certain prior knowledge of\nentities into the model. Speci\ufb01cally, the prior information is encoded as the prior covariance param-\neters, i.e. \u03a3\u25e6 and \u2126\u25e6. In the general case, it is dif\ufb01cult to run sampling process due to the size of \u03a3\u25e6\nand \u2126\u25e6. We assume that \u03a3\u25e6 and \u2126\u25e6 have a special form, i.e. \u03a3\u25e6 = F\u25e6(F\u25e6)(cid:62) + \u03b1I(m), where F\u25e6 is\nan m \u00d7 p matrix, and \u2126\u25e6 = G\u25e6(G\u25e6)(cid:62) + \u03b2I(n), where G\u25e6 is an n \u00d7 q matrix, and the magnitude of\np and q is about the same as or less than that of d. This prior knowledge can be obtained from some\nadditional features of entities.\nAlthough such an informative \u03a3\u25e6 prevents us from directly sampling each row of F independently,\nas we do in Model 3, we can expand matrix F of size m \u00d7 d to (F, F\u25e6) of size m \u00d7 (d + p), and\nderive an equivalent model, where rows of F are conditionally independent given F\u25e6. Figure 4\nillustrates this transformation.\nTheorem 3. Let \u03b4 > p, \u03a3\u25e6 = F\u25e6(F\u25e6)(cid:62) + \u03b1I(m), where F\u25e6 is an m\u00d7 p\nmatrix. If\n\n1. \u03a3 \u223c IW m(\u03b4, \u03a3\u25e6) and F|\u03a3 \u223c Nm,d(0, \u03a3 \u2297 I(d)),\n\n(cid:19) \u223c IW d+p(\u03b4 \u2212 p, \u03b1I(d+p)) and\n\n(cid:18)K11 K12\n\n2. K =\n\nK21 K22\n\nH|K \u223c Nm,d(F\u25e6K\u22121\nwhere K11\u00b72 = K11 \u2212 K12K\u22121\n\n22 K21, I(m) \u2297 K11\u00b72),\n22 K21, then F and H have the same distribution.\n\nFigure 4: Theorem 3\n\nProof sketch. Consider the distribution\n\n(H1, H2)|K \u223c Nm,d+p(0, I(m) \u2297 K).\n\n(1)\n22 K21, I(m) \u2297 K11\u00b72), p(H) = p(H1|H2 = F\u25e6). On the other\nBecause H1|H2 \u223c Nm,d(H2K\u22121\nhand, we have a matrix-variate t distribution, (H1, H2) \u223c tm,d+p(\u03b4 \u2212 p, 0, \u03b1I(m), I(d+p)). By\nTheorem 4.3.9 in [4], we have H1|H2 \u223c tm,d(\u03b4, 0, \u03b1I(m) + H2H(cid:62)\n, I(d)) = tm,d(\u03b4, 0, \u03a3\u25e6, I(d)),\nwhich implies p(F) = p(H1|H2 = F\u25e6) = p(H).\n\n2\n\n2|Y \u2212 FG(cid:62)|2I can be ef\ufb01ciently computed in O(dr) time.\n\n4\n\nKI(m)RI(n)FGZYs2S0SI(d)F\u2192aI(d+p)KI(m)(F,F0)\fThe following corollary allows us to compute the posterior distribution of K ef\ufb01ciently.\nCorollary 4. K|H \u223c IW d+p(\u03b4 + m, \u03b1I(d+p) + (H, F\u25e6)(cid:62)(H, F\u25e6)).\n\nProof sketch. Because normal distribution and inverse Wishart distribution are conjugate, we can\nderive the posterior distribution K from Eq. (1).\n\nThus, we can explicitly sample from the conditional posterior distributions, as listed in Algorithm 1\n(BSRM/F for BSRM with features) in Appendix. We note that when p = q = 0, Algorithm 1\n(BSRM/F) reduces to the exact algorithm for BSRM. Each iteration in this sampling algorithm can\nbe computed in O(d2r + d3(m + n) + dpm + dqn) time complexity.\n\n2.4 Unblocking for Sampling Implementation\n\nBlocking Gibbs sampling technique is commonly used to improve the sampling ef\ufb01ciency by re-\nducing the sample variance according to the Rao-Blackwell theorem (c.f. [9]). However, blocking\nGibbs sampling is not necessary to be computationally ef\ufb01cient. To improve the computational ef\ufb01-\nciency of Algorithm 1, we use unblocking sampling to reduce the major computational cost is Step 2\nand Step 4. We consider sampling each element of F conditionally. The sampling process is written\nas Step 4 and Step 9 of Algorithm 2, which is called BSRM/F with conditional Gibss sampling. We\ncan reduce the computational cost of each iteration to O(dr + d2(m + n) + dpm + dqn), which is\ncomparable to other low-rank matrix factorization approaches. Though such a conditional sampling\nprocess increases the sample variance comparing to Algorithm 1, we can afford more samples within\na given amount of time due to its faster speed. Our experiments show that the overall computational\ncost of Algorithm 2 is usually less than that of Algorithm 1 when achieving the same accuracy.\nAdditionally, since {f i} are independent, we can parallelize the for loops in Step 4 and Step 9 of\nAlgorithm 2.\n\n3 Related Work\n\nSRMs fall into a class of statistical latent-variable relational models that explain relations by latent\nfactors of entities. Recently a number of such models were proposed that can be roughly put into two\ngroups, depending on whether the latent factors are continuous or discrete: (1) Discrete latent-state\nrelational models: a large body of research infers latent classes of entities and explains the entity\nrelationship by the probability conditioned on the joint state of participating entities, e.g., [6, 14, 7,\n1]. In another work [10], binary latent factors are modeled; (2) Continuous latent-variable relational\nmodels: many such models assume relational data underlain by multiplicative effects between latent\nvariables of entities, e.g. [5]. A simple example is matrix factorization, which recently has become\nvery popular in collaborative \ufb01ltering applications, e.g., [12, 8, 13].\nThe latest Bayesian probabilistic matrix factorization [13] reported the state-of-the-art accuracy of\nmatrix factorization on Net\ufb02ix data. Interestingly, the model turns out to be similar to our Model 3\nunder the non-informative prior. This paper reveals the equivalence between different models and\noffers a more general Bayesian framework that allows informative priors from entity features to\nplay a role. The framework also generalizes Gaussian processes [11] to a relational domain, where\na nonparametric prior for stochastic relational processes is described.\n\n4 Experiments\n\nSynthetic data: We compare BSRM under noninformative priors against two other algorithms: the\nfast max-margin matrix factorization (fMMMF) in [12] with a square loss, and SRM using varia-\ntional Bayesian approach (SRM-VB) in [15]. We generate a 30 \u00d7 20 random matrix (Figure 5(a)),\nthen add Gaussian noise with \u03c32 = 0.1 (Figure 5(b)). The root mean squared noise is 0.32. We\nselect 70% elements as the observed data and use the rest of the elements for testing. The recon-\nstruction matrix and root mean squared errors (RMSEs) of predictions on the test elements are shown\nin Figure 5(c)-5(e). BSRM outperforms the variational approach of SRMs and fMMMF. Note that\nbecause of the log-determinant penalty of the inverse Wishart prior, SRM-VB enforces the rank to\nbe smaller, thus the result of SRM-VB looks smoother than that of BSRM.\n\n5\n\n\f(a) Original Matrix (b) With Noise(0.32) (c) fMMMF (0.27)\n\n(d) SRM-VB(0.22)\n\n(e) BSRM(0.19)\n\nFigure 5: Experiments on synthetic data. RMSEs are shown in parentheses.\n\nUser Mean Movie Mean\n\nRMSE\nMAE\n\n1.425\n1.141\n\n1.387\n1.103\n\nfMMMF [12] VB [8]\n1.165\n0.915\n\n1.186\n0.943\n\nTable 1: RMSE (root mean squared error) and MAE (mean absolute error) of the experiments on\nEachMovie data. All standard errors are 0.001 or less.\n\nEachMovie data: We test the accuracy and the ef\ufb01ciency of our algorithms on EachMovie. The\ndataset contains 74, 424 users\u2019 2, 811, 718 ratings on 1, 648 movies, i.e. about 2.29% are rated by\nzero-to-\ufb01ve stars. We put all the ratings into a matrix, and randomly select 80% as observed data\nto predict the remaining ratings. The random selection was carried out 10 times independently. We\ncompare our approach against several competing methods: 1) User Mean, predicting ratings by the\nsample mean of the same user\u2019s ratings; 2) Move Mean, predicting rating by the sample mean of\nratings on the same movie; 3) fMMMF [12]; 4) VB introduced in [8], which is a probabilistic low-\nrank matrix factorization using variational approximation. Because of the data size, we cannot run\nthe SRM-VB of [15]. We test the algorithms BSRM and BSRM/F, both following Algorithm 2,\nwhich run without and with features, respectively. The features used in BSRM/F are generated from\nthe PCA result of the binary indicator matrix that indicates whether the user rates the movie. The\n10 top factors of both the user side and the movie side are used as features, i.e. p = 10, q = 10. We\nrun the experiments with different d = 16, 32, 100, 200, 300. The hyper parameters are set to some\ntrivial values, \u03b4 = p + 1 = 11, \u03b1 = \u03b2 = 1, \u03c32 = 1, and \u03bd = 1. The results are shown in Table 1\nand 2. We \ufb01nd that the accuracy improves as the number of d is increased. Once d is greater than\n100, the further improvement is not very signi\ufb01cant within a reasonable amount of running time.\n\nrank (d)\n\n16\n\nBSRM RMSE 1.0983\n0.8411\nBSRM/F RMSE 1.0952\n0.8311\n\nMAE\n\nMAE\n\n32\n\n1.0924\n0.8321\n1.0872\n0.8280\n\n100\n\n1.0905\n0.8335\n1.0848\n0.8289\n\n200\n\n1.0903\n0.8340\n1.0846\n0.8293\n\n300\n\n1.0902\n0.8393\n1.0852\n0.8292\n\nTable 2: RMSE (root mean squared error) and MAE (mean absolute error) of experiments on Each-\nMovie data. All standard errors are 0.001 or less.\n\nTo compare the overall computational ef\ufb01ciency of the two Gibbs sampling procedures, Algorithm 1\nand Algorithm 2, we run both algorithms\nand record the running time and accuracy\nin RMSE. The dimensionality d is set to\nbe 100. We compute the average ZO and\nevaluate it after a certain number of itera-\ntions. The evaluation results are shown in\nFigure 6. We run both algorithms for 100\niterations as the burn-in period, so that we\ncan have an independent start sample. Af-\nter the burn-in period, we restart to compute\nthe averaged ZO and evaluate them, there-\nfore there are abrupt points at 100 iterations\nin both cases. The results show that the\noverall accuracy of Algorithm 2 is better at\nany given time.\n\nFigure 6: Time-Accuracy of Algorithm 1 and 2\n\n6\n\n246810121416182051015202530246810121416182051015202530246810121416182051015202530246810121416182051015202530246810121416182051015202530 1.08 1.1 1.12 1.14 1.16 1.18 1.2 0 1000 2000 3000 4000 5000 6000 7000 8000RMSERunning time (sec)burn-in endsburn-in endsAlgorithm 1Algorithm 2\fNet\ufb02ix data: We also test the algorithms on the large collection of user ratings from net\ufb02ix.com. The\ndataset consists of 100, 480, 507 ratings from 480, 189 users on 17, 770 movies. In addition, Net\ufb02ix\nalso provides a set of validation data with 1, 408, 395 ratings. In order to evaluate the prediction\naccuracy, there is a test set containing 2, 817, 131 ratings whose values are withheld and unknown\nfor all the participants.\nThe features used in BSRM/F are generated from the PCA result of a binary matrix that indicates\nwhether or not the user rated the movie. The top 30 user-side factors are used as features, none of\nmovie-side factors are used, i.e. p = 30, q = 0. The hyper parameters are set to some trivial values,\n\u03b4 = p + 1 = 31, \u03b1 = \u03b2 = 1, \u03c32 = 1, and \u03bd = 1. The results on the validation data are shown in\nTable 3. The submitted result of BSRM/F(400) achieves RMSE 0.8881 on the test set. The running\ntime is around 21 minutes per iteration for 400 latent dimensions on an Intel Xeon 2GHz PC.\n\nRMSE\n\nBPMF [13]\n\nVB[8]\n0.9141\n0.8880\nTable 3: RMSE (root mean squared error) of experiments on Net\ufb02ix data.\n\n0.8926\n\n0.8920\n\n0.8930\n\n400\n\n0.8895\n\n100\n\nBSRM\n\n200\n\n0.8910\n\nBSRM/F\n\n200\n\n100\n\n400\n\n0.8874\n\n5 Conclusions\n\nIn this paper, we study the fully Bayesian inference for stochastic relational models (SRMs), for\nlearning the real-valued relation between entities of two sets. We overcome the scalability issue\nby transforming SRMs into equivalent models, which can be ef\ufb01ciently sampled. The experiments\nshow that the fully Bayesian inference outperforms the previously used variational Bayesian infer-\nence on SRMs. In addition, some techniques for ef\ufb01cient computation in this paper can be applied to\nother large-scale Bayesian inferences, especially for models involving inverse-Wishart distributions.\nAcknowledgment: We thank the reviewers and Sarah Tyler for constructive comments.\n\nReferences\n\n[1] E. Airodi, D. Blei, S. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. In\n\nJournal of Machine Learning Research, 2008.\n\n[2] A. P. Dawid. Some matrix-variate distribution theory: notational considerations and a Bayesian\n\napplication. Biometrika, 68:265\u2013274, 1981.\n\n[3] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman &\n\nHall, New York, 1995.\n\n[4] A. K. Gupta and D. K. Nagar. Matrix Variate Distributions. Chapman & Hall/CRC, 2000.\n[5] P. Hoff. Multiplicative latent factor models for description and prediction of social networks.\n\nComputational and Mathematical Organization Theory, 2007.\n\n[6] T. Hofmann. Latent semantic models for collaborative \ufb01ltering. ACM Trans. Inf. Syst.,\n\n22(1):89\u2013115, 2004.\n\n[7] C. Kemp, J. B. Tenenbaum, T. L. Grif\ufb01ths, T. Yamada, and N. Ueda. Learning systems of\nconcepts with an in\ufb01nite relational model. In Proceedings of the 21st National Conference on\nArti\ufb01cial Intelligence (AAAI), 2006.\n\n[8] Y. J. Lim and Y. W. Teh. Variational Bayesian approach to movie rating prediction. In Pro-\n\nceedings of KDD Cup and Workshop, 2007.\n\n[9] J. S. Liu. Monte Carlo Strategies in Scienti\ufb01c Computing. Springer, 2001.\n[10] E. Meeds, Z. Ghahramani, R. Neal, and S. T. Roweis. Modeling dyadic data with binary latent\n\nfactors. In Advances in Neural Information Processing Systems 19, 2007.\n\n[11] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press,\n\n2006.\n\n[12] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative\n\nprediction. In ICML, 2005.\n\n7\n\n\f[13] R. Salakhutdinov and A. Mnih. Bayeisna probabilistic matrix factorization using Markov chain\n\nMonte Carlo. In The 25th International Conference on Machine Learning, 2008.\n\n[14] Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. In\ufb01nite hidden relational models. In Proceedings of\n\nthe 22nd International Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2006.\n\n[15] K. Yu, W. Chu, S. Yu, V. Tresp, and Z. Xu. Stochastic relational models for discriminative link\n\nprediction. In Advances in Neural Information Processing Systems 19 (NIPS), 2006.\n\n[16] S. Zhu, K. Yu, and Y. Gong. Predictive matrix-variate t models. In J. Platt, D. Koller, Y. Singer,\nand S. Roweis, editors, NIPS \u201907: Advances in Neural Information Processing Systems 20,\npages 1721\u20131728. MIT Press, Cambridge, MA, 2008.\n\nAppendix\nBefore presenting the algorithms, we introduce the necessary notation. Let Ii = {j|(i, j) \u2208 I} and\nIj = {i|(i, j) \u2208 I}. A matrix with subscripts indicates its submatrix, which consists its entries at the\ngiven indices in the subscripts, for example, XIj ,j is a subvector of the j-th column of X whose row\nindices are in set Ij, X\u00b7,j is the j-th column of X (\u00b7 indicates the full set). Xi,j denotes the (i, j)-th\ni,j. We \ufb01ll the unobserved\nelements in Y with 0 for simplicity in notation\n\nentry of X. |X|2I is the squared sum of elements in set I, i.e. (cid:80)\n\n(i,j)\u2208I X 2\n\nAlgorithm 1 BSRM/F: Gibbs sampling for SRM with features\n1: Draw K \u223c IW d+p(\u03b4 + m, \u03b1I(d+p) + (F, F\u25e6)(cid:62)(F, F\u25e6));\n2: For each i \u2208 U0, draw f i \u223c Nd(K(i)(s\u22122G(cid:62)Y(cid:62)\ni,\u00b7 + K\u22121\n3: Draw R \u223c IW d+q(\u03b4 + n, \u03b2I(d+q) + (G, G\u25e6)(cid:62)(G, G\u25e6));\n4: For each j \u2208 V0, draw gj\n\u223c Nd(R(j)(s\u22122F(cid:62)Y\u00b7,j + R\u22121\n5: Draw s2 \u223c \u03c7\u22122(\u03bd + r, \u03c32 + |Y \u2212 FG(cid:62)|2I ).\n\nwhere K(i) =(cid:0)s\u22122(GIi,\u00b7)(cid:62)GIi,\u00b7 + K\u22121\nwhere R(j) =(cid:0)s\u22122(FIj ,\u00b7)(cid:62)FIj ,\u00b7 + R\u22121\n1: \u2206i,j \u2190 Yi,j \u2212(cid:80)\n\nFi,kGj,k, for (i, j) \u2208 I;\n\n(cid:1)\u22121;\n(cid:1)\u22121;\n\n11\u00b72\n\n11\u00b72\n\n22 f\u25e6\n11\u00b72K12K\u22121\n\ni\n\n), K(i)),\n\n11\u00b72R12R\u22121\n\n22 g\u25e6\n\nj\n\n), R(j)),\n\nDraw f \u223c N1(\u03c6\u22121(s\u22122\u2206i,IiGIi,k \u2212 Fi,\u00b7\u03a6\u00b7,k), \u03c6\u22121), where \u03c6 = s\u22122(GIi,k)(cid:62)GIi,k + \u03a6k,k;\nUpdate Fi,k \u2190 Fi,k + f, and \u2206i,j \u2190 \u2206i,j \u2212 f Gj,k, for j \u2208 Ii;\n\nk\n\nAlgorithm 2 BSRM/F: Conditional Gibbs sampling for SRM with features\n2: Draw \u03a6 \u223c Wd+p(\u03b4 + m + d + p \u2212 1, (\u03b1I(d+p) + (F, F\u25e6)(cid:62)(F, F\u25e6))\u22121);\n3: for each (i, k) \u2208 U0 \u00d7 {1,\u00b7\u00b7\u00b7 , d} do\n4:\n5:\n6: end for\n7: Draw \u03a8 \u223c Wd+q(\u03b4 + n + d + q \u2212 1, (\u03b2I(d+q) + (G, G\u25e6)(cid:62)(G, G\u25e6))\u22121);\n8: for each (j, k) \u2208 V0 \u00d7 {1,\u00b7\u00b7\u00b7 , d} do\n9:\n10:\n11: end for\n12: Draw s2 \u223c \u03c7\u22122(\u03bd + r, \u03c32 + |\u2206|2I ).\n\nDraw g \u223c N1(\u03c8\u22121(s\u22122\u2206(cid:62)\nUpdate Gj,k \u2190 Gj,k + g and \u2206i,j \u2190 \u2206i,j \u2212 gFi,k, for i \u2208 Ij;\n\nIj ,jFIj ,k\u2212Gj,\u00b7\u03a8\u00b7,k), \u03c8\u22121), where \u03c8 = s\u22122(FIj ,k)(cid:62)FIj ,k +\u03a8k,k;\n\n8\n\n\f", "award": [], "sourceid": 664, "authors": [{"given_name": "Shenghuo", "family_name": "Zhu", "institution": null}, {"given_name": "Kai", "family_name": "Yu", "institution": null}, {"given_name": "Yihong", "family_name": "Gong", "institution": null}]}