{"title": "Probabilistic Matrix Factorization", "book": "Advances in Neural Information Processing Systems", "page_first": 1257, "page_last": 1264, "abstract": null, "full_text": "Probabilistic Matrix Factorization\n\nRuslan Salakhutdinov and Andriy Mnih\n\nDepartment of Computer Science, University of Toronto\n\n6 King\u2019s College Rd, M5S 3G4, Canada\n\n{rsalakhu,amnih}@cs.toronto.edu\n\nAbstract\n\nMany existing approaches to collaborative \ufb01ltering can neither handle very large\ndatasets nor easily deal with users who have very few ratings. In this paper we\npresent the Probabilistic Matrix Factorization (PMF) model which scales linearly\nwith the number of observations and, more importantly, performs well on the\nlarge, sparse, and very imbalanced Net\ufb02ix dataset. We further extend the PMF\nmodel to include an adaptive prior on the model parameters and show how the\nmodel capacity can be controlled automatically. Finally, we introduce a con-\nstrained version of the PMF model that is based on the assumption that users who\nhave rated similar sets of movies are likely to have similar preferences. The result-\ning model is able to generalize considerably better for users with very few ratings.\nWhen the predictions of multiple PMF models are linearly combined with the\npredictions of Restricted Boltzmann Machines models, we achieve an error rate\nof 0.8861, that is nearly 7% better than the score of Net\ufb02ix\u2019s own system.\n\n1 Introduction\n\nOne of the most popular approaches to collaborative \ufb01ltering is based on low-dimensional factor\nmodels. The idea behind such models is that attitudes or preferences of a user are determined by\na small number of unobserved factors. In a linear factor model, a user\u2019s preferences are modeled\nby linearly combining item factor vectors using user-speci\ufb01c coef\ufb01cients. For example, for N users\nand M movies, the N \u00d7 M preference matrix R is given by the product of an N \u00d7 D user coef\ufb01cient\nmatrix U T and a D \u00d7 M factor matrix V [7]. Training such a model amounts to \ufb01nding the best\nrank-D approximation to the observed N \u00d7 M target matrix R under the given loss function.\nA variety of probabilistic factor-based models has been proposed recently [2, 3, 4]. All these models\ncan be viewed as graphical models in which hidden factor variables have directed connections to\nvariables that represent user ratings. The major drawback of such models is that exact inference is\nintractable [12], which means that potentially slow or inaccurate approximations are required for\ncomputing the posterior distribution over hidden factors in such models.\n\nLow-rank approximations based on minimizing the sum-squared distance can be found using Sin-\ngular Value Decomposition (SVD). SVD \ufb01nds the matrix \u02c6R = U T V of the given rank which min-\nimizes the sum-squared distance to the target matrix R. Since most real-world datasets are sparse,\nmost entries in R will be missing. In those cases, the sum-squared distance is computed only for\nthe observed entries of the target matrix R. As shown by [9], this seemingly minor modi\ufb01cation\nresults in a dif\ufb01cult non-convex optimization problem which cannot be solved using standard SVD\nimplementations.\n\nInstead of constraining the rank of the approximation matrix \u02c6R = U T V , i.e. the number of factors,\n[10] proposed penalizing the norms of U and V . Learning in this model, however, requires solv-\ning a sparse semi-de\ufb01nite program (SDP), making this approach infeasible for datasets containing\nmillions of observations.\n\n1\n\n\f\u03c3\n\nV\n\nV j\n\n\u03c3\n\nU\n\nU\n\ni\n\n\u03c3\n\nV\n\nV j\n\nj=1,...,M\n\nR ij\n\n\u03c3\n\ni=1,...,N\n\nj=1,...,M\n\nR ij\n\n\u03c3\n\n\u03c3\n\nW\n\nW\nk\n\nk=1,...,M\n\n\u03c3\n\nU\n\niY\n\nU i\n\niI\n\ni=1,...,N\n\nFigure 1: The left panel shows the graphical model for Probabilistic Matrix Factorization (PMF). The right\npanel shows the graphical model for constrained PMF.\n\nMany of the collaborative \ufb01ltering algorithms mentioned above have been applied to modelling\nuser ratings on the Net\ufb02ix Prize dataset that contains 480,189 users, 17,770 movies, and over 100\nmillion observations (user/movie/rating triples). However, none of these methods have proved to\nbe particularly successful for two reasons. First, none of the above-mentioned approaches, except\nfor the matrix-factorization-based ones, scale well to large datasets. Second, most of the existing\nalgorithms have trouble making accurate predictions for users who have very few ratings. A common\npractice in the collaborative \ufb01ltering community is to remove all users with fewer than some minimal\nnumber of ratings. Consequently, the results reported on the standard datasets, such as MovieLens\nand EachMovie, then seem impressive because the most dif\ufb01cult cases have been removed. For\nexample, the Net\ufb02ix dataset is very imbalanced, with \u201cinfrequent\u201d users rating less than 5 movies,\nwhile \u201cfrequent\u201d users rating over 10,000 movies. However, since the standardized test set includes\nthe complete range of users, the Net\ufb02ix dataset provides a much more realistic and useful benchmark\nfor collaborative \ufb01ltering algorithms.\n\nThe goal of this paper is to present probabilistic algorithms that scale linearly with the number of\nobservations and perform well on very sparse and imbalanced datasets, such as the Net\ufb02ix dataset.\nIn Section 2 we present the Probabilistic Matrix Factorization (PMF) model that models the user\npreference matrix as a product of two lower-rank user and movie matrices. In Section 3, we extend\nthe PMF model to include adaptive priors over the movie and user feature vectors and show how\nthese priors can be used to control model complexity automatically. In Section 4 we introduce a\nconstrained version of the PMF model that is based on the assumption that users who rate similar\nsets of movies have similar preferences. In Section 5 we report the experimental results that show\nthat PMF considerably outperforms standard SVD models. We also show that constrained PMF and\nPMF with learnable priors improve model performance signi\ufb01cantly. Our results demonstrate that\nconstrained PMF is especially effective at making better predictions for users with few ratings.\n\n2 Probabilistic Matrix Factorization (PMF)\n\nSuppose we have M movies, N users, and integer rating values from 1 to K 1. Let Rij represent\nthe rating of user i for movie j, U \u2208 RD\u00d7N and V \u2208 RD\u00d7M be latent user and movie feature\nmatrices, with column vectors Ui and Vj representing user-speci\ufb01c and movie-speci\ufb01c latent feature\nvectors respectively. Since model performance is measured by computing the root mean squared\nerror (RMSE) on the test set we \ufb01rst adopt a probabilistic linear model with Gaussian observation\nnoise (see \ufb01g. 1, left panel). We de\ufb01ne the conditional distribution over the observed ratings as\n\np(R|U, V, \u03c32) =\n\nN\n\nYi=1\n\nM\n\nYj=1(cid:20)N (Rij|U T\n\ni Vj, \u03c32)(cid:21)Iij\n\n,\n\n(1)\n\nwhere N (x|\u00b5, \u03c32) is the probability density function of the Gaussian distribution with mean \u00b5 and\nvariance \u03c32, and Iij is the indicator function that is equal to 1 if user i rated movie j and equal to\n\n1Real-valued ratings can be handled just as easily by the models described in this paper.\n\n2\n\n\f0 otherwise. We also place zero-mean spherical Gaussian priors [1, 11] on user and movie feature\nvectors:\n\np(U |\u03c32\n\nU ) =\n\nN\n\nYi=1\n\nN (Ui|0, \u03c32\nU\n\nI),\n\np(V |\u03c32\n\nV ) =\n\nM\n\nYj=1\n\nN (Vj |0, \u03c32\nV\n\nI).\n\n(2)\n\nThe log of the posterior distribution over the user and movie features is given by\n\nln p(U, V |R, \u03c32, \u03c32\n\nV , \u03c32\n\nU ) = \u2212\n\n1\n2\u03c32\n\nN\n\nM\n\nXi=1\n\uf8eb\n\uf8ed\n\nN\n\nXi=1\n\n\uf8eb\n\uf8ed\n\n\u2212\n\n1\n2\n\nIij (Rij \u2212 U T\n\ni Vj)2 \u2212\n\nXj=1\n\n1\n2\u03c32\nU\n\nN\n\nXi=1\n\nU T\n\ni Ui \u2212\n\n1\n2\u03c32\nV\n\nM\n\nXj=1\n\nV T\nj Vj\n\nM\n\nXj=1\n\nIij\uf8f6\n\uf8f8 ln \u03c32 + N D ln \u03c32\n\nV\uf8f6\nU + M D ln \u03c32\n\uf8f8 + C,\n\n(3)\n\nwhere C is a constant that does not depend on the parameters. Maximizing the log-posterior over\nmovie and user features with hyperparameters (i.e. the observation noise variance and prior vari-\nances) kept \ufb01xed is equivalent to minimizing the sum-of-squared-errors objective function with\nquadratic regularization terms:\n\nN\n\nM\n\n1\n2\n\nN\n\n+\n\n\u03bbU\n2\n\nM\n\n\u03bbV\n2\n\nE =\n\nXi=1\n\nXj=1\nIij(cid:0)Rij \u2212 U T\nU , \u03bbV = \u03c32/\u03c32\n\nXi=1\nwhere \u03bbU = \u03c32/\u03c32\nF ro denotes the Frobenius norm. A local minimum\nof the objective function given by Eq. 4 can be found by performing gradient descent in U and V .\nNote that this model can be viewed as a probabilistic extension of the SVD model, since if all ratings\nhave been observed, the objective given by Eq. 4 reduces to the SVD objective in the limit of prior\nvariances going to in\ufb01nity.\n\ni Vj(cid:1)2\nV , and k \u00b7 k2\n\nXj=1\n\nk Vj k2\n\nF ro,\n\nk Ui k2\n\nF ro +\n\n(4)\n\nIn our experiments, instead of using a simple linear-Gaussian model, which can make predictions\noutside of the range of valid rating values, the dot product between user- and movie-speci\ufb01c feature\nvectors is passed through the logistic function g(x) = 1/(1 + exp(\u2212x)), which bounds the range of\npredictions:\n\np(R|U, V, \u03c32) =\n\nN\n\nYi=1\n\nM\n\nYj=1(cid:20)N (Rij|g(U T\n\ni Vj ), \u03c32)(cid:21)Iij\n\n.\n\n(5)\n\nWe map the ratings 1, ..., K to the interval [0, 1] using the function t(x) = (x \u2212 1)/(K \u2212 1), so\nthat the range of valid rating values matches the range of predictions our model makes. Minimizing\nthe objective function given above using steepest descent takes time linear in the number of obser-\nvations. A simple implementation of this algorithm in Matlab allows us to make one sweep through\nthe entire Net\ufb02ix dataset in less than an hour when the model being trained has 30 factors.\n\n3 Automatic Complexity Control for PMF Models\n\nCapacity control is essential to making PMF models generalize well. Given suf\ufb01ciently many fac-\ntors, a PMF model can approximate any given matrix arbitrarily well. The simplest way to control\nthe capacity of a PMF model is by changing the dimensionality of feature vectors. However, when\nthe dataset is unbalanced, i.e. the number of observations differs signi\ufb01cantly among different rows\nor columns, this approach fails, since any single number of feature dimensions will be too high for\nsome feature vectors and too low for others. Regularization parameters such as \u03bbU and \u03bbV de\ufb01ned\nabove provide a more \ufb02exible approach to regularization. Perhaps the simplest way to \ufb01nd suitable\nvalues for these parameters is to consider a set of reasonable parameter values, train a model for each\nsetting of the parameters in the set, and choose the model that performs best on the validation set.\nThe main drawback of this approach is that it is computationally expensive, since instead of training\na single model we have to train a multitude of models. We will show that the method proposed by\n[6], originally applied to neural networks, can be used to determine suitable values for the regular-\nization parameters of a PMF model automatically without signi\ufb01cantly affecting the time needed to\ntrain the model.\n\n3\n\n\fAs shown above, the problem of approximating a matrix in the L2 sense by a product of two low-rank\nmatrices that are regularized by penalizing their Frobenius norm can be viewed as MAP estimation\nin a probabilistic model with spherical Gaussian priors on the rows of the low-rank matrices. The\ncomplexity of the model is controlled by the hyperparameters: the noise variance \u03c32 and the the\nparameters of the priors (\u03c32\nV above). Introducing priors for the hyperparameters and maxi-\nmizing the log-posterior of the model over both parameters and hyperparameters as suggested in [6]\nallows model complexity to be controlled automatically based on the training data. Using spherical\npriors for user and movie feature vectors in this framework leads to the standard form of PMF with\n\u03bbU and \u03bbV chosen automatically. This approach to regularization allows us to use methods that\nare more sophisticated than the simple penalization of the Frobenius norm of the feature matrices.\nFor example, we can use priors with diagonal or even full covariance matrices as well as adjustable\nmeans for the feature vectors. Mixture of Gaussians priors can also be handled quite easily.\n\nU and \u03c32\n\nIn summary, we \ufb01nd a point estimate of parameters and hyperparameters by maximizing the log-\nposterior given by\n\nln p(U, V, \u03c32, \u0398U , \u0398V |R) = ln p(R|U, V, \u03c32) + ln p(U |\u0398U ) + ln p(V |\u0398V )+\n\nln p(\u0398U ) + ln p(\u0398V ) + C,\n\n(6)\nwhere \u0398U and \u0398V are the hyperparameters for the priors over user and movie feature vectors re-\nspectively and C is a constant that does not depend on the parameters or hyperparameters.\nWhen the prior is Gaussian, the optimal hyperparameters can be found in closed form if the movie\nand user feature vectors are kept \ufb01xed. Thus to simplify learning we alternate between optimizing\nthe hyperparameters and updating the feature vectors using steepest ascent with the values of hy-\nperparameters \ufb01xed. When the prior is a mixture of Gaussians, the hyperparameters can be updated\nby performing a single step of EM. In all of our experiments we used improper priors for the hy-\nperparameters, but it is easy to extend the closed form updates to handle conjugate priors for the\nhyperparameters.\n\n4 Constrained PMF\n\nOnce a PMF model has been \ufb01tted, users with very few ratings will have feature vectors that are close\nto the prior mean, or the average user, so the predicted ratings for those users will be close to the\nmovie average ratings. In this section we introduce an additional way of constraining user-speci\ufb01c\nfeature vectors that has a strong effect on infrequent users.\n\nLet W \u2208 RD\u00d7M be a latent similarity constraint matrix. We de\ufb01ne the feature vector for user i as:\n\nUi = Yi + PM\nPM\n\nk=1 IikWk\n\nk=1 Iik\n\n.\n\n(7)\n\nwhere I is the observed indicator matrix with Iij taking on value 1 if user i rated movie j and 0\notherwise2. Intuitively, the ith column of the W matrix captures the effect of a user having rated a\nparticular movie has on the prior mean of the user\u2019s feature vector. As a result, users that have seen\nthe same (or similar) movies will have similar prior distributions for their feature vectors. Note that\nYi can be seen as the offset added to the mean of the prior distribution to get the feature vector Ui\nfor the user i. In the unconstrained PMF model Ui and Yi are equal because the prior mean is \ufb01xed\nat zero (see \ufb01g. 1). We now de\ufb01ne the conditional distribution over the observed ratings as\n\np(R|Y, V, W, \u03c32) =\n\nN\n\nYi=1\n\nM\n\nYj=1(cid:20)N (Rij|g(cid:0)(cid:2)Yi + PM\nPM\n\nk=1 IikWk\n\nk=1 Iik\n\nVj(cid:1), \u03c32)(cid:21)Iij\n\n.\n\n(cid:3)T\n\n(8)\n\nWe regularize the latent similarity constraint matrix W by placing a zero-mean spherical Gaussian\nprior on it:\n\np(W |\u03c3W ) =\n\nM\n\nYk=1\n\nN (Wk|0, \u03c32\nW\n\nI).\n\n(9)\n\n2If no rating information is available about some user i, i.e. all entries of Ii vector are zero, the value of the\n\nratio in Eq. 7 is set to zero.\n\n4\n\n\f0.97\n\n0.96\n\n0.95\n\nE\nS\nM\nR\n\n0.94\n\n0.93\n\n0.92\n\n0.91\n0\n\n10D\n\nPMF1\n\nSVD\n\nNetflix \nBaseline Score\n\nPMF2\n\nPMFA1\n\n0.97\n\n0.96\n\n0.95\n\nE\nS\nM\nR\n\n0.94\n\n0.93\n\n0.92\n\n0.91\n\n0.9\n0\n\nThe Net\ufb02ix Dataset\n\n30D\n\nNetflix \nBaseline Score\n\nSVD\n\nPMF\n\n10\n\n20\n\n30\n\n40\n\n50\n\nEpochs\n\n60\n\n70\n\n80\n\n90\n\n100\n\nConstrained \nPMF\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\nEpochs\n\n35\n\n40\n\n45\n\n50\n\n55\n\n60\n\nFigure 2: Left panel: Performance of SVD, PMF and PMF with adaptive priors, using 10D feature vectors, on\nthe full Net\ufb02ix validation data. Right panel: Performance of SVD, Probabilistic Matrix Factorization (PMF)\nand constrained PMF, using 30D feature vectors, on the validation data. The y-axis displays RMSE (root mean\nsquared error), and the x-axis shows the number of epochs, or passes, through the entire training dataset.\n\nAs with the PMF model, maximizing the log-posterior is equivalent to minimizing the sum-of-\nsquared errors function with quadratic regularization terms:\n\nN\n\nM\n\nE =\n\n1\n2\n\nN\n\nk=1 Iik\n\nXi=1\n\nXj=1\n\nk=1 IikWk\n\nIij(cid:0)Rij \u2212 g(cid:0)(cid:2)Yi + PM\nPM\nXj=1\nwith \u03bbY = \u03c32/\u03c32\nW . We can then perform gradient descent in Y ,\nV , and \u03bbW = \u03c32/\u03c32\nV , and W to minimize the objective function given by Eq. 10. The training time for the constrained\nPMF model scales linearly with the number of observations, which allows for a fast and simple\nimplementation. As we show in our experimental results section, this model performs considerably\nbetter than a simple unconstrained PMF model, especially on infrequent users.\n\nXi=1\nY , \u03bbV = \u03c32/\u03c32\n\nVj(cid:1)(cid:1)2\nXk=1\n\nM\n\nk Wk k2\n\nF ro,\n\nk Yi k2\n\nF ro +\n\nk Vj k2\n\nF ro +\n\n+\n\n\u03bbY\n2\n\n(cid:3)T\n\n\u03bbW\n2\n\nM\n\n\u03bbV\n2\n\n(10)\n\n5 Experimental Results\n\n5.1 Description of the Net\ufb02ix Data\n\nAccording to Net\ufb02ix, the data were collected between October 1998 and December 2005 and repre-\nsent the distribution of all ratings Net\ufb02ix obtained during this period. The training dataset consists\nof 100,480,507 ratings from 480,189 randomly-chosen, anonymous users on 17,770 movie titles.\nAs part of the training data, Net\ufb02ix also provides validation data, containing 1,408,395 ratings. In\naddition to the training and validation data, Net\ufb02ix also provides a test set containing 2,817,131\nuser/movie pairs with the ratings withheld. The pairs were selected from the most recent ratings for\na subset of the users in the training dataset. To reduce the unintentional over\ufb01tting to the test set that\nplagues many empirical comparisons in the machine learning literature, performance is assessed by\nsubmitting predicted ratings to Net\ufb02ix who then post the root mean squared error (RMSE) on an\nunknown half of the test set. As a baseline, Net\ufb02ix provided the test score of its own system trained\non the same data, which is 0.9514.\n\nTo provide additional insight into the performance of different algorithms we created a smaller and\nmuch more dif\ufb01cult dataset from the Net\ufb02ix data by randomly selecting 50,000 users and 1850\nmovies. The toy dataset contains 1,082,982 training and 2,462 validation user/movie pairs. Over\n50% of the users in the training dataset have less than 10 ratings.\n\n5.2 Details of Training\n\nTo speed-up the training, instead of performing batch learning we subdivided the Net\ufb02ix data into\nmini-batches of size 100,000 (user/movie/rating triples), and updated the feature vectors after each\n\n5\n\n\fmini-batch. After trying various values for the learning rate and momentum and experimenting with\nvarious values of D, we chose to use a learning rate of 0.005, and a momentum of 0.9, as this setting\nof parameters worked well for all values of D we have tried.\n\n5.3 Results for PMF with Adaptive Priors\n\nTo evaluate the performance of PMF models with adaptive priors we used models with 10D features.\nThis dimensionality was chosen in order to demonstrate that even when the dimensionality of fea-\ntures is relatively low, SVD-like models can still over\ufb01t and that there are some performance gains\nto be had by regularizing such models automatically. We compared an SVD model, two \ufb01xed-prior\nPMF models, and two PMF models with adaptive priors. The SVD model was trained to minimize\nthe sum-squared distance only to the observed entries of the target matrix. The feature vectors of\nthe SVD model were not regularized in any way. The two \ufb01xed-prior PMF models differed in their\nregularization parameters: one (PMF1) had \u03bbU = 0.01 and \u03bbV = 0.001, while the other (PMF2)\nhad \u03bbU = 0.001 and \u03bbV = 0.0001. The \ufb01rst PMF model with adaptive priors (PMFA1) had Gaus-\nsian priors with spherical covariance matrices on user and movie feature vectors, while the second\nmodel (PMFA2) had diagonal covariance matrices. In both cases, the adaptive priors had adjustable\nmeans. Prior parameters and noise covariances were updated after every 10 and 100 feature matrix\nupdates respectively. The models were compared based on the RMSE on the validation set.\n\nThe results of the comparison are shown on Figure 2 (left panel). Note that the curve for the PMF\nmodel with spherical covariances is not shown since it is virtually identical to the curve for the model\nwith diagonal covariances. Comparing models based on the lowest RMSE achieved over the time of\ntraining, we see that the SVD model does almost as well as the moderately regularized PMF model\n(PMF2) (0.9258 vs. 0.9253) before over\ufb01tting badly towards the end of training. While PMF1\ndoes not over\ufb01t, it clearly under\ufb01ts since it reaches the RMSE of only 0.9430. The models with\nadaptive priors clearly outperform the competing models, achieving the RMSE of 0.9197 (diagonal\ncovariances) and 0.9204 (spherical covariances). These results suggest that automatic regularization\nthrough adaptive priors works well in practice. Moreover, our preliminary results for models with\nhigher-dimensional feature vectors suggest that the gap in performance due to the use of adaptive\npriors is likely to grow as the dimensionality of feature vectors increases. While the use of diagonal\ncovariance matrices did not lead to a signi\ufb01cant improvement over the spherical covariance matrices,\ndiagonal covariances might be well-suited for automatically regularizing the greedy version of the\nPMF training algorithm, where feature vectors are learned one dimension at a time.\n\n5.4 Results for Constrained PMF\n\nFor experiments involving constrained PMF models, we used 30D features (D = 30), since this\nchoice resulted in the best model performance on the validation set. Values of D in the range of\n[20, 60] produce similar results. Performance results of SVD, PMF, and constrained PMF on the\ntoy dataset are shown on Figure 3. The feature vectors were initialized to the same values in all\nthree models. For both PMF and constrained PMF models the regularization parameters were set to\n\u03bbU = \u03bbY = \u03bbV = \u03bbW = 0.002. It is clear that the simple SVD model over\ufb01ts heavily. The con-\nstrained PMF model performs much better and converges considerably faster than the unconstrained\nPMF model. Figure 3 (right panel) shows the effect of constraining user-speci\ufb01c features on the\npredictions for infrequent users. Performance of the PMF model for a group of users that have fewer\nthan 5 ratings in the training datasets is virtually identical to that of the movie average algorithm that\nalways predicts the average rating of each movie. The constrained PMF model, however, performs\nconsiderably better on users with few ratings. As the number of ratings increases, both PMF and\nconstrained PMF exhibit similar performance.\n\nOne other interesting aspect of the constrained PMF model is that even if we know only what movies\nthe user has rated, but do not know the values of the ratings, the model can make better predictions\nthan the movie average model. For the toy dataset, we randomly sampled an additional 50,000 users,\nand for each of the users compiled a list of movies the user has rated and then discarded the actual\nratings. The constrained PMF model achieved a RMSE of 1.0510 on the validation set compared\nto a RMSE of 1.0726 for the simple movie average model. This experiment strongly suggests that\nknowing only which movies a user rated, but not the actual ratings, can still help us to model that\nuser\u2019s preferences better.\n\n6\n\n\fE\nS\nM\nR\n\n1.3\n1.28\n1.26\n1.24\n1.22\n1.2\n1.18\n1.16\n1.14\n1.12\n1.1\n1.08\n1.06\n1.04\n1.02\n1\n0\n\nConstrained \nPMF\n20\n\n40\n\nToy Dataset\n\n1.2\n\n1.15\n\n1.1\n\n1.05\n\n1\n\n0.95\n\n0.9\n\n0.85\n\nE\nS\nM\nR\n\nSVD\n\nPMF\n\n60\n\n80\n\n100\n\nEpochs\n\n120\n\n140\n\n160\n\n180\n\n200\n\n0.8\n\n 1\u22125 \n\nMovie Average\n\nPMF\n\nConstrained \nPMF\n\n 6\u221210 \n\n 11\u221220 \n\n 21\u221240 \n\nNumber of Observed Ratings\n\n 41\u221280 \n\n 81\u2212160 \n\n >161 \n\nFigure 3: Left panel: Performance of SVD, Probabilistic Matrix Factorization (PMF) and constrained PMF on\nthe validation data. The y-axis displays RMSE (root mean squared error), and the x-axis shows the number of\nepochs, or passes, through the entire training dataset. Right panel: Performance of constrained PMF, PMF, and\nthe movie average algorithm that always predicts the average rating of each movie. The users were grouped by\nthe number of observed ratings in the training data.\n\n1.2\n\n1.15\n\n1.1\n\n1.05\n\n1\n\n0.95\n\n0.9\n\n0.85\n\nE\nS\nM\nR\n\nMovie \nAverage\n\nPMF\n\nConstrained \nPMF\n\n20\n\n18\n\n16\n\n14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n)\n\n%\n\n(\n \ns\nr\ne\ns\nU\n\n0.92\n\n0.918\n\n0.916\n\n0.914\n\n0.912\n\n0.91\n\n0.908\n\n0.906\n\n0.904\n\n0.902\n\nE\nS\nM\nR\n\nConstrained \nPMF\n\nConstrained PMF \n(using Test rated/unrated id)\n\n0.8\n\n 1\u22125 \n\n 6\u221210 11\u221220 21\u221240 41\u221280 81\u2212160 161\u2212320 321\u2212640 >641 \n\nNumber of Observed Ratings\n\n0\n 1\u22125 \n\n 6\u221210 11\u221220 21\u221240 41\u221280 81\u2212160 161\u2212320 321\u2212640 >641 \n\nNumber of Observed Ratings\n\n0.9\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\nEpochs\n\n35\n\n40\n\n45\n\n50\n\n55\n\n60\n\nFigure 4: Left panel: Performance of constrained PMF, PMF, and the movie average algorithm that always\npredicts the average rating of each movie. The users were grouped by the number of observed rating in the train-\ning data, with the x-axis showing those groups, and the y-axis displaying RMSE on the full Net\ufb02ix validation\ndata for each such group. Middle panel: Distribution of users in the training dataset. Right panel: Performance\nof constrained PMF and constrained PMF that makes use of an additional rated/unrated information obtained\nfrom the test dataset.\n\nPerformance results on the full Net\ufb02ix dataset are similar to the results on the toy dataset. For both\nthe PMF and constrained PMF models the regularization parameters were set to \u03bbU = \u03bbY = \u03bbV =\n\u03bbW = 0.001. Figure 2 (right panel) shows that constrained PMF signi\ufb01cantly outperforms the\nunconstrained PMF model, achieving a RMSE of 0.9016. A simple SVD achieves a RMSE of about\n0.9280 and after about 10 epochs begins to over\ufb01t. Figure 4 (left panel) shows that the constrained\nPMF model is able to generalize considerably better for users with very few ratings. Note that over\n10% of users in the training dataset have fewer than 20 ratings. As the number of ratings increases,\nthe effect from the offset in Eq. 7 diminishes, and both PMF and constrained PMF achieve similar\nperformance.\n\nThere is a more subtle source of information in the Net\ufb02ix dataset. Net\ufb02ix tells us in advance which\nuser/movie pairs occur in the test set, so we have an additional category: movies that were viewed\nbut for which the rating is unknown. This is a valuable source of information about users who occur\nseveral times in the test set, especially if they have only a small number of ratings in the training set.\nThe constrained PMF model can easily take this information into account. Figure 4 (right panel)\nshows that this additional source of information further improves model performance.\n\nWhen we linearly combine the predictions of PMF, PMF with a learnable prior, and constrained\nPMF, we achieve an error rate of 0.8970 on the test set. When the predictions of multiple PMF\nmodels are linearly combined with the predictions of multiple RBM models, recently introduced\nby [8], we achieve an error rate of 0.8861, that is nearly 7% better than the score of Net\ufb02ix\u2019s own\nsystem.\n\n7\n\n\f6 Summary and Discussion\n\nIn this paper we presented Probabilistic Matrix Factorization (PMF) and its two derivatives: PMF\nwith a learnable prior and constrained PMF. We also demonstrated that these models can be ef\ufb01-\nciently trained and successfully applied to a large dataset containing over 100 million movie ratings.\n\nEf\ufb01ciency in training PMF models comes from \ufb01nding only point estimates of model parameters\nand hyperparameters, instead of inferring the full posterior distribution over them. If we were to\ntake a fully Bayesian approach, we would put hyperpriors over the hyperparameters and resort to\nMCMC methods [5] to perform inference. While this approach is computationally more expensive,\npreliminary results strongly suggest that a fully Bayesian treatment of the presented PMF models\nwould lead to a signi\ufb01cant increase in predictive accuracy.\n\nAcknowledgments\n\nWe thank Vinod Nair and Geoffrey Hinton for many helpful discussions. This research was sup-\nported by NSERC.\n\nReferences\n\n[1] Delbert Dueck and Brendan Frey. Probabilistic sparse matrix factorization. Technical Report PSI TR\n\n2004-023, Dept. of Computer Science, University of Toronto, 2004.\n\n[2] Thomas Hofmann. Probabilistic latent semantic analysis.\n\nIn Proceedings of the 15th Conference on\n\nUncertainty in AI, pages 289\u2013296, San Fransisco, California, 1999. Morgan Kaufmann.\n\n[3] Benjamin Marlin. Modeling user rating pro\ufb01les for collaborative \ufb01ltering.\n\nLawrence K. Saul, and Bernhard Sch\u00a8olkopf, editors, NIPS. MIT Press, 2003.\n\nIn Sebastian Thrun,\n\n[4] Benjamin Marlin and Richard S. Zemel. The multiple multiplicative factor model for collaborative \ufb01lter-\ning. In Machine Learning, Proceedings of the Twenty-\ufb01rst International Conference (ICML 2004), Banff,\nAlberta, Canada, July 4-8, 2004. ACM, 2004.\n\n[5] Radford M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical Report\n\nCRG-TR-93-1, Department of Computer Science, University of Toronto, September 1993.\n\n[6] S. J. Nowlan and G. E. Hinton. Simplifying neural networks by soft weight-sharing. Neural Computation,\n\n4:473\u2013493, 1992.\n\n[7] Jason D. M. Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collaborative\nprediction. In Luc De Raedt and Stefan Wrobel, editors, Machine Learning, Proceedings of the Twenty-\nSecond International Conference (ICML 2005), Bonn, Germany, August 7-11, 2005, pages 713\u2013719.\nACM, 2005.\n\n[8] Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. Restricted Boltzmann machines for collabo-\nrative \ufb01ltering. In Machine Learning, Proceedings of the Twenty-fourth International Conference (ICML\n2004). ACM, 2007.\n\n[9] Nathan Srebro and Tommi Jaakkola. Weighted low-rank approximations.\n\nIn Tom Fawcett and Nina\nMishra, editors, Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003),\nAugust 21-24, 2003, Washington, DC, USA, pages 720\u2013727. AAAI Press, 2003.\n\n[10] Nathan Srebro, Jason D. M. Rennie, and Tommi Jaakkola. Maximum-margin matrix factorization. In\n\nAdvances in Neural Information Processing Systems, 2004.\n\n[11] Michael E. Tipping and Christopher M. Bishop. Probabilistic principal component analysis. Technical\n\nReport NCRG/97/010, Neural Computing Research Group, Aston University, September 1997.\n\n[12] Max Welling, Michal Rosen-Zvi, and Geoffrey Hinton. Exponential family harmoniums with an applica-\n\ntion to information retrieval. In NIPS 17, pages 1481\u20131488, Cambridge, MA, 2005. MIT Press.\n\n8\n\n\f", "award": [], "sourceid": 1007, "authors": [{"given_name": "Andriy", "family_name": "Mnih", "institution": null}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": null}]}