{"title": "Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 64, "page_last": 72, "abstract": null, "full_text": "Nonparametric Max-Margin Matrix Factorization for\nCollaborative Prediction\nMinjie Xu, Jun Zhu and Bo Zhang\nState Key Laboratory of Intelligent Technology and Systems (LITS)\nTsinghua National Laboratory for Information Science and Technology (TNList)\nDepartment of Computer Science and Technology, Tsinghua University, Beijing 100084, China\nchokkyvista06@gmail.com,{dcszj,dcszb}@mail.tsinghua.edu.cn\n\nAbstract\nWe present a probabilistic formulation of max-margin matrix factorization and\nbuild accordingly a nonparametric Bayesian model which automatically resolves\nthe unknown number of latent factors. Our work demonstrates a successful example that integrates Bayesian nonparametrics and max-margin learning, which are\nconventionally two separate paradigms and enjoy complementary advantages. We\ndevelop an efcient variational algorithm for posterior inference, and our extensive empirical studies on large-scale MovieLens and EachMovie data sets appear\nto justify the aforementioned dual advantages.\n\n1\n\nIntroduction\n\nCollaborative prediction is a task of predicting users potential preferences on currently unrated\nitems (e.g., movies) based on their currently observed preferences and their relations with others.\nOne typical setting formalizes it as a matrix completion problem, i.e., to ll in missing entries (or,\npreferences) into a partially observed user-by-item matrix. Often there is extra information available\n(e.g., users age, gender; movies genre, year, etc.) [10] to help with the task.\nAmong other popular approaches, factor-based models have been used extensively in collaborative\nprediction. The underlying idea behind such models is that there is only a small number of latent\nfactors inuencing the preferences. In a linear factor model, a users rating of an item is modeled as\na linear combination of these factors, with user-specic coefcients and item-specic factor values.\nThus, given a N M preference matrix for N users and M items, a K-factor model ts it with\na N K coefcient matrix U and a M K factor matrix V as U V . Various computational\nmethods have been successfully developed to implement such an idea, including probabilistic matrix\nfactorization (PMF) [13, 12] and deterministic reconstruction/approximation error minimization,\ne.g., max-margin matrix factorization (M3 F) with hinge loss [14, 11, 16].\nOne common problem in latent factor models is how to determine the number of factors, which\nis unknown a priori. A typical solution relies on some general model selection procedure, e.g.,\ncross-validation, which explicitly enumerates and compares many candidate models and thus can\nbe computationally expensive. On the other hand, probabilistic matrix factorization models have\nlend themselves naturally to leverage recent advances in Bayesian nonparametrics to bypass explicit\nmodel selection [17, 1]. However, it remains largely unexplored how to borrow such advantages into\ndeterministic max-margin matrix factorization models, particularly the very successful M3 F.\nTo address the above problem, this paper presents innite probabilistic max-margin matrix factorization (iPM3 F), a nonparametric Bayesian-style M3 F model that utilizes nonparametric Bayesian\ntechniques to automatically resolve the unknown number of latent factors in M3 F models. The rst\nkey step towards iPM3 F is a general probabilistic formulation of the standard M3 F, which is based\non the maximum entropy discrimination principle [4]. We can then principally extend it to a non1\n\n\fparametric model, which in theory has an unbounded number of latent factors. To avoid overtting\nwe impose a sparsity-inducing Indian buffet process prior on the latent coefcient matrix, selecting\nonly an appropriate number of active factors. We develop an efcient variational method to infer\nposterior distributions and learn parameters (if ever exist) and our extensive empirical results on\nMovieLens and EachMovie demonstrate appealing performances.\nThe rest of the paper is structured as follows. In Section 2, we briey review the formalization\nof max-margin matrix factorization; In Section 3, we present a general probabilistic formulation\nof M3 F, and then its nonparametric extension and a fully Bayesian formulation; In Section 4, we\ndiscuss how to perform learning and inference; In Section 5, we give empirical results on 2 prevalent\ncollaborative ltering data sets; And nally, we conclude in Section 6.\n\n2\n\nMax-margin matrix factorization\n\nGiven a preference matrix Y RN M , which is partially observed and usually sparse, we denote\nthe observed entry indices by I. The task of traditional matrix factorization is to nd a low-rank\nmatrix X RN M to approximate Y under some loss measure, e.g., the commonly used squared\nerror, and use Xij as the reconstruction of the missing entries Yij wherever ij I. Max-margin\n/\nmatrix factorization (M3 F) [14] extends the model by using a sparsity-inducing norm regularizer\nfor a low-norm factorization and adopting hinge loss for the error measure, which is applicable to\nbinary, discrete ordinal, or categorical data. For the binary case where Yij {1} and one predicts\nby Yij = sign(Xij ), the optimization problem of M3 F is dened as\nmin\n\nX\n\nX\n\n\n\nh (Yij Xij ) ,\n\n+C\n\n(1)\n\nijI\n\nwhere h(x) = max(0, 1 x) is the hinge loss and X is the nuclear norm of X. M3 F can be\nequivalently reformulated as a semi-denite programming (SDP) and thus learned using standard\nSDP solvers, but it is unfortunately very slow and can only scale up to thousands of users and items.\nAs shown in [14], the nuclear norm can be written in a variational form, namely\nX\n\n=\n\n\n\nmin\n\nX=U V\n\n1\n2\n\nU\n\n2\nF\n\n+ V\n\n2\nF\n\n.\n\n(2)\n\nBased on the equivalence, a fast M3 F model is proposed in [11], which uses gradient descent to\nsolve an equivalent problem, only on U and V instead\nmin\nU,V\n\n1\n2\n\nU\n\n2\nF\n\n+ V\n\n2\nF\n\n+C\n\nh Yij Ui Vj\n\n,\n\n(3)\n\nijI\n\nwhere U RN K is the user coefcient matrix, V RM K the item factor matrix, and K the\nnumber of latent factors. We use Ui to denote the ith row of U , and Vj likewise.\nThe fast M3 F model can scale up to millions of users and items. But one unaddressed resulting\nproblem is that it needs to specify the unknown number of latent factors, K, a priori. Below we\npresent a nonparametric Bayesian approach, which effectively bypasses the model selection problem\nand produces very robust prediction. We also design a blockwise coordinate descent algorithm that\ndirectly solves problem (3) rather than working on a smoothing relaxation [11], and it turns out to\nbe as efcient and accurate. To save space, we defer this part to Appendix B.\n\n3\n\nNonparametric Bayesian max-margin matrix factorization\n\nNow we present the nonparametric Bayesian max-margin matrix factorization models. We start with\na brief introduction to maximum entropy discrimination, which lays the basis for our methods.\n3.1\n\nMaximum entropy discrimination\n\nWe consider the binary classication setting since it sufces for our model. Given a set of training\ndata {(xd , yd )}D (yd {1}) and a discriminant function F (x; ) parameterized by , maxd=1\nimum entropy discrimination (MED) [4] seeks to learn a distribution p() rather than perform a\npoint estimation of as is the case with standard SVMs that typically lack a direct probabilistic\ninterpretation. Accordingly, MED takes expectation over the original discriminant function with\nrespect to p() and has the new prediction rule\ny = sign (Ep [F (x; )]) .\n\n2\n\n(4)\n\n\fTo nd p(), MED solves the following relative-entropic regularized risk minimization problem\nmin\np()\n\nh (yd Ep [F (xd ; )]) ,\n\nKL (p() p0 ()) + C\n\n(5)\n\nd\n\nwhere p0 () is the pre-specied prior distribution of , KL(p p0 ) the Kullback-Leibler divergence,\nor relative entropy, between two distributions, C the regularization constant and h (x) = max(0, \nx) ( > 0) the generalized hinge loss.\nBy dening F as the log-likelihood ratio of a Bayesian generative model1 , MED provides an elegant\nway to integrate discriminative max-margin learning and Bayesian generative modeling. In fact,\nMED subsumes SVM as a special case and has been extended to incorporate latent variables [5, 18]\nand perform structured output prediction [21]. Recent work has further extended MED to unite\nBayesian nonparametrics and max-margin learning [20, 19], which have been largely treated as\nisolated topics, for learning better classication models. The present work contributes by introducing\na novel generalization of MED to handle the challenging matrix factorization problems.\n3.2\n\nProbabilistic max-margin matrix factorization\n\nLike PMF [12], we treat U and V as random variables, whose joint prior distribution is denoted by\np0 (U, V ). Then, our goal is to infer their posterior distribution p(U, V )2 after a set of observations\nhave been provided. We rst consider the binary case where Yij takes value from {1}. If the\nfactorization, U and V , is given, we can naturally dene the discriminant function F as\nF ((i, j); U, V ) = Ui Vj .\n\n(6)\n\nFurthermore, since both U and V are random variables, we need to resolve the uncertainty in order to\nderive a prediction rule. Here, we choose the canonical MED approach, namely the expectation operator, which is linear and has shown promise in [18, 19], rather than the log-marginalized-likelihood\nratio approach [5], which requires an extra likelihood model. Hence, substituting the discriminant\nfunction (6) into (4), we have the prediction rule\nYij = sign Ep [Ui Vj ] .\n\n(7)\n\nThen following the principle of MED learning, we dene probabilistic max-margin matrix factorization (PM3 F) as solving the following optimization problem\nmin\n\np(U,V )\n\nKL(p(U, V ) p0 (U, V )) + C\n\nh\n\nYij Ep [Ui Vj ] .\n\n(8)\n\nijI\n\nNote that our probabilistic formulation is strictly more general than the original M3 F model, which\nis in fact a special case of PM3 F under a standard Gaussian prior and a mean-eld assumption\non p(U, V ). Specically, if we assume p0 (U, V ) = i N (Ui |0, I) j N (Vj |0, I) and p(U, V ) =\np(U )p(V ), then one can prove p(U ) = i N (Ui |i , I), p(V ) = j N (Vj |j , I) and PM3 F reduces\naccordingly to a M3 F problem (3), namely\nmin\n,\n\n1\n( \n2\n\n2\nF\n\n+ \n\n2\nF)\n\n+C\n\nh\n\nYij i j\n\n.\n\n(9)\n\nijI\n\nRatings: For ordinal ratings Yij {1, 2, . . . , L}, we use the same strategy as in [14] to dene the\nloss function. Specically, we introduce thresholds 0 1 L , where 0 = and\nL = +, to discretize R into L intervals. The prediction rule is changed accordingly to\nYij = max r|Ep [Ui Vj ] r + 1.\n\n(10)\n\nYij 1 + Ep [Ui Vj ] Yij .\n\n(11)\n\nIn a hard-margin setting, we would require that\n\nWhile in a soft-margin setting, we dene the loss as\nYij 1\nijI\n1\n2\n\nr=1\n\nL1\n\nh (Ep [Ui Vj ] r ) +\n\nr=Yij\n\nL1\n\nh (r Ep [Ui Vj ]) =\n\nh\nijI r=1\n\nr\nTij (r Ep [Ui Vj ])\n\nF can also be directly specied without any reference to probabilistic models [4], as is our case.\nWe abbreviated the posterior p(U, V |Y ) since we dont specify the likelihood p(Y |U, V ) anyway.\n\n3\n\n(12)\n\n\f+1 for r Yij\n. The loss thus dened is an upper bound to the sum of absolute\n1 for r < Yij\n\nr\nwhere Tij =\n\ndifferences between the predicted ratings and the true ratings, a loss measure closely related to\nNormalized Mean Absolute Error (NMAE) [7, 14].\nFurthermore, we can learn a more exible model to capture users diverse rating criteria by replacing\nuser-common thresholds r in the prediction rule (10) and the loss (12) with user-specic ones ir .\nFinally, we may as well treat the additionally introduced thresholds ir as random variables and infer\ntheir posterior distribution, hereby giving the full PM3 F model as solving\nL1\n\nmin\n\np(U,V,)\n\n3.3\n\nKL(p(U, V, ) p0 (U, V, )) + C\n\nh\nijI r=1\n\nr\nTij (Ep [ir ] Ep [Ui Vj ]) .\n\n(13)\n\nInnite PM3 F (iPM3 F)\n\nAs we have stated, one common problem with nite factor-based models, including PM3 F, is that we\nneed to explicitly select the number of latent factors, i.e., K. In this section, we present an innite\nPM3 F model which, through Bayesian nonparametric techniques, automatically adapts and selects\nthe number of latent factors during learning.\nWithout loss of generality, we consider learning a binary3 coefcient matrix Z {0, 1}N . For\nnite-sized binary matrices, we may dene their prior as given by a Beta-Bernoulli process [8].\nWhile in the innite case, we allow Z to have an innite number of columns. Similar to the\nnonparametric matrix factorization model [17], we adopt IBP prior over unbounded binary matrices\nas previously established in [3] and furthermore, we focus on its stick-breaking construction [15],\nwhich facilitates the development of efcient inference algorithms. Specically, let k (0, 1) be\na parameter associated with each column of Z (with respect to its left-ordered equivalent class).\nThen the IBP prior can be described as given by the following generative process\ni.i.d. for i = 1, . . . , N\n\nZik Bernoulli(k )\n\n(k),\n\n(14)\n\nk\n\n1 = 1 , k = k k1 =\ni=1\n\ni , where i Beta(, 1)\n\ni.i.d. for i = 1, . . . , +.\n\n(15)\n\nThis process results in a descending sequence of k . Specically, given a nite data set (N < +),\nthe probability of seeing the kth factor decreases exponentially with k and the number of active\nfactors K+ follows a Poisson(HN ), where HN is the N th harmonic number. Alternatively, we\ncan use a Beta process prior over Z as in [9].\nAs for the counterpart, we place an isotropic Gaussian prior over the item factor matrix V . Prior\nspecied, we may follow the above probabilistic framework to perform max-margin training, with\nU replaced by Z. In summary, the stick-breaking construction for the IBP prior results in an\naugmented iPM3 F problem for binary data as\nmin\n\np(,Z,V )\n\nKL(p(, Z, V ) p0 (, Z, V )) + C\n\nh\n\nYij Ep [Zi Vj ] ,\n\n(16)\n\nijI\n\nwhere p0 (, Z, V ) = p0 ()p0 (Z|)p0 (V ) with\nk Beta(, 1)\nZik | Bernoulli(k )\nVjk N (0, 2 )\n\ni.i.d. for k = 1, . . . , +,\ni.i.d. for i = 1, . . . , N (k),\ni.i.d. for j = 1, . . . , M, k = 1, . . . , +.\n\nFor ordinal ratings, we augment the iPM3 F problem from (13) likewise and, apart from adopting the\nsame prior assumptions for , Z and V , assume p0 () = p0 (|, Z, V ) with\nir N (r , 2 )\n\ni.i.d. for i = 1, . . . , N, r = 1, . . . , L 1,\n\nwhere 1 < < L1 are specied as a prior guidance towards an ascending sequence of largemargin thresholds.\n3\nLearning real-valued coefcients can be easily done as in [3] by dening U = Z W , where W is a\nreal-valued matrix and denotes the Hadamard product or element-wise product.\n\n4\n\n\f3.4\n\nThe fully Bayesian model (iBPM3 F)\n\nTo take iPM3 F one step further towards a Bayesian-style model, we introduce priors for hyperparameters and perform fully-Bayesian inference [12], where model parameters and hyperparameters are integrated out when making prediction. This approach naturally ts in our MEDbased model thanks to the adoption of the expectation operator when dening prediction rule (7)\nand (10). Another observation is that the hyper-parameter in a way serves the same role as the\nregularization constant C, and thus we also try simplifying the model by omitting C in iBPM3 F.\nWe admit though, however many level of hyper-parameters are stacked and treated as stochastic and\nintegrated out, there always exists a gap between our model and a canonical Bayesian one since\nwe reject a likelihood. We believe the connection is better justied under the general regularized\nBayesian inference framework [19] with a trivial non-informative likelihood.\nHere we use the same Gaussian-Wishart prior over the latent factor matrix V as well as its\nhyper-parameters and , thus yielding a doubly augmented problem for binary data as\nmin\n\np(,Z,,,V )\n\nKL(p(, Z, , , V ) p0 (, Z, , , V )) +\n\nYij Ep [Zi Vj ] ,\n\nh\n\n(17)\n\nijI\n\nwhere weve omitted the regularization constant C and set p0 (, Z, , , V ) to be factorized as\np0 ()p0 (Z|)p0 (, )p0 (V |, ), with and Z enjoying the same priors as in iPM3 F and\n(, ) GW(0 , 0 , W0 , 0 ) = N (|0 , (0 )1 )W(|W0 , 0 ),\n\nVj |, N (Vj |, 1 )\n\ni.i.d. for j = 1, . . . , M .\n\nAnd note that exactly the same process applies as well to the full model for ordinal ratings.\n\n4\n\nLearning and inference under truncated mean-eld assumptions\n\nNow, we briey discuss how to perform learning and inference in iPM3 F. For iBPM3 F, similar\nprocedures are applicable. We defer all the details to Appendix D for saving space. Specically, we introduce a simple variational inference method to approximate the optimal posterior,\nwhich turns out to perform well in practice. We make the following truncated mean-eld assumption\nK\n\np(, Z, V ) = p()p(Z)p(V ) =\nk=1\n\nwhere K is the truncation level and\n\nN\n\np(k ) \n\nK\n\ni=1 k=1\n\np(Zik ) p(V ),\n\ni.i.d. for k = 1, . . . , K,\ni.i.d. for i = 1, . . . , N, k = 1, . . . , K.\n\nk Beta(k1 , k2 )\nZik Bernoulli(ik )\n\n(18)\n\n(19)\n(20)\n\nNote that we make no further assumption on the functional form of p(V ) and that we factorize p(Z)\ninto element-wise i.i.d. p(Zik ) and parameterize it with Bernoulli(ik ) merely out of the pursuit of\na simpler denotation for subsequent deduction. Actually it can be shown that p(Z) indeed enjoys all\nthese properties given the mildest truncated mean-eld assumption p(, Z, V ) = p()p(Z)p(V ).\nFor ordinal ratings, we make an additional mean-eld assumption\n(21)\n\np(, Z, V, ) = p(, Z, V )p(),\nwhere p(, Z, V ) is treated exactly the same as for binary data and p() is left in free forms.\n\nOne noteworthy point is that given p(Z), we may calculate the expectation of the posterior effective\ndimensionality of the latent factor space as\nK\n\nN\n\nEp [K+ ] =\nk=1\n\n1\n\n(1 ik ) .\n\n(22)\n\ni=1\n\nThen the problem can be solved using an iterative procedure that alternates between optimizing each\ncomponent at a time, as outlined below (We defer the details to Appendix D.):\nInfer p(V ): The linear discriminant function and the isotropic Gaussian prior on V leads to an\nM\nisotropic Gaussian posterior p(V ) = j=1 N (Vj |j , 2 I) while the M mean vectors j can be\nobtained via solving M independent binary SVMs\nmin\nj\n\n1\nj\n2 2\n\n2\n\n+C\n\nh\ni|ijI\n\n5\n\nYij j i\n\n.\n\n(23)\n\n\fInfer p() and p(Z): Since is marginalized before exerting any inuence in the loss term, its\nupdate is independent of the loss and hence we adopt the same update rules as in [2]; The subproblem on p(Z) decomposes into N independent convex optimization problems, one for each i as\nK\n\nmin\ni\n\nk=1\n\nEZ [log p(Zik )] E,Z [log p0 (Zik |)] + C\n\nh\n\nYij i j\n\n,\n\n(24)\n\nj|ijI\n\nwhere EZ [log p(Zik )] = ik log ik +(1ik ) log(1ik ), E,Z [log p0 (Zik |)] = ik k E [log j ]+\nj=1\n(1 ik )E [log(1 k j )] and E [log j ] = (k1 ) (k1 + k2 ), E [log(1 k j )] L ,\nk\nj=1\nj=1\nwhere L in turn is the multivariate lower bound as in [2]. We may use the similar subgradient\nk\ntechnique as in [19] to approximately solve for i . Here we introduce an alternative solution, which\nis as efcient and guarantees convergence as iteration goes on. We update i via coordinate descent,\nwith each conditional optimal ik sought by binary search. (See Appendix D.1.3 for details.)\nN\n\nL1\n\nInfer p(): p() remains an isotropic Gaussian as p() = i=1 r=1 N (ir |\nir of each component is solution to the corresponding subproblem\nmin\nir\n\n1\n(\n2 2\n\nir\n\n r )2 + C\n\nh\nj|ijI\n\nr\nTij (\n\nir\n\n i j ) ,\n\nir , \n\n2\n\n) and the mean\n(25)\n\nto which the binary search solver for each ik also applies. Note that as +, the Gaussian\ndistribution regresses to a uniform distribution and problem (25) reduces accordingly to the corresponding conditional subproblem for in the original M3 F (Appendix B.3).\n\n5\n\nExperiments and discussions\n\nWe conduct experiments on the MovieLens 1M and EachMovie data sets, and compare our results\nwith fast M3 F [11] and two probabilistic matrix factorization methods, PMF [13] and BPMF [12].\nData sets: The MovieLens data set contains 1,000,209 anonymous ratings (ranging from 1 to 5) of\n3,952 movies made by 6,040 users, among which 3,706 movies are actually rated and every user\nhas at least 20 ratings. The EachMovie data set contains 2,811,983 ratings of 1,628 movies made by\n72,916 users, among which 1,623 movies are actually rated and 36,656 users has at least 20 ratings.\nAs in [7, 11], we discarded users with fewer than 20 ratings, leaving us with 2,579,985 ratings.\nThere are 6 possible rating values, {0, 0.2, . . . , 1} and we mapped them to {1, 2, . . . , 6}.\n\nProtocol: As in [7, 11], we test our method in a pure collaborative prediction setting, neglecting any\nexternal information other than the user-item-rating triplets in the data sets. We adopt as well the\nall-but-one protocol to partition the data set into training set and test set, that is to randomly withhold\none of the observed ratings from each user into test set and use the rest as training set. Validation set,\nwhen needed, is constructed likewise from the constructed training set. Also as described in [7], we\nconsider both weak and strong generalization. For weak, the training ratings for all users are always\navailable, so a single-stage training process will sufce; while for strong, training is rst carried out\non a subset of users, and then keeping the learned latent factor matrix V xed, we train the model a\nsecond time on the other users for their user proles (coefcients Z and thresholds ) and perform\nprediction on these users only. We partition the users accordingly as in [7, 11], namely 5,000 and\n1,040 users for weak and strong respectively in MovieLens, and 30,000 and 6,565 in EachMovie.\nWe repeat the random partition thrice. We compute Normalized Mean Absolute Error (NMAE) as\nthe error measure and report the averaged performance.4\nImplementation details: We perform cross-validation to choose the best regularization constant C\nfor iPM3 F as well as to guide early-stopping during the learning process. The candidate C values\nare the same 11 values which are log-evenly distributed between 0.13/4 and 0.12 as in [11]. We\nset the truncation level K = 100 (same for M3 F and PMF models), = 3, = 1, = 1.5 ;\n1 , . . . , L1 are set to be symmetric with respect to 0, with a step-size of 2 ; We set the margin\nparameter = 9. Although M3 F is invariant to (Appendix B.4), we nd that setting = 9 achieved\na good balance between performance and training time (Figure 1). The difference is largely believed\nto attribute to the uniform convergence standard we used when solving SVM subproblems. Finally,\nfor iBPM3 F, we nd that although removing C can achieve competitive results with iPM3 F, keeping\nC will produce even better performance. Hence we learn iBPM3 F using the selected C for iPM3 F.\n6\n\n\fTable 1: NMAE performance of different models on MovieLens and EachMovie.\nAlgorithm\nM3 F [11]\nPMF [13]\nBPMF [12]\nM3 F\niPM3 F\niBPM3 F\n5.1\n\nMovieLens\nweak\nstrong\n.4156 .0037 .4203 .0138\n.4332 .0033 .4413 .0074\n.4235 .0023 .4450 .0085\n.4176 .0016 .4227 .0072\n.4031 .0030 .4135 .0109\n.4050 .0029 .4089 .0146\n\nEachMovie\nweak\nstrong\n.4397 .0006 .4341 .0025\n.4466 .0016 .4579 .0016\n.4352 .0014 .4445 .0005\n.4348 .0023 .4301 .0034\n.4211 .0019 .4224 .0051\n.4268 .0029 .4403 .0040\n\nExperimental results\n\nTable 1 presents the NMAE performance of different models, where the performance of M3 F is\ncited from the corresponding paper [11] and represents the state-of-the-art. We observe that iPM3 F\nsignicantly outperforms M3 F, PMF and BPMF in terms of the NMAE error measure on both data\nsets for both settings. Moreover, we nd that the fully Bayesian formulation of iPM3 F achieves\ncomparable performances in most cases as iPM3 F and that our coordinate descent algorithm for\nM3 F (M3 F ) performs quite similar to the original gradient descent algorithm for M3 F.\nIn summary, the effect of endowing M3 F models with a probabilistic formulation is intriguing in\nthat not only the performance of the model is largely improved but with the help of Bayesian nonparametric techniques, the effort of selecting the number of latent factors is saved as well.\nAnother observation from Table 1 is that in gener- Table 2: NMAE on the purged EachMovie.\nal almost all models perform worse on EachMovie\nAlgorithm\nweak\nstrong\nthan on MovieLens. A closer investigation nds that\nM3 F [11] .4009 .0012 .4028 .0064\nthe EachMovie data set has a special rating. When\nPMF [13] .4153 .0016 .4329 .0059\na user has rated an item as zero star, he might either\nBPMF [12] .4021 .0011 .4119 .0062\nexpress a genuine dislike or, when the weight of the\nM3 F\n.4059 .0012 .4095 .0052\nrating is less than 1, indicate that he never plans to\n.3954 .0026 .3977 .0034\niPM3 F\nsee that movie since it just sounds awful. Ideally\niBPM3 F\n.3982 .0021 .4026 .0067\nwe should treat such a declaration as less authoritative than a regular rating of zero star and hence omit it from the data set. We have tried this setting\nby removing these special ratings.5 Table 2 presents the NMAE results of different models. Again,\nthe coordinate descent M3 F performs comparably with fast M3 F; iPM3 F performs better than all the\nother methods; And iBPM3 F performs comparably with iPM3 F.\n5\n\n0.46\n\n9.5\n\n2000\n\nx 10\n\n0.5\npartition #1\npartition #2\npartition #3\n\nNMAE\ntime\n\n0.44\n\n1000\n\n0.43\n\n500\n\n0.46\n\n1\n\n9\n\n25\n49\nmargin parameter:\n\n100\n\n400\n\n0\n900\n\n0.44\n\n8\n0.42\n7.5\n\n0.42\n0.5\n\npartition #1\npartition #2\npartition #3\n\n8.5\nNMAE\n\n1500\n\nObjective value\n\nNMAE\n\n0.45\n\nAverage time per iteration (s)\n\n9\n\n0.48\n\n7\n\n0.4\n\n0\n\n10\n\n20\n30\n# of iterations\n\n40\n\n50\n\n0.38\n\n0\n\n10\n\n20\n30\n# of iterations\n\n40\n\n50\n\nFigure 1: Inuence of on M3 F. Figure 2: Objective values dur- Figure 3: NMAE during the\nWe xed = 9 across the exper- ing the training of iPM3 F on training of iPM3 F on MovieLeniments.\nMovieLens 1M.\ns 1M.\nCloser analysis of iPM3 F\n\n5.2\n\nThe posterior dimensionality: As indicated in Eq. (22), we may calculate the expectation of the\neffective dimensionality K+ of the latent factor space to roughly have a sense of how the iPM3 F\nmodel automatically chooses the latent dimensionality. Since we take = 3 in the IBP prior (15)\nand N 104 , the expected prior dimensionality HN is about 30. We nd that when the truncation\nlevel K is set small, e.g., 60 or 80, the expected posterior dimensionality very quickly saturates,\n4\n5\n\nNote that M3 F models output discretized ordinal ratings while PMF models output real-valued ratings.\nAfter discarding users with less than 20 normal ratings, we are left with 35,281 users and 2,315,060 ratings.\n\n7\n\n\fTable 3: Performance of iPM3 F with and without probabilistic treatment of \nAlgorithm\nw/ prob.\nw/o prob.\nmargin\n\nMovieLens\n.4031 .0030\n.4056 .0043\n.0024 .0013\n\nEachMovie\n.4211 .0019\n.4256 .0011\n.0045 .0016\n\npEachMovie\n.3954 .0026\n.4026 .0023\n.0072 .0045\n\noften within the rst few iterations; While for sufciently large Ks, e.g., 150 or 200, iPM3 F tends to\noutput a sparse Z of expected dimensionality around 135 or 110 respectively. (For each truncation\nlevel, we rerun our model and perform cross-validation to select the best regularization constant C.)\nThis interesting observation veries our models capability of automatic model complexity control.\nStability: As Figure 2 and 3 shows, iPM3 F performs quite stably against 3 different randomly partitioned subsets. iBPM3 F expresses a similar trait, but the test performance does not keep dropping\nwith the decreasing of the objective value. Therefore we use a validation set to guide the earlystopping during the learning process, terminating when validation error starts to rebound.\nTreating thresholds : When predicting ordinal ratings, the introduced thresholds are very important since they underpin the large-margin principle of max-margin matrix factorization models.\nNevertheless without a proper probabilistic treatment, the subproblems on thresholds (25) are not\nstrictly convex, very often giving rise to a section of candidate thresholds that are equally optimal\nfor the solution. Under our probabilistic model however, we can easily get rid of this non-strict\nconvexity by introducing for them a Gaussian prior as stated above in section 3.3. We compare performances of iPM3 F both with and without the probabilistic treatment of and as shown in Table 3,\nthe improvement is outstanding.\nFinally, Table 4 presents the running time of vari- Table 4: Running time of different models.\nous models on both EachMovie and MovieLens data\nAlgorithm MovieLens EachMovie Iters\nsets. For M3 F, the original paper [11] reported about\nM3 F [11]\n5h\n15h\n100\n5h on MovieLens with a standard 3.06Ghz Pentium\nPMF [13]\n8.7m\n25m\n50\n4 CPU and about 15h on EachMovie, which are fairBPMF [12]\n19m\n1h\n50\nly acceptable for factorizing a matrix with millions\nM3 F\n4h\n10h\n50\n3\nof entries. Our current implementations of M F and\nU, V\n3.8h\n9.5h\niPM3 F consume about 4.5h and 10h on MovieLens\n\n125s\n750s\nand EachMovie respectively with a 3.00Ghz Core i5\niPM3 F\n4.6h\n5.5h\n50\nCPU. A closer investigation discovers that most of\nV\n4.3h\n4.3h\nthe running time is spent on learning U (or Z) and\n\n18m\n1h\nV in PM3 F models, which breaks down into a set of\nstruct\nSVM optimization problems that are learned by SVM\n. More efcient SVM solvers can be immediately applied to further improve the efciency. Furthermore, the blockwise coordinate descent\nalgorithm can naturally be parallelized, since the sub-problems of learning different Ui (or Vj ) are\nnot coupled. We leave this improvement in future work.\n\n6\n\nConclusions\n\nWeve presented an innite probabilistic max-margin matrix factorization method, which utilizes the\nadvantages of nonparametric Bayesian techniques to bypass the model selection problem of maxmargin matrix factorization methods. Weve also developed efcient blockwise coordinate descent\nalgorithms for variational inference and performed extensive evaluation on two large benchmark\ndata sets. Empirical results demonstrate appealing performance.\n\nAcknowledgments\nThis work is supported by the National Basic Research Program (973 Program) of China (Nos.\n2013CB329403, 2012CB316301), National Natural Science Foundation of China (Nos. 91120011,\n61273023), and Tsinghua University Initiative Scientic Research Program (No. 20121088071).\n8\n\n\fReferences\n[1] N. Ding, Y. Qi, R. Xiang, I. Molloy, and N. Li. Nonparametric Bayesian matrix factorization by power-EP.\nIn Proceedings of the 24th AAAI Conference on Articial Intelligence, 2010.\n[2] F. Doshi-Velez, K. Miller, J. Van Gael, and Y.W. Teh. Variational inference for the Indian buffet process.\nJournal of Machine Learning Research, 5:137144, 2009.\n[3] T. Grifths and Z. Ghahramani. Innite latent feature models and the Indian buffet process. 2005.\n[4] T. Jaakkola, M. Meila, and T. Jebara. Maximum entropy discrimination. In Advances in Neural Information Processing Systems, 1999.\n[5] T. Jebara. Discriminative, generative and imitative learning. PhD Thesis, 2002.\n[6] T. Joachims, T. Finley, and C.N. Yu. Cutting-plane training of structural SVMs. Machine Learning,\n77(1):2759, 2009.\n[7] B. Marlin and R.S. Zemel. The multiple multiplicative factor model for collaborative ltering. In Proceedings of the 21st International Conference on Machine Learning, 2004.\n[8] E. Meeds, Z. Ghahramani, R. Neal, and S. Roweis. Modeling dyadic data with binary latent factors. In\nAdvances in Neural Information Processing Systems, 2007.\n[9] J. Paisley and L. Carin. Nonparametric factor analysis with Beta process priors. In Proceedings of the\n26th International Conference on Machine Learning, 2009.\n[10] I. Porteous, A. Asuncion, and M. Welling. Bayesian matrix factorization with side information and\nDirichlet process mixtures. In Proceedings of the 24th AAAI Conference on Articial Intelligence, 2010.\n[11] J.D.M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In\nProceedings of the 22nd International Conference on Machine Learning, 2005.\n[12] R. Salakhutdinov and A. Mnih. Bayesian probabilistic matrix factorization using Markov chain Monte\nCarlo. In Proceedings of the 25th International Conference on Machine Learning, 2008.\n[13] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In Advances in Neural Information\nProcessing Systems, 2008.\n[14] N. Srebro, J.D.M. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. In Advances in Neural\nInformation Processing Systems, 2005.\n[15] Y.W. Teh, D. Gorur, and Z. Ghahramani. Stick-breaking construction of the Indian buffet process. In\nProceedings of the 21th AAAI Conference on Articial Intelligence, 2007.\n[16] M. Weimer, R. Karatzoglou, and A. Smola. Improving maximum margin matrix factorization. Machine\nLearning, 72(3):263276, 2008.\n[17] F. Wood and T.L. Grifths. Particle ltering for nonparametric Bayesian matrix factorization. In Advances\nin Neural Information Processing Systems, 2007.\n[18] J. Zhu, A. Ahmed, and E.P. Xing. MedLDA: Maximum margin supervised topic models for regression\nand classication. In Proceedings of the 26th International Conference on Machine Learning, 2009.\n[19] J. Zhu, N. Chen, and E.P. Xing. Innite latent SVM for classication and multi-task learning. In Advances\nin Neural Information Processing Systems, 2011.\n[20] J. Zhu, N. Chen, and E.P. Xing. Innite SVM: a Dirichlet process mixture of large-margin kernel machines. In Proceedings of the 28th International Conference on Machine Learning, 2011.\n[21] J. Zhu and E.P. Xing. Maximum entropy discrimination Markov networks. Journal of Machine Learning\nResearch, 10:25312569, 2009.\n\n9\n\n\f", "award": [], "sourceid": 4581, "authors": [{"given_name": "Minjie", "family_name": "Xu", "institution": null}, {"given_name": "Jun", "family_name": "Zhu", "institution": null}, {"given_name": "Bo", "family_name": "Zhang", "institution": null}]}