{"title": "Maximum-Margin Matrix Factorization", "book": "Advances in Neural Information Processing Systems", "page_first": 1329, "page_last": 1336, "abstract": null, "full_text": "Maximum-Margin Matrix Factorization\n\nNathan Srebro\n\nDept. of Computer Science\n\nUniversity of Toronto\n\nToronto, ON, CANADA\n\nJason D. M. Rennie\nTommi S. Jaakkola\nComputer Science and Arti\ufb01cial Intelligence Lab\n\nMassachusetts Institute of Technology\n\nCambridge, MA, USA\n\nnati@cs.toronto.edu\n\njrennie,tommi@csail.mit.edu\n\nAbstract\n\nWe present a novel approach to collaborative prediction, using low-norm\ninstead of low-rank factorizations. The approach is inspired by, and has\nstrong connections to, large-margin linear discrimination. We show how\nto learn low-norm factorizations by solving a semi-de\ufb01nite program, and\ndiscuss generalization error bounds for them.\n\n1 Introduction\n\nFitting a target matrix Y with a low-rank matrix X by minimizing the sum-squared error is\na common approach to modeling tabulated data, and can be done explicitly in terms of the\nsingular value decomposition of Y . It is often desirable, though, to minimize a different\nloss function: loss corresponding to a speci\ufb01c probabilistic model (where X are the mean\nparameters, as in pLSA [1], or the natural parameters [2]); or loss functions such as hinge\nloss appropriate for binary or discrete ordinal data. Loss functions other than squared-error\nyield non-convex optimization problems with multiple local minima. Even with a squared-\nerror loss, when only some of the entries in Y are observed, as is the case for collaborative\n\ufb01ltering, local minima arise and SVD techniques are no longer applicable [3].\nLow-rank approximations constrain the dimensionality of the factorization X = U V 0.\nOther constraints, such as sparsity and non-negativity [4], have also been suggested for\nbetter capturing the structure in Y , and also lead to non-convex optimization problems.\n\nIn this paper we suggest regularizing the factorization by constraining the norm of U and\nV \u2014constraints that arise naturally when matrix factorizations are viewed as feature learn-\ning for large-margin linear prediction (Section 2). Unlike low-rank factorizations, such\nconstraints lead to convex optimization problems that can be formulated as semi-de\ufb01nite\nprograms (Section 4). Throughout the paper, we focus on using low-norm factorizations\nfor \u201ccollaborative prediction\u201d: predicting unobserved entries of a target matrix Y , based on\na subset S of observed entries YS. In Section 5, we present generalization error bounds for\ncollaborative prediction using low-norm factorizations.\n\n2 Matrix Factorization as Feature Learning\n\nUsing a low-rank model for collaborative prediction [5, 6, 3] is straightforward: A low-\nrank matrix X is sought that minimizes a loss versus the observed entries YS. Unobserved\n\n\fentries in Y are predicted according to X. Matrices of rank at most k are those that can\nbe factored into X = U V 0, U \u2208 Rn\u00d7k, V \u2208 Rm\u00d7k, and so seeking a low-rank matrix is\nequivalent to seeking a low-dimensional factorization.\nIf one of the matrices, say U, is \ufb01xed, and only the other matrix V 0 needs to be learned, then\n\ufb01tting each column of the target matrix Y is a separate linear prediction problem. Each row\nof U functions as a \u201cfeature vector\u201d, and each column of V 0 is a linear predictor, predicting\nthe entries in the corresponding column of Y based on the \u201cfeatures\u201d in U.\n\nIn collaborative prediction, both U and V are unknown and need to be estimated. This can\nbe thought of as learning feature vectors (rows in U) for each of the rows of Y , enabling\ngood linear prediction across all of the prediction problems (columns of Y ) concurrently,\neach with a different linear predictor (columns of V 0). The features are learned without\nany external information or constraints which is impossible for a single prediction task (we\nwould use the labels as features). The underlying assumption that enables us to do this in\na collaborative \ufb01ltering situation is that the prediction tasks (columns of Y ) are related, in\nthat the same features can be used for all of them, though possibly in different ways.\n\nLow-rank collaborative prediction corresponds to regularizing by limiting the dimensional-\nity of the feature space\u2014each column is a linear prediction problem in a low-dimensional\nspace. Instead, we suggest allowing an unbounded dimensionality for the feature space, and\nregularizing by requiring a low-norm factorization, while predicting with large-margin.\n\nFro + kV k2\n\nConsider adding to the loss a penalty term which is the sum of squares of entries in U and\nV , i.e. kUk2\nFro (kkFro denotes the Frobenius norm). Each \u201cconditional\u201d problem\n(\ufb01tting U given V and vice versa) again decomposes into a collection of standard, this time\nregularized, linear prediction problems. With an appropriate loss function, or constraints\non the observed entries, these correspond to large-margin linear discrimination problems.\nFor example, if we learn a binary observation matrix by minimizing a hinge loss plus such\na regularization term, each conditional problem decomposes into a collection of SVMs.\n\n3 Maximum-Margin Matrix Factorizations\nMatrices with a factorization X = U V 0, where U and V have low Frobenius norm (recall\nthat the dimensionality of U and V is no longer bounded!), can be characterized in several\nequivalent ways, and are known as low trace norm matrices:\nDe\ufb01nition 1. The trace norm1 kXk\u03a3 is the sum of the singular values of X.\nLemma 1. kXk\u03a3 = minX=U V 0 kUkFro kV kFro = minX=U V 0 1\nThe characterization in terms of the singular value decomposition allows us to characterize\nlow trace norm matrices as the convex hull of bounded-norm rank-one matrices:\nLemma 2. {X |kXk\u03a3 \u2264 B} = conv\nuv0 |u \u2208 Rn, v \u2208 Rm,|u|2 = |v|2 = B\n\nFro + kV k2\no\n\nn\n\n2(kUk2\n\nFro)\n\nIn particular, the trace norm is a convex function, and the set of bounded trace norm ma-\ntrices is a convex set. For convex loss functions, seeking a bounded trace norm matrix\nminimizing the loss versus some target matrix is a convex optimization problem.\n\nThis contrasts sharply with minimizing loss over low-rank matrices\u2014a non-convex prob-\nlem. Although the sum-squared error versus a fully observed target matrix can be min-\nimized ef\ufb01ciently using the SVD (despite the optimization problem being non-convex!),\nminimizing other loss functions, or even minimizing a squared loss versus a partially ob-\nserved matrix, is a dif\ufb01cult optimization problem with multiple local minima [3].\n\n1Also known as the nuclear norm and the Ky-Fan n-norm.\n\n\fIn fact, the trace norm has been suggested as a convex surrogate to the rank for various\nrank-minimization problems [7]. Here, we justify the trace norm directly, both as a natural\nextension of large-margin methods and by providing generalization error bounds.\nTo simplify presentation, we focus on binary labels, Y \u2208 {\u00b11}n\u00d7m. We consider hard-\nmargin matrix factorization, where we seek a minimum trace norm matrix X that matches\nthe observed labels with a margin of one: YiaXia \u2265 1 for all ia \u2208 S. We also consider\nsoft-margin learning, where we minimize a trade-off between the trace norm of X and its\nhinge-loss relative to YS:\n\nminimize kXk\u03a3 + c\n\nmax(0, 1 \u2212 YiaXia).\n\n(1)\n\nX\n\nia\u2208S\n\nAs in maximum-margin linear discrimination, there is an inverse dependence between the\nnorm and the margin. Fixing the margin and minimizing the trace norm is equivalent\nto \ufb01xing the trace norm and maximizing the margin. As in large-margin discrimination\nwith certain in\ufb01nite dimensional (e.g. radial) kernels, the data is always separable with\n\nsuf\ufb01ciently high trace norm (a trace norm ofpn|S| is suf\ufb01cient to attain a margin of one).\n\nThe max-norm variant\nInstead of constraining the norms of rows in U and V on aver-\nage, we can constrain all rows of U and V to have small L2 norm, replacing the trace norm\nwith kXkmax = minX=U V 0(maxi |Ui|)(maxa |Va|) where Ui, Va are rows of U, V . Low-\nmax-norm discrimination has a clean geometric interpretation. First, note that predicting\nthe target matrix with the signs of a rank-k matrix corresponds to mapping the \u201citems\u201d\n(columns) to points in Rk, and the \u201cusers\u201d (rows) to homogeneous hyperplanes, such that\neach user\u2019s hyperplane separates his positive items from his negative items. Hard-margin\nlow-max-norm prediction corresponds to mapping the users and items to points and hy-\nperplanes in a high-dimensional unit sphere such that each user\u2019s hyperplane separates his\npositive and negative items with a large-margin (the margin being the inverse of the max-\nnorm).\n\n4 Learning Maximum-Margin Matrix Factorizations\n\nFro + kV k2\n\nIn this section we investigate the optimization problem of learning a MMMF, i.e. a low\nnorm factorization U V 0, given a binary target matrix. Bounding the trace norm of U V 0 by\n2(kUk2\nFro), we can characterize the trace norm in terms of the trace of a positive\n1\nsemi-de\ufb01nite matrix:\nLemma 3 ([7, Lemma 1]). For any X \u2208 Rn\u00d7m and t \u2208 R: kXk\u03a3 \u2264 t iff there exists\n\n(cid:3) < 0 and tr A + tr B \u2264 2t.\n\nA \u2208 Rn\u00d7n and B \u2208 Rm\u00d7m such that 2(cid:2) A X\nProof. Note that for any matrix W , kWkFro = tr W W 0. If(cid:2) A X\ntr U U0 + tr V V 0 \u2264 2t and consider the p.s.d. matrix(cid:2) U U0 X\nX0 V V 0(cid:3).\n\nV ] [ U0 V 0 ]. We have X = U V 0 and 1\n\nX0 B\nFro +kV k2\n\n(cid:3) < 0, we can write it as\n\n2(tr A + tr B) \u2264 t,\na product [ U\nestablishing kXk\u03a3 \u2264 t. Conversely, if kXk\u03a3 \u2264 t we can write it as X = U V 0 with\n\n2(kUk2\n\nFro) = 1\n\nX0 B\n\nLemma 3 can be used in order to formulate minimizing the trace norm as a semi-de\ufb01nite\noptimization problem (SDP). Soft-margin matrix factorization (1), can be written as:\n\nX\n\nia\u2208S\n\n(cid:21)\n\n(cid:20) A X\n\nX0 B\n\nmin\n\n1\n2\n\n(tr A + tr B) + c\n\n\u03beia s.t.\n\n2A < 0 denotes A is positive semi-de\ufb01nite\n\n< 0,\n\nyiaXia \u2265 1 \u2212 \u03beia\n\n\u03beia \u2265 0\n\n\u2200ia \u2208 S (2)\n\n\fX\n\nia\u2208S\n\n(cid:21)\n\n(\u2212Q \u2297 Y )\n\nI\n\nAssociating a dual variable Qia with each constraint on Xia, the dual of (2) is [8, Section\n5.4.2]:\n\nmax\n\nQia s.t.\n\nI\n\n(\u2212Q \u2297 Y )0\n\n< 0,\n\n0 \u2264 Qia \u2264 c\n\n(3)\n\nwhere Q \u2297 Y denotes the sparse matrix (Q \u2297 Y )ia = QiaYia for ia \u2208 S and zeros\nelsewhere. The problem is strictly feasible, and there is no duality gap. The p.s.d. constraint\nin the dual (3) is equivalent to bounding the spectral norm of Q \u2297 Y , and the dual can also\nbe written as an optimization problem subject to a bound on the spectral norm, i.e. a bound\non the singular values of Q \u2297 Y :\n\n(cid:20)\n\nX\n\nia\u2208S\n\nmax\n\nQia s.t.\n\nkQ \u2297 Y k2 \u2264 1\n\n0 \u2264 Qia \u2264 c \u2200ia \u2208 S\n\n(4)\n\nIn typical collaborative prediction problems, we observe only a small fraction of the entries\nin a large target matrix. Such a situation translates to a sparse dual semi-de\ufb01nite program,\nwith the number of variables equal to the number of observed entries. Large-scale SDP\nsolvers can take advantage of such sparsity.\nThe prediction matrix X\u2217 minimizing (1) is part of the primal optimal solution of (2), and\ncan be extracted from it directly. Nevertheless, it is interesting to study how the optimal\nprediction matrix X\u2217 can be directly recovered from a dual optimal solution Q\u2217 alone.\nAlthough unnecessary when relying on interior point methods used by most SDP solvers (as\nthese return a primal/dual optimal pair), this can enable us to use specialized optimization\nmethods, taking advantage of the simple structure of the dual.\nRecovering X\u2217 from Q\u2217 As for linear programming, recovering a primal optimal solu-\ntion directly from a dual optimal solution is not always possible for SDPs. However, at\nleast for the hard-margin problem (no slack) this is possible, and we describe below how\nan optimal prediction matrix X\u2217 can be recovered from a dual optimal solution Q\u2217 by\ncalculating a singular value decomposition and solving linear equations.\nGiven a dual optimal Q\u2217, consider its singular value decomposition Q\u2217 \u2297 Y = U\u039bV 0.\nRecall that all singular values of Q\u2217\u2297Y are bounded by one, and consider only the columns\n\u02dcU \u2208 Rn\u00d7p of U and \u02dcV \u2208 Rm\u00d7p of V with singular value one. It is possible to show [8,\nSection 5.4.3], using complimentary slackness, that for some matrix R \u2208 Rp\u00d7p, X\u2217 =\n\u02dcU RR0 \u02dcV 0 is an optimal solution to the maximum margin matrix factorization problem (1).\nFurthermore, p(p+1)\nia > 0,\nand assuming hard-margin constraints, i.e. no box constraints in the dual, complimentary\nslackness dictates that X\u2217\na = Yia, providing us with a linear equation on\nentries in the symmetric RR0. For hard-margin matrix factorization, we can\nthe p(p+1)\ntherefore recover the entries of RR0 by solving a system of linear equations, with a number\nof variables bounded by the number of observed entries.\n\nis bounded above by the number of non-zero Q\u2217\n\nia = \u02dcUiRR0 \u02dcV 0\n\nia. When Q\u2217\n\n2\n\n2\n\nRecovering speci\ufb01c entries The approach described above requires solving a large sys-\ntem of linear equations (with as many variables as observations). Furthermore, especially\nwhen the observations are very sparse (only a small fraction of the entries in the target\nmatrix are observed), the dual solution is much more compact then the prediction matrix:\nthe dual involves a single number for each observed entry. It might be desirable to avoid\nstoring the prediction matrix X\u2217 explicitly, and calculate a desired entry X\u2217\ni0a0, or at least\nits sign, directly from the dual optimal solution Q\u2217.\nConsider adding the constraint Xi0a0 > 0 to the primal SDP (2). If there exists an optimal\nsolution X\u2217 to the original SDP with X\u2217\ni0a0 > 0, then this is also an optimal solution to\n\n\fthe modi\ufb01ed SDP, with the same objective value. Otherwise, the optimal solution of the\nmodi\ufb01ed SDP is not optimal for the original SDP, and the optimal value of the modi\ufb01ed\nSDP is higher (worse) than the optimal value of the original SDP.\nIntroducing the constraint Xi0a0 > 0 to the primal SDP (2) corresponds to introducing a\nnew variable Qi0a0 to the dual SDP (3), appearing in Q\u2297 Y (with Yi0a0 = 1) but not in the\nobjective. In this modi\ufb01ed dual, the optimal solution Q\u2217 of the original dual would always\nbe feasible. But, if X\u2217\ni0a0 < 0 in all primal optimal solutions, then the modi\ufb01ed primal\nSDP has a higher value, and so does the dual, and Q\u2217 is no longer optimal for the new dual.\nBy checking the optimality of Q\u2217 for the modi\ufb01ed dual, e.g. by attempting to re-optimize\nit, we can recover the sign of X\u2217\nWe can repeat this test once with Yi0a0 = 1 and once with Yi0a0 = \u22121, corresponding\nto Xi0a0 < 0. If Yi0a0X\u2217\ni0a0 < 0 (in all optimal solutions), then the dual solution can be\nimproved by introducing Qi0a0 with a sign of Yi0a0.\n\ni0a0.\n\nPredictions for new users So far, we assumed that learning is done on the known entries\nin all rows. It is commonly desirable to predict entries in a new partially observed row of\nY (a new user), not included in the original training set. This essentially requires solving\na \u201cconditional\u201d problem, where V is already known, and a new row of U is learned (the\npredictor for the new user) based on a new partially observed row of X. Using maximum-\nmargin matrix factorization, this is a standard SVM problem.\n\nMax-norm MMMF as a SDP The max-norm variant can also be written as a SDP, with\nthe primal and dual taking the forms:\n\n(5)\n\n(6)\n\nmin t + c\n\n\u03beia s.t.\n\nX\n\nia\u2208S\n\n(cid:20)\n\nX\n\nia\u2208S\n\nmax\n\n(cid:21)\n\n(cid:20) A X\n\nX0 B\n\nAii, Baa \u2264 t \u2200i, a\n\n< 0\n\nyiaXia \u2265 1 \u2212 \u03beia\n\n\u03beia \u2265 0\n\n\u2200ia \u2208 S\n\n(cid:21)\n\nQia s.t.\n\n\u0393\n\n(\u2212Q \u2297 Y )0\n\n(\u2212Q \u2297 Y )\n\n\u2206\n\n< 0\n\n\u0393, \u2206 are diagonal\ntr \u0393 + tr \u2206 = 1\n\n0 \u2264 Qia \u2264 c \u2200ia \u2208 S\n\n5 Generalization Error Bounds for Low Norm Matrix Factorizations\n\nSimilarly to standard feature-based prediction approaches, collaborative prediction meth-\nods can also be analyzed in terms of their generalization ability: How con\ufb01dently can we\npredict entries of Y based on our error on the observed entries YS? We present here gen-\neralization error bounds that holds for any target matrix Y , and for a random subset of\nobservations S, and bound the average error across all entries in terms of the observed\nmargin error3. The central assumption, paralleling the i.i.d. source assumption for standard\nfeature-based prediction, is that the observed subset S is picked uniformly at random.\nTheorem 4. For all target matrices Y \u2208 {\u00b11}n\u00d7m and sample sizes |S| > n log n, and\nfor a uniformly selected sample S of |S| entries in Y , with probability at least 1 \u2212 \u03b4 over\n3The bounds presented here are special cases of bounds for general loss functions that we present\nand prove elsewhere [8, Section 6.2]. To prove the bounds we bound the Rademacher complexity of\nbounded trace norm and bounded max-norm matrices (i.e. balls w.r.t. these norms). The unit trace\nnorm ball is the convex hull of outer products of unit norm vectors. It is therefore enough to bound\nthe Rademacher complexity of such outer products, which boils down to analyzing the spectral norm\nof random matrices. As a consequence of Grothendiek\u2019s inequality, the unit max-norm ball is within\na factor of two of the convex hull of outer products of sign vectors. The Rademacher complexity of\nsuch outer products can be bounded by considering their cardinality.\n\n\fthe sample selection, the following holds for all matrices X \u2208 Rn\u00d7m and all \u03b3 > 0:\n\n1\nnm\n\nand\n\n1\nnm\n\n|{ia|XiaYia \u2264 0}| <\n\n4\u221a\n\nkXk\u03a3\n\u221a\nnm\n\u03b3\n\nK\n\nln m\n\ns\n\n1\n|S||{ia \u2208 S|XiaYia \u2264 \u03b3}|+\n(n + m) ln n\n\ns\n\nln(1 + | log kXk\u03a3 /\u03b3|)\n\n|S|\n\n+\n\n|{ia|XiaYia \u2264 0}| <\n\n1\n|S||{ia \u2208 S|XiaYia \u2264 \u03b3}|+\nr n + m\n\ns\n\nln(1 + | log kXk\u03a3 /\u03b3|)\n\nkXkmax\n\n12\n\n\u03b3\n\n|S| +\n\n|S|\n\n|S|\n\ns\n\ns\n\n+\n\n+\n\nln(4/\u03b4)\n2|S|\n\n(7)\n\nln(4/\u03b4)\n2|S|\n\n(8)\n\nWhere K is a universal constant that does not depend on Y ,n,m,\u03b3 or any other quantity.\nTo understand the scaling of these bounds, consider n \u00d7 m matrices X = U V 0 where\nthe norms of rows of U and V are bounded by r, i.e. matrices with kXkmax \u2264 r2. The\n\u221a\ntrace norm of such matrices is bounded by r2/\nnm, and so the two bounds agree up to\nlog-factors\u2014the cost of allowing the norm to be low on-average but not uniformly. Recall\nthat the conditional problem, where V is \ufb01xed and only U is learned, is a collection of\nlow-norm (large-margin) linear prediction problems. When the norms of rows in U and V\nare bounded by r, a similar generalization error bound on the conditional problem would\ninclude the term r2\nboth U and V does not introduce signi\ufb01cantly more error than learning just one of them.\nAlso of interest is the comparison with bounds for low-rank matrices, for which kXk\u03a3 \u2264\n\u221a\nrank X kXkFro. In particular, for n \u00d7 m rank-k X with entries bounded by B, kXk\u03a3 \u2264\n\u221a\nknmB, and the second term in the right-hand side of (7) becomes:\n\nq n|S| , matching the bounds of Theorem 4 up to log-factors\u2014learning\n\n\u03b3\n\ns\n\n4\u221a\n\nln m\n\nK\n\nB\n\u03b3\n\nk(n + m) ln n\n\n|S|\n\n(9)\n\nAlthough this is the best (up to log factors) that can be expected from scale-sensitive\nbounds4, taking a combinatorial approach, the dependence on the magnitude of the entries\nin X (and the margin) can be avoided [9].\n\n6\n\nImplementation and Experiments\n\nRatings In many collaborative prediction tasks, the labels are not binary, but rather are\ndiscrete \u201cratings\u201d in several ordered levels (e.g. one star through \ufb01ve stars). Separating R\nlevels by thresholds \u2212\u221e = \u03b80 < \u03b81 < \u00b7\u00b7\u00b7 < \u03b8R = \u221e, and generalizing hard-margin\nconstraints for binary labels, one can require \u03b8Yia + 1 \u2264 Xia \u2264 \u03b8Yia+1 \u2212 1. A soft-margin\nversion of these constraints, with slack variables for the two constraints on each observed\nrating, corresponds to a generalization of the hinge loss which is a convex bound on the\nzero/one level-agreement error (ZOE) [10]. To obtain a loss which is a convex bound on\nthe mean-absolute-error (MAE\u2014the difference, in levels, between the predicted level and\nthe true level), we introduce R \u2212 1 slack variables for each observed rating\u2014one for each\n4For general loss functions, bounds as in Theorem 4 depend only on the Lipschitz constant of\nthe loss, and (9) is the best (up to log factors) that can be achieved without explicitly bounding the\nmagnitude of the loss function.\n\n\fof the R \u2212 1 constraints Xia \u2265 \u03b8r for r < Yia and Xia \u2264 \u03b8r for r \u2265 Yia. Both of\nthese soft-margin problems (\u201cimmediate-threshold\u201d and \u201call-threshold\u201d) can be formulated\nas SDPs similar to (2)-(3). Furthermore, it is straightforward to learn also the thresholds\n(they appear as variables in the primal, and correspond to constraints in the dual)\u2014either\na single set of thresholds for the entire matrix, or a separate threshold vector for each row\nof the matrix (each \u201cuser\u201d). Doing the latter allows users to \u201cuse ratings differently\u201d and\nalleviates the need to normalize the data.\nExperiments We conducted preliminary experiments on a subset of the 100K MovieLens\nDataset5, consisting of the 100 users and 100 movies with the most ratings. We used CSDP\n[11] to solve the resulting SDPs6. The ratings are on a discrete scale of one through \ufb01ve,\nand we experimented with both generalizations of the hinge loss above, allowing per-user\nthresholds. We compared against WLRA and K-Medians (described in [12]) as \u201cBaseline\u201d\nlearners. We randomly split the data into four sets. For each of the four possible test sets,\nwe used the remaining sets to calculate a 3-fold cross-validation (CV) error for each method\n(WLRA, K-medians, trace norm and max-norm MMMF with immediate-threshold and all-\nthreshold hinge loss) using a range of parameters (rank for WLRA, number of centers for\nK-medians, slack cost for MMMF). For each of the four splits, we selected the two MMMF\nlearners with lowest CV ZOE and MAE and the two Baseline learners with lowest CV ZOE\nand MAE, and measured their error on the held-out test data. Table 1 lists these CV and\ntest errors, and the average test error across all four test sets. On average and on three of\nthe four test sets, MMMF achieves lower MAE than the Baseline learners; on all four of\nthe test sets, MMMF achieves lower ZOE than the Baseline learners.\n\nTest\nSet\n1\n2\n3\n4\n\nAvg.\n\n1\n2\n3\n4\n\nAvg.\n\nMethod\n\nWLRA rank 2\nWLRA rank 2\nWLRA rank 1\nWLRA rank 2\n\nZOE\n\nCV\n0.547\n0.550\n0.562\n0.557\n\nmax-norm C=0.0012\ntrace norm C=0.24\nmax-norm C=0.0012\nmax-norm C=0.0012\n\n0.543\n0.550\n0.551\n0.544\n\nMethod\n\nK-Medians K=2\nK-Medians K=2\nK-Medians K=2\nK-Medians K=2\n\nTest\n0.575\n0.562\n0.543\n0.553\n0.558\n0.562 max-norm C=0.0012\n0.552 max-norm C=0.0011\n0.527 max-norm C=0.0012\n0.550 max-norm C=0.0012\n0.548\n\nMAE\n\nCV\n0.678\n0.686\n0.700\n0.685\n\n0.669\n0.675\n0.668\n0.667\n\nTest\n0.691\n0.681\n0.681\n0.696\n0.687\n0.677\n0.683\n0.646\n0.686\n0.673\n\nTable 1: Baseline (top) and MMMF (bottom) methods and parameters that achieved the\nlowest cross validation error (on the training data) for each train/test split, and the error for\nthis predictor on the test data. All listed MMMF learners use the \u201call-threshold\u201d objective.\n\n7 Discussion\n\nLearning maximum-margin matrix factorizations requires solving a sparse semi-de\ufb01nite\nprogram. We experimented with generic SDP solvers, and were able to learn with up to\ntens of thousands of labels. We propose that just as generic QP solvers do not perform\nwell on SVM problems, special purpose techniques, taking advantage of the very simple\nstructure of the dual (3), are necessary in order to solve large-scale MMMF problems.\n\nSDPs were recently suggested for a related, but different, problem: learning the features\n\n5http://www.cs.umn.edu/Research/GroupLens/\n6Solving with immediate-threshold loss took about 30 minutes on a 3.06GHz Intel Xeon.\nSolving with all-threshold loss took eight to nine hours. The MATLAB code is available at\nwww.ai.mit.edu/\u02dcnati/mmmf\n\n\f(or equivalently, kernel) that are best for a single prediction task [13]. This task is hopeless\nif the features are completely unconstrained, as they are in our formulation. Lanckriet et al\nsuggest constraining the allowed features, e.g. to a linear combination of a few \u201cbase fea-\nture spaces\u201d (or base kernels), which represent the external information necessary to solve\na single prediction problem. It is possible to combine the two approaches, seeking con-\nstrained features for multiple related prediction problems, as a way of combining external\ninformation (e.g. details of users and of items) and collaborative information.\n\nAn alternate method for introducing external information into our formulation is by adding\nto U and/or V additional \ufb01xed (non-learned) columns representing the external features.\nThis method degenerates to standard SVM learning when Y is a vector rather than a matrix.\n\nAn important limitation of the approach we have described, is that observed entries are\nassumed to be uniformly sampled. This is made explicit in the generalization error bounds.\nSuch an assumption is typically unrealistic, as, e.g., users tend to rate items they like. At\nan extreme, it is often desirable to make predictions based only on positive samples. Even\nin such situations, it is still possible to learn a low-norm factorization, by using appropriate\nloss functions, e.g. derived from probabilistic models incorporating the observation pro-\ncess. However, obtaining generalization error bounds in this case is much harder. Simply\nallowing an arbitrary sampling distribution and calculating the expected loss based on this\ndistribution (which is not possible with the trace norm, but is possible with the max-norm\n[8]) is not satisfying, as this would guarantee low error on items the user is likely to want\nanyway, but not on items we predict he would like.\nAcknowledgments We would like to thank Sam Roweis for pointing out [7].\nReferences\n[1] T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learn-\n\ning Journal, 42(1):177\u2013196, 2001.\n\n[2] M. Collins, S. Dasgupta, and R. Schapire. A generalization of principal component analysis to\n\nthe exponential family. In Advances in Neural Information Processing Systems 14, 2002.\n\n[3] Nathan Srebro and Tommi Jaakkola. Weighted low rank approximation. In 20th International\n\nConference on Machine Learning, 2003.\n\n[4] D.D. Lee and H.S. Seung. Learning the parts of objects by non-negative matrix factorization.\n\nNature, 401:788\u2013791, 1999.\n\n[5] T. Hofmann. Latent semantic models for collaborative \ufb01ltering. ACM Trans. Inf. Syst.,\n\n22(1):89\u2013115, 2004.\n\n[6] Benjamin Marlin. Modeling user rating pro\ufb01les for collaborative \ufb01ltering.\n\nNeural Information Processing Systems, volume 16, 2004.\n\nIn Advances in\n\n[7] Maryam Fazel, Haitham Hindi, and Stephen P. Boyd. A rank minimization heuristic with appli-\ncation to minimum order system approximation. In Proceedings American Control Conference,\nvolume 6, 2001.\n\n[8] Nathan Srebro. Learning with Matrix Factorization. PhD thesis, Massachusetts Institute of\n\nTechnology, 2004.\n\n[9] N. Srebro, N. Alon, and T. Jaakkola. Generalization error bounds for collaborative prediction\n\nwith low-rank matrices. In Advances In Neural Information Processing Systems 17, 2005.\n\n[10] Amnon Shashua and Anat Levin. Ranking with large margin principle: Two approaches. In\n\nAdvances in Neural Information Proceedings Systems, volume 15, 2003.\n\n[11] B. Borchers. CSDP, a C library for semide\ufb01nite programming. Optimization Methods and\n\nSoftware, 11(1):613\u2013623, 1999.\n\n[12] B. Marlin. Collaborative \ufb01ltering: A machine learning perspective. Master\u2019s thesis, University\n\nof Toronto, 2004.\n\n[13] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix\n\nwith semide\ufb01nite programming. Journal of Machine Learning Research, 5:27\u201372, 2004.\n\n\f", "award": [], "sourceid": 2655, "authors": [{"given_name": "Nathan", "family_name": "Srebro", "institution": null}, {"given_name": "Jason", "family_name": "Rennie", "institution": null}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}]}