Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices

Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

Bibtex Metadata Paper

Authors

Nathan Srebro, Noga Alon, Tommi Jaakkola

Abstract

We prove generalization error bounds for predicting entries in a partially observed matrix by fitting the observed entries with a low-rank matrix. In justifying the analysis approach we take to obtain the bounds, we present an example of a class of functions of finite pseudodimension such that the sums of functions from this class have unbounded pseudodimension.

1 Introduction

"Collaborative filtering" refers to the general task of providing users with information on what items they might like, or dislike, based on their preferences so far and how they relate to the preferences of other users. This approach contrasts with a more traditional feature- based approach where predictions are made based on features of the items.

For feature-based approaches, we are accustomed to studying prediction methods in terms of probabilistic post-hoc generalization error bounds. Such results provide us a (proba- bilistic) bound on the performance of our predictor on future examples, in terms of its performance on the training data. These bounds hold without any assumptions on the true "model", that is the true dependence of the labels on the features, other than the central assumptions that the training examples are drawn i.i.d. from the distribution of interest.

In this paper we suggest studying the generalization ability of collaborative prediction methods. By "collaborative prediction" we indicate that the objective is to be able to pre- dict user preferences for items, that is, entries in some unknown target matrix Y of user- item "ratings", based on observing a subset YS of the entries in this matrix1. We present

 1In other collaborative filtering tasks, the objective is to be able to provide each user with a few items that overlap his top-rated items, while it is not important to be able to correctly predict the users ratings for other items. Note that it is possible to derive generalization error bounds for this objective based on bounds for the "prediction" objective.



           arbitrary source distribution                target matrix Y                    random training set              random set S of observed entries                           hypothesis                      predicted matrix X                          training error             observed discrepancy DS(X; Y )                    generalization error                true discrepancy D(X; Y )

Figure 1: Correspondence with post-hoc bounds on the generalization error for standard feature-based prediction tasks

bounds on the true average overall error D(X; Y ) = 1 n m loss(X nm i=1 a=1 ia; Yia) of the predictions X in terms of the average error over the observed entries DS(X; Y ) = 1 loss(X |S| iaS ia; Yia), without making any assumptions on the true nature of the pref- erences Y . What we do assume is that the subset S of entries that we observe is chosen uniformly at random. This strong assumption parallels the i.i.d. source assumption for feature-based prediction.

In particular, we present generalization error bounds on prediction using low-rank models.

Collaborative prediction using low-rank models is fairly straight forward. A low-rank ma- trix X is sought that minimizes the average observed error DS(X; Y ). Unobserved entries in Y are then predicted according to X. The premise behind such a model is that there are only a small number of factors influencing the preferences, and that a user's preference vector is determined by how each factor applies to that user. Different methods differ in how they relate real-valued entries in X to preferences in Y , and in the associated measure of discrepancy. For example, entries in X can be seen as parameters for a probabilistic models of the entries in Y , either mean parameters [1] or natural parameters [2], and a maximum likelihood criterion used. Or, other loss functions, such as squared error [3, 2], or zero-one loss versus the signs of entries in X, can be minimized.

Prior Work Previous results bounding the error of collaborative prediction using a low- rank matrix all assume the true target matrix Y is well-approximated by a low-rank matrix. This corresponds to a large eigengap between the top few singular values of Y and the remaining singular values. Azar et al [3] give asymptotic results on the convergence of the predictions to the true preferences, assuming they have an eigengap. Drineas et al [4] analyze the sample complexity needed to be able to predict a matrix with an eigengap, and suggests strategies for actively querying entries in the target matrix. To our knowledge, this is the first analysis of the generalization error of low-rank methods that do not make any assumptions on the true target matrix.

Generalization error bounds (and related online learning bounds) were previously discussed for collaborative prediction applications, but only when prediction was done for each user separately, using a feature-based method, with the other user's preferences as features [5, 6]. Although these address a collaborative prediction application, the learning setting is a standard feature-based setting. These methods are also limited, in that learning must be performed separately for each user.

Shaw-Taylor et al [7] discuss assumption-free post-hoc bounds on the residual errors of low-rank approximation. These results apply to a different setting, where a subset of the rows are fully observed, and bound a different quantity--the distance between rows and the learned subspace, rather then the distance to predicted entries.

Organization In Section 2 we present a generalization error bound for zero-one loss, based on a combinatorial result which we prove in Section 3. In Section 4 we generalize the bound to arbitrary loss functions. Finally, in Section 5 we justify the combinatorial

approach taken, by considering an alternate approach (viewing rank-k matrices as combi- nation of k rank-1 matrices) and showing why it does not work.

2 Generalization Error Bound for Zero-One Error

We begin by considering binary labels Yia and a zero-one sign agreement loss:

                                    loss(Xia; Yia) = 1YiaXia0                                                    (1)

Theorem 1. For any matrix Y {1}nm, n, m > 2, > 0 and integer k, with proba- bility at least 1 - over choosing a subset S of entries in Y uniformly among all subsets of |S| entries, the discrepancy with respect to the zero-one sign agreement loss satisfies2:

                                                                   k(n + m) log 16em - log                                                                                                  k               X,rank X<k D(X ; Y ) < D (X ; Y ) +                                                    S                                   2|S|

To prove the theorem we employ standard arguments about the generalization error for finite hypothesis classes with bounded cardinality.

First fix Y as well as X nm R . When an index pair (i, a) is chosen uniformly at random, loss(Xia; Yia) is a Bernoulli random variable with probability D(X; Y ) of being one. If the entries of S are chosen independently and uniformly, |S|D(X; Y ) is Binomially S distributed with mean |S|D(X; Y ) and using Chernoff's inequality:

                        Pr D(X; Y )  D(X; Y ) +                     e-2|S| 2                                   (2)                                                              S                              S

The distribution of S in Theorem 1 is slightly different, as S is chosen without repetitions. The mean of D(X; Y ) is the same, but it is more concentrated, and (2) still holds. S

Now consider all rank-k matrices. Noting that loss(Xia; Yia) depends only on the sign of Xia, it is enough to consider the equivalence classes of matrices with the same sign patterns. Let f (n, m, k) be the number of such equivalence classes, i.e. the number of possible sign configurations of n m matrices of rank at most k:

               F (n, m, k) = {sign X  {-, 0, +}nm|X                      nm                                                                            R           , rank X  k}                                          f (n, m, k) = F (n, m, k)

                                                                                             1     If Xia > 0 where sign X denotes the element-wise sign matrix (sign X)ia =                                   0     If Xia = 0 .                                                                                                  -1    If Xia < 1

For all matrices in an equivalence class, the random variable D(X; Y ) is the same, and S taking a union bound of the events D(X; Y ) D(X; Y )+ for each of these f (n, m, k) S random variables we have:

                                                                    log f (n, m, k) - log          Pr         X,rank XkD(X; Y )  D(X; Y ) +                                                                (3)                                                         S          S                                                                        2|S|

by using (2) and setting = log f (n,m,k)-log . The proof of Theorem 1 rests on bounding 2|S| f (n, m, k), which we will do in the next section.

Note that since the equivalence classes we defined do not depend on the sample set, no symmetrization argument is necessary.

 2All logarithms are base two

3 Sign Configurations of a Low-Rank Matrix

In this section, we bound the number f (n, m, k) of sign configurations of n m rank- k matrices over the reals. Such a bound was previously considered in the context of unbounded error communication complexity. Alon, Frankl and Rodl [8] showed that f (n, m, k) minh (8 nm/h )(n+m)k+h+m, and used counting arguments to establish that some (in fact, most) binary matrices can only be realized by high-rank matrices, and therefore correspond to functions with high unbounded error communication complexity.

Here, we follow a general course outlined by Alon [9] to obtain a simpler, and slightly tighter, bound based on the following result due to Warren:

Let P1, . . . , Pr be real polynomials in q variables, and let C be the complement of the variety defined by iPi, i.e. the set of points in which all the m polynomials are non-zero: C = {x q R |iPi(x) = 0} Theorem 2 (Warren [10]). If all r polynomials are of degree at most d, then the number of connected components of C is at most: q q r 4edr c(C) 2(2d)q 2i i q i=0

where the second inequality holds when r > q > 2.

The signs of the polynomials P1, . . . , Pr are fixed inside each connected component of C. And so, c(C) bounds the number of sign configurations of P1, . . . , Pr that do not contain zeros. To bound the overall number of sign configurations the polynomials are modified slightly (see Appendix), yielding: Corollary 3 ([9, Proposition 5.5]). The number of -/0/+ sign configurations of r polyno- mials, each of degree at most d, over q variables, is at most (8edr/q)q (for r > q > 2).

In order to apply these bounds to low-rank matrices, recall that any matrix X of rank at most k can be written as a product X = U V where U nk km R and V R . Consider the k(n+m) entries of U, V as variables, and the nm entries of X as polynomials of degree two over these variables: k

                                             Xia =                           UiVa                                                                      =1 Applying Corollary 3 we obtain:                                                       k(n+m) Lemma 4. f (n, m, k)                8e2nm                                (16em/k)k(n+m)                                      k(n+m)

Substituting this bound in (3) establishes Theorem 1. The upper bound on f (n, m, k) is tight up to a multiplicative factor in the exponent: 1 Lemma 5. For m > k2, f (n, m, k) m (k-1)n 2

Proof. Fix any matrix V mk R with rows in general position, and consider the number f (n, V, k) of sign configurations of matrices U V , where U varies over all n k matrices. Focusing only on +/- sign configurations (no zeros in U V ), each row of sign U V is a homogeneous linear classification of the rows of V , i.e. of m vectors in general position in k m R . There are exactly 2 k-1 possible homogeneous linear classifications of m i=0 i vectors in general position in k R , and so these many options for each row of sign U V . We can therefore bound: n k-1 n n(k-1) 1 f (n, m, k) f (n, V, k) 2 m m m = m (k-1)n 2 i k-1 k-1 i=0

4 Generalization Error Bounds for Other Loss Functions

In Section 2 we considered generalization error bounds for a zero-one loss function. More commonly, though, other loss functions are used, and it is desirable to obtain generalization error bounds for general loss functions.

When dealing with other loss functions, the magnitude of the entries in the matrix are important, and not only their signs. It is therefore no longer enough to bound the number of sign configurations. Instead, we will bound not only the number of ways low rank matrices behave with regards to a threshold of zero, but the number of possible ways low- rank matrices can behave relative to any set of thresholds. That is, for any threshold matrix T nm R , we will show that the number of possible sign configurations of (X - T ), where X is low-rank, is small. Intuitively, this captures the complexity of the class of low-rank matrices not only around zero, but throughout all possible values.

We then use standard results from statistical machine learning to obtain generalization error bounds from the bound on the number of relative sign configurations. The number of rela- tive sign configurations serves as a bound on the pseudodimension--the maximum number of entries for which there exists a set of thresholds such that all relative sign configurations (limited to these entries) is possible. The pseudodimension can in turn be used to show the existence of a small -net, which is used to obtain generalization error bounds.

Recall the definition of the pseudodimension of a class of real-valued functions:

Definition 1. A class F of real-valued functions pseudo-shatters the points x1, . . . , xn with thresholds t1, . . . , tn if for every binary labeling of the points (s1, . . . , sn) {+, -}n there exists f F s.t. f (xi) ti iff si = -. The pseudodimension of a class F is the supremum over n for which there exist n points and thresholds that can be shattered.

In order to apply known results linking the pseudodimension to covering numbers, we consider matrices X nm R as real-valued functions X : [n] [m] R over index pairs to entries in the matrix. The class Xk of rank-k matrices can now be seen as a class of real-valued functions over the domain [n] [m]. We bound the pseudodimension of this class by bounding, for any threshold matrix T nm R the number of relative sign matrices:

    F                                                                           nm              T (n, m, k) = {sign (X - T )  {-, 0, +}nm|X  R                             , rank X  k}                                          fT (n, m, k) = FT (n, m, k)

                                                                                 k(n+m) Lemma 6. For any T                 nm                                R           , we have fT (n, m, k)  16em                       .                                                                                k

Proof. We take a similar approach to that of Lemma 4, writing rank-k matrices as a product X = U V where U nk km R and V R . Consider the k(n + m) entries of U, V as variables, and the nm entries of X - T as polynomials of degree two over these variables:

                                                       k

                                 (X - T )ia =               UiVa - Tia                                                       =1

Applying Corollary 10 yields the desired bound.

Corollary 7. The pseudodimension of the class Xk of n m matrices over the reals of rank at most k, is at most k(n + m) log 16em . k

We can now invoke standard generalization error bounds in terms of the pseudodimension (Theorem 11 in the Appendix) to obtain:

Theorem 8. For any monotone loss function with |loss| M , any matrix Y {1}nm, n, m > 2, > 0 and integer k, with probability at least 1 - over choosing a subset S of entries in Y uniformly among all subsets of |S| entries:

                                                   k(n + m) log 16em log M|S| - log                                                                           k           k(n+m)        X,rank X<k D(X ; Y ) < DS (X ; Y ) + 6                                  |S|

5 Low-Rank Matrices as Combined Classifiers

Rank-k matrices are those matrices which are a sum of k rank-1 matrices. If we view matrices as functions from pairs of indices to the reals, we can think of rank-k matrices as "combined" classifiers, and attempt to bound their complexity as such, based on the low complexity of the "basis" functions, i.e. rank-1 matrices.

A similar approach is taken in related work on learning with low-norm (maximum margin) matrix factorization [11, 12], where the hypothesis class can be viewed as a convex combi- nation of rank-1 unit-norm matrices. Scale-sensitive (i.e. dependent on the margin, or the slope of the loss function) generalization error bounds for this class are developed based on the graceful behavior of scale-sensitive complexity measures (e.g. log covering numbers and the Rademacher complexity) with respect to convex combinations. Taking a similar view, it is possible to obtain scale-sensitive generalization error bounds for low-rank ma- trices. In this Section we question whether it is possible to obtain scale-insensitive bounds, similar to Theorems 1 and 8, by viewing low-rank matrices as combined classifiers.

It cannot be expected that scale-insensitive complexity would be preserved when taking convex combinations of an unbounded number of base functions. However, the VC- dimension, a scale-insensitive measure of complexity, does scale gracefully when taking linear combinations of a bounded number of functions from a low VC-dimension class of indicator function. Using this, we can obtain generalization error bounds for linear com- binations of signs of rank-one matrices, but not signs of linear combinations of rank-one matrices. An alternate candidate scale-insensitive complexity measure is the pseudodi- mension of a class of real-valued functions. If we could bound the pseudodimension of the class of sums of k functions from a bounded-pseudodimension base class of real valued functions, we could avoid the sign-configuration counting and obtain generalization error bounds for rank-k matrices. Unfortunately, the following counterexample shows that this is not possible.

Theorem 9. There exists a family F closed under scalar multiplication whose pseudodi- mension is at most five, and such that {f1 + f2|f1, f2 F } does not have a finite pseu- dodimension.

Proof. We describe a class F of real-valued functions over the positive integers N. To do so, consider a one-to-one mapping of finite sets of positive integers to the positive integers. For each A N define two functions3, fA(x) = 2xA + 1xA and gA(x) = 2xA. Let F be the set of all scalar multiplications of these functions.

For every A N , fA - gA is the indicator function of A, implying that every finite subset can be shattered, and the pseudodimension of {f1 + f2 : f1, f2 F } is unbounded.

It remains to show that the pseudodimension of F is less than six. To do so, we note that there are no positive integers A < B and x < y and positive reals , > 0 such that (2xB + 1) > 2xA and 2yB < (2yA + 1). It follows that for any A < B and any , > 0, on an initial segment (possibly empty) of N we have gB fB gA fA while on the rest of N we have gA fA < gB fB. In particular, any pair of

 3We use A to refer both to a positive integer and the finite set it maps to.

functions (fA, fB) or (fA, gB) or (gA, gB) in F that are not associated with the same subset (i.e. A = B), cross each other at most once. This holds also when or are negative, as the functions never change signs.

For any six naturals x1 < x2 < < x6 and six thresholds, consider the three labellings (+, -, +, -, +, -), (-, +, -, +, -, +), (+, +, -, -, +, +). The three functions realizing these labellings must cross each other at least twice, but by the above arguments, there are no three functions in F such that every pair crosses each other at least twice.4

6 Discussion

Alon, Frankl and Rodl [8] use a result of Milnor similar to Warren's Theorem 2. Milnor's and Warren's theorems were previously used for bounding the VC-dimension of certain geometric classes [13], and of general concept classes parametrized by real numbers, in terms of the complexity of the boolean formulas over polynomials used to represent them [14]. This last general result can be used to bound the VC-dimension of signs of nm rank- k matrices by 2k(n + m) log(48enm), yielding a bound similar to Theorem 1 with an extra log |S| term. In this paper, we take a simpler path, applying Warren's theorem directly, and thus avoiding the log |S| term and reducing the other logarithmic term. Applying Warren's theorem directly also enables us to bound the pseudodimension and obtain the bound of Theorem 8 for general loss functions.

Another notable application of Milnor's result, which likely inspired these later uses, is for bounding the number of configurations of n points in d R with different possible linear clas- sifications [15, 16]. Viewing signs of rank-k n m matrices as n linear classification of m points in k R , this bound can be used to bound f (n, m, k) < 2km log 2n+k(k+1)n log n with- out using Warren's Theorem directly [8, 12]. The bound of Lemma 4 avoids the quadratic dependence on k in the exponent.

Acknowledgments We would like to thank Peter Bartlett for pointing out [13, 14]. N.S. and T.J. would like to thank Erik Demaine for introducing them to oriented matroids.

A Proof of Corollary 3

Consider a set R q R containing one variable configuration for each possible sign pattern. Set . = 1 min (x) = P 2 1iq,xRPi(x)=0 |Pi(x)| > 0. Now consider the 2q polynomials P + i i(x) + and P -(x) = P q | (x) = 0, P -(x) = 0 . Different points in R i i(x) - and C = x R iP + i i (representing all sign configurations) lie in different connected components of C . Invoking Theorem 2 on C establishes Corollary 3.

The count in Corollary 3 differentiates between positive, negative and zero signs. However, we are only concerned with the positivity of YiaXia (in the proof of Theorem 1) or of Xia - Tia (in the proof of Theorem 8), and do not need to differentiate between zero and negative values. Invoking Theorem 2 on C+ = x q R |iP +(x) = 0 , yields: i

Corollary 10. The number of -/+ sign configurations (where zero is considered negative) of r poly- nomials, each of degree at most d, over q variables, is at most (4edr/q)q (for r > q > 2).

Applying Corollary 10 on the nm degree-two polynomials Y k ia U =1 iVa establishes that for any Y , the number of configurations of sign agreements of rank-k matrices with Y is bounded by (8em/k)k(n+m) and yields a constant of 8 instead of 16 inside the logarithm in Theorem 1. Applying Corollary 10 instead of Corollary 3 allows us to similarly tighten in the bounds in Corollary 7 and in Theorem 8.

      4A more careful analysis shows that F has pseudodimension three.

B Generalization Error Bound in terms of the Pseudodimension

Theorem 11. Let F be a class of real-valued functions f : X R with pseudodimension d, and loss : R Y R be a bounded monotone loss function (i.e. for all y, loss(x, y) is mono- tone in x), with loss < M . For any joint distribution over (X, Y ), consider an i.i.d. sample S = (X1, Y1), . . . , (Xn, Yn). Then for any > 0:

                                        n                                                 d                                        1                                              32eM            2 n Pr    fF EX,Y [loss(f (X), Y )] >              loss(f (Xi), Yi) +    < 4e(d + 1)                 e- 32  S                                     n i=1

The bound is a composition of a generalization error bound in terms of the L1 covering number [17, Theorem 17.1], a bound on the L1 covering number in terms of the pseudodimension [18] and the observation that composition with a monotone function does not increase the pseudodimension [17, Theorem 12.3].