{"title": "Blind Regression: Nonparametric Regression for Latent Variable Models via Collaborative Filtering", "book": "Advances in Neural Information Processing Systems", "page_first": 2155, "page_last": 2163, "abstract": "We introduce the framework of {\\em blind regression} motivated by {\\em matrix completion} for recommendation systems: given $m$ users, $n$ movies, and a subset of user-movie ratings, the goal is to predict the unobserved user-movie ratings given the data, i.e., to complete the partially observed matrix. Following the framework of non-parametric statistics, we posit that user $u$ and movie $i$ have features $x_1(u)$ and $x_2(i)$ respectively, and their corresponding rating $y(u,i)$ is a noisy measurement of $f(x_1(u), x_2(i))$ for some unknown function $f$. In contrast with classical regression, the features $x = (x_1(u), x_2(i))$ are not observed, making it challenging to apply standard regression methods to predict the unobserved ratings. Inspired by the classical Taylor's expansion for differentiable functions, we provide a prediction algorithm that is consistent for all Lipschitz functions. In fact, the analysis through our framework naturally leads to a variant of collaborative filtering, shedding insight into the widespread success of collaborative filtering in practice. Assuming each entry is sampled independently with probability at least $\\max(m^{-1+\\delta},n^{-1/2+\\delta})$ with $\\delta > 0$, we prove that the expected fraction of our estimates with error greater than $\\epsilon$ is less than $\\gamma^2 / \\epsilon^2$ plus a polynomially decaying term, where $\\gamma^2$ is the variance of the additive entry-wise noise term. Experiments with the MovieLens and Netflix datasets suggest that our algorithm provides principled improvements over basic collaborative filtering and is competitive with matrix factorization methods.", "full_text": "Blind Regression: Nonparametric Regression for\nLatent Variable Models via Collaborative Filtering\n\nChristina E. Lee\n\nYihua Li\n\nDevavrat Shah\n\nDogyoon Song\n\nLaboratory for Information and Decision Systems\n\nDepartment of Electrical Engineering and Computer Science\n\nMassachusetts Institute of Technology\n\n{celee, liyihua, devavrat, dgsong}@mit.edu\n\nAbstract\n\nWe introduce the framework of blind regression motivated by matrix completion\nfor recommendation systems: given m users, n movies, and a subset of user-movie\nratings, the goal is to predict the unobserved user-movie ratings given the data,\ni.e., to complete the partially observed matrix. Following the framework of non-\nparametric statistics, we posit that user u and movie i have features x1(u) and\nx2(i) respectively, and their corresponding rating y(u, i) is a noisy measurement of\nf (x1(u), x2(i)) for some unknown function f. In contrast with classical regression,\nthe features x = (x1(u), x2(i)) are not observed, making it challenging to apply\nstandard regression methods to predict the unobserved ratings.\nInspired by the classical Taylor\u2019s expansion for differentiable functions, we pro-\nvide a prediction algorithm that is consistent for all Lipschitz functions. In fact,\nthe analysis through our framework naturally leads to a variant of collaborative\n\ufb01ltering, shedding insight into the widespread success of collaborative \ufb01ltering in\npractice. Assuming each entry is sampled independently with probability at least\nmax(m\u22121+\u03b4, n\u22121/2+\u03b4) with \u03b4 > 0, we prove that the expected fraction of our\nestimates with error greater than \u0001 is less than \u03b32/\u00012 plus a polynomially decaying\nterm, where \u03b32 is the variance of the additive entry-wise noise term.\nExperiments with the MovieLens and Net\ufb02ix datasets suggest that our algorithm\nprovides principled improvements over basic collaborative \ufb01ltering and is competi-\ntive with matrix factorization methods.\n\n1\n\nIntroduction\n\nIn this paper, we provide a statistical framework for performing nonparametric regression over latent\nvariable models. We are initially motivated by the problem of matrix completion arising in the\ncontext of designing recommendation systems. In the popularized setting of Net\ufb02ix, there are m\nusers, indexed by u \u2208 [m], and n movies, indexed by i \u2208 [n]. Each user u has a rating for each\nmovie i, denoted as y(u, i). The system observes ratings for only a small fraction of user-movie\npairs. The goal is to predict ratings for the rest of the unknown user-movie pairs, i.e., to complete\nthe partially observed m \u00d7 n rating matrix. To be able to obtain meaningful predictions from the\npartially observed matrix, it is essential to impose a structure on the data.\nWe assume each user u and movie i is associated to features x1(u) \u2208 X1 and x2(i) \u2208 X2 for some\ncompact metric spaces X1,X2 equipped with Borel probability measures. Following the philosophy\nof non-parametric statistics, we assume that there exists some function f : X1 \u00d7 X2 \u2192 R such that\nthe rating of user u for movie i is given by\n\ny(u, i) = f (x1(u), x2(i)) + \u03b7ui,\n\n(1)\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fwhere \u03b7ui is some independent bounded noise. We observe ratings for a subset of the user-movie\npairs, and the goal is to use the given data to predict f (x1(u), x2(i)) for all (u, i) \u2208 [m] \u00d7 [n] whose\nrating is unknown. In classical nonparametric regression, we observe input features x1(u), x2(i)\nalong with the rating y(u, i) for each datapoint, and thus we can approximate the function f well\nusing local approximation techniques as long as f satis\ufb01es mild regularity conditions. However, in\nour setting, we do not observe the latent features x1(u), x2(i), but instead we only observe the indices\n(u, i). Therefore, we use blind regression to refer to the challenge of performing regression with\nunobserved latent input variables. This paper addresses the question, does there exist a meaningful\nprediction algorithm for general nonparametric regression when the input features are unobserved?\n\nRelated Literature. Matrix completion has received enormous attention in the past decade. Matrix\nfactorization based approaches, such as low-rank approximation, and neighborhood based approaches,\nsuch as collaborative \ufb01ltering, have been the primary ways to address the problem. In the recent\nyears, there has been exciting intellectual development in the context of matrix factorization based\napproaches. Since any matrix can be factorized, its entries can be described by a function f in (1) with\nthe form f (x1, x2) = xT\n1 x2, and the goal of factorization is to recover the latent features for each row\nand column. [25] was one of the earlier works to suggest the use of low-rank matrix approximation,\nobserving that a low-rank matrix has a comparatively small number of free parameters. Subsequently,\nstatistically ef\ufb01cient approaches were suggested using optimization based estimators, proving that\nmatrix factorization can \ufb01ll in the missing entries with sample complexity as low as rn log n, where\nr is the rank of the matrix [5, 23, 11, 21, 10]. There has been an exciting line of ongoing work to\nmake the resulting algorithms faster and scalable [7, 17, 4, 15, 24, 20].\nMany of these approaches are based on the structural assumption that the underlying matrix is\nlow-rank and the matrix entries are reasonably \u201cincoherent\u201d. Unfortunately, the low-rank assumption\nmay not hold in practice. The recent work [8] makes precisely this observation, showing that a simple\nnon-linear, monotonic transformation of a low-rank matrix could easily produce an effectively high-\nrank matrix, despite few free model parameters. They provide an algorithm and analysis speci\ufb01c to\nthe form of their model, which achieves sample complexity of O((mn)2/3). However, their algorithm\nonly applies to functions f which are a nonlinear monotonic transformation of the inner product of\nthe latent features. [6] proposes the universal singular value thresholding estimator (USVT), and\nthey provide an analysis under a similar model in which they assume f to be a bounded Lipschitz\nfunction. They achieve a sample complexity, or the required fraction of measurements over the total\n\nmn entries, which scales with the latent space dimension q according to \u2126(cid:0)m\u22122/(q+2)(cid:1) for a square\n\nmatrix, whereas we achieve a sample complexity of \u2126(m\u22121/2+\u03b4) (which is independent of q) as long\nas the latent dimension scales as o(log n).\nThe term collaborative \ufb01ltering was coined in [9], and this technique is widely used in practice due to\nits simplicity and ability to scale. There are two main paradigms in neighborhood-based collaborative\n\ufb01ltering: the user-user paradigm and the item-item paradigm. To recommend items to a user in the\nuser-user paradigm, one \ufb01rst looks for similar users, and then recommends items liked by those\nsimilar users. In the item-item paradigm, in contrast, items similar to those liked by the user are\nfound and subsequently recommended. Much empirical evidence exists that the item-item paradigm\nperforms well in many cases [16, 14, 22], however the theoretical understanding of the method has\nbeen limited. In recent works, Latent mixture models or cluster models have been introduced to\nexplain the collaborative \ufb01ltering algorithm as well as the empirically observed superior performance\nof item-item paradigms, c.f. [12, 13, 1, 2, 3]. However, these results assume a speci\ufb01c parametric\nmodel, such as a mixture distribution model for preferences across users and movies. We hope that\nby providing an analysis for collaborative \ufb01ltering within our broader nonparametric model, we can\nprovide a more complete understanding of the potentials and limitations of collaborative \ufb01ltering.\nThe algorithm that we propose in this work is inspired by local functional approximations, speci\ufb01-\ncally Taylor\u2019s approximation and classical kernel regression, which also relies on local smoothed\napproximations, c.f. [18, 26]. However, since kernel regression and other similar methods use explicit\nknowledge of the input features, their analysis and proof techniques do not extend to our context of\nBlind regression, in which the features are latent. Although our estimator takes a similar form of\ncomputing a convex combination of nearby datapoints weighted according to a function of the latent\ndistance, the analysis required is entirely different.\n\n2\n\n\fContributions. The key contribution of our work is in providing a statistical framework for nonpara-\nmetric regression over latent variable models. We refrain from any speci\ufb01c modeling assumptions\non f, keeping mild regularity conditions aligned with the philosophy of non-parametric statistics.\nWe assume that the latent features are drawn independently from an identical distribution (IID) over\nbounded metric spaces; the function f is Lipschitz with respect to the latent spaces; entries are ob-\nserved independently with some probability p; and the additive noise in observations is independently\ndistributed with zero mean and bounded support. In spite of the minimal assumptions of our model,\nwe provide a consistent matrix completion algorithm with \ufb01nite sample error bounds. Furthermore,\nas a coincidental by-product, we \ufb01nd that our framework provides an explanation of the practical\nmystery of \u201cwhy collaborative \ufb01ltering algorithms work well in practice\u201d.\nThere are two conceptual parts to our algorithm. First, we derive an estimate of f (x1(u), x2(i)) for\nan unobserved index pair (u, i) by using \ufb01rst order local Taylor approximation expanded around the\npoints corresponding to (u, i(cid:48)), (u(cid:48), i), and (u(cid:48), i(cid:48)). This leads to estimation that\n\u02c6y(u, i) \u2261 y(u(cid:48), i) + y(u, i(cid:48)) \u2212 y(u(cid:48), i(cid:48)) \u2248 f (x1(u), x2(i)),\n\n(2)\nas long as x1(u(cid:48)) is close to x1(u) or x2(i(cid:48)) is close to x2(i). In kernel regression, distances between\ninput features are used to upper bound the error of individual estimates, but since the latent features\nare not observed, we need another method to determine which of these estimates are reliable.\nSecondly, under mild regularity conditions, we upper bound the squared error of the estimate in (2)\nby the the variance of the squared difference between commonly observed entries in rows (u, v) or\ncolumns (i, j). We empirically estimate this quantity and use it similarly to distance in the latent\nspace in order to appropriately weight individual estimates to a \ufb01nal prediction. If we choose only the\ndatapoints with minimum empirical row variance, we recover user-user nearest neighbor collaborative\n\ufb01ltering. Inspired by kernel regression, we also propose using computing the weights according to a\nGaussian kernel applied to the minimum of the row or column sample variances.\nAs the main technical result, we show that the user-user nearest neighbor variant of collaborative\n\ufb01ltering method with our similarity metric yields a consistent estimator for any Lipschitz function\nas long as we observe max(m\u22121+\u03b4, n\u22121/2+\u03b4) fraction of the matrix with \u03b4 > 0. In the process, we\nobtain \ufb01nite sample error bounds, whose details are stated in Theorem 1. We compared the Gaussian\nkernel variant of our algorithm to classic collaborative \ufb01ltering algorithms and a matrix factorization\nbased approach (softImpute) on predicting user-movie ratings for the Net\ufb02ix and MovieLens datasets.\nExperiments suggest that our method improves over existing collaborative \ufb01ltering methods, and\nsometimes outperforms matrix-factorization-based approaches depending on the dataset.\n\n2 Setup\nOperating assumptions. There are m users and n movies. The rating of user u \u2208 [m] for movie\ni \u2208 [n] is given by (1), taking the form y(u, i) = f (x1(u), x2(i)) + \u03b7u,i. We make the following\nassumptions.\n\n(a) X1 and X2 are compact metric spaces endowed with metric dX1 and dX2 respectively:\n2 \u2208 X2.\n\n1 \u2208 X1, and dX2 (x2, x(cid:48)\n\n2) \u2264 BX , \u2200 x2, x(cid:48)\n\n1) \u2264 BX , \u2200 x1, x(cid:48)\n\ndX1 (x1, x(cid:48)\n\n(3)\n\n(b) f : X1 \u00d7 X2 \u2192 R is L\u2212Lipschitz with respect to \u221e-product metric:\n\n1, x(cid:48)\n\n1), dX2(x2, x(cid:48)\n\n2)} , \u2200x1, x(cid:48)\n\n|f (x1, x2) \u2212 f (x(cid:48)\n\n2)| \u2264 L max{dX1 (x1, x(cid:48)\n\n2 \u2208 X2.\n(c) The latent features of each user u and movie i, x1(u) and x2(i), are sampled independently\naccording to Borel probability measures PX1 and PX2 on (X1, TX1) and (X2, TX2 ), where\nTX denotes the Borel \u03c3-algebra of a metric space X .\nvariance \u03b32: for all u \u2208 [m], i \u2208 [n],\n\u03b7u,i \u2208 [\u2212B\u03b7, B\u03b7],\n\n(d) The additive noise for all data points are independent and bounded with mean zero and\n\n1 \u2208 X1, x2, x(cid:48)\n\nE[\u03b7u,i] = 0,\n\nVar[\u03b7u,i] = \u03b32.\n\n(4)\n\n(e) Rating of each entry is revealed (observed) with probability p, independently.\n\n3\n\n\fNotation. Let random variable Mui = 1 if the rating of user u and movie i is revealed and 0\notherwise. Mui is an independent Bernoulli random variable with parameter p. Let N1(u) denote the\nset of column indices of observed entries in row u. Similarly, let N2(i) denote the set of row indices\nof observed entries in column i. That is,\n\nN1(u) (cid:44) {i : M (u, i) = 1} and N2(i) (cid:44) {u : M (u, i) = 1}.\n\n(5)\nFor rows v (cid:54)= u, N1(u, v) (cid:44) N1(u) \u2229 N1(v) denotes column indices of commonly observed entries\nof rows (u, v). For columns i (cid:54)= j, N2(i, j) (cid:44) N2(i) \u2229 N2(j) denotes row indices of commonly\nobserved entries of columns (i, j). We refer to this as the overlap between two rows or columns.\n\n3 Algorithm Intuition\n\nLocal Taylor Approximation. We propose a prediction algorithm for unknown ratings based on\n\u223c= R, and we wish\ninsights from the classical Taylor approximation of a function. Suppose X1\nto predict unknown rating, f (x1(u), x2(i)), of user u \u2208 [m] for movie i \u2208 [n]. Using the \ufb01rst order\nTaylor expansion of f around (x1(v), x2(j)) for some u (cid:54)= v \u2208 [m], i (cid:54)= j \u2208 [n], it follows that\nf (x1(u), x2(i)) \u2248 f (x1(v), x2(j)) + (x1(u) \u2212 x1(v)) \u2202f (x1(v),x2(j))\nWe are not able to directly compute this expression, as we do not know the latent features, the\nfunction f, or the partial derivatives of f. However, we can again apply Taylor expansion for\nf (x1(v), x2(i)) and f (x1(u), x2(j)) around (x1(v), x2(j)), which results in a set of equations with\nthe same unknown terms. It follows from rearranging terms and substitution that\n\n+ (x2(i) \u2212 x2(j)) \u2202f (x1(v),x2(j))\n\n\u223c= X2\n\n\u2202x1\n\n\u2202x2\n\n.\n\nf (x1(u), x2(i)) \u2248 f (x1(v), x2(i)) + f (x1(u), x2(j)) \u2212 f (x1(v), x2(j)),\n\nas long as the \ufb01rst order Taylor approximation is accurate. Thus if the noise term in (1) is small, we\ncan approximate f (x1(u), x2(i)) by using observed ratings y(v, j), y(u, j) and y(v, i) according to\n(6)\n\n\u02c6y(u, i) = y(u, j) + y(v, i) \u2212 y(v, j).\n\nReliability of Local Estimates. We will show that the variance of the difference between two rows\nor columns upper bounds the estimation error. Therefore, in order to ensure the accuracy of the above\nestimate, we use empirical observations to estimate the variance of the difference between two rows\nor columns, which directly relates to an error bound. By expanding (6) according to (1), the error\nf (x1(u), x2(i)) \u2212 \u02c6y(u, i) is equal to\n(f (x1(u), x2(i)) \u2212 f (x1(v), x2(i))) \u2212 (f (x1(u), x2(j)) \u2212 f (x1(v), x2(j))) \u2212 \u03b7vi + \u03b7vj \u2212 \u03b7uj.\nIf we condition on x1(u) and x1(v),\n\n(Error)2 | x1(u), x1(v)\n\n= 2 Varx\u223cX2 [f (x1(u), x) \u2212 f (x1(v), x) | x1(u), x1(v)] + 3\u03b32.\n\nE(cid:104)\n\n(cid:105)\n\nSimilarly, if we condition on x2(i) and x2(j) it follows that the expected squared error is bounded by\nthe variance of the difference between the ratings of columns i and j. This theoretically motivates\nweighting the estimates according to the variance of the difference between the rows or columns.\n\n4 Algorithm Description\n\nWe provide the algorithm for predicting an unknown entry in position (u, i) using available data.\nGiven a parameter \u03b2 \u2265 2, de\ufb01ne \u03b2-overlapping neighbors of u and i respectively as\n\nS \u03b2\nu (i) = {v s.t. v \u2208 N2(i), v (cid:54)= u, |N1(u, v)| \u2265 \u03b2},\nS \u03b2\ni (u) = {j s.t. j \u2208 N1(u), j (cid:54)= i, |N2(i, j)| \u2265 \u03b2}.\nu (i), compute the empirical row variance between u and v,\n\nFor each v \u2208 S \u03b2\n\ns2\nuv =\n\n1\n\n2|N1(u, v)|(|N1(u, v)| \u2212 1)\n\n((y(u, i) \u2212 y(v, i)) \u2212 (y(u, j) \u2212 y(v, j)))2 .\n\n(7)\n\n(cid:88)\n\ni,j\u2208N1(u,v)\n\n4\n\n\fSimilarly, compute empirical column variances between i and j, for all j \u2208 S \u03b2\n\ni (u),\n\ns2\nij =\n\n1\n\n2|N2(i, j)|(|N2(i, j)| \u2212 1)\n\nu,v\u2208N2(i,j)\n\n((y(u, i) \u2212 y(u, j)) \u2212 (y(v, i) \u2212 y(v, j)))2 .\n\n(8)\n\n(cid:110)\n\nLet B\u03b2(u, i) denote the set of positions (v, j) such that the entries y(v, j), y(u, j) and y(v, i) are\nobserved, and the commonly observed ratings between (u, v) and between (i, j) are at least \u03b2.\n\nCompute the \ufb01nal estimate as a convex combination of estimates derived in (6) for (v, j) \u2208 B\u03b2(u, i),\n\nB\u03b2(u, i) =\n\n(v, j) \u2208 S \u03b2\n\nu (i) \u00d7 S \u03b2\n\ni (u) s.t. M (v, j) = 1\n\n.\n\n(cid:80)\n(v,j)\u2208B\u03b2 (u,i) wui(v, j) (y(u, j) + y(v, i) \u2212 y(v, j))\n\n\u02c6y(u, i) =\n\n(v,j)\u2208B\u03b2 (u,i) wui(v, j)\n\n,\n\n(9)\n\n(cid:111)\n\n(cid:88)\n\n(cid:80)\n\nwhere the weights wui(v, j) are de\ufb01ned as a function of (7) and (8). We proceed to discuss a few\nchoices for the weight function, each of which results in a different algorithm.\n\nUser-User or Item-Item Nearest Neighbor Weights. We can evenly distribute the weights only\namong entries in the nearest neighbor row, i.e., the row with minimal empirical variance,\n\nwvj = I(v = u\u2217), for u\u2217 \u2208 arg min\n\ns2\nuv.\n\nv\u2208S \u03b2\n\nu (i)\n\nIf we substitute these weights in (9), we recover an estimate which is asymptotically equivalent to the\nmean-adjusted variant of the classical user-user nearest neighbor (collaborative \ufb01ltering) algorithm,\n\nwhere muu\u2217 is the empirical mean of the difference of ratings between rows u and u\u2217. For any u, v,\n\n\u02c6y(u, i) = y(u\u2217, i) + muu\u2217 ,\n\n(cid:88)\n\nj\u2208N1(u,v)\n\nmuv =\n\n1\n\n|N1(u, v)|\n\n(y(u, j) \u2212 y(v, j)).\n\nEquivalently, we can evenly distribute the weights among entries in the nearest neighbor columns, i.e.,\nthe column with minimal empirical variance, recovering the classical mean-adjusted item-item nearest\nneighbor collaborative \ufb01ltering algorithm. Theorem 1 proves that this simple algorithm produces\na consistent estimator, and we provide the \ufb01nite sample error analysis. Due to the similarities, our\nanalysis also directly implies the proof of correctness and consistency for the classic user-user and\nitem-item collaborative \ufb01ltering method.\n\nUser-Item Gaussian Kernel Weights. Inspired by kernel regression, we introduce a variant of\nthe algorithm which computes the weights according to a Gaussian kernel function with bandwith\nparameter \u03bb, substituting in the minimum row or column sample variance as a proxy for the distance,\n\nwvj = exp(\u2212\u03bb min{s2\n\nuv, s2\n\nij}).\n\nWhen \u03bb = \u221e, the estimate only depends on the basic estimates whose row or column has the\nminimum sample variance. When \u03bb = 0, the algorithm equally averages all basic estimates. We\napplied this variant of our algorithm to both movie recommendation and image inpainting data, which\nshow that our algorithm improves upon user-user and item-item classical collaborative \ufb01ltering.\n\nConnections to Cosine Similarity Weights. In our algorithm, we determine reliability of estimates\nas a function of the sample variance, which is equivalent to the squared distance of the mean-\nadjusted values. In classical collaborative \ufb01ltering, cosine similarity is commonly used, which can be\napproximated as a different choice of the weight kernel over the squared difference.\n\n5 Main Theorem\nLet E \u2282 [m] \u00d7 [n] denote the set of user-movie pairs for which the algorithm predicts a rating. For\n\u03b5 > 0, the overall \u03b5-risk of the algorithm is the fraction of estimates whose error is larger than \u03b5,\n\n(cid:88)\n\n(u,i)\u2208E\n\nRisk\u03b5 =\n\n1\n|E|\n\nI(|f (x1(u), x2(i)) \u2212 \u02c6y(u, i)| > \u03b5).\n\n(10)\n\n5\n\n\fL2\n\n2\n3 \u03b4\n\n\u03b52\n\n1 +\n\n\u03b5\n\n(cid:19)\n\n+ O\n\nexp\n\nCm\u03b4\n\n(cid:18)\n\n(cid:18)\n\n\u2212 1\n4\n\nn\u2212 2\n3 \u03b4\n\n+ m\u03b4 exp\n\n3 \u00b7 21/3\n\n6 for h(r) := inf x0\u2208X1\n\n(cid:18)\n(cid:1) \u2227 1\n\n(cid:19)\nwhere B = 2(LBX + B\u03b7), and C = h(cid:0)(cid:112) \u03c1\n\nIn Theorem 1, we upper bound the expected \u03b5-Risk, proving that the user-user nearest neighbor\nestimator is consistent, i.e., in the presence of no noise, estimates converge to the true values as m, n\ngo to in\ufb01nity. We may assume m \u2264 n without loss of generality.\nTheorem 1. For a \ufb01xed \u03b5 > 0, as long as p \u2265 max{m\u22121+\u03b4, n\u22121/2+\u03b4} (where \u03b4 > 0), for any\n\u03c1 = \u03c9(n\u22122\u03b4/3), the user-user nearest-neighbor variant of our method with \u03b2 = np2/2 achieves\nE[Risk\u03b5] \u2264 3\u03c1 + \u03b32\n\n(cid:19)(cid:19)\n(cid:18)\n\u2212 1\n5B2 n\n.\n(dX1 (x, x0) \u2264 r).\nFor a generic \u03b2, we can also provide precise error bounds of a similar form, with modi\ufb01ed rates of\nconvergence. Choosing \u03b2 to grow with np2 ensures that as n goes to in\ufb01nity, the required overlap\nbetween rows also goes to in\ufb01nity, thus the empirical mean and variance computed in the algorithm\nconverge precisely to the true mean and variance. The parameter \u03c1 in Theorem 1 is introduced purely\nfor the purpose of analysis, and is not used within the implementation of the the algorithm.\nThe function h behaves as a lower bound of the cumulative distribution function of PX1, and it always\nexists under our assumptions that X1 is compact. It is used to ensure that for any u \u2208 [m], with high\nprobability, there exists another row v \u2208 S \u03b2\nu (i) such that dX1(x1(u), x1(v)) is small, implying by\nthe Lipschitz condition that we can use the values of row v to approximate the values of row u well.\nFor example, if PX1 is a uniform distribution over a unit cube in q dimensional Euclidean space,\nthen h(r) = min(1, r)q, and our error bound becomes meaningful for n \u2265 (L2/\u03c1)q/2\u03b4. On the other\nhand, if PX1 is supported over \ufb01nitely many points, then h(r) = minx\u2208supp(PX1 ) PX1 (x) is a positive\nconstant, and the role of the latent dimension becomes irrelevant. Intuitively, the \u201cgeometry\u201d of PX1\nthrough h near 0 determines the impact of the latent space dimension on the sample complexity, and\nour results hold as long as the latent dimension q = o(log n).\n\nPx\u223cPX1\n\n6 Proof Sketch\n\nFor any evaluation set of unobserved entries E, the expectation of \u03b5-risk is\nE[Risk\u03b5] =\n\nP(|f (x1(u), x2(i)) \u2212 \u02c6y(u, i)| > \u03b5) = P(|f (x1(u), x2(i)) \u2212 \u02c6y(u, i)| > \u03b5),\n\n(cid:88)\n\n1\n|E|\n\n(u,i)\u2208E\n\nbecause the indexing of the entries are exchangeable and identically distributed. To bound the\nexpected risk, it is suf\ufb01cient to provide a tail bound for the probability of the error. For any \ufb01xed\na, b \u2208 X1, and random variable x \u223c PX2, we denote the mean and variance of the difference\nf (a, x) \u2212 f (b, x) by\n\n\u00b5ab (cid:44) Ex[f (a, x) \u2212 f (b, x)] = E[muv|x1(u) = a, x1(v) = b],\n\u03c32\nab\n\n(cid:44) Varx[f (a, x) \u2212 f (b, x)] = E[s2\n\nuv|x1(u) = a, x1(v) = b] \u2212 2\u03b32,\n\nu (i) s2\n\nwhich we point out is also equivalent to the expectation of the empirical means and variances\ncomputed by the algorithm when we condition on the latent representations of the users. The\ncomputation of \u02c6y(u, i) involves two steps: \ufb01rst the algorithm determines the neighboring row with the\nminimum sample variance, u\u2217 = arg minv\u2208S \u03b2\nuv, and then it computes the estimate by adjusting\naccording to the empirical mean, \u02c6y(u, i) := y(u\u2217, i) + muu\u2217.\nThe proof involves three key steps, each stated within a lemma. Lemma 1 proves that with high\nprobability the observations are dense enough such that there is suf\ufb01cient number of rows with\noverlap of entries larger than \u03b2, i.e., the number of the candidate rows, |S \u03b2\nu (i)|, concentrates around\n(m \u2212 1)p. This relies on concentration of Binomial random variables via Chernoff\u2019s bound.\nLemma 1. Given p > 0, 2 \u2264 \u03b2 \u2264 np2/2 and \u03b1 > 0, for any (u, i) \u2208 [m] \u00d7 [n],\n+ (m \u2212 1) exp\n\nP(cid:0)|S \u03b2\nu (i)| /\u2208 (1 \u00b1 \u03b1)(m \u2212 1)p(cid:1) \u2264 2 exp\n\n(cid:18)\n\u2212 \u03b12(m \u2212 1)p\n\n(cid:19)\n\n(cid:19)\n\n(cid:18)\n\n.\n\n\u2212 np2\n8\n\n3\n\nLemma 2 proves that since the latent features are sampled iid from a bounded metric space, for any\nindex pair (u, i), there exists a \u201cgood\u201d neighboring row v \u2208 S \u03b2\n\nu (i), whose \u03c32\n\nx1(u)x1(v) is small.\n\n6\n\n\f(cid:18)\n\n(cid:18)\n\nP\n\n(cid:18)(cid:114) \u03c1\n\n(cid:19)(cid:19)|S|\n\nv\u2208S \u03c32\nmin\nPx\u223cPX1\n\nLemma 2. Consider u \u2208 [n] and set S \u2282 [n] \\ {u}. Then for any \u03c1 > 0,\n\n(cid:19)\n\u2264\nx1(u)x1(v) > \u03c1\n(dX1 (x, x0) \u2264 r).\nwhere h(r) := inf x0\u2208X1\nSubsequently, conditioned on the event that |S \u03b2\nu (i)| \u2248 (m \u2212 1)p, Lemmas 3 and 4 prove that the\nsample mean and sample variance of the differences between two rows concentrate around the\ntrue mean and true variance with high probability. This involves using the Lipschitz and bounded\nassumptions on f and X1, as well as the Bernstein and Maurer-Pontil inequalities.\nLemma 3. Given u, v \u2208 [m], i \u2208 [n] and \u03b2 \u2265 2, for any \u03b1 > 0,\n\n1 \u2212 h\n\n,\n\nL2\n\nP(cid:0)(cid:12)(cid:12)\u00b5x1(u)x1(v) \u2212 muv\n\n(cid:18)\n\n(cid:12)(cid:12) > \u03b1| v \u2208 S \u03b2\n(cid:12)(cid:12)(cid:12) > \u03c1 (cid:12)(cid:12) v \u2208 S \u03b2\n\nu (i)(cid:1) \u2264 exp\n(cid:17) \u2264 2 exp\n\nu (i)\n\n(cid:19)\n\n,\n\n\u2212 3\u03b2\u03b12\n6B2 + 2B\u03b1\n\n(cid:18)\n\n\u2212\n\n(cid:19)\n\n,\n\n\u03b2\u03c12\n\n4B2(2LB2X + 4\u03b32 + \u03c1)\n\nwhere recall that B = 2(LBX + B\u03b7).\nLemma 4. Given u \u2208 [m], i \u2208 [n], and \u03b2 \u2265 2, for any \u03c1 > 0,\n\nP(cid:16)(cid:12)(cid:12)(cid:12)s2\n\nuv \u2212 (\u03c32\n\nx1(u)x1(v) + 2\u03b32)\n\nwhere recall that B = 2(LBX + B\u03b7).\nGiven that there exists a neighbor v \u2208 S \u03b2\nx1(u)x1(v) is small, and conditioned\non the event that all the sample variances concentrate around the true variance, it follows that the true\nvariance between u and its nearest neighbor u\u2217 is small with high probability. Finally, conditioned\nu (i)| \u2248 (m \u2212 1)p and the true variance between the target row and the nearest\non the event that |S \u03b2\nneighbor row is small, we provide a bound on the tail probability of the estimation error by using\nChevyshev inequalities. The only term in the error probability which does not decay to zero is the\nerror from Chebyshev\u2019s inequality, which dominates the \ufb01nal expression, leading to the \ufb01nal result.\n\nu (i) whose true variance \u03c32\n\n7 Experiments\n\nWe evaluated the performance of our algorithm to predict user-movie ratings on the MovieLens 1M\nand Net\ufb02ix datasets. For the implementation of our method, we used user-item Gaussian kernel\nweights for the \ufb01nal estimator. We chose overlap parameter \u03b2 = 2 to ensure the algorithm is able\nto compute an estimate for all missing entries. When \u03b2 is larger, the algorithm enforces rows (or\ncolumns) to have more commonly rated movies (or users). Although this increases the reliability of\nthe estimates, it also reduces the fraction of entries for which the estimate is de\ufb01ned. We optimized\nthe \u03bb bandwidth parameter of the Gaussian kernel by evaluating the method with multiple values for\n\u03bb and choosing the value which minimizes the error.\nWe compared our method with user-user collaborative \ufb01ltering, item-item collaborative \ufb01ltering,\nand softImpute from [20]. We chose the classic mean-adjusted collaborative \ufb01ltering method, in\nwhich the weights are proportional to the cosine similarity of pairs of users or items (i.e. movies).\nSoftImpute is a matrix-factorization-based method which iteratively replaces missing elements in the\nmatrix with those obtained from a soft-thresholded SVD.\nFor both MovieLens and Net\ufb02ix data sets, the ratings are integers from 1 to 5. From each dataset, we\ngenerated 100 smaller user-movie rating matrices, in which we randomly subsampled 2000 users and\n2000 movies. For each rating matrix, we randomly select and withhold a percentage of the known\nratings for the test set, while the remaining portion of the data set is revealed to the algorithm for\ncomputing the estimates. After the algorithm computes its predictions for unrevealed movie-user\npairs, we evaluate the Root Mean Squared Error (RMSE) of the predictions compared with the\nwithheld test set, where RMSE is de\ufb01ned as the square root of the mean of squared prediction error\nover the evaluation set. Figure 1 plots the RMSE of our method along with classic collaborative\n\ufb01ltering and softImpute evaluated against 10%, 30%, 50%, and 70% withheld test sets. The RMSE is\naveraged over 100 subsampled rating matrices, and 95% con\ufb01dence intervals are provided.\n\n7\n\n\fFigure 1: Performance of algorithms on Net\ufb02ix and MovieLens datasets with 95% con\ufb01dence interval.\n\u03bb values used by our algorithm are 2.8 (10%), 2.3 (30%), 1.7 (50%), 1 (70%) for MovieLens, and 1.8\n(10%), 1.7 (30%), 1.6 (50%), 1.5 (70%) for Net\ufb02ix.\n\nFigure 1 suggests that our algorithm achieves a systematic improvement over classical user-user\nand item-item collaborative \ufb01ltering. SoftImpute performs the worst on the MovieLens dataset,\nbut it performs the best on the Net\ufb02ix dataset. This behavior could be due to different underlying\nassumptions of low rank for matrix factorization methods as opposed to Lipschitz for collaborative\n\ufb01ltering methods, which could lead to dataset dependent performance outcomes.\n\n8 Discussion\n\nWe introduced a generic framework of blind regression, i.e., nonparametric regression over latent\nvariable models. We allow the model to be any Lipschitz function f over any bounded feature space\nX1,X2, while imposing the limitation that the input features are latent. This is applicable to a wide\nvariety of problems, including recommendation systems, but also includes social network analysis,\ncommunity detection, crowdsourcing, and product demand prediction. Many parametric models (e.g.\nlow rank assumptions) can be framed as a speci\ufb01c case of our model.\nDespite the generality and limited assumptions of our model, we present a simple similarity based\nestimator, and we provide theoretical guarantees bounding its error within the noise level \u03b32. The\nanalysis provides theoretical grounds for the popularity of similarity based methods. To the best of\nour knowledge, this is the \ufb01rst provable guarantee on the performance of neighbor-based collaborative\n\ufb01ltering within a fully nonparametric model. Our algorithm and analysis follows from local Taylor\napproximation, along with an observation that the sample variance between rows or columns is a good\nindicator of \u201ccloseness\u201d, or the similarity of their function values. The algorithm essentially estimates\nthe local metric information between the latent features from observed data, and then performs local\nsmoothing in a similar manner as classical kernel regression.\nDue to the local nature of our algorithm, our sample complexity does not depend on the latent\ndimension, whereas Chatterjee\u2019s USVT estimator [6] requires sampling almost every entry when\nthe latent dimension is large. This difference is due to the fact that Chatterjee\u2019s result stems from\nshowing that a Lipschitz function can be approximated by a piecewise constant function, which upper\nbound the rank of the target matrix. This discretization results in a large penalty with regards to the\ndimension of the latent space. Since our method follows from local approximations, we only require\nsuf\ufb01cent sampling such that locally there are enough close neighbor points.\nThe connection of our framework to regression implies many natural future directions. We can\nextend model (1) to multivariate functions f, which translates to the problem of higher order tensor\ncompletion. Variations of the algorithm and analysis that we provide for matrix completion can\nextend to tensor completion, due to the \ufb02exible and generic assumptions of our model. It would also\nbe useful to extend the results to capture general noise models, sparser sampling regimes, or mixed\nmodels with both parametric and nonparametric or both latent and observed variables.\n\nAcknowledgements: This work is supported in parts by ARO under MURI award 133668-5079809,\nby NSF under grants CMMI-1462158 and CMMI-1634259, and additionally by a Samsung Scholar-\nship, Siebel Scholarship, NSF Graduate Fellowship, and Claude E. Shannon Research Assistantship.\n\n8\n\n\fReferences\n[1] S. Aditya, O. Dabeer, and B. K. Dey. A channel coding perspective of collaborative \ufb01ltering. IEEE\n\nTransactions on Information Theory, 57(4):2327\u20132341, 2011.\n\n[2] G. Bresler, G. H. Chen, and D. Shah. A latent source model for online collaborative \ufb01ltering. In Advances\n\nin Neural Information Processing Systems, pages 3347\u20133355, 2014.\n\n[3] G. Bresler, D. Shah, and L. F. Voloch. Collaborative \ufb01ltering with low regret.\n\narXiv preprint\n\narXiv:1507.05371, 2015.\n\n[4] D. Cai, X. He, X. Wu, and J. Han. Non-negative matrix factorization on manifold. In Data Mining, 2008.\n\nICDM\u201908. Eighth IEEE International Conference on, pages 63\u201372. IEEE, 2008.\n\n[5] E. J. Cand\u00e8s and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational\n\n[6] S. Chatterjee et al. Matrix estimation by universal singular value thresholding. The Annals of Statistics,\n\nmathematics, 9(6):717\u2013772, 2009.\n\n43(1):177\u2013214, 2015.\n\n[7] M. Fazel, H. Hindi, and S. P. Boyd. Log-det heuristic for matrix rank minimization with applications to\nhankel and euclidean distance matrices. In Proceedings of ACC, volume 3, pages 2156\u20132162. IEEE, 2003.\n[8] R. S. Ganti, L. Balzano, and R. Willett. Matrix completion under monotonic single index models. In\n\nAdvances in Neural Information Processing Systems, pages 1864\u20131872, 2015.\n\n[9] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry. Using collaborative \ufb01ltering to weave an information\n\ntapestry. Commun. ACM, 1992.\n\n[10] P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using alternating minimization. In\nProceedings of the 45th annual ACM symposium on Theory of computing, pages 665\u2013674. ACM, 2013.\n\n[11] R. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries. IEEE Trans. Inf. Theory,\n\n56(6), 2009.\n\n[12] J. Kleinberg and M. Sandler. Convergent algorithms for collaborative \ufb01ltering. In Proceedings of the 4th\n\nACM conference on Electronic commerce, pages 1\u201310. ACM, 2003.\n\n[13] J. Kleinberg and M. Sandler. Using mixture models for collaborative \ufb01ltering. In Proceedings of the\n\nthirty-sixth annual ACM symposium on Theory of computing, pages 569\u2013578. ACM, 2004.\n\n[14] Y. Koren and R. Bell. Advances in collaborative \ufb01ltering. In Recommender Systems Handbook, pages\n\n145\u2013186. Springer US, 2011.\n\n[15] Z. Lin, A. Ganesh, J. Wright, L. Wu, M. Chen, and Y. Ma. Fast convex optimization algorithms for exact\n\nrecovery of a corrupted low-rank matrix. CAMSAP, 61, 2009.\n\n[16] G. Linden, B. Smith, and J. York. Amazon.com recommendations: Item-to-item collaborative \ufb01ltering.\n\nIEEE Internet Computing, 7(1):76\u201380, 2003.\n\n[17] Z. Liu and L. Vandenberghe. Interior-point method for nuclear norm approximation with application to\n\nsystem identi\ufb01cation. SIAM Journal on Matrix Analysis and Applications, 31(3):1235\u20131256, 2010.\n\n[18] Y. Mack and B. W. Silverman. Weak and strong uniform consistency of kernel regression estimates.\n\nZeitschrift f\u00fcr Wahrscheinlichkeitstheorie und verwandte Gebiete, 61(3):405\u2013415, 1982.\n\n[19] A. Maurer and M. Pontil. Empirical Bernstein Bounds and Sample Variance Penalization. ArXiv e-prints,\n\nJuly 2009.\n\n[20] R. Mazumder, T. Hastie, and R. Tibshirani. Spectral regularization algorithms for learning large incomplete\n\nmatrices. The Journal of Machine Learning Research, 11:2287\u20132322, 2010.\n\n[21] S. Negahban and M. J. Wainwright. Restricted strong convexity and weighted matrix completion: Optimal\n\nbounds with noise. The Journal of Machine Learning Research, 13(1):1665\u20131697, 2012.\n\n[22] X. Ning, C. Desrosiers, and G. Karypis. Recommender Systems Handbook, chapter A Comprehensive\n\nSurvey of Neighborhood-Based Recommendation Methods, pages 37\u201376. Springer US, 2015.\n\n[23] A. Rohde, A. B. Tsybakov, et al. Estimation of high-dimensional low-rank matrices. The Annals of\n\nStatistics, 39(2):887\u2013930, 2011.\n\n[24] B.-H. Shen, S. Ji, and J. Ye. Mining discrete patterns via binary matrix factorization. In Proceedings of the\n\n15th ACM SIGKDD international conference, pages 757\u2013766. ACM, 2009.\n\n[25] N. Srebro, N. Alon, and T. S. Jaakkola. Generalization error bounds for collaborative prediction with\n\nlow-rank matrices. In Advances In Neural Information Processing Systems, pages 1321\u20131328, 2004.\n\n[26] M. P. Wand and M. C. Jones. Kernel smoothing. Crc Press, 1994.\n\n9\n\n\f", "award": [], "sourceid": 1127, "authors": [{"given_name": "Dogyoon", "family_name": "Song", "institution": "MIT"}, {"given_name": "Christina", "family_name": "Lee", "institution": "MIT"}, {"given_name": "Yihua", "family_name": "Li", "institution": "MIT"}, {"given_name": "Devavrat", "family_name": "Shah", "institution": "Massachusetts Institute of Technology"}]}