{"title": "Missing Not at Random in Matrix Completion: The Effectiveness of Estimating Missingness Probabilities Under a Low Nuclear Norm Assumption", "book": "Advances in Neural Information Processing Systems", "page_first": 14900, "page_last": 14909, "abstract": "Matrix completion is often applied to data with entries missing not at random (MNAR). For example, consider a recommendation system where users tend to only reveal ratings for items they like. In this case, a matrix completion method that relies on entries being revealed at uniformly sampled row and column indices can yield overly optimistic predictions of unseen user ratings. Recently, various papers have shown that we can reduce this bias in MNAR matrix completion if we know the probabilities of different matrix entries being missing. These probabilities are typically modeled using logistic regression or naive Bayes, which make strong assumptions and lack guarantees on the accuracy of the estimated probabilities. In this paper, we suggest a simple approach to estimating these probabilities that avoids these shortcomings. Our approach follows from the observation that missingness patterns in real data often exhibit low nuclear norm structure. We can then estimate the missingness probabilities by feeding the (always fully-observed) binary matrix specifying which entries are revealed to an existing nuclear-norm-constrained matrix completion algorithm by Davenport et al. [2014]. Thus, we tackle MNAR matrix completion by solving a different matrix completion problem first that recovers missingness probabilities. We establish finite-sample error bounds for how accurate these probability estimates are and how well these estimates debias standard matrix completion losses for the original matrix to be completed. Our experiments show that the proposed debiasing strategy can improve a variety of existing matrix completion algorithms, and achieves downstream matrix completion accuracy at least as good as logistic regression and naive Bayes debiasing baselines that require additional auxiliary information.", "full_text": "Missing Not at Random in Matrix Completion:\nThe Effectiveness of Estimating Missingness\n\nProbabilities Under a Low Nuclear Norm Assumption\n\nWei Ma\u2217\nCarnegie Mellon University\n\nGeorge H. Chen\u2217\n\nPittsburgh, PA 15213\n\n{weima,georgechen}@cmu.edu\n\nAbstract\n\nMatrix completion is often applied to data with entries missing not at random\n(MNAR). For example, consider a recommendation system where users tend to\nonly reveal ratings for items they like. In this case, a matrix completion method\nthat relies on entries being revealed at uniformly sampled row and column indices\ncan yield overly optimistic predictions of unseen user ratings. Recently, various\npapers have shown that we can reduce this bias in MNAR matrix completion\nif we know the probabilities of different matrix entries being missing. These\nprobabilities are typically modeled using logistic regression or naive Bayes, which\nmake strong assumptions and lack guarantees on the accuracy of the estimated\nprobabilities. In this paper, we suggest a simple approach to estimating these\nprobabilities that avoids these shortcomings. Our approach follows from the\nobservation that missingness patterns in real data often exhibit low nuclear norm\nstructure. We can then estimate the missingness probabilities by feeding the (always\nfully-observed) binary matrix specifying which entries are revealed or missing to\nan existing nuclear-norm-constrained matrix completion algorithm by Davenport et\nal. [2014]. Thus, we tackle MNAR matrix completion by solving a different matrix\ncompletion problem \ufb01rst that recovers missingness probabilities. We establish\n\ufb01nite-sample error bounds for how accurate these probability estimates are and\nhow well these estimates debias standard matrix completion losses for the original\nmatrix to be completed. Our experiments show that the proposed debiasing strategy\ncan improve a variety of existing matrix completion algorithms, and achieves\ndownstream matrix completion accuracy at least as good as logistic regression and\nnaive Bayes debiasing baselines that require additional auxiliary information.\n\n1\n\nIntroduction\n\nMany modern applications involve partially observed matrices where entries are missing not at random\n(MNAR). For example, in restaurant recommendation, consider a ratings matrix X \u2208 (R \u222a {(cid:63)})m\u00d7n\nwhere rows index users and columns index restaurants, and the entries of the matrix correspond\nto user-supplied restaurant ratings or \u201c(cid:63)\u201d to indicate \u201cmissing\u201d. A user who is never in London is\nunlikely to go to and subsequently rate London restaurants, and a user who is vegan is unlikely to go\nto and rate restaurants that focus exclusively on meat. In particular, the entries in the ratings matrix\nare not revealed uniformly at random. As another example, in a health care context, the partially\nobserved matrix X could instead have rows index patients and columns index medically relevant\n\n\u2217Equal contribution\nCode available at https://github.com/georgehc/mnar_mc\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) Coat\n\n(b) MovieLens-100k\n\nFigure 1: Missingness mask matrices (with rows indexing users and columns indexing items) exhibit\nlow-rank block structure for the (a) Coat and (b) MovieLens-100k datasets. Black indicates an\nentry being revealed. For each dataset, we show the missingness mask matrix on the left and the\ncorresponding block structure identi\ufb01ed using spectral biclustering [Kluger et al., 2003] on the right;\nrows and columns have been rearranged based on the biclustering result.\n\nmeasurements such as latest readings from lab tests. Which measurements are taken is not uniform at\nrandom and involve, for instance, what diseases the patients have. Matrix completion can be used\nin both examples: predicting missing entries in the recommendation context, or imputing missing\nfeatures before possibly using the imputed feature vectors in a downstream prediction task.\nThe vast majority of existing theory on matrix completion assume that entries are revealed with the\nsame probability independently (e.g., Cand\u00e8s and Recht [2009], Cai et al. [2010], Keshavan et al.\n[2010a,b], Recht [2011], Chatterjee [2015], Song et al. [2016]). Recent approaches to handling entries\nbeing revealed with nonuniform probabilities have shown that estimating what these entry revelation\nprobabilities can substantially improve matrix completion accuracy in recommendation data [Liang\net al., 2016, Schnabel et al., 2016, Wang et al., 2018a,b, 2019]. Speci\ufb01cally, these methods all involve\nestimating the matrix P \u2208 [0, 1]m\u00d7n, where Pu,i is the probability of entry (u, i) being revealed\nfor the partially observed matrix X. We refer to this matrix P as the propensity score matrix. By\nknowing (or having a good estimate of) P , we can debias a variety of existing matrix completion\nmethods that do not account for MNAR entries [Schnabel et al., 2016].\nIn this paper, we focus on the problem of estimating propensity score matrix P and examine how\nerror in estimating P impacts downstream matrix completion accuracy. Existing work [Liang et al.,\n2016, Schnabel et al., 2016, Wang et al., 2018b, 2019] typically models entries of P as outputs of a\nsimple predictor such as logistic regression or naive Bayes. In the generative modeling work by Liang\net al. [2016] and Wang et al. [2018b], P is estimated as part of a larger Bayesian model, whereas in\nthe work by Schnabel et al. [2016] and Wang et al. [2019] that debias matrix completion via inverse\nprobability weighting (e.g., Imbens and Rubin [2015]), P is estimated as a pre-processing step.\nRather than specifying parametric models for P , we instead hypothesize that in real data, P often has\na particular low nuclear norm structure (precise details are given in Assumptions A1 and A2 in Section\n3; special cases include P being low rank or having clustering structure in rows/columns). Thus, with\nenough rows and columns in the partially observed matrix X, we should be able to recover P from\nthe missingness mask matrix M \u2208 {0, 1}m\u00d7n, where Mu,i = 1{Xu,i (cid:54)= (cid:63)}. For example, for two\nreal datasets Coat [Schnabel et al., 2016] and MovieLens-100k [Harper and Konstan, 2016], their\nmissingness mask matrices M (note that these are always fully-observed) have block structure, as\nshown in Figure 1, suggesting that they are well-modeled as being generated from a low rank P ; with\nvalues of P bounded away from 0, such a low rank P is a special case of the general low nuclear\nnorm structure we consider. In fact, the low rank missingness patterns of Coat and MovieLens-100k\ncan be explained by topic modeling, as we illustrate in Appendix A.\nWe can recover propensity score matrix P from missingness matrix M using the existing 1-bit matrix\ncompletion algorithm by Davenport et al. [2014]. This algorithm, which we refer to as 1BITMC,\nsolves a convex program that amounts to nuclear-norm-constrained maximum likelihood estimation.\nWe remark that Davenport et al. developed their algorithm for matrix completion where entries are\nmissing independently with the same probability and the revealed ratings are binary. We intentionally\napply their algorithm instead to the matrix M of binary values for which there are no missing entries.\nThus, rather than completing a matrix, we use 1BITMC to denoise M to produce propensity score\n\nmatrix estimate (cid:98)P . Then we use (cid:98)P to help debias the actual matrix completion problem that we care\n\nabout: completing the original partially observed matrix X.\n\n2\n\n\fOur contributions are as follows:\n\u2022 We establish \ufb01nite-sample bounds on the mean-squared error (MSE) for estimating propensity\nscore matrix P using 1BITMC and also on its debiasing effect for standard MSE or mean absolute\nerror (MAE) matrix completion losses (the debiasing is via weighting entries inversely by their\nestimated propensity scores).\n\n\u2022 We empirically examine the effectiveness of using 1BITMC to estimate propensity score matrix P\ncompared to logistic regression or naive Bayes baselines. In particular, we use the estimated\npropensity scores from these three methods to debias a variety of matrix completion algorithms,\nwhere we \ufb01nd that 1BITMC typically yields downstream matrix completion accuracy as good\nas or better than the other two methods. The 1BITMC-debiased variants of matrix completion\nalgorithms often do better than their original unmodi\ufb01ed counterparts and can outperform some\nexisting matrix completion algorithms that handle nonuniformly sampled data.\n\n2 Model and Algorithms\nModel. Consider a signal matrix S \u2208 Rm\u00d7n, a noise matrix W \u2208 Rm\u00d7n, and a propensity\nscore matrix P \u2208 [0, 1]m\u00d7n. All three of these matrices are unknown. We observe the matrix\nX \u2208 (R \u222a {(cid:63)})m\u00d7n, where Xu,i = Su,i + Wu,i with probability Pu,i, independent of everything\nelse; otherwise Xu,i = (cid:63), indicating that the entry is missing. We denote \u2126 to be the set of entries that\nare revealed (i.e., \u2126 = {(u, i) : u \u2208 [m], i \u2208 [n] s.t. Xu,i (cid:54)= (cid:63)}), and we denote X\u2217 := S + W to be\nthe noise-corrupted data if we had observed all the entries. Matrix completion aims to estimate S\ngiven X, exploiting some structural assumption on S (e.g., low nuclear norm, low rank, a latent\nvariable model).\nDebiasing matrix completion with inverse propensity scoring. Suppose we want to estimate S\nwith low mean squared error (MSE). If no entries are missing so that we directly observe X\u2217, then\n\nthe MSE of an estimate (cid:98)S of S is\n\nHowever, we actually observe X which in general has missing entries. The standard approach is to\ninstead use the observed MSE:\n\n1\nmn\n\nLfull MSE((cid:98)S) :=\nLMSE((cid:98)S) :=\n\nm(cid:88)\n(cid:88)\n\nu=1\n\nn(cid:88)\n((cid:98)Su,i \u2212 X\u2217\n((cid:98)Su,i \u2212 Xu,i)2.\n\ni=1\n\nu,i)2.\n\n1\n|\u2126|\n\n(u,i)\u2208\u2126\n\nIf the probability of every entry in X being revealed is the same (i.e., the matrix P consists of only\n\none unique nonzero value), then the loss LMSE((cid:98)S) is an unbiased estimate of the loss Lfull MSE((cid:98)S).\n\nHowever, this is no longer guaranteed to hold when entries are missing with different probabilities.\nTo handle this more general setting, we can debias the loss LMSE by weighting each observation\ninversely by its propensity score, a technique referred to as inverse propensity scoring (IPS) or inverse\nprobability weighting in causal inference [Thompson, 2012, Imbens and Rubin, 2015, Little and\nRubin, 2019, Schnabel et al., 2016]:\n\nLIPS-MSE((cid:98)S|P ) :=\n\n1\nmn\n\n(cid:88)\n\n((cid:98)Su,i \u2212 Xu,i)2\n\n.\n\nAssuming P is known, the IPS loss LIPS-MSE((cid:98)S|P ) is an unbiased estimate for Lfull MSE((cid:98)S).2\n\n(u,i)\u2208\u2126\n\nPu,i\n\n(1)\n\nAny matrix completion method that uses the naive MSE loss LMSE can then be modi\ufb01ed to instead\nuse the unbiased loss LIPS-MSE. For example, the standard approach of minimizing LMSE with nuclear\nnorm regularization can be modi\ufb01ed where we instead solve the following convex program:\n\nLIPS-MSE(\u0393|P ) + \u03bb(cid:107)\u0393(cid:107)\u2217,\n\n(2)\nwhere \u03bb > 0 is a user-speci\ufb01ed parameter, and (cid:107) \u00b7 (cid:107)\u2217 denotes the nuclear norm. Importantly, using\nthe loss LIPS-MSE requires either knowing or having an estimate for the propensity score matrix P .\n2Note that LIPS-MSE((cid:98)S|P ) = 1\nrespect to which entries are revealed, E\u2126[LIPS-MSE((cid:98)S|P )] = 1\n\n1{(u, i) \u2208 \u2126} ((cid:98)Su,i\u2212X\u2217\n(cid:80)n\n\n= Lfull MSE((cid:98)S).\n\n. Taking the expectation with\n\n((cid:98)Su,i\u2212X\u2217\n\n(cid:80)m\n\nu,i)2\n\nu,i)2\n\nPu,i\n\nu=1\n\ni=1\n\nmn\n\nmn\n\nu=1\n\ni=1 Pu,i\n\nPu,i\n\n\u0393\u2208Rm\u00d7n\n\n(cid:98)S = arg min\n(cid:80)n\n(cid:80)m\n\n3\n\n\f(u,i)\u2208\u2126\n\nPu,i\n\nInstead of squared error, we could look at other kinds of error such as absolute error, in which case\nwe would consider MAE instead of MSE. Also, instead of nuclear norm, other regularizers could be\nused in optimization problem (2). Lastly, we remark that the inverse-propensity-scoring loss LIPS-MSE\nis not the only way to use propensity scores to weight. Another example is the Self-Normalized\nInverse Propensity Scoring (SNIPS) estimator [Trotter and Tukey, 1956, Swaminathan and Joachims,\n. This estimator\n\n2015], which replaces the denominator term mn in equation (1) by(cid:80)\n\n1\n\ntends to have lower variance than the IPS estimator but incurs a small bias [Hesterberg, 1995].\nFor ease of analysis, our theory focuses on debiasing with IPS. Algorithmically, for a given P ,\nwhether one uses IPS or SNIPS for estimating S in optimization problem (2) does not matter since\nthey differ by a multiplicative constant; tuning regularization parameter \u03bb would account for such\nconstants. In experiments, for reporting test set errors, we use SNIPS since IPS can be quite sensitive\nto how many revealed entries are taken into consideration.\nEstimating the propensity score matrix. We can estimate P based on the missingness mask matrix\nM \u2208 {0, 1}m\u00d7n, where Mu,i = 1{Xu,i (cid:54)= (cid:63)}. Speci\ufb01cally, we use the nuclear-norm-constrained\nmaximum likelihood estimator proposed by Davenport et al. [2014] for 1-bit matrix completion,\nwhich we refer to as 1BITMC. The basic idea of 1BITMC is to model P as the result of applying a\nuser-speci\ufb01ed link function \u03c3 : R \u2192 [0, 1] to each entry of a parameter matrix A \u2208 Rm\u00d7n so that\nPu,i = \u03c3(Au,i); \u03c3 can for instance be taken to be the standard logistic function \u03c3(x) = 1/(1 + e\u2212x).\nThen we estimate A assuming that it satis\ufb01es nuclear norm and entry-wise max norm constraints,\nnamely that\n\nmn, (cid:107)\u0393(cid:107)max \u2264 \u03b3(cid:9),\n\n\u221a\n\nwhere \u03c4 > 0 and \u03b3 > 0 are user-speci\ufb01ed parameters. Then 1BITMC is given as follows:\n\nA \u2208 F\u03c4,\u03b3 :=(cid:8)\u0393 \u2208 Rm\u00d7n : (cid:107)\u0393(cid:107)\u2217 \u2264 \u03c4\n(cid:98)A = arg max\n\nm(cid:88)\n\nn(cid:88)\n\n\u0393\u2208F\u03c4,\u03b3\n\nu=1\n\ni=1\n\n1. Solve the constrained Bernoulli maximum likelihood problem:\n\n[Mu,i log \u03c3(\u0393u,i) + (1 \u2212 Mu,i) log(1 \u2212 \u03c3(\u0393u,i))].\n\n(3)\n\nFor speci\ufb01c choices of \u03c3 such as the standard logistic function, this optimization problem is\nconvex and can, for instance, be solved via projected gradient descent.\n\n2. Construct the matrix (cid:98)P \u2208 [0, 1]m\u00d7n, where (cid:98)Pu,i := \u03c3((cid:98)Au,i).\nFor (cid:98)P computed via 1BITMC, our theory bounds how close (cid:98)P is to P and also how close the\nIPS loss LIPS-MSE((cid:98)S|(cid:98)P ) is to the fully-observed MSE Lfull MSE((cid:98)S). We \ufb01rst state our assumptions\n\n3 Theoretical Guarantee\n\non the propensity score matrix P and the partially observed matrix X. As introduced previously,\n1BITMC models P via parameter matrix A \u2208 Rm\u00d7n and link function \u03c3 : R \u2192 [0, 1] such that\nPu,i = \u03c3(Au,i). For ease of exposition, throughout this section, we take \u03c3 to be the standard logistic\nfunction: \u03c3(x) = 1/(1 + e\u2212x). Following Davenport et al. [2014], we assume that:\nA1. A has bounded nuclear norm: there exists a constant \u03b8 \u2208 (0,\u221e) such that (cid:107)A(cid:107)\u2217 \u2264 \u03b8\nA2. Entries of A are bounded in absolute value: there exists a constant \u03b1 \u2208 (0,\u221e) such that\n(cid:107)A(cid:107)max := maxu\u2208[m],i\u2208[n] |Au,i| \u2264 \u03b1. In other words, Pu,i \u2208 [\u03c3(\u2212\u03b1), \u03c3(\u03b1)] for all u \u2208 [m]\nand i \u2208 [n], where \u03c3 is the standard logistic function.\n\nmn.\n\n\u221a\n\nAs stated, Assumption A2 requires probabilities in P to be bounded away from both 0 and 1. With\nsmall changes to 1BITMC and our theoretical analysis, it is possible to allow for entries in P to be 1,\ni.e., propensity scores should be bounded from 0 but not necessarily from 1. We defer discussing this\nsetting to Appendix C as the changes are somewhat technical; the resulting theoretical guarantee is\nqualitatively similar to our guarantee for 1BITMC below.\nAssumptions A1 and A2 together are more general than assuming that A has low rank and has\nentries bounded in absolute value.\nIn particular, when Assumption A2 holds and A has rank\nr \u2208 (0, min{m, n}], then Assumption A1 holds with \u03b8 = \u03b1\nr(cid:107)A(cid:107)F \u2264\n\u221a\nrmn, where (cid:107) \u00b7 (cid:107)F denotes the Frobenius norm). Note that a special case of\nrmn(cid:107)A(cid:107)max \u2264 \u03b1\nA being low rank is A having clustering structure in rows, columns, or both. Thus, our theory also\ncovers the case in which P has row/column clustering with entries bounded away from 0.\n\nr (since (cid:107)A(cid:107)\u2217 \u2264 \u221a\n\n\u221a\n\n\u221a\n\n4\n\n\fAs for the partially observed matrix X, we assume that its values are bounded, regardless of which\nentries are revealed (so our assumption will be on X\u2217, the version of X that is fully-observed):\nA3. There exists a constant \u03c6 \u2208 (0,\u221e) such that (cid:107)X\u2217(cid:107)max \u2264 \u03c6.\nFor example, in a recommendation systems context where X is the ratings matrix, Assumption A3\nholds if the ratings fall within a closed range of values (such as like/dislike where Xu,i \u2208 {+1,\u22121}\nand \u03c6 = 1, or a rating out of \ufb01ve stars where Xu,i \u2208 [1, 5] and \u03c6 = 5).\nFor simplicity, we do not place assumptions on signal matrix S or noise matrix W aside from\ntheir sum X\u2217 having bounded entries. Different assumptions on S and W lead to different matrix\ncompletion algorithms. Many of these algorithms can be debiased using estimated propensity scores.\nWe focus our theoretical analysis on this debiasing step and experimentally apply the debiasing to\na variety of matrix completion algorithms. We remark that there are existing papers that discuss\nhow to handle MNAR data when S is low rank and W consists of i.i.d. zero-mean Gaussian (or\nsub-Gaussian) noise, a setup related to principal component analysis (e.g., Sportisse et al. [2018,\n2019], Zhu et al. [2019]; a comparative study is provided by Dray and Josse [2015]).\nOur main result is as follows. We defer the proof to Appendix B.\nTheorem 1. Under Assumptions A1\u2013A3, suppose that we run algorithm 1BITMC with user-speci\ufb01ed\n\nparameters satisfying \u03c4 \u2265 \u03b8 and \u03b3 \u2265 \u03b1 to obtain the estimate (cid:98)P of propensity score matrix P . Let\n(cid:98)S \u2208 Rm\u00d7n be any matrix satisfying (cid:107)(cid:98)S(cid:107)max \u2264 \u03c8 for some \u03c8 \u2265 \u03c6. Let \u03b4 \u2208 (0, 1). Then there exists a\nm+n \u2212 \u03b4\n\nuniversal constant C > 0 such that provided that m + n \u2265 C, with probability at least 1 \u2212 C\nover randomness in which entries are revealed in X, we simultaneously have\n\n(cid:16) 1\u221a\n|LIPS-MSE((cid:98)S|(cid:98)P ) \u2212 Lfull MSE((cid:98)S)| \u2264 8\u03c82\u221a\nin\ufb01nity, the IPS loss LIPS-MSE((cid:98)S|(cid:98)P ) with (cid:98)P computed using the 1BITMC algorithm is a consistent\nestimator for the fully-observed MSE loss Lfull MSE((cid:98)S).\n|(cid:98)Su,i \u2212 Xu,i|\nLfull MAE((cid:98)S) :=\n(cid:114) 1\n|LIPS-MAE((cid:98)S|(cid:98)P ) \u2212 Lfull MAE((cid:98)S)| \u2264\ngradient methods [Parikh and Boyd, 2014]. Hence we can \ufb01nd a (cid:98)S that minimizes LIPS-MSE((cid:98)S|(cid:98)P ),\nand it is straightforward to show that when m, n \u2192 \u221e, this (cid:98)S also minimizes Lfull MSE((cid:98)S) since\n|LIPS-MSE((cid:98)S|(cid:98)P ) \u2212 Lfull MSE((cid:98)S)| \u2192 0.\n\nThis theorem implies that under Assumptions A1\u2013A3, with the number of rows and columns going to\n\nLIPS-MAE((cid:98)S|P ) :=\n(cid:16) 1\n\n1\n\nEquations (2) and (3) both correspond to convex programs that can be ef\ufb01ciently solved via proximal\n\nn(cid:88)\n((cid:98)Pu,i \u2212 Pu,i)2 \u2264 4e\u03c4\n\nWe remark that our result easily extends to using MAE instead of MSE. If we de\ufb01ne\n\nthen the MAE version of Theorem 1 would replace (5) with\n\n|(cid:98)Su,i \u2212 X\u2217\nu,i|,\n\n(cid:17)\n(cid:16) 1\n\n1\u221a\nn\n\n,\n\nm(cid:88)\n\nn(cid:88)\n\n4\u03c8\n\ne\u03c4\n\n\u03c3(\u2212\u03b3)\u03c3(\u2212\u03b1)\n\n(cid:114) 1\n\n\u03c3(\u2212\u03b3)\u03c3(\u2212\u03b1)\n\n4\u03c82\n\u03c3(\u2212\u03b1)\n\n+\n\n2\u03c8\n\u03c3(\u2212\u03b1)\n\nm1/4\n\nn1/4\n\n+\n\nm\n\ne\u03c4\n\nlog\n\n2\n\u03b4\n\n.\n\nlog\n\n2\n\u03b4\n\n.\n\n(cid:88)\n\n+\n\nm1/4\n\nn1/4\n\n1\n\n+\n\n+\n\n(u,i)\u2208\u2126\n\nPu,i\n\n1\nmn\n\n(cid:17)\n\n(4)\n\n(5)\n\nm(cid:88)\n\n1\nmn\n\nu=1\n\ni=1\n\n1\nmn\n\nu=1\n\ni=1\n\n,\n\n2mn\n\n(cid:17)\n\n2mn\n\n\u221a\n\n4 Experiments\n\nWe now assess how well 1BITMC debiases matrix completion algorithms on synthetic and real data.\n\n4.1 Synthetic Data\n\nData. We create two synthetic datasets that are intentionally catered toward propensity scores\nbeing well-explained by naive Bayes and logistic regression. 1) MovieLoverData: the dataset\ncomes from the Movie-Lovers toy example (Figure 1 in Schnabel et al. [2016], which is based\non Table 1 of Steck [2010]), where we set parameter p = 0.5; 2) UserItemData: for the second\ndataset, the \u201ctrue\u201d rating matrix and propensity score matrix are generated by the following steps.\n\n5\n\n\fthen form the form (cid:101)S = U1V (cid:62)\n\nWe generate U1 \u2208 [0, 1]m\u00d720, V1 \u2208 [0, 1]n\u00d720 by sampling entries i.i.d. from Uniform[0, 1], and\n\n1 . We scale the values of (cid:101)S to be from 1 to 5 and round to the\n\nnearest integer to produce the true ratings matrix S. Next, we generate row and column feature\nvectors U2 \u2208 Rm\u00d720, V2 \u2208 Rn\u00d720 by sampling entries i.i.d. from a normal distribution N (0, 1/64).\nWe further generate w1 \u2208 [0, 1]20\u00d71, w2 \u2208 [0, 1]20\u00d71 by sampling entries i.i.d. from Uniform[0, 1].\nThen we form the propensity score matrix P \u2208 [0, 1]m\u00d7n by setting Pu,i = \u03c3(U2[u]w1 + V2[i]w2),\nwhere \u03c3 is the standard logistic function, and U2[u] denotes the u-th row of U2. For both datasets,\nwe set m = 200, n = 300. We also assume that i.i.d noise N (0, 1) is added to each matrix entry\nof signal matrix S in producing the partially revealed matrix X. All the ratings are clipped to [1, 5]\nand rounded. By sampling based on P , we generate \u2126, the training set indices. The true ratings\nmatrix S is used for testing. We brie\ufb02y explain why Assumptions A1\u2013A3 hold for these two datasets\nin Appendix D.\nAlgorithms comparison. We compare two types of algorithms for matrix completion. The \ufb01rst\ntype does not account for entries being MNAR. This type of algorithm includes Probabilistic Matrix\nFactorization (PMF) [Mnih and Salakhutdinov, 2008], Funk\u2019s SVD [Funk, 2006], SVD++ [Koren,\n2008], and SOFTIMPUTE [Mazumder et al., 2010]. The second type accounts for MNAR entries and\nincludes max-norm-constrained matrix completion (MAXNORM) [Cai and Zhou, 2016], EXPOMF\n[Liang et al., 2016], and weighted-trace-norm-regularized matrix completion (WTN) [Srebro and\nSalakhutdinov, 2010]. For all the algorithms above (except for EXPOMF), the ratings in the squared\nerror loss can be debiased by the propensity scores (as shown in equation (1)), and the propensity\nscores can be estimated from logistic regression (LR) (which requires extra user or item feature data),\nnaive Bayes (NB) (speci\ufb01cally equation (18) of Schnabel et al. [2016], which requires a small set of\nmissing at random (MAR) ratings), and 1BITMC [Davenport et al., 2014]. Hence we have a series of\nweighted-variants of the existing algorithms. For example, 1BITMC-PMF means the PMF method\nis used and the inverse propensity scores estimated from 1BITMC is used as weights for debiasing.\nMetrics. We use MSE and MAE to measure the estimation quality of the propensity scores. Similarly,\nwe also use MSE and MAE to compare the estimated full rating matrix with the true rating matrix S\n(denoted as full-MSE or full-MAE). We also report SNIPS-MSE (SNIPS-MAE); these are evaluated\non test set entries (i.e., all matrix entries in these synthetic datasets) using the true P .\nExperiment setup. For all algorithms, we tune hyperparameters through 5-fold cross-validation\nusing grid search. For the debiased methods (LR-(cid:63), NB-(cid:63), 1BITMC-(cid:63)), we \ufb01rst estimate the\npropensity score matrix and then optimize the debiased loss. We note that MovieLoverData does\nnot contain user/item features. Thus, naive Bayes can be used to estimate P for MovieLoverData\nand UserItemData, while logistic regression is only applicable for UserItemData. In using logistic\nregression to estimate propensity scores, we can use all user/item features, only user features, or\nonly item features (denoted as LR, LR-U, LR-I, respectively). Per dataset, we generate P and S\nonce before generating 10 samples of noisy revealed ratings X based on P and S. We apply all the\nalgorithms stated above to these 10 experimental repeats.\nResults. Before looking at the performance of matrix completion methods, we \ufb01rst inspect the\naccuracy of the estimated propensity scores. Since we know the true propensity score for the synthetic\n\ndatasets, we can compare the true P with the estimated (cid:98)P directly, as presented in Table 1. In how\n\nwe constructed the synthetic datasets, unsurprisingly estimating propensity scores using naive Bayes\non MovieLoverData and logistic regression on UserItemData achieve the best performance. In\nboth cases, 1BITMC still achieves reasonably low errors in estimating the propensity score matrices.\n\nAlgorithm\n\nNaive Bayes\nLR\nLR-U\nLR-I\n1BITMC\n\nMovieLoverData\n\nMSE\n0.0346 \u00b1 0.0002\nN/A\nN/A\nN/A\n0.0520 \u00b1 0.0003\n\nMAE\n0.1665 \u00b1 0.0007\nN/A\nN/A\nN/A\n0.1724 \u00b1 0.0006\n\nUserItemData\nMAE\n0.0990 \u00b1 0.0005\n0.0105 \u00b1 0.0017\n0.0667 \u00b1 0.0002\n0.0639 \u00b1 0.0001\n0.0881 \u00b1 0.0002\n\nMSE\n0.0150 \u00b1 0.0001\n0.0002 \u00b1 0.0001\n0.0070 \u00b1 0.0000\n0.0065 \u00b1 0.0000\n0.0119 \u00b1 0.0000\n\nTable 1: Estimation accuracy of propensity score matrix (average \u00b1 standard deviation across 10\nexperimental repeats).\n\n6\n\n\fAlgorithm\n\nPMF\nNB-PMF\nLR-PMF\n1BITMC-PMF\nSVD\nNB-SVD\nLR-SVD\n1BITMC-SVD\nSVD++\nNB-SVD++\nLR-SVD++\n1BITMC-SVD++\nSOFTIMPUTE\nNB-SOFTIMPUTE\nLR-SOFTIMPUTE\n1BITMC-SOFTIMPUTE\nMAXNORM\nNB-MAXNORM\nLR-MAXNORM\n1BITMC-MAXNORM\nWTN\nNB-WTN\nLR-WTN\n1BITMC-WTN\nEXPOMF\n\nMovieLoverData\n\nUserItemData\n\nMSE\n0.326 \u00b1 0.042\n0.363 \u00b1 0.013\nN/A\n0.299 \u00b1 0.014\n1.359 \u00b1 0.033\n0.866 \u00b1 0.028\nN/A\n0.861 \u00b1 0.028\n0.343 \u00b1 0.023\n0.968 \u00b1 0.020\nN/A\n0.345 \u00b1 0.023\n0.374 \u00b1 0.009\n0.495 \u00b1 0.010\nN/A\n0.412 \u00b1 0.011\n0.674 \u00b1 0.052\n0.371 \u00b1 0.050\nN/A\n0.396 \u00b1 0.036\n3.791 \u00b1 0.032\n3.262 \u00b1 0.093\nN/A\n3.788 \u00b1 0.039\n0.820 \u00b1 0.005\n\nSNIPS-MSE\n0.325 \u00b1 0.041\n0.363 \u00b1 0.012\nN/A\n0.299 \u00b1 0.013\n1.360 \u00b1 0.034\n0.866 \u00b1 0.027\nN/A\n0.862 \u00b1 0.028\n0.343 \u00b1 0.021\n0.987 \u00b1 0.020\nN/A\n0.345 \u00b1 0.021\n0.374 \u00b1 0.008\n0.495 \u00b1 0.009\nN/A\n0.412 \u00b1 0.010\n0.674 \u00b1 0.053\n0.371 \u00b1 0.049\nN/A\n0.395 \u00b1 0.035\n3.790 \u00b1 0.035\n3.262 \u00b1 0.094\nN/A\n3.787 \u00b1 0.042\n0.822 \u00b1 0.005\n\nMSE\n0.161 \u00b1 0.002\n0.144 \u00b1 0.002\n0.159 \u00b1 0.002\n0.146 \u00b1 0.002\n0.139 \u00b1 0.001\n0.147 \u00b1 0.001\n0.147 \u00b1 0.001\n0.139 \u00b1 0.001\n0.140 \u00b1 0.001\n0.152 \u00b1 0.002\n0.154 \u00b1 0.001\n0.140 \u00b1 0.001\n0.579 \u00b1 0.002\n0.599 \u00b1 0.004\n0.602 \u00b1 0.003\n0.588 \u00b1 0.002\n0.531 \u00b1 0.002\n0.541 \u00b1 0.006\n0.544 \u00b1 0.004\n0.542 \u00b1 0.003\n0.551 \u00b1 0.002\n0.557 \u00b1 0.002\n0.553 \u00b1 0.002\n0.551 \u00b1 0.002\n1.170 \u00b1 0.008\n\nSNIPS-MSE\n0.160 \u00b1 0.002\n0.145 \u00b1 0.002\n0.164 \u00b1 0.003\n0.146 \u00b1 0.002\n0.139 \u00b1 0.001\n0.147 \u00b1 0.002\n0.152 \u00b1 0.002\n0.139 \u00b1 0.001\n0.140 \u00b1 0.001\n0.153 \u00b1 0.002\n0.160 \u00b1 0.002\n0.140 \u00b1 0.001\n0.556 \u00b1 0.003\n0.588 \u00b1 0.004\n0.581 \u00b1 0.004\n0.564 \u00b1 0.003\n0.507 \u00b1 0.002\n0.520 \u00b1 0.007\n0.521 \u00b1 0.005\n0.519 \u00b1 0.004\n0.528 \u00b1 0.002\n0.535 \u00b1 0.002\n0.532 \u00b1 0.002\n0.528 \u00b1 0.002\n1.218 \u00b1 0.009\n\nTable 2: MSE-based metrics of matrix completion methods on synthetic datasets (average \u00b1 standard\ndeviation across 10 experimental repeats).\n\nNow we compare the matrix completion methods directly and report the performance of different\nmethods in Table 2. Note that we only show the MSE-based results; the MAE-based results are\npresented in Appendix D. The debiased variants generally perform as well as or better than their\noriginal unmodi\ufb01ed counterparts. 1BITMC-PMF achieves the best accuracy on MovieLoverData,\nand both SVD and 1BITMC-SVD perform the best on UserItemData. The debiasing using\n1BITMC can improve the performance of PMF, SVD, MAXNORM and WTN on MovieLoverData,\nand PMF is improved on UserItemData. In general, debiasing using 1BITMC leads to higher\nmatrix completion accuracy than debiasing using LR and NB.\n\n4.2 Real-World Data\n\nData. We consider two real-world datasets. 1) Coat: the dataset contains ratings from 290 users on\n300 items [Schnabel et al., 2016]. The dataset contains both MNAR ratings as well as MAR ratings.\nBoth user and item features are available for the dataset. 2) MovieLens-100k: the dataset contains\n100k ratings from 943 users on 1,682 movies, and it does not contain any MAR ratings [Harper and\nKonstan, 2016].\nExperiments setup. Since the Coat dataset contains both MAR and MNAR data, we are able to\ntrain the algorithms on the MNAR data and test on the MAR data. In this way, the MSE (MAE) on\nthe testing set directly re\ufb02ect the matrix completion accuracy. For MovieLens-100k, we split the\ndata into 90/10 train/test sets 10 times. For both datasets, we use 5-fold cross-validation to tune the\nhyperparameters through grid search. The SNIPS related measures are computed on test data based\non the propensities estimated from 1BITMC-PMF using training data.\n\n7\n\n\fAlgorithm\n\nPMF\nNB-PMF\nLR-PMF\n1BITMC-PMF\nSVD\nNB-SVD\nLR-SVD\n1BITMC-SVD\nSVD++\nNB-SVD++\nLR-SVD++\n1BITMC-SVD++\nSOFTIMPUTE\nNB-SOFTIMPUTE\nLR-SOFTIMPUTE\n1BITMC-SOFTIMPUTE\nMAXNORM\nNB-MAXNORM\nLR-MAXNORM\n1BITMC-MAXNORM\nWTN\nNB-WTN\nLR-WTN\n1BITMC-WTN\nEXPOMF\n\nMSE\n1.000\n1.034\n1.025\n0.999\n1.203\n1.246\n1.234\n1.202\n1.208\n1.488\n1.418\n1.248\n1.064\n1.052\n1.069\n0.998\n1.168\n1.460\n1.537\n1.471\n1.396\n1.329\n1.340\n1.396\n2.602\n\nCoat\nSNIPS-MSE MSE\n1.051\n1.117\n1.107\n1.052\n1.270\n1.346\n1.334\n1.272\n1.248\n1.608\n1.532\n1.274\n1.150\n1.138\n1.156\n1.078\n1.263\n1.578\n1.662\n1.590\n1.509\n1.437\n1.448\n1.509\n2.813\n\n0.896 \u00b1 0.013\nN/A\nN/A\n0.845 \u00b1 0.012\n0.862 \u00b1 0.013\nN/A\nN/A\n0.821 \u00b1 0.011\n0.838 \u00b1 0.013\nN/A\nN/A\n0.833 \u00b1 0.012\n0.929 \u00b1 0.015\nN/A\nN/A\n0.933 \u00b1 0.014\n0.911 \u00b1 0.011\nN/A\nN/A\n0.977 \u00b1 0.017\n0.939 \u00b1 0.013\nN/A\nN/A\n0.934 \u00b1 0.013\n2.461 \u00b1 0.077\n\nMovieLens-100k\n\nSNIPS-MSE\n0.902 \u00b1 0.013\nN/A\nN/A\n0.853 \u00b1 0.011\n0.872 \u00b1 0.012\nN/A\nN/A\n0.832 \u00b1 0.011\n0.849 \u00b1 0.012\nN/A\nN/A\n0.843 \u00b1 0.011\n0.950 \u00b1 0.015\nN/A\nN/A\n0.953 \u00b1 0.014\n0.925 \u00b1 0.011\nN/A\nN/A\n0.992 \u00b1 0.019\n0.952 \u00b1 0.013\nN/A\nN/A\n0.946 \u00b1 0.013\n2.558 \u00b1 0.083\n\nTable 3: MSE-based metrics of matrix completion methods on Coat and MovieLens-100k (results\nfor MovieLens-100k are the averages \u00b1 standard deviations across 10 experimental repeats).\n\nResults. The performance of each algorithm is presented in Table 3. We report the MSE-based results;\nMAE results are in Appendix D. Algorithms 1BITMC-SOFTIMPUTE and 1BITMC-PMF perform\nthe best on Coat based on MSE, and 1BITMC-SVD outperforms the rest on MovieLens-100k. The\ndebiasing approach does not improve the accuracy for MAXNORM and WTN.\n5 Conclusions\nIn this paper, we examined the effectiveness of debiasing matrix completion algorithms using\nmissingness probabilities (propensity scores) estimated via another matrix completion algorithm:\n1BITMC by Davenport et al. [2014], which relies on low nuclear norm structure, and which we apply\nto a fully-revealed missingness mask matrix (so we are doing matrix denoising rather than completion).\nOur numerical experiments indicate that debiasing using 1BITMC can achieve downstream matrix\ncompletion accuracy at least as good as debiasing using logistic regression and naive Bayes baselines,\ndespite 1BITMC not using auxiliary information such as row/column feature vectors. Moreover,\ndebiasing matrix completion algorithms with 1BITMC can boost accuracy, in some cases achieving\nthe best or nearly the best performance across all algorithms we tested. These experimental \ufb01ndings\nsuggest that a low nuclear norm assumption on missingness patterns is reasonable.\nIn terms of theoretical analysis, we have not addressed the full generality of MNAR data in matrix\ncompletion. For example, we still assume that each entry is revealed independent of other entries. In\nreality, one matrix entry being revealed could increase (or decrease) the chance of another entry being\nrevealed. As another loose end, our theory breaks down when a missingness probability is exactly 0.\nFor example, consider when the matrix to be completed corresponds to feature vectors collected from\npatients. A clinical measurement that only makes sense for women will have 0 probability of being\nrevealed for men. In such scenarios, imputing such a missing value does not actually make sense.\nThese are two open problems among many for robustly handling MNAR data with guarantees.\n\n8\n\n\fReferences\nSt\u00e9phane Boucheron, G\u00e1bor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic\n\ntheory of independence. Oxford university press, 2013.\n\nJian-Feng Cai, Emmanuel J. Cand\u00e8s, and Zuowei Shen. A singular value thresholding algorithm for\n\nmatrix completion. SIAM Journal on Optimization, 20(4):1956\u20131982, 2010.\n\nT. Tony Cai and Wen-Xin Zhou. Matrix completion via max-norm constrained optimization. Elec-\n\ntronic Journal of Statistics, 10(1):1493\u20131525, 2016.\n\nEmmanuel J. Cand\u00e8s and Benjamin Recht. Exact matrix completion via convex optimization.\n\nFoundations of Computational mathematics, 9(6):717, 2009.\n\nSourav Chatterjee. Matrix estimation by universal singular value thresholding. The Annals of\n\nStatistics, 43(1):177\u2013214, 2015.\n\nMark A. Davenport, Yaniv Plan, Ewout Van Den Berg, and Mary Wootters. 1-bit matrix completion.\n\nInformation and Inference, 3(3):189\u2013223, 2014.\n\nSt\u00e9phane Dray and Julie Josse. Principal component analysis with missing values: a comparative\n\nsurvey of methods. Plant Ecology, 216(5):657\u2013667, 2015.\n\nSimon Funk. Net\ufb02ix update: Try this at home, 2006.\n\nF. Maxwell Harper and Joseph A. Konstan. The MovieLens datasets: History and context. ACM\n\nTransactions on Interactive Intelligent Systems (TiiS), 5(4):19, 2016.\n\nTim Hesterberg. Weighted average importance sampling and defensive mixture distributions. Techno-\n\nmetrics, 37(2):185\u2013194, 1995.\n\nGuido W. Imbens and Donald B. Rubin. Causal Inference for Statistics, Social, and Biomedical\n\nSciences. Cambridge University Press, 2015.\n\nRaghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few\n\nentries. IEEE transactions on information theory, 56(6):2980\u20132998, 2010a.\n\nRaghunandan H. Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from noisy\n\nentries. Journal of Machine Learning Research, 11(Jul):2057\u20132078, 2010b.\n\nYuval Kluger, Ronen Basri, Joseph T Chang, and Mark Gerstein. Spectral biclustering of microarray\n\ndata: Coclustering genes and conditions. Genome Research, 13(4):703\u2013716, 2003.\n\nYehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative \ufb01ltering model.\nIn Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and\nData Mining, pages 426\u2013434. ACM, 2008.\n\nDawen Liang, Laurent Charlin, James McInerney, and David M. Blei. Modeling user exposure in\nrecommendation. In Proceedings of the 25th International Conference on World Wide Web, pages\n951\u2013961. International World Wide Web Conferences Steering Committee, 2016.\n\nRoderick J. A. Little and Donald B. Rubin. Statistical Analysis with Missing Data. Wiley, 3rd edition,\n\n2019.\n\nRahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for\nlearning large incomplete matrices. Journal of Machine Learning Research, 11(Aug):2287\u20132322,\n2010.\n\nAndriy Mnih and Ruslan R Salakhutdinov. Probabilistic matrix factorization. In Advances in Neural\n\nInformation Processing Systems, pages 1257\u20131264, 2008.\n\nNeal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends R(cid:13) in Optimization, 1\n\n(3):127\u2013239, 2014.\n\nBenjamin Recht. A simpler approach to matrix completion. Journal of Machine Learning Research,\n\n12(Dec):3413\u20133430, 2011.\n\n9\n\n\fTobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims.\nRecommendations as treatments: Debiasing learning and evaluation. In International Conference\non Machine Learning, pages 1670\u20131679, 2016.\n\nYoav Seginer. The expected norm of random matrices. Combinatorics, Probability and Computing, 9\n\n(2):149\u2013166, 2000.\n\nDogyoon Song, Christina E. Lee, Yihua Li, and Devavrat Shah. Blind regression: Nonparametric\nregression for latent variable models via collaborative \ufb01ltering. In Advances in Neural Information\nProcessing Systems, pages 2155\u20132163, 2016.\n\nAude Sportisse, Claire Boyer, and Julie Josse. Imputation and low-rank estimation with missing non\n\nat random data. arXiv preprint arXiv:1812.11409, 2018.\n\nAude Sportisse, Claire Boyer, and Julie Josse. Estimation and imputation in probabilistic principal\n\ncomponent analysis with missing not at random data. 2019.\n\nNathan Srebro and Ruslan R. Salakhutdinov. Collaborative \ufb01ltering in a non-uniform world: Learning\nwith the weighted trace norm. In Advances in Neural Information Processing Systems, pages\n2056\u20132064, 2010.\n\nHarald Steck. Training and testing of recommender systems on data missing not at random. In\nProceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and\nData Mining, pages 713\u2013722. ACM, 2010.\n\nAdith Swaminathan and Thorsten Joachims. The self-normalized estimator for counterfactual learning.\n\nIn Advances in Neural Information Processing Systems, pages 3231\u20133239, 2015.\n\nSteven K. Thompson. Sampling. Wiley, 3rd edition, 2012.\n\nHale F. Trotter and John W. Tukey. Conditional Monte Carlo for normal samples. In Symposium on\n\nMonte Carlo Methods, pages 64\u201379, 1956.\n\nMenghan Wang, Mingming Gong, Xiaolin Zheng, and Kun Zhang. Modeling dynamic missingness\nof implicit feedback for recommendation. In Advances in Neural Information Processing Systems,\npages 6669\u20136678, 2018a.\n\nMenghan Wang, Xiaolin Zheng, Yang Yang, and Kun Zhang. Collaborative \ufb01ltering with social\nexposure: A modular approach to social recommendation. In AAAI Conference on Arti\ufb01cial\nIntelligence, 2018b.\n\nXiaojie Wang, Rui Zhang, Yu Sun, and Jianzhong Qi. Doubly robust joint learning for recom-\nmendation on data missing not at random. In International Conference on Machine Learning,\n2019.\n\nZiwei Zhu, Tengyao Wang, and Richard J Samworth. High-dimensional principal component analysis\n\nwith heterogeneous missingness. arXiv preprint arXiv:1906.12125, 2019.\n\n10\n\n\f", "award": [], "sourceid": 8478, "authors": [{"given_name": "Wei", "family_name": "Ma", "institution": "Carnegie Mellon University"}, {"given_name": "George", "family_name": "Chen", "institution": "Carnegie Mellon University"}]}