{"title": "A Sparse Interactive Model for Matrix Completion with Side Information", "book": "Advances in Neural Information Processing Systems", "page_first": 4071, "page_last": 4079, "abstract": "Matrix completion methods can benefit from side information besides the partially observed matrix. The use of side features describing the row and column entities of a matrix has been shown to reduce the sample complexity for completing the matrix. We propose a novel sparse formulation that explicitly models the interaction between the row and column side features to approximate the matrix entries. Unlike early methods, this model does not require the low-rank condition on the model parameter matrix. We prove that when the side features can span the latent feature space of the matrix to be recovered, the number of observed entries needed for an exact recovery is $O(\\log N)$ where $N$ is the size of the matrix. When the side features are corrupted latent features of the matrix with a small perturbation, our method can achieve an $\\epsilon$-recovery with $O(\\log N)$ sample complexity, and maintains a $\\O(N^{3/2})$ rate similar to classfic methods with no side information. An efficient linearized Lagrangian algorithm is developed with a strong guarantee of convergence. Empirical results show that our approach outperforms three state-of-the-art methods both in simulations and on real world datasets.", "full_text": "A Sparse Interactive Model for Matrix Completion\n\nwith Side Information\n\nJin Lu\n\nGuannan Liang\n\nJiangwen Sun\n\nJinbo Bi\n\n{jin.lu, guannan.liang, jiangwen.sun, jinbo.bi}@uconn.edu\n\nUniversity of Connecticut\n\nStorrs, CT 06269\n\nAbstract\n\nMatrix completion methods can bene\ufb01t from side information besides the partial-\nly observed matrix. The use of side features that describe the row and column\nentities of a matrix has been shown to reduce the sample complexity for complet-\ning the matrix. We propose a novel sparse formulation that explicitly models the\ninteraction between the row and column side features to approximate the matrix\nentries. Unlike early methods, this model does not require the low rank condition\non the model parameter matrix. We prove that when the side features span the\nlatent feature space of the matrix to be recovered, the number of observed entries\nneeded for an exact recovery is O(log N ) where N is the size of the matrix. If the\nside features are corrupted latent features of the matrix with a small perturbation,\nour method can achieve an \u0001-recovery with O(log N ) sample complexity. If side\ninformation is useless, our method maintains a O(N 3/2) sampling rate similar to\nclassic methods. An ef\ufb01cient linearized Lagrangian algorithm is developed with\na convergence guarantee. Empirical results show that our approach outperforms\nthree state-of-the-art methods both in simulations and on real world datasets.\n\nIntroduction\n\n1\nMatrix completion has been a basis of many machine learning approaches for computer vision [6],\nrecommender systems [21, 24], signal processing [19, 27], and among many others. Classically,\nlow-rank matrix completion methods are based on matrix decomposition techniques which require\nonly the partially observed data in the matrix [15, 3, 14] by solving the following problem\n\nminE (cid:107)E(cid:107)\u2217,\n\nsubject to R\u2126(E) = R\u2126(F),\n\n(1)\nwhere F \u2208 Rm\u00d7n is the partially observed low-rank matrix (with a rank of r) that needs to be\nrecovered, \u2126 \u2286 {1,\u00b7\u00b7\u00b7 , m}\u00d7{1,\u00b7\u00b7\u00b7 , n} be the set of indexes where the corresponding components\nin F are observed, the mapping R\u2126(M): Rm\u00d7n \u2192 Rm\u00d7n gives another matrix whose (i, j)-th\nentry is Mi,j if (i, j) \u2208 \u2126 (or 0 otherwise), and (cid:107)E(cid:107)\u2217 computes the nuclear norm of E. Early\ntheoretical analysis [4, 5, 20] proves that O(N r log2 N ) entries are suf\ufb01cient for an exact recovery\nif the observed entries are uniformly sampled at random where N = max{n, m}.\nRecent studies start to explore side information for matrix completion and factorization [1, 18, 7,\n17, 8]. For example, to infer the missing ratings in a user-movie rating matrix, descriptors of the\nusers and movies are often known and may help to build a content-based recommender system. For\ninstance, kids tend to like cartoons, so the age of a user likely interacts with the cartoon feature of a\nmovie. When few ratings are known, this side information could be the main source for completing\nthe matrix. Although based on empirical studies, several works found that side features are helpful\n[17, 18], those methods are based on non-convex matrix factorization formulations without any\ntheoretical guarantees. Three recent methods have focussed on convex nuclear-norm regularized\nobjectives, which leads to theoretical guarantees on matrix recovery [13, 28, 9, 16]. These methods\nall construct an inductive model XT GY so that R\u2126(XT GY) = R\u2126(F) where the side matrices\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fX and Y consist of side features, respectively, for the row entities (e.g., users) and column entities\n(e.g., movies) of a (rating) matrix. This inductive model has a parameter matrix G which is either\nrequired to be low rank [13] or to have a minimal nuclear norm (cid:107)G(cid:107)\u2217 [28]. Recovering G of a\n(usually) smaller size is argued to be easier than directly recovering the matrix F. With a very\nstrong assumption on \u2018perfect\u2019 side information, i.e., both X and Y are orthonormal matrices and\nrespectively in the latent column and row space of the matrix F, the method in [28] is proved to\nrequire much reduced sample complexity O(log N ) for an exact recovery of F. Because most side\nfeatures X and Y are not perfect in practice, a very recent work [9] proposes to use a residual matrix\nN to handle the noisy side features. This method constructs an inductive model XT GY + N to\napproximate F and requires both G and N to be low rank, or have a low nuclear norm. It uses the\nnuclear norm of the residual to quantify the usefulness of side information, and proves O(log N )\nsampling rate for an \u0001-recovery when X and Y span the full latent feature space of F, and o(N )\nsample complexity when X and Y contain corrupted latent features of F. An \u0001-recovery is de\ufb01ned\nas that the expected discrepancy between the predicted matrix and the true matrix is less than an\narbitrarily small \u0001 > 0 under a certain probability.\nIn this paper, we propose a new method for matrix recovery by constructing a sparse interactive\nmodel XT GY to approximate F where G can be sparse but does not need to be low rank. The\n(i, j)-th element of G determines the role of the interaction between the i-th feature of users and\nthe j-th feature of products. The low-rank property of F is commonly assumed to characterize the\nobservation that similar users tend to rate similar products similarly [4]. When using an inductive\napproximation F = XT GY, rank(F) \u2264 rank(G), so a low-rank requirement on G can be a suf-\n\ufb01cient condition on the low-rank condition of F. Previous relevant methods [13, 28, 9] all impose\nthe low-rank condition on G, which is however not a necessary condition for F to be low rank\n(only becomes a necessary condition when X and Y are full rank). Given general side matrices\nX \u2208 Rd1\u00d7m and Y \u2208 Rd2\u00d7n where the numbers of features d1, d2 (cid:28) N, limiting the interactive\nmodel of G \u2208 Rd1\u00d7d2 to be low rank can be an over-restrictive constraint. In our model, we use\na low-rank matrix E to directly approximate F and then estimate E from the interactive model of\nX and Y with a sparse regularizer on G. We show empirically that a low-rank F can be recovered\nfrom a corresponding full (or high) rank G. Our contributions are summarized as follows:\n(i) We propose a new formulation that estimates both E and G by imposing a nuclear-norm con-\nstraint on E but a general regularizer on G, e.g., the sparse regularizer (cid:107)G(cid:107)1. The proposed model\nhas recovery guarantees depending on the quality of the side features: (1) when X and Y are full row\nrank and span the entire latent feature space of F (but are not required to satisfy the much stronger\ncondition of being orthonormal as in [28]), O(log N ) observations are still suf\ufb01cient for our method\nto achieve an exact recovery of F. (2) When the side matrices are not full rank and corrupted from\nthe original latent features of F, i.e., X and Y do not contain enough basis to exactly recover F,\nO(log N ) observed entries can be suf\ufb01cient for an \u0001-recovery.\n(ii) A new linearized alternating direction method of multipliers (LADMM) is developed to ef\ufb01-\nciently solve the proposed formulation. Existing methods that use side information are solved by\nstandard block-wise coordinate descent algorithms which have convergence guarantee to a glob-\nal solution only when each block-wise subproblem has a unique solution [26]. Our LADMM has\nstronger convergence property [29] and bene\ufb01ts from the linear convergence rate of ADMM [11, 23].\n(iii) Prior methods focus on the recovery of F, and little light has been shed to understand whether\nthe interactive model of G can be retrieved. Because of the explicit use of E and G, our method\naims to directly recover both. The unique G in the case of exact recovery of F can be attained by\nour algorithm. When G is not unique in the \u0001-recovery case, our algorithm converges to a point in\nthe optimal solution set.\n\n2 The Proposed Interactive Model\nTo utilize the side information in X and Y to complete F, we consider to build a predictive model\nfrom the observed components that predicts the missing ones. One can simply build a linear model:\nf = xT u + yT v + g, where x and y are the feature vectors respectively for a user and a product,\nand u, v and g are model parameters. In real life applications, interactive terms between the features\nin X and Y can be very important. For example, male users tend to rate science \ufb01ction and action\nmovies higher than female, which can be informative when predicting their ratings. Therefore, a\nlinear model considering no interactive terms can be oversimple and have low predictive power for\nmissing entries. We hence add interactive terms by introducing an interaction matrix Hd1\u00d7d2 into\nthe predictive model, which can be written as: f = xT Hy + xT u + yT v + g. By de\ufb01ning\n\n2\n\n\f(cid:18) H u\n\n(cid:19)\n\n\u00afx = [xT 1]T , \u00afy = [yT 1]T and G(a=d1+1)\u00d7(b=d2+1) =\nsimpli\ufb01ed to: f = \u00afxT G\u00afy. The following optimization problem can be solved to obtain the model\nparameter G.\n\nthe above model can be\n\nvT\n\ng\n\ng(G) + \u03bbE(cid:107)E(cid:107)\u2217,\n\nmin\nG,E\n\nsubject to\n\n\u00afXT G \u00afY = E, R\u2126(E) = R\u2126(F),\n\nwhere E is a completed version of F, \u00afXa\u00d7m and \u00afYb\u00d7n are two matrices that are created by aug-\nmenting one row of all ones to X and to Y, respectively, and g(G) and (cid:107)E(cid:107)\u2217 are used to incorporate\nthe (sparsity) prior of G and low rank prior of E. Because the side information data can be noisy\nand not all the features and their interactions are helpful to the prediction of F, a sparse G is often\nexpected. Our implementation has used g(G) = (cid:107)G(cid:107)1. It is natural to impose low rank requirement\non E because it is a completed version of a low rank matrix F. The tuning parameter \u03bbE is used to\nbalance the two priors in the objective.\nWithout loss of generality and for convenience of notation, we simply use X and Y to denote the\naugmented matrices. Denote the Frobenius norm of a matrix by || \u00b7 ||F . To account for Gaussian\nnoise, we relax the equality constraint XT GY = E and replace it by minimizing their squared\nresidual: (cid:107)XT GY\u2212 E(cid:107)2\nF and solve the following convex optimization problem to obtain G and E:\n\n1\n2\n\n(cid:107)XT GY \u2212 E(cid:107)2\n\nF + \u03bbGg(G) + \u03bbE(cid:107)E(cid:107)\u2217,\n\nmin\nG,E\n\nsubject to R\u2126(E) = R\u2126(F).\n\n(2)\nwhere \u03bbG is another tuning parameter that together with \u03bbE balances the three terms in the objective.\nEspecially, the regularizer g(\u00b7) in our theoretical analysis can take any general matrix norm that\nsatis\ufb01es (cid:107)M(cid:107)\u2217 \u2264 Cg(M)), \u2200M, for a constant C, so for instance g(\u00b7) can be (cid:107)G(cid:107)1, or ||G||F , or\n(cid:107)G(cid:107)2. Throughout this paper, the matrices X (and Y) refer to, i.e., either the original Xd1\u00d7m (and\nYd2\u00d7n) or the augmented \u00afXa\u00d7m (and \u00afYb\u00d7n) depending on the user-speci\ufb01ed model.\nOur formulation (2) differs from existing methods that make use of side information for matrix\ncompletion in several ways. Existing methods [28, 13, 9] solve the problem by \ufb01nding \u02c6H that\nminimizes (cid:107)H(cid:107)\u2217 subject to R\u2126(XT HY) = R\u2126(F), but we expand it to include the linear term\nwithin the interactive model. The proposed model adds the \ufb02exibility to consider both linear and\nquadratically interactive terms, and allows the algorithm to determine the terms that should be used\nin the model by enforcing the sparsity in H (or G). Because E = XT GY, the rank of G bounds\nthat of E from above. The existing methods all control the rank of G (e.g. by minimizing (cid:107)G(cid:107)\u2217) to\nincorporate the prior of low rank E (and thus low rank F) in their formulations. However, when the\nrank of G is not properly chosen during the tuning of hyperparameters, it may not even be a suf\ufb01cient\ncondition to ensure low rank E (if rank(E) (cid:28) the pre-speci\ufb01ed rank(G)). It is easy to see that besides\nG a low-rank X or Y can lead to a low-rank E as well. Enforcing a low-rank condition on H or G\nmay limit the search space of the interactive model and thus impair the prediction performance on\nmissing matrix entries, which are demonstrated in our empirical results. Moreover, one can observe\nthat when \u03bbG is suf\ufb01ciently large, Eq.(2) is reduced to the standard matrix completion problem (1)\nwithout side information because G may be degenerated into a zero matrix, so our formulation is\napplicable when no access to useful side information.\n3 Recovery Analysis\nLet E0 and G0 be the two matrices such that R\u2126(F) = R\u2126(E0) and E0 = XT G0Y. In this\nsection, we give our theoretical results on the sample complexity for achieving an exact recovery of\nE0 and G0 when X and Y are both full row rank (i.e., rank(X) = a and rank(Y) = b), and an\n\u0001-recovery of E0 when the two side matrices are corrupted and less informative. The proofs of all\ntheorems are given in supplementary materials.\n3.1 Sample Complexity for Exact Recovery\nBefore presenting our results, we give a few de\ufb01nitions. Let F = U\u03a3VT , XT = UX\u03a3XVT\nX and\nYT = UY\u03a3YVT\nY be the singular value decomposition of F, XT and YT , respectively, where all\n\u03a3 matrices are full rank, meaning that singular vectors corresponding to the singular value 0 are not\nincluded in the respective U and V matrices. Let\n\nPU = UUT \u2208 Rm\u00d7m, PV = VVT \u2208 Rn\u00d7n,\n\nPX = UXUT\n\nX = XT VX\u03a3\u22122\n\nX VT\n\nXX \u2208 Rm\u00d7m, PY = UYUT\n\nY = YT VY\u03a3\u22122\n\nY VT\n\nYY \u2208 Rn\u00d7n,\n\n3\n\n\fwhere PU, PV, PX and PY project a vector onto the subspaces spanned, respectively, by the\ncolumns in U, V and rows in X, and Y. For any matrix Mm\u00d7n that satis\ufb01es M = PXMPY,\nwe de\ufb01ne two linear operators: PT : Rm\u00d7n \u2192 Rm\u00d7n and PT \u22a5 : Rm\u00d7n \u2192 Rm\u00d7n as follows:\n\nPT (M) = PUMPY + PXMPV \u2212 PUMPV\nPT \u22a5 (M) = (PX \u2212 PU)M(PY \u2212 PV) = PX\u22a5MPY\u22a5 .\n\nLet \u00b50 and \u00b51 be the two coherence measures of F and be de\ufb01ned as follows as discussed in [4, 16]:\n\n\u00b50 = max\n\nmax\n1\u2264i\u2264m\n\n(cid:107)PUei(cid:107)2,\n\nn\nr\n\nmax\n1\u2264j\u2264n\n\n(cid:107)PVej(cid:107)2\n\nr\n\n,\n\n\u00b51 = max\ni,j\n\nmn\nr\n\n([UVT ]i,j)2,\n\nwhere ei is the unit vector with the ith entry equal to 1. Let \u00b5XY be the coherence measurement\nbetween X and Y and be de\ufb01ned as:\n\n(cid:32)\n\n\u00b5XY = max\n\nmax\n1\u2264i\u2264m\n\nm(cid:107)xi(cid:107)2\n\n2\n\na\n\n, max\n1\u2264j\u2264n\n\nn(cid:107)yj(cid:107)2\n\n2\n\nb\n\n(cid:33)\n\n.\n\n(cid:18) m\n\n(cid:19)\n\nX (cid:107)\u2217,(cid:107)\u03a3\u22121\n\nY (cid:107)\u2217), N = max(m, n), q0 = 1\n\n3 \u03c3\u00b5 max(\u00b51, \u00b5)r(a + b) log N and T1 = 8p\n\n3 \u03c3\u00b5 max(\u00b51, \u00b5)(1 + log a \u2212 log r)r(a + b) log N.\n\nWith the above de\ufb01nitions, we show in the following theorem that when X and Y are both full row\nrank, (G0, E0) is the unique solution to Eq.(2) with high probability as long as there are O(r log N )\nobserved components in F. In other words, with a sampling rate of O(r log N ), our method can\nfully recover both E0 and G0 with a high probability when X and Y are full row rank.\nTheorem 1 Let \u00b5 = max(\u00b50, \u00b5XY), \u03c3 = max((cid:107)\u03a3\u22121\n2 (1 +\nlog a \u2212 log r), T0 = 128p\n3 \u03c32\u00b52(ab + r2) log N, where p\nis a constant. Assume T1 \u2265 q0T0, X and Y are both full row rank. For any p > 1, with a probability\nat least 1 \u2212 4(q0 + 1)N\u2212p+1 \u2212 2q0N\u2212p+2, (G0, E0) is the unique optimizer to Problem (2) with\nnecessary sampling rate as few as O(r log N ). More precisely, the sampling size |\u2126| should satisfy\nthat |\u2126| \u2265 64p\nWhen r (cid:28) N and r = O(1), the sampling rate for the exact recovery of both E0 and G0 reduces to\nO(log N ). A similar sampling rate for a full recovery of E0 has been developed in [28] where both\nX and Y, however, need to be orthonormal matrices in their derivation. In Theorem 1, because \u03c3\nis mainly determined by the smallest singular values of the side information matrices, and sampling\nrate increases when \u03c3 increases, it suggests that side information matrices of lower rank would re-\nquire more observed F entries for a full recovery of F. An advanced model without the orthonormal\nassumption has been given in [9], but exact recovery is not discussed. In our case, the two matrices\nare only required to be full row rank. Moreover, the theoretical or empirical results in our work give\nthe \ufb01rst careful investigation on the recovery of both G0 and E0.\n3.2 Sample Complexity for \u0001-Recovery\nThe condition for full-rank side information matrices may not be satis\ufb01ed in some cases to fully\nrecover E0 (or F). We analyze the error bound of our model and prove a reduced sample complexity\nin comparison with standard matrix completion methods for an \u0001-recovery when the side information\nmatrices are not full row rank or their rank is dif\ufb01cult to attain.\nTheorem 2 Denote (cid:107)E(cid:107)\u2217 \u2264 \u03b1, (cid:107)G(cid:107)1 \u2264 \u03b3, (cid:107)XT GY \u2212 E(cid:107)F \u2264 \u03c6 and the perfect side feature\nmatrices (containing latent features of F) are corrupted with \u2206X and \u2206Y where (cid:107)\u2206X(cid:107)F \u2264\ns1,(cid:107)\u2206Y(cid:107)F \u2264 s2 and S = max(s1, s2). To \u0001-recover F that the expected loss E[l(f, F)] < \u0001\nfor a given arbitrarily small \u0001 > 0, O(min((\u03b32 + \u03c62) log N, S2\u03b1\nN )/\u00012) observations are suf\ufb01-\ncient for our model when corrupted factors of side information are bounded.\nTheorem 2 can be inferred from the fact that the trace norm of E and the (cid:96)1-norm of G affect\nsample complexity of our model. It meets the intuition that higher rank matrix ought to require\nmore observations to recover. Besides, for the discovery of G, a sparse interactive matrix can lead\nto the decrease of the sample complexity, which implies that the side information, even though when\nit is not perfect, could be informative enough such that the original matrix can be compressed by\nsparse coding via the estimated interaction between the features of row and column entities of the\nmatrix. Our empirical evaluations have con\ufb01rmed the utility of even imperfect side features.\nWhen the rank of the original data matrix r = O(1) (r (cid:28) N), and correspondingly \u03b1 = O(1),\nTheorem 2 points out that only O(log N ) sampling rate is required for an \u0001-recovery. The clas-\nsic matrix completion analysis without side information shows that under certain conditions, one\n\n\u221a\n\n4\n\n\fcan achieve O(N poly log N ) sample complexity for both perfect recovery [4] and \u0001-recovery [25],\nwhich is higher than our complexity. However, the condition for these existing bounds is that the ob-\nserved entries follow a certain distribution. Recent studies [22] found that if no speci\ufb01c distribution\nis pre-assumed for observed entries, O(N 3/2) sampling rate is suf\ufb01cient for an \u0001-recovery. Com-\npared to those results, our analysis does not require any assumption on the distribution of observed\nentries. When X and Y contain insuf\ufb01cient interaction information about F and (cid:107)E(cid:107)\u2217 = O(N ),\nthe sample complexity of our method increases to O(N 3/2) in the worst case, which means that our\nmodel maintains the same complexity as the classic methods.\n4 Adaptive LADMM Algorithm\nIn this section, we develop an adaptive LADMM algorithm [29] to solve problem (2). First, we show\nthat the ADMM is applicable in our problem and we then derive LADMM steps. A convergence\nproof is established to guarantee the performance of our algorithm.\nBecause it requires separable blocks of variables in order to use ADMM, we \ufb01rst de\ufb01ne C = E \u2212\nXT GY and use it in Eq.(2). Then the augmented Lagrangian function of (2) is given by\n\nF + \u03bbE(cid:107)E(cid:107)\u2217 + \u03bbG(cid:107)G(cid:107)1 + (cid:104)M1, R\u2126(E \u2212 F)(cid:105) +\n\n(cid:107)E \u2212 XT GY \u2212 C(cid:107)2\n\nF\n\n(cid:68)\n\nL(E, G, C, M1, M2, \u03b2) =\nM2, E \u2212 XT GY \u2212 C\n\n+\n\n(cid:69)\n\n+\n\n(cid:107)C(cid:107)2\n(cid:107)R\u2126(E \u2212 F)(cid:107)2\n\n1\n2\n\u03b2\n2\n\nF +\n\n\u03b2\n2\n\nwhere M1, M2 \u2208 Rm\u00d7n are Lagrange multipliers and \u03b2 > 0 is the penalty parameter. Given Ck,\nGk, Ek, Mk\n2 at iteration k, each group of the variables yields their respective subproblems:\n\n1 and Mk\n\nAfter solving these subproblems, we update the multipliers M1 and M2 as follows;\n\nCk+1 = arg min\nC\nGk+1 = arg min\nG\nEk+1 = arg min\nE\n\nL(Ek, Gk, Mk\nL(Ek, G, Mk\nL(E, Gk+1, Mk\n\n2, C, \u03b2k),\n2, Ck+1, \u03b2k),\n\n1, Mk\n\n2, Ck+1, \u03b2k),\n\nMk+1\nMk+1\n\n1 =Mk\n2 =Mk\n\n1 + \u03b2k(R\u2126(Ek+1 \u2212 F)),\n2 + \u03b2k(Ek+1 \u2212 XT Gk+1Y \u2212 Ck+1).\n\n(3)\n\n(4)\n\n(5)\n\nWe focus on demonstrating the iterative steps of the adaptive LADMM. Given Ck, Gk Ek, Mk\n1\nand Mk\n2, Algorithm 1 describes how to obtain the next iterate (C, E, G, M1, M2). A closed-form\nsolution has been derived for each subproblem in the supplementary material.\n\nAlgorithm 1 The adaptive LADMM algorithm to solve Ck, Gk, Ek, k = 1, ..., K\nInput: X, Y and R\u2126(F) with parameters \u03bbG, \u03bbE, \u03c4A, \u03c4B, \u03c1 and \u03b2max.\nOutput: C, G, E;\n1: Initialize E0, G0, M0\n\n2. Compute A = YT \u2297 XT . k = 0,\n\n1, M0\n\n, 0)(cid:12) sgn(gk\u2212 f k\n\n1 = AT (Agk +\n\u03c4A\u03b2k\n2/\u03b2k) and e = vec(E), g = vec(G), m = vec(M),\n\n1 /\u03c4A)) where f k\n\n3 )/(2\u03c4B), \u03bbE/2(\u03b2k\u03c4B)) where f k\n\n2 = R\u2126(Ek \u2212 F + Mk\n\n1/\u03b2k);\n\nrepeat;\n\n2/\u03b2k);\n1 /\u03c4A|\u2212 \u03bbG\n\n2: Ck+1 = \u03b2k\n3: Gk+1 = reshape(max(|gk\u2212 f k\n\n\u03b2k+1 (Ek \u2212 XT GkY + Mk\n1) = AT (Agk + ck \u2212 ek \u2212 mk\n\n4: Ek+1 = SV T (Ek \u2212 (f k\n\nck \u2212 bk\nc = vec(C).\n3 = Ek \u2212 XT Gk+1Y \u2212 Ck + Mk\nf k\n1 + \u03b2k(R\u2126(Ek+1 \u2212 F)).\n1 = Mk\n2 + \u03b2k(Ek+1 \u2212 XT Gk+1Y \u2212 Ck+1).\n2 = Mk\n\n5: Mk+1\n6: Mk+1\n7: \u03b2k+1 = min(\u03b2max, \u03c1\u03b2k).\n8: k = k + 1 until convergence;\n\n2 + f k\n\n2/\u03b2k.\n\nReturn C, G, E;\n\nThe adaptive parameter in Algorithm 1 is \u03c1 > 1, and \u03b2max controls the upper bound of {\u03b2k}. The\noperator reshape(g) converts a vector g \u2208 Rab into a matrix G \u2208 Ra\u00d7b, which is the inverse\n\n5\n\n\f1, M0\n\noperator of vec(G). The operator SV T (E, t) is the singular value thresholding process de\ufb01ned in\n[3] for soft-thresholding the singular values of an arbitrary matrix E by a threshold t. The matrix\nA = YT \u2297 XT where \u2297 indicates the Kronecker product. In the initialization step, M0\n2 are\nrandomly drawn from the standard Gaussian distribution; we initialize E0 and G0 by the iterative\nsoft-thresholding algorithm [2] and SV T operator respectively.\nThe adaptive LADMM can effectively solve the proposed optimization problem in several aspect-\ns. First, the convergence of the commonly-used block-wise coordinate descent (BCD) method,\nsometimes referred to as alternating minimization methods, requires typically that the optimization\nproblem be strictly convex (or quasiconvex but hemivariate). The strongest result for BCD so far\nis established in [26] which requires the alternating subproblems to be optimized in each iteration\nto its unique optimal solution. This requirement is often restrictive in practice. Our convex (but\nnot strictly convex) problem can be solved by the adaptive LADMM with the global convergence\nguarantee which is characterized in Theorem 3. Second, two of the subproblems are non-smooth\ndue to the (cid:96)1-norm or the nuclear norm, so it can be dif\ufb01cult to obtain a closed-form formula to\nef\ufb01ciently compute a solution by standard optimization tools; however, adaptive LADMM utilizes\nthe linearization technique which leads to a closed-form solution for each linearized subproblem,\nand signi\ufb01cantly enhances the ef\ufb01ciency of the iterative process. Third, adaptive LADMM can be\npractically parallelizable by a similar scheme to that of ADMM. It is also noted that the convergence\nrate of LADMM [11] and parallel LADMM is O(1/k) [23] whereas the BCD method still lacks of\nclear theoretical results of its convergence rate.\nTheorem 3 De\ufb01ne the operators A and B as A(G) =\n\n, and\n. If \u03b2k is non-decreasing and upper-bounded, \u03c4A > (cid:107)A(cid:107)2, and \u03c4B > (cid:107)B(cid:107)2, then\nlet M =\nthe sequence {(Ck, Gk, Ek, Mk)} generated by the adaptive LADMM Algorithm 1 converges to a\nglobal minimizer of Eq. (2).\n\n(cid:18)R\u2126(E)\n(cid:19)\n\n(cid:18)M1\n\n(cid:19)\n\n(cid:18)\n\nM2\n\n(cid:19)\n\n0\n\n\u2212XT GY\n\n, B(E) =\n\nE\n\n2/(cid:107)R(cid:54)\u2126(F)(cid:107)2\n\n5 Experimental Results\nWe validated our method in both simulations and the analysis of two real world datasets: MovieLens\n(movie rating) and NCI-DREAM (drug discovery) datasets. Three most recent matrix completion\nmethods that also utilized side information, MAXIDE[28], IMC[13] and DirtyIMC[9], were com-\npared against our method. The design of our experiments focused on demonstrating the effective-\nness of our method in practice. The performance of all methods was measured by the relative mean\nsquared error (RMSE) calculated on missing entries: (cid:107)R(cid:54)\u2126(XT GY \u2212 F)(cid:107)2\n2. For both\nsynthetic and real-world datasets, we randomly set q percent of the components in each observed\nmatrix F to be missing. The hyperparameters \u03bb\u2019s and the rank of G (required by IMC and DirtyIM-\nC) were tuned via the same cross validation process: we randomly picked 10% of the given entries to\nform a validation set. Then models were obtained by applying each method to the remaining entries\nwith a speci\ufb01c choice of \u03bb from 10\u22123, 10\u22122, ..., 104. The average validation RMSE was examined\nby repeating the above procedure six times. The hyperparameter values that gave the best average\nvalidation RMSE were chosen for each method. For IMC and DirtyIMC, the best rank of G was\nchosen from = 1 to 15 within each data split. For each choice of q, we repeated the above entire\nprocedure six times and reported the average RMSE on the missing entries.\n5.1 Synthetic Datasets\nWe created two different simulation tests with and without full row rank X and Y. For all the\nsynthetic datasets, we \ufb01rst randomly created X and Y. In order to make our simulations reminiscent\nreal situations where distributions of side features can be heterogeneous, data for each feature in both\nX and Y were generated according to a distribution that was randomly selected from Gaussian,\nPoisson and Gamma distributions. We created the sparse G matrices as follows. The location of\nthe non-zero entries of G were randomly picked but their values were generated by multiplying a\nvalue drawn from N (0, 100), which we repeated several times to chose the matrices that showed\nfull or high rank. We then generated F with F = XT GY + N where N represents noise and\neach component Ni,j was drawn from N (0, 1). For each simulated F, we ran all methods with\nq \u2208 [10% \u2212 80%] with an increase step of 10%.\nWe compared the different methods in three settings, which were labeled as synthetic experiment\nI, II and III in our results. In the \ufb01rst setting, the dimension of X and Y was set to 15 \u00d7 50 and\n\n6\n\n\f20 \u00d7 140 and all features in these two matrices were randomly generated to make them full row\nrank. Both the last two settings corresponded to the second test where X and Y were not full\nrow rank. The dimension of X and Y was set to 16 \u00d7 50, 21 \u00d7 140 and 20 \u00d7 50, 25 \u00d7 140,\nrespectively, for these two settings where the \ufb01rst 15 features in X and 20 features in Y were\nrandomly created, but the remaining features were generated by arbitrarily linear combinations of\nthe randomly created features. For all three settings, we used 10 synthetic datasets and reported\nmean and standard deviation of RMSE on missing values as shown in Figure 1.\n\nFigure 1: The Comparison of RMSE for Experiments I, II, and III.\n\nOur approach outperformed all other compared methods signi\ufb01cantly in almost all these settings.\nWhen the missing rate q increased, the RMSE of our method grew much slower than other methods.\nWe studied the rank of the recovered G and E in the \ufb01rst setting. For all methods, the corresponding\nG and E that gave the best performance were examined. The ranks of G and E from our method,\nMAXIDE, IMC, DirtyIMC were 15, 8, 1, 1 and 15, 15, 1, 2, respectively. These results suggested\nthat incorporating the strong prior of low rank G might hurt the recovery performance. The retrieved\nmodel matrices G of all compared methods (when using q =10% of missing entries in one of the\n10 synthetic datasets) together with the true G are plotted in Figure 2. Only our method was able to\nrecover the true G and all the other methods merely found approximations.\n\nFigure 2: The heatmap of the true G and recovered G matrices in Synthetic Experiment I.\n\n5.2 Real-world Datasets\nWe used two relatively large datasets that we could \ufb01nd as suitable for our empirical evaluation.\nNote that early methods employing side information were often tested on datasets with either X or\nY but not both although some of them might be larger than the two datasets we used.\n5.2.1. MovieLens. This dataset was downloaded from [12] and contained 100,000 user ratings\n(integers from 1 to 5) from 943 users on 1682 movies. There were 20 movie features such as genre\nand release date, as well as 24 user features describing users\u2019 demographic information such as age\nand gender. We compared all methods with four different q values: 20-50%. The RMSE values of\neach method are shown in Table 1, which shows that our approach signi\ufb01cantly outperformed other\nmethods, especially when q was large. Figure 3 shows the constructed G matrix that shows some\ninteresting observations. For instance, male users tend to rate action, science \ufb01ction, thriller and war\nmovies high but low for children\u2019 movies, exhibiting some common intuitions.\n5.2.2 NCI-DREAM Challenge. The data on the reactions of 46 breast cancer cell lines to 26 drugs\nand the expression data of 18633 genes for all the cell lines were provided by NCI-DREAM Chal-\n\n7\n\nMissing percentage0.20.40.60.8RMSE00.20.40.60.81Synthetic Experiment IIIOur approachMAXIDEIMCDirtyIMCMissing percentage0.20.40.60.8RMSE00.20.40.60.81Synthetic Experiment IIIOur approachMAXIDEIMCDirtyIMCMissing percentage0.20.40.60.8RMSE00.20.40.60.81Synthetic Experiment IIOur approachMAXIDEIMCDirtyIMCMissing percentage0.20.40.60.8RMSE00.20.40.60.81Synthetic Experiment IIOur approachMAXIDEIMCDirtyIMCMissing percentage0.20.40.60.8RMSE00.20.40.60.81Synthetic Experiment IOur approachMAXIDEIMCDirtyIMCMissing percentage0.20.40.60.8RMSE00.20.40.60.81Synthetic Experiment IOur approachMAXIDEIMCDirtyIMC\fMethods\n\nOur approach\n\nMAXIDE\n\nIMC\n\nDirtyIMC\n\n20%\n0.276\n\n(\u00b1 0.001)\n0.424\n(\u00b10.016)\n0.935\n(\u00b10.001)\n0.705\n(\u00b10.001)\n\nMovieLens Data\n30%\n40%\n0.284\n0.279\n\n(\u00b1 0.002)\n0.425\n(\u00b10.013)\n0.943\n(\u00b10.001)\n0.738\n(\u00b10.001)\n\n(\u00b1 0.001)\n0.419\n(\u00b10.008)\n0.945\n(\u00b10.001)\n0.775\n(\u00b10.001)\n\n50%\n0.292\n\n(\u00b1 0.001)\n0.421\n(\u00b10.013)\n0.959\n(\u00b10.001)\n0.814\n(\u00b10.001)\n\n20%\n0.181\n\n(\u00b1 0.069)\n0.268\n(\u00b10.036)\n0.437\n(\u00b10.031)\n0.432\n(\u00b10.033)\n\n30%\n0.139\n\n(\u00b1 0.010)\n0.240\n(\u00b10.007)\n0.489\n(\u00b10.003)\n0.475\n(\u00b10.008)\n\nNCI-Dream Challenge\n\n40%\n0.145\n\n(\u00b1 0.018)\n0.255\n(\u00b10.016)\n0.557\n(\u00b10.013)\n0.551\n(\u00b10.018)\n\n50%\n0.190\n\n(\u00b1 0.031)\n0.288\n(\u00b10.022)\n0.637\n(\u00b10.011)\n0.632\n(\u00b10.011)\n\nTable 1: The Comparison of RMSE values of different methods on real-world datasets.\n\nlenge [10]. For each drug, we had 14 features that describes their chemical and physical properties\nsuch as molecular weight, XLogP3 and hydrogen bond donor count, and were downloaded from\nNational Center for Biotechnology Information (http://pubchem.ncbi.nlm.nih.gov/). For the cell\nline features, we ran principle component analysis (PCA) and used the top 45 principal components\nthat accounted for more than 99.99% of the total data variance. We compared the four different\nmethods with four different q values: 20-50%. The RMSE values of all methods are provided in\nTable 1 where our method again shows the best performance. We examined the ranks of both G and\nE obtained by all the methods. They were 15, 15, 1, 1 for G and 2, 15, 1, 2 for E, respectively, for\nour approach, MAXIDE, IMC and DirtyIMC in sequence. This demonstrates that a low rank E but\na high rank G give the best performance on this dataset. In other words, requiring a low rank G\nmay hurt the performance of recovering a low rank E.\nThe constructed G by our method is plotted in Figure 4, where columns represent cell line features\n(i.e., principle components) and rows represent drug features. Please refer to the supplementary ma-\nterial for the names of these features. According to this \ufb01gure, drug features: XlogP (F2), hydrogen\nbond donor (HBD) (F3), Hydrogen bond acceptor (HBA) (F4) and Rotatable Bond number (F5) all\nplayed important roles in drug sensitivity. This result aligns well with biological knowledge, as all\nthese four features are very important descriptors for cellular entry and retention.\n\nFigure 3: HeatMap of G for MovieLens\n\nFigure 4: HeatMap of sign(G) log(|G|) for NCI-\nDREAM for a better illustration\n\n6 Conclusion\nIn this paper, we have proposed a novel sparse inductive model that utilizes side features describing\nthe row and column entities of a partially observed matrix to predict its missing entries. This method\nmodels the linear predictive power of side features as well as interaction between the features of row\nand column entities. Theoretical analysis shows that this model has advantages of reduced sample\ncomplexity over classical matrix completion methods, requiring only O(log N ) observed entries to\nachieve a perfect recovery of the original matrix when the side features re\ufb02ect the true latent fea-\nture space of the matrix. When the side features are less informative, our model requires O(log N )\nobservations for an \u0001-recovery of the matrix. Unlike early methods that use a BCD algorithm, we\nhave developed a LADMM algorithm to optimize the proposed formulation. Given the optimization\nproblem is convex, this algorithm can converge to a global solution. Computational results demon-\nstrate the superior performance of this method over three recent methods. Future work includes\nthe examination of other types and quality of side information and the understanding of whether our\nmethod will bene\ufb01t a variety of relevant problems, such as multi-label learning, and semi-supervised\nclustering etc.\nAcknowledgments\nJinbo Bi and her students Jin Lu, Guannan Liang and Jiangwen Sun were supported by NSF grants\nIIS-1320586, DBI-1356655, and CCF-1514357 and NIH R01DA037349.\n\n8\n\n\fReferences\n[1] J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. A new approach to collaborative \ufb01ltering: Operator\nestimation with spectral regularization. The Journal of Machine Learning Research, 10:803\u2013826, 2009.\n[2] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM journal on imaging sciences, 2(1):183\u2013202, 2009.\n\n[3] J.-F. Cai, E. J. Cand`es, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM\n\n[4] E. J. Cand`es and B. Recht. Exact matrix completion via convex optimization. Foundations of Computa-\n\nJ. on Optimization, 20(4):1956\u20131982, Mar. 2010.\n\ntional mathematics, 9(6):717\u2013772, 2009.\n\n[5] E. J. Cand`es and T. Tao. The power of convex relaxation: Near-optimal matrix completion. Information\n\nTheory, IEEE Transactions on, 56(5):2053\u20132080, 2010.\n\n[6] P. Chen and D. Suter. Recovering the missing components in a large noisy low-rank matrix: Application\n\nto sfm. IEEE Trans. Pattern Anal. Mach. Intell., 26(8):1051\u20131063, Aug. 2004.\n\n[7] T. Chen, W. Zhang, Q. Lu, K. Chen, Z. Zheng, and Y. Yu. Svdfeature: a toolkit for feature-based collab-\n\norative \ufb01ltering. The Journal of Machine Learning Research, 13(1):3619\u20133622, 2012.\n\n[8] K.-Y. Chiang, C.-J. Hsieh, and E. I. S. Dhillon. Robust principal component analysis with side infor-\nmation. In Proceedings of The 33rd International Conference on Machine Learning, pages 2291\u20132299,\n2016.\n\n[9] K.-Y. Chiang, C.-J. Hsieh, and I. S. Dhillon. Matrix completion with noisy side information. Advances in\n\nNeural Information Processing Systems 28, pages 3429\u20133437, 2015.\n\n[10] A. Daemen, O. L. Grif\ufb01th, L. M. Heiser, N. J. Wang, O. M. Enache, Z. Sanborn, F. Pepin, S. Durinck, J. E.\nKorkola, M. Grif\ufb01th, et al. Modeling precision treatment of breast cancer. Genome Biol, 14(10):R110,\n2013.\n\n[11] E. X. Fang, B. He, H. Liu, and X. Yuan. Generalized alternating direction method of multipliers: new\n\ntheoretical insights and applications. Mathematical Programming Computation, 7(2):149\u2013187, 2015.\n\n[12] F. M. Harper and J. A. Konstan. The movielens datasets: History and context. ACM Trans. Interact. Intell.\n\nSyst., 5(4):19:1\u201319:19, Dec. 2015.\n\n[13] P. Jain and I. S. Dhillon. Provable inductive matrix completion. arXiv preprint arXiv:1306.0626, 2013.\n[14] R. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries. Information Theory, IEEE\n\nTransactions on, 56(6):2980\u20132998, June 2010.\n\n[15] Z. Lin, M. Chen, and Y. Ma. The Augmented Lagrange Multiplier Method for Exact Recovery of Cor-\n\nrupted Low-Rank Matrices. Mathematical Programming, 2010.\n\n[16] G. Liu and P. Li. Low-rank matrix completion in the presence of high coherence. IEEE Transactions on\n\nSignal Processing, 64(21):5623\u20135633, Nov 2016.\n\n[17] A. K. Menon, K.-P. Chitrapura, S. Garg, D. Agarwal, and N. Kota. Response prediction using col-\nIn Proceedings of the 17th ACM SIGKDD\n\nlaborative \ufb01ltering with hierarchies and side-information.\ninternational conference on Knowledge discovery and data mining, pages 141\u2013149. ACM, 2011.\n\n[18] N. Natarajan and I. S. Dhillon.\n\nInductive matrix completion for predicting gene\u2013disease associations.\n\nBioinformatics, 30(12):i60\u2013i68, 2014.\n\n[19] X. Ning and G. Karypis. Sparse linear methods with side information for top-n recommendations. In\nProceedings of the Sixth ACM Conference on Recommender Systems, RecSys \u201912, pages 155\u2013162, New\nYork, NY, USA, 2012. ACM.\n\n[20] B. Recht. A simpler approach to matrix completion. The Journal of Machine Learning Research,\n\n12:3413\u20133430, 2011.\n\n[21] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction.\nIn Proceedings of the 22Nd International Conference on Machine Learning, ICML \u201905, pages 713\u2013719,\nNew York, NY, USA, 2005. ACM.\n\n[22] O. Shamir and S. Shalev-Shwartz. Matrix completion with the trace norm:\n\ntransducing. The Journal of Machine Learning Research, 15(1):3401\u20133423, 2014.\n\nlearning, bounding, and\n\n[23] W. Shi, Q. Ling, G. Wu, and W. Yin. A proximal gradient algorithm for decentralized composite opti-\n\nmization. IEEE Transactions on Signal Processing, 63(22):6013\u20136023, Nov 2015.\n\n[24] V. Sindhwani, S. Bucak, J. Hu, and A. Mojsilovic. One-class matrix completion with low-density factor-\nizations. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 1055\u20131060, Dec\n2010.\n\n[25] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. pages 545\u2013560, 2005.\n[26] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal\n\nof optimization theory and applications, 109(3):475\u2013494, 2001.\n\n[27] Z. Weng and X. Wang. Low-rank matrix completion for array signal processing. In Acoustics, Speech and\n\nSignal Processing (ICASSP), 2012 IEEE International Conference on, pages 2697\u20132700. IEEE, 2012.\n\n[28] M. Xu, R. Jin, and Z. hua Zhou. Speedup matrix completion with side information: Application to\n\nmulti-label learning. Advances in Neural Information Processing Systems 26, pages 2301\u20132309, 2013.\n\n[29] J. Yang and X.-M. Yuan. Linearized augmented lagrangian and alternating direction methods for nuclear\n\nnorm minimization. Math. Comput., 82, 2013.\n\n9\n\n\f", "award": [], "sourceid": 2029, "authors": [{"given_name": "Jin", "family_name": "Lu", "institution": "University of Connecticut"}, {"given_name": "Guannan", "family_name": "Liang", "institution": "University of Connecticut"}, {"given_name": "Jiangwen", "family_name": "Sun", "institution": "University of Connecticut"}, {"given_name": "Jinbo", "family_name": "Bi", "institution": "University of Connecticut"}]}