{"title": "Markov Random Fields for Collaborative Filtering", "book": "Advances in Neural Information Processing Systems", "page_first": 5473, "page_last": 5484, "abstract": "In this paper, we model the dependencies among the items that are recommended to a user in a collaborative-filtering problem via a Gaussian Markov Random Field (MRF). We build upon Besag's auto-normal parameterization and pseudo-likelihood, which not only enables computationally efficient learning, but also connects the areas of MRFs and sparse inverse covariance estimation with autoencoders and neighborhood models, two successful approaches in collaborative filtering. We propose a novel approximation for learning sparse MRFs, where the trade-off between recommendation-accuracy and training-time can be controlled. At only a small fraction of the training-time compared to various baselines, including deep nonlinear models, the proposed approach achieved competitive ranking-accuracy on all three well-known data-sets used in our experiments, and notably a 20% gain in accuracy on the data-set with the largest number of items.", "full_text": "Markov Random Fields for Collaborative Filtering\n\nHarald Steck\n\nNet\ufb02ix\n\nLos Gatos, CA 95032\nhsteck@netflix.com\n\nAbstract\n\nIn this paper, we model the dependencies among the items that are recommended\nto a user in a collaborative-\ufb01ltering problem via a Gaussian Markov Random\nField (MRF). We build upon Besag\u2019s auto-normal parameterization and pseudo-\nlikelihood [7], which not only enables computationally ef\ufb01cient learning, but also\nconnects the areas of MRFs and sparse inverse covariance estimation with au-\ntoencoders and neighborhood models, two successful approaches in collaborative\n\ufb01ltering. We propose a novel approximation for learning sparse MRFs, where\nthe trade-off between recommendation-accuracy and training-time can be con-\ntrolled. At only a small fraction of the training-time compared to various baselines,\nincluding deep nonlinear models, the proposed approach achieved competitive\nranking-accuracy on all three well-known data-sets used in our experiments, and\nnotably a 20% gain in accuracy on the data-set with the largest number of items.\n\n1\n\nIntroduction\n\nCollaborative \ufb01ltering has witnessed signi\ufb01cant improvements in recent years, largely due to models\nbased on low-dimensional embeddings, like weighted matrix factorization (e.g., [26, 39]) and deep\nlearning [23, 22, 33, 47, 62, 58, 20, 11], including autoencoders [58, 33]. Also neighborhood-based\napproaches are competitive in certain regimes (e.g., [1, 53, 54]), despite being simple heuristics based\non item-item (or user-user) similarity matrices (like cosine similarity). In this paper, we outline that\nMarkov Random Fields (MRF) are closely related to autoencoders as well as to neighborhood-based\napproaches. We build on the enormous progress made in learning MRFs, in particular in sparse\ninverse covariance estimation (e.g., [36, 59, 15, 2, 60, 44, 45, 63, 55, 24, 25, 52, 56, 51]). Much of\nthe literature on sparse inverse covariance estimation focuses on the regime where the number of data\npoints n is much smaller than the number of variables m in the model (n < m).\nThis paper is concerned with a different regime, where the number n of data-points (i.e., users) and\nthe number m of variables (i.e., items) are both large as well as n > m, which is typical for many\ncollaborative \ufb01ltering applications. We use an MRF as to model the dependencies (i.e., similarities)\namong the items that are recommended to a user, while a user corresponds to a sample drawn from\nthe distribution of the MRF. In this regime (n > m), learning a sparse model may not lead to\nsigni\ufb01cant improvements in prediction accuracy (compared to a dense model). Instead, we exploit\nmodel-sparsity as to reduce the training-time considerably, as computational cost is a main concern\nwhen both n and m are large. To this end, we propose a novel approximation that enables one to trade\nprediction-accuracy for training-time. This trade-off subsumes the two extreme cases commonly\nconsidered in the literature, namely regressing each variable against its neighbors in the MRF and\ninverting the covariance matrix.\nThis paper is organized as follows. In the next section, we review Besag\u2019s auto-normal parameteriza-\ntion and pseudo-likelihood, and the resulting closed-form solution for the fully-connected MRF. We\nthen state the key Corollary in Section 2.2.2, which is the basis of our novel sparse approximation\noutlined in Section 3. We discuss the connections to various related approaches in Section 4. The\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fempirical evaluation on three well-known data-sets in Section 5 demonstrates the high accuracy\nachieved by this approach, while requiring only a small faction of the training-time as well as of the\nnumber of parameters compared to the best competing model.\n\n2 Pseudo-Likelihood\n\nIn this section, we lay the groundwork for the novel sparse approximation outlined in Section 3.\n\n2.1 Model Parameterization\nIn this section, we assume that the graph G of the MRF is given, and each node i \u2208 I in G corresponds\nto an item in the collaborative-\ufb01ltering problem. The m = |I| nodes are associated with the (row)\nvector of random variables X = (X1, ..., Xm) that follows a zero-mean1 multivariate Gaussian\ndistribution N (0, \u03a3).2 A user corresponds to a sample drawn from this distribution. As to make\nrecommendations, we use the expectation E[Xi|XI\\{i} = xI\\{i}] as the predicted score for item i\ngiven a user\u2019s observed interactions xI\\{i} with all the other items I \\ {i}. When learning the MRF,\nmaximizing the (L1/L2 norm regularized) likelihood of the MRF that is parameterized according to\nthe Hammersley-Clifford theorem [18] can become computationally expensive, given that the number\nof items and users is often large in collaborative \ufb01ltering applications. For this reason, we use Besag\u2019s\nauto-normal parameterization of the MRF [7, 8]: the conditional mean of each Xi is parameterized\nin terms of a regression against the remaining nodes:\n\nE[Xi|XI\\{i} = xI\\{i}] =\n\n\u03b2j,ixj = x \u00b7 B\u00b7,i\n\n(1)\n\n(cid:88)\n\nj\u2208I\\{i}\n\nwhere \u03b2j,i = 0 if the edge between nodes i and j is absent in G, i.e., the regression of each node\ni involves only its neighbors in G. In the last equality in Eq. 1, we switched to matrix notation,\nwhere B \u2208 Rm\u00d7m with Bj,i := \u03b2j,i for i (cid:54)= j, and with a zero diagonal, Bi,i = 0, as to exclude Xi\nfrom the covariates in the regression regarding its own mean in Eq. 1. B\u00b7,i denotes the ith column\nof B, and x a realization of X regarding a user. Besides B, the vector of conditional variances\ni . The fact that\n(cid:126)\u03c32 := (\u03c32\nthe covariance matrix \u03a3 is symmetric imposes the constraint \u03c32\nj \u03b2j,i on the auto-normal\nparameterization [7]. Moreover, the positive de\ufb01niteness of \u03a3 gives rise to an additional constraint,\nwhich in general can only be veri\ufb01ed if the numerical values of the parameters are known [7]. For\ncomputational ef\ufb01ciency, we will not explicitly enforce either one of these constraints in this paper.\n\nm) are further model parameters: var(Xi|XI\\{i} = xI\\{i}) = \u03c32\n\ni \u03b2i,j = \u03c32\n\n1, ..., \u03c32\n\n2.2 Parameter Fitting\n\ni\u2208I L(X\u00b7,i|X\u00b7,I\\{i}; B\u00b7,i, \u03c32\n\nlikelihoods of the items: L(pseudo)(X|B, (cid:126)\u03c32) =(cid:80)\n\nBesag\u2019s pseudo-likelihood yields asymptotically consistent estimates when the auto-normal param-\neterization is used [7].3 The log pseudo-likelihood is de\ufb01ned as the sum of the conditional log\ni ) , where X is the\ngiven user-item-interaction data-matrix X \u2208 Rn\u00d7m regarding n users and m items. While this\napproach allows for a real-valued data-matrix X (e.g., the duration that a user listened to a song),\nin our experiments in Section 5, following the experimental set-up in [33], we use a binary matrix\nX, where 1 indicates an observed user-item interaction (e.g., a user listened to a song). X\u00b7,i denotes\ncolumn i of matrix X, while column i is dropped in X\u00b7,I\\{i}. Note that we assume i.i.d. data.\nSubstituting the Gaussian density function for each (univariate) conditional likelihood, results in\n. If the symmetry-constraint\nj \u03b2j,i in previous section) is dropped, the parameters \u02c6B that maximize this pseudo-\n\nL(pseudo)(X|B, (cid:126)\u03c32) = \u2212(cid:80)\n\n||X\u00b7,i \u2212 XB\u00b7,i||2\n\n(cid:110) 1\n\n(cid:111)\n\ni\u2208I\n\n2 + 1\n\n2 log 2\u03c0\u03c32\n\ni\n\n2\u03c32\ni\n\ni \u03b2i,j = \u03c32\n\n(cf. \u03c32\nlikelihood also maximize the decoupled pseudo-likelihood\n\nL(decoupled pseudo)(X|B) = \u2212(cid:88)\n\n||X\u00b7,i \u2212 XB\u00b7,i||2\n\n2 = \u2212||X \u2212 XB||2\nF ,\n\n(2)\n\ni\u2208I\n\n1In fact, it is possible to drop this common assumption in certain cases, see Appendix.\n2We use \u2019item\u2019, \u2019node\u2019 and \u2019random variable\u2019 interchangeably in this paper.\n\n3Interestingly, despite its similarity to Eq. 1, the parameterization Xi =(cid:80)\n\nj\u2208I\\{i} \u03b2i,jXj + \u0001i, where \u0001i is\n\nindependent Gaussian noise with zero mean and variance \u03c32\n\ni , does not lead to consistent estimates [7].\n\n2\n\n\fi ) and wi = 1, in(cid:80)\n\nwhere || \u00b7 ||F denotes the Frobenius norm of a matrix. Note that any weighting scheme wi > 0,\n2 results in the same optimum \u02c6B.\nincluding wi = 1/(2\u03c32\nThis is obvious from the fact that this sum is optimized by optimizing each column B\u00b7,i independently\nof the other columns, as they are decoupled in the absence of the symmetry constraint. Note that,\nunlike the pseudo-likelihood, Eq. 2 becomes independent of the (unknown) conditional variances \u03c32\ni .\n\ni\u2208I wi||X\u00b7,i \u2212 XB\u00b7,i||2\n\n2.2.1 Complete Graph\n\nThe result for the complete graph is useful for the next section. Starting from Eq. 2, we add L2-norm\nregularization with hyper-parameter \u03bb > 0:\n||X \u2212 XB||2\n\nF where diag(B) = 0\n\nF + \u03bb \u00b7 ||B||2\n\n(3)\n\n\u02c6B = arg min\nB\n\nwhere we explicitly re-stated the zero-diagonal constraint, see Section 2.1. The method of Lagrangian\nmultipliers immediately yields the closed-form solution (see derivation below):\n\n\u03bb\n\nand S\u03bb = n\u22121(X(cid:62)X + \u03bb \u00b7 I),\n\n\u02c6B = I \u2212 \u02c6C \u00b7 dMat(1 (cid:11) diag( \u02c6C)) where \u02c6C = S\u22121\n\n(4)\nwhere I denotes the identity matrix, dMat(\u00b7) a diagonal matrix, (cid:11) the elementwise division, and\ndiag(\u00b7) the diagonal of the estimated concentration matrix \u02c6C, which is the inverse of the L2-norm\nregularized empirical covariance matrix S\u03bb. Note that \u02c6Bi,j = \u2212 \u02c6Ci,j/ \u02c6Cj,j for i (cid:54)= j.\nDerivation: Eq. 3 can be solved via the method of Lagrangian multipliers: setting the derivative\nof the Lagrangian ||X \u2212 XB||2\nF + 2\u03b3(cid:62) \u00b7 diag(B) to zero, where \u03b3 \u2208 Rm is the\nvector of Lagrangian multipliers regarding the equality constraint diag(B) = 0, it follows after re-\n\narranging terms: \u02c6B = (X(cid:62)X + \u03bb\u00b7 I)\u22121(cid:0)X(cid:62)X \u2212 dMat(\u03b3)(cid:1) = n\u22121 \u02c6C(n \u02c6C\u22121 \u2212 \u03bb\u00b7 I\u2212 dMat(\u03b3)) =\n\nI \u2212 n\u22121 \u02c6C \u00b7 dMat(\u03b3 + \u03bb), where \u03b3 is determined by the constraint 0 = diag( \u02c6B) = diag(I) \u2212\nn\u22121diag( \u02c6C)(cid:12) (\u03b3 + \u03bb), where (cid:12) denotes the elementwise product. Hence, \u03b3 + \u03bb = n(cid:11) diag( \u02c6C). (cid:3)\n\nF + \u03bb \u00b7 ||B||2\n\n2.2.2 Subgraphs\n\nThe result from the previous section carries immediately over to certain subgraphs:\nCorollary: Let D \u2286 I be a subset of nodes that forms a fully connected subgraph in G. Let C be the\nMarkov blanket of D in graph G such that each j \u2208 C is connected to each i \u2208 D. Then the non-zero\nparameter-estimates in the columns i \u2208 D of \u02c6B based on the pseudo-likelihood are asymptotically\nconsistent, and given by \u02c6Bj,i = \u2212 \u02c6Cj,i/ \u02c6Ci,i for all i \u2208 D and j \u2208 C \u222a D \\ {i}, where the submatrix\nof matrix \u02c6C \u2208 R|I|\u00d7|I| regarding the nodes C \u222a D is determined by the inverse of the submatrix of\nthe empirical covariance matrix:4 \u02c6C[C \u222a D;C \u222a D] = S\u03bb[C \u222a D;C \u222a D]\u22121.\nProof: This follows trivially when considering the nodes in D as the so-called dependents and the\nnodes in C as the conditioners in the coding technique used in [7]. The estimate in Eq. 4 for the\ncomplete graph carries over to the nodes D, as each i \u2208 D is connected to all j \u2208 C \u222a D, and D given\nthe conditioners C is independent of all remaining nodes in graph G. (cid:3)\n\n3 Sparse Approximation\n\nIn collaborative-\ufb01ltering problems with a large number of items, the graph G can be expected to\nbe (approximately) sparse, where related items form densely connected subgraphs, while items in\ndifferent subgraphs are only sparsely connected. An absent edge in graph G is equivalent to a zero\nentry in the concentration matrix C [31, 36] and in the matrix of regression coef\ufb01cients B (see Eq. 4).\nIn our approach, the goal is to trade accuracy for training-time, rather than to learn the \u2019true\u2019 graph G\nand the most accurate parameters at any computational cost. This is important in practical applications,\nwhere recommender systems have to be re-trained regularly under time-constraints as to ingest the\nmost recent data. To this end, we use model-sparsity as a means for speeding up the training (rather\nthan for improving accuracy), as it reduces the number of parameters that need to be learned. The\nCorollary outlined above can be used for an approximation where a large number of small submatrices\n\n4When used as indices regarding a (sub-)matrix, a set of indices is used as if it were a list of sorted indices.\n\n3\n\n\fof the concentration matrix \u02c6C has to be inverted, each regarding a set of related items D conditioned\non their Markov blanket C. This can be computationally much more ef\ufb01cient (1) compared to\ninverting the entire concentration matrix \u02c6C at once (like in Eq. 4, which can be computationally\nexpensive if the number of items is large), or (2) compared to regressing each individual node against\nits neighbors as is commonly done in the literature (e.g., [7, 21, 36, 38]).\nOur approximation is comprised of two parts: \ufb01rst, the empirical covariance matrix is thresholded\n(see next section), resulting in a sparsity pattern that has to be suf\ufb01ciently sparse as to enable\ncomputationally ef\ufb01cient estimation of the (approximate) regression coef\ufb01cients in B in the second\npart (see Section 3.2). An implementation of the algorithm, and the used values of the hyper-\nparameters, are publicly available at https://github.com/hasteck/MRF_NeurIPS_2019.\n\n3.1 Approximate Graph Structure\n\nNumerous approaches for learning the sparse graph structure (and parameters) have been proposed in\nrecent years, e.g., [36, 59, 15, 2, 60, 44, 45, 63, 55, 24, 25, 52, 56, 51]. Interestingly, simply applying\na threshold to the empirical covariance matrix (in absolute value) [10, 9, 17, 57, 35, 48, 14, 61, 13]\ncan recover the same sparsity pattern as the graphical lasso does [59, 15, 2] under certain assumptions,\nregarding the connected components [57, 35], as well as the edges [48, 14, 61, 13] in the graph.\nWhile it may be computationally expensive to verify that the underlying assumptions are met by a\ngiven empirical covariance matrix, the rule of thumb given in [48] is that the assumptions can be\nexpected to hold if the resulting matrix is \u2019very\u2019 sparse. For computational ef\ufb01ciency, we hence apply\na threshold to S\u03bb as to obtain a suf\ufb01ciently sparse matrix A \u2208 R|I|\u00d7|I| re\ufb02ecting the sparsity pattern.\nAdditionally, we apply an upper limit on the number of non-zero entries per column in A (retaining\nthe entries with the largest values in S\u03bb), as to bound the maximal training-cost of each iteration\n(see second to last paragraph in the in the next section). We allow at most 1,000 non-zero entries per\ncolumn in our experiments in Section 5, based on the trade-off between training time and prediction\naccuracy: a smaller value tends to reduce the training-time, but it might also degrade the prediction\naccuracy of the learned sparse model. In Table 2, this threshold actually affects only about 2% of the\nitems when using the sparsity level of 0.5%, while it has no effect at the sparsity level of 0.1%. Apart\nfrom that, allowing an item to have up to 1,000 similar items (e.g., songs in the MSD data) seems a\nreasonably large number in practice.\n\n3.2 Approximate Parameter-Estimates\n\nIn this section, we outline our novel approach for approximate parameter-estimation given the sparsity\npattern A from the previous section. In this approach, the trade-off between approximation-accuracy\nand training-time is controlled by the value of the hyper-parameter r \u2208 [0, 1] used in step 2 below.\nGiven the sparsity pattern A, we \ufb01rst create a list L of all items i \u2208 I, sorted in descending order\nby each item\u2019s number of neighbors in A, i.e., number of non-zero entries in column i (ties may be\nbroken according to the items\u2019 popularities). Our iterative approach is based on this list L, which\ngets modi\ufb01ed until it is empty, which marks the end of the iterations. We also use a set S, initialized\nto be empty. Each iteration is comprised of the following four steps:\nStep 1: We take the \ufb01rst element from list L, say item i, and insert it into set S. Then we determine\nits neighbors N (i) based on the ith column of the sparsity pattern in matrix A.\nStep 2: We now split the set N (i) \u222a {i} into two disjoint sets such that set D(i) contains node\ni as well as the m(i) = round(r \u00b7 |N (i)|) nodes that have the largest empirical covariances with\nnode i (in absolute value), where r \u2208 [0, 1] is the chosen hyper-parameter. The second set is\nC(i) := (N (i) \u222a {i}) \\ D(i). We now make the key assumption of this approach (and do not verify\nit as to save computation time), namely that C(i) is a Markov blanket of D(i) in the sparse graph G.\nObviously, this is a strong assumption, and cannot be expected to hold in general. It may not be\nunreasonable, however, to expect this to be an approximation in the sense that C(i) contains many\nnodes of the (actual) Markov blanket of D(i) in graph G for two reasons: (1) if m(i) = 0, then\nC(i) = N (i) is indeed the Markov blanket of D(i) = {i}; (2) given that we chose D(i) to contain\nthe variables with the largest covariances to node i, their Markov blankets likely have many nodes\nin common with the Markov blanket of node i. As we increase the value of m(i) \u2264 |N (i)|, the\n\n4\n\n\fapproximation-accuracy obviously deteriorates (except for the case that N (i) \u222a {i} is a connected\ncomponent in graph G). For these reasons, the value of m(i) (which is controlled by the chosen value\nof r) allows one to control the trade-off between approximation accuracy and computational cost.\nStep 3: Given set D(i) and its (assumed) Markov blanket C(i), we now assume that these nodes are\nconnected as required by the Corollary above. Note that this may assume additional edges to be\npresent, resulting in additional regression parameters that need to be estimated. Obviously, this is a\nfurther approximation. However, the decrease in statistical ef\ufb01ciency can be expected to be rather\nsmall in the typical setting of collaborative \ufb01ltering, where the number of data points (i.e., users)\nusually is much larger than the number of nodes (i.e., items) in the (typically small) subset D(i) \u222aC(i).\nWe now can apply the Corollary in Section 2.2.2, and obtain the estimates for all the columns j \u2208 D(i)\nin matrix \u02c6B at once. This is the key to the computational ef\ufb01ciency of this approach: for about the\nsame computational cost as estimating the single column i in \u02c6B, we now obtain the (approximate)\nestimates for 1 + m(i) columns (see Section 2.2.2 for details).\nStep 4: Finally, we remove all the 1 + m(i) items in D(i) from the sorted list L, and go to step 1\nunless L is empty. Obviously, as we increase the value of r (and hence m(i)), the size of list L\ndecreases by a larger number in each iteration, eventually requiring fewer iterations, which reduces\nthe training-time. If we choose m(i) = 0, then D(i) = {i}, and there is no computational speed-up\n(and also no approximation) compared to the baseline of solving one regression problem per column\nin \u02c6B w.r.t. the pseudo-likelihood.\nUpon completion of the iterations, we have estimates for all columns of \u02c6B. In fact, for many entries\n(j, i), there may be multiple estimates; for instance if node i \u2208 D(k) and node j \u2208 D(k) \u222a C(k) for\nseveral different nodes k \u2208 S. As to aggregate possibly multiple estimates for entry \u02c6Bj,i into a single\nvalue, we simply use their average in our experiments.\nThe computational complexity of this iterative scheme can be controlled by the sparsity level chosen\nin Section 3.1, as well as the chosen value r \u2208 [0, 1], which determines the values m(i) (see\nstep 2). When using the Coppersmith-Winograd algorithm for matrix inversion, it is given by\ni\u2208S (1 + |N (i)|)2.376), where the size of S depends on the chosen values r. Note that the sum\nmay be dominated by the largest value |N (i)|, which motivated us to cap this value in Section 3.1.\nNote that set S can be computed in linear time in |I| by iterating through steps 1, 2, and 4 (skipping\nstep 3). Once S is determined, the computation of step 3 for different i \u2208 S is embarrassingly parallel.\nIn comparison, in the standard approach of separately regressing each item i against its neighbors\ni\u2208I |N (i)|2.376), i.e., the sum here extends over all i \u2208 I (instead of subset\nS \u2286 I only). In the other extreme, inverting the entire covariance matrix incurs the cost O(|I|2.376).\nLacking an analytical error-bound, the accuracy of this approximation may be assessed empirically,\nby simply learning \u02c6B under different choices regarding the sparsity level (see Section 3.1) and the\nvalue r (see step 2 above). Given that recommender systems are re-trained frequently as to ingest the\nmost recent data, only these regular updates require ef\ufb01cient computations in practice. Additional\nmodels with higher accuracy (and increased training-time) may be learned occasionally as to assess\nthe accuracy of the models that get trained regularly.\n\nO((cid:80)\nN (i), we have O((cid:80)\n\n4 Related Work\n\nWe discuss the connections to various related approaches in this section.\nSeveral non-Gaussian distributions are also covered by Besag\u2019s auto-models [5, 6], including the\nlogistic auto-model for binary data. Binary data were also considered in [2, 42]. While we rely on\nthe Gaussian distribution for computational ef\ufb01ciency, note that, regarding model-\ufb01t, Eq. 4 and the\nCorollary provide the best least-squares \ufb01t of a linear model for any distribution of X, as Eq. 4 is the\nsolution of Eq. 3. The empirical results in Section 5 corroborate that this is an effective trade-off\nbetween accuracy and training-time.\nSparse Inverse Covariance Estimation has seen tremendous progress beyond the graphical lasso\n[59, 15, 2]. A main focus was computationally ef\ufb01cient optimization of the full likelihood [24,\n25, 52, 44, 51, 55, 45], often in the regime where n < m (e.g., [24, 52, 51]), or the regime of\nsmall data (e.g., [63, 56]). Node-wise regression was considered for structure-learning in [36] and\n\n5\n\n\ffor parameter-learning in [60], which is along the lines of Besag\u2019s pseudo-likelihood [7, 8]. The\npseudo-likelihood was generalized in [34]. Our paper focuses on a different regime, with large n, m\nand n > m, as typical for collaborative \ufb01ltering. In Section 3.2, we outlined a novel kind of sparse\napproximation, using set-wise rather than the node-wise regression, which is commonly used in the\nliterature.\nDependency Networks [21] also regress each node against its neighbors. As this may result in\ninconsistencies when learning from \ufb01nite data, a kind of Gibbs sampling is used as to obtain a\nconsistent joint distribution. This increases the computational cost. Given that collaborative \ufb01ltering\ntypically operates in the regime of large n, m and n > m, we rely on the asymptotic consistency of\nBesag\u2019s pseudo-likelihood for computational ef\ufb01ciency.\nIn SLIM [38], the objective is similar to Eq. 3, but is comprised of two additional terms: (1) sparsity-\npromoting L1-norm regularization and (2) a non-negativity constraint on the learned regression\nparameters. As we can see in Table 1, this not only reduces accuracy but also increases training-time,\ncompared to \u02c6B(dense). In [38], also the variant fsSLIM was proposed, where \ufb01rst the sparsity pattern\nwas determined via a k-nearest-neighbor approach, and then a separate regression problem was solved\nfor each node. This node-wise regression is a special case (for r = 0) of our set-based approximation\noutlined in Section 3.2. The variants proposed in [46, 32] drop the constraint of a zero diagonal for\ncomputational ef\ufb01ciency, which however is an essential property of Besag\u2019s auto-models [7]. The\nlogistic loss is used in [46], which requires one to solve a separate logistic regression problem for\neach node, which is computationally expensive.\nAutoencoders and Deep Learning have led to many improvements in collaborative \ufb01ltering [23, 22,\n33, 47, 62, 58, 20, 11]. In the pseudo-likelihood of the MRF in Eq. 3, the objective is to reproduce\nX from X (using B), like in an autoencoder, cf. also our short paper [50]. However, there is no\nencoder, decoder or hidden layer in the MRF in Eq. 3. The learned B is typically of full rank, and the\nconstraint diag(B) = 0 is essential for generalizing to unseen data. The empirical evidence in Section\n5 corroborates that this is a viable alternative to using low-dimensional embeddings, as in typical\nautoencoders, as to generalize to unseen data. In fact, recent work on deep collaborative \ufb01ltering\ncombines low-rank and full-rank models for improved recommendation accuracy [11]. Moreover,\nrecent progress in deep learning has also led to full-rank models, like invertible deep networks [27], as\nwell as \ufb02ow-based generative models [16, 30, 40, 12, 29]. Adapting these approaches to collaborative\n\ufb01ltering appears to be promising future work in light of the experimental results obtained by the\nfull-rank shallow model in this paper.\nNeighborhood Approaches are typically based on a heuristic item-item (or user-user) similarity\nmatrix (e.g. cosine similarity), e.g., [1, 53, 54] and references therein. Our approach yields three key\ndifferences to cosine-similarity and the like: (1) a principled way of learning/optimizing the similarity\nmatrix \u02c6B from data; (2) Eq. 4 shows that the conceptually correct similarity matrix is not based on (a\nre-scaled version of) the covariance matrix, but on its inverse; (3) the similarity matrix is asymmetric\n(cf. Eq. 4) rather than symmetric.\n\n5 Experiments\n\nIn our experiments, we empirically evaluate the closed-form (dense) solution \u02c6B(dense) (see Eq. 4)\nas well as the sparse approximation outlined in Section 3. We follow the experimental set-up in\n[33] and use their publicly available code for reproducibility.5 Three well-known data sets were\nused in the experiments in [33]: MovieLens 20 Million (ML-20M) [19], Net\ufb02ix Prize (Net\ufb02ix) [3],\nand the Million Song Data (MSD) [4]. They were pre-processed and \ufb01ltered for items and users\nwith a certain activity level in [33], resulting in the data-set sizes shown in Table 1. We use all the\napproaches evaluated on these three data-sets in [33] as baselines:\n\n\u2022 Sparse Linear Method (SLIM) [38] as discussed in Section 4.\n\u2022 Weighted Matrix Factorization (WMF ) [26, 39]: A linear model with a latent representation\nof users and items. Variants like NSVD [41] or FISM [28] obtained very similar accuracies.\n\u2022 Collaborative Denoising Autoencoder (CDAE) [58]: nonlinear model with 1 hidden layer.\n5The code regarding ML-20M in [33] is publicly available at https://github.com/dawenl/vae_cf, and\n\ncan be modi\ufb01ed for the other two data-sets as described in [33].\n\n6\n\n\f\u2022 Denoising Autoencoder (MULT-DAE) and Variational Autoencoder (MULT-VAE) [33]:\ndeep nonlinear models, trained w.r.t.\nthe multinomial likelihood. Three hidden layers\nwere found to obtain the best accuracy on these data-sets, see Section 4.3 in [33]. Note\nthat rather shallow architectures are commonly found to obtain the highest accuracy in\ncollaborative-\ufb01ltering (which is different from other application areas of deep learning, like\nimage classi\ufb01cation, where deeper architectures often achieve higher accuracy).\n\nWe do not compare to Neural Collaborative Filtering (NCF), its extension NeuCF [20] and to Bayesian\nPersonalized Ranking (BPR) [43], as their accuracies were found to be below par on the three data-\nsets ML-20M, Net\ufb02ix, and MSD in [33]. NCF and NeuCF [20] was competitive only on unrealistically\nsmall data-sets in [33].\nWe follow the evaluation protocol used in [33],5 which is based on strong generalization, i.e., the\ntraining, validation and test sets are disjoint in terms of the users. Normalized Discounted Cumulative\nGain (nDCG@100) and Recall (@20 and @50) served as ranking metrics for evaluation in [33]. For\nfurther details of the experimental set-up, the reader is referred to [33].\nNote that the training-data matrix X here is binary, where 1 indicates an observed user-item interac-\ntion. This obviously violates our assumption of a Gaussian distribution, which we made for reasons\nof computational ef\ufb01ciency. In this case, our approach yields the best least-squares \ufb01t of a linear\nmodel, as discussed in Section 4. The empirical results corroborate that this is a viable trade-off\nbetween accuracy and training-time, as discussed in the following.\nClosed-Form Dense Solution: Table 1 summarizes the experimental results across the three data\nsets. It shows that the closed-form solution \u02c6B(dense) (see Eq. 4) obtains nDCG@100 that is about 1%\nlower on ML-20M, about 3% better on Net\ufb02ix, and a remarkable 24% better on MSD than the best\ncompeting model, MULT-VAE.\nIt is an interesting question as to why this simple full-rank model outperforms the deep nonlinear\nMULT-VAE by such a large margin on the MSD data. We suspect that the hourglass architecture\nof MULT-VAE (where the smallest hidden layer has 200 dimensions in [33]) severely restricts the\ninformation that can \ufb02ow between the 41,140-dimensional input and output layers (regarding the\n41,140 items in MSD data), so that many relevant dependencies between items may get lost. For\ninstance, compared to the full-rank model, MULT-VAE recommends long-tail items considerably\nless frequently among the top-N items, on average across all test users in the MSD data, see also\n[50]. As the MSD data contain about twice as many items as the other two data sets, this would\n\nTable 1: The closed-form dense solution \u02c6B(dense) (see Eq. 4) obtains competitive ranking-accuracy\nwhile requiring only a small fraction of the training time, compared to the various models empirically\nevaluated in [33]. The standard errors of the ranking-metrics are about 0.002, 0.001, and 0.001 on\nML-20M, Net\ufb02ix, and MSD data [33], respectively.\n\nML-20M\n\nNet\ufb02ix\n\nMSD\n\n0.392\n\n0.522\n\n0.397\n\n0.364\n\n0.448\n\n0.391\n\n0.334\n\nnDCG Recall Recall\n\nnDCG Recall Recall\n\nnDCG Recall Recall\n@100 @20 @50 @100 @20 @50 @100 @20 @50\n0.423\n0.430\nreproduced from [33]:\n0.426\n0.419\n0.418\n0.401\n0.386\n\n0.364\n0.316\n0.363\n0.313\n0.237\n0.283\n\u2013did not \ufb01nish in [33]\u2013\n0.257\n0.312\n\n0.537\n0.524\n0.523\n0.495\n0.498\n\n0.351\n0.344\n0.343\n0.347\n0.316\n\n0.395\n0.387\n0.391\n0.370\n0.360\n\n0.444\n0.438\n0.428\n0.428\n0.404\n\n0.386\n0.380\n0.376\n0.379\n0.351\n\n0.266\n0.266\n0.188\n\n0.211\n\nmodels\n\u02c6B(dense)\n\nMULT-VAE\nMULT-DAE\nCDAE\nSLIM\nWMF\ntraining times\n\u02c6B(dense)\nMULT-VAE\ndata-set\nproperties\n\n2 min 0 sec\n28 min 10 sec\n136,677 users\n20,108 movies\n10 million interactions\n\n1 min 30 sec\n1 hour 26 min\n463,435 users\n17,769 movies\n57 million interactions\n\n15 min 45 sec\n4 hours 30 min\n571,355 users\n41,140 songs\n34 million interactions\n\n7\n\n\falso explain why the difference in ranking accuracy between MULT-VAE and the proposed full-rank\nmodel is the largest on the MSD data. While the ranking accuracy of MULT-VAE may be improved by\nconsiderably increasing the number of dimensions, note that this would prolong the training time at\nleast linearly, which is already 4 hours 30 minutes for MULT-VAE on MSD data (see Table 1). Apart\nfrom that, as a simple sanity check, once the full-rank matrix \u02c6B(dense) was learned, we applied a\nlow-rank approximation (SVD), and found that even 3,000 dimensions resulted in about a 10% drop\nin nDCG@100 on MSD data. This motivated us to pursue sparse full-rank rather than dense low-rank\napproximations in this paper, which is naturally facilitated by MRFs.\nBesides the differences in accuracy, Table 1 also shows that the training-times of MULT-VAE are\nmore than ten times larger than the few minutes required to learn \u02c6B(dense) on all three data-sets. The\nreasons are that MULT-VAE is trained on the user-item data-matrix X and uses stochastic gradient\ndescent to optimize ELBO (which involves several expensive computations in each step)\u2013in contrast,\nthe proposed MRF uses a closed-form solution, and is trained on the item-item data-matrix (note\nthat #items (cid:28) #users in our experiments). Also note that the training-times of MULT-VAE reported\nin Table 1 are optimistic, as they are based on only 50 iterations, where the training of MULT-VAE\nmay not have fully converged yet (the reported accuracies of MULT-VAE are based on 200 iterations).\nThese times were obtained on an AWS instance with 64 GB memory and 16 vCPUs for learning\n\u02c6B(dense), and with a GPU for training MULT-VAE (which was about \ufb01ve times faster than training\nMULT-VAE on 16 vCPUs).\nSparse Approximation: Given the short training-times of the closed-form solution on these three\ndata-sets, we demonstrate the speed-up obtained by the sparse approximation (see Section 3) on the\nMSD data, where the training of the closed-form solution took the longest: Table 2 shows that the\ntraining-time can be reduced from about 16 minutes for the closed-form solution to under a minute\nwith only a relatively small loss in accuracy: while the loss in accuracy is statistically signi\ufb01cant\n(standard error is about 0.001), it is still very small compared to the difference to MULT-VAE, the\nmost accurate competing model in Table 1.\nWe can also see in Table 2 that different trade-offs between accuracy and training-time can be\nobtained by using a sparser model and/or a larger hyper-parameter r (which increases the sizes of the\nsubsets of items in step 2 in Section 3.2): \ufb01rst, the special case r = 0 corresponds to regressing each\nindividual item (instead of a subset of items) against its neighbors in the MRF, which is commonly\ndone in the literature (e.g., [7, 21, 36, 38]). Table 2 illustrates that this can be computationally more\nexpensive than inverting the entire covariance matrix at once (cf. MRF with sparsity 0.5% and\nr = 0 vs. the dense solution). Second, comparing the MRF with sparsity 0.5% and r = 0.5 vs. the\nmodel with sparsity 0.1% and r = 0, we can see that the former obtains a better Recall@50 than\nthe latter does, and also requires less training time. This illustrates that it can be bene\ufb01cial to learn\na denser model (0.5% vs. 0.1% sparsity here) but with a larger value r (0.5 vs. 0 here). Note that\noptimizing Recall@50 (vs. @20) is important in applications where a large number of items has to\n\nTable 2: Sparse Approximation (see Section 3), on MSD data (standard error \u2248 0.001): ranking-\naccuracy can be traded for training-time, controlled by the sparsity-level and the parameter r \u2208 [0, 1]\n(de\ufb01ned in Section 3). For comparison, also the closed-form solution \u02c6B(dense) and the best competing\nmodel, MULT-VAE, from Table 1 are shown.\n\nnDCG@100 Recall@20 Recall@50\n0.391\n\n0.334\n\n0.430\n\nTraining Time\n15 min 45 sec\n\n21 min 12 sec\n3 min 27 sec\n2 min 1 sec\n\n3 min 7 sec\n1 min 10 sec\n39 sec\n4 hours 30 min\n\n0.390\n0.387\n0.385\n\n\u02c6B(dense)\n0.5% sparse approximation (see Section 3)\nr = 0\nr = 0.1\nr = 0.5\n0.1% sparse approximation (see Section 3)\nr = 0\nr = 0.1\nr = 0.5\nMULT-VAE\n\n0.385\n0.382\n0.381\n0.316\n\n0.330\n0.327\n0.327\n0.266\n\n0.427\n0.424\n0.424\n\n0.421\n0.417\n0.417\n0.364\n\n0.333\n0.331\n0.330\n\n8\n\n\fbe recommended, like for instance on the homepages of video streaming services, where typically\nhundreds of videos are recommended. The proposed sparse approximation enables one to choose the\noptimal trade-off between training-time and ranking-accuracy for a given real-world application.\nAlso note that, at sparsity levels 0.5% and 0.1%, our sparse model contains about the same number\nof parameters as a dense matrix of size 41,140\u00d7200 and 41,140\u00d740, respectively. In comparison,\nMULT-VAE in [33] is comprised of layers with dimensions 41,140 \u2192 600 \u2192 200 \u2192 600 \u2192 41,140\nregarding the 41,140 items in the MSD data, i.e., it uses two matrices of size 41,140\u00d7600. Hence,\nour sparse approximation (1) has only a fraction of the parameters, (2) requires orders of magnitude\nless training time, and (3) still obtains about 20% better a ranking-accuracy than MULT-VAE in Table\n2, the best competing model (see also Table 1).\nPopularity Bias: The popularity bias in the model\u2019s predictions is very important for obtaining high\nrecommendation accuracy, see also [49]. The different item-popularities affect the means and the\ncovariances in the Gaussian MRF, and we used the standard procedure of centering the user-item\nmatrix X (zero mean) and re-scaling the columns of X prior to training (once the training was\ncompleted, and when making predictions, we scaled the predicted values back to the original space,\nso that the predictions re\ufb02ected the full popularity bias in the training data, which can be expected\nto be the same as the popularity bias in the test data due to the way the data were split). This is\nparticularly important when learning the sparse model: theoretically, its sparsity pattern is determined\nby the correlation matrix (which quanti\ufb01es the strength of statistical dependence between the nodes\nin the Gaussian MRF), while the values (after scaling them back to the original space) of the non-zero\nentries are determined by the covariance matrix. In practice, we divided each column i in X by\nsi = std\u03b1\ni , where stdi is the column\u2019s empirical standard deviation; the grid search regarding the\nexponent \u03b1 \u2208 {0, 1/4, 1/2, 3/4, 1} yielded the best accuracy for \u03b1 = 3/4 on MSD data (note that\n\u03b1 = 1 would result in the correlation matrix), which coincidentally is the same value as was used in\nword2vec [37] to remove the word-popularities in text-data as to learn word-similarities.\n\nConclusions\n\nGeared toward collaborative \ufb01ltering, where typically the number of users n (data points) and items m\n(variables) are large and n > m, we presented a computationally ef\ufb01cient approximation to learning\nsparse Gaussian Markov Random Fields (MRF). The key idea is to solve a large number of regression\nproblems, each regarding a small subset of items. The size of each subset can be controlled, and\nthis enables one to trade accuracy for training-time. As special (and extreme) cases, it subsumes the\napproaches commonly considered in the literature, namely regressing each item (i.e., set of size one)\nagainst its neighbors in the MRF, as well as inverting the entire covariance matrix at once. Apart from\nthat, the auto-normal parameterization of MRFs prevents self-similarity of items (i.e., zero-diagonal\nin the weight-matrix), which we found an effective alternative to using low-dimensional embeddings\nin autoencoders in our experiments, as to enable the learned model to generalize to unseen data.\nRequiring several orders of magnitude less training time, the proposed sparse approximation resulted\nin a model with fewer parameters than the competing models, while obtaining about 20% better\nranking accuracy on the data-set with the largest number of items in our experiments.\n\nAppendix\nLet \u02c6\u00b5(cid:62) (cid:54)= 0 denote the row vector of the (empirical) column-means of the given user-item data-matrix\nX. If we assume that matrix B ful\ufb01lls the eigenvector-constraint \u02c6\u00b5(cid:62) \u00b7 B = \u02c6\u00b5(cid:62), then it holds that\n(X \u2212 1 \u00b7 \u02c6\u00b5(cid:62)) \u2212 (X \u2212 1 \u00b7 \u02c6\u00b5(cid:62)) \u00b7 B = X \u2212 X \u00b7 B (where 1 denotes a column vector of ones in the\nouter product with \u02c6\u00b5(cid:62)). In other words, the learned \u02c6B is invariant under centering the columns of the\ntraining data X. When the constraint \u02c6\u00b5(cid:62) \u00b7 B = \u02c6\u00b5(cid:62) is added to the training objective in Eq. 3, the\nmethod of Lagrangian multipliers again yields the closed-form solution:\n\u00b7 \u02c6C \u00b7 dMat(\u02dc\u03b3),\n\n\u02c6B = I \u2212\n\n(cid:33)\n\n(cid:32)\nI \u2212 \u02c6C\u02c6\u00b5\u02c6\u00b5(cid:62)\n\u02c6\u00b5(cid:62) \u02c6C\u02c6\u00b5\n\nwhere now \u02dc\u03b3 = 1 (cid:11) diag((I \u2212 \u02c6C\u02c6\u00b5\u02c6\u00b5(cid:62)\n\u02c6\u00b5(cid:62) \u02c6C\u02c6\u00b5\nis merely the additional factor I \u2212 \u02c6C\u02c6\u00b5\u02c6\u00b5(cid:62)\n\u02c6\u00b5(cid:62) \u02c6C\u02c6\u00b5\nhowever, we did not observe this to cause any signi\ufb01cant effect regarding the ranking metrics.\n\n) \u00b7 \u02c6C) for the zero diagonal of \u02c6B. The difference to Eq. 4\ndue to the constraint \u02c6\u00b5(cid:62) \u00b7 B = \u02c6\u00b5(cid:62). In our experiments,\n\n9\n\n\fAcknowledgments\n\nI am very grateful to Tony Jebara for his encouragement, and to Dawen Liang for providing the code\nfor the experimental setup of all three data-sets.\n\nReferences\n[1] F. Aiolli. Ef\ufb01cient top-N recommendation for very large scale binary rated datasets. In ACM Conference\n\non Recommender Systems (RecSys), 2013.\n\n[2] O. Banerjee, L.E. Ghaoui, and A. d\u2019Aspremont. Model selection through sparse maximum likelihood\n\nestimation for multivariate Gaussian or binary data. Journal of Machine Learning Research, 9, 2008.\n\n[3] J. Bennet and S. Lanning. The Net\ufb02ix Prize. In Workshop at SIGKDD-07, ACM Conference on Knowledge\n\nDiscovery and Data Mining, 2007.\n\n[4] T. Bertin-Mahieux, D.P.W. Ellis, B. Whitman, and P. Lamere. The million song dataset. In International\n\nSociety for Music Information Retrieval Conference (ISMIR), 2011.\n\n[5] J. Besag. Nearest-neighbor systems and the auto-logistic model for binary data. Journal of the Royal\n\nStatistical Society, Series B, 34:75\u201383, 1972.\n\n[6] J. Besag. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical\n\nSociety, Series B, 36:192\u2013236, 1974.\n\n[7] J. Besag. Statistical analysis of non-lattice data. The Statistician, 24:179\u201395, 1975.\n\n[8] J. Besag. Ef\ufb01ciency of pseudo-likelihood estimation for simple Gaussian \ufb01elds. Biometrika, 64, 1977.\n\n[9] T. Blumensath and M.E. Davis. Iterative hard thresholding for compressed sensing, 2008. arXiv:0805.0510.\n\n[10] T. Blumensath and M.E. Davis. Iterative thresholding for sparse approximations. Journal for Fourier\n\nAnalysis and Applications, 14, 2008.\n\n[11] H.T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai,\nM. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, X. Liu, and H. Shah. Wide & deep learning for recommender\nsystems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems (DLRS), pages\n7\u201310, 2016.\n\n[12] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using Real NVP. In Int. Conference on\n\nLearning Representations (ICLR), 2017.\n\n[13] S. Fattahi and S. Sojoudi. Graphical lasso and thresholding: Equivalence and closed form solution. Journal\n\nof Machine Learning Research, 20, 2019.\n\n[14] S. Fattahi, R.Y. Zhang, and S. Sojoudi. Sparse inverse covariance estimation for chordal structures. In\n\nEuropean Control Conference (ECC), 2018.\n\n[15] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso.\n\nBiostatistics, 9, 2008.\n\n[16] M. Germain, K. Gregor, I. Murray, and H. Larochelle. MADE: Masked autoencoder for distribution\n\nestimation. In International Conference on Machine Learning (ICML), 2015.\n\n[17] D. Guillot, B. Rajaratnam, B. Rolfs, A. Maleki, and I. Wong. Iterative thresholding algorithm for sparse\n\ninverse covariance estimation. In Advances in Neural Information Processing Systems (NIPS), 2012.\n\n[18] J. M. Hammersley and P. E. Clifford. Markov \ufb01elds on \ufb01nite graphs and lattices. Unpublished manuscript,\n\n1971.\n\n[19] F. M. Harper and J. A. Konstan. The MovieLens datasets: History and context. ACM Transactions on\n\nInteractive Intelligent Systems (TiiS), 5, 2015.\n\n[20] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua. Neural collaborative \ufb01ltering. In International\n\nWorld Wide Web Conference (WWW), 2017.\n\n[21] D. Heckerman, D.M. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Dependency networks for\ninference, collaborative \ufb01ltering, and data visualization. Journal of Machine Learning Research, 1:49\u201375,\n2000.\n\n10\n\n\f[22] B. Hidasi and A. Karatzoglou. Recurrent neural networks with top-k gains for session-based recom-\nmendations. In International Conference on Information and Knowledge Management (CIKM), 2017.\narXiv:1706.03847.\n\n[23] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk. Session-based recommendations with recurrent\n\nneural networks, 2015. arXiv:1511.06939.\n\n[24] C.-J. Hsieh, M.A. Sustik, I.S. Dhillon, P.K. Ravikumar, and R. Poldrack. BIG & QUIC: Sparse inverse\ncovariance estimation for a million variables. In Advances in Neural Information Processing Systems\n(NIPS), 2013.\n\n[25] C.-J. Hsieh, M.A. Sustik, I.S. Dhillon, P.K. Ravikumar, and R. Poldrack. QUIC: Quadratic approximation\n\nfor sparse inverse covariance matrix estimation. Journal of Machine Learning Research, 2014.\n\n[26] Y. Hu, Y. Koren, and C. Volinsky. Collaborative \ufb01ltering for implicit feedback datasets.\n\nInternational Conference on Data Mining (ICDM), 2008.\n\nIn IEEE\n\n[27] J.-H. Jacobsen, A. Smeulders, and E. Oyallon. i-RefNet: Deep invertible networks. In Int. Conference on\n\nLearning Representations (ICLR), 2018.\n\n[28] S. Kabbur, X. Ning, and G. Karypis. FISM: Factored item similarity models for top-N recommender\n\nsystems. In ACM Conference on Knowledge Discovery and Data Mining (KDD), 2013.\n\n[29] D. P. Kingma and P. Dhariwal. Glow: Generative \ufb02ow with invertible 1x1 convolutions, 2018.\n\narXiv:1807:03039.\n\n[30] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improving variational\ninference with inverse autoregressive \ufb02ow. In Advances in Neural Information Processing Systems (NIPS),\n2016.\n\n[31] S. L. Lauritzen. Graphical Models. Oxford University Press, 1996.\n\n[32] M. Levy and K. Jack. Ef\ufb01cient top-N recommendation by linear regression. In RecSys Large Scale\n\nRecommender Systems Workshop, 2013.\n\n[33] D. Liang, R. G. Krishnan, M. D. Hoffman, and T. Jebara. Variational autoencoders for collaborative\n\n\ufb01ltering. In International World Wide Web Conference (WWW), 2018.\n\n[34] B.G. Lindsay. Composite likelihood methods. Contemporary Mathematics, 80, 1988.\n\n[35] R. Mazumder and T. Hastie. Exact covariance thresholding into connected components for large-scale\n\ngraphical lasso. Journal of Machine Learning Research, 13:781\u201394, 2012.\n\n[36] N. Meinshausen and P. B\u00fchlmann. High-dimensional graphs and variable selection with the lasso. Annals\n\nof Statistics, 34, 2006.\n\n[37] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and\nphrases and their compositionality. In Conference on Neural Information Processing Systems (NIPS), 2013.\n\n[38] X. Ning and G. Karypis. SLIM: Sparse linear methods for top-N recommender systems.\n\nInternational Conference on Data Mining (ICDM), pages 497\u2013506, 2011.\n\nIn IEEE\n\n[39] R. Pan, Y. Zhou, B. Cao, N. Liu, R. Lukose, M. Scholz, and Q. Yang. One-class collaborative \ufb01ltering. In\n\nIEEE International Conference on Data Mining (ICDM), 2008.\n\n[40] G. Papamakarios, I. Murray, and T. Pavlakou. Masked autoregressive \ufb02ow for density estimation. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2017.\n\n[41] A. Paterek. Improving regularized singular value decomposition for collaborative \ufb01ltering. In KDDCup,\n\n2007.\n\n[42] P. Ravikumar, M. Wainwright, and J. Lafferty. High-dimensional Ising model selection using (cid:96)1-regularized\n\nlogistic regression. Annals of Statistics, 38, 2010.\n\n[43] S. Rendle, Ch. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme. BPR: Bayesian personalized ranking\nfrom implicit feedback. In Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 452\u201361, 2009.\n\n[44] K. Scheinberg, S. Ma, and D. Goldfarb. Sparse inverse covariance selection via alternating linearization\n\nmethods. In Advances in Neural Information Processing Systems (NIPS), 2010.\n\n11\n\n\f[45] M. Schmidt. Graphical Model Structure Learning with L1-Regularization. PhD thesis, University of\n\nBritish Columbia, Vancouver, Canada, 2011.\n\n[46] S. Sedhain, A. K. Menon, S. Sanner, and D. Braziunas. On the effectiveness of linear models for one-class\n\ncollaborative \ufb01ltering. In AAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\n[47] S. Sedhain, A. K. Menon, S. Sanner, and L. Xie. AutoRec: Autoencoders meet collaborative \ufb01ltering. In\n\nInternational World Wide Web Conference (WWW), 2015.\n\n[48] S. Sojoudi. Equivalence of graphical lasso and thresholding for sparse graphs. Journal of Machine Learning\n\nResearch, 2016.\n\n[49] H. Steck. Item popularity and recommendation accuracy. In ACM Conference on Recommender Systems\n\n(RecSys), pages 125\u201332, 2011.\n\n[50] H. Steck. Embarrassingly shallow autoencoders for sparse data.\n\nConference (WWW), 2019.\n\nIn International World Wide Web\n\n[51] I. Stojkovic, V. Jelisavcic, V. Milutinovic, and Z. Obradovic. Fast sparse Gaussian Markov random \ufb01elds\n\nlearning based on Cholesky factorization. In Int. Joint Conf. on Arti\ufb01cial Intelligence (IJCAI), 2017.\n\n[52] E. Treister and J.S. Turek. A block-coordinate descent approach for large-scale sparse inverse covariance\n\nestimation. In Advances in Neural Information Processing Systems (NIPS), 2014.\n\n[53] K. Verstrepen and B. Goethals. Unifying nearest neighbors collaborative \ufb01ltering. In ACM Conference on\n\nRecommender Systems (RecSys), 2014.\n\n[54] M. N. Volkovs and G. W. Yu. Effective latent models for binary feedback in recommender systems. In\n\nACM Conference on Research and Development in Information Retrieval (SIGIR), 2015.\n\n[55] H. Wang, A. Banerjee, C.-J. Hsieh, P.K. Ravikumar, and I.S. Dhillon. Large-scale distributed sparse\n\nprecision estimation. In Advances in Neural Information Processing Systems (NIPS), 2013.\n\n[56] L. Wang, X. Ren, and Q. Gu. Precision matrix estimation in high dimensional Gaussian graphical models\n\nwith faster rates. In International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2016.\n\n[57] D.M. Witten, J.H. Friedman, and N. Simon. New insights and faster computations for the graphical lasso.\n\nJournal of Computational and Graphical Statistics, 20:892\u2013900, 2011.\n\n[58] Y. Wu, C. DuBois, A. X. Zheng, and M. Ester. Collaborative denoising auto-encoders for top-N recom-\n\nmender systems. In ACM Conference on Web Search and Data Mining (WSDM), 2016.\n\n[59] M. Yuan. Model selection and estimation in the Gaussian graphical model. Biometrika, 2007.\n\n[60] M. Yuan. High dimensional inverse covariance matrix estimation via linear programming. Journal of\n\nMachine Learning Research, 11:2261\u201386, 2010.\n\n[61] R.Y. Zhang, S. Fattahi, and S. Sojoudi. Large-scale sparse inverse covariance estimation via thresholding\n\nand max-det matrix completion. In International Conference on Machine Learning (ICML), 2018.\n\n[62] Y. Zheng, B. Tang, W. Ding, and H. Zhou. A neural autoregressive approach to collaborative \ufb01ltering. In\n\nInternational Conference on Machine Learning (ICML), 2016.\n\n[63] S. Zhou, P. R\u00fctimann, M. Xu, and P. B\u00fchlmann. High-dimensional covariance estimation based on\n\nGaussian graphical models. Journal of Machine Learning Research, 12, 2011.\n\n12\n\n\f", "award": [], "sourceid": 2928, "authors": [{"given_name": "Harald", "family_name": "Steck", "institution": "Netflix"}]}