{"title": "Learning Nonsymmetric Determinantal Point Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 6718, "page_last": 6728, "abstract": "Determinantal point processes (DPPs) have attracted substantial attention as an elegant probabilistic model that captures the balance between quality and diversity within sets. DPPs are conventionally parameterized by a positive semi-definite kernel matrix, and this symmetric kernel encodes only repulsive interactions between items. These so-called symmetric DPPs have significant expressive power, and have been successfully applied to a variety of machine learning tasks, including recommendation systems, information retrieval, and automatic summarization, among many others. Efficient algorithms for learning symmetric DPPs and sampling from these models have been reasonably well studied. However, relatively little attention has been given to nonsymmetric DPPs, which relax the symmetric constraint on the kernel. Nonsymmetric DPPs allow for both repulsive and attractive item interactions, which can significantly improve modeling power, resulting in a model that may better fit for some applications. We present a method that enables a tractable algorithm, based on maximum likelihood estimation, for learning nonsymmetric DPPs from data composed of observed subsets. Our method imposes a particular decomposition of the nonsymmetric kernel that enables such tractable learning algorithms, which we analyze both theoretically and experimentally. We evaluate our model on synthetic and real-world datasets, demonstrating improved predictive performance compared to symmetric DPPs, which have previously shown strong performance on modeling tasks associated with these datasets.", "full_text": "Learning Nonsymmetric Determinantal Point\n\nProcesses\n\nMike Gartrell\nCriteo AI Lab\n\nm.gartrell@criteo.com\n\nElvis Dohmatob\nCriteo AI Lab\n\ne.dohmatob@criteo.com\n\nVictor-Emmanuel Brunel\n\nENSAE ParisTech\n\nvictor.emmanuel.brunel@ensae.fr\n\nSyrine Krichene \u21e4\n\nCriteo AI Lab\n\nsyrinekrichene@google.com\n\nAbstract\n\nDeterminantal point processes (DPPs) have attracted substantial attention as an\nelegant probabilistic model that captures the balance between quality and diversity\nwithin sets. DPPs are conventionally parameterized by a positive semi-de\ufb01nite ker-\nnel matrix, and this symmetric kernel encodes only repulsive interactions between\nitems. These so-called symmetric DPPs have signi\ufb01cant expressive power, and\nhave been successfully applied to a variety of machine learning tasks, including rec-\nommendation systems, information retrieval, and automatic summarization, among\nmany others. Ef\ufb01cient algorithms for learning symmetric DPPs and sampling from\nthese models have been reasonably well studied. However, relatively little attention\nhas been given to nonsymmetric DPPs, which relax the symmetric constraint on\nthe kernel. Nonsymmetric DPPs allow for both repulsive and attractive item inter-\nactions, which can signi\ufb01cantly improve modeling power, resulting in a model that\nmay better \ufb01t for some applications. We present a method that enables a tractable\nalgorithm, based on maximum likelihood estimation, for learning nonsymmetric\nDPPs from data composed of observed subsets. Our method imposes a particular\ndecomposition of the nonsymmetric kernel that enables such tractable learning\nalgorithms, which we analyze both theoretically and experimentally. We evaluate\nour model on synthetic and real-world datasets, demonstrating improved predictive\nperformance compared to symmetric DPPs, which have previously shown strong\nperformance on modeling tasks associated with these datasets.\n\nIntroduction\n\n1\nDeterminantal point processes (DPPs) have attracted growing attention from the machine learning\ncommunity as an elegant probablistic model for the relationship between items within observed\nsubsets, drawn from a large collection of items. DPPs have been well studied for their theoretical\nproperties [1, 4, 9, 13, 18, 20, 21], and have been applied to numerous machine learning applications,\nincluding document summarization [7, 24], recommender systems [11], object retrieval [1], sensor\nplacement [17], information retrieval [19], and minibatch selection [29]. Ef\ufb01cient algorithms for\nDPP learning [10, 12, 14, 25, 26] and sampling [2, 22, 27] have been reasonably well studied. DPPs\nare conventionally parameterized by a positive semi-de\ufb01nite (PSD) kernel matrix, and due to this\nsymmetric kernel, they are able to encode only repulsive interactions between items. Despite this\nlimitation, symmetric DPPs have signi\ufb01cant expressive power, and have proven effective in the\naforementioned applications. However, the ability to encode only repulsive interactions, or negative\n\n\u21e4Currently at Google.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fcorrelations between pairs of items, does have important limitations in some settings. For example,\nconsider the case of a recommender system for a shopping website, where the task is to provide good\nrecommendations for items to complete a user\u2019s shopping basket prior to checkout. For models that\ncan only encode negative correlations, such as the symmetric DPP, it is impossible to directly encode\npositive interactions between items; e.g., a purchased basket containing a video game console would\nbe more likely to also contain a game controller. One way to resolve this limitation is to consider\nnonsymmetric DPPs, which relax the symmetric constraint on the kernel.\nNonsymmetric DPPs allow the model to encode both repulsive and attractive item interactions,\nwhich can signi\ufb01cantly improve modeling power. With one notable exception [5], little attention has\nbeen given to nonsymmetric DPPs within the machine learning community. We present a method\nfor learning fully nonsymmetric DPP kernels from data composed of observed subsets, where we\nleverage a low-rank decomposition of the nonsymmetric kernel that enables a tractable learning\nalgorithm based on maximum likelihood estimation (MLE).\n\nContributions Our work makes the following contributions:\n\u2022 We present a decomposition of the nonsymmetric DPP kernel that enables a tractable MLE-based\nlearning algorithm. To the best of our knowledge, this is the \ufb01rst MLE-based learning algorithm for\nnonsymmetric DPPs.\n\u2022 We present a general framework for the theoretical analysis of the properties of the maximum\nlikelihood estimator for a somewhat restricted class of nonsymmetric DPPs, which shows that this\nestimator has particular statistical guarantees regarding consistency.\n\u2022 Through an extensive experimental evaluation on several synthetic and real-world datasets, we\nhighlight the signi\ufb01cant improvements in modeling power that nonsymmetric DPPs provide in\ncomparison to symmetric DPPs. We see that nonsymmetric DPPs are more effective at recovering\ncorrelation structure within data, particularly for data that contains large disjoint collections of items.\n\nUnlike previous work on signed DPPs [5], our work does not make the very limiting assumption\nthat the correlation kernel of the DPP is symmetric in the absolute values of the entries. This gives\nour model much more \ufb02exibility. Moreover, our learning algorithm, based on maximum likelihood\nestimation, allows us to leverage a low rank assumption on the kernel, while the method of moments\ntackled in [5] does not seem to. Finally, the learning algorithm in [5] has computational complexity\nof O(M 6), where M is the size of the ground set (e.g., item catalog), making it computationally\ninfeasible for most practical scenarios. In contrast, our learning algorithm has substantially lower\ntime complexity of O(M 3), which allows our approach to be used on many real-world datasets.\n\n2 Background\nA DPP models a distribution over subsets of a \ufb01nite ground set Y that is parametrized by a matrix\nL 2 R|Y|\u21e5|Y|, such that for any J \u2713Y ,\n(1)\n\nPr(J) / det(LJ ),\nwhere LJ = [Lij]i,j2J is the submatrix of L indexed by J.\n\ndet(LJ )\n\ndet(L + I)\n\n.\n\nPL(J) =\n\nSince the normalization constant for Eq. 1 follows from the observation thatPJ\u2713Y det(LJ ) =\ndet(L + I), we have, for all J \u2713Y ,\nWithout loss of generality, we will assume that Y = {1, 2, . . . , M}, which we also denote by [M ],\nwhere M 1 is the cardinality of Y.\nIt is common to assume that L is a positive semi-de\ufb01nite matrix in order to ensure that PL de\ufb01nes a\nprobability distribution on the power set of [M ] [20]. More generally, any matrix L whose principal\nminors det(LJ ), J \u2713 [M ], are nonnegative, is admissible to de\ufb01ne a probability distribution as in\n(2) [5]; such matrices are called P0-matrices. Recall that any matrix L can be decomposed uniquely\nas the sum of a symmetric matrix S and a skew-symmetric matrix A. Namely, S = L+L>\n2 whereas\nA = LL>\n. The following lemma gives a simple suf\ufb01cient condition on S for L to be a P0-matrix.\nLemma 1. Let L 2 RM\u21e5M be an arbitrary matrix. If L + L> is PSD, then L is a P0-matrix.\n\n(2)\n\n2\n\n2\n\n\fAn important consequence is that a matrix of the form D + A, where D is diagonal with positive\ndiagonal entries and A is skew-symmetric, is a P0-matrix. Such a matrix would only capture\nnonnegative correlations, as explained in the next section.\n\n2.1 Capturing Positive and Negative Correlations\nWhen DPPs are used to model real data, they are often formulated in terms of the L matrix as\ndescribed above, called an L-ensemble. However, DPPs can be alternatively represented in terms of\nthe M \u21e5 M matrix K, where K = I (L + I)1. Using the K representation,\n\nPr(J \u2713 Y ) = det(KJ ),\n\n(3)\nwhere Y is a random subset drawn from P. K is called the marginal kernel; since here we are\nde\ufb01ning marginal probabilities that don\u2019t need to sum to 1, no normalization constant is needed.\nDPPs are conventionally parameterized by a PSD K or L matrix, which is symmetric.\nHowever, K and L need not be symmetric. As shown in [5], K is admissible if and only if L is\na P0 matrix, that is, all of its principal minors are nonnegative. The class of P0 matrices is much\nlarger, and allows us to accommodate nonsymmetric K and L matrices. To enforce the P0 constraint\non L during learning, we impose the decomposition of L described in Section 4. Since we see as\nconsequence of Lemma 1 that the sum of a PSD matrix and a skew-symmetric matrix is a P0 matrix,\nthis allows us to support nonsymmetric kernels, while ensuring that L is a P0 matrix. As we will see\nin the following, there are signi\ufb01cant advantages to accommodating nonsymmetric kernels in terms\nof modeling power.\nAs shown in [20], the eigenvalues of K are bounded above by one, while L need only be a PSD or\nP0 matrix. Furthermore, K gives the marginal probabilities of subsets, while L directly models the\natomic probabilities of observed each subset of Y. For these reasons, most work on learning DPPs\nfrom data uses the L representation of a DPP.\nIf J = {i} is a singleton set, then Pr(i 2 Y ) = Kii. The diagonal entries of K directly correspond\nto the marginal inclusion probabilities for each element of Y. If J = {i, j} is a set containing two\nelements, then we have\n(4)\n\nPr(i, j 2 Y ) =\n\nKii Kij\n\nKji Kjj = KiiKjj KijKji.\n\nTherefore, the off-diagonal elements determine the correlations between pairs of items; that is,\ncov(1i2Y , 1j2Y ) = KijKji. For a symmetric K, the signs and magnitudes of Kij and Kji are the\nsame, resulting in cov(1i2Y , 1j2Y ) = K2\nij \uf8ff 0. We see that in this case, the off-diagonal elements\nrepresent negative correlations between pairs of items, where a larger value of Kij leads to a lower\nprobability of i and j co-occurring, while a smaller value of Kij indicates a higher co-occurrence\nprobability. If Kij = 0, then there is no correlation between this pair of items. Since the sign\nof the K2\nij term is always nonpositive, the symmetric model is able to capture only nonpositive\ncorrelations between items. In fact, symmetric DPPs induce a strong negative dependence between\nitems, called negative association [3].\nFor a nonsymmetric K, the signs and magnitudes of Kij and Kji may differ, resulting in\ncov(1i2Y , 1j2Y ) = KijKji 0.\nIn this case, the off-diagonal elements represent positive\ncorrelations between pairs of items, where a larger value of KijKji leads to a higher probability of i\nand j co-occurring, while a smaller value of KijKji indicates a lower co-occurrence probability. Of\ncourse, the signs of the off-diagonal elements for some pairs (i, j) may be the same in a nonsymmetric\nK, which allows the model to also capture negative correlations. Therefore, a nonsymmetric K can\ncapture both negative and positive correlations between pairs of items.\n\n3 General guarantees in maximum likelihood estimation for DPPs\nIn this section we de\ufb01ne the log-likelihood function and we study the Fisher information of the model.\nThe Fisher information controls whether the maximum likelihood, computed on n iid samples, will\nbe a pn-consistent. When the matrix L is not invertible (i.e., if it is only a P0-matrix and not a\nP -matrix), the support of PL, de\ufb01ned as the collection of all subsets J \u2713 [M ] such that PL(J) 6= 0,\ndepends on L, and the Fisher information will not be de\ufb01ned in general. Hence, we will assume,\nin this section, that L is invertible and that we only maximize the log-likelihood over classes of\ninvertible matrices L.\n\n3\n\n\fConsider a subset \u21e5 of the set of all P -matrices of size M. Given a collection of n observed subsets\n{Y1, ..., Yn} composed of items from Y = [M ], our learning task is to \ufb01t a DPP kernel L based on\nthis data. For all L 2 \u21e5, the log-likelihood is de\ufb01ned as\n\u02c6fn(L) =\n\n\u02c6pJ log det(LJ ) log det(L + I)\n\n1\nn\n\nnXi=1\n\nlog PL(Yi) = XJ\u2713[M ]\n\n\u02c6pJ log PL(J) = XJ\u2713[M ]\n\n(5)\n\nwhere \u02c6pJ is the proportion of observed samples that equal J.\nNow, assume that Y1, . . . , Yn are iid copies of a DPP with kernel L\u21e4 2 \u21e5. For all L 2 \u21e5, the\npopulation log-likelihood is de\ufb01ned as the expectation of \u02c6fn(L), i.e.,\n(6)\n\np\u21e4J log det(LJ ) log det(L + I)\n\nf (L) = E [log PL(Y1)] = XJ\u2713[M ]\n\nwhere p\u21e4J = E[\u02c6pJ ] = log PL\u21e4(J).\nThe maximum likelihood estimator (MLE) is de\ufb01ned as a minimizer \u02c6L of \u02c6fn(L) over the parameter\nspace \u21e5. Since \u02c6L can be viewed as a perturbed version of L\u21e4, it can be convenient to introduce the\nspace H de\ufb01ned as the linear subspace of RM\u21e5M spanned by \u21e5 and de\ufb01ne the successive derivatives\nof \u02c6fn and f as multilinear forms on H. As we will see later on, the complexity of the model can be\ncaptured in the size of the space H. The following lemma provides a few examples. We say that a\nmatrix L is a signed matrix if for all i 6= j, Li,j = \"i,jLj,i for some \"i,j 2 {1, 1}.\nLemma 2. 1. If \u21e5 is the set of all positive de\ufb01nite matrices, it is easy to see that H is the set of all\nsymmetric matrices.\n2. If \u21e5 is the set of all P -matrices, then H = RM\u21e5M.\n3. If \u21e5 is the collection of signed P -matrices, then H = RM\u21e5M.\n4. If \u21e5 is the set of P -matrices of the form S + A, where S is a symmetric matrix and A is a\nskew-symmetric matrix (i.e., A> = A), then H = RM\u21e5M.\n5. If \u21e5 is the set of signed P -matrices with known signed pattern (i.e., there exists (\"i,j)i>j \u2713\n{1, 1} such that for all L 2 \u21e5 and all i > j, Lj,i = \"i,jLi,j), then H is the collection of all signed\nmatrices with that same sign pattern. In particular, if \u21e5 is the set of all P -matrices of the form\nD + A where D is diagonal and A is skew-symmetric, then H is the collection of all matrices that\nare the sum of a diagonal and a skew-symmetric matrix.\nIt is easy to see that the population log-likelihood f is in\ufb01nitely many times differentiable on the\nrelative interior of \u21e5 and that for all L in the relative interior of \u21e5 and H 2H ,\nJ HJ tr(I + L)1H\nJ HJ )2 + tr\u21e3(I + L)1H2\u2318 .\n\ndf (L)(H) = XJ\u2713[M ]\nd2f (L)(H, H) = XJ\u2713[M ]\n\np\u21e4J trL1\np\u21e4J tr(L1\n\nand\n\n(7)\n\nHence, we have the following theorem. The case of symmetric kernels is studied in [6] and the\nfollowing result is a straightforward extension to arbitrary parameter spaces. For completeness, we\ninclude the proof in the appendix. For a set \u21e5 \u2713 RN\u21e5N, we call the relative interior of \u21e5 its interior\nin the linear space spanned by \u21e5.\nTheorem 1. Let \u21e5 be a set of P -matrices and let L\u21e4 be in the relative interior of \u21e5. Then, for all\nH 2H , df (L\u21e4)(H) = 0. Moreover, the Fisher information is the negative Hessian of f at L\u21e4 and\nis given by\n(9)\n\nwhere Y is a DPP with kernel L\u21e4.\nIt follows that the Fisher information is positive de\ufb01nite if and only if any H 2H that veri\ufb01es\n(10)\nmust be H = 0. When \u21e5 is the space of symmetric and positive de\ufb01nite kernels, the Fisher\ninformation is de\ufb01nite if and only if L\u21e4 is irreducible, i.e., it is not block-diagonal up to a permutation\nof its rows and columns [6]. In that case, it is shown that the MLE learns L\u21e4 at the speed n1/2. In\ngeneral, this property fails and even irreducible kernels can induce a singular Fisher information.\n\ntr(L\u21e4J )1HJ = 0,8J \u2713 [M ]\n\n d2f (L\u21e4)(H, H) = Var tr(L\u21e4Y )1HY ,\n\n(8)\n\n4\n\n\fLemma 3. Let \u21e5 be a subset of P -matrices.\n1. Let L\u21e4 2 \u21e5 and H 2H satisfy (10). Then, for all i 2 [M ], Hi,i = 0.\n2. Let i, j 2 [M ] with i 6= j. Let L\u21e4 2 \u21e5 be such that L\u21e4i,j 6= 0 and H satisfy the following property:\n9\" 6= 0 such that 8H 2H , Hj,i = \"Hi,j. Then, if H 2H satis\ufb01es (10), Hi,j = Hj,i = 0.\n3. Let L\u21e4 2 \u21e5 be block diagonal. Then, any H 2H supported outside of the diagonal blocks of L\u21e4\nsatis\ufb01es (10).\n\nIn particular, this lemma implies that if \u21e5 is a class of signed P -matrices with prescribed sign pattern\n(i.e., Li,j = \"i,jLj,i for all i 6= j and all L 2 \u21e5, where the \"i,j\u2019s are \u00b11 and do not depend on L),\nthen if L\u21e4 lies in the relative interior of \u21e5 and has no zero entries, the Fisher information is de\ufb01nite.\nIn the symmetric case, it is shown in [6] that the only matrices H satisfying (10) must be supported\noff the diagonal blocks of L\u21e4, i.e., the third part of Lemma 3 is an equivalence. In the appendix, we\nprovide a few very simple counterexamples that show that this equivalence is no longer valid in the\nnonsymmetric case.\n4 Model\nTo add support for positive correlations to the DPP, we consider nonsymmetric L matrices. In\nparticular, our approach involves incorporating a skew-symmetric perturbation to the PSD L.\nRecall that any matrix L can be uniquely decomposed as L = S + A, where S is symmetric and\nA is skew-symmetric. We impose a decomposition on A as A = BCT CBT , where B and C\nare low-rank M \u21e5 D0 matrices, and we use a low-rank factorization of S, S = V V T , where V is\na low-rank M \u21e5 D matrix, as described in [12], which also allows us to enforce S to be PSD and\nhence, L to be a P0-matrix by Lemma 1.\nWe de\ufb01ne a regularization term, R(V , B, C), as\n\nR(V , B, C) = \u21b5\n\n1\nikvik2\n\n2 \n\n1\nikbik2\n\n2 \n\n1\nikcik2\n\n2\n\n(11)\n\nMXi=1\n\nMXi=1\n\nMXi=1\n\nwhere i counts the number of occurrences of item i in the training set, vi, bi, and ci are the\ncorresponding row vectors of V , B, and C, respectively, and \u21b5, , > 0 are tunable hyperparameters.\nThis regularization formulation is similar to that proposed in [12]. From the above, we have the full\nformulation of the regularized log-likelihood of our model:\n\n(V , B, C) =\n\nnXi=1\n\nlog det\u21e3VYi V T\n\nYi + (BYi CT\n\nYi CYi BT\n\nYi )\u2318 log det\u21e3V V T + (BCT CBT ) + I\u2318\n\n@\n\n@Lij\n\n+ R(V , B, C)\n\n(12)\nThe computational complexity of Eq. 12 will be dominated by computing the determinant in the\nsecond term (the normalization constant), which is O(M 3). Furthermore, since\n(log det(L)) =\n), the computational complexity of computing the gradient of Eq. 12 during learning will\ntr(L1 @L\n@Lij\nbe dominated by computing the matrix inverse in the gradient of the second term, (L + I)1, which\nis O(M 3). Therefore, we see that the low-rank decomposition of the kernel in our nonsymmetric\nmodel does not afford any improvement over a full-rank model in terms of computational complexity.\nHowever, our low-rank decomposition does provide a savings in terms of the memory required to\nstore model parameters, since our low-rank model has space complexity O(M D + 2M D0), while a\nfull-rank version of this nonsymmetric model has space complexity O(M 2 + 2M 2). When D \u2327 M\nand D0 \u2327 M, which is typical in many settings, this will result in a signi\ufb01cant space savings.\n5 Experiments\nWe run extensive experiments on several synthetic and real-world datasets. Since the focus of our\nwork is on improving DPP modeling power and comparing nonsymmetric and symmetric DPPs, we\nuse the standard symmetric low-rank DPP as the baseline model for our experiments.\n\nPreventing numerical instabilities The \ufb01rst term on the right side of Eq. (12) will be singular\nwhenever |Yi| > D, where Yi is an observed subset. Therefore, to address this in practice we set D\n\n5\n\n\fto the size of the largest subset observed in the data, as explained in [12]. Furthermore, the \ufb01rst term\non the right side of Eq. (12) may be singular even when |Yi|\uf8ff D. In this case, we know that we\nare not at a maximum, since the value of the function becomes 1. Numerically, to prevent such\nsingularities, in our implementation we add a small \u270fI correction to each LYi when optimizing Eq.\n12 (we set \u270f = 105 in our experiments).\n5.1 Datasets\nWe perform next-item prediction and AUC-based classi\ufb01cation experiments on two real-world\ndatasets composed of purchased shopping baskets:\n\n1. Amazon Baby Registries: This public dataset consists of 111,0006 registries or \"baskets\" of baby\nproducts, and has been used in prior work on DPP learning [11, 14, 25]. The registries are collected\nfrom 15 different categories, such as \"apparel\", \"diapers\", etc., and the items in each category are\ndisjoint.We evaluate our models on the popular apparel category.\nWe also perform an evaluation on a dataset composed of the three most popular categories: apparel,\ndiaper, and feeding. We construct this dataset, composed of three large disjoint categories of items,\nwith a catalog of 100 items in each category, to highlight the differences in how nonsymmetric and\nsymmetric DPPs model data. In particular, we will see that the nonsymmetric DPP uses positive\ncorrelations to capture item co-occurences within baskets, while negative correlations are used to\ncapture disjoint pairs of items. In contrast, since symmetric DPPs can only represent negative\ncorrelations, they must attempt to capture both co-occuring items, and items that are disjoint, using\nonly negative correlations.\n2. UK Retail: This is a public dataset [8] that contains 25,898 baskets drawn from a catalog of\n4,070 items. This dataset contains transactions from a non-store online retail company that primarily\nsells unique all-occasion gifts, and many customers are wholesalers. We omit all baskets with more\nthan 100 items, which allows us to use a low-rank factorization of the symmetric DPP (D = 100)\nthat scales well in training and prediction time, while also keeping memory consumption for model\nparameters to a manageable level.\n3. We also perform an evaluation on synthetically generated data. Our data generator allows us\nto explicitly control the item catalog size, the distribution of set sizes, and the item co-occurrence\ndistribution. By controlling these parameters, we are able to empirically study how the nonsymmetric\nand symmetric models behave for data with a speci\ufb01ed correlation structure.\n\n5.2 Experimental setup and metrics\nNext-item prediction involves identifying the best item to add to a subset of selected items (e.g.,\nbasket completion), and is the primary prediction task we evaluate.\nWe compute a next-item prediction for a basket J by conditioning the DPP on the event that\nall items in J are observed. As described in [13], we compute this conditional kernel, LJ, as\nJ LJ, \u00afJ, where \u00afJ = Y J, L \u00afJ is the restriction of L to the rows and columns\nLJ = L \u00afJ L \u00afJ,J L1\nindexed by \u00afJ, and L \u00afJ,J consists of the \u00afJ rows and J columns of L. The computational complexity\nof this operation is dominated by the three matrix multiplications, which is O(M 2|J|).\nWe compare the performance of all methods using a standard recommender system metric: mean\npercentile rank (MPR). A MPR of 50 is equivalent to random selection; a MPR of 100 indicates\nthat the model perfectly predicts the held out item. MPR is a recall-based metric which we use to\nevaluate the model\u2019s predictive power by measuring how well it predicts the next item in a basket; it\nis a standard choice for recommender systems [15, 23]. See Appendix C for a formal description of\nhow the MPR metric is computed.\nWe evaluate the discriminative power of each model using the AUC metric. For this task, we generate\na set of negative subsets uniformly at random. For each positive subset J + in the test set, we generate\na negative subset J of the same length by drawing |J +| samples uniformly at random, and ensure\nthat the same item is not drawn more than once for a subset. We compute the AUC for the model\non these positive and negative subsets, where the score for each subset is the log-likelihood that the\nmodel assigns to the subset. This task measures the ability of the model to discriminate between\nobserved positive subsets (ground-truth subsets) and randomly generated subsets.\nFor all experiments, a random selection of 80% of the baskets are used for training, and the remaining\n20% are used for testing. We use a small held-out validation set for tracking convergence and tuning\n\n6\n\n\fhyperparameters. Convergence is reached during training when the relative change in validation\nlog-likelihood is below a pre-determined threshold, which is set identically for all models. We\nimplement our models using PyTorch 2, and use the Adam [16] optimization algorithm to train our\nmodels.\n\n5.3 Results on synthetic datasets\nWe run a series of synthetic experiments to examine the differences between nonsymmetric and\nsymmetric DPPs. In all of these experiments, we de\ufb01ne an oracle that controls the generative process\nfor the data. The oracle uses a deterministic policy to generate a dataset composed of positive baskets\n(items that co-occur) and negative baskets (items that don\u2019t co-occur). This generative policy de\ufb01nes\nthe expected normalized determinant, det(KJ ), for each pair of items, and a threshold that limits\nthe maximum determinantal volume for a positive basket and the minimum volume for a negative\nbasket. This threshold is used to compute AUC results for this set of positives and negatives. Note\nthat the negative sets are used only during evaluation. For each experiment, in Figures 1, 2, and 3, we\nplot a transformed version of the learned K matrices for the nonsymmetric and symmetric models,\nwhere each element i of this matrix is re-weighted by det(K{ij}) for the corresponding pair. For\neach plotted transformation of K, a magenta element corresponds to a negative correlation, which\nwill tend to result in the model predicting that the corresponding pair is negative pair. Black and cyan\nelements correspond to smaller and larger positive correlations, respectively, for the nonsymmetric\nmodel, and very small negative correlations for the symmetric model; the model will tend to predict\nthat the corresponding pair is positive in these cases. We perform the AUC-based evaluation for each\npair {i, j} by comparing det(K{ij}) predicted by the model with the ground truth determinantal\nvolume provided by the oracle; this task is equivalent to performing basket completion for the pair. In\nFigures 1, 2, and 3, we show the prediction error for each pair, where cyan corresponds to low error,\nand magenta corresponds to high error.\nRecovering positive examples for low-sparsity data In this experiment we aim to show that the\nnonsymmetric model is just as capable as the symmetric model when it comes to learning negative\ncorrelations when trained on data containing few negative correlations and many positive correlations.\nWe choose a setting where the symmetric model performs well. We construct a dataset that contains no\nlarge disjoint collections of items, with 100 baskets of size six, and a catalog of 100 items. To reduce\nthe impact of negative correlations between items, we use a categorical distribution, with nonuniform\nevent probabilities, for sampling the items that populate each basket, with a large coverage of possible\nitem pairs. This logic ensures few negative correlations, since there is a low probability that two\nitems will never co-occur. For the nonsymmetric DPP, the oracle expects the model to predict a low\nnegative correlation, or a positive correlation, for a pair of products that have a high co-occurence\nprobability in the data. The results of this experiment are shown in Figure 1. We see from the plots\nshowing the transformed K matrices that both the nonsymmetric and symmetric models recover\napproximately the same structure, resulting in similar error plots, and similar predictive AUC of\napproximately 0.8 for both models.\nRecovering negative examples for high-sparsity data We construct a more challenging scenario\nfor this experiment, which reveals an important limitation of the symmetric DPP. The symmetric\nDPP requires a relatively high density of observed item pairs (positive pairs) in order to learn the\nnegative structure of the data that describes items that do not co-occur. During learning, the DPP will\nmaximize determinantal volumes for positive pairs, while the det(L + I) normalization constant\nmaintains a representation of the global volume of the parameter space for the entire item catalog.\nFor a high density of observed positive pairs, increasing the volume allocated to positive pairs will\nresult in a decrease in the volume assigned to many negative pairs, in order to maintain approximately\nthe same global volume represented by the normalization constant. For a low density of positive\npairs, the model will not allocate low volumes to many negative pairs. This phenomenon affects both\nthe nonsymmetric and symmetric models. Therefore, the difference in each model\u2019s ability to capture\nnegative structure within a low-density region of positive can be explained in terms of how each\nmodel maximizes determinatal volumes using positive and negative correlations. In the case of the\nsymmetric DPP, the model can increase determinantal volumes by using smaller negative correlations,\nresulting in off-diagonal Kij = Kji values that approach zero. As these off-diagonal parameters\napproach zero, this behavior has the side effect of also increasing the determinantal volumes of subsets\nwithin disjoint groups, since these volumes are also affected by these small parameter values. In\n\n2Our code is available at https://github.com/cgartrel/nonsymmetric-DPP-learning\n\n7\n\n\fFigure 1: Results for synthetic experiment showing model recovery of structure of positive examples\nfor low-sparsity data.\n\nFigure 2: Results for synthetic experiment showing model recovery of structure of negative examples\nfor high-sparsity data. 14 disjoint groups are used for data generation.\n\ncontrast, the nonsymmetric model behaves differently; determinantal volumes can be maximized by\nswitching the signs of the off-diagonal entries of K and increasing the magnitude of these parameters,\nrather than reducing the values of these parameters to near zero. This behavior allows the model to\nassign higher volumes to positive pairs than to negative pairs within disjoint groups in many cases,\nthus allowing the nonsymmetric model to recover disjoint structure.\nIn our experiment, the oracle controls the sparsity of the data by setting the number of disjoint groups\nof items; positive pairs within each disjoint group are generated uniformly at random, in order to\nfocus on the effect of disjoint groups. For the AUC evaluation, negative baskets are constructed so\nthat they contain items from at least two different disjoint groups. When constructing our dataset,\nwe set the number of disjoint groups to 14, with 100 baskets of size six, and a catalog of 100 items.\nThe results of our experiment are shown in Figure 2. We see from the error plot that the symmetric\nmodel cannot effectively learn the structure of the data, leading to high error in many areas, including\nwithin the disjoint blocks; the symmetric model provides an AUC of 0.5 as a result. In contrast, the\nnonsymmetric model is able to approximately recover the block structure, resulting in an AUC of 0.7.\nRecovering positive examples for data that mixes disjoint sparsity with popularity-based posi-\ntive structure For our \ufb01nal synthetic experiment, we construct a scenario that combines aspects of\nour two previous experiments. In this experiment, we consider three disjoint groups. For each disjoint\ngroup we use a categorical distribution with nonuniform event probabilities for sampling items within\nbaskets, which induces a positive correlation structure within each group. Therefore, the oracle will\nexpect to see a high negative correlation for disjoint pairs, compared to all other non-disjoint pairs\nwithin a particular disjoint group. For items with a high co-occurrence probability, we expect the\nsymmetric DPP to recover a near zero negative correlation, and the nonsymmetric DPP to recover\na positive correlation. Furthermore, we expect both the nonsymmetric and symmetric models to\nrecover higher marginal probabilities, or Kii values, for more popular items. The determinantal\nvolumes for positive pairs containing popular items will thus tend to be larger than the volumes of\nnegative pairs. Therefore, for baskets containing popular items, we expect that both the nonsymmetric\nand symmetric models will be able to easily discriminate between positive and negative baskets.\nWhen constructing positive baskets, popular items are sampled with high probability, proportional\nto their popularity. We therefore expect that both models will be able to recover some signal about\nthe correlation structure of the data within each disjoint group, resulting in a predictive AUC higher\nthan 0.5, since the popularity-based positive correlation structure within each group allows the model\nto recover some structure about correlations among item pairs within each group. However, we\nexpect that the nonsymmetric model will provide better predictive performance than the symmetric\nmodel, since its properties enable recovery of disjoint structure (as discussed previously). We see the\nexpected results in Figure 3, which are further con\ufb01rmed by the predictive AUC results: 0.7 for the\nsymmetric model, and 0.75 for the nonsymmetric model.\n\n5.4 Results on real-world datasets\nTo examine how the nonsymmetric model behaves when trained on a real-world dataset with clear\ndisjoint structure, we \ufb01rst train and evaluate the model on the three-category Amazon baby registry\n\n8\n\n\fFigure 3: Results for synthetic experiment showing model recovery of positive structure for data with\npopularity-based positive examples and disjoint groups. Three disjoint sets and popularity-based\nweighted random generation are used for the positive examples.\n\nMetric\n\nAmazon: Apparel\n\nAmazon: 3-category\n\nUK Retail\n\nMPR\nAUC\n\nSym DPP\n77.42 \u00b1 1.12\n0.66 \u00b1 0.01\n\nNonsym DPP\n80.32 \u00b1 0.75\n0.73 \u00b1 0.01\n\nSym DPP\n60.61 \u00b1 0.94\n0.70 \u00b1 0.01\n\nNonsym DPP\n75.09 \u00b1 0.85\n0.79 \u00b1 0.01\n\nSym DPP\n76.79 \u00b1 0.60\n0.57 \u00b1 0.001\n\nNonsym DPP\n79.45 \u00b1 0.57\n0.65 \u00b1 0.01\n\nTable 1: MPR and AUC results for the Amazon Diaper, Amazon three-category (Apparel + Diaper +\nFeeding), and UK retail datasets. Results show mean and 95% con\ufb01dence estimates obtained using\nbootstrapping. Bold values indicate improvement over the symmetric low-rank DPP outside of the\ncon\ufb01dence interval. We use D = 30,\u21b5 = 0 for both Amazon datasets; D0 = 100 for the Amazon\n3-category dataset; D0 = 30 for the Amazon apparel dataset; D = 100, D0 = 20,\u21b5 = 1 for the UK\ndataset; and = = 0 for all datasets.\n\nC1\n\ndataset. This dataset is composed of three disjoint categories of items, where each disjoint category is\ncomposed of 100 items and approximately 10,000 baskets. Given the structure of this dataset, with a\nsmall item catalog for each category and a large number of baskets relative to the size of the catalog,\nwe would expect a relatively high density of positive pairwise item correlations within each category.\nFurthermore, since each category is disjoint, we would expect the model to recover a low density\nof positive correlations between pairs of items that are disjoint, since these items do not co-occur\nwithin observed baskets. We see the experimental results for this three-category dataset in Figure 4.\nAs expected, positive correlations dominate within each category, e.g., within category 1, the model\nencodes 80.5% of the pairwise interactions as positive correlations. For pairwise interactions between\nitems within two disjoint categories, we see that negative correlations dominate, e.g., between C1 and\nC2, the model encodes 97.2% of the pairwise interactions as negative correlations (or equivalently,\n2.8% as positive interactions).\nTable 1 shows the results of our performance evaluation on\nthe Amazon and UK datasets. Compared to the symmetric\nDPP, we see that the nonsymmetric DPP provides mod-\nerate to large improvements on both the MPR and AUC\nmetrics for all datasets. In particular, we see a substantial\nimprovement on the three-category Amazon dataset, pro-\nviding further evidence that the nonsymmetric DPP is far\nmore effective than the symmetric DPP at recovering the\nstructure of data that contains large disjoint components.\n6 Conclusion\nBy leveraging a low-rank decomposition of the nonsym-\nmetric DPP kernel, we have introduced a tractable MLE-\nbased algorithm for learning nonsymmetric DPPs from data. To the best of our knowledge, this is the\n\ufb01rst MLE-based learning algorithm for nonsymmetric DPPs. A general framework for the theoretical\nanalysis of the properties of the maximum likelihood estimator for a somewhat restricted class of non-\nsymmetric DPPs reveals that this estimator has certain statistical guarantees regarding its consistency.\nWhile symmetric DPPs are limited to capturing only repulsive item interactions, nonsymmetric DPPs\nallow for both repulsive and attractive item interactions, which lead to fundamental changes in model\nbehavior. Through an extensive experimental evaluation on several synthetic and real-world datasets,\nwe have demonstrated that nonsymmetric DPPs can provide signi\ufb01cant improvements in modeling\npower, and predictive performance, compared to symmetric DPPs. We believe that our contributions\nopen to the door to an array of future work on nonsymmetric DPPs, including an investigation of\nsampling algorithms, reductions in computational complexity for learning, and further theorectical\nunderstanding of the properties of the model.\n\nC3\nC2\n80.5% 2.8%\n3.7%\nC1\n2.8% 71.6% 4.5%\nC2\n4.5% 97.8%\n3.7%\nC3\nFigure 4: Percentage of positive pairwise\ncorrelations encoded by nonsymmetric\nDPP when trained on the three-category\nAmazon baby registry dataset, as a frac-\ntion of all possible pairwise correlations.\nCategory n is denoted by Cn.\n\n\"\n\n#\n\n9\n\n\fReferences\n[1] R. Affandi, E. Fox, R. Adams, and B. Taskar. Learning the parameters of determinantal point\n\nprocess kernels. In ICML, 2014.\n\n[2] Nima Anari, Shayan Oveis Gharan, and Alireza Rezaei. Monte carlo markov chain algorithms\nfor sampling strongly rayleigh distributions and determinantal point processes. In 29th Annual\nConference on Learning Theory, volume 49 of Proceedings of Machine Learning Research,\npages 103\u2013115, Columbia University, New York, New York, USA, 23\u201326 Jun 2016. PMLR.\n\n[3] Julius Borcea, Petter Br\u00e4nd\u00e9n, and Thomas Liggett. Negative dependence and the geometry of\n\npolynomials. Journal of the American Mathematical Society, 22(2):521\u2013567, 2009.\n\n[4] Alexei Borodin. Determinantal Point Processes. arXiv:0911.1153, 2009.\n\n[5] Victor-Emmanuel Brunel. Learning signed determinantal point processes through the principal\n\nminor assignment problem. In NeurIPS, pages 7365\u20137374, 2018.\n\n[6] Victor-Emmanuel Brunel, Ankur Moitra, Philippe Rigollet, and John Urschel. Rates of esti-\nmation for determinantal point processes. In Conference on Learning Theory, pages 343\u2013345,\n2017.\n\n[7] Wei-Lun Chao, Boqing Gong, Kristen Grauman, and Fei Sha. Large-margin determinantal\n\npoint processes. In Uncertainty in Arti\ufb01cial Intelligence (UAI), 2015.\n\n[8] D Chen. Data mining for the online retail industry: A case study of rfm model-based customer\nsegmentation using data mining. Journal of Database Marketing and Customer Strategy\nManagement, 19(3), August 2012.\n\n[9] Laurent Decreusefond, Ian Flint, Nicolas Privault, and Giovanni Luca Torrisi. Determinantal\n\nPoint Processes, 2015.\n\n[10] Christophe Dupuy and Francis Bach. Learning Determinantal Point Processes in sublinear time,\n\n2016.\n\n[11] Mike Gartrell, Ulrich Paquet, and Noam Koenigstein. Bayesian low-rank determinantal point\n\nprocesses. In RecSys. ACM, 2016.\n\n[12] Mike Gartrell, Ulrich Paquet, and Noam Koenigstein. Low-rank factorization of Determinantal\n\nPoint Processes. In AAAI, 2017.\n\n[13] J. Gillenwater. Approximate Inference for Determinantal Point Processes. PhD thesis, University\n\nof Pennsylvania, 2014.\n\n[14] J. Gillenwater, A. Kulesza, E. Fox, and B. Taskar. Expectation-maximization for learning\n\nDeterminantal Point Processes. In NIPS, 2014.\n\n[15] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative \ufb01ltering for implicit feedback\ndatasets. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining,\n2008.\n\n[16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR,\n\n2015.\n\n[17] Andreas Krause, Ajit Singh, and Carlos Guestrin. Near-optimal sensor placements in Gaussian\n\nprocesses: theory, ef\ufb01cient algorithms and empirical studies. JMLR, 9:235\u2013284, 2008.\n\n[18] A. Kulesza. Learning with Determinantal Point Processes. PhD thesis, University of Pennsyl-\n\nvania, 2013.\n\n[19] A. Kulesza and B. Taskar. k-dpps: Fixed-size determinantal point processes. In ICML, 2011.\n\n[20] A. Kulesza and B. Taskar. Determinantal Point Processes for machine learning, volume 5.\n\nFoundations and Trends in Machine Learning, 2012.\n\n10\n\n\f[21] Fr\u00e9d\u00e9ric Lavancier, Jesper M\u00f8ller, and Ege Rubak. Determinantal Point Process models and\nstatistical inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology),\n77(4):853\u2013877, 2015.\n\n[22] Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Fast dpp sampling for nystrom with application\nto kernel methods. In Proceedings of The 33rd International Conference on Machine Learning,\nvolume 48 of Proceedings of Machine Learning Research, pages 2061\u20132070, New York, New\nYork, USA, 20\u201322 Jun 2016. PMLR.\n\n[23] Yanen Li, Jia Hu, ChengXiang Zhai, and Ye Chen. Improving one-class collaborative \ufb01ltering by\nincorporating rich user information. In Proceedings of the 19th ACM International Conference\non Information and Knowledge Management, CIKM \u201910, 2010.\n\n[24] H. Lin and J. Bilmes. Learning mixtures of submodular shells with application to document\n\nsummarization. In Uncertainty in Arti\ufb01cial Intelligence (UAI), 2012.\n\n[25] Zelda Mariet and Suvrit Sra. Fixed-point algorithms for learning Determinantal Point Processes.\n\nIn ICML, 2015.\n\n[26] Zelda Mariet and Suvrit Sra. Kronecker Determinantal Point Processes. In NIPS, 2016.\n[27] Patrick Rebeschini and Amin Karbasi. Fast mixing for discrete point processes. In Proceedings\nof The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning\nResearch, pages 1480\u20131500, Paris, France, 03\u201306 Jul 2015. PMLR.\n\n[28] Michael J. Tsatsomeros. Focus on computational neurobiology. chapter Generating and\nDetecting Matrices with Positive Principal Minors, pages 115\u2013132. Nova Science Publishers,\nInc., 2004.\n\n[29] Cheng Zhang, Hedvig Kjellstr\u00f6m, and Stephan Mandt. Stochastic learning on imbalanced data:\n\nDeterminantal Point Processes for mini-batch diversi\ufb01cation. CoRR, abs/1705.00607, 2017.\n\n11\n\n\f", "award": [], "sourceid": 3637, "authors": [{"given_name": "Mike", "family_name": "Gartrell", "institution": "Criteo AI Lab"}, {"given_name": "Victor-Emmanuel", "family_name": "Brunel", "institution": "ENSAE ParisTech"}, {"given_name": "Elvis", "family_name": "Dohmatob", "institution": "Criteo"}, {"given_name": "Syrine", "family_name": "Krichene", "institution": "Google"}]}