{"title": "Fast Greedy MAP Inference for Determinantal Point Process to Improve Recommendation Diversity", "book": "Advances in Neural Information Processing Systems", "page_first": 5622, "page_last": 5633, "abstract": "The determinantal point process (DPP) is an elegant probabilistic model of repulsion with applications in various machine learning tasks including summarization and search. However, the maximum a posteriori (MAP) inference for DPP which plays an important role in many applications is NP-hard, and even the popular greedy algorithm can still be too computationally expensive to be used in large-scale real-time scenarios. To overcome the computational challenge, in this paper, we propose a novel algorithm to greatly accelerate the greedy MAP inference for DPP. In addition, our algorithm also adapts to scenarios where the repulsion is only required among nearby few items in the result sequence. We apply the proposed algorithm to generate relevant and diverse recommendations. Experimental results show that our proposed algorithm is significantly faster than state-of-the-art competitors, and provides a better relevance-diversity trade-off on several public datasets, which is also confirmed in an online A/B test.", "full_text": "Fast Greedy MAP Inference for Determinantal Point\n\nProcess to Improve Recommendation Diversity\n\nLaming Chen\n\nHulu LLC\n\nBeijing, China\n\nlaming.chen@hulu.com\n\nGuoxin Zhang\u2217\n\nzhangguoxin@kuaishou.com\n\nKwai Inc.\n\nBeijing, China\n\nHanning Zhou\n\nHulu LLC\n\nBeijing, China\n\nericzhouh@gmail.com\n\nAbstract\n\nThe determinantal point process (DPP) is an elegant probabilistic model of repul-\nsion with applications in various machine learning tasks including summarization\nand search. However, the maximum a posteriori (MAP) inference for DPP which\nplays an important role in many applications is NP-hard, and even the popular\ngreedy algorithm can still be too computationally expensive to be used in large-\nscale real-time scenarios. To overcome the computational challenge, in this paper,\nwe propose a novel algorithm to greatly accelerate the greedy MAP inference for\nDPP. In addition, our algorithm also adapts to scenarios where the repulsion is\nonly required among nearby few items in the result sequence. We apply the pro-\nposed algorithm to generate relevant and diverse recommendations. Experimental\nresults show that our proposed algorithm is signi\ufb01cantly faster than state-of-the-art\ncompetitors, and provides a better relevance-diversity trade-off on several public\ndatasets, which is also con\ufb01rmed in an online A/B test.\n\n1\n\nIntroduction\n\nThe determinantal point process (DPP) was \ufb01rst introduced in [33] to give the distributions of fermion\nsystems in thermal equilibrium. The repulsion of fermions is described precisely by DPP, making it\nnatural for modeling diversity. Besides its early applications in quantum physics and random matrices\n[35], it has also been recently applied to various machine learning tasks such as multiple-person\npose estimation [27], image search [28], document summarization [29], video summarization [19],\nproduct recommendation [18], and tweet timeline generation [49]. Compared with other probabilistic\nmodels such as the graphical models, one primary advantage of DPP is that it admits polynomial-time\nalgorithms for many types of inference, including conditioning and sampling [30].\nOne exception is the important maximum a posteriori (MAP) inference, i.e., \ufb01nding the set of items\nwith the highest probability, which is NP-hard [25]. Consequently, approximate inference methods\nwith low computational complexity are preferred. A near-optimal MAP inference method for DPP\nis proposed in [17]. However, this algorithm is a gradient-based method with high computational\ncomplexity for evaluating the gradient in each iteration, making it impractical for large-scale real-time\napplications. Another method is the widely used greedy algorithm [37], justi\ufb01ed by the fact that the\nlog-probability of set in DPP is submodular. Despite its relatively weak theoretical guarantees [13], it\nis widely used due to its promising empirical performance [29, 19, 49]. Known exact implementations\nof the greedy algorithm [17, 32] have O(M 4) complexity, where M is the total number of items. Han\net al.\u2019s recent work [20] reduces the complexity down to O(M 3) by introducing some approximations,\nwhich sacri\ufb01ces accuracy. In this paper, we propose an exact implementation of the greedy algorithm\nwith O(M 3) complexity, and it runs much faster than the approximate one [20] empirically.\n\n\u2217This work was conducted while the author was with Hulu.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThe essential characteristic of DPP is that it assigns higher probability to sets of items that are diverse\nfrom each other [30]. In some applications, the selected items are displayed as a sequence, and the\nnegative interactions are restricted only among nearby few items. For example, when recommending\na long sequence of items to the user, each time only a small portion of the sequence catches the user\u2019s\nattention. In this scenario, requiring items far away from each other to be diverse is unnecessary.\nDeveloping fast algorithm for this scenario is another motivation of this paper.\nContributions. In this paper, we propose a novel algorithm to greatly accelerate the greedy MAP\ninference for DPP. By updating the Cholesky factor incrementally, our algorithm reduces the com-\nplexity down to O(M 3), and runs in O(N 2M ) time to return N items, making it practical to be used\nin large-scale real-time scenarios. To the best of our knowledge, this is the \ufb01rst exact implementation\nof the greedy MAP inference for DPP with such a low time complexity.\nIn addition, we also adapt our algorithm to scenarios where the diversity is only required within a\nsliding window. Supposing the window size is w < N, the complexity can be reduced to O(wN M ).\nThis feature makes it particularly suitable for scenarios where we need a long sequence of items\ndiversi\ufb01ed within a short sliding window.\nFinally, we apply our proposed algorithm to the recommendation task. Recommending diverse items\ngives the users exploration opportunities to discover novel and serendipitous items, and also enables\nthe service to discover users\u2019 new interests. As shown in the experimental results on public datasets\nand an online A/B test, the DPP-based approach enjoys a favorable trade-off between relevance and\ndiversity compared with the known methods.\n\n2 Background and Related Work\n\nNotations. Sets are represented by uppercase letters such as Z, and #Z denotes the number of\nelements in Z. Vectors and matrices are represented by bold lowercase letters and bold uppercase\nletters, respectively. (\u00b7)(cid:62) denotes the transpose of the argument vector or matrix. (cid:104)x, y(cid:105) is the inner\nproduct of two vectors x and y. Given subsets X and Y , LX,Y is the sub-matrix of L indexed by\nX in rows and Y in columns. For notation simplicity, we let LX,X = LX, LX,{i} = LX,i, and\nL{i},X = Li,X. det(L) is the determinant of L, and det(L\u2205) = 1 by convention.\n\n2.1 Determinantal Point Process\n\nDPP is an elegant probabilistic model with the ability to express negative interactions [30]. Formally,\na DPP P on a discrete set Z = {1, 2, . . . , M} is a probability measure on 2Z, the set of all subsets of\nZ. When P gives nonzero probability to the empty set, there exists a matrix L \u2208 RM\u00d7M such that\nfor every subset Y \u2286 Z, the probability of Y is\n\nP(Y ) \u221d det(LY ),\n\nwhere L is a real, positive semide\ufb01nite (PSD) kernel matrix indexed by the elements of Z. Under this\ndistribution, many types of inference tasks including marginalization, conditioning, and sampling can\nbe performed in polynomial time, except for the MAP inference\n\nYmap = arg maxY \u2286Z det(LY ).\n\nIn some applications, we need to impose a cardinality constraint on Y to return a subset of \ufb01xed size\nwith the highest probability, resulting in the MAP inference for k-DPP [28].\nBesides the works on the MAP inference for DPP introduced in Section 1, some other works propose\nto draw samples and return the one with the highest probability. In [16], a fast sampling algorithm\nwith complexity O(N 2M ) is proposed when the eigendecomposition of L is available. Though the\nupdate rules of [16] and our work are similar, there are two major differences making our approach\nmore ef\ufb01cient. First, [16] requires the eigendecomposition of L with time complexity O(M 3). This\ncomputation overhead dominates the overall running time when we only need to return a small\nnumber of items. By contrast, our approach only requires overall O(N 2M ) complexity to return\nN items. Second, sampling algorithm of DPP usually needs to perform multiple sample trials to\nachieve comparable empirical performance with the greedy algorithm, which further increases the\ncomputational complexity.\n\n2\n\n\f2.2 Greedy Submodular Maximization\n\nA set function is a real-valued function de\ufb01ned on 2Z. If the marginal gains of a set function f are\nnon-increasing, i.e., for any i \u2208 Z and any X \u2286 Y \u2286 Z \\ {i},\n\nf (X \u222a {i}) \u2212 f (X) \u2265 f (Y \u222a {i}) \u2212 f (Y ),\n\nthen f is submodular. The log-probability function in DPP f (Y ) = log det(LY ) is submodular,\nas is revealed in [17]. Submodular maximization corresponds to \ufb01nding a set which maximizes a\nsubmodular function. The MAP inference for DPP is a submodular maximization.\nSubmodular maximization is generally NP-hard. A popular approximation approach is based on the\ngreedy algorithm [37]. Initialized as \u2205, in each iteration, an item which maximizes the marginal gain\n\nj = arg maxi\u2208Z\\Yg f (Yg \u222a {i}) \u2212 f (Yg),\n\nis added to Yg, until the maximal marginal gain becomes negative or the cardinality constraint is\nviolated. When f is monotone, i.e., f (X) \u2264 f (Y ) for any X \u2286 Y , the greedy algorithm admits a\n(1 \u2212 1/e)-approximation guarantee subject to a cardinality constraint [37]. For general submodular\nmaximization with no constraints, a modi\ufb01ed greedy algorithm guarantees (1/2)-approximation [10].\nDespite these theoretical guarantees, greedy algorithm is widely used for DPP due to its promising\nempirical performance [29, 19, 49].\n\n2.3 Diversity of Recommendation\n\nImproving the recommendation diversity has been an active \ufb01eld in machine learning and information\nretrieval. Some works addressed this problem in a generic setting to achieve better trade-off between\narbitrary relevance and dissimilarity functions [11, 9, 51, 8, 21]. However, they used only pairwise\ndissimilarities to characterize the overall diversity property of the list, which may not capture some\ncomplex relationships among items (e.g., the characteristics of one item can be described as a simple\nlinear combination of another two). Some other works tried to build new recommender systems to\npromote diversity through the learning process [3, 43, 48], but this makes the algorithms less generic\nand unsuitable for direct integration into existing recommender systems.\nSome works proposed to de\ufb01ne the similarity metric based on the taxonomy information [52, 2,\n12, 45, 4, 44]. However, the semantic taxonomy information is not always available, and it may be\nunreliable to de\ufb01ne similarity based on them. Several other works proposed to de\ufb01ne the diversity\nmetric based on explanation [50], clustering [7, 5, 31], feature space [40], or coverage [47, 39].\nIn this paper, we apply the DPP model and our proposed algorithm to optimize the trade-off between\nrelevance and diversity. Unlike existing techniques based on pairwise dissimilarities, our method\nde\ufb01nes the diversity in the feature space of the entire subset. Notice that our approach is essentially\ndifferent from existing DPP-based methods for recommendation. In [18, 34, 14, 15], they proposed\nto recommend complementary products to the ones in the shopping basket, and the key is to learn the\nkernel matrix of DPP to characterize the relations among items. By contrast, we aim to generate a\nrelevant and diverse recommendation list through the MAP inference.\nThe diversity considered in our paper is different from the aggregate diversity in [1, 38]. Increasing\naggregate diversity promotes long tail items, while improving diversity prefers diverse items in each\nrecommendation list.\n\n3 Fast Greedy MAP Inference\n\nIn this section, we present a fast implementation of the greedy MAP inference algorithm for DPP. In\neach iteration, item\n\nj = arg maxi\u2208Z\\Yg log det(LYg\u222a{i}) \u2212 log det(LYg )\n\n(1)\n\nis added to the already selected item set Yg. Since L is a PSD matrix, all of its principal minors are\nalso PSD. Suppose det(LYg ) > 0, and the Cholesky decomposition of LYg is available as\n\nLYg = VV(cid:62),\n\n3\n\n\fi = Lii, j = arg maxi\u2208Z log(d2\n\ni ), Yg = {j}\n\nAlgorithm 1 Fast Greedy MAP Inference\n1: Input: Kernel L, stopping criteria\n2: Initialize: ci = [], d2\n3: while stopping criteria not satis\ufb01ed do\nfor i \u2208 Z \\ Yg do\n4:\n5:\n6:\n7:\n8:\n9: end while\n10: Return: Yg\n\nei = (Lji \u2212 (cid:104)cj, ci(cid:105))/dj\ni = d2\nci = [ci\n\nend for\nj = arg maxi\u2208Z\\Yg log(d2\n\nei], d2\n\ni\n\ni \u2212 e2\ni ), Yg = Yg \u222a {j}\n\nwhere V is an invertible lower triangular matrix. For any i \u2208 Z \\ Yg, the Cholesky decomposition of\nLYg\u222a{i} can be derived as\n\n(cid:20) LYg LYg,i\n\n(cid:21)\n\n(cid:20)V 0\n\n(cid:21)(cid:20)V 0\n\n(cid:21)(cid:62)\n\nci\n\ndi\n\nci\n\ndi\n\n=\n\nLii\nwhere row vector ci and scalar di \u2265 0 satis\ufb01es\n\nLi,Yg\n\nLYg\u222a{i} =\n\nVc(cid:62)\ni = LYg,i,\ni = Lii \u2212 (cid:107)ci(cid:107)2\nd2\n2.\n\nIn addition, according to Equ. (2), it can be derived that\ndet(LYg\u222a{i}) = det(VV(cid:62)) \u00b7 d2\n\ni = det(LYg ) \u00b7 d2\ni .\n\nTherefore, Opt. (1) is equivalent to\n\n,\n\n(2)\n\n(3)\n(4)\n\n(5)\n\n(6)\n\nOnce Opt. (6) is solved, according to Equ. (2), the Cholesky decomposition of LYg\u222a{j} becomes\n\nj = arg maxi\u2208Z\\Yg log(d2\n\ni ).\n\n(cid:20)V 0\n\n(cid:21)(cid:20)V 0\n\n(cid:21)(cid:62)\n\ncj\n\ndj\n\ncj\n\ndj\n\nLYg\u222a{j} =\n\n,\n\n(7)\n\nwhere cj and dj are readily available. The Cholesky factor of LYg can therefore be ef\ufb01ciently updated\nafter a new item is added to Yg.\nFor each item i, ci and di can also be updated incrementally. After Opt. (6) is solved, de\ufb01ne c(cid:48)\ni and\ni as the new vector and scalar of i \u2208 Z \\ (Yg \u222a {j}). According to Equ. (3) and Equ. (7), we have\nd(cid:48)\n\n(cid:20)V 0\n\n(cid:21)\n\ncj\n\ndj\n\n(cid:20)LYg,i\n\n(cid:21)\n\nLji\n\n(cid:62)\n\nc(cid:48)\n\ni\n\n= LYg\u222a{j},i =\n\n.\n\n(8)\n\nCombining Equ. (8) with Equ. (3), we conclude\n\nc(cid:48)\ni = [ci\n\nThen Equ. (4) implies\n\n(Lji \u2212 (cid:104)cj, ci(cid:105))/dj]\n\n.\n= [ci\n\nei] .\n\ni = d2\n\n2 \u2212 e2\n\ni \u2212 e2\ni .\n\ni(cid:107)2\n2 = Lii \u2212 (cid:107)ci(cid:107)2\ni = Lii \u2212 (cid:107)c(cid:48)\nd(cid:48)2\nInitially, Yg = \u2205, and Equ. (5) implies d2\ni = det(Lii) = Lii. The complete algorithm is summarized\nin Algorithm 1. The stopping criteria is d2\nj < 1 for unconstraint MAP inference or #Yg > N when\nthe cardinality constraint is imposed. For the latter case, we introduce a small number \u03b5 > 0 and add\nj < \u03b5 to the stopping criteria for numerical stability of calculating 1/dj.\nd2\nIn the k-th iteration, for each item i \u2208 Z \\ Yg, updating ci and di involve the inner product of two\nvectors of length k, resulting in overall complexity O(kM ). Therefore, Algorithm 1 runs in O(M 3)\ntime for unconstraint MAP inference and O(N 2M ) to return N items. Notice that this is achieved\nby additional O(N M ) (or O(M 2) for the unconstraint case) space for ci and di.\n\n(9)\n\n4\n\n\fi ), Yg = {j}\n\nei], d2\n\ni \u2212 e2\n\ni\n\nend for\nif #Yg \u2265 w then\n\ni = Lii, j = arg maxi\u2208Z log(d2\n\nUpdate V according to Equ. (7)\nfor i \u2208 Z \\ Yg do\n\nei = (Lji \u2212 (cid:104)cj, ci(cid:105))/dj\ni = d2\nci = [ci\n\nv = V2:,1, V = V2:, ai = ci,1, ci = ci,2:\nfor l = 1,\u00b7\u00b7\u00b7 , w \u2212 1 do\n\nAlgorithm 2 Fast Greedy MAP Inference with a Sliding Window\n1: Input: Kernel L, window size w, stopping criteria\n2: Initialize: V = [], ci = [], d2\n3: while stopping criteria not satis\ufb01ed do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\n23:\n24: end while\n25: Return: Yg\n\nend if\nj = arg maxi\u2208Z\\Yg log(d2\n\nend for\nfor i \u2208 Z \\ Yg do\ni + a2\ni\n\ni ), Yg = Yg \u222a {j}\n\nd2\ni = d2\nend for\n\nll + v2\nl\n\nt2 = V2\nVl+1:,l = (Vl+1:,lVll + vl+1:vl)/t, vl+1: = (vl+1:t \u2212 Vl+1:,lvl)/Vll\nfor i \u2208 Z \\ Yg do\nend for\nVll = t\n\nci,l = (ci,lVll + aivl)/t, ai = (ait \u2212 ci,lvl)/Vll\n\n4 Diversity within Sliding Window\n\nIn some applications, the selected set of items are displayed as a sequence, and the diversity is only\nrequired within a sliding window. Denote the window size as w. We modify Opt. (1) to\n\ng \u222a{i}) \u2212 log det(LY w\n\ng\n\n),\n\nj = arg maxi\u2208Z\\Yg log det(LY w\n\n(10)\ng \u2286 Yg contains w \u2212 1 most recently added items. When #Yg \u2265 w, a simple modi\ufb01cation\nwhere Y w\nof method [32] solves Opt. (10) with complexity O(w2M ). We adapt our algorithm to this scenario\nso that Opt. (10) can be solved in O(wM ) time.\nIn Section 3, we showed how to ef\ufb01ciently select a new item when V, ci, and di are available. For\ng . After Opt. (10) is solved, we can similarly update V,\nOpt. (10), V is the Cholesky factor of LY w\nci, and di for LY w\ng , we also need to\nremove the earliest added item in Y w\ng . The detailed derivations of updating V, ci, and di when the\nearliest added item is removed are given in the supplementary material.\nThe complete algorithm is summarized in Algorithm 2. Line 10-21 shows how to update V, ci, and di\nin place after the earliest item is removed. In the k-th iteration where k \u2265 w, updating V, all ci and\ndi require O(w2), O(wM ), and O(M ) time, respectively. The overall complexity of Algorithm 2 is\nO(wN M ) to return N \u2265 w items. Numerical stability is discussed in the supplementary material.\n\ng \u222a{j}. When the number of items in Y w\n\ng is w \u2212 1, to update Y w\n\n5\n\nImproving Recommendation Diversity\n\nIn this section, we describe a DPP-based approach for recommending relevant and diverse items to\nusers. For a user u, the pro\ufb01le item set Pu is de\ufb01ned as the set of items that the user likes. Based on\nPu, a recommender system recommends items Ru to the user.\nThe approach takes three inputs: a candidate item set Cu, a score vector ru which indicates how\nrelevant the items in Cu are, and a PSD matrix S which measures the similarity of each pair of items.\nThe \ufb01rst two inputs can be obtained from the internal results of many traditional recommendation\n\n5\n\n\falgorithms. The third input, similarity matrix S, can be obtained based on the attributes of items, the\ninteraction relations with users, or a combination of both. This approach can be regarded as a ranking\nalgorithm balancing the relevance of items and their similarities.\nTo apply the DPP model in the recommendation task, we need to construct the kernel matrix. As\nrevealed in [30], the kernel matrix can be written as a Gram matrix, L = B(cid:62)B, where the columns\nof B are vectors representing the items. We can construct each column vector Bi as the product of\nthe item score ri \u2265 0 and a normalized vector fi \u2208 RD with (cid:107)fi(cid:107)2 = 1. The entries of kernel L can\nbe written as\n(11)\nWe can think of (cid:104)fi, fj(cid:105) as measuring the similarity between item i and item j, i.e., (cid:104)fi, fj(cid:105) = Sij.\nTherefore, the kernel matrix for user u can be written as\n\nLij = (cid:104)Bi, Bj(cid:105) = (cid:104)rifi, rjfj(cid:105) = rirj(cid:104)fi, fj(cid:105).\n\nL = Diag(ru) \u00b7 S \u00b7 Diag(ru),\n\n(cid:88)\n\ni\u2208Ru\n\nwhere Diag(ru) is a diagonal matrix whose diagonal vector is ru. The log-probability of Ru is\n\nlog det(LRu ) =\n\nlog(r2\n\nu,i) + log det(SRu ).\n\n(12)\n\nThe second term in Equ. (12) is maximized when the item representations of Ru are orthogonal, and\ntherefore it promotes diversity. It clearly shows how the DPP model incorporates the relevance and\ndiversity of the recommended items.\nA nice feature of methods in [11, 51, 8] is that they involve a tunable parameter which allows users to\nadjust the trade-off between relevance and diversity. According to Equ. (12), the original DPP model\ndoes not offer such a mechanism. We modify the log-probability of Ru to\nru,i + (1 \u2212 \u03b8) \u00b7 log det(SRu ),\n\nlog P(Ru) \u221d \u03b8 \u00b7 (cid:88)\n\nwhere \u03b8 \u2208 [0, 1]. This corresponds to a DPP with kernel\n\ni\u2208Ru\n\nL(cid:48) = Diag(exp(\u03b1ru)) \u00b7 S \u00b7 Diag(exp(\u03b1ru)),\n\nwhere \u03b1 = \u03b8/(2(1 \u2212 \u03b8)). We can also get the marginal gain of log-probability\n\nlog P(Ru \u222a {i}) \u2212 log P(Ru) \u221d \u03b8 \u00b7 ru,i + (1 \u2212 \u03b8) \u00b7 (log det(SRu\u222a{i}) \u2212 log det(SRu )).\n\n(13)\nThen Algorithm 1 and Algorithm 2 can be easily modi\ufb01ed to maximize (13) with kernel matrix S.\nNotice that we need the similarity Sij \u2208 [0, 1] for the recommendation task, where 0 means the most\ndiverse and 1 means the most similar. This may be violated when the inner product of normalized\nvectors (cid:104)fi, fj(cid:105) can take negative values. In the extreme case, the most diverse pair fi = \u2212fj, but the\ndeterminant of the corresponding sub-matrix is 0, same as fi = fj. To guarantee nonnegativity, we\ncan take a linear mapping while keeping S a PSD matrix, e.g.,\n\nSij =\n\n1 + (cid:104)fi, fj(cid:105)\n\n2\n\n(cid:28) 1\u221a\n\n=\n\n(cid:21)\n\n(cid:20)1\n\nfi\n\n2\n\n,\n\n1\u221a\n2\n\n(cid:21)(cid:29)\n\n(cid:20) 1\n\nfj\n\n\u2208 [0, 1].\n\n6 Experimental Results\n\nIn this section, we evaluate and compare our proposed algorithms on synthetic dataset and real-world\nrecommendation tasks. Algorithms are implemented in Python with vectorization. The experiments\nare performed on a laptop with 2.2GHz Intel Core i7 and 16GB RAM.\n\n6.1 Synthetic Dataset\n\nIn this subsection, we evaluate the performance of our Algorithm 1 on the MAP inference for DPP.\nWe follow the experimental setup in [20]. The entries of the kernel matrix satisfy Equ. (11), where\nri = exp(0.01xi + 0.2) with xi \u2208 R drawn from the normal distribution N (0, 1), and fi \u2208 RD with\nD same as the total item size M and entries drawn i.i.d. from N (0, 1) and then normalized.\nOur proposed faster exact algorithm (FaX) is compared with Schur complement combined with lazy\nevaluation (Lazy) [36] and faster approximate algorithm (ApX) [20]. The parameters of the reference\n\n6\n\n\fFigure 1: Comparison of Lazy, ApX, and our FaX under different M when N = 1000 (left), and\nunder different N when M = 6000 (right).\n\nalgorithms are chosen as suggested in [20]. The gradient-based method in [17] and the double greedy\nalgorithm in [10] are not compared because as reported in [20], they performed worse than ApX. We\nreport the speedup over Lazy of each algorithm, as well as the ratio of log-probability [20]\n\nlog det LY / log det LYLazy ,\n\nwhere Y and YLazy are the outputs of an algorithm and Lazy, respectively. We compare these metrics\nwhen the total item size M varies from 2000 to 6000 with the selected item size N = 1000, and\nwhen N varies from 400 to 2000 with M = 6000. The results are averaged over 10 independent\ntrials, and shown in Figure 1. In both cases, FaX runs signi\ufb01cantly faster than ApX, which is the\nstate-of-the-art fast greedy MAP inference algorithm for DPP. FaX is about 100 times faster than\nLazy, while ApX is about 3 times faster, as reported in [20]. The accuracy of FaX is the same as Lazy,\nbecause they are exact implementations of the greedy algorithm. ApX loses about 0.2% accuracy.\n\n6.2 Short Sequence Recommendation\n\nIn this subsection, we evaluate the performance of Algorithm 1 to recommend short sequences of\nitems to users on the following two public datasets.\nNet\ufb02ix Prize2: This dataset contains users\u2019 ratings of movies. We keep ratings of four or higher and\nbinarize them. We only keep users who have watched at least 10 movies and movies that are watched\nby at least 100 users. This results in 436, 674 users and 11, 551 movies with 56, 406, 404 ratings.\nMillion Song Dataset [6]: This dataset contains users\u2019 play counts of songs. We binarize play counts\nof more than once. We only keep users who listen to at least 20 songs and songs that are listened to\nby at least 100 users. This results in 213, 949 users and 20, 716 songs with 8, 500, 651 play counts.\nFor each dataset, we construct the test set by randomly selecting one interacted item for each user,\nand use the rest data for training. We adopt an item-based recommendation algorithm [24] on the\ntraining set to learn an item-item PSD similarity matrix S. For each user, the pro\ufb01le set Pu consists\nof the interacted items in the training set, and the candidate set Cu is formed by the union of 50 most\nsimilar items of each item in Pu. The median of #Cu is 735 and 811 on Net\ufb02ix Prize and Million\nSong Dataset, respectively. For any item in Cu, the relevance score is the aggregated similarity to all\nitems in Pu [22]. With S, Cu, and the score vector ru, algorithms recommend N = 20 items.\nPerformance metrics of recommendation include mean reciprocal rank (MRR) [46], intra-list average\ndistance (ILAD) [51], and intra-list minimal distance (ILMD). They are de\ufb01ned as\n(1 \u2212 Sij),\nMRR = mean\nu\u2208U\nwhere U is the set of all users, and pu is the smallest rank position of items in the test set. MRR\nmeasures relevance, while ILAD and ILMD measure diversity. We also compare the metric popularity-\nweighted recall (PW Recall) [42] in the supplementary material. For these metrics, higher ones are\npreferred.\nOur DPP-based algorithm (DPP) is compared with maximal marginal relevance (MMR) [11], max-\nsum diversi\ufb01cation (MSD) [8], entropy regularizer (Entropy) [40], and coverage-based algorithm\n(Cover) [39]. They all involve a tunable parameter to adjust the trade-off between relevance and\ndiversity. For Cover, the parameter is \u03b3 \u2208 [0, 1] which de\ufb01nes the saturation function f (t) = t\u03b3.\nIn the \ufb01rst experiment, we test the impact of trade-off parameter \u03b8 \u2208 [0, 1] of DPP on Net\ufb02ix Prize\ndataset. The results are shown in Figure 2. As \u03b8 increases, MRR improves at \ufb01rst, achieves the best\n\nILMD = mean\nu\u2208U\n\nILAD = mean\nu\u2208U\n\n(1 \u2212 Sij),\n\np\u22121\nu ,\n\nmean\n\ni,j\u2208Ru,i(cid:54)=j\n\nmin\n\ni,j\u2208Ru,i(cid:54)=j\n\n2Net\ufb02ix Prize website, http://www.net\ufb02ixprize.com/\n\n7\n\n\fFigure 2: Impact of trade-off parameter \u03b8 on Net\ufb02ix Prize dataset.\n\nFigure 3: Comparison of trade-off performance between relevance and diversity under different\nchoices of trade-off parameters on Net\ufb02ix Prize (left) and Million Song Dataset (right). The standard\nerror of MRR is about 0.0003 and 0.0006 on Net\ufb02ix Prize and Million Song Dataset, respectively.\n\nTable 1: Comparison of average / upper 99% running time (in milliseconds).\n\nDataset\n\nMMR\n\nMSD\n\nEntropy\n\nCover\n\nDPP\n\nNet\ufb02ix Prize\n\nMillion Song Dataset\n\n0.23 / 0.50\n0.23 / 0.41\n\n0.21 / 0.41\n0.22 / 0.34\n\n200.74 / 2883.82\n26.45 / 168.12\n\n120.19 / 1332.21\n23.76 / 173.64\n\n0.73 / 1.75\n0.76 / 1.46\n\nvalue when \u03b8 \u2248 0.7, and then decreases a little bit. ILAD and ILMD are monotonously decreasing as\n\u03b8 increases. When \u03b8 = 1, DPP returns items with the highest relevance scores. Therefore, taking\nmoderate amount of diversity into consideration, better performance can be achieved.\nIn the second experiment, by varying the trade-off parameters, the trade-off performance between\nrelevance and diversity are compared in Figure 3. The parameters are chosen such that different\nalgorithms have approximately the same range of MRR. As can be seen, Cover performs the best\non Net\ufb02ix Prize but becomes the worst on Million Song Dataset. Among the other algorithms, DPP\nenjoys the best relevance-diversity trade-off performance. Their average and upper 99% running time\nare compared in Table 1. MMR, MSD, and DPP run signi\ufb01cantly faster than Entropy and Cover.\nSince DPP runs in less than 2ms with probability 99%, it can be used in real-time scenarios.\nWe conducted an online A/B test in a movie recommender system for four weeks. For each user,\ncandidate movies with relevance scores were generated by an online scoring model. An of\ufb02ine matrix\nfactorization algorithm [26] was trained daily to generate movie representations which were used to\ncalculate similarities. For the control group, 5% users were randomly selected and presented with\nN = 8 movies with the highest relevance scores. For the treatment group, another 5% random users\nwere presented with N movies generated by DPP with a \ufb01ne-tuned trade-off parameter. Two online\nmetrics, improvements of number of titles watched and watch minutes, are reported in Table 2. The\nresults are also compared with another 5% randomly selected users using MMR. As can be seen,\nDPP performed better compared with systems without diversi\ufb01cation algorithm or with MMR.\n\n6.3 Long Sequence Recommendation\n\nIn this subsection, we evaluate the performance of Algorithm 2 to recommend long sequences of\nitems to users. For each dataset, we construct the test set by randomly selecting 5 interacted items for\neach user, and use the rest for training. Each long sequence contains N = 100 items. We choose\nwindow size w = 10 so that every w successive items in the sequence are diverse. The value of w\nusually depends on speci\ufb01c applications. Generally speaking, if each time a user can only see a small\nportion of the sequence, w can be of the order of the portion size. Other settings are the same as in\nthe previous subsection.\n\n8\n\n0.250.500.751.000.050.06MRRDPP0.250.500.751.000.750.800.850.900.95ILADDPP0.250.500.751.000.40.50.60.70.8ILMDDPP\fTable 2: Performance improvement of MMR and DPP over Control in an online A/B test.\n\nAlgorithm Improvement of No. Titles Watched\n\nImprovement of Watch Minutes\n\nMMR\nDPP\n\n0.84%\n1.33%\n\n0.86%\n1.52%\n\nFigure 4: Comparison of trade-off performance between relevance and diversity under different\nchoices of trade-off parameters on Net\ufb02ix Prize (left) and Million Song Dataset (right). The standard\nerror of nDCG is about 0.00025 and 0.0005 on Net\ufb02ix Prize and Million Song Dataset, respectively.\n\nPerformance metrics include normalized discounted cumulative gain (nDCG) [23], intra-list average\nlocal distance (ILALD), and intra-list minimal local distance (ILMLD). The latter two are de\ufb01ned as\n\n(1 \u2212 Sij),\n\n(1 \u2212 Sij),\n\nILALD = mean\nu\u2208U\n\nmean\n\ni,j\u2208Ru,i(cid:54)=j,dij\u2264w\n\nILMLD = mean\nu\u2208U\n\nmin\n\ni,j\u2208Ru,i(cid:54)=j,dij\u2264w\n\nwhere dij is the position distance of item i and j in Ru. Similarly, higher metrics are desirable. To\nmake a fair comparison, we modify the diversity terms in MMR and MSD so that they only consider\nthe most recently added w \u2212 1 items. Entropy and Cover are not tested because they are not suitable\nfor this scenario. By varying trade-off parameters, the trade-off performance between relevance and\ndiversity of MMR, MSD, and DPP are compared in Figure 4. The parameters are chosen such that\ndifferent algorithms have approximately the same range of nDCG. As can be seen, DPP performs\nthe best with respect to relevance-diversity trade-off. We also compare the metric PW Recall in the\nsupplementary material.\n\n7 Conclusion and Future Work\n\nIn this paper, we presented a fast and exact implementation of the greedy MAP inference for DPP. The\ntime complexity of our algorithm is O(M 3), which is signi\ufb01cantly lower than state-of-the-art exact\nimplementations. Our proposed acceleration technique can be applied to other problems with log-\ndeterminant of PSD matrices in the objective functions, such as the entropy regularizer [40]. We also\nadapted our fast algorithm to scenarios where the diversity is only required within a sliding window.\nExperiments showed that our algorithm runs signi\ufb01cantly faster than state-of-the-art algorithms,\nand our proposed approach provides better relevance-diversity trade-off on recommendation task.\nPotential future research directions include learning the optimal trade-off parameter automatically\nand the theoretical analysis of Algorithm 2.\n\nAcknowledgments\n\nWe thank Bangsheng Tang, Yin Zheng, Shenglong Lv, and the reviewers for many helpful discussions\nand suggestions.\n\nReferences\n[1] G. Adomavicius and Y. Kwon. Maximizing aggregate recommendation diversity: A graph-\n\ntheoretic approach. In Proceedings of DiveRS 2011, pages 3\u201310, 2011.\n\n[2] R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong. Diversifying search results. In Proceed-\n\nings of WSDM 2009, pages 5\u201314. ACM, 2009.\n\n[3] A. Ahmed, C. H. Teo, S. V. N. Vishwanathan, and A. Smola. Fair and balanced: Learning to\n\npresent news stories. In Proceedings of WSDM 2012, pages 333\u2013342. ACM, 2012.\n\n9\n\n0.290.300.310.32nDCG0.960.98ILALDMillion Song DatasetMMRMSDDPP0.290.300.310.32nDCG0.60.8ILMLDMillion Song DatasetMMRMSDDPP\f[4] A. Ashkan, B. Kveton, S. Berkovsky, and Z. Wen. Optimal greedy diversity for recommendation.\n\nIn Proceedings of IJCAI 2015, pages 1742\u20131748, 2015.\n\n[5] T. Aytekin and M. \u00d6. Karakaya. Clustering-based diversity improvement in top-N recommenda-\n\ntion. Journal of Intelligent Information Systems, 42(1):1\u201318, 2014.\n\n[6] T. Bertin-Mahieux, D. P.W. Ellis, B. Whitman, and P. Lamere. The million song dataset. In\n\nProceedings of ISMIR 2011, page 10, 2011.\n\n[7] R. Boim, T. Milo, and S. Novgorodov. Diversi\ufb01cation and re\ufb01nement in collaborative \ufb01ltering\n\nrecommender. In Proceedings of CIKM 2011, pages 739\u2013744. ACM, 2011.\n\n[8] A. Borodin, H. C. Lee, and Y. Ye. Max-sum diversi\ufb01cation, monotone submodular functions\nand dynamic updates. In Proceedings of SIGMOD-SIGACT-SIGAI, pages 155\u2013166. ACM,\n2012.\n\n[9] K. Bradley and B. Smyth. Improving recommendation diversity. In Proceedings of AICS 2001,\n\npages 85\u201394, 2001.\n\n[10] N. Buchbinder, M. Feldman, J. Sef\ufb01, and R. Schwartz. A tight linear time (1/2)-approximation\nfor unconstrained submodular maximization. SIAM Journal on Computing, 44(5):1384\u20131402,\n2015.\n\n[11] J. Carbonell and J. Goldstein. The use of MMR, diversity-based reranking for reordering\ndocuments and producing summaries. In Proceedings of SIGIR 1998, pages 335\u2013336. ACM,\n1998.\n\n[12] P. Chandar and B. Carterette. Preference based evaluation measures for novelty and diversity.\n\nIn Proceedings of SIGIR 2013, pages 413\u2013422. ACM, 2013.\n\n[13] A. \u00c7ivril and M. Magdon-Ismail. On selecting a maximum volume sub-matrix of a matrix and\n\nrelated problems. Theoretical Computer Science, 410(47-49):4801\u20134811, 2009.\n\n[14] M. Gartrell, U. Paquet, and N. Koenigstein. Bayesian low-rank determinantal point processes.\n\nIn Proceedings of RecSys 2016, pages 349\u2013356. ACM, 2016.\n\n[15] M. Gartrell, U. Paquet, and N. Koenigstein. Low-rank factorization of determinantal point\n\nprocesses. In Proceedings of AAAI 2017, pages 1912\u20131918, 2017.\n\n[16] J. Gillenwater. Approximate inference for determinantal point processes. University of Pennsyl-\n\nvania, 2014.\n\n[17] J. Gillenwater, A. Kulesza, and B. Taskar. Near-optimal MAP inference for determinantal point\n\nprocesses. In Proceedings of NIPS 2012, pages 2735\u20132743, 2012.\n\n[18] J. A. Gillenwater, A. Kulesza, E. Fox, and B. Taskar. Expectation-maximization for learning\n\ndeterminantal point processes. In Proceedings of NIPS 2014, pages 3149\u20133157, 2014.\n\n[19] B. Gong, W. L. Chao, K. Grauman, and F. Sha. Diverse sequential subset selection for supervised\n\nvideo summarization. In Proceedings of NIPS 2014, pages 2069\u20132077, 2014.\n\n[20] I. Han, P. Kambadur, K. Park, and J. Shin. Faster greedy MAP inference for determinantal point\n\nprocesses. In Proceedings of ICML 2017, pages 1384\u20131393, 2017.\n\n[21] J. He, H. Tong, Q. Mei, and B. Szymanski. GenDeR: A generic diversi\ufb01ed ranking algorithm.\n\nIn Proceedings of NIPS 2012, pages 1142\u20131150, 2012.\n\n[22] N. Hurley and M. Zhang. Novelty and diversity in top-N recommendation\u2013analysis and\n\nevaluation. ACM Transactions on Internet Technology, 10(4):14, 2011.\n\n[23] K. J\u00e4rvelin and J. Kek\u00e4l\u00e4inen. IR evaluation methods for retrieving highly relevant documents.\n\nIn Proceedings of SIGIR 2000, pages 41\u201348. ACM, 2000.\n\n[24] G. Karypis. Evaluation of item-based top-N recommendation algorithms. In Proceedings of\n\nCIKM 2001, pages 247\u2013254. ACM, 2001.\n\n10\n\n\f[25] C. W. Ko, J. Lee, and M. Queyranne. An exact algorithm for maximum entropy sampling.\n\nOperations Research, 43(4):684\u2013691, 1995.\n\n[26] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems.\n\nComputer, 42(8), 2009.\n\n[27] A. Kulesza and B. Taskar. Structured determinantal point processes. In Proceedings of NIPS\n\n2010, pages 1171\u20131179, 2010.\n\n[28] A. Kulesza and B. Taskar. k-DPPs: Fixed-size determinantal point processes. In Proceedings of\n\nICML 2011, pages 1193\u20131200, 2011.\n\n[29] A. Kulesza and B. Taskar. Learning determinantal point processes. In Proceedings of UAI 2011,\n\npages 419\u2013427. AUAI Press, 2011.\n\n[30] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. Foundations\n\nand Trends R(cid:13) in Machine Learning, 5(2\u20133):123\u2013286, 2012.\n\n[31] S. C. Lee, S. W. Kim, S. Park, and D. K. Chae. A single-step approach to recommendation\n\ndiversi\ufb01cation. In Proceedings of WWW 2017 Companion, pages 809\u2013810. ACM, 2017.\n\n[32] C. Li, S. Sra, and S. Jegelka. Gaussian quadrature for matrix inverse forms with applications.\n\nIn Proceedings of ICML 2016, pages 1766\u20131775, 2016.\n\n[33] O. Macchi. The coincidence approach to stochastic point processes. Advances in Applied\n\nProbability, 7(1):83\u2013122, 1975.\n\n[34] Z. Mariet and S. Sra. Fixed-point algorithms for learning determinantal point processes. In\n\nProceedings of ICML 2015, pages 2389\u20132397, 2015.\n\n[35] M. L. Mehta and M. Gaudin. On the density of eigenvalues of a random matrix. Nuclear\n\nPhysics, 18:420\u2013427, 1960.\n\n[36] M. Minoux. Accelerated greedy algorithms for maximizing submodular set functions. In\n\nOptimization techniques, pages 234\u2013243. Springer, 1978.\n\n[37] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing\n\nsubmodular set functions\u2013I. Mathematical Programming, 14(1):265\u2013294, 1978.\n\n[38] K. Niemann and M. Wolpers. A new collaborative \ufb01ltering approach for increasing the aggregate\ndiversity of recommender systems. In Proceedings of SIGKDD 2013, pages 955\u2013963. ACM,\n2013.\n\n[39] S. A. Puthiya Parambath, N. Usunier, and Y. Grandvalet. A coverage-based approach to\nrecommendation diversity on similarity graph. In Proceedings of RecSys 2016, pages 15\u201322.\nACM, 2016.\n\n[40] L. Qin and X. Zhu. Promoting diversity in recommendation by entropy regularizer. In Proceed-\n\nings of IJCAI 2013, pages 2698\u20132704, 2013.\n\n[41] J. R. Schott. Matrix algorithms, volume 1: Basic decompositions. Journal of the American\n\nStatistical Association, 94(448):1388\u20131388, 1999.\n\n[42] H. Steck. Item popularity and recommendation accuracy. In Proceedings of RecSys 2011, pages\n\n125\u2013132. ACM, 2011.\n\n[43] R. Su, L. A. Yin, K. Chen, and Y. Yu. Set-oriented personalized ranking for diversi\ufb01ed top-N\n\nrecommendation. In Proceedings of RecSys 2013, pages 415\u2013418. ACM, 2013.\n\n[44] C. H. Teo, H. Nassif, D. Hill, S. Srinivasan, M. Goodman, V. Mohan, and S. V. N. Vishwanathan.\nAdaptive, personalized diversity for visual discovery. In Proceedings of RecSys 2016, pages\n35\u201338. ACM, 2016.\n\n[45] S. Vargas, L. Baltrunas, A. Karatzoglou, and P. Castells. Coverage, redundancy and size-\nawareness in genre diversity for recommender systems. In Proceedings of RecSys 2014, pages\n209\u2013216. ACM, 2014.\n\n11\n\n\f[46] E. M. Voorhees. The TREC-8 question answering track report. In Proceedings of TREC 1999,\n\npages 77\u201382, 1999.\n\n[47] L. Wu, Q. Liu, E. Chen, N. J. Yuan, G. Guo, and X. Xie. Relevance meets coverage: A uni\ufb01ed\nframework to generate diversi\ufb01ed recommendations. ACM Transactions on Intelligent Systems\nand Technology, 7(3):39, 2016.\n\n[48] L. Xia, J. Xu, Y. Lan, J. Guo, W. Zeng, and X. Cheng. Adapting Markov decision process for\n\nsearch result diversi\ufb01cation. In Proceedings of SIGIR 2017, pages 535\u2013544. ACM, 2017.\n\n[49] J. G. Yao, F. Fan, W. X. Zhao, X. Wan, E. Y. Chang, and J. Xiao. Tweet timeline generation\n\nwith determinantal point processes. In Proceedings of AAAI 2016, pages 3080\u20133086, 2016.\n\n[50] C. Yu, L. Lakshmanan, and S. Amer-Yahia. It takes variety to make a world: Diversi\ufb01cation in\n\nrecommender systems. In Proceedings of EDBT 2009, pages 368\u2013378. ACM, 2009.\n\n[51] M. Zhang and N. Hurley. Avoiding monotony: Improving the diversity of recommendation lists.\n\nIn Proceedings of RecSys 2008, pages 123\u2013130. ACM, 2008.\n\n[52] C. N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen. Improving recommendation lists\n\nthrough topic diversi\ufb01cation. In Proceedings of WWW 2005, pages 22\u201332. ACM, 2005.\n\n12\n\n\f", "award": [], "sourceid": 2695, "authors": [{"given_name": "Laming", "family_name": "Chen", "institution": "Hulu LLC"}, {"given_name": "Guoxin", "family_name": "Zhang", "institution": "Kwai Inc."}, {"given_name": "Eric", "family_name": "Zhou", "institution": "Facebook"}]}