{"title": "Kronecker Determinantal Point Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 2694, "page_last": 2702, "abstract": "Determinantal Point Processes (DPPs) are probabilistic models over all subsets a ground set of N items. They have recently gained prominence in several applications that rely on diverse subsets. However, their applicability to large problems is still limited due to O(N^3) complexity of core tasks such as sampling and learning. We enable efficient sampling and learning for DPPs by introducing KronDPP, a DPP model whose kernel matrix decomposes as a tensor product of multiple smaller kernel matrices. This decomposition immediately enables fast exact sampling. But contrary to what one may expect, leveraging the Kronecker product structure for speeding up DPP learning turns out to be more difficult. We overcome this challenge, and derive batch and stochastic optimization algorithms for efficiently learning the parameters of a KronDPP.", "full_text": "Kronecker Determinantal Point Processes\n\nMassachusetts Institute of Technology\n\nMassachusetts Institute of Technology\n\nZelda Mariet\n\nCambridge, MA 02139\nzelda@csail.mit.edu\n\nSuvrit Sra\n\nCambridge, MA 02139\n\nsuvrit@mit.edu\n\nAbstract\n\nDeterminantal Point Processes (DPPs) are probabilistic models over all subsets\na ground set of N items. They have recently gained prominence in several ap-\nplications that rely on \u201cdiverse\u201d subsets. However, their applicability to large\nproblems is still limited due to O(N 3) complexity of core tasks such as sampling\nand learning. We enable ef\ufb01cient sampling and learning for DPPs by introducing\nKRONDPP, a DPP model whose kernel matrix decomposes as a tensor product of\nmultiple smaller kernel matrices. This decomposition immediately enables fast\nexact sampling. But contrary to what one may expect, leveraging the Kronecker\nproduct structure for speeding up DPP learning turns out to be more dif\ufb01cult. We\novercome this challenge, and derive batch and stochastic optimization algorithms\nfor ef\ufb01ciently learning the parameters of a KRONDPP.\n\n1 Introduction\nDeterminantal Point Processes (DPPs) are discrete probability models over the subsets of a ground\nset of N items. They provide an elegant model to assign probabilities to an exponentially large\nsample, while permitting tractable (polynomial time) sampling and marginalization. They are often\nused to provide models that balance \u201cdiversity\u201d and quality, characteristics valuable to numerous\nproblems in machine learning and related areas [17].\nThe antecedents of DPPs lie in statistical mechanics [24], but since the seminal work of [15] they\nhave made inroads into machine learning. By now they have been applied to a variety of prob-\nlems such as document and video summarization [6, 21], sensor placement [14], recommender\nsystems [31], and object retrieval [2]. More recently, they have been used to compress fully-\nconnected layers in neural networks [26] and to provide optimal sampling procedures for the Nys-\ntr\u00f6m method [20]. The more general study of DPP properties has also garnered a signi\ufb01cant amount\nof interest, see e.g., [1, 5, 7, 12, 16\u201318, 23].\nHowever, despite their elegance and tractability, widespread adoption of DPPs is impeded by the\nO(N 3) cost of basic tasks such as (exact) sampling [12, 17] and learning [10, 12, 17, 25]. This\ncost has motivated a string of recent works on approximate sampling methods such as MCMC\nsamplers [13, 20] or core-set based samplers [19]. The task of learning a DPP from data has received\nless attention; the methods of [10, 25] cost O(N 3) per iteration, which is clearly unacceptable for\nrealistic settings. This burden is partially ameliorated in [9], who restrict to learning low-rank DPPs,\nthough at the expense of being unable to sample subsets larger than the chosen rank.\nThese considerations motivate us to introduce KRONDPP, a DPP model that uses Kronecker (tensor)\nproduct kernels. As a result, KRONDPP enables us to learn large sized DPP kernels, while also\npermitting ef\ufb01cient (exact and approximate) sampling. The use of Kronecker products to scale\nmatrix models is a popular and effective idea in several machine-learning settings [8, 27, 28, 30].\nBut as we will see, its ef\ufb01cient execution for DPPs turns out to be surprisingly challenging.\nTo make our discussion more concrete, we recall some basic facts now. Suppose we have a ground\nset of N items Y = f1; : : : ; Ng. A discrete DPP over Y is a probability measure P on 2\nY\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fparametrized by a positive de\ufb01nite matrix K (the marginal kernel) such that 0 \u2aaf K \u2aaf I, so that for\nany Y 2 2\n\nY drawn from P, the measure satis\ufb01es\n\n8A (cid:18) Y;\n\nP(A (cid:18) Y ) = det(KA);\n\n(1)\nwhere KA is the submatrix of K indexed by elements in A (i.e., KA = [Kij]i;j2A). If a DPP\nwith marginal kernel K assigns nonzero probability to the empty set, the DPP can alternatively be\nparametrized by a positive de\ufb01nite matrix L (the DPP kernel) so that\n\n:\n\ndet(LY )\ndet(L + I)\n\nP(Y ) / det(LY ) =) P(Y ) =\n\n(2)\nA brief manipulation (see e.g., [17, Eq. 15]) shows that when the inverse exists, L = K(I (cid:0) K)\n(cid:0)1.\nThe determinants, such as in the normalization constant in (2), make operations over DPPs typically\ncost O(N 3), which is a key impediment to their scalability.\nTherefore, if we consider a class of DPP kernels whose structure makes it easy to compute determi-\nnants, we should be able to scale up DPPs. An alternative approach towards scalability is to restrict\nthe size of the subsets, as done in k-DPP [16] or when using rank-k DPP kernels [9] (where k \u226a N).\nWithout further assumptions, both approaches still require O(N 3) preprocessing for exact sampling;\nanother caveat is that they limit the DPP model by assigning zero probabilities to sets of cardinality\ngreater than k.\nIn contrast, KRONDPP uses a kernel matrix of the form L = L1 (cid:10) : : : (cid:10) Lm, where each sub-\nkernel Li is a smaller positive de\ufb01nite matrix. This decomposition has two key advantages: (i) it\nsigni\ufb01cantly lowers the number of parameters required to specify the DPP from N 2 to O(N 2=m)\n(assuming the sub-kernels are roughly the same size); and (ii) it enables fast sampling and learning.\nFor ease of exposition, we describe speci\ufb01c details of KRONDPP for m = 2; as will become clear\nfrom the analysis, typically the special cases m = 2 and m = 3 should suf\ufb01ce to obtain low-\ncomplexity sampling and learning algorithms.\n\nContributions. Our main contribution is the KRONDPP model along with ef\ufb01cient algorithms for\nsampling from it and learning a Kronecker factored kernel. Speci\ufb01cally, inspired by the algorithm\nof [25], we develop KRK-PICARD (Kronecker-Kernel Picard), a block-coordinate ascent procedure\nthat generates a sequence of Kronecker factored estimates of the DPP kernel while ensuring mono-\ntonic progress on its (dif\ufb01cult, nonconvex) objective function. More importantly, we show how\nto implement KRK-PICARD to run in O(N 2) time when implemented as a batch method, and in\nO(N 3=2) time and O(N ) space, when implemented as a stochastic method. As alluded to above,\nunlike many other uses of Kronecker models, KRONDPP does not admit trivial scaling up, largely\ndue to extensive dependence of DPPs on arbitrary submatrices of the DPP kernel. An interesting\ntheoretical nugget that arises from our analysis is the combinatorial problem that we call subset clus-\ntering, a problem whose (even approximate) solution can lead to further speedups of our algorithms.\n\n2 Preliminaries\n\ni;j=1.\n\nWe begin by recalling basic properties of Kronecker products needed in our analysis; we omit proofs\nof these well-known results for brevity. The Kronecker (tensor) product of A 2 Rp(cid:2)q with B 2\nRr(cid:2)s two matrices is de\ufb01ned as the pr (cid:2) qs block matrix A (cid:10) B = [aijB]p;q\nWe denote the block aijB in A (cid:10) B by (A (cid:10) B)(ij) for any valid pair (i; j), and extend the notation\nto non-Kronecker product matrices to indicate the submatrix of size r (cid:2) s at position (i; j).\nProposition 2.1. Let A; B; C; D be matrices of sizes so that AC and BD are well-de\ufb01ned. Then,\n(i) If A; B \u2ab0 0, then, A (cid:10) B \u2ab0 0;\n(ii) If A and B are invertible then so is A (cid:10) B, with (A (cid:10) B)\n(iii) (A (cid:10) B)(C (cid:10) D) = (AC) (cid:10) (BD).\nAn important consequence of Prop. 2.1(iii) is the following corollary.\nCorollary 2.2. Let A = PADAP\nand B. Then, A (cid:10) B diagonalizes as (PA (cid:10) PB)(DA (cid:10) DB)(PA (cid:10) PB)\n\n\u22a4\nB be the eigenvector decompositions of A\n\n\u22a4\nA and B = PBDBP\n\n(cid:0)1 (cid:10) B\n\n(cid:0)1 = A\n\n(cid:0)1;\n\n\u22a4.\n\n2\n\n\fWe will also need the notion of partial trace operators, which are perhaps less well-known:\nDe\ufb01nition 2.3. Let A 2 RN1N2(cid:2)N1N2. The partial traces Tr1(A) and Tr2(A) are de\ufb01ned as\nfollows:\n\n[\n\n]\n\n2 RN1(cid:2)N1;\n\nTr2(A) :=\n\nA(ii) 2 RN2(cid:2)N2 :\n\nTr1(A) :=\n\nTr(A(ij))\n\n1(cid:20)i;j(cid:20)N1\n\n\u2211N1\n\ni=1\n\nThe action of partial traces is easy to visualize: indeed, Tr1(A (cid:10) B) = Tr(B)A and Tr2(A (cid:10) B) =\nTr(A)B. For us, the most important property of partial trace operators is their positivity.\nProposition 2.4. Tr1 and Tr2 are positive operators, i.e., for A \u227b 0, Tr1(A) \u227b 0 and Tr2(A) \u227b 0.\n\nProof. Please refer to [4, Chap. 4].\n\n3 Learning the kernel matrix for KRONDPP\n\nIn this section, we consider the key dif\ufb01cult task for KRONDPPs: learning a Kronecker product\nkernel matrix from n observed subsets Y1; : : : ; Yn. Using the de\ufb01nition (2) of P(Yi), maximum-\nlikelihood learning of a DPP with kernel L results in the optimization problem:\n\narg max\nL\u227b0\n\n\u03d5(L);\n\n\u03d5(L) =\n\n1\nn\n\n(log det(LYi) (cid:0) log det(L + I)) :\n\n(3)\n\nn\u2211\n\ni=1\n\nThis problem is nonconvex and conjectured to be NP-hard [15, Conjecture 4.1]. Moreover the\nconstraint L \u227b 0 is nontrivial to handle. Writing Ui as the indicator matrix for Yi of size N (cid:2) jYij\nso that LYi = U\n\n\u22a4\ni LUi, the gradient of \u03d5 is easily seen to be\n\u22a4\ni\n\n\u2206 := \u2207\u03d5(L) =\n\nU\n\n(cid:0)1\nYi\n\n1\nn\n\n\u2211n\n\n(4)\nIn [25], the authors derived an iterative method (\u201cthe Picard iteration\u201d) for computing an L that\nsolves \u2206 = 0 by running the simple iteration\n\nUiL\n\ni=1\n\n(5)\nMoreover, iteration (5) is guaranteed to monotonically increase the log-likelihood \u03d5 [25]. But these\nbene\ufb01ts accrue at a cost of O(N 3) per iteration, and furthermore a direct application of (5) cannot\nguarantee the Kronecker structure required by KRONDPP.\n\nL L + L\u2206L:\n\n(cid:0) (L + I)\n\n(cid:0)1:\n\n3.1 Optimization algorithm\n\nOur aim is to obtain an ef\ufb01cient algorithm to (locally) optimize (3). Beyond its nonconvexity, the\nKronecker structure L = L1 (cid:10) L2 imposes another constraint. As in [25] we \ufb01rst rewrite \u03d5 as a\nfunction of S = L\n\n(cid:0)1, and re-arrange terms to write it as\nU\n\nlog det\n\n\u03d5(S) = log det(S)\n\n+\n\n|\n\n{z\n\n}\n\n|\n\n1\nn\n\ni=1\n\n(\n\n) (cid:0) log det(I + S)\n}\n\n\u22a4\ni S\n\n{z\n(cid:0)1Ui\n\n\u2211n\n\n:\n\n(6)\n\nf (S)\n\ng(S)\n\nIt is easy to see that f is concave, while a short argument shows that g is convex [25]. An appeal to\nthe convex-concave procedure [29] then shows that updating S by solving \u2207f (S(k+1))+\u2207g(S(k)) =\n0, which is what (5) does [25, Thm. 2.2], is guaranteed to monotonically increase \u03d5.\nBut for KRONDPP this idea does not apply so easily: due the constraint L = L1 (cid:10) L2 the function\n\n) (cid:0) log det(I + S1 (cid:10) S2);\n\ng(cid:10) : (S1; S2) ! 1\n\n\u2211n\n\n(\n\nU\n\nfails to be convex, precluding an easy generalization. Nevertheless, for \ufb01xed S1 or S2 the functions\n\nn\n\ni=1\n\nlog det\n\n{\nf1 : S1 7! f (S1 (cid:10) S2)\ng1 : S1 7! g(S1 (cid:10) S2)\n)\n\n(\n\n\u2207fi\n\n;\n\n(cid:0)1Ui\n\ni (S1 (cid:10) S2)\n\u22a4\n{\nf2 : S2 ! f (S1 (cid:10) S2)\ng2 : S2 ! g(S1 (cid:10) S2)\n)\n\n(\n\nare once again concave or convex. Indeed, the map (cid:10) : S1 ! S1 (cid:10) S2 is linear and f is concave,\nand f1 = f \u25e6 (cid:10) is also concave; similarly, f2 is seen to be concave and g1 and g2 are convex. Hence,\nby generalizing the arguments of [29, Thm. 2] to our \u201cblock-coordinate\u201d setting, updating via\n\n(7)\nshould increase the log-likelihood \u03d5 at each iteration. We prove below that this is indeed the case,\nand that updating as per (7) ensure positive de\ufb01niteness of the iterates as well as monotonic ascent.\n\nfor i = 1; 2;\n\nSi\n\nSi\n\n(k+1)\n\n(k)\n\n;\n\n= (cid:0)\u2207gi\n\n3\n\n\f3.1.1 Positive de\ufb01nite iterates and ascent\n\nIn order to show the positive de\ufb01niteness of the solutions to (7), we \ufb01rst derive their closed form.\nProposition 3.1 (Positive de\ufb01nite iterates). For S1 \u227b 0, S2 \u227b 0, the solutions to (7) are given by\nthe following expressions:\n\n\u2207f1(X) = (cid:0)\u2207g1(S1) () X\n\u2207f2(X) = (cid:0)\u2207g2(S2) () X\nMoreover, these solutions are positive de\ufb01nite.\n\n(cid:0)1 = Tr1((I (cid:10) S2)(L + L\u2206L)) =N2\n(cid:0)1 = Tr2 ((S1 (cid:10) I)(L + L\u2206L)) =N1:\n\nProof. The details are somewhat technical, and are hence given in Appendix A. We know that\n(cid:0)1L \u227b 0. Since the partial trace operators are\nL \u227b 0 =) L + L\u2206L (cid:21) 0, because L (cid:0) L(I + L)\npositive (Prop. 2.4), it follows that the solutions to (7) are also positive de\ufb01nite.\n{\n(\n\u227b 0, updating according to (7) generates\nk(cid:21)0 is non-decreasing.\n\nWe are now ready to establish that these updates ensure monotonic ascent in the log-likelihood.\nTheorem 3.2 (Ascent). Starting with L(0)\n1\npositive de\ufb01nite iterates L(k)\n\n\u227b 0, L(0)\n2 , and the sequence\n\n1 and L(k)\n\n(cid:10) L(k)\n\n)}\n\nL(k)\n\n\u03d5\n\n2\n\n1\n\n2\n\nProof. Updating according to (7) generates positive de\ufb01nite matrices Si, and hence positive de\ufb01nite\nsubkernels Li = Si. Moreover, due to the convexity of g1 and concavity of f1, for matrices A; B \u227b 0\n\n(\n\n)\n\n(\n\n)\n\n(\n\n+ g1\n\n\u22a4\n\u22a4\n\n1\n\n2\n\n\u03d5\n\nL(k+1)\n\n1\n\n(B (cid:0) A);\n(A (cid:0) B):\n(A (cid:0) B).\nHence, f1(A) + g1(A) (cid:21) f1(B) + g1(B) + (\u2207f1(A) + \u2207g1(B))\n\u22a4\n)\n(\n)\nThus, if S(k)\n1 we have\n\nf1(B) (cid:20) f1(A) + \u2207f1(A)\ng1(A) (cid:21) g1(B) + \u2207g1(B)\n) (cid:21) f1\n(\n\nverify (7), by setting A = S(k+1)\n\nS(k)\n1\nThe same reasoning holds for L2, which proves the theorem.\nAs Tr1((I (cid:10) S2)L) = N2L1 (and similarly for L2), updating as in (7) is equivalent to updating\n=N1:\n)\n\nGeneralization. We can generalize the updates to take an additional step-size parameter a:\n(cid:10) I)(L\u2206L)\n\n(cid:10) L(k)\n)\n(cid:10) I)(L\u2206L)\n\nL2 L2 + Tr2\n\n1 ; S(k+1)\n(cid:10) L(k)\n(\n(\n\nL1 L1 + Tr1\n\nL2 L2 + a Tr2\n\nL1 L1 + a Tr1\n\n(cid:0)1\n2 )(L\u2206L)\n\n1\nS(k+1)\n1\n\nand B = S(k)\n\n(I (cid:10) L\n\n(I (cid:10) L\n\n(cid:0)1\n2 )(L\u2206L)\n\n(cid:0)1\n(L\n1\n\n+ g1\n\nS(k)\n1\n\n= f1\n\nS(k+1)\n1\n\n= \u03d5\n\nL(k)\n\n1\n\n(cid:0)1\n(L\n1\n\n)\n)\n\n)\n\n:\n\n2\n\n(\n\n(\n\n=N2;\n\n=N2;\n\n(\n\n=N1:\n\nExperimentally, a > 1 (as long as the updates remain positive de\ufb01nite) can provide faster conver-\ngence, although the monotonicity of the log-likelihood is no longer guaranteed. We found experi-\nmentally that the range of admissible a is larger than for Picard, but decreases as N grows larger.\nThe arguments above easily generalize to the multiblock case. Thus, when learning L = L1 (cid:10)(cid:1)(cid:1)(cid:1)(cid:10)\nLm, by writing Eij the matrix with a 1 in position (i; j) and zeros elsewhere, we update Lk as\n(Lk)ij (Lk)ij + Nk=(N1 : : : Nm) Tr [(L1 (cid:10) : : : (cid:10) Lk(cid:0)1 (cid:10) Eij (cid:10) Lk+1 (cid:10) : : : (cid:10) Lm)(L\u2206L)] :\nFrom the above updates it is not transparent whether the Kronecker product saves us any computa-\ntion. In particular, it is not clear whether the updates can be implemented to run faster than O(N 3).\nWe show below in the next section how to implement these updates ef\ufb01ciently.\n\n3.1.2 Algorithm and complexity analysis\n\nFrom Theorem 3.2, we obtain Algorithm 1 (which is different from the Picard iteration in [25],\nbecause it operates alternatingly on each subkernel). It is important to note that a further speedup\nto Algorithm 1 can be obtained by performing stochastic updates, i.e., instead of computing the\nfull gradient of the log-likelihood, we perform our updates using only one (or a small minibatch)\nsubset Yi at each step instead of iterating over the entire training set; this uses the stochastic gradient\n\u2206 = UiL\n\n(cid:0)1. The crucial strength of Algorithm 1 lies in the following result:\n\n(cid:0) (I + L)\n\nU\n\n(cid:0)1\nYi\n\n\u22a4\ni\n\n4\n\n\fAlgorithm 1 KRK-PICARD iteration\n\nInput: Matrices L1; L2, training set T , parameter a.\nfor i = 1 to maxIter do\n\n(\n(\n(I (cid:10) L\n(cid:0)1\n(L\n1\n\n)\n)\n(cid:0)1\n2 )(L\u2206L)\n(cid:10) I)(L\u2206L)\n\nL1 L1 + a Tr1\nL2 L2 + a Tr2\n\nend for\nreturn (L1; L2)\n\n=N2\n=N1\n\n// or update stochastically\n// or update stochastically\n\nTheorem 3.3 (Complexity). For N1 (cid:25) N2 (cid:25) p\nN, the updates in Algorithm 1 can be computed in\nO(n(cid:20)3 +N 2) time and O(N 2) space, where (cid:20) is the size of the largest training subset. Furthermore,\nstochastic updates can be computed in O(N (cid:20)2 + N 3=2) time and O(N + (cid:20)2) space.\nIndeed, by leveraging the properties of the Kronecker product, the updates can be obtained without\n(cid:0)1,\ncomputing L\u2206L. This result is non-trivial: the components of \u2206, 1\nn\nmust be considered separately for computational ef\ufb01ciency. The proof is provided in App. B. How-\never, it seems that considering more than 2 subkernels does not lead to further speed-ups.\nThis is a marked improvement over [25], which runs in O(N 2) space and O(n(cid:20)3 + N 3) time (non-\nstochastic) or O(N 3) time (stochastic); Algorithm 1 also provides faster stochastic updates than [9]1.\nHowever, one may wonder if by learning the sub-kernels by alternating updates the log-likelihood\nconverges to a sub-optimal limit. The next section discusses how to jointly update L1 and L2.\n\n\u22a4\ni and (I + L)\nU\n\n\u2211\n\ni UiL\n\n(cid:0)1\nYi\n\n3.2 Joint updates\nWe also analyzed the possibility of updating L1 and L2 jointly: we update L L + L\u2206L and then\n\u2032\n\u2032\n2 such that:\n1 and L\nrecover the Kronecker structure of the kernel by de\ufb01ning the updates L\n\u22252\n\n2) minimizes \u2225L + L\u2206L (cid:0) L\n\u2032\n\u2032\n1\n\n(cid:10) L\n\u2032\n2\n\n{\n\nF\n\n(8)\n\nWe show in appendix C that such solutions exist and can be computed from the \ufb01rst singular value\nand vectors of the matrix R =\ni;j=1. Note however that in this case, there is\nno guaranteed increase in log-likelihood. The pseudocode for the related algorithm (JOINT-PICARD)\nis given in appendix C.1. An analysis similar to the proof of Thm. 3.3 shows that the updates can be\nobtained O(n(cid:20)3 + max(N1; N2)4).\n\n(cid:0)1 + \u2206)(ij))\n\nvec((L\n\n\u2032\n1; L\n(L\n\u227b 0; L\n\u2032\n[\nL\n1\n\n\u2032\n2\n\n\u227b 0;\u2225L\n\u2032\n1\n\n\u2225 = \u2225L\n\u2032\n2\n\n\u2225\n\n\u22a4]N1\n\n3.3 Memory-time trade-off\n\n\u2211\n\nfY1; : : : ; Yng = [m\n\n\u2211\n\nU\n\ni UiL\n\u2211\ns:t: 8k;j[Y 2Sk Y j < z;\n\nAlthough KRONDPPS have tractable learning algorithms, the memory requirements remain high for\ni needs to be stored, requiring O(N 2)\n\u22a4\nnon-stochastic updates, as the matrix (cid:2) = 1\nn\nmemory. However, if the training set can be subdivided such that\n\n(cid:0)1\nYi\n\nm\n\nk=1Sk\n\nk=1 (cid:2)k with (cid:2)k =\n\n(9)\n\u22a4\ni . Due to the bound in Eq. 9,\n(cid:2) can be decomposed as 1\nn\neach (cid:2)k will be sparse, with only z2 non-zero coef\ufb01cients. We can then store each (cid:2)k with minimal\nstorage and update L1 and L2 in O(n(cid:20)3 + mz2 + N 3=2) time and O(mz2 + N ) space.\nDetermining the existence of such a partition of size m is a variant of the NP-Hard Subset-Union\nKnapsack Problem (SUKP) [11] with m knapsacks and where the value of each item (i.e. each Yi)\nis equal to 1: a solution to SUKP of value n with m knapsacks is equivalent to a solution to Eq. 9.\nHowever, an approximate partition can also be simply constructed via a greedy algorithm.\n\nYi2Sk\n\n(cid:0)1\nYi\n\nUiL\n\nU\n\n4 Sampling\nSampling exactly (see Alg. 2 and [17]) from a full DPP kernel costs O(N 3 + N k3) where k is the\nsize of the sampled subset. The bulk of the computation lies in the initial eigendecomposition of L;\n\n1For example, computing matrix B in [9] (de\ufb01ned after Eq. 7), which is a necessary step for (stochastic)\n\ngradient ascent, costs O(N 2) due to matrix multiplications.\n\n5\n\n\fthe k orthonormalizations cost O(N k3). Although the eigendecomposition need only happen once\nfor many iterations of sampling, exact sampling is nonetheless intractable in practice for large N.\n\nAlgorithm 2 Sampling from a DPP kernel L\n\nInput: Matrix L.\nEigendecompose L as f((cid:21)i; vi)g1(cid:20)i(cid:20)N .\nJ \u2205\nfor i = 1 to N do\n\nJ ! J [ fig with probability (cid:21)i=((cid:21)i + 1).\n\nend for\nV fvigi2J, Y \u2205\nwhile jV j > 0 do\n\n\u2211\n\nSample i from f1 : : : Ng with probability 1jV j\nY Y [ fig, V V?, where V? is an orthonormal basis of the subspace of V orthonormal to ei\n\nv2V v2\n\ni\n\nend while\nreturn Y\nIt follows from Prop. 2.2 that for KRONDPPS, the eigenvalues (cid:21)i can be obtained in O(N 3\nand the k eigenvectors in O(kN ) operations. For N1 (cid:25) N2 (cid:25) p\n2 ),\n1 + N 3\nN, exact sampling thus only costs\nO(N 3=2 + N k3). If L = L1 (cid:10) L2 (cid:10) L3, the same reasoning shows that exact sampling becomes\nlinear in N, only requiring O(N k3) operations.\nOne can also resort to MCMC sampling; for instance such a sampler was considered in [13] (though\nwith an incorrect mixing time analysis). The results of [20] hold only for k-DPPs, but suggest\ntheir MCMC sampler may possibly take O(N 2 log(N=\u03f5)) time for full DPPs, which is impractical.\nNevertheless if one develops faster MCMC samplers, they should also be able to pro\ufb01t from the\nKronecker product structure offered by KRONDPP.\n5 Experimental results\nIn order to validate our learning algorithm, we compared KRK-PICARD to JOINT-PICARD and to\nthe Picard iteration (PICARD) on multiple real and synthetic datasets.2\n\n5.1 Synthetic tests\nTo enable a fair comparison between algorithms, we test them on synthetic data drawn from a full\n(non-Kronecker) ground-truth DPP kernel. The sub-kernels were initialized by Li = X\nX, with\nX\u2019s coef\ufb01cients drawn uniformly from [0;\nFor Figures 1a and 1b, training data was generated by sampling 100 subsets from the true kernel\nwith sizes uniformly distributed between 10 and 190.\n\n\u22a4\n2]; for PICARD, L was initialized with L1 (cid:10) L2.\n\np\n\n.\n\nd\no\no\nh\ni\nl\ne\nk\ni\nl\n-\ng\no\nl\nd\ne\nz\ni\nl\na\nm\nr\no\nN\n\n..\nPICARD\n(cid:1)103\n\n0\n\n(cid:0)2\n\n(cid:0)4\n\n(cid:0)6\n\n.(cid:0)8 .\n.....\n.\n.\n.\n.\n.\n.\n.\n.\n0\n\nKRK-PICARD\n\n..\n\n.\n\n.\n(cid:1)104\n\n.\n\nJOINT-PICARD\n\n..\n\n0\n\n(cid:0)2\n\n0\n\n(cid:0)1\n\n(cid:0)2\n\nKRK-PICARD (stochastic)\n(cid:1)105\n\n..\n\n..\n\n(cid:0)4 .\n.....\n.\n.\n.\n.\n.\n.\n.\n0\n\n200\n\n100\ntime (s)\n\n.(cid:0)3 .\n.....\n.\n.\n.\n.\n.\n.\n.\n.\n0\n\n600\n\n200\n\n400\ntime (s)\n\n20\n\n60\n\n40\ntime (s)\n\n80\n\n(a) N1 = N2 = 50\n\n(b) N1 = N2 = 100\n\n(c) N1 = 100; N2 = 500\n\nFigure 1: a = 1; the thin dotted lines indicated the standard deviation from the mean.\n\n2All experiments were repeated 5 times and averaged, using MATLAB on a Linux Mint system with 16GB\n\nof RAM and an i7-4710HQ CPU @ 2.50GHz.\n\n6\n\n\fTo evaluate KRK-PICARD on matrices too large to \ufb01t in memory and with large (cid:20), we drew samples\nfrom a 50 (cid:1) 103(cid:2)50 (cid:1) 103 kernel of rank 1; 000 (on average jYij (cid:25) 1; 000), and learned the kernel\nstochastically (only KRK-PICARD could be run due to the memory requirements of other methods);\nthe likelihood drastically improves in only two steps (Fig.1c).\nAs shown in Figures 1a and 1b, KRK-PICARD converges signi\ufb01cantly faster than PICARD, espe-\ncially for large values of N. However, although JOINT-PICARD also increases the log-likelihood\nat each iteration, it converges much slower and has a high standard deviation, whereas the standard\ndeviations for PICARD and KRK-PICARD are barely noticeable. For these reasons, we drop the\ncomparison to JOINT-PICARD in the subsequent experiments.\n\n5.2 Small-scale real data: baby registries\n\nWe compared KRK-PICARD to PICARD and EM [10] on the baby registry dataset (described in-\ndepth in [10]), which has also been used to evaluate other DPP learning algorithms [9, 10, 25]. The\ndataset contains 17 categories of baby-related products obtained from Amazon. We learned kernels\nfor the 6 largest categories (N = 100); in this case, PICARD is suf\ufb01ciently ef\ufb01cient to be prefered\nto KRK-PICARD; this comparison serves only to evaluate the quality of the \ufb01nal kernel estimates.\nThe initial marginal kernel K for EM was sampled from a Wishart distribution with N degrees of\nfreedom and an identity covariance matrix, then scaled by 1=N; for PICARD, L was set to K(I (cid:0)\n(cid:0)1; for KRK-PICARD, L1 and L2 were chosen (as in JOINT-PICARD) by minimizing \u2225L (cid:0)\nK)\nL1 (cid:10) L2\u2225. Convergence was determined when the objective change dipped below a threshold (cid:14). As\none EM iteration takes longer than one Picard iteration but increases the likelihood more, we set\n(cid:14)PIC = (cid:14)KRK = 10\nThe \ufb01nal log-likelihoods are shown in Table 1; we set the step-sizes to their largest possible values,\ni.e. aPIC = 1:3 and aKRK = 1:8. Table 1 shows that KRK-PICARD obtains comparable, albeit\nslightly worse log-likelihoods than PICARD and EM, which con\ufb01rms that for tractable N, the better\nmodeling capability of full kernels make them preferable to KRONDPPS.\n\n(cid:0)4 and (cid:14)EM = 10\n\n(cid:0)5.\n\nTable 1: Final log-likelihoods for each large category of the baby registries dataset\n\n(a) Training set\n\n(b) Test set\n\nCategory\napparel\nbath\nbedding\ndiaper\nfeeding\ngear\n\nEM PICARD KRK-PICARD\n-10.1\n-8.6\n-8.7\n-10.5\n-12.1\n-9.3\n\n-10.2\n-8.8\n-8.8\n-10.7\n-12.1\n-9.3\n\n-10.7\n-9.1\n-9.3\n-11.1\n-12.5\n-9.6\n\nCategory\napparel\nbath\nbedding\ndiaper\nfeeding\ngear\n\nEM PICARD KRK-PICARD\n-10.1\n-8.6\n-8.7\n-10.6\n-12.2\n-9.2\n\n-10.2\n-8.8\n-8.8\n-10.7\n-12.2\n-9.2\n\n-10.7\n-9.1\n-9.3\n-11.2\n-12.6\n-9.5\n\n5.3 Large-scale real dataset: GENES\n\nFinally, to evaluate KRK-PICARD on large matrices of real-world data, we train it on data from the\nGENES [3] dataset (which has also been used to evaluate DPPs in [3, 19]). This dataset consists in\n10,000 genes, each represented by 331 features corresponding to the distance of a gene to hubs in\nthe BioGRID gene interaction network.\nWe construct a ground truth Gaussian DPP kernel on the GENES dataset and use it to obtain 100\ntraining samples with sizes uniformly distributed between 50 and 200 items. Similarly to the syn-\n\u22a4\nthetic experiments, we initialized KRK-PICARD\u2019s kernel by setting Li = X\ni Xi where Xi is a\nrandom matrix of size N1 (cid:2) N1; for PICARD, we set the initial kernel L = L1 (cid:10) L2.\nFigure 2 shows the performance of both algorithms. As with the synthetic experiments, KRK-\nPICARD converges much faster; stochastic updates increase its performance even more, as shown in\nFig. 2b. Average runtimes and speed-up are given in Table 2: KRK-PICARD runs almost an order of\nmagnitude faster than PICARD, and stochastic updates are more than two orders of magnitude faster,\nwhile providing slightly larger initial increases to the log-likelihood.\n\n7\n\n\f.\n(cid:1)103\n\n0\n(cid:0)10\n(cid:0)20\n(cid:0)30\n.(cid:0)40 .\n.....\n.\n.\n.\n.\n.\n.\n.\n.\n.\n0\n\nd\no\no\nh\ni\nl\ne\nk\ni\nl\n-\ng\no\nl\nd\ne\nz\ni\nl\na\nm\nr\no\nN\n\nPICARD\n\n..\n\n.\n\nKRK-PICARD\n\n..\n\n100\n\n200\n\n300\n\ntime (s)\n\n.\n(cid:1)103\n\n0\n(cid:0)10\n(cid:0)20\n(cid:0)30\n.(cid:0)40 .\n.....\n.\n.\n.\n.\n.\n.\n.\n0\n\nKRK-PICARD (stochastic)\n\n..\n\n..\n\n50\n\n100\n\ntime (s)\n\n(a) Non-stochastic learning\n\n(b) Stochastic vs. non-stochastic\n\nFigure 2: n = 150, a = 1.\n\nTable 2: Average runtime and performance on the GENES dataset for N1 = N2 = 100\nKRK-PICARD (stochastic)\n\nPICARD\n\nAverage runtime\n\nNLL increase (1st iter.)\n\n161.5 (cid:6) 17.7 s\n(2:81 (cid:6) 0:03) (cid:1) 104\n\nKRK-PICARD\n8.9 (cid:6) 0.2 s\n\n(2:96 (cid:6) 0:02) (cid:1) 104\n\n1.2 (cid:6) 0.02 s\n\n(3:13 (cid:6) 0:04) (cid:1) 104\n\n6 Conclusion and future work\n\nWe introduced KRONDPPS, a variant of DPPs with kernels structured as the Kronecker product of m\nsmaller matrices, and showed that typical operations over DPPs such as sampling and learning the\nkernel from data can be made ef\ufb01cient for KRONDPPS on previously untractable ground set sizes.\nBy carefully leveraging the properties of the Kronecker product, we derived for m = 2 a low-\ncomplexity algorithm to learn the kernel from data which guarantees positive iterates and a mono-\ntonic increase of the log-likelihood, and runs in O(n(cid:20)3 + N 2) time. This algorithm provides even\nmore signi\ufb01cant speed-ups and memory gains in the stochastic case, requiring only O(N 3=2 + N (cid:20)2)\ntime and O(N + (cid:20)2) space. Experiments on synthetic and real data showed that KRONDPPS can be\nlearned ef\ufb01ciently on sets large enough that L does not \ufb01t in memory.\nOur experiments showed that KRONDPP\u2019s reduced number of parameters (compared to full kernels)\ndid not impact its performance noticeably. However, a more in-depth investigation of its expressivity\nmay be valuable for future study. Similarly, a deeper study of initialization procedures for DPP\nlearning algorithms, including KRK-PICARD, is an important question.\nWhile discussing learning the kernel, we showed that L1 and L2 cannot be updated simultaneously\nin a CCCP-style iteration since g is not convex over (S1; S2). However, it can be shown that g is\ngeodesically convex over the Riemannian manifold of positive de\ufb01nite matrices, which suggests that\nderiving an iteration which would take advantage of the intrinsic geometry of the problem may be a\nviable line of future work.\nKRONDPPS also enable fast sampling, in O(N 3=2 + N k3) operations when using two sub-kernels,\nand in O(N k3) when using three sub-kernels. This speedup allows for exact sampling at comparable\nor even better costs than previous algorithms for approximate sampling. However, the subset size k\nis still limiting, due to the O(N k3) cost of sampling and learning. A key aspect of future work on\nobtaining truly scalable DPPs is to overcome this computational bottleneck.\n\nAcknowledgements\n\nSS acknowledges partial support from NSF grant IIS-1409802.\n\n8\n\n\fReferences\n[1] R. Affandi, A. Kulesza, E. Fox, and B. Taskar. Nystr\u00f6m approximation for large-scale Determinantal\n\nPoint Processes. In Arti\ufb01cial Intelligence and Statistics (AISTATS), 2013.\n\n[2] R. Affandi, E. Fox, R. Adams, and B. Taskar. Learning the parameters of Determinantal Point Process\n\nkernels. In ICML, 2014.\n\n[3] N. K. Batmanghelich, G. Quon, A. Kulesza, M. Kellis, P. Golland, and L. Bornn. Diversifying sparsity\n\nusing variational determinantal point processes. arXiv:1411.6307, 2014.\n[4] R. Bhatia. Positive De\ufb01nite Matrices. Princeton University Press, 2007.\n[5] A. Borodin. Determinantal point processes. arXiv:0911.1153, 2009.\n[6] W. Chao, B. Gong, K. Grauman, and F. Sha. Large-margin determinantal point processes. In Uncertainty\n\nin Arti\ufb01cial Intelligence (UAI), 2015.\n\n[7] L. Decreusefond, I. Flint, N. Privault, and G. L. Torrisi. Determinantal point processes, 2015.\n[8] S. Flaxman, A. Wilson, D. Neill, H. Nickisch, and A. Smola. Fast Kronecker inference in Gaussian\n\nprocesses with non-Gaussian likelihoods. In ICML, pages 607\u2013616, 2015.\n\n[9] M. Gartrell, U. Paquet, and N. Koenigstein. Low-rank factorization of determinantal point processes for\n\nrecommendation. arXiv:1602.05436, 2016.\n\n[10] J. Gillenwater, A. Kulesza, E. Fox, and B. Taskar. Expectation-Maximization for learning Determinantal\n\nPoint Processes. In NIPS, 2014.\n\n[11] O. Goldschmidt, D. Nehme, and G. Yu. Note: On the set-union knapsack problem. Naval Research\n\nLogistics, 41:833\u2013842, 1994.\n\n[12] J. B. Hough, M. Krishnapur, Y. Peres, and B. Vir\u00e1g. Determinantal processes and independence. Proba-\n\nbility Surveys, 3(206\u2013229):9, 2006.\n\n[13] B. Kang. Fast determinantal point process sampling with application to clustering. In Advances in Neural\n\nInformation Processing Systems 26, pages 2319\u20132327. Curran Associates, Inc., 2013.\n\n[14] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: theory,\n\nef\ufb01cient algorithms and empirical studies. JMLR, 9:235\u2013284, 2008.\n\n[15] A. Kulesza. Learning with Determinantal Point Processes. PhD thesis, University of Pennsylvania, 2013.\n[16] A. Kulesza and B. Taskar. k-DPPs: Fixed-size Determinantal Point Processes. In ICML, 2011.\n[17] A. Kulesza and B. Taskar. Determinantal Point Processes for machine learning, volume 5. Foundations\n\nand Trends in Machine Learning, 2012.\n\n[18] F. Lavancier, J. M\u00f8ller, and E. Rubak. Determinantal point process models and statistical inference.\n\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 77(4):853\u2013877, 2015.\n\n[19] C. Li, S. Jegelka, and S. Sra. Ef\ufb01cient sampling for k-determinantal point processes. arXiv:1509.01618,\n\n2015.\n\n[20] C. Li, S. Jegelka, and S. Sra. Fast DPP sampling for Nystr\u00f6m with application to kernel methods.\n\narXiv:1603.06052, 2016.\n\n[21] H. Lin and J. Bilmes. Learning mixtures of submodular shells with application to document summariza-\n\ntion. In Uncertainty in Arti\ufb01cial Intelligence (UAI), 2012.\n\n[22] C. V. Loan and N. Pitsianis. Approximation with kronecker products. In Linear Algebra for Large Scale\n\nand Real Time Applications, pages 293\u2013314. Kluwer Publications, 1993.\n\n[23] R. Lyons. Determinantal probability measures. Publications Math\u00e9matiques de l\u2019Institut des Hautes\n\n\u00c9tudes Scienti\ufb01ques, 98(1):167\u2013212, 2003.\n\n[24] O. Macchi. The coincidence approach to stochastic point processes. Adv. Appl. Prob., 7(1), 1975.\n[25] Z. Mariet and S. Sra. Fixed-point algorithms for learning determinantal point processes. In ICML, 2015.\n[26] Z. Mariet and S. Sra. Diversity networks. Int. Conf. on Learning Representations (ICLR), 2016. URL\n\narXiv:1511.05077.\n\n[27] J. Martens and R. B. Grosse. Optimizing neural networks with Kronecker-factored approximate curvature.\n\nIn ICML, 2015.\n\n[28] G. Wu, Z. Zhang, and E. Y. Chang. Kronecker factorization for speeding up kernel machines. In SIAM\n\nData Mining (SDM), pages 611\u2013615, 2005.\n\n[29] A. L. Yuille and A. Rangarajan. The concave-convex procedure (cccp). In Advances in Neural Information\n\nProcessing Systems 14, pages 1033\u20131040. MIT Press, 2002.\n\n[30] X. Zhang, F. X. Yu, R. Guo, S. Kumar, S. Wang, and S.-F. Chang. Fast orthogonal projection based on\n\nkronecker product. In ICCV, 2015.\n\n[31] T. Zhou, Z. Kuscsik, J.-G. Liu, M. Medo, J. R. Wakeling, and Y.-C. Zhang. Solving the apparent diversity-\n\naccuracy dilemma of recommender systems. PNAS, 107(10):4511\u20134515, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1379, "authors": [{"given_name": "Zelda", "family_name": "Mariet", "institution": "MIT"}, {"given_name": "Suvrit", "family_name": "Sra", "institution": "MIT"}]}