{"title": "On two ways to use determinantal point processes for Monte Carlo integration", "book": "Advances in Neural Information Processing Systems", "page_first": 7770, "page_last": 7779, "abstract": "When approximating an integral by a weighted sum of function evaluations, determinantal point processes (DPPs) provide a way to enforce repulsion between the evaluation points.\nThis negative dependence is encoded by a kernel.\nFifteen years before the discovery of DPPs, Ermakov & Zolotukhin (EZ, 1960) had the intuition of sampling a DPP and solving a linear system to compute an unbiased Monte Carlo estimator of the integral.\nIn the absence of DPP machinery to derive an efficient sampler and analyze their estimator, the idea of Monte Carlo integration with DPPs was stored in the cellar of numerical integration. \nRecently, Bardenet & Hardy (BH, 2019) came up with a more natural estimator with a fast central limit theorem (CLT).\nIn this paper, we first take the EZ estimator out of the cellar, and analyze it using modern arguments.\nSecond, we provide an efficient implementation to sample exactly a particular multidimensional DPP called multivariate Jacobi ensemble.\nThe latter satisfies the assumptions of the aforementioned CLT. \nThird, our new implementation lets us investigate the behavior of the two unbiased Monte Carlo estimators in yet unexplored regimes.\nWe demonstrate experimentally good properties when the kernel is adapted to basis of functions in which the integrand is sparse or has fast-decaying coefficients.\nIf such a basis and the level of sparsity are known (e.g., we integrate a linear combination of kernel eigenfunctions), the EZ estimator can be the right choice, but otherwise it can display an erratic behavior.", "full_text": "On two ways to use determinantal point processes\n\nfor Monte Carlo integration\n\nGuillaume Gautier\u2020\u21e4\ng.gautier@inria.fr\nvalkom@deepmind.com\n\u2020Univ. Lille, CNRS, Centrale Lille, UMR 9189 \u2013 CRIStAL, 59651 Villeneuve d\u2019Ascq, France\n\nremi.bardenet@gmail.com\n\nMichal Valko\u2021\u21e4\u2020\n\nR\u00e9mi Bardenet\u2020\n\n\u21e4Inria Lille-Nord Europe, 40 avenue Halley 59650 Villeneuve d\u2019Ascq, France\n\n\u2021DeepMind Paris, 14 Rue de Londres, 75009 Paris, France\n\nAbstract\n\nWhen approximating an integral by a weighted sum of function evaluations, de-\nterminantal point processes (DPPs) provide a way to enforce repulsion between\nthe evaluation points. This negative dependence is encoded by a kernel. Fifteen\nyears before the discovery of DPPs, Ermakov & Zolotukhin (EZ, 1960) had the\nintuition of sampling a DPP and solving a linear system to compute an unbiased\nMonte Carlo estimator of the integral. In the absence of DPP machinery to derive\nan ef\ufb01cient sampler and analyze their estimator, the idea of Monte Carlo integration\nwith DPPs was stored in the cellar of numerical integration. Recently, Bardenet &\nHardy (BH, 2019) came up with a more natural estimator with a fast central limit\ntheorem (CLT). In this paper, we \ufb01rst take the EZ estimator out of the cellar, and an-\nalyze it using modern arguments. Second, we provide an ef\ufb01cient implementation1\nto sample exactly a particular multidimensional DPP called multivariate Jacobi\nensemble. The latter satis\ufb01es the assumptions of the aforementioned CLT. Third,\nour new implementation lets us investigate the behavior of the two unbiased Monte\nCarlo estimators in yet unexplored regimes. We demonstrate experimentally good\nproperties when the kernel is adapted to basis of functions in which the integrand is\nsparse or has fast-decaying coef\ufb01cients. If such a basis and the level of sparsity are\nknown (e.g., we integrate a linear combination of kernel eigenfunctions), the EZ\nestimator can be the right choice, but otherwise it can display an erratic behavior.\n\n1\n\nIntroduction\n\nNumerical integration is a core task of many machine learning applications, including most Bayesian\nmethods (Robert, 2007). Both deterministic (Davis & Rabinowitz, 1984; Dick & Pillichshammer,\n2010) and random (Robert & Casella, 2004) algorithms have been proposed; see also (Evans &\nSwartz, 2000) for a survey. All methods require evaluating the integrand at carefully chosen points,\ncalled quadrature nodes, and combining these evaluations to minimize the approximation error.\nRecently, a stream of work has made use of prior knowledge on the smoothness of the integrand using\nkernels. Oates et al. (2017) and Liu & Lee (2017) used kernel-based control variates, splitting the\ncomputational budget into regressing the integrand and integrating the residual. Bach (2017) looked\nfor the best way to sample i.i.d. nodes and combine the resulting evaluations. Finally, Bayesian\nquadrature (O\u2019Hagan, 1991; Husz\u00e1r & Duvenaud, 2012; Briol et al., 2015), herding (Chen et al.,\n2010; Bach et al., 2012), or the biased importance sampling estimate of Delyon & Portier (2016) all\nfavor dissimilar nodes, where dissimilarity is measured by a kernel. Our work falls in this last cluster.\nWe build on the particular approach of Bardenet & Hardy (2019) for Monte Carlo integration based\non projection determinantal point processes (DPPs, Hough et al., 2006; Kulesza & Taskar, 2012).\nDPPs are a repulsive distribution over con\ufb01gurations of points, where repulsion is again parametrized\nby a kernel. In a sense, DPPs are the kernel machines of point processes.\n1 github.com/guilgautier/DPPy\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFifteen years before Macchi (1975) even formalized DPPs, Ermakov & Zolotukhin (EZ, 1960) had\nthe intuition to use a determinantal structure to sample quadrature nodes, followed by solving a\nlinear system to aggregate the evaluations of the integrand into an unbiased estimator. This linear\nsystem yields a simple and interpretable characterization of the variance of their estimator. Ermakov\n& Zolotukhin\u2019s result did not diffuse much2 in the Monte Carlo community, partly because the\nmathematical and computational machinery to analyze and implement it was not available. Seemingly\nunaware of this previous work, Bardenet & Hardy (2019) came up with a more natural estimator of\nthe integral of interest, and they could build upon the thorough study of DPPs in random matrix theory\n(Johansson, 2006) to obtain a fast central limit theorem (CLT). Since then, DPPs with stationary\nkernels have also been used by Mazoyer et al. (2019) for Monte Carlo integration. In any case,\nthese DPP-based Monte Carlo estimators crucially depend on ef\ufb01cient sampling procedures for the\ncorresponding (potentially multidimensional) DPP.\n\nOur contributions. First, we reveal the close link between DPPs and the approach of Ermakov &\nZolotukhin (1960). Second, we provide a simple proof of their result and survey the properties of the\nEZ estimator with modern arguments. In particular, when the integrand is a linear combination of the\neigenfunctions of the kernel of the underlying DPP, the corresponding Fourier-like coef\ufb01cients can\nbe estimated with zero variance. In other words, one sample of the corresponding DPP yields perfect\ninterpolation of the underlying integrand, by solving a linear system. Third, we propose an ef\ufb01cient\nPython implementation for exact sampling of a particular DPP, called multivariate Jacobi ensemble.\nThe code1 is available in the DPPy toolbox of Gautier et al. (2019). This implementation allows to\nnumerically investigate the behavior of the two Monte Carlo estimators derived by Bardenet & Hardy\n(2019) and Ermakov & Zolotukhin (1960), in regimes yet unexplored for any of the two. Fourth,\nimportant theoretical properties of both estimators, like the CLT of (Bardenet & Hardy, 2019), are\ntechnically involved. A CLT for EZ promises to be even more dif\ufb01cult to establish. The current\nempirical investigation provides a motivation and guidelines for more theoretical work. Our point\nis not to compare DPP-based Monte Carlo estimators to the wide choice of numerical integration\nalgorithms, but to get a \ufb01ne understanding of their properties so as to \ufb01ne-tune their design and guide\ntheoretical developments.\n\n2 Quadrature, DPPs, and the multivariate Jacobi ensemble\n\nIn this section, we quickly survey classical quadrature rules. Then, we de\ufb01ne DPPs and give the key\nproperties that make them useful for Monte Carlo integration. Finally, among so-called projection\nDPPs, we introduce the multivariate Jacobi ensemble used by Bardenet & Hardy (2019) to generate\nquadrature nodes, and on which we base our experimental work.\n\n2.1 Standard quadrature\nFollowing Briol et al. (2015, Section 2.1), let \u00b5(dx) = !(x) dx be a positive Borel measure on\nX \u21e2 Rd with \ufb01nite mass and density ! w.r.t. the Lebesgue measure. This paper aims to compute\nintegrals of the formR f (x)\u00b5(dx) for some test function f : X ! R. A quadrature rule approximates\nsuch integrals as a weighted sum of evaluations of f at some nodes {x1, . . . , xN}\u21e2 X,\n\nZ f (x)\u00b5(dx) \u21e1\n\nNXn=1\n\n!nf (xn),\n\n(1)\n\nwhere the weights !n , !n(x1, . . . , xN ) do not need to be non-negative nor sum to one.\nAmong the many quadrature designs mentioned in the introduction (Evans & Swartz, 2000, Section 5),\nwe pay special attention to the textbook example of the (deterministic) Gauss-Jacobi rule. This scheme\napplies to dimension d = 1, for X , [1, 1] and !(x) , (1 x)a(1 + x)b with a, b > 1. In this\ncase, the nodes {x1, . . . , xN} are taken to be the zeros of pN, the orthonormal Jacobi polynomial\nof degree N, and the weights !n , 1/K(xn, xn) with K(x, x) ,PN1\nk=0 pk(x)2. In particular, this\nspeci\ufb01c quadrature rule allows to perfectly integrate polynomials up to degree 2N 1 (Davis &\nRabinowitz, 1984, Section 2.7). In a sense, the DPPs of Bardenet & Hardy (2019) are a random,\nmultivariate generalization of Gauss-Jacobi quadrature, as we shall see in Section 3.1.\n2 Many thanks to Mathieu Gerber of Univ. Bristol, UK, for digging up this result from his human memory.\n\n2\n\n\fMonte Carlo integration can be de\ufb01ned as random choices of nodes in (1). Importance sampling, for\ninstance, corresponds to i.i.d. nodes, while Markov chain Monte Carlo corresponds to nodes drawn\nfrom a carefully chosen Markov chain; see, e.g., Robert & Casella (2004) for more details. Finally,\nquasi-Monte Carlo (QMC, Dick & Pillichshammer, 2010) applies to \u00b5 uniform over a compact subset\nof Rd, and constructs deterministic nodes that spread uniformly, as measured by their discrepancy.\n\n2.2 Projection DPPs\nDPPs can be understood as a parametric class of point processes, speci\ufb01ed by a base measure \u00b5 and\na kernel K : X \u21e5 X ! C. The latter is commonly assumed to be Hermitian and trace-class. For\nthe resulting process to be well de\ufb01ned, it is necessary and suf\ufb01cient that the kernel K is positive\nsemi-de\ufb01nite with eigenvalues in [0, 1], see, e.g., Soshnikov (2000, Theorem 3). When the eigenvalues\nfurther belong to {0, 1}, we speak of a projection kernel and a projection DPP. One practical feature\nof projection DPPs is that they almost surely produce samples with \ufb01xed cardinality, equal to the\nrank N of the kernel. More generally, they are the building blocks of DPPs. Indeed, under general\nassumptions, all DPPs are mixtures of projection DPPs (Hough et al., 2006, Theorem 7). Hereafter,\nunless speci\ufb01cally stated, K is assumed to be a real-valued, symmetric, projection kernel.\nOne way to de\ufb01ne a projection DPP with N points is to take N functions 0, . . . , N1 orthonormal\nw.r.t. \u00b5, i.e., hk, `i ,R k(x)`(x)\u00b5(dx) = k`, and consider the kernel KN associated to the\northogonal projector onto HN , span{k, 0 \uf8ff k \uf8ff N 1}, i.e.,\n\nKN (x, y) ,\n\nk(x)k(y).\n\n(2)\n\nN1Xk=0\n\nWe say that the set {x1, . . . , xN}\u21e2 X is drawn from the projection DPP with base measure \u00b5 and\nkernel KN, denoted by {x1, . . . , xN}\u21e0 DPP(\u00b5, KN ), when (x1, . . . , xN ) has joint distribution\n(3)\n\ndet(KN (xp, xn))N\n\np,n=1 \u00b5\u2326N (dx).\n\n1\nN !\n\nDPP(\u00b5, KN ) indeed de\ufb01nes a probability measure over sets since (3) is invariant by permutation\nand the orthonormality of the ks yields the normalization. See also Appendix A.1 for more details\non the construction of projection DPPs from sets of linearly independent functions.\nThe repulsion of projection DPPs may be understood geometrically by considering the Gram formu-\nlation of the kernel (2), namely\n\nKN (x, y) =( x)T(y), where (x) , (0(x), . . . , N1(x))T.\n\nThis allows to rewrite the joint distribution (3) as\n\n1\nN !\n\n|\n\ndet (x1:N )(x1:N )T\n\n=(det (x1:N ))2\n\n{z\n\n\u00b5\u2326N (dx), where (x1:N ) ,0B@\n\n}\n\n0(x1)\n\n...\n\n. . . N1(x1)\n\n...\n\n0(xN )\n\n. . . N1(xN )\n\nThus, the larger the determinant of the feature matrix (x1:N ), i.e., the larger the volume of the\nparallelotope spanned by the feature vectors (x1), . . . , (xN ), the more likely x1, . . . , xN co-occur.\n\n(4)\n\n1CA. (5)\n\n2.3 The multivariate Jacobi ensemble\nIn this part, we specify a projection kernel. We follow Bardenet & Hardy (2019) and take its\neigenfunctions to be multivariate orthonormal polynomials. In dimension d = 1, letting (k)k0 in (2)\nbe the orthonormal polynomials w.r.t. \u00b5 results in a projection DPP called an orthogonal polynomial\nensemble (OPE, K\u00f6nig, 2004). When d > 1, orthonormal polynomials can still be uniquely de\ufb01ned\nby applying the Gram-Schmidt procedure to a set of monomials, provided the base measure is not\npathological. However, there is no natural order on multivariate monomials: an ordering b : Nd ! N\nmust be picked before we apply Gram-Schmidt to the monomials in L2(\u00b5). We follow Bardenet\n& Hardy (2019, Section 2.1.3) and consider multi-indices k , (k1, . . . , kd) 2 Nd ordered by their\nmaximum degree maxi ki, and for constant maximum degree, by the usual lexicographic order. We\nstill denote the corresponding multivariate orthonormal polynomials by (k)k2Nd.\n\n3\n\n\fBy multivariate OPE we mean the projection DPP with base measure \u00b5(dx) , !(x) dx and orthogo-\nnal projection kernel KN (x, y) ,PN1\nb(k)=0 k(x)k(y). When the base measure is separable, i.e.,\n!(x) = !1(x1) \u21e5\u00b7\u00b7\u00b7\u21e5 !d(xd), multivariate orthonormal polynomials are products of univariate\nones, and the kernel (2) reads\n\nKN (x, y) =\n\ni\nki(xi)i\n\nki(yi),\n\n(6)\n\nN1Xb(k)=0\n\ndYi=1\n\n`)`0 are the orthonormal polynomials w.r.t. !i(z) dz. For X = [1, 1]d and !i(z) =\n\nwhere (i\n(1 z)ai(1 + z)bi, with ai, bi > 1, the resulting DPP is called a multivariate Jacobi ensemble.\n3 Monte Carlo integration with projection DPPs\nOur goal is to design random quadrature rules (1) on X , [1, 1]d with desirable properties. We\nfocus on computingR f (x)\u00b5(dx) with the two unbiased DPP-based Monte Carlo estimators of\nBardenet & Hardy (BH, 2019) and Ermakov & Zolotukhin (EZ, 1960). We start by presenting the\nnatural BH estimator which, when associated to the multivariate Jacobi ensemble, comes with a CLT\nwith a faster rate than classical Monte Carlo. Then, we survey the properties of the less obvious EZ\nestimator. Using a generalization of the Cauchy-Binet formula we provide a slight improvement of\nthe key result of EZ. Despite the lack of result illustrating a fast convergence rate, the EZ estimator\nhas a practical and interpretable variance. In particular, this estimator turns a single DPP sample\ninto a perfect integrator as well as a perfect interpolator of functions that are linear combinations\nof eigenfunctions of the associated kernel. Finally, we detail our exact sampling procedure for\nmultivariate Jacobi ensemble, which allows to exploit the best of both the BH and EZ estimators.\n\n3.1 A natural estimator\nFor f 2 L1(\u00b5), Bardenet & Hardy (2019) consider\nNXn=1\n\nN (f ) ,\n\nbI BH\n\nf (xn)\n\nKN (xn, xn)\n\n,\n\n(7)\n\nN (f )i =\n\nVarhbI BH\n\nas an unbiased estimator ofR f (x)\u00b5(dx), with variance (see, e.g., Lavancier et al., 2012, Section 2.1)\n\nf (y)\n\n1\n\nKN (x, y)2\u00b5(dx)\u00b5(dy),\n\n(8)\n\n2Z \u2713 f (x)\n\nKN (x, x) \n\nKN (y, y)\u25c62\n\nwhich clearly captures a notion of smoothness of f w.r.t. KN but its interpretation is not obvious.\nFor X = [1, 1]d, the interest in multivariate Jacobi ensemble among DPPs comes from the fact that\n(7) can be understood as a (randomized) multivariate counterpart of the Gauss-Jacobi quadrature\nintroduced in Section 2.1. Moreover, for f essentially C1, Bardenet & Hardy (2019, Theorem 2.7)\nproved a CLT with faster-than-classical-Monte-Carlo decay,\n\nN (f ) Z f (x)\u00b5(dx)\u25c6 law!N!1 N0, \u23262\nf,!,\n\npN 1+1/d\u2713bI BH\n2Pk2Nd(k1 + \u00b7\u00b7\u00b7 + kd)F f!\n(k)2, where Fg denotes the Fourier transform of g, and\ni=1 \u21e1p1 (xi)2. In the fast CLT (9), the asymptotic variance is governed by the\n\nsmoothness of f since \u2326f,! is a measure of the decay of the Fourier coef\ufb01cients of the integrand.\n\n!eq(x) , 1/Qd\n\nf,! , 1\n\nwith \u23262\n\n(9)\n\n!eq\n\n3.2 The Ermakov-Zolotukhin estimator\nWe start by stating the main \ufb01nding of Ermakov & Zolotukhin (1960), see also Evans & Swartz (2000,\nSection 6.4.3) and references therein. To the best of our knowledge, we are the \ufb01rst to make the con-\nnection with projection DPPs, as de\ufb01ned in Section 2.2. This allows us to give a slight improvement\nand provide a simpler proof of the original result, based on a generalization of the Cauchy-Binet\nformula (Johansson, 2006). Finally, we apply Ermakov & Zolotukhin\u2019s (1960) technique to build an\n\nunbiased estimator ofR f (x)\u00b5(dx), which comes with a practical and interpretable variance.\n\n4\n\n\f...\n\n0(x1)\n\n0B@\n\n. . . N1(x1)\n\nTheorem 1. Consider f 2 L2(\u00b5) and N functions 0, . . . , N1 2 L2(\u00b5) orthonormal w.r.t. \u00b5. Let\n{x1, . . . , xN}\u21e0 DPP(\u00b5, KN ), with KN (x, y) =PN1\nk=0 k(x)k(y). Consider the linear system\n1CA\n0B@\n\ndet k1,f (x1:N )\n,\nThen, the solution of (10) is unique, \u00b5-almost surely, with coordinates yk =\nwhere k1,f (x1:N ) is the matrix obtained by replacing the k-th column of (x1:N ) by f (x1:N ).\nMoreover, for all 1 \uf8ff k \uf8ff N, the coordinate yk of the solution vector satis\ufb01es\nhf, `i2.\n\nand Var[yk] = kfk2 \n\n1CA =0B@\n\nE[yk] = hf, k1i,\n\n. . . N1(xN )\n\n1CA.\n\ny1\n...\nyN\n\n0(xN )\n\ndet (x1:N )\n\nf (xN )\n\nf (x1)\n\n(10)\n\n(11)\n\n...\n\n...\n\nWe improved the original result by showing that Cov[yj, yk] = 0, for all 1 \uf8ff j 6= k \uf8ff N.\nBefore we provide the proof, also detailed in Appendix A.2, several remarks are in order. We start by\nconsidering a function f ,PM1\nk=0 hf, kik, 1 \uf8ff M \uf8ff 1, where (k)k0 forms an orthonormal\nbasis of L2(\u00b5), e.g., the Fourier basis or wavelet bases (Mallat & Peyr\u00e9, 2009). Next, we build the\northogonal projection kernel KN onto HN , span{0, . . . , N1} as in (2). Then, Theorem 1\nshows that solving (10), with points {x1, . . . , xN}\u21e0 DPP(\u00b5, KN ), provides unbiased estimates of\nthe N Fourier-like coef\ufb01cients (hf, ki)N1\nk=0 . Remarkably, these estimates are uncorrelated and have\nthe same variance (11) equal to the residualP1k=Nhf, ki2. The faster the decay of the coef\ufb01cients,\nthe smaller the variance. In particular, for M \uf8ff N, i.e., f 2H N, the estimators have zero variance.\nPut differently, f can be reconstructed perfectly from only one sample of DPP(\u00b5, KN ).\n\nN1X`=0\n\nProof. First, the joint distribution (5) of (x1, . . . , xN ) is proportional to (det (x1:N ))2\u00b5\u2326N (x).\nThus, the matrix (x1:N ) de\ufb01ning the linear system (10) is invertible, \u00b5-almost surely, and the\nexpression of the coordinates follows from Cramer\u2019s rule. Then, we treat the case k = 1, the others\nfollow the same lines. The proof relies on the orthonormality of the ks and a generalization of the\nCauchy-Binet formula (A.1), cf. Lemma A. The expectation in (11) reads\n\nE\uf8ff det 0,f (x1:N )\ndet (x1:N ) (5)\n\n=\n\nSimilarly, the second moment reads\n1\n\nE\"\u2713 det 0,f (x1:N )\n\ndet (x1:N ) \u25c62# (5)\n\n=\n\n1\n\n`=1\n\n(A.1)\n\nIN1\n\n0N1,1\n\n\u2318 = hf, 0i.\n\nN !Z det 0,f (x1:N ) det (x1:N ) \u00b5\u2326N (dx)\n= det\u21e3 hf,0i (hf,`i)N1\nN !Z det 0,f (x1:N ) det 0,f (x1:N ) \u00b5\u2326N (dx)\n= det\u21e3\nN1Xk=1\n\n\u2318 = kfk2 \n\n(hf,ki)N1\n\n(hf,`i)N1\n\nkfk2\n\nIN1\n\nk=1\n\n`=1\n\n(A.1)\n\nhf, ki2.\n\n(12)\n\n(13)\n\nFinally, the variance in (11) = (13) - (12)2. The covariance is treated in Appendix A.2.\n\nIn the setting of the multivariate Jacobi ensemble described in Section 2.3, the \ufb01rst orthonormal\n\n= \u00b5[1, 1]d1/2 det 0,f (x1:N )\n\npolynomial 0 is constant, equal to \u00b5[1, 1]d1/2. Hence, a direct application of Theorem 1 yields\nN (f ) , y1\nbI EZ\n0\nas an unbiased estimator ofR[1,1]d f (x)\u00b5(dx), see Appendix A.3. We also show that (14) can be\nviewed as a quadrature rule (1) with weights summing to \u00b5([1, 1]d). Unlike the variance ofbI BH\nN (f )\nin (8), the variance ofbI EZ\nN (f ) clearly re\ufb02ects the accuracy of the approximation of f by its projection\nonto HN. In particular, it allows to integrate and interpolate polynomials up to \u201cdegree\u201d b1(N 1),\nperfectly. Nonetheless, its limiting theoretical properties, like a CLT, look hard to establish. In\nparticular, the dependence of each quadrature weight on all quadrature nodes makes the estimator a\npeculiar object that doesn\u2019t \ufb01t the assumptions of traditional CLTs for DPPs (Soshnikov, 2000).\n\ndet (x1:N )\n\n(14)\n\n,\n\n5\n\n\f3.3 How to sample from projection DPPs and the multivariate Jacobi ensemble\nTo perform Monte Carlo integration with DPPs, it is crucial to sample the points and evaluate the\nweights ef\ufb01ciently. However, sampling from continuous DPPs remains a challenge. In this part, we\nreview brie\ufb02y the main technique for projection DPP sampling before we develop our method to\ngenerate samples from the multivariate Jacobi ensemble. The code1 is available in the DPPy toolbox\nof Gautier et al. (2019), the associated documentation3 contains a lot more details on DPP sampling.\nIn both \ufb01nite and continuous cases, except for some speci\ufb01c instances, exact DPP sampling usually\nrequires the spectral decomposition of the underlying kernel (Lavancier et al., 2012, Section 2.4).\nHowever, for projection DPPs, prior knowledge of the eigenfunctions is not necessary, only an oracle\nto evaluate the kernel is required. Next, we describe the generic exact sampler of Hough et al. (2006,\nAlgorithm 18). It is based on the chain rule and valid exclusively for projection DPPs.\nFor simplicity, consider a projection DPP(\u00b5, KN ) with \u00b5(dx) = !(x) dx and KN as in (2). This\nDPP has exactly N points, \u00b5-almost surely (Hough et al., 2006, Lemma 17). To get a valid sample\n{x1, . . . , xN}, it is enough to apply the chain rule to sample (x1, . . . , xN ) and forget the order the\npoints were selected. The chain rule scheme can be derived from two different perspectives. Either\nusing Schur complements to express the determinant in the joint distribution (3),\n\nKN (x1, x1)\n\nN\n\n!(x1) dx1\n\nNYn=2\n\nKN (xn, xn) Kn1(xn)TK1\n\nn1Kn1(xn)\n\n!(xn) dxn,\n\n(15)\n\nN (n 1)\n\n!(xn) dxn.\n\n(16)\n\nwhere Kn1(\u00b7) = (KN (x1,\u00b7), . . . , KN (xn1,\u00b7))T, and Kn1 = (KN (xp, xq))n1\ncally using the base\u21e5height formula to express the squared volume in the joint distribution (5),\n\np,q=1. Or geometri-\n\nk(x1)k2\n\nN\n\n!(x1) dx1\n\ndistance2(xn), span{(xp)}n1\np=1\n\nN (n 1)\n\nNYn=2\n\nparametersai,bi \uf8ff 1/2, cf. Section 2.3. We remodeled the original implementation4 of the\n\nNote that the numerators in (15) correspond to the incremental posterior variances of a noise-free\nGaussian process model with kernel KN (Rasmussen & Williams, 2006), giving yet another intuition\nfor repulsion. When using the chain rule, the practical challenge is twofold: \ufb01nd ef\ufb01cient ways to (i)\nevaluate the conditional densities, (ii) sample exactly from the conditionals.\nIn this work, we take X = [1, 1]d and focus on sampling the multivariate Jacobi ensemble with\nmultivariate Jacobi ensemble sampler accompanying the work of Bardenet & Hardy (BH, 2019) in a\nmore Pythonic way. In particular, we address the previous challenges in the following way:\n(i) contrary to BH, we leverage the Gram structure to vectorize the computations and consider (16)\ninstead of (15), and evaluate KN (x, y) via (4) instead of (6). The overall procedure is akin to a\nsequential Gram-Schmidt orthogonalization of the feature vectors (x1), . . . , (xN ). Moreover we\npay special attention to avoiding unnecessary evaluations of orthogonal polynomials (OP) when com-\nputing a feature vector (x). This is done by coupling the slicing feature of the Python language with\nthe dedicated method scipy.special.eval_jacobi, used to evaluate the three-term recurrence\nrelations satis\ufb01ed by each of the univariate Jacobi polynomials. Given the chosen ordering b, the\ncomputation of (x) requires the evaluation of d recurrence relations up to depth dpN.\n(ii) like BH, we sample each conditional in turn using a rejection sampling mechanism with the same\nproposal distribution. But BH take as proposal !eq(x) dx, which corresponds to the limiting marginal\nof the multivariate Jacobi ensemble as N goes to in\ufb01nity; see (Simon, 2011, Section 3.11). On our\nside, we use a two-layer rejection sampling scheme. We rather sample from the n-th conditional using\nthe marginal distribution N1KN (x, x)!(x) dx as proposal and rejection constant N/(N (n 1)).\nThis allows us to reduce the number of (costly) evaluations of the acceptance ratio. The marginal\ndistribution itself is sampled using the same proposal !eq(x) dx and rejection constant as BH. The\nrejection constant, of order 2d, is derived from Chow et al. (1994) and Gautschi (2009). We further\nreduced the number of OP evaluations by considering N1KN (x, x)!(x) dx as a mixture, where\neach component in (6) involves only one OP. In the end, the expected total number of rejections\nis of order 2dN log N. Finally, we implemented a speci\ufb01c rejection free method for the univariate\nJacobi ensemble; a special continuous projection DPP which can be sampled exactly in O(N 2), by\ncomputing the eigenvalues of a random tridiagonal matrix (Killip & Nenciu, 2004, Theorem 2).\n3 dppy.readthedocs.io 4 github.com/rbardenet/dppmc\n\n6\n\n\fAll these improvements resulted in dramatic speedups. For example, on a modern laptop, sampling a\n2D Jacobi ensemble with N = 1000 points, see Figure 1(a), takes less than a minute, compared to\nhours previously. For more details on the sampling procedure, we refer to Appendix A.4.\n\n(a) / !1, / !2, !eq\n\n(b) htimei to get one sample\n\n(c) h#rejectionsi to get one sample\n\nFigure 1: (a) A sample from a 2D Jacobi ensemble with N = 1000 points. (b)-(c) ai, bi = 1/2,\nthe colors and numbers correspond to the dimension. For d = 1, the tridiagonal model (tri) of Killip\n& Nenciu offers tremendous time savings. (c) The total number of rejections grows as 2dN log(N ).\n4 Empirical investigation\n\nWe perform three main sets of experiments to investigate the properties of the BH (7) and EZ (14)\n\nestimators of the integralR f (x)\u00b5(dx). We add the baseline vanilla Monte Carlo, where points are\n\ndrawn i.i.d. proportionally to \u00b5. The two estimators are built from the multivariate Jacobi ensemble,\ncf. Section 2.3. First, we extend, for larger N, the experiments of Bardenet & Hardy (2019) illustrating\ntheir fast CLT (9) on a smooth function. Then, we illustrate Theorem 1 by considering polynomial\nfunctions that can be either fully or partially decomposed in the eigenbasis of the DPP kernel. Finally,\nwe compare the potential of both estimators on various non smooth functions.\n\n(a) d = 1\n\n(b) d = 2\n\n(c) d = 3\n\n(d) d = 4\n\n(e) d = 1\n\n(f) d = 2\n\n(g) d = 3\n\n(h) d = 4\n\nFigure 2: (a)-(d) cf. Section 4.1, the numbers in the legend are the slope and R2 (e)-(h) cf. Section 4.2.\n\n4.1 The bump experiment\n\nN dramatically outperformsbI BH\n\nBardenet & Hardy (2019, Section 3) illustrate the behavior ofbI BH\nN and its CLT (9) on a unimodal,\nsmooth bump function; see Appendix B.1. The expected variance decay is of order 1/N 1+1/d. We\nreproduce their experiment in Figure 2 for larger N, and compare with the behavior of bI EZ\nN . In\nshort,bI EZ\nN in d \uf8ff 2, with surprisingly fast empirical convergence rates.\nWhen d 3, performance decreases, andbI BH\nN shows both faster and more regular variance decay.\nTo know whether we can hope for a CLT for bI EZ\nN , we performed Kolmogorov-Smirnov tests for\nN = 300, which yielded small p-values across dimensions, from 0.03 to 0.24. This is compared to\nthe same p-values forbI BH\nN , which range from 0.60 to 0.99. The results are presented in Appendix B.1.\nThe lack of normality ofbI EZ\nN is partly due to a few outliers. Where these outliers come from is left\nfor future work; ill-conditioning of the linear system (10) is an obvious candidate. Besides, contrary\ntobI BH\nN , the estimatorbI EZ\nN has no guarantee to preserve the sign of integrands having constant sign.\n\n7\n\n\f4.2\n\nIntegrating sums of eigenfunctions\n\nf (x) =XM1\n\nb(k)=0\n\nk(x),\n\nb(k) + 1\n\nprescribed by Theorem 1. To that end, we consider functions of the form\n\nIn the next series of experiments, we are mainly interested in testing the variance decay ofbI EZ\n\n1\n\nN (f )\n\n(17)\n\nwhose integral w.r.t. \u00b5 is to be estimated based on realizations of the multivariate Jacobi ensemble with\nkernel KN (x, y) =PN1\nb(k)=0 k(x)k(y), where N 6= M a priori. This means that the function f\ncan be either fully (M \uf8ff N) or partially (M > N) decomposed in the eigenbasis of the kernel. In\nboth cases, we let the number of points N used to build the two estimators vary from 10 to 100 in\ndimensions d = 1 to 4. In the \ufb01rst setting, we set M = 70. Thus, N eventually reaches the number\nof functions used to build f in (17), after whatbI EZ\nN is an exact estimator, see Figure 2(e)-(h). The\nsecond setting has M = N + 1, so that the number of points N is never enough for the variance in\n(11) to be zero. The results of both settings are to be found in Appendix B.2.\nIn the \ufb01rst case, for each dimension d, we indeed observe a drop in the variance of bI EZ\nN once the\nnumber of points of the DPP hits the threshold N = M. This is in perfect agreement with Theorem 1:\nonce f 2H M \u2713H N, the variance in (11) is zero. In the second setting, as N increases the\ncontribution of the extra mode b1(N ) in (17) decreases as 1\nN . Hence, from Theorem 1 we expect a\nvariance decay of order 1\n\nN 2 , which we observe in practice.\n\n4.3 Further experiments\nIn Appendices B.3-B.6 we test the robustness of both BH and EZ estimators, when applied to functions\npresenting discontinuities or which do not belong to the span of the eigenfunctions of the kernel.\n\nAlthough the conditions of the CLT (9) associated tobI BH are violated, the corresponding variance\ndecay is smooth but not as fast. ForbI EZ, the performance deteriorates with the dimension. Indeed,\nthe cross terms arising from the Taylor expansion of the different functions introduce monomials,\nassociated to large coef\ufb01cients, that do not belong to HN. Sampling more points would reduce the\nvariance (11). But more importantly, for EZ to excel, this suggests to adapt the kernel to the basis\nwhere the integrand is known to be sparse or to have fast-decaying coef\ufb01cients. In regimes where BH\nand EZ do not shine, vanilla Monte Carlo becomes competitive for small values of N.\n\n5 Conclusion\n\nErmakov & Zolotukhin (EZ, 1960) proposed a non-obvious unbiased Monte Carlo estimator using\nprojection DPPs. It requires solving a linear system, which in turn involves evaluating both the N\neigenfunctions of the corresponding kernel and the integrand at the N points of the DPP sample.\nThis is yet another connection between DPPs and linear algebra. In fact, solving this linear system\nprovides unbiased estimates of the Fourier-like coef\ufb01cients of the integrand f with each of the N\neigenfunctions of the DPP kernel. Remarkably, these estimators have identical variance, and this\nvariance measures the accuracy of the approximation of f by its projection onto these eigenfunctions.\nWith modern arguments, we have provided a much shorter proof of these properties than in the\noriginal work of (Ermakov & Zolotukhin, 1960). Beyond this, little is known on the EZ estimator.\nWhile coming with a less interpretable variance, the more direct estimator proposed by Bardenet &\nHardy (BH, 2019) has an intrinsic connection with the classical Gauss quadrature and further enjoys\nstronger theoretical properties when using multivariate Jacobi ensemble.\nOur experiments highlight the key features of both estimators when the underlying DPP is a multi-\nvariate Jacobi ensemble, and further demonstrate the known properties of the BH estimator in yet\nunexplored regimes. Although EZ shows a surprisingly fast empirical convergence rate for d \uf8ff 2,\nits behavior is more erratic for d 3. Ill-conditioning of the linear system is a potential source of\noutliers in the distribution of the estimator. Regularization may help but would introduce a stabil-\nity/bias trade-off. More generally, EZ seems worth investigating for integration or even interpolation\ntasks where the function is known to be decomposable in the eigenbasis of the kernel, i.e., in a\nsetting similar to the one of Bach (2017). Finally, the new implementation of an exact sampler for\nmultivariate Jacobi ensemble unlocks more large-scale empirical investigations and asks for more\ntheoretical work. The associated code1 is available in the DPPy toolbox of Gautier et al. (2019).\n\n8\n\n\fReferences\nBach, F. On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions.\n\nJournal of Machine Learning Research, 2017. arXiv:1502.06800.\n\nBach, F., Lacoste-Julien, S., and Obozinski, G. On the Equivalence between Herding and Condi-\ntional Gradient Algorithms. In International Conference on Machine Learning (ICML), 2012.\narXiv:1203.4523.\n\nBardenet, R. and Hardy, A. Monte Carlo with Determinantal Point Processes. Annals of Applied\n\nProbability, in press, 2019. arXiv:1605.00361.\n\nBriol, F.-X., Oates, C. J., Girolami, M., and Osborne, M. A. Frank-Wolfe Bayesian Quadrature: Prob-\nabilistic Integration with Theoretical Guarantees. In Advances in Neural Information Processing\nSystems (NeurIPS), 2015. arXiv:1506.02681.\n\nChen, Y., Welling, M., and Smola, A. Super-Samples from Kernel Herding. In Conference on\n\nUncertainty in Arti\ufb01cial Intelligence (UAI), 2010. arXiv:1203.3472.\n\nChow, Y., Gatteschi, L., and Wong, R. A Bernstein-type inequality for the Jacobi polynomial.\n\nProceedings of the American Mathematical Society, 1994.\n\nDavis, P. J. and Rabinowitz, P. Methods of numerical integration. Academic Press. 1984.\n\nDelyon, B. and Portier, F.\n\narXiv:1409.0733.\n\nIntegral approximation by kernel smoothing. Bernoulli, 2016.\n\nDick, J. and Pillichshammer, F. Digital nets and sequences : discrepancy and quasi-Monte Carlo\n\nintegration. Cambridge University Press. 2010.\n\nErmakov, S. M. and Zolotukhin, V. G. Polynomial Approximations and the Monte-Carlo Method.\n\nTheory of Probability & Its Applications, 1960.\n\nEvans, M. and Swartz, T. Approximating integrals via Monte Carlo and deterministic methods.\n\nOxford University Press. 2000.\n\nGautier, G., Polito, G., Bardenet, R., and Valko, M. DPPy: DPP Sampling with Python. Journal of\nMachine Learning Research - Machine Learning Open Source Software (JMLR-MLOSS), in press,\n2019. arXiv:1809.07258.\n\nGautschi, W. How sharp is Bernstein\u2019s Inequality for Jacobi polynomials? Electronic Transactions\n\non Numerical Analysis, 2009.\n\nHough, J. B., Krishnapur, M., Peres, Y., and Vir\u00e1g, B. Determinantal Processes and Independence. In\n\nProbability Surveys. 2006. arXiv:math/0503110.\n\nHusz\u00e1r, F. and Duvenaud, D. Optimally-Weighted Herding is Bayesian Quadrature. In Conference\n\non Uncertainty in Arti\ufb01cial Intelligence (UAI), 2012. arXiv:1204.1664.\n\nJohansson, K. Random matrices and determinantal processes. Les Houches Summer School Proceed-\n\nings, 2006.\n\nKillip, R. and Nenciu, I. Matrix models for circular ensembles. International Mathematics Research\n\nNotices, 2004. arXiv:math/0410034.\n\nK\u00f6nig, W. Orthogonal polynomial ensembles in probability theory. Probability Surveys, 2004.\n\narXiv:math/0403090.\n\nKulesza, A. and Taskar, B. Determinantal Point Processes for Machine Learning. Foundations and\n\nTrends in Machine Learning, 2012. arXiv:1207.6083.\n\nLavancier, F., M\u00f8ller, J., and Rubak, E. Determinantal point process models and statistical inference :\nExtended version. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 2012.\narXiv:1205.4818.\n\n9\n\n\fLiu, Q. and Lee, J. D. Black-Box Importance Sampling. In Internation Conference on Arti\ufb01cial\n\nIntelligence and Statistics (AISTATS), 2017. arXiv:1610.05247.\n\nMacchi, O. The coincidence approach to stochastic point processes. Advances in Applied Probability,\n\n1975.\n\nMallat, S. and Peyr\u00e9, G. A wavelet tour of signal processing : the sparse way. Elsevier/Academic\n\nPress. 2009.\n\nMazoyer, A., Coeurjolly, J.-F., and Amblard, P.-O. Projections of determinantal point processes.\n\nArXiv e-prints, 2019. arXiv:1901.02099v3.\n\nOates, C. J., Girolami, M., and Chopin, N. Control functionals for Monte Carlo integration. Journal\n\nof the Royal Statistical Society: Series B (Statistical Methodology), 2017. arXiv:1410.2392.\nO\u2019Hagan, A. Bayes\u2013Hermite quadrature. Journal of Statistical Planning and Inference, 1991.\nRasmussen, C. E. and Williams, C. K. I. Gaussian processes for machine learning. MIT Press. 2006.\nRobert, C. P. The Bayesian choice : from decision-theoretic foundations to computational implemen-\n\ntation. Springer. 2007.\n\nRobert, C. P. and Casella, G. Monte Carlo statistical methods. Springer-Verlag New York. 2004.\nSimon, B. Szeg\u02ddo\u2019s theorem and its descendants: Spectral theory for l2 perturbations of orthogonal\n\npolynomials. M. B. Porter Lecture Series, Princeton Univ. Press, Princeton, NJ. 2011.\n\nSoshnikov, A. Determinantal random point \ufb01elds.\n\narXiv:math/0002099.\n\nRussian Mathematical Surveys, 2000.\n\n10\n\n\f", "award": [], "sourceid": 4215, "authors": [{"given_name": "Guillaume", "family_name": "Gautier", "institution": "CNRS, INRIA, Univ. Lille"}, {"given_name": "R\u00e9mi", "family_name": "Bardenet", "institution": "University of Lille"}, {"given_name": "Michal", "family_name": "Valko", "institution": "DeepMind Paris and Inria Lille - Nord Europe"}]}