{"title": "Geometrically Coupled Monte Carlo Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 195, "page_last": 206, "abstract": "Monte Carlo sampling in high-dimensional, low-sample settings is important in many machine learning tasks. We improve current methods for sampling in Euclidean spaces by avoiding independence, and instead consider ways to couple samples. We show fundamental connections to optimal transport theory, leading to novel sampling algorithms, and providing new theoretical grounding for existing strategies. We compare our new strategies against prior methods for improving sample efficiency, including QMC, by studying discrepancy. We explore our findings empirically, and observe benefits of our sampling schemes for reinforcement learning and generative modelling.", "full_text": "Geometrically Coupled Monte Carlo Sampling\n\nMark Rowland(cid:3)\n\nUniversity of Cambridge\n\nmr504@cam.ac.uk\n\nKrzysztof Choromanski*\n\nGoogle Brain Robotics\nkchoro@google.com\n\nFran\u00e7ois Chalus\n\nUniversity of Cambridge\nchalusf3@gmail.com\n\nAldo Pacchiano\n\nUniversity of California, Berkeley\n\npacchiano@berkeley.edu\n\nTam\u00e1s Sarl\u00f3s\nGoogle Research\n\nstamas@google.com\n\nRichard E. Turner\n\nUniversity of Cambridge\n\nret26@cam.ac.uk\n\nAdrian Weller\n\nUniversity of Cambridge\n\nAlan Turing Institute\naw665@cam.ac.uk\n\nAbstract\n\nMonte Carlo sampling in high-dimensional, low-sample settings is important in\nmany machine learning tasks. We improve current methods for sampling in Eu-\nclidean spaces by avoiding independence, and instead consider ways to couple\nsamples. We show fundamental connections to optimal transport theory, leading\nto novel sampling algorithms, and providing new theoretical grounding for exist-\ning strategies. We compare our new strategies against prior methods for improving\nsample ef\ufb01ciency, including quasi-Monte Carlo, by studying discrepancy. We ex-\nplore our \ufb01ndings empirically, and observe bene\ufb01ts of our sampling schemes for\nreinforcement learning and generative modelling.\n\n1\n\nIntroduction and related work\n\nMonte Carlo (MC) methods are popular in many areas of machine learning, including approximate\nBayesian inference (Robert and Casella, 2005; Rezende et al., 2014; Kingma and Welling, 2014;\nWelling and Teh, 2011), reinforcement learning (RL) (Salimans et al., 2017; Choromanski et al.,\n2018c; Mania et al., 2018), and random feature approximations for kernel methods (Rahimi and\nRecht, 2007; Yu et al., 2016). Typically, Monte Carlo samples are drawn independently. In many\napplications, however, there may be an imbalance between the computational cost in drawing MC\nsamples from the distribution of interest, and the subsequent cost incurred due to downstream com-\nputation with the samples. For example, when a sample represents the con\ufb01guration of weights in\na policy network for an RL problem, the cost of computing forward passes, backpropagating gradi-\nents through the network, and interacting with the environment, is much greater than drawing the\nsample itself. Since a high proportion of total time is spent computing with each sample relative to\nthe cost of generating the sample, it may be possible to improve ef\ufb01ciency by replacing the default\nof independent, identically distributed samples by samples with some non-trivial coupling.\nSuch approaches have been studied in computational statistics for decades, often under the guise of\nvariance reduction. Related methods such as control variates, quasi-Monte Carlo (QMC) (Halton,\n1960; Aistleitner and Dick, 2015; Dick et al., 2015; Brauchart and Dick, 2012; Sloan and Wozni-\nakowski, 1998; Avron et al., 2016)), herding (Chen et al., 2010; Huszar and Duvenaud, 2012) and\n\n(cid:3)\n\nEqual contribution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fantithetic sampling (Hammersley and Morton, 1956; Salimans et al., 2017) have also been explored.\nMethods used in recent machine learning applications include orthogonality constraints (Yu et al.,\n2016; Choromanski et al., 2018b,c, 2017, 2018a). In this paper, we investigate improvements to MC\nsampling through carefully designed joint distributions, with an emphasis on the low-sample, high-\ndimensional regime, which is often relevant for practical machine learning applications (Rezende\net al., 2014; Kingma and Welling, 2014; Salimans et al., 2017). We call our approach Geometrically\nCoupled Monte Carlo (GCMC) since, as we will see, it is geometrically motivated. Importantly, we\nfocus on Monte Carlo sampling, in contrast to (pseudo-)deterministic approaches such as QMC and\nherding, as unbiasedness of estimators is often an important property for stochastic approximation.\nWhilst approaches such as herding and QMC are known to have superior asymptotic performance\nto Monte Carlo methods in low dimensions, this may not hold in high-dimensional, low-sample\nregimes, where they do not provide any theoretical improvement guarantees.\nWe summarize our main contributions below. Throughout the paper, we save proofs of our results\nfor the Appendix; where appropriate, we provide proof sketches to aid intuition.\n(cid:15) We frame the problem of \ufb01nding an optimal coupling amongst a collection of samples as a multi-\nmarginal transport (MMT) problem: this generalises the notion of optimal transport, which has seen\nmany applications in machine learning (see for example Arjovsky et al., 2017). We show several\nsettings where the MMT problem can be solved analytically. We recover some existing coupling\nstrategies (based on orthogonal matrices), and derive novel strategies, involving coupling norms of\npairs of samples.\n(cid:15) To connect to QMC, we show that sets of geometrically coupled Monte Carlo samples give rise\nto low discrepancy sequences. To our knowledge, we present the \ufb01rst explanation of the success of\nstructured orthogonal matrices for scalable RBF kernel approximation via discrepancy theory.\n(cid:15) We provide exponentially small upper bounds on failure probabilities for estimators of gradients\nof Gaussian smoothings of blackbox functions based on the gradient sensing mechanism, both for\nunstructured and orthogonal settings (Choromanski et al., 2018c). These methods can be used to\nlearn good quality policies for reinforcement learning tasks.\n(cid:15) We empirically measure the discrepancy of sequences produced by our method and show that\nthey enable us to learn good quality policies for quadruped robot navigation in low-sample, high-\ndimensional regimes, where standard QMC approaches based on Halton sequences and related con-\nstructions fail.\n\n2 Optimal couplings, herding, and optimal transport\nConsider the problem of computing the expectation If = EX(cid:24)(cid:17)[f (X)], where (cid:17) 2 P(Rd) is a\nmultivariate probability distribution and f : Rd ! R is some measurable function in L1((cid:17)). A\ni=1 f (Xi), where the samples\nX1; : : : ; Xm (cid:24) (cid:17) are taken independently. This estimator is clearly unbiased. The main question\nthat we are interested in is what joint distributions (or couplings) over the ensemble of samples\n(X1; : : : ; Xm) lead to estimators of the expectation above which are still unbiased, but have lower\n\nstandard Monte Carlo approach is to approximate If bybIiid\nf , de\ufb01ned for a general estimatorbIf by:\nmean squared error (MSE) than the i.i.d. estimatorbIiid\n[(bIf (cid:0) If\n\nMSE(bIf ) = E\n\n]\n\n)2\n\n(1)\n\n\u2211\n\nm\n\nf = 1\nm\n\n:\n\nFor suf\ufb01ciently rich functions classes F (cid:18) L2((cid:17)), a coupling of the random variables (X1; : : : ; Xm)\nthat achieves optimal MSE simultaneously for all functions f 2 F need not exist. We illustrate this\nwith examples in the Appendix Section 8.2. This motivates the approach below to de\ufb01ne optimality\nof a coupling by taking into account average performance across a function class of interest.\n\n2.1 K-optimal couplings\n\nWe begin by de\ufb01ning formally the notion of coupling.\nDe\ufb01nition 2.1. Given a probability distribution (cid:17) 2 P(Rd) and m 2 N, we denote by (cid:3)m((cid:17)) the\nset of all joint distributions of m random variables (X1; : : : ; Xm), where each random variable Xi\n\n2\n\n\fhas the marginal distribution (cid:17). More formally,\n\n(cid:3)m((cid:17)) = f(cid:22) 2 P(Rd(cid:2)m)j((cid:25)i)#(cid:22) = (cid:17) for i = 1; : : : ; mg ;\n\nm\n\n(cid:0)1\n\n\u2211\ni=1 f (Xi) is unbiased for EX(cid:24)(cid:17) [f (X)], for any f 2 L1((cid:17)).\n\nwhere (cid:25)i : Rd(cid:2)m ! Rd denotes projection onto the ith set of d coordinates, for i = 1; : : : ; m.\nNote that if X1:m (cid:24) (cid:22) 2 (cid:3)m((cid:17)), then because of the restriction on the marginals of X1:m, the\nestimator m\nWe now de\ufb01ne the following notion of optimality of a coupling. Similar notions have appeared in\nthe literature when samples are taken to be non-random, or when selecting importance distributions,\nsometimes referred to as kernel quadrature (Rasmussen and Ghahramani, 2003; Briol et al., 2017).\nDe\ufb01nition 2.2 (K-optimal coupling). Given a kernel K : Rd (cid:2) Rd ! R, a K-optimal coupling is\na solution to the optimisation problem\n\n24EX1:m(cid:24)(cid:22)\n\n24(\n\nm\u2211\n\ni=1\n\n1\nm\n\n)2\n\n3535 :\n\nf (Xi) (cid:0) If\n\nEf(cid:24)GP(0;K)\n\narg min\n(cid:22)2(cid:3)m((cid:17))\n\n(2)\n\nThat is, a K-optimal coupling is one that gives the best MSE on average when the function con-\ncerned is drawn from the Gaussian process GP(0; K). For background on Gaussian processes, see\n(Rasmussen and Williams, 2005).\nRemark 2.3. There are measure-theoretic subtleties in making sure that the objective in Expression\n(2) is well-de\ufb01ned. For readability, we treat these issues in the Appendix (Section 7), but remark\nhere that it is suf\ufb01cient to restrict to kernels K for which sample paths of the corresponding Gaussian\nprocess are continuous, which we do for the remainder of the paper.\n\nOur ultimate aim is to characterise K-optimal couplings under a variety of conditions algorithmi-\ncally to enable practical implementation. We discuss the identi\ufb01cation of K-optimal couplings,\nalong with precise statements of algorithms, in Section 2.3. First we develop the theoretical prop-\nerties of K-optimal couplings, starting with the intimate connection between K-optimal couplings\nand multi-marginal transport theory (Pass, 2014). This theory is a generalisation of optimal transport\ntheory to the case where there are more than two marginal distributions.\nTheorem 2.4. The optimisation problem de\ufb01ning K-optimality in Equation (2) is equivalent to the\nfollowing multi-marginal transport problem:\n\n24\u2211\n\ni\u0338=j\n\n35 :\n\nEX1:m(cid:24)(cid:22)\n\narg min\n(cid:22)2(cid:3)m((cid:17))\n\nK(Xi; Xj)\n\nRemark 2.5. The optimal transport problem of Theorem 2.4 has an interesting difference from most\noptimal transport problems arising in machine learning: in general, its cost function is repulsive, so\nit seeks a transport plan where transport paths are typically long, as opposed to the short transport\npaths sought when the cost is given by e.g. a metric. Intuitively, the optimal transport cost rewards\nspace-\ufb01lling couplings, for which it is uncommon to observe collections of samples close together.\n\n2.2 Minimax couplings and herding\n\nDe\ufb01nition 2.2 (K-optimality) considers best average-case behaviour. We could instead use a \u201cmini-\nmax\" de\ufb01nition of optimality, by examining best worst-case behaviour.\nDe\ufb01nition 2.6 (Minimax coupling). Given a function class F (cid:18) L2((cid:17)), we say that (cid:22) 2 (cid:3)m((cid:17)) is\nan F -minimax coupling if it is a solution to the following optimisation problem:\n\n24(\n\nm\u2211\n\ni=1\n\n1\nm\n\n)2\n\n35 :\n\nf (Xi) (cid:0) If\n\narg min\n(cid:22)2(cid:3)m((cid:17))\n\nsup\nf2F\n\nEX1:m(cid:24)(cid:22)\n\n(3)\n\nIn general, the minimax coupling objective appearing in Equation (3) is intractable. However, there\nis an elegant connection to concepts from the kernel herding literature that may be established by\ntaking the function class F to be the unit ball in some reproducing kernel Hilbert space (RKHS).\n\n3\n\n\fProposition 2.7. Suppose that the function class F is the unit ball in some RKHS given by a kernel\nK : Rd (cid:2) Rd ! R. Then the component\n\n24(\nm\u2211\n\n1\nm\n\n1\nm\n\ni=1\n\nm\u2211\n\ni=1\n\n(cid:14)Xi\n\n35\n35 ;\n\n)2\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\nHK\n\nf (Xi) (cid:0) If\n)\n\n(cid:0) (cid:18)K ((cid:17))\n\nEX1:m(cid:24)(cid:22)\n\nsup\nf2F\n\n(\n\n24(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:18)K\n\nEX1:m(cid:24)(cid:22)\n\nof the minimax coupling objective in Equation (3) may be upper-bounded by the following objective:\n\n(4)\n\nm\n\nm\n\n(cid:0)1\n\nm\n\n(cid:0)1\n\n(cid:0)1\n\n\u2211\n\n\u2211\n\n\u2211\n\nwhere (cid:18)K : P(Rd) ! HK is the kernel mean embedding into the RKHS HK associated with K.\nWe note the intimate connection of the objective in Equation (4) with maximum mean discrepancy\n(MMD) (Gretton et al., 2012) and herding (Chen et al., 2010; Huszar and Duvenaud, 2012). First,\nthe integrand appearing in Equation (4) is exactly the MMD-squared between m\ni=1 (cid:14)Xi and (cid:17)\ni=1 (cid:14)Xi to be a non-random measure\nwith respect to the kernel K. Second, if we instead take m\nof the form m\ni=1 (cid:14)xi, viewing Expression (4) as a function of the delta locations x1; : : : ; xk re-\nsults in exactly the herding optimisation problem. A connection between variance-reduced sampling\nand herding has also been noted in the context of random permutations (Lomel\u00ed et al., 2018). As\nwell as these similarities, there are important differences between herding and the notion described\nhere. Because all samples are regarded as random variables which are constrained to be marginally\ndistributed according to (cid:17), a coupling maintains the usual unbiasedness guarantees of \ufb01nite-sample\nMonte Carlo estimators. In contrast, herding is theoretically supported by fast asymptotic rates of\nconvergence for a wide variety of estimators, but because samples are chosen in a deterministic way,\nestimator properties based on \ufb01nite numbers of herding samples are harder to describe statistically.\nOften there are good reasons to eschew unbiasedness of an estimator in favour of fast convergence\nrates; however, unbiasedness of gradient estimators is crucial in optimisation algorithms performing\ncorrectly, as is well-established in the stochastic approximation literature. Bellemare et al. (2017)\nprovide a discussion of this phenomenon in the context of generative modelling.\nInterestingly, the following result shows that solutions of Problem (4) coincide exactly with K-\noptimal couplings of De\ufb01nition 2.2.\nTheorem 2.8. Given a probability distribution (cid:17) 2 P(Rd) and a kernel K : Rd (cid:2) Rd ! R, a\ncoupling (cid:22) 2 (cid:3)m((cid:17)) is K-optimal iff it is solves the optimisation problem in Expression (4).\nConnections similar to Theorem 2.8 have previously been established in the study of identifying\ndeterministic quadrature points (Paskov, 1993) \u2013 we also highlight (Kanagawa et al., 2018) as a\nrecent review of such connections. In contrast, here we take random quadrature points with \ufb01xed\nmarginal distributions.\n\n2.3 Solving for K-optimal couplings\n\nIn this section, we study the objective de\ufb01ning K-optimal couplings, as given in De\ufb01nition 2.2. The\nproblem is intractable to solve analytically in general, so we present several solutions in settings\nwith additional restrictions, either on the number of samples m in the problem, or on the types\nof couplings considered. The theoretical statements are given in Theorems 2.9 and 2.10, with the\ncorresponding practical algorithms given as Algorithms 1 and 2. We emphasise that solving Problem\n(2) in general remains an interesting direction for future work.\nTheorem 2.9. Let (cid:17) 2 P(Rd) be isotropic, and let K : Rd (cid:2) Rd ! R be a stationary isotropic\nkernel, such that K(x; y) is a strictly decreasing, strictly convex function of \u2225x (cid:0) y\u2225. Then the K-\noptimal coupling of 2 samples (X1; X2) from (cid:17) is given by \ufb01rst drawing X1 (cid:24) (cid:17), and then setting\nthe direction of X2 to be opposite to that of X1, and setting the norm of \u2225X2\u2225 so that\n\nFR(\u2225X2\u2225) + FR(\u2225X1\u2225) = 1 ;\n\n(5)\n\nwhere FR is the CDF associated with the norm of a random vector distributed according to (cid:17).\n\nThe proof of this theorem can be found in the Appendix Section 9 and relies on \ufb01rst showing that\nany optimal coupling must be antithetic and second that an antithetic coupling must satisfy equation\n\n4\n\n\fAlgorithm 1 Antithetic inverse lengths coupling\nof Theorem 2.9\n\nfor i = 1; : : : ; m do\nDraw Xi (cid:24) (cid:17).\nSet Xm+i = (cid:0)Xi\nend for\nOutput: X1; : : : ; X2m marginally (cid:17) dis-\ntributed, with low MSE.\n\nR (1(cid:0)FR(jjXijj))\n(cid:0)1\n\njjXijj\n\nF\n\n.\n\nAlgorithm 2 Orthogonal coupling of Theo-\nrem 2.10\n\nfor i = 1; : : : ; m do\nDraw Xi (cid:24) (cid:17) conditionally orthogonal to\nX1; : : : ; Xi(cid:0)1.\nSet Xm+i = (cid:0)Xi.\nend for\nOutput: X1; : : : ; X2m marginally (cid:17) dis-\ntributed, with low MSE.\n\n(5) in order for the marginals to be equal to (cid:17).\nIn the Appendix Section 8 we illustrate with a\ncounterexample that the convexity assumption is required. Indeed if most of the mass of (cid:17) is near\nthe origin and the RBF kernel is larger around 0 then the classical antithetic coupling X2 = (cid:0)X1\nperforms better.\nFurther extending the above situation, we restrict our attention to antithetic couplings and establish\nthat the optimal way to couple m antithetic pairs (Xi; Xm+i) = (Xi;(cid:0)Xi) is to draw sequentially\northogonal samples if the dimension of the space allows it and the marginal (cid:17) is spherically symmet-\nric. Introduce the following notation for the set of antithetic couplings with independent lengths:\n2m((cid:17)) = fLaw(X1; : : : ; X2m) 2 (cid:3)2m((cid:17))j jjXijj; 1 (cid:20) i (cid:20) m are independent; Xi = (cid:0)Xm+i g :\n(cid:3)anti\nTheorem 2.10. Let (cid:17) 2 P(Rd) be isotropic and let K : Rd (cid:2) Rd ! R be a stationary\nisotropic kernel, such that K(x; y) = (cid:8)(jjx (cid:0) yjj2), where (cid:8) is a decreasing, convex function.\nIf Law(X1; : : : ; X2m), with m (cid:20) d, is a solution to the constrained optimal coupling problem\n\n(jjXi (cid:0) Xjjj2\n\n(cid:8)\n\n)35 ;\n\n24 2m\u2211\n\ni;j=1\n\nEX1:2m(cid:24)(cid:22)\n\narg min\n(cid:22)2(cid:3)anti\n2m((cid:17))\n\nthen it satis\ufb01es \u27e8Xi; Xj\u27e9 = 0 a.s. for all 1 (cid:20) i < j (cid:20) m.\nThe proof of this theorem can be found in the Appendix Section 9 and relies on reformulating\nthe objective function and showing that the exact minimum is attained thanks to convexity. This\nresult illustrates the advantage that orthogonal samples can have over i.i.d. samples, see (Yu et al.,\n2016) for earlier such settings. Details on how to ef\ufb01ciently sample orthogonal samples can be\nfound in (Stewart, 1980); exact simulation of d orthogonal samples is possible in O(d3) time, whilst\nempirically good quality samples can be obtained from approximate algorithms in O(d2 log d) time.\nWe emphasise that we focus on applications where these increases in sampling costs are insigni\ufb01cant\nrelative to the downstream costs of computing with the samples (such as simulating rollouts in RL\nenvironments, as in Section 5.1). However, we note that an interesting direction for future work\nwould be to incorporate a notion of computational complexity into the K-optimality objective, to\ntrade off statistical ef\ufb01ciency against sampling costs.\n\n3 Low discrepancy of geometrically coupled samples\n\nHaving described our notions of optimal couplings in the previous section and obtained several\nsampling schemes, we now provide an interesting connection between our geometrically coupled\nsamples and low discrepancy sequences that are studied in the QMC literature. Our main interest\nis in the local discrepancy function disrS : Rd ! R parametrised by a given set of samples S =\nfX1; :::; XjSjg and de\ufb01ned as follows:\n\ndisrS(u) = Vol(Ju) (cid:0) jfi : Xi 2 Jugj\n\n;\n\njSj\n\n\u220f\n\nwhere: Ju = [0; u1) (cid:2) ::: (cid:2) [0; ud) and Vol(Ju) =\ntion D\nthe empirical sample S from the uniform distribution on a hypercube [0; 1]d.\n\nj=1 uj. Now de\ufb01ne the star discrepancy func-\n(S) = supu2[0;1]d jdisrS(u)j. This function measures the discrepancy between\n\n(S) as: D\n\n(cid:3)\n\n(cid:3)\n\nd\n\n5\n\n\f(cid:0) sa2\n\n8\n\n(cid:3)\n\ndef\n= D\n\n(cid:3)\n(cid:21) with respect to a distribution (cid:21) is de\ufb01ned on S as: D\n\n(cid:3)\n(cid:21)(S)\n\nConsider an expression If = EX(cid:24)(cid:21)[f (X)], where (cid:21) 2 P(R1), and a set of samples S =\nfX1; :::; XjSjg that is used in a given (Q)MC estimator to approximate If . The star discrep-\nancy function D\n(F(cid:21)(S)) =\njdisrF(cid:21)(S)(u)j, where F(cid:21)(S) = fF(cid:21)(Xi)gi=1;:::;jSj and F(cid:21) stands for the cdf function\nsupu2[0;1]\nfor (cid:21). In other words, to measure the discrepancy between arbitrary distribution (cid:21) 2 P(R1) and\na set of samples S, the set of samples is transformed to the interval [0; 1] via the cdf F(cid:21) and the\ndiscrepancy between the uniform distribution on [0; 1] and the transformed sequence F(cid:21)(S) is cal-\nculated.\nWe will focus here on distributions (cid:21) 2 P(R1), which we call regular distributions, corresponding\nz, where z 2 Rd is a deterministic vector and g 2 Rd\nto random variables X de\ufb01ned as X = g\nis taken from some isotropic distribution (cid:28) (e.g. multivariate Gaussian distribution). Regular distri-\nbutions play an important role in machine learning. It is easy to show that the random feature map\napproximation of radial basis function (RBF) kernels such as Gaussian kernels can be rewritten as\nIf = EX(cid:24)(cid:21)[f (X)], where f (x) def= cos(x) and (cid:21) is a regular distribution (Rahimi and Recht, 2007).\nTo sample points from (cid:21), we will use the standard set Siid of independent samples as well as the set\nof orthogonal samples Sort, where marginal distributions of different gi are (cid:21) but different gi are\nconditioned to be exactly orthogonal (see Choromanski et al., 2018b, for explicit constructions).Our\nmain result of this section shows that local discrepancy disrF(cid:21)(S)(u) for a \ufb01xed u 2 [0; 1]d is better\nconcentrated around 0 for regular distributions (cid:21) if orthogonal sets of samples S are used instead\nof independent samples. Indeed, in both cases one can obtain exponentially small upper bounds on\nfailure probabilities but these are sharper when orthogonal samples are used.\nTheorem 3.1. [Local discrepancy & regular distributions] Denote by Siid a set of independent sam-\nples, each taken from a regular distribution (cid:21) and by Sort the set of orthogonal samples for that\ndistribution. Let s = jSiidj = jSortj. Then for any \ufb01xed u 2 [0; 1] and a 2 R+ the following holds:\nP[jdisrF(cid:21)(Siid)(u)j > a] (cid:20) 2e\ndef= piid(a) and for some port satisfying port < piid it holds point-\nwise: P[jdisrF(cid:21)(Sort)(u)j > a] (cid:20) port(a) : Also: V ar(disrF(cid:21)(Sort)(u)) < V ar(disrF(cid:21)(Siid)(u)).\nSharper concentration results regarding local discrepancies translate to sharper concentration results\n(cid:3)\n(cid:21) via the \u03f5-net argument and thus also ultimately to sharper\nfor the star discrepancy function D\nresults regarding approximation error of MC estimators using regular distributions via the celebrated\nKoksma-Hlawka Inequality ((Avron et al., 2016); see Theorem 10.4 in the Appendix).\nWe conclude that orthogonal samples (special\ninstantiations of the GCMC mechanism) lead to\nstrictly better guarantees regarding the approxi-\nmation error of If for functions f with bounded\nvariation and regular distributions (cid:21) than stan-\ndard MC mechanisms. This is the case in par-\nticular for random feature map based approxi-\nmators of RBF kernels. The advantages of or-\nthogonal samples in this setting were partially\nunderstood before for certain classes of RBF\nkernels (Choromanski et al., 2018b; Yu et al.,\n2016), but to the best of our knowledge, gen-\neral non-asymptotic results and the connection\nwith discrepancy theory were not known.\nIn Figure 1 we show a kernel density\n(cid:3)\nestimate of\nthe D\ndiscrepancies of 50,000 sample sequences\nfor a range of cou-\nF\npling algorithms to generate Gaussian samples\ngi. We see that using antithetic samples with\ncoupled lengths as in Algorithm 1 leads to a\nsequence with lower discrepancy on average.\nWe also observe that coupling the samples to\nbe orthogonal reduces the discrepancy. This\n\n(cid:3) discrepancy for\nFigure 1: Histograms of the D\ndifferent sampling methods: samples gi have i.i.d.,\northogonal or RQMC directions with uncoupled\nlengths or lengths coupled according to Algo-\nrithms 1 or 2\n\nthe distributions of\n\n(cid:0)1N (0;1)\n\n(\n\n(\n\n))\n\nT z\ngi\njjzjj\n\ni=1;:::;40\n\n\u22a4\n\n6\n\n0.000.050.100.150.200.25Discrepancy0.02.55.07.510.012.515.017.5Densityno norm couplingantithetic, equal normantithetic, inv. cdf.i.i.d.ort.RQMC\fcon\ufb01rms the above results. Finally this \ufb01gure\nshows that an algorithm designed to have a low\ndiscrepancy (RQMC) will still reach a lower discrepancy than a classical sampling method but this\ndifference can be mitigated by using antithetic samples.\n\n4 Geometric coupling for estimating gradients of function smoothings\n\nHere we provide results on the concentration of zeroth order gradient estimators for reinforcement\nlearning applications, helping to explain their ef\ufb01cacy. This area is one of the main applications of\nthe GCMC methods introduced in Section 2, and we present experiments for these applications in\nSection 5.1. To our knowledge, we provide the \ufb01rst result showing exponential concentration for the\nEvolution Strategies (ES) gradient estimator (Salimans et al., 2017) in this setting. We also provide\nexponential concentration bounds for orthogonal gradient estimators.\nRecall that given a function F : (cid:2) ! R to be minimised, the Vanilla ES gradient estimator is\nde\ufb01ned as:\n\nN\u2211\n\ni=1\n\n1\nN (cid:27)\n\n^\u2207V\nN F(cid:27)((cid:18)) =\n\nF ((cid:18) + (cid:27)\u03f5i)\u03f5i; where \u03f5i (cid:24) N (0; I) are all i.i.d. :\n\n(6)\nIn what follows we assume that F is uniformly bounded over its domain by F. In the case that F is\na sum of discounted rewards, an upper bound of R for the reward function yields an upper bound of\nR for F , where (cid:13) is the discount factor. Whenever F is bounded in absolute value, the random\n1\n1(cid:0)(cid:13)\nvector ^\u2207V\nTheorem 4.1. If F is a bounded function such that jFj (cid:20) R1, then the vanilla ES estimator is a\nsub-Gaussian vector with parameter\n\nN F(cid:27)((cid:18)) is sub-Gaussian.\n\n8c2+1\n\np\n\n; with c = 24e and therefore for any t (cid:21) 0:\n^\u2207V\nN F(cid:27)((cid:18))\n\n(cid:0)t2N (cid:27)2\n1 (8c2+1) ;\n\n(cid:20) 2de\n\n2R2\n\n)\n\n(cid:12)(cid:12)(cid:12)(cid:12) (cid:21) t\n\n])\n\n)\n\n(\n\np\np\n2R1\n(cid:0)\n\nN (cid:27)\nE\n\n[\n\nj\n\nP\n\nmax\n\n^\u2207V\nN F(cid:27)((cid:18))\n\n(cid:12)(cid:12)(cid:12)(cid:12)(\n\n(\n\nj=1;:::;d\nfor a universal constant c.\n\nj\n\nFor the case of pairs of antithetic coupled gradient estimators, one can obtain a similar bound with\ncomparable performance using this technique.\n\n4.1 Bounds for orthogonal estimators\n\nWe show that a general class of orthogonal gradient estimators present similar exponential concentra-\ntion properties as the Vanilla ES estimator. Proving these bounds is substantially more challenging\nbecause of the correlation structure between samples. To our knowledge, these are the \ufb01rst results\nshowing exponential concentration for structured gradient estimators, yielding insight as to why\nthese perform well in practice. We provide concentration bounds for gradient estimators of the\nform:\n\n^\u2207Ort\nd F ((cid:18)) =\n\n1\nd(cid:27)\n\n(cid:23)ibiF ((cid:18) + (cid:27)(cid:23)ibi) ;\n\n\u221a\n\nwhere the random vectors (cid:23)i 2 Rd are sampled uniformly from the unit sphere using a sequentially\northogonal process, and bi are zero mean signed lengths, sampled from sub-Gaussian distributions\neach with sub-Gaussian parameter (cid:12)i, independent from each other and from all other sources of\nrandomness. Let c := 2\n2. Whenever the function F is bounded, the random variable\nvector ^\u2207Ort\nTheorem 4.2. Let B = maxi E [jbij], and (cid:12) = maxi (cid:12)i, jFj (cid:20) R, then the orthogonal gradient\nestimator ^\u2207Ort\nAssuming N = T d and the availability of T i.i.d. orthogonal estimators (indexed by j), de\ufb01ne:\n\nd F ((cid:18)) is sub-Gaussian with parameter\n\nd F ((cid:18)) is sub-Gaussian.\n\n(cid:12)2c2R2\n(cid:27)2d2 + R2B2\n4d(cid:27)2 .\n\n(24e)2 + 1\n\n\u221a\n\nd\u2211\n\ni=1\n\nT\u2211\n\nj=1\n\n7\n\n^\u2207Ort\nN F ((cid:18)) =\n\n1\nT\n\n^\u2207Ort;j\n\nd\n\nF ((cid:18)) :\n\n\fTheorem 4.3. The\n1p\n(cid:12)2c2R2\n(cid:27)2d2 + R2B2\nT\n\n4(cid:27)2d = 1p\n\nN\n\ngradient\n\n^\u2207Ort\nN F ((cid:18))\n\nestimator\n(cid:12)2c2R2\nd(cid:27)2 + R2B2\n\nis\n4(cid:27)2 ; and therefore:\n\n)\n\n(\n\n[\n\n^\u2207Ort\nN F ((cid:18))\n\n(cid:0)\n\nj\n\nE\n\n^\u2207Ort\nN F ((cid:18))\n\n])\n\n(cid:12)(cid:12)(cid:12)(cid:12) (cid:21) t\n)\n\nj\n\nsub-Gaussian with\n\nparameter\n\n(cid:20) 2de\n\n(cid:0)t2N (cid:27)2\n\n(cid:12)2 c2R2(cid:27)2\n\nd\n\n+ R2B2\n\n4\n\n:\n\n\u221a\n\n\u221a\n\n(\n\n(cid:12)(cid:12)(cid:12)(cid:12)(\n\nP\n\nmax\n\nj=1;:::;d\n\n5 Experiments\n\n5.1 Learning ef\ufb01cient navigation policies with ES strategies\n\nWe consider the task of closed-loop policy optimization to train stable walking behaviors for\nquadruped locomotion of the Minitaur platform on the Bullet simulator (Coumans and Bai, 2016\u2013\n2018). We train neural network policies with d (cid:21) 96 parameters and optimize the blackbox\nfunction F that takes as input parameters of the neural network and outputs the total reward,\nby applying MC estimators of gradients of Gaussian smoothings of F , as described in Expres-\nsion (6). The main aim of the experiments is to compare policies learnt by using i.i.d.\nsam-\nples, as in Expression (6), against estimators using GCMC methods. We test four different con-\ntrol variate terms that lead to four different variants of the MC algorithm: vanilla (no control\nvariate), forward (cid:12)nite-di\ufb00erence (see Choromanski et al., 2018c, for details), antithetic and\nantithetic-coupled (see: below). For each of these four variants we use different sampling strate-\ngies of calculating the MC estimator: MCGaussian, Halton (baselines), MCGaussianOrthogonal,\nMCGaussianOrthogonalFixed, and MCRandomHadamard that correspond to:\nindependent\nGaussian samples (Salimans et al., 2017), samples constructed from randomized Halton sequences\nused on a regular basis in QMC methods, Gaussian orthogonal samples (introduced \ufb01rst in Choro-\nmanski et al. (2018c) but not tested for m < d and in the locomotion task setting), Gaussian or-\nthogonal samples with renormalized lengths (each length equals\nd) and \ufb01nally: rows of random\nHadamard matrices (that approximate Gaussian orthogonal samples, but are easier to compute, (see\nChoromanski et al., 2018c)). For the antithetic variant using Gaussian orthogonal samples, we also\ntest the variant which couples the lengths of antithetic pairs of samples as in Algorithm 1; we refer\nto this as antithetic (cid:0) coupled. We tested different number of samples s with the emphasis on MC\nestimators satisfying: m \u226a d. We chose: m = 8; 16; 32; 48; 56; 64; 96. Full details of the sampling\nmechanisms described above are given in the Appendix Section.\nFigure 2 shows comparison of different MC methods using antithetic variant for m = 8; 32; 48\nsamples given to the MC estimator per iteration of the optimization routine (with an exception of the\nHalton approach, where we used m = 96 samples to demonstrate that even with the larger number of\nsamples standard QMC methods fail). Walkable policies are characterized by total reward R > 10:0.\nWe notice that structured approaches outperform the unstructured one and that QMC method based\non Halton sequences did not lead to walkable policies. Since it will be also the case for other settings\nconsidered by us, we exclude it from the subsequent plots.\n\np\n\n(a) m = 8\n\n(b) m = 32\n\n(c) m = 48, 600 iterations (d) m = 48, 100 itera-\n\ntions\n\nTraining curves\n\nFigure 2:\ncoupled, (cid:12)xed,\nhalton-96 correspond to: MCGaussian, MCGaussianOrthogonal, antithetic (cid:0) coupled,\nMCGaussianOrthogonalFixed and Halton-based QMC method. Sub\ufb01gure (d) is a zoomed ver-\nsion of Sub\ufb01gure (c) after just 100 iterations and with Halton approach excluded.\n\nfor different MC methods.\n\nort,\n\niid,\n\nFor m = 32 we excluded the comparison with MCGaussian since it performed substantially\nworse than other methods and with MCGaussianOrthogonalFixed since it was very similar to\n\n8\n\n\fMCGaussianOrthogonal (for clarity). Again, for clarity, for m = 8 we plot the max-reward-\ncurves, where the maximal reward from already constructed policies instead of the current one is\nplotted (thus these curves are monotonic). In Sub\ufb01gure (a) the curves stabilize after about 87 itera-\ntions (for the MCGaussianOrthogonal strategy the curve ultimately exceeds reward 10.0 but after\n> 500 iterations).\nWe conclude that for m = 8 the coupling mechanism is the only one that leads to walkable policies\nand for m = 32 it leads to the best policy among all considered structured mechanisms. More ex-\nperimental results are given in the Appendix. We also attach videos showing how policies learned\nby applying certain structured mechanisms work in practice (details in the Appendix). Testing all\nvariants of the MC mechanism mentioned above, we managed to successfully train stable walking be-\nhaviours using only m = 8 samples per iteration only for k = 5 settings: MCGaussianOrthogonal-\nantithetic-coupled, MCGaussianOrthogonal-antithetic, MCGaussianOrthogonal-forward-fd,\nMCRandomHadamard-antithetic and MCRandomHadamard-vanilla. Thus all 5 policies corre-\nspond to some variants of our GCMC mechanism.\nWe did not conduct hyperparameters tuning to obtain the above curves. We used hyperparameters\napplied on a regular basis in other Monte Carlo algorithms for policy optimization, in particular\nchose (cid:27) = 0:1 and (cid:17) = 0:01, where (cid:27) stands for the standard deviation of the entries of Gaussian\nvectors used for MC and (cid:17) is the gradient step size. The experiments where conducted in a dis-\ntributed environment on a cluster of machines, where each machine was responsible for evaluating\nexactly one sample.\n\n5.2 Variance-reduced ELBO estimation for deep generative models\n\nIn this section, we test GCMC sampling strategies on a deep generative modelling application. We\nconsider a variational autoencoder (VAE) (Rezende et al., 2014; Kingma and Welling, 2014) with\nlatent variable z with prior p(z), observed variable x with trainable generative model p(cid:18)(xjz), and\ntrainable recognition model q\u03d5(zjx). In the standard VAE training algorithm, the evidence lower-\nbound (ELBO) for a single training point x is:\n\nEz(cid:24)q\u03d5((cid:1)jx) [log p(cid:18)(x; z) (cid:0) log q\u03d5(zjx)] :\n\nThis objective is then optimised by estimating gradients using a combination of m 2 N i.i.d. Monte\nCarlo samples together with the reparametrisation trick. We adjust the training algorithm by using a\nvariety of GCMC sampling algorithms, rather than i.i.d. sampling. We train on MNIST, and report\nthe average train and test ELBO after 50 epochs for a variety of sampling algorithms and numbers\nof samples K, to understand the effect of these sampling methods on speeding up learning. The full\nresults and experiment speci\ufb01cations are given in the Appendix Section 12. We observe that GCMC\nmethods consistently lead to better log-likelihoods than i.i.d. sampling, in fact with GCMC methods\nwith 2 samples performing better than i.i.d. methods using 8 samples. We highlight concurrent\nwork (Buchholz et al., 2018) that presents an in-depth study of quasi-Monte Carlo integration for\nvariational inference.\n\n6 Conclusion\n\nWe have introduced Monte Carlo coupling strategies in Euclidean spaces for improving algorithms\nthat typically operate in a high-dimensional, low-sample regime, demonstrating fundamental con-\nnections to multi-marginal transport. In future work, it will be interesting to explore applications in\nother areas such as random feature kernel approximation. We also highlight more general solution\nof the K-optimality criterion, and incorporation of a sampling cost penalty into the corresponding\nobjective as interesting problems left open by this paper.\n\nAcknowledgements\n\nWe thank Jiri Hron, Mar\u00eda Lomel\u00ed, and the anonymous reviewers for helpful comments on the\nmanuscript. We thank Yingzhen Li for her VAE implementation. MR acknowledges support by\nEPSRC grant EP/L016516/1 for the Cambridge Centre for Analysis. AW acknowledges support\nfrom the David MacKay Newton research fellowship at Darwin College, The Alan Turing Institute\nunder EPSRC grant EP/N510129/1 & TU/B/000074, and the Leverhulme Trust via the CFI.\n\n9\n\n\fReferences\nChristoph Aistleitner and Josef Dick. Functions of bounded variation, signed measures, and a gen-\n\neral Koksma-Hlawka inequality. Acta Arithmetica, 167(2):143\u2013171, 2015.\n\nCharalambos D. Aliprantis and Kim C. Border. In\ufb01nite Dimensional Analysis: a Hitchhiker\u2019s Guide.\n\nSpringer, 2006.\n\nMartin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial networks.\n\nIn International Conference on Machine Learning (ICML), 2017.\n\nHaim Avron, Vikas Sindhwani, Jiyan Yang, and Michael W. Mahoney. Quasi-Monte Carlo feature\nmaps for shift-invariant kernels. Journal of Machine Learning Research, 17:120:1\u2013120:38, 2016.\nURL http://jmlr.org/papers/v17/14-538.html.\n\nMarc G. Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan,\nStephan Hoyer, and R\u00e9mi Munos. The Cramer distance as a solution to biased Wasserstein gradi-\nents. arXiv, abs/1705.10743, 2017.\n\nSt\u00e9phane Boucheron, G\u00e1bor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymp-\n\ntotic Theory of Independence. Oxford University Press, 2013.\n\nJohann S. Brauchart and Josef Dick. Quasi-monte carlo rules for numerical integration over the unit\nsphere S2. Numerische Mathematik, 121(3):473\u2013502, 2012. doi: 10.1007/s00211-011-0444-6.\nURL https://doi.org/10.1007/s00211-011-0444-6.\n\nFran\u00e7ois-Xavier Briol, Chris J. Oates, Jon Cockayne, Wilson Ye Chen, and Mark Girolami. On\nthe sampling problem for kernel quadrature. In International Conference on Machine Learning\n(ICML), 2017.\n\nAlexander Buchholz, Florian Wenzel, and Stephan Mandt. Quasi-Monte Carlo variational inference.\n\nIn International Conference on Machine Learning (ICML), 2018.\n\nYutian Chen, Max Welling, and Alexander J. Smola. Super-samples from kernel herding. In Uncer-\n\ntainty in Arti\ufb01cial Intelligence (UAI), 2010.\n\nK. Choromanski, C. Downey, and B. Boots.\n\nInitialization matters: Orthogonal predictive state\nIn International Conference on Learning Representations (ICLR),\n\nrecurrent neural networks.\n2018a.\n\nK. Choromanski, M. Rowland, T. Sarlos, V. Sindhwani, R. Turner, and A. Weller. The geometry of\n\nrandom features. In Arti\ufb01cial Intelligence and Statistics (AISTATS), 2018b.\n\nKrzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard Turner, and Adrian Weller.\nStructured evolution with compact architectures for scalable policy optimization. In International\nConference on Machine Learning (ICML), 2018c.\n\nKrzysztof Marcin Choromanski, Mark Rowland, and Adrian Weller. The unreasonable effectiveness\nof structured random orthogonal embeddings. In Neural Information Processing Systems (NIPS),\n2017.\n\nErwin Coumans and Yunfei Bai. Pybullet, a python module for physics simulation for games,\n\nrobotics and machine learning. http://pybullet.org, 2016\u20132018.\n\nJosef Dick, Aicke Hinrichs, and Friedrich Pillichshammer. Proof techniques in quasi-Monte Carlo\ntheory. J. Complexity, 31(3):327\u2013371, 2015. doi: 10.1016/j.jco.2014.09.003. URL https://\ndoi.org/10.1016/j.jco.2014.09.003.\n\nEvarist Gin and Richard Nickl. Mathematical Foundations of In\ufb01nite-Dimensional Statistical Mod-\n\nels. Cambridge University Press, 2015.\n\nXavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\n\nnetworks. In Arti\ufb01cial Intelligence and Statistics (AISTATS), 2010.\n\n10\n\n\fArthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch\u00f6lkopf, and Alexander Smola.\n\nA kernel two-sample test. J. Mach. Learn. Res., 13:723\u2013773, March 2012. ISSN 1532-4435.\n\nPaul R Halmos. Measure Theory, volume 18. Springer, 2013.\n\nJohn Halton. On the ef\ufb01ciency of certain quasi-random sequences of points in evaluating multi-\n\ndimensional integrals. Numerische Mathematik, 2:84\u201390, 1960.\n\nJ. M. Hammersley and K. W. Morton. A new Monte Carlo technique: antithetic variates. Mathemat-\n\nical Proceedings of the Cambridge Philosophical Society, 52:449\u2013475, 1956.\n\nFerenc Huszar and David K. Duvenaud. Optimally-weighted herding is Bayesian quadrature. In\n\nUncertainty in Arti\ufb01cial Intelligence (UAI), 2012.\n\nMotonobu Kanagawa, Philipp Hennig, Dino Sejdinovic, and Bharath K Sriperumbudur. Gaussian\n\nprocesses and kernel methods: A review on connections and equivalences. In Arxiv, 2018.\n\nDiederik P. Kingma and Max Welling. Auto-encoding variational bayes. In International Conference\n\non Learning Representations (ICLR), 2014.\n\nM. Lomel\u00ed, M. Rowland, A. Gretton, and Z. Ghahramani. Antithetic and Monte Carlo kernel esti-\n\nmators for partial rankings. In Arxiv, 2018.\n\nHoria Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive\n\napproach to reinforcement learning. In Arxiv, 2018.\n\nM. B. Marcus and L. A. Shepp. Sample behavior of Gaussian processes. In Proceedings of the Sixth\nBerkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory,\npages 423\u2013441, 1972.\n\nHarald Niederreiter. Random number generation and quasi-Monte Carlo methods. In Society for\n\nIndustrial and Applied Mathematics, 1992.\n\nS.H. Paskov. Average case complexity of multivariate integration for smooth functions. Journal of\n\nComplexity, 9(2):291 \u2013 312, 1993.\n\nBrendan Pass. Multi-marginal optimal transport: Theory and applications. ESAIM: Mathematical\n\nModelling and Numerical Analysis, 49, 05 2014.\n\nA. Rahimi and B. Recht. Random features for large-scale kernel machines. In Neural Information\n\nProcessing Systems (NIPS), 2007.\n\nCarl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning.\n\nThe MIT Press, 2005. ISBN 026218253X.\n\nCE. Rasmussen and Z. Ghahramani. Bayesian Monte Carlo.\n\nSystems (NIPS), 2003.\n\nIn Neural Information Processing\n\nDanilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and ap-\nproximate inference in deep generative models. In International Conference on Machine Learning\n(ICML), 2014.\n\nChristian P. Robert and George Casella. Monte Carlo Statistical Methods. Springer-Verlag, 2005.\n\nTim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a\n\nscalable alternative to reinforcement learning. In arXiv, 2017.\n\nIan H. Sloan and Henryk Wozniakowski. When are quasi-Monte Carlo algorithms ef\ufb01cient for high\ndimensional integrals? J. Complexity, 14(1):1\u201333, 1998. doi: 10.1006/jcom.1997.0463. URL\nhttps://doi.org/10.1006/jcom.1997.0463.\n\nG. W. Stewart. The ef\ufb01cient generation of random orthogonal matrices with an application to con-\ndition estimators. SIAM Journal on Numerical Analysis, 17(3):403\u2013409, 1980. ISSN 00361429.\nURL http://www.jstor.org/stable/2156882.\n\n11\n\n\fMichel Talagrand. Regularity of Gaussian processes. Acta Math., 159:99\u2013149, 1987.\n\nMax Welling and Yee Whye Teh. Bayesian learning via stochastic gradient Langevin dynamics. In\n\nInternational Conference on Machine Learning (ICML), 2011.\n\nF. Yu, A. Suresh, K. Choromanski, D. Holtmann-Rice, and S. Kumar. Orthogonal random features.\n\nIn Neural Information Processing Systems (NIPS), 2016.\n\n12\n\n\f", "award": [], "sourceid": 153, "authors": [{"given_name": "Mark", "family_name": "Rowland", "institution": "University of Cambridge"}, {"given_name": "Krzysztof", "family_name": "Choromanski", "institution": "Google Brain Robotics"}, {"given_name": "Fran\u00e7ois", "family_name": "Chalus", "institution": "Credit Suisse & University of Cambridge"}, {"given_name": "Aldo", "family_name": "Pacchiano", "institution": "UC Berkeley"}, {"given_name": "Tamas", "family_name": "Sarlos", "institution": "Google Research"}, {"given_name": "Richard", "family_name": "Turner", "institution": "University of Cambridge"}, {"given_name": "Adrian", "family_name": "Weller", "institution": "University of Cambridge"}]}