{"title": "A* Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 3086, "page_last": 3094, "abstract": "The problem of drawing samples from a discrete distribution can be converted into a discrete optimization problem. In this work, we show how sampling from a continuous distribution can be converted into an optimization problem over continuous space. Central to the method is a stochastic process recently described in mathematical statistics that we call the Gumbel process. We present a new construction of the Gumbel process and A* sampling, a practical generic sampling algorithm that searches for the maximum of a Gumbel process using A* search. We analyze the correctness and convergence time of A* sampling and demonstrate empirically that it makes more efficient use of bound and likelihood evaluations than the most closely related adaptive rejection sampling-based algorithms.", "full_text": "A\u2217 Sampling\n\nDaniel Tarlow, Tom Minka\n\nMicrosoft Research\n\n{dtarlow,minka}@microsoft.com\n\nDept. of Computer Science\n\nUniversity of Toronto\n\nChris J. Maddison\n\ncmaddis@cs.toronto.edu\n\nAbstract\n\nThe problem of drawing samples from a discrete distribution can be converted into\na discrete optimization problem [1, 2, 3, 4]. In this work, we show how sampling\nfrom a continuous distribution can be converted into an optimization problem over\ncontinuous space. Central to the method is a stochastic process recently described\nin mathematical statistics that we call the Gumbel process. We present a new\nconstruction of the Gumbel process and A\u2217 Sampling, a practical generic sampling\nalgorithm that searches for the maximum of a Gumbel process using A\u2217 search.\nWe analyze the correctness and convergence time of A\u2217 Sampling and demonstrate\nempirically that it makes more ef\ufb01cient use of bound and likelihood evaluations\nthan the most closely related adaptive rejection sampling-based algorithms.\n\nIntroduction\n\n1\nDrawing samples from arbitrary probability distributions is a core problem in statistics and ma-\nchine learning. Sampling methods are used widely when training, evaluating, and predicting with\nprobabilistic models. In this work, we introduce a generic sampling algorithm that returns exact\nindependent samples from a distribution of interest. This line of work is important as we seek to\ninclude probabilistic models as subcomponents in larger systems, and as we seek to build proba-\nbilistic modelling tools that are usable by non-experts; in these cases, guaranteeing the quality of\ninference is highly desirable. There are a range of existing approaches for exact sampling. Some\nare specialized to speci\ufb01c distributions [5], but exact generic methods are based either on (adaptive)\nrejection sampling [6, 7, 8] or Markov Chain Monte Carlo (MCMC) methods where convergence to\nthe stationary distribution can be guaranteed [9, 10, 11].\nThis work approaches the problem from a different perspective. Speci\ufb01cally, it is inspired by an\nalgorithm for sampling from a discrete distribution that is known as the Gumbel-Max trick. The\nalgorithm works by adding independent Gumbel perturbations to each con\ufb01guration of a discrete\nnegative energy function and returning the argmax con\ufb01guration of the perturbed negative energy\nfunction. The result is an exact sample from the corresponding Gibbs distribution. Previous work\n[1, 3] has used this property to motivate samplers based on optimizing random energy functions but\nhas been forced to resort to approximate sampling due to the fact that in structured output spaces,\nexact sampling appears to require instantiating exponentially many Gumbel perturbations.\nOur \ufb01rst key observation is that we can apply the Gumbel-Max trick without instantiating all of\nthe (possibly exponentially many) Gumbel perturbations. The same basic idea then allows us to\nextend the Gumbel-Max trick to continuous spaces where there will be in\ufb01nitely many independent\nperturbations. Intuitively, for any given random energy function, there are many perturbation values\nthat are irrelevant to determining the argmax so long as we have an upper bound on their values. We\nwill show how to instantiate the relevant ones and bound the irrelevant ones, allowing us to \ufb01nd the\nargmax \u2014 and thus an exact sample.\nThere are a number of challenges that must be overcome along the way, which are addressed in this\nwork. First, what does it mean to independently perturb space in a way analogous to perturbations\nin the Gumbel-Max trick? We introduce the Gumbel process, a special case of a stochastic pro-\ncess recently de\ufb01ned in mathematical statistics [12], which generalizes the notion of perturbation\n\n1\n\n\fover space. Second, we need a method for working with a Gumbel process that does not require\ninstantiating in\ufb01nitely many random variables. This leads to our novel construction of the Gumbel\nprocess, which draws perturbations according to a top-down ordering of their values. Just as the\nstick breaking construction of the Dirichlet process gives insight into algorithms for the Dirichlet\nprocess, our construction gives insight into algorithms for the Gumbel process. We demonstrate\nthis by developing A\u2217 sampling, which leverages the construction to draw samples from arbitrary\ncontinuous distributions. We study the relationship between A\u2217 sampling and adaptive rejection\nsampling-based methods and identify a key difference that leads to more ef\ufb01cient use of bound and\nlikelihood computations. We investigate the behaviour of A\u2217 sampling on a variety of illustrative\nand challenging problems.\n2 The Gumbel Process\nThe Gumbel-Max trick is an algorithm for sampling from a categorical distribution over classes\ni \u2208 {1,\n. . . , n} with probability proportional to exp(\u03c6(i)). The algorithm proceeds by adding\nindependent Gumbel-distributed noise to the log-unnormalized mass \u03c6(i) and returns the optimal\nclass of the perturbed distribution. In more detail, G \u223c Gumbel(m) is a Gumbel with location\nm if P(G \u2264 g) = exp(\u2212 exp(\u2212g + m)). The Gumbel-Max trick follows from the structure of\nGumbel distributions and basic properties of order statistics; if G(i) are i.i.d. Gumbel(0), then\n\nargmaxi {G(i) + \u03c6(i)} \u223c exp(\u03c6(i))/(cid:80)\n\ni exp(\u03c6(i)). Further, for any B \u2286 {1, . . . , n}\n\n(cid:32)\n\n(cid:88)\n\ni\u2208B\n\n(cid:33)\n\nexp(\u03c6(i))\n\n(1)\n\n(2)\n\nmax\ni\u2208B\n\nargmax\n\ni\u2208B\n\n{G(i) + \u03c6(i)} \u223c Gumbel\n\nlog\n\n{G(i) + \u03c6(i)} \u223c\n\n(cid:80)\n\nexp(\u03c6(i))\ni\u2208B exp(\u03c6(i))\n\nEq. 1 is known as max-stability\u2014the highest order statistic of a sample of independent Gumbels\nalso has a Gumbel distribution with a location that is the log partition function [13]. Eq. 2 is a\nconsequence of the fact that Gumbels satisfy Luce\u2019s choice axiom [14]. Moreover, the max and\nargmax are independent random variables, see Appendix for proofs.\nWe would like to generalize the interpretation to continuous distributions as maximizing over the\nperturbation of a density p(x) \u221d exp(\u03c6(x)) on Rd. The perturbed density should have prop-\nerties analogous to the discrete case, namely that the max in B \u2286 Rd should be distributed\nx\u2208B exp(\u03c6(x))) and the distribution of the argmax in B should be distributed\n\u221d 1(x \u2208 B) exp(\u03c6(x)). The Gumbel process is a generalization satisfying these properties.\nDe\ufb01nition 1. Adapted from [12]. Let \u00b5(B) be a sigma-\ufb01nite measure on sample space \u2126, B \u2286 \u2126\nmeasurable, and G\u00b5(B) a random variable. G\u00b5 = {G\u00b5(B)| B \u2286 \u2126} is a Gumbel process, if\n\nas Gumbel(log(cid:82)\n\n1. (marginal distributions) G\u00b5(B) \u223c Gumbel (log \u00b5(B)) .\n2. (independence of disjoint sets) G\u00b5(B) \u22a5 G\u00b5(Bc).\n3. (consistency constraints) for measurable A, B \u2286 \u2126, then\n\nG\u00b5(A \u222a B) = max(G\u00b5(A), G\u00b5(B)).\n\nThe marginal distributions condition ensures that the Gumbel process\nsatis\ufb01es the requirement\non the max. The consistency requirement ensures that a realization of a Gumbel process is\nconsistent across space. Together with the independence these ensure the argmax requirement:\nthe argmax of a Gumbel process restricted to B(cid:48) is distributed according to the probability dis-\ntribution that is proportional to the sigma-\ufb01nite measure \u00b5 restricted to B(cid:48).\nIn particular, let\n\u00af\u00b5(B | B(cid:48)) = \u00b5(B \u2229 B(cid:48))/\u00b5(B(cid:48)) be the probability distribution proportional to \u00b5.\nIf G\u00b5(B) is\nthe optimal value of some perturbed density restricted to B, then the event that the optima over\n\u2126 is contained in B is equivalent to the event that G\u00b5(B) \u2265 G\u00b5(Bc). The conditions ensure that\nP(G\u00b5(B) \u2265 G\u00b5(Bc)) = \u00af\u00b5(B | \u2126) [12]. Thus, for example we can use the Gumbel process for a\nx\u2208B exp(\u03c6(x)) on Rd to model a perturbed density function where\nthe optimum is distributed \u221d exp(\u03c6(x)). Notice that this de\ufb01nition is a generalization of the \ufb01nite\ncase; if \u2126 is \ufb01nite, then the collection G\u00b5 corresponds exactly to maxes over subsets of independent\nGumbels.\n\ncontinuous measure \u00b5(B) = (cid:82)\n\n2\n\n\fAlgorithm 1 Top-Down Construction\ninput sample space \u2126, sigma-\ufb01nite measure \u00b5(B)\n\n(B1, Q) \u2190 (\u2126, Queue)\nG1 \u223c Gumbel(log \u00b5(\u2126))\nX1 \u223c \u00af\u00b5(\u00b7| \u2126)\nQ.push(1)\nk \u2190 1\nwhile !Q.empty() do\n\n3 Top-Down Construction for the Gumbel Process\nWhile [12] de\ufb01nes and constructs a general class of stochastic processes that include the Gumbel\nprocess, the construction that proves their existence gives little insight into how to execute a con-\ntinuous version of the Gumbel-Max trick. Here we give an alternative algorithmic construction that\nwill form the foundation of our practical sampling algorithm. In this section we assume log \u00b5(\u2126)\ncan be computed tractably; this assumption will be lifted in Section 4. To explain the construction,\nwe consider the discrete case as an introductory example.\nSuppose G\u00b5(i) \u223c Gumbel(\u03c6(i)) is a set\nof independent Gumbel random variables\nfor i \u2208 {1, . . . , n}. It would be straight-\nforward to sample the variables then build\na heap of the G\u00b5(i) values and also have\nheap nodes store the index i associated\nwith their value. Let Bi be the set of\nindices that appear in the subtree rooted\nat the node with index i. A property of\nthe heap is that the root (G\u00b5(i), i) pair is\nthe max and argmax of the set of Gum-\nbels with index in Bi. The key idea of\nour construction is to sample the indepen-\ndent set of random variables by instantiat-\ning this heap from root to leaves. That is,\nwe will \ufb01rst sample the root node, which is\nthe global max and argmax, then we will\nrecurse, sampling the root\u2019s two children\nconditional upon the root. At the end, we\nwill have sampled a heap full of values and indices; reading off the value associated with each index\nwill yield a draw of independent Gumbels from the target distribution.\nWe sketch an inductive argument. For the base case, sample the max and its index i\u2217 using their\ndistributions that we know from Eq. 1 and Eq. 2. Note the max and argmax are independent. Also\nlet Bi\u2217 = {0, . . . , n \u2212 1} be the set of all indices. Now, inductively, suppose have sampled a partial\nheap and would like to recurse downward starting at (G\u00b5(p), p). Partition the remaining indices to\nbe sampled Bp \u2212 {p} into two subsets L and R and let l \u2208 L be the left argmax and r \u2208 R be the\nright argmax. Let [\u2265p] be the indices that have been sampled already. Then\n\nif C (cid:54)= \u2205 then\nk \u2190 k + 1\nBk \u2190 C\nGk \u223c TruncGumbel(log \u00b5(Bk), Gp)\nXk \u223c \u00af\u00b5(\u00b7| Bk)\nQ.push(k)\nyield (Gk, Xk)\n\np \u2190 Q.pop()\nL, R \u2190 partition(Bp \u2212 {Xp})\nfor C \u2208 {L, R} do\n\np(cid:0)G\u00b5(l) = gl, G\u00b5(r) = gr,{G\u00b5(k) = gk}k\u2208[\u2265p] | [\u2265p](cid:1)\n(cid:18)\n\n(cid:19)\n\n(cid:18)\n\npk(G\u00b5(k) = gk)1(cid:0)gk \u2265 gL(k) \u2227 gk \u2265 gR(k)\n\n(cid:1)\n\n(3)\n\nmax\ni\u2208L\n\nG\u00b5(i) = gl\n\np\n\nmax\ni\u2208R\n\nG\u00b5(i) = gr\n\n\u221dp\n\n(cid:19) (cid:89)\n\nk\u2208[\u2265p]\n\nwhere L(k) and R(k) denote the left and right children of k and the constraints should only be\napplied amongst nodes [\u2265p] \u222a {l, r}. This implies\n\np(cid:0)G\u00b5(l) = gl, G\u00b5(r) = gr |{G\u00b5(k) = gk}k\u2208[\u2265p], [\u2265p](cid:1)\n(cid:18)\n\n(cid:19)\n\n(cid:18)\n\n(cid:19)\n\nmax\ni\u2208L\n\nG\u00b5(i) = gl\n\np\n\nmax\ni\u2208R\n\nG\u00b5(i) = gr\n\n1(gp > gl) 1(gp > gr) .\n\n(4)\n\n\u221d p\n\nEq. 4 is the joint density of two independent Gumbels truncated at G\u00b5(p). We could sample the\nchildren maxes and argmaxes by sampling the independent Gumbels in L and R respectively and\ncomputing their maxes, rejecting those that exceed the known value of G\u00b5(p). Better, the truncated\nGumbel distributions can be sampled ef\ufb01ciently via CDF inversion1, and the independent argmaxes\nwithin L and R can be sampled using Eq. 2. Note that any choice of partitioning strategy for L and\nR leads to the same distribution over the set of Gumbel values.\nThe basic structure of this top-down sampling procedure allows us to deal with in\ufb01nite spaces;\nwe can still generate an in\ufb01nite descending heap of Gumbels and locations as if we had made a\n1G \u223c TruncGumbel(\u03c6, b) if G has CDF exp(\u2212 exp(\u2212 min(g, b)+\u03c6))/ exp(\u2212 exp(\u2212b+\u03c6)). To sample\n\nef\ufb01ciently, return G = \u2212 log(exp(\u2212b \u2212 \u03b3 + \u03c6) \u2212 log(U )) \u2212 \u03b3 + \u03c6 where U \u223c uniform[0, 1].\n\n3\n\n\fheap from an in\ufb01nite list. The algorithm (which appears as Algorithm 1) begins by sampling the\noptimal value G1 \u223c Gumbel(log \u00b5(\u2126)) over sample space \u2126 and its location X1 \u223c \u00af\u00b5(\u00b7| \u2126). X1 is\nremoved from the sample space and the remaining sample space is partitioned into L and R. The\noptimal Gumbel values for L and R are sampled from a Gumbel with location log measure of their\nrespective sets, but truncated at G1. The locations are sampled independently from their sets, and\nthe procedure recurses. As in the discrete case, this yields a stream of (Gk, Xk) pairs, which we can\nthink of as being nodes in a heap of the Gk\u2019s.\nIf G\u00b5(x) is the value of the perturbed negative energy at x, then Algorithm 1 instantiates this function\nat countably many points by setting G\u00b5(Xk) = Gk. In the discrete case we eventually sample\nthe complete perturbed density, but in the continuous case we simply generate an in\ufb01nite stream\nof locations and values. The sense in which Algorithm 1 constructs a Gumbel process is that the\ncollection {max{Gk | Xk \u2208 B}| B \u2286 \u2126} satis\ufb01es De\ufb01nition 1. The intuition should be provided by\nthe introductory argument; a full proof appears in the Appendix. An important note is that because\nGk\u2019s are sampled in descending order along a path in the tree, when the \ufb01rst Xk lands in set B, the\nvalue of max{Gk | Xk \u2208 B} will not change as the algorithm continues.\n4 A\u2217 Sampling\nThe Top-Down construction is not executable\nin general, because it assumes log \u00b5(\u2126) can be\ncomputed ef\ufb01ciently. A\u2217 sampling is an algo-\nrithm that executes the Gumbel-Max trick with-\nout this assumption by exploiting properties\nof the Gumbel process. Henceforth A\u2217 sam-\npling refers exclusively to the continuous ver-\nsion.\nA\u2217 sampling is possible because we can trans-\nform one Gumbel process into another by\nadding the difference in their log densities.\nSuppose we have two continuous measures\nx\u2208B exp(\u03c6(x)) and \u03bd(B) =\nx\u2208B exp(i(x)). Let pairs (Gk, Xk) be draws\nfrom the Top-Down construction for G\u03bd.\nIf\no(x) = \u03c6(x) \u2212 i(x) is bounded,\nthen we\ncan recover G\u00b5 by adding the difference o(Xk)\nto every Gk; i.e., {max{Gk + o(Xk)| Xk \u2208\nB}| B \u2286 Rd} is a Gumbel process with mea-\nsure \u00b5. As an example, if \u03bd were a prior and\no(x) a bounded log-likelihood, then we could\nsimulate the Gumbel process corresponding to\nthe posterior by adding o(Xk) to every Gk from\na run of the construction for \u03bd.\nThis \u201clinearity\u201d allows us to decompose a tar-\nget log density function into a tractable i(x)\nand boundable o(x). The tractable compo-\nnent is analogous to the proposal distribution\nin a rejection sampler. A\u2217 sampling searches\nfor argmax{Gk + o(Xk)} within the heap of\n(Gk, Xk) pairs from the Top-Down construc-\ntion of G\u03bd. The search is an A\u2217 procedure:\nnodes in the search tree correspond to increas-\ningly re\ufb01ned regions in space, and the search\nis guided by upper and lower bounds that are\ncomputed for each region. Lower bounds for\nregion B come from drawing the max Gk and argmax Xk of G\u03bd within B and evaluating Gk+o(Xk).\nUpper bounds come from the fact that\n\nAlgorithm 2 A\u2217 Sampling\ninput proposal log density i(x), difference in log den-\nsities o(x), bounding function M (B), and partition\n(LB, X\u2217, k) \u2190 (\u2212\u221e, null, 1)\nQ \u2190 PriorityQueue\nG1 \u223c Gumbel(log \u03bd(Rd))\nX1 \u223c exp(i(x))/\u03bd(Rd))\nM1 \u2190 M (Rd)\nQ.pushW ithP riority(1, G1 + M1)\nwhile !Q.empty() and LB < Q.topP riority() do\n\nif C (cid:54)= \u2205 then\nk \u2190 k + 1\nBk \u2190 C\nGk \u223c TruncGumbel(log \u03bd(Bk), Gp)\nXk \u223c 1(x \u2208 Bk) exp(i(x))/\u03bd(Bk)\nif LB < Gk + Mp then\n\n\u00b5(B) = (cid:82)\n(cid:82)\n\np \u2190 Q.popHighest()\nLBp \u2190 Gp + o(Xp)\nif LB < LBp then\n\nLB \u2190 LBp\nX\u2217 \u2190 Xp\n\nL, R \u2190 partition(Bp, Xp)\nfor C \u2208 {L, R} do\n\nFigure 1: Illustration of A\u2217 sampling.\n\nMk \u2190 M (Bk)\nif LB < Gk + Mk then\n\noutput (LB, X\u2217)\n\nQ.pushW ithP riority(k, Gk + Mk)\n\nmax{Gk + o(Xk)| Xk \u2208 B} \u2264 max{Gk | Xk \u2208 B} + M (B),\n\n4\n\nLB1o(x)x1exact samplex2LB2o(x)+G\fwhere M (B) is a bounding function for a region, M (B) \u2265 o(x) for all x \u2208 B. M (B) is not random\nand can be implemented using methods from e.g., convex duality or interval analysis. The \ufb01rst term\non the RHS is the Gk value used in the lower bound.\nThe algorithm appears in Algorithm 2 and an execution is illustrated in Fig. 1. The algorithm begins\nwith a global upper bound (dark blue dashed). G1 and X1 are sampled, and the \ufb01rst lower bound\nLB1 = G1 + o(X1) is computed. Space is split, upper bounds are computed for the new children\nregions (medium blue dashed), and the new nodes are put on the queue. The region with highest\nupper bound is chosen, the maximum Gumbel in the region, (G2, X2), is sampled, and LB2 is\ncomputed. The current region is split at X2 (producing light blue dashed bounds), after which LB2\nis greater than the upper bound for any region on the queue, so LB2 is guaranteed to be the max over\nthe in\ufb01nite tree of Gk + o(Xk). Because max{Gk + o(Xk)| Xk \u2208 B} is a Gumbel process with\nmeasure \u00b5, this means that X2 is an exact sample from p(x) \u221d exp(\u03c6(x))) and LB2 is an exact\nsample from Gumbel(log \u00b5(Rd)). Proofs of termination and correctness are in the Appendix.\nA\u2217 Sampling Variants. There are several variants of A\u2217 sampling. When more than one sample\nis desired, bound information can be reused across runs of the sampler. In particular, suppose we\nhave a partition of Rd with bounds on o(x) for each region. A\u2217 sampling could use this by running a\nsearch independently for each region and returning the max Gumbel. The maximization can be done\nlazily by using A\u2217 search, only expanding nodes in regions that are needed to determine the global\nmaximum. The second variant trades bound computations for likelhood computations by drawing\nmore than one sample from the auxiliary Gumbel process at each node in the search tree. In this\nway, more lower bounds are computed (costing more likelihood evaluations), but if this leads to\nbetter lower bounds, then more regions of space can be pruned, leading to fewer bound evaluations.\nFinally, an interesting special case of A\u2217 sampling can be implemented when o(x) is unimodal in\n1D. In this case, at every split of a parent node, one child can immediately be pruned, so the \u201csearch\u201d\ncan be executed without a queue. It simply maintains the currently active node and drills down until\nit has provably found the optimum.\n5 Comparison to Rejection Samplers\nOur \ufb01rst result relating A\u2217 sampling to rejection sampling is that if the same global bound M =\nM (Rd) is used at all nodes within A\u2217 sampling, then the runtime of A\u2217 sampling is equivalent to that\nof standard rejection sampling. That is, the distribution over the number of iterations is distributed\nas a Geometric distribution with rate parameter \u00b5(Rd)/(exp(M )\u03bd(Rd)). A proof is in the Appendix\nas part of the proof of termination.\nWhen bounds are re\ufb01ned, A\u2217 sampling bears similarity to adaptive rejection sampling-based algo-\nrithms. In particular, while it appears only to have been applied in discrete domains, OS\u2217 [7] is a\ngeneral class of adaptive rejection sampling methods that maintain piecewise bounds on the target\ndistribution. If piecewise constant bounds are used (henceforth we assume OS\u2217 uses only constant\nbounds) the procedure can be described as follows: at each step, (1) a region B with bound M (B) is\nsampled with probability proportional to \u03bd(B) exp(M (B)), (2) a point is drawn from the proposal\ndistribution restricted to the chosen region; (3) standard accept/rejection computations are performed\nusing the regional bound, and (4) if the point is rejected, a region is chosen to be split into two, and\nnew bounds are computed for the two regions that were created by the split. This process repeats\nuntil a point is accepted.\nSteps (2) and (4) are performed identically in A\u2217 when sampling argmax Gumbel locations and when\nsplitting a parent node. A key difference is how regions are chosen in step (1). In OS\u2217, a region\nis drawn according to volume of the region under the proposal. Note that piece selection could be\nimplemented using the Gumbel-Max trick, in which case we would choose the piece with maximum\nGB + M (B) where GB \u223c Gumbel(log \u03bd(B)). In A\u2217 sampling the region with highest upper bound\nis chosen, where the upper bound is GB + M (B). The difference is that GB values are reset after\neach rejection in OS\u2217, while they persist in A\u2217 sampling until a sample is returned.\nThe effect of the difference is that A\u2217 sampling more tightly couples together where the accepted\nsample will be and which regions are re\ufb01ned. Unlike OS\u2217, it can go so far as to prune a region\nfrom the search, meaning there is zero probability that the returned sample will be from that region,\nand that region will never be re\ufb01ned further. OS\u2217, on the other hand, is blind towards where the\nsample that will eventually be accepted comes from and will on average waste more computation\nre\ufb01ning regions that ultimately are not useful in drawing the sample. In experiments, we will see\n\n5\n\n\f(a) vs. peakiness\n\n(b) vs. # pts\n\n(c) Problem-dependent scaling\n\nFigure 2: (a) Drill down algorithm performance on p(x) = exp(\u2212x)/(1 + x)a as function of a. (b) Effect of\ndifferent bounding strategies as a function of number of data points; number of likelihood and bound evaluations\nare reported. (c) Results of varying observation noise in several nonlinear regression problems.\n\nthat A\u2217 consistently dominates OS\u2217, re\ufb01ning the function less while also using fewer likelihood\nevaluations. This is possible because the persistence inside A\u2217 sampling focuses the re\ufb01nement on\nthe regions that are important for accepting the current sample.\n\n6 Experiments\nThere are three main aims in this section. First, understand the empirical behavior of A\u2217 sampling as\nparameters of the inference problem and o(x) bounds vary. Second, demonstrate generality by\nshowing that A\u2217 sampling algorithms can be instantiated in just a few lines of model-speci\ufb01c code by\nexpressing o(x) symbolically, and then using a branch and bound library to automatically compute\nbounds. Finally, compare to OS\u2217 and an MCMC method (slice sampling).\nIn all experiments,\nregions in the search trees are hyper rectangles (possibly with in\ufb01nite extent); to split a region A,\nchoose the dimension with the largest side length and split the dimension at the sampled Xk point.\n\n6.1 Scaling versus Peakiness and Dimension\nIn the \ufb01rst experiment, we sample from p(x) = exp(\u2212x)/(1 + x)a for x > 0, a > 0 using exp(\u2212x)\nas the proposal distribution. In this case, o(x) = \u2212a log(1 + x) which is unimodal, so the drill down\nvariant of A\u2217 sampling can be used. As a grows, the function becomes peakier; while this presents\nsigni\ufb01cant dif\ufb01culty for vanilla rejection sampling, the cost to A\u2217 is just the cost of locating the peak,\nwhich is essentially binary search. Results averaged over 1000 runs appear in Fig. 2 (a).\nIn the second experiment, we run A\u2217 sampling on the clutter problem [15], which estimates the\nmean of a \ufb01xed covariance isotropic Gaussian under the assumption that some points are outliers.\nWe put a Gaussian prior on the inlier mean and set i(x) to be equal to the prior, so o(x) contains\njust the likelihood terms. To compute bounds on the total log likelihood, we compute upper bounds\non the log likelihood of each point independently then sum up these bounds. We will refer to these\nas \u201cconstant\u201d bounds. In D dimensions, we generated 20 data points with half within [\u22125,\u22123]D\nand half within [2, 4]D, which ensures that the posterior is sharply bimodal, making vanilla MCMC\nquickly inappropriate as D grows. The cost of drawing an exact sample as a function of D (averaged\nover 100 runs) grows exponentially in D, but the problem remains reasonably tractable as D grows\n(D = 3 requires 900 likelihood evaluations, D = 4 requires 4000). The analogous OS\u2217 algorithm\nrun on the same set of problems requires 16% to 40% more computation on average over the runs.\n\n6.2 Bounding Strategies\nHere we investigate alternative strategies for bounding o(x) in the case where o(x) is a sum of\nper-instance log likelihoods. To allow easy implementation of a variety of bounding strategies, we\nchoose the simple problem of estimating the mean of a 1D Gaussian given N observations. We use\nthree types of bounds: constant bounds as in the clutter problem; linear bounds, where we compute\nlinear upper bounds on each term of the sum, then sum the linear functions and take the max over the\nregion; and quadratic bounds, which are the same as linear except quadratic bounds are computed\non each term. In this problem, quadratic bounds are tight. We evaluate A\u2217 sampling using each of\nthe bounding strategies, varying N. See Fig. 2 (b) for results.\n\n6\n\n100102104106a100101likelihood eval / sampleRejectionDrill down A\u2217100102104N100101102103# evalsConst liksConst bndsLinear liksLinear bndsTight liksTight bnds1001081016Expected rejection # lik evals100102104Avg A\u2217 # lik evalsy=a(x\u2212b)2/((x\u2212b)2+c2)y=ax(x\u2212b)(x\u2212c)dy=1bxcos(a)(xsin(a)+qx2sin2(a)+2bc)y=aexp(\u2212b|x\u2212c|d)+ey=asin(bx+c)+dsin(ex+f)\fFor N = 1, all bound types are equivalent when each expands around the same point. For larger N,\nthe looseness of each per-point bound becomes important. The \ufb01gure shows that, for large N, using\n\u221a\nlinear bounds multiplies the number of evaluations by 3, compared to tight bounds. Using constant\nbounds multiplies the number of evaluations by O(\nN ). The Appendix explains why this happens\nand shows that this behavior is expected for any estimation problem where the width of the posterior\nshrinks with N.\n6.3 Using Generic Interval Bounds\nHere we study the use of bounds that are derived automatically by means of interval methods [16].\nThis suggests how A\u2217 sampling (or OS\u2217) could be used within a more general purpose probabilistic\nprogramming setting. We chose a number of nonlinear regression models inspired by problems in\nphysics, computational ecology, and biology. For each, we use FuncDesigner [17] to symbolically\nconstruct o(x) and automatically compute the bounds needed by the samplers.\nSeveral expressions for y = f (x) appear in the legend of Fig. 2 (c), where letters a through f denote\nparameters that we wish to sample. The model in all cases is yn = f (xn) + \u0001n where n is the\ndata point index and \u0001n is Gaussian noise. We set uniform priors from a reasonable range for all\nparameters (see Appendix) and generated a small (N=3) set of training data from the model so that\nposteriors are multimodal. The peakiness of the posterior can be controlled by the magnitude of the\nobservation noise; we varied this from large to small to produce problems over a range of dif\ufb01culties.\nWe use A\u2217 sampling to sample from the posterior \ufb01ve times for each model and noise setting and\nreport the average number of likelihood evaluations needed in Fig. 2 (c) (y-axis). To establish the\ndif\ufb01culty of the problems, we estimate the expected number of likelihood evaluations needed by a\nrejection sampler to accept a sample. The savings over rejection sampling is often exponentially\nlarge, but it varies per problem and is not necessarily tied to the dimension. In the example where\nsavings are minimal, there are many symmetries in the model, which leads to uninformative bounds.\nWe also compared to OS\u2217 on the same class of problems. Here we generated 20 random instances\nwith a \ufb01xed intermediate observation noise value for each problem and drew 50 samples, resetting\nthe bounds after each sample. The average cost (heuristically set to # likelihood evaluations plus 2\n\u00d7 # bound evaluations) of OS\u2217 for the \ufb01ve models in Fig. 2 (c) respectively was 21%, 30%, 11%,\n21%, and 27% greater than for A\u2217.\n6.4 Robust Bayesian Regression\nHere our aim is to do Bayesian inference in a robust linear regression model yn = wTxn + \u0001n where\nnoise \u0001n is distributed as standard Cauchy and w has an isotropic Gaussian prior. Given a dataset\nD = {xn, yn}N\nn=1 our goal is to draw samples from the posterior P(w |D). This is a challenging\nproblem because the heavy-tailed noise model can lead to multimodality in the posterior over w.\nn log(1 + (wTxn \u2212 yn)2). We generated N data points with input\ndimension D in such a way that the posterior is bimodal and symmetric by setting w\u2217 = [2, ..., 2]T,\ngenerating X(cid:48) \u223c randn(N/2, D) and y(cid:48) \u223c X(cid:48)w\u2217 +.1\u00d7randn(N/2), then setting X = [X(cid:48); X(cid:48)] and\ny = [y(cid:48);\u2212y(cid:48)]. There are then equally-sized modes near w\u2217 and \u2212w\u2217. We decompose the posterior\ninto a uniform i(\u00b7) within the interval [\u221210, 10]D and put all of the prior and likelihood terms into\no(\u00b7). Bounds are computed per point; in some regions the per point bounds are linear, and in others\nthey are quadratic. Details appear in the Appendix.\nWe compare to OS\u2217, using two re\ufb01nement strategies that are discussed in [7]. The \ufb01rst is directly\nanalogous to A\u2217 sampling and is the method we have used in the earlier OS\u2217 comparisons. When a\npoint is rejected, re\ufb01ne the piece that was proposed from at the sampled point, and split the dimen-\nsion with largest side length. The second method splits the region with largest probability under the\nproposal. We ran experiments on several random draws of the data and report performance along\nthe two axes that are the dominant costs: how many bound computations were used, and how many\nlikelihood evaluations were used. To weigh the tradeoff between the two, we did a rough asymp-\ntotic calculation of the costs of bounds versus likelihood computations and set the cost of a bound\ncomputation to be D + 1 times the cost of a likelihood computation.\nIn the \ufb01rst experiment, we ask each algorithm to draw a single exact sample from the posterior.\nHere, we also report results for the variants of A\u2217 sampling and OS\u2217 that trade off likelihood compu-\ntations for bound computations as discussed in Section 4. A representative result appears in Fig. 3\n(left). Across operating points, A\u2217 consistently uses fewer bound evaluations and fewer likelihood\nevaluations than both OS\u2217 re\ufb01nement strategies.\n\nThe log likelihood is L(w) =(cid:80)\n\n7\n\n\fIn the second experiment, we ask each algo-\nrithm to draw 200 samples from the poste-\nrior and experiment with the variants that reuse\nbound information across samples. A represen-\ntative result appears in Fig. 3 (right). Here we\nsee that the extra re\ufb01nement done by OS\u2217 early\non allows it to use fewer likelihood evaluations\nat the expense of more bound computations,\nbut A\u2217 sampling operates at a point that is not\nachievable by OS\u2217. For all of these problems,\nwe ran a random direction slice sampler [18]\nthat was given 10 times the computational bud-\nget that A\u2217 sampling used to draw 200 sam-\nples. The slice sampler had trouble mixing\nwhen D > 1. Across the \ufb01ve runs for D = 2,\nthe sampler switched modes once, and it did not\never switch modes when D > 2.\n\nFigure 3: A\u2217 (circles) versus OS\u2217 (squares and dia-\nmonds) computational costs on Cauchy regression ex-\nperiments of varying dimension. Square is re\ufb01nement\nstrategy that splits node where rejected point was sam-\npled; Diamond re\ufb01nes region with largest mass under\nthe proposal distribution. Red lines denote lines of\nequi-total computational cost and are spaced on a log\nscale by 10% increase increments. Color of markers de-\nnotes the rate of re\ufb01nement, ranging from (darkest) re-\n\ufb01ning for every rejection (for OS\u2217) or one lower bound\nevaluation per node expansion (for A\u2217) to (lightest) re-\n\ufb01ning on 10% of rejections (for OS\u2217) or performing\n.1 \u2212 1) + 1 lower bound evaluations per node\nPoisson( 1\nexpansion (for A\u2217). (left) Cost of drawing a single sam-\nple, averaged over 20 random data sets. (right) Drawing\n200 samples averaged over 5 random data sets. Results\nare similar over a range of N\u2019s and D = 1, . . . , 4.\n\n7 Discussion\nThis work answers a natural question: is there\na Gumbel-Max trick for continuous spaces, and\ncan it be leveraged to develop tractable algo-\nrithms for sampling from continuous distribu-\ntions?\nIn the discrete case, recent work on \u201cPerturb\nand MAP\u201d (P&M) methods [1, 19, 2] that draw samples as the argmaxes of random energy func-\ntions has shown value in developing approximate, correlated perturbations. It is natural to think\nabout continuous analogs in which exactness is abandoned in favor of more ef\ufb01cient computation.\nA question is if the approximations can be developed in a principled way, like how [3] showed a\nparticular form of correlated discrete perturbation gives rise to bounds on the log partition function.\nCan analogous rigorous approximations be established in the continuous case? We hope this work\nis a starting point for exploring that question.\nWe do not solve the problem of high dimensions. There are simple examples where bounds become\nuninformative in high dimensions, such as when sampling a density that is uniform over a hyper-\nsphere when using hyperrectangular search regions. In this case, little is gained over vanilla rejection\nsampling. An open question is if the split between i(\u00b7) and o(\u00b7) can be adapted to be node-speci\ufb01c\nduring the search. An adaptive rejection sampler would be able to do this, which would allow lever-\naging parameter-varying bounds in the proposal distributions. This might be an important degree of\nfreedom to exercise, particularly when scaling up to higher dimensions.\nThere are several possible follow-ons including the discrete version of A\u2217 sampling and evaluating\nA\u2217 sampling as an estimator of the log partition function. In future work, we would like to explore\ntaking advantage of conditional independence structure to perform more intelligent search, hope-\nfully helping the method scale to larger dimensions. Example starting points might be ideas from\nAND/OR search [20] or branch and bound algorithms that only branch on a subset of dimensions\n[21].\n\nAcknowledgments\nThis research was supported by NSERC. We thank James Martens and Radford Neal for helpful\ndiscussions, Elad Mezuman for help developing early ideas related to this work, and Roger Grosse\nfor suggestions that greatly improved this work.\n\nReferences\n\n[1] G. Papandreou and A. Yuille. Perturb-and-MAP Random Fields: Using Discrete Optimization to Learn\n\nand Sample from Energy Models. In ICCV, pages 193\u2013200, November 2011.\n\n8\n\n0300600Relative lik cost0450900Relative bound costD=2, N=250010002000Relative lik cost015003000Relative bound costD=2, N=250\f[2] Daniel Tarlow, Ryan Prescott Adams, and Richard S Zemel. Randomized Optimum Models for Structured\n\nPrediction. In AISTATS, pages 21\u201323, 2012.\n\n[3] Tamir Hazan and Tommi S Jaakkola. On the Partition Function and Random Maximum A-Posteriori\n\nPerturbations. In ICML, pages 991\u2013998, 2012.\n\n[4] Stefano Ermon, Carla P Gomes, Ashish Sabharwal, and Bart Selman. Embed and Project: Discrete\n\nSampling with Universal Hashing. In NIPS, pages 2085\u20132093, 2013.\n\n[5] George Papandreou and Alan L Yuille. Gaussian Sampling by Local Perturbations.\n\n1858\u20131866, 2010.\n\nIn NIPS, pages\n\n[6] W.R. Gilks and P. Wild. Adaptive Rejection Sampling for Gibbs Sampling. Applied Statistics, 41(2):337\n\n\u2013 348, 1992.\n\n[7] Marc Dymetman, Guillaume Bouchard, and Simon Carter. The OS* Algorithm: a Joint Approach to\n\nExact Optimization and Sampling. arXiv preprint arXiv:1207.0742, 2012.\n\n[8] V Mansinghka, D Roy, E Jonas, and J Tenenbaum. Exact and Approximate Sampling by Systematic\n\nStochastic Search. JMLR, 5:400\u2013407, 2009.\n\n[9] James Gary Propp and David Bruce Wilson. Exact Sampling with Coupled Markov Chains and Applica-\n\ntions to Statistical Mechanics. Random Structures and Algorithms, 9(1-2):223\u2013252, 1996.\n\n[10] Antonietta Mira, Jesper Moller, and Gareth O Roberts. Perfect Slice Samplers. Journal of the Royal\n\nStatistical Society: Series B (Statistical Methodology), 63(3):593\u2013606, 2001.\n\n[11] Faheem Mitha. Perfect Sampling on Continuous State Spaces. PhD thesis, University of North Carolina,\n\nChapel Hill, 2003.\n\n[12] Hannes Malmberg. Random Choice over a Continuous Set of Options. Master\u2019s thesis, Department of\n\nMathematics, Stockholm University, 2013.\n\n[13] E. J. Gumbel and J. Lieblein. Statistical Theory of Extreme Values and Some Practical Applications: a\n\nSeries of Lectures. US Govt. Print. Of\ufb01ce, 1954.\n\n[14] John I. Yellott Jr. The Relationship between Luce\u2019s Choice Axiom, Thurstone\u2019s Theory of Comparative\nJudgment, and the Double Exponential Distribution. Journal of Mathematical Psychology, 15(2):109 \u2013\n144, 1977.\n\n[15] Thomas P Minka. Expectation Propagation for Approximate Bayesian Inference. In UAI, pages 362\u2013369.\n\nMorgan Kaufmann Publishers Inc., 2001.\n\n[16] Eldon Hansen and G William Walster. Global Optimization Using Interval Analysis: Revised and Ex-\n\npanded, volume 264. CRC Press, 2003.\n\n[17] Dmitrey Kroshko. FuncDesigner. http://openopt.org/FuncDesigner, June 2014.\n[18] Radford M Neal. Slice Sampling. Annals of Statistics, pages 705\u2013741, 2003.\n[19] Tamir Hazan, Subhransu Maji, and Tommi Jaakkola. On Sampling from the Gibbs Distribution with\n\nRandom Maximum A-Posteriori Perturbations. In NIPS, pages 1268\u20131276. 2013.\n\n[20] Robert Eugeniu Mateescu. AND/OR Search Spaces for Graphical Models. PhD thesis, University of\n\nCalifornia, 2007.\n\n[21] Manmohan Chandraker and David Kriegman. Globally Optimal Bilinear Programming for Computer\n\nVision Applications. In CVPR, pages 1\u20138, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1596, "authors": [{"given_name": "Chris", "family_name": "Maddison", "institution": "University of Toronto"}, {"given_name": "Daniel", "family_name": "Tarlow", "institution": "Microsoft Research"}, {"given_name": "Tom", "family_name": "Minka", "institution": "MSR"}]}