{"title": "Bayesian Optimization for Probabilistic Programs", "book": "Advances in Neural Information Processing Systems", "page_first": 280, "page_last": 288, "abstract": "We present the first general purpose framework for marginal maximum a posteriori estimation of probabilistic program variables. By using a series of code transformations, the evidence of any probabilistic program, and therefore of any graphical model, can be optimized with respect to an arbitrary subset of its sampled variables. To carry out this optimization, we develop the first Bayesian optimization package to directly exploit the source code of its target, leading to innovations in problem-independent hyperpriors, unbounded optimization, and implicit constraint satisfaction; delivering significant performance improvements over prominent existing packages. We present applications of our method to a number of tasks including engineering design and parameter optimization.", "full_text": "Bayesian Optimization for Probabilistic Programs\n\nTom Rainforth\u2020 Tuan Anh Le\u2020\n\nJan-Willem van de Meent\u2021\n\nMichael A. Osborne\u2020\n\nFrank Wood\u2020\n\n\u2020 Department of Engineering Science, University of Oxford\n\n\u2021 College of Computer and Information Science, Northeastern University\n\n{twgr,tuananh,mosb,fwood}@robots.ox.ac.uk, j.vandemeent@northeastern.edu\n\nAbstract\n\nWe present the \ufb01rst general purpose framework for marginal maximum a pos-\nteriori estimation of probabilistic program variables. By using a series of code\ntransformations, the evidence of any probabilistic program, and therefore of any\ngraphical model, can be optimized with respect to an arbitrary subset of its sampled\nvariables. To carry out this optimization, we develop the \ufb01rst Bayesian optimization\npackage to directly exploit the source code of its target, leading to innovations in\nproblem-independent hyperpriors, unbounded optimization, and implicit constraint\nsatisfaction; delivering signi\ufb01cant performance improvements over prominent exist-\ning packages. We present applications of our method to a number of tasks including\nengineering design and parameter optimization.\n\n1\n\nIntroduction\n\nProbabilistic programming systems (PPS) allow probabilistic models to be represented in the form\nof a generative model and statements for conditioning on data [4, 9, 10, 16, 17, 29]. Their core\nphilosophy is to decouple model speci\ufb01cation and inference, the former corresponding to the user-\nspeci\ufb01ed program code and the latter to an inference engine capable of operating on arbitrary\nprograms. Removing the need for users to write inference algorithms signi\ufb01cantly reduces the burden\nof developing new models and makes effective statistical methods accessible to non-experts.\nAlthough signi\ufb01cant progress has been made on the problem of general purpose inference of program\nvariables, less attention has been given to their optimization. Optimization is an essential tool for\neffective machine learning, necessary when the user requires a single estimate. It also often forms a\ntractable alternative when full inference is infeasible [18]. Moreover, coincident optimization and\ninference is often required, corresponding to a marginal maximum a posteriori (MMAP) setting\nwhere one wishes to maximize some variables, while marginalizing out others. Examples of MMAP\nproblems include hyperparameter optimization, expectation maximization, and policy search [27].\nIn this paper we develop the \ufb01rst system that extends probabilistic programming (PP) to this more\ngeneral MMAP framework, wherein the user speci\ufb01es a model in the same manner as existing\nsystems, but then selects some subset of the sampled variables in the program to be optimized, with\nthe rest marginalized out using existing inference algorithms. The optimization query we introduce\ncan be implemented and utilized in any PPS that supports an inference method returning a marginal\nlikelihood estimate. This framework increases the scope of models that can be expressed in PPS and\ngives additional \ufb02exibility in the outputs a user can request from the program.\nMMAP estimation is dif\ufb01cult as it corresponds to the optimization of an intractable integral, such that\nthe optimization target is expensive to evaluate and gives noisy results. Current PPS inference engines\nare typically unsuited to such settings. We therefore introduce BOPP1 (Bayesian optimization for\nprobabilistic programs) which couples existing inference algorithms from PPS, like Anglican [29],\nwith a new Gaussian process (GP) [22] based Bayesian optimization (BO) [11, 15, 20, 23] package.\n\n1Code available at http://www.github.com/probprog/bopp/\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: Simulation-based optimization of radiator powers subject to varying solar intensity. Shown\nare output heat maps from Energy2D [30] simulations at one intensity, corresponding from left to\nright to setting all the radiators to the same power, the best result from a set of randomly chosen\npowers, and the best setup found after 100 iterations of BOPP. The far right plot shows convergence\nof the evidence of the respective model, giving the median and 25/75% quartiles.\n\n(defopt house-heating [alphas] [powers]\n\n(let [solar-intensity (sample weather-prior)\n\npowers (sample (dirichlet alphas))\ntemperatures (simulate solar-intensity powers)]\n\n(observe abc-likelihood temperatures)))\n\nFigure 2: BOPP query for optimizing the power allocation to radiators in a house. Here\nweather-prior is a distribution over the solar intensity and a uniform Dirichlet prior with concen-\ntration alpha is placed over the powers. Calling simulate performs an Energy2D simulation of\nhouse temperatures. The utility of the resulting output is conditioned upon using abc-likelihood.\nCalling doopt on this query invokes the BOPP algorithm to perform MMAP estimation, where the\nsecond input powers indicates the variable to be optimized.\n\nTo demonstrate the functionality provided by BOPP, we consider an example application of engineer-\ning design. Engineering design relies extensively on simulations which typically have two things in\ncommon: the desire of the user to \ufb01nd a single best design and an uncertainty in the environment in\nwhich the designed component will live. Even when these simulations are deterministic, this is an\napproximation to a truly stochastic world. By expressing the utility of a particular design-environment\ncombination using an approximate Bayesian computation (ABC) likelihood [5], one can pose this as\na MMAP problem, optimizing the design while marginalizing out the environmental uncertainty.\nFigure 1 illustrates how BOPP can be applied to engineering design, taking the example of optimizing\nthe distribution of power between radiators in a house so as to homogenize the temperature, while\nmarginalizing out possible weather conditions and subject to a total energy budget. The probabilistic\nprogram shown in Figure 2 allows us to de\ufb01ne a prior over the uncertain weather, while conditioning\non the output of a deterministic simulator (here Energy2D [30]-a \ufb01nite element package for heat trans-\nfer) using an ABC likelihood. BOPP now allows the required coincident inference and optimization\nto be carried out automatically, directly returning increasingly optimal con\ufb01gurations.\nBO is an attractive choice for the required optimization in MMAP as it is typically ef\ufb01cient in the\nnumber of target evaluations, operates on non-differentiable targets, and incorporates noise in the\ntarget function evaluations. However, applying BO to probabilistic programs presents challenges,\nsuch as the need to give robust performance on a wide range of problems with varying scaling and\npotentially unbounded support. Furthermore, the target program may contain unknown constraints,\nimplicitly de\ufb01ned by the generative model, and variables whose type is unknown (i.e. they may be\ncontinuous or discrete).\nOn the other hand, the availability of the target source code in a PPS presents opportunities to\novercome these issues and go beyond what can be done with existing BO packages. BOPP exploits\nthe source code in a number of ways, such as optimizing the acquisition function using the original\ngenerative model to ensure the solution satis\ufb01es the implicit constaints, performing adaptive domain\nscaling to ensure that GP kernel hyperparameters can be set according to problem-independent\nhyperpriors, and de\ufb01ning an adaptive non-stationary mean function to support unbounded BO.\nTogether, these innovations mean that BOPP can be run in a manner that is fully black-box from the\nuser\u2019s perspective, requiring only the identi\ufb01cation of the target variables relative to current syntax\nfor operating on arbitrary programs. We further show that BOPP is competitive with existing BO\nengines for direct optimization on common benchmarks problems that do not require marginalization.\n\n2\n\nIteration050100p(Y,3)0.050.10.150.20.25BOPPEven Powers\f2 Background\n\n2.1 Probabilistic Programming\n\nProbabilistic programming systems allow users to de\ufb01ne probabilistic models using a domain-speci\ufb01c\nprogramming language. A probabilistic program implicitly de\ufb01nes a distribution on random variables,\nwhilst the system back-end implements general-purpose inference methods.\nPPS such as Infer.Net [17] and Stan [4] can be thought of as de\ufb01ning graphical models or factor\ngraphs. Our focus will instead be on systems such as Church [9], Venture [16], WebPPL [10], and\nAnglican [29], which employ a general-purpose programming language for model speci\ufb01cation. In\nthese systems, the set of random variables is dynamically typed, such that it is possible to write\nprograms in which this set differs from execution to execution. This allows an unspeci\ufb01ed number\nof random variables and incorporation of arbitrary black box deterministic functions, such as was\nexploited by the simulate function in Figure 2. The price for this expressivity is that inference\nmethods must be formulated in such a manner that they are applicable to models where the density\nfunction is intractable and can only be evaluated during forwards simulation of the program.\nOne such general purpose system, Anglican, will be used as a reference in this paper. In Anglican,\nmodels are de\ufb01ned using the inference macro defquery. These models, which we refer to as queries\n[9], specify a joint distribution p(Y, X) over data Y and variables X. Inference on the model is\nperformed using the macro doquery, which produces a sequence of approximate samples from\nthe conditional distribution p(X|Y ) and, for importance sampling based inference algorithms (e.g.\nsequential Monte Carlo), a marginal likelihood estimate p(Y ).\nRandom variables in an Anglican program are speci\ufb01ed using sample statements, which can be\nthought of as terms in the prior. Conditioning is speci\ufb01ed using observe statements which can be\nthought of as likelihood terms. Outputs of the program, taking the form of posterior samples, are\nindicated by the return values. There is a \ufb01nite set of sample and observe statements in a program\nsource code, but the number of times each statement is called can vary between executions. We refer\nthe reader to http://www.robots.ox.ac.uk/\u02dcfwood/anglican/ for more details.\n\n2.2 Bayesian Optimization\nConsider an arbitrary black-box target function f : \u03d1 \u2192 R that can be evaluated for an arbitrary point\n\u03b8 \u2208 \u03d1 to produce, potentially noisy, outputs \u02c6w \u2208 R. BO [15, 20] aims to \ufb01nd the global maximum\n(1)\n\n\u03b8\u2217 = argmax\n\nf (\u03b8) .\n\n\u03b8\u2208\u03d1\n\nThe key idea of BO is to place a prior on f that expresses belief about the space of functions within\nwhich f might live. When the function is evaluated, the resultant information is incorporated by\nconditioning upon the observed data to give a posterior over functions. This allows estimation\nof the expected value and uncertainty in f (\u03b8) for all \u03b8 \u2208 \u03d1. From this, an acquisition function\n\u03b6 : \u03d1 \u2192 R is de\ufb01ned, which assigns an expected utility to evaluating f at particular \u03b8, based on the\ntrade-off between exploration and exploitation in \ufb01nding the maximum. When direct evaluation of f\nis expensive, the acquisition function constitutes a cheaper to evaluate substitute, which is optimized\nto ascertain the next point at which the target function should be evaluated in a sequential fashion.\nBy interleaving optimization of the acquisition function, evaluating f at the suggested point, and\nupdating the surrogate, BO forms a global optimization algorithm that is typically very ef\ufb01cient in the\nrequired number of function evaluations, whilst naturally dealing with noise in the outputs. Although\nalternatives such as random forests [3, 14] or neural networks [26] exist, the most common prior\nused for f is a GP [22]. For further information on BO we refer the reader to the recent review by\nShahriari et al [24].\n\n3 Problem Formulation\n\nGiven a program de\ufb01ning the joint density p(Y, X, \u03b8) with \ufb01xed Y , our aim is to optimize with\nrespect to a subset of the variables \u03b8 whilst marginalizing out latent variables X\n\n(cid:90)\n\n\u03b8\u2217 = argmax\n\n\u03b8\u2208\u03d1\n\np(\u03b8|Y ) = argmax\n\n\u03b8\u2208\u03d1\n\np(Y, \u03b8) = argmax\n\np(Y, X, \u03b8)dX.\n\n\u03b8\u2208\u03d1\n\n3\n\n(2)\n\n\fTo provide syntax to differentiate between \u03b8 and X, we introduce a new query macro defopt. The\nsyntax of defopt is identical to defquery except that it has an additional input identifying the\nvariables to be optimized. To allow for the interleaving of inference and optimization required\nin MMAP estimation, we further introduce doopt, which, analogous to doquery, returns a lazy\nsequence {\u02c6\u03b8\u2217m, \u02c6\u2126\u2217m, \u02c6u\u2217m}m=1,... where \u02c6\u2126\u2217m \u2286 X are the program outputs associated with \u03b8 = \u02c6\u03b8\u2217m and\neach \u02c6u\u2217m \u2208 R+ is an estimate of the corresponding log marginal log p(Y, \u02c6\u03b8\u2217m) (see Section 4.2). The\nsequence is de\ufb01ned such that, at any time, \u02c6\u03b8\u2217m corresponds to the point expected to be most optimal\nof those evaluated so far and allows both inference and optimization to be carried out online.\nAlthough no restrictions are placed on X, it is necessary to place some restrictions on how programs\nuse the optimization variables \u03b8 = \u03c61:K speci\ufb01ed by the optimization argument list of defopt.\nFirst, each optimization variable \u03c6k must be bound to a value directly by a sample statement with\n\ufb01xed measure-type distribution argument. This avoids change of variable complications arising from\nnonlinear deterministic mappings. Second, in order for the optimization to be well de\ufb01ned, the\nprogram must be written such that any possible execution trace binds each optimization variable \u03c6k\nexactly once. Finally, although any \u03c6k may be lexically multiply bound, it must have the same base\nmeasure in all possible execution traces, because, for instance, if the base measure of a \u03c6k were to\nchange from Lebesgue to counting, the notion of optimality would no longer admit a conventional\ninterpretation. Note that although the transformation implementations shown in Figure 3 do not\ncontain runtime exception generators that disallow continued execution of programs that violate these\nconstraints, those actually implemented in the BOPP system do.\n\n4 Bayesian Program Optimization\n\nIn addition to the syntax introduced in the previous section, there are \ufb01ve main components to BOPP:\n- A program transformation, q\u2192q-marg, allowing estimation of the evidence p(Y, \u03b8) at a \ufb01xed \u03b8.\n- A high-performance, GP based, BO implementation for actively sampling \u03b8.\n- A program transformation, q\u2192q-prior, used for automatic and adaptive domain scaling, such\nthat a problem-independent hyperprior can be placed over the GP hyperparameters.\n- An adaptive non-stationary mean function to support unbounded optimization.\n- A program transformation, q\u2192q-acq, and annealing maximum likelihood estimation method to\noptimize the acquisition function subject the implicit constraints imposed by the generative model.\n\nTogether these allow BOPP to perform online MMAP estimation for arbitrary programs in a manner\nthat is black-box from the user\u2019s perspective - requiring only the de\ufb01nition of the target program in\nthe same way as existing PPS and identifying which variables to optimize. The BO component of\nBOPP is both probabilistic programming and language independent, and is provided as a stand-alone\npackage.2 It requires as input only a target function, a sampler to establish rough input scaling, and a\nproblem speci\ufb01c optimizer for the acquisition function that imposes the problem constraints.\nFigure 3 provides a high level overview of the algorithm invoked when doopt is called on a query q\nthat de\ufb01nes a distribution p (Y, a, \u03b8, b). We wish to optimize \u03b8 whilst marginalizing out a and b, as\nindicated by the the second input to q. In summary, BOPP performs iterative optimization in 5 steps\n\n- Step 1 (blue arrows) generates unweighted samples from the transformed prior program q-prior\n(top center), constructed by removing all conditioning. This initializes the domain scaling for \u03b8.\n- Step 2 (red arrows) evaluates the marginal p(Y, \u03b8) at a small number of the generated \u02c6\u03b8 by\nperforming inference on the marginal program q-marg (middle centre), which returns samples\nfrom the distribution p (a, b|Y, \u03b8) along with an estimate of p(Y, \u03b8). The evaluated points (middle\nright) provide an initial domain scaling of the outputs and starting points for the BO surrogate.\n- Step 3 (black arrow) \ufb01ts a mixture of GPs posterior [22] to the scaled data (bottom centre) using a\nproblem independent hyperprior. The solid blue line and shaded area show the posterior mean and\n\u00b12 standard deviations respectively. The new estimate of the optimum \u02c6\u03b8\u2217 is the value for which\nthe mean estimate is largest, with \u02c6u\u2217 equal to the corresponding mean value.\n\n2Code available at http://www.github.com/probprog/deodorant/\n\n4\n\n\fFigure 3: Overview of the BOPP algorithm, description given in main text. p-a, p-\u03b8, p-b and lik\nall represent distribution object constructors. factor is a special distribution constructor that assigns\nprobability p(y) = y, in this case y = \u03b6(\u03b8).\n- Step 4 (purple arrows) constructs an acquisition function \u03b6 : \u03d1 \u2192 R+ (bottom left) using the GP\nposterior. This is optimized, giving the next point to evaluate \u02c6\u03b8next, by performing annealed impor-\ntance sampling on a transformed program q-acq (middle left) in which all observe statements\nare removed and replaced with a single observe assigning probability \u03b6(\u03b8) to the execution.\n\n- Step 5 (green arrow) evaluates \u02c6\u03b8next using q-marg and continues to step 3.\n\n4.1 Program Transformation to Generate the Target\n\nConsider the defopt query q in Figure 3, the body of which de\ufb01nes the joint distribution p (Y, a, \u03b8, b).\nCalculating (2) (de\ufb01ning X = {a, b}) using a standard optimization scheme presents two issues: \u03b8 is\na random variable within the program rather than something we control and its probability distribution\nis only de\ufb01ned conditioned on a.\nWe deal with both these issues simultaneously using a program transformation similar to the disin-\ntegration transformation in Hakaru [31]. Our marginal transformation returns a new query object,\nq-marg as shown in Figure 3, that de\ufb01nes the same joint distribution on program variables and\ninputs, but now accepts the value for \u03b8 as an input. This is done by replacing all sample statements\nassociated with \u03b8 with equivalent observe<- statements, taking \u03b8 as the observed value, where\nobserve<- is identical to observe except that it returns the observed value. As both sample and\nobserve operate on the same variable type - a distribution object - this transformation can always\nbe made, while the identical returns of sample and observe<- trivially ensures validity of the\ntransformed program.\n\n4.2 Bayesian Optimization of the Marginal\n\nThe target function for our BO scheme is log p(Y, \u03b8), noting argmax f (\u03b8) = argmax log f (\u03b8) for\nany f : \u03d1 \u2192 R+. The log is taken because GPs have unbounded support, while p (Y, \u03b8) is always\npositive, and because we expect variations over many orders of magnitude. PPS with importance\nsampling based inference engines, e.g. sequential Monte Carlo [29] or the particle cascade [21], can\nreturn noisy estimates of this target given the transformed program q-marg.\n\n5\n\n3-15-10-5051015Expected improvement00.020.040.060.080.13-15-10-50510153-15-10-5051015log p(Y,3)-60-40-200\u02c6\u2713next{\u02c6\u2713\u21e4,\u02c6\u2326\u21e4,\u02c6u\u21e4}\u02c6\u2713\u21e4\u02c6u\u21e411222334454(defoptq[y][\u2713](let[a(sample(p-a))\u2713(sample(p-\u2713a))b(sample(p-ba\u2713))](observe(lika\u2713b)y)[ab]))(a)Originalquery(defqueryq-marg[y\u02c6\u2713](let[a(sample(p-a))\u2713(observe<-(p-\u2713a)\u02c6\u2713)b(sample(p-ba\u2713))](observe(lika\u2713b)y)[ab]))(b)ConditionalqueryFigure1:Left:asimpleexampleoptimizationquerywherewewanttooptimize\u2713.Right:thesamequeryafterthetransformationappliedbyBOPPtomakethequeryamenabletooptimization.Notep-urepresentsadistributionobject,whilstp-\u2713,p-vandlikallrepresentfunctionswhichreturndistributionsobjects.(defqueryq-prior[y](let[a(sample(p-a))\u2713(sample(p-\u2713a))]\u2713))(a)Priorquery(defqueryq-acq[y\u21e3](let[a(sample(p-a))\u2713(sample(p-\u2713a))](observe(factor)(\u21e3\u2713))\u2713))(b)AcquisitionqueryFigure2:Left:atransformationofqthatsamplesfromthepriorp(\u2713).Right:atransformationofqusedintheoptimizationoftheacquisitionfunction.Observingfromfactorassignsaprobabilityexp\u21e3(\u2713)totheexecution,i.e.(factor)returnsadistributionofobjectforwhichthelogprobabilitydensityfunctionistheidentityfunction.1(defoptq[y][\u2713](let[a(sample(p-a))\u2713(sample(p-\u2713a))b(sample(p-ba\u2713))](observe(lika\u2713b)y)[ab]))(a)Originalquery(defqueryq-marg[y\u02c6\u2713](let[a(sample(p-a))\u2713(observe<-(p-\u2713a)\u02c6\u2713)b(sample(p-ba\u2713))](observe(lika\u2713b)y)[ab]))(b)ConditionalqueryFigure1:Left:asimpleexampleoptimizationquerywherewewanttooptimize\u2713.Right:thesamequeryafterthetransformationappliedbyBOPPtomakethequeryamenabletooptimization.Notep-urepresentsadistributionobject,whilstp-\u2713,p-vandlikallrepresentfunctionswhichreturndistributionsobjects.(defqueryq-prior[y](let[a(sample(p-a))\u2713(sample(p-\u2713a))]\u2713))(a)Priorquery(defqueryq-acq[y\u21e3](let[a(sample(p-a))\u2713(sample(p-\u2713a))](observe(factor)(\u21e3\u2713))\u2713))(b)AcquisitionqueryFigure2:Left:atransformationofqthatsamplesfromthepriorp(\u2713).Right:atransformationofqusedintheoptimizationoftheacquisitionfunction.Observingfromfactorassignsaprobabilityexp\u21e3(\u2713)totheexecution,i.e.(factor)returnsadistributionofobjectforwhichthelogprobabilitydensityfunctionistheidentityfunction.1(defoptq[y][\u2713](let[a(sample(p-a))\u2713(sample(p-\u2713a))b(sample(p-ba\u2713))](observe(lika\u2713b)y)[ab]))(a)Originalquery(defqueryq-marg[y\u02c6\u2713](let[a(sample(p-a))\u2713(observe<-(p-\u2713a)\u02c6\u2713)b(sample(p-ba\u2713))](observe(lika\u2713b)y)[ab]))(b)ConditionalqueryFigure1:Left:asimpleexampleoptimizationquerywherewewanttooptimize\u2713.Right:thesamequeryafterthetransformationappliedbyBOPPtomakethequeryamenabletooptimization.Notep-urepresentsadistributionobject,whilstp-\u2713,p-vandlikallrepresentfunctionswhichreturndistributionsobjects.(defqueryq-prior[y](let[a(sample(p-a))\u2713(sample(p-\u2713a))]\u2713))(a)Priorquery(defqueryq-acq[y\u21e3](let[a(sample(p-a))\u2713(sample(p-\u2713a))](observe(factor)(\u21e3\u2713))\u2713))(b)AcquisitionqueryFigure2:Left:atransformationofqthatsamplesfromthepriorp(\u2713).Right:atransformationofqusedintheoptimizationoftheacquisitionfunction.Observingfromfactorassignsaprobabilityexp\u21e3(\u2713)totheexecution,i.e.(factor)returnsadistributionofobjectforwhichthelogprobabilitydensityfunctionistheidentityfunction.1(defoptq[y][\u2713](let[a(sample(p-a))\u2713(sample(p-\u2713a))b(sample(p-ba\u2713))](observe(lika\u2713b)y)[ab]))(a)Originalquery(defqueryq-marg[y\u02c6\u2713](let[a(sample(p-a))\u2713(observe<-(p-\u2713a)\u02c6\u2713)b(sample(p-ba\u2713))](observe(lika\u2713b)y)[ab]))(b)ConditionalqueryFigure1:Left:asimpleexampleoptimizationquerywherewewanttooptimize\u2713.Right:thesamequeryafterthetransformationappliedbyBOPPtomakethequeryamenabletooptimization.Notep-urepresentsadistributionobject,whilstp-\u2713,p-vandlikallrepresentfunctionswhichreturndistributionsobjects.(defqueryq-prior[y](let[a(sample(p-a))\u2713(sample(p-\u2713a))]\u2713))(a)Priorquery(defqueryq-acq[y\u21e3](let[a(sample(p-a))\u2713(sample(p-\u2713a))](observe(factor)(\u21e3\u2713))\u2713))(b)AcquisitionqueryFigure2:Left:atransformationofqthatsamplesfromthepriorp(\u2713).Right:atransformationofqusedintheoptimizationoftheacquisitionfunction.Observingfromfactorassignsaprobabilityexp\u21e3(\u2713)totheexecution,i.e.(factor)returnsadistributionofobjectforwhichthelogprobabilitydensityfunctionistheidentityfunction.1\fOur BO scheme uses a GP prior and a Gaussian likelihood. Though the rationale for the latter is\npredominantly computational, giving an analytic posterior, there are also theoretical results suggesting\nthat this choice is appropriate [2]. We use as a default covariance function a combination of a Mat\u00b4ern-\n3/2 and Mat\u00b4ern-5/2 kernel. By using automatic domain scaling as described in the next section,\nproblem independent priors are placed over the GP hyperparameters such as the length scales\nand observation noise. Inference over hyperparameters is performed using Hamiltonian Monte\nCarlo (HMC) [6], giving an unweighted mixture of GPs. Each term in this mixture has an analytic\nm : \u03d1\u00d7\u03d1 \u2192 R,\ndistribution fully speci\ufb01ed by its mean function \u00b5i\nwhere m indexes the BO iteration and i the hyperparameter sample.\nThis posterior is \ufb01rst used to estimate which of the previously evaluated \u02c6\u03b8j is the most optimal, by\nm(\u02c6\u03b8j). This completes the\ntaking the point with highest expected value , \u02c6u\u2217m = maxj\u22081...m\nde\ufb01nition of the output sequence returned by the doopt macro. Note that as the posterior updates\nglobally with each new observation, the relative estimated optimality of previously evaluated points\nchanges at each iteration. Secondly it is used to de\ufb01ne the acquisition function \u03b6, for which we take\nthe expected improvement [25], de\ufb01ning \u03c3i\n\nm : \u03d1 \u2192 R and covariance function ki\n\n(cid:80)N\n\ni=1 \u00b5i\n\nm (\u03b8, \u03b8) and \u03b3i\nki\n\nm (\u03b8) = \u00b5i\n\nm (\u03b8) =\n\n,\n\n\u2217\nm(\u03b8)\u2212\u02c6u\nm\n\u03c3i\nm(\u03b8)\n\n(cid:112)\n(cid:1) \u03a6(cid:0)\u03b3i\nm (\u03b8)(cid:1) + \u03c3i\n\nm (\u03b8) \u03c6(cid:0)\u03b3i\n\nm (\u03b8)(cid:1)\n\n(cid:0)\u00b5i\n\nN(cid:88)\n\ni=1\n\n\u03b6 (\u03b8) =\n\nm (\u03b8) \u2212 \u02c6u\u2217m\n\n(3)\n\nwhere \u03c6 and \u03a6 represent the pdf and cdf of a unit normal distribution respectively. We note that more\npowerful, but more involved, acquisition functions, e.g. [12], could be used instead.\n\n4.3 Automatic and Adaptive Domain Scaling\n\nDomain scaling, by mapping to a common space, is crucial for BOPP to operate in the required\nblack-box fashion as it allows a general purpose and problem independent hyperprior to be placed\non the GP hyperparameters. BOPP therefore employs an af\ufb01ne scaling to a [\u22121, 1] hypercube for\nboth the inputs and outputs of the GP. To initialize scaling for the input variables, we sample directly\nfrom the generative model de\ufb01ned by the program. This is achieved using a second transformed\nprogram, q-prior, which removes all conditioning, i.e. observe statements, and returns \u03b8. This\ntransformation also introduces code to terminate execution of the query once all \u03b8 are sampled,\nin order to avoid unnecessary computation. As observe statements return nil, this transforma-\ntion trivially preserves the generative model of the program, but the probability of the execution\nchanges. Simulating from the generative model does not require inference or calling potentially\nexpensive likelihood functions and is therefore computationally inexpensive. By running inference\non q-marg given a small number of these samples as arguments, a rough initial characterization of\noutput scaling can also be achieved. If points are observed that fall outside the hypercube under the\ninitial scaling, the domain scaling is appropriately updated3 so that the target for the GP remains the\n[\u22121, 1] hypercube.\n4.4 Unbounded Bayesian Optimization via Non-Stationary Mean Function Adaptation\n\nUnlike standard BO implementations, BOPP is not provided with external constraints and we therefore\ndevelop a scheme for operating on targets with potentially unbounded support. Our method exploits\nthe knowledge that the target function is a probability density, implying that the area that must be\nsearched in practice to \ufb01nd the optimum is \ufb01nite, by de\ufb01ning a non-stationary prior mean function.\nThis takes the form of a bump function that is constant within a region of interest, but decays rapidly\noutside. Speci\ufb01cally we de\ufb01ne this bump function in the transformed space as\nif r \u2264 re\notherwise\n\n\u00b5prior (r; re, r\u221e) =\n\n(cid:16) r\u2212re\n\n(cid:40)0\n\n(cid:17)\n\n(4)\n\nlog\n\nr\u221e\u2212re\n\n+ r\u2212re\nr\u221e\u2212re\n\nwhere r is the radius from the origin, re is the maximum radius of any point generated in the initial\nscaling or subsequent evaluations, and r\u221e is a parameter set to 1.5re by default. Consequently, the\nacquisition function also decays and new points are never suggested arbitrarily far away. Adaptation\n\n3An important exception is that the output mapping to the bottom of the hypercube remains \ufb01xed such that\nlow likelihood new points are not incorporated. This ensures stability when considering unbounded problems.\n\n6\n\n\fFigure 4: Convergence on an unconstrained bimodal problem with p (\u03b8) = Normal(0, 0.5) and\np (Y |\u03b8) = Normal(5 \u2212 |\u03b8| , 0.5) giving signi\ufb01cant prior misspeci\ufb01cation. The top plots show a\nregressed GP, with the solid line corresponding to the mean and the shading shows \u00b1 2 standard\ndeviations. The bottom plots show the corresponding acquisition functions.\n\nFigure 5: Comparison of BOPP used as an optimizer to prominent BO packages on common\nbenchmark problems. The dashed lines shows the \ufb01nal mean error of SMAC (red), Spearmint (green)\nand TPE (black) as quoted by [7]. The dark blue line shows the mean error for BOPP averaged over\n100 runs, whilst the median and 25/75% percentiles are shown in cyan. Results for Spearmint on\nBranin and SMAC on SVM on-grid are omitted because both BOPP and the respective algorithms\naveraged zero error to the provided number of signi\ufb01cant \ufb01gures in [7].\n\nof the scaling will automatically update this mean function appropriately, learning a region of interest\nthat matches that of the true problem, without complicating the optimization by over-extending\nthis region. We note that our method shares similarity with the recent work of Shahriari et al [23],\nbut overcomes the sensitivity of their method upon a user-speci\ufb01ed bounding box representing soft\nconstraints, by initializing automatically and adapting as more data is observed.\n\n4.5 Optimizing the Acquisition Function\n\nOptimizing the acquisition function for BOPP presents the issue that the query contains implicit\nconstraints that are unknown to the surrogate function. The problem of unknown constraints has\nbeen previously covered in the literature [8, 13] by assuming that constraints take the form of a\nblack-box function which is modeled with a second surrogate function and must be evaluated in\nguess-and-check strategy to establish whether a point is valid. Along with the potentially signi\ufb01cant\nexpense such a method incurs, this approach is inappropriate for equality constraints or when the\ntarget variables are potentially discrete. For example, the Dirichlet distribution in Figure 2 introduces\nan equality constraint on powers, namely that its components must sum to 1.\nWe therefore take an alternative approach based on directly using the program to optimize the\nacquisition function. To do so we consider a transformed program q-acq that is identical to q-prior\n(see Section 4.3), but adds an additional observe statement that assigns a weight \u03b6(\u03b8) to the\nexecution. By setting \u03b6(\u03b8) to the acquisition function, the maximum likelihood corresponds to\nthe optimum of the acquisition function subject to the implicit program constraints. We obtain a\nmaximum likelihood estimate for q-acq using a variant of annealed importance sampling [19] in\nwhich lightweight Metropolis Hastings (LMH) [28] with local random-walk moves is used as the\nbase transition kernel.\n\n7\n\nFigure4:ConvergenceofBOPPonunconstrainedbimodalproblemwithp(\u2713)=Normal(0,0.5)andp(Y|\u2713)=Normal(5|\u2713|,0.5)givingsigni\ufb01cantpriormisspeci\ufb01cation.ThetopplotsshowtheregressedGP,withthesolidlinecorrespondingtothemeanandtheshadingshows\u00b12standarddeviations.Belowisthecorrespondingacquisitionfunctionwhichawayfromtheregionofinterest.acquisitionfunctionalsodecaysandnewpointsareneversuggestedarbitrarilyfaraway.Adaptationofthescalingwillautomaticallyupdatethismeanfunctionappropriately,learningaregionofinterestthatmatchesthatofthetrueproblem,withoutcomplicatingtheoptimizationbyover-extendingthisregion.WenotethatourmethodsharessimilaritywiththerecentworkofShahriarietal[24],butovercomesthesensitivityoftheirmethoduponauser-speci\ufb01edboundingboxrepresentingsoftconstraints,byinitializingautomaticallyandadaptingasmoredataisobserved.4.5OptimizingtheAcquisitionFunctionOptimizingtheacquisitionfunctionforBOPPpresentstheissuethatthequerycontainsimplicitconstraintsthatareunknowntothesurrogatefunction.Theproblemofunknownconstraintshasbeenpreviouslycoveredintheliterature[8,11]byassumingthatconstraintstaketheformofablack-boxfunctionwhichismodelledwithasecondsurrogatefunctionandmustbeevaluatedinguess-and-checkstrategytoestablishwhetherapointisvalid.Alongwiththepotentiallysigni\ufb01canceexpensesuchamethodincurs,thisapproachisinappropriateforequalityconstraintsorwhenthetargetvariablesarepotentiallydiscrete.Wethereforetakeanalternativeapproachbasedondirectlyusingtheprogramtooptimizetheacquisitionfunction.Todosoweconsideruseatransformedprogramq-acqthatisidenticaltoq-prior(seeSection4.3),butaddsanadditionalobservestatementthatassignsaweight\u21e3(\u2713)totheexecution.Bysetting\u21e3(\u2713)totheacquisitionfunction,themaximumlikelihoodcorrespondstotheoptimumoftheacquisitionfunctionsubjecttotheimplicitprogramconstraints.Weobtainamaximumlikelihoodestimateforq-acqusingavariantofannealedimportancesampling[18]inwhichlightweightMetropolisHastings(LMH)[29]withlocalrandom-walkmovesisusedasthebasetransitionkernel.5ExperimentsWe\ufb01rstdemonstratetheabilityofBOPPtocarryoutunboundedoptimizationusinga1Dproblemwithasigni\ufb01cantprior-posteriormismatchasshowninFigure4.ItshowsBOPPadaptingtothetargetandeffectivelyestablishingamaximainthepresenceofmultiplemodes.After20evaluationstheacquisitionsbegintoexploretheleftmode,after50bothmodeshavebeenfullyuncovered.NextwecompareBOPPtotheprominentBOpackagesSMAC[12],Spearmint[26]andTPE[3]onanumberofclassicalbenchmarksasshowninFigure5.TheseresultsdemonstratethatBOPPprovidessubstantialadvantagesoverthesesystemswhenusedsimplyasanoptimizeronbothcontinuousanddiscreteoptimizationproblems.FinallywedemonstrateperformanceofBOPPonaMMAPproblem.Comparisonhereismoredif\ufb01cultduetothedearthofexistingalternativesforPPS.Inparticular,simplyrunninginferencedoesnotreturnestimatesofthedensityfunctionp(Y,\u2713).WeconsiderthepossiblealternativeofusingourconditionalcodetransformationtodesignaparticlemarginalMetropolisHastings(PMMH,7Figure4:ConvergenceofBOPPonunconstrainedbimodalproblemwithp(\u2713)=Normal(0,0.5)andp(Y|\u2713)=Normal(5|\u2713|,0.5)givingsigni\ufb01cantpriormisspeci\ufb01cation.ThetopplotsshowtheregressedGP,withthesolidlinecorrespondingtothemeanandtheshadingshows\u00b12standarddeviations.Belowisthecorrespondingacquisitionfunctionwhichawayfromtheregionofinterest.acquisitionfunctionalsodecaysandnewpointsareneversuggestedarbitrarilyfaraway.Adaptationofthescalingwillautomaticallyupdatethismeanfunctionappropriately,learningaregionofinterestthatmatchesthatofthetrueproblem,withoutcomplicatingtheoptimizationbyover-extendingthisregion.WenotethatourmethodsharessimilaritywiththerecentworkofShahriarietal[24],butovercomesthesensitivityoftheirmethoduponauser-speci\ufb01edboundingboxrepresentingsoftconstraints,byinitializingautomaticallyandadaptingasmoredataisobserved.4.5OptimizingtheAcquisitionFunctionOptimizingtheacquisitionfunctionforBOPPpresentstheissuethatthequerycontainsimplicitconstraintsthatareunknowntothesurrogatefunction.Theproblemofunknownconstraintshasbeenpreviouslycoveredintheliterature[8,11]byassumingthatconstraintstaketheformofablack-boxfunctionwhichismodelledwithasecondsurrogatefunctionandmustbeevaluatedinguess-and-checkstrategytoestablishwhetherapointisvalid.Alongwiththepotentiallysigni\ufb01canceexpensesuchamethodincurs,thisapproachisinappropriateforequalityconstraintsorwhenthetargetvariablesarepotentiallydiscrete.Wethereforetakeanalternativeapproachbasedondirectlyusingtheprogramtooptimizetheacquisitionfunction.Todosoweconsideruseatransformedprogramq-acqthatisidenticaltoq-prior(seeSection4.3),butaddsanadditionalobservestatementthatassignsaweight\u21e3(\u2713)totheexecution.Bysetting\u21e3(\u2713)totheacquisitionfunction,themaximumlikelihoodcorrespondstotheoptimumoftheacquisitionfunctionsubjecttotheimplicitprogramconstraints.Weobtainamaximumlikelihoodestimateforq-acqusingavariantofannealedimportancesampling[18]inwhichlightweightMetropolisHastings(LMH)[29]withlocalrandom-walkmovesisusedasthebasetransitionkernel.5ExperimentsWe\ufb01rstdemonstratetheabilityofBOPPtocarryoutunboundedoptimizationusinga1Dproblemwithasigni\ufb01cantprior-posteriormismatchasshowninFigure4.ItshowsBOPPadaptingtothetargetandeffectivelyestablishingamaximainthepresenceofmultiplemodes.After20evaluationstheacquisitionsbegintoexploretheleftmode,after50bothmodeshavebeenfullyuncovered.NextwecompareBOPPtotheprominentBOpackagesSMAC[12],Spearmint[26]andTPE[3]onanumberofclassicalbenchmarksasshowninFigure5.TheseresultsdemonstratethatBOPPprovidessubstantialadvantagesoverthesesystemswhenusedsimplyasanoptimizeronbothcontinuousanddiscreteoptimizationproblems.FinallywedemonstrateperformanceofBOPPonaMMAPproblem.Comparisonhereismoredif\ufb01cultduetothedearthofexistingalternativesforPPS.Inparticular,simplyrunninginferencedoesnotreturnestimatesofthedensityfunctionp(Y,\u2713).WeconsiderthepossiblealternativeofusingourconditionalcodetransformationtodesignaparticlemarginalMetropolisHastings(PMMH,7Figure4:ConvergenceofBOPPonunconstrainedbimodalproblemwithp(\u2713)=Normal(0,0.5)andp(Y|\u2713)=Normal(5|\u2713|,0.5)givingsigni\ufb01cantpriormisspeci\ufb01cation.ThetopplotsshowtheregressedGP,withthesolidlinecorrespondingtothemeanandtheshadingshows\u00b12standarddeviations.Belowisthecorrespondingacquisitionfunctionwhichawayfromtheregionofinterest.acquisitionfunctionalsodecaysandnewpointsareneversuggestedarbitrarilyfaraway.Adaptationofthescalingwillautomaticallyupdatethismeanfunctionappropriately,learningaregionofinterestthatmatchesthatofthetrueproblem,withoutcomplicatingtheoptimizationbyover-extendingthisregion.WenotethatourmethodsharessimilaritywiththerecentworkofShahriarietal[24],butovercomesthesensitivityoftheirmethoduponauser-speci\ufb01edboundingboxrepresentingsoftconstraints,byinitializingautomaticallyandadaptingasmoredataisobserved.4.5OptimizingtheAcquisitionFunctionOptimizingtheacquisitionfunctionforBOPPpresentstheissuethatthequerycontainsimplicitconstraintsthatareunknowntothesurrogatefunction.Theproblemofunknownconstraintshasbeenpreviouslycoveredintheliterature[8,11]byassumingthatconstraintstaketheformofablack-boxfunctionwhichismodelledwithasecondsurrogatefunctionandmustbeevaluatedinguess-and-checkstrategytoestablishwhetherapointisvalid.Alongwiththepotentiallysigni\ufb01canceexpensesuchamethodincurs,thisapproachisinappropriateforequalityconstraintsorwhenthetargetvariablesarepotentiallydiscrete.Wethereforetakeanalternativeapproachbasedondirectlyusingtheprogramtooptimizetheacquisitionfunction.Todosoweconsideruseatransformedprogramq-acqthatisidenticaltoq-prior(seeSection4.3),butaddsanadditionalobservestatementthatassignsaweight\u21e3(\u2713)totheexecution.Bysetting\u21e3(\u2713)totheacquisitionfunction,themaximumlikelihoodcorrespondstotheoptimumoftheacquisitionfunctionsubjecttotheimplicitprogramconstraints.Weobtainamaximumlikelihoodestimateforq-acqusingavariantofannealedimportancesampling[18]inwhichlightweightMetropolisHastings(LMH)[29]withlocalrandom-walkmovesisusedasthebasetransitionkernel.5ExperimentsWe\ufb01rstdemonstratetheabilityofBOPPtocarryoutunboundedoptimizationusinga1Dproblemwithasigni\ufb01cantprior-posteriormismatchasshowninFigure4.ItshowsBOPPadaptingtothetargetandeffectivelyestablishingamaximainthepresenceofmultiplemodes.After20evaluationstheacquisitionsbegintoexploretheleftmode,after50bothmodeshavebeenfullyuncovered.NextwecompareBOPPtotheprominentBOpackagesSMAC[12],Spearmint[26]andTPE[3]onanumberofclassicalbenchmarksasshowninFigure5.TheseresultsdemonstratethatBOPPprovidessubstantialadvantagesoverthesesystemswhenusedsimplyasanoptimizeronbothcontinuousanddiscreteoptimizationproblems.FinallywedemonstrateperformanceofBOPPonaMMAPproblem.Comparisonhereismoredif\ufb01cultduetothedearthofexistingalternativesforPPS.Inparticular,simplyrunninginferencedoesnotreturnestimatesofthedensityfunctionp(Y,\u2713).WeconsiderthepossiblealternativeofusingourconditionalcodetransformationtodesignaparticlemarginalMetropolisHastings(PMMH,7Figure4:ConvergenceofBOPPonunconstrainedbimodalproblemwithp(\u2713)=Normal(0,0.5)andp(Y|\u2713)=Normal(5|\u2713|,0.5)givingsigni\ufb01cantpriormisspeci\ufb01cation.ThetopplotsshowtheregressedGP,withthesolidlinecorrespondingtothemeanandtheshadingshows\u00b12standarddeviations.Belowisthecorrespondingacquisitionfunctionwhichawayfromtheregionofinterest.acquisitionfunctionalsodecaysandnewpointsareneversuggestedarbitrarilyfaraway.Adaptationofthescalingwillautomaticallyupdatethismeanfunctionappropriately,learningaregionofinterestthatmatchesthatofthetrueproblem,withoutcomplicatingtheoptimizationbyover-extendingthisregion.WenotethatourmethodsharessimilaritywiththerecentworkofShahriarietal[24],butovercomesthesensitivityoftheirmethoduponauser-speci\ufb01edboundingboxrepresentingsoftconstraints,byinitializingautomaticallyandadaptingasmoredataisobserved.4.5OptimizingtheAcquisitionFunctionOptimizingtheacquisitionfunctionforBOPPpresentstheissuethatthequerycontainsimplicitconstraintsthatareunknowntothesurrogatefunction.Theproblemofunknownconstraintshasbeenpreviouslycoveredintheliterature[8,11]byassumingthatconstraintstaketheformofablack-boxfunctionwhichismodelledwithasecondsurrogatefunctionandmustbeevaluatedinguess-and-checkstrategytoestablishwhetherapointisvalid.Alongwiththepotentiallysigni\ufb01canceexpensesuchamethodincurs,thisapproachisinappropriateforequalityconstraintsorwhenthetargetvariablesarepotentiallydiscrete.Wethereforetakeanalternativeapproachbasedondirectlyusingtheprogramtooptimizetheacquisitionfunction.Todosoweconsideruseatransformedprogramq-acqthatisidenticaltoq-prior(seeSection4.3),butaddsanadditionalobservestatementthatassignsaweight\u21e3(\u2713)totheexecution.Bysetting\u21e3(\u2713)totheacquisitionfunction,themaximumlikelihoodcorrespondstotheoptimumoftheacquisitionfunctionsubjecttotheimplicitprogramconstraints.Weobtainamaximumlikelihoodestimateforq-acqusingavariantofannealedimportancesampling[18]inwhichlightweightMetropolisHastings(LMH)[29]withlocalrandom-walkmovesisusedasthebasetransitionkernel.5ExperimentsWe\ufb01rstdemonstratetheabilityofBOPPtocarryoutunboundedoptimizationusinga1Dproblemwithasigni\ufb01cantprior-posteriormismatchasshowninFigure4.ItshowsBOPPadaptingtothetargetandeffectivelyestablishingamaximainthepresenceofmultiplemodes.After20evaluationstheacquisitionsbegintoexploretheleftmode,after50bothmodeshavebeenfullyuncovered.NextwecompareBOPPtotheprominentBOpackagesSMAC[12],Spearmint[26]andTPE[3]onanumberofclassicalbenchmarksasshowninFigure5.TheseresultsdemonstratethatBOPPprovidessubstantialadvantagesoverthesesystemswhenusedsimplyasanoptimizeronbothcontinuousanddiscreteoptimizationproblems.FinallywedemonstrateperformanceofBOPPonaMMAPproblem.Comparisonhereismoredif\ufb01cultduetothedearthofexistingalternativesforPPS.Inparticular,simplyrunninginferencedoesnotreturnestimatesofthedensityfunctionp(Y,\u2713).WeconsiderthepossiblealternativeofusingourconditionalcodetransformationtodesignaparticlemarginalMetropolisHastings(PMMH,7p(Y,\u2713)p(Y,\u2713)p(Y,\u2713)p(Y,\u2713)p(Y,\u2713)p(Y,\u2713)p(Y,\u2713)p(Y,\u2713)Iteration0100200Error10-510-310-1BraninIteration010020010-1100Hartmann 6DIteration05010010-2100SVM on-gridErrorErrorErrorIteration0255010-2100102LDA on-gridBOPP meanBOPPmedianSMACSpearmintTPE\fFigure 6: Convergence for transition dynamics parameters of the pickover attractor in terms of the\ncumulative best log p (Y, \u03b8) (left) and distance to the \u201ctrue\u201d \u03b8 used in generating the data (right).\nSolid line shows median over 100 runs, whilst the shaded region the 25/75% quantiles.\n\n5 Experiments\n\nWe \ufb01rst demonstrate the ability of BOPP to carry out unbounded optimization using a 1D problem\nwith a signi\ufb01cant prior-posterior mismatch as shown in Figure 4. It shows BOPP adapting to the\ntarget and effectively establishing a maxima in the presence of multiple modes. After 20 evaluations\nthe acquisitions begin to explore the right mode, after 50 both modes have been fully uncovered.\nNext we compare BOPP to the prominent BO packages SMAC [14], Spearmint [25] and TPE [3] on a\nnumber of classical benchmarks as shown in Figure 5. These results demonstrate that BOPP provides\nsubstantial advantages over these systems when used simply as an optimizer on both continuous and\ndiscrete optimization problems. In particular, it offers a large advantage over SMAC and TPE on\nthe continuous problems (Branin and Hartmann), due to using a more powerful surrogate, and over\nSpearmint on the others due to not needing to make approximations to deal with discrete problems.\nFinally we demonstrate performance of BOPP on a MMAP problem. Comparison here is more\ndif\ufb01cult due to the dearth of existing alternatives for PPS. In particular, simply running inference\non the original query does not return estimates for p (Y, \u03b8). We consider the possible alternative of\nusing our conditional code transformation to design a particle marginal Metropolis Hastings (PMMH,\n[1]) sampler which operates in a similar fashion to BOPP except that new \u03b8 are chosen using a MH\nstep instead of actively sampling with BO. For these MH steps we consider both LMH [28] with\nproposals from the prior and the random-walk MH (RMH) variant introduced in Section 4.5. Results\nfor estimating the dynamics parameters of a chaotic pickover attractor, while using an extended\nKalman smoother to estimate the latent states are shown in Figure 6. Model details are given in the\nsupplementary material along with additional experiments.\n\n6 Discussion and Future Work\n\nWe have introduced a new method for carrying out MMAP estimation of probabilistic program\nvariables using Bayesian optimization, representing the \ufb01rst uni\ufb01ed framework for optimization\nand inference of probabilistic programs. By using a series of code transformations, our method\nallows an arbitrary program to be optimized with respect to a de\ufb01ned subset of its variables, whilst\nmarginalizing out the rest. To carry out the required optimization, we introduce a new GP-based BO\npackage that exploits the availability of the target source code to provide a number of novel features,\nsuch as automatic domain scaling and constraint satisfaction.\nThe concepts we introduce lead directly to a number of extensions of interest, including but not\nrestricted to smart initialization of inference algorithms, adaptive proposals, and nested optimization.\nFurther work might consider maximum marginal likelihood estimation and risk minimization. Though\nonly requiring minor algorithmic changes, these cases require distinct theoretical considerations.\n\nAcknowledgements\n\nTom Rainforth is supported by a BP industrial grant. Tuan Anh Le is supported by a Google\nstudentship, project code DF6700. Frank Wood is supported under DARPA PPAML through the U.S.\nAFRL under Cooperative Agreement FA8750-14-2-0006, Sub Award number 61160290-111668.\n\n8\n\n\fReferences\n[1] C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo methods. J Royal Stat.\n\nSoc.: Series B (Stat. Methodol.), 72(3):269\u2013342, 2010.\n\n[2] J. B\u00b4erard, P. Del Moral, A. Doucet, et al. A lognormal central limit theorem for particle approximations of\n\nnormalizing constants. Electronic Journal of Probability, 19(94):1\u201328, 2014.\n\n[3] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. K\u00b4egl. Algorithms for hyper-parameter optimization. In\n\nNIPS, pages 2546\u20132554, 2011.\n\n[4] B. Carpenter, A. Gelman, M. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. A. Brubaker, J. Guo, P. Li,\n\nand A. Riddell. Stan: a probabilistic programming language. Journal of Statistical Software, 2015.\n\n[5] K. Csill\u00b4ery, M. G. Blum, O. E. Gaggiotti, and O. Franc\u00b8ois. Approximate Bayesian Computation (ABC) in\n\npractice. Trends in Ecology & Evolution, 25(7):410\u2013418, 2010.\n\n[6] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics letters B, 1987.\n[7] K. Eggensperger, M. Feurer, F. Hutter, J. Bergstra, J. Snoek, H. Hoos, and K. Leyton-Brown. Towards\nan empirical foundation for assessing Bayesian optimization of hyperparameters. In NIPS workshop on\nBayesian Optimization in Theory and Practice, pages 1\u20135, 2013.\n\n[8] J. R. Gardner, M. J. Kusner, Z. E. Xu, K. Q. Weinberger, and J. Cunningham. Bayesian optimization with\n\ninequality constraints. In ICML, pages 937\u2013945, 2014.\n\n[9] N. Goodman, V. Mansinghka, D. M. Roy, K. Bonawitz, and J. B. Tenenbaum. Church: a language for\n\ngenerative models. In UAI, pages 220\u2013229, 2008.\n\n[10] N. D. Goodman and A. Stuhlm\u00a8uller. The Design and Implementation of Probabilistic Programming\n\nLanguages. 2014.\n\n[11] M. U. Gutmann and J. Corander. Bayesian optimization for likelihood-free inference of simulator-based\n\nstatistical models. JMLR, 17:1\u201347, 2016.\n\n[12] J. M. Hern\u00b4andez-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy search for ef\ufb01cient\n\nglobal optimization of black-box functions. In NIPS, pages 918\u2013926, 2014.\n\n[13] J. M. Hern\u00b4andez-Lobato, M. A. Gelbart, R. P. Adams, M. W. Hoffman, and Z. Ghahramani. A general\nframework for constrained Bayesian optimization using information-based search. JMLR, 17:1\u201353, 2016.\n[14] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general algorithm\n\ncon\ufb01guration. In Learn. Intell. Optim., pages 507\u2013523. Springer, 2011.\n\n[15] D. R. Jones, M. Schonlau, and W. J. Welch. Ef\ufb01cient global optimization of expensive black-box functions.\n\nJ Global Optim, 13(4):455\u2013492, 1998.\n\n[16] V. Mansinghka, D. Selsam, and Y. Perov. Venture: a higher-order probabilistic programming platform with\n\nprogrammable inference. arXiv preprint arXiv:1404.0099, 2014.\n\n[17] T. Minka, J. Winn, J. Guiver, and D. Knowles. Infer .NET 2.4, Microsoft Research Cambridge, 2010.\n[18] K. P. Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.\n[19] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11(2):125\u2013139, 2001.\n[20] M. A. Osborne, R. Garnett, and S. J. Roberts. Gaussian processes for global optimization.\n\ninternational conference on learning and intelligent optimization (LION3), pages 1\u201315, 2009.\n\nIn 3rd\n\n[21] B. Paige, F. Wood, A. Doucet, and Y. W. Teh. Asynchronous anytime sequential monte carlo. In NIPS,\n\npages 3410\u20133418, 2014.\n\n[22] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[23] B. Shahriari, A. Bouchard-C\u02c6ot\u00b4e, and N. de Freitas. Unbounded Bayesian optimization via regularization.\n\nAISTATS, 2016.\n\n[24] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the human out of the loop: A\n\nreview of Bayesian optimization. Proceedings of the IEEE, 104(1):148\u2013175, 2016.\n\n[25] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms.\n\nIn NIPS, pages 2951\u20132959, 2012.\n\n[26] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Ali, R. P. Adams, et al.\n\nScalable Bayesian optimization using deep neural networks. In ICML, 2015.\n\n[27] J.-W. van de Meent, B. Paige, D. Tolpin, and F. Wood. Black-box policy search with probabilistic programs.\n\nIn AISTATS, pages 1195\u20131204, 2016.\n\n[28] D. Wingate, A. Stuhlmueller, and N. D. Goodman. Lightweight implementations of probabilistic program-\n\nming languages via transformational compilation. In AISTATS, pages 770\u2013778, 2011.\n\n[29] F. Wood, J. W. van de Meent, and V. Mansinghka. A new approach to probabilistic programming inference.\n\nIn AISTATS, pages 2\u201346, 2014.\n\n[30] C. Xie. Interactive heat transfer simulations for everyone. The Physics Teacher, 50(4), 2012.\n[31] R. Zinkov and C.-C. Shan. Composing inference algorithms as program transformations. arXiv preprint\n\narXiv:1603.01882, 2016.\n\n9\n\n\f", "award": [], "sourceid": 187, "authors": [{"given_name": "Tom", "family_name": "Rainforth", "institution": "University of Oxford"}, {"given_name": "Tuan Anh", "family_name": "Le", "institution": "University of Oxford"}, {"given_name": "Jan-Willem", "family_name": "van de Meent", "institution": "University of Oxford"}, {"given_name": "Michael", "family_name": "Osborne", "institution": "U Oxford"}, {"given_name": "Frank", "family_name": "Wood", "institution": "University of Oxford"}]}