{"title": "Parallel Predictive Entropy Search for Batch Global Optimization of Expensive Objective Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 3330, "page_last": 3338, "abstract": "We develop \\textit{parallel predictive entropy search} (PPES), a novel algorithm for Bayesian optimization of expensive black-box objective functions. At each iteration, PPES aims to select a \\textit{batch} of points which will maximize the information gain about the global maximizer of the objective. Well known strategies exist for suggesting a single evaluation point based on previous observations, while far fewer are known for selecting batches of points to evaluate in parallel. The few batch selection schemes that have been studied all resort to greedy methods to compute an optimal batch. To the best of our knowledge, PPES is the first non-greedy batch Bayesian optimization strategy. We demonstrate the benefit of this approach in optimization performance on both synthetic and real world applications, including problems in machine learning, rocket science and robotics.", "full_text": "Parallel Predictive Entropy Search for Batch Global\n\nOptimization of Expensive Objective Functions\n\nAmar Shah\n\nDepartment of Engineering\n\nCambridge University\nas793@cam.ac.uk\n\nZoubin Ghahramani\n\nDepartment of Engineering\nUniversity of Cambridge\n\nzoubin@eng.cam.ac.uk\n\nAbstract\n\nWe develop parallel predictive entropy search (PPES), a novel algorithm for\nBayesian optimization of expensive black-box objective functions. At each itera-\ntion, PPES aims to select a batch of points which will maximize the information\ngain about the global maximizer of the objective. Well known strategies exist\nfor suggesting a single evaluation point based on previous observations, while far\nfewer are known for selecting batches of points to evaluate in parallel. The few\nbatch selection schemes that have been studied all resort to greedy methods to\ncompute an optimal batch. To the best of our knowledge, PPES is the \ufb01rst non-\ngreedy batch Bayesian optimization strategy. We demonstrate the bene\ufb01t of this\napproach in optimization performance on both synthetic and real world applica-\ntions, including problems in machine learning, rocket science and robotics.\n\n1\n\nIntroduction\n\nFinding the global maximizer of a non-concave objective function based on sequential, noisy obser-\nvations is a fundamental problem in various real world domains e.g. engineering design [1], \ufb01nance\n[2] and algorithm optimization [3]. We are interesed in objective functions which are unknown but\nmay be evaluated pointwise at some expense, be it computational, economical or other. The chal-\nlenge is to \ufb01nd the maximizer of the expensive objective function in as few sequential queries as\npossible, in order to minimize the total expense.\nA Bayesian approach to this problem would probabilistically model the unknown objective function,\nf. Based on posterior belief about f given evaluations of the the objective function, you can decide\nwhere to evaluate f next in order to maximize a chosen utility function. Bayesian optimization [4]\nhas been successfully applied in a range of dif\ufb01cult, expensive global optimization tasks including\noptimizing a robot controller to maximize gait speed [5] and discovering a chemical derivative of a\nparticular molecule which best treats a particular disease [6].\nTwo key choices need to be made when implementing a Bayesian optimization algorithm: (i) a\nmodel choice for f and (ii) a strategy for deciding where to evaluate f next. A common approach\nfor modeling f is to use a Gaussian process prior [7], as it is highly \ufb02exible and amenable to analytic\ncalculations. However, other models have shown to be useful in some Bayesian optimization tasks\ne.g. Student-t process priors [8] and deep neural networks [9]. Most research in the Bayesian\noptimization literature considers the problem of deciding how to choose a single location where\nf should be evaluated next. However, it is often possible to probe several points in parallel. For\nexample, you may possess 2 identical robots on which you can test different gait parameters in\nparallel. Or your computer may have multiple cores on which you can run algorithms in parallel\nwith different hyperparameter settings.\nWhilst there are many established strategies to select a single point to probe next e.g. expected\nimprovement, probability of improvement and upper con\ufb01dence bound [10], there are few well\nknown strategies for selecting batches of points. To the best of our knowledge, every batch selection\n\n1\n\n\fstrategy proposed in the literature involves a greedy algorithm, which chooses individual points\nuntil the batch is \ufb01lled. Greedy choice making can be severely detrimental, for example, a greedy\napproach to the travelling salesman problem could potentially lead to the uniquely worst global\nsolution [11]. In this work, our key contribution is to provide what we believe is the \ufb01rst non-greedy\nalgorithm to choose a batch of points to probe next in the task of parallel global optimization.\nOur approach is to choose a set of points which in expectation, maximally reduces our uncertainty\nabout the location of the maximizer of the objective function. The algorithm we develop, parallel\npredictive entropy search, extends the methods of [12, 13] to multiple point batch selection.\nIn\nSection 2, we formalize the problem and discuss previous approaches before developing parallel\npredictive entropy search in Section 3. Finally, we demonstrate the bene\ufb01t of our non-greedy strategy\non synthetic as well as real-world objective functions in Section 4.\n\n2 Problem Statement and Background\nOur aim is to maximize an objective function f : X \u2192 R, which is unknown but can be (noisily)\nevaluated pointwise at multiple locations in parallel. In this work, we assume X is a compact subset\nof RD. At each decision, we must select a set of Q points St = {xt,1, ..., xt,Q} \u2282 X , where the\nobjective function would next be evaluated in parallel. Each evaluation leads to a scalar observation\nyt,q = f (xt,q) + \u0001t,q, where we assume \u0001t,q \u223c N (0, \u03c32) i.i.d. We wish to minimize a future regret,\nrT = [f (x\u2217) \u2212 f (\u02dcxT )], where x\u2217 \u2208 argmaxx\u2208X f (x) is an optimal decision (assumed to exist)\nand \u02dcxT is our guess of where the maximizer of f is after evaluating T batches of input points. It is\nhighly intractable to make decisions T steps ahead in the setting described, therefore it is common\nto consider the regret of the very next decision. In this work, we shall assume f is a draw from a\nGaussian process with constant mean \u03bb \u2208 R and differentiable kernel function k : X 2 \u2192 R.\nMost Bayesian optimization research focuses on choosing a single point to query at each de-\ncision i.e. Q = 1. A popular strategy in this setting is to choose the point with highest\nthe maximizer of aEI(x|D) =\nexpected improvement over the current best evaluation,\n(cid:112)\n, where D is the set of ob-\nVar[f (x)|D], \u00b5(x) = E[f (x)|D],\nservations, xbest is the best evaluation point so far, \u03c3(x) =\n\u03c4 (x) = (\u00b5(x) \u2212 f (xbest))/\u03c3(x) and \u03c6(.) and \u03a6(.) are the standard Gaussian p.d.f. and c.d.f.\nAside from being an intuitive approach, a key advantage of using the expected improvement strategy\nis in the fact that it is computable analytically and is in\ufb01nitely differentiable, making the problem\nof \ufb01nding argmaxx\u2208X aEI(x|D) amenable to a plethora of gradient based optimization methods.\nUnfortunately, the corresponding strategy for selecting Q > 1 points to evaluate in parallel does not\nlead to an analytic expression. [14] considered an approach which sequentially used the EI criterion\nto greedily choose a batch of points to query next, which [3] formalized and utilized by de\ufb01ning\n\nE(cid:2)max(f (x) \u2212 f (xbest), 0)(cid:12)(cid:12)D\n\n(cid:104)\n\u03c6(cid:0)\u03c4 (x)(cid:1) + \u03c4 (x)\u03a6(cid:0)\u03c4 (x)(cid:1)(cid:105)\n\n(cid:3) = \u03c3(x)\n\ni.e.\n\n{yq(cid:48)}q\n\nq(cid:48)=1|D,{xq(cid:48)}q\n\nX q\n\naEI\n\nq(cid:48)=1\n\naEI\u2212MCMC\nq(cid:48)=1\nthe expected gain in evaluating x after evaluating {xq(cid:48), yq(cid:48)}q\nq(cid:48)=1, which can be approximated using\nMonte Carlo samples, hence the name EI-MCMC. Choosing a batch of points St using the EI-\nMCMC policy is doubly greedy: (i) the EI criterion is greedy as it inherently aims to minimize one-\nstep regret, rt, and (ii) the EI-MCMC approach starts with an empty set and populates it sequentially\n(and hence greedily), deciding the best single point to include until |St| = Q.\nA similar but different approach called simulated matching (SM) was introduced by [15]. Let \u03c0 be a\nbaseline policy which chooses a single point to evaluate next (e.g. EI). SM aims to select a batch St\nof size Q, which includes a point \u2018close to\u2019 the best point which \u03c0 would have chosen when applied\nsequentially Q times, with high probability. Formally, SM aims to maximize\n\nq(cid:48)=1\n\n(cid:0)x|D,{xq(cid:48)}q\n\n(cid:90)\n\n(cid:1) =\n\n(cid:0)x|D\u222a{xq(cid:48), yq(cid:48)}q\n\n(cid:1)p(cid:0)\n\n(cid:1)dy1..dyq,\n\n(cid:104)\n\n(cid:104)Ef\n\n))2(cid:12)(cid:12)(cid:12)D, SQ\n\n\u03c0\n\n(cid:105)(cid:105)\n\n,\n\nf (x(cid:48)\n\naSM(St|D) = \u2212E\n\nSQ\n\u03c0\n\nmin\nx\u2208St\n\n(x \u2212 argmaxx(cid:48)\n\n\u2208SQ\n\n\u03c0\n\nwhere SQ\n\u03c0 is the set of Q points which policy \u03c0 would query if employed sequentially. A greedy\nk-medoids based algorithm is proposed to approximately maximize the objective, which the authors\njustify by the submodularity of the objective function.\nThe upper con\ufb01dence bound (UCB) strategy [16] is another method used by practitioners to decide\nwhere to evaluate an objective function next. The UCB approach is to maximize aUCB(x|D) =\n\u00b5(x) + \u03b11/2\nt \u03c3(x), where \u03b1t is a domain-speci\ufb01c time-varying positive parameter which trades off\n\n2\n\n\fexploration and exploitation. In order to extend this approach to the parallel setting, [17] noted that\nthe predictive variance of a Gaussian process depends only on where observations are made, and\nnot the observations themselves. Therefore, they suggested the GP-BUCB method, which greedily\npopulates the set St by maximizing a UCB type equation Q times sequentially, updating \u03c3 at each\nstep, whilst maintaining the same \u00b5 for each batch. Finally, a variant of the GP-UCB was proposed\nby [18]. The \ufb01rst point of the set St is chosen by optimizing the UCB objective. Thereafter, a\n\u2018relevant region\u2019 Rt \u2282 X which contains the maximizer of f with high probability is de\ufb01ned.\nPoints are greedily chosen from this region to maximize the information gain about f, measured\nby expected reduction in entropy, until |St| = Q. This method was named Gaussian process upper\ncon\ufb01dence bound with pure exploration (GP-UCB-PE).\nEach approach discussed resorts to a greedy batch selection process. To the best of our knowledge,\nno batch Bayesian optimization method to date has avoided a greedy algorithm. We avoid a greedy\nbatch selection approach with PPES, which we develop in the next section.\n\n3 Parallel Predictive Entropy Search\n\nq=1\n\n{yq}Q\n\n(cid:1)(cid:3)(cid:105)\n\n(cid:12)(cid:12)D,St\n\n(cid:1)(cid:104)\nH(cid:2)p(cid:0)x\u2217|D \u222a {xq, yq}Q\n\naPPES(St|D) = H(cid:2)p(x\u2217|D)(cid:3)\n\nOur approach is to maximize information [19] about the location of the global maximizer x\u2217, which\nwe measure in terms of the negative differential entropy of p(x\u2217|D). Analogous to [13], PPES aims\np(cid:0)\nto choose the set of Q points, St = {xq}Q\nq=1, which maximizes\n\u2212 E\n\n(cid:82) p(x) log p(x)dx is the differential entropy of its argument and the expectation\n(cid:1) would have to be evaluated for many different combinations of\n\nwhere H[p(x)] = \u2212\nabove is taken with respect to the posterior joint predictive distribution of {yq}Q\nq=1 given the previous\nevaluations, D, and the set St. Evaluating (1) exactly is typically infeasible. The prohibitive aspects\n{xq, yq}Q\nq=1, and the entropy computations are not analytically tractable in themselves. Signi\ufb01cant\napproximations need to be made to (1) before it becomes practically useful [12]. A convenient\nequivalent formulation of the quantity in (1) can be written as the mutual information between x\u2217\nand {yq}Q\n\nare that p(cid:0)x\u2217|D \u222a {xq, yq}Q\nq=1|D,St, x\u2217(cid:1)(cid:3)(cid:105)\naPPES(St|D) = H(cid:2)p(cid:0)\nq=1|D,St, x\u2217(cid:1) is the joint posterior predictive distibution for {yq}Q\n\nq=1 given D [20]. By symmetry of the mutual information, we can rewrite aPPES as\n\n\u2212 Ep(x\u2217|D)\n\nH(cid:2)p(cid:0)\n\n{yq}Q\n\n{yq}Q\n\n{yq}Q\n\n(cid:1)(cid:3)\n\n(cid:104)\n\n(1)\n\nq=1\n\nq=1\n\n,\n\n,\n\n{fq}Q\n\n{yq}Q\n\nq=1|D,St\n\nq=1|D,St\n\nWe develop an approach to approximate the expectation of the predictive entropy in (2), using an\nexpectation propagation based method which we discuss in the following section.\n3.1 Approximating the Predictive Entropy\n\n(2)\nq=1 given the ob-\nserved data, D and the location of the global maximizer of f. The key advantage of the formulation\nin (2), is that the objective is based on entropies of predictive distributions of the observations, which\nare much more easily approximated than the entropies of distributions on x\u2217.\n\nwhere p(cid:0)\nIn fact, the \ufb01rst term of (2) can be computed analytically. Suppose p(cid:0)\nvariate Gaussian with covariance K, then H(cid:2)p(cid:0)\nAssuming a sample of x\u2217, we discuss our approach to approximating H(cid:2)p(cid:0)\nwhere p(cid:0)\np(cid:0)\nThe posterior predictive distribution of the Gaussian process, p(cid:0)\n\n(cid:1) is multi-\n(cid:1)(cid:3) = 0.5 log[det(2\u03c0e(K + \u03c32I))].\nq=1|D,St, x\u2217(cid:1)(cid:3) in\nq=1|D,St, x\u2217(cid:1) =\np(cid:0)\nq=1|D,St, x\u2217(cid:1) is the posterior distribution of the objective function at the locations\nq=1|D,St, x\u2217(cid:1), which would lead to an analytic approximation to the integral in (3).\n(cid:1), is multivariate\n\nxq \u2208 St, given previous evaluations D, and that x\u2217 is the global maximizer of f. Recall that\np(yq|fq) is Gaussian for each q. Our approach will be to derive a Gaussian approximation to\n{fq}Q\n\n(2) for a set of query points St. Note that we can write\n\nq=1|D,St, x\u2217(cid:1) Q(cid:89)\n\np(yq|fq) df1...dfQ,\n\nq=1|D,St\n\n{yq}Q\n\n{yq}Q\n\n{fq}Q\n\n{fq}Q\n\np(cid:0)\n\n(cid:90)\n\nGaussian distributed. However, by further conditioning on the location x\u2217, the global maximizer\nof f, we impose the condition that f (x) \u2264 f (x(cid:63)) for any x \u2208 X . Imposing this constraint for\n\nq=1|D,St\n\n{fq}Q\n\n(3)\n\nq=1\n\n3\n\n\fall x \u2208 X is extremely dif\ufb01cult and makes the computation of p(cid:0)\n\nq=1|D,St, x\u2217(cid:1) highly in-\n\n{fq}Q\n\n\u2248\n\n(cid:90)\n\n(cid:17) Q(cid:89)\n\np(cid:0)f|D,St, x\u2217(cid:1)\n\np(cid:0)f +|D,St, x\u2217(cid:1)\u03a6\n\ntractable. We instead impose the following two conditions (i) f (x) \u2264 f (x(cid:63)) for each x \u2208 St,\nand (ii) f (x(cid:63)) \u2265 ymax + \u0001, where ymax is the largest observed noisy objective function value and\n\u0001 \u223c N (0, \u03c32). Constraint (i) is equivalent to imposing that f (x(cid:63)) is larger than objective function\nvalues at current query locations, whilst condition (ii) makes f (x(cid:63)) larger than previous objec-\ntive function evaluations, accounting for noise. Denoting the two conditions C, and the variables\nf = [f1, ..., fQ](cid:62) and f + = [f ; f (cid:63)], where f (cid:63) = f (x\u2217), we incorporate the conditions as follows\n\n(cid:16) f (cid:63) \u2212 ymax\nthe integrand of (4) with w(f +) = N (f +; m+, K+)(cid:81)Q+1\n\nwhere I(.) is an indicator function. The integral in (4) can be approximated using expectation propa-\ngation [21]. The Gaussian process predictive p(f +|D,St, x\u2217) is N (f +; m+, K+). We approximate\n\u02dcZqN (c(cid:62)q f +; \u02dc\u00b5q, \u02dc\u03c4q), where each \u02dcZq\nand \u02dc\u03c4q are positive, \u02dc\u00b5q \u2208 R and for q \u2264 Q, cq is a vector of length Q + 1 with qth entry \u22121, Q + 1st\nentry 1, and remaining entries 0, whilst cQ+1 = [0, ..., 0, 1](cid:62). The approximation w(f +) approxi-\nmates the Gaussian CDF, \u03a6(.), and each indicator function, I(.), with a univariate, scaled Gaussian\nPDF. The site parameters, { \u02dcZq, \u02dc\u00b5q, \u02dc\u03c4q}Q+1\nq=1 , are learned using a fast EP algorithm, for which details\nare given in the supplementary material, where we show that w(f +) = ZN (f +; \u00b5+, \u03a3+), where\n(5)\n\nI(f (cid:63) \u2265 fq) df (cid:63),\n\n(4)\n\n(cid:19)\u22121\n\nQ+1(cid:88)\n\nK\u22121\n\nK\u22121\n\nq=1\n\nq=1\n\n\u03c3\n\n,\n\n\u02dc\u00b5q\n\u02dc\u03c4q\n\n+ +\n\ncqc(cid:62)q\n\ncqc(cid:62)q\n\n+ m+ +\n\n, \u03a3+ =\n\n\u00b5+ = \u03a3+\n\nQ+1(cid:88)\n\n(cid:19)\u22121\n\n(cid:18)\n(cid:18)\nand hence p(cid:0)f +|D,St, C(cid:1)\nder marginalization, a convenient corollary is that p(cid:0)f|D,St, x\u2217(cid:1)\np(cid:0)\nq=1|D,St, x\u2217(cid:1)\nanalytically, such that H(cid:2)p(cid:0)\nq=1|D,St, x\u2217(cid:1)(cid:3)\nSo far, we have considered how to approximate H(cid:2)p(cid:0)\n\n\u2248 N (f +; \u00b5+, \u03a3+). Since multivariate Gaussians are consistent un-\n\u2248 N (f ; \u00b5, \u03a3), where \u00b5 is the\nvector containing the \ufb01rst Q elements of \u00b5+, and \u03a3 is the matrix containing the \ufb01rst Q rows and\ncolumns of \u03a3+. Since sums of independent Gaussians are also Gaussian distributed, we see that\n{yq}Q\n\u2248 N ([y1, ..., yQ](cid:62); \u00b5, \u03a3 + \u03c32I). The \ufb01nal convenient attribute of our Gaus-\nsian approximation, is that the differential entropy of a multivariate Gaussian can be computed\n\nq=1|D,St, x\u2217(cid:1)(cid:3), given the global maxi-\n\n3.2 Sampling from the Posterior over the Global Maximizer\n\nmizer, x\u2217. We in fact would like the expected value of this quantity over the posterior distribution of\nthe global maximizer, p(x\u2217|D). Literally, p(x\u2217|D) \u2261 p(f (x\u2217) = maxx\u2208X f (x)|D), the posterior\nprobability that x\u2217 is the global maximizer of f. Computing the distribution p(x\u2217|D) is intractable,\nbut it is possible to approximately sample from it and compute a Monte Carlo based approximation\nof the desired expectation. We consider two approaches to sampling from the posterior of the global\nmaximizer: (i) a maximum a posteriori (MAP) method, and (ii) a random feaure approach.\n\n\u2248 0.5 log[det(2\u03c0e(\u03a3 + \u03c32I))].\n\n{yq}Q\n\n{yq}Q\n\n1\n\u02dc\u03c4q\n\nq=1\n\nq=1\n\nMAP sample from p(x\u2217|D). The MAP of p(x\u2217|D) is its posterior mode, given by x\u2217MAP =\nargmaxx\u2217\u2208X p(x\u2217|D). We may approximate the expected value of the predictive entropy by re-\nplacing the posterior distribution of x\u2217 with a single point estimate at x\u2217MAP. There are two key\nadvantages to using the MAP estimate in this way. Firstly, it is simple to compute x\u2217MAP, as it is\nthe global maximizer of the posterior mean of f given the observations D. Secondly, choosing to\nuse x\u2217MAP assists the EP algorithm developed in Section 3.1 to converge as desired. This is because\nthe condition f (x\u2217) \u2265 f (x) for x \u2208 X is easy to enforce when x\u2217 = x\u2217MAP, the global maximizer\nof the poserior mean of f. When x\u2217 is sampled such that the posterior mean at x\u2217 is signi\ufb01cantly\nsuboptimal, the EP approximation may be poor. Whilst using the MAP estimate approximation is\nconvenient, it is after all a point estimate and fails to characterize the full posterior distribution. We\ntherefore consider a method to draw samples from p(x\u2217|D) using random features.\nRandom Feature Samples from p(x\u2217|D). A naive approach to sampling from p(x\u2217|D) would\nbe to sample g \u223c p(f|D), and choosing argmaxx\u2208X g. Unfortunately, this would require sampling\ng over an uncountably in\ufb01nite space, which is infeasible. A slightly less naive method would be to\nsequentially construct g, whilst optimizing it, instead of evaluating it everywhere in X . However,\nthis approach would have cost O(m3) where m is the number of function evaluations of g necessary\nto \ufb01nd its optimum. We propose as in [13], to sample and optimize an analytic approximation to g.\n\n4\n\n\f(cid:112)\n\nBy Bochner\u2019s theorem [22], a stationary kernel function, k, has a Fourier dual s(w), which is equal\nto the spectral density of k. Setting p(w) = s(w)/\u03b1, a normalized density, we can write\n)] = 2\u03b1Ep(w,b)[cos(w(cid:62)x + b) cos(w(cid:62)x(cid:48) + b)],\n\nk(x, x(cid:48)) = \u03b1Ep(w)[e\u2212iw(cid:62)\n\n(x\u2212x(cid:48)\n\n(6)\n\nwhere b \u223c U [0, 2\u03c0]. Let \u03c6(x) =\n2\u03b1/m cos(Wx + b) denote an m-dimensional feature mapping\nwhere W and b consist of m stacked samples from p(w, b), then the kernel k can be approximated\nby the inner product of these features, k(x, x(cid:48)) \u2248 \u03c6(x)(cid:62)\u03c6(x(cid:48)) [23]. The linear model g(x) =\n\u03c6(x)(cid:62)\u03b8 + \u03bb where \u03b8|D \u223c N (A\u22121\u03c6(cid:62)(y \u2212 \u03bb1), \u03c32A\u22121) is an approximate sample from p(f|D),\nwhere y is a vector of objective function evaluations, A = \u03c6(cid:62)\u03c6 + \u03c32I and \u03c6(cid:62) = [\u03c6(x1)...\u03c6(xn)].\nIn fact, limm\u2192\u221e g is a true sample from p(f|D) [24].\nThe generative process above suggests the following approach to approximately sampling from\n(i) sample random features \u03c6(i) and corresponding posterior weights \u03b8(i) using the\np(x\u2217|D):\nprocess above, (ii) construct g(i)(x) = \u03c6(i)(x)(cid:62)\u03b8(i) + \u03bb, and (iii) \ufb01nally compute x(cid:63)(i) =\nargmaxx\u2208X g(i)(x) using gradient based methods.\n3.3 Computing and Optimizing the PPES Approximation\nLet \u03c8 denote the set of kernel parameters and the observation noise variance, \u03c32. Our posterior\nbelief about \u03c8 is summarized by the posterior distribution p(\u03c8|D) \u221d p(\u03c8)p(D|\u03c8), where p(\u03c8) is\nour prior belief about \u03c8 and p(D|\u03c8) is the GP marginal likelihood given the parameters \u03c8. For a\nfully Bayesian treatment of \u03c8, we must marginalize aPPES with respect to p(\u03c8|D). The expectation\nwith respect to the posterior distribution of \u03c8 is approximated with Monte Carlo samples. A similar\napproach is taken in [3, 13]. Combining the EP based method to approximate the predictive entropy\nwith either of the two methods discussed in the previous section to approximately sample from\np(x\u2217|D), we can construct \u02c6aPPES an approximation to (2), de\ufb01ned by\n\n(cid:105)\nlog[det(K(i) + \u03c32(i)I)] \u2212 log[det(\u03a3(i) + \u03c32(i)I)]\n\n\u02c6aPPES(St|D) =\n\nM(cid:88)\n\n,\n\n(7)\n\n1\n2M\n\n(cid:104)\n\ni=1\n\n(cid:105)\n\n(cid:104)\n\n(cid:20)\n\nM(cid:88)\n\n(cid:104)\n\nwhere K(i) is constructed using \u03c8(i) the ith sample of M from p(\u03c8|D), \u03a3(i) is constructed as in\nSection 3.1, assuming the global maximizer is x\u2217(i) \u223c p(x\u2217|D, \u03c8(i)). The PPES approximation\nis simple and amenable to gradient based optimization. Our goal is to choose St = {x1, ..., xQ}\nwhich maximizes \u02c6aPPES in (7). Since our kernel function is differentiable, we may consider taking\nthe derivative of \u02c6aPPES with respect to xq,d, the dth component of xq,\n\n=\n\ntrace\n\n\u2212 trace\n\n\u2202 \u02c6aPPES\n\u2202 xq,d\n\n(\u03a3(i) + \u03c32(i)I)\u22121 \u2202\u03a3(i)\n\u2202xq,d\n\n(K(i) + \u03c32(i)I)\u22121 \u2202K(i)\n\u2202xq,d\n\n1\n2M\ni=1\nComputing \u2202K(i)\nis simple directly from the de\ufb01nition of the chosen kernel function. \u03a3(i) is a\n\u2202xq,d\nq=1 , and we know how to compute \u2202K(i)\nfunction of K(i), {cq}Q+1\n, and that each cq\nis a constant vector. Hence our only concern is how the EP site parameters, {\u02dc\u03c3(i)\nq }Q+1\nq=1 , vary with\nxq,d. Rather remarkably, we may invoke a result from Section 2.1 of [25], which says that converged\nsite parameters, { \u02dcZq, \u02dc\u00b5q, \u02dc\u03c3q}Q+1\nq=1 , have 0 derivative with respect to parameters of p(f +|D,St, x\u2217).\nThere is a key distinction between explicit dependencies (where \u03a3 actually depends on K) and\nimplicit dependencies where a site parameter, \u02dc\u03c3q, might depend implicitly on K. A similar approach\nis taken in [26], and discussed in [7]. We therefore compute\n\nq=1 and {\u02dc\u03c3(i)\n\nq }Q+1\n\n. (8)\n\n\u2202xq,d\n\n(cid:105)(cid:21)\n\n\u2202\u03a3(i)\n+\n\u2202xq,d\n\n\u2202K(i)\n+\n\u2202xq,d\n\n= \u03a3(i)\n\n+ K(i)\u22121\n\n+\n\nK(i)\u22121\n\n+ \u03a3(i)\n+ .\n\n(9)\n\nOn \ufb01rst inspection, it may seem computationally too expensive to compute derivatives with respect\nto each q and d. However, note that we may compute and store the matrices K(i)\u22121\n+ , (K(i) +\n\u03c32(i)I)\u22121 and (\u03a3(i) + \u03c32(i)I)\u22121 once, and that \u2202K(i)\nis symmetric with exactly one non-zero row\nand non-zero column, which can be exploited for fast matrix multiplication and trace computations.\n\n+ \u03a3(i)\n\n+\n\u2202xq,d\n\n5\n\n\f(a) Synthetic function\n\n(b) aPPES(x, x(cid:48))\n\n(c) \u02c6aPPES(x, x(cid:48))\n\nFigure 1: Assessing the quality of our approximations to the parallel predictive entropy search strat-\negy. (a) Synthetic objective function (blue line) de\ufb01ned on [0, 1], with noisy observations (black\nsquares). (b) Ground truth aPPES de\ufb01ned on [0, 1]2, obtained by rejection sampling. (c) Our ap-\nproximation \u02c6aPPES using expectation propagation. Dark regions correspond to pairs (x, x(cid:48)) with\nhigh utility, whilst faint regions correspond to pairs (x, x(cid:48)) with low utility.\n\n4 Empirical Study\n\nwe choose to use a squared-exponential kernel of the form k(x, x(cid:48)) = \u03b32 exp(cid:2)\n\n(cid:3). Therefore the set of model hyperparameters is {\u03bb, \u03b3, l1, ..., lD, \u03c3}, a broad Gaussian\n\n\u2212 0.5(cid:80)\n\nIn this section, we study the performance of PPES in comparison to aforementioned methods. We\nmodel f as a Gaussian process with constant mean \u03bb and covariance kernel k. Observations of the\nobjective function are considered to be independently drawn from N (f (x), \u03c32). In our experiments,\nd(xd \u2212\nx(cid:48)d)2/l2\nd\nhyperprior is placed on \u03bb and uninformative Gamma priors are used for the other hyperparameters.\nIt is worth investigating how well \u02c6aPPES (7) is able to approximate aPPES (2). In order to test\nthe approximation in a manner amenable to visualization, we generate a sample f from a Gaussian\nprocess prior on X = [0, 1], with \u03b32 = 1, \u03c32 = 10\u22124 and l2 = 0.025, and consider batches of size\nQ = 2. We set M = 200. A rejection sampling based approach is used to compute the ground\ntruth aPPES, de\ufb01ned on X Q = [0, 1]2. We \ufb01rst discretize [0, 1]2, and sample p(x\u2217|D) in (2) by\nevaluating samples from p(f|D) on the discrete points and choosing the input with highest function\np(f|D) are evaluted on discrete points in [0, 1]2 and rejected if the highest function value occurs not\nat x\u2217. We add independent Gaussian noise with variance \u03c32 to the non rejected samples from the\n\nvalue. Given x\u2217, we compute H(cid:2)p(cid:0)y1, y2|D, x1, x2, x\u2217(cid:1)(cid:3) using rejection sampling. Samples from\nprevious step and approximate H(cid:2)p(cid:0)y1, y2|D, x1, x2, x\u2217(cid:1)(cid:3) using kernel density estimation [27].\n\nFigure 1 includes illustrations of (a) the objective function to be maximized, f, with 5 noisy ob-\nservations, (b) the aPPES ground truth obtained using the rejection sampling method and \ufb01nally (c)\n\u02c6aPPES using the EP method we develop in the previous section. The black squares on the axes of\nFigures 1(b) and 1(c) represent the locations in X = [0, 1] where f has been noisily sampled, and\nthe darker the shade, the larger the function value. The lightly shaded horizontal and vertical lines\nin these \ufb01gures along the points The \ufb01gures representing aPPES and \u02c6aPPES appear to be symmet-\nric, as is expected, since the set St = {x, x(cid:48)} is not an ordered set, since all points in the set are\nprobed in parallel i.e. St = {x, x(cid:48)} = {x(cid:48), x}. The surface of \u02c6aPPES is similar to that of aPPES.\nIn paticular, the \u02c6aPPES approximation often appeared to be an annealed version of the ground truth\naPPES, in the sense that peaks were more pronounced, and non-peak areas were \ufb02atter. Since we are\ninterested in argmax{x,x(cid:48)}\u2208X 2 aPPES({x, x(cid:48)}), our key concern is that the peaks of \u02c6aPPES occur at\nthe same input locations as aPPES. This appears to be the case in our experiment, suggesting that\nthe argmax \u02c6aPPES is a good approximation for argmax aPPES.\nWe now test the performance of PPES in the task of \ufb01nding the optimum of various objective func-\ntions. For each experiment, we compare PPES (M = 200) to EI-MCMC (with 100 MCMC sam-\nples), simulated matching with a UCB baseline policy, GP-BUCB and GP-UCB-PE. We use the ran-\ndom features method to sample from p(x\u2217|D), rejecting samples which lead to failed EP runs. An\nexperiment of an objective function, f, consists of sampling 5 input points uniformly at random and\nrunning each algorithm starting with these samples and their corresponding (noisy) function values.\nWe measure performance after t batch evaluations using immediate regret, rt = |f (\u02dcxt) \u2212 f (x\u2217)|,\nwhere x\u2217 is the known optimizer of f and \u02dcxt is the recommendation of an algorithm after t batch\nevaluations. We perform 100 experiments for each objective function, and report the median of the\n\n6\n\n00.20.40.60.8100.20.40.60.8100.10.20.30.40.50.60.700.20.40.60.8100.20.40.60.8100.10.20.30.40.50.60.70.800.20.40.60.81\u22121.5\u22121\u22120.500.511.52100.20.40.60.8100.20.40.60.8100.10.20.30.40.50.60.700.20.40.60.8100.20.40.60.8100.10.20.30.40.50.60.70.800.20.40.60.81\u22121.5\u22121\u22120.500.511.52100.20.40.60.8100.20.40.60.8100.10.20.30.40.50.60.700.20.40.60.8100.20.40.60.8100.10.20.30.40.50.60.70.800.20.40.60.81\u22121.5\u22121\u22120.500.511.521\f(a) Branin\n\n(b) Cosines\n\n(c) Shekel\n\n(d) Hartmann\n\nFigure 2: Median of the immediate regret of the PPES and 4 other algorithms over 100 experiments\non benchmark synthetic objective functions, using batches of size Q = 3.\n\nimmediate regret obtained for each algorithm. The con\ufb01dence bands represent one standard devia-\ntion obtained from bootstrapping. The empirical distribution of the immediate regret is heavy tailed,\nmaking the median more representative of where most data points lie than the mean.\nOur \ufb01rst set of experiments is on a set of synthetic benchmark objective functions including Branin-\nHoo [28], a mixture of cosines [29], a Shekel function with 10 modes [30] (each de\ufb01ned on [0, 1]2)\nand the Hartmann-6 function [28] (de\ufb01ned on [0, 1]6). We choose batches of size Q = 3 at each\ndecision time. The plots in Figure 2 illustrate the median immediate regrets found for each algo-\nrithm. The results suggest that the PPES algorithm performs close to best if not the best for each\nproblem considered. EI-MCMC does signi\ufb01cantly better on the Hartmann function, which is a rela-\ntively smooth function with very few modes, where greedy search appears bene\ufb01cial. Entropy-based\nstrategies are more exploratory in higher dimensions. Nevertheless, PPES does signi\ufb01cantly better\nthan GP-UCB-PE on 3 of the 4 problems, suggesting that our non-greedy batch selection procedure\nenhances performance versus a greedy entropy based policy.\nWe now consider maximization of real world objective functions. The \ufb01rst, boston, returns the\nnegative of the prediction error of a neural network trained on a random train/text split of the Boston\nHousing dataset [31]. The weight-decay parameter and number of training iterations for the neural\nnetwork are the parameters to be optimized over. The next function, hydrogen, returns the amount\nof hydrogen produced by particular bacteria as a function of pH and nitrogen levels of a growth\nmedium [32]. Thirdly we consider a function, rocket, which runs a simulation of a rocket [33]\nbeing launched from the Earth\u2019s surface and returns the time taken for the rocket to land on the\nEarth\u2019s surface. The variables to be optimized over are the launch height from the surface, the mass\nof fuel to use and the angle of launch with respect to the Earth\u2019s surface. If the rocket does not\nreturn, the function returns 0. Finally we consider a function, robot, which returns the walking\nspeed of a bipedal robot [34]. The function\u2019s input parameters, which live in [0, 1]8, are the robot\u2019s\ncontroller. We add Gaussian noise with \u03c3 = 0.1 to the noiseless function. Note that all of the\nfunctions we consider are not available analytically. boston trains a neural network and returns\ntest error, whilst rocket and robot run physical simulations involving differential equations\nbefore returning a desired quantity. Since the hydrogen dataset is available only for discrete points,\nwe de\ufb01ne hydrogen to return the predictive mean of a Gaussian process trained on the dataset.\nFigure 3 show the median values of immediate regret by each method over 200 random initializa-\ntions. We consider batches of size Q = 2 and Q = 4. We \ufb01nd that PPES consistently outperforms\ncompeting methods on the functions considered. The greediness and nonrequirement of MCMC\nsampling of the SM-UCB, GP-BUCB and GP-UCB-PE algorithms make them amenable to large\nbatch experiments, for example, [17] consider optimization in R45 with batches of size 10. However,\nthese three algorithms all perform poorly when selecting batches of smaller size. The performance\non the hydrogen function illustrates an interesting phenemona; whilst the immediate regret of\nPPES is mediocre initially, it drops rapidly as more batches are evaluated.\nThis behaviour is likely due to the non-greediness of the approach we have taken. EI-MCMC makes\ngood initial progress, but then fails to explore the input space as well as PPES is able to. Recall that\nafter each batch evaluation, an algorithm is required to output \u02dcxt, its best estimate for the maximizer\nof the objective function. We observed that whilst competing algorithms tended to evaluate points\nwhich had high objective function values compared to PPES, yet when it came to recommending \u02dcxt,\n\n7\n\n05101520250246810tregretPPESEI-MCMCSMUCBBUCBUCBPE051015202500.10.20.30.40.50.60.70.8tregret010203001234567tregret0102030405000.511.522.53tregret105101520250246810tregretPPESEI-MCMCSMUCBBUCBUCBPE051015202500.10.20.30.40.50.60.70.8tregret010203001234567tregret0102030405000.511.522.53tregret105101520250246810tregretPPESEI-MCMCSMUCBBUCBUCBPE051015202500.10.20.30.40.50.60.70.8tregret010203001234567tregret0102030405000.511.522.53tregret105101520250246810tregretPPESEI-MCMCSMUCBBUCBUCBPE051015202500.10.20.30.40.50.60.70.8tregret010203001234567tregret0102030405000.511.522.53tregret1\f(a) boston\n\n(b) hydrogen\n\n(c) rocket\n\n(d) robot\n\nFigure 3: Median of the immediate regret of the PPES and 4 other algorithms over 100 experiments\non real world objective functions. Figures in the top row use batches of size Q = 2, whilst \ufb01gues on\nthe bottom row use batches of size Q = 4.\nPPES tended to do a better job. Our belief is that this occured exactly because the PPES objective\naims to maximize information gain rather than objective function value improvement.\nThe rocket function has a strong discontinuity making if dif\ufb01cult to maximize. If the fuel mass,\nlaunch height and/or angle are too high, the rocket would not return to the Earth\u2019s surface, resulting\nin a 0 function value. It can be argued that a stationary kernel Gaussian process is a poor model for\nthis function, yet it is worth investigating the performance of a GP based models since a practitioner\nmay not know whether or not their black-box function is smooth apriori. PPES seemed to handle this\nfunction best and had fewer samples which resulted in 0 function value than each of the competing\nmethods and made fewer recommendations which led to a 0 function value. The relative increase\nin PPES performance from increasing batch size from Q = 2 to Q = 4 is small for the robot\nfunction compared to the other functions considered. We believe this is a consequence of using a\nslightly naive optimization procedure to save computation time. Our optimization procedure \ufb01rst\ncomputes \u02c6aPPES at 1000 points selected uniformly at random, and performs gradient ascent from\nthe best point. Since \u02c6aPPES is de\ufb01ned on X Q = [0, 1]32, this method may miss a global optimum.\nOther methods all select their batches greedily, and hence only need to optimize in X = [0, 1]8.\nHowever, this should easily be avoided by using a more exhaustive gradient based optimizer.\n\n5 Conclusions\nWe have developed parallel predictive entropy search, an information theoretic approach to batch\nBayesian optimization. Our method is greedy in the sense that it aims to maximize the one-step\ninformation gain about the location of x\u2217, but it is not greedy in how it selects a set of points to\nevaluate next. Previous methods are doubly greedy, in that they look one step ahead, and also select\na batch of points greedily. Competing methods are prone to under exploring, which hurts their\nperfomance on multi-modal, noisy objective functions, as we demonstrate in our experiments.\n\nReferences\n\n[1] G. Wang and S. Shan. Review of Metamodeling Techniques in Support of Engineering Design Optimiza-\n\ntion. Journal of Mechanical Design, 129(4):370\u2013380, 2007.\n\n[2] W. Ziemba & R. Vickson. Stochastic Optimization Models in Finance. World Scienti\ufb01c Singapore, 2006.\n\n8\n\n0510152001234\u00b710\u22122tregretPPESEI-MCMCSMUCBBUCBUCBPE01020304002468101214tregret0510152001234\u00b710\u22122tregret01020304002468101214tregret10510152001234\u00b710\u22122tregretPPESEI-MCMCSMUCBBUCBUCBPE01020304002468101214tregret0510152001234\u00b710\u22122tregret01020304002468101214tregret10510152001234\u00b710\u22122tregretPPESEI-MCMCSMUCBBUCBUCBPE01020304002468101214tregret0510152001234\u00b710\u22122tregret01020304002468101214tregret10510152001234\u00b710\u22122tregretPPESEI-MCMCSMUCBBUCBUCBPE01020304002468101214tregret0510152001234\u00b710\u22122tregret01020304002468101214tregret101020304000.511.522.53tregret01020304000.511.522.533.54tregret01020304000.511.522.53tregret01020304000.511.522.533.54tregret101020304000.511.522.53tregret01020304000.511.522.533.54tregret01020304000.511.522.53tregret01020304000.511.522.533.54tregret101020304000.511.522.53tregret01020304000.511.522.533.54tregret01020304000.511.522.53tregret01020304000.511.522.533.54tregret101020304000.511.522.53tregret01020304000.511.522.533.54tregret01020304000.511.522.53tregret01020304000.511.522.533.54tregret1\f[3] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian Optimization of Machine Learning Algo-\n\nrithms. NIPS, 2012.\n\n[4] J. Mockus. Bayesian Approach to Global Optimization: Theory and Applications. Kluwer, 1989.\n[5] D. Lizotte, T. Wang, M. Bowling, and D. Schuurmans. Automatic Gait Optimization with Gaussian\n\nProcess Regression. IJCAI, pages 944\u2013949, 2007.\n\n[6] D. M. Negoescu, P. I. Frazier, and W. B. Powell. The Knowledge-Gradient Algorithm for Sequencing\n\nExperiments in Drug Discovery. INFORMS Journal on Computing, 23(3):346\u2013363, 2011.\n\n[7] Carl Rasmussen and Chris Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[8] A. Shah, A. G. Wilson, and Z. Ghahramani. Student-t Processes as Alternatives to Gaussian Processes.\n\nAISTATS, 2014.\n\n[9] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, Mr Prabat, and R. P.\n\nAdams. Scalable Bayesian Optimization Using Deep Neural Networks. ICML, 2015.\n\n[10] E. Brochu, M. Cora, and N. de Freitas. A Tutorial on Bayesian Optimization of Expensive Cost Functions,\nwith Applications to Active User Modeling and Hierarchical Reinforcement Learning. Technical Report\nTR-2009-23, University of British Columbia, 2009.\n\n[11] G. Gutin, A. Yeo, and A. Zverovich. Traveling salesman should not be greedy:domination analysis of\n\ngreedy-type heuristics for the TSP. Discrete Applied Mathematics, 117:81\u201386, 2002.\n\n[12] P. Hennig and C. J. Schuler. Entropy Search for Information-Ef\ufb01cient Global Optimization. JMLR, 2012.\n[13] J. M. Hern\u00b4andez-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive Entropy Search for Ef\ufb01cient\n\nGlobal Optimization of Black-box Functions. NIPS, 2014.\n\n[14] D. Ginsbourger, J. Janusevskis, and R. Le Riche. Dealing with Asynchronicity in Parallel Gaussian\n\nProcess Based Optimization. 2011.\n\n[15] J. Azimi, A. Fern, and X. Z. Fern. Batch Bayesian Optimization via Simulation Matching. NIPS, 2010.\n[16] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian Process Optimization in the Bandit Setting:\n\nNo Regret and Experimental Design. ICML, 2010.\n\n[17] T. Desautels, A. Krause, and J. Burdick. Parallelizing Exploration-Exploitation Tradeoffs with Gaussian\n\nProcess Bandit Optimization. ICML, 2012.\n\n[18] E. Contal, D. Buffoni, D. Robicquet, and N. Vayatis. Parallel Gaussian Process Optimization with Upper\nCon\ufb01dence Bound and Pure Exploration. In Machine Learning and Knowledge Discovery in Databases,\npages 225\u2013240. Springer Berlin Heidelberg, 2013.\n\n[19] D. J. MacKay. Information-Based Objective Functions for Active Data Selection. Neural Computation,\n\n4(4):590\u2013604, 1992.\n\n[20] N. Houlsby, J. M. Hern\u00b4andez-Lobato, F. Huszar, and Z. Ghahramani. Collaborative Gaussian Processes\n\nfor Preference Learning. NIPS, 2012.\n\n[21] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Masachusetts\n\nInstitute of Technology, 2001.\n\n[22] S. Bochner. Lectures on Fourier Integrals. Princeton University Press, 1959.\n[23] A. Rahimi and B. Recht. Random Features for Large-Scale Kernel Machines. NIPS, 2007.\n[24] R. M. Neal. Bayesian Learning for Neural Networks. PhD thesis, University of Toronto, 1995.\n[25] M. Seeger. Expectation Propagation for Exponential Families. Technical Report, U.C. Berkeley, 2008.\n[26] J. P. Cunningham, P. Hennig, and S. Lacoste-Julien. Gaussian Probabilities and Expectation Propagation.\n\narXiv, 2013. http://arxiv.org/abs/1111.6832.\n\n[27] I. Ahmad and P. E. Lin. A Nonparametric Estimation of the Entropy for Absolutely Continuous Distribu-\n\ntions. IEEE Trans. on Information Theory, 22(3):372\u2013375, 1976.\n\n[28] D. Lizotte. Practical Bayesian Optimization. PhD thesis, University of Alberta, 2008.\n[29] B. S. Anderson, A. W. Moore, and D. Cohn. A Nonparametric Approach to Noisy and Costly Optimiza-\n\ntion. ICML, 2000.\n\n[30] J. Shekel. Test Functions for Multimodal Search Techniques. Information Science and Systems, 1971.\n[31] K. Bache and M. Lichman. UCI Machine Learning Repository, 2013.\n[32] E. H. Burrows, W. K. Wong, X. Fern, F.W.R. Chaplen, and R.L. Ely. Optimization of ph and nitrogen\nfor enhanced hydrogen production by synechocystis sp. pcc 6803 via statistical and machine learning\nmethods. Biotechnology Progress, 25(4):1009\u20131017, 2009.\n\n[33] J. E. Hasbun. In Classical Mechanics with MATLAB Applications. Jones & Bartlett Learning, 2008.\n[34] E. Westervelt and J. Grizzle. Feedback Control of Dynamic Bipedal Robot Locomotion. Control and\n\nAutomation Series. CRC PressINC, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1842, "authors": [{"given_name": "Amar", "family_name": "Shah", "institution": "Cambridge"}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": "University of Cambridge"}]}