{"title": "Bayesian Optimization with a Finite Budget: An Approximate Dynamic Programming Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 883, "page_last": 891, "abstract": "We consider the problem of optimizing an expensive objective function when a finite budget of total evaluations is prescribed. In that context, the optimal solution strategy for Bayesian optimization can be formulated as a dynamic programming instance. This results in a complex problem with uncountable, dimension-increasing state space and an uncountable control space. We show how to approximate the solution of this dynamic programming problem  using rollout, and propose rollout heuristics specifically designed for the Bayesian optimization setting. We present numerical experiments showing that the resulting algorithm for optimization with a finite budget outperforms several popular Bayesian optimization algorithms.", "full_text": "Bayesian Optimization with a Finite Budget:\n\nAn Approximate Dynamic Programming Approach\n\nMassachusetts Institute of Technology\n\nMassachusetts Institute of Technology\n\nRemi R. Lam\n\nCambridge, MA\nrlam@mit.edu\n\nKaren E. Willcox\n\nCambridge, MA\n\nkwillcox@mit.edu\n\nDavid H. Wolpert\nSanta Fe Institute\n\nSanta Fe, NM\n\ndhw@santafe.edu\n\nAbstract\n\nWe consider the problem of optimizing an expensive objective function when a\n\ufb01nite budget of total evaluations is prescribed. In that context, the optimal solution\nstrategy for Bayesian optimization can be formulated as a dynamic programming in-\nstance. This results in a complex problem with uncountable, dimension-increasing\nstate space and an uncountable control space. We show how to approximate the\nsolution of this dynamic programming problem using rollout, and propose rollout\nheuristics speci\ufb01cally designed for the Bayesian optimization setting. We present\nnumerical experiments showing that the resulting algorithm for optimization with\na \ufb01nite budget outperforms several popular Bayesian optimization algorithms.\n\n1\n\nIntroduction\n\nOptimizing an objective function is a central component of many algorithms in machine learning\nand engineering. It is also essential to many scienti\ufb01c models, concerning everything from human\nbehavior, to protein folding, to population biology. Often, the objective function to optimize is\nnon-convex and does not have a known closed-form expression. In addition, the evaluation of this\nfunction can be expensive, involving a time-consuming computation (e.g., training a neural network,\nnumerically solving a set of partial differential equations, etc.) or a costly experiment (e.g., drilling a\nborehole, administering a treatment, etc.). Accordingly, there is often a \ufb01nite budget specifying the\nmaximum number of evaluations of the objective function allowed to perform the optimization.\nBayesian optimization (BO) has become a popular optimization technique for solving problems\ngoverned by such expensive objective functions [17, 9, 2]. BO iteratively updates a statistical model\nand uses it as a surrogate for the objective function. At each iteration, this statistical model is used to\nselect the next design to evaluate. Most BO algorithms are greedy, ignoring how the design selected\nat a given iteration will affect the future steps of the optimization. Thus, the decisions made are\ntypically one-step optimal. Because of this shortsightedness, such algorithms balance, in a greedy\nfashion, the BO exploration-exploitation trade-off: evaluating designs to improve the statistical model\nor to \ufb01nd the optimizer of the objective function.\nIn contrast to greedy algorithms, a lookahead approach is aware of the remaining evaluations and can\nbalance the exploration-exploitation trade-off in a principled way. A lookahead approach builds an\noptimal strategy that maximizes a long-term reward over several steps. That optimal strategy is the\nsolution of a challenging dynamic programming (DP) problem whose complexity stems, in part, from\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fthe increasing dimensionality of the involved spaces as the budget increases, and from the presence\nof nested maximizations and expectations. This is especially challenging when the design space takes\nan uncountable set of values.\nThe \ufb01rst contribution of this paper is to use rollout [1], an approximate dynamic programming\n(ADP) algorithm to circumvent the nested maximizations of the DP formulation. This leads to a\nproblem signi\ufb01cantly simpler to solve. Rollout uses suboptimal heuristics to guide the simulation\nof optimization scenarios over several steps. Those simulations allow us to quantify the long-term\nbene\ufb01ts of evaluating a given design. The heuristics used by rollout are typically problem-dependent.\nThe second contribution of this paper is to build heuristics adapted to BO with a \ufb01nite budget that\nleverage existing greedy BO strategies. As demonstrated with numerical experiments, this can lead to\nimprovements in performance.\nThe following section of this paper provides a brief description of Gaussian processes and their use\nin Bayesian optimization (Sec. 2), followed by a brief overview of dynamic programming (Sec. 3).\nSec. 4 develops the connection between BO and DP and discusses some of the related work. We then\npropose to employ the rollout algorithm (with heuristics adapted to BO) to mitigate the complexity of\nthe DP algorithm (Sec. 5). In Sec. 6, we numerically investigate the proposed algorithm and present\nour conclusions in Sec. 7.\n\n2 Bayesian Optimization\n\nWe consider the following optimization problem:\n\n(OP) x\u2217 = argminx\u2208X f (x),\n\n(1)\nwhere x is a d-dimensional vector of design variables. The design space, X , is a bounded subset of\nRd, and f : X (cid:55)\u2192 R is an objective function that is expensive to evaluate. We are interested in \ufb01nding\na minimizer x\u2217 of the objective function using a \ufb01nite budget of N function evaluations. We refer to\nthis problem as the original problem (OP).\nIn the Bayesian optimization (BO) setting, the (deterministic or noisy) objective function f is modeled\nas a realization of a stochastic process, typically a Gaussian process (GP) G, on a probability space\n(\u2126, \u03a3, P), which de\ufb01nes a prior distribution over functions. A GP is fully de\ufb01ned by a mean function\nm : X \u2192 R (often set to zero without loss of generality) and a covariance kernel \u03ba : X 2 \u2192 R (see\n[16] for an overview of GP):\n(2)\n\nf \u223c G(m, \u03ba).\n\nThe BO algorithm starts with an initial design x1 and its associated value y1 = f (x1) provided by\nthe user. This de\ufb01nes the \ufb01rst training set S1 = {(x1, y1)}. At each iteration k \u2208 {1,\u00b7\u00b7\u00b7 , N}, the\nGP prior is updated, using Bayes rule, to obtain posterior distributions conditioned on the current\ni=1 containing the past evaluated designs and observations. For any\ntraining set Sk = {(xi, yi)}k\n(potentially non-evaluated) design x \u2208 X , the posterior mean \u00b5k(x) and posterior variance \u03c32\nk(x) of\nthe GP, conditioned on Sk, are known in closed-form and are considered cheap to evaluate:\n\n\u03c32\n\n\u00b5k(x) = K(Xk, x)(cid:62)[K(Xk, Xk) + \u03bbI]\u22121Yk,\n\nk(x) = \u03ba(x, x) \u2212 K(Xk, x)(cid:62)[K(Xk, Xk) + \u03bbI]\u22121K(Xk, x),\n\n(3)\n(4)\nwhere K(Xk, Xk) is the k \u00d7 k matrix whose ijth entry is \u03ba(xi, xj), K(Xk, x) (respectively Yk) is\nthe k \u00d7 1 vector whose ith entry is \u03ba(xi, x) (respectively yi), and \u03bb is the noise variance. A new\ndesign xk+1 is then selected and evaluated with this objective function to provide an observation\nyk+1 = f (xk+1). This new pair (xk+1, yk+1) is added to the current training set Sk to de\ufb01ne the\ntraining set for the next iteration Sk+1 = Sk \u222a {(xk+1, yk+1)}.\nIn BO, the next design to evaluate is selected by solving an auxiliary problem (AP), typically of the\nform:\n\n(5)\nwhere Uk is a utility function to maximize. The rationale is that, because the optimization run-time or\ncost is dominated by the evaluation of the expensive function f, time and effort should be dedicated\nto choosing a good and informative (in a sense de\ufb01ned by the auxiliary problem) design to evaluate.\n\n(AP) xk+1 = argmaxx\u2208X Uk(x;Sk),\n\n2\n\n\fSolving this auxiliary problem (sometimes called maximization of an acquisition or utility function)\ndoes not involve the evaluation of the expensive objective function f, but only the posterior quantities\nof the GP and, thus, is considered cheap.\nExamples of utility functions, Uk, used to select the next design to evaluate in Bayesian optimization\ninclude maximizing the probability of improvement (PI) [12], maximizing the expected improvement\n(EI) in the ef\ufb01cient global optimization (EGO) algorithm [10], minimizing a linear combination \u00b5\u2212\u03b1\u03c3\nof the posterior mean \u00b5 and standard deviation \u03c3 in GP upper con\ufb01dence bound (GP-UCB) [18], or\nmaximizing a metric quantifying the information gain [19, 6, 7]. However, the aforementioned utility\nfunctions are oblivious to the number of objective function evaluations left and, thus, lead to greedy\noptimization strategies. Devising methods that account for the remaining budget would allow to better\nplan the sequence of designs to evaluate, balance in a principled way the exploration-exploitation\ntrade-off encountered in BO, and thus potentially lead to performance gains.\n\n3 Dynamic Programming\n\nIn this section, we review some of the key features of dynamic programming (DP) which addresses\noptimal decision making under uncertainty for dynamical systems. BO with a \ufb01nite budget can be\nseen as such a problem. It has the following characteristics: (1) a statistical model to represent the\nobjective function, (2) a system dynamic that describes how this statistical model is updated as new\ninformation is collected, and (3) a goal that can be quanti\ufb01ed with a long-term reward. DP provides\nus with a mathematical formulation to address this class of problem. A full overview of DP can be\nfound in [1, 15].\nWe consider a system governed by a discrete-stage dynamic. At each stage k, the system is fully\ncharacterized by a state zk \u2208 Zk. A control uk, from a control space Uk(zk), that generally depends\non the state, is applied. Given a state zk and a control uk, a random disturbance wk \u2208 Wk(zk, uk)\noccurs, characterized by a random variable Wk with probability distribution P(\u00b7|zk, uk). Then, the\nsystem evolves to a new state zk+1 \u2208 Zk+1, according to the system dynamic. This can be written in\nthe following form:\n(6)\nwhere z1 is an initial state, N is the total number of stages, or horizon, and Fk : Zk \u00d7 Uk \u00d7 Wk (cid:55)\u2192\nZk+1 is the dynamic of the system at stage k (where the spaces\u2019 dependencies are dropped for ease\nof notation).\nWe seek the construction of an optimal policy (optimal in a sense yet to de\ufb01ne). A policy,\n\u03c0 = {\u03c01,\u00b7\u00b7\u00b7 , \u03c0N}, is sequence of rules, \u03c0k : Zk (cid:55)\u2192 Uk, for k = 1,\u00b7\u00b7\u00b7 , N, mapping a state\nzk to a control uk = \u03c0k(zk).\nAt each stage k, a stage-reward function rk : Zk \u00d7Uk \u00d7Wk (cid:55)\u2192 R, quanti\ufb01es the bene\ufb01ts of applying\na control uk to a state zk, subject to a disturbance wk. A \ufb01nal reward function rN +1 : ZN +1 (cid:55)\u2192 R,\nsimilarly quanti\ufb01es the bene\ufb01ts of ending at a state zN +1. Thus, the expected reward starting from\nstate z1 and using policy \u03c0 is:\n\n\u2200k \u2208 {1,\u00b7\u00b7\u00b7 , N},\u2200(zk, uk, wk) \u2208 Zk \u00d7 Uk \u00d7 Wk,\n\nzk+1 = Fk(zk, uk, wk),\n\n(cid:35)\n\n(cid:34)\n\nN(cid:88)\n\nk=1\n\nJ\u03c0(z1) = E\n\nrN +1(zN +1) +\n\nrk(zk, \u03c0k(zk), wk)\n\n,\n\n(7)\n\nwhere the expectation is taken with respect to the disturbances. An optimal policy, \u03c0\u2217, is a policy\nthat maximizes this (long-term) expected reward over the set of admissible policies \u03a0:\n\nJ\u2217(z1) = J\u03c0\u2217 (z1) = max\n\u03c0\u2208\u03a0\n\n(8)\nwhere J\u2217 is the optimal reward function, also called optimal value function. Using Bellman\u2019s\nprinciple of optimality, the optimal reward is given by a nested formulation and can be computed\nusing the following DP recursive algorithm, working backward from k = N to k = 1:\n\nJ\u03c0(z1),\n\nJN +1(zN +1) = rN +1(zN +1),\nJk(zk) = max\nuk\u2208Uk\n\nE[rk(zk, uk, wk) + Jk+1(Fk(zk, uk, wk))].\n\n(9)\n(10)\n\nThe optimal reward J\u2217(z1) is then given by J1(z1), and if u\u2217\nside of Eq. 10 for all k and all zk, then the policy \u03c0\u2217 = {\u03c0\u2217\n\nk = \u03c0\u2217\n1,\u00b7\u00b7\u00b7 , \u03c0\u2217\n\nk(zk) maximizes the right hand\nN} is optimal (e.g., [1], p.23).\n\n3\n\n\f4 Bayesian Optimization with a Finite Budget\n\nIn this section, we de\ufb01ne the auxiliary problem of BO with a \ufb01nite budget (Eq. 5) as a DP instance.\nAt each iteration k, we seek to evaluate the design that will lead, once the evaluation budget N\nhas been consumed, to the maximum reduction of the objective function. In general, the value of\nthe objective function f (x) at a design x is unknown before its evaluation and, thus, estimating\nthe long-term effect of an evaluation is not possible. However, using the GP representing f, it is\npossible to characterize the unknown f (x) by a distribution. This can be used to simulate sequences\nof designs and function values (i.e., optimization scenarios), compute their rewards and associated\nprobabilities, without evaluating f. Using this simulation machinery, it is possible to capture the\ngoal of achieving a long term reward in a utility function Uk. We now formulate the simulation of\noptimization scenarios in the DP context and proceed with the de\ufb01nition of such utility function Uk.\nWe consider that the process of optimization is a dynamical system. At each iteration k, this system\nis fully characterized by a state zk equal to the training set Sk. The system is actioned by a control\nuk equal to the design xk+1 selected to be evaluated. For a given state and control, the value of the\nobjective function is unknown and modeled as a random variable Wk, characterized by:\n\nWk \u223c N\n\n(11)\nwhere \u00b5k(xk+1) and \u03c32\nk(xk+1) are the posterior mean and variance of the GP at xk+1, conditioned\non Sk. We de\ufb01ne a disturbance wk to be equal to a realization fk+1 of Wk. Thus, wk = fk+1\nrepresents a possible (simulated) value of the objective function at xk+1. Note that this simulated\nvalue of the objective function, fk+1, is not the value of the objective function yk+1 = f (xk+1).\nHence, we have the following identities: Zk = (X \u00d7 R)k, Uk = X and Wk = R.\nThe new state zk+1 is then de\ufb01ned to be the augmented training set Sk+1 = Sk \u222a {(xk+1, fk+1)},\nand the system dynamic can be written as:\n(12)\nThe disturbances wk+1 at iteration k + 1 are then characterized, using Bayes\u2019 rule, by the posterior\nof the GP conditioned on the training set Sk+1.\nTo optimally control this system (i.e., to use an optimal strategy to solve OP), we de\ufb01ne the stage-\nreward function at iteration k to be the reduction in the objective function obtained at stage k:\n\nSk+1 = Fk(Sk, xk+1, fk+1) = Sk \u222a {(xk+1, fk+1)}.\n\n(cid:0)\u00b5k(xk+1), \u03c32\n\nk(xk+1)(cid:1) ,\n\n(cid:110)\n\n(cid:111)\n\nrk(Sk, xk+1, fk+1) = max\n\n0, f\n\nSk\nmin \u2212 fk+1\n\n,\n\n(13)\n\nSk\nmin is the minimum value of the objective function in the training set Sk. We de\ufb01ne the \ufb01nal\nwhere f\nreward to be zero: rN +1(SN +1) = 0. The utility function, at a given iteration k characterized by Sk,\nis de\ufb01ned to be the expected reward:\n(14)\nwhere the expectation is taken with respect to the disturbances, and Jk+1 is de\ufb01ned by Eqs. 9-10.\nNote that E[rk(Sk, xk+1, fk+1)] is simply the expected improvement given, for all x \u2208 X , by:\n\n\u2200xk+1 \u2208 X , Uk(xk+1;Sk) = E[rk(Sk, xk+1, fk+1) + Jk+1(Fk(Sk, xk+1, fk+1))],\n(cid:33)\n\n(cid:32)\n\n(cid:33)\n\n(cid:32)\n\nSk\nf\nmin \u2212 \u00b5k (x)\n\n\u03c3k (x)\n\n+ \u03c3k(x)\u03c6\n\nf\n\nSk\nmin \u2212 \u00b5k (x)\n\n\u03c3k (x)\n\n,\n\n(15)\n\nEI(x;Sk) =\n\nSk\nf\nmin \u2212 \u00b5k (x)\n\n\u03a6\n\n(cid:16)\n\n(cid:17)\n\nwhere \u03a6 is the standard Gaussian CDF and \u03c6 is the standard Gaussian PDF.\nIn other words, the GP is used to simulate possible scenarios, and the next design to evaluate is\nchosen to maximize the decrease of the objective function, over the remaining iterations, averaged\nover all possible simulated scenarios.\nSeveral related methods have been proposed to go beyond greedy BO strategies. Optimal formula-\ntions for BO with a \ufb01nite budget have been explored in [14, 4]. Both formulations involve nested\nmaximizations and expectations. Those authors note that their N-steps lookahead methods scale\npoorly with the number of steps considered (i.e., the budget N); they are able to solve the problem\nfor two-steps lookahead. For some speci\ufb01c instances of BO (e.g., \ufb01nding the super-level set of a\none-dimensional function), the optimal multi-step strategy can be computed ef\ufb01ciently [3]. Approx-\nimation techniques accounting for more steps have been recently proposed. They leverage partial\n\n4\n\n\ftree exploration [13] or Lipschitz reward function [11] and have been applied to cases where the\ncontrol spaces Uk are \ufb01nite (e.g., at each iteration, uk is one of the 4 or 8 directions that a robot\ncan take to move before it evaluates f). Theoretical performance guarantees are provided for the\nalgorithm proposed in [11]. Another approximation technique for non-greedy BO has been proposed\nin GLASSES [5] and is applicable to uncountable control space Uk. It builds an approximation of\nthe N-steps lookahead formulation by using a one-step lookahead algorithm with approximation\nof the value function Jk+1. The approximate value function is induced by a heuristic oracle based\non a batch Bayesian optimization method. The oracle is used to select up to 15 steps at once to\napproximate the value function.\nIn this paper, we propose to use rollout, an ADP algorithm, to address the intractability of the DP\nformulation. The proposed approach is not restricted to countable control spaces, and accounts for\nmore than two steps. This is achieved by approximating the value function Jk+1 with simulations\nover several steps, where the information acquired at each simulated step is explicitly used to simulate\nthe next step. Note that this is a closed-loop approach, in comparison to GLASSES [5] which is an\nopen-loop approach. In contrast to the DP formulation, the decision made at each simulated step of\nthe rollout is not optimal, but guided by problem-dependent heuristics. In this paper we propose the\nuse of heuristics adapted to BO, leveraging existing greedy BO strategies.\n\n5 Rollout for Bayesian Optimization\n\nSolving the auxiliary problem de\ufb01ned by Eqs. 5,14 is challenging. It requires the solution of nested\nmaximizations and expectations for which there is no closed-form expression known. In \ufb01nite spaces,\nthe DP algorithm already suffers from the curse of dimensionality. In this particular setting, the state\nspaces Zk = (X \u00d7 R)k are uncountable and their dimension increases by d + 1 at each stage. The\ncontrol spaces Uk = X are also uncountable, but of \ufb01xed dimension. Thus, solving Eq. 5 with utility\nfunction de\ufb01ned by Eq. 14 is intractable.\nTo simplify the problem, we use ADP to approximate Uk with the rollout algorithm (see [1, 15] for\nan overview). It is a one-step lookahead technique where Jk+1 is approximated using simulations\nover several future steps. The difference with the DP formulation is that, in those simulated future\nsteps, rollout relaxes the requirement to optimally select a design (which is the origin of the nested\nmaximizations). Instead, rollout uses a suboptimal heuristic to decide which control to apply for a\ngiven state. This suboptimal heuristic is problem-dependent and, in the context of BO with a \ufb01nite\nbudget, we propose to use existing greedy BO algorithms as such a heuristic. Our algorithm proceeds\nas follows.\nFor any iteration k, the optimal reward to go, Jk+1 (Eq. 14), is approximated by Hk+1, the reward to\ngo induced by a heuristic \u03c0 = (\u03c01,\u00b7\u00b7\u00b7 , \u03c0N ), also called base policy. Hk+1 is recursively given by:\n(16)\n(17)\nfor all n \u2208 {k + 1,\u00b7\u00b7\u00b7 , N \u2212 1}, where \u03b3 \u2208 [0, 1] is a discount factor incentivizing the early collection\nof reward. A discount factor \u03b3 = 0, leads to a greedy strategy that maximizes the immediate collection\nof reward. This corresponds to maximizing the EI. On the other hand, \u03b3 = 1, means that there is no\ndifferentiation between collecting reward early or late in the optimization. Note that Hk+1 is de\ufb01ned\nby recursion, and involves nested expectations. However, the nested maximizations are replaced by\nthe use of the base policy \u03c0. An important point is that, even if its de\ufb01nition is recursive, Hk+1 can\nbe computed in a forward manner, unlike Jk+1 which has to be computed in a backward fashion (see\nEqs. 9,10). The DP and the rollout formulations are illustrated in Fig.1. The approximated reward\n\nHN (SN ) = EI(\u03c0N (SN );SN ),\nHn(Sn) = E [rn(Sn, \u03c0n(Sn), fn+1) + \u03b3Hn+1(F(Sn, \u03c0n(Sn), fn+1))] ,\n\nHk+1 is then numerically approximated by (cid:101)Hk+1 using several simpli\ufb01cations. First, we use a rolling\n\nhorizon, h, to alleviate the curse of dimensionality. At a given iteration k, a rolling horizon limits the\nnumber of stages considered to compute the approximate reward to go by replacing the horizon N by\n\u02dcN = min{k + h, N}. Second, expectations are taken with respect to the (Gaussian) disturbances\nand are approximated using Gauss-Hermite quadrature. We obtain the following formulation:\n(18)\n\n\u02dcH \u02dcN (S \u02dcN ) = EI(\u03c0 \u02dcN (S \u02dcN );S \u02dcN ),\n\n(cid:17)(cid:17)(cid:105)\n\n(cid:101)Hn(Sn) =\n\n\u03b1(q)(cid:104)\n\nNq(cid:88)\n\nq=1\n\n(cid:16)\n\nrn\n\nSn, \u03c0n(Sn), f (q)\n\nn+1\n\nF\n\nSn, \u03c0n(Sn), f (q)\n\nn+1\n\n,\n\n(19)\n\n(cid:16)\n\n(cid:16)\n\n+ \u03b3(cid:101)Hn+1\n\n(cid:17)\n\n5\n\n\f(cid:16)\n\n\u03b1(q)(cid:104)\n\nNq(cid:88)\n\nq=1\n\n(cid:17)\n\n+ \u03b3(cid:101)Hk+1\n\n(cid:16)\n\n(cid:16)\n\nUk(xk+1;Sk) =\n\nSk, xk+1, f (q)\nWe note that for the last iteration, k = N, the utility function is known in closed form:\n\nSk, xk+1, f (q)\n\nF\n\nk+1\n\nk+1\n\nrk\n\n(cid:17)(cid:17)(cid:105)\n\n.\n\n(20)\n\n(21)\n\nFigure 1: Graphs representing the DP (left) and the rollout (right) formulations (in the binary\ndecisions, binary disturbances case). Each white circle represents a training set, each black circle\nrepresents a training set and a design. Double arrows are decisions that depend on decisions lower in\nthe graph (leading to nested optimizations in the DP formulation), single arrows represent decisions\nmade using a heuristic (independent of the lower part of the graph). Dashed lines are simulated values\nof the objective function and lead to the computation of expectations. Note the simpler structure of\nthe rollout graph compared to the DP one.\n\nfor all n \u2208 {k + 1,\u00b7\u00b7\u00b7 , \u02dcN \u2212 1}, where Nq \u2208 N is the number of quadrature weights \u03b1(q) \u2208 R\nk+1 \u2208 R, and rk is the stage-reward de\ufb01ned by Eq. 13. Finally, for all iterations\nand points f (q)\nk \u2208 {1,\u00b7\u00b7\u00b7 , N \u2212 1} and for all xk+1 \u2208 X , we de\ufb01ne the utility function to be:\n\nUN (xN +1;SN ) = EI(xN +1;SN ).\n\nThe base policy \u03c0 used as a heuristic in the rollout is problem-dependent. A good heuristic \u03c0 should\nbe cheap to compute and induce an expected reward J\u03c0 close to the optimal expected reward J\u03c0\u2217\n(Eq. 7). In the context of BO with a \ufb01nite budget, this heuristic should mimic an optimal strategy that\nbalances the exploration-exploitation trade-off. We propose to use existing BO strategies, in particular,\nmaximization of the expected improvement (which has an exploratory behavior) and minimization of\nthe posterior mean (which has an exploitation behavior) to build the base policy. For every iteration\nk \u2208 {1,\u00b7\u00b7\u00b7 , N \u2212 1}, we de\ufb01ne \u03c0 = {\u03c0k+1,\u00b7\u00b7\u00b7 , \u03c0 \u02dcN} such that, at stage n \u2208 {k + 1, \u02dcN \u2212 1}, the\npolicy component, \u03c0n : Zn (cid:55)\u2192 X , maps a state zn = Sn to the design xn+1 that maximizes the\nexpected improvement (Eq. 15):\n\nxn+1 = argmax\n\nx\u2208X\n\nEI(x;Sn).\n\n(22)\n\nThe last policy component, \u03c0 \u02dcN : Z \u02dcN (cid:55)\u2192 X , is de\ufb01ned to map a state z \u02dcN = S \u02dcN to the design x \u02dcN +1\nthat minimizes the posterior mean (Eq. 3):\n\nx \u02dcN +1 = argmin\nx\u2208X\n\n(cid:0)N h\n\n\u00b5 \u02dcN (x).\n\n(cid:1) applications of a heuristic. In our approach,\n|Sk|2(cid:1) of work (rank-1 update of the\n\n(23)\n\n(cid:0)\n\nq\n\nEach evaluation of the utility function requires O\nthe heuristic involves optimizing a quantity that requires O\nCholesky decomposition to update the GP, and back-substitution for the posterior variance).\nTo summarize, we propose to use rollout, a one-step lookahead algorithm that approximates Jk+1.\nThis approximation is computed using simulation over several steps (e.g., more than 3 steps), where\nthe information collected after a simulated step is explicitly used to simulate the next step (i.e., it is a\nclosed-loop approach). This is achieved using a heuristic instead of the optimal strategy, and thus,\nleads to a formulation without nested maximizations.\n\n6\n\nSkxk+1fk+1Sk+1xk+2fk+2Sk+2...SN\u00b7\u00b7\u00b7Skxk+1fk+1Sk+1\u03c0k+1(Sk+1)fk+2Sk+2...SN\u00b7\u00b7\u00b7\f6 Experiments and Discussion\n\nIn this section, we apply the proposed algorithm to several optimization problems with a \ufb01nite budget\nand demonstrate its performance on GP generated and classic test functions.\nWe use a zero-mean GP with square-exponential kernel (hyper-parameters: maximum variance\n\u03c32 = 4, length scale L = 0.1, noise variance \u03bb = 10\u22123) to generate 24 objective functions de\ufb01ned\non X = [0, 1]2. We generate 10 designs from a uniform distribution on X , and use them as 10\ndifferent initial guesses for optimization. Thus, for each optimization, the initial training set S1\ncontains one training point. All algorithms are given a budget of N = 15 evaluations. For each of\nthe initial guess and each objective function, we run the BO algorithm with the following utility\nfunctions: PI, EI and GP-UCB (with the parameter balancing exploration and exploitation set to\n\u03b1 = 3). We also run the rollout algorithm proposed in Sec. 5 and de\ufb01ned by Eqs. 5,20, for the same\nobjective functions and with the same initial guesses for different parameters of the rolling horizon\nh \u2208 {2, 3, 4, 5} and discount factor \u03b3 \u2208 {0.5, 0.7, 0.9, 1.0}. All algorithms use the same kernel and\nhyper-parameters as those used to generate the objective functions.\nGiven a limited evaluation budget, we evaluate the performance of an algorithm for the original\nproblem (Eq. 1) in terms of gap G [8]. The gap measures the best decrease in objective function from\nthe \ufb01rst to the last iteration, normalized by the maximum reduction possible:\n\n.\n\n(24)\n\nG =\n\nSN +1\nS1\nf\nmin \u2212 f\nmin\nS1\nmin \u2212 f (x\u2217)\nf\n\nThe mean and the median performances of the rollout algorithm are computed for the 240 experiments\nfor the 16 con\ufb01gurations of discount factors and rolling horizons. The results are reported in Table 1.\n\nTable 1: Mean (left) and median (right) performance G over 24 objective functions and 10 initial\nguesses for different rolling horizons h and discount factors \u03b3.\n\n\u03b3\n\n0.5\n0.7\n0.9\n1.0\n\nh = 2\n\nh = 3\n\nh = 4\n\nh = 5\n\n0.790\n0.787\n0.816\n0.818\n\n0.811\n0.786\n0.767\n0.793\n\n0.799\n0.787\n0.827\n0.842\n\n0.817\n0.836\n0.828\n0.812\n\n\u03b3\n\n0.5\n0.7\n0.9\n1.0\n\nh = 2\n\nh = 3\n\nh = 4\n\nh = 5\n\n0.849\n0.849\n0.896\n0.870\n\n0.862\n0.830\n0.839\n0.861\n\n0.858\n0.806\n0.876\n0.917\n\n0.856\n0.878\n0.850\n0.858\n\nThe mean gap achieved is G = 0.698 for PI, G = 0.762 for EI and G = 0.711 for GP-UCB.\nAll the con\ufb01gurations of the rollout algorithm outperform the three greedy BO algorithms. The\nbest performance is achieved by the con\ufb01guration \u03b3 = 1.0 and h = 4. For this con\ufb01guration, the\nperformance increase with respect to EI is about 8%. The worst mean con\ufb01guration (\u03b3 = 0.9 and\nh = 3) still outperforms EI by 0.5%.\nThe median performance achieved is G = 0.738 for PI, G = 0.777 for EI and G = 0.770 for\nGP-UCB. All the con\ufb01gurations of the rollout algorithm outperform the three greedy BO algorithms.\nThe best performance is achieved by the con\ufb01guration \u03b3 = 1.0 and h = 4 (same as best mean\nperformance). For this con\ufb01guration, the performance increase with respect to EI is about 14%.\nThe worst rollout con\ufb01guration (\u03b3 = 0.7 and h = 4) still outperforms EI by 2.9%. The complete\ndistribution of gaps achieved by the greedy BO algorithms and the best and worst con\ufb01gurations of\nthe rollout is shown in Fig. 2.\nWe notice that increasing the length of the rolling horizon does not necessarily increase the gap (see\nTable 1). This is a classic result from DP (Sec. 6.5.1 of [1]). We also notice that discounting the\nfuture rewards has no clear effect on the gap. For all discount factors tested, we notice that reward is\nnot only collected at the last stage (See Fig. 2). This is a desirable property. Indeed, in a case where\nthe optimization has to be stopped before the end of the budget is reached, one would wish to have\ncollected part of the reward.\nWe now evaluate the performance on test functions.1 We consider four rollout con\ufb01gurations R-4-9\n(h = 4, \u03b3 = 0.9), R-4-10 (h = 4, \u03b3 = 1.0), R-5-9 (h = 5, \u03b3 = 0.9) and R-5-10 (h = 5, \u03b3 = 1.0)\n\n1Test functions from http://www.sfu.ca/~ssurjano/optimization.html.\n\n7\n\n\fFigure 2: Left: Histogram of gap for the rollout (best and worst mean con\ufb01gurations tested) and\ngreedy BO algorithms. Right: Median gap of the rollout (for the best and worst mean con\ufb01gurations\ntested) and other algorithms as a function of iteration (budget of N = 15).\n\nand two additional BO algorithms: PES [7] and the non-greedy GLASSES [5]. We use a square-\nexponential kernel for each algorithm (hyper-parameters: maximum variance \u03c32 = 4, noise variance\n\u03bb = 10\u22123, length scale L set to 10% of the design space length scale). We generate 40 designs from\na uniform distribution on X , and use them as 40 different initial guesses for optimization. Each\nalgorithm is given N = 15 evaluations. The mean and median gap (over the 40 initial guesses) for\neach function de\ufb01ne 8 metrics (shown in Table 2). We found that rollout had the best metric 3 times\nout of 8, and was never the worst algorithm. PES was found to perform best on 3 metrics out of\n8 but was the worst algorithm for 2 metrics out of 8. GLASSES was never the best algorithm and\nperformed the worst in one metric. Note that the rollout con\ufb01guration R-4-9 outperforms GLASSES\non 5 metrics out of 6 (excluding the case of the Griewank function). Thus, our rollout algorithm\nperforms well and shows robustness.\n\nTable 2: Mean and median gap G over 40 initial guesses.\nR-4-10\n\nGLASSES\n\nEI\n\nUCB\n\nPES\n\nPI\n\nFunction name\n\nBranin-Hoo\n\nGoldstein-Price\n\nGriewank\n\nSix-hump Camel Mean\n\nMean\nMedian\nMean\nMedian\nMean\nMedian\n\nMedian\n\n0.847\n0.922\n0.873\n0.983\n0.827\n0.904\n0.850\n0.893\n\n0.818\n0.909\n0.866\n0.981\n0.884\n0.953\n0.887\n0.970\n\n0.848\n0.910\n0.733\n0.899\n0.913\n0.970\n0.817\n0.915\n\n0.861\n0.983\n0.819\n0.987\n0.972\n0.987\n0.664\n0.801\n\n0.846\n0.909\n0.782\n0.919\n12\n12\n0.776\n0.941\n\nR-4-9\n0.904\n0.959\n0.895\n0.991\n0.882\n0.967\n0.860\n0.926\n\nR-5-9\n\nR-5-10\n\n0.898\n0.943\n0.784\n0.985\n0.885\n0.962\n0.825\n0.900\n\n0.887\n0.921\n0.861\n0.989\n0.930\n0.960\n0.793\n0.941\n\n0.903\n0.950\n0.743\n0.928\n0.867\n0.954\n0.803\n0.907\n\n7 Conclusions\n\nWe presented a novel algorithm to perform Bayesian optimization with a \ufb01nite budget of evaluations.\nThe next design to evaluate is chosen to maximize a utility function that quanti\ufb01es long-term rewards.\nWe propose to employ an approximate dynamic programming algorithm, rollout, to approximate\nthis utility function. Rollout leverages heuristics to circumvent the need for nested maximizations.\nWe propose to build such a heuristic using existing suboptimal Bayesian optimization strategies, in\nparticular maximization of the expected improvement and minimization of the posterior mean. The\nproposed approximate dynamic programming algorithm is empirically shown to outperform popular\ngreedy and non-greedy Bayesian optimization algorithms on multiple test cases.\nThis work was supported in part by the AFOSR MURI on multi-information sources of multi-physics\nsystems under Award Number FA9550-15-1-0038, program manager Dr. Jean-Luc Cambier.\n\n2This gap G = 1 results from an arbitrary choice made by one optimizer used by GLASSES to evaluate the\norigin. The origin happens to be the minimizer of the Griewank function. We thus exclude those results from the\nanalysis.\n\n8\n\n0.00.10.20.30.40.50.60.70.80.91.0G020406080100120140RealizationsRollout(Best)Rollout(Worst)EIUCBPI02468101214Iterationk0.00.20.40.60.81.0GRollout(Best)Rollout(Worst)EIUCBPI\fReferences\n[1] D. P. Bertsekas. Dynamic programming and optimal control, volume 1. Athena Scienti\ufb01c, 1995.\n\n[2] E. Brochu, V. M. Cora, and N. De Freitas. A tutorial on Bayesian optimization of expensive cost\nfunctions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint\narXiv:1012.2599, 2010.\n\n[3] J. M. Cashore,\n\nL. Kumarga,\n\nand P.\n\nI. Frazier.\n\nMulti-step Bayesian\n\nfor\n\ntion\nhttps://people.orie.cornell.edu/pfrazier/pub/workingpaper-CashoreKumargaFrazier.pdf.\n\none-dimensional\n\nWorking\n\nfeasibility\n\ndetermination.\n\npaper. Retrieved\n\noptimiza-\nfrom\n\n[4] D. Ginsbourger and R. Le Riche. Towards Gaussian process-based optimization with \ufb01nite time horizon.\n\nIn mODa 9\u2013Advances in Model-Oriented Design and Analysis, pages 89\u201396. Springer, 2010.\n\n[5] J. Gonz\u00e1lez, M. Osborne, and N. D. Lawrence. GLASSES: Relieving the myopia of Bayesian optimisation.\nIn Proceedings of the 19th International Conference on Arti\ufb01cial Intelligence and Statistics, pages 790\u2013799,\n2016.\n\n[6] P. Hennig and C. J. Schuler. Entropy search for information-ef\ufb01cient global optimization. The Journal of\n\nMachine Learning Research, 13(1):1809\u20131837, 2012.\n\n[7] J. M. Hern\u00e1ndez-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy search for ef\ufb01cient\nglobal optimization of black-box functions. In Advances in Neural Information Processing Systems, pages\n918\u2013926, 2014.\n\n[8] D. Huang, T. T. Allen, W. I. Notz, and N. Zeng. Global optimization of stochastic black-box systems via\n\nsequential kriging meta-models. Journal of Global Optimization, 34(3):441\u2013466, 2006.\n\n[9] D. R. Jones. A taxonomy of global optimization methods based on response surfaces. Journal of Global\n\nOptimization, 21(4):345\u2013383, 2001.\n\n[10] D. R. Jones, M. Schonlau, and W. J. Welch. Ef\ufb01cient global optimization of expensive black-box functions.\n\nJournal of Global Optimization, 13(4):455\u2013492, 1998.\n\n[11] C. K. Ling, K. H. Low, and P. Jaillet. Gaussian process planning with lipschitz continuous reward functions:\nTowards unifying bayesian optimization, active learning, and beyond. In 30th AAAI Conference on Arti\ufb01cial\nIntelligence, 2016.\n\n[12] D. J. Lizotte. Practical Bayesian Optimization. PhD thesis, Edmonton, Alta., Canada, 2008. AAINR46365.\n\n[13] R. Marchant, F. Ramos, and S. Sanner. Sequential Bayesian optimisation for spatial-temporal monitoring.\n\n2015.\n\n[14] M. A. Osborne, R. Garnett, and S. J. Roberts. Gaussian processes for global optimization.\nInternational Conference on Learning and Intelligent Optimization (LION3), pages 1\u201315, 2009.\n\nIn 3rd\n\n[15] W. B. Powell. Approximate Dynamic Programming: Solving the Curses of Dimensionality, volume 842.\n\nJohn Wiley & Sons, 2011.\n\n[16] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge,\n\nMA, 2006.\n\n[17] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms.\n\nIn Advances in Neural Information Processing Systems, pages 2951\u20132959, 2012.\n\n[18] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting:\nNo regret and experimental design. In Proceedings of the 27th International Conference on Machine\nLearning, pages 1015\u20131022, 2010.\n\n[19] J. Villemonteix, E. Vazquez, and E. Walter. An informational approach to the global optimization of\n\nexpensive-to-evaluate functions. Journal of Global Optimization, 44(4):509\u2013534, 2009.\n\n9\n\n\f", "award": [], "sourceid": 537, "authors": [{"given_name": "Remi", "family_name": "Lam", "institution": "MIT"}, {"given_name": "Karen", "family_name": "Willcox", "institution": "MIT"}, {"given_name": "David", "family_name": "Wolpert", "institution": "NASA Ames Research  Center"}]}