{"title": "Practical Two-Step Lookahead Bayesian Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 9813, "page_last": 9823, "abstract": "Expected improvement and other acquisition functions widely used in Bayesian optimization use a \"one-step\" assumption: they value objective function evaluations assuming no future evaluations will be performed. Because we usually evaluate over multiple steps, this assumption may leave substantial room for improvement. Existing theory gives acquisition functions looking multiple steps in the future but calculating them requires solving a high-dimensional continuous-state continuous-action Markov decision process (MDP). Fast exact solutions of this MDP remain out of reach of today's methods. As a result, previous two- and multi-step lookahead Bayesian optimization algorithms are either too expensive to implement in most practical settings or resort to heuristics that may fail to fully realize the promise of two-step lookahead. This paper proposes a computationally efficient algorithm that provides an accurate solution to the two-step lookahead Bayesian optimization problem in seconds to at most several minutes of computation per batch of evaluations. The resulting acquisition function provides increased query efficiency and robustness compared with previous two- and multi-step lookahead methods in both single-threaded and batch experiments. This unlocks the value of two-step lookahead in practice. We demonstrate the value of our algorithm with extensive experiments on synthetic test functions and real-world problems.", "full_text": "Practical Two-Step Look-Ahead\n\nBayesian Optimization\n\nJian Wu\n\nwujian046@gmail.com\n\nSchool of Operations Research and Information Engineering\n\nPeter I. Frazier\u21e4\n\nCornell University\nIthaca, NY 14850\npf98@cornell.edu\n\nAbstract\n\nExpected improvement and other acquisition functions widely used in Bayesian op-\ntimization use a \u201cone-step\u201d assumption: they value objective function evaluations\nassuming no future evaluations will be performed. Because we usually evaluate\nover multiple steps, this assumption may leave substantial room for improvement.\nExisting theory gives acquisition functions looking multiple steps in the future but\ncalculating them requires solving a high-dimensional continuous-state continuous-\naction Markov decision process (MDP). Fast exact solutions of this MDP remain\nout of reach of today\u2019s methods. As a result, previous two- and multi-step looka-\nhead Bayesian optimization algorithms are either too expensive to implement in\nmost practical settings or resort to heuristics that may fail to fully realize the\npromise of two-step lookahead. This paper proposes a computationally ef\ufb01cient\nalgorithm that provides an accurate solution to the two-step lookahead Bayesian\noptimization problem in seconds to at most several minutes of computation per\nbatch of evaluations. The resulting acquisition function provides increased query\nef\ufb01ciency and robustness compared with previous two- and multi-step lookahead\nmethods in both single-threaded and batch experiments. This unlocks the value of\ntwo-step lookahead in practice. We demonstrate the value of our algorithm with\nextensive experiments on synthetic test functions and real-world problems.\n\n1\n\nIntroduction\n\nWe consider minimization of a continuous black-box function f over a hyperrectangle A\u2713 Rd.\nWe suppose evaluations f (x) are time-consuming to obtain, do not provide \ufb01rst- or second-order\nderivative information and are noise-free. Such problems arise when tuning hyperparameters of\ncomplex machine learning models [Snoek et al., 2012] and optimizing engineering systems using\nphysics-based simulators [Forrester et al., 2008].\nWe consider this problem within a Bayesian optimization (BayesOpt) framework [Brochu et al., 2010,\nFrazier, 2018]. BayesOpt methods contain two components: (1) a statistical model over f, typically a\nGaussian process [Rasmussen and Williams, 2006]; and (2) an acquisition function computed from\nthe statistical model that quanti\ufb01es the value of evaluating f. After a \ufb01rst stage of evaluations of f,\noften at points chosen uniformly at random from A, we behave iteratively: we \ufb01t the statistical model\nto all available data; then optimize the resulting acquisition function (which can be evaluated quickly\nand often provides derivative information) to \ufb01nd the best point(s) at which to evaluate f; perform\nthese evaluations; and repeat until our evaluation budget is exhausted.\n\n\u21e4Peter Frazier is also a Staff Data Scientist at Uber\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThe most widely-used acquisition functions use a one-step lookahead approach. They consider the\ndirect effect of the evaluation on an immediate measure of solution quality, and do not consider\nevaluations that will be performed later. This includes expected improvement (EI) [Jones et al., 1998],\nprobability of improvement (PI) Kushner [1964], entropy search (ES) [Hern\u00e1ndez-Lobato et al., 2014,\nWang and Jegelka, 2017], and the knowledge gradient (KG) [Wu and Frazier, 2016]. By myopically\nmaximizing the immediate improvement in solution quality, they may sacri\ufb01ce even greater gains in\nsolution quality obtainable through coordinated action across multiple evaluations.\nResearchers have sought to address this shortcoming through non-myopic acquisition functions. The\ndecision of where to sample next in BayesOpt can be formulated as a partially observable Markov\ndecision process (POMDP) [Ginsbourger and Riche, 2010]. The solution to this POMDP is given by\nthe Bellman recursion [Lam et al., 2016] and provides a non-myopic acquisition function that provides\nthe best possible average-case performance under the prior. However, the \u201ccurse of dimensionality\"\nPowell [2007] prevents solving this POMDP for even small-scale problems.\nThe past literature [Lam et al., 2016, Osborne et al., 2009, Ginsbourger and Riche, 2010, Gonz\u00e1lez\net al., 2016] instead approximates the solution to this POMDP to create non-myopic acquisition\nfunctions. Two-step lookahead is particularly attractive [Osborne et al., 2009, Ginsbourger and Riche,\n2010, Gonz\u00e1lez et al., 2016] because it is substantially easier to compute than looking ahead more than\ntwo steps, but still promises a performance improvement over the one-step acquisition functions used\nin practice. Indeed, Ginsbourger and Riche [2010] argue that using two-step lookahead encourages\na particularly bene\ufb01cial form of exploration: evaluating a high uncertainty region bene\ufb01ts future\nevaluations; if the evaluation reveals the region was better than expected, then future evaluations\nevaluate nearby to \ufb01nd improvements in solution quality. This bene\ufb01t occurs even if the \ufb01rst\nevaluation does not generate a direct improvement in solution quality. In numerical experiments,\nOsborne et al. [2009], Ginsbourger and Riche [2010] show that two-step lookahead improves over\none-step lookahead in a range of practical problems.\nAt the same time, optimizing two-step acquisition functions is computationally challenging. Unlike\ncommon one-step acquisition functions like expected improvement, they cannot be computed in\nclosed form and instead require a time-consuming simulation with nested optimization. Simulation\ncreates noise and prevents straightforwand differentiation, which hampers optimizing these two-\nstep acquisition functions precisely. Existing approaches [Osborne et al., 2009, Ginsbourger and\nRiche, 2010, Gonz\u00e1lez et al., 2016] use derivative-free optimizers, which can require a large number\nof iterations to optimize precisely, particularly as the dimension d of the feasible space grows.\n(Numerical experiments in Osborne et al. [2009], Ginsbourger and Riche [2010] are restricted to\nproblems with d \uf8ff 3.) As a result, existing two- and multi-step methods require a prohibitive amount\nof computation (e.g., Lam [2018] reports that the method in Lam et al. [2016] requires between 10\nminutes and 1 hour per evaluation even on low-dimensional problems). If suf\ufb01cient computation is\nnot performed, then errors in the acquisition-function optimization overwhelm the bene\ufb01ts provided\nby two-step lookahead and query ef\ufb01ciency degrades compared to a one-step acquisition function\nsupporting precise optimization. Similar challenges arise for the multi-step method proposed in\nLam et al. [2016]. This computational challenge has largely prevented the widespread adoption of\nnon-myopic acquisition functions in practice.\n\nContributions. This article makes two key innovations unlocking the power of two-step lookahead\nin practice. First, we provide an estimator based on the envelope theorem for the gradient of the two-\nstep lookahead acquisition function. Second, we show how Monte Carlo variance reduction methods\ncan further reduce the computational cost of estimating both the two-step lookahead acquisition\nfunction and its gradient. These techniques can be used within multistart stochastic gradient ascent to\nef\ufb01ciently generate multiple approximate stationary points of the acquisition function, from which\nwe can select the best to provide an ef\ufb01cient approximate optimum. Together, these innovations\nsupport optimizing the acquisition function accurately with computation requiring between a few\nseconds and several minutes on a single core. Moreover, this computation can be easily parallelized\nacross cores. It also scales better in the batch size and dimension of the black-box function compared\nwith the common practice of using a derivative-free optimizer. An implementation is available\nin the Cornell-MOE codebase, https://github.com/wujian16/Cornell-MOE, and the code to\nreplicate our experiments is available at https://github.com/wujian16/TwoStep-BayesOpt.\nOur approach leverages computational techniques developed in the literature. The \ufb01rst is in\ufb01nitesimal\nperturbation analysis [Heidelberger et al., 1988] and the envelope theorem [Milgrom and Segal, 2002],\n\n2\n\n\fpreviously used in Bayesian optimization to optimize the knowledge gradient aquisition function\n(which is myopic, as noted above) by Wu et al. [2017]. This built on earlier work using in\ufb01nitesimal\nperturbation analysis without the envelope theorem to optimize the expected improvement acquisition\nfunction (also myopic) in the batch setting [Wang et al., 2016]. The second is a pair of variance\nreducton methods: Gauss-Hermite quadrature Liu and Pierce [1994] and importance sampling\n[Asmussen and Glynn, 2007]. Our paper is the \ufb01rst to demonstrate the power of these techniques for\nnon-myopic Bayesian optimization.\n\n2 The Two-Step Optimal (2-OPT) Acquisition Function\n\nThis section de\ufb01nes the two-step lookahead acquisition function. This acquisition function is optimal\nwhen there are two stages of measurements remaining, and so we call it 2-OPT. Before de\ufb01ning\n2-OPT, we \ufb01rst provide notation and brief background from Gaussian process regression in Sect. 2.1.\nWe then de\ufb01ne 2-OPT in Sect. 2.2 and show how to estimate it with Monte Carlo in Sect. 2.3. While\n2-OPT has been de\ufb01ned implicitly in past work, we include a complete description to provide a\nframework and notation supporting our novel ef\ufb01cient method for maximizing it in Sect. 3.\n\n2.1 Gaussian process model for the objective f\nWe place a Gaussian process (GP) prior on the objective f. Although standard, here we brie\ufb02y\ndescribe inference under a GP to provide notation used later. Our GP prior is characterized by a mean\nfunction \u00b5(\u00b7) and a kernel function K(\u00b7,\u00b7). The posterior distribution on f after observing f at data\npoints D = (x(1), . . . , x(m)) is a GP with mean function and kernel de\ufb01ned respectively by\n\n\u00b5(x) + K(x, D)K(D, D)1(f (D) \u00b5(D)),\nK(x, x0) K(x, D)K(D, D)1K(D, x0).\n\n(1)\n\nIn (1), f (D) = (f (x(1)), . . . , f (x(m))), and similarly for \u00b5(D). Expressions K(x, D), K(D, x),\nand K(D, D) similarly evaluate to a column vector, row vector, and square matrix respectively.\n\n2.2 Two-step lookahead acquisition function\nHere we de\ufb01ne the 2-OPT acquisition function from a theoretical (but not yet computational)\nperspective. This formulation follows previous work on two-step and multi-step acquisition functions\n[Lam et al., 2016, Osborne et al., 2009, Ginsbourger and Riche, 2010, Gonz\u00e1lez et al., 2016]. 2-OPT\ngives optimal average-case behavior when we have two stages of evaluations remaining, and the\nsecond stage of evaluated may be chosen based on the results from the \ufb01rst.\nTo support batch evaluations while maintaining computational tractability, our \ufb01rst stage of evaluations\nuses a batch of q 1 simultaneous evaluations, while the second stage uses a single evaluation.\nThroughout, we assume that we have already observed a collection of data points D, so that the\ncurrent posterior distribution is a GP with a mean function \u00b50 and kernel K0 given by (1), and use\nE0 to indicate the expectation taken with respect to this distribution. We let f\u21e40 = min f (D) be the\nbest point observed thus far.\nWe index quantities associated with the \ufb01rst stage of evaluations by 1 and the second by 2. We let\nX1 indicate the set of q points to be evaluated in the \ufb01rst stage. We let f (X1) = (f (x) : x 2 X1)\nindicate the corresponding vector of observed values and and let min f (X1) be the smallest value in\nthis vector. We let x2 indicate the single point observed in the second stage.\nFor each i = 1, 2, we de\ufb01ne f\u21e4i\nto be smallest value observed by the end of stage i, so f\u21e41 =\nmin(f\u21e40 , f (X1)) and f\u21e42 = min(f\u21e41 , f (x2)). We let \u00b5i be the mean function and Ki the kernel for\nthe posterior distribution given D and observations available at the end of stage i. We let Ei indicate\nthe expectation taken with respect to the corresponding Gaussian process.\nThe overall loss whose expected value we seek to minimize is f\u21e42 .\nTo \ufb01nd the optimal sampling strategy, we follow the dynamic programming principle. We \ufb01rst write\nthe expected loss achievable at the end of the second stage, conditioned on the selection of points\n(X1) and results (f (X1)) from the \ufb01rst stage. If we choose the \ufb01nal evaluation optimally, then this\nexpected loss is L1 = minx22A E1 [f\u21e42 ]. This posterior and thus also L1 depends on X1 and f (X1).\n\n3\n\n\fFollowing the derivation of the expected improvement Jones et al. [1998], we rewrite this as\nL1 = min\nx22A\n\nE1\u21e5f\u21e41 (f\u21e41 f (x2))+\u21e4 = f\u21e41 max\n\nE1\u21e5(f\u21e41 f (x2))+\u21e4 = f\u21e41 max\n\nwhere y+ = max(y, 0) is the positive part function and EI1(x) is the expected improvement under\nthe GP after the \ufb01rst evaluation has been performed:\n\nEI1(x2),\n\nx22A\n\nx22A\n\nEI1(x) = EI(f\u21e41 \u00b51(x2), K1(x2, x2)).\n\n(2)\nHere EI(m, v) = m(m/pv) + pv'(m/pv) gives the expected improvement at a point where the\ndifference between the best observed point and the mean is m and the variance is v. is the standard\nnormal cdf and ' is the standard normal pdf.\nWith this expression for the value achievable at the start of the second stage, the expected value\nachieved at the start of the \ufb01rst stage is:\n\nE0 [L1] = E0\uf8fff\u21e41 max\n\nx22A\n\nEI1(x2) = E0\uf8fff\u21e40 (f\u21e40 min f (X1))+ max\n\nx22A\n\nEI1(x2)\n\n= f\u21e40 EI0(X1) E0\uf8ffmax\n\nx22A\n\nEI1(x2) ,\n\n(3)\n\n(4)\n\nwhere EI0(X1) = E0 [(f\u21e40 min f (X1))+] is the multipoints expected improvement [Ginsbourger\net al., 2010] under the GP with mean \u00b50 and kernel K0.\nWe de\ufb01ne our two-step acquistion function to be\n\n2-OPT(X1) = EI0(X1) + E0\uf8ff max\n\nx22A()\n\nEI1(x2) ,\n\nwhere A() is a set similar to A de\ufb01ned below. Because f\u21e40 does not depend on X1, \ufb01nding the X1\nthat minimizes (3) is equivalent to \ufb01nding the value that maximizes (4) (when A = A()). In the\nde\ufb01nition of 2-OPT, we emphasize that E0\u21e5maxx22A() EI1(x2)\u21e4 depends on X1 through the fact\nthat f (X1) in\ufb02uences the mean function \u00b51 and kernel K1 from which EI1 is computed.\nWe de\ufb01ne A() to be a compact set of points in A separated by at least from all points in X1 and\nthose with K0(x) = 0. 2-OPT(X1) means 2-OPT(X1) with = 0, i.e., with A = A(). The\nparameter 0 is introduced purely to overcome a technical hurdle in our theoretical result and we\nbelieve in practice it can be set to 0. Indeed, the theory allows setting to an extremely small value,\nsuch as 105, and the maximum of EI1(x2) over x2 2A is seldom this close to a point in X1: the\nposterior variance vanishes at points in X1 and EI1(x2) increases as x2 moves away from them.\nFigure 1 illustrates 2-OPT\u2019s behavior and shows how it explores more than EI.\n\n2.3 Monte Carlo estimation of 2-OPT(\u00b7)\n2-OPT(X1) cannot be computed in closed form. We can, however, estimate it using Monte Carlo.\nWe \ufb01rst use the reparameterization trick [Wilson et al., 2018] to write f (X1) as \u00b50(X1) + C0(X1)Z,\nwhere Z is a q-dimensional independent standard normal random variable and C0(X1) is the Cholesky\ndecomposition of K0(X1, X1).\nWe assume that K0(X1, X1) is positive de\ufb01nite so C0(X1) is of full rank.\nThen, under (1), for generic x,\n\n\u00b51(x) = \u00b50(x) + K0(x, X1)K0(X1, X1)1(f (X1) \u00b50(X1))\n\n= \u00b50(x) + K0(x, X1)(C0(X1)C0(X1)T )1C0(X1)Z\n= \u00b50(x) + 0(x, X1)Z\n\nK1(x, x) = K0(x) K0(x, X1)K0(X1, X1)1K(X1, x)\n\n= K0(x) K0(x, X1)(C0(X1)C0(X1)T )1K(X1, x)\n= K0(x) 0(x, X1)0(x, X1)T .\n\n4\n\n\fFigure 1: We demonstrate 2-OPT and EI minimizing a 1-d synthetic function sampled from a GP. Each row\nshows the posterior on f (mean +/ one standard deviation) and the corresponding acquisition function, for EI\n(left) and 2-OPT (right). We plot progress over three iterations. On the \ufb01rst iteration, EI evaluates a point that\nre\ufb01nes an existing local optimum and could have provided a small one-step improvement, but provides little\ninformation of use in future evaluations. In contrast, 2-OPT explores more aggressively, which helps it identify a\nnew global minimum in the next iteration.\n\nwhere 0(x, X1) = K0(x, X1)C0(X1)1.\nWith this notation, we can write EI1(x2) explicitly as\nEI1(x2) = EI1(X1, x2, Z) := EI(f\u21e41 \u00b50(x2) 0(x2, X1)Z, K0(x2) 0(x2, X1)0(x2, X1)T )\nwhere we have introduced more explicitly in expanded notation EI1(X1, x2, Z) the quantities on\nwhich EI1(x2) depends, and written it explicitly in terms of the function EI(m, v).\nThus, we can rewrite the 2-OPT acquisition function as 2-OPT(X1) = E0[\\2-OPT(X1, Z)] where\n\n\\2-OPT(X1, Z) = (f\u21e40 min f (X1))+ + max\nx22A()\n\nEI1(x2)\n\n= max(f\u21e40 \u00b50(X1) C0(X1)Z)+ + max\nx22A()\n\nEI1(X1, x2, Z),\n\nwhere max(y)+ is the largest non-negative component of y, or 0 if all components are negative.\nThen, to estimate 2-OPT(X1), we sample Z and compute \\2-OPT(X1, Z) using a nonlinear global\noptimization routine to calculate the inner maximization. Averaging many such replications provides\na strongly consistent estimate of 2-OPT(X1).\nPrevious approaches [Osborne et al., 2009, Ginsbourger and Riche, 2010, Gonz\u00e1lez et al., 2016]\nuse this or a similar simulation method to obtain an estimator of 2-OPT, and then use this estimator\nwithin a derivative-free optimization approach. This requires extensive computation because:\n\n1. The nested optimization over x2 is time-consuming and must be done for each simulation.\n2. Noise in the simulation requires either a noise-tolerant derivative-free optimization method\nthat would typically require more iterations, or requires that the simulation be averaged over\nenough replications on each iteration to make noise negligible. This increases the number of\nsimulations required to optimize accurately.\n\n3. It does not leverage derivative information, causing optimization to require more iterations,\n\nespecially as the dimension d of the search space or the batch size q grows.\n\n3 Ef\ufb01ciently Optimizing 2-OPT\n\nHere we describe a novel computational approach to optimizing 2-OPT that is substantially more\nef\ufb01cient than previously proposed methods. Our approach includes two components: a novel\nsimulation-based stochastic gradient estimator, which can be used within multistart stochastic gradient\nascent; and variance reduction techniques that reduce the variance of this stochastic gradient estimator.\n\n5\n\n\f3.1 Estimation of the Gradient of 2-OPT\n\nWe now show how to obtain an approximately unbiased estimator of the gradient of 2-OPT(X1).\nThe main idea is to exchange the expectation and gradient operators when taking the gradient with\nrespect to X1,\n\nr2-OPT(X1) = E0hr\\2-OPT(X1, Z)i\n\nEI1(X1, x2, Z)\n= E0\uf8ffr max(f\u21e40 \u00b50(X1) C0(X1)Z)+ + r max\n= E0\u21e5r max(f\u21e40 \u00b50(X1) C0(X1)Z)+ + rEI1(X1, x\u21e42, Z)\u21e4\nwhere x\u21e42 2 arg maxx22A() EI1(X1, x2, Z) is \ufb01xed and the last equation follows under some\nregularity conditions by the envelope theorem [Milgrom and Segal, 2002]. The following theorem\nshows this estimator of r2-OPT(X1) is unbiased. Its proof is in the supplement.\nTheorem 1. We assume:\n\nx22A()\n\n\u2022 The domain A() is nonempty and compact and > 0.\n\u2022 The mean function \u00b50 and kernel K0 are continuously differentiable.\n\u2022 The kernel K0 is non-degenerate, in the sense that the posterior variance, K1(x, x), at a\npoint is non-zero if the prior variance, K0(x, x), is strictly positive and that point has not\nbeen sampled (x is not in X1).\n\nLet x\u21e42 be a global maximizer in A() of EI1(X1, x2, Z). Then,\n\ng(X1, Z) := r max(f\u21e40 \u00b50(X1) C0(X1)Z)+ + rEI1(X1, x\u21e42, Z)\n\n(5)\nexists almost surely and is an unbiased estimator of r2-OPT(X1), where the gradient is taken with\nrespect to X1 while holding A() \ufb01xed.\n\nWe then use this stochastic gradient estimator within stochastic gradient ascent [Kushner and Yin,\n2003] with multiple restarts to \ufb01nd a collection of stationary points X1 (each X1 is a single point in\nRd if q = 1 or a collection of q points in Rd if q > 1). We use Monte Carlo to evaluate 2-OPT(X1)\nfor each of these stationary points and select as our approximate maximizer of 2-OPT the point or\nbatch of points with the largest estimated 2-OPT(X1). In practice we perform this procedure using\n = 0, although Theorem 1 only guarantees an unbiased gradient estimator when > 0.\n\n3.2 Variance reduction\n\nWe now describe variance reduction techniques that further improve computation time and accuracy.\n\nGauss-Hermite Quadrature (fully sequential setting)\nIn the fully sequential setting where we\npropose one point at each iteration (q = 1), we use Gauss-Hermite quadrature [Liu and Pierce, 1994]\nto estimate 2-OPT(X1) and its gradient. These quantities are both expectations over the 1-d standard\nGaussian random variable Z. Gauss-Hermite quadrature estimates the expectation of a random\ni=1 wig(zi) with well-chosen weights wi and locations zi. In\n\npractice, we \ufb01nd n = 20 accurately estimates 2-OPT(X1) and its gradient.\n\nvariable g(Z) by a weighted sumPn\n\nImportance sampling (batch setting)\nIn the batch setting, Gauss-Hermite quadrature scales poorly\nwith batch size q since the number of weighted points required grows exponentially with the dimension\nover which we integrate, which is q. In the batch setting, we adopt another variance reduction\ntechnique: importance sampling [Asmussen and Glynn, 2007].\nRecall that our Monte Carlo estimator of 2-OPT and its gradient involve a sampled multipoints EI\nterm max(f\u21e40 \u00b50(X1) C0(X1)Z)+.\nFor high-dimensional test functions or after we have many function evaluations, most draws of\nZ result in this multipoints EI term taking a value of 0. This occurs when all components of\n\u00b50(X1) + C0(X1)Z are larger than f\u21e40 . For such Z, the derivative of this immediate improvement\n\n6\n\n\fterm is also 0. Also, for such Z, the second term in our Monte Carlo estimator of 2-OPT and its\ngradient, maxx22A EI1(X1, x2, Z), also tend to be small and have a small gradient.\nAs a result, when calculating the expected value of these samples of 2-OPT or its gradient, we include\nmany 0s. This can make the variance of estimators based on averaging these estimators large relative\nto their expected value. This in turn makes gradient-based optimization and comparison using Monte\nCarlo estimates challenging.\nTo address this, we simulate Z from a multivariate Gaussian distribution with a larger standard\ndeviation v > 1, calling it Zv. This substantially increases the chance that the at least one component\nof \u00b50(X1) + C0(X1)Z will exceed f\u21e40 . We \ufb01nd v = 3 works well in test problems.\nTo compensate for sampling from a different distribution, we multiply by the likelihood ratio between\nthe density for which we wish to calculate the expectation, which is the multivariate standard\nnormal density, and the density from which Zv was sampled. Letting '(\u00b7; 0, v2I) indicate the q-\ndimensional normal multivariate density with mean 0 and covariance matrix v2I, this likelihood ratio\nis '(Zv; 0, I)/'(Zv; 0, v2I).\nThe\nare,\n\\2-OPT(X1, Zv)'(Zv; 0, I)/'(Zv; 0, v2I) and g(X1, Zv)'(Zv; 0, I)/'(Zv; 0, v2I).\n\n2-OPT and\n\nrespectively,\n\nestimators\n\nof\n\nits\n\ngradients\n\nresulting\n\nunbiased\n\n4 Numerical experiments\n\nWe test our algorithms on common synthetic functions and widely-benchmarked real-world problems.\nWe compare with acquisition functions widely used in practice including GP-LCB [Srinivas et al.,\n2010], PI, EI [Snoek et al., 2012] and KG [Wu and Frazier, 2016] and multi-step lookahead methods\nfrom Lam et al. [2016] and Gonz\u00e1lez et al. [2016].\n2-OPT is substantially more robust than competing methods, providing performance that is best\nor close to best across essentially all problems, iterations and performance measures. In contrast,\nwhile other methods like EI and KG sometimes outperform 2-OPT, they also sometimes substantially\nunderperform. For example, EI has simple regret two orders of magnitude worse than 2-OPT on\nHartmann6 and KG is 3 times worse on 10d Rosenbrock.\nMoreover, the computation time of one iteration of 2-OPT is fast enough to be practical, varying from\nseconds to several minutes on a single core in all our experiments, and can be easily parallelized across\ncores. This is approximately an order of magnitude faster than the benchmark multi-step lookahead\nmethods. 2-OPT\u2019s strong empirical performance together with a supporting fast computational\nmethod unlocks the value of two-step lookahead in practice.\n\nExperimental details Following Snoek et al. [2012], we use a constant mean prior and the ARD\nMat\u00b4ern 5/2 kernel. We integrate over GP hyperparameters by sampling 16 sets of values using the\nemcee package [Foreman-Mackey et al., 2013]. We initiate our algorithms by randomly sampling 3\npoints from a Latin hypercube design and then start the Bayesian optimization iterative process. We\nuse 100 random initializations in the synthetic and real functions experiments, 40 in the comparisons\nto multi-step lookahead methods (replicating the experiment setup of Lam et al. [2016]), and 10 for\ncomparisons of computation time.\nSynthetic functions, compared with one-step methods. First, we test 2-OPT and benchmark\nmethods on 6 well-known synthetic test functions chosen from Bingham [2015] ranging from 2d to\n10d: 2d Branin, 2d Camel, 5d Ackley5, 6d Hartmann6, 8d Cosine and 10d Levy. Figure 2 shows the\n90% quantile of the log immediate regret for 6 of these 8 benchmarks. Figure 5 in the supplement\nreports the mean of the base 10 logarithm of the immediate regret (plus or minus one standard error)\non these functions along with two more added in the author response period: 2d Michalewicz and\n10d Rosenbrock.\nSynthetic functions, compared with multi-step methods. To compare with non-myopic algorithms\nproposed in Gonz\u00e1lez et al. [2016] and Lam et al. [2016], we replicate the experimental settings in\nLam et al. [2016] and add 2-OPT\u2019s performance to their Table 2. We report the results in Table 1.\nGLASSES was proposed in Gonz\u00e1lez et al. [2016] and the four columns R-4-9, R-4-10, R-5-9, and\nR-5-10 are algorithm variants proposed in Lam et al. [2016].\n\n7\n\n\fFigure 2: Synthetic test functions, 90% quantile of log10 immediate regret compared with common\none-step heuristics. 2-OPT provides substantially more robust performance.\n\nFigure 3: Realistic benchmarks: HPOlib (top): 2-OPT is competitive with the best of the competitors\nin each benchmark. ATO (bottom left): 2-OPT outperforms EI slightly and clearly outperforms KG.\nAll algorithms converge to nearly the same performance. Robot Pushing: 2-OPT slightly outperforms\nPES and clearly outperforms EI and KG.\n\n8\n\n020406080100120functiRn evals3456789SimSle Regret14d RRbRtEIKG3E62-237\fFunction name\nBranin-Hoo\n\nPI\nMean\n.847 .818 .848 .861 .846\nMedian .922 .909 .910 .983 .909\n.873 .866 .733 .819 .782\nMedian .983 .981 .899 .987 .919\n.827 .884 .913 .972 1.02\nMean\nMedian .904 .953 .970 .987 1.02\n.850 .887 .817 .664 .776\nMedian .893 .970 .915 .801 .941\n\nEI UCB PES GLASSES R-4-9 R-4-10 R-5-9 R-5-10 2-OPT\n.9995\n.9994\n.9651\n.9911\n.9321\n.9801\n.9010\n.9651\n\n.904 .898\n.959 .943\n.895 .784\n.991 .985\n.882 .885\n.967 .962\n.860 .825\n.926 .900\n\n.887 .903\n.921 .950\n.861 .743\n.989 .928\n.930 .867\n.960 .954\n.793 .803\n.941 .907\n\nGoldstein-Price Mean\n\nGriewank\n\n6-hump Camel Mean\n\nTable 1: Performance of our two-step acquisition fuction (2-OPT) on test functions compared\nwith non-myopic and other benchmark algorithms originally reported in Lam et al. [2016]. Each\nvalue reported is the \u201cgap\u201d: the ratio of the overall improvement obtained by the algorithm to the\nimprovement possible by a globally optimal solution. A gap of 1 represents \ufb01nding the optimal\nsolution; 0 represents no improvement in solution quality. The best gap appears in boldface.\n\nValues reported are the \u201cgap\u201c [Huang et al., 2006], which is the ratio of the improvement obtained\nby an algorithm to the improvement possible by a globally optimal solution. Letting \u02c6xN be the\nbest solution found by the algorithm and \u02c6x0 be the best solution found in the initial stage, the gap\nis G = (f (\u02c6x0) f (\u02c6xN ))/(f (\u02c6x0) minx f (x)). A gap of 1 indicates that the algorithm found a\nglobally optimal solution, while 0 indicates no improvement.\n2-OPT is best in 5 out of 8 problems (tied for best on one of these problems), and second-best in the\nremaining 3. It outperforms or ties the non-myopic competitiors on all problem instances.\nIn the supplement Figure 4 shows the time required for acquisition function optimization on 1\ncore from a AWS t2.2xlarge instance for 2-OPT, EI, KG, and GLASSES. Time for other problems\nis similar, with higher-dimensional problems requiring more time. 2-OPT\u2019s computation time is\ncomparable to KG, about 10 times slower than EI, and about 10 times faster than GLASSES. Code\nfrom Lam et al. [2016] was unavailable when these experiments were performed, but Lam [2018]\nreports that the time required is between 10 minutes and 1 hour, even on low-dimensional problems.\n\nRealistic benchmarks Figure 3 shows performance on a collection of more realistic benchmarks,\nHPOlib, ATO, and Robot Pushing.\nThe HPOlib library was developed in Eggensperger et al. [2013] based on hyperparameter tuning\nbenchmarks from Snoek et al. [2012]. We benchmark on the two most widely used test problems\nthere: logistic regression and SVM. On both problems, 2-OPT performs comparably to the best of\nthe competitors, with 2-OPT and EI slightly outperforming KG on logistic regression.\nThe assemble-to-order (ATO) benchmark [Hong and Nelson, 2006, Poloczek et al., 2017] is a\nreinforcement learning problem with a parameterized control policy where the goal is to optimize\nan 8-dimensional inventory target vector to maximize pro\ufb01t in a business setting. 2-OPT provides a\nsubstantial bene\ufb01t over competitors from the start and remains best over the whole process. After\n40 iterations, EI catches 2-OPT, while KG lags both EI and 2-OPT until iteration \u02dc100 where all the\nalgorithms converge with comparable performance.\nThe robot pushing problem is a 14-dimensional reinforcement learning problem considered in Wang\nand Jegelka [2017]. 2-OPT outperforms all the competitors on this benchmark.\n\n5 Conclusions\n\nIn this article, we propose the \ufb01rst computationally ef\ufb01cient two-step lookahead BayesOpt algorithm.\nThe algorithm comes in both sequential and batch forms, and reduces the computational time\ncompared to previous proposals with increased performance. In experiments, we \ufb01nd that two-step\nlookahead provides additional value compared to several one-step lookahead heuristics.\n\n2Lam et al. [2016] reports that GLASSES achieves gap=1 on Griewank because it arbitrarily evaluates at the\n\norigin, which happens to be a global minimizer. Following Lam et al. [2016], we exclude these results.\n\n9\n\n\fReferences\nS. Asmussen and P. W. Glynn. Stochastic simulation: algorithms and analysis, volume 57. Springer\n\nScience & Business Media, 2007.\n\nD. Bingham. Optimization test problems. http://www.sfu.ca/~ssurjano/optimization.\n\nhtml, 2015.\n\nE. Brochu, V. M. Cora, and N. De Freitas. A tutorial on Bayesian optimization of expensive cost\nfunctions, with application to active user modeling and hierarchical reinforcement learning. arXiv\npreprint arXiv:1012.2599, 2010.\n\nK. Eggensperger, M. Feurer, F. Hutter, J. Bergstra, J. Snoek, H. Hoos, and K. Leyton-Brown. Towards\nan empirical foundation for assessing bayesian optimization of hyperparameters. In NIPS workshop\non Bayesian Optimization in Theory and Practice, volume 10, page 3, 2013.\n\nD. Foreman-Mackey, D. W. Hogg, D. Lang, and J. Goodman. emcee: the mcmc hammer. Publications\n\nof the Astronomical Society of the Paci\ufb01c, 125(925):306, 2013.\n\nA. Forrester, A. Sobester, and A. Keane. Engineering design via surrogate modelling: a practical\n\nguide. John Wiley & Sons, 2008.\n\nP. I. Frazier. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.\n\nD. Ginsbourger and R. Riche. Towards gaussian process-based optimization with \ufb01nite time horizon.\n\nmODa 9\u2013Advances in Model-Oriented Design and Analysis, pages 89\u201396, 2010.\n\nD. Ginsbourger, R. Le Riche, and L. Carraro. Kriging is well-suited to parallelize optimization. In\nComputational Intelligence in Expensive Optimization Problems, pages 131\u2013162. Springer, 2010.\n\nJ. Gonz\u00e1lez, M. Osborne, and N. Lawrence. Glasses: Relieving the myopia of Bayesian optimisation.\n\nIn Arti\ufb01cial Intelligence and Statistics, pages 790\u2013799, 2016.\n\nP. Heidelberger, X.-R. Cao, M. A. Zazanis, and R. Suri. Convergence properties of in\ufb01nitesimal\n\nperturbation analysis estimates. Management Science, 34(11):1281\u20131302, 1988.\n\nJ. M. Hern\u00e1ndez-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy search for ef\ufb01cient\nglobal optimization of black-box functions. In Advances in Neural Information Processing Systems,\npages 918\u2013926, 2014.\n\nL. J. Hong and B. L. Nelson. Discrete optimization via simulation using compass. Operations\n\nResearch, 54(1):115\u2013129, 2006.\n\nD. Huang, T. T. Allen, W. I. Notz, and N. Zeng. Global optimization of stochastic black-box systems\n\nvia sequential kriging meta-models. Journal of global optimization, 34(3):441\u2013466, 2006.\n\nD. R. Jones, M. Schonlau, and W. J. Welch. Ef\ufb01cient global optimization of expensive black-box\n\nfunctions. Journal of Global optimization, 13(4):455\u2013492, 1998.\n\nH. Kushner and G. G. Yin. Stochastic approximation and recursive algorithms and applications,\n\nvolume 35. Springer Science & Business Media, 2003.\n\nH. J. Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the\n\npresence of noise. Journal of Fluids Engineering, 86(1):97\u2013106, 1964.\n\nR. Lam. Personal communication, 2018.\n\nR. Lam, K. Willcox, and D. H. Wolpert. Bayesian optimization with a \ufb01nite budget: An approximate\ndynamic programming approach. In Advances in Neural Information Processing Systems, pages\n883\u2013891, 2016.\n\nP. L\u2019Ecuyer. A uni\ufb01ed view of the IPA, SF, and LR gradient estimation techniques. Management\n\nScience, 36(11):1364\u20131383, 1990.\n\nQ. Liu and D. A. Pierce. A note on Gauss\u2013Hermite quadrature. Biometrika, 81(3):624\u2013629, 1994.\n\n10\n\n\fP. Milgrom and I. Segal. Envelope theorems for arbitrary choice sets. Econometrica, 70(2):583\u2013601,\n\n2002.\n\nM. A. Osborne, R. Garnett, and S. J. Roberts. Gaussian processes for global optimization. In 3rd\ninternational conference on learning and intelligent optimization (LION3), pages 1\u201315. Citeseer,\n2009.\n\nM. Poloczek, J. Wang, and P. I. Frazier. Multi-information source optimization. In Advances in\n\nNeural Information Processing Systems, 2017. ArXiv preprint 1603.00389.\n\nW. B. Powell. Approximate Dynamic Programming: Solving the curses of dimensionality, volume\n\n703. John Wiley & Sons, 2007.\n\nC. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n\nISBN ISBN 0-262-18253-X.\n\nS. P. Smith. Differentiation of the cholesky algorithm. Journal of Computational and Graphical\n\nStatistics, 4(2):134 \u2013 147, 1995.\n\nJ. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning\n\nalgorithms. In Advances in neural information processing systems, pages 2951\u20132959, 2012.\n\nN. Srinivas, A. Krause, M. Seeger, and S. M. Kakade. Gaussian process optimization in the bandit\n\nsetting: No regret and experimental design. In ICML, pages 1015\u20131022, 2010.\n\nB. S. Thomson, J. B. Bruckner, and A. M. Bruckner. Elementary real analysis. ClassicalRealAnaly-\n\nsis.com, 2008.\n\nJ. Wang, S. C. Clark, E. Liu, and P. I. Frazier. Parallel Bayesian global optimization of expensive\n\nfunctions. arXiv preprint arXiv:1602.05149, to appear in Operations Research, 2016.\n\nZ. Wang and S. Jegelka. Max-value entropy search for ef\ufb01cient Bayesian optimization. In Proceedings\nof the 34th International Conference on Machine Learning-Volume 70, pages 3627\u20133635. JMLR.\norg, 2017.\n\nJ. Wilson, F. Hutter, and M. Deisenroth. Maximizing acquisition functions for Bayesian optimization.\n\nIn Advances in Neural Information Processing Systems, pages 9884\u20139895, 2018.\n\nJ. Wu and P. Frazier. The parallel knowledge gradient method for batch bayesian optimization. In\n\nAdvances in Neural Information Processing Systems, pages 3126\u20133134, 2016.\n\nJ. Wu, M. Poloczek, A. G. Wilson, and P. I. Frazier. Bayesian optimization with gradients. In\n\nAdvances in Neural Information Processing Systems, 2017. ArXiv preprint 1703.04389.\n\n11\n\n\f", "award": [], "sourceid": 5204, "authors": [{"given_name": "Jian", "family_name": "Wu", "institution": "Cornell University"}, {"given_name": "Peter", "family_name": "Frazier", "institution": "Cornell / Uber"}]}