{"title": "Bayesian Optimization with Gradients", "book": "Advances in Neural Information Processing Systems", "page_first": 5267, "page_last": 5278, "abstract": "Bayesian optimization has shown success in global optimization of expensive-to-evaluate multimodal objective functions. However, unlike most optimization methods, Bayesian optimization typically does not use derivative information. In this paper we show how Bayesian optimization can exploit derivative information to find good solutions with fewer objective function evaluations. In particular, we develop a novel Bayesian optimization algorithm, the derivative-enabled knowledge-gradient (dKG), which is one-step Bayes-optimal, asymptotically consistent, and provides greater one-step value of information than in the derivative-free setting. dKG accommodates noisy and incomplete derivative information, comes in both sequential and batch forms, and can optionally reduce the computational cost of inference through automatically selected retention of a single directional derivative. We also compute the dKG acquisition function and its gradient using a novel fast discretization-free technique. We show dKG provides state-of-the-art performance compared to a wide range of optimization procedures with and without gradients, on benchmarks including logistic regression, deep learning, kernel learning, and k-nearest neighbors.", "full_text": "Bayesian Optimization with Gradients\n\nJian Wu 1 Matthias Poloczek 2 Andrew Gordon Wilson 1\n1 Cornell University, 2 University of Arizona\n\nPeter I. Frazier 1\n\nAbstract\n\nBayesian optimization has been successful at global optimization of expensive-\nto-evaluate multimodal objective functions. However, unlike most optimization\nmethods, Bayesian optimization typically does not use derivative information. In\nthis paper we show how Bayesian optimization can exploit derivative information to\n\ufb01nd good solutions with fewer objective function evaluations. In particular, we de-\nvelop a novel Bayesian optimization algorithm, the derivative-enabled knowledge-\ngradient (d-KG), which is one-step Bayes-optimal, asymptotically consistent, and\nprovides greater one-step value of information than in the derivative-free setting.\nd-KG accommodates noisy and incomplete derivative information, comes in both\nsequential and batch forms, and can optionally reduce the computational cost of\ninference through automatically selected retention of a single directional derivative.\nWe also compute the d-KG acquisition function and its gradient using a novel fast\ndiscretization-free technique. We show d-KG provides state-of-the-art performance\ncompared to a wide range of optimization procedures with and without gradients,\non benchmarks including logistic regression, deep learning, kernel learning, and\nk-nearest neighbors.\n\n1\n\nIntroduction\n\nBayesian optimization [3, 17] is able to \ufb01nd global optima with a remarkably small number of\npotentially noisy objective function evaluations. Bayesian optimization has thus been particularly\nsuccessful for automatic hyperparameter tuning of machine learning algorithms [10, 11, 35, 38],\nwhere objectives can be extremely expensive to evaluate, noisy, and multimodal.\nBayesian optimization supposes that the objective function (e.g., the predictive performance with\nrespect to some hyperparameters) is drawn from a prior distribution over functions, typically a\nGaussian process (GP), maintaining a posterior as we observe the objective in new places. Acquisition\nfunctions, such as expected improvement [15, 17, 28], upper con\ufb01dence bound [37], predictive entropy\nsearch [14] or the knowledge gradient [32], determine a balance between exploration and exploitation,\nto decide where to query the objective next. By choosing points with the largest acquisition function\nvalues, one seeks to identify a global optimum using as few objective function evaluations as possible.\nBayesian optimization procedures do not generally leverage derivative information, beyond a few\nexceptions described in Sect. 2. By contrast, other types of continuous optimization methods [36] use\ngradient information extensively. The broader use of gradients for optimization suggests that gradients\nshould also be quite useful in Bayesian optimization: (1) Gradients inform us about the objective\u2019s\nrelative value as a function of location, which is well-aligned with optimization. (2) In d-dimensional\nproblems, gradients provide d distinct pieces of information about the objective\u2019s relative value in each\ndirection, constituting d + 1 values per query together with the objective value itself. This advantage\nis particularly signi\ufb01cant for high-dimensional problems. (3) Derivative information is available\nin many applications at little additional cost. Recent work [e.g., 23] makes gradient information\navailable for hyperparameter tuning. Moreover, in the optimization of engineering systems modeled\nby partial differential equations, which pre-dates most hyperparameter tuning applications [8], adjoint\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fmethods provide gradients cheaply [16, 29]. And even when derivative information is not readily\navailable, we can compute approximative derivatives in parallel through \ufb01nite differences.\nIn this paper, we explore the \u201cwhat, when, and why\u201d of Bayesian optimization with derivative\ninformation. We also develop a Bayesian optimization algorithm that effectively leverages gradi-\nents in hyperparameter tuning to outperform the state of the art. This algorithm accommodates\nincomplete and noisy gradient observations, can be used in both the sequential and batch settings,\nand can optionally reduce the computational overhead of inference by selecting the single most\nvaluable directional derivatives to retain. For this purpose, we develop a new acquisition function,\ncalled the derivative-enabled knowledge-gradient (d-KG). d-KG generalizes the previously proposed\nbatch knowledge gradient method of Wu and Frazier [44] to the derivative setting, and replaces its\napproximate discretization-based method for calculating the knowledge-gradient acquisition function\nby a novel faster exact discretization-free method. We note that this discretization-free method is also\nof interest beyond the derivative setting, as it can be used to improve knowledge-gradient methods for\nother problem settings. We also provide a theoretical analysis of the d-KG algorithm, showing (1) it\nis one-step Bayes-optimal by construction when derivatives are available; (2) that it provides one-step\nvalue greater than in the derivative-free setting, under mild conditions; and (3) that its estimator of\nthe global optimum is asymptotically consistent.\nIn numerical experiments we compare with state-of-the-art batch Bayesian optimization algorithms\nwith and without derivative information, and the gradient-based optimizer BFGS with full gradients.\nWe assume familiarity with GPs and Bayesian optimization, for which we recommend Rasmussen and\nWilliams [31] and Shahriari et al. [34] as a review. In Section 2 we begin by describing related work.\nIn Sect. 3 we describe our Bayesian optimization algorithm exploiting derivative information. In\nSect. 4 we compare the performance of our algorithm with several competing methods on a collection\nof synthetic and real problems.\nThe code for this paper is available at https://github.com/wujian16/Cornell-MOE.\n\n2 Related Work\n\nOsborne et al. [26] proposes fully Bayesian optimization procedures that use derivative observations\nto improve the conditioning of the GP covariance matrix. Samples taken near previously observed\npoints use only the derivative information to update the covariance matrix. Unlike our current work,\nderivative information does not affect the acquisition function. We directly compare with Osborne\net al. [26] within the KNN benchmark in Sect. 4.2.\nLizotte [22, Sect. 4.2.1 and Sect. 5.2.4] incorporates derivatives into Bayesian optimization, modeling\nthe derivatives of a GP as in Rasmussen and Williams [31, Sect. 9.4]. Lizotte [22] shows that\nBayesian optimization with the expected improvement (EI) acquisition function and complete gradient\ninformation at each sample can outperform BFGS. Our approach has six key differences: (i) we\nallow for noisy and incomplete derivative information; (ii) we develop a novel acquisition function\nthat outperforms EI with derivatives; (iii) we enable batch evaluations; (iv) we implement and\ncompare batch Bayesian optimization with derivatives across several acquisition functions, on\nbenchmarks and new applications such as kernel learning, logistic regression, deep learning and\nk-nearest neighbors, further revealing empirically where gradient information will be most valuable;\n(v) we provide a theoretical analysis of Bayesian optimization with derivatives; (vi) we develop a\nscalable implementation.\nVery recently, Koistinen et al. [19] uses GPs with derivative observations for minimum energy path\ncalculations of atomic rearrangements and Ahmed et al. [1] studies expected improvement with\ngradient observations. In Ahmed et al. [1], a randomly selected directional derivative is retained\nin each iteration for computational reasons, which is similar to our approach of retaining a single\ndirectional derivative, though differs in its random selection in contrast with our value-of-information-\nbased selection. Our approach is complementary to these works.\nFor batch Bayesian optimization, several recent algorithms have been proposed that choose a set of\npoints to evaluate in each iteration [5, 6, 12, 18, 24, 33, 35, 39]. Within this area, our approach to\nhandling batch observations is most closely related to the batch knowledge gradient (KG) of Wu and\nFrazier [44]. We generalize this approach to the derivative setting, and provide a novel exact method\nfor computing the knowledge-gradient acquisition function that avoids the discretization used in Wu\n\n2\n\n\fand Frazier [44]. This generalization improves speed and accuracy, and is also applicable to other\nknowledge gradient methods in continuous search spaces.\nRecent advances improving both access to derivatives and computational tractability of GPs make\nBayesian optimization with gradients increasingly practical and timely for discussion.\n\n3 Knowledge Gradient with Derivatives\n\nSect. 3.1 reviews a general approach to incorporating derivative information into GPs for Bayesian\noptimization. Sect. 3.2 introduces a novel acquisition function d-KG, based on the knowledge gradient\napproach, which utilizes derivative information. Sect. 3.3 computes this acquisition function and its\ngradient ef\ufb01ciently using a novel fast discretization-free approach. Sect. 3.4 shows that this algorithm\nprovides greater value of information than in the derivative-free setting, is one-step Bayes-optimal,\nand is asymptotically consistent when used over a discretized feasible space.\n\n(cid:19)\n\n(cid:17)\n\n3.1 Derivative Information\nGiven an expensive-to-evaluate function f, we wish to \ufb01nd argminx\u2208Af (x), where A \u2282 Rd is\nthe domain of optimization. We place a GP prior over f : A \u2192 R, which is speci\ufb01ed by its mean\nfunction \u00b5 : A \u2192 R and kernel function K : A \u00d7 A \u2192 R. We \ufb01rst suppose that for each sample of\nx we observe the function value and all d partial derivatives, possibly with independent normally\ndistributed noise, and then later discuss relaxation to observing only a single directional derivative.\nSince the gradient is a linear operator, the gradient of a GP is also a GP (see also Sect. 9.4 in Rasmussen\nand Williams [31]), and the function and its gradient follow a multi-output GP with mean function \u02dc\u00b5\nand kernel function \u02dcK de\ufb01ned below:\n\n(cid:17)\n\nJ(x, x(cid:48))\nJ(x(cid:48), x)T H(x, x(cid:48))\n\n\u02dcK(x, x(cid:48)) =\nand H(x, x(cid:48)) is the d \u00d7 d Hessian of K(x, x(cid:48)).\n\n(3.1)\n\n,\u00b7\u00b7\u00b7 , \u2202K(x,x(cid:48))\n\n\u02dc\u00b5(x) = (\u00b5(x),\u2207\u00b5(x))T ,\n\n(cid:16) \u2202K(x,x(cid:48))\n(y(x),\u2207y(x))T (cid:12)(cid:12)(cid:12) f (x),\u2207f (x) \u223c N(cid:16)\n\n\u2202x(cid:48)\n\n1\n\nwhere J(x, x(cid:48)) =\nWhen evaluating at a point x, we observe the noise-obscured function value y(x) and gradient \u2207y(x).\nJointly, these observations form a (d + 1)-dimensional vector with conditional distribution\n\n\u2202x(cid:48)\n\nd\n\n(f (x),\u2207f (x))T , diag(\u03c32(x))\n\n(3.2)\nwhere \u03c32 : A \u2192 Rd+1\u22650 gives the variance of the observational noise. If \u03c32 is not known, we may\nestimate it from data. The posterior distribution is again a GP. We refer to the mean function of this\nposterior GP after n samples as \u02dc\u00b5(n)(\u00b7) and its kernel function as \u02dcK (n)(\u00b7,\u00b7). Suppose that we have\nsampled at n points X := {x(1), x(2),\u00b7\u00b7\u00b7 , x(n)} and observed (y,\u2207y)(1:n), where each observation\nconsists of the function value and the gradient at x(i). Then \u02dc\u00b5(n)(\u00b7) and \u02dcK (n)(\u00b7,\u00b7) are given by\n\n,\n\n(cid:18) K(x, x(cid:48))\n\n\u02dc\u00b5(n)(x) = \u02dc\u00b5(x) + \u02dcK(x, X)\n\n(cid:16) \u02dcK(X, X) + diag{\u03c32(x(1)),\u00b7\u00b7\u00b7 , \u03c32(x(n))}(cid:17)\u22121(cid:16)\n\n(cid:16) \u02dcK(X, X) + diag{\u03c32(x(1)),\u00b7\u00b7\u00b7 , \u03c32(x(n))}(cid:17)\u22121\n\n(y,\u2207y)(1:n) \u2212 \u02dc\u00b5(X)\n\n\u02dcK (n)(x, x(cid:48)) = \u02dcK(x, x(cid:48)) \u2212 \u02dcK(x, X)\n\n\u02dcK(X, x(cid:48)).\n(3.3)\nIf our observations are incomplete, then we remove the rows and columns in (y,\u2207y)(1:n), \u02dc\u00b5(X),\n\u02dcK(\u00b7, X), \u02dcK(X, X) and \u02dcK(X,\u00b7) of Eq. (3.3) corresponding to partial derivatives (or function values)\nthat were not observed. If we can observe directional derivatives, then we add rows and columns\ncorresponding to these observations, where entries in \u02dc\u00b5(X) and \u02dcK(\u00b7,\u00b7) are obtained by noting that a\ndirectional derivative is a linear transformation of the gradient.\n\n(cid:17)\n\n3.2 The d-KG Acquisition Function\n\nWe propose a novel Bayesian optimization algorithm to exploit available derivative information, based\non the knowledge gradient approach [9]. We call this algorithm the derivative-enabled knowledge\ngradient (d-KG).\n\n3\n\n\fThe algorithm proceeds iteratively, selecting in each iteration a batch of q points in A that has a\nmaximum value of information (VOI). Suppose we have observed n points, and recall from Section 3.1\nthat \u02dc\u00b5(n)(x) is the (d + 1)-dimensional vector giving the posterior mean for f (x) and its d partial\nderivatives at x. Sect. 3.1 discusses how to remove the assumption that all d + 1 values are provided.\nThe expected value of f (x) under the posterior distribution is \u02dc\u00b5(n)\nIf after n samples we\nwere to make an irrevocable (risk-neutral) decision now about the solution to our overarching\noptimization problem and receive a loss equal to the value of f at the chosen point, we would choose\nargminx\u2208A \u02dc\u00b5(n)\n1 (x). Similarly, if we made this\ndecision after n + q samples our conditional expected loss would be minx\u2208A \u02dc\u00b5(n+q)\n(x). Therefore,\nwe de\ufb01ne the d-KG factor for a given set of q candidate points z(1:q) as\n\n1 (x) and suffer conditional expected loss minx\u2208A \u02dc\u00b5(n)\n\n1 (x).\n\n1\n\n(cid:20)\n\n(cid:12)(cid:12)(cid:12) x((n+1):(n+q)) = z(1:q)\n\n(cid:21)\n\n,\n\nd-KG(z(1:q)) = min\n\n1 (x) \u2212 En\n\nx\u2208A \u02dc\u00b5(n)\n\nx\u2208A \u02dc\u00b5(n+q)\nmin\n\n1\n\n(x)\n\n(3.4)\nwhere En [\u00b7] is the expectation taken with respect to the posterior distribution after n eval-\n(\u00b7) under this posterior marginalizes over the observa-\nuations, and the distribution of \u02dc\u00b5(n+q)\n\ntions(cid:0)y(z(1:q)),\u2207y(z(1:q))(cid:1) =(cid:0)y(z(i)),\u2207y(z(i)) : i = 1, . . . , q(cid:1) upon which it depends. We subse-\n\nquently refer to Eq. (3.4) as the inner optimization problem.\nThe d-KG algorithm then seeks to evaluate the batch of points next that maximizes the d-KG factor,\n\n1\n\nmax\nz(1:q)\u2282A\n\nd-KG(z(1:q)).\n\n(3.5)\n\n1\n\nWe refer to Eq. (3.5) as the outer optimization problem. d-KG solves the outer optimization problem\nusing the method described in Section 3.3.\nThe d-KG acquisition function differs from the batch knowledge gradient acquisition function in Wu\n(x) at time n+q depends on \u2207y(z(1:q)). This\nand Frazier [44] because here the posterior mean \u02dc\u00b5(n+q)\nin turn requires calculating the distribution of these gradient observations under the time-n posterior\nand marginalizing over them. Thus, the d-KG algorithm differs from KG not just in that gradient\nobservations change the posterior, but also in that the prospect of future gradient observations changes\nthe acquisition function. An additional major distinction from Wu and Frazier [44] is that d-KG\nemploys a novel discretization-free method for computing the acquisition function (see Section 3.3).\nFig. 1 illustrates the behavior of d-KG and d-EI on a 1-d example. d-EI generalizes expected\nimprovement (EI) to batch acquisition with derivative information [22]. d-KG clearly chooses a better\npoint to evaluate than d-EI.\nIncluding all d partial derivatives can be computationally prohibitive since GP inference scales as\nO(n3(d + 1)3). To overcome this challenge while retaining the value of derivative observations, we\ncan include only one directional derivative from each iteration in our inference. d-KG can naturally\ndecide which derivative to include, and can adjust our choice of where to best sample given that we\nobserve more limited information. We de\ufb01ne the d-KG acquisition function for observing only the\nfunction value and the derivative with direction \u03b8 at z(1:q) as\n\n(cid:12)(cid:12)(cid:12) x((n+1):(n+q)) = z(1:q); \u03b8\n\n(cid:21)\n\n.\n\n(3.6)\n\nd-KG(z(1:q), \u03b8) = min\n\n1 (x) \u2212 En\n\nx\u2208A \u02dc\u00b5(n)\n\nx\u2208A \u02dc\u00b5(n+q)\nmin\n\n1\n\n(x)\n\n(cid:20)\n\nwhere conditioning on \u03b8 is here understood to mean that \u02dc\u00b5(n+q)\n(x) is the conditional mean of f (x)\ngiven y(z(1:q)) and \u03b8T\u2207y(z(1:q)) = (\u03b8T\u2207y(z(i)) : i = 1, . . . , q). The full algorithm is as follows.\n\n1\n\nAlgorithm 1 d-KG with Relevant Directional Derivative Detection\n1: for t = 1 to N do\n2:\n3:\n4: end for\n\n(z(1:q)\u2217\nAugment data with y(z(1:q)\u2217\nReturn x\u2217 = argminx\u2208A \u02dc\u00b5N q\n\n, \u03b8\u2217) = argmaxz(1:q),\u03b8d-KG(z(1:q), \u03b8)\n\n) and \u03b8\u2217T\u2207y(z(1:q)\u2217\n\n1 (x)\n\n). Update our posterior on (f (x),\u2207f (x)).\n\n4\n\n\fFigure 1: KG [44] and EI [39] refer to acquisition functions without gradients. d-KG and d-EI refer to the\ncounterparts with gradients. The topmost plots show (1) the posterior surfaces of a function sampled from a one\ndimensional GP without and with incorporating observations of the gradients. The posterior variance is smaller\nif the gradients are incorporated; (2) the utility of sampling each point under the value of information criteria of\nKG (d-KG) and EI (d-EI) in both settings. If no derivatives are observed, both KG and EI will query a point with\nhigh potential gain (i.e. a small expected function value). On the other hand, when gradients are observed, d-KG\nmakes a considerably better sampling decision, whereas d-EI samples essentially the same location as EI. The\nplots in the bottom row depict the posterior surface after the respective sample. Interestingly, KG bene\ufb01ts more\nfrom observing the gradients than EI (the last two plots): d-KG\u2019s observation yields accurate knowledge of the\noptimum\u2019s location, while d-EI\u2019s observation leaves substantial uncertainty.\n3.3 Ef\ufb01cient Exact Computation of d-KG\nCalculating and maximizing d-KG is dif\ufb01cult when A is continuous because the term\nminx\u2208A \u02dc\u00b5(n+q)\n(x) in Eq. (3.6) requires optimizing over a continuous domain, and then we must\nintegrate this optimal value through its dependence on y(z(1:q)) and \u03b8T\u2207y(z(1:q)). Previous work\non the knowledge gradient in continuous domains [30, 32, 44] approaches this computation by\ntaking minima within expectations not over the full domain A but over a discretized \ufb01nite approx-\nimation. This approach supports analytic integration in Scott et al. [32] and Poloczek et al. [30],\nand a sampling-based scheme in Wu and Frazier [44]. However, the discretization in this approach\nintroduces error and scales poorly with the dimension of A.\nHere we propose a novel method for calculating an unbiased estimator of the gradient of d-KG which\nwe then use within stochastic gradient ascent to maximize d-KG. This method avoids discretization,\nand thus is exact. It also improves speed signi\ufb01cantly over a discretization-based scheme.\nIn Section A of the supplement we show that the d-KG factor can be expressed as\n\n1\n\n1 (x) \u2212 min\nx\u2208A\n\nd-KG(z(1:q), \u03b8) = En\n\nx\u2208A \u02c6\u00b5(n)\nmin\n\n\u02c6\u00b5(n)\n1 (x) + \u02c6\u03c3(n)\n\n(3.7)\nwhere \u02c6\u00b5(n) is the mean function of (f (x), \u03b8T\u2207f (x)) after n evaluations, W is a 2q dimensional\n(x, \u03b8, z(1:q)) is the \ufb01rst row of a 2 \u00d7 2q dimensional\nstandard normal random column vector and \u02c6\u03c3(n)\nmatrix, which is related to the kernel function of (f (x), \u03b8T\u2207f (x)) after n evaluations with an exact\nform speci\ufb01ed in (A.2) of the supplement.\nUnder suf\ufb01cient regularity conditions [21], one can interchange the gradient and expectation operators,\n\n(x, \u03b8, z(1:q))W\n\n1\n\n1\n\n,\n\n(cid:17)(cid:21)\n\n(cid:16)\n\n(cid:20)\n\n(cid:20)\n\n(cid:16)\n\n\u2207d-KG(z(1:q), \u03b8) = \u2212En\n\n\u2207 min\nx\u2208A\n\n1 (x) + \u02c6\u03c3(n)\n\u02c6\u00b5(n)\n\n1\n\n(x, \u03b8, z(1:q))W\n\n,\n\nis with respect\n\nto z(1:q)\n\nand \u03b8.\n\nIf\n\n(x, z(1:q), \u03b8)\n\nis continuously differentiable and A is compact,\n\n(cid:55)\u2192\nthe enve-\n\n(cid:16)\n\nwhere here the gradient\n1 (x) + \u02c6\u03c3(n)\n\u02c6\u00b5(n)\nlope theorem [25] implies\n\n(x, \u03b8, z(1:q))W\n\n1\n\n(cid:17)\n\n(cid:16)\n\n(cid:104)\u2207(cid:16)\n\n(cid:17)(cid:21)\n\n(cid:17)(cid:105)\n\n\u2207d-KG(z(1:q), \u03b8) = \u2212En\n\n(3.8)\nwhere x\u2217(W ) \u2208 arg minx\u2208A\n. To \ufb01nd x\u2217(W ), one can utilize a\nmulti-start gradient descent method since the gradient is analytically available for the objective\n\n\u02c6\u00b5(n)\n1 (x) + \u02c6\u03c3(n)\n\n(x, \u03b8, z(1:q))W\n\n1\n\n,\n\n1 (x\u2217(W )) + \u02c6\u03c3(n)\n\u02c6\u00b5(n)\n\n1\n\n(x\u2217(W ), \u03b8, z(1:q))W\n\n(cid:17)\n\n5\n\n210120.00.10.20.3KGEI210120.00.20.40.6dKGdEI\ft\n\n(cid:17)\n\n(x, \u03b8, z(1:q))W . Practically, we \ufb01nd that the learning rate of linner\n\n= 0.03/t0.7 is\n\n1\n\n\u02c6\u00b5(n)\n1 (x) + \u02c6\u03c3(n)\nrobust for \ufb01nding x\u2217(W ).\n\nThe expression (3.8) implies that \u2207(cid:16)\n\n1 (x\u2217(W )) + \u02c6\u03c3(n)\n\u02c6\u00b5(n)\n\n1\n\n(x\u2217(W ), \u03b8, z(1:q))W\n\nis an unbiased\nestimator of \u2207d-KG(z(1:q), \u03b8, A), when the regularity conditions it assumes hold. We can use this\nunbiased gradient estimator within stochastic gradient ascent [13], optionally with multiple starts, to\nsolve the outer optimization problem argmaxz(1:q),\u03b8d-KG(z(1:q), \u03b8) and can use a similar approach\nwhen observing full gradients to solve (3.5). For the outer optimization problem, we \ufb01nd that the\nlearning rate of louter\n\nperforms well over all the benchmarks we tested.\n\nt = 10linner\n\nt\n\nBayesian Treatment of Hyperparameters. We adopt a fully Bayesian treatment of hyperparame-\nters similar to Snoek et al. [35]. We draw M samples of hyperparameters \u03c6(i) for 1 \u2264 i \u2264 M via the\nemcee package [7] and average our acquisition function across them to obtain\n\nM(cid:88)\n\ni=1\n\nd-KGIntegrated(z(1:q), \u03b8) =\n\n1\nM\n\nd-KG(z(1:q), \u03b8; \u03c6(i)),\n\n(3.9)\n\nwhere the additional argument \u03c6(i) in d-KG indicates that the computation is performed conditioning\non hyperparameters \u03c6(i). In our experiments, we found this method to be computationally ef\ufb01cient\nand robust, although a more principled treatment of unknown hyperparameters within the knowledge\ngradient framework would instead marginalize over them when computing \u02dc\u00b5(n+q)(x) and \u02dc\u00b5(n).\n\n3.4 Theoretical Analysis\n\nHere we present three theoretical results giving insight into the properties of d-KG, with proofs in the\nsupplementary material. For the sake of simplicity, we suppose all partial derivatives are provided\nto d-KG. Similar results hold for d-KG with relevant directional derivative detection. We begin by\nstating that the value of information (VOI) obtained by d-KG exceeds the VOI that can be achieved\nin the derivative-free setting.\nProposition 1. Given identical posteriors \u02dc\u00b5(n),\n\nd-KG(z(1:q)) \u2265 KG(z(1:q)),\n\nwhere KG is the batch knowledge gradient acquisition function without gradients proposed by Wu\nand Frazier [44]. This inequality is strict under mild conditions (see Sect. B in the supplement).\n\nNext, we show that d-KG is one-step Bayes-optimal by construction.\nProposition 2. If only one iteration is left and we can observe both function values and partial\nderivatives, then d-KG is Bayes-optimal among all feasible policies.\n\nAs a complement to the one-step optimality, we show that d-KG is asymptotically consistent if the\nfeasible set A is \ufb01nite. Asymptotic consistency means that d-KG will choose the correct solution\nwhen the number of samples goes to in\ufb01nity.\nTheorem 1. If the function f (x) is sampled from a GP with known hyperparameters, the d-KG\nalgorithm is asymptotically consistent, i.e.\n\nalmost surely, where x\u2217(d-KG, N ) is the point recommended by d-KG after N iterations.\n\nN\u2192\u221e f (x\u2217(d-KG, N )) = min\n\nlim\n\nx\u2208A f (x)\n\n4 Experiments\n\nWe evaluate the performance of the proposed algorithm d-KG with relevant directional derivative\ndetection (Algorithm 1) on six standard synthetic benchmarks (see Fig. 2). Moreover, we examine its\nability to tune the hyperparameters for the weighted k-nearest neighbor metric, logistic regression,\ndeep learning, and for a spectral mixture kernel (see Fig. 3).\nWe provide an easy-to-use Python package with the core written in C++, available at https://\ngithub.com/wujian16/Cornell-MOE.\n\n6\n\n\fWe compare d-KG to several state-of-the-art methods: (1) The batch expected improvement method\n(EI) of Wang et al. [39] that does not utilize derivative information and an extension of EI that incor-\nporates derivative information denoted d-EI. d-EI is similar to Lizotte [22] but handles incomplete\ngradients and supports batches. (2) The batch GP-UCB-PE method of Contal et al. [5] that does\nnot utilize derivative information, and an extension that does. (3) The batch knowledge gradient\nalgorithm without derivative information (KG) of Wu and Frazier [44]. Moreover, we generalize the\nmethod of Osborne et al. [26] to batches and evaluate it on the KNN benchmark. All of the above\nalgorithms allow incomplete gradient observations. In benchmarks that provide the full gradient, we\nadditionally compare to the gradient-based method L-BFGS-B provided in scipy. We suppose that\nthe objective function f is drawn from a Gaussian process GP (\u00b5, \u03a3), where \u00b5 is a constant mean\nfunction and \u03a3 is the squared exponential kernel. We sample M = 10 sets of hyperparameters by the\nemcee package [7].\nRecall that the immediate regret is de\ufb01ned as the loss with respect to a global optimum. The plots for\nsynthetic benchmark functions, shown in Fig. 2, report the log10 of immediate regret of the solution\nthat each algorithm would pick as a function of the number of function evaluations. Plots for other\nexperiments report the objective value of the solution instead of the immediate regret. Error bars give\nthe mean value plus and minus one standard deviation. The number of replications is stated in each\nbenchmark\u2019s description.\n\n4.1 Synthetic Test Functions\n\nWe evaluate all methods on six test functions chosen from Bingham [2]. To demonstrate the ability\nto bene\ufb01t from noisy derivative information, we sample additive normally distributed noise with\nzero mean and standard deviation \u03c3 = 0.5 for both the objective function and its partial derivatives.\n\u03c3 is unknown to the algorithms and must be estimated from observations. We also investigate\nhow incomplete gradient observations affect algorithm performance. We also experiment with two\ndifferent batch sizes: we use a batch size q = 4 for the Branin, Rosenbrock, and Ackley functions;\notherwise, we use a batch size q = 8. Fig. 2 summarizes the experimental results.\nFunctions with Full Gradient Information. For 2d Branin on domain [\u22125, 15] \u00d7 [0, 15], 5d\nAckley on [\u22122, 2]5, and 6d Hartmann function on [0, 1]6, we assume that the full gradient is available.\nLooking at the results for the Branin function in Fig. 2, d-KG outperforms its competitors after 40\nfunction evaluations and obtains the best solution overall (within the limit of function evaluations).\nBFGS makes faster progress than the Bayesian optimization methods during the \ufb01rst 20 evaluations,\nbut subsequently stalls and fails to obtain a competitive solution. On the Ackley function d-EI makes\nfast progress during the \ufb01rst 50 evaluations but also fails to make subsequent progress. Conversely,\nd-KG requires about 50 evaluations to improve on the performance of d-EI, after which d-KG\nachieves the best overall performance again. For the Hartmann function d-KG clearly dominates its\ncompetitors over all function evaluations.\nFunctions with Incomplete Derivative Information. For the 3d Rosenbrock function on [\u22122, 2]3\nwe only provide a noisy observation of the third partial derivative. Both EI and d-EI get stuck early.\nd-KG on the other hand \ufb01nds a near-optimal solution after \u223c50 function evaluations; KG, without\nderivatives, catches up after \u223c75 evaluations and performs comparably afterwards. The 4d Levy\nbenchmark on [\u221210, 10]4, where the fourth partial derivative is observable with noise, shows a\ndifferent ordering of the algorithms: EI has the best performance, beating even its formulation that\nuses derivative information. One explanation could be that the smoothness and regular shape of the\nfunction surface bene\ufb01ts this acquisition criteria. For the 8d Cosine mixture function on [\u22121, 1]8 we\nprovide two noisy partial derivatives. d-KG and UCB with derivatives perform better than EI-type\ncriterion, and achieve the best performances, with d-KG beating UCB with derivatives slightly.\nIn general, we see that d-KG successfully exploits noisy derivative information and has the best\noverall performance.\n\n4.2 Real-World Test Functions\n\nWeighted k-Nearest Neighbor. Suppose a cab company wishes to predict the duration of trips.\nClearly, the duration not only depends on the endpoints of the trip, but also on the day and time.\n\n7\n\n\fFigure 2: The average performance of 100 replications (the log10 of the immediate regret vs. the number of\nfunction evaluations). d-KG performs signi\ufb01cantly better than its competitors for all benchmarks except Levy\nfuncion. In Branin and Hartmann, we also plot black lines, which is the performance of BFGS.\n\nIn this benchmark we tune a weighted k-nearest neighbor (KNN) metric to optimize predictions\nof these durations, based on historical data. A trip is described by the pick-up time t, the pick-up\nlocation (p1, p2), and the drop-off point (d1, d2). Then the estimate of the duration is obtained as\na weighted average over all trips Dm,t in our database that happened in the time interval t \u00b1 m\ndurationi \u00d7\nweight(i)). The weight of trip i \u2208 Dm,t in this prediction is given by\n1 + (p1 \u2212 pi\n\nminutes, where m is a tunable hyperparameter: Prediction(t, p1, p2, d1, d2) = ((cid:80)\nweight(i))/((cid:80)\nweight(i) =(cid:0)(t \u2212 ti)2/l2\n\ni\u2208Dm,t\n4 + (d2 \u2212 di\n\n3 + (d1 \u2212 di\n\n2 + (p2 \u2212 pi\n\n(cid:1)\u22121,\n\ni\u2208Dm,t\n\n1)2/l2\n\n2)2/l2\n\n1)2/l2\n\n2)2/l2\n5\n\n3, l2\n\n2, l2\n\n1, pi\n\n1, di\n\n2, di\n\n1 in [101, 108], and l2\n\nwhere (ti, pi\n2) are the respective parameter values for trip i, and (l1, l2, l3, l4, l5) are tunable\nhyperparameters. Thus, we have 6 hyperparameters to tune: (m, l1, l2, l3, l4, l5). We choose m in\n[30, 200], l2\nWe use the yellow cab NYC public data set from June 2016, sampling 10000 records from June 1 \u2013\n25 as training data and 1000 trip records from June 26 \u2013 30 as validation data. Our test criterion is\nthe root mean squared error (RMSE), for which we compute the partial derivatives on the validation\ndataset with respect to the hyperparameters (l1, l2, l3, l4, l5), while the hyperparameter m is not\ndifferentiable. In Fig. 3 we see that d-KG overtakes the alternatives, and that UCB and KG acquisition\nfunctions also bene\ufb01t from exploiting derivative information.\n\n5 each in [10\u22128, 10\u22121].\n\n4, l2\n\nKernel Learning. Spectral mixture kernels [40] can be used for \ufb02exible kernel learning to enable\nlong-range extrapolation. These kernels are obtained by modeling a spectral density by a mixture of\nGaussians. While any stationary kernel can be described by a spectral mixture kernel with a particular\nsetting of its hyperparameters, initializing and learning these parameters can be dif\ufb01cult. Although\nwe have access to an analytic closed form of the (marginal likelihood) objective, this function is (i)\nexpensive to evaluate and (ii) highly multimodal. Moreover, (iii) derivative information is available.\nThus, learning \ufb02exible kernel functions is a perfect candidate for our approach.\nThe task is to train a 2-component spectral mixture kernel on an airline data set [40]. We must deter-\nmine the mixture weights, means, and variances, for each of the two Gaussians. Fig. 3 summarizes\nperformance for batch size q = 8. BFGS is sensitive to its initialization and human intervention and\nis often trapped in local optima. d-KG, on other hand, more consistently \ufb01nds a good solution, and\nobtains the best solution of all algorithms (within the step limit). Overall, we observe that gradient\ninformation is highly valuable in performing this kernel learning task.\n\n8\n\n\fLogistic Regression and Deep Learning. We tune logistic regression and a feedforward neural\nnetwork with 2 hidden layers on the MNIST dataset [20], a standard classi\ufb01cation task for handwritten\ndigits. The training set contains 60000 images, the test set 10000. We tune 4 hyperparameters for\nlogistic regression: the (cid:96)2 regularization parameter from 0 to 1, learning rate from 0 to 1, mini batch\nsize from 20 to 2000 and training epochs from 5 to 50. The \ufb01rst derivatives of the \ufb01rst two parameters\ncan be obtained via the technique of Maclaurin et al. [23]. For the neural network, we additionally\ntune the number of hidden units in [50, 500].\nFig. 3 reports the mean and standard deviation of the mean cross-entropy loss (or its log scale)\non the test set for 20 replications. d-KG outperforms the other approaches, which suggests that\nderivative information is helpful. Our algorithm proves its value in tuning a deep neural network,\nwhich harmonizes with research computing the gradients of hyperparameters [23, 27].\n\nFigure 3: Results for the weighted KNN benchmark, the spectral mixture kernel benchmark, logistic regression\nand deep neural network (from left to right), all with batch size 8 and averaged over 20 replications.\n\n5 Discussion\n\nBayesian optimization is successfully applied to low dimensional problems where we wish to \ufb01nd a\ngood solution with a very small number of objective function evaluations. We considered several such\nbenchmarks, as well as logistic regression, deep learning, kernel learning, and k-nearest neighbor\napplications. We have shown that in this context derivative information can be extremely useful: we\ncan greatly decrease the number of objective function evaluations, especially when building upon\nthe knowledge gradient acquisition function, even when derivative information is noisy and only\navailable for some variables.\nBayesian optimization is increasingly being used to automate parameter tuning in machine learning,\nwhere objective functions can be extremely expensive to evaluate. For example, the parameters to\nlearn through Bayesian optimization could even be the hyperparameters of a deep neural network. We\nexpect derivative information with Bayesian optimization to help enable such promising applications,\nmoving us towards fully automatic and principled approaches to statistical machine learning.\nIn the future, one could combine derivative information with \ufb02exible deep projections [43], and\nrecent advances in scalable Gaussian processes for O(n) training and O(1) test time predictions\n[41, 42]. These steps would help make Bayesian optimization applicable to a much wider range\nof problems, wherever standard gradient based optimizers are used \u2013 even when we have analytic\nobjective functions that are not expensive to evaluate \u2013 while retaining faster convergence and\nrobustness to multimodality.\n\nAcknowledgments\n\nWilson was partially supported by NSF IIS-1563887. Frazier, Poloczek, and Wu were partially\nsupported by NSF CAREER CMMI-1254298, NSF CMMI-1536895, NSF IIS-1247696, AFOSR\nFA9550-12-1-0200, AFOSR FA9550-15-1-0038, and AFOSR FA9550-16-1-0046.\n\nReferences\n[1] M. O. Ahmed, B. Shahriari, and M. Schmidt. Do we need \u201charmless\u201d bayesian optimization\n\nand \u201c\ufb01rst-order\u201d bayesian optimization? In NIPS BayesOpt, 2016.\n\n9\n\n\f[2] D. Bingham. Optimization test problems. http://www.sfu.ca/~ssurjano/optimization.\n\nhtml, 2015.\n\n[3] E. Brochu, V. M. Cora, and N. De Freitas. A tutorial on bayesian optimization of expensive\ncost functions, with application to active user modeling and hierarchical reinforcement learning.\narXiv preprint arXiv:1012.2599, 2010.\n\n[4] N. T. . L. Commission. NYC Trip Record Data. http://www.nyc.gov/html/tlc/, June\n\n2016. Last accessed on 2016-10-10.\n\n[5] E. Contal, D. Buffoni, A. Robicquet, and N. Vayatis. Parallel gaussian process optimization with\nupper con\ufb01dence bound and pure exploration. In Machine Learning and Knowledge Discovery\nin Databases, pages 225\u2013240. Springer, 2013.\n\n[6] T. Desautels, A. Krause, and J. W. Burdick. Parallelizing exploration-exploitation tradeoffs\nin gaussian process bandit optimization. The Journal of Machine Learning Research, 15(1):\n3873\u20133923, 2014.\n\n[7] D. Foreman-Mackey, D. W. Hogg, D. Lang, and J. Goodman. emcee: the mcmc hammer.\n\nPublications of the Astronomical Society of the Paci\ufb01c, 125(925):306, 2013.\n\n[8] A. Forrester, A. Sobester, and A. Keane. Engineering design via surrogate modelling: a\n\npractical guide. John Wiley & Sons, 2008.\n\n[9] P. Frazier, W. Powell, and S. Dayanik. The knowledge-gradient policy for correlated normal\n\nbeliefs. INFORMS Journal on Computing, 21(4):599\u2013613, 2009.\n\n[10] J. R. Gardner, M. J. Kusner, Z. E. Xu, K. Q. Weinberger, and J. Cunningham. Bayesian\noptimization with inequality constraints. In International Conference on Machine Learning,\npages 937\u2013945, 2014.\n\n[11] M. Gelbart, J. Snoek, and R. Adams. Bayesian optimization with unknown constraints. In\n\nInternational Conference on Machine Learning, pages 250\u2013259, Corvallis, Oregon, 2014.\n\n[12] J. Gonzalez, Z. Dai, P. Hennig, and N. Lawrence. Batch bayesian optimization via local\n\npenalization. In AISTATS, pages 648\u2013657, 2016.\n\n[13] J. Harold, G. Kushner, and G. Yin. Stochastic approximation and recursive algorithm and\n\napplications. Springer, 2003.\n\n[14] J. M. Hern\u00e1ndez-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy search\nfor ef\ufb01cient global optimization of black-box functions. In Advances in Neural Information\nProcessing Systems, pages 918\u2013926, 2014.\n\n[15] D. Huang, T. T. Allen, W. I. Notz, and N. Zeng. Global Optimization of Stochastic Black-Box\nSystems via Sequential Kriging Meta-Models. Journal of Global Optimization, 34(3):441\u2013466,\n2006.\n\n[16] A. Jameson. Re-engineering the design process through computation. Journal of Aircraft, 36\n\n(1):36\u201350, 1999.\n\n[17] D. R. Jones, M. Schonlau, and W. J. Welch. Ef\ufb01cient global optimization of expensive black-box\n\nfunctions. Journal of Global optimization, 13(4):455\u2013492, 1998.\n\n[18] T. Kathuria, A. Deshpande, and P. Kohli. Batched gaussian process bandit optimization via\ndeterminantal point processes. In Advances in Neural Information Processing Systems, pages\n4206\u20134214, 2016.\n\n[19] O.-P. Koistinen, E. Maras, A. Vehtari, and H. J\u00f3nsson. Minimum energy path calculations with\n\ngaussian process regression. Nanosystems: Physics, Chemistry, Mathematics, 7(6), 2016.\n\n[20] Y. LeCun, C. Cortes, and C. J. Burges. The mnist database of handwritten digits, 1998.\n\n[21] P. L\u2019Ecuyer. Note: On the interchange of derivative and expectation for likelihood ratio\n\nderivative estimators. Management Science, 41(4):738\u2013747, 1995.\n\n10\n\n\f[22] D. J. Lizotte. Practical bayesian optimization. PhD thesis, University of Alberta, 2008.\n\n[23] D. Maclaurin, D. Duvenaud, and R. P. Adams. Gradient-based hyperparameter optimization\nthrough reversible learning. In International Conference on Machine Learning, pages 2113\u2013\n2122, 2015.\n\n[24] S. Marmin, C. Chevalier, and D. Ginsbourger. Ef\ufb01cient batch-sequential bayesian optimization\n\nwith moments of truncated gaussian vectors. arXiv preprint arXiv:1609.02700, 2016.\n\n[25] P. Milgrom and I. Segal. Envelope theorems for arbitrary choice sets. Econometrica, 70(2):\n\n583\u2013601, 2002.\n\n[26] M. A. Osborne, R. Garnett, and S. J. Roberts. Gaussian processes for global optimization. In\n3rd International Conference on Learning and Intelligent Optimization (LION3), pages 1\u201315.\nCiteseer, 2009.\n\n[27] F. Pedregosa. Hyperparameter optimization with approximate gradient.\n\nConference on Machine Learning, pages 737\u2013746, 2016.\n\nIn International\n\n[28] V. Picheny, D. Ginsbourger, Y. Richet, and G. Caplin. Quantile-based optimization of noisy\n\ncomputer experiments with tunable precision. Technometrics, 55(1):2\u201313, 2013.\n\n[29] R.-\u00c9. Plessix. A review of the adjoint-state method for computing the gradient of a functional\n\nwith geophysical applications. Geophysical Journal International, 167(2):495\u2013503, 2006.\n\n[30] M. Poloczek, J. Wang, and P. I. Frazier. Multi-information source optimization. In Advances\nin Neural Information Processing Systems, 2017. Accepted for publication. ArXiv preprint\n1603.00389.\n\n[31] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press,\n\n2006. ISBN 0-262-18253-X.\n\n[32] W. Scott, P. Frazier, and W. Powell. The correlated knowledge gradient for simulation optimiza-\ntion of continuous parameters using gaussian process regression. SIAM Journal on Optimization,\n21(3):996\u20131026, 2011.\n\n[33] A. Shah and Z. Ghahramani. Parallel predictive entropy search for batch global optimization of\nexpensive objective functions. In Advances in Neural Information Processing Systems, pages\n3312\u20133320, 2015.\n\n[34] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the human out of\nthe loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148\u2013175, 2016.\n\n[35] J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning\nalgorithms. In Advances in Neural Information Processing Systems, pages 2951\u20132959, 2012.\n\n[36] J. Snyman. Practical mathematical optimization: an introduction to basic optimization theory\nand classical and new gradient-based algorithms, volume 97. Springer Science & Business\nMedia, 2005.\n\n[37] N. Srinivas, A. Krause, M. Seeger, and S. M. Kakade. Gaussian process optimization in the\nbandit setting: No regret and experimental design. In International Conference on Machine\nLearning, pages 1015\u20131022, 2010.\n\n[38] K. Swersky, J. Snoek, and R. P. Adams. Multi-task bayesian optimization. In Advances in\n\nNeural Information Processing Systems, pages 2004\u20132012, 2013.\n\n[39] J. Wang, S. C. Clark, E. Liu, and P. I. Frazier. Parallel bayesian global optimization of expensive\n\nfunctions. arXiv preprint arXiv:1602.05149, 2016.\n\n[40] A. G. Wilson and R. P. Adams. Gaussian process kernels for pattern discovery and extrapolation.\n\nIn International Conference on Machine Learning, pages 1067\u20131075, 2013.\n\n[41] A. G. Wilson and H. Nickisch. Kernel interpolation for scalable structured gaussian processes\n\n(kiss-gp). In International Conference on Machine Learning, pages 1775\u20131784, 2015.\n\n11\n\n\f[42] A. G. Wilson, C. Dann, and H. Nickisch. Thoughts on massively scalable gaussian processes.\n\narXiv preprint arXiv:1511.01870, 2015.\n\n[43] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep kernel learning. In Proceedings of\nthe 19th International Conference on Arti\ufb01cial Intelligence and Statistics, pages 370\u2013378, 2016.\n\n[44] J. Wu and P. Frazier. The parallel knowledge gradient method for batch bayesian optimization.\n\nIn Advances in Neural Information Processing Systems, pages 3126\u20133134, 2016.\n\n12\n\n\f", "award": [], "sourceid": 2724, "authors": [{"given_name": "Jian", "family_name": "Wu", "institution": "AQR Capital Management"}, {"given_name": "Matthias", "family_name": "Poloczek", "institution": "Cornell University"}, {"given_name": "Andrew", "family_name": "Wilson", "institution": "Cornell University"}, {"given_name": "Peter", "family_name": "Frazier", "institution": "Cornell / Uber"}]}