{"title": "The Parallel Knowledge Gradient Method for Batch Bayesian Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 3126, "page_last": 3134, "abstract": "In many applications of black-box optimization, one can evaluate multiple points simultaneously, e.g. when evaluating the performances of several different neural network architectures in a parallel computing environment. In this paper, we develop a novel batch Bayesian optimization algorithm --- the parallel knowledge gradient method. By construction, this method provides the one-step Bayes optimal batch of points to sample. We provide an efficient strategy for computing this Bayes-optimal batch of points, and we demonstrate that the parallel knowledge gradient method finds global optima significantly faster than previous batch Bayesian optimization algorithms on both synthetic test functions and when tuning hyperparameters of practical machine learning algorithms, especially when function evaluations are noisy.", "full_text": "The Parallel Knowledge Gradient Method\n\nfor Batch Bayesian Optimization\n\nJian Wu, Peter I. Frazier\n\nCornell University\nIthaca, NY, 14853\n\n{jw926, pf98}@cornell.edu\n\nAbstract\n\nIn many applications of black-box optimization, one can evaluate multiple points\nsimultaneously, e.g. when evaluating the performances of several different neural\nnetworks in a parallel computing environment. In this paper, we develop a novel\nbatch Bayesian optimization algorithm \u2014 the parallel knowledge gradient method.\nBy construction, this method provides the one-step Bayes optimal batch of points\nto sample. We provide an ef\ufb01cient strategy for computing this Bayes-optimal\nbatch of points, and we demonstrate that the parallel knowledge gradient method\n\ufb01nds global optima signi\ufb01cantly faster than previous batch Bayesian optimization\nalgorithms on both synthetic test functions and when tuning hyperparameters of\npractical machine learning algorithms, especially when function evaluations are\nnoisy.\n\n1\n\nIntroduction\n\nIn Bayesian optimization [19] (BO), we wish to optimize a derivative-free expensive-to-evaluate\nfunction f with feasible domain A \u2286 Rd,\n\nmin\nx\u2208A f (x),\n\nwith as few function evaluations as possible. In this paper, we assume that membership in the domain\nA is easy to evaluate and we can evaluate f only at points in A. We assume that evaluations of f are\neither noise-free, or have additive independent normally distributed noise. We consider the parallel\nsetting, in which we perform more than one simultaneous evaluation of f.\nBO typically puts a Gaussian process prior distribution on the function f, updating this prior\ndistribution with each new observation of f, and choosing the next point or points to evaluate\nby maximizing an acquisition function that quanti\ufb01es the bene\ufb01t of evaluating the objective as a\nfunction of where it is evaluated. In comparison with other global optimization algorithms, BO often\n\ufb01nds \u201cnear optimal\u201d function values with fewer evaluations [19]. As a consequence, BO is useful\nwhen function evaluation is time-consuming, such as when training and testing complex machine\nlearning algorithms (e.g. deep neural networks) or tuning algorithms on large-scale dataset (e.g.\nImageNet) [4]. Recently, BO has become popular in machine learning as it is highly effective in\ntuning hyperparameters of machine learning algorithms [8, 9, 19, 22].\nMost previous work in BO assumes that we evaluate the objective function sequentially [13], though a\nfew recent papers have considered parallel evaluations [3, 5, 18, 25]. While in practice, we can often\nevaluate several different choices in parallel, such as multiple machines can simultaneously train the\nmachine learning algorithm with different sets of hyperparameters. In this paper, we assume that\nwe can access q \u2265 1 evaluations simultaneously at each iteration. Then we develop a new parallel\nacquisition function to guide where to evaluate next based on the decision-theoretical analysis.\nOur Contributions. We propose a novel batch BO method which measures the information gain\nof evaluating q points via a new acquisition function, the parallel knowledge gradient (q-KG). This\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fmethod is derived using a decision-theoretic analysis that chooses the set of points to evaluate next\nthat is optimal in the average-case with respect to the posterior when there is only one batch of points\nremaining. Naively maximizing q-KG would be extremely computationally intensive, especially when\nq is large, and so, in this paper, we develop a method based on in\ufb01nitesimal perturbation analysis (IPA)\n[25] to evaluate q-KG\u2019s gradient ef\ufb01ciently, allowing its ef\ufb01cient optimization. In our experiments\non both synthetic functions and tuning practical machine learning algorithms, q-KG consistently\n\ufb01nds better function values than other parallel BO algorithms, such as parallel EI [2, 19, 25], batch\nUCB [5] and parallel UCB with exploration [3]. q-KG provides especially large value when function\nevaluations are noisy. The code in this paper is available at https://github.com/wujian16/qKG.\nThe rest of the paper is organized as follows. Section 2 reviews related work. Section 3 gives\nbackground on Gaussian processes and de\ufb01nes notation used later. Section 4 proposes our new\nacquisition function q-KG for batch BO. Section 5 provides our computationally ef\ufb01cient approach\nto maximizing q-KG. Section 6 presents the empirical performance of q-KG and several benchmarks\non synthetic functions and real problems. Finally, Section 7 concludes the paper.\n2 Related work\nWithin the past several years, the machine learning community has revisited BO [8, 9, 18, 19, 20,\n22] due to its huge success in tuning hyperparameters of complex machine learning algorithms.\nBO algorithms consist of two components: a statistical model describing the function and an\nacquisition function guiding evaluations. In practice, Gaussian Process (GP) [16] is the mostly\nwidely used statistical model due to its \ufb02exibility and tractability. Much of the literature in BO\nfocuses on designing good acquisition functions that reach optima with as few evaluations as possible.\nMaximizing this acquisition function usually provides a single point to evaluate next, with common\nacquisition functions for sequential Bayesian optimization including probability of improvement\n(PI)[23], expected improvement (EI) [13], upper con\ufb01dence bound (UCB) [21], entropy search (ES)\n[11], and knowledge gradient (KG) [17].\nRecently, a few papers have extended BO to the parallel setting, aiming to choose a batch of points to\nevaluate next in each iteration, rather than just a single point. [10, 19] suggests parallelizing EI by\niteratively constructing a batch, in each iteration adding the point with maximal single-evaluation EI\naveraged over the posterior distribution of previously selected points. [10] also proposes an algorithm\ncalled \u201cconstant liar\", which iteratively constructs a batch of points to sample by maximizing single-\nevaluation while pretending that points previously added to the batch have already returned values.\nThere are also work extending UCB to the parallel setting. [5] proposes the GP-BUCB policy, which\nselects points sequentially by a UCB criterion until \ufb01lling the batch. Each time one point is selected,\nthe algorithm updates the kernel function while keeping the mean function \ufb01xed. [3] proposes an\nalgorithm combining UCB with pure exploration, called GP-UCB-PE. In this algorithm, the \ufb01rst\npoint is selected according to a UCB criterion; then the remaining points are selected to encourage the\ndiversity of the batch. These two algorithms extending UCB do not require Monte Carlo sampling,\nmaking them fast and scalable. However, UCB criteria are usually designed to minimize cumulative\nregret rather than immediate regret, causing these methods to underperform in BO, where we wish to\nminimize simple regret.\nThe parallel methods above construct the batch of points in an iterative greedy fashion, optimizing\nsome single-evaluation acquisition function while holding the other points in the batch \ufb01xed. The\nacquisition function we propose considers the batch of points collectively, and we choose the batch to\njointly optimize this acquisition function. Other recent papers that value points collectively include\n[2] which optimizes the parallel EI by a closed-form formula, [15, 25], in which gradient-based\nmethods are proposed to jointly optimize a parallel EI criterion, and [18], which proposes a parallel\nversion of the ES algorithm and uses Monte Carlo Sampling to optimize the parallel ES acquisition\nfunction.\nWe compare against methods from a number of these previous papers in our numerical experiments,\nand demonstrate that we provide an improvement, especially in problems with noisy evaluations.\nOur method is also closely related to the knowledge gradient (KG) method [7, 17] for the non-batch\n(sequential) setting, which chooses the Bayes-optimal point to evaluate if only one iteration is left\n[17], and the \ufb01nal solution that we choose is not restricted to be one of the points we evaluate.\n(Expected improvement is Bayes-optimal if the solution is restricted to be one of the points we\nevaluate.) We go beyond this previous work in two aspects. First, we generalize to the parallel setting.\n\n2\n\n\fSecond, while the sequential setting allows evaluating the KG acquisition function exactly, evaluation\nrequires Monte Carlo in the parallel setting, and so we develop more sophisticated computational\ntechniques to optimize our acquisition function. Recently, [26] studies a nested batch knowledge\ngradient policy. However, they optimize over a \ufb01nite discrete feasible set, where the gradient of KG\ndoes not exist. As a result, their computation of KG is much less ef\ufb01cient than ours. Moreover, they\nfocus on a nesting structure from materials science not present in our setting.\n3 Background on Gaussian processes\n\nIn this section, we state our prior on f, brie\ufb02y discuss well known results about Gaussian processes\n(GP), and introduce notation used later. We put a Gaussian process prior over the function f : A \u2192 R,\nwhich is speci\ufb01ed by its mean function \u00b5(x) : A \u2192 R and kernel function K(x1, x2) : A \u00d7 A \u2192 R.\nWe assume either exact or independent normally distributed measurement errors, i.e. the evaluation\ny(xi) at point xi satis\ufb01es\n\ny(xi) | f (xi) \u223c N (f (xi), \u03c32(xi)),\n\nwhere \u03c32 : A \u2192 R+ is a known function describing the variance of the measurement errors. If \u03c32 is\nnot known, we can also estimate it as we do in Section 6.\nSupposing we have measured f at n points x(1:n) := {x(1), x(2),\u00b7\u00b7\u00b7 , x(n)} and obtained corre-\nsponding measurements y(1:n), we can then combine these observed function values with our prior to\nobtain a posterior distribution on f. This posterior distribution is still a Gaussian process with the\nmean function \u00b5(n) and the kernel function K (n) as follows\n\u00b5(n)(x) = \u00b5(x)\n\nK(x(1:n), x(1:n)) + diag{\u03c32(x(1)),\u00b7\u00b7\u00b7 , \u03c32(x(n))}(cid:17)\u22121\n(cid:16)\n(cid:16)\nK(x(1:n), x(1:n)) + diag{\u03c32(x(1)),\u00b7\u00b7\u00b7 , \u03c32(x(n))}(cid:17)\u22121\n\n+ K(x, x(1:n))\n\nK (n)(x1, x2) = K(x1, x2)\n\n\u2212 K(x1, x(1:n))\n\n(y(1:n) \u2212 \u00b5(x(1:n))),\n\nK(x(1:n), x2).\n\n(3.1)\n\n4 Parallel knowledge gradient (q-KG)\n\nIn this section, we propose a novel parallel Bayesian optimization algorithm by generalizing the\nconcept of the knowledge gradient from [7] to the parallel setting. The knowledge gradient policy in\n[7] for discrete A chooses the next sampling decision by maximizing the expected incremental value\nof a measurement, without assuming (as expected improvement does) that the point returned as the\noptimum must be a previously sampled point.\nWe now show how to compute this expected incremental value of an additional iteration in the\nparallel setting. Suppose that we have observed n function values. If we were to stop measuring now,\nminx\u2208A \u00b5(n)(x) would be the minimum of the predictor of the GP. If instead we took one more batch\nof samples, minx\u2208A \u00b5(n+q)(x) would be the minimum of the predictor of the GP. The difference\nbetween these quantities, minx\u2208A \u00b5(n)(x)\u2212minx\u2208A \u00b5(n+q)(x), is the increment in expected solution\nquality (given the posterior after n + q samples) that results from the additional batch of samples.\nThis increment in solution quality is random given the posterior after n samples, because\nminx\u2208A \u00b5(n+q)(x) is itself a random vector due to its dependence on the outcome of the sam-\nples. We can compute the probability distribution of this difference (with more details given below),\nand the q-KG algorithm values the sampling decision z(1:q) := {z1, z2,\u00b7\u00b7\u00b7 , zq} according to its\nexpected value, which we call the parallel knowledge gradient factor, and indicate it using the notation\nq-KG. Formally, we de\ufb01ne the q-KG factor for a set of candidate points to sample z(1:q) as\n\nwhere En [\u00b7] := E(cid:2)\u00b7|x(1:n), y(1:n)(cid:3) is the expectation taken with respect to the posterior distribution\n\nx\u2208A \u00b5(n)(x) \u2212 En\n\nq-KG(z(1:q), A) = min\n\nafter n evaluations. Then we choose to evaluate the next batch of q points that maximizes the parallel\nknowledge gradient,\n\n(4.1)\n\n(cid:20)\n\n(cid:21)\nx\u2208A \u00b5(n+q)(x)|y(z(1:q))\n\nmin\n\n,\n\n(4.2)\n\nq-KG(z(1:q), A).\n\nmax\n\nz(1:q)\u2282A\n\n3\n\n\fBy construction, the parallel knowledge gradient policy is Bayes-optimal for minimizing the minimum\nof the predictor of the GP if only one decision is remaining. The q-KG algorithm will reduce to\nthe parallel EI algorithm if function evaluations are noise-free and the \ufb01nal recommendation is\nrestricted to the previous sampling decisions. Because under the two conditions above, the increment\nin expected solution quality will become\n\n\u00b5(n)(x) \u2212\n\nmin\n\nx\u2208x(1:n)\n\nmin\n\nx\u2208x(1:n)\u222az(1:q)\n\n\u00b5(n+q)(x) = min y(1:n) \u2212 min\n\ny(1:n), min\nx\u2208z(1:q)\n\n\u00b5(n+q)(x)\n\n(cid:26)\n\n(cid:27)\n\n(cid:19)+\n\n(cid:18)\n\n=\n\nmin y(1:n) \u2212 min\nx\u2208z(1:q)\n\n\u00b5(n+q)(x)\n\n,\n\nwhich is exactly the parallel EI acquisition function. However, computing q-KG and its gradient is\nvery expensive. We will address the computational issues in Section 5. The full description of the\nq-KG algorithm is summarized as follows.\n\nAlgorithm 1 The q-KG algorithm\nRequire: the number of initial stage samples I, and the number of main stage sampling iterations N.\n1: Initial Stage: draw I initial samples from a latin hypercube design in A, x(i) for i = 1, . . . , I .\n2: Main Stange:\n3: for s = 1 to N do\n4:\n5:\n\nq ) = argmaxz(1:q)\u2282Aq-KG(z(1:q), A)\nq ), re-train the hyperparameters of the GP by MLE, and\n\n2 ,\u00b7\u00b7\u00b7 , z\u2217\n1 , z\u2217\nSolve (4.2), i.e. get (z\u2217\n2 ,\u00b7\u00b7\u00b7 , z\u2217\n1 , z\u2217\nSample these points (z\u2217\nupdate the posterior distribution of f.\n6: end for\n7: return x\u2217 = argminx\u2208A\u00b5(I+N q)(x).\n5 Computation of q-KG\n\nIn this section, we provide the strategy to maximize q-KG by a gradient-based optimizer. In Section 5.1\nand Section 5.2, we describe how to compute q-KG and its gradient when A is \ufb01nite in (4.1).\nSection 5.3 describes an effective way to discretize A in (4.1). The readers should note that there are\ntwo As here, one is in (4.1) which is used to compute the q-KG factor given a sampling decision z(1:q).\nThe other is the feasible domain in (4.2) (z(1:q) \u2282 A) that we optimize over. We are discretizing the\n\ufb01rst A.\n\n5.1 Estimating q-KG when A is \ufb01nite in (4.1)\n\nFollowing [7], we express \u00b5(n+q)(x) as\n\n\u00b5(n+q)(x) = \u00b5(n)(x) + K (n)(x, z(1:q))\n\n(cid:16)\n\n+diag{\u03c32(z(1)),\u00b7\u00b7\u00b7 , \u03c32(z(q))}(cid:17)\u22121(cid:16)\n\nK (n)(z(1:q), z(1:q))\n\ny(z(1:q)) \u2212 \u00b5(n)(z(1:q))\n\n(cid:17)\n\n.\n\nBecause y(z(1:q)) \u2212 \u00b5(n)(z(1:q)) is normally distributed with zero mean and covariance matrix\nK (n)(z(1:q), z(1:q))+diag{\u03c32(z(1)),\u00b7\u00b7\u00b7 , \u03c32(z(q))} with respect to the posterior after n observations,\nwe can rewrite \u00b5(n+q)(x) as\n\n\u00b5(n+q)(x) = \u00b5(n)(x) + \u02dc\u03c3n(x, z(1:q))Zq,\n\n(5.1)\n\nwhere Zq is a standard q-dimensional normal random vector, and\n\n\u02dc\u03c3n(x, z(1:q)) = K (n)(x, z(1:q))(D(n)(z(1:q))T )\u22121,\n\nwhere D(n)(z(1:q)) is the Cholesky factor of the covariance matrix K (n)(z(1:q), z(1:q)) +\ndiag{\u03c32(z(1)),\u00b7\u00b7\u00b7 , \u03c32(z(q))}. Now we can compute the q-KG factor using Monte Carlo sam-\npling when A is \ufb01nite: we can sample Zq, compute (5.1), then plug in (4.1), repeat many times and\ntake average.\n\n4\n\n\f(cid:16)\n\n(cid:18) \u2202\n\n\u2202zij\n\n5.2 Estimating the gradient of q-KG when A is \ufb01nite in (4.1)\nIn this section, we propose an unbiased estimator of the gradient of q-KG using IPA when A is \ufb01nite.\nAccessing a stochastic gradient makes optimization much easier. By (5.1), we express q-KG as\n\nwhere g = minx\u2208A \u00b5(n)(x)\u2212 minx\u2208A(cid:0)\u00b5(n)(x) + \u02dc\u03c3n(x, z(1:q))Zq\n\nq-KG(z(1:q), A) = EZq\n\ng(z(1:q), A, Zq)\n\nK are continuously differentiable, one can show that (please see the details in the supplementary\nmaterials)\n\n(cid:17)\n(cid:1). Under the condition that \u00b5 and\n\n(5.2)\n\n,\n\n(cid:19)\n\n\u2202\n\n\u2202zij\n\nq-KG(z(1:q), A) = EZq\n\ng(z(1:q), A, Zq)\n\n,\n\n(5.3)\n\nwhere zij is the jth dimension of the ith point in z(1:q). By the formula of g,\n\n\u2202\n\n\u2202zij\n\ng(z(1:q), A, Zq) =\n\n\u2202\n\n\u2202zij\n\n\u00b5(n)(x\u2217(before)) \u2212 \u2202\n\u2202zij\n\n\u00b5(n)(x\u2217(after))\n\nwhere x\u2217(before) = argminx\u2208A\u00b5(n)(x), x\u2217(after) = argminx\u2208A(cid:0)\u00b5(n)(x) + \u02dc\u03c3n(x, z(1:q))Zq\n\n\u02dc\u03c3n(x\u2217(after), z(1:q))Zq\n\n\u2212 \u2202\n\u2202zij\n\n(cid:1),\n\nand\n\n\u02dc\u03c3n(x\u2217(after), z(1:q)) =\n\n\u2202\n\n\u2202zij\n\n(cid:18) \u2202\n\n(cid:19)\n\nK (n)(x\u2217(after), z(1:q))\n\n(D(n)(z(1:q))T )\u22121\n\n(cid:18) \u2202\n\n\u2202zij\n\u2212K (n)(x\u2217(after), z(1:q))(D(n)(z(1:q))T )\u22121\n(D(n)(z(1:q))T )\u22121.\n\nD(n)(z(1:q))T\n\n(cid:19)\n\n\u2202zij\n\nNow we can sample many times and take average to estimate the gradient of q-KG via (5.3). This\ntechnique is called in\ufb01nitesimal perturbation analysis (IPA) in gradient estimation [14]. Since we can\nestimate the gradient of q-KG ef\ufb01ciently when A is \ufb01nite, we will apply some standard gradient-based\noptimization algorithms, such as multi-start stochastic gradient ascent to maximize q-KG.\n\n5.3 Approximating q-KG when A is in\ufb01nite in (4.1) through discretization\nWe have speci\ufb01ed how to maximize q-KG when A is \ufb01nite in (4.1), but usually A is in\ufb01nite. In this\ncase, we will discretize A to approximate q-KG, and then maximize over the approximate q-KG. The\ndiscretization itself is an interesting research topic [17].\nIn this paper, the discrete set An is not chosen statically, but evolves over time: speci\ufb01cally, we\nsuggest drawing M samples from the global optima of the posterior distribution of the Gaussian\nprocess (please refer to [11, 18] for a description of this technique). This sample set, denoted\nby AM\nn , is then extended by the locations of previously sampled points x(1:n) and the set of candidate\npoints z(1:q). Then (4.1) can be restated as\n\nq-KG(z(1:q), An) = min\nx\u2208An\n\n\u00b5(n)(x) \u2212 En\n\nmin\nx\u2208An\n\n\u00b5(n+q)(x)|y(z(1:q))\n\n,\n\n(5.4)\n\nn \u222a x(1:n) \u222a z(1:q). For the experimental evaluation we recompute AM\n\nwhere An = AM\niteration after updating the posterior of the Gaussian process.\n6 Numerical experiments\nWe conduct experiments in two different settings: the noise-free setting and the noisy setting. In\nboth settings, we test the algorithms on well-known synthetic functions chosen from [1] and practical\nproblems. Following previous literature [19], we use a constant mean prior and the ARD Mat\u00b4ern 5/2\nkernel. In the noisy setting, we assume that \u03c32(x) is constant across the domain A, and we estimate\nit together with other hyperparameters in the GP using maximum likelihood estimation (MLE). We\nset M = 1000 to discretize the domain following the strategy in Section 5.3. In general, the q-KG\n\nn in every\n\n5\n\n(cid:20)\n\n(cid:21)\n\n\fFigure 1: Performances on noise-free synthetic functions with q = 4. We report the mean and the standard\ndeviation of the log10 scale of the immediate regret vs. the number of function evaluations.\n\nalgorithm performs as well or better than state-of-art benchmark algorithms on both synthetic and\nreal problems. It performs especially well in the noisy setting.\nBefore describing the details of the empirical results, we highlight the implementation details of our\nmethod and the open-source implementations of the benchmark methods. Our implementation inherits\nthe open-source implementation of parallel EI from the Metrics Optimization Engine [24],\nwhich is fully implemented in C++ with a python interface. We reuse their GP regression and GP\nhyperparameter \ufb01tting methods and implement the q-KG method in C++. Besides comparing to\nparallel EI in [24], we also compare our method to a well-known heuristic parallel EI implemented in\nSpearmint [12], the parallel UCB algorithm (GP-BUCB) and parallel UCB with pure exploration\n(GP-UCB-PE) both implemented in Gpoptimization [6].\n6.1 Noise-free problems\nIn this section, we focus our attention on the noise-free setting, in which we can evaluate the objective\nexactly. We show that parallel knowledge gradient outperforms or is competitive with state-of-art\nbenchmarks on several well-known test functions and tuning practical machine learning algorithms.\n6.1.1 Synthetic functions\n\nFirst, we test our algorithm along with the benchmarks on 4 well-known synthetic test functions:\nBranin2 on the domain [\u221215, 15]2, Rosenbrock3 on the domain [\u22122, 2]3, Ackley5 on the domain\n[\u22122, 2]5, and Hartmann6 on the domain [0, 1]6. We initiate our algorithms by randomly sampling\n2d + 2 points from a Latin hypercube design, where d is the dimension of the problem. Figure 3\nreports the mean and the standard deviation of the base 10 logarithm of the immediate regret by\nrunning 100 random initializations with batch size q = 4.\nThe results show that q-KG is signi\ufb01cantly better on Rosenbrock3, Ackley5 and Hartmann6, and is\nslightly worse than the best of the other benchmarks on Branin2. Especially on Rosenbrock3 and\nAckley5, q-KG makes dramatic progress in early iterations.\n6.1.2 Tuning logistic regression and convolutional neural networks (CNN)\n\nIn this section, we test the algorithms on two practical problems: tuning logistic regression on the\nMNIST dataset and tuning CNN on the CIFAR10 dataset. We set the batch size to q = 4.\n\n6\n\n020406080100120140160function evaluations\u22123.0\u22122.5\u22122.0\u22121.5\u22121.0\u22120.50.00.51.01.5the log10 scale of the immediate regret2d BraninNoNoise function with batch size 4GP-BUCBGP-UCB-PEMOE-qEISpearmint-qEIqKG020406080100120140160function evaluations\u22123\u22122\u221210123the log10 scale of the immediate regret3d RosenbrockNoNoise function with batch size 4GP-BUCBGP-UCB-PEMOE-qEISpearmint-qEIqKG050100150200250function evaluations\u22120.4\u22120.20.00.20.40.60.8the log10 scale of the immediate regret5d AckleyNoNoise function with batch size 4GP-BUCBGP-UCB-PEMOE-qEISpearmint-qEIqKG020406080100120140160180function evaluations\u22121.5\u22121.0\u22120.50.00.5the log10 scale of the immediate regret6d HartmannNoNoise function with batch size 4GP-BUCBGP-UCB-PEMOE-qEISpearmint-qEIqKG\fFirst, we tune logistic regression on the MNIST dataset. This task is to classify handwritten digits\nfrom images, and is a 10-class classi\ufb01cation problem. We train logistic regression on a training set\nwith 60000 instances with a given set of hyperparameters and test it on a test set with 10000 instances.\nWe tune 4 hyperparameters: mini batch size from 10 to 2000, training iterations from 100 to 10000,\nthe (cid:96)2 regularization parameter from 0 to 1, and learning rate from 0 to 1. We report the mean and\nstandard deviation of the test error for 20 independent runs. From the results, one can see that both\nalgorithms are making progress at the initial stage while q-KG can maintain this progress for longer\nand results in a better algorithm con\ufb01guration in general.\n\nFigure 2: Performances on tuning machine learning algorithms with q = 4\n\nIn the second experiment, we tune a CNN on CIFAR10 dataset. This is also a 10-class classi\ufb01cation\nproblem. We train the CNN on the 50000 training data with certain hyperparameters and test it on\nthe test set with 10000 instances. For the network architecture, we choose the one in tensorflow\ntutorial. It consists of 2 convolutional layers, 2 fully connected layers, and on top of them is a softmax\nlayer for \ufb01nal classi\ufb01cation. We tune totally 8 hyperparameters: the mini batch size from 10 to 1000,\ntraining epoch from 1 to 10, the (cid:96)2 regularization parameter from 0 to 1, learning rate from 0 to 1,\nthe kernel size from 2 to 10, the number of channels in convolutional layers from 10 to 1000, the\nnumber of hidden units in fully connected layers from 100 to 1000, and the dropout rate from 0 to 1.\nWe report the mean and standard deviation of the test error for 5 independent runs. In this example,\nthe q-KG is making better (more aggressive) progress than parallel EI even in the initial stage and\nmaintain this advantage to the end. This architecture has been carefully tuned by the human expert,\nand achieve a test error around 14%, and our automatic algorithm improves it to around 11%.\n6.2 Noisy problems\nIn this section, we study problems with noisy function evaluations. Our results show that the\nperformance gains over benchmark algorithms from q-KG evident in the noise-free setting are even\nlarger in the noisy setting.\n6.2.1 Noisy synthetic functions\n\nWe test on the same 4 synthetic functions from the noise-free setting, and add independent gaussian\nnoise with standard deviation \u03c3 = 0.5 to the function evaluation. The algorithms are not given this\nstandard deviation, and must learn it from data.\nThe results in Figure 4 show that q-KG is consistently better than or at least competitive with\nall competing methods. Also observe that the performance advantage of q-KG is larger than for\nnoise-free problems.\n6.2.2 Noisy logistic regression with small test sets\n\nTesting on a large test set such as ImageNet is slow, especially when we must test many times for\ndifferent hyperparameters. To speed up hyperparameter tuning, we may instead test the algorithm on\na subset of the testing data to approximate the test error on the full set. We study the performance of\nour algorithm and benchmarks in this scenario, focusing on tuning logistic regression on MNIST.\nWe train logistic regression on the full training set of 60, 000, but we test the algorithm by testing on\n1, 000 randomly selected samples from the test set, which provides a noisy approximation of the test\nerror on the full test set.\n\n7\n\n010203040506070function evaluations0.050.100.150.200.250.30test errorLogistic Regression on MNISTMOE-qEISpearmint-qEIqKG102030405060708090function evaluations0.100.150.200.250.300.35CNN on CIFAR10Spearmint-qEIqKG\fFigure 3: Performances on noisy synthetic functions with q = 4. We report the mean and the standard deviation\nof the log10 scale of the immediate regret vs. the number of function evaluations.\n\nFigure 4: Tuning logistic regression on smaller test sets with q = 4\n\nWe report the mean and standard deviation of the test error on the full set using the hyperparameters\nrecommended by each parallel BO algorithm for 20 independent runs. The result shows that q-KG\nis better than both versions of parallel EI, and its \ufb01nal test error is close to the noise-free test error\n(which is substantially more expensive to obtain). As we saw with synthetic test functions, q-KG\u2019s\nperformance advantage in the noisy setting is wider than in the noise-free setting.\nAcknowledgments\nThe authors were partially supported by NSF CAREER CMMI-1254298, NSF CMMI-1536895, NSF\nIIS-1247696, AFOSR FA9550-12-1-0200, AFOSR FA9550-15-1-0038, and AFOSR FA9550-16-1-\n0046.\n7 Conclusions\nIn this paper, we introduce a novel batch Bayesian optimization method q-KG, derived from a\ndecision-theoretical perspective, and develop a computational method to implement it ef\ufb01ciently. We\nshow that q-KG outperforms or is competitive with the state-of-art benchmark algorithms on several\nsynthetic functions and in tuning practical machine learning algorithms.\n\n8\n\n020406080100120140160function evaluations\u22122.5\u22122.0\u22121.5\u22121.0\u22120.50.00.51.01.5the log10 scale of the immediate regret2d Branin function with batch size 4GP-BUCBGP-UCB-PEMOE-qEISpearmint-qEIqKG050100150200250300350function evaluations\u22122.0\u22121.5\u22121.0\u22120.50.00.51.01.52.02.5the log10 scale of the immediate regret3d Rosenbrock function with batch size 4GP-BUCBGP-UCB-PEMOE-qEISpearmint-qEIqKG050100150200250function evaluations\u22120.20.00.20.40.60.8the log10 scale of the immediate regret5d Ackley function with batch size 4GP-BUCBGP-UCB-PEMOE-qEISpearmint-qEIqKG020406080100120140160180function evaluations\u22120.6\u22120.4\u22120.20.00.20.40.6the log10 scale of the immediate regret6d Hartmann function with batch size 4GP-BUCBGP-UCB-PEMOE-qEISpearmint-qEIqKG010203040506070function evaluations0.050.100.150.200.250.300.35test errorLogistic Regression on MNIST with Smaller Test SetsMOE-qEISpearmint-qEIqKG\fReferences\n[1] Bingham, D. (2015). Optimization test problems. http://www.sfu.ca/~ssurjano/optimization.\n\nhtml.\n\n[2] Chevalier, C. and Ginsbourger, D. (2013). Fast computation of the multi-points expected improvement with\n\napplications in batch selection. In Learning and Intelligent Optimization, pages 59\u201369. Springer.\n\n[3] Contal, E., Buffoni, D., Robicquet, A., and Vayatis, N. (2013). Parallel gaussian process optimization with\nupper con\ufb01dence bound and pure exploration. In Machine Learning and Knowledge Discovery in Databases,\npages 225\u2013240. Springer.\n\n[4] Deng, L. and Yu, D. (2014). Deep learning: Methods and applications. Foundations and Trends in Signal\n\nProcessing, 7(3\u20134):197\u2013387.\n\n[5] Desautels, T., Krause, A., and Burdick, J. W. (2014). Parallelizing exploration-exploitation tradeoffs in\n\ngaussian process bandit optimization. The Journal of Machine Learning Research, 15(1):3873\u20133923.\n\n[6] etal, E. C.\n\n(2015).\n\ngpoptimization.\n\nGpoptimization.\n\nhttps://reine.cmla.ens-cachan.fr/e.contal/\n\n[7] Frazier, P., Powell, W., and Dayanik, S. (2009). The knowledge-gradient policy for correlated normal beliefs.\n\nINFORMS journal on Computing, 21(4):599\u2013613.\n\n[8] Gardner, J. R., Kusner, M. J., Xu, Z. E., Weinberger, K. Q., and Cunningham, J. (2014). Bayesian\n\noptimization with inequality constraints. In ICML, pages 937\u2013945.\n\n[9] Gelbart, M., Snoek, J., and Adams, R. (2014). Bayesian optimization with unknown constraints.\n\nIn\nProceedings of the Thirtieth Conference Annual Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-14),\npages 250\u2013259, Corvallis, Oregon. AUAI Press.\n\n[10] Ginsbourger, D., Le Riche, R., and Carraro, L. (2010). Kriging is well-suited to parallelize optimization.\n\nIn Computational Intelligence in Expensive Optimization Problems, pages 131\u2013162. Springer.\n\n[11] Hern\u00e1ndez-Lobato, J. M., Hoffman, M. W., and Ghahramani, Z. (2014). Predictive entropy search for\nef\ufb01cient global optimization of black-box functions. In Advances in Neural Information Processing Systems,\npages 918\u2013926.\n\n[12] Jasper Snoek, Hugo Larochelle, R. P. A. e. (2015). Spearmint. https://github.com/HIPS/Spearmint.\n[13] Jones, D. R., Schonlau, M., and Welch, W. J. (1998). Ef\ufb01cient global optimization of expensive black-box\n\nfunctions. Journal of Global optimization, 13(4):455\u2013492.\n\n[14] L\u2019Ecuyer, P. (1990). A uni\ufb01ed view of the IPA, SF, and LR gradient estimation techniques. Management\n\nScience, 36(11):1364\u20131383.\n\n[15] Marmin, S., Chevalier, C., and Ginsbourger, D. (2015). Differentiating the multipoint expected improve-\nment for optimal batch design. In International Workshop on Machine Learning, Optimization and Big Data,\npages 37\u201348. Springer.\n\n[16] Rasmussen, C. E. (2006). Gaussian processes for machine learning.\n[17] Scott, W., Frazier, P., and Powell, W. (2011). The correlated knowledge gradient for simulation optimization\nof continuous parameters using gaussian process regression. SIAM Journal on Optimization, 21(3):996\u20131026.\n[18] Shah, A. and Ghahramani, Z. (2015). Parallel predictive entropy search for batch global optimization of\n\nexpensive objective functions. In Advances in Neural Information Processing Systems, pages 3312\u20133320.\n\n[19] Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical bayesian optimization of machine learning\n\nalgorithms. In Advances in neural information processing systems, pages 2951\u20132959.\n\n[20] Snoek, J., Swersky, K., Zemel, R., and Adams, R. (2014). Input warping for bayesian optimization\nof non-stationary functions. In Proceedings of the 31st International Conference on Machine Learning\n(ICML-14), pages 1674\u20131682.\n\n[21] Srinivas, N., Krause, A., Seeger, M., and Kakade, S. M. (2010). Gaussian process optimization in the\nbandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on\nMachine Learning (ICML-10), pages 1015\u20131022.\n\n[22] Swersky, K., Snoek, J., and Adams, R. P. (2013). Multi-task bayesian optimization. In Advances in neural\n\ninformation processing systems, pages 2004\u20132012.\n\n[23] T\u00f6rn, A. and Zilinskas, A. (1989). Global optimization, volume 350 of lecture notes in computer science.\n[24] Wang, J., Clark, S. C., Liu, E., and Frazier, P. I. (2014). Metrics optimization engine. http://yelp.\n\ngithub.io/MOE/. Last accessed on 2016-01-21.\n\n[25] Wang, J., Clark, S. C., Liu, E., and Frazier, P. I. (2015a). Parallel bayesian global optimization of expensive\n\nfunctions.\n\n[26] Wang, Y., Reyes, K. G., Brown, K. A., Mirkin, C. A., and Powell, W. B. (2015b). Nested-batch-mode\nlearning and stochastic optimization with an application to sequential multistage testing in materials science.\nSIAM Journal on Scienti\ufb01c Computing, 37(3):B361\u2013B381.\n\n9\n\n\f", "award": [], "sourceid": 1567, "authors": [{"given_name": "Jian", "family_name": "Wu", "institution": "Cornell University"}, {"given_name": "Peter", "family_name": "Frazier", "institution": "Princeton University"}]}