{"title": "Regret bounds for meta Bayesian optimization with an unknown Gaussian process prior", "book": "Advances in Neural Information Processing Systems", "page_first": 10477, "page_last": 10488, "abstract": "Bayesian optimization usually assumes that a Bayesian prior is given. However, the strong theoretical guarantees in Bayesian optimization are often regrettably compromised in practice because of unknown parameters in the prior. In this paper, we adopt a variant of empirical Bayes and show that,  by estimating the Gaussian process prior from offline data sampled from the same prior and constructing unbiased estimators of the posterior, variants of both GP-UCB and \\emph{probability of improvement} achieve a near-zero regret bound, which decreases to a constant proportional to the observational noise as the number of offline data and the number of online evaluations increase. Empirically, we have verified our approach on challenging simulated robotic problems featuring task and motion planning.", "full_text": "Regret bounds for meta Bayesian optimization\n\nwith an unknown Gaussian process prior\n\nZi Wang\u2217\nMIT CSAIL\n\nziw@csail.mit.edu\n\nBeomjoon Kim\u2217\n\nMIT CSAIL\n\nbeomjoon@mit.edu\n\nLeslie Pack Kaelbling\n\nMIT CSAIL\n\nlpk@csail.mit.edu\n\nAbstract\n\nBayesian optimization usually assumes that a Bayesian prior is given. However,\nthe strong theoretical guarantees in Bayesian optimization are often regrettably\ncompromised in practice because of unknown parameters in the prior. In this paper,\nwe adopt a variant of empirical Bayes and show that, by estimating the Gaussian\nprocess prior from of\ufb02ine data sampled from the same prior and constructing\nunbiased estimators of the posterior, variants of both GP-UCB and probability\nof improvement achieve a near-zero regret bound, which decreases to a constant\nproportional to the observational noise as the number of of\ufb02ine data and the\nnumber of online evaluations increase. Empirically, we have veri\ufb01ed our approach\non challenging simulated robotic problems featuring task and motion planning.\n\n1\n\nIntroduction\n\nBayesian optimization (BO) is a popular approach to optimizing black-box functions that are expen-\nsive to evaluate. Because of expensive evaluations, BO aims to approximately locate the function\nmaximizer without evaluating the function too many times. This requires a good strategy to adaptively\nchoose where to evaluate based on the current observations.\nBO adopts a Bayesian perspective and assumes that there is a prior on the function; typically, we use\na Gaussian process (GP) prior. Then, the information collection strategy can rely on the prior to focus\non good inputs, where the goodness is determined by an acquisition function derived from the GP\nprior and current observations. In past literature, it has been shown both theoretically and empirically\nthat if the function is indeed drawn from the given prior, there are many acquisition functions that\nBO can use to locate the function maximizer quickly [51, 5, 53].\nHowever, in reality, the prior we choose to use in BO often does not re\ufb02ect the distribution from\nwhich the function is drawn. Hence, we sometimes have to estimate the hyper-parameters of a chosen\nform of the prior on the \ufb02y as we collect more data [50]. One popular choice is to estimate the prior\nparameters using empirical Bayes with, e.g., the maximum likelihood estimator [44] .\nDespite the vast literature that shows many empirical Bayes approaches have well-founded theoretical\nguarantees such as consistency [40] and admissibility [26], it is dif\ufb01cult to analyze a version of BO\nthat uses empirical Bayes because of the circular dependencies between the estimated parameters and\nthe data acquisition strategies. The requirement to select the prior model and estimate its parameters\nleads to a BO version of the chicken-and-egg dilemma: the prior model selection depends on the data\ncollected and the data collection strategy depends on having a \u201ccorrect\u201d prior. Theoretically, there is\nlittle evidence that BO with unknown parameters in the prior can work well. Empirically, there is\nevidence showing it works well in some situations, but not others [33, 23], which is not surprising in\nlight of no free lunch results [56, 22].\n\n\u2217Equal contribution.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn this paper, we propose a simple yet effective strategy for learning a prior in a meta-learning setting\nwhere training data on functions from the same Gaussian process prior are available. We use a variant\nof empirical Bayes that gives unbiased estimates for both the parameters in the prior and the posterior\ngiven observations of the function we wish to optimize. We analyze the regret bounds in two settings:\n(1) \ufb01nite input space, and (2) compact input space in Rd. We clarify additional assumptions on the\ntraining data and form of Gaussian processes of both settings in Sec. 4.1 and Sec. 4.2. We prove\ntheorems that show a near-zero regret bound for variants of GP-UCB [2, 51] and probability of\nimprovement (PI) [29, 53]. The regret bound decreases to a constant proportional to the observational\nnoise as online evaluations and of\ufb02ine data size increase.\nFrom a more pragmatic perspective on Bayesian optimization for important areas such as robotics,\nwe further explore how our approach works for problems in task and motion planning domains [27],\nand we explain why the assumptions in our theorems make sense for these problems in Sec. 5. Indeed,\nassuming a common kernel, such as squared exponential or Mat\u00e9rn, is very limiting for robotic\nproblems that involve discontinuity and non-stationarity. However, with our approach of setting the\nprior and posterior parameters, BO outperforms all other methods in the task and motion planning\nbenchmark problems.\nThe contributions of this paper are (1) a stand-alone BO module that takes in only a multi-task training\ndata set as input and then actively selects inputs to ef\ufb01ciently optimize a new function and (2) analysis\nof the regret of this module. The analysis is constructive, and determines appropriate hyperparameter\nsettings for the GP-UCB acquisition function. Thus, we make a step forward to resolving the problem\nthat, despite being used for hyperparameter tuning, BO algorithms themselves have hyperparameters.\n\n2 Background and related work\n\nBO optimizes a black-box objective function through sequential queries. We usually assume knowl-\nedge of a Gaussian process [44] prior on the function, though other priors such as Bayesian neural\nnetworks and their variants [17, 30] are applicable too. Then, given possibly noisy observations\nand the prior distribution, we can do Bayesian posterior inference and construct acquisition func-\ntions [29, 38, 2] to search for the function optimizer.\nHowever, in practice, we do not know the prior and it must be estimated. One of the most popular\nmethods of prior estimation in BO is to optimize mean/kernel hyper-parameters by maximizing\ndata-likelihood of the current observations [44, 19]. Another popular approach is to put a prior on\nthe mean/kernel hyper-parameters and obtain a distribution of such hyper-parameters to adapt the\nmodel given observations [20, 50]. These methods require a predetermined form of the mean function\nand the kernel function. In the existing literature, mean functions are usually set to be 0 or linear\nand the popular kernel functions include Mat\u00e9rn kernels, Gaussian kernels, linear kernels [44] or\nadditive/product combinations of the above [11, 24].\nMeta BO aims to improve the optimization of a given objective function by learning from past\nexperiences with other similar functions. Meta BO can be viewed as a special case of transfer\nlearning or multi-task learning. One well-studied instance of meta BO is the machine learning (ML)\nhyper-parameter tuning problem on a dataset, where, typically, the validation errors are the functions\nto optimize [14]. The key question is how to transfer the knowledge from previous experiments on\nother datasets to the selection of ML hyper-parameters for the current dataset.\nTo determine the similarity between validation error functions on different datasets, meta-features of\ndatasets are often used [6]. With those meta-features of datasets, one can use contextual Bayesian\noptimization approaches [28] that operate with a probabilistic functional model on both the dataset\nmeta-features and ML hyper-parameters [3]. Feurer et al. [16], on the other hand, used meta-features\nof datasets to construct a distance metric, and to sort hyper-parameters that are known to work for\nsimilar datasets according to their distances to the current dataset. The best k hyper-parameters are\nthen used to initialize a vanilla BO algorithm. If the function meta-features are not given, one can\nestimate the meta-features, such as the mean and variance of all observations, using Monte Carlo\nmethods [52], maximum likelihood estimates [57] or maximum a posteriori estimates [43, 42].\nAs an alternative to using meta-features of functions, one can construct a kernel between functions.\nFor functions that are represented by GPs, Malkomes et al. [36] studied a \u201ckernel kernel\u201d, a kernel\nfor kernels, such that one can use BO with a \u201ckernel kernel\u201d to select which kernel to use to model or\n\n2\n\n\foptimize an objective function [35] in a Bayesian way. However, [36] requires an initial set of kernels\nto select from. Instead, Golovin et al. [18] introduced a setting where the functions come in sequence\nand the posterior of the former function becomes the prior of the current function. Removing the\nassumption that functions come sequentially, Feurer et al. [15] proposed a method to learn an additive\nensemble of GPs that are known to \ufb01t all of those past \u201ctraining functions\u201d.\nTheoretically, it has been shown that meta BO methods that use information from similar functions\nmay result in an improvement for the cumulative regret bound [28, 47] or the simple regret bound [42]\nwith the assumptions that the GP priors are given. If the form of the GP kernel is given and the prior\nmean function is 0 but the kernel hyper-parameters are unknown, it is possible to obtain a regret\nbound given a range of these hyper-parameters [54]. In this paper, we prove a regret bound for meta\nBO where the GP prior is unknown; this means, neither the range of GP hyper-parameters nor the\nform of the kernel or mean function is given.\nA more ambitious approach to solving meta BO is to train an end-to-end system, such as a recurrent\nneural network [21], that takes the history of observations as an input and outputs the next point to\nevaluate [8]. Though it has been demonstrated that the method in [8] can learn to trade-off exploration\nand exploitation for a short horizon, it is unclear how many \u201ctraining instances\u201d, in the form of\nobservations of BO performed on similar functions, are necessary to learn the optimization strategies\nfor any given horizon of optimization. In this paper, we show both theoretically and empirically how\nthe number of \u201ctraining instances\u201d in our method affects the performance of BO.\nOur methods are most similar to the BOX algorithm [27], which uses evaluations of previous\nfunctions to make point estimates of a mean and covariance matrix on the values over a discrete\ndomain. Our methods for the discrete setting (described in Sec. 4.1) directly improve on BOX by\nchoosing the exploration parameters in GP-UCB more effectively. This general strategy is extended\nto the continuous-domain setting in Sec. 4.2, in which we extend a method for learning the GP\nprior [41] and the use the learned prior in GP-UCB and PI.\nLearning how to learn, or \u201cmeta learning\u201d, has a long history in machine learning [46]. It was\nargued that learning how to learn is \u201clearning the prior\u201d [4] with \u201cpoint sets\u201d [37], a set of iid sets\nof potentially non-iid points. We follow this simple intuition and present a meta BO approach that\nlearns its GP prior from the data collected on functions that are assumed to have been drawn from the\nsame prior distribution.\nEmpirical Bayes [45, 26] is a standard methodology for estimating unknown parameters of a\nBayesian model. Our approach is a variant of empirical Bayes. We can view our computations\nas the construction of a sequence of estimators for a Bayesian model. The key difference from\ntraditional empirical Bayes methods is that we are able to prove a regret bound for a BO method\nthat uses estimated parameters to construct priors and posteriors. In particular, we use frequentist\nconcentration bounds to analyze Bayesian procedures, which is one way to certify empirical Bayes in\nstatistics [49, 13].\n\n3 Problem formulation and notations\n\nUnlike the standard BO setting, we do not assume knowledge of the mean or covariance in the GP\nprior, but we do assume the availability of a dataset of iid sets of potentially non-iid observations on\nfunctions sampled from the same GP prior. Then, given a new, unknown function sampled from that\nsame distribution, we would like to \ufb01nd its maximizer.\nMore formally, we assume there exists a distribution GP (\u00b5, k), and both the mean \u00b5 : X \u2192 R and the\nkernel k : X\u00d7X \u2192 R are unknown. Nevertheless, we are given a dataset \u00afDN = {[(\u00afxij, \u00afyij)]Mi\nj=1}N\ni=1,\nwhere \u00afyij is drawn independently from N (fi(\u00afxij), \u03c32) and fi : X \u2192 R is drawn independently from\nGP (\u00b5, k). The noise level \u03c3 is unknown as well. We will specify inputs \u00afxij in Sec. 4.1 and Sec. 4.2.\nGiven a new function f sampled from GP (\u00b5, k), our goal is to maximize it by sequentially querying\nt=1, yt \u223c N (f (xt), \u03c32). We study two evaluation\nthe function and constructing DT = [(xt, yt)]T\ncriteria: (1) the best-sample simple regret rT = maxx\u2208X f (x) \u2212 maxt\u2208[T ] f (xt) which indicates the\nvalue of the best query in hindsight, and (2) the simple regret, RT = maxx\u2208X f (x) \u2212 f (\u02c6x\u2217\nT ) which\nmeasures how good the inferred maximizer \u02c6x\u2217\n\nT is.\n\n3\n\n\fNotation We use N (u, V ) to denote a multivariate Gaussian distribution with mean u and variance\nV and use W(V, n) to denote a Wishart distribution with n degrees of freedom and scale matrix\nV . We also use [n] to denote [1,\u00b7\u00b7\u00b7 , n],\u2200n \u2208 Z+. We overload function notation for evaluations\non vectors x = [xi]n\ni=1,\nand the output matrix as k(x, x(cid:48)) = [k(xi, x(cid:48)\nj)]i\u2208[n],j\u2208[n(cid:48)], and we overload the kernel function\nk(x) = k(x, x).\n\nj=1 by denoting the output column vector as \u00b5(x) = [\u00b5(xi)]n\n\ni=1, x(cid:48) = [xj]n(cid:48)\n\n4 Meta BO and its theoretical guarantees\n\nInstead of hand-crafting the mean \u00b5 and\nkernel k, we estimate them using the train-\ning dataset \u00afDN . Our approach is fairly\nstraightforward: in the of\ufb02ine phase, the\ntraining dataset \u00afDN is collected and we\nobtain estimates of the mean function \u02c6\u00b5\nand kernel \u02c6k; in the online phase, we treat\nGP (\u02c6\u00b5, \u02c6k) as the Bayesian \u201cprior\u201d to do\nBayesian optimization. We illustrate the\ntwo phases in Fig. 1.\nIn Alg. 1, we de-\npict our algorithm, assuming the dataset\n\u00afDN has been collected. We use ES-\nTIMATE( \u00afDN ) to denote the \u201cprior\u201d esti-\nmation and INFER(Dt; \u02c6\u00b5, \u02c6k) the \u201cposte-\nrior\u201d inference, both of which we will\nintroduce in Sec. 4.1 and Sec. 4.2. For\nacquisition functions, we consider spe-\ncial cases of probability of improvement\n(PI) [53, 29] and upper con\ufb01dence bound\n(GP-UCB) [51, 2]:\n\nAlgorithm 1 Meta Bayesian optimization\n1: function META-BO( \u00afDN , f)\n2:\n3:\n4: end function\n\n\u02c6\u00b5(\u00b7), \u02c6k(\u00b7,\u00b7) \u2190 ESTIMATE( \u00afDN )\nreturn BO(f, \u02c6\u00b5, \u02c6k)\n\nD0 \u2190 \u2205\nfor t = 1,\u00b7\u00b7\u00b7 , T do\n\n5: function BO (f, \u02c6\u00b5, \u02c6k)\n6:\n7:\n8:\n9:\n10:\n11:\n12:\nend for\n13:\nreturn DT\n14:\n15: end function\n\n\u02c6\u00b5t\u22121(\u00b7), \u02c6kt\u22121(\u00b7) \u2190 INFER(Dt\u22121; \u02c6\u00b5, \u02c6k)\n\u03b1t\u22121(\u00b7) \u2190ACQUISITION (\u02c6\u00b5t\u22121, \u02c6kt\u22121)\nxt \u2190 arg maxx\u2208X \u03b1t\u22121(x)\nyt \u2190 OBSERVE(f (xt))\nDt \u2190 Dt\u22121 \u222a [(xt, yt)]\n\n\u03b1PI\nt\u22121(x) =\n\n, \u03b1GP-UCB\n\nt\u22121\n\n(x) = \u02c6\u00b5t\u22121(x) + \u03b6t\n\n\u02c6kt\u22121(x)\n\n1\n2 .\n\n\u02c6\u00b5t\u22121(x) \u2212 \u02c6f\u2217\n\u02c6kt\u22121(x) 1\n\n2\n\nHere, PI assumes additional information2 in the form of the upper bound on function value \u02c6f\u2217 \u2265\nmaxx\u2208X f (x). For GP-UCB, we set its hyperparameter \u03b6t to be\n\n(cid:16)\n\n\u03b6t =\n\n6(N \u2212 3 + t + 2\n\nt log 6\n\n(cid:113)\n\n(cid:17) 1\n\u03b4 )/(\u03b4N (N \u2212 t \u2212 1))\nN\u2212t log 6\n\n\u03b4 ) 1\n\n2 ) 1\n\n2\n\n2\n\n\u03b4 + 2 log 6\n(1 \u2212 2( 1\n\n+ (2 log( 3\n\n\u03b4 )) 1\n\n2\n\n,\n\nwhere N is the size of the dataset \u00afDN and \u03b4 \u2208 (0, 1). With probability 1 \u2212 \u03b4, the regret bound in\nThm. 2 or Thm. 4 holds with these special cases of GP-UCB and PI. Under two different settings of\nthe search space X, \ufb01nite X and compact X \u2208 Rd, we show how our algorithm works in detail and\nwhy it works via regret analyses on the best-sample simple regret. Finally in Sec. 4.3 we show how\nthe simple regret can be bounded. The proofs of the analyses can be found in the appendix.\n\n4.1 X is a \ufb01nite set\n\nWe \ufb01rst study the simplest case, where the function domain X = [\u00afxj]M\nj=1 is a \ufb01nite set with cardinality\n|X| = M \u2208 Z+. For convenience, we treat this set as an ordered vector of items indexed by\nj \u2208 [M ]. We collect the training dataset \u00afDN = {[(\u00afxj, \u00af\u03b4ij \u00afyij)]M\nj=1}N\ni=1, where \u00afyij are independently\ndrawn from N (fi(\u00afxj), \u03c32), fi are drawn independently from GP (\u00b5, k) and \u00af\u03b4ij \u2208 {0, 1}. Because\nthe training data can be collected of\ufb02ine by querying the functions {fi}N\ni=1 in parallel, it is not\nunreasonable to assume that such a dataset \u00afDN is available. If \u00af\u03b4ij = 0, it means the (i, j)-th entry of\nthe dataset \u00afDN is missing, perhaps as a result of a failed experiment.\n\n2Alternatively, an upper bound \u02c6f\u2217 can be estimated adaptively [53]. Note that here we are maximizing the PI\n\nacquisition function and hence \u03b1PI\n\nt\u22121(x) is a negative version of what was de\ufb01ned in [53].\n\n4\n\n\fi=1\n\nj=1\n\n[7],\n\nincluding\n\n(cid:80)M\n\ntions (cid:80)N\n\nEstimating GP param-\neters\nIf \u00af\u03b4ij < 1, we\nhave missing entries in\nthe\nobservation matrix\n\u00afY = [\u00af\u03b4ij \u00afyij]i\u2208[N ],j\u2208[M ] \u2208\nRN\u00d7M . Under additional\nspeci\ufb01ed\nassumptions\nin\nthat\nrank(Y ) = r and the total\nnumber of valid observa-\n\u00af\u03b4ij \u2265\nO(rN 6\n5 log N ), we can use\nmatrix completion [7] to\nfully recover the matrix \u00afY\nwith high probability.\nIn\nthe following, we proceed\nby considering completed\nobservations only.\nLet the completed observation matrix be Y = [\u00afyij]i\u2208[N ],j\u2208[M ]. We use an unbiased sample mean and\nN\u22121 (Y \u2212 1N \u02c6\u00b5(X)T)T(Y \u2212\ncovariance estimator for \u00b5 and k; that is, \u02c6\u00b5(X) = 1\n1N \u02c6\u00b5(X)T), where 1N is an N by 1 vector of ones. It is well known that \u02c6\u00b5 and \u02c6k are independent and\n\u02c6\u00b5(X) \u223c N (\u00b5(X), 1\nConstructing estimators of the posterior Given noisy observations Dt = {(x\u03c4 , y\u03c4 )}t\ndo Bayesian posterior inference to obtain f \u223c GP (\u00b5t, kt). By the GP assumption, we get\n\nFigure 1: Our approach estimates the mean function \u02c6\u00b5 and kernel \u02c6k\nfrom functions sampled from GP (\u00b5, k) in the of\ufb02ine phase. Those\nsampled functions are illustrated by colored lines. In the online phase,\na new function f sampled from the same GP (\u00b5, k) is given and we\ncan estimate its posterior mean function \u02c6\u00b5t and covariance function \u02c6kt\nwhich will be used for Bayesian optimization.\n\nN Y T1N and \u02c6k(X) = 1\nN\u22121 (k(X) + \u03c32I), N \u2212 1) [1].\n\nN (k(X) + \u03c32I)), \u02c6k(X) \u223c W( 1\n\n\u03c4 =1, we can\n\n\u00b5t(x) = \u00b5(x) + k(x, xt)(k(xt) + \u03c32I)\u22121(yt \u2212 \u00b5(xt)), \u2200x \u2208 X\n\nkt(x, x(cid:48)) = k(x, x(cid:48)) \u2212 k(x, xt)(k(xt) + \u03c32I)\u22121k(xt, x(cid:48)), \u2200x, x(cid:48) \u2208 X,\n\n(1)\n(2)\n\n\u03c4 =1, xt = [x\u03c4 ]T\n\nwhere yt = [y\u03c4 ]T\n\u03c4 =1 [44]. The problem is that neither the posterior mean \u00b5t nor\nthe covariance kt are computable because the Bayesian prior mean \u00b5, the kernel k and the noise\nparameter \u03c3 are all unknown. How to estimate \u00b5t and kt without knowing those prior parameters?\nWe introduce the following unbiased estimators for the posterior mean and covariance,\n\n\u02c6\u00b5t(x) = \u02c6\u00b5(x) + \u02c6k(x, xt)\u02c6k(xt, xt)\n\n\u02c6kt(x, x(cid:48)) =\n\nN \u2212 1\nN \u2212 t \u2212 1\n\n\u22121\n\n(cid:16)\u02c6k(x, x(cid:48)) \u2212 \u02c6k(x, xt)\u02c6k(xt, xt)\n\n(cid:17)\n(yt \u2212 \u02c6\u00b5(xt)), \u2200x \u2208 X,\n\u22121\u02c6k(xt, x(cid:48))\n\n, \u2200x, x(cid:48) \u2208 X.\n\n(3)\n\n(4)\n\nNotice that unlike Eq. (1) and Eq. (2), our estimators \u02c6\u00b5t and \u02c6kt do not depend on any unknown values\nor an additional estimate of the noise parameter \u03c3. In Lemma 1, we show that our estimators are\nindeed unbiased and we derive their concentration bounds.\nLemma 1. Pick probability \u03b4 \u2208 (0, 1). For any nonnegative integer t < T , conditioned on\nthe observations Dt = {(x\u03c4 , y\u03c4 )}t\n\u03c4 =1, the estimators in Eq. (3) and Eq. (4) satisfy E[\u02c6\u00b5t(X)] =\n\u00b5t(X), E[\u02c6kt(X)] = kt(X) + \u03c32I. Moreover, if the size of the training dataset satis\ufb01es N \u2265 T + 2,\nthen for any input x \u2208 X, with probability at least 1 \u2212 \u03b4, both\n|\u02c6\u00b5t(x) \u2212 \u00b5t(x)|2 < at(kt(x) + \u03c32) and 1 \u2212 2\n\nbt < \u02c6kt(x)/(kt(x) + \u03c32) < 1 + 2\n\n(cid:112)\n\nbt + 2bt\n\n(cid:112)\n(cid:17)\n\n(cid:16)\n\n4\n\n\u221a\nN\u22122+t+2\n\u03b4N (N\u2212t\u22122)\n\nt log (4/\u03b4)+2 log (4/\u03b4)\n\nhold, where at =\n\nand bt = 1\n\n\u03b4 .\nN\u2212t\u22121 log 4\n\nRegret bounds We show a near-zero upper bound on the best-sample simple regret of meta BO\nwith GP-UCB and PI that uses speci\ufb01c parameter settings in Thm. 2. In particular, for both GP-UCB\nand PI, the regret bound converges to a residual whose scale depends on the noise level \u03c3 in the\nobservations.\nTheorem 2. Assume there exists constant c \u2265 maxx\u2208X k(x) and a training dataset is available\nwhose size is N \u2265 4 log 6\n\u03b4 + T + 2. Then, with probability at least 1 \u2212 \u03b4, the best-sample simple\n\n5\n\n\fregret in T iterations of meta BO with special cases of either GP-UCB or PI satis\ufb01es\n\nrUCB\nT < \u03b7UCB\n\nT\n\nT < \u03b7PI\n\nT (N )\u03bbT , \u03bb2\n\nT = O(\u03c1T /T ) + \u03c32,\n\n(N )\u03bbT , rPI\n\u221a\n1+m\u221a\n1\u2212m\n\nT (N ) = (m+C2)(\n\n+1)+C3, m = O(\n\n\u221a\n1+m\u221a\n1\u2212m\n\nwhere \u03b7U CB\nC1, C2, C3 > 0 are constants, and \u03c1T = max\n\n(N ) = (m+C1)(\n\n+1), \u03b7PI\n\nT\n\n1\n\n2 log |I + \u03c3\u22122k(A)|.\n\nA\u2208X,|A|=T\n\n(cid:113) 1\n\nN\u2212T ),\n\nT\n\nT\n\nand \u03b7PI\n\n(cid:113) d log(T )\n\nThis bound re\ufb02ects how training instances N and BO iterations T affect the best-sample simple\nregret. The coef\ufb01cients \u03b7UCB\nT both converge to constants (more details in the appendix), with\ncomponents converging at rate O(1/(N \u2212 T ) 1\n2 ). The convergence of the shared term \u03bbT depends on\n\u03c1T , the maximum information gain between function f and up to T observations yT . If, for example,\neach input has dimension Rd and k(x, x(cid:48)) = xTx(cid:48), then \u03c1T = O(d log(T )) [51], in which case \u03bbT\nconverges to the observational noise level \u03c3 at rate O(\n). Together, the bounds indicate\nthat the best-sample simple regret of both our settings of GP-UCB and PI decreases to a constant\nproportional to noise level \u03c3.\n4.2 X \u2282 Rd is compact\nFor compact X \u2282 Rd, we consider the primal form of GPs. We further assume that there exist basis\ns=1 : X \u2192 RK, mean parameter u \u2208 RK and covariance parameter \u03a3 \u2208 RK\u00d7K\nfunctions \u03a6 = [\u03c6s]K\nsuch that \u00b5(x) = \u03a6(x)Tu and k(x, x(cid:48)) = \u03a6(x)T\u03a3\u03a6(x(cid:48)). Notice that \u03a6(x) \u2208 RK is a column vector\nand \u03a6(xt) \u2208 RK\u00d7t for any xt = [x\u03c4 ]t\n\u03c4 =1. This means, for any input x \u2208 X, the observation satis\ufb01es\ny \u223c N (f (x), \u03c32), where f = \u03a6(x)TW \u223c GP (\u00b5, k) and the linear operator W \u223c N (u, \u03a3) [39]. In\nthe following analyses, we assume the basis functions \u03a6 are given.\nWe assume that a training dataset \u00afDN = {[(\u00afxj, \u00afyij)]M\ni=1 is given, where \u00afxj \u2208 X \u2282 Rd, yij are\nindependently drawn from N (fi(\u00afxj), \u03c32), fi are drawn independently from GP (\u00b5, k) and M \u2265 K.\nEstimating GP parameters Because the basis functions \u03a6 are given, learning the mean function\n\u00b5 and the kernel k in the GP is equivalent to learning the mean parameter u and the covariance\nparameter \u03a3 that parameterize distribution of the linear operator W . Notice that \u2200i \u2208 [N ],\n\nj=1}N\n\n\u00afyi = \u03a6( \u00afx)TWi + \u00af\u0001i \u223c N (\u03a6( \u00afx)Tu, \u03a6( \u00afx)T\u03a3\u03a6( \u00afx) + \u03c32I),\nj=1 \u2208 RM , \u00afx = [\u00afxj]M\n\nj=1 \u2208 RM\u00d7d and \u00af\u0001i = [\u00af\u0001ij]M\n\nwhere \u00afyi = [\u00afyij]M\n\u03a6( \u00afx) \u2208 RK\u00d7M has linearly independent rows, one unbiased estimator of Wi is\n\nj=1 \u2208 RM . If the matrix\n\n\u02c6Wi = (\u03a6( \u00afx)T)+ \u00afyi = (\u03a6( \u00afx)\u03a6( \u00afx)T)\u22121\u03a6( \u00afx) \u00afyi \u223c N (u, \u03a3 + \u03c32(\u03a6( \u00afx)\u03a6( \u00afx)T)\u22121).\n\ni=1 \u2208 RN\u00d7K. We use the estimator \u02c6u = 1\n\nLet W = [ \u02c6Wi]N\n1N \u02c6u) to the estimate GP parameters. Again, \u02c6u and \u02c6\u03a3 are independent and\n\nN (\u03a3 + \u03c32(\u03a6( \u00afx)\u03a6( \u00afx)T)\u22121)(cid:1) , \u02c6\u03a3 \u223c W(cid:16) 1\n\n(cid:17)\n(cid:0)\u03a3 + \u03c32(\u03a6( \u00afx)\u03a6( \u00afx)T)\u22121(cid:1) , N \u2212 1\n\n\u02c6u \u223c N(cid:0)u, 1\n\nN WT1N and \u02c6\u03a3 = 1\n\nN\u22121 (W \u2212 1N \u02c6u)T(W \u2212\n\nN\u22121\n\n[1].\n\nConstructing estimators of the posterior We assume the total number of evaluations T < K.\nGiven noisy observations Dt = {(x\u03c4 , y\u03c4 )}t\n\u03c4 =1, we have \u00b5t(x) = \u03a6(x)Tut and kt(x, x(cid:48)) =\n\u03a6(x)T\u03a3t\u03a6(x(cid:48)), where the posterior of W \u223c N (ut, \u03a3t) satis\ufb01es\n\nut = u + \u03a3\u03a6(xt)(\u03a6(xt)T\u03a3\u03a6(xt) + \u03c32I)\u22121(yt \u2212 \u03a6(xt)Tu),\n\u03a3t = \u03a3 \u2212 \u03a3\u03a6(xt)(\u03a6(xt)T\u03a3\u03a6(xt) + \u03c32I)\u22121\u03a6(xt)T\u03a3.\n\nSimilar to the strategy used in Sec. 4.1, we construct an estimator for the posterior of W to be\n\n\u02c6ut = \u02c6u + \u02c6\u03a3\u03a6(xt)(\u03a6(xt)T \u02c6\u03a3\u03a6(xt))\u22121(yt \u2212 \u03a6(xt)Tu),\n\u02c6\u03a3t =\n\n(cid:16) \u02c6\u03a3 \u2212 \u02c6\u03a3\u03a6(xt)(\u03a6(xt)T \u02c6\u03a3\u03a6(xt))\u22121\u03a6(xt)T \u02c6\u03a3\n\n(cid:17)\n\nN \u2212 1\nN \u2212 t \u2212 1\n\n(8)\nWe can compute the conditional mean and variance of the observation on x \u2208 X to be\n\u02c6\u00b5t(x) = \u03a6(x)T \u02c6ut and \u02c6kt(x) = \u03a6(x)T \u02c6\u03a3t\u03a6(x). For convenience of notation, we de\ufb01ne \u00af\u03c32(x) =\n\u03c32\u03a6(x)T(\u03a6( \u00afx)\u03a6( \u00afx)T)\u22121\u03a6(x).\n\n.\n\n(5)\n(6)\n\n(7)\n\n6\n\n\fLemma 3. Pick probability \u03b4 \u2208 (0, 1). Assume \u03a6( \u00afx) has full row rank. For any nonnegative integer\nt < T , T \u2264 K, conditioned on the observations Dt = {(x\u03c4 , y\u03c4 )}t\n\u03c4 =1, E[\u02c6\u00b5t(x)] = \u00b5t(x), E[\u02c6kt(x)] =\nkt(x) + \u00af\u03c32(x). Moreover, if the size of the training dataset satis\ufb01es N \u2265 T + 2, then for any input\nx \u2208 X, with probability at least 1 \u2212 \u03b4, both\n|\u02c6\u00b5t(x) \u2212 \u00b5t(x)|2 < at(kt(x) + \u00af\u03c32(x)) and 1 \u2212 2\n\nbt < \u02c6kt(x)/(kt(x) + \u00af\u03c32(x)) < 1 + 2\n\n(cid:112)\n\nbt + 2bt\n\n(cid:16)\n\nN\u22122+t+2\n\n4\n\nt log (4/\u03b4)+2 log (4/\u03b4)\n\n\u221a\n\u03b4N (N\u2212t\u22122)\n\nhold, where at =\n\nand bt = 1\n\n\u03b4 .\nN\u2212t\u22121 log 4\n\n(cid:112)\n(cid:17)\n\nT converges to \u00af\u03c32(\u00b7) instead of \u03c32 in Thm. 2 and \u00af\u03c32(\u00b7) is proportional to \u03c32 .\n\nRegret bounds Similar to the \ufb01nite X case, we can also show a near-zero regret bound for compact\nX \u2208 Rd. The following theorem clari\ufb01es our results. The convergence rates are the same as Thm. 2.\nNote that \u03bb2\nTheorem 4. Assume all the assumptions in Thm. 2 and that \u03a6( \u00afx) has full row rank. With probability\nat least 1 \u2212 \u03b4, the best-sample simple regret in T iterations of meta BO with either GP-UCB or PI\nsatis\ufb01es\n\nrUCB\nT < \u03b7UCB\n\nT\n\n(N )\u03bbT , rPI\n\nT < \u03b7PI\n\nT (N )\u03bbT , \u03bb2\n\nT = O(\u03c1T /T ) + \u00af\u03c3(x\u03c4 )2,\n\nwhere \u03b7U CB\nC1, C2, C3 > 0 are constants, \u03c4 = arg mint\u2208[T ] kt\u22121(xt) and \u03c1T = max\n\nT (N ) = (m+C2)(\n\n(N ) = (m+C1)(\n\n+1), \u03b7PI\n\nT\n\n+1)+C3, m = O(\n\nA\u2208X,|A|=T\n\n\u221a\n1+m\u221a\n1\u2212m\n\n\u221a\n1+m\u221a\n1\u2212m\n\n(cid:113) 1\nN\u2212T ),\n2 log |I + \u03c3\u22122k(A)|.\n\n1\n\n4.3 Bounding the simple regret by the best-sample simple regret\nOnce we have the observations DT = {(xt, yt)}T\nt=1, we can infer where the arg max of the function\nis. For all the cases in which X is discrete or compact and the acquisition function is GP-UCB or PI,\nwe choose the inferred arg max to be \u02c6x\u2217\nT = x\u03c4 where \u03c4 = arg maxt\u2208[T ] yt. We show in Lemma 5\nthat with high probability, the difference between the simple regret RT and the best-sample simple\nregret rT is proportional to the observation noise \u03c3.\nLemma 5. With probability at least 1 \u2212 \u03b4, RT \u2264 rT + 2(2 log 1\nTogether with the bounds on the best-sample simple regret from Thm. 2 and Thm. 4, our result shows\nthat, with high probability, the simple regret decreases to a constant proportional to the noise level \u03c3\nas the number of iterations and training functions increases.\n\n\u03b4 ) 1\n\n2 \u03c3.\n\n5 Experiments\n\nWe evaluate our algorithm in four different\nblack-box function optimization problems, in-\nvolving discrete or continuous function domains.\nOne problem is optimizing a synthetic function\nin R2, and the rest are optimizing decision vari-\nables in robotic task and motion planning prob-\nlems that were used in [27]3.\nAt a high level, our task and motion planning\nbenchmarks involve computing kinematically\nfeasible collision-free motions for picking and\nplacing objects in a scene cluttered with obsta-\ncles. This problem has a similar setup to exper-\nimental design: the robot can \u201cexperiment\u201d by\nassigning values to decision variables including\ngrasps, base poses, and object placements until\nit \ufb01nds a feasible plan. Given the assigned val-\nues for these variables, the robot program makes\n\nFigure 2: Two instances of a picking problem. A\nproblem instance is de\ufb01ned by the arrangement and\nnumber of obstacles, which vary randomly across\ndifferent instances. The objective is to select a\ngrasp that can pick the blue box, marked with a\ncircle, without violating kinematic and collision\nconstraints. [27].\n\n3 Our code is available at https://github.com/beomjoonkim/MetaLearnBO.\n\n7\n\n\fFigure 3: Learning curves (top) and rewards vs number of iterations (bottom) for optimizing synthetic\nfunctions sampled from a GP and two scoring functions from.\n\na call to a planner4 which then attempts to \ufb01nd a sequence of motions that achieve these grasps and\nplacements. We score the variable assignment based on the results of planning, assigning a very low\nscore if the problem was infeasible and otherwise scoring based on plan length or obstacle clearance.\nAn example problem is given in Figure 2.\nPlanning problem instances are characterized by arrangements of obstacles in the scene and the\nshape of the target object to be manipulated, and each problem instance de\ufb01nes a different score\nfunction. Our objective is to optimize the score function for a new problem instance, given sets of\ndecision-variable and score pairs from a set of previous planning problem instances as training data.\nIn two robotics domains, we discretize the original function domain using samples from the past\nplanning experience, by extracting the values of the decision variables and their scores from successful\nplans. This is inspired by the previous successful use of BO in a discretized domain [9] to ef\ufb01ciently\nsolve an adaptive locomotion problem.\nWe compare our approach, called point estimate meta Bayesian optimization (PEM-BO), to three\nbaseline methods. The \ufb01rst is a plain Bayesian optimization method that uses a kernel function to\nrepresent the covariance matrix, which we call Plain. Plain optimizes its GP hyperparameters by\nmaximizing the data likelihood. The second is a transfer learning sequential model-based optimiza-\ntion [57] method, that, like PEM-BO, uses past function evaluations, but assumes that functions\nsampled from the same GP have similar response surface values. We call this method TLSM-BO.\nThe third is random selection, which we call Random. We present the results on the UCB acquisition\nfunction in the paper and results on the PI acquisition function are available in the appendix.\nIn all domains, we use the \u03b6t value as speci\ufb01ed in Sec. 4. For continuous domains, we use \u03a6(x) =\n[cos(xT \u03b2(i) + \u03b2(i)\n0 , we\nrepresent the function \u03a6(x)T Wi with a 1-hidden-layer neural network with cosine activation function\nand a linear output layer with function-speci\ufb01c weights Wi. We then train this network on the entire\ndataset \u00afDN . Then, \ufb01xing \u03a6(x), for each set of pairs ( \u00afyi, \u00afxi), i = {1\u00b7\u00b7\u00b7 N}, we analytically solve\nthe linear regression problem yi \u2248 \u03a6(xi)T Wi as described in Sec. 4.2.\nOptimizing a continuous synthetic function In this problem, the objective is to optimize a black-\nbox function sampled from a GP, whose domain is R2, given a set of evaluations of different functions\nfrom the same GP. Speci\ufb01cally, we consider a GP with a squared exponential kernel function. The\npurpose of this problem is to show that PEM-BO, which estimates mean and covariance matrix based\non \u00afDN , would perform similarly to BO methods that start with an appropriate prior. We have training\ndata from N = 100 functions with M = 1000 sample points each.\n\ni=1 as our basis functions. In order to train the weights Wi, \u03b2(i), and \u03b2(i)\n\n0 )]K\n\n4We use Rapidly-exploring random tree (RRT) [32] with prede\ufb01ned random seed, but other choices are\n\npossible.\n\n8\n\n020406080100120140160Number of evaluations3.43.23.02.82.62.42.2RewardsRandomPlain-UCBPEM-BO-UCBTLSM-BO-UCB051015202530Number of evaluations65432RewardsRandomPlain-UCBPEM-BO-UCBTLSM-BO-UCB020406080Number of evaluations0255075100125150RewardsRandomPlain-UCBPEM-BO-UCBTLSM-BO-UCB0.00.10.20.30.40.50.60.70.80.9Portions of N707580859095100105RewardsPEM-BO-UCBPlain-UCB0.000.010.020.030.040.050.060.070.080.09Portions of N3.43.23.02.82.62.4RewardsPEM-BO-UCBPlain-UCB0.000.010.020.030.040.050.060.070.080.09Portions of N65432RewardsPEM-BO-UCBPlain-UCBNumber of evaluationsNumber of evaluationsNumber of evaluationsProportion of training datasetProportion of training datasetProportion of training datasetRewardsRewardsRewardsRewardsRewardsRewards(a)(b)(c)(d)(e)\fFigure 3(a) shows the learning curve, when we have different portions of data. The x-axis represents\nthe percentage of the dataset used to train the basis functions, u, and W from the training dataset, and\nthe y-axis represents the best function value found after 10 evaluations on a new function. We can see\nthat even with just ten percent of the training data points, PEM-BO performs just as well as Plain,\nwhich uses the appropriate kernel for this particular problem. Compared to PEM-BO, which can\nef\ufb01ciently use all of the dataset, we had to limit the number of training data points for TLSM-BO to\n1000, because even performing inference requires O(N M ) time. This leads to its noticeably worse\nperformance than Plain and PEM-BO.\nFigure 3(d) shows the how maxt\u2208[T ] yt evolves, where T \u2208 [1, 100]. As we can see, PEM-BO using\nthe UCB acquisition function performs similarly to Plain with the same acquisition function. TLSM-\nBO again suffers because we had to limit the number of training data points.\nOptimizing a grasp In the robot-planning problem shown in Figure 2, the robot has to choose a\ngrasp for picking the target object in a cluttered scene. A planning problem instance is de\ufb01ned by the\nposes of obstacles and the target objects, which changes the feasibility of a grasp across different\ninstances.\nThe reward function is the negative of the length of the picking motion if the motion is feasible, and\n\u2212k \u2208 R otherwise, where \u2212k is a suitably lower number than the lengths of possible trajectories.\nWe construct the discrete set of grasps by using grasps that worked in the past planning problem\ninstances. The original space of grasps is R58, which describes position, direction, roll, and depth of\na robot gripper with respect to the object, as used in [10]. For both Plain and TLSM-BO, we use\nsquared exponential kernel function on this original grasp space to represent the covariance matrix.\nWe note that this is a poor choice of kernel, because the grasp space includes angles, making it a\nnon-vector space. These methods also choose a grasp from the discrete set. We train on dataset with\nN = 1800 previous problems, and let M = 162.\nFigure 3(b) shows the learning curve with T = 5. The x-axis is the percentage of the dataset used\nfor training, ranging from one percent to ten percent. Initially, when we just use one percent of the\ntraining data points, PEM-BO performs as poorly as TLSM-BO, which again, had only 1000 training\ndata points. However, PEM-BO outperforms both TLSM-BO and Plain after that. The main reason\nthat PEM-BO outperforms these approaches is because their prior, which is de\ufb01ned by the squared\nexponential kernel, is not suitable for this problem. PEM-BO, on the other hand, was able to avoid\nthis problem by estimating a distribution over values at the discrete sample points that commits only\nto their joint normality, but not to any metric on the underlying space. These trends are also shown\nin Figure 3(e), where we plot maxt\u2208[T ] yt for T \u2208 [1, 100]. PEM-BO outperforms the baselines\nsigni\ufb01cantly.\nOptimizing a grasp, base pose, and placement We now consider a more dif\ufb01cult task that involves\nboth picking and placing objects in a cluttered scene. A planning problem instance is de\ufb01ned by\nthe poses of obstacles and the poses and shapes of the target object to be pick and placed. The\nreward function is again the negative of the length of the picking motion if the motion is feasible,\nand \u2212k \u2208 R otherwise. For both Plain and TLSM-BO, we use three different squared exponential\nkernels on the original spaces of grasp, base pose, and object placement pose respectively and then\nadd them together to de\ufb01ne the kernel for the whole set. For this domain, N = 1500, and M = 1000.\nFigure 3(c) shows the learning curve, when T = 5. The x-axis is the percentage of the dataset used\nfor training, ranging from one percent to ten percent. Initially, when we just use one percent of\nthe training data points, PEM-BO does not perform well. Similar to the previous domain, it then\nsigni\ufb01cantly outperforms both TLSM-BO and Plain after increasing the training data. This is also\nre\ufb02ected in Figure 3(f), where we plot maxt\u2208[T ] yt for T \u2208 [1, 100]. PEM-BO outperforms baselines.\nNotice that Plain and TLSM-BO perform worse than Random, as a result of making inappropriate\nassumptions on the form of the kernel.\n\n6 Conclusion\n\nWe proposed a new framework for meta BO that estimates its Gaussian process prior based on\npast experience with functions sampled from the same prior. We established regret bounds for our\napproach without the reliance on a known prior and showed its good performance on task and motion\nplanning benchmark problems.\n\n9\n\n\fAcknowledgments\n\nWe would like to thank Stefanie Jegelka, Tamara Broderick, Trevor Campbell, Tom\u00e1s Lozano-\nP\u00e9rez for discussions and comments. We would like to thank Sungkyu Jung and Brian Axelrod for\ndiscussions on Wishart distributions. We gratefully acknowledge support from NSF grants 1420316,\n1523767 and 1723381, from AFOSR grant FA9550-17-1-0165, from Honda Research and Draper\nLaboratory. Any opinions, \ufb01ndings, and conclusions or recommendations expressed in this material\nare those of the authors and do not necessarily re\ufb02ect the views of our sponsors.\n\nReferences\n[1] Theodore Wilbur Anderson. An Introduction to Multivariate Statistical Analysis. Wiley New\n\nYork, 1958.\n\n[2] Peter Auer. Using con\ufb01dence bounds for exploitation-exploration tradeoffs. JMLR, 3:397\u2013422,\n\n2002.\n\n[3] R\u00e9mi Bardenet, M\u00e1ty\u00e1s Brendel, Bal\u00e1zs K\u00e9gl, and Michele Sebag. Collaborative hyperparameter\n\ntuning. In ICML, 2013.\n\n[4] J Baxter. A Bayesian/information theoretic model of bias learning. In COLT, New York, New\n\nYork, USA, 1996.\n\n[5] Ilija Bogunovic, Jonathan Scarlett, Andreas Krause, and Volkan Cevher. Truncated variance\nreduction: A uni\ufb01ed approach to bayesian optimization and level-set estimation. In NIPS, 2016.\n\n[6] Pavel Brazdil, Jo\u00afao Gama, and Bob Henery. Characterizing the applicability of classi\ufb01cation\n\nalgorithms using meta-level learning. In ECML, 1994.\n\n[7] Emmanuel J Cand\u00e8s and Benjamin Recht. Exact matrix completion via convex optimization.\n\nFoundations of Computational mathematics, 9(6):717, 2009.\n\n[8] Yutian Chen, Matthew W Hoffman, Sergio G\u00f3mez Colmenarejo, Misha Denil, Timothy P\nLillicrap, Matt Botvinick, and Nando de Freitas. Learning to learn without gradient descent by\ngradient descent. In ICML, 2017.\n\n[9] A. Cully, J. Clune, D. Tarapore, and J. Mouret. Robots that adapt like animals. Nature, 2015.\n\n[10] R. Diankov. Automated Construction of Robotic Manipulation Programs. PhD thesis, CMU\n\nRobotics Institute, August 2010.\n\n[11] David K Duvenaud, Hannes Nickisch, and Carl E Rasmussen. Additive Gaussian processes. In\n\nNIPS, 2011.\n\n[12] M. L. Eaton. Multivariate Statistics: A Vector Space Approach. Beachwood, Ohio, USA:\n\nInstitute of Mathematical Statistics, 2007.\n\n[13] Bradley Efron. Bayes, oracle Bayes, and empirical Bayes. 2017.\n\n[14] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and\n\nFrank Hutter. Ef\ufb01cient and robust automated machine learning. In NIPS, 2015.\n\n[15] Matthias Feurer, Benjamin Letham, and Eytan Bakshy. Scalable meta-learning for Bayesian\n\noptimization. arXiv preprint arXiv:1802.02219, 2018.\n\n[16] Matthias Feurer, Jost Springenberg, and Frank Hutter. Initializing Bayesian hyperparameter\n\noptimization via meta-learning. In AAAI, 2015.\n\n[17] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model\n\nuncertainty in deep learning. In ICML, 2016.\n\n[18] Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Elliot Karro, and\n\nD. Sculley. Google vizier: A service for black-box optimization. In KDD, 2017.\n\n10\n\n\f[19] Philipp Hennig and Christian J Schuler. Entropy search for information-ef\ufb01cient global opti-\n\nmization. JMLR, 13:1809\u20131837, 2012.\n\n[20] Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, Matthew W Hoffman, and Zoubin Ghahramani. Predictive\n\nentropy search for ef\ufb01cient global optimization of black-box functions. In NIPS, 2014.\n\n[21] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n[22] Christian Igel and Marc Toussaint. A no-free-lunch theorem for non-uniform distributions of\n\ntarget functions. Journal of Mathematical Modelling and Algorithms, 3(4):313\u2013322, 2005.\n\n[23] Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabas Poczos, and Eric Xing.\nNeural architecture search with Bayesian optimisation and optimal transport. arXiv preprint\narXiv:1802.07191, 2018.\n\n[24] Kirthevasan Kandasamy, Jeff Schneider, and Barnabas Poczos. High dimensional Bayesian\n\noptimisation and bandits via additive models. In ICML, 2015.\n\n[25] Kenji Kawaguchi, Bo Xie, Vikas Verma, and Le Song. Deep semi-random features for nonlinear\n\nfunction approximation. In AAAI, 2017.\n\n[26] Robert W Keener. Theoretical Statistics: Topics for a Core Course. Springer, 2011.\n\n[27] Beomjoon Kim, Leslie Pack Kaelbling, and Tom\u00e1s Lozano-P\u00e9rez. Learning to guide task and\n\nmotion planning using score-space representation. In ICRA, 2017.\n\n[28] Andreas Krause and Cheng S Ong. Contextual Gaussian process bandit optimization. In NIPS,\n\n2011.\n\n[29] Harold J Kushner. A new method of locating the maximum point of an arbitrary multipeak\n\ncurve in the presence of noise. Journal of Fluids Engineering, 86(1):97\u2013106, 1964.\n\n[30] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable\n\npredictive uncertainty estimation using deep ensembles. In NIPS, 2017.\n\n[31] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model\n\nselection. Annals of Statistics, pages 1302\u20131338, 2000.\n\n[32] Steven M LaValle and James J Kuffner Jr. Rapidly-exploring random trees: Progress and\n\nprospects. In Workshop on the Algorithmic Foundations of Robotics (WAFR), 2000.\n\n[33] Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hy-\nperband: A novel bandit-based approach to hyperparameter optimization. In International\nConference on Learning Representations (ICLR), 2016.\n\n[34] Karim Lounici et al. High-dimensional covariance matrix estimation with missing observations.\n\nBernoulli, 20(3):1029\u20131058, 2014.\n\n[35] Gustavo Malkomes and Roman Garnett. Towards automated Bayesian optimization. In ICML\n\nAutoML Workshop, 2017.\n\n[36] Gustavo Malkomes, Charles Schaff, and Roman Garnett. Bayesian optimization for automated\n\nmodel selection. In NIPS, 2016.\n\n[37] T P Minka and R W Picard. Learning how to learn is learning with point sets. Technical report,\n\nMIT Media Lab, 1997.\n\n[38] J. Mo\u02d8ckus. On Bayesian methods for seeking the extremum. In Optimization Techniques IFIP\n\nTechnical Conference, 1974.\n\n[39] R.M. Neal. Bayesian Learning for Neural Networks. Lecture Notes in Statistics 118. Springer,\n\n1996.\n\n[40] Sonia Petrone, Judith Rousseau, and Catia Scricciolo. Bayes and empirical Bayes: do they\n\nmerge? Biometrika, 101(2):285\u2013302, 2014.\n\n11\n\n\f[41] John C Platt, Christopher JC Burges, Steven Swenson, Christopher Weare, and Alice Zheng.\nLearning a Gaussian process prior for automatically generating music playlists. In NIPS, 2002.\n\n[42] Matthias Poloczek, Jialei Wang, and Peter Frazier. Multi-information source optimization. In\n\nNIPS, 2017.\n\n[43] Matthias Poloczek, Jialei Wang, and Peter I Frazier. Warm starting Bayesian optimization. In\n\nWinter Simulation Conference (WSC). IEEE, 2016.\n\n[44] Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning.\n\nThe MIT Press, 2006.\n\n[45] Herbert Robbins. An empirical Bayes approach to statistics. In Third Berkeley Symp. Math.\n\nStatist. Probab., 1956.\n\n[46] J Schmidhuber. On learning how to learn learning strategies. Technical report, FKI-198-94\n\n(revised), 1995.\n\n[47] Alistair Shilton, Sunil Gupta, Santu Rana, and Svetha Venkatesh. Regret bounds for transfer\n\nlearning in Bayesian optimisation. In AISTATS, 2017.\n\n[48] Mlnoru Slotani. Tolerance regions for a multivariate normal population. Annals of the Institute\n\nof Statistical Mathematics, 16(1):135\u2013153, 1964.\n\n[49] Suzanne Sniekers, Aad van der Vaart, et al. Adaptive Bayesian credible sets in regression with\n\na Gaussian process prior. Electronic Journal of Statistics, 9(2):2475\u20132527, 2015.\n\n[50] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine\n\nlearning algorithms. In NIPS, 2012.\n\n[51] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process\n\noptimization in the bandit setting: No regret and experimental design. In ICML, 2010.\n\n[52] Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task Bayesian optimization. In NIPS,\n\n2013.\n\n[53] Zi Wang and Stefanie Jegelka. Max-value entropy search for ef\ufb01cient Bayesian optimization.\n\nIn ICML, 2017.\n\n[54] Ziyu Wang and Nando de Freitas. Theoretical analysis of Bayesian optimisation with unknown\n\nGaussian process hyper-parameters. In NIPS workshop on Bayesian Optimization, 2014.\n\n[55] Eric W. Weisstein. Square root inequality. MathWorld\u2013A Wolfram Web Resource. http:\n\n//mathworld.wolfram.com/SquareRootInequality.html, 1999-2018.\n\n[56] David H Wolpert and William G Macready. No free lunch theorems for optimization. IEEE\n\ntransactions on evolutionary computation, 1(1):67\u201382, 1997.\n\n[57] Dani Yogatama and Gideon Mann. Ef\ufb01cient transfer learning method for automatic hyperpa-\n\nrameter tuning. In AISTATS, 2014.\n\n12\n\n\f", "award": [], "sourceid": 6720, "authors": [{"given_name": "Zi", "family_name": "Wang", "institution": "MIT"}, {"given_name": "Beomjoon", "family_name": "Kim", "institution": "MIT"}, {"given_name": "Leslie", "family_name": "Kaelbling", "institution": "MIT"}]}