{"title": "Gaussian Process Bandit Optimisation with Multi-fidelity Evaluations", "book": "Advances in Neural Information Processing Systems", "page_first": 992, "page_last": 1000, "abstract": "In many scientific and engineering applications, we are tasked with the optimisation of an expensive to evaluate black box function $\\func$. Traditional methods for this problem assume just the availability of this single function. However, in many cases, cheap approximations to $\\func$ may be obtainable. For example, the expensive real world behaviour of a robot can be approximated by a cheap computer simulation. We can use these approximations to eliminate low function value regions cheaply and use the expensive evaluations of $\\func$ in a small but promising region and speedily identify the optimum. We formalise this task as a \\emph{multi-fidelity} bandit problem where the target function and its approximations are sampled from a Gaussian process. We develop \\mfgpucb, a novel method based on upper confidence bound techniques. In our theoretical analysis we demonstrate that it exhibits precisely the above behaviour, and achieves better regret than strategies which ignore multi-fidelity information. \\mfgpucbs outperforms such naive strategies and other multi-fidelity methods on several synthetic and real experiments.", "full_text": "Gaussian Process Bandit Optimisation with Multi-\ufb01delity Evaluations\n\nKirthevasan Kandasamy (cid:92), Gautam Dasarathy \u2666, Junier Oliva (cid:92),\n\n{kandasamy, joliva, schneide, bapoczos}@cs.cmu.edu, gautamd@rice.edu\n\nJeff Schneider (cid:92), Barnab\u00e1s P\u00f3czos (cid:92)\n\n(cid:92) Carnegie Mellon University, \u2666 Rice University\n\nAbstract\n\nIn many scienti\ufb01c and engineering applications, we are tasked with the optimisation\nof an expensive to evaluate black box function f. Traditional methods for this\nproblem assume just the availability of this single function. However, in many cases,\ncheap approximations to f may be obtainable. For example, the expensive real\nworld behaviour of a robot can be approximated by a cheap computer simulation.\nWe can use these approximations to eliminate low function value regions cheaply\nand use the expensive evaluations of f in a small but promising region and speedily\nidentify the optimum. We formalise this task as a multi-\ufb01delity bandit problem\nwhere the target function and its approximations are sampled from a Gaussian\nprocess. We develop MF-GP-UCB, a novel method based on upper con\ufb01dence\nbound techniques.\nIn our theoretical analysis we demonstrate that it exhibits\nprecisely the above behaviour, and achieves better regret than strategies which\nignore multi-\ufb01delity information. MF-GP-UCB outperforms such naive strategies\nand other multi-\ufb01delity methods on several synthetic and real experiments.\n\nIntroduction\n\n1\nIn stochastic bandit optimisation, we wish to optimise a payoff function f : X \u2192 R by sequentially\nquerying it and obtaining bandit feedback, i.e. when we query at any x \u2208 X , we observe a possibly\nnoisy evaluation of f (x). f is typically expensive and the goal is to identify its maximum while\nkeeping the number of queries as low as possible. Some applications are hyper-parameter tuning in\nexpensive machine learning algorithms, optimal policy search in complex systems, and scienti\ufb01c\nexperiments [20, 23, 27]. Historically, bandit problems were studied in settings where the goal is\nto maximise the cumulative reward of all queries to the payoff instead of just \ufb01nding the maximum.\nApplications in this setting include clinical trials and online advertising.\nConventional methods in these settings assume access to only this single expensive function of\ninterest f. We will collectively refer to them as single \ufb01delity methods. In many practical problems\nhowever, cheap approximations to f might be available. For instance, when tuning hyper-parameters\nof learning algorithms, the goal is to maximise a cross validation (CV) score on a training set, which\ncan be expensive if the training set is large. However CV curves tend to vary smoothly with training\nset size; therefore, we can train and cross validate on small subsets to approximate the CV accuracies\nof the entire dataset. For a concrete example, consider kernel density estimation (KDE), where we\nneed to tune the bandwidth h of a kernel. Figure 1 shows the CV likelihood against h for a dataset of\nsize n = 3000 and a smaller subset of size n = 300. The two maximisers are different, which is to\nbe expected since optimal hyper-parameters are functions of the training set size. That said, the curve\nfor n = 300 approximates the n = 3000 curve quite well. Since training/CV on small n is cheap,\nwe can use it to eliminate bad values of the hyper-parameters and reserve the expensive experiments\nwith the entire dataset for the promising candidates (e.g. boxed region in Fig. 1).\nIn online advertising, the goal is to maximise the cumulative number of clicks over a given period. In\nthe conventional bandit treatment, each query to f is the display of an ad for a speci\ufb01c time, say one\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fhour. However, we may display ads for shorter intervals, say a few minutes, to approximate its hourly\nperformance. The estimate is biased, as displaying an ad for a longer interval changes user behaviour,\nbut will nonetheless be useful in gauging its long run click through rate. In optimal policy search\nin robotics and automated driving vastly cheaper computer simulations are used to approximate the\nexpensive real world performance of the system. Scienti\ufb01c experiments can be approximated to\nvarying degrees using less expensive data collection, analysis, and computational techniques.\nIn this paper, we cast these tasks as multi-\ufb01delity bandit optimisation problems assuming the avail-\nability of cheap approximate functions (\ufb01delities) to the payoff f. Our contributions are:\n1. We present a formalism for multi-\ufb01delity bandit optimisation using Gaussian Process (GP)\nassumptions on f and its approximations. We develop a novel algorithm, Multi-Fidelity Gaussian\nProcess Upper Con\ufb01dence Bound (MF-GP-UCB) for this setting.\n\n2. Our theoretical analysis proves that MF-GP-UCB explores the space at lower \ufb01delities and uses\nthe high \ufb01delities in successively smaller regions to zero in on the optimum. As lower \ufb01delity\nqueries are cheaper, MF-GP-UCB has better regret than single \ufb01delity strategies.\n\n3. Empirically, we demonstrate that MF-GP-UCB outperforms single \ufb01delity methods on\na series of synthetic examples,\nthree hyper-parameter tuning tasks and one inference\nproblem in Astrophysics. Our matlab implementation and experiments are available at\ngithub.com/kirthevasank/mf-gp-ucb.\n\nRelated Work: Since the seminal work by Robbins [25], the multi-armed bandit problem has been\nstudied extensively in the K-armed setting. Recently, there has been a surge of interest in the\noptimism under uncertainty principle for K armed bandits, typi\ufb01ed by upper con\ufb01dence bound\n(UCB) methods [2, 4]. UCB strategies have also been used in bandit tasks with linear [6] and GP [28]\npayoffs. There is a plethora of work on single \ufb01delity methods for global optimisation both with\nnoisy and noiseless evaluations. Some examples are branch and bound techniques such as dividing\nrectangles (DiRect) [12], simulated annealing, genetic algorithms and more [17, 18, 22]. A suite of\nsingle \ufb01delity methods in the GP framework closely related to our work is Bayesian Optimisation\n(BO). While there are several techniques for BO [13, 21, 30], of particular interest to us is the\nGaussian process upper con\ufb01dence bound (GP-UCB) algorithm of Srinivas et al. [28].\nMany applied domains of research such as aerodynamics, industrial design and hyper-parameter\ntuning have studied multi-\ufb01delity methods [9, 11, 19, 29]; a plurality of them use BO techniques.\nHowever none of these treatments neither formalise nor analyse any notion of regret in the multi-\n\ufb01delity setting. In contrast, MF-GP-UCB is an intuitive UCB idea with good theoretical properties.\nSome literature have analysed multi-\ufb01delity methods in speci\ufb01c contexts such as hyper-parameter\ntuning, active learning and reinforcement learning [1, 5, 26, 33]. Their settings and assumptions are\nsubstantially different from ours. Critically, none of them are in the more dif\ufb01cult bandit setting where\nthere is a price for exploration. Due to space constraints we discuss them in detail in Appendix A.3.\nThe multi-\ufb01delity poses substantially new theoretical and algorithmic challenges. We build on GP-\nUCB and our recent work on multi-\ufb01delity bandits in the K-armed setting [16]. Section 2 presents our\nformalism including a notion of regret for multi-\ufb01delity GP bandits. Section 3 presents our algorithm.\nThe theoretical analysis is in Appendix C with a synopsis for the 2-\ufb01delity case in Section 4. Section 6\npresents our experiments. Appendix A.1 tabulates the notation used in the manuscript.\n\n2 Preliminaries\nWe wish to maximise a payoff function f : X \u2192 R where X \u2261 [0, r]d. We can interact with f only by\nquerying at some x \u2208 X and obtaining a noisy observation y = f (x) + \u0001. Let x(cid:63) \u2208 argmaxx\u2208X f (x)\nand f(cid:63) = f (x(cid:63)). Let xt \u2208 X be the point queried at time t. The goal of a bandit strategy\nis to maximise the sum of rewards(cid:80)n\nt=1 f (xt) or equivalently minimise the cumulative regret\n(cid:80)n\nt=1 f(cid:63) \u2212 f (xt) after n queries; i.e. we compete against an oracle which queries at x(cid:63) at all t.\nOur primary distinction from the classical setting is that we have access to M\u22121 successively accurate\napproximations f (1), f (2), . . . , f (M\u22121) to the payoff f = f (M ). We refer to these approximations as\n\ufb01delities. We encode the fact that \ufb01delity m approximates \ufb01delity M via the assumption, (cid:107)f (M ) \u2212\nf (m)(cid:107)\u221e \u2264 \u03b6 (m), where \u03b6 (1) > \u03b6 (2) > \u00b7\u00b7\u00b7 > \u03b6 (M ) = 0. Each query at \ufb01delity m expends a cost\n\u03bb(m) of a resource, e.g. computational effort or advertising time, where \u03bb(1) < \u03bb(2) < \u00b7\u00b7\u00b7 < \u03bb(M ).\nA strategy for multi-\ufb01delity bandits is a sequence of query-\ufb01delity pairs {(xt, mt)}t\u22650, where\n\n2\n\n\ft=1 . Here\n\nn(x)) with\n\n\u00b5n(x) = k(cid:62)\u2206\u22121Y,\n\n\u0393(\u03bd)(cid:16)\u221a2\u03bdz\nh (cid:17)\u03bd\n\nFigure 1: Left: Average CV log likelihood on datasets of size 300, 3000 on a synthetic KDE task. The crosses\nare the maxima. Right: Illustration of GP-UCB at time t. The \ufb01gure shows f (x) (solid black line), the UCB\n\u03d5t(x) (dashed blue line) and queries until t \u2212 1 (black crosses). We query at xt = argmaxx\u2208X \u03d5t(x) (red star).\n(xn, mn) could depend on the previous query-observation-\ufb01delity tuples {(xt, yt, mt)}n\u22121\nyt = f (mt)(xt) + \u0001. After n steps we will have queried any of the M \ufb01delities multiple times.\nSome smoothness assumptions on f (m)\u2019s are needed to make the problem tractable. A standard in the\nBayesian nonparametric literature is to use a Gaussian process (GP) prior [24] with covariance kernel\n\u03ba. In this work we focus on the squared exponential (SE) \u03ba\u03c3,h and the Mat\u00e9rn \u03ba\u03bd,h kernels as they are\npopularly used in practice and their theoretical properties are well studied. Writing z = (cid:107)x \u2212 x(cid:48)(cid:107)2,\nh (cid:17),\nB\u03bd(cid:16)\u221a2\u03bdz\nthey are de\ufb01ned as \u03ba\u03c3,h(x, x(cid:48)) = \u03c3 exp(cid:0)\u2212z2/(2h2)(cid:1), \u03ba\u03bd,h(x, x(cid:48)) = 21\u2212\u03bd\nwhere \u0393, B\u03bd are the Gamma and modi\ufb01ed Bessel functions. A convenience the GP framework offers\nis that posterior distributions are analytically tractable. If f \u223c GP(0, \u03ba), and we have observations\ni=1, where yi = f (xi) + \u0001 and \u0001 \u223c N (0, \u03b72) is Gaussian noise, the posterior\nDn = {(xi, yi)}n\ndistribution for f (x)|Dn is also Gaussian N (\u00b5n(x), \u03c32\n(1)\nHere, Y \u2208 Rn with Yi = yi, k \u2208 Rn with ki = \u03ba(x, xi) and \u2206 = K + \u03b72I \u2208 Rn\u00d7n where\nKi,j = \u03ba(xi, xj). In keeping with the above, we make the following assumptions on our problem.\nAssumption 1. A1: The functions at all \ufb01delities are sampled from GPs, f (m) \u223c GP(0, \u03ba) for all\nm = 1, . . . , M. A2: (cid:107)f (M ) \u2212 f (m)(cid:107)\u221e \u2264 \u03b6 (m) for all m = 1, . . . , M. A3: (cid:107)f (M )(cid:107)\u221e \u2264 B.\nThe purpose of A3 is primarily to de\ufb01ne the regret. In Remark 7, Appendix A.4 we argue that these\nassumptions are probabilistically valid, i.e. the latter two events occur with nontrivial probability\nwhen we sample the f (m)\u2019s from a GP. So a generative mechanism would keep sampling the functions\nand deliver them when the conditions hold true. A point x \u2208 X can be queried at any of the M\n\ufb01delities. When we query at \ufb01delity m, we observe y = f (m)(x) + \u0001 where \u0001 \u223c N (0, \u03b72).\nWe now present our notion of cumulative regret R(\u039b) after spending capital \u039b of a resource in the\nmulti-\ufb01delity setting. R(\u039b) should reduce to the conventional de\ufb01nition of regret for any single\n\ufb01delity strategy that queries only at M th \ufb01delity. As only the optimum of f = f (M ) is of interest\nto us, queries at \ufb01delities less than M should yield the lowest possible reward, (\u2212B) according to\nA3. Accordingly, we set the instantaneous reward qt at time to be \u2212B if mt (cid:54)= M and f (M )(xt) if\nmt = M. If we let rt = f(cid:63) \u2212 qt denote the instantaneous regret, we have rt = f(cid:63) + B if mt (cid:54)= M\nand f(cid:63) \u2212 f (xt) if mt = M. R(\u039b) should also factor in the costs of the \ufb01delity of each query. Finally,\nwe should also receive (\u2212B) reward for any unused capital. Accordingly, we de\ufb01ne R(\u039b) as,\n\nn(x) = \u03ba(x, x) \u2212 k(cid:62)\u2206\u22121k.\n\u03c32\n\n\u03bb(mt)rt, (2)\n\n\u03bb(mt)(cid:19)(\u2212B)(cid:35) \u2264 2B\u039bres +\n\n\u03bb(mt)qt +(cid:18)\u039b \u2212\n\nR(\u039b) = \u039bf(cid:63) \u2212(cid:34) N(cid:88)t=1\nwhere \u039bres = \u039b \u2212(cid:80)N\ncapital \u039b, i.e. the largest n such that(cid:80)n\n\nt=1 \u03bb(mt). Here, N is the (random) number of queries at all \ufb01delities within\nt=1 \u03bb(mt) \u2264 \u039b. According to (2) above, we wish to compete\nagainst an oracle that uses all its capital \u039b to query x(cid:63) at the M th \ufb01delity. R(\u039b) is at best 0 when\nwe follow the oracle and at most 2\u039bB. Our goal is a strategy that has small regret for all values of\n(suf\ufb01ciently large) \u039b, i.e. the equivalent of an anytime strategy, as opposed to a \ufb01xed time horizon\nstrategy in the usual bandit setting. For the purpose of optimisation, we also de\ufb01ne the simple regret\nas S(\u039b) = mint rt = f(cid:63) \u2212 maxt qt. S(\u039b) is the difference between f(cid:63) and the best highest \ufb01delity\nquery (and f(cid:63) + B if we have never queried at \ufb01delity M). Since S(\u039b) \u2264 1\n\u039b R(\u039b), any strategy with\nasymptotic sublinear regret lim\u039b\u2192\u221e\nSince, to our knowledge, this is the \ufb01rst attempt to formalise regret for multi-\ufb01delity problems, the\nde\ufb01nition for R(\u039b) (2) necessitates justi\ufb01cation. Consider a two \ufb01delity robot gold mining problem\n\n\u039b R(\u039b) = 0, also has vanishing simple regret.\n\n1\n\nN(cid:88)t=1\n\nN(cid:88)t=1\n\n3\n\nn=300n=3000x\u03d5tf\fwhere the second \ufb01delity is a real world robot trial, costing \u03bb(2) dollars and the \ufb01rst \ufb01delity is a\ncomputer simulation costing \u03bb(1). A multi-\ufb01delity algorithm queries the simulator to learn about\nthe real world. But it does not collect any actual gold during a simulation; hence no reward, which\naccording to our assumptions is \u2212B. Meantime the oracle is investing this capital on the best\nexperiment and collecting \u223c f(cid:63) gold. Therefore, the regret at this time instant is f(cid:63) + B. However\nwe weight this by the cost to account for the fact that the simulation costs only \u03bb(1). Note that lower\n\ufb01delities use up capital but yield the lowest reward. The goal however, is to leverage information\nfrom these cheap queries to query prudently at the highest \ufb01delity and obtain better regret.\nThat said, other multi-\ufb01delity settings might require different de\ufb01nitions for R(\u039b). In online advertis-\ning, the lower \ufb01delities (displaying ads for shorter periods) would still yield rewards. In clinical trials,\nthe regret at the highest \ufb01delity due to a bad treatment would be, say, a dead patient. However, a bad\ntreatment on a simulation may not warrant large penalty. We use the de\ufb01nition in (2) because it is\nmore aligned with our optimisation experiments: lower \ufb01delities are useful to the extent that they\nguide search on the expensive f (M ), but there is no reward to \ufb01nding the optimum of a cheap f (m).\nA crucial challenge for a multi-\ufb01delity method is to not get stuck at the optimum of a lower \ufb01delity,\nwhich is typically suboptimal for f (M ). While exploiting information from the lower \ufb01delities, it is\nalso important to explore suf\ufb01ciently at f (M ). In our experiments we demonstrate that naive strategies\nwhich do not do so would get stuck at the optimum of a lower \ufb01delity.\nA note on GP-UCB: Sequential optimisation methods adopting UCB principles maintain a high\nprobability upper bound \u03d5t : X \u2192 R for f (x) for all x \u2208 X [2]. For GP-UCB, \u03d5t takes the form\n\u03d5t(x) = \u00b5t\u22121(x) + \u03b21/2\nt \u03c3t\u22121(x) where \u00b5t\u22121, \u03c3t\u22121 are the posterior mean and standard deviation of\nthe GP conditioned on the previous t \u2212 1 queries. The key intuition is that the mean \u00b5t\u22121 encourages\nan exploitative strategy \u2013 in that we want to query where we know the function is high \u2013 and the\ncon\ufb01dence band \u03b21/2\nt \u03c3t\u22121 encourages an explorative strategy \u2013 in that we want to query at regions\nwe are uncertain about f lest we miss out on high valued regions. We have illustrated GP-UCB in\nFig 1 and reviewed the algorithm and its theoretical properties in Appendix A.2.\n\n3 MF-GP-UCB\nThe proposed algorithm, MF-GP-UCB, will also maintain a UCB for f (M ) obtained via the previous\nqueries at all \ufb01delities. Denote the posterior GP mean and standard deviation of f (m) conditioned\nonly on the previous queries at \ufb01delity m by \u00b5(m)\n\u03d5(m)\n\nrespectively (See (1)). Then de\ufb01ne,\n\n(x) = \u00b5(m)\n\n\u03d5t(x) = min\n\n\u03d5(m)\n\n(x).\n\n(3)\n\nt\n\nt\n\nm=1,...,M\n\nt\n\nt\u22121(x) + \u03b21/2\n\nt \u03c3(m)\n\nt\n\n, \u03c3(m)\nt\u22121(x) + \u03b6 (m), \u2200 m,\nt\u22121(x) + \u03b21/2\n\nt \u03c3(m)\n\nt\n\nt \u03c3(m)\n\nt \u03c3(m)\n\nFor appropriately chosen \u03b2t, \u00b5(m)\nt\u22121(x) will upper bound f (m)(x) with high probability.\nBy A2, \u03d5(m)\n(x) upper bounds f (M )(x) for all m. We have M such upper bounds, and their minimum\n\u03d5t(x) gives the best bound. Our next query is at the maximiser of this UCB, xt = argmaxx\u2208X \u03d5t(x).\nNext we need to decide which \ufb01delity to query at. Consider any m < M. The \u03b6 (m) conditions\non f (m) constrain the value of f (M ) \u2013 the con\ufb01dence band \u03b21/2\nt\u22121 for f (m) is lengthened by\n\u03b6 (m) to obtain con\ufb01dence on f (M ). If \u03b21/2\nt\u22121(xt) for f (m) is large, it means that we have not\nconstrained f (m) suf\ufb01ciently well at xt and should query at the mth \ufb01delity. On the other hand,\nquerying inde\ufb01nitely in the same region to reduce \u03b21/2\nt\u22121 in that region will not help us much as\nthe \u03b6 (m) elongation caps off how much we can learn about f (M ) from f (m); i.e. even if we knew\nf (m) perfectly, we will only have constrained f (M ) to within a \u00b1\u03b6 (m) band. Our algorithm captures\nt \u03c3(1)\nthis simple intuition. Having selected xt, we begin by checking at the \ufb01rst \ufb01delity. If \u03b21/2\nt\u22121(xt) is\nsmaller than a threshold \u03b3(1), we proceed to the second \ufb01delity. If at any stage \u03b21/2\nt \u03c3(m)\nt\u22121(xt) \u2265 \u03b3(m)\nwe query at \ufb01delity mt = m. If we proceed all the way to \ufb01delity M, we query at mt = M. We will\ndiscuss choices for \u03b3(m) shortly. We summarise the resulting procedure in Algorithm 1.\nFig 2 illustrates MF-GP-UCB on a 2\u2013\ufb01delity problem. Initially, MF-GP-UCB is mostly exploring\nX in the \ufb01rst \ufb01delity. \u03b21/2\nt\u22121 is large and we are yet to constrain f (1) well to proceed to f (2). By\nt = 14, we have constrained f (1) around the optimum and have started querying at f (2) in this region.\n\nt \u03c3(m)\n\nt \u03c3(1)\n\n4\n\n\fAlgorithm 1 MF-GP-UCB\n\u2022 For m = 1, . . . , M: D(m)\n\u2022 for t = 1, 2, . . .\n1. xt \u2190 argmaxx\u2208X \u03d5t(x).\n2. mt = minm{ m|\u03b21/2\nt \u03c3(m)\n3. yt \u2190 Query f (mt) at xt.\n4. Update D(mt)\nt \u2190 D(mt)\n\n(See Equation (3))\n\nt\u22121(xt) \u2265 \u03b3(m) or m = M}.\n\n(See Appendix B, C for \u03b2t)\n\nInputs: kernel \u03ba, bounds {\u03b6 (m)}M\n) \u2190 (0, \u03ba1/2).\n\n, \u03c3(m)\n\n(\u00b5(m)\n\n0 \u2190 \u2205,\n\n0\n\n0\n\nm=1, thresholds {\u03b3(m)}M\n\nm=1.\n\nt\u22121 \u222a{(xt, yt)}. Obtain \u00b5(mt)\n\nt\n\n, \u03c3(mt)\n\nt\n\nconditioned on D(mt)\n\nt\n\n(See (1)).\n\n, \u03d5(2)\n\nIllustration of MF-GP-UCB for a 2-\ufb01delity problem initialised with 5 random points at the \ufb01rst\nFigure 2:\n\ufb01delity. In the top \ufb01gures, the solid lines in brown and blue are f (1), f (2) respectively, and the dashed lines are\nt ). The small crosses are queries from 1 to t \u2212 1 and the\n\u03d5(1)\nred star is the maximiser of \u03d5t, i.e. the next query xt. x(cid:63), the optimum of f (2) is shown in magenta. In the\nt\u22121(xt) \u2264 \u03b3(1)\nbottom \ufb01gures, the solid orange line is \u03b21/2\nwe play at \ufb01delity mt = 2 and otherwise at mt = 1. See Fig. 6 in Appendix B for an extended simulation.\n\nt\u22121 and the dashed black line is \u03b3(1). When \u03b21/2\n\n. The solid green line is \u03d5t = min(\u03d5(1)\n\nt \u03c3(1)\n\nt \u03c3(1)\n\n, \u03d5(2)\n\nt\n\nt\n\nt\n\nt\n\ndips to change \u03d5t in this region. MF-GP-UCB has identi\ufb01ed the maximum with just\n\nNotice how \u03d5(2)\n3 queries to f (2). In Appendix B we provide an extended simulation and discuss further insights.\nFinally, we make an essential observation. The posterior for any f (m)(x) conditioned on previous\nqueries at all \ufb01delities is not Gaussian due to the \u03b6 (m) constraints (A2). However, |f (m)(x) \u2212\n\u00b5(m)\nt\u22121(x)| < \u03b21/2\nt\u22121(x) holds with high probability, since, by conditioning only on queries at the\nmth \ufb01delity we have Gaussianity for f (m)(x). Next we summarise our main theoretical contributions.\n\nt \u03c3(m)\n\n4 Summary of Theoretical Results\n\nFor pedagogical reasons we present our results for the M = 2 case. Appendix C contains statements\nand proofs for general M. We also ignore constants and polylog terms when they are dominated\nby other terms. (cid:46),(cid:16) denote inequality and equality ignoring constants. We begin by de\ufb01ning the\nMaximum Information Gain (MIG) which characterises the statistical dif\ufb01culty of GP bandits.\nDe\ufb01nition 2. (Maximum Information Gain) Let f \u223c GP(0, \u03ba). Consider any A \u2282 Rd and let\n(cid:101)A = {x1, . . . , xn} \u2282 A be a \ufb01nite subset. Let f(cid:101)A, \u0001(cid:101)A \u2208 Rn be such that (f(cid:101)A)i = f (xi), (\u0001(cid:101)A)i \u223c\nN (0, \u03b72), and y(cid:101)A = f(cid:101)A + \u0001(cid:101)A. Let I denote the Shannon Mutual Information. The Maximum\nInformation Gain of A is \u03a8n(A) = max(cid:101)A\u2282A,|(cid:101)A|=n I(y(cid:101)A; f(cid:101)A).\nThe MIG, which depends on the kernel \u03ba and the set A, is an important quantity in our analysis. For a\ngiven \u03ba, it typically scales with the volume of A; i.e. if A = [0, r]d then \u03a8n(A) \u2208 O(rd\u03a8n([0, 1]d)).\n2\u03bd+d(d+1) ) [28].\nFor the SE kernel, \u03a8n([0, 1]d) \u2208 O((log(n))d+1) and for Mat\u00e9rn, \u03a8n([0, 1]d) \u2208 O(n\nRecall, N is the (random) number of queries by a multi-\ufb01delity strategy within capital \u039b at either\n\ufb01delity. Let n\u039b = (cid:98)\u039b/\u03bb(2)(cid:99) be the (non-random) number of queries by a single \ufb01delity method\noperating only at the second \ufb01delity. As \u03bb(1) < \u03bb(2), N could be large for an arbitrary multi-\ufb01delity\nmethod. However, our analysis reveals that for MF-GP-UCB, N is on the order of n\u039b.\n\nd(d+1)\n\n5\n\nx\u22c6xtt=6\u03d5(1)t\u03d5(2)t\u03d5tf(1)f(2)x\u22c6xtt=14f(1)f(2)\u03b21/2t\u03c3(1)t\u22121(x)\u03b3(1)mt=1\u03b3(1)mt=2\fn\u03b1\n\n\u039b\u03b2n\u039b \u03a8n\u03b1\n\n\u039b\n\nFundamental to the 2-\ufb01delity problem is the set Xg = {x \u2208 X ; f(cid:63) \u2212 f (1)(x) \u2264 \u03b6 (1)}. Xg is a\nhigh valued region for f (2)(x): for all x \u2208 Xg, f (2)(x) is at most 2\u03b6 (1) away from the optimum.\nMore interestingly, when \u03b6 (1) is small, i.e. when f (1) is a good approximation to f (2), Xg will\nbe much smaller than X . This is precisely the target domain for this research. For instance,\nin the robot gold mining example, a cheap computer simulator can be used to eliminate several\nbad policies and we could reserve the real world trials for the promising candidates. If a multi-\n\ufb01delity strategy were to use the second \ufb01delity queries only in Xg, then the regret will only have\n\u03a8n(Xg) dependence after n high \ufb01delity queries. In contrast, a strategy that only operates at the\nhighest \ufb01delity (e.g. GP-UCB) will have \u03a8n(X ) dependence. In the scenario described above\n\u03a8n(Xg) (cid:28) \u03a8n(X ), and the multi-\ufb01delity strategy will have signi\ufb01cantly better regret than a single\n\ufb01delity strategy. MF-GP-UCB roughly achieves this goal. In particular, we consider a slightly in\ufb02ated\nset (cid:101)Xg,\u03c1 = {x \u2208 X ; f(cid:63) \u2212 f (1)(x) \u2264 \u03b6 (1) + \u03c1\u03b3(1)}, of Xg where \u03c1 > 0. The following result which\ncharacterises the regret of MF-GP-UCB in terms of (cid:101)Xg,\u03c1 is the main theorem of this paper.\nTheorem 3 (Regret of MF-GP-UCB \u2013 Informal). Let X = [0, r]d and f (1), f (2) \u223c GP(0, \u03ba) satisfy\nAssumption 1. Pick \u03b4 \u2208 (0, 1) and run MF-GP-UCB with \u03b2t (cid:16) d log(t/\u03b4). Then, with probability\n> 1 \u2212 \u03b4, for suf\ufb01ciently large \u039b and for all \u03b1 \u2208 (0, 1), there exists \u03c1 depending on \u03b1 such that,\n(X ) + \u03bb(1)\u03ben,(cid:101)Xg,\u03c1,\u03b3(1)\nR(\u039b) (cid:46) \u03bb(2)\nAs we will explain shortly, the latter two terms are of lower order. It is instructive to compare the\nabove rates against that for GP-UCB (see Theorem 4, Appendix A.2). By dropping the common\nand subdominant terms, the rate for MF-GP-UCB is \u03bb(2)\u03a81/2\nn\u039b (X ) whereas for\nGP-UCB it is \u03bb(2)\u03a81/2\n\n(cid:113)\nn\u039b\u03b2n\u039b \u03a8n\u039b ((cid:101)Xg,\u03c1) + \u03bb(1)(cid:113)\n\nn\u039b\u03b2n\u039b \u03a8n\u039b (X ) + \u03bb(2)(cid:113)\n\n\u03bb(1), \u03bb(2) become comparable, the bound for MF-GP-UCB decays gracefully. In the worst case,\nMF-GP-UCB is never worse than GP-UCB up to constant terms. Intuitively, the above result states\n\nn\u039b (X ). When \u03bb(1) (cid:28) \u03bb(2) and vol((cid:101)Xg,\u03c1) (cid:28) vol(X ) the rates for MF-GP-\nUCB are very appealing. When the approximation worsens (Xg, (cid:101)Xg,\u03c1 become larger) and the costs\nthat MF-GP-UCB explores the entire X using f (1) but uses \u201cmost\u201d of its queries to f (2) inside (cid:101)Xg,\u03c1.\nNow let us turn to the latter two terms in the bound. The third term is the regret due to the second\n\ufb01delity queries outside (cid:101)Xg,\u03c1. We are able to show that the number of such queries is O(n\u03b1\n\u039b) for\nall \u03b1 > 0 for an appropriate \u03c1. This strong result is only possible in the multi-\ufb01delity setting. For\nexample, in GP-UCB the best bound you can achieve on the number of plays on a suboptimal set is\nO(n1/2\n\u039b ) for the SE kernel and worse for the Mat\u00e9rn kernel. The last term is due to the \ufb01rst \ufb01delity\nplays inside (cid:101)Xg,\u03c1 and it scales with vol((cid:101)Xg,\u03c1) and polylogarithmically with n, both of which are small.\nHowever, it has a 1/poly(\u03b3(1)) dependence which could be bad if \u03b3(1) is too small: intuitively, if\n\u03b3(1) is too small then you will wait for a long time in step 2 of Algorithm 1 for \u03b21/2\nt\u22121 to decrease\nwithout proceeding to f (2), incurring large regret (f(cid:63) + B) in the process. Our analysis reveals that\nan optimal choice for the SE kernel scales \u03b3(1) (cid:16) (\u03bb(1)\u03b6 (1)/(t\u03bb(2)))1/(d+2) at time t. However this\nis of little practical use as the leading constant depends on several problem dependent quantities such\nas \u03a8n(Xg). In Section 5 we describe a heuristic to set \u03b3(m) which worked well in our experiments.\nTheorem 3 can be generalised to cases where the kernels \u03ba(m) and observation noises \u03b7(m) are\ndifferent at each \ufb01delity. The changes to the proofs are minimal. In fact, our practical implementation\nuses different kernels. As with any nonparametric method, our algorithm has exponential dependence\non dimension. This can be alleviated by assuming additional structure in the problem [8, 15]. Finally,\nwe note that the above rates translate to bounds on the simple regret S(\u039b) for optimisation.\n\nt \u03c3(1)\n\nn\u039b ((cid:101)Xg,\u03c1) + \u03bb(1)\u03a81/2\n\nImplementation Details\n\n5\nOur implementation uses some standard techniques in Bayesian optimisation to learn the kernel such\nas initialisation with random queries and periodic marginal likelihood maximisation. The above\ntechniques might be already known to a reader familiar with the BO literature. We have elaborated\nthese in Appendix B but now focus on the \u03b3(m), \u03b6 (m) parameters of our method.\nAlgorithm 1 assumes that the \u03b6 (m)\u2019s are given with the problem description, which is hardly the\ncase in practice. In our implementation, instead of having to deal with M \u2212 1, \u03b6 (m) values we set\n(\u03b6 (1), \u03b6 (2), . . . , \u03b6 (M\u22121)) = ((M \u2212 1)\u03b6, (M \u2212 2)\u03b6, . . . , \u03b6) so we only have one value \u03b6. This for\n\n6\n\n\fFigure 3: The simple regret S(\u039b) against the spent capital \u039b on synthetic functions. The title states the\nfunction, its dimensionality, the number of \ufb01delities and the costs we used for each \ufb01delity in the experiment.\nAll curves barring DiRect (which is a deterministic), were produced by averaging over 20 experiments. The\nerror bars indicate one standard error. See Figures 8, 9 10 in Appendix D for more synthetic results. The last\npanel shows the number of queries at different function values at each \ufb01delity for the Hartmann-3D example.\ninstance, is satis\ufb01ed if (cid:107)f (m) \u2212 f (m\u22121)(cid:107)\u221e \u2264 \u03b6 which is stronger than Assumption A2. Initially, we\nstart with small \u03b6. Whenever we query at any \ufb01delity m > 1 we also check the posterior mean of\nthe (m \u2212 1)th \ufb01delity. If |f (m)(xt) \u2212 \u00b5(m\u22121)\n(xt)| > \u03b6, we query again at xt, but at the (m \u2212 1)th\n\ufb01delity. If |f (m)(xt) \u2212 f (m\u22121)(xt)| > \u03b6, we update \u03b6 to twice the violation. To set \u03b3(m)\u2019s we use\nthe following intuition: if the algorithm, is stuck at \ufb01delity m for too long then \u03b3(m) is probably too\nsmall. We start with small values for \u03b3(m). If the algorithm does not query above the mth \ufb01delity for\nmore than \u03bb(m+1)/\u03bb(m) iterations, we double \u03b3(m). We found our implementation to be fairly robust\neven recovering from fairly bad approximations at the lower \ufb01delities (see Appendix D.3).\n\nt\u22121\n\n6 Experiments\nWe compare MF-GP-UCB to the following methods. Single \ufb01delity methods: GP-UCB; EI: the\nexpected improvement criterion for BO [13]; DiRect: the dividing rectangles method [12]. Multi-\n\ufb01delity methods: MF-NAIVE: a naive baseline where we use GP-UCB to query at the \ufb01rst \ufb01delity a\nlarge number of times and then query at the last \ufb01delity at the points queried at f (1) in decreasing\norder of f (1)-value; MF-SKO: the multi-\ufb01delity sequential kriging method from [11]. Previous\nworks on multi-\ufb01delity methods (including MF-SKO) had not made their code available and were\nnot straightforward to implement. Hence, we could not compare to all of them. We discuss this more\nin Appendix D along with some other single and multi-\ufb01delity baselines we tried but excluded in\nthe comparison to avoid clutter in the \ufb01gures. In addition, we also detail the design choices and\nhyper-parameters for all methods in Appendix D.\nSynthetic Examples: We use the Currin exponential (d = 2), Park (d = 4) and Borehole (d = 8)\nfunctions in M = 2 \ufb01delity experiments and the Hartmann functions in d = 3 and 6 with M = 3\nand 4 \ufb01delities respectively. The \ufb01rst three are taken from previous multi-\ufb01delity literature [32] while\nwe tweaked the Hartmann functions to obtain the lower \ufb01delities for the latter two cases. We show\nthe simple regret S(\u039b) against capital \u039b for the Borehole and Hartmann-3D functions in Fig. 3 with\nthe rest deferred to Appendix D due to space constraints. MF-GP-UCB outperforms other methods.\nAppendix D also contains results for the cumulative regret R(\u039b) and the formulae for these functions.\nA common occurrence with MF-NAIVE was that once we started querying at \ufb01delity M, the regret\nbarely decreased. The diagnosis in all cases was the same: it was stuck around the maximum of f (1)\nwhich is suboptimal for f (M ). This suggests that while we have cheap approximations, the problem\nis by no means trivial. As explained previously, it is also important to \u201cexplore\u201d at the higher \ufb01delities\nto achieve good regret. The ef\ufb01cacy of MF-GP-UCB when compared to single \ufb01delity methods is\nthat it con\ufb01nes this exploration to a small set containing the optimum. In our experiments we found\nthat MF-SKO did not consistently beat other single \ufb01delity methods. Despite our best efforts to\nreproduce this (and another) multi-\ufb01delity method, we found them to be quite brittle (Appendix D.1).\nThe third panel of Fig. 3 shows a histogram of the number of queries at each \ufb01delity after 184 queries\nof MF-GP-UCB, for different ranges of f (3)(x) for the Hartmann-3D function. Many of the queries\nat the low f (3) values are at \ufb01delity 1, but as we progress they decrease and the second \ufb01delity queries\nincrease. The third \ufb01delity dominates very close to the optimum but is used sparingly elsewhere.\nThis corroborates the prediction in our analysis that MF-GP-UCB uses low \ufb01delities to explore and\nsuccessively higher \ufb01delities at promising regions to zero in on x(cid:63). (Also see Fig. 6, Appendix B.)\n\n7\n\n\u039b2004006008001000S(\u039b)100101102BoreHole-8D,M=2,Costs=[1;10]MF-GP-UCBGP-UCBEIDiRectMF-NAIVEMF-SKO\u039b200040006000800010000S(\u039b)10-310-210-1100Hartmann-3D,M=3,Costs=[1;10;100]00.511.522.533.50510152025303540NumberofQueriesQueryfrequenciesforHartmann-3D f(3)(x)m=1m=2m=3\fFigure 4: Results on the hyper-parameter tuning experiments. The title states the experiment, dimensionality\n(number of hyperparameters) and training set size at each \ufb01delity. All curves were produced by averaging over\n10 experiments. The error bars indicate one standard error. The lengths of the curves are different in time as we\nran each method for a pre-speci\ufb01ed number of iterations and they concluded at different times.\n\nReal Experiments: We present results on three hyper-parameter tuning tasks (results in Fig. 4), and\na maximum likelihood inference task in Astrophysics (Fig. 5). We compare methods on computation\ntime since that is the \u201ccost\u201d in all experiments. We include the processing time for each method in\nthe comparison (i.e. the cost of determining the next query).\nClassi\ufb01cation using SVMs (SVM): We trained an SVM on the magic gamma dataset using the\nSMO algorithm to an accuracy of 10\u221212. The goal is to tune the kernel bandwidth and the soft margin\ncoef\ufb01cient in the ranges (10\u22123, 101) and (10\u22121, 105) respectively on a dataset of size 2000. We set\nthis up as a M = 2 \ufb01delity experiment with the entire training set at the second \ufb01delity and 500\npoints at the \ufb01rst. Each query was 5-fold cross validation on these training sets.\nRegression using Additive Kernels (SALSA): We used the regression method from [14] on the\n4-dimensional coal power plant dataset. We tuned the 6 hyper-parameters \u2013the regularisation penalty,\nthe kernel scale and the kernel bandwidth for each dimension\u2013 each in the range (10\u22123, 104) using\n5-fold cross validation. This experiment used M = 3 and 2000, 4000, 8000 points at each \ufb01delity.\nViola & Jones face detection (V&J): The V&J classi\ufb01er [31], which uses a cascade of weak\nclassi\ufb01ers, is a popular method for face detection. To classify an image, we pass it through each\nclassi\ufb01er. If at any point the classi\ufb01er score falls below a threshold, the image is classi\ufb01ed negative. If\nit passes through the cascade, then it is classi\ufb01ed positive. One of the more popular implementations\ncomes with OpenCV and uses a cascade of 22 weak classi\ufb01ers. The threshold values in OpenCV\nare pre-set based on some heuristics and there is no reason to think they are optimal for a given face\ndetection task. The goal is to tune these 22 thresholds by optimising for them over a training set. We\nmodi\ufb01ed the OpenCV implementation to take in the thresholds as parameters. As our domain X we\nchose a neighbourhood around the con\ufb01guration used in OpenCV. We set this up as a M = 2 \ufb01delity\nexperiment where the second \ufb01delity used 3000 images from the V&J face database and the \ufb01rst used\n300. Interestingly, on an independent test set, the con\ufb01gurations found by MF-GP-UCB consistently\nachieved over 90% accuracy while the OpenCV con\ufb01guration achieved only 87.4% accuracy.\n\nType Ia Supernovae: We use Type Ia supernovae data [7]\nfor maximum likelihood inference on 3 cosmological param-\neters, the Hubble constant H0 \u2208 (60, 80), the dark matter\nand dark energy fractions \u2126M , \u2126\u039b \u2208 (0, 1). Unlike typical\nparametric maximum likelihood problems, the likelihood\nis only available as a black-box. It is computed using the\nRobertson\u2013Walker metric which requires a one dimensional\nnumerical integration for each sample in the dataset. We set\nthis up as a M = 3 \ufb01delity task. The goal is to maximise the\nlikelihood at the third \ufb01delity where the integration was per-\nformed using the trapezoidal rule on a grid of size 106. For\nthe \ufb01rst and second \ufb01delities, we used grids of size 102, 104\nrespectively. The results are given in Fig. 5.\n\nFigure 5: Results on the supernova infer-\nence problem. The y-axis is the log likeli-\nhood so higher is better. MF-NAIVE is not\nvisible as it performed very poorly.\nConclusion: We introduced and studied the multi-\ufb01delity bandit under Gaussian Process assump-\ntions. We present, to our knowledge, the \ufb01rst formalism of regret and the \ufb01rst theoretical results\nin this setting. They demonstrate that MF-GP-UCB explores the space via cheap lower \ufb01delities,\nand leverages the higher \ufb01delities on successively smaller regions hence achieving better regret than\nsingle \ufb01delity strategies. Experimental results demonstrate the ef\ufb01cacy of our method.\n\n8\n\nCPUTime(s)02000400060008000CV(Classi\ufb01cation)Error0.1150.120.1250.130.1350.14SVM-2D,M=2,ntr=[500,2000]MF-GP-UCBGP-UCBEIDiRectMF-NAIVEMF-SKOCPUTime(s)01000200030004000500060007000CV(LeastSquares)Error00.20.40.60.81SALSA-6D,M=3,ntr=[2000,4000,8000]CPUTime(s)10002000300040005000600070008000CV(Classi\ufb01cation)Error0.10.150.20.250.30.35V&J-22D,M=2,ntr=[300,3000]CPUTime(s)500100015002000250030003500LogLikelihood-10-50510Supernova-3D,M=3,Grid=[100,10K,1M]\fReferences\n[1] Alekh Agarwal, John C Duchi, Peter L Bartlett, and Clement Levrard. Oracle inequalities for computation-\n\nally budgeted model selection. In COLT, 2011.\n\n[2] Peter Auer. Using Con\ufb01dence Bounds for Exploitation-exploration Trade-offs. J. Mach. Learn. Res., 2003.\n[3] E. Brochu, V. M. Cora, and N. de Freitas. A Tutorial on Bayesian Optimization of Expensive Cost\n\nFunctions, with Application to Active User Modeling and Hierarchical RL. CoRR, 2010.\n\n[4] S\u00e9bastien Bubeck and Nicol\u00f2 Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends in Machine Learning, 2012.\n\n[5] Mark Cutler, Thomas J. Walsh, and Jonathan P. How. Reinforcement Learning with Multi-Fidelity\n\nSimulators. In ICRA, 2014.\n\n[6] V. Dani, T. P. P. Hayes, and S. M Kakade. Stochastic Linear Optimization under Bandit Feedback. In\n\nCOLT, 2008.\n\n[7] T. M. Davis et al. Scrutinizing Exotic Cosmological Models Using ESSENCE Supernova Data Combined\n\nwith Other Cosmological Probes. Astrophysical Journal, 2007.\n\n[8] J Djolonga, A Krause, and V Cevher. High-Dimensional Gaussian Process Bandits. In NIPS, 2013.\n[9] Alexander I. J. Forrester, Andr\u00e1s S\u00f3bester, and Andy J. Keane. Multi-\ufb01delity optimization via surrogate\nmodelling. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Science, 2007.\n[10] Subhashis Ghosal and Anindya Roy. Posterior consistency of Gaussian process prior for nonparametric\n\nbinary regression\". Annals of Statistics, 2006.\n\n[11] D. Huang, T.T. Allen, W.I. Notz, and R.A. Miller. Sequential kriging optimization using multiple-\ufb01delity\n\nevaluations. Structural and Multidisciplinary Optimization, 2006.\n\n[12] D. R. Jones, C. D. Perttunen, and B. E. Stuckman. Lipschitzian Optimization Without the Lipschitz\n\nConstant. J. Optim. Theory Appl., 1993.\n\n[13] Donald R. Jones, Matthias Schonlau, and William J. Welch. Ef\ufb01cient global optimization of expensive\n\nblack-box functions. J. of Global Optimization, 1998.\n\n[14] Kirthevasan Kandasamy and Yaoliang Yu. Additive Approximations in High Dimensional Nonparametric\n\nRegression via the SALSA. In ICML, 2016.\n\n[15] Kirthevasan Kandasamy, Jeff Schenider, and Barnab\u00e1s P\u00f3czos. High Dimensional Bayesian Optimisation\n\nand Bandits via Additive Models. In International Conference on Machine Learning, 2015.\n\n[16] Kirthevasan Kandasamy, Gautam Dasarathy, Jeff Schneider, and Barnabas Poczos. The Multi-\ufb01delity\n\nMulti-armed Bandit. In NIPS, 2016.\n\n[17] K. Kawaguchi, L. P. Kaelbling, and T. Lozano-P\u00e9rez. Bayesian Optimization with Exponential Convergence.\n\nIn NIPS, 2015.\n\n[18] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. SCIENCE, 1983.\n[19] A. Klein, S. Bartels, S. Falkner, P. Hennig, and F. Hutter. Towards ef\ufb01cient Bayesian Optimization for Big\n\nData. In BayesOpt, 2015.\n\n[20] R. Martinez-Cantin, N. de Freitas, A. Doucet, and J. Castellanos. Active Policy Learning for Robot\n\nPlanning and Exploration under Uncertainty. In Proceedings of Robotics: Science and Systems, 2007.\n\n[21] Jonas Mockus. Application of Bayesian approach to numerical methods of global and stochastic optimiza-\n\ntion. Journal of Global Optimization, 1994.\n\n[22] R. Munos. Optimistic Optimization of Deterministic Functions without the Knowledge of its Smoothness.\n\nIn NIPS, 2011.\n\n[23] D. Parkinson, P. Mukherjee, and A.. R Liddle. A Bayesian model selection analysis of WMAP3. Physical\n\nReview, 2006.\n\n[24] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. UPG Ltd, 2006.\n[25] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American\n\nMathematical Society, 1952.\n\n[26] A Sabharwal, H Samulowitz, and G Tesauro. Selecting near-optimal learners via incremental data allocation.\n\nIn AAAI, 2015.\n\n[27] J. Snoek, H. Larochelle, and R. P Adams. Practical Bayesian Optimization of Machine Learning Algorithms.\n\nIn NIPS, 2012.\n\n[28] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian Process Optimization in\n\nthe Bandit Setting: No Regret and Experimental Design. In ICML, 2010.\n\n[29] Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task bayesian optimization. In NIPS, 2013.\n[30] W. R. Thompson. On the Likelihood that one Unknown Probability Exceeds Another in View of the\n\nEvidence of Two Samples. Biometrika, 1933.\n\n[31] Paul A. Viola and Michael J. Jones. Rapid Object Detection using a Boosted Cascade of Simple Features.\n\nIn Computer Vision and Pattern Recognition, 2001.\n\n[32] Shifeng Xiong, Peter Z. G. Qian, and C. F. Jeff Wu. Sequential design and analysis of high-accuracy and\n\nlow-accuracy computer codes. Technometrics, 2013.\n\n[33] C. Zhang and K. Chaudhuri. Active Learning from Weak and Strong Labelers. In NIPS, 2015.\n\n9\n\n\f", "award": [], "sourceid": 586, "authors": [{"given_name": "Kirthevasan", "family_name": "Kandasamy", "institution": "CMU"}, {"given_name": "Gautam", "family_name": "Dasarathy", "institution": "Carnegie Mellon University"}, {"given_name": "Junier", "family_name": "Oliva", "institution": "Carnegie Mellon University"}, {"given_name": "Jeff", "family_name": "Schneider", "institution": "CMU"}, {"given_name": "Barnabas", "family_name": "Poczos", "institution": "Carnegie Mellon University"}]}